I'm new to small RNA sequencing, and a bit confused about some of the sequences I'm seeing and wonder if you can provide some tips. I'm looking at a public dataset (SRP032955) in which the reads have already been de-multiplexed (if they were barcoded in the first place) to single-sample files. After I trim off the Illumina 3' adapter, I still have a number of sequences that are very highly repeated, but which I can't identify as sequencing artifacts. Here's a good example:
>raw read (3' adapter in red)
TCTATGTTCAGTGCGACTGCATGGAATTCTCGGGTGCCAAGGAACTCCAG
>resulting trimmed read
TCTATGTTCAGTGCGACTGCA
This trimmed read appears 23584 times, or 1.3% of the reads. FastQC labels it as "No Hit" in the Overrepresented sequences list, along with a whole bunch of other "No Hit" highly-repeated sequences.
Is this a real *RNA or an artifact? It doesn't BLAST to the Maize B73 genome, although this sample is Maize W23, for which I don't have a genome sequence handy.
>raw read (3' adapter in red)
TCTATGTTCAGTGCGACTGCATGGAATTCTCGGGTGCCAAGGAACTCCAG
>resulting trimmed read
TCTATGTTCAGTGCGACTGCA
This trimmed read appears 23584 times, or 1.3% of the reads. FastQC labels it as "No Hit" in the Overrepresented sequences list, along with a whole bunch of other "No Hit" highly-repeated sequences.
Is this a real *RNA or an artifact? It doesn't BLAST to the Maize B73 genome, although this sample is Maize W23, for which I don't have a genome sequence handy.