Hi all,
We had an external company run RNA-seq for us and I'm now knee-deep in trying to assemble these sequences. The platform used was Illumina HiSeq 2000, producing a couple of fq files containing paired end data. I've noticed that some of the sequences in file 1 begin with an N, with a quality score of B - I've read other threads here that advise that this is a low quality score equivalent to 2. The paired sequences in file 2 don't seem to have this issue, although may end with a B quality base - here's an example
I don't think it's a huge issue in the data as a whole as FastQC doesn't flag any problem with the number of Ns at the first base, so it's likely a small subset of the sequences.
Nevertheless I'd like to remove these bases and am struggling to find a tool that does what I need (or, perhaps more likely, am struggling to use the tools available correctly) - fastx toolkit only seems to remove bases from the 3' end, and when I use Trimmomatic with options PE -phred64 LEADING:3 TRAILING:3 it happily removes the poor quality bases from the 3' end but not the 5' - so in the above example the final A of the file2 sequence is removed, but not the first N of file1. I don't know if this is because it is an N rather than a nucleotide or if it's due to its position in the sequence.
Any advice on the nature of these initial Ns in Illumina data and how best to remove them would be much appreciated!
We had an external company run RNA-seq for us and I'm now knee-deep in trying to assemble these sequences. The platform used was Illumina HiSeq 2000, producing a couple of fq files containing paired end data. I've noticed that some of the sequences in file 1 begin with an N, with a quality score of B - I've read other threads here that advise that this is a low quality score equivalent to 2. The paired sequences in file 2 don't seem to have this issue, although may end with a B quality base - here's an example
Code:
@ABC123:1:1101:1423:1934#/1 NACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCA + BP\aceeca]cgcdgcegfdgdgdcgd_aa^cSXcgecaW^eeg_[aW\Za_fghhh]ddgdbaabbccZ_R`Z`T\KTTZZ`b^WXX]bY_bY`baa[[ @ABC123:1:1101:1423:1934#/2 GACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCA + _^aceeeegegggfffefb`eeaffggcgh_cgffhghhibeffgfgfegdfgighhghhihggcghigdgggggdabc_abbb`a_`_ccb`Z_bcccB
Nevertheless I'd like to remove these bases and am struggling to find a tool that does what I need (or, perhaps more likely, am struggling to use the tools available correctly) - fastx toolkit only seems to remove bases from the 3' end, and when I use Trimmomatic with options PE -phred64 LEADING:3 TRAILING:3 it happily removes the poor quality bases from the 3' end but not the 5' - so in the above example the final A of the file2 sequence is removed, but not the first N of file1. I don't know if this is because it is an N rather than a nucleotide or if it's due to its position in the sequence.
Any advice on the nature of these initial Ns in Illumina data and how best to remove them would be much appreciated!
Comment