Seqanswers Leaderboard Ad

**mastal** · 10-14-2013, 09:32 AM

Trimmomatic's LEADING:3 command should remove very low quality bases from the 5' end.

What version of Illumina's software was used to produce your fastq files?
The quality encodings used by Illumina have changed a few times.
See

FASTQ format - Wikipedia

http://en.wikipedia.org/wiki/FASTQ_format

Since Illumina v1.8, they now use the phred33 quality scale,
which might explain why trimmomatic didn't remove the N base from the 5' end of your sequence.

**beej** · 10-15-2013, 02:15 AM

Hi mastal, thanks for your reply.

We weren't given the info on Illumina software when we received the data, but I've gone back to request that just to make sure.

Going by the wiki page it does look to be phred64, most of the quality scores tend to by in the range of "[\]^_`abcdefghi" ASCII characters. I've tried running Trimmomatic with the -phred33 option and it removes nothing, not even the 3' bases that were removed before.

I've also tried editing the first base of the first sequence to a G rather than an N, and it's still not removing it - the LEADING command just doesn't seem to be working for me, it's baffling.

EDIT: OK, I've just tried using the single end (SE) version on the first sequence file only, and it's removed the leading Ns. As I'm not removing any reads, just trimming the 5' and 3' ends where there is a low quality base, I suppose running it this way on each file in turn should do the trick and I shouldn't (in theory) have any problems with unmatched reads. I'd still prefer to get it working using the PE function just in case, though.

FURTHER EDIT: Problem resolved. Short answer: I'm an idiot. Long answer: my Trimmomatic command was missing a path for the second unpaired output file. Add one of these in, and it worked fine.

**cement_head** · 10-15-2013, 04:02 AM

Originally posted by beej View Post

Hi all,

We had an external company run RNA-seq for us and I'm now knee-deep in trying to assemble these sequences. The platform used was Illumina HiSeq 2000, producing a couple of fq files containing paired end data. I've noticed that some of the sequences in file 1 begin with an N, with a quality score of B - I've read other threads here that advise that this is a low quality score equivalent to 2. The paired sequences in file 2 don't seem to have this issue, although may end with a B quality base - here's an example

Code:

@ABC123:1:1101:1423:1934#/1
NACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCA
+
BP\aceeca]cgcdgcegfdgdgdcgd_aa^cSXcgecaW^eeg_[aW\Za_fghhh]ddgdbaabbccZ_R`Z`T\KTTZZ`b^WXX]bY_bY`baa[[

@ABC123:1:1101:1423:1934#/2
GACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCA
+
_^aceeeegegggfffefb`eeaffggcgh_cgffhghhibeffgfgfegdfgighhghhihggcghigdgggggdabc_abbb`a_`_ccb`Z_bcccB

I don't think it's a huge issue in the data as a whole as FastQC doesn't flag any problem with the number of Ns at the first base, so it's likely a small subset of the sequences.

Nevertheless I'd like to remove these bases and am struggling to find a tool that does what I need (or, perhaps more likely, am struggling to use the tools available correctly) - fastx toolkit only seems to remove bases from the 3' end, and when I use Trimmomatic with options PE -phred64 LEADING:3 TRAILING:3 it happily removes the poor quality bases from the 3' end but not the 5' - so in the above example the final A of the file2 sequence is removed, but not the first N of file1. I don't know if this is because it is an N rather than a nucleotide or if it's due to its position in the sequence.

Any advice on the nature of these initial Ns in Illumina data and how best to remove them would be much appreciated!

Some programs give you the option to simply remove the first & last one, two or three bases from every read - I'm guessing that this is a pre-emptive way of dealing with data that may have a lower score.

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 14 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

Illumina HiSeq - first base quality score and trimming

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News