Unconfigured Ad

**shurjo** · 04-29-2010, 04:18 PM

If this is Illumina data, were your reads processed with pipeline v1.3 or later? If so, you have to include the --solexa-quals option in your TopHat run.

**bzhang** · 04-29-2010, 04:24 PM

This is Illumina data. What I received was sequence.txt file and I have converted it into fastq (sanger) format. Do I still need to use --solexa-quals?

**shurjo** · 04-29-2010, 04:26 PM

Fastq files include quality scores, so the answer would be yes (once again, only if your reads were processed with pipeline v1.3 or later).

**bzhang** · 04-29-2010, 04:33 PM

I have already converted the Illumina quality score to Sanger standard quality score (shift each character by 31). Do I still need to use the option?

**shurjo** · 04-29-2010, 08:58 PM

I guess not. At this point my knowledge ends and I would go running to the nearest full-time bioinformatics geek. One last thing though: I do see an extra newline at the end of the sample you posted, so I would double check your input file once to make sure that you dont have any in there.

Sorry and best of luck,

Shurjo

**bzhang** · 04-29-2010, 09:25 PM

Shurjo, Thanks for the help. I have checked the file again to make sure there is no extra newline. These two reads were taken out from a large data file. The prep_reads apparently runs fine for the first 200,000 some reads and then choke on these two and I just could not see how they are different from other reads.

**Cole Trapnell** · 04-29-2010, 10:05 PM

Can you verify that the FASTQ file is correctly formatted? The fact that TopHat is choosing a seed length of 101bp tells me something's up with that file. The seed length ought to be 25 for 50bp reads or longer. TopHat's FASTQ parser occasionally screws up when FASTQ records are incorrectly formatted or when the read and/or quality sequences span more than one line in the file. We plan to replace the parser in an upcoming version to make it more robust to this kind of thing.

**bzhang** · 04-29-2010, 11:13 PM

Cole, could you take a look at the fastq file I attached? The original fastq file was converted from the Illumina SCARF format and contains millions of reads. prep_reads gave the error after 10 minutes, and the two reads I attached seem to be responsible for the problem.

**maubp** · 04-30-2010, 01:00 AM

Originally posted by bzhang View Post

Saw ASCII character 10 but expected 33-based Phred qual.
terminate called after throwing an instance of 'int'

I looked through data and the only ASCII character 10s I could find are the newlines at the end of each line. The test data is attached. Can someone help?

Are you on Linux/Unix? It sounds like the file has DOS/Windows new lines (CR, LF - i.e. ASCII 10, 13) rather than Unix style (LF only). Try using dos2unix on it (or a similar tool).

**bzhang** · 04-30-2010, 11:06 AM

I think I figured out the problem. The Illumina sequence file uses '.' for undetermined bases and prep_reads filters this out when reading the sequence. This creates a mismatch between the sequences and the quality scores. For the problematic reads I attached, the first sequence contains 11 '.'s, so prep_reads reads in 90 bases. There happens to be a '@' in the quality scores after 90 and prep_reads treats it as the start of a new record, and this messes up the next record and hence the error. I don't know if using '.' in the sequences is a new convention adopted by Illumina or not. I am surprised that I am the first one to encounter this problem. For now I guess I'll just convert all those '.'s into 'N's, but prep_reads can certainly be more robust.

I am sort of lucky in a sense that my data contains enough reads to see this problem. If I only have 200,000 reads, I may not see the problem and happily carry on the downstream analysis unaware of the mismatch between the sequences and the quality scores.

**Cole Trapnell** · 04-30-2010, 11:09 AM

Thanks for the heads up. We'll add the bug to our tracker and address it in the next release. Others are likely to have this problem.

**darked89** · 05-01-2010, 04:19 AM

Originally posted by Cole Trapnell View Post

Can you verify that the FASTQ file is correctly formatted? The fact that TopHat is choosing a seed length of 101bp tells me something's up with that file. The seed length ought to be 25 for 50bp reads or longer.

I am also getting seed lengths = read_length (54, 76bp). Tophat runs fine till the end, but the accepted_hits.sam has zero spliced reads (for 76bp run). I run it in paired end mode, therefore assumed that something is wrong with my --mate-inner-dist / --mate-std-dev values (60, 20). Checked with the lab corrected these (20, 20), but still got no splices. Input FASTQ files were filtered using R ShortRead package. The same files seem to be doing OK with other mappers (SOAP, GEM).

Is there any way I can check that my FASTQ files are Tophat compatible?

**bzhang** · 05-01-2010, 08:30 AM

From what I understand by reading the code, at least in the recent versions, the seed length is equal to the shortest read length. So if all the reads are of the same length, the seed length is set to the read length. I am not sure about the impact of setting seed length this way, guess I have to read more paper to understand this.

**bzhang** · 05-01-2010, 01:23 PM

Originally posted by darked89 View Post

I am also getting seed lengths = read_length (54, 76bp). Tophat runs fine till the end, but the accepted_hits.sam has zero spliced reads (for 76bp run). I run it in paired end mode, therefore assumed that something is wrong with my --mate-inner-dist / --mate-std-dev values (60, 20). Checked with the lab corrected these (20, 20), but still got no splices. Input FASTQ files were filtered using R ShortRead package. The same files seem to be doing OK with other mappers (SOAP, GEM).

Is there any way I can check that my FASTQ files are Tophat compatible?

It seems tophat calls bowtie with option -v 2, which, according to the manual, means at most 2 mismatches allowed and the option -l (which specifies seed length) is ignored. I think your fastq files are fine as long as they don't contain non-alphabetical characters in the sequences.

Topics	Statistics	Last Post
A New Single-Cell Method Maps DNA-Protein Interactions by SEQadmin2 Started by SEQadmin2, Today, 08:59 AM	0 responses 8 views 0 reactions	Last Post by SEQadmin2 Today, 08:59 AM
Long-Read RNA Sequencing Uncovers a Hidden Layer of Immune Cell Regulation by SEQadmin2 Started by SEQadmin2, 06-02-2026, 12:03 PM	0 responses 21 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 12:03 PM
DNA Methylation Study Reveals How Epigenetic Changes Pass Between Generations by SEQadmin2 Started by SEQadmin2, 06-02-2026, 11:40 AM	0 responses 17 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 11:40 AM
MetaBeeAI Helps Scientists Process Research Literature Faster by SEQadmin2 Started by SEQadmin2, 05-28-2026, 11:40 AM	0 responses 29 views 0 reactions	Last Post by SEQadmin2 05-28-2026, 11:40 AM

Unconfigured Ad

prep_reads error when running Tophat

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News