Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • bzhang
    Member
    • Apr 2010
    • 10

    prep_reads error when running Tophat

    I am running tophat on a test reads and got the following error,

    Thu Apr 29 16:48:07 2010] Beginning TopHat run (v1.0.13)
    -----------------------------------------------
    [Thu Apr 29 16:48:07 2010] Preparing output location ./tophat_out/
    [Thu Apr 29 16:48:07 2010] Checking for Bowtie index files
    [Thu Apr 29 16:48:07 2010] Checking for reference FASTA file
    [Thu Apr 29 16:48:07 2010] Checking for Bowtie
    Bowtie version: 0.12.5.0
    [Thu Apr 29 16:48:07 2010] Checking reads
    seed length: 101bp
    format: fastq
    quality scale: phred33 (default)
    [FAILED]
    Error: could not execute prep_reads

    The prep_reads.log file has this information,

    rep_reads v1.0.13
    ---------------------------
    Saw ASCII character 10 but expected 33-based Phred qual.
    terminate called after throwing an instance of 'int'

    I looked through data and the only ASCII character 10s I could find are the newlines at the end of each line. The test data is attached. Can someone help?
    Attached Files
  • shurjo
    Senior Member
    • Jan 2009
    • 132

    #2
    If this is Illumina data, were your reads processed with pipeline v1.3 or later? If so, you have to include the --solexa-quals option in your TopHat run.

    Comment

    • bzhang
      Member
      • Apr 2010
      • 10

      #3
      This is Illumina data. What I received was sequence.txt file and I have converted it into fastq (sanger) format. Do I still need to use --solexa-quals?

      Comment

      • shurjo
        Senior Member
        • Jan 2009
        • 132

        #4
        Fastq files include quality scores, so the answer would be yes (once again, only if your reads were processed with pipeline v1.3 or later).

        Comment

        • bzhang
          Member
          • Apr 2010
          • 10

          #5
          I have already converted the Illumina quality score to Sanger standard quality score (shift each character by 31). Do I still need to use the option?

          Comment

          • shurjo
            Senior Member
            • Jan 2009
            • 132

            #6
            I guess not. At this point my knowledge ends and I would go running to the nearest full-time bioinformatics geek. One last thing though: I do see an extra newline at the end of the sample you posted, so I would double check your input file once to make sure that you dont have any in there.

            Sorry and best of luck,

            Shurjo

            Comment

            • bzhang
              Member
              • Apr 2010
              • 10

              #7
              Shurjo, Thanks for the help. I have checked the file again to make sure there is no extra newline. These two reads were taken out from a large data file. The prep_reads apparently runs fine for the first 200,000 some reads and then choke on these two and I just could not see how they are different from other reads.

              Comment

              • Cole Trapnell
                Senior Member
                • Nov 2008
                • 213

                #8
                Can you verify that the FASTQ file is correctly formatted? The fact that TopHat is choosing a seed length of 101bp tells me something's up with that file. The seed length ought to be 25 for 50bp reads or longer. TopHat's FASTQ parser occasionally screws up when FASTQ records are incorrectly formatted or when the read and/or quality sequences span more than one line in the file. We plan to replace the parser in an upcoming version to make it more robust to this kind of thing.

                Comment

                • bzhang
                  Member
                  • Apr 2010
                  • 10

                  #9
                  Cole, could you take a look at the fastq file I attached? The original fastq file was converted from the Illumina SCARF format and contains millions of reads. prep_reads gave the error after 10 minutes, and the two reads I attached seem to be responsible for the problem.

                  Comment

                  • maubp
                    Peter (Biopython etc)
                    • Jul 2009
                    • 1544

                    #10
                    Originally posted by bzhang View Post
                    Saw ASCII character 10 but expected 33-based Phred qual.
                    terminate called after throwing an instance of 'int'

                    I looked through data and the only ASCII character 10s I could find are the newlines at the end of each line. The test data is attached. Can someone help?
                    Are you on Linux/Unix? It sounds like the file has DOS/Windows new lines (CR, LF - i.e. ASCII 10, 13) rather than Unix style (LF only). Try using dos2unix on it (or a similar tool).

                    Comment

                    • bzhang
                      Member
                      • Apr 2010
                      • 10

                      #11
                      I think I figured out the problem. The Illumina sequence file uses '.' for undetermined bases and prep_reads filters this out when reading the sequence. This creates a mismatch between the sequences and the quality scores. For the problematic reads I attached, the first sequence contains 11 '.'s, so prep_reads reads in 90 bases. There happens to be a '@' in the quality scores after 90 and prep_reads treats it as the start of a new record, and this messes up the next record and hence the error. I don't know if using '.' in the sequences is a new convention adopted by Illumina or not. I am surprised that I am the first one to encounter this problem. For now I guess I'll just convert all those '.'s into 'N's, but prep_reads can certainly be more robust.

                      I am sort of lucky in a sense that my data contains enough reads to see this problem. If I only have 200,000 reads, I may not see the problem and happily carry on the downstream analysis unaware of the mismatch between the sequences and the quality scores.

                      Comment

                      • Cole Trapnell
                        Senior Member
                        • Nov 2008
                        • 213

                        #12
                        Thanks for the heads up. We'll add the bug to our tracker and address it in the next release. Others are likely to have this problem.

                        Comment

                        • darked89
                          Member
                          • Jun 2009
                          • 38

                          #13
                          Originally posted by Cole Trapnell View Post
                          Can you verify that the FASTQ file is correctly formatted? The fact that TopHat is choosing a seed length of 101bp tells me something's up with that file. The seed length ought to be 25 for 50bp reads or longer.
                          I am also getting seed lengths = read_length (54, 76bp). Tophat runs fine till the end, but the accepted_hits.sam has zero spliced reads (for 76bp run). I run it in paired end mode, therefore assumed that something is wrong with my --mate-inner-dist / --mate-std-dev values (60, 20). Checked with the lab corrected these (20, 20), but still got no splices. Input FASTQ files were filtered using R ShortRead package. The same files seem to be doing OK with other mappers (SOAP, GEM).

                          Is there any way I can check that my FASTQ files are Tophat compatible?

                          Comment

                          • bzhang
                            Member
                            • Apr 2010
                            • 10

                            #14
                            From what I understand by reading the code, at least in the recent versions, the seed length is equal to the shortest read length. So if all the reads are of the same length, the seed length is set to the read length. I am not sure about the impact of setting seed length this way, guess I have to read more paper to understand this.

                            Comment

                            • bzhang
                              Member
                              • Apr 2010
                              • 10

                              #15
                              Originally posted by darked89 View Post
                              I am also getting seed lengths = read_length (54, 76bp). Tophat runs fine till the end, but the accepted_hits.sam has zero spliced reads (for 76bp run). I run it in paired end mode, therefore assumed that something is wrong with my --mate-inner-dist / --mate-std-dev values (60, 20). Checked with the lab corrected these (20, 20), but still got no splices. Input FASTQ files were filtered using R ShortRead package. The same files seem to be doing OK with other mappers (SOAP, GEM).

                              Is there any way I can check that my FASTQ files are Tophat compatible?
                              It seems tophat calls bowtie with option -v 2, which, according to the manual, means at most 2 mismatches allowed and the option -l (which specifies seed length) is ignored. I think your fastq files are fine as long as they don't contain non-alphabetical characters in the sequences.

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                06-02-2026, 10:05 AM
                              • SEQadmin2
                                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                by SEQadmin2


                                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                Introduction

                                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                05-22-2026, 06:42 AM
                              • SEQadmin2
                                Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                by SEQadmin2

                                Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                05-06-2026, 09:04 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Today, 08:59 AM
                              0 responses
                              8 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 12:03 PM
                              0 responses
                              21 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 11:40 AM
                              0 responses
                              17 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-28-2026, 11:40 AM
                              0 responses
                              29 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...