Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem with quality in fastq file

    Hi all,

    I have a little problem with bowtie2 aligner related to the quality of my reads in fastq file. I have some raw RNAseq data (Illumina, single end, 50pb), and when I try to align it against my reference sequence, it pops me after a while messages like "read HWI-ST766:125........ has more quality values than read characters" or ""read HWI-ST766:125........ has spaces in quality check".

    So, obviously, there are some reads in my file that are "corrupted" in some way and bowtie2 doesn't like that.

    I tried to delete these specific line with grep and sed functions, it worked well but it's too long and I can't do it every time I have this issue.

    So, I was wondering if I could somehow clean my data according to the quality or perhaps eliminate all reads which will make bowtie2 bug...

    Anyone has a clue how to do this? It's frustrating, cause I feel I'm not far away from getting my results, but there is always something else!

    Here is my command for alignment (if it can help):

    bowtie2 -q -a -p 6 -t -x IndexFile -U FastqFile -S SamFile

    Thank you in advance!

  • #2
    Personally I would first try to redownload the FASTQ file in case it was corrupted over the network, and if applicable repeat the decompression as well - again, just in case there is a bad sector on your drive or something. It might also be worth running a test on your RAM (e.g. memcheck) to make sure that is working fine - otherwise you can get problems from that too, e.g. bases flipping as in http://mira-assembler.sourceforge.ne...onus_part.html

    Comment


    • #3
      How did it get that way?

      Assuming the bowtie2 error messages speak the truth (which you can verify by examining the relevant fastq lines), I'd sure recommend tracing the problem back to its source, rather than trying to clean up the data after the fact.

      Are the bad reads interspersed with good ones, or do they fall at the end of the fastq file? In the latter case, you may have filled up your disk.

      What is the output of the instrument -- .bcl files? How do you turn that into fastq files?

      Do the reads which bowtie2 does NOT complain about look plausible? E.g., quality characters in the correct range?

      If all else fails, post a few (6) reads here, showing a bad read in context with other 'good' ones.

      --SP

      Comment


      • #4
        I too am running into this issue with a quite a few datasets, using bowtie2.0.0-beta6.

        Examples:

        @SRR387921.488948 0303_20110429_2_SL_AWG_TG_NA11829_4_2pA_01003434289_1_4_41_117/1
        T10331232322220002110220221110022020211222032021222
        +
        !%85117+****&7(&=,'%%).%'((4).)61)%.,(&''7='10%-&,)

        @SRR096575.4651 VAB_0513_20101119_1_SP_ANG_TG_NA11830_3_1sA_01003380693_2853_102_63/1
        T322003021302112201213211122322210023002300122221.1
        +
        !9,%.7%6-/9.)975+%),8+(<.(*19*%+&%%*2%<)'*5*&.)%(!&

        @SRR096590.1165 VAB_0510_20101117_2_SP_ANG_TG_NA11831_5_1sA_01003380706_11279_16_41/1
        T300321021030023001031320311113312212333223222232.3
        +
        !557*.7;6925=46+:>-9:-690>;%(3-2-&5)/&'5)%8%&)*(%!2


        Strangely enough, these are all the first read in their respective files, and all of them appear to be correct (i.e. same number of quality values as read chars.)

        Comment


        • #5
          Originally posted by kz26 View Post
          I too am running into this issue with a quite a few datasets, using bowtie2.0.0-beta6.

          ...

          Strangely enough, these are all the first read in their respective files, and all of them appear to be correct (i.e. same number of quality values as read chars.)
          Those are colour space FASTQ, and frustratingly there seem to be two schools of thought on how many quality scores are needed, specifically should there be a score for the adaptor base or not.

          Comment


          • #6
            maubp, what does that mean? I have the same problem as kz26. Help please!

            Comment


            • #7
              I mean some sources include a quality for the adaptor, e.g. here we have an adapter plus 50 colour space calls. Should there be 51 qualities or just 50?

              Code:
              @SRR387921.488948 0303_20110429_2_SL_AWG_TG_NA11829_4_2pA_01003434289_1_4_41_117/1
              T10331232322220002110220221110022020211222032021222
              +
              !%85117+****&7(&=,'%%).%'((4).)61)%.,(&''7='10%-&,)
              That file has 51 quality scores, including one for the adapter. Some tools do not expect a quality for the adapter. So if we remove the "!" for the adapter "T" in this case we'd get:

              Code:
              @SRR387921.488948 0303_20110429_2_SL_AWG_TG_NA11829_4_2pA_01003434289_1_4_41_117/1
              T10331232322220002110220221110022020211222032021222
              +
              %85117+****&7(&=,'%%).%'((4).)61)%.,(&''7='10%-&,)
              I don't do any work with colour space, so I've not researched this issue. But this is my observation and guess about the apparent problem.

              Comment


              • #8
                what i have is this

                @HWI-ST1146:66:C0YHCACXX:7:1101:2909:2074 1:N:0:ATCACG
                CCACTAGCTTTCCTGGCAC
                +
                JJEHIJIIJJJHEHFHFFF

                so the number of letters is the same for the read and the quality. I'm using Bowtie 0.12.7. and i've used it before 10's of times but with output from older machines. this new one is from HiSeq

                Comment


                • #9
                  Originally posted by afadda View Post
                  what i have is this

                  @HWI-ST1146:66:C0YHCACXX:7:1101:2909:2074 1:N:0:ATCACG
                  CCACTAGCTTTCCTGGCAC
                  +
                  JJEHIJIIJJJHEHFHFFF

                  so the number of letters is the same for the read and the quality. I'm using Bowtie 0.12.7. and i've used it before 10's of times but with output from older machines. this new one is from HiSeq
                  Is there an error message? The recent Illumina pipelines use the original Sanger FASTQ encoding for quality scores - perhaps you are using an option specific to the obsolete Illumina specific FASTQ encoding?

                  Comment


                  • #10
                    yes. message is:
                    Too few quality values for read: HWI-ST1146:66:C0YHCACXX:7:1101:8166:5424 1:N:0:ACTTGA
                    are you sure this is a FASTQ-int file?

                    my command line is:
                    bowtie -S -a --best --strata -v2 -m14 $reference $seqfile > $samfile --un $unalignfile

                    Comment


                    • #11
                      OK - so what does that read look like in the FASTQ input file? You showed a different read (which was only 19 bases long, and had as expected a matching 19 quality scores).

                      Comment


                      • #12
                        you're absolutely right. it's a programming mistake on my side when i was trimming the reads, so that the read in the error message had different length for quality.
                        thanks for trouble shooting!
                        (should never program when sleepy)

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Exploring the Dynamics of the Tumor Microenvironment
                          by seqadmin




                          The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                          07-08-2024, 03:19 PM
                        • seqadmin
                          Exploring Human Diversity Through Large-Scale Omics
                          by seqadmin


                          In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                          06-25-2024, 06:43 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 07-16-2024, 05:49 AM
                        0 responses
                        23 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 07-15-2024, 06:53 AM
                        0 responses
                        31 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 07-10-2024, 07:30 AM
                        0 responses
                        40 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 07-03-2024, 09:45 AM
                        0 responses
                        205 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X