Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • PFS
    Member
    • Mar 2010
    • 55

    FASTQC guessing wrong quality encoding

    Hello,

    I have some Illumina files processed with CASAVA 1.8.
    The program FASTQC is guessing the format to be be Illumina 1.5
    Is there a way to explicitly tell fastqc what encoding the data is? If not, what else can I do?

    Thanks!
  • westerman
    Rick Westerman
    • Jun 2008
    • 1104

    #2
    Just to make sure, do you have the most recent version of FastQC? 9-9-11: Version 0.10.0 released. That version added support for CASAVA 1.8 type of files and thus may be a solution to your problem.

    Comment

    • PFS
      Member
      • Mar 2010
      • 55

      #3
      I thought that v.0.9 should be able to distinguish between encodings (see below) ... but I will try to see if the latest version can help.

      Thanks!


      From the release notes:
      "30-3-11: Version 0.9.1 released
      Added --quiet and --nogroup options to command line
      Added encoding type to the basic stats
      Added detection of Illumina <1.3 1.3 1.5 and 1.9 encodings"

      Comment

      • simonandrews
        Simon Andrews
        • May 2009
        • 870

        #4
        The encoding detection hasn't changed since v0.9.1 so moving to 0.10.0 won't help.

        The encoding detection is done entirely on the basis of the range of Phred values seen in the file. In order to incorrectly detect Sanger encoded data as Illumina 1.5 you'd have to have a dataset where no base call's quality value was lower than 31. This would seem very unlikely in any normal illumina dataset, unless it had been (very harshly?) quality trimmed before being put through fastqc.

        I've just double checked on some of our casava 1.8 data and the encoding is correctly detected in all of the cases I looked at.

        Is there something unusual about the sequence file you analysed? Very low number of reads, or very unusual quality distribution? If it's not obvious what went wrong in this case would you be willing to make a small subset of the data available so I can see what happened?

        Comment

        • robs
          Senior Member
          • May 2010
          • 116

          #5
          Maybe it's time to add a feature that allows users to specifically tell the program what encoding it should use (especially considering the ambiguity between the different formats/encondings).

          Comment

          • simonandrews
            Simon Andrews
            • May 2009
            • 870

            #6
            Originally posted by robs View Post
            Maybe it's time to add a feature that allows users to specifically tell the program what encoding it should use (especially considering the ambiguity between the different formats/encodings).
            I'm really not keen on doing this. In practice there is very little ambiguity between the different encodings and in real samples it's extremely unlikely that the encoding will be mis-detected (I'm still waiting for the original author of this thread to get back to me about their sample). The only cases we've ever seen where this went wrong were in simulated datasets where samples were being given an artificially narrow range of quality values.

            What we have seen numerous times is complaints that FastQC was getting the quality detection wrong when it was actually correct. Providing an option to set the encoding type will result in people getting it wrong, and this is not going to be handled well in the program. You're likely to end up with corrupted plots and odd errors which are just going to generate confusion and unnecessary bug reports.

            If there are cases starting to crop up where the detection is actually wrong then please let me know. We're not seeing them, but I'm absolutely prepared to believe they exist. It may be that we can improve the algorithm which guesses the encoding to cope with them or there may be other bugs we can fix, but I think the correct answer is to get the automatic detection correct rather than have people specify the encoding manually.

            Comment

            • robs
              Senior Member
              • May 2010
              • 116

              #7
              I think you should give users more credit for knowing what they do. Having the automatic detection as default, but still offering an option to specify the encoding would be nice to have. You could add a meaningful warning if someone specifies an encoding that the program does not agree with. (The overlap between the different encodings allows an incorrect prediction, no matter how good your automatic detection is.)

              Given the "numerous times" people complained, maybe a short report/output why the specific encoding has been selected by the program might be quite useful for both sides.

              Comment

              • simonandrews
                Simon Andrews
                • May 2009
                • 870

                #8
                The point I'd stress is that we have never yet seen a real sample where the encoding was guessed incorrectly (maybe between illumina 1.3 and 1.5, but the offset is the same for those two anyway so it makes no difference). I know there are cases where this could theoretically happen but until we actually see that then adding this option is just something to go wrong.

                The complaints we've had before have all either been resolved by either finding that the pipeline version used wasn't what people expected, or that the encodings had been altered by a third party (SRA recodes into Sanger encoding in some cases for example), or on a couple of occasions finding that the file had become corrupted. None of these cases would have been helped by adding a forced encoding mode.

                In terms of reporting why an encoding was selected, it's really just done off the lowest untransformed value so there's not much which could be reported.

                Comment

                • curtish
                  Junior Member
                  • Oct 2011
                  • 2

                  #9
                  Simon,

                  First, we love FastQC, and are particularly addicted to having it available in our local Galaxy installation! It has saved us from many headaches.

                  So, I'm not sure you would consider this a "real" sample, but it's a real nuisance for us. We're working on a type of metagenomics project where we must use only reads with no low-quality bases. So, after FastQC'ing the raw reads, we *do* filter them very aggressively. We then run FastQC again to see what our selected subpopulation of high quality reads look like. Unfortunatley, FastQC decided our Illumina1.9/fastqsanger reads are really illumina1.3 reads, and the result is hard to work with. So, we will implement the ability to pass the encoding type down from Galaxy. Is there an easy way to contribute a code modification back to FastQC? I don't see an associated SourceForge site...

                  Comment

                  • simonandrews
                    Simon Andrews
                    • May 2009
                    • 870

                    #10
                    Originally posted by curtish View Post
                    Is there an easy way to contribute a code modification back to FastQC? I don't see an associated SourceForge site...
                    We don't have a publicly accessible source repository for FastQC, but I'm happy to take patches against the source of the latest release.

                    If you want to add this option then it will require a change to the wrapper to collect and validate the forced offset. This will then need to be picked up in the Sequence.QualityEncoding.PhredEncoding class . I'd suggest that the change be structured such that the suggested offset is overridden if the lowest encoding found in the file is lower than the offset supplied to avoid odd errors elsewhere. Alternatively you could have the getFastQEncodingOffset method throw an exception if the supplied encoding isn't compatible with the data, but this will require modifications in a number of places.
                    Last edited by simonandrews; 10-07-2011, 10:24 AM. Reason: Spelling fail!

                    Comment

                    • david_2012
                      Junior Member
                      • Mar 2012
                      • 4

                      #11
                      encoding-specification through command-line option would be welcome

                      Hey Simon,

                      I can only second curtish. Both in how useful FastQC is as a tool and in how useful it would be, to have a command-line option that specifies a certain quality encoding.

                      I in my case, I did some strong quality trimming, resulting in no quality scores 31 or lower. And that in turn makes FastQC guess it is Illumina <1.3 encoding as opposed to the correct encoding, which is Illumina 1.8+.

                      So is there a patch available yet, curtish? Or is this planned for future versions of FastQC?

                      Thanks,
                      David

                      Comment

                      • magofiura
                        Junior Member
                        • Jan 2012
                        • 2

                        #12
                        Same problem as above.
                        Does someone know a way to fix or bypass it?

                        Thanks,

                        Leo.

                        Comment

                        • Axel
                          Junior Member
                          • Feb 2014
                          • 8

                          #13
                          Same problem as those above. I have reads encoded at Illumina 1.9 which a first pass of FastQC correctly identifies. I filter my reads very heavily leaving no reads with quality below 31. On the second pass FastQC mis-identifies the encoding as Illumina <1.3.

                          I love the tool as it is and will continue using it, but a function where the user can specify encoding in addition to the automatic detection would be really good.

                          Comment

                          • simonandrews
                            Simon Andrews
                            • May 2009
                            • 870

                            #14
                            We've had an ongoing discussion about this issue for some time and we've gone over this again this morning and I think we've decided on a way forward.

                            Our basic position has always been that we didn't want to introduce a flag to force an encoding since our experience has been that the vast majority (but not all) of reports of mis-detection we've had have turned out to be correct detection, and the file wasn't what the user thought it was. True mis-detection only occurs on data which has been manipulated (usually by quality trimming) - we've never seen a raw sequencing file which got the detection wrong.

                            The problem is that for trimmed data the window for unambiguous detection isn't as wide as we'd like. From a base 33 encoding you become ambiguous at 59, meaning that data trimmed to a phred of above 26 (about 3/1000 errors), which is a realistic level at which people could filter.

                            The reason for putting the break at 59 was to support the Illumina <1.3 files, which used a Base64 encoding, but which allowed quality scores down to -5. Normal Phred 64 wouldn't become ambiguous until 64 which would be a Phred of 31 (below 1/1000 errors).

                            To try to alleviate this situation we're therefore going to remove support for the Illumina <1.3 encoding in the next (imminent) fastqc release. Since this was replaced in 2009 we don't envisage that this will have much of an effect on anyone, and will mean that as long as data is not trimmed so that no base is less than Q31 the auto-detection will still work.

                            Comment

                            • blakeoft
                              Member
                              • Oct 2013
                              • 79

                              #15
                              Could you include a read at the beginning of the fastq file with the following structure:
                              @readName
                              AA
                              +
                              mM
                              where m and M are the min and max possible quality scores used by your encoding, respectively? Sure, this will throw off your data, but since it's only one read, I think that it won't make that much of a difference. I'm not sure how FASTQC works, but I assume that it keeps track of the 'smallest' and 'biggest' qual scores that are observed throughout all of the reads. If both extremes are present right at the start, it would seem to me that it wouldn't have much of a chance at getting it wrong.

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                Yesterday, 10:05 AM
                              • SEQadmin2
                                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                by SEQadmin2


                                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                Introduction

                                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                05-22-2026, 06:42 AM
                              • SEQadmin2
                                Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                by SEQadmin2

                                Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                05-06-2026, 09:04 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Yesterday, 12:03 PM
                              0 responses
                              19 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, Yesterday, 11:40 AM
                              0 responses
                              14 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-28-2026, 11:40 AM
                              0 responses
                              29 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-26-2026, 10:12 AM
                              0 responses
                              31 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...