Unconfigured Ad

**westerman** · 09-12-2011, 06:27 AM

Just to make sure, do you have the most recent version of FastQC? 9-9-11: Version 0.10.0 released. That version added support for CASAVA 1.8 type of files and thus may be a solution to your problem.

**PFS** · 09-12-2011, 07:23 AM

I thought that v.0.9 should be able to distinguish between encodings (see below) ... but I will try to see if the latest version can help.

Thanks!

From the release notes:
"30-3-11: Version 0.9.1 released
Added --quiet and --nogroup options to command line
Added encoding type to the basic stats
Added detection of Illumina <1.3 1.3 1.5 and 1.9 encodings"

**simonandrews** · 09-12-2011, 08:19 AM

The encoding detection hasn't changed since v0.9.1 so moving to 0.10.0 won't help.

The encoding detection is done entirely on the basis of the range of Phred values seen in the file. In order to incorrectly detect Sanger encoded data as Illumina 1.5 you'd have to have a dataset where no base call's quality value was lower than 31. This would seem very unlikely in any normal illumina dataset, unless it had been (very harshly?) quality trimmed before being put through fastqc.

I've just double checked on some of our casava 1.8 data and the encoding is correctly detected in all of the cases I looked at.

Is there something unusual about the sequence file you analysed? Very low number of reads, or very unusual quality distribution? If it's not obvious what went wrong in this case would you be willing to make a small subset of the data available so I can see what happened?

**robs** · 09-13-2011, 06:50 PM

Maybe it's time to add a feature that allows users to specifically tell the program what encoding it should use (especially considering the ambiguity between the different formats/encondings).

**simonandrews** · 09-14-2011, 12:04 AM

Originally posted by robs View Post

Maybe it's time to add a feature that allows users to specifically tell the program what encoding it should use (especially considering the ambiguity between the different formats/encodings).

I'm really not keen on doing this. In practice there is very little ambiguity between the different encodings and in real samples it's extremely unlikely that the encoding will be mis-detected (I'm still waiting for the original author of this thread to get back to me about their sample). The only cases we've ever seen where this went wrong were in simulated datasets where samples were being given an artificially narrow range of quality values.

What we have seen numerous times is complaints that FastQC was getting the quality detection wrong when it was actually correct. Providing an option to set the encoding type will result in people getting it wrong, and this is not going to be handled well in the program. You're likely to end up with corrupted plots and odd errors which are just going to generate confusion and unnecessary bug reports.

If there are cases starting to crop up where the detection is actually wrong then please let me know. We're not seeing them, but I'm absolutely prepared to believe they exist. It may be that we can improve the algorithm which guesses the encoding to cope with them or there may be other bugs we can fix, but I think the correct answer is to get the automatic detection correct rather than have people specify the encoding manually.

**robs** · 09-14-2011, 09:53 AM

I think you should give users more credit for knowing what they do. Having the automatic detection as default, but still offering an option to specify the encoding would be nice to have. You could add a meaningful warning if someone specifies an encoding that the program does not agree with. (The overlap between the different encodings allows an incorrect prediction, no matter how good your automatic detection is.)

Given the "numerous times" people complained, maybe a short report/output why the specific encoding has been selected by the program might be quite useful for both sides.

**simonandrews** · 09-15-2011, 12:43 AM

The point I'd stress is that we have never yet seen a real sample where the encoding was guessed incorrectly (maybe between illumina 1.3 and 1.5, but the offset is the same for those two anyway so it makes no difference). I know there are cases where this could theoretically happen but until we actually see that then adding this option is just something to go wrong.

The complaints we've had before have all either been resolved by either finding that the pipeline version used wasn't what people expected, or that the encodings had been altered by a third party (SRA recodes into Sanger encoding in some cases for example), or on a couple of occasions finding that the file had become corrupted. None of these cases would have been helped by adding a forced encoding mode.

In terms of reporting why an encoding was selected, it's really just done off the lowest untransformed value so there's not much which could be reported.

**curtish** · 10-07-2011, 09:31 AM

Simon,

First, we love FastQC, and are particularly addicted to having it available in our local Galaxy installation! It has saved us from many headaches.

So, I'm not sure you would consider this a "real" sample, but it's a real nuisance for us. We're working on a type of metagenomics project where we must use only reads with no low-quality bases. So, after FastQC'ing the raw reads, we *do* filter them very aggressively. We then run FastQC again to see what our selected subpopulation of high quality reads look like. Unfortunatley, FastQC decided our Illumina1.9/fastqsanger reads are really illumina1.3 reads, and the result is hard to work with. So, we will implement the ability to pass the encoding type down from Galaxy. Is there an easy way to contribute a code modification back to FastQC? I don't see an associated SourceForge site...

**simonandrews** · 10-07-2011, 10:23 AM

Originally posted by curtish View Post

Is there an easy way to contribute a code modification back to FastQC? I don't see an associated SourceForge site...

We don't have a publicly accessible source repository for FastQC, but I'm happy to take patches against the source of the latest release.

If you want to add this option then it will require a change to the wrapper to collect and validate the forced offset. This will then need to be picked up in the Sequence.QualityEncoding.PhredEncoding class . I'd suggest that the change be structured such that the suggested offset is overridden if the lowest encoding found in the file is lower than the offset supplied to avoid odd errors elsewhere. Alternatively you could have the getFastQEncodingOffset method throw an exception if the supplied encoding isn't compatible with the data, but this will require modifications in a number of places.

**david_2012** · 03-02-2012, 08:53 AM

encoding-specification through command-line option would be welcome

Hey Simon,

I can only second curtish. Both in how useful FastQC is as a tool and in how useful it would be, to have a command-line option that specifies a certain quality encoding.

I in my case, I did some strong quality trimming, resulting in no quality scores 31 or lower. And that in turn makes FastQC guess it is Illumina <1.3 encoding as opposed to the correct encoding, which is Illumina 1.8+.

So is there a patch available yet, curtish? Or is this planned for future versions of FastQC?

Thanks,
David

**magofiura** · 04-20-2012, 01:20 AM

Same problem as above.
Does someone know a way to fix or bypass it?

Thanks,

Leo.

**Axel** · 05-20-2014, 09:29 AM

Same problem as those above. I have reads encoded at Illumina 1.9 which a first pass of FastQC correctly identifies. I filter my reads very heavily leaving no reads with quality below 31. On the second pass FastQC mis-identifies the encoding as Illumina <1.3.

I love the tool as it is and will continue using it, but a function where the user can specify encoding in addition to the automatic detection would be really good.

**simonandrews** · 05-21-2014, 02:56 AM

We've had an ongoing discussion about this issue for some time and we've gone over this again this morning and I think we've decided on a way forward.

Our basic position has always been that we didn't want to introduce a flag to force an encoding since our experience has been that the vast majority (but not all) of reports of mis-detection we've had have turned out to be correct detection, and the file wasn't what the user thought it was. True mis-detection only occurs on data which has been manipulated (usually by quality trimming) - we've never seen a raw sequencing file which got the detection wrong.

The problem is that for trimmed data the window for unambiguous detection isn't as wide as we'd like. From a base 33 encoding you become ambiguous at 59, meaning that data trimmed to a phred of above 26 (about 3/1000 errors), which is a realistic level at which people could filter.

The reason for putting the break at 59 was to support the Illumina <1.3 files, which used a Base64 encoding, but which allowed quality scores down to -5. Normal Phred 64 wouldn't become ambiguous until 64 which would be a Phred of 31 (below 1/1000 errors).

To try to alleviate this situation we're therefore going to remove support for the Illumina <1.3 encoding in the next (imminent) fastqc release. Since this was replaced in 2009 we don't envisage that this will have much of an effect on anyone, and will mean that as long as data is not trimmed so that no base is less than Q31 the auto-detection will still work.

**blakeoft** · 05-21-2014, 07:41 AM

Could you include a read at the beginning of the fastq file with the following structure:

@readName
AA
+
mM

where m and M are the min and max possible quality scores used by your encoding, respectively? Sure, this will throw off your data, but since it's only one read, I think that it won't make that much of a difference. I'm not sure how FASTQC works, but I assume that it keeps track of the 'smallest' and 'biggest' qual scores that are observed throughout all of the reads. If both extremes are present right at the start, it would seem to me that it wouldn't have much of a chance at getting it wrong.

Topics	Statistics	Last Post
Long-Read RNA Sequencing Uncovers a Hidden Layer of Immune Cell Regulation by SEQadmin2 Started by SEQadmin2, Yesterday, 12:03 PM	0 responses 19 views 0 reactions	Last Post by SEQadmin2 Yesterday, 12:03 PM
DNA Methylation Study Reveals How Epigenetic Changes Pass Between Generations by SEQadmin2 Started by SEQadmin2, Yesterday, 11:40 AM	0 responses 14 views 0 reactions	Last Post by SEQadmin2 Yesterday, 11:40 AM
MetaBeeAI Helps Scientists Process Research Literature Faster by SEQadmin2 Started by SEQadmin2, 05-28-2026, 11:40 AM	0 responses 29 views 0 reactions	Last Post by SEQadmin2 05-28-2026, 11:40 AM
Scientists Solve a 25-Year Mystery in RNA Interference by SEQadmin2 Started by SEQadmin2, 05-26-2026, 10:12 AM	0 responses 31 views 0 reactions	Last Post by SEQadmin2 05-26-2026, 10:12 AM

Unconfigured Ad

FASTQC guessing wrong quality encoding

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News