Seqanswers Leaderboard Ad

**kmcarr** · 04-14-2010, 12:31 PM

Originally posted by Bio.X2Y View Post

Hi there,

I'm having a look at some FASTQ files generated from the Illumina GA pipeline (I think version 1.3). The data is for a series of paired-end RNA-Seq runs, containing >100,000,000 reads in total.

I'm just trying to get a feel for the information at the moment, and one of the first things I've noticed is that the quality scores are not what I expect.

As I understand it (from the Wikipedia article on FastQ created by Torst), version 1.3+ of the GA pipeline encodes Phred quality scores from 0-62 using ASCII 64-126.

Our files use the characters 66-98, which implies that all our bases have Phred qualities in the range 2 to 34 (inclusive). Also, none of the bases have a quality character of 67, implying no base has a Phred quality of 3 (even though 100,000s of bases have qualities 2 and 4-34).

I'd appreciate it if someone could help answer the following:

(1) does it seem reasonable that our qualities are being capped at 34? I notice a previous post has comments from maubp/kmcarr pointing out that the maximum scores might be capped at 34 by a particular version of Bustard (http://seqanswers.com/forums/showthread.php?t=4679) - perhaps this is what's happening?

I guess I'll jump in here too. I don't know in which version of the pipeline Illumina started capping the Q score at 34. I mentioned v 1.6 because in that version of RTA they added a Q score plot to their status displays making the upper bound quite obvious. Since they have been using the same Q score scheme since v 1.3 I wouldn't be surprised if it was bounded there.

(2) is it normal to have a non-zero lower bound for observed quality scores (in our case, 2)?

I also see a lower limit of 2 in my data. Again, I'm guessing it is some bound imposed by their algorithm.

(3) is there an obvious reason why none of our bases has a quality of 3, even though every other quality in the range 2 to 34 is highly represented?

I have seen this as well. I assumed it was some artifact of their algorithm which causes a discontinuity. In other words, you're not alone so you probably shouldn't worry about it too much.

Please note that I have no real information to base any of these statements on. It's purely speculation.

**Torst** · 04-24-2010, 08:21 PM

Originally posted by Bio.X2Y View Post

I'm having a look at some FASTQ files generated from the Illumina GA pipeline (I think version 1.3).
(2) is it normal to have a non-zero lower bound for observed quality scores (in our case, 2)?
(3) is there an obvious reason why none of our bases has a quality of 3, even though every other quality in the range 2 to 34 is highly represented?

Perhaps it is related to this new 'feature' of Pipeline 1.3+? See SLIDE 17 in http://docs.google.com/fileview?id=0...NTUyNDE3&hl=en. Here is the text of the slide:

"The Read Segment Quality Control Indicator: At the ends of some reads, quality scores are unreliable. Illumina has an algorithm for identifying these unreliable runs of quality scores, and we use a special indicator to flag these portions of reads A quality score of 2, encoded as a "B", is used as a special indicator. A quality score of 2 does not imply a specific error rate, but rather implies that the marked region of the read should not be used for downstream analysis. Some reads will end with a run of B (or Q2) basecalls, but there will never be an isolated Q2 basecall."

**maubp** · 04-29-2010, 04:26 AM

That's a very useful bit of information - thanks Torst!

Are you sure about which version of the pipeline this was introduced in? Here you wrote 1.3+ but on the wikipedia page said 1.5+

**Torst** · 04-29-2010, 04:19 PM

Originally posted by maubp View Post

That's a very useful bit of information - thanks Torst! Are you sure about which version of the pipeline this was introduced in? Here you wrote 1.3+ but on the wikipedia page said 1.5+

I'm the one who actually wrote the original Wiki page for FastQ and I added the BBBB stuff a couple of days ago. I am actually not sure now when it was added - I wrote 1.3+ here, but I now think it was either 1.4 or 1.5, so I wrote 1.5+ in Wikipedia. Another person came and adjusted the page after me and didn't change the 1.5, so hopefully they agreed with my guesstimate.

As "B" is very poor quality anyway, if you ever see an exact run of BBBBB at the 3' end of a read, it is very likely to be the read quality indicator, so should be trimmed. In older versions of the pipeline, you would be unlikely to see such clear runs of BBBBB anyway - it would be a mix of A,B,C and so on.

**maubp** · 04-29-2010, 11:50 PM

Originally posted by Torst View Post

I'm the one who actually wrote the original Wiki page for FastQ and I added the BBBB stuff a couple of days ago. I am actually not sure now when it was added - I wrote 1.3+ here, but I now think it was either 1.4 or 1.5, so I wrote 1.5+ in Wikipedia.

My question was perhaps unclear - I checked the history and saw you'd put 1.5+ on the wiki. But as you point out, either way people will probably trim off reads at Q2 level anyway.

Here's a quick Biopython snippet to do this trimming:

OBF » Illumina FASTQ files – Read Segment Quality Control Indicator » Illumina FASTQ files – Read Segment Quality Control Indicator

http://news.open-bio.org/news/2010/04/illumina-q2-trim-fastq/

Open Bioinformatics Foundation Homepage

**Hena** · 05-03-2010, 12:23 AM

Originally posted by Torst View Post

Perhaps it is related to this new 'feature' of Pipeline 1.3+? See SLIDE 17 in http://docs.google.com/fileview?id=0...NTUyNDE3&hl=en. Here is the text of the slide:

"The Read Segment Quality Control Indicator: At the ends of some reads, quality scores are unreliable. Illumina has an algorithm for identifying these unreliable runs of quality scores, and we use a special indicator to flag these portions of reads A quality score of 2, encoded as a "B", is used as a special indicator. A quality score of 2 does not imply a specific error rate, but rather implies that the marked region of the read should not be used for downstream analysis. Some reads will end with a run of B (or Q2) basecalls, but there will never be an isolated Q2 basecall."

What does that "never be an isolated Q2 basecall" really mean? I have this quality score

^``B`bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb`b`bcaaa`aab_a`caacca_]aaa`aaNY]]_]ab

Also how the region is marked? It seems that biopython example removes everything from B onwards. I have several reads which have B ending reads. I would assume that the regions marked with continuous B are the bad ones and should be removed. This could (assuming that there is segment in a middle of read) split the read into two though. And how should I handle the read with quality score above?

**maubp** · 05-03-2010, 09:06 AM

Originally posted by Hena View Post

What does that "never be an isolated Q2 basecall" really mean? I have this quality score

^``B`bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb`b`bcaaa`aab_a`caacca_]aaa`aaNY]]_]ab

That looks like an isolated B character.

Originally posted by Hena View Post

Also how the region is marked? It seems that biopython example removes everything from B onwards.

well yes, that's what I wrote the script to, and what I claimed it would do.

What version of the Illumnia pipeline did your data come from?

**Torst** · 05-03-2010, 05:25 PM

Originally posted by Hena View Post

What does that "never be an isolated Q2 basecall" really mean? I have this quality score

^``B`bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb`b`bcaaa`aab_a`caacca_]aaa`aaNY]]_]ab

Also how the region is marked? It seems that biopython example removes everything from B onwards. I have several reads which have B ending reads. I would assume that the regions marked with continuous B are the bad ones and should be removed. This could (assuming that there is segment in a middle of read) split the read into two though. And how should I handle the read with quality score above?

The script provided is not doing the correct thing. You can have an isolated B, but ONLY as the LAST base in the read. Any RUN of Bs starting at the END of the read and working backwards are the bad ones.

**maubp** · 05-04-2010, 12:27 AM

Originally posted by Torst View Post

The script provided is not doing the correct thing. You can have an isolated B, but ONLY as the LAST base in the read. Any RUN of Bs starting at the END of the read and working backwards are the bad ones.

So are any B characters in the middle simply PHRED quality 2 then? That was not clear to me - maybe as I suggested this data is from an older Illumina pipeline before the new use of B as a flag? Afterall the slide did say there would not be any isolated B characters (granted the final base likely be possible).

**Torst** · 05-04-2010, 12:36 AM

Originally posted by maubp View Post

So are any B characters in the middle simply PHRED quality 2 then? That was not clear to me.

Actually, the pipeline guarantees there will NOT be any isolated Bs in the quality string anymore. Quality values 0,1,2 are no longer used. B=2 is only used at the END of reads in a contiguous fashion backwards. I suspect the disallowed B as a quality value now, so you can tell the difference between older pipleline quality strings (which use B for Q2) and recent pipelines which use it for the "read quality indicator".

**maubp** · 05-04-2010, 02:02 AM

Originally posted by Torst View Post

Actually, the pipeline guarantees there will NOT be any isolated Bs in the quality string anymore. Quality values 0,1,2 are no longer used. B=2 is only used at the END of reads in a contiguous fashion backwards. I suspect the disallowed B as a quality value now, so you can tell the difference between older pipleline quality strings (which use B for Q2) and recent pipelines which use it for the "read quality indicator".

That makes sense.

Do you agree that would mean Hena's data predates the new "Read Segment Quality Control Indicator" with the special meaning for character "B", and thus using "B trimming" as implemented in my script on it isn't a good idea. Some sort of window based average quality threshold filtering would be more sensible.

However, if the FASTQ files from the latest Illumina pipeline really do only use B at the end of a read, my script is fine as is. Yes?

(P.S. Apologies if my recent posts were too terse, typing on a mobile device is a pain - I'm back at a full keyboard now)

**Torst** · 05-04-2010, 07:33 PM

maubp,

Originally posted by maubp View Post

That makes sense. Do you agree that would mean Hena's data predates the new "Read Segment Quality Control Indicator" with the special meaning for character "B", and thus using "B trimming" as implemented in my script on it isn't a good idea. Some sort of window based average quality threshold filtering would be more sensible.

Yes, for reads from older pipelines that don't use the RSQC, your script is not useful.

However, if the FASTQ files from the latest Illumina pipeline really do only use B at the end of a read, my script is fine as is. Yes?

Correct - for data from recent pipelines that do use the RSQC, your script will do the right thing. But if you modify it to only work if the Bs are anchored to the end of the read, it will still do the same thing for new reads (as no isolated Bs) and won't do too much damage if used on older reads (the Bs, which are awful quality 2 anyway, would have to be at the end). I always prefer to program defensively.

**maubp** · 05-04-2010, 11:49 PM

Originally posted by Torst View Post

Correct - for data from recent pipelines that do use the RSQC, your script will do the right thing. But if you modify it to only work if the Bs are anchored to the end of the read, it will still do the same thing for new reads (as no isolated Bs) and won't do too much damage if used on older reads (the Bs, which are awful quality 2 anyway, would have to be at the end). I always prefer to program defensively.

I quite agree about defensive programming, and have worked out a neat way to do this alternative (trim trailing B characters only). I'll update the blog post.

Thanks for your advice Torst.

Peter

**jkbonfield** · 05-05-2010, 12:12 AM

Do they still emit varying quality values for N bases?

That always confused me. Most were 4 I think, but we'd occasionally see N with quality all the way up to 10. I can only assume they change bases to N at some stage, but don't do anything with the Q value. It seemed broken at the time anyway, but maybe it's a bit saner now.

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, Yesterday, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin Yesterday, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 26 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 160 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

Illumina FASTQ Quality Scores - Missing Value

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News