Seqanswers Leaderboard Ad

**Bruins** · 11-10-2010, 07:07 AM

perhaps someone could recommend a book on this subject, if one exists? I find myself worrying about these questions too.

**simonandrews** · 11-10-2010, 08:35 AM

Originally posted by Bruins View Post

perhaps someone could recommend a book on this subject, if one exists? I find myself worrying about these questions too.

Good luck with that! This field is new enough that I don't think anyone has a definitive answer for what you should filter and at what cutoff. If there is a book on this I'm not sure I'd trust it.

I suppose this boils down to there being two kinds of quality problems, either you're making calls with low confidence or you're making correct calls of something you don't want (eg adapters).

For low confidence data you would ideally be able to leave this all in place and have your downstream analysis tools take this into account when analysing it, so mappers and aligners won't care too much if low confidence calls mismatch. This allows you to retain as much information as possible - which should be a good thing. However this falls down if the scores assigned to your calls prove not to be accurate - which is probably the case a lot of the time. This will lead to you ignoring good quality data because of the poor data on the end. We've therefore often decided in our data to trim really poor sequence (normally by truncating a whole run at a particular position) since aligners then have less excuse for getting the mapping wrong.

In some applications (SNP calling, bisulphite seq etc) you may prefer to never have to deal with low confidence calls and so would trim your data at an early stage and just live with the reduced coverage you get, rather than have to worry about dealing with a large number of low confidence predictions later on.

For contaminated data you may not have to worry about the contamination - if you're getting some adapter sequence in your data then it probably won't align to your reference and you can just ignore it. However if you have partial adapter sequence on the end of real data then this could make a mess of alignment where you could make false overlaps between otherwise unrelated sequences. As read lengths get longer you may find that an increasing percentage of your library has some adapter on the end of the reads and it will become more important to remove this to preserve as much data as possible.

**gaffa** · 11-14-2010, 03:17 PM

Thanks for your reply simonandrews, you bring up several good points. I've also been toying with trimming all reads of a run at the same position - however this feels a little wrong at some level, since you know that you're throwing away some good sequence. A per-read approach would feel better, though of course then you have to make the trickier decisions involving quality thresholds.

I've also been pondering what kind of mapping success rate one can expect. Harismendy et. al. 2009 (http://genomebiology.com/2009/10/3/R32) report that, in their experiment, "only 43% and 34% of the Illumina GA and ABI SOLiD raw reads, respectively, are usable". This seems pretty low - though I've seen higher figures reported elsewhere, and for the Illumina sequencing this study had 36 bp reads; I assume longer reads will improve mapping rates.

It would be interesting if any had any additional figures on what kind of success rate one can expect.

**simonandrews** · 11-15-2010, 01:16 AM

Originally posted by gaffa View Post

Thanks for your reply simonandrews, you bring up several good points. I've also been toying with trimming all reads of a run at the same position - however this feels a little wrong at some level, since you know that you're throwing away some good sequence. A per-read approach would feel better, though of course then you have to make the trickier decisions involving quality thresholds.

My concern with that approach would be that I might be biasing my results. What if AT rich sequences show poorer quality? Would I introduce a %GC bias by trimming each read individually? If a whole run is becoming poor quality then trimming the whole thing is effectively the same thing as doing a shorter run and I'm happier with that. Also, depending on your downstream analysis it may be trickier to handle runs with variable read lengths. A lot of the stats is easier if you can remove read length as a factor you have to consider.

Originally posted by gaffa View Post

I've also been pondering what kind of mapping success rate one can expect. Harismendy et. al. 2009 (http://genomebiology.com/2009/10/3/R32) report that, in their experiment, "only 43% and 34% of the Illumina GA and ABI SOLiD raw reads, respectively, are usable". This seems pretty low - though I've seen higher figures reported elsewhere, and for the Illumina sequencing this study had 36 bp reads; I assume longer reads will improve mapping rates.

30-40% mapping is absurdly low for most applications. We generally see mapping efficiency over 70% for RNA-Seq and maybe 60% for ChIP-Seq with 40bp reads (this will be very antibody dependent). In many cases the reads which don't map (at least in our case) are things which aren't present in the assemblies (centromeres, telomeres etc.), or regions which are duplicated with high identity which would require much more sequence to map uniquely. We have a repeat mapping pipeline where we assign reads to repeat classes, and don't care if they map to more than one instance of the class (or even to multiple classes). This lets us look at most of the unmappable data albeit in a slightly different way to the conventionally mapping regions.

**lh3** · 11-15-2010, 06:17 AM

1. BWA's quality trimming method is learned from phred.

2. I have seen quite a lot of papers citing Harismendy et al. (2009). This paper was great at the time of the submission (end of 2008; sequencing should be done earlier), but it is not representative any more. NGS is a fast changing field. Many things have happened in the past two years.

**bioinfosm** · 11-15-2010, 08:43 AM

I totally agree with lh3 on the second point. Also, we have seen that for a majority of cases, letting the aligner and variant caller deal with low quality works fine. If Q really goes downhill, then it is usually consistent with all lanes, and using a pre-defined trim of all reads (essentially a shorter run) avoids the bias as Simon mentioned.

FastQC is very useful to summarize all this information

**gaffa** · 11-17-2010, 09:05 AM

Thanks for the replies. I wonder if anyone has any opinions on what Q value cutoffs should be used. I've seen the values 15 and 20 thrown around?

One thing that I've been thinking about is to map full reads first, and then take the unmapped ones, trim them from the 3' end and try them again. I haven't seen this approach used but it doesn't seem like it should be that controversial. (Though of course it would be time-consuming).

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, 07-25-2024, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin 07-25-2024, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 26 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 160 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

Accepted practices of NGS quality filtering?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News