Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Accepted practices of NGS quality filtering?

    Hi all,

    There is a lot of software for performing various forms of quality control and filtering on short read data generated by NGS platforms, but it's harder to find information about what decisions one should make when doing filtering - where to put your thresholds, how much to trim etc.

    I have just begun to map >100 bp Illumina reads to a small genome but I seem to have a lot of noise in the data set (both adapter contamination and low quality sequence) resulting in low mapping rates. Quality plummets towards the 3' end, and trimming all reads gives better rates but then of course, you don't want to trim away good sequence.

    I understand BWA has a pretty neat approach to trimming reads individually (the -q flag), but there is the question of what value to set for this parameter.

    Other approaches, like discarding whole reads exceeding some minimum thresholds, counting ambiguous bases ("N"'s), looking for little windows of poor-quality sequence etc. are also mentioned every now and then and sound good in principle but here too it is hard to know exactly how much of a given approach one should do. Another thought here is that if a read is really bad it probably won't align anyway - but it still feels better to identify and exclude these reads beforehand, right? I also thought about mapping full-length reads, then trimming unmapped reads and trying them again. It seems like there are many different approaches one could take but it's not obvious what would be best.

    Since this is such a common task it would seem that some kind of standard practices would start to emerge after a while - atleast there should be a lot of people with experience of these kind of decisions. So does anyone have any opinions on this, or possibly links to other resources on the topic? Many thanks in advance.

  • #2
    perhaps someone could recommend a book on this subject, if one exists? I find myself worrying about these questions too.

    Comment


    • #3
      Originally posted by Bruins View Post
      perhaps someone could recommend a book on this subject, if one exists? I find myself worrying about these questions too.
      Good luck with that! This field is new enough that I don't think anyone has a definitive answer for what you should filter and at what cutoff. If there is a book on this I'm not sure I'd trust it.

      I suppose this boils down to there being two kinds of quality problems, either you're making calls with low confidence or you're making correct calls of something you don't want (eg adapters).

      For low confidence data you would ideally be able to leave this all in place and have your downstream analysis tools take this into account when analysing it, so mappers and aligners won't care too much if low confidence calls mismatch. This allows you to retain as much information as possible - which should be a good thing. However this falls down if the scores assigned to your calls prove not to be accurate - which is probably the case a lot of the time. This will lead to you ignoring good quality data because of the poor data on the end. We've therefore often decided in our data to trim really poor sequence (normally by truncating a whole run at a particular position) since aligners then have less excuse for getting the mapping wrong.

      In some applications (SNP calling, bisulphite seq etc) you may prefer to never have to deal with low confidence calls and so would trim your data at an early stage and just live with the reduced coverage you get, rather than have to worry about dealing with a large number of low confidence predictions later on.

      For contaminated data you may not have to worry about the contamination - if you're getting some adapter sequence in your data then it probably won't align to your reference and you can just ignore it. However if you have partial adapter sequence on the end of real data then this could make a mess of alignment where you could make false overlaps between otherwise unrelated sequences. As read lengths get longer you may find that an increasing percentage of your library has some adapter on the end of the reads and it will become more important to remove this to preserve as much data as possible.

      Comment


      • #4
        Thanks for your reply simonandrews, you bring up several good points. I've also been toying with trimming all reads of a run at the same position - however this feels a little wrong at some level, since you know that you're throwing away some good sequence. A per-read approach would feel better, though of course then you have to make the trickier decisions involving quality thresholds.

        I've also been pondering what kind of mapping success rate one can expect. Harismendy et. al. 2009 (http://genomebiology.com/2009/10/3/R32) report that, in their experiment, "only 43% and 34% of the Illumina GA and ABI SOLiD raw reads, respectively, are usable". This seems pretty low - though I've seen higher figures reported elsewhere, and for the Illumina sequencing this study had 36 bp reads; I assume longer reads will improve mapping rates.

        It would be interesting if any had any additional figures on what kind of success rate one can expect.

        Comment


        • #5
          Originally posted by gaffa View Post
          Thanks for your reply simonandrews, you bring up several good points. I've also been toying with trimming all reads of a run at the same position - however this feels a little wrong at some level, since you know that you're throwing away some good sequence. A per-read approach would feel better, though of course then you have to make the trickier decisions involving quality thresholds.
          My concern with that approach would be that I might be biasing my results. What if AT rich sequences show poorer quality? Would I introduce a %GC bias by trimming each read individually? If a whole run is becoming poor quality then trimming the whole thing is effectively the same thing as doing a shorter run and I'm happier with that. Also, depending on your downstream analysis it may be trickier to handle runs with variable read lengths. A lot of the stats is easier if you can remove read length as a factor you have to consider.

          Originally posted by gaffa View Post
          I've also been pondering what kind of mapping success rate one can expect. Harismendy et. al. 2009 (http://genomebiology.com/2009/10/3/R32) report that, in their experiment, "only 43% and 34% of the Illumina GA and ABI SOLiD raw reads, respectively, are usable". This seems pretty low - though I've seen higher figures reported elsewhere, and for the Illumina sequencing this study had 36 bp reads; I assume longer reads will improve mapping rates.
          30-40% mapping is absurdly low for most applications. We generally see mapping efficiency over 70% for RNA-Seq and maybe 60% for ChIP-Seq with 40bp reads (this will be very antibody dependent). In many cases the reads which don't map (at least in our case) are things which aren't present in the assemblies (centromeres, telomeres etc.), or regions which are duplicated with high identity which would require much more sequence to map uniquely. We have a repeat mapping pipeline where we assign reads to repeat classes, and don't care if they map to more than one instance of the class (or even to multiple classes). This lets us look at most of the unmappable data albeit in a slightly different way to the conventionally mapping regions.

          Comment


          • #6
            1. BWA's quality trimming method is learned from phred.

            2. I have seen quite a lot of papers citing Harismendy et al. (2009). This paper was great at the time of the submission (end of 2008; sequencing should be done earlier), but it is not representative any more. NGS is a fast changing field. Many things have happened in the past two years.

            Comment


            • #7
              I totally agree with lh3 on the second point. Also, we have seen that for a majority of cases, letting the aligner and variant caller deal with low quality works fine. If Q really goes downhill, then it is usually consistent with all lanes, and using a pre-defined trim of all reads (essentially a shorter run) avoids the bias as Simon mentioned.

              FastQC is very useful to summarize all this information
              --
              bioinfosm

              Comment


              • #8
                Thanks for the replies. I wonder if anyone has any opinions on what Q value cutoffs should be used. I've seen the values 15 and 20 thrown around?

                One thing that I've been thinking about is to map full reads first, and then take the unmapped ones, trim them from the 3' end and try them again. I haven't seen this approach used but it doesn't seem like it should be that controversial. (Though of course it would be time-consuming).
                Last edited by gaffa; 11-17-2010, 09:09 AM.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM
                • seqadmin
                  Techniques and Challenges in Conservation Genomics
                  by seqadmin



                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                  Avian Conservation
                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                  03-08-2024, 10:41 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Yesterday, 06:37 PM
                0 responses
                11 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, Yesterday, 06:07 PM
                0 responses
                10 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-22-2024, 10:03 AM
                0 responses
                51 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-21-2024, 07:32 AM
                0 responses
                67 views
                0 likes
                Last Post seqadmin  
                Working...
                X