Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • gaffa
    Member
    • Oct 2010
    • 82

    Accepted practices of NGS quality filtering?

    Hi all,

    There is a lot of software for performing various forms of quality control and filtering on short read data generated by NGS platforms, but it's harder to find information about what decisions one should make when doing filtering - where to put your thresholds, how much to trim etc.

    I have just begun to map >100 bp Illumina reads to a small genome but I seem to have a lot of noise in the data set (both adapter contamination and low quality sequence) resulting in low mapping rates. Quality plummets towards the 3' end, and trimming all reads gives better rates but then of course, you don't want to trim away good sequence.

    I understand BWA has a pretty neat approach to trimming reads individually (the -q flag), but there is the question of what value to set for this parameter.

    Other approaches, like discarding whole reads exceeding some minimum thresholds, counting ambiguous bases ("N"'s), looking for little windows of poor-quality sequence etc. are also mentioned every now and then and sound good in principle but here too it is hard to know exactly how much of a given approach one should do. Another thought here is that if a read is really bad it probably won't align anyway - but it still feels better to identify and exclude these reads beforehand, right? I also thought about mapping full-length reads, then trimming unmapped reads and trying them again. It seems like there are many different approaches one could take but it's not obvious what would be best.

    Since this is such a common task it would seem that some kind of standard practices would start to emerge after a while - atleast there should be a lot of people with experience of these kind of decisions. So does anyone have any opinions on this, or possibly links to other resources on the topic? Many thanks in advance.
  • Bruins
    Member
    • Feb 2010
    • 78

    #2
    perhaps someone could recommend a book on this subject, if one exists? I find myself worrying about these questions too.

    Comment

    • simonandrews
      Simon Andrews
      • May 2009
      • 870

      #3
      Originally posted by Bruins View Post
      perhaps someone could recommend a book on this subject, if one exists? I find myself worrying about these questions too.
      Good luck with that! This field is new enough that I don't think anyone has a definitive answer for what you should filter and at what cutoff. If there is a book on this I'm not sure I'd trust it.

      I suppose this boils down to there being two kinds of quality problems, either you're making calls with low confidence or you're making correct calls of something you don't want (eg adapters).

      For low confidence data you would ideally be able to leave this all in place and have your downstream analysis tools take this into account when analysing it, so mappers and aligners won't care too much if low confidence calls mismatch. This allows you to retain as much information as possible - which should be a good thing. However this falls down if the scores assigned to your calls prove not to be accurate - which is probably the case a lot of the time. This will lead to you ignoring good quality data because of the poor data on the end. We've therefore often decided in our data to trim really poor sequence (normally by truncating a whole run at a particular position) since aligners then have less excuse for getting the mapping wrong.

      In some applications (SNP calling, bisulphite seq etc) you may prefer to never have to deal with low confidence calls and so would trim your data at an early stage and just live with the reduced coverage you get, rather than have to worry about dealing with a large number of low confidence predictions later on.

      For contaminated data you may not have to worry about the contamination - if you're getting some adapter sequence in your data then it probably won't align to your reference and you can just ignore it. However if you have partial adapter sequence on the end of real data then this could make a mess of alignment where you could make false overlaps between otherwise unrelated sequences. As read lengths get longer you may find that an increasing percentage of your library has some adapter on the end of the reads and it will become more important to remove this to preserve as much data as possible.

      Comment

      • gaffa
        Member
        • Oct 2010
        • 82

        #4
        Thanks for your reply simonandrews, you bring up several good points. I've also been toying with trimming all reads of a run at the same position - however this feels a little wrong at some level, since you know that you're throwing away some good sequence. A per-read approach would feel better, though of course then you have to make the trickier decisions involving quality thresholds.

        I've also been pondering what kind of mapping success rate one can expect. Harismendy et. al. 2009 (http://genomebiology.com/2009/10/3/R32) report that, in their experiment, "only 43% and 34% of the Illumina GA and ABI SOLiD raw reads, respectively, are usable". This seems pretty low - though I've seen higher figures reported elsewhere, and for the Illumina sequencing this study had 36 bp reads; I assume longer reads will improve mapping rates.

        It would be interesting if any had any additional figures on what kind of success rate one can expect.

        Comment

        • simonandrews
          Simon Andrews
          • May 2009
          • 870

          #5
          Originally posted by gaffa View Post
          Thanks for your reply simonandrews, you bring up several good points. I've also been toying with trimming all reads of a run at the same position - however this feels a little wrong at some level, since you know that you're throwing away some good sequence. A per-read approach would feel better, though of course then you have to make the trickier decisions involving quality thresholds.
          My concern with that approach would be that I might be biasing my results. What if AT rich sequences show poorer quality? Would I introduce a %GC bias by trimming each read individually? If a whole run is becoming poor quality then trimming the whole thing is effectively the same thing as doing a shorter run and I'm happier with that. Also, depending on your downstream analysis it may be trickier to handle runs with variable read lengths. A lot of the stats is easier if you can remove read length as a factor you have to consider.

          Originally posted by gaffa View Post
          I've also been pondering what kind of mapping success rate one can expect. Harismendy et. al. 2009 (http://genomebiology.com/2009/10/3/R32) report that, in their experiment, "only 43% and 34% of the Illumina GA and ABI SOLiD raw reads, respectively, are usable". This seems pretty low - though I've seen higher figures reported elsewhere, and for the Illumina sequencing this study had 36 bp reads; I assume longer reads will improve mapping rates.
          30-40% mapping is absurdly low for most applications. We generally see mapping efficiency over 70% for RNA-Seq and maybe 60% for ChIP-Seq with 40bp reads (this will be very antibody dependent). In many cases the reads which don't map (at least in our case) are things which aren't present in the assemblies (centromeres, telomeres etc.), or regions which are duplicated with high identity which would require much more sequence to map uniquely. We have a repeat mapping pipeline where we assign reads to repeat classes, and don't care if they map to more than one instance of the class (or even to multiple classes). This lets us look at most of the unmappable data albeit in a slightly different way to the conventionally mapping regions.

          Comment

          • lh3
            Senior Member
            • Feb 2008
            • 686

            #6
            1. BWA's quality trimming method is learned from phred.

            2. I have seen quite a lot of papers citing Harismendy et al. (2009). This paper was great at the time of the submission (end of 2008; sequencing should be done earlier), but it is not representative any more. NGS is a fast changing field. Many things have happened in the past two years.

            Comment

            • bioinfosm
              Senior Member
              • Jan 2008
              • 483

              #7
              I totally agree with lh3 on the second point. Also, we have seen that for a majority of cases, letting the aligner and variant caller deal with low quality works fine. If Q really goes downhill, then it is usually consistent with all lanes, and using a pre-defined trim of all reads (essentially a shorter run) avoids the bias as Simon mentioned.

              FastQC is very useful to summarize all this information
              --
              bioinfosm

              Comment

              • gaffa
                Member
                • Oct 2010
                • 82

                #8
                Thanks for the replies. I wonder if anyone has any opinions on what Q value cutoffs should be used. I've seen the values 15 and 20 thrown around?

                One thing that I've been thinking about is to map full reads first, and then take the unmapped ones, trim them from the 3' end and try them again. I haven't seen this approach used but it doesn't seem like it should be that controversial. (Though of course it would be time-consuming).
                Last edited by gaffa; 11-17-2010, 09:09 AM.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Pathogen Surveillance with Advanced Genomic Tools
                  by seqadmin




                  The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
                  03-24-2025, 11:48 AM
                • seqadmin
                  New Genomics Tools and Methods Shared at AGBT 2025
                  by seqadmin


                  This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                  The Headliner
                  The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                  03-03-2025, 01:39 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 03-20-2025, 05:03 AM
                0 responses
                42 views
                0 reactions
                Last Post seqadmin  
                Started by seqadmin, 03-19-2025, 07:27 AM
                0 responses
                53 views
                0 reactions
                Last Post seqadmin  
                Started by seqadmin, 03-18-2025, 12:50 PM
                0 responses
                39 views
                0 reactions
                Last Post seqadmin  
                Started by seqadmin, 03-03-2025, 01:15 PM
                0 responses
                194 views
                0 reactions
                Last Post seqadmin  
                Working...