Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Error correcting Illumina raw reads -- necessary after trimming?

    I've been trying to figure this one out for a little while. Most of the genomes that have been assembled using exclusively Illumina short PE reads (such as the Giant Panda) have an error correction step (i.e. Quake, Reptile, SOAPec, or ECHO) before assembly, as reads in Illumina sequences can have basepair substitutions. Error correcting can reduce the total number of unique K-mers to help assembly.

    But if I'm already trimming sequences before assembly for Phred quality scores > 20 (in other words, > 99% base call accuracy) is using an error correction software really necessary? Thanks for any thoughts.

  • #2
    I think it would be a reasonable think to do, because even bases with Q30 have a chance of 0.01% being wrong and cause dead ends in the assembly graph.

    Other thoughts?

    Comment


    • #3
      Also, depending on how or what you use to trim, there can still be low quality bases in the read itself even after trimming.

      Comment


      • #4
        I trimmed using left/right for Q>20. No sliding window, just a cut where there was any base less than Q20. So there shouldn't be any bases less than Q20 left, at all.

        It ended up eliminating 5% of my bases (out of ~130 GB).

        Would I still get dead ends in the assembly graph from a miscalled base if I have 30-50X coverage?

        Comment


        • #5
          The trade off here is that only regions with relatively high coverage will be corrected, and depending on how you run these programs, you may want to trim your reads based on how the k-mers for correction align. So, if you have low coverage of some regions, you might end up trimming them right out of your data. However, if you don't trim them, you might end up with dead ends.

          Ultimately, I think error correction can definitely help, but like all the steps in assembly, its worth doing some parameter sweeps. And remember, with bigger k-mer sizes for correction, you get more uniqueness, but you can't trim/correct the ends of the reads as easily.

          Also, depending on k-mer size of n for assembly, single base pair errors, will cause errors in n k-mer. And while you obviously had the coverage to support correction of that base at low k-mers, thus its likely at high enough coverage for assembly. When you go to use larger k-mers, you may be missing out on regions further away from the error, that are not as well covered.

          Comment


          • #6
            Originally posted by Wallysb01 View Post
            The trade off here is that only regions with relatively high coverage will be corrected, and depending on how you run these programs, you may want to trim your reads based on how the k-mers for correction align. So, if you have low coverage of some regions, you might end up trimming them right out of your data. However, if you don't trim them, you might end up with dead ends.

            Ultimately, I think error correction can definitely help, but like all the steps in assembly, its worth doing some parameter sweeps. And remember, with bigger k-mer sizes for correction, you get more uniqueness, but you can't trim/correct the ends of the reads as easily.

            Also, depending on k-mer size of n for assembly, single base pair errors, will cause errors in n k-mer. And while you obviously had the coverage to support correction of that base at low k-mers, thus its likely at high enough coverage for assembly. When you go to use larger k-mers, you may be missing out on regions further away from the error, that are not as well covered.
            Thanks for the thoughts. I have been avoiding using error correction because most of the software aren't user friendly, but it seems like it can't hurt.

            In the Giant Panda genome, they corrected K-mers at K=17 and then ran the assembly using SOAP de novo 27mer. Does anyone know if that means they corrected 17mers and then reassembled the reads before chopping into 27mers? Is there any point to this? It seems strange that they would correct Kmers at a different size than they would assemble.

            Comment


            • #7
              Generally it is too computationally expensive in cpu-time/memory to error check with long kmers. Also the problem is that with longer kmers approaching read length the depth of coverage falls, and it is harder to distinguish between the errors and low coverage. If you have reads of 100bp, and a read depth of coverage of 20x then a Kmer of 20 would have a kmer depth of coverage of 16x (as the remaining sequence depth is accounted by kmers that run off the end of the read). On the other hand a kmer depth of coverage for a kmer of 80 would only have a kmer depth of coverage of 4x for the same raw data.

              Of course with the assembly longer kmers are desired to overcome short multi copy sequences.

              Of course with 4x expected, many kmers would be only represented 1x or 2x making it very hard to id errors/vs low coverage.

              Even with a mammalian size genome a kmer of 17 would detect most of the errors, but about 20-23 would be better. The question to ask is the genome size large enough that a "random sequence" error kmer would match somewhere else in the genome by chance. With a human size genome a random 17mer would have about a 20% chance of being observed, but with a bacterial genome a random kmer 17 would be observed about 1/1000 as often. But even in a human genome 80% of the error kmers would not be represented (if they are essentially random). Therefore a kmer of 17 would potentially remove 80% of the errors in human data due to being flagged as low coverage.

              Removing 70%-90% of the errors does a lot for the quality of the assembly and increases contig/scaffold sizes.

              Comment


              • #8
                Originally posted by jebowers View Post
                Generally it is too computationally expensive in cpu-time/memory to error check with long kmers. Also the problem is that with longer kmers approaching read length the depth of coverage falls, and it is harder to distinguish between the errors and low coverage. If you have reads of 100bp, and a read depth of coverage of 20x then a Kmer of 20 would have a kmer depth of coverage of 16x (as the remaining sequence depth is accounted by kmers that run off the end of the read). On the other hand a kmer depth of coverage for a kmer of 80 would only have a kmer depth of coverage of 4x for the same raw data.

                Of course with the assembly longer kmers are desired to overcome short multi copy sequences.

                Of course with 4x expected, many kmers would be only represented 1x or 2x making it very hard to id errors/vs low coverage.

                Even with a mammalian size genome a kmer of 17 would detect most of the errors, but about 20-23 would be better. The question to ask is the genome size large enough that a "random sequence" error kmer would match somewhere else in the genome by chance. With a human size genome a random 17mer would have about a 20% chance of being observed, but with a bacterial genome a random kmer 17 would be observed about 1/1000 as often. But even in a human genome 80% of the error kmers would not be represented (if they are essentially random). Therefore a kmer of 17 would potentially remove 80% of the errors in human data due to being flagged as low coverage.

                Removing 70%-90% of the errors does a lot for the quality of the assembly and increases contig/scaffold sizes.
                That makes sense why you would use a smaller K. But if I wanted to use a Kmer size in my de novo assembly that was greater than what I error corrected, wouldn't I have to reassemble the reads into reads > Kmer for assembly?

                edit: seems that many of the EC programs include a merger that will merge corrected Kmers into full reads again
                Last edited by jwag; 08-02-2013, 12:14 PM.

                Comment


                • #9
                  The error correction process generally puts out the reads with the suspected errors "fixed" in a format similar to the original reads (ie fasta or fastq reads that are the same size as the original except for some trimming). So for example if the EC process finds an error on base 15 of a 100bp sequence it changes that base and returns the whole 100bp edited sequence. The assembler then can independently compute kmers of the appropriate size from the corrected reads. If a few short (trimmed) reads exist that are < than the kmer used most assemblers will ignore those reads without any problem.

                  Comment


                  • #10
                    After trimming is done then another approach/option is to simply let the assembly program handle low count kmers and not worry about error-correcting the errors. In other words you can set most genome assembly programs so that if the program sees a kmer occurring only once (or whatever level you want) then drops the kmer. This will tend to reduce assembly problems and dead-ends.

                    Of course doing the above means you are throwing data instead of correcting it. However I am leery about data-correction. SOLiD reads could be corrected since every base was scanned twice (due to color-space) inside a single read. Given only a few overlapping reads it became obvious which was the bad read and thus the correct base. Illumina obviously doesn't have dual-scanned bases thus you have to rely on multiple reads external to the suspected erroneous read to be the correcting force. Not that this process is inherently wrong but if I have enough "correct" reads/kmers to fix up a single-occurring kmer then why not just make it simple and throw away that single-occuring kmer -- I'll still have enough data from the correct kmers to provide for a good assembly.

                    Comment


                    • #11
                      Originally posted by westerman View Post
                      After trimming is done then another approach/option is to simply let the assembly program handle low count kmers and not worry about error-correcting the errors. In other words you can set most genome assembly programs so that if the program sees a kmer occurring only once (or whatever level you want) then drops the kmer. This will tend to reduce assembly problems and dead-ends.

                      Of course doing the above means you are throwing data instead of correcting it. However I am leery about data-correction. SOLiD reads could be corrected since every base was scanned twice (due to color-space) inside a single read. Given only a few overlapping reads it became obvious which was the bad read and thus the correct base. Illumina obviously doesn't have dual-scanned bases thus you have to rely on multiple reads external to the suspected erroneous read to be the correcting force. Not that this process is inherently wrong but if I have enough "correct" reads/kmers to fix up a single-occurring kmer then why not just make it simple and throw away that single-occuring kmer -- I'll still have enough data from the correct kmers to provide for a good assembly.
                      Similar to what you're saying, so far what I've been doing is assuming that the assembler (I've been using CLC and Soapdenovo2) would simply bubble at a miscalled base. My line of thought is that if there were miscalled bases inside only one kmer, it would create a bubble rather than a dead end. Assuming a reasonable amount of coverage around and within that bubble with the correct base(s), it should be able to resolve the base(s). So, the Kmer isn't necessarily getting thrown out in this case.

                      I'm not really sure the downstream consequences of going this route, though, compared to doing a more thorough error correction.

                      Comment


                      • #12
                        Originally posted by jwag View Post
                        Assuming a reasonable amount of coverage around and within that bubble with the correct base(s), it should be able to resolve the base(s). So, the Kmer isn't necessarily getting thrown out in this case.
                        That's main assumption, by not doing error correction, your error base has good coverage with in a distance of k around it. When you are doing assembly, you want to push you kmer size as high as reasonably possible, to get through short repeats, but a single base error in a read can start trashing very large portions of your read. In fact, you'll lose k k-mers from your read, if it has k-1 bases on it side (less obviously if its towards the ends).

                        For example, for 100bp reads, when using a kmer size of 31 (pretty modest kmer size for assembly), you can generate 70 31-mers. If you have an error anywhere in the middle of that read, where 31-mers can't completely surround it, 31 of these 70 k-mers, or 44% of them. At k=1/2*L + 1, where L is length of the read, you'll lose the whole read if there is any error in it at all. In an ideal world, I think you'd like to use kmers that at least approach half your read length.

                        If you look at the assemblathon 2 submission commands, you see very large ks in general. http://korflab.ucdavis.edu/Datasets/...nal_file_3.pdf

                        ABySS was run using k=56 or 80, depending on the genome. Allpaths does error correction by default. For SOAP, the bird it was corrected at k=27 and assembled at k=29, the snake was corrected at k=18 but assemblyed at k=35.

                        In my mind its, not so much the actual base being corrected that is gained with error correction, but the coverage ~k distance away from the read, thus allowing you to increase your k.

                        Comment


                        • #13
                          As Wallys said, the advantage of error correcting is that may individual the Kmers would have a low kmer coverage in a long kmer assembly. That could result in bubbles or breakpoints if 2 or more errors are near each other. The problem is that with long kmers it may not be easy to ascertain which Kmer is correct when an error is encountered if the effective depth of coverage of that kmer is low. If using a long kmer on short reads (eg k=80 read length 100) the Kmer depth of coverage is going to be 1/5 the sequencing depth of coverage, and much lower if uncorrected errors exist.

                          Comment


                          • #14
                            Thanks for these great posts. I went ahead and put my trimmed data (trimmed for Q < 20) through SOAP error corrector, k=17. This is for one lane of 101bp Illumina reads (forward reads only):

                            Label_1 num_raw_reads: 184648159
                            Label_2 num_raw_bases: 18164107828
                            Label_3 num_result_reads: 182183710
                            Label_4 num_result_bases: 17778439456

                            Label_5 num_trimmed_reads: 10985799
                            Label_6 num_trimmed_bases: 244794760
                            Label_7 num_deleted_reads: 2464449

                            Label_8 num_corrected_bases_by_Fast_method: 784837
                            Label_9 num_corrected_bases_by_BBtree_method: 15912816
                            Label_10 num_corrected_bases_by_two_methods: 16697653

                            Label_11 low_quality_bases_filter_ratio: 0.0212324
                            Label_12 estimated_raw_base_error_ratio: 0.000939208

                            So it looks like about 0.1% of total bases were corrected in this case. I was planning on using k=47 for my de novo assembly. Maybe I should increase my k during correction too?

                            I'm also guessing I should put in ALL of my lanes of data into the error corrector at the same time so that it will more accurately count the number of a particular kmer.

                            Comment


                            • #15
                              Originally posted by jwag View Post
                              So it looks like about 0.1% of total bases were corrected in this case. I was planning on using k=47 for my de novo assembly. Maybe I should increase my k during correction too?
                              You might as well push it as high as you can until you start to see your k-mer coverage distribution plot lack a distinct valley and peak, or until you start seeing you assembly quality suffer from insufficient coverage. Usually, in my assemblies it was pretty easy to tell when k was just too big. And you start seeing a point of diminishing returns in a fairly large range around your ideal k. So, you might do a sweep of k at about every 5bp (e.g. 31, 36, 41, 46..., or similar). Usually from that you'll find a good spot to start further refining your assemblies. To save time, you could just assemble to contigs, but then you lose the quality control information from mapping the reads back for scaffolding, since map % is often a decent measure of how well you're using your data.

                              I'm also guessing I should put in ALL of my lanes of data into the error corrector at the same time so that it will more accurately count the number of a particular kmer.
                              You don't necessarily need to put all your data into the error correction. Though it will depend on how much coverage you have to start with and how big of a k-mer your doing correction at. So, if you have 100x raw coverage, don't be afraid to correct with only 50x if your correction kmer is small (say ~21bp or less), but if you push k up towards 27, you might want all of it. And it will depend on your computational resources. RAM usage (which can easily approach 1TB for 100x vertebrate genomes) and time will go up quite a lot as you increase k and add more data. But if you have the resources, you might as well use all your data, even prior to quality trimming (but after adapter clipping), as you can let SOAP error correction do trimming based on suspicious k-mers in combination with quality data by playing with options e, w and q. Also, watch your value for "l", it should be roughly the value of that local minimum in your kmer coverage plot.

                              In my data, I ended up trimming at Q=10, and letting SOAPec correct anything with Q<30, but retained error-mers with Q>30. For what ever reason with my data I was getting worse results when letting SOAPec correct for all bases. But you may find things to be different in your case.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              12 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              68 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X