Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • New way to compare mappers and variant callers

    Guys,

    I'm starting a new thread for GCAT which is a completely free and collaborative way to compare genome analysis pipelines:



    You can link to specific comparisons and they are fully interactive so this should make discussion here better because we can link to real data!

    Would love eveyone's input. It's a cool project but brand new and it will need lots of work!

  • #2
    In response to @lh3 from http://seqanswers.com/forums/showthr...t=15200&page=4

    Although the metrics used on your website's comparison are similar, it looks like you used a wiggle of 50bp, which could account for differences. The curve axes are incorrect reads / mapped reads and correct reads / total in FASTQ

    From other reports with different read lengths and mutations I see pretty consistently that BWA MEM does indeed deprecate BWASW, and MEM is also the most accurate open source aligner i have seen so far!

    Comment


    • #3
      Originally posted by adaptivegenome View Post
      Guys,

      I'm starting a new thread for GCAT which is a completely free and collaborative way to compare genome analysis pipelines:



      You can link to specific comparisons and they are fully interactive so this should make discussion here better because we can link to real data!

      Would love eveyone's input. It's a cool project but brand new and it will need lots of work!
      This looks an interesting project and I'm interested in running Subread aligner on the test datasets. But could you please provide a bit more information about the datasets such as the fragment length of PE data and the indel lengths in the small indel datasets and large indel datasets?

      Thanks,
      Wei

      Comment


      • #4
        The inner distance for paired end is always 300bp (so a template size of 500bp for 100bp reads)

        The small indel rate has sizes 1-10bp, and the large indel rate is 10-24bp

        Looking forward to seeing the Subread results!

        Comment


        • #5
          Thanks for the info, oiiio. I will try to upload the Subread results soon.

          I would also like to comment on the simulation data generated in your project. It looks like you used wgsim simulator to generate reads. This simulator seems to assume that the sequencing errors are uniformly distributed in the read sequences, but we know that this is not true because for example Illumina HiSeq sequencers produce more errors at the start/end of the reads. So you may need to add other read simulator/s in your project as well.

          The simulator 'Mason' was used in both bowtie2 paper and Subread paper, so you may consider using that as well. The simulator 'Art' was used in Subread paper as well.

          In the simulation data you have generated using wgsim, all read bases have exactly the same quality scores ("2"), which correspond to the phred score of 17 which is quite low. This makes your datasets actually be quite different from real datasets and this will particularly affect those aligners which makes use of quality score information during their alignments. Art and Mason seem to do better in this regard.

          Cheers,
          Wei

          Comment


          • #6
            GCAT actually uses DWGSIM, but you are right about the assumptions. GCAT also includes real data for variant calling benchmarks but I think you are right, we should try some other simulators.

            We will give MASON a try and it would be awesome if you were to upload some results in the meantime.

            Thanks for the input!


            Originally posted by shi View Post
            Thanks for the info, oiiio. I will try to upload the Subread results soon.

            I would also like to comment on the simulation data generated in your project. It looks like you used wgsim simulator to generate reads. This simulator seems to assume that the sequencing errors are uniformly distributed in the read sequences, but we know that this is not true because for example Illumina HiSeq sequencers produce more errors at the start/end of the reads. So you may need to add other read simulator/s in your project as well.

            The simulator 'Mason' was used in both bowtie2 paper and Subread paper, so you may consider using that as well. The simulator 'Art' was used in Subread paper as well.

            In the simulation data you have generated using wgsim, all read bases have exactly the same quality scores ("2"), which correspond to the phred score of 17 which is quite low. This makes your datasets actually be quite different from real datasets and this will particularly affect those aligners which makes use of quality score information during their alignments. Art and Mason seem to do better in this regard.

            Cheers,
            Wei

            Comment


            • #7
              This is a good point, and I think it would be very interesting to see how the alignment reports change with simulator as well. Comparing simulators that use and don't use that kind of error model could test how reliant base quality-sensitive aligners are these qualities, and also on expected error position, which would be very cool!

              In the meantime you will find the variant calling test interesting also, as it is performed on real whole exome data (NA12878) on multiple platforms. With this you can compare a VCF generated by an aligner+variant caller pipeline.

              For example, here is the variant calling report for the 100bp 30x coverage Illumina exome :




              Originally posted by shi View Post
              Thanks for the info, oiiio. I will try to upload the Subread results soon.

              I would also like to comment on the simulation data generated in your project. It looks like you used wgsim simulator to generate reads. This simulator seems to assume that the sequencing errors are uniformly distributed in the read sequences, but we know that this is not true because for example Illumina HiSeq sequencers produce more errors at the start/end of the reads. So you may need to add other read simulator/s in your project as well.

              The simulator 'Mason' was used in both bowtie2 paper and Subread paper, so you may consider using that as well. The simulator 'Art' was used in Subread paper as well.

              In the simulation data you have generated using wgsim, all read bases have exactly the same quality scores ("2"), which correspond to the phred score of 17 which is quite low. This makes your datasets actually be quite different from real datasets and this will particularly affect those aligners which makes use of quality score information during their alignments. Art and Mason seem to do better in this regard.

              Cheers,
              Wei
              Last edited by oiiio; 04-15-2013, 06:50 AM.

              Comment


              • #8
                No read simulators simulate realistic data. ART and mason infer quality from empirical alignment, which will: 1) bias towards mappers using similar algorithms; 2) miss reads with excessive mismatches/gaps; 3) be dependent of the reference and sequences in use. I don't know much about ACANA used by ART, but RazorS used by mason does not do affine-gap alignment, which will affect its gap modeling. My understanding is both of them assume position independence. This is clearly not true for Illumina reads: some reads are systematically worse at every base than others (for example, if there is a Q2 base, the following bases are also Q2). If I am right, both ART and mason assume no variations between reads and the reference genome. This is rarely the case in practice. In an Illumina alignment, the majority of gaps, especially long gaps, are variations instead of sequencing errors. Furthermore, when we align reads to a related species ~1% divergence away from the sequenced sample, the reads will apparently have 1% uniform errors. 1% divergence is not rare when we study non human organisms. To this end, ART and mason do something better but something worse.

                Wgsim uses a default sequencing error rate as high as 2% because I want to stress aligners. All mappers make no mistake for a perfectly matched read. However, with 100bp or longer reads, we should be able to access 2% divergence (for human, quite a lot of reads span two SNPs). It is these regions that show and need the capability of an aligner. Longer indels (say >5bp), which I guess are very rare from mason/art simulation, also stress the mappers a lot.

                Finally, my experience is that the relative performance of mappers stay the same within a reasonably large range error configurations. The maq read simulator generates more realistic errors. I stopped using that in wgsim mainly because as I tested that time, the maq error model added little to the evaluation of mappers.


                I used to think about how to simulate more realistic data. I think we should take the variants from the Venter/HuRef genome, incorporate it to the human genome without shifting the coordinates, and simulate reads from that. As to sequencing errors, we should randomly draw a the quality of a real read, generate errors from the recalibrated quality, but output raw base quality. Such a simulation will be much more realistic than anything at present. Nonetheless, for mapper evaluation, probably this would not lead to much difference, and, life is too short to do everything perfectly.
                Last edited by lh3; 04-15-2013, 07:27 AM.

                Comment


                • #9
                  Originally posted by lh3 View Post
                  I used to think about how to simulate more realistic data. I think we should take the variants from the Venter/HuRef genome, incorporate it to the human genome without shifting the coordinates, and simulate reads from that. As to sequencing errors, we should randomly draw a the quality of a real read, generate errors from the recalibrated quality, but output raw base quality. Such a simulation will be much more realistic than anything at present. Nonetheless, for mapper evaluation, probably this would not lead to much difference, and, life is too short to do everything perfectly.
                  Another great quote from Heng

                  Comment


                  • #10
                    Mason and Art do allow you to specify the rates of indels and SNPs introduced to simulation reads.

                    It is a good idea to use base-calling quality scores to mutate read bases to generate errors. This is actually what did in our evaluation in the Subread paper. We took quality scores from a SEQC dataset, assigned them to reads extracted from the human genome and then mutated read bases according to their quality scores (the lower the quality scores, the more likely they will be changed to different bases). These made the simulation reads very similar to the real reads. We also included simulation reads generated from Mason and Art for the evaluation. For more details, see our paper: http://www.ncbi.nlm.nih.gov/pubmed/23558742

                    Comment


                    • #11
                      Originally posted by shi View Post
                      Mason and Art do allow you to specify the rates of indels and SNPs introduced to simulation reads.
                      Thanks for the clarification. I was reading their readme only. I can see more options from the command-line helps. ART is unable to simulate variants. Its indel error model is unclear. Mason is quite complete in features (though I have not figured out how to feed error profiles to mason). I may recommend it over wgsim/dwgsim. However, to use mason, we should still use high variant probability or high sequencing error rate to stress the mapper. The default setting is much easier than real data - due to the coalescent process, there are far more hard SNP-dense regions in real data than most simulations.

                      It is a good idea to use base-calling quality scores to mutate read bases to generate errors. This is actually what did in our evaluation in the Subread paper. We took quality scores from a SEQC dataset, assigned them to reads extracted from the human genome and then mutated read bases according to their quality scores (the lower the quality scores, the more likely they will be changed to different bases). These made the simulation reads very similar to the real reads.
                      Also thanks for this. I missed that bit. I assume you are extracting the full quality string of a read, not a single base. Is that right (it is not clear from the text)? Using real quality has other concerns, but anyway the key to using real quality properly is to sample the quality of a whole read, instead of sample the quality of each base.

                      We also included simulation reads generated from Mason and Art for the evaluation. For more details, see our paper: http://www.ncbi.nlm.nih.gov/pubmed/23558742
                      Actually I reviewed a version of your manuscript. I kept anonymous that time (I sign most reviews nowadays). I suggested the ROC-like representation and recommended to focus on RNA-seq.

                      Comment


                      • #12
                        Yes, we extracted the entire string of quality scores from each read in a SEQC dataset and assigned them to simulation reads.

                        You can feed a quality profile to Art, but Mason does not allow you do that. When running Mason, we used the same Indel and SNP rates as used in the BWA paper.

                        It makes more sense to me to use the typical rates of indel, SNP and the sequencing error because this will make it easier to see the close to real performance of aligners. For example, 1 SNP out of 1000 bases, 1 indel out of 10000 bases and sequencing error rate of 1 to 5 percent.

                        Comment


                        • #13
                          We should use higher SNP/indel rate, because that is more real. If you plot the heterozygosity of a single individual along the genome, you will find the heterozygosity is far from uniform. Some 100kb regions may contain no SNPs but other regions may reach a heterozygosity 1-2%. That is caused primarily by coalescence and partly by selection, non-homologous gene conversions and variable recombination/mutation rates. Although the fraction of high heterozygosity regions is small, the variant error rate is much higher. If you simply simulate 1 SNP per 1000 base, you can rarely get regions with high SNP density, which deviates from real data and is less useful for evaluating the performance of a mapper in the context of variable calling and the discovery of structural variations that demands even higher mapping accuracy.

                          As to mason, I think it has a parameter to feed error profiles. I just do not know the format.

                          Comment


                          • #14
                            It will certainly be useful to incorporate into the generation of simulation reads the information of the distribution of positions of SNPs and indels on the reference genome. I think the dbSNP database might be useful for this, but it seems harder for indels.

                            Comment


                            • #15
                              Originally posted by lh3 View Post
                              I used to think about how to simulate more realistic data. I think we should take the variants from the Venter/HuRef genome, incorporate it to the human genome without shifting the coordinates, and simulate reads from that. As to sequencing errors, we should randomly draw a the quality of a real read, generate errors from the recalibrated quality, but output raw base quality. Such a simulation will be much more realistic than anything at present. Nonetheless, for mapper evaluation, probably this would not lead to much difference, and, life is too short to do everything perfectly.
                              I like this idea a lot as well. We should try it sometime.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Non-Coding RNA Research and Technologies
                                by seqadmin




                                Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

                                Nobel Prize for MicroRNA Discovery
                                This week,...
                                10-07-2024, 08:07 AM
                              • seqadmin
                                Recent Developments in Metagenomics
                                by seqadmin





                                Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
                                09-23-2024, 06:35 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 02:44 PM
                              0 responses
                              7 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 10-11-2024, 06:55 AM
                              0 responses
                              14 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 10-02-2024, 04:51 AM
                              0 responses
                              110 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 10-01-2024, 07:10 AM
                              0 responses
                              117 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X