New way to compare mappers and variant callers

zee replied

04-16-2013, 10:25 AM
Originally posted by lh3 View Post

I used to think about how to simulate more realistic data. I think we should take the variants from the Venter/HuRef genome, incorporate it to the human genome without shifting the coordinates, and simulate reads from that. As to sequencing errors, we should randomly draw a the quality of a real read, generate errors from the recalibrated quality, but output raw base quality. Such a simulation will be much more realistic than anything at present. Nonetheless, for mapper evaluation, probably this would not lead to much difference, and, life is too short to do everything perfectly.

I like this idea a lot as well. We should try it sometime.
Leave a comment:
shi replied

04-15-2013, 06:30 PM
It will certainly be useful to incorporate into the generation of simulation reads the information of the distribution of positions of SNPs and indels on the reference genome. I think the dbSNP database might be useful for this, but it seems harder for indels.
Leave a comment:
lh3 replied

04-15-2013, 05:44 PM
We should use higher SNP/indel rate, because that is more real. If you plot the heterozygosity of a single individual along the genome, you will find the heterozygosity is far from uniform. Some 100kb regions may contain no SNPs but other regions may reach a heterozygosity 1-2%. That is caused primarily by coalescence and partly by selection, non-homologous gene conversions and variable recombination/mutation rates. Although the fraction of high heterozygosity regions is small, the variant error rate is much higher. If you simply simulate 1 SNP per 1000 base, you can rarely get regions with high SNP density, which deviates from real data and is less useful for evaluating the performance of a mapper in the context of variable calling and the discovery of structural variations that demands even higher mapping accuracy.

As to mason, I think it has a parameter to feed error profiles. I just do not know the format.
Leave a comment:
shi replied

04-15-2013, 04:30 PM
Yes, we extracted the entire string of quality scores from each read in a SEQC dataset and assigned them to simulation reads.

You can feed a quality profile to Art, but Mason does not allow you do that. When running Mason, we used the same Indel and SNP rates as used in the BWA paper.

It makes more sense to me to use the typical rates of indel, SNP and the sequencing error because this will make it easier to see the close to real performance of aligners. For example, 1 SNP out of 1000 bases, 1 indel out of 10000 bases and sequencing error rate of 1 to 5 percent.
Leave a comment:
lh3 replied

04-15-2013, 04:17 PM
Originally posted by shi View Post

Mason and Art do allow you to specify the rates of indels and SNPs introduced to simulation reads.

Thanks for the clarification. I was reading their readme only. I can see more options from the command-line helps. ART is unable to simulate variants. Its indel error model is unclear. Mason is quite complete in features (though I have not figured out how to feed error profiles to mason). I may recommend it over wgsim/dwgsim. However, to use mason, we should still use high variant probability or high sequencing error rate to stress the mapper. The default setting is much easier than real data - due to the coalescent process, there are far more hard SNP-dense regions in real data than most simulations.

It is a good idea to use base-calling quality scores to mutate read bases to generate errors. This is actually what did in our evaluation in the Subread paper. We took quality scores from a SEQC dataset, assigned them to reads extracted from the human genome and then mutated read bases according to their quality scores (the lower the quality scores, the more likely they will be changed to different bases). These made the simulation reads very similar to the real reads.

Also thanks for this. I missed that bit. I assume you are extracting the full quality string of a read, not a single base. Is that right (it is not clear from the text)? Using real quality has other concerns, but anyway the key to using real quality properly is to sample the quality of a whole read, instead of sample the quality of each base.

We also included simulation reads generated from Mason and Art for the evaluation. For more details, see our paper: http://www.ncbi.nlm.nih.gov/pubmed/23558742

Actually I reviewed a version of your manuscript. I kept anonymous that time (I sign most reviews nowadays). I suggested the ROC-like representation and recommended to focus on RNA-seq.
Leave a comment:
shi replied

04-15-2013, 02:48 PM
Mason and Art do allow you to specify the rates of indels and SNPs introduced to simulation reads.

It is a good idea to use base-calling quality scores to mutate read bases to generate errors. This is actually what did in our evaluation in the Subread paper. We took quality scores from a SEQC dataset, assigned them to reads extracted from the human genome and then mutated read bases according to their quality scores (the lower the quality scores, the more likely they will be changed to different bases). These made the simulation reads very similar to the real reads. We also included simulation reads generated from Mason and Art for the evaluation. For more details, see our paper: http://www.ncbi.nlm.nih.gov/pubmed/23558742
Leave a comment:
adaptivegenome replied

04-15-2013, 08:04 AM
Originally posted by lh3 View Post

I used to think about how to simulate more realistic data. I think we should take the variants from the Venter/HuRef genome, incorporate it to the human genome without shifting the coordinates, and simulate reads from that. As to sequencing errors, we should randomly draw a the quality of a real read, generate errors from the recalibrated quality, but output raw base quality. Such a simulation will be much more realistic than anything at present. Nonetheless, for mapper evaluation, probably this would not lead to much difference, and, life is too short to do everything perfectly.

Another great quote from Heng
Leave a comment:
lh3 replied

04-15-2013, 07:15 AM
No read simulators simulate realistic data. ART and mason infer quality from empirical alignment, which will: 1) bias towards mappers using similar algorithms; 2) miss reads with excessive mismatches/gaps; 3) be dependent of the reference and sequences in use. I don't know much about ACANA used by ART, but RazorS used by mason does not do affine-gap alignment, which will affect its gap modeling. My understanding is both of them assume position independence. This is clearly not true for Illumina reads: some reads are systematically worse at every base than others (for example, if there is a Q2 base, the following bases are also Q2). If I am right, both ART and mason assume no variations between reads and the reference genome. This is rarely the case in practice. In an Illumina alignment, the majority of gaps, especially long gaps, are variations instead of sequencing errors. Furthermore, when we align reads to a related species ~1% divergence away from the sequenced sample, the reads will apparently have 1% uniform errors. 1% divergence is not rare when we study non human organisms. To this end, ART and mason do something better but something worse.

Wgsim uses a default sequencing error rate as high as 2% because I want to stress aligners. All mappers make no mistake for a perfectly matched read. However, with 100bp or longer reads, we should be able to access 2% divergence (for human, quite a lot of reads span two SNPs). It is these regions that show and need the capability of an aligner. Longer indels (say >5bp), which I guess are very rare from mason/art simulation, also stress the mappers a lot.

Finally, my experience is that the relative performance of mappers stay the same within a reasonably large range error configurations. The maq read simulator generates more realistic errors. I stopped using that in wgsim mainly because as I tested that time, the maq error model added little to the evaluation of mappers.

I used to think about how to simulate more realistic data. I think we should take the variants from the Venter/HuRef genome, incorporate it to the human genome without shifting the coordinates, and simulate reads from that. As to sequencing errors, we should randomly draw a the quality of a real read, generate errors from the recalibrated quality, but output raw base quality. Such a simulation will be much more realistic than anything at present. Nonetheless, for mapper evaluation, probably this would not lead to much difference, and, life is too short to do everything perfectly.

Last edited by lh3; 04-15-2013, 07:27 AM.
Leave a comment:
oiiio replied

04-15-2013, 06:44 AM
This is a good point, and I think it would be very interesting to see how the alignment reports change with simulator as well. Comparing simulators that use and don't use that kind of error model could test how reliant base quality-sensitive aligners are these qualities, and also on expected error position, which would be very cool!

In the meantime you will find the variant calling test interesting also, as it is performed on real whole exome data (NA12878) on multiple platforms. With this you can compare a VCF generated by an aligner+variant caller pipeline.

For example, here is the variant calling report for the 100bp 30x coverage Illumina exome :

http://www.bioplanet.com/gcat/reports/103/variant-calls/illumina-100bp-pe-exome-30x/bwa-gatk_ug

Originally posted by shi View Post

Thanks for the info, oiiio. I will try to upload the Subread results soon.

I would also like to comment on the simulation data generated in your project. It looks like you used wgsim simulator to generate reads. This simulator seems to assume that the sequencing errors are uniformly distributed in the read sequences, but we know that this is not true because for example Illumina HiSeq sequencers produce more errors at the start/end of the reads. So you may need to add other read simulator/s in your project as well.

The simulator 'Mason' was used in both bowtie2 paper and Subread paper, so you may consider using that as well. The simulator 'Art' was used in Subread paper as well.

In the simulation data you have generated using wgsim, all read bases have exactly the same quality scores ("2"), which correspond to the phred score of 17 which is quite low. This makes your datasets actually be quite different from real datasets and this will particularly affect those aligners which makes use of quality score information during their alignments. Art and Mason seem to do better in this regard.

Cheers,
Wei

Last edited by oiiio; 04-15-2013, 06:50 AM.
Leave a comment:
adaptivegenome replied

04-15-2013, 06:11 AM
GCAT actually uses DWGSIM, but you are right about the assumptions. GCAT also includes real data for variant calling benchmarks but I think you are right, we should try some other simulators.

We will give MASON a try and it would be awesome if you were to upload some results in the meantime.

Thanks for the input!

Originally posted by shi View Post

Thanks for the info, oiiio. I will try to upload the Subread results soon.

I would also like to comment on the simulation data generated in your project. It looks like you used wgsim simulator to generate reads. This simulator seems to assume that the sequencing errors are uniformly distributed in the read sequences, but we know that this is not true because for example Illumina HiSeq sequencers produce more errors at the start/end of the reads. So you may need to add other read simulator/s in your project as well.

The simulator 'Mason' was used in both bowtie2 paper and Subread paper, so you may consider using that as well. The simulator 'Art' was used in Subread paper as well.

In the simulation data you have generated using wgsim, all read bases have exactly the same quality scores ("2"), which correspond to the phred score of 17 which is quite low. This makes your datasets actually be quite different from real datasets and this will particularly affect those aligners which makes use of quality score information during their alignments. Art and Mason seem to do better in this regard.

Cheers,
Wei
Leave a comment:
shi replied

04-14-2013, 10:14 PM
Thanks for the info, oiiio. I will try to upload the Subread results soon.

I would also like to comment on the simulation data generated in your project. It looks like you used wgsim simulator to generate reads. This simulator seems to assume that the sequencing errors are uniformly distributed in the read sequences, but we know that this is not true because for example Illumina HiSeq sequencers produce more errors at the start/end of the reads. So you may need to add other read simulator/s in your project as well.

The simulator 'Mason' was used in both bowtie2 paper and Subread paper, so you may consider using that as well. The simulator 'Art' was used in Subread paper as well.

In the simulation data you have generated using wgsim, all read bases have exactly the same quality scores ("2"), which correspond to the phred score of 17 which is quite low. This makes your datasets actually be quite different from real datasets and this will particularly affect those aligners which makes use of quality score information during their alignments. Art and Mason seem to do better in this regard.

Cheers,
Wei
Leave a comment:
oiiio replied

04-14-2013, 06:19 PM
The inner distance for paired end is always 300bp (so a template size of 500bp for 100bp reads)

The small indel rate has sizes 1-10bp, and the large indel rate is 10-24bp

Looking forward to seeing the Subread results!
Leave a comment:
shi replied

04-14-2013, 05:36 PM
Originally posted by adaptivegenome View Post

Guys,

I'm starting a new thread for GCAT which is a completely free and collaborative way to compare genome analysis pipelines:

Http://www.bioplanet.com/gcat

You can link to specific comparisons and they are fully interactive so this should make discussion here better because we can link to real data!

Would love eveyone's input. It's a cool project but brand new and it will need lots of work!

This looks an interesting project and I'm interested in running Subread aligner on the test datasets. But could you please provide a bit more information about the datasets such as the fragment length of PE data and the indel lengths in the small indel datasets and large indel datasets?

Thanks,
Wei
Leave a comment:
oiiio replied

04-14-2013, 08:43 AM
In response to @lh3 from http://seqanswers.com/forums/showthr...t=15200&page=4

Although the metrics used on your website's comparison are similar, it looks like you used a wiggle of 50bp, which could account for differences. The curve axes are incorrect reads / mapped reads and correct reads / total in FASTQ

From other reports with different read lengths and mutations I see pretty consistently that BWA MEM does indeed deprecate BWASW, and MEM is also the most accurate open source aligner i have seen so far!
Leave a comment:
adaptivegenome started a topic New way to compare mappers and variant callers

04-14-2013, 08:05 AM
New way to compare mappers and variant callers

Guys,

I'm starting a new thread for GCAT which is a completely free and collaborative way to compare genome analysis pipelines:

Http://www.bioplanet.com/gcat

You can link to specific comparisons and they are fully interactive so this should make discussion here better because we can link to real data!

Would love eveyone's input. It's a cool project but brand new and it will need lots of work!
Tags: None

Previous template Next

Recent Developments in Metagenomics

by seqadmin

Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable¹. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
- Channel: Articles
09-23-2024, 06:35 AM
Understanding Genetic Influence on Infectious Disease

by seqadmin

During the COVID-19 pandemic, scientists observed that while some individuals experienced severe illness when infected with SARS-CoV-2, others were barely affected. These disparities left researchers and clinicians wondering what causes the wide variations in response to viral infections and what role genetics plays.

Jean-Laurent Casanova, M.D., Ph.D., Professor at Rockefeller University, is a leading expert in this crossover between genetics and infectious...
- Channel: Articles
09-09-2024, 10:59 AM

Topics	Statistics	Last Post
Mechanical Forces in DNA Transcription Uncovered by Clemson Researchers by seqadmin Started by seqadmin, 10-02-2024, 04:51 AM	0 responses 13 views 0 likes	Last Post by seqadmin 10-02-2024, 04:51 AM
New Epigenetic Clock Links Cheek Cells to Mortality Risk by seqadmin Started by seqadmin, 10-01-2024, 07:10 AM	0 responses 21 views 0 likes	Last Post by seqadmin 10-01-2024, 07:10 AM
AI-Powered Blood Test Shows Promise for Early Ovarian Cancer Detection by seqadmin Started by seqadmin, 09-30-2024, 08:33 AM	0 responses 25 views 0 likes	Last Post by seqadmin 09-30-2024, 08:33 AM
Stem Cell Research Suggests Human Cells May Enter Developmental Pause by seqadmin Started by seqadmin, 09-26-2024, 12:57 PM	0 responses 18 views 0 likes	Last Post by seqadmin 09-26-2024, 12:57 PM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: