Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • zee
    replied
    Originally posted by lh3 View Post
    I used to think about how to simulate more realistic data. I think we should take the variants from the Venter/HuRef genome, incorporate it to the human genome without shifting the coordinates, and simulate reads from that. As to sequencing errors, we should randomly draw a the quality of a real read, generate errors from the recalibrated quality, but output raw base quality. Such a simulation will be much more realistic than anything at present. Nonetheless, for mapper evaluation, probably this would not lead to much difference, and, life is too short to do everything perfectly.
    I like this idea a lot as well. We should try it sometime.

    Leave a comment:


  • shi
    replied
    It will certainly be useful to incorporate into the generation of simulation reads the information of the distribution of positions of SNPs and indels on the reference genome. I think the dbSNP database might be useful for this, but it seems harder for indels.

    Leave a comment:


  • lh3
    replied
    We should use higher SNP/indel rate, because that is more real. If you plot the heterozygosity of a single individual along the genome, you will find the heterozygosity is far from uniform. Some 100kb regions may contain no SNPs but other regions may reach a heterozygosity 1-2%. That is caused primarily by coalescence and partly by selection, non-homologous gene conversions and variable recombination/mutation rates. Although the fraction of high heterozygosity regions is small, the variant error rate is much higher. If you simply simulate 1 SNP per 1000 base, you can rarely get regions with high SNP density, which deviates from real data and is less useful for evaluating the performance of a mapper in the context of variable calling and the discovery of structural variations that demands even higher mapping accuracy.

    As to mason, I think it has a parameter to feed error profiles. I just do not know the format.

    Leave a comment:


  • shi
    replied
    Yes, we extracted the entire string of quality scores from each read in a SEQC dataset and assigned them to simulation reads.

    You can feed a quality profile to Art, but Mason does not allow you do that. When running Mason, we used the same Indel and SNP rates as used in the BWA paper.

    It makes more sense to me to use the typical rates of indel, SNP and the sequencing error because this will make it easier to see the close to real performance of aligners. For example, 1 SNP out of 1000 bases, 1 indel out of 10000 bases and sequencing error rate of 1 to 5 percent.

    Leave a comment:


  • lh3
    replied
    Originally posted by shi View Post
    Mason and Art do allow you to specify the rates of indels and SNPs introduced to simulation reads.
    Thanks for the clarification. I was reading their readme only. I can see more options from the command-line helps. ART is unable to simulate variants. Its indel error model is unclear. Mason is quite complete in features (though I have not figured out how to feed error profiles to mason). I may recommend it over wgsim/dwgsim. However, to use mason, we should still use high variant probability or high sequencing error rate to stress the mapper. The default setting is much easier than real data - due to the coalescent process, there are far more hard SNP-dense regions in real data than most simulations.

    It is a good idea to use base-calling quality scores to mutate read bases to generate errors. This is actually what did in our evaluation in the Subread paper. We took quality scores from a SEQC dataset, assigned them to reads extracted from the human genome and then mutated read bases according to their quality scores (the lower the quality scores, the more likely they will be changed to different bases). These made the simulation reads very similar to the real reads.
    Also thanks for this. I missed that bit. I assume you are extracting the full quality string of a read, not a single base. Is that right (it is not clear from the text)? Using real quality has other concerns, but anyway the key to using real quality properly is to sample the quality of a whole read, instead of sample the quality of each base.

    We also included simulation reads generated from Mason and Art for the evaluation. For more details, see our paper: http://www.ncbi.nlm.nih.gov/pubmed/23558742
    Actually I reviewed a version of your manuscript. I kept anonymous that time (I sign most reviews nowadays). I suggested the ROC-like representation and recommended to focus on RNA-seq.

    Leave a comment:


  • shi
    replied
    Mason and Art do allow you to specify the rates of indels and SNPs introduced to simulation reads.

    It is a good idea to use base-calling quality scores to mutate read bases to generate errors. This is actually what did in our evaluation in the Subread paper. We took quality scores from a SEQC dataset, assigned them to reads extracted from the human genome and then mutated read bases according to their quality scores (the lower the quality scores, the more likely they will be changed to different bases). These made the simulation reads very similar to the real reads. We also included simulation reads generated from Mason and Art for the evaluation. For more details, see our paper: http://www.ncbi.nlm.nih.gov/pubmed/23558742

    Leave a comment:


  • adaptivegenome
    replied
    Originally posted by lh3 View Post
    I used to think about how to simulate more realistic data. I think we should take the variants from the Venter/HuRef genome, incorporate it to the human genome without shifting the coordinates, and simulate reads from that. As to sequencing errors, we should randomly draw a the quality of a real read, generate errors from the recalibrated quality, but output raw base quality. Such a simulation will be much more realistic than anything at present. Nonetheless, for mapper evaluation, probably this would not lead to much difference, and, life is too short to do everything perfectly.
    Another great quote from Heng

    Leave a comment:


  • lh3
    replied
    No read simulators simulate realistic data. ART and mason infer quality from empirical alignment, which will: 1) bias towards mappers using similar algorithms; 2) miss reads with excessive mismatches/gaps; 3) be dependent of the reference and sequences in use. I don't know much about ACANA used by ART, but RazorS used by mason does not do affine-gap alignment, which will affect its gap modeling. My understanding is both of them assume position independence. This is clearly not true for Illumina reads: some reads are systematically worse at every base than others (for example, if there is a Q2 base, the following bases are also Q2). If I am right, both ART and mason assume no variations between reads and the reference genome. This is rarely the case in practice. In an Illumina alignment, the majority of gaps, especially long gaps, are variations instead of sequencing errors. Furthermore, when we align reads to a related species ~1% divergence away from the sequenced sample, the reads will apparently have 1% uniform errors. 1% divergence is not rare when we study non human organisms. To this end, ART and mason do something better but something worse.

    Wgsim uses a default sequencing error rate as high as 2% because I want to stress aligners. All mappers make no mistake for a perfectly matched read. However, with 100bp or longer reads, we should be able to access 2% divergence (for human, quite a lot of reads span two SNPs). It is these regions that show and need the capability of an aligner. Longer indels (say >5bp), which I guess are very rare from mason/art simulation, also stress the mappers a lot.

    Finally, my experience is that the relative performance of mappers stay the same within a reasonably large range error configurations. The maq read simulator generates more realistic errors. I stopped using that in wgsim mainly because as I tested that time, the maq error model added little to the evaluation of mappers.


    I used to think about how to simulate more realistic data. I think we should take the variants from the Venter/HuRef genome, incorporate it to the human genome without shifting the coordinates, and simulate reads from that. As to sequencing errors, we should randomly draw a the quality of a real read, generate errors from the recalibrated quality, but output raw base quality. Such a simulation will be much more realistic than anything at present. Nonetheless, for mapper evaluation, probably this would not lead to much difference, and, life is too short to do everything perfectly.
    Last edited by lh3; 04-15-2013, 07:27 AM.

    Leave a comment:


  • oiiio
    replied
    This is a good point, and I think it would be very interesting to see how the alignment reports change with simulator as well. Comparing simulators that use and don't use that kind of error model could test how reliant base quality-sensitive aligners are these qualities, and also on expected error position, which would be very cool!

    In the meantime you will find the variant calling test interesting also, as it is performed on real whole exome data (NA12878) on multiple platforms. With this you can compare a VCF generated by an aligner+variant caller pipeline.

    For example, here is the variant calling report for the 100bp 30x coverage Illumina exome :




    Originally posted by shi View Post
    Thanks for the info, oiiio. I will try to upload the Subread results soon.

    I would also like to comment on the simulation data generated in your project. It looks like you used wgsim simulator to generate reads. This simulator seems to assume that the sequencing errors are uniformly distributed in the read sequences, but we know that this is not true because for example Illumina HiSeq sequencers produce more errors at the start/end of the reads. So you may need to add other read simulator/s in your project as well.

    The simulator 'Mason' was used in both bowtie2 paper and Subread paper, so you may consider using that as well. The simulator 'Art' was used in Subread paper as well.

    In the simulation data you have generated using wgsim, all read bases have exactly the same quality scores ("2"), which correspond to the phred score of 17 which is quite low. This makes your datasets actually be quite different from real datasets and this will particularly affect those aligners which makes use of quality score information during their alignments. Art and Mason seem to do better in this regard.

    Cheers,
    Wei
    Last edited by oiiio; 04-15-2013, 06:50 AM.

    Leave a comment:


  • adaptivegenome
    replied
    GCAT actually uses DWGSIM, but you are right about the assumptions. GCAT also includes real data for variant calling benchmarks but I think you are right, we should try some other simulators.

    We will give MASON a try and it would be awesome if you were to upload some results in the meantime.

    Thanks for the input!


    Originally posted by shi View Post
    Thanks for the info, oiiio. I will try to upload the Subread results soon.

    I would also like to comment on the simulation data generated in your project. It looks like you used wgsim simulator to generate reads. This simulator seems to assume that the sequencing errors are uniformly distributed in the read sequences, but we know that this is not true because for example Illumina HiSeq sequencers produce more errors at the start/end of the reads. So you may need to add other read simulator/s in your project as well.

    The simulator 'Mason' was used in both bowtie2 paper and Subread paper, so you may consider using that as well. The simulator 'Art' was used in Subread paper as well.

    In the simulation data you have generated using wgsim, all read bases have exactly the same quality scores ("2"), which correspond to the phred score of 17 which is quite low. This makes your datasets actually be quite different from real datasets and this will particularly affect those aligners which makes use of quality score information during their alignments. Art and Mason seem to do better in this regard.

    Cheers,
    Wei

    Leave a comment:


  • shi
    replied
    Thanks for the info, oiiio. I will try to upload the Subread results soon.

    I would also like to comment on the simulation data generated in your project. It looks like you used wgsim simulator to generate reads. This simulator seems to assume that the sequencing errors are uniformly distributed in the read sequences, but we know that this is not true because for example Illumina HiSeq sequencers produce more errors at the start/end of the reads. So you may need to add other read simulator/s in your project as well.

    The simulator 'Mason' was used in both bowtie2 paper and Subread paper, so you may consider using that as well. The simulator 'Art' was used in Subread paper as well.

    In the simulation data you have generated using wgsim, all read bases have exactly the same quality scores ("2"), which correspond to the phred score of 17 which is quite low. This makes your datasets actually be quite different from real datasets and this will particularly affect those aligners which makes use of quality score information during their alignments. Art and Mason seem to do better in this regard.

    Cheers,
    Wei

    Leave a comment:


  • oiiio
    replied
    The inner distance for paired end is always 300bp (so a template size of 500bp for 100bp reads)

    The small indel rate has sizes 1-10bp, and the large indel rate is 10-24bp

    Looking forward to seeing the Subread results!

    Leave a comment:


  • shi
    replied
    Originally posted by adaptivegenome View Post
    Guys,

    I'm starting a new thread for GCAT which is a completely free and collaborative way to compare genome analysis pipelines:



    You can link to specific comparisons and they are fully interactive so this should make discussion here better because we can link to real data!

    Would love eveyone's input. It's a cool project but brand new and it will need lots of work!
    This looks an interesting project and I'm interested in running Subread aligner on the test datasets. But could you please provide a bit more information about the datasets such as the fragment length of PE data and the indel lengths in the small indel datasets and large indel datasets?

    Thanks,
    Wei

    Leave a comment:


  • oiiio
    replied
    In response to @lh3 from http://seqanswers.com/forums/showthr...t=15200&page=4

    Although the metrics used on your website's comparison are similar, it looks like you used a wiggle of 50bp, which could account for differences. The curve axes are incorrect reads / mapped reads and correct reads / total in FASTQ

    From other reports with different read lengths and mutations I see pretty consistently that BWA MEM does indeed deprecate BWASW, and MEM is also the most accurate open source aligner i have seen so far!

    Leave a comment:


  • New way to compare mappers and variant callers

    Guys,

    I'm starting a new thread for GCAT which is a completely free and collaborative way to compare genome analysis pipelines:



    You can link to specific comparisons and they are fully interactive so this should make discussion here better because we can link to real data!

    Would love eveyone's input. It's a cool project but brand new and it will need lots of work!

Latest Articles

Collapse

  • seqadmin
    The Impact of AI in Genomic Medicine
    by seqadmin



    Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
    02-26-2024, 02:07 PM
  • seqadmin
    Multiomics Techniques Advancing Disease Research
    by seqadmin


    New and advanced multiomics tools and technologies have opened new avenues of research and markedly enhanced various disciplines such as disease research and precision medicine1. The practice of merging diverse data from various ‘omes increasingly provides a more holistic understanding of biological systems. As Maddison Masaeli, Co-Founder and CEO at Deepcell, aptly noted, “You can't explain biology in its complex form with one modality.”

    A major leap in the field has
    ...
    02-08-2024, 06:33 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 02-28-2024, 06:12 AM
0 responses
26 views
0 likes
Last Post seqadmin  
Started by seqadmin, 02-23-2024, 04:11 PM
0 responses
74 views
0 likes
Last Post seqadmin  
Started by seqadmin, 02-21-2024, 08:52 AM
0 responses
81 views
0 likes
Last Post seqadmin  
Started by seqadmin, 02-20-2024, 08:57 AM
0 responses
69 views
0 likes
Last Post seqadmin  
Working...
X