Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • lh3
    replied
    I should add that subread has the potential to become a very competitive RNA-seq mapper, achieving similar speed and accuracy to STAR. DNA-seq and RNA-seq are different worlds. For DNA-seq, we need very high mapping accuracy to reduce errors in downstream analyses, especially for the discovery of de novo mutations and structural variations. For RNA-seq, however, fluctuations in expression and library contaminations of unspliced and randomly spliced RNAs outweigh mapping errors. It does not matter if a mapper misplaces 1% of reads as long as it can pinpoint most splice junctions. Mapping accuracy comes at a second place. If subread focuses on RNA-seq mapping, it can be a success especially when reads are getting longer. DNA-seq is not the strength of subread in its current implementation.

    Leave a comment:


  • lh3
    replied
    The entire thread is in the context of genomic read alignment. As to RNA-seq, I was talking about M/I/D -- you do not care where is a small indel. For RNA-seq, it is true that CIGAR operator "N" is frequently more important than anything else, but that is arguably beyond the scope of this thread. Thanks a lot for the clarification anyway.

    As I said above, the difficult part for evaluating alignments is that the alignment is determined by the scoring system but the true scoring system varies with loci. In the end, all you are comparing is how close the mapper's scoring is to the simulator's scoring and neither of them is realistic. Of course we can be more careful when doing the evaluation to alleviate this problem. That will be more complicated and arguably overkilling. Bwa-mem is probably better at alignments, but I have not found a simple and convincing way to evaluate it.

    Leave a comment:


  • zee
    replied
    Originally posted by shi View Post
    An interesting paper. The results included in it seemed to agree with our evaluation results in which we found Novoalign and Subread have the best accuracy -- http://www.ncbi.nlm.nih.gov/pubmed/23558742
    Congratulations Shi on getting Subread published and I think the theory of it is quite interesting.
    Subread does an interesting comparison of aligners in Table 2 of the publication. This table shows a comparison of mapping RNASeq data to a reference genome. Since the input reads are from the transcriptome I can't see how aligning a higher percentage of reads make it more accurate than Bowtie2, BWA or Novoalign which all report lower alignment percentages. How do we know that these extra alignments are indeed good mappings? If you were aligning with all these mappers to a reference transcriptome (transcript sequences) then this table would make more sense to me.
    The reason other aligners probably report lower alignment percentages is related to Heng's statement about the false positive rate. With RNAseq reads spanning exon-exon junctions these don't align well to a reference sequence and we usually need to add exon-exon junctions to the search space. So an aligner could introduce more error to try and align an exon junction read in the genome space. This is also of concern for SV reads that point to gene fusions.
    It would have been better to compare to GSNAP , STAR, Mapsplice and Tophat for RNASeq comparison with the SEQC dataset. In the practical world I would either need to use these specialized aligners or make my own transcriptome database for searching.

    Leave a comment:


  • alexdobin
    replied
    Originally posted by lh3 View Post
    In practice, RNA-seq/chip-seq and the discovery of structural variations does not care about CIGAR.
    Hi Heng,

    I beg to disagree with you on one point: of all the *-seqs, RNA-seq is probably the one that cares about the CIGAR string the most. Splice junctions provide crucial information about expression of isoforms (alternative splicing). Unlike indels, splice junctions are very abundant in long RNA libraries, and with longer reads we are seeing >50% of read pairs with at least one splice, so realignment would not be practical. Also, standard global alignment algorithms will not be feasible, I believe, since junctions span tens of kilobases in mammalian genomes.

    On the other hand I agree that aligners should not be judged harshly for soft-clipping a few bases from the ends of the reads when these bases cannot be placed confidently. In addition to the standard "correctly mapped locus" metric, one could imagine a metric which would count all correctly (true positive) and incorrectly (false positive) mapped bases for each read - that would show which aligners perform better for recovering the entire alignment.

    Cheers
    Alex

    Leave a comment:


  • lh3
    replied
    You should not require CIGAR to be the same. When there is an indel at the last couple of bases of a read, there is no way to find the correct CIGAR. In a tandem repeat, multiple CIGARs are equivalent. CIGAR is also greatly affected by scoring matrix. A mapper that uses scoring similar to simulation will perform much better (PS: mason as I remember does not simulate gaps under the affine-gap model, which will bias against the more biological meaningful affine-gap models), while in fact the true scoring matrix varies greatly with loci - you cannot simulate that realistically. In practice, RNA-seq/chip-seq and the discovery of structural variations does not care about CIGAR. The mainstream indel calling pipelines, including samtools, gatk and pindel, all use realignment. They do not trust CIGAR. For SNP calling, BAQ and GATK realignment are specifically designed to tackle this problem which is not solvable from single read alignment. In addition, if we worry about CIGAR, we can do a global alignment to reconstruct it, which is very fast.

    The primary goal of a mapper is to find correct mappings or positions. Without correct mapping positions, none of the above (realignment etc.) will work. CIGAR is a secondary target. By requiring exact CIGAR, you are claiming massive correct mappings as wrong ones, but these have little effect on downstream analyses. You are evaluating an accuracy that does not have much to do with practical analyses. At the same time, the massive incorrect CIGARs completely hide the true mapping accuracy for those accurate mappers such as bowtie 2 and novoalign. The last and bowtie 2 papers are talking about 1e-5 error rate for high mapQ mappings, while you are showing 2% error rate. These are not comparable.
    Last edited by lh3; 04-26-2013, 05:45 AM.

    Leave a comment:


  • shi
    replied
    Originally posted by lh3 View Post
    In Bowtie 2 Fig 2a, bowtie 2 and bwa essentially made no mistakes at sensitivity 90%. In your figure 4c, at the highest mapQ (the leftmost dot), bowtie 2 has a ~1500/8e4 error rate and bwa has a 2000/9e4 error rate. These are very different results from that in the bowtie 2 paper, where the error rate for high mapping quality is vanishingly small.

    I have explained that not performing alignments for suboptimal loci is exactly why subread is fast but not as accurate as others. Due to repeats and sequencing errors/variants, a locus with fewer votes may turn out to be optimal. To achieve high accuracy, you need to see enough suboptimal loci and perform alignment around them to finally determine which has the optimal score. GSNAP is taking the right strategy. If I remember correctly, it uses votes (GSNAP does not use a terminology "vote" though) to filter bad alignments and then perform extension for the rest even if they are not optimal. YABOS, as I remember, is doing something similar. Both bwa-sw and bwa-mem also collect votes to filter out very bad loci but may perform DP extension for many suboptimal loci in repetitive regions. In some way, voting is just the starting point of these mappers. They do not stop at voting because they want to achieve higher accuracy. This is why they are more accurate.
    If you read our paper closely, you will find that the CIGAR strings of mapped reads were required to be correct in addition to mapping locations being correct, when we called correctly aligned reads in our simulation. This may explain why aligners in our simulation had higher error rates. The parameter differences when using Mason could cause this difference as well. But the overall error rates between the two studies were comparable. I wouldnt expect the results from two different study will be identical.

    I'm not convinced by your explanation. We get around this suboptimal alignment issue by using more subreads in the voting. This gives Subread aligner a lot more power in mapping reads correctly. You will find this is very helpful if you implement this in your aligner if you truly use the seed-and-vote approach as proposed in our Subread paper.

    Subread does not stop at voting, it employs a poweful in-fill algorithm to fill in the gaps in the read sequence after voting to finalize the alignment.

    Leave a comment:


  • lh3
    replied
    In Bowtie 2 Fig 2a, bowtie 2 and bwa essentially made no mistakes at sensitivity 90%. In your figure 4c, at the highest mapQ (the leftmost dot), bowtie 2 has a ~1500/8e4 error rate and bwa has a 2000/9e4 error rate. These are very different results from that in the bowtie 2 paper, where the error rate for high mapping quality is vanishingly small.

    I have explained that not performing alignments for suboptimal loci is exactly why subread is fast but not as accurate as others. Due to repeats and sequencing errors/variants, a locus with fewer votes may turn out to be optimal. To achieve high accuracy, you need to see enough suboptimal loci and perform alignment around them to finally determine which has the optimal score. GSNAP is taking the right strategy. If I remember correctly, it uses votes (GSNAP does not use a terminology "vote" though) to filter bad alignments and then perform extension for the rest even if they are not optimal. YABOS, as I remember, is doing something similar. Both bwa-sw and bwa-mem also collect votes to filter out very bad loci but may perform DP extension for many suboptimal loci in repetitive regions. In some way, voting is just the starting point of these mappers. They do not stop at voting because they want to achieve higher accuracy. This is why they are more accurate.

    Leave a comment:


  • shi
    replied
    Figure 2(a) of the Bowtie2 paper shows that the false discovery rates of aligners were around 3 percent that is similar to what was shown in our paper.

    The major difference between the seed-and-vote paradigm used by Subread and the one you referred to is that Subread does not perform alignments for the seed sequences which have made successful votes in its finalization step, whereas the seed-and-extend needs to perform the final alignment for almost every base in the read using dynamic programming or backtrack. Therefore Subread has a very small computational cost in its final alignment. For example, if 8 out of 10 subreads (16mers) made the successful votes when mapping a 100bp read, there will be at most 20 bases which need to aligned in the finalization step. This is one of the reasons why Subread is very fast. Another reason is that the mapping location is determined by the voting process, ie the genomic location receiving the largest number of votes (supported by the majority of the extracted subreads) is chosen as the final mapping location. The size of the genomic span of the voting subreads will be used to break ties when multiple best locations were found.

    Given the decreasing error rate of the sequencers, seed-and-vote can quickly and accurately map most of the reads because sufficient number of good subreads can be extracted from these reads. Subread spends most of its time on the mapping of those reads which have more errors, indels, and junctions.

    Leave a comment:


  • lh3
    replied
    In my understanding, the review process being kept confidential before publication is to prevent others from stealing your ideas. As your papers has already been published, I am not sure why I cannot disclose I was a reviewer of an earlier version of your manuscript without saying which version. On the contrary, I think once the paper is published, it would be better to make the review process transparent. I am currently signing all my reviews. I am responsible for my words in the review, even if I, admittedly, give bad suggestions sometimes. On the other hand, I could be wrong about the journal policy. If I have violated, I deeply apologize and promise not to do that again.

    You should read the bowtie2 and the recent LAST papers. Bowtie2 uses mason, the same one you use in Fig 4a. LAST is simulating sequencing errors based on arguably realistic error profiles, very similar to yours in Fig 4a. You can also ask the developers of those popular mappers if 2% error rate for mapQ>~20 alignments represents the accuracy of their mappers on real data.

    Simulating data based on empirical quality is quite common. I did that 5 years ago (albeit that was a simpler version of simulation) and many others have done similar or better. As to seed-and-vote, both bwa-sw and bwa-mem use it. I did not emphasize because this idea was frequently seen among long-read and whole-genome alignment. For more recent papers, GSNAP and YABOS essentially use the same idea and they are using the strategy very carefully to avoid wrong alignments.

    Leave a comment:


  • shi
    replied
    Firstly let me point out what I received in my inbox about your reply --- "Well, in another thread, I have told you that I reviewed your manuscript. I meant to let you know my opinion without making it public. In fact, ...". I think here and before you broke the trust the prestigious journal PNAS gave to you for reviewing our paper.

    I would also like to point out that Figure4 was NOT included in the version of our paper you reviewed.

    Let me tell you the reason why Subread had a high error rate in your evaluation. That is because your simulation data were generated using wgsim, which assumes that the sequencing errors are uniformly distributed in the read sequence. This assumption is invalid because for example Illumina sequencing data have high sequencing errors at the start and end positions of the reads. You have agreed to this in your own post you posted a few days ago:

    Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc


    The parameters of our Subread aligner were tuned using the real sequencing error model. We used the sequencing error information provided in the SEQC (MAQC III) data to help generate simulation data very close to the real data to tune Subread parameters to make it work best for the real data. The fundamental difference in generating simulated reads gives rise to the remarkable differences between our evaluation and your evaluation. This difference may not change the evaluation results too much when you evaluated the family of seed-and-extend aligners, but it changed a lot for the evaluation of Subread aligner that uses a entirely different paradigm -- seed-and-vote.

    The ERCC spike-in sequencing data included in the SEQC study is very useful in evaluating the accuracy of alternative aligners, because they are real sequencing data and they have known truth. Table 3 in our paper clearly shows the superiority of Subread aligner over other aligners.

    Figure 4 is not surprising to us at all. It is simply because we used the real sequencing errors (extracted from SEQC reads), which were not used in other evaluations at all.

    Leave a comment:


  • lh3
    replied
    Well, in another thread, I have told you that I reviewed your manuscript. In fact, part of my review is still true. Subread is really fast, but this comes at cost of accuracy. On the data set in my manuscript, subread reports 1455 wrong alignments out of its top 1704k alignments at mapQ threshold 101 and 41770/1803k at mapQ=1 (Note that I am trying to choose favorable thresholds for subread). In comparison, bwa-mem make only 3 mistakes out of 1735k alignments at mapQ=47 and 1797/1910k at mapQ=1. As to novoalign, 2/1843k at mapQ=65 and 1010/1922k at mapQ=4. Bowtie2: 74/1713k at mapQ=24; 4701/1884k at mapQ=4. These aligners are all more accurate than subread - they report fewer wrong alignments at similar or better sensitivity.

    In addition, Figure 4 in your paper is surprising. It shows that novoalign made 40k wrong alignments out of 10 million reads even in unique regions with repeats removed (Fig 4a), while bwa wrongly aligned 2k alignments out of 100k reads (Fig 4c; 2% error rate!). These error rates are exceedingly high. If you think my plots are biased, you can have a look at the ROC curves in the Bowtie2 paper or more clearly the recent LAST paper [PMID:23413433]. All these ROC curves, including mine, are talking about 1e-5 or even lower error rate. That is the accuracy a mapper should achieve.
    Last edited by lh3; 04-25-2013, 06:51 PM.

    Leave a comment:


  • shi
    replied
    An interesting paper. The results included in it seemed to agree with our evaluation results in which we found Novoalign and Subread have the best accuracy -- http://www.ncbi.nlm.nih.gov/pubmed/23558742

    Leave a comment:


  • lh3
    replied
    Summary: BWA-MEM is a new alignment algorithm for aligning sequence reads or long query sequences against a large reference genome such as human. It automatically chooses between local and end-to-end alignments, supports paired-end reads and performs chimeric alignment. The algorithm is robust to sequencing errors and applicable to a wide range of sequence lengths from 70bp to a few megabases. For mapping 100bp sequences, BWA-MEM shows better performance than several state-of-art read aligners to date. Availability and implementation: BWA-MEM is implemented as a component of BWA, which is available at http://github.com/lh3/bwa. Contact: [email protected]

    Leave a comment:


  • narain
    replied
    Dear @lh3 and @adaptivegenome

    Thank you for addressing my concerns with regard to applicability and merits of various applications. I can now use the tools wisely. Is there a paper due, to explain the improvements in bwa-mem over its predecessor bwa-sw ?

    Narain

    Leave a comment:


  • lh3
    replied
    Originally posted by narain View Post
    @lh3 and other eminent speakers in the thread

    Its nice to know that we have so many new aligners that have come up in competition to the BLAST which dominated for about 2 decades, and in many way is still the most popular.
    Actually none of the new aligners takes blast as a competitor. BLAST and NGS mappers are in different domains.

    In your opinion, for what purposes is bwa-sw better than bwa-mem. I am assuming bwa-mem to be a more equivalent for bowtie2 for long reads such as 250 to 400 bp, but not very good for reads in the order of kilobases.
    You can find in the bwa manpage that both bwa-mem and bwa-sw are designed for 70bp up to a few megabases. As long as you are aware of the several differences between 500bp and 5Mbp alignment, it is not hard to write an aligner that works with a wide range of read lengths. As I have already done that in bwa-sw, I can do the similar in bwa-mem.

    I would say Bowtie2 is primarily designed for sequence reads with several hundreds bp in length and its performance drops with increasing read length.

    For what purposes could bwt-sw be better applicable than bwa-sw?
    Bwt-sw guarantees to find all local hits while bwa-sw does not. When you want to make sure a read alignment is correct, bwt-sw is the way to go.

    In what ways bwa-sw is better than MegaBlast given that both do alignment for long sequences in order of kilobases or larger.
    You can find the arguments in the bwa-sw paper. Blast and megablast are good for homology searches. However, they report local hits only, which is not quite suitable for long read, contig or BAC alignment. The bwa-sw paper gives an example:

    Say we have a 1kb read. The first 800bp, which contains an Alu, is unambiguously mapped to chr1 and the rest 200bp unambiguously to chr2. When you align this read with bwa-sw/bwa-mem, you will get two hits, as we would want to see. If you align with blast/blat/ssaha2, you will get the 800bp hits and a long list of Alu hits before seeing the 200bp hit, but for read alignment, we do not care these Alu hits as they are contained in the 800bp hit.

    In addition to this problem, blast does not compute mapping quality. For resequencing, contig alignment etc., this is a more practical metrics than e-value that is only useful for homology searching. Furthermore, bwa-sw/bwa-mem are by far faster than megablast. (PS: another problem with blast, I am not sure about megablast, is its X-dropoff heuristics. This at times leads to fragmented alignments which are hard to process.)

    In what way bwt-sw could be different than BLAST as though the latter uses heuristics, it still comes up with all the relevant local alignments if you tune the -e parameter accordingly, such as increasing it to 10. The website of bwt-sw states to use it at your own risk, implying that the software is not well tested!
    Blast uses heuristics, which means it may miss optimal hits, regardless of the -e in use. If there are two mismatches at position 11 and 22 on a 32bp read, blast will miss it as there are no seed hits. Bwt-sw will not have this problem. Bwt-sw instead has problems with repeats. Internally, bwt-sw switches to the standard SW when the score is high enough. When the input is a repeat, it may have millions of candidates, which will fail bwt-sw.

    On GCTA, the developers have contacted with me off the forum. I have one concern with the evaluation and they have acknowledged that. I think before the concern is resolved, I would take the results with a little caution. In general, though, GCTA is great. I really like the idea and its implementation, and appreciate the great efforts.
    Last edited by lh3; 04-25-2013, 10:37 AM.

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    Current Approaches to Protein Sequencing
    by seqadmin


    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
    04-04-2024, 04:25 PM
  • seqadmin
    Strategies for Sequencing Challenging Samples
    by seqadmin


    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
    03-22-2024, 06:39 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 04-11-2024, 12:08 PM
0 responses
25 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 10:19 PM
0 responses
29 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 09:21 AM
0 responses
25 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-04-2024, 09:00 AM
0 responses
52 views
0 likes
Last Post seqadmin  
Working...
X