Bowtie 2 versus BWA - SEQanswers

wlangdon replied

01-23-2013, 02:09 AM
I just did a speed accuracy v test on Cancer Institue paired end sequences. GP tweak to
Bowtie2 came out fastest (4 times speed of BWA) took less than half the memory and
had almost the same accuracy (82.1% v 83.1%)
See http://arxiv.org/abs/1301.5187
Bill
Leave a comment:
cjp replied

12-07-2011, 02:16 AM
I think they are here (and other aligners' parameters too):

NGS mapper ROC curves

http://lh3lh3.users.sourceforge.net/alnROC.shtml

Chris
Leave a comment:
zee replied

12-06-2011, 12:42 PM
Hi Heng,

Would you mind sharing the parameters you used for Bowtie2-beta4 on 100bp Illumina reads?

Thanks.

Originally posted by lh3 View Post

Updated to bowtie2-beta4. On accuracy, bowtie2-beta4 is similar to bwa-sw overall. I have also done the comparison on real data following the way I used in the bwa-sw paper. Out of 138k 454 reads with average read length 355bp, bwa-sw misses 1094+58 good alignments (~90% shorter than 100bp) and gives 31 questionable alignments, while bowtie2-beta4 misses 13+91 good alignments and gives 65 questionable alignments. The accuracy is largely indistinguishable for practical applications. On speed, Bowtie2 is about 20% faster and uses less memory.

In conclusion, bowtie2-beta4 has similar accuracy to bwa-sw for both 100bp simulated data and 350bp real 454 data. It is one of the best (accuracy+speed) mappers for hiseq and 454 reads. I will start to recommend it to others along with smalt/novoalign/gsnap. I think a missing feature in bowtie2 is to properly report chimeric alignments, which is essential to mapping even longer sequences. This should be fairly easy to implement.
Leave a comment:
twu replied

12-06-2011, 12:36 PM
Heng, thanks for your comments about GSNAP. I will think more about how to get more informative mapping quality results, and would welcome any further suggestions you might have. Actually, one of the reasons I haven't done much with the mapping quality calculations, is that my colleagues here have used BWA+GATK for SNP calling, and they told me that GSNAP had similar behavior to BWA on its mapping quality calculations. But perhaps they were wrong.

I also noted your timing results where the GSNAP paired-end algorithm is more than 2 times slower than the single-end algorithm. One of the reasons is that for paired-end data, GSNAP looks deeper at suboptimal results on each of the two ends in order to get a concordant result. In some cases, GSNAP may need to do its own version of a Smith-Waterman alignment in the neighborhood of a good alignment for the other end. Instead of using Smith-Waterman, though, GSNAP uses its GMAP algorithm, which is good for finding splicing, because our main application so far has been RNA-Seq, rather than DNA-Seq.

GSNAP is also like BWA in that it does not use base quality scores for alignment. We also do not use base quality scores for trimming, but just pass the information on to the SNP caller.

Last edited by twu; 12-06-2011, 12:40 PM.
Leave a comment:
lh3 replied

12-06-2011, 12:22 PM
Updated to bowtie2-beta4. On accuracy, bowtie2-beta4 is similar to bwa-sw overall. I have also done the comparison on real data following the way I used in the bwa-sw paper. Out of 138k 454 reads with average read length 355bp, bwa-sw misses 1094+58 good alignments (~90% shorter than 100bp) and gives 31 questionable alignments, while bowtie2-beta4 misses 13+91 good alignments and gives 65 questionable alignments. The accuracy is largely indistinguishable for practical applications. On speed, Bowtie2 is about 20% faster and uses less memory.

In conclusion, bowtie2-beta4 has similar accuracy to bwa-sw for both 100bp simulated data and 350bp real 454 data. It is one of the best (accuracy+speed) mappers for hiseq and 454 reads. I will start to recommend it to others along with smalt/novoalign/gsnap. I think a missing feature in bowtie2 is to properly report chimeric alignments, which is essential to mapping even longer sequences. This should be fairly easy to implement.

Last edited by lh3; 12-06-2011, 12:24 PM. Reason: typo
Leave a comment:
adaptivegenome replied

11-25-2011, 11:31 AM
I have been wondering the same thing. So if I was to compare the recall of mutations from BWA mapped reads to a mapper that does recalibrate base qualities, you do think it would matter if I use GATK to first recall rate the reads mapped by BWA? You I think did this previously in a paper with Nils, right? How did you do the comparison?

Sorry for all the questions!
Leave a comment:
lh3 replied

11-25-2011, 06:24 AM
BWA does not use base quality during alignment except for trimming. I have not been convinced that the difference between using base quality or not has a significant effect on downstream data analyses.
Leave a comment:
adaptivegenome replied

11-24-2011, 07:43 PM
Heng, I have a quick question. In trying to use simulated data to recall mutations, would you base recalibrate BWA-mapped reads? With real human data you can use dbSNP, however with simulated data what would you use?
Leave a comment:
lh3 replied

11-19-2011, 07:39 AM
BTW, here is an interesting observation on speed. I simulated 100k single-end (SE) reads and 100k pairs of paired-end (PE) reads (200k reads). One would think a program should run about twice as slow in the PE mode simply because there are twice as many reads. This is true for bowtie2/bwa/bwa-sw. Nonetheless, gsnap is much slower in the PE mode. My guess is in the PE mode, gsnap visits more suboptimal hits to get more reads paired. It is slower, but the accuracy is also higher. On the other hand, both novoalign and smalt are faster in the PE mode, but at a cost the false positive rate also goes slightly higher. My explanation is that they do not map each end separately and then pair them (bwa/bwa-sw does this), but rather map the pair as a whole.
Leave a comment:
lh3 replied

11-19-2011, 07:22 AM
@Thomas I quite like the GSNAP algorithm as well as its implementation and I have recommended it to others already. It is one of the top NGS mappers nowadays. My opinion is for GSNAP the only thing might be improved is a more useful mapping quality. I know GSNAP gives mapQ, but for "unique" hits, the vast majority get a mapQ 40 and the few hits with higher mapQ are actually not more accurate. Perhaps having higher mapQ may not help the standard SNP calling too much, but there are areas where extremely high mapping accurate is preferred.

I am not sure how to improve mapQ for single-end mapping, but I kinda think you should be able to derive better mapQ for paired-end mapping. It seems to me that GSNAP will visit more suboptimal hits in the PE mode. By seeing more hits and using the pairing information, you can know some hits can barely wrong.
Leave a comment:
lh3 replied

11-19-2011, 07:02 AM
@kopi-o As Steve has argued, the experiment I am doing now is not a classical binary classification. There are only "wrongly mapped", but actually no "false positives" in the strict sense.

I guess what you mean is to look at reads generated from regions with simulated polymorphisms. This is also Gerton insists doing and I agree it is a good thing to do. The current ROC does not tell you if a mapper uniformly misses a hit or consistently misses a hit in some regions. Similarly, the ROC does not tell you whether a mapper tends to produce consistent errors or random errors. All these matter in variant calling.

If you are interested in variant calling, the right evaluation is to plot the ROC for variant calls. This is a standard binary classification and is more telling.
Leave a comment:
kopi-o replied

11-19-2011, 02:10 AM
Heng, in cases where the distribution of positive vs negative examples is very skewed (such as variant calling), the ROC curve can also be misleading. The ROC curve typically only looks at positive examples (false positive rate vs true positive rate), but one should also look at the corresponding ROC curve for negative examples (TNR vs FNR), or, look at precision-recall curves.
Leave a comment:
twu replied

11-18-2011, 01:27 PM
Heng, thanks for making the ROC plots available. I think they're pretty interesting.

About GSNAP having only a single point on the plots: actually, GSNAP does calculate mapping quality, but my understanding was that the quality should translate into the probability that the given read alignment is the correct one. So GSNAP does a Bayesian normalization of all of the raw mapping qualities, and then reports the normalized mapping quality. This tends to produce a dichotomous result, where if one alignment is much better than the others, it gets a very high mapping quality of 40 or so, but if there are two or more roughly similar multimappers, they all get a mapping quality of 3 or so (where 3 corresponds to the log probability of 0.5). Perhaps I have the wrong understanding of mapping quality here (and maybe someone can correct me), but I am told that GATK has a similar expectation.

To get around this, I have added a new field in the SAM output of GSNAP called XQ, which gives the non-normalized mapping quality, although it is still scaled so the best alignment gets a mapping quality of 40. (There are certain reasons why the scaling is important, mostly having to do with GSTRUCT needing to know this information.)

Regarding the comment someone made about multimappers, GSNAP is designed to report all multimappers, but I think it has a different meaning than other programs of what a multimapper is. Some programs expect to give the program a specific search parameter, like 5 mismatches or less, and then expect to get all mappings that satisfy that parameter. On the other hand, GSNAP interprets the search parameter to be the maximum perimeter of the search space, so has a hard limitation of looking for 6 mismatches or more. However, if GSNAP finds an exact match, it will also place a soft limitation at that point and just report all multimappers that are also exact matches, and not report the 1-mismatch through 5-mismatch answers. The exception is that if GSNAP is given a value for suboptimal mismatches, and then it will proceed to search at that many mismatches past its optimal finding. For example, if GSNAP cannot find an exact match, but finds a 1-mismatch, and is given a suboptimal-mismatch value of 1, then it will report all 1-mismatch and 2-mismatch alignments, but will still not go past that to report 3-mismatch through 5-mismatch answers.

Last edited by twu; 11-18-2011, 02:41 PM.
Leave a comment:
sparks replied

11-09-2011, 10:14 PM
Originally posted by Heisman View Post

Looking at those ROC curves, it appears to me that Novoalign is the best mapper in the specified simulation that was run with respect to sensitivity and specificity. Is this a correct interpretation?

For sure, we believe it's still the best

Colin

PS. Some mapper comparisons have shown different results but this can be the result of targeting the simulation at the mapper the developer is promoting. One recent example used high simulated indel rates and then didn't adjust the other mappers gap open & extend penalties to suit. Their mapper, with default low gap penalties, came out as the clear winner.
It can be a problem to optimise the parameters for every mapper when doing these comparisons and the tendency is to use defaults which is probably reasonable.

I think Heng Li is doing an honest and unbiased comparison at mutation rates you'd expect in a resequencing project.
Leave a comment:
zee replied

11-09-2011, 09:54 PM
Originally posted by Heisman View Post

Looking at those ROC curves, it appears to me that Novoalign is the best mapper in the specified simulation that was run with respect to sensitivity and specificity. Is this a correct interpretation?

From this comparison yes that would be the case, with smalt showing good performance as well.What would be interesting would be to do this with genericforms' suggestion of a 30x coverage genome.
Leave a comment:

Previous 1 2 3 4 5 6 7 template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News