Bowtie, an ultrafast, memory-efficient, open source short read aligner

xwu replied

01-12-2009, 09:33 AM
I should have read Ben's post earlier. I spent a lot of time to find out that the sequences are all reversed, which caused weird results. I will try out the 0.9.8.1 now.
Leave a comment:
Ben Langmead replied

01-07-2009, 10:50 AM
Hello Joachim,

Sorry for the delay in replying! Please see responses below.

Originally posted by joa_ds View Post

[*]I don't quite understand what the 'nostrattum' flag does

Without the --nostratum flag, Bowtie will only report alignments for the best "stratum" where alignments were found. By "best stratum", I mean the best category of alignment, categorized by number of mismatches in the seed region. Say you use -k 3 and a given read aligns once with 1 mismatch in the seed and twice with 2 mismatches in the seed. If you do *not* specify --nostratum (the default), then Bowtie will only report the single 1-mismatch hit. If you do specify --nostratum, Bowtie will report all 3 hits.

I'll make a note to add a clear example in the documentation for future releases.

[*]I am only interested in rather unique maps, the rest can go to another file and i can have a look at it later. the --unfa flag moves unmapped seqs to a file, but when i use -m 3 i will discard seqs that map more than 3 times, right? Those go that same file? or they are lost forever? The idea is that i want to do a preliminary analyses fast and i can remap those multimaps overnight or during a weekend when the server is not used.

(Let me answer your question with respect to the just released 0.9.8.1 version of Bowtie, since version 0.9.8 had issues with --unfa and --unfq.)

As of 0.9.8.1 Bowtie supports --maxfa/--maxfq options that dump reads that exceed the -m limit to a separate file. If --maxfa/--maxfq is not specified but --unfa/--unfq is, then these reads are dumped to the same file as the reads that don't align at all.

[*]If i use the -k 3 flag, i want to report 3 maps, will it take the first 3 it encounters? And if i use the --best flag, will it go find all the possible maps and only report the best 3?

./bowtie -k 3 -m 10 --best --unfa MSC_bowtie_unal_fasta human_genome ../files/file.111.fastq MSC_bowtie

is the commando i want to use. I hope it will find max 10 maps per sequence and report the best 3 (combining -k 3 and --best) Will this work? Just experimenting... In a later stage i will map everything(even x100 repeats) and output it to a db, so it doesnt really matter if it doesnt work, just trying to understand the program completely.

Based on what you say, yes, that command will do what you intend. -k 3 --best should guarantee that you get up to three alignments of the "best" kind (best in terms of # of mismatches in the seed) and -m 10 will ensure that no alignments are reported for a read that aligns to more than 10 places. If you don't care whether the alignments come from the same strata, then you should also use "--nostrata".

Hope that helps.

Thanks,
Ben
Leave a comment:
Ben Langmead replied

01-07-2009, 10:35 AM
Thanks Florian! - Indeed, version 0.9.8.1 of Bowtie was just released minutes ago and it fixes all known issues with the --unfa and --unfq options. It's highly recommended that you use that version.

Thanks,
Ben
Leave a comment:
florian replied

01-07-2009, 09:53 AM
Yeh, I agree. However, there seems to be a problem with multi-threading, so be advised NOT to use the -p option in conjunction with it. Ben (the main developer) has told me he was already working on a solution, though, so this will only be a intermittent drawback.
Leave a comment:
xwu replied

01-07-2009, 09:48 AM
Florian, thanks a lot for your reply. I was using v0.9.7, and did not know there is a new version available. It is a very helpful function. Thanks.
Leave a comment:
florian replied

01-07-2009, 09:42 AM
hi there

as of version 0.9.8 bowtie comes with the --unfa <filename> / --unfq <filename> options, which should be doing exactly what you are looking for

Bowtie: Manual

http://bowtie-bio.sourceforge.net/manual.html#algn_unfa

cheers,

florian
Leave a comment:
xwu replied

01-07-2009, 09:36 AM
I am a new user of Bowtie, and I got one question. Is there a way to output the reads that CAN NOT be aligned to the reference genome?
Leave a comment:
joa_ds replied

12-16-2008, 07:57 AM
Hi,

first of all, we here at Ghent were very impressed with the bowtie results... We have ordered, but are still waiting for delivery, both a Solexa and a 454 sequencer and our group will be responsible for the data analysis and server installation etc.

Some people here were testing the classis programs such as GMAP and BLAT and a Solexa human genome test-set would take 2-3-4 days on our 8core 32G RAM server.

I went on a search and found novocraft. Taking only 6h to do the same job.

Now i found bowtie and it only takes 1h i guess? I started it before the lunch break and it was finished when i came back. Very impressive (without the -p flag!). (we have a 16core 64G coming... and i will use the -p flag... mapping the genome in 10 mins amazing.

I have some questions though:
I don't quite understand what the 'nostrattum' flag does

I am only interested in rather unique maps, the rest can go to another file and i can have a look at it later. the --unfa flag moves unmapped seqs to a file, but when i use -m 3 i will discard seqs that map more than 3 times, right? Those go that same file? or they are lost forever? The idea is that i want to do a preliminary analyses fast and i can remap those multimaps overnight or during a weekend when the server is not used.

If i use the -k 3 flag, i want to report 3 maps, will it take the first 3 it encounters? And if i use the --best flag, will it go find all the possible maps and only report the best 3?

./bowtie -k 3 -m 10 --best --unfa MSC_bowtie_unal_fasta human_genome ../files/file.111.fastq MSC_bowtie

is the commando i want to use. I hope it will find max 10 maps per sequence and report the best 3 (combining -k 3 and --best) Will this work? Just experimenting... In a later stage i will map everything(even x100 repeats) and output it to a db, so it doesnt really matter if it doesnt work, just trying to understand the program completely.

greetz,
Joachim
Leave a comment:
lh3 replied

12-13-2008, 02:09 AM
Cole, pre-masking the reference genome is definitely worth trying, although I have not got much time to follow this line. Most algorithms spend a lot of time on repeats. Suffix array/tree based methods are fast largely because they collapse exact repeats and save time. However, using repeatmasker to mask genome is not the way to go. Repeatmasker does not guarantees to mask sequences having multiple copies. The sequences it masks may also be unique. Biological repeats are not necessarily sequences with multiple copies. What we more like to do is to simulate reads from each position on the reference, map them back, and then calculate some statistics indicating whether the region is a repeat. Illumina's "sequence-ability" is such a statistics, although it is not frequently use. I think a better statistics would also consider whether a read from the reference can be mapped elsewhere when there is an additional mismatch (bwa behaviour).

A potential problem with premasking is this strategy (at least the one I envision) does not work well with paired end mapping. Reads from some masked regions can be mapped correctly when pairing is in use. I can vaguely see possible solutions, but maybe it is too vague to say here. I am sure people can come up with better solutions.

[PS: I am thinking a better replacement with "sequence-ability". Let S be the reference. For all position x, we extract a read S[x,x+31] and map it back to S. Define a statistics R[x] such that it equals 0 if S[x,x+31] can be mapped to multiple places; equals 1 if S[x,x+31] has an 1-difference (mismatch+indel) hit elsewhere and 2 otherwise. Suppose a 32bp read is mapped to y. We can discard the hits if R[y]=0; or downweight it in SNP calling if R[y]=1. Approximation has to be made for read length longer/shorter than 32bp. A simplified the version would just set R[x] as 0 or 2.]

Last edited by lh3; 12-13-2008, 02:32 AM.
Leave a comment:
Cole Trapnell replied

12-12-2008, 06:53 PM
Hi lh3,

I first wanted to say thanks for your compliments and constructive criticisms on Bowtie. To follow up on the discussion about reporting hits for multireads: would it not be better to penalize or even ignore alignments to predefined, annotated repeats in the reference as a post-processing step?

In your experiment, I wonder how many of the mapped reads fall entirely within RepeatMasker regions, for example.
Leave a comment:
lh3 replied

12-12-2008, 02:13 AM
Ben, you are welcome. I was looking at 200x36x36-071113_EAS56_0053-s_1_?.fastq.gz from SRA000271. Soap2 is also available and so you can try by yourself. Let me know if I was doing something wrong. As for SV detection, a naive but most widely used way is to cluster, based on coordinates, read pairs that are mapped with excessively large/small insert size or across chromosomes. I think the experience is we need to look at reads mapped confidently. bwa's behaviour is mostly preferred. Soap2's is also ok, but we need to be more careful to filter out wrong alignments. I a bit worry about the default bowtie behaviour on such applications, but of course we can only be sure when it gets evaluated.

[PS: Just read your reply to ewingad. I can see that bowtie will stop searching the reverse strand if it finds a hit on the forward strand. Suppose there is a segmental duplication with one copy on the forward strand and the other on the reverse. Bowtie will almost always report reads from either copy as unique and map all of them to the forward strand only (if the two copies are almost identical). This will incur higher SNP error rate and confuse SV detection. Note that your proposed method in your previous reply would not solve this case which may actually happen frequently. Always searching both strands will largely alleviate, though not completely solve, this problem.

In addition, from your reply I just notice this --nostrata thing. I wonder whether bowtie keeps multiple best strata. Say the best hit contains one mismatch and two strata may yield two different 1-mismatch alignments (e.g. the mismatch occurs at 10bp and 20bp respectively). When I use --best -a, does bowtie report the alignments in both strata? I suppose it does. It is not very clear from the manual.]

Last edited by lh3; 12-12-2008, 03:11 AM.
Leave a comment:
Ben Langmead replied

12-11-2008, 01:03 PM
Originally posted by ewingad View Post

I agree with lh3 that it may be vital to know whether a given alignment is to a repetitive sequence i.e. if other equally good alignments exist - I suppose that if there are multiple _identical_ alignments they should all be coming from the same BW rows? If this is provable then it is possible to reassure users about lh3's point #2.

Yes, that's true. The alignments included in a single range or rows returned by Bowtie's search routine are "identical" in the sense that the reference characters aligned to are the same. This is provable owing to the properties of the matrix.

I also agree that something like lh3's point #5 is useful - when aligning short reads it is often a useful quality measure to know how many alignments would result from a 1-bp change in the query sequence. So, if for a given read, if I know that even if I changed one base, the reported alignment would not change, I feel more confident about that alignment.

That's good to know - thanks; for now, using -a or -k 2 (or whatever other -k you prefer) together with --nostrata should tell you what you need to know.

Thanks,
Ben
Leave a comment:
Ben Langmead replied

12-11-2008, 12:57 PM
That's a good experiment - can I recreate that? Are these public reads that I can access? This will help us think about the issue of what the contents of field #7 should actually be.

Can you explain a little more (or point us somewhere that explains) the SV detection issue? Is there a paper that describes the algorithm you have in mind?

Thanks again,
Ben
Leave a comment:
lh3 replied

12-11-2008, 11:14 AM
I have just done a comparison between bowtie and soap2. I am looking at 3.9 million alignments reported by both soap2 and bowtie. Soap2 says that there are 649k reads can be placed more than once, while 171k of them with the 7th col equal to 0 in bowtie. This means bowtie will claim 26% (171/649) of repetitive alignments as unique. This ratio seems a bit high to me. (cmd: soap2, default; bowtie -f -v2) I agree that not all of these wrong alignments may yield wrong SNPs, but SV detection will be affected a lot. Of course we need further evaluation on real data.
Leave a comment:
Ben Langmead replied

12-11-2008, 10:12 AM
Hey lh3,

This discussion is very helpful - thanks, and thanks for checking out Bowtie.

Originally posted by lh3 View Post

To Ben:

First thank you for Bowtie. It is really amazing. I have not read Bowtie codes, but from README, I think all the BWT based aligners (together with SOAP2 and mine) share a lot in common. Some comments:

1. I forget how much time is spent on building BWT index, but my impression is BWT-SW (also used in soap2 and bwa) is much faster and more light-weighted. Maybe it is worth having a look at the publications by Tam's group. Of course, building index is an once-for-all process. Just remind you of a possibly better algorithm.

I agree; it's entirely possible that there's something faster, e.g., what BWT-SW does. The biggest advantage of the algorithm Bowtie uses as I've implemented it is that it's quite flexible with how much memory it uses. However, I too suspect that something faster could be done. That's on my TODO list (though it's pretty far down, since, as you say, that cost is typically amortized away).

2. My main concern about bowtie is actually related to the column 7. I think by default (no --best), bowtie just outputs the first group of hits it meets. Users would not know whether it is the best or whether it is a repeat or not. I think (maybe wrong) this behaviour is only useful for screening human contaminations. With "--best", user would know the output is the best hit, but whether it is a repeat is still unknown in some cases. I know the "unknown" cases should be rare, but it would be necessary to convince users that the rare cases would not affect accuracy. Only with "--best -k 2", a user may know whether it is a repeat or not, although he/she would not know the number of occurrences. I think the "--best -k2" is the most desired behaviour and should become the default. Bowtie is fast enough. Slowing it down by a factor 3 will still make most users quite happy (see also below). Also quoting the speed under the default option would be unfair to others.

Your characterization that Bowtie "outputs the first group of hits it meets" is right - specifically, it will randomly select k distinct alignments from that range of BWT rows. (Of course, if k is greater than the number of alignments in that range, it has to go and search for another range.)

First, let me reemphasize that I think of Bowtie's target application as mammalian resequencing - that's how I characterize it in the manual and that's what we spend our time trying to optimize it for. I also mention in the manual that Bowtie should not be considered a general-purpose alignment tool, because there are definitely alignment scenarios where Bowtie won't give the exact information needed, or may not give it all that much faster than other tools. Given that, I want to argue that the "unknown" cases are not a significant concern, and so I disagree that quoting the speed under the default option is unfair.

First, the "unknown" cases. In resequencing, the result of the alignment step is essentially a multiple alignment across the whole genome with relatively deep coverage. Looking for, e.g., a SNP variant involves walking along the columns of the multiple alignment, looking at the alignments that span that column and combining the relevant evidence from each of the alignments to see if there's a call to be made. As described in the Maq paper (are you familiar with it? ), there's a need to discount evidence from alignments whose placement was ambiguous. That's what field #7 does (or will do, when we've settled on what should actually go there; we've marked it as "reserved" for now). The good news is that we can reasonably assume deep coverage, which means that we don't have to rely on the value in field #7 for just one read to know how reliable its evidence is. We can, for example, take the max of the values for field #7 for all the reads that span the column. That's just an example; probably something smarter than max is needed. In short, I argue that A) the case where field #7 is very wrong is rare, and B) in cases where it is very wrong, it can probably correct it by looking at nearby aligned reads. So I think the problem is manageable.

3. I think it is worth metioning that the speed of soap/bowtie/novoalign is sensitive to error rate, while eland/zoom/rmap not. On bowtie's home page, it is claimed that bowtie is 35X faster than maq and 350X than soap. However, to my experiences, on high quality data, maq and soap is about the same speed. We can only see this big difference when base quality is low. Actually on my small experiment, bowtie is 170X faster than maq (cmd option: -f -v 2; I know using quality may be a bit slower), but I guess the difference would be smaller if the error rate is higher.

Those stats are helpful - thank you. I have noticed that the maq/bowtie speed ratio varies substantially based on CPU architecture and whether the input is filtered for poly-As. 35x (rounded down from about 38x) is the most Maq-favorable ratio I obtained in a set of experiments using filtered human read data from 1000 Genomes on a few different machines.

I do mention in the manual that Bowtie is faster for higher-quality input, though I could make that point more prominent.

4. Comparison of BWT based aligners. So far as I know, there are three BWT based short read aligners: bowtie, soap2 and bwa. Soap2 gives the number of occurences of the best hits and bwa also reports the number of hits with additional mismatches if the best hit is unique (like what chipper is asking for). bwa is the only one that finds short indels at present, although I am sure this is not hard for soap2/bowtie. In addition, soap2 only finds hits with up to 2 mismatches. I think they have to do so according to a brief description of its algorithm. On speed, bowtie-f-v2 is 3X faster than soap2 (bowtie--best-v2-f-k2 is similar to soap2 in speed) and soap2 is 2.5X faster than bwa. On memory, both bwa and bowtie use 2.3GB while soap2 uses 5.3GB.

Those stats are very good to know. I admit that I haven't had much time to look at soap2 and bwa.

Note that Bowtie uses a bit more than half that amount of member when -z is specified. -z precludes the use of some options, but the default alignment mode works fine.

5. back to how the alignments are reported. I think the bwa behaviour is useful if people do not care too much about speed. Knowing the number of suboptimal hits would help us to decide which alignments are reliable. I know this is important to some (not all) SV detection algorithms. If you think the bwa behaviour is costly (possibly it is), I would recommend the soap2's one. Frequently, we may want to know the exact number of occurences (no need to output the detailed aligments). I am sure having the soap2 behaviour would make bowtie more popular.

These are helpful suggestions. The TODO list for Bowtie grows at an alarming rate! In the short run, we're working on supporting paired-end alignment and indels, but we'll spend some time thinking about the best way to accomplish some of these "field #7" improvements too.

Many thanks,
Ben
Leave a comment:

Previous 1 22 29 30 31 32 33 34 template Next

Recent Advances in Sequencing Analysis Tools

by seqadmin

The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
- Channel: Articles
05-06-2024, 07:48 AM
Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM

Topics	Statistics	Last Post
A Closer Look at the Enigmatic Genomes of Oikopleura dioica by seqadmin Started by seqadmin, 05-10-2024, 06:35 AM	0 responses 15 views 0 likes	Last Post by seqadmin 05-10-2024, 06:35 AM
Advanced Epigenome Editing Platform Explores Gene Regulation Mechanisms by seqadmin Started by seqadmin, 05-09-2024, 02:46 PM	0 responses 21 views 0 likes	Last Post by seqadmin 05-09-2024, 02:46 PM
Telomere Maintenance by PARP1: A New Perspective in Cancer Research by seqadmin Started by seqadmin, 05-07-2024, 06:57 AM	0 responses 18 views 0 likes	Last Post by seqadmin 05-07-2024, 06:57 AM
Enhanced Neoantigen Detection: Introducing NeoHunter by seqadmin Started by seqadmin, 05-06-2024, 07:17 AM	0 responses 19 views 0 likes	Last Post by seqadmin 05-06-2024, 07:17 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News