Seqanswers Leaderboard Ad

**aloraine** · 10-19-2012, 10:16 AM

TopHat 2 runs bowtie2. Does it run bowtie2 w/ -k or -a parameter?

**ugolino** · 10-19-2012, 11:54 AM

Following your tip I ran a PE RNA-seq alignment with these options.
bowtie2 --no-contain -a --no-mixed

Then fed SAM output file to htseq-count (w/ -s no option as it's a non-stranded library) and got ~ 6x more counts than there are PE reads to begin with. Apparently, htseq-count tallied every alignment as unique, not ambiguous, which makes me think that the SAM output from bowtie2 does not flag multiple alignments properly or htseq-count does not process them ?

**ecSeq Bioinformatics** · 10-22-2012, 10:10 AM

Originally posted by aloraine View Post

TopHat 2 runs bowtie2. Does it run bowtie2 w/ -k or -a parameter?

There are also several other tools that map the reads with bowtie1/2. It would be nice to check they use the -k paramter and if not, any bias/error is added to their results.

To name some of these tools:

miRDeep2
miRanalyzer
miREvo
MapSplice
SpliceSeq
RUM
complete list of tools

**fkrueger** · 11-01-2012, 12:54 AM

As of yesterday night a new version of Bowtie 2 has been released (Bowtie 2 version 2.0.2) which adds an option --non-deterministic:

Code:

*    Added --non-deterministic option, which better matches how some users
     expect the pseudo-random generator inside Bowtie 2 to work.  Normally, if
     you give the same read (same name, sequence and qualities) and --seed, you
     get the same answer.  --non-deterministic breaks that guarantee.  This can
     be more appropriate for datasets where the input contains many identical
     reads (same name, same sequence, same qualities).

Does this option address the multiple mapping issue reported by original poster?

**ecSeq Bioinformatics** · 11-01-2012, 05:42 AM

That is a very good point.

I actually like this statement:

...which better matches how some users expect the pseudo-random generator...

How do the bowtie developer expect a random generator to work?

But, in my opinion, the problem still exists.

When I run bowtie2 with default parameters, I still get one non-random hit for all reads.
All tools that use bowtie2 (e.g. tophat2) for mapping, do not use --non-deterministic, or -a/-k parameter.

So, downstream analysis will still be affected.

Why not set it as default parameter? Time complexity?

**ecSeq Bioinformatics** · 11-02-2012, 06:34 AM

We now checked the behavior of the new version of bowtie2 (v.2.0.2) with our testset from above (3 equivalent mapping loci):

1 .
(1) no --non-deterministic option + (2) same name for all reads:
100% of all reads map to one locus.

2 .
(1) no --non-deterministic option + (2) different name for all reads:
99% of all reads map to one locus.
1% of the reads map to on of the other loci.

3 .
(1) --non-deterministic option + (2) same name for all reads:
65% of the reads map to loci1
19% of the reads map to loci2
16% of the reads map to loci3

4 .
(1) --non-deterministic option + (2) different name for all reads:
same result as in 3.

**knchess** · 11-06-2012, 10:44 AM

Originally posted by ecSeq Bioinformatics View Post

We now checked the behavior of the new version of bowtie2 (v.2.0.2) with our testset from above (3 equivalent mapping loci):

1 .
(1) no --non-deterministic option + (2) same name for all reads:
100% of all reads map to one locus.

2 .
(1) no --non-deterministic option + (2) different name for all reads:
99% of all reads map to one locus.
1% of the reads map to on of the other loci.

3 .
(1) --non-deterministic option + (2) same name for all reads:
65% of the reads map to loci1
19% of the reads map to loci2
16% of the reads map to loci3

4 .
(1) --non-deterministic option + (2) different name for all reads:
same result as in 3.

From bowtie2's manual:

The pseudo-random number generator is re-initialized for every read, and the seed used to initialize it is a function of the read name, nucleotide string, quality string, and the value specified with --seed. If you run the same version of Bowtie 2 on two reads with identical names, nucleotide strings, and quality strings, and if --seed is set the same for both runs, Bowtie 2 will produce the same output; i.e., it will align the read to the same place, even if there are multiple equally good alignments. This is intuitive and desirable in most cases. Most users expect Bowtie to produce the same output when run twice on the same input.

So the result for 1. is expected. It's also desirable because otherwise alignment runs can't be parallelized. But I agree that --non-deterministic doesn't fix anything because as 2. shows, different names still get biased hits. Deterministic should still be random, which is what the manual also claims. Even --non-deterministic doesn't give random hits, as 3. and 4. show..

I actually got the same results as 3. and 4. without using --non-deterministic by completely randomizing the read names, so even random read names still give bias. It seems the more similar the read names, the worse the bias.

**ecSeq Bioinformatics** · 11-07-2012, 06:52 AM

... it will align the read to the same place, even if there are multiple equally good alignments. This is intuitive and desirable in most cases. Most users expect Bowtie to produce the same output when run twice on the same input.

I think that might be intuitive, but by far not random.

I think this discussion is really fruitful, since it is really important to understand what results are produced by widely used mapping tools. Sometimes developer assume that some behavior is "intuitive and desirable", but the users actually expect a completely different one, ending up with a strong bias.

To sum it up:
I suggest, when using bowtie2, users should set the -a parameter to assure, that they will not get a bias. Or know and accept the bias... of course!

**knchess** · 11-07-2012, 11:10 AM

I agree that knowing what the tools are really doing is important. Ideally some communication from the developers would help. There doesn't seem to be a mailing list for bowtie though. Are they aware of the problem?

What do you do downstream if you use the -a parameter? Randomly choose one of the reported hits manually? If each identical loci got 1 reported hit for a matching read, then that itself is also a bias (compared to a read that only has one loci).

bwa actually reports all hits by default. But the nice thing is the 'primary' hit reported for a SAM entry is actually random (so about 33/33/33 for the experiment done in OP) while still been deterministic even if the read names are identical. Then you can just limit reported hits to 1 to get the behaviour that bowtie2 is supposed to have by default. Unfortunetely, bwa has its own issues...

**ecSeq Bioinformatics** · 11-07-2012, 12:07 PM

Are they aware of the problem?

I reported the bug at http://sourceforge.net/tracker/?func...7&atid=1101606

What do you do downstream if you use the -a parameter?

In my opinion, you can do 3 different things for downstream analysis:

Discard all reads with multiple hits
For a DE analysis it should be ok, since you delete them for all runs. But you will underestimate, or even completely miss the expression of all genes with multiple copies.
Take all hits
For a DE analysis it should be ok, since you do the same for all runs and you won't miss genes with multiple copies. But you will overestimate the expression of all genes with multiple copies.
Take all hits, but divide the "expression impact" of the read by the number of its mapping loci
I think this is the prefered way, since you do not know where the read actually comes from. You have all mappings, but by dividing the impact, you will "punish" the hit and equally distribute its impact in expression (it's somehow the same result as the random generator, but in my opinion much cleaner). That is actually, how we handle this issue by default. This way, you minimize the error and don't loose any region, where the read might come from. Another benefit of this approach is the possibility that you can always access the mapping loci at a later time and filter out multiple hits, or change the way of how you want to normalize them. Of course, most of the downstream tools are not able to handle this normalization, but we use our own tools for that.

For mapping we use segemehl. With this tool, you get all multiple hits by default AND it sets the NH-TAG in the sam file, which gives you the number of multiple hits for the read. That allows you do directly "normalize the expression" of the read.

This software does also allow mismatches and indels in the seed, which in my view is very important and still a problem of almost all mapping tools.

**yangjr** · 01-09-2013, 12:36 PM

[*]Take all hits, but divide the "expression impact" of the read by the number of its mapping loci

A better but more complex way is
1. Ask bowtie to generate all hits.
2. Take the hits that are uniquely best, estimate a "hit density" for single site
3. Take the multiple hits, divide their "expression impact" according to the "hit density" of its mapping loci (or a flanking window around the loci). For example, one tag which is sequenced 100 times, and it hits 3 loci equally good, and that 3 loci have a hit density of a, b and c according to the unique hits. Then the "expression impact" contributed by this tag would be 100*a/(a+b+c), 100*b/(a+b+c) and 100*c/(a+b+c) respectively.

**syfo** · 01-10-2013, 09:24 AM

Originally posted by ecSeq Bioinformatics View Post

That is a very good point.
I actually like this statement:

Code:

*    Added --non-deterministic option, which better matches how some users
     expect the pseudo-random generator inside Bowtie 2 to work.

How do the bowtie developer expect a random generator to work?

As all computationally-based random generators do.

I think there might be some confusion about the term "random" here. Computational methods for the generation of "random" numbers do not pick up numbers at random but use algorithms to generate long series of apparently random numbers, which are in fact completely determined by an initial number called "seed". Hence the "pseudo-random". The seed is often hard-coded or built from the input. This is why a given input will always give the same output, which is actually more than expected (reproducibility is crucial for both developpers and users -just consider testing/debugging for instance).

What can be done to give a better illusion of randomness is to use a "random" -i.e. variable- seed: the generator then combines some extremely variable and unpredictable data (like the current time, the ID given by the computer to the process or some other dynamic system information) to compute a different seed at each execution, which will produce a different series of numbers each time. I bet that this is exactly what the non-deterministic option is doing.

**raphael123** · 02-07-2014, 06:07 PM

Very usefull discussion !

Syfo you are rigth but it is not explaining this difference :

(1) no --non-deterministic option + (2) same name for all reads:
100% of all reads map to one locus.

(1) --non-deterministic option + (2) same name for all reads:
65% of the reads map to loci1
19% of the reads map to loci2
16% of the reads map to loci3

knchess don t you think it depend how much you try to scan the full genome ? Is the first locus quicker to access in the genome ?

**clintp** · 02-10-2016, 11:54 AM

Has there been any update to this issue?

Topics	Statistics	Last Post
Study Reveals How Bacteria Defend Against Viral Attacks by seqadmin Started by seqadmin, 08-27-2024, 04:40 AM	0 responses 16 views 0 likes	Last Post by seqadmin 08-27-2024, 04:40 AM
New Single-Molecule Sequencing Platform Introduces Advanced Features for High-Throughput Genomics by seqadmin Started by seqadmin, 08-22-2024, 05:00 AM	0 responses 293 views 0 likes	Last Post by seqadmin 08-22-2024, 05:00 AM
New DNA Code Discovered Revealing Complex Gene Regulation Mechanisms by seqadmin Started by seqadmin, 08-21-2024, 10:49 AM	0 responses 135 views 0 likes	Last Post by seqadmin 08-21-2024, 10:49 AM
Epigenetic Clocks Derived from Retroelements Offer New Insights into Aging by seqadmin Started by seqadmin, 08-19-2024, 05:12 AM	0 responses 124 views 0 likes	Last Post by seqadmin 08-19-2024, 05:12 AM

Seqanswers Leaderboard Ad

Announcement

ATTENTION: bowtie2 and multiple hits

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News