Seqanswers Leaderboard Ad

**nilshomer** · 04-02-2010, 05:52 PM

Maybe Heng will comment, but I will take a shot at the first part.

[QUOTE=jmartin;16446]I'm seeing some alignments that don't make sense to me come out of BWA. There only parameter we are setting is '-n 20', and these are 100mer reads from metagenomic samples being mapped against a bacterial database.

Our understanding of the '-n' parameter is that its setting the max allowable edit distance between query and the reference for a good alignment, so its something like the max number of mismatches allowed. But in the SAM output we're seeing alignments where the NM:i field is showing 70-75 (NM:i is supposed to show the number of mismatches).

How can BWA even be making an alignment of a 100mer query where there are 70-75 mismatches?
[QUOTE]

BWA uses the first 32 bases in its initial lookup, so that your "20" mismatches can only occur in the first 32 bases (see the "-l" option). The rest of the bases are filled in later and can have any # of mismatches. Note that the algorithm is exponential with respect to the "-n" option so I am quite amused that it was even possible for the program to complete with "-n 20" (that is a greater than 60% error rate!).

**jmartin** · 04-06-2010, 04:24 PM

Heh, it actually does complete with -n 20 (I had tried -n 30 & -n 33 as well, those values did not complete on a 32Gb blade).

The reason I'd been trying such large values for -n is to overcome some ambiguous bases that seem to exist in my Illumina data. While I don't expect more than 1-2 real errors per 100bp of human data mapping to another human genome, we have a highly variable distribution of ambiguous bases that appear in the data generated from some of our metagenomic samples (I'm trying to remove human sequence from metagenomic bacterial samples harvested from various human body sites). Some sites have ~30% of the reads showing >= 20 Ns in their sequence. It depends on the body site (different collection techniques are used at each site, by different sets of hands, and with different people making the library preps).

But Heng has mentioned in another thread that bwa is really not designed for such sequence, and that its not really safe to use -n values > 7. Anyway, I appreciate the reply.

**Chipper** · 04-07-2010, 02:01 AM

BWA uses the first 32 bases in its initial lookup, so that your "20" mismatches can only occur in the first 32 bases (see the "-l" option). The rest of the bases are filled in later and can have any # of mismatches. Note that the algorithm is exponential with respect to the "-n" option so I am quite amused that it was even possible for the program to complete with "-n 20" (that is a greater than 60% error rate!).

BWA uses the -n parameter for the number of mismatches in the full read (-k is for the seed). Bowtie uses -n for the seed and then allows any # of mismatches in the 3' end. And then BFAST uses -n for the number of threads. Wouldn''t it be great if these parameters were standardized...?

**nilshomer** · 04-07-2010, 07:53 AM

Originally posted by Chipper View Post

BWA uses the -n parameter for the number of mismatches in the full read (-k is for the seed). Bowtie uses -n for the seed and then allows any # of mismatches in the 3' end. And then BFAST uses -n for the number of threads. Wouldn't it be great if these parameters were standardized...?

You absolutely right about BWA "-n", that is my error.

As for standardizing the options, that would be lovely (BFAST came out in mid/early 2008, look how many aligners there are now), but the differences in the algorithm are too substantial in my opinion. For example BWA and other BWT algorithms search (exponentially) over a certain # of mismatches/differences, while BFAST and spaced seed (index/hash) algorithms do not necessarily guarantee to find up to a certain # of mismatches (say 99% of reads with k # of mismatches). Therefore, there can be a parameter "up to k mismatches" in the former but not in the later.

Remember these software are usually written by graduate students (who need to graduate) or post-docs. Maybe a faculty position (

) would allow us to give better support and standardization.

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, Yesterday, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin Yesterday, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 24 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 159 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

BWA odd behaviors

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News