Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • BWA odd behaviors

    I'm seeing some alignments that don't make sense to me come out of BWA. There only parameter we are setting is '-n 20', and these are 100mer reads from metagenomic samples being mapped against a bacterial database.

    Our understanding of the '-n' parameter is that its setting the max allowable edit distance between query and the reference for a good alignment, so its something like the max number of mismatches allowed. But in the SAM output we're seeing alignments where the NM:i field is showing 70-75 (NM:i is supposed to show the number of mismatches).

    How can BWA even be making an alignment of a 100mer query where there are 70-75 mismatches?

    Another oddity I've seen is that if we reduce the size of the input query file, the exact same reads that previously were showing 70-75 mismatches now show < 20 mismatches. We've seen this weird error in fasta files of ~600k 100mer reads, and then we've broken that query file into chunks of 5000 100mer reads, and the same reads do not give this error. But the results of the small chunks seem to not match entirely with the BLASTN results. Mainly the small chunks will either give the same hit as BLASTN, or will fail to find a hit that BLASTN finds.

    Is this a known issue? Or could I be doing something wrong by failing to set some needed parameter? I'm using BWA 0.5.7 on a 64bit machine.

    Thanks,
    John Martin

  • #2
    Maybe Heng will comment, but I will take a shot at the first part.

    [QUOTE=jmartin;16446]I'm seeing some alignments that don't make sense to me come out of BWA. There only parameter we are setting is '-n 20', and these are 100mer reads from metagenomic samples being mapped against a bacterial database.

    Our understanding of the '-n' parameter is that its setting the max allowable edit distance between query and the reference for a good alignment, so its something like the max number of mismatches allowed. But in the SAM output we're seeing alignments where the NM:i field is showing 70-75 (NM:i is supposed to show the number of mismatches).

    How can BWA even be making an alignment of a 100mer query where there are 70-75 mismatches?
    [QUOTE]

    BWA uses the first 32 bases in its initial lookup, so that your "20" mismatches can only occur in the first 32 bases (see the "-l" option). The rest of the bases are filled in later and can have any # of mismatches. Note that the algorithm is exponential with respect to the "-n" option so I am quite amused that it was even possible for the program to complete with "-n 20" (that is a greater than 60% error rate!).

    Comment


    • #3
      Heh, it actually does complete with -n 20 (I had tried -n 30 & -n 33 as well, those values did not complete on a 32Gb blade).

      The reason I'd been trying such large values for -n is to overcome some ambiguous bases that seem to exist in my Illumina data. While I don't expect more than 1-2 real errors per 100bp of human data mapping to another human genome, we have a highly variable distribution of ambiguous bases that appear in the data generated from some of our metagenomic samples (I'm trying to remove human sequence from metagenomic bacterial samples harvested from various human body sites). Some sites have ~30% of the reads showing >= 20 Ns in their sequence. It depends on the body site (different collection techniques are used at each site, by different sets of hands, and with different people making the library preps).

      But Heng has mentioned in another thread that bwa is really not designed for such sequence, and that its not really safe to use -n values > 7. Anyway, I appreciate the reply.

      Comment


      • #4

        BWA uses the first 32 bases in its initial lookup, so that your "20" mismatches can only occur in the first 32 bases (see the "-l" option). The rest of the bases are filled in later and can have any # of mismatches. Note that the algorithm is exponential with respect to the "-n" option so I am quite amused that it was even possible for the program to complete with "-n 20" (that is a greater than 60% error rate!).
        BWA uses the -n parameter for the number of mismatches in the full read (-k is for the seed). Bowtie uses -n for the seed and then allows any # of mismatches in the 3' end. And then BFAST uses -n for the number of threads. Wouldn''t it be great if these parameters were standardized...?

        Comment


        • #5
          Originally posted by Chipper View Post
          BWA uses the -n parameter for the number of mismatches in the full read (-k is for the seed). Bowtie uses -n for the seed and then allows any # of mismatches in the 3' end. And then BFAST uses -n for the number of threads. Wouldn't it be great if these parameters were standardized...?
          You absolutely right about BWA "-n", that is my error.

          As for standardizing the options, that would be lovely (BFAST came out in mid/early 2008, look how many aligners there are now), but the differences in the algorithm are too substantial in my opinion. For example BWA and other BWT algorithms search (exponentially) over a certain # of mismatches/differences, while BFAST and spaced seed (index/hash) algorithms do not necessarily guarantee to find up to a certain # of mismatches (say 99% of reads with k # of mismatches). Therefore, there can be a parameter "up to k mismatches" in the former but not in the later.

          Remember these software are usually written by graduate students (who need to graduate) or post-docs. Maybe a faculty position () would allow us to give better support and standardization.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Best Practices for Single-Cell Sequencing Analysis
            by seqadmin



            While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
            06-06-2024, 07:15 AM
          • seqadmin
            Latest Developments in Precision Medicine
            by seqadmin



            Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

            Somatic Genomics
            “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
            05-24-2024, 01:16 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 06-21-2024, 07:49 AM
          0 responses
          14 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 06-20-2024, 07:23 AM
          0 responses
          14 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 06-17-2024, 06:54 AM
          0 responses
          16 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 06-14-2024, 07:24 AM
          0 responses
          25 views
          0 likes
          Last Post seqadmin  
          Working...
          X