Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • BWA odd behaviors

    I'm seeing some alignments that don't make sense to me come out of BWA. There only parameter we are setting is '-n 20', and these are 100mer reads from metagenomic samples being mapped against a bacterial database.

    Our understanding of the '-n' parameter is that its setting the max allowable edit distance between query and the reference for a good alignment, so its something like the max number of mismatches allowed. But in the SAM output we're seeing alignments where the NM:i field is showing 70-75 (NM:i is supposed to show the number of mismatches).

    How can BWA even be making an alignment of a 100mer query where there are 70-75 mismatches?

    Another oddity I've seen is that if we reduce the size of the input query file, the exact same reads that previously were showing 70-75 mismatches now show < 20 mismatches. We've seen this weird error in fasta files of ~600k 100mer reads, and then we've broken that query file into chunks of 5000 100mer reads, and the same reads do not give this error. But the results of the small chunks seem to not match entirely with the BLASTN results. Mainly the small chunks will either give the same hit as BLASTN, or will fail to find a hit that BLASTN finds.

    Is this a known issue? Or could I be doing something wrong by failing to set some needed parameter? I'm using BWA 0.5.7 on a 64bit machine.

    Thanks,
    John Martin

  • #2
    Maybe Heng will comment, but I will take a shot at the first part.

    [QUOTE=jmartin;16446]I'm seeing some alignments that don't make sense to me come out of BWA. There only parameter we are setting is '-n 20', and these are 100mer reads from metagenomic samples being mapped against a bacterial database.

    Our understanding of the '-n' parameter is that its setting the max allowable edit distance between query and the reference for a good alignment, so its something like the max number of mismatches allowed. But in the SAM output we're seeing alignments where the NM:i field is showing 70-75 (NM:i is supposed to show the number of mismatches).

    How can BWA even be making an alignment of a 100mer query where there are 70-75 mismatches?
    [QUOTE]

    BWA uses the first 32 bases in its initial lookup, so that your "20" mismatches can only occur in the first 32 bases (see the "-l" option). The rest of the bases are filled in later and can have any # of mismatches. Note that the algorithm is exponential with respect to the "-n" option so I am quite amused that it was even possible for the program to complete with "-n 20" (that is a greater than 60% error rate!).

    Comment


    • #3
      Heh, it actually does complete with -n 20 (I had tried -n 30 & -n 33 as well, those values did not complete on a 32Gb blade).

      The reason I'd been trying such large values for -n is to overcome some ambiguous bases that seem to exist in my Illumina data. While I don't expect more than 1-2 real errors per 100bp of human data mapping to another human genome, we have a highly variable distribution of ambiguous bases that appear in the data generated from some of our metagenomic samples (I'm trying to remove human sequence from metagenomic bacterial samples harvested from various human body sites). Some sites have ~30% of the reads showing >= 20 Ns in their sequence. It depends on the body site (different collection techniques are used at each site, by different sets of hands, and with different people making the library preps).

      But Heng has mentioned in another thread that bwa is really not designed for such sequence, and that its not really safe to use -n values > 7. Anyway, I appreciate the reply.

      Comment


      • #4

        BWA uses the first 32 bases in its initial lookup, so that your "20" mismatches can only occur in the first 32 bases (see the "-l" option). The rest of the bases are filled in later and can have any # of mismatches. Note that the algorithm is exponential with respect to the "-n" option so I am quite amused that it was even possible for the program to complete with "-n 20" (that is a greater than 60% error rate!).
        BWA uses the -n parameter for the number of mismatches in the full read (-k is for the seed). Bowtie uses -n for the seed and then allows any # of mismatches in the 3' end. And then BFAST uses -n for the number of threads. Wouldn''t it be great if these parameters were standardized...?

        Comment


        • #5
          Originally posted by Chipper View Post
          BWA uses the -n parameter for the number of mismatches in the full read (-k is for the seed). Bowtie uses -n for the seed and then allows any # of mismatches in the 3' end. And then BFAST uses -n for the number of threads. Wouldn't it be great if these parameters were standardized...?
          You absolutely right about BWA "-n", that is my error.

          As for standardizing the options, that would be lovely (BFAST came out in mid/early 2008, look how many aligners there are now), but the differences in the algorithm are too substantial in my opinion. For example BWA and other BWT algorithms search (exponentially) over a certain # of mismatches/differences, while BFAST and spaced seed (index/hash) algorithms do not necessarily guarantee to find up to a certain # of mismatches (say 99% of reads with k # of mismatches). Therefore, there can be a parameter "up to k mismatches" in the former but not in the later.

          Remember these software are usually written by graduate students (who need to graduate) or post-docs. Maybe a faculty position () would allow us to give better support and standardization.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM
          • seqadmin
            Techniques and Challenges in Conservation Genomics
            by seqadmin



            The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

            Avian Conservation
            Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
            03-08-2024, 10:41 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 03-27-2024, 06:37 PM
          0 responses
          12 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-27-2024, 06:07 PM
          0 responses
          11 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-22-2024, 10:03 AM
          0 responses
          52 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-21-2024, 07:32 AM
          0 responses
          68 views
          0 likes
          Last Post seqadmin  
          Working...
          X