Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How BWA handles mismatches?

    Hi everyone,

    I have 5 million raw reads (76 bp read length) per sample from Illumina platform. Now I am using BWA to align these reads to reference human genome.

    After build up the index for the human genome, I used the following bwa commands to align short reads to human genome:

    ~/…/BWA/bwa-0.5.5/bwa aln -t 30 -M 7 hs_ref.fa reads/index3.fq > 020310_bwa_m7/aln_index3.sai

    ~/…/BWA/bwa-0.5.5/bwa samse hs_ref.fa 020310_bwa_m7/aln_index3.sai reads/index3.fq > 020310_bwa_m7/index3bwa.sam

    I used parameter –M 7 in order to allow 7 mismatches in alignment, but it seemed not work. I had the same results as one when I used default –M.

    If anyone could tell me how BWA handles the mismatches and allow higher number of mismatches. I did not get it from the manual.

    Many thanks beforehand!

  • #2
    Hi all,

    Although nobody’s replied my post yet, I like to share some testing results of using different parameters of BWA, maybe this could be helpful for somebody or somebody could help me with these inputs.

    The purpose of my testing is to allow more mismatches to see if I could have more alignments (particularly alignments with repeats) in human reference genome. I modified parameters with 6 different combinations in BWA, surprised to me that I had very similar results: 49% unique alignments, ~4% multiple alignments, and about 47% reads failed to align.

    The combination I used for tests are as below:
    -M 1
    -k 6
    -k6 -l32 -m1
    -n6 -l32 -m1
    -l32 –k20 –m1 (for this test, I liked to go extreme on –k to see what happened, however, it turned out with nothing changed)

    I took a look at the unaligned reads. Some could be aligned by BLAT although some were not. Some of ones that could be aligned by BLAT have repeat markers. It seems I do lost some true alignments. I am wondering why I could not have these true alignments using BWA… Any help would be appreciated if you have a clue!
    Last edited by jlmlj; 02-08-2010, 09:15 AM.

    Comment


    • #3
      try

      bwa aln -n 7 -l 1000000

      This will be very slow.

      Comment


      • #4
        Originally posted by lh3 View Post
        try

        bwa aln -n 7 -l 1000000

        This will be very slow.
        Thank you so much, very excided to get the feedback from the author of this beautiful software! I am going to try it now.

        I know -n is the max number of differences (mismatches + gaps) for the whole read length, and -l is to take the first INT as seed. However, why you set INT for -l so large, like "1000000"? Thanks in advance for the explanation!

        updates:
        I have run your parameters for 20mins, it seems the progress is very very slow: it's been staying at the process of the first step:
        [bwa_aln_core] calculate SA coordinate... (I only have 1 line for the progress)
        And it's used up all 30 nodes on our cluster. So I am thinking if it is possbile to decrease a bit the number for -l...
        Thanks!
        Last edited by jlmlj; 02-05-2010, 02:15 PM.

        Comment


        • #5
          -l 10000 effectively disables seeding. You may try "aln -n 5". But for reads with low quality, bwa may be very slow. Its algorithm is not designed for this case.

          Comment


          • #6
            Originally posted by lh3 View Post
            -l 10000 effectively disables seeding. You may try "aln -n 5". But for reads with low quality, bwa may be very slow. Its algorithm is not designed for this case.
            Hi lh3,

            Thank you very much for the reply! So in this test, I disable the seed, BWA allows 7 mismatches for the total 75 read length, even for those low-quality bases, am I correct?

            The test has done, it took ~49hrs with 30-node cluster. However I still have results very similar to what I had in previous tests, which means I have 48% reads failed to align to anything in the human reference genome. (I counted "XT:A:U" as unique matches, and "XT:A:R" as repeat matches in the output SAM files).

            The results confuse me a lot: we should have much more repeat matches in the human genome. I am trying to figure out what unaligned reads are? It would be appreciated very much for any suggetion!
            Last edited by jlmlj; 02-08-2010, 09:17 AM.

            Comment


            • #7
              Dear jlmlj,

              I used the parameters (bwa aln -n 7 -l 1000000) and I was able to align a read that had 5 mismatches to the reference. Running bwa on the default settings didn't report this alignment. So perhaps you can try taking one or two individual unaligned reads and do your tests again? Just a suggestion, if you haven't already done this.

              As a more general note, I'm new to next-gen sequencing so I'd just like to point out something I found out. When I was looking at the sam file for this alignment, the CIGAR string was 27M and that looked like a mistake to me because I knew there were mismatches in the alignment. So I looked up the documentation, and found out that the "M" can be a sequence match or mismatch. It wasn't intuitive to me, so just thought I'd point it out.

              Cheers,

              Dave

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Essential Discoveries and Tools in Epitranscriptomics
                by seqadmin




                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                04-22-2024, 07:01 AM
              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 08:47 AM
              0 responses
              14 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              60 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              60 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              54 views
              0 likes
              Last Post seqadmin  
              Working...
              X