Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • bwa quality trimming and samtools rmdup

    I am trying to use bwa and samtools to map 76 bp reads from multiple bacterial strains back onto a reference sequence, with the ultimate goal of extracting SNP frequencies. To obtain accurate variant frequencies, it is important to me to remove PCR duplicates. It occurred to me that quality trimming reads with BWA using the "-q" flag during alignment could affect how well rmdup works downstream. Say 2 reads are PCR duplicates, but one is rather low-quality and is trimmed to a different length than the other. The alignment start and stop positions would no longer be the same for these duplicates, and they would not be filtered using samtools rmdup, which requires identical external coordinates.

    Is this right? Does BWA do hard trimming of reads with the "-q" flag? Or does it ignore low quality bases when calculating the alignment, but still use them to determine alignment coordinates?

    While I'm at it, could anyone explain how the BWA quality trimmer works? Please don't say "read the man page," I did, and was very confused by the explanation there:
    Code:
    -q INT 	
    Parameter for read trimming. BWA trims a read down to argmax_x{\sum_{i=x+1}^l(INT-q_i)} if q_l<INT
    where l is the original read length. [0]
    Noob question #3: Does rmdup work for single read data or doesn't it? The samtools man page states:
    Code:
    Samtools’ rmdup does not work for single-end data and does not remove duplicates across chromosomes. Picard is better.
    However, there is an option "-s" to use rmdup on single read data, and when I apply it to an alignment, it looks to reduce the size of the resulting bam file. The man page and the command itself don't agree!
    Last edited by greigite; 02-16-2010, 03:33 PM. Reason: new question

  • #2
    did you get the answer? nobody know the answer?

    Comment


    • #3
      I did not get an answer- now I use Mosaik instead of bwa, it has much better documentation.

      Comment


      • #4
        Explanation of BWA read trimming

        The BWA trimming feature seems to be explained a little more clearly here: http://seqanswers.com/forums/showthread.php?t=6251 . The real C source code is in the function bwa_trim_read() in the file bwaseqio.c, but I found the comments and variable names of the Perl example referenced in the other thread more clear.

        Comment


        • #5
          Also see the SolexaQA FAQ for an enlightening discussion of the bwa algorithm vs. SolexaQA algorithm.

          Comment


          • #6
            Originally posted by greigite View Post
            I did not get an answer- now I use Mosaik instead of bwa, it has much better documentation.
            Where is the real bottleneck ;-)
            Homepage: Dan Bolser
            MetaBase the database of biological databases.

            Comment


            • #7
              Duplicate Marking &amp; Trimming

              Originally posted by greigite View Post
              I am trying to use bwa and samtools to map 76 bp reads from multiple bacterial strains back onto a reference sequence, with the ultimate goal of extracting SNP frequencies. To obtain accurate variant frequencies, it is important to me to remove PCR duplicates. It occurred to me that quality trimming reads with BWA using the "-q" flag during alignment could affect how well rmdup works downstream. Say 2 reads are PCR duplicates, but one is rather low-quality and is trimmed to a different length than the other. The alignment start and stop positions would no longer be the same for these duplicates, and they would not be filtered using samtools rmdup, which requires identical external coordinates.

              Is this right? Does BWA do hard trimming of reads with the "-q" flag? Or does it ignore low quality bases when calculating the alignment, but still use them to determine alignment coordinates?

              While I'm at it, could anyone explain how the BWA quality trimmer works? Please don't say "read the man page," I did, and was very confused by the explanation there:
              Code:
              -q INT 	
              Parameter for read trimming. BWA trims a read down to argmax_x{\sum_{i=x+1}^l(INT-q_i)} if q_l<INT
              where l is the original read length. [0]
              Noob question #3: Does rmdup work for single read data or doesn't it? The samtools man page states:
              Code:
              Samtools’ rmdup does not work for single-end data and does not remove duplicates across chromosomes. Picard is better.
              However, there is an option "-s" to use rmdup on single read data, and when I apply it to an alignment, it looks to reduce the size of the resulting bam file. The man page and the command itself don't agree!
              I have the same question. I know it has been a long time since you wrote this post, but I am wondering what you ended up doing. I am trying to develop a good pipeline for doing similar analyses.

              Comment


              • #8
                Originally posted by mrood View Post
                I have the same question. I know it has been a long time since you wrote this post, but I am wondering what you ended up doing. I am trying to develop a good pipeline for doing similar analyses.
                Which question? He posted 3. I found a good description of the protocol here:
                Download SAM tools for free. SAM (Sequence Alignment/Map) is a flexible generic format for storing nucleotide sequence alignment. SAMtools provide efficient utilities on manipulating alignments in the SAM format.
                Homepage: Dan Bolser
                MetaBase the database of biological databases.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Exploring the Dynamics of the Tumor Microenvironment
                  by seqadmin




                  The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                  07-08-2024, 03:19 PM
                • seqadmin
                  Exploring Human Diversity Through Large-Scale Omics
                  by seqadmin


                  In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                  06-25-2024, 06:43 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 07-19-2024, 07:20 AM
                0 responses
                37 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 07-16-2024, 05:49 AM
                0 responses
                47 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 07-15-2024, 06:53 AM
                0 responses
                57 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 07-10-2024, 07:30 AM
                0 responses
                43 views
                0 likes
                Last Post seqadmin  
                Working...
                X