Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Gen2007
    Member
    • Jun 2010
    • 10

    Unique Reads from BAM/SAM

    Hi all,

    I'm using published data that includes all reads from an experiment and am not familiar with the SAM/BAM output. I'm having trouble interpreting the bitwise flag and MAPQ values to extract uniquely mapping reads and understand appropriate cut-offs.

    The SAM FAQ says to use samtools view -q 1 in order to extract "unique" reads; to which value does q refer? MAPQ?

    I want to replicate the number of acceptable reads in the publications using their criteria: "The best alignment was kept only in cases in which the second-best alignment had at least three more mismatches." What sort of q value (MAPQ) would this be?

    In all, I just want to be confident that I can filter out all unique reads.

    Thanks for any help!

    Further info: the dataset is single end bisulfite treated and only has 10 bitwise flags:
    0
    4
    16
    512
    516
    528
    1024
    1040
    1536
    1552
    Last edited by Gen2007; 01-13-2011, 10:31 AM.
  • zhidkov.ilia
    Member
    • Dec 2010
    • 25

    #2
    Hope this will help:

    Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

    Comment

    • swbarnes2
      Senior Member
      • May 2008
      • 910

      #3
      The only flags you want are 0 and 16. Everything higher than 1023 is a duplicate. Everything higher than 511 failed a platform quality check. The 4's are unmapped.

      view -f 0x010 is what you want, I think.

      Comment

      • Gen2007
        Member
        • Jun 2010
        • 10

        #4
        Thanks for the suggestions. Depending on how I filter I got different results (out of 24.5 million reads). Do you guys have any idea of what is contributing to the disparities?

        samtools view -q 1 Colon_tumor_FFPE_lane2.bam | wc -l : 8,481,938 reads
        view -f 0x010 Colon_tumor_FFPE_lane2.bam | wc -l : 9,597,123 reads
        awk '$2~/^(0|16)$/' Colon_tumor_FFPE_lane2.sam | wc -l : 1,237,334

        Any ideas?

        Thanks a lot!

        Comment

        • yxi
          Junior Member
          • Apr 2009
          • 6

          #5
          I have the same question. After searching the forum I found many people have similar questions, and people have different understandings of how "unique/multiple" mappings were defined in SAM format. From the SAM specification, the flag 0x100 indicates whether the alignment is primary or not. But this is different from "unique", because multiple mappings will still have one primary mapping and other non-primary mappings, so just counting primary mappings doesn't give you the answer. I wonder if SAM format could add an aux field for this basic information, which could be extracted easily? If I'm missing something, please correct me.

          Comment

          • dariober
            Senior Member
            • May 2010
            • 311

            #6
            Hello,
            Originally posted by Gen2007 View Post
            The SAM FAQ says to use samtools view -q 1 in order to extract "unique" reads; to which value does q refer? MAPQ?
            ...Well, the FAQ on samtools shows an example of how to filter out reads with MAPQ below a certain threshold. In that example they put 1 which means 'filter out nearly nothing!'. As they say, more stringent thresholds would be appropriate, say 15 (p-value ~0.05 that a read is incorrectly mapped) or 20 (p-value ~0.01).
            Also note that 'uniqueness' has a rather vague definition, better to think in terms of reliability (i.e. MAPQ).

            I want to replicate the number of acceptable reads in the publications using their criteria: "The best alignment was kept only in cases in which the second-best alignment had at least three more mismatches." What sort of q value (MAPQ) would this be?
            I can't tell exactly, but I would say that -q 15 or -q 20 should be stringent enough while avoiding to throw away too many genuine reads.
            The filtering may become trickier if you want to replicate *exactly* the published results.

            Just my 2p...
            Dario

            Comment

            • Gen2007
              Member
              • Jun 2010
              • 10

              #7
              Thanks a lot for the reply Dario. It seemed to me that -q 1 was not removing very much, and I was not very confident in using that value as a proper filter. Let me try your suggestions. Thanks again!

              Comment

              Latest Articles

              Collapse

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by SEQadmin2, 06-05-2026, 10:09 AM
              0 responses
              17 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-04-2026, 08:59 AM
              0 responses
              34 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-02-2026, 12:03 PM
              0 responses
              37 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-02-2026, 11:40 AM
              0 responses
              24 views
              0 reactions
              Last Post SEQadmin2  
              Working...