Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • remove reads from BAM whose mate has already been filtered

    Hi,

    I have removed duplicates from a paired end run BAM using Picard MarkDuplicates. In some cases, a single read was retained (not exactly sure why, perhaps the retained read was unmapped, but my BAM no longer has an even number of reads). No other filtering was done.

    For some downstream methods (e.g., bedtools pairtobed) I need to have a BAM where both reads are present for each fragment and no "singletons" of this type are present.

    Is there an available method to remove such singleton reads?

    If not, I was thinking to sort on readname, cook up something to identify singletons, dump names of singletons to file, remove reads using Picard FilterSamReads. Other ideas?

  • #2
    You can use bam flags to do this filtering.

    Here is a webpage with some good information on BAM flags:



    INTERPRETING THE BAM FLAGS


    The second column in a SAM/BAM file is the flag column. They may seem confusing at first but the encoding allows details about a read to be stored by just using a few digits. The trick is to convert the numerical digit into binary, and then use the table to interpret the binary numbers, where 1 = true and 0 = false.

    Here are some common BAM flags:

    163: 10100011 in binary
    147: 10010011 in binary
    99: 1100011 in binary
    83: 1010011 in binary

    Interpretation of 10100011 (reading the binary from left to right):

    1 the read is paired in sequencing, no matter whether it is mapped in a pair
    1 the read is mapped in a proper pair (depends on the protocol, normally inferred during alignment)
    0 the query sequence itself is unmapped
    0 the mate is unmapped
    0 strand of the query (0 for forward; 1 for reverse strand)
    1 strand of the mate
    0 the read is the first read in a pair
    1 the read is the second read in a pair

    Comment


    • #3
      Hi vivek,

      BAM flags won't work for this. The information about the paired read does not tell you anything about whether the read is still in the file. It only contains information about its mapping properties.

      Comment


      • #4
        If not, I was thinking to sort on readname, cook up something to identify singletons, dump names of singletons to file, remove reads using Picard FilterSamReads. Other ideas?
        I think that's what you'll have to do.

        Maybe you can go back and confirm that MarkDuplicates was treating your reads as paired end, and not single end? Maybe that was the problem.

        Or, try filtering your orignal file to only have reads where both ends mapped, then MarkDuplictes. Maybe that's why MarkDuplicates didn't mark both reads.

        Comment


        • #5
          Hi swbarnes2,

          Here is my output in the MarkDuplicates metrics file:

          ## METRICS CLASS net.sf.picard.sam.DuplicationMetrics
          LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICAT
          ES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
          CR503-1 4628209 124378530 7792909 3928406 30212247 83840 0.253973 213024317

          It certainly looks like MarkDups detected paired ends. The UNPAIRED_READS_EXAMINED and UNPAIRED_READ_DUPLICATES are the classes in question. I had always interpreted these to be cases where one read mapped and the other didn't. In any event, if I were to guess the UNPAIRED_READ_DUPLICATES are cases where a read, whose mate was unmapped, was removed because it mapped to the exact same coordinates as other reads.

          If this looks unusual I would appreciate feedback, but my guess is that the expected behavior is that MarkDuplicates will leave some orphan unmapped reads when REMOVE_DUPLICATES=true.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Genetic Variation in Immunogenetics and Antibody Diversity
            by seqadmin



            The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...
            11-06-2024, 07:24 PM
          • seqadmin
            Choosing Between NGS and qPCR
            by seqadmin



            Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
            10-18-2024, 07:11 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, Today, 11:09 AM
          0 responses
          22 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, Today, 06:13 AM
          0 responses
          20 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 11-01-2024, 06:09 AM
          0 responses
          30 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 10-30-2024, 05:31 AM
          0 responses
          21 views
          0 likes
          Last Post seqadmin  
          Working...
          X