Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • remove reads from BAM whose mate has already been filtered

    Hi,

    I have removed duplicates from a paired end run BAM using Picard MarkDuplicates. In some cases, a single read was retained (not exactly sure why, perhaps the retained read was unmapped, but my BAM no longer has an even number of reads). No other filtering was done.

    For some downstream methods (e.g., bedtools pairtobed) I need to have a BAM where both reads are present for each fragment and no "singletons" of this type are present.

    Is there an available method to remove such singleton reads?

    If not, I was thinking to sort on readname, cook up something to identify singletons, dump names of singletons to file, remove reads using Picard FilterSamReads. Other ideas?

  • #2
    You can use bam flags to do this filtering.

    Here is a webpage with some good information on BAM flags:



    INTERPRETING THE BAM FLAGS


    The second column in a SAM/BAM file is the flag column. They may seem confusing at first but the encoding allows details about a read to be stored by just using a few digits. The trick is to convert the numerical digit into binary, and then use the table to interpret the binary numbers, where 1 = true and 0 = false.

    Here are some common BAM flags:

    163: 10100011 in binary
    147: 10010011 in binary
    99: 1100011 in binary
    83: 1010011 in binary

    Interpretation of 10100011 (reading the binary from left to right):

    1 the read is paired in sequencing, no matter whether it is mapped in a pair
    1 the read is mapped in a proper pair (depends on the protocol, normally inferred during alignment)
    0 the query sequence itself is unmapped
    0 the mate is unmapped
    0 strand of the query (0 for forward; 1 for reverse strand)
    1 strand of the mate
    0 the read is the first read in a pair
    1 the read is the second read in a pair

    Comment


    • #3
      Hi vivek,

      BAM flags won't work for this. The information about the paired read does not tell you anything about whether the read is still in the file. It only contains information about its mapping properties.

      Comment


      • #4
        If not, I was thinking to sort on readname, cook up something to identify singletons, dump names of singletons to file, remove reads using Picard FilterSamReads. Other ideas?
        I think that's what you'll have to do.

        Maybe you can go back and confirm that MarkDuplicates was treating your reads as paired end, and not single end? Maybe that was the problem.

        Or, try filtering your orignal file to only have reads where both ends mapped, then MarkDuplictes. Maybe that's why MarkDuplicates didn't mark both reads.

        Comment


        • #5
          Hi swbarnes2,

          Here is my output in the MarkDuplicates metrics file:

          ## METRICS CLASS net.sf.picard.sam.DuplicationMetrics
          LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICAT
          ES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
          CR503-1 4628209 124378530 7792909 3928406 30212247 83840 0.253973 213024317

          It certainly looks like MarkDups detected paired ends. The UNPAIRED_READS_EXAMINED and UNPAIRED_READ_DUPLICATES are the classes in question. I had always interpreted these to be cases where one read mapped and the other didn't. In any event, if I were to guess the UNPAIRED_READ_DUPLICATES are cases where a read, whose mate was unmapped, was removed because it mapped to the exact same coordinates as other reads.

          If this looks unusual I would appreciate feedback, but my guess is that the expected behavior is that MarkDuplicates will leave some orphan unmapped reads when REMOVE_DUPLICATES=true.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Best Practices for Single-Cell Sequencing Analysis
            by seqadmin



            While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
            06-06-2024, 07:15 AM
          • seqadmin
            Latest Developments in Precision Medicine
            by seqadmin



            Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

            Somatic Genomics
            “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
            05-24-2024, 01:16 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 06-14-2024, 07:24 AM
          0 responses
          12 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 06-13-2024, 08:58 AM
          0 responses
          13 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 06-12-2024, 02:20 PM
          0 responses
          17 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 06-07-2024, 06:58 AM
          0 responses
          184 views
          0 likes
          Last Post seqadmin  
          Working...
          X