Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem removing duplicate reads? (samtools and picard)

    Hello,

    I am having trouble removing duplicate reads with samtools and picard. I have a sam/bam file that is of paired-end illumina reads mapped to the genome (minimal file below). I have coordinate-sorted the sam/bam file and tried both:

    java -jar picard-tools-1.38/MarkDuplicates.jar I=problemSorted.bam O=out.sam M=log.txt REMOVE_DUPLICATES=true
    -and-
    samtools rmdup problemSorted.bam out.bam

    and neither seem to remove the lines that I believe are duplicates. The two read pairs have the exact same sequence and mapping, so I believe I should only have one pair after removing duplicates. I have read that samtools may not work in this case because the two ends of the reads map to different chromosomes, but I am surprised that picard is not removing them.

    The duplicate reads are removed by:
    samtools rmdup problemSorted.bam out.bam -S
    but I am worried that treating all my paired-end reads as single reads when removing duplicates may lead to other issues.

    Thanks for your help! Minimal input file is below:

    @HD VN:1.0 SO:coordinate
    @SQ SN:chr1 LN:28195914
    @SQ SN:chr2 LN:19369704
    HWI-EAS000_1:5:71:14037:15683:0:1:1 113 chr1 11732000 37 36M chr2 54321 0 CTCCCATCTCTATTCCATTTCCTCTGCCATGTATTC IIIIIIIIIIGIIIIIHIIIIIGIIIIIIIIIIIII X0:i:1 X1:i:0 MD:Z:36 XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U
    HWI-EAS000_1:5:80:6679:2759:0:1:1 113 chr1 11732000 37 36M chr2 54321 0 CTCCCATCTCTATTCCATTTCCTCTGCCATGTATTC HGIIIIHIHHIIIHIIIIIIIIHIIIIIIIIIIIIH X0:i:1 X1:i:0 MD:Z:36 XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U
    HWI-EAS000_1:5:71:14037:15683:0:2:1 177 chr2 54321 37 36M chr1 11732000 0 GTATGTACTGTATTATCTGAGTTTTTTATTCACAAG IIIIIIIIIIHIHIHIIIIIIIIIIIIIIIIIIIII X0:i:1 X1:i:0 MD:Z:36 XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U
    HWI-EAS000_1:5:80:6679:2759:0:2:1 177 chr2 54321 37 36M chr1 11732000 0 GTATGTACTGTATTATCTGAGTTTTTTATTCACAAG IIIIHIIIHIIIIIIIIIIIIIIIIIIIIIIIIIII X0:i:1 X1:i:0 MD:Z:36 XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U

  • #2
    According to the SAM format specification (http://samtools.sourceforge.net/SAM1.pdf) , read names must be the same for read pairs. Your read pairs do not seem to follow this rule, so it appears they are not being recognized as pairs by MarkDuplicates.

    Comment


    • #3
      Hi,

      I'm also facing the same problem here is case in question. The read names are same.

      HWI-ST220_63:6:2108:11340:115090 177 chr1 14480159 37 50M chr5 84273185 0 GCTACCACTCAGAAACTTGGAGAAATCACAGAGAGCCATGGACTTTATTT JIIHGIGJIJJIJHGIHGGJJJJJJJJIJJJIJJJIJGHHHHFFFFFCCC XT:A:U NM:i:0 SM:i:37 AM:i:3X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50
      HWI-ST220_63:6:2108:11340:115090 113 chr5 84273185 37 50M chr1 14480159 0 AACAGAATTCCAGTGAGATACCTTCTTCCTTCTCAGGATTTTACATGTTA HHJIJJIIIJJJJJJJJIHDJJJIJIJJJJJIIJJJJHHHHHDFFFFCCC XT:A:U NM:i:0 SM:i:3AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50
      HWI-ST220_63:6:1108:12937:121403 177 chr1 14480159 37 50M chr5 84273185 0 GCTACCACTCAGAAACTTGGAGAAATCACAGAGAGCCATGGACTTTATTT JIIGJIHGJJJJIHGJJJIJJJJJJIJHJIJIJJJIJHHHFHFFFFFCCC XT:A:U NM:i:0 SM:i:37 AM:i:3X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50
      HWI-ST220_63:6:1108:12937:121403 113 chr5 84273185 37 50M chr1 14480159 0 AACAGAATTCCAGTGAGATACCTTCTTCCTTCTCAGGATTTTACATGTTA JJIIJIJIGJJGIIJJIGE;JJJJIIJJJJIIIJJJJHFHHHFFFFFCCC XT:A:U NM:i:0 SM:i:3AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50

      Any pointers would be most appreciated.

      Comment


      • #4
        Need to include the command you used, etc.
        Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
        Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
        Projects: U87MG whole genome sequence [Website] [Paper]

        Comment


        • #5
          I had 1st sorted the file
          samtools sort BothEndsMapped.bam BothEndsMapped.sorted
          and then rmdup..
          samtools rmdup BothEndsMapped.sorted.bam BothEndsMapped.sorted.unique.bam
          I guess by default it does for paired end reads.

          Comment


          • #6
            Hi cbl,

            Can you explain to us why duplicate removal, especially removing the pairs that map to different chr. is critical to your analysis? I always treat duplicate removal as nice-to-have but not must-to-have.

            Thank you,
            Douglas

            Comment


            • #7
              Well I'm looking for cases of rearrangements hence, reads mapping in different chromosomes is important.
              As far as removing duplicates is considered, its quite subjective to the need as well as coverage.
              To start with being strict on the data I'm removing the duplicates. Cases of different reads showing same rearrangement would be say 90% chance that they are case of rearrangement (provided they are not into repeats/segmental duplicates). In case of low depth of coverage it would be quite difficult distinguish rearrangements due to the real data and PCR duplicates, so have to be little strict on that front.

              Comment


              • #8
                Hi shruti,

                I understand your points. For me, I use a different approach in rearrangement analysis. I rely on a very tiny subset of the raw reads to infer the rearrangement. So at the beginning I use very nonrestrictive criteria to locate candidate reads that indicate potential rearrangement. Then I will dive into these reads. In the context of this post, including duplicate reads at the beginning does not hurt your chance finding/validating the rearrangement. When you have a small subset of candidate reads, you may look into duplication, most likely with custom scripts.

                So my suggestion here is not to dwell too much on removing duplicated reads at the beginning as some general-purpose tools (e.g., samtools) may not be suitable for your specific needs.

                Douglas

                Comment


                • #9
                  Well..yes I guess one can start with being lenient. Just curious, if you start with these reads, then how do you filter at the end. or do you also experimentally validate these cases. how much of a false positive do you observe.

                  Comment


                  • #10
                    Hi shruti,

                    I usually use custom scripts to screen for 1) whether they map to the same chr. 2) how far apart and in which orientation 3) mapping quality and ambiguity 4) read counts (independent or PCR-duplication?). I am reluctant to put a hard number on false positive but I agree with you that it is critical to experimentally follow up and validate the models obtained from next-gen data.

                    Douglas

                    Comment


                    • #11
                      Originally posted by shruti View Post
                      I had 1st sorted the file
                      samtools sort BothEndsMapped.bam BothEndsMapped.sorted
                      and then rmdup..
                      samtools rmdup BothEndsMapped.sorted.bam BothEndsMapped.sorted.unique.bam
                      I guess by default it does for paired end reads.
                      Use Picard MarkDuplicates instead and get back to us. Samtools rmdup is not recommended.

                      If you add more responses, copy and paste the exact command. Remove file names if they're sensitive.

                      Originally posted by DZhang View Post
                      Can you explain to us why duplicate removal, especially removing the pairs that map to different chr. is critical to your analysis? I always treat duplicate removal as nice-to-have but not must-to-have.
                      There has never been a demonstrated case where leaving duplicates in is advantageous.

                      In contrast, numerous analyses have demonstrated beneficial effects of removing duplicates.

                      Many discussions of this have been done on this site in the past. http://seqanswers.com/forums/showthread.php?t=6854 for example.
                      Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
                      Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
                      Projects: U87MG whole genome sequence [Website] [Paper]

                      Comment


                      • #12
                        Hi Michael.James.Clark,

                        I looked at the thread you recommended - http://seqanswers.com/forums/showthread.php?t=6854. It appears to me that you were the only one who positively and absolutely advocate for duplicate removal. Others were either not sure or took a balanced approach as to the potential pros and cons. I certainly recognize the benefits of duplicate removal, due to the limitations of current laboratory technology and bioinformatic methods, I would not put it as must-have as long as the user considers the impact of duplicate when interpreting the analysis results.

                        Douglas

                        Comment


                        • #13
                          Originally posted by DZhang View Post
                          I looked at the thread you recommended - http://seqanswers.com/forums/showthread.php?t=6854. It appears to me that you were the only one who positively and absolutely advocate for duplicate removal. Others were either not sure or took a balanced approach as to the potential pros and cons.
                          I think you ought to read that thread more carefully then, because that is not what I said, and I was not the only one to explain that duplicate removal benefits many downstream analyses. (Oh, by the way, I chose that thread because it is in my history since I replied to it, but there are many others on the same topic that you could read up on that might interest you.)

                          I will clarify my statement for you: Were we talking about ChIP-seq or RNA-seq here, I might agree that duplicate removal is a questionable step to take. Certainly if we were talking about single-end data, I would say that it's probably not necessary (and, in fact, is likely throwing the baby out with the bathwater in many cases).

                          However, it seems this is paired WGS (or some other DNA-seq) with the goal of variant detection. In that case, de-dupping has been shown to be beneficial for variant calling (SNVs, indels, SVs, CNVs).

                          With regards to SVs, de-dupping is particularly useful for long mate-paired data. For simple PE data at adequate depth, it may have little impact. However, I've not seen anyone demonstrate it hurts SV detection in those cases, therefore why not do it? Especially if you'll be using that same data for SNV calling, where it has demonstrably significant effects.
                          Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
                          Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
                          Projects: U87MG whole genome sequence [Website] [Paper]

                          Comment


                          • #14
                            Hi Michael.James.Clark,

                            Thank you for your additional comment. I would not characterize my opinion on duplicate removal as "questionable" step. I completely recognize the benefits of doing it at the beginning of the pipeline, but would not take it as a mandatory step - "must-have". In this particular case, cbl had executed the duplicate removal. My suggestion to cbl was to go ahead with the downstream analysis instead of halting the project progress. The mere fact that both programs, samtools and picard, have built-in functions to remove duplicates but one may work and the other may not work in a particular case exactly illustrates there is a judgment call on duplicate removal.

                            It of course has value to find a solution that can remove the duplicates that are mapped to diffferent chr. and I hope your suggestion of picard will resolve that.

                            Douglas

                            Comment


                            • #15
                              Originally posted by DZhang View Post
                              The mere fact that both programs, samtools and picard, have built-in functions to remove duplicates but one may work and the other may not work in a particular case exactly illustrates there is a judgment call on duplicate removal.
                              Samtools rmdup and Picard MarkDuplicates are not different due to "judgment calls".

                              See here: http://sourceforge.net/apps/mediawik...tools_rmdup.3F

                              "samtools rmdup does not remove interchromosomal duplicates. MarkDuplicates does remove these duplicates."

                              That's because of the way samtools rmdup works... it looks for successive reads mapping to the same spot after sorting without considering the pair. It's not because the authors intended it to miss those from what I understand--it's because way back when they wrote it, they didn't consider that case.

                              (By the way, that also explains why shruti is seeing interchromosomal duplicates in her results, and why I told her to use Picard rather than samtools in the first place. )

                              Also, I'm not sure actually if samtools rmdup considers the second end at all. I think it does, but I think it simply doesn't work on interchromosomals. Either way, it's safest to just use Picard from what I've seen.
                              Last edited by Michael.James.Clark; 07-24-2011, 04:35 PM.
                              Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
                              Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
                              Projects: U87MG whole genome sequence [Website] [Paper]

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Essential Discoveries and Tools in Epitranscriptomics
                                by seqadmin


                                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                                Today, 07:01 AM
                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              37 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              39 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              35 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              54 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X