Announcement

Collapse
No announcement yet.

Samtools's rmdup vs. Picard's MarkDuplicates

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Samtools's rmdup vs. Picard's MarkDuplicates

    I am trying to remove duplicate using samtools's rmdup and rmdupse, they don't seem to just remove everything that share the same start/end sites. I couldn't seem to find a good description of what exactly rmdup/rmdupse does, nor could i understand the C source code.

    I found in another post that Picard might does a better job at this, but it seems controversial. So I am wondering what people use and what are the pros/cons of the two options.

    thanks,

  • #2
    Not too controversial, use Picard.

    Comment


    • #3
      Originally posted by nilshomer View Post
      Not too controversial, use Picard.
      Do you know the difference between Picard and samtools in terms of duplicate removal? Thanks.

      Comment


      • #4
        Originally posted by xguo View Post
        Do you know the difference between Picard and samtools in terms of duplicate removal? Thanks.
        Something about not considering paired-end/mate-pair reads correctly. Did you try searching the site yet?

        Comment


        • #5
          Originally posted by nilshomer View Post
          Something about not considering paired-end/mate-pair reads correctly. Did you try searching the site yet?
          I did, but haven't found any details. For single-end reads, samtools consider them to be duplicates as long as their mapping locations are the same. Even if the sequences of two reads are not the same, they are still removed as duplicates. I'm not sure if it's desired, and how picard handles it.

          Comment


          • #6
            Originally posted by xguo View Post
            I did, but haven't found any details. For single-end reads, samtools consider them to be duplicates as long as their mapping locations are the same. Even if the sequences of two reads are not the same, they are still removed as duplicates. I'm not sure if it's desired, and how picard handles it.
            I don't think the sequences are considered, since two observations of the same PCR fragment could be different. You could read the source code or email the samtools help list ([email protected]). Please post back when you get the answer, or maybe add it to the wiki under duplicate removal for the future benefit of others.

            Also, the first search on google using the string "site: seqanswers.com duplicate removal" was useful.
            Last edited by nilshomer; 06-13-2010, 08:22 PM. Reason: more info

            Comment


            • #7
              Originally posted by nilshomer View Post
              I don't think the sequences are considered, since two observations of the same PCR fragment could be different. You could read the source code or email the samtools help list ([email protected]). Please post back when you get the answer, or maybe add it to the wiki under duplicate removal for the future benefit of others.

              Also, the first search on google using the string "site: seqanswers.com duplicate removal" was useful.
              The following is what I found from samtools-help list:

              "Essentially what Picard does (for pairs; single-end data is also handled) is to find the 5' coordinates and mapping orientations of each read pair. When doing this it takes into account all clipping that has taking place as well as any gaps or jumps in the alignment. You can thus think of it as determining "if all the bases from the read were aligned, where would the 5' most base
              have been aligned". It then matches all read pairs that have identical 5' coordinates and orientations and marks as duplicates all but the "best" pair. "Best" is defined as the read pair having the highest sum of base qualities as bases with Q >= 15."

              Comment


              • #8
                Originally posted by xguo View Post
                The following is what I found from samtools-help list:

                "Essentially what Picard does (for pairs; single-end data is also handled) is to find the 5' coordinates and mapping orientations of each read pair. When doing this it takes into account all clipping that has taking place as well as any gaps or jumps in the alignment. You can thus think of it as determining "if all the bases from the read were aligned, where would the 5' most base
                have been aligned". It then matches all read pairs that have identical 5' coordinates and orientations and marks as duplicates all but the "best" pair. "Best" is defined as the read pair having the highest sum of base qualities as bases with Q >= 15."
                This would be useful to add to http://wiki.seqanswers.com/ so we can build up a base of answers and not have to search the forums. Do you want to add it?

                Comment


                • #9
                  now I come to anther question: once we have reads with various lengths, should 3' coordinates should also be considerred(or length of the aligned part) ? for reads from the reverse strand, 3' is the beginning of the fragment, am I right?



                  Originally posted by xguo View Post
                  The following is what I found from samtools-help list:

                  "Essentially what Picard does (for pairs; single-end data is also handled) is to find the 5' coordinates and mapping orientations of each read pair. When doing this it takes into account all clipping that has taking place as well as any gaps or jumps in the alignment. You can thus think of it as determining "if all the bases from the read were aligned, where would the 5' most base
                  have been aligned". It then matches all read pairs that have identical 5' coordinates and orientations and marks as duplicates all but the "best" pair. "Best" is defined as the read pair having the highest sum of base qualities as bases with Q >= 15."

                  Comment


                  • #10
                    Originally posted by xguo View Post
                    I did, but haven't found any details. For single-end reads, samtools consider them to be duplicates as long as their mapping locations are the same. Even if the sequences of two reads are not the same, they are still removed as duplicates. I'm not sure if it's desired, and how picard handles it.
                    Are you sure of this?

                    Comment


                    • #11
                      I believe Picard is doing the same for SE and this is the right thing to do. PCR duplicates can have different sequences. Optical duplicates have identical sequences.

                      Comment


                      • #12
                        Hi lh3, thanks for the response! I understand PCR duplicates, but can you clarify what an optical duplicate is? I like what you are saying. Thanks again!

                        Comment


                        • #13
                          Originally posted by JohnK View Post
                          Hi lh3, thanks for the response! I understand PCR duplicates, but can you clarify what an optical duplicate is? I like what you are saying. Thanks again!
                          There is a nice discussion of such things (including a post where Heng explains the answer to your question) as well as links to numerous other threads on the topic here: http://seqanswers.com/forums/showthread.php?t=6854

                          Hope that helps.

                          Originally posted by mingkunli
                          now I come to anther question: once we have reads with various lengths, should 3' coordinates should also be considerred(or length of the aligned part) ? for reads from the reverse strand, 3' is the beginning of the fragment, am I right?
                          PCR duplicates do not necessarily have the same sequence OR the same length. The 5' end of the read is the most reliable position because if it's a PCR duplicate, it will be guaranteed to be the same at that end but not at the 3' end. That is, I think, the reasoning behind not considering the 3' end.

                          Also, reads from the reverse strand are still in 5'->3' context. The enzymes always go in the same direction, right?
                          Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
                          Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
                          Projects: U87MG whole genome sequence [Website] [Paper]

                          Comment


                          • #14
                            Originally posted by lh3 View Post
                            I believe Picard is doing the same for SE and this is the right thing to do. PCR duplicates can have different sequences. Optical duplicates have identical sequences.
                            I might arrive quite late to this thread, but I am going through the same dilemma... In one section of the 1000 genomes paper (suppl) they mention that they use rmdup and then to make sure the run MarkDuplicates on top of that... Would this be the ultimate best approach??

                            In my case I am using SE data, and if I'm guessing right for SE picard and samtools work the same for dupl removal??

                            Thanks!

                            Comment


                            • #15
                              I tried to use picard's mark duplicates, but it doesn't seem to be doing much of a job (The sam file is from a bwa paired end alignment)

                              java -jar SortSam.jar I=full_genome_grepped.sam O=full_genome_grepped_sorted.sam SO=coordinate VALIDATION_STRINGENCY=LENIENT

                              java -jar MarkDuplicates.jar I=full_genome_grepped_sorted.sam O=full_genome_grepped_sorted_deduped.sam M=Dedup_metrics.txt VALIDATION_STRINGENCY=LENIENT QUIET=true REMOVE_DUPLICATES=true

                              The resultant file is only a little smaller than the original. So I checked some known duplicates;

                              grep 1:78:10366:8808 full_genome_grepped_sorted_deduped.sam
                              100729:1:78:10366:8808:a: 99 chr9 3000477 0 50M = 3000810 383 AGTGGATGTTTCTCATTTTCCATGATTTTCAGTTTTCTTGCCATATTCCA gggggggggggggggggggggdgfcgggggdggggggggegggggggggg X0:i:23 X1:i:27 MD:Z:50 XG:i:0 AM:i:0 NM:i:0 SM:i:0 XM:i:0 XO:i:0 XT:A:R
                              100729:1:78:10366:8808:b: 147 chr9 3000810 0 50M = 3000477 -383 GCCATATTTTACGTCCTAAAGTGTGCATTTCTCATTTTTCACGTTTTTCA hhhhhhhghhhhhhhhhhhhhhhhhghhhhhhhghhhhhhhhhhhhhhhh X0:i:4 X1:i:0 XA:Z:chr9,-3022120,50M,3;chr9,-3011768,50M,3;chr9,-3019372,50M,3; MD:Z:9C1G6C31 XG:i:0 AM:i:0 NM:i:3 SM:i:0 XM:i:3 XO:i:0 XT:A:R

                              grep 1:80:15287:8131 full_genome_grepped_sorted_deduped.sam
                              100729:1:80:15287:8131:a: 99 chr9 3000477 3 50M = 3000810 383 AGTGGATGTTTCTCATTTTCCATGATTTTCAGTTTTCTTGCCATATTCCA ggggggggggggggfggggggcggggggggggggggggggggggggggga X0:i:23 X1:i:27 MD:Z:50 XG:i:0 AM:i:0 NM:i:0 SM:i:0 XM:i:0 XO:i:0 XT:A:R
                              100729:1:80:15287:8131:b: 147 chr9 3000810 3 50M = 3000477 -383 GCCATATTTTACGTCCTAAAGTGTGCATTTCTCATTTTTCACGTTTTTCA hhhhhhhhhhhghhhhhhhhhfhhhhhhhhhhghhhhhhhhhhhhhhhhh X0:i:4 X1:i:0 XA:Z:chr9,-3000810,50M,3;chr9,-3011768,50M,3;chr9,-3019372,50M,3; MD:Z:9C1G6C31 XG:i:0 AM:i:0 NM:i:3 SM:i:0 XM:i:3 XO:i:0 XT:A:R

                              I was hoping that one of the two would be deleted from the deduped file, since they look like perfect duplicates to me.
                              Last edited by swbarnes2; 11-19-2010, 03:18 PM. Reason: typos

                              Comment

                              Working...
                              X