Seqanswers Leaderboard Ad

**nilshomer** · 06-08-2010, 03:09 PM

Not too controversial, use Picard.

**xguo** · 06-13-2010, 06:58 PM

Originally posted by nilshomer View Post

Not too controversial, use Picard.

Do you know the difference between Picard and samtools in terms of duplicate removal? Thanks.

**nilshomer** · 06-13-2010, 07:02 PM

Originally posted by xguo View Post

Do you know the difference between Picard and samtools in terms of duplicate removal? Thanks.

Something about not considering paired-end/mate-pair reads correctly. Did you try searching the site yet?

**xguo** · 06-13-2010, 08:10 PM

Originally posted by nilshomer View Post

Something about not considering paired-end/mate-pair reads correctly. Did you try searching the site yet?

I did, but haven't found any details. For single-end reads, samtools consider them to be duplicates as long as their mapping locations are the same. Even if the sequences of two reads are not the same, they are still removed as duplicates. I'm not sure if it's desired, and how picard handles it.

**nilshomer** · 06-13-2010, 08:21 PM

Originally posted by xguo View Post

I did, but haven't found any details. For single-end reads, samtools consider them to be duplicates as long as their mapping locations are the same. Even if the sequences of two reads are not the same, they are still removed as duplicates. I'm not sure if it's desired, and how picard handles it.

I don't think the sequences are considered, since two observations of the same PCR fragment could be different. You could read the source code or email the samtools help list ([email protected]). Please post back when you get the answer, or maybe add it to the wiki under duplicate removal for the future benefit of others.

Also, the first search on google using the string "site: seqanswers.com duplicate removal" was useful.

**xguo** · 06-14-2010, 05:40 AM

Originally posted by nilshomer View Post

I don't think the sequences are considered, since two observations of the same PCR fragment could be different. You could read the source code or email the samtools help list ([email protected]). Please post back when you get the answer, or maybe add it to the wiki under duplicate removal for the future benefit of others.

Also, the first search on google using the string "site: seqanswers.com duplicate removal" was useful.

The following is what I found from samtools-help list:

"Essentially what Picard does (for pairs; single-end data is also handled) is to find the 5' coordinates and mapping orientations of each read pair. When doing this it takes into account all clipping that has taking place as well as any gaps or jumps in the alignment. You can thus think of it as determining "if all the bases from the read were aligned, where would the 5' most base
have been aligned". It then matches all read pairs that have identical 5' coordinates and orientations and marks as duplicates all but the "best" pair. "Best" is defined as the read pair having the highest sum of base qualities as bases with Q >= 15."

**nilshomer** · 06-14-2010, 08:37 AM

Originally posted by xguo View Post

The following is what I found from samtools-help list:

"Essentially what Picard does (for pairs; single-end data is also handled) is to find the 5' coordinates and mapping orientations of each read pair. When doing this it takes into account all clipping that has taking place as well as any gaps or jumps in the alignment. You can thus think of it as determining "if all the bases from the read were aligned, where would the 5' most base
have been aligned". It then matches all read pairs that have identical 5' coordinates and orientations and marks as duplicates all but the "best" pair. "Best" is defined as the read pair having the highest sum of base qualities as bases with Q >= 15."

This would be useful to add to http://wiki.seqanswers.com/ so we can build up a base of answers and not have to search the forums. Do you want to add it?

**mingkunli** · 08-06-2010, 09:22 AM

now I come to anther question: once we have reads with various lengths, should 3' coordinates should also be considerred(or length of the aligned part) ? for reads from the reverse strand, 3' is the beginning of the fragment, am I right?

Originally posted by xguo View Post

The following is what I found from samtools-help list:

"Essentially what Picard does (for pairs; single-end data is also handled) is to find the 5' coordinates and mapping orientations of each read pair. When doing this it takes into account all clipping that has taking place as well as any gaps or jumps in the alignment. You can thus think of it as determining "if all the bases from the read were aligned, where would the 5' most base
have been aligned". It then matches all read pairs that have identical 5' coordinates and orientations and marks as duplicates all but the "best" pair. "Best" is defined as the read pair having the highest sum of base qualities as bases with Q >= 15."

**JohnK** · 10-29-2010, 07:37 AM

Originally posted by xguo View Post

I did, but haven't found any details. For single-end reads, samtools consider them to be duplicates as long as their mapping locations are the same. Even if the sequences of two reads are not the same, they are still removed as duplicates. I'm not sure if it's desired, and how picard handles it.

Are you sure of this?

**lh3** · 10-29-2010, 08:37 AM

I believe Picard is doing the same for SE and this is the right thing to do. PCR duplicates can have different sequences. Optical duplicates have identical sequences.

**JohnK** · 10-29-2010, 11:00 AM

Hi lh3, thanks for the response! I understand PCR duplicates, but can you clarify what an optical duplicate is? I like what you are saying. Thanks again!

**Michael.James.Clark** · 10-29-2010, 11:40 AM

Originally posted by JohnK View Post

Hi lh3, thanks for the response! I understand PCR duplicates, but can you clarify what an optical duplicate is? I like what you are saying. Thanks again!

There is a nice discussion of such things (including a post where Heng explains the answer to your question) as well as links to numerous other threads on the topic here: http://seqanswers.com/forums/showthread.php?t=6854

Hope that helps.

Originally posted by mingkunli

now I come to anther question: once we have reads with various lengths, should 3' coordinates should also be considerred(or length of the aligned part) ? for reads from the reverse strand, 3' is the beginning of the fragment, am I right?

PCR duplicates do not necessarily have the same sequence OR the same length. The 5' end of the read is the most reliable position because if it's a PCR duplicate, it will be guaranteed to be the same at that end but not at the 3' end. That is, I think, the reasoning behind not considering the 3' end.

Also, reads from the reverse strand are still in 5'->3' context. The enzymes always go in the same direction, right?

**maricu** · 11-11-2010, 03:49 AM

Originally posted by lh3 View Post

I believe Picard is doing the same for SE and this is the right thing to do. PCR duplicates can have different sequences. Optical duplicates have identical sequences.

I might arrive quite late to this thread, but I am going through the same dilemma... In one section of the 1000 genomes paper (suppl) they mention that they use rmdup and then to make sure the run MarkDuplicates on top of that... Would this be the ultimate best approach??

In my case I am using SE data, and if I'm guessing right for SE picard and samtools work the same for dupl removal??

Thanks!

**swbarnes2** · 11-19-2010, 01:03 PM

I tried to use picard's mark duplicates, but it doesn't seem to be doing much of a job (The sam file is from a bwa paired end alignment)

java -jar SortSam.jar I=full_genome_grepped.sam O=full_genome_grepped_sorted.sam SO=coordinate VALIDATION_STRINGENCY=LENIENT

java -jar MarkDuplicates.jar I=full_genome_grepped_sorted.sam O=full_genome_grepped_sorted_deduped.sam M=Dedup_metrics.txt VALIDATION_STRINGENCY=LENIENT QUIET=true REMOVE_DUPLICATES=true

The resultant file is only a little smaller than the original. So I checked some known duplicates;

grep 1:78:10366:8808 full_genome_grepped_sorted_deduped.sam
100729:1:78:10366:8808:a: 99 chr9 3000477 0 50M = 3000810 383 AGTGGATGTTTCTCATTTTCCATGATTTTCAGTTTTCTTGCCATATTCCA gggggggggggggggggggggdgfcgggggdggggggggegggggggggg X0:i:23 X1:i:27 MD:Z:50 XG:i:0 AM:i:0 NM:i:0 SM:i:0 XM:i:0 XO:i:0 XT:A:R
100729:1:78:10366:8808:b: 147 chr9 3000810 0 50M = 3000477 -383 GCCATATTTTACGTCCTAAAGTGTGCATTTCTCATTTTTCACGTTTTTCA hhhhhhhghhhhhhhhhhhhhhhhhghhhhhhhghhhhhhhhhhhhhhhh X0:i:4 X1:i:0 XA:Z:chr9,-3022120,50M,3;chr9,-3011768,50M,3;chr9,-3019372,50M,3; MD:Z:9C1G6C31 XG:i:0 AM:i:0 NM:i:3 SM:i:0 XM:i:3 XO:i:0 XT:A:R

grep 1:80:15287:8131 full_genome_grepped_sorted_deduped.sam
100729:1:80:15287:8131:a: 99 chr9 3000477 3 50M = 3000810 383 AGTGGATGTTTCTCATTTTCCATGATTTTCAGTTTTCTTGCCATATTCCA ggggggggggggggfggggggcggggggggggggggggggggggggggga X0:i:23 X1:i:27 MD:Z:50 XG:i:0 AM:i:0 NM:i:0 SM:i:0 XM:i:0 XO:i:0 XT:A:R
100729:1:80:15287:8131:b: 147 chr9 3000810 3 50M = 3000477 -383 GCCATATTTTACGTCCTAAAGTGTGCATTTCTCATTTTTCACGTTTTTCA hhhhhhhhhhhghhhhhhhhhfhhhhhhhhhhghhhhhhhhhhhhhhhhh X0:i:4 X1:i:0 XA:Z:chr9,-3000810,50M,3;chr9,-3011768,50M,3;chr9,-3019372,50M,3; MD:Z:9C1G6C31 XG:i:0 AM:i:0 NM:i:3 SM:i:0 XM:i:3 XO:i:0 XT:A:R

I was hoping that one of the two would be deleted from the deduped file, since they look like perfect duplicates to me.

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

Samtools's rmdup vs. Picard's MarkDuplicates

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News