Seqanswers Leaderboard Ad

**drio** · 11-19-2010, 05:45 PM

The example you are showing is not significant to determine if there was any problem with picard. You are showing the two ends of a pair. They are both in your sam file before and after running picard. That's the expected behavior.

**bioinfosm** · 11-19-2010, 09:29 PM

Originally posted by Michael.James.Clark View Post

There is a nice discussion of such things (including a post where Heng explains the answer to your question) as well as links to numerous other threads on the topic here: http://seqanswers.com/forums/showthread.php?t=6854

Hope that helps.

Referring to this topic, is duplicate removal really necessary for DNA sequencing?

Removing duplicates is it really necessary? - SEQanswers

http://seqanswers.com/forums/showthread.php?t=6854

Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

**JohnK** · 11-20-2010, 12:56 PM

Originally posted by swbarnes2 View Post

I tried to use picard's mark duplicates, but it doesn't seem to be doing much of a job (The sam file is from a bwa paired end alignment)

java -jar SortSam.jar I=full_genome_grepped.sam O=full_genome_grepped_sorted.sam SO=coordinate VALIDATION_STRINGENCY=LENIENT

java -jar MarkDuplicates.jar I=full_genome_grepped_sorted.sam O=full_genome_grepped_sorted_deduped.sam M=Dedup_metrics.txt VALIDATION_STRINGENCY=LENIENT QUIET=true REMOVE_DUPLICATES=true

The resultant file is only a little smaller than the original. So I checked some known duplicates;

grep 1:78:10366:8808 full_genome_grepped_sorted_deduped.sam
100729:1:78:10366:8808:a: 99 chr9 3000477 0 50M = 3000810 383 AGTGGATGTTTCTCATTTTCCATGATTTTCAGTTTTCTTGCCATATTCCA gggggggggggggggggggggdgfcgggggdggggggggegggggggggg X0:i:23 X1:i:27 MD:Z:50 XG:i:0 AM:i:0 NM:i:0 SM:i:0 XM:i:0 XO:i:0 XT:A:R
100729:1:78:10366:8808:b: 147 chr9 3000810 0 50M = 3000477 -383 GCCATATTTTACGTCCTAAAGTGTGCATTTCTCATTTTTCACGTTTTTCA hhhhhhhghhhhhhhhhhhhhhhhhghhhhhhhghhhhhhhhhhhhhhhh X0:i:4 X1:i:0 XA:Z:chr9,-3022120,50M,3;chr9,-3011768,50M,3;chr9,-3019372,50M,3; MD:Z:9C1G6C31 XG:i:0 AM:i:0 NM:i:3 SM:i:0 XM:i:3 XO:i:0 XT:A:R

grep 1:80:15287:8131 full_genome_grepped_sorted_deduped.sam
100729:1:80:15287:8131:a: 99 chr9 3000477 3 50M = 3000810 383 AGTGGATGTTTCTCATTTTCCATGATTTTCAGTTTTCTTGCCATATTCCA ggggggggggggggfggggggcggggggggggggggggggggggggggga X0:i:23 X1:i:27 MD:Z:50 XG:i:0 AM:i:0 NM:i:0 SM:i:0 XM:i:0 XO:i:0 XT:A:R
100729:1:80:15287:8131:b: 147 chr9 3000810 3 50M = 3000477 -383 GCCATATTTTACGTCCTAAAGTGTGCATTTCTCATTTTTCACGTTTTTCA hhhhhhhhhhhghhhhhhhhhfhhhhhhhhhhghhhhhhhhhhhhhhhhh X0:i:4 X1:i:0 XA:Z:chr9,-3000810,50M,3;chr9,-3011768,50M,3;chr9,-3019372,50M,3; MD:Z:9C1G6C31 XG:i:0 AM:i:0 NM:i:3 SM:i:0 XM:i:3 XO:i:0 XT:A:R

I was hoping that one of the two would be deleted from the deduped file, since they look like perfect duplicates to me.

Hi,

Picard's markdups program's supposed advantage is in its ability to remove duplicates across chromosomes. I ran markdups and samtools and samtools ended up removing more dups, but not across chromosomes of course. Either way, you can physically go through and filter out pairs and sequences with poor pairing or mapping QVs. I also remove reads across chromosomes now. Depends on what you're doing really.

**swbarnes2** · 11-22-2010, 10:11 AM

Originally posted by JohnK View Post

Hi,

Picard's markdups program's supposed advantage is in its ability to remove duplicates across chromosomes. I ran markdups and samtools and samtools ended up removing more dups, but not across chromosomes of course. Either way, you can physically go through and filter out pairs and sequences with poor pairing or mapping QVs. I also remove reads across chromosomes now. Depends on what you're doing really.

Yeah, that's what I wrote a script for a while back, but I wanted to use something standard, something that other people can trust works.

But Picard doesn't seem to be removing dupes within chromosomes. I posted the .sam entries for two different clusters with the exact same sequence at each end; why didn't Picard remove one cluster or the other?

**JohnK** · 11-22-2010, 03:09 PM

I'm not sure really. I actually use samtools now to remove the dups and I believe we're moving towards filtering out the singletons, cross-chromosome pairs, invalid insert sizes, and anything else under a QV of ten. Just for the sake of the most accurate data.

**drio** · 11-23-2010, 05:42 AM

Originally posted by swbarnes2 View Post

I posted the .sam entries for two different clusters with the exact same sequence at each end; why didn't Picard remove one cluster or the other?

Picard looks for the mapping coordinates (and orientation), not the actual sequence.

Notice the MAPQ of those alignments. It is very low. Those alignments should be filtered out and should not contribute to the downstream analysis.

**swbarnes2** · 11-23-2010, 11:56 AM

Originally posted by drio View Post

Picard looks for the mapping coordinates (and orientation), not the actual sequence.

Notice the MAPQ of those alignments. It is very low. Those alignments should be filtered out and should not contribute to the downstream analysis.

Ah, so it's just ignoring those because they are repetitive in the genome. Okay, I guess that makes sense, though I'd rather have have them in there, but deduped, so I can tell the difference between a repetative part, and a deletion.

But I have other things that align uniquely, and still aren't being trimmed.

grep 2:78:8570:13184 full_genome_grepped_sorted_deduped.sam
100729:2:78:8570:13184:b: 129 chr1 149128373 37 50M = 149128649 276 CCTTTTCCTCTGCATGCAATAGATCATTGAGTGTGTGTGTGTGTGTGTGT hhghhhhhhhhhhghhhhhhchhhhghheafdfdfafcfcfcfafafafa X0:i:1 X1:i:0 MD:Z:50 XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U
100729:2:78:8570:13184:a: 65 chr1 149128649 37 50M = 149128373 -276 ACTTTCAATCTCTGCTTTTCAAAATGTGAAAATGCCCAATTGATTACAGC hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhghghhhhhhfhh X0:i:1 X1:i:0 MD:Z:50 XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U

grep 2:33:3309:20360 full_genome_grepped_sorted_deduped.sam
100729:2:33:3309:20360:b: 129 chr1 149128373 37 50M = 149128649 276 CCTTTTCCTCTGCATGCAATAGATCATTGAGTGTGTGTGTGTGTGTGTGT ggggggggggdffffgggggggfagggggffafcfafacab_b]b_fa]a X0:i:1 X1:i:0 MD:Z:50 XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U
100729:2:33:3309:20360:a: 65 chr1 149128649 37 50M = 149128373 -276 ACTTTCAATCTCTGCTTTTCAAAATGTGAAAATGCCCAATTGATTACAGC hhhhhhhhhhhhhhhhhhhhhhhg_hhhhhhhhahhhhhhfhfhheggcf X0:i:1 X1:i:0 MD:Z:50 XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U

The coordinates for each cluster are identical, even the sequences are identical at both ends, and isn't a mapping quality of 37 fine? Why isn't one of these being removed by Picard?

**drio** · 11-23-2010, 03:14 PM

I do not know why one of those pairs was not flagged as PCR duplicate.

**cjp** · 06-07-2011, 05:54 AM

Originally posted by swbarnes2 View Post

I tried to use picard's mark duplicates, but it doesn't seem to be doing much of a job (The sam file is from a bwa paired end alignment)

java -jar SortSam.jar I=full_genome_grepped.sam O=full_genome_grepped_sorted.sam SO=coordinate VALIDATION_STRINGENCY=LENIENT

java -jar MarkDuplicates.jar I=full_genome_grepped_sorted.sam O=full_genome_grepped_sorted_deduped.sam M=Dedup_metrics.txt VALIDATION_STRINGENCY=LENIENT QUIET=true REMOVE_DUPLICATES=true

The resultant file is only a little smaller than the original. So I checked some known duplicates;

grep 1:78:10366:8808 full_genome_grepped_sorted_deduped.sam
100729:1:78:10366:8808:a: 99 chr9 3000477 0 50M = 3000810 383 AGTGGATGTTTCTCATTTTCCATGATTTTCAGTTTTCTTGCCATATTCCA gggggggggggggggggggggdgfcgggggdggggggggegggggggggg X0:i:23 X1:i:27 MD:Z:50 XG:i:0 AM:i:0 NM:i:0 SM:i:0 XM:i:0 XO:i:0 XT:A:R
100729:1:78:10366:8808:b: 147 chr9 3000810 0 50M = 3000477 -383 GCCATATTTTACGTCCTAAAGTGTGCATTTCTCATTTTTCACGTTTTTCA hhhhhhhghhhhhhhhhhhhhhhhhghhhhhhhghhhhhhhhhhhhhhhh X0:i:4 X1:i:0 XA:Z:chr9,-3022120,50M,3;chr9,-3011768,50M,3;chr9,-3019372,50M,3; MD:Z:9C1G6C31 XG:i:0 AM:i:0 NM:i:3 SM:i:0 XM:i:3 XO:i:0 XT:A:R

grep 1:80:15287:8131 full_genome_grepped_sorted_deduped.sam
100729:1:80:15287:8131:a: 99 chr9 3000477 3 50M = 3000810 383 AGTGGATGTTTCTCATTTTCCATGATTTTCAGTTTTCTTGCCATATTCCA ggggggggggggggfggggggcggggggggggggggggggggggggggga X0:i:23 X1:i:27 MD:Z:50 XG:i:0 AM:i:0 NM:i:0 SM:i:0 XM:i:0 XO:i:0 XT:A:R
100729:1:80:15287:8131:b: 147 chr9 3000810 3 50M = 3000477 -383 GCCATATTTTACGTCCTAAAGTGTGCATTTCTCATTTTTCACGTTTTTCA hhhhhhhhhhhghhhhhhhhhfhhhhhhhhhhghhhhhhhhhhhhhhhhh X0:i:4 X1:i:0 XA:Z:chr9,-3000810,50M,3;chr9,-3011768,50M,3;chr9,-3019372,50M,3; MD:Z:9C1G6C31 XG:i:0 AM:i:0 NM:i:3 SM:i:0 XM:i:3 XO:i:0 XT:A:R

I was hoping that one of the two would be deleted from the deduped file, since they look like perfect duplicates to me.

Hi there,

I was getting the same problem and played with the code a bit. If your forward and reverse reads have slightly different names - am guessing you have :a: and :b: as suffixes for your forward and reverse read names - then the program doesn't recognise them as being from the same pair.

Chris

**swbarnes2** · 06-07-2011, 08:11 AM

Originally posted by cjp View Post

Hi there,

I was getting the same problem and played with the code a bit. If your forward and reverse reads have slightly different names - am guessing you have :a: and :b: as suffixes for your forward and reverse read names - then the program doesn't recognise them as being from the same pair.

Chris

Thanks for figuring it out. That makes sense, though I thought when I ran my file as it was, that it did trim a few entries out, just not as many as I thought should be trimmed. And samtools does gets that samples named like that are paired.

**Michael.James.Clark** · 06-07-2011, 01:00 PM

Originally posted by cjp View Post

Hi there,

I was getting the same problem and played with the code a bit. If your forward and reverse reads have slightly different names - am guessing you have :a: and :b: as suffixes for your forward and reverse read names - then the program doesn't recognise them as being from the same pair.

Chris

Indeed, that's likely what it is. Actually, if you check the SAM format spec sheet, paired reads are supposed to have identical names.

http://samtools.sourceforge.net/SAM1.pdf

See the example.

**cjp** · 06-08-2011, 01:01 AM

Originally posted by swbarnes2 View Post

Thanks for figuring it out. That makes sense, though I thought when I ran my file as it was, that it did trim a few entries out, just not as many as I thought should be trimmed. And samtools does gets that samples named like that are paired.

Those are probably unpaired reads - you can look in the metrics file to see how many unpaired and paired end reads were marked as duplicates.

Before read names were changed to be the same:

LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
A0600028 3172904 67528513 0 2607750 0 0 0.018865

After making the name change:

LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
A0600028 3172904 67528513 0 2607750 38116791 0 0.570364 34131325

Chris

**Heisman** · 09-20-2011, 09:46 AM

Nevermind, figured out my problem. The problem was I'm an idiot. Our paired end sequence data comes back with read one ending in '#0/1' and read two ending in '#0/3', and the 3 has to be converted to a 2 for MarkDuplicates to work. I naturally used a command that changed every 3 to a 2...

**amsci8** · 04-19-2013, 08:54 AM

Mark duplicates post fastq trimming?

Hello, I am fairly new to bioinformatics and have a question about using Picard Mark Duplicates after trimming sequence reads.

(I could not find the exact answer to my question (so I apologize if this a repeat question), but if it exists please link to the answer.)

Does anyone know how Picard Mark Duplicates knows which bases were trimmed from a read? If you trim bases from the 5' end of the read, how does Picard know that? See quoted comment from xguo below.

Note, I am using trimmomatic to trim low quality bases and adapters from paired end whole genome DNA re-sequencing data. I am aligning this data to a reference then using the GATK to call snp variants for population genetic analyses... so I need remove duplicates to ensure accurate genotype/snp calls.

Thank you!
Amanda

Originally posted by xguo View Post

The following is what I found from samtools-help list:

"Essentially what Picard does (for pairs; single-end data is also handled) is to find the 5' coordinates and mapping orientations of each read pair. When doing this it takes into account all clipping that has taking place as well as any gaps or jumps in the alignment. You can thus think of it as determining "if all the bases from the read were aligned, where would the 5' most base
have been aligned". It then matches all read pairs that have identical 5' coordinates and orientations and marks as duplicates all but the "best" pair. "Best" is defined as the read pair having the highest sum of base qualities as bases with Q >= 15."

**syfo** · 04-22-2013, 05:04 AM

I think that MarkDuplicates is considering the bases that were "clipped" during the alignment process: those terminal parts of the sequences do belong to the read but are not reported as part of the alignments. This is not the same as the "trimming" process that removes low-quality and/or adapter sequences before the alignment. As you pointed out, those are definitely lost.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 17 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 46 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News