Hello,

I am having trouble removing duplicate reads with samtools and picard. I have a sam/bam file that is of paired-end illumina reads mapped to the genome (minimal file below). I have coordinate-sorted the sam/bam file and tried both:

java -jar picard-tools-1.38/MarkDuplicates.jar I=problemSorted.bam O=out.sam M=log.txt REMOVE_DUPLICATES=true

-and-

samtools rmdup problemSorted.bam out.bam

and neither seem to remove the lines that I believe are duplicates. The two read pairs have the exact same sequence and mapping, so I believe I should only have one pair after removing duplicates. I have read that samtools may not work in this case because the two ends of the reads map to different chromosomes, but I am surprised that picard is not removing them.

The duplicate reads are removed by:

samtools rmdup problemSorted.bam out.bam -S

but I am worried that treating all my paired-end reads as single reads when removing duplicates may lead to other issues.

Thanks for your help! Minimal input file is below:

@HD VN:1.0 SO:coordinate

@SQ SN:chr1 LN:28195914

@SQ SN:chr2 LN:19369704

HWI-EAS000_1:5:71:14037:15683:0:1:1 113 chr1 11732000 37 36M chr2 54321 0 CTCCCATCTCTATTCCATTTCCTCTGCCATGTATTC IIIIIIIIIIGIIIIIHIIIIIGIIIIIIIIIIIII X0:i:1 X1:i:0 MD:Z:36 XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U

HWI-EAS000_1:5:80:6679:2759:0:1:1 113 chr1 11732000 37 36M chr2 54321 0 CTCCCATCTCTATTCCATTTCCTCTGCCATGTATTC HGIIIIHIHHIIIHIIIIIIIIHIIIIIIIIIIIIH X0:i:1 X1:i:0 MD:Z:36 XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U

HWI-EAS000_1:5:71:14037:15683:0:2:1 177 chr2 54321 37 36M chr1 11732000 0 GTATGTACTGTATTATCTGAGTTTTTTATTCACAAG IIIIIIIIIIHIHIHIIIIIIIIIIIIIIIIIIIII X0:i:1 X1:i:0 MD:Z:36 XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U

HWI-EAS000_1:5:80:6679:2759:0:2:1 177 chr2 54321 37 36M chr1 11732000 0 GTATGTACTGTATTATCTGAGTTTTTTATTCACAAG IIIIHIIIHIIIIIIIIIIIIIIIIIIIIIIIIIII X0:i:1 X1:i:0 MD:Z:36 XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U

I am having trouble removing duplicate reads with samtools and picard. I have a sam/bam file that is of paired-end illumina reads mapped to the genome (minimal file below). I have coordinate-sorted the sam/bam file and tried both:

java -jar picard-tools-1.38/MarkDuplicates.jar I=problemSorted.bam O=out.sam M=log.txt REMOVE_DUPLICATES=true

-and-

samtools rmdup problemSorted.bam out.bam

and neither seem to remove the lines that I believe are duplicates. The two read pairs have the exact same sequence and mapping, so I believe I should only have one pair after removing duplicates. I have read that samtools may not work in this case because the two ends of the reads map to different chromosomes, but I am surprised that picard is not removing them.

The duplicate reads are removed by:

samtools rmdup problemSorted.bam out.bam -S

but I am worried that treating all my paired-end reads as single reads when removing duplicates may lead to other issues.

Thanks for your help! Minimal input file is below:

@HD VN:1.0 SO:coordinate

@SQ SN:chr1 LN:28195914

@SQ SN:chr2 LN:19369704

HWI-EAS000_1:5:71:14037:15683:0:1:1 113 chr1 11732000 37 36M chr2 54321 0 CTCCCATCTCTATTCCATTTCCTCTGCCATGTATTC IIIIIIIIIIGIIIIIHIIIIIGIIIIIIIIIIIII X0:i:1 X1:i:0 MD:Z:36 XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U

HWI-EAS000_1:5:80:6679:2759:0:1:1 113 chr1 11732000 37 36M chr2 54321 0 CTCCCATCTCTATTCCATTTCCTCTGCCATGTATTC HGIIIIHIHHIIIHIIIIIIIIHIIIIIIIIIIIIH X0:i:1 X1:i:0 MD:Z:36 XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U

HWI-EAS000_1:5:71:14037:15683:0:2:1 177 chr2 54321 37 36M chr1 11732000 0 GTATGTACTGTATTATCTGAGTTTTTTATTCACAAG IIIIIIIIIIHIHIHIIIIIIIIIIIIIIIIIIIII X0:i:1 X1:i:0 MD:Z:36 XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U

HWI-EAS000_1:5:80:6679:2759:0:2:1 177 chr2 54321 37 36M chr1 11732000 0 GTATGTACTGTATTATCTGAGTTTTTTATTCACAAG IIIIHIIIHIIIIIIIIIIIIIIIIIIIIIIIIIII X0:i:1 X1:i:0 MD:Z:36 XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U

## Comment