Unconfigured Ad

**mard** · 03-18-2010, 09:40 PM

Yes it tells you the number of reads that have been marked as duplicates, as well as the total number of reads. But note that reads that Picard marks as duplicates do not necessarily have identical sequence they just map to the same chromosomal location.

**bair** · 03-19-2010, 01:36 AM

Originally posted by mard View Post

Yes it tells you the number of reads that have been marked as duplicates, as well as the total number of reads. But note that reads that Picard marks as duplicates do not necessarily have identical sequence they just map to the same chromosomal location.

Thanks. How to pick up the duplicates to remove? keep the best alignment one if they do not have identical sequences?

Here is what I got from picard :

## METRICS CLASS net.sf.picard.sam.DuplicationMetrics
LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_
PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
Unknown Library 27221401 548559917 190908169 14563968 58165860 0 0.11642 2400441897

## HISTOGRAM java.lang.Double
BIN VALUE
1.0 1
2.0 1.795707
3.0 2.428856
4.0 2.932657
5.0 3.333535
6.0 3.652516
7.0 3.906332
8.0 4.108295

What is this histogram about?

My original bam file has 657624702 paired reads, so 2*657624702 in total. After remove duplicates, bam file has 1184353716 reads in total. So suppose,
2*657624702 - 1184353716 = 130895688 reads removed.

I couldn't get this number from picard output M file, any help?

Thanks

**psm3426** · 12-23-2010, 12:00 PM

The reason for the histogram is one of the FAQ on their wiki.

picard

http://sourceforge.net/apps/mediawiki/picard/index.php?title=Main_Page#Q:_What_is_meaning_of_the_histogram_produced_by_MarkDuplicates.3F

Download picard for free. A set of tools for working with high-throughput sequencing data. A set of tools (in Java) for working with next generation sequencing data in the SAM/BAM format. Note that development has moved to GitHub at https://github.com/broadinstitute/picard and support is available on the GATK forum at http://gatkforums.broadinstitute.org/categories/ask-the-team

The reason that you couldn't get that number is because for read pair duplicates, they divide the actual number of duplicates in half before reporting it. So in your case, 2 * 58165860 (value under paired_read_duplicates) = 130895688, which was the number of duplicates you were missing. =)

Topics	Statistics	Last Post
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 12 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM
A New Single-Cell Method Maps DNA-Protein Interactions by SEQadmin2 Started by SEQadmin2, 06-04-2026, 08:59 AM	0 responses 24 views 0 reactions	Last Post by SEQadmin2 06-04-2026, 08:59 AM
Long-Read RNA Sequencing Uncovers a Hidden Layer of Immune Cell Regulation by SEQadmin2 Started by SEQadmin2, 06-02-2026, 12:03 PM	0 responses 28 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 12:03 PM
DNA Methylation Study Reveals How Epigenetic Changes Pass Between Generations by SEQadmin2 Started by SEQadmin2, 06-02-2026, 11:40 AM	0 responses 22 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 11:40 AM

Unconfigured Ad

MarkDuplicates in picard

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News