Seqanswers Leaderboard Ad

**nilshomer** · 06-12-2010, 04:25 PM

Originally posted by cliff View Post

I just tried Picard to remove PCR duplicates and used the test_sorted.bam (obtained by using samtools sort) as the input file. My following command

java -jar MarkDuplicates.jar test_sorted.bam test_rmdup.bam

gave me an error

ERROR: Invalid argument 'test_sorted.bam'.

Anybody knows where I did wrong?

Thanks for all your help in advance.

Try it without any arguments to see how to specify input and output files. The command is different from samtools.

**cliff** · 06-12-2010, 06:15 PM

I tried again

java -Xmx2g -jar ~/picard-tools-1.21/MarkDuplicates.jar INPUT=test_sorted.bam OUTPUT=test_rmdup.bam METRICS_FILE=PCR_duplicates REMOVE_DUPLICATES=true

And I got this error:

[Sat Jun 12 22:11:22 EDT 2010] net.sf.picard.sam.MarkDuplicates INPUT=test_sorted.bam OUTPUT=test_rmdup.bam METRICS_FILE=PCR_duplicates REMOVE_DUPLICATES=true ASSUME_SORTED=false MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]

[0-9]+)

[0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 TMP_DIR=/tmp/cliff VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000
INFO 2010-06-12 22:11:22 MarkDuplicates Start of doWork freeMemory: 31062256; totalMemory: 31588352; maxMemory: 1908932608
INFO 2010-06-12 22:11:22 MarkDuplicates Reading input file and constructing read end information.
INFO 2010-06-12 22:11:22 MarkDuplicates Will retain up to 7575129 data points before spilling to disk.
[Sat Jun 12 22:11:23 EDT 2010] net.sf.picard.sam.MarkDuplicates done.
Runtime.totalMemory()=152829952
Exception in thread "main" net.sf.picard.PicardException: test_sorted.bam is not coordinate sorted.
at net.sf.picard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:250)
at net.sf.picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:112)
at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:150)
at net.sf.picard.sam.MarkDuplicates.main(MarkDuplicates.java:96)

It said "test_sorted.bam is not coordinate sorted.", but I got this test_sorted.bam after I used "samtools sort" actually...

where did I do wrong?..

**kmcarr** · 06-12-2010, 09:03 PM

Originally posted by cliff View Post

It said "test_sorted.bam is not coordinate sorted.", but I got this test_sorted.bam after I used "samtools sort" actually...

where did I do wrong?..

Nowhere, this is samtools' fault. The SAM specification lists a header (HD) tag for sort order (SO). The three permissible values for this tag are "unsorted", "coordinate", indicating that the entries have been sorted by chromosome and start position, and "queryname", meaning the file is sorted by the read IDs. When you sort the file with samtools it does not update the SO tag to reflect the fact the file has been sorted. According to the author of samtools, the SAM specification does not require this so it is not a bug (see this thread). Perhaps not but it's damned annoying.

You can view the header information for your bam file with the command

Code:

samtools view -H test_sorted.bam

Picard reads the SO tag to determine whether or not the file is sorted. This is obviously much easier and more efficient than actually checking every line of the file to determine whether or not it has been sorted.

Before you can use Picard to remove duplicates you will have to fix the SO tag. Fourtunately Picard has a command to this, ReplaceSamHeader. Alternatively you could use the Picard SortSam instead of the samtools sort (For the record I don't know for sure if Picard SortSam properly updates the SO tag.)

**nilshomer** · 06-12-2010, 09:35 PM

You can also add the "AS=true" option to assume that the input is sorted.

**bosTau2** · 06-23-2010, 05:38 AM

Thanks. I got the exactly same problem...

**mmuratet** · 07-20-2010, 10:54 AM

Definition of 'coordinate sorted'?

Greetings
I'm having the same problem. I used the command line argument to assume it was sorted but I'm getting screwy results. When the MarkDuplicates method says it wants 'coordinate sorted' data are they referring to tile-x-y or a genomic alignment? It seems one could find duplicates without reference to a genome. If it's tile-x-y then is it lexical or numeric?
Thanks
Mike

**Lee Sam** · 07-20-2010, 12:17 PM

The simple solution is to use samtools sort the file first. I've been using the Picard tools MergeSamFiles.jar to both merge and sort because I typically have multiple lanes of data for each sample.

Mike, I don't think it will work without being aligned because I believe that Picard works by looking at the mappings.

**kmcarr** · 07-20-2010, 01:41 PM

Originally posted by mmuratet View Post

Greetings
I'm having the same problem. I used the command line argument to assume it was sorted but I'm getting screwy results. When the MarkDuplicates method says it wants 'coordinate sorted' data are they referring to tile-x-y or a genomic alignment? It seems one could find duplicates without reference to a genome. If it's tile-x-y then is it lexical or numeric?
Thanks
Mike

Coordinate sorted means sorted by their genomic alignment coordinates. Picard identifies duplicates as those reads mapping to the identical coordinates on the genome; obviously this task is made immensely easier if the alignments are already sorted.

Yes, you could find duplicates without reference to a genome. You would have to perform an all vs. all search which would require an huge amount of time and RAM when you are talking about tens or hundreds of million reads.

**thomasvangurp** · 08-12-2010, 04:32 AM

I would like to use Picard duplicate removal also. However, i ran into some trouble using a SAM-file outputted by CLC-Bio Genomics workbench. Anyone had an idead how to fix this issue?

Code:

root@thomasg-desktop:/home/thomasg/Downloads/\tmp/picard-tools-1.27# java -jar MergeSamFiles.jar I=/home/thomasg/RF_7.fastq\ trimmed\ \(paired\)\ mapping\ \(11205\ references\).sam SO=coordinate AS=false O=/home/thomasg/out.sam
[Thu Aug 12 14:30:53 CEST 2010] net.sf.picard.sam.MergeSamFiles OUTPUT=/home/thomasg/out.sam SORT_ORDER=coordinate ASSUME_SORTED=false    MERGE_SEQUENCE_DICTIONARIES=false USE_THREADING=false TMP_DIR=/tmp/root VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000
INFO	2010-08-12 14:30:53	MergeSamFiles	Sorting input files using temp directory /tmp/root
[Thu Aug 12 14:30:53 CEST 2010] net.sf.picard.sam.MergeSamFiles done.
Runtime.totalMemory()=379322368
Exception in thread "main" net.sf.samtools.SAMFormatException: Error parsing text SAM file. Paired read should be marked as first of pair or second of pair.; File /home/thomasg/RF_7.fastq trimmed (paired) mapping (11205 references).sam; Line 11208
Line: RF_43280	25	Contig_1	1	60	50M	*	0	0	ACAGCGACTCAACCAAAGGAATCCTATATAGAAATGCTATTAGGAATCCC	HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH	NH:i:1
	at net.sf.samtools.SAMTextReader.reportErrorParsingLine(SAMTextReader.java:220)
	at net.sf.samtools.SAMTextReader.access$500(SAMTextReader.java:40)
	at net.sf.samtools.SAMTextReader$RecordIterator.parseLine(SAMTextReader.java:424)
	at net.sf.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:268)
	at net.sf.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:240)
	at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:609)
	at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:587)
	at net.sf.picard.util.PeekableIterator.advance(PeekableIterator.java:71)
	at net.sf.picard.util.PeekableIterator.<init>(PeekableIterator.java:41)
	at net.sf.picard.sam.ComparableSamRecordIterator.<init>(ComparableSamRecordIterator.java:51)
	at net.sf.picard.sam.MergingSamRecordIterator.addIterator(MergingSamRecordIterator.java:93)
	at net.sf.picard.sam.MergingSamRecordIterator.startIterationIfRequired(MergingSamRecordIterator.java:102)
	at net.sf.picard.sam.MergingSamRecordIterator.hasNext(MergingSamRecordIterator.java:117)
	at net.sf.picard.sam.MergeSamFiles.doWork(MergeSamFiles.java:190)
	at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:150)
	at net.sf.picard.sam.MergeSamFiles.main(MergeSamFiles.java:83)

**mmuratet** · 08-12-2010, 06:20 AM

Picard duplicate removal problem

I had a similar problem with sam files derived from Illumina output. The problem was the mate IDs that Illumina uses, i.e., index

airN:filterFlag. I believe the tools expect pair IDs in the form /1 and /2. Check the output from the workbench to see how they identify pairs.

**scientifica** · 12-22-2010, 05:47 AM

Dear all,

For my sequencing project I would also like to remove duplicates. Did any of you already work with the CLC Assembly Cell to remove them?
I have no idea where to start.

**shanlan.mo** · 01-26-2015, 11:56 PM

Originally posted by cliff View Post

I tried again

java -Xmx2g -jar ~/picard-tools-1.21/MarkDuplicates.jar INPUT=test_sorted.bam OUTPUT=test_rmdup.bam METRICS_FILE=PCR_duplicates REMOVE_DUPLICATES=true

And I got this error:

[Sat Jun 12 22:11:22 EDT 2010] net.sf.picard.sam.MarkDuplicates INPUT=test_sorted.bam OUTPUT=test_rmdup.bam METRICS_FILE=PCR_duplicates REMOVE_DUPLICATES=true ASSUME_SORTED=false MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]

[0-9]+)

[0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 TMP_DIR=/tmp/cliff VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000
INFO 2010-06-12 22:11:22 MarkDuplicates Start of doWork freeMemory: 31062256; totalMemory: 31588352; maxMemory: 1908932608
INFO 2010-06-12 22:11:22 MarkDuplicates Reading input file and constructing read end information.
INFO 2010-06-12 22:11:22 MarkDuplicates Will retain up to 7575129 data points before spilling to disk.
[Sat Jun 12 22:11:23 EDT 2010] net.sf.picard.sam.MarkDuplicates done.
Runtime.totalMemory()=152829952
Exception in thread "main" net.sf.picard.PicardException: test_sorted.bam is not coordinate sorted.
at net.sf.picard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:250)
at net.sf.picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:112)
at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:150)
at net.sf.picard.sam.MarkDuplicates.main(MarkDuplicates.java:96)

It said "test_sorted.bam is not coordinate sorted.", but I got this test_sorted.bam after I used "samtools sort" actually...

where did I do wrong?..

the bam is sorted by Picardtools ,suchjava -jar $softwave/SamFormatConverter.jar I=$I/HFHm001_1_Tri.fastq_bismark_bt2_pe.sam o=$O/HFHm001_1_Tri.fastq_bismark_bt2_pe.bam
java -jar $softwave/SortSam.jar I=$O/HFHm001_1_Tri.fastq_bismark_bt2_pe.bam O=$O/HFHm001_1_Tri.fastq_bismark_bt2_pe.sorted.bam sort_order=coordinate

Topics	Statistics	Last Post
ASHG 2024 Highlights – Part Two by seqadmin Started by seqadmin, Today, 11:09 AM	0 responses 24 views 0 likes	Last Post by seqadmin Today, 11:09 AM
ASHG 2024 Highlights – Part One by seqadmin Started by seqadmin, Today, 06:13 AM	0 responses 20 views 0 likes	Last Post by seqadmin Today, 06:13 AM
Seq-Scope Expands Possibilities for High-Resolution Gene Expression Analysis by seqadmin Started by seqadmin, 11-01-2024, 06:09 AM	0 responses 30 views 0 likes	Last Post by seqadmin 11-01-2024, 06:09 AM
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, 10-30-2024, 05:31 AM	0 responses 21 views 0 likes	Last Post by seqadmin 10-30-2024, 05:31 AM

Seqanswers Leaderboard Ad

Announcement

How to use Picard's MarkDuplicates

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News