Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to use Picard's MarkDuplicates

    I just tried Picard to remove PCR duplicates and used the test_sorted.bam (obtained by using samtools sort) as the input file. My following command

    java -jar MarkDuplicates.jar test_sorted.bam test_rmdup.bam

    gave me an error

    ERROR: Invalid argument 'test_sorted.bam'.

    Anybody knows where I did wrong?

    Thanks for all your help in advance.

  • #2
    Originally posted by cliff View Post
    I just tried Picard to remove PCR duplicates and used the test_sorted.bam (obtained by using samtools sort) as the input file. My following command

    java -jar MarkDuplicates.jar test_sorted.bam test_rmdup.bam

    gave me an error

    ERROR: Invalid argument 'test_sorted.bam'.

    Anybody knows where I did wrong?

    Thanks for all your help in advance.
    Try it without any arguments to see how to specify input and output files. The command is different from samtools.

    Comment


    • #3
      I tried again

      java -Xmx2g -jar ~/picard-tools-1.21/MarkDuplicates.jar INPUT=test_sorted.bam OUTPUT=test_rmdup.bam METRICS_FILE=PCR_duplicates REMOVE_DUPLICATES=true

      And I got this error:

      [Sat Jun 12 22:11:22 EDT 2010] net.sf.picard.sam.MarkDuplicates INPUT=test_sorted.bam OUTPUT=test_rmdup.bam METRICS_FILE=PCR_duplicates REMOVE_DUPLICATES=true ASSUME_SORTED=false MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9][0-9]+)[0-9]+)[0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 TMP_DIR=/tmp/cliff VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000
      INFO 2010-06-12 22:11:22 MarkDuplicates Start of doWork freeMemory: 31062256; totalMemory: 31588352; maxMemory: 1908932608
      INFO 2010-06-12 22:11:22 MarkDuplicates Reading input file and constructing read end information.
      INFO 2010-06-12 22:11:22 MarkDuplicates Will retain up to 7575129 data points before spilling to disk.
      [Sat Jun 12 22:11:23 EDT 2010] net.sf.picard.sam.MarkDuplicates done.
      Runtime.totalMemory()=152829952
      Exception in thread "main" net.sf.picard.PicardException: test_sorted.bam is not coordinate sorted.
      at net.sf.picard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:250)
      at net.sf.picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:112)
      at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:150)
      at net.sf.picard.sam.MarkDuplicates.main(MarkDuplicates.java:96)


      It said "test_sorted.bam is not coordinate sorted.", but I got this test_sorted.bam after I used "samtools sort" actually...

      where did I do wrong?..

      Comment


      • #4
        Originally posted by cliff View Post
        It said "test_sorted.bam is not coordinate sorted.", but I got this test_sorted.bam after I used "samtools sort" actually...

        where did I do wrong?..
        Nowhere, this is samtools' fault. The SAM specification lists a header (HD) tag for sort order (SO). The three permissible values for this tag are "unsorted", "coordinate", indicating that the entries have been sorted by chromosome and start position, and "queryname", meaning the file is sorted by the read IDs. When you sort the file with samtools it does not update the SO tag to reflect the fact the file has been sorted. According to the author of samtools, the SAM specification does not require this so it is not a bug (see this thread). Perhaps not but it's damned annoying.

        You can view the header information for your bam file with the command
        Code:
        samtools view -H test_sorted.bam
        Picard reads the SO tag to determine whether or not the file is sorted. This is obviously much easier and more efficient than actually checking every line of the file to determine whether or not it has been sorted.

        Before you can use Picard to remove duplicates you will have to fix the SO tag. Fourtunately Picard has a command to this, ReplaceSamHeader. Alternatively you could use the Picard SortSam instead of the samtools sort (For the record I don't know for sure if Picard SortSam properly updates the SO tag.)

        Comment


        • #5
          You can also add the "AS=true" option to assume that the input is sorted.

          Comment


          • #6
            Thanks. I got the exactly same problem...

            Comment


            • #7
              Definition of 'coordinate sorted'?

              Greetings
              I'm having the same problem. I used the command line argument to assume it was sorted but I'm getting screwy results. When the MarkDuplicates method says it wants 'coordinate sorted' data are they referring to tile-x-y or a genomic alignment? It seems one could find duplicates without reference to a genome. If it's tile-x-y then is it lexical or numeric?
              Thanks
              Mike

              Comment


              • #8
                The simple solution is to use samtools sort the file first. I've been using the Picard tools MergeSamFiles.jar to both merge and sort because I typically have multiple lanes of data for each sample.

                Mike, I don't think it will work without being aligned because I believe that Picard works by looking at the mappings.

                Comment


                • #9
                  Originally posted by mmuratet View Post
                  Greetings
                  I'm having the same problem. I used the command line argument to assume it was sorted but I'm getting screwy results. When the MarkDuplicates method says it wants 'coordinate sorted' data are they referring to tile-x-y or a genomic alignment? It seems one could find duplicates without reference to a genome. If it's tile-x-y then is it lexical or numeric?
                  Thanks
                  Mike
                  Coordinate sorted means sorted by their genomic alignment coordinates. Picard identifies duplicates as those reads mapping to the identical coordinates on the genome; obviously this task is made immensely easier if the alignments are already sorted.

                  Yes, you could find duplicates without reference to a genome. You would have to perform an all vs. all search which would require an huge amount of time and RAM when you are talking about tens or hundreds of million reads.

                  Comment


                  • #10
                    I would like to use Picard duplicate removal also. However, i ran into some trouble using a SAM-file outputted by CLC-Bio Genomics workbench. Anyone had an idead how to fix this issue?

                    Code:
                    root@thomasg-desktop:/home/thomasg/Downloads/\tmp/picard-tools-1.27# java -jar MergeSamFiles.jar I=/home/thomasg/RF_7.fastq\ trimmed\ \(paired\)\ mapping\ \(11205\ references\).sam SO=coordinate AS=false O=/home/thomasg/out.sam
                    [Thu Aug 12 14:30:53 CEST 2010] net.sf.picard.sam.MergeSamFiles OUTPUT=/home/thomasg/out.sam SORT_ORDER=coordinate ASSUME_SORTED=false    MERGE_SEQUENCE_DICTIONARIES=false USE_THREADING=false TMP_DIR=/tmp/root VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000
                    INFO	2010-08-12 14:30:53	MergeSamFiles	Sorting input files using temp directory /tmp/root
                    [Thu Aug 12 14:30:53 CEST 2010] net.sf.picard.sam.MergeSamFiles done.
                    Runtime.totalMemory()=379322368
                    Exception in thread "main" net.sf.samtools.SAMFormatException: Error parsing text SAM file. Paired read should be marked as first of pair or second of pair.; File /home/thomasg/RF_7.fastq trimmed (paired) mapping (11205 references).sam; Line 11208
                    Line: RF_43280	25	Contig_1	1	60	50M	*	0	0	ACAGCGACTCAACCAAAGGAATCCTATATAGAAATGCTATTAGGAATCCC	HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH	NH:i:1
                    	at net.sf.samtools.SAMTextReader.reportErrorParsingLine(SAMTextReader.java:220)
                    	at net.sf.samtools.SAMTextReader.access$500(SAMTextReader.java:40)
                    	at net.sf.samtools.SAMTextReader$RecordIterator.parseLine(SAMTextReader.java:424)
                    	at net.sf.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:268)
                    	at net.sf.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:240)
                    	at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:609)
                    	at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:587)
                    	at net.sf.picard.util.PeekableIterator.advance(PeekableIterator.java:71)
                    	at net.sf.picard.util.PeekableIterator.<init>(PeekableIterator.java:41)
                    	at net.sf.picard.sam.ComparableSamRecordIterator.<init>(ComparableSamRecordIterator.java:51)
                    	at net.sf.picard.sam.MergingSamRecordIterator.addIterator(MergingSamRecordIterator.java:93)
                    	at net.sf.picard.sam.MergingSamRecordIterator.startIterationIfRequired(MergingSamRecordIterator.java:102)
                    	at net.sf.picard.sam.MergingSamRecordIterator.hasNext(MergingSamRecordIterator.java:117)
                    	at net.sf.picard.sam.MergeSamFiles.doWork(MergeSamFiles.java:190)
                    	at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:150)
                    	at net.sf.picard.sam.MergeSamFiles.main(MergeSamFiles.java:83)

                    Comment


                    • #11
                      Picard duplicate removal problem

                      I had a similar problem with sam files derived from Illumina output. The problem was the mate IDs that Illumina uses, i.e., indexairN:filterFlag. I believe the tools expect pair IDs in the form /1 and /2. Check the output from the workbench to see how they identify pairs.

                      Comment


                      • #12
                        Dear all,

                        For my sequencing project I would also like to remove duplicates. Did any of you already work with the CLC Assembly Cell to remove them?
                        I have no idea where to start.
                        Time is a great teacher. Unfortunately, it kills all its pupils.

                        Comment


                        • #13
                          Originally posted by cliff View Post
                          I tried again

                          java -Xmx2g -jar ~/picard-tools-1.21/MarkDuplicates.jar INPUT=test_sorted.bam OUTPUT=test_rmdup.bam METRICS_FILE=PCR_duplicates REMOVE_DUPLICATES=true

                          And I got this error:

                          [Sat Jun 12 22:11:22 EDT 2010] net.sf.picard.sam.MarkDuplicates INPUT=test_sorted.bam OUTPUT=test_rmdup.bam METRICS_FILE=PCR_duplicates REMOVE_DUPLICATES=true ASSUME_SORTED=false MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9][0-9]+)[0-9]+)[0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 TMP_DIR=/tmp/cliff VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000
                          INFO 2010-06-12 22:11:22 MarkDuplicates Start of doWork freeMemory: 31062256; totalMemory: 31588352; maxMemory: 1908932608
                          INFO 2010-06-12 22:11:22 MarkDuplicates Reading input file and constructing read end information.
                          INFO 2010-06-12 22:11:22 MarkDuplicates Will retain up to 7575129 data points before spilling to disk.
                          [Sat Jun 12 22:11:23 EDT 2010] net.sf.picard.sam.MarkDuplicates done.
                          Runtime.totalMemory()=152829952
                          Exception in thread "main" net.sf.picard.PicardException: test_sorted.bam is not coordinate sorted.
                          at net.sf.picard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:250)
                          at net.sf.picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:112)
                          at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:150)
                          at net.sf.picard.sam.MarkDuplicates.main(MarkDuplicates.java:96)


                          It said "test_sorted.bam is not coordinate sorted.", but I got this test_sorted.bam after I used "samtools sort" actually...

                          where did I do wrong?..
                          the bam is sorted by Picardtools ,suchjava -jar $softwave/SamFormatConverter.jar I=$I/HFHm001_1_Tri.fastq_bismark_bt2_pe.sam o=$O/HFHm001_1_Tri.fastq_bismark_bt2_pe.bam
                          java -jar $softwave/SortSam.jar I=$O/HFHm001_1_Tri.fastq_bismark_bt2_pe.bam O=$O/HFHm001_1_Tri.fastq_bismark_bt2_pe.sorted.bam sort_order=coordinate

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Genetic Variation in Immunogenetics and Antibody Diversity
                            by seqadmin



                            The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...
                            11-06-2024, 07:24 PM
                          • seqadmin
                            Choosing Between NGS and qPCR
                            by seqadmin



                            Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
                            10-18-2024, 07:11 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, Today, 11:09 AM
                          0 responses
                          24 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, Today, 06:13 AM
                          0 responses
                          20 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 11-01-2024, 06:09 AM
                          0 responses
                          30 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 10-30-2024, 05:31 AM
                          0 responses
                          21 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X