Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • elgor
    Junior Member
    • May 2011
    • 8

    Picard's MarkDuplicates -> OutOfMemoryError

    Hi folks,

    here comes my first question for you. I'm trying to remove duplicates from a big sorted merged BAM-file (~270 GB) with the help of Picard's MarkDuplicate function, but I'm running into OutOfMemoryErrors all the time. I'm kind of new to the real world sequencing industry and would appreciate any help you can give me.

    That's the command I'm using:

    Code:
    /usr/lib/jvm/java-1.6.0-ibm-1.6.0.8.x86_64/jre/bin/java -jar -Xmx40g /illumina/tools/picard-tools-1.45/MarkDuplicates.jar 
    INPUT=BL14_sorted_merged.bam 
    OUTPUT=BL14_sorted_merged_deduped.bam 
    METRICS_FILE=metrics.txt 
    REMOVE_DUPLICATES=true 
    ASSUME_SORTED=true 
    VALIDATION_STRINGENCY=LENIENT 
    TMP_DIR=/illumina/runs/temp/
    The ErrorMessag usually looks like this after running around 8 hours:

    Code:
    Exception in thread "main" java.lang.OutOfMemoryError
    at net.sf.samtools.util.SortingLongCollection.<init>(SortingLongCollection.java:101)
    at net.sf.picard.sam.MarkDuplicates.generateDuplicateIndexes(MarkDuplicates.java:443)
    at net.sf.picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:115)
    at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:158)
    at net.sf.picard.sam.MarkDuplicates.main(MarkDuplicates.java:97)
    The machine I'm running it on has 48275 MB RAM and 2000 MB Swap.

    Please tell me, if you need mor info, if I'm doing something completley wrong or the amount of memory just isn't enough to get a result or whatever. Thanks in advance.
  • elgor
    Junior Member
    • May 2011
    • 8

    #2
    It seems I've finally found a working set of arguments! After more than 14 hours it's still running! Fingers crossed, it keeps doing so and finishs successfully eventually.

    Comment

    • oiiio
      Senior Member
      • Jan 2011
      • 105

      #3
      Do you mind posting your working set of arguments? I'm in a very similar situation with this error.

      Comment

      • elgor
        Junior Member
        • May 2011
        • 8

        #4
        Originally posted by oiiio View Post
        Do you mind posting your working set of arguments? I'm in a very similar situation with this error.
        Sorry, I had forgotten about posting my solution here. It solved a similar problem for a guy on the samtools/picard mailing list already:

        [Samtools-help] Picard MarkDuplicates memory error on very large file


        In short: Less heap makes Picard more stable. Xmx4g seems optimal.

        Comment

        • dGho
          Member
          • Jan 2013
          • 43

          #5
          Hi, I am having a lot of trouble w MarkDuplicates on some of my bam files. It was throwing the same error as shown in this forum:

          Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
          I have tried the following with no success:
          1. -Xmx2g (this is the most that my cluster is allowing me for some reason) : this allowed the program to run longer but still throws the same error
          2. MAX_RECORDS_IN_RAM=5000000: this gave me a different error (below)

          Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
          at net.sf.samtools.BinaryTagCodec.readTags(BinaryTagCodec.java:282)
          at net.sf.samtools.BAMRecord.decodeAttributes(BAMRecord.java:308)
          at net.sf.samtools.BAMRecord.getAttribute(BAMRecord.java:288)
          at net.sf.samtools.SAMRecord.isValid(SAMRecord.java:1601)
          at net.sf.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:540)
          at net.sf.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:522)
          at net.sf.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:481)
          at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:672)
          at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:650)
          at net.sf.picard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:386)
          at net.sf.picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:150)
          at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:177)
          at net.sf.picard.sam.MarkDuplicates.main(MarkDuplicates.java:134)
          I don't really know where to go from here? Has anyone else had the above error thrown and been able to solve it?

          Comment

          • dGho
            Member
            • Jan 2013
            • 43

            #6
            Originally posted by dGho View Post
            Hi, I am having a lot of trouble w MarkDuplicates on some of my bam files. It was throwing the same error as shown in this forum:



            I have tried the following with no success:
            1. -Xmx2g (this is the most that my cluster is allowing me for some reason) : this allowed the program to run longer but still throws the same error
            2. MAX_RECORDS_IN_RAM=5000000: this gave me a different error (below)



            I don't really know where to go from here? Has anyone else had the above error thrown and been able to solve it?
            And to add insult to injury, I have attempted to add -XX:-UseGCOverheadLimit to my command, which now looks like this:

            java -Xmx2g -XX:-UseGCOverheadLimit -jar /usr/local/picard/1.84/MarkDuplicates.jar INPUT="$f1"a1.clean.bam OUTPUT="$f1"a1.ddup.bam METRICS_FILE="$f1"a1.ddup.metrics REMOVE_DUPLICATES=false ASSUME_SORTED=true VALIDATION_STRINGENCY=LENIENT TMP_DIR=/scratch/apaciork_group/tmp TMP_DIR=/scratch/dghoneim/tmp CREATE_INDEX=true MAX_RECORDS_IN_RAM=5000000
            and now I am getting the original error again!
            Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
            at java.util.ArrayList.<init>(Unknown Source)
            at java.util.ArrayList.<init>(Unknown Source)
            at net.sf.samtools.SAMRecord.getAlignmentBlocks(SAMRecord.java:1370)
            at net.sf.samtools.SAMRecord.validateCigar(SAMRecord.java:1413)
            at net.sf.samtools.BAMRecord.getCigar(BAMRecord.java:247)
            at net.sf.samtools.SAMRecord.getUnclippedStart(SAMRecord.java:472)
            at net.sf.picard.sam.MarkDuplicates.buildReadEnds(MarkDuplicates.java:463)
            at net.sf.picard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:402)
            at net.sf.picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:150)
            at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:177)
            at net.sf.picard.sam.MarkDuplicates.main(MarkDuplicates.java:134)
            hmmm...I am going in circles...anyone have a clue what is going on?

            Comment

            • GenoMax
              Senior Member
              • Feb 2008
              • 7142

              #7
              Are you sure you are running 64-bit java (wonder if that is the reason it is only allowing you to allocate 2G to the heap space)? Both 32-bit and 64-bit java may be installed on your cluster.

              Can you post the output of

              Code:
              $ java -version

              Comment

              • dGho
                Member
                • Jan 2013
                • 43

                #8
                Thank you so much Geno
                I am using java 7


                java version "1.7.0_11"
                Java(TM) SE Runtime Environment (build 1.7.0_11-b21)
                Java HotSpot(TM) Server VM (build 23.6-b04, mixed mode)

                Comment

                • GenoMax
                  Senior Member
                  • Feb 2008
                  • 7142

                  #9
                  Originally posted by dGho View Post
                  Thank you so much Geno
                  I am using java 7


                  java version "1.7.0_11"
                  Java(TM) SE Runtime Environment (build 1.7.0_11-b21)
                  Java HotSpot(TM) Server VM (build 23.6-b04, mixed mode)

                  Can you check the following to see if you get an error?

                  Code:
                  $ java -d64 -version

                  Comment

                  • dGho
                    Member
                    • Jan 2013
                    • 43

                    #10
                    Error: This Java instance does not support a 64-bit JVM.

                    is what I get. So I guess I am running 32 bit. Could this be my problem?
                    I am a little confused bc I don't have trouble running MarkDuplicates on my old bam files until now, just our most recent ones.

                    Comment

                    • GenoMax
                      Senior Member
                      • Feb 2008
                      • 7142

                      #11
                      Originally posted by dGho View Post
                      Error: This Java instance does not support a 64-bit JVM.

                      is what I get. So I guess I am running 32 bit. Could this be my problem?
                      I am a little confused bc I don't have trouble running MarkDuplicates on my old bam files until now, just our most recent ones.
                      You are running 32-bit java. That explains why you have not been able to allocate more heap memory.

                      Can you look around to see if there is 64-bit version of Java available on your cluster?

                      Are these BAM files larger than previous one?

                      Comment

                      • dGho
                        Member
                        • Jan 2013
                        • 43

                        #12
                        Yes, these BAM files are slightly larger. I will see if I can use 64-bit java on our cluster...thank you so much Geno for you suggestion!

                        Comment

                        • dGho
                          Member
                          • Jan 2013
                          • 43

                          #13
                          So, I tried using 64bit java and using the -Xmx4g option. This allowed markduplicates to run longer (72min) and then ran out of memory again. any thoughts?

                          Comment

                          • GenoMax
                            Senior Member
                            • Feb 2008
                            • 7142

                            #14
                            Are you sure the process ran out of RAM or did it run out of temp space on disk? How big is the BAM file?

                            Comment

                            • dGho
                              Member
                              • Jan 2013
                              • 43

                              #15
                              Originally posted by GenoMax View Post
                              Are you sure the process ran out of RAM or did it run out of temp space on disk? How big is the BAM file?
                              Thank you Geno, so I guess it was just a problem w RAM. I am working on a cluster that has "unlimited" space on the disk, so I was pretty sure that was not the problem. I wanted to post the solution that worked for me. -Xmx4g was not enough for the data set I am working on, although 2G had been enough for all the past exomes.

                              in my case the solution was:
                              I used -Xmx8g and that ran fine...so I guess -Xmx4g is not always optimal. Thank you Geno for all your help and 64bit Java was definitely the way to go.

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                Yesterday, 10:05 AM
                              • SEQadmin2
                                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                by SEQadmin2


                                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                Introduction

                                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                05-22-2026, 06:42 AM
                              • SEQadmin2
                                Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                by SEQadmin2

                                Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                05-06-2026, 09:04 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Yesterday, 12:03 PM
                              0 responses
                              17 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, Yesterday, 11:40 AM
                              0 responses
                              13 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-28-2026, 11:40 AM
                              0 responses
                              29 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-26-2026, 10:12 AM
                              0 responses
                              31 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...