Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Picard: EstimateLibraryComplexity -> OutOfMemoryError

    I want to run EstimateLibraryComplexity.jar with a 9.8GB big bam file, but I always get a OutOfMemoryError error. I already tried -Xmx (up to 60GB) and still get the error. Has anybody an idea of how to run EstimateLibraryComplexity on bigger bam files?

    That's my call and the error message:

    Code:
    $ java -Xmx10g -jar EstimateLibraryComplexity.jar INPUT=file.bam 
    OUTPUT=file.libraryComplexity
    
    [Wed Jun 04 21:43:08 CEST 2014] picard.sam.EstimateLibraryComplexity 
    INPUT=[file.bam] OUTPUT=file.libraryComplexity    MIN_IDENTICAL_BASES=5 
    MAX_DIFF_RATE=0.03 MIN_MEAN_QUALITY=20 MAX_GROUP_RATIO=500 
    READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* 
    OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false 
    VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 
    MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
    [Wed Jun 04 21:43:08 CEST 2014] Executing as me@work on Linux 3.6.2-
    1.fc16.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_07-b10; Picard 
    version: 1.114(444810c1de1433d9eca8130be63ccc7fd70a9499_1400593393) 
    JdkDeflater
    INFO    2014-06-04 21:43:08     EstimateLibraryComplexity       Will store 
    15494157 read pairs in memory before sorting.
    INFO    2014-06-04 21:43:13     EstimateLibraryComplexity       Read     
    1,000,000 records.  Elapsed time: 00:00:05s.  Time for last 1,000,000:    5s.  
    Last read position: chr10:38,239,480
    
    ....
    
    INFO    2014-06-04 21:53:21     EstimateLibraryComplexity       Read    
    30,000,000 records.  Elapsed time: 00:10:13s.  Time for last 1,000,000:  183s.  
    Last read position: chr15:34,522,127
    
    [Wed Jun 04 22:54:26 CEST 2014] picard.sam.EstimateLibraryComplexity done. 
    Elapsed time: 71.30 minutes.
    Runtime.totalMemory()=5801312256
    To get help, see http://picard.sourceforge.net/index.shtml#GettingHelp
    Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
            at java.util.Arrays.copyOfRange(Arrays.java:2694)
            at java.lang.String.<init>(String.java:203)
            at java.lang.String.substring(String.java:1913)
            at htsjdk.samtools.util.StringUtil.split(StringUtil.java:89)
            at picard.sam.AbstractDuplicateFindingAlgorithm.addLocationInformation(AbstractD
    uplicateFindingAlgorithm.java:71)
            at picard.sam.EstimateLibraryComplexity.doWork(EstimateLibraryComplexity.java:25
    6)
            at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:1
    83)
            at picard.cmdline.CommandLineProgram.instanceMainWithExit(CommandLineProgra
    m.java:124)
            at picard.sam.EstimateLibraryComplexity.main(EstimateLibraryComplexity.java:217)

    And that's the java version:

    Code:
    $ java -showversion
    java version "1.7.0_07"
    Java(TM) SE Runtime Environment (build 1.7.0_07-b10)
    Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode)
    I also posted this question at Biostars!

  • #2
    Try to use -Xms also. If this not help, you need a bigger machine.

    Comment


    • #3
      Well, of course I could give it more memory (up to 2T). But since I don't think that the huge memory consumption is a feature, I assume it is a user error, or a bug.

      Comment


      • #4
        A sorted, compressed bam file can have extremely high compression, and Java is not very memory-efficient particularly when working with Strings, which appear to be used there. So I would just run it with all available memory (setting the -Xmx flag at around 85% of physical RAM). But in your case it looks like the program may have actually completed:
        [Wed Jun 04 22:54:26 CEST 2014] picard.sam.EstimateLibraryComplexity done.
        Elapsed time: 71.30 minutes
        ...and then crashed, possibly while generating some kind of output, which sounds like a bug in the program.

        If it still does not work when you give it more RAM, I have a program that will estimate library complexity that you could try, invoked by the shellscript "bbcountunique.sh", available at my BBMap website.

        bbcountunique.sh -Xmx100g in=reads.fq out=results.txt

        It's very memory-efficient, as it does not store Strings, just numeric kmers. And it does not use mapping information, just the raw sequence. So it's designed for fastq or fasta input, but it still works on sam input and should work on a bam file if samtools is installed.

        The output is a histogram with the percentage of reads that are unique, every 25000 reads (you can adjust that number with the 'interval' flag). This is calculated based on whether kmers have been seen before, using the read's first kmer and a random kmer. k is by default 20. So, you can plot the histogram to observe the library's complexity; we run this on all our of our data.

        For paired data, it's best to use it with fastq or fasta, though, because then you also get information about unique pairs rather than just unique reads.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Latest Developments in Precision Medicine
          by seqadmin



          Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

          Somatic Genomics
          “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
          05-24-2024, 01:16 PM
        • seqadmin
          Recent Advances in Sequencing Analysis Tools
          by seqadmin


          The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
          05-06-2024, 07:48 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 05-30-2024, 03:16 PM
        0 responses
        18 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 05-29-2024, 01:32 PM
        0 responses
        18 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 05-24-2024, 07:15 AM
        0 responses
        209 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 05-23-2024, 10:28 AM
        0 responses
        225 views
        0 likes
        Last Post seqadmin  
        Working...
        X