Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Picard: EstimateLibraryComplexity -> OutOfMemoryError

    I want to run EstimateLibraryComplexity.jar with a 9.8GB big bam file, but I always get a OutOfMemoryError error. I already tried -Xmx (up to 60GB) and still get the error. Has anybody an idea of how to run EstimateLibraryComplexity on bigger bam files?

    That's my call and the error message:

    Code:
    $ java -Xmx10g -jar EstimateLibraryComplexity.jar INPUT=file.bam 
    OUTPUT=file.libraryComplexity
    
    [Wed Jun 04 21:43:08 CEST 2014] picard.sam.EstimateLibraryComplexity 
    INPUT=[file.bam] OUTPUT=file.libraryComplexity    MIN_IDENTICAL_BASES=5 
    MAX_DIFF_RATE=0.03 MIN_MEAN_QUALITY=20 MAX_GROUP_RATIO=500 
    READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* 
    OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false 
    VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 
    MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
    [Wed Jun 04 21:43:08 CEST 2014] Executing as me@work on Linux 3.6.2-
    1.fc16.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_07-b10; Picard 
    version: 1.114(444810c1de1433d9eca8130be63ccc7fd70a9499_1400593393) 
    JdkDeflater
    INFO    2014-06-04 21:43:08     EstimateLibraryComplexity       Will store 
    15494157 read pairs in memory before sorting.
    INFO    2014-06-04 21:43:13     EstimateLibraryComplexity       Read     
    1,000,000 records.  Elapsed time: 00:00:05s.  Time for last 1,000,000:    5s.  
    Last read position: chr10:38,239,480
    
    ....
    
    INFO    2014-06-04 21:53:21     EstimateLibraryComplexity       Read    
    30,000,000 records.  Elapsed time: 00:10:13s.  Time for last 1,000,000:  183s.  
    Last read position: chr15:34,522,127
    
    [Wed Jun 04 22:54:26 CEST 2014] picard.sam.EstimateLibraryComplexity done. 
    Elapsed time: 71.30 minutes.
    Runtime.totalMemory()=5801312256
    To get help, see http://picard.sourceforge.net/index.shtml#GettingHelp
    Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
            at java.util.Arrays.copyOfRange(Arrays.java:2694)
            at java.lang.String.<init>(String.java:203)
            at java.lang.String.substring(String.java:1913)
            at htsjdk.samtools.util.StringUtil.split(StringUtil.java:89)
            at picard.sam.AbstractDuplicateFindingAlgorithm.addLocationInformation(AbstractD
    uplicateFindingAlgorithm.java:71)
            at picard.sam.EstimateLibraryComplexity.doWork(EstimateLibraryComplexity.java:25
    6)
            at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:1
    83)
            at picard.cmdline.CommandLineProgram.instanceMainWithExit(CommandLineProgra
    m.java:124)
            at picard.sam.EstimateLibraryComplexity.main(EstimateLibraryComplexity.java:217)

    And that's the java version:

    Code:
    $ java -showversion
    java version "1.7.0_07"
    Java(TM) SE Runtime Environment (build 1.7.0_07-b10)
    Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode)
    I also posted this question at Biostars!

  • #2
    Try to use -Xms also. If this not help, you need a bigger machine.

    Comment


    • #3
      Well, of course I could give it more memory (up to 2T). But since I don't think that the huge memory consumption is a feature, I assume it is a user error, or a bug.

      Comment


      • #4
        A sorted, compressed bam file can have extremely high compression, and Java is not very memory-efficient particularly when working with Strings, which appear to be used there. So I would just run it with all available memory (setting the -Xmx flag at around 85% of physical RAM). But in your case it looks like the program may have actually completed:
        [Wed Jun 04 22:54:26 CEST 2014] picard.sam.EstimateLibraryComplexity done.
        Elapsed time: 71.30 minutes
        ...and then crashed, possibly while generating some kind of output, which sounds like a bug in the program.

        If it still does not work when you give it more RAM, I have a program that will estimate library complexity that you could try, invoked by the shellscript "bbcountunique.sh", available at my BBMap website.

        bbcountunique.sh -Xmx100g in=reads.fq out=results.txt

        It's very memory-efficient, as it does not store Strings, just numeric kmers. And it does not use mapping information, just the raw sequence. So it's designed for fastq or fasta input, but it still works on sam input and should work on a bam file if samtools is installed.

        The output is a histogram with the percentage of reads that are unique, every 25000 reads (you can adjust that number with the 'interval' flag). This is calculated based on whether kmers have been seen before, using the read's first kmer and a random kmer. k is by default 20. So, you can plot the histogram to observe the library's complexity; we run this on all our of our data.

        For paired data, it's best to use it with fastq or fasta, though, because then you also get information about unique pairs rather than just unique reads.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Exploring the Dynamics of the Tumor Microenvironment
          by seqadmin




          The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
          07-08-2024, 03:19 PM
        • seqadmin
          Exploring Human Diversity Through Large-Scale Omics
          by seqadmin


          In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
          06-25-2024, 06:43 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 07-10-2024, 07:30 AM
        0 responses
        30 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 07-03-2024, 09:45 AM
        0 responses
        201 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 07-03-2024, 08:54 AM
        0 responses
        212 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 07-02-2024, 03:00 PM
        0 responses
        194 views
        0 likes
        Last Post seqadmin  
        Working...
        X