Seqanswers Leaderboard Ad



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • how to best remove low-coverage reads

    Hi all,

    I need to remove low coverage reads from a CHipseq experiment before I attempt assembly (I don't have a reference genome to map back to, so I am improvising). It looks like there are several options, but am not familiar enough with any of them to use. I have access to iPlant atmosphere for computing power so my memory is going to be a bit limited. I am also somewhat of a novice- I have unpacked and run some programs, but I haven't run any scripts of my own. Any suggestions?

  • #2
    The best way to remove low-coverage reads without a reference is to use kmer-counting to estimate read depth. BBNorm will do this: in=reads.fq out=highpass.fq target=99999999 min=5 passes=1

    This will remove all reads that have a depth below 5x. The "target=99999999" ensures that it will not remove high-depth reads unless the depth is above 99999999, which it won't be.

    BBNorm is fast and does not use very much memory.


    • #3
      Excellent! I just downloaded your program and am unpacking my reads now. I take it that I don't need to specify a kmer length. Also, I have paired end reads, and would like to keep them that way for eventual assembly. Is there a good way to do that or should I just run the output through a re-pairing program afterwards?


      • #4
        Originally posted by horvathdp View Post
        Excellent! I just downloaded your program and am unpacking my reads now. I take it that I don't need to specify a kmer length. Also, I have paired end reads, and would like to keep them that way for eventual assembly. Is there a good way to do that or should I just run the output through a re-pairing program afterwards?
        Pairing is retained. It takes 2 files (using in1= in2= out1= out2=) if you want, or pairs interleaved in a single file (alternating read 1 then read 2). Also, it reads/writes gzip (but not tar).

        The default kmer length is 31; you can change that with e.g. "k=25". If you have short reads, you may want to reduce it, as the depth it targets is kmer depth, and once kmers get close to read length, kmer depth and read depth diverge.
        Last edited by Brian Bushnell; 01-27-2015, 04:18 PM.


        • #5
          Good to know. I'm working my way through the docs. It seems like a nice, fairly intuitive program. Incidentally, I tried to run the program last night, and it finished in seconds, but the output files were empty. My guess is that I don't have sufficient RAM or memory on my small instance. Can you take a guess as to the minimum I need to run this program on ~400 million 160 base fragments?



          • #6
            BBNorm is fast, but not THAT fast - if 400 million reads finished in seconds, then I'm sure it crashed or aborted - could you post the exact command and console output (std error)?

            BBNorm uses a constant amount of memory that is set when it starts, regardless of the input. Rather, the accuracy of its estimates of read depth declines with the input volume, and increases with the amount of memory available. And the amount of memory that is needed for optimal accuracy varies based on the error rate and genome size of the data - for a chipseq experiment, if only 5% of the genome is covered at all by reads, then the effective genome size would be very small.

            Generally, I would say that the amount of memory needed for very high accuracy, with typical Illumina data, is around 15 bytes per covered reference base, regardless of the number of reads. In your case, since you only care about low coverage, you can use the flag "bits=4" (which caps the counters at a max of 15, and makes them use less memory) to increase the accuracy.


            • #7
              Here is the output when I try to generate a histogram using the command:
              ./ in=spurge_genome_all_trimmed_R1.fastq hist=histogram.txt

              I decided this might be more informative since I have such histograms that were generated by jellyfish for this dataset (of spurge genomic illumina sequences) so I could compare them.

              I got the histogram file, but it is empty except for the column headers.

              I am running only 1 cpu with 4G of ram.

              The output is below

              horvathdp@vm64-16:~/myvol/Price/PriceSource130506/bbmap$ ./ in=spurge_genome_all_trimmed_R1.fastq hist=histogram.txt
              java -ea -Xmx679m -Xms679m -cp /home/horvathdp/myvol/Price/PriceSource130506/bbmap/current/ jgi.KmerNormalize bits=32 ecc=f passes=1 keepall dr=f prefilter hist=stdout minprob=0 minqual=0 mindepth=0 minkmers=1 hashes=3 in=spurge_genome_all_trimmed_R1.fastq hist=histogram.txt
              Executing jgi.KmerNormalize [bits=32, ecc=f, passes=1, keepall, dr=f, prefilter, hist=stdout, minprob=0, minqual=0, mindepth=0, minkmers=1, hashes=3, in=spurge_genome_all_trimmed_R1.fastq, hist=histogram.txt]

              threads: 1
              k: 31
              deterministic: false
              toss error reads: false
              passes: 1
              bits per cell: 32
              cells: 67.65M
              hashes: 3
              prefilter bits: 2
              prefilter cells: 582.87M
              prefilter hashes: 2
              base min quality: 0
              kmer min prob: 0.0

              target depth: 40
              min depth: 0
              max depth: 40
              min good kmers: 1
              depth percentile: 54.0
              ignore dupe kmers: true
              fix spikes: false
              histogram length: 1048576
              print zero cov: false

              Exception in thread "Thread-1" java.lang.IndexOutOfBoundsException: index -3230147
              at java.util.concurrent.atomic.AtomicIntegerArray.rawIndex(
              at java.util.concurrent.atomic.AtomicIntegerArray.get(
              at kmer.KCountArray7MTA.incrementHashedLocal(
              at kmer.KCountArray7MTA.increment(
              at kmer.KmerCount7MTA$CountThread.addRead_Advanced(
              at kmer.KmerCount7MTA$CountThread.count(
              at kmer.KmerCount7MTA$
              Made prefilter: hashes = 2 mem = 138.88 MB cells = 582.51M used = 0.000%
              Exception in thread "Thread-5" java.lang.IndexOutOfBoundsException: index -3230147
              at java.util.concurrent.atomic.AtomicIntegerArray.rawIndex(
              at java.util.concurrent.atomic.AtomicIntegerArray.get(
              at kmer.KCountArray7MTA.readHashed(
              at kmer.KCountArray7MTA.increment(
              at kmer.KmerCount7MTA$CountThread.addRead_Advanced(
              at kmer.KmerCount7MTA$CountThread.count(
              at kmer.KmerCount7MTA$
              Made hash table: hashes = 3 mem = 258.06 MB cells = 67.65M used = 0.000%

              Estimated kmers of depth 1-3: 1
              Estimated kmers of depth 4+ : 0
              Estimated unique kmers: 1

              Table creation time: 1.279 seconds.
              Exception in thread "Thread-10" java.lang.NoClassDefFoundError: java/util/concurrent/ThreadLocalRandom
              at jgi.KmerNormalize$
              Caused by: java.lang.ClassNotFoundException: java.util.concurrent.ThreadLocalRandom
              at Method)
              at java.lang.ClassLoader.loadClass(
              at sun.misc.Launcher$AppClassLoader.loadClass(
              at java.lang.ClassLoader.loadClass(
              ... 1 more
              Table read time: 0.185 seconds. 0.00 kb/sec
              Total reads in: 0 NaN% Kept
              Total bases in: 0 NaN% Kept
              Error reads in: 0 NaN%
              Error type 1: 0 NaN%
              Error type 2: 0 NaN%
              Error type 3: 0 NaN%

              Wrote histogram to histogram.txt
              Total kmers counted: 0
              Total unique kmer count: 0
              Includes forward kmers only.
              The unique kmer estimate can be more accurate than the unique count, if the tables are very full.
              The most accurate value is the greater of the two.

              Percent unique: Infinity%
              Depth average: NaN (unique kmers)
              Depth median: 0 (unique kmers)
              Depth standard deviation: NaN (unique kmers)

              Depth average: NaN (all kmers)
              Depth median: 0 (all kmers)
              Depth standard deviation: 0.00 (all kmers)

              Approx. read depth median: NaN

              Total time: 2.074 seconds. 0.00 kb/sec


              • #8
                Oh! I think I understand the problem now. Almost all of the programs in BBTools are compatible with Java 6, but BBNorm (,, and requires Java 7 or higher. Also, it needs a 64-bit OS and 64-bit version of Java if you want to be able to use all of your memory. If you install 64-bit Java 7, then you can add the flag "-Xmx3g" which will force the program to use 3 GB of ram.


                • #9
                  So I loaded java7,

                  made sure it was the default java

                  Reading state information... Done
                  default-jre is already the newest version.
                  default-jre set to manually installed.
                  0 upgraded, 0 newly installed, 0 to remove and 127 not upgraded.\and I still get the following

                  ./ -Xmx3g in=spurge_genome_all_trimmed_R1.fastq hist=histogram.txt
                  java -ea -Xmx3g -Xms3g -cp /home/horvathdp/myvol/Price/PriceSource130506/bbmap/current/ jgi.KmerNormalize bits=32 ecc=f passes=1 keepall dr=f prefilter hist=stdout minprob=0 minqual=0 mindepth=0 minkmers=1 hashes=3 -Xmx3g in=spurge_genome_all_trimmed_R1.fastq hist=histogram.txt
                  Executing jgi.KmerNormalize [bits=32, ecc=f, passes=1, keepall, dr=f, prefilter, hist=stdout, minprob=0, minqual=0, mindepth=0, minkmers=1, hashes=3, -Xmx3g, in=spurge_genome_all_trimmed_R1.fastq, hist=histogram.txt]

                  threads: 1
                  k: 31
                  deterministic: false
                  toss error reads: false
                  passes: 1
                  bits per cell: 32
                  cells: 355.27M
                  hashes: 3
                  prefilter bits: 2
                  prefilter cells: 3060.78M
                  prefilter hashes: 2
                  base min quality: 0
                  kmer min prob: 0.0

                  target depth: 40
                  min depth: 0
                  max depth: 40
                  min good kmers: 1
                  depth percentile: 54.0
                  ignore dupe kmers: true
                  fix spikes: false
                  histogram length: 1048576
                  print zero cov: false

                  Exception in thread "Thread-1" java.lang.IndexOutOfBoundsException: index -19148990
                  at java.util.concurrent.atomic.AtomicIntegerArray.checkedByteOffset(
                  at java.util.concurrent.atomic.AtomicIntegerArray.get(
                  at kmer.KCountArray7MTA.incrementHashedLocal(
                  at kmer.KCountArray7MTA.increment(
                  at kmer.KmerCount7MTA$CountThread.addRead_Advanced(
                  at kmer.KmerCount7MTA$CountThread.count(
                  at kmer.KmerCount7MTA$
                  Made prefilter: hashes = 2 mem = 729.14 MB cells = 3058.25M used = 0.000%
                  Exception in thread "Thread-5" java.lang.IndexOutOfBoundsException: index -19148990
                  at java.util.concurrent.atomic.AtomicIntegerArray.checkedByteOffset(
                  at java.util.concurrent.atomic.AtomicIntegerArray.get(
                  at kmer.KCountArray7MTA.readHashed(
                  at kmer.KCountArray7MTA.increment(
                  at kmer.KmerCount7MTA$CountThread.addRead_Advanced(
                  at kmer.KmerCount7MTA$CountThread.count(
                  at kmer.KmerCount7MTA$
                  Made hash table: hashes = 3 mem = 1.32 GB cells = 355.16M used = 0.000%

                  Estimated kmers of depth 1-3: 1
                  Estimated kmers of depth 4+ : 0
                  Estimated unique kmers: 1

                  Table creation time: 9.133 seconds.
                  Exception in thread "Thread-11" java.lang.IndexOutOfBoundsException: index -19148990
                  at java.util.concurrent.atomic.AtomicIntegerArray.checkedByteOffset(
                  at java.util.concurrent.atomic.AtomicIntegerArray.get(
                  at kmer.KCountArray7MTA.readHashed(
                  at jgi.KmerNormalize.generateCoverage(
                  at jgi.KmerNormalize$ProcessThread.normalizeInThread(
                  at jgi.KmerNormalize$
                  Table read time: 0.129 seconds. 0.00 kb/sec
                  Total reads in: 0 NaN% Kept
                  Total bases in: 0 NaN% Kept
                  Error reads in: 0 NaN%
                  Error type 1: 0 NaN%
                  Error type 2: 0 NaN%
                  Error type 3: 0 NaN%

                  Wrote histogram to histogram.txt
                  Total kmers counted: 0
                  Total unique kmer count: 0
                  Includes forward kmers only.
                  The unique kmer estimate can be more accurate than the unique count, if the tables are very full.
                  The most accurate value is the greater of the two.

                  Percent unique: Infinity%
                  Depth average: NaN (unique kmers)
                  Depth median: 0 (unique kmers)
                  Depth standard deviation: NaN (unique kmers)

                  Depth average: NaN (all kmers)
                  Depth median: 0 (all kmers)
                  Depth standard deviation: 0.00 (all kmers)

                  Approx. read depth median: NaN

                  Total time: 15.108 seconds. 0.00 kb/sec


                  • #10
                    Well, sorry for the inconvenience, then... I have never seen that error before and I can't imagine why it is happening. works fine for me when I run it with -Xmx679m or -Xmx3g, and 1 thread, which should be exactly the same as what you are doing.

                    I can only suggest that another similar tool, such as khmer or Trinity, might be able to accomplish the same thing.


                    • #11
                      @horvathdp: Do you have only 4G RAM? Are you running a 32-bit or 64-bit OS?


                      • #12
                        Thanks for trying. It always amazes me how helpful and patient you developers are with nubes like myself. Yes, just 4G of RAM on a linux (Ubuntu 12) 64 bit instance.


                        • #13
                          I managed to get the program to run on a different system. Not sure what was the trick, but all is well. I have another question though. When my data was run on Jellyfish, the histogram I got back had a peak at 15X coverage. However, when I ran my peak is at about 35X coverage. To be fair, based on the predicted size of my genome, was expecting about 30X coverage. Any idea why Jellyfish indicates about half the coverage that your program gives?


                          • #14
                            The most likely cause is that stores only one copy of a kmer or its reverse-complement. I am not sure about Jellyfish's behavior, but this is the result I would expect if it stored forward and reverse kmers independently. In this case, you would expect roughly 30x coverage of reads on the genome, or 15x coverage of forward reads. But you'd have to check Jellyfish's documentation to see exactly how it processes kmers and their reverse-complements.


                            • #15
                              So I generated a bunch of lovely histograms and figured out my cutoffs for the samples, but I am getting an error of the hash table being too full when running even when I cut my sample size in half. Are there any run parameters I could add to the command line to allow the program to work? The machine I am working on has a Tb of hard drive and 12 G of RAM, so I am surprised that it is running low.


                              Latest Articles


                              • seqadmin
                                Best Practices for Single-Cell Sequencing Analysis
                                by seqadmin

                                While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
                                Yesterday, 07:15 AM
                              • seqadmin
                                Latest Developments in Precision Medicine
                                by seqadmin

                                Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

                                Somatic Genomics
                                “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
                                05-24-2024, 01:16 PM





                              Topics Statistics Last Post
                              Started by seqadmin, Today, 06:58 AM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 08:18 AM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 08:04 AM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, 06-03-2024, 06:55 AM
                              0 responses
                              Last Post seqadmin  