Header Leaderboard Ad


Introducing BBNorm, a read normalization and error-correction tool



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • #46
    Hello! I have RNA-Seq data and I am processing them with BBtools for adapter trimming, quality trimming, contaminant filtering, error correction. I want to use these data as support in denovo genome assembly with Abyss, which has this ability. Should I normalize them? What is the target number that I should set in bbnorm? Can someone explain how I calculate the right number?
    Last edited by kokyriakidis; 07-28-2018, 12:38 AM.


    • #47
      khist.sh assertion error

      Hi Brian:

      I think this is your preferred place for issue reporting - let me know if I missed some other location.

      I've tried to do this to calculate dinucleotide frequencies. Not sure I'm asking for khist something reasonable though.
      khist.sh k=2 in=tiny.bam
      java -ea -Xmx27710m -Xms27710m -cp /home/NEB/langhorst/miniconda2/envs/bbmap/opt/bbmap-37.78/current/ jgi.KmerNormalize bits=32 ecc=f passes=1 keepall dr=f prefilter hist=stdout minprob=0 minqual=0 mindepth=0 minkmers=1 hashes=3 k=2 in=tiny.bam
      Executing jgi.KmerNormalize [bits=32, ecc=f, passes=1, keepall, dr=f, prefilter, hist=stdout, minprob=0, minqual=0, mindepth=0, minkmers=1, hashes=3, k=2, in=tiny.bam]
      threads:          	32
      k:                	2
      deterministic:    	false
      toss error reads: 	false
      passes:           	1
      bits per cell:    	32
      cells:            	3289.05M
      hashes:           	3
      prefilter bits:   	2
      prefilter cells:  	28.34B
      prefilter hashes: 	2
      base min quality: 	0
      kmer min prob:    	0.0
      target depth:     	100
      min depth:        	0
      max depth:        	100
      min good kmers:   	1
      depth percentile: 	54.0
      ignore dupe kmers:	true
      fix spikes:       	false
      histogram length: 	1048576
      print zero cov:   	false
      Made prefilter:   	hashes = 2   	 mem = 6.60 GB   	cells = 28.34B   	used = 0.000%
      Made hash table:  	hashes = 3   	 mem = 12.25 GB   	cells = 3287.47M   	used = 0.000%
      Estimated kmers of depth 1-3: 	0
      Estimated kmers of depth 4+ : 	9
      Estimated unique kmers:     	9
      Table creation time:		38.189 seconds.
      Exception in thread "Thread-201" Exception in thread "Thread-230" Exception in thread "Thread-229" Exception in thread "Thread-228" Exception in thread "Thread-232" Exception in thread "Thread-231" Exception in thread "Thread-227" java.lang.AssertionError
      	at bloom.KCountArray.makeCanonical2(KCountArray.java:489)
      	at bloom.KCountArray.read(KCountArray.java:135)
      	at jgi.KmerNormalize.generateCoverage(KmerNormalize.java:2752)
      	at jgi.KmerNormalize$ProcessThread.normalizeInThread(KmerNormalize.java:2859)
      	at jgi.KmerNormalize$ProcessThread.run(KmerNormalize.java:2808)
      I was able to do what I want with jellyfish count -s 1000000 -m 2 -t 16 <(samtools fastq tiny.bam) && jellyfish dump -c -t mer_counts_0
      but I thought you'd want to know about the assertion.

      AA 1652511
      AC 369103
      AG 388077
      AT 1270520
      CA 325270
      CC 238680
      CG 36991
      CT 369537
      GA 371297
      GC 10646
      GG 254788
      GT 391079
      TA 1313592
      TC 355048
      TG 348502
      TT 1676488


      • #48
        Hello there,

        I have performed a metagenomics sequencing on my samples on an Illumina Hi-Seq. How do I determine the target coverage value to use?

        Thanks in advance.


        • #49
          Exception in thread &quot;Thread-175&quot; java.lang.AssertionError:


          Thanks for the great software!

          I want to use bbnorm to normalize single end RNAseq library reads.

          I'm using the software on my university cluster (linux)

          this is the exception message I get:

          Exception in thread "Thread-175" java.lang.AssertionError:

          at stream.ReadStreamByteWriter.writeFastq(ReadStreamByteWriter.java:460)
          at stream.ReadStreamByteWriter.processJobs(ReadStreamByteWriter.java:97)
          at stream.ReadStreamByteWriter.run2(ReadStreamByteWriter.java:42)
          at stream.ReadStreamByteWriter.run(ReadStreamByteWriter.java:28)

          this is the script I used:
          #SBATCH -N 1
          #SBATCH -n 24
          #SBATCH -p hive7d
          #SBATCH -J ecc_bbnormSE # Job name
          #SBATCH --mem=128000
          #SBATCH --time=2-23:00:00 # Runtime in D-HH:MM:SS
          #SBATCH --mail-type=ALL # Type of email notification- BEGIN,END,FAIL,ALL
          #SBATCH [email protected] # Email to send notifications to

          . /etc/profile.d/modules.sh
          module purge
          module load java/jre1.8.0

          export PATH=/data/home/steindler/mbritstei/programs/anaconda2/bin:$PATH

          source activate bbtools


          bbnorm.sh in=$fastq_DATA/petrosia_SE.fastq out= out2=petrosia_SE_normalized_ecc.fastq target=100 min=5 ecc prefilter

          source deactivate

          ## Can you please tell me what is wrong?

          # thanks!!



          • #50
            @Maya: Can you explicitly add -Xmx128g (is that what you are asking for) threads=24 in your bbnorm.sh command?

            I also don't understand this part: in=$fastq_DATA/petrosia_SE.fastq out= out2=petrosia_SE_normalized_ecc.fastq There should just be one "out="?

            Normalizing RNAseq data is not appropriate. You are going to lose vital count information. See the section on "When not to normalize" in BBNorm guide here.
            Last edited by GenoMax; 01-27-2019, 05:38 AM.


            • #51
              Thanks @GenoMax

              I did not see that "out= " there... will try again, also with the suggested flags.

              I'm using the normalization just for assembly, not for quantification.


              • #52
                Originally posted by Brian Bushnell View Post
                I'd like to introduce BBNorm, a member of the BBTools package.

                BBNorm is a kmer-based normalization tool for NGS reads, which may also be used for error-correction and generating kmer-frequency plots. It is extremely fast and memory-efficient due to the use of atomic counters and probabilistic data structures.

                First, what is normalization? Many assemblers perform poorly in the presence of too much data, and data with irregular coverage, such as MDA-amplified single cells or metagenomes. And even if an assembler performs well with these datasets (Spades, for example, does a very good job with single cells, though still benefits from normalization in my tests), more data will increase the runtime and memory usage, potentially to the point that the data cannot be assembled.

                Subsampling randomly discards some percentage of the reads, to reduce the average coverage, and is computationally cheap (you can do that quickly with reformat). However, if you have one area with 10000x coverage and another area with 10x coverage, subsampling to 1% will reduce the 10000x area to 100x - a good level for some assemblers - but the 10x area to 0.1x, which cannot be assembled. So with irregular coverage, it is not ideal!

                Normalization discards reads with a probability based on the local coverage. So, for example, if you normalize to a target 40x coverage, reads in regions with 10x coverage will all be retained; in an 80x region they will be discarded with a 0.5 probability; and in a 10000x area they will be discarded with a 0.004 probability. As a result, after normalization, areas with coverage above the target will be reduced to the target, and areas below the target will be left untouched. This generally makes the data much easier to assemble and typically increases continuity (L50) by a substantial amount.

                To normalize to 40x coverage with BBNorm, and discard reads with an apparent depth under 2x (which typically indicates the reads have errors):

                bbnorm.sh in=reads.fq out=normalized.fq target=40 mindepth=2

                Error-correction is also useful in many cases when you have sufficient depth and wish to avoid false-positive variant calls, or achieve a higher mapping rate, or want to merge paired reads via overlap, or assemble high error-rate data with an assembler that is not very tolerant of errors, or reduce the memory usage of a DeBruijn assembler. Like normalization, error-correction is not universally a good idea - JGI does not normalize and error-correct all data prior to use, for example - but it is highly beneficial in many situations. Also, BBNorm only corrects substitution errors, not indels, since that is the error mode that occurs in Illumina data. In other words, it will NOT error-correct PacBio or 454 data, which feature indel errors. BBNorm is currently being used for error-correction by the OLC assembler Omega.

                To error-correct reads with BBNorm:

                ecc.sh in=reads.fq out=corrected.fq

                To error-correct and normalize at the same time, just add the flag "ecc" when running BBNorm.

                Lastly, BBNorm can be used for producing kmer frequency histograms, and binning reads based on coverage depth. The histograms are useful to determine things like the ploidy of an organism, the genome size and coverage, the heterozygousity rate, and the presence of contaminants (which typically have drastically different coverage than the genome of interest).

                To generate a kmer frequency histogram:

                khist.sh in=reads.fq hist=histogram.txt

                You can also use the "hist" flag during normalization or error-correction, for the histogram of input reads, for free; "histout" will generate the histogram of output reads, but cost an additional pass.

                So, what is the difference between bbnorm.sh, ecc.sh, and khist.sh? They all call the same program, just with different default parameters (you can see the exact parameters by looking at the bottom of the shellscripts). If you specify all parameters manually, they are equivalent.

                How does BBNorm work, and why is it better than other tools?

                BBNorm counts kmers; by default, 31-mers. It reads the input once to count them. Then it reads the input a second time to process the reads according to their kmer frequencies. For this reason, unlike most other BBTools, BBNorm CANNOT accept piped input. For normalization, it discards reads with probability based on the ratio of the desired coverage to the median of the counts of a read's kmers. For error-correction, situations where there are adjacent kmers in a read with drastically different frequencies - for example, differing by a factor of 180 - are detected; the offending base is altered to a different base, if doing so will restore the kmer to a similar frequency as its adjacent kmers.

                BBNorm is memory-efficient because it does not explicitly store kmers - everything is in a probabilistic data structure called a count-min sketch. As a result, BBNorm will never run out of memory, slow down, or use disk, no matter how much data you have or how big the genome is. Rather, the accuracy will decline as the table's loading increases - but because kmers are not explicitly stored, it can store several times more than an explicit data structure (such as Google Sparse Hash). And for normalization, the reduction in accuracy at extremely high loading does not matter, because the median is used - so even if multiple kmers within a read have an incorrectly high count, they will not even be considered, and thus the results will not be affected at all. As a result - in practice, you should use all available memory even for a tiny genome with a small number of reads; but even for a huge genome with very high coverage, BBNorm will still work, and produce good results quickly on a computer with limited memory.

                Speedwise, BBNorm is multithreaded in all stages, using atomic counters which do not require locking - this allows it to scale efficiently with processor core counts.

                BBTools has another program functionally related to BBNorm, "kmercountexact.sh". It does NOT use probabilistic data structures, and uses locking rather than atomic counters, and as a result may not scale as well, and will run out of memory on large datasets. However, it is still extremely fast and memory-efficient - using ~15 bytes per kmer (with an optional count-min-sketch prefilter to remove low-count error kmers). It cannot normalize or error-correct, but it can generate the exact kmer count of a dataset as well as the exact kmer frequency histogram (and do rudimentary peak calling for genome size estimation). In practice, when I am interested in kmer frequency histograms, I use KmerCountExact for isolates, and BBNorm for metagenomes.

                BBNorm (and all BBTools) can be downloaded here:


                Edit: Note! Most programs in the BBMap package run in Java 6 or higher, but BBNorm requires Java 7 or higher.
                Hi!!! A question, please.

                BBNorm can also be used to normalize the coverage of PacBio reads??

                Thank you in advance!

                I just read that it indeed can be used on PacBio reads but doesn't perform error-correction! that's fine for me I have HiFi PacBio reads and wanted to try this normalization step
                Last edited by silverfox; 04-10-2020, 03:00 PM.


                • #53
                  normalized reads file empty

                  Hi Brian and everyone! I'm using bbnorm but i keep getting a problem.

                  I ran the command:

                  $ bbnorm.sh -Xmx64g t=18 in=pt.raw.fastq out=pt.raw.normalized.fq target=90 mindepth=2

                  everything looked good but when the proccess ended, I realized the file pt.raw.normalized.fq was empty


                  I just run the following command:

                  $ bbnorm.sh in=pt.raw.fastq out=pt.raw.normalized3.fq target=90 min=2

                  But at the end, my pt.raw.normalized3.fq file was still empty, like before T-T

                  I think the problem could be here:

                  In the second pass says

                  HTML Code:
                  Made hash table:        hashes = 3       mem = 65.66 GB         cells = 35.25B          used = 0.000%
                  Estimated unique kmers:         0
                  Table creation time:            17.804 seconds.
                  Started output threads.
                  Table read time:                0.012 seconds.          0.00 kb/sec
                  Total reads in:                 0               NaN% Kept
                  Total bases in:                 0               NaN% Kept
                  Error reads in:                 0               NaN%
                  Error type 1:                   0               NaN%
                  Error type 2:                   0               NaN%
                  Error type 3:                   0               NaN%
                  Total kmers counted:            0
                  Please, can someone tell me what did I do wrong?
                  Thanks a lot in advance!


                  I just found one of your comments:
                  Originally posted by Brian Bushnell View Post
                  BBNorm cannot be used for raw PacBio data, as the error rate is too high; it is throwing everything away as the apparent depth is always 1, since all the kmers are unique. Normalization uses 31-mers (by default) which requires that, on average, the error rate is below around 1/40 (so that a large number of kmers will be error-free). However, raw PacBio data has on average a 1/7 error rate, so most programs that use long kmers will not work on it at all. BBMap, on the other hand, uses short kmers (typically 9-13) and it can process PacBio data, but does not do normalization - a longer kmer is needed.

                  PacBio CCS or "Reads of Insert" that are self-corrected, with multiple passes to drop the error rate below 3% or so, could be normalized by BBNorm. So, if you intentionally fragment your reads to around 3kbp or less, but run long movies, then self-correct, normalization should work.

                  PacBio data has a very flat coverage distribution, which is great, and means that typically it does not need normalization. But MDA'd single cells have highly variable coverage regardless of the platform, and approaches like HGAP to correct by consensus of multiple reads covering the same locus will not work anywhere that has very low coverage. I think your best bet is really to shear to a smaller fragment size, self-correct to generate "Reads of Insert", and use those to assemble. I doubt normalization will give you a better assembly with error-corrected single-cell PacBio data, but if it did, you would have to use custom parameters to not throw away low-coverage data (namely, "mindepth=0 lowthresh=0"), since a lot of the single-cell contigs have very low coverage. BBNorm (and, I imagine, all other normalizers) have defaults set for Illumina reads.
                  I will try to normalize my HiFi PacBio (PacBio CCS) reads then

                  Can I not reduce the kmer length? (default=31)
                  Last edited by silverfox; 04-12-2020, 06:25 PM.


                  • #54
                    Hi all,
                    does any one has experience with processing ancient DNA using bbtools? My question is how short the kmer length for bbnorm should be, regarding issues with the fragmented DNA. Thanks in advance!


                    • #55
                      Multiple Samples for BBNorm

                      Hi Brian -- How would this work with paired-end reads? Would it be okay to concatenate all forward and all reverse reads then run BBNorm?

                      Originally posted by Brian Bushnell View Post
                      At this point, BBNorm does not accept multiple input files (other than dual files for paired reads). You would have to concatenate them first:

                      cat a.fastq.gz b.fastq.gz > all.fa.gz

                      ...which works fine for gzipped files. Most of my programs can accept piped input from stdin, but not BBNorm since it needs to read the files twice.


                      • #56
                        Yes BBNorm works with paired-end files. If you have data that ran over multiple lanes then it would be fine to concatenate the reads in identical order for R1/R2 data.


                        • #57
                          Core dump error with BBNorm

                          Hi all,
                          I'm using BBNorm to normalize a metagenomic dataset but am getting the following error during 'pass 1':

                          HTML Code:
                          Estimated unique kmers:         15483482031
                          Table creation time:            1619.850 seconds.
                          Started output threads.
                          /opt/nesi/CS400_centos7_bdw/BBMap/38.90-gimkl-2020a/bbnorm.sh: line 164: 106596 Bus error               (core dumped) java -ea -Xmx17653m -Xms17653m -cp /opt/nesi/CS400_centos7_bdw/BBMap/38.90-gimkl-2020a/current/ jgi.KmerNormalize bits=32 t=10 in=S1_all_R1_cpfy_clean.fq.gz in2=S1_all_R2_cpfy_clean.fq.gz out=S1_all_R1_cpfy_clean_norm.fq.gz out2=S1_all_R2_cpfy_clean_norm.fq.gz target=100 min=5
                          From manual I assumed that memory shouldn't be an issue, and have given 21G of memory to the job scheduler (slurm).

                          I'm wondering if I'm exceeding my storage with the temporary files that are created by BBNorm, although I should have ~10TB so wouldn't have thought that was the issue.

                          Any insight on this error would be much appreciated. Thanks in advance!


                          • #58
                            @Sean Depending on size of your dataset 21G memory may not be enough.


                            Latest Articles


                            • seqadmin
                              Improved Targeted Sequencing: A Comprehensive Guide to Amplicon Sequencing
                              by seqadmin

                              Amplicon sequencing is a targeted approach that allows researchers to investigate specific regions of the genome. This technique is routinely used in applications such as variant identification, clinical research, and infectious disease surveillance. The amplicon sequencing process begins by designing primers that flank the regions of interest. The DNA sequences are then amplified through PCR (typically multiplex PCR) to produce amplicons complementary to the targets. RNA targets...
                              03-21-2023, 01:49 PM
                            • seqadmin
                              Targeted Sequencing: Choosing Between Hybridization Capture and Amplicon Sequencing
                              by seqadmin

                              Targeted sequencing is an effective way to sequence and analyze specific genomic regions of interest. This method enables researchers to focus their efforts on their desired targets, as opposed to other methods like whole genome sequencing that involve the sequencing of total DNA. Utilizing targeted sequencing is an attractive option for many researchers because it is often faster, more cost-effective, and only generates applicable data. While there are many approaches...
                              03-10-2023, 05:31 AM





                            Topics Statistics Last Post
                            Started by seqadmin, 03-24-2023, 02:45 PM
                            0 responses
                            Last Post seqadmin  
                            Started by seqadmin, 03-22-2023, 12:26 PM
                            0 responses
                            Last Post seqadmin  
                            Started by seqadmin, 03-17-2023, 12:32 PM
                            0 responses
                            Last Post seqadmin  
                            Started by seqadmin, 03-15-2023, 12:42 PM
                            0 responses
                            Last Post seqadmin