Announcement

Collapse
No announcement yet.

Introducing BBNorm, a read normalization and error-correction tool

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Introducing BBNorm, a read normalization and error-correction tool

    I'd like to introduce BBNorm, a member of the BBTools package.

    BBNorm is a kmer-based normalization tool for NGS reads, which may also be used for error-correction and generating kmer-frequency plots. It is extremely fast and memory-efficient due to the use of atomic counters and probabilistic data structures.

    First, what is normalization? Many assemblers perform poorly in the presence of too much data, and data with irregular coverage, such as MDA-amplified single cells or metagenomes. And even if an assembler performs well with these datasets (Spades, for example, does a very good job with single cells, though still benefits from normalization in my tests), more data will increase the runtime and memory usage, potentially to the point that the data cannot be assembled.

    Subsampling randomly discards some percentage of the reads, to reduce the average coverage, and is computationally cheap (you can do that quickly with reformat). However, if you have one area with 10000x coverage and another area with 10x coverage, subsampling to 1% will reduce the 10000x area to 100x - a good level for some assemblers - but the 10x area to 0.1x, which cannot be assembled. So with irregular coverage, it is not ideal!

    Normalization discards reads with a probability based on the local coverage. So, for example, if you normalize to a target 40x coverage, reads in regions with 10x coverage will all be retained; in an 80x region they will be discarded with a 0.5 probability; and in a 10000x area they will be discarded with a 0.004 probability. As a result, after normalization, areas with coverage above the target will be reduced to the target, and areas below the target will be left untouched. This generally makes the data much easier to assemble and typically increases continuity (L50) by a substantial amount.

    To normalize to 40x coverage with BBNorm, and discard reads with an apparent depth under 2x (which typically indicates the reads have errors):

    bbnorm.sh in=reads.fq out=normalized.fq target=40 mindepth=2

    Error-correction is also useful in many cases when you have sufficient depth and wish to avoid false-positive variant calls, or achieve a higher mapping rate, or want to merge paired reads via overlap, or assemble high error-rate data with an assembler that is not very tolerant of errors, or reduce the memory usage of a DeBruijn assembler. Like normalization, error-correction is not universally a good idea - JGI does not normalize and error-correct all data prior to use, for example - but it is highly beneficial in many situations. Also, BBNorm only corrects substitution errors, not indels, since that is the error mode that occurs in Illumina data. In other words, it will NOT error-correct PacBio or 454 data, which feature indel errors. BBNorm is currently being used for error-correction by the OLC assembler Omega.

    To error-correct reads with BBNorm:

    ecc.sh in=reads.fq out=corrected.fq

    To error-correct and normalize at the same time, just add the flag "ecc" when running BBNorm.

    Lastly, BBNorm can be used for producing kmer frequency histograms, and binning reads based on coverage depth. The histograms are useful to determine things like the ploidy of an organism, the genome size and coverage, the heterozygousity rate, and the presence of contaminants (which typically have drastically different coverage than the genome of interest).

    To generate a kmer frequency histogram:

    khist.sh in=reads.fq hist=histogram.txt

    You can also use the "hist" flag during normalization or error-correction, for the histogram of input reads, for free; "histout" will generate the histogram of output reads, but cost an additional pass.

    So, what is the difference between bbnorm.sh, ecc.sh, and khist.sh? They all call the same program, just with different default parameters (you can see the exact parameters by looking at the bottom of the shellscripts). If you specify all parameters manually, they are equivalent.

    How does BBNorm work, and why is it better than other tools?

    BBNorm counts kmers; by default, 31-mers. It reads the input once to count them. Then it reads the input a second time to process the reads according to their kmer frequencies. For this reason, unlike most other BBTools, BBNorm CANNOT accept piped input. For normalization, it discards reads with probability based on the ratio of the desired coverage to the median of the counts of a read's kmers. For error-correction, situations where there are adjacent kmers in a read with drastically different frequencies - for example, differing by a factor of 180 - are detected; the offending base is altered to a different base, if doing so will restore the kmer to a similar frequency as its adjacent kmers.

    BBNorm is memory-efficient because it does not explicitly store kmers - everything is in a probabilistic data structure called a count-min sketch. As a result, BBNorm will never run out of memory, slow down, or use disk, no matter how much data you have or how big the genome is. Rather, the accuracy will decline as the table's loading increases - but because kmers are not explicitly stored, it can store several times more than an explicit data structure (such as Google Sparse Hash). And for normalization, the reduction in accuracy at extremely high loading does not matter, because the median is used - so even if multiple kmers within a read have an incorrectly high count, they will not even be considered, and thus the results will not be affected at all. As a result - in practice, you should use all available memory even for a tiny genome with a small number of reads; but even for a huge genome with very high coverage, BBNorm will still work, and produce good results quickly on a computer with limited memory.

    Speedwise, BBNorm is multithreaded in all stages, using atomic counters which do not require locking - this allows it to scale efficiently with processor core counts.

    BBTools has another program functionally related to BBNorm, "kmercountexact.sh". It does NOT use probabilistic data structures, and uses locking rather than atomic counters, and as a result may not scale as well, and will run out of memory on large datasets. However, it is still extremely fast and memory-efficient - using ~15 bytes per kmer (with an optional count-min-sketch prefilter to remove low-count error kmers). It cannot normalize or error-correct, but it can generate the exact kmer count of a dataset as well as the exact kmer frequency histogram (and do rudimentary peak calling for genome size estimation). In practice, when I am interested in kmer frequency histograms, I use KmerCountExact for isolates, and BBNorm for metagenomes.

    BBNorm (and all BBTools) can be downloaded here:

    http://sourceforge.net/projects/bbmap/

    Edit: Note! Most programs in the BBMap package run in Java 6 or higher, but BBNorm requires Java 7 or higher.
    Last edited by Brian Bushnell; 01-29-2015, 09:47 AM.

  • #2
    How does BBNorm compare to normalize_by_median from the khmer package? The implementation (apart from language and possibly better usage of processor cors) sounds very similar.

    Comment


    • #3
      There are a number of similar tools now --

      Digital normalization, http://ivory.idyll.org/blog/diginorm-paper-posted.html

      Trinity's in silico read normalization, based on Jellyfish and custom Perl scripts: http://trinityrnaseq.sourceforge.net...alization.html

      NeatFreq, written in Java (I think): http://www.biomedcentral.com/1471-2105/15/357/abstract

      Mira also contains an implementation of a similar approach.

      I'd love to see a comparison of the algorithms in use! I know what Trinity's approach does, but I haven't looked into NeatFreq, BBNorm, or Mira.

      --titus

      Comment


      • #4
        Originally posted by sarvidsson View Post
        How does BBNorm compare to normalize_by_median from the khmer package? The implementation (apart from language and possibly better usage of processor cors) sounds very similar.
        The implementation is a bit different in a couple of respects. Normalization can preferentially retain reads with errors, since they have a low apparent coverage; as a result, normalized data - particularly from single-cells - will often have a much higher error rate than the original data, even if low-depth reads are discarded. BBNorm, by default, uses 2-pass normalization which allows it - if there is sufficient initial depth - to preferentially discard low-quality reads, and still hit the target depth with a very narrow peak. So, if you look at the post-normalization kmer frequency histogram, BBNorm's output will have substantially fewer error kmers and a substantially narrower peak. This can be confirmed by mapping; the error rate in the resulting data is much lower.

        I'm working on publishing BBNorm, which will have comparative benchmarks versus other normalization tools, but in my informal testing it's way faster and yields better assemblies than the two other normalizers I have tested. The specific way that the decision is made on whether or not to discard a read has a huge impact on the end results, as does the way in which pairs are handled, and exactly how kmers are counted, and how a kmer's frequency is used to estimate a read's depth. BBNorm has various heuristics in place for each of these that substantially improved assemblies compared to leaving the heuristic disabled; my earlier description of discarding a read or not based on the median frequency of the read's kmers is actually a gross oversimplification. Also, using error-correction in conjunction with normalization leads to different results, as it can make it easier to correctly determine the depth of a read.

        I guess I would say the the theory is similar, but the implementation is probably very different than other normalizers.

        Comment


        • #5
          Hi Brian,

          I'm trying to do some normalization but I want to set my target coverage to 10X rather than 40X. Is there any way to change that in BBNorm? I tried target=10, but it still says 40X on the run description.

          Comment


          • #6
            By default, BBNorm will run 2 passes. The first pass will normalize to some depth higher than the ultimate desired depth, and the second pass will normalize to the target depth. This allows, in the first pass, preferential discarding of reads that are low quality. So the result from the second pass should still be a target of 10x.

            You can instead set "passes=1" which will aim for the target on the first pass and not do a second pass. This is slightly faster but will typically yield data with more errors. Neither is universally better, though.

            If you are going to target a depth of 10x, it's important to also reduce "mindepth" - by default it is 6, which is appropriate for 40x but not for 10x. Probably 2 would be better. Everything with apparent depth below that gets discarded.

            Comment


            • #7
              Hi Brian,

              This tool looks great. Is there a way to accept multiple fastq.gz files for inputs? I want to run all my reads (multiple fastq.gz) through bbnorm.

              Comment


              • #8
                At this point, BBNorm does not accept multiple input files (other than dual files for paired reads). You would have to concatenate them first:

                cat a.fastq.gz b.fastq.gz > all.fa.gz

                ...which works fine for gzipped files. Most of my programs can accept piped input from stdin, but not BBNorm since it needs to read the files twice.

                Comment


                • #9
                  Dear Brain Bushnell

                  BBNORM can used to normalize MATE pair sequences by Nextra kit such as (2k - 20K) to reduce the input size?

                  Comment


                  • #10
                    Yes, it can. BBNorm will (by default, it can be changed) discard pairs based on the depth of the lower mate, so if read 1 has high coverage and read 2 has low coverage, the pair will not be discarded. If both are high depth, they will be discarded.

                    Comment


                    • #11
                      Dear Brian Bushnell

                      Thank you for your valuable response and tool, your tools reduced my 80% of time

                      I have few more doubts, Please write your suggestion

                      If the libraries are from the different platform such as (HiSeq, Miseq and NextSeq) or different insert size such as (2k 4k 8k ....)

                      which is the best method to normalize?

                      1) Pool together and perform normalization or Sequencing Platform dependent normalization?

                      Another issue if i perform pre-processing the read length will vary according to sequencing artifacts.

                      2) So, before/after pre-processing is better for normalization?

                      3) If i want to use only 40X from 120X from the given genome (estimated size : 1.2GB) the normalized data should be <=(40*1.2GB) or the BNORM will give more than that?

                      3) Can i used for RNA-Seq libraries before perform Denovo assembly? will it affect the isoform detection or chance to miss transcripts ?

                      Thank you

                      Comment


                      • #12
                        Sorry, I somehow missed your post!

                        1) This is kind of tricky. Typically, though, I would recommend normalizing data independently if it is different (such as different insert size) since it has a different use, and you don't want it all mixed together anyway. If it is the same type - for example, 2x150bp reads with short inserts - then I would normalize it all together regardless of whether it came from a different platform or library, because it will all be used the same way.

                        2) I recommend pre-processing (adapter trimming, contaminant removal, quality-trimming or filtering) prior to normalization, because those processes all remove spurious kmers that make it harder to determine read depth, and thus improve the normalization results.

                        3) If you target 40x coverage for a 1.2Gbp genome, BBNorm should output approximately 20*1.2Gbp of data. Normally it will go a little bit over to try to ensure everywhere has at least 40x.

                        4) Normalizing RNA-seq data can certainly be done prior to assembly. But if you have 2 isoforms of a gene - one that uses exons 1, 2, and 3, and one that only uses exons 1 and 3, and one of them is expressed 100x more highly than the other, then after normalization, the less-expressed isoform may not get assembled, only the more abundant one. So there are definite disadvantages. But, it's worth trying if you get a bad assembly from the raw data.

                        Comment


                        • #13
                          Dear Brian Bushnell

                          Thank You !!!

                          Can you please refer the article, which explain BBNORM methodology in detail. For complete understanding and to code citation

                          Comment


                          • #14
                            I am currently collaborating with another person on writing BBNorm's paper and we plan to submit it in March. I will post here once it gets accepted.

                            Comment


                            • #15
                              Optimal depth for read error correcting

                              Hi Brian and others,

                              I am wondering what depth you need and what depth is optimal (if the two differ) for proper read error correcting using BBMap or any other error correcting program. The Quake website mentioned >15x coverage but a quick round of Googling hasn't given me much more than that.

                              The reason I ask is because I have a couple lanes of MiSeq data (600 cycle PE sequencing), which individually total to about 3x coverage of my genome each. Therefore, a kmer based error correction wouldn't work well, even if I were to concatenate the two together. We do have an additional HiSeq lane (100bp PE) and a few GAII lanes (so 50-60x coverage total), so we have the option of concatenating all of the datasets together (though one GAII lane isn't paired-end). However, then we would have the separate the individual lanes back out, since we next plan to merge the MiSeq reads to create longer, 454-like reads.

                              Therefore, my second question is about what workflow would be best to accomplish this task? Are there some settings in ecc.sh or the like that would allow decent error correction with low coverage? Or alternatively, is there an easy way of separating data from different lanes if we were to concatenate a bunch together to give the coverage necessary to confidently correct? Thanks in advance for the help.

                              Comment

                              Working...
                              X