Header Leaderboard Ad

Collapse

Introducing BBNorm, a read normalization and error-correction tool

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    At default settings, BBNorm will correct using kmers that occur a minimum of 22 times (echighthresh flag). Whether 50x is sufficient to do a good job depends on things like the ploidy of the organism, how heterozygous it is, and the uniformity of your coverage. 50x of fairly uniform coverage is plenty for a haploid organism, or a diploid organism like human with a low het rate, but not for a wild tetraploid plant. You can always reduce the setting, of course, but I would not put it below ~16x typically. You can't get good error-correction with 3x depth nor matter what you do. Bear in mind that the longer the kmer is compared to read length, the lower the kmer depth will be compared to read depth.

    To deal with multiple different data sources, you can run BBNorm with the "extra" flag to use additional data to build kmer frequency tables but not as output, like this:

    ecc.sh in=miseq.fq out=miseq_ecc.fq extra=gaII_1.fq,gaII_2.fq,gaII_3.fq

    That would take extra processing time, since all the data would have to be reprocessed for every output file you generate. Alternately, you can do this:

    bbrename.sh miseq.fq addprefix prefix=miseq
    bbrename.sh gaII_1.fq addprefix prefix=gaII_1

    ...etc

    Then cat all the files together, and error-correct them:
    ecc.sh in=combined.fq out=ecc.fq ordered int=f

    Then demultiplex:
    demuxbyname.sh in=ecc.fq out=demuxed_%.fq names=miseq,gaII_1 int=f
    (note the % symbol; it will be replaced by a name)

    That will keep all the read order the same. So, if all the reads are initially either single-ended or interleaved (i.e. one file per library) pairs will be kept together, and you can de-interleave them afterward if you want. You can convert between 2-file and interleaved with reformat.sh.

    Comment


    • #17
      Dear Brian,

      Is it possible to use BBNorm to normalize filtered subreads from PacBio sequencing?

      BBNorm seems to work well for my Illumina MiSeq data.
      But, when I tried it for PacBio filtered subreads (fastq format), the resulting output file was empty.

      Actually, I'm working on PacBio sequencing data from MDA-amplified single cell genomes. When I assembled the data using HGAP (SMRT Analysis), the results were not good (too many rather short contigs). Some of my colleagues told me that uneven coverage might be the cause of bad assembly. So, I've been trying to normalize raw data.

      Thanks.

      Comment


      • #18
        BBNorm cannot be used for raw PacBio data, as the error rate is too high; it is throwing everything away as the apparent depth is always 1, since all the kmers are unique. Normalization uses 31-mers (by default) which requires that, on average, the error rate is below around 1/40 (so that a large number of kmers will be error-free). However, raw PacBio data has on average a 1/7 error rate, so most programs that use long kmers will not work on it at all. BBMap, on the other hand, uses short kmers (typically 9-13) and it can process PacBio data, but does not do normalization - a longer kmer is needed.

        PacBio CCS or "Reads of Insert" that are self-corrected, with multiple passes to drop the error rate below 3% or so, could be normalized by BBNorm. So, if you intentionally fragment your reads to around 3kbp or less, but run long movies, then self-correct, normalization should work.

        PacBio data has a very flat coverage distribution, which is great, and means that typically it does not need normalization. But MDA'd single cells have highly variable coverage regardless of the platform, and approaches like HGAP to correct by consensus of multiple reads covering the same locus will not work anywhere that has very low coverage. I think your best bet is really to shear to a smaller fragment size, self-correct to generate "Reads of Insert", and use those to assemble. I doubt normalization will give you a better assembly with error-corrected single-cell PacBio data, but if it did, you would have to use custom parameters to not throw away low-coverage data (namely, "mindepth=0 lowthresh=0"), since a lot of the single-cell contigs have very low coverage. BBNorm (and, I imagine, all other normalizers) have defaults set for Illumina reads.

        Comment


        • #19
          Coverage Binning

          Hi Brian
          I am trying to assemble a genome from single cell amplification using spades and subsequently using tetramer + coverage analysis (i.e. CONCOCT) to remove "contaminating reads" which seem to be created at the sequencing. If I understood correctly bbnorm will keep enough information to allow coverage information to be used subsequently to bin contigs. Is that the case?

          I also would like to know your opinion on how bbnorm compares to the strategy described in http://www.sciencemag.org/cgi/conten.../6047/1296/DC1

          I quote:

          "High-depth k-mers,presumably derived from MDA amplification bias, cause problems in the assembly, especially if the k-mer depth varies in orders of magnitude for different regions of the genome. We removed reads representing high-abundance k-mers (>64x k-mer depth) and
          trimmed reads that contain unique k-mers. The results of the k-mer based coverage normalization are shown in table S8."

          Comment


          • #20
            Hi suzumar, if you apply any kind of coverage normalization (whether with khmer or bbnorm) prior to assembly, you won't be able to use those reads to calculate the coverage of the contigs. However, you can go back and use the unnormalized reads to calculate the contig coverage.

            Re the science paper, I don't know what they actually did; if you can find the software and/or detailed instructions in their methods, I'd be happy to comment.

            Comment


            • #21
              I'll read the article in a few days, and comment on it then. As Titus stated, you cannot do binning by depth after normalization - it destroys that information. Furthermore, MDA'd single cells cannot be individually binned for contaminants based on depth, as the depth is exponentially random across the genome.

              I use BBNorm (with the settings target=100 min=2) to preprocess amplified single cells prior to assembly with Spades, as it vastly reduces the total runtime and memory use, meaning that the jobs are much less likely to crash or be killed. If you want to reduce contamination, though, I have a different tool called CrossBlock, which is designed to eliminate cross-contamination between multiplexed single-cell libraries. You need to first assemble all the libraries, then run CrossBlock with all of the libraries and their reads (raw, not normalized!) as input; it essentially removes contigs from assemblies that have greater coverage from another library than their own library. Incidentally, CrossBlock does in fact use BBNorm.

              The latest version of Spades does not really have too much trouble with high-abundance kmers, unless they get extremely high or you have a limited amount of memory. So, you don't HAVE to normalize before running Spades, but it tends to give a comparable assemble with a small fraction of the resources - typically with slightly better continuity and slightly lower misassembly rates, with slightly lower genome recovery, but a slightly higher rate of long genes being called (according to Quast).

              On the other hand, if you want to assemble MDA-amplified single-cell data with an assembler designed for isolate data, normalization is pretty much essential for a decent assembly.
              Last edited by Brian Bushnell; 09-05-2015, 09:32 PM.

              Comment


              • #22
                Comparison paper

                Hello Brian,

                You mentioned the intent to submit a paper in March comparing BBNorm w/other methods. Were you able to scrape up any time to submit something? I'm keen to read it.

                Comment


                • #23
                  I have the manuscript mostly written, but it's not really ready to submit anywhere yet. However, there is a postdoc who is eager to get started on preparing it for submission, so... hopefully soon?

                  Comment


                  • #24
                    Unique kmers from BBNorm?

                    Brian,

                    I see that in the output of BBNorm there are counts of unique kmers. Is there an option within BBNorm to export the unique kmers as a list or FASTA?

                    Best,

                    Bob

                    Comment


                    • #25
                      Hi Bob,

                      That's not possible with BBNorm, as it uses a lossy data structure called a count-min sketch to store counts. However, you can do that with KmerCountExact, which is faster than BBNorm, but less memory-efficient. Usage:

                      kmercountexact.sh in=reads.fq out=kmers.fasta


                      That will print them in fasta format; for 2-column tsv, add the flag "fastadump=f". There are also flags to suppress storing or printing of kmers with low counts.

                      Comment


                      • #26
                        Oh cool! Can I specify the kmer size? And can it accept paired end FASTQ files like the other tools? Ideally, I'd like to take a pair of PE FASTQs and extract the unique 31-mers (or some other value of k depending on needs). Currently I use Jellyfish for this.
                        Last edited by jazz710; 12-05-2015, 09:17 AM. Reason: More complete

                        Comment


                        • #27
                          Originally posted by jazz710 View Post
                          Oh cool! Can I specify the kmer size? And can it accept paired end FASTQ files like the other tools? Ideally, I'd like to take a pair of PE FASTQs and extract the unique 31-mers (or some other value of k depending on needs). Currently I use Jellyfish for this.
                          Yes and yes!

                          Comment


                          • #28
                            Hi Brian,

                            I am using bbnorm.sh to filter low-depth reads and correct them together. I have a question about it as below:
                            The raw pair-end reads file size(.gz): 223+229Mb;
                            After running bbnorm.sh:111+113+64Mb (out, out2 and outt);
                            So, what is different between 64Mb reads with filtered reads(164Mb=223+229-111-113-64).
                            This is my running shell:
                            bbnorm.sh in=input_fq1.gz in2=input_fq2.gz zerobin=t prefilter=t maxdepth=1000 lowbindepth=10 highbindepth=500 ecc=t out1=bbnorm.fq1.gz out2=bbnorm.fq2.gz outt=exclued.fq.gz

                            Thanks,
                            Xiao

                            Comment


                            • #29
                              Hi Xiao,

                              The size difference is likely due to compression, and the fact that error-free reads compress better than reads with errors. Comparing the file size of compressed files tends to be confusing. If you want to know the truth, look at the actual amount of data in the files. For example, "reformat.sh in=file.fq.gz" will tell you the exact number of bases in the file.

                              Comment


                              • #30
                                Hi Brian,

                                Thanks very much for your quick reply. According to you suggestion, I ran reformat.sh to calculate the exact number of bases in all files, but the reason seems not relating to "compression" very much! See below,
                                1)reads before running bbnorm.sh:
                                reads 1
                                Input : 2728414 reads 336952602 bases
                                reads 2
                                Input: 2728414 reads 338676300 bases

                                2) reads after...:
                                reads 1
                                Input: 1307784 reads 162040282 bases
                                reads 2
                                Input: 1307784 reads 162053968 bases
                                excluded reads
                                Input: 767030 reads 95017289 bases

                                Thanks,
                                Xiao

                                Comment

                                Working...
                                X