Introducing BBNorm, a read normalization and error-correction tool

[email protected] replied

03-05-2016, 08:28 PM
Hi Brian,

I am using bbnorm.sh to filter low-depth reads and correct them together. I have a question about it as below:
The raw pair-end reads file size(.gz): 223+229Mb;
After running bbnorm.sh:111+113+64Mb (out, out2 and outt);
So, what is different between 64Mb reads with filtered reads(164Mb=223+229-111-113-64).
This is my running shell:
bbnorm.sh in=input_fq1.gz in2=input_fq2.gz zerobin=t prefilter=t maxdepth=1000 lowbindepth=10 highbindepth=500 ecc=t out1=bbnorm.fq1.gz out2=bbnorm.fq2.gz outt=exclued.fq.gz

Thanks,
Xiao
Leave a comment:
Brian Bushnell replied

12-05-2015, 09:46 AM
Originally posted by jazz710 View Post

Oh cool! Can I specify the kmer size? And can it accept paired end FASTQ files like the other tools? Ideally, I'd like to take a pair of PE FASTQs and extract the unique 31-mers (or some other value of k depending on needs). Currently I use Jellyfish for this.

Yes and yes!
Leave a comment:
jazz710 replied

12-05-2015, 09:15 AM
Oh cool! Can I specify the kmer size? And can it accept paired end FASTQ files like the other tools? Ideally, I'd like to take a pair of PE FASTQs and extract the unique 31-mers (or some other value of k depending on needs). Currently I use Jellyfish for this.

Last edited by jazz710; 12-05-2015, 09:17 AM. Reason: More complete
Leave a comment:
Brian Bushnell replied

12-04-2015, 09:58 PM
Hi Bob,

That's not possible with BBNorm, as it uses a lossy data structure called a count-min sketch to store counts. However, you can do that with KmerCountExact, which is faster than BBNorm, but less memory-efficient. Usage:

kmercountexact.sh in=reads.fq out=kmers.fasta

That will print them in fasta format; for 2-column tsv, add the flag "fastadump=f". There are also flags to suppress storing or printing of kmers with low counts.
Leave a comment:
jazz710 replied

12-04-2015, 09:09 PM
Unique kmers from BBNorm?

Brian,

I see that in the output of BBNorm there are counts of unique kmers. Is there an option within BBNorm to export the unique kmers as a list or FASTA?

Best,

Bob
Leave a comment:
Brian Bushnell replied

09-24-2015, 09:40 AM
I have the manuscript mostly written, but it's not really ready to submit anywhere yet. However, there is a postdoc who is eager to get started on preparing it for submission, so... hopefully soon?
Leave a comment:
C. Olsen replied

09-24-2015, 07:33 AM
Comparison paper

Hello Brian,

You mentioned the intent to submit a paper in March comparing BBNorm w/other methods. Were you able to scrape up any time to submit something? I'm keen to read it.
Leave a comment:
Brian Bushnell replied

09-05-2015, 09:28 PM
I'll read the article in a few days, and comment on it then. As Titus stated, you cannot do binning by depth after normalization - it destroys that information. Furthermore, MDA'd single cells cannot be individually binned for contaminants based on depth, as the depth is exponentially random across the genome.

I use BBNorm (with the settings target=100 min=2) to preprocess amplified single cells prior to assembly with Spades, as it vastly reduces the total runtime and memory use, meaning that the jobs are much less likely to crash or be killed. If you want to reduce contamination, though, I have a different tool called CrossBlock, which is designed to eliminate cross-contamination between multiplexed single-cell libraries. You need to first assemble all the libraries, then run CrossBlock with all of the libraries and their reads (raw, not normalized!) as input; it essentially removes contigs from assemblies that have greater coverage from another library than their own library. Incidentally, CrossBlock does in fact use BBNorm.

The latest version of Spades does not really have too much trouble with high-abundance kmers, unless they get extremely high or you have a limited amount of memory. So, you don't HAVE to normalize before running Spades, but it tends to give a comparable assemble with a small fraction of the resources - typically with slightly better continuity and slightly lower misassembly rates, with slightly lower genome recovery, but a slightly higher rate of long genes being called (according to Quast).

On the other hand, if you want to assemble MDA-amplified single-cell data with an assembler designed for isolate data, normalization is pretty much essential for a decent assembly.

Last edited by Brian Bushnell; 09-05-2015, 09:32 PM.
Leave a comment:
titusbrown replied

09-05-2015, 05:17 AM
Hi suzumar, if you apply any kind of coverage normalization (whether with khmer or bbnorm) prior to assembly, you won't be able to use those reads to calculate the coverage of the contigs. However, you can go back and use the unnormalized reads to calculate the contig coverage.

Re the science paper, I don't know what they actually did; if you can find the software and/or detailed instructions in their methods, I'd be happy to comment.
Leave a comment:
suzumar replied

09-04-2015, 11:48 PM
Coverage Binning

Hi Brian
I am trying to assemble a genome from single cell amplification using spades and subsequently using tetramer + coverage analysis (i.e. CONCOCT) to remove "contaminating reads" which seem to be created at the sequencing. If I understood correctly bbnorm will keep enough information to allow coverage information to be used subsequently to bin contigs. Is that the case?

I also would like to know your opinion on how bbnorm compares to the strategy described in https://www.sciencemag.org/cgi/conte.../6047/1296/DC1

I quote:

"High-depth k-mers,presumably derived from MDA amplification bias, cause problems in the assembly, especially if the k-mer depth varies in orders of magnitude for different regions of the genome. We removed reads representing high-abundance k-mers (>64x k-mer depth) and
trimmed reads that contain unique k-mers. The results of the k-mer based coverage normalization are shown in table S8."
Leave a comment:
Brian Bushnell replied

03-10-2015, 06:45 PM
BBNorm cannot be used for raw PacBio data, as the error rate is too high; it is throwing everything away as the apparent depth is always 1, since all the kmers are unique. Normalization uses 31-mers (by default) which requires that, on average, the error rate is below around 1/40 (so that a large number of kmers will be error-free). However, raw PacBio data has on average a 1/7 error rate, so most programs that use long kmers will not work on it at all. BBMap, on the other hand, uses short kmers (typically 9-13) and it can process PacBio data, but does not do normalization - a longer kmer is needed.

PacBio CCS or "Reads of Insert" that are self-corrected, with multiple passes to drop the error rate below 3% or so, could be normalized by BBNorm. So, if you intentionally fragment your reads to around 3kbp or less, but run long movies, then self-correct, normalization should work.

PacBio data has a very flat coverage distribution, which is great, and means that typically it does not need normalization. But MDA'd single cells have highly variable coverage regardless of the platform, and approaches like HGAP to correct by consensus of multiple reads covering the same locus will not work anywhere that has very low coverage. I think your best bet is really to shear to a smaller fragment size, self-correct to generate "Reads of Insert", and use those to assemble. I doubt normalization will give you a better assembly with error-corrected single-cell PacBio data, but if it did, you would have to use custom parameters to not throw away low-coverage data (namely, "mindepth=0 lowthresh=0"), since a lot of the single-cell contigs have very low coverage. BBNorm (and, I imagine, all other normalizers) have defaults set for Illumina reads.
Leave a comment:
inkang replied

03-10-2015, 06:18 PM
Dear Brian,

Is it possible to use BBNorm to normalize filtered subreads from PacBio sequencing?

BBNorm seems to work well for my Illumina MiSeq data.
But, when I tried it for PacBio filtered subreads (fastq format), the resulting output file was empty.

Actually, I'm working on PacBio sequencing data from MDA-amplified single cell genomes. When I assembled the data using HGAP (SMRT Analysis), the results were not good (too many rather short contigs). Some of my colleagues told me that uneven coverage might be the cause of bad assembly. So, I've been trying to normalize raw data.

Thanks.
Leave a comment:
Brian Bushnell replied

03-10-2015, 12:48 PM
At default settings, BBNorm will correct using kmers that occur a minimum of 22 times (echighthresh flag). Whether 50x is sufficient to do a good job depends on things like the ploidy of the organism, how heterozygous it is, and the uniformity of your coverage. 50x of fairly uniform coverage is plenty for a haploid organism, or a diploid organism like human with a low het rate, but not for a wild tetraploid plant. You can always reduce the setting, of course, but I would not put it below ~16x typically. You can't get good error-correction with 3x depth nor matter what you do. Bear in mind that the longer the kmer is compared to read length, the lower the kmer depth will be compared to read depth.

To deal with multiple different data sources, you can run BBNorm with the "extra" flag to use additional data to build kmer frequency tables but not as output, like this:

ecc.sh in=miseq.fq out=miseq_ecc.fq extra=gaII_1.fq,gaII_2.fq,gaII_3.fq

That would take extra processing time, since all the data would have to be reprocessed for every output file you generate. Alternately, you can do this:

bbrename.sh miseq.fq addprefix prefix=miseq
bbrename.sh gaII_1.fq addprefix prefix=gaII_1
...etc

Then cat all the files together, and error-correct them:
ecc.sh in=combined.fq out=ecc.fq ordered int=f

Then demultiplex:
demuxbyname.sh in=ecc.fq out=demuxed_%.fq names=miseq,gaII_1 int=f
(note the % symbol; it will be replaced by a name)

That will keep all the read order the same. So, if all the reads are initially either single-ended or interleaved (i.e. one file per library) pairs will be kept together, and you can de-interleave them afterward if you want. You can convert between 2-file and interleaved with reformat.sh.
Leave a comment:
dcard replied

03-10-2015, 12:27 PM
Optimal depth for read error correcting

Hi Brian and others,

I am wondering what depth you need and what depth is optimal (if the two differ) for proper read error correcting using BBMap or any other error correcting program. The Quake website mentioned >15x coverage but a quick round of Googling hasn't given me much more than that.

The reason I ask is because I have a couple lanes of MiSeq data (600 cycle PE sequencing), which individually total to about 3x coverage of my genome each. Therefore, a kmer based error correction wouldn't work well, even if I were to concatenate the two together. We do have an additional HiSeq lane (100bp PE) and a few GAII lanes (so 50-60x coverage total), so we have the option of concatenating all of the datasets together (though one GAII lane isn't paired-end). However, then we would have the separate the individual lanes back out, since we next plan to merge the MiSeq reads to create longer, 454-like reads.

Therefore, my second question is about what workflow would be best to accomplish this task? Are there some settings in ecc.sh or the like that would allow decent error correction with low coverage? Or alternatively, is there an easy way of separating data from different lanes if we were to concatenate a bunch together to give the coverage necessary to confidently correct? Thanks in advance for the help.
Leave a comment:
Brian Bushnell replied

02-23-2015, 12:13 AM
I am currently collaborating with another person on writing BBNorm's paper and we plan to submit it in March. I will post here once it gets accepted.
Leave a comment:

Previous 1 2 3 4 template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Today, 11:49 AM	0 responses 13 views 0 likes	Last Post by seqadmin Today, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News