Seqanswers Leaderboard Ad

**sarvidsson** · 01-23-2015, 01:42 AM

How does BBNorm compare to normalize_by_median from the khmer package? The implementation (apart from language and possibly better usage of processor cors) sounds very similar.

**titusbrown** · 01-23-2015, 03:47 AM

There are a number of similar tools now --

Digital normalization, http://ivory.idyll.org/blog/diginorm-paper-posted.html

Trinity's in silico read normalization, based on Jellyfish and custom Perl scripts: http://trinityrnaseq.sourceforge.net...alization.html

NeatFreq, written in Java (I think): http://www.biomedcentral.com/1471-2105/15/357/abstract

Mira also contains an implementation of a similar approach.

I'd love to see a comparison of the algorithms in use! I know what Trinity's approach does, but I haven't looked into NeatFreq, BBNorm, or Mira.

--titus

**Brian Bushnell** · 01-23-2015, 12:49 PM

Originally posted by sarvidsson View Post

How does BBNorm compare to normalize_by_median from the khmer package? The implementation (apart from language and possibly better usage of processor cors) sounds very similar.

The implementation is a bit different in a couple of respects. Normalization can preferentially retain reads with errors, since they have a low apparent coverage; as a result, normalized data - particularly from single-cells - will often have a much higher error rate than the original data, even if low-depth reads are discarded. BBNorm, by default, uses 2-pass normalization which allows it - if there is sufficient initial depth - to preferentially discard low-quality reads, and still hit the target depth with a very narrow peak. So, if you look at the post-normalization kmer frequency histogram, BBNorm's output will have substantially fewer error kmers and a substantially narrower peak. This can be confirmed by mapping; the error rate in the resulting data is much lower.

I'm working on publishing BBNorm, which will have comparative benchmarks versus other normalization tools, but in my informal testing it's way faster and yields better assemblies than the two other normalizers I have tested. The specific way that the decision is made on whether or not to discard a read has a huge impact on the end results, as does the way in which pairs are handled, and exactly how kmers are counted, and how a kmer's frequency is used to estimate a read's depth. BBNorm has various heuristics in place for each of these that substantially improved assemblies compared to leaving the heuristic disabled; my earlier description of discarding a read or not based on the median frequency of the read's kmers is actually a gross oversimplification. Also, using error-correction in conjunction with normalization leads to different results, as it can make it easier to correctly determine the depth of a read.

I guess I would say the the theory is similar, but the implementation is probably very different than other normalizers.

**jazz710** · 01-29-2015, 11:04 AM

Hi Brian,

I'm trying to do some normalization but I want to set my target coverage to 10X rather than 40X. Is there any way to change that in BBNorm? I tried target=10, but it still says 40X on the run description.

**Brian Bushnell** · 01-29-2015, 11:30 AM

By default, BBNorm will run 2 passes. The first pass will normalize to some depth higher than the ultimate desired depth, and the second pass will normalize to the target depth. This allows, in the first pass, preferential discarding of reads that are low quality. So the result from the second pass should still be a target of 10x.

You can instead set "passes=1" which will aim for the target on the first pass and not do a second pass. This is slightly faster but will typically yield data with more errors. Neither is universally better, though.

If you are going to target a depth of 10x, it's important to also reduce "mindepth" - by default it is 6, which is appropriate for 40x but not for 10x. Probably 2 would be better. Everything with apparent depth below that gets discarded.

**damiankao** · 02-04-2015, 01:21 PM

Hi Brian,

This tool looks great. Is there a way to accept multiple fastq.gz files for inputs? I want to run all my reads (multiple fastq.gz) through bbnorm.

**Brian Bushnell** · 02-04-2015, 01:30 PM

At this point, BBNorm does not accept multiple input files (other than dual files for paired reads). You would have to concatenate them first:

cat a.fastq.gz b.fastq.gz > all.fa.gz

...which works fine for gzipped files. Most of my programs can accept piped input from stdin, but not BBNorm since it needs to read the files twice.

**sathiyamurthi** · 02-09-2015, 09:08 PM

Dear Brain Bushnell

BBNORM can used to normalize MATE pair sequences by Nextra kit such as (2k - 20K) to reduce the input size?

**Brian Bushnell** · 02-09-2015, 09:28 PM

Yes, it can. BBNorm will (by default, it can be changed) discard pairs based on the depth of the lower mate, so if read 1 has high coverage and read 2 has low coverage, the pair will not be discarded. If both are high depth, they will be discarded.

**sathiyamurthi** · 02-09-2015, 09:48 PM

Dear Brian Bushnell

Thank you for your valuable response and tool, your tools reduced my 80% of time

I have few more doubts, Please write your suggestion

If the libraries are from the different platform such as (HiSeq, Miseq and NextSeq) or different insert size such as (2k 4k 8k ....)

which is the best method to normalize?

1) Pool together and perform normalization or Sequencing Platform dependent normalization?

Another issue if i perform pre-processing the read length will vary according to sequencing artifacts.

2) So, before/after pre-processing is better for normalization?

3) If i want to use only 40X from 120X from the given genome (estimated size : 1.2GB) the normalized data should be <=(40*1.2GB) or the BNORM will give more than that?

3) Can i used for RNA-Seq libraries before perform Denovo assembly? will it affect the isoform detection or chance to miss transcripts ?

Thank you

**Brian Bushnell** · 02-19-2015, 11:30 AM

Sorry, I somehow missed your post!

1) This is kind of tricky. Typically, though, I would recommend normalizing data independently if it is different (such as different insert size) since it has a different use, and you don't want it all mixed together anyway. If it is the same type - for example, 2x150bp reads with short inserts - then I would normalize it all together regardless of whether it came from a different platform or library, because it will all be used the same way.

2) I recommend pre-processing (adapter trimming, contaminant removal, quality-trimming or filtering) prior to normalization, because those processes all remove spurious kmers that make it harder to determine read depth, and thus improve the normalization results.

3) If you target 40x coverage for a 1.2Gbp genome, BBNorm should output approximately 20*1.2Gbp of data. Normally it will go a little bit over to try to ensure everywhere has at least 40x.

4) Normalizing RNA-seq data can certainly be done prior to assembly. But if you have 2 isoforms of a gene - one that uses exons 1, 2, and 3, and one that only uses exons 1 and 3, and one of them is expressed 100x more highly than the other, then after normalization, the less-expressed isoform may not get assembled, only the more abundant one. So there are definite disadvantages. But, it's worth trying if you get a bad assembly from the raw data.

**sathiyamurthi** · 02-22-2015, 11:43 PM

Dear Brian Bushnell

Thank You !!!

Can you please refer the article, which explain BBNORM methodology in detail. For complete understanding and to code citation

**Brian Bushnell** · 02-23-2015, 12:13 AM

I am currently collaborating with another person on writing BBNorm's paper and we plan to submit it in March. I will post here once it gets accepted.

**dcard** · 03-10-2015, 12:27 PM

Optimal depth for read error correcting

Hi Brian and others,

I am wondering what depth you need and what depth is optimal (if the two differ) for proper read error correcting using BBMap or any other error correcting program. The Quake website mentioned >15x coverage but a quick round of Googling hasn't given me much more than that.

The reason I ask is because I have a couple lanes of MiSeq data (600 cycle PE sequencing), which individually total to about 3x coverage of my genome each. Therefore, a kmer based error correction wouldn't work well, even if I were to concatenate the two together. We do have an additional HiSeq lane (100bp PE) and a few GAII lanes (so 50-60x coverage total), so we have the option of concatenating all of the datasets together (though one GAII lane isn't paired-end). However, then we would have the separate the individual lanes back out, since we next plan to merge the MiSeq reads to create longer, 454-like reads.

Therefore, my second question is about what workflow would be best to accomplish this task? Are there some settings in ecc.sh or the like that would allow decent error correction with low coverage? Or alternatively, is there an easy way of separating data from different lanes if we were to concatenate a bunch together to give the coverage necessary to confidently correct? Thanks in advance for the help.

Topics	Statistics	Last Post
Study Highlights Challenges in Cellular Reprogramming for Regenerative Medicine by seqadmin Started by seqadmin, Today, 06:25 AM	0 responses 13 views 0 likes	Last Post by seqadmin Today, 06:25 AM
New DNA Modification Discovered as Key to Gene Activation in Early Development by seqadmin Started by seqadmin, Yesterday, 01:02 PM	0 responses 12 views 0 likes	Last Post by seqadmin Yesterday, 01:02 PM
Wastewater Analysis Unlocks New Method for Identifying Public Health Threats by seqadmin Started by seqadmin, 09-18-2024, 06:39 AM	0 responses 14 views 0 likes	Last Post by seqadmin 09-18-2024, 06:39 AM
Molecular Markers Shared Across Dementias by seqadmin Started by seqadmin, 09-11-2024, 02:44 PM	0 responses 14 views 0 likes	Last Post by seqadmin 09-11-2024, 02:44 PM

Seqanswers Leaderboard Ad

Announcement

Introducing BBNorm, a read normalization and error-correction tool

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News