Introducing BBNorm, a read normalization and error-correction tool

Brian Bushnell replied

01-12-2017, 10:02 AM
Whether or not to use "prefilter" just depends on the amount of memory you have rather than the workflow. It basically makes BBNorm take twice as long but increases accuracy in cases where you have a very large dataset compared to memory - so, there's no penalty for using it, and it always increases accuracy, but the increase is trivial if you have a lot of memory. So if you have lots of ram or a small dataset you don't need it.

In your case the dataset has approximately 5 billion unique kmers (which is what the output of loglog.sh means).

As for BBNorm's memory use:

-Xmx is a Java flag that specifies how much much heap memory Java will use. This is most, but not all, of the memory that your job will use - there is some overhead. Normally BBNorm will auto-detect how much memory is available and everything should be fine without you needing to specify -Xmx, but that depends on the job manager and system configuration. If you manually specify memory with -Xmx, it must be lower than your requested memory for the scheduler, not higher. I recommend about 84% for our cluster, but this depends. So, basically, if you submit requesting a 100G, then set -Xmx84g. If this gets killed by the scheduler, then decrease -Xmx rather than increasing it.

For 5 billion unique kmers, I recommend using the prefilter flag. The overall command would be something like:

bbnorm.sh in=reads.fq outlow=low.fq outmid=mid.fq outhigh=high.fq passes=1 lowbindepth=10 highbindepth=80

Even though BBNorm will mention "target depth" and "min depth", those values will not affect your outputs - they only affect reads that go to the "out=" stream (which you did not specify), not reads that go to the "outlow=" and so forth. Sorry it's a litttle confusing.
Leave a comment:
jov14 replied

01-12-2017, 06:33 AM
A question reagarding the partitioning option of BBnorm

Hi,
I want to preferentially assemble the genome of a low abundant community member from a metagenome, so I am interested in the partitioning option of BBnorm.

I have some questions on how to choose the best parameters though:

-for the other bbnorm workflows (normalization, filtering, error correction) you recommend the "prefilter" option. Is this also recommendable for the partitioning workflow? (Because this option is used in most of the example-usages of BBnorm in the documentation EXCEPT the partitioning workflow)

-from the description, I assumed that by giving "outlow, outmid and outhigh" arguments, the usual normalization workflow would be overridden and ALL reads would be grouped into one of these categories. However the preliminary output of BBnorm states that a "target depth" of 100 and a "min depth" of 5 is being applied. Does that mean that all reads below a coverage of five will be discarded? Do I need to adjust the "mindepth" parameter as well?

-Our job-submission pipeline requires the specification of a maximum RAM usage for all scripts started. However bbnorm keeps exceeding this value (which leads to a termination of the job). I kept increasing the memory limit of BBnorm using the "-Xmx" argument upto 200G, but always bbNorm exceeds the alloted limit (even if using the "prefilter" option above).
Do I have consider any additional memory requirements of the script, in addition to the "-Xmx" limit? How would I determine how much memory is needed?
(The dataset consists of about 84.547.019 read-pairs
loglog.sh calculated a "Cardiality" of 5.373.179.884, but I do not know how exactly to interpret this value).

Thanks for any suggestions.
Leave a comment:
Brian Bushnell replied

11-02-2016, 10:14 AM
Originally posted by evanname View Post

Brian, thank you so much for the excellent tools!

Is it possible to say at what level the error correction would be able to distinguish between sequencing errors and heterogeneity in the source sample?

For example, if the source was a 500bp PCR product and 2% of the molecules had a substitution at base 100, would BBnorm flag that as an error? Is there an approximate percent heterogeneity at any particular base that serves as the dividing line between 'error' and 'SNP'?

Thanks!

Sorry for the very late reply, but anyway -

I recommend using Tadpole for error-correction now; it substantially better than BBNorm because it uses exact kmer counts and algorithms designed to take advantage of the exact counts. I now only use BBNorm for normalization and plotting kmer-frequency histograms of datasets too big to fit into memory, but not for error-correction.

I don't recommend doing error-correction at all on data for which you are hoping to find rare SNPs. That said, by default, BBNorm determines a base to be in error if there is at least a 1:140 ratio of kmer counts between it and the adjacent kmers, so a 2% SNP should be safe. Tadpole, on the other hand, defaults to a 1:16 ratio for detecting errors, which is much more aggressive and would wipe out a 2% SNP. Why is it more aggressive? Well... I tried to optimize the parameters for the best Spades assemblies, and Spades seems to perform best with pretty aggressive error-correction. You can change that threshold, though.
Leave a comment:
cerebis replied

10-11-2016, 05:37 PM
Garbage collection

If you are using Oracle's JVM (or perhaps others too), what you're seeing as excess CPU consumption from bbnorm, might actually stem from garbage collection within the JVM. This really depends on an application's behaviour.

There has been a lot of work on the performance of garbage collectors in Java and there are a few to choose between.

As a quick validation test, you could try insisting on the single-threaded collector by adding the following option to the java invocation inside the bbnorm.sh script. (Sorry doesn't seem to be a means of passing that in)

Code:

-XX:+UseSerialGC

You can also specify thread limits for the parallel collector. Normally, you don't have to completely restrict it to see changes in concurrency. 4 is actually quite strict on a modern multicore CPU.

Code:

-XX:ParallelGCThreads=4 -XX:ConcGCThreads=4

Lots of further information can be found at Oracle's VM options

Keep in mind that SerialGC will mean the program will likely halt briefly at GC events. So at best you should expect a penalty on runtime if the parallel GC was already working quite hard.
Leave a comment:
evanname replied

04-21-2016, 12:03 PM
Brian, thank you so much for the excellent tools!

Is it possible to say at what level the error correction would be able to distinguish between sequencing errors and heterogeneity in the source sample?

For example, if the source was a 500bp PCR product and 2% of the molecules had a substitution at base 100, would BBnorm flag that as an error? Is there an approximate percent heterogeneity at any particular base that serves as the dividing line between 'error' and 'SNP'?

Thanks!
Leave a comment:
balaena replied

04-04-2016, 09:12 AM
40 cores Intel(R) Xeon(R) CPU E7- 8860 @ 2.27GHz

1TB Ram

Dataset: 2x 75 million 125bp reads

Thanks!
Leave a comment:
GenoMax replied

04-04-2016, 09:06 AM
Will have to wait on @Brian.

For reference: How many cores/memory is available on this system? What is the size of the dataset?
Leave a comment:
balaena replied

04-04-2016, 09:03 AM
Sorry. No change.
Leave a comment:
GenoMax replied

04-04-2016, 08:51 AM
Can you test these two option "gunzip=f bf2=f" and report what happens?
Leave a comment:
balaena replied

04-04-2016, 08:44 AM
Hi

Thanks for the suggestion, but it didn't help. The load goes through the ceiling. I used it like so:

bbnorm.sh -Xmx100g in= in2= out= out2= target=200 mindepth=3 threads=4 pigz=f unpigz=f

I could do much more than 4 threads but just wanted to see what happens.

Best
Leave a comment:
GenoMax replied

04-04-2016, 08:02 AM
While @Brian would be along later with an official answer I feel that this may not be directly related to BBMap. If you have pigz installed on your machine then BBMap tools use it by default to uncompress files and that program may be starting additional threads that overwhelm your system.

If pigz is installed you could turn it off by adding "pigz=f unpigz=f" to your BBMap tools commands and see if that stops the problem. Do keep using threads= option. You are not running this under a job scheduler, correct?
Leave a comment:
balaena replied

04-04-2016, 07:43 AM
Hi Brian

I have difficulties to control the load of bbnorm on our server. Regardless which number I enter for threads=, it will always completely use all idle cores and the load average eventually goes above the number of cores.

our java is:

java version "1.8.0_60"
Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)

Any solution?
Best
Leave a comment:
Brian Bushnell replied

03-06-2016, 09:05 PM
Oh... sorry, the explanation is a bit different here. By default BBNorm runs in 2-pass mode, which gives the best normalization. However, that generates temp files (which are later deleted). The final outputs are only from the second pass - reads discarded in the first pass would disappear completely.

For what you are doing I recommend this command:

bbnorm.sh in=input_fq1.gz in2=input_fq2.gz zerobin=t prefilter=t target=1000 min=10 passes=1 ecc=t out1=bbnorm.fq1.gz out2=bbnorm.fq2.gz outt=excluded.fq.gz

Then the output numbers should add up as expected.
Leave a comment:
[email protected] replied

03-06-2016, 02:10 AM
Hi Brian,

Thanks very much for your quick reply. According to you suggestion, I ran reformat.sh to calculate the exact number of bases in all files, but the reason seems not relating to "compression" very much! See below,
1)reads before running bbnorm.sh:
reads 1
Input : 2728414 reads 336952602 bases
reads 2
Input: 2728414 reads 338676300 bases

2) reads after...:
reads 1
Input: 1307784 reads 162040282 bases
reads 2
Input: 1307784 reads 162053968 bases
excluded reads
Input: 767030 reads 95017289 bases

Thanks,
Xiao
Leave a comment:
Brian Bushnell replied

03-05-2016, 09:58 PM
Hi Xiao,

The size difference is likely due to compression, and the fact that error-free reads compress better than reads with errors. Comparing the file size of compressed files tends to be confusing. If you want to know the truth, look at the actual amount of data in the files. For example, "reformat.sh in=file.fq.gz" will tell you the exact number of bases in the file.
Leave a comment:

Previous 1 2 3 4 template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News