Introducing BBSplit: Read Binning Tool for Metagenomes and Contaminated Libraries

GenoMax replied

10-02-2017, 04:03 AM
@Wouter: @Brian would be along with an official answer but this may be because most NGS aligners are non-determinstic by default i.e. they may not produce exactly the same answer for each run.

bbmap.sh provides a "determinisitc" flag that you can set to "true" to get exactly identical results each time. I don't see bbsplit listed as having that flag but it may support it since it is based on bbmap. @Brian will confirm.

You would also need to consider how similar your references are and how you are choosing to handle multi-mappers (reads which map across genomes). Are you leaving the ambiguous2= option default. It may help to post the entire bbsplit command you are using for this analysis.
Leave a comment:
wlokhorst replied

10-02-2017, 03:01 AM
Hello Brian,

Your tool is great and on first sight it seems to work perfectly, but after running several batches of samples I've noticed some quirky behaviour.

To investigate this, I've run the BBSplit tool several times on the same sample (and deleting the index every time) and it doesn't give me the same results every time. I've pasted the summarised results below and, as you can see, they differ with every run. This certainly isn't what I expected. Mapping the same reference sequences to a sample multiple times should always give me the same results, so something is wrong here.

The numbers are the the number of reads mapped to a certain reference sequence.

Code:

run1 run2 run3 run4 run5 run6 run7 run8 run9 run10 ref_seq_1 4 8 8 8 4 8 8 8 4 8 ref_seq_2 4 4 4 4 4 4 4 4 4 4 ref_seq_3 8 4 12 40 16 4 4 4 16 4 ref_seq_4 24 24 8 4 8 4 24 24 8 4 ref_seq_5 4 8 12 4 4 40 16 12 4 12 ref_seq_6 12 8 8 8 4 16 4 32 4 4 ref_seq_7 4 4 4 4 56 4 56 4 56 4 ref_seq_8 16 4 4 4 16 4 16 4 16 4 ref_seq_9 4 4 4 4 4 4 4 4 4 4 ref_seq_10 20 20 20 20 20 20 20 20 20 20

Hopefully you're able to tell me what causes this and (even better) solve it easily.

Kind regards,

Wouter Lokhorst
Last edited by GenoMax; 10-02-2017, 06:05 AM.
Leave a comment:
Brian Bushnell replied

07-17-2017, 08:52 AM
Hi Jack,

This means that you were running multiple different indexing processes in the same directory at the same time. Unless you use a different directory for each process, or specify a different index location with "path=", or specify a different build number, the indexes can overwrite each other leading to corrupt zip files (which, fortunately, normally get detected, as in this case).

If you want to do all of these mapping operations to the same references, just index once, wait for it to finish, and then run all the mapping operations without specifying "ref=". E.g.

Code:

bbsplit.sh ref=ecoli.fa,salmonella.fa (wait for finish) bbsplit.sh a.fq basename=out_a_%.fq bbsplit.sh b.fq basename=out_b_%.fq bbsplit.sh c.fq basename=out_c_%.fq ...etc

If each one needs different references, then either run them serially, or use a different directory/build each time.
Leave a comment:
xc611 replied

07-16-2017, 05:55 AM
Hi, Brian:

I have run bbsplit to number of samples for binning the rRNA reads. Most of them (total of 172 samples) run fine but some of them gave me strange java errors. I rerun couple of times and every time different sample had this error, so I know its not the fastq file error.

Any suggestions?

Thanks

Jack
Exception in thread "main" java.lang.RuntimeException: java.io.EOFException: Unexpected end of ZLIB input stream
at fileIO.ReadWrite.readObject(ReadWrite.java:788)
at fileIO.ReadWrite.read(ReadWrite.java:1154)
at dna.ChromosomeArray.read(ChromosomeArray.java:65)
at align2.ChromLoadThread.loadAll(ChromLoadThread.java:50)
at dna.Data.loadChromosomes(Data.java:272)
at align2.BBMap.loadIndex(BBMap.java:354)
at align2.BBMap.main(BBMap.java:33)
at align2.BBSplitter.main(BBSplitter.java:49)
Caused by: java.io.EOFException: Unexpected end of ZLIB input stream
at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:240)
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:117)
at java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2313)
at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2326)
at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2797)
at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:802)
at java.io.ObjectInputStream.<init>(ObjectInputStream.java:299)
at fileIO.ReadWrite.readObject(ReadWrite.java:783)
... 7 more
Leave a comment:
RubenD replied

03-05-2017, 08:32 PM
Originally posted by Brian Bushnell View Post

Hi Ruben,

When the same flag is given twice, the second one takes precedence. So, in this case the command is working fine and minhits=2 / maxindel=200000 are being used.

Perfect! Thanks a lot Brian.
Leave a comment:
Brian Bushnell replied

03-05-2017, 07:57 PM
Hi Ruben,

When the same flag is given twice, the second one takes precedence. So, in this case the command is working fine and minhits=2 / maxindel=200000 are being used.
Leave a comment:
RubenD replied

03-05-2017, 05:47 PM
@GenoMax: it seems to be running correctly and produces sensible results, however I'm not sure whether it uses the default or my custom parameter settings.

I have first generated an index for mouse and human:

bbsplit.sh build=2 maxindel=200000 minhits=2 ambiguous=best ambiguous2=best \
ref_human=/gcdata/gcproj/Ruben/GENOMES/Human/Sequences/Ensembl/GRCh37/Release75/Genome/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa \
ref_mouse=/gcdata/gcproj/Ruben/GENOMES/Mouse/Sequences/NCBIM37.67/Mus_musculus.NCBIM37.67.dna.toplevel.fa

Then I split my fastq files, again this works but I'm not sure whether it actually used the parameters that I provide...

bbsplit.sh minhits=2 maxindel=200000 t=10 ambiguous=best ambiguous2=best path=/gcdata/gcproj/Ruben/GENOMES/BBmap_indices/ build=2 in=sample.fastq.gz out_human=sample_human.fastq out_mouse=sample_mouse.fastq scafstats=scafstats.txt refstats=refstats.txt

Thanks for your help,
Ruben
Leave a comment:
GenoMax replied

03-05-2017, 05:34 PM
@RubenD: Please provide the exact BBSplit command you are using. You need to provide more than one reference for BBSplit to work right.
Leave a comment:
RubenD replied

03-05-2017, 03:51 PM
how to change default bbsplit parameters?

Hi all,

I would like to use bbsplit to separate my fastq files.

First I generated an index and now I am running a for loop for all my fastq's.

However when I run the command it seems that the default parameter values are not changed, see for example parameters minhits and maxindel in bold below. Which parameter will it take?

Am I doing something wrong here?

Merged reference file /gcdata/gcproj/Ruben/GENOMES/BBmap_indices/ref/genome/2/merged_ref_208959345802405.fa.gz already exists; skipping merge.
Ref merge time: 0.028 seconds.
Executing align2.BBMap [ow=t, fastareadlen=500, minhits=1, minratio=0.56, maxindel=20, qtrim=rl, untrim=t, trimq=6, minhits=2, maxindel=200000, t=10, ambiguous=best, ambiguous2=best, build=2, in=../Fastq_files/20161223_CDX10_CC3737_S7_R1_001.fastq.gz, out_human=20161223_CDX10_CC3737_S7_R1_001_human.fastq, out_mouse=20161223_CDX10_CC3737_S7_R1_001_mouse.fastq, scafstats=scafstats_20161223_CDX10_CC3737_S7_R1_001.txt, refstats=refstats_20161223_CDX10_CC3737_S7_R1_001.txt, ref=/gcdata/gcproj/Ruben/GENOMES/BBmap_indices/ref/genome/2/merged_ref_208959345802405.fa.gz]

BBMap version 37.00
Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.560
Set threads to 10
Scaffold statistics will be written to scafstats_20161223_CDX10_CC3737_S7_R1_001.txt
Reference set statistics will be written to refstats_20161223_CDX10_CC3737_S7_R1_001.txt
Retaining first best site only for ambiguous mappings.
NOTE: Ignoring reference file because it already appears to have been processed.
NOTE: If you wish to regenerate the index, please manually delete /gcdata/gcproj/Ruben/GENOMES/BBmap_indices/ref/genome/2/summary.txt
Set genome to 2
Leave a comment:
Brian Bushnell replied

09-10-2015, 12:18 AM
Hi Luc!

The syntax for BBSplit is different from BBMap. It's like this:

bbsplit.sh in=moleculo.fasta.gz ref=mitochondrion1.fasta,mitochondrion2.fasta basename=out_%.fa outu=unmapped.fa

That will give you 3 output files, "out_mitochondrion1.fa", "out_mitochondrion2.fa", and "unmapped.fa".

Unfortunately, BBSplit by default breaks reads longer than 500bp into 500bp pieces. It's technically possible to change this to 6kbp pieces with the use of a couple extra parameters (specifically "msa=MultiStateAligner9PacBio fastareadlen=6000"), but I suggest you try Seal instead. Seal does not do alignment, just kmer-matching, but I tend to use it in this kind of situation as it is much faster and can handle reads of unlimited length without breaking them up. The default is a hamming distance of zero and considering something a match if even a single kmer is shared (and by default k=31).

Syntax:

seal.sh in=moleculo.fasta.gz ref=mitochondrion1.fasta,mitochondrion2.fasta pattern=out_%.fa outu=unmatched.fa mkf=0.05 hdist=1

The last 2 parameters are optional, but specify a hamming distance of 1 (which is fine in the case of mitochondria, which are tiny) and require a read to share 5% of its kmers with a reference sequence to be considered matching. You can increase that to a substantially higher fraction (like 0.5), and eliminate the hamming distance, if you expect a very close match between the reads and the reference.

Last edited by Brian Bushnell; 09-10-2015, 12:23 AM.
Leave a comment:
luc replied

09-09-2015, 11:45 PM
Hi Brian,

yes I am using version 35.43.

I ran for example:
bbsplit.sh in=moleculo.fasta.gz ref=mitochondrion1.fasta,mitochondrion2.fasta build=1 maxindel=100 minratio=0.8 refstats=mitorefstats.txt out=MolToMitoreads.fasta &

The results look like this:
------------------ Results ------------------

Genome: 1
Key Length: 13
Max Indel: 100
Minimum Score Ratio: 0.8
Mapping Mode: normal
Reads Used: 887008 (419139365 bases)

Mapping: 18.567 seconds.
Reads/sec: 47772.92
kBases/sec: 22574.22

Read 1 data: pct reads num reads pct bases num bases

mapped: 0.3272% 2902 0.3307% 1386018
unambiguous: 0.2006% 1779 0.2065% 865395
ambiguous: 0.1266% 1123 0.1242% 520623
low-Q discards: 0.0000% 0 0.0000% 0

perfect best site: 0.0046% 41 0.0024% 9939
semiperfect site: 0.0046% 41 0.0024% 9939

Match Rate: NA NA 94.2618% 1332227
Error Rate: 43.3616% 2861 5.7382% 81100
Sub Rate: 43.2707% 2855 2.4932% 35237
Del Rate: 29.7818% 1965 1.9322% 27309
Ins Rate: 28.8118% 1901 1.3128% 18554
N Rate: 0.0000% 0 0.0000% 0

The problems is that the out-file contains almost all of the Moleculo data - in 500bp pieces - instead of the expected 0.33 %.

Last edited by luc; 09-09-2015, 11:48 PM.
Leave a comment:
Brian Bushnell replied

09-08-2015, 11:54 PM
Hi Luc,

It's quite possible that you are not misinterpreting anything, but the results are incorrect. Would you mind giving the exact parameters and version (I assume 35.43) you used? I will probably be able to replicate the bug without your specific data.

Thanks,
Brian

P.S. By the way, "mapPacBio.sh" will use approximately the same algorithm but only shred them into 6000bp shreds.

Last edited by Brian Bushnell; 09-08-2015, 11:57 PM.
Leave a comment:
luc replied

09-08-2015, 11:36 PM
Hi Brian,

when using BBsplit (also latest version) with long reads (Moleculo data) the program shreds these long reads into 500 bp pieces. It seems to me that the distributing of the reads according to reference is not working in this case. For me about 0.5% of the genomic reads are mapping to mitochondria according to the BBsplit report, but the out-files contain about 99% of the reads. Perhaps I am misinterpreting anything?

Thanks!
Leave a comment:
manuelkleiner replied

03-16-2015, 02:09 PM
Wow that was fast! Thank you Brian!
I will test it tomorrow and let you know how it works out for me.

@GenoMax: the 20,000 Mulit-fasta files are the product of a rather strict binning of multiple metagenomes. So some files contain multiple sequences, some only contain one. Concatenating would thus not be an option and the scaffold/chromosome statistics function would also not work to get the statistics per file.
Leave a comment:
Brian Bushnell replied

03-13-2015, 04:41 PM
Manuel,

I released a new version of BBTools, 34.65, in which you can now specify a directory for the "ref=" argument. If anything in the list is a directory, it will use all fasta files in that directory. They need a fasta extension, like .fa or .fasta, but can be compressed with an additional .gz after that. Have a nice weekend!
Leave a comment:

Previous 1 2 3 4 5 template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Today, 08:47 AM	0 responses 11 views 0 likes	Last Post by seqadmin Today, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News