Introducing BBSplit: Read Binning Tool for Metagenomes and Contaminated Libraries

GenoMax replied

10-12-2017, 04:40 AM
While Brian will have a more detailed insight it should be ok to use the unmasked genome with BBSplit. Any mapping issues you may have with short reads should be more or less same with genome or just CDS sequences.
Leave a comment:
Microalgues replied

10-12-2017, 03:53 AM
Biological question about contaminants DB

Dear Brian,

I have created a de novo transcriptome of a non-model organism. After BLASTing the contigs I found some homology with possible contaminants. My question is if I should use DNA genomes (un/masked) or the CDS gene predictions from ENSMBL database.

In the first case, do you recommend the unmasked, the masked version of genomes (ensmbl) or should I use BBMask?
And in the second case, could BBsplit handle with CDS mapping?

I am a bit lost, I think that contigs from de novo transcriptome have to be mapped against CDS sequences.

With thanks,
Xavier
Leave a comment:
sk8bro replied

10-11-2017, 01:48 PM
Thanks Brian, I will look into the options you suggested to handle this use case!

Kate
Leave a comment:
Brian Bushnell replied

10-11-2017, 01:43 PM
Originally posted by GenoMax View Post

You can set nodisk=t to prevent the index from being written to disk (memory will always be used).

Actually, "nodisk" does not work with BBSplit... sorry! I'll clarify that in the documentation. It's not like it's impossible to make it work, but it would be pretty complicated; one of those things that I would do if I could clone myself.

Hi Brian,

I am wondering for BBsplit (with ambiguous2=split) if it is expected behavior that if I provide 2 references:

Ref 1 has 1 sequence that is a perfect match to PE sequencing reads
Ref 2 has many unique sequences that are at least 1 MM different from sequencing reads

For BBsplit to think the reads are ambiguously mapped? And for this to be the case for >90% of read pairs but not for every read pair?

Is there a way to force BBsplit to select Ref1 (perfect) over the options in Ref2?

I am trying to remove a nearly-identical putative contaminant sequence from my data...

(FYI I have ambiguous is set to best, and I've also tried setting both ambiguous and ambiguous2 to best, but the reads are almost all considered to be ambiguous in that scenario too)

Thx! Kate

Hi Kate,

You can certainly set "perfectmode" to only allow mappings that are 100% identity to the reference. For example, you could run once in perfectmode, and then run the remaining unmapped reads normally. But generally, yes, this is intended behavior. BBSplit is intended to do things like separate mouse reads from human reads when a mouse is used as the vector for some human DNA study. Also, it is designed to separate reads belonging to various bacteria in a metagenome, and chloroplast/plant reads in a plant. In any of these cases, if there is a single bp mismatch, you can't unambiguously assign a read to one reference or the other, since it could be a sequencing error or (more importantly) an actual variation.

You might find Seal (seal.sh) useful. It has functionality similar to BBSplit, but is alignment free (meaning it is way faster, but uses more memory). It allows you to specify cutoffs, so you can for example send all of the perfectly-matching reads to one file, regardless of whether they almost match another genome. It decides which file to assign a read to based on the number of kmers (by default, 31-mers) matching that reference. When dealing with bacteria, I always prefer Seal over BBSplit because it is so much faster and easier to use, and bacteria have tiny genomes that are under high evolutionary pressure to avoid low-complexity sequence. Euks have huge genomes with a lot of low-complexity regions so BBSplit is better in that case, since it is more precise.

If you want to remove a particular contaminant from your data, I suggest trying BBDuk. BBDuk does not allow kmers longer than 31. However, it emulates longer kmers. For example, if you set "k=90", then it will consider reads as matching (by default, they get discarded) if there is a 90bp stretch in which all 31-mers match 31-mers in the reference. In practice this is very similar to matching 90-mers.

So, for example:

bbduk.sh in=reads.fq out=filtered.fq ref=contaminant.fa k=90 mm=f

That will remove all reads containing at least 90bp that shares all kmers with your reference (which should be the unwanted sequence). The exact values depend on the length of the sequence in question. Note that read length is very important here; if a read only overlaps 80bp of the 90bp sequence in question, it would not be removed.
Leave a comment:
sk8bro replied

10-11-2017, 01:04 PM
Hi Brian,

I am wondering for BBsplit (with ambiguous2=split) if it is expected behavior that if I provide 2 references:

Ref 1 has 1 sequence that is a perfect match to PE sequencing reads
Ref 2 has many unique sequences that are at least 1 MM different from sequencing reads

For BBsplit to think the reads are ambiguously mapped? And for this to be the case for >90% of read pairs but not for every read pair?

Is there a way to force BBsplit to select Ref1 (perfect) over the options in Ref2?

I am trying to remove a nearly-identical putative contaminant sequence from my data...

(FYI I have ambiguous is set to best, and I've also tried setting both ambiguous and ambiguous2 to best, but the reads are almost all considered to be ambiguous in that scenario too)

Thx! Kate

Last edited by sk8bro; 10-11-2017, 01:08 PM.
Leave a comment:
GenoMax replied

10-05-2017, 04:39 AM
For some other programs in BBMap suite there has been a "race condition" which produces a nasty (though according to @Brian a harmless) error related to threads.

At least in this case the result appears to be identical. In the last table you had given us the results all looked different.

Not sure why you still have this in the output, if you had deleted the ref folder:

NOTE: Ignoring reference file because it already appears to have been processed.
NOTE: If you wish to regenerate the index, please manually delete ref/genome/1/summary.txt

You can set nodisk=t to prevent the index from being written to disk (memory will always be used).

Last edited by GenoMax; 10-05-2017, 04:43 AM.
Leave a comment:

wlokhorst replied

10-05-2017, 03:29 AM

40428 reads indeed. The number of threads shouldn't matter for the succes or failure of the process. However, I did run it on 4 threads for you, so here are the results:

Code:

BBMap version 37.53
Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.560
Set threads to 4
Retaining first best site only for ambiguous mappings.
NOTE:   Ignoring reference file because it already appears to have been processed.
NOTE:   If you wish to regenerate the index, please manually delete ref/genome/1/summary.txt
Set genome to 1

Loaded Reference:       1.647 seconds.
Loading index for chunk 1-1, build 1
Generated Index:        1.542 seconds.
Analyzed Index:         3.989 seconds.
Cleared Memory:         6.978 seconds.
Processing reads in single-ended mode.
Started read stream.
Started 4 mapping threads.
Detecting finished threads: 0, 1, 2, 3

   ------------------   Results   ------------------

Genome:                 1
Key Length:             13
Max Indel:              20
Minimum Score Ratio:    0.56
Mapping Mode:           normal
Reads Used:             40428   (3919820 bases)

Mapping:                10.603 seconds.
Reads/sec:              3812.77
kBases/sec:             369.68


Read 1 data:            pct reads       num reads       pct bases          num bases

mapped:                   4.5464%            1838         4.3524%             170607
unambiguous:              1.2120%             490         1.2217%              47890
ambiguous:                3.3343%            1348         3.1307%             122717
low-Q discards:           0.0000%               0         0.0000%                  0

perfect best site:        0.7173%             290         0.6776%              26561
semiperfect site:         0.7297%             295         0.6905%              27066

Match Rate:                   NA               NA        91.4514%             156445
Error Rate:              42.4134%            1543         8.4773%              14502
Sub Rate:                42.1935%            1535         7.7799%              13309
Del Rate:                 2.2265%              81         0.2701%                462
Ins Rate:                 4.1506%             151         0.4273%                731
N Rate:                   0.2474%               9         0.0713%                122

Total time:             25.038 seconds.

This also includes removal of the ref folder (which I do every time before testing the next run) and re-creation with the first command I've given you before.

Leave a comment:

Brian Bushnell replied

10-04-2017, 09:24 AM
The numbers of mapped reads do not add up to the same record in each case... it seems like the program is crashing and not processing all the reads, or else the input is different. The output is deterministic for single-ended reads so I can't see any other explanation. Does the program print an error message?
Leave a comment:
GenoMax replied

10-02-2017, 06:06 PM
@Brian: Reads are single end 100 bp. Commands used are in post #37
Leave a comment:
Brian Bushnell replied

10-02-2017, 03:26 PM
That's odd, if you are not intentionally producing random assignments for sequences. Can you please post your full command line? Also, are the reads paired or unpaired?
Leave a comment:
wlokhorst replied

10-02-2017, 08:34 AM
Neither deterministic=t nor ambiguous2=all helped change the results. Unless you or Brian know a way of solving this, I'm sorry to say that I'll try using another tool (or attempt writing something of my own).

Reads are 100bp.
Leave a comment:
GenoMax replied

10-02-2017, 06:00 AM
You could try changing the ambiguous2= parameter and see how that changes the result. "toss/all" may generate results that look more similar. With very similar genomes you are going to have trouble assigning reads reproducibly. How long are these reads?
Leave a comment:
wlokhorst replied

10-02-2017, 04:22 AM
@GenoMax: I did not consider the non-deterministic behaviour of mappers, however BBMap should still be deterministic in my case, since I have single-end reads. See copied text from the BBMap help section below. I could still try to explicitly call this though.

Run in deterministic mode. In this case it is good to set averagepairdist. BBMap is deterministic without this flag if using single-ended reads, or run singlethreaded.

My references are very similar in some cases (e.g. different strains of the same species) and I'm leaving the ambiguous parameters untouched.

Commands:
bash ~/bbmap/bbsplit.sh ref={}
# ref= a folder containing my reference sequences
bash ~/bbmap/bbsplit.sh in={} basename={}/out_%.fq
# in= sample, basename= output folder and output filename
Leave a comment:

Previous 1 2 3 4 5 template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News