Introducing BBSplit: Read Binning Tool for Metagenomes and Contaminated Libraries

Yumeko replied

12-19-2023, 08:34 PM
Hi there,
I am trying to run BBSplit on a huge chr-level assembled reference genome (~24Gb) and its non-chr-level-assembled contigs (ca. 1Gb) with the following command on remote server (I specify the maximum memory use in the server as 64G).

bbsplit.sh -Xmx40g ambiguous=toss ambiguous2=toss in1=HKs_fq/HK002_L1_1_trimmed.fastq.gz in2=HKs_fq/HK002_L1_2_trimmed.fastq.gz ref=P.tabuliformis_V1.0_contig.fa,P.tabuliformis_V1.0_chr.fa basename=out_%_#.fq.gz

But the merging reference step produces much smaller (8Gb) fasta, and the mapping step also produce warning/error as follows:

Exception in thread "main" java.lang.AssertionError: Resizing to an non-longer array (2147483627); probable array size overflow.

at structures.ByteBuilder.expand(ByteBuilder.java:606)

at structures.ByteBuilder.append(ByteBuilder.java:379)

at dna.FastaToChromArrays2.nextScaffold(FastaToChromArrays2.java:539)

at dna.FastaToChromArrays2.makeNextChrom(FastaToChromArrays2.java:460)

at dna.FastaToChromArrays2.makeChroms(FastaToChromArrays2.java:345)

at dna.FastaToChromArrays2.main2(FastaToChromArrays2.java:153)

at align2.RefToIndex.makeIndex(RefToIndex.java:147)

at align2.BBMap.setup(BBMap.java:280)

at align2.AbstractMapper.<init>(AbstractMapper.java:58)

at align2.BBMap.<init>(BBMap.java:42)

at align2.BBMap.main(BBMap.java:30)

at align2.BBSplitter.main(BBSplitter.java:48)
---------------------------------

Is there anyway for me to handle this large genome and proceed adequate merging and mapping?
Leave a comment:
GenoMax replied

03-28-2020, 03:54 AM
@Amanda: I will need to dig through some past correspondence with Brian but I think he had recommended splitting first and then mapping to avoid the problem of having all references present in the BAM file. Which indeed causes issues with visualization programs.

If you look at the in-line help for "ambiguous2" you can see what it is doing:

Code:

ambiguous2=<best> Set behavior only for reads that map ambiguously to multiple different references. Normal 'ambiguous=' controls behavior on all ambiguous reads; Ambiguous2 excludes reads that map ambiguously within a single reference.
Leave a comment:
ahurley2 replied

03-27-2020, 12:49 PM
Question about BBsplit ambig2=toss and bam files

Hello!

I am using BBsplit to separate reads from a paired-end three-species bacterial RNASeq project. I set the flag ambig2=toss but then see this sentence in the print out for the code:

"Retaining first best site only for ambiguous mappings."

To me, that looks like default ambiguous=best. Is that what I should be seeing? How do I know if the ambiguous reads are being tossed?

Additionally, I am mapping directly into a bam file. From earlier posts, looks like BBsplit bam files are incompatible with IGV but would they be okay with a feature counter like HTseq or edgeR?

Thanks very much,
Amanda
Leave a comment:
phuongbigbig replied

10-08-2018, 01:48 AM
Contamination from human genome?

Hi,

I am working on non-model fish RNA-seq data, I am considering remove human contamination from reads, is this feasible since there is number of orthologs between human and fish?
Is there any recommendation regarding choice of "-minratio" for this case? It seems that 0.56 maybe too low? (I don't have reference genome for this non-model fish, by the way)

P.s: I think there should be different usage strategy of sensitivity or specificity for the case of binning (having 2 reference, i.e host vs contaminant, both have comparative alignment score to judge) AND for the case of decontaminating (only have the reference of contaminant, judgement only based on alignment to contaminant reference).

Thank you very much for your suggestion !
Leave a comment:

kcamnairb replied

08-20-2018, 05:59 AM

Hi Brian,

I'm trying to use bbsplit to separate rnaseq reads from two mixed fungal samples. I'm using the individual transcriptomes as references. I was getting some unexpected results. It seemed that more reads were unambiguously mapping to the reference that is listed first, so I swapped the order of the references and the results changed dramatically. I have ambiguous2=toss, but it seems like it's still using the first best site. Below are my commands and refstats output. Is there anything I'm doing wrong?

Thanks,
Brian

Code:

bbsplit.sh ref=53.fasta,17.fasta \
        in=53_30_r1_S7_R1_001.fastq.gz in2=53_30_r1_S7_R2_001.fastq.gz \
        out_17=map17_53_30_r1_S7_R#_001.fastq.gz \
        out_53=map53_53_30_r1_S7_R#_001.fastq.gz \
        refstats=53_30_r1_S7.stats ambiguous2=toss

#name	%unambiguousReads	unambiguousMB	%ambiguousReads	ambiguousMB	unambiguousReads	ambiguousReads
53	41.51013	1625.01508	57.30665	2219.25878	11241396	15519266
17	1.13394	44.03152	57.30665	2219.25878	307084	15519266        
        
bbsplit.sh ref=17.fasta,53.fasta \
        in=53_30_r1_S7_R1_001.fastq.gz in2=53_30_r1_S7_R2_001.fastq.gz \
        out_17=map17_53_30_r1_S7_R#_001.fastq.gz \
        out_53=map53_53_30_r1_S7_R#_001.fastq.gz \
        refstats=53_30_r1_S7.stats2 ambiguous2=toss

#name	%unambiguousReads	unambiguousMB	%ambiguousReads	ambiguousMB	unambiguousReads	ambiguousReads
53	21.37940	838.36051	67.54242	2623.22348	5789774	18291224
17	11.02890	426.72088	67.54242	2623.22348	2986746	18291224

Last edited by GenoMax; 08-20-2018, 08:03 AM.

Leave a comment:

rajarapupriya replied

06-01-2018, 10:09 AM
Thanks for your quick reply. I ran a test run with both the references in the same command.
Leave a comment:
GenoMax replied

06-01-2018, 09:26 AM
@Priya: You should be able to include both sequences in the same command. There is always a debate about how to handle the multi-mapping (to both species) reads. First take a look to see how big that number is. If it is not large then you should be able to move forward.
Leave a comment:
rajarapupriya replied

06-01-2018, 09:05 AM
Hi Brain,

Thanks for developing great tools for the community !!
We are using bbsplit to separate insect and it's symbiont transcripts from a ribodepleted transcriptome. However, the reference for the insect is a cDNA transcriptome and for the symbiont its a genome. Do we need to do sequential mapping to bin for individual species given the references are different or can we give include both the references, a transcriptome and a genome, in the same command.?

Thanks
Priya
Leave a comment:
GenoMax replied

05-25-2018, 07:45 AM
BBsplit should work here. Make sure you clean fasta headers from your genome sequences (remove spaces in headers etc, make sure they are unique). You have the option of handling multi-mapping data in various ways (discard, assign to all genomes etc) with BBSplit. So consider those carefully.

While you could make a BAM file(s) directly from BBSplit, you should split the data into separate fastq's first and then re-align to the respective genomes using "bbmap.sh". This avoids having a large number of @SQ header lines in BAM files which can cause problems with some tools (e.g. IGV).
Leave a comment:
JenBarb replied

05-25-2018, 07:29 AM
Is your tool suitable for microbiome data where the database reference is many many bacterial 16s sequences?
i am looking for a tool that will take my fastq reads and align them to a given bacterial database and then will provide the location start and stop of where the reads aligned.

thanks!
Jen
Leave a comment:
GenoMax replied

11-14-2017, 04:48 AM
@Kate: You should take a look at tadpole.sh from BBMap suite, which is a k-mer based aligner. Someone recently used it to assemble the axolotl genome so whatever you are working with may be feasible to do. See post #64 to get started.
Leave a comment:
sk8bro replied

11-13-2017, 09:07 PM
Consensus seq from bbsplit

Hi Brian,

Do you have any thoughts on how to generate a consensus sequence from the read-pairs that I've used BBsplit to align and separate from each other?

I tried out a set of samtools mpileup/bcftools commands but it is reference-based and each of my reference files has many sequences in it so I'd have to additionally pick one reference or generate a consensus reference and then re-align to it, which seems like it could introduce reference-based bias.

I also tried out a set of ClustalW2/ANDES commands that is reference-free but it is too memory-intensive doing the MSA for the millions of reads that I have

I was thinking to go down the Mothur rabbit hole next because it looks like there is hope of combining tools to compress both unique sequences and sequences contained in others, and then keep track such that the consensus reflects the true base frequencies

Anyways, I am just wondering if any BBmap tools are suited to this task, or if you've used something I'm not thinking of (I took a quick look at clumpify but the results were >1 sequence outputted I believe)

Thx! Kate
Leave a comment:
sk8bro replied

10-13-2017, 02:12 PM
Hi Brian,

So I've considered your 3 suggestions and Seal seems the path of least resistance. To separate sequences with 1 SNP of importance, in presence of other sequencing error SNPs the references need to be considered simultaneously or else which SNP is the important one gets lost.

So... i took approach of using BBsplit (to ensure the .95 minid/idfilter) that I want, and then took everything mapped as input to Seal (default except k=8, ambiguous=toss).

There are a few read pairs that are coming out as "Unmapped" in the Seal step, but they aligned with BBsplit. I looked at one and saw it has 1 SNP in each of F and R ~150 bp reads. Which Processing parameter do I need to change in order to "Map" this read pair?

Thx, Kate
Leave a comment:
GenoMax replied

10-12-2017, 07:17 AM
@Brian is doing something very specific to remove human contamination in JGI's non-human samples.

Re-reading your original question it would be good to know if your contaminants are "close" relatives or can be considered a distant species. Success or failure of bbsplit is going to largely depend on that. No tool is going to be able to separate reads from a very closely related species based on sequence alone.

Last edited by GenoMax; 10-12-2017, 07:38 AM.
Leave a comment:
Microalgues replied

10-12-2017, 06:28 AM
Originally posted by GenoMax View Post

While Brian will have a more detailed insight it should be ok to use the unmasked genome with BBSplit. Any mapping issues you may have with short reads should be more or less same with genome or just CDS sequences.

Thanks GenoMax,

I see from other thread that masked genomes are used in order to prevent false positives when removing contamination.

At the same time in the thread you suggest simply use BBSplit, which should be enough.

Originally posted by GenoMax View Post

You can create them yourself using bbmask.sh. Not sure if you would need to if you are just looking to remove reads mapping to mito and chloroplast.

I assume you have seen BBsplit, which can be used for this purpose.

With thanks,
Xavier
Leave a comment:

Previous 1 2 3 4 5 template Next

Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM
Strategies for Sequencing Challenging Samples

by seqadmin

Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
- Channel: Articles
03-22-2024, 06:39 AM

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 31 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News