Hi there,
I am trying to run BBSplit on a huge chr-level assembled reference genome (~24Gb) and its non-chr-level-assembled contigs (ca. 1Gb) with the following command on remote server (I specify the maximum memory use in the server as 64G).
bbsplit.sh -Xmx40g ambiguous=toss ambiguous2=toss in1=HKs_fq/HK002_L1_1_trimmed.fastq.gz in2=HKs_fq/HK002_L1_2_trimmed.fastq.gz ref=P.tabuliformis_V1.0_contig.fa,P.tabuliformis_V1.0_chr.fa basename=out_%_#.fq.gz
But the merging reference step produces much smaller (8Gb) fasta, and the mapping step also produce warning/error as follows:
Exception in thread "main" java.lang.AssertionError: Resizing to an non-longer array (2147483627); probable array size overflow.
at structures.ByteBuilder.expand(ByteBuilder.java:606)
at structures.ByteBuilder.append(ByteBuilder.java:379)
at dna.FastaToChromArrays2.nextScaffold(FastaToChromArrays2.java:539)
at dna.FastaToChromArrays2.makeNextChrom(FastaToChromArrays2.java:460)
at dna.FastaToChromArrays2.makeChroms(FastaToChromArrays2.java:345)
at dna.FastaToChromArrays2.main2(FastaToChromArrays2.java:153)
at align2.RefToIndex.makeIndex(RefToIndex.java:147)
at align2.BBMap.setup(BBMap.java:280)
at align2.AbstractMapper.<init>(AbstractMapper.java:58)
at align2.BBMap.<init>(BBMap.java:42)
at align2.BBMap.main(BBMap.java:30)
at align2.BBSplitter.main(BBSplitter.java:48)
---------------------------------
Is there anyway for me to handle this large genome and proceed adequate merging and mapping?
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
@Amanda: I will need to dig through some past correspondence with Brian but I think he had recommended splitting first and then mapping to avoid the problem of having all references present in the BAM file. Which indeed causes issues with visualization programs.
If you look at the in-line help for "ambiguous2" you can see what it is doing:
Code:ambiguous2=<best> Set behavior only for reads that map ambiguously to multiple different references. Normal 'ambiguous=' controls behavior on all ambiguous reads; Ambiguous2 excludes reads that map ambiguously within a single reference.
Leave a comment:
-
Question about BBsplit ambig2=toss and bam files
Hello!
I am using BBsplit to separate reads from a paired-end three-species bacterial RNASeq project. I set the flag ambig2=toss but then see this sentence in the print out for the code:
"Retaining first best site only for ambiguous mappings."
To me, that looks like default ambiguous=best. Is that what I should be seeing? How do I know if the ambiguous reads are being tossed?
Additionally, I am mapping directly into a bam file. From earlier posts, looks like BBsplit bam files are incompatible with IGV but would they be okay with a feature counter like HTseq or edgeR?
Thanks very much,
Amanda
Leave a comment:
-
Contamination from human genome?
Hi,
I am working on non-model fish RNA-seq data, I am considering remove human contamination from reads, is this feasible since there is number of orthologs between human and fish?
Is there any recommendation regarding choice of "-minratio" for this case? It seems that 0.56 maybe too low? (I don't have reference genome for this non-model fish, by the way)
P.s: I think there should be different usage strategy of sensitivity or specificity for the case of binning (having 2 reference, i.e host vs contaminant, both have comparative alignment score to judge) AND for the case of decontaminating (only have the reference of contaminant, judgement only based on alignment to contaminant reference).
Thank you very much for your suggestion !
Leave a comment:
-
Hi Brian,
I'm trying to use bbsplit to separate rnaseq reads from two mixed fungal samples. I'm using the individual transcriptomes as references. I was getting some unexpected results. It seemed that more reads were unambiguously mapping to the reference that is listed first, so I swapped the order of the references and the results changed dramatically. I have ambiguous2=toss, but it seems like it's still using the first best site. Below are my commands and refstats output. Is there anything I'm doing wrong?
Thanks,
Brian
Code:bbsplit.sh ref=53.fasta,17.fasta \ in=53_30_r1_S7_R1_001.fastq.gz in2=53_30_r1_S7_R2_001.fastq.gz \ out_17=map17_53_30_r1_S7_R#_001.fastq.gz \ out_53=map53_53_30_r1_S7_R#_001.fastq.gz \ refstats=53_30_r1_S7.stats ambiguous2=toss #name %unambiguousReads unambiguousMB %ambiguousReads ambiguousMB unambiguousReads ambiguousReads 53 41.51013 1625.01508 57.30665 2219.25878 11241396 15519266 17 1.13394 44.03152 57.30665 2219.25878 307084 15519266 bbsplit.sh ref=17.fasta,53.fasta \ in=53_30_r1_S7_R1_001.fastq.gz in2=53_30_r1_S7_R2_001.fastq.gz \ out_17=map17_53_30_r1_S7_R#_001.fastq.gz \ out_53=map53_53_30_r1_S7_R#_001.fastq.gz \ refstats=53_30_r1_S7.stats2 ambiguous2=toss #name %unambiguousReads unambiguousMB %ambiguousReads ambiguousMB unambiguousReads ambiguousReads 53 21.37940 838.36051 67.54242 2623.22348 5789774 18291224 17 11.02890 426.72088 67.54242 2623.22348 2986746 18291224
Last edited by GenoMax; 08-20-2018, 08:03 AM.
Leave a comment:
-
Thanks for your quick reply. I ran a test run with both the references in the same command.
Leave a comment:
-
@Priya: You should be able to include both sequences in the same command. There is always a debate about how to handle the multi-mapping (to both species) reads. First take a look to see how big that number is. If it is not large then you should be able to move forward.
Leave a comment:
-
Hi Brain,
Thanks for developing great tools for the community !!
We are using bbsplit to separate insect and it's symbiont transcripts from a ribodepleted transcriptome. However, the reference for the insect is a cDNA transcriptome and for the symbiont its a genome. Do we need to do sequential mapping to bin for individual species given the references are different or can we give include both the references, a transcriptome and a genome, in the same command.?
Thanks
Priya
Leave a comment:
-
BBsplit should work here. Make sure you clean fasta headers from your genome sequences (remove spaces in headers etc, make sure they are unique). You have the option of handling multi-mapping data in various ways (discard, assign to all genomes etc) with BBSplit. So consider those carefully.
While you could make a BAM file(s) directly from BBSplit, you should split the data into separate fastq's first and then re-align to the respective genomes using "bbmap.sh". This avoids having a large number of @SQ header lines in BAM files which can cause problems with some tools (e.g. IGV).
Leave a comment:
-
Is your tool suitable for microbiome data where the database reference is many many bacterial 16s sequences?
i am looking for a tool that will take my fastq reads and align them to a given bacterial database and then will provide the location start and stop of where the reads aligned.
thanks!
Jen
Leave a comment:
-
Consensus seq from bbsplit
Hi Brian,
Do you have any thoughts on how to generate a consensus sequence from the read-pairs that I've used BBsplit to align and separate from each other?
I tried out a set of samtools mpileup/bcftools commands but it is reference-based and each of my reference files has many sequences in it so I'd have to additionally pick one reference or generate a consensus reference and then re-align to it, which seems like it could introduce reference-based bias.
I also tried out a set of ClustalW2/ANDES commands that is reference-free but it is too memory-intensive doing the MSA for the millions of reads that I have
I was thinking to go down the Mothur rabbit hole next because it looks like there is hope of combining tools to compress both unique sequences and sequences contained in others, and then keep track such that the consensus reflects the true base frequencies
Anyways, I am just wondering if any BBmap tools are suited to this task, or if you've used something I'm not thinking of (I took a quick look at clumpify but the results were >1 sequence outputted I believe)
Thx! Kate
Leave a comment:
-
Hi Brian,
So I've considered your 3 suggestions and Seal seems the path of least resistance. To separate sequences with 1 SNP of importance, in presence of other sequencing error SNPs the references need to be considered simultaneously or else which SNP is the important one gets lost.
So... i took approach of using BBsplit (to ensure the .95 minid/idfilter) that I want, and then took everything mapped as input to Seal (default except k=8, ambiguous=toss).
There are a few read pairs that are coming out as "Unmapped" in the Seal step, but they aligned with BBsplit. I looked at one and saw it has 1 SNP in each of F and R ~150 bp reads. Which Processing parameter do I need to change in order to "Map" this read pair?
Thx, Kate
Leave a comment:
-
@Brian is doing something very specific to remove human contamination in JGI's non-human samples.
Re-reading your original question it would be good to know if your contaminants are "close" relatives or can be considered a distant species. Success or failure of bbsplit is going to largely depend on that. No tool is going to be able to separate reads from a very closely related species based on sequence alone.Last edited by GenoMax; 10-12-2017, 07:38 AM.
Leave a comment:
-
Originally posted by GenoMax View PostWhile Brian will have a more detailed insight it should be ok to use the unmasked genome with BBSplit. Any mapping issues you may have with short reads should be more or less same with genome or just CDS sequences.
I see from other thread that masked genomes are used in order to prevent false positives when removing contamination.
At the same time in the thread you suggest simply use BBSplit, which should be enough.
Xavier
Leave a comment:
Latest Articles
Collapse
-
by seqadmin
Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.
Long-Read Sequencing
Long-read sequencing has seen remarkable advancements,...-
Channel: Articles
12-02-2024, 01:49 PM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, Yesterday, 07:59 AM
|
0 responses
11 views
0 likes
|
Last Post
by seqadmin
Yesterday, 07:59 AM
|
||
Newborn Genomic Screening Shows Promise in Reducing Infant Mortality and Hospitalization
by seqadmin
Started by seqadmin, 12-09-2024, 08:22 AM
|
0 responses
9 views
0 likes
|
Last Post
by seqadmin
12-09-2024, 08:22 AM
|
||
Started by seqadmin, 12-02-2024, 09:29 AM
|
0 responses
172 views
0 likes
|
Last Post
by seqadmin
12-02-2024, 09:29 AM
|
||
Started by seqadmin, 12-02-2024, 09:06 AM
|
0 responses
61 views
0 likes
|
Last Post
by seqadmin
12-02-2024, 09:06 AM
|
Leave a comment: