Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Yumeko
    replied
    Hi there,
    I am trying to run BBSplit on a huge chr-level assembled reference genome (~24Gb) and its non-chr-level-assembled contigs (ca. 1Gb) with the following command on remote server (I specify the maximum memory use in the server as 64G).

    bbsplit.sh -Xmx40g ambiguous=toss ambiguous2=toss in1=HKs_fq/HK002_L1_1_trimmed.fastq.gz in2=HKs_fq/HK002_L1_2_trimmed.fastq.gz ref=P.tabuliformis_V1.0_contig.fa,P.tabuliformis_V1.0_chr.fa basename=out_%_#.fq.gz

    But the merging reference step produces much smaller (8Gb) fasta, and the mapping step also produce warning/error as follows:

    Exception in thread "main" java.lang.AssertionError: Resizing to an non-longer array (2147483627); probable array size overflow.

    at structures.ByteBuilder.expand(ByteBuilder.java:606)

    at structures.ByteBuilder.append(ByteBuilder.java:379)

    at dna.FastaToChromArrays2.nextScaffold(FastaToChromArrays2.java:539)

    at dna.FastaToChromArrays2.makeNextChrom(FastaToChromArrays2.java:460)

    at dna.FastaToChromArrays2.makeChroms(FastaToChromArrays2.java:345)

    at dna.FastaToChromArrays2.main2(FastaToChromArrays2.java:153)

    at align2.RefToIndex.makeIndex(RefToIndex.java:147)

    at align2.BBMap.setup(BBMap.java:280)

    at align2.AbstractMapper.<init>(AbstractMapper.java:58)

    at align2.BBMap.<init>(BBMap.java:42)

    at align2.BBMap.main(BBMap.java:30)

    at align2.BBSplitter.main(BBSplitter.java:48)
    ---------------------------------

    Is there anyway for me to handle this large genome and proceed adequate merging and mapping?

    Leave a comment:


  • GenoMax
    replied
    @Amanda: I will need to dig through some past correspondence with Brian but I think he had recommended splitting first and then mapping to avoid the problem of having all references present in the BAM file. Which indeed causes issues with visualization programs.

    If you look at the in-line help for "ambiguous2" you can see what it is doing:
    Code:
    ambiguous2=<best>    Set behavior only for reads that map ambiguously to multiple different references.
                         Normal 'ambiguous=' controls behavior on all ambiguous reads;
                         Ambiguous2 excludes reads that map ambiguously within a single reference.

    Leave a comment:


  • ahurley2
    replied
    Question about BBsplit ambig2=toss and bam files

    Hello!

    I am using BBsplit to separate reads from a paired-end three-species bacterial RNASeq project. I set the flag ambig2=toss but then see this sentence in the print out for the code:

    "Retaining first best site only for ambiguous mappings."

    To me, that looks like default ambiguous=best. Is that what I should be seeing? How do I know if the ambiguous reads are being tossed?

    Additionally, I am mapping directly into a bam file. From earlier posts, looks like BBsplit bam files are incompatible with IGV but would they be okay with a feature counter like HTseq or edgeR?

    Thanks very much,
    Amanda

    Leave a comment:


  • phuongbigbig
    replied
    Contamination from human genome?

    Hi,

    I am working on non-model fish RNA-seq data, I am considering remove human contamination from reads, is this feasible since there is number of orthologs between human and fish?
    Is there any recommendation regarding choice of "-minratio" for this case? It seems that 0.56 maybe too low? (I don't have reference genome for this non-model fish, by the way)

    P.s: I think there should be different usage strategy of sensitivity or specificity for the case of binning (having 2 reference, i.e host vs contaminant, both have comparative alignment score to judge) AND for the case of decontaminating (only have the reference of contaminant, judgement only based on alignment to contaminant reference).

    Thank you very much for your suggestion !

    Leave a comment:


  • kcamnairb
    replied
    Hi Brian,

    I'm trying to use bbsplit to separate rnaseq reads from two mixed fungal samples. I'm using the individual transcriptomes as references. I was getting some unexpected results. It seemed that more reads were unambiguously mapping to the reference that is listed first, so I swapped the order of the references and the results changed dramatically. I have ambiguous2=toss, but it seems like it's still using the first best site. Below are my commands and refstats output. Is there anything I'm doing wrong?

    Thanks,
    Brian
    Code:
    bbsplit.sh ref=53.fasta,17.fasta \
            in=53_30_r1_S7_R1_001.fastq.gz in2=53_30_r1_S7_R2_001.fastq.gz \
            out_17=map17_53_30_r1_S7_R#_001.fastq.gz \
            out_53=map53_53_30_r1_S7_R#_001.fastq.gz \
            refstats=53_30_r1_S7.stats ambiguous2=toss
    
    #name	%unambiguousReads	unambiguousMB	%ambiguousReads	ambiguousMB	unambiguousReads	ambiguousReads
    53	41.51013	1625.01508	57.30665	2219.25878	11241396	15519266
    17	1.13394	44.03152	57.30665	2219.25878	307084	15519266        
            
    bbsplit.sh ref=17.fasta,53.fasta \
            in=53_30_r1_S7_R1_001.fastq.gz in2=53_30_r1_S7_R2_001.fastq.gz \
            out_17=map17_53_30_r1_S7_R#_001.fastq.gz \
            out_53=map53_53_30_r1_S7_R#_001.fastq.gz \
            refstats=53_30_r1_S7.stats2 ambiguous2=toss
    
    #name	%unambiguousReads	unambiguousMB	%ambiguousReads	ambiguousMB	unambiguousReads	ambiguousReads
    53	21.37940	838.36051	67.54242	2623.22348	5789774	18291224
    17	11.02890	426.72088	67.54242	2623.22348	2986746	18291224
    Last edited by GenoMax; 08-20-2018, 08:03 AM.

    Leave a comment:


  • rajarapupriya
    replied
    Thanks for your quick reply. I ran a test run with both the references in the same command.

    Leave a comment:


  • GenoMax
    replied
    @Priya: You should be able to include both sequences in the same command. There is always a debate about how to handle the multi-mapping (to both species) reads. First take a look to see how big that number is. If it is not large then you should be able to move forward.

    Leave a comment:


  • rajarapupriya
    replied
    Hi Brain,

    Thanks for developing great tools for the community !!
    We are using bbsplit to separate insect and it's symbiont transcripts from a ribodepleted transcriptome. However, the reference for the insect is a cDNA transcriptome and for the symbiont its a genome. Do we need to do sequential mapping to bin for individual species given the references are different or can we give include both the references, a transcriptome and a genome, in the same command.?

    Thanks
    Priya

    Leave a comment:


  • GenoMax
    replied
    BBsplit should work here. Make sure you clean fasta headers from your genome sequences (remove spaces in headers etc, make sure they are unique). You have the option of handling multi-mapping data in various ways (discard, assign to all genomes etc) with BBSplit. So consider those carefully.

    While you could make a BAM file(s) directly from BBSplit, you should split the data into separate fastq's first and then re-align to the respective genomes using "bbmap.sh". This avoids having a large number of @SQ header lines in BAM files which can cause problems with some tools (e.g. IGV).

    Leave a comment:


  • JenBarb
    replied
    Is your tool suitable for microbiome data where the database reference is many many bacterial 16s sequences?
    i am looking for a tool that will take my fastq reads and align them to a given bacterial database and then will provide the location start and stop of where the reads aligned.

    thanks!
    Jen

    Leave a comment:


  • GenoMax
    replied
    @Kate: You should take a look at tadpole.sh from BBMap suite, which is a k-mer based aligner. Someone recently used it to assemble the axolotl genome so whatever you are working with may be feasible to do. See post #64 to get started.

    Leave a comment:


  • sk8bro
    replied
    Consensus seq from bbsplit

    Hi Brian,

    Do you have any thoughts on how to generate a consensus sequence from the read-pairs that I've used BBsplit to align and separate from each other?

    I tried out a set of samtools mpileup/bcftools commands but it is reference-based and each of my reference files has many sequences in it so I'd have to additionally pick one reference or generate a consensus reference and then re-align to it, which seems like it could introduce reference-based bias.

    I also tried out a set of ClustalW2/ANDES commands that is reference-free but it is too memory-intensive doing the MSA for the millions of reads that I have

    I was thinking to go down the Mothur rabbit hole next because it looks like there is hope of combining tools to compress both unique sequences and sequences contained in others, and then keep track such that the consensus reflects the true base frequencies

    Anyways, I am just wondering if any BBmap tools are suited to this task, or if you've used something I'm not thinking of (I took a quick look at clumpify but the results were >1 sequence outputted I believe)

    Thx! Kate

    Leave a comment:


  • sk8bro
    replied
    Hi Brian,

    So I've considered your 3 suggestions and Seal seems the path of least resistance. To separate sequences with 1 SNP of importance, in presence of other sequencing error SNPs the references need to be considered simultaneously or else which SNP is the important one gets lost.

    So... i took approach of using BBsplit (to ensure the .95 minid/idfilter) that I want, and then took everything mapped as input to Seal (default except k=8, ambiguous=toss).

    There are a few read pairs that are coming out as "Unmapped" in the Seal step, but they aligned with BBsplit. I looked at one and saw it has 1 SNP in each of F and R ~150 bp reads. Which Processing parameter do I need to change in order to "Map" this read pair?

    Thx, Kate

    Leave a comment:


  • GenoMax
    replied
    @Brian is doing something very specific to remove human contamination in JGI's non-human samples.

    Re-reading your original question it would be good to know if your contaminants are "close" relatives or can be considered a distant species. Success or failure of bbsplit is going to largely depend on that. No tool is going to be able to separate reads from a very closely related species based on sequence alone.
    Last edited by GenoMax; 10-12-2017, 07:38 AM.

    Leave a comment:


  • Microalgues
    replied
    Originally posted by GenoMax View Post
    While Brian will have a more detailed insight it should be ok to use the unmasked genome with BBSplit. Any mapping issues you may have with short reads should be more or less same with genome or just CDS sequences.
    Thanks GenoMax,

    I see from other thread that masked genomes are used in order to prevent false positives when removing contamination.

    At the same time in the thread you suggest simply use BBSplit, which should be enough.

    Originally posted by GenoMax View Post
    You can create them yourself using bbmask.sh. Not sure if you would need to if you are just looking to remove reads mapping to mito and chloroplast.

    I assume you have seen BBsplit, which can be used for this purpose.
    With thanks,
    Xavier

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    Current Approaches to Protein Sequencing
    by seqadmin


    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
    04-04-2024, 04:25 PM
  • seqadmin
    Strategies for Sequencing Challenging Samples
    by seqadmin


    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
    03-22-2024, 06:39 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 04-11-2024, 12:08 PM
0 responses
31 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 10:19 PM
0 responses
32 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 09:21 AM
0 responses
28 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-04-2024, 09:00 AM
0 responses
53 views
0 likes
Last Post seqadmin  
Working...
X