Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • GenoMax
    replied
    @Wouter: @Brian would be along with an official answer but this may be because most NGS aligners are non-determinstic by default i.e. they may not produce exactly the same answer for each run.

    bbmap.sh provides a "determinisitc" flag that you can set to "true" to get exactly identical results each time. I don't see bbsplit listed as having that flag but it may support it since it is based on bbmap. @Brian will confirm.

    You would also need to consider how similar your references are and how you are choosing to handle multi-mappers (reads which map across genomes). Are you leaving the ambiguous2= option default. It may help to post the entire bbsplit command you are using for this analysis.

    Leave a comment:


  • wlokhorst
    replied
    Hello Brian,

    Your tool is great and on first sight it seems to work perfectly, but after running several batches of samples I've noticed some quirky behaviour.

    To investigate this, I've run the BBSplit tool several times on the same sample (and deleting the index every time) and it doesn't give me the same results every time. I've pasted the summarised results below and, as you can see, they differ with every run. This certainly isn't what I expected. Mapping the same reference sequences to a sample multiple times should always give me the same results, so something is wrong here.

    The numbers are the the number of reads mapped to a certain reference sequence.
    Code:
    		run1	run2	run3	run4	run5	run6	run7	run8	run9	run10	
    ref_seq_1	4	8	8	8	4	8	8	8	4	8
    ref_seq_2	4	4	4	4	4	4	4	4	4	4
    ref_seq_3	8	4	12	40	16	4	4	4	16	4
    ref_seq_4	24	24	8	4	8	4	24	24	8	4
    ref_seq_5	4	8	12	4	4	40	16	12	4	12
    ref_seq_6	12	8	8	8	4	16	4	32	4	4
    ref_seq_7	4	4	4	4	56	4	56	4	56	4
    ref_seq_8	16	4	4	4	16	4	16	4	16	4
    ref_seq_9	4	4	4	4	4	4	4	4	4	4
    ref_seq_10	20	20	20	20	20	20	20	20	20	20
    Hopefully you're able to tell me what causes this and (even better) solve it easily.

    Kind regards,

    Wouter Lokhorst
    Last edited by GenoMax; 10-02-2017, 06:05 AM.

    Leave a comment:


  • Brian Bushnell
    replied
    Hi Jack,

    This means that you were running multiple different indexing processes in the same directory at the same time. Unless you use a different directory for each process, or specify a different index location with "path=", or specify a different build number, the indexes can overwrite each other leading to corrupt zip files (which, fortunately, normally get detected, as in this case).

    If you want to do all of these mapping operations to the same references, just index once, wait for it to finish, and then run all the mapping operations without specifying "ref=". E.g.

    Code:
    bbsplit.sh ref=ecoli.fa,salmonella.fa
    
    (wait for finish)
    
    bbsplit.sh a.fq basename=out_a_%.fq
    bbsplit.sh b.fq basename=out_b_%.fq
    bbsplit.sh c.fq basename=out_c_%.fq
    ...etc
    If each one needs different references, then either run them serially, or use a different directory/build each time.

    Leave a comment:


  • xc611
    replied
    Hi, Brian:

    I have run bbsplit to number of samples for binning the rRNA reads. Most of them (total of 172 samples) run fine but some of them gave me strange java errors. I rerun couple of times and every time different sample had this error, so I know its not the fastq file error.

    Any suggestions?

    Thanks

    Jack
    Exception in thread "main" java.lang.RuntimeException: java.io.EOFException: Unexpected end of ZLIB input stream
    at fileIO.ReadWrite.readObject(ReadWrite.java:788)
    at fileIO.ReadWrite.read(ReadWrite.java:1154)
    at dna.ChromosomeArray.read(ChromosomeArray.java:65)
    at align2.ChromLoadThread.loadAll(ChromLoadThread.java:50)
    at dna.Data.loadChromosomes(Data.java:272)
    at align2.BBMap.loadIndex(BBMap.java:354)
    at align2.BBMap.main(BBMap.java:33)
    at align2.BBSplitter.main(BBSplitter.java:49)
    Caused by: java.io.EOFException: Unexpected end of ZLIB input stream
    at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:240)
    at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
    at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:117)
    at java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2313)
    at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2326)
    at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2797)
    at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:802)
    at java.io.ObjectInputStream.<init>(ObjectInputStream.java:299)
    at fileIO.ReadWrite.readObject(ReadWrite.java:783)
    ... 7 more

    Leave a comment:


  • RubenD
    replied
    Originally posted by Brian Bushnell View Post
    Hi Ruben,

    When the same flag is given twice, the second one takes precedence. So, in this case the command is working fine and minhits=2 / maxindel=200000 are being used.
    Perfect! Thanks a lot Brian.

    Leave a comment:


  • Brian Bushnell
    replied
    Hi Ruben,

    When the same flag is given twice, the second one takes precedence. So, in this case the command is working fine and minhits=2 / maxindel=200000 are being used.

    Leave a comment:


  • RubenD
    replied
    @GenoMax: it seems to be running correctly and produces sensible results, however I'm not sure whether it uses the default or my custom parameter settings.

    I have first generated an index for mouse and human:

    bbsplit.sh build=2 maxindel=200000 minhits=2 ambiguous=best ambiguous2=best \
    ref_human=/gcdata/gcproj/Ruben/GENOMES/Human/Sequences/Ensembl/GRCh37/Release75/Genome/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa \
    ref_mouse=/gcdata/gcproj/Ruben/GENOMES/Mouse/Sequences/NCBIM37.67/Mus_musculus.NCBIM37.67.dna.toplevel.fa


    Then I split my fastq files, again this works but I'm not sure whether it actually used the parameters that I provide...

    bbsplit.sh minhits=2 maxindel=200000 t=10 ambiguous=best ambiguous2=best path=/gcdata/gcproj/Ruben/GENOMES/BBmap_indices/ build=2 in=sample.fastq.gz out_human=sample_human.fastq out_mouse=sample_mouse.fastq scafstats=scafstats.txt refstats=refstats.txt

    Thanks for your help,
    Ruben

    Leave a comment:


  • GenoMax
    replied
    @RubenD: Please provide the exact BBSplit command you are using. You need to provide more than one reference for BBSplit to work right.

    Leave a comment:


  • RubenD
    replied
    how to change default bbsplit parameters?

    Hi all,

    I would like to use bbsplit to separate my fastq files.

    First I generated an index and now I am running a for loop for all my fastq's.

    However when I run the command it seems that the default parameter values are not changed, see for example parameters minhits and maxindel in bold below. Which parameter will it take?

    Am I doing something wrong here?


    Merged reference file /gcdata/gcproj/Ruben/GENOMES/BBmap_indices/ref/genome/2/merged_ref_208959345802405.fa.gz already exists; skipping merge.
    Ref merge time: 0.028 seconds.
    Executing align2.BBMap [ow=t, fastareadlen=500, minhits=1, minratio=0.56, maxindel=20, qtrim=rl, untrim=t, trimq=6, minhits=2, maxindel=200000, t=10, ambiguous=best, ambiguous2=best, build=2, in=../Fastq_files/20161223_CDX10_CC3737_S7_R1_001.fastq.gz, out_human=20161223_CDX10_CC3737_S7_R1_001_human.fastq, out_mouse=20161223_CDX10_CC3737_S7_R1_001_mouse.fastq, scafstats=scafstats_20161223_CDX10_CC3737_S7_R1_001.txt, refstats=refstats_20161223_CDX10_CC3737_S7_R1_001.txt, ref=/gcdata/gcproj/Ruben/GENOMES/BBmap_indices/ref/genome/2/merged_ref_208959345802405.fa.gz]

    BBMap version 37.00
    Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.560
    Set threads to 10
    Scaffold statistics will be written to scafstats_20161223_CDX10_CC3737_S7_R1_001.txt
    Reference set statistics will be written to refstats_20161223_CDX10_CC3737_S7_R1_001.txt
    Retaining first best site only for ambiguous mappings.
    NOTE: Ignoring reference file because it already appears to have been processed.
    NOTE: If you wish to regenerate the index, please manually delete /gcdata/gcproj/Ruben/GENOMES/BBmap_indices/ref/genome/2/summary.txt
    Set genome to 2

    Leave a comment:


  • Brian Bushnell
    replied
    Hi Luc!

    The syntax for BBSplit is different from BBMap. It's like this:

    bbsplit.sh in=moleculo.fasta.gz ref=mitochondrion1.fasta,mitochondrion2.fasta basename=out_%.fa outu=unmapped.fa

    That will give you 3 output files, "out_mitochondrion1.fa", "out_mitochondrion2.fa", and "unmapped.fa".

    Unfortunately, BBSplit by default breaks reads longer than 500bp into 500bp pieces. It's technically possible to change this to 6kbp pieces with the use of a couple extra parameters (specifically "msa=MultiStateAligner9PacBio fastareadlen=6000"), but I suggest you try Seal instead. Seal does not do alignment, just kmer-matching, but I tend to use it in this kind of situation as it is much faster and can handle reads of unlimited length without breaking them up. The default is a hamming distance of zero and considering something a match if even a single kmer is shared (and by default k=31).

    Syntax:

    seal.sh in=moleculo.fasta.gz ref=mitochondrion1.fasta,mitochondrion2.fasta pattern=out_%.fa outu=unmatched.fa mkf=0.05 hdist=1

    The last 2 parameters are optional, but specify a hamming distance of 1 (which is fine in the case of mitochondria, which are tiny) and require a read to share 5% of its kmers with a reference sequence to be considered matching. You can increase that to a substantially higher fraction (like 0.5), and eliminate the hamming distance, if you expect a very close match between the reads and the reference.
    Last edited by Brian Bushnell; 09-10-2015, 12:23 AM.

    Leave a comment:


  • luc
    replied
    Hi Brian,

    yes I am using version 35.43.

    I ran for example:
    bbsplit.sh in=moleculo.fasta.gz ref=mitochondrion1.fasta,mitochondrion2.fasta build=1 maxindel=100 minratio=0.8 refstats=mitorefstats.txt out=MolToMitoreads.fasta &

    The results look like this:
    ------------------ Results ------------------

    Genome: 1
    Key Length: 13
    Max Indel: 100
    Minimum Score Ratio: 0.8
    Mapping Mode: normal
    Reads Used: 887008 (419139365 bases)

    Mapping: 18.567 seconds.
    Reads/sec: 47772.92
    kBases/sec: 22574.22


    Read 1 data: pct reads num reads pct bases num bases

    mapped: 0.3272% 2902 0.3307% 1386018
    unambiguous: 0.2006% 1779 0.2065% 865395
    ambiguous: 0.1266% 1123 0.1242% 520623
    low-Q discards: 0.0000% 0 0.0000% 0

    perfect best site: 0.0046% 41 0.0024% 9939
    semiperfect site: 0.0046% 41 0.0024% 9939

    Match Rate: NA NA 94.2618% 1332227
    Error Rate: 43.3616% 2861 5.7382% 81100
    Sub Rate: 43.2707% 2855 2.4932% 35237
    Del Rate: 29.7818% 1965 1.9322% 27309
    Ins Rate: 28.8118% 1901 1.3128% 18554
    N Rate: 0.0000% 0 0.0000% 0

    The problems is that the out-file contains almost all of the Moleculo data - in 500bp pieces - instead of the expected 0.33 %.
    Last edited by luc; 09-09-2015, 11:48 PM.

    Leave a comment:


  • Brian Bushnell
    replied
    Hi Luc,

    It's quite possible that you are not misinterpreting anything, but the results are incorrect. Would you mind giving the exact parameters and version (I assume 35.43) you used? I will probably be able to replicate the bug without your specific data.

    Thanks,
    Brian

    P.S. By the way, "mapPacBio.sh" will use approximately the same algorithm but only shred them into 6000bp shreds.
    Last edited by Brian Bushnell; 09-08-2015, 11:57 PM.

    Leave a comment:


  • luc
    replied
    Hi Brian,

    when using BBsplit (also latest version) with long reads (Moleculo data) the program shreds these long reads into 500 bp pieces. It seems to me that the distributing of the reads according to reference is not working in this case. For me about 0.5% of the genomic reads are mapping to mitochondria according to the BBsplit report, but the out-files contain about 99% of the reads. Perhaps I am misinterpreting anything?

    Thanks!

    Leave a comment:


  • manuelkleiner
    replied
    Wow that was fast! Thank you Brian!
    I will test it tomorrow and let you know how it works out for me.

    @GenoMax: the 20,000 Mulit-fasta files are the product of a rather strict binning of multiple metagenomes. So some files contain multiple sequences, some only contain one. Concatenating would thus not be an option and the scaffold/chromosome statistics function would also not work to get the statistics per file.

    Leave a comment:


  • Brian Bushnell
    replied
    Manuel,

    I released a new version of BBTools, 34.65, in which you can now specify a directory for the "ref=" argument. If anything in the list is a directory, it will use all fasta files in that directory. They need a fasta extension, like .fa or .fasta, but can be compressed with an additional .gz after that. Have a nice weekend!

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    Essential Discoveries and Tools in Epitranscriptomics
    by seqadmin




    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
    04-22-2024, 07:01 AM
  • seqadmin
    Current Approaches to Protein Sequencing
    by seqadmin


    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
    04-04-2024, 04:25 PM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, Today, 08:47 AM
0 responses
11 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-11-2024, 12:08 PM
0 responses
60 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 10:19 PM
0 responses
59 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 09:21 AM
0 responses
54 views
0 likes
Last Post seqadmin  
Working...
X