Header Leaderboard Ad


Introducing BBSplit: Read Binning Tool for Metagenomes and Contaminated Libraries



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Originally posted by vingomez View Post
    (1) Is there a flag/command to generate paired reads for mapped reads (e.g. out_ecoli_r1.fq out_ecoli_r2.fq, out_salmonella_r1 out_salmonella_r2.fq.fq)?
    Hi Vicente,

    I released a new version of BBTools, 34.64. BBSplit now has the ability to output paired reads in dual files using the # symbol. For example:

    bbsplit.sh ref=x.fa,y.fa in1=read1.fq in2=read2.fq basename=o%_#.fq

    will produce ox_1.fq, ox_2.fq, oy_1.fq, and oy_2.fq

    You can use the # symbol for input also, like "in=read#.fq", and it will get expanded into 1 and 2.


    • #17
      Problem with argument length

      Dear Brian,
      I have run into a problem when trying to create an index for bbsplit using a large number of files (~20,000). In the example below the variable $files contains 20,000 comma delimited, file names.

      [email protected][BBMap] bbsplit.sh -Xmx100g threads=20 ref=$files                                                                                                                         [ 2:47PM]
      zsh: argument list too long: bbsplit.sh
      What I have found out so far is that the linux kernel has a upper limit for the argument length in the shell, which causes this problem. I have tried various workarounds e.g. trying to pipe the file names in from STDIN or creating a bash script. But nothing has worked.
      Any suggestions how to get bbsplit to indext such a large number of files before running the splitting?
      Maybe one solution would be if one could hand the ref= argument a folder containing all fasta files and then bbsplit writes an index for each .fasta file in the folder!?

      Thanks a lot in advance for your help,


      • #18
        Hi Manuel,

        I will add the possibility to point "ref=" to a directory in the next release; that's a pretty good idea.



        • #19
          Brian: Concatenating some files to make the number smaller should make it work now, correct?


          • #20
            Well... the easiest way to use BBSplit is to have it split input into one output file per reference file. Once you concatenate them, you lose that ability. It's technically possible to specify sets of named sequences within a file but that way is too complicated. Using a tiered approach is possible (though a pain):

            for references a, b, c, d, e, f:

            abc=cat a b c
            def=cat d e f

            bbsplit ref=abc,def

            then re-split the abc pile between a, b, and c, and the def pile between d, e, and f. That will basically cut the command line lengths (and open file handles) by a factor of the square root of the number of references.

            But, I will add this feature quickly since it's hard to do otherwise.

            Note that if all you care about are statistics of which reads mapped to which scaffolds, then if all the ref sequences have unique names, you can just run BBMap on the concatenated file with the "scafstats" output flag. But for actually splitting the files easily, the references currently need to be in individual files.


            • #21
              @Manuel: Are these 20000 sequences related (contigs)?

              @Brian: Thanks for the additional explanation.


              • #22

                I released a new version of BBTools, 34.65, in which you can now specify a directory for the "ref=" argument. If anything in the list is a directory, it will use all fasta files in that directory. They need a fasta extension, like .fa or .fasta, but can be compressed with an additional .gz after that. Have a nice weekend!


                • #23
                  Wow that was fast! Thank you Brian!
                  I will test it tomorrow and let you know how it works out for me.

                  @GenoMax: the 20,000 Mulit-fasta files are the product of a rather strict binning of multiple metagenomes. So some files contain multiple sequences, some only contain one. Concatenating would thus not be an option and the scaffold/chromosome statistics function would also not work to get the statistics per file.


                  • #24
                    Hi Brian,

                    when using BBsplit (also latest version) with long reads (Moleculo data) the program shreds these long reads into 500 bp pieces. It seems to me that the distributing of the reads according to reference is not working in this case. For me about 0.5% of the genomic reads are mapping to mitochondria according to the BBsplit report, but the out-files contain about 99% of the reads. Perhaps I am misinterpreting anything?



                    • #25
                      Hi Luc,

                      It's quite possible that you are not misinterpreting anything, but the results are incorrect. Would you mind giving the exact parameters and version (I assume 35.43) you used? I will probably be able to replicate the bug without your specific data.


                      P.S. By the way, "mapPacBio.sh" will use approximately the same algorithm but only shred them into 6000bp shreds.
                      Last edited by Brian Bushnell; 09-08-2015, 11:57 PM.


                      • #26
                        Hi Brian,

                        yes I am using version 35.43.

                        I ran for example:
                        bbsplit.sh in=moleculo.fasta.gz ref=mitochondrion1.fasta,mitochondrion2.fasta build=1 maxindel=100 minratio=0.8 refstats=mitorefstats.txt out=MolToMitoreads.fasta &

                        The results look like this:
                        ------------------ Results ------------------

                        Genome: 1
                        Key Length: 13
                        Max Indel: 100
                        Minimum Score Ratio: 0.8
                        Mapping Mode: normal
                        Reads Used: 887008 (419139365 bases)

                        Mapping: 18.567 seconds.
                        Reads/sec: 47772.92
                        kBases/sec: 22574.22

                        Read 1 data: pct reads num reads pct bases num bases

                        mapped: 0.3272% 2902 0.3307% 1386018
                        unambiguous: 0.2006% 1779 0.2065% 865395
                        ambiguous: 0.1266% 1123 0.1242% 520623
                        low-Q discards: 0.0000% 0 0.0000% 0

                        perfect best site: 0.0046% 41 0.0024% 9939
                        semiperfect site: 0.0046% 41 0.0024% 9939

                        Match Rate: NA NA 94.2618% 1332227
                        Error Rate: 43.3616% 2861 5.7382% 81100
                        Sub Rate: 43.2707% 2855 2.4932% 35237
                        Del Rate: 29.7818% 1965 1.9322% 27309
                        Ins Rate: 28.8118% 1901 1.3128% 18554
                        N Rate: 0.0000% 0 0.0000% 0

                        The problems is that the out-file contains almost all of the Moleculo data - in 500bp pieces - instead of the expected 0.33 %.
                        Last edited by luc; 09-09-2015, 11:48 PM.


                        • #27
                          Hi Luc!

                          The syntax for BBSplit is different from BBMap. It's like this:

                          bbsplit.sh in=moleculo.fasta.gz ref=mitochondrion1.fasta,mitochondrion2.fasta basename=out_%.fa outu=unmapped.fa

                          That will give you 3 output files, "out_mitochondrion1.fa", "out_mitochondrion2.fa", and "unmapped.fa".

                          Unfortunately, BBSplit by default breaks reads longer than 500bp into 500bp pieces. It's technically possible to change this to 6kbp pieces with the use of a couple extra parameters (specifically "msa=MultiStateAligner9PacBio fastareadlen=6000"), but I suggest you try Seal instead. Seal does not do alignment, just kmer-matching, but I tend to use it in this kind of situation as it is much faster and can handle reads of unlimited length without breaking them up. The default is a hamming distance of zero and considering something a match if even a single kmer is shared (and by default k=31).


                          seal.sh in=moleculo.fasta.gz ref=mitochondrion1.fasta,mitochondrion2.fasta pattern=out_%.fa outu=unmatched.fa mkf=0.05 hdist=1

                          The last 2 parameters are optional, but specify a hamming distance of 1 (which is fine in the case of mitochondria, which are tiny) and require a read to share 5% of its kmers with a reference sequence to be considered matching. You can increase that to a substantially higher fraction (like 0.5), and eliminate the hamming distance, if you expect a very close match between the reads and the reference.
                          Last edited by Brian Bushnell; 09-10-2015, 12:23 AM.


                          • #28
                            how to change default bbsplit parameters?

                            Hi all,

                            I would like to use bbsplit to separate my fastq files.

                            First I generated an index and now I am running a for loop for all my fastq's.

                            However when I run the command it seems that the default parameter values are not changed, see for example parameters minhits and maxindel in bold below. Which parameter will it take?

                            Am I doing something wrong here?

                            Merged reference file /gcdata/gcproj/Ruben/GENOMES/BBmap_indices/ref/genome/2/merged_ref_208959345802405.fa.gz already exists; skipping merge.
                            Ref merge time: 0.028 seconds.
                            Executing align2.BBMap [ow=t, fastareadlen=500, minhits=1, minratio=0.56, maxindel=20, qtrim=rl, untrim=t, trimq=6, minhits=2, maxindel=200000, t=10, ambiguous=best, ambiguous2=best, build=2, in=../Fastq_files/20161223_CDX10_CC3737_S7_R1_001.fastq.gz, out_human=20161223_CDX10_CC3737_S7_R1_001_human.fastq, out_mouse=20161223_CDX10_CC3737_S7_R1_001_mouse.fastq, scafstats=scafstats_20161223_CDX10_CC3737_S7_R1_001.txt, refstats=refstats_20161223_CDX10_CC3737_S7_R1_001.txt, ref=/gcdata/gcproj/Ruben/GENOMES/BBmap_indices/ref/genome/2/merged_ref_208959345802405.fa.gz]

                            BBMap version 37.00
                            Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.560
                            Set threads to 10
                            Scaffold statistics will be written to scafstats_20161223_CDX10_CC3737_S7_R1_001.txt
                            Reference set statistics will be written to refstats_20161223_CDX10_CC3737_S7_R1_001.txt
                            Retaining first best site only for ambiguous mappings.
                            NOTE: Ignoring reference file because it already appears to have been processed.
                            NOTE: If you wish to regenerate the index, please manually delete /gcdata/gcproj/Ruben/GENOMES/BBmap_indices/ref/genome/2/summary.txt
                            Set genome to 2


                            • #29
                              @RubenD: Please provide the exact BBSplit command you are using. You need to provide more than one reference for BBSplit to work right.


                              • #30
                                @GenoMax: it seems to be running correctly and produces sensible results, however I'm not sure whether it uses the default or my custom parameter settings.

                                I have first generated an index for mouse and human:

                                bbsplit.sh build=2 maxindel=200000 minhits=2 ambiguous=best ambiguous2=best \
                                ref_human=/gcdata/gcproj/Ruben/GENOMES/Human/Sequences/Ensembl/GRCh37/Release75/Genome/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa \

                                Then I split my fastq files, again this works but I'm not sure whether it actually used the parameters that I provide...

                                bbsplit.sh minhits=2 maxindel=200000 t=10 ambiguous=best ambiguous2=best path=/gcdata/gcproj/Ruben/GENOMES/BBmap_indices/ build=2 in=sample.fastq.gz out_human=sample_human.fastq out_mouse=sample_mouse.fastq scafstats=scafstats.txt refstats=refstats.txt

                                Thanks for your help,


                                Latest Articles


                                • seqadmin
                                  Improved Targeted Sequencing: A Comprehensive Guide to Amplicon Sequencing
                                  by seqadmin

                                  Amplicon sequencing is a targeted approach that allows researchers to investigate specific regions of the genome. This technique is routinely used in applications such as variant identification, clinical research, and infectious disease surveillance. The amplicon sequencing process begins by designing primers that flank the regions of interest. The DNA sequences are then amplified through PCR (typically multiplex PCR) to produce amplicons complementary to the targets. RNA targets...
                                  03-21-2023, 01:49 PM
                                • seqadmin
                                  Targeted Sequencing: Choosing Between Hybridization Capture and Amplicon Sequencing
                                  by seqadmin

                                  Targeted sequencing is an effective way to sequence and analyze specific genomic regions of interest. This method enables researchers to focus their efforts on their desired targets, as opposed to other methods like whole genome sequencing that involve the sequencing of total DNA. Utilizing targeted sequencing is an attractive option for many researchers because it is often faster, more cost-effective, and only generates applicable data. While there are many approaches...
                                  03-10-2023, 05:31 AM





                                Topics Statistics Last Post
                                Started by seqadmin, 03-24-2023, 02:45 PM
                                0 responses
                                Last Post seqadmin  
                                Started by seqadmin, 03-22-2023, 12:26 PM
                                0 responses
                                Last Post seqadmin  
                                Started by seqadmin, 03-17-2023, 12:32 PM
                                0 responses
                                Last Post seqadmin  
                                Started by seqadmin, 03-15-2023, 12:42 PM
                                0 responses
                                Last Post seqadmin