Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • BBMap (aligner for DNA/RNAseq) is now open-source and available for download.

    BBMap is now available here:

    Download BBMap for free. BBMap short read aligner, and other bioinformatic tools. This package includes BBMap, a short read aligner, as well as various other bioinformatic tools. It is written in pure Java, can run on any platform, and has no dependencies other than Java being installed (compiled for Java 6 and higher).


    Links to other BBTools forum threads:
    BBSplit (Binning reads based on mapping to multiple references at once)
    BBDuk and Seal (Decontamination, filtering, adapter-trimming, kmer-masking, and alignment-free expression quantification)
    BBNorm (Normalization, error-correction, kmer frequency histograms, and genome size estimation)
    BBMerge (Paired read overlap merging and insert size calculation)
    Reformat (Read reformatting, deinterleaving, subsampling, etc)
    CalcUniqueness (Library uniqueness/saturation plots)
    RemoveHuman (Filtering out reads from human, or some other specific organism, with zero false positives removals)
    Repair (Re-pairing reads that got out of order, based on their names)
    CalcTrueQuality (Recalibrating quality scores of reads)
    Tadpole (Assembler, error-correction, read extension)
    KmerCompressor (Kmer set generation and set operations)
    Clumpify (Increases compression of gzipped/bzipped fastq files)

    A thread with answers to frequently asked questions about BBTools, collated by GenoMax, is here.

    BBMap/BBTools are now open source. Please try it out - it's a 3MB download, and written in pure Java, so installation is trivial - just unzip and run. Handles all sequencing platforms (Illumina, PacBio, 454, Sanger, Nanopore, etc) except Solid colorspace, which I removed to simplify the code.

    A Powerpoint comparison of performance (speed, memory, sensitivity, specificity) on various genomes, compared to bwa, bowtie2, gsnap, smalt:



    ...but in summary, BBMap is similar in speed to bwa, with much better sensitivity and specificity than any other aligner I've compared it to. It uses more memory than Burrows-Wheeler-based aligners, but in exchange, the indexing speed is many times faster.

    How to use

    There is documentation in the docs folder and displayed by shellscripts when run with no arguments. But for example:

    bbmap.sh ref=ecoli.fa
    ...will build an index and write it to the present directory

    bbmap.sh in=reads.fq out=mapped.sam
    ...will map to the indexed reference

    bbmap.sh in1=reads1.fq in2=reads2.fq out=mapped.sam ref=ecoli.fa nodisk
    ...will build an index in memory and map paired reads to it in a single command

    If your OS does not support shellscripts, replace 'bbmap.sh' like this:
    java -Xmx23g -cp /path/to/current align2.BBMap in=reads.fq out=mapped.sam

    ...where /path/to/current is the location of the 'current' directory, and -Xmx23g specifies the amount of memory to use. This should be set to about 85% of physical memory (the symbols 'm' or 'g' specify megs or gigs), or more, depending on your virtual memory configuration. Human reference requires around 21 GB; generally, references need around 7 bytes per base pair, and a minimum of 500 MB at default settings. However, there is a reduced memory mode ('usemodulo') that only needs half as much memory. The shellscripts are just wrappers that display usage information and set the -Xmx parameter.

    Please ask if you encounter any problems or need help! And there are other neat tools too, for error correction, normalization, depth-binning, reference-based binning, contaminant filtering, adapter trimming, optimal quality trimming, reformatting files, paired-read merging, deduplication of assemblies, and histogram generation for things like kmer depth and insert size.

    NOTE: BBMap (and all related tools) shellscripts will try to autodetect memory, but may fail (resulting in the jvm failing to start or running out of memory), depending on the system configuration. This can be overridden by adding the -Xmx30g flag to the parameter list of the shellscript (adjusted for your computer's physical memory) and it will be passed to java.
    Last edited by Brian Bushnell; 12-04-2016, 12:09 AM.

  • #2
    Dear Brian,

    thanks a lot for posting your aligner package.
    Thanks to the PacBio option it is not strictly a short read aligner - but for which max read read length would you recommend it (e.g are merged 500 bp Miseq reads OK for the short read version; are PacBio reads longer than the 6 kb mentioned for version 25 allowed for the latest version)?
    Would you recommend any specific settings for the alignment of RNA seq reads?
    Last edited by luc; 02-23-2014, 07:10 PM.

    Comment


    • #3
      Originally posted by luc View Post
      Dear Brian,

      thanks a lot for posting your aligner package.
      Thanks to the PacBio option it is not strictly a short read aligner - but for which max read read length would you recommend it (e.g are merged 500 bp Miseq reads OK for the short read version; are PacBio reads longer than the 6 kb mentioned for version 25 allowed for the latest version)?
      Would you recommend any specific settings for the alignment of RNA seq reads?
      Luc,

      Unfortunately, the limit is still 6kb. I am working on increasing that but it may take a while. For PacBio reads, you should use the script mapPacBio8k.sh (which despite the name only goes up to 6k), and the PacBio reads should be in fasta format. Reads in fasta format that are longer than 6000bp will be automatically broken into 6000bp pieces and their names appended with "_1", "_2", and so forth. But the vast majority of the PacBio reads will be under 6kbp and thus won't be fragmented.

      For any Illumina reads, including merged Miseq, use bbmap.sh. It handles reads up to 500bp. The only exception is if you want to map Illumina reads to PacBio reads (for error correction), in which case you should use mapPacBio.

      bbmap.sh (which calls BBMap.class) has mutation penalties tuned for Illumina reads, while mapPacBio8k.sh (which calls BBMapPacBio.class) has mutation penalties tuned for PacBio reads and a longer maximum read length, and is slower. You can still use mapPacBio8k.sh for any really long reads, though, like when mapping one assembly to another.

      For RNA-seq (using Illumina reads), I would recommend a command line like this:

      bbmap.sh ref=reference.fa in=reads.fq out=mapped.sam maxindel=100000 xstag=firststrand intronlen=10 ambig=random

      This will look for introns up to 100kbp long, and flag every gap longer than 10bp with cigar symbol 'N' (skipped) rather than 'D' (deletion). Ambiguously-mapped reads will go to a random top-scoring site, which is generally best for quantification. XS tags, needed by Cufflinks, will be produced according to the "first strand" library protocol (if you don't know what protocol is used, then set it to xstag=unstranded).

      The 'maxindel' flag can be increased or decreased as needed. Many plants and fungi tend to have very short introns mostly under 400bp, while human average is much higher and has a few that are even longer than 100kbp. It should be set to above the size of the largest introns you expect, but higher values decrease accuracy, so I don't recommend setting it above 500000 or so. The default is 16000 which is fine for the plants and fungi I've worked with but if you are working with vertebrates then 200000 would probably be appropriate.
      Last edited by Brian Bushnell; 02-24-2014, 03:19 PM.

      Comment


      • #4
        Thanks a lot for the advice!
        I will give it a try.

        Comment


        • #5
          Originally posted by luc View Post
          Thanks a lot for the advice!
          I will give it a try.
          You're welcome. By the way, I added a java 6-compiled package to the Sourceforge page (and a newer version of the java 7-compiled package). These may also address some other issues; e.g., some shellscripts had Windows-style newlines (\n\r), and I changed them all to Unix-style newlines (\n). Some shellscript parsers choke on \n\r.

          Comment


          • #6
            I would like to try BBMap on human exomes, but I have a strange output, for the reference indexing (it goes only to chr7), therefore the rest of the computation fails:

            $ /home/max/appli/bbmap/bbmap_31.32/bbmap.sh -Xmx40G build=1 ref=/home/max/reference/ucsc.hg19.fasta
            This system does not have ulimit set, so max memory cannot be determined. Attempting to use 4G.
            If this fails, please set ulimit or run this program qsubbed or from a qlogin session on Genepool.
            java -ea -Xmx40G -cp /home/max/appli/bbmap/bbmap_31.32/current/ align2.BBMap build=1 overwrite=true match=long fastareadlen=500 -Xmx40G build=1 ref=/home/max/reference/ucsc.hg19.fasta
            Executing align2.BBMap [build=1, overwrite=true, match=long, fastareadlen=500, -Xmx40G, build=1, ref=/home/max/reference/ucsc.hg19.fasta]

            BBMap version 31.32
            Set OVERWRITE to true
            Cigar strings enabled.
            Retaining first best site only for ambiguous mappings.
            No output file.
            NOTE: Deleting contents of ref/genome/1 because reference is specified and overwrite=true
            NOTE: Deleting contents of ref/index/1 because reference is specified and overwrite=true
            Writing reference.
            Executing dna.FastaToChromArrays [/home/max/reference/ucsc.hg19.fasta, 1, writeinthread=false, genscaffoldinfo=true, retain, waitforwriting=false, gzip=true, chromc=false, maxlen=536670912, writechroms=true, minscaf=1, midpad=300, startpad=8000, stoppad=8000, nodisk=false]

            Set genScaffoldInfo=true
            Writing ref/genome/1/chr1.chrom.gz
            Writing ref/genome/1/chr2.chrom.gz
            Writing ref/genome/1/chr3.chrom.gz
            Writing ref/genome/1/chr4.chrom.gz
            Writing ref/genome/1/chr5.chrom.gz
            Writing ref/genome/1/chr6.chrom.gz
            Writing ref/genome/1/chr7.chrom.gz
            Set genome to 1

            Loaded Reference: 0,011 seconds.
            Loading index for chrom 1-7, genome 1
            No index available; generating from reference genome: /home/max/data/test/ref/index/1/chr1-3_index_k13_c2_b1.block
            No index available; generating from reference genome: /home/max/data/test/ref/index/1/chr4-7_index_k13_c2_b1.block
            Indexing threads started for block 4-7
            Indexing threads started for block 0-3
            Indexing threads finished for block 0-3
            Indexing threads finished for block 4-7
            Generated Index: 176,884 seconds.
            Finished Writing: 4,670 seconds.
            No reads to process; quitting.

            Total time: 230,800 seconds.

            And then the mapping:

            $ /home/max/appli/bbmap/bbmap_31.32/bbmap.sh -Xmx40G build=1 in=test.bam out=test_bbmap.bam
            This system does not have ulimit set, so max memory cannot be determined. Attempting to use 4G.
            If this fails, please set ulimit or run this program qsubbed or from a qlogin session on Genepool.
            java -ea -Xmx40G -cp /home/max/appli/bbmap/bbmap_31.32/current/ align2.BBMap build=1 overwrite=true match=long fastareadlen=500 -Xmx40G build=1 in=test.bam out=test_bbmap.bam
            Executing align2.BBMap [build=1, overwrite=true, match=long, fastareadlen=500, -Xmx40G, build=1, in=test.bam, out=test_bbmap.bam]

            BBMap version 31.32
            Set OVERWRITE to true
            Cigar strings enabled.
            Retaining first best site only for ambiguous mappings.
            Set genome to 1

            Loaded Reference: 4,994 seconds.
            Loading index for chrom 1-7, genome 1
            Generated Index: 10,028 seconds.
            Analyzed Index: 9,399 seconds.
            Could not find samtools.
            Started output stream: 0,049 seconds.
            Cleared Memory: 0,441 seconds.
            Processing reads in single-ended mode.
            Started read stream.
            Exception in thread "Thread-45" java.lang.AssertionError: 57 = i
            array=�BC���R����UR�t���l��� iqp.)�C�9\�ʣ�1�
            �rX�[�%}hW�|t��NO_�n�������w�݇��i�+}z~z�+�������O��oηw�v���쨻Y*��%d� ������ �{@}, start=44, stop=141
            at align2.Tools.parseInt(Tools.java:1097)
            at stream.SamLine.<init>(SamLine.java:1242)
            at stream.SamReadInputStream.toReadList(SamReadInputStream.java:130)
            at stream.SamReadInputStream.fillBuffer(SamReadInputStream.java:102)
            at stream.SamReadInputStream.hasMore(SamReadInputStream.java:58)
            at stream.ConcurrentGenericReadInputStream$ReadThread.readLists(ConcurrentGenericReadInputStream.java:717)
            at stream.ConcurrentGenericReadInputStream$ReadThread.run(ConcurrentGenericReadInputStream.java:709)
            Started 32 mapping threads.

            And nothing goes on.

            Comment


            • #7
              Brian: Trying out BBMap and I am having problems indexing the reference. My species is Pig which has the following Fasta entry lines in the genome.fa file:

              grep '>' ../References/GENOME_DIR/genome.fa
              >10
              >11
              >12
              >13
              >14
              >15
              >16
              >17
              >18
              >1
              >2
              >3
              >4
              >5
              >6
              >7
              >8
              >9
              >MT
              >X
              >Y
              However it appears that BBMap is only creating 6 chromosomes as follows:

              Executing align2.BBMap [build=1, overwrite=true, match=long, fastareadlen=500, ref=../References/GENOME_DIR/genome.fa]

              BBMap version 31.32
              Set OVERWRITE to true
              Cigar strings enabled.
              Retaining first best site only for ambiguous mappings.
              No output file.
              Writing reference.
              Executing dna.FastaToChromArrays [../References/GENOME_DIR/genome.fa, 1, writeinthread=false, genscaffoldinfo=true, retain, waitforwriting=false, gzip=true, chromc=false, maxlen=536670912, writechroms=true, minscaf=1, midpad=300, startpad=8000, stoppad=8000, nodisk=false]

              Set genScaffoldInfo=true
              Writing ref/genome/1/chr1.chrom.gz
              Writing ref/genome/1/chr2.chrom.gz
              Writing ref/genome/1/chr3.chrom.gz
              Writing ref/genome/1/chr4.chrom.gz
              Writing ref/genome/1/chr5.chrom.gz
              Writing ref/genome/1/chr6.chrom.gz
              Set genome to 1

              Loaded Reference: 0.008 seconds.
              Loading index for chrom 1-6, genome 1
              No index available; generating from reference genome: /group/gcore/machine/sbs/run00289/131127_M00657_0033_000000000-A6EDC/hr00671_and-yanhua/006356_-945-Flag-1-946-/ref/index/1/chr1-3_index_k13_c2_b1.block
              No index available; generating from reference genome: /group/gcore/machine/sbs/run00289/131127_M00657_0033_000000000-A6EDC/hr00671_and-yanhua/006356_-945-Flag-1-946-/ref/index/1/chr4-6_index_k13_c2_b1.block
              Indexing threads started for block 0-3
              Indexing threads started for block 4-6
              Indexing threads finished for block 4-6
              Indexing threads finished for block 0-3
              Generated Index: 274.860 seconds.
              Finished Writing: 29.271 seconds.
              No reads to process; quitting.

              Total time: 362.577 seconds.
              Any ideas? Feel free to send me email westerman at purdue dot edu

              Comment


              • #8
                Westerman,

                Sorry for the confusion. The files are called "chrom" for legacy reasons, and they don't correspond to chromosomes anymore. I should rename them "chunks". Originally, BBMap wrote one chromosome per chrom file, which is fine for complete reference genomes. But de-novo assemblies can have thousands or millions of scaffolds, so now I bundle them into chunks.

                Everything is there, and when you produce a sam file, you'll see that reads are mapped to all of the different input chromosomes.

                Comment


                • #9
                  Originally posted by Brian Bushnell View Post
                  Everything is there, and when you produce a sam file, you'll see that reads are mapped to all of the different input chromosomes.
                  So they are. Different mapping and slightly more -- in terms of raw matches -- than tophat. I'll have to do more work in order to see if the matches are accurate but so far it is encouraging.

                  samtools idxstats BBMapped.bam | sort -n

                  * 0 0 174134
                  MT 16613 121832 3149
                  X 144288218 11619 559
                  Y 1637716 16 4
                  1 315321322 31627 2152
                  2 162569375 33762 1920
                  3 144787322 40457 3319
                  4 143465943 49722 2064
                  5 111506441 27714 1515
                  6 157765593 60316 2405
                  7 134764511 27906 2778
                  8 148491826 14523 1241
                  9 153670197 26155 1155
                  10 79102373 6924 601
                  11 87690581 13340 650
                  12 63588571 25561 1007
                  13 218635234 29541 1744
                  14 153851969 29737 2588
                  15 157681621 25974 2666
                  16 86898991 9360 334
                  17 69701581 23454 913
                  18 61220071 6550 332

                  Tophat accepted_hits.bam

                  MT 16613 85810 0
                  X 144288218 11197 0
                  Y 1637716 12 0
                  1 315321322 35442 0
                  2 162569375 24092 0
                  3 144787322 46519 0
                  4 143465943 43747 0
                  5 111506441 22186 0
                  6 157765593 45857 0
                  7 134764511 18867 0
                  8 148491826 13938 0
                  9 153670197 27813 0
                  10 79102373 4661 0
                  11 87690581 12493 0
                  12 63588571 19511 0
                  13 218635234 33190 0
                  14 153851969 19774 0
                  15 157681621 11172 0
                  16 86898991 14862 0
                  17 69701581 19776 0
                  18 61220071 5159 0

                  Comment


                  • #10
                    By the way, BBMap's default mode is "ambig=best", meaning that any ambiguously-mapped reads will go to the first top-scoring location (ambiguous hits will be annotated "XT:A:R" rather than "XT:A:U", so you will know which ones are ambiguous). This is fine for variant-calling, but if you are doing RNA-seq expression quantification, it's usually better to set "ambig=random" or "ambig=all". Otherwise, if there are two almost-identical genes, all the reads will map to the first one.

                    Also, for vertebrate RNA-seq, I suggest these settings:
                    xstag=us
                    generate XS tags, needed by cufflinks, according to "unstranded" protocol, or if you don't know the protocol. Alternatives are "ss" (second strand) and "fs" (first strand). If you DO know the protocol, please set it as ss or fs.

                    intronlen=10
                    All deletions up to this length will be annotated as 'D' in the cigar string; longer ones will be annotated as 'N' (meaning skipped). This may or may not matter to downstream tools. For DNA mapping you don't need this flag.

                    maxindel=200000
                    The longest deletion (i.e., intron) that it will look for. May still find longer ones. Around 98% of human introns are shorter than 100kbp. If you do not set this, it will default to 16kbp, which is too short for vertebrate RNA-seq (though lots of plants and fungi have tiny introns of around 200bp).

                    I'll add this recommendation to the readme for the next release.

                    Comment


                    • #11
                      Thanks for the recommendations, especially the ambig= one. The help for bbmap.sh is ... confusing, at least at first glance.

                      One thing I think is working but giving me an improper warning message is the -Xmx parameter. From starting like so:

                      bbmap.sh -Xmx30G path=../BBMap_ref in1=$R1 in2=$R2 ...
                      I get:

                      This system does not have ulimit set, so max memory cannot be determined. Attempting to use 4G.
                      If this fails, please set ulimit or run this program qsubbed or from a qlogin session on Genepool.
                      java -ea -Xmx30G ....
                      In other words a warning but Java itself is starting up properly with 30G.

                      Comment


                      • #12
                        Hmm, that's interesting, thanks for noting it. It's because I first try to detect memory, then parse the -Xmx override. I will reverse the order

                        I'll try to improve the help information, so any concrete suggestions/complaints would be very helpful.

                        Comment


                        • #13
                          As for help, the major problem is the formatting on screens with 80 columns. Wrapping is hard to read plus there is too much information for a quick reference. I suggest trying to limit the help messages to 80 columns and then direct the user to a longer formatted README or extended help. I know that the current help says to look at the README.

                          Also parsing '-h' on the command line would be nice. While displaying help when no commands are given is good it is also useful to be able to use '-h'.

                          As an example of the first paragraph, at the moment the help includes:

                          Code:
                          [FONT="Courier New"]maxsites=5          Maximum number of total alignments to print per read.  Only relevant when secondary=t.
                          quickmatch=f        Generate cigar strings more quickly.  Must be true to generate secondary site cigar strings.
                          keepnames=f         Keep original names of paired reads, rather than ensuring both reads have the same name.[/FONT]
                          All useful information to be sure but these could shortened and the extended information put into a formatted README. E.g.

                          Code:
                          [FONT="Courier New"]maxsites=5          Max. number of total alignments to print
                          quickmatch=f        Generate cigar strings more quickly.
                          keepnames=f         Keep original names of paired reads[/FONT]

                          Comment


                          • #14
                            Hi,

                            I do get the similar warnings and an exits on a system with more than 200 GB of RAM available (using the Java 6 version).

                            "This system does not have ulimit set, so max memory cannot be determined. Attempting to use 4G.
                            If this fails, please set ulimit or run this program qsubbed or from a qlogin session on Genepool."

                            "Invalid maximum heap size: -Xmx30G
                            The specified size exceeds the maximum representable size.
                            Could not create the Java virtual machine."

                            Comment


                            • #15
                              Originally posted by luc View Post
                              Hi,

                              I do get the similar warnings and an exits on a system with more than 200 GB of RAM available (using the Java 6 version).

                              "This system does not have ulimit set, so max memory cannot be determined. Attempting to use 4G.
                              If this fails, please set ulimit or run this program qsubbed or from a qlogin session on Genepool."

                              "Invalid maximum heap size: -Xmx30G
                              The specified size exceeds the maximum representable size.
                              Could not create the Java virtual machine."
                              luc,

                              That means that you have a 32-bit version of Java installed. You need a 64-bit version, which you can get here.

                              Westerman,

                              Thanks for your feedback; I'll simplify the help menus for the next version.

                              -Brian

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Genetic Variation in Immunogenetics and Antibody Diversity
                                by seqadmin



                                The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...
                                11-06-2024, 07:24 PM
                              • seqadmin
                                Choosing Between NGS and qPCR
                                by seqadmin



                                Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
                                10-18-2024, 07:11 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 11-08-2024, 11:09 AM
                              0 responses
                              35 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 11-08-2024, 06:13 AM
                              0 responses
                              28 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 11-01-2024, 06:09 AM
                              0 responses
                              32 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 10-30-2024, 05:31 AM
                              0 responses
                              23 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X