BBMap is now available here:
Links to other BBTools forum threads:
BBSplit (Binning reads based on mapping to multiple references at once)
BBDuk and Seal (Decontamination, filtering, adapter-trimming, kmer-masking, and alignment-free expression quantification)
BBNorm (Normalization, error-correction, kmer frequency histograms, and genome size estimation)
BBMerge (Paired read overlap merging and insert size calculation)
Reformat (Read reformatting, deinterleaving, subsampling, etc)
CalcUniqueness (Library uniqueness/saturation plots)
RemoveHuman (Filtering out reads from human, or some other specific organism, with zero false positives removals)
Repair (Re-pairing reads that got out of order, based on their names)
CalcTrueQuality (Recalibrating quality scores of reads)
Tadpole (Assembler, error-correction, read extension)
KmerCompressor (Kmer set generation and set operations)
Clumpify (Increases compression of gzipped/bzipped fastq files)
A thread with answers to frequently asked questions about BBTools, collated by GenoMax, is here.
BBMap/BBTools are now open source. Please try it out - it's a 3MB download, and written in pure Java, so installation is trivial - just unzip and run. Handles all sequencing platforms (Illumina, PacBio, 454, Sanger, Nanopore, etc) except Solid colorspace, which I removed to simplify the code.
A Powerpoint comparison of performance (speed, memory, sensitivity, specificity) on various genomes, compared to bwa, bowtie2, gsnap, smalt:
...but in summary, BBMap is similar in speed to bwa, with much better sensitivity and specificity than any other aligner I've compared it to. It uses more memory than Burrows-Wheeler-based aligners, but in exchange, the indexing speed is many times faster.
How to use
There is documentation in the docs folder and displayed by shellscripts when run with no arguments. But for example:
bbmap.sh ref=ecoli.fa
...will build an index and write it to the present directory
bbmap.sh in=reads.fq out=mapped.sam
...will map to the indexed reference
bbmap.sh in1=reads1.fq in2=reads2.fq out=mapped.sam ref=ecoli.fa nodisk
...will build an index in memory and map paired reads to it in a single command
If your OS does not support shellscripts, replace 'bbmap.sh' like this:
java -Xmx23g -cp /path/to/current align2.BBMap in=reads.fq out=mapped.sam
...where /path/to/current is the location of the 'current' directory, and -Xmx23g specifies the amount of memory to use. This should be set to about 85% of physical memory (the symbols 'm' or 'g' specify megs or gigs), or more, depending on your virtual memory configuration. Human reference requires around 21 GB; generally, references need around 7 bytes per base pair, and a minimum of 500 MB at default settings. However, there is a reduced memory mode ('usemodulo') that only needs half as much memory. The shellscripts are just wrappers that display usage information and set the -Xmx parameter.
Please ask if you encounter any problems or need help! And there are other neat tools too, for error correction, normalization, depth-binning, reference-based binning, contaminant filtering, adapter trimming, optimal quality trimming, reformatting files, paired-read merging, deduplication of assemblies, and histogram generation for things like kmer depth and insert size.
NOTE: BBMap (and all related tools) shellscripts will try to autodetect memory, but may fail (resulting in the jvm failing to start or running out of memory), depending on the system configuration. This can be overridden by adding the -Xmx30g flag to the parameter list of the shellscript (adjusted for your computer's physical memory) and it will be passed to java.
Links to other BBTools forum threads:
BBSplit (Binning reads based on mapping to multiple references at once)
BBDuk and Seal (Decontamination, filtering, adapter-trimming, kmer-masking, and alignment-free expression quantification)
BBNorm (Normalization, error-correction, kmer frequency histograms, and genome size estimation)
BBMerge (Paired read overlap merging and insert size calculation)
Reformat (Read reformatting, deinterleaving, subsampling, etc)
CalcUniqueness (Library uniqueness/saturation plots)
RemoveHuman (Filtering out reads from human, or some other specific organism, with zero false positives removals)
Repair (Re-pairing reads that got out of order, based on their names)
CalcTrueQuality (Recalibrating quality scores of reads)
Tadpole (Assembler, error-correction, read extension)
KmerCompressor (Kmer set generation and set operations)
Clumpify (Increases compression of gzipped/bzipped fastq files)
A thread with answers to frequently asked questions about BBTools, collated by GenoMax, is here.
BBMap/BBTools are now open source. Please try it out - it's a 3MB download, and written in pure Java, so installation is trivial - just unzip and run. Handles all sequencing platforms (Illumina, PacBio, 454, Sanger, Nanopore, etc) except Solid colorspace, which I removed to simplify the code.
A Powerpoint comparison of performance (speed, memory, sensitivity, specificity) on various genomes, compared to bwa, bowtie2, gsnap, smalt:
...but in summary, BBMap is similar in speed to bwa, with much better sensitivity and specificity than any other aligner I've compared it to. It uses more memory than Burrows-Wheeler-based aligners, but in exchange, the indexing speed is many times faster.
How to use
There is documentation in the docs folder and displayed by shellscripts when run with no arguments. But for example:
bbmap.sh ref=ecoli.fa
...will build an index and write it to the present directory
bbmap.sh in=reads.fq out=mapped.sam
...will map to the indexed reference
bbmap.sh in1=reads1.fq in2=reads2.fq out=mapped.sam ref=ecoli.fa nodisk
...will build an index in memory and map paired reads to it in a single command
If your OS does not support shellscripts, replace 'bbmap.sh' like this:
java -Xmx23g -cp /path/to/current align2.BBMap in=reads.fq out=mapped.sam
...where /path/to/current is the location of the 'current' directory, and -Xmx23g specifies the amount of memory to use. This should be set to about 85% of physical memory (the symbols 'm' or 'g' specify megs or gigs), or more, depending on your virtual memory configuration. Human reference requires around 21 GB; generally, references need around 7 bytes per base pair, and a minimum of 500 MB at default settings. However, there is a reduced memory mode ('usemodulo') that only needs half as much memory. The shellscripts are just wrappers that display usage information and set the -Xmx parameter.
Please ask if you encounter any problems or need help! And there are other neat tools too, for error correction, normalization, depth-binning, reference-based binning, contaminant filtering, adapter trimming, optimal quality trimming, reformatting files, paired-read merging, deduplication of assemblies, and histogram generation for things like kmer depth and insert size.
NOTE: BBMap (and all related tools) shellscripts will try to autodetect memory, but may fail (resulting in the jvm failing to start or running out of memory), depending on the system configuration. This can be overridden by adding the -Xmx30g flag to the parameter list of the shellscript (adjusted for your computer's physical memory) and it will be passed to java.
Comment