Seqanswers Leaderboard Ad

**sdriscoll** · 11-10-2014, 11:14 AM

Hi Brian,
New week...new question. I'm doing something where I have a bunch of 15-mer segments of several larger sequences and I'd like to match the mers up to two separate references. The references may be as large as two entire mammalian genomes. The fact that bbmap can conveniently and rapidly generate an index on the fly and map reads super fast in 'perfectmode' makes it look like a good tool for this. I'm just wondering if aligning 15-mers with bbmap is reliable or would I be better of making my own program that essentially just hashes all 15-mers from the two references and runs a simple look-up.

**sdriscoll** · 11-10-2014, 11:35 AM

Originally posted by Brian Bushnell View Post

Yes, I put in an option for that because some programs require it. The flag is trd=t (or trimreaddescriptions=true), which will truncate all names (both read names and reference names) at the first whitespace symbol.

This isn't working for me. I've built and index like this:

Code:

bbmap.sh ref=hg19e.fa path=./bbmap trimreaddescriptions=t

however in the output alignments to this index I still see full-length reference names:

Code:

uc011kko.2_1158_1257_0_0_1      0       7 dna_sm:chromosome chromosome:GRCh37:7:1:159138663:1 REF  101957711        44      100=    *       0       0       ACCCATGGTTCCACGGGACACTGTCCCGGGTCAAGGCTGCTCAACTGGTTCTGGCAGGGGGGCCCCGGAACCACGGCCTCTTCGTGATCCGCCAAAGTGA        ????????????????????????????????????????????????????????????????????????????????????????????????????        NM:i:0  AM:i:44

I downloaded the current version last week (Thursday or Friday).

**Brian Bushnell** · 11-10-2014, 12:01 PM

You can run BBMap with the "k=15" flag to use 15mers for indexing (15 is the max). It will be reliable except with very-low-complexity kmers - polymonomers (e.g. AAAAAAAAAAAAAAA) and polydimers (e.g. ACACACACACACACA) are not indexed. You can also use BBSplit (sort of a wrapper for BBMap that maps to multiple references simultaneously) to output the reads into different files depending on which reference they hit.

If you don't care about the mapping location, there's also BBDuk, which simply associates kmers with sequences (ignoring position information). So you can do something like this:

bbduk.sh ref=human.fa in=reads.fq k=15 mm=f outm=matched.fq outu=unmatched.fq

It uses more memory than BBMap (around 15 bytes per unique kmer at best) but should be faster, and does not exclude homopolymer kmers. Optionally, you can allow a hamming distance or edit distance in the kmers, though that increases memory usage by a factor of 3*K (for hdist=1). I'm working on a version that does binning like BBSplit but it's not quite done yet.

**Brian Bushnell** · 11-10-2014, 12:03 PM

Originally posted by sdriscoll View Post

This isn't working for me. I've built and index like this:

Code:

bbmap.sh ref=hg19e.fa path=./bbmap trimreaddescriptions=t

however in the output alignments to this index I still see full-length reference names:

Code:

uc011kko.2_1158_1257_0_0_1      0       7 dna_sm:chromosome chromosome:GRCh37:7:1:159138663:1 REF  101957711        44      100=    *       0       0       ACCCATGGTTCCACGGGACACTGTCCCGGGTCAAGGCTGCTCAACTGGTTCTGGCAGGGGGGCCCCGGAACCACGGCCTCTTCGTGATCCGCCAAAGTGA        ????????????????????????????????????????????????????????????????????????????????????????????????????        NM:i:0  AM:i:44

I downloaded the current version last week (Thursday or Friday).

The "trd" flag needs to be set when mapping, not indexing. Please let me know if you used it when mapping - if so, there's a bug; but if it was only when indexing, that behavior is expected.

**sdriscoll** · 11-10-2014, 12:31 PM

Ah, thanks. I did not use it when mapping. Also thanks for the primer on bbsplit and bbduk. Fun tools!

**sdriscoll** · 11-10-2014, 01:06 PM

bbduk and bbsplit are super close to what I need except I actually want a lower level result (i.e. the number of mers per query sequence that match to mers in each of the two references). I don't wany any assignment decisions to be made...just counting.

**Brian Bushnell** · 11-10-2014, 01:43 PM

Coincidentally, I'm making a new version of BBDuk designed for RNA-seq which will be able to do just that. The current version only stores one value per kmer so if both references have the same kmer, it will only be considered a match to the first one. The new version will store an arbitrary number of values per kmer, so it will be able to correctly resolve situations where multiple references share a kmer (as in alternatively spliced transcriptomes). I'll build in the ability to note the counts per reference, but it probably won't be ready for a few days.

**sdriscoll** · 11-10-2014, 02:09 PM

Thanks, that sounds great. BBDuk is looking like a really handy tool. For example even taking two kmer FASTA files and doing a "set difference" between them is possible here. I dig it.

**susanklein** · 11-10-2014, 07:50 PM

bbmap error

Hi,

any idea what could cause this error? Thanks:

$ ~/bin/bbmap/bbmap.sh ref=plasmids.fna

java -Djava.library.path=/OSM/HOME-MEL/all29c/bin/bbmap/jni/ -ea -Xmx33004m -cp /OSM/HOME-MEL/all29c/bin/bbmap/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 ref=plasmids.fna
Exception in thread "main" java.lang.UnsupportedClassVersionError: align2/BBMap : Unsupported major.minor version 51.0
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:634)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:277)
at java.net.URLClassLoader.access$000(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:212)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
Could not find the main class: align2.BBMap. Program will exit.

**Brian Bushnell** · 11-10-2014, 08:12 PM

Hi Susan,

That means you are using an old version of Java. BBMap requires Java 7 or higher. You can install it yourself if you want by downloading it from here, though the easiest way would simply be to ask your sysadmin to install it, if you have one. Java 7 has been out since 2011, and it's backwards compatible with software made for older versions, so I highly recommend installing it and removing any older versions.

Java 8 is out now, too, which will also work fine.

**susanklein** · 11-10-2014, 08:22 PM

Hi,

ok, thanks apparently 7 or 8 not an option on the cluster I use. Stuck with 6 for now. I mostly have to avoid Java programs because of this. Thanks for your efforts anyway.

S.

**Brian Bushnell** · 11-10-2014, 08:24 PM

Susan,

I will put up a Java 6-compiled version up tomorrow.

-Brian

**susanklein** · 11-10-2014, 08:27 PM

Thanks very much. That's what I call service!

S.

**Brian Bushnell** · 11-11-2014, 08:57 AM

Susan,

The Java 6-compiled version is up now. It's available here, the second file down (BBMap_33.89_java6.tar.gz). Please let me know if you have any problems with it.

**sdriscoll** · 11-11-2014, 01:54 PM

Hi Brian,

Since I live off of command line help with all of the tools I use I had to reformat your usage function in the bbmap.sh script. I thought I'd share with you in case you'd like to build off of this one and keep it a little cleaner.

Code:

usage(){
echo "
BBMap v33.x
Written by Brian Bushnell, from Dec. 2010 - present
Last modified October 1, 2014

Description:  Fast and accurate short-read aligner for DNA and RNA.

To index:   bbmap.sh ref=<reference fasta>
To map:     bbmap.sh in=<reads> out=<output sam>
To map without an index:
    bbmap.sh ref=<reference fasta> in=<reads> out=<output sam> nodisk

in=stdin will accept reads from standard in, and out=stdout will write to 
standard out, but file extensions are still needed to specify the format of the 
input and output files e.g. in=stdin.fa.gz will read gzipped fasta from 
standard in; out=stdout.sam.gz will write gzipped sam.

Indexing Parameters (required when building the index):

nodisk=f                Set to true to build index in memory and write nothing 
                        to disk except output.
ref=<file>              Specify the reference sequence.  Only do this ONCE, 
                        when building the index (unless using 'nodisk').
build=1                 If multiple references are indexed in the same directory, 
                        each needs a unique numeric ID (unless using 'nodisk').
k=13                    Kmer length, range 8-15.  Longer is faster but uses more 
                        memory.  Shorter is more sensitive.
path=<.>                Specify the location to write the index, if you don't 
                        want it in the current working directory.
usemodulo=f             Throw away ~80% of kmers based on remainder modulo a 
                        number (reduces RAM by 50% and sensitivity slightly).
                        Should be enabled both when building the index AND 
                        when mapping.

Input Parameters:

build=1                 Designate index to use.  Corresponds to the number 
                        specified when building the index.
in=<file>               Primary reads input; required parameter.
in2=<file>              For paired reads in two files.
interleaved=auto        True forces paired/interleaved input; false forces 
                        single-ended mapping. If not specified, interleaved 
                        status will be autodetected from read names.
fastareadlen=500        Break up FASTA reads longer than this.  Max is 500.  
                        Only works for FASTA input.
unpigz=f                Spawn a pigz (parallel gzip) process for faster 
                        decompression than using Java.  
                        Requires pigz to be installed.

Sampling Parameters:

reads=-1                Set to a positive number N to only process the first 
                        N reads (or pairs), then quit.  -1 means use all reads.
skipreads=0             Set to a number N to skip the first N reads (or pairs), 
                        then map the rest.
touppercase=t           (tuc) Convert lowercase letters in reads to upper case 
                        (otherwise they will not match the reference).

Mapping Parameters:

fast=f                  This flag is a macro which sets other paramters to run 
                        faster, at reduced sensitivity.  Bad for RNA-seq.
maxindel=16000          Don't look for indels longer than this. Lower is faster.
                        Set to >=100k for RNAseq with long introns like mammals.
strictmaxindel=f        When enabled, do not allow indels longer than 'maxindel'.
                        By default these are not sought, but may be found anyway.
minid=0.76              Approximate minimum alignment identity to look for.  Higher 
                        is faster and less sensitive.
idfilter=0              Independant of minid; sets exact minimum identity allowed 
                        for alignments.  Range is 0 to 1.
minhits=1               Minimum number of seed hits required for candidate sites.
                        Higher is faster.
k=13                    Kmer length of index.  Higher is faster (for large genomes),
                        less sensitive, and uses more RAM.  Max is 15.
local=f                 Set to true to use local, rather than global, alignments.
                        This will soft-clip ugly ends of poor alignments.
perfectmode=f           Allow only perfect mappings when set to true (very fast).
semiperfectmode=f       Allow only perfect and semiperfect (perfect except for N's 
                        in reference) mappings.
threads=auto            (t) Set to number of threads desired.  By default, uses 
                        all cores available.
ambiguous=best          (ambig) Set behavior on ambiguously-mapped reads (with 
                        multiple top-scoring mapping locations).
                            best    (use the first best site)
                            toss    (consider unmapped)
                            random  (select one top-scoring site randomly)
                            all     (retain all top-scoring sites)
samestrandpairs=f       (ssp) Specify whether paired reads should map to the same 
                        strand or opposite strands.
requirecorrectstrand=t  (rcs) Forbid pairing of reads without correct strand 
                        orientation.  Set to false for long-mate-pair libraries.
killbadpairs=f          (kbp) If a read pair is mapped with an inappropriate insert
                        size or orientation, the read with the lower mapping quality 
                        is marked unmapped.
pairedonly=f            (po) Treat unpaired reads as unmapped.  Thus they will be 
                        sent to 'outu' but not 'outm'.
rcompmate=f             Reverse complement second read in each pair prior to 
                        mapping.
pairlen=32000           Set max allowed distance between paired reads.  
                        (insert size)=(pairlen)+(read1 length)+(read2 length)
bandwidthratio=0        (bwr) If above zero, restrict alignment band to this 
                        fraction of read length.  Faster but less accurate.
usejni=f                (jni) Do alignments in C code, which is faster.  Requires 
                        compiling the C code; details are in /jni/README.txt.
maxsites2=800           Don't analyze (or print) more than this many alignments 
                        per read.

Quality and Trimming Parameters:

qin=auto                Set to 33 or 64 to specify input quality value ASCII
                        offset. 33 is Sanger, 64 is old Solexa.
qout=auto               Set to 33 or 64 to specify output quality value ASCII 
                        offset (only if output format is fastq).
qtrim=f                 Quality-trim ends before mapping.  Options are 'f' (false), 
                        'l' (left), 'r' (right), and 'lr' (both).
untrim=f                Undo trimming after mapping.  Untrimmed bases will be 
                        soft-clipped in cigar strings.
trimq=7                 Trim regions with average quality below this 
                        (phred algorithm).
mintl=60                (mintrimlength) Don't trim reads to be shorter than this.
fakequality=-1          Set to a positive number 1-50 to generate fake quality 
                        strings for fasta input reads.
ignorebadquality=f      (ibq) Keep going, rather than crashing, if a read has 
                        out-of-range quality values.

Output Parameters:

outputunmapped=t        Set to false if unmapped reads should not be printed to 
                        'out=' target (saves time and disk space).
out=<file>              Write all reads to this file (unless outputunmapped=t).
outu=<file>             Write only unmapped reads to this file.  Does not include 
                        unmapped paired reads with a mapped mate.
outm=<file>             Write only mapped reads to this file.  Includes unmapped 
                        paired reads with a mapped mate.
bamscript=<file>        (bs) Write a shell script to <file> that will turn the sam 
                        output into a sorted, indexed bam file.
ordered=f               Set to true to output reads in same order as input.  
                        Slower and uses more memory.
overwrite=f             (ow) Allow process to overwrite existing files.
secondary=f             Print secondary alignments.
sssr=0.95               (secondarysitescoreratio) Print only secondary alignments 
                        with score of at least this fraction of primary.
ssao=f                  (secondarysiteasambiguousonly) Only print secondary 
                        alignments for ambiguously-mapped reads.
maxsites=5              Maximum number of total alignments to print per read.
                        Only relevant when secondary=t.
quickmatch=f            Generate cigar strings more quickly.  Must be true to 
                        generate secondary site cigar strings.
trimreaddescriptions=f  (trd) Truncate read names at the first whitespace, 
                        assuming that the remaineder is a comment or description.
ziplevel=2              (zl) Compression level for zip or gzip output.
pigz=f                  Spawn a pigz (parallel gzip) process for faster compression 
                        than using Java.  Requires pigz to be installed.
machineout=f            Set to true to output statistics in machine-friendly 
                        'key=value' format.

Sam flags and settings:

sam=1.4                 Set to 1.4 to write Sam version 1.4 cigar strings, 
                        with = and X, or 1.3 to use M.
saa=t                   (secondaryalignmentasterisks) Use asterisks instead of
                        bases for sam secondary alignments.
cigar=t                 Set to 'f' to skip generation of cigar strings (faster).
keepnames=f             Keep original names of paired reads, rather than ensuring 
                        both reads have the same name.
intronlen=999999999     Set to a lower number like 10 to change 'D' to 'N' in 
                        cigar strings for deletions of at least that length.
rgid=                   Set readgroup ID.  All other readgroup fields can be set 
                        similarly, with the flag rgXX=
mdtag=f                 Write MD tags.
nhtag=f                 Write NH tags.
xmtag=f                 Write XM tags (may only work correctly with ambig=all).
xstag=f                 Set to 'xs=fs', 'xs=ss', or 'xs=us' to write XS tags 
                        for RNAseq using firststrand, secondstrand, or 
                        unstranded libraries.  Needed by Cufflinks.  
                        JGI mainly uses 'firststrand'.
stoptag=f               Write a tag indicating read stop location, prefixed by YS:i:
idtag=f                 Write a tag indicating percent identity, prefixed by YI:f:
inserttag=f             Write a tag indicating insert size, prefixed by X8:Z:
scoretag=f              Write a tag indicating BBMap's raw score, prefixed by YR:i:
noheader=f              Disable generation of header lines.

Histogram and statistics output parameters:

scafstats=<file>        Statistics on how many reads mapped to which scaffold 
                        to this file.
bhist=<file>            Base composition histogram by position.
qhist=<file>            Quality histogram by position.
aqhist=<file>           Histogram of average read quality.
bqhist=<file>           Quality histogram designed for box plots.
lhist=<file>            Read length histogram.
ihist=<file>            Write histogram of insert sizes (for paired reads).
ehist=<file>            Errors-per-read histogram.
qahist=<file>           Quality accuracy histogram of error rates versus 
                        quality score.
indelhist=<file>        Indel length histogram.
mhist=<file>            Histogram of match, sub, del, and ins rates by 
                        read location.
gchist=<file>           Read GC content histogram.
gcbins=100              Set the number of bins in the GC histogram.
idhist=<file>           Histogram of read count versus percent identity.
idbins=100              Set the number of bins in the identity histogram.

Coverage output parameters (these may reduce speed and use more RAM):

covstats=<file>         Per-scaffold coverage info.
covhist=<file>          Histogram of # occurrences of each depth level.
basecov=<file>          Coverage per base location.
bincov=<file>           Print binned coverage per location (one line per X bases).
covbinsize=1000         Set the binsize for binned coverage output.
nzo=f                   Only print scaffolds with nonzero coverage.
twocolumn=f             Change to true to print only ID and Avg_fold instead of 
                        all 6 columns to the 'out=' file.
uscov=t                 Include secondary alignments when calculating coverage.
32bit=f                 Set to true if you need per-base coverage over 64k.
strandedcov=f           Track coverage for plus and minus strand independently.
startcov=f              Only track start positions of reads.

Java Parameters:
-Xmx                    This will be passed to Java to set memory usage, 
                        overriding the program's automatic memory detection.
                        -Xmx20g will specify 20 gigs of RAM, and -Xmx800m 
                        will specify 800 megs.  The max is typically 85% of 
                        physical memory.

This list is not complete.  For more information, please consult 
$DIR""docs/readme.txt. 
Please contact Brian Bushnell at [email protected] if you encounter 
any problems.
"   
}

Topics	Statistics	Last Post
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, Yesterday, 05:31 AM	0 responses 10 views 0 likes	Last Post by seqadmin Yesterday, 05:31 AM
Small Blood Stem Cell Subset Linked to Immune System Aging by seqadmin Started by seqadmin, 10-24-2024, 06:58 AM	0 responses 20 views 0 likes	Last Post by seqadmin 10-24-2024, 06:58 AM
New AI Model Designs Synthetic DNA Switches for Targeted Gene Expression in Specific Cell Types by seqadmin Started by seqadmin, 10-23-2024, 08:43 AM	0 responses 50 views 0 likes	Last Post by seqadmin 10-23-2024, 08:43 AM
Microbes in Urban Spaces Adapt to Disinfectants and Scarce Resources by seqadmin Started by seqadmin, 10-17-2024, 07:29 AM	0 responses 58 views 0 likes	Last Post by seqadmin 10-17-2024, 07:29 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News