Seqanswers Leaderboard Ad

**Brian Bushnell** · 11-11-2014, 02:01 PM

Wow, that's actually... a lot nicer. Thanks! I'll put it in the next release (and credit you of course).

**susanklein** · 11-11-2014, 02:34 PM

Hi,

sorry, the Javva 6 version is not working for me either. If I check my version I get:

$ java -version
java version "1.6.0_27"
OpenJDK Runtime Environment (IcedTea6 1.12.6) (6b27-1.12.6-1~deb7u1)
OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)
all29c@aahl-02-mel:~/data/019/19.5$

The error is:
Exception in thread "main" java.lang.UnsupportedClassVersionError: align2/BBMap : Unsupported major.minor version 51.0
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:634)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:277)
at java.net.URLClassLoader.access$000(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:212)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
Could not find the main class: align2.BBMap. Program will exit.

**sdriscoll** · 11-11-2014, 02:36 PM

Cool! I did bbduk.sh as well. I've been using those two the most the last couple of days. FYI I had to make sure there wasn't any mixture of tabs and spaces (fortunately Sublime Text 2 can make that swap for me). Also it's a lot easier to write those tedious usage things in a single echo instead of multiple.

Code:

usage(){
echo "
Written by Brian Bushnell
Last modified October 27, 2014

Description:  Compares reads to the kmers in a reference dataset, optionally 
allowing an edit distance. Splits the reads into two outputs - those that 
match the reference, and those that don't. Can also trim (remove) the matching 
parts of the reads rather than binning the reads.

Usage:  bbduk.sh in=<input file> out=<output file> ref=<contaminant files>

Input may be stdin or a fasta, fastq, or sam file, compressed or uncompressed.
Output may be stdout or a file.
If you pipe via stdin/stdout, please include the file type; e.g. for gzipped 
fasta input, set in=stdin.fa.gz

Optional parameters (and their defaults)

Input parameters:

in=<file>           Main input. in=stdin.fq will pipe from stdin.
in2=<file>          Input for 2nd read of pairs in a different file.
ref=<file,file>     Comma-delimited list of reference files.
touppercase=f       (tuc) Change all bases upper-case.
interleaved=auto    (int) t/f overrides interleaved autodetection.
qin=auto            Input quality offset: 33 (Sanger), 64, or auto.
reads=-1            If positive, quit after processing X reads or pairs.

Output parameters:

out=<file>          (outnonmatch) Write reads here that do not contain 
                    kmers matching the database.  'out=stdout.fq' will pipe 
                    to standard out.
out2=<file>         (outnonmatch2) Use this to write 2nd read of pairs to a 
                    different file.
outmatch=<file>     (outm or outb) Write reads here that contain kmers 
                    matching the database.
outmatch2=<file>    (outm2 or outb2) Use this to write 2nd read of pairs 
                    to a different file.
outsingle=<file>    (outs) Use this to write singleton reads whose mate was 
                    trimmed shorter than minlen.
stats=<file>        Write statistics about which contamininants were detected.
duk=<file>          Write statistics in duk's format.
overwrite=t         (ow) Grant permission to overwrite files.
showspeed=t         (ss) 'f' suppresses display of processing speed.
ziplevel=2          (zl) Compression level; 1 (min) through 9 (max).
fastawrap=80        Length of lines in fasta output.
qout=auto           Output quality offset: 33 (Sanger), 64, or auto.

Histogram output parameters:

bhist=<file>        Base composition histogram by position.
qhist=<file>        Quality histogram by position.
aqhist=<file>       Histogram of average read quality.
bqhist=<file>       Quality histogram designed for box plots.
lhist=<file>        Read length histogram.
gchist=<file>       Read GC content histogram.

Histograms for sam files only (requires sam format 1.4 or higher):

ehist=<file>        Errors-per-read histogram.
qahist=<file>       Quality accuracy histogram of error rates versus quality 
                    score.
indelhist=<file>    Indel length histogram.
mhist=<file>        Histogram of match, sub, del, and ins rates by read location.
idhist=<file>       Histogram of read count versus percent identity.

Processing parameters:

threads=auto        (t) Set number of threads to use; default is number of 
                    logical processors.
k=27                Kmer length used for finding contaminants.  Contaminants 
                    shorter than k will not be found.  k must be at least 1.
rcomp=t             Look for reverse-complements of kmers in addition to 
                    forward kmers.
maskmiddle=t        (mm) Treat the middle base of a kmer as a wildcard, to 
                    increase sensitivity in the presence of errors.
maxbadkmers=0       (mbk) Reads with more than this many contaminant kmers 
                    will be discarded.
hammingdistance=0   (hdist) Maximum Hamming distance from ref kmers (subs only).  
                    Memory use is proportional to (3*K)^hdist.
editdistance=0      (edist) Maximum edit distance from ref kmers (subs and indels).  
                    Memory use is proportional to (8*K)^edist.
forbidn=f           (fn) Forbids matching of read kmers containing N.  By default, 
                    these will match a reference 'A' if hdist>0 or edist>0, to 
                    increase sensitivity.
maxns=-1            If non-negative, reads with more Ns than this 
                    (before trimming) will be discarded.
minskip=1           (mns) Force minimal skip interval when indexing reference 
                    kmers.  1 means use all, 2 means use every other kmer, etc.
maxskip=99          (mxs) Restrict maximal skip interval when indexing 
                    reference kmers. Normally all are used for scaffolds<100kb, 
                    but with longer scaffolds, up to K-1 are skipped.
removeifeitherbad=t (rieb) Paired reads get sent to 'outmatch' if either is 
                    match (or either is trimmed shorter than minlen).  
                    Set to false to require both.
findbestmatch=f     (fbm) If multiple matches, associate read with sequence 
                    sharing most kmers.
prealloc=f          Preallocate memory in table.  Allows faster table loading 
                    and more efficient memory usage, for a large reference.

Trimming/Masking parameters:

ktrim=f             Trim reads to remove bases matching reference kmers.
                    Values: 
                             t (trim)
                             f (don't trim), 
                             r (trim right end), 
                             l (trim left end), 
                             n (convert to N instead of trimming).
                    Any non-whitespace character other than t, f, r, l, n: 
                    convert to that symbol rather than trimming, and process 
                    short kmers on both ends.
useshortkmers=f     (usk) Look for shorter kmers at read tips (only for 
                    k-trimming).  Enabling this will disable maskmiddle.
mink=6              Minimum length of short kmers.  Setting this automatically 
                    sets useshortkmers=t.
qtrim=f             Trim read ends to remove bases with quality below minq.
                    Performed AFTER looking for kmers.
                    Values: 
                            t (trim both ends), 
                            f (neither end), 
                            r (right end only), 
                            l (left end only).
trimq=6             Trim quality threshold.
minlength=10        (ml) Reads shorter than this after trimming will be 
                    discarded.  Pairs will be discarded only if both are shorter.
maxlength=          Reads longer than this after trimming will be discarded.  Pairs 
                    will be discarded only if both are longer.
minavgquality=0     (maq) Reads with average quality (before trimming) below 
                    this will be discarded.
otm=f               (outputtrimmedtomatch) Output reads trimmed to shorter than 
                    minlength to outm rather than discarding.
tp=0                (trimpad) Trim this much extra around matching kmers.
tbo=f               (trimbyoverlap) Trim adapters based on where paired 
                    reads overlap.
minoverlap=24       Require this many bases of overlap for overlap detection.
mininsert=25        Require insert size of at least this much for overlap 
                    detection (will automatically set minoverlap too).
tpe=f               (trimpairsevenly) When kmer right-trimming, trim both reads 
                    to the minimum length of either.
forcetrimleft=0     (ftl) If positive, trim bases to the left of this position 
                    (exclusive, 0-based).
forcetrimright=0    (ftr) If positive, trim bases to the right of this position 
                    (exclusive, 0-based).
restrictleft=0      If positive, only look for kmer matches in the 
                    leftmost X bases.
restrictright=0     If positive, only look for kmer matches in the 
                    rightmost X bases.

Java Parameters:

-Xmx                This will be passed to Java to set memory usage, overriding 
                    the program's automatic memory detection. -Xmx20g will specify 
                    20 gigs of RAM, and -Xmx200m will specify 200 megs.  
                    The max is typically 85% of physical memory.

There is a changelog at 
/global/projectb/sandbox/gaag/bbtools/docs/changelog_bbduk.txt
Please contact Brian Bushnell at [email protected] if you encounter any problems.
"	
}

**GenoMax** · 11-11-2014, 02:55 PM

Originally posted by susanklein View Post

Hi,

sorry, the Javva 6 version is not working for me either. If I check my version I get:

$ java -version
java version "1.6.0_27"
OpenJDK Runtime Environment (IcedTea6 1.12.6) (6b27-1.12.6-1~deb7u1)
OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)
all29c@aahl-02-mel:~/data/019/19.5$

Your sys admins must hate making any changes

For Java 6 .. last release is 45. Looks like you are on release 27.

**Brian Bushnell** · 11-11-2014, 03:07 PM

Originally posted by susanklein View Post

Hi,

sorry, the Javva 6 version is not working for me either. If I check my version I get:

$ java -version
java version "1.6.0_27"
OpenJDK Runtime Environment (IcedTea6 1.12.6) (6b27-1.12.6-1~deb7u1)
OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)
all29c@aahl-02-mel:~/data/019/19.5$

The error is:
Exception in thread "main" java.lang.UnsupportedClassVersionError: align2/BBMap : Unsupported major.minor version 51.0
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:634)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:277)
at java.net.URLClassLoader.access$000(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:212)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
Could not find the main class: align2.BBMap. Program will exit.

Susan,

Sorry, that was my fault; I was able to replicate that error. The wrong binaries were somehow in the java6 package. I've fixed it for real this time with BBMap_33.89b_java6.tar.gz which is now uploaded.

-Brian

**Brian Bushnell** · 11-11-2014, 03:08 PM

Originally posted by sdriscoll View Post

Cool! I did bbduk.sh as well. I've been using those two the most the last couple of days. FYI I had to make sure there wasn't any mixture of tabs and spaces (fortunately Sublime Text 2 can make that swap for me). Also it's a lot easier to write those tedious usage things in a single echo instead of multiple.

Thanks again! I'll have to try out Sublime Text 2.

**susanklein** · 11-11-2014, 05:14 PM

Hi,

this is very wierd.. it must be a machine problem, but bbmap (and bowtie1/2) are not working for me for certain file sizes (~200k). The error is below. The file is there, looks normal. If I double or half its size, then the index builds. Very strange.

Thanks for the help though!

5$ ~/bin/bbmap/bbmap.sh ref=/OSM/HOME-MEL/all29c/data/019/19.5/plasmids.fna
java -Djava.library.path=/OSM/HOME-MEL/all29c/bin/bbmap/jni/ -ea -Xmx32955m -cp /OSM/HOME-MEL/all29c/bin/bbmap/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 ref=/OSM/HOME-MEL/all29c/data/019/19.5/plasmids.fna
Executing align2.BBMap [build=1, overwrite=true, fastareadlen=500, ref=/OSM/HOME-MEL/all29c/data/019/19.5/plasmids.fna]

BBMap version 33.89
Retaining first best site only for ambiguous mappings.
No output file.
Writing reference.
Executing dna.FastaToChromArrays2 [/OSM/HOME-MEL/all29c/data/019/19.5/plasmids.fna, 1, writeinthread=false, genscaffoldinfo=true, retain, waitforwriting=false, gzip=true, chromc=false, maxlen=536670912, writechroms=true, minscaf=1, midpad=300, startpad=8000, stoppad=8000, nodisk=false]

Set genScaffoldInfo=true
Writing chunk 1
Set genome to 1

Loaded Reference: 0.012 seconds.
Loading index for chunk 1-1, build 1
No index available; generating from reference genome: /OSM/HOME-MEL/all29c/data/019/19.5/ref/index/1/chr1_index_k13_c13_b1.block
Indexing threads started for block 0-1
Indexing threads finished for block 0-1
Generated Index: 2.162 seconds.
No reads to process; quitting.

Attached Files

plasmids1.fna.gz (59.2 KB, 41 views)

**susanklein** · 11-11-2014, 05:29 PM

Although.. not sure now.. it seems to have made two files chr1_index_k13_c13_b1.block and chr1_index_k13_c13_b1.block2.gz
is this right? Maybe it has worked.. The 'no reads to process' threw me.

Thanks.

S

Originally posted by susanklein View Post

Hi,

this is very wierd.. it must be a machine problem, but bbmap (and bowtie1/2) are not working for me for certain file sizes (~200k). The error is below. The file is there, looks normal. If I double or half its size, then the index builds. Very strange.

Thanks for the help though!

5$ ~/bin/bbmap/bbmap.sh ref=/OSM/HOME-MEL/all29c/data/019/19.5/plasmids.fna
java -Djava.library.path=/OSM/HOME-MEL/all29c/bin/bbmap/jni/ -ea -Xmx32955m -cp /OSM/HOME-MEL/all29c/bin/bbmap/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 ref=/OSM/HOME-MEL/all29c/data/019/19.5/plasmids.fna
Executing align2.BBMap [build=1, overwrite=true, fastareadlen=500, ref=/OSM/HOME-MEL/all29c/data/019/19.5/plasmids.fna]

BBMap version 33.89
Retaining first best site only for ambiguous mappings.
No output file.
Writing reference.
Executing dna.FastaToChromArrays2 [/OSM/HOME-MEL/all29c/data/019/19.5/plasmids.fna, 1, writeinthread=false, genscaffoldinfo=true, retain, waitforwriting=false, gzip=true, chromc=false, maxlen=536670912, writechroms=true, minscaf=1, midpad=300, startpad=8000, stoppad=8000, nodisk=false]

Set genScaffoldInfo=true
Writing chunk 1
Set genome to 1

Loaded Reference: 0.012 seconds.
Loading index for chunk 1-1, build 1
No index available; generating from reference genome: /OSM/HOME-MEL/all29c/data/019/19.5/ref/index/1/chr1_index_k13_c13_b1.block
Indexing threads started for block 0-1
Indexing threads finished for block 0-1
Generated Index: 2.162 seconds.
No reads to process; quitting.

**GenoMax** · 11-11-2014, 05:57 PM

Originally posted by susanklein View Post

Although.. not sure now.. it seems to have made two files chr1_index_k13_c13_b1.block and chr1_index_k13_c13_b1.block2.gz
is this right? Maybe it has worked.. The 'no reads to process' threw me.

Thanks.

S

It has worked. One could do the reference generation and mapping at the same time or pre-create the indexes (as you did). In the latter case the program outputs the "no reads to process" message.

Use path=/OSM/HOME-MEL/all29c/data/019/19.5 when you do the mapping.

**sdriscoll** · 11-14-2014, 09:47 AM

how do you control the quality filtering threshold in bbmap? i see the maq threshold setting in reformat.sh - is that same setting usable in bbmap? thanks.

**Brian Bushnell** · 11-14-2014, 10:08 AM

Reformat and BBDuk both support 'maq' (minAverageQuality), but it's not in BBMap - I didn't really think it would be useful. I will add it, though. You can also pipe input:

reformat.sh in=reads.fq out=stdout.fq maq=10 | bbmap.sh in=stdin.fq ...

...though that's not really ideal.

**sdriscoll** · 11-14-2014, 10:16 AM

I must be misunderstanding something then. I aligned some PE 100 reads from 2011 that had some crappy qualities with bbmap and in the output bam it had flagged many reads with x200 (qc failed). Is bbmap not filtering by base qualities?

**Brian Bushnell** · 11-14-2014, 10:32 AM

Oh... that's a heuristic that cannot be adjusted, though it CAN be disabled (with the "usequality=f" or "uq=f" flag). I'll add that to the help menu.

Not every kmer in a read is used to search the index. Most are discarded, and they are discarded based on several factors:
1) How close they are to other retained kmers (to get a good distribution across the read).
2) The probability that the kmer is correct, based on base qualities; by default, kmers with a 94%+ chance of containing an error are usually discarded, and kmers with a 99.99%+ chance of containing an error are always discarded - this includes any kmer that spans an 'N'.
3) Kmers that are more common in that specific genome are more likely to be discarded.

If a read ends up with fewer than "minhits" kmers (default 1) then it will be discarded as "low quality".

"uq=f" will turn off rule 2, so quality scores will not be considered. However, a read will still be discarded as low quality if there are so many Ns in it that there are not K consecutive defined bases anywhere.

**sdriscoll** · 11-14-2014, 10:58 AM

Ah thanks for the explanation! Very interesting. I'm not used to aligners making those kind of decisions but I think it's useful. I have another question for you.

Originally posted by sdriscoll View Post

OK one more question. What would the correct alignment option setting be to get alignments with 0 mismatches and still allow splices? If it's not possible I can just do downstream filtering.

Thanks.

Originally posted by Brian Bushnell View Post

There is no option for that, unfortunately.

I'm wondering how the setting of minid changes the reporting of intron spanning alignments in RNA-seq reads. At what point will minid cause bbmap to NOT report introns? For example I may want to be very restrictive about mismatches (minid=0.98) but I'd still want my RNA-seq reads to map across junctions. Is there a measurable identity penalty when reads map across introns?

Thanks.

**Brian Bushnell** · 11-14-2014, 11:41 AM

First, let me describe "minid" and "idfilter", which are slightly different:

"minid" sets a threshold for the minimum allowable alignment score, which is based on a custom hard-coded affine-transform weight matrix, and only loosely correlated with the actual percent identity, depending on the types and lengths of mutation events. It directly affects the mapping process (and speed), as sites that can be proven to have a score below the minimum ratio are not further considered.

"idfilter" does not affect mapping at all, but rather, sites that have an true percent identity below "idfilter" are not printed. You can use both or either; the difference is subtle, but important in some cases.

Now, as for scoring - the affine score matrix is deeply embedded in BBMap. A 100bp read that maps perfectly with only matches has a 100% identity and BBMap score of 9970 (100 points per match, but only 70 points for the first match). If a 100bp read can map somewhere with 1 mismatch (99% identity, BBMap score of 9713: 9970+1*(-127-100-100+70)), that site will always be chosen over a site with 50bp match, 20k intron, 50bp match (4.7% identity, BBMap score of roughly 8841). But if the 100bp read can map somewhere with 5 mismatches (95% identity, BBMap score of 8685: 9970 +5*(-127-100-100+70)), the spliced alignment will be chosen, because 8841>8685. However they are very close so it would get a low mapping score.

These site-selection priorities will be made no matter what you set for idfilter or minid; they could only be changed by a new weight matrix, which is nontrivial because it affects a lot of the proofs scattered through the code. A couple mismatches will always be preferred over a really long splice.

So, you can post-filter based on the number of mismatches, but there's currently no way to adjust what BBMap thinks is the best alignment beyond using the "maxindel" flag. However, I could make a new weight matrix that penalizes substitutions more and deletions less, as long as a single deletion strictly has a greater score penalty than a single substitution. Would that be helpful?

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Today, 11:49 AM	0 responses 11 views 0 likes	Last Post by seqadmin Today, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News