Seqanswers Leaderboard Ad

**Brian Bushnell** · 10-20-2014, 10:06 AM

I hope to have it submitted by the end of this year.

-Brian

**dbrami** · 10-22-2014, 10:24 AM

Hi Brian,
Thank you for previous answer through email which I am posting:

"Daniel,

Unfortunately, there is not currently a master document, though I certainly want to make one. BBMap is documented in /docs/readme.txt while everything else is just documented in its shellscript. But I do have individual threads on SeqAnswers about the most commonly-used tools:

BBMap (aligner for DNA/RNAseq) is now open-source and available for download. - SEQanswers

http://seqanswers.com/forums/showthread.php?t=41057

Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

(first post contains links to the other threads)

For metagenomics, I consider bbmap.sh, bbsplit.sh, bbduk.sh, bbnorm.sh, pileup.sh, dedupe.sh, and stats.sh to be the most directly relevant.

As for your specific questions:

Is there any documentation on the mapping output summary? Ie: How do numbers in mapped pairs vs read 1 mapped differ?

To be considered proper pairs, by default, the reads must map to opposite strands, on the same scaffold, with an insert size of under 32kbp. Read 1 will typically have a higher mapping rate than the pairing rate, because sometimes the reads map discordantly, and sometimes only one read maps. This is especially true in a metagenome which may have very short scaffolds.

Can we output the mapping results to .fast.gz directly?

All tools in the BBMap package can input and output gzipped files, fasta or fastq. It detects this based on filename, so a gzipped fastq file should be named "*.fq.gz" or "*.fastq.gz".

Is it possible to bin out mappings where both reads map vs either read maps vs unmapped through command line without bitwise comparison of SAM flags?

Yes. BBMap supports a "po" or "pairedonly" flag; you can for example do this:

bbmap.sh in1=read1.fq in2=read2.fq po=t outm=mapped.fq outu=unmapped.fq

...which will split the reads into proper pairs with both mapped (outm stream) and pairs where either one or both didn't map, or they did not map in the proper orientation (outu stream). With "po=f" (pairedonly=false), outm will accept anything where at least one read mapped, and outu will only get pairs in which neither mapped."

I am trying to find a value in mapping summary to would correspond to bowtie2's '% overall alignment rate'

Also, which BBMap parameters should I tweak if I want to accurately want to reproduce bowtie2 results obtained using "-k 1 --fast-local" ?

**Brian Bushnell** · 10-22-2014, 10:48 AM

Daniel,

Originally posted by dbrami View Post

I am trying to find a value in mapping summary to would correspond to bowtie2's '% overall alignment rate'

Also, which BBMap parameters should I tweak if I want to accurately want to reproduce bowtie2 results obtained using "-k 1 --fast-local" ?

For paired reads, the overall alignment rate is ((percent mapped for read 1)+(percent mapped for read 2))/2. Or in other words half of the sum of the two numbers in red below.

Code:

Pairing data:           pct reads       num reads       pct bases          num bases

mated pairs:             99.5206%            3114        99.5206%             934200
bad pairs:                0.3196%              10         0.3196%               3000
insert size avg:          321.31


Read 1 data:            pct reads       num reads       pct bases          num bases

mapped:                  [B][COLOR="Red"]99.9361%[/COLOR][/B]            3127        99.9361%             469050
unambiguous:             98.6897%            3088        98.6897%             463200
ambiguous:                1.2464%              39         1.2464%               5850
low-Q discards:           0.0000%               0         0.0000%                  0

perfect best site:        1.9175%              60         1.9175%               9000
semiperfect site:         1.9175%              60         1.9175%               9000
rescued:                  0.0000%               0

Match Rate:                   NA               NA        62.0582%             441661
Error Rate:              96.5462%            3019        37.6163%             267711
Sub Rate:                87.4320%            2734         2.3412%              16662
Del Rate:                42.8846%            1341        34.0933%             242638
Ins Rate:                49.8881%            1560         1.1818%               8411
N Rate:                  49.4084%            1545         0.3254%               2316


Read 2 data:            pct reads       num reads       pct bases          num bases

mapped:                  [B][COLOR="Red"]99.9041%[/COLOR][/B]            3126        99.9041%             468900
unambiguous:             98.6897%            3088        98.6897%             463200
ambiguous:                1.2144%              38         1.2144%               5700
low-Q discards:           0.0000%               0         0.0000%                  0

perfect best site:        1.6619%              52         1.6619%               7800
semiperfect site:         1.6619%              52         1.6619%               7800
rescued:                  0.0000%               0

Match Rate:                   NA               NA        61.2099%             441930
Error Rate:              96.7370%            3024        38.4685%             277739
Sub Rate:                88.2278%            2758         2.2755%              16429
Del Rate:                43.9859%            1375        35.0546%             253091
Ins Rate:                49.1683%            1537         1.1384%               8219
N Rate:                  50.7358%            1586         0.3216%               2322

Flags equivalent to bowtie2's "-k 1 --fast-local" would be something like:

"fast local maxindel=10"

**dbrami** · 10-22-2014, 02:38 PM

de-interleave fastq outtput

Can BBMap de-interleave the fastq output for reads 1 and reads 2?
Ie: if I Use "pairedonly=f outu=${B}.noplant.fastq.gz", can BBMap put the unmapped forward reads ${B}_R1.noplant.fastq.gz and the unmapped reverse reads in ${B}_R2.noplant.fastq.gz.
Or should I just de-interleave the reads as a post process?

**Brian Bushnell** · 10-22-2014, 02:46 PM

Daniel,

You can set "outu1=" and "outu2=" or "outm1=" and "outm2=" for output in paired files.

**prashant_singh** · 11-07-2014, 06:46 AM

Issues with BBMap

Hi
I am a newbie to NGS data analysis- basically a bench scientist trying to do analysis.
I have sequenced human tissue samples on the PacBio RSII and am trying to align to human genome using the bbmap tool.

As mentioned in the forum, I am trying to use the MapPacBio.sh script for the alignment. Previously I was getting an error where it could not find the class align2.MapPacBio . I hard coded the path in the file and then

Here is the new error I am getting:

java -Djava.library.path=/opt/compsci/bbmap/33.73/jni/ -ea -Xmx51798m -cp BBMapPacBio build=1 overwrite=true minratio=0.40 fastareadlen=6000 ambiguous=best minscaf=100 startpad=10000 stoppad=10000 midpad=6000 in=/data/*****/PacBio/****/consensus_isoforms.fasta out=/data/***/pac88 ref=/data/shared/research_pipelines_reference_data/human/RNA/genome.fa
Error: Could not find or load main class build=1

Java version is "1.7.0_51"
OpenJDK Runtime Environment (rhel-2.4.4.1.el6_5-x86_64 u51-b02)
OpenJDK 64-Bit Server VM (build 24.45-b08, mixed mode)

Could you please help?

Thanks
Prashant

**westerman** · 11-07-2014, 07:09 AM

@prashant_singh. I am hardly an expert on BBTools. Brian will probably log in soon -- he is on USA west coast time as far as I know -- and so will give a better answer. But I happened to start up a mapPacBio run just as your post came in. Comparing your Java to my Java reveals:

Yours ...

Code:

java 
  -Djava.library.path=/opt/compsci/bbmap/33.73/jni/ 
  -ea -Xmx51798m 
  -cp BBMapPacBio 
  build=1 
  overwrite=true 
  minratio=0.40 
  fastareadlen=6000 
  ambiguous=best 
  minscaf=100 
  startpad=10000 
  stoppad=10000 
  midpad=6000 
  in=/data/*****/PacBio/****/consensus_isoforms.fasta 
  out=/data/***/pac88  
  ref=/data/shared/research_pipelines_reference_data/human/RNA/genome.fa

Mine ...

Code:

java 
  -Djava.library.path=/group/bioinfo/apps/apps/BBMap-33.57/jni/ 
  -ea -Xmx126058m 
  -cp /group/bioinfo/apps/apps/BBMap-33.57/current/ 
  align2.BBMapPacBio 
  build=1 
  overwrite=true 
  minratio=0.40 
  fastareadlen=6000 
  ambiguous=best 
  minscaf=100 
  startpad=10000 
  stoppad=10000 
  midpad=6000 
  ref=kmer60-10.fa 
  in=all_spades_2000plus.fa 
  out=LR_vs_scaffolds.sam

The major different is the 'cp' where I have the complete path plus the line after the 'cp' which shows the program to run.

You mentioned:

Previously I was getting an error where it could not find the class align2.MapPacBio . I hard coded the path in the file and then
....
Error: Could not find or load main class build=1

I suspect that you mis-coded the path and probably erased the part which states the program to run. I suggest starting anew and if you really need to hard-code the path then be careful in doing so.

**Brian Bushnell** · 11-07-2014, 09:38 AM

That's correct, and yes, I'm on West Coast time

The command line should probably be:

java -ea -Xmx51798m -cp /opt/compsci/bbmap/33.73/current/ align2.BBMapPacBio build=1 overwrite=true minratio=0.40 fastareadlen=6000 ambiguous=best minscaf=100 startpad=10000 stoppad=10000 midpad=6000 in=/data/*****/PacBio/****/consensus_isoforms.fasta out=/data/***/pac88.sam ref=/data/shared/research_pipelines_reference_data/human/RNA/genome.fa

Note both of the things in bold: "/opt/compsci/bbmap/33.73/current/" is probably correct, but maybe not, depending on how it was installed. I'm guessing that somewhere under "/opt/compsci/bbmap/33.73/" is a subdirectory named "current" - that is what you want to point it to.

For "out=/data/***/pac88.sam", I just added the ".sam" suffix. All of the programs in BBTools are extension-sensitive, so if you don't include an extension, it will just output the default format (which in this case is sam, but it's always best to specify it).

**sdriscoll** · 11-07-2014, 04:21 PM

I apologize if this has been covered already but I've just downloaded and run bbmap a few times (seems great so far and I like the control provided via the available command line options) and I have a quick question. When building the index is it possible for bbmap to ignore parts of the reference names past the first blank space? For example I'm using the Ensembl human release and the FASTA reference names look something like this:

>1 dna_sm:chromosome chromosome:GRCh37:1:1:249250621:1 REF

Currently when I run alignments on this FASTA file that entire name shows up in the SAM file while instead I should only see "1". Other aligners (STAR, bowtie, bwa, etc) throw out everything after the first space when indexing leaving just the main reference name. This is particularly important for using aligners as part of a gene expression pipeline using something like eXpress since some transcriptome extraction tools (programs that parse a GTF and extract transcripts from an accompanying genome FASTA file) usually name the transcripts as >TRANSCRIPT_ID GENE_SYMBOL. eXpress will throw out everything after the first space however if the alignments contain the full reference names then no alignments are matched up to transcripts in the output.

**Brian Bushnell** · 11-07-2014, 04:26 PM

Yes, I put in an option for that because some programs require it. The flag is trd=t (or trimreaddescriptions=true), which will truncate all names (both read names and reference names) at the first whitespace symbol.

**sdriscoll** · 11-07-2014, 04:43 PM

rad. thanks for that.

**sdriscoll** · 11-07-2014, 04:50 PM

OK one more question. What would the correct alignment option setting be to get alignments with 0 mismatches and still allow splices? If it's not possible I can just do downstream filtering.

Thanks.

**Brian Bushnell** · 11-07-2014, 05:02 PM

There is no option for that, unfortunately.

**sdriscoll** · 11-07-2014, 05:26 PM

OK, thanks. Also thanks for the aligner.

**Brian Bushnell** · 11-07-2014, 05:41 PM

You're welcome!

Topics	Statistics	Last Post
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, Yesterday, 05:31 AM	0 responses 10 views 0 likes	Last Post by seqadmin Yesterday, 05:31 AM
Small Blood Stem Cell Subset Linked to Immune System Aging by seqadmin Started by seqadmin, 10-24-2024, 06:58 AM	0 responses 20 views 0 likes	Last Post by seqadmin 10-24-2024, 06:58 AM
New AI Model Designs Synthetic DNA Switches for Targeted Gene Expression in Specific Cell Types by seqadmin Started by seqadmin, 10-23-2024, 08:43 AM	0 responses 50 views 0 likes	Last Post by seqadmin 10-23-2024, 08:43 AM
Microbes in Urban Spaces Adapt to Disinfectants and Scarce Resources by seqadmin Started by seqadmin, 10-17-2024, 07:29 AM	0 responses 58 views 0 likes	Last Post by seqadmin 10-17-2024, 07:29 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News