Unconfigured Ad

GenoMax · 01-24-2018, 06:41 AM

I am not sure why BBMap is not reporting the two sites in the output SAM. Perhaps that is by design.

Unfortunately @Brian has been missing from online forums for last couple of months. I pinged him again yesterday with ref to this question. We will need him to chime in on these questions.

milan.molbio · 01-24-2018, 06:03 AM

@sdriscoll thanks for the tip! with these two options bbmap does report one 100% semiperfect site read but it's not in the output file. Anything else I'm missing?

the exact command was:

Code:

bbmap.sh secondary=t sssr=0.80  k=3 maxindel=0 strictmaxindel  ref=sample.genome.fa ambiguous=all   maxsites=10  in=sample.read.fa out=mapped.sam vslow

sdriscoll · 01-23-2018, 12:46 PM

@milan.molbio-

bbmap looks for the best alignment and reports only that one by default. If there are multiple "best" then you can get them all by setting 'ambig=all'.

What you're looking for should be possible by enabling 'secondary=t' and then setting both 'sssr' and 'maxsites' to your liking. "secondary" alignments are those with lower mapping scores relative to the best alignment. "sssr" controls the ratio of the best score that is acceptable (default is 0.95 to allow secondary alignments with mapping scores as low as 95% of the best score). Finally "maxsites" sets a cap on the number of secondary alignment sites to report. You can probably just set that to a very high value to be sure you have them "all" - or at least all that bbmap can find.

milan.molbio · 01-23-2018, 01:46 AM

Hi guys,

I'm trying to get bbmap to find all hits up to some identity threshold for a 20bp read, but it only finds the perfect hit. This should be possible right? So I must be doing something wrong. Here's the test scenario:

Code:

# file: sample.genome.fa
> 10 GRCz10
TTTCCTGAGGATGTGAGATTACTTCCGGGTTTTACAACAACATGGCAGCGCCCTTAGGTCTGCATCTGGAGTTTGGGTTTGTATTTATTGTTTCCTTTATCATTAATCTATTTTTTTTTTCTTTCTCGTGTTTTTTAAGAACATGGCAGTTTTGTTTTACATCAACTTGGCAGTAAT

# file sample.reads.fa:
> r1
GTTTTACAACAACATGGCAG
GTTTTTTAAGAACATGGCAG

# run.bash
bbmap.sh ref=sample.genome.fa k=3
bbmap.sh minid=0.7 k=3 maxindel=0 strictmaxindel  ref=sample.genome.fa ambiguous=all  in=sample.read.fa out=mapped.sam vslow

# results: mapped.sam
mapped.sam
@HD	VN:1.4	SO:unsorted
@SQ	SN: 10 GRCz10	LN:177
@PG	ID:BBMap	PN:BBMap	VN:37.85	CL:java -Djava.library.path=/Users/milan/Projects/geneassassin/code/ga-pipeline/bin/bbmap/jni/ -ea -Xmx3200m align2.BBMap build=1 overwrite=true fastareadlen=500 minid=0.7 k=3 maxindel=0 strictmaxindel ref=sample.genome.fa ambiguous=all in=sample.read.fa out=mapped.sam vslow
 r1	0	 10 GRCz10	29	40	20=	*	0	0	GTTTTACAACAACATGGCAG	*	NM:i:0	AM:i:40	NH:i:1

expected output:

1 perfect hit (position 28)
1 hit with 2 substitutions (position 153, 8:T>A,13:T>A - GTTTTACATCAACTTGGCAG)
1 hit with 3 substitutions (position 129, 5:T>A,6:T>C,9:G>C - GTTTTTTAAGAACATGGCAG)

Code:

# stdout:

Max memory cannot be determined.  Attempting to use 3200 MB.
If this fails, please add the -Xmx flag (e.g. -Xmx24g) to your command,
or run this program qsubbed or from a qlogin session on Genepool, or set ulimit to an appropriate value.
java -Djava.library.path=/Users/milan/Projects/geneassassin/code/ga-pipeline/bin/bbmap/jni/ -ea -Xmx3200m -cp /Users/milan/Projects/geneassassin/code/ga-pipeline/bin/bbmap/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 minid=0.7 k=3 maxindel=0 strictmaxindel ref=sample.genome.fa ambiguous=all in=sample.read.fa out=mapped.sam vslow
Executing align2.BBMap [tipsearch=150, minhits=1, minratio=0.25, rescuemismatches=50, rescuedist=3000, build=1, overwrite=true, fastareadlen=500, minid=0.7, k=3, maxindel=0, strictmaxindel, ref=sample.genome.fa, ambiguous=all, in=sample.read.fa, out=mapped.sam]
Version 37.85 [tipsearch=150, minhits=1, minratio=0.25, rescuemismatches=50, rescuedist=3000, build=1, overwrite=true, fastareadlen=500, minid=0.7, k=3, maxindel=0, strictmaxindel, ref=sample.genome.fa, ambiguous=all, in=sample.read.fa, out=mapped.sam]

Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.250
Retaining all best sites for ambiguous mappings.
Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.449
NOTE:	Ignoring reference file because it already appears to have been processed.
NOTE:	If you wish to regenerate the index, please manually delete ref/genome/1/summary.txt
Set genome to 1

Loaded Reference:	0.075 seconds.
Loading index for chunk 1-1, build 1
Generated Index:	0.004 seconds.
Analyzed Index:   	0.002 seconds.
Started output stream:	0.021 seconds.
Cleared Memory:    	0.116 seconds.
Processing reads in single-ended mode.
Started read stream.
Started 8 mapping threads.
Detecting finished threads: 0, 1, 2, 3, 4, 5, 6, 7

   ------------------   Results   ------------------

Genome:                	1
Key Length:            	3
Max Indel:             	0
Minimum Score Ratio:  	0.4486
Mapping Mode:         	normal
Reads Used:           	1	(20 bases)

Mapping:          	0.208 seconds.
Reads/sec:       	4.81
kBases/sec:      	0.10


Read 1 data:      	pct reads	num reads 	pct bases	   num bases

mapped:          	100.0000% 	        1 	100.0000% 	          20
unambiguous:     	100.0000% 	        1 	100.0000% 	          20
ambiguous:       	  0.0000% 	        0 	  0.0000% 	           0
low-Q discards:  	  0.0000% 	        0 	  0.0000% 	           0

perfect best site:	100.0000% 	        1 	100.0000% 	          20
semiperfect site:	100.0000% 	        1 	100.0000% 	          20

Match Rate:      	      NA 	       NA 	100.0000% 	          20
Error Rate:      	  0.0000% 	        0 	  0.0000% 	           0
Sub Rate:        	  0.0000% 	        0 	  0.0000% 	           0
Del Rate:        	  0.0000% 	        0 	  0.0000% 	           0
Ins Rate:        	  0.0000% 	        0 	  0.0000% 	           0
N Rate:          	  0.0000% 	        0 	  0.0000% 	           0

Total time:     	0.571 seconds.

sdriscoll · 01-22-2018, 01:57 PM

Hello-

I'm trying to understand this example in the original post:

Code:

bbmap.sh in=reads.fq outu=unmapped.fq int=f
repair.sh in=unmapped.fq out=paired.fq fint outs=singletons.fq

It is provided as an example of how to deal with the non-deterministic behavior of bbmap when aligning paired-end reads. However what is described (map reads as single-end and then re-pair them using repair.sh) doesn't seem to be what these code lines show. If the goal is to get bbmap to produce deterministic paired-end alignments wouldn't the final outcome of the method be an alignment file and not just a FASTQ? Why would I be using the un-aligned reads as part of the final product?

milan.molbio · 01-18-2018, 05:53 AM

Hi guys,

I'm trying to get bbmap to find all hits up to some identity threshold for a 20bp read, but it only finds the perfect hit. This should be possible right? So I must be doing something wrong. Here's the test scenario:

Code:

# file: sample.genome.fa
> 10 GRCz10
TTTCCTGAGGATGTGAGATTACTTCCGGGTTTTACAACAACATGGCAGCGCCCTTAGGTCTGCATCTGGAGTTTGGGTTTGTATTTATTGTTTCCTTTATCATTAATCTATTTTTTTTTTCTTTCTCGTGTTTTTTAAGAACATGGCAGTTTTGTTTTACATCAACTTGGCAGTAAT

# file sample.reads.fa:
> r1
GTTTTACAACAACATGGCAG
GTTTTTTAAGAACATGGCAG

# run.bash
bbmap.sh ref=sample.genome.fa k=3
bbmap.sh minid=0.7 k=3 maxindel=0 strictmaxindel  ref=sample.genome.fa ambiguous=all  in=sample.read.fa out=mapped.sam vslow

# results: mapped.sam
mapped.sam
@HD	VN:1.4	SO:unsorted
@SQ	SN: 10 GRCz10	LN:177
@PG	ID:BBMap	PN:BBMap	VN:37.85	CL:java -Djava.library.path=/Users/milan/Projects/geneassassin/code/ga-pipeline/bin/bbmap/jni/ -ea -Xmx3200m align2.BBMap build=1 overwrite=true fastareadlen=500 minid=0.7 k=3 maxindel=0 strictmaxindel ref=sample.genome.fa ambiguous=all in=sample.read.fa out=mapped.sam vslow
 r1	0	 10 GRCz10	29	40	20=	*	0	0	GTTTTACAACAACATGGCAG	*	NM:i:0	AM:i:40	NH:i:1

expected output:

1 perfect hit (position 28)
1 hit with 2 substitutions (position 153, 8:T>A,13:T>A - GTTTTACATCAACTTGGCAG)
1 hit with 3 substitutions (position 129, 5:T>A,6:T>C,9:G>C - GTTTTTTAAGAACATGGCAG)

Code:

# stdout:

Max memory cannot be determined.  Attempting to use 3200 MB.
If this fails, please add the -Xmx flag (e.g. -Xmx24g) to your command,
or run this program qsubbed or from a qlogin session on Genepool, or set ulimit to an appropriate value.
java -Djava.library.path=/Users/milan/Projects/geneassassin/code/ga-pipeline/bin/bbmap/jni/ -ea -Xmx3200m -cp /Users/milan/Projects/geneassassin/code/ga-pipeline/bin/bbmap/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 minid=0.7 k=3 maxindel=0 strictmaxindel ref=sample.genome.fa ambiguous=all in=sample.read.fa out=mapped.sam vslow
Executing align2.BBMap [tipsearch=150, minhits=1, minratio=0.25, rescuemismatches=50, rescuedist=3000, build=1, overwrite=true, fastareadlen=500, minid=0.7, k=3, maxindel=0, strictmaxindel, ref=sample.genome.fa, ambiguous=all, in=sample.read.fa, out=mapped.sam]
Version 37.85 [tipsearch=150, minhits=1, minratio=0.25, rescuemismatches=50, rescuedist=3000, build=1, overwrite=true, fastareadlen=500, minid=0.7, k=3, maxindel=0, strictmaxindel, ref=sample.genome.fa, ambiguous=all, in=sample.read.fa, out=mapped.sam]

Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.250
Retaining all best sites for ambiguous mappings.
Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.449
NOTE:	Ignoring reference file because it already appears to have been processed.
NOTE:	If you wish to regenerate the index, please manually delete ref/genome/1/summary.txt
Set genome to 1

Loaded Reference:	0.075 seconds.
Loading index for chunk 1-1, build 1
Generated Index:	0.004 seconds.
Analyzed Index:   	0.002 seconds.
Started output stream:	0.021 seconds.
Cleared Memory:    	0.116 seconds.
Processing reads in single-ended mode.
Started read stream.
Started 8 mapping threads.
Detecting finished threads: 0, 1, 2, 3, 4, 5, 6, 7

   ------------------   Results   ------------------

Genome:                	1
Key Length:            	3
Max Indel:             	0
Minimum Score Ratio:  	0.4486
Mapping Mode:         	normal
Reads Used:           	1	(20 bases)

Mapping:          	0.208 seconds.
Reads/sec:       	4.81
kBases/sec:      	0.10


Read 1 data:      	pct reads	num reads 	pct bases	   num bases

mapped:          	100.0000% 	        1 	100.0000% 	          20
unambiguous:     	100.0000% 	        1 	100.0000% 	          20
ambiguous:       	  0.0000% 	        0 	  0.0000% 	           0
low-Q discards:  	  0.0000% 	        0 	  0.0000% 	           0

perfect best site:	100.0000% 	        1 	100.0000% 	          20
semiperfect site:	100.0000% 	        1 	100.0000% 	          20

Match Rate:      	      NA 	       NA 	100.0000% 	          20
Error Rate:      	  0.0000% 	        0 	  0.0000% 	           0
Sub Rate:        	  0.0000% 	        0 	  0.0000% 	           0
Del Rate:        	  0.0000% 	        0 	  0.0000% 	           0
Ins Rate:        	  0.0000% 	        0 	  0.0000% 	           0
N Rate:          	  0.0000% 	        0 	  0.0000% 	           0

Total time:     	0.571 seconds.

milan.molbio · 01-18-2018, 05:47 AM

how to find all ambiguous reads?

Hi guys,

I'm trying to get bbmap to find all hits up to some identity threshold for a 20bp read, but it only finds the perfect hit. This should be possible right? So I must be doing something wrong. Here's the test scenario:

Code:

# file: sample.genome.fa
> 10 GRCz10
TTTCCTGAGGATGTGAGATTACTTCCGGGTTTTACAACAACATGGCAGCGCCCTTAGGTCTGCATCTGGAGTTTGGGTTTGTATTTATTGTTTCCTTTATCATTAATCTATTTTTTTTTTCTTTCTCGTGTTTTTTAAGAACATGGCAGTTTTGTTTTACATCAACTTGGCAGTAAT

# file sample.reads.fa:
> r1
GTTTTACAACAACATGGCAG
GTTTTTTAAGAACATGGCAG

# run.bash
bbmap.sh ref=sample.genome.fa k=3
bbmap.sh minid=0.7 k=3 maxindel=0 strictmaxindel  ref=sample.genome.fa ambiguous=all  in=sample.read.fa out=mapped.sam vslow

# results: mapped.sam
mapped.sam
@HD	VN:1.4	SO:unsorted
@SQ	SN: 10 GRCz10	LN:177
@PG	ID:BBMap	PN:BBMap	VN:37.85	CL:java -Djava.library.path=/Users/milan/Projects/geneassassin/code/ga-pipeline/bin/bbmap/jni/ -ea -Xmx3200m align2.BBMap build=1 overwrite=true fastareadlen=500 minid=0.7 k=3 maxindel=0 strictmaxindel ref=sample.genome.fa ambiguous=all in=sample.read.fa out=mapped.sam vslow
 r1	0	 10 GRCz10	29	40	20=	*	0	0	GTTTTACAACAACATGGCAG	*	NM:i:0	AM:i:40	NH:i:1

expected output:

1 perfect hit (position 28)
1 hit with 2 substitutions (position 153, 8:T>A,13:T>A - GTTTTACATCAACTTGGCAG)
1 hit with 3 substitutions (position 129, 5:T>A,6:T>C,9:G>C - GTTTTTTAAGAACATGGCAG)

Code:

# stdout:

Max memory cannot be determined.  Attempting to use 3200 MB.
If this fails, please add the -Xmx flag (e.g. -Xmx24g) to your command,
or run this program qsubbed or from a qlogin session on Genepool, or set ulimit to an appropriate value.
java -Djava.library.path=/Users/milan/Projects/geneassassin/code/ga-pipeline/bin/bbmap/jni/ -ea -Xmx3200m -cp /Users/milan/Projects/geneassassin/code/ga-pipeline/bin/bbmap/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 minid=0.7 k=3 maxindel=0 strictmaxindel ref=sample.genome.fa ambiguous=all in=sample.read.fa out=mapped.sam vslow
Executing align2.BBMap [tipsearch=150, minhits=1, minratio=0.25, rescuemismatches=50, rescuedist=3000, build=1, overwrite=true, fastareadlen=500, minid=0.7, k=3, maxindel=0, strictmaxindel, ref=sample.genome.fa, ambiguous=all, in=sample.read.fa, out=mapped.sam]
Version 37.85 [tipsearch=150, minhits=1, minratio=0.25, rescuemismatches=50, rescuedist=3000, build=1, overwrite=true, fastareadlen=500, minid=0.7, k=3, maxindel=0, strictmaxindel, ref=sample.genome.fa, ambiguous=all, in=sample.read.fa, out=mapped.sam]

Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.250
Retaining all best sites for ambiguous mappings.
Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.449
NOTE:	Ignoring reference file because it already appears to have been processed.
NOTE:	If you wish to regenerate the index, please manually delete ref/genome/1/summary.txt
Set genome to 1

Loaded Reference:	0.075 seconds.
Loading index for chunk 1-1, build 1
Generated Index:	0.004 seconds.
Analyzed Index:   	0.002 seconds.
Started output stream:	0.021 seconds.
Cleared Memory:    	0.116 seconds.
Processing reads in single-ended mode.
Started read stream.
Started 8 mapping threads.
Detecting finished threads: 0, 1, 2, 3, 4, 5, 6, 7

   ------------------   Results   ------------------

Genome:                	1
Key Length:            	3
Max Indel:             	0
Minimum Score Ratio:  	0.4486
Mapping Mode:         	normal
Reads Used:           	1	(20 bases)

Mapping:          	0.208 seconds.
Reads/sec:       	4.81
kBases/sec:      	0.10


Read 1 data:      	pct reads	num reads 	pct bases	   num bases

mapped:          	100.0000% 	        1 	100.0000% 	          20
unambiguous:     	100.0000% 	        1 	100.0000% 	          20
ambiguous:       	  0.0000% 	        0 	  0.0000% 	           0
low-Q discards:  	  0.0000% 	        0 	  0.0000% 	           0

perfect best site:	100.0000% 	        1 	100.0000% 	          20
semiperfect site:	100.0000% 	        1 	100.0000% 	          20

Match Rate:      	      NA 	       NA 	100.0000% 	          20
Error Rate:      	  0.0000% 	        0 	  0.0000% 	           0
Sub Rate:        	  0.0000% 	        0 	  0.0000% 	           0
Del Rate:        	  0.0000% 	        0 	  0.0000% 	           0
Ins Rate:        	  0.0000% 	        0 	  0.0000% 	           0
N Rate:          	  0.0000% 	        0 	  0.0000% 	           0

Total time:     	0.571 seconds.

phylloxera · 01-10-2018, 01:31 PM

Ultimately, i'd like to do variant calling on a combined pac bio / illumina whole viral genome dataset. I am working with BBMap right now as it has the intuitive minid flag, which seems desirable. As a first step, I'm trying to optimize my mapping as much as possible on one of the samples that is most divergent to the reference.

Here is my working command:
bbmap/mapPacBio.sh in=200335185_usedreads.fastq.gz ref=200303013.fa maxindel=40 minid=0.4 vslow k=8 out=200335185.sam overwrite=t bamscript=bs.sh; sh bs.sh

It optimizes the number of reads mapped (4148/4164) and minimizes the number of ambiguous mapping reads (1).

Given, k=8 and minid=0.4, geneious mapper maps all 4164 reads for maxindel ranging from 20-500. If it is in the cards, I'd like to be able to map the remaining stragglers but don't know what other BBMap flags I should try in this endeavor. Also, I'm curious why bbmap is so much more sensitive to the valueof maxindel... here are select bbmap results:
maxindel num reads num ambiguous
20 4145 2
40 4148 1
60 4137 1
100 4130 3
200 4125 4

mcmc · 12-13-2017, 01:36 PM

Originally posted by GenoMax View Post

BBTools currently has no count utilities. They may be on the wish list since many have asked Brian. For now, your best bet is to use featureCounts.

Thanks! I'm surprised there's something BBTools doesn't do

GenoMax · 12-13-2017, 10:27 AM

Originally posted by mcmc View Post

Is there a way to use BBtools to summarize reads mapped to a genome (using BBmap/BBsplit, in a sam file) by orf? I see that pileup.sh will take a prodigal-output fasta file with orf info, but I've got a genome downloaded from refseq with all the ncbi files (gff, cds fasta, gb). Can BBtools parse one of these to summarize my sam file by orf?

While I could map to the orfs.fna instead, I'm interested in intergenics too, e.g. for orf/RNA discovery.

Thanks,
MCMC

BBTools currently has no count utilities. They may be on the wish list since many have asked Brian. For now, your best bet is to use featureCounts.

mcmc · 12-13-2017, 09:29 AM

summarizing mapped reads by orf

Is there a way to use BBtools to summarize reads mapped to a genome (using BBmap/BBsplit, in a sam file) by orf? I see that pileup.sh will take a prodigal-output fasta file with orf info, but I've got a genome downloaded from refseq with all the ncbi files (gff, cds fasta, gb). Can BBtools parse one of these to summarize my sam file by orf?

While I could map to the orfs.fna instead, I'm interested in intergenics too, e.g. for orf/RNA discovery.

Thanks,
MCMC

mcmc · 12-11-2017, 09:21 PM

BBSplit ambig=toss

Hi Brian et al.,
When I run BBsplit with ambig=toss, the ambiguous reads are not written to unmapped.fq; but when I run BBmap, they are. Is this the expected behavior? I'd like to be able to retrieve the ambiguous reads from BBsplit (both within/between two references).
Thanks,
MC

GenoMax · 12-04-2017, 07:01 AM

Have you looked through the logs and such to see if there is any indication of any issues? There is always the possibility that @Brian may not have checked extreme usage case like this for randomreads.sh and this may be a genuine bug that is clearly a road-block.

Since you have said that 100 genomes seem to work fine you could do 10 runs of 100 genomes each and then perhaps merge the data. A thought.

vzinche · 12-04-2017, 06:29 AM

I will try that, thank you.

And regarding the third question, that is actually a problem. I have no idea where these reads come from. I tried to search them or parts of them in the original genomes, but apparently with no success. Could be chimeras made up of short sequences, but I can't say for sure.

The first thought was that it could be some memory problem, since it gets worse when increasing the size of the initial file, but it's just a random idea.

GenoMax · 12-04-2017, 06:20 AM

@Brian will likely have to weigh in on this (especially "positions stated in read header with the actual read sequence, for most of the reads they were too different ") but be aware that he has been behind on support of late.

A few things to check that I can think of:

1. If you are only going to check the + strands then perhaps you should have used the samestrand=t option when generating the reads.
2. Default value for BBMap is ambig=best. Can you try mapping with ambig=all to see if that improves alignments?
3. Do you know why the remaining reads are not mapping (are they chimeras)?

Topics	Statistics	Last Post
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 14 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM
A New Single-Cell Method Maps DNA-Protein Interactions by SEQadmin2 Started by SEQadmin2, 06-04-2026, 08:59 AM	0 responses 28 views 0 reactions	Last Post by SEQadmin2 06-04-2026, 08:59 AM
Long-Read RNA Sequencing Uncovers a Hidden Layer of Immune Cell Regulation by SEQadmin2 Started by SEQadmin2, 06-02-2026, 12:03 PM	0 responses 33 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 12:03 PM
DNA Methylation Study Reveals How Epigenetic Changes Pass Between Generations by SEQadmin2 Started by SEQadmin2, 06-02-2026, 11:40 AM	0 responses 23 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 11:40 AM

Unconfigured Ad

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News