Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • GenoMax
    replied
    I am not sure why BBMap is not reporting the two sites in the output SAM. Perhaps that is by design.

    Unfortunately @Brian has been missing from online forums for last couple of months. I pinged him again yesterday with ref to this question. We will need him to chime in on these questions.

    Leave a comment:


  • milan.molbio
    replied
    @sdriscoll thanks for the tip! with these two options bbmap does report one 100% semiperfect site read but it's not in the output file. Anything else I'm missing?

    the exact command was:
    Code:
    bbmap.sh secondary=t sssr=0.80  k=3 maxindel=0 strictmaxindel  ref=sample.genome.fa ambiguous=all   maxsites=10  in=sample.read.fa out=mapped.sam vslow

    Leave a comment:


  • sdriscoll
    replied
    @milan.molbio-

    bbmap looks for the best alignment and reports only that one by default. If there are multiple "best" then you can get them all by setting 'ambig=all'.

    What you're looking for should be possible by enabling 'secondary=t' and then setting both 'sssr' and 'maxsites' to your liking. "secondary" alignments are those with lower mapping scores relative to the best alignment. "sssr" controls the ratio of the best score that is acceptable (default is 0.95 to allow secondary alignments with mapping scores as low as 95% of the best score). Finally "maxsites" sets a cap on the number of secondary alignment sites to report. You can probably just set that to a very high value to be sure you have them "all" - or at least all that bbmap can find.

    Leave a comment:


  • milan.molbio
    replied
    Hi guys,

    I'm trying to get bbmap to find all hits up to some identity threshold for a 20bp read, but it only finds the perfect hit. This should be possible right? So I must be doing something wrong. Here's the test scenario:

    Code:
    # file: sample.genome.fa
    > 10 GRCz10
    TTTCCTGAGGATGTGAGATTACTTCCGGGTTTTACAACAACATGGCAGCGCCCTTAGGTCTGCATCTGGAGTTTGGGTTTGTATTTATTGTTTCCTTTATCATTAATCTATTTTTTTTTTCTTTCTCGTGTTTTTTAAGAACATGGCAGTTTTGTTTTACATCAACTTGGCAGTAAT
    
    # file sample.reads.fa:
    > r1
    GTTTTACAACAACATGGCAG
    GTTTTTTAAGAACATGGCAG
    
    # run.bash
    bbmap.sh ref=sample.genome.fa k=3
    bbmap.sh minid=0.7 k=3 maxindel=0 strictmaxindel  ref=sample.genome.fa ambiguous=all  in=sample.read.fa out=mapped.sam vslow
    
    # results: mapped.sam
    mapped.sam
    @HD	VN:1.4	SO:unsorted
    @SQ	SN: 10 GRCz10	LN:177
    @PG	ID:BBMap	PN:BBMap	VN:37.85	CL:java -Djava.library.path=/Users/milan/Projects/geneassassin/code/ga-pipeline/bin/bbmap/jni/ -ea -Xmx3200m align2.BBMap build=1 overwrite=true fastareadlen=500 minid=0.7 k=3 maxindel=0 strictmaxindel ref=sample.genome.fa ambiguous=all in=sample.read.fa out=mapped.sam vslow
     r1	0	 10 GRCz10	29	40	20=	*	0	0	GTTTTACAACAACATGGCAG	*	NM:i:0	AM:i:40	NH:i:1
    expected output:
    1. 1 perfect hit (position 28)
    2. 1 hit with 2 substitutions (position 153, 8:T>A,13:T>A - GTTTTACATCAACTTGGCAG)
    3. 1 hit with 3 substitutions (position 129, 5:T>A,6:T>C,9:G>C - GTTTTTTAAGAACATGGCAG)



    Code:
    # stdout:
    
    Max memory cannot be determined.  Attempting to use 3200 MB.
    If this fails, please add the -Xmx flag (e.g. -Xmx24g) to your command,
    or run this program qsubbed or from a qlogin session on Genepool, or set ulimit to an appropriate value.
    java -Djava.library.path=/Users/milan/Projects/geneassassin/code/ga-pipeline/bin/bbmap/jni/ -ea -Xmx3200m -cp /Users/milan/Projects/geneassassin/code/ga-pipeline/bin/bbmap/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 minid=0.7 k=3 maxindel=0 strictmaxindel ref=sample.genome.fa ambiguous=all in=sample.read.fa out=mapped.sam vslow
    Executing align2.BBMap [tipsearch=150, minhits=1, minratio=0.25, rescuemismatches=50, rescuedist=3000, build=1, overwrite=true, fastareadlen=500, minid=0.7, k=3, maxindel=0, strictmaxindel, ref=sample.genome.fa, ambiguous=all, in=sample.read.fa, out=mapped.sam]
    Version 37.85 [tipsearch=150, minhits=1, minratio=0.25, rescuemismatches=50, rescuedist=3000, build=1, overwrite=true, fastareadlen=500, minid=0.7, k=3, maxindel=0, strictmaxindel, ref=sample.genome.fa, ambiguous=all, in=sample.read.fa, out=mapped.sam]
    
    Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.250
    Retaining all best sites for ambiguous mappings.
    Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.449
    NOTE:	Ignoring reference file because it already appears to have been processed.
    NOTE:	If you wish to regenerate the index, please manually delete ref/genome/1/summary.txt
    Set genome to 1
    
    Loaded Reference:	0.075 seconds.
    Loading index for chunk 1-1, build 1
    Generated Index:	0.004 seconds.
    Analyzed Index:   	0.002 seconds.
    Started output stream:	0.021 seconds.
    Cleared Memory:    	0.116 seconds.
    Processing reads in single-ended mode.
    Started read stream.
    Started 8 mapping threads.
    Detecting finished threads: 0, 1, 2, 3, 4, 5, 6, 7
    
       ------------------   Results   ------------------
    
    Genome:                	1
    Key Length:            	3
    Max Indel:             	0
    Minimum Score Ratio:  	0.4486
    Mapping Mode:         	normal
    Reads Used:           	1	(20 bases)
    
    Mapping:          	0.208 seconds.
    Reads/sec:       	4.81
    kBases/sec:      	0.10
    
    
    Read 1 data:      	pct reads	num reads 	pct bases	   num bases
    
    mapped:          	100.0000% 	        1 	100.0000% 	          20
    unambiguous:     	100.0000% 	        1 	100.0000% 	          20
    ambiguous:       	  0.0000% 	        0 	  0.0000% 	           0
    low-Q discards:  	  0.0000% 	        0 	  0.0000% 	           0
    
    perfect best site:	100.0000% 	        1 	100.0000% 	          20
    semiperfect site:	100.0000% 	        1 	100.0000% 	          20
    
    Match Rate:      	      NA 	       NA 	100.0000% 	          20
    Error Rate:      	  0.0000% 	        0 	  0.0000% 	           0
    Sub Rate:        	  0.0000% 	        0 	  0.0000% 	           0
    Del Rate:        	  0.0000% 	        0 	  0.0000% 	           0
    Ins Rate:        	  0.0000% 	        0 	  0.0000% 	           0
    N Rate:          	  0.0000% 	        0 	  0.0000% 	           0
    
    Total time:     	0.571 seconds.

    Leave a comment:


  • sdriscoll
    replied
    Hello-

    I'm trying to understand this example in the original post:

    Code:
    bbmap.sh in=reads.fq outu=unmapped.fq int=f
    repair.sh in=unmapped.fq out=paired.fq fint outs=singletons.fq
    It is provided as an example of how to deal with the non-deterministic behavior of bbmap when aligning paired-end reads. However what is described (map reads as single-end and then re-pair them using repair.sh) doesn't seem to be what these code lines show. If the goal is to get bbmap to produce deterministic paired-end alignments wouldn't the final outcome of the method be an alignment file and not just a FASTQ? Why would I be using the un-aligned reads as part of the final product?

    Leave a comment:


  • milan.molbio
    replied
    Hi guys,

    I'm trying to get bbmap to find all hits up to some identity threshold for a 20bp read, but it only finds the perfect hit. This should be possible right? So I must be doing something wrong. Here's the test scenario:

    Code:
    # file: sample.genome.fa
    > 10 GRCz10
    TTTCCTGAGGATGTGAGATTACTTCCGGGTTTTACAACAACATGGCAGCGCCCTTAGGTCTGCATCTGGAGTTTGGGTTTGTATTTATTGTTTCCTTTATCATTAATCTATTTTTTTTTTCTTTCTCGTGTTTTTTAAGAACATGGCAGTTTTGTTTTACATCAACTTGGCAGTAAT
    
    # file sample.reads.fa:
    > r1
    GTTTTACAACAACATGGCAG
    GTTTTTTAAGAACATGGCAG
    
    # run.bash
    bbmap.sh ref=sample.genome.fa k=3
    bbmap.sh minid=0.7 k=3 maxindel=0 strictmaxindel  ref=sample.genome.fa ambiguous=all  in=sample.read.fa out=mapped.sam vslow
    
    # results: mapped.sam
    mapped.sam
    @HD	VN:1.4	SO:unsorted
    @SQ	SN: 10 GRCz10	LN:177
    @PG	ID:BBMap	PN:BBMap	VN:37.85	CL:java -Djava.library.path=/Users/milan/Projects/geneassassin/code/ga-pipeline/bin/bbmap/jni/ -ea -Xmx3200m align2.BBMap build=1 overwrite=true fastareadlen=500 minid=0.7 k=3 maxindel=0 strictmaxindel ref=sample.genome.fa ambiguous=all in=sample.read.fa out=mapped.sam vslow
     r1	0	 10 GRCz10	29	40	20=	*	0	0	GTTTTACAACAACATGGCAG	*	NM:i:0	AM:i:40	NH:i:1
    expected output:
    1. 1 perfect hit (position 28)
    2. 1 hit with 2 substitutions (position 153, 8:T>A,13:T>A - GTTTTACATCAACTTGGCAG)
    3. 1 hit with 3 substitutions (position 129, 5:T>A,6:T>C,9:G>C - GTTTTTTAAGAACATGGCAG)



    Code:
    # stdout:
    
    Max memory cannot be determined.  Attempting to use 3200 MB.
    If this fails, please add the -Xmx flag (e.g. -Xmx24g) to your command,
    or run this program qsubbed or from a qlogin session on Genepool, or set ulimit to an appropriate value.
    java -Djava.library.path=/Users/milan/Projects/geneassassin/code/ga-pipeline/bin/bbmap/jni/ -ea -Xmx3200m -cp /Users/milan/Projects/geneassassin/code/ga-pipeline/bin/bbmap/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 minid=0.7 k=3 maxindel=0 strictmaxindel ref=sample.genome.fa ambiguous=all in=sample.read.fa out=mapped.sam vslow
    Executing align2.BBMap [tipsearch=150, minhits=1, minratio=0.25, rescuemismatches=50, rescuedist=3000, build=1, overwrite=true, fastareadlen=500, minid=0.7, k=3, maxindel=0, strictmaxindel, ref=sample.genome.fa, ambiguous=all, in=sample.read.fa, out=mapped.sam]
    Version 37.85 [tipsearch=150, minhits=1, minratio=0.25, rescuemismatches=50, rescuedist=3000, build=1, overwrite=true, fastareadlen=500, minid=0.7, k=3, maxindel=0, strictmaxindel, ref=sample.genome.fa, ambiguous=all, in=sample.read.fa, out=mapped.sam]
    
    Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.250
    Retaining all best sites for ambiguous mappings.
    Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.449
    NOTE:	Ignoring reference file because it already appears to have been processed.
    NOTE:	If you wish to regenerate the index, please manually delete ref/genome/1/summary.txt
    Set genome to 1
    
    Loaded Reference:	0.075 seconds.
    Loading index for chunk 1-1, build 1
    Generated Index:	0.004 seconds.
    Analyzed Index:   	0.002 seconds.
    Started output stream:	0.021 seconds.
    Cleared Memory:    	0.116 seconds.
    Processing reads in single-ended mode.
    Started read stream.
    Started 8 mapping threads.
    Detecting finished threads: 0, 1, 2, 3, 4, 5, 6, 7
    
       ------------------   Results   ------------------
    
    Genome:                	1
    Key Length:            	3
    Max Indel:             	0
    Minimum Score Ratio:  	0.4486
    Mapping Mode:         	normal
    Reads Used:           	1	(20 bases)
    
    Mapping:          	0.208 seconds.
    Reads/sec:       	4.81
    kBases/sec:      	0.10
    
    
    Read 1 data:      	pct reads	num reads 	pct bases	   num bases
    
    mapped:          	100.0000% 	        1 	100.0000% 	          20
    unambiguous:     	100.0000% 	        1 	100.0000% 	          20
    ambiguous:       	  0.0000% 	        0 	  0.0000% 	           0
    low-Q discards:  	  0.0000% 	        0 	  0.0000% 	           0
    
    perfect best site:	100.0000% 	        1 	100.0000% 	          20
    semiperfect site:	100.0000% 	        1 	100.0000% 	          20
    
    Match Rate:      	      NA 	       NA 	100.0000% 	          20
    Error Rate:      	  0.0000% 	        0 	  0.0000% 	           0
    Sub Rate:        	  0.0000% 	        0 	  0.0000% 	           0
    Del Rate:        	  0.0000% 	        0 	  0.0000% 	           0
    Ins Rate:        	  0.0000% 	        0 	  0.0000% 	           0
    N Rate:          	  0.0000% 	        0 	  0.0000% 	           0
    
    Total time:     	0.571 seconds.

    Leave a comment:


  • milan.molbio
    replied
    how to find all ambiguous reads?

    Hi guys,

    I'm trying to get bbmap to find all hits up to some identity threshold for a 20bp read, but it only finds the perfect hit. This should be possible right? So I must be doing something wrong. Here's the test scenario:

    Code:
    # file: sample.genome.fa
    > 10 GRCz10
    TTTCCTGAGGATGTGAGATTACTTCCGGGTTTTACAACAACATGGCAGCGCCCTTAGGTCTGCATCTGGAGTTTGGGTTTGTATTTATTGTTTCCTTTATCATTAATCTATTTTTTTTTTCTTTCTCGTGTTTTTTAAGAACATGGCAGTTTTGTTTTACATCAACTTGGCAGTAAT
    
    # file sample.reads.fa:
    > r1
    GTTTTACAACAACATGGCAG
    GTTTTTTAAGAACATGGCAG
    
    # run.bash
    bbmap.sh ref=sample.genome.fa k=3
    bbmap.sh minid=0.7 k=3 maxindel=0 strictmaxindel  ref=sample.genome.fa ambiguous=all  in=sample.read.fa out=mapped.sam vslow
    
    # results: mapped.sam
    mapped.sam
    @HD	VN:1.4	SO:unsorted
    @SQ	SN: 10 GRCz10	LN:177
    @PG	ID:BBMap	PN:BBMap	VN:37.85	CL:java -Djava.library.path=/Users/milan/Projects/geneassassin/code/ga-pipeline/bin/bbmap/jni/ -ea -Xmx3200m align2.BBMap build=1 overwrite=true fastareadlen=500 minid=0.7 k=3 maxindel=0 strictmaxindel ref=sample.genome.fa ambiguous=all in=sample.read.fa out=mapped.sam vslow
     r1	0	 10 GRCz10	29	40	20=	*	0	0	GTTTTACAACAACATGGCAG	*	NM:i:0	AM:i:40	NH:i:1
    expected output:
    1. 1 perfect hit (position 28)
    2. 1 hit with 2 substitutions (position 153, 8:T>A,13:T>A - GTTTTACATCAACTTGGCAG)
    3. 1 hit with 3 substitutions (position 129, 5:T>A,6:T>C,9:G>C - GTTTTTTAAGAACATGGCAG)



    Code:
    # stdout:
    
    Max memory cannot be determined.  Attempting to use 3200 MB.
    If this fails, please add the -Xmx flag (e.g. -Xmx24g) to your command,
    or run this program qsubbed or from a qlogin session on Genepool, or set ulimit to an appropriate value.
    java -Djava.library.path=/Users/milan/Projects/geneassassin/code/ga-pipeline/bin/bbmap/jni/ -ea -Xmx3200m -cp /Users/milan/Projects/geneassassin/code/ga-pipeline/bin/bbmap/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 minid=0.7 k=3 maxindel=0 strictmaxindel ref=sample.genome.fa ambiguous=all in=sample.read.fa out=mapped.sam vslow
    Executing align2.BBMap [tipsearch=150, minhits=1, minratio=0.25, rescuemismatches=50, rescuedist=3000, build=1, overwrite=true, fastareadlen=500, minid=0.7, k=3, maxindel=0, strictmaxindel, ref=sample.genome.fa, ambiguous=all, in=sample.read.fa, out=mapped.sam]
    Version 37.85 [tipsearch=150, minhits=1, minratio=0.25, rescuemismatches=50, rescuedist=3000, build=1, overwrite=true, fastareadlen=500, minid=0.7, k=3, maxindel=0, strictmaxindel, ref=sample.genome.fa, ambiguous=all, in=sample.read.fa, out=mapped.sam]
    
    Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.250
    Retaining all best sites for ambiguous mappings.
    Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.449
    NOTE:	Ignoring reference file because it already appears to have been processed.
    NOTE:	If you wish to regenerate the index, please manually delete ref/genome/1/summary.txt
    Set genome to 1
    
    Loaded Reference:	0.075 seconds.
    Loading index for chunk 1-1, build 1
    Generated Index:	0.004 seconds.
    Analyzed Index:   	0.002 seconds.
    Started output stream:	0.021 seconds.
    Cleared Memory:    	0.116 seconds.
    Processing reads in single-ended mode.
    Started read stream.
    Started 8 mapping threads.
    Detecting finished threads: 0, 1, 2, 3, 4, 5, 6, 7
    
       ------------------   Results   ------------------
    
    Genome:                	1
    Key Length:            	3
    Max Indel:             	0
    Minimum Score Ratio:  	0.4486
    Mapping Mode:         	normal
    Reads Used:           	1	(20 bases)
    
    Mapping:          	0.208 seconds.
    Reads/sec:       	4.81
    kBases/sec:      	0.10
    
    
    Read 1 data:      	pct reads	num reads 	pct bases	   num bases
    
    mapped:          	100.0000% 	        1 	100.0000% 	          20
    unambiguous:     	100.0000% 	        1 	100.0000% 	          20
    ambiguous:       	  0.0000% 	        0 	  0.0000% 	           0
    low-Q discards:  	  0.0000% 	        0 	  0.0000% 	           0
    
    perfect best site:	100.0000% 	        1 	100.0000% 	          20
    semiperfect site:	100.0000% 	        1 	100.0000% 	          20
    
    Match Rate:      	      NA 	       NA 	100.0000% 	          20
    Error Rate:      	  0.0000% 	        0 	  0.0000% 	           0
    Sub Rate:        	  0.0000% 	        0 	  0.0000% 	           0
    Del Rate:        	  0.0000% 	        0 	  0.0000% 	           0
    Ins Rate:        	  0.0000% 	        0 	  0.0000% 	           0
    N Rate:          	  0.0000% 	        0 	  0.0000% 	           0
    
    Total time:     	0.571 seconds.

    Leave a comment:


  • phylloxera
    replied
    Ultimately, i'd like to do variant calling on a combined pac bio / illumina whole viral genome dataset. I am working with BBMap right now as it has the intuitive minid flag, which seems desirable. As a first step, I'm trying to optimize my mapping as much as possible on one of the samples that is most divergent to the reference.

    Here is my working command:
    bbmap/mapPacBio.sh in=200335185_usedreads.fastq.gz ref=200303013.fa maxindel=40 minid=0.4 vslow k=8 out=200335185.sam overwrite=t bamscript=bs.sh; sh bs.sh

    It optimizes the number of reads mapped (4148/4164) and minimizes the number of ambiguous mapping reads (1).

    Given, k=8 and minid=0.4, geneious mapper maps all 4164 reads for maxindel ranging from 20-500. If it is in the cards, I'd like to be able to map the remaining stragglers but don't know what other BBMap flags I should try in this endeavor. Also, I'm curious why bbmap is so much more sensitive to the valueof maxindel... here are select bbmap results:
    maxindel num reads num ambiguous
    20 4145 2
    40 4148 1
    60 4137 1
    100 4130 3
    200 4125 4

    Leave a comment:


  • mcmc
    replied
    Originally posted by GenoMax View Post
    BBTools currently has no count utilities. They may be on the wish list since many have asked Brian. For now, your best bet is to use featureCounts.
    Thanks! I'm surprised there's something BBTools doesn't do

    Leave a comment:


  • GenoMax
    replied
    Originally posted by mcmc View Post
    Is there a way to use BBtools to summarize reads mapped to a genome (using BBmap/BBsplit, in a sam file) by orf? I see that pileup.sh will take a prodigal-output fasta file with orf info, but I've got a genome downloaded from refseq with all the ncbi files (gff, cds fasta, gb). Can BBtools parse one of these to summarize my sam file by orf?

    While I could map to the orfs.fna instead, I'm interested in intergenics too, e.g. for orf/RNA discovery.

    Thanks,
    MCMC
    BBTools currently has no count utilities. They may be on the wish list since many have asked Brian. For now, your best bet is to use featureCounts.

    Leave a comment:


  • mcmc
    replied
    summarizing mapped reads by orf

    Is there a way to use BBtools to summarize reads mapped to a genome (using BBmap/BBsplit, in a sam file) by orf? I see that pileup.sh will take a prodigal-output fasta file with orf info, but I've got a genome downloaded from refseq with all the ncbi files (gff, cds fasta, gb). Can BBtools parse one of these to summarize my sam file by orf?

    While I could map to the orfs.fna instead, I'm interested in intergenics too, e.g. for orf/RNA discovery.

    Thanks,
    MCMC

    Leave a comment:


  • mcmc
    replied
    BBSplit ambig=toss

    Hi Brian et al.,
    When I run BBsplit with ambig=toss, the ambiguous reads are not written to unmapped.fq; but when I run BBmap, they are. Is this the expected behavior? I'd like to be able to retrieve the ambiguous reads from BBsplit (both within/between two references).
    Thanks,
    MC

    Leave a comment:


  • GenoMax
    replied
    Have you looked through the logs and such to see if there is any indication of any issues? There is always the possibility that @Brian may not have checked extreme usage case like this for randomreads.sh and this may be a genuine bug that is clearly a road-block.

    Since you have said that 100 genomes seem to work fine you could do 10 runs of 100 genomes each and then perhaps merge the data. A thought.

    Leave a comment:


  • vzinche
    replied
    I will try that, thank you.

    And regarding the third question, that is actually a problem. I have no idea where these reads come from. I tried to search them or parts of them in the original genomes, but apparently with no success. Could be chimeras made up of short sequences, but I can't say for sure.

    The first thought was that it could be some memory problem, since it gets worse when increasing the size of the initial file, but it's just a random idea.

    Leave a comment:


  • GenoMax
    replied
    @Brian will likely have to weigh in on this (especially "positions stated in read header with the actual read sequence, for most of the reads they were too different ") but be aware that he has been behind on support of late.

    A few things to check that I can think of:

    1. If you are only going to check the + strands then perhaps you should have used the samestrand=t option when generating the reads.
    2. Default value for BBMap is ambig=best. Can you try mapping with ambig=all to see if that improves alignments?
    3. Do you know why the remaining reads are not mapping (are they chimeras)?

    Leave a comment:

Latest Articles

Collapse

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by SEQadmin2, 06-05-2026, 10:09 AM
0 responses
14 views
0 reactions
Last Post SEQadmin2  
Started by SEQadmin2, 06-04-2026, 08:59 AM
0 responses
28 views
0 reactions
Last Post SEQadmin2  
Started by SEQadmin2, 06-02-2026, 12:03 PM
0 responses
33 views
0 reactions
Last Post SEQadmin2  
Started by SEQadmin2, 06-02-2026, 11:40 AM
0 responses
23 views
0 reactions
Last Post SEQadmin2  
Working...