Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • how to find all ambiguous reads?

    Hi guys,

    I'm trying to get bbmap to find all hits up to some identity threshold for a 20bp read, but it only finds the perfect hit. This should be possible right? So I must be doing something wrong. Here's the test scenario:

    Code:
    # file: sample.genome.fa
    > 10 GRCz10
    TTTCCTGAGGATGTGAGATTACTTCCGGGTTTTACAACAACATGGCAGCGCCCTTAGGTCTGCATCTGGAGTTTGGGTTTGTATTTATTGTTTCCTTTATCATTAATCTATTTTTTTTTTCTTTCTCGTGTTTTTTAAGAACATGGCAGTTTTGTTTTACATCAACTTGGCAGTAAT
    
    # file sample.reads.fa:
    > r1
    GTTTTACAACAACATGGCAG
    GTTTTTTAAGAACATGGCAG
    
    # run.bash
    bbmap.sh ref=sample.genome.fa k=3
    bbmap.sh minid=0.7 k=3 maxindel=0 strictmaxindel  ref=sample.genome.fa ambiguous=all  in=sample.read.fa out=mapped.sam vslow
    
    # results: mapped.sam
    mapped.sam
    @HD	VN:1.4	SO:unsorted
    @SQ	SN: 10 GRCz10	LN:177
    @PG	ID:BBMap	PN:BBMap	VN:37.85	CL:java -Djava.library.path=/Users/milan/Projects/geneassassin/code/ga-pipeline/bin/bbmap/jni/ -ea -Xmx3200m align2.BBMap build=1 overwrite=true fastareadlen=500 minid=0.7 k=3 maxindel=0 strictmaxindel ref=sample.genome.fa ambiguous=all in=sample.read.fa out=mapped.sam vslow
     r1	0	 10 GRCz10	29	40	20=	*	0	0	GTTTTACAACAACATGGCAG	*	NM:i:0	AM:i:40	NH:i:1
    expected output:
    1. 1 perfect hit (position 28)
    2. 1 hit with 2 substitutions (position 153, 8:T>A,13:T>A - GTTTTACATCAACTTGGCAG)
    3. 1 hit with 3 substitutions (position 129, 5:T>A,6:T>C,9:G>C - GTTTTTTAAGAACATGGCAG)



    Code:
    # stdout:
    
    Max memory cannot be determined.  Attempting to use 3200 MB.
    If this fails, please add the -Xmx flag (e.g. -Xmx24g) to your command,
    or run this program qsubbed or from a qlogin session on Genepool, or set ulimit to an appropriate value.
    java -Djava.library.path=/Users/milan/Projects/geneassassin/code/ga-pipeline/bin/bbmap/jni/ -ea -Xmx3200m -cp /Users/milan/Projects/geneassassin/code/ga-pipeline/bin/bbmap/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 minid=0.7 k=3 maxindel=0 strictmaxindel ref=sample.genome.fa ambiguous=all in=sample.read.fa out=mapped.sam vslow
    Executing align2.BBMap [tipsearch=150, minhits=1, minratio=0.25, rescuemismatches=50, rescuedist=3000, build=1, overwrite=true, fastareadlen=500, minid=0.7, k=3, maxindel=0, strictmaxindel, ref=sample.genome.fa, ambiguous=all, in=sample.read.fa, out=mapped.sam]
    Version 37.85 [tipsearch=150, minhits=1, minratio=0.25, rescuemismatches=50, rescuedist=3000, build=1, overwrite=true, fastareadlen=500, minid=0.7, k=3, maxindel=0, strictmaxindel, ref=sample.genome.fa, ambiguous=all, in=sample.read.fa, out=mapped.sam]
    
    Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.250
    Retaining all best sites for ambiguous mappings.
    Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.449
    NOTE:	Ignoring reference file because it already appears to have been processed.
    NOTE:	If you wish to regenerate the index, please manually delete ref/genome/1/summary.txt
    Set genome to 1
    
    Loaded Reference:	0.075 seconds.
    Loading index for chunk 1-1, build 1
    Generated Index:	0.004 seconds.
    Analyzed Index:   	0.002 seconds.
    Started output stream:	0.021 seconds.
    Cleared Memory:    	0.116 seconds.
    Processing reads in single-ended mode.
    Started read stream.
    Started 8 mapping threads.
    Detecting finished threads: 0, 1, 2, 3, 4, 5, 6, 7
    
       ------------------   Results   ------------------
    
    Genome:                	1
    Key Length:            	3
    Max Indel:             	0
    Minimum Score Ratio:  	0.4486
    Mapping Mode:         	normal
    Reads Used:           	1	(20 bases)
    
    Mapping:          	0.208 seconds.
    Reads/sec:       	4.81
    kBases/sec:      	0.10
    
    
    Read 1 data:      	pct reads	num reads 	pct bases	   num bases
    
    mapped:          	100.0000% 	        1 	100.0000% 	          20
    unambiguous:     	100.0000% 	        1 	100.0000% 	          20
    ambiguous:       	  0.0000% 	        0 	  0.0000% 	           0
    low-Q discards:  	  0.0000% 	        0 	  0.0000% 	           0
    
    perfect best site:	100.0000% 	        1 	100.0000% 	          20
    semiperfect site:	100.0000% 	        1 	100.0000% 	          20
    
    Match Rate:      	      NA 	       NA 	100.0000% 	          20
    Error Rate:      	  0.0000% 	        0 	  0.0000% 	           0
    Sub Rate:        	  0.0000% 	        0 	  0.0000% 	           0
    Del Rate:        	  0.0000% 	        0 	  0.0000% 	           0
    Ins Rate:        	  0.0000% 	        0 	  0.0000% 	           0
    N Rate:          	  0.0000% 	        0 	  0.0000% 	           0
    
    Total time:     	0.571 seconds.

    Comment


    • Hi guys,

      I'm trying to get bbmap to find all hits up to some identity threshold for a 20bp read, but it only finds the perfect hit. This should be possible right? So I must be doing something wrong. Here's the test scenario:

      Code:
      # file: sample.genome.fa
      > 10 GRCz10
      TTTCCTGAGGATGTGAGATTACTTCCGGGTTTTACAACAACATGGCAGCGCCCTTAGGTCTGCATCTGGAGTTTGGGTTTGTATTTATTGTTTCCTTTATCATTAATCTATTTTTTTTTTCTTTCTCGTGTTTTTTAAGAACATGGCAGTTTTGTTTTACATCAACTTGGCAGTAAT
      
      # file sample.reads.fa:
      > r1
      GTTTTACAACAACATGGCAG
      GTTTTTTAAGAACATGGCAG
      
      # run.bash
      bbmap.sh ref=sample.genome.fa k=3
      bbmap.sh minid=0.7 k=3 maxindel=0 strictmaxindel  ref=sample.genome.fa ambiguous=all  in=sample.read.fa out=mapped.sam vslow
      
      # results: mapped.sam
      mapped.sam
      @HD	VN:1.4	SO:unsorted
      @SQ	SN: 10 GRCz10	LN:177
      @PG	ID:BBMap	PN:BBMap	VN:37.85	CL:java -Djava.library.path=/Users/milan/Projects/geneassassin/code/ga-pipeline/bin/bbmap/jni/ -ea -Xmx3200m align2.BBMap build=1 overwrite=true fastareadlen=500 minid=0.7 k=3 maxindel=0 strictmaxindel ref=sample.genome.fa ambiguous=all in=sample.read.fa out=mapped.sam vslow
       r1	0	 10 GRCz10	29	40	20=	*	0	0	GTTTTACAACAACATGGCAG	*	NM:i:0	AM:i:40	NH:i:1
      expected output:
      1. 1 perfect hit (position 28)
      2. 1 hit with 2 substitutions (position 153, 8:T>A,13:T>A - GTTTTACATCAACTTGGCAG)
      3. 1 hit with 3 substitutions (position 129, 5:T>A,6:T>C,9:G>C - GTTTTTTAAGAACATGGCAG)



      Code:
      # stdout:
      
      Max memory cannot be determined.  Attempting to use 3200 MB.
      If this fails, please add the -Xmx flag (e.g. -Xmx24g) to your command,
      or run this program qsubbed or from a qlogin session on Genepool, or set ulimit to an appropriate value.
      java -Djava.library.path=/Users/milan/Projects/geneassassin/code/ga-pipeline/bin/bbmap/jni/ -ea -Xmx3200m -cp /Users/milan/Projects/geneassassin/code/ga-pipeline/bin/bbmap/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 minid=0.7 k=3 maxindel=0 strictmaxindel ref=sample.genome.fa ambiguous=all in=sample.read.fa out=mapped.sam vslow
      Executing align2.BBMap [tipsearch=150, minhits=1, minratio=0.25, rescuemismatches=50, rescuedist=3000, build=1, overwrite=true, fastareadlen=500, minid=0.7, k=3, maxindel=0, strictmaxindel, ref=sample.genome.fa, ambiguous=all, in=sample.read.fa, out=mapped.sam]
      Version 37.85 [tipsearch=150, minhits=1, minratio=0.25, rescuemismatches=50, rescuedist=3000, build=1, overwrite=true, fastareadlen=500, minid=0.7, k=3, maxindel=0, strictmaxindel, ref=sample.genome.fa, ambiguous=all, in=sample.read.fa, out=mapped.sam]
      
      Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.250
      Retaining all best sites for ambiguous mappings.
      Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.449
      NOTE:	Ignoring reference file because it already appears to have been processed.
      NOTE:	If you wish to regenerate the index, please manually delete ref/genome/1/summary.txt
      Set genome to 1
      
      Loaded Reference:	0.075 seconds.
      Loading index for chunk 1-1, build 1
      Generated Index:	0.004 seconds.
      Analyzed Index:   	0.002 seconds.
      Started output stream:	0.021 seconds.
      Cleared Memory:    	0.116 seconds.
      Processing reads in single-ended mode.
      Started read stream.
      Started 8 mapping threads.
      Detecting finished threads: 0, 1, 2, 3, 4, 5, 6, 7
      
         ------------------   Results   ------------------
      
      Genome:                	1
      Key Length:            	3
      Max Indel:             	0
      Minimum Score Ratio:  	0.4486
      Mapping Mode:         	normal
      Reads Used:           	1	(20 bases)
      
      Mapping:          	0.208 seconds.
      Reads/sec:       	4.81
      kBases/sec:      	0.10
      
      
      Read 1 data:      	pct reads	num reads 	pct bases	   num bases
      
      mapped:          	100.0000% 	        1 	100.0000% 	          20
      unambiguous:     	100.0000% 	        1 	100.0000% 	          20
      ambiguous:       	  0.0000% 	        0 	  0.0000% 	           0
      low-Q discards:  	  0.0000% 	        0 	  0.0000% 	           0
      
      perfect best site:	100.0000% 	        1 	100.0000% 	          20
      semiperfect site:	100.0000% 	        1 	100.0000% 	          20
      
      Match Rate:      	      NA 	       NA 	100.0000% 	          20
      Error Rate:      	  0.0000% 	        0 	  0.0000% 	           0
      Sub Rate:        	  0.0000% 	        0 	  0.0000% 	           0
      Del Rate:        	  0.0000% 	        0 	  0.0000% 	           0
      Ins Rate:        	  0.0000% 	        0 	  0.0000% 	           0
      N Rate:          	  0.0000% 	        0 	  0.0000% 	           0
      
      Total time:     	0.571 seconds.

      Comment


      • Hello-

        I'm trying to understand this example in the original post:

        Code:
        bbmap.sh in=reads.fq outu=unmapped.fq int=f
        repair.sh in=unmapped.fq out=paired.fq fint outs=singletons.fq
        It is provided as an example of how to deal with the non-deterministic behavior of bbmap when aligning paired-end reads. However what is described (map reads as single-end and then re-pair them using repair.sh) doesn't seem to be what these code lines show. If the goal is to get bbmap to produce deterministic paired-end alignments wouldn't the final outcome of the method be an alignment file and not just a FASTQ? Why would I be using the un-aligned reads as part of the final product?
        /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
        Salk Institute for Biological Studies, La Jolla, CA, USA */

        Comment


        • Hi guys,

          I'm trying to get bbmap to find all hits up to some identity threshold for a 20bp read, but it only finds the perfect hit. This should be possible right? So I must be doing something wrong. Here's the test scenario:

          Code:
          # file: sample.genome.fa
          > 10 GRCz10
          TTTCCTGAGGATGTGAGATTACTTCCGGGTTTTACAACAACATGGCAGCGCCCTTAGGTCTGCATCTGGAGTTTGGGTTTGTATTTATTGTTTCCTTTATCATTAATCTATTTTTTTTTTCTTTCTCGTGTTTTTTAAGAACATGGCAGTTTTGTTTTACATCAACTTGGCAGTAAT
          
          # file sample.reads.fa:
          > r1
          GTTTTACAACAACATGGCAG
          GTTTTTTAAGAACATGGCAG
          
          # run.bash
          bbmap.sh ref=sample.genome.fa k=3
          bbmap.sh minid=0.7 k=3 maxindel=0 strictmaxindel  ref=sample.genome.fa ambiguous=all  in=sample.read.fa out=mapped.sam vslow
          
          # results: mapped.sam
          mapped.sam
          @HD	VN:1.4	SO:unsorted
          @SQ	SN: 10 GRCz10	LN:177
          @PG	ID:BBMap	PN:BBMap	VN:37.85	CL:java -Djava.library.path=/Users/milan/Projects/geneassassin/code/ga-pipeline/bin/bbmap/jni/ -ea -Xmx3200m align2.BBMap build=1 overwrite=true fastareadlen=500 minid=0.7 k=3 maxindel=0 strictmaxindel ref=sample.genome.fa ambiguous=all in=sample.read.fa out=mapped.sam vslow
           r1	0	 10 GRCz10	29	40	20=	*	0	0	GTTTTACAACAACATGGCAG	*	NM:i:0	AM:i:40	NH:i:1
          expected output:
          1. 1 perfect hit (position 28)
          2. 1 hit with 2 substitutions (position 153, 8:T>A,13:T>A - GTTTTACATCAACTTGGCAG)
          3. 1 hit with 3 substitutions (position 129, 5:T>A,6:T>C,9:G>C - GTTTTTTAAGAACATGGCAG)



          Code:
          # stdout:
          
          Max memory cannot be determined.  Attempting to use 3200 MB.
          If this fails, please add the -Xmx flag (e.g. -Xmx24g) to your command,
          or run this program qsubbed or from a qlogin session on Genepool, or set ulimit to an appropriate value.
          java -Djava.library.path=/Users/milan/Projects/geneassassin/code/ga-pipeline/bin/bbmap/jni/ -ea -Xmx3200m -cp /Users/milan/Projects/geneassassin/code/ga-pipeline/bin/bbmap/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 minid=0.7 k=3 maxindel=0 strictmaxindel ref=sample.genome.fa ambiguous=all in=sample.read.fa out=mapped.sam vslow
          Executing align2.BBMap [tipsearch=150, minhits=1, minratio=0.25, rescuemismatches=50, rescuedist=3000, build=1, overwrite=true, fastareadlen=500, minid=0.7, k=3, maxindel=0, strictmaxindel, ref=sample.genome.fa, ambiguous=all, in=sample.read.fa, out=mapped.sam]
          Version 37.85 [tipsearch=150, minhits=1, minratio=0.25, rescuemismatches=50, rescuedist=3000, build=1, overwrite=true, fastareadlen=500, minid=0.7, k=3, maxindel=0, strictmaxindel, ref=sample.genome.fa, ambiguous=all, in=sample.read.fa, out=mapped.sam]
          
          Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.250
          Retaining all best sites for ambiguous mappings.
          Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.449
          NOTE:	Ignoring reference file because it already appears to have been processed.
          NOTE:	If you wish to regenerate the index, please manually delete ref/genome/1/summary.txt
          Set genome to 1
          
          Loaded Reference:	0.075 seconds.
          Loading index for chunk 1-1, build 1
          Generated Index:	0.004 seconds.
          Analyzed Index:   	0.002 seconds.
          Started output stream:	0.021 seconds.
          Cleared Memory:    	0.116 seconds.
          Processing reads in single-ended mode.
          Started read stream.
          Started 8 mapping threads.
          Detecting finished threads: 0, 1, 2, 3, 4, 5, 6, 7
          
             ------------------   Results   ------------------
          
          Genome:                	1
          Key Length:            	3
          Max Indel:             	0
          Minimum Score Ratio:  	0.4486
          Mapping Mode:         	normal
          Reads Used:           	1	(20 bases)
          
          Mapping:          	0.208 seconds.
          Reads/sec:       	4.81
          kBases/sec:      	0.10
          
          
          Read 1 data:      	pct reads	num reads 	pct bases	   num bases
          
          mapped:          	100.0000% 	        1 	100.0000% 	          20
          unambiguous:     	100.0000% 	        1 	100.0000% 	          20
          ambiguous:       	  0.0000% 	        0 	  0.0000% 	           0
          low-Q discards:  	  0.0000% 	        0 	  0.0000% 	           0
          
          perfect best site:	100.0000% 	        1 	100.0000% 	          20
          semiperfect site:	100.0000% 	        1 	100.0000% 	          20
          
          Match Rate:      	      NA 	       NA 	100.0000% 	          20
          Error Rate:      	  0.0000% 	        0 	  0.0000% 	           0
          Sub Rate:        	  0.0000% 	        0 	  0.0000% 	           0
          Del Rate:        	  0.0000% 	        0 	  0.0000% 	           0
          Ins Rate:        	  0.0000% 	        0 	  0.0000% 	           0
          N Rate:          	  0.0000% 	        0 	  0.0000% 	           0
          
          Total time:     	0.571 seconds.

          Comment


          • @milan.molbio-

            bbmap looks for the best alignment and reports only that one by default. If there are multiple "best" then you can get them all by setting 'ambig=all'.

            What you're looking for should be possible by enabling 'secondary=t' and then setting both 'sssr' and 'maxsites' to your liking. "secondary" alignments are those with lower mapping scores relative to the best alignment. "sssr" controls the ratio of the best score that is acceptable (default is 0.95 to allow secondary alignments with mapping scores as low as 95% of the best score). Finally "maxsites" sets a cap on the number of secondary alignment sites to report. You can probably just set that to a very high value to be sure you have them "all" - or at least all that bbmap can find.
            /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
            Salk Institute for Biological Studies, La Jolla, CA, USA */

            Comment


            • @sdriscoll thanks for the tip! with these two options bbmap does report one 100% semiperfect site read but it's not in the output file. Anything else I'm missing?

              the exact command was:
              Code:
              bbmap.sh secondary=t sssr=0.80  k=3 maxindel=0 strictmaxindel  ref=sample.genome.fa ambiguous=all   maxsites=10  in=sample.read.fa out=mapped.sam vslow

              Comment


              • I am not sure why BBMap is not reporting the two sites in the output SAM. Perhaps that is by design.

                Unfortunately @Brian has been missing from online forums for last couple of months. I pinged him again yesterday with ref to this question. We will need him to chime in on these questions.

                Comment


                • Yeah I've come to terms with the fact that BBMap wasn't really designed for this type of use. When I want to work with an alignment strategy where the goal is to find "all" alignments of a read up to some maximum error threshold I turn to bowtie (1 or 2) or BWA.
                  /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                  Salk Institute for Biological Studies, La Jolla, CA, USA */

                  Comment


                  • Originally posted by sdriscoll View Post
                    Yeah I've come to terms with the fact that BBMap wasn't really designed for this type of use. When I want to work with an alignment strategy where the goal is to find "all" alignments of a read up to some maximum error threshold I turn to bowtie (1 or 2) or BWA.
                    i did give them both a try. bowtie seems to reliably find and report them but it's got a limit of maximum 3 mismatches (and I need 4 bwa is also struggling to find all the matches, and i've filed an issue: https://github.com/lh3/bwa/issues/176

                    Comment


                    • Hi Brian, I have a problem whereby I have a about 12 fastq files and I'm not completely convinced that the barcode for each of the samples are as named in the key. If I have 12 unique 6-mer barcodes eg GTCCGC, etc is there a way to definitively say which samples have the following barcode with bbmerge? thanks!

                      Comment


                      • I assume you are referring to a standard barcode/index from Illumina tech. If so those are never part of actual read. If they are present in the fastq header then you could use "demuxbyname.sh" to separate the reads you are interested in.

                        If your index is "inline" (and at a specific location in the read e.g. beginning) then you could use bbduk.sh to match/separate those reads.

                        I am not sure bbmerge.sh is going to help in your case.

                        Comment


                        • Originally posted by Alex Lee View Post
                          Hi Brian, I have a problem whereby I have a about 12 fastq files and I'm not completely convinced that the barcode for each of the samples are as named in the key. If I have 12 unique 6-mer barcodes eg GTCCGC, etc is there a way to definitively say which samples have the following barcode with bbmerge? thanks!
                          BBMerge might be able to help in this case, if you have paired reads with a sufficient number of short inserts. You can run it like this:

                          Code:
                          bbmerge.sh in1=read1.fq in2=read2.fq outa=adapters.fa
                          This will indicate the adapter sequence for the library, based on read-through from short-insert reads. The barcode is typically part of the adapter, so if you compare the adapters from 2 multiplexed libraries, the location where they differ is the barcode region.

                          Comment


                          • Originally posted by milan.molbio View Post
                            Hi guys,

                            I'm trying to get bbmap to find all hits up to some identity threshold for a 20bp read, but it only finds the perfect hit. This should be possible right? So I must be doing something wrong.
                            While BBMap is not originally designed for this purpose; I made a version that does a much better job at finding all mappings above some identity threshold, bbmapskimmer.sh. The usage is the same as BBMap; just add the ambig=all flag.

                            Comment


                            • Originally posted by Brian Bushnell View Post
                              While BBMap is not originally designed for this purpose; I made a version that does a much better job at finding all mappings above some identity threshold, bbmapskimmer.sh. The usage is the same as BBMap; just add the ambig=all flag.
                              Does this mean that bbmap functions identically but does not report all the alignments in the output file but includes them in the statistics for alignment?

                              Comment


                              • Is there a multi-sample pileup? I have a bunch of samples mapped to the same reference. It would be awesome to have that as a table with the coverage for each sample per reference sequence.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM
                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Yesterday, 06:37 PM
                                0 responses
                                10 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, Yesterday, 06:07 PM
                                0 responses
                                9 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-22-2024, 10:03 AM
                                0 responses
                                49 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-21-2024, 07:32 AM
                                0 responses
                                67 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X