Announcement

Collapse
No announcement yet.

BBMap (aligner for DNA/RNAseq) is now open-source and available for download.

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Hi Brian,
    Thanks for helping with the PacBio mapping issues. I have another, unrelated question about usage of bbmap. Is there a way to have bbmap output ALL mappings that meet a certain threshold specified with options such as subfilter=X? Using the ambiguous=all parameter, I cant get all equivalent mappings reported for a read, but can I use options to return all mappings that score >= a threshold specified with options like subfilter? Although this isn't the intended usage of the program, I would like to be able to search large sequence sets with primer sequences and look for opportunities where primer pairs sit in proximity to each other while allowing some mismatching.
    For example, if a set of primers is designed to perfectly match an intended target site, can I use bbmap to explore less perfect sites for hybridization opportunities?
    Last edited by lankage; 05-22-2015, 10:32 AM.

    Comment


    • BBMap is designed to find the best mapping, and heuristics will cause it to ignore mappings that are valid but substantially worse. Therefore, I made a different version of it, BBMapSkimmer, which is designed to find all of the mappings above a certain threshold. The shellscript is bbmapskimmer.sh and the usage is similar to bbmap.sh or mapPacBio.sh. For primers, which I assume will be short, you may wish to use a lower than default K of, say, 10 or 11, and add the "slow" flag.

      I also wrote another pair of programs specifically for working with primer pairs, msa.sh and cutprimers.sh. msa.sh will forcibly align a primer sequence (or a set of primer sequences) against a set of reference sequences to find the single best matching location per reference sequence - in other words, if you have 3 primers and 100 ref sequences, it will output a sam file with exactly 100 alignments - one per ref sequence, using the primer sequence that matched best. Of course you can also just run it with 1 primer sequence.

      So you run msa twice - once for the left primer, and once for the right primer - and generate 2 sam files. Then you feed those into cutprimers.sh, which will create a new fasta file containing the sequence between the primers, for each reference sequence. We used these programs to synthetically cut V4 out of full-length 16S sequences.

      I should say, though, that the primer sites identified are based on the normal BBMap scoring, which is not necessarily the same as where the primers would bind naturally, though with highly conserved regions there should be no difference.

      Comment


      • BBmapskimmer.sh

        Hi Brian,
        I have been experimenting with bbmapskimmer.sh and had a question about flags in the SAM output. I have been looking at a pair of primers matched against the Macaca fasicularis reference genome V5 for the potential to amplify unwanted sites. I have some python code that reads in sam output and draws a visualization of a primer to subject sequence alignment. I noticed in one case, the sam output has the 256 flag for a sub alignment of a sequence that would seem to suggest that the reverse complement of the query sequence (primer) maps to chromosome 12 at the given position. When i pull out Macaca reference genomic sequence at the given coordinates, I see a match to the 5' - 3' FWD orientation of the primer, rather than the reverse complement of the primer. Is there not a way to tell what orientation the query sequence matches the reference if the query hit is stored as a sub mapping?

        Sam output:
        @HD VN:1.4 SO:unsorted
        @SQ SN:10 LN:96509753
        @SQ SN:11 LN:137757926
        @SQ SN:12 LN:132586672
        @SQ SN:13 LN:111193037
        @SQ SN:14 LN:130733371
        @SQ SN:15 LN:112612857
        @SQ SN:16 LN:80997621
        @SQ SN:17 LN:96864807
        @SQ SN:18 LN:75711847
        @SQ SN:19 LN:59248254
        @SQ SN:1 LN:227556264
        @SQ SN:20 LN:78541002
        @SQ SN:2 LN:192460366
        @SQ SN:3 LN:192294377
        @SQ SN:4 LN:170955103
        @SQ SN:5 LN:189454096
        @SQ SN:6 LN:181584905
        @SQ SN:7 LN:171882078
        @SQ SN:8 LN:146850525
        @SQ SN:9 LN:133195287
        @SQ SN:MT LN:16575
        @SQ SN:X LN:152835861
        @PG ID:BBMap PN:BBMap VN:34.92 CL:java -Djava.library.path=/slipstream/home/graham/bbmap/jni/ -ea -Xmx160531m align2.BBMapPacBioSkimmer build=1 overwrite=true minratio=0.40 fastareadlen=6000 ambig=all minscaf=100 startpad=10000 stoppad=10000 midpad=6000 in=primers.fas ref=mafa5.fa out=primerhits.4.sam k=6 slow subfilter=4
        FCGR2A.3F 16 1 90182861 40 19= * 0 0 ACTGTGCCAACGTCCAGTC * NM:i:0 AM:i:40 NH:i:2
        FCGR2A.3F 256 12 127476172 32 1=1X17= * 0 0 * * NM:i:1 AM:i:32 NH:i:2
        FCGR2A.3R 0 1 90169340 40 20= * 0 0 TTGTCATCCACTCAGCAAGC * NM:i:0 AM:i:40 NH:i:2
        FCGR2A.3R 256 3 184441394 32 9=1X10= * 0 0 * * NM:i:1 AM:i:32 NH:i:2

        This line in question:
        FCGR2A.3F 256 12 127476172 32 1=1X17= * 0 0 * * NM:i:1 AM:i:32 NH:i:2



        My output:
        [PRE]
        *** FCGR2A.3F 1 RC 90182861 19=
        ACTGTGCCAACGTCCAGTC
        |||||||||||||||||||
        ACTGTGCCAACGTCCAGTCGGGTT
        *** FCGR2A.3F 12 RC 127476172 1=1X17=
        ACTGTGCCAACGTCCAGTC
        | |||||||||||||||||
        GGCTGGACGTTGGCACAGTGACCA

        *** FCGR2A.3R 1 FWD 90169340 20=
        TTGTCATCCACTCAGCAAGC
        ||||||||||||||||||||
        TTGTCATCCACTCAGCAAGCTGAGA
        *** FCGR2A.3R 3 FWD 184441394 9=1X10=
        TTGTCATCCACTCAGCAAGC
        ||||||||| ||||||||||
        TTGTCATCCTCTCAGCAAGCCCAAG[/PRE]
        Last edited by lankage; 05-26-2015, 08:13 AM.

        Comment


        • The first alignment has bitflag 16, indicating it is reverse-complemented, and the second has bitflag 256, indicating it is forward (but a secondary alignment). Since the primary alignment is reverse-complemented, the bases are displayed reverse-complemented (required by SAM specification). So, they need to be reverse-complemented again before displaying them for the second alignment.

          You can optionally add the flag "saa=f" (secondaryalignmentasterisks=false) which will make BBMap print out the query bases in the correct orientation for every alignment, rather than just the primary.

          Comment


          • mapPacBio.sh - soft clipped mappings

            Hi Brian, yet again another question about mapPacBio.sh. I would like to use mapPacBio.sh to map full length MHC allele sequences from pac bio sequencing against a reference dataset which contains partial MHC allele sequences, and separate out those that represent extensions of alleles in the dataset, and those that are putative novels. I would like to be able to specify perfect identity to a reference within the overlap portion but when i use the "idfilter=1 local" parameter, I get no mappings. If i use idfilter=.99, i can see the type of mapping i expect to find as below with only soft clipped bases, but also have mappings with mismatches. Is there a way to restrict the mapping results to those with ONLY soft clipped bases?

            bbmap/mapPacBio.sh in=filtered_contigs.fasta ref=Mafa_MHC-I_coding_throughPacBio5-15.05.22.fasta outm=filtered_contigs_mapped.sam outu=filtered_contigs_unmapped.fasta idfilter=.99 local -Xmx4g


            m150505_221415_42134_c100814192550000001823181311031594_s1_p0/161219/ccs/lbc13.merged.zerolen.fastq_1;size=22; 0 Mafa_B_148_01_01 1 43 9S1080= *

            Comment


            • Ah... that's unintended. When you set "idfilter=1" it turns on "perfectmode", which was unintentional; I'll fix that. It SHOULD turn on "semiperfectmode", which is what you need - no substitutions or indels, but going off the end of a reference is allowed.

              So, in this case, take out the "idfilter" flag and add "semiperfectmode", and you'll get the results you want. Although 100% perfect, non-soft-clipped reads will still be present. You could remove those with a double-mapping, though, like this:

              mapPacBio.sh in=reads.fq out=semiperfect.fq semiperfectmode

              mapPacBio.sh in=semiperfect.fq outm=perfect.sam outu=clipped.sam perfectmode


              -Brian

              Comment


              • Originally posted by Brian Bushnell View Post
                Sure. For simplicity, I will assume the reads are interleaved in a single file initially (you can transform between interleaved and dual-file reads with reformat.sh if you want). First thing is to trim the adapters with BBDuk:

                Code:
                bbduk.sh in=reads.fq out=trimmed.fq ktrim=r k=23 mink=12 hdist=1 tbo tpe ref=nextera_LMP_adapter.fa.gz
                nextera_LMP_adapter.fa.gz is in /bbmap/resources/.

                Next, if you want to remove contaminants/spikeins such as PhiX, you can do so now with a filtering pass of BBDuk, but that's optional.

                Finally, split the reads into LMP pairs, short pairs, and singletons:

                Code:
                splitnextera.sh in=trimmed.fq out=longmatepairs.fq outf=fragmentpairs.fq outu=unknownpairs.fq outs=singletons.fq mask=t
                At this point you could do quality-trimming if you want to on the individual sub-libraries; it's important not to do so sooner as it interferes with the ability to detect junctions. If all you care about is the LMP data you can just keep longmatepairs.fq and throw away the others, but theoretically, you might be able to get enough fragment pairs and singletons to assemble the contigs from those, then use the LMP for scaffolding, and thus get a complete single-library assembly. That's what Illumina is encouraging, at any rate.

                The majority of the reads in the "unknown" bin were LMP, not short-fragment reads, in the one library I tested (and also according to Illumina's data), but I would not really consider it safe to use for scaffolding OR contig-building, and would probably throw it away unless you don't have enough data.

                Hi Brian,

                Thesedays I am thinking of several assembly questions related to what you said above.
                1) After removing those "pair-end" and "single-end" contamination, why can't we reverse_complement those LMP reads and treat them as pair-end reads for the contig assembly stage? Maybe due to the larger variance of the LMP insert sizes?
                2) For those "unknown" reads, I think it is also helpful to assemble them as single ends, right?

                Comment


                • Originally posted by blsfoxfox View Post
                  Hi Brian,

                  Thesedays I am thinking of several assembly questions related to what you said above.
                  1) After removing those "pair-end" and "single-end" contamination, why can't we reverse_complement those LMP reads and treat them as pair-end reads for the contig assembly stage? Maybe due to the larger variance of the LMP insert sizes?
                  You can... whether that's a good idea depends on the assembler. Some assemblers probably use LMP reads for kmers just like the pair-end reads. Particularly if you don't have enough coverage of PE reads, using the LMP reads that way is probably fine, though they will probably have a more biased coverage distribution than the PE reads, which could interfere with the heuristics of some assemblers.
                  2) For those "unknown" reads, I think it is also helpful to assemble them as single ends, right?
                  Virtually all of the unknown-binned reads are non-overlapping LMP reads that you can't assemble into single reads (in the data I have examined). They end up in the unknown bin because the junction adapter was not visible in either read. But that usually means that the junctions were in the unsequenced portion between the two reads. This really depends on your insert size distribution (the physical insert size of the sequenced fragments, not the insert size of the long transposased pieces). If you fragment to substantially longer than 2x read length, a lot of LMP pairs will end up in the unknown bin because the junction is in the unsequenced middle. If you fragment to a shorter insert size, such that most of your pairs overlap, the unknown bin would consist more of PE reads.

                  The latest version of splitnextera has an option to attempt to merge the reads by overlap before looking for the junction. That way, it is better able to determine whether a pair belongs in the unknown bin - if they overlap, and do not contain a junction, they go to singleton rather than unknown; if they do contain a junction, they go to LMP; so the only pairs that end up in unknown are the ones that don't overlap AND don't have a junction adapter. So, fewer reads end up in unknown; but it will only be useful on libraries that have overlapping fragments.
                  Last edited by Brian Bushnell; 06-01-2015, 09:18 AM.

                  Comment


                  • Originally posted by Brian Bushnell View Post
                    You can... whether that's a good idea depends on the assembler. Some assemblers probably use LMP reads for kmers just like the pair-end reads. Particularly if you don't have enough coverage of PE reads, using the LMP reads that way is probably fine, though they will probably have a more biased coverage distribution than the PE reads, which could interfere with the heuristics of some assemblers.

                    Virtually all of the unknown-binned reads are non-overlapping LMP reads that you can't assemble into single reads (in the data I have examined). They end up in the unknown bin because the junction adapter was not visible in either read. But that usually means that the junctions were in the unsequenced portion between the two reads. This really depends on your insert size distribution (the physical insert size of the sequenced fragments, not the insert size of the long transposased pieces). If you fragment to substantially longer than 2x read length, a lot of LMP pairs will end up in the unknown bin because the junction is in the unsequenced middle. If you fragment to a shorter insert size, such that most of your pairs overlap, the unknown bin would consist more of PE reads.

                    The latest version of splitnextera has an option to attempt to merge the reads by overlap before looking for the junction. That way, it is better able to determine whether a pair belongs in the unknown bin - if they overlap, and do not contain a junction, they go to singleton rather than unknown; if they do contain a junction, they go to LMP; so the only pairs that end up in unknown are the ones that don't overlap AND don't have a junction adapter. So, fewer reads end up in unknown; but it will only be useful on libraries that have overlapping fragments.
                    Hi Brian,

                    Thanks for your reply!
                    I understand your points of the differences between "unknown" and "single-end" reads. However, I am not going to merge the "unknown" pairs into one single read, but treat one pair of reads as two single reads only for the contig assembly step. By doing this, we neither throw the "unknown" reads away nor rely on their mate-pair information to do scaffolding as insurance. Is it valid to you?

                    Comment


                    • Oh, yes, that sounds perfectly fine. You may want to adapter-trim the unknown reads first (treating them as singletons) by specifying the junction-adapter as the adapter sequence, like this:

                      bbduk.sh in=unknown.fq int=f out=trimmed.fq adapter=CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG k=21 mink=5 ktrim=r hdist=1

                      That will ensure you trim junction adapter from the end of the read, in case it was present but not detected because it was just a few bases. Otherwise the unknown bin may be enriched for reads with junction adapter at the end, that was just too short to be positively identified. Technically it could be present on the left side, as well. Making a base-frequency histogram of the unknown bin might be useful; I have not done that.
                      Last edited by Brian Bushnell; 06-01-2015, 10:10 AM.

                      Comment


                      • Hi Brian,

                        what are the applications/assembly operations for which tadpole.sh is designed? In part it replaces BBmerge?

                        Thanks!

                        Comment


                        • Hi Luc,

                          Tadpole does not replace BBMerge, though I plan to further integrate them in the future - mainly, so that BBMerge can first attempt to merge, then if unsuccessful extend the reads using Tadpole, then attempt to merge again, and if still unsuccessful, undo all the changes.

                          The main reason I made "Tadpole" was because I need to quantify the insert size of libraries, to determine whether they are acceptable; for example, if a project needs a 2x150bp library with a 500bp insert size, for an unknown organism... how do you determine whether it passes? BBMerge only works when the reads are largely overlapping. So, I wrote Tadpole to extend the right end of non-overlapping reads so that they will overlap and can be merged. So far, it works really well on 2x150bp single-cell data with a 350bp insert size, but I have not tested it further than that.

                          Tadpole is a complete (and very fast) assembler; you can run "tadpole.sh in=reads.fq out=contigs.fa" and it will give you a conservative assembly with a low error rate. The main drawback is that currently the max kmer length is 31, and as such the continuity is poor. I'm evaluating it for use in making a quick assembly for mapping reads to recalibrate their quality, prior to feeding them to a more sophisticated assembler; for recalibration, a low error rate and low misassembly rate is more important than continuity.

                          For extending reads, the command would be:
                          tadpole.sh in=reads.fq extend=reads.fq oute=extended.fq mode=extend extendleft=100 extendright=100

                          That will extend the reads by up to 100bp in each direction, stopping early if a branch is hit. For extending paired reads so that they overlap, only “extendright” is needed, so “extendleft” should be set to zero.

                          Comment


                          • Hi Brian, got another one for you. This time im working with Illumina miseq reads and mapping them to a HIV reference sequence for the purpose of determining the presence or absence of a variety of HIV drug resistance associated mutations. One step in my workflow requires the use of the mapping information to determine which mutation sites are spanned by a mapped read. I have a list of base coordinates based on the HIV reference sequence. I noticed that in some instances with a large insert relative to the reference, that the mapping start coordinate does not begin with the first base of the read sequence. In the example provided, the start coordinate of 3804 is actually 29 bases in from the start of the read. This is following a 27 base insert per the cigar string. I would like to be able to reliably determine the complete interval with which a read overlaps the reference sequence so I can know which mutation sites are covered. Can i tweak the parameters to get the start coordinate to reflect the start of the read sequence in a non-soft clipped mapping?

                            Example SAM record:
                            Code:
                            M00196:47:000000000-A1CEL:1:1103:11833:21618 1:N:0:1	16	NC_001802DRannotations_(modified)	3804	19	2=27I2=1X5=1X23=1X27=1X82=	*	0	0	ATCACTAGCCATTGCTCTCCAATTACTGTGAGCATGAAAAATATCACAGTAATTGGAGAGCTATGGCTAGTGATTTTAACCTGCCACCTATAGTAGCAAAAGAAATAGTAGCCAGCTGTGATAAATGTCAGCTAAAAGGAGAAGCCATGCATGGACAAGTAGACTGTAGTCC	????ABBBDDDDDEEDGGGGGGIIIIIIHHIHHHHIHIIIIIIIHIIIIHHHHHIIIIIHHHHHHIHIIIIIHHIIHHGHIIIHFFHIIIHGGGIHHHIIHHGFIHIIIIHIHHHFHGHHIHHHHHHFFHHHHHHIHHHHHHHHIIHHFIHIHHFFFFFFFDDDDDDDDBBB	NM:i:31	AM:i:19
                            
                            Read with expanded cigar string:
                            ATCACTAGCCATTGCTCTCCAATTACTGTGAGCATGAAAAATATCACAGTAATTGGAGAGCTATGGCTAGTGATTTTAACCTGCCACCTATAGTAGCAAAAGAAATAGTAGCCAGCTGTGATAAATGTCAGCTAAAAGGAGAAGCCATGCATGGACAAGTAGACTGTAGTCC
                            MMIIIIIIIIIIIIIIIIIIIIIIIIIIIMMXMMMMMXMMMMMMMMMMMMMMMMMMMMMMMXMMMMMMMMMMMMMMMMMMMMMMMMMMMXMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
                            HTML Code:
                            [PRE]
                            >>NC_001802DRannotations_(modified)                       (9181 nt)
                             initn: 652 init1: 652 opt: 684  Z-score: 836.6  bits: 167.0 E(1): 8.7e-45
                            banded Smith-Waterman score: 684; 97.2% identity (97.2% similar) in 144 nt overlap (29-172:3805-3948)
                            
                                             10        20        30        40        50        
                            offend   ATCACTAGCCATTGCTCTCCAATTACTGTGAGCATGAAAAATATCACAGTAATTGGAG
                                                                 ::: ::::: ::::::::::::::::::::
                            NC_001 AUUUUUAGAUGGAAUAGAUAAGGCCCAAGAUGAACAUGAGAAAUAUCACAGUAAUUGGAG
                                     3780      3790      3800      3810      3820      3830    
                            
                                   60        70        80        90       100       110        
                            offend AGCTATGGCTAGTGATTTTAACCTGCCACCTATAGTAGCAAAAGAAATAGTAGCCAGCTG
                                   ::: ::::::::::::::::::::::::::: ::::::::::::::::::::::::::::
                            NC_001 AGCAAUGGCUAGUGAUUUUAACCUGCCACCUGUAGUAGCAAAAGAAAUAGUAGCCAGCUG
                                     3840      3850      3860      3870      3880      3890    
                            
                                  120       130       140       150       160       170        
                            offend TGATAAATGTCAGCTAAAAGGAGAAGCCATGCATGGACAAGTAGACTGTAGTCC      
                                   ::::::::::::::::::::::::::::::::::::::::::::::::::::::      
                            NC_001 UGAUAAAUGUCAGCUAAAAGGAGAAGCCAUGCAUGGACAAGUAGACUGUAGUCCAGGAAU
                                     3900      3910      3920      3930      3940      3950    
                            
                            NC_001 AUGGCAACUAGAUUGUACACAUUUAGAAGGAAAAGUUAUCCUGGUAGCAGUUCAUGUAGC
                                     3960      3970      3980      3990      4000      4010[/PRE]
                            Last edited by Brian Bushnell; 06-08-2015, 02:21 PM.

                            Comment


                            • You can change gap penalties and get a different alignment by adding the flag "msa=MultiStateAligner9Flat". That will generally reduce the rate of long indels by flattening the affine transforms. The resulting cigar string is "2=1D1X1=1X3=2X1=2X2=3I4=1I1X1=3I3=1X5=1X23=1X27=1X82=", which may not be exactly what you are looking for, but is much closer.

                              Alternatively, depending on what you are doing with the data, you can enable soft-clipping with BBMap's "local" flag, then calculate coverage using pileup.sh with the "includesoftclip" flag. This will calculate coverage including soft-clipped sequences.

                              Please let me know if neither of those is adequate...

                              Comment


                              • Hi Brian,
                                Came across an instance today where I was attempting to parse bbmap SAM output for cigar string information to observe the mapping of some HIV read sequences against a reference. I was expecting for a cigar string to be present for each record when using the outm=output.sam parameter. In one of my mapping records, I observed an asterisk instead. Am I wrong to assume outm= is intended to include only those mapped reads? I can filter out these reads, but I'd like to make sure my mapping parameters make sense.

                                From the SAM output:
                                @HD VN:1.4 SO:unsorted
                                @SQ SN:NC_001802DRannotations_(modified) LN:9181
                                @PG ID:BBMap PN:BBMap VN:34.92 CL:java -Djava.library.path=/home/dnanexus/bbmap/jni/ -ea -Xmx10g align2.BBMap build=1 overwrite=true fastareadlen=500 in=reads_file ref=ref_file outm=sam_output minid=.8 strictmaxindel=10 k=8 subfilter=15 -Xmx10g

                                The offending mapping record:
                                M01472:214:000000000-AG0YC:1:2108:10755:20410 1:N:0:78 0 NC_001802DRannotations_(modified) 2154 4 * * 0 0 AAGTTAAACAATGGCCATTGACAGAAGAAAAAATAAAAGCATTAGTCATAGTAATATGGGGAAAGACTCCTAAATTTAAATTACCCATACAAAAGGAAACATGGGAAGCATGGTGGACAGAGTATTGGC CCCCCGGGGGGGGGGGGGGGGGGGGGGFDCFGECGGGF<AFEGGFEFGGGGFGGGGGGGGGGGGGGGGGFGFFGGFGEFGGAGFGEAFF<,FGGGGGGGGGGFGGFDGGGFECFGGGGGGGGGGCCCCC AM:i:4

                                This only occurred after parsing ~8 million mapping records out of a total 50 million.

                                Comment

                                Working...
                                X