Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • MinHash Sketch - A Tool for Rapid Sequence Comparison

    MinHash Sketch is a method of rapidly comparing large strings or sets. In genomics, you can use it like this:

    1) Gather all the kmers in a genome.
    2) Apply a hash function to them.
    3) Keep the 10000 smallest hashcodes and call this set a "sketch".

    If you do this for multiple genomes, you can calculate how similar two genomes are much faster than via alignment; and specifically, the greatest advantage is that the speed of sketch comparison is unrelated to genome size. For example, if you assemble something, you can sketch it and compare it to sketches of everything in nt or RefSeq in a couple seconds. If the top hit shows that your new sketch shared 90% of its hash codes with E.coli, that means it has roughly 90% kmer identity to E.coli, and therefore, it is E.coli.

    The BBMap package has an extremely efficient MinHash Sketch implementation, now accessible through 4 programs. The latest, just released in BBMap 36.92, is SendSketch:

    Code:
    sendsketch.sh in=contigs.fa
    This will make a sketch of your assembly, open an HTTP connection to JGI's taxonomy server, and send the sketch. The server will compare it to sketches of all of nt, and return the top hits. This is kind of convenient because the sketches sit around in memory all the time rather than needing to be loaded. Alternately, you can run a local server with "taxserver.sh" if you want to use a different database.

    The other tools are for processing sketches locally. To sketch and query refseq, for example:

    Code:
    sketch.sh in=refseq.fa.gz out=refseq#.sketch files=31 mode=sequence
    That will produce one sketch per sequence, and split them among 31 output files (for rapid loading). Alternatively, and preferably, you can produce one sketch per taxID, like this:

    Code:
    sketch.sh in=refseq.fa.gz out=refseq#.sketch files=31 mode=taxa tree=tree.taxtree.gz gi=gitable.int1d.gz taxlevel=species
    That will produce only one sketch per species, so for things like E.coli with thousands of variants, you won't get thousands of sketches, just one. For some purposes the subspecies level is probably better, though. tree.taxtree.gz and gitable.int1d.gz are created from NCBI's tax dump, like this:

    Code:
    wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip
    wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz
    wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_prot.dmp.gz
    
    unzip taxdmp.zip
    taxtree.sh names.dmp nodes.dmp tree.taxtree.gz
    gitable.sh gi_taxid_nucl.dmp.gz,gi_taxid_prot.dmp.gz gitable.int1d.gz
    Now that you have created some sketches, you can query them like this:

    Code:
    comparesketch.sh in=contigs.fa refseq*.sketch
    "in=" can be a comma-delimited list of either fasta or sketch files. Also, you can turn an input fasta into one sketch per sequence rather than just a single sketch with the "mode=sequence" flag. If you want to process using taxonomic information, add "tree=taxtree.gz" as well. You alternately compare raw fastq reads if you want. KmerCountExact can produce sketches like this:

    Code:
    kmercountexact.sh in=reads.fq sketch=out.sketch mincount=5
    That will make a sketch only from kmers occuring at least 5 times (to avoid error kmers). Overall, the process is probably similar speed-wise to assembling with Tadpole and feeding the assembly to SendSketch/CompareSketch:

    Code:
    tadpole.sh in=reads.fq out=contigs.fa
    sendsketch.sh in=contigs.fa
    So, that's how you run Sketch. Now, what does the output mean? Here's an example, using a strain of Aspergillus:

    Code:
    sendsketch.sh in=ref.fa
    
    Loaded 1 sketches in    1.288 seconds.
    
    Results for ChrIII_A_nidulans_FGSC_A4:
    
    WKID 99.75%     KID 96.96%      matches 9696    compared 9720   taxID 227321    gSize 43368715  Aspergillus nidulans FGSC A4
    WKID 0.10%      KID 0.05%       matches 5       compared 5103   taxID 331117    gSize 14931471  Aspergillus fischeri NRRL 181
    WKID 0.06%      KID 0.04%       matches 4       compared 6204   taxID 1509407   gSize 18030896  Aspergillus nomius NRRL 13137
    WKID 0.05%      KID 0.04%       matches 4       compared 8485   taxID 5061      gSize 39417003  Aspergillus niger
    WKID 0.07%      KID 0.03%       matches 3       compared 4561   taxID 344612    gSize 13236981  Aspergillus clavatus NRRL 1
    WKID 0.06%      KID 0.03%       matches 3       compared 4654   taxID 330879    gSize 13837118  Aspergillus fumigatus Af293
    
    Total Time:     2.605 seconds.
    "matches" is the number of matching kmers (well, technically hash codes) between the query and reference sketch. KID (or kmer identity) is simply the percent of the kmers that matched, while WKID is the weighted kmer identity, which has been adjusted to compensate for genome size. For example, say human chromosome 1 is 8% of the human genome. It would get a KID of 8% when compared to human, since it would only share 8% of the kmers, but the WKID would still be 100%. So, that's the column that tells you the actual sequence similarity (disregarding size).

    taxID is the NCBI taxon identifier, gSize is the number of bases used to create the sketch, and "compared" basically tells you how many kmers were compared before the went out of the kmer range of the larger sketch (this is related to genome size and WKID calculation).

    As you can see, there are a few kmer hits to other Aspergillus strains, but what we have here is clearly Aspergillus nidulans.
    Last edited by Brian Bushnell; 02-03-2017, 08:53 AM.

  • #2
    Insane!

    Perfect timing... I've been doing 10x linked read assemblies of metagenomes and this will greatly speed the characterization of the resulting scaffolds.
    Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

    Comment


    • #3
      Oh, good! Currently, by the way, that comparesketch.sh (for local comparisons) currently has more configurability for displaying results than sendsketch.sh (in terms of being able to specify number of records returned per query, identity cutoffs, displaying complete lineage, etc). Eventually sendsketch.sh will catch up though. sendsketch.sh can currently do per-sequence queries from a single file, which is probably what you want to do.

      Comment


      • #4
        Super rad! I tried to use sendsketch.sh to quickly annotate a small assembly:

        Code:
        sendsketch.sh mode=sequence k=31,24 in=phage_07.fna
        but got the following:

        ERROR: The sketch is not compatible with this server.
        Settings: k=31,24 amino=false
        You may need to download a newer version of BBTools; this is version 37.38

        It's a fresh download of the most recent version, so not sure what to do! Super excited to try it out. In the meantime, will try local comparison!
        Last edited by nate85; 08-07-2017, 10:02 AM.

        Comment


        • #5
          Hi Nate,

          Thanks for letting me know! I accidentally included old blacklists in 37.38. That's fixed in 37.40 which I just released. I validated that downloading it from SourceForge and running it works correctly, but please let me know if you still encounter problems.

          Comment


          • #6
            Brian,

            I'd like to combine a few of your tools to put together a way to identify unknown, and possibly novel, viral sequences.

            I have some sequences with known hits to viruses that would map directly to the virus, and then also that will map to some viruses after being translated into different frames.

            Using tadpole, translate6frames and sendsketch, I am unable to consistently find all of the hits that I know are there. These are 150bp reads.

            This is what I tried

            tadpole.sh in=reads.fq out=contigs.fa k=50
            translate6frames.sh in=contigs.fa out=contigs_6frames.fa aaout=f
            sendsketch.sh in=contigs_6frames.fa address=refseq

            Can you think of any way to improve this to increase sensitivity?

            Thanks for all of these tools by the way, sendsketch is incredible. I don't know what I would do without BBTools.

            Comment


            • #7
              Originally posted by jweger1988 View Post
              This is what I tried

              tadpole.sh in=reads.fq out=contigs.fa k=50
              translate6frames.sh in=contigs.fa out=contigs_6frames.fa aaout=f
              sendsketch.sh in=contigs_6frames.fa address=refseq

              Can you think of any way to improve this to increase sensitivity?
              Actually, that should not work at all. The JGI RefSeq sketch server is using nucleotide references rather than protein references, so you should try without using translate6frames. To use an amino acid query you need to add the flag "amino", and basically use comparesketch.sh with a local reference, because none of the JGI sketch servers use amino acids currently. I may add one for nr though.

              You could try something like this (will take a while as large files need to be downloaded):

              First, run bbmap/pipelines/fetchTaxonomy.sh

              Then:

              Code:
              wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/viral/viral.*.protein.faa.gz
              
              cat viral.*.protein.faa.gz > refseq_viral.faa.gz
              
              gi2taxid.sh in=refseq_viral.fa.gz out=renamed.fa.gz tree=auto table=null accession=auto taxpath=/path/to/taxonomy
              
              sketch.sh in=renamed.fa.gz out=viral_AA.sketch mode=taxa tree=auto accession=null gi=null ow minsize=20 prefilter autosize k=9,6 amino taxpath=/path/to/taxonomy
              That will give you a local copy of a all RefSeq viral proteins, indexed on a per-taxa level. It's much simpler to do on a per-sequence level (since you don't need all the taxonomy-related stuff), but the sensitivity is much lower because proteins are so short, so I don't really recommend it. Sketch is really designed for long reference sequences. To subsequently run a comparison:

              Code:
              comparesketch.sh in=translated6frames.faa k=9,6 amino ref=viral_AA.sketch
              Running that series of commands on Lambda phage, I got:

              Code:
              Query: Escherichia virus Lambda Seqs: 6         Bases: 97000    gSize: 94790    SketchLen: 969  TaxID: 10710
              WKID    KID     ANI     Complt  Contam  Matches Unique  noHit   TaxID   gSize   gSeqs   taxName
              100.00% 15.81%  100.00% 100.00% 32.37%  147     43      482     10710   14785   74      Escherichia virus Lambda
              20.35%  3.76%   80.87%  100.00% 44.31%  35      0       484     10730   17271   80      Enterobacteria phage 933W
              11.92%  3.78%   75.31%  100.00% 45.37%  31      0       417     194949  26090   170     Escherichia phage Stx2 II
              11.37%  3.68%   74.84%  100.00% 45.37%  29      0       402     194948  25585   167     Escherichia Stx1 converting phage
              15.03%  2.83%   77.67%  100.00% 45.48%  26      0       475     489779  17432   83      Escherichia phage Min27
              (etc)
              You might need the latest version which I just uploaded now (37.50), though, which fixes an "amino" flag parse error.

              That said, the RefSeq viral database is so small, and viral assemblies are so small, that it would be a lot faster in this case to just use BLAST (either versus just viral or all of RefSeq) unless you plan on doing a lot of queries.

              Thanks for all of these tools by the way, sendsketch is incredible. I don't know what I would do without BBTools.
              You're welcome; I'm glad you're finding it helpful!

              Comment


              • #8
                What would be the optimal way to use these sketching tools if I want to make metagenome sample comparisons similar to what e.g. Sourmash can do? My first thought would be to sketch each sample and then do an all-vs-all comparison using comparesketch.sh, something like this example:

                Code:
                for sample in sample1 sample2 sample3; do
                    sketch.sh in=${sample}_R1.fq.gz out=${sample}.sketch.gz
                done
                comparesketch.sh alltoall *sketch.gz format=3
                That would give me similarity measures between each sample, right?
                It looks like the "name" of each sketch is called after the first read in each file, which gives very hard-to-interpret names in the output listing. Is it possible to adjust this somehow, other than editing the NM0-field in the sketch file?
                Last edited by boulund; 04-14-2018, 06:38 AM.

                Comment


                • #9
                  Originally posted by boulund View Post
                  It looks like the "name" of each sketch is called after the first read in each file, which gives very hard-to-interpret names in the output listing. Is it possible to adjust this somehow, other than editing the NM0-field in the sketch file?
                  I noticed that I missed the
                  Code:
                  name0=
                  argument to sketch.sh that changes the name. Great!

                  I wrote a quick Python script that takes the output from comparesketch.sh and produces a heatmap comparison of sample similarity, and it looks all right. Good for now! Thanks for a great tool Brian!

                  Comment


                  • #10
                    Has the new Mash2.0 'screen' function been implemented in bbtools 'sketch'?

                    Hi Brian Bushnell,

                    Has the new 'screen' function of Mash2.0 been incorporated into BBtools sketch yet? It would be super useful for finding stuff in the SRA db with sendsketch! Thanks!

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM
                    • seqadmin
                      Techniques and Challenges in Conservation Genomics
                      by seqadmin



                      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                      Avian Conservation
                      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                      03-08-2024, 10:41 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Yesterday, 06:37 PM
                    0 responses
                    11 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, Yesterday, 06:07 PM
                    0 responses
                    10 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-22-2024, 10:03 AM
                    0 responses
                    51 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-21-2024, 07:32 AM
                    0 responses
                    67 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X