Seqanswers Leaderboard Ad

**SNPsaurus** · 02-03-2017, 12:55 PM

Insane!

Perfect timing... I've been doing 10x linked read assemblies of metagenomes and this will greatly speed the characterization of the resulting scaffolds.

**Brian Bushnell** · 02-03-2017, 01:12 PM

Oh, good! Currently, by the way, that comparesketch.sh (for local comparisons) currently has more configurability for displaying results than sendsketch.sh (in terms of being able to specify number of records returned per query, identity cutoffs, displaying complete lineage, etc). Eventually sendsketch.sh will catch up though. sendsketch.sh can currently do per-sequence queries from a single file, which is probably what you want to do.

**nate85** · 08-07-2017, 09:59 AM

Super rad! I tried to use sendsketch.sh to quickly annotate a small assembly:

Code:

sendsketch.sh mode=sequence k=31,24 in=phage_07.fna

but got the following:

ERROR: The sketch is not compatible with this server.
Settings: k=31,24 amino=false
You may need to download a newer version of BBTools; this is version 37.38

It's a fresh download of the most recent version, so not sure what to do! Super excited to try it out. In the meantime, will try local comparison!

**Brian Bushnell** · 08-07-2017, 02:19 PM

Hi Nate,

Thanks for letting me know! I accidentally included old blacklists in 37.38. That's fixed in 37.40 which I just released. I validated that downloading it from SourceForge and running it works correctly, but please let me know if you still encounter problems.

**jweger1988** · 08-27-2017, 11:33 AM

Brian,

I'd like to combine a few of your tools to put together a way to identify unknown, and possibly novel, viral sequences.

I have some sequences with known hits to viruses that would map directly to the virus, and then also that will map to some viruses after being translated into different frames.

Using tadpole, translate6frames and sendsketch, I am unable to consistently find all of the hits that I know are there. These are 150bp reads.

This is what I tried

tadpole.sh in=reads.fq out=contigs.fa k=50
translate6frames.sh in=contigs.fa out=contigs_6frames.fa aaout=f
sendsketch.sh in=contigs_6frames.fa address=refseq

Can you think of any way to improve this to increase sensitivity?

Thanks for all of these tools by the way, sendsketch is incredible. I don't know what I would do without BBTools.

**Brian Bushnell** · 08-28-2017, 10:33 AM

Originally posted by jweger1988 View Post

This is what I tried

tadpole.sh in=reads.fq out=contigs.fa k=50
translate6frames.sh in=contigs.fa out=contigs_6frames.fa aaout=f
sendsketch.sh in=contigs_6frames.fa address=refseq

Can you think of any way to improve this to increase sensitivity?

Actually, that should not work at all. The JGI RefSeq sketch server is using nucleotide references rather than protein references, so you should try without using translate6frames. To use an amino acid query you need to add the flag "amino", and basically use comparesketch.sh with a local reference, because none of the JGI sketch servers use amino acids currently. I may add one for nr though.

You could try something like this (will take a while as large files need to be downloaded):

First, run bbmap/pipelines/fetchTaxonomy.sh

Then:

Code:

wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/viral/viral.*.protein.faa.gz

cat viral.*.protein.faa.gz > refseq_viral.faa.gz

gi2taxid.sh in=refseq_viral.fa.gz out=renamed.fa.gz tree=auto table=null accession=auto taxpath=/path/to/taxonomy

sketch.sh in=renamed.fa.gz out=viral_AA.sketch mode=taxa tree=auto accession=null gi=null ow minsize=20 prefilter autosize k=9,6 amino taxpath=/path/to/taxonomy

That will give you a local copy of a all RefSeq viral proteins, indexed on a per-taxa level. It's much simpler to do on a per-sequence level (since you don't need all the taxonomy-related stuff), but the sensitivity is much lower because proteins are so short, so I don't really recommend it. Sketch is really designed for long reference sequences. To subsequently run a comparison:

Code:

comparesketch.sh in=translated6frames.faa k=9,6 amino ref=viral_AA.sketch

Running that series of commands on Lambda phage, I got:

Code:

Query: Escherichia virus Lambda Seqs: 6         Bases: 97000    gSize: 94790    SketchLen: 969  TaxID: 10710
WKID    KID     ANI     Complt  Contam  Matches Unique  noHit   TaxID   gSize   gSeqs   taxName
100.00% 15.81%  100.00% 100.00% 32.37%  147     43      482     10710   14785   74      Escherichia virus Lambda
20.35%  3.76%   80.87%  100.00% 44.31%  35      0       484     10730   17271   80      Enterobacteria phage 933W
11.92%  3.78%   75.31%  100.00% 45.37%  31      0       417     194949  26090   170     Escherichia phage Stx2 II
11.37%  3.68%   74.84%  100.00% 45.37%  29      0       402     194948  25585   167     Escherichia Stx1 converting phage
15.03%  2.83%   77.67%  100.00% 45.48%  26      0       475     489779  17432   83      Escherichia phage Min27
(etc)

You might need the latest version which I just uploaded now (37.50), though, which fixes an "amino" flag parse error.

That said, the RefSeq viral database is so small, and viral assemblies are so small, that it would be a lot faster in this case to just use BLAST (either versus just viral or all of RefSeq) unless you plan on doing a lot of queries.

Thanks for all of these tools by the way, sendsketch is incredible. I don't know what I would do without BBTools.

You're welcome; I'm glad you're finding it helpful!

**boulund** · 04-14-2018, 06:33 AM

What would be the optimal way to use these sketching tools if I want to make metagenome sample comparisons similar to what e.g. Sourmash can do? My first thought would be to sketch each sample and then do an all-vs-all comparison using comparesketch.sh, something like this example:

Code:

for sample in sample1 sample2 sample3; do
    sketch.sh in=${sample}_R1.fq.gz out=${sample}.sketch.gz
done
comparesketch.sh alltoall *sketch.gz format=3

That would give me similarity measures between each sample, right?
It looks like the "name" of each sketch is called after the first read in each file, which gives very hard-to-interpret names in the output listing. Is it possible to adjust this somehow, other than editing the NM0-field in the sketch file?

**boulund** · 04-15-2018, 10:54 AM

Originally posted by boulund View Post

It looks like the "name" of each sketch is called after the first read in each file, which gives very hard-to-interpret names in the output listing. Is it possible to adjust this somehow, other than editing the NM0-field in the sketch file?

I noticed that I missed the

Code:

name0=

argument to sketch.sh that changes the name. Great!

I wrote a quick Python script that takes the output from comparesketch.sh and produces a heatmap comparison of sample similarity, and it looks all right. Good for now! Thanks for a great tool Brian!

**nano85** · 10-21-2018, 01:08 PM

Has the new Mash2.0 'screen' function been implemented in bbtools 'sketch'?

Hi Brian Bushnell,

Has the new 'screen' function of Mash2.0 been incorporated into BBtools sketch yet? It would be super useful for finding stuff in the SRA db with sendsketch! Thanks!

Publications — Mash 2.0 documentation

https://mash.readthedocs.io/en/latest/

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Today, 11:49 AM	0 responses 11 views 0 likes	Last Post by seqadmin Today, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

MinHash Sketch - A Tool for Rapid Sequence Comparison

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News