Seqanswers Leaderboard Ad

**rhinoceros** · 11-20-2013, 08:26 AM

ftp://ftp.metagenomics.anl.gov/data/

**Bachbioinfo** · 11-20-2013, 08:46 AM

Thanks !

for MD5nrthe publication link is here http://www.biomedcentral.com/1471-2105/13/141

It seems they did the same for Ribosomal databases md5RNA
(ftp://ftp.metagenomics.anl.gov/data/...rent/md5rna.gz)

**Bachbioinfo** · 12-09-2013, 07:07 AM

16S microbial

Originally posted by rhinoceros View Post

ftp://ftp.metagenomics.anl.gov/data/

Hello,

I have again another question please

Is the database existed on "ftp://ftp.ncbi.nlm.nih.gov/blas/db/16SMicrobial.tar.gz" is including the same informations of "ftp://ftp.metagenomics.anl.gov/data/MD5nr/20130801/md5rna.gz"?

In the README of blast db ncbi there is no information how this db was constructed and what does it contains ?

Many Thanks

**rhinoceros** · 12-09-2013, 07:21 AM

NCBI's 16S db is tiny and contains about 7k near full bacterial and a few hundred near full archaeal SSU sequences. m5rna on the other hand is greengenes, silva and rdp (maybe something else too?) combined and contains something like 3.5M SSU/LSU sequences of various lengths.

The alphanumeric characters you see are md5 checksums. See here.

**boetsie** · 12-11-2013, 09:13 AM

Thanks for pointing out this 16s non-redundant database. I am trying to use this database, but have some difficulties with the md5 checksum. My goal is to align my reads to the database and process them with MEGAN. However, MEGAN needs to have some ID's to get the taxonomic name. For example my current GreenGenes file looks like;

HTML Code:

>AF068820.2 hydrothermal vent clone VC2.1 Arc13 k__Archaea; p__Euryarchaeota; c__Thermoplasmata; o__Thermoplasmatales; f__Aciduliprofundaceae; otu_204

Here, the ID 'AF068820.2' is important.

The header of the md5rna file looks like;

HTML Code:

>0000175eddb4b05d0bd52467315668ac

As rhinoceros pointed out, there is some information about the md5 checksums here: http://blog.metagenomics.anl.gov/m5nr-api/

and after some searching I found this;

HTML Code:

http://blog.metagenomics.anl.gov/m5tools-pl-the-m5nr-database-command-line-tool/

Two questions:
- The first is that I can't find the tool 'm5tools.pl' on the FTP site. Can someone provide me this tool?
- With this tool, can I regenerate the 'original' header from GreenGenes, thus with the ID 'AF068820', or at least the taxonomy ID of the organism? In the examples I saw this which could help me;

http://api.metagenomics.anl.gov/m5nr/md5/000821a2e2f63df1a3873e4b280002a

But if I do this with my md5 ID, I get no results;
http://api.metagenomics.anl.gov/m5nr/md5/0000175eddb4b05d0bd52467315668ac

Thanks in advance,
Boetsie

**rhinoceros** · 12-11-2013, 12:45 PM

The m5tools script is at least here as "m5nr-tools.pl"

There's api documentation here too. I don't know why your query doesn't work, are you sure it's a good checksum?

You could probably get taxonomic annotations with the map files too from here. Just need to apply join and sort to the right columns of the right files..

**boetsie** · 12-12-2013, 01:02 AM

Thank you for pointing me to the m5nr-tools.pl script. However, if I take the first two md5 sums of the md5rna database

HTML Code:

grep ">" md5rna -m 3
>000000bce90ad07d3161ffac8cea5874
>0000029042cc6c69f2b830142508acb1

And search them in the map file;

HTML Code:

grep "000000bce90ad07d3161ffac8cea5874" /data/testfolder/metagenomics_pipeline_test/16s/non-redundant-database/md5_rna_map
000000bce90ad07d3161ffac8cea5874        16      3385    2304
grep "0000029042cc6c69f2b830142508acb1" /data/testfolder/metagenomics_pipeline_test/16s/non-redundant-database/md5_rna_map
0000029042cc6c69f2b830142508acb1        16      3385    382680

Both have '16' as database, which is the RDP database (if 16 corresponds to the 'source'). So I try to find them in the RDP database;

HTML Code:

perl MG-RAST-Tools-master/tools/bin/m5nr-tools.pl --api http://kbase.us/services/communities/1 --option annotation --source RDP --md5 000000bce90ad07d3161ffac8cea5874,0000029042cc6c69f2b830142508acb1
S003289208      000000bce90ad07d3161ffac8cea5874        16S ribosomal RNA       Acinetobacter lwoffi

I get only one hit.

Since this did not work and probably is very slow, I am trying to work with the map files.

Thank you rhinoceros
Boetsie

**rhinoceros** · 12-12-2013, 02:09 AM

The syntaxt of sort and join combination you'll be using will be something like:

Code:

join -1 2 -2 1 -o 2.1,1.3 <(sort -k2,2 file1) <(sort -k1,1 file2)

Which would look for matches in column 2 of file1 and column 1 of file2 and output column 1 of file2 and column 3 of file1. Obviously you'll first need to figure out which columns are relevant in whatever files). In my experience this kind of combination of join and sort is very fast and works well for huge multimillion row tables..

**boetsie** · 12-12-2013, 05:13 AM

I've already figured that out

Thanks for your help though!

**Bachbioinfo** · 12-12-2013, 10:38 AM

Thanks for posting your question here. I did not yet try to map the md5rna IDs to taxonomic info or other annotations.
As I am using MEGAN5 too, I would like to know whether it is a good idea to select the soft masking option with blastn.

I will post later my comments for the mdrna mapping steps

Originally posted by boetsie View Post

Thanks for pointing out this 16s non-redundant database. I am trying to use this database, but have some difficulties with the md5 checksum. My goal is to align my reads to the database and process them with MEGAN. However, MEGAN needs to have some ID's to get the taxonomic name. For example my current GreenGenes file looks like;

HTML Code:

>AF068820.2 hydrothermal vent clone VC2.1 Arc13 k__Archaea; p__Euryarchaeota; c__Thermoplasmata; o__Thermoplasmatales; f__Aciduliprofundaceae; otu_204

Here, the ID 'AF068820.2' is important.

The header of the md5rna file looks like;

HTML Code:

>0000175eddb4b05d0bd52467315668ac

As rhinoceros pointed out, there is some information about the md5 checksums here: http://blog.metagenomics.anl.gov/m5nr-api/

and after some searching I found this;

HTML Code:

http://blog.metagenomics.anl.gov/m5tools-pl-the-m5nr-database-command-line-tool/

Two questions:
- The first is that I can't find the tool 'm5tools.pl' on the FTP site. Can someone provide me this tool?
- With this tool, can I regenerate the 'original' header from GreenGenes, thus with the ID 'AF068820', or at least the taxonomy ID of the organism? In the examples I saw this which could help me;

http://api.metagenomics.anl.gov/m5nr/md5/000821a2e2f63df1a3873e4b280002a

But if I do this with my md5 ID, I get no results;
http://api.metagenomics.anl.gov/m5nr/md5/0000175eddb4b05d0bd52467315668ac

Thanks in advance,
Boetsie

**rhinoceros** · 12-12-2013, 11:08 AM

Originally posted by Bachbioinfo View Post

As I am using MEGAN5 too, I would like to know whether it is a good idea to select the soft masking option with blastn.

There's this article that gives rather good suggestions for blast in general. They also have an accompanying website updated for blast+. I'm looking forward to their 2013 article.

But I wouldn't know how applicable this stuff is to 16S and nucleotide queries in general. In my opinion, blast is the wrong approach to 16S amplicon data to begin with. Both QIIME (MacQIIME for Mac OS X) and mothur are far better suited for 16S stuff, and blast is definitely not the best method for assigning taxonomy to 16S reads.

**Bachbioinfo** · 12-12-2013, 11:19 AM

I totally agree with what are you suggesting for MOTHUR and QUIIME. I have metagenomes rather than amplicons. In this case what is the best way to estimate the OTUs abundance. I do not know if QUIIME could be best too to do that. There is a lot of tools and methods, there is a lot of literature of comparison, but the most appropriate approach of metagenomes is not always the same. Somewhere, I should start the data analysis

best,

Originally posted by rhinoceros View Post

There's this article that gives rather good suggestions for blast in general. They also have an accompanying website update for blast+. But I wouldn't know how applicable this stuff is to 16S queries in general. In my opinion, blast is the wrong approach to 16S amplicon data to begin with. Both QIIME (MacQIIME for Mac OS X) and mothur are far better suited for 16S stuff, and blast is definitely not the best method for assigning taxonomy to 16S reads.

**rhinoceros** · 12-12-2013, 11:39 AM

Well, you could start by submitting your data to mg-rast. You can read at their website what the pipeline does. You can download your data following any particular step, e.g. predicted proteins or annotations against some specific db. You can also e.g. export biom tables for QIIME. It's not perfect, but it's a good start, and gives you initial results very fast. I noticed that the way they assign Kegg orthologs leaves a lot of real hits out. I'm sure it's the same with a lot of other stuff too.

**Bachbioinfo** · 12-16-2013, 12:51 PM

Originally posted by boetsie View Post

Thank you for pointing me to the m5nr-tools.pl script. However, if I take the first two md5 sums of the md5rna database

HTML Code:

grep ">" md5rna -m 3
>000000bce90ad07d3161ffac8cea5874
>0000029042cc6c69f2b830142508acb1

And search them in the map file;

HTML Code:

grep "000000bce90ad07d3161ffac8cea5874" /data/testfolder/metagenomics_pipeline_test/16s/non-redundant-database/md5_rna_map
000000bce90ad07d3161ffac8cea5874        16      3385    2304
grep "0000029042cc6c69f2b830142508acb1" /data/testfolder/metagenomics_pipeline_test/16s/non-redundant-database/md5_rna_map
0000029042cc6c69f2b830142508acb1        16      3385    382680

Both have '16' as database, which is the RDP database (if 16 corresponds to the 'source'). So I try to find them in the RDP database;

HTML Code:

perl MG-RAST-Tools-master/tools/bin/m5nr-tools.pl --api http://kbase.us/services/communities/1 --option annotation --source RDP --md5 000000bce90ad07d3161ffac8cea5874,0000029042cc6c69f2b830142508acb1
S003289208      000000bce90ad07d3161ffac8cea5874        16S ribosomal RNA       Acinetobacter lwoffi

I get only one hit.

Since this did not work and probably is very slow, I am trying to work with the map files.

Thank you rhinoceros
Boetsie

Hello,
I have just noticed the same things, the key 0000029042cc6c69f2b830142508acb1 for example , I cannot find it with m5nr-tools.pl in spite of trying all ribosomal sources described here :"http://api.metagenomics.anl.gov/api.html#annotation". I have please a question what do correspond the two last columns in md5_rna_map ?
i.e. 0000029042cc6c69f2b830142508acb1 16 3385 382680

Taxon ID and Gi respectively ? if this is the case I cannot be able to find "382680" in a simple search on ncbi databases

Thank you all

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 20 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 20 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Mother of Ribosomal Dtatabses ?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News