Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Bachbioinfo
    Member
    • Nov 2013
    • 24

    Mother of Ribosomal Dtatabses ?

    Is the M5rna database available on MG-RAST server is including a non redundant database of ribosomal genes from combination of SILVA, Greengenes, and RDP ?

    Do you know in which ftp can I find this database to use it in local ?
    Last edited by Bachbioinfo; 12-12-2013, 11:23 AM.
    __Bach__
  • rhinoceros
    Senior Member
    • Apr 2013
    • 372

    #2
    ftp://ftp.metagenomics.anl.gov/data/
    Last edited by rhinoceros; 11-20-2013, 08:36 AM.
    savetherhino.org

    Comment

    • Bachbioinfo
      Member
      • Nov 2013
      • 24

      #3
      Thanks !

      for MD5nrthe publication link is here http://www.biomedcentral.com/1471-2105/13/141

      It seems they did the same for Ribosomal databases md5RNA
      (ftp://ftp.metagenomics.anl.gov/data/...rent/md5rna.gz)
      __Bach__

      Comment

      • Bachbioinfo
        Member
        • Nov 2013
        • 24

        #4
        16S microbial

        Originally posted by rhinoceros View Post

        Hello,

        I have again another question please

        Is the database existed on "ftp://ftp.ncbi.nlm.nih.gov/blas/db/16SMicrobial.tar.gz" is including the same informations of "ftp://ftp.metagenomics.anl.gov/data/MD5nr/20130801/md5rna.gz"?

        In the README of blast db ncbi there is no information how this db was constructed and what does it contains ?

        Many Thanks
        Last edited by Bachbioinfo; 12-09-2013, 07:11 AM.
        __Bach__

        Comment

        • rhinoceros
          Senior Member
          • Apr 2013
          • 372

          #5
          NCBI's 16S db is tiny and contains about 7k near full bacterial and a few hundred near full archaeal SSU sequences. m5rna on the other hand is greengenes, silva and rdp (maybe something else too?) combined and contains something like 3.5M SSU/LSU sequences of various lengths.

          The alphanumeric characters you see are md5 checksums. See here.
          Last edited by rhinoceros; 12-09-2013, 07:26 AM.
          savetherhino.org

          Comment

          • boetsie
            Senior Member
            • Feb 2010
            • 245

            #6
            Thanks for pointing out this 16s non-redundant database. I am trying to use this database, but have some difficulties with the md5 checksum. My goal is to align my reads to the database and process them with MEGAN. However, MEGAN needs to have some ID's to get the taxonomic name. For example my current GreenGenes file looks like;

            HTML Code:
            >AF068820.2 hydrothermal vent clone VC2.1 Arc13 k__Archaea; p__Euryarchaeota; c__Thermoplasmata; o__Thermoplasmatales; f__Aciduliprofundaceae; otu_204
            Here, the ID 'AF068820.2' is important.

            The header of the md5rna file looks like;

            HTML Code:
            >0000175eddb4b05d0bd52467315668ac
            As rhinoceros pointed out, there is some information about the md5 checksums here: http://blog.metagenomics.anl.gov/m5nr-api/

            and after some searching I found this;

            HTML Code:
            http://blog.metagenomics.anl.gov/m5tools-pl-the-m5nr-database-command-line-tool/
            Two questions:
            - The first is that I can't find the tool 'm5tools.pl' on the FTP site. Can someone provide me this tool?
            - With this tool, can I regenerate the 'original' header from GreenGenes, thus with the ID 'AF068820', or at least the taxonomy ID of the organism? In the examples I saw this which could help me;



            But if I do this with my md5 ID, I get no results;
            http://api.metagenomics.anl.gov/m5nr/md5/0000175eddb4b05d0bd52467315668ac

            Thanks in advance,
            Boetsie

            Comment

            • rhinoceros
              Senior Member
              • Apr 2013
              • 372

              #7
              The m5tools script is at least here as "m5nr-tools.pl"

              There's api documentation here too. I don't know why your query doesn't work, are you sure it's a good checksum?

              You could probably get taxonomic annotations with the map files too from here. Just need to apply join and sort to the right columns of the right files..
              Last edited by rhinoceros; 12-11-2013, 01:03 PM.
              savetherhino.org

              Comment

              • boetsie
                Senior Member
                • Feb 2010
                • 245

                #8
                Thank you for pointing me to the m5nr-tools.pl script. However, if I take the first two md5 sums of the md5rna database
                HTML Code:
                grep ">" md5rna -m 3
                >000000bce90ad07d3161ffac8cea5874
                >0000029042cc6c69f2b830142508acb1
                And search them in the map file;

                HTML Code:
                grep "000000bce90ad07d3161ffac8cea5874" /data/testfolder/metagenomics_pipeline_test/16s/non-redundant-database/md5_rna_map
                000000bce90ad07d3161ffac8cea5874        16      3385    2304
                grep "0000029042cc6c69f2b830142508acb1" /data/testfolder/metagenomics_pipeline_test/16s/non-redundant-database/md5_rna_map
                0000029042cc6c69f2b830142508acb1        16      3385    382680
                Both have '16' as database, which is the RDP database (if 16 corresponds to the 'source'). So I try to find them in the RDP database;


                HTML Code:
                perl MG-RAST-Tools-master/tools/bin/m5nr-tools.pl --api http://kbase.us/services/communities/1 --option annotation --source RDP --md5 000000bce90ad07d3161ffac8cea5874,0000029042cc6c69f2b830142508acb1
                S003289208      000000bce90ad07d3161ffac8cea5874        16S ribosomal RNA       Acinetobacter lwoffi
                I get only one hit.

                Since this did not work and probably is very slow, I am trying to work with the map files.

                Thank you rhinoceros
                Boetsie

                Comment

                • rhinoceros
                  Senior Member
                  • Apr 2013
                  • 372

                  #9
                  The syntaxt of sort and join combination you'll be using will be something like:

                  Code:
                  join -1 2 -2 1 -o 2.1,1.3 <(sort -k2,2 file1) <(sort -k1,1 file2)
                  Which would look for matches in column 2 of file1 and column 1 of file2 and output column 1 of file2 and column 3 of file1. Obviously you'll first need to figure out which columns are relevant in whatever files). In my experience this kind of combination of join and sort is very fast and works well for huge multimillion row tables..
                  savetherhino.org

                  Comment

                  • boetsie
                    Senior Member
                    • Feb 2010
                    • 245

                    #10
                    I've already figured that out Thanks for your help though!

                    Comment

                    • Bachbioinfo
                      Member
                      • Nov 2013
                      • 24

                      #11
                      Thanks for posting your question here. I did not yet try to map the md5rna IDs to taxonomic info or other annotations.
                      As I am using MEGAN5 too, I would like to know whether it is a good idea to select the soft masking option with blastn.

                      I will post later my comments for the mdrna mapping steps

                      Originally posted by boetsie View Post
                      Thanks for pointing out this 16s non-redundant database. I am trying to use this database, but have some difficulties with the md5 checksum. My goal is to align my reads to the database and process them with MEGAN. However, MEGAN needs to have some ID's to get the taxonomic name. For example my current GreenGenes file looks like;

                      HTML Code:
                      >AF068820.2 hydrothermal vent clone VC2.1 Arc13 k__Archaea; p__Euryarchaeota; c__Thermoplasmata; o__Thermoplasmatales; f__Aciduliprofundaceae; otu_204
                      Here, the ID 'AF068820.2' is important.

                      The header of the md5rna file looks like;

                      HTML Code:
                      >0000175eddb4b05d0bd52467315668ac
                      As rhinoceros pointed out, there is some information about the md5 checksums here: http://blog.metagenomics.anl.gov/m5nr-api/

                      and after some searching I found this;

                      HTML Code:
                      http://blog.metagenomics.anl.gov/m5tools-pl-the-m5nr-database-command-line-tool/
                      Two questions:
                      - The first is that I can't find the tool 'm5tools.pl' on the FTP site. Can someone provide me this tool?
                      - With this tool, can I regenerate the 'original' header from GreenGenes, thus with the ID 'AF068820', or at least the taxonomy ID of the organism? In the examples I saw this which could help me;



                      But if I do this with my md5 ID, I get no results;
                      http://api.metagenomics.anl.gov/m5nr/md5/0000175eddb4b05d0bd52467315668ac

                      Thanks in advance,
                      Boetsie
                      __Bach__

                      Comment

                      • rhinoceros
                        Senior Member
                        • Apr 2013
                        • 372

                        #12
                        Originally posted by Bachbioinfo View Post
                        As I am using MEGAN5 too, I would like to know whether it is a good idea to select the soft masking option with blastn.
                        There's this article that gives rather good suggestions for blast in general. They also have an accompanying website updated for blast+. I'm looking forward to their 2013 article.

                        But I wouldn't know how applicable this stuff is to 16S and nucleotide queries in general. In my opinion, blast is the wrong approach to 16S amplicon data to begin with. Both QIIME (MacQIIME for Mac OS X) and mothur are far better suited for 16S stuff, and blast is definitely not the best method for assigning taxonomy to 16S reads.
                        Last edited by rhinoceros; 12-12-2013, 11:10 AM.
                        savetherhino.org

                        Comment

                        • Bachbioinfo
                          Member
                          • Nov 2013
                          • 24

                          #13
                          I totally agree with what are you suggesting for MOTHUR and QUIIME. I have metagenomes rather than amplicons. In this case what is the best way to estimate the OTUs abundance. I do not know if QUIIME could be best too to do that. There is a lot of tools and methods, there is a lot of literature of comparison, but the most appropriate approach of metagenomes is not always the same. Somewhere, I should start the data analysis

                          best,



                          Originally posted by rhinoceros View Post
                          There's this article that gives rather good suggestions for blast in general. They also have an accompanying website update for blast+. But I wouldn't know how applicable this stuff is to 16S queries in general. In my opinion, blast is the wrong approach to 16S amplicon data to begin with. Both QIIME (MacQIIME for Mac OS X) and mothur are far better suited for 16S stuff, and blast is definitely not the best method for assigning taxonomy to 16S reads.
                          __Bach__

                          Comment

                          • rhinoceros
                            Senior Member
                            • Apr 2013
                            • 372

                            #14
                            Well, you could start by submitting your data to mg-rast. You can read at their website what the pipeline does. You can download your data following any particular step, e.g. predicted proteins or annotations against some specific db. You can also e.g. export biom tables for QIIME. It's not perfect, but it's a good start, and gives you initial results very fast. I noticed that the way they assign Kegg orthologs leaves a lot of real hits out. I'm sure it's the same with a lot of other stuff too.
                            savetherhino.org

                            Comment

                            • Bachbioinfo
                              Member
                              • Nov 2013
                              • 24

                              #15
                              Originally posted by boetsie View Post
                              Thank you for pointing me to the m5nr-tools.pl script. However, if I take the first two md5 sums of the md5rna database
                              HTML Code:
                              grep ">" md5rna -m 3
                              >000000bce90ad07d3161ffac8cea5874
                              >0000029042cc6c69f2b830142508acb1
                              And search them in the map file;

                              HTML Code:
                              grep "000000bce90ad07d3161ffac8cea5874" /data/testfolder/metagenomics_pipeline_test/16s/non-redundant-database/md5_rna_map
                              000000bce90ad07d3161ffac8cea5874        16      3385    2304
                              grep "0000029042cc6c69f2b830142508acb1" /data/testfolder/metagenomics_pipeline_test/16s/non-redundant-database/md5_rna_map
                              0000029042cc6c69f2b830142508acb1        16      3385    382680
                              Both have '16' as database, which is the RDP database (if 16 corresponds to the 'source'). So I try to find them in the RDP database;


                              HTML Code:
                              perl MG-RAST-Tools-master/tools/bin/m5nr-tools.pl --api http://kbase.us/services/communities/1 --option annotation --source RDP --md5 000000bce90ad07d3161ffac8cea5874,0000029042cc6c69f2b830142508acb1
                              S003289208      000000bce90ad07d3161ffac8cea5874        16S ribosomal RNA       Acinetobacter lwoffi
                              I get only one hit.

                              Since this did not work and probably is very slow, I am trying to work with the map files.

                              Thank you rhinoceros
                              Boetsie
                              Hello,
                              I have just noticed the same things, the key 0000029042cc6c69f2b830142508acb1 for example , I cannot find it with m5nr-tools.pl in spite of trying all ribosomal sources described here :"http://api.metagenomics.anl.gov/api.html#annotation". I have please a question what do correspond the two last columns in md5_rna_map ?
                              i.e. 0000029042cc6c69f2b830142508acb1 16 3385 382680

                              Taxon ID and Gi respectively ? if this is the case I cannot be able to find "382680" in a simple search on ncbi databases

                              Thank you all
                              __Bach__

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Pathogen Surveillance with Advanced Genomic Tools
                                by seqadmin




                                The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
                                03-24-2025, 11:48 AM
                              • seqadmin
                                New Genomics Tools and Methods Shared at AGBT 2025
                                by seqadmin


                                This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                                The Headliner
                                The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                                03-03-2025, 01:39 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Today, 12:59 PM
                              0 responses
                              6 views
                              0 reactions
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 10:17 AM
                              0 responses
                              8 views
                              0 reactions
                              Last Post seqadmin  
                              Started by seqadmin, 03-20-2025, 05:03 AM
                              0 responses
                              49 views
                              0 reactions
                              Last Post seqadmin  
                              Started by seqadmin, 03-19-2025, 07:27 AM
                              0 responses
                              60 views
                              0 reactions
                              Last Post seqadmin  
                              Working...