Announcement

Collapse

Welcome to the New Seqanswers!

Welcome to the new Seqanswers! We'd love your feedback, please post any you have to this topic: New Seqanswers Feedback.
See more
See less

Downloading Genomic Data from NCBI from E-utils Direct

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Downloading Genomic Data from NCBI from E-utils Direct

    Hello,

    I have been given a list the names of organisms whose genomes are stores with NCBI. I have been able to grab the "Project Accession" for each of these names using the NCBI "Entrez Direct" following command:

    esearch -db genome -query "Acidothermus cellulolyticus" | efetch -format docsum

    Which was a start, but this does not enable me to download the FASTA, which I would like. The Project Accession is associated with numerous genomes and I'm OK with downloading all the genomes within a given project. However, I cannot figure out which command will perform that download.

    Any help would be greatly appreciated (even in a totally different approach to go from genome name to data).

    Thanks

  • #2
    If you have genome id's, then simply:

    Code:
    efetch -db nuccore -id $ID -format fasta
    savetherhino.org

    Comment


    • #3
      Many thanks!

      That works for genome accessions, but not for project accessions. Does anyone has advice for how to a) get a list of genome accessions associated with a project or b) download all genomes for a given project? The 'esearch' command I mentioned above only provided the project accession for any given genome.

      Comment


      • #4
        Perhaps you could post a few examples of project numbers?
        savetherhino.org

        Comment


        • #5
          Example Project Numbers

          Fair enough!

          These are project numbers that have genomic data in either a single chromosome (PRJNA47909) or chromosome withe multiple plasmids (PRJNA35077). I would like a command that downloads all data from within a single project.

          Thanks again,

          Comment


          • #6
            It's essentially:

            Code:
            esearch -db genome -query PRJNA35077 | elink -target nuccore | efetch -format fasta
            However, in real life this fetches too much data because multiple genomes are associated with this project as we see here.

            Code:
            esearch -db genome -query PRJNA35077 | elink -target nuccore | efetch -format docsum | xtract -Pattern DocumentSummary -element Title -element Extra
            Ruminococcus albus SY3, whole genome shotgun sequencing project	gi|739430767|ref|NZ_JEOB00000000.1|NZ_JEOB01000000
            Ruminococcus albus 7 = DSM 20455, whole genome shotgun sequencing project	gi|655056149|ref|NZ_JHYT00000000.1|NZ_JHYT01000000
            Ruminococcus albus AD2013, whole genome shotgun sequencing project	gi|640306017|ref|NZ_JAGS00000000.1|NZ_JAGS01000000
            Ruminococcus albus 8, whole genome shotgun sequencing project	gi|325681559|ref|NZ_ADKM00000000.2|NZ_ADKM02000000
            Ruminococcus albus 7 plasmid pRUMAL02, complete sequence	gi|319788691|ref|NC_014825.1|
            Ruminococcus albus 7 plasmid pRUMAL01, complete sequence	gi|317133719|ref|NC_014824.1|
            Ruminococcus albus 7 plasmid pRUMAL04, complete sequence	gi|317133710|ref|NC_014827.1|
            Ruminococcus albus 7, complete genome	gi|317054731|ref|NC_014833.1|
            Ruminococcus albus 7 plasmid pRUMAL03, complete sequence	gi|315630409|ref|NC_014826.1|
            Ruminococcus albus 7 plasmid pRUMAL04, complete sequence	gi|315450868|gb|CP002407.1|
            Ruminococcus albus 7 plasmid pRUMAL03, complete sequence	gi|315450849|gb|CP002406.1|
            Ruminococcus albus 7 plasmid pRUMAL02, complete sequence	gi|315450558|gb|CP002405.1|
            Ruminococcus albus 7 plasmid pRUMAL01, complete sequence	gi|315450181|gb|CP002404.1|
            Ruminococcus albus 7, complete genome	gi|315447000|gb|CP002403.1|
            Ruminococcus albus SY3, whole genome shotgun sequencing project	gi|593023627|gb|JEOB00000000.1|JEOB01000000
            Ruminococcus albus 7 = DSM 20455, whole genome shotgun sequencing project	gi|607835232|gb|JHYT00000000.1|JHYT01000000
            Ruminococcus albus AD2013, whole genome shotgun sequencing project	gi|573973220|gb|JAGS00000000.1|JAGS01000000
            Ruminococcus albus 8, whole genome shotgun sequencing project	gi|324110714|gb|ADKM00000000.2|ADKM02000000
            Last edited by rhinoceros; 09-09-2015, 01:01 AM.
            savetherhino.org

            Comment


            • #7
              Significant Progress

              Very helpful, thanks! Unfortunately, you're right that those commands download a number of genome files. It appears that the commands treats redundant records of the same genomic data (stored @NCBI and @INSDC) as separate and downloads both. I think I can work with what you've given me so far. Thanks!

              I did not find the NCBI manual on Entrez Direct to be very helpful, where did you pick up your knowledge?

              Comment


              • #8
                The final road block.

                You have given me both helpful one-liners and a diverse set of commands, and I have tried working with them to solve this final problem. I am now able to parse the accession for the genome file from the project file, but in cases where a genome has multiple contigs, this accession number brings one to another metadata file which provides the accessions for all the contigs. (The webpage version reads: "This entry is the master record for a whole genome shotgun sequencing project and contains no sequence data."). In these cases, your helpful script fails b/c it needs to provide me with one more command to pull the metadata so that I may parse the range of accession numbers.

                Here's an example of that last page for a genome (NZ_LDPH00000000.1).

                Thank you very much,
                Roli

                Comment


                • #9
                  Hi. I picked up stuff mostly from the manual and random google searches. Whatever applies to eutils can also be achieved with edirect with little experimentation. For the wgs, you link through the db assembly. You can also find out about existing links by queries from the genbank home page. For example, for NZ_LDPH.. query we see that there are records in dbs nucleotide and genome. From the genome link on top right we see "related information" and when we click assembly we land on this page and again from "related information" we can find links to the actual sequences in nuccore..

                  Code:
                  esearch -db genome -query NZ_LDPH00000000.1 | elink -target assembly | elink -target nuccore | efetch -format fasta
                  We can also see the existing links without visiting the website at all. For example:

                  Code:
                  esearch -db genome -query NZ_LDPH00000000.1 | elink -target all -cmd acheck
                  <?xml version="1.0" encoding="UTF-8"?>
                  <!DOCTYPE eLinkResult PUBLIC "-//NLM//DTD elink 20101123//EN" "http://eutils.ncbi.nlm.nih.gov/eutils/dtd/20101123/elink.dtd">
                  <eLinkResult>
                  	<LinkSet>
                  		<DbFrom>genome</DbFrom>
                  		<IdCheckList>
                  			<IdLinkSet>
                  				<Id>38503</Id>
                  				<LinkInfo>
                  					<DbTo>assembly</DbTo>
                  					<LinkName>genome_assembly</LinkName>
                  					<MenuTag>Assembly</MenuTag>
                  					<HtmlTag>Assembly</HtmlTag>
                  					<Priority>128</Priority>
                  				</LinkInfo>
                  				<LinkInfo>
                  					<DbTo>bioproject</DbTo>
                  					<LinkName>genome_bioproject</LinkName>
                  					<MenuTag>BioProject Links</MenuTag>
                  					<HtmlTag>BioProject</HtmlTag>
                  					<Priority>128</Priority>
                  				</LinkInfo>
                  				<LinkInfo>
                  					<DbTo>nuccore</DbTo>
                  					<LinkName>genome_nuccore</LinkName>
                  					<MenuTag>Components</MenuTag>
                  					<HtmlTag>Components</HtmlTag>
                  					<Priority>150</Priority>
                  				</LinkInfo>
                  				<LinkInfo>
                  					<DbTo>nucleotide</DbTo>
                  					<LinkName>genome_nucleotide</LinkName>
                  					<MenuTag>Assembly</MenuTag>
                  					<HtmlTag>Assembly</HtmlTag>
                  					<Priority>150</Priority>
                  				</LinkInfo>
                  				<LinkInfo>
                  					<DbTo>protein</DbTo>
                  					<LinkName>genome_protein</LinkName>
                  					<MenuTag>Protein Links</MenuTag>
                  					<HtmlTag>Protein</HtmlTag>
                  					<Priority>170</Priority>
                  				</LinkInfo>
                  				<LinkInfo>
                  					<DbTo>pubmed</DbTo>
                  					<LinkName>genome_pubmed</LinkName>
                  					<MenuTag>PubMed Links</MenuTag>
                  					<HtmlTag>PubMed</HtmlTag>
                  					<Priority>180</Priority>
                  				</LinkInfo>
                  				<LinkInfo>
                  					<DbTo>taxonomy</DbTo>
                  					<LinkName>genome_taxonomy</LinkName>
                  					<MenuTag>Taxonomy Links</MenuTag>
                  					<HtmlTag>Taxonomy</HtmlTag>
                  					<Priority>200</Priority>
                  				</LinkInfo>
                  			</IdLinkSet>
                  		</IdCheckList>
                  	</LinkSet>
                  </eLinkResult>
                  p.s. I totally agree that that documentation concerning edirect could be a lot more comprehensive.
                  Last edited by rhinoceros; 09-09-2015, 11:06 PM.
                  savetherhino.org

                  Comment


                  • #10
                    Possibly Helpful Python Scripts

                    Thank you very much Rhino! Those details will help in future.

                    I have prepared a few Python scripts to automate the process of going from an input of a list of genome names ->Script1-> output: list of project accessions ->Script2-> de-replicated genome fasta files (fasta).

                    For whomever finds this thread useful, the scripts can be found here:
                    Script1
                    Script2

                    I have not provided any documentation (yet ... or maybe ever), but a novice python user (like myself) should be able to implement them. The second script was designed to accommodate for the fact the 'fetch' command inevitably downloads identical files (in all but accession number) from both repositories.

                    Don't hesitate to message me privately if you would like to use these scripts.

                    Comment

                    Working...
                    X