Seqanswers Leaderboard Ad

**rhinoceros** · 09-06-2015, 11:34 PM

If you have genome id's, then simply:

Code:

efetch -db nuccore -id $ID -format fasta

**roliwilhelm** · 09-07-2015, 10:31 AM

Many thanks!

That works for genome accessions, but not for project accessions. Does anyone has advice for how to a) get a list of genome accessions associated with a project or b) download all genomes for a given project? The 'esearch' command I mentioned above only provided the project accession for any given genome.

**rhinoceros** · 09-07-2015, 10:11 PM

Perhaps you could post a few examples of project numbers?

**roliwilhelm** · 09-08-2015, 12:48 PM

Example Project Numbers

Fair enough!

These are project numbers that have genomic data in either a single chromosome (PRJNA47909) or chromosome withe multiple plasmids (PRJNA35077). I would like a command that downloads all data from within a single project.

Thanks again,

**rhinoceros** · 09-08-2015, 11:36 PM

It's essentially:

Code:

esearch -db genome -query PRJNA35077 | elink -target nuccore | efetch -format fasta

However, in real life this fetches too much data because multiple genomes are associated with this project as we see here.

Code:

esearch -db genome -query PRJNA35077 | elink -target nuccore | efetch -format docsum | xtract -Pattern DocumentSummary -element Title -element Extra
Ruminococcus albus SY3, whole genome shotgun sequencing project	gi|739430767|ref|NZ_JEOB00000000.1|NZ_JEOB01000000
Ruminococcus albus 7 = DSM 20455, whole genome shotgun sequencing project	gi|655056149|ref|NZ_JHYT00000000.1|NZ_JHYT01000000
Ruminococcus albus AD2013, whole genome shotgun sequencing project	gi|640306017|ref|NZ_JAGS00000000.1|NZ_JAGS01000000
Ruminococcus albus 8, whole genome shotgun sequencing project	gi|325681559|ref|NZ_ADKM00000000.2|NZ_ADKM02000000
Ruminococcus albus 7 plasmid pRUMAL02, complete sequence	gi|319788691|ref|NC_014825.1|
Ruminococcus albus 7 plasmid pRUMAL01, complete sequence	gi|317133719|ref|NC_014824.1|
Ruminococcus albus 7 plasmid pRUMAL04, complete sequence	gi|317133710|ref|NC_014827.1|
Ruminococcus albus 7, complete genome	gi|317054731|ref|NC_014833.1|
Ruminococcus albus 7 plasmid pRUMAL03, complete sequence	gi|315630409|ref|NC_014826.1|
Ruminococcus albus 7 plasmid pRUMAL04, complete sequence	gi|315450868|gb|CP002407.1|
Ruminococcus albus 7 plasmid pRUMAL03, complete sequence	gi|315450849|gb|CP002406.1|
Ruminococcus albus 7 plasmid pRUMAL02, complete sequence	gi|315450558|gb|CP002405.1|
Ruminococcus albus 7 plasmid pRUMAL01, complete sequence	gi|315450181|gb|CP002404.1|
Ruminococcus albus 7, complete genome	gi|315447000|gb|CP002403.1|
Ruminococcus albus SY3, whole genome shotgun sequencing project	gi|593023627|gb|JEOB00000000.1|JEOB01000000
Ruminococcus albus 7 = DSM 20455, whole genome shotgun sequencing project	gi|607835232|gb|JHYT00000000.1|JHYT01000000
Ruminococcus albus AD2013, whole genome shotgun sequencing project	gi|573973220|gb|JAGS00000000.1|JAGS01000000
Ruminococcus albus 8, whole genome shotgun sequencing project	gi|324110714|gb|ADKM00000000.2|ADKM02000000

**roliwilhelm** · 09-09-2015, 09:13 AM

Significant Progress

Very helpful, thanks! Unfortunately, you're right that those commands download a number of genome files. It appears that the commands treats redundant records of the same genomic data (stored @NCBI and @INSDC) as separate and downloads both. I think I can work with what you've given me so far. Thanks!

I did not find the NCBI manual on Entrez Direct to be very helpful, where did you pick up your knowledge?

**roliwilhelm** · 09-09-2015, 06:36 PM

The final road block.

You have given me both helpful one-liners and a diverse set of commands, and I have tried working with them to solve this final problem. I am now able to parse the accession for the genome file from the project file, but in cases where a genome has multiple contigs, this accession number brings one to another metadata file which provides the accessions for all the contigs. (The webpage version reads: "This entry is the master record for a whole genome shotgun sequencing project and contains no sequence data."). In these cases, your helpful script fails b/c it needs to provide me with one more command to pull the metadata so that I may parse the range of accession numbers.

Here's an example of that last page for a genome (NZ_LDPH00000000.1).

Thank you very much,
Roli

**rhinoceros** · 09-09-2015, 10:34 PM

Hi. I picked up stuff mostly from the manual and random google searches. Whatever applies to eutils can also be achieved with edirect with little experimentation. For the wgs, you link through the db assembly. You can also find out about existing links by queries from the genbank home page. For example, for NZ_LDPH.. query we see that there are records in dbs nucleotide and genome. From the genome link on top right we see "related information" and when we click assembly we land on this page and again from "related information" we can find links to the actual sequences in nuccore..

Code:

esearch -db genome -query NZ_LDPH00000000.1 | elink -target assembly | elink -target nuccore | efetch -format fasta

We can also see the existing links without visiting the website at all. For example:

Code:

esearch -db genome -query NZ_LDPH00000000.1 | elink -target all -cmd acheck
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE eLinkResult PUBLIC "-//NLM//DTD elink 20101123//EN" "http://eutils.ncbi.nlm.nih.gov/eutils/dtd/20101123/elink.dtd">
<eLinkResult>
	<LinkSet>
		<DbFrom>genome</DbFrom>
		<IdCheckList>
			<IdLinkSet>
				<Id>38503</Id>
				<LinkInfo>
					<DbTo>assembly</DbTo>
					<LinkName>genome_assembly</LinkName>
					<MenuTag>Assembly</MenuTag>
					<HtmlTag>Assembly</HtmlTag>
					<Priority>128</Priority>
				</LinkInfo>
				<LinkInfo>
					<DbTo>bioproject</DbTo>
					<LinkName>genome_bioproject</LinkName>
					<MenuTag>BioProject Links</MenuTag>
					<HtmlTag>BioProject</HtmlTag>
					<Priority>128</Priority>
				</LinkInfo>
				<LinkInfo>
					<DbTo>nuccore</DbTo>
					<LinkName>genome_nuccore</LinkName>
					<MenuTag>Components</MenuTag>
					<HtmlTag>Components</HtmlTag>
					<Priority>150</Priority>
				</LinkInfo>
				<LinkInfo>
					<DbTo>nucleotide</DbTo>
					<LinkName>genome_nucleotide</LinkName>
					<MenuTag>Assembly</MenuTag>
					<HtmlTag>Assembly</HtmlTag>
					<Priority>150</Priority>
				</LinkInfo>
				<LinkInfo>
					<DbTo>protein</DbTo>
					<LinkName>genome_protein</LinkName>
					<MenuTag>Protein Links</MenuTag>
					<HtmlTag>Protein</HtmlTag>
					<Priority>170</Priority>
				</LinkInfo>
				<LinkInfo>
					<DbTo>pubmed</DbTo>
					<LinkName>genome_pubmed</LinkName>
					<MenuTag>PubMed Links</MenuTag>
					<HtmlTag>PubMed</HtmlTag>
					<Priority>180</Priority>
				</LinkInfo>
				<LinkInfo>
					<DbTo>taxonomy</DbTo>
					<LinkName>genome_taxonomy</LinkName>
					<MenuTag>Taxonomy Links</MenuTag>
					<HtmlTag>Taxonomy</HtmlTag>
					<Priority>200</Priority>
				</LinkInfo>
			</IdLinkSet>
		</IdCheckList>
	</LinkSet>
</eLinkResult>

p.s. I totally agree that that documentation concerning edirect could be a lot more comprehensive.

**roliwilhelm** · 09-10-2015, 12:12 PM

Possibly Helpful Python Scripts

Thank you very much Rhino! Those details will help in future.

I have prepared a few Python scripts to automate the process of going from an input of a list of genome names ->Script1-> output: list of project accessions ->Script2-> de-replicated genome fasta files (fasta).

For whomever finds this thread useful, the scripts can be found here:
Script1
Script2

I have not provided any documentation (yet ... or maybe ever), but a novice python user (like myself) should be able to implement them. The second script was designed to accommodate for the fact the 'fetch' command inevitably downloads identical files (in all but accession number) from both repositories.

Don't hesitate to message me privately if you would like to use these scripts.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Downloading Genomic Data from NCBI from E-utils Direct

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News