Unconfigured Ad

**GenoMax** · 12-08-2014, 12:44 PM

There may be another way of doing this. One solution:

Do YP accessions refer to bacterial sequences? You can get corresponding "gi" ID's from the "faa" files here: ftp://ftp.ncbi.nih.gov/refseq/release/bacteria/

The gi ID's can then be mapped to the NC* from *genomic* files in the same directory.

**Richard Finney** · 12-08-2014, 01:56 PM

The grande flat text file "gene2accession" from NCBI has this information.

There are many other interesting files in the directory of this file ( ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/ . ) and they are updated frequently.
There is a README file which helps explain the data thereabouts ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/README

The URL is for gene2accession is ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2accession.gz

Command to get it is : wget -nc ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2accession.gz

or use a browser.

Be sure to "gzip -d filename" to ungzip the file

_____

The "YP" is RNA_nucleotide_accession.version in column 6 and the "NC" is protein_accession.version in column 8

the gory details ...

The header is this ...

-bash-4.1$ head -1 gene2accession
#Format: tax_id GeneID status RNA_nucleotide_accession.version RNA_nucleotide_gi protein_accession.version protein_gi genomic_nucleotide_accession.version genomic_nucleotide_gi start_position_on_the_genomic_accession end_position_on_the_genomic_accession orientation assembly mature_peptide_accession.version mature_peptide_gi Symbol (tab is used as a separator, pound sign - start of a comment)

"YPs" look like this ...
-bash-4.1$ grep YP_ gene2accession | head
9 8655732 PROVISIONAL - - YP_003329478.1 270208711 NC_013549.1 270208709 1111 2502 + - - - leuC
9 8655733 PROVISIONAL - - YP_003329479.1 270208712 NC_013549.1 270208709 2560 3162 + - - - leuD
9 8655734 PROVISIONAL - - YP_003329480.1 270208713 NC_013549.1 270208709 3488 5035 + - - - leuA
9 8655735 PROVISIONAL - - YP_003329481.1 270208714 NC_013549.1 270208709 5466 6209 + - - - repA
9 8655736 PROVISIONAL - - YP_003329477.1 270208710 NC_013549.1 270208709 14 1111 + - - - leuB
9 20468915 PROVISIONAL - - YP_009062868.1 690387890 NC_025017.1 690387888 2298 2882 + - - - trpG
9 20468916 PROVISIONAL - - YP_009062867.1 690387889 NC_025017.1 690387888 0 1580 + - - - trpE
33 5961931 PROVISIONAL - - YP_001691218.1 169302958 NC_010372.1 169302939 15822 16589 - - - - pMF1.19c
33 5961932 PROVISIONAL - - YP_001691211.1 169302951 NC_010372.1 169302939 10004 11044 + - - - pMF1.12
33 5961933 PROVISIONAL - - YP_001691221.1 169302961 NC_010372.1 169302939 17650 18333 + - - - pMF1.22

**GenoMax** · 12-08-2014, 06:02 PM

Thanks for sharing that Richard. Learned something new.

Is this file continually updated?

**Richard Finney** · 12-08-2014, 06:23 PM

Theoretically these files are re-genetreated daily; though sometimes the actual contents don't change.

Using a little script-fu you can do things like create a GO term counts file for a set of gene inputs; just to get some bearings. Theres ENSEMBL to gene Ref/HUGO lookups too which comes in handy when dealing with "European oriented" software ike Deseq2.
Not that there's anything wrong with using default deseq annotation files. .

**carolW** · 12-09-2014, 02:04 AM

very nice and practical.

Can I grep a protein ID to this file gene2accession? Will I not have 2 prot ID that will be extracted by grep if they have the same pattern for ex they end by 1, 10, 100 etc?

Many thx

**Richard Finney** · 12-09-2014, 08:34 AM

Correct. Grepping is a problem unless the desired string match is unique.

Rolling your own ""match lines with items in this string set with items in that column" is a right of passage in the business.

Whether you can most easily do this in python/perl/java/c or a bash script using standard utils is an open question.

**Michael Love** · 12-09-2014, 11:41 AM

DESeq is database agnostic. Although I like "European oriented"

e.g. in our demo data package, airway,

http://bioconductor.org/packages/release/data/experiment/vignettes/airway/inst/doc/airway.html

...just replace this line:

Code:

txdb <- makeTranscriptDbFromBiomart(biomart="ensembl", dataset="hsapiens_gene_ensembl")

with

Code:

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

**carolW** · 12-10-2014, 12:58 AM

Does NCBI have any file that indicates the length of aa or nt of sequences, proteomic or genomic?

**rhinoceros** · 12-10-2014, 01:10 AM

Originally posted by carolW View Post

Does NCBI have any file that indicates the length of aa or nt of sequences, proteomic or genomic?

Not sure, but you can get this information with Entrez Direct, e.g. for this and this proteins, the query would be:

Code:

efetch -db protein -id 195954015,553836951 -format docsum | xtract -element Slen | tr "\t" "\n" 
225
74

With nucleotides, db would be "nuccore"..

**carolW** · 12-10-2014, 01:25 AM

if I have a set of IDs, what would be the file to search in?

**rhinoceros** · 12-10-2014, 01:27 AM

Originally posted by carolW View Post

if I have a set of IDs, what would be the file to search in?

I don't understand your question

**GenoMax** · 12-10-2014, 06:30 AM

Originally posted by carolW View Post

Does NCBI have any file that indicates the length of aa or nt of sequences, proteomic or genomic?

File Richard referred to has the genomic coordinates.

start position on the genomic accession:
position of the gene feature on the genomic accession,
'-' if not applicable
position 0-based

end position on the genomic accession:
position of the gene feature on the genomic accession,
'-' if not applicable
position 0-based

If you are dealing with bacterial ORF's then coverting that to AA lengths should be easy.

Otherwise rhinoceros posted a programmatic way you can get that information directly from NCBI. You would need to iterate through your ID's.

**carolW** · 01-30-2015, 01:21 AM

As proteins whose ID starting WP_ are not in this file, how to find the info for these proteins?

**GenoMax** · 01-30-2015, 04:04 AM

Originally posted by carolW View Post

As proteins whose ID starting WP_ are not in this file, how to find the info for these proteins?

ftp://ftp.ncbi.nlm.nih.gov/refseq/release/bacteria/

Look for files with *non_redundant* in names.

Perhaps Richard knows of a file where this information is in one spot.

Topics	Statistics	Last Post
High-Resolution Sequencing Exposes Hidden Toxoplasma Diversity by SEQadmin2 Started by SEQadmin2, 07-02-2026, 11:08 AM	0 responses 23 views 0 reactions	Last Post by SEQadmin2 07-02-2026, 11:08 AM
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, 06-30-2026, 05:37 AM	0 responses 23 views 0 reactions	Last Post by SEQadmin2 06-30-2026, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 23 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 55 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM

Unconfigured Ad

NCBI Reference Sequence ID to refseq accession

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News