Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • entrez ID conversion

    Hello,

    does anyone know how to convert entrez I.D. to either Refseq ID or Gene Symbols?
    I have found resources on Refseq to Gene Symbol conversion, but I can't find anything on Entrez I.D.
    The genome I work with is C. elegans.
    Thanks in advance for any suggestion

  • #2
    Try UniProt's online conversion service: http://www.uniprot.org -> "ID Mapping" tab

    Comment


    • #3
      NCBI maintains a flatfiles of gene annotations which contains the information you're after:
      ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz
      ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2accession.gz
      ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2refseq.gz
      [ There are other interesting files in that directory ]


      The tax_id (taxonomy ID for C.Elgans is 6239 ) [ from Taxonomy browser http://www.ncbi.nlm.nih.gov/taxonomy ]

      You can type : "wget -nc ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz" from the command line, or download via a browser.

      Example using this data :
      bash-3.00$ cat gene2refseq | awk '{if ($1==6239) print $0}' | head
      6239 171590 REVIEWED NM_058260.3 193203640 NP_490660.1 17510631 NC_003279.6 193203938 4123 10231 - -
      6239 171591 REVIEWED NM_058259.3 193203639 NP_490661.1 17510629 NC_003279.6 193203938 11498 16830 + -
      6239 171592 REVIEWED NM_058261.3 133902001 NP_490662.1 17510633 NC_003279.6 193203938 17496 26780 - -
      6239 171592 REVIEWED NM_058262.3 86561628 NP_490663.1 17510635 NC_003279.6 193203938 17496 26780 - -
      6239 171593 REVIEWED NM_058263.3 115533565 NP_490664.2 115533566 NC_003279.6 193203938 27594 32481 - -
      6239 171594 REVIEWED NM_058265.3 71995026 NP_490666.2 25143331 NC_003279.6 193203938 49918 54359 + -
      6239 171595 REVIEWED NM_058267.4 115533567 NP_490668.4 115533568 NC_003279.6 193203938 55315 64020 - -
      6239 171597 REVIEWED NM_058269.2 71995034 NP_490670.1 17510145 NC_003279.6 193203938 85044 86283 - -
      6239 171599 REVIEWED NM_058271.6 212645149 NP_490672.2 25143337 NC_003279.6 193203938 93030 94880 + -
      6239 171600 REVIEWED NM_058272.4 212645150 NP_490673.1 17510147 NC_003279.6 193203938 96478 100612 - -
      -bash-3.00$ cat gene_info | grep 171590 | awk '{if ($1==6239) print $0}'
      6239 171590 Y74C9A.3 Y74C9A.3 - WormBase:WBGene00022277 I - hypothetical protein protein-coding - - - - 20101017

      Comment


      • #4
        DAVID has a Gene ID Conversion tool:



        Fuad

        Comment


        • #5
          Bioconductor package "biomaRt" also could do it.

          Comment


          • #6
            In Bioconductor, just use the following codes:

            > library(org.Hs.eg.db)
            > library(annotate)
            > lookUp('3815', 'org.Hs.eg', 'SYMBOL')
            $`3815`
            [1] "KIT"

            > lookUp('3815', 'org.Hs.eg', 'REFSEQ')
            $`3815`
            [1] "NM_000222" "NM_001093772" "NP_000213" "NP_001087241"

            Comment


            • #7
              You can also do ID conversion using Biomart at EBI.

              Comment


              • #8
                Always a fan of the linux one-liner, here is an example for human ACTB gene using hg18:

                mysql -h genome-mysql.cse.ucsc.edu -A -u genome -D hg18 -e "select k2ll.value as entrezGeneId, kx.refseq as refseqMrna, kx.geneSymbol as entrezGeneSymbol, kx.description as entrezGeneDesc from kgXref kx, knownToLocusLink k2ll where k2ll.name=kx.kgID and kx.geneSymbol='ACTB';"
                UCSC's C.elegans tables don't include the knownGene and kg% tables, but some poking around ( using "show tables like '%locus%';" ) led me to formulate this MySQL query that takes locusLinkId as input and prints the gene symbol, refseq mRNA, description, etc.

                mysql -h genome-mysql.cse.ucsc.edu -A -u genome -D ce6 -e "select rl.locusLinkId, rl.name as geneName, rl.product as geneDescription, rl.mrnaAcc as refseqMrna, rl.protAcc as refseqProt from refLink rl where rl.locusLinkId=174288;"
                The bummer is that you have to tell it to use "ce6" -- it isn't generic enough to sniff out what organism and version to use a priori. But you'll know which one to use right? :-) And you can of course change the "=174288" to "IN (174288, 174289,174290)" for more of a bulk-input-experience, depending upon what you need. If you end up batch-scripting some geneID conversions, I'd definitely use the "IN" clause instead of querying them one-by-one. Markedly faster.

                DAVID is in theory a great resource, but could be opened up to increase the API limits, or to allow direct data downloads.

                Comment


                • #9
                  Thank you all guys

                  Comment


                  • #10
                    How to do the opposite?

                    Originally posted by peachgil View Post
                    In Bioconductor, just use the following codes:

                    > library(org.Hs.eg.db)
                    > library(annotate)
                    > lookUp('3815', 'org.Hs.eg', 'SYMBOL')
                    $`3815`
                    [1] "KIT"

                    > lookUp('3815', 'org.Hs.eg', 'REFSEQ')
                    $`3815`
                    [1] "NM_000222" "NM_001093772" "NP_000213" "NP_001087241"
                    I have a set of HGNC gene symbols, and I want to convert them to Entrez Gene IDs.

                    Thanks much!

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      The Impact of AI in Genomic Medicine
                      by seqadmin



                      Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                      02-26-2024, 02:07 PM
                    • seqadmin
                      Multiomics Techniques Advancing Disease Research
                      by seqadmin


                      New and advanced multiomics tools and technologies have opened new avenues of research and markedly enhanced various disciplines such as disease research and precision medicine1. The practice of merging diverse data from various ‘omes increasingly provides a more holistic understanding of biological systems. As Maddison Masaeli, Co-Founder and CEO at Deepcell, aptly noted, “You can't explain biology in its complex form with one modality.”

                      A major leap in the field has
                      ...
                      02-08-2024, 06:33 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 02-28-2024, 06:12 AM
                    0 responses
                    28 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 02-23-2024, 04:11 PM
                    0 responses
                    74 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 02-21-2024, 08:52 AM
                    0 responses
                    85 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 02-20-2024, 08:57 AM
                    0 responses
                    70 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X