Seqanswers Leaderboard Ad



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Reference Gene Sequence file

    Dear all,

    I've just arrived to the NGS field and I'm currently facing many issues with different bioinformatic analysis workflows. Maybe the questions I may ask look naive for most of you, but I hope you can help me even if the answer is obvious to you.

    The issue I'm on right now is this:

    Is there any public database or repository where I can download a file where all the genes of an organism (in this case those corresponding to the hg19 sequence version of the human genome) are displayed showing the name of each gene, its coordinates and the chromosome its located at?

    The file format should be somehow equivalent to this fields:

    #GeneID Start End Chromosome ...

    Thanks in advance

  • #2
    You will be able (I assume although ive never worked with the human genome before) to download gff format files from the NCBI ftp database which has all the information that you need.

    The ftp site is here:


    • #3
      UCSC has good data.
      Go to
      and download refFlat.txt.gz

      Direct link:

      This is the refseq "definition" of the human genes from the NCBI at NIH.

      It is carefully curated "by hand".

      Other folks have their own "guesses" as to what the comolete set of human genes is.

      It's a "moving target" and the final catalog of what genes are in the genome isn't done.

      refseq is a pretty good picture, based on current understanding, of what's there.


      • #4
        Dear jimmybee and Richard Finney,

        The information you've given me here has been of great value. I've followed your instructions and downloaded the references. Now I have to see if I can establish correspondences between the info in those reference tables and the info I obtained from my bowtie alignment files against the hg19 genome. I'll keep you informed on how I'll try to do this.

        Best wishes and thank you once more for your kind help.


        • #5
          You can also get this data from Ensembl

          The human data is mapped to GRCh37/hg19


          • #6
            Dear Laura,

            Thank you very much for your feedback. It was last the piece of information I was looking for, in order to compare my mapping results to the table of counts I had generated using R , whose reference was the ensembl annotation of hg19.

            You've all help me a lot with this issue.


            • #7
              Doesn't ensembl gene annotation have the problem of "too many transcripts"?? I tried using ensembl gene references with snpeff to annotate a list of variants I had generated (for a tumor-normal pair) and the output file had increased in number of lines by more than 5 fold. Multiple transcripts are (average of 3-4) assigned to the same position. Doesn't this make downstream analysis difficult?

              I am new to Bioinformatics too. And this has been bugging me for a few days.

              Is it okay to use refseq gene definitions, if my variants were obtained by aligning to GRCh37.67?


              Originally posted by laura View Post
              You can also get this data from Ensembl


              The human data is mapped to GRCh37/hg19



              • #8
                An interesting complaint, I have rarely seen Ensembl being accused as having too many alt splice forms, normally people complain about too few

                The majority of loci have more than one transcript which will be expressed at different times in different tissues

                This does give you more information but this extra information can be important

                The ensembl variant effect predictor script does have an option to output only the most severe consequence per gene and if this consequence occurs in more than one transcript the consequence is chosen arbitarily

                The best solution to this problem is of course to have expression data for tumor and normal tissue so you can actually know which transcripts from a particular loci are being expressed (using rnaseq or microarray data) so you only need consider the appropriate transcripts on the basis of the expression data

                All the annotation data should be based on the same reference, for the autosomes (and mostly for X and Y) hg19 and the primary GRCh37 should be identical in terms of sequence and coordinates. UCSC choses to use its own naming convention for its assemblies rather than the official GRC name


                • #9
                  Originally posted by laura View Post

                  All the annotation data should be based on the same reference, for the autosomes (and mostly for X and Y) hg19 and the primary GRCh37 should be identical in terms of sequence and coordinates. UCSC choses to use its own naming convention for its assemblies rather than the official GRC name
                  You mean the current hg19, if I download it today and GRCh37.67 will have the same sequence and coordinates??


                  • #10
                    Originally posted by laura View Post
                    An interesting complaint, I have rarely seen Ensembl being accused as having too many alt splice forms, normally people complain about too few
                    That is an interesting statement! Ensembl gene set has 57000 "genes" while RefSeq has only 24000. I would have expected that to be a non-rare complaint..

                    Kindly clarify.


                    • #11
                      hg19 only represents the primary GRCh37 assembly as far as I am aware (someone please correct me if I am wrong) so long as you stick to the primary chromosomes (so 1-22, X and Y) the coordinates should be identical


                      • #12
                        Originally posted by shyam_la View Post
                        That is an interesting statement! Ensembl gene set has 57000 "genes" while RefSeq has only 24000. I would have expected that to be a non-rare complaint..

                        Kindly clarify.
                        The Refseq cdna set will probably only considers protein coding transcripts/cdnas

                        Ensembl's 57000 includes not just protein coding genes but also pseudogenes, ncRNAs and some other things

                        If you go to biomart and just look at protein coding genes you get 21976 genes


                        • #13
                          Originally posted by laura View Post
                          hg19 only represents the primary GRCh37 assembly as far as I am aware (someone please correct me if I am wrong) so long as you stick to the primary chromosomes (so 1-22, X and Y) the coordinates should be identical
                          Well, I have come across another source that said that hg19 is also updated but the name stays the same. There is no version number unlike GRCh37 and hence the latter should be preferred so that you can keep track of which version you are using.

                          You seem to suggest that hg19 has stayed unchanged since Feb 2009, when it was first released.

                          So many conflicting ideas on the internet! Sigh.. A newbie is bound to get lost..


                          • #14
                            I am no expert in UCSC so I am just going by what the browser is called, If your other source is the UCSC help pages or something similar I would trust it

                            I will point out though that the primary assembly (ie the main chromosome 1-22 and chrX and Y) has not changed since GRCh37 was released in 2009 so if you are just using the main chromosomes without any repeat masking then it doesn't matter what the pN number is the chromosomes are identical

                            The version only matters if you are also considering alternative haplotypes and GRC fix patches
                            Last edited by laura; 07-03-2012, 01:18 PM.


                            • #15
                              Originally posted by laura View Post
                              I am no expert in UCSC so I am just going by what the browser is called, If your other source is the UCSC help pages or something similar I would trust it

                              I will point out though that the primary assembly (ie the main chromosome 1-22 and chrX and Y) has not changed since GRCh37 was released in 2009 so if you are just using the main chromosomes without any repeat masking then it doesn't matter what the pN number is the chromosomes are identical

                              The version only matters if you are also considering alternative haplotypes and GRC fix patches

                              This was my source..

                              But I know you are correct now!

                              Just one more question: Number of FIX Patch scaffolds aligned to the Primary Assembly 60
                              This is from the statistics page for GRCh37.p7. What does 60 mean here?


                              Latest Articles


                              • seqadmin
                                Addressing Off-Target Effects in CRISPR Technologies
                                by seqadmin

                                The first FDA-approved CRISPR-based therapy marked the transition of therapeutic gene editing from a dream to reality1. CRISPR technologies have streamlined gene editing, and CRISPR screens have become an important approach for identifying genes involved in disease processes2. This technique introduces targeted mutations across numerous genes, enabling large-scale identification of gene functions, interactions, and pathways3. Identifying the full range...
                                08-27-2024, 04:44 AM
                              • seqadmin
                                Selecting and Optimizing mRNA Library Preparations
                                by seqadmin

                                Sequencing mRNA provides a snapshot of cellular activity, allowing researchers to study the dynamics of cellular processes, compare gene expression across different tissue types, and gain insights into the mechanisms of complex diseases. “mRNA’s central role in the dogma of molecular biology makes it a logical and relevant focus for transcriptomic studies,” stated Sebastian Aguilar Pierlé, Ph.D., Application Development Lead at Inorevia. “One of the major hurdles for...
                                08-07-2024, 12:11 PM





                              Topics Statistics Last Post
                              Started by seqadmin, 08-27-2024, 04:40 AM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, 08-22-2024, 05:00 AM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, 08-21-2024, 10:49 AM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, 08-19-2024, 05:12 AM
                              0 responses
                              Last Post seqadmin  