Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Reference Gene Sequence file

    Dear all,

    I've just arrived to the NGS field and I'm currently facing many issues with different bioinformatic analysis workflows. Maybe the questions I may ask look naive for most of you, but I hope you can help me even if the answer is obvious to you.

    The issue I'm on right now is this:

    Is there any public database or repository where I can download a file where all the genes of an organism (in this case those corresponding to the hg19 sequence version of the human genome) are displayed showing the name of each gene, its coordinates and the chromosome its located at?

    The file format should be somehow equivalent to this fields:

    #GeneID Start End Chromosome ...



    Thanks in advance

  • #2
    You will be able (I assume although ive never worked with the human genome before) to download gff format files http://www.sanger.ac.uk/resources/so.../gff/spec.html) from the NCBI ftp database which has all the information that you need.

    The ftp site is here: ftp://ftp.ncbi.nlm.nih.gov

    Comment


    • #3
      UCSC has good data.
      Go to http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database
      and download refFlat.txt.gz

      Direct link:


      This is the refseq "definition" of the human genes from the NCBI at NIH.

      It is carefully curated "by hand".

      Other folks have their own "guesses" as to what the comolete set of human genes is.

      It's a "moving target" and the final catalog of what genes are in the genome isn't done.

      refseq is a pretty good picture, based on current understanding, of what's there.

      Comment


      • #4
        Dear jimmybee and Richard Finney,

        The information you've given me here has been of great value. I've followed your instructions and downloaded the references. Now I have to see if I can establish correspondences between the info in those reference tables and the info I obtained from my bowtie alignment files against the hg19 genome. I'll keep you informed on how I'll try to do this.

        Best wishes and thank you once more for your kind help.

        Comment


        • #5
          You can also get this data from Ensembl

          ftp://ftp.ensembl.org/pub/current_gtf

          The human data is mapped to GRCh37/hg19

          ftp://ftp.ensembl.org/pub/current_gtf/homo_sapiens/

          Comment


          • #6
            Dear Laura,

            Thank you very much for your feedback. It was last the piece of information I was looking for, in order to compare my mapping results to the table of counts I had generated using R , whose reference was the ensembl annotation of hg19.

            You've all help me a lot with this issue.

            Comment


            • #7
              Doesn't ensembl gene annotation have the problem of "too many transcripts"?? I tried using ensembl gene references with snpeff to annotate a list of variants I had generated (for a tumor-normal pair) and the output file had increased in number of lines by more than 5 fold. Multiple transcripts are (average of 3-4) assigned to the same position. Doesn't this make downstream analysis difficult?

              I am new to Bioinformatics too. And this has been bugging me for a few days.

              Is it okay to use refseq gene definitions, if my variants were obtained by aligning to GRCh37.67?

              Thanks.

              Originally posted by laura View Post
              You can also get this data from Ensembl

              ftp://ftp.ensembl.org/pub/current_gtf

              The human data is mapped to GRCh37/hg19

              ftp://ftp.ensembl.org/pub/current_gtf/homo_sapiens/

              Comment


              • #8
                An interesting complaint, I have rarely seen Ensembl being accused as having too many alt splice forms, normally people complain about too few

                The majority of loci have more than one transcript which will be expressed at different times in different tissues

                This does give you more information but this extra information can be important

                The ensembl variant effect predictor script does have an option to output only the most severe consequence per gene and if this consequence occurs in more than one transcript the consequence is chosen arbitarily



                The best solution to this problem is of course to have expression data for tumor and normal tissue so you can actually know which transcripts from a particular loci are being expressed (using rnaseq or microarray data) so you only need consider the appropriate transcripts on the basis of the expression data

                All the annotation data should be based on the same reference, for the autosomes (and mostly for X and Y) hg19 and the primary GRCh37 should be identical in terms of sequence and coordinates. UCSC choses to use its own naming convention for its assemblies rather than the official GRC name

                Comment


                • #9
                  Originally posted by laura View Post

                  All the annotation data should be based on the same reference, for the autosomes (and mostly for X and Y) hg19 and the primary GRCh37 should be identical in terms of sequence and coordinates. UCSC choses to use its own naming convention for its assemblies rather than the official GRC name
                  You mean the current hg19, if I download it today and GRCh37.67 will have the same sequence and coordinates??

                  Comment


                  • #10
                    Originally posted by laura View Post
                    An interesting complaint, I have rarely seen Ensembl being accused as having too many alt splice forms, normally people complain about too few
                    That is an interesting statement! Ensembl gene set has 57000 "genes" while RefSeq has only 24000. I would have expected that to be a non-rare complaint..

                    Kindly clarify.

                    Comment


                    • #11
                      hg19 only represents the primary GRCh37 assembly as far as I am aware (someone please correct me if I am wrong) so long as you stick to the primary chromosomes (so 1-22, X and Y) the coordinates should be identical

                      Comment


                      • #12
                        Originally posted by shyam_la View Post
                        That is an interesting statement! Ensembl gene set has 57000 "genes" while RefSeq has only 24000. I would have expected that to be a non-rare complaint..

                        Kindly clarify.
                        The Refseq cdna set will probably only considers protein coding transcripts/cdnas

                        Ensembl's 57000 includes not just protein coding genes but also pseudogenes, ncRNAs and some other things

                        If you go to biomart and just look at protein coding genes you get 21976 genes

                        Comment


                        • #13
                          Originally posted by laura View Post
                          hg19 only represents the primary GRCh37 assembly as far as I am aware (someone please correct me if I am wrong) so long as you stick to the primary chromosomes (so 1-22, X and Y) the coordinates should be identical
                          Well, I have come across another source that said that hg19 is also updated but the name stays the same. There is no version number unlike GRCh37 and hence the latter should be preferred so that you can keep track of which version you are using.

                          You seem to suggest that hg19 has stayed unchanged since Feb 2009, when it was first released.

                          So many conflicting ideas on the internet! Sigh.. A newbie is bound to get lost..

                          Comment


                          • #14
                            I am no expert in UCSC so I am just going by what the browser is called, If your other source is the UCSC help pages or something similar I would trust it

                            I will point out though that the primary assembly (ie the main chromosome 1-22 and chrX and Y) has not changed since GRCh37 was released in 2009 so if you are just using the main chromosomes without any repeat masking then it doesn't matter what the pN number is the chromosomes are identical

                            The version only matters if you are also considering alternative haplotypes and GRC fix patches
                            Last edited by laura; 07-03-2012, 01:18 PM.

                            Comment


                            • #15
                              Originally posted by laura View Post
                              I am no expert in UCSC so I am just going by what the browser is called, If your other source is the UCSC help pages or something similar I would trust it

                              I will point out though that the primary assembly (ie the main chromosome 1-22 and chrX and Y) has not changed since GRCh37 was released in 2009 so if you are just using the main chromosomes without any repeat masking then it doesn't matter what the pN number is the chromosomes are identical

                              The version only matters if you are also considering alternative haplotypes and GRC fix patches


                              This was my source..

                              But I know you are correct now!

                              Just one more question: Number of FIX Patch scaffolds aligned to the Primary Assembly 60
                              This is from the statistics page for GRCh37.p7. What does 60 mean here?

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Essential Discoveries and Tools in Epitranscriptomics
                                by seqadmin




                                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                04-22-2024, 07:01 AM
                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 05-02-2024, 08:06 AM
                              0 responses
                              16 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-30-2024, 12:17 PM
                              0 responses
                              20 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-29-2024, 10:49 AM
                              0 responses
                              25 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-25-2024, 11:49 AM
                              0 responses
                              28 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X