Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Ensembl/NCBI/UCSC mouse gene annotations for cufflinks

    Dear seqA community,

    I'm assembling transcripts on the mouse reference annotations (*.gtf files) provided by Ensembl, NCBI and UCSC. Ideally, I would like to use Ensembl, because they annotate genes as protein-coding, non-coding, pseudo-genes, etc. But I have a problem with Ensembl: some important transcripts are not in the database, for example: Kcnq1ot1 or Ipw

    Question #1: Why is that? Should I expect these and other similar genes to be included in a future version of Ensembl?

    Both Refseq and UCSC have entries for these genes, but they lack the convenient categorization provided by Ensembl (protein-coding, non-coding, pseudogenes, etc.).

    Question #2: I have been unable to find an equivalent categorization file matching UCSC or NCBI identifiers. Can someone point me in the right direction?

    Thank you for any advice you can give!
    Last edited by sp144; 11-30-2013, 05:46 PM.

  • #2
    I would try the new Gencode annotation file for mouse. It should be the most comprehensive annotation out there.


    Sorry, misunderstood.
    Not sure why it's not in there.
    Last edited by jeppepeppe; 12-01-2013, 04:12 PM.

    Comment


    • #3
      Originally posted by jeppepeppe View Post
      I would try the new Gencode annotation file for mouse. It should be the most comprehensive annotation out there.


      Sorry, misunderstood.
      Not sure why it's not in there.
      Thank you jeppepeppe,

      that was a good suggestion, but I went and checked and both examples are indeed missing. From what I can tell right now, my only options are to:
      a.) exclude genes not listed in ensembl
      b.) use UCSC annotation and ID converter tools to retrieve ensembl annotation matching UCSC IDs. But in this process I'll lose ~ 10% of my data, so not ideal:

      The document shows gene identifiers from different databases that correspond to the same genes. It lists Ensembl gene IDs (prefixed with ENSG) and gene symbols for many genes. Several databases and tools are able to convert between the different gene identifiers.

      Comment


      • #4
        Hi,

        It's good to hear the biotype categorizations are useful to you.

        Ensembl will have a more updated mouse gene set than what's on the GENCODE page, as the GENCODE set has been taken from a previous release of Ensembl. (GENCODE is using Ensembl genes- i.e. the merged set between Ensembl automatic annotation and Vega/Havana manual annotation).

        We will have an update in mouse genes for the next release (e74), due out this week. (Release 74). This will include updated Vega/Havana manual annotation. I have checked the first gene you mention (KCNQ1OT1) on our test site, and it will be present in the next release.

        I hope that helps.

        Comment


        • #5
          Originally posted by Giulietta EnsemblHelpdesk View Post
          Hi,

          It's good to hear the biotype categorizations are useful to you.

          Ensembl will have a more updated mouse gene set than what's on the GENCODE page, as the GENCODE set has been taken from a previous release of Ensembl. (GENCODE is using Ensembl genes- i.e. the merged set between Ensembl automatic annotation and Vega/Havana manual annotation).

          We will have an update in mouse genes for the next release (e74), due out this week. (Release 74). This will include updated Vega/Havana manual annotation. I have checked the first gene you mention (KCNQ1OT1) on our test site, and it will be present in the next release.

          I hope that helps.
          Thank you, Giulietta!
          That is indeed very helpful news and very lucky for me! My data is aligned to mouse mm9 (build 37) however. Will the e74 annotation only be available for mm10/NCBI38 coordinates? Will it be possible to perform a simple liftover back to mm9 coordinates?

          I could of course re-align to mm10 (build 38), but for reasons relating to my custom-built pipeline, I'd prefer to stay in mm9 (build 37) if at all possible.
          Thank you and Best wishes!

          PS. on a related note I'm a bit unclear as to why the transcript biotypes and gene biotypes differ - is it because some transcripts of protein-coding genes are not translated, etc?
          Last edited by sp144; 12-02-2013, 05:46 PM.

          Comment


          • #6
            Hi sp44

            We don't update old assemblies with the new annotation, so for NCBIm37 you will only see the release 67 annotation from May 2012, as that was the last release with the old assembly.

            Gene and transcript biotypes differ because a gene will have multiple transcripts, which will each have their own biotypes. For example, this gene has some coding and some non-coding transcripts.

            Emily

            Comment


            • #7
              Hello sp144,

              To add to Emily's message, yes you can lift over coordinates of the new annotation to the older assembly. Ensembl provides an assembly converter tool for this:



              By the way, if you have a list of genes which are not in the most current Ensembl database, we'd like you to send those along to Vega/Havana- they manually annotate genes which we then merge into our geneset generated by automatic annotation. The contact email is in the link:

              Wellcome Sanger Institute tools directory


              Best wishes,
              Giulietta
              Last edited by Giulietta EnsemblHelpdesk; 12-03-2013, 03:04 AM. Reason: forgot link

              Comment


              • #8
                Thank you Giulietta and Emily,

                I took a look at the new assembly, but sadly Ipw is not annotated at all and Kcnq1ot1 is incorrectly annotated as being on the forward strand and consisting of 5 exons. It's actually on the reverse strand and consists of a single exon. I also don't understand why Kcnq1ot1 is capitalized in the gtf.

                I'm surprised given that these genes have long been in Refseq and UCSC. I'll contact the Vega/Havana people - but I'm guessing these won't be updated until the next ensembl release. When do you think it will come out? Thank you!

                Comment


                • #9
                  Originally posted by sp144 View Post
                  Thank you Giulietta and Emily,

                  I took a look at the new assembly, but sadly Ipw is not annotated at all and Kcnq1ot1 is incorrectly annotated as being on the forward strand and consisting of 5 exons. It's actually on the reverse strand and consists of a single exon. I also don't understand why Kcnq1ot1 is capitalized in the gtf.

                  I'm surprised given that these genes have long been in Refseq and UCSC. I'll contact the Vega/Havana people - but I'm guessing these won't be updated until the next ensembl release. When do you think it will come out? Thank you!
                  I find Kcnq1ot1 in mouse on the forward strand, consisting of a single exon:



                  Are we looking at the same gene?

                  Comment


                  • #10
                    I find four of them, KCNQ1OT1_1, KCNQ1OT1_2, KCNQ1OT1_3 and KCNQ1OT1_5, all neighbours on the forward strand.

                    Comment


                    • #11
                      Yes, thank you Emily, in the gtf there are 4 entries, neighbors on the forward strand. But in Refseq and UCSC there is a single 1-exon transcript on the reverse strand, hence the name: Kcnq1 "opposite transcript" 1 = Kcnq1ot1.

                      I emailed the VEGA group, but no response yet.
                      Thank you.

                      Comment


                      • #12
                        Hi

                        I'm from the HAVANA group at Sanger and although I haven't yet received your email via Vega, I was alerted to this thread via the Ensembl team.

                        Neither Ipw or Kcnq1ot1 had been manually annotated, but this is not entirely surprising as we have only just started genome-wide manual annotation of non-coding loci in mouse.

                        I have had the annotation for these loci updated and both will appear in future releases of GENCODE/Ensembl. Just to clarify, the GENCODE and Ensembl genesets are identical (essentially, for human and mouse, Ensembl displays the GENCODE geneset which is created via a merge of manual gene annotation and Ensembl gene predictions) and released in synch (this is well established for human, and while the Ensembl geneset for mouse has been created in the same way as human for several years the separate release of GENCODE gene annotation is more limited - GENCODE M1=Ensembl 65 and GENCODE M2=Ensembl 74). Updates to annotation can take some time to appear in new releases of GENCODE/Ensembl, however, it is possible to see updated manual annotation (which will be included in future releases) via the Vega browser. Click through 'Configure this page' and then click on the 'Havana update' box in the Genes and transcripts section. This track is updated approximately fortnightly.

                        I hope this is useful
                        Last edited by afrankish; 12-09-2013, 09:50 AM.

                        Comment


                        • #13
                          Thank you, afrankish; I'm mostly looking for a gtf annotation that includes the very useful Ensembl biotype categories yet captures RefSeq and UCSC gene entries missing from Ensembl. I'm sure updating these transcript annotations is challenging, as they represent a moving target with increasing sequencing depth. I just wish there was a mechanism to "fast-track" entries from other major databases for annotation. Both of these genes have been in RefSeq and UCSC for quite some time.

                          I will keep an eye out for the next ensembl release. Thank you to everyone for contributing to this post - it was my first on seqanswers and I'm impressed that you VEGA and Ensembl folks responded so quickly. Thank you!

                          Comment


                          • #14
                            Hi sp144,

                            Just to clarify, in the Ensembl pipeline Kcnq1ot1 has been annotated from RFAM, which has four separate entries:

                            RF01946 KCNQ1OT1_1 KCNQ1 overlapping transcript 1 conserved region 1
                            RF01947 KCNQ1OT1_2 KCNQ1 overlapping transcript 1 conserved region 2
                            RF01948 KCNQ1OT1_3 KCNQ1 overlapping transcript 1 conserved region 3
                            RF01950 KCNQ1OT1_5 KCNQ1 overlapping transcript 1 conserved region 5

                            The Ensembl pipeline's strength is very much on coding sequences, and we prefer to receive annotation on ncRNAs from Havana (who manually annotate the genome). As afrankish points out, we merge the Havana manual annotation into the transcript set from the Ensembl automatic pipelines to create the GENCODE set.

                            We hope to have this annotation for you in the future.

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Genetic Variation in Immunogenetics and Antibody Diversity
                              by seqadmin



                              The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...
                              Yesterday, 07:24 PM
                            • seqadmin
                              Choosing Between NGS and qPCR
                              by seqadmin



                              Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
                              10-18-2024, 07:11 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, 11-01-2024, 06:09 AM
                            0 responses
                            27 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 10-30-2024, 05:31 AM
                            0 responses
                            21 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 10-24-2024, 06:58 AM
                            0 responses
                            25 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 10-23-2024, 08:43 AM
                            0 responses
                            56 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X