Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • GTF file with gene name attribute for Cuffcompare

    Sorry if this question has already been asked, but to get a good annotation with Cuffcompare I need a GTF file with the reference gene symbol name, such as the "Myog" example given in the Cufflinks manual. I can generate a GTF file using the UCSC Table Browser, but all genes and trancripts are named in UCSC format, e.g.; "uc007cr1", which is not a helpful annotation for a biologist.

    Is there an easy way to get a GTF file where the reference gene name is the gene symbol? I am working with hg19 and mm9.

  • #2
    In the manual of Cufflinks:

    "Cuffcompare Input

    Cuffcompare takes Cufflinks' GTF output as input, and optionally can take a "reference" annotation (such as from Ensembl)"

    Just click the Ensemble, you will get the GTF file from each specie. Hope this helps.

    Comment


    • #3
      Yes, I have looked at Ensembl GTF files and they also lack the HGNC gene symbol attribute. Genes are identified by their Ensembl code; e.g.; ENSG00000122180.

      Comment


      • #4
        Originally posted by ChrisL View Post
        Yes, I have looked at Ensembl GTF files and they also lack the HGNC gene symbol attribute. Genes are identified by their Ensembl code; e.g.; ENSG00000122180.
        You will have to convert the ensembl ids to corresponding gene symbols. Check out biomart.

        you can select ensembl gene id and gene symbols and get the file which will help you translate. This will require some programming.

        Comment


        • #5
          I use MGI Batch Query to convert the ENSEMBLE ID to gene name:
          MGI: the international database resource for the laboratory mouse, providing integrated genetic, genomic, and biological data for researching human health and disease.

          Comment


          • #6
            GTF File for Cufflinks

            Have you tried to download the RefSeq refFlat file in GTF format from the UCSC table browser? That might also work (and be a lot easier).

            Comment


            • #7
              Yes, I used UCSC to generate a GTF file based on RefSeq, but the RefSeq annotation is really no better. For example, the human gene "MYOG" is "NC_000001.10" in RefSeq.

              If the GTF file had the gene id in a separate delimited column it would be easy to replace with the HGNC gene symbol using the UNIX join command and a lookup table. Luckily I have access to programmers as it looks like a job for a script.

              Comment


              • #8
                Originally posted by ChrisL View Post
                Yes, I used UCSC to generate a GTF file based on RefSeq, but the RefSeq annotation is really no better. For example, the human gene "MYOG" is "NC_000001.10" in RefSeq.

                If the GTF file had the gene id in a separate delimited column it would be easy to replace with the HGNC gene symbol using the UNIX join command and a lookup table. Luckily I have access to programmers as it looks like a job for a script.
                Did you use the refGene table of the refFlat table?

                Comment


                • #9
                  Brilliant! That worked.

                  Thanks RockChalkJayhawk.

                  Chris

                  Comment


                  • #10
                    @RockChalkJayhawk or ChrisL,
                    Can one of you elaborate on that workflow?

                    Comment


                    • #11
                      Originally posted by genbio64 View Post
                      @RockChalkJayhawk or ChrisL,
                      Can one of you elaborate on that workflow?
                      ucsc table browser, choose refseq genes for the track then refflat table.

                      Comment


                      • #12
                        Hi everyone,
                        I've been scanning this answer to my question but I could not find it. So I saw this post which kind of touches my problem. I downloaded one GTF file with the ENSEMBL annotation and the one you propose here. I used the same GTF in the Tophat, cufflinks and cuffcompare steps but the final output from cuffdiff does not contain any of the 2 annotations. I thought that I had to do another step to match the statistical analysis with the annotation, but I cannot find what that step is. As they are now, the data mean nothing unless I manually much the cufflinks names with the ENSEBL one.
                        Could please somone explain what I am doing wrong?
                        Thank you very much,
                        Filippos

                        Comment


                        • #13
                          Originally posted by filippos View Post
                          Hi everyone,
                          I've been scanning this answer to my question but I could not find it. So I saw this post which kind of touches my problem. I downloaded one GTF file with the ENSEMBL annotation and the one you propose here. I used the same GTF in the Tophat, cufflinks and cuffcompare steps but the final output from cuffdiff does not contain any of the 2 annotations. I thought that I had to do another step to match the statistical analysis with the annotation, but I cannot find what that step is. As they are now, the data mean nothing unless I manually much the cufflinks names with the ENSEBL one.
                          Could please somone explain what I am doing wrong?
                          Thank you very much,
                          Filippos
                          you may want other people to verify anything i say, but this is what i think.

                          make sure you add "chr" to column 1 of your ensemble reference. then use that reference to make your combined gtf in cuffcompare.
                          Code:
                          cuffcompare -r ensembl.gtf ensembl.gtf ensembl.gtf
                          run cufflinks with resultant stdout.combined.gtf

                          Comment


                          • #14
                            Thank you jbrwn for your answer.
                            The first lines of the ensembl GTF that I'm using are:

                            NT_166433 protein_coding exon 11955 12166 . + . gene_id "ENSMUSG00000000702"; transcript_id "ENSMUST00000105216"; exon_number "1"; gene_name "AC007307.1"; transcript_name "AC007307.1-201";
                            NT_166433 protein_coding CDS 12026 12166 . + 0 gene_id "ENSMUSG00000000702"; transcript_id "ENSMUST00000105216"; exon_number "1"; gene_name "AC007307.1"; transcript_name "AC007307.1-201"; protein_id "ENSMUSP00000100851";
                            NT_166433 protein_coding start_codon 12026 12028 . + 0 gene_id "ENSMUSG00000000702"; transcript_id "ENSMUST00000105216"; exon_number "1"; gene_name "AC007307.1"; transcript_name "AC007307.1-201";
                            NT_166433 protein_coding exon 16677 16841 . + . gene_id "ENSMUSG00000000702"; transcript_id "ENSMUST00000105216"; exon_number "2"; gene_name "AC007307.1"; transcript_name "AC007307.1-201";
                            NT_166433 protein_coding CDS 16677 16841 . + 0 gene_id "ENSMUSG00000000702"; transcript_id "ENSMUST00000105216"; exon_number "2"; gene_name "AC007307.1"; transcript_name "AC007307.1-201"; protein_id "ENSMUSP00000100851";
                            NT_166433 protein_coding exon 17745 17814 . + . gene_id "ENSMUSG00000000702"; transcript_id "ENSMUST00000105216"; exon_number "3"; gene_name "AC007307.1"; transcript_name "AC007307.1-201";

                            At some point (around line 100) the thing changes to:

                            18 protein_coding exon 3122455 3123465 . - . gene_id "ENSMUSG00000091539"; transcript_id "ENSMUST00000165255"; exon_number "1"; gene_name "AC125218.1"; transcript_name "AC125218.1-201";
                            18 protein_coding CDS 3122495 3123412 . - 0 gene_id "ENSMUSG00000091539"; transcript_id "ENSMUST00000165255"; exon_number "1"; gene_name "AC125218.1"; transcript_name "AC125218.1-201"; protein_id "ENSMUSP00000129804";
                            18 protein_coding start_codon 3123410 3123412 . - 0 gene_id "ENSMUSG00000091539"; transcript_id "ENSMUST00000165255"; exon_number "1"; gene_name "AC125218.1"; transcript_name "AC125218.1-201";
                            18 protein_coding stop_codon 3122492 3122494 . - 0 gene_id "ENSMUSG00000091539"; transcript_id "ENSMUST00000165255"; exon_number "1"; gene_name "AC125218.1"; transcript_name "AC125218.1-201";
                            18 protein_coding exon 3327492 3327589 . - . gene_id "ENSMUSG00000063889"; transcript_id "ENSMUST00000151311"; exon_number "1"; gene_name "Crem"; transcript_name "Crem-020";
                            18 protein_coding CDS 3327492 3327535 . - 0 gene_id "ENSMUSG00000063889"; transcript_id "ENSMUST00000151311"; exon_number "1"; gene_name "Crem"; transcript_name "Crem-020"; protein_id "ENSMUSP00000118267";
                            18 protein_coding start_codon 3327533 3327535 . - 0 gene_id "ENSMUSG00000063889"; transcript_id "ENSMUST00000151311"; exon_number "1"; gene_name "Crem"; transcript_name "Crem-020";
                            18 protein_coding exon 3325359 3325476 . - . gene_id "ENSMUSG00000063889"; transcript_id "ENSMUST00000151311"; exon_number "2"; gene_name "Crem"; transcript_name "Crem-020";

                            The file came from the UCSC Table browser.
                            I guess that I should add the "chr" before the "18" in the above lines and probably delete the first 100lines? The first time I tried to use this file, TopHat didn't let me because it had some kind of duplicate entries. Is it possible that the first lines are problematic? Is there an easy way to add the "chr" in all the lines? I am really new to all this.
                            Thanks again for the quick reply and excuse me for asking so obvious questions.
                            Filippos

                            Comment


                            • #15
                              oh, i should have specified that my instructions applied to human as i'm not familiar with anything associated with other organisms. my aligned reads come out of tophat as "chr1" and "chrX", which is why i treated my ensemble reference the way i did in my previous reply. i don't know what you'll need to do with NT_166433 or MT. take a look at your reads or wait till someone comes along who's worked with mice.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              9 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              51 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X