Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • GTF file with gene name attribute for Cuffcompare

    Sorry if this question has already been asked, but to get a good annotation with Cuffcompare I need a GTF file with the reference gene symbol name, such as the "Myog" example given in the Cufflinks manual. I can generate a GTF file using the UCSC Table Browser, but all genes and trancripts are named in UCSC format, e.g.; "uc007cr1", which is not a helpful annotation for a biologist.

    Is there an easy way to get a GTF file where the reference gene name is the gene symbol? I am working with hg19 and mm9.

  • #2
    In the manual of Cufflinks:

    "Cuffcompare Input

    Cuffcompare takes Cufflinks' GTF output as input, and optionally can take a "reference" annotation (such as from Ensembl)"

    Just click the Ensemble, you will get the GTF file from each specie. Hope this helps.

    Comment


    • #3
      Yes, I have looked at Ensembl GTF files and they also lack the HGNC gene symbol attribute. Genes are identified by their Ensembl code; e.g.; ENSG00000122180.

      Comment


      • #4
        Originally posted by ChrisL View Post
        Yes, I have looked at Ensembl GTF files and they also lack the HGNC gene symbol attribute. Genes are identified by their Ensembl code; e.g.; ENSG00000122180.
        You will have to convert the ensembl ids to corresponding gene symbols. Check out biomart.

        you can select ensembl gene id and gene symbols and get the file which will help you translate. This will require some programming.

        Comment


        • #5
          I use MGI Batch Query to convert the ENSEMBLE ID to gene name:
          MGI: the international database resource for the laboratory mouse, providing integrated genetic, genomic, and biological data for researching human health and disease.

          Comment


          • #6
            GTF File for Cufflinks

            Have you tried to download the RefSeq refFlat file in GTF format from the UCSC table browser? That might also work (and be a lot easier).

            Comment


            • #7
              Yes, I used UCSC to generate a GTF file based on RefSeq, but the RefSeq annotation is really no better. For example, the human gene "MYOG" is "NC_000001.10" in RefSeq.

              If the GTF file had the gene id in a separate delimited column it would be easy to replace with the HGNC gene symbol using the UNIX join command and a lookup table. Luckily I have access to programmers as it looks like a job for a script.

              Comment


              • #8
                Originally posted by ChrisL View Post
                Yes, I used UCSC to generate a GTF file based on RefSeq, but the RefSeq annotation is really no better. For example, the human gene "MYOG" is "NC_000001.10" in RefSeq.

                If the GTF file had the gene id in a separate delimited column it would be easy to replace with the HGNC gene symbol using the UNIX join command and a lookup table. Luckily I have access to programmers as it looks like a job for a script.
                Did you use the refGene table of the refFlat table?

                Comment


                • #9
                  Brilliant! That worked.

                  Thanks RockChalkJayhawk.

                  Chris

                  Comment


                  • #10
                    @RockChalkJayhawk or ChrisL,
                    Can one of you elaborate on that workflow?

                    Comment


                    • #11
                      Originally posted by genbio64 View Post
                      @RockChalkJayhawk or ChrisL,
                      Can one of you elaborate on that workflow?
                      ucsc table browser, choose refseq genes for the track then refflat table.

                      Comment


                      • #12
                        Hi everyone,
                        I've been scanning this answer to my question but I could not find it. So I saw this post which kind of touches my problem. I downloaded one GTF file with the ENSEMBL annotation and the one you propose here. I used the same GTF in the Tophat, cufflinks and cuffcompare steps but the final output from cuffdiff does not contain any of the 2 annotations. I thought that I had to do another step to match the statistical analysis with the annotation, but I cannot find what that step is. As they are now, the data mean nothing unless I manually much the cufflinks names with the ENSEBL one.
                        Could please somone explain what I am doing wrong?
                        Thank you very much,
                        Filippos

                        Comment


                        • #13
                          Originally posted by filippos View Post
                          Hi everyone,
                          I've been scanning this answer to my question but I could not find it. So I saw this post which kind of touches my problem. I downloaded one GTF file with the ENSEMBL annotation and the one you propose here. I used the same GTF in the Tophat, cufflinks and cuffcompare steps but the final output from cuffdiff does not contain any of the 2 annotations. I thought that I had to do another step to match the statistical analysis with the annotation, but I cannot find what that step is. As they are now, the data mean nothing unless I manually much the cufflinks names with the ENSEBL one.
                          Could please somone explain what I am doing wrong?
                          Thank you very much,
                          Filippos
                          you may want other people to verify anything i say, but this is what i think.

                          make sure you add "chr" to column 1 of your ensemble reference. then use that reference to make your combined gtf in cuffcompare.
                          Code:
                          cuffcompare -r ensembl.gtf ensembl.gtf ensembl.gtf
                          run cufflinks with resultant stdout.combined.gtf

                          Comment


                          • #14
                            Thank you jbrwn for your answer.
                            The first lines of the ensembl GTF that I'm using are:

                            NT_166433 protein_coding exon 11955 12166 . + . gene_id "ENSMUSG00000000702"; transcript_id "ENSMUST00000105216"; exon_number "1"; gene_name "AC007307.1"; transcript_name "AC007307.1-201";
                            NT_166433 protein_coding CDS 12026 12166 . + 0 gene_id "ENSMUSG00000000702"; transcript_id "ENSMUST00000105216"; exon_number "1"; gene_name "AC007307.1"; transcript_name "AC007307.1-201"; protein_id "ENSMUSP00000100851";
                            NT_166433 protein_coding start_codon 12026 12028 . + 0 gene_id "ENSMUSG00000000702"; transcript_id "ENSMUST00000105216"; exon_number "1"; gene_name "AC007307.1"; transcript_name "AC007307.1-201";
                            NT_166433 protein_coding exon 16677 16841 . + . gene_id "ENSMUSG00000000702"; transcript_id "ENSMUST00000105216"; exon_number "2"; gene_name "AC007307.1"; transcript_name "AC007307.1-201";
                            NT_166433 protein_coding CDS 16677 16841 . + 0 gene_id "ENSMUSG00000000702"; transcript_id "ENSMUST00000105216"; exon_number "2"; gene_name "AC007307.1"; transcript_name "AC007307.1-201"; protein_id "ENSMUSP00000100851";
                            NT_166433 protein_coding exon 17745 17814 . + . gene_id "ENSMUSG00000000702"; transcript_id "ENSMUST00000105216"; exon_number "3"; gene_name "AC007307.1"; transcript_name "AC007307.1-201";

                            At some point (around line 100) the thing changes to:

                            18 protein_coding exon 3122455 3123465 . - . gene_id "ENSMUSG00000091539"; transcript_id "ENSMUST00000165255"; exon_number "1"; gene_name "AC125218.1"; transcript_name "AC125218.1-201";
                            18 protein_coding CDS 3122495 3123412 . - 0 gene_id "ENSMUSG00000091539"; transcript_id "ENSMUST00000165255"; exon_number "1"; gene_name "AC125218.1"; transcript_name "AC125218.1-201"; protein_id "ENSMUSP00000129804";
                            18 protein_coding start_codon 3123410 3123412 . - 0 gene_id "ENSMUSG00000091539"; transcript_id "ENSMUST00000165255"; exon_number "1"; gene_name "AC125218.1"; transcript_name "AC125218.1-201";
                            18 protein_coding stop_codon 3122492 3122494 . - 0 gene_id "ENSMUSG00000091539"; transcript_id "ENSMUST00000165255"; exon_number "1"; gene_name "AC125218.1"; transcript_name "AC125218.1-201";
                            18 protein_coding exon 3327492 3327589 . - . gene_id "ENSMUSG00000063889"; transcript_id "ENSMUST00000151311"; exon_number "1"; gene_name "Crem"; transcript_name "Crem-020";
                            18 protein_coding CDS 3327492 3327535 . - 0 gene_id "ENSMUSG00000063889"; transcript_id "ENSMUST00000151311"; exon_number "1"; gene_name "Crem"; transcript_name "Crem-020"; protein_id "ENSMUSP00000118267";
                            18 protein_coding start_codon 3327533 3327535 . - 0 gene_id "ENSMUSG00000063889"; transcript_id "ENSMUST00000151311"; exon_number "1"; gene_name "Crem"; transcript_name "Crem-020";
                            18 protein_coding exon 3325359 3325476 . - . gene_id "ENSMUSG00000063889"; transcript_id "ENSMUST00000151311"; exon_number "2"; gene_name "Crem"; transcript_name "Crem-020";

                            The file came from the UCSC Table browser.
                            I guess that I should add the "chr" before the "18" in the above lines and probably delete the first 100lines? The first time I tried to use this file, TopHat didn't let me because it had some kind of duplicate entries. Is it possible that the first lines are problematic? Is there an easy way to add the "chr" in all the lines? I am really new to all this.
                            Thanks again for the quick reply and excuse me for asking so obvious questions.
                            Filippos

                            Comment


                            • #15
                              oh, i should have specified that my instructions applied to human as i'm not familiar with anything associated with other organisms. my aligned reads come out of tophat as "chr1" and "chrX", which is why i treated my ensemble reference the way i did in my previous reply. i don't know what you'll need to do with NT_166433 or MT. take a look at your reads or wait till someone comes along who's worked with mice.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Exploring the Dynamics of the Tumor Microenvironment
                                by seqadmin




                                The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                                07-08-2024, 03:19 PM
                              • seqadmin
                                Exploring Human Diversity Through Large-Scale Omics
                                by seqadmin


                                In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                                06-25-2024, 06:43 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 07-10-2024, 07:30 AM
                              0 responses
                              19 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 07-03-2024, 09:45 AM
                              0 responses
                              197 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 07-03-2024, 08:54 AM
                              0 responses
                              206 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 07-02-2024, 03:00 PM
                              0 responses
                              190 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X