Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • sdarko
    Member
    • Apr 2009
    • 52

    Best source for GTF file for use with TopHat/Cufflinks

    I've been grabbing the "refSeq genes" table (human, hg19) from UCSC in GTF file format for use with TopHat/Cufflinks.

    I was just curious as to what everyone else is using and if I might find a more optimal GTF file to use.
  • gringer
    David Eccles (gringer)
    • May 2011
    • 845

    #2
    That's what I've been using, because I can understand how to use the UCSC table browser to output a GTF file (just change the output format). Just be mindful that it updates fairly frequently, and you might discover new annotations by looking again at something you mapped a few months ago.
    Last edited by gringer; 07-13-2011, 05:07 AM. Reason: got the URL wrong

    Comment

    • gavin.oliver
      Senior Member
      • Jan 2010
      • 110

      #3
      Ensembl provides GTF files with each build as standard and is much more comprehensive than Refseq alone. I have switched to using the genome and GTF from Ensembl as a result.

      Comment

      • sdarko
        Member
        • Apr 2009
        • 52

        #4
        Originally posted by gavin.oliver View Post
        Ensembl provides GTF files with each build as standard and is much more comprehensive than Refseq alone. I have switched to using the genome and GTF from Ensembl as a result.
        Thanks for the advise. I will try it out today.

        Comment

        • sdarko
          Member
          • Apr 2009
          • 52

          #5
          Originally posted by gavin.oliver View Post
          Ensembl provides GTF files with each build as standard and is much more comprehensive than Refseq alone. I have switched to using the genome and GTF from Ensembl as a result.
          Did you get your genome and gtf from the ensembl website?

          Other than renaming the gtf and fasta header entries so that they matched, was there anything else that you had to do to make the gtf file work with tophat? I'm getting an error that my gtf doesn't contain junctions.

          Comment

          • gavin.oliver
            Senior Member
            • Jan 2010
            • 110

            #6
            I didn't have to do anything else, no.

            What command are you using to execute Tophat?

            Comment

            • sdarko
              Member
              • Apr 2009
              • 52

              #7
              Originally posted by gavin.oliver View Post
              I didn't have to do anything else, no.

              What command are you using to execute Tophat?
              I want to make sure that I'm not messing up anything too basic first.

              I grabbed the reference genome in fasta format from here --> ftp://ftp.ensembl.org/pub/release-63...o_sapiens/dna/

              I grabbed the associated GTF file from here -->ftp://ftp.ensembl.org/pub/release-63/gtf/homo_sapiens/

              I then process them so that the entry names match in both files.

              Here are a few lines from my GTF from ensembl:
              Code:
              chr18           protein_coding  exon    49501   49557   .       -       .        gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "1"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001";
              chr18           protein_coding  CDS     49501   49557   .       -       0        gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "1"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001"; protein_id "ENSP00000309431";
              chr18           protein_coding  start_codon     49555   49557   .       -       0        gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "1"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001";
              chr18           protein_coding  exon    49129   49237   .       -       .        gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "2"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001";
              chr18           protein_coding  CDS     49129   49237   .       -       0        gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "2"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001"; protein_id "ENSP00000309431";
              chr18           protein_coding  exon    48940   49050   .       -       .        gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "3"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001";
              chr18           protein_coding  CDS     48940   49050   .       -       2        gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "3"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001"; protein_id "ENSP00000309431";
              chr18           protein_coding  exon    47390   48447   .       -       .        gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "4"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001";
              chr18           protein_coding  CDS     47393   48447   .       -       2        gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "4"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001"; protein_id "ENSP00000309431";
              chr18           protein_coding  stop_codon      47390   47392   .       -       0        gene_id "ENSG00000173213"; transcript_id "ENST00000308911"; exon_number "4"; gene_name "RP11-683L23.1"; transcript_name "RP11-683L23.1-001";
              chr18           miRNA   exon    48162   48272   .       +       .        gene_id "ENSG00000221441"; transcript_id "ENST00000408514"; exon_number "1"; gene_name "AP001005.1"; transcript_name "AP001005.1-201";
              chr18           protein_coding  exon    158483  158714  .       +       .        gene_id "ENSG00000101557"; transcript_id "ENST00000261601"; exon_number "1"; gene_name "USP14"; transcript_name "USP14-201";
              chr18           protein_coding  CDS     158699  158714  .       +       0        gene_id "ENSG00000101557"; transcript_id "ENST00000261601"; exon_number "1"; gene_name "USP14"; transcript_name "USP14-201"; protein_id "ENSP00000261601";
              chr18           protein_coding  start_codon     158699  158701  .       +       0        gene_id "ENSG00000101557"; transcript_id "ENST00000261601"; exon_number "1"; gene_name "USP14"; transcript_name "USP14-201";
              chr18           protein_coding  exon    163308  163453  .       +       .        gene_id "ENSG00000101557"; transcript_id "ENST00000261601"; exon_number "2"; gene_name "USP14"; transcript_name "USP14-201";
              chr18           protein_coding  CDS     163308  163453  .       +       2        gene_id "ENSG00000101557"; transcript_id "ENST00000261601"; exon_number "2"; gene_name "USP14"; transcript_name "USP14-201"; protein_id "ENSP00000261601";
              chr18           protein_coding  exon    166787  166819  .       +       .        gene_id "ENSG00000101557"; transcript_id "ENST00000261601"; exon_number "3"; gene_name "USP14"; transcript_name "USP14-201";
              chr18           protein_coding  CDS     166787  166819  .       +       0        gene_id "ENSG00000101557"; transcript_id "ENST00000261601"; exon_number "3"; gene_name "USP14"; transcript_name "USP14-201"; protein_id "ENSP00000261601";
              Here are the names of my chromosomes according to bowtie-inspect:
              Code:
              chr1
              chr2
              chr3
              chr4
              chr5
              chr6
              chr7
              chr8
              chr9
              chr10
              chr11
              chr12
              chr13
              chr14
              chr15
              chr16
              chr17
              chr18
              chr19
              chr20
              chr21
              chr22
              chrX
              chrY
              chrM
              In TopHat, I get the following error:
              Code:
              [Mon Jul 18 15:15:21 2011] Reading known junctions from GTF file
                      Warning: TopHat did not find any junctions in GTF file
              In Cufflinks, I get the following error:
              Code:
              [08:34:37] Loading reference annotation.
              Error: duplicate GFF ID 'ENST00000445581' encountered!

              Comment

              • gavin.oliver
                Senior Member
                • Jan 2010
                • 110

                #8
                The only difference I can see between my setup and yours is that I removed the 'chr' prefixes. There was a reason for this - but I can't remember what it was!

                Comment

                • gringer
                  David Eccles (gringer)
                  • May 2011
                  • 845

                  #9
                  Well, if chromosome 22 is anything to go by, it looks like the chromosome labels in the fasta file don't include the 'chr' bit.

                  Comment

                  • sdarko
                    Member
                    • Apr 2009
                    • 52

                    #10
                    Well, I figured out the problem. I put an extra tab between the chrom name and the second column in the gtf file when I renamed everything.

                    I fixed it and all is well.

                    Comment

                    • gavin.oliver
                      Senior Member
                      • Jan 2010
                      • 110

                      #11
                      Originally posted by sdarko View Post
                      Well, I figured out the problem. I put an extra tab between the chrom name and the second column in the gtf file when I renamed everything.

                      I fixed it and all is well.
                      Glad to hear it

                      Comment

                      • hbt
                        Member
                        • Jan 2011
                        • 20

                        #12
                        @sdarko Could you explain little how you went about "renaming the gtf and fasta header entries so that they matched" please.
                        I'm keen to update the gtf I use with tophat to the ensembl version.

                        many thanks for any advice you may be able to give

                        Comment

                        • shurjo
                          Senior Member
                          • Jan 2009
                          • 132

                          #13
                          Look here: http://cufflinks.cbcb.umd.edu/igenomes.html

                          Comment

                          • kopi-o
                            Senior Member
                            • Feb 2008
                            • 319

                            #14
                            If you download both the genome FASTA and the annotation from ENSEMBL, you shouldn't need to rename anything.
                            Last edited by kopi-o; 10-24-2011, 10:31 AM. Reason: clarity

                            Comment

                            • HSV-1
                              Member
                              • Jul 2012
                              • 38

                              #15
                              Hi, gavin,
                              I have analysed my RNA-seq data with the references (both genome reference and annotation reference) from UCSC and ensemble. What I found is that the map results with the reference from UCSC is much more than those with the reference from ensemble. What I don't understand is that how this is possible?
                              What is confusing me much more is that GTF from UCSC is less than half of the one in Ensemble. With a smaller reference I got more results !
                              Do you have any idea?
                              Thanks!

                              Originally posted by gavin.oliver View Post
                              Ensembl provides GTF files with each build as standard and is much more comprehensive than Refseq alone. I have switched to using the genome and GTF from Ensembl as a result.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                New Genomics Tools and Methods Shared at AGBT 2025
                                by seqadmin


                                This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                                The Headliner
                                The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                                03-03-2025, 01:39 PM
                              • seqadmin
                                Investigating the Gut Microbiome Through Diet and Spatial Biology
                                by seqadmin




                                The human gut contains trillions of microorganisms that impact digestion, immune functions, and overall health1. Despite major breakthroughs, we’re only beginning to understand the full extent of the microbiome’s influence on health and disease. Advances in next-generation sequencing and spatial biology have opened new windows into this complex environment, yet many questions remain. This article highlights two recent studies exploring how diet influences microbial...
                                02-24-2025, 06:31 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 05:03 AM
                              0 responses
                              16 views
                              0 reactions
                              Last Post seqadmin  
                              Started by seqadmin, 03-19-2025, 07:27 AM
                              0 responses
                              14 views
                              0 reactions
                              Last Post seqadmin  
                              Started by seqadmin, 03-18-2025, 12:50 PM
                              0 responses
                              16 views
                              0 reactions
                              Last Post seqadmin  
                              Started by seqadmin, 03-03-2025, 01:15 PM
                              0 responses
                              185 views
                              0 reactions
                              Last Post seqadmin  
                              Working...