Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problems creating GTF for Cufflinks annotation

    I have been trying to supply a GTF for annotation with Cufflinks/Cuffcompare and I have been having no success at all.

    I started by only having GFF files. The organism I work with, Arabidopsis, does not have any published GTF annotation files that I have been able to locate and I saw someone else on here was unable to locate any as well. So I attempted to convert the GFFs I had into GTFs by converting the ninth column. I used http://mblab.wustl.edu/GTF22.html as my reference.

    On the first try I simply took the feature column and made it the gene_id and the transcript_id, knowing the names would be nice, but for our purposes just knowing what the reads represent is sufficient (mRNA, miRNA, siRNA, pseudogene, etc.)

    Code:
    Chr1	TAIR9	gene	3631	5899	.	+	.	gene_id "gene"; transcript_id "gene";
    
    Chr1	TAIR9	mRNA	3631	5899	.	+	.	gene_id "mRNA"; transcript_id "mRNA";
    
    Chr1	TAIR9	protein	3760	5630	.	+	.	gene_id "protein"; transcript_id "protein";
    This resulted in an error in Cuffcompare:

    Code:
    cuffcompare -r *.gtf -R -V -o 162.162E -p 4 transcripts1.gtf transcripts2.gtf
    Loading reference transcripts..
    Error: duplicate GFF ID 'mRNA' encountered!
    Based on the error results I reformatted my GFF>GTF conversion file by simply numbering each of the gene_id and transcript_id in a unique way to remove any redundancy in the file:

    Code:
    Chr1	TAIR9	gene	3631	5899	.	+	.	gene_id "gene2"; transcript_id "gene-2";
    
    Chr1	TAIR9	mRNA	3631	5899	.	+	.	gene_id "mRNA3"; transcript_id "mRNA-3";
    
    Chr1	TAIR9	protein	3760	5630	.	+	.	gene_id "protein4"; transcript_id "protein-4";
    Result:

    Code:
    cuffcompare -r *.gtf -R -V -o 162.162E -p 4 transcripts1.gtf transcripts2.gtf
    Loading reference transcripts..
    GList error (GList.hh:592):Invalid list index: -1
    I investigate the error, but was really unable to find anything so I figured that maybe the way I set up the transcript_id was throwing an error (*****-N) so I altered the GTF again. "-" > "1"

    Code:
    Chr1	TAIR9	gene	3631	5899	.	+	.	gene_id "gene2"; transcript_id "gene12";
    
    Chr1	TAIR9	mRNA	3631	5899	.	+	.	gene_id "mRNA3"; transcript_id "mRNA13";
    
    Chr1	TAIR9	protein	3760	5630	.	+	.	gene_id "protein4"; transcript_id "protein14";
    Result:

    Code:
    cuffcompare -r *.gtf -R -V -o 162.162E -p 4 transcripts1.gtf transcripts2.gtf
    Loading reference transcripts..
    GList error (GList.hh:592):Invalid list index: -1
    I have no idea what the "GList error (GList.hh:592):Invalid list index: -1" error means or how to correct it.

    Can anyone make a recommendation on changing a GFF into a GTF? Tophat was able to supply GFF files for annotation, but for some reason Cufflinks only allows GTF files to provide annotation. It's great for some of the more mainstream organisms, but a lot of them (Arabidopsis in my case) only have annotations in GFF and GFF3 which creates a wall in being able to process the expression data.

    Any and all help/suggestions would be greatly appreciated. I've been hung on up this problem for some time now and I have no more ideas on how to proceed.


    Thanks as always.

  • #2
    Ignore everything except for exons and CDS lines; those are all that matter to cufflinks. Every exon or CDS entry which is part of the same gene must have the same "gene_id". Every exon or CDS which is part of the same transcript must have the same "transcript_id". Here is an example of one gene (AT1G01020) which has two transcripts (AT1G01020.1 and AT1G01020.2).

    The GFF3 (TAIR9 annotation);

    Code:
    Chr1	TAIR9	gene	5928	8737	.	-	.	ID=AT1G01020;Note=protein_coding_gene;Name=AT1G01020
    Chr1	TAIR9	mRNA	5928	8737	.	-	.	ID=AT1G01020.1;Parent=AT1G01020;Name=AT1G01020.1;Index=1
    Chr1	TAIR9	protein	6915	8666	.	-	.	ID=AT1G01020.1-Protein;Name=AT1G01020.1;Derives_from=AT1G01020.1
    Chr1	TAIR9	five_prime_UTR	8667	8737	.	-	.	Parent=AT1G01020.1
    Chr1	TAIR9	CDS	8571	8666	.	-	0	Parent=AT1G01020.1,AT1G01020.1-Protein;
    Chr1	TAIR9	exon	8571	8737	.	-	.	Parent=AT1G01020.1
    Chr1	TAIR9	CDS	8417	8464	.	-	0	Parent=AT1G01020.1,AT1G01020.1-Protein;
    Chr1	TAIR9	exon	8417	8464	.	-	.	Parent=AT1G01020.1
    Chr1	TAIR9	CDS	8236	8325	.	-	0	Parent=AT1G01020.1,AT1G01020.1-Protein;
    Chr1	TAIR9	exon	8236	8325	.	-	.	Parent=AT1G01020.1
    Chr1	TAIR9	CDS	7942	7987	.	-	0	Parent=AT1G01020.1,AT1G01020.1-Protein;
    Chr1	TAIR9	exon	7942	7987	.	-	.	Parent=AT1G01020.1
    Chr1	TAIR9	CDS	7762	7835	.	-	2	Parent=AT1G01020.1,AT1G01020.1-Protein;
    Chr1	TAIR9	exon	7762	7835	.	-	.	Parent=AT1G01020.1
    Chr1	TAIR9	CDS	7564	7649	.	-	0	Parent=AT1G01020.1,AT1G01020.1-Protein;
    Chr1	TAIR9	exon	7564	7649	.	-	.	Parent=AT1G01020.1
    Chr1	TAIR9	CDS	7384	7450	.	-	1	Parent=AT1G01020.1,AT1G01020.1-Protein;
    Chr1	TAIR9	exon	7384	7450	.	-	.	Parent=AT1G01020.1
    Chr1	TAIR9	CDS	7157	7232	.	-	0	Parent=AT1G01020.1,AT1G01020.1-Protein;
    Chr1	TAIR9	exon	7157	7232	.	-	.	Parent=AT1G01020.1
    Chr1	TAIR9	CDS	6915	7069	.	-	2	Parent=AT1G01020.1,AT1G01020.1-Protein;
    Chr1	TAIR9	three_prime_UTR	6437	6914	.	-	.	Parent=AT1G01020.1
    Chr1	TAIR9	exon	6437	7069	.	-	.	Parent=AT1G01020.1
    Chr1	TAIR9	three_prime_UTR	5928	6263	.	-	.	Parent=AT1G01020.1
    Chr1	TAIR9	exon	5928	6263	.	-	.	Parent=AT1G01020.1
    Chr1	TAIR9	mRNA	6790	8737	.	-	.	ID=AT1G01020.2;Parent=AT1G01020;Name=AT1G01020.2;Index=1
    Chr1	TAIR9	protein	7315	8666	.	-	.	ID=AT1G01020.2-Protein;Name=AT1G01020.2;Derives_from=AT1G01020.2
    Chr1	TAIR9	five_prime_UTR	8667	8737	.	-	.	Parent=AT1G01020.2
    Chr1	TAIR9	CDS	8571	8666	.	-	0	Parent=AT1G01020.2,AT1G01020.2-Protein;
    Chr1	TAIR9	exon	8571	8737	.	-	.	Parent=AT1G01020.2
    Chr1	TAIR9	CDS	8417	8464	.	-	0	Parent=AT1G01020.2,AT1G01020.2-Protein;
    Chr1	TAIR9	exon	8417	8464	.	-	.	Parent=AT1G01020.2
    Chr1	TAIR9	CDS	8236	8325	.	-	0	Parent=AT1G01020.2,AT1G01020.2-Protein;
    Chr1	TAIR9	exon	8236	8325	.	-	.	Parent=AT1G01020.2
    Chr1	TAIR9	CDS	7942	7987	.	-	0	Parent=AT1G01020.2,AT1G01020.2-Protein;
    Chr1	TAIR9	exon	7942	7987	.	-	.	Parent=AT1G01020.2
    Chr1	TAIR9	CDS	7762	7835	.	-	2	Parent=AT1G01020.2,AT1G01020.2-Protein;
    Chr1	TAIR9	exon	7762	7835	.	-	.	Parent=AT1G01020.2
    Chr1	TAIR9	CDS	7564	7649	.	-	0	Parent=AT1G01020.2,AT1G01020.2-Protein;
    Chr1	TAIR9	exon	7564	7649	.	-	.	Parent=AT1G01020.2
    Chr1	TAIR9	CDS	7315	7450	.	-	1	Parent=AT1G01020.2,AT1G01020.2-Protein;
    Chr1	TAIR9	three_prime_UTR	7157	7314	.	-	.	Parent=AT1G01020.2
    Chr1	TAIR9	exon	7157	7450	.	-	.	Parent=AT1G01020.2
    Chr1	TAIR9	three_prime_UTR	6790	7069	.	-	.	Parent=AT1G01020.2
    Chr1	TAIR9	exon	6790	7069	.	-	.	Parent=AT1G01020.2
    Same information in GTF:

    Code:
    Chr1	TAIR9	CDS	8571	8666	.	-	0	gene_id "AT1G01020"; transcript_id "AT1G01020.1";
    Chr1	TAIR9	EXON	8571	8737	.	-	.	gene_id "AT1G01020"; transcript_id "AT1G01020.1";
    Chr1	TAIR9	CDS	8417	8464	.	-	0	gene_id "AT1G01020"; transcript_id "AT1G01020.1";
    Chr1	TAIR9	EXON	8417	8464	.	-	.	gene_id "AT1G01020"; transcript_id "AT1G01020.1";
    Chr1	TAIR9	CDS	8236	8325	.	-	0	gene_id "AT1G01020"; transcript_id "AT1G01020.1";
    Chr1	TAIR9	EXON	8236	8325	.	-	.	gene_id "AT1G01020"; transcript_id "AT1G01020.1";
    Chr1	TAIR9	CDS	7942	7987	.	-	0	gene_id "AT1G01020"; transcript_id "AT1G01020.1";
    Chr1	TAIR9	EXON	7942	7987	.	-	.	gene_id "AT1G01020"; transcript_id "AT1G01020.1";
    Chr1	TAIR9	CDS	7762	7835	.	-	2	gene_id "AT1G01020"; transcript_id "AT1G01020.1";
    Chr1	TAIR9	EXON	7762	7835	.	-	.	gene_id "AT1G01020"; transcript_id "AT1G01020.1";
    Chr1	TAIR9	CDS	7564	7649	.	-	0	gene_id "AT1G01020"; transcript_id "AT1G01020.1";
    Chr1	TAIR9	EXON	7564	7649	.	-	.	gene_id "AT1G01020"; transcript_id "AT1G01020.1";
    Chr1	TAIR9	CDS	7384	7450	.	-	1	gene_id "AT1G01020"; transcript_id "AT1G01020.1";
    Chr1	TAIR9	EXON	7384	7450	.	-	.	gene_id "AT1G01020"; transcript_id "AT1G01020.1";
    Chr1	TAIR9	CDS	7157	7232	.	-	0	gene_id "AT1G01020"; transcript_id "AT1G01020.1";
    Chr1	TAIR9	EXON	7157	7232	.	-	.	gene_id "AT1G01020"; transcript_id "AT1G01020.1";
    Chr1	TAIR9	CDS	6915	7069	.	-	2	gene_id "AT1G01020"; transcript_id "AT1G01020.1";
    Chr1	TAIR9	EXON	6437	7069	.	-	.	gene_id "AT1G01020"; transcript_id "AT1G01020.1";
    Chr1	TAIR9	EXON	5928	6263	.	-	.	gene_id "AT1G01020"; transcript_id "AT1G01020.1";
    Chr1	TAIR9	CDS	8571	8666	.	-	0	gene_id "AT1G01020"; transcript_id "AT1G01020.2";
    Chr1	TAIR9	EXON	8571	8737	.	-	.	gene_id "AT1G01020"; transcript_id "AT1G01020.2";
    Chr1	TAIR9	CDS	8417	8464	.	-	0	gene_id "AT1G01020"; transcript_id "AT1G01020.2";
    Chr1	TAIR9	EXON	8417	8464	.	-	.	gene_id "AT1G01020"; transcript_id "AT1G01020.2";
    Chr1	TAIR9	CDS	8236	8325	.	-	0	gene_id "AT1G01020"; transcript_id "AT1G01020.2";
    Chr1	TAIR9	EXON	8236	8325	.	-	.	gene_id "AT1G01020"; transcript_id "AT1G01020.2";
    Chr1	TAIR9	CDS	7942	7987	.	-	0	gene_id "AT1G01020"; transcript_id "AT1G01020.2";
    Chr1	TAIR9	EXON	7942	7987	.	-	.	gene_id "AT1G01020"; transcript_id "AT1G01020.2";
    Chr1	TAIR9	CDS	7762	7835	.	-	2	gene_id "AT1G01020"; transcript_id "AT1G01020.2";
    Chr1	TAIR9	EXON	7762	7835	.	-	.	gene_id "AT1G01020"; transcript_id "AT1G01020.2";
    Chr1	TAIR9	CDS	7564	7649	.	-	0	gene_id "AT1G01020"; transcript_id "AT1G01020.2";
    Chr1	TAIR9	EXON	7564	7649	.	-	.	gene_id "AT1G01020"; transcript_id "AT1G01020.2";
    Chr1	TAIR9	CDS	7315	7450	.	-	1	gene_id "AT1G01020"; transcript_id "AT1G01020.2";
    Chr1	TAIR9	EXON	7157	7450	.	-	.	gene_id "AT1G01020"; transcript_id "AT1G01020.2";
    Chr1	TAIR9	EXON	6790	7069	.	-	.	gene_id "AT1G01020"; transcript_id "AT1G01020.2";

    Comment


    • #3
      Thank you for the reply that clears some things up for me.

      I do have a few questions though:

      1.) How were able to convert the TAIR9 GFF3 files into GTF format?

      2.) We are mostly interested in investigating small RNA such as miRNA, siRNA, and other non-coding RNA. We have files for them in GFF. The siRNA data started out as just sequences in supplementary data. From those I aligned them to the genome and created a GFF from that data. How could I supply files such as those to Cufflinks?

      Example:
      Code:
      Chr1	TAIR9	    Jacobsen_siRNA	10002796	10002812	.	.	.	.
      Chr1	TAIR9       Jacobsen_siRNA	10004771	10004794	.	.	.	.
      Chr1	TAIR9       Jacobsen_siRNA	10004925	10004941	.	.	.	.
      Chr1	TAIR9	    Jacobsen_siRNA	10007606	10007626	.	.	.	.

      Comment


      • #4
        Hi, I'm encountering a similar issue with cuffcompare. While trying to run it with the transcripts.gtf generated from cufflinks, it gave me the following error:

        GList error (GList.hh:592):Invalid list index: 0

        This is very strange because the file was generated from cufflinks, it's supposed to work with cuffcompare. Could someone please help?

        Thanks!

        -EDIT-
        I found out that it could be because of the missing strand information. Sorry about that.
        Last edited by Haneko; 04-07-2010, 07:25 PM. Reason: Problem may be solved

        Comment


        • #5
          GList.hh:592 error

          Same situation for me. I cannot run cuffcompare because of duplicate errors. What I did was to delete all duplicated exon lines (exon numbers vary though) but keep transcript lines with a perl script. Compared to original gtf file generated by cufflinks, this new "transcript only" gtf file sounds have all information including strand.

          however, I still got error "GList error (GList.hh:592):Invalid list index: 0".

          Henko, can you share your idea what is going on?

          cheers

          Comment


          • #6
            Originally posted by DrD2009 View Post
            Thank you for the reply that clears some things up for me.

            I do have a few questions though:

            1.) How were able to convert the TAIR9 GFF3 files into GTF format?

            2.) We are mostly interested in investigating small RNA such as miRNA, siRNA, and other non-coding RNA. We have files for them in GFF. The siRNA data started out as just sequences in supplementary data. From those I aligned them to the genome and created a GFF from that data. How could I supply files such as those to Cufflinks?

            Example:
            Code:
            Chr1	TAIR9	    Jacobsen_siRNA	10002796	10002812	.	.	.	.
            Chr1	TAIR9       Jacobsen_siRNA	10004771	10004794	.	.	.	.
            Chr1	TAIR9       Jacobsen_siRNA	10004925	10004941	.	.	.	.
            Chr1	TAIR9	    Jacobsen_siRNA	10007606	10007626	.	.	.	.
            I converted the TAIR9 GFF3 file using the attached perl script. This script uses Bioperl, specifically Bio::FeatureIO. However there appears to be a bug in Bio::FeatureIO::gff related to the phase/frame value. To get this script to work properly I actually had to hack up Bio/FeatureIO/gff.pm a little. I am properly ashamed for having done this . Now since frame/phase is irrelevant to your siRNA annotations you would not have to worry about this issue. You would need to install BioPerl to run the script though.

            Note: I was going to post the entire TAIR9 GTF but the gzipped file is too large to attach and I don't have an accessible server. If you desperately need it send me a PM an I could e-mail it to you.
            Attached Files

            Comment


            • #7
              Hi kmcarr,

              Would it be possible for you to email me the TAIR9 gtf file?

              thanks

              Comment


              • #8
                Hi kmcarr,

                I am also interested in your TAIR9 gtf file. Would it be possible to email me this file ([email protected]) ?
                Thanks !

                Comment


                • #9
                  gff.pm

                  Hi kmcarr,
                  could you post your gff.pm hack? I need to do this conversion and need to worry about frame.
                  Thanks,
                  Bob

                  Comment


                  • #10
                    It seems that the GTF file is provided by TAIR now, has anyone tried it?

                    ftp://ftp.arabidopsis.org/home/tair/...enes_exons.gtf

                    thanks,

                    Originally posted by kpatel View Post
                    Hi kmcarr,

                    Would it be possible for you to email me the TAIR9 gtf file?

                    thanks

                    Comment


                    • #11
                      Hello All,

                      I was having this issue, while I was running "cuffmerge" on the assemblies built using cufflinks 2.1.1.

                      It turned out, that the problem with duplicated entries was not with the gencode gtf file which I was using for reference, but the "transcripts.gtf" file created during cufflinks step.

                      After, updating cufflinks to a newer version 2.2.1 and re-running cufflinks step has resolved this issue.

                      Hope that helps.
                      Good luck

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Non-Coding RNA Research and Technologies
                        by seqadmin




                        Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

                        Nobel Prize for MicroRNA Discovery
                        This week,...
                        10-07-2024, 08:07 AM
                      • seqadmin
                        Recent Developments in Metagenomics
                        by seqadmin





                        Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
                        09-23-2024, 06:35 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 10-11-2024, 06:55 AM
                      0 responses
                      12 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 10-02-2024, 04:51 AM
                      0 responses
                      110 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 10-01-2024, 07:10 AM
                      0 responses
                      114 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 09-30-2024, 08:33 AM
                      1 response
                      121 views
                      0 likes
                      Last Post EmiTom
                      by EmiTom
                       
                      Working...
                      X