Seqanswers Leaderboard Ad



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Tab Delimited File Editors? (GFF to GTF)

    Hello everyone,

    I have a bunch of GFFs that I would like to convert into GTF format in order to provide annotation for use in Cufflinks.

    Can anyone recommend a tab delimited file editor I could use to do this? I'm not a programmer so if there is coding necessary it would have to be very basic. I've tried using Galaxy, but it changes the data I enter (mainly: "" ).


  • #2
    How about this?

    Oh, sorry. I read it wrong. You want to go in the other direction.
    Last edited by mgogol; 03-29-2010, 12:34 PM.


    • #3
      Hi Brandon,

      Did you find a simple way to convert GFF to GTF. I want to do exactly the same thing, I am also not a programmer.

      Thanks, J


      • #4
        You can try my perl script. I used this with a flybase gff file. Note that if you want to represent ncRNAs, tRNAs, rRNAs, snRNAs, miRNAs, you'll have to manually change them to "mRNA" in the gff file or modify the script.

        This also expects all the mRNA entries to come before the exons. If your gff file isn't ordered like that, you can grep the mRNAs out and then the exons and cat them together.

        #!/usr/bin/env perl
        # parses mRNA, exon lines from a gff file and prints gtf lines (for cufflinks) 
        # 5/2010
        # #############################
        use Bio::Tools::GFF;
        my $parser = new Bio::Tools::GFF->new(-file=> $ARGV[0], -gff_version => 3);
        my %hash;
        while( my $result = $parser->next_feature ) 
        	($id,@junk)= $result->get_tag_values("ID");
        	$type = $result->primary_tag();
        	$seq_id = $result->seq_id();
        	$strand = $result->strand();
        	$strand =~ s/-1/-/g;
        	$strand =~ s/1/+/g;
        	$start = $result->start();
        	$end = $result->end();
        	if($type eq "mRNA")
        		($parent,@junk)= $result->get_tag_values("Parent");
        		$hash{$id} = $parent;
        	if($type eq "exon")
        		#find out transcript (parent) and gene for THIS exon
        		($parent,@junk)= $result->get_tag_values("Parent");
        		$transcript = $parent;
        		$gene = $hash{$transcript};	
        		print "$seq_id\tFlyBase\t$type\t$start\t$end\t.\t$strand\t.\tgene_id \"$gene\";transcript_id \"$transcript\";\n";


        • #5
          Thanks for that. Sorry I am a newbie to this and perl. How would I go about changing the script to suit my data?

          This is my data:

          DDB0232428 Sequencing Center mRNA 1890 3287 . + . ID=DDB0216437;Parent=DDB_G0267178;Name=DDB0216437;description=JC1V2_0_00003: Obtained from the Dictyostelium Genome Consortium at The Wellcome Trust Sanger Institute;translation_start=1;Dbxref=Protein Accession Version:EAL73826.1,Inparanoid V. 5.1DB0216437,UniProt:Q55H43,Genome V. 2.0 ID:JC1V2_0_00003,Protein Accession Number:EAL73826.1,Protein GI Number:60475899
          DDB0232428 Sequencing Center mRNA 3848 4855 . + . ID=DDB0216438;Parent=DDB_G0267180;Name=DDB0216438;description=JC1V2_0_00004: Obtained from the Dictyostelium Genome Consortium at The Wellcome Trust Sanger Institute;translation_start=1;Dbxref=Protein Accession Version:EAL73827.1,Inparanoid V. 5.1DB0216438,UniProt:Q55H42,Genome V. 2.0 ID:JC1V2_0_00004,Protein Accession Number:EAL73827.1,Protein GI Number:60475900
          DDB0232428 Sequencing Center mRNA 5505 7769 . + . ID=DDB0216439;Parent=DDB_G0267182;Name=DDB0216439;description=JC1V2_0_00005: Obtained from the Dictyostelium Genome Consortium at The Wellcome Trust Sanger Institute;translation_start=1;Dbxref=Protein Accession Version:EAL73828.1,Inparanoid V. 5.1DB0216439,UniProt:Q55H60,Genome V. 2.0 ID:JC1V2_0_00005,Protein Accession Number:EAL73828.1,Protein GI Number:60475901
          DDB0232428 Sequencing Center mRNA 8308 9522 . - . ID=DDB0216440;Parent=DDB_G0267184;Name=DDB0216440;description=JC1V2_0_00006: Obtained from the Dictyostelium Genome Consortium at The Wellcome Trust Sanger Institute;translation_start=1;Dbxref=Protein Accession Version:EAL73829.1,Inparanoid V. 5.1DB0216440,UniProt:Q55H61,Genome V. 2.0 ID:JC1V2_0_00006,Protein Accession Number:EAL73829.1,Protein GI Number:60475902
          DDB0232428 Sequencing Center mRNA 9635 9889 . - . ID=DDB0216441;Parent=DDB_G0267186;Name=DDB0216441;description=JC1V2_0_00007: Obtained from the Dictyostelium Genome Consortium at The Wellcome Trust Sanger Institute;translation_start=1;Dbxref=Protein Accession Version:EAL73830.1,Inparanoid V. 5.1DB0216441,UniProt:Q55H59,Genome V. 2.0 ID:JC1V2_0_00007,Protein Accession Number:EAL73830.1,Protein GI Number:60475903
          followed by exons after the mRNAs

          DDB0232428 Sequencing Center exon 1890 3287 . + . Parent=DDB0216437
          DDB0232428 Sequencing Center exon 3848 4855 . + . Parent=DDB0216438
          DDB0232428 Sequencing Center exon 5505 7769 . + . Parent=DDB0216439
          DDB0232428 Sequencing Center exon 8308 9522 . - . Parent=DDB0216440
          DDB0232428 Sequencing Center exon 9635 9889 . - . Parent=DDB0216441
          I get this error:

          James$ perl chrm1_mRNA_exon.gff > chrm1.gtf

          ------------- EXCEPTION -------------
          MSG: asking for tag value that does not exist ID
          STACK Bio::SeqFeature::Generic::get_tag_values Bio/SeqFeature/
          STACK toplevel

          Thanks alot, James
          Last edited by James; 07-31-2010, 11:43 AM. Reason: edit details


          • #6
            oh DDB0232428 is chrm1. I'll change that to chrm1 with sed.


            • #7
              Similar Problem

              I'm experiencing a similar problem. I have a .gff file for my organism (Anabaena sp. strain 7120) and would like to convert it to a .gtf to upload with the software cufflinks.

              My current format looks like this:
              ##gff-version 3
              #!gff-spec-version 1.14
              #!source-version NCBI C++ formatter 0.2
              ##Type DNA BA000019.2
              BA000019.2 DDBJ source 1 6413771 . + . organism=Nostoc sp. PCC 7120;mol_type=genomic DNA;strain=PCC 7120;db_xref=taxon:103690;note=synonym: Anabaena sp. PCC 7120
              BA000019.2 DDBJ gene 1 918 . - . ID=BA000019.2:all0001
              BA000019.2 DDBJ gene 6413460 6413771 . - . ID=BA000019.2:all0001
              BA000019.2 DDBJ CDS 1 918 . - 0 note=all0001%3B ORF_ID:all0001%3B%0Aunknown protein;transl_table=11;protein_id=BAB77525.1;db_xref=GI:55420319;exon_number=1
              BA000019.2 DDBJ CDS 6413463 6413771 . - 0 note=all0001%3B ORF_ID:all0001%3B%0Aunknown protein;transl_table=11;protein_id=BAB77525.1;db_xref=GI:55420319;exon_number=2
              BA000019.2 DDBJ start_codon 916 918 . - 0 note=all0001%3B ORF_ID:all0001%3B%0Aunknown protein;transl_table=11;protein_id=BAB77525.1;db_xref=GI:55420319;exon_number=1

              and I need this:
              AB000381 Twinscan CDS 380 401 . + 0 gene_id "001"; transcript_id "001.1";
              AB000381 Twinscan CDS 501 650 . + 2 gene_id "001"; transcript_id "001.1";
              AB000381 Twinscan CDS 700 707 . + 2 gene_id "001"; transcript_id "001.1";
              AB000381 Twinscan start_codon 380 382 . + 0 gene_id "001"; transcript_id "001.1";
              AB000381 Twinscan stop_codon 708 710 . + 0 gene_id "001"; transcript_id "001.1";

              I tried a couple gff to gtf perl converters like this one by the ninth column never comes out right. Any help would be great.


              • #8
                bump ?

                any help would be great!


                • #9
                  Originally posted by BrittLF View Post
                  any help would be great!
                  Please do not bump your threads. Give it some more time and some people may answer your questions. Otherwise, keep searching.


                  • #10
                    I am also trying to convert a gff file to gtf, and am using the script. However, I'm getting an error about the length of each line in the file:

                    ------------- EXCEPTION -------------
                    MSG: Each line of the fasta entry must be the same length except the last.
                    Line above #3 'LbrM01_V2_October Ge..' is 87 != 100 chars.
                    STACK Bio:: DB::Fasta::calculate_offsets /Users/jaimebittner/BioPerl-1.6.1//Bio/DB/
                    STACK Bio:: DB::Fasta::index_file /Users/jaimebittner/BioPerl-1.6.1//Bio/DB/
                    STACK Bio:: DB::Fasta::new /Users/jaimebittner/BioPerl-1.6.1//Bio/DB/
                    STACK toplevel

                    indexing was interrupted, so unlinking L_braziliensis.gff.index at /Users/jaimebittner/BioPerl-1.6.1//Bio/DB/ line 1053.
                    The attribute column (the last column) differs for each line:
                    LbrM01_V2_October GeneDB Contig 1 235333 . + . Sequence LbrM01_V2_October ; Alias LbrM01_V2_October
                    LbrM01_V2_October GeneDB source 1 235333 . + . source unknown_1 ; origid "Lbr.chr1" ;
                    LbrM01_V2_October GeneDB source 1 235333 . + . source unknown_2 ; origid "Lbr.chr1" ;
                    LbrM01_V2_October GeneDB CDS_parts 1272 4166 . - . mRNA LbrM01_V2.0010 ; temporary_systematic_id "LbrM01_V2.0010" ; colour "8" ; ortholog "GeneDB_Lmajor:LmjF01.0630 ||| GeneDB_Linfantum:LinJ01_V3.0650;predicted_by_orthomcl ||| GeneDB_Lmajor:LmjF01.0630;predicted_by_orthomcl" ; product "hypothetical protein, unknown function" ;
                    LbrM01_V2_October GeneDB CDS 1272 4166 . - . mRNA LbrM01_V2.0010 ; colour "8" ;
                    but I don't know how to fix this. Is there something I can use to cut down the length of the attributes to an equal number of characters?

                    thank you!
                    Last edited by jbittner; 10-18-2010, 02:32 PM. Reason: :D made a smiley face when posted


                    • #11
                      Maybe you can get rid of some of the irrelevant lines? grep for mRNA and exon and make a new file only containing those lines? If you put your file up somewhere maybe I could take a look at it.

                      Same with other people having problems.

                      The errors are from Bioperl, so I'm having trouble figuring out what they mean, I'd have to do more testing with the script.


                      • #12
                        Thank you for the idea, I am sort of new to this so any advice really helps.

                        I got the GFF file off of the Sanger FTP site, and it's for the parasite Leishmania braziliensis. It's too big to upload to the forum even when I compress it. Is there another way I can get it to you?

                        Here is the link for where I got it (I connected as "guest", then found it through the folders Datasets/GFF)


                        • #13
                          That GFF file doesn't have exon entries and the last column doesn't have an ID tag... Do you have a source for exon level information?

                          If you don't, you could try running without a gtf file, and just trying to let cufflinks define it's own transcripts.


                          • #14
                            Unfortunately the only exon level information that we have found is in a .cds file and I haven't found any ways to convert this to GFF or GTF, I don't even know what that file extension means. (I found it in the same FTP site).

                            Also, we are ultimately trying to get a refflat file to use with DEGseq, and so converting our gff to gtf file was just an intermediate step in that process.

                            I really appreciate your help


                            • #15
                              Um. I don't know either. The cds file doesn't seem to have exon information. I've got to get back to my own work now... Good luck.


                              Latest Articles


                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin

                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin

                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM





                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              Last Post seqadmin