Seqanswers Leaderboard Ad

**mgogol** · 03-29-2010, 10:14 AM

How about this?

http://www.sequenceontology.org/cgi-bin/converter.cgi

keywords, go, here, seperated, by, commas

Oh, sorry. I read it wrong. You want to go in the other direction.

**James** · 07-28-2010, 07:04 AM

Hi Brandon,

Did you find a simple way to convert GFF to GTF. I want to do exactly the same thing, I am also not a programmer.

Thanks, J

**mgogol** · 07-28-2010, 07:26 AM

You can try my perl script. I used this with a flybase gff file. Note that if you want to represent ncRNAs, tRNAs, rRNAs, snRNAs, miRNAs, you'll have to manually change them to "mRNA" in the gff file or modify the script.

This also expects all the mRNA entries to come before the exons. If your gff file isn't ordered like that, you can grep the mRNAs out and then the exons and cat them together.

Code:

#!/usr/bin/env perl
###############################
# gff2gtf.pl
#
# parses mRNA, exon lines from a gff file and prints gtf lines (for cufflinks) 
# 5/2010
# #############################

use Bio::Tools::GFF;
  
my $parser = new Bio::Tools::GFF->new(-file=> $ARGV[0], -gff_version => 3);

my %hash;
while( my $result = $parser->next_feature ) 
{
	($id,@junk)= $result->get_tag_values("ID");
	$type = $result->primary_tag();

	if(!$result)
	{
		last;
	}

	$seq_id = $result->seq_id();
	$strand = $result->strand();
	$strand =~ s/-1/-/g;
	$strand =~ s/1/+/g;
	$start = $result->start();
	$end = $result->end();

	if($type eq "mRNA")
	{
		($parent,@junk)= $result->get_tag_values("Parent");
		$hash{$id} = $parent;
	}
	if($type eq "exon")
	{
		#find out transcript (parent) and gene for THIS exon
		($parent,@junk)= $result->get_tag_values("Parent");
		$transcript = $parent;
		$gene = $hash{$transcript};	
		print "$seq_id\tFlyBase\t$type\t$start\t$end\t.\t$strand\t.\tgene_id \"$gene\";transcript_id \"$transcript\";\n";
	}
}

**James** · 07-31-2010, 11:42 AM

Thanks for that. Sorry I am a newbie to this and perl. How would I go about changing the script to suit my data?

This is my data:

DDB0232428 Sequencing Center mRNA 1890 3287 . + . ID=DDB0216437;Parent=DDB_G0267178;Name=DDB0216437;description=JC1V2_0_00003: Obtained from the Dictyostelium Genome Consortium at The Wellcome Trust Sanger Institute;translation_start=1;Dbxref=Protein Accession Version:EAL73826.1,Inparanoid V. 5.1

DB0216437,UniProt:Q55H43,Genome V. 2.0 ID:JC1V2_0_00003,Protein Accession Number:EAL73826.1,Protein GI Number:60475899
DDB0232428 Sequencing Center mRNA 3848 4855 . + . ID=DDB0216438;Parent=DDB_G0267180;Name=DDB0216438;description=JC1V2_0_00004: Obtained from the Dictyostelium Genome Consortium at The Wellcome Trust Sanger Institute;translation_start=1;Dbxref=Protein Accession Version:EAL73827.1,Inparanoid V. 5.1

DB0216438,UniProt:Q55H42,Genome V. 2.0 ID:JC1V2_0_00004,Protein Accession Number:EAL73827.1,Protein GI Number:60475900
DDB0232428 Sequencing Center mRNA 5505 7769 . + . ID=DDB0216439;Parent=DDB_G0267182;Name=DDB0216439;description=JC1V2_0_00005: Obtained from the Dictyostelium Genome Consortium at The Wellcome Trust Sanger Institute;translation_start=1;Dbxref=Protein Accession Version:EAL73828.1,Inparanoid V. 5.1

DB0216439,UniProt:Q55H60,Genome V. 2.0 ID:JC1V2_0_00005,Protein Accession Number:EAL73828.1,Protein GI Number:60475901
DDB0232428 Sequencing Center mRNA 8308 9522 . - . ID=DDB0216440;Parent=DDB_G0267184;Name=DDB0216440;description=JC1V2_0_00006: Obtained from the Dictyostelium Genome Consortium at The Wellcome Trust Sanger Institute;translation_start=1;Dbxref=Protein Accession Version:EAL73829.1,Inparanoid V. 5.1

DB0216440,UniProt:Q55H61,Genome V. 2.0 ID:JC1V2_0_00006,Protein Accession Number:EAL73829.1,Protein GI Number:60475902
DDB0232428 Sequencing Center mRNA 9635 9889 . - . ID=DDB0216441;Parent=DDB_G0267186;Name=DDB0216441;description=JC1V2_0_00007: Obtained from the Dictyostelium Genome Consortium at The Wellcome Trust Sanger Institute;translation_start=1;Dbxref=Protein Accession Version:EAL73830.1,Inparanoid V. 5.1

DB0216441,UniProt:Q55H59,Genome V. 2.0 ID:JC1V2_0_00007,Protein Accession Number:EAL73830.1,Protein GI Number:60475903

followed by exons after the mRNAs

DDB0232428 Sequencing Center exon 1890 3287 . + . Parent=DDB0216437
DDB0232428 Sequencing Center exon 3848 4855 . + . Parent=DDB0216438
DDB0232428 Sequencing Center exon 5505 7769 . + . Parent=DDB0216439
DDB0232428 Sequencing Center exon 8308 9522 . - . Parent=DDB0216440
DDB0232428 Sequencing Center exon 9635 9889 . - . Parent=DDB0216441

I get this error:

James$ perl gff2gtf.pl chrm1_mRNA_exon.gff > chrm1.gtf

------------- EXCEPTION -------------
MSG: asking for tag value that does not exist ID
STACK Bio::SeqFeature::Generic::get_tag_values Bio/SeqFeature/Generic.pm:517
STACK toplevel gff2gtf.pl:16
-------------------------------------

Thanks alot, James

**James** · 07-31-2010, 11:48 AM

oh DDB0232428 is chrm1. I'll change that to chrm1 with sed.

**BrittLF** · 09-13-2010, 04:06 PM

Similar Problem

Hi!
I'm experiencing a similar problem. I have a .gff file for my organism (Anabaena sp. strain 7120) and would like to convert it to a .gtf to upload with the software cufflinks.

My current format looks like this:
##gff-version 3
#!gff-spec-version 1.14
#!source-version NCBI C++ formatter 0.2
##Type DNA BA000019.2
BA000019.2 DDBJ source 1 6413771 . + . organism=Nostoc sp. PCC 7120;mol_type=genomic DNA;strain=PCC 7120;db_xref=taxon:103690;note=synonym: Anabaena sp. PCC 7120
BA000019.2 DDBJ gene 1 918 . - . ID=BA000019.2:all0001
BA000019.2 DDBJ gene 6413460 6413771 . - . ID=BA000019.2:all0001
BA000019.2 DDBJ CDS 1 918 . - 0 note=all0001%3B ORF_ID:all0001%3B%0Aunknown protein;transl_table=11;protein_id=BAB77525.1;db_xref=GI:55420319;exon_number=1
BA000019.2 DDBJ CDS 6413463 6413771 . - 0 note=all0001%3B ORF_ID:all0001%3B%0Aunknown protein;transl_table=11;protein_id=BAB77525.1;db_xref=GI:55420319;exon_number=2
BA000019.2 DDBJ start_codon 916 918 . - 0 note=all0001%3B ORF_ID:all0001%3B%0Aunknown protein;transl_table=11;protein_id=BAB77525.1;db_xref=GI:55420319;exon_number=1

and I need this:
AB000381 Twinscan CDS 380 401 . + 0 gene_id "001"; transcript_id "001.1";
AB000381 Twinscan CDS 501 650 . + 2 gene_id "001"; transcript_id "001.1";
AB000381 Twinscan CDS 700 707 . + 2 gene_id "001"; transcript_id "001.1";
AB000381 Twinscan start_codon 380 382 . + 0 gene_id "001"; transcript_id "001.1";
AB000381 Twinscan stop_codon 708 710 . + 0 gene_id "001"; transcript_id "001.1";

I tried a couple gff to gtf perl converters like this one by the ninth column never comes out right. Any help would be great.
Thanks!
Britt

**BrittLF** · 09-14-2010, 09:03 AM

bump ?

any help would be great!

**nilshomer** · 09-14-2010, 10:00 AM

Originally posted by BrittLF View Post

any help would be great!

Please do not bump your threads. Give it some more time and some people may answer your questions. Otherwise, keep searching.

**jbittner** · 10-18-2010, 02:31 PM

Hi,
I am also trying to convert a gff file to gtf, and am using the gff2gtf.pl script. However, I'm getting an error about the length of each line in the file:

------------- EXCEPTION -------------
MSG: Each line of the fasta entry must be the same length except the last.
Line above #3 'LbrM01_V2_October Ge..' is 87 != 100 chars.
STACK Bio:: DB::Fasta::calculate_offsets /Users/jaimebittner/BioPerl-1.6.1//Bio/DB/Fasta.pm:770
STACK Bio:: DB::Fasta::index_file /Users/jaimebittner/BioPerl-1.6.1//Bio/DB/Fasta.pm:680
STACK Bio:: DB::Fasta::new /Users/jaimebittner/BioPerl-1.6.1//Bio/DB/Fasta.pm:491
STACK toplevel gff2gtf.pl:20
-------------------------------------

indexing was interrupted, so unlinking L_braziliensis.gff.index at /Users/jaimebittner/BioPerl-1.6.1//Bio/DB/Fasta.pm line 1053.

The attribute column (the last column) differs for each line:

LbrM01_V2_October GeneDB Contig 1 235333 . + . Sequence LbrM01_V2_October ; Alias LbrM01_V2_October
LbrM01_V2_October GeneDB source 1 235333 . + . source unknown_1 ; origid "Lbr.chr1" ;
LbrM01_V2_October GeneDB source 1 235333 . + . source unknown_2 ; origid "Lbr.chr1" ;
LbrM01_V2_October GeneDB CDS_parts 1272 4166 . - . mRNA LbrM01_V2.0010 ; temporary_systematic_id "LbrM01_V2.0010" ; colour "8" ; ortholog "GeneDB_Lmajor:LmjF01.0630 ||| GeneDB_Linfantum:LinJ01_V3.0650;predicted_by_orthomcl ||| GeneDB_Lmajor:LmjF01.0630;predicted_by_orthomcl" ; product "hypothetical protein, unknown function" ;
LbrM01_V2_October GeneDB CDS 1272 4166 . - . mRNA LbrM01_V2.0010 ; colour "8" ;

but I don't know how to fix this. Is there something I can use to cut down the length of the attributes to an equal number of characters?

thank you!

**mgogol** · 10-19-2010, 05:58 AM

Maybe you can get rid of some of the irrelevant lines? grep for mRNA and exon and make a new file only containing those lines? If you put your file up somewhere maybe I could take a look at it.

Same with other people having problems.

The errors are from Bioperl, so I'm having trouble figuring out what they mean, I'd have to do more testing with the script.

**jbittner** · 10-19-2010, 10:12 AM

Thank you for the idea, I am sort of new to this so any advice really helps.

I got the GFF file off of the Sanger FTP site, and it's for the parasite Leishmania braziliensis. It's too big to upload to the forum even when I compress it. Is there another way I can get it to you?

Here is the link for where I got it ftp://ftp.sanger.ac.uk/pub/pathogens/L_braziliensis/ (I connected as "guest", then found it through the folders Datasets/GFF)

**mgogol** · 10-19-2010, 10:24 AM

That GFF file doesn't have exon entries and the last column doesn't have an ID tag... Do you have a source for exon level information?

If you don't, you could try running without a gtf file, and just trying to let cufflinks define it's own transcripts.

**jbittner** · 10-19-2010, 10:47 AM

Unfortunately the only exon level information that we have found is in a .cds file and I haven't found any ways to convert this to GFF or GTF, I don't even know what that file extension means. (I found it in the same FTP site).

Also, we are ultimately trying to get a refflat file to use with DEGseq, and so converting our gff to gtf file was just an intermediate step in that process.

I really appreciate your help

**mgogol** · 10-19-2010, 11:44 AM

Um. I don't know either. The cds file doesn't seem to have exon information. I've got to get back to my own work now... Good luck.

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, 07-25-2024, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin 07-25-2024, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 26 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 160 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

Tab Delimited File Editors? (GFF to GTF)

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News