Hi All,
I wanted to re-create FlyBase's gtf (FB2015_04) from the gff file. It is a different matter why I want to do that.
So I parsed the dmel-all-r6.07.gff to a gtf file using my own program. I found a few genes/transcripts that are not what I expected. Bear with me on that one. For the sake of simplicity I am giving an example for one gene, but there are 23 such cases.
In the gff file for gene FBgn0031926 and transcript FBtr0335486 these are the lines, excluding a few not relevant ones.
2L FlyBase CDS 7613405 7614199 . + 0 Parent=FBtr0079472,FBtr0335486
2L FlyBase CDS 7614326 7614695 . + 0 Parent=FBtr0079472,FBtr0335486
2L FlyBase CDS 7614843 7615444 . + 2 Parent=FBtr0335486
2L FlyBase CDS 7615576 7615578 . + 0 Parent=FBtr0335486
2L FlyBase three_prime_UTR 7615579 7615967 . + . Parent=FBtr0335486
2L FlyBase three_prime_UTR 7616117 7616533 . + . Parent=FBtr0335486
The start_codon is not a problem and the first 3 CDS. The problem comes when one tries to create a stop_codon. The last CDS (7615576-7615578) is basically the stop codon. So from that the stop_codon becomes:
2L FlyBase CDS 7615576 7615578
Then one has to delete the last CDS (7615576-7615578), as it is just the stop_codon. This is how I parse it:
2L FlyBase start_codon 7613405 7613407 . + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
2L FlyBase CDS 7613405 7614199 . + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
2L FlyBase CDS 7614326 7614695 . + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
2L FlyBase CDS 7614843 7615444 . + 2 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
2L FlyBase stop_codon 7615576 7615578 . + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
2L FlyBase 3UTR 7615579 7615967 . + . gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
2L FlyBase 3UTR 7616117 7616533 . + . gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
Everything is as it should be. Nevertheless, the FlyBase gtf file for this transcript has the following:
2L FlyBase start_codon 7613405 7613407 . + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
2L FlyBase CDS 7613405 7614199 7 + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
2L FlyBase CDS 7614326 7614695 7 + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
2L FlyBase CDS 7614843 7615444 7 + 2 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
2L FlyBase CDS 7615575 7615575 7 + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
2L FlyBase stop_codon 7615576 7615578 . + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
2L FlyBase 3UTR 7615579 7615967 7 + . gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
2L FlyBase 3UTR 7616117 7616533 7 + . gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
Look into the last CDS (7615575-7615575), it includes a single base from the intronic region. Either I am wrongly reading the specifications for the GTF files (http://mblab.wustl.edu/GTF22.html) or FlyBase somewhat makes it differently than how it should be.
I also looked at Ensembl's GTF file and there they completely remove the stop_codon and the 3UTR starts from where the stop_codon should start. They have also removed the last CDS. Ensembl's gtf is also a bit suspicious, as there is no stop_codon for that particular gene and the other 22 cases.
I also looked at UCSC's (dm3), downloaded from tophat, and there everything is as I calculate the stop_codon.
My question is, is this an error by FlyBase/Ensembl and how should this be correctly done?
Many thanks indeed for any insight into this one.
I wanted to re-create FlyBase's gtf (FB2015_04) from the gff file. It is a different matter why I want to do that.
So I parsed the dmel-all-r6.07.gff to a gtf file using my own program. I found a few genes/transcripts that are not what I expected. Bear with me on that one. For the sake of simplicity I am giving an example for one gene, but there are 23 such cases.
In the gff file for gene FBgn0031926 and transcript FBtr0335486 these are the lines, excluding a few not relevant ones.
2L FlyBase CDS 7613405 7614199 . + 0 Parent=FBtr0079472,FBtr0335486
2L FlyBase CDS 7614326 7614695 . + 0 Parent=FBtr0079472,FBtr0335486
2L FlyBase CDS 7614843 7615444 . + 2 Parent=FBtr0335486
2L FlyBase CDS 7615576 7615578 . + 0 Parent=FBtr0335486
2L FlyBase three_prime_UTR 7615579 7615967 . + . Parent=FBtr0335486
2L FlyBase three_prime_UTR 7616117 7616533 . + . Parent=FBtr0335486
The start_codon is not a problem and the first 3 CDS. The problem comes when one tries to create a stop_codon. The last CDS (7615576-7615578) is basically the stop codon. So from that the stop_codon becomes:
2L FlyBase CDS 7615576 7615578
Then one has to delete the last CDS (7615576-7615578), as it is just the stop_codon. This is how I parse it:
2L FlyBase start_codon 7613405 7613407 . + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
2L FlyBase CDS 7613405 7614199 . + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
2L FlyBase CDS 7614326 7614695 . + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
2L FlyBase CDS 7614843 7615444 . + 2 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
2L FlyBase stop_codon 7615576 7615578 . + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
2L FlyBase 3UTR 7615579 7615967 . + . gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
2L FlyBase 3UTR 7616117 7616533 . + . gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
Everything is as it should be. Nevertheless, the FlyBase gtf file for this transcript has the following:
2L FlyBase start_codon 7613405 7613407 . + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
2L FlyBase CDS 7613405 7614199 7 + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
2L FlyBase CDS 7614326 7614695 7 + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
2L FlyBase CDS 7614843 7615444 7 + 2 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
2L FlyBase CDS 7615575 7615575 7 + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
2L FlyBase stop_codon 7615576 7615578 . + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
2L FlyBase 3UTR 7615579 7615967 7 + . gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
2L FlyBase 3UTR 7616117 7616533 7 + . gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
Look into the last CDS (7615575-7615575), it includes a single base from the intronic region. Either I am wrongly reading the specifications for the GTF files (http://mblab.wustl.edu/GTF22.html) or FlyBase somewhat makes it differently than how it should be.
I also looked at Ensembl's GTF file and there they completely remove the stop_codon and the 3UTR starts from where the stop_codon should start. They have also removed the last CDS. Ensembl's gtf is also a bit suspicious, as there is no stop_codon for that particular gene and the other 22 cases.
I also looked at UCSC's (dm3), downloaded from tophat, and there everything is as I calculate the stop_codon.
My question is, is this an error by FlyBase/Ensembl and how should this be correctly done?
Many thanks indeed for any insight into this one.
Comment