Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • saskak
    Member
    • Mar 2011
    • 10

    Something wrong in FlyBase's gtf (gff to gtf conversion)

    Hi All,

    I wanted to re-create FlyBase's gtf (FB2015_04) from the gff file. It is a different matter why I want to do that.

    So I parsed the dmel-all-r6.07.gff to a gtf file using my own program. I found a few genes/transcripts that are not what I expected. Bear with me on that one. For the sake of simplicity I am giving an example for one gene, but there are 23 such cases.

    In the gff file for gene FBgn0031926 and transcript FBtr0335486 these are the lines, excluding a few not relevant ones.

    2L FlyBase CDS 7613405 7614199 . + 0 Parent=FBtr0079472,FBtr0335486
    2L FlyBase CDS 7614326 7614695 . + 0 Parent=FBtr0079472,FBtr0335486
    2L FlyBase CDS 7614843 7615444 . + 2 Parent=FBtr0335486
    2L FlyBase CDS 7615576 7615578 . + 0 Parent=FBtr0335486
    2L FlyBase three_prime_UTR 7615579 7615967 . + . Parent=FBtr0335486
    2L FlyBase three_prime_UTR 7616117 7616533 . + . Parent=FBtr0335486


    The start_codon is not a problem and the first 3 CDS. The problem comes when one tries to create a stop_codon. The last CDS (7615576-7615578) is basically the stop codon. So from that the stop_codon becomes:
    2L FlyBase CDS 7615576 7615578

    Then one has to delete the last CDS (7615576-7615578), as it is just the stop_codon. This is how I parse it:

    2L FlyBase start_codon 7613405 7613407 . + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
    2L FlyBase CDS 7613405 7614199 . + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
    2L FlyBase CDS 7614326 7614695 . + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
    2L FlyBase CDS 7614843 7615444 . + 2 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
    2L FlyBase stop_codon 7615576 7615578 . + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
    2L FlyBase 3UTR 7615579 7615967 . + . gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
    2L FlyBase 3UTR 7616117 7616533 . + . gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";


    Everything is as it should be. Nevertheless, the FlyBase gtf file for this transcript has the following:

    2L FlyBase start_codon 7613405 7613407 . + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
    2L FlyBase CDS 7613405 7614199 7 + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
    2L FlyBase CDS 7614326 7614695 7 + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
    2L FlyBase CDS 7614843 7615444 7 + 2 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
    2L FlyBase CDS 7615575 7615575 7 + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
    2L FlyBase stop_codon 7615576 7615578 . + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
    2L FlyBase 3UTR 7615579 7615967 7 + . gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
    2L FlyBase 3UTR 7616117 7616533 7 + . gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";


    Look into the last CDS (7615575-7615575), it includes a single base from the intronic region. Either I am wrongly reading the specifications for the GTF files (http://mblab.wustl.edu/GTF22.html) or FlyBase somewhat makes it differently than how it should be.

    I also looked at Ensembl's GTF file and there they completely remove the stop_codon and the 3UTR starts from where the stop_codon should start. They have also removed the last CDS. Ensembl's gtf is also a bit suspicious, as there is no stop_codon for that particular gene and the other 22 cases.

    I also looked at UCSC's (dm3), downloaded from tophat, and there everything is as I calculate the stop_codon.

    My question is, is this an error by FlyBase/Ensembl and how should this be correctly done?

    Many thanks indeed for any insight into this one.
  • saskak
    Member
    • Mar 2011
    • 10

    #2
    Additional frame inconsistencies

    Unfortunately no one has suggested reasonable explanation for my previous problems.

    Additionally to that I also found a few frame inconsistencies, i.e. column 8 (count from 1).

    For the gene: FBgn0033313 and transcript: FBtr0305081 there is something not quite right with the frame of the start_codons, i.e. column 8.
    The gff for this gene and transcript reads for the first few CDS:

    2R FlyBase CDS 8616078 8616078 . + 0 Parent=FBtr0305081
    2R FlyBase CDS 8616327 8616516 . + 2 Parent=FBtr0310448,FBtr0310449,FBtr0305081
    2R FlyBase CDS 8616700 8618171 . + 1 Parent=FBtr0290112,FBtr0301363,FBtr0310448,FBtr0310449,FBtr0305080,FBtr0305081,FBtr0305082
    2R FlyBase CDS 8618234 8618461 . + 2 Parent=FBtr0290112,FBtr0301363,FBtr0310448,FBtr0310449,FBtr0305080,FBtr0305081,FBtr0305082


    I parsed to:

    2R FlyBase start_codon 8616078 8616078 . + 0 gene_id "FBgn0033313"; gene_symbol "Cirl"; transcript_id "FBtr0305081"; transcript_symbol "Cirl-RG";
    2R FlyBase start_codon 8616327 8616328 . + 2 gene_id "FBgn0033313"; gene_symbol "Cirl"; transcript_id "FBtr0305081"; transcript_symbol "Cirl-RG";
    2R FlyBase CDS 8616078 8616078 . + 0 gene_id "FBgn0033313"; gene_symbol "Cirl"; transcript_id "FBtr0305081"; transcript_symbol "Cirl-RG";
    2R FlyBase CDS 8616327 8616516 . + 2 gene_id "FBgn0033313"; gene_symbol "Cirl"; transcript_id "FBtr0305081"; transcript_symbol "Cirl-RG";


    Nevertheless, in FlyBase's gtf the frame of the second start_codon is:

    2R FlyBase start_codon 8616078 8616078 . + 0 gene_id "FBgn0033313"; gene_symbol "Cirl"; transcript_id "FBtr0305081"; transcript_symbol "Cirl-RG";
    2R FlyBase start_codon 8616327 8616328 . + 1 gene_id "FBgn0033313"; gene_symbol "Cirl"; transcript_id "FBtr0305081"; transcript_symbol "Cirl-RG";
    2R FlyBase CDS 8616078 8616078 15 + 0 gene_id "FBgn0033313"; gene_symbol "Cirl"; transcript_id "FBtr0305081"; transcript_symbol "Cirl-RG";
    2R FlyBase CDS 8616327 8616516 15 + 2 gene_id "FBgn0033313"; gene_symbol "Cirl"; transcript_id "FBtr0305081"; transcript_symbol "Cirl-RG";


    Note the frame is 1 in start_codon 8616327 8616328. As this start_codon has two bases, then according to the gtf2.2 guidelines, the frame should be 2, i.e. the third base in the feature is the start of a codon. This is not the only case of such mis-framing around, I count quite a few.

    I checked this in Ensembl's gtf and this appears to be 2 as I parsed it. Do you think I should I contact FlyBase to inquire about these.

    Many thanks indeed for any help.

    Comment

    • westerman
      Rick Westerman
      • Jun 2008
      • 1104

      #3
      Originally posted by saskak View Post
      I checked this in Ensembl's gtf and this appears to be 2 as I parsed it. Do you think I should I contact FlyBase to inquire about these.
      Yes. They will know their dataset better than most of us on SeqAnswers. If there is a problem then they will appreciate knowing about it.

      Comment

      • saskak
        Member
        • Mar 2011
        • 10

        #4
        Solved

        Contacted FlyBase and it turned out they had a bug/s in their annotation pipeline. Should be fixed in the 6.08 gtf file.

        Comment

        • dpryan
          Devon Ryan
          • Jul 2011
          • 3478

          #5
          Thanks for the follow up and getting this corrected!

          Comment

          Latest Articles

          Collapse

          • SEQadmin2
            From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
            by SEQadmin2


            Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


            The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
            ...
            06-02-2026, 10:05 AM
          • SEQadmin2
            Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
            by SEQadmin2


            With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


            Introduction

            Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
            05-22-2026, 06:42 AM
          • SEQadmin2
            Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
            by SEQadmin2

            Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


            Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
            05-06-2026, 09:04 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by SEQadmin2, Today, 08:59 AM
          0 responses
          10 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-02-2026, 12:03 PM
          0 responses
          21 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-02-2026, 11:40 AM
          0 responses
          17 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 05-28-2026, 11:40 AM
          0 responses
          31 views
          0 reactions
          Last Post SEQadmin2  
          Working...