Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Something wrong in FlyBase's gtf (gff to gtf conversion)

    Hi All,

    I wanted to re-create FlyBase's gtf (FB2015_04) from the gff file. It is a different matter why I want to do that.

    So I parsed the dmel-all-r6.07.gff to a gtf file using my own program. I found a few genes/transcripts that are not what I expected. Bear with me on that one. For the sake of simplicity I am giving an example for one gene, but there are 23 such cases.

    In the gff file for gene FBgn0031926 and transcript FBtr0335486 these are the lines, excluding a few not relevant ones.

    2L FlyBase CDS 7613405 7614199 . + 0 Parent=FBtr0079472,FBtr0335486
    2L FlyBase CDS 7614326 7614695 . + 0 Parent=FBtr0079472,FBtr0335486
    2L FlyBase CDS 7614843 7615444 . + 2 Parent=FBtr0335486
    2L FlyBase CDS 7615576 7615578 . + 0 Parent=FBtr0335486
    2L FlyBase three_prime_UTR 7615579 7615967 . + . Parent=FBtr0335486
    2L FlyBase three_prime_UTR 7616117 7616533 . + . Parent=FBtr0335486


    The start_codon is not a problem and the first 3 CDS. The problem comes when one tries to create a stop_codon. The last CDS (7615576-7615578) is basically the stop codon. So from that the stop_codon becomes:
    2L FlyBase CDS 7615576 7615578

    Then one has to delete the last CDS (7615576-7615578), as it is just the stop_codon. This is how I parse it:

    2L FlyBase start_codon 7613405 7613407 . + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
    2L FlyBase CDS 7613405 7614199 . + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
    2L FlyBase CDS 7614326 7614695 . + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
    2L FlyBase CDS 7614843 7615444 . + 2 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
    2L FlyBase stop_codon 7615576 7615578 . + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
    2L FlyBase 3UTR 7615579 7615967 . + . gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
    2L FlyBase 3UTR 7616117 7616533 . + . gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";


    Everything is as it should be. Nevertheless, the FlyBase gtf file for this transcript has the following:

    2L FlyBase start_codon 7613405 7613407 . + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
    2L FlyBase CDS 7613405 7614199 7 + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
    2L FlyBase CDS 7614326 7614695 7 + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
    2L FlyBase CDS 7614843 7615444 7 + 2 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
    2L FlyBase CDS 7615575 7615575 7 + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
    2L FlyBase stop_codon 7615576 7615578 . + 0 gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
    2L FlyBase 3UTR 7615579 7615967 7 + . gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";
    2L FlyBase 3UTR 7616117 7616533 7 + . gene_id "FBgn0031926"; gene_symbol "CG6739"; transcript_id "FBtr0335486"; transcript_symbol "CG6739-RB";


    Look into the last CDS (7615575-7615575), it includes a single base from the intronic region. Either I am wrongly reading the specifications for the GTF files (http://mblab.wustl.edu/GTF22.html) or FlyBase somewhat makes it differently than how it should be.

    I also looked at Ensembl's GTF file and there they completely remove the stop_codon and the 3UTR starts from where the stop_codon should start. They have also removed the last CDS. Ensembl's gtf is also a bit suspicious, as there is no stop_codon for that particular gene and the other 22 cases.

    I also looked at UCSC's (dm3), downloaded from tophat, and there everything is as I calculate the stop_codon.

    My question is, is this an error by FlyBase/Ensembl and how should this be correctly done?

    Many thanks indeed for any insight into this one.

  • #2
    Additional frame inconsistencies

    Unfortunately no one has suggested reasonable explanation for my previous problems.

    Additionally to that I also found a few frame inconsistencies, i.e. column 8 (count from 1).

    For the gene: FBgn0033313 and transcript: FBtr0305081 there is something not quite right with the frame of the start_codons, i.e. column 8.
    The gff for this gene and transcript reads for the first few CDS:

    2R FlyBase CDS 8616078 8616078 . + 0 Parent=FBtr0305081
    2R FlyBase CDS 8616327 8616516 . + 2 Parent=FBtr0310448,FBtr0310449,FBtr0305081
    2R FlyBase CDS 8616700 8618171 . + 1 Parent=FBtr0290112,FBtr0301363,FBtr0310448,FBtr0310449,FBtr0305080,FBtr0305081,FBtr0305082
    2R FlyBase CDS 8618234 8618461 . + 2 Parent=FBtr0290112,FBtr0301363,FBtr0310448,FBtr0310449,FBtr0305080,FBtr0305081,FBtr0305082


    I parsed to:

    2R FlyBase start_codon 8616078 8616078 . + 0 gene_id "FBgn0033313"; gene_symbol "Cirl"; transcript_id "FBtr0305081"; transcript_symbol "Cirl-RG";
    2R FlyBase start_codon 8616327 8616328 . + 2 gene_id "FBgn0033313"; gene_symbol "Cirl"; transcript_id "FBtr0305081"; transcript_symbol "Cirl-RG";
    2R FlyBase CDS 8616078 8616078 . + 0 gene_id "FBgn0033313"; gene_symbol "Cirl"; transcript_id "FBtr0305081"; transcript_symbol "Cirl-RG";
    2R FlyBase CDS 8616327 8616516 . + 2 gene_id "FBgn0033313"; gene_symbol "Cirl"; transcript_id "FBtr0305081"; transcript_symbol "Cirl-RG";


    Nevertheless, in FlyBase's gtf the frame of the second start_codon is:

    2R FlyBase start_codon 8616078 8616078 . + 0 gene_id "FBgn0033313"; gene_symbol "Cirl"; transcript_id "FBtr0305081"; transcript_symbol "Cirl-RG";
    2R FlyBase start_codon 8616327 8616328 . + 1 gene_id "FBgn0033313"; gene_symbol "Cirl"; transcript_id "FBtr0305081"; transcript_symbol "Cirl-RG";
    2R FlyBase CDS 8616078 8616078 15 + 0 gene_id "FBgn0033313"; gene_symbol "Cirl"; transcript_id "FBtr0305081"; transcript_symbol "Cirl-RG";
    2R FlyBase CDS 8616327 8616516 15 + 2 gene_id "FBgn0033313"; gene_symbol "Cirl"; transcript_id "FBtr0305081"; transcript_symbol "Cirl-RG";


    Note the frame is 1 in start_codon 8616327 8616328. As this start_codon has two bases, then according to the gtf2.2 guidelines, the frame should be 2, i.e. the third base in the feature is the start of a codon. This is not the only case of such mis-framing around, I count quite a few.

    I checked this in Ensembl's gtf and this appears to be 2 as I parsed it. Do you think I should I contact FlyBase to inquire about these.

    Many thanks indeed for any help.

    Comment


    • #3
      Originally posted by saskak View Post
      I checked this in Ensembl's gtf and this appears to be 2 as I parsed it. Do you think I should I contact FlyBase to inquire about these.
      Yes. They will know their dataset better than most of us on SeqAnswers. If there is a problem then they will appreciate knowing about it.

      Comment


      • #4
        Solved

        Contacted FlyBase and it turned out they had a bug/s in their annotation pipeline. Should be fixed in the 6.08 gtf file.

        Comment


        • #5
          Thanks for the follow up and getting this corrected!

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Understanding Genetic Influence on Infectious Disease
            by seqadmin




            During the COVID-19 pandemic, scientists observed that while some individuals experienced severe illness when infected with SARS-CoV-2, others were barely affected. These disparities left researchers and clinicians wondering what causes the wide variations in response to viral infections and what role genetics plays.

            Jean-Laurent Casanova, M.D., Ph.D., Professor at Rockefeller University, is a leading expert in this crossover between genetics and infectious...
            09-09-2024, 10:59 AM
          • seqadmin
            Addressing Off-Target Effects in CRISPR Technologies
            by seqadmin






            The first FDA-approved CRISPR-based therapy marked the transition of therapeutic gene editing from a dream to reality1. CRISPR technologies have streamlined gene editing, and CRISPR screens have become an important approach for identifying genes involved in disease processes2. This technique introduces targeted mutations across numerous genes, enabling large-scale identification of gene functions, interactions, and pathways3. Identifying the full range...
            08-27-2024, 04:44 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, Today, 06:25 AM
          0 responses
          13 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, Yesterday, 01:02 PM
          0 responses
          12 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 09-18-2024, 06:39 AM
          0 responses
          14 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 09-11-2024, 02:44 PM
          0 responses
          14 views
          0 likes
          Last Post seqadmin  
          Working...
          X