Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to read frames in .gff file format

    Hi,

    I need help to understand the .gff file format.
    actually, i want to develop a database for a microorganism based on information in a .gff file. The problem is, i don't understand about the framing parts where it consists of either a dot (.), 0, 1 or 2. I need to understand this so that i could proceed with the necessary data crunching for my input file for the database. i read some information about this in some websites and i tried to do it based on my understanding but still my output is wrong.


    i need to arrange my CDS accordingly. can anyone here help me to explain this clearly? thanks

  • #2
    What is recorded in a GFF (or GTF) file is not frame information; it is codon phase information and only applies to CDS features. Here is the description from the GFF definition page at the GMOD site:
    Column 8: "phase"

    For features of type "CDS", the phase indicates where the feature begins with reference to the reading frame. The phase is one of the integers 0, 1, or 2, indicating the number of bases that should be removed from the beginning of this feature to reach the first base of the next codon. In other words, a phase of "0" indicates that the next codon begins at the first base of the region described by the current line, a phase of "1" indicates that the next codon begins at the second base of this region, and a phase of "2" indicates that the codon begins at the third base of this region. This is NOT to be confused with the frame, which is simply start modulo 3. If there is no phase, put a "." (a period) in this field.

    For forward strand features, phase is counted from the start field. For reverse strand features, phase is counted from the end field.

    The phase is required for all CDS features.
    As the description above says you can determine the reading frame for a transcript by calculating the modulo (remainder) of the start position by 3; for features on the minus strand it would be (chromosome size - start) modulo 3.

    Comment


    • #3
      Hi kmcarr,

      Thanks so much for your prompt response.
      i read that one before but not really understand it. with your explanation about modulo, it becomes clearer now. will try doing it and see how it goes.. Thanks

      Comment


      • #4
        Hello,

        I thought I'd post a reply here instead of starting a new thread. I'm currently trying to select sequences for CDS features out of a GFF file. And I run into problems when a single mRNA contains several GFFs leading to one and the same protein. An example would be

        chrVII SGD mRNA 726974 727730 . + . Name=YGR118W_mRNA;Parent=58298;ID=58302
        chrVII SGD CDS 726974 727038 . + 0 Name=YGR118W_CDS;Parent=58302;ID=58299;orf_classification=Verified
        chrVII SGD CDS 727358 727730 . + 1 Name=YGR118W_CDS;Parent=58302;ID=58300;orf_classification=Verified
        chrVII SGD intron 727039 727357 . + . Name=YGR118W_intron;Parent=58302;ID=58301;orf_classification=Verified

        From the definition of phase I understand that I have to skip the first base of the second CDS and start translating only after it. But this gives (a) incorrect length of the joint CDS, i.e. not divisible by 3; (b) incorrect translation into amino acids.

        In most cases (except ca. 40 sequences in A.thaliana) a correct result is obtained by simply concatenating the two CDSs without using the phase(s) at all.

        I would appreciate any pointers towards a correct understanding of the phase field, it seems that right now I don't understand its purpose at all.

        Kind regards,
        Alexey

        Comment


        • #5
          Originally posted by AlexeyG View Post
          Hello,

          I thought I'd post a reply here instead of starting a new thread. I'm currently trying to select sequences for CDS features out of a GFF file. And I run into problems when a single mRNA contains several GFFs leading to one and the same protein. An example would be

          chrVII SGD mRNA 726974 727730 . + . Name=YGR118W_mRNA;Parent=58298;ID=58302
          chrVII SGD CDS 726974 727038 . + 0 Name=YGR118W_CDS;Parent=58302;ID=58299;orf_classification=Verified
          chrVII SGD CDS 727358 727730 . + 1 Name=YGR118W_CDS;Parent=58302;ID=58300;orf_classification=Verified
          chrVII SGD intron 727039 727357 . + . Name=YGR118W_intron;Parent=58302;ID=58301;orf_classification=Verified

          From the definition of phase I understand that I have to skip the first base of the second CDS and start translating only after it. But this gives (a) incorrect length of the joint CDS, i.e. not divisible by 3; (b) incorrect translation into amino acids.

          In most cases (except ca. 40 sequences in A.thaliana) a correct result is obtained by simply concatenating the two CDSs without using the phase(s) at all.

          I would appreciate any pointers towards a correct understanding of the phase field, it seems that right now I don't understand its purpose at all.

          Kind regards,
          Alexey
          Alexsy,

          When you say 'skip the first base' I get the feeling you mean you are discarding it when performing the translation. Phase doesn't mean to disregard any of the nucleotides within a CDS, it merely tells you how the reading frame of the overall mRNA relates to that particular coding exon.

          Let's look at your example. The first CDS is 65nt long, divided by 3 (into codons) leaves a remainder of 2nt. You pick up the first nt of the next codon to complete that codon, then continue the translation in CDS #2 from the second position (phase=1). CDS #2 is 373nt long, taken together the complete CDS is 438nt yielding a protein of 146 amino acids (or 145 + stop codon).

          So another way to think of the phase value is how many nt from the 5' end of this exon do I have to use to complete the final codon from the previous exon.

          Comment


          • #6
            kmcarr,

            Thank you for spelling the example out for me. I indeed thought that if phase=n, then I have to skip the first n nucleotides. But your last lines puts everything into place.

            The other part of the problem is that I'm still sometimes getting entries like:

            Chr1 TAIR10 CDS 873435 874832 0.0 0 Parent=AT1G03495.1,AT1G03495.1-Protein;

            And in this case it's a lone entry that does not have any neighboring CDS, but which has length of 1397 bp. In reality it should correspond to a gene of 1398 bp, but I guess that I should blame on the errors in my reference sequences.

            Comment


            • #7
              Originally posted by AlexeyG View Post
              kmcarr,

              Thank you for spelling the example out for me. I indeed thought that if phase=n, then I have to skip the first n nucleotides. But your last lines puts everything into place.

              The other part of the problem is that I'm still sometimes getting entries like:

              Chr1 TAIR10 CDS 873435 874832 0.0 0 Parent=AT1G03495.1,AT1G03495.1-Protein;

              And in this case it's a lone entry that does not have any neighboring CDS, but which has length of 1397 bp. In reality it should correspond to a gene of 1398 bp, but I guess that I should blame on the errors in my reference sequences.
              Alexsy,

              No, those coordinates do define a CDS of 1,398nt. Remember that the coordinates are inclusive, so the formula to calculate feature length is:

              end - (start -1) or in your example 874,832 - (873,435 -1) = 1,398

              Comment


              • #8
                You're right. But in the previous by accident I posted a BioJava representation of a GFF feature instead of an actual line from the GFF file. Sorry for this mixup. Below is the GFF feature as it's described in the file:

                Chr1 TAIR10 CDS 873436 874832 . + 0 Parent=AT1G03495.1,AT1G03495.1-Protein;

                And the calculation hence:

                In [57]: 874832 - 873436 + 1
                Out[57]: 1397

                Kind regards,
                Alexey.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Best Practices for Single-Cell Sequencing Analysis
                  by seqadmin



                  While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
                  06-06-2024, 07:15 AM
                • seqadmin
                  Latest Developments in Precision Medicine
                  by seqadmin



                  Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

                  Somatic Genomics
                  “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
                  05-24-2024, 01:16 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 06-07-2024, 06:58 AM
                0 responses
                13 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 06-06-2024, 08:18 AM
                0 responses
                24 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 06-06-2024, 08:04 AM
                0 responses
                22 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 06-03-2024, 06:55 AM
                0 responses
                15 views
                0 likes
                Last Post seqadmin  
                Working...
                X