Seqanswers Leaderboard Ad

**kmcarr** · 02-16-2011, 11:33 AM

What is recorded in a GFF (or GTF) file is not frame information; it is codon phase information and only applies to CDS features. Here is the description from the GFF definition page at the GMOD site:

Column 8: "phase"

For features of type "CDS", the phase indicates where the feature begins with reference to the reading frame. The phase is one of the integers 0, 1, or 2, indicating the number of bases that should be removed from the beginning of this feature to reach the first base of the next codon. In other words, a phase of "0" indicates that the next codon begins at the first base of the region described by the current line, a phase of "1" indicates that the next codon begins at the second base of this region, and a phase of "2" indicates that the codon begins at the third base of this region. This is NOT to be confused with the frame, which is simply start modulo 3. If there is no phase, put a "." (a period) in this field.

For forward strand features, phase is counted from the start field. For reverse strand features, phase is counted from the end field.

The phase is required for all CDS features.

As the description above says you can determine the reading frame for a transcript by calculating the modulo (remainder) of the start position by 3; for features on the minus strand it would be (chromosome size - start) modulo 3.

**redse171** · 02-16-2011, 01:51 PM

Hi kmcarr,

Thanks so much for your prompt response.
i read that one before but not really understand it. with your explanation about modulo, it becomes clearer now. will try doing it and see how it goes.. Thanks

**AlexeyG** · 02-28-2012, 07:53 AM

Hello,

I thought I'd post a reply here instead of starting a new thread. I'm currently trying to select sequences for CDS features out of a GFF file. And I run into problems when a single mRNA contains several GFFs leading to one and the same protein. An example would be

chrVII SGD mRNA 726974 727730 . + . Name=YGR118W_mRNA;Parent=58298;ID=58302
chrVII SGD CDS 726974 727038 . + 0 Name=YGR118W_CDS;Parent=58302;ID=58299;orf_classification=Verified
chrVII SGD CDS 727358 727730 . + 1 Name=YGR118W_CDS;Parent=58302;ID=58300;orf_classification=Verified
chrVII SGD intron 727039 727357 . + . Name=YGR118W_intron;Parent=58302;ID=58301;orf_classification=Verified

From the definition of phase I understand that I have to skip the first base of the second CDS and start translating only after it. But this gives (a) incorrect length of the joint CDS, i.e. not divisible by 3; (b) incorrect translation into amino acids.

In most cases (except ca. 40 sequences in A.thaliana) a correct result is obtained by simply concatenating the two CDSs without using the phase(s) at all.

I would appreciate any pointers towards a correct understanding of the phase field, it seems that right now I don't understand its purpose at all.

Kind regards,
Alexey

**kmcarr** · 02-28-2012, 08:32 AM

Originally posted by AlexeyG View Post

Hello,

I thought I'd post a reply here instead of starting a new thread. I'm currently trying to select sequences for CDS features out of a GFF file. And I run into problems when a single mRNA contains several GFFs leading to one and the same protein. An example would be

chrVII SGD mRNA 726974 727730 . + . Name=YGR118W_mRNA;Parent=58298;ID=58302
chrVII SGD CDS 726974 727038 . + 0 Name=YGR118W_CDS;Parent=58302;ID=58299;orf_classification=Verified
chrVII SGD CDS 727358 727730 . + 1 Name=YGR118W_CDS;Parent=58302;ID=58300;orf_classification=Verified
chrVII SGD intron 727039 727357 . + . Name=YGR118W_intron;Parent=58302;ID=58301;orf_classification=Verified

From the definition of phase I understand that I have to skip the first base of the second CDS and start translating only after it. But this gives (a) incorrect length of the joint CDS, i.e. not divisible by 3; (b) incorrect translation into amino acids.

In most cases (except ca. 40 sequences in A.thaliana) a correct result is obtained by simply concatenating the two CDSs without using the phase(s) at all.

I would appreciate any pointers towards a correct understanding of the phase field, it seems that right now I don't understand its purpose at all.

Kind regards,
Alexey

Alexsy,

When you say 'skip the first base' I get the feeling you mean you are discarding it when performing the translation. Phase doesn't mean to disregard any of the nucleotides within a CDS, it merely tells you how the reading frame of the overall mRNA relates to that particular coding exon.

Let's look at your example. The first CDS is 65nt long, divided by 3 (into codons) leaves a remainder of 2nt. You pick up the first nt of the next codon to complete that codon, then continue the translation in CDS #2 from the second position (phase=1). CDS #2 is 373nt long, taken together the complete CDS is 438nt yielding a protein of 146 amino acids (or 145 + stop codon).

So another way to think of the phase value is how many nt from the 5' end of this exon do I have to use to complete the final codon from the previous exon.

**AlexeyG** · 02-28-2012, 09:20 AM

kmcarr,

Thank you for spelling the example out for me. I indeed thought that if phase=n, then I have to skip the first n nucleotides. But your last lines puts everything into place.

The other part of the problem is that I'm still sometimes getting entries like:

Chr1 TAIR10 CDS 873435 874832 0.0 0 Parent=AT1G03495.1,AT1G03495.1-Protein;

And in this case it's a lone entry that does not have any neighboring CDS, but which has length of 1397 bp. In reality it should correspond to a gene of 1398 bp, but I guess that I should blame on the errors in my reference sequences.

**kmcarr** · 02-28-2012, 09:28 AM

Originally posted by AlexeyG View Post

kmcarr,

Thank you for spelling the example out for me. I indeed thought that if phase=n, then I have to skip the first n nucleotides. But your last lines puts everything into place.

The other part of the problem is that I'm still sometimes getting entries like:

Chr1 TAIR10 CDS 873435 874832 0.0 0 Parent=AT1G03495.1,AT1G03495.1-Protein;

And in this case it's a lone entry that does not have any neighboring CDS, but which has length of 1397 bp. In reality it should correspond to a gene of 1398 bp, but I guess that I should blame on the errors in my reference sequences.

Alexsy,

No, those coordinates do define a CDS of 1,398nt. Remember that the coordinates are inclusive, so the formula to calculate feature length is:

end - (start -1) or in your example 874,832 - (873,435 -1) = 1,398

**AlexeyG** · 02-29-2012, 03:18 AM

You're right. But in the previous by accident I posted a BioJava representation of a GFF feature instead of an actual line from the GFF file. Sorry for this mixup. Below is the GFF feature as it's described in the file:

Chr1 TAIR10 CDS 873436 874832 . + 0 Parent=AT1G03495.1,AT1G03495.1-Protein;

And the calculation hence:

In [57]: 874832 - 873436 + 1
Out[57]: 1397

Kind regards,
Alexey.

Topics	Statistics	Last Post
The Adaptation of the Cell Cycle in Multiciliated Cells by seqadmin Started by seqadmin, 06-07-2024, 06:58 AM	0 responses 13 views 0 likes	Last Post by seqadmin 06-07-2024, 06:58 AM
New Method for DNA Sequence Amplification by seqadmin Started by seqadmin, 06-06-2024, 08:18 AM	0 responses 24 views 0 likes	Last Post by seqadmin 06-06-2024, 08:18 AM
New Tools Enhance Single-Molecule DNA Analysis with Minimal Samples by seqadmin Started by seqadmin, 06-06-2024, 08:04 AM	0 responses 22 views 0 likes	Last Post by seqadmin 06-06-2024, 08:04 AM
SIX2 Protein Identified as a Key Player in Prostate Cancer Treatment Resistance by seqadmin Started by seqadmin, 06-03-2024, 06:55 AM	0 responses 15 views 0 likes	Last Post by seqadmin 06-03-2024, 06:55 AM

Seqanswers Leaderboard Ad

Announcement

How to read frames in .gff file format

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News