Hi gang,
I"m working through an RNA-seq project with tophat-cufflinks and I've come across a question that I haven't been able to answer with forum searches or looking through the manual.
After running cufflinks using a reference GTF, and looking at the transcripts.gtf file, I see transcripts both with and without an additional -Protein record. If I understand correctly, the -Protein transcript (which almost always has a CUFF.* gene name as opposed to the reference gene name) gives FPKM values for only the coding region of the transcript. However, not all genes have this additional -Protein listing.
For example, here is a transcript with the additional -Protein lines:
... and here is one without...
Sometimes only lines specific to the Protein are printed. My question is why would a transcript occasionally be predicted without a corresponding protein. Or, why would a protein be predicted without a corresponding transcript?
This becomes a larger issue when comparing samples with cuffcompare and lines like the following show up in a tracking file when comparing 3 samples:
In this case three lines are used to track two isoforms, and the transcript-level and the protein-level records from the three samples have been mixed, and even associated with the wrong isoform.
I'll also add that the tss_id and p_id don't do much to clear up the picture. Here's the corresponding entries in the combined.gtf file:
Why would TCONS_00..13 not be assigned a tss_id, but the same p_id as the 2nd isoform? Are all -Protein predictions assigned only p_ids and no tss_id ?
I"m now in the process of going through the tracking file with a perl script to make sure situations like this are sorted out, but I"m wondering if there's a reason why this might be expected.
Or am I just doing something wrong? Thanks for any feedback you might have... that's a lot of questions for one post.
Cheers,
Jonathan
I"m working through an RNA-seq project with tophat-cufflinks and I've come across a question that I haven't been able to answer with forum searches or looking through the manual.
After running cufflinks using a reference GTF, and looking at the transcripts.gtf file, I see transcripts both with and without an additional -Protein record. If I understand correctly, the -Protein transcript (which almost always has a CUFF.* gene name as opposed to the reference gene name) gives FPKM values for only the coding region of the transcript. However, not all genes have this additional -Protein listing.
For example, here is a transcript with the additional -Protein lines:
Code:
Chr1 Cufflinks transcript 471990 473160 1000 - . gene_id "AT1G02360.1"; transcript_id "AT1G02360.1"; FPKM "1.6172080056"; frac "0.136045"; conf_lo "1.443829"; conf_hi "1.790587"; cov "6.342879"; Chr1 Cufflinks exon 471990 472507 1000 - . gene_id "AT1G02360.1"; transcript_id "AT1G02360.1"; exon_number "1"; FPKM "1.6172080056"; frac "0.136045"; conf_lo "1.443829"; conf_hi "1.790587"; cov "6.342879"; Chr1 Cufflinks exon 472668 473160 1000 - . gene_id "AT1G02360.1"; transcript_id "AT1G02360.1"; exon_number "2"; FPKM "1.6172080056"; frac "0.136045"; conf_lo "1.443829"; conf_hi "1.790587"; cov "6.342879"; Chr1 Cufflinks transcript 472138 473116 1000 - . gene_id "CUFF.490"; transcript_id "AT1G02360.1,AT1G02360.1-Protein"; FPKM "13.2205860354"; frac "0.863955"; conf_lo "11.380920"; conf_hi "15.060252"; cov "51.852685"; Chr1 Cufflinks exon 472138 472507 1000 - . gene_id "CUFF.490"; transcript_id "AT1G02360.1,AT1G02360.1-Protein"; exon_number "1"; FPKM "13.2205860354"; frac "0.863955"; conf_lo "11.380920"; conf_hi "15.060252"; cov "51.852685"; Chr1 Cufflinks exon 472668 473116 1000 - . gene_id "CUFF.490"; transcript_id "AT1G02360.1,AT1G02360.1-Protein"; exon_number "2"; FPKM "13.2205860354"; frac "0.863955"; conf_lo "11.380920"; conf_hi "15.060252"; cov "51.852685";
Code:
Chr1 Cufflinks transcript 61963 63811 1000 - . gene_id "AT1G01130.1"; transcript_id "AT1G01130.1"; FPKM "0.9643481790"; frac "1.000000"; conf_lo "0.466361"; conf_hi "1.462335"; cov "3.918969"; Chr1 Cufflinks exon 61963 62124 1000 - . gene_id "AT1G01130.1"; transcript_id "AT1G01130.1"; exon_number "1"; FPKM "0.9643481790"; frac "1.000000"; conf_lo "0.466361"; conf_hi "1.462335"; cov "3.918969";Chr1 Cufflinks exon 63431 63811 1000 - . gene_id "AT1G01130.1"; transcript_id "AT1G01130.1"; exon_number "2"; FPKM "0.9643481790"; frac "1.000000"; conf_lo "0.466361"; conf_hi "1.462335"; cov "3.918969";
This becomes a larger issue when comparing samples with cuffcompare and lines like the following show up in a tracking file when comparing 3 samples:
Code:
TCONS_00000013 XLOC_000012 AT1G01260|AT1G01260.1 = q1:AT1G01260.1|AT1G01260.1|100|1.636771|1.424183|1.849358|6.358318|2578 q2:CUFF.112|AT1G01260.2,AT1G01260.2-Protein|100|8.204362|7.299812|9.108912|34.086923|1773 - TCONS_00000014 XLOC_000012 AT1G01260|AT1G01260.2 = q1:AT1G01260.2|AT1G01260.2|100|0.472212|0.422116|0.522308|1.834390|2378 q2:AT1G01260.2|AT1G01260.2|100|0.995585|0.905067|1.086104|4.136391|2378 - TCONS_00000015 XLOC_000012 AT1G01260|AT1G01260.1 = q1:CUFF.90|AT1G01260.2,AT1G01260.2-Protein|100|4.929213|4.241811|5.616615|19.148380|1773 - q3:CUFF.95|AT1G01260.2,AT1G01260.2-Protein|100|8.246326|7.368782|9.123870|34.015752|1773
I'll also add that the tss_id and p_id don't do much to clear up the picture. Here's the corresponding entries in the combined.gtf file:
Code:
Chr1 Cufflinks exon 109032 111609 . + . gene_id "XLOC_000012"; transcript_id "TCONS_00000013"; exon_number "1"; oId "AT1G01260.1"; nearest_ref "AT1G01260.1"; class_code "="; p_id "P12"; Chr1 Cufflinks exon 109076 109330 . + . gene_id "XLOC_000012"; transcript_id "TCONS_00000014"; exon_number "1"; oId "AT1G01260.2"; nearest_ref "AT1G01260.2"; class_code "="; tss_id "TSS10"; p_id "P12"; Chr1 Cufflinks exon 109413 111535 . + . gene_id "XLOC_000012"; transcript_id "TCONS_00000014"; exon_number "2"; oId "AT1G01260.2"; nearest_ref "AT1G01260.2"; class_code "="; tss_id "TSS10"; p_id "P12"; Chr1 Cufflinks exon 109595 111367 . + . gene_id "XLOC_000012"; transcript_id "TCONS_00000015"; exon_number "1"; oId "AT1G01260.2,AT1G01260.2-Protein"; nearest_ref "AT1G01260.1"; class_code "="; p_id "P12";
I"m now in the process of going through the tracking file with a perl script to make sure situations like this are sorted out, but I"m wondering if there's a reason why this might be expected.
Or am I just doing something wrong? Thanks for any feedback you might have... that's a lot of questions for one post.
Cheers,
Jonathan
Comment