Hello, I have been lurking here a bit, but haven't posted yet, so bear with me.
I am trying to recreate the calculations of FPKM and the test statistic performed by uffdiff on two sample genes in order to better understand how my RNA-seq data has been tested (and thus better explain my data). I have been reading through the supplemental information from the paper linked on the Cufflinks website (Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation; Nature Biotechnology 28,511–515(2010)doi:10.1038/nbt.1621), and I have been able to get through part of the calculations, but am still stumped.
To calculate FPKM, I am using the following formula:
(10^9)*Xg*(gamma t)
~l(t)*M
Where:
Xg is the number of fragments aligned to the gene locus (g),
gamma t is the maximum likelihood estimate of the probability of selecting a fragment from a transcript (t) coming from that gene locus
~l(t) is the adjusted transcript length
Σ from i=1 to t(i)) [F(i)(l(t)-i+1)]
F(i) = probability that the fragment has a length i
l(t) = the transcript length
M is the total number of fragments mapped in that sample
My sample is run with single-end reads 50 bp long, so I simplified the ~l(t) to the transcript length - 49, and we ran cuffdiff using a GTF file that contains only the loci for one transcript for each gene, so I think gamma should be 1. One of my lab members wrote a program that assigns raw reads to individual genes, so I am using the reads from this program for the raw reads in the FPKM calculation. Using this data I get pretty close to the FPKM calculated:
The difference is most likely from my assumption that the ~l(t) is only looking at the 50 bp read length, or the assumption that gamma is 1.
From this, I get into much more complicated math when trying to recreate the test statistic. The formula for the test statistic, from the supplemental data, is:
[log(Xa)+log(gamma a)+ log(Mb) - log(Xb) - log(gamma b) - log (Ma)]
SQRT[ (psi a*(1+Xa)*(gamma a)^2)/(Xa*(gamma a)^2)+ (psi b*(1+Xb)*(gamma b)^2)/(Xb*(gamma b)^2)]
Where psi is a variance-covariance matrix that estimates the covariance between gamma (tk) and gamma (tl). As far as I can tell, tk and tl come from Tophat, which splits a read of length l into two smaller reads of length k in order to align across splice junctions. (This may be completely off base)
When I run the equations using these assumptions, though, I get a value far off from the test statistic calculated by Cuffdiff:
So, I suppose my question is, is there a way for me to calculate (or even estimate) the psi function? Am I making faulty assumptions?
I apologize if this question has been addressed in a previous thread. I did try to find the answer in the archives before asking here.
Thank you
I am trying to recreate the calculations of FPKM and the test statistic performed by uffdiff on two sample genes in order to better understand how my RNA-seq data has been tested (and thus better explain my data). I have been reading through the supplemental information from the paper linked on the Cufflinks website (Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation; Nature Biotechnology 28,511–515(2010)doi:10.1038/nbt.1621), and I have been able to get through part of the calculations, but am still stumped.
To calculate FPKM, I am using the following formula:
(10^9)*Xg*(gamma t)
~l(t)*M
Where:
Xg is the number of fragments aligned to the gene locus (g),
gamma t is the maximum likelihood estimate of the probability of selecting a fragment from a transcript (t) coming from that gene locus
~l(t) is the adjusted transcript length
Σ from i=1 to t(i)) [F(i)(l(t)-i+1)]
F(i) = probability that the fragment has a length i
l(t) = the transcript length
M is the total number of fragments mapped in that sample
My sample is run with single-end reads 50 bp long, so I simplified the ~l(t) to the transcript length - 49, and we ran cuffdiff using a GTF file that contains only the loci for one transcript for each gene, so I think gamma should be 1. One of my lab members wrote a program that assigns raw reads to individual genes, so I am using the reads from this program for the raw reads in the FPKM calculation. Using this data I get pretty close to the FPKM calculated:
Code:
gene Reads A Reads B Cuff A Cuff B Calc A Calc B A 2 81 0.021956 0.823028 0.02116 0.795515 B 1 40 0.007541 0.279201 0.007374 0.273778
From this, I get into much more complicated math when trying to recreate the test statistic. The formula for the test statistic, from the supplemental data, is:
[log(Xa)+log(gamma a)+ log(Mb) - log(Xb) - log(gamma b) - log (Ma)]
SQRT[ (psi a*(1+Xa)*(gamma a)^2)/(Xa*(gamma a)^2)+ (psi b*(1+Xb)*(gamma b)^2)/(Xb*(gamma b)^2)]
Where psi is a variance-covariance matrix that estimates the covariance between gamma (tk) and gamma (tl). As far as I can tell, tk and tl come from Tophat, which splits a read of length l into two smaller reads of length k in order to align across splice junctions. (This may be completely off base)
When I run the equations using these assumptions, though, I get a value far off from the test statistic calculated by Cuffdiff:
Code:
Gene T stat Cufflinks t-stat A 0.99128963 -3.42295 B 0.898803344 -3.07139
So, I suppose my question is, is there a way for me to calculate (or even estimate) the psi function? Am I making faulty assumptions?
I apologize if this question has been addressed in a previous thread. I did try to find the answer in the archives before asking here.
Thank you
Comment