Hello, I have been lurking here a bit, but haven't posted yet, so bear with me.

I am trying to recreate the calculations of FPKM and the test statistic performed by uffdiff on two sample genes in order to better understand how my RNA-seq data has been tested (and thus better explain my data). I have been reading through the supplemental information from the paper linked on the Cufflinks website (Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation; Nature Biotechnology 28,511–515(2010)doi:10.1038/nbt.1621), and I have been able to get through part of the calculations, but am still stumped.

To calculate FPKM, I am using the following formula:

~l(t)*M

Where:

Xg is the number of fragments aligned to the gene locus (g),

gamma t is the maximum likelihood estimate of the probability of selecting a fragment from a transcript (t) coming from that gene locus

~l(t) is the adjusted transcript length

Σ from i=1 to t(i)) [F(i)(l(t)-i+1)]

F(i) = probability that the fragment has a length i

l(t) = the transcript length

M is the total number of fragments mapped in that sample

My sample is run with single-end reads 50 bp long, so I simplified the ~l(t) to the transcript length - 49, and we ran cuffdiff using a GTF file that contains only the loci for one transcript for each gene, so I think gamma should be 1. One of my lab members wrote a program that assigns raw reads to individual genes, so I am using the reads from this program for the raw reads in the FPKM calculation. Using this data I get pretty close to the FPKM calculated:

The difference is most likely from my assumption that the ~l(t) is only looking at the 50 bp read length, or the assumption that gamma is 1.

From this, I get into much more complicated math when trying to recreate the test statistic. The formula for the test statistic, from the supplemental data, is:

SQRT[ (psi a*(1+Xa)*(gamma a)^2)/(Xa*(gamma a)^2)+ (psi b*(1+Xb)*(gamma b)^2)/(Xb*(gamma b)^2)]

Where psi is a variance-covariance matrix that estimates the covariance between gamma (tk) and gamma (tl). As far as I can tell, tk and tl come from Tophat, which splits a read of length l into two smaller reads of length k in order to align across splice junctions. (This may be completely off base)

When I run the equations using these assumptions, though, I get a value far off from the test statistic calculated by Cuffdiff:

So, I suppose my question is, is there a way for me to calculate (or even estimate) the psi function? Am I making faulty assumptions?

I apologize if this question has been addressed in a previous thread. I did try to find the answer in the archives before asking here.

Thank you

I am trying to recreate the calculations of FPKM and the test statistic performed by uffdiff on two sample genes in order to better understand how my RNA-seq data has been tested (and thus better explain my data). I have been reading through the supplemental information from the paper linked on the Cufflinks website (Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation; Nature Biotechnology 28,511–515(2010)doi:10.1038/nbt.1621), and I have been able to get through part of the calculations, but am still stumped.

To calculate FPKM, I am using the following formula:

__(10^9)*Xg*(gamma t)__~l(t)*M

Where:

Xg is the number of fragments aligned to the gene locus (g),

gamma t is the maximum likelihood estimate of the probability of selecting a fragment from a transcript (t) coming from that gene locus

~l(t) is the adjusted transcript length

Σ from i=1 to t(i)) [F(i)(l(t)-i+1)]

F(i) = probability that the fragment has a length i

l(t) = the transcript length

M is the total number of fragments mapped in that sample

My sample is run with single-end reads 50 bp long, so I simplified the ~l(t) to the transcript length - 49, and we ran cuffdiff using a GTF file that contains only the loci for one transcript for each gene, so I think gamma should be 1. One of my lab members wrote a program that assigns raw reads to individual genes, so I am using the reads from this program for the raw reads in the FPKM calculation. Using this data I get pretty close to the FPKM calculated:

Code:

gene Reads A Reads B Cuff A Cuff B Calc A Calc B A 2 81 0.021956 0.823028 0.02116 0.795515 B 1 40 0.007541 0.279201 0.007374 0.273778

From this, I get into much more complicated math when trying to recreate the test statistic. The formula for the test statistic, from the supplemental data, is:

__[log(Xa)+log(gamma a)+ log(Mb) - log(Xb) - log(gamma b) - log (Ma)]__SQRT[ (psi a*(1+Xa)*(gamma a)^2)/(Xa*(gamma a)^2)+ (psi b*(1+Xb)*(gamma b)^2)/(Xb*(gamma b)^2)]

Where psi is a variance-covariance matrix that estimates the covariance between gamma (tk) and gamma (tl). As far as I can tell, tk and tl come from Tophat, which splits a read of length l into two smaller reads of length k in order to align across splice junctions. (This may be completely off base)

When I run the equations using these assumptions, though, I get a value far off from the test statistic calculated by Cuffdiff:

Code:

Gene T stat Cufflinks t-stat A 0.99128963 -3.42295 B 0.898803344 -3.07139

So, I suppose my question is, is there a way for me to calculate (or even estimate) the psi function? Am I making faulty assumptions?

I apologize if this question has been addressed in a previous thread. I did try to find the answer in the archives before asking here.

Thank you

## Comment