Header Leaderboard Ad

Collapse

Calculating p-values from FPKM?

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Calculating p-values from FPKM?

    Oof! I'm on the last leg of my first project here at UCLA in the Pellegrini Lab. I'm supposed to figure out which genes are differentially expressed (say, with a p-value < 0.001). I used Fisher's Exact Test to calculate directly from the FPKM (Fragments Per Kilobase of gene per Million reads), and the calculation took about 5 seconds on my laptop, and I got very significant results. Very disturbing that it was so easy.

    So I started getting this inkling that maaaaybe I can't do the FET on FPKMs, as each gene is normalized by its own length, so the totals of the FPKMs fall out of proportion.

    Question 1: Am I correct in thinking this?

    If answer(Question 1) == yes,
    I tried running the FET on the actual counts of each gene per million reads, and overnight, my computer calculated about 60 of them. The problem is, I have 27,644 genes I need to do the test for. I doubt even on a big cluster that I'll be able to get results in a reasonable amount of time.

    Question 2: Does anybody have any suggestions or alternatives?

    Also, as a side note, I'm running the p-value calculations in MATLAB using DCFisherextest.m, which runs 2x2 contingency tables using the approximation log10(x!) =~ gammaln(x+1)/log(10), which is significantly faster and more accurate than direct factorials (MATLAB does not take anything higher than 170, and my data is several orders of magnitude higher).

    Thank you in advance!

  • #2
    Not exactly an update, but I've posted the script to our public server. You can retrieve it at http://pellegrini.mcdb.ucla.edu/Public/DCFisherextest.m

    Comment


    • #3
      It would be nice if you could write the number of samples you have - then it would be easier to say if the test you apply is proper. Differential fpkm analysis does not take much time - similarly as microarray differential analysis.

      Do you calculate FPKM values with cufflinks?
      My suggestion would be to apply Student's ttest, with some kind of correction - taking into account both - "0" FPKMs for some genes and fold changes of particular genes.

      This would be ok if you have, lets say 8 versus 8 samples at least.

      Maybe someone has better ideas? I would gladly take into consideration some suggestions on this topic.
      Tomasz Stokowy
      www.sequencing.io.gliwice.pl

      Comment


      • #4
        Yes, I calculated the FPKM with cufflinks. It's actually only two samples. I was afraid that I might need a very powerful significance test, but my department head actually just suggested doing a Poisson Test -- taking the lower expression of a single gene, calculating the Poisson parameter, and finding the likelihood of the larger expression of that gene.

        Student's test, though? What kind of correction are you talking about with the fold changes? I'm curious as to how that can be applied. Thanks for the reply!

        Comment


        • #5
          Since you have only 2 samples, I think there is no place for statistics - you can not discuss about any population phenomena just on the base of 2 samples. What you can do best in my opinion is just calculation of fold changes. This will lead you to conclusion "expression of gene in a sample is higher than in the other one". To start any statistics i would suggest to have at least 5vs5 for "small number of samples tests, FET or Dixon's" and 8vs8 after outliers filtering for other statistics.

          Tomasz
          Tomasz Stokowy
          www.sequencing.io.gliwice.pl

          Comment


          • #6
            Despite the fact that it is done often, Fisher's exact test is unsuitable to test for differential expression, for reasons we've discussed in more than one thread here.

            However, using Fisher's test on FPKM values rather than raw counts is, in a way, doubly wrong, because the whole point of Fisher's test, where appropriate, is to account for counting noise. So, just out of curiosity: where did you get this idea from.

            On a more constructive side, here are some references how to do it:

            Robinson MD and Smyth GK (2007). Moderated statistical tests for
            assessing differences in tag abundance. Bioinformatics 23, 2881-2887

            Simon Anders and Wolfgang Huber (2010): Differential expression
            analysis for sequence count data. Genome Biology 11:R106

            Robinson MD, McCarthy DJ and Smyth GK (2010). edgeR: a Bioconductor
            package for differential expression analysis of digital gene
            expression data. Bioinformatics 26, 139-140

            Thomas J Hardcastle and Krystyna A Kelly (2010): baySeq: Empirical
            Bayesian methods for identifying differential expression in sequence
            count data. BMC Bioinformatics 11:422

            Comment


            • #7
              Since you have only 2 samples, I think there is no place for statistics - you can not discuss about any population phenomena just on the base of 2 samples. What you can do best in my opinion is just calculation of fold changes. This will lead you to conclusion "expression of gene in a sample is higher than in the other one".
              Well, I'm tempted to disagree with you on this point. The Illumina does so many reads, it's almost impossible not to derive and statistics from two samples. And yes, "expression of gene in a sample is higher than in the other one"... You're right, we need to be very careful as to what we do here. Differential expression does not exactly say how much of a difference, but I think we can pretty safely assume that it can be taken directly from the fold changes, as long as the expressions of the genes are not so low as to have a high p-value.

              Despite the fact that it is done often, Fisher's exact test is unsuitable to test for differential expression, for reasons we've discussed in more than one thread here.
              Thank you. I've found the thread and am studying it now.

              However, using Fisher's test on FPKM values rather than raw counts is, in a way, doubly wrong, because the whole point of Fisher's test, where appropriate, is to account for counting noise. So, just out of curiosity: where did you get this idea from.
              Fisher's Exact Test is what we have been using by default in our lab. Ultimately, I don't really see why it's unsuitable. It bypasses all the worry of having to calculate a variance (how do you get a good value with two samples?). I do understand, however, that it while it will give you a sense of whether something is expressed differently, it doesn't say HOW differently. On the other hand, you can -essentially- determine that from the p-value, plus you can easily find the fold change.
              I definitely agree with you on the using of raw counts rather than FPKM values. As long as you use a fitting test, it shouldn't matter whether you use raw counts or counts per million.

              Thank you both for your help!

              Artur

              Comment


              • #8
                Fisher's Exact Test is what we have been using by default in our lab.
                Your lab is not the only one who does this, but I sincerely hope referees and editors will finally learn to reject all these papers. Most of them do, by now, as the news that it is wrong seem to finally spread.

                It bypasses all the worry of having to calculate a variance (how do you get a good value with two samples?)
                But this is exactly the problem. Fisher's test does not bypass this problem. It explicitly assumes that the dispersion (the coefficient of extra-Poisson variation) is zero, and this is inappropriate. It's only because this zero does not appear in the formula that people assume that the test does not need a variance estimation; it does: unless you know the dispersion to be zero, you cannot use the test. The tests mentioned in the papers above become (approximately) equal to the Fisher's test once you set the dispersion to zero there.

                See this paper for a more thorough explanation:

                Baggerly, K. A., Deng, L., Morris, J. S. & Aldaz, C. M. 2003 Differential expres-
                sion in SAGE: accounting for normal between-library variation. Bioinformatics,
                19(12), 1477–1483. (doi:10.1093/bioinformatics/btg173)

                ... it doesn't say HOW differently. On the other hand, you can -essentially- determine that from the p-value, plus you can easily find the fold change.
                A p value should never be used as in indicator of the magnitude of an effect. Significance and effect size are not the same. Use the fold change, or some shrinkage estimate of it.

                Comment


                • #9
                  After speaking to my professor..
                  Of course you need to calculate p-values with the FPKMs! I am rather new in this field, and did not realize that the RNA strands are 'chopped up' before they're sequence. Makes total sense.
                  Simon, thank you for all your help. You are totally right about the p-values vs effect (fold change). Do you recommend using a Poisson test instead of FET?

                  Comment


                  • #10
                    Also, I am a little bit at a loss as to what to do after finding differential expression of genes, p-values, and maybe some pathways.. Does anybody have a link to a thread about what to do after those analyses are done?
                    I've been asked to do a heat map. That shouldn't be too hard. What else?

                    Comment


                    • #11
                      Originally posted by Artur Jaroszewicz View Post
                      After speaking to my professor..
                      Of course you need to calculate p-values with the FPKMs! I am rather new in this field, and did not realize that the RNA strands are 'chopped up' before they're sequence. Makes total sense.
                      Simon, thank you for all your help. You are totally right about the p-values vs effect (fold change). Do you recommend using a Poisson test instead of FET?
                      To repeat:

                      1. Both Fisher's exact test and the Poisson likelihood ratio test are based on the assumption that they are given raw integer counts. Using them with FPKM values is incorrect.

                      2. Both these tests neglect overdispersion (sample-to-sample variation not due to counting noise but due to actual differences between replicate sample). Hence, neither of them should be used for differential expression analysis. (In fact, the Poisson LRT can be seen as an approxmiation to the FET.)

                      For explanations of these points, please consult a stochastics textbooks, or, for the second point, see this paper:

                      Baggerly, K. A., Deng, L., Morris, J. S. & Aldaz, C. M. (2003):
                      Differential expression in SAGE: accounting for normal between-library variation.
                      Bioinformatics, 19(12), 1477–1483. (doi:10.1093/bioinformatics/btg173)

                      For correct methods, see the papers listed in post #6.

                      Comment


                      • #12
                        Hello, I created log2(FPKM tissueA/ FPKM tissueB) table for control 6sample and disease 7samples and would like to see significant differential genes.

                        log2(fpkmA/fpkmB) CTRL1.....6 log2(fpkmA/fpkmB) DISEASE1....7

                        gene1
                        gene2
                        gene3

                        ......

                        Which statistical test should I apply for my case?

                        Thank you very much.

                        Comment


                        • #13
                          Hi aslihan,

                          I return to this thread a completely different man. Thank you Simon Anders for helping me earlier. I actually use both HTSeq and DESeq in my analysis now.

                          What package are you using to calculate your RPKM/FPKM values? If you are using Cufflinks, you'd probably want to use CuffDiff/CuffCompare to calculate the p-values. If you have a pipeline that doesn't employ Cufflinks, you may want to try Simon Anders' (above) DESeq or Robinson, et. al's EdgeR. DESeq is nice if you have replicates, but returns rather conservative p-values when there's only one replicate. People say EdgeR is better for handling single replicates, but I have not used it.

                          Comment


                          • #14
                            Calculating p-values from FPKM and more...

                            Originally posted by Artur Jaroszewicz View Post
                            Hi aslihan,

                            I return to this thread a completely different man. Thank you Simon Anders for helping me earlier. I actually use both HTSeq and DESeq in my analysis now.

                            What package are you using to calculate your RPKM/FPKM values? If you are using Cufflinks, you'd probably want to use CuffDiff/CuffCompare to calculate the p-values. If you have a pipeline that doesn't employ Cufflinks, you may want to try Simon Anders' (above) DESeq or Robinson, et. al's EdgeR. DESeq is nice if you have replicates, but returns rather conservative p-values when there's only one replicate. People say EdgeR is better for handling single replicates, but I have not used it.

                            Hi Artur,
                            I read with interest your last conversation abt the statistic to be used.
                            I am new to this filed and very new to the statistic field, thus I am completely lost!

                            here my issue: I have two samples, treated and untreated. I have done the illumina sequencing and then analysed the data via the tophat/cufflinks/cuffcompare/cuffdiff pipeline.

                            My question: how to find the de genes?I was suggesrted that in absebce of replicates it does not make sense to do a test statistic, but CUFFDIFF DOES. does it mean that the results are not reliable?

                            Shall I rank all my genes by fold change and then take top one as most differentially expressed?But then, how can i know if this difference is significant?

                            also, some genes have fpkm=0....the fold change is sometimes very very high, but this does not reflect the truth...how to deal with this??

                            Last question: when you run cuffcompare, do you give as input both cufflinks .gtf files from yuur treated and untreated (+/_ a reference genome as you wish) or just one .gtf files?

                            thanks in advance for any help (specially the statistical part!)

                            ib

                            Comment


                            • #15
                              Hi IBseq,

                              Even though CuffDiff may give p-values, these are not really to be trusted. It doesn't make sense to compare simply one replicate of each. Here's an analogy: proving cats run faster than dogs by taking one cat and one dog, and racing them. Yes, it is an overly simplified, but you get the idea. You have no way of knowing if a difference is due to natural biological variation or due to differences between the samples. The best you can do is sort by fold change or by top RPKM, and look at differences there.

                              If you decide that you do want to do differential testing, you should be using (in your pipeline) cufflinks -> cuffcompare -> cuffdiff. Cuffcompare basically creates a single combined gtf file which you can use for cuffdiff. Use this only if you are doing a de novo alignment. If you're using a reference gtf file, use that.

                              If you have an RPKM (or 'FPKM') of zero, it means it's not expressed, so you have an infinite fold change. I recommend using a log base 2 ratio, but adding, e.g., 0.001 to each value, to make sure you don't take the log of zero or dividing by zero.

                              Best of luck,

                              Artur

                              Comment

                              Working...
                              X