Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Artur Jaroszewicz
    Member
    • Sep 2011
    • 45

    Calculating p-values from FPKM?

    Oof! I'm on the last leg of my first project here at UCLA in the Pellegrini Lab. I'm supposed to figure out which genes are differentially expressed (say, with a p-value < 0.001). I used Fisher's Exact Test to calculate directly from the FPKM (Fragments Per Kilobase of gene per Million reads), and the calculation took about 5 seconds on my laptop, and I got very significant results. Very disturbing that it was so easy.

    So I started getting this inkling that maaaaybe I can't do the FET on FPKMs, as each gene is normalized by its own length, so the totals of the FPKMs fall out of proportion.

    Question 1: Am I correct in thinking this?

    If answer(Question 1) == yes,
    I tried running the FET on the actual counts of each gene per million reads, and overnight, my computer calculated about 60 of them. The problem is, I have 27,644 genes I need to do the test for. I doubt even on a big cluster that I'll be able to get results in a reasonable amount of time.

    Question 2: Does anybody have any suggestions or alternatives?

    Also, as a side note, I'm running the p-value calculations in MATLAB using DCFisherextest.m, which runs 2x2 contingency tables using the approximation log10(x!) =~ gammaln(x+1)/log(10), which is significantly faster and more accurate than direct factorials (MATLAB does not take anything higher than 170, and my data is several orders of magnitude higher).

    Thank you in advance!
  • Artur Jaroszewicz
    Member
    • Sep 2011
    • 45

    #2
    Not exactly an update, but I've posted the script to our public server. You can retrieve it at http://pellegrini.mcdb.ucla.edu/Public/DCFisherextest.m

    Comment

    • stoker
      Member
      • Oct 2010
      • 17

      #3
      It would be nice if you could write the number of samples you have - then it would be easier to say if the test you apply is proper. Differential fpkm analysis does not take much time - similarly as microarray differential analysis.

      Do you calculate FPKM values with cufflinks?
      My suggestion would be to apply Student's ttest, with some kind of correction - taking into account both - "0" FPKMs for some genes and fold changes of particular genes.

      This would be ok if you have, lets say 8 versus 8 samples at least.

      Maybe someone has better ideas? I would gladly take into consideration some suggestions on this topic.
      Tomasz Stokowy
      www.sequencing.io.gliwice.pl

      Comment

      • Artur Jaroszewicz
        Member
        • Sep 2011
        • 45

        #4
        Yes, I calculated the FPKM with cufflinks. It's actually only two samples. I was afraid that I might need a very powerful significance test, but my department head actually just suggested doing a Poisson Test -- taking the lower expression of a single gene, calculating the Poisson parameter, and finding the likelihood of the larger expression of that gene.

        Student's test, though? What kind of correction are you talking about with the fold changes? I'm curious as to how that can be applied. Thanks for the reply!

        Comment

        • stoker
          Member
          • Oct 2010
          • 17

          #5
          Since you have only 2 samples, I think there is no place for statistics - you can not discuss about any population phenomena just on the base of 2 samples. What you can do best in my opinion is just calculation of fold changes. This will lead you to conclusion "expression of gene in a sample is higher than in the other one". To start any statistics i would suggest to have at least 5vs5 for "small number of samples tests, FET or Dixon's" and 8vs8 after outliers filtering for other statistics.

          Tomasz
          Tomasz Stokowy
          www.sequencing.io.gliwice.pl

          Comment

          • Simon Anders
            Senior Member
            • Feb 2010
            • 995

            #6
            Despite the fact that it is done often, Fisher's exact test is unsuitable to test for differential expression, for reasons we've discussed in more than one thread here.

            However, using Fisher's test on FPKM values rather than raw counts is, in a way, doubly wrong, because the whole point of Fisher's test, where appropriate, is to account for counting noise. So, just out of curiosity: where did you get this idea from.

            On a more constructive side, here are some references how to do it:

            Robinson MD and Smyth GK (2007). Moderated statistical tests for
            assessing differences in tag abundance. Bioinformatics 23, 2881-2887

            Simon Anders and Wolfgang Huber (2010): Differential expression
            analysis for sequence count data. Genome Biology 11:R106

            Robinson MD, McCarthy DJ and Smyth GK (2010). edgeR: a Bioconductor
            package for differential expression analysis of digital gene
            expression data. Bioinformatics 26, 139-140

            Thomas J Hardcastle and Krystyna A Kelly (2010): baySeq: Empirical
            Bayesian methods for identifying differential expression in sequence
            count data. BMC Bioinformatics 11:422

            Comment

            • Artur Jaroszewicz
              Member
              • Sep 2011
              • 45

              #7
              Since you have only 2 samples, I think there is no place for statistics - you can not discuss about any population phenomena just on the base of 2 samples. What you can do best in my opinion is just calculation of fold changes. This will lead you to conclusion "expression of gene in a sample is higher than in the other one".
              Well, I'm tempted to disagree with you on this point. The Illumina does so many reads, it's almost impossible not to derive and statistics from two samples. And yes, "expression of gene in a sample is higher than in the other one"... You're right, we need to be very careful as to what we do here. Differential expression does not exactly say how much of a difference, but I think we can pretty safely assume that it can be taken directly from the fold changes, as long as the expressions of the genes are not so low as to have a high p-value.

              Despite the fact that it is done often, Fisher's exact test is unsuitable to test for differential expression, for reasons we've discussed in more than one thread here.
              Thank you. I've found the thread and am studying it now.

              However, using Fisher's test on FPKM values rather than raw counts is, in a way, doubly wrong, because the whole point of Fisher's test, where appropriate, is to account for counting noise. So, just out of curiosity: where did you get this idea from.
              Fisher's Exact Test is what we have been using by default in our lab. Ultimately, I don't really see why it's unsuitable. It bypasses all the worry of having to calculate a variance (how do you get a good value with two samples?). I do understand, however, that it while it will give you a sense of whether something is expressed differently, it doesn't say HOW differently. On the other hand, you can -essentially- determine that from the p-value, plus you can easily find the fold change.
              I definitely agree with you on the using of raw counts rather than FPKM values. As long as you use a fitting test, it shouldn't matter whether you use raw counts or counts per million.

              Thank you both for your help!

              Artur

              Comment

              • Simon Anders
                Senior Member
                • Feb 2010
                • 995

                #8
                Fisher's Exact Test is what we have been using by default in our lab.
                Your lab is not the only one who does this, but I sincerely hope referees and editors will finally learn to reject all these papers. Most of them do, by now, as the news that it is wrong seem to finally spread.

                It bypasses all the worry of having to calculate a variance (how do you get a good value with two samples?)
                But this is exactly the problem. Fisher's test does not bypass this problem. It explicitly assumes that the dispersion (the coefficient of extra-Poisson variation) is zero, and this is inappropriate. It's only because this zero does not appear in the formula that people assume that the test does not need a variance estimation; it does: unless you know the dispersion to be zero, you cannot use the test. The tests mentioned in the papers above become (approximately) equal to the Fisher's test once you set the dispersion to zero there.

                See this paper for a more thorough explanation:

                Baggerly, K. A., Deng, L., Morris, J. S. & Aldaz, C. M. 2003 Differential expres-
                sion in SAGE: accounting for normal between-library variation. Bioinformatics,
                19(12), 1477–1483. (doi:10.1093/bioinformatics/btg173)

                ... it doesn't say HOW differently. On the other hand, you can -essentially- determine that from the p-value, plus you can easily find the fold change.
                A p value should never be used as in indicator of the magnitude of an effect. Significance and effect size are not the same. Use the fold change, or some shrinkage estimate of it.

                Comment

                • Artur Jaroszewicz
                  Member
                  • Sep 2011
                  • 45

                  #9
                  After speaking to my professor..
                  Of course you need to calculate p-values with the FPKMs! I am rather new in this field, and did not realize that the RNA strands are 'chopped up' before they're sequence. Makes total sense.
                  Simon, thank you for all your help. You are totally right about the p-values vs effect (fold change). Do you recommend using a Poisson test instead of FET?

                  Comment

                  • Artur Jaroszewicz
                    Member
                    • Sep 2011
                    • 45

                    #10
                    Also, I am a little bit at a loss as to what to do after finding differential expression of genes, p-values, and maybe some pathways.. Does anybody have a link to a thread about what to do after those analyses are done?
                    I've been asked to do a heat map. That shouldn't be too hard. What else?

                    Comment

                    • Simon Anders
                      Senior Member
                      • Feb 2010
                      • 995

                      #11
                      Originally posted by Artur Jaroszewicz View Post
                      After speaking to my professor..
                      Of course you need to calculate p-values with the FPKMs! I am rather new in this field, and did not realize that the RNA strands are 'chopped up' before they're sequence. Makes total sense.
                      Simon, thank you for all your help. You are totally right about the p-values vs effect (fold change). Do you recommend using a Poisson test instead of FET?
                      To repeat:

                      1. Both Fisher's exact test and the Poisson likelihood ratio test are based on the assumption that they are given raw integer counts. Using them with FPKM values is incorrect.

                      2. Both these tests neglect overdispersion (sample-to-sample variation not due to counting noise but due to actual differences between replicate sample). Hence, neither of them should be used for differential expression analysis. (In fact, the Poisson LRT can be seen as an approxmiation to the FET.)

                      For explanations of these points, please consult a stochastics textbooks, or, for the second point, see this paper:

                      Baggerly, K. A., Deng, L., Morris, J. S. & Aldaz, C. M. (2003):
                      Differential expression in SAGE: accounting for normal between-library variation.
                      Bioinformatics, 19(12), 1477–1483. (doi:10.1093/bioinformatics/btg173)

                      For correct methods, see the papers listed in post #6.

                      Comment

                      • aslihan
                        Member
                        • Jun 2011
                        • 23

                        #12
                        Hello, I created log2(FPKM tissueA/ FPKM tissueB) table for control 6sample and disease 7samples and would like to see significant differential genes.

                        log2(fpkmA/fpkmB) CTRL1.....6 log2(fpkmA/fpkmB) DISEASE1....7

                        gene1
                        gene2
                        gene3

                        ......

                        Which statistical test should I apply for my case?

                        Thank you very much.

                        Comment

                        • Artur Jaroszewicz
                          Member
                          • Sep 2011
                          • 45

                          #13
                          Hi aslihan,

                          I return to this thread a completely different man. Thank you Simon Anders for helping me earlier. I actually use both HTSeq and DESeq in my analysis now.

                          What package are you using to calculate your RPKM/FPKM values? If you are using Cufflinks, you'd probably want to use CuffDiff/CuffCompare to calculate the p-values. If you have a pipeline that doesn't employ Cufflinks, you may want to try Simon Anders' (above) DESeq or Robinson, et. al's EdgeR. DESeq is nice if you have replicates, but returns rather conservative p-values when there's only one replicate. People say EdgeR is better for handling single replicates, but I have not used it.

                          Comment

                          • IBseq
                            Member
                            • Jul 2012
                            • 56

                            #14
                            Calculating p-values from FPKM and more...

                            Originally posted by Artur Jaroszewicz View Post
                            Hi aslihan,

                            I return to this thread a completely different man. Thank you Simon Anders for helping me earlier. I actually use both HTSeq and DESeq in my analysis now.

                            What package are you using to calculate your RPKM/FPKM values? If you are using Cufflinks, you'd probably want to use CuffDiff/CuffCompare to calculate the p-values. If you have a pipeline that doesn't employ Cufflinks, you may want to try Simon Anders' (above) DESeq or Robinson, et. al's EdgeR. DESeq is nice if you have replicates, but returns rather conservative p-values when there's only one replicate. People say EdgeR is better for handling single replicates, but I have not used it.

                            Hi Artur,
                            I read with interest your last conversation abt the statistic to be used.
                            I am new to this filed and very new to the statistic field, thus I am completely lost!

                            here my issue: I have two samples, treated and untreated. I have done the illumina sequencing and then analysed the data via the tophat/cufflinks/cuffcompare/cuffdiff pipeline.

                            My question: how to find the de genes?I was suggesrted that in absebce of replicates it does not make sense to do a test statistic, but CUFFDIFF DOES. does it mean that the results are not reliable?

                            Shall I rank all my genes by fold change and then take top one as most differentially expressed?But then, how can i know if this difference is significant?

                            also, some genes have fpkm=0....the fold change is sometimes very very high, but this does not reflect the truth...how to deal with this??

                            Last question: when you run cuffcompare, do you give as input both cufflinks .gtf files from yuur treated and untreated (+/_ a reference genome as you wish) or just one .gtf files?

                            thanks in advance for any help (specially the statistical part!)

                            ib

                            Comment

                            • Artur Jaroszewicz
                              Member
                              • Sep 2011
                              • 45

                              #15
                              Hi IBseq,

                              Even though CuffDiff may give p-values, these are not really to be trusted. It doesn't make sense to compare simply one replicate of each. Here's an analogy: proving cats run faster than dogs by taking one cat and one dog, and racing them. Yes, it is an overly simplified, but you get the idea. You have no way of knowing if a difference is due to natural biological variation or due to differences between the samples. The best you can do is sort by fold change or by top RPKM, and look at differences there.

                              If you decide that you do want to do differential testing, you should be using (in your pipeline) cufflinks -> cuffcompare -> cuffdiff. Cuffcompare basically creates a single combined gtf file which you can use for cuffdiff. Use this only if you are doing a de novo alignment. If you're using a reference gtf file, use that.

                              If you have an RPKM (or 'FPKM') of zero, it means it's not expressed, so you have an infinite fold change. I recommend using a log base 2 ratio, but adding, e.g., 0.001 to each value, to make sure you don't take the log of zero or dividing by zero.

                              Best of luck,

                              Artur

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Pathogen Surveillance with Advanced Genomic Tools
                                by seqadmin




                                The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
                                03-24-2025, 11:48 AM
                              • seqadmin
                                New Genomics Tools and Methods Shared at AGBT 2025
                                by seqadmin


                                This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                                The Headliner
                                The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                                03-03-2025, 01:39 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-20-2025, 05:03 AM
                              0 responses
                              41 views
                              0 reactions
                              Last Post seqadmin  
                              Started by seqadmin, 03-19-2025, 07:27 AM
                              0 responses
                              51 views
                              0 reactions
                              Last Post seqadmin  
                              Started by seqadmin, 03-18-2025, 12:50 PM
                              0 responses
                              38 views
                              0 reactions
                              Last Post seqadmin  
                              Started by seqadmin, 03-03-2025, 01:15 PM
                              0 responses
                              192 views
                              0 reactions
                              Last Post seqadmin  
                              Working...