Announcement

Collapse
No announcement yet.

cuffdiff p value for 2 conditions without replicates

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • cuffdiff p value for 2 conditions without replicates

    Hi guys,
    Can anyone help explain 1) how p value is calculated for the differential expression using cuffdiff for two conditions without replicates? 2) is this p value useful or we can just look at the fold change to select DE genes?
    It is easy to understand p values with replicates, but doesn't make much sense to me to use p or q values for 2 samples without replicates.

    Many thanks!

  • #2
    Hi there,

    I'm not a statistician, so I can't answer you how the software count the p-value. But I can suggest you reads this thread http://seqanswers.com/forums/newrepl...wreply&p=83270 ( look for mbblack comment #10). He did mentioned bout doing differential expression analysis without replicate which I agree on his opinion.

    Originally posted by mbblack View Post
    If you actually have no replicates, then it really is pointless to even bother computing the statistics. In that worst case scenario, you'd do best by simply ranking genes by normalized expression or raw counts, and pick those with the greatest difference in observed values (and then validate them independently).

    So you have to interpret your results in light of your experimental limitations, as well as what your goal from the analysis was, and adjust things as the situation calls for. The stats are just tools to guide you and add some rigor to your analysis.

    Comment


    • #3
      Thanks masterpiece. It makes sense.
      I also read the FAQ on cufflinks website. It says that Cuffdiff compares the log ratio of FPKM in two conditions (i.e. log(FPKMa/FPKMb)) in the following test statistic, which is approximately normally distributed:

      T=E[log(FPKMa/FPKMb)] / Var[log(FPKMa/FPKMb)] ~= log(FPKMa/FPKMb) /sqrt(Var(FPKMa)/(FPKMa)^2 + Var(FPKMb)/(FPKMb)^2)

      So once we get the statistic we can get a corresponding p value. To get a statistic T, we need to get the Var values.
      If there are replicates in each condition, then the Var can be calculated from the replicates;
      If there are no replicates, then the Var is calculated under the assumption that the genes/transcripts are not differentially expressed. I guess here the Var is calculated as the Var between the two samples (one in each condition). But definitely Var would be big if the gene is actually differentially expressed in the two conditions and this will make an underestimation of p value under the assumption of "not differentially expressed". So I think the p values would not be precise for no replicates experiments, as masterpiece and mbblack has commented.

      More discussion is appreciated if you had such no replicate RNAseq data analyzed with Cuffdiff. How do you feel about the p values?

      Comment


      • #4
        hug2001 - you've pretty much hit the nail. Without replicates, you cannot estimate variance, thus you cannot adequately estimate your test statistic in the first place. Thus any p-value derived from that test statistic is meaningless. And in conventional statistical differential gene expression, the central issue is being able to estimate the variance in each of your comparison groups so as to compute robust statistical significance levels for the observed mean differences. Without replicates you simply cannot do that.

        That is why I suggest if you do not have replicates, then forget about the statistics altogether, and just focus on the observed difference in magnitude as your means of differentiating genes.
        Michael Black, Ph.D.
        ScitoVation LLC. RTP, N.C.

        Comment


        • #5
          Agree with mbblack. Magnitude of changes is more meaningful.
          We are doing PCR/WB to validate some genes showing DE.
          Thanks for your discussion, masterpiece and mbblack

          Comment


          • #6
            Hi guys,
            just a question to follow up the conversation: if I do not consider at all the p and/or q values, and only look at the fold change, how this can be relieable considerigtn that there a re genes turned off in oneof the two samples?

            I had the output from cuffdiff, filtered the reusts by significnace (thus I only took those with "yes") and then filtered by fold chnage and the top/bottoem ones were the up pr downregulated genes. Is this approach correct? If I don't dop this, then I have fold chnage values as infinite becuase some genes have fpkm=0 in one of the two conditions,..

            any suggestion is much appreciated!

            thanks,
            ib

            Comment


            • #7
              I would discount any gene where the FPKM was zero in either condition. That simply indicates a lack of data, so one should not make any inference about such genes (you cannot estimate the significance of something that was not measured in the first place).

              Without replicates, the statistical significance values (pValue and FDR) are meaningless. So drop those genes where FPKM is zero, then rank order the ones that remain by fold change and select from those with the greatest apparent fold change.

              In performing a differential gene expression analysis, it's probably wiser to exclude genes with too little data for proper abundance estimation right at the beginning. A common cutoff used in published papers is only working with genes with a raw read count of at least > 10 (or I've also seen a cutoff using an RPKM > 0.1 published as well).
              Michael Black, Ph.D.
              ScitoVation LLC. RTP, N.C.

              Comment


              • #8
                thanks mbblack,
                one thing i don't get...so you suggest to rank by fpkm value, exclude those where this is=0 and then rank the rest by fold change? so this means that we do not consider whether they are significant?

                thanks,
                ib

                Comment


                • #9
                  Without biological replicates, you cannot compute reliable nor robust statistical significance. To do that, you need to sample the within-group biological variation so you can estimate variance around each mean, which is impossible without replicates. Without replicates, you cannot even compute a mean. You just have two numbers, and the difference between them, with no idea of the variation about those two numbers.

                  So yes, I don't see any point in even computing statistics for the situation where you have no replicates - the numbers are meaningless and thus so is the inference of significance or non-significance (by statistical criteria at least).

                  Really, the only interpretable measure you have without replicates is the difference between your pairs of numbers. So, rank order those differences in observed abundance, and ignore those with insufficient data to adequately compute an estimated abundance (ie. the FPKM = 0 ones).
                  Michael Black, Ph.D.
                  ScitoVation LLC. RTP, N.C.

                  Comment


                  • #10
                    Hi,
                    I was going through the cuffdiff manual where it says "If there are no replicates, the variance is measured across the conditions under the assumption that most transcripts are not differentially expressed
                    However, if your experiment has replicates for each condition, Cuffdiff can better estimate the variance for fragment counts and thus provides more accurate differential expression calls."

                    I would appreciate your opinion abt this..

                    best,
                    ib

                    Comment


                    • #11
                      Originally posted by IBseq View Post
                      Hi,
                      I was going through the cuffdiff manual where it says "If there are no replicates, the variance is measured across the conditions under the assumption that most transcripts are not differentially expressed
                      However, if your experiment has replicates for each condition, Cuffdiff can better estimate the variance for fragment counts and thus provides more accurate differential expression calls."

                      I would appreciate your opinion abt this..

                      best,
                      ib
                      In the absence of replicates, CuffDiff uses a heuristic approach to basically make the most of the data (or lack of it). In my opinion, it is in no way a reliable replacement for actually having biological replicates for a differential gene expression analysis. The few statisticians I work with also seem of the opinion that there is no real algorithmic substitution for proper biological sampling of variance.

                      So, again I will say, in my opinion and experience, in the absence of biological replicates any statistical significance measure is meaningless. Without replicates, you do not have the appropriate data to compute statistical significance (and no manipulation nor permutation of the limited non-replicated data you do have can truly make up for that lack of appropriate replicate data).

                      That is why I suggest ignoring any statistics altogether. Just use what you actually do have, which is difference (NOT mean difference, just difference) between your two samples.

                      (P.S. you always need to be skeptical of the fact that a mathematical formula or computational algorithm can produce some kind of number you want. Generally speaking, such things will always produce a result because that is what they are designed to do - they churn through the input values and spit out a result. The mere fact that you can run an algorithm to completion does not mean that you should be doing so with any given set of data, or that you should be trusting or relying on the results it produces. For example, most software will run t-tests on simple pairs of numbers without replicates, but the fact the software produces output does not mean the statistical test was a mathematically valid one to perform in the first place. A t-Test needs to be run on differences of means, and if you do not have replicates in your data, you cannot compute means nor reliable estimates of variance, and thus a t-test is an inapporpriate test, despite the fact the algorithm ran properly and produced output values).
                      Last edited by mbblack; 10-17-2012, 05:02 AM.
                      Michael Black, Ph.D.
                      ScitoVation LLC. RTP, N.C.

                      Comment


                      • #12
                        Hi mbblack,
                        thanks. I think I understood your point of view. I shall look at the data by simply looking at the fold change and ignoring the fpkm values equal to zero from both samples.

                        Thanks a lot for your time.
                        irene

                        Comment


                        • #13
                          Hi Folks,
                          ignoring fkpm values close to zero in one condition might actually is not good for genes that are true on/off responses. I have a couple of those and these are probably the most interesting ones...

                          Comment


                          • #14
                            Originally posted by bvb1909 View Post
                            Hi Folks,
                            ignoring fkpm values close to zero in one condition might actually is not good for genes that are true on/off responses. I have a couple of those and these are probably the most interesting ones...
                            Except the problem is you really do not know if those FPKM=0 values are real or not. At low counts (and by low, that may mean anything with a feature count of less than at least a 100 or more), the variance of the estimated transcript abundance (even with many biological replicates) is very large, making your estimate of low count features far less reliable than for high(er) count features.

                            And an FPKM=0 simply means you failed to detect that transcript, not that it was not actually expressed. Including those values puts your results into, at best, the realm of speculation, not inference based on actual data. No data is not evidence of non-expression. So you cannot make the inference of a true on/off feature, since you literally have no data for the feature in one condition. You would need to independently confirm that the non-detected feature was actually absent because it was not expressed, as opposed to simply being so rare your RNA-Seq experiment failed to detect it.
                            Last edited by mbblack; 01-23-2013, 05:18 AM.
                            Michael Black, Ph.D.
                            ScitoVation LLC. RTP, N.C.

                            Comment


                            • #15
                              I think that a low read count is in many cases a pretty good indication of low expression. Simply excluding those on the basis that I cannot statistically be sure means you are missing out on a lot of differentially expressed genes. Just as an example, a well known differentially expressed gene under our condition would be ruled out by you because the statistics tells you so (because FPKM = 0 under control condition), biology tells us it is the one of the most important genes.... Guess it is also a matter of what you want to get from the data

                              Comment

                              Working...
                              X