Announcement

Collapse
No announcement yet.

cuffdiff p value for 2 conditions without replicates

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • mbblack
    replied
    Originally posted by bvb1909 View Post
    I think that a low read count is in many cases a pretty good indication of low expression. Simply excluding those on the basis that I cannot statistically be sure means you are missing out on a lot of differentially expressed genes. Just as an example, a well known differentially expressed gene under our condition would be ruled out by you because the statistics tells you so (because FPKM = 0 under control condition), biology tells us it is the one of the most important genes.... Guess it is also a matter of what you want to get from the data
    The problem with low count data is that it results in a very high false positive error rate for differentially expressed genes. I suppose it then depends on how tolerant your study is to false positives and what your objective is with the data. However, I will make a final comment that in the published RNA-Seq DGE papers of the last 2 or 3 years, there seems to be a clear and growing consensus that low read count data should be excluded from DGE analyses to avoid the bias of much higher false positive errors especially in low expressors. Published papers seem to vary when using FPKM/RPKM normalization, with cutoffs varying from 0.1 to 0.5 (one paper I seem to recall even using higher, but I remember the read depth was quite low in that work as well). However, I too am becoming a non-fan of RPKM or similar methods, as they can be very misleading for some genes.

    In my own work, we have settled on excluding raw counts less than 11 (so I actually filter on count > 10), and then normalize what remains. Even then, It's simply though to plot and show that genes with raw counts between 11 and about 150 or so, have very high variance in their transcript abundance estimates, while for those with counts > about 150, the variance tightens up dramatically. We also always run 5 biological replicates for all treatments and controls.

    Working in toxicology and particularly with risk assessment type studies, we do not have the option of dismissing statistical significance, and in fact almost always base our DGE assessments on simultaneously filtering results for statistical significance and minimal fold change difference (although for initial exploratory analyses, we may relax those criteria - as you say, it depends on what one's goals for the data are).

    Just as an aside, in the limited qPCR validations series that I've run, we get very poor correspondence with RNA-Seq results base solely on statistical significance or solely on fold change. Correspondence (using ABI TaqMan rtPCR assays) improves dramatically when comparing genes that were both statistically significant and met minimum fold change differences (I usually filter for genes with FDR < 0.05 and FC > +/- 1.5). Nothing novel in that result, and of course, the same applies for microarray data for that matter: combining statistical significance and some minimum magnitude of relative change proves a far more robust estimator of differential gene expression than either cutoff alone. The problem with the original post that started this thread, is that you cannot compute statistical significance in the absence or replicates, so you are left with just raw differences in magnitude based on single measures of abundance.

    Leave a comment:


  • chadn737
    replied
    I don't like FPKM/RPKM because it can mask how many reads there actually are, even inflating values when there are actually very few reads. That being said, I disagree with filtering out genes that have a 0 FPKM or read count in one condition. I have seen clear examples where one conditions will have dozens, even hundreds of reads, while the other has zero. Clearly you can observe expression in one condition and call this as differentially expressed. Where things become murky is when one condition has maybe 2-4 reads and the other 0. A better filter then would take into account this possibility.
    Last edited by chadn737; 01-24-2013, 06:54 AM.

    Leave a comment:


  • Jeremy
    replied
    It might be better to use normalized raw reads. That way you can see how reliable the fold difference actually is, lower read counts are more prone to noise induced errors. For example if sample A has 5 reads and sample B has 0 reads then it may not be so reliable. Whereas if sample A has 3000 reads and sample B has 0 it would be more reliable. FPKM can change the numbers making low read count small fragments look more reliable than they actually are (not sure if that has changed, it's been a while since I tried FPKM).

    Leave a comment:


  • bvb1909
    replied
    I think that a low read count is in many cases a pretty good indication of low expression. Simply excluding those on the basis that I cannot statistically be sure means you are missing out on a lot of differentially expressed genes. Just as an example, a well known differentially expressed gene under our condition would be ruled out by you because the statistics tells you so (because FPKM = 0 under control condition), biology tells us it is the one of the most important genes.... Guess it is also a matter of what you want to get from the data

    Leave a comment:


  • mbblack
    replied
    Originally posted by bvb1909 View Post
    Hi Folks,
    ignoring fkpm values close to zero in one condition might actually is not good for genes that are true on/off responses. I have a couple of those and these are probably the most interesting ones...
    Except the problem is you really do not know if those FPKM=0 values are real or not. At low counts (and by low, that may mean anything with a feature count of less than at least a 100 or more), the variance of the estimated transcript abundance (even with many biological replicates) is very large, making your estimate of low count features far less reliable than for high(er) count features.

    And an FPKM=0 simply means you failed to detect that transcript, not that it was not actually expressed. Including those values puts your results into, at best, the realm of speculation, not inference based on actual data. No data is not evidence of non-expression. So you cannot make the inference of a true on/off feature, since you literally have no data for the feature in one condition. You would need to independently confirm that the non-detected feature was actually absent because it was not expressed, as opposed to simply being so rare your RNA-Seq experiment failed to detect it.
    Last edited by mbblack; 01-23-2013, 05:18 AM.

    Leave a comment:


  • bvb1909
    replied
    Hi Folks,
    ignoring fkpm values close to zero in one condition might actually is not good for genes that are true on/off responses. I have a couple of those and these are probably the most interesting ones...

    Leave a comment:


  • IBseq
    replied
    Hi mbblack,
    thanks. I think I understood your point of view. I shall look at the data by simply looking at the fold change and ignoring the fpkm values equal to zero from both samples.

    Thanks a lot for your time.
    irene

    Leave a comment:


  • mbblack
    replied
    Originally posted by IBseq View Post
    Hi,
    I was going through the cuffdiff manual where it says "If there are no replicates, the variance is measured across the conditions under the assumption that most transcripts are not differentially expressed
    However, if your experiment has replicates for each condition, Cuffdiff can better estimate the variance for fragment counts and thus provides more accurate differential expression calls."

    I would appreciate your opinion abt this..

    best,
    ib
    In the absence of replicates, CuffDiff uses a heuristic approach to basically make the most of the data (or lack of it). In my opinion, it is in no way a reliable replacement for actually having biological replicates for a differential gene expression analysis. The few statisticians I work with also seem of the opinion that there is no real algorithmic substitution for proper biological sampling of variance.

    So, again I will say, in my opinion and experience, in the absence of biological replicates any statistical significance measure is meaningless. Without replicates, you do not have the appropriate data to compute statistical significance (and no manipulation nor permutation of the limited non-replicated data you do have can truly make up for that lack of appropriate replicate data).

    That is why I suggest ignoring any statistics altogether. Just use what you actually do have, which is difference (NOT mean difference, just difference) between your two samples.

    (P.S. you always need to be skeptical of the fact that a mathematical formula or computational algorithm can produce some kind of number you want. Generally speaking, such things will always produce a result because that is what they are designed to do - they churn through the input values and spit out a result. The mere fact that you can run an algorithm to completion does not mean that you should be doing so with any given set of data, or that you should be trusting or relying on the results it produces. For example, most software will run t-tests on simple pairs of numbers without replicates, but the fact the software produces output does not mean the statistical test was a mathematically valid one to perform in the first place. A t-Test needs to be run on differences of means, and if you do not have replicates in your data, you cannot compute means nor reliable estimates of variance, and thus a t-test is an inapporpriate test, despite the fact the algorithm ran properly and produced output values).
    Last edited by mbblack; 10-17-2012, 05:02 AM.

    Leave a comment:


  • IBseq
    replied
    Hi,
    I was going through the cuffdiff manual where it says "If there are no replicates, the variance is measured across the conditions under the assumption that most transcripts are not differentially expressed
    However, if your experiment has replicates for each condition, Cuffdiff can better estimate the variance for fragment counts and thus provides more accurate differential expression calls."

    I would appreciate your opinion abt this..

    best,
    ib

    Leave a comment:


  • mbblack
    replied
    Without biological replicates, you cannot compute reliable nor robust statistical significance. To do that, you need to sample the within-group biological variation so you can estimate variance around each mean, which is impossible without replicates. Without replicates, you cannot even compute a mean. You just have two numbers, and the difference between them, with no idea of the variation about those two numbers.

    So yes, I don't see any point in even computing statistics for the situation where you have no replicates - the numbers are meaningless and thus so is the inference of significance or non-significance (by statistical criteria at least).

    Really, the only interpretable measure you have without replicates is the difference between your pairs of numbers. So, rank order those differences in observed abundance, and ignore those with insufficient data to adequately compute an estimated abundance (ie. the FPKM = 0 ones).

    Leave a comment:


  • IBseq
    replied
    thanks mbblack,
    one thing i don't get...so you suggest to rank by fpkm value, exclude those where this is=0 and then rank the rest by fold change? so this means that we do not consider whether they are significant?

    thanks,
    ib

    Leave a comment:


  • mbblack
    replied
    I would discount any gene where the FPKM was zero in either condition. That simply indicates a lack of data, so one should not make any inference about such genes (you cannot estimate the significance of something that was not measured in the first place).

    Without replicates, the statistical significance values (pValue and FDR) are meaningless. So drop those genes where FPKM is zero, then rank order the ones that remain by fold change and select from those with the greatest apparent fold change.

    In performing a differential gene expression analysis, it's probably wiser to exclude genes with too little data for proper abundance estimation right at the beginning. A common cutoff used in published papers is only working with genes with a raw read count of at least > 10 (or I've also seen a cutoff using an RPKM > 0.1 published as well).

    Leave a comment:


  • IBseq
    replied
    Hi guys,
    just a question to follow up the conversation: if I do not consider at all the p and/or q values, and only look at the fold change, how this can be relieable considerigtn that there a re genes turned off in oneof the two samples?

    I had the output from cuffdiff, filtered the reusts by significnace (thus I only took those with "yes") and then filtered by fold chnage and the top/bottoem ones were the up pr downregulated genes. Is this approach correct? If I don't dop this, then I have fold chnage values as infinite becuase some genes have fpkm=0 in one of the two conditions,..

    any suggestion is much appreciated!

    thanks,
    ib

    Leave a comment:


  • hug2001
    replied
    Agree with mbblack. Magnitude of changes is more meaningful.
    We are doing PCR/WB to validate some genes showing DE.
    Thanks for your discussion, masterpiece and mbblack

    Leave a comment:


  • mbblack
    replied
    hug2001 - you've pretty much hit the nail. Without replicates, you cannot estimate variance, thus you cannot adequately estimate your test statistic in the first place. Thus any p-value derived from that test statistic is meaningless. And in conventional statistical differential gene expression, the central issue is being able to estimate the variance in each of your comparison groups so as to compute robust statistical significance levels for the observed mean differences. Without replicates you simply cannot do that.

    That is why I suggest if you do not have replicates, then forget about the statistics altogether, and just focus on the observed difference in magnitude as your means of differentiating genes.

    Leave a comment:

Working...
X