Welcome to the New Seqanswers!

Welcome to the new Seqanswers! We'd love your feedback, please post any you have to this topic: New Seqanswers Feedback.
See more
See less

Beginner question for Differential Expression Analysis

  • Filter
  • Time
  • Show
Clear All
new posts

  • Beginner question for Differential Expression Analysis


    I am a beginner in analyzing data from an RNA seq experiment. I was not the one performing the bioinformatics analysis (I am more of a bench scientist). So, I have an excel file in my hands. I am a bit confused though with how to retrieve my DE genes.
    I have read what p and q values represent. I have understood that setting an FDR value threshold is a 'safe' choice in order to identify whether the significant differences recorded are truly significant.

    I am a bit confused though with choosing the FDR threshold. If I understand correctly the level of 0.05 does not apply to all experiments.

    Could you please refer me to some further reading, or perhaps provide me with some tips, so that I proceed correctly with my analysis?

    I apologize if this is a very basic question. I appreciate your help.


  • #2
    The raw p-values in your results are still what they are - at a per-gene level given the dispersion models of the expression values in conditions that gene has a low probability of NOT being deferentially expressed. Statistical reality, however, shows us that when we repeatedly run a statistical test between two groups of values that DO come from the same distribution (say split 20 values with a mean of 10 and stdev of 5 into two random groups) we will see 5% or so of those tests return a significant p-value. So given the large number of genes we are testing people theorize that there's a measurable effect of type I error.

    In practice I think of the p-value and q-value (adjusted p-value, FDR, etc) differently in different situations. If our goal is a candidate type approach, which means we'll be running additional experiments to verify the RNA-seq result for that gene, we may use the raw p-values to get a broader list of candidates. If we have a phenotype and we want to report the number of genes affected or the percentage of genes enriched vs depleted we'll use the adjusted p-values since that is a more general claim.

    Sometimes our experiment may yield zero significant genes by the adjusted p-values even though we know there's a phenotype. In those cases we may proceed with genes significant by raw p-value and keep in mind that we must proceed cautiously. We wouldn't do that if we were going straight into a figure with that result - we'd of course try to confirm if any of those genes appear to be different via other methods.

    Finally, keep in mind that raw p-values likely have a high type-I error rate while the adjusted p-values likely have a high type-II error rate. Both of these rates improve the larger your sample size. Of course with higher and higher sample sizes you'll also get significance calls for features with smaller and smaller effect sizes and you'll have to start thinking in terms of "what is a significant effect?". I can't answer that one.
    /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
    Salk Institute for Biological Studies, La Jolla, CA, USA */


    • #3
      Many thanks sdriscoll!!