Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • DESeq: pval vs padj

    Hello everybody,

    Statistics is not my strong side, so I'm asking a basic question here. I've done some bacterial RNA seq. I was able to extract my reads with HTSeq and I've done some statistics with DESeq. Now I have a list of DE genes with their associated p value and padj value. I know from the DESeq vignette that padj value corresponds to p-value adjusted for multiple testing using Benjamini-Hochberg method. However, for me that's equivalent to Chinese, i.e. I'm not sure what does it actually means.

    Which p value I should consider when treating my data? Number of genes I would include in a qPCR validation study will change notably... And I would also like to understand why I should consider one or the other parameter.

    Thanks in advance!

  • #2
    You'll want to use the adjusted p-value. For the reason why, I would suggest that you review what a p-value is and why you expect to find spurious findings with increasing numbers of tests (the various adjusts are aimed at addressing this).

    Comment


    • #3
      Many thanks. I'll review the subject in some statistics handbook for details.

      Comment


      • #4
        I always thought this was a pretty good illustration:

        Comment


        • #5

          I think this is a good start point.

          PS, I am Chinese, but I don't think statistics is so simple as Chinese :-)

          Comment


          • #6
            Originally posted by ThePresident View Post
            Many thanks. I'll review the subject in some statistics handbook for details.
            just do a web search for some simple terms like "p-value versus FDR" and you will find many good summaries, many of them on various stats departments or professors web pages.

            A fundamental difference is that a p-value is a statement about the probability of an observed test statistic given its distribution.

            While an FDR is a statement about the probability of false discoveries given a certain number of simultaneous tests and their p-value distribution. It is an attempt to control for false discoveries as the type I error tends to balloon with multiple tests, and that multiplicity of errors is not reflected in the individual test statistic's p-values.

            So if you base your selection on p-values, you will end up inherently including a large number of false positives. Using the FDR, you are controlling the number of false positives across all your significant statistical tests.
            Michael Black, Ph.D.
            ScitoVation LLC. RTP, N.C.

            Comment


            • #7
              Originally posted by mbblack View Post

              A fundamental difference is that a p-value is a statement about the probability of an observed test statistic given its distribution.

              While an FDR is a statement about the probability of false discoveries given a certain number of simultaneous tests and their p-value distribution. It is an attempt to control for false discoveries as the type I error tends to balloon with multiple tests, and that multiplicity of errors is not reflected in the individual test statistic's p-values.

              So if you base your selection on p-values, you will end up inherently including a large number of false positives. Using the FDR, you are controlling the number of false positives across all your significant statistical tests.
              Well explained, thanks! In fact, that's what I did; now I do have somewhat better understanding of what those parameters are. Only one thing: how do you interpret padj value in terms of significance? You consider that a null hypothesis is rejected under some threshold (like for p value) or...? I don't know if I was enough clear...

              Comment


              • #8
                You'll tend to interprete the FDR value similarly to how you interprete a single p-value. What is your comfort level in terms of false positives? Also, what are you generating gene lists for?

                So, for example:

                scenario one - you wish to pull out significantly differentially expressed genes for some sort of ontology or enrichment analysis. Your primary goal is to identify biological processes, pathways or other ontology categories. So you may be fairly relaxed with your choice of cutoff in order to be sure to have sufficient genes to get a reasonably robust enrichment result. So, you may pick an FDR of < 0.05, or even 0.1 if you need to pad out your gene lists.

                scenario two - you are trying to pick out genes as candidates for bio-assay development, so you'd like to find the least number necessary to characterize your system, and you need to be stringent about your risk of false positives (wasted money down the road if those fail to validate for your assay). So you now pick a more stringent FDR, maybe even going to < 0.01 if that gives you enough to continue with. Perhaps you simultaneously throw in a fold change cutoff as well, so only take genes with both an FDR < 0.05 and a log2 FC > 2 (picking only highly significant high expressors).

                So, as with any choice of statistical criteria, you pick a cutoff that makes sense in light of your questions(s) and your system.

                Since I work in toxicology, we tend to worry more about false negatives than false positives, so I generally need not be overly stringent with my cutoff and usually use an FDR < 0.05 to 0.1 in order to be sure I capture enough genes for my downstream analyses (and I frequently add a fold change filter as well, with linear fold change of +/-1.5 to +/-2) - but it really depends on what you want out of the diff. gene expression analysis in the first place.
                Michael Black, Ph.D.
                ScitoVation LLC. RTP, N.C.

                Comment


                • #9
                  Honestly, I don't have any certitudes concerning the outcome of my experience. I have my control and my test condition. Some genes could be upregulated, others downregulated but I expect that a majority would be unaltered.

                  From there, and considering your examples, I should expect some of the genes to come necessarily as false positives. So, a prudent way to proceed would be to consider a more stringent approach... using FDR < 0.05 I got 15 genes which is fair well plus I'm confident that I see some true changes in expression.

                  Thank you for your help

                  Comment


                  • #10
                    Originally posted by ThePresident View Post
                    Honestly, I don't have any certitudes concerning the outcome of my experience. I have my control and my test condition. Some genes could be upregulated, others downregulated but I expect that a majority would be unaltered.

                    From there, and considering your examples, I should expect some of the genes to come necessarily as false positives. So, a prudent way to proceed would be to consider a more stringent approach... using FDR < 0.05 I got 15 genes which is fair well plus I'm confident that I see some true changes in expression.

                    Thank you for your help
                    In any genomic differential gene expression experiment, one expects the considerable majority of genes will not be differentially expressed. That's just biology.

                    The reality also is you will always have some false positives and some false negatives in any large scale statistical analysis since you cannot either truly eliminate all errors (type I and type II), nor could you really know if you had should you try to do so. Using tools like FDR helps to control those types of errors, but it does not eliminate them.

                    Also, replicates come into play too. With too few replicates, your statistical tests have limited power, so you have few statistically significant test results. That in turn means your FDR corrected p-values also will have very few (often none, if you had no replicates or only one or two) significant results, because your p-value distribution did not reflect any discrimination.

                    If you actually have no replicates, then it really is pointless to even bother computing the statistics. In that worst case scenario, you'd do best by simply ranking genes by normalized expression or raw counts, and pick those with the greatest difference in observed values (and then validate them independently).

                    So you have to interpret your results in light of your experimental limitations, as well as what your goal from the analysis was, and adjust things as the situation calls for. The stats are just tools to guide you and add some rigor to your analysis.
                    Michael Black, Ph.D.
                    ScitoVation LLC. RTP, N.C.

                    Comment


                    • #11
                      Originally posted by mbblack View Post

                      Also, replicates come into play too. With too few replicates, your statistical tests have limited power, so you have few statistically significant test results. That in turn means your FDR corrected p-values also will have very few (often none, if you had no replicates or only one or two) significant results, because your p-value distribution did not reflect any discrimination.
                      Fair well. Actually, I have two biological replicates for each of my conditions. It doesn't give me a huge confidence for my stats, but I assume it's enough to call for the most DE genes. I'll probably miss some of them by being too stringent but my goal is to get those that could have some biological impact. Even if I have a gene whose expression is statistically different from my reference condition, I would not consider it unless the fold change is worth something.

                      If I could I would have done 3 or more replicates for each of my conditions. But the budget was limiting so... you know the story. Now, I have to work with what I have and I'm trying to use some statistical tools to help me get through and avoid a huge amount of false positives that would cost even more in downstream validation.

                      But I understand that statistics can me misused and that we should always consider it as a tool in light of our experiment. We often use it straightforward, losing the big picture, and falling down for a p value < 0.05 because that is enough for most peer-review journals right? Not many are going to ask for a detailed statistical analysis, so in many labs (including mine) the main goal is to get a star* (statistical significance) above your histogram.

                      Anyway, my English is not at the top but I hope you understood what I wanted to say.

                      Thanks again

                      Comment


                      • #12
                        About qPCR validation, my DESeq run calls many genes that have a differential expression difference as low as 20%, significantly differentially expressed. While I appreciate the power of DESeq, I need to verify this with qPCR.

                        I am just about to embark on this for the first time, obviously using MIQE (and help from a post-doc), but everyone in my lab has told me that unless it has at least 2 fold differential expression, I won't able to determine any differences. Is this the experience of other people on here? If so, do you not do qPCR validation for some of these genes that are closer in expression and simply rely on the sequencing data?

                        Comment


                        • #13
                          In regards to qPCR validation, if you are using the same RNA as you did your RNA-seq with, it is meaningless. Well, not meaningless, it means you have controlled for technical noise but not biological noise.

                          pval vs. padj: This is the perspective from a biologist with very little statistical understand, but thought I might be able to add something. Fools all speak the same language. If you have 10,000 genes and you do pval cutoff of 0.2, while each one of those genes has an 20% chance of being a false positive, you will also get on average 2,000 false positives in your data set. So for example, if you got 2,200 differentially expressed genes, on average only 200 of then would be real.

                          Where as a padj(FDR) cutoff of 0.2 means on average 20% of the genes in you list are false positives.

                          If this isn't quite correct someone with a better understanding of statistics please chime in.
                          --------------
                          Ethan

                          Comment


                          • #14
                            Originally posted by billstevens View Post
                            I am just about to embark on this for the first time, obviously using MIQE (and help from a post-doc), but everyone in my lab has told me that unless it has at least 2 fold differential expression, I won't able to determine any differences. Is this the experience of other people on here? If so, do you not do qPCR validation for some of these genes that are closer in expression and simply rely on the sequencing data?
                            My experience seems to parallel those of your lab mates. While it's certainly not impossible to detect <2x changes in qPCR, it's not trivial (you'll probably just need more samples). I would also question whether 20% changes in RNA are biologically meaningful. Given the number of samples that would have to be looked at to significantly discern that sort of RNA level change, I probably wouldn't bother unless the candidates were highly explanatory.

                            Comment


                            • #15
                              I was just about to ask for qPCR validation. What is the routine for RNA-seq , i.e. is it necessary to validate data by qPCR? If I'm about to publish those RNA-seq results, I'm afraid reviewers will ask for qPCR validation even for significantly (<0.01) DE genes (fold change > 2). And, as I see it, it is better to use new RNA samples in order to control for biological differences...?

                              Thanks in advance,

                              TP

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Non-Coding RNA Research and Technologies
                                by seqadmin




                                Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

                                Nobel Prize for MicroRNA Discovery
                                This week,...
                                10-07-2024, 08:07 AM
                              • seqadmin
                                Recent Developments in Metagenomics
                                by seqadmin





                                Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
                                09-23-2024, 06:35 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Today, 06:55 AM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 10-02-2024, 04:51 AM
                              0 responses
                              105 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 10-01-2024, 07:10 AM
                              0 responses
                              113 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 09-30-2024, 08:33 AM
                              1 response
                              117 views
                              0 likes
                              Last Post EmiTom
                              by EmiTom
                               
                              Working...
                              X