Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • gene ontology over-representation of differentially expressed genes

    I am pretty new to the statistical methods used in calculating the probably of over-representation so bear with me. I've been reading this for an intro on this topic:


    My main question is do we care about the probability of GO over-representation in differentially expressed genes?

    The p-value after an over-representation analysis is the chance that the GO term appearing in my sub-list is due to random chance. But what does "random chance" mean in the context of differential expression lists? The chance that the gene is not differentially expressed? The chance that the GO assignment was wrong?

    Example case:
    -I have a total gene set of 10,000 genes.
    -500 of those genes have "cell cycle" GO term.
    -I have a list of 200 differentially expressed genes and 10 of them are cell cycle.

    If I do a simple hypergeometric test in R (according to the presentation I linked above) with:
    phyper(9, 500, 10000-500, 200, lower.tail=FALSE)
    I get a pretty bad p-value of: 0.55

    So that p-value is telling me the 10 genes that are cell cycle in my differentially expressed list is not very significant. So if we randomly draw 200 genes from the pool of 10,000 genes, the chances of getting 10 cell cyle genes is 0.55.

    What does the significance really mean in this context? There is a good chance that the 10 cell cycle genes really weren't differentially expressed in the first place? What if the 10 cell cycle genes had a very high significance in my differential expression analysis?

  • #2
    It means that there is no significant enrichment of cell cycle genes in your subset of genes. As you stated, the number of cell cycle genes in your subset is what you would expect if you randomly selected 200 genes from the set of 10000.
    In other words, the proportion of genes with a "cell cycle" annotation in your subset of genes is similar to (actually, in this case it is exactly the same as) the proportion of genes with a "cell cycle" annotation in the whole gene set.

    10/200=0.05
    500/10000=0.05

    Any type of over-representation analysis has no relation to whether an individual gene is deferentially expressed. This type of analysis assumes that the genes in your differentially expressed gene list are actually differentially expressed and that the annotations are correct. And the results indicate only if your subset of genes has more of any given annotation than you would expect by chance.

    Comment


    • #3
      Any type of over-representation analysis has no relation to whether an individual gene is deferentially expressed.
      Thanks. So there really is no point in using an over-representation analysis in differentially express genes. It doesn't tell you anything in relation to differential expression.

      Comment


      • #4
        Yup, that's correct.

        Comment


        • #5
          Although, if you want to know if you have an enrichment of a group of genes with a specific function in your subset of differentially genes, then it may be useful to you.

          Comment


          • #6
            Why is there no point? Looking at over-representation can tell you a lot about what is going on in the data. If a particular GO category is over-represented then that process is likely to be particularly important under the conditions you are testing for.

            The fact that cell cycle is not over-represented tells you something as well, that maybe there is not much changing in relation to cell cycle under your conditions.

            Looking for enrichment has been valuable and informative in my own work.

            Comment


            • #7
              I think damiankao was trying to use enrichment analysis to determine the probability that an individual gene is differentially expressed. But, of course, enrichment analysis has no relation to to whether or not an individual gene is differentially expressed. So, in this particular context there is no point to performing this type of analysis.

              Comment


              • #8
                I guess I am trying to point out that over-representation analysis gives you significance relative to random chance.

                In the case of differential expression, there is no random chance because we are already assuming the list is correct. We are not getting significance values relative to all possible configurations of the differentially expressed list, because there is only one list.

                In my example with cell cycle. My differentially expressed gene list has under-representation of cell cycle. What does that mean really? Under random conditions, the probability is 0.05 to see a cell cycle gene. Are we assuming between my two conditions, it is also 0.05 to see a cell cycle gene differentially expressed?

                Comment


                • #9
                  i would agree with others here who say that functional annotation (enrichment testing) of signature/diff exp gene lists is indeed useful and is actually a basic tool built into almost every exp analysis package out there (i.e. DAVID, Ingenuity, various R packages, etc...).

                  there are two sets of p-values to consider carefully. the first is often a t-test p-value cutoff used to define the differentially exp genes (the alpha). and whether or not you use a multiple test correction on top of the raw p-value, in addition. note that some people will split the diff exp genes as the up-reg'd or the down-reg'd sets.

                  the next p-value is the Fisher's/Hypergeometric enrichment test p-value. depending on the gene list set sizes (on average 100~1000 genes) you will get a set of GO category rankings. with this hypothesis test, you will need fairly stringest cutoffs to "trust" the enrichments. and this is heavily dependent on having reasonable gene set sizes otherwise you will get misleading enrichment annotations. multiple-test corrections are also useful for these tests.

                  bottom line, the first task of getting diff exp genes requires one type of test and the second task of enrichment requires another. these are independent of one another but are often used in this sequence to identify potential pathways or GO lists in expression data. you may need to test various cutoffs to see how robust your enrichments are.

                  in your example, you are asking a basic question about what Fisher's testing is all about. there are obviously plenty of places to read up on what this is. but an intuitive interpretation is that if you randomly grabbed 200 genes out of 10000 and only got 9 that were cell cycle, by random chance you could easily get that many so you end up with an uninteresting p-value. but if you run your R test again with say 100 instead of 9, you'll get a much smaller p-value (likely < 0.05) indicating that the chance you could randomly get 100 cell cycle genes when grabbing 200 at a time is a very small probability - suggesting this is statistically significant. hope that helps.
                  Last edited by anc327; 10-20-2011, 02:48 PM.

                  Comment


                  • #10
                    Originally posted by anc327 View Post
                    i would agree with others here who say that functional annotation (enrichment testing) of signature/diff exp gene lists is indeed useful and is actually a basic analysis built into almost every exp analysis package out there (i.e. DAVID, Ingenuity, various R packages, etc...).

                    there are two sets of p-values to consider carefully. the first is often a t-test p-value cutoff used to define the differentially exp genes (the alpha). and whether or not you use a multiple test correction on top of the raw p-value, in addition. note that some people will split the diff exp genes as the up-reg'd or the down-reg'd sets.

                    the next p-value is the Fisher's/Hypergeometric enrichment test p-value. depending on the gene list set sizes (on average 100~1000 genes) you will get a set of GO category rankings. with this hypothesis test, you will need fairly stringest cutoffs to "trust" the enrichments. and this is heavily dependent on having reasonable gene set sizes otherwise you will get misleading enrichment annotations. also, multiple-test corrections are also useful for these tests.

                    bottom line, the first task of getting diff exp genes requires one type of test and the second task of enrichment requires another. these are independent of one another but are often used in this sequence to identify potential pathways or GO lists in expression data. you may need to test various cutoffs to see how robust your enrichments are.

                    in your example, you are asking a basic question about what Fisher's testing is all about. there are obviously plenty of places to read up on what this is. but an intuitive interpretation is that if you randomly grabbed 200 genes out of 10000 and only got 9 that were cell cycle, by random chance you could easily get that many. but if you run your R test again with say 100 instead of 9, you'll get a much smaller p-value (< 0.05) indicating that the chance you could get 100 cell cycle genes when grabbing 200 at a time is a small probability not due to chance. hope that helps.
                    I understand how the test works. I guess my question is does the p-value you obtain from this test useful?

                    In my example, there are 500 genes out of 10,000 genes that have cell cycle GO term. So the probability of getting a cell cycle gene from randomly picking a gene is 500 / 10,000 = 0.05.

                    So if I pick 200 genes randomly, I should be able to get 10 just by chance. So anything significantly above or below that would tell me if the term is over or under represented.

                    But with differential expression lists, I am not picking 200 genes randomly. I have 200 genes that I've established to be differentially expressed between two conditions by whatever test I've conducted previously. Can we really say the probability of getting a cell cycle gene in this differentially expressed gene list is 0.05 if we are not randomly choosing genes?

                    Let's say I am comparing two samples: normal sample vs irradiated sample. Irradiation usually screws up cell proliferation. So we expect a lot of genes involved in cell cycle to be down-regulated after irradiation.

                    Out of 500 possible cell cycle genes in a pool of 10,000, we picked up 300 in our differentially down-regulated list of 400 genes. The p-value for this hypergeometric test would be pretty good.

                    Under the assumption that 0.05 (500 / 10,000) is the probability of getting a cell cycle gene by chance, we get a good p-value. But since we are introducing two conditions into our picking of the list, we are not picking by chance. The probability of getting a cell cycle gene under our two conditions should probably be higher, making the p-value worse than it actually is.

                    I have no idea how to adjust the probability based on the conditions picked. I am just not sure whether using a random probability on a non-randomly picked list will tell us anything meaningful.

                    Sorry if it's a naive thought. Perhaps I am just over-thinking it.

                    Comment


                    • #11
                      Originally posted by damiankao
                      But since we are introducing two conditions into our picking of the list, we are not picking by chance. The probability of getting a cell cycle gene under our two conditions should probably be higher, making the p-value worse than it actually is.

                      I have no idea how to adjust the probability based on the conditions picked. I am just not sure whether using a random probability on a non-randomly picked list will tell us anything meaningful.
                      I'm not certain this fully addresses your question, but consider more carefully what anc327 said:
                      Originally posted by anc327
                      there are two sets of p-values to consider carefully. the first is often a t-test p-value cutoff used to define the differentially exp genes (the alpha). and whether or not you use a multiple test correction on top of the raw p-value
                      There is still some error in picking the genes that are differentially expressed. While that error may be low, <5%, lets just assume that 5% of your differentially expressed genes are false positives. I think one thing the p-value for the term enrichment addresses is the error that will be introduced by false positives in your differentially expressed genes.

                      Thats me speaking as a non-statistician so I have no idea if I am right in this or not.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Latest Developments in Precision Medicine
                        by seqadmin



                        Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

                        Somatic Genomics
                        “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
                        05-24-2024, 01:16 PM
                      • seqadmin
                        Recent Advances in Sequencing Analysis Tools
                        by seqadmin


                        The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
                        05-06-2024, 07:48 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, Yesterday, 01:32 PM
                      0 responses
                      10 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 05-24-2024, 07:15 AM
                      0 responses
                      199 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 05-23-2024, 10:28 AM
                      0 responses
                      221 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 05-23-2024, 07:35 AM
                      0 responses
                      232 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X