I am pretty new to the statistical methods used in calculating the probably of over-representation so bear with me. I've been reading this for an intro on this topic:

My main question is do we care about the probability of GO over-representation in differentially expressed genes?

The p-value after an over-representation analysis is the chance that the GO term appearing in my sub-list is due to random chance. But what does "random chance" mean in the context of differential expression lists? The chance that the gene is not differentially expressed? The chance that the GO assignment was wrong?

Example case:

-I have a total gene set of 10,000 genes.

-500 of those genes have "cell cycle" GO term.

-I have a list of 200 differentially expressed genes and 10 of them are cell cycle.

If I do a simple hypergeometric test in R (according to the presentation I linked above) with:

phyper(9, 500, 10000-500, 200, lower.tail=FALSE)

I get a pretty bad p-value of: 0.55

So that p-value is telling me the 10 genes that are cell cycle in my differentially expressed list is not very significant. So if we randomly draw 200 genes from the pool of 10,000 genes, the chances of getting 10 cell cyle genes is 0.55.

What does the significance really mean in this context? There is a good chance that the 10 cell cycle genes really weren't differentially expressed in the first place? What if the 10 cell cycle genes had a very high significance in my differential expression analysis?

My main question is do we care about the probability of GO over-representation in differentially expressed genes?

The p-value after an over-representation analysis is the chance that the GO term appearing in my sub-list is due to random chance. But what does "random chance" mean in the context of differential expression lists? The chance that the gene is not differentially expressed? The chance that the GO assignment was wrong?

Example case:

-I have a total gene set of 10,000 genes.

-500 of those genes have "cell cycle" GO term.

-I have a list of 200 differentially expressed genes and 10 of them are cell cycle.

If I do a simple hypergeometric test in R (according to the presentation I linked above) with:

phyper(9, 500, 10000-500, 200, lower.tail=FALSE)

I get a pretty bad p-value of: 0.55

So that p-value is telling me the 10 genes that are cell cycle in my differentially expressed list is not very significant. So if we randomly draw 200 genes from the pool of 10,000 genes, the chances of getting 10 cell cyle genes is 0.55.

What does the significance really mean in this context? There is a good chance that the 10 cell cycle genes really weren't differentially expressed in the first place? What if the 10 cell cycle genes had a very high significance in my differential expression analysis?

## Comment