Seqanswers Leaderboard Ad

**lmc** · 10-20-2011, 06:40 AM

It means that there is no significant enrichment of cell cycle genes in your subset of genes. As you stated, the number of cell cycle genes in your subset is what you would expect if you randomly selected 200 genes from the set of 10000.
In other words, the proportion of genes with a "cell cycle" annotation in your subset of genes is similar to (actually, in this case it is exactly the same as) the proportion of genes with a "cell cycle" annotation in the whole gene set.

10/200=0.05
500/10000=0.05

Any type of over-representation analysis has no relation to whether an individual gene is deferentially expressed. This type of analysis assumes that the genes in your differentially expressed gene list are actually differentially expressed and that the annotations are correct. And the results indicate only if your subset of genes has more of any given annotation than you would expect by chance.

**damiankao** · 10-20-2011, 07:28 AM

Any type of over-representation analysis has no relation to whether an individual gene is deferentially expressed.

Thanks. So there really is no point in using an over-representation analysis in differentially express genes. It doesn't tell you anything in relation to differential expression.

**lmc** · 10-20-2011, 07:35 AM

Yup, that's correct.

**lmc** · 10-20-2011, 07:37 AM

Although, if you want to know if you have an enrichment of a group of genes with a specific function in your subset of differentially genes, then it may be useful to you.

**chadn737** · 10-20-2011, 07:40 AM

Why is there no point? Looking at over-representation can tell you a lot about what is going on in the data. If a particular GO category is over-represented then that process is likely to be particularly important under the conditions you are testing for.

The fact that cell cycle is not over-represented tells you something as well, that maybe there is not much changing in relation to cell cycle under your conditions.

Looking for enrichment has been valuable and informative in my own work.

**lmc** · 10-20-2011, 07:50 AM

I think damiankao was trying to use enrichment analysis to determine the probability that an individual gene is differentially expressed. But, of course, enrichment analysis has no relation to to whether or not an individual gene is differentially expressed. So, in this particular context there is no point to performing this type of analysis.

**damiankao** · 10-20-2011, 08:13 AM

I guess I am trying to point out that over-representation analysis gives you significance relative to random chance.

In the case of differential expression, there is no random chance because we are already assuming the list is correct. We are not getting significance values relative to all possible configurations of the differentially expressed list, because there is only one list.

In my example with cell cycle. My differentially expressed gene list has under-representation of cell cycle. What does that mean really? Under random conditions, the probability is 0.05 to see a cell cycle gene. Are we assuming between my two conditions, it is also 0.05 to see a cell cycle gene differentially expressed?

**anc327** · 10-20-2011, 12:41 PM

i would agree with others here who say that functional annotation (enrichment testing) of signature/diff exp gene lists is indeed useful and is actually a basic tool built into almost every exp analysis package out there (i.e. DAVID, Ingenuity, various R packages, etc...).

there are two sets of p-values to consider carefully. the first is often a t-test p-value cutoff used to define the differentially exp genes (the alpha). and whether or not you use a multiple test correction on top of the raw p-value, in addition. note that some people will split the diff exp genes as the up-reg'd or the down-reg'd sets.

the next p-value is the Fisher's/Hypergeometric enrichment test p-value. depending on the gene list set sizes (on average 100~1000 genes) you will get a set of GO category rankings. with this hypothesis test, you will need fairly stringest cutoffs to "trust" the enrichments. and this is heavily dependent on having reasonable gene set sizes otherwise you will get misleading enrichment annotations. multiple-test corrections are also useful for these tests.

bottom line, the first task of getting diff exp genes requires one type of test and the second task of enrichment requires another. these are independent of one another but are often used in this sequence to identify potential pathways or GO lists in expression data. you may need to test various cutoffs to see how robust your enrichments are.

in your example, you are asking a basic question about what Fisher's testing is all about. there are obviously plenty of places to read up on what this is. but an intuitive interpretation is that if you randomly grabbed 200 genes out of 10000 and only got 9 that were cell cycle, by random chance you could easily get that many so you end up with an uninteresting p-value. but if you run your R test again with say 100 instead of 9, you'll get a much smaller p-value (likely < 0.05) indicating that the chance you could randomly get 100 cell cycle genes when grabbing 200 at a time is a very small probability - suggesting this is statistically significant. hope that helps.

**damiankao** · 10-20-2011, 01:42 PM

Originally posted by anc327 View Post

i would agree with others here who say that functional annotation (enrichment testing) of signature/diff exp gene lists is indeed useful and is actually a basic analysis built into almost every exp analysis package out there (i.e. DAVID, Ingenuity, various R packages, etc...).

there are two sets of p-values to consider carefully. the first is often a t-test p-value cutoff used to define the differentially exp genes (the alpha). and whether or not you use a multiple test correction on top of the raw p-value, in addition. note that some people will split the diff exp genes as the up-reg'd or the down-reg'd sets.

the next p-value is the Fisher's/Hypergeometric enrichment test p-value. depending on the gene list set sizes (on average 100~1000 genes) you will get a set of GO category rankings. with this hypothesis test, you will need fairly stringest cutoffs to "trust" the enrichments. and this is heavily dependent on having reasonable gene set sizes otherwise you will get misleading enrichment annotations. also, multiple-test corrections are also useful for these tests.

bottom line, the first task of getting diff exp genes requires one type of test and the second task of enrichment requires another. these are independent of one another but are often used in this sequence to identify potential pathways or GO lists in expression data. you may need to test various cutoffs to see how robust your enrichments are.

in your example, you are asking a basic question about what Fisher's testing is all about. there are obviously plenty of places to read up on what this is. but an intuitive interpretation is that if you randomly grabbed 200 genes out of 10000 and only got 9 that were cell cycle, by random chance you could easily get that many. but if you run your R test again with say 100 instead of 9, you'll get a much smaller p-value (< 0.05) indicating that the chance you could get 100 cell cycle genes when grabbing 200 at a time is a small probability not due to chance. hope that helps.

I understand how the test works. I guess my question is does the p-value you obtain from this test useful?

In my example, there are 500 genes out of 10,000 genes that have cell cycle GO term. So the probability of getting a cell cycle gene from randomly picking a gene is 500 / 10,000 = 0.05.

So if I pick 200 genes randomly, I should be able to get 10 just by chance. So anything significantly above or below that would tell me if the term is over or under represented.

But with differential expression lists, I am not picking 200 genes randomly. I have 200 genes that I've established to be differentially expressed between two conditions by whatever test I've conducted previously. Can we really say the probability of getting a cell cycle gene in this differentially expressed gene list is 0.05 if we are not randomly choosing genes?

Let's say I am comparing two samples: normal sample vs irradiated sample. Irradiation usually screws up cell proliferation. So we expect a lot of genes involved in cell cycle to be down-regulated after irradiation.

Out of 500 possible cell cycle genes in a pool of 10,000, we picked up 300 in our differentially down-regulated list of 400 genes. The p-value for this hypergeometric test would be pretty good.

Under the assumption that 0.05 (500 / 10,000) is the probability of getting a cell cycle gene by chance, we get a good p-value. But since we are introducing two conditions into our picking of the list, we are not picking by chance. The probability of getting a cell cycle gene under our two conditions should probably be higher, making the p-value worse than it actually is.

I have no idea how to adjust the probability based on the conditions picked. I am just not sure whether using a random probability on a non-randomly picked list will tell us anything meaningful.

Sorry if it's a naive thought. Perhaps I am just over-thinking it.

**chadn737** · 10-20-2011, 01:57 PM

Originally posted by damiankao

But since we are introducing two conditions into our picking of the list, we are not picking by chance. The probability of getting a cell cycle gene under our two conditions should probably be higher, making the p-value worse than it actually is.

I have no idea how to adjust the probability based on the conditions picked. I am just not sure whether using a random probability on a non-randomly picked list will tell us anything meaningful.

I'm not certain this fully addresses your question, but consider more carefully what anc327 said:

Originally posted by anc327

there are two sets of p-values to consider carefully. the first is often a t-test p-value cutoff used to define the differentially exp genes (the alpha). and whether or not you use a multiple test correction on top of the raw p-value

There is still some error in picking the genes that are differentially expressed. While that error may be low, <5%, lets just assume that 5% of your differentially expressed genes are false positives. I think one thing the p-value for the term enrichment addresses is the error that will be introduced by false positives in your differentially expressed genes.

Thats me speaking as a non-statistician so I have no idea if I am right in this or not.

Topics	Statistics	Last Post
New Toolkit Enhances Plant Mitochondrial Genome Research by seqadmin Started by seqadmin, 05-24-2024, 07:15 AM	0 responses 16 views 0 likes	Last Post by seqadmin 05-24-2024, 07:15 AM
Catalog of Gene-Isoform Variation in Developing Human Brain by seqadmin Started by seqadmin, 05-23-2024, 10:28 AM	0 responses 18 views 0 likes	Last Post by seqadmin 05-23-2024, 10:28 AM
Ancient Viral Sequences in Human Brain Linked to Psychiatric Disorders by seqadmin Started by seqadmin, 05-23-2024, 07:35 AM	0 responses 22 views 0 likes	Last Post by seqadmin 05-23-2024, 07:35 AM
New Milestone for COSMIC with Extensive Cancer Mutation Data by seqadmin Started by seqadmin, 05-22-2024, 02:06 PM	0 responses 11 views 0 likes	Last Post by seqadmin 05-22-2024, 02:06 PM

Seqanswers Leaderboard Ad

Announcement

gene ontology over-representation of differentially expressed genes

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News