Hi,
I guess my question would better fit to a statistics board but it is hard to explain genomics to non biologists (at least for me).
I use the nice bedtools package to find overlapping intervalls of ChIP-seq regions from different factors/chromatin modifications.
1st question:
How can one calculate whether a given observation is significantly different than random overlap, i.e. what is the probability that 1000 regions (let's say from 500bp to 1000bp) of factor A show overlap with factor B (2000 regions) in 800 cases.
Of course in this case is there is significantly more overlap than random. I did this via binning the genome into 1kb bins. Then I assigned to each bind whether it was bound by A and/or B. This enabled me to easily assign distinct probabilities for each bin to be associated with a single factor or with both factors (as it is easy to determine the "universe"). Then I perform Chi-Square or similar tests.
But this approach appears to be too complicated too me. There must be a way to directly calculate the probability for an overlap of two (or more factors).
Does anyone have a suggestion how to accomplish this?
2. question is related to above question. Is there a script (or can someone explain what I have to think about to write one on my own -> Perl/R/Python) that creates intersection (overlap) between n factors' regions and looks for all different possible outcomes, i.e. sites with all factors, all but one factor, all but to factors etc.?
I'd be glad if someone could point me at a direction how to approach these two aspects!
Maxim
I guess my question would better fit to a statistics board but it is hard to explain genomics to non biologists (at least for me).
I use the nice bedtools package to find overlapping intervalls of ChIP-seq regions from different factors/chromatin modifications.
1st question:
How can one calculate whether a given observation is significantly different than random overlap, i.e. what is the probability that 1000 regions (let's say from 500bp to 1000bp) of factor A show overlap with factor B (2000 regions) in 800 cases.
Of course in this case is there is significantly more overlap than random. I did this via binning the genome into 1kb bins. Then I assigned to each bind whether it was bound by A and/or B. This enabled me to easily assign distinct probabilities for each bin to be associated with a single factor or with both factors (as it is easy to determine the "universe"). Then I perform Chi-Square or similar tests.
But this approach appears to be too complicated too me. There must be a way to directly calculate the probability for an overlap of two (or more factors).
Does anyone have a suggestion how to accomplish this?
2. question is related to above question. Is there a script (or can someone explain what I have to think about to write one on my own -> Perl/R/Python) that creates intersection (overlap) between n factors' regions and looks for all different possible outcomes, i.e. sites with all factors, all but one factor, all but to factors etc.?
I'd be glad if someone could point me at a direction how to approach these two aspects!
Maxim
Comment