I am attempting to carry out GO term enrichment analysis on set of differentially methylated genomic regions (DMRs). The program I am attempting to do this in is DAVID.

My data is derived from reduced-representation bisulfite sequencing (RRBS) of liver tissue. Although I am working on a non-model species, there is relatively close reference genome available, and I have used this to annotate the DMRs where they overlap with known genes.

I therefore have a gene list that I would like to use in GO term enrichment analysis and explore other aspects of functional annotation. However, when carrying out these sorts of analyses in DAVID, a background list is required for statistical comparison. This usually involves taking a list of the genes known from that particular reference genome (e.g. 30,000 in humans) and then seeing if any particular gene category are over-represented in my gene list in comparison to the genomic background. With the RRBS data, I am only sequencing a subset of the available genome and genes in my gene list can only come from this subset, therefore to ask if genes categories are over-represented in my DMR data set by comparing against the entire number of genes known from that organism does not really make much sense to me.

One way of creating a background list might be to get a list of genes that overlap with any aligned methylated C in my data set (not just the significant ones) and then use this as the background to do the test. Would this be appropriate? One complication is that I am comparing three different pairwise treatments, so I guess I would need three independent background lists comprising methylated Cs present in at least one individual/replicate in both treatment groups being compared or is that nonsense?

More generally, does anyone have any advice on how to generate an appropriate background list for RRBS functional annotation analysis? I guess this might overlap with other reduced-representation sequencing techniques such as RAD-seq. Alternatively, does anyone know of any other methods/programs for carrying out GO term enrichment analysis that takes into account the biased sampling of the genome involved in RRBS?

Thanks in advance for any advice you can give,