Hi guys!
I have an issue about the analysis of different type of ChIP-seq data. I want to combine them using clustering to observe meaningful epigenetic patterns in the dataset. Briefly, I generated a matrix with rows representing genomic 200bp bins (I have different millions of rows) and epigenetic marks in columns. I apply pam-clustering (clara from 'cluster' R package) to the matrix and fortunately seems to work and it is quite fast. The problem is about the method to determine the optimal number of clusters. I tried different approaches from different R packages (silhouette, pamk, gap statistic and so on..) but obviously all of them didn't work since they require too much memory in R. So, my idea was to extract a subset of , let's say, 10000/50000 rows from the full matrix and use them to infer the optimal cluster number. Do you think it could be correct? In that case, of course, I would have to find a good criteria to define my subset. Otherwise, I didn't find any other solution to set the optimal k for the moment. I would be very grateful if somebody can help me. Thanks a lot.
fran
I have an issue about the analysis of different type of ChIP-seq data. I want to combine them using clustering to observe meaningful epigenetic patterns in the dataset. Briefly, I generated a matrix with rows representing genomic 200bp bins (I have different millions of rows) and epigenetic marks in columns. I apply pam-clustering (clara from 'cluster' R package) to the matrix and fortunately seems to work and it is quite fast. The problem is about the method to determine the optimal number of clusters. I tried different approaches from different R packages (silhouette, pamk, gap statistic and so on..) but obviously all of them didn't work since they require too much memory in R. So, my idea was to extract a subset of , let's say, 10000/50000 rows from the full matrix and use them to infer the optimal cluster number. Do you think it could be correct? In that case, of course, I would have to find a good criteria to define my subset. Otherwise, I didn't find any other solution to set the optimal k for the moment. I would be very grateful if somebody can help me. Thanks a lot.
fran