I just found this great website. I would like say thank you to the administrator(s) as you provided a really useful resource for next-gen seq community.
I want to introduce to the community a tool we have developed for ChIP-seq data analysis. The tool is called CisGenome and can be downloaded from http://www.biostat.jhsph.edu/~hji/cisgenome/. The paper describing the tool is published in this month's Nature Biotechnology, Ji et al., 2008, 26:1293 - 1300.
I realized that ECO has already included CisGenome into the ChIP-seq software lists (thanks!). What I want to do here is to highlight several critical features of CisGenome.
1. New statistics:
When a ChIP-seq experiment involves only ChIP'd sample but not control samples, we developed a truncated negative binomial model to estimate false discovery rate (FDR). Most existing algorithms for handling this type of data use Poisson or Monte Carlo simulation to provide the background model, which has the underlying assumption that read (tag) sampling rate is a constant across genome. Our own experience shows that this is a poor assumption and in most cases will lead to overstating the statistical significance. The negative binomial model we used in CisGenome provides a simple but much better model to describe the variations of read sampling rate across the genome. Also, it does not require users to provide an ad hoc number for the "fraction of alignible genome".
When the ChIP-seq experiment involves both ChIP'd sample and negative control sample, we use a conditional binomial model to detect peaks. The model automatically takes into account the difference between the total number of reads in the ChIP sample and the number of reads in the control sample. In other words, normalization is done naturally by the statistical model. To estimate false discovery rate, our model does NOT require that the number of ChIP reads matches the number of control reads (i.e. it is fine to have 2 million ChIP reads and 1 million control reads, or 1 million ChIP reads vs. 2 million control reads). As a comparison, some previous methods compute FDR by switching the ChIP & control labels, these type of methods usually require you to have approx. the same number of ChIP & control reads. Some other methods like QuEST compares two negative controls to get an FDR estimate, but in order to do so, you have to double your control reads in the experiments (i.e., to compute FDR for a comparison between 1 million ChIP reads and 1 million control reads, you need to have another 1 million control reads. You estimate FDR by comparing control vs control).
Finally, many existing tools provide p-values instead of FDR. It is well known that p-value is not a good error rate measure to use in the context of multiple testing. CisGenome provides FDR estimates instead of p-values for both one-sample (only ChIP'd sample is available) and two-sample (both ChIP'd and control samples are available) ChIP-seq analyses.
2. Graphic user interface & visualization
If you don't have programming experience, we have a graphic user interface designed for you. If you are an experienced programmer, you can always use our core functions as a command line program (i.e., you can easily incorporate them into your shell files and prepare batch jobs).
In addition to the GUI, we have a CisGenome browser (pretty much like UCSC browser but with fewer functions). The browser runs locally on your computer, and you can visualize raw data and peak signals in the browser. In the same browser, you can also visualize gene structures, cross-species conservation, DNA sequences, motif logos, etc. You can also add custom tracks. Remember, this is a light-weight browser running on your own computers, you don't need to upload anything to web servers (like what you will do in order to use UCSC). It is a tool designed to save some time in large-scale interactive analyses, since it avoids uploading large data sets to webservers.
3. Motif analysis, gene annotation, sequence retrival, etc.
ChIP-seq peak detection is not the only function of CisGenome. Indeed, you can use CisGenome to do a bunch of downstream analyses including de novo motif discovery, mapping motif to the genome or any set of genomic regions, adding gene annotations, retrieving DNA sequences, get summary statistics about distributions of your peaks (i.e. x% are in exon, y% are in 1kb promoter, etc.). You can also use CisGenome to analyze ChIP-chip data.
Of course, any software will have bugs. We are not surprised if you encounter bugs in CisGenome. When you find bugs, just kindly let us know. We will try to fix them. We hope that you will find CisGenome useful in your own work.
I want to introduce to the community a tool we have developed for ChIP-seq data analysis. The tool is called CisGenome and can be downloaded from http://www.biostat.jhsph.edu/~hji/cisgenome/. The paper describing the tool is published in this month's Nature Biotechnology, Ji et al., 2008, 26:1293 - 1300.
I realized that ECO has already included CisGenome into the ChIP-seq software lists (thanks!). What I want to do here is to highlight several critical features of CisGenome.
1. New statistics:
When a ChIP-seq experiment involves only ChIP'd sample but not control samples, we developed a truncated negative binomial model to estimate false discovery rate (FDR). Most existing algorithms for handling this type of data use Poisson or Monte Carlo simulation to provide the background model, which has the underlying assumption that read (tag) sampling rate is a constant across genome. Our own experience shows that this is a poor assumption and in most cases will lead to overstating the statistical significance. The negative binomial model we used in CisGenome provides a simple but much better model to describe the variations of read sampling rate across the genome. Also, it does not require users to provide an ad hoc number for the "fraction of alignible genome".
When the ChIP-seq experiment involves both ChIP'd sample and negative control sample, we use a conditional binomial model to detect peaks. The model automatically takes into account the difference between the total number of reads in the ChIP sample and the number of reads in the control sample. In other words, normalization is done naturally by the statistical model. To estimate false discovery rate, our model does NOT require that the number of ChIP reads matches the number of control reads (i.e. it is fine to have 2 million ChIP reads and 1 million control reads, or 1 million ChIP reads vs. 2 million control reads). As a comparison, some previous methods compute FDR by switching the ChIP & control labels, these type of methods usually require you to have approx. the same number of ChIP & control reads. Some other methods like QuEST compares two negative controls to get an FDR estimate, but in order to do so, you have to double your control reads in the experiments (i.e., to compute FDR for a comparison between 1 million ChIP reads and 1 million control reads, you need to have another 1 million control reads. You estimate FDR by comparing control vs control).
Finally, many existing tools provide p-values instead of FDR. It is well known that p-value is not a good error rate measure to use in the context of multiple testing. CisGenome provides FDR estimates instead of p-values for both one-sample (only ChIP'd sample is available) and two-sample (both ChIP'd and control samples are available) ChIP-seq analyses.
2. Graphic user interface & visualization
If you don't have programming experience, we have a graphic user interface designed for you. If you are an experienced programmer, you can always use our core functions as a command line program (i.e., you can easily incorporate them into your shell files and prepare batch jobs).
In addition to the GUI, we have a CisGenome browser (pretty much like UCSC browser but with fewer functions). The browser runs locally on your computer, and you can visualize raw data and peak signals in the browser. In the same browser, you can also visualize gene structures, cross-species conservation, DNA sequences, motif logos, etc. You can also add custom tracks. Remember, this is a light-weight browser running on your own computers, you don't need to upload anything to web servers (like what you will do in order to use UCSC). It is a tool designed to save some time in large-scale interactive analyses, since it avoids uploading large data sets to webservers.
3. Motif analysis, gene annotation, sequence retrival, etc.
ChIP-seq peak detection is not the only function of CisGenome. Indeed, you can use CisGenome to do a bunch of downstream analyses including de novo motif discovery, mapping motif to the genome or any set of genomic regions, adding gene annotations, retrieving DNA sequences, get summary statistics about distributions of your peaks (i.e. x% are in exon, y% are in 1kb promoter, etc.). You can also use CisGenome to analyze ChIP-chip data.
Of course, any software will have bugs. We are not surprised if you encounter bugs in CisGenome. When you find bugs, just kindly let us know. We will try to fix them. We hope that you will find CisGenome useful in your own work.
Comment