Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • CisGenome -- an integrated tool for ChIP-seq data analysis

    I just found this great website. I would like say thank you to the administrator(s) as you provided a really useful resource for next-gen seq community.

    I want to introduce to the community a tool we have developed for ChIP-seq data analysis. The tool is called CisGenome and can be downloaded from http://www.biostat.jhsph.edu/~hji/cisgenome/. The paper describing the tool is published in this month's Nature Biotechnology, Ji et al., 2008, 26:1293 - 1300.


    I realized that ECO has already included CisGenome into the ChIP-seq software lists (thanks!). What I want to do here is to highlight several critical features of CisGenome.

    1. New statistics:

    When a ChIP-seq experiment involves only ChIP'd sample but not control samples, we developed a truncated negative binomial model to estimate false discovery rate (FDR). Most existing algorithms for handling this type of data use Poisson or Monte Carlo simulation to provide the background model, which has the underlying assumption that read (tag) sampling rate is a constant across genome. Our own experience shows that this is a poor assumption and in most cases will lead to overstating the statistical significance. The negative binomial model we used in CisGenome provides a simple but much better model to describe the variations of read sampling rate across the genome. Also, it does not require users to provide an ad hoc number for the "fraction of alignible genome".

    When the ChIP-seq experiment involves both ChIP'd sample and negative control sample, we use a conditional binomial model to detect peaks. The model automatically takes into account the difference between the total number of reads in the ChIP sample and the number of reads in the control sample. In other words, normalization is done naturally by the statistical model. To estimate false discovery rate, our model does NOT require that the number of ChIP reads matches the number of control reads (i.e. it is fine to have 2 million ChIP reads and 1 million control reads, or 1 million ChIP reads vs. 2 million control reads). As a comparison, some previous methods compute FDR by switching the ChIP & control labels, these type of methods usually require you to have approx. the same number of ChIP & control reads. Some other methods like QuEST compares two negative controls to get an FDR estimate, but in order to do so, you have to double your control reads in the experiments (i.e., to compute FDR for a comparison between 1 million ChIP reads and 1 million control reads, you need to have another 1 million control reads. You estimate FDR by comparing control vs control).

    Finally, many existing tools provide p-values instead of FDR. It is well known that p-value is not a good error rate measure to use in the context of multiple testing. CisGenome provides FDR estimates instead of p-values for both one-sample (only ChIP'd sample is available) and two-sample (both ChIP'd and control samples are available) ChIP-seq analyses.

    2. Graphic user interface & visualization

    If you don't have programming experience, we have a graphic user interface designed for you. If you are an experienced programmer, you can always use our core functions as a command line program (i.e., you can easily incorporate them into your shell files and prepare batch jobs).

    In addition to the GUI, we have a CisGenome browser (pretty much like UCSC browser but with fewer functions). The browser runs locally on your computer, and you can visualize raw data and peak signals in the browser. In the same browser, you can also visualize gene structures, cross-species conservation, DNA sequences, motif logos, etc. You can also add custom tracks. Remember, this is a light-weight browser running on your own computers, you don't need to upload anything to web servers (like what you will do in order to use UCSC). It is a tool designed to save some time in large-scale interactive analyses, since it avoids uploading large data sets to webservers.

    3. Motif analysis, gene annotation, sequence retrival, etc.

    ChIP-seq peak detection is not the only function of CisGenome. Indeed, you can use CisGenome to do a bunch of downstream analyses including de novo motif discovery, mapping motif to the genome or any set of genomic regions, adding gene annotations, retrieving DNA sequences, get summary statistics about distributions of your peaks (i.e. x% are in exon, y% are in 1kb promoter, etc.). You can also use CisGenome to analyze ChIP-chip data.

    Of course, any software will have bugs. We are not surprised if you encounter bugs in CisGenome. When you find bugs, just kindly let us know. We will try to fix them. We hope that you will find CisGenome useful in your own work.

  • #2
    I have tested cis genome browser. Though I have not use all of its function. It looks quite good!

    Comment


    • #3
      Sounds really promising. I'll check it out.

      Comment


      • #4
        Hi hji,

        I saw the article in NBT the other day and it certainly looks really useful. I have a few questions, if you don't mind.

        1. Does the peak detection algorithm in ChIP-seq adjust for variable number of potential single mapping sites in different regions? I am assuming that the algorithm only uses uniquely mapping reads. A few tags in a region mostly consisting of repeats can be more significant than many tags in a unique region - is this accounted for?

        2. My understanding is that the GUI is only available for Windows. Is all functionality available in the Linux version, and can analysis results obtained on the Linux platform be tranferred to a windows computer for viewing and further analys? I guess what I'm asking is how decoupled the GUI is from core functionality, file formats etc.

        Best regards,
        Erik

        Comment


        • #5
          Erik,

          Re your first question: "Does the peak detection algorithm in ChIP-seq adjust for variable number of potential single mapping sites in different regions? I am assuming that the algorithm only uses uniquely mapping reads. A few tags in a region mostly consisting of repeats can be more significant than many tags in a unique region - is this accounted for?"

          If you are using two sample analysis, this is automatically adjusted for. Since the same bias should apply for ChIP'd and control sample. (correct me if I'm not right).

          If you are using one sample analysis, the answer is no, we haven't adjusted for it in the current version. You raised a very good point, and we will try to incorporate this into our next version of peak detection algorithm if that tests well.

          Re your second question: "My understanding is that the GUI is only available for Windows. Is all functionality available in the Linux version, and can analysis results obtained on the Linux platform be tranferred to a windows computer for viewing and further analys? I guess what I'm asking is how decoupled the GUI is from core functionality, file formats etc."

          You are right, the GUI is currently only for windows. But all core algoritms can be run on Linux. The window GUI use the same core algorithms as the Linux version and yields the same results in the same formats. So you can transfer results from Linux to a windows machine and perform further analysis from there.

          Comment


          • #6
            Any suggestions for using CisGenome for MeDIP-CHIP without input controls (only treated vs. notTreated)? I am still waiting for the normalization of my 63 .cel files to finish. I therefore have not had a chance to explore the TileMap interface. Any suggestion for starting conditions are appreciated.

            Frank

            Comment


            • #7
              I'm not quite sure how your data structure is, but it looks like a typical two-sample comparison should work.

              Comment


              • #8
                Sorry I did not make this clearer. Now that I have done a couple analyses I can tell you that I am not getting any peaks using HMM and 2 samples when comparing (treatment > control) and only like 20 peaks for (control > treatment). I have not used the UMS settings yet.
                I was just wondering since I am looking for single base events (CpG or MeCpG) and not TF binding what would be my most relaxed (least stringent) HMM setting for peak detection. I can identify 3000+ regions via MA(300) for (treatment > control) but only 5 of these regions are FDR 0.0000000 and the next group of peaks is 0.10000000.
                I also have no good grasp on why the FDR numbers in the COD files are grouped instead of continuous (eg. 5 peaks FDR=0.0000000, next peak group at 0.1000000).

                I greatly appreciate your input. I am just trying to work my way through the 2005 TILEMAP paper. If only my statistical comprehension would be better. But the program so far is very nice especially since my boss always wanted some sort of FDR calculations incorporated into tiling analysis.

                Thanks again

                Comment


                • #9
                  In that case, I suggest you look at the raw data first. You can import the fc.bar and ma.bar into CisGenome browser and look at the top peaks. Ask yourself the question: do they look like something real? This will help you understand whether the FDR make sense or not.

                  Regarding why FDR are always grouped: because the FDR is forced to be monotone. Your peaks are ranked, the raw FDR is computed as (# peaks in the left tail)/(# peaks in the right tail). Suppose the raw FDR is: 0.01; 0.02; 0.00; 0.06; 0.05; 0.07 ... then the reported FDR will be 0.00; 0.00; 0.00; 0.05; 0.05; 0.07 ... This is somewhat like the Benjamini-Hochberg procedure.

                  Comment


                  • #10
                    Hi HJI,
                    I am trying to analyze my chip-seq results, I am hoping that CisGenome can help me. I have two sets of data, experimental and control, both in WIG and BED formats. I need to know the difference between the two. Being a rookie in chip-seq field, do you mind telling me if CisGenome is the right tool for me? and if so, how should I use it? thank you!!

                    Comment


                    • #11
                      I just added a function to convert BED file to ALN file. You can then use the ALN file to detect peaks and perform subsequent analysis. You are certainly welcome to try CisGenome.

                      BTW, we have also added support for C elegans, Yeast and Chicken recently.

                      Comment


                      • #12
                        cisGenome trouble shooting?

                        Hello,

                        I am trying to use cisGenome to "find closest gene" to TF binding sites identified using ChIP-Seq. I have downloaded the human genome database (hg18) and have converted the enriched sites into the COD file format. I was able to load the genome datase and COD file into the cisGenome browser. Then I choose “Genome > Annotate with … > Closest Gene”. From here I indicate a save to location and hit "OK". There is a new window that flashes (too fast for me to read) and then there is no file saved or further COD added to the project. I don't know what I am doing wrong. I would be EXTREMELY grateful for any advice.

                        Best regards,
                        Sanjay

                        Comment


                        • #13
                          schandri

                          First, check whether you have set the CisGenome.ini file. In that file, you should give the CisGenome installation path.

                          Second, check whether any of your folder or file path/names contains blank characters such as "C:\My Document\". If so, move (or rename) your data to folders that do not contain blank characters. CisGenome should also be installed in a folder that does not contain blank characters.

                          Try and see if this solves your problem.

                          Comment


                          • #14
                            Thanks for your post, hji. Your suggestions fixed the problem! I had installed cisGenome in a path that did not have any spaces, but had made two other mistakes. First, the path in the .ini file was slightly off and second, my .COD data file was in a location that had a file path containing spaces. Now it seems to be working great!

                            Thanks again.
                            Sanjay

                            Comment


                            • #15
                              convert bar to wig

                              Anyone know of a utility to convert .bar files to .wig files?

                              I'd be happy to write a program to do it - but any pointers for the .bar format would be helpful. I'm sure I'm not the only one who would be interested in seeing cisGenome output in the UCSC genome browser (which doesn't read .bar last I checked).

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-27-2024, 06:37 PM
                              0 responses
                              13 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-27-2024, 06:07 PM
                              0 responses
                              11 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              53 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              69 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X