Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • ChIP-Seq Challenge

    Community ChIP-Seq Challenge 1.0

    Hello Folks,

    We need your help! Yes you!

    Here is an experiment in open community development. We are not sure it will work but hope it will help with a growing problem….

    Given the dozen or so ChIP-Seq analysis applications currently available, we would like to know which algorithms are the best with respect to 1) identifying real ChIP-Seq peaks and 2) estimating confidence in them with a false discovery rate.

    We propose a series of tests using spike-in datasets where known truth can be used to objectively measure which methods work well under different conditions.

    Towards this end, we have created a spike-in dataset where simulated ChIP-Seq reads were added to experimentally derived input Illumina Genome Analyzer sequence data. Additional input data without spike-ins is also available for use as an input control.

    It is our request that users (and developers) of particular ChIP-Seq packages download the data, analyze it, and post their lists of ChIP-Seq peaks along side a detailed description of how they processed the data.

    Multiple submissions using the same analysis package from multiple users are encouraged.

    It is our hope that this open community experiment will help clarify which analysis packages work well under different conditions and foster continued development of ChIP-Seq algorithms.

    So download the data, run it through your favorite ChIP-Seq detector, and publicly post and/or privately submit your lists to us by March 2nd.

    Best regards,

    David Nix

    The Huntsman Cancer Institute and
    University of Utah Bioinformatics Shared Resource Center
    http://bioserver.hci.utah.edu [email protected]


    Details:

    1) A combine pool of mapped sequencing data from human Jurkat T-cell input chromatin DNA from Valouev et al (Valouev A, Johnson DS, Sundquist A, Medina C, Anton E, Batzoglou S, Myers RM, Sidow A: Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data. Nat Methods. 2008 Aug 17. http://mendel.stanford.edu/sidowlab/...ign_25.hg18.gz) and from the Graves’ lab (Hollenhorst, P and Graves, B unpublished) was merged, randomized, and split in 1/3rd and 2/3rd samples, 10 million and 20 million reads respectively.

    2) The USeq Simulator application (http://useq.sourceforge.net/cmdLnMenus.html#Simulator) was used to generate simulated spike-ins. The reads were aligned to the genome using the stand-alone ELAND aligner. Spike-in regions where > ¼ of the reads mapped were used to randomly select a specific number of reads to represent spikes of different concentrations. These were added to the 1/3rd input sample and constitute the ChIP-Seq sample.

    3) A few comments, hundreds of spikes have been added, their size range was selected to closely approximate a real size selected ChIP-Seq experiment. The strand of the mapped reads has been preserved. The read positions have not been shifted to compensate for the length of the fragments but simply assigned to the center of the 26bp mapped read. Reads that mapped to multiple locations were mapped following the ELAND aligner’s default parameters.

    4) The key will be made immediately available to anyone who submits lists of ChIP-Seq peaks and promises not to distribute the key until after April 1.

    5) Seven lists should be provided each ranked best to worst and generated by setting FDR thresholds of 20%, 10%, 5%, 1%, 0.1%, 0.05%, and 0.01%. These should be in bed file format (tab delimited: chrom, start, stop, name, score; e.g. ‘chrX 3599643 3599943 peak37 3219’). Additionally (or alternatively if FDRs cannot be estimated), provide three ranked lists containing the top 500, 1000, and 1500 putative ChIP-Seq peaks. Multiple list sets are acceptable (e.g. one set with strand skew filtering, one without).

    6) A description should be provided with the lists describing how the data was processed with sufficient detail for someone else to be able to replicate your results (e.g. command lines and or all application parameters).

    7) The key will be publicly released on April 1st.

    8) Submissions should be made to [email protected] or publicly posted by March 2nd for inclusion in a summary report. Multiple submissions are encouraged, both pre and post key.

    9) The data can be downloaded from http://bioserver.hci.utah.edu/ChIPSeqSpikeIns . It is split by sample, strand, and chromosome. Each text file contains a column of base positions (H_sapiens_Mar_2006, hg18) representing the center of each mapped read. See the CCS1.0_Text.zip file. (The data is also available in PointData bar format for direct visualization in the Integrated Genome Browser and for use in USeq applications. See the CCS1.0_PointData_ForUSeq.zip file.)

    10) Let us know if you need help reformatting the data for analysis.

  • #2
    Cool - I won't have time till february, but this sounds neat.
    The more you know, the more you know you don't know. —Aristotle

    Comment


    • #3
      Interresting, will definitely try it at some point. Did not understand the positions, should I shift each read +/- 13 bases to get the fragment ends?

      It would also be interresting to study how the different programs scores peaks that contain multilple binding sites (or spikes) at a short distance, that is to get the center positions as close to the simulated peak centers as possible under different conditions. Is this something you have considered doing also?

      Comment


      • #4
        Yes, if you want the coordinates for a particular read, subtract 13 and add 13 to the given position, interbase coordinates.

        We do have the exact center position from which the randomized fragments were generated and could calculate how close a particular call comes to that center.

        Chipper, would you mind running this analysis when the lists are in?

        Comment


        • #5
          Will it be possible to get the aligned reads in some raw aligned form (Eland, MAQ, exonerate?) I haven't looked at the files posted yet in any details, but I don't want to have to write an interpreter for whatever format is being used by Useq. (-:
          The more you know, the more you know you don't know. —Aristotle

          Comment


          • #6
            ELAND Sorted and Export data files

            Yes, there are many different formats. We hoped that by providing the simplest (just a position), folks could parse it into something suitable for their favorite application. (USeq uses a binary bar format.)

            I have added both ELAND xxx_sorted.txt and xxx_export.txt formatted data sets to the http://bioserver.hci.utah.edu/ChIPSeqSpikeIns directory. Only the chromosome, strand, and position columns have any meaning the others are identical across different rows. The alignment score was set to 74 and the quality boolean to Y. Note, the position was derived by subtracting 12 from the middle positions in the original data files to convert their values into the ELAND coordinate system.

            Comment


            • #7
              Peaks or center base? FDRs? Input libraries?

              On 1/20/09 11:35 PM, "XU Han" <[email protected]> wrote:

              Hi, David:

              It’s an interesting challenge. May I ask you two questions regarding the submission?

              1. The predicted peak should be a single base (i.e., start=end) or a region (start<end);

              2. The FDR refers to the global FDR or local FDR (q-value)?

              Also, I noticed that you used an input library to generate the spike-in data, and another input as the control for prediction. Are these two libraries biological replicates or technical replicates?

              Han


              A response:

              Just the chIP regions, not the base.

              Hmm, as far as the FDRs, it is probably better to tell you what we want to do with the FDR thresholded lists.

              For each FDR thresholded list provided, it will be intersected with the key and the real FDR for the list calculated (#non intersecting false positives/ # regions in the provided list). A comparison can then be made between your estimated FDR and the real FDR.

              For example, lets say you threshold your binding peaks at a 5% FDR to generate a list of 1500 regions. Of the 1500, only 1000 intersect with the key, the other 500 are false positives. Thus your real FDR for the list would be 500/1500 or 33%. The closer your estimated FDR is to the real FDR the better.

              The input data was pooled then randomly split in thirds, to one of the thirds was added the simulated chIP-seq data, the other two thirds were joined to constitute input sample. So, the replicates are neither biological or technical.

              Comment


              • #8
                Prizes (iPods!) and ChIP-Seq Categories

                Hello Folks,

                Both ABI and Illumina have offered prizes to the winners of the contest, see below. Many thanks to these good folks for supporting the community development of bioinformatics.

                Get your lists in ASAP, 7 days and counting...

                Here are the categories:

                1) Best true positive vs. false positive discriminator. The winning method returns the most spike-ins from the contestant's top 500, 1000, and 1500 best hit lists.

                2) Best confidence estimator derived from the contestant's 10,5,1,0.1% FDR thresholded lists. The method with the least cumulative sum of fold differences from the actual FDRs will be the winner.

                In the event of a tie in a particular category, the associated prize will be awarded to the person whom first submitted their lists. Only one prize per contestant.

                Prizes! An iPod Shuffle to the winner of each category with additional items (water bottles, tee shirts, coffee mugs) to 2nd and third place winners. One prize per person.

                Comment


                • #9
                  Originally posted by Nix View Post
                  Hello Folks,

                  For each FDR thresholded list provided, it will be intersected with the key and the real FDR for the list calculated (#non intersecting false positives/ # regions in the provided list). A comparison can then be made between your estimated FDR and the real FDR.

                  For example, lets say you threshold your binding peaks at a 5% FDR to generate a list of 1500 regions. Of the 1500, only 1000 intersect with the key, the other 500 are false positives. Thus your real FDR for the list would be 500/1500 or 33%. The closer your estimated FDR is to the real FDR the better.

                  1) Best true positive vs. false positive discriminator. The winning method returns the most spike-ins from the contestant's top 500, 1000, and 1500 best hit lists.

                  2) Best confidence estimator derived from the contestant's 10,5,1,0.1% FDR thresholded lists. The method with the least cumulative sum of fold differences from the actual FDRs will be the winner.

                  In the event of a tie in a particular category, the associated prize will be awarded to the person whom first submitted their lists. Only one prize per contestant.

                  Prizes! An iPod Shuffle to the winner of each category with additional items (water bottles, tee shirts, coffee mugs) to 2nd and third place winners. One prize per person.

                  If I understand it correctly, I am not 100% confident that the outlined evaluation criteria makes for an accurate evaluation of submitted lists. It is mentioned that each submitted list "will be intersected with the key and the real FDR for the list calculated (#non intersecting false positives/ # regions in the provided list)." What if I just submit the following as my list. Would the real FDR be zero?

                  chr1:1-lengthOfChr1
                  chr2:1-lengthOfChr2
                  ...
                  ...

                  How is the resolution of the submitted binding regions taken into account while evaluating the submitted list of binding regions? To make sure the submitted sites intersect with the sites in the key, one can just make the submitted sites longer.

                  How was the key determined? Were experiments conducted to verify each site in the key just to make sure that the sites in the key are indeed true positives? Or, just because a submitted site does not interest with any of the key sites, how do we know if a submitted site is false positive? One can argue that maybe the key is not complete.

                  Without proper controls, it just may not be right to decide which method is discriminative with simple evaluation criteria outlined in the challenge.

                  Comment


                  • #10
                    David, You mentioned that you plan to include the submitted lists as a part of a report. By report, do you mean a paper that may be submitted to a journal? If yes, will the participants be included as co-authors?

                    Comment


                    • #11
                      Hello ChipMaster,

                      Yes, submission of huge regions would be one way to cheat. It is also very easy to spot and disqualify. Given the number of spike-ins and their random distribution across the genome. The chance of two spike-ins landing next to one another is very slim thus even if you submitted regions in the 5-10kb range I doubt it would help. That said let's say that each region should be < 1kb. Much more than that and your list will get flagged.

                      Regarding the key, this is a simulation so we know exactly what was added to the experimentally derived input sequencing data. No need for validation. Anything not added is by definition a false positive (the input data was pooled and randomly split).

                      We're trying to keep the analysis and the criteria for ranking the methods quite simple.

                      Regarding the initial report, yes, anyone who submits a list or makes a substantial contribution would be an author. Whether the report rises to the level of a publication will need to be seen.

                      Comment


                      • #12
                        Hello David,

                        It's hard to judge the performance of a method for FDR, b/c most methods can identify the top peaks (say 100) with relatively low FDR, and then I can generate other FDR lists by replacing some of these peaks (at the bottom) with trashes. For example, generating FDR 5% (or 10%) by replacing 5 (or 10) of these top 100 peaks with 5 or 10 "trashes".
                        Will I get an iPod from this?

                        John

                        Comment


                        • #13
                          Don't know if I can help you here. The FDR estimations I've used are typically tied to a threshold and can be used to filter a list of putative peaks. Relaxing the threshold increases the FDR.

                          Regarding the iPods, ya can't win if ya don't play so be sure to submit some lists! -cheers, D

                          Comment


                          • #14
                            Originally posted by Nix View Post
                            Hello ChipMaster,
                            That said let's say that each region should be < 1kb.
                            The only reason people use ChIP-Seq over ChIP-chip is that it provides higher resolution. 1 Kb upper limit does not make sense. Since the sequenced DNA fragments are ~200-500 bp in most cases, without having to use any program, one should be able to pinpoint the peaks (enriched regions) with a ~200-500 bp resolution. Any program that improves upon this should at least make sure to narrow down the region (based on the tag directions) to a few tens of base pairs. Given this, I would argue that the regions cannot be more than 100 or 200bp.

                            Comment


                            • #15
                              Originally posted by chipmaster View Post
                              The only reason people use ChIP-Seq over ChIP-chip is that it provides higher resolution. 1 Kb upper limit does not make sense. Since the sequenced DNA fragments are ~200-500 bp in most cases, without having to use any program, one should be able to pinpoint the peaks (enriched regions) with a ~200-500 bp resolution. Any program that improves upon this should at least make sure to narrow down the region (based on the tag directions) to a few tens of base pairs. Given this, I would argue that the regions cannot be more than 100 or 200bp.
                              Yes the initial fragments are 200-500, but peak calling from the depth of coverage does yield wider peaks, which can be narrowed by parameters like tapering off the shoulder etc..
                              --
                              bioinfosm

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Non-Coding RNA Research and Technologies
                                by seqadmin




                                Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

                                Nobel Prize for MicroRNA Discovery
                                This week,...
                                10-07-2024, 08:07 AM
                              • seqadmin
                                Recent Developments in Metagenomics
                                by seqadmin





                                Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
                                09-23-2024, 06:35 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Today, 06:55 AM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 10-02-2024, 04:51 AM
                              0 responses
                              105 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 10-01-2024, 07:10 AM
                              0 responses
                              113 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 09-30-2024, 08:33 AM
                              1 response
                              117 views
                              0 likes
                              Last Post EmiTom
                              by EmiTom
                               
                              Working...
                              X