Seqanswers Leaderboard Ad



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • normalization of ChIP-seq data

    Hi all,

    if I may add a more general question regarding the way to normalize the
    ChIP-seq data when comparing multiple experiments. Considering the
    Poisson distribution based models for peak finding, my question is the
    following :

    - assuming there are 2 ChIP-seq experiments :

    <> minus treatment : 10 mil tags : 10000 peaks
    <> plus treatment : 4 mil tags: 3000 peaks

    - although the data is differently saturated, could we still compare
    10 000 peaks vs 3 000 peaks and say, for instance, 8000 peaks are lost

    with the treatment, 2000 peaks remain unchanged and 1000 peaks
    are gained ?
    - assuming that the cut-off to call a peak is different for minus
    treatment (let's say 6 tags) vs plus treatment
    (let's say 4 tags), would the comparison that is described above be
    statistically legitimate ? thanks a lot,


  • #2
    IMHO the answer is no. There are several issues I can think of.

    First, you cannot compare the number of peaks or get any statistics directly without any estimate of replication noise. If you had replicate experiments for atleast one (ideally both) conditions and u ran peak calls on the replicates you could get an estimate of the variance of the called peaks.

    Secondly, it depends on your p-value/enrichment cutoff. If you are restricting to very strong peaks then you could potentially compare the numbers. Reason is as you relax ur threshold for calling peaks, the different experiments could bleed in noisy peaks at different rates. So a tiny change in the p-value threshold could cause massive differences in number of peaks called. For example, I have seem ample cases of biological and technical replicates of the same experiment giving quite different number of peaks for the same threshold with the same peak caller program. The strongest peaks tend to agree but as u go down the list the consistency gets worse.

    Also, hopefully the control experiment used is common or that is going to make it even harder to do a head to head comparison.

    Ideally, you want to rank your peaks by their enrichment/p-value and compute rank statistics on that to estimate how different the two experiments are.


    • #3
      May I add another question. I have the same scenario. 2 different chipseq from 2 different experiments (one in brain and one in heart). brain chipseq has 10 million tags and heart 4 million tags. I want to map the raw number of tags around promoter. But this difference in no.of tags is not giving any patterns except a flat line on the top and one at the bottom.

      I tried to normalize in this way. But it didn't work at all. Any ideas about normalizing ChIP-Seq sample with different number of tags from 2 different experiments ?

      position_cDNAnorm = (position_cDNA / sum_cDNA) * average_sum_cDNA

      * position_cDNAnorm = normalised cDNA value for specific position and specific DBP
      * position_cDNA = cDNA value for specific position and specific DBP
      * sum_cDNA = total cDNA count for specific DBP
      * average_sum_cDNA = average of total cDNA counts of all DBPs
      DBP= DNA Bindign Protein (Transcription factor)


      • #4
        I completely agree with akundaje. I would like to emphasize his point that even if you have the the same number of tags from two biological or even technical replicates and compare them you will get different peaks called. Replicates will help weed out the borderline peaks. The peaks called and read count is not a linear relationship.

        Since there is this issue with variation of borderline peaks called at the peak arbitrary cut off, I think the thing to do is in you peak finder run you two ChIP samples as 'treatment' and 'control'. This will identify significant differences between the samples. However, some of these differences may be from differences in chromatin structure and sheering efficiency and not txn factor binding. So this requires a second step. Take your significant differences and then intersect those with your list of peaks and you should end up with a list of real differences between the two conditions.

        You should still normalize the read counts and get some replicates.

        I made a blog post on my new blog on this subject. So here is the shameless link to it:
        This is a question people seem to be having some difficult with, as I’ve seen it asked a few times on SeqAnswers. You have results from two ChIP-seq experiments.  For example, you want to know if N…

        This seems like a pretty good way to go about addressing the question at hand, but there may be better ways.
        Last edited by ETHANol; 08-07-2011, 04:12 AM.


        • #5
          In my experience no clear statement can be made without replicates. E.g. we had two replicates with about 4K peaks. The overlap of the peaks was 100. That already tells you quite something about peak calling and its interpretations. After looking at ChIP-seq data from others I experienced the same. But folks tend to pool their replicates before peak calling to get around that. Anyway if I see people taking peak numbers to answer biological questions the first thing I do is to look at the raw data (if it is available). In most but one cases I would say that peak numbers mean nothing.

          Another case was the analysis of cells with very low TF protein level upon treatment (like in a KO situation). Peak calling reveals double the amount of peaks for that situation compared to untreated cells with TF binding and normal protein levels.

          I did not find any answers on how to rank my peaks to compare different treatments. For me it worked quite well to plot the tag enrichments (Input, IgG, Treated, Untreated) +-3kb around my peaks in a heat map and do k-means clustering. That identified strongly enriched sites I can trust.


          • #6

            It is a good solution somehow. But in my case, I would like to compare the two samples to see if these two samples are similar or different. It might need some statistical calculation I guess.
            Any suggestions will be highly appreciated.


            Latest Articles


            • seqadmin
              Best Practices for Single-Cell Sequencing Analysis
              by seqadmin

              While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
              06-06-2024, 07:15 AM
            • seqadmin
              Latest Developments in Precision Medicine
              by seqadmin

              Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

              Somatic Genomics
              “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
              05-24-2024, 01:16 PM





            Topics Statistics Last Post
            Started by seqadmin, 06-21-2024, 07:49 AM
            0 responses
            Last Post seqadmin  
            Started by seqadmin, 06-20-2024, 07:23 AM
            0 responses
            Last Post seqadmin  
            Started by seqadmin, 06-17-2024, 06:54 AM
            0 responses
            Last Post seqadmin  
            Started by seqadmin, 06-14-2024, 07:24 AM
            0 responses
            Last Post seqadmin