Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • tanasabogdan
    Junior Member
    • Oct 2009
    • 1

    normalization of ChIP-seq data

    Hi all,

    if I may add a more general question regarding the way to normalize the
    ChIP-seq data when comparing multiple experiments. Considering the
    Poisson distribution based models for peak finding, my question is the
    following :

    - assuming there are 2 ChIP-seq experiments :

    <> minus treatment : 10 mil tags : 10000 peaks
    <> plus treatment : 4 mil tags: 3000 peaks

    - although the data is differently saturated, could we still compare
    10 000 peaks vs 3 000 peaks and say, for instance, 8000 peaks are lost

    with the treatment, 2000 peaks remain unchanged and 1000 peaks
    are gained ?
    - assuming that the cut-off to call a peak is different for minus
    treatment (let's say 6 tags) vs plus treatment
    (let's say 4 tags), would the comparison that is described above be
    statistically legitimate ? thanks a lot,

    Bogdan
  • akundaje
    Junior Member
    • Sep 2008
    • 5

    #2
    IMHO the answer is no. There are several issues I can think of.

    First, you cannot compare the number of peaks or get any statistics directly without any estimate of replication noise. If you had replicate experiments for atleast one (ideally both) conditions and u ran peak calls on the replicates you could get an estimate of the variance of the called peaks.

    Secondly, it depends on your p-value/enrichment cutoff. If you are restricting to very strong peaks then you could potentially compare the numbers. Reason is as you relax ur threshold for calling peaks, the different experiments could bleed in noisy peaks at different rates. So a tiny change in the p-value threshold could cause massive differences in number of peaks called. For example, I have seem ample cases of biological and technical replicates of the same experiment giving quite different number of peaks for the same threshold with the same peak caller program. The strongest peaks tend to agree but as u go down the list the consistency gets worse.

    Also, hopefully the control experiment used is common or that is going to make it even harder to do a head to head comparison.

    Ideally, you want to rank your peaks by their enrichment/p-value and compute rank statistics on that to estimate how different the two experiments are.

    Comment

    • repinementer
      Member
      • Dec 2009
      • 80

      #3
      May I add another question. I have the same scenario. 2 different chipseq from 2 different experiments (one in brain and one in heart). brain chipseq has 10 million tags and heart 4 million tags. I want to map the raw number of tags around promoter. But this difference in no.of tags is not giving any patterns except a flat line on the top and one at the bottom.

      I tried to normalize in this way. But it didn't work at all. Any ideas about normalizing ChIP-Seq sample with different number of tags from 2 different experiments ?

      position_cDNAnorm = (position_cDNA / sum_cDNA) * average_sum_cDNA

      * position_cDNAnorm = normalised cDNA value for specific position and specific DBP
      * position_cDNA = cDNA value for specific position and specific DBP
      * sum_cDNA = total cDNA count for specific DBP
      * average_sum_cDNA = average of total cDNA counts of all DBPs
      DBP= DNA Bindign Protein (Transcription factor)

      Comment

      • ETHANol
        Senior Member
        • Feb 2010
        • 308

        #4
        I completely agree with akundaje. I would like to emphasize his point that even if you have the the same number of tags from two biological or even technical replicates and compare them you will get different peaks called. Replicates will help weed out the borderline peaks. The peaks called and read count is not a linear relationship.

        Since there is this issue with variation of borderline peaks called at the peak arbitrary cut off, I think the thing to do is in you peak finder run you two ChIP samples as 'treatment' and 'control'. This will identify significant differences between the samples. However, some of these differences may be from differences in chromatin structure and sheering efficiency and not txn factor binding. So this requires a second step. Take your significant differences and then intersect those with your list of peaks and you should end up with a list of real differences between the two conditions.

        You should still normalize the read counts and get some replicates.

        I made a blog post on my new blog on this subject. So here is the shameless link to it:
        This is a question people seem to be having some difficult with, as I’ve seen it asked a few times on SeqAnswers. You have results from two ChIP-seq experiments.  For example, you want to know if N…


        This seems like a pretty good way to go about addressing the question at hand, but there may be better ways.
        Last edited by ETHANol; 08-07-2011, 04:12 AM.
        --------------
        Ethan

        Comment

        • howi
          Junior Member
          • Apr 2011
          • 6

          #5
          In my experience no clear statement can be made without replicates. E.g. we had two replicates with about 4K peaks. The overlap of the peaks was 100. That already tells you quite something about peak calling and its interpretations. After looking at ChIP-seq data from others I experienced the same. But folks tend to pool their replicates before peak calling to get around that. Anyway if I see people taking peak numbers to answer biological questions the first thing I do is to look at the raw data (if it is available). In most but one cases I would say that peak numbers mean nothing.

          Another case was the analysis of cells with very low TF protein level upon treatment (like in a KO situation). Peak calling reveals double the amount of peaks for that situation compared to untreated cells with TF binding and normal protein levels.

          I did not find any answers on how to rank my peaks to compare different treatments. For me it worked quite well to plot the tag enrichments (Input, IgG, Treated, Untreated) +-3kb around my peaks in a heat map and do k-means clustering. That identified strongly enriched sites I can trust.

          Comment

          • emilyjia2000
            Member
            • May 2011
            • 59

            #6
            ETHANol,

            It is a good solution somehow. But in my case, I would like to compare the two samples to see if these two samples are similar or different. It might need some statistical calculation I guess.
            Any suggestions will be highly appreciated.
            Thanks

            Comment

            Latest Articles

            Collapse

            • SEQadmin2
              Nine Things a Sample Prep Scientist Thinks About Before Sequencing
              by SEQadmin2


              I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


              Here are nine questions we think about, in roughly the order they matter, before...
              06-18-2026, 07:11 AM
            • SEQadmin2
              From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
              by SEQadmin2


              Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


              The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
              ...
              06-02-2026, 10:05 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by SEQadmin2, 06-17-2026, 06:09 AM
            0 responses
            30 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-09-2026, 11:58 AM
            0 responses
            44 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-05-2026, 10:09 AM
            0 responses
            50 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-04-2026, 08:59 AM
            0 responses
            51 views
            0 reactions
            Last Post SEQadmin2  
            Working...