Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • biznatch
    Senior Member
    • Nov 2010
    • 124

    Normalizing to input

    I want to make some .wig and/or .bed files for visualising in the UCSC Genome Browser, but first I want to normalise the samples to input. I'm using Perl scripts to do this (don't need help writing the scripts, just wondering about the methodology, this is my first set of chip-seq data...although maybe there are programs out there that can already do this for me?):

    1. I have about 3 times as many reads for input (60million) compared to the experimental sample. Before subtracting input from experimental, should I divide the input coverage at each bp by 3 (or whatever the exact ratio is)? Is there another way to normalise for differences in number of reads between input and experimental?

    2. Once this is done, should I just subtract input from experimental at each bp?
  • simonandrews
    Simon Andrews
    • May 2009
    • 870

    #2
    Not wishing to evade your question - but are you sure you want to do that?

    When we started out doing ChIP-Seq we used to normalise against input, but after looking at the results we found that in general we were causing more problems than we fixed. The reason was that over any given peak in our ChIP the coverage in the input was much poorer than that in the ChIP, so we were effectively reducing our accuracy of measurement to the poor coverage of in the input. In many cases we had only a very small number of reads in the input and the addition or loss of only a few reads would have a huge effect on the corrected value we would get.

    What we did instead was to use the input as a filter to mask out regions where there were way more reads than we would expect. These regions normally contained mismapped reads and it was better to discard them than to try to correct against mismapped reads in the ChIP sample.

    In your case you say you have 3x the coverage in the input so maybe you have enough data to do this correction reliably. Even so it might be worth looking at the general level of variability in your input samples and, excluding extreme outliers, compare this to the levels of enrichment you see in your ChIP. You can then get a good impression of whether the variability in the input levels is going to have a considerable impact on how you judge the strength of the enriched peaks.

    The simplest correction is to work out the log transformed ratio of ChIP to input. You can also get the same effect by doing a log count of reads in each sample and then subtracting the input from the ChIP.

    In terms of corrections, if you're using multiple ChIP samples then you want to correct the counts in those to account for the differing numbers of total reads in each sample (say by expressing the count as counts per million input reads). You can correct the inputs as well if you like, but given that you will use the same input for each ChIP it doesn't really matter if you do this or not since it will just move all of your results by a constant factor.

    Comment

    • biznatch
      Senior Member
      • Nov 2010
      • 124

      #3
      No I'm not sure, haha. Just figuring things out here. Coverage on this input data looks pretty good and consistent, except for some "peaks" where there's a peak in both the input and ChIP, and it's basically these that I want removed from the ChIP data as I suppose they're artefacts of mismapping or bias. I have other data with far fewer input reads so maybe doing a filter like you suggested would work better for that. Thanks for the reply, it's given me some ideas to try out.

      Comment

      • yaten2020
        Junior Member
        • Aug 2011
        • 7

        #4
        hi
        I think, something like that has been done by Li Chen here. Though I could not understand it in and out. Any comments??

        YK

        Comment

        • rebrendi
          ng
          • May 2008
          • 78

          #5
          Originally posted by simonandrews View Post
          Not wishing to evade your question - but are you sure you want to do that?

          When we started out doing ChIP-Seq we used to normalise against input, but after looking at the results we found that in general we were causing more problems than we fixed. The reason was that over any given peak in our ChIP the coverage in the input was much poorer than that in the ChIP, so we were effectively reducing our accuracy of measurement to the poor coverage of in the input. In many cases we had only a very small number of reads in the input and the addition or loss of only a few reads would have a huge effect on the corrected value we would get.

          What we did instead was to use the input as a filter to mask out regions where there were way more reads than we would expect. These regions normally contained mismapped reads and it was better to discard them than to try to correct against mismapped reads in the ChIP sample.

          In your case you say you have 3x the coverage in the input so maybe you have enough data to do this correction reliably. Even so it might be worth looking at the general level of variability in your input samples and, excluding extreme outliers, compare this to the levels of enrichment you see in your ChIP. You can then get a good impression of whether the variability in the input levels is going to have a considerable impact on how you judge the strength of the enriched peaks.

          The simplest correction is to work out the log transformed ratio of ChIP to input. You can also get the same effect by doing a log count of reads in each sample and then subtracting the input from the ChIP.

          In terms of corrections, if you're using multiple ChIP samples then you want to correct the counts in those to account for the differing numbers of total reads in each sample (say by expressing the count as counts per million input reads). You can correct the inputs as well if you like, but given that you will use the same input for each ChIP it doesn't really matter if you do this or not since it will just move all of your results by a constant factor.
          Simon, I completely agree with the arguments, just want to make sure things did not change during these two years: is it still common NOT to normalize by input?

          Comment

          • simonandrews
            Simon Andrews
            • May 2009
            • 870

            #6
            I don't pretend to speak for whole of the ChIP-Seq analysis field, but for our analyses we don't directly normalise to input. We use input samples if we do peak calling to use a local read density estimate to define enrichment, but this doesn't normally carry through into our quantitation. We will often use other normalisation techniques to normalise the global distribution of counts to remove effects introduced by differenential ChIP efficiency, but these are not position specific. We would still use the input as a filter to remove places showing large levels of enrichment if we were analysing data without using peaks called from an input.

            This all assumes that we're using samples sequenced on the same platform with the same type of run, mapped with the same mapper with the same options. Under those conditions most of the artefacts you're looking at would be constant between samples so you're OK if you're comparing different sample groups. If you really want to compare peak strengths within a sample then you might want to look at input normalisation or filtering more carefully, but this is always going to be tricky.

            Comment

            Latest Articles

            Collapse

            • SEQadmin2
              Nine Things a Sample Prep Scientist Thinks About Before Sequencing
              by SEQadmin2


              I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


              Here are nine questions we think about, in roughly the order they matter, before...
              Yesterday, 07:11 AM
            • SEQadmin2
              From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
              by SEQadmin2


              Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


              The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
              ...
              06-02-2026, 10:05 AM
            • SEQadmin2
              Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
              by SEQadmin2


              With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


              Introduction

              Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
              05-22-2026, 06:42 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by SEQadmin2, 06-17-2026, 06:09 AM
            0 responses
            20 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-09-2026, 11:58 AM
            0 responses
            38 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-05-2026, 10:09 AM
            0 responses
            44 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-04-2026, 08:59 AM
            0 responses
            49 views
            0 reactions
            Last Post SEQadmin2  
            Working...