Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • anandksrao
    Junior Member
    • Jun 2011
    • 9

    choosing & validating RNA-Seq time course data normalization method(s)

    Dear all,

    I seek your help with choosing & validating RNA-Seq time course data normalization method(s) for my work.

    My data set is 4 reps per time point, and 9 time points.
    I want to extract co-expressed genes based on their shared expression profiles over time. So I am NOT asking you how to perform pair-wise DE gene identification.

    I know there have been multiple posts on the topic of RNA-Seq data normalization. This is my 1st post here, so at the cost of being repetitive with some of my questions, and irking some or all of you, here I go:

    1. For my purposes I am assuming that raw mapped count data needs to first normalized, right?

    2. Should I test different methods of data normalization of my raw, mapped counts? Like TMM, quantile etc.?

    3. Strictly speaking, should the choice of normalization method be justified through some measure or test, or is it norm to try out different methods?

    4. Do both edgeR and DESeq offer different built-in methods of data normalization applicable for time-course data (NOT pair-wise comparisons)?

    5. Will normalization have to be performed with respect to a reference data point, lets say time point zero (which makes intuitive and biological sense to me)
    OR
    are there variants of normalization that can normalize data across time, but without explicitly choosing a reference (such a method, if it exists, does not make intuitive or biological sense to me)

    6. What is the best place for someone like myself, new to bio-statistics and the R environment, to quickly learn tricks of the trade?

    Lots of question I know, hoping this forum can help out a poor, starving grad student

    Thanks a ton.
    Wishing you all happy holidays and a fantastic 2012!

    AksR
    -----------------
    CTTATTGTTGAACTTOAATGGTGCTAATGATCCTCGTOTCTCCTGAACGT
    (translate THAT!)
  • Simon Anders
    Senior Member
    • Feb 2010
    • 995

    #2
    I start by answering two of your questions:

    4. Do both edgeR and DESeq offer different built-in methods of data normalization applicable for time-course data (NOT pair-wise comparisons)?
    Normalization is independent of the experimental design. The built-in normalisations of DESeq and edgeR simply determine for each sample a scaling factor (or: size factor), such that all samples' counts, when multiplied with their factor, are on a scale that allows for comparisons. What you want compare with what is unimportant for this step.

    5. Will normalization have to be performed with respect to a reference data point, lets say time point zero (which makes intuitive and biological sense to me)
    OR
    are there variants of normalization that can normalize data across time, but without explicitly choosing a reference (such a method, if it exists, does not make intuitive or biological sense to me)
    DESeq chooses the size factors such that their product is one, in order to put the common scale somewhere in the middle of all the library sizes. If you multiplied all the factors by a constant, the analysis result would not change. Hence, one could as well declare an arbitrary sample as reference and chose the factors such that this sample gets assigned a one.

    3. Strictly speaking, should the choice of normalization method be justified through some measure or test, or is it norm to try out different methods?
    If the normalization does not work well, replicates will appear less similar than they are. This drives up the variance estimate and reduces the number of hits. Hence, in theory, a bad normalization should only reduce power, i.e., is conservative. I'm not sure, though, whether it would be a good idea to use the number of hits in the downstream test for differential expression as a figure of merits for the quality of the normalization; one might easily fall for outliers that way.

    Comment

    • steven
      Senior Member
      • Aug 2009
      • 269

      #3
      Looking for co-expressed genes throughout time points? I haven't seen much of this in NGS papers yet. What about a clustering approach? Maybe this thread could help.

      Comment

      • anandksrao
        Junior Member
        • Jun 2011
        • 9

        #4
        Originally posted by Simon Anders View Post
        DESeq chooses the size factors such that their product is one, in order to put the common scale somewhere in the middle of all the library sizes. If you multiplied all the factors by a constant, the analysis result would not change. Hence, one could as well declare an arbitrary sample as reference and chose the factors such that this sample gets assigned a one.
        I have some questions regarding the calculation of the geometric mean to normalize individual libraries as implemented by estimateSizeFactors in DESeq.
        I checked out the DESeq package documentation for estimateSizeFactorsForMatrix

        Description:
        Given a matrix or data frame of count data, this function
        estimates the size factors as follows: Each column is divided by
        the geometric means of the rows. The median (or, ir requested,
        another location estimator) of these ratios (skipping the genes
        with a geometric mean of zero) is used as the size factor for this
        column.


        My question to the forum / Simon is very specifically about "skipping the genes with a geometric mean of zero"

        Skipping genes with a geometric mean of zero seems to me like it might miss quite a few genes, especially in my time course study, where across so many time points there is probably a higher chance, than for just a pairwise comparison with 2 time points, that even a highly expressed gene at time t1 may have zero expression at time t2. Such a gene would have 0 geometric mean, and would be consequently discarded. I would not want to discard such a gene from my analysis - quite the contrary actually.

        So for the purpose of not missing genes I am trying 2 things:
        a. pseudo-replace : substitute any raw count 0 to raw count 1, then perform the analysis,
        OR
        b. pseudo-add: add 1 to all raw counts, then perform the analysis

        Do my option a. or option b. violate the nBinom model or suffer from any intrinsic error that precludes correct conclusions ?

        I intend to use my slightly modified data from options a and b, to
        1, normalize using RLE (nomenclature from edgeR),
        2. perform VST if library variances are heteroskedastic, and
        3. finally perform fuzzy-K clustering to obtain dominant temporal patterns of expression.

        Looking forward to your opinions / comments / criticisms

        Comment

        • Simon Anders
          Senior Member
          • Feb 2010
          • 995

          #5
          Don't worry. The genes with zero counts are just not used in the calculation of the size factors. They are, of course, not discarded and not excluded from the test for differential expression.

          Comment

          • anandksrao
            Junior Member
            • Jun 2011
            • 9

            #6
            Originally posted by Simon Anders View Post
            Don't worry. The genes with zero counts are just not used in the calculation of the size factors. They are, of course, not discarded and not excluded from the test for differential expression.
            Thanks Simon!

            Comment

            • anandksrao
              Junior Member
              • Jun 2011
              • 9

              #7
              For my time series - based clustering problem to find co-expressed genes with identical temporal expression profiles (which is NOT the same as DE gene identification), I would assume there is still the problem of over-dispersion across multiple biological replicates we have. So will DESeq help perform the variance stabilization transformation, after which I can use this transformed data for time series clustering?

              Comment

              Latest Articles

              Collapse

              • SEQadmin2
                Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                by SEQadmin2


                I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


                Here are nine questions we think about, in roughly the order they matter, before...
                06-18-2026, 07:11 AM
              • SEQadmin2
                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                by SEQadmin2


                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                ...
                06-02-2026, 10:05 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by SEQadmin2, 06-17-2026, 06:09 AM
              0 responses
              30 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-09-2026, 11:58 AM
              0 responses
              44 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-05-2026, 10:09 AM
              0 responses
              49 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-04-2026, 08:59 AM
              0 responses
              51 views
              0 reactions
              Last Post SEQadmin2  
              Working...