Header Leaderboard Ad


DESeq Multi-factor designs: determining the significance of model terms



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • DESeq Multi-factor designs: determining the significance of model terms


    I have performed a two-factor design RNA seq experiment (treatment and sex) and would like to know if these factors have a significant effect on gene expression.

    To do this I followed the instructions as in the DESeq manual and fitted the following models:

    fit0 <- fitNbinomGLMs( cds, count ~ treatment)
    fit1 <- fitNbinomGLMs( cds, count ~ treatment + sex)
    fit2 <- fitNbinomGLMs( cds, count ~ treatment + sex + treatment * sex)

    Is it possible to determine which of these models fits the data best?

    Typically when fitting glms I would use AIC to determine model term significance (i.e. does fitting an interaction term significantly improve the fit of the model?). Is there any way to do this is DESeq?

    Many thanks,


  • #2
    Dear Darren,

    it seems you have successfully mastered the mechanics of DESeq, but from the wording, perhaps it would be good to get an update on statistical theory, namely the topics 'hypothesis testing' and 'model selection' and their (different) associated concepts and tools. People often get confused between these two. It seems that what you actually want are tests against the null hypotheses:
    • no gene is affected by treatment,
    • no gene is affected by sex,
    • no gene is affected by treatment in a sex-dependent manner.
    The DESeq vignette explains how to use the function
    nbinomGLMTest to perform chi-squared tests on a gene-by-gene manner, which you then need to follow by the suitable multing testing computation to arrive at the experiment-wise test p-value.

    Hope this helps
    Best wishes
    Wolfgang Huber


    • #3
      Reading that and the DESeq vignette I am still confuse :

      fit1 = fitNbinomGLMs( cdsFull, count ~ libType + condition )
      fit0 = fitNbinomGLMs( cdsFull, count ~ libType )
      pvalsGLM = nbinomGLMTest( fit1, fit0 )
      Here the p value for a single row (in this case let s say RNA) is testing a null hypothesis.
      I can t formulate it, is it the probability that this miRNA counts are explain better by one model than an other ?

      Let s say taht RNA12 is significant (after FDR) what can I say ?

      Let s say that I want to test for all my RNA if the libType is driving an effect, can I do that ? By doing something like averaging the p values ?


      • #4
        By definition, a likelihood ratio test compares two models. If RNA12 is significant, then including condition in the model produces a significantly better fit than not including this. In a paper, this is written as, "condition significantly affects the expression of RNA12."

        To test for libType, you would compare fit1 to a model of ~condition.


        • #5
          Multi factor DESeq2 help

          Dear all,
          Looking for some advice on a multi-factor DESeq2 analysis please (I’m new to all this so bear with me!)

          The experimental design involves a cross-factorial design in which subjects were exposed to one of three pesticide treatments (Control, Pesticide1, Pesticide2) and challenged with one of three injection treatments (Naïve=no injection, Buffer, Injection), plus sampled at one of two time points.

          The main questions I think I want to address are as follows:
          (For each timepoint)
          - What genes are differentially expressed as a result of an injection (i.e in Buffer vs Naïve, Injection vs Naïve, and Injection vs Buffer) in 1) Control group 2) Pesticide1 group 3) Pesticide2 group?
          - What genes are differentially expressed as a result of exposure to either Pesticide1 or Pesticide2 compared to Control-fed in 1)Naïve 2) Buffer-injected 3) Injected?
          - And, ideally, what genes are differentially expressed as a result of injection in a pesticide-dependent manner (ie interaction between Pesticide and Injection)?

          Having had a read through the DESeq2 vignette and other posts on here, I think the best way to address this might be to subset my samples (e.g. subset to only control-fed 8 hour samples) and run the DESeq2 differential expression pipeline for each subset, and compare the different levels of the (eg injection) treatment using the contrast arguments to results.

          The main thing I wanted to check was if using this subsetting approach, do I need to make any added adjustment of the P-value to account for the fact I’ll be doing multiple tests? As far as I can understand (but believe me, stats is not my strong point) having already adjusted for the FDR, I don’t need to make any further adjustment for the multiple testing due to subsetting, but please correct me if I’m wrong.

          Or is there an alternative way to test my hypotheses without subsetting? Although I can use a GLM formula to include multiple treatments (and their interactions) (eg ~Pesticide*Injection) , my difficulty if I do this is how to address the contrasts. Using the resultsNames argument this gives me interaction combinations as follows

          > resultsNames(dds)
          [1] "Intercept"
          [2] "Pesticide_Pesticide1_vs_Control"
          [3] "Pesticide_Pesticide2_vs_Control"
          [4] "Injection_Naive_vs_Injection"
          [5] "Injection_Buffer_vs_Injection"
          [6] "PesticidePesticide1.InjectionNaive"
          [7] "PesticidePesticide2.InjectionNaive"
          [8] "PesticidePesticide1.InjectionBuffer"
          [9] "PesticidePesticide2.InjectionBuffer"

          The first set of contrasts make sense to me as Pesticide “Control” and Injection “Naïve” were set as the base levels, but could someone clarify what these interaction contrasts mean please. Does this give (for [6]) the genes that are differentially expressed between Pesticide1 vs Control in a manner dependent on whether the injection treatment was Naïve vs Injection?

          Also, for eg [2] this is the genes differentially expressed between Pesticide1 and Control accounting for Injection treatment. By accounting for injection treatment, does this mean that the test accounts for any genes that were either differentially expressed due to injection and/or with pesticide dependent on injection, and only gives those genes that are affected by Pesticide1 across all the injection treatments?

          Apologies for the essay of a post, but any thoughts would be greatly appreciated. Many thanks in advance.


          • #6

            Interaction models are a bit complicated, so I really recommend that users take the time to read a statistics reference on interactions in linear models, or to consult a statistician at your local institution who can answer basic questions. There's no reason to expect that this tricky stuff can be best learned on forums.

            That said, I just posted the following to the Bioc mailing list, for a brief explanation of how to interpret interaction terms:

            Suppose we have a design ~ A + B+ A:B, and A=0/1, B=0/1. This means
            we are modeling a log fold change due to variable A=1 over A=0, and a
            log fold change due to variable B=1 over B=0. The interaction term is
            an additional log fold change when A=1 and B=1 beyond the main effect
            for A and the main effect for B. If the log fold change for the
            interaction term is 0, then we know: the fold change when A=1 and B=1
            is simply the two main effect fold changes multiplied (the log fold
            changes added).
            One quick note: you might want to set the base level of Injection to naive. By default R will choose the base level using alphabetical order. See the DESeq2 vignette for code examples.

            To answer some of your questions:

            What about multiple testing across contrasts. The multiple test correction in DESeq2 is over genes, per results table. Alternative options for correction are described well in the limma vignette, section 13.3 "Multiple Testing Across Contrasts". As we cannot predict which contrasts the user will be interested in, in order to make the results table reproducible, the default correction is over genes. We leave alternate strategies using the pvalue column to the user.

            You can generate all of the contrasts of interest without subsetting. See the 'contrast' argument of ?results, and you can ask more questions here. Please also though post your code and sessionInfo().

            By accounting for injection treatment, does this mean that the test accounts for any genes that were either differentially expressed due to injection and/or with pesticide dependent on injection, and only gives those genes that are affected by Pesticide1 across all the injection treatments?
            In this interaction model, the Pesticide_Pesticide1_vs_Control effect is for the base level of injection. The pest1 vs control effect for other levels of injection is equal to the sum of the interaction term and the effect for the base level of injection. This will make more sense after checking a reference for interactions in linear models.


            • #7
              Thanks Ecolliso,

              I would say that you have the same kind of design that I have, maybe more complex. I am trying to understand what I can do in DESeq1 or 2

              I can answer to
              if using this subsetting approach, do I need to make any added adjustment of the P-value to account for the fact I’ll be doing multiple tests?
              adjustment on p value should theorically be applied on all the p values ever computed on earth .... It s of course not mathematical it s just a widely use way to safely rely on something. So yes you should correct for all your p values but if you don t you can still publish for instance . More that that you should have a p value that is really significant, even 0.001 is still one chance over 1000 to be wrong, it s kind of a lot of chance no ?

              About GLM, here is some questions:
              1. First, from here : One reason for calling the general linear model “general” is that it can handle an X that is not numerical as well as one that is numerical. I cant find in DESeq a way to check continuous covariates like age or time ....

              1. Second I wonder if puting the subject identifier as a covariate will help to make a pair wise test ...

              1. Last, GLM usually give a p value for the intercept and each coefficient, why DESeq does not do it ?


              • #8
                quick answers:

                Search the DESeq2 vignette for "continuous covariate"

                Yes including the subject identifier, i.e. "~ patient + condition" will test the change relative for each patient (pair-wise).

                You can produce results tables for any coefficient using 'contrast' or 'name' argument to the results() function.


                • #9
                  Thanks Mickael, I found the way to use continuous covariate, it was simple !
                  Now to test the change relative for each patient, I should fit a GLM including the subject identifier, i.e. "~ patient + condition", like you say.
                  Then I have to go to DESeq2 since DEseq have no result() function, I guess the p value associated with each coeficient is the usual that you find in the basic glm() funtion of R. So I can use it a as testing for my differential expression like the regular negative binomial test ! That s great ! Do you know why people use GLM only in case of covariate ?

                  Many Thanks!


                  Latest Articles


                  • seqadmin
                    A Brief Overview and Common Challenges in Single-cell Sequencing Analysis
                    by seqadmin

                    ​​​​​​The introduction of single-cell sequencing has advanced the ability to study cell-to-cell heterogeneity. Its use has improved our understanding of somatic mutations1, cell lineages2, cellular diversity and regulation3, and development in multicellular organisms4. Single-cell sequencing encompasses hundreds of techniques with different approaches to studying the genomes, transcriptomes, epigenomes, and other omics of individual cells. The analysis of single-cell sequencing data i...

                    01-24-2023, 01:19 PM
                  • seqadmin
                    Introduction to Single-Cell Sequencing
                    by seqadmin
                    Single-cell sequencing is a technique used to investigate the genome, transcriptome, epigenome, and other omics of individual cells using high-throughput sequencing. This technology has provided many scientific breakthroughs and continues to be applied across many fields, including microbiology, oncology, immunology, neurobiology, precision medicine, and stem cell research.

                    The advancement of single-cell sequencing began in 2009 when Tang et al. investigated the single-cell transcriptomes
                    01-09-2023, 03:10 PM