Seqanswers Leaderboard Ad



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • DESeq Multi-factor designs: determining the significance of model terms


    I have performed a two-factor design RNA seq experiment (treatment and sex) and would like to know if these factors have a significant effect on gene expression.

    To do this I followed the instructions as in the DESeq manual and fitted the following models:

    fit0 <- fitNbinomGLMs( cds, count ~ treatment)
    fit1 <- fitNbinomGLMs( cds, count ~ treatment + sex)
    fit2 <- fitNbinomGLMs( cds, count ~ treatment + sex + treatment * sex)

    Is it possible to determine which of these models fits the data best?

    Typically when fitting glms I would use AIC to determine model term significance (i.e. does fitting an interaction term significantly improve the fit of the model?). Is there any way to do this is DESeq?

    Many thanks,


  • #2
    Dear Darren,

    it seems you have successfully mastered the mechanics of DESeq, but from the wording, perhaps it would be good to get an update on statistical theory, namely the topics 'hypothesis testing' and 'model selection' and their (different) associated concepts and tools. People often get confused between these two. It seems that what you actually want are tests against the null hypotheses:
    • no gene is affected by treatment,
    • no gene is affected by sex,
    • no gene is affected by treatment in a sex-dependent manner.
    The DESeq vignette explains how to use the function
    nbinomGLMTest to perform chi-squared tests on a gene-by-gene manner, which you then need to follow by the suitable multing testing computation to arrive at the experiment-wise test p-value.

    Hope this helps
    Best wishes
    Wolfgang Huber


    • #3
      Reading that and the DESeq vignette I am still confuse :

      fit1 = fitNbinomGLMs( cdsFull, count ~ libType + condition )
      fit0 = fitNbinomGLMs( cdsFull, count ~ libType )
      pvalsGLM = nbinomGLMTest( fit1, fit0 )
      Here the p value for a single row (in this case let s say RNA) is testing a null hypothesis.
      I can t formulate it, is it the probability that this miRNA counts are explain better by one model than an other ?

      Let s say taht RNA12 is significant (after FDR) what can I say ?

      Let s say that I want to test for all my RNA if the libType is driving an effect, can I do that ? By doing something like averaging the p values ?


      • #4
        By definition, a likelihood ratio test compares two models. If RNA12 is significant, then including condition in the model produces a significantly better fit than not including this. In a paper, this is written as, "condition significantly affects the expression of RNA12."

        To test for libType, you would compare fit1 to a model of ~condition.


        • #5
          Multi factor DESeq2 help

          Dear all,
          Looking for some advice on a multi-factor DESeq2 analysis please (I’m new to all this so bear with me!)

          The experimental design involves a cross-factorial design in which subjects were exposed to one of three pesticide treatments (Control, Pesticide1, Pesticide2) and challenged with one of three injection treatments (Naïve=no injection, Buffer, Injection), plus sampled at one of two time points.

          The main questions I think I want to address are as follows:
          (For each timepoint)
          - What genes are differentially expressed as a result of an injection (i.e in Buffer vs Naïve, Injection vs Naïve, and Injection vs Buffer) in 1) Control group 2) Pesticide1 group 3) Pesticide2 group?
          - What genes are differentially expressed as a result of exposure to either Pesticide1 or Pesticide2 compared to Control-fed in 1)Naïve 2) Buffer-injected 3) Injected?
          - And, ideally, what genes are differentially expressed as a result of injection in a pesticide-dependent manner (ie interaction between Pesticide and Injection)?

          Having had a read through the DESeq2 vignette and other posts on here, I think the best way to address this might be to subset my samples (e.g. subset to only control-fed 8 hour samples) and run the DESeq2 differential expression pipeline for each subset, and compare the different levels of the (eg injection) treatment using the contrast arguments to results.

          The main thing I wanted to check was if using this subsetting approach, do I need to make any added adjustment of the P-value to account for the fact I’ll be doing multiple tests? As far as I can understand (but believe me, stats is not my strong point) having already adjusted for the FDR, I don’t need to make any further adjustment for the multiple testing due to subsetting, but please correct me if I’m wrong.

          Or is there an alternative way to test my hypotheses without subsetting? Although I can use a GLM formula to include multiple treatments (and their interactions) (eg ~Pesticide*Injection) , my difficulty if I do this is how to address the contrasts. Using the resultsNames argument this gives me interaction combinations as follows

          > resultsNames(dds)
          [1] "Intercept"
          [2] "Pesticide_Pesticide1_vs_Control"
          [3] "Pesticide_Pesticide2_vs_Control"
          [4] "Injection_Naive_vs_Injection"
          [5] "Injection_Buffer_vs_Injection"
          [6] "PesticidePesticide1.InjectionNaive"
          [7] "PesticidePesticide2.InjectionNaive"
          [8] "PesticidePesticide1.InjectionBuffer"
          [9] "PesticidePesticide2.InjectionBuffer"

          The first set of contrasts make sense to me as Pesticide “Control” and Injection “Naïve” were set as the base levels, but could someone clarify what these interaction contrasts mean please. Does this give (for [6]) the genes that are differentially expressed between Pesticide1 vs Control in a manner dependent on whether the injection treatment was Naïve vs Injection?

          Also, for eg [2] this is the genes differentially expressed between Pesticide1 and Control accounting for Injection treatment. By accounting for injection treatment, does this mean that the test accounts for any genes that were either differentially expressed due to injection and/or with pesticide dependent on injection, and only gives those genes that are affected by Pesticide1 across all the injection treatments?

          Apologies for the essay of a post, but any thoughts would be greatly appreciated. Many thanks in advance.


          • #6

            Interaction models are a bit complicated, so I really recommend that users take the time to read a statistics reference on interactions in linear models, or to consult a statistician at your local institution who can answer basic questions. There's no reason to expect that this tricky stuff can be best learned on forums.

            That said, I just posted the following to the Bioc mailing list, for a brief explanation of how to interpret interaction terms:

            Suppose we have a design ~ A + B+ A:B, and A=0/1, B=0/1. This means
            we are modeling a log fold change due to variable A=1 over A=0, and a
            log fold change due to variable B=1 over B=0. The interaction term is
            an additional log fold change when A=1 and B=1 beyond the main effect
            for A and the main effect for B. If the log fold change for the
            interaction term is 0, then we know: the fold change when A=1 and B=1
            is simply the two main effect fold changes multiplied (the log fold
            changes added).
            One quick note: you might want to set the base level of Injection to naive. By default R will choose the base level using alphabetical order. See the DESeq2 vignette for code examples.

            To answer some of your questions:

            What about multiple testing across contrasts. The multiple test correction in DESeq2 is over genes, per results table. Alternative options for correction are described well in the limma vignette, section 13.3 "Multiple Testing Across Contrasts". As we cannot predict which contrasts the user will be interested in, in order to make the results table reproducible, the default correction is over genes. We leave alternate strategies using the pvalue column to the user.

            You can generate all of the contrasts of interest without subsetting. See the 'contrast' argument of ?results, and you can ask more questions here. Please also though post your code and sessionInfo().

            By accounting for injection treatment, does this mean that the test accounts for any genes that were either differentially expressed due to injection and/or with pesticide dependent on injection, and only gives those genes that are affected by Pesticide1 across all the injection treatments?
            In this interaction model, the Pesticide_Pesticide1_vs_Control effect is for the base level of injection. The pest1 vs control effect for other levels of injection is equal to the sum of the interaction term and the effect for the base level of injection. This will make more sense after checking a reference for interactions in linear models.


            • #7
              Thanks Ecolliso,

              I would say that you have the same kind of design that I have, maybe more complex. I am trying to understand what I can do in DESeq1 or 2

              I can answer to
              if using this subsetting approach, do I need to make any added adjustment of the P-value to account for the fact I’ll be doing multiple tests?
              adjustment on p value should theorically be applied on all the p values ever computed on earth .... It s of course not mathematical it s just a widely use way to safely rely on something. So yes you should correct for all your p values but if you don t you can still publish for instance . More that that you should have a p value that is really significant, even 0.001 is still one chance over 1000 to be wrong, it s kind of a lot of chance no ?

              About GLM, here is some questions:
              1. First, from here : One reason for calling the general linear model “general” is that it can handle an X that is not numerical as well as one that is numerical. I cant find in DESeq a way to check continuous covariates like age or time ....

              1. Second I wonder if puting the subject identifier as a covariate will help to make a pair wise test ...

              1. Last, GLM usually give a p value for the intercept and each coefficient, why DESeq does not do it ?


              • #8
                quick answers:

                Search the DESeq2 vignette for "continuous covariate"

                Yes including the subject identifier, i.e. "~ patient + condition" will test the change relative for each patient (pair-wise).

                You can produce results tables for any coefficient using 'contrast' or 'name' argument to the results() function.


                • #9
                  Thanks Mickael, I found the way to use continuous covariate, it was simple !
                  Now to test the change relative for each patient, I should fit a GLM including the subject identifier, i.e. "~ patient + condition", like you say.
                  Then I have to go to DESeq2 since DEseq have no result() function, I guess the p value associated with each coeficient is the usual that you find in the basic glm() funtion of R. So I can use it a as testing for my differential expression like the regular negative binomial test ! That s great ! Do you know why people use GLM only in case of covariate ?

                  Many Thanks!


                  Latest Articles


                  • seqadmin
                    Exploring the Dynamics of the Tumor Microenvironment
                    by seqadmin

                    The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                    07-08-2024, 03:19 PM
                  • seqadmin
                    Exploring Human Diversity Through Large-Scale Omics
                    by seqadmin

                    In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                    06-25-2024, 06:43 AM





                  Topics Statistics Last Post
                  Started by seqadmin, 07-19-2024, 07:20 AM
                  0 responses
                  Last Post seqadmin  
                  Started by seqadmin, 07-16-2024, 05:49 AM
                  0 responses
                  Last Post seqadmin  
                  Started by seqadmin, 07-15-2024, 06:53 AM
                  0 responses
                  Last Post seqadmin  
                  Started by seqadmin, 07-10-2024, 07:30 AM
                  0 responses
                  Last Post seqadmin