No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Unbalanced block design analysis with DESeq2


    I have RNAseq samples from 11 Patients. For each patient
    the samples were taken from two difference locations (Ileum and Caecum)
    and had two different inflammation status at those locations.
    However, the design is not balanced as I do not have 4 samples for each patient.
    Below is a description of the dataset

    1   Caecum     NI
    1    Ileum     NI
    3   Caecum      I
    3    Ileum     NI
    4    Ileum      I
    5   Caecum     NI
    5   Caecum      I
    5   Caecum     NI
    6    Ileum      I
    6   Caecum      I
    7   Caecum     NI
    7    Ileum      I
    8    Ileum     NI
    8    Ileum      I
    9    Ileum     NI
    9    Ileum      I
    9   Caecum      I
    10    Ileum     NI
    10   Caecum     NI
    12    Ileum     NI
    12   Caecum     NI
    14    Ileum     NI
    14    Ileum      I
    14    Ileum      I

    I am interested in testing if there is a difference between the Location in general, the Status in general,
    the Status given the Location and the interaction between status and location.

    Therefore, I used DESeq2 with the following design
    ~ Patient + Location + Status + Location:Status
    From what I understood this will remove the variation due to Patient and test for
    the effect of Location, Status and the interaction between the two.

    To check the difference for Location in general, I extracted the results of the DESeq analysis
    using contrast c("Location", "Ileum", "Caecum"). For Status, I used contrast = c("Status", "NI", "I")

    To test for the effect of Status given the Location, I used
    contrast = list("LocationCaecum.StatusI", "LocationCaecum.StatusNI").

    To check for the effect of Location on Status (i.e. inflamed caecum has additional effect
    than Caecum alone and inflammation alone) I used
    contrast = list(c("StatusI", "LocationCaecum.StatusI"), c("StatusNI", "LocationCaecum.StatusNI"))

    Considering that the high amount of missing samples (or incomplete blocks), does the design
    formula I am using make sense? For instance, when comparing on Location, does the pairing have any value
    as it will only work for the following 8 samples (Or am I wrong?)

    Sample Location Status
    5   Caecum     NI
    5   Caecum      I
    8    Ileum     NI
    8    Ileum      I
    9    Ileum     NI
    9    Ileum      I
    14   Ileum     NI
    14   Ileum      I
    And is it OK to use the whole dataset for these tests or it is better to subset it first
    then do a DESeq2 analysis for each case separately. For example select all NI samples, then compare them between Ileum and Caecum with the design below

    design = formula(~ Patient + Status)

    Thank you very much in advance,

  • #2
    Your model (~ Patient + Location + Status + Location:Status) has full rank, so it should work in general. You can drop at least patient #4, since it adds nothing.

    Edit: BTW, make sure that "patient" is a factor (and reset the levels if you remove patient #4).


    • #3

      I have analyzed the samples according to Devon's recommendation and everything went well without any error messages.

      However, in the results, I do not understand why the interaction effect of Location and Status gives nearly the exact opposite log fold change.
      Is this something normal or it indicates something wrong with the analysis?

      Could it indicate a mislabeling of the samples? I am suspecting this already from the PCA plot of VST values.

      Below is the relevant part of the code I used, starting from a SummarizedExperiment.

      # Remove sample 4
      se <- se[, colnames(se) != "SA4B4"]
      se$SampleID <- factor(se$SampleID)
      se$Patient <- factor(se$Patient)
      # relevel: put NI and Ileum as base level
      se$Status <- relevel(se$Status, "NI")
      se$Location <- relevel(se$Location, "Ileum")
      # create a DEseq dataset
      dds <- DESeqDataSet(se = se, design = formula(~ Patient + Location + Status + Location:Status)) 
      # run analysis
      dds <- DESeq(dds)
      # extract results 
      # Test the hypothesis that here is no gene expression difference between Caecum and Ileum
      results(dds, contrast=c("Location", "Caecum", "Ileum"))
      # Test the hypothesis that here is no gene expression difference between I and NI 
      results(dds, contrast=c("Status", "I", "NI"))
      # Interaction effect between Caecum and Status
      results(dds, contrast=list("LocationCaecum.StatusI", "LocationCaecum.StatusNI"))
      # Interaction effect between Ileum and Status (The result of this contrast if the opposite of the contrast above)
      results(dds, contrast=list("LocationIleum.StatusI", "LocationIleum.StatusNI"))
      # Test the hypothesis that here is no gene expression difference between I and NI in Caecum
      results(dds, contrast=list(c("StatusI", "LocationCaecum.StatusI"), c("StatusNI", "LocationCaecum.StatusNI")))
      # Test the hypothesis that here is no gene expression difference between I and NI in Ileum
      results(dds, contrast = list(c("StatusI", "LocationIleum.StatusI"), c("StatusNI", "LocationIleum.StatusNI")))
      And below is the coefficients for the first 6 genes

      head(coef(dds)[, 12:19])
      DataFrame with 6 rows and 8 columns
                      LocationIleum LocationCaecum    StatusNI     StatusI
                          <numeric>      <numeric>   <numeric>   <numeric>
      ENSG00000000003  -0.137034259   0.1468754138 -0.06881193  0.07865309
      ENSG00000000005  -0.483682079   0.4866318563 -0.08784116  0.09079094
      ENSG00000000419   0.008710795   0.0003770367 -0.02530449  0.03439232
      ENSG00000000457   0.121583429  -0.1132622163 -0.10059046  0.10891168
      ENSG00000000460  -0.080019422   0.0865712307  0.11134169 -0.10478988
      ENSG00000000938  -0.002435431   0.0101981947 -0.49175965  0.49952241
                      LocationIleum.StatusNI LocationCaecum.StatusNI
                                   <numeric>               <numeric>
      ENSG00000000003            -0.11392008              0.11334093
      ENSG00000000005            -0.42676013              0.42602082
      ENSG00000000419            -0.09053654              0.09032357
      ENSG00000000457             0.02507584             -0.02592245
      ENSG00000000460            -0.14335772              0.14429482
      ENSG00000000938             0.17662933             -0.18076818
                      LocationIleum.StatusI LocationCaecum.StatusI
                                  <numeric>              <numeric>
      ENSG00000000003            0.11276674            -0.11210476
      ENSG00000000005            0.42268926            -0.42192512
      ENSG00000000419            0.09060986            -0.09032040
      ENSG00000000457           -0.02405254             0.02496919
      ENSG00000000460            0.14268424            -0.14356620
      ENSG00000000938           -0.17664982             0.18085401
      any help on this would be appreciated
      Thanks in advance,


      • #4
        Below is the failed attachment from previous post
        Click image for larger version

Name:	PCA.png
Views:	1
Size:	51.4 KB
ID:	304825


        • #5
          However, in the results, I do not understand why the interaction effect of Location and Status gives nearly the exact opposite log fold change.
          Is this something normal or it indicates something wrong with the analysis?
          Short answer: Don't worry, it's all fine. The values are nearly the same because they should be.

          Long answer: Well, this will be long...

          This is normal, but a bit unfortunate. In DESeq2, we use a non-standard way of setting up model matrices, which we call "extended model matrix".
          As explained in more detail in our paper, this is necessary to get proper shrinkage of log-fold-change estimates via ridge penality.

          In essence, using your example, and first explaining it without interaction: With standard model matrices, the intercept coefficient would be the expected log expression for a sample from patient level 1, location level 1 (ileum), status level 1 (NI), and then there are coefficients for all other patients, giving the differences in expression if the sample is any of the other patients, and one coefficient for the other location level (caecum) and one the other status level (I). The location coefficient, for example, is the difference between log expression for the other location level (caecum) and the first location level (ileum). For each factor, there is one less coefficient than there are levels.

          For extended model matrices, the intercept is the average over all levels rather then the value for all factors being the first level, and then, there is one coefficient for each level of each factor. These coefficients give the difference to the grand average (not to the first level as before).

          So, LocationIleum is the difference between ileum samples and the average over all samples, and LocationCaecum is the difference between caecum samples and the average. The difference caecum-vs-illeum is hence the difference between the LocationCaecum and the LocationIlleum coefficient -- and this is what 'results' will calculate for you if you ask for this contrast. If you had a balanced design, these two coefficient would be exactly the same value with opposite signs (because the grand average would be in the middle of both), and half of what the coefficient with standard design matrices would be. As your design is not quite balanced, the values are only about the same, but this is not an issue.

          Now, for the interaction: An interaction is a difference of difference: In your case, it is the difference between (a) the difference between I and NI for caecum samples and (b) the difference between I and NI for illeum samples:

          interaction = (caecumI - caecumNI) - (illeumI - illeumNI)

          You could also ask for other interactions, e.g.

          interaction' = (caecumI - illeumI) - (caecumNI - illeumNI)

          and besides these two, there are two more ways to arrange the terms in this way. But all four of these are the same and differ only by sign, as you will notice quickly if you write the equations without parantheses. It is hence fully expected that all four interaction coefficients are the same, up to sign. There is only one interaction value -- it's only due to the need for extended design matrices that the same information appears four times. And the lack of balances causes them to differ, but only slightly.

          Two final remarks:

          - The interaction in my double difference above is twice of what the coefficient says, again because these are the differences to the grand average, not to each other. (Mike: If you read this please double-check whether I'm right.)

          - Strictly speaking, we need extended design matrices only if we have more than two levels. Hence, normally, DESeq2 switches to standard design matrices (with so-called sum contrasts) if all factors have only two levels -- to save users from exactly the confusion we have here. Unfortunately, your patient factor has more than two levels, and our code is not smart enough to realize that we need extended coding only this factor.
          Last edited by Simon Anders; 03-10-2015, 01:07 PM.


          • #6
            Could it indicate a mislabeling of the samples? I am suspecting this already from the PCA plot of VST values.
            Yes, your PCA plot shows that something is amiss, but this has nothing to do with what I discussed above.

            You need to figure out what the difference between the sample groups to the left and to the right is. Maybe there is some mislabelling, but it would have to be a rather catastrophic one, not just a single mislabelled sample. Or there is some dramaticv batch effect, such that something in the processing was done very differently in the left sample group than in the right one.


            • #7
              Thank you for the detailed answer!


              Latest Articles


              • seqadmin
                Advanced Tools Transforming the Field of Cytogenomics
                by seqadmin

                At the intersection of cytogenetics and genomics lies the exciting field of cytogenomics. It focuses on studying chromosomes at a molecular scale, involving techniques that analyze either the whole genome or particular DNA sequences to examine variations in structure and behavior at the chromosomal or subchromosomal level. By integrating cytogenetic techniques with genomic analysis, researchers can effectively investigate chromosomal abnormalities related to diseases, particularly...
                09-26-2023, 06:26 AM
              • seqadmin
                How RNA-Seq is Transforming Cancer Studies
                by seqadmin

                Cancer research has been transformed through numerous molecular techniques, with RNA sequencing (RNA-seq) playing a crucial role in understanding the complexity of the disease. Maša Ivin, Ph.D., Scientific Writer at Lexogen, and Yvonne Goepel Ph.D., Product Manager at Lexogen, remarked that “The high-throughput nature of RNA-seq allows for rapid profiling and deep exploration of the transcriptome.” They emphasized its indispensable role in cancer research, aiding in biomarker...
                09-07-2023, 11:15 PM
              • seqadmin
                Methods for Investigating the Transcriptome
                by seqadmin

                Ribonucleic acid (RNA) represents a range of diverse molecules that play a crucial role in many cellular processes. From serving as a protein template to regulating genes, the complex processes involving RNA make it a focal point of study for many scientists. This article will spotlight various methods scientists have developed to investigate different RNA subtypes and the broader transcriptome.

                Whole Transcriptome RNA-seq
                Whole transcriptome sequencing...
                08-31-2023, 11:07 AM





              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 09:38 AM
              0 responses
              Last Post seqadmin  
              Started by seqadmin, 09-27-2023, 06:57 AM
              0 responses
              Last Post seqadmin  
              Started by seqadmin, 09-26-2023, 07:53 AM
              1 response
              Last Post seed_phrase_metal_storage  
              Started by seqadmin, 09-25-2023, 07:42 AM
              0 responses
              Last Post seqadmin