Header Leaderboard Ad

Collapse

discrepance in DESeq2 results with different design structures

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • discrepance in DESeq2 results with different design structures

    Hi,

    I am doing a DESeq2 analysis using a data set of nine different conditions/time-points.
    I first ran the analysis with the complete data set in the count table and than have used the results function to extract the pair-wise results of the DE analysis.
    Code:
    >count_table <- read.delim2("count_table_complete.txt",row.names=1)
    
    >colData <- read.delim2("metaData.txt")
    >levels(colData$condition)
    [1] "CR4W0h"  "CR4W24h" "CR4W4h"  "CTRL0h"  "CTRL24h" "CTRL4h"  "HP0h"    "HP24h"   "HP4h"   
    
    >cds <- DESeqDataSetFromMatrix (
      countData= count_table[,-1],
      colData    = colData,  
      design     = ~condition )
    
    >fit = DESeq(cds)
    
    >res = results(fit, contrast=c("condition", "CTRL0h" , "HP0h") )
    > res
    log2 fold change (MAP): condition CTRL0h vs HP0h 
    Wald test p-value: condition CTRL0h vs HP0h 
    DataFrame with 39179 rows and 6 columns
                          baseMean log2FoldChange      lfcSE       stat     pvalue      padj
                         <numeric>      <numeric>  <numeric>  <numeric>  <numeric> <numeric>
    ENSMUSG00000000001 [B][COLOR="Red"]3093.215856[/COLOR][/B]    0.002814819 0.08192671 0.03435777 0.97259186  0.998355
    When doing so, I get only 3 genes with an adjusted p-value below 0.05
    Code:
    >table(res$padj<=0.05)
    
    FALSE  TRUE 
     1956     3
    But if i ran the same analysis with the subset of the data which contains only the columns from the count table for "CTRL0h" and "HP0h" I get many more significant genes.

    Code:
    > count_table <- read.delim2("count_table_complete.txt",row.names=1)
    > count_subset <- count_table[,c(2:5,23:26)]
    > head(count_subset)
                        C20  C22  C23  C24 HP10 HP11 HP12 HP14
    ENSMUSG00000000001 2811 2360 3053 3334 2636 2736 2505 3282
    
    > colData_subset <- colData[c(1:4, 23:26),]  
    
    >cds_subset <- DESeqDataSetFromMatrix (
      countData= count_subset,
      colData    = colData_subset,  
      design     = ~condition )
    
    >fit_subset = DESeq(cds_subset)
    
    >res_subset = results(fit_subset, contrast=c("condition", "CTRL0h" , "HP0h") )
    > res_subset
    log2 fold change (MAP): condition CTRL0h vs HP0h 
    Wald test p-value: condition CTRL0h vs HP0h 
    DataFrame with 39179 rows and 6 columns
                           baseMean log2FoldChange      lfcSE       stat      pvalue      padj
                          <numeric>      <numeric>  <numeric>  <numeric>   <numeric> <numeric>
    ENSMUSG00000000001 [B][COLOR="Red"]2810.6322989 [/COLOR][/B]  -0.003504429 0.05504337 -0.0636667 0.949235621 0.9938743
    
    > table(res$padj<= 0.05)
    
    FALSE  TRUE 
    13122   321
    first I see a difference in the baseMean values. I guess this might be because for some of the conditions/TP I have more samples than the others, so that the baseMean is calculated differently (Am I correct in this assumption?)

    But I can't figure out why I get such a big different in the number of DE genes between the two approaches.

    Is it because of the differences in the independent filtering step, DESeq2 is automatically doing?
    I have tried to deactivate the independent filtering in the results function, but it didn't make the results better.

    Any other suggestions?

    thanks,
    Assa

  • #2
    I will answer on your post @ bioc support site today

    https://support.bioconductor.org/p/67202/

    Comment

    Working...
    X