Hello All (and hopefully Simon Anders),
I've been using DESeq for testing for differential expression between samples. It's always worked great, but for some reason, I don't think it's working correctly for this one data set I have right now. Here's what I'm doing.
I have three biological samples (CD90, UtESC-, and UtESC+) with three replicates each. I align these to the mouse genome using Tophat, allowing only unique hits. I then use HTSeq with the Ensembl gtf file to find my gene matrix. I then run this matrix through DESeq to find the size factors, variance, and p-values for each pairwise comparison: CD90 to UtESC-, CD90 to UtESC+, and UtESC- to UtESC+.
Here are the size factors (they vary more than most of my data sets, but I don't think it explains what is happening here):
> sizeFactors(cds)
CD901 CD902 CD903 UtESC.1 UtESC.2 UtESC.3 UtESC.1.1 UtESC.2.1
0.9780371 1.3965205 1.3044467 1.7915295 0.3420241 0.6416566 0.5929506 2.4919984
UtESC.3.1
1.0039902
Here is the plot of variance:

Seems okay so far. However, when I calculate the p-values and fold changes, I get very strange results. The fold changes seem wrong (does not match up with the RPKM values I calculated) and the last comparison (UtESC+ vs UtESC-) gives almost only adjusted p-values of 1, none below 0.05, while the regular p-values give only about 200 p-values of less than 0.01. I've asked around, searched seqanswers, and looked over my code many times, but I can't find what I'm doing wrong (if anything). Does anybody have any troubleshooting suggestions? For reference, here is my code:
library("DESeq")
count_table <- read.table("UtESC_gene_array.txt", header=TRUE, row.names=1)
design <- data.frame(row.names = colnames(count_table), condition = c("CD901", "CD902", "CD903", "UtESC-1", "UtESC-2", "UtESC-3", "UtESC+1", "UtESC+2", "UtESC+3"),
libType = c("single-end", "single-end", "single-end", "single-end", "single-end", "single-end", "single-end", "single-end", "single-end") )
conds <- factor( c("CD", "CD", "CD", "UM", "UM", "UM", "UP", "UP", "UP") )
cds <- newCountDataSet( count_table, conds )
cds <- estimateSizeFactors( cds )
cds <- estimateDispersions( cds, fitType="local" )
res <- cbind( nbinomTest( cds, "CD", "UM" )[,c(1,6,8)], nbinomTest( cds, "CD", "UP" )[,c(6,8)], nbinomTest( cds, "UM", "UP" )[,c(6,8)] )
write.table(res, file="UtESC_take2_pvals.txt", append=FALSE, quote=FALSE, sep="\t")
If you would like a copy of the data for yourself to try, you can grab it here: http://pellegrini.mcdb.ucla.edu/Artur/UtESC_gene_array.txt
Thank you all in advance for your help!
Best,
Artur
I've been using DESeq for testing for differential expression between samples. It's always worked great, but for some reason, I don't think it's working correctly for this one data set I have right now. Here's what I'm doing.
I have three biological samples (CD90, UtESC-, and UtESC+) with three replicates each. I align these to the mouse genome using Tophat, allowing only unique hits. I then use HTSeq with the Ensembl gtf file to find my gene matrix. I then run this matrix through DESeq to find the size factors, variance, and p-values for each pairwise comparison: CD90 to UtESC-, CD90 to UtESC+, and UtESC- to UtESC+.
Here are the size factors (they vary more than most of my data sets, but I don't think it explains what is happening here):
> sizeFactors(cds)
CD901 CD902 CD903 UtESC.1 UtESC.2 UtESC.3 UtESC.1.1 UtESC.2.1
0.9780371 1.3965205 1.3044467 1.7915295 0.3420241 0.6416566 0.5929506 2.4919984
UtESC.3.1
1.0039902
Here is the plot of variance:

Seems okay so far. However, when I calculate the p-values and fold changes, I get very strange results. The fold changes seem wrong (does not match up with the RPKM values I calculated) and the last comparison (UtESC+ vs UtESC-) gives almost only adjusted p-values of 1, none below 0.05, while the regular p-values give only about 200 p-values of less than 0.01. I've asked around, searched seqanswers, and looked over my code many times, but I can't find what I'm doing wrong (if anything). Does anybody have any troubleshooting suggestions? For reference, here is my code:
library("DESeq")
count_table <- read.table("UtESC_gene_array.txt", header=TRUE, row.names=1)
design <- data.frame(row.names = colnames(count_table), condition = c("CD901", "CD902", "CD903", "UtESC-1", "UtESC-2", "UtESC-3", "UtESC+1", "UtESC+2", "UtESC+3"),
libType = c("single-end", "single-end", "single-end", "single-end", "single-end", "single-end", "single-end", "single-end", "single-end") )
conds <- factor( c("CD", "CD", "CD", "UM", "UM", "UM", "UP", "UP", "UP") )
cds <- newCountDataSet( count_table, conds )
cds <- estimateSizeFactors( cds )
cds <- estimateDispersions( cds, fitType="local" )
res <- cbind( nbinomTest( cds, "CD", "UM" )[,c(1,6,8)], nbinomTest( cds, "CD", "UP" )[,c(6,8)], nbinomTest( cds, "UM", "UP" )[,c(6,8)] )
write.table(res, file="UtESC_take2_pvals.txt", append=FALSE, quote=FALSE, sep="\t")
If you would like a copy of the data for yourself to try, you can grab it here: http://pellegrini.mcdb.ucla.edu/Artur/UtESC_gene_array.txt
Thank you all in advance for your help!
Best,
Artur
Comment