I'm doing some differential expression analysis with DESeq2, limma(voom) and edgeR, and I get some weird results. There is a gene that I know should be highly expressed in one of my samples, but my results tell me it's the other way around. Looking at both the counts and the FPKM values (from Cufflinks) for said gene, they both agree that I should get a log2 fold change around 0.8, but I get around -0.8. Setting aside the fact that eyeballing a fold change from counts or FPKM isn't the right way to do things, it should at the very least be of the correct sign, right?
I have not noticed this until now, when I happened to check this particular gene, and now I seem to see this in other genes as well. I can do some other eyeballing for random genes where I see a difference in counts/FPKM, and while it's not always so pronounced, I do seem to get some weird values. So, I started looking in my code for possible reasons for this. I checked my dds structure in DESeq2, and this is what it looks like:
Now, comparing this to the DESeq2 vignette, everything seems fine, except the "colnames(7): 1 2 ... 6 7" part - instead of numbers in my data, the vignette has "treated1, treated2", etc instead. I figured it must then be due to some fault in me creating the dds construct, but I can't figure out what. This is (the relevant parts of) my code:
I call my script from the terminal as an Rscript like so:
... mainly so that I can do different analyses on different combinations of cell lines without having to do different sets of code. I am guessing that there's something wrong with part of this that is making the fold change go bonkers. It is my intent that the fold change would here be cell_line_1 / cell_line_2. Given an experiment with 3 and 4 replicates per cell line (as above) the "condition" variable looks like this:
And the "organization" variable like this:
... which looks right to me. I am stumped, and I would VERY much appreciate any help with this!
I have not noticed this until now, when I happened to check this particular gene, and now I seem to see this in other genes as well. I can do some other eyeballing for random genes where I see a difference in counts/FPKM, and while it's not always so pronounced, I do seem to get some weird values. So, I started looking in my code for possible reasons for this. I checked my dds structure in DESeq2, and this is what it looks like:
Code:
dds class: DESeqDataSet dim: 20477 7 exptData(0): assays(1): counts rownames(20477): ENSG00000000003 ENSG00000000005 ... ENSG00000273431 ENSG00000273457 rowData metadata column names(0): colnames(7): 1 2 ... 6 7 colData names(2): names.data. condition
Code:
# Load data cat('reading data\n') data = read.table("counts/collected_counts.txt", header=TRUE, row.names=1) # Filter for desired samples samples = strsplit(args$input_samples, ",")[[1]] data = data[ , c(grep(samples[1], names(data), value=TRUE), grep(samples[2], names(data), value=TRUE))] # Load data in DESeq2 number_samples_1 = length(grep(samples[1], names(data), value=TRUE)) number_samples_2 = length(grep(samples[2], names(data), value=TRUE)) condition = as.factor(c(rep(samples[1], number_samples_1), rep(samples[2], number_samples_2))) organization = data.frame(names(data), condition=condition) #DESeq2 calculations dds = DESeqDataSetFromMatrix(countData=data, colData=organization, design=~condition) dds dds = DESeq(dds, parallel=TRUE) res_DESeq2 = results(dds, parallel=TRUE)
Code:
de_analysis.R cell_line_1,cell_line_2
Code:
[1] hct hct hct rko rko rko rko Levels: hct rko
Code:
names.data. condition 1 hct_a hct 2 hct_b hct 3 hct_c hct 4 rko_a rko 5 rko_b rko 6 rko_c rko 7 rko_d rko
Comment