Can an updated ENSEMBL database release have an effect on DESeq2 results?
Well, obviously it can, but the changes in the output I see are rather dramatic.
Here's in brief my situtation:
Data from an RNAseq experiment (mouse) was
mapped to GRCm38p2 in Feb. 2014, using STAR,
counted with HTSeq (latest version),
DEG estimations done with DESeq2 (latest version at that time)
and Ensemble release e75 GTF file.
Results were interesting and made sense, biologically.
Now we remapped everything to GRCm38p3
using Ensemble release e78 GTF and latest versions of those programms (which have not changed much).
Results are radically different, with some important changes gone, (which we validated by qPCR!)
(example: a gene with log2FC=-0,63, padj=0.0688 in the first mapping, and log2FC=-0.21, padj=0.22 in the second) - however, the counts for each sample and on average (baseMean) are very very similar, so I guess it's not the mapping that makes the difference.
Notably the number of annotated genes found in e78 is much higher compared to e75 so I expect this has an influence on the adjusted p-value - but how can it affect the log2FC!? Could it be that the number of genes tested affects the gene models / dispersions that DESeq2 estimates?
I would be happy about any comments.
Thanks.
Well, obviously it can, but the changes in the output I see are rather dramatic.
Here's in brief my situtation:
Data from an RNAseq experiment (mouse) was
mapped to GRCm38p2 in Feb. 2014, using STAR,
counted with HTSeq (latest version),
DEG estimations done with DESeq2 (latest version at that time)
and Ensemble release e75 GTF file.
Results were interesting and made sense, biologically.
Now we remapped everything to GRCm38p3
using Ensemble release e78 GTF and latest versions of those programms (which have not changed much).
Results are radically different, with some important changes gone, (which we validated by qPCR!)
(example: a gene with log2FC=-0,63, padj=0.0688 in the first mapping, and log2FC=-0.21, padj=0.22 in the second) - however, the counts for each sample and on average (baseMean) are very very similar, so I guess it's not the mapping that makes the difference.
Notably the number of annotated genes found in e78 is much higher compared to e75 so I expect this has an influence on the adjusted p-value - but how can it affect the log2FC!? Could it be that the number of genes tested affects the gene models / dispersions that DESeq2 estimates?
I would be happy about any comments.
Thanks.
I know that to work with no replicates is not appropiate, but in these cases I need to provide some information of the data for the end user
. Isn't there some way to obtain a DE genes list (even if the results are to be taken with care)?
I had filtered the object returned from the results() function with a minimum log2foldChange of 1.5 and an adjusted p-value of 0.01. It was definitively too much. The lfc of the result data is less than 1 in most cases, and the adj p-value is superior than one...
Comment