Can an updated ENSEMBL database release have an effect on DESeq2 results?
Well, obviously it can, but the changes in the output I see are rather dramatic.
Here's in brief my situtation:
Data from an RNAseq experiment (mouse) was
mapped to GRCm38p2 in Feb. 2014, using STAR,
counted with HTSeq (latest version),
DEG estimations done with DESeq2 (latest version at that time)
and Ensemble release e75 GTF file.
Results were interesting and made sense, biologically.
Now we remapped everything to GRCm38p3
using Ensemble release e78 GTF and latest versions of those programms (which have not changed much).
Results are radically different, with some important changes gone, (which we validated by qPCR!)
(example: a gene with log2FC=-0,63, padj=0.0688 in the first mapping, and log2FC=-0.21, padj=0.22 in the second) - however, the counts for each sample and on average (baseMean) are very very similar, so I guess it's not the mapping that makes the difference.
Notably the number of annotated genes found in e78 is much higher compared to e75 so I expect this has an influence on the adjusted p-value - but how can it affect the log2FC!? Could it be that the number of genes tested affects the gene models / dispersions that DESeq2 estimates?
I would be happy about any comments.
Thanks.
Well, obviously it can, but the changes in the output I see are rather dramatic.
Here's in brief my situtation:
Data from an RNAseq experiment (mouse) was
mapped to GRCm38p2 in Feb. 2014, using STAR,
counted with HTSeq (latest version),
DEG estimations done with DESeq2 (latest version at that time)
and Ensemble release e75 GTF file.
Results were interesting and made sense, biologically.
Now we remapped everything to GRCm38p3
using Ensemble release e78 GTF and latest versions of those programms (which have not changed much).
Results are radically different, with some important changes gone, (which we validated by qPCR!)
(example: a gene with log2FC=-0,63, padj=0.0688 in the first mapping, and log2FC=-0.21, padj=0.22 in the second) - however, the counts for each sample and on average (baseMean) are very very similar, so I guess it's not the mapping that makes the difference.
Notably the number of annotated genes found in e78 is much higher compared to e75 so I expect this has an influence on the adjusted p-value - but how can it affect the log2FC!? Could it be that the number of genes tested affects the gene models / dispersions that DESeq2 estimates?
I would be happy about any comments.
Thanks.
Comment