I have a bacterial timecourse with 2 biological replicates per timepoint. There is a fair bit of variance between my replicates.
I have spent the last few days working with DESeq, and I've been doing pairwise comparisons between 0hr and the later timepoints, since it doesn't look like DESeq has an ANOVA sort of option.
These are my questions. First: the tRNA in my sample are (of course) highly expressed and also differ a lot between the replicates. I have them in my samples for unrelated reasons, but since many people merely eliminate these during sample prep, is there a reason why I shouldn't just make my count table excluding these genes? I have read that with many analysis methods, the highest expression highest variance genes can skew the statistics.
Next, If I carry out the Dispersal Estimate step on a data table containing all timepoints (10 columns), for the first timepoint I get 343 genes that go up with an adjusted p value <0.05. If I carry out the Dispersal Estimate step on a data table containing only the early timepoints, I get 1004 genes that go up with an adjusted p value of <0.05. Is it in some way dishonest of me to consider all of those genes for the early timepoint and not just the more stringent set?
To give you some more concrete example genes for the comparison between 0 and 2 hours,
Gene A and B go up significantly (Padj<0.05) in both tests,
Gene C goes up significantly (Padj<0.05) if dispersal is estimated with the limited dataset but not the full data set
Gene D never shows statistical significance
gene Hr0_1 Hr0_2 Hr2_1 Hr2_2 Hr6_1 Hr6_2 Hr12_1 Hr12_2 Hr24_1 Hr24_2
A 32 28 278 244 188 95 240 290 592 264
B 582 1550 13499 14490 11176 6161 20458 27906 41519 22960
C 36 41 282 310 361 309 1166 15600 42665 917
D 107 111 79 81 40 17 40 675 1542 54
I have spent the last few days working with DESeq, and I've been doing pairwise comparisons between 0hr and the later timepoints, since it doesn't look like DESeq has an ANOVA sort of option.
These are my questions. First: the tRNA in my sample are (of course) highly expressed and also differ a lot between the replicates. I have them in my samples for unrelated reasons, but since many people merely eliminate these during sample prep, is there a reason why I shouldn't just make my count table excluding these genes? I have read that with many analysis methods, the highest expression highest variance genes can skew the statistics.
Next, If I carry out the Dispersal Estimate step on a data table containing all timepoints (10 columns), for the first timepoint I get 343 genes that go up with an adjusted p value <0.05. If I carry out the Dispersal Estimate step on a data table containing only the early timepoints, I get 1004 genes that go up with an adjusted p value of <0.05. Is it in some way dishonest of me to consider all of those genes for the early timepoint and not just the more stringent set?
To give you some more concrete example genes for the comparison between 0 and 2 hours,
Gene A and B go up significantly (Padj<0.05) in both tests,
Gene C goes up significantly (Padj<0.05) if dispersal is estimated with the limited dataset but not the full data set
Gene D never shows statistical significance
gene Hr0_1 Hr0_2 Hr2_1 Hr2_2 Hr6_1 Hr6_2 Hr12_1 Hr12_2 Hr24_1 Hr24_2
A 32 28 278 244 188 95 240 290 592 264
B 582 1550 13499 14490 11176 6161 20458 27906 41519 22960
C 36 41 282 310 361 309 1166 15600 42665 917
D 107 111 79 81 40 17 40 675 1542 54