We did three biological replicates for our treatment and control using RNA-seq to find out which transcripts have differential expression. To make sure we are obtaining genuine changes, we did another batch of experiments several months later. Now, we have:
Batch1: 3 treatments vs. 3 controls
Batch2: 3 treatments vs. 3 controls
The two batches were done under the same conditions(hopefully). However, there is a significant difference in total read count. The first batch contains ~10 million reads for each replicate but the second batch contains ~30 million reads for each. It is because Illumina has improved chemicals and software.
I applied several tools (including DESeq, edgeR and limma) to identify differential genes from the two batches of data. The 1st batch yields ~500 genes and the 2nd batch yields ~200 genes. To our disappoint, the two lists contain very small overlaps.
We suspect one set of treatments or controls was screwed so decided to switch the treatment and control of the two batches to identify the bad ones.
To our surprise, the two batches yield 10 fold more genes after switching! That means, each batch now contains ~5000 differential genes and they overlap by 70%!! This cannot be biologically true and I suspect it is related with the unbalanced inputs of treatment vs. control.
To my knowledge, both DESeq and edgeR try to normalize the library sizes internally before performing statistical tests. However, the question is how well is that done? Any input or suggestions?
Batch1: 3 treatments vs. 3 controls
Batch2: 3 treatments vs. 3 controls
The two batches were done under the same conditions(hopefully). However, there is a significant difference in total read count. The first batch contains ~10 million reads for each replicate but the second batch contains ~30 million reads for each. It is because Illumina has improved chemicals and software.
I applied several tools (including DESeq, edgeR and limma) to identify differential genes from the two batches of data. The 1st batch yields ~500 genes and the 2nd batch yields ~200 genes. To our disappoint, the two lists contain very small overlaps.
We suspect one set of treatments or controls was screwed so decided to switch the treatment and control of the two batches to identify the bad ones.
To our surprise, the two batches yield 10 fold more genes after switching! That means, each batch now contains ~5000 differential genes and they overlap by 70%!! This cannot be biologically true and I suspect it is related with the unbalanced inputs of treatment vs. control.
To my knowledge, both DESeq and edgeR try to normalize the library sizes internally before performing statistical tests. However, the question is how well is that done? Any input or suggestions?
Comment