Hello-
first off, I've been using cuffliniks/cuffdiff since it was first released so I'm familiar enough with the software. I have only recently discovered this issue because we've had some RNA-Seq runs on an Illumina Hiseq 2000 machine where we barcoded samples and ran 6 samples per "lane" to get a total of 48 samples sequenced at a time. We've had a large amount of variability in the # of reads per sample with this method.
I'm sharing a less dramatic example here but one that's good enough to point out a freaky issue.
I've got a "wt" verses "mutant" experiment with 2 biological replicates per condition. The aligned read counts for the samples is roughly 40M for three of them and 18M for the fourth. So the "wt" condition has 40M+18M and the "mutant" has 40M+40M.
I've run cuffdiff including both replicates per condition and created a scatter plot of the FPKM values. I've also run cuffdiff excluding the sample with lower reads (single "wt" sample verses two "mutant" samples) and created another scatterplot. In both cases I've tried upper quartile normalization and the default. The attached image shows both plots (1 vs 2 on the left and 2 vs 2 on the right). The diagonal blue line indicates the "1:1" line. If the normalization step is done correctly the main body of the data should be centered on this line as we would expect most genes to have a fold change near 1. Points in red are genes identified significantly differentially expressed by cuffdiff.
You can see in the 2 vs 2 plot, on the right, the main body is NOT centered on the blue line - in fact it's pulled in the direction of the "wt" condition and there's a large body of red points within the main body of the scatter. The plot on the left shows proper normalization and that large body of points is correctly shown as not significant. On the log scale, those points furthest to the upper-right have FPKM values 1000's higher in the "wt" condition in the plot on the right verses the plot on the left. By including the low-read count sample cuffdiff artificially increased all of the gene expression values for those samples.
So why does cuffdiff fail at properly normalizing these samples? We're talking about a simple division step and I can see it correctly identifies the different normalization factors during the start of its run. I haven't read anywhere that it's dangerous to use samples with different read counts. If I take this same data and run it through a DESeq based pipeline I get results that look very much like the plot on the left even if I include the sample with a low read count.
I've seen a much more extreme example where the entire body of points in the scatter were pulled so far to one side of the blue line that cuffdiff called over 13,000 genes significant. Again running through the DESeq pipeline I got a more sensible output where the normalization correctly centered the scatter on the blue line.
Has anyone else seen this issue? This seems very misleading.
first off, I've been using cuffliniks/cuffdiff since it was first released so I'm familiar enough with the software. I have only recently discovered this issue because we've had some RNA-Seq runs on an Illumina Hiseq 2000 machine where we barcoded samples and ran 6 samples per "lane" to get a total of 48 samples sequenced at a time. We've had a large amount of variability in the # of reads per sample with this method.
I'm sharing a less dramatic example here but one that's good enough to point out a freaky issue.
I've got a "wt" verses "mutant" experiment with 2 biological replicates per condition. The aligned read counts for the samples is roughly 40M for three of them and 18M for the fourth. So the "wt" condition has 40M+18M and the "mutant" has 40M+40M.
I've run cuffdiff including both replicates per condition and created a scatter plot of the FPKM values. I've also run cuffdiff excluding the sample with lower reads (single "wt" sample verses two "mutant" samples) and created another scatterplot. In both cases I've tried upper quartile normalization and the default. The attached image shows both plots (1 vs 2 on the left and 2 vs 2 on the right). The diagonal blue line indicates the "1:1" line. If the normalization step is done correctly the main body of the data should be centered on this line as we would expect most genes to have a fold change near 1. Points in red are genes identified significantly differentially expressed by cuffdiff.
You can see in the 2 vs 2 plot, on the right, the main body is NOT centered on the blue line - in fact it's pulled in the direction of the "wt" condition and there's a large body of red points within the main body of the scatter. The plot on the left shows proper normalization and that large body of points is correctly shown as not significant. On the log scale, those points furthest to the upper-right have FPKM values 1000's higher in the "wt" condition in the plot on the right verses the plot on the left. By including the low-read count sample cuffdiff artificially increased all of the gene expression values for those samples.
So why does cuffdiff fail at properly normalizing these samples? We're talking about a simple division step and I can see it correctly identifies the different normalization factors during the start of its run. I haven't read anywhere that it's dangerous to use samples with different read counts. If I take this same data and run it through a DESeq based pipeline I get results that look very much like the plot on the left even if I include the sample with a low read count.
I've seen a much more extreme example where the entire body of points in the scatter were pulled so far to one side of the blue line that cuffdiff called over 13,000 genes significant. Again running through the DESeq pipeline I got a more sensible output where the normalization correctly centered the scatter on the blue line.
Has anyone else seen this issue? This seems very misleading.
Comment