I am looking for a way to quantify and statistically evaluate spliceforms across a set of RNA-Seq experiments.
My current understanding is that the input to DESeq/edgeR/baySeq should be simply reads mapped to a gene locus. Since cufflinks assigns spliceform abundances one library at a time, any systematic errors inherent in that sample are confounded into the spliceform quantification problem. As I understand it, these systematic errors really should be corrected for with a general linear model, which takes as input all the samples of interest/relevance. (Cufflinks aficionados, please correct me if I am wrong!)
Also as I understand it, the developers of DESeq/edgeR/baySeq have come to the conclusion (along with many others) that RPKM/FPKM is not a sufficient correction to be able to compare different genes within the same library to each other. There appear to be additional biases (beyond length) that affect the transformation from mRNA to RNA-Seq sequences. Therefore, it has been suggested that instead, it is only reasonable, for now, to restrict ourselves to comparing abundances of the same gene between different samples.
I find this solution somewhat unsatisfying, though. If I have two spliceforms which, by definition, originate from the same genetic locus, but have different lengths, and varying expression levels in the two (or more) conditions I am surveying, then the noise associated with those two expression levels is different. Moreover, given the current model for mean-variance relationships (negative bionomial), noise, unlike expression level, is not linear. So I would not expect the noise from two genes with the same average expression level, but one containing many differently-regulated spliceforms, and the other containing a single spliceform, to follow the same distribution. Ideally, I would want a general linear model that can simultaneously correct for systematic (non-biological) errors in the sample collection process and estimate spliceform abundances as well.
Is there a good reason such a model is unnecessary? Is there a good reason to be content with locus-level abundances?
Thanks for your input!
~Rachel
My current understanding is that the input to DESeq/edgeR/baySeq should be simply reads mapped to a gene locus. Since cufflinks assigns spliceform abundances one library at a time, any systematic errors inherent in that sample are confounded into the spliceform quantification problem. As I understand it, these systematic errors really should be corrected for with a general linear model, which takes as input all the samples of interest/relevance. (Cufflinks aficionados, please correct me if I am wrong!)
Also as I understand it, the developers of DESeq/edgeR/baySeq have come to the conclusion (along with many others) that RPKM/FPKM is not a sufficient correction to be able to compare different genes within the same library to each other. There appear to be additional biases (beyond length) that affect the transformation from mRNA to RNA-Seq sequences. Therefore, it has been suggested that instead, it is only reasonable, for now, to restrict ourselves to comparing abundances of the same gene between different samples.
I find this solution somewhat unsatisfying, though. If I have two spliceforms which, by definition, originate from the same genetic locus, but have different lengths, and varying expression levels in the two (or more) conditions I am surveying, then the noise associated with those two expression levels is different. Moreover, given the current model for mean-variance relationships (negative bionomial), noise, unlike expression level, is not linear. So I would not expect the noise from two genes with the same average expression level, but one containing many differently-regulated spliceforms, and the other containing a single spliceform, to follow the same distribution. Ideally, I would want a general linear model that can simultaneously correct for systematic (non-biological) errors in the sample collection process and estimate spliceform abundances as well.
Is there a good reason such a model is unnecessary? Is there a good reason to be content with locus-level abundances?
Thanks for your input!
~Rachel
Comment