I'm conducting analysis of RNA HiSeq data, and we are trying to compute enrichment for a given window of reads in the IP over reads in our control. This window could be an entire gene, or a very small 25 bp segment within an exon. Working with some collaborators, we've been in discussion about specifically how to compute enrichment and whether or not that includes RPKM. I've now thoroughly confused myself and I was wondering if anyone had insight into better ways of computing this.
My initial method of computing enrichment was the ratio of reads in the IP to the reads in the control, normalized by total number of reads sequenced in each:
Enrichment = (#IPw / Σ IP) / (#CNTLw / Σ CNTL),
where w represents the number of reads that mapped to that given window and Σ represents the total number of reads that were mapped to the genome (as a normalization factor).
However, our collaborators insisted that we incorporate RPKM as a normalization factor (that is divide), to account for differing gene lengths, so our final equation then became:
Enrichment = (#IPw / Σ IP) / (#CNTLw / Σ CNTL) / (10^9 * #CNTLg / Σ CNTL / length),
where here #CNTLg is the number of reads that map to the gene exons (so excluding introns) and length refers to the length of the mature transcript (CDS + UTRs, no introns).
However, our results are very strange, since low RPKM values (< 1) result in a very high enrichment score, and this doesn't make sense for computing enrichment. Furthermore, through answers on this forum, it sounds like RPKM is used more for differential expression between two samples, e.g., two biological replicates, and not necessarily to be used for computing the enrichment of our IP over the control. We're not trying to find DE genes here, but trying to determine an enrichment of our IP over our control for any given window.
Discussing this with my PI, we thought perhaps excluding RPKM but normalizing solely over the transcript length might be better. One odd result of dividing the enrichment by RPKM is that you're essentially multiplying by the transcript length, which is opposite of what I'd think we're trying to achieve.
Another possibility I thought is to perhaps compute the RPKM for the control, and then compute the RPKM as such for the IP, and take the ratio of that. This at least seems consistent with what RPKM seems to have been designed for, if I'm understanding RPKM correctly, but I'm still not sure if that makes any more sense or is better than the other approaches.
Thank you very much and I greatly appreciate your help if anyone has any ideas!
My initial method of computing enrichment was the ratio of reads in the IP to the reads in the control, normalized by total number of reads sequenced in each:
Enrichment = (#IPw / Σ IP) / (#CNTLw / Σ CNTL),
where w represents the number of reads that mapped to that given window and Σ represents the total number of reads that were mapped to the genome (as a normalization factor).
However, our collaborators insisted that we incorporate RPKM as a normalization factor (that is divide), to account for differing gene lengths, so our final equation then became:
Enrichment = (#IPw / Σ IP) / (#CNTLw / Σ CNTL) / (10^9 * #CNTLg / Σ CNTL / length),
where here #CNTLg is the number of reads that map to the gene exons (so excluding introns) and length refers to the length of the mature transcript (CDS + UTRs, no introns).
However, our results are very strange, since low RPKM values (< 1) result in a very high enrichment score, and this doesn't make sense for computing enrichment. Furthermore, through answers on this forum, it sounds like RPKM is used more for differential expression between two samples, e.g., two biological replicates, and not necessarily to be used for computing the enrichment of our IP over the control. We're not trying to find DE genes here, but trying to determine an enrichment of our IP over our control for any given window.
Discussing this with my PI, we thought perhaps excluding RPKM but normalizing solely over the transcript length might be better. One odd result of dividing the enrichment by RPKM is that you're essentially multiplying by the transcript length, which is opposite of what I'd think we're trying to achieve.
Another possibility I thought is to perhaps compute the RPKM for the control, and then compute the RPKM as such for the IP, and take the ratio of that. This at least seems consistent with what RPKM seems to have been designed for, if I'm understanding RPKM correctly, but I'm still not sure if that makes any more sense or is better than the other approaches.
Thank you very much and I greatly appreciate your help if anyone has any ideas!
Comment