Seqanswers Leaderboard Ad

**dpryan** · 01-05-2015, 06:07 AM

Firstly, I would strongly encourage you to not use RPKMs for statistics. Use either unique counts or estimated counts (e.g., from RSEM or eXpress).

The difference is likely due to either (1) reads that map to multiple features that are then being double/triple/whatever count or (2) multimapping reads with different alignments of the same read being counted separately. It's best to avoid both of these possibilities by simply not using bedtools for this.

If you really want to use RPKM then it's better to normalize before converting to RPKM. You can do this in limma/edgeR/DESeq2 from the estimated or unique counts (I think DESeq2 will throw an error if you feed in estimated counts).

Originally it was argued that the library-size normalization used in calculating RPKM was sufficient normalization, but that's been shown to be wrong in most cases.

If you don't care about rRNA and tRNA then don't include them. Most common RNAseq library preps for what you're doing will involve either ribo depletion or polyA selection, both of which would make rRNA counts highly questionable. For standard RNAseq, I never include rRNA or tRNA.

**Schelarina** · 01-05-2015, 11:53 AM

but I used unique mapping reads with bedtools multicov

**dpryan** · 01-05-2015, 12:00 PM

They may map uniquely to the genome, but that doesn't mean they map uniquely to features. If you want to count uniquely mapped reads, then use a tool designed for that (namely featureCounts or htseq-count).

**colindaven** · 01-27-2015, 07:33 AM

Originally posted by dpryan View Post

Originally it was argued that the library-size normalization used in calculating RPKM was sufficient normalization, but that's been shown to be wrong in most cases.

Interesting point. Is there a comparison paper to back this up somewhere ? Many people are still using RPKM or FPKM directly from Cufflinks or similar.

What are the key advantages of "unique counts" or "estimated reads"?

**dpryan** · 01-27-2015, 07:48 AM

I recall there being at least one paper on the issues with "simple" RPKM/FPKM calculations, but I can't find it at the moment.

The biggest benefit to unique counts over estimated counts is:

It's much faster to get unique counts (no costly expectation maximization needed).
Negative binomial-based methods already exist to make use of this.

Of course, regarding #2, you can always just use limma (or edgeR, since apparently the authors now say it works ok with expected counts) and your results will likely be just as good. Regarding #1, there are methods that can work on a stream and those should be reasonably quick...though they're still going to be slower. The big question is what you really gain with expected counts over unique counts. The answer to this probably varies with what you really want to look at. If you have a bunch of paralogous genes that you're interest in then expected counts are what you want. If not, they may not be worth the extra time.

**colindaven** · 01-27-2015, 09:00 AM

I'll have to look into this in more detail (full papers), but initial results follow:

One paper here claims "similar results" for RPKM and Rsem.

author = {Guo, Yan and Sheng, Quanhu and Li, Jiang and Ye, Fei and Samuels,
David C. and Shyr, Yu},
title = {Large scale comparison of gene expression levels by microarrays and
RNAseq using TCGA data.},
journal = {PLoS One},
year = {2013},

-> "The RNAseq normalization methods
RPKM and RSEM produce similar results on the gene level and reasonably
concordant results on the exon level."

Another is very critical of RPKM, perhaps this is the one you meant?

author = {Wagner, Günter P. and Kin, Koryu and Lynch, Vincent J.},
title = {Measurement of mRNA abundance using RNA-seq data: RPKM measure is
inconsistent among samples.},
journal = {Theory Biosci},
year = {2012},

"
Measures of RNA abundance are important for many areas of biology
and often obtained from high-throughput RNA sequencing methods such
as Illumina sequence data. These measures need to be normalized to
remove technical biases inherent in the sequencing approach, most
notably the length of the RNA species and the sequencing depth of
a sample. These biases are corrected in the widely used reads per
kilobase per million reads (RPKM) measure. Here, we argue that the
intended meaning of RPKM is a measure of relative molar RNA concentration
(rmc) and show that for each set of transcripts the average rmc is
a constant, namely the inverse of the number of transcripts mapped.
Further, we show that RPKM does not respect this invariance property
and thus cannot be an accurate measure of rmc. We propose a slight
modification of RPKM that eliminates this inconsistency and call
it TPM for transcripts per million. TPM respects the average invariance
and eliminates statistical biases inherent in the RPKM measure
"
--> I am interested to see how TPM differs from RPKM

RPKM is biased claim these authors

author = {Zheng, Wei and Chung, Lisa M. and Zhao, Hongyu},
title = {Bias detection and correction in RNA-Sequencing data.},
journal = {BMC Bioinformatics},
year = {2011},
volume = {12},

"we showed that commonly used estimates
of gene expression levels from RNA-Seq data, such as reads per kilobase
of gene length per million reads (RPKM), are biased in terms of gene
length, GC content and dinucleotide frequencies. We directly examined
the biases at the gene-level, and proposed a simple generalized-additive-model
based approach to correct different sources of biases simultaneously.
"

Again critical of RPKM:

author = {Bullard, James H. and Purdom, Elizabeth and Hansen, Kasper D. and
Dudoit, Sandrine},
title = {Evaluation of statistical methods for normalization and differential
expression in mRNA-Seq experiments.},
journal = {BMC Bioinformatics},
year = {2010},

"
We investigate the impact
of the read count normalization method on DE results and show that
the standard approach of scaling by total lane counts (e.g., RPKM)
can bias estimates of DE. We propose more general quantile-based
normalization procedures and demonstrate an improvement in DE detection
"

Generally in the past I've used edgeR but the voom/limma method in Degust provides very good usability and visualisation.

**dpryan** · 01-27-2015, 09:39 AM

It was the Wagner et al. paper that I had in mind.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 37 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 41 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 35 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

bedtools multicov and RPKM

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News