No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • bedtools multicov and RPKM

    I am calculating RPKM to make differential gene expression between two libraries.
    First, I am extracting read counts for each gene interval using
    bedtools multicov -bams file.bam -bed gene.gff3 > gene_counts.gff &
    To compute RPKM i do (10^9 * C)/(N * L) and as "N" I use the sum of read counts i get for all the features calculated with bedtools multicov.
    First question... I don't understand why the the sum of read counts i get for all the features calculated with bedtools multicov is much higher than the value I get when i calculate the total number of mapped reads with samtools. Why is that difference?
    Second question... the two libraries I am comparing have different size so obviously I have to normalize them before comparison.. I don't know how to do normalization, first or after calculating RPKM?
    Third question.. is it better to remove rRNA and tRNA when doing differential gene expression analysis?
    Thanks everybody

  • #2
    Firstly, I would strongly encourage you to not use RPKMs for statistics. Use either unique counts or estimated counts (e.g., from RSEM or eXpress).

    The difference is likely due to either (1) reads that map to multiple features that are then being double/triple/whatever count or (2) multimapping reads with different alignments of the same read being counted separately. It's best to avoid both of these possibilities by simply not using bedtools for this.

    If you really want to use RPKM then it's better to normalize before converting to RPKM. You can do this in limma/edgeR/DESeq2 from the estimated or unique counts (I think DESeq2 will throw an error if you feed in estimated counts).

    Originally it was argued that the library-size normalization used in calculating RPKM was sufficient normalization, but that's been shown to be wrong in most cases.

    If you don't care about rRNA and tRNA then don't include them. Most common RNAseq library preps for what you're doing will involve either ribo depletion or polyA selection, both of which would make rRNA counts highly questionable. For standard RNAseq, I never include rRNA or tRNA.


    • #3
      but I used unique mapping reads with bedtools multicov


      • #4
        They may map uniquely to the genome, but that doesn't mean they map uniquely to features. If you want to count uniquely mapped reads, then use a tool designed for that (namely featureCounts or htseq-count).


        • #5
          Originally posted by dpryan View Post

          Originally it was argued that the library-size normalization used in calculating RPKM was sufficient normalization, but that's been shown to be wrong in most cases.
          Interesting point. Is there a comparison paper to back this up somewhere ? Many people are still using RPKM or FPKM directly from Cufflinks or similar.

          What are the key advantages of "unique counts" or "estimated reads"?


          • #6
            I recall there being at least one paper on the issues with "simple" RPKM/FPKM calculations, but I can't find it at the moment.

            The biggest benefit to unique counts over estimated counts is:
            1. It's much faster to get unique counts (no costly expectation maximization needed).
            2. Negative binomial-based methods already exist to make use of this.

            Of course, regarding #2, you can always just use limma (or edgeR, since apparently the authors now say it works ok with expected counts) and your results will likely be just as good. Regarding #1, there are methods that can work on a stream and those should be reasonably quick...though they're still going to be slower. The big question is what you really gain with expected counts over unique counts. The answer to this probably varies with what you really want to look at. If you have a bunch of paralogous genes that you're interest in then expected counts are what you want. If not, they may not be worth the extra time.


            • #7
              I'll have to look into this in more detail (full papers), but initial results follow:

              One paper here claims "similar results" for RPKM and Rsem.

              author = {Guo, Yan and Sheng, Quanhu and Li, Jiang and Ye, Fei and Samuels,
              David C. and Shyr, Yu},
              title = {Large scale comparison of gene expression levels by microarrays and
              RNAseq using TCGA data.},
              journal = {PLoS One},
              year = {2013},

              -> "The RNAseq normalization methods
              RPKM and RSEM produce similar results on the gene level and reasonably
              concordant results on the exon level."

              Another is very critical of RPKM, perhaps this is the one you meant?

              author = {Wagner, G√ľnter P. and Kin, Koryu and Lynch, Vincent J.},
              title = {Measurement of mRNA abundance using RNA-seq data: RPKM measure is
              inconsistent among samples.},
              journal = {Theory Biosci},
              year = {2012},

              Measures of RNA abundance are important for many areas of biology
              and often obtained from high-throughput RNA sequencing methods such
              as Illumina sequence data. These measures need to be normalized to
              remove technical biases inherent in the sequencing approach, most
              notably the length of the RNA species and the sequencing depth of
              a sample. These biases are corrected in the widely used reads per
              kilobase per million reads (RPKM) measure. Here, we argue that the
              intended meaning of RPKM is a measure of relative molar RNA concentration
              (rmc) and show that for each set of transcripts the average rmc is
              a constant, namely the inverse of the number of transcripts mapped.
              Further, we show that RPKM does not respect this invariance property
              and thus cannot be an accurate measure of rmc. We propose a slight
              modification of RPKM that eliminates this inconsistency and call
              it TPM for transcripts per million. TPM respects the average invariance
              and eliminates statistical biases inherent in the RPKM measure
              --> I am interested to see how TPM differs from RPKM

              RPKM is biased claim these authors

              author = {Zheng, Wei and Chung, Lisa M. and Zhao, Hongyu},
              title = {Bias detection and correction in RNA-Sequencing data.},
              journal = {BMC Bioinformatics},
              year = {2011},
              volume = {12},

              "we showed that commonly used estimates
              of gene expression levels from RNA-Seq data, such as reads per kilobase
              of gene length per million reads (RPKM), are biased in terms of gene
              length, GC content and dinucleotide frequencies. We directly examined
              the biases at the gene-level, and proposed a simple generalized-additive-model
              based approach to correct different sources of biases simultaneously.

              Again critical of RPKM:

              author = {Bullard, James H. and Purdom, Elizabeth and Hansen, Kasper D. and
              Dudoit, Sandrine},
              title = {Evaluation of statistical methods for normalization and differential
              expression in mRNA-Seq experiments.},
              journal = {BMC Bioinformatics},
              year = {2010},

              We investigate the impact
              of the read count normalization method on DE results and show that
              the standard approach of scaling by total lane counts (e.g., RPKM)
              can bias estimates of DE. We propose more general quantile-based
              normalization procedures and demonstrate an improvement in DE detection

              Generally in the past I've used edgeR but the voom/limma method in Degust provides very good usability and visualisation.


              • #8
                It was the Wagner et al. paper that I had in mind.


                Latest Articles


                • seqadmin
                  Advanced Methods for the Detection of Infectious Disease
                  by seqadmin

                  The recent pandemic caused worldwide health, economic, and social disruptions with its reverberations still felt today. A key takeaway from this event is the need for accurate and accessible tools for detecting and tracking infectious diseases. Timely identification is essential for early intervention, managing outbreaks, and preventing their spread. This article reviews several valuable tools employed in the detection and surveillance of infectious diseases.
                  11-27-2023, 01:15 PM
                • seqadmin
                  Strategies for Investigating the Microbiome
                  by seqadmin

                  Microbiome research has led to the discovery of important connections to human and environmental health. Sequencing has become a core investigational tool in microbiome research, a subject that we covered during a recent webinar. Our expert speakers shared a number of advancements including improved experimental workflows, research involving transmission dynamics, and invaluable analysis resources. This article recaps their informative presentations, offering insights...
                  11-09-2023, 07:02 AM





                Topics Statistics Last Post
                Started by seqadmin, 12-01-2023, 09:55 AM
                0 responses
                Last Post seqadmin  
                Started by seqadmin, 11-30-2023, 10:48 AM
                0 responses
                Last Post seqadmin  
                Started by seqadmin, 11-29-2023, 08:26 AM
                0 responses
                Last Post seqadmin  
                Started by seqadmin, 11-29-2023, 08:12 AM
                0 responses
                Last Post seqadmin