Seqanswers Leaderboard Ad



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • bedtools multicov and RPKM

    I am calculating RPKM to make differential gene expression between two libraries.
    First, I am extracting read counts for each gene interval using
    bedtools multicov -bams file.bam -bed gene.gff3 > gene_counts.gff &
    To compute RPKM i do (10^9 * C)/(N * L) and as "N" I use the sum of read counts i get for all the features calculated with bedtools multicov.
    First question... I don't understand why the the sum of read counts i get for all the features calculated with bedtools multicov is much higher than the value I get when i calculate the total number of mapped reads with samtools. Why is that difference?
    Second question... the two libraries I am comparing have different size so obviously I have to normalize them before comparison.. I don't know how to do normalization, first or after calculating RPKM?
    Third question.. is it better to remove rRNA and tRNA when doing differential gene expression analysis?
    Thanks everybody

  • #2
    Firstly, I would strongly encourage you to not use RPKMs for statistics. Use either unique counts or estimated counts (e.g., from RSEM or eXpress).

    The difference is likely due to either (1) reads that map to multiple features that are then being double/triple/whatever count or (2) multimapping reads with different alignments of the same read being counted separately. It's best to avoid both of these possibilities by simply not using bedtools for this.

    If you really want to use RPKM then it's better to normalize before converting to RPKM. You can do this in limma/edgeR/DESeq2 from the estimated or unique counts (I think DESeq2 will throw an error if you feed in estimated counts).

    Originally it was argued that the library-size normalization used in calculating RPKM was sufficient normalization, but that's been shown to be wrong in most cases.

    If you don't care about rRNA and tRNA then don't include them. Most common RNAseq library preps for what you're doing will involve either ribo depletion or polyA selection, both of which would make rRNA counts highly questionable. For standard RNAseq, I never include rRNA or tRNA.


    • #3
      but I used unique mapping reads with bedtools multicov


      • #4
        They may map uniquely to the genome, but that doesn't mean they map uniquely to features. If you want to count uniquely mapped reads, then use a tool designed for that (namely featureCounts or htseq-count).


        • #5
          Originally posted by dpryan View Post

          Originally it was argued that the library-size normalization used in calculating RPKM was sufficient normalization, but that's been shown to be wrong in most cases.
          Interesting point. Is there a comparison paper to back this up somewhere ? Many people are still using RPKM or FPKM directly from Cufflinks or similar.

          What are the key advantages of "unique counts" or "estimated reads"?


          • #6
            I recall there being at least one paper on the issues with "simple" RPKM/FPKM calculations, but I can't find it at the moment.

            The biggest benefit to unique counts over estimated counts is:
            1. It's much faster to get unique counts (no costly expectation maximization needed).
            2. Negative binomial-based methods already exist to make use of this.

            Of course, regarding #2, you can always just use limma (or edgeR, since apparently the authors now say it works ok with expected counts) and your results will likely be just as good. Regarding #1, there are methods that can work on a stream and those should be reasonably quick...though they're still going to be slower. The big question is what you really gain with expected counts over unique counts. The answer to this probably varies with what you really want to look at. If you have a bunch of paralogous genes that you're interest in then expected counts are what you want. If not, they may not be worth the extra time.


            • #7
              I'll have to look into this in more detail (full papers), but initial results follow:

              One paper here claims "similar results" for RPKM and Rsem.

              author = {Guo, Yan and Sheng, Quanhu and Li, Jiang and Ye, Fei and Samuels,
              David C. and Shyr, Yu},
              title = {Large scale comparison of gene expression levels by microarrays and
              RNAseq using TCGA data.},
              journal = {PLoS One},
              year = {2013},

              -> "The RNAseq normalization methods
              RPKM and RSEM produce similar results on the gene level and reasonably
              concordant results on the exon level."

              Another is very critical of RPKM, perhaps this is the one you meant?

              author = {Wagner, Günter P. and Kin, Koryu and Lynch, Vincent J.},
              title = {Measurement of mRNA abundance using RNA-seq data: RPKM measure is
              inconsistent among samples.},
              journal = {Theory Biosci},
              year = {2012},

              Measures of RNA abundance are important for many areas of biology
              and often obtained from high-throughput RNA sequencing methods such
              as Illumina sequence data. These measures need to be normalized to
              remove technical biases inherent in the sequencing approach, most
              notably the length of the RNA species and the sequencing depth of
              a sample. These biases are corrected in the widely used reads per
              kilobase per million reads (RPKM) measure. Here, we argue that the
              intended meaning of RPKM is a measure of relative molar RNA concentration
              (rmc) and show that for each set of transcripts the average rmc is
              a constant, namely the inverse of the number of transcripts mapped.
              Further, we show that RPKM does not respect this invariance property
              and thus cannot be an accurate measure of rmc. We propose a slight
              modification of RPKM that eliminates this inconsistency and call
              it TPM for transcripts per million. TPM respects the average invariance
              and eliminates statistical biases inherent in the RPKM measure
              --> I am interested to see how TPM differs from RPKM

              RPKM is biased claim these authors

              author = {Zheng, Wei and Chung, Lisa M. and Zhao, Hongyu},
              title = {Bias detection and correction in RNA-Sequencing data.},
              journal = {BMC Bioinformatics},
              year = {2011},
              volume = {12},

              "we showed that commonly used estimates
              of gene expression levels from RNA-Seq data, such as reads per kilobase
              of gene length per million reads (RPKM), are biased in terms of gene
              length, GC content and dinucleotide frequencies. We directly examined
              the biases at the gene-level, and proposed a simple generalized-additive-model
              based approach to correct different sources of biases simultaneously.

              Again critical of RPKM:

              author = {Bullard, James H. and Purdom, Elizabeth and Hansen, Kasper D. and
              Dudoit, Sandrine},
              title = {Evaluation of statistical methods for normalization and differential
              expression in mRNA-Seq experiments.},
              journal = {BMC Bioinformatics},
              year = {2010},

              We investigate the impact
              of the read count normalization method on DE results and show that
              the standard approach of scaling by total lane counts (e.g., RPKM)
              can bias estimates of DE. We propose more general quantile-based
              normalization procedures and demonstrate an improvement in DE detection

              Generally in the past I've used edgeR but the voom/limma method in Degust provides very good usability and visualisation.


              • #8
                It was the Wagner et al. paper that I had in mind.


                Latest Articles


                • seqadmin
                  Exploring the Dynamics of the Tumor Microenvironment
                  by seqadmin

                  The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                  07-08-2024, 03:19 PM
                • seqadmin
                  Exploring Human Diversity Through Large-Scale Omics
                  by seqadmin

                  In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                  06-25-2024, 06:43 AM





                Topics Statistics Last Post
                Started by seqadmin, 07-19-2024, 07:20 AM
                0 responses
                Last Post seqadmin  
                Started by seqadmin, 07-16-2024, 05:49 AM
                0 responses
                Last Post seqadmin  
                Started by seqadmin, 07-15-2024, 06:53 AM
                0 responses
                Last Post seqadmin  
                Started by seqadmin, 07-10-2024, 07:30 AM
                0 responses
                Last Post seqadmin