No announcement yet.

DESeq analysis with ERCC RNA spike ins

  • Filter
  • Time
  • Show
Clear All
new posts

  • DESeq analysis with ERCC RNA spike ins

    Hi, I want to perform differential expression analyses of multiple RNA-seq samples using DESeq. We have included ERCC spike in controls and wish to use these to normalise the count data.

    I have seen a few posts that suggest I use estimateSizeFactors() on a DESeqDataSet consisting of only the ERCC RNAs then apply these size factors to the DESeqDataSet containing my experimental data.

    We have used the same total amount of RNA and spike in volume for each sample so there are no corrections applied first. However, we have used mix1 in our treatment samples and mix2 in our control. Would it make more sense then to only use subgroup B of the ERCC spike ins to estimate size factors as these are the same concentration in both mixes?

    Is there perhaps a more accurate way to go about this? I have read the "Synthetic spike-in standards for RNA-seq experiments" paper which suggests plotting expected fpkm fold change against observed and fitting a curve. However, I would prefer to use DESeq and count based differential expression to compare with previous analyses performed without spike in.

    Thanks in advance for any help with this.

  • #2
    The DESeq size factors assume that most things will be 1:1, so the 1:1 sub pool would be a good fit.

    That said, it's data analysis so there's nothing stopping you from doing it both ways - I'd be very surprised if there was any variation between the normalization factors you generate this way.

    I did a quick spot check on some of my data and 9/10 libraries gave the exact same normalization factor using all of the ERCCs vs using only the 1:1 pool. The 10th was off by a bit but it was actually spiked differently than the others, so it's expected for that difference to be picked up.
    Last edited by jparsons; 01-07-2014, 10:25 AM.


    • #3
      Why exactly did you decide to use spike-ins? In a standard RNA-Seq experiment, I would expect a normalization based on spike-ins to give worse results than one based on the counts from the biological data, but maybe yours is not a "standard" experiment.


      • #4
        We have added RNA spike ins to give us the ability to check for sequencing bias, look at lower limits of detection and hopefully to aid normalisation of transcript abundance. Most of our RNA-seq experiments have spike in added by default.

        I would say this analysis is fairly standard. We have extracted ribo-depleted RNA from treatment and control cells (with several reps) and want to test for differential expression between the 2 groups. Can you explain a little more why you would not normalize to the spike in?


        • #5
          I would also like to understand why using spike-in controls for normalization is being discouraged here.


          • #6
            Originally posted by friducha View Post
            I would also like to understand why using spike-in controls for normalization is being discouraged here.
            Added noise, likely different range of expression/presence, lower number of species used for normlization, etc.

            Spike-ins are useful when you are concerned about transcriptional amplification (or otherwise heavily asymmetrically distributed fold-changes between groups). When that's not the case, using them makes little sense.


            • #7
              Using ERCC spike-ins makes only sense when you are sure that before adding them the ratio between mRNA and totalRNA in your sample is equal.
              In a MAQC/SEQC studies this ratio was disturbed and thus we were able to investigate this issue:

              You can study how ERCCs behave in your samples with use of errcdashbord R package ("Assessing technical performance in different gene expression experiments with external spike-in RNA control ratio mixtures" from link above), and then use ERCCs for normalization with use of RUV ("Normalization of RNA-seq data using factor analysis of control genes or samples" from link above). However we think that for removing unwanted variation tools like PEER or SVA are better (see "Detecting and correcting systematic variation in large-scale RNA sequencing data" from link above)
              Last edited by plabaj; 02-27-2015, 12:34 AM.
              Pawel Labaj


              • #8
                In my understanding, using spike-ins helps us detect the "breadth" of our sequencing, or in other terms the low-abundant transcripts that can potentially be detected in the sequencing experiments (I am sorry I am not saying this right), but I agree with what dpryan had to say about the use of spike-ins.

                I came across the following paper, where they propose a methodology to normalize your reads using "target genes", which could include housekeeping genes, ERCC spike-ins or any other gene set. They observe that using just the ERCC spike-ins wasn't enough to normalize your RNA-Seq data, which is something that I think is known but never really shown before (unless it is).


                I found this R package very useful just to play with and use some housekeeping genes for normalization instead of the library size and other design factors.


                • #9
                  In my opinion, ERCC spike-ins are uniquely situated to determine the mRNA:totalRNA ratio between samples, and are best used when you expect that the ratio is NOT equal. The Nature Biotech paper referenced above didn't account for the mRNA:totalRNA ratio, even though they were using the MAQC/SEQC sample for half of the work, which is the main reason why they were unable to normalize the data.

                  Dpryan's points are worth repeating, though - there are few data points to use, particularly with Ambion's 10^20 dynamic range pool. In cases where other normalization methods don't make sense, they can be a good fallback - but to quote the guidance from the Clinical and Laboratory Standards Institute about the ERCCs, "While it is possible to scale or normalize array data by matching the mean or median of a set of external RNA controls, this approach is problematic for a number of reasons…Third, normalization using hundreds or thousands of genes within the linear range of response of the assay is mathematically more robust than using a small number of external RNA controls." (The rest of the paragraph is centered on microarray-specific issues)

                  I have a preprint that discusses the use of the ERCCs to account for mRNA:total, including in the context of the SEQC dataset,


                  • #10
                    I hadn't seen your paper on Bioarxiv, that definitely looks to be worth a read!


                    • #11
                      It seems that this bioRxiv paper is a nice complement to Sarah's NatBiotech ERCC paper. Good job!

                      I agree with jparsons that based on ERCCs you can nicely charcterize your samples (for example with use of erccdashboard R package). For normalization, however, 'broader' approaches seems to work better. We have shown that both PEER and SVA (not yet RNA-Seq optimized version) work better than ERCC based RUV (
                      Pawel Labaj


                      • #12
                        Originally posted by dpryan View Post
                        Added noise, likely different range of expression/presence, lower number of species used for normlization, etc.
                        I can understand the added noise and the differences in ranges of expressions, but what is the meaning of the last part?
                        how does the number of species influences the normalization and for that why lower number?

                        thanks for the clarification.



                        • #13
                          That just gets at the robustness. Your robustness increases as the number of rows in the matrix used for normalization increases (to an extent, of course).


                          • #14
                            What about cross batch comparison

                            We have recently added a 300 sample library with ERCC controls. The idea behind this is that these are part of large scale study which will happen over time. This means that we cannot be certain that all samples will be generated by the same chemistry, sequencer and prep kit. My hope is that this will help us compare between batches produced between different technologies (ie Illumina 2000 vs Illumina 4000 vs Illumina 6000? ). Does anybody know about a study comparing different RNASeq libpreps / sequencers and how to normalize between them?


                            • #15
                              Sounds very interesting!

                              In terms of your question, not everything in one paper but have a look into SEQC paper about removing unwated variation as well as ABRF consortium paper here:
                              In general ABRF consortium might be interested in answering these types of questions.
                              Pawel Labaj