Welcome to the New Seqanswers!

Welcome to the new Seqanswers! We'd love your feedback, please post any you have to this topic: New Seqanswers Feedback.
See more
See less

EdgeR parameters and concepts

  • Filter
  • Time
  • Show
Clear All
new posts

  • EdgeR parameters and concepts


    I am using the package EdgeR for calculating differential expression between genes. However I'm a bit of rookie in statistics and have a few questions regarding some concepts. I'd greatly appreciate your help in getting my head around these concepts. My questions are as follows:

    1) What does Common Dispersion really mean?
    Context: I have 2 replicates of my sample and a regular pairwise correlation suggests a coefficient of 0.98 between the replicates. However, when I run the estimateCommonDisp() function in EdgeR it gives me 0.035 as an estimate of common dispersion. The manual and some online posts seem to suggest that low estimate indicates high variability between replicates. Is that true or am I being mistaken?

    2) Is it always better to use moderated tagwise dispersion (as it seems more customized to each tag than a general one-size-fits-all). When is it better to use tagwise over only common dispersion? Also, when you do smoothing, what do you really do?

    3) How does the prior.n value affect the results?
    Context: My prior.n value post smoothing is 0.0002284025. The manual asks prior.n=10 as default.It also says that lower the value, the more highly variable tags will be heavily penalized and there will be less "squeezing" of tagwise dispersions towards a common value. Is that good or bad? Should I use the value obtained with my dataset or some value between 10 and 40 as suggested?

    Thanks in advance!
    Last edited by flobpf; 06-18-2010, 12:40 PM.

  • #2
    edgeR parameters and concepts

    Hi there

    First, thanks for using edgeR and I hope that I can clarify some of the concepts that you're having a little difficulty with.

    1) Common Dispersion
    When assessing differential expression, it is really important to model the variability in the data appropriately. For the count data that we obtain from RNA-seq experiments, we use the negative binomial (NB) model, because there is typically more variation in RNA-seq data than can be accounted for by the Poisson model (called overdispersion). We estimate the mean (mu) of the counts for each gene, which corresponds to the abundance of that gene in the RNA sample. In edgeR we model the mean for a gene as the library size * concentration. This model has a second parameter, called the dispersion parameter (under our parameterization). This parameter is very important, as it determines how we model the variance for each gene. The variance function for each gene is V = mu *( 1 + dispersion * mu ), where each gene has a distinct value for mu.

    Under the "common dispersion model" we use the *same* value for the dispersion when modelling the variance for each gene. Under a tagwise model we allow for a *different* value for the dispersion to be used for each gene. So in brief, the dispersion is vital for modelling the variance of each gene and the common dispersion means that we use the same value of the dispersion for each gene. (NB: this does not mean that the variance is the same for each gene! Remember that the variance is quadratic in the mean, so variance increases with gene mean).

    Now to the interpretation. First, the common dispersion is *completely different* from correlation between two replicates, so it is not surprising that they are very different. I would not recommend using correlation as a way to assess the similarity of samples, especially when there is a nice interpretation of the common dispersion. The common dispersion is the "squared coefficient of variation", where the coefficient of variation gives the amount of variability in the *true* abundance for each gene between replicates (which we can't measure directly). In your example, you have a common dispersion estimate of 0.035, which (take square root) gives an estimate of the coefficient of variation of 0.19. We interpret this as saying that the true abundance for each gene can vary up or down by 19% between replicates.

    I hope that it is clear from that explanation that the *higher* the estimate of the common dispersion the *higher* the variability between replicates, not what you suggested in your post. I am concerned that you were led to the mistaken understanding that a low estimate indicates high variability between replicates. Can you point me towards places in the manual or forums/online posts that suggest this? I would like to clear up misunderstandings where I can, and if I need to edit the manual to make it clearer, then that would be good to know.

    2) Moderated tagwise dispersion
    The motivation for using moderated tagwise dispersions is that you may not really believe that every gene has the same common value for the dispersion, so you want to allow and estimate a distinct, individual dispersion for each gene. The problem with this is that we usually have very small samples in RNA-seq experiments (e.g. perhaps 3 samples in each of two groups). This means that we might be estimating a dispersion for each gene on 4 or 5 degrees of freedom, which leads to *very* unreliable estimates. On the other hand, we are estimating the common dispersion from many thousands of observations, so we expect this to be very reliable. The idea, then, is to moderate or "squeeze" the tagwise dispersion estimates towards the common value. So estimates larger than the common value are made smaller and estimates smaller than the common value are made larger. This makes the tagwise dispersion estimates more stable and improves inference over using "raw" tagwise dispersion estimates.

    It is not necessarily better to do tagwise over only common dispersion. Moderated tagwise dispersion will tend to rank more highly as DE genes that are more consistent in their counts within groups, whereas using common dispersion is more likely to rank genes with one extremely large count in one library as highly DE. So in that sense, which one you use will depend on your preference for what you consider to be more interesting DE genes. Why not try both approaches and compare the results?

    3) prior.n
    The value for prior.n determines the amount of smoothing of tagwise dispersions towards the common dispersion. You can think of it as like a "weight" for the common value. (It is actually the weight for the common likelihood in the weighted likelihood equation). The larger the value for prior.n, the more smoothing, i.e. the closer your tagwise dispersion estimates will be to the common dispersion. If you use a prior.n of 1, then that gives the common likelihood the weight of one observation.

    In answer to your question, it is a good thing to squeeze the tagwise dispersions towards a common value, or else you will be using very unreliable estimates of the dispersion. I would not recommend using the value that you obtained from estimateSmoothing()---this is far too small and would result in virtually no moderation (squeezing) of the tagwise dispersions. How many samples do you have in your experiment? What is the experimental design? If you have few samples (<6) then I would suggest a prior.n of at least 10. If you have more samples, then the tagwise dispersion estimates will be more reliable, so you could consider using a smaller prior.n, although I would hesitate to use a prior.n less than 5.

    I hope that I've been able to help your understanding of edgeR and analysing DE in RNA-seq data. Please don't hesitate to get back in touch if there are still things that are unclear or could benefit from further elaboration.

    Best regards


    • #3
      Hi Davis,
      Thanks a lot for the explanation. It was really very helpful.

      About the online links, I was referring to this reply by Dr. Smyth:
      But now that I read it again after your reply, I can understand what he was trying to say.

      Thanks again!


      • #4
        common dispersion

        After I analyzed my dataset and got no DGE genes!
        I was wondering if the problem was with the variation of the counts for the genes and between the two treatments as the common dispersion is 0.2556, and sqrt (common.dispersion) = 0.5055. Also I checked the MDSplot where no obvious separation could be observed between the treatments of the samples either.
        Then my question is, how big is the common.dispersion should be for a good experiment? Thanks!


        • #5
          EdgeR DGE

          Hi, I am new EdgeR user and I am using it to calculate the different expressed genes between 3 conditions, performing the pairwise comparison (exact test).
          I am working with RNA_Seq mapped to reference transcriptome. As explained in the tutorial, I processed my raw counts with:
          -filter low count (1cpm)
          -common dispersion
          -exact test for pairwise

          I got confused when I compared the results obtained from EdgeR with another statistical test. I used a simple ANOVA on the 3 conditions, with a post hoc test Tukey. From the ANOVA results, there was no significance between the 3 conditions, with no pairwise significance too. I performed also a simple test between 2 conditions, and again no significant result.
          However, with Edge R, I was able to get a list of DGE with the method="BH" < 0.05.
          I am a little bit confused on how I can get different expressed genes when there is no significance between the treatments.
          Moreover, when I selected from the pseudo.counts, only the genes that appeared in the DGE list, again there was no significance between them.

          Cam=n someone help me with this problem?
          How are selected the DGE? Is is from the tag counts, or the Pvalue?
          How the FDR affect them?

          Thanks so much for the help



          • #6
            If I have 48 libraries and two GLM coefficient. Is this equal to df = 46?
            As recommended edgeR developers says prior.df = 50.
            To calculate this I did 50/46 = 1.08695652174
            Is this equal to n.prior?

            My question is how I should input my df = 46 and n.prior = 1.08695652174 into estimateGLMTagwiseDisp?
            Last edited by sindrle; 01-28-2014, 11:08 AM. Reason: Read old version of user guide..


            • #7
              pvalue question: Is it common to have very low pvalues (e^-300) through edge R


              I recently started using edgeR for differential RNA-seq analysis. I use htseq to count reads on genic features and filtered out any gene with average reads below a number say (10). Generally, any analysis I perform using edgeR involves two conditions with triplicates. While the DE gene results are mostly consistent with microarray analysis, the pvalues and adjusted pvalues are remarkably significant ranging from 0 to 1e-300 to 0.05 to 1.
              While the top hits are consistent, the concern is related to the over-fitting or overestimating the DE list.
              Below is the code I used and a volcano plot of the results. I am told that a lot of genes are down-regulated upon knock-out of a transcription factor which is the case here.

              x <- read.delim("RNA_seq_counts_combined_filtered.txt",row.names="GENE")
              group <- factor(c("1","1","1","2","2","2"))###1 for WT and 2 for KO
              y <- DGEList(counts=x,group=group)
              y <- calcNormFactors(y)
              design <- model.matrix(~group)
              y <- estimateDisp(y,design)
              fit <- glmFit(y,design)
              lrt <- glmLRT(fit,coef=2)

              res <- read.table("edgeR_results_genes.txt", header=TRUE)

              png("RNA-seq_Volcano.png", width=10, height=14, units="in", res=300)
              with(res, plot(logFC, -log10(FDR), pch=20, main="KO vs WT"))
              with(subset(res, (FDR<.01 & abs(logFC)>1)), points(logFC, -log10(FDR), pch=20, col="red"))
              Attached Files