Seqanswers Leaderboard Ad

**Davis McC** · 06-20-2010, 07:10 PM

edgeR parameters and concepts

Hi there

First, thanks for using edgeR and I hope that I can clarify some of the concepts that you're having a little difficulty with.

1) Common Dispersion
When assessing differential expression, it is really important to model the variability in the data appropriately. For the count data that we obtain from RNA-seq experiments, we use the negative binomial (NB) model, because there is typically more variation in RNA-seq data than can be accounted for by the Poisson model (called overdispersion). We estimate the mean (mu) of the counts for each gene, which corresponds to the abundance of that gene in the RNA sample. In edgeR we model the mean for a gene as the library size * concentration. This model has a second parameter, called the dispersion parameter (under our parameterization). This parameter is very important, as it determines how we model the variance for each gene. The variance function for each gene is V = mu *( 1 + dispersion * mu ), where each gene has a distinct value for mu.

Under the "common dispersion model" we use the *same* value for the dispersion when modelling the variance for each gene. Under a tagwise model we allow for a *different* value for the dispersion to be used for each gene. So in brief, the dispersion is vital for modelling the variance of each gene and the common dispersion means that we use the same value of the dispersion for each gene. (NB: this does not mean that the variance is the same for each gene! Remember that the variance is quadratic in the mean, so variance increases with gene mean).

Now to the interpretation. First, the common dispersion is *completely different* from correlation between two replicates, so it is not surprising that they are very different. I would not recommend using correlation as a way to assess the similarity of samples, especially when there is a nice interpretation of the common dispersion. The common dispersion is the "squared coefficient of variation", where the coefficient of variation gives the amount of variability in the *true* abundance for each gene between replicates (which we can't measure directly). In your example, you have a common dispersion estimate of 0.035, which (take square root) gives an estimate of the coefficient of variation of 0.19. We interpret this as saying that the true abundance for each gene can vary up or down by 19% between replicates.

I hope that it is clear from that explanation that the *higher* the estimate of the common dispersion the *higher* the variability between replicates, not what you suggested in your post. I am concerned that you were led to the mistaken understanding that a low estimate indicates high variability between replicates. Can you point me towards places in the manual or forums/online posts that suggest this? I would like to clear up misunderstandings where I can, and if I need to edit the manual to make it clearer, then that would be good to know.

2) Moderated tagwise dispersion
The motivation for using moderated tagwise dispersions is that you may not really believe that every gene has the same common value for the dispersion, so you want to allow and estimate a distinct, individual dispersion for each gene. The problem with this is that we usually have very small samples in RNA-seq experiments (e.g. perhaps 3 samples in each of two groups). This means that we might be estimating a dispersion for each gene on 4 or 5 degrees of freedom, which leads to *very* unreliable estimates. On the other hand, we are estimating the common dispersion from many thousands of observations, so we expect this to be very reliable. The idea, then, is to moderate or "squeeze" the tagwise dispersion estimates towards the common value. So estimates larger than the common value are made smaller and estimates smaller than the common value are made larger. This makes the tagwise dispersion estimates more stable and improves inference over using "raw" tagwise dispersion estimates.

It is not necessarily better to do tagwise over only common dispersion. Moderated tagwise dispersion will tend to rank more highly as DE genes that are more consistent in their counts within groups, whereas using common dispersion is more likely to rank genes with one extremely large count in one library as highly DE. So in that sense, which one you use will depend on your preference for what you consider to be more interesting DE genes. Why not try both approaches and compare the results?

3) prior.n
The value for prior.n determines the amount of smoothing of tagwise dispersions towards the common dispersion. You can think of it as like a "weight" for the common value. (It is actually the weight for the common likelihood in the weighted likelihood equation). The larger the value for prior.n, the more smoothing, i.e. the closer your tagwise dispersion estimates will be to the common dispersion. If you use a prior.n of 1, then that gives the common likelihood the weight of one observation.

In answer to your question, it is a good thing to squeeze the tagwise dispersions towards a common value, or else you will be using very unreliable estimates of the dispersion. I would not recommend using the value that you obtained from estimateSmoothing()---this is far too small and would result in virtually no moderation (squeezing) of the tagwise dispersions. How many samples do you have in your experiment? What is the experimental design? If you have few samples (<6) then I would suggest a prior.n of at least 10. If you have more samples, then the tagwise dispersion estimates will be more reliable, so you could consider using a smaller prior.n, although I would hesitate to use a prior.n less than 5.

I hope that I've been able to help your understanding of edgeR and analysing DE in RNA-seq data. Please don't hesitate to get back in touch if there are still things that are unclear or could benefit from further elaboration.

Best regards
Davis

**flobpf** · 06-21-2010, 05:58 AM

Hi Davis,
Thanks a lot for the explanation. It was really very helpful.

About the online links, I was referring to this reply by Dr. Smyth:

http://permalink.gmane.org/gmane.science.biology.informatics.conductor/28513

But now that I read it again after your reply, I can understand what he was trying to say.

Thanks again!

**yifangt** · 02-23-2012, 08:19 AM

common dispersion

After I analyzed my dataset and got no DGE genes!
I was wondering if the problem was with the variation of the counts for the genes and between the two treatments as the common dispersion is 0.2556, and sqrt (common.dispersion) = 0.5055. Also I checked the MDSplot where no obvious separation could be observed between the treatments of the samples either.
Then my question is, how big is the common.dispersion should be for a good experiment? Thanks!
Yifang

**vittoria_1198** · 02-26-2013, 03:12 PM

EdgeR DGE

Hi, I am new EdgeR user and I am using it to calculate the different expressed genes between 3 conditions, performing the pairwise comparison (exact test).
I am working with RNA_Seq mapped to reference transcriptome. As explained in the tutorial, I processed my raw counts with:
-normalization
-filter low count (1cpm)
-common dispersion
-exact test for pairwise

I got confused when I compared the results obtained from EdgeR with another statistical test. I used a simple ANOVA on the 3 conditions, with a post hoc test Tukey. From the ANOVA results, there was no significance between the 3 conditions, with no pairwise significance too. I performed also a simple test between 2 conditions, and again no significant result.
However, with Edge R, I was able to get a list of DGE with the method="BH" < 0.05.
I am a little bit confused on how I can get different expressed genes when there is no significance between the treatments.
Moreover, when I selected from the pseudo.counts, only the genes that appeared in the DGE list, again there was no significance between them.

Cam=n someone help me with this problem?
How are selected the DGE? Is is from the tag counts, or the Pvalue?
How the FDR affect them?

Thanks so much for the help

Vittoria

**sindrle** · 01-28-2014, 10:33 AM

Hi!
If I have 48 libraries and two GLM coefficient. Is this equal to df = 46?
As recommended edgeR developers says prior.df = 50.
To calculate this I did 50/46 = 1.08695652174
Is this equal to n.prior?

My question is how I should input my df = 46 and n.prior = 1.08695652174 into estimateGLMTagwiseDisp?

**dhir_kumar** · 12-21-2016, 08:39 AM

pvalue question: Is it common to have very low pvalues (e^-300) through edge R

Hi,

I recently started using edgeR for differential RNA-seq analysis. I use htseq to count reads on genic features and filtered out any gene with average reads below a number say (10). Generally, any analysis I perform using edgeR involves two conditions with triplicates. While the DE gene results are mostly consistent with microarray analysis, the pvalues and adjusted pvalues are remarkably significant ranging from 0 to 1e-300 to 0.05 to 1.
While the top hits are consistent, the concern is related to the over-fitting or overestimating the DE list.
Below is the code I used and a volcano plot of the results. I am told that a lot of genes are down-regulated upon knock-out of a transcription factor which is the case here.

library("edgeR")
x <- read.delim("RNA_seq_counts_combined_filtered.txt",row.names="GENE")
group <- factor(c("1","1","1","2","2","2"))###1 for WT and 2 for KO
y <- DGEList(counts=x,group=group)
y <- calcNormFactors(y)
design <- model.matrix(~group)
y <- estimateDisp(y,design)
fit <- glmFit(y,design)
lrt <- glmLRT(fit,coef=2)
topTags(lrt)
write.table(topTags(lrt,dim(lrt)[1]),"edgeR_results_genes.txt")

res <- read.table("edgeR_results_genes.txt", header=TRUE)

png("RNA-seq_Volcano.png", width=10, height=14, units="in", res=300)
with(res, plot(logFC, -log10(FDR), pch=20, main="KO vs WT"))
with(subset(res, (FDR<.01 & abs(logFC)>1)), points(logFC, -log10(FDR), pch=20, col="red"))
dev.off()

Attached Files

RNA-seq_Volcano.png (26.1 KB, 21 views)

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 58 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 45 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 55 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

EdgeR parameters and concepts

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News