Seqanswers Leaderboard Ad

**mike123** · 01-18-2013, 11:34 AM

Anybody got an answer?

**Gordon Smyth** · 01-21-2013, 02:01 PM

There isn't any rounding error. Rather, you are making a number of assumptions here about how things are done, and the calculations in edgeR are actually more subtle than you are assuming.

The average cpm for a group is not computed in edgeR simply by taking averages of individual cpm values. Doing so would treat all the cpm as equally reliable, whereas reliability actually depends on the library size, the size of the counts and on the negative binomial dispersion.

The log-fold-change between two groups is not simply the log-ratio of the two average cpm values. That would give wildly variable logFC values for small counts, whereas edgeR returns values that are more stable.

To get some feeling for what is done, start by reading the help page for exactTest() and looking at the prior.count argument. Also see the predFC() function.

**mike123** · 01-24-2013, 09:50 AM

That makes sense. Thanks!

**Shanrong** · 02-02-2013, 12:28 PM

edgeR logFC calculation

The best way to understand logFC and logCPM is to take a look at the souce code exactTest as Dr. Smyth suggested.

Below is how logFC caculated.
...
abundance1 <- mglmOneGroup(y1 + matrix(prior.count[j1], ntags,
n1, byrow = TRUE), offset = offset[j1])
...
abundance2 <- mglmOneGroup(y2 + matrix(prior.count[j2], ntags,
n2, byrow = TRUE), offset = offset[j2])
...
logFC <- (abundance2 - abundance1)/log(2)

My question is: I understand raw counts y1, y2 were slightly adjusted by prior.count, but wondering why adjusted this way?

Thanks,
Shanrong

**Gordon Smyth** · 02-02-2013, 03:33 PM

Originally posted by Shanrong View Post

My question is: I understand raw counts y1, y2 were slightly adjusted by prior.count, but wondering why adjusted this way?

To avoid wildly variable log fold changes for very small counts. See help("predFC") for a longer explanation.

**Shanrong** · 02-04-2013, 02:02 PM

plotSmear and maPlot

I have a deep RNA-Seq dataset (100M reads/sample). After I run the needed steps, I call plotSmear

...
et <- exactTest(er)
topTags(et)
de <- decideTestsDGE(et)
detags <- rownames(er)[as.logical(de)]
plotSmear(et, de.tags=detags)
...

In my call, plotSmean simply takes et$table$logFC and et$table$logCPM and then calls maPlot. If I drill down the souce cod of maPlot, below is what's happening

if (!is.null(logAbundance) && !is.null(logFC)) {
A <- logAbundance
M <- logFC
w <- v <- rep(FALSE, length(A))
w <- A < -25 + log2(1e+06)
......
}

I plot the histogram of A, and I seems to me "-25" above is not a good cutoff, and "-28" might work better for my dataset.

Below is is the info for histogram.
> hist(A)$counts
[1] 2 1 0 4 7 10 1183 1177 1520 1569 1721 2846 4197 2361 486 113 9 1
> hist(A)$breaks
[1] -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16

Note: -25 + log2(1e+06) = -5.07

My request: it would be nice if I can pass on the cutoff for A when I call maPlot.

**Shanrong** · 02-12-2013, 05:48 PM

transcripts with ZERO reads and calcNormFactors

Originally posted by Gordon Smyth View Post

To avoid wildly variable log fold changes for very small counts. See help("predFC") for a longer explanation.

Dr. Smyth,

I am puzzled by those transcripts having ZERO reads at all conditions. I did normalizations: (1) with all transcripts; (2) exclude transcripts with ZERO read at all conditions

nrow(counts)
[1] 36004
sum( rowSums(counts)>0 )
[1] 28867

edger <- DGEList(counts=counts,group=conditions)
edger <- calcNormFactors(edger)

edger1 <- DGEList(counts=counts[rowSum(counts)>0,],group=conditions)
edger1 <- calcNormFactors(edger1)

edger$samples$norm.factors
[1] 1.1499722 1.1440693 0.9016199 0.9087874 0.8415538 0.8710643
edger1$samples$norm.factors
[1] 1.1499722 1.1440693 0.9016199 0.9087874 0.8415538 0.8710643

However, the normalize factors are the SAME. why? Does this indicate such transcripts with ZERO reads are not used at all during calcNormFactors?

**Gordon Smyth** · 02-23-2013, 08:00 PM

Originally posted by Shanrong View Post

Does this indicate such transcripts with ZERO reads are not used at all during calcNormFactors?

Yes.

If a transcript has no reads in any sample, then it is effectively not in the data set at all. We naturally do not want it to affect the results.

**Gordon Smyth** · 02-23-2013, 08:13 PM

Originally posted by Shanrong View Post

My request: it would be nice if I can pass on the cutoff for A when I call maPlot.

When plotting shrunk fold changes in edgeR, you don't actually need a smear cutoff, because none of the fold changes will be infinite.

Anyway, this isn't the right forum to ask for changes to the edgeR package. It would be better to email the Bioconductor mailing list, because that is the official support list for edgeR and other Bioconductor packages.

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 12 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

edgeR logFC calculation rounding error?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News