Multiple DGE libraries comparison. (EdgeR baySeq DESeq)

cjp replied

11-29-2011, 07:39 AM
Originally posted by Simon Anders View Post

Wow, this sounds like an embarrassing bug!

I've had a look immediately -- and, guess what, the variance adjustment does exactly what it should, but what I have written about it in the vignette (and further up in this thread) was completely the wrong way round.

...

I'll correct the vignette right away.

Thanks for pointing this out
Simon

Hi Simon,

Thanks for the helpful comments in this thread!

But, I was reading the whole thread from the beginning - and maybe you should edit the post - number 31 - to say you noticed later it is the wrong way round for others doing the same and reading the whole thread! I've only just noticed on post 45 when I could have saved myself some confusion earlier in the thread.

Chris
Leave a comment:
Gordon Smyth replied

04-23-2011, 04:38 PM
Multiple DGE libraries comparison. (EdgeR baySeq DESeq)

Dear Roman,

This is a very late reply to your original post of a year ago in which you described an experiment with four RNA-Seq libraries, Male-Old, Male-Young, Female-Old and Female-Young. You wanted to find transcripts differentially expressed between Male and Female.

This type of experiment is typically treated in a way that is analogous to a paired t-test, in which you test for Male vs Female differences, adjusting for any differences between Young and Old.

When you originally posted your question, edgeR did not have facilities for this type of analysis, but it has since the Bioconductor release of Oct 2010. The edgeR Vignette gives in Section 11 a worked example of an analysis with paired comparisons (tumour vs normal for the same patients) that is analogous to your problem. The analysis goes something like this:

design <- model.matrix(~Age+Sex)
y <- estimateGLMCommonDisp(y,design)
disp <- y$common.dispersion
fit <- glmFit(y,design,dispersion=disp)
lrt <- glmLRT(y,fit)
topTags(lrt)

edgeR not require you to have replicates of any of the four
conditions, because the linear model that is fitted has only 3 coefficients.

Try out the paired comparison analysis and let us know if you have any problems with it.

Best wishes
Gordon
Leave a comment:
Roman Bruno replied

04-07-2011, 06:34 AM
Hi again Simon, I wonder if you had since then any suggestion about a possible method in order to compare multiple library condition.

Thanks
Leave a comment:
agent99 replied

12-30-2010, 08:38 PM
Originally posted by RockChalkJayhawk View Post

I have had some incosistencies with cufflinks with gene-level expression. I would sugest moving on to using a simple program such as HT-Seq or BEDtools to get the actual tag counts within the gene region. Then, you could use DESeq for the normalization.

Thanks for the advice. Apologies for being a little unclear in this post. I had hoped to get transcript level counts, so I was using cufflinks to determine abundance of different isoforms.
Leave a comment:
RockChalkJayhawk replied

12-30-2010, 07:02 PM
Originally posted by agent99 View Post

Did anyone respond to this post?

From the cufflinks output, I tried using the coverage value * (exon length/read length) to get counts, but it doesn't seem to be the right value. My reasoning was that if this is the counts (or fragments) in an exon, I should be able to recover the total number of mapped fragments and that value should be the same over all exons.

fragments / (exon length/1000) = fragments/kb, then

FPKM * fragments/kb = X million reads

However, I'm getting a 2-fold range for how many million reads were sequenced. Could this be because the FPKM is a MLE?

Thanks for any input/advice.

Alisha

I have had some incosistencies with cufflinks with gene-level expression. I would sugest moving on to using a simple program such as HT-Seq or BEDtools to get the actual tag counts within the gene region. Then, you could use DESeq for the normalization.
Leave a comment:
agent99 replied

12-30-2010, 02:14 PM
recovering counts from cufflinks

Originally posted by Boel View Post

Is there a way to output the actual counts?

Did anyone respond to this post?

From the cufflinks output, I tried using the coverage value * (exon length/read length) to get counts, but it doesn't seem to be the right value. My reasoning was that if this is the counts (or fragments) in an exon, I should be able to recover the total number of mapped fragments and that value should be the same over all exons.

fragments / (exon length/1000) = fragments/kb, then

FPKM * fragments/kb = X million reads

However, I'm getting a 2-fold range for how many million reads were sequenced. Could this be because the FPKM is a MLE?

Thanks for any input/advice.

Alisha
Leave a comment:
Simon Anders replied

05-20-2010, 09:12 AM
Hi Joro,

"out of vertex space" is an error from locfit, which DESeq uses internally. I've seen the error occuring onl,y once so far, in a quite unusual data set.

You can try to increase the number of vertices with:

Code:

cds <- estimateVarianceFunctions( cds, locfit_extra_args=list(maxk=300) )

If this does not help, could you send me your CountDataSet object (save it with save( cds, file="cds.rda" )) so that I can try around with it?

Cheers
Simon
Leave a comment:
joro replied

05-20-2010, 07:51 AM
Hi Simon,

When using the command "cds = estimateVarianceFunctions(cds)" I get the following error:

Error in lfproc(x, y, weights = weights, cens = cens, base = base, geth = geth, :
newsplit: out of vertex space
In addition: Warning messages:
1: In lfproc(x, y, weights = weights, cens = cens, base = base, geth = geth, :
procv: no points with non-zero weight
2: In lfproc(x, y, weights = weights, cens = cens, base = base, geth = geth, :
procv: no points with non-zero weight
3: In lfproc(x, y, weights = weights, cens = cens, base = base, geth = geth, :
procv: no points with non-zero weight
4: In lfproc(x, y, weights = weights, cens = cens, base = base, geth = geth, :
procv: no points with non-zero weight
5: In lfproc(x, y, weights = weights, cens = cens, base = base, geth = geth, :
procv: no points with non-zero weight
6: In lfproc(x, y, weights = weights, cens = cens, base = base, geth = geth, :
procv: no points with non-zero weight
7: In lfproc(x, y, weights = weights, cens = cens, base = base, geth = geth, :
procv: no points with non-zero weight
8: In lfproc(x, y, weights = weights, cens = cens, base = base, geth = geth, :
procv: no points with non-zero weight
9: In lfproc(x, y, weights = weights, cens = cens, base = base, geth = geth, :
procv: no points with non-zero weight
10: In lfproc(x, y, weights = weights, cens = cens, base = base, geth = geth, :
procv: no points with non-zero weight

I'm not sure what's causing the error and was wondering if anyone has seen it before?

Thank you.
Leave a comment:
Simon Anders replied

05-15-2010, 08:49 AM
Originally posted by diffexpr View Post

I have been trying out DESeq using both simulated and real data, and have downloaded the newer version DESeq 1.1.3 in order to be able to control the variance adjustment factor. Using real data, I tried factors below 1 as suggested and the ECDF curves moved below the green line. However the number of hits increased rather than decreased as would be expected if the number of type I errors were being reduced. By using simulated data I confirmed that the fraction of false positives were actually going up when the ECDF curves were shifted below the green line.

Wow, this sounds like an embarrassing bug!

I've had a look immediately -- and, guess what, the variance adjustment does exactly what it should, but what I have written about it in the vignette (and further up in this thread) was completely the wrong way round.

So, let me try again:

If the ECDF curve is below the green line, this means that the variance residuals are too high. (If one plots a histogram instead of an ECDF curve, there are too many values towards the right end.) These residuals are the ratios of the per-gene estimates over the fitted values. If they are too high, the fitted values are too low, i.e., DESeq assumes a smaller variance than it should and will hence produce too small p values. To correct, one should set the variance adjustment factor to a value above 1, so that DESeq makes the fitted variance estimate larger before performing the test.

If, however, the curves are above the green line, DESeq performs conservatively, i.e., produces p values that may be too high.

Hence, the curves that Abhijit showed further up in this thread were fine all along. They were above the green line, so if he went forward without any variance adjustment, he might have lost a few hits but DESeq would have maintained type-I error control. This is actually a relief, although I feel now quite stupid of noticing this only now, after you pointed me to it.

I'll correct the vignette right away.

Thanks for pointing this out
Simon
Leave a comment:
diffexpr replied

05-14-2010, 01:45 PM
variance adjustment factors

Hi Simon

I have been trying out DESeq using both simulated and real data, and have downloaded the newer version DESeq 1.1.3 in order to be able to control the variance adjustment factor. Using real data, I tried factors below 1 as suggested and the ECDF curves moved below the green line. However the number of hits increased rather than decreased as would be expected if the number of type I errors were being reduced. By using simulated data I confirmed that the fraction of false positives were actually going up when the ECDF curves were shifted below the green line.
Leave a comment:
Simon Anders replied

05-14-2010, 05:12 AM
Originally posted by gen2prot View Post

Where can I find the version 1.1.1 of DESeq? I don't see it on the biocunductor website.

The newest version is here:

Bioconductor - Help

http://bioconductor.org/packages/devel/bioc/html/DESeq.html

The Bioconductor project aims to develop and share open source software for precise and repeatable analysis of biological data. We foster an inclusive and collaborative community of developers and data scientists.

This is the Bioconductor devel branch. You probably looked in the release branch.

See the installation instructions here: http://www-huber.embl.de/users/anders/DESeq/

Simon
Leave a comment:
gen2prot replied

05-13-2010, 02:41 PM
Hi Simon,

Where can I find the version 1.1.1 of DESeq? I don't see it on the biocunductor website.

Thanks
Abhijit
Leave a comment:
Simon Anders replied

05-05-2010, 05:56 AM
Hi Roman

Originally posted by Roman Bruno View Post

Hi Simon, when you say that, is it also true for SAGE data ?

Roman

Short answer: Yes.

Long answer:

The variance that you observe in count data is the sum of two components. First, the shot noise (Poisson noise), whose variance is always equal to the mean. Imagine that you have two samples which contain precisely the same concentration of a transcript and you sequence them to precisely the same depth. Even then, you will not get the same count. This unavoidable variance is the shot noise, and it is always equal to the expected number of counts.

Second, the biological noise, i.e., the noise between two samples. (Even if you do twice exactly the same, the transcript will not really have the same concentration in your two replicate samples.) Such noise is typically multiplicative, i.e., the variance is proportional to the square of the mean.

Hence, for very low count rates, the shot noise dominates, and biological noise is unimportant. Then, the Poisson model, which ignores biological noise, is fine. SAGE typically has so low count rates that the error you make by ignoring biological noise may not be that severe unless you also have strong biological noise.

Still, Robinson and Smyth had already proposed a way to do it properly ("edgeR") back when SAGE was new and trendy, so no excuse to be sloppy.

For RNA-Seq data, people sort of got away with it when the GenomeAnalyzer I only produced less than 1 mio reads, but now, with the GA IIx and its 20 mio reads, you have to be blind to not notice that testing with a Poisson model always gives the suspicious result that all strongly expressed genes seem to be differentially expressed (unless you cheat by using an extremely small p value cut-off).

Last edited by Simon Anders; 05-05-2010, 12:47 PM. Reason: typos
Leave a comment:
Roman Bruno replied

05-05-2010, 05:33 AM
Originally posted by Simon Anders View Post

We argue (in our preprint) that using the Poisson model for RNA-Seq data is incorrect and leads to wrong p values.

Hi Simon, when you say that, is it also true for SAGE data ?

Roman
Leave a comment:
RSK replied

05-03-2010, 06:21 PM
I get it now. Thanks very much for clarifying!
I got mixed up.

-R

Originally posted by Simon Anders View Post

Hi RSK,

we are talking about "DESeq", not "DEGSeq". Both are Bioconductor packages to check for differential expression. Our package (DESeq) estimates the noise from comparing replicates while DEGSeq assumes the so-called Poisson model which does not require noise estimation. We argue (in our preprint) that using the Poisson model for RNA-Seq data is incorrect and leads to wrong p values.

Sorry for the naming clash between the two packages.

Simon
Leave a comment:

Previous 1 2 3 4 template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Today, 08:47 AM	0 responses 12 views 0 likes	Last Post by seqadmin Today, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News