Seqanswers Leaderboard Ad

**Simon Anders** · 04-22-2010, 02:10 AM

Hi Abhijit

Originally posted by gen2prot View Post

I performed the residualsEcdfPlot function to see the dependence of the sample variance to the mean. It is heavily deviated for both males and females. Attached are the curves. What can I do to fix this? Also which resvar column should I be looking into while rejecting the gene outliers which have high variance? As of now the value in the resvarA col is consistently <1 while in resvarB its very high (10-30).
Abhijit

Thanks for reporting this issue. Another user recently sent me data that also caused the residual ECDF plots to be off in a similar way, and I investigated them a bit yesterday. My guess is that the cause is heterogeneity of replicates, i.e., some replicate samples are more similar than others. Then, the sample variances deviate from the theoretical chi-square distribution, which misleads the variance fit. (Consequenctly, this problem should only arise if one has more than two replicates.)

Could you look at the heatmap of sample distances (as described in the vignette) to see if your replicates look heterogeneous?

About this fit diagnostic plot: If the ECDF curves are above the diagonal green line (which indicates the ideal chi-square shape of the residuals distribution), variance will be underestimated, and p values will be too low. If the ECDF curves are below the green line, it is the other way round. A simple (maybe clumsy, but still valid) fix is to instruct DESeq to multiply its variance estimates by a user-specified factor in order to account for the misspecification of the residuals distribution. Just try out a couple of values and choose a correction factor that lets the ECDF curves appear just below the green line. When this is the case, correct control of type-I error is reliably restored, even though detection power will suffer, if the curve bulges away (downwards) from the diagonal green line.

I will add (hopefully today) an option to specify such a correction factor to DESeq. As Bioconductor is currently preparing its new release, I cannot submit the change to the server within the next few days, but I can send you the updated package by e-mail.

Cheers
Simon

Note: See also post #45, where I correct a mistake in this bug.

**Simon Anders** · 04-22-2010, 02:38 AM

Hi Abhijit

thanks for your reports on your problems with DESeq, they help me to improve it.

However, looking again at this post of yours, I realized that the root of your problem is a more basic one.

Originally posted by gen2prot View Post

[...] Let me explain the objective which we are trying to achieve. We are conducting an experiment with drosophila flies, to identify genes that are affected by a particular mutation. We have deep sequenced 4 normal male and female larvas and 4 mutant male and female larvas. We now want to compare the gene expression between normal males and normal females in order to remove sex-associated genes from our dataset (13669 genes), since we will not be able to pinpoint the change in expression due to mutation. We then would like to take the remaining genes and examine them for differential gene expression between normal and mutant flies.

Using DESeq and the FDR set at 5%, I would have to ignore almost 6000 genes. That number is too high. Info gathered from Flybase and other sources, suggest that sex-associated genes are less than a 1000. [...]

Your experiment is set up as a randomized block design. (See the Wikipedia article for an explanation of the term.) This is a very suitable design but your plan to analyse it does not seem to be that good.

To explain, let me remind you of a rather trivial but often overlooked fact: When comparing two conditions, in reality, all genes are differentially expressed. Think about all the 2000 genes that you know to differ between males and females. Surely they all have downstream effects on further genes, which have downstream effects on even more genes. All genes are somehow connected and influence each other. If one changes, all other genes will change as well, even though maybe only a tiny little bit.

Hence, if people say that a gene is differentially expressed, what they really mean is this: We know that the gene's expression has changed by an appreciable amount, large enough for our assay to detect it, to make us reasonably sure that we know whether it went up or down and to give us a (maybe only rough) idea of the magnitude of the change.

Hence, if you only looked hard enough and used hundreds of replicate samples, so that you can detect even the tiniest of difference between sexes, you will find that you have to exclude every gene.

But this is not at all necessary: If a gene changes due to the mutation from say, 1000 (somehow normalized) counts to 1500 counts in males, and from 2000 to 3000 in females, than you have a clear sex difference (in both the wild type and the mutation, it is twice as large in females) but also a clear effect of the mutation (in both males and females, the expression increases by 50%).

So, what you should do is this: Calculate the fold-change due to the mutation in the males and in the females separately, and if a gene is significant in both males and in females, and both agree on the direction, then it is significant independent of sex.

A more proper way to do this is to use a (generalized) linear model, and I am currently working on adding this to DESeq, so that DESeq's variance estimates can be used to assess the significance of the effect size in the linear model fit. Might still take a while, though.

Cheers
Simon

**gen2prot** · 04-22-2010, 07:51 AM

Hi Simon,

Thank you for your detailed explanantion. Yes I did overlook the fact that all genes are differentially expressed one way or another. However, are there any known cases of non-sex associated genes that behave differently in males and females due to a mutation? In other words a non-sex associated gene increases its expression in the mutant male, while the same gene reduces its expression in the mutant female. What will be your take on such cases. I do not know of any existing ones.

Thanks
Abhijit

**Simon Anders** · 04-22-2010, 07:57 AM

Hi Abhijit

well, if you should find something as unusual and interesting looking as a gene on which the mutation has opposite effect in the two sexes, throwing the gene away would probably be a dumb idea. I would rather have a very close look at it, in the hope that there is some interesting biology to discover. ;-)

Simon

**gen2prot** · 04-22-2010, 09:57 AM

Yes. I think so too. Thank you so much for your help all this. I think biologists tend to have a very rough idea about viewing data statistically. That is problematic these days since we get sequencing data by the gallon and a lot of interesting things may get overlooked.

-Abhijit

**Simon Anders** · 04-29-2010, 06:26 AM

Hi

a while ago, Abhijit reported that, when using DESeq on his data, the residuals ECDF plot indicated a bad fit, and I promised to add a feature to DESeq to allow for manually adjusting the variance estimates.

Originally posted by Simon Anders View Post

About this fit diagnostic plot: If the ECDF curves are above the diagonal green line (which indicates the ideal chi-square shape of the residuals distribution), variance will be underestimated, and p values will be too low. If the ECDF curves are below the green line, it is the other way round. A simple (maybe clumsy, but still valid) fix is to instruct DESeq to multiply its variance estimates by a user-specified factor in order to account for the misspecification of the residuals distribution. Just try out a couple of values and choose a correction factor that lets the ECDF curves appear just below the green line. When this is the case, correct control of type-I error is reliably restored, even though detection power will suffer, if the curve bulges away (downwards) from the diagonal green line.

I will add (hopefully today) an option to specify such a correction factor to DESeq.

I just want to report, that I've added this now, in version 1.1.1 of DESeq.

Simon

**RSK** · 05-03-2010, 06:48 AM

Dear Simon and Abhijit,

I am trying to figure out which example in the vignette (which is the manual for DEGseq?) includes the following resSig assignment line.

Originally posted by Simon Anders View Post

If you followed the example in the vignette, you have come across the line

resSig <- res[ res$padj < .1, ]

Thanks!!
RSK

**Simon Anders** · 05-03-2010, 07:57 AM

Hi RSK,

we are talking about "DESeq", not "DEGSeq". Both are Bioconductor packages to check for differential expression. Our package (DESeq) estimates the noise from comparing replicates while DEGSeq assumes the so-called Poisson model which does not require noise estimation. We argue (in our preprint) that using the Poisson model for RNA-Seq data is incorrect and leads to wrong p values.

Sorry for the naming clash between the two packages.

Simon

**RSK** · 05-03-2010, 06:21 PM

I get it now. Thanks very much for clarifying!
I got mixed up.

-R

Originally posted by Simon Anders View Post

Hi RSK,

we are talking about "DESeq", not "DEGSeq". Both are Bioconductor packages to check for differential expression. Our package (DESeq) estimates the noise from comparing replicates while DEGSeq assumes the so-called Poisson model which does not require noise estimation. We argue (in our preprint) that using the Poisson model for RNA-Seq data is incorrect and leads to wrong p values.

Sorry for the naming clash between the two packages.

Simon

**Roman Bruno** · 05-05-2010, 05:33 AM

Originally posted by Simon Anders View Post

We argue (in our preprint) that using the Poisson model for RNA-Seq data is incorrect and leads to wrong p values.

Hi Simon, when you say that, is it also true for SAGE data ?

Roman

**Simon Anders** · 05-05-2010, 05:56 AM

Hi Roman

Originally posted by Roman Bruno View Post

Hi Simon, when you say that, is it also true for SAGE data ?

Roman

Short answer: Yes.

Long answer:

The variance that you observe in count data is the sum of two components. First, the shot noise (Poisson noise), whose variance is always equal to the mean. Imagine that you have two samples which contain precisely the same concentration of a transcript and you sequence them to precisely the same depth. Even then, you will not get the same count. This unavoidable variance is the shot noise, and it is always equal to the expected number of counts.

Second, the biological noise, i.e., the noise between two samples. (Even if you do twice exactly the same, the transcript will not really have the same concentration in your two replicate samples.) Such noise is typically multiplicative, i.e., the variance is proportional to the square of the mean.

Hence, for very low count rates, the shot noise dominates, and biological noise is unimportant. Then, the Poisson model, which ignores biological noise, is fine. SAGE typically has so low count rates that the error you make by ignoring biological noise may not be that severe unless you also have strong biological noise.

Still, Robinson and Smyth had already proposed a way to do it properly ("edgeR") back when SAGE was new and trendy, so no excuse to be sloppy.

For RNA-Seq data, people sort of got away with it when the GenomeAnalyzer I only produced less than 1 mio reads, but now, with the GA IIx and its 20 mio reads, you have to be blind to not notice that testing with a Poisson model always gives the suspicious result that all strongly expressed genes seem to be differentially expressed (unless you cheat by using an extremely small p value cut-off).

**gen2prot** · 05-13-2010, 02:41 PM

Hi Simon,

Where can I find the version 1.1.1 of DESeq? I don't see it on the biocunductor website.

Thanks
Abhijit

**Simon Anders** · 05-14-2010, 05:12 AM

Originally posted by gen2prot View Post

Where can I find the version 1.1.1 of DESeq? I don't see it on the biocunductor website.

The newest version is here:

Bioconductor - Help

http://bioconductor.org/packages/devel/bioc/html/DESeq.html

The Bioconductor project aims to develop and share open source software for precise and repeatable analysis of biological data. We foster an inclusive and collaborative community of developers and data scientists.

This is the Bioconductor devel branch. You probably looked in the release branch.

See the installation instructions here: http://www-huber.embl.de/users/anders/DESeq/

Simon

**diffexpr** · 05-14-2010, 01:45 PM

variance adjustment factors

Hi Simon

I have been trying out DESeq using both simulated and real data, and have downloaded the newer version DESeq 1.1.3 in order to be able to control the variance adjustment factor. Using real data, I tried factors below 1 as suggested and the ECDF curves moved below the green line. However the number of hits increased rather than decreased as would be expected if the number of type I errors were being reduced. By using simulated data I confirmed that the fraction of false positives were actually going up when the ECDF curves were shifted below the green line.

**Simon Anders** · 05-15-2010, 08:49 AM

Originally posted by diffexpr View Post

I have been trying out DESeq using both simulated and real data, and have downloaded the newer version DESeq 1.1.3 in order to be able to control the variance adjustment factor. Using real data, I tried factors below 1 as suggested and the ECDF curves moved below the green line. However the number of hits increased rather than decreased as would be expected if the number of type I errors were being reduced. By using simulated data I confirmed that the fraction of false positives were actually going up when the ECDF curves were shifted below the green line.

Wow, this sounds like an embarrassing bug!

I've had a look immediately -- and, guess what, the variance adjustment does exactly what it should, but what I have written about it in the vignette (and further up in this thread) was completely the wrong way round.

So, let me try again:

If the ECDF curve is below the green line, this means that the variance residuals are too high. (If one plots a histogram instead of an ECDF curve, there are too many values towards the right end.) These residuals are the ratios of the per-gene estimates over the fitted values. If they are too high, the fitted values are too low, i.e., DESeq assumes a smaller variance than it should and will hence produce too small p values. To correct, one should set the variance adjustment factor to a value above 1, so that DESeq makes the fitted variance estimate larger before performing the test.

If, however, the curves are above the green line, DESeq performs conservatively, i.e., produces p values that may be too high.

Hence, the curves that Abhijit showed further up in this thread were fine all along. They were above the green line, so if he went forward without any variance adjustment, he might have lost a few hits but DESeq would have maintained type-I error control. This is actually a relief, although I feel now quite stupid of noticing this only now, after you pointed me to it.

I'll correct the vignette right away.

Thanks for pointing this out
Simon

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 25 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 24 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News