Multiple DGE libraries comparison. (EdgeR baySeq DESeq)

Simon Anders replied

05-03-2010, 07:57 AM
Hi RSK,

we are talking about "DESeq", not "DEGSeq". Both are Bioconductor packages to check for differential expression. Our package (DESeq) estimates the noise from comparing replicates while DEGSeq assumes the so-called Poisson model which does not require noise estimation. We argue (in our preprint) that using the Poisson model for RNA-Seq data is incorrect and leads to wrong p values.

Sorry for the naming clash between the two packages.

Simon
Leave a comment:
RSK replied

05-03-2010, 06:48 AM
Dear Simon and Abhijit,

I am trying to figure out which example in the vignette (which is the manual for DEGseq?) includes the following resSig assignment line.

Originally posted by Simon Anders View Post

If you followed the example in the vignette, you have come across the line

resSig <- res[ res$padj < .1, ]

Thanks!!
RSK
Leave a comment:
Simon Anders replied

04-29-2010, 06:26 AM
Hi

a while ago, Abhijit reported that, when using DESeq on his data, the residuals ECDF plot indicated a bad fit, and I promised to add a feature to DESeq to allow for manually adjusting the variance estimates.

Originally posted by Simon Anders View Post

About this fit diagnostic plot: If the ECDF curves are above the diagonal green line (which indicates the ideal chi-square shape of the residuals distribution), variance will be underestimated, and p values will be too low. If the ECDF curves are below the green line, it is the other way round. A simple (maybe clumsy, but still valid) fix is to instruct DESeq to multiply its variance estimates by a user-specified factor in order to account for the misspecification of the residuals distribution. Just try out a couple of values and choose a correction factor that lets the ECDF curves appear just below the green line. When this is the case, correct control of type-I error is reliably restored, even though detection power will suffer, if the curve bulges away (downwards) from the diagonal green line.

I will add (hopefully today) an option to specify such a correction factor to DESeq.

I just want to report, that I've added this now, in version 1.1.1 of DESeq.

Simon
Leave a comment:
gen2prot replied

04-22-2010, 09:57 AM
Yes. I think so too. Thank you so much for your help all this. I think biologists tend to have a very rough idea about viewing data statistically. That is problematic these days since we get sequencing data by the gallon and a lot of interesting things may get overlooked.

-Abhijit
Leave a comment:
Simon Anders replied

04-22-2010, 07:57 AM
Hi Abhijit

well, if you should find something as unusual and interesting looking as a gene on which the mutation has opposite effect in the two sexes, throwing the gene away would probably be a dumb idea. I would rather have a very close look at it, in the hope that there is some interesting biology to discover. ;-)

Simon
Leave a comment:
gen2prot replied

04-22-2010, 07:51 AM
Hi Simon,

Thank you for your detailed explanantion. Yes I did overlook the fact that all genes are differentially expressed one way or another. However, are there any known cases of non-sex associated genes that behave differently in males and females due to a mutation? In other words a non-sex associated gene increases its expression in the mutant male, while the same gene reduces its expression in the mutant female. What will be your take on such cases. I do not know of any existing ones.

Thanks
Abhijit
Leave a comment:
Simon Anders replied

04-22-2010, 02:38 AM
Hi Abhijit

thanks for your reports on your problems with DESeq, they help me to improve it.

However, looking again at this post of yours, I realized that the root of your problem is a more basic one.

Originally posted by gen2prot View Post

[...] Let me explain the objective which we are trying to achieve. We are conducting an experiment with drosophila flies, to identify genes that are affected by a particular mutation. We have deep sequenced 4 normal male and female larvas and 4 mutant male and female larvas. We now want to compare the gene expression between normal males and normal females in order to remove sex-associated genes from our dataset (13669 genes), since we will not be able to pinpoint the change in expression due to mutation. We then would like to take the remaining genes and examine them for differential gene expression between normal and mutant flies.

Using DESeq and the FDR set at 5%, I would have to ignore almost 6000 genes. That number is too high. Info gathered from Flybase and other sources, suggest that sex-associated genes are less than a 1000. [...]

Your experiment is set up as a randomized block design. (See the Wikipedia article for an explanation of the term.) This is a very suitable design but your plan to analyse it does not seem to be that good.

To explain, let me remind you of a rather trivial but often overlooked fact: When comparing two conditions, in reality, all genes are differentially expressed. Think about all the 2000 genes that you know to differ between males and females. Surely they all have downstream effects on further genes, which have downstream effects on even more genes. All genes are somehow connected and influence each other. If one changes, all other genes will change as well, even though maybe only a tiny little bit.

Hence, if people say that a gene is differentially expressed, what they really mean is this: We know that the gene's expression has changed by an appreciable amount, large enough for our assay to detect it, to make us reasonably sure that we know whether it went up or down and to give us a (maybe only rough) idea of the magnitude of the change.

Hence, if you only looked hard enough and used hundreds of replicate samples, so that you can detect even the tiniest of difference between sexes, you will find that you have to exclude every gene.

But this is not at all necessary: If a gene changes due to the mutation from say, 1000 (somehow normalized) counts to 1500 counts in males, and from 2000 to 3000 in females, than you have a clear sex difference (in both the wild type and the mutation, it is twice as large in females) but also a clear effect of the mutation (in both males and females, the expression increases by 50%).

So, what you should do is this: Calculate the fold-change due to the mutation in the males and in the females separately, and if a gene is significant in both males and in females, and both agree on the direction, then it is significant independent of sex.

A more proper way to do this is to use a (generalized) linear model, and I am currently working on adding this to DESeq, so that DESeq's variance estimates can be used to assess the significance of the effect size in the linear model fit. Might still take a while, though.

Cheers
Simon
Leave a comment:
Simon Anders replied

04-22-2010, 02:10 AM
Hi Abhijit

Originally posted by gen2prot View Post

I performed the residualsEcdfPlot function to see the dependence of the sample variance to the mean. It is heavily deviated for both males and females. Attached are the curves. What can I do to fix this? Also which resvar column should I be looking into while rejecting the gene outliers which have high variance? As of now the value in the resvarA col is consistently <1 while in resvarB its very high (10-30).
Abhijit

Thanks for reporting this issue. Another user recently sent me data that also caused the residual ECDF plots to be off in a similar way, and I investigated them a bit yesterday. My guess is that the cause is heterogeneity of replicates, i.e., some replicate samples are more similar than others. Then, the sample variances deviate from the theoretical chi-square distribution, which misleads the variance fit. (Consequenctly, this problem should only arise if one has more than two replicates.)

Could you look at the heatmap of sample distances (as described in the vignette) to see if your replicates look heterogeneous?

About this fit diagnostic plot: If the ECDF curves are above the diagonal green line (which indicates the ideal chi-square shape of the residuals distribution), variance will be underestimated, and p values will be too low. If the ECDF curves are below the green line, it is the other way round. A simple (maybe clumsy, but still valid) fix is to instruct DESeq to multiply its variance estimates by a user-specified factor in order to account for the misspecification of the residuals distribution. Just try out a couple of values and choose a correction factor that lets the ECDF curves appear just below the green line. When this is the case, correct control of type-I error is reliably restored, even though detection power will suffer, if the curve bulges away (downwards) from the diagonal green line.

I will add (hopefully today) an option to specify such a correction factor to DESeq. As Bioconductor is currently preparing its new release, I cannot submit the change to the server within the next few days, but I can send you the updated package by e-mail.

Cheers
Simon

Note: See also post #45, where I correct a mistake in this bug.

Last edited by Simon Anders; 11-29-2011, 07:43 AM.
Leave a comment:
gen2prot replied

04-21-2010, 07:03 AM
Hi Simon,

I performed the residualsEcdfPlot function to see the dependence of the sample variance to the mean. It is heavily deviated for both males and females. Attached are the curves. What can I do to fix this? Also which resvar column should I be looking into while rejecting the gene outliers which have high variance? As of now the value in the resvarA col is consistently <1 while in resvarB its very high (10-30).

Abhijit
Attached Files

residualplot_for_females.pdf (246.0 KB, 162 views)

residualplot_for_males.pdf (242.5 KB, 101 views)
Leave a comment:
Simon Anders replied

04-15-2010, 11:47 PM
Hi Abhijit,

thanks for the table, I can see now what the problem is.

It has little to do with the library sizes (total read counts). I should explain this first. If you divide each column of your count data by the size factor, you get the adjustment for the total read counts that you want. The 'baseMean' columns in the result table are computed that way. The variance estimate used by DESeq is also, conceptually, a variance on this 'common scale' level, so this issue is really fully taken care of.

DESeq's fundamental assumption is that the mean is a good predictor of the variance, i.e., that two genes of similar expression strength have similar variance. This assumption is needed because, with typically only two or maybe three replicates, you cannot estimate the variance for each gene separately. Hence, DESeq fits a curve to the dependence of the variance on the mean, and the reads of variances from it. With the 'residualsEcdfPlot' function, you get a disgnostic plot to see how well that worked: all the red-to-blue lines should lie close to the green diagonal. Please check this.

Even if the plot looks fine, there may still be quite some outliers; genes which simply hjave much stronger variance than other genes of similar expression strength. The columns 'resVarA' and 'resVarB' in the result table help you spot these. These columns contain the quotient of the variance as estimated from the counts for just this gene over the variance as predicted from the mean. Because an estimate from the few values for a single gene is very noisy, this value is expected to fluctuate quite a lot around 1, maybe, for four replicates, from .1 to 5 or 10.

If the value in the resVar columns is very high (way above 1), the gene is likely a variance outlier, and the test was unreliable for it. In your table, the resVar values are indeed quite large, and you might want to exclude all those genes from your analysis.

What is ultimatively needed is a scheme to adjust the variance estimate from the mean by the single-gene estimate, if the two disagree too much. Lönnsted and Speed suggested an algorithm for this, the empirical Bayes adjustment, quite a while ago, and it was widely used for microarrays (as part of Smyth's 'limma' package). Robinson and Smyth's 'edgeR' package attempts to do the same for RNA-Seq data (now called 'tagwise dispersion estimate'). In my experiments with edgeR, switching on the tagwise dispersion mode never had much effect on the results, but as your data seems to have such strong variance outliers (and you have unusually many replicates), it may help, so give it a try and compare the results of DESeq and edgeR.

In the end, we need both within the same tool: a flexible local variance estimation as we offer with DESeq, and a gene-wise variance adjustment as edgeR offers. I'm working on it but it will take a while.

Simon
Leave a comment:
gen2prot replied

04-15-2010, 01:23 PM
Originally posted by Simon Anders View Post

Hi Abhijit

This is precisely the reason why DESeq wants to know the raw counts. if you would just look at ratios, you would not see the differences between the two scenarios.

Now, I'd be surprised if DESeq called case (a) really significant. Is this an actual observation or just a made-up example?

Is is, however, quite conceivable that something like 0 - 10 is significant, especially if four replicates on each side agree.

So, possible sources for your problem are:

- a bug in DESeq. i don't hope so, but if you send me your count table, I can check.

- improper variance estimation. Are you sure you set up the comparison correctly. Maybe send me your R code and I can check.

- confounding. Maybe there is strong differences between your male and your female samples, because they differ systematically in some other experimental condition besides sex, e.g., age, physiology, or whatever.

- there are really so many differential expressed genes, and people had not noticed before because they used microarrays and not RNA-Seq

Simon

Hi Simon,

Yes the example was made up. I am sending you a real example in the pdf file. While I have more confidence in the first four entries, I am not so sure about 5-11. With an FDR of 1e-5 I would have to delete all these entries. There is no doubt a lot of variability in the read count numbers. Which is why i suggested that if I divided them by the total read counts, I would maybe have a more reasonable picture since i am going to be looking at the proportion and not read counts. The R code that I used is identical to what is given in the manual. I did not include column "S4_NF_DUP2" in the analysis since its deviation was very high compared to the rest of the NF's. I have also mentioned the total read counts of all the genes.

Let me know what you think.

Thanks
Abhijit
Attached Files

example.pdf (35.2 KB, 146 views)
Leave a comment:
Simon Anders replied

04-15-2010, 07:51 AM
Hi Abhijit

Originally posted by gen2prot View Post

On closer examination I find that DESeq calls differentially expressed the following scenarios, with 4 reps.

Normal male ||||| Normal female
a) 0 0 0 0 ||||| 2 4 5 6
b) 0 0 0 0 ||||| 100 200 400 600

While I am more confident about 'b' I am not so sure about 'a'.

This is precisely the reason why DESeq wants to know the raw counts. if you would just look at ratios, you would not see the differences between the two scenarios.

Now, I'd be surprised if DESeq called case (a) really significant. Is this an actual observation or just a made-up example?

Is is, however, quite conceivable that something like 0 - 10 is significant, especially if four replicates on each side agree.

So, possible sources for your problem are:

- a bug in DESeq. i don't hope so, but if you send me your count table, I can check.

- improper variance estimation. Are you sure you set up the comparison correctly. Maybe send me your R code and I can check.

- confounding. Maybe there is strong differences between your male and your female samples, because they differ systematically in some other experimental condition besides sex, e.g., age, physiology, or whatever.

- there are really so many differential expressed genes, and people had not noticed before because they used microarrays and not RNA-Seq

Simon
Leave a comment:
gen2prot replied

04-15-2010, 07:29 AM
Hi Simon,

Thank you for your reply. I am in a fix. Let me explain the objective which we are trying to achieve. We are conducting an experiment with drosophila flies, to identify genes that are affected by a particular mutation. We have deep sequenced 4 normal male and female larvas and 4 mutant male and female larvas. We now want to compare the gene expression between normal males and normal females in order to remove sex-associated genes from our dataset (13669 genes), since we will not be able to pinpoint the change in expression due to mutation. We then would like to take the remaining genes and examine them for differential gene expression between normal and mutant flies.

Using DESeq and the FDR set at 5%, I would have to ignore almost 6000 genes. That number is too high. Info gathered from Flybase and other sources, suggest that sex-associated genes are less than a 1000. On closer examination I find that DESeq calls differentially expressed the following scenarios, with 4 reps.

Normal male ||||| Normal female
a) 0 0 0 0 ||||| 2 4 5 6
b) 0 0 0 0 ||||| 100 200 400 600

While I am more confident about 'b' I am not so sure about 'a'. As far as the program is concerned both are differentially expressed. If I play around with the FDR setting, a stringent criteria (1e-50 and below) retains in the dataset valid sex-associated (SA) genes while a relaxed criteria (1e-5 and below) removes from the dataset possible non-SA genes. Thats the problem. I was wondering if there was a way to confidently call differential gene expression from read count variation. I do not know the best way to approach this problem. Typically we would like to retain in our dataset genes which show a 2-5 fold difference in gene expression between normal males and normal females. I do not know what is the best way of doing this.

Thanks
Abhijit

Last edited by gen2prot; 04-15-2010, 07:35 AM. Reason: delimiting with spaces was not efficient for reading and explaining the problem
Leave a comment:
Simon Anders replied

04-14-2010, 11:37 PM
Hi Abhijit,

no, DESeq needs to get the raw counts. The whole idea is that the raw counts allow DESeq to dissect the noise into shot noise (Poisson distributed) and sample noise.

If you take the count values from all your samples for a gene and calculate the variance, then this value will be quite high and will drop if you normalize as described. However, this is not the variance that DESeq works with. It is fully aware of difference in library sizes and adjusts for it when estimating variance. The 'estimateSizeFactors' function takes care of this (and, by the way, uses a better normalization that just dividing by the total counts).

For more details, see the vignette and the paper.

Simon
Leave a comment:
gen2prot replied

04-14-2010, 07:53 PM
Hi Simon,

I was wondering if it is better to normalize the raw gene read counts by the total number of reads before looking for differential gene expression. If I run DESeq with the raw read counts there is some significant variation between biological replicates. However if I normalize it as described, the variation between replicates drops. Can DESeq handle decimal values?

Abhijit
Leave a comment:

Previous 1 2 3 4 template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News