Header Leaderboard Ad

Collapse

Multiple DGE libraries comparison. (EdgeR baySeq DESeq)

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Dear Wolfgang,

    Thank you for pointing out my typo- namely that "measurement" should have been "estimate".

    Comment


    • #17
      Hi Simon,

      Thank you for your help. I got DESeq to work. However, I am trying to figure out a good value to use for the FDR cutoff. This maybe a very stupid question, but I'll ask anyways. Is an FDR of 10% more stringent than an FDR of 1%. I am thinking in terms of p-values where lesser numbers are higher stringency. Should I imagine FDR along the same lines or the opposite.

      Thanks
      Abhijit

      Comment


      • #18
        The same lines.

        A false dicovery rate (FDR) of 10% means that (in expectation) 10% of your hits are false positives. To get FDR control at 10%, you adjust the raw p values for multiple testing with an FDR-controlling method such as the one by Benjamini and Hochber or the one by Storey, and the take all those genes with an adjusted p value below the desired FDR. DESeq does this for you, using the Benjamini-Hochberg method.

        Controlling at FDR 10% is a common, but, of course, arbitrary choice. It is up to you to decide how stringent you want to be.

        The concept of FDR has been introduced by Benjamini and Hochberg in 1995 (J Roy Stat Soc B 57 289). Since then, there has been a lot of research on how best to adjust p values to allow for FDR control, and how well this works in case of correlation.

        Comment


        • #19
          Hi Simon,

          How can I set the FDR at 1% instead of 10%. Is there a way of doing this through the nbinomTest function? I am not an R programmer so I do not know.

          Thanks
          Abhijit

          Comment


          • #20
            If you followed the example in the vignette, you have come across the line

            resSig <- res[ res$padj < .1, ]

            This is where we select the genes, which we want to consider significant. res$padj is the Benjamini-Hoch-berg-adjusted p value, and selecting all genes with padj below 0.1 controls the FDR at 10%. To get 1%, just put .01 here.

            Comment


            • #21
              Multi-sample comaprison

              Simon,

              Lets say I want to compare 4 samples each with 2 replicates (an ANOVA for example). As of now, DESeq doesn't have this functionality correct?

              Is there a way I can export the entire matrix of normalized values rather than making the 1X1 comparison? For example, would it be correct to multiply all of the values in T1b by 0.5587394, or T2 by 1.5823096, etc? (these represent the column names ans sizeFactors from the DESeq vignette)

              Comment


              • #22
                Let's say you have four different condition, each with two replicates: A1, A2, B1, B2, C1, C2, D1, D2.

                Then, you can use DESeq as it is to do contrasts such as 'A vs B', 'A vs C', 'B vs D', etc.

                What does not yet work is interaction contrasts. So, let's say, you have the conditions wt, A, B, AB; where wt is the wild-type, A and B are some treatments or mutations, and AB is the combination of both. You now might either be interested in simple contrasts, as above, which is fine. Or you want to check for interactions, i.e., you say that the effect of A is given by the difference dA between A and wt (dA = A - wt) and similar for B (dB = B - wt) and AB (dAB = AB - wt), and you now wonder whether dAB = dA + dB. To check this, you need an interaction contrast, and this is something that I am currently working on, but I might still need a few weeks.

                For now, you can use variance-stabilize the data (function 'getVarianceStabilizedData) and then use an ordinary linear model or the 'limma' package to check for interaction. That works fine but the power is reduced.

                Just using the raw count data, scaled by the library sizes, is not a good idea, if you intend to feed it a linear model. Ordinary least squares regression requires homoscedasticity, and count data is heteroscedastic. The point of a variance-stabilizing transformation is precisely to remedy that.

                Comment


                • #23
                  Originally posted by Simon Anders View Post
                  Let's say you have four different condition, each with two replicates: A1, A2, B1, B2, C1, C2, D1, D2.

                  Then, you can use DESeq as it is to do contrasts such as 'A vs B', 'A vs C', 'B vs D', etc.

                  What does not yet work is interaction contrasts. So, let's say, you have the conditions wt, A, B, AB; where wt is the wild-type, A and B are some treatments or mutations, and AB is the combination of both. You now might either be interested in simple contrasts, as above, which is fine. Or you want to check for interactions, i.e., you say that the effect of A is given by the difference dA between A and wt (dA = A - wt) and similar for B (dB = B - wt) and AB (dAB = AB - wt), and you now wonder whether dAB = dA + dB. To check this, you need an interaction contrast, and this is something that I am currently working on, but I might still need a few weeks.

                  For now, you can use variance-stabilize the data (function 'getVarianceStabilizedData) and then use an ordinary linear model or the 'limma' package to check for interaction. That works fine but the power is reduced.

                  Just using the raw count data, scaled by the library sizes, is not a good idea, if you intend to feed it a linear model. Ordinary least squares regression requires homoscedasticity, and count data is heteroscedastic. The point of a variance-stabilizing transformation is precisely to remedy that.
                  Whoops! I just saw it in your vignette. That's what I need since I'm actually doing a time-course. It'll just make it easier to generate some figures.

                  Comment


                  • #24
                    Hi Simon,

                    I was wondering if it is better to normalize the raw gene read counts by the total number of reads before looking for differential gene expression. If I run DESeq with the raw read counts there is some significant variation between biological replicates. However if I normalize it as described, the variation between replicates drops. Can DESeq handle decimal values?

                    Abhijit

                    Comment


                    • #25
                      Hi Abhijit,

                      no, DESeq needs to get the raw counts. The whole idea is that the raw counts allow DESeq to dissect the noise into shot noise (Poisson distributed) and sample noise.

                      If you take the count values from all your samples for a gene and calculate the variance, then this value will be quite high and will drop if you normalize as described. However, this is not the variance that DESeq works with. It is fully aware of difference in library sizes and adjusts for it when estimating variance. The 'estimateSizeFactors' function takes care of this (and, by the way, uses a better normalization that just dividing by the total counts).

                      For more details, see the vignette and the paper.

                      Simon

                      Comment


                      • #26
                        Hi Simon,

                        Thank you for your reply. I am in a fix. Let me explain the objective which we are trying to achieve. We are conducting an experiment with drosophila flies, to identify genes that are affected by a particular mutation. We have deep sequenced 4 normal male and female larvas and 4 mutant male and female larvas. We now want to compare the gene expression between normal males and normal females in order to remove sex-associated genes from our dataset (13669 genes), since we will not be able to pinpoint the change in expression due to mutation. We then would like to take the remaining genes and examine them for differential gene expression between normal and mutant flies.

                        Using DESeq and the FDR set at 5%, I would have to ignore almost 6000 genes. That number is too high. Info gathered from Flybase and other sources, suggest that sex-associated genes are less than a 1000. On closer examination I find that DESeq calls differentially expressed the following scenarios, with 4 reps.

                        Normal male ||||| Normal female
                        a) 0 0 0 0 ||||| 2 4 5 6
                        b) 0 0 0 0 ||||| 100 200 400 600

                        While I am more confident about 'b' I am not so sure about 'a'. As far as the program is concerned both are differentially expressed. If I play around with the FDR setting, a stringent criteria (1e-50 and below) retains in the dataset valid sex-associated (SA) genes while a relaxed criteria (1e-5 and below) removes from the dataset possible non-SA genes. Thats the problem. I was wondering if there was a way to confidently call differential gene expression from read count variation. I do not know the best way to approach this problem. Typically we would like to retain in our dataset genes which show a 2-5 fold difference in gene expression between normal males and normal females. I do not know what is the best way of doing this.

                        Thanks
                        Abhijit
                        Last edited by gen2prot; 04-15-2010, 07:35 AM. Reason: delimiting with spaces was not efficient for reading and explaining the problem

                        Comment


                        • #27
                          Hi Abhijit

                          Originally posted by gen2prot View Post
                          On closer examination I find that DESeq calls differentially expressed the following scenarios, with 4 reps.

                          Normal male ||||| Normal female
                          a) 0 0 0 0 ||||| 2 4 5 6
                          b) 0 0 0 0 ||||| 100 200 400 600

                          While I am more confident about 'b' I am not so sure about 'a'.
                          This is precisely the reason why DESeq wants to know the raw counts. if you would just look at ratios, you would not see the differences between the two scenarios.

                          Now, I'd be surprised if DESeq called case (a) really significant. Is this an actual observation or just a made-up example?

                          Is is, however, quite conceivable that something like 0 - 10 is significant, especially if four replicates on each side agree.

                          So, possible sources for your problem are:

                          - a bug in DESeq. i don't hope so, but if you send me your count table, I can check.

                          - improper variance estimation. Are you sure you set up the comparison correctly. Maybe send me your R code and I can check.

                          - confounding. Maybe there is strong differences between your male and your female samples, because they differ systematically in some other experimental condition besides sex, e.g., age, physiology, or whatever.

                          - there are really so many differential expressed genes, and people had not noticed before because they used microarrays and not RNA-Seq

                          Simon

                          Comment


                          • #28
                            Originally posted by Simon Anders View Post
                            Hi Abhijit

                            This is precisely the reason why DESeq wants to know the raw counts. if you would just look at ratios, you would not see the differences between the two scenarios.

                            Now, I'd be surprised if DESeq called case (a) really significant. Is this an actual observation or just a made-up example?

                            Is is, however, quite conceivable that something like 0 - 10 is significant, especially if four replicates on each side agree.

                            So, possible sources for your problem are:

                            - a bug in DESeq. i don't hope so, but if you send me your count table, I can check.

                            - improper variance estimation. Are you sure you set up the comparison correctly. Maybe send me your R code and I can check.

                            - confounding. Maybe there is strong differences between your male and your female samples, because they differ systematically in some other experimental condition besides sex, e.g., age, physiology, or whatever.

                            - there are really so many differential expressed genes, and people had not noticed before because they used microarrays and not RNA-Seq

                            Simon
                            Hi Simon,

                            Yes the example was made up. I am sending you a real example in the pdf file. While I have more confidence in the first four entries, I am not so sure about 5-11. With an FDR of 1e-5 I would have to delete all these entries. There is no doubt a lot of variability in the read count numbers. Which is why i suggested that if I divided them by the total read counts, I would maybe have a more reasonable picture since i am going to be looking at the proportion and not read counts. The R code that I used is identical to what is given in the manual. I did not include column "S4_NF_DUP2" in the analysis since its deviation was very high compared to the rest of the NF's. I have also mentioned the total read counts of all the genes.

                            Let me know what you think.

                            Thanks
                            Abhijit
                            Attached Files

                            Comment


                            • #29
                              Hi Abhijit,

                              thanks for the table, I can see now what the problem is.

                              It has little to do with the library sizes (total read counts). I should explain this first. If you divide each column of your count data by the size factor, you get the adjustment for the total read counts that you want. The 'baseMean' columns in the result table are computed that way. The variance estimate used by DESeq is also, conceptually, a variance on this 'common scale' level, so this issue is really fully taken care of.

                              DESeq's fundamental assumption is that the mean is a good predictor of the variance, i.e., that two genes of similar expression strength have similar variance. This assumption is needed because, with typically only two or maybe three replicates, you cannot estimate the variance for each gene separately. Hence, DESeq fits a curve to the dependence of the variance on the mean, and the reads of variances from it. With the 'residualsEcdfPlot' function, you get a disgnostic plot to see how well that worked: all the red-to-blue lines should lie close to the green diagonal. Please check this.

                              Even if the plot looks fine, there may still be quite some outliers; genes which simply hjave much stronger variance than other genes of similar expression strength. The columns 'resVarA' and 'resVarB' in the result table help you spot these. These columns contain the quotient of the variance as estimated from the counts for just this gene over the variance as predicted from the mean. Because an estimate from the few values for a single gene is very noisy, this value is expected to fluctuate quite a lot around 1, maybe, for four replicates, from .1 to 5 or 10.

                              If the value in the resVar columns is very high (way above 1), the gene is likely a variance outlier, and the test was unreliable for it. In your table, the resVar values are indeed quite large, and you might want to exclude all those genes from your analysis.

                              What is ultimatively needed is a scheme to adjust the variance estimate from the mean by the single-gene estimate, if the two disagree too much. Lönnsted and Speed suggested an algorithm for this, the empirical Bayes adjustment, quite a while ago, and it was widely used for microarrays (as part of Smyth's 'limma' package). Robinson and Smyth's 'edgeR' package attempts to do the same for RNA-Seq data (now called 'tagwise dispersion estimate'). In my experiments with edgeR, switching on the tagwise dispersion mode never had much effect on the results, but as your data seems to have such strong variance outliers (and you have unusually many replicates), it may help, so give it a try and compare the results of DESeq and edgeR.

                              In the end, we need both within the same tool: a flexible local variance estimation as we offer with DESeq, and a gene-wise variance adjustment as edgeR offers. I'm working on it but it will take a while.

                              Simon

                              Comment


                              • #30
                                Hi Simon,

                                I performed the residualsEcdfPlot function to see the dependence of the sample variance to the mean. It is heavily deviated for both males and females. Attached are the curves. What can I do to fix this? Also which resvar column should I be looking into while rejecting the gene outliers which have high variance? As of now the value in the resvarA col is consistently <1 while in resvarB its very high (10-30).

                                Abhijit
                                Attached Files

                                Comment

                                Working...
                                X