No announcement yet.

Multiple DGE libraries comparison. (EdgeR baySeq DESeq)

  • Filter
  • Time
  • Show
Clear All
new posts

  • Multiple DGE libraries comparison. (EdgeR baySeq DESeq)

    Hi, I'm an undergratuated student in msc Biostatistics in France, and I'm stuck with DGE data.

    I'd like to compare multiple DGE librairies without replicates. I got 4 librairies that describe 4 different conditions (Young Males/Females , Old Males/Females) and I'd like to know the top diffentially expressed tags between all Males vs all Females. But I can't find a solution because performing all possible pair-wise tests in several libraries is statistically invalid and would lead to an unacceptable
    accumulation of false positives.

    I've searched through the litterature without success so far and the 3 packages I'm using don't seem to treat this kind of comparison.

    I would appreciate any form of help (litterature, idea, R packages...)

    Thank you

  • #2
    Hi Roman

    the simple solution is as follows: Consider the samples from the two male subjects as replicates, and the two female ones likewise, and go ahead using DESeq or edgeR.

    You may now say: But an old man is no proper replicate for a young man. However, remember that the point of replicates is not to have two samples which are as similar as possible.

    Rather, you have many experimental conditions, some of them you vary on purpose (here: sex of the subject) and some which you cannot control, either because you have to work with the samples that you got (here: age) or because you don't know them at all (here: medical history, race, accidental differences in preprocessing of samples).

    You use replicates for two reasons: (i) You want to have all the different possibilities for the unwanted conditional variation in equal proportions, so that they average out. If you don't even know what these differences are, your best bet is to take many random samples in the hope that they average out well. (ii) You want to know how much the samples vary only due to the unknown and not-controlled-for condition differences, so that you will call a difference with respect to the condition of interest significant only if it is sufficiently larger than what you observe if you keep the experimental condition fixed and only the other stuff (that you cannot control) for varies.

    In your case, you cannot control for age, simply because you do not have enough samples, so you are justified to treat age in the same way as you treat differences that you don't even know about, such as differences in the subject's living conditions, differences in the sample preparation etc.: namely as an unknown covariate that you have to average over, i.e., you ignore the age information.

    Of course, if you had age-matched replicates, you could add age as a factor to your model and promote it to an explained covariate. This would give you additional statistical power, i.e., you will be able to call smaller differences significant. Nevertheless, your result from treating the non-age matched subjects as replicates is statistically valid -- only unfortunate, as you won't get as many hits as you would with better replicates.

    If you want to know how much you miss this way, you can check how much effect age has on expression levels by repeating the analysis with age as main difference and sex as ignored covariate, i.e., consider the two young subjects as replicates, and the two old ones likewise.

    Actually, if you now managed to find more samples so that you have replicates that are simultaneously matched for age and sex you will find a new problem: To take advantage of this, you have to model both simultaneously, and this is something that, in fact, none of the current methods can do well. You could use the VST from our DESeq package and then limma but this costs a lot of power. I am currently looking into this and hope to figure out soon how to make a proper generalized linear model for negative binomial distribution, so stay tuned.



    • #3
      It's the first solution I've suggested to my professor but he didn't like the fact that some count (for the expressed tag) was similar between young male and old female.

      Anyway thank you a lot for your quick response, I'm going to follow your advices.




      • #4
        I am a molecular biologist/biochemist but I can also do bioinformatics and biostatistics (lite). I imagine your professor is of a biological background and doesn't fully appreciate the statistics. Most biologists I know can't grasp the finer details of replicates and multiplicity correction. Often I hear if the replicates are almost identical then they are a waste of money and if the alpha level is 0.05 then doing a multiplicity correction is throwing out too much data. While I sometimes do not use an FDR and instead use the correlation of mutiple probes within a gene I have an appreciation that millions of Chi Sq, Fishers exact tests, t-tests or ANOVAs will give many false positives.

        He/She should have consulted a statistician before they designed the experiment and paid for the sequencing. If they are receptive I would school them in some Stats 101. It might make their future experimental design a little better.

        My colleague has battled for months with his supervisor to run 4 replicates per condition instead of 2 for a microarray experiment. T-tests with 2 reps are yuk! It's a false economy to run 2 reps (if the treatment effect sizes are modest).

        Anyway, my rant aside, Simon has some excellent suggestions to get you out of your tricky situation.


        • #5
          I've just got a private mail about this thread, asking "Why is limma not a good option for doing this?", and I thought I better reply in a public post, as I should relativize this a bit. Limma might actually work quite well, see below.

          I write in some detail in the hope to stimulate exchange of experience with other people trying to do similar analyses.

          As I've written above, one may want to examine more complex contrasts than just comparing one treatment or experimental condition with another one, and neither edgeR nor DESeq provide for this at the moment. Typically, if one would like to study the interaction of two or more conditions (i.e., check, whether the effect of A and B together is different from the sum of the effect of A alone and the effect of B alone), one would fit a linear model with an appropriate design matrix. For microarrays, Robinson and Smyth's 'limma' software proved very useful for such purposes, as it allowed to pool information from many genes in order to get good variance estimates even in case of few replicates.

          For RNA-Seq, one cannot use ordinary linear models (including 'limma') straight away because sequence count data is neither homoscedastic nor can the residuals be expected to be normally distributed. A generalized linear model of the Poisson family (as done by a number of authors) is a bad idea as it underestimates variances, possibly severely, and so gives much too low p values.

          Maybe one could use a form of generalized linear models tailored for the negative binomial case (or a quasi-Poisson GLM with proper variance estimation), and I intend to study this a bit.

          Another approach is to transform the data to make it homoscedastic by means of a variance-stabilizing transformation (VST). Taking the logarithm would be a very simple approach to this end. Better results can be achieved by estimating the variance-mean dependence and then obtain the corresponding VST numerically. The function 'getVarianceStabilizedData' of our DESeq package does that. (Use at least DESeq version 0.7.12; there was a bug in this function until yesterday.)

          Once one has variance stabilized data, one can fit an ordinary linear model. As the replicate number is typically too small to get a proper per-gene fit, one needs to pool variance information, and this is what limma's 'empirical Bayes' functionality does.

          I've tested this approach on a couple of data sets, by checking for differential expression in a simple comparison of one condition with another, first with DESeq's 'nbinomTest', then with 'limma' acting on the data after VST. The results are very mixed: sometimes limma finds about 2/3 as many hits as DESeq, but often, it does not find any at all. This is probably due to the fact that limma is designed for data with normally distributed error terms. The VST, however, only makes the data homoscedastic but the residuals are still for from normal.

          For simulated data with constant dispersion, limma's hits are always a subset of DESeq's hit, while for real data, many limma hits are not shared with DESeq, because limma's eBayes procedure adjusts DESeq's variance estimate on a per-gene bases. I am not sure whether this is good or bad. If one does not want to trust it, one could set the variance to the common estimate from all VST data, i.e., simply resort to a z test instead of using limma.

          If anyone out there also wants to fit interaction contrasts, I'd be glad to exchange experiences to find out what works best.


          • #6
            fitting interaction contrasts

            The design of my current project will need a model such as limma, I will have data from several different cell types, some of them activated, some not etc.

            Would your reasoning change in any way when assessing normalized data, such as RPKM or FPKM? Could I get limma to work using FPKM values?



            • #7
              It should work with both, variance-stabilized count data and FPKM data. However, FPKM data needs to be normalized and variance-stabilized, too, before you can give it to limma and I guess that it is harder there. The reason is that with the normalization by transcript length, you loose the information about the initial count value, i.e., you do not know whether these are many counts of a long transcript or few counts of a short one. Even if both cases give the same FPKM value, they will have different variance but limma will have a hard time noticing this.

              Hence, I would go for count data and the VST that we describe in our paper.

              You can try both and compare. The crucial part here is to remember that it is not about "the more hits, the better". You want to be sure that you control type-I error, i.e. that you do not get more false positives than indicated by the nominal false discovery rate.

              A good way to check whether a testing procedure maintains type-I error control is to use biological replicates and test for differential expression between the replicates. As there should be none, the p values have to look uniform. Examine the p values for weakly and for strongly expressed genes separately for uniformity to check for bias.

              I am not sure, though, that this can be done with limma out of the box if you have less than four replicates. But it should work with some tweaking.



              • #8
                I'd like to clarify one point here:

                There is no such thing as "FPKM data" (or for that matter "RPKM data"). The only data in an RNA-Seq experiment is the read counts. FPKM is a unit for measuring expression (expected fragments per kilobase of transcript per million of reads sequenced). It is just a constant multiple of the estimated fraction of transcript in the sample.

                Now in terms of differential expression analysis, Simon is correct that for a measurement to be useful (whatever units it is in) it is important to know its variance. That is why in Cufflinks, a program that reports transcript expression in FPKM units, the variance is reported as well.


                • #9
                  One more thing: when cuffdiff reports the results of differential testing, it's being performed on the actual counts (after log transformation) for the reasons that Simon suggests.


                  • #10
                    Originally posted by Cole Trapnell View Post
                    One more thing: when cuffdiff reports the results of differential testing, it's being performed on the actual counts (after log transformation) for the reasons that Simon suggests.

                    Do you think it would be a good idea to use quantile normalization on tag counts before running cuffdiff or is this unecessary given that it didn't actually use RPKM values (which I didn't know, but am glad to hear)?


                    • #11
                      Originally posted by lpachter View Post
                      There is no such thing as "FPKM data" (or for that matter "RPKM data"). The only data in an RNA-Seq experiment is the read counts.
                      I heard from people working on alignment that "the only data is the read sequences" and from people working on base calling that "the only data is the raw images".. at least, one consensus: for sure that makes a lot of data


                      • #12
                        Originally posted by Cole Trapnell View Post
                        One more thing: when cuffdiff reports the results of differential testing, it's being performed on the actual counts (after log transformation) for the reasons that Simon suggests.
                        Is there a way to output the actual counts?


                        • #13
                          Hi Simon,

                          I am trying to install DESeq on MAC OSX Leopard. I get the error message "In getDependencies(pkgs, dependencies, available, lib) : package ‘DESeq’ is not available". Any clue why this is happening. I found that DESeq needs BioBase to be installed. I have done that, and yet I get this error. Any ideas?



                          • #14
                            Hi Abhijit

                            have you followed the instructions on ?

                            If so, please send me the output from the installation. The real error message should be hidden in there.



                            • #15
                              Originally posted by lpachter View Post
                              I'd like to clarify one point here:

                              ...for a measurement to be useful (whatever units it is in) it is important to know its variance. ....
                              While we're at it...: there is also no such thing as "the variance of a measurement". There is the sample variance of a set of measurements; and there is the variance of a statistical ensemble (physicists' language) or distribution in a probabilistic model (mathematicians' language), whose "true" value we often do not know, but which we can aim to estimate from data.

                              The choice of ensemble matters: repeated measurements on the same sample using the same machine will have smaller variance than repeated measurements using different machines, or different biological replicates, e.g. different cell passages.
                              Wolfgang Huber