Announcement

Collapse
No announcement yet.

Multiple DGE libraries comparison. (EdgeR baySeq DESeq)

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • cjp
    replied
    Originally posted by Simon Anders View Post
    Wow, this sounds like an embarrassing bug!

    I've had a look immediately -- and, guess what, the variance adjustment does exactly what it should, but what I have written about it in the vignette (and further up in this thread) was completely the wrong way round.

    ...

    I'll correct the vignette right away.

    Thanks for pointing this out
    Simon
    Hi Simon,

    Thanks for the helpful comments in this thread!

    But, I was reading the whole thread from the beginning - and maybe you should edit the post - number 31 - to say you noticed later it is the wrong way round for others doing the same and reading the whole thread! I've only just noticed on post 45 when I could have saved myself some confusion earlier in the thread.

    Chris

    Leave a comment:


  • Gordon Smyth
    replied
    Multiple DGE libraries comparison. (EdgeR baySeq DESeq)

    Dear Roman,

    This is a very late reply to your original post of a year ago in which you described an experiment with four RNA-Seq libraries, Male-Old, Male-Young, Female-Old and Female-Young. You wanted to find transcripts differentially expressed between Male and Female.

    This type of experiment is typically treated in a way that is analogous to a paired t-test, in which you test for Male vs Female differences, adjusting for any differences between Young and Old.

    When you originally posted your question, edgeR did not have facilities for this type of analysis, but it has since the Bioconductor release of Oct 2010. The edgeR Vignette gives in Section 11 a worked example of an analysis with paired comparisons (tumour vs normal for the same patients) that is analogous to your problem. The analysis goes something like this:

    design <- model.matrix(~Age+Sex)
    y <- estimateGLMCommonDisp(y,design)
    disp <- y$common.dispersion
    fit <- glmFit(y,design,dispersion=disp)
    lrt <- glmLRT(y,fit)
    topTags(lrt)

    edgeR not require you to have replicates of any of the four
    conditions, because the linear model that is fitted has only 3 coefficients.

    Try out the paired comparison analysis and let us know if you have any problems with it.

    Best wishes
    Gordon

    Leave a comment:


  • Roman Bruno
    replied
    Hi again Simon, I wonder if you had since then any suggestion about a possible method in order to compare multiple library condition.

    Thanks

    Leave a comment:


  • agent99
    replied
    Originally posted by RockChalkJayhawk View Post
    I have had some incosistencies with cufflinks with gene-level expression. I would sugest moving on to using a simple program such as HT-Seq or BEDtools to get the actual tag counts within the gene region. Then, you could use DESeq for the normalization.
    Thanks for the advice. Apologies for being a little unclear in this post. I had hoped to get transcript level counts, so I was using cufflinks to determine abundance of different isoforms.

    Leave a comment:


  • RockChalkJayhawk
    replied
    Originally posted by agent99 View Post
    Did anyone respond to this post?

    From the cufflinks output, I tried using the coverage value * (exon length/read length) to get counts, but it doesn't seem to be the right value. My reasoning was that if this is the counts (or fragments) in an exon, I should be able to recover the total number of mapped fragments and that value should be the same over all exons.

    fragments / (exon length/1000) = fragments/kb, then

    FPKM * fragments/kb = X million reads

    However, I'm getting a 2-fold range for how many million reads were sequenced. Could this be because the FPKM is a MLE?

    Thanks for any input/advice.

    Alisha
    I have had some incosistencies with cufflinks with gene-level expression. I would sugest moving on to using a simple program such as HT-Seq or BEDtools to get the actual tag counts within the gene region. Then, you could use DESeq for the normalization.

    Leave a comment:


  • agent99
    replied
    recovering counts from cufflinks

    Originally posted by Boel View Post
    Is there a way to output the actual counts?
    Did anyone respond to this post?

    From the cufflinks output, I tried using the coverage value * (exon length/read length) to get counts, but it doesn't seem to be the right value. My reasoning was that if this is the counts (or fragments) in an exon, I should be able to recover the total number of mapped fragments and that value should be the same over all exons.

    fragments / (exon length/1000) = fragments/kb, then

    FPKM * fragments/kb = X million reads

    However, I'm getting a 2-fold range for how many million reads were sequenced. Could this be because the FPKM is a MLE?

    Thanks for any input/advice.

    Alisha

    Leave a comment:


  • Simon Anders
    replied
    Hi Joro,

    "out of vertex space" is an error from locfit, which DESeq uses internally. I've seen the error occuring onl,y once so far, in a quite unusual data set.

    You can try to increase the number of vertices with:
    Code:
    cds <- estimateVarianceFunctions( cds, locfit_extra_args=list(maxk=300) )
    If this does not help, could you send me your CountDataSet object (save it with save( cds, file="cds.rda" )) so that I can try around with it?

    Cheers
    Simon

    Leave a comment:


  • joro
    replied
    Hi Simon,

    When using the command "cds = estimateVarianceFunctions(cds)" I get the following error:

    Error in lfproc(x, y, weights = weights, cens = cens, base = base, geth = geth, :
    newsplit: out of vertex space
    In addition: Warning messages:
    1: In lfproc(x, y, weights = weights, cens = cens, base = base, geth = geth, :
    procv: no points with non-zero weight
    2: In lfproc(x, y, weights = weights, cens = cens, base = base, geth = geth, :
    procv: no points with non-zero weight
    3: In lfproc(x, y, weights = weights, cens = cens, base = base, geth = geth, :
    procv: no points with non-zero weight
    4: In lfproc(x, y, weights = weights, cens = cens, base = base, geth = geth, :
    procv: no points with non-zero weight
    5: In lfproc(x, y, weights = weights, cens = cens, base = base, geth = geth, :
    procv: no points with non-zero weight
    6: In lfproc(x, y, weights = weights, cens = cens, base = base, geth = geth, :
    procv: no points with non-zero weight
    7: In lfproc(x, y, weights = weights, cens = cens, base = base, geth = geth, :
    procv: no points with non-zero weight
    8: In lfproc(x, y, weights = weights, cens = cens, base = base, geth = geth, :
    procv: no points with non-zero weight
    9: In lfproc(x, y, weights = weights, cens = cens, base = base, geth = geth, :
    procv: no points with non-zero weight
    10: In lfproc(x, y, weights = weights, cens = cens, base = base, geth = geth, :
    procv: no points with non-zero weight

    I'm not sure what's causing the error and was wondering if anyone has seen it before?

    Thank you.

    Leave a comment:


  • Simon Anders
    replied
    Originally posted by diffexpr View Post
    I have been trying out DESeq using both simulated and real data, and have downloaded the newer version DESeq 1.1.3 in order to be able to control the variance adjustment factor. Using real data, I tried factors below 1 as suggested and the ECDF curves moved below the green line. However the number of hits increased rather than decreased as would be expected if the number of type I errors were being reduced. By using simulated data I confirmed that the fraction of false positives were actually going up when the ECDF curves were shifted below the green line.
    Wow, this sounds like an embarrassing bug!

    I've had a look immediately -- and, guess what, the variance adjustment does exactly what it should, but what I have written about it in the vignette (and further up in this thread) was completely the wrong way round.

    So, let me try again:

    If the ECDF curve is below the green line, this means that the variance residuals are too high. (If one plots a histogram instead of an ECDF curve, there are too many values towards the right end.) These residuals are the ratios of the per-gene estimates over the fitted values. If they are too high, the fitted values are too low, i.e., DESeq assumes a smaller variance than it should and will hence produce too small p values. To correct, one should set the variance adjustment factor to a value above 1, so that DESeq makes the fitted variance estimate larger before performing the test.

    If, however, the curves are above the green line, DESeq performs conservatively, i.e., produces p values that may be too high.

    Hence, the curves that Abhijit showed further up in this thread were fine all along. They were above the green line, so if he went forward without any variance adjustment, he might have lost a few hits but DESeq would have maintained type-I error control. This is actually a relief, although I feel now quite stupid of noticing this only now, after you pointed me to it.

    I'll correct the vignette right away.

    Thanks for pointing this out
    Simon

    Leave a comment:


  • diffexpr
    replied
    variance adjustment factors

    Hi Simon


    I have been trying out DESeq using both simulated and real data, and have downloaded the newer version DESeq 1.1.3 in order to be able to control the variance adjustment factor. Using real data, I tried factors below 1 as suggested and the ECDF curves moved below the green line. However the number of hits increased rather than decreased as would be expected if the number of type I errors were being reduced. By using simulated data I confirmed that the fraction of false positives were actually going up when the ECDF curves were shifted below the green line.

    Leave a comment:


  • Simon Anders
    replied
    Originally posted by gen2prot View Post
    Where can I find the version 1.1.1 of DESeq? I don't see it on the biocunductor website.
    The newest version is here:
    http://bioconductor.org/packages/dev...tml/DESeq.html

    This is the Bioconductor devel branch. You probably looked in the release branch.

    See the installation instructions here: http://www-huber.embl.de/users/anders/DESeq/

    Simon

    Leave a comment:


  • gen2prot
    replied
    Hi Simon,

    Where can I find the version 1.1.1 of DESeq? I don't see it on the biocunductor website.

    Thanks
    Abhijit

    Leave a comment:


  • Simon Anders
    replied
    Hi Roman

    Originally posted by Roman Bruno View Post
    Hi Simon, when you say that, is it also true for SAGE data ?

    Roman
    Short answer: Yes.

    Long answer:

    The variance that you observe in count data is the sum of two components. First, the shot noise (Poisson noise), whose variance is always equal to the mean. Imagine that you have two samples which contain precisely the same concentration of a transcript and you sequence them to precisely the same depth. Even then, you will not get the same count. This unavoidable variance is the shot noise, and it is always equal to the expected number of counts.

    Second, the biological noise, i.e., the noise between two samples. (Even if you do twice exactly the same, the transcript will not really have the same concentration in your two replicate samples.) Such noise is typically multiplicative, i.e., the variance is proportional to the square of the mean.

    Hence, for very low count rates, the shot noise dominates, and biological noise is unimportant. Then, the Poisson model, which ignores biological noise, is fine. SAGE typically has so low count rates that the error you make by ignoring biological noise may not be that severe unless you also have strong biological noise.

    Still, Robinson and Smyth had already proposed a way to do it properly ("edgeR") back when SAGE was new and trendy, so no excuse to be sloppy.

    For RNA-Seq data, people sort of got away with it when the GenomeAnalyzer I only produced less than 1 mio reads, but now, with the GA IIx and its 20 mio reads, you have to be blind to not notice that testing with a Poisson model always gives the suspicious result that all strongly expressed genes seem to be differentially expressed (unless you cheat by using an extremely small p value cut-off).
    Last edited by Simon Anders; 05-05-2010, 12:47 PM. Reason: typos

    Leave a comment:


  • Roman Bruno
    replied
    Originally posted by Simon Anders View Post
    We argue (in our preprint) that using the Poisson model for RNA-Seq data is incorrect and leads to wrong p values.
    Hi Simon, when you say that, is it also true for SAGE data ?

    Roman

    Leave a comment:


  • RSK
    replied
    I get it now. Thanks very much for clarifying!
    I got mixed up.

    -R

    Originally posted by Simon Anders View Post
    Hi RSK,

    we are talking about "DESeq", not "DEGSeq". Both are Bioconductor packages to check for differential expression. Our package (DESeq) estimates the noise from comparing replicates while DEGSeq assumes the so-called Poisson model which does not require noise estimation. We argue (in our preprint) that using the Poisson model for RNA-Seq data is incorrect and leads to wrong p values.

    Sorry for the naming clash between the two packages.

    Simon

    Leave a comment:

Working...
X