Seqanswers Leaderboard Ad

**Cole Trapnell** · 10-04-2010, 08:19 AM

Cuffdiff now supports replicates, so it should handle this sort of setup

**Simon Anders** · 10-05-2010, 01:02 AM

Cole, if cufflinks can handle paired samples, I'd be really impressed, but I wonder if you simply were to fast with your reply and overlooked the fact that the question was about paired samples. If not, please correct me.

For readers unfamiliar with the issue, a quick reminder of text-book knowledge with an example for a paired t test:

Imagine you have 5 sample, for which you measure some quantitative trait before and after a certain treatment. Let's say this trait varies according to a normal distribution and the data looks like this

Code:

                    S1    S2    S3    S4    S5
before treatment  4.00  7.30 11.13  8.50  9.50
after treatment   4.72  8.44 12.04  9.66 10.65

The mean is 8.09 before and 9.10 after treatment. The (pooled) standard deviation of the data is 2.7, and hence, the difference of the means (1.01) is not significant. ( t = (9.10-8.09)/2.7=.37, p=.36 )

However, we did not use the sample pairing here. If we want to do this, we first take the differences and then the average, i.e., instead of subtracting the averages, we subtract each treated value from the corresponding untreated value.

The differences are:

Code:

0.72 1.13 0.91 1.15 1.14

The mean is 1.01 as before, but the standard deviation of the difference is only .19, and hence, the difference is clearly significant.

To come back to DEGs: Whenever samples are paired (i.e., the same sample is measured twice under different conditions) and unless the difference between the samples is typically much smaller than the effect of the treatment, we dramatically loose statistical power if our method is unable to make use of the pairing information.

To my knowledge, none of the currently released tools can do this, though.

We have recently expanded our DESeq package to be able to fit generalized linear models (GLMs), and these can be used to model the pairing. Unfortunately, our method to estimate the dispersion (which, for count data, takes the role of the standard deviation of the differences in the example above) does not work for paired designs. We have some ideas how to get around this and are testing them at the moment, but it does not yet work as well as we would like it.

As far as I know, the edgeR people seem to pursue similar ideas.

DESeq offers a function for a "variance stabilizing transformation" which translates the count data onto a continuous scale such that it becomes approximately homoscedastic. This allows then to use tools that worked well for microarrays, such as pairwise t-tests or Smyth's 'limma' package. However, the transformation costs power and introduces bias in case the library sizes are too different. Still, it may a good way to get started.

Simon

**Cole Trapnell** · 10-05-2010, 04:47 AM

Cuffdiff doesn't explicitly support sample pairing, hypatia, but I suggest you try the newest version of Cuffdiff (0.9.1) if you haven't already, as it should get you started pretty quickly. You may see fewer differentially expressed genes due to a loss in power, but hopefully being able to find differentially spliced genes or those undergoing shifts in promoter preference will make up for it

**hypatia** · 10-05-2010, 05:45 AM

My first objective is really differential expression, specially in the lower range of expression and I already know that in my disease model, the correlation of each patient fold change is not high, around 0.4, so the paired model will make a huge difference here.

**Davis McC** · 10-05-2010, 07:54 PM

GLMs in edgeR

Hi hypatia and others

We have recently implemented GLM methods in edgeR, so the package can now deal with paired designs as well as other more complicated designs as well. One of the PhD students in our division has been working on using Cox-Reid conditional inference to estimate the dispersion parameter for the negative binomial model. This approach does take into account the paired nature of the samples (or indeed works whatever the experimental design) and has been giving us very reasonable results in our testing.

Following Simon's work with DESeq, the edgeR methods can now also add a gene abundance-related trend on the dispersion estimates.

These new methods, the GLMs with CR estimation of the dispersion (plus a whole lot of other improvements), are all implemented in the current development version of edgeR in the Bioconductor repository. They will be rolled out into the release version with the release of Bioconductor 2.7 on 18 October.

We'll be adding to the documentation over the next couple of weeks to get some examples in there of using these methods on paired designs and other more complicated experimental designs.

The new edgeR methods have been developed with exactly this sort of application in mind, so I certainly encourage you to give them a try. We'd be really interested in how they work for you.

Best regards
Davis

**lpachter** · 10-06-2010, 08:49 AM

Hi Hypatia,

From my current reading of the literature, it seems to me that baySeq may be a good solution for you right now:

baySeq: Empirical Bayesian methods for identifying differential expression in sequence count data - BMC Bioinformatics

http://www.biomedcentral.com/1471-2105/11/422/

Background High throughput sequencing has become an important technology for studying expression levels in many types of genomic, and particularly transcriptomic, data. One key way of analysing such data is to look for elements of the data which display particular patterns of differential expression in order to take these forward for further analysis and validation. Results We propose a framework for defining patterns of differential expression and develop a novel algorithm, baySeq, which uses an empirical Bayes approach to detect these patterns of differential expression within a set of sequencing samples. The method assumes a negative binomial distribution for the data and derives an empirically determined prior distribution from the entire dataset. We examine the performance of the method on real and simulated data. Conclusions Our method performs at least as well, and often better, than existing methods for analyses of pairwise differential expression in both real and simulated data. When we compare methods for the analysis of data from experimental designs involving multiple sample groups, our method again shows substantial gains in performance. We believe that this approach thus represents an important step forward for the analysis of count data from sequencing experiments.

Regarding Cufflinks, I would like to correct Simon who is not a developer of the program and therefore may not understand it completely (Simon, please correct me if I am wrong). It is not accurate to say that it does not support sample pairing.
What Cufflinks does is estimate expression values according to a generative model of the sequencing process, one that currently takes into account sequencing bias of various kinds. The value of paired samples is that the experimental bias should be similar in the pairs, and this will be implicitly "learned" by Cufflinks, so that its not clear to me a priori that it will produce inferior results to methods that learn the distributions of counts.

**Simon Anders** · 10-06-2010, 09:12 AM

Originally posted by lpachter View Post

From my current reading of the literature, it seems to me that baySeq may be a good solution for you right now:
http://www.biomedcentral.com/1471-2105/11/422/

I am afraid, no. BaySeq's paper abstract advertises its ability to deal with more complex designs but I looked a bit closer at the paper and it seems that it focuses on nested one-way designs and cannot deal with crossed factors (two-way anova) as one would need for paired samples. At least if I have understood its approach correctly.

Regarding Cufflinks, I would like to correct Simon who is not a developer of the program and therefore may not understand it completely (Simon, please correct me if I am wrong). It is not accurate to say that it does not support sample pairing.
What Cufflinks does is estimate expression values according to a generative model of the sequencing process, one that currently takes into account sequencing bias of various kinds. The value of paired samples is that the experimental bias should be similar in the pairs, and this will be implicitly "learned" by Cufflinks, so that its not clear to me a priori that it will produce inferior results to methods that learn the distributions of counts.

Actually, I did not say that it does not support paired sampling. Cole said so (and he is a developer ;-) ).

I'm a bit puzzled what you might mean by "implicit learning", and by the fact that you talk about sequencing bias. (The value of paired designs, as I understand the term, is not to reduce bias but to reduce variance.) Anyway, I guess, this discussion has to wait until you have written up and published the method behind the new biological replicate functionality.

Simon

**lpachter** · 10-06-2010, 03:06 PM

Regarding baySeq, I am not an author on that software so I cannot speak for the details of it. I just mentioned it because it seemed like they do a lot of things right on a lot of aspects of differential expression analysis. They actually say in their paper that they do not handle paired samples, but as you pointed out no program currently does, and many of the other details matter as well.

The relationship between bias and variance is that bias causes variance. For an explanation of this see

Application Unavailable | Springer Nature

http://genomebiology.com/2010/11/5/R50

**f1boston** · 01-18-2011, 03:52 PM

Hi Davis,

do you have any updates about the documentation for using edgeR in dataset with paired samples design?

You also mentioned some "very reasonable results in our testing"... are these results publicly available now?

All the best!

Originally posted by Davis McC View Post

Hi hypatia and others

We have recently implemented GLM methods in edgeR, so the package can now deal with paired designs as well as other more complicated designs as well. One of the PhD students in our division has been working on using Cox-Reid conditional inference to estimate the dispersion parameter for the negative binomial model. This approach does take into account the paired nature of the samples (or indeed works whatever the experimental design) and has been giving us very reasonable results in our testing.

Following Simon's work with DESeq, the edgeR methods can now also add a gene abundance-related trend on the dispersion estimates.

These new methods, the GLMs with CR estimation of the dispersion (plus a whole lot of other improvements), are all implemented in the current development version of edgeR in the Bioconductor repository. They will be rolled out into the release version with the release of Bioconductor 2.7 on 18 October.

We'll be adding to the documentation over the next couple of weeks to get some examples in there of using these methods on paired designs and other more complicated experimental designs.

The new edgeR methods have been developed with exactly this sort of application in mind, so I certainly encourage you to give them a try. We'd be really interested in how they work for you.

Best regards
Davis

**Davis McC** · 01-30-2011, 04:29 PM

Hi f1boston

The edgeR functions for analysing differential expression with a paired samples design are documented in the package. We are planning on updating the User's Guide substantially to include better examples of using the GLM methods with paired and other experimental designs, but unfortunately this has taken a back seat while we have been putting a lot of work into the development of the new methods.

We don't have any publicly available results as such - but all of the methods are available through Bioconductor so anyone could test them themselves. I certainly encourage you to do so! Finding a suitable yardstick for comparison with other/previous methods is difficult - hence why I said that the results look reasonable. They do look reasonable, but the actual 'truth' is not known in the datasets we have seen. The important point is that the GLM methods can properly analyse paired designs, whereas our older methods could not.

If you have more specific questions that I may be able to help you with please feel free to get in touch.

Best regards
Davis

**sheng** · 10-27-2011, 09:58 AM

How many pairs is need to gain statistic power for paired study using GLM methods

Hi Davis,

I was wondering using edgeR GLM method for paired study, how many paired would be required for gaining certain statistic power?

Cheers,
Sheng

Originally posted by Davis McC View Post

Hi f1boston

The edgeR functions for analysing differential expression with a paired samples design are documented in the package. We are planning on updating the User's Guide substantially to include better examples of using the GLM methods with paired and other experimental designs, but unfortunately this has taken a back seat while we have been putting a lot of work into the development of the new methods.

We don't have any publicly available results as such - but all of the methods are available through Bioconductor so anyone could test them themselves. I certainly encourage you to do so! Finding a suitable yardstick for comparison with other/previous methods is difficult - hence why I said that the results look reasonable. They do look reasonable, but the actual 'truth' is not known in the datasets we have seen. The important point is that the GLM methods can properly analyse paired designs, whereas our older methods could not.

If you have more specific questions that I may be able to help you with please feel free to get in touch.

Best regards
Davis

**Davis McC** · 01-08-2012, 06:30 PM

Hi Sheng

edgeR can indeed be used to analyse RNA-seq data from paired designs, but your question is far, far too vague for me to be able to give you any sensible answer about how many samples you need.

In general, the answer is "as many as you can afford".

Cheers
Davis

**aggp11** · 01-11-2012, 10:58 AM

Simon,

I am sorry if this sounds stupid, but how does the experimental design of the poster differ from the pasillaGenes example in the DESeq documentation? I am probably not able to get a handle on what the poster might mean by paired samples, could you help me understand what we mean by a paired sample experimental design?

Thanks,
Praful

**aggp11** · 01-11-2012, 11:17 AM

Simon,

Nevermind, I think I know understand the issue now.

Thanks,
Praful

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, Yesterday, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin Yesterday, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 26 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 159 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

DEG for paired samples, biological replicates

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News