Seqanswers Leaderboard Ad

**sadiexiaoyu** · 06-03-2013, 04:41 AM

Originally posted by Simon Anders View Post

Cole, if cufflinks can handle paired samples, I'd be really impressed, but I wonder if you simply were to fast with your reply and overlooked the fact that the question was about paired samples. If not, please correct me.

For readers unfamiliar with the issue, a quick reminder of text-book knowledge with an example for a paired t test:

Imagine you have 5 sample, for which you measure some quantitative trait before and after a certain treatment. Let's say this trait varies according to a normal distribution and the data looks like this

Code:

                    S1    S2    S3    S4    S5
before treatment  4.00  7.30 11.13  8.50  9.50
after treatment   4.72  8.44 12.04  9.66 10.65

The mean is 8.09 before and 9.10 after treatment. The (pooled) standard deviation of the data is 2.7, and hence, the difference of the means (1.01) is not significant. ( t = (9.10-8.09)/2.7=.37, p=.36 )

However, we did not use the sample pairing here. If we want to do this, we first take the differences and then the average, i.e., instead of subtracting the averages, we subtract each treated value from the corresponding untreated value.

The differences are:

Code:

0.72 1.13 0.91 1.15 1.14

The mean is 1.01 as before, but the standard deviation of the difference is only .19, and hence, the difference is clearly significant.

To come back to DEGs: Whenever samples are paired (i.e., the same sample is measured twice under different conditions) and unless the difference between the samples is typically much smaller than the effect of the treatment, we dramatically loose statistical power if our method is unable to make use of the pairing information.

To my knowledge, none of the currently released tools can do this, though.

We have recently expanded our DESeq package to be able to fit generalized linear models (GLMs), and these can be used to model the pairing. Unfortunately, our method to estimate the dispersion (which, for count data, takes the role of the standard deviation of the differences in the example above) does not work for paired designs. We have some ideas how to get around this and are testing them at the moment, but it does not yet work as well as we would like it.

As far as I know, the edgeR people seem to pursue similar ideas.

DESeq offers a function for a "variance stabilizing transformation" which translates the count data onto a continuous scale such that it becomes approximately homoscedastic. This allows then to use tools that worked well for microarrays, such as pairwise t-tests or Smyth's 'limma' package. However, the transformation costs power and introduces bias in case the library sizes are too different. Still, it may a good way to get started.

Simon

Dear Simon,

I tried to use edgeR to analyse paired samples, and I also would like to use DESeq, but I just found "multi-factor designs" using GLM in the DESeq manual. This cannot be used for paired samples, right? Could you give me any suggestion to analyse paired samples using DESeq? I noticed this thread was posted long time ago, so maybe there is new method....?
Thank you!

Best,
Sadiexiaoyu

**Simon Anders** · 06-03-2013, 07:40 AM

Originally posted by sadiexiaoyu View Post

I tried to use edgeR to analyse paired samples, and I also would like to use DESeq, but I just found "multi-factor designs" using GLM in the DESeq manual. This cannot be used for paired samples, right? Could you give me any suggestion to analyse paired samples using DESeq? I noticed this thread was posted long time ago, so maybe there is new method....?

Sure, we fixed the issue with the dispersion estimator long ago.

Let's say you have 2n samples, which come in n pairs. For example, we have a treatment and a control sample from each of n subjects. Then you proceed as follows:

Make a design frame, i.e., a table describing the 2n samples, with one row per sample and two columns, say, "subject" (with levels "Subject_1", "Subject_2", "Subject_3", ...) and "treatment" (with levels "no" and "yes") and pass this as second argument to "newCountDataSet" as shown in the vignette's section on GLMs. Then fit two GLMs, namely "count ~ subject" as reduced model and "count ~ subject + treatment" as full model and compare them, again exactly as shown in the vignette (where we have "libType" instead of "subject"). Voilà, a paired test.

AFAIK, it works the same way in edgeR, only that you don't specify the reduced formula explicitely but rather, edgeR simply drops the last term in the full formula to get the reduced one, i.e., "treatment" has to come after "subject".

**sadiexiaoyu** · 06-03-2013, 12:17 PM

Originally posted by Simon Anders View Post

Sure, we fixed the issue with the dispersion estimator long ago.

Let's say you have 2n samples, which come in n pairs. For example, we have a treatment and a control sample from each of n subjects. Then you proceed as follows:

Make a design frame, i.e., a table describing the 2n samples, with one row per sample and two columns, say, "subject" (with levels "Subject_1", "Subject_2", "Subject_3", ...) and "treatment" (with levels "no" and "yes") and pass this as second argument to "newCountDataSet" as shown in the vignette's section on GLMs. Then fit two GLMs, namely "count ~ subject" as reduced model and "count ~ subject + treatment" as full model and compare them, again exactly as shown in the vignette (where we have "libType" instead of "subject"). Voilà, a paired test.

AFAIK, it works the same way in edgeR, only that you don't specify the reduced formula explicitely but rather, edgeR simply drops the last term in the full formula to get the reduced one, i.e., "treatment" has to come after "subject".

Dear Simon,

Thank you for your answer. But I still did not get it clearly.
1) Do you mean using design like this:
> design=data.frame(subject=c("sample1","sample2","sample3"),treatment=c("b,n","b,n","b,n"))
> design
subject treatment
1 sample1 b,n
2 sample2 b,n
3 sample3 b,n
However, my count_data "sample1b“, ”sample1n","sample2b","sample2n","sample3b","sample3n",could you tell me how can I combine my count_data with this design?

2) I tried to use DESeq to do this paired analysis in the following (using DESeq manual "multi-factor comparison" as reference), is this right?
> x<-read.delim("matrix for DESeq_n_b_new.txt",row.names=1,stringsAsFactors=FALSE)
> head(x)
X1b X2b X3b X1n X2n X3n
gene1 1562 1380 1533 821 799 820
gene2 74 79 83 51 57 45
gene3 997 837 1230 700 513 705
gene4 83 59 81 49 71 67
gene5 517 456 512 301 322 307
gene6 0 0 2 0 0 0
> design=data.frame(row.names=colnames(x),treatment=c("b","b","b","n","n","n"),subject=c("sample1","sample2","sample3"))
> design
treatment subject
X1b b sample1
X2b b sample2
X3b b sample3
X1n n sample1
X2n n sample2
X3n n sample3
> cdsFull=newCountDataSet(x,design)
and using
> fit1=fitNbinomGLMs(cdsFilt,count~subject+treatment)
> fit0=fitNbinomGLMs(cdsFilt,count~subject)
> pvalsGLM=nbinomGLMTest(fit1,fit0)
> padjGLM=p.adjust(pvalsGLM,method="BH")

3)You said "edgeR simply drops the last term in the full formula to get the reduced one", is that mean, for example, if I set
design<-model.matrix(~subject+treatment) in edge R, then edgeR just get "treatment" with adjusting "subject" differences;
if I set
design<-model.matrix(~treatment+subject) in edgeR, then edgeR will get "subject" with adjusting "treatment" differences.
Am I right?

Actually, I used
design<-model.matrix(~subject+treatment) in edgeR, and I get the following results:
> topTags(lrt)
Coefficient: treatment n
genes logFC logCPM LR PValue FDR
15498 gene10 -3.460666 8.556419 400.1173 5.192712e-89 9.278857e-85
20675 gene8 4.608737 7.967542 394.4721 8.796487e-88 7.859221e-84
15990 gene5 -3.287753 5.762309 377.7188 3.904981e-84 2.325937e-80
731 gene71 -3.515135 7.259228 376.8333 6.087180e-84 2.719295e-80
17873 gene889 -3.560003 7.921179 363.5712 4.698794e-81 1.679255e-77
334 gene34 -3.320911 8.088810 362.5978 7.654637e-81 2.279679e-77
22052 gene2272 -3.718390 8.846441 356.9188 1.319823e-79 3.369132e-76
8133 gene89 -3.004279 6.273042 336.3495 3.980106e-75 8.890063e-72
23453 gene275 -3.576274 7.439596 330.8462 6.287517e-74 1.248352e-70
5681 gene56 -3.205917 6.550010 330.3990 7.868486e-74 1.406020e-70

> colnames(design)
[1] "(Intercept)" "sample2" "sample3" "treatmentn"

I think the logFC is the foldchange between treatment n vs treatment b, using sample1 as the baseline for adjusting the differences among individuals. Is that right?

Best,

Sadiexiaoyu

**sadiexiaoyu** · 06-03-2013, 12:19 PM

Originally posted by Simon Anders View Post

Sure, we fixed the issue with the dispersion estimator long ago.

Let's say you have 2n samples, which come in n pairs. For example, we have a treatment and a control sample from each of n subjects. Then you proceed as follows:

Make a design frame, i.e., a table describing the 2n samples, with one row per sample and two columns, say, "subject" (with levels "Subject_1", "Subject_2", "Subject_3", ...) and "treatment" (with levels "no" and "yes") and pass this as second argument to "newCountDataSet" as shown in the vignette's section on GLMs. Then fit two GLMs, namely "count ~ subject" as reduced model and "count ~ subject + treatment" as full model and compare them, again exactly as shown in the vignette (where we have "libType" instead of "subject"). Voilà, a paired test.

AFAIK, it works the same way in edgeR, only that you don't specify the reduced formula explicitely but rather, edgeR simply drops the last term in the full formula to get the reduced one, i.e., "treatment" has to come after "subject".

Dear Simon,

Thank you for your answer. But I still did not get it clearly.
1) Do you mean using design like this:
> design=data.frame(subject=c("sample1","sample2","sample3"),treatment=c("b,n","b,n","b,n"))
> design
subject treatment
1 sample1 b,n
2 sample2 b,n
3 sample3 b,n
However, my count_data "sample1b“, ”sample1n","sample2b","sample2n","sample3b","sample3n",could you tell me how can I combine my count_data with this design?

2) I tried to use DESeq to do this paired analysis in the following (using DESeq manual "multi-factor comparison" as reference), is this right?
> x<-read.delim("matrix for DESeq_n_b_new.txt",row.names=1,stringsAsFactors=FALSE)
> head(x)
X1b X2b X3b X1n X2n X3n
gene1 1562 1380 1533 821 799 820
gene2 74 79 83 51 57 45
gene3 997 837 1230 700 513 705
gene4 83 59 81 49 71 67
gene5 517 456 512 301 322 307
gene6 0 0 2 0 0 0
> design=data.frame(row.names=colnames(x),treatment=c("b","b","b","n","n","n"),subject=c("sample1","sample2","sample3"))
> design
treatment subject
X1b b sample1
X2b b sample2
X3b b sample3
X1n n sample1
X2n n sample2
X3n n sample3
> cdsFull=newCountDataSet(x,design)
and using
> fit1=fitNbinomGLMs(cdsFilt,count~subject+treatment)
> fit0=fitNbinomGLMs(cdsFilt,count~subject)
> pvalsGLM=nbinomGLMTest(fit1,fit0)
> padjGLM=p.adjust(pvalsGLM,method="BH")

3)You said "edgeR simply drops the last term in the full formula to get the reduced one", is that mean, for example, if I set
design<-model.matrix(~subject+treatment) in edge R, then edgeR just get "treatment" with adjusting "subject" differences;
if I set
design<-model.matrix(~treatment+subject) in edgeR, then edgeR will get "subject" with adjusting "treatment" differences.
Am I right?

Actually, I used
design<-model.matrix(~subject+treatment) in edgeR, and I get the following results:
> topTags(lrt)
Coefficient: treatment n
genes logFC logCPM LR PValue FDR
15498 gene10 -3.460666 8.556419 400.1173 5.192712e-89 9.278857e-85
20675 gene8 4.608737 7.967542 394.4721 8.796487e-88 7.859221e-84
15990 gene5 -3.287753 5.762309 377.7188 3.904981e-84 2.325937e-80
731 gene71 -3.515135 7.259228 376.8333 6.087180e-84 2.719295e-80
17873 gene889 -3.560003 7.921179 363.5712 4.698794e-81 1.679255e-77
334 gene34 -3.320911 8.088810 362.5978 7.654637e-81 2.279679e-77
22052 gene2272 -3.718390 8.846441 356.9188 1.319823e-79 3.369132e-76
8133 gene89 -3.004279 6.273042 336.3495 3.980106e-75 8.890063e-72
23453 gene275 -3.576274 7.439596 330.8462 6.287517e-74 1.248352e-70
5681 gene56 -3.205917 6.550010 330.3990 7.868486e-74 1.406020e-70

> colnames(design)
[1] "(Intercept)" "sample2" "sample3" "treatmentn"

I think the logFC is the foldchange between treatment n vs treatment b, using sample1 as the baseline for adjusting the differences among individuals. Is that right?

Best,

Sadiexiaoyu

**sadiexiaoyu** · 06-03-2013, 12:34 PM

Originally posted by Simon Anders View Post

Sure, we fixed the issue with the dispersion estimator long ago.

Let's say you have 2n samples, which come in n pairs. For example, we have a treatment and a control sample from each of n subjects. Then you proceed as follows:

Make a design frame, i.e., a table describing the 2n samples, with one row per sample and two columns, say, "subject" (with levels "Subject_1", "Subject_2", "Subject_3", ...) and "treatment" (with levels "no" and "yes") and pass this as second argument to "newCountDataSet" as shown in the vignette's section on GLMs. Then fit two GLMs, namely "count ~ subject" as reduced model and "count ~ subject + treatment" as full model and compare them, again exactly as shown in the vignette (where we have "libType" instead of "subject"). Voilà, a paired test.

AFAIK, it works the same way in edgeR, only that you don't specify the reduced formula explicitely but rather, edgeR simply drops the last term in the full formula to get the reduced one, i.e., "treatment" has to come after "subject".

Dear Simon,

Thank you for your answer. But I still did not get it clearly.
1) Do you mean using design like this:
>design=data.frame(subject=c("sample1","sample2","sample3"),treatment=c("b,n","b,n","b,n"))
> design
subject treatment
1 sample1 b,n
2 sample2 b,n
3 sample3 b,n
However, my count_data "sample1b“, ”sample1n","sample2b","sample2n","sample3b","sample3n",how can I combine my count_data with this design?

Best,

Sadiexiaoyu

**sadiexiaoyu** · 06-03-2013, 12:35 PM

Originally posted by Simon Anders View Post

Sure, we fixed the issue with the dispersion estimator long ago.

Let's say you have 2n samples, which come in n pairs. For example, we have a treatment and a control sample from each of n subjects. Then you proceed as follows:

Make a design frame, i.e., a table describing the 2n samples, with one row per sample and two columns, say, "subject" (with levels "Subject_1", "Subject_2", "Subject_3", ...) and "treatment" (with levels "no" and "yes") and pass this as second argument to "newCountDataSet" as shown in the vignette's section on GLMs. Then fit two GLMs, namely "count ~ subject" as reduced model and "count ~ subject + treatment" as full model and compare them, again exactly as shown in the vignette (where we have "libType" instead of "subject"). Voilà, a paired test.

AFAIK, it works the same way in edgeR, only that you don't specify the reduced formula explicitely but rather, edgeR simply drops the last term in the full formula to get the reduced one, i.e., "treatment" has to come after "subject".

And my second question is,
I tried to use DESeq to do this paired analysis in the following (using DESeq manual "multi-factor comparison" as reference), is this right?
> x<-read.delim("matrix for DESeq_n_b_new.txt",row.names=1,stringsAsFactors=FALSE)
> head(x)
X1b X2b X3b X1n X2n X3n
gene1 1562 1380 1533 821 799 820
gene2 74 79 83 51 57 45
gene3 997 837 1230 700 513 705
gene4 83 59 81 49 71 67
gene5 517 456 512 301 322 307
gene6 0 0 2 0 0 0
> design=data.frame(row.names=colnames(x),treatment=c("b","b","b","n","n","n"),subject=c("sample1","sample2","sample3"))
> design
treatment subject
X1b b sample1
X2b b sample2
X3b b sample3
X1n n sample1
X2n n sample2
X3n n sample3
> cdsFull=newCountDataSet(x,design)
and using
> fit1=fitNbinomGLMs(cdsFilt,count~subject+treatment)
> fit0=fitNbinomGLMs(cdsFilt,count~subject)
> pvalsGLM=nbinomGLMTest(fit1,fit0)
> padjGLM=p.adjust(pvalsGLM,method="BH")

**sadiexiaoyu** · 06-03-2013, 12:36 PM

Originally posted by Simon Anders View Post

Sure, we fixed the issue with the dispersion estimator long ago.

Let's say you have 2n samples, which come in n pairs. For example, we have a treatment and a control sample from each of n subjects. Then you proceed as follows:

Make a design frame, i.e., a table describing the 2n samples, with one row per sample and two columns, say, "subject" (with levels "Subject_1", "Subject_2", "Subject_3", ...) and "treatment" (with levels "no" and "yes") and pass this as second argument to "newCountDataSet" as shown in the vignette's section on GLMs. Then fit two GLMs, namely "count ~ subject" as reduced model and "count ~ subject + treatment" as full model and compare them, again exactly as shown in the vignette (where we have "libType" instead of "subject"). Voilà, a paired test.

AFAIK, it works the same way in edgeR, only that you don't specify the reduced formula explicitely but rather, edgeR simply drops the last term in the full formula to get the reduced one, i.e., "treatment" has to come after "subject".

And my third question is,
You said "edgeR simply drops the last term in the full formula to get the reduced one", is that mean, for example, if I set
design<-model.matrix(~subject+treatment) in edge R, then edgeR just get "treatment" with adjusting "subject" differences;
if I set
design<-model.matrix(~treatment+subject) in edgeR, then edgeR will get "subject" with adjusting "treatment" differences.
Am I right?

Actually, I used
design<-model.matrix(~subject+treatment) in edgeR, and I get the following results:
> topTags(lrt)
Coefficient: treatment n
genes logFC logCPM LR PValue FDR
15498 gene10 -3.460666 8.556419 400.1173 5.192712e-89 9.278857e-85
20675 gene8 4.608737 7.967542 394.4721 8.796487e-88 7.859221e-84
15990 gene5 -3.287753 5.762309 377.7188 3.904981e-84 2.325937e-80
731 gene71 -3.515135 7.259228 376.8333 6.087180e-84 2.719295e-80
17873 gene889 -3.560003 7.921179 363.5712 4.698794e-81 1.679255e-77
334 gene34 -3.320911 8.088810 362.5978 7.654637e-81 2.279679e-77
22052 gene2272 -3.718390 8.846441 356.9188 1.319823e-79 3.369132e-76
8133 gene89 -3.004279 6.273042 336.3495 3.980106e-75 8.890063e-72
23453 gene275 -3.576274 7.439596 330.8462 6.287517e-74 1.248352e-70
5681 gene56 -3.205917 6.550010 330.3990 7.868486e-74 1.406020e-70

> colnames(design)
[1] "(Intercept)" "sample2" "sample3" "treatmentn"

I think the logFC is the foldchange between treatment n vs treatment b, using sample1 as the baseline for adjusting the differences among individuals. Is that right?

Best,

Sadiexiaoyu

**Simon Anders** · 06-03-2013, 10:20 PM

About your first question (post #18): You answered this yourself in the your next post. You need a dsign fram with one row for each column in your count data set. So each "sample" appears twice. (And hence, one maybe should not call it a "sample" but that's just semantics.)

You second question (post #19): Yes, this code is correct.

And for the third (post #20): I am not an expert on edgeR but I think this is correct, too.

**sadiexiaoyu** · 06-04-2013, 12:48 AM

Originally posted by Simon Anders View Post

About your first question (post #18): You answered this yourself in the your next post. You need a dsign fram with one row for each column in your count data set. So each "sample" appears twice. (And hence, one maybe should not call it a "sample" but that's just semantics.)

You second question (post #19): Yes, this code is correct.

And for the third (post #20): I am not an expert on edgeR but I think this is correct, too.

Dear Simon,

Thank you!

Best,

Sadiexiaoyu

**younko** · 05-06-2014, 06:12 PM

same for me`

hello sadiexiaoyu

did you get your answers?
I also have exactly same question with you! And I totally agree with your point. As far as I understand your points is right, in that way, we should be able to test the effect of treatment by adjusting with subject(for paired sample).

Could you please some one (know better than me

), confirm this question?

Thanks in advance

**Simon Anders** · 05-07-2014, 01:22 AM

Yes, this is correct.

**younko** · 05-07-2014, 01:25 AM

good to know

Dear Simon,

I really appreciate your confirmation!
One more quick question.

If I have multiple disease given design. (e.g. paired sample)
how can I handle ?? could you please give some comments for this ? Please refer the following link!

DEG analysis with paired sample without replication - SEQanswers

http://seqanswers.com/forums/showthread.php?t=42985

Application of sequencing to RNA analysis (RNA-Seq, whole transcriptome, SAGE, expression analysis, novel organism mining, splice variants)

with best regards,

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 26 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 29 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 25 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News