Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Originally posted by Simon Anders View Post
    Cole, if cufflinks can handle paired samples, I'd be really impressed, but I wonder if you simply were to fast with your reply and overlooked the fact that the question was about paired samples. If not, please correct me.

    For readers unfamiliar with the issue, a quick reminder of text-book knowledge with an example for a paired t test:
    Imagine you have 5 sample, for which you measure some quantitative trait before and after a certain treatment. Let's say this trait varies according to a normal distribution and the data looks like this

    Code:
                        S1    S2    S3    S4    S5
    before treatment  4.00  7.30 11.13  8.50  9.50
    after treatment   4.72  8.44 12.04  9.66 10.65
    The mean is 8.09 before and 9.10 after treatment. The (pooled) standard deviation of the data is 2.7, and hence, the difference of the means (1.01) is not significant. ( t = (9.10-8.09)/2.7=.37, p=.36 )

    However, we did not use the sample pairing here. If we want to do this, we first take the differences and then the average, i.e., instead of subtracting the averages, we subtract each treated value from the corresponding untreated value.

    The differences are:
    Code:
    0.72 1.13 0.91 1.15 1.14
    The mean is 1.01 as before, but the standard deviation of the difference is only .19, and hence, the difference is clearly significant.

    To come back to DEGs: Whenever samples are paired (i.e., the same sample is measured twice under different conditions) and unless the difference between the samples is typically much smaller than the effect of the treatment, we dramatically loose statistical power if our method is unable to make use of the pairing information.

    To my knowledge, none of the currently released tools can do this, though.

    We have recently expanded our DESeq package to be able to fit generalized linear models (GLMs), and these can be used to model the pairing. Unfortunately, our method to estimate the dispersion (which, for count data, takes the role of the standard deviation of the differences in the example above) does not work for paired designs. We have some ideas how to get around this and are testing them at the moment, but it does not yet work as well as we would like it.

    As far as I know, the edgeR people seem to pursue similar ideas.

    DESeq offers a function for a "variance stabilizing transformation" which translates the count data onto a continuous scale such that it becomes approximately homoscedastic. This allows then to use tools that worked well for microarrays, such as pairwise t-tests or Smyth's 'limma' package. However, the transformation costs power and introduces bias in case the library sizes are too different. Still, it may a good way to get started.

    Simon
    Dear Simon,

    I tried to use edgeR to analyse paired samples, and I also would like to use DESeq, but I just found "multi-factor designs" using GLM in the DESeq manual. This cannot be used for paired samples, right? Could you give me any suggestion to analyse paired samples using DESeq? I noticed this thread was posted long time ago, so maybe there is new method....?
    Thank you!

    Best,
    Sadiexiaoyu

    Comment


    • #17
      Originally posted by sadiexiaoyu View Post
      I tried to use edgeR to analyse paired samples, and I also would like to use DESeq, but I just found "multi-factor designs" using GLM in the DESeq manual. This cannot be used for paired samples, right? Could you give me any suggestion to analyse paired samples using DESeq? I noticed this thread was posted long time ago, so maybe there is new method....?
      Sure, we fixed the issue with the dispersion estimator long ago.

      Let's say you have 2n samples, which come in n pairs. For example, we have a treatment and a control sample from each of n subjects. Then you proceed as follows:

      Make a design frame, i.e., a table describing the 2n samples, with one row per sample and two columns, say, "subject" (with levels "Subject_1", "Subject_2", "Subject_3", ...) and "treatment" (with levels "no" and "yes") and pass this as second argument to "newCountDataSet" as shown in the vignette's section on GLMs. Then fit two GLMs, namely "count ~ subject" as reduced model and "count ~ subject + treatment" as full model and compare them, again exactly as shown in the vignette (where we have "libType" instead of "subject"). VoilĂ , a paired test.

      AFAIK, it works the same way in edgeR, only that you don't specify the reduced formula explicitely but rather, edgeR simply drops the last term in the full formula to get the reduced one, i.e., "treatment" has to come after "subject".

      Comment


      • #18
        Originally posted by Simon Anders View Post
        Sure, we fixed the issue with the dispersion estimator long ago.

        Let's say you have 2n samples, which come in n pairs. For example, we have a treatment and a control sample from each of n subjects. Then you proceed as follows:

        Make a design frame, i.e., a table describing the 2n samples, with one row per sample and two columns, say, "subject" (with levels "Subject_1", "Subject_2", "Subject_3", ...) and "treatment" (with levels "no" and "yes") and pass this as second argument to "newCountDataSet" as shown in the vignette's section on GLMs. Then fit two GLMs, namely "count ~ subject" as reduced model and "count ~ subject + treatment" as full model and compare them, again exactly as shown in the vignette (where we have "libType" instead of "subject"). VoilĂ , a paired test.

        AFAIK, it works the same way in edgeR, only that you don't specify the reduced formula explicitely but rather, edgeR simply drops the last term in the full formula to get the reduced one, i.e., "treatment" has to come after "subject".
        Dear Simon,

        Thank you for your answer. But I still did not get it clearly.
        1) Do you mean using design like this:
        > design=data.frame(subject=c("sample1","sample2","sample3"),treatment=c("b,n","b,n","b,n"))
        > design
        subject treatment
        1 sample1 b,n
        2 sample2 b,n
        3 sample3 b,n
        However, my count_data "sample1b“, ”sample1n","sample2b","sample2n","sample3b","sample3n",could you tell me how can I combine my count_data with this design?

        2) I tried to use DESeq to do this paired analysis in the following (using DESeq manual "multi-factor comparison" as reference), is this right?
        > x<-read.delim("matrix for DESeq_n_b_new.txt",row.names=1,stringsAsFactors=FALSE)
        > head(x)
        X1b X2b X3b X1n X2n X3n
        gene1 1562 1380 1533 821 799 820
        gene2 74 79 83 51 57 45
        gene3 997 837 1230 700 513 705
        gene4 83 59 81 49 71 67
        gene5 517 456 512 301 322 307
        gene6 0 0 2 0 0 0
        > design=data.frame(row.names=colnames(x),treatment=c("b","b","b","n","n","n"),subject=c("sample1","sample2","sample3"))
        > design
        treatment subject
        X1b b sample1
        X2b b sample2
        X3b b sample3
        X1n n sample1
        X2n n sample2
        X3n n sample3
        > cdsFull=newCountDataSet(x,design)
        and using
        > fit1=fitNbinomGLMs(cdsFilt,count~subject+treatment)
        > fit0=fitNbinomGLMs(cdsFilt,count~subject)
        > pvalsGLM=nbinomGLMTest(fit1,fit0)
        > padjGLM=p.adjust(pvalsGLM,method="BH")

        3)You said "edgeR simply drops the last term in the full formula to get the reduced one", is that mean, for example, if I set
        design<-model.matrix(~subject+treatment) in edge R, then edgeR just get "treatment" with adjusting "subject" differences;
        if I set
        design<-model.matrix(~treatment+subject) in edgeR, then edgeR will get "subject" with adjusting "treatment" differences.
        Am I right?

        Actually, I used
        design<-model.matrix(~subject+treatment) in edgeR, and I get the following results:
        > topTags(lrt)
        Coefficient: treatment n
        genes logFC logCPM LR PValue FDR
        15498 gene10 -3.460666 8.556419 400.1173 5.192712e-89 9.278857e-85
        20675 gene8 4.608737 7.967542 394.4721 8.796487e-88 7.859221e-84
        15990 gene5 -3.287753 5.762309 377.7188 3.904981e-84 2.325937e-80
        731 gene71 -3.515135 7.259228 376.8333 6.087180e-84 2.719295e-80
        17873 gene889 -3.560003 7.921179 363.5712 4.698794e-81 1.679255e-77
        334 gene34 -3.320911 8.088810 362.5978 7.654637e-81 2.279679e-77
        22052 gene2272 -3.718390 8.846441 356.9188 1.319823e-79 3.369132e-76
        8133 gene89 -3.004279 6.273042 336.3495 3.980106e-75 8.890063e-72
        23453 gene275 -3.576274 7.439596 330.8462 6.287517e-74 1.248352e-70
        5681 gene56 -3.205917 6.550010 330.3990 7.868486e-74 1.406020e-70

        > colnames(design)
        [1] "(Intercept)" "sample2" "sample3" "treatmentn"

        I think the logFC is the foldchange between treatment n vs treatment b, using sample1 as the baseline for adjusting the differences among individuals. Is that right?

        Best,

        Sadiexiaoyu

        Comment


        • #19
          Originally posted by Simon Anders View Post
          Sure, we fixed the issue with the dispersion estimator long ago.

          Let's say you have 2n samples, which come in n pairs. For example, we have a treatment and a control sample from each of n subjects. Then you proceed as follows:

          Make a design frame, i.e., a table describing the 2n samples, with one row per sample and two columns, say, "subject" (with levels "Subject_1", "Subject_2", "Subject_3", ...) and "treatment" (with levels "no" and "yes") and pass this as second argument to "newCountDataSet" as shown in the vignette's section on GLMs. Then fit two GLMs, namely "count ~ subject" as reduced model and "count ~ subject + treatment" as full model and compare them, again exactly as shown in the vignette (where we have "libType" instead of "subject"). VoilĂ , a paired test.

          AFAIK, it works the same way in edgeR, only that you don't specify the reduced formula explicitely but rather, edgeR simply drops the last term in the full formula to get the reduced one, i.e., "treatment" has to come after "subject".
          Dear Simon,

          Thank you for your answer. But I still did not get it clearly.
          1) Do you mean using design like this:
          > design=data.frame(subject=c("sample1","sample2","sample3"),treatment=c("b,n","b,n","b,n"))
          > design
          subject treatment
          1 sample1 b,n
          2 sample2 b,n
          3 sample3 b,n
          However, my count_data "sample1b“, ”sample1n","sample2b","sample2n","sample3b","sample3n",could you tell me how can I combine my count_data with this design?

          2) I tried to use DESeq to do this paired analysis in the following (using DESeq manual "multi-factor comparison" as reference), is this right?
          > x<-read.delim("matrix for DESeq_n_b_new.txt",row.names=1,stringsAsFactors=FALSE)
          > head(x)
          X1b X2b X3b X1n X2n X3n
          gene1 1562 1380 1533 821 799 820
          gene2 74 79 83 51 57 45
          gene3 997 837 1230 700 513 705
          gene4 83 59 81 49 71 67
          gene5 517 456 512 301 322 307
          gene6 0 0 2 0 0 0
          > design=data.frame(row.names=colnames(x),treatment=c("b","b","b","n","n","n"),subject=c("sample1","sample2","sample3"))
          > design
          treatment subject
          X1b b sample1
          X2b b sample2
          X3b b sample3
          X1n n sample1
          X2n n sample2
          X3n n sample3
          > cdsFull=newCountDataSet(x,design)
          and using
          > fit1=fitNbinomGLMs(cdsFilt,count~subject+treatment)
          > fit0=fitNbinomGLMs(cdsFilt,count~subject)
          > pvalsGLM=nbinomGLMTest(fit1,fit0)
          > padjGLM=p.adjust(pvalsGLM,method="BH")

          3)You said "edgeR simply drops the last term in the full formula to get the reduced one", is that mean, for example, if I set
          design<-model.matrix(~subject+treatment) in edge R, then edgeR just get "treatment" with adjusting "subject" differences;
          if I set
          design<-model.matrix(~treatment+subject) in edgeR, then edgeR will get "subject" with adjusting "treatment" differences.
          Am I right?

          Actually, I used
          design<-model.matrix(~subject+treatment) in edgeR, and I get the following results:
          > topTags(lrt)
          Coefficient: treatment n
          genes logFC logCPM LR PValue FDR
          15498 gene10 -3.460666 8.556419 400.1173 5.192712e-89 9.278857e-85
          20675 gene8 4.608737 7.967542 394.4721 8.796487e-88 7.859221e-84
          15990 gene5 -3.287753 5.762309 377.7188 3.904981e-84 2.325937e-80
          731 gene71 -3.515135 7.259228 376.8333 6.087180e-84 2.719295e-80
          17873 gene889 -3.560003 7.921179 363.5712 4.698794e-81 1.679255e-77
          334 gene34 -3.320911 8.088810 362.5978 7.654637e-81 2.279679e-77
          22052 gene2272 -3.718390 8.846441 356.9188 1.319823e-79 3.369132e-76
          8133 gene89 -3.004279 6.273042 336.3495 3.980106e-75 8.890063e-72
          23453 gene275 -3.576274 7.439596 330.8462 6.287517e-74 1.248352e-70
          5681 gene56 -3.205917 6.550010 330.3990 7.868486e-74 1.406020e-70

          > colnames(design)
          [1] "(Intercept)" "sample2" "sample3" "treatmentn"

          I think the logFC is the foldchange between treatment n vs treatment b, using sample1 as the baseline for adjusting the differences among individuals. Is that right?

          Best,

          Sadiexiaoyu

          Comment


          • #20
            Originally posted by Simon Anders View Post
            Sure, we fixed the issue with the dispersion estimator long ago.

            Let's say you have 2n samples, which come in n pairs. For example, we have a treatment and a control sample from each of n subjects. Then you proceed as follows:

            Make a design frame, i.e., a table describing the 2n samples, with one row per sample and two columns, say, "subject" (with levels "Subject_1", "Subject_2", "Subject_3", ...) and "treatment" (with levels "no" and "yes") and pass this as second argument to "newCountDataSet" as shown in the vignette's section on GLMs. Then fit two GLMs, namely "count ~ subject" as reduced model and "count ~ subject + treatment" as full model and compare them, again exactly as shown in the vignette (where we have "libType" instead of "subject"). VoilĂ , a paired test.

            AFAIK, it works the same way in edgeR, only that you don't specify the reduced formula explicitely but rather, edgeR simply drops the last term in the full formula to get the reduced one, i.e., "treatment" has to come after "subject".
            Dear Simon,

            Thank you for your answer. But I still did not get it clearly.
            1) Do you mean using design like this:
            >design=data.frame(subject=c("sample1","sample2","sample3"),treatment=c("b,n","b,n","b,n"))
            > design
            subject treatment
            1 sample1 b,n
            2 sample2 b,n
            3 sample3 b,n
            However, my count_data "sample1b“, ”sample1n","sample2b","sample2n","sample3b","sample3n",how can I combine my count_data with this design?


            Best,

            Sadiexiaoyu

            Comment


            • #21
              Originally posted by Simon Anders View Post
              Sure, we fixed the issue with the dispersion estimator long ago.

              Let's say you have 2n samples, which come in n pairs. For example, we have a treatment and a control sample from each of n subjects. Then you proceed as follows:

              Make a design frame, i.e., a table describing the 2n samples, with one row per sample and two columns, say, "subject" (with levels "Subject_1", "Subject_2", "Subject_3", ...) and "treatment" (with levels "no" and "yes") and pass this as second argument to "newCountDataSet" as shown in the vignette's section on GLMs. Then fit two GLMs, namely "count ~ subject" as reduced model and "count ~ subject + treatment" as full model and compare them, again exactly as shown in the vignette (where we have "libType" instead of "subject"). VoilĂ , a paired test.

              AFAIK, it works the same way in edgeR, only that you don't specify the reduced formula explicitely but rather, edgeR simply drops the last term in the full formula to get the reduced one, i.e., "treatment" has to come after "subject".
              And my second question is,
              I tried to use DESeq to do this paired analysis in the following (using DESeq manual "multi-factor comparison" as reference), is this right?
              > x<-read.delim("matrix for DESeq_n_b_new.txt",row.names=1,stringsAsFactors=FALSE)
              > head(x)
              X1b X2b X3b X1n X2n X3n
              gene1 1562 1380 1533 821 799 820
              gene2 74 79 83 51 57 45
              gene3 997 837 1230 700 513 705
              gene4 83 59 81 49 71 67
              gene5 517 456 512 301 322 307
              gene6 0 0 2 0 0 0
              > design=data.frame(row.names=colnames(x),treatment=c("b","b","b","n","n","n"),subject=c("sample1","sample2","sample3"))
              > design
              treatment subject
              X1b b sample1
              X2b b sample2
              X3b b sample3
              X1n n sample1
              X2n n sample2
              X3n n sample3
              > cdsFull=newCountDataSet(x,design)
              and using
              > fit1=fitNbinomGLMs(cdsFilt,count~subject+treatment)
              > fit0=fitNbinomGLMs(cdsFilt,count~subject)
              > pvalsGLM=nbinomGLMTest(fit1,fit0)
              > padjGLM=p.adjust(pvalsGLM,method="BH")

              Comment


              • #22
                Originally posted by Simon Anders View Post
                Sure, we fixed the issue with the dispersion estimator long ago.

                Let's say you have 2n samples, which come in n pairs. For example, we have a treatment and a control sample from each of n subjects. Then you proceed as follows:

                Make a design frame, i.e., a table describing the 2n samples, with one row per sample and two columns, say, "subject" (with levels "Subject_1", "Subject_2", "Subject_3", ...) and "treatment" (with levels "no" and "yes") and pass this as second argument to "newCountDataSet" as shown in the vignette's section on GLMs. Then fit two GLMs, namely "count ~ subject" as reduced model and "count ~ subject + treatment" as full model and compare them, again exactly as shown in the vignette (where we have "libType" instead of "subject"). VoilĂ , a paired test.

                AFAIK, it works the same way in edgeR, only that you don't specify the reduced formula explicitely but rather, edgeR simply drops the last term in the full formula to get the reduced one, i.e., "treatment" has to come after "subject".
                And my third question is,
                You said "edgeR simply drops the last term in the full formula to get the reduced one", is that mean, for example, if I set
                design<-model.matrix(~subject+treatment) in edge R, then edgeR just get "treatment" with adjusting "subject" differences;
                if I set
                design<-model.matrix(~treatment+subject) in edgeR, then edgeR will get "subject" with adjusting "treatment" differences.
                Am I right?

                Actually, I used
                design<-model.matrix(~subject+treatment) in edgeR, and I get the following results:
                > topTags(lrt)
                Coefficient: treatment n
                genes logFC logCPM LR PValue FDR
                15498 gene10 -3.460666 8.556419 400.1173 5.192712e-89 9.278857e-85
                20675 gene8 4.608737 7.967542 394.4721 8.796487e-88 7.859221e-84
                15990 gene5 -3.287753 5.762309 377.7188 3.904981e-84 2.325937e-80
                731 gene71 -3.515135 7.259228 376.8333 6.087180e-84 2.719295e-80
                17873 gene889 -3.560003 7.921179 363.5712 4.698794e-81 1.679255e-77
                334 gene34 -3.320911 8.088810 362.5978 7.654637e-81 2.279679e-77
                22052 gene2272 -3.718390 8.846441 356.9188 1.319823e-79 3.369132e-76
                8133 gene89 -3.004279 6.273042 336.3495 3.980106e-75 8.890063e-72
                23453 gene275 -3.576274 7.439596 330.8462 6.287517e-74 1.248352e-70
                5681 gene56 -3.205917 6.550010 330.3990 7.868486e-74 1.406020e-70

                > colnames(design)
                [1] "(Intercept)" "sample2" "sample3" "treatmentn"

                I think the logFC is the foldchange between treatment n vs treatment b, using sample1 as the baseline for adjusting the differences among individuals. Is that right?

                Best,

                Sadiexiaoyu

                Comment


                • #23
                  About your first question (post #18): You answered this yourself in the your next post. You need a dsign fram with one row for each column in your count data set. So each "sample" appears twice. (And hence, one maybe should not call it a "sample" but that's just semantics.)

                  You second question (post #19): Yes, this code is correct.

                  And for the third (post #20): I am not an expert on edgeR but I think this is correct, too.

                  Comment


                  • #24
                    Originally posted by Simon Anders View Post
                    About your first question (post #18): You answered this yourself in the your next post. You need a dsign fram with one row for each column in your count data set. So each "sample" appears twice. (And hence, one maybe should not call it a "sample" but that's just semantics.)

                    You second question (post #19): Yes, this code is correct.

                    And for the third (post #20): I am not an expert on edgeR but I think this is correct, too.
                    Dear Simon,

                    Thank you!

                    Best,

                    Sadiexiaoyu

                    Comment


                    • #25
                      same for me`

                      hello sadiexiaoyu

                      did you get your answers?
                      I also have exactly same question with you! And I totally agree with your point. As far as I understand your points is right, in that way, we should be able to test the effect of treatment by adjusting with subject(for paired sample).

                      Could you please some one (know better than me ), confirm this question?

                      Thanks in advance

                      Comment


                      • #26
                        Yes, this is correct.

                        Comment


                        • #27
                          good to know

                          Dear Simon,

                          I really appreciate your confirmation!
                          One more quick question.

                          If I have multiple disease given design. (e.g. paired sample)
                          how can I handle ?? could you please give some comments for this ? Please refer the following link!

                          Application of sequencing to RNA analysis (RNA-Seq, whole transcriptome, SAGE, expression analysis, novel organism mining, splice variants)


                          with best regards,
                          Last edited by younko; 05-07-2014, 01:27 AM.

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Choosing Between NGS and qPCR
                            by seqadmin



                            Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
                            10-18-2024, 07:11 AM
                          • seqadmin
                            Non-Coding RNA Research and Technologies
                            by seqadmin




                            Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

                            Nobel Prize for MicroRNA Discovery
                            This week,...
                            10-07-2024, 08:07 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, 11-01-2024, 06:09 AM
                          0 responses
                          15 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 10-30-2024, 05:31 AM
                          0 responses
                          17 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 10-24-2024, 06:58 AM
                          0 responses
                          24 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 10-23-2024, 08:43 AM
                          0 responses
                          53 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X