Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Deseq2 design

    Dear all,

    I have one question about the design of the condition: I am doing gene silencing screening with 2 different shRNA (targeted to the same gene). Before I do the RNA-seq, I know both of these two shRNA bring my gene of interest down 50% compare with control (bench data), and I want to see the differential expression genes in the knock down condition. So I have 9 samples in total with 3 biological replicates per condition:

    untreated1 control
    untreated2 control
    untreated3 control
    treatedA1 shRNA_A
    treatedA2 shRNA_A
    treatedA3 shRNA_A
    treatedB1 shRNA_B
    treatedB2 shRNA_B
    treatedB3 shRNA_B

    First, I did the Deseq2 analysis with two groups comparison: control vs shRNA_A and control vs shRNA_B.
    design as follows:
    condition
    untreated1 control
    untreated2 control
    untreated3 control
    treatedA1 shRNA_A
    treatedA2 shRNA_A
    treatedA3 shRNA_A

    condition
    untreated1 control
    untreated2 control
    untreated3 control
    treatedA1 shRNA_B
    treatedA2 shRNA_B
    treatedA3 shRNA_B


    From the two dataset, I can see my gene goes down 50% (consistent with my bench data), but there are different number of regulated genes (with padj < 0.1) in the two dataset:
    control vs shRNA_A 167 genes
    control vs shRNA_B 431 genes

    there are 50 common regulated genes in the two lists, but the non-overlapped genes have similar expression trend in the two shRNA condition (just because of the padj, those genes are not choose in the list).

    Is my analysis design correct? or shall I do another type of design like this?

    condition type
    untreated1 control untreated
    untreated2 control untreated
    untreated3 control untreated
    treatedA1 shRNA_A treated
    treatedA2 shRNA_A treated
    treatedA3 shRNA_A treated
    treatedB1 shRNA_B treated
    treatedB2 shRNA_B treated
    treatedB3 shRNA_B treated

    ideally I would like to see the genes in two shRNA treated condition together (like anova test), but I do not know how to design it?

    Thanks in advance for any advice or comments!

    Best,
    Wei

  • #2
    I'd personally go with the combined design (i.e., the last one you showed). Especially since you only have 3 samples per group, you'll likely get a bit better variance estimates with that. If you wanted to do a classic 1-way anova like test, then you'd just use nbinomLRT() and compare that against a design of ~1.

    Comment


    • #3
      Yes, as Devon states, if you want to test (control vs shRNA_A) and (control vs shRNA_B) at once you can use the LRT,

      with a condition vector like so: c("control", ..., "shRNA_A", ..., "shRNA_B", ... )

      you can use:

      Code:
      design(dds) = ~ condition
      dds = DESeq(dds, test="LRT", reduced = ~ 1)
      results(dds) gives the LRT p-values and adjusted p-values (note that the LFC column is only for one of the comparisons, see ?results for more information).

      If you want to make the two comparisons individually, it's best to run all the samples together and produce results tables like so:

      Code:
      dds = DESeq(dds)
      results(dds, contrast=c("condition","shRNA_A","control"))
      results(dds, contrast=c("condition","shRNA_B","control"))

      Comment


      • #4
        Thanks a lot Devon and Michael!

        Like you recommended I run all the samples together to get better variance estimates. I tried to my design as ~ condition+type. I had one error"the model matrix is not full rank, i.e. one or more variables in the design formula are linear combinations of the others", I guess this is a wrong design.

        Here is my code:

        > countdata = as.matrix(read.table("data.txt", header=T, row.names=1))
        > head (countdata)
        untreated1 untreated2 untreated3 treatedA1 treatedA2 treatedA3 treatedB1 treatedB2 treatedB3
        ENSMUSG00000000001 1233 1091 1259 1234 1118 1082 1454 1086 1143
        ENSMUSG00000000003 0 0 0 0 0 0 0 0 0
        ENSMUSG00000000028 76 68 61 75 81 82 93 74 88
        ENSMUSG00000000031 2 3 5 2 2 2 12 5 2
        ENSMUSG00000000037 24 31 38 27 26 14 47 43 45
        ENSMUSG00000000049 6 7 10 11 19 16 20 11 15
        > coldata= read.table ("design.txt", header = T, row.names=1)
        >
        > coldata
        condition type
        untreated1 control untreated
        untreated2 control untreated
        untreated3 control untreated
        treatedA1 shRNA_A treated
        treatedA2 shRNA_A treated
        treatedA3 shRNA_A treated
        treatedB1 shRNA_B treated
        treatedB2 shRNA_B treated
        treatedB3 shRNA_B treated
        > dds <- DESeqDataSetFromMatrix(countData = countdata, colData = coldata, design = ~ condition + type)
        Error in DESeqDataSet(se, design = design, ignoreRank) :
        the model matrix is not full rank, i.e. one or more variables in the design formula are linear combinations of the others



        Then I do it like Michael's code, I delete the type column in coldata file.

        > coldata
        condition
        untreated1 control
        untreated2 control
        untreated3 control
        treatedA1 shRNA_A
        treatedA2 shRNA_A
        treatedA3 shRNA_A
        treatedB1 shRNA_B
        treatedB2 shRNA_B
        treatedB3 shRNA_B

        > dds <- DESeqDataSetFromMatrix(countData = countdata, colData = coldata, design = ~ condition)
        > design(dds) = ~ condition
        > dds = DESeq(dds, test="LRT", reduced = ~ 1)
        estimating size factors
        estimating dispersions
        gene-wise dispersion estimates
        mean-dispersion relationship
        final dispersion estimates
        fitting model and testing
        > res <- results(dds)
        > head(res)
        log2 fold change: condition shRNA B vs control
        LRT p-value: '~ condition' vs '~ 1'
        DataFrame with 6 rows and 6 columns
        baseMean log2FoldChange lfcSE stat pvalue
        <numeric> <numeric> <numeric> <numeric> <numeric>
        ENSMUSG00000000001 1183.648139 -0.1000936 0.09353996 1.191909 0.55103647
        ENSMUSG00000000003 0.000000 NA NA NA NA
        ENSMUSG00000000028 77.301980 0.1806468 0.21423655 1.095024 0.57838715
        ENSMUSG00000000031 3.688155 0.7430438 0.92477711 2.323728 0.31290243
        ENSMUSG00000000037 32.345099 0.4137095 0.32571884 6.944625 0.03104515
        ENSMUSG00000000049 12.603642 0.8593232 0.53167437 3.999843 0.13534587
        padj
        <numeric>
        ENSMUSG00000000001 0.9509148
        ENSMUSG00000000003 NA
        ENSMUSG00000000028 0.9573557
        ENSMUSG00000000031 NA
        ENSMUSG00000000037 0.3663522
        ENSMUSG00000000049 NA

        ######question1: res only include one comparison "condition shRNA B vs control", how can I get the result "condition shRNA A vs control"? (sorry I checked the ?results, but did not manage to get it)
        ######question2: LRT p-value: '~ condition' vs '~ 1', I do not quite understand '~ 1'?

        when I do two comparisons individually, then this is nbinomWaldTest test:

        dds = DESeq(dds)
        resA <- results(dds, contrast=c("condition","shRNA_A","control"))
        resB <- results(dds, contrast=c("condition","shRNA_B","control"))

        The result resB is slightly different from the res (with LRT test) in both number of genes and fold change number.
        ######question3: which method shall I use?

        Thanks again for your time and help

        Comment


        • #5
          Q1:

          In my previous response I gave the answer to this one:

          results(dds) gives the LRT p-values and adjusted p-values (note that the LFC column is only for one of the comparisons, see ?results for more information).
          When you print 'res', it says:

          Code:
          LRT p-value: '~ condition' vs '~ 1'
          So the p-values are for the test of the counts depending on all levels of condition (control, shRNA_A, shRNA_B), while the LFC shown in this table is only for one of the fold changes. Please read the help for ?results which has a paragraph explaining this.


          Q2:

          The likelihood ratio test compares the explanatory power of two models.

          See http://en.wikipedia.org/wiki/Likelihood-ratio_test for background on this statistical test.

          the full model (~ condition) includes the intercept and the information in 'condition' (note that in R formula, ~ condition is equivalent to ~ 1 + condition)

          the reduced model (~ 1) is just an intercept term

          resB is a comparison of one level vs the control, whereas the LRT tests all levels of condition.

          Q3:

          Which method you use depends on the biological question you want to ask.

          Use the LRT if you want to describe the set of genes changing from any level of condition.

          Use the Wald tests if you want to describe the set of genes from a single pair of levels (e.g. shRNA_B vs control).

          Comment


          • #6
            Michael: Quick (largely) unrelated question for you. Given an experiment with design ~condition, where condition just has two levels, is there much of a benefit to using a Wald test over an LRT other than the ease of computation? My understanding had always been that the Wald test approximated an LRT and was often preferred since you don't have to fit the alternate model.

            Comment


            • #7
              I haven't seen any big difference in the ranking or set of genes {adj. p < alpha} between the Wald and the LRT. Some differences are:

              The Wald statistic appeared more compatible with the LFC shrinkage, producing uniform p-values under the null (the numerator and denominator shrink at nearly the same rate), while after LFC shrinkage the LRT statistic under the null piles up a bit near 0 (p-values piling up at 1).

              The Wald allow for easy testing of |beta| > theta, for theta > 0.

              The standard Wald pipeline has LFC shrinkage, for which we have to estimate the MLE betas to estimate the prior, so I'm not sure if it's even faster one way or the other.

              Comment


              • #8
                Thanks a lot Michael for the detail and explanation!

                I think LRT test makes more sense for me (compare all levels). Sorry for the question before to get the comparison with shRNA_A vs control. I got it with the "name".

                > resultsNames(dds)
                [1] "Intercept" "condition_shRNAA_vs_control" "condition_shRNAB_vs_control"
                > resA <- results(dds, name="condition_shRNA_A_vs_control")

                But with "contrast", I had one error:

                > resA <- results(dds, contrast=c("condition","shRNA_A","control"))

                Error in normalizeSingleBracketSubscript(j, x) :
                subscript contains invalid names

                Comment


                • #9
                  Originally posted by dpryan View Post
                  I'd personally go with the combined design (i.e., the last one you showed). Especially since you only have 3 samples per group, you'll likely get a bit better variance estimates with that. If you wanted to do a classic 1-way anova like test, then you'd just use nbinomLRT() and compare that against a design of ~1.

                  Thanks a lot Devon! just one question about "design of ~1", you mentioned. you mean: dds <- nbinomLRT(dds, reduced = ~ 1) ?

                  Comment


                  • #10
                    hi Wei,

                    Yes, there is a bug in v1.4, that if you've run the LRT pipeline, then you can't use the contrast argument if the LFC is already contained in resultsNames, but you can get it with the 'name' argument. And it's the same result table using 'name'. I've fixed this in the next release (v1.6 released in October).

                    ( and yes to your question for Devon )

                    Comment


                    • #11
                      Thanks again Michael for the swift reply and great work
                      I need to read more to understand the use of "~1" in different function

                      Comment


                      • #12
                        Originally posted by Michael Love View Post
                        I haven't seen any big difference in the ranking or set of genes {adj. p < alpha} between the Wald and the LRT. Some differences are:

                        The Wald statistic appeared more compatible with the LFC shrinkage, producing uniform p-values under the null (the numerator and denominator shrink at nearly the same rate), while after LFC shrinkage the LRT statistic under the null piles up a bit near 0 (p-values piling up at 1).

                        The Wald allow for easy testing of |beta| > theta, for theta > 0.

                        The standard Wald pipeline has LFC shrinkage, for which we have to estimate the MLE betas to estimate the prior, so I'm not sure if it's even faster one way or the other.
                        Thanks, that makes sense.

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Strategies for Sequencing Challenging Samples
                          by seqadmin


                          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                          03-22-2024, 06:39 AM
                        • seqadmin
                          Techniques and Challenges in Conservation Genomics
                          by seqadmin



                          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                          Avian Conservation
                          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                          03-08-2024, 10:41 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, Yesterday, 06:37 PM
                        0 responses
                        11 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, Yesterday, 06:07 PM
                        0 responses
                        10 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-22-2024, 10:03 AM
                        0 responses
                        51 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-21-2024, 07:32 AM
                        0 responses
                        68 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X