Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Deseq2 design

    Dear all,

    I have one question about the design of the condition: I am doing gene silencing screening with 2 different shRNA (targeted to the same gene). Before I do the RNA-seq, I know both of these two shRNA bring my gene of interest down 50% compare with control (bench data), and I want to see the differential expression genes in the knock down condition. So I have 9 samples in total with 3 biological replicates per condition:

    untreated1 control
    untreated2 control
    untreated3 control
    treatedA1 shRNA_A
    treatedA2 shRNA_A
    treatedA3 shRNA_A
    treatedB1 shRNA_B
    treatedB2 shRNA_B
    treatedB3 shRNA_B

    First, I did the Deseq2 analysis with two groups comparison: control vs shRNA_A and control vs shRNA_B.
    design as follows:
    condition
    untreated1 control
    untreated2 control
    untreated3 control
    treatedA1 shRNA_A
    treatedA2 shRNA_A
    treatedA3 shRNA_A

    condition
    untreated1 control
    untreated2 control
    untreated3 control
    treatedA1 shRNA_B
    treatedA2 shRNA_B
    treatedA3 shRNA_B


    From the two dataset, I can see my gene goes down 50% (consistent with my bench data), but there are different number of regulated genes (with padj < 0.1) in the two dataset:
    control vs shRNA_A 167 genes
    control vs shRNA_B 431 genes

    there are 50 common regulated genes in the two lists, but the non-overlapped genes have similar expression trend in the two shRNA condition (just because of the padj, those genes are not choose in the list).

    Is my analysis design correct? or shall I do another type of design like this?

    condition type
    untreated1 control untreated
    untreated2 control untreated
    untreated3 control untreated
    treatedA1 shRNA_A treated
    treatedA2 shRNA_A treated
    treatedA3 shRNA_A treated
    treatedB1 shRNA_B treated
    treatedB2 shRNA_B treated
    treatedB3 shRNA_B treated

    ideally I would like to see the genes in two shRNA treated condition together (like anova test), but I do not know how to design it?

    Thanks in advance for any advice or comments!

    Best,
    Wei

  • #2
    I'd personally go with the combined design (i.e., the last one you showed). Especially since you only have 3 samples per group, you'll likely get a bit better variance estimates with that. If you wanted to do a classic 1-way anova like test, then you'd just use nbinomLRT() and compare that against a design of ~1.

    Comment


    • #3
      Yes, as Devon states, if you want to test (control vs shRNA_A) and (control vs shRNA_B) at once you can use the LRT,

      with a condition vector like so: c("control", ..., "shRNA_A", ..., "shRNA_B", ... )

      you can use:

      Code:
      design(dds) = ~ condition
      dds = DESeq(dds, test="LRT", reduced = ~ 1)
      results(dds) gives the LRT p-values and adjusted p-values (note that the LFC column is only for one of the comparisons, see ?results for more information).

      If you want to make the two comparisons individually, it's best to run all the samples together and produce results tables like so:

      Code:
      dds = DESeq(dds)
      results(dds, contrast=c("condition","shRNA_A","control"))
      results(dds, contrast=c("condition","shRNA_B","control"))

      Comment


      • #4
        Thanks a lot Devon and Michael!

        Like you recommended I run all the samples together to get better variance estimates. I tried to my design as ~ condition+type. I had one error"the model matrix is not full rank, i.e. one or more variables in the design formula are linear combinations of the others", I guess this is a wrong design.

        Here is my code:

        > countdata = as.matrix(read.table("data.txt", header=T, row.names=1))
        > head (countdata)
        untreated1 untreated2 untreated3 treatedA1 treatedA2 treatedA3 treatedB1 treatedB2 treatedB3
        ENSMUSG00000000001 1233 1091 1259 1234 1118 1082 1454 1086 1143
        ENSMUSG00000000003 0 0 0 0 0 0 0 0 0
        ENSMUSG00000000028 76 68 61 75 81 82 93 74 88
        ENSMUSG00000000031 2 3 5 2 2 2 12 5 2
        ENSMUSG00000000037 24 31 38 27 26 14 47 43 45
        ENSMUSG00000000049 6 7 10 11 19 16 20 11 15
        > coldata= read.table ("design.txt", header = T, row.names=1)
        >
        > coldata
        condition type
        untreated1 control untreated
        untreated2 control untreated
        untreated3 control untreated
        treatedA1 shRNA_A treated
        treatedA2 shRNA_A treated
        treatedA3 shRNA_A treated
        treatedB1 shRNA_B treated
        treatedB2 shRNA_B treated
        treatedB3 shRNA_B treated
        > dds <- DESeqDataSetFromMatrix(countData = countdata, colData = coldata, design = ~ condition + type)
        Error in DESeqDataSet(se, design = design, ignoreRank) :
        the model matrix is not full rank, i.e. one or more variables in the design formula are linear combinations of the others



        Then I do it like Michael's code, I delete the type column in coldata file.

        > coldata
        condition
        untreated1 control
        untreated2 control
        untreated3 control
        treatedA1 shRNA_A
        treatedA2 shRNA_A
        treatedA3 shRNA_A
        treatedB1 shRNA_B
        treatedB2 shRNA_B
        treatedB3 shRNA_B

        > dds <- DESeqDataSetFromMatrix(countData = countdata, colData = coldata, design = ~ condition)
        > design(dds) = ~ condition
        > dds = DESeq(dds, test="LRT", reduced = ~ 1)
        estimating size factors
        estimating dispersions
        gene-wise dispersion estimates
        mean-dispersion relationship
        final dispersion estimates
        fitting model and testing
        > res <- results(dds)
        > head(res)
        log2 fold change: condition shRNA B vs control
        LRT p-value: '~ condition' vs '~ 1'
        DataFrame with 6 rows and 6 columns
        baseMean log2FoldChange lfcSE stat pvalue
        <numeric> <numeric> <numeric> <numeric> <numeric>
        ENSMUSG00000000001 1183.648139 -0.1000936 0.09353996 1.191909 0.55103647
        ENSMUSG00000000003 0.000000 NA NA NA NA
        ENSMUSG00000000028 77.301980 0.1806468 0.21423655 1.095024 0.57838715
        ENSMUSG00000000031 3.688155 0.7430438 0.92477711 2.323728 0.31290243
        ENSMUSG00000000037 32.345099 0.4137095 0.32571884 6.944625 0.03104515
        ENSMUSG00000000049 12.603642 0.8593232 0.53167437 3.999843 0.13534587
        padj
        <numeric>
        ENSMUSG00000000001 0.9509148
        ENSMUSG00000000003 NA
        ENSMUSG00000000028 0.9573557
        ENSMUSG00000000031 NA
        ENSMUSG00000000037 0.3663522
        ENSMUSG00000000049 NA

        ######question1: res only include one comparison "condition shRNA B vs control", how can I get the result "condition shRNA A vs control"? (sorry I checked the ?results, but did not manage to get it)
        ######question2: LRT p-value: '~ condition' vs '~ 1', I do not quite understand '~ 1'?

        when I do two comparisons individually, then this is nbinomWaldTest test:

        dds = DESeq(dds)
        resA <- results(dds, contrast=c("condition","shRNA_A","control"))
        resB <- results(dds, contrast=c("condition","shRNA_B","control"))

        The result resB is slightly different from the res (with LRT test) in both number of genes and fold change number.
        ######question3: which method shall I use?

        Thanks again for your time and help

        Comment


        • #5
          Q1:

          In my previous response I gave the answer to this one:

          results(dds) gives the LRT p-values and adjusted p-values (note that the LFC column is only for one of the comparisons, see ?results for more information).
          When you print 'res', it says:

          Code:
          LRT p-value: '~ condition' vs '~ 1'
          So the p-values are for the test of the counts depending on all levels of condition (control, shRNA_A, shRNA_B), while the LFC shown in this table is only for one of the fold changes. Please read the help for ?results which has a paragraph explaining this.


          Q2:

          The likelihood ratio test compares the explanatory power of two models.

          See http://en.wikipedia.org/wiki/Likelihood-ratio_test for background on this statistical test.

          the full model (~ condition) includes the intercept and the information in 'condition' (note that in R formula, ~ condition is equivalent to ~ 1 + condition)

          the reduced model (~ 1) is just an intercept term

          resB is a comparison of one level vs the control, whereas the LRT tests all levels of condition.

          Q3:

          Which method you use depends on the biological question you want to ask.

          Use the LRT if you want to describe the set of genes changing from any level of condition.

          Use the Wald tests if you want to describe the set of genes from a single pair of levels (e.g. shRNA_B vs control).

          Comment


          • #6
            Michael: Quick (largely) unrelated question for you. Given an experiment with design ~condition, where condition just has two levels, is there much of a benefit to using a Wald test over an LRT other than the ease of computation? My understanding had always been that the Wald test approximated an LRT and was often preferred since you don't have to fit the alternate model.

            Comment


            • #7
              I haven't seen any big difference in the ranking or set of genes {adj. p < alpha} between the Wald and the LRT. Some differences are:

              The Wald statistic appeared more compatible with the LFC shrinkage, producing uniform p-values under the null (the numerator and denominator shrink at nearly the same rate), while after LFC shrinkage the LRT statistic under the null piles up a bit near 0 (p-values piling up at 1).

              The Wald allow for easy testing of |beta| > theta, for theta > 0.

              The standard Wald pipeline has LFC shrinkage, for which we have to estimate the MLE betas to estimate the prior, so I'm not sure if it's even faster one way or the other.

              Comment


              • #8
                Thanks a lot Michael for the detail and explanation!

                I think LRT test makes more sense for me (compare all levels). Sorry for the question before to get the comparison with shRNA_A vs control. I got it with the "name".

                > resultsNames(dds)
                [1] "Intercept" "condition_shRNAA_vs_control" "condition_shRNAB_vs_control"
                > resA <- results(dds, name="condition_shRNA_A_vs_control")

                But with "contrast", I had one error:

                > resA <- results(dds, contrast=c("condition","shRNA_A","control"))

                Error in normalizeSingleBracketSubscript(j, x) :
                subscript contains invalid names

                Comment


                • #9
                  Originally posted by dpryan View Post
                  I'd personally go with the combined design (i.e., the last one you showed). Especially since you only have 3 samples per group, you'll likely get a bit better variance estimates with that. If you wanted to do a classic 1-way anova like test, then you'd just use nbinomLRT() and compare that against a design of ~1.

                  Thanks a lot Devon! just one question about "design of ~1", you mentioned. you mean: dds <- nbinomLRT(dds, reduced = ~ 1) ?

                  Comment


                  • #10
                    hi Wei,

                    Yes, there is a bug in v1.4, that if you've run the LRT pipeline, then you can't use the contrast argument if the LFC is already contained in resultsNames, but you can get it with the 'name' argument. And it's the same result table using 'name'. I've fixed this in the next release (v1.6 released in October).

                    ( and yes to your question for Devon )

                    Comment


                    • #11
                      Thanks again Michael for the swift reply and great work
                      I need to read more to understand the use of "~1" in different function

                      Comment


                      • #12
                        Originally posted by Michael Love View Post
                        I haven't seen any big difference in the ranking or set of genes {adj. p < alpha} between the Wald and the LRT. Some differences are:

                        The Wald statistic appeared more compatible with the LFC shrinkage, producing uniform p-values under the null (the numerator and denominator shrink at nearly the same rate), while after LFC shrinkage the LRT statistic under the null piles up a bit near 0 (p-values piling up at 1).

                        The Wald allow for easy testing of |beta| > theta, for theta > 0.

                        The standard Wald pipeline has LFC shrinkage, for which we have to estimate the MLE betas to estimate the prior, so I'm not sure if it's even faster one way or the other.
                        Thanks, that makes sense.

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Advanced Tools Transforming the Field of Cytogenomics
                          by seqadmin


                          At the intersection of cytogenetics and genomics lies the exciting field of cytogenomics. It focuses on studying chromosomes at a molecular scale, involving techniques that analyze either the whole genome or particular DNA sequences to examine variations in structure and behavior at the chromosomal or subchromosomal level. By integrating cytogenetic techniques with genomic analysis, researchers can effectively investigate chromosomal abnormalities related to diseases, particularly...
                          Yesterday, 06:26 AM
                        • seqadmin
                          How RNA-Seq is Transforming Cancer Studies
                          by seqadmin



                          Cancer research has been transformed through numerous molecular techniques, with RNA sequencing (RNA-seq) playing a crucial role in understanding the complexity of the disease. Maša Ivin, Ph.D., Scientific Writer at Lexogen, and Yvonne Goepel Ph.D., Product Manager at Lexogen, remarked that “The high-throughput nature of RNA-seq allows for rapid profiling and deep exploration of the transcriptome.” They emphasized its indispensable role in cancer research, aiding in biomarker...
                          09-07-2023, 11:15 PM
                        • seqadmin
                          Methods for Investigating the Transcriptome
                          by seqadmin




                          Ribonucleic acid (RNA) represents a range of diverse molecules that play a crucial role in many cellular processes. From serving as a protein template to regulating genes, the complex processes involving RNA make it a focal point of study for many scientists. This article will spotlight various methods scientists have developed to investigate different RNA subtypes and the broader transcriptome.

                          Whole Transcriptome RNA-seq
                          Whole transcriptome sequencing...
                          08-31-2023, 11:07 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, Today, 06:57 AM
                        0 responses
                        6 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, Yesterday, 07:53 AM
                        0 responses
                        8 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 09-25-2023, 07:42 AM
                        0 responses
                        14 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 09-22-2023, 09:05 AM
                        0 responses
                        44 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X