Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • marcus1487
    Junior Member
    • Sep 2014
    • 3

    DEXSeq Partially Unreplicated Dataset: Weird EstimateDispersions behavior

    Hi,

    I am having an issue with analyzing a next-gen sequencing data set for differential exon usage using DEXSeq. I have two main questions:

    First, I have a partially unreplicated data set. I have created a reproducible example below where I have 3 cell lines each in two conditions, but only two of the 3 cell lines were completed in biological duplicate. The example is on random data, but my real data set is biologically derived and much larger; 42 cell lines with 37 unreplicated.

    My first issue is that estimateDispersions produces different dispersion estimates when the unreplicated cell line is included or not included in the data set. I was under the impression that the the dispersion estimates should not be effected by unreplicated data, but this does not appear to be the case. I hope that I am missing something here; i.e. are my formulas correct? is there a reason that the dispersion estimates are different? are the unreplicated samples included in the baseMean calculation and that is changing the dispersion fitting?

    Here is my code to produce this outcome:

    Code:
    suppressMessages( library(DEXSeq) )
    
    set.seed(110011)
    
    sampNames <- c(
      'a.untreated.1','a.untreated.2',
      'b.untreated.1','b.untreated.2',
      'c.untreated.1', 'a.treated.1',
      'a.treated.2', 'b.treated.1',
      'b.treated.2', 'c.treated.1')
    
    ## generate random count data set
    nGenes <- 100
    nExons <- 10
    nSamps <- 10
    sampleData <- data.frame(
      row.names=sampNames,
      condition=rep(c("untreated", "treated"), each=5),
      cellLine=rep(c('a','a','b', 'b', 'c'), 2))
    design <- formula( ~ sample + exon + cellLine + cellLine:exon + condition:exon )
    subDesign <- formula( ~ sample + exon + cellLine + cellLine:exon )
    groupID <- unlist(lapply(1:nGenes, function(i) rep(paste('g', i), nExons)))
    featureID <- rep(sapply(1:nExons, function(i) paste('e', i)), nGenes)
    countData <- matrix(rpois(nGenes * nExons * nSamps, 100),
                        nrow=nGenes * nExons,
                        dimnames=list(groupID, sampNames))
    
    ## run DEXSeq with all samples
    dxd <- DEXSeqDataSet(countData, sampleData, design, featureID, groupID)
    dxd <- estimateSizeFactors(dxd)
    ## this throws a warning which I am assuming is b/c
    ## the data is not realistic for the model
    dxd <- estimateDispersions(dxd)
    dxd <- testForDEU(dxd, reducedModel = subDesign)
    dxr1 <- DEXSeqResults(dxd)
    
    ## now use only replicated samples
    sampleData2 <- sampleData[sampleData[,'cellLine'] != 'c',]
    countData2 <- countData[,sampleData[,'cellLine'] != 'c']
    dxd2 <- DEXSeqDataSet(countData2, sampleData2, design, featureID, groupID)
    dxd2 <- estimateSizeFactors(dxd2)
    dxd2 <- estimateDispersions(dxd2)
    dxd2 <- testForDEU(dxd2, reducedModel =  subDesign)
    dxr2 <- DEXSeqResults(dxd2)
    
    ## why is this not true if I have only removed an unreplicated sample
    all.equal(dxr1$dispersion, dxr2$dispersion)

    Second, once I get the first part to work how can I subset the DEXSeqDataSet in order to testForDEU across conditions, but only within a single cell line? This used to be possible in the old version of DEXSeq (https://stat.ethz.ch/pipermail/bioco...ay/052479.html), but I do not see it in the manual or in any forums for the current version of DEXSeq.

    Code:
    ## then can I subset to testForDEU in only one cellLine?
    ## like this, but this throws an error
    R> dxdSubA <- dxd
    R> colData(dxdSubA) <- colData(dxdSubA)[colData(dxdSubA)$cellLine == 'a',]
    Error in `colData<-`(`*tmp*`, value = <S4 object of class "DataFrame">) :
       'colData' nrow differs from 'assays' ncol
    Here is my sessionInfo:
    Code:
    R version 3.1.1 (2014-07-10)
    Platform: x86_64-apple-darwin13.3.0 (64-bit)
    
    locale:
    [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
    
    attached base packages:
    [1] parallel  stats     graphics  grDevices utils     datasets  methods  
    [8] base     
    
    other attached packages:
     [1] DEXSeq_1.10.8           BiocParallel_0.6.1      DESeq2_1.4.5           
     [4] RcppArmadillo_0.4.400.0 Rcpp_0.11.2             GenomicRanges_1.16.4   
     [7] GenomeInfoDb_1.0.2      IRanges_1.22.10         Biobase_2.24.0         
    [10] BiocGenerics_0.10.0    
    
    loaded via a namespace (and not attached):
     [1] annotate_1.42.1      AnnotationDbi_1.26.0 BatchJobs_1.3       
     [4] BBmisc_1.7           biomaRt_2.20.0       Biostrings_2.32.1   
     [7] bitops_1.0-6         brew_1.0-6           checkmate_1.4       
    [10] codetools_0.2-9      DBI_0.3.0            digest_0.6.4        
    [13] fail_1.2             foreach_1.4.2        genefilter_1.46.1   
    [16] geneplotter_1.42.0   grid_3.1.1           hwriter_1.3.2       
    [19] iterators_1.0.7      lattice_0.20-29      locfit_1.5-9.1      
    [22] RColorBrewer_1.0-5   RCurl_1.95-4.3       Rsamtools_1.16.1    
    [25] RSQLite_0.11.4       sendmailR_1.1-2      splines_3.1.1       
    [28] statmod_1.4.20       stats4_3.1.1         stringr_0.6.2       
    [31] survival_2.37-7      tools_3.1.1          XML_3.98-1.1        
    [34] xtable_1.7-3         XVector_0.4.0        zlibbioc_1.10.0
    Answers to either part would be great! (I hope it made sense to combine these posts as I think they are related) Thank you to anyone for the help in advance!
  • marcus1487
    Junior Member
    • Sep 2014
    • 3

    #2
    So I was able to get a work-able solution to the second half of my problem, but I am still not sure if it is producing accurate results. Here is the function I have produced to testForDEU on each cellLine independently using globally fit dispersions. The dxd object is taken from the code in my previous post.

    Code:
    fitOneCellLine <- function(cellLine){
      sampleData1 <- sampleData[sampleData$cellLine == cellLine,]
      countData1 <- countData[,sampleData$cellLine == cellLine]
      dxd1 <- DEXSeqDataSet(
        countData1, sampleData1, formula( ~ sample + exon + condition:exon),
        featureID, groupID)
      dxd1 <- estimateSizeFactors(dxd1)
    
      ## add dispersions from rep samples
      dxdDims <- dim(mcols(rowData(dxd)))[2]
      mcols(rowData(dxd1))[5:dxdDims] <- mcols(rowData(dxd))[5:dxdDims]
      mcols(mcols(dxd1)) <- mcols(mcols(dxd))
      dxd1@dispersionFunction <- dxd@dispersionFunction
      
      dxd1 <- testForDEU(dxd1, reducedModel =  ~ sample + exon)
      return(DEXSeqResults(dxd1))
    }
    
    aRes <- fitOneCellLine('a')
    bRes <- fitOneCellLine('b')
    cRes <- fitOneCellLine('c')
    I note that the three results objects have different p-values, but I am still not sure if I have transferred all of the appropriate information into the cellLine specific object before testing for DEU. I am mostly concerned that this workflow is not producing accurate results. I am not sure what to test against either as this is the only way I could get cellLine specific results.

    Also I was not able to produce the fold change values here, but in my data set generated using the DEXSeqDataSetFromHTSeq function I am able to estimate fold changes. I am not sure what else is different between this randomly generated data set and my biological data set aside from the generator function, but in any case that is not a huge concern of mine.

    Comment

    • areyes
      Senior Member
      • Aug 2010
      • 165

      #3
      Dear Marcus,

      Of course the optimal is that you would have replicates for the third cell line. But what you could also try is to estimate the dispersions using only the cell lines for which you have replicates, and pass this dispersion estimates to an object with all the data. This would assume that the variability between each of the celllines is similar... although I don't know how true this is.

      With regards to the implementation, testForDEU requires the columns "allZero", "dispGeneEst", "dispFit", "dispersion","dispMAP", "dispOutlier". You could pass these columns to the object used for the testing independently of how you estimate the dispersions before this step and this should work.

      Comment

      • marcus1487
        Junior Member
        • Sep 2014
        • 3

        #4
        Hi areyes,

        Thank you for your response.

        In terms of the assumption that variability between each of the cell lines is similar, isn't this just the same assumption that is made when variability is estimated across all samples even when they are all replicated?

        To spell it out a little more, if we are testing for the effect of a condition across cell lines What is the difference between these two situations:

        1) Estimating variability across a set of replicated cell line - condition combinations and then testing for differential expression between conditions within a single replicated cell line.
        2) Estimating variability across a set of replicated cell line - condition combinations and then testing for differential expression between conditions within a separate UN-replicated cell line.

        I may be wrong, but I think the only difference is the additional power from an additional experiment in the replicated experiment (assuming common read coverage). I guess the question is: Is there reason to share dispersion estimates between different conditions when comparing differential expression across another condition.

        In terms of the implementation details this is incredibly helpful! Thank you so much for confirming that this works correctly assuming the above assumptions are valid.

        Comment

        • areyes
          Senior Member
          • Aug 2010
          • 165

          #5
          Hi Marcus,

          You are right, so far one dispersion estimate is estimated for all the levels of the factorial designs.

          I think that if you include the un-replicated cell-line when estimating the dispersions, the final dispersion estimates might be underestimated so it might be better to use only the replicated samples for this step.

          Yes, including more samples should give you additional power. But the think that I would keep in mind is that it would be not possible to estimate the variability in the un-replicated cell-line, so its hard to know if the dispersion estimation based on the replicated cell-lines models well the variability of the un-replicated cell-line. I would just try and see how much additional power to get by including the un-replicated cell-line.


          On other topics, I just read the formulas of your code again:

          design <- formula( ~ sample + exon + cellLine + cellLine:exon + condition:exon )
          subDesign <- formula( ~ sample + exon + cellLine + cellLine:exon )

          In this formulas, you don't have to include the cellLine effect alone, you can just do:

          design <- formula( ~ sample + exon + cellLine:exon + condition:exon )
          subDesign <- formula( ~ sample + exon + cellLine:exon )

          to consider the variability in exon usage due to the different cell-lines.

          Comment

          Latest Articles

          Collapse

          • GATTACAT
            Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing
            by GATTACAT
            Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
            07-01-2026, 11:43 AM
          • SEQadmin2
            Nine Things a Sample Prep Scientist Thinks About Before Sequencing
            by SEQadmin2


            I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

            Here are nine questions we think about, in roughly the order they matter, before...
            06-18-2026, 07:11 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by SEQadmin2, 07-02-2026, 11:08 AM
          0 responses
          9 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-30-2026, 05:37 AM
          0 responses
          13 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-26-2026, 11:10 AM
          0 responses
          20 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-17-2026, 06:09 AM
          0 responses
          54 views
          0 reactions
          Last Post SEQadmin2  
          Working...