Seqanswers Leaderboard Ad

**sdriscoll** · 05-02-2012, 11:01 PM

What do your MA plots look like? Those would reveal if the normalization of library sizes went wonky.

Sometimes when I think the output of a tool looks weird I'll try another tool. Give edgeR a shot and see what the results look like.

**Artur Jaroszewicz** · 05-03-2012, 11:16 AM

It seems to look pretty good:

I'll give EdgeR a shot. Unfortunately, I have my entire pipeline set up with DESeq, so I'd rather keep it consistent across analyses.

**Wolfgang Huber** · 05-03-2012, 01:43 PM

Dear Artur,

for sure "unfortunately" is not the right sentiment here

Anyway, can you produce MA-plots like the one you did above for every pair of samples (i.e. for 9 choose 2 = 36 pairs)? The R function

Code:

 pairs(log2(exprs(cds)), pch=".")

does already most of that. If you are comfortable with R, you can add a function as the panel argument to do the 45-degree rotation and add the line at y=0.

To further explore data quality (and perhaps identify samples that went wrong, and are destroying your power to detect differential expression), try

Code:

library("arrayQualityMetrics")
vst = varianceStabilizingTransformation(cds)
arrayQualityMetrics(vst, intgroup="condition")

Please also try method = "per-condition" in the call to estimateDispersions, this could be less conservative if one of the conditions is (for some reason) a lot more noisy than the others.

PS The URL http://pelegrini.mcdb.ucla.edu/Artur...gene_array.txt did unfortunately not work when I tried ("Browsername can't find the server at pelegrini.mcdb.ucla.edu"). Could it be within your intranet?

**sdriscoll** · 05-03-2012, 02:00 PM

I thought I'd try those snippits myself and I get some errors. Here's my session info:

Code:

> sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

I do not have the function "varianceStabilizingTransformation" and "exprs(cds)" returns an error:

> exprs(cds)
Error in function (classes, fdef, mtable) :
unable to find an inherited method for function "exprs", for signature "CountDataSet"

**dietmar13** · 05-03-2012, 09:05 PM

you could try SAMseq and limma-voom()

which are both capable of using multi-group designs.

SAMseq (from SAM R-package):
R: x <- your count matrix
R: y <- c(rep(1,3), rep(2,3), rep(3,3))
R: samfit <- SAMseq(x, y, resp.type = "Multiclass")

limma R package:
R: Counts <- your count matrix
R: nf <- calcNormFactors(Counts) [ from the edgeR package ]

R: f <- factor(targets$Target, levels=c("RNA1","RNA2","RNA3"))
R: design <- model.matrix(~0+f)
R: colnames(design) <- c("RNA1","RNA2","RNA3")
R: y <- voom(Counts,design,plot=TRUE,lib.size=colSums(Counts)*nf)

Pair-wise comparisons between the three groups
R: fit <- lmFit(y, design)
R: contrast.matrix <- makeContrasts(RNA2-RNA1, RNA3-RNA2, RNA3-RNA1,
levels=design)
R: fit2 <- contrasts.fit(fit, contrast.matrix)
R: fit2 <- eBayes(fit2)

e.g. RNA2 versus RNA1
R: topTable(fit2, coef=1, adjust="BH")

**Simon Anders** · 05-03-2012, 11:05 PM

Judging from your dispersion plot, you have many genes with really high variance. Remember that a dispersion of .1 means that the gene typically varies by sqrt(.1)=30% between replicates, and then, to get significance, the change between conditions should be much larger.

Take a few of these genes with strong log2FoldChange and look at the normalized counts. Do you really see something, or is the variation withing group as strong as between groups?

(You know, it is entirely possible, of course, that DESeq did not work well on your data, but it is as likely that it is simply right, and the signal-to-noise ratio of your experiment is too bad to allow inference, i.e., that you simply don't have anything.)

**ETHANol** · 05-04-2012, 04:44 AM

I am not an expert here so take my word with a grain of salt, but I am currently looking at some cell lines which were sorted based on their expression on a single gene. There is a lot of variance between the cell lines, but there are also real differentially expressed genes between the two groups as well.

from the DESeq vignette

As we estimated the dispersion from only two samples, we should expect the estimates to scatter with quite some sampling variance around their true values. Hence, we DESeq should not use the per-gene estimates directly in the test, because using too low dispersion values leads to false positives. Many of the values below the red line are likely to be underestimates of the true dispersions, and hence, it is prudent to instead rather use the fitted value. On the othe hand, not all of he values above the red line are overestimations, and hence, the conservative choice is to keep them instead of replacing them with their fitted values. If you do not like this default behaviour, you can change it with the option sharingMode of estimateDispersions. Note that DESeq orginally (as described in [1]) only used the fitted values (sharingMode="fit-only"). The current default (sharingMode="maximum") is more conservative.

I was getting no differentially expressed genes but when I changed to sharingMode="fit-only" I got some. We have a 8 biological replicates for each condition so I think this is a reasonable thing to do. My guess is the the whole analysis was being thrown off by the high variance of a lot of other genes. The risk is the less stringent variance calculation is giving us false positives.

**Simon Anders** · 05-04-2012, 05:00 AM

Originally posted by ETHANol View Post

I was getting no differentially expressed genes but when I changed to sharingMode="fit-only" I got some. We have a 8 biological replicates for each condition so I think this is a reasonable thing to do.

No, quite the opposite! If you have 8 replicates per condition, you have enough power to get a reliable variance estimate for each gene without having to rely on other genes. So, you switch to sharingMode="gene-est-only", and then, high-variance genes will not reduce your power for any of the other genes.

**ETHANol** · 05-04-2012, 05:19 AM

Thanks Simon for the correction, that's what I was trying to say but my cluelessness resulted in what I wrote.

**Simon Anders** · 05-04-2012, 08:15 AM

Ok, that's good. Now, you have 14 degrees of freedom for a pooled dispersion estimate (16 samples minus 2 conditions), Artur has only 6 DoF (9 samples minus 3 conditions). This is not quite enough to use "per-gene-est-only", I would say. I'm afraid his problem is really that his signal-to-noise ratio is too bad. Artur, you may need to tell us more about your experiment.

**Wolfgang Huber** · 05-05-2012, 08:10 AM

Originally posted by sdriscoll View Post

I thought I'd try those snippits myself and I get some errors. I do not have the function "varianceStabilizingTransformation" and "exprs(cds)" returns an error:

Dear Sdriscoll

thank you for checking and reporting these errors. My apologies. That comes from sending untested code snippets late at night. Here's an example that works:

Code:

library("arrayQualityMetrics")
library("DESeq")
library("pasilla")

data("pasillaGenes")

pairs(log2(counts(pasillaGenes)+1),
  panel=function(x, y, ...) {
    points((x+y)/2, y-x, pch=".")
    abline(h=0, col="blue")
}, xlim=c(0, 16), ylim=c(-5,5))

cds = estimateDispersions(estimateSizeFactors(pasillaGenes), method="blind")
vst = DESeq:::varianceStabilizingTransformation(cds)
arrayQualityMetrics(vst, intgroup="condition",
    reporttitle="Quality metrics report for pasillaGenes", force=TRUE)



> sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: x86_64-apple-darwin10.8.0/x86_64 (64-bit)

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] pasilla_0.2.11             DEXSeq_1.2.0              
[3] DESeq_1.9.5                locfit_1.5-8              
[5] Biobase_2.16.0             BiocGenerics_0.2.0        
[7] arrayQualityMetrics_3.12.0 fortunes_1.5-0            

loaded via a namespace (and not attached):
 [1] AnnotationDbi_1.18.0  BeadDataPackR_1.8.0   BiocInstaller_1.4.4  
 [4] Biostrings_2.24.1     Cairo_1.5-1           DBI_0.2-5            
 [7] Hmisc_3.9-3           IRanges_1.14.2        KernSmooth_2.23-7    
[10] RColorBrewer_1.0-5    RCurl_1.91-1          RSQLite_0.11.1       
[13] SVGAnnotation_0.9-0   XML_3.9-4             affy_1.34.0          
[16] affyPLM_1.32.0        affyio_1.24.0         annotate_1.34.0      
[19] beadarray_2.6.0       biomaRt_2.12.0        cluster_1.14.2       
[22] colorspace_1.1-1      genefilter_1.38.0     geneplotter_1.34.0   
[25] grid_2.15.0           hwriter_1.3           lattice_0.20-6       
[28] latticeExtra_0.6-19   limma_3.12.0          plyr_1.7.1           
[31] preprocessCore_1.18.0 reshape2_1.2.1        setRNG_2011.11-2     
[34] splines_2.15.0        statmod_1.4.14        stats4_2.15.0        
[37] stringr_0.6           survival_2.36-14      tools_2.15.0         
[40] vsn_3.24.0            xtable_1.7-0          zlibbioc_1.2.0

You will need DESeq>=1.9.4 for this (i.e. a version more recent than the last release of Bioconductor from April 2012). If you are using R 2.15, and are not normally subscribing to the devel branch of Bioconductor, then the best way to install DESeq is via

Code:

source("http://bioconductor.org/biocLite.R"); useDevel(TRUE); biocLite("DESeq"); useDevel(old)

The above mentioned prefix DESeq::: in the call to varianceStabilizingTransformation is only needed for DESeq <=1.9.5, in more recent versions this is properly exported.

I hope this helps, best wishes
Wolfgang

**sdriscoll** · 05-06-2012, 08:28 AM

Thanks for the explanation. Late night code snippets...that's funny and it happens to everyone that writes code.

**ETHANol** · 05-06-2012, 08:49 AM

Delivery of code snipits after midnight on SEQanswers. You deserve an award for that or at least to have your next grant approved.

**Artur Jaroszewicz** · 05-07-2012, 12:49 PM

Hi everyone,

Thank you so much for your input so far. Simon, I am beginning to think that the replicates may indeed be too noisy to infer anything. Still, it's strange. I have a question for you about the way you determine fold change. I see that you estimate the total abundance of a given gene by dividing the raw reads by the size factors, then find the unweighted average of the replicates to obtain a mean level for each sample. This is the only way to do the differential testing, from what I understand. However, it leads to some results that I'm not sure I fully grasp or agree with. You calculate fold change, etc., from these numbers. Isn't it more accurate to take the weighted mean of these numbers? In my analysis, it's caused some problems. I calculate RPKM by taking the total counts per gene for each sample and dividing by the total number of reads in the sample (and gene length). This gives replicates with more reads more strength, which I think should be the case. Just FYI, my size factors here are:
CD901 0.9780371
CD902 1.3965205
CD903 1.3044467
UtESC.1 1.7915295
UtESC.2 0.3420241
UtESC.3 0.6416566
UtESC.1.1 0.5929506
UtESC.2.1 2.4919984
UtESC.3.1 1.0039902
So in the last sample, replicates 1 and 2 have a five-fold size difference, yet the same weight in the differential test. Because of this discrepancy, I get strange artifacts in certain differentially expressed genes which have sample 1 higher than sample 2 in one method of estimation and lower in the other method. Do you have any explanations or suggestions for what I should do differently?

Wolfgang, I realized that I had the wrong link. I edited it, and now it should work. And you're right, I should feel very fortunate for having DESeq in my pipeline =]

Here's some background information sent from my colleague:

We study the mouse reproductive system, and have identified a signature for adult mouse endometrial epithelial cells. We have collected FACS sorted fractions of the epithelial stem cell fraction (UtESC+), the epithelial non-stem cell fraction (UtESC-), and the stromal fraction (CD90).

I re-ran the test using
> cds <- estimateDispersions( cds, method="per-condition" )
and looked at the p-values again. There are only 13 genes of ~30,000 with adjusted p-values of less than 0.01, and a more reasonable 309 genes with non-adjusted p-values of less than 0.01. These are much lower numbers than I usually get, but it may be the only thing I can do with this data. Do you agree?

Thank you for your continued help,
Artur

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 24 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 50 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

DESeq not giving any differentially expressed genes in RNA seq

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News