Seqanswers Leaderboard Ad

**dpryan** · 04-08-2014, 12:49 AM

Then presumably that should work fine.

**nbahlis** · 04-08-2014, 05:20 AM

Originally posted by dpryan View Post

Then presumably that should work fine.

Hi Ryan,

What is the right way to draw the heat map based on genes with padj < 0.1 instead of selecting "Rowmeans" as in the vignette? The script I have written does NOT select genes based on their padj < 0.1

thank you for your help

**dpryan** · 04-08-2014, 05:27 AM

You would just use a different index, such as:

Code:

select <- which(p.adjust( res$pvalue[keep], method="BH" ) < .1 )

**nbahlis** · 04-08-2014, 05:28 AM

Originally posted by dpryan View Post

Then presumably that should work fine.

here is the script I've written:
> sampleFiles <- list.files(path="~/Desktop/HTseq_counts/", pattern="*.counts")
> Table3 <- data.frame(
+ sampleName = sampleFiles, fileName = sampleFiles,
+ condition = c( "pre", "pre", "pre", "pre", "pre", "pre", "pre", "pre", "pre", "pre", "pre", "pre", "post", "post", "post", "post", "post", "post", "post", "post", "post", "post", "post", "post"),
+ libType = c("pair8", "pair10", "pair9", "pair1", "pair7", "pair11", "pair2", "pair3", "pair4", "pair12", "pair13", "pair6", "pair7", "pair1", "pair3", "pair4", "pair11", "pair2", "pair6", "pair10", "pair9", "pair8", "pair12", "pair13"))
> directory <- c("~/Desktop/HTseq_counts/")
> design <- formula(~ libType + condition)
> ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable= Table3, directory= directory, design= design)
Usage note: the following factors have 3 or more levels:

libType

For DESeq2 versions < 1.3, if you plan on extracting results for
these factors, we recommend using betaPrior=FALSE as an argument
when calling DESeq().
As currently implemented in version 1.2, the log2 fold changes can
vary if the base level is changed, when extracting results for a
factor with 3 or more levels. A solution will be implemented in
version 1.3 which allows for the use of a beta prior and symmetric
log2 fold change estimates regardless of the selection of base level.
> Table3
sampleName fileName condition libType
1 P110.counts P110.counts pre pair8
2 P124.counts P124.counts pre pair10
3 P149.counts P149.counts pre pair9
4 P185.counts P185.counts pre pair1
5 P189.counts P189.counts pre pair7
6 P192.counts P192.counts pre pair11
7 P218.counts P218.counts pre pair2
8 P227.counts P227.counts pre pair3
9 P235.counts P235.counts pre pair4
10 P274.counts P274.counts pre pair12
11 P321.counts P321.counts pre pair13
12 P351.counts P351.counts pre pair6
13 P357.counts P357.counts post pair7
14 P367.counts P367.counts post pair1
15 P377.counts P377.counts post pair3
16 P384.counts P384.counts post pair4
17 P426.counts P426.counts post pair11
18 P543.counts P543.counts post pair2
19 P584.counts P584.counts post pair6
20 P590.counts P590.counts post pair10
21 P594.counts P594.counts post pair9
22 P610.counts P610.counts post pair8
23 P653.counts P653.counts post pair12
24 P654.counts P654.counts post pair13
> ddsHTSeq
class: DESeqDataSet
dim: 51911 24
exptData(0):
assays(1): counts
rownames(51911): ENSG00000000003 ENSG00000000005 ... ENSG00000259170
ENSG00000259171
rowData metadata column names(0):
colnames(24): P110.counts P124.counts ... P653.counts P654.counts
colData names(2): condition libType
> dds <- DESeq(ddsHTSeq)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
1515 rows did not converge in beta, labelled in mcols(object)$betaConv. Use larger maxit argument with nbinomWaldTest
> res <- results(dds)
> res <- res[order(res$padj),]
> plotMA(dds, main="DESeq2", ylim=c(-2,2))
> plotDispEsts(dds)
> hist( res$pvalue, breaks=100)
> filterThreshold <- 5
> keep <- rowMeans( counts( dds, normalized=TRUE)) > filterThreshold
> table(p.adjust(res$pvalue, method="BH") < 0.1)

FALSE TRUE
44016 1036
> table(p.adjust(res$pvalue[keep], method="BH") < 0.1)

FALSE TRUE
17543 1684
> hist( res$pvalue[keep], breaks=100)
> reskeep <- results(dds[keep])
> reskeep <- reskeep[order(reskeep$padj),]
> library( "org.Hs.eg.db" )
Loading required package: AnnotationDbi
Loading required package: Biobase
Welcome to Bioconductor

Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: DBI

> cols(org.Hs.eg.db)
[1] "ENTREZID" "PFAM" "IPI" "PROSITE" "ACCNUM"
[6] "ALIAS" "CHR" "CHRLOC" "CHRLOCEND" "ENZYME"
[11] "MAP" "PATH" "PMID" "REFSEQ" "SYMBOL"
[16] "UNIGENE" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS" "GENENAME"
[21] "UNIPROT" "GO" "EVIDENCE" "ONTOLOGY" "GOALL"
[26] "EVIDENCEALL" "ONTOLOGYALL" "OMIM" "UCSCKG"
Warning message:
In cols(org.Hs.eg.db) :
'cols' has been deprecated and replaced by 'columns' for versions of
Bioc that are higher than 2.13. Please use 'columns' anywhere that
you previously used 'cols'
> convertIDs <- function( ids, fromKey, toKey, db, ifMultiple=c( "putNA", "useFirst" ) ) {
+ stopifnot( inherits( db, "AnnotationDb" ) )
+ ifMultiple <- match.arg( ifMultiple )
+ suppressWarnings( selRes <- AnnotationDbi::select(
+ db, keys=ids, keytype=fromKey, cols=c(fromKey,toKey) ) )
+ if( ifMultiple == "putNA" ) {
+ duplicatedIds <- selRes[ duplicated( selRes[,1] ), 1 ]
+ selRes <- selRes[ ! selRes[,1] %in% duplicatedIds, ] }
+ return( selRes[ match( ids, selRes[,1] ), 2 ] ) }
> reskeep$symbol <- convertIDs( row.names(reskeep), "ENSEMBL", "SYMBOL", org.Hs.eg.db )
> head(reskeep)
DataFrame with 6 rows and 7 columns
baseMean log2FoldChange lfcSE stat pvalue
<numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000118193 352.79648 -1.3133381 0.1805654 -7.273477 3.503503e-13
ENSG00000090889 486.81411 -1.5000754 0.2177141 -6.890116 5.574692e-12
ENSG00000166851 591.71901 -1.4069394 0.2097238 -6.708535 1.965874e-11
ENSG00000213401 45.86075 -3.6684941 0.5520076 -6.645731 3.017157e-11
ENSG00000144554 1137.21388 -0.8662289 0.1334898 -6.489104 8.634842e-11
ENSG00000204592 27901.79642 0.6212955 0.0959501 6.475194 9.469050e-11
padj symbol
<numeric> <character>
ENSG00000118193 7.130679e-09 KIF14
ENSG00000090889 5.673085e-08 KIF4A
ENSG00000166851 1.333714e-07 PLK1
ENSG00000213401 1.535205e-07 NA
ENSG00000144554 3.212059e-07 FANCD2
ENSG00000204592 3.212059e-07 HLA-E

> select <- order(p.adjust( reskeep$pvalue, method="BH" ) < .0005)
> heatmap.2( assay(rld) [ select, ], scale="row", cexRow=0.5, cexCol=0.75,
+ trace="none", dendrogram="column",
+ col = redgreen(75))
> select <- order(p.adjust( reskeep$pvalue[keep], method="BH" ) < .00005)
> heatmap.2(assay(rld)[select,],col = redgreen(75),trace="none", cexRow=0.75, cexCol=0.75)
> select <- order(p.adjust( reskeep$pvalue[keep], method="BH" ) < .00000005)
> heatmap.2( assay(rld) [ select, ], scale="row", cexRow=0.5, cexCol=0.75,
+ trace="none", dendrogram="column",
+ col = redgreen(75))
> select <- order(p.adjust( reskeep$padj[keep], method="BH" ) < .00000005)
> heatmap.2( assay(rld) [ select, ], scale="row", cexRow=0.5, cexCol=0.75,
+ trace="none", dendrogram="column",
+ col = redgreen(75))

**nbahlis** · 04-08-2014, 06:26 AM

thanks Ryan. For whatever reason "select <- which(p.adjust( res$pvalue[keep], method="BH" ) < .1 )" does not select in the heatmap for the genes with padj<0.1. The list of the genes seems to be randomly picked up

**dpryan** · 04-08-2014, 06:28 AM

FYI, the newer versions of DESeq2 do independent filtering for you, so you needn't normally filter things yourself (i.e., you can skip the rowMeans() step).

Code:

select <- order(p.adjust( reskeep$pvalue, method="BH" ) < .0005)

I have no idea why you're trying to do it this way, that simply won't do what you want. You're ordering a set of TRUE and FALSE boolean values, which will really not do what you want.

Code:

select <- which(p.adjust(reskeep$pvalue, method="BH")<0.1)

Also:

Code:

p.adjust( reskeep$pvalue[keep])

Aside from using order() instead of which(), this won't do what you want since you've already subset "res" to get "reskeep" (from which I presume you're making "rld", though you don't show that). You're thing selecting things based on a further subset of reskeep that is no longer meaningful and then applying that to "rld".
The above line will do what you want and use a more typical threshold of significance.

**pecanton** · 04-10-2014, 04:20 PM

Hi,

First off, I'm sorry to hijack this dicussion, but my question is very related to the original post so I didn't want to start a new thread.

I have a time course RNA seq experiment, with a control condition and 4 time points, each with 3 biological replicates. I've done TopHat2-HTseq-DESeq2 analysis to find the differentially expressed genes in each time point vs the control condition. (curiously, with my data DEseq2 predicted all but 2 or 3 of the genes predicted by EdgeR, out of tens to thousands of predictions, so that's why I've been using only DEseq2)

I now want to do some clustering analysis to find genes with more or less similar expression patterns across the time points, and yes, get a pretty heatmap by the end of the process. However, I have some doubts on how to proceed:

1) Should I do the clustering analysis with the variance stabilization of the counts table as in the DEseq2 vignette or should I use the table I've compiled with the log2 fold changes for all genes in all the time points after extracting the results of the differential expression tests? Related to this, I'd like to reduce the number genes used in the clustering. Out of ~18000 genes in my organism, at most ~5% of genes are predicted to have a significant change in expression (P-adjusted value < 0.05), so I'm curious if it's not best to reduce the genes to cluster only those that are significant in at least one of the time points. I feel all the other genes would just contribute a lot of noise and huge, unmanageable clustering.

2) I've been testing the Cluster 3-TreeView tools to get a feel of how to proceed with the heirarchical clustering analysis, but there are many parameters to put into the algorithm. Reading up papers on the subject has left me even more confused as to what is actually proper versus what produces the results I'd like to see. For example, should I center data or not, and use median or mean? Should I use Spearman correlation or euclidian distances between genes? What is better, centroid linkage or average linkage?

Any pointers on this matter would be greatly appreciated.

**dpryan** · 04-11-2014, 01:51 AM

It's generally a better idea to use the variance stabilized data when making heatmaps. How you want to filter things is up to you. Randomly subsampling things, using only the DE genes, or filtering based on variance are probably the most common choices. The method to choose depends on what you hope to gain. I personally do clustering mostly for quality control, since it's largely not that informative otherwise.

BTW, you might be interested in using WGCNA, which can be thought of as using clustering to do more rigourous groupings of genes/probes. It's an R package available on CRAN and is probably not the most user-friendly package ever, but the methods might be really appealing for the questions you have in mind.

Regarding centering, usually you'll want to center the values. After that there's no perfect answer to things. Lior Pachter wrote a blog post a while back regarding making heatmaps that you should read. I also wrote a bit about this in an answer a while back over on biostars that you might find helpful. The short answer is that there is usually no single best answer.

**super0925** · 04-11-2014, 03:16 AM

Originally posted by dpryan View Post

It's generally a better idea to use the variance stabilized data when making heatmaps. How you want to filter things is up to you. Randomly subsampling things, using only the DE genes, or filtering based on variance are probably the most common choices. The method to choose depends on what you hope to gain. I personally do clustering mostly for quality control, since it's largely not that informative otherwise.

BTW, you might be interested in using WGCNA, which can be thought of as using clustering to do more rigourous groupings of genes/probes. It's an R package available on CRAN and is probably not the most user-friendly package ever, but the methods might be really appealing for the questions you have in mind.

Regarding centering, usually you'll want to center the values. After that there's no perfect answer to things. Lior Pachter wrote a blog post a while back regarding making heatmaps that you should read. I also wrote a bit about this in an answer a while back over on biostars that you might find helpful. The short answer is that there is usually no single best answer.

Hi Ryan
on you post on http://www.biostars.org/p/91978/#91984 , you mentioned that it is a really really bad idea to use Euclidian distance to draw heatmap on RNA-seq data, so what do command you prefer to draw heatmap clustering?
I have just followed the http://bioconductor.org/packages/dev.../doc/DESeq.pdf
Thank you!

**dpryan** · 04-11-2014, 03:40 AM

@super0925: Lior mentioned the issues with euclidean distance in his post (I was just referring to that in my answer) so I would recommend just reading that and seeing what he recommended there (Mahalanobis distance if I remember correctly, though I recall him not being completely enthralled with any particular method).

**super0925** · 04-11-2014, 04:44 AM

Originally posted by dpryan View Post

It's mostly a question of how aggressively to quality trim after removing adapter sequences. I fall into the the "trim gently" camp, so I trim off adapters and bases with a phred score of 5 or below. While more aggressive trimming can probably increase the mapping rate, it will generally drastically decrease the overall number of mapped reads (and generally not improve accuracy that much). This is at least the case for RNAseq, other experiment types may be different.

Hi Ryan so how about the duplicates?I have seen someone on blog said better not to remove the duplicate?

**dpryan** · 04-11-2014, 04:48 AM

That someone was correct. Since you'll often have highly expressed short genes, marking duplicates would artificially deflate read counts and decrease your ability to detect DE genes. What might make sense is to look for sudden jumps in depth along genes and perhaps marking duplicates according to that, though I've never seen anyone do that and I kind of doubt it'd be that beneficial (though I could be wrong there).

BTW, my name is Devon (I know, Ryan is also a common first name).

**super0925** · 04-11-2014, 05:05 AM

Originally posted by dpryan View Post

That someone was correct. Since you'll often have highly expressed short genes, marking duplicates would artificially deflate read counts and decrease your ability to detect DE genes. What might make sense is to look for sudden jumps in depth along genes and perhaps marking duplicates according to that, though I've never seen anyone do that and I kind of doubt it'd be that beneficial (though I could be wrong there).

BTW, my name is Devon (I know, Ryan is also a common first name).

Thank you Devon!

**pecanton** · 04-11-2014, 07:42 AM

Originally posted by dpryan View Post

It's generally a better idea to use the variance stabilized data when making heatmaps. How you want to filter things is up to you. Randomly subsampling things, using only the DE genes, or filtering based on variance are probably the most common choices. The method to choose depends on what you hope to gain. I personally do clustering mostly for quality control, since it's largely not that informative otherwise.

BTW, you might be interested in using WGCNA, which can be thought of as using clustering to do more rigourous groupings of genes/probes. It's an R package available on CRAN and is probably not the most user-friendly package ever, but the methods might be really appealing for the questions you have in mind.

Regarding centering, usually you'll want to center the values. After that there's no perfect answer to things. Lior Pachter wrote a blog post a while back regarding making heatmaps that you should read. I also wrote a bit about this in an answer a while back over on biostars that you might find helpful. The short answer is that there is usually no single best answer.

That post by lior Pachter was really informative. Thank you for that. I also read your post at Biostars. I think I may settle on filtering the variance stabilized matrix by the DE genes and NOT using the euclidian distance between expression data. However, I'm wondering how it will look. In the matrix I have 15 columns, 3 replicates for each of the 4 time points and the control condition. The table after the DE analysis and the log 2 changes only has 4, since DEseq2 has already evaluated the replicates into one result, and it implicitly has the comparison to the control condition. I'd think longer vectors for each gene would allow better clustering through any algorithm, though I still get the impression that it's for debate which one is more accurate, just that one should shy away from those that cluster mainly on min or max values.

**super0925** · 04-20-2014, 12:38 PM

Originally posted by dpryan View Post

You don't do a DE analysis without replicates. The best you can do is look at the ranked fold-changes (have a look at the GFold package, which tries to do this in a somewhat more useful way). Personally, I wouldn't waste my time without good reason.

Hi Devon I remember that last time you are talking biological or technological replicates.
I think that how many minimum or maximum replicates in library preparation is based on the experiment design, but could we decide it from Deseq/edgeR results (e.g. outliers or expression levels) or we only decide it just based on the experience ? because as you know, the more replicates requires the more money for sequencing.
My understanding, the more replicates is more useful theoretically, am I right?
In library preparation, what is the impact of reads length and depth? are they the more the better ? and if the depth is more important than length?
Thank you!

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News