Hi everyone,
I'm a fresh french PhD student in biology who has to struggle since a few weeks with RNAseq datas with a little (or no) background in statistics. I've read a lot but as there is no biostatisticians in my lab, i'm coming here for a advices.
The aim of my project is to make sub groups in a patients cohort and to establish a molecular signature with a minimal list of differentially expressed genes. For that, i got RNAseq results from 40 patients and 2 control samples with no replicates.
So after a few research i started to use DESeq (with the reads, don't panic), and my idea is:
First, making a clustering with a big list of genes (around 7700 after filtering genes with low coverage), and then after having my sub group I'll consider one group as biological replicates and compare each of the group with the rest of the patients, and filter the differentially expressed genes, add them in one list which will allow me to discriminate every group.
So i followed the DESeq workflow like this:
allreads2 <- read.table("allreads2.txt", header=TRUE, row.names=1)
conds <- factor(c("p10019", "p10139", "p10381", "p10398", "p10414", "p10487",
"p10878", "p10975", "p10991", "p11065", "p11176", "p11229", "p11478",
"p11913", "p13465", "p13690", "p14000", "p14246", "p14435", "p14462",
"p14470", "p14696", "p14696", "p14896", "p14903", "p15159", "p15867",
"p15949", "p16070", "p17343", "p17495", "p17838", "p1918", "p459",
"p4659", "p5844", "p5929", "p9506", "p9545", "p9882", "control1", "control2" )) (I have considered each patient as one conditions)
cds <- newCountDataSet( allreads2, conds )
cds <- estimateSizeFactors( cds )
normcds <- counts( cds, normalized=TRUE )
cds <- estimateDispersions( cds, method="blind", sharingMode="fit-only" )
vsd <- varianceStabilizingTransformation( cds )
And then i have made the heatmap of the count table:
library("RColorBrewer")
library("gtools")
select = order(rowMeans(counts(cds)), decreasing=TRUE)[1:7716]
colors <- colorRampPalette(c("yellow","red"))(100)
heatmap(exprs(vsd)[select,], col = colors, trace="none", margin=c(10,10))
And the heatmap sample to sample distance
dists <- dist( t( exprs(vsd) ) )
mat = as.matrix( dists )
heatmap(mat, trace="none", col = rev(colors), margin=c(13, 13))
Which gave me different results in clustering.
So here are my questions :
1. First of all, is my strategy to get this molecular signature and my R commands right? Otherwise, what should i change?
2. And then which heatmap seems the more appropriate to do my sub groups according to my experiment and the purpose and why (that's actually what's really blocking me at this point, because everybody in my lab has a different point of view on this, and cannot explain)?
That would be really really great if someone can help a student who dont want to end like most of biologists, bad in statistics ^^.
Thanks a lot.
I'm a fresh french PhD student in biology who has to struggle since a few weeks with RNAseq datas with a little (or no) background in statistics. I've read a lot but as there is no biostatisticians in my lab, i'm coming here for a advices.
The aim of my project is to make sub groups in a patients cohort and to establish a molecular signature with a minimal list of differentially expressed genes. For that, i got RNAseq results from 40 patients and 2 control samples with no replicates.
So after a few research i started to use DESeq (with the reads, don't panic), and my idea is:
First, making a clustering with a big list of genes (around 7700 after filtering genes with low coverage), and then after having my sub group I'll consider one group as biological replicates and compare each of the group with the rest of the patients, and filter the differentially expressed genes, add them in one list which will allow me to discriminate every group.
So i followed the DESeq workflow like this:
allreads2 <- read.table("allreads2.txt", header=TRUE, row.names=1)
conds <- factor(c("p10019", "p10139", "p10381", "p10398", "p10414", "p10487",
"p10878", "p10975", "p10991", "p11065", "p11176", "p11229", "p11478",
"p11913", "p13465", "p13690", "p14000", "p14246", "p14435", "p14462",
"p14470", "p14696", "p14696", "p14896", "p14903", "p15159", "p15867",
"p15949", "p16070", "p17343", "p17495", "p17838", "p1918", "p459",
"p4659", "p5844", "p5929", "p9506", "p9545", "p9882", "control1", "control2" )) (I have considered each patient as one conditions)
cds <- newCountDataSet( allreads2, conds )
cds <- estimateSizeFactors( cds )
normcds <- counts( cds, normalized=TRUE )
cds <- estimateDispersions( cds, method="blind", sharingMode="fit-only" )
vsd <- varianceStabilizingTransformation( cds )
And then i have made the heatmap of the count table:
library("RColorBrewer")
library("gtools")
select = order(rowMeans(counts(cds)), decreasing=TRUE)[1:7716]
colors <- colorRampPalette(c("yellow","red"))(100)
heatmap(exprs(vsd)[select,], col = colors, trace="none", margin=c(10,10))
And the heatmap sample to sample distance
dists <- dist( t( exprs(vsd) ) )
mat = as.matrix( dists )
heatmap(mat, trace="none", col = rev(colors), margin=c(13, 13))
Which gave me different results in clustering.
So here are my questions :
1. First of all, is my strategy to get this molecular signature and my R commands right? Otherwise, what should i change?
2. And then which heatmap seems the more appropriate to do my sub groups according to my experiment and the purpose and why (that's actually what's really blocking me at this point, because everybody in my lab has a different point of view on this, and cannot explain)?
That would be really really great if someone can help a student who dont want to end like most of biologists, bad in statistics ^^.
Thanks a lot.
Comment