I am using edgeR to analyze RNA-Seq data. This is my script:
library("edgeR")
#############################
#read in metadata & DGE
#############################
composite_samples <- read.csv(file="samples.csv",header=TRUE,sep=",")
counts <- readDGE(composite_samples$CountFiles)$counts
#############################
#Filter & Library Size Re-set
#############################
noint <- rownames(counts) %in% (c("no_feature", "ambiguous", "too_low_aQual", "not_aligned", "alignment_not_unique"))
cpms <- cpm(counts)
keep <- rowSums(cpms>1)>=3 & !noint
counts <- counts[keep,]
colnames(counts) <- composite_samples$SampleName
d <- DGEList(counts=counts, group=composite_samples$Condition)
d$samples$lib.size <- colSums(d$counts)
#############################
#Normalisation
#############################
d <- calcNormFactors(d)
#############################
#Recording the normalized counts
#############################
all_cpm=cpm(d, normalized.lib.size=TRUE)
all_counts <- cbind(rownames(all_cpm), all_cpm)
colnames(all_counts)[1] <- "Ensembl.Gene.ID"
rownames(all_counts) <- NULL
#############################
#Estimate Dispersion
#############################
d <- estimateCommonDisp(d)
d <- estimateTagwiseDisp(d)
#############################
#Perform a test
#############################
de_ctl_mo_composite <- exactTest(d, pair=c("NY", "N"))
I believe that the variable "all_counts" shall contain the normalized counts for each sample in each condition. My understanding is also that edgeR adds pseudocounts BEFORE performing the library normalisation. Thus it is possible that some values revert to being zero after normalisation. But I thought that this would happen rarely. Yet in a recent dataset I find an improbably large number of zero values in "all_counts" which made me think that my understanding of how pseudocounts and normalisation work in edgeR might be incorrect. Can, please, somebody comment on this?
library("edgeR")
#############################
#read in metadata & DGE
#############################
composite_samples <- read.csv(file="samples.csv",header=TRUE,sep=",")
counts <- readDGE(composite_samples$CountFiles)$counts
#############################
#Filter & Library Size Re-set
#############################
noint <- rownames(counts) %in% (c("no_feature", "ambiguous", "too_low_aQual", "not_aligned", "alignment_not_unique"))
cpms <- cpm(counts)
keep <- rowSums(cpms>1)>=3 & !noint
counts <- counts[keep,]
colnames(counts) <- composite_samples$SampleName
d <- DGEList(counts=counts, group=composite_samples$Condition)
d$samples$lib.size <- colSums(d$counts)
#############################
#Normalisation
#############################
d <- calcNormFactors(d)
#############################
#Recording the normalized counts
#############################
all_cpm=cpm(d, normalized.lib.size=TRUE)
all_counts <- cbind(rownames(all_cpm), all_cpm)
colnames(all_counts)[1] <- "Ensembl.Gene.ID"
rownames(all_counts) <- NULL
#############################
#Estimate Dispersion
#############################
d <- estimateCommonDisp(d)
d <- estimateTagwiseDisp(d)
#############################
#Perform a test
#############################
de_ctl_mo_composite <- exactTest(d, pair=c("NY", "N"))
I believe that the variable "all_counts" shall contain the normalized counts for each sample in each condition. My understanding is also that edgeR adds pseudocounts BEFORE performing the library normalisation. Thus it is possible that some values revert to being zero after normalisation. But I thought that this would happen rarely. Yet in a recent dataset I find an improbably large number of zero values in "all_counts" which made me think that my understanding of how pseudocounts and normalisation work in edgeR might be incorrect. Can, please, somebody comment on this?
Comment