Data filters at what stage of NGS data analysis?

anandksrao

Junior Member

Join Date: Jun 2011
Posts: 9

Data filters at what stage of NGS data analysis?

10-08-2012, 10:27 AM

Greetings friends!

I seek help with data that I have : 3 time points, 3 genotypes, 3 replicates for each of these = 27 libraries

The goal is to find genes that have different time expression profiles amongst 2 or more genotypes.

After our 1st round of data analysis, (including TMM normalization), the time course graphs and box plots were so noisy in terms of high std error at each time point, that it was hard to say if expression profile of one genotype was overlapping or distinct from that for the other genotypes! R code attached at bottom of this post.

So in short - we now need to employ data filters to check and reduce noise in our data. Some ideas are
removing genes that have low expression (count) levels
removing genes that have high variance across replicates
removing genes that have low variance across time (constitutively expressed genes are biologically less interesting)

So my question to you is what stage of my analysis do I employ these filters?
On the raw data itself, prior to normalization?
Or should I perform the TMM normalization, use the norm factors to transform my data to non-integer normalized counts and then filter (in which case I think I cannot fit them into negative binomial model, right?)

Code:

count = read.table("Input.txt", sep="\t", header=T)                     					
#$#$ read in raw count mapped data

f.count = count[apply(count[,-c(1,ncol(count))],1,sum) > 27,]                               
#$#$ filter ou genes with total read count < 27 across all libraries

f.dat = f.count[,-c(1,ncol(count))]                                                         
#$#$ select only read count, not rest of data frame

S = factor(rep(c("gen1","gen2","gen3"),rep(9,3)))                                           
#$#$ define group

Time = factor(rep(rep(c("0","10","20"),rep(3,3)),3))         								
#$#$ define time

Time.rep = rep(1:3,9)                                                                        
#$#$ define replicate

Group = paste(S,Time,Time.rep,sep="_")                                                         
#$#$ define group_time_replicate

library(edgeR)                                                                              
#$#$ load edgeR package

f.factor = data.frame(files = names(f.dat), S = S , Time = Time, lib.size = c(apply(f.dat,2,sum)),norm.factors = calcNormFactors(as.matrix(f.dat)))  
#$#$  make data for edgeR method

count.d = new("DGEList", list(samples = f.factor, counts = as.matrix(f.dat)))               
#$#$  make data for edgeR method

design = model.matrix(~ Time + S)                                                           
#$#$  make design data for edgeR method

count.d = calcNormFactors(count.d)                                                          
#$#$  Normalize TMM

glmfit.d = glmFit(count.d, design, dispersion = 0.1)                                        
#$#$  Fit the Negative Binomial Gen Lin Models

lrt.count = glmLRT(count.d, glmfit.d)                                                       
#$#$  Likelihood ratio tests

result.count = data.frame(f.count, lrt.count$table)                                         
#$#$  combining raw data and results from edgeR

result.count$FDR = p.adjust(result.count$p.value,method="BH")                               
#$#$  calculating the False Discovery Rate

write.table(result.count, "edgeR.Medicago_count_WT_Mu3.txt",sep="\t",row.names=F)           
#$#$  saving the combined data set

Tags: data filter, model fitting, negative binomial, time course deseq, tmm normalization

markrobinsonca

Junior Member

Join Date: Mar 2010

Posts: 7
- Share
- Tweet
#2

10-10-2012, 02:11 AM

See this:

[BioC] Data filtering

https://stat.ethz.ch/pipermail/bioconductor/2012-October/048508.html
Comment

Previous template Next

Exploring the Dynamics of the Tumor Microenvironment

by seqadmin

The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
- Channel: Articles
07-08-2024, 03:19 PM

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, 07-25-2024, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin 07-25-2024, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 28 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 161 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

Data filters at what stage of NGS data analysis?

Comment

Latest Articles

ad_right_rmr

News