Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Data filters at what stage of NGS data analysis?

    Greetings friends!

    I seek help with data that I have : 3 time points, 3 genotypes, 3 replicates for each of these = 27 libraries

    The goal is to find genes that have different time expression profiles amongst 2 or more genotypes.

    After our 1st round of data analysis, (including TMM normalization), the time course graphs and box plots were so noisy in terms of high std error at each time point, that it was hard to say if expression profile of one genotype was overlapping or distinct from that for the other genotypes! R code attached at bottom of this post.

    So in short - we now need to employ data filters to check and reduce noise in our data. Some ideas are
    removing genes that have low expression (count) levels
    removing genes that have high variance across replicates
    removing genes that have low variance across time (constitutively expressed genes are biologically less interesting)

    So my question to you is what stage of my analysis do I employ these filters?
    On the raw data itself, prior to normalization?
    Or should I perform the TMM normalization, use the norm factors to transform my data to non-integer normalized counts and then filter (in which case I think I cannot fit them into negative binomial model, right?)

    Code:
    count = read.table("Input.txt", sep="\t", header=T)                     					
    #$#$ read in raw count mapped data
    
    f.count = count[apply(count[,-c(1,ncol(count))],1,sum) > 27,]                               
    #$#$ filter ou genes with total read count < 27 across all libraries
    
    f.dat = f.count[,-c(1,ncol(count))]                                                         
    #$#$ select only read count, not rest of data frame
    
    S = factor(rep(c("gen1","gen2","gen3"),rep(9,3)))                                           
    #$#$ define group
    
    Time = factor(rep(rep(c("0","10","20"),rep(3,3)),3))         								
    #$#$ define time
    
    Time.rep = rep(1:3,9)                                                                        
    #$#$ define replicate
    
    Group = paste(S,Time,Time.rep,sep="_")                                                         
    #$#$ define group_time_replicate
    
    library(edgeR)                                                                              
    #$#$ load edgeR package
    
    f.factor = data.frame(files = names(f.dat), S = S , Time = Time, lib.size = c(apply(f.dat,2,sum)),norm.factors = calcNormFactors(as.matrix(f.dat)))  
    #$#$  make data for edgeR method
    
    count.d = new("DGEList", list(samples = f.factor, counts = as.matrix(f.dat)))               
    #$#$  make data for edgeR method
    
    design = model.matrix(~ Time + S)                                                           
    #$#$  make design data for edgeR method
    
    count.d = calcNormFactors(count.d)                                                          
    #$#$  Normalize TMM
    
    glmfit.d = glmFit(count.d, design, dispersion = 0.1)                                        
    #$#$  Fit the Negative Binomial Gen Lin Models
    
    lrt.count = glmLRT(count.d, glmfit.d)                                                       
    #$#$  Likelihood ratio tests
    
    result.count = data.frame(f.count, lrt.count$table)                                         
    #$#$  combining raw data and results from edgeR
    
    result.count$FDR = p.adjust(result.count$p.value,method="BH")                               
    #$#$  calculating the False Discovery Rate
    
    write.table(result.count, "edgeR.Medicago_count_WT_Mu3.txt",sep="\t",row.names=F)           
    #$#$  saving the combined data set

  • #2
    See this:

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Best Practices for Single-Cell Sequencing Analysis
      by seqadmin



      While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
      06-06-2024, 07:15 AM
    • seqadmin
      Latest Developments in Precision Medicine
      by seqadmin



      Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

      Somatic Genomics
      “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
      05-24-2024, 01:16 PM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, 06-14-2024, 07:24 AM
    0 responses
    12 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 06-13-2024, 08:58 AM
    0 responses
    14 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 06-12-2024, 02:20 PM
    0 responses
    17 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 06-07-2024, 06:58 AM
    0 responses
    186 views
    0 likes
    Last Post seqadmin  
    Working...
    X