Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • elizabeth000
    Member
    • Apr 2015
    • 21

    Understanding edgeR protocol from Anders et al 2013

    Hello, I am using edgeR for the first time, and relying on the edgeR user's guide and Anders et al 2013 protocol in Nature Protocols to help me along. I would like to fully understand the steps involved but I need some additional explanation.

    I want to determine which features have a significantly higher number of counts in some samples compared to others. Since I will never be able to establish a significantly higher number of counts for features that have very low counts across all samples, it is not useful to analyze these features. Therefore, I will remove them from the DGEList object by filtering.

    In the edgeR user's guide this is done simply by:
    > keep <- rowSums(cpm(d)>100) >= 2
    > d <- d[keep,,keep.lib.sizes=FALSE]

    This is straight-forward and I believe I fully understand the code. But in Anders et al there is an additional step involving the %in% operator:
    > noint = rownames(counts) %in% c("no_feature","ambiguous","too_low_aQual","not_aligned","alignment_not_unique")
    > cpms = cpm(counts)
    > keep = rowSums(cpms > 1) >=2 & !noint
    > counts = counts[keep,]

    Does the vector noint contain all the rownames of counts with data other than "no_feature" and "ambigous" etc?
    It seems necessary to me to remove the "__no_feature" etc data at the end of the htseq-count output files, but I do not understand how this code does that. It's way too clever for me!
  • bruce01
    Senior Member
    • Mar 2011
    • 160

    #2
    Here is a good overview of what %in% does.

    noint actually contains only those c("no_feature"...) that are found in rownames(counts), then uses a '!' (not) statement to remove them when defining 'keep'.

    There are other ways to do this, for example I use ENSEMBL annotations and so can just grep all lines with 'ENS' in the shell before I read counts into R.

    Hope that helps.

    Comment

    • elizabeth000
      Member
      • Apr 2015
      • 21

      #3
      Thank you, that is what I thought it was supposed to do!
      So if I understand properly, noint contains a FALSE value for every rowname that doesn't include any "no_feature" or "ambiguous" etc and contains a TRUE value for every rowname that includes "no_feature" or "ambiguous" etc.

      But it doesn't give the expected output. I ran the following lines:
      data = readDGE(listfiles)
      noint = rownames(data) %in% c("no_feature","ambiguous","too_low_aQual","not_aligned","alignment_not_unique")
      cpmd = cpm(data)
      keep = rowSums(cpmd > 1) >=2 & !noint
      data = data[keep,]
      data$samples$lib.size = colSums(data$counts)

      The number of rows of data$counts was reduced from 28031 to 18064, but the no_feature etc rows are still present:
      > tail(rownames(data$counts))
      [1] "CGI_10028935" "CGI_10028939" "__no_feature" "__ambiguous"
      [5] "__too_low_aQual" "__not_aligned"

      I cannot find my error...

      Comment

      • bruce01
        Senior Member
        • Mar 2011
        • 160

        #4
        The problem is you are using the vector 'c("no_feature"...)', which does not contain "__no_feature" etc. If you add them to the previous vector then they will also be removed.

        Comment

        • elizabeth000
          Member
          • Apr 2015
          • 21

          #5
          Yes, I just noticed this and fixed the bug myself! Obviously the string has to match exactly...
          Like a fool I was using the exact syntax from the Nature Protocols paper, which surprisingly does not seem to be correct. The code that works for me is:

          Code:
          data = readDGE(listfiles)
          noint = rownames(data) %in% c("__no_feature","__ambiguous","__too_low_aQual","__not_aligned","__alignment_not_unique")
          cpmd = cpm(data)
          keep = rowSums(cpmd > 1) >=2 & !noint
          data = data[keep,]
          data$samples$lib.size = colSums(data$counts)
          > table(noint)
          noint
          FALSE TRUE
          28026 5

          > tail(rownames(data$counts))
          [1] "CGI_10028931" "CGI_10028932" "CGI_10028933" "CGI_10028934" "CGI_10028935"
          [6] "CGI_10028939"

          Also I noticed in the Nature Protocols paper there is no mention of recomputing library sizes, although this is always done in the examples from the edgeR user's guide. Can anyone think of a reason that the library sizes should NOT be recomputed after filtering? I just want to check... Thanks a lot!

          Comment

          Latest Articles

          Collapse

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by SEQadmin2, 06-05-2026, 10:09 AM
          0 responses
          11 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-04-2026, 08:59 AM
          0 responses
          23 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-02-2026, 12:03 PM
          0 responses
          28 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-02-2026, 11:40 AM
          0 responses
          22 views
          0 reactions
          Last Post SEQadmin2  
          Working...