Unconfigured Ad

**bruce01** · 04-28-2015, 02:23 AM

Here is a good overview of what %in% does.

noint actually contains only those c("no_feature"...) that are found in rownames(counts), then uses a '!' (not) statement to remove them when defining 'keep'.

There are other ways to do this, for example I use ENSEMBL annotations and so can just grep all lines with 'ENS' in the shell before I read counts into R.

Hope that helps.

**elizabeth000** · 04-28-2015, 03:29 AM

Thank you, that is what I thought it was supposed to do!
So if I understand properly, noint contains a FALSE value for every rowname that doesn't include any "no_feature" or "ambiguous" etc and contains a TRUE value for every rowname that includes "no_feature" or "ambiguous" etc.

But it doesn't give the expected output. I ran the following lines:
data = readDGE(listfiles)
noint = rownames(data) %in% c("no_feature","ambiguous","too_low_aQual","not_aligned","alignment_not_unique")
cpmd = cpm(data)
keep = rowSums(cpmd > 1) >=2 & !noint
data = data[keep,]
data$samples$lib.size = colSums(data$counts)

The number of rows of data$counts was reduced from 28031 to 18064, but the no_feature etc rows are still present:
> tail(rownames(data$counts))
[1] "CGI_10028935" "CGI_10028939" "__no_feature" "__ambiguous"
[5] "__too_low_aQual" "__not_aligned"

I cannot find my error...

**bruce01** · 04-28-2015, 03:43 AM

The problem is you are using the vector 'c("no_feature"...)', which does not contain "__no_feature" etc. If you add them to the previous vector then they will also be removed.

**elizabeth000** · 04-28-2015, 03:56 AM

Yes, I just noticed this and fixed the bug myself! Obviously the string has to match exactly...
Like a fool I was using the exact syntax from the Nature Protocols paper, which surprisingly does not seem to be correct. The code that works for me is:

Code:

data = readDGE(listfiles)
noint = rownames(data) %in% c("__no_feature","__ambiguous","__too_low_aQual","__not_aligned","__alignment_not_unique")
cpmd = cpm(data)
keep = rowSums(cpmd > 1) >=2 & !noint
data = data[keep,]
data$samples$lib.size = colSums(data$counts)

> table(noint)
noint
FALSE TRUE
28026 5

> tail(rownames(data$counts))
[1] "CGI_10028931" "CGI_10028932" "CGI_10028933" "CGI_10028934" "CGI_10028935"
[6] "CGI_10028939"

Also I noticed in the Nature Protocols paper there is no mention of recomputing library sizes, although this is always done in the examples from the edgeR user's guide. Can anyone think of a reason that the library sizes should NOT be recomputed after filtering? I just want to check... Thanks a lot!

Topics	Statistics	Last Post
Engineered Protein Motor Takes Its First Steps Along DNA Track by SEQadmin2 Started by SEQadmin2, Today, 11:05 AM	0 responses 6 views 0 reactions	Last Post by SEQadmin2 Today, 11:05 AM
High-Resolution Sequencing Exposes Hidden Toxoplasma Diversity by SEQadmin2 Started by SEQadmin2, 07-02-2026, 11:08 AM	0 responses 27 views 0 reactions	Last Post by SEQadmin2 07-02-2026, 11:08 AM
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, 06-30-2026, 05:37 AM	0 responses 25 views 0 reactions	Last Post by SEQadmin2 06-30-2026, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 25 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM

Unconfigured Ad

Understanding edgeR protocol from Anders et al 2013

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News