Unconfigured Ad

**Jeremy** · 01-23-2013, 05:48 PM

It might be better to use normalized raw reads. That way you can see how reliable the fold difference actually is, lower read counts are more prone to noise induced errors. For example if sample A has 5 reads and sample B has 0 reads then it may not be so reliable. Whereas if sample A has 3000 reads and sample B has 0 it would be more reliable. FPKM can change the numbers making low read count small fragments look more reliable than they actually are (not sure if that has changed, it's been a while since I tried FPKM).

**chadn737** · 01-23-2013, 05:49 PM

I don't like FPKM/RPKM because it can mask how many reads there actually are, even inflating values when there are actually very few reads. That being said, I disagree with filtering out genes that have a 0 FPKM or read count in one condition. I have seen clear examples where one conditions will have dozens, even hundreds of reads, while the other has zero. Clearly you can observe expression in one condition and call this as differentially expressed. Where things become murky is when one condition has maybe 2-4 reads and the other 0. A better filter then would take into account this possibility.

**mbblack** · 01-24-2013, 05:59 AM

Originally posted by bvb1909 View Post

I think that a low read count is in many cases a pretty good indication of low expression. Simply excluding those on the basis that I cannot statistically be sure means you are missing out on a lot of differentially expressed genes. Just as an example, a well known differentially expressed gene under our condition would be ruled out by you because the statistics tells you so (because FPKM = 0 under control condition), biology tells us it is the one of the most important genes.... Guess it is also a matter of what you want to get from the data

The problem with low count data is that it results in a very high false positive error rate for differentially expressed genes. I suppose it then depends on how tolerant your study is to false positives and what your objective is with the data. However, I will make a final comment that in the published RNA-Seq DGE papers of the last 2 or 3 years, there seems to be a clear and growing consensus that low read count data should be excluded from DGE analyses to avoid the bias of much higher false positive errors especially in low expressors. Published papers seem to vary when using FPKM/RPKM normalization, with cutoffs varying from 0.1 to 0.5 (one paper I seem to recall even using higher, but I remember the read depth was quite low in that work as well). However, I too am becoming a non-fan of RPKM or similar methods, as they can be very misleading for some genes.

In my own work, we have settled on excluding raw counts less than 11 (so I actually filter on count > 10), and then normalize what remains. Even then, It's simply though to plot and show that genes with raw counts between 11 and about 150 or so, have very high variance in their transcript abundance estimates, while for those with counts > about 150, the variance tightens up dramatically. We also always run 5 biological replicates for all treatments and controls.

Working in toxicology and particularly with risk assessment type studies, we do not have the option of dismissing statistical significance, and in fact almost always base our DGE assessments on simultaneously filtering results for statistical significance and minimal fold change difference (although for initial exploratory analyses, we may relax those criteria - as you say, it depends on what one's goals for the data are).

Just as an aside, in the limited qPCR validations series that I've run, we get very poor correspondence with RNA-Seq results base solely on statistical significance or solely on fold change. Correspondence (using ABI TaqMan rtPCR assays) improves dramatically when comparing genes that were both statistically significant and met minimum fold change differences (I usually filter for genes with FDR < 0.05 and FC > +/- 1.5). Nothing novel in that result, and of course, the same applies for microarray data for that matter: combining statistical significance and some minimum magnitude of relative change proves a far more robust estimator of differential gene expression than either cutoff alone. The problem with the original post that started this thread, is that you cannot compute statistical significance in the absence or replicates, so you are left with just raw differences in magnitude based on single measures of abundance.

Topics	Statistics	Last Post
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 17 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 27 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM
A New Single-Cell Method Maps DNA-Protein Interactions by SEQadmin2 Started by SEQadmin2, 06-04-2026, 08:59 AM	0 responses 38 views 0 reactions	Last Post by SEQadmin2 06-04-2026, 08:59 AM
Long-Read RNA Sequencing Uncovers a Hidden Layer of Immune Cell Regulation by SEQadmin2 Started by SEQadmin2, 06-02-2026, 12:03 PM	0 responses 61 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 12:03 PM

Unconfigured Ad

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News