Unconfigured Ad

**mikep** · 08-05-2014, 06:03 PM

Raw counts don't follow a linear distribution. Use Spearman, not Pearson. And discard any genes with 0 counts. Actually, I'd probably discard genes with < 10.

Secondly, is this human data, or in other words are your biological replicates sampled from different individuals with a heterogeneous genetic background?

**mihuzx** · 08-05-2014, 10:47 PM

Originally posted by mikep View Post

Raw counts don't follow a linear distribution. Use Spearman, not Pearson. And discard any genes with 0 counts. Actually, I'd probably discard genes with < 10.

Secondly, is this human data, or in other words are your biological replicates sampled from different individuals with a heterogeneous genetic background?

thank you for your advice,
I removed all genes < 10 and calculated the spearman correlation, but it still only about 0.93.
and I calculate spearson crrelation with genes < 1RPKM ,it didn't change.
now I wonder if I can use this to call DE genes. and how much it affect the result.
or if I use the data ,how can I make the least differrence.

**velt** · 08-05-2014, 11:53 PM

The Pearson and Spearman correlation coefficients are not well suited to RNA-seq count data. Indeed, we want to know if expression values are the same between two samples (linearity => Pearson coefficient), not just whether they have an increasing or decreasing trend (Spearman coefficient). But, Pearson’s r is generally ambiguous and highly dependent on sequencing depth and the range of expression levels inherent to the sample (difference between lowest and highest bin count).

I think it is difficult, from these coefficients, to determine if the samples are good replicates or not.

I advise you to read this publication and to use the SERE coefficient, which is well suited to the comparison of RNA-seq samples:

SERE: single-parameter quality control and sample comparison for RNA-Seq - PubMed

http://www.ncbi.nlm.nih.gov/pubmed/23033915

SERE can therefore serve as a straightforward and reliable statistical procedure for the global assessment of pairs or large groups of RNA-Seq datasets by a single statistical parameter.

A score of 1 indicating faithful replication. And more the score is high, more the samples are different. I use this coefficient to explore my data.

**mikep** · 08-06-2014, 12:59 AM

You didn't mention your sample source. If it is different people then 0.93 might be as good as it gets. I get around 0.95 on my data.

Another option (for future use) is to use a spikein like ERCC, you can then correlate counts independent of biological variabilty.

As for DE, my advice is suck it and see.

Finally Velt, nice call. Assimilating SERE into our pipeline in 3...2...1...

**mihuzx** · 08-06-2014, 03:17 AM

Originally posted by velt View Post

The Pearson and Spearman correlation coefficients are not well suited to RNA-seq count data. Indeed, we want to know if expression values are the same between two samples (linearity => Pearson coefficient), not just whether they have an increasing or decreasing trend (Spearman coefficient). But, Pearson’s r is generally ambiguous and highly dependent on sequencing depth and the range of expression levels inherent to the sample (difference between lowest and highest bin count).

I think it is difficult, from these coefficients, to determine if the samples are good replicates or not.

I advise you to read this publication and to use the SERE coefficient, which is well suited to the comparison of RNA-seq samples:

SERE: single-parameter quality control and sample comparison for RNA-Seq - PubMed

http://www.ncbi.nlm.nih.gov/pubmed/23033915

SERE can therefore serve as a straightforward and reliable statistical procedure for the global assessment of pairs or large groups of RNA-Seq datasets by a single statistical parameter.

A score of 1 indicating faithful replication. And more the score is high, more the samples are different. I use this coefficient to explore my data.

hi velt,
thank you very much.
I have tried it with my data. the SERE score is 5.8.
and another pair replication is about 3.3
is this too high? or any sugguestions ?
by the way, I think this standard is really strict.

**mbblack** · 08-06-2014, 04:01 AM

Well, your single greatest source of variation when it comes to differential expression is biological variation amongst individuals in your population. So if these were two different individuals, then your observed correlations might not be far off, at least when looking only at raw read counts.

Also, did you have equal or near equal read depth for each sample? If you had large differences in read depth across the two samples, then raw counts will also vary a great deal because of that.

Honestly, I would not worry about such differences in raw counts between biological replicates. That sort of variability is the very reason you use biological replication, so you can compute a robust mean population response. Individuals will inherently vary, often a great deal, in raw expression estimates.

How do your normalized read counts compare for these two samples? That is by far a more meaningful comparison than raw counts. Also, basing a comparison on an N of just 2 can be very misleading, as you have no idea how those two biological samples fall out in terms of the range of variation in expression for your population.

**bkellman16** · 09-21-2015, 11:11 AM

SERE over log transform

My understanding is that log(poisson) [log(counts) in this case] will approximate a normal distribution thereby achieving linearity. Is there a benefit to using SERE over using the pearson correlation of log transformed counts?

Topics	Statistics	Last Post
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, Today, 06:09 AM	0 responses 15 views 0 reactions	Last Post by SEQadmin2 Today, 06:09 AM
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 34 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 39 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM
A New Single-Cell Method Maps DNA-Protein Interactions by SEQadmin2 Started by SEQadmin2, 06-04-2026, 08:59 AM	0 responses 47 views 0 reactions	Last Post by SEQadmin2 06-04-2026, 08:59 AM

Unconfigured Ad

RNA-seq bio-replication with low correlation

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News