Seqanswers Leaderboard Ad

**dpryan** · 10-28-2013, 01:25 AM

PCA won't really do what you want (though I suppose it could vaguely hint at it). Why don't you just directly measure the correlation between the samples? That would seem to more directly answer the question.

**Kennels** · 10-28-2013, 02:11 AM

Thanks for the reply, I suppose I can do all pairwise comparisons for all samples and find the best correlation. But this would be assuming some kind of a linear relationship (?), and there is high variability between the genes for any two samples based on 'eye-balling' the expression profiles.

What I would like is to use all the data at once and find some kind of pattern from the variability and 'group' the samples based on that.

**mcnelson.phd** · 10-28-2013, 04:05 AM

Originally posted by Kennels View Post

Thanks for the reply, I suppose I can do all pairwise comparisons for all samples and find the best correlation. But this would be assuming some kind of a linear relationship (?), and there is high variability between the genes for any two samples based on 'eye-balling' the expression profiles.

What I would like is to use all the data at once and find some kind of pattern from the variability and 'group' the samples based on that.

PCA will let your group samples that are "similar" but it won't tell you if there's a correlation between the expression patterns. Correlations are good for telling you if there's a relationship and if it's positive or negative, which PCA won't tell you. If you're concerned about linearity, then use the Spearman rank correlation instead of Pearson.

**sphil** · 10-28-2013, 05:09 AM

Originally posted by Kennels View Post

... But this would be assuming some kind of a linear relationship (?), and there is ....

If you use pearson that would be the case. Maybe have a look at kendal-tau correlation which, afaik, also suits to non linear. What you can use for sure is the information content (mutual information).

cheers...

**rskr** · 10-28-2013, 05:25 AM

Originally posted by Kennels View Post

Thanks for the reply, I suppose I can do all pairwise comparisons for all samples and find the best correlation. But this would be assuming some kind of a linear relationship (?), and there is high variability between the genes for any two samples based on 'eye-balling' the expression profiles.

What I would like is to use all the data at once and find some kind of pattern from the variability and 'group' the samples based on that.

Fwiw, PCA is also based on linear relationships.

**gringer** · 10-28-2013, 11:22 AM

Expression values generally have a log-linear distribution. You might get away with a standard linear Pearson's correlation if you take the log of expression values first.

The best way to be sure is by graphing and eyeballing. With 5 samples and ~30 genes, you could probably get a quicker idea of the most similar profiles with a scatterplot matrix -- I would do one without transformation, and another with log-transformed values:

Just a moment...

http://www.statmethods.net/graphs/scatterplot.html

**Kennels** · 10-28-2013, 04:09 PM

Thanks everyone for the advice and information. I am currently testing out scatterplots and several correlation tests via R.

However... and I do apologize if this doesn't seem to be getting through to me ... I am not really trying to find a positive or negative correlation. I will certainly get some value from my data, but I am concerned that biologically that could be misleading because of the complex interplay in this small set of genes.

For example, gene A and B could be positively correlated in sample 1, but in sample 2 they could be negatively correlated because gene C had a higher expression. Take into account 30 odd genes. If i did a correlation test, I would be trying to put some kind of 'positive or negative' relationship of the samples. But biologically, I want to take into account all that variability and ask:
"Which samples behave most similarly in the interplay of the expression of these genes?"
which is what led me to PCA in the first place.

But perhaps this is exactly what correlation does and I am misunderstanding it? My understanding was the pearson/spearman correlations require the data to be linear or monotonic (somewhat linear), which in my samples I 'believe' they aren't. Of course i have to confirm this with gringer's suggestions.

**gringer** · 10-28-2013, 04:28 PM

For example, gene A and B could be positively correlated in sample 1, but in sample 2 they could be negatively correlated because gene C had a higher expression. Take into account 30 odd genes. If i did a correlation test, I would be trying to put some kind of 'positive or negative' relationship of the samples. But biologically, I want to take into account all that variability and ask:
"Which samples behave most similarly in the interplay of the expression of these genes?"

You should be checking for correlation between samples, not genes, and that's what everyone is suggesting that you do. The correlation test will relate to how expression is different (or similar) between two samples (e.g. a non-parametric correlation might check if the expression levels go (in increasing order) A,B,E,C,D in both sample 1 and sample 2).

Principal Component Analysis is a dimension reduction technique that uses linear transformations of multi-dimensional values to allow them to be reduced to a simpler (lower-dimensional) complexity, commonly down to two dimensions. I believe the usual methods for working out how to do this reduction involve expectations of normally distributed data, and carry out something similar to a correlation analysis to work out how to weight each component [rskr's statement seems to support this] -- someone please correct me if I'm wrong about that. I don't think you can get away completely from linear correlations by trying to hide your data in a PCA.

**Kennels** · 10-28-2013, 05:33 PM

Originally posted by gringer View Post

You should be checking for correlation between samples, not genes, and that's what everyone is suggesting that you do. The correlation test will relate to how expression is different (or similar) between two samples (e.g. a non-parametric correlation might check if the expression levels go (in increasing order) A,B,E,C,D in both sample 1 and sample 2).

Thanks gringer for your explanation - it is clearer now, and looking at the correlation output it makes more sense.

You were right that the log values seem to show overall a linear relationship (which was a surprise to me for these genes).
Would you be able to comment on my interpretation? I did a scatterplot matrix with log values (image below), and a spearmans correlation in R:

Code:

          S1        S2        S3        S4        S5
S1 1.0000000 0.8409553 0.6508859 0.7027103 0.7342251
S2 0.8409553 1.0000000 0.6067227 0.7691877 0.8000000
S3 0.6508859 0.6067227 1.0000000 0.5299720 0.5543417
S4 0.7027103 0.7691877 0.5299720 1.0000000 0.7823529
S5 0.7342251 0.8000000 0.5543417 0.7823529 1.0000000

S3 seems to be the 'least' correlated with all the others, and S1 and S2 the 'most'.
I'm not sure what value would be considered a 'good/strong' correlation (obviously 1 is best, but perhaps above 0.9?), but I suppose that depends on the biological context, as well as the number of data points.

Thanks again!

Attached Files

ScatterPlot.LogtpmsRSEM.png (25.2 KB, 8 views)

**gringer** · 10-28-2013, 07:19 PM

Originally posted by Kennels View Post

I'm not sure what value would be considered a 'good/strong' correlation (obviously 1 is best, but perhaps above 0.9?), but I suppose that depends on the biological context, as well as the number of data points.

Indeed, significance depends largely on what feels right, the assumptiontions that have been made, and occasionally who's paying for the research.

Your original question was to find the two samples that are 'most similar', and as you have found it's fairly obvious given the correlation statistics and scatter plots. If you're looking for "most similar", the thing that matters is not the signficance of the correlation statistic (for that most similar pairing), but how different it is from the next "most similar" pairing. There are various other tests that can be done to find out the chance of confusion in that regard, but they're [currently] out of the scope of answers in this thread.

FWIW, The cor.test function of R will give you p values for your correlation statistic, and you can play around with parametric and non-parametric methods to see how much it changes things if you drop the assumption of normality (see 'help(cor.test)' for more information).

**dietmar13** · 10-28-2013, 11:35 PM

nmf or Isomap

if you want look if your samples are subdividable in some groups you could make a non-negative matrix factorization with 2 to 3 groups and look if you get a good separation (cophenetic correlation coefficient).

or if you want consider especially non-linear associations you could use Isomap, a non-linear dimensionality reduction (similar to PCA).

but how these methods will performe with such a small data set, i don't know...

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 17 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Is Principal Component Analysis suited for this analysis?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News