Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Kennels
    Senior Member
    • Feb 2011
    • 149

    Is Principal Component Analysis suited for this analysis?

    Hi,

    I have expression values (calculated by RSEM, RNAseq data) for over 30 genes, from 5 samples. Based on these values, I would like to find which two samples display the most 'similar' expression profile. These genes are all from a common pathway for virus defence in plants (RNAi).

    Is PCA suited for this? I realize this is a very small sample set.

    Any advice, comments, recommendations for other tests, greatly appreciated!
  • dpryan
    Devon Ryan
    • Jul 2011
    • 3478

    #2
    PCA won't really do what you want (though I suppose it could vaguely hint at it). Why don't you just directly measure the correlation between the samples? That would seem to more directly answer the question.

    Comment

    • Kennels
      Senior Member
      • Feb 2011
      • 149

      #3
      Thanks for the reply, I suppose I can do all pairwise comparisons for all samples and find the best correlation. But this would be assuming some kind of a linear relationship (?), and there is high variability between the genes for any two samples based on 'eye-balling' the expression profiles.

      What I would like is to use all the data at once and find some kind of pattern from the variability and 'group' the samples based on that.

      Comment

      • mcnelson.phd
        Senior Member
        • Jul 2011
        • 162

        #4
        Originally posted by Kennels View Post
        Thanks for the reply, I suppose I can do all pairwise comparisons for all samples and find the best correlation. But this would be assuming some kind of a linear relationship (?), and there is high variability between the genes for any two samples based on 'eye-balling' the expression profiles.

        What I would like is to use all the data at once and find some kind of pattern from the variability and 'group' the samples based on that.
        PCA will let your group samples that are "similar" but it won't tell you if there's a correlation between the expression patterns. Correlations are good for telling you if there's a relationship and if it's positive or negative, which PCA won't tell you. If you're concerned about linearity, then use the Spearman rank correlation instead of Pearson.

        Comment

        • sphil
          Senior Member
          • Apr 2010
          • 192

          #5
          Originally posted by Kennels View Post
          ... But this would be assuming some kind of a linear relationship (?), and there is ....
          If you use pearson that would be the case. Maybe have a look at kendal-tau correlation which, afaik, also suits to non linear. What you can use for sure is the information content (mutual information).

          cheers...

          Comment

          • rskr
            Senior Member
            • Oct 2010
            • 249

            #6
            Originally posted by Kennels View Post
            Thanks for the reply, I suppose I can do all pairwise comparisons for all samples and find the best correlation. But this would be assuming some kind of a linear relationship (?), and there is high variability between the genes for any two samples based on 'eye-balling' the expression profiles.

            What I would like is to use all the data at once and find some kind of pattern from the variability and 'group' the samples based on that.
            Fwiw, PCA is also based on linear relationships.

            Comment

            • gringer
              David Eccles (gringer)
              • May 2011
              • 845

              #7
              Expression values generally have a log-linear distribution. You might get away with a standard linear Pearson's correlation if you take the log of expression values first.

              The best way to be sure is by graphing and eyeballing. With 5 samples and ~30 genes, you could probably get a quicker idea of the most similar profiles with a scatterplot matrix -- I would do one without transformation, and another with log-transformed values:

              Comment

              • Kennels
                Senior Member
                • Feb 2011
                • 149

                #8
                Thanks everyone for the advice and information. I am currently testing out scatterplots and several correlation tests via R.

                However... and I do apologize if this doesn't seem to be getting through to me ... I am not really trying to find a positive or negative correlation. I will certainly get some value from my data, but I am concerned that biologically that could be misleading because of the complex interplay in this small set of genes.

                For example, gene A and B could be positively correlated in sample 1, but in sample 2 they could be negatively correlated because gene C had a higher expression. Take into account 30 odd genes. If i did a correlation test, I would be trying to put some kind of 'positive or negative' relationship of the samples. But biologically, I want to take into account all that variability and ask:
                "Which samples behave most similarly in the interplay of the expression of these genes?"
                which is what led me to PCA in the first place.

                But perhaps this is exactly what correlation does and I am misunderstanding it? My understanding was the pearson/spearman correlations require the data to be linear or monotonic (somewhat linear), which in my samples I 'believe' they aren't. Of course i have to confirm this with gringer's suggestions.

                Comment

                • gringer
                  David Eccles (gringer)
                  • May 2011
                  • 845

                  #9
                  For example, gene A and B could be positively correlated in sample 1, but in sample 2 they could be negatively correlated because gene C had a higher expression. Take into account 30 odd genes. If i did a correlation test, I would be trying to put some kind of 'positive or negative' relationship of the samples. But biologically, I want to take into account all that variability and ask:
                  "Which samples behave most similarly in the interplay of the expression of these genes?"
                  You should be checking for correlation between samples, not genes, and that's what everyone is suggesting that you do. The correlation test will relate to how expression is different (or similar) between two samples (e.g. a non-parametric correlation might check if the expression levels go (in increasing order) A,B,E,C,D in both sample 1 and sample 2).

                  Principal Component Analysis is a dimension reduction technique that uses linear transformations of multi-dimensional values to allow them to be reduced to a simpler (lower-dimensional) complexity, commonly down to two dimensions. I believe the usual methods for working out how to do this reduction involve expectations of normally distributed data, and carry out something similar to a correlation analysis to work out how to weight each component [rskr's statement seems to support this] -- someone please correct me if I'm wrong about that. I don't think you can get away completely from linear correlations by trying to hide your data in a PCA.
                  Last edited by gringer; 10-28-2013, 04:32 PM.

                  Comment

                  • Kennels
                    Senior Member
                    • Feb 2011
                    • 149

                    #10
                    Originally posted by gringer View Post
                    You should be checking for correlation between samples, not genes, and that's what everyone is suggesting that you do. The correlation test will relate to how expression is different (or similar) between two samples (e.g. a non-parametric correlation might check if the expression levels go (in increasing order) A,B,E,C,D in both sample 1 and sample 2).
                    Thanks gringer for your explanation - it is clearer now, and looking at the correlation output it makes more sense.

                    You were right that the log values seem to show overall a linear relationship (which was a surprise to me for these genes).
                    Would you be able to comment on my interpretation? I did a scatterplot matrix with log values (image below), and a spearmans correlation in R:
                    Code:
                              S1        S2        S3        S4        S5
                    S1 1.0000000 0.8409553 0.6508859 0.7027103 0.7342251
                    S2 0.8409553 1.0000000 0.6067227 0.7691877 0.8000000
                    S3 0.6508859 0.6067227 1.0000000 0.5299720 0.5543417
                    S4 0.7027103 0.7691877 0.5299720 1.0000000 0.7823529
                    S5 0.7342251 0.8000000 0.5543417 0.7823529 1.0000000
                    S3 seems to be the 'least' correlated with all the others, and S1 and S2 the 'most'.
                    I'm not sure what value would be considered a 'good/strong' correlation (obviously 1 is best, but perhaps above 0.9?), but I suppose that depends on the biological context, as well as the number of data points.

                    Thanks again!
                    Attached Files

                    Comment

                    • gringer
                      David Eccles (gringer)
                      • May 2011
                      • 845

                      #11
                      Originally posted by Kennels View Post
                      I'm not sure what value would be considered a 'good/strong' correlation (obviously 1 is best, but perhaps above 0.9?), but I suppose that depends on the biological context, as well as the number of data points.
                      Indeed, significance depends largely on what feels right, the assumptiontions that have been made, and occasionally who's paying for the research.

                      Your original question was to find the two samples that are 'most similar', and as you have found it's fairly obvious given the correlation statistics and scatter plots. If you're looking for "most similar", the thing that matters is not the signficance of the correlation statistic (for that most similar pairing), but how different it is from the next "most similar" pairing. There are various other tests that can be done to find out the chance of confusion in that regard, but they're [currently] out of the scope of answers in this thread.

                      FWIW, The cor.test function of R will give you p values for your correlation statistic, and you can play around with parametric and non-parametric methods to see how much it changes things if you drop the assumption of normality (see 'help(cor.test)' for more information).

                      Comment

                      • dietmar13
                        Senior Member
                        • Mar 2010
                        • 107

                        #12
                        nmf or Isomap

                        if you want look if your samples are subdividable in some groups you could make a non-negative matrix factorization with 2 to 3 groups and look if you get a good separation (cophenetic correlation coefficient).

                        or if you want consider especially non-linear associations you could use Isomap, a non-linear dimensionality reduction (similar to PCA).

                        but how these methods will performe with such a small data set, i don't know...

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          New Genomics Tools and Methods Shared at AGBT 2025
                          by seqadmin


                          This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                          The Headliner
                          The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                          03-03-2025, 01:39 PM
                        • seqadmin
                          Investigating the Gut Microbiome Through Diet and Spatial Biology
                          by seqadmin




                          The human gut contains trillions of microorganisms that impact digestion, immune functions, and overall health1. Despite major breakthroughs, we’re only beginning to understand the full extent of the microbiome’s influence on health and disease. Advances in next-generation sequencing and spatial biology have opened new windows into this complex environment, yet many questions remain. This article highlights two recent studies exploring how diet influences microbial...
                          02-24-2025, 06:31 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 03-20-2025, 05:03 AM
                        0 responses
                        17 views
                        0 reactions
                        Last Post seqadmin  
                        Started by seqadmin, 03-19-2025, 07:27 AM
                        0 responses
                        18 views
                        0 reactions
                        Last Post seqadmin  
                        Started by seqadmin, 03-18-2025, 12:50 PM
                        0 responses
                        19 views
                        0 reactions
                        Last Post seqadmin  
                        Started by seqadmin, 03-03-2025, 01:15 PM
                        0 responses
                        185 views
                        0 reactions
                        Last Post seqadmin  
                        Working...