Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Interpretation of PCA of FPKM values

    (I've already searched the forums, but have yet to find a satisfactory answer, so I figured I'd create a new thread for it.)

    I recently learned about PCA and that I could use it for quality control of my RNA-seq experiments, but I have some stuff that I'm not sure about.

    1) I'm currently performing the PCA (in R) on my FPKM data from Cufflinks - is this "correct", or should I do it some other part of my data? (I'm also performing DE-analysis with DESeq2). Which part of your data do you guys usually do it with? Do you even do PCA, or do you do something else for QC?

    2) What, exactly, are the dots on the PCA plot?

    As far as (2) is concerned, PCA was explained to me as first plotting the samples on the multidimensional space of the variables (i.e. genes), followed by finding the vector corresponding to the maximum variation in the data (first principal component). The second, third, etc. PC would then follow (orthogonal to previous ones). Plotting was projecting the samples from the n-dimensional space down to the (most often) two largest PCs, which then would show you how the samples cluster. Is this correct?

    If I imagine a plot with 3 variables (rather than n) for simplicity's sake, is the correct interpretation that the dot that corresponds to a sample is simply the point in space where values of x,y,z is the FPKM for the sample for the genes x,y,z? (Then follows finding the PCs, projection, etc.)

  • #2
    1) A PCA is a standard part of my analysis and I also do it in R (also using DESeq2, in case you're using its plotPCA function). You can perform the PCA on either FPKM or variance stabilized (or robust log transformed) counts, whatever's more convenient. BTW, I assume you're not using the FPKMs in DESeq2 (I feel compelled to check). Usually one uses cufflinks/cuffmerge to get a better annotation and then gets counts with htseq-count or featureCounts and feeds those to DESeq2 (otherwise, use cuffdiff).

    2) The dots are your samples.

    That's kinda-sorta-not-really a description of PCA. PCA itself has nothing to do with plotting, rather its just a method of dimension reduction that decomposes a high-dimension dataset into more manageable dimensions (you'll get as many principal components as samples). The orthogonal nature of PCA is really important, since it means that the principal components themselves aren't readily biologically interpretable (e.g., component #1 probably doesn't correspond to anything obvious). The output from PCA is a few matrices, of which you end up plotting part of one. You typically plot the "load" of each sample in a given principal component. So the actual position in a 3D plot will be determined by a samples load in those 3 dimensions. Typically you just plot 2 of the principal components, since that's easier to visualize.

    So no, the position of a sample in a 3D plots of PCA loadings has little directly to do with genes x.y, and z. Unless you have as many samples as genes (now THAT is an expensive experiment!), you'll have many many many more genes than dimensions and dimensions will typically not correspond to any single gene (one of the matrices returned by PCA will give you an idea of this, though you can always also make a "biplot"). Rather, the components are sourced from all of the genes. That's about the best I can explain that without showing linear algebra (in which case, the wikipedia article on PCA is actually pretty good).

    BTW, Lior Pachter has a blog post on PCA that you might find useful. It has some nice images to illustrate what's actually going on in PCA (and an image really is worth a thousand words here).

    Comment


    • #3
      Thanks a lot, especially for the blog post link! His third interpretation (i.e. maximizing the retained variance for projection of the points onto the PCA subspace) resonates very well with what I was told, and I think I'll stick to that for now.

      I am using the prcomp function and ggbiplot package for my PCA analysis, and I've only used the FPKM values from Cufflinks so far. Of course I'm not using the FPKM for the DE-analysis (good that you check, though!): I'm using featureCounts and DESeq2, but I haven't used cufflinks/cuffmerge for any kind of "getting better annotation", as you put it. Does the plotPCA function in DESeq2 use the counts for the calculations of the PCA, or something else?

      I was probably unclear in my first post. If I try to visualize the process of making a PCA-plot, my current thinking is starting at a point before any analysis has been made, and basically image the samples in the n-dimensional space (3-dimensional, in my previous attempt at an example) of the variables (i.e. FPKM for each gene, here). In this space, I find the first principal component by maximizing the retained variance by projection onto the PC, and then do the same for the rest of the principal components. While the PCs do not correspond to anything biological, the n-dimensional space before the analysis does correspond to the samples' various FPKM-values for each gene.

      Is this any closer, or should I give up on the whole "n-dimensional space before the analysis"-part? I'm trying to understand what PCA starts with (i.e. the data before any analysis), what happens with this during the analysis and what the end point is.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Essential Discoveries and Tools in Epitranscriptomics
        by seqadmin




        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
        04-22-2024, 07:01 AM
      • seqadmin
        Current Approaches to Protein Sequencing
        by seqadmin


        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
        04-04-2024, 04:25 PM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, Yesterday, 08:47 AM
      0 responses
      16 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-11-2024, 12:08 PM
      0 responses
      60 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 10:19 PM
      0 responses
      60 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 09:21 AM
      0 responses
      54 views
      0 likes
      Last Post seqadmin  
      Working...
      X