Seqanswers Leaderboard Ad

**dpryan** · 12-05-2014, 06:13 AM

1) A PCA is a standard part of my analysis and I also do it in R (also using DESeq2, in case you're using its plotPCA function). You can perform the PCA on either FPKM or variance stabilized (or robust log transformed) counts, whatever's more convenient. BTW, I assume you're not using the FPKMs in DESeq2 (I feel compelled to check). Usually one uses cufflinks/cuffmerge to get a better annotation and then gets counts with htseq-count or featureCounts and feeds those to DESeq2 (otherwise, use cuffdiff).

2) The dots are your samples.

That's kinda-sorta-not-really a description of PCA. PCA itself has nothing to do with plotting, rather its just a method of dimension reduction that decomposes a high-dimension dataset into more manageable dimensions (you'll get as many principal components as samples). The orthogonal nature of PCA is really important, since it means that the principal components themselves aren't readily biologically interpretable (e.g., component #1 probably doesn't correspond to anything obvious). The output from PCA is a few matrices, of which you end up plotting part of one. You typically plot the "load" of each sample in a given principal component. So the actual position in a 3D plot will be determined by a samples load in those 3 dimensions. Typically you just plot 2 of the principal components, since that's easier to visualize.

So no, the position of a sample in a 3D plots of PCA loadings has little directly to do with genes x.y, and z. Unless you have as many samples as genes (now THAT is an expensive experiment!), you'll have many many many more genes than dimensions and dimensions will typically not correspond to any single gene (one of the matrices returned by PCA will give you an idea of this, though you can always also make a "biplot"). Rather, the components are sourced from all of the genes. That's about the best I can explain that without showing linear algebra (in which case, the wikipedia article on PCA is actually pretty good).

BTW, Lior Pachter has a blog post on PCA that you might find useful. It has some nice images to illustrate what's actually going on in PCA (and an image really is worth a thousand words here).

**ErikFas** · 12-05-2014, 07:58 AM

Thanks a lot, especially for the blog post link! His third interpretation (i.e. maximizing the retained variance for projection of the points onto the PCA subspace) resonates very well with what I was told, and I think I'll stick to that for now.

I am using the prcomp function and ggbiplot package for my PCA analysis, and I've only used the FPKM values from Cufflinks so far. Of course I'm not using the FPKM for the DE-analysis (good that you check, though!): I'm using featureCounts and DESeq2, but I haven't used cufflinks/cuffmerge for any kind of "getting better annotation", as you put it. Does the plotPCA function in DESeq2 use the counts for the calculations of the PCA, or something else?

I was probably unclear in my first post. If I try to visualize the process of making a PCA-plot, my current thinking is starting at a point before any analysis has been made, and basically image the samples in the n-dimensional space (3-dimensional, in my previous attempt at an example) of the variables (i.e. FPKM for each gene, here). In this space, I find the first principal component by maximizing the retained variance by projection onto the PC, and then do the same for the rest of the principal components. While the PCs do not correspond to anything biological, the n-dimensional space before the analysis does correspond to the samples' various FPKM-values for each gene.

Is this any closer, or should I give up on the whole "n-dimensional space before the analysis"-part? I'm trying to understand what PCA starts with (i.e. the data before any analysis), what happens with this during the analysis and what the end point is.

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Today, 11:49 AM	0 responses 12 views 0 likes	Last Post by seqadmin Today, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Interpretation of PCA of FPKM values

Comment

Comment

Latest Articles

ad_right_rmr

News