(I've already searched the forums, but have yet to find a satisfactory answer, so I figured I'd create a new thread for it.)
I recently learned about PCA and that I could use it for quality control of my RNA-seq experiments, but I have some stuff that I'm not sure about.
1) I'm currently performing the PCA (in R) on my FPKM data from Cufflinks - is this "correct", or should I do it some other part of my data? (I'm also performing DE-analysis with DESeq2). Which part of your data do you guys usually do it with? Do you even do PCA, or do you do something else for QC?
2) What, exactly, are the dots on the PCA plot?
As far as (2) is concerned, PCA was explained to me as first plotting the samples on the multidimensional space of the variables (i.e. genes), followed by finding the vector corresponding to the maximum variation in the data (first principal component). The second, third, etc. PC would then follow (orthogonal to previous ones). Plotting was projecting the samples from the n-dimensional space down to the (most often) two largest PCs, which then would show you how the samples cluster. Is this correct?
If I imagine a plot with 3 variables (rather than n) for simplicity's sake, is the correct interpretation that the dot that corresponds to a sample is simply the point in space where values of x,y,z is the FPKM for the sample for the genes x,y,z? (Then follows finding the PCs, projection, etc.)
I recently learned about PCA and that I could use it for quality control of my RNA-seq experiments, but I have some stuff that I'm not sure about.
1) I'm currently performing the PCA (in R) on my FPKM data from Cufflinks - is this "correct", or should I do it some other part of my data? (I'm also performing DE-analysis with DESeq2). Which part of your data do you guys usually do it with? Do you even do PCA, or do you do something else for QC?
2) What, exactly, are the dots on the PCA plot?
As far as (2) is concerned, PCA was explained to me as first plotting the samples on the multidimensional space of the variables (i.e. genes), followed by finding the vector corresponding to the maximum variation in the data (first principal component). The second, third, etc. PC would then follow (orthogonal to previous ones). Plotting was projecting the samples from the n-dimensional space down to the (most often) two largest PCs, which then would show you how the samples cluster. Is this correct?
If I imagine a plot with 3 variables (rather than n) for simplicity's sake, is the correct interpretation that the dot that corresponds to a sample is simply the point in space where values of x,y,z is the FPKM for the sample for the genes x,y,z? (Then follows finding the PCs, projection, etc.)
Comment