Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Golsheed
    Member
    • Oct 2014
    • 49

    PCA for compositional data and relative isoform usage

    Hello,

    I have relative isoform usage data (isoform proportions) for two conditions (Non-Infected and Infected) for some genes. I have studied (using statistical hypothesis testing) the differential isoform usage for this data; i.e., whether the relative isoform usage is statistically different after infection, compared to the control (Non-Infected), for each gene. This is a bit different from differential isoform expression, since I'm dealing with isoform usage proportions and not the actual read counts for each isoform.
    For a target gene g with K isoform, I have a vector of size K, where its ith element is the relative usage for isoform i for gene g.
    I want to perform a principal component analysis on this data to see whether the first (or second) principal component separates the data into two groups based on conditions (Non-Infected and Infected).
    Does anyone know how this can in done in R? Considering the data points are vectors for "each gene and each sample", and also the fact that the vector elements sum up to 1 (and thus are dependent on one another), I can't use the usual PCA method in R.
    I have found that the package robCompositions can do PCA for compositional data, but there's no detailed documentation for it. My main problem is how to draw a PC plot of data (PC2 vs PC1), which shows the clustering of data based on condition. Can this be done the same way as in prcomp(); i.e., using scores?

    I'd appreciate any help.

    Thanks,
    Golsheed
  • Skiaphrene
    Member
    • Aug 2013
    • 18

    #2
    Hi Golsheed,


    I haven't used the robCompositions package but it looks very interesting. Can't you use the plot.pcaCoDa() function on the output of pcaCoDa() (which is what I'm assuming you're using to do the PCA) to generate sample & variable plots? I've tried out the example in the documentation and it generates a plot that overlays the variables on top of the sample plot, so you should be able to see your clustering if it is there. Either way, in the pcaCoDa object returned by the pcaCoDa() function, the scores element does seem to contain the sample coordinates in PC space, and the loadings element does seem to contain the variable coordinates in PC space., if you need to plot each separately (which you may if you have many isoforms).


    On a more speculative note, perhaps it would be possible to arrange your data into a single data frame (with samples per rows and isoform proportions in columns) and run robCompositions separately on each isoform separately, retain the returned PCs, and run a specially-weighted PCA on these PCs, in a way similar to what a "Multiple Factor Analysis" does, c.f. function MFA() in package FactoMineR. This would ensure that each isoform would contribute equally to the overall analysis.


    Hope this helps,

    -- Alex

    Comment

    • Golsheed
      Member
      • Oct 2014
      • 49

      #3
      Thanks so much for your help, Alex.

      A few questions if you don't mind:

      (1) I have many isoforms for each gene (ranging from 2 isoforms to 20-30), do you think the PC plot would make more sense if I plot if for each isoform separately? i.e., one PC plot for isoform one, and so on. Is that what you meant in the first paragraph?

      (2) I'm not familiar with multiple factor analysis, do you mind elaborating a bit more about it and also how to do the weighting? or refer me to a paper or something so I can get a better idea of it.

      Thanks so much,
      Golsheed

      Comment

      • Skiaphrene
        Member
        • Aug 2013
        • 18

        #4
        Originally posted by Golsheed View Post
        (1) I have many isoforms for each gene (ranging from 2 isoforms to 20-30), do you think the PC plot would make more sense if I plot if for each isoform separately? i.e., one PC plot for isoform one, and so on. Is that what you meant in the first paragraph?
        => This isn't what I had in mind in the first paragraph... My understanding is that you have several samples for which you have isoform proportions for multiple genes. I was assuming you wanted to run the proportions PCA across all isoforms for all genes at once. This should be possible at least in theory for a normal PCA, but I don't know how it would work out with the proportions PCA, as doing it across all isoforms and all genes at once means that the sum of all proportions is not one (rather it is the number of genes). This kind of analysis would not be able to take into account that the various proportions can be grouped by gene, which is where my idea of a "proportions MFA" came in. Anyway...

        => ...Coming back to your question above, you could run a proportions PCA on each gene individually and generate plots for each (note: you'll have to check if a proportion PCA needs at least 3 proportion variables to make 2 PCs - a normal PCA does). This would highlight things like sample clustering and dimensions of major variability for each gene separately. I don't know how useful that would be.


        Originally posted by Golsheed View Post
        (2) I'm not familiar with multiple factor analysis, do you mind elaborating a bit more about it and also how to do the weighting? or refer me to a paper or something so I can get a better idea of it.
        => If you're in a purely numeric variable setting, then an MFA is like a PCA of PCAs, and is useful for highlighting variability patterns shared across multiple groups of variables. Here those groups would be your genes, and the variables would be the proportions of each gene's isoforms.

        => I'm sorry but I can't remember the exact weighting scheme.

        => You can read up more about MFA here:
        - on this page of FactoMineR's website: http://factominer.free.fr/advanced-m...-analysis.html
        - through the references given in the MFA function documentation in the FactoMineR package:
        Escofier, B. and Pages, J. (1994) Multiple Factor Analysis (AFMULT package). Computational Statistics and Data Analysis, 18, 121-140.
        Becue-Bertaut, M. and Pages, J. (2008) Multiple factor analysis and clustering of a mixture of quantitative, categorical and frequency data. Computational Statistice and Data Analysis, 52, 3255-3268.

        (the latter one can be found on ResearchGate: http://www.researchgate.net/publicat...frequency_data)

        => With the concept of MFA in mind, I can imagine running a proportions PCA on each gene separately, then taking the sample coordinates for the first 2 or three PCs returned, and doing a normal PCA on that. If you take the same number of PCs for each gene then you shouldn't have to worry about any special weighting.


        Let me know what you think!

        Best,

        -- Alex

        Comment

        • Golsheed
          Member
          • Oct 2014
          • 49

          #5
          Thanks a bunch.

          So here's what you're proposing in short:
          (1) doing a normal PCA on each gene separately
          (2) constructing a matrix where each row corresponds to a sample and the columns are as follows:

          sample, gene1_PC1, gene1_PC2, gene2_PC1, gene2_PC2, gene3_PC1, ...

          and doing a normal PCA on that, right?

          Just to make sure I got things right, by "sample coordinates for the first 2 or three PCs" you mean the scores?

          Thanks,
          Golsheed

          Comment

          • Skiaphrene
            Member
            • Aug 2013
            • 18

            #6
            You're welcome!

            Originally posted by Golsheed View Post
            (1) doing a normal PCA on each gene separately
            (2) constructing a matrix where each row corresponds to a sample and the columns are as follows:

            sample, gene1_PC1, gene1_PC2, gene2_PC1, gene2_PC2, gene3_PC1, ...

            and doing a normal PCA on that, right?
            => (1) well as you pointed out previously I'm not sure a normal PCA will work on the isoform proportions for each gene, as the variables are related (sum to 1). The proportions PCA from robComposition sounds like a method designed to be able to handle this, so maybe you should do a proportions PCA on each gene rather than a normal PCA.

            => (2) yes, this is what I had in mind. Since the PC coordinates are no longer proportions, a normal PCA across this should be fine.

            => If your data hadn't been proportions then a normal MFA would have sufficed!


            Good luck!

            Best,

            -- Alex

            Comment

            Latest Articles

            Collapse

            • SEQadmin2
              Nine Things a Sample Prep Scientist Thinks About Before Sequencing
              by SEQadmin2


              I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

              Here are nine questions we think about, in roughly the order they matter, before...
              06-18-2026, 07:11 AM
            • SEQadmin2
              From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
              by SEQadmin2


              Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


              The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
              ...
              06-02-2026, 10:05 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by SEQadmin2, 06-26-2026, 11:10 AM
            0 responses
            12 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-17-2026, 06:09 AM
            0 responses
            46 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-09-2026, 11:58 AM
            0 responses
            106 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-05-2026, 10:09 AM
            0 responses
            125 views
            0 reactions
            Last Post SEQadmin2  
            Working...