Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • PCA for compositional data and relative isoform usage

    Hello,

    I have relative isoform usage data (isoform proportions) for two conditions (Non-Infected and Infected) for some genes. I have studied (using statistical hypothesis testing) the differential isoform usage for this data; i.e., whether the relative isoform usage is statistically different after infection, compared to the control (Non-Infected), for each gene. This is a bit different from differential isoform expression, since I'm dealing with isoform usage proportions and not the actual read counts for each isoform.
    For a target gene g with K isoform, I have a vector of size K, where its ith element is the relative usage for isoform i for gene g.
    I want to perform a principal component analysis on this data to see whether the first (or second) principal component separates the data into two groups based on conditions (Non-Infected and Infected).
    Does anyone know how this can in done in R? Considering the data points are vectors for "each gene and each sample", and also the fact that the vector elements sum up to 1 (and thus are dependent on one another), I can't use the usual PCA method in R.
    I have found that the package robCompositions can do PCA for compositional data, but there's no detailed documentation for it. My main problem is how to draw a PC plot of data (PC2 vs PC1), which shows the clustering of data based on condition. Can this be done the same way as in prcomp(); i.e., using scores?

    I'd appreciate any help.

    Thanks,
    Golsheed

  • #2
    Hi Golsheed,


    I haven't used the robCompositions package but it looks very interesting. Can't you use the plot.pcaCoDa() function on the output of pcaCoDa() (which is what I'm assuming you're using to do the PCA) to generate sample & variable plots? I've tried out the example in the documentation and it generates a plot that overlays the variables on top of the sample plot, so you should be able to see your clustering if it is there. Either way, in the pcaCoDa object returned by the pcaCoDa() function, the scores element does seem to contain the sample coordinates in PC space, and the loadings element does seem to contain the variable coordinates in PC space., if you need to plot each separately (which you may if you have many isoforms).


    On a more speculative note, perhaps it would be possible to arrange your data into a single data frame (with samples per rows and isoform proportions in columns) and run robCompositions separately on each isoform separately, retain the returned PCs, and run a specially-weighted PCA on these PCs, in a way similar to what a "Multiple Factor Analysis" does, c.f. function MFA() in package FactoMineR. This would ensure that each isoform would contribute equally to the overall analysis.


    Hope this helps,

    -- Alex

    Comment


    • #3
      Thanks so much for your help, Alex.

      A few questions if you don't mind:

      (1) I have many isoforms for each gene (ranging from 2 isoforms to 20-30), do you think the PC plot would make more sense if I plot if for each isoform separately? i.e., one PC plot for isoform one, and so on. Is that what you meant in the first paragraph?

      (2) I'm not familiar with multiple factor analysis, do you mind elaborating a bit more about it and also how to do the weighting? or refer me to a paper or something so I can get a better idea of it.

      Thanks so much,
      Golsheed

      Comment


      • #4
        Originally posted by Golsheed View Post
        (1) I have many isoforms for each gene (ranging from 2 isoforms to 20-30), do you think the PC plot would make more sense if I plot if for each isoform separately? i.e., one PC plot for isoform one, and so on. Is that what you meant in the first paragraph?
        => This isn't what I had in mind in the first paragraph... My understanding is that you have several samples for which you have isoform proportions for multiple genes. I was assuming you wanted to run the proportions PCA across all isoforms for all genes at once. This should be possible at least in theory for a normal PCA, but I don't know how it would work out with the proportions PCA, as doing it across all isoforms and all genes at once means that the sum of all proportions is not one (rather it is the number of genes). This kind of analysis would not be able to take into account that the various proportions can be grouped by gene, which is where my idea of a "proportions MFA" came in. Anyway...

        => ...Coming back to your question above, you could run a proportions PCA on each gene individually and generate plots for each (note: you'll have to check if a proportion PCA needs at least 3 proportion variables to make 2 PCs - a normal PCA does). This would highlight things like sample clustering and dimensions of major variability for each gene separately. I don't know how useful that would be.


        Originally posted by Golsheed View Post
        (2) I'm not familiar with multiple factor analysis, do you mind elaborating a bit more about it and also how to do the weighting? or refer me to a paper or something so I can get a better idea of it.
        => If you're in a purely numeric variable setting, then an MFA is like a PCA of PCAs, and is useful for highlighting variability patterns shared across multiple groups of variables. Here those groups would be your genes, and the variables would be the proportions of each gene's isoforms.

        => I'm sorry but I can't remember the exact weighting scheme.

        => You can read up more about MFA here:
        - on this page of FactoMineR's website: http://factominer.free.fr/advanced-m...-analysis.html
        - through the references given in the MFA function documentation in the FactoMineR package:
        Escofier, B. and Pages, J. (1994) Multiple Factor Analysis (AFMULT package). Computational Statistics and Data Analysis, 18, 121-140.
        Becue-Bertaut, M. and Pages, J. (2008) Multiple factor analysis and clustering of a mixture of quantitative, categorical and frequency data. Computational Statistice and Data Analysis, 52, 3255-3268.

        (the latter one can be found on ResearchGate: http://www.researchgate.net/publicat...frequency_data)

        => With the concept of MFA in mind, I can imagine running a proportions PCA on each gene separately, then taking the sample coordinates for the first 2 or three PCs returned, and doing a normal PCA on that. If you take the same number of PCs for each gene then you shouldn't have to worry about any special weighting.


        Let me know what you think!

        Best,

        -- Alex

        Comment


        • #5
          Thanks a bunch.

          So here's what you're proposing in short:
          (1) doing a normal PCA on each gene separately
          (2) constructing a matrix where each row corresponds to a sample and the columns are as follows:

          sample, gene1_PC1, gene1_PC2, gene2_PC1, gene2_PC2, gene3_PC1, ...

          and doing a normal PCA on that, right?

          Just to make sure I got things right, by "sample coordinates for the first 2 or three PCs" you mean the scores?

          Thanks,
          Golsheed

          Comment


          • #6
            You're welcome!

            Originally posted by Golsheed View Post
            (1) doing a normal PCA on each gene separately
            (2) constructing a matrix where each row corresponds to a sample and the columns are as follows:

            sample, gene1_PC1, gene1_PC2, gene2_PC1, gene2_PC2, gene3_PC1, ...

            and doing a normal PCA on that, right?
            => (1) well as you pointed out previously I'm not sure a normal PCA will work on the isoform proportions for each gene, as the variables are related (sum to 1). The proportions PCA from robComposition sounds like a method designed to be able to handle this, so maybe you should do a proportions PCA on each gene rather than a normal PCA.

            => (2) yes, this is what I had in mind. Since the PC coordinates are no longer proportions, a normal PCA across this should be fine.

            => If your data hadn't been proportions then a normal MFA would have sufficed!


            Good luck!

            Best,

            -- Alex

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM
            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            27 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            31 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            27 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            52 views
            0 likes
            Last Post seqadmin  
            Working...
            X