Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Hockeymac18
    Junior Member
    • Feb 2015
    • 4

    Comparing gene expression of specific genes between, samples, datasets, and species

    Our questions might be a bit different than most. As background for our lab, we are not experts in RNA-Seq data and are learning it as we go.

    There have been a number of studies that have sequenced "normal" individuals, and we are interested in using this public data to answer a few simple questions. Specifically, we're interested to find out the gene expression range between "normal" individuals in the population for a very small number of genes (which we are interested in from our wet-lab work). We are not comparing normal against a disease, or anything like that.

    We were wondering if there is a recommended way to normalize this data so that we can compare the gene expression of Gene X in individual A to individual B (letting us ultimately determine the range of expression in Gene X for all individuals).

    We know that just using RPKM values is not the way to go. Housekeeping genes are full of many issues (for instance, there is no guarantee that they are actually stable across individuals). We have looked at quantile normalized values, but will this let you compare between individuals the way we want?

    Naively, we are thinking that percent ranking of gene X against all genes in all individuals might work well. We are thinking this would also let us compare between studies and potentially even between species. But then this would remove the resolution (for instance, the number 2 gene vs. the number 3 gene might actually have a very large difference in expression, even if their "ranks" are nearly identical).

    We've also thought about using "total expression" as the denominator. That is, we would divide expression for gene X / total expression in each individual. I know people have shied away from using this when looking at differential expression analysis, but if we already know what genes we want to know the expression of, we are thinking this method "could work". But like percent rank, we're not sure if there are any limitations that we are missing.

    Does anyone know of a good way approach this question? We naively thought this would be a simple analysis (i.e. just grab the expression values for each and compare), but as we learn more it seems more complicated than we initially expected.

    We appreciate any insight.
  • mikep
    Member
    • Feb 2011
    • 45

    #2
    TPM (transcripts per million) would be a good way to go, you haven't mentioned how the public data was processed but you could use RSEM to generate the values. I would worry about any between sample normalization throwing away biological variation. Quantile normalization would be a really bad idea, and given RNAseq is a relative measure of abundance I'm not sure where you are getting "total expression" from, did you mean total mapped reads? If you did you are half way to RPKM.

    Comment

    • Hockeymac18
      Junior Member
      • Feb 2015
      • 4

      #3
      Originally posted by mikep View Post
      TPM (transcripts per million) would be a good way to go, you haven't mentioned how the public data was processed but you could use RSEM to generate the values. I would worry about any between sample normalization throwing away biological variation. Quantile normalization would be a really bad idea, and given RNAseq is a relative measure of abundance I'm not sure where you are getting "total expression" from, did you mean total mapped reads? If you did you are half way to RPKM.
      Thank you for your response. After doing a bit more reading, it does seem like TPM is what we're after.

      "Total expression" as a concept is something our P.I. thought would be good to normalize against conceptually. And yes, I believe from an RNA-Seq perspective, this would mean total mapped reads.

      The "public" datasets that we've found that we'd like to use have reported their expression figures as RPKM. We have also seen papers that were reporting expression as quantile-normalized RPKM/FPKM values.

      Is it possible to calculate TPM from RPKM/FPKM? I guess for that it would depend on how they calculated RPKM, correct?

      Am I correct in that the main difference between TPM and RPKM/FPKM is the length normalization for the transcript? Naively, then, I would think you could multiply RPKM/FPKM by the length of the transcript and get TPM, right? But then we would have to assume that each RPKM/FPKM value in each experiment is using the same length for the transcript...

      Or am I missing something fundamental there in the formulas for TPM and RPKM/FPKM (which is quite likely)?

      Comment

      • mikep
        Member
        • Feb 2011
        • 45

        #4
        As I said, normalizing by "total expression" as you define it is really just the M part of RPKM. There is no simple way to get TPM from RPKM. You could reverse engineer it if you had the original annotation used in the mapping and knew the total mapped reads, I guess.

        You have the "main difference" between TPM and FPKM wrong, you will get a better understanding of it by reading the papers and/or blogs on the subject.

        Practically, and what I would do if in your shoes, is to get my hands on the raw data and remap with RSEM, which will do the work for you.

        Comment

        • Hockeymac18
          Junior Member
          • Feb 2015
          • 4

          #5
          Originally posted by mikep View Post
          As I said, normalizing by "total expression" as you define it is really just the M part of RPKM. There is no simple way to get TPM from RPKM. You could reverse engineer it if you had the original annotation used in the mapping and knew the total mapped reads, I guess.

          You have the "main difference" between TPM and FPKM wrong, you will get a better understanding of it by reading the papers and/or blogs on the subject.

          Practically, and what I would do if in your shoes, is to get my hands on the raw data and remap with RSEM, which will do the work for you.
          I think what I was thinking of is CPM (counts per million).


          But I think I am missing something with the relationship between TPM and FPKM...

          If the formulas for each are:

          TPM for any given gene = (count / length of transcript) * (1 / (sum for all genes: count / length of transcript)) * 10^6

          RPKM/FPKM for any given gene = count / ((length of transcript/10^3) * (total number of reads/10^6))

          Shouldn't you be able to go between the two?


          If you have all FPKM values for all genes, shouldn't you be able to get TPM for a given gene by:

          TPM = (FPKM for gene / (sum of all FPKM for all genes)) * 10^6

          This blog seems to confirm that (which references a review by Lior Pacther on transcript quantification methods): https://haroldpimentel.wordpress.com...ression-units/


          Also, isn't there supposed to be a proportionality constant in the RPKM/FPKM formula? Or is that "cancelled" out in the equation?

          The reason I bring up the proportionality constant is that this has been a main reason that people have recommended not using FPKM/RPKM and have instead recommended using TPM:
          Wagner, Kim, and Lynch: http://lynchlab.uchicago.edu/publica...%282012%29.pdf
          Lior Pacther blog article: https://liorpachter.wordpress.com/20...he-supplement/
          Lior Pacther talk: https://www.youtube.com/watch?v=5NiF...tu.be&t=30m30s


          Also, I appreciate your comments about re-analyzing the data. That is something we have thought of. But it also brings up the point about the applicability and usefulness of public data: If you have to download public raw data yourself and re-analyze it, I think the power of public processed data is a bit lessened (or even useless in nature). This also isn't a trivial thing to do for ~100's of samples (the type of comparisons we'd like to make), especially for a lab that is not really set up for full-fledged RNA-Seq analysis (we generally only do 1-2 RNA-Seq experiments a year).
          Last edited by Hockeymac18; 02-04-2015, 05:02 PM.

          Comment

          • mikep
            Member
            • Feb 2011
            • 45

            #6
            Originally posted by Hockeymac18 View Post
            But I think I am missing something with the relationship between TPM and FPKM...

            If the formulas for each are:

            TPM for any given gene = (count / length of transcript) * (1 / (sum for all genes: count / length of transcript)) * 10^6

            RPKM/FPKM for any given gene = count / ((length of transcript/10^3) * (total number of reads/10^6))

            Shouldn't you be able to go between the two?


            If you have all FPKM values for all genes, shouldn't you be able to get TPM for a given gene by:

            TPM = (FPKM for gene / (sum of all FPKM for all genes)) * 10^6

            This blog seems to confirm that (which references a review by Lior Pacther on transcript quantification methods): https://haroldpimentel.wordpress.com...ression-units/
            Well, you learn something new every day. That makes sense. What I meant to say was there is no constant scaling factor between the two, it differs according to the samples.

            Also, I appreciate your comments about re-analyzing the data. That is something we have thought of. But it also brings up the point about the applicability and usefulness of public data: If you have to download public raw data yourself and re-analyze it, I think the power of public processed data is a bit lessened (or even useless in nature). This also isn't a trivial thing to do for ~100's of samples (the type of comparisons we'd like to make), especially for a lab that is not really set up for full-fledged RNA-Seq analysis (we generally only do 1-2 RNA-Seq experiments a year).
            Fair enuff

            Comment

            • Hockeymac18
              Junior Member
              • Feb 2015
              • 4

              #7
              Originally posted by mikep View Post
              Well, you learn something new every day. That makes sense. What I meant to say was there is no constant scaling factor between the two, it differs according to the samples.



              Fair enuff
              Thank you for you help on the matter. You helped me work through the issues conceptually, and I learned a great deal about RNA-Seq quantification methods along the way.

              Comment

              • Zapages
                Member
                • Oct 2012
                • 98

                #8
                I would like to say this was a very insightful topic in regards to TPM vs RPKM/FPKM situation. Thank you for sharing this information.

                As for re-analyzing public data. My suggestion is to create pipeline and try to offload the information cloud based methodology and go from there. This will save you time.

                Personally, I did the following:

                NCBI SRA/GEO > EBI-SRA > Trimmomatic > FastQC > Trimmomatic > FastQC > Iplant Collaborative > Tophat2 > Cufflinks2 > Cuffmerge2 > Cuffdiff2 > Offline (CummeRbund)

                and


                NCBI SRA/GEO > EBI-SRA > Trimmomatic > FastQC > Trimmomatic > FastQC > Iplant Collaborative > Tophat2 > EdgeR

                and


                NCBI SRA/GEO > EBI-SRA > Trimmomatic > FastQC > Trimmomatic > FastQC > Iplant Collaborative > Tophat2 > DeSeq

                This took about a 6 to 8 months to accomplish for about 40 samples. Its definitely do-able, but takes a bit of time.

                I would suggest trying iPlant Collaborative's Discovery Environment.

                All the best with your project.

                Comment

                • kopi-o
                  Senior Member
                  • Feb 2008
                  • 319

                  #9
                  This has code for going between RPKM and TPM (and also effective counts)

                  This post covers the units used in RNA-Seq that are, unfortunately, often misused and misunderstood. I’ll try to clear up a bit of the confusion here. The first thing one should remember is t…


                  A very nice post.

                  Also when comparing public data, I recommend that you try to correct for batch effects using ComBat or a similar program. Also you might want to convert to log scale before that. Good luck!
                  Last edited by kopi-o; 02-05-2015, 04:09 AM.

                  Comment

                  • mbblack
                    Senior Member
                    • Aug 2009
                    • 245

                    #10
                    One advantage of starting from raw data and re-normalizing and analyzing yourself is that you can investigate the various original data sets for any potential bias in library size and signal distribution. Your processed data may represent radically different original data sets, and so you may be introducing bias into your meta-analysis by starting from processed data only.

                    That actually, to me, is the whole point of requiring authors to submit raw data - you really need to begin from that if you want to compare across studies. I just would not be comfortable trying to do any real cross-study meta-analysis from processed data.
                    Michael Black, Ph.D.
                    ScitoVation LLC. RTP, N.C.

                    Comment

                    Latest Articles

                    Collapse

                    • SEQadmin2
                      From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                      by SEQadmin2


                      Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                      The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                      ...
                      06-02-2026, 10:05 AM
                    • SEQadmin2
                      Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                      by SEQadmin2


                      With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                      Introduction

                      Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                      05-22-2026, 06:42 AM
                    • SEQadmin2
                      Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                      by SEQadmin2

                      Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                      Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                      05-06-2026, 09:04 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by SEQadmin2, 06-02-2026, 12:03 PM
                    0 responses
                    21 views
                    0 reactions
                    Last Post SEQadmin2  
                    Started by SEQadmin2, 06-02-2026, 11:40 AM
                    0 responses
                    14 views
                    0 reactions
                    Last Post SEQadmin2  
                    Started by SEQadmin2, 05-28-2026, 11:40 AM
                    0 responses
                    29 views
                    0 reactions
                    Last Post SEQadmin2  
                    Started by SEQadmin2, 05-26-2026, 10:12 AM
                    0 responses
                    31 views
                    0 reactions
                    Last Post SEQadmin2  
                    Working...