Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • liuxq
    Member
    • Jun 2010
    • 36

    How to compute RPKM?

    Everyone knows the formula for RPKM compuation: rpkm=10^9*C/NL,where C is the reads number of the transcript, L is the length of the transcript and N is the total reads number of the sample

    However, in my RNA-seq analysis pipeline, I have three "N".

    1. total reads number
    2. number of reads which can be mapped to reference genome
    3. number of reads which are the result after mappable reads filtering using repeatmask

    how to select the total reads number N for RPKM computation? I find that using three "N" have totally different effect.

    Thanks very much.
  • RockChalkJayhawk
    Senior Member
    • Mar 2009
    • 192

    #2
    Originally posted by liuxq View Post
    Everyone knows the formula for RPKM compuation: rpkm=10^9*C/NL,where C is the reads number of the transcript, L is the length of the transcript and N is the total reads number of the sample

    However, in my RNA-seq analysis pipeline, I have three "N".

    1. total reads number
    2. number of reads which can be mapped to reference genome
    3. number of reads which are the result after mappable reads filtering using repeatmask

    how to select the total reads number N for RPKM computation? I find that using three "N" have totally different effect.

    Thanks very much.
    If all your experiments use repeat mask, then use option 3. Just make sure to clearly point out this definition when you report FPKM.

    Comment

    • liuxq
      Member
      • Jun 2010
      • 36

      #3
      Originally posted by RockChalkJayhawk View Post
      If all your experiments use repeat mask, then use option 3. Just make sure to clearly point out this definition when you report FPKM.
      why using option 3 is more reasonable?

      Comment

      • RockChalkJayhawk
        Senior Member
        • Mar 2009
        • 192

        #4
        Option 3 in this scenario represents the last step in processing - or the final number of mapped reads that you will use in your analysis. It is not as informative to use any N other than what passes through your quality control steps.

        Comment

        • sameet
          Member
          • Apr 2010
          • 34

          #5
          Hi,
          I am a bit confused. What should i use for N, total number of reads that mapped, or the unique number of reads that mapped. I cannot afford to discard the repeated reads because I have some important data in it.
          Sameet Mehta (Ph.D.),
          Visiting Fellow,
          National Cancer Insitute,
          Bethesda,
          US.

          Comment

          • RockChalkJayhawk
            Senior Member
            • Mar 2009
            • 192

            #6
            Originally posted by sameet View Post
            Hi,
            I am a bit confused. What should i use for N, total number of reads that mapped, or the unique number of reads that mapped. I cannot afford to discard the repeated reads because I have some important data in it.
            In that case I would use 2, but make sure you clearly state that you haven't removed reads from repeat regions.

            Comment

            • sameet
              Member
              • Apr 2010
              • 34

              #7
              Originally posted by RockChalkJayhawk View Post
              In that case I would use 2, but make sure you clearly state that you haven't removed reads from repeat regions.
              Hi,
              I was thinking along same lines. But I want to know how to handle situations when the same read maps to multiple locations, because this happens at a a pretty high high rate in my samples.
              Sameet Mehta (Ph.D.),
              Visiting Fellow,
              National Cancer Insitute,
              Bethesda,
              US.

              Comment

              • severin
                Genome Informatics Facility
                • Sep 2009
                • 105

                #8
                Hi Sameet,
                As far as I have seen there really is no clear rule on what to do with mappings to multiple locations, which is why many scientists use uniquely mappable reads for each gene. In the RNA-Seq Atlas for Glycine max, I used the uniquely mappable reads then use the mappable total count (N) that includes the multiple alignments. Now ,of course, there are programs (Cufflinks or Erange) that try to account for multiple mappings but that doesn't help you decide to include them in the first place.

                As people above have mentioned, reporting the methodology is very important. I found in soybean (it has had two whole genome duplications, so lots of similar genes) the Atlas paper using only the uniquely mappable reads on a non-replicated sample still provided plenty of interesting data that fit what we would expect from a soybean (genes involved in seed filling still were highly expressed in seed filling etc).

                No one method is going to be better than another in every case. It really depends on what you are looking at. Just be aware of the potential biases and include those in your interpretation.
                Last edited by severin; 12-24-2010, 05:39 AM.

                Comment

                • Simon Anders
                  Senior Member
                  • Feb 2010
                  • 995

                  #9
                  See Robinson and Oshlack's paper (Genome Biol 2010, 11:R25) for some thought why neither of the three 'N' values may be a good option, at least if you want to see differential expression.

                  Comment

                  • john23
                    Junior Member
                    • Feb 2011
                    • 1

                    #10
                    Originally posted by Simon Anders View Post
                    See Robinson and Oshlack's paper (Genome Biol 2010, 11:R25) for some thought why neither of the three 'N' values may be a good option, at least if you want to see differential expression.
                    RPKM/FPKM is a better option then the raw read counts because it takes into account the quantity of RNA which has been used for sequencing. In general the RNA samples are sequenced using different amounts of RNA which gives totally different number of reads (a larger quantity of RNA gives a larger number of reads).

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Pathogen Surveillance with Advanced Genomic Tools
                      by seqadmin




                      The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
                      Yesterday, 11:48 AM
                    • seqadmin
                      New Genomics Tools and Methods Shared at AGBT 2025
                      by seqadmin


                      This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                      The Headliner
                      The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                      03-03-2025, 01:39 PM
                    • seqadmin
                      Investigating the Gut Microbiome Through Diet and Spatial Biology
                      by seqadmin




                      The human gut contains trillions of microorganisms that impact digestion, immune functions, and overall health1. Despite major breakthroughs, we’re only beginning to understand the full extent of the microbiome’s influence on health and disease. Advances in next-generation sequencing and spatial biology have opened new windows into this complex environment, yet many questions remain. This article highlights two recent studies exploring how diet influences microbial...
                      02-24-2025, 06:31 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 03-20-2025, 05:03 AM
                    0 responses
                    34 views
                    0 reactions
                    Last Post seqadmin  
                    Started by seqadmin, 03-19-2025, 07:27 AM
                    0 responses
                    43 views
                    0 reactions
                    Last Post seqadmin  
                    Started by seqadmin, 03-18-2025, 12:50 PM
                    0 responses
                    35 views
                    0 reactions
                    Last Post seqadmin  
                    Started by seqadmin, 03-03-2025, 01:15 PM
                    0 responses
                    190 views
                    0 reactions
                    Last Post seqadmin  
                    Working...