Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • cuekt
    Junior Member
    • Feb 2012
    • 2

    Rpkm=0

    Hi everybody,

    Question about RPKM=0 on RNA-seq analysis.

    Could you tell me how you guys are handling RPKM=0 when you count it in a calculation of fold changes.
    For example, when RPKM data of sample A vs. sample B is 100 vs. 0, respectively, it is difficult get the fold change value between sample A and sample B, right? (because it's going to be zero, anyway).

    It could be three ways as follows:
    1) ignore the genes with 0 RPKM from consideration.
    2) genes with 0 RPKM are assigned a inverse value of infinity.
    3) genes with 0 RPKM are assigned a certain small value, e.g., 0.002 (see the following paper: Rowley et al. 2011 Blood 118: e101-11).

    Which way is common (or your preference) and reliable to obtain real fold changes? Or is there any other ways?
    Last edited by cuekt; 02-11-2012, 11:39 AM.
  • Meligethes
    Member
    • Mar 2012
    • 23

    #2
    I have the same "problem" !

    Comment

    • mbblack
      Senior Member
      • Aug 2009
      • 245

      #3
      You cannot compute a fold change for a gene you did not detect, simple as that. You would not use missing data in other analyses would you? Had you done an array experiment, would you include genes which did not appear as expressed on an array? So why would you do so for this instance. If you do not have a value for transcript abundance for a gene, you will have to remove that gene from your comparisons.

      Personally, I'm now in the habit of only including genes with a raw mapped count of > 10 in all my normalizations for differential gene expression analysis (I just filter the raw count table and only retain rows with a "count > 10" for all samples, and whatever remains is what I have for normalization and differential gene expression). I've also seen publications which have only used genes with RPKM values of > 0.1 for differential gene expression. The thinking is that samples with very low counts (e.g. < 10) represent estimates of transcript abundance that are too unreliable for inclusion in differential gene expression analysis.

      But the bottom line is, you cannot compute fold change at all for a gene unless it was actually detected in BOTH of your sample groups. No data is no data - ignore those and go with the genes you actually have data for.

      P.S. I've also seen publications using a RPKM cutoff of > 0.5. Regardless, the growing consensus in published work seems to be that a minimum value cutoff should be a best practice for DGE analysis.
      Last edited by mbblack; 11-13-2012, 05:51 AM.
      Michael Black, Ph.D.
      ScitoVation LLC. RTP, N.C.

      Comment

      • maize
        Junior Member
        • Apr 2011
        • 9

        #4
        I had same problem before. Mbblack, thank you for clear answers!

        I understand differential expression can only be calculated for genes expressed across all sample groups. Sample group with missing data can not be included in.

        How to deal with the missing data within biological replications if each sample group is consisted of 3 biological replications? Should only gene expressed in all biological replications be considered? I saw many cases where one replication has missing data. The idea of having 3 replications is to do statistical comparison between samples (t test, each with 3 obervations). Mising values in replications make the test impossible. Any suggestions? Thanks.

        Comment

        • mbblack
          Senior Member
          • Aug 2009
          • 245

          #5
          Originally posted by maize View Post
          I had same problem before. Mbblack, thank you for clear answers!

          I understand differential expression can only be calculated for genes expressed across all sample groups. Sample group with missing data can not be included in.

          How to deal with the missing data within biological replications if each sample group is consisted of 3 biological replications? Should only gene expressed in all biological replications be considered? I saw many cases where one replication has missing data. The idea of having 3 replications is to do statistical comparison between samples (t test, each with 3 obervations). Mising values in replications make the test impossible. Any suggestions? Thanks.
          As I said, for myself, I am now in the habit of only performing DGE on genes where I have a raw mapped count > 10 for ALL samples (meaning all replicates as well). That is my minimum inclusive cutoff for any gene - all samples/replicates must have a mapped read count of > 10. Any gene(s) with any sample(s) with a count not passing that cutoff are excluded from further DGE analyses.

          Other published results, using RPKM, have used minimum cuttoffs of 0.1 or 0.5.

          But, the bottom line is, you need to set some minimum limit for inclusion of any gene in your analyses, and then exclude those genes that fail to meet that minimum detection threshold.
          Michael Black, Ph.D.
          ScitoVation LLC. RTP, N.C.

          Comment

          • wetSEQer
            Member
            • Dec 2013
            • 15

            #6
            If you have 0 reads in one experiment groups, and more reads in another, you shouldn't discard them, that is the thing you are chasing for, right? Some gene completely on or off with a given sequencing depth....
            I never cutoff readings based on raw counts, since there is bias towards short genes vs long genes.
            I always go with RPKM and only cutoff reads based on the higher RPKM sample, if you trust all the statistics, you can set a "small" threshold, if you need qPCR to confirm, I guess you need a large threshold, I used 10.

            Comment

            Latest Articles

            Collapse

            • GATTACAT
              Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing
              by GATTACAT
              Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
              Yesterday, 11:43 AM
            • SEQadmin2
              Nine Things a Sample Prep Scientist Thinks About Before Sequencing
              by SEQadmin2


              I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

              Here are nine questions we think about, in roughly the order they matter, before...
              06-18-2026, 07:11 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by SEQadmin2, Today, 11:08 AM
            0 responses
            6 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-30-2026, 05:37 AM
            0 responses
            11 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-26-2026, 11:10 AM
            0 responses
            18 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-17-2026, 06:09 AM
            0 responses
            52 views
            0 reactions
            Last Post SEQadmin2  
            Working...