Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Cufflinks large FPKM with -g and -N options

    Perplexing,

    Running Cufflinks (2.0.0 and previous version) with
    the -g/--GTF-guide AND -N/--upper-quartile-norm options
    can result in enormous FPKM values (~e+11), and not only
    for short/novel transcripts---for many many genes.

    Of 16725 genes, 12094 are nonzero (min= 432, max = 5.7e+11).
    For nonzero genes:
    25th %ile = 216,582
    50th %ile = 1e+06
    75th %ile = 3.12e+06

    I appreciate that -N should just be re-scaling FPKMs
    and have seen reasonable results from Cufflinks
    in -G (non-deNovo) mode, but these -g levels
    seem strange to me.

    Removing the -N option brings the FPKM levels
    back to a typical range of values.

    Unfortunately the error bounds appear to be identically
    equal to the mean FPKM, with and without -N, at least
    for Cufflinks 2.0.0. Removing the -b option (which was
    also originally used) restored the error-bars.

    It seems that -g & -N together give an FPKM range that
    is quite different from the range produced by -G & -N.
    I would guess that the reason for this is related to tiling of
    the annotated transcripts with faux reads in RABT.
    Perhaps, somehow, this is making almost all genes
    look like upper-quartile expressors, but this is only a guess.
    Last edited by duffman; 05-18-2012, 09:08 AM. Reason: additional details

  • #2
    Large FPKM with Cufflink

    This issue have been discussed several times on this forum and I also have brought this issue on this forum that cufflink will give very high FPKM for some of genes irrepsective of their size (cole suggested for short). I will be intrested to know:
    1. range of FPKM values with and without -N option.
    2. What excatly you think may have happened with -N option to give high FPKM?
    I have observed no solution to this high FPKM problem but you have found it will be intresting to learn more about this.
    Thanks

    Comment


    • #3
      With cufflinks you can have three different normalizations: fragments mapped to genome (in millions), fragments mapped to transcriptome (in millions: --compatable-hits-norm) or upper quartile (-N). Regardless of the normalization the same number of reads is quantified at each gene. I've looked into it myself. If you run cufflinks using all three of those normalizations then look at each of the separate isoforms.fpkm_tracking files you can confirm it. Check for the coverage and FPKM columns. You should see different FPKMs but identical coverages across the three quantifications. Furthermore if you divide the FPKMs by each other you should see that at each gene there's a constant ratio between the FPKMs.

      If you calc FPKMs yourself you can see why the numbers shift around. To be honest the "FPKM" designation is misleading when you're using any normalization other than "mapped reads in millions". Right? Fragments per kilobase per million mapped reads is what you're used to.

      So say we have a gene that's 2500 bases long. We've got 121 fragments that mapped to it and we've got 34.7 million fragments mapped to the genome. We can get the FPKM like so..

      Code:
      FPKM = 121/(34.7*2.5) = 1.394813
      Say only 27.4 million fragments mapped to the transcriptome. So if you used --compatible-hits-norm then the calculation looks like this:

      Code:
      FPKM = 121/(27.4*2.5) = 1.766423
      Those aren't that different from one another. Now if you use upper quartile we're talking about the upper quartile value of fragments mapped to genes in the sample. That number might be something like 12,000. Divide this value by 1e6 to put it into "millions" like you do with mapped fragments it becomes 0.012. So now the calculation looks like this:

      Code:
      FPKM = 121/(0.012*2.5) = 4033.333
      So maybe it makes sense to scale the upper quartile normalization value by 1000 so that the "FPKM" comes out as 4.033 instead of 4033. That's reasonable. But it really shouldn't be called an FPKM because if you think about it it's like someone telling you there are 14 cars outside and you assume they mean 14...but they actually told you 14 in base 16 which would be 20 in base 10 (or maybe like expecting a measurement to be in cm but you're given the measurement in inches with a cm designation). It's not fragments per kilobase per million mapped reads, it's fragments per kilobase per upper quartile of read counts @ genes. So FPKPUQRCG. That name sucks.

      The point of these different normalizations is only applicable to when you're comparing samples to each other. So if you're goal is to see if gene X is expressed higher in Sample A verses B then regardless of the normalization used (as long as you use the same one on both samples) you'll find your answer. The upper quartile normalization has been showing to be more robust so maybe it's better to use it for comparing samples to one another. Also, obviously, for the expression levels to make sense to other people we all need to be using the same normalization. We should probably all be using upper quartile normalization but that puts the numbers on a different scale than we used to seeing.

      Hope that helped.
      /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
      Salk Institute for Biological Studies, La Jolla, CA, USA */

      Comment


      • #4
        Thats was crazy helpful-thanks.

        -MW

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM
        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        31 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        32 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        28 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-04-2024, 09:00 AM
        0 responses
        53 views
        0 likes
        Last Post seqadmin  
        Working...
        X