Seqanswers Leaderboard Ad



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Cufflinks large FPKM with -g and -N options


    Running Cufflinks (2.0.0 and previous version) with
    the -g/--GTF-guide AND -N/--upper-quartile-norm options
    can result in enormous FPKM values (~e+11), and not only
    for short/novel transcripts---for many many genes.

    Of 16725 genes, 12094 are nonzero (min= 432, max = 5.7e+11).
    For nonzero genes:
    25th %ile = 216,582
    50th %ile = 1e+06
    75th %ile = 3.12e+06

    I appreciate that -N should just be re-scaling FPKMs
    and have seen reasonable results from Cufflinks
    in -G (non-deNovo) mode, but these -g levels
    seem strange to me.

    Removing the -N option brings the FPKM levels
    back to a typical range of values.

    Unfortunately the error bounds appear to be identically
    equal to the mean FPKM, with and without -N, at least
    for Cufflinks 2.0.0. Removing the -b option (which was
    also originally used) restored the error-bars.

    It seems that -g & -N together give an FPKM range that
    is quite different from the range produced by -G & -N.
    I would guess that the reason for this is related to tiling of
    the annotated transcripts with faux reads in RABT.
    Perhaps, somehow, this is making almost all genes
    look like upper-quartile expressors, but this is only a guess.
    Last edited by duffman; 05-18-2012, 09:08 AM. Reason: additional details

  • #2
    Large FPKM with Cufflink

    This issue have been discussed several times on this forum and I also have brought this issue on this forum that cufflink will give very high FPKM for some of genes irrepsective of their size (cole suggested for short). I will be intrested to know:
    1. range of FPKM values with and without -N option.
    2. What excatly you think may have happened with -N option to give high FPKM?
    I have observed no solution to this high FPKM problem but you have found it will be intresting to learn more about this.


    • #3
      With cufflinks you can have three different normalizations: fragments mapped to genome (in millions), fragments mapped to transcriptome (in millions: --compatable-hits-norm) or upper quartile (-N). Regardless of the normalization the same number of reads is quantified at each gene. I've looked into it myself. If you run cufflinks using all three of those normalizations then look at each of the separate isoforms.fpkm_tracking files you can confirm it. Check for the coverage and FPKM columns. You should see different FPKMs but identical coverages across the three quantifications. Furthermore if you divide the FPKMs by each other you should see that at each gene there's a constant ratio between the FPKMs.

      If you calc FPKMs yourself you can see why the numbers shift around. To be honest the "FPKM" designation is misleading when you're using any normalization other than "mapped reads in millions". Right? Fragments per kilobase per million mapped reads is what you're used to.

      So say we have a gene that's 2500 bases long. We've got 121 fragments that mapped to it and we've got 34.7 million fragments mapped to the genome. We can get the FPKM like so..

      FPKM = 121/(34.7*2.5) = 1.394813
      Say only 27.4 million fragments mapped to the transcriptome. So if you used --compatible-hits-norm then the calculation looks like this:

      FPKM = 121/(27.4*2.5) = 1.766423
      Those aren't that different from one another. Now if you use upper quartile we're talking about the upper quartile value of fragments mapped to genes in the sample. That number might be something like 12,000. Divide this value by 1e6 to put it into "millions" like you do with mapped fragments it becomes 0.012. So now the calculation looks like this:

      FPKM = 121/(0.012*2.5) = 4033.333
      So maybe it makes sense to scale the upper quartile normalization value by 1000 so that the "FPKM" comes out as 4.033 instead of 4033. That's reasonable. But it really shouldn't be called an FPKM because if you think about it it's like someone telling you there are 14 cars outside and you assume they mean 14...but they actually told you 14 in base 16 which would be 20 in base 10 (or maybe like expecting a measurement to be in cm but you're given the measurement in inches with a cm designation). It's not fragments per kilobase per million mapped reads, it's fragments per kilobase per upper quartile of read counts @ genes. So FPKPUQRCG. That name sucks.

      The point of these different normalizations is only applicable to when you're comparing samples to each other. So if you're goal is to see if gene X is expressed higher in Sample A verses B then regardless of the normalization used (as long as you use the same one on both samples) you'll find your answer. The upper quartile normalization has been showing to be more robust so maybe it's better to use it for comparing samples to one another. Also, obviously, for the expression levels to make sense to other people we all need to be using the same normalization. We should probably all be using upper quartile normalization but that puts the numbers on a different scale than we used to seeing.

      Hope that helped.
      /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
      Salk Institute for Biological Studies, La Jolla, CA, USA */


      • #4
        Thats was crazy helpful-thanks.



        Latest Articles


        • seqadmin
          Latest Developments in Precision Medicine
          by seqadmin

          Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

          Somatic Genomics
          “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
          Yesterday, 01:16 PM
        • seqadmin
          Recent Advances in Sequencing Analysis Tools
          by seqadmin

          The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
          05-06-2024, 07:48 AM





        Topics Statistics Last Post
        Started by seqadmin, Yesterday, 07:15 AM
        0 responses
        Last Post seqadmin  
        Started by seqadmin, 05-23-2024, 10:28 AM
        0 responses
        Last Post seqadmin  
        Started by seqadmin, 05-23-2024, 07:35 AM
        0 responses
        Last Post seqadmin  
        Started by seqadmin, 05-22-2024, 02:06 PM
        0 responses
        Last Post seqadmin