Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • cufflinks reports extremely high FPKMs for short transcripts

    I'm seeing some odd FPKM values reported by cufflinks and I'm wondering if anyone else has seen this or can suggest an explanation. Essentially, the shorter a transcript is the higher its FPKM. The shortest transcripts reach ridiculous levels. In a typical experiment, I see:

    Code:
    Tscript Length     avg. FPKM
    --------------     ---------
    >1000              20
    200 - 1000         30
    100 - 200          2,500
    < 100              130,000
    If I examine the alignment in IGV or directly in the SAM file I find that the short transcripts do not in fact have ridiculously high coverage. For example a 90bp transcript with an FPKM over 50,000 has just 18 reads (total reads in the experiment is about 20M).

    I see this with cufflinks-1.1.0 and 1.0.3, with and without upper quartile normalization.

  • #2
    It's good that I'm not alone.

    Code:
    79      447370
    86      148939
    100     50999.3
    101     142356
    103     101460
    103     216072
    Same result observed for my cufflinks reports. As transcripts length longer, FPKM value also decrease to more reasonable level. Hope for any kind helps!

    Comment


    • #3
      I've been reading through the supplemental methods of the Cufflinks paper and I have a theory about why this is happening. Rather than use the actual transcript length in FPKM calculations, Cufflinks uses what they call an adjusted length. This is intended to account for the fact that the expected fragment length will affect the probability of selecting a fragment from a transcript of a given length.

      If I'm following the math correctly then this formula does not really handle cases where the transcript length is significantly shorter than the expected fragment length. It will produce an extremely low value for the adjusted transcript length, which will then cause the high FPKMs.

      I've sent an email to the cufflinks developers to ask them if this sounds reasonable. In the meantime I think I'll just exclude transcripts shorter than 200bp or at least ignore the FPKM values for intra-sample expression comparisons.

      Comment


      • #4
        Thanks for your information and proactive action. I did gone through that but never thought of it as a problem. Hope you can get a good answer from the developers.

        I am now using RSEM for the calculation of readcount then input for DESeq for differential expression analysis. This way perform well and looks better for me. I'm afraid that removing 200 bp transcripts might removing some useful information for the analysis.

        Comment


        • #5
          Originally posted by cram View Post
          I've been reading through the supplemental methods of the Cufflinks paper and I have a theory about why this is happening. Rather than use the actual transcript length in FPKM calculations, Cufflinks uses what they call an adjusted length. This is intended to account for the fact that the expected fragment length will affect the probability of selecting a fragment from a transcript of a given length.

          If I'm following the math correctly then this formula does not really handle cases where the transcript length is significantly shorter than the expected fragment length. It will produce an extremely low value for the adjusted transcript length, which will then cause the high FPKMs.

          I've sent an email to the cufflinks developers to ask them if this sounds reasonable. In the meantime I think I'll just exclude transcripts shorter than 200bp or at least ignore the FPKM values for intra-sample expression comparisons.
          Hi,

          I have been reading the supplemental material of Cufflinks.
          And I have been hurt by the lots of formulas on it.
          Could you tell me the reason why they use the adjusted length? What does the length mean in math or biology?

          Thanks,

          Comment


          • #6
            Hey cram,

            did you find a solution to this in the end? Is it different in newer versions of Cufflinks?

            I'm actually struggling with a connected problem and intra-sample comparisons:

            Is there a possibility to compare the transcript abundance within certain group of transcripts (e.g. Gene_A, Gene_B, Gene_C) to actually rank them by expression (i.e. Gene_A is higher expressed than Gene_C)?

            I tried counting within exons and normalizing to the lengths exon-summed transcripts, but there might still be some bias, since some exons will also overlap...

            Any ideas?

            Comment


            • #7
              Thanks for pointing this phenomenon out.

              I am using cufflinks extensively and noticed this behavior somewhere around the Cufflinks version 1.0.0 release. Older versions of Cufflinks did not seem to have this issue.

              Currently I circumvent this by removing or ignoring transcripts shorter than 250bp. Plotting the distribution of FPKMs shows this to be a reasonable cutoff value. I agree that the abnormal increase in FPKM may be tied to the fragment length.

              I agree that there is a problem here and hope the developers address it.

              Best regards

              Comment


              • #8
                this is a common issue - it's in eXpress as well. in fact any of these tools that uses the "effective length correction" for read counts or expressions. apparently there isn't currently a logical way to fix it. additionally it's only theoretical that this adjustment improves expressions. if you're counting hits in a more general way, like with htseq-count, this adjustment is not made. I don't like it because it says that there's reads in my data that don't exist! it should be obvious that counts for features that are so close to the expected fragment length may be unreliable or lower than they *should* be - that's good enough information for me.

                if you want you can disable this adjustment in cufflinks by using their '--no-effective-length-correction' option. this fixes it. i've, for example, compared read counts reverse calculated from the FPKM's cufflinks reports using this option and they are identical to counts i get through a normal naive counting method (at the gene locus level).

                by the way you can get those "raw" counts back from cufflinks by keeping the "Raw Map Mass" value it reports during its run and then using the following calculation on the FPKM values in isoforms.fpkm_tracking:

                COUNTS = FPKM*transcript_length/1000000000 * MASS
                /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                Salk Institute for Biological Studies, La Jolla, CA, USA */

                Comment


                • #9
                  Thank you! This is extremely helpful!

                  Unfortunately I have run many samples without the --no-effective-length-correction enabled so I may have to deal with the bias problem for now. For future experiments I will definitely employ this option.

                  Comment


                  • #10
                    sorry to bring an old thread back to life- but this has been bothering me a bit and I wondered the following.

                    Has anyone ever attempted to empirically check the length correction algorithm?

                    I doubt this would even be a good test. But I note that the ERCC standards have transcripts ranging from 1995 to 274 bp in length- and further

                    ERCC-77 is 275 bp length and abundance is 3.66
                    ERCC-51 is 274 bp in length and abundance is 58.59

                    Has anyone actually run this through cufflinks using --no-effective-length and without this flag to compare how the ERCC standard curve looks in terms of RKPM (I am single ended) for each situation?

                    Comment


                    • #11
                      Originally posted by rufessor View Post
                      Has anyone actually run this through cufflinks using --no-effective-length and without this flag to compare how the ERCC standard curve looks in terms of RKPM (I am single ended) for each situation?
                      I did a quick test of this today and saw no obvious difference (except for a uniformly higher FPKM in "standard") between standard cufflinks and using --no-effective-length, as regards the reported ERCC FPKM. There flat increase issue to the fact that using ERCCs in the cufflinks pipeline is itself fraught because spike-ins break the fundamental assumption of the FPKM calculation. (It's not insane to use ERCC FPKM values, but it's not ideal either, particularly as regards comparing/normalizing samples)

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Recent Developments in Metagenomics
                        by seqadmin





                        Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
                        09-23-2024, 06:35 AM
                      • seqadmin
                        Understanding Genetic Influence on Infectious Disease
                        by seqadmin




                        During the COVID-19 pandemic, scientists observed that while some individuals experienced severe illness when infected with SARS-CoV-2, others were barely affected. These disparities left researchers and clinicians wondering what causes the wide variations in response to viral infections and what role genetics plays.

                        Jean-Laurent Casanova, M.D., Ph.D., Professor at Rockefeller University, is a leading expert in this crossover between genetics and infectious...
                        09-09-2024, 10:59 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 10-02-2024, 04:51 AM
                      0 responses
                      13 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 10-01-2024, 07:10 AM
                      0 responses
                      21 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 09-30-2024, 08:33 AM
                      0 responses
                      25 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 09-26-2024, 12:57 PM
                      0 responses
                      18 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X