Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • cufflinks reports extremely high FPKMs for short transcripts

    I'm seeing some odd FPKM values reported by cufflinks and I'm wondering if anyone else has seen this or can suggest an explanation. Essentially, the shorter a transcript is the higher its FPKM. The shortest transcripts reach ridiculous levels. In a typical experiment, I see:

    Code:
    Tscript Length     avg. FPKM
    --------------     ---------
    >1000              20
    200 - 1000         30
    100 - 200          2,500
    < 100              130,000
    If I examine the alignment in IGV or directly in the SAM file I find that the short transcripts do not in fact have ridiculously high coverage. For example a 90bp transcript with an FPKM over 50,000 has just 18 reads (total reads in the experiment is about 20M).

    I see this with cufflinks-1.1.0 and 1.0.3, with and without upper quartile normalization.

  • #2
    It's good that I'm not alone.

    Code:
    79      447370
    86      148939
    100     50999.3
    101     142356
    103     101460
    103     216072
    Same result observed for my cufflinks reports. As transcripts length longer, FPKM value also decrease to more reasonable level. Hope for any kind helps!

    Comment


    • #3
      I've been reading through the supplemental methods of the Cufflinks paper and I have a theory about why this is happening. Rather than use the actual transcript length in FPKM calculations, Cufflinks uses what they call an adjusted length. This is intended to account for the fact that the expected fragment length will affect the probability of selecting a fragment from a transcript of a given length.

      If I'm following the math correctly then this formula does not really handle cases where the transcript length is significantly shorter than the expected fragment length. It will produce an extremely low value for the adjusted transcript length, which will then cause the high FPKMs.

      I've sent an email to the cufflinks developers to ask them if this sounds reasonable. In the meantime I think I'll just exclude transcripts shorter than 200bp or at least ignore the FPKM values for intra-sample expression comparisons.

      Comment


      • #4
        Thanks for your information and proactive action. I did gone through that but never thought of it as a problem. Hope you can get a good answer from the developers.

        I am now using RSEM for the calculation of readcount then input for DESeq for differential expression analysis. This way perform well and looks better for me. I'm afraid that removing 200 bp transcripts might removing some useful information for the analysis.

        Comment


        • #5
          Originally posted by cram View Post
          I've been reading through the supplemental methods of the Cufflinks paper and I have a theory about why this is happening. Rather than use the actual transcript length in FPKM calculations, Cufflinks uses what they call an adjusted length. This is intended to account for the fact that the expected fragment length will affect the probability of selecting a fragment from a transcript of a given length.

          If I'm following the math correctly then this formula does not really handle cases where the transcript length is significantly shorter than the expected fragment length. It will produce an extremely low value for the adjusted transcript length, which will then cause the high FPKMs.

          I've sent an email to the cufflinks developers to ask them if this sounds reasonable. In the meantime I think I'll just exclude transcripts shorter than 200bp or at least ignore the FPKM values for intra-sample expression comparisons.
          Hi,

          I have been reading the supplemental material of Cufflinks.
          And I have been hurt by the lots of formulas on it.
          Could you tell me the reason why they use the adjusted length? What does the length mean in math or biology?

          Thanks,

          Comment


          • #6
            Hey cram,

            did you find a solution to this in the end? Is it different in newer versions of Cufflinks?

            I'm actually struggling with a connected problem and intra-sample comparisons:

            Is there a possibility to compare the transcript abundance within certain group of transcripts (e.g. Gene_A, Gene_B, Gene_C) to actually rank them by expression (i.e. Gene_A is higher expressed than Gene_C)?

            I tried counting within exons and normalizing to the lengths exon-summed transcripts, but there might still be some bias, since some exons will also overlap...

            Any ideas?

            Comment


            • #7
              Thanks for pointing this phenomenon out.

              I am using cufflinks extensively and noticed this behavior somewhere around the Cufflinks version 1.0.0 release. Older versions of Cufflinks did not seem to have this issue.

              Currently I circumvent this by removing or ignoring transcripts shorter than 250bp. Plotting the distribution of FPKMs shows this to be a reasonable cutoff value. I agree that the abnormal increase in FPKM may be tied to the fragment length.

              I agree that there is a problem here and hope the developers address it.

              Best regards

              Comment


              • #8
                this is a common issue - it's in eXpress as well. in fact any of these tools that uses the "effective length correction" for read counts or expressions. apparently there isn't currently a logical way to fix it. additionally it's only theoretical that this adjustment improves expressions. if you're counting hits in a more general way, like with htseq-count, this adjustment is not made. I don't like it because it says that there's reads in my data that don't exist! it should be obvious that counts for features that are so close to the expected fragment length may be unreliable or lower than they *should* be - that's good enough information for me.

                if you want you can disable this adjustment in cufflinks by using their '--no-effective-length-correction' option. this fixes it. i've, for example, compared read counts reverse calculated from the FPKM's cufflinks reports using this option and they are identical to counts i get through a normal naive counting method (at the gene locus level).

                by the way you can get those "raw" counts back from cufflinks by keeping the "Raw Map Mass" value it reports during its run and then using the following calculation on the FPKM values in isoforms.fpkm_tracking:

                COUNTS = FPKM*transcript_length/1000000000 * MASS
                /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                Salk Institute for Biological Studies, La Jolla, CA, USA */

                Comment


                • #9
                  Thank you! This is extremely helpful!

                  Unfortunately I have run many samples without the --no-effective-length-correction enabled so I may have to deal with the bias problem for now. For future experiments I will definitely employ this option.

                  Comment


                  • #10
                    sorry to bring an old thread back to life- but this has been bothering me a bit and I wondered the following.

                    Has anyone ever attempted to empirically check the length correction algorithm?

                    I doubt this would even be a good test. But I note that the ERCC standards have transcripts ranging from 1995 to 274 bp in length- and further

                    ERCC-77 is 275 bp length and abundance is 3.66
                    ERCC-51 is 274 bp in length and abundance is 58.59

                    Has anyone actually run this through cufflinks using --no-effective-length and without this flag to compare how the ERCC standard curve looks in terms of RKPM (I am single ended) for each situation?

                    Comment


                    • #11
                      Originally posted by rufessor View Post
                      Has anyone actually run this through cufflinks using --no-effective-length and without this flag to compare how the ERCC standard curve looks in terms of RKPM (I am single ended) for each situation?
                      I did a quick test of this today and saw no obvious difference (except for a uniformly higher FPKM in "standard") between standard cufflinks and using --no-effective-length, as regards the reported ERCC FPKM. There flat increase issue to the fact that using ERCCs in the cufflinks pipeline is itself fraught because spike-ins break the fundamental assumption of the FPKM calculation. (It's not insane to use ERCC FPKM values, but it's not ideal either, particularly as regards comparing/normalizing samples)

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Recent Advances in Sequencing Technologies
                        by seqadmin







                        Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                        Long-Read Sequencing
                        Long-read sequencing has...
                        Yesterday, 01:49 PM
                      • seqadmin
                        Genetic Variation in Immunogenetics and Antibody Diversity
                        by seqadmin



                        The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...
                        11-06-2024, 07:24 PM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, Yesterday, 09:29 AM
                      0 responses
                      45 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, Yesterday, 09:06 AM
                      0 responses
                      27 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, Yesterday, 08:03 AM
                      0 responses
                      19 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 11-22-2024, 07:36 AM
                      0 responses
                      65 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X