Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • jparsons
    replied
    Originally posted by rufessor View Post
    Has anyone actually run this through cufflinks using --no-effective-length and without this flag to compare how the ERCC standard curve looks in terms of RKPM (I am single ended) for each situation?
    I did a quick test of this today and saw no obvious difference (except for a uniformly higher FPKM in "standard") between standard cufflinks and using --no-effective-length, as regards the reported ERCC FPKM. There flat increase issue to the fact that using ERCCs in the cufflinks pipeline is itself fraught because spike-ins break the fundamental assumption of the FPKM calculation. (It's not insane to use ERCC FPKM values, but it's not ideal either, particularly as regards comparing/normalizing samples)

    Leave a comment:


  • rufessor
    replied
    sorry to bring an old thread back to life- but this has been bothering me a bit and I wondered the following.

    Has anyone ever attempted to empirically check the length correction algorithm?

    I doubt this would even be a good test. But I note that the ERCC standards have transcripts ranging from 1995 to 274 bp in length- and further

    ERCC-77 is 275 bp length and abundance is 3.66
    ERCC-51 is 274 bp in length and abundance is 58.59

    Has anyone actually run this through cufflinks using --no-effective-length and without this flag to compare how the ERCC standard curve looks in terms of RKPM (I am single ended) for each situation?

    Leave a comment:


  • choy
    replied
    Thank you! This is extremely helpful!

    Unfortunately I have run many samples without the --no-effective-length-correction enabled so I may have to deal with the bias problem for now. For future experiments I will definitely employ this option.

    Leave a comment:


  • sdriscoll
    replied
    this is a common issue - it's in eXpress as well. in fact any of these tools that uses the "effective length correction" for read counts or expressions. apparently there isn't currently a logical way to fix it. additionally it's only theoretical that this adjustment improves expressions. if you're counting hits in a more general way, like with htseq-count, this adjustment is not made. I don't like it because it says that there's reads in my data that don't exist! it should be obvious that counts for features that are so close to the expected fragment length may be unreliable or lower than they *should* be - that's good enough information for me.

    if you want you can disable this adjustment in cufflinks by using their '--no-effective-length-correction' option. this fixes it. i've, for example, compared read counts reverse calculated from the FPKM's cufflinks reports using this option and they are identical to counts i get through a normal naive counting method (at the gene locus level).

    by the way you can get those "raw" counts back from cufflinks by keeping the "Raw Map Mass" value it reports during its run and then using the following calculation on the FPKM values in isoforms.fpkm_tracking:

    COUNTS = FPKM*transcript_length/1000000000 * MASS

    Leave a comment:


  • choy
    replied
    Thanks for pointing this phenomenon out.

    I am using cufflinks extensively and noticed this behavior somewhere around the Cufflinks version 1.0.0 release. Older versions of Cufflinks did not seem to have this issue.

    Currently I circumvent this by removing or ignoring transcripts shorter than 250bp. Plotting the distribution of FPKMs shows this to be a reasonable cutoff value. I agree that the abnormal increase in FPKM may be tied to the fragment length.

    I agree that there is a problem here and hope the developers address it.

    Best regards

    Leave a comment:


  • Neuromancer
    replied
    Hey cram,

    did you find a solution to this in the end? Is it different in newer versions of Cufflinks?

    I'm actually struggling with a connected problem and intra-sample comparisons:

    Is there a possibility to compare the transcript abundance within certain group of transcripts (e.g. Gene_A, Gene_B, Gene_C) to actually rank them by expression (i.e. Gene_A is higher expressed than Gene_C)?

    I tried counting within exons and normalizing to the lengths exon-summed transcripts, but there might still be some bias, since some exons will also overlap...

    Any ideas?

    Leave a comment:


  • Hunny
    replied
    Originally posted by cram View Post
    I've been reading through the supplemental methods of the Cufflinks paper and I have a theory about why this is happening. Rather than use the actual transcript length in FPKM calculations, Cufflinks uses what they call an adjusted length. This is intended to account for the fact that the expected fragment length will affect the probability of selecting a fragment from a transcript of a given length.

    If I'm following the math correctly then this formula does not really handle cases where the transcript length is significantly shorter than the expected fragment length. It will produce an extremely low value for the adjusted transcript length, which will then cause the high FPKMs.

    I've sent an email to the cufflinks developers to ask them if this sounds reasonable. In the meantime I think I'll just exclude transcripts shorter than 200bp or at least ignore the FPKM values for intra-sample expression comparisons.
    Hi,

    I have been reading the supplemental material of Cufflinks.
    And I have been hurt by the lots of formulas on it.
    Could you tell me the reason why they use the adjusted length? What does the length mean in math or biology?

    Thanks,

    Leave a comment:


  • magick
    replied
    Thanks for your information and proactive action. I did gone through that but never thought of it as a problem. Hope you can get a good answer from the developers.

    I am now using RSEM for the calculation of readcount then input for DESeq for differential expression analysis. This way perform well and looks better for me. I'm afraid that removing 200 bp transcripts might removing some useful information for the analysis.

    Leave a comment:


  • cram
    replied
    I've been reading through the supplemental methods of the Cufflinks paper and I have a theory about why this is happening. Rather than use the actual transcript length in FPKM calculations, Cufflinks uses what they call an adjusted length. This is intended to account for the fact that the expected fragment length will affect the probability of selecting a fragment from a transcript of a given length.

    If I'm following the math correctly then this formula does not really handle cases where the transcript length is significantly shorter than the expected fragment length. It will produce an extremely low value for the adjusted transcript length, which will then cause the high FPKMs.

    I've sent an email to the cufflinks developers to ask them if this sounds reasonable. In the meantime I think I'll just exclude transcripts shorter than 200bp or at least ignore the FPKM values for intra-sample expression comparisons.

    Leave a comment:


  • magick
    replied
    It's good that I'm not alone.

    Code:
    79      447370
    86      148939
    100     50999.3
    101     142356
    103     101460
    103     216072
    Same result observed for my cufflinks reports. As transcripts length longer, FPKM value also decrease to more reasonable level. Hope for any kind helps!

    Leave a comment:


  • cufflinks reports extremely high FPKMs for short transcripts

    I'm seeing some odd FPKM values reported by cufflinks and I'm wondering if anyone else has seen this or can suggest an explanation. Essentially, the shorter a transcript is the higher its FPKM. The shortest transcripts reach ridiculous levels. In a typical experiment, I see:

    Code:
    Tscript Length     avg. FPKM
    --------------     ---------
    >1000              20
    200 - 1000         30
    100 - 200          2,500
    < 100              130,000
    If I examine the alignment in IGV or directly in the SAM file I find that the short transcripts do not in fact have ridiculously high coverage. For example a 90bp transcript with an FPKM over 50,000 has just 18 reads (total reads in the experiment is about 20M).

    I see this with cufflinks-1.1.0 and 1.0.3, with and without upper quartile normalization.

Latest Articles

Collapse

  • seqadmin
    Multiomics Techniques Advancing Disease Research
    by seqadmin


    New and advanced multiomics tools and technologies have opened new avenues of research and markedly enhanced various disciplines such as disease research and precision medicine1. The practice of merging diverse data from various ‘omes increasingly provides a more holistic understanding of biological systems. As Maddison Masaeli, Co-Founder and CEO at Deepcell, aptly noted, “You can't explain biology in its complex form with one modality.”

    A major leap in the field has
    ...
    02-08-2024, 06:33 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, Yesterday, 04:11 PM
1 response
21 views
0 likes
Last Post kim897
by kim897
 
Started by seqadmin, 02-21-2024, 08:52 AM
0 responses
37 views
0 likes
Last Post seqadmin  
Started by seqadmin, 02-20-2024, 08:57 AM
0 responses
26 views
0 likes
Last Post seqadmin  
Started by seqadmin, 02-14-2024, 09:19 AM
0 responses
57 views
0 likes
Last Post seqadmin  
Working...
X