Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • tleonardi
    Junior Member
    • Nov 2010
    • 3

    Inconsistency between cuffdiff 1.1.0 and cuffdiff 1.0.2

    Hello everybody,
    I'm writing here because I've noticed that cuffdiff 1.1.0 and 1.0.2 give really different results when it comes to DE testing.

    This is what I get with 1.1.0
    Code:
    cuffdiff -p 4 -o . -N --emit-count-tables --frag-bias-correct /disk1/tl344/rep/iGenome/Mus_musculus/Ensembl/NCBIM37/Sequence/WholeGenomeFasta/genome.fa -M /disk1/tl344/rep/UCSC.chrM.renamed.gtf /disk1/tl344/cuff/illumina/cuffcompare/cuffcompare.combined.gtf /disk1/tl344/mapped_reads/sample1_properlyPaired2.bam /disk1/tl344/mapped_reads/sample2_properlyPaired2.bam/disk1/tl344/mapped_reads/sample3_properlyPaired2.bam &
    
    tl344@ram--bio:~/cuff/illumina/cuffdiff$ awk '$5=="q1" && $6=="q2"' gene_exp.diff|cut -f7 |sort |uniq -c
    434 FAIL
    1 HIDATA
    8338 LOWDATA
    34844 NOTEST
    4474 OK
    
    tl344@ram--bio:~/cuff/illumina/cuffdiff$ awk '$5=="q1" && $6=="q2" && $14=="yes"' gene_exp.diff |wc -l
    175
    While this is what I get with 1.0.2:
    Code:
    tl344@ram--bio:~/cuff/illumina/cuffdiff_1.0.2$ awk '$5=="q1" && $6=="q2"' gene_exp.diff|cut -f7 |sort |uniq -c
    379 FAIL
    32 LOWDATA
    16460 NOTEST
    31220 OK
    
    tl344@ram--bio:~/cuff/illumina/cuffdiff_1.0.2$ awk '$5=="q1" && $6=="q2" && $14=="yes"' gene_exp.diff |wc -l
    1534
    All default options for the two versions appear to be the same, the command line options are exactly the same and the files (gtf, fa, bam) are the same.

    As you can see, the number of significant genes changes quite a lot.
    I think this might be because of the fact that (from the 1.1.0 release notes)
    Cuffdiff now includes a more sophisticated check for sufficient sequencing depth prior to testing for differences, which substantially improves the accuracy of differential expression analysis in loci with low to medium depth.
    but still I wouldn't expect such a big difference.
    Is there anyone who had similar problems? Any idea if this is expected behaviour or if there's something wrong?

    Many thanks in advance for any feedback!
    tom
  • cw11
    Member
    • Sep 2011
    • 12

    #2
    Hmm, I just saw a reply from Cole on another thread that might be relevant:

    Originally posted by Cole Trapnell View Post
    A bit more on what FAIL means, and how it can happen. We use FAIL for genes that actually throw a numerical exception during isoform abundance calculation. In Cufflinks and Cuffdiff, there's a couple of calculations that require us to build matrices with either a row per transcript and a column per read (more or less) or a square matrix with a row and column for each transcript. Some of these matrices need to be invertible or positive definite or have other properties in order for the next steps in the algorithm to succeed. However, sometimes (due to things like round-off error) they aren't. Other times, missing data causes trouble. Oddly enough, this is actually more likely to happen the more reads you get overall, because you can see that isoforms are present, but you don't actually have enough data to calculate those abundances. This is the effect you were observing above. So since we can't be sure about the values (and in fact, were we to go ahead and do the calculation anyways, they could be *wildly* off in theory, or even negative), we set them to zero and move on.

    In order to make differential expression estimates more conservative, version 1.1.0 really ramped up the checks that are done before these steps so we don't end up reporting false positives that are due to numerical exceptions. However, users (like yourselves) have been pretty frustrated by those changes, so I've spent the last few weeks going back and streamlining the overall algorithm to actually eliminate pieces that require the matrices to have some of those properties. The main offender was our "importance sampling" procedure, which tries to give us a sense (for each gene) for the accuracy for the maximum likelihood estimate of isoform abundances. This procedure was originally meant to improve the robustness of the estimate when one or more isoforms were close to zero, but in practice, we found that it actually hurts as often as it helps. Moreover, this procedure would often FAIL genes, so I removed it altogether. I've compensated on the differential expression side with some other statistical improvements and fixes, and the result is globally more accurate differential analysis (both in terms of fewer FAILs and fewer false positives than 1.1.0).

    The upcoming version 1.2.0 should drastically reduce the number of FAIL genes, though there will still be some. If we can't calculate an MLE to begin with, or if for some reason the confidence interval calculation fails, a gene will be marked as FAIL.

    Hope this sheds light on things.

    Comment

    • tleonardi
      Junior Member
      • Nov 2010
      • 3

      #3
      Thank you cw11 for your reply. I've read the post that you linked, and indeed it is relevant. What I see might in fact be a consequence of removing the "importance sampling" procedure in 1.1.0
      I was wondering if this is a problem only for me..

      Comment

      • fangquan
        Member
        • Jul 2011
        • 18

        #4
        I have the same problem. Please read my post here.

        Question on Cuffdiff’s DE results.

        Code:
        Quote:
        cuffdiff -p $NSLOTS -o cuffdiff_v110 Ensembl_59.gtf C_3732_hits.bam,C_6341_hits.bam,C_6356_hits.ba
        m,C_6124_hits.bam,C_60PO_hits.bam HD_4430_hits.bam,HD_4412_hits.bam,HD_4242_hits.bam,HD_4189_hits
        .bam,HD_3584_hits.bam
        Versions
        v091 v101 v102 v103 v110
        OK
        18112 33215 33230 33216 6765
        Fail
        3957 2524 2509 2523 1234
        LOWDATA
        NA 514 514 514 9000
        NOTEST
        29582 15408 15408 15408 34634
        Total
        51651 51661 51661 51661 51633


        We wouldn’t expect such a big difference[1] between v110 and v101(v102,v103)especially in the numbers of NOTEST and OK. Cole Trapnell explained what does “Fail” mean [2], we would like to know what “NOTEST” means here.

        All default options for the two versions appear to be the same; the command line options are exactly the same when we use different versions.
        The cuffdiff v1.1.0 indicated.
        “Cuffdiff now includes a more sophisticated check for sufficient sequencing depth prior to testingfor differences, which substantially improves the accuracy of differential expression analysis in loci with low to medium depth.”


        Question on Cuffdiff’s q-value.

        What is q-value here? Lots of evidence suggests q-value should be less than one [4], [5].


        Reference:
        (1) Inconsistency between cuffdiff 1.1.0 and cuffdiff 1.0.2
        Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

        (2) Cole Trapnell on “Fail” test

        (3) Cuffdiff producing q-values > 1
        Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

        (4) Understanding q-values as a More Intuitive Alternative to p-values

        (5) P-values, False Discovery Rate (FDR) and q-values What are p-values?

        (6) Cuffdiff producing q-values > 1
        http://seqanswers.com/forums/showthread.php?t=13764

        Comment

        Latest Articles

        Collapse

        • SEQadmin2
          Nine Things a Sample Prep Scientist Thinks About Before Sequencing
          by SEQadmin2


          I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


          Here are nine questions we think about, in roughly the order they matter, before...
          06-18-2026, 07:11 AM
        • SEQadmin2
          From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
          by SEQadmin2


          Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


          The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
          ...
          06-02-2026, 10:05 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by SEQadmin2, 06-17-2026, 06:09 AM
        0 responses
        24 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-09-2026, 11:58 AM
        0 responses
        42 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-05-2026, 10:09 AM
        0 responses
        48 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-04-2026, 08:59 AM
        0 responses
        49 views
        0 reactions
        Last Post SEQadmin2  
        Working...