Welcome to the New Seqanswers!

Welcome to the new Seqanswers! We'd love your feedback, please post any you have to this topic: New Seqanswers Feedback.
See more
See less

inconsistancy between cufflinks and cuffdiffs output?

  • Filter
  • Time
  • Show
Clear All
new posts

  • inconsistancy between cufflinks and cuffdiffs output?

    Hi all,

    I'm working on differential expression analysis on bacterial transcriptome using cufflinks packages. However I found these result (table below) that confusing me. Some of the expression profiles reported in the cuffdiffs analysis are not consistent with the cufflinks analysis. Below is part of the combined results from cuffdiff (column 2- 10) and cufflinks (column 11 - 14).

    no	gene	FPKM (sampleA)	FPKM (sampleB)	log2(fold_change)	fold_change	test_stat	p_value	q_value	significant	FPKM (SampleA-r1)	FPKM (SampleA-r2)	FPKM (SampleB-r1)	FPKM (SampleB-r2)
    1	gene0090	0	0.280696	1.79769e+308	#VALUE!	1.79769e+308	0.26143	1	no	11.2027	11.0176	6.73073	14.1696
    2	gene0091	0	0	0	1	0	1	1	no	44.6099	57.6125	27.8334	30.6339
    3	gene1910	0	0.440214	1.79769e+308	#VALUE!	1.79769e+308	0.134216	1	no	11.7053	14.0353	33.6464	45.4033
    4	gene2086	0	0	0	1	0	1	1	no	2.29351	1.15464	0.654272	6.82639
    5	gene2223	0	0.277117	1.79769e+308	#VALUE!	1.79769e+308	0.341546	1	no	0	0	0	0.660894
    6	gene2488	0	0	0	1	0	1	1	no	10.4025	11.889	8.49335	17.6908
    7	gene3116	0	2.19548	1.79769e+308	#VALUE!	1.79769e+308	0.270146	1	no	43.535	54.3667	31.5445	29.4442
    8	gene3375	0	8.83927	1.79769e+308	#VALUE!	1.79769e+308	0.156186	0.371228	no	0	0	8.83593	8.62098
    9	genet08	0	47943.9	1.79769e+308	#VALUE!	1.79769e+308	0.216948	0.441909	no	215745	289492	283647	224841
    10	genet15	0	0	0	1	0	1	1	no	0	32165.7	0	44968.3
    11	genet19	0	0	0	1	0	1	1	no	0	0	0	22484.1
    12	genet30	0	20919.6	1.79769e+308	#VALUE!	1.79769e+308	0.26143	0.486515	no	34511.2	109316	0	49890.9
    13	genet35	0	47943.9	1.79769e+308	#VALUE!	1.79769e+308	0.216948	0.441909	no	61641.3	0	81041.9	44968.3
    14	gene1061	0	4.64699	1.79769e+308	#VALUE!	1.79769e+308	0.0974253	1	no	0	0	239.346	308.699
    15	gene2057	0	0	0	1	0	1	1	no	10.5358	12.5268	7.81085	17.5835
    16	gene1219	279.782	126.536	-1.14476	0.452264919	4.02067	5.80E-05	0.000723236	yes	294.917	272.507	98.3558	163.011
    17	gene0454	281.243	122.751	-1.19608	0.436459592	4.50118	6.76E-06	0.000105364	yes	319.036	254.564	112.823	133.568
    18	gene0696	281.492	119.564	-1.23531	0.424751221	4.65254	3.28E-06	5.51E-05	yes	299.484	271.801	101.908	141.367
    19	gene1908	295.083	175.718	-0.747857	0.595487447	2.83873	0.0045293	0.0283197	yes	293.55	302.417	149.607	207.991
    20	gene3330	298.322	59.7793	-2.31915	0.200385497	8.05916	6.66E-16	4.98E-14	yes	316.381	288.917	47.3887	75.7079
    21	gene2943	304.366	157.484	-0.950603	0.517416153	3.27473	0.00105763	0.00865789	yes	286.907	325.549	166.413	140.802
    22	gene1286a	304.854	1411.33	2.21087	4.629543685	-6.29067	3.16E-10	1.26E-08	yes	368.954	256.093	1544.83	1187.39
    23	gene2931	307.767	74.4498	-2.0475	0.241902905	6.94379	3.82E-12	1.86E-10	yes	398.538	236.189	80.5042	63.9738
    24	gene2526	310.296	73.1319	-2.08507	0.235684698	5.75042	8.90E-09	2.70E-07	yes	293.458	331.069	70.5113	74.949
    25	gene3402	323.668	104.076	-1.63688	0.321551116	5.94465	2.77E-09	9.15E-08	yes	318.393	334.796	101.337	105.236
    26	gene0394	324.037	294.276	-0.138986	0.90815723	0.301882	0.762742	0.881605	no	335.479	320.832	244.124	357.452
    27	gene2014	324.814	336.958	0.0529539	1.037386787	-0.190118	0.849217	0.931596	no	345.453	313.737	348.649	311.732
    28	gene0458	325.072	154.641	-1.07184	0.475711893	4.01458	5.96E-05	0.000732211	yes	364.987	297.467	128.933	186.893
    29	gene3221	326.916	193.231	-0.758593	0.591072498	2.64956	0.00805976	0.0447439	yes	315.978	342.966	138.708	265.134
    30	gene1964	330.623	151.484	-1.12602	0.458177971	3.99545	6.46E-05	0.000786006	yes	322.485	344.348	132.149	174.834
    For example in row no 9 (genet08), in cufflinks analysis FPKM in both sampleA replicates reported as high as 20000, but in cuffdiffs analysis the FPKM reported as 0. There are several rows in the table that show the same inconsistency between cuffdiffs and cufflinks analysis, I don't have the exact figure but i can estimate its around 5% of the total gene list have this kind of inconsistency.

    Below is the command that I used to run cufflinks and cuffdiffs

    cufflinks command (example for SampleA_r1)

    cufflinks -G combined.gtf SampleA_r1-accepted_hits.bam
    cuffdiff command

    cuffdiff -o SampleA-vs-SampleB-cufflinks cuffcmp.combined.gtf SampleA-r1-accepted_hits.bam,SampleA-r2-accepted_hits.bam SampleB-r1-accepted_hits.bam,SampleB-r2-accepted_hits.bam
    Hope someone can share their thought on this.

    Thanks in advance

    - kamal

  • #2
    I got the same problem.


    • #3
      Great!, I thought I was the only person who got this problem. Bad thing is, I haven't solved the problem yet, but at the moment I just use results from cufflinks run as I looking into differential expression analysis


      • #4
        No, please don't use the previous result.
        I believe the problem is already fixed in the new version. Try the new V2.02 version.

        there are also huge difference between V1.3 and V2.02, it is long story, but I belive v2.02 is indeed better.


        • #5
          I meant, the version 2.02 cuffdiff also output the FPKM of all the replicates, so there is no need to run the cufflinks anymore.


          • #6
            Hmm... I've tried rerun the same command using cufflink version 2.02 (previously version 2.01), however the problem still remain . At the moment I'm using the approach that I mentioned above

            On another note, I hope for future cufflinks version (cuffdiff run) they will report out replicate variation as well (Something like DESeq output reporting). Sometime I want to see how much is the gene expression variation between replicate.


            • #7
              Are you using different gene annotation gtf files in cuffdiff and cufflinks?
              combined.gtf and cuffcmp.combined.gtf


              • #8
                Yes, I'm using different gtf files as the cuffdiff required for me to combined the gtf files by using the cuffmerge command. But technically both gtf files (combined.gtf and cuffcmp.combined.gtf ) have the same information except additional columns in the cuffcmp.combined.gtf added by cuffmerge.

                I don't think my way of running the program is the problem, as we have already validate some of the FPKM data, seem that the results are very good. The problem here is only some of the genes which have inconsistent FPKM output from cuffdiff and cufflinks. From my calculation about 50 genes (~1% of total genes) have this kind inconsistency. I attached the simplified version of table I posted on my 1st post (ahh.. should have done it on the 1st place).

                As an example in row no 9 (genet08), in cufflinks analysis FPKM in both sampleA replicates reported as high as 20000, but in cuffdiffs analysis the FPKM reported as 0
                Attached Files


                • #9
                  Here is my experience, hope it may help a little bit.
                  The version 1.3 cuffdiff only give FPKM of a condition, but not the FPKMs of individual replicates. I realized that some condition FPKMs from cuffdiff are significantly different from FPKMs of the biological replicates from cufflinks.

                  The version 2.02 cuffdiff will output both condition FPKM and replicate FPKMs. For me, the problem is solved, since I don't need to use cufflinks to get the replicate FPKM any more.

                  I am not sure whether you tried the same gtf files in cufflinks and cuffdiff. Do you think you will get the same FPKM if you do that way?


                  • #10
                    sqcrft, the FPKM of each replicate that cuffdiff produces (this is the file) does not even match the FPKM that cufflinks produces. And the FPKM of the file that cuffdiff produces does not remotely come close to the "Values" that the gene_expr.diff files produce.

                    I am honestly baffled by all of this. Here is an example for gene GPR89C.

                    My Cufflinks files produce the following FPKM values for
                    Sample 1: 16,16,12 and for Sample 2: 16,21,15.

                    My Cuffdiff produces the genes.readgroup.tracking file that sqcrft is referring to and it produces the following FPKM values for
                    Sample 1: .2,.1,.08 and for Sample 2: .16,.19,.19

                    Lastly, Cuffdiff produces the gene.expr.diff file, the one that groups the values by condition and the one most people are interested in, produces the following values:
                    Sample 1: 0.12 and Sample 2: 3.44

                    I haven't checked other genes yet, but I will. I'm thinking the best thing might be to just use the output files from Cufflinks and use that for analysis.


                    • #11

                      Apology for late reply due to busy with other project.

                      I've tried run according to your suggestion, but the problem still there. No luck I think. Well I think at the moment I just focus on 99% genes which shown to be ok (so far).


                      Looks like we are in the same boat. Do tell me if you find solution for this prob
                      Last edited by masterpiece; 09-06-2012, 03:26 AM.


                      • #12
                        My solution was switching to DESeq. My two cents are that its much preferable to my needs. Very simple to follow, very logical, and the authors are very active on the forum. Additionally, it has a lot of quality control metrics that I did not find with Cufflinks/CummeRbund, plus Cufflinks really messed me up by having a new version that gives such drastically different results. Again, that's just my two cents.


                        • #13
                          I have had similar experience with Cuffdiff and have as well switched to DESeq as Bill Stevens did. Got completely different results compared to Cuffdiff which makes it pretty much impossible for me to say which method produces the most correct results.
                          Last edited by glados; 09-12-2012, 03:31 AM.


                          • #14
                            We've been seeing this for quite a while now. There is another thread about this here:

                            There was some claim that using 2.0.2 fixed this, although I havn't checked this yet.
                            There was also a dicussion on if it was due to unrealsable high FPKMs on short genes for which there is an explaination, although we see it in all genes.

                            The FPKMs from CuffDiff are much more believable than those from Cufflinks if you ask me. Our pipelines run but DESeq and CuffDiff, and I generally use the CuffDiff quantitiations for everything except differentially expression, for that we choose a CuffDIff, edgeR or DESeq based on the problem, often we use two or three methods and look at the overlap. There is generally a pretty good corrolation in the fold changes between CuffDiff and the count based methods, but quite a big difference in the p-values.

                            Of course if you are interested in doing more complex GLM type designs, you'll need to use something other than CuffDIff. On the other hand, if you are working in a system where you think differential splicing is going to be important, then CuffDiff is pretty much your only choice.


                            • #15
                              Bill Steven,

                              Yup, I did tried using the DESeq for analysis and I prefer the way it report the results compared to cuffdiff. They report out "variation between replicates", which is not provided in cuffdiff. IMHO as molecular biologist, I think the stats is important for me to know how different is the expression between replicates,

                              But as mention by Sudders, for my coming project I may need to look into differential splicing, for this I have to go with cuffdiff. I think there is no such things as one software suits all.


                              I've tried to compared both software. Yes, the results are different because both using different algorithm. The way DESeq treat the number of read count is also different. DESeq only count reads that only uniquely mapped, Cufflink/cuffdiff they have their own way to count reads that are not uniquely mapped.

                              From my comparison test, they maybe different in term of FPKM/RPKM, but most of them have same profiles. So I think both of them still good. To be sure, you need to validate them with lab work.


                              Unfortunately the latest version still didn't solve the problem. At first I'm using 1.4.1, but even after I update to 2.01 and 2.02, the problem is still there.

                              I wish I don't have to use several software to get one job done because I hate when people questioning me "So tell me, which one gives the best result and explain me why (answer not more than 50 words and use only layman term)?"
                              Last edited by masterpiece; 09-22-2012, 02:55 PM.