Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • konrad98
    Member
    • Jan 2009
    • 17

    Tophat, Cufflinks and replicates

    Hi all,

    I've analysed a series of 2 pairs of different biological replicates with 3 programs, Cuffdiff, DESeq and edgeR. At the moment I'm just looking at gene-level expression.

    One would generally expect that in an ideal world there would be no significant differential expression between replicate sample and that most packages would overlap significantly in their predictions. For DESeq and edgeR this is indeed the case. Yet Cuffdiff seems to stand out from the crowd.


    Let's call the replicates A1, A2 and B1, B2. So below we're looking at how many predictions agree when each pair of replicates is compared using (in an 'ideal' world 100% would agree when using the same package).


    Cufflinks vs DESeq:

    A1-A2 62% (4262/6931) predictions agree
    B1-B2 54% (3712/6931) predictions agree

    DESeq vs edgeR:

    A1-A2 88% (6360/7220) predictions agree
    B1-B2: 82% (5983/7220) predictions agree

    Cufflinks vs edgeR:

    A1-A2: 59% (5921/9987) predictions agree
    B1-B2: 48% (4753/9987) predictions agree



    Reads were obtained from a single-end Illumina 76bp run and mapped with Tophat. DESeq and edgeR require raw count values so these were extracted from the Tophat SAM file with the HTSeq program and a well annotated GFF3 file from the broad institute. Cufflinks was run on each of the biological replicates and then cuffcompare and cuffdiff on the resulting GTF files.


    Has anyone seen this sort of discrepancy before and (more to the point) am I doing something criminally wrong when performing the Cuffdiff analysis?

    Any help would be much appreciated!
  • Balat
    Member
    • May 2010
    • 36

    #2
    Tophat, Cufflinks and replicates

    Hi,
    Yes I have done a similar comparison to you. However, I found generally a good agreement (I did not quantify) between edgeR and cuffdiff outputs. I had 3 bilogical replicates(A, B, C) and two treatemts (1 and 2). I have combined all the reads from all the libraries in Tophat and got a gtf file with cufflinks. I used this gtf to obtain read read counts with HTSeq and I used these read count in edgeR.
    Similarly I used the same gtf file and individual sam file from each library in the cuffdiff. As there is no option for using biological replicates in cuffdiff, I compared the expression between two treatements separately for each biological replication (A1 vs A2). I observed good correspondance between the replicates i.e., if a gene is differentially expressed between A1 and A2, same gene is generally differentially expressed between other replicates (B1 and B2; C1 and C2) as well . However, I did not get any isoform expression differences which is expected as I combined the reads from all libraries when I used Tophat. When I ran the edgeR, I treated the read counts from each treatment as bilogical replicates.
    Generally, most of the DE genes with egdgeR were also significant with cuff diff. Overall I found a good agreement between the outputs from edgeR and cuffdiff. I have combined all the reads to get a good annotation as the organism that I am working on is not annotated yet. Could this be the reason i.e., using aggregate gtf rather than stdout.combined.gtf for the inconsistancy you see between different programs?

    Comment

    • chrisbala
      Member
      • Jan 2010
      • 82

      #3
      Hi konrad98,

      I've done a fair bit of messing around with DESeq and I'm still trying to get my cufflinks analysis "perfected" (I'm about to post about my problems...)

      But I think overall, the approaches are so different that it is not surprising that you are getting different results. DESeq and edgeR on the other hand are fairly similar, in fact I think DESeq is based in part on the approach used in edgeR.

      One thing that worried me early on is that cuffdiff was sometimes finding significant differences for transcripts of very low read counts (for example 1 in one condition, and 0 in the other (using direct read counting with HT-Seq, for example)).

      Anyway, the point is that I'm curious what others say, but my experience is similar. Am I wrong to be unsettled about sig differences at such low read counts? Disclaimer: I am still trying to make sure I have not messed anything up on my end, and the cuffdiff results I describe were with an earlier version of the software....I'll post followup if I have any new insights...

      Comment

      • konrad98
        Member
        • Jan 2009
        • 17

        #4
        Hi all,

        Many thanks for your speedy response!

        I have supplied cuffdiff with serval GTF files for comparison:

        1. The output from cuffcompare (i.e. the stdout.combined.gtf file) when the transcript.gtf files from both replicate transcript.gtf files are supplied along with an mRNA reference GTF file.

        2. The transcript.gtf file from a single replicate sample when cufflinks is run

        3. The original reference GTF file from the Broad institute (just to see what happens).


        All give a large number of genes as being significantly differentially expressed.

        I am convinced that I must be doing something wrong when I run Cufflinks if you are not seeing the same problem Balat.

        Chris - I agree the significant hits when very low read counts are reported seem implausible. I usually filter these out as being due to sampling errors.

        Comment

        • chrisbala
          Member
          • Jan 2010
          • 82

          #5
          Just curious - Is filtering low-coverage transcripts something most people are doing?

          Comment

          • konrad98
            Member
            • Jan 2009
            • 17

            #6
            I can't say if its common or not, but its only a problem I encounter with Cufflinks. When using DESeq and edgeR it won't classify such genes as being differentially expressed so I don't perform any filtering there.

            Comment

            • xinchen
              Junior Member
              • May 2010
              • 6

              #7
              I've never directly compared Cufflinks/Cuffdiff output to DESeq, but have you extracted the FPKMs for genes in cufflinks and the read counts from your direct alignment from HTSeq?

              It could be interesting to see the difference in counts/FPKMs between replicates in the two methods.

              Comment

              • konrad98
                Member
                • Jan 2009
                • 17

                #8
                Hi xinchen,

                The htseq counts and the FPKM counts have reasonably good agreement. (R^2=0.9), so I don't think this would explain the more than a fraction of the discrepancy.

                Cole - if you have any thoughts, I'd be grateful. No doubt you're busy preparing the next release anyhow!

                Comment

                • thinkRNA
                  Member
                  • Jan 2010
                  • 94

                  #9
                  Originally posted by konrad98 View Post
                  Hi all,

                  I've analysed a series of 2 pairs of different biological replicates with 3 programs, Cuffdiff, DESeq and edgeR. At the moment I'm just looking at gene-level expression.

                  One would generally expect that in an ideal world there would be no significant differential expression between replicate sample and that most packages would overlap significantly in their predictions. For DESeq and edgeR this is indeed the case. Yet Cuffdiff seems to stand out from the crowd.


                  Let's call the replicates A1, A2 and B1, B2. So below we're looking at how many predictions agree when each pair of replicates is compared using (in an 'ideal' world 100% would agree when using the same package).


                  Cufflinks vs DESeq:

                  A1-A2 62% (4262/6931) predictions agree
                  B1-B2 54% (3712/6931) predictions agree

                  DESeq vs edgeR:

                  A1-A2 88% (6360/7220) predictions agree
                  B1-B2: 82% (5983/7220) predictions agree

                  Cufflinks vs edgeR:

                  A1-A2: 59% (5921/9987) predictions agree
                  B1-B2: 48% (4753/9987) predictions agree



                  Reads were obtained from a single-end Illumina 76bp run and mapped with Tophat. DESeq and edgeR require raw count values so these were extracted from the Tophat SAM file with the HTSeq program and a well annotated GFF3 file from the broad institute. Cufflinks was run on each of the biological replicates and then cuffcompare and cuffdiff on the resulting GTF files.


                  Has anyone seen this sort of discrepancy before and (more to the point) am I doing something criminally wrong when performing the Cuffdiff analysis?

                  Any help would be much appreciated!
                  This is an extremely interesting finding.For the genes that only cufflinks detects as differentially expressed only, have you visually looked to see if these are differences in isoforms (splicing or promoter regions), I would also suggest lowering the default FDR. EdgeR and DeSeq are taking the raw count of reads while cufflinks is telling you isoform differences. I would suggest you visually pick a few genes and look at your reads and the cufflinks output to see if indeed there is a difference and possibly post it here.

                  On another note, I am going to try running EdgeR and DeSeq and was wondering what parameters you used for running HTSeq? I have mouse 75 bs SE illumina reads. Unfortunately, when I looked at dot plot of FPKM between my replicates (on the log scale), I detected a problem in the experiment/sequencing that the high coverage regions display too much variation. The treated replicates don't exhibit this problem. I would like to see if HTSEQ detects this as well.

                  Comment

                  • thinkRNA
                    Member
                    • Jan 2010
                    • 94

                    #10
                    Originally posted by Balat View Post
                    Hi,
                    As there is no option for using biological replicates in cuffdiff, I compared the expression between two treatements separately for each biological replication (A1 vs A2). I observed good correspondance between the replicates i.e., if a gene is differentially expressed between A1 and A2, same gene is generally differentially expressed between other replicates (B1 and B2; C1 and C2) as well . However, I did not get any isoform expression differences which is expected as I combined the reads from all libraries when I used Tophat.
                    Hi Balat, I am a little confused and perhaps you can clarify. You say that you did not get any isoform differences as you combined the reads from all the libraries, but I thought you separately analyzed A1 vs A2 and B1 vs B2 and noted that the same gene is differentially expressed in those comparisons. Can you please list exactly what you did?

                    Comment

                    • konrad98
                      Member
                      • Jan 2009
                      • 17

                      #11
                      Hi thinkRNA,

                      In cufflinks I am using a reference annotation which contains only a few splice variants (no more than 100). Of those that are there, there doesn't seem to be any relationship between isoforms and this feature. Also I am looking at the gene level expression (genes.expr) rather than the transcripts. I'll see if I can get some pretty pictures to post.

                      I used the stranded option to specify non-stranded data in HTSeq. All other parameters were left as default.

                      Comment

                      • Balat
                        Member
                        • May 2010
                        • 36

                        #12
                        Hi thinkRNA,
                        I have combined all my reads at tophat stage got one sam which I used in cufflinks and got one gtf file. Then I aligned each sample reads and got separate sam file for each sample. I used the individual sample sam file and the combined gtf file in cuffdiff for testing differential expression between different samples (A1 vs A2). As I have used the combined gtf file from cufflinks rather than stdout.combined gtf file from cuffcompare, cuffdiff reported only gene level expression differences. I think cuffdiff needs stdout.combined gtf file from cuffcompare to report isoform differences.

                        Comment

                        • nkwuji
                          Member
                          • Mar 2010
                          • 19

                          #13
                          Originally posted by Balat View Post
                          Hi thinkRNA,
                          I have combined all my reads at tophat stage got one sam which I used in cufflinks and got one gtf file. Then I aligned each sample reads and got separate sam file for each sample. I used the individual sample sam file and the combined gtf file in cuffdiff for testing differential expression between different samples (A1 vs A2). As I have used the combined gtf file from cufflinks rather than stdout.combined gtf file from cuffcompare, cuffdiff reported only gene level expression differences. I think cuffdiff needs stdout.combined gtf file from cuffcompare to report isoform differences.
                          Hi, Balat,

                          Thx for sharing your experience. That really helps. Actually I used a very similar protocol as you described, that I aligned all the reads from 7 lanes together using cufflinks, in order to build a comprehensive expressed transcriptome. Then, I combined the cufflinks transcripts with UCSC RefSeq tracks, namely combined.gtf.

                          After that, I used this combined.gtf in cuffdiff, edgeR, and DESeq. And, I noticed a similar percentage of overlapping differentially expressed genes as in the first thread. More than 85% of differentially expressed genes from edgeR and DEseq overlap. But for cuffdiff, if I used the significant genes called by cuffdiff, there is only 10% overlap between cuffdiff and edgeR. However, if I just compare p.values, e.g. in cuffdiff p.value<0.01 and in edgeR p.value<0.01, the overlap is around 60-70%. So, I wonder what threshold you used in cuffdiff to call the differentially expressed genes?

                          I also noticed cuffdiff is not very consistent, especially for genes expressed at low levels, and for genes expressed highly in one sample but not in the other sample. I hope this may be improved in the next version of cuffdiff.

                          And, another suggestion (or just my own thought), the read counts actually follow Poisson distribution, but after it is log2 transformed, e.g. log2(counts +0.25), the counts will follow normal distribution. Can we just use the log2 counts in the microarray packages, e.g. limma? Or there are plenty of other suitable solutions, e.g. Between group analysis, ANOVA, Rank product .... Does anyone get successful result applying these microarray protocols?

                          Cheers,
                          Jun

                          Comment

                          • adarob
                            Member
                            • Jul 2010
                            • 71

                            #14
                            Thanks for bringing this to our attention. We will be looking into these concerns with Cuffdiff this week.

                            Comment

                            • adarob
                              Member
                              • Jul 2010
                              • 71

                              #15
                              I double checked the Cuffdiff code, and I didn't see any problems that would be leading to incorrect behavior. Therefore, I am confident that it is reporting the correct results according to our method. Perhaps this is an explanation for the differences:

                              Cufflinks calculates FPKM at the gene level by summing the transcript-level FPKMs. While it might seem at first that this is equivalent to calculating the expression of the gene based on its read counts, this is not the case.

                              Consider a gene of length 100 (sum of all exon lengths) that has 3 isoforms (A, B, and C) with lengths 25, 50, and 100, respectively.

                              If in sample 1, the gene has 100 reads mapping to it, the RPKM would be 1. However, at the isoform level, Cufflinks may find that 50 of these reads come from isoform A, 30 from B, and 20 from C. Therefore, the FPKM would be approximately (since we divide by effective length instead of absolute length) 50/25 + 30/50 + 20/100 = 2.8.

                              If in sample 2, the gene again has 100 reads, the RPKM would still be 1. However, imagine that isoform switching has taken place and now 10 of the reads come from A, 10 from B, and 80 from C. Then, the FPKM would be approximately 10/25 + 10/50 + 80/100 = 1.4.

                              According to Cufflinks, there is differential expression, but not according to a count-based differential expression approach. This is why it's really important to look at differential expression at the isoform level instead of the gene level--the transcript is really what is being expressed, not the gene.

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                06-02-2026, 10:05 AM
                              • SEQadmin2
                                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                by SEQadmin2


                                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                Introduction

                                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                05-22-2026, 06:42 AM
                              • SEQadmin2
                                Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                by SEQadmin2

                                Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                05-06-2026, 09:04 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Today, 08:59 AM
                              0 responses
                              4 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 12:03 PM
                              0 responses
                              21 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 11:40 AM
                              0 responses
                              14 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-28-2026, 11:40 AM
                              0 responses
                              29 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...