Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Tophat, Cufflinks and replicates

    Hi all,

    I've analysed a series of 2 pairs of different biological replicates with 3 programs, Cuffdiff, DESeq and edgeR. At the moment I'm just looking at gene-level expression.

    One would generally expect that in an ideal world there would be no significant differential expression between replicate sample and that most packages would overlap significantly in their predictions. For DESeq and edgeR this is indeed the case. Yet Cuffdiff seems to stand out from the crowd.


    Let's call the replicates A1, A2 and B1, B2. So below we're looking at how many predictions agree when each pair of replicates is compared using (in an 'ideal' world 100% would agree when using the same package).


    Cufflinks vs DESeq:

    A1-A2 62% (4262/6931) predictions agree
    B1-B2 54% (3712/6931) predictions agree

    DESeq vs edgeR:

    A1-A2 88% (6360/7220) predictions agree
    B1-B2: 82% (5983/7220) predictions agree

    Cufflinks vs edgeR:

    A1-A2: 59% (5921/9987) predictions agree
    B1-B2: 48% (4753/9987) predictions agree



    Reads were obtained from a single-end Illumina 76bp run and mapped with Tophat. DESeq and edgeR require raw count values so these were extracted from the Tophat SAM file with the HTSeq program and a well annotated GFF3 file from the broad institute. Cufflinks was run on each of the biological replicates and then cuffcompare and cuffdiff on the resulting GTF files.


    Has anyone seen this sort of discrepancy before and (more to the point) am I doing something criminally wrong when performing the Cuffdiff analysis?

    Any help would be much appreciated!

  • #2
    Tophat, Cufflinks and replicates

    Hi,
    Yes I have done a similar comparison to you. However, I found generally a good agreement (I did not quantify) between edgeR and cuffdiff outputs. I had 3 bilogical replicates(A, B, C) and two treatemts (1 and 2). I have combined all the reads from all the libraries in Tophat and got a gtf file with cufflinks. I used this gtf to obtain read read counts with HTSeq and I used these read count in edgeR.
    Similarly I used the same gtf file and individual sam file from each library in the cuffdiff. As there is no option for using biological replicates in cuffdiff, I compared the expression between two treatements separately for each biological replication (A1 vs A2). I observed good correspondance between the replicates i.e., if a gene is differentially expressed between A1 and A2, same gene is generally differentially expressed between other replicates (B1 and B2; C1 and C2) as well . However, I did not get any isoform expression differences which is expected as I combined the reads from all libraries when I used Tophat. When I ran the edgeR, I treated the read counts from each treatment as bilogical replicates.
    Generally, most of the DE genes with egdgeR were also significant with cuff diff. Overall I found a good agreement between the outputs from edgeR and cuffdiff. I have combined all the reads to get a good annotation as the organism that I am working on is not annotated yet. Could this be the reason i.e., using aggregate gtf rather than stdout.combined.gtf for the inconsistancy you see between different programs?

    Comment


    • #3
      Hi konrad98,

      I've done a fair bit of messing around with DESeq and I'm still trying to get my cufflinks analysis "perfected" (I'm about to post about my problems...)

      But I think overall, the approaches are so different that it is not surprising that you are getting different results. DESeq and edgeR on the other hand are fairly similar, in fact I think DESeq is based in part on the approach used in edgeR.

      One thing that worried me early on is that cuffdiff was sometimes finding significant differences for transcripts of very low read counts (for example 1 in one condition, and 0 in the other (using direct read counting with HT-Seq, for example)).

      Anyway, the point is that I'm curious what others say, but my experience is similar. Am I wrong to be unsettled about sig differences at such low read counts? Disclaimer: I am still trying to make sure I have not messed anything up on my end, and the cuffdiff results I describe were with an earlier version of the software....I'll post followup if I have any new insights...

      Comment


      • #4
        Hi all,

        Many thanks for your speedy response!

        I have supplied cuffdiff with serval GTF files for comparison:

        1. The output from cuffcompare (i.e. the stdout.combined.gtf file) when the transcript.gtf files from both replicate transcript.gtf files are supplied along with an mRNA reference GTF file.

        2. The transcript.gtf file from a single replicate sample when cufflinks is run

        3. The original reference GTF file from the Broad institute (just to see what happens).


        All give a large number of genes as being significantly differentially expressed.

        I am convinced that I must be doing something wrong when I run Cufflinks if you are not seeing the same problem Balat.

        Chris - I agree the significant hits when very low read counts are reported seem implausible. I usually filter these out as being due to sampling errors.

        Comment


        • #5
          Just curious - Is filtering low-coverage transcripts something most people are doing?

          Comment


          • #6
            I can't say if its common or not, but its only a problem I encounter with Cufflinks. When using DESeq and edgeR it won't classify such genes as being differentially expressed so I don't perform any filtering there.

            Comment


            • #7
              I've never directly compared Cufflinks/Cuffdiff output to DESeq, but have you extracted the FPKMs for genes in cufflinks and the read counts from your direct alignment from HTSeq?

              It could be interesting to see the difference in counts/FPKMs between replicates in the two methods.

              Comment


              • #8
                Hi xinchen,

                The htseq counts and the FPKM counts have reasonably good agreement. (R^2=0.9), so I don't think this would explain the more than a fraction of the discrepancy.

                Cole - if you have any thoughts, I'd be grateful. No doubt you're busy preparing the next release anyhow!

                Comment


                • #9
                  Originally posted by konrad98 View Post
                  Hi all,

                  I've analysed a series of 2 pairs of different biological replicates with 3 programs, Cuffdiff, DESeq and edgeR. At the moment I'm just looking at gene-level expression.

                  One would generally expect that in an ideal world there would be no significant differential expression between replicate sample and that most packages would overlap significantly in their predictions. For DESeq and edgeR this is indeed the case. Yet Cuffdiff seems to stand out from the crowd.


                  Let's call the replicates A1, A2 and B1, B2. So below we're looking at how many predictions agree when each pair of replicates is compared using (in an 'ideal' world 100% would agree when using the same package).


                  Cufflinks vs DESeq:

                  A1-A2 62% (4262/6931) predictions agree
                  B1-B2 54% (3712/6931) predictions agree

                  DESeq vs edgeR:

                  A1-A2 88% (6360/7220) predictions agree
                  B1-B2: 82% (5983/7220) predictions agree

                  Cufflinks vs edgeR:

                  A1-A2: 59% (5921/9987) predictions agree
                  B1-B2: 48% (4753/9987) predictions agree



                  Reads were obtained from a single-end Illumina 76bp run and mapped with Tophat. DESeq and edgeR require raw count values so these were extracted from the Tophat SAM file with the HTSeq program and a well annotated GFF3 file from the broad institute. Cufflinks was run on each of the biological replicates and then cuffcompare and cuffdiff on the resulting GTF files.


                  Has anyone seen this sort of discrepancy before and (more to the point) am I doing something criminally wrong when performing the Cuffdiff analysis?

                  Any help would be much appreciated!
                  This is an extremely interesting finding.For the genes that only cufflinks detects as differentially expressed only, have you visually looked to see if these are differences in isoforms (splicing or promoter regions), I would also suggest lowering the default FDR. EdgeR and DeSeq are taking the raw count of reads while cufflinks is telling you isoform differences. I would suggest you visually pick a few genes and look at your reads and the cufflinks output to see if indeed there is a difference and possibly post it here.

                  On another note, I am going to try running EdgeR and DeSeq and was wondering what parameters you used for running HTSeq? I have mouse 75 bs SE illumina reads. Unfortunately, when I looked at dot plot of FPKM between my replicates (on the log scale), I detected a problem in the experiment/sequencing that the high coverage regions display too much variation. The treated replicates don't exhibit this problem. I would like to see if HTSEQ detects this as well.

                  Comment


                  • #10
                    Originally posted by Balat View Post
                    Hi,
                    As there is no option for using biological replicates in cuffdiff, I compared the expression between two treatements separately for each biological replication (A1 vs A2). I observed good correspondance between the replicates i.e., if a gene is differentially expressed between A1 and A2, same gene is generally differentially expressed between other replicates (B1 and B2; C1 and C2) as well . However, I did not get any isoform expression differences which is expected as I combined the reads from all libraries when I used Tophat.
                    Hi Balat, I am a little confused and perhaps you can clarify. You say that you did not get any isoform differences as you combined the reads from all the libraries, but I thought you separately analyzed A1 vs A2 and B1 vs B2 and noted that the same gene is differentially expressed in those comparisons. Can you please list exactly what you did?

                    Comment


                    • #11
                      Hi thinkRNA,

                      In cufflinks I am using a reference annotation which contains only a few splice variants (no more than 100). Of those that are there, there doesn't seem to be any relationship between isoforms and this feature. Also I am looking at the gene level expression (genes.expr) rather than the transcripts. I'll see if I can get some pretty pictures to post.

                      I used the stranded option to specify non-stranded data in HTSeq. All other parameters were left as default.

                      Comment


                      • #12
                        Hi thinkRNA,
                        I have combined all my reads at tophat stage got one sam which I used in cufflinks and got one gtf file. Then I aligned each sample reads and got separate sam file for each sample. I used the individual sample sam file and the combined gtf file in cuffdiff for testing differential expression between different samples (A1 vs A2). As I have used the combined gtf file from cufflinks rather than stdout.combined gtf file from cuffcompare, cuffdiff reported only gene level expression differences. I think cuffdiff needs stdout.combined gtf file from cuffcompare to report isoform differences.

                        Comment


                        • #13
                          Originally posted by Balat View Post
                          Hi thinkRNA,
                          I have combined all my reads at tophat stage got one sam which I used in cufflinks and got one gtf file. Then I aligned each sample reads and got separate sam file for each sample. I used the individual sample sam file and the combined gtf file in cuffdiff for testing differential expression between different samples (A1 vs A2). As I have used the combined gtf file from cufflinks rather than stdout.combined gtf file from cuffcompare, cuffdiff reported only gene level expression differences. I think cuffdiff needs stdout.combined gtf file from cuffcompare to report isoform differences.
                          Hi, Balat,

                          Thx for sharing your experience. That really helps. Actually I used a very similar protocol as you described, that I aligned all the reads from 7 lanes together using cufflinks, in order to build a comprehensive expressed transcriptome. Then, I combined the cufflinks transcripts with UCSC RefSeq tracks, namely combined.gtf.

                          After that, I used this combined.gtf in cuffdiff, edgeR, and DESeq. And, I noticed a similar percentage of overlapping differentially expressed genes as in the first thread. More than 85% of differentially expressed genes from edgeR and DEseq overlap. But for cuffdiff, if I used the significant genes called by cuffdiff, there is only 10% overlap between cuffdiff and edgeR. However, if I just compare p.values, e.g. in cuffdiff p.value<0.01 and in edgeR p.value<0.01, the overlap is around 60-70%. So, I wonder what threshold you used in cuffdiff to call the differentially expressed genes?

                          I also noticed cuffdiff is not very consistent, especially for genes expressed at low levels, and for genes expressed highly in one sample but not in the other sample. I hope this may be improved in the next version of cuffdiff.

                          And, another suggestion (or just my own thought), the read counts actually follow Poisson distribution, but after it is log2 transformed, e.g. log2(counts +0.25), the counts will follow normal distribution. Can we just use the log2 counts in the microarray packages, e.g. limma? Or there are plenty of other suitable solutions, e.g. Between group analysis, ANOVA, Rank product .... Does anyone get successful result applying these microarray protocols?

                          Cheers,
                          Jun

                          Comment


                          • #14
                            Thanks for bringing this to our attention. We will be looking into these concerns with Cuffdiff this week.

                            Comment


                            • #15
                              I double checked the Cuffdiff code, and I didn't see any problems that would be leading to incorrect behavior. Therefore, I am confident that it is reporting the correct results according to our method. Perhaps this is an explanation for the differences:

                              Cufflinks calculates FPKM at the gene level by summing the transcript-level FPKMs. While it might seem at first that this is equivalent to calculating the expression of the gene based on its read counts, this is not the case.

                              Consider a gene of length 100 (sum of all exon lengths) that has 3 isoforms (A, B, and C) with lengths 25, 50, and 100, respectively.

                              If in sample 1, the gene has 100 reads mapping to it, the RPKM would be 1. However, at the isoform level, Cufflinks may find that 50 of these reads come from isoform A, 30 from B, and 20 from C. Therefore, the FPKM would be approximately (since we divide by effective length instead of absolute length) 50/25 + 30/50 + 20/100 = 2.8.

                              If in sample 2, the gene again has 100 reads, the RPKM would still be 1. However, imagine that isoform switching has taken place and now 10 of the reads come from A, 10 from B, and 80 from C. Then, the FPKM would be approximately 10/25 + 10/50 + 80/100 = 1.4.

                              According to Cufflinks, there is differential expression, but not according to a count-based differential expression approach. This is why it's really important to look at differential expression at the isoform level instead of the gene level--the transcript is really what is being expressed, not the gene.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              17 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              22 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              16 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              46 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X