Announcement

Collapse

Welcome to the New Seqanswers!

Welcome to the new Seqanswers! We'd love your feedback, please post any you have to this topic: New Seqanswers Feedback.
See more
See less

Differential expression, splicing, and promoter use with Cufflinks

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Differential expression, splicing, and promoter use with Cufflinks

    We are happy to announce a major update to Cufflinks that introduces some powerful new features and includes a number of performance improvement and bug fixes. Highlights include:
    • Cufflinks now includes a new tool, "Cuffdiff", which performs testing for differential expression, splicing, promoter use, and coding sequence output on two or more RNA-Seq samples. See the greatly expanded manual for details.
    • Cuffcompare now reports a file containing the "union" of all transfrags in the files you give it as input, greatly simplifying downstream validatation of novel transcripts.
    • Cufflinks' assembler has been overhauled and optimized, resulting in a speedup of 4-5 times over version 0.7.0, and a greatly reduced memory footprint. Phasing of splicing events has also been improved.
    • Many bugfixes, including for a number of bugs reported by users in these forums.


    We hope you find the new functionality useful, and continue to report bugs and feature requests.

  • #2
    Thanks, Cole.

    I am not sure I understood quite well how to give GTF annotation to Cuffdiff according to the manual.
    First, is it required the matching of tss_id and p_id? If not, how does the program know which TSS is corresponding to a transcript?
    Second, if the TSS of a transcript or primary transcript is unkown, the program will skip this transcript and won't look for the difference in promoter use, right?
    Moreover, is it possible that to infer the TSS for RNA-seq data?

    Many thanks.
    Xi Wang

    Comment


    • #3
      ABI Solid?

      Hi Cole,

      Thanks for the new release it looks really comprehensive and I look forward to trying it for my Illumina datasets. Do you have any plans to include ABI Solid support for TopHat and Cufflinks, especially now that bowtie supports colourspace?

      Many thanks.

      Comment


      • #4
        Originally posted by Xi Wang View Post
        Thanks, Cole.

        I am not sure I understood quite well how to give GTF annotation to Cuffdiff according to the manual.
        First, is it required the matching of tss_id and p_id? If not, how does the program know which TSS is corresponding to a transcript?
        Second, if the TSS of a transcript or primary transcript is unkown, the program will skip this transcript and won't look for the difference in promoter use, right?
        Moreover, is it possible that to infer the TSS for RNA-seq data?

        Many thanks.
        Without tss_id and p_id attributes, Cufflinks will simply test for differential expression of transcripts and genes. You can attach these attributes to your own GTF file, but for convenience, cuffcompare now outputs a single file containing the "union" of all transfrags assembled you give it. So the basic workflow we recommend is:

        1) Assemble each sample with cufflinks
        2) Run cuffcompare on the sample transfrags all at the same time, providing a reference annotation if you want to classify your transfrags according to known, novel, etc.
        3) Give the stdout.combined.gtf to cuffdiff, along with your original SAM alignments from the samples. Cuffdiff will re-estimate the abundances of the transfrags in the GTF using the alignments in each sample, and do the differential expression testing at the same time.

        Optionally, you may wish to clean up the stdout.combined.gtf before running cuffdiff, to remove partial transfrags that resulted from low depth of sequencing coverage in one of the samples. We like to perform differential testing only on transcripts that are either already known to annotation or that we've assembled in two different samples independently.

        As far as how cuffcompare assigns p_id and tss_id attributes:

        * p_id is assigned just using the CDS records in the reference GTF. If there are no CDS records, there will be no p_ids. Similarly, if you run cuffcompare without a reference annotation along with your sample assemblies, there will be no p_id attributes in stdout.combined.gtf
        * tss_id is assigned based on transfrags where the 5' ends are: two transcripts on the same strand and which share bases have the same TSS iff their 5' ends start within 100bp of each other. This threshhold is chosen based on our observation that depth of sequencing doesn't always reach to the end of the true transcript on either end. You can change it with the -d option (which I just realized is not listed in the manual - I will update it).

        All this is to say that if you're hoping to just use a reference GTF with cuffdiff, you'll need to add those p_id and tss_id attributes yourself. You can do this with cuffcompare too, using a little hack:

        cuffcompare -r reference.gtf reference.gtf reference.gtf

        This will spit out a version of reference.gtf in stdout.combined.gtf that has the p_id and tss_id attributes attached.
        Last edited by Cole Trapnell; 02-09-2010, 09:59 PM.

        Comment


        • #5
          Originally posted by chapmandu2 View Post
          Hi Cole,

          Thanks for the new release it looks really comprehensive and I look forward to trying it for my Illumina datasets. Do you have any plans to include ABI Solid support for TopHat and Cufflinks, especially now that bowtie supports colourspace?

          Many thanks.
          Cufflinks should *in theory* already support Colorspace, since it takes SAM input, and doesn't call expressed SNPs by itself (yet). TopHat will hopefully support Colorspace sometime this spring. I've got a number of other features in TopHat and Cufflinks I need to get to, and I have to finish my thesis and graduate - so I can't give a timeline. However, it's an often requested feature, so I'd like to add support.

          Comment


          • #6
            Hi Cole,

            Thanks for the new release! I've been trying to use cuffdiff as described above. It runs for a while and then terminates as follows;

            Importance sampling posterior distribution
            isoform TCONS_00000803 has no p_id, no CDS grouping analysis available here
            Quantitating samples in locus [ chr1:152014391-152019257 ]
            Calculating intial MLE
            Tossing likely garbage isoforms
            Revising MLE
            Importance sampling posterior distribution
            Calculating intial MLE
            Tossing likely garbage isoforms
            Revising MLE
            Importance sampling posterior distribution
            Calculating intial MLE
            Tossing likely garbage isoforms
            Revising MLE
            Importance sampling posterior distribution
            Calculating intial MLE
            Tossing likely garbage isoforms
            Revising MLE
            Importance sampling posterior distribution
            isoform TCONS_00002699 has no p_id, no CDS grouping analysis available here
            terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<std::domain_error> >'
            what(): Error in function boost::math::cdf(const normal_distribution<d>&, d): Random variate x is nan, but must be finite!
            Aborted





            I don't expect this is because of the lack of p_id as this happens earlier in the running of the program but it doesn't terminate. However... I've tried using cuffdiff on cuffcompare stdout.combined.gtf files that were derived with UCSC annotation AND Ensembl annotation and they both terminate after a similar incidence (isoform TCONS_00002699 has no p_id, no CDS grouping analysis available here).

            Would you know why this is happening?

            Regards,

            Karen

            Comment


            • #7
              One more thing on a slightly separate issue. The output from cuffcompare stdout.tracking, according to the manual should contain;

              Each of the columns after the fifth have the following format:
              qJ:<gene_id>|<transcript_id>|<FMI>|<FPKM>|<conf_lo>|<conf_hi>


              However, I have 4 numerical columns after the <FMI>, not three. What does the forth one relate to?

              Example:
              q1:ENSG00000188076|ENST00000342878|100|12.188023|11.834710|12.541337|11.044084

              Thanks,

              Karen

              Comment


              • #8
                Originally posted by Kasycas View Post
                Hi Cole,

                Thanks for the new release! I've been trying to use cuffdiff as described above. It runs for a while and then terminates as follows;

                Importance sampling posterior distribution
                isoform TCONS_00000803 has no p_id, no CDS grouping analysis available here
                Quantitating samples in locus [ chr1:152014391-152019257 ]
                Calculating intial MLE
                Tossing likely garbage isoforms
                Revising MLE
                Importance sampling posterior distribution
                Calculating intial MLE
                Tossing likely garbage isoforms
                Revising MLE
                Importance sampling posterior distribution
                Calculating intial MLE
                Tossing likely garbage isoforms
                Revising MLE
                Importance sampling posterior distribution
                Calculating intial MLE
                Tossing likely garbage isoforms
                Revising MLE
                Importance sampling posterior distribution
                isoform TCONS_00002699 has no p_id, no CDS grouping analysis available here
                terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<std::domain_error> >'
                what(): Error in function boost::math::cdf(const normal_distribution<d>&, d): Random variate x is nan, but must be finite!
                Aborted





                I don't expect this is because of the lack of p_id as this happens earlier in the running of the program but it doesn't terminate. However... I've tried using cuffdiff on cuffcompare stdout.combined.gtf files that were derived with UCSC annotation AND Ensembl annotation and they both terminate after a similar incidence (isoform TCONS_00002699 has no p_id, no CDS grouping analysis available here).

                Would you know why this is happening?

                Regards,

                Karen
                Another user reported this to me a few days ago, and I fixed it yesterday. It's a divide by zero error in the Jensen-Shannon variance calculation. I'll be releasing a fix in a few days. Please sign up for the mailing list if you haven't already - you'll get an email once I make the release.

                Comment


                • #9
                  Originally posted by Kasycas View Post
                  One more thing on a slightly separate issue. The output from cuffcompare stdout.tracking, according to the manual should contain;

                  Each of the columns after the fifth have the following format:
                  qJ:<gene_id>|<transcript_id>|<FMI>|<FPKM>|<conf_lo>|<conf_hi>


                  However, I have 4 numerical columns after the <FMI>, not three. What does the forth one relate to?

                  Example:
                  q1:ENSG00000188076|ENST00000342878|100|12.188023|11.834710|12.541337|11.044084

                  Thanks,

                  Karen
                  The last column is the estimated depth of read coverage for that transfrag. Apologies - I will update the manual.

                  Comment


                  • #10
                    Originally posted by Cole Trapnell View Post
                    Without tss_id and p_id attributes, Cufflinks will simply test for differential expression of transcripts and genes. You can attach these attributes to your own GTF file, but for convenience, cuffcompare now outputs a single file containing the "union" of all transfrags assembled you give it. So the basic workflow we recommend is:

                    1) Assemble each sample with cufflinks
                    2) Run cuffcompare on the sample transfrags all at the same time, providing a reference annotation if you want to classify your transfrags according to known, novel, etc.
                    3) Give the stdout.combined.gtf to cuffdiff, along with your original SAM alignments from the samples. Cuffdiff will re-estimate the abundances of the transfrags in the GTF using the alignments in each sample, and do the differential expression testing at the same time.

                    Optionally, you may wish to clean up the stdout.combined.gtf before running cuffdiff, to remove partial transfrags that resulted from low depth of sequencing coverage in one of the samples. We like to perform differential testing only on transcripts that are either already known to annotation or that we've assembled in two different samples independently.

                    As far as how cuffcompare assigns p_id and tss_id attributes:

                    * p_id is assigned just using the CDS records in the reference GTF. If there are no CDS records, there will be no p_ids. Similarly, if you run cuffcompare without a reference annotation along with your sample assemblies, there will be no p_id attributes in stdout.combined.gtf
                    * tss_id is assigned based on transfrags where the 5' ends are: two transcripts on the same strand and which share bases have the same TSS iff their 5' ends start within 100bp of each other. This threshhold is chosen based on our observation that depth of sequencing doesn't always reach to the end of the true transcript on either end. You can change it with the -d option (which I just realized is not listed in the manual - I will update it).

                    All this is to say that if you're hoping to just use a reference GTF with cuffdiff, you'll need to add those p_id and tss_id attributes yourself. You can do this with cuffcompare too, using a little hack:

                    cuffcompare -r reference.gtf reference.gtf reference.gtf

                    This will spit out a version of reference.gtf in stdout.combined.gtf that has the p_id and tss_id attributes attached.
                    Thanks for the info on the reference gtf. I downloaded both fasta and gtf from ensembl and ran into the chr problem. However, now when I run the cuffcompare on the reference genome I get tss_ids but no p_ids and the original gtf has CDS information.

                    I also had the following error when running cuffcompare on cufflinks output and the fixed gtf file that I guess has something to do with the cufflinks gtf files since there are two of them.

                    Warning: found 26695 transcripts with undetermined strand.
                    Warning: found 44851 transcripts with undetermined strand.

                    Cuffcompare then exits.

                    Any help on moving forward with cufflinks will be greatly appreciated.

                    Cheers,
                    Lesley

                    Comment


                    • #11
                      Error messages

                      already reported ...
                      Last edited by seqfast; 03-03-2010, 07:42 AM.

                      Comment


                      • #12
                        cuffdiff considers only X, Y, and MT loci

                        Hi,

                        I ran tophat using the h_sapiens_37_asm index and converted the accepted_hits.sam file's chromosomes accessions to their corresponding number/letter (1,2,X,Y,MT). I wanted the chromosome notation to match the chromosome notation in the ensembl gtf file (Homo_sapiens.GRCh37.56.gtf). Next I ran cufflinks on each sample using the converted sam file outputted by tophat. Then I ran cuffcompare using the transcripts.gtf files from each samples (outputted by cufflinks) along with my reference gtf above. Finally, I fed the converted sam files and combined.gtf file into cuffdiff. Cuffdiff runs without error however it only considers loci on the X, Y and MT chromosomes. Has anyone else experienced this error?

                        Thank you in advance for any advice.

                        Comment


                        • #13
                          Originally posted by jebe View Post
                          Hi,

                          I ran tophat using the h_sapiens_37_asm index and converted the accepted_hits.sam file's chromosomes accessions to their corresponding number/letter (1,2,X,Y,MT). I wanted the chromosome notation to match the chromosome notation in the ensembl gtf file (Homo_sapiens.GRCh37.56.gtf). Next I ran cufflinks on each sample using the converted sam file outputted by tophat. Then I ran cuffcompare using the transcripts.gtf files from each samples (outputted by cufflinks) along with my reference gtf above. Finally, I fed the converted sam files and combined.gtf file into cuffdiff. Cuffdiff runs without error however it only considers loci on the X, Y and MT chromosomes. Has anyone else experienced this error?

                          Thank you in advance for any advice.
                          Did you try convert the chromosome notation in the ensembl gtf to chr1,chr2,...chrX,chrY, and chrM? I think conversion in this way is much better.
                          Xi Wang

                          Comment


                          • #14
                            This may be a naive question, as I'm only about to get into using Cufflinks (Bowtie and Tophat seem great though), but I have not been able to find any documentation about differential expression analysis when groups of samples are involved? My question is can you - and therefore how can you - specify that certain samples are replicates, and so be treated as a group when running differential expression analysis?

                            Comment


                            • #15
                              Originally posted by Cole Trapnell View Post
                              Another user reported this to me a few days ago, and I fixed it yesterday. It's a divide by zero error in the Jensen-Shannon variance calculation. I'll be releasing a fix in a few days. Please sign up for the mailing list if you haven't already - you'll get an email once I make the release.
                              Cole,

                              First, thanks for an excellent software stack.

                              Was the release you are referring to > 0.8.1? I am using 0.8.1 (the latest available on the web site) and am experiencing this problem. It seems that since 0.8.1 was released on 2/13/2010 and you wrote the above on 2/22/2010 the the fix would be in a version later than 0.8.1. I hate to be a pest; I have no doubt you are very busy and dealing with (L)users is the last thing you need, but I'm a little stymied by this bug.

                              Thanks again.

                              P.S. Yes, I just subscribed to the mailing list.

                              Comment

                              Working...
                              X