Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Cole Trapnell
    Senior Member
    • Nov 2008
    • 213

    Cufflinks: transcript assembly and abundance estimation for RNA-seq

    Hi all,

    I'm pleased to let you know that I've recently released a new tool for analyzing RNA-Seq data that you may find useful. The program, Cufflinks, is designed to assemble the transcripts in an RNA-Seq sample and calculate their abundances (in transcript-level RPKM). Cufflinks is built to work without a reference annotation. You need only a reference genome and an input file of RNA-Seq read alignments (in SAM format), such as the one produced by TopHat.

    Cufflinks currently supports only Mac OS X and Linux, and you'll need a medium-sized machine to run it. The program is fast and fully threaded, but for large (>100 million aligned reads) runs, the memory footprint can be 10GB or more.

    This is only the first release, and there are numerous features planned for the next weeks and months. The project is a collaboration between scientists at the University of Maryland, UC Berkeley, Caltech, and WashU. We would be grateful for any feedback, suggestions, bug reports, or criticism you have. We hope you find Cufflinks useful.
  • apratap
    Member
    • Jan 2009
    • 58

    #2
    Hi Cole

    Thanks for the update. I took a quick glance and just wondering if there are some methods available to visualize the transcript abundances reported in the GTF format. Also any way to simultaneously compare control and sample.

    -Abhi

    Comment

    • Cole Trapnell
      Senior Member
      • Nov 2008
      • 213

      #3
      Hi Abhi,

      We recommend using cuffcompare to compare control and sample, though you may need to tailor significance tests, etc. to your analysis. You might want to check out this paper for more details on control and sample testing with RPKM.

      As for visualization, we don't have anything out for that yet, as supporting a UI on various platforms, etc. is more engineering resources than we can really afford right now. However, we've tried to make the various output files as friendly as possible to environments like R, so that you can use your plotting environment of choice.

      If anyone were to contribute scripts or packages that depended on standard plotting tools, we would happily include them in the Cufflinks distribution (properly credited, of course).

      Comment

      • krobison
        Senior Member
        • Nov 2007
        • 734

        #4
        A tip of the cap to continuing the classy nomenclature theme!

        Comment

        • vkrishna
          Junior Member
          • Oct 2009
          • 1

          #5
          Setting up GBrowse to view transcript abundance

          Hi apratap,
          I've been working on visualizing the transcript abundance data in GBrowse and have had some luck with it. Here is how I went about it,
          1. Loading the Cufflinks identified pseudo-transcripts is very easy (GBrowse seems to accept the GTF as input and is able to display the transcripts)
          2. To display the abundance data, you need to perform the following transformation on the GTF file and create a GFF file for use with GBrowse. As an example, here are a few lines from the GTF file
          Code:
          124 Cufflinks   transcript  1551    1663    1000    .   .   gene_id "CUFF.1"; transcript_id "CUFF.1.0"; RPKM "268.2781599177"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0.    000000"; cov "0.637168";
          124 Cufflinks   exon    1551    1663    1000    .   .   gene_id "CUFF.1"; transcript_id "CUFF.1.0"; exon_number "1"; RPKM "268.2781599177"; frac "1.000000"; conf_lo "0.000000";   conf_hi "0.000000"; cov "0.637168";
          124 Cufflinks   transcript  8604    9455    1000    .   .   gene_id "CUFF.4"; transcript_id "CUFF.4.0"; RPKM "1405.4689751085"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0.   000000"; cov "3.338028";
          124 Cufflinks   exon    8604    9455    1000    .   .   gene_id "CUFF.4"; transcript_id "CUFF.4.0"; exon_number "1"; RPKM "1405.4689751085"; frac "1.000000"; conf_lo "0.000000";  conf_hi "0.000000"; cov "3.338028";
          124 Cufflinks   transcript  9540    9757    1000    .   .   gene_id "CUFF.7"; transcript_id "CUFF.7.0"; RPKM "903.9004975207"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0.    000000"; cov "2.146789";
          124 Cufflinks   exon    9540    9757    1000    .   .   gene_id "CUFF.7"; transcript_id "CUFF.7.0"; exon_number "1"; RPKM "903.9004975207"; frac "1.000000"; conf_lo "0.000000";   conf_hi "0.000000"; cov "2.146789";
          124 Cufflinks   transcript  9825    12624   1000    .   .   gene_id "CUFF.10"; transcript_id "CUFF.10.0"; RPKM "1082.6940025248"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0. 000000"; cov "2.571429";
          124 Cufflinks   exon    9825    12624   1000    .   .   gene_id "CUFF.10"; transcript_id "CUFF.10.0"; exon_number "1"; RPKM "1082.6940025248"; frac "1.000000"; conf_lo "0.        000000"; conf_hi "0.000000"; cov "2.571429";
          124 Cufflinks   transcript  12678   13027   1000    .   .   gene_id "CUFF.13"; transcript_id "CUFF.13.0"; RPKM "1775.6181641407"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0. 000000"; cov "4.217143";
          124 Cufflinks   exon    12678   13027   1000    .   .   gene_id "CUFF.13"; transcript_id "CUFF.13.0"; exon_number "1"; RPKM "1775.6181641407"; frac "1.000000"; conf_lo "0.        000000"; conf_hi "0.000000"; cov "4.217143";
          124 Cufflinks   transcript  14562   14603   1000    .   .   gene_id "CUFF.16"; transcript_id "CUFF.16.0"; RPKM "721.7960016832"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0.  000000"; cov "1.714286";
          124 Cufflinks   exon    14562   14603   1000    .   .   gene_id "CUFF.16"; transcript_id "CUFF.16.0"; exon_number "1"; RPKM "721.7960016832"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "1.714286";
          124 Cufflinks   transcript  17793   17835   1000    .   .   gene_id "CUFF.19"; transcript_id "CUFF.19.0"; RPKM "705.0100481557"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0.  000000"; cov "1.674419";
          124 Cufflinks   exon    17793   17835   1000    .   .   gene_id "CUFF.19"; transcript_id "CUFF.19.0"; exon_number "1"; RPKM "705.0100481557"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "1.674419";
          124 Cufflinks   transcript  27148   27726   1000    .   .   gene_id "CUFF.22"; transcript_id "CUFF.22.0"; RPKM "2853.5251258254"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0. 000000"; cov "6.777202";
          124 Cufflinks   exon    27148   27726   1000    .   .   gene_id "CUFF.22"; transcript_id "CUFF.22.0"; exon_number "1"; RPKM "2853.5251258254"; frac "1.000000"; conf_lo "0.        000000"; conf_hi "0.000000"; cov "6.777202";
          124 Cufflinks   transcript  27829   29109   1000    .   .   gene_id "CUFF.25"; transcript_id "CUFF.25.0"; RPKM "3585.3145657380"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0. 000000"; cov "8.515222";
          124 Cufflinks   exon    27829   29109   1000    .   .   gene_id "CUFF.25"; transcript_id "CUFF.25.0"; exon_number "1"; RPKM "3585.3145657380"; frac "1.000000"; conf_lo "0.        000000"; conf_hi "0.000000"; cov "8.515222";
          124 Cufflinks   transcript  29171   34842   1000    .   .   gene_id "CUFF.28"; transcript_id "CUFF.28.0"; RPKM "2442.5515614083"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0. 000000"; cov "5.801128";
          124 Cufflinks   exon    29171   34842   1000    .   .   gene_id "CUFF.28"; transcript_id "CUFF.28.0"; exon_number "1"; RPKM "2442.5515614083"; frac "1.000000"; conf_lo "0.        000000"; conf_hi "0.000000"; cov "5.801128";
          124 Cufflinks   transcript  34945   36257   1000    .   .   gene_id "CUFF.31"; transcript_id "CUFF.31.0"; RPKM "3751.9099097396"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0. 000000"; cov "8.910891";
          124 Cufflinks   exon    34945   36257   1000    .   .   gene_id "CUFF.31"; transcript_id "CUFF.31.0"; exon_number "1"; RPKM "3751.9099097396"; frac "1.000000"; conf_lo "0.        000000"; conf_hi "0.000000"; cov "8.910891";
          124 Cufflinks   transcript  37803   38013   1000    .   .   gene_id "CUFF.34"; transcript_id "CUFF.34.0"; RPKM "1867.7754356353"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0. 000000"; cov "4.436019";
          124 Cufflinks   exon    37803   38013   1000    .   .   gene_id "CUFF.34"; transcript_id "CUFF.34.0"; exon_number "1"; RPKM "1867.7754356353"; frac "1.000000"; conf_lo "0.        000000"; conf_hi "0.000000"; cov "4.436019";
          The above segment is converted into the following
          Code:
          124 Cufflinks   abundance   1551    1663    268.2781599177  .   .   ID=Cufflinks_transcript_abundance
          124 Cufflinks   abundance   8604    9455    1405.4689751085 .   .   ID=Cufflinks_transcript_abundance
          124 Cufflinks   abundance   9540    9757    903.9004975207  .   .   ID=Cufflinks_transcript_abundance
          124 Cufflinks   abundance   9825    12624   1082.6940025248 .   .   ID=Cufflinks_transcript_abundance
          124 Cufflinks   abundance   12678   13027   1775.6181641407 .   .   ID=Cufflinks_transcript_abundance
          124 Cufflinks   abundance   14562   14603   721.7960016832  .   .   ID=Cufflinks_transcript_abundance
          124 Cufflinks   abundance   17793   17835   705.0100481557  .   .   ID=Cufflinks_transcript_abundance
          124 Cufflinks   abundance   27148   27726   2853.5251258254 .   .   ID=Cufflinks_transcript_abundance
          124 Cufflinks   abundance   27829   29109   3585.3145657380 .   .   ID=Cufflinks_transcript_abundance
          124 Cufflinks   abundance   29171   34842   2442.5515614083 .   .   ID=Cufflinks_transcript_abundance
          124 Cufflinks   abundance   34945   36257   3751.9099097396 .   .   ID=Cufflinks_transcript_abundance
          124 Cufflinks   abundance   37803   38013   1867.7754356353 .   .   ID=Cufflinks_transcript_abundance
          124 Cufflinks   abundance   38163   38302   6604.4334154015 .   .   ID=Cufflinks_transcript_abundance
          And use the following GBrowse configuration for the abundance track:
          Code:
          [abundance]
          feature = abundance
          category = Cufflinks_abundance
          glyph = xyplot
          graph_type = boxes
          fgcolor = brown
          bgcolor = orange
          scale = left
          min_score = 0
          height = 50
          group_on = display_name
          and this is what you see!



          Hope this helps!

          Best,
          ~Vivek
          Last edited by vkrishna; 10-25-2009, 04:10 PM.

          Comment

          • apratap
            Member
            • Jan 2009
            • 58

            #6
            Hey vkrishna

            Thanks a lot for sharing your method. I will try to work on it sometime later this week.

            Best,
            -Abhi

            Comment

            • marvin.j
              Junior Member
              • Oct 2009
              • 5

              #7
              Bug: Cufflinks merges proximal genes, although there is proper annotation

              In our group we have just noticed that proximal genes get merged by cufflinks.
              While this is probably unavoidable for a run without any annotation it is clearly wrong in case the correct annotation is provided and ignored/overridden.

              Here's one example from the mitochondrial chromosome (kindly provided by arthur.yxt):

              cufflinks genes.expr file says:

              Code:
              ENSG00000198899      713201 chrM   8527   9988   5618.54
              But a look at the GTF annotation file reveals there are actually two genes being merged:

              Code:
              chrM    .       exon    8528    9208    .       +       .       gene_id "ENSG00000198899";transcript_id "ENST00000361899"
              chrM    .       exon    9208    9988    .       +       .       gene_id "ENSG00000198938";transcript_id "ENST00000362079"
              While extending an annotated gene by proximal stuff that is otherwise "intergenic" is probably a good idea (TSS or PolyA-site variation, cryptic exons, etc.) in this case it is clearly wrong as the merged part is not intergenic but a complete neighboring gene.

              If needed, a small dataset to reproduce this bug can be provided.

              Best!
              -Marvin

              Comment

              • Cole Trapnell
                Senior Member
                • Nov 2008
                • 213

                #8
                Originally posted by marvin.j View Post
                In our group we have just noticed that proximal genes get merged by cufflinks.
                While this is probably unavoidable for a run without any annotation it is clearly wrong in case the correct annotation is provided and ignored/overridden.

                Here's one example from the mitochondrial chromosome (kindly provided by arthur.yxt):

                cufflinks genes.expr file says:

                Code:
                ENSG00000198899      713201 chrM   8527   9988   5618.54
                But a look at the GTF annotation file reveals there are actually two genes being merged:

                Code:
                chrM    .       exon    8528    9208    .       +       .       gene_id "ENSG00000198899";transcript_id "ENST00000361899"
                chrM    .       exon    9208    9988    .       +       .       gene_id "ENSG00000198938";transcript_id "ENST00000362079"
                While extending an annotated gene by proximal stuff that is otherwise "intergenic" is probably a good idea (TSS or PolyA-site variation, cryptic exons, etc.) in this case it is clearly wrong as the merged part is not intergenic but a complete neighboring gene.

                If needed, a small dataset to reproduce this bug can be provided.

                Best!
                -Marvin
                There are really two issues here:

                1) The reference-based quantitation is only reporting one gene in this example. This is probably a bug, and I would greatly appreciate a small test set via email

                2) The assembly algorithm merges overlapping genes. In strandless prototocols, merging overlapping, unspliced genes or overlapping genes on the same strand is tough to avoid as you suggest (although tracking polyA reads may help here). However, in my experience these are extremely rare in most organisms that have splicing. Yeast is a notable exception. The mitochondrial genome is one place I *would* expect it, given the gene density, lack of splicing, and generally very high read count (with associated polymerase run-on, etc.).

                The example you gave, if I understand correctly, demonstrates 1). Are you also seeing 2)? And If so, can you give us a sense of how often? Fighting this problem via polyA tracking and local assembly is high on the list of planned features, but if users are running into this a lot, I could try to get to it sooner rather than later.

                Comment

                • gard
                  Junior Member
                  • Oct 2009
                  • 1

                  #9
                  Removed, found answer somewhere else in the forum.
                  Last edited by gard; 10-29-2009, 07:31 AM.

                  Comment

                  • sdriscoll
                    I like code
                    • Sep 2009
                    • 436

                    #10
                    This is probably a simple question but I just don't understand the computation of RPKM well enough to answer it for myself.

                    I've run cufflinks on tophat output (both run without a reference annotation) and then finally cuffcompare with a reference annotation to map the cufflinks transcripts to the UCSC transcripts. My question is this: in the stdout.tracking file that's produced from cuffcompare I'll find several cufflinks determined transcripts that have been mapped to a single UCSC transcript. Each of those cufflinks transcripts has an RPKM - how do I translate those RPKMs to the single UCSC transcript that cuffcompare has mapped them to?

                    example lines from my tracking file:

                    Code:
                    uc008bzg.1|uc008bzg.1	e	q1:CUFF.209682|CUFF.209682.0|100|10.337529|0.000000|0.000000|uniq
                    uc008bzg.1|uc008bzg.1	j	q1:CUFF.209785|CUFF.209785.0|100|259.417274|0.000000|0.000000|uniq
                    uc008bzg.1|uc008bzg.1	c	q1:CUFF.209691|CUFF.209691.0|100|34.891979|0.000000|0.000000|uniq
                    uc008bzg.1|uc008bzg.1	c	q1:CUFF.209695|CUFF.209695.0|100|135.598390|0.000000|0.000000|uniq
                    uc008bzg.1|uc008bzg.1	c	q1:CUFF.209701|CUFF.209701.0|100|133.368072|0.000000|0.000000|uniq
                    I'd like to get a single RPKM for uc008bzg.1 from those five lines. Thanks in advance!
                    /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                    Salk Institute for Biological Studies, La Jolla, CA, USA */

                    Comment

                    • sdriscoll
                      I like code
                      • Sep 2009
                      • 436

                      #11
                      OK, I've thought about this and it's likely that the way to pull off what I want is to consider it a problem of merging multiple things made up of the same materials (in this case bp of exon) all with different densities (RPKMs). That breaks it down to calculating a weighted mean. Of course I'll need total bp worth of exon from each of those transcripts but with that info it should be an easy calculation.
                      /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                      Salk Institute for Biological Studies, La Jolla, CA, USA */

                      Comment

                      • Cole Trapnell
                        Senior Member
                        • Nov 2008
                        • 213

                        #12
                        I've had decent success doing that, by performing a transfrag length-weighted average of the RPKM.

                        Comment

                        • apratap
                          Member
                          • Jan 2009
                          • 58

                          #13
                          Hi Cole and Driscoll

                          Good that you have asked this before I actually realized. So I am a bit confused now. Why does cufflinks for the same sample, maps diff cufflinks predicted transcripts to same UCSC transcript. I think I am missing some biology here but may be this question is not trivial.

                          It will me understand better if you guys could explain a bit more.

                          -Abhi

                          Comment

                          • Cole Trapnell
                            Senior Member
                            • Nov 2008
                            • 213

                            #14
                            After normalizing for transcript length, the fraction of the total reads in your experiment that come from any one transcript is roughly proportional to the abundance of that transcript.

                            What we're seeing here is a transcript that is rare enough in the sample that it doesn't get many reads, and thus has gaps in sequencing coverage. When this happens, Cufflinks can't assemble the full thing. This is one of the central differences between Cufflinks and say, an ab initio gene finder like Augustus. So what happens is that Cufflinks assembles the reads that came from this rare transcript as best it can, and the result is a handful of "transfrags", each of which get their own RPKM estimate. What sdriscoll is trying to do here is estimate the RPKM of the whole transcript from the individual fragment RPKMs.

                            Comment

                            • ndaniel
                              Member
                              • Feb 2009
                              • 33

                              #15
                              Hi Cole,

                              does Cufflinks take as input the 'raw' SAM file produced by Bowtie (>version 0.11.1)?

                              Daniel

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                06-02-2026, 10:05 AM
                              • SEQadmin2
                                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                by SEQadmin2


                                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                Introduction

                                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                05-22-2026, 06:42 AM
                              • SEQadmin2
                                Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                by SEQadmin2

                                Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                05-06-2026, 09:04 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Today, 08:59 AM
                              0 responses
                              8 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 12:03 PM
                              0 responses
                              21 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 11:40 AM
                              0 responses
                              15 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-28-2026, 11:40 AM
                              0 responses
                              29 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...