Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Cufflinks: transcript assembly and abundance estimation for RNA-seq

    Hi all,

    I'm pleased to let you know that I've recently released a new tool for analyzing RNA-Seq data that you may find useful. The program, Cufflinks, is designed to assemble the transcripts in an RNA-Seq sample and calculate their abundances (in transcript-level RPKM). Cufflinks is built to work without a reference annotation. You need only a reference genome and an input file of RNA-Seq read alignments (in SAM format), such as the one produced by TopHat.

    Cufflinks currently supports only Mac OS X and Linux, and you'll need a medium-sized machine to run it. The program is fast and fully threaded, but for large (>100 million aligned reads) runs, the memory footprint can be 10GB or more.

    This is only the first release, and there are numerous features planned for the next weeks and months. The project is a collaboration between scientists at the University of Maryland, UC Berkeley, Caltech, and WashU. We would be grateful for any feedback, suggestions, bug reports, or criticism you have. We hope you find Cufflinks useful.

  • #2
    Hi Cole

    Thanks for the update. I took a quick glance and just wondering if there are some methods available to visualize the transcript abundances reported in the GTF format. Also any way to simultaneously compare control and sample.

    -Abhi

    Comment


    • #3
      Hi Abhi,

      We recommend using cuffcompare to compare control and sample, though you may need to tailor significance tests, etc. to your analysis. You might want to check out this paper for more details on control and sample testing with RPKM.

      As for visualization, we don't have anything out for that yet, as supporting a UI on various platforms, etc. is more engineering resources than we can really afford right now. However, we've tried to make the various output files as friendly as possible to environments like R, so that you can use your plotting environment of choice.

      If anyone were to contribute scripts or packages that depended on standard plotting tools, we would happily include them in the Cufflinks distribution (properly credited, of course).

      Comment


      • #4
        A tip of the cap to continuing the classy nomenclature theme!

        Comment


        • #5
          Setting up GBrowse to view transcript abundance

          Hi apratap,
          I've been working on visualizing the transcript abundance data in GBrowse and have had some luck with it. Here is how I went about it,
          1. Loading the Cufflinks identified pseudo-transcripts is very easy (GBrowse seems to accept the GTF as input and is able to display the transcripts)
          2. To display the abundance data, you need to perform the following transformation on the GTF file and create a GFF file for use with GBrowse. As an example, here are a few lines from the GTF file
          Code:
          124 Cufflinks   transcript  1551    1663    1000    .   .   gene_id "CUFF.1"; transcript_id "CUFF.1.0"; RPKM "268.2781599177"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0.    000000"; cov "0.637168";
          124 Cufflinks   exon    1551    1663    1000    .   .   gene_id "CUFF.1"; transcript_id "CUFF.1.0"; exon_number "1"; RPKM "268.2781599177"; frac "1.000000"; conf_lo "0.000000";   conf_hi "0.000000"; cov "0.637168";
          124 Cufflinks   transcript  8604    9455    1000    .   .   gene_id "CUFF.4"; transcript_id "CUFF.4.0"; RPKM "1405.4689751085"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0.   000000"; cov "3.338028";
          124 Cufflinks   exon    8604    9455    1000    .   .   gene_id "CUFF.4"; transcript_id "CUFF.4.0"; exon_number "1"; RPKM "1405.4689751085"; frac "1.000000"; conf_lo "0.000000";  conf_hi "0.000000"; cov "3.338028";
          124 Cufflinks   transcript  9540    9757    1000    .   .   gene_id "CUFF.7"; transcript_id "CUFF.7.0"; RPKM "903.9004975207"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0.    000000"; cov "2.146789";
          124 Cufflinks   exon    9540    9757    1000    .   .   gene_id "CUFF.7"; transcript_id "CUFF.7.0"; exon_number "1"; RPKM "903.9004975207"; frac "1.000000"; conf_lo "0.000000";   conf_hi "0.000000"; cov "2.146789";
          124 Cufflinks   transcript  9825    12624   1000    .   .   gene_id "CUFF.10"; transcript_id "CUFF.10.0"; RPKM "1082.6940025248"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0. 000000"; cov "2.571429";
          124 Cufflinks   exon    9825    12624   1000    .   .   gene_id "CUFF.10"; transcript_id "CUFF.10.0"; exon_number "1"; RPKM "1082.6940025248"; frac "1.000000"; conf_lo "0.        000000"; conf_hi "0.000000"; cov "2.571429";
          124 Cufflinks   transcript  12678   13027   1000    .   .   gene_id "CUFF.13"; transcript_id "CUFF.13.0"; RPKM "1775.6181641407"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0. 000000"; cov "4.217143";
          124 Cufflinks   exon    12678   13027   1000    .   .   gene_id "CUFF.13"; transcript_id "CUFF.13.0"; exon_number "1"; RPKM "1775.6181641407"; frac "1.000000"; conf_lo "0.        000000"; conf_hi "0.000000"; cov "4.217143";
          124 Cufflinks   transcript  14562   14603   1000    .   .   gene_id "CUFF.16"; transcript_id "CUFF.16.0"; RPKM "721.7960016832"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0.  000000"; cov "1.714286";
          124 Cufflinks   exon    14562   14603   1000    .   .   gene_id "CUFF.16"; transcript_id "CUFF.16.0"; exon_number "1"; RPKM "721.7960016832"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "1.714286";
          124 Cufflinks   transcript  17793   17835   1000    .   .   gene_id "CUFF.19"; transcript_id "CUFF.19.0"; RPKM "705.0100481557"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0.  000000"; cov "1.674419";
          124 Cufflinks   exon    17793   17835   1000    .   .   gene_id "CUFF.19"; transcript_id "CUFF.19.0"; exon_number "1"; RPKM "705.0100481557"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "1.674419";
          124 Cufflinks   transcript  27148   27726   1000    .   .   gene_id "CUFF.22"; transcript_id "CUFF.22.0"; RPKM "2853.5251258254"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0. 000000"; cov "6.777202";
          124 Cufflinks   exon    27148   27726   1000    .   .   gene_id "CUFF.22"; transcript_id "CUFF.22.0"; exon_number "1"; RPKM "2853.5251258254"; frac "1.000000"; conf_lo "0.        000000"; conf_hi "0.000000"; cov "6.777202";
          124 Cufflinks   transcript  27829   29109   1000    .   .   gene_id "CUFF.25"; transcript_id "CUFF.25.0"; RPKM "3585.3145657380"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0. 000000"; cov "8.515222";
          124 Cufflinks   exon    27829   29109   1000    .   .   gene_id "CUFF.25"; transcript_id "CUFF.25.0"; exon_number "1"; RPKM "3585.3145657380"; frac "1.000000"; conf_lo "0.        000000"; conf_hi "0.000000"; cov "8.515222";
          124 Cufflinks   transcript  29171   34842   1000    .   .   gene_id "CUFF.28"; transcript_id "CUFF.28.0"; RPKM "2442.5515614083"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0. 000000"; cov "5.801128";
          124 Cufflinks   exon    29171   34842   1000    .   .   gene_id "CUFF.28"; transcript_id "CUFF.28.0"; exon_number "1"; RPKM "2442.5515614083"; frac "1.000000"; conf_lo "0.        000000"; conf_hi "0.000000"; cov "5.801128";
          124 Cufflinks   transcript  34945   36257   1000    .   .   gene_id "CUFF.31"; transcript_id "CUFF.31.0"; RPKM "3751.9099097396"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0. 000000"; cov "8.910891";
          124 Cufflinks   exon    34945   36257   1000    .   .   gene_id "CUFF.31"; transcript_id "CUFF.31.0"; exon_number "1"; RPKM "3751.9099097396"; frac "1.000000"; conf_lo "0.        000000"; conf_hi "0.000000"; cov "8.910891";
          124 Cufflinks   transcript  37803   38013   1000    .   .   gene_id "CUFF.34"; transcript_id "CUFF.34.0"; RPKM "1867.7754356353"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0. 000000"; cov "4.436019";
          124 Cufflinks   exon    37803   38013   1000    .   .   gene_id "CUFF.34"; transcript_id "CUFF.34.0"; exon_number "1"; RPKM "1867.7754356353"; frac "1.000000"; conf_lo "0.        000000"; conf_hi "0.000000"; cov "4.436019";
          The above segment is converted into the following
          Code:
          124 Cufflinks   abundance   1551    1663    268.2781599177  .   .   ID=Cufflinks_transcript_abundance
          124 Cufflinks   abundance   8604    9455    1405.4689751085 .   .   ID=Cufflinks_transcript_abundance
          124 Cufflinks   abundance   9540    9757    903.9004975207  .   .   ID=Cufflinks_transcript_abundance
          124 Cufflinks   abundance   9825    12624   1082.6940025248 .   .   ID=Cufflinks_transcript_abundance
          124 Cufflinks   abundance   12678   13027   1775.6181641407 .   .   ID=Cufflinks_transcript_abundance
          124 Cufflinks   abundance   14562   14603   721.7960016832  .   .   ID=Cufflinks_transcript_abundance
          124 Cufflinks   abundance   17793   17835   705.0100481557  .   .   ID=Cufflinks_transcript_abundance
          124 Cufflinks   abundance   27148   27726   2853.5251258254 .   .   ID=Cufflinks_transcript_abundance
          124 Cufflinks   abundance   27829   29109   3585.3145657380 .   .   ID=Cufflinks_transcript_abundance
          124 Cufflinks   abundance   29171   34842   2442.5515614083 .   .   ID=Cufflinks_transcript_abundance
          124 Cufflinks   abundance   34945   36257   3751.9099097396 .   .   ID=Cufflinks_transcript_abundance
          124 Cufflinks   abundance   37803   38013   1867.7754356353 .   .   ID=Cufflinks_transcript_abundance
          124 Cufflinks   abundance   38163   38302   6604.4334154015 .   .   ID=Cufflinks_transcript_abundance
          And use the following GBrowse configuration for the abundance track:
          Code:
          [abundance]
          feature = abundance
          category = Cufflinks_abundance
          glyph = xyplot
          graph_type = boxes
          fgcolor = brown
          bgcolor = orange
          scale = left
          min_score = 0
          height = 50
          group_on = display_name
          and this is what you see!



          Hope this helps!

          Best,
          ~Vivek
          Last edited by vkrishna; 10-25-2009, 04:10 PM.

          Comment


          • #6
            Hey vkrishna

            Thanks a lot for sharing your method. I will try to work on it sometime later this week.

            Best,
            -Abhi

            Comment


            • #7
              Bug: Cufflinks merges proximal genes, although there is proper annotation

              In our group we have just noticed that proximal genes get merged by cufflinks.
              While this is probably unavoidable for a run without any annotation it is clearly wrong in case the correct annotation is provided and ignored/overridden.

              Here's one example from the mitochondrial chromosome (kindly provided by arthur.yxt):

              cufflinks genes.expr file says:

              Code:
              ENSG00000198899      713201 chrM   8527   9988   5618.54
              But a look at the GTF annotation file reveals there are actually two genes being merged:

              Code:
              chrM    .       exon    8528    9208    .       +       .       gene_id "ENSG00000198899";transcript_id "ENST00000361899"
              chrM    .       exon    9208    9988    .       +       .       gene_id "ENSG00000198938";transcript_id "ENST00000362079"
              While extending an annotated gene by proximal stuff that is otherwise "intergenic" is probably a good idea (TSS or PolyA-site variation, cryptic exons, etc.) in this case it is clearly wrong as the merged part is not intergenic but a complete neighboring gene.

              If needed, a small dataset to reproduce this bug can be provided.

              Best!
              -Marvin

              Comment


              • #8
                Originally posted by marvin.j View Post
                In our group we have just noticed that proximal genes get merged by cufflinks.
                While this is probably unavoidable for a run without any annotation it is clearly wrong in case the correct annotation is provided and ignored/overridden.

                Here's one example from the mitochondrial chromosome (kindly provided by arthur.yxt):

                cufflinks genes.expr file says:

                Code:
                ENSG00000198899      713201 chrM   8527   9988   5618.54
                But a look at the GTF annotation file reveals there are actually two genes being merged:

                Code:
                chrM    .       exon    8528    9208    .       +       .       gene_id "ENSG00000198899";transcript_id "ENST00000361899"
                chrM    .       exon    9208    9988    .       +       .       gene_id "ENSG00000198938";transcript_id "ENST00000362079"
                While extending an annotated gene by proximal stuff that is otherwise "intergenic" is probably a good idea (TSS or PolyA-site variation, cryptic exons, etc.) in this case it is clearly wrong as the merged part is not intergenic but a complete neighboring gene.

                If needed, a small dataset to reproduce this bug can be provided.

                Best!
                -Marvin
                There are really two issues here:

                1) The reference-based quantitation is only reporting one gene in this example. This is probably a bug, and I would greatly appreciate a small test set via email

                2) The assembly algorithm merges overlapping genes. In strandless prototocols, merging overlapping, unspliced genes or overlapping genes on the same strand is tough to avoid as you suggest (although tracking polyA reads may help here). However, in my experience these are extremely rare in most organisms that have splicing. Yeast is a notable exception. The mitochondrial genome is one place I *would* expect it, given the gene density, lack of splicing, and generally very high read count (with associated polymerase run-on, etc.).

                The example you gave, if I understand correctly, demonstrates 1). Are you also seeing 2)? And If so, can you give us a sense of how often? Fighting this problem via polyA tracking and local assembly is high on the list of planned features, but if users are running into this a lot, I could try to get to it sooner rather than later.

                Comment


                • #9
                  Removed, found answer somewhere else in the forum.
                  Last edited by gard; 10-29-2009, 07:31 AM.

                  Comment


                  • #10
                    This is probably a simple question but I just don't understand the computation of RPKM well enough to answer it for myself.

                    I've run cufflinks on tophat output (both run without a reference annotation) and then finally cuffcompare with a reference annotation to map the cufflinks transcripts to the UCSC transcripts. My question is this: in the stdout.tracking file that's produced from cuffcompare I'll find several cufflinks determined transcripts that have been mapped to a single UCSC transcript. Each of those cufflinks transcripts has an RPKM - how do I translate those RPKMs to the single UCSC transcript that cuffcompare has mapped them to?

                    example lines from my tracking file:

                    Code:
                    uc008bzg.1|uc008bzg.1	e	q1:CUFF.209682|CUFF.209682.0|100|10.337529|0.000000|0.000000|uniq
                    uc008bzg.1|uc008bzg.1	j	q1:CUFF.209785|CUFF.209785.0|100|259.417274|0.000000|0.000000|uniq
                    uc008bzg.1|uc008bzg.1	c	q1:CUFF.209691|CUFF.209691.0|100|34.891979|0.000000|0.000000|uniq
                    uc008bzg.1|uc008bzg.1	c	q1:CUFF.209695|CUFF.209695.0|100|135.598390|0.000000|0.000000|uniq
                    uc008bzg.1|uc008bzg.1	c	q1:CUFF.209701|CUFF.209701.0|100|133.368072|0.000000|0.000000|uniq
                    I'd like to get a single RPKM for uc008bzg.1 from those five lines. Thanks in advance!
                    /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                    Salk Institute for Biological Studies, La Jolla, CA, USA */

                    Comment


                    • #11
                      OK, I've thought about this and it's likely that the way to pull off what I want is to consider it a problem of merging multiple things made up of the same materials (in this case bp of exon) all with different densities (RPKMs). That breaks it down to calculating a weighted mean. Of course I'll need total bp worth of exon from each of those transcripts but with that info it should be an easy calculation.
                      /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                      Salk Institute for Biological Studies, La Jolla, CA, USA */

                      Comment


                      • #12
                        I've had decent success doing that, by performing a transfrag length-weighted average of the RPKM.

                        Comment


                        • #13
                          Hi Cole and Driscoll

                          Good that you have asked this before I actually realized. So I am a bit confused now. Why does cufflinks for the same sample, maps diff cufflinks predicted transcripts to same UCSC transcript. I think I am missing some biology here but may be this question is not trivial.

                          It will me understand better if you guys could explain a bit more.

                          -Abhi

                          Comment


                          • #14
                            After normalizing for transcript length, the fraction of the total reads in your experiment that come from any one transcript is roughly proportional to the abundance of that transcript.

                            What we're seeing here is a transcript that is rare enough in the sample that it doesn't get many reads, and thus has gaps in sequencing coverage. When this happens, Cufflinks can't assemble the full thing. This is one of the central differences between Cufflinks and say, an ab initio gene finder like Augustus. So what happens is that Cufflinks assembles the reads that came from this rare transcript as best it can, and the result is a handful of "transfrags", each of which get their own RPKM estimate. What sdriscoll is trying to do here is estimate the RPKM of the whole transcript from the individual fragment RPKMs.

                            Comment


                            • #15
                              Hi Cole,

                              does Cufflinks take as input the 'raw' SAM file produced by Bowtie (>version 0.11.1)?

                              Daniel

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                The Impact of AI in Genomic Medicine
                                by seqadmin



                                Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                                02-26-2024, 02:07 PM
                              • seqadmin
                                Multiomics Techniques Advancing Disease Research
                                by seqadmin


                                New and advanced multiomics tools and technologies have opened new avenues of research and markedly enhanced various disciplines such as disease research and precision medicine1. The practice of merging diverse data from various ‘omes increasingly provides a more holistic understanding of biological systems. As Maddison Masaeli, Co-Founder and CEO at Deepcell, aptly noted, “You can't explain biology in its complex form with one modality.”

                                A major leap in the field has
                                ...
                                02-08-2024, 06:33 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:12 AM
                              0 responses
                              17 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 02-23-2024, 04:11 PM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 02-21-2024, 08:52 AM
                              0 responses
                              73 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 02-20-2024, 08:57 AM
                              0 responses
                              62 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X