Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • IrisZhu
    Member
    • Jul 2010
    • 25

    cufflinks generated >400,000 transcripts??

    I've tried using tophat and cufflinks on 3 different sets of data, at least two of which I know are of very high quality. I was surprised by the output of cufflinks --- it generated > 400,000 transcripts!!

    Could anybody tell me how many transcripts you got from tophat - cufflinks pipeline?

    Thank you in advance!

    Iris
  • GKM
    Member
    • May 2009
    • 45

    #2
    Cufflinks will report anything it sees, it is your job to filter the garbage out after that. Use cuffcompare to compare against the annotation, this will tell you which ones are known transcripts, novel isoforms of known genes and novel intergenic transcripts, and also which are things that you probably don't want to deal with like intronic leftovers, polymerase post-3' run on fragments, etc.

    Also, note that Cufflinks will produce three values for each transcript - it's best guess FPKM estimate, and 95% confidence values on each side. For a lot of transcripts those three values will be 0, a very small number, and another very small, although somewhat bigger number. You probably want those out too if you want to be stringent.

    The above should bring down the number of transcripts significantly.

    Comment

    • IrisZhu
      Member
      • Jul 2010
      • 25

      #3
      Originally posted by GKM View Post
      Cufflinks will report anything it sees, it is your job to filter the garbage out after that. Use cuffcompare to compare against the annotation, this will tell you which ones are known transcripts, novel isoforms of known genes and novel intergenic transcripts, and also which are things that you probably don't want to deal with like intronic leftovers, polymerase post-3' run on fragments, etc.

      Also, note that Cufflinks will produce three values for each transcript - it's best guess FPKM estimate, and 95% confidence values on each side. For a lot of transcripts those three values will be 0, a very small number, and another very small, although somewhat bigger number. You probably want those out too if you want to be stringent.

      The above should bring down the number of transcripts significantly.
      Thanks a lot for your reply . I did use cuffcompare to check the output against a reference genome: Homo_sapiens.GRCh37.59.gtf downloaded from Ensembl and only got ~7000-9000 (2% of the total output and <20% of the reference transcripts) transcripts matching with the reference:
      this is from one dataset:
      >>cut -f3 transcripts.tmap |sort|uniq -c
      8646 =
      56845 c
      1 class_code
      26512 e
      230788 i
      10697 j
      8367 o
      20707 p
      212179 u
      which makes me (actually not me, my boss) doubt if I use tophat/cufflinks properly or not. I would expect a much better recovery of the annotated transcripts since I got a very good coverage after mapping the same set of reads directly to the transcriptome with bowtie.
      Last edited by IrisZhu; 09-05-2010, 05:47 AM.

      Comment

      • chrisbala
        Member
        • Jan 2010
        • 82

        #4
        filtering cufflinks transcripts

        I have a similar situation (and I guess this is common) to have a large number of predicted transcripts from cufflinks.

        I'm trying to adjust the pre-mrna fraction (-j) to see if this helps at all (as it seems that some of my transcripts are might be from premrna. )

        does anyone else have any suggestions about how to filter out the junk? Afterall its sort of hard to know what junk is?

        Low coverage stuff makes some sense. But is there a way in cufflinks to require that a transcript be represented by some # of reads in advance?

        Are there any other filters that come to mind? Many of my dubious transcripts are very short, and unspliced, would it be too risky to filter on such features? hmmm probably yes.

        Comment

        • aulyanov
          Junior Member
          • Feb 2010
          • 1

          #5
          In my case it is even worse. I have a goal to discover alternative spliced genes using CuffLink. I took gene-by-gene approach and submitted only a fraction of the sam file that cover a region of the interest. So far I have got only 50% recovery of main RefSeq transcripts and lot of false-positive two-exon transcripts with score 1000. The only explanation I have is that I use bwa but not TopHat to align the reads.

          Comment

          • frankyue50
            Member
            • Nov 2008
            • 34

            #6
            In some cases, I have seen cufflinks give more 1 million transcript. But I checked the authors original paper, they only predicted less 30000 transcript ...

            Comment

            • dnusol
              Senior Member
              • Jul 2009
              • 136

              #7
              Hi,

              I used cufflinks for my Arabidopsis data using the TAIR .gtf file and I only got

              41590 =

              in my transcripts.tmap file so I don´t know if there is no new isoform that can be found

              Comment

              • frankyue50
                Member
                • Nov 2008
                • 34

                #8
                But you provided a gtf file ... We were talking about the transcriptome assembly.

                Originally posted by dnusol View Post
                Hi,

                I used cufflinks for my Arabidopsis data using the TAIR .gtf file and I only got

                41590 =

                in my transcripts.tmap file so I don´t know if there is no new isoform that can be found

                Comment

                • plabaj
                  Member
                  • Oct 2010
                  • 95

                  #9
                  Hi,

                  If providing a reference GTF file (-G option) is not "permitted/welcome" in your study, you should think about playing with following parameters:
                  --min-frags-per-transfrag <int> - by default is 10, increasing should produce less transcripts
                  -A/--small-anchor-fraction <0.0-1.0> - by default is 0.12, decreasing will take into consideration more reads falling on splice junctions -> less one exon transcripts; should help without producing FP transcripts especially for longer reads (>75bp) and paired-end
                  Pawel Labaj

                  Comment

                  • dagarfield
                    Member
                    • Aug 2010
                    • 39

                    #10
                    How much coverage do you have?
                    If you have low coverage, and you're predicting transcripts de novo, cufflinks is going to give you a lot of transcripts (because many of your reads don't overlap another one).

                    -DG

                    Comment

                    • oliviera
                      Member
                      • Apr 2010
                      • 31

                      #11
                      Dear all,
                      I have a similar concern. I get 119021 transcripts model
                      In my case we have generate > 200 Million paired end reads so coverage should be high.
                      Have you managed to tune your tophat/cuffklinks pipeline to decrease the transcripts model?? And if yes, how?

                      In addition how do you evaluate the stats from cuffcompare? Here is an example of what I get.
                      # Query mRNAs : 119021 in 114586 loci (22194 multi-exon transcripts)
                      # (3456 multi-transcript loci, ~1.0 transcripts per locus)
                      # Reference mRNAs : 50278 in 31712 loci (44173 multi-exon)
                      # Corresponding super-loci: 14397
                      #--------------------| Sn | Sp | fSn | fSp
                      Base level: 47.9 30.7 - -
                      Exon level: 18.4 27.7 25.7 38.5
                      Intron level: 28.7 83.0 29.9 86.5
                      Intron chain level: 7.1 14.2 15.7 31.3
                      Transcript level: 0.0 0.0 0.1 0.0
                      Locus level: 9.7 2.7 14.2 3.9
                      Missed exons: 155159/300668 ( 51.6%)
                      Wrong exons: 77024/200099 ( 38.5%)
                      Missed introns: 162909/243880 ( 66.8%)
                      Wrong introns: 6774/84272 ( 8.0%)
                      Missed loci: 15952/31712 ( 50.3%)
                      Wrong loci: 67846/114586 ( 59.2%)

                      Total union super-loci across all input datasets: 82472

                      Olivier

                      Comment

                      • polyatail
                        Member
                        • Dec 2010
                        • 25

                        #12
                        * If you're using paired-end reads, check the cufflinks log to be sure they're aligned and recognized as PE. In other threads, that wasn't the case.

                        * Tune the anchor length and multiplicity (-a and -g) at the TopHat step. The defaults assume 36 bp single-end reads. For 72 bp PE, we found -a 16 -g 5 produced the best results.

                        * Provide a tRNA/rRNA mask file to Cufflinks. This will remove some high-coverage single exon transcripts from contamination.

                        * I, personally, have had success tweaking -A and -j, but not -F in Cufflinks

                        119k transcripts is not an unreasonable amount for single-end reads. Are you running with a reference annotation (-g in Cufflinks v1) to guide assembly?

                        Comment

                        • 11xinqi
                          Member
                          • Mar 2011
                          • 31

                          #13
                          Originally posted by chrisbala View Post
                          I have a similar situation (and I guess this is common) to have a large number of predicted transcripts from cufflinks.

                          I'm trying to adjust the pre-mrna fraction (-j) to see if this helps at all (as it seems that some of my transcripts are might be from premrna. )

                          does anyone else have any suggestions about how to filter out the junk? Afterall its sort of hard to know what junk is?

                          Low coverage stuff makes some sense. But is there a way in cufflinks to require that a transcript be represented by some # of reads in advance?

                          Are there any other filters that come to mind? Many of my dubious transcripts are very short, and unspliced, would it be too risky to filter on such features? hmmm probably yes.
                          Hi, do you have any idea about why cufflinks gives a large number of predicted transcripts and how to filter the result now? Thank you.

                          Comment

                          • 11xinqi
                            Member
                            • Mar 2011
                            • 31

                            #14
                            Originally posted by IrisZhu View Post
                            I've tried using tophat and cufflinks on 3 different sets of data, at least two of which I know are of very high quality. I was surprised by the output of cufflinks --- it generated > 400,000 transcripts!!

                            Could anybody tell me how many transcripts you got from tophat - cufflinks pipeline?

                            Thank you in advance!

                            Iris
                            Hi, do you have any idea about why cufflinks gives a large number of predicted transcripts and how to filter the result now? Thank you.

                            Comment

                            Latest Articles

                            Collapse

                            • SEQadmin2
                              From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                              by SEQadmin2


                              Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                              The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                              ...
                              06-02-2026, 10:05 AM
                            • SEQadmin2
                              Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                              by SEQadmin2


                              With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                              Introduction

                              Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                              05-22-2026, 06:42 AM
                            • SEQadmin2
                              Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                              by SEQadmin2

                              Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                              Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                              05-06-2026, 09:04 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by SEQadmin2, 06-02-2026, 12:03 PM
                            0 responses
                            19 views
                            0 reactions
                            Last Post SEQadmin2  
                            Started by SEQadmin2, 06-02-2026, 11:40 AM
                            0 responses
                            14 views
                            0 reactions
                            Last Post SEQadmin2  
                            Started by SEQadmin2, 05-28-2026, 11:40 AM
                            0 responses
                            29 views
                            0 reactions
                            Last Post SEQadmin2  
                            Started by SEQadmin2, 05-26-2026, 10:12 AM
                            0 responses
                            31 views
                            0 reactions
                            Last Post SEQadmin2  
                            Working...