Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Cufflinks runtime

    Cufflinks 2.2.1 is taking a really long time. I start with 45 million 100bp paired-end, rRNA depleted, stranded reads aligned with STAR, 24 million uniquely align, 15 million are multimappers, using a sorted .bam as the input for cufflinks. Cufflinks command is:

    Code:
    cufflinks -o outputFolder -p 32 -g gencode.v2.annotation.gtf -M maskFile.gtf -b mm10.fa -u --library-type fr-secondstrand inputSorted.bam
    Things to note:

    -p 32, all 32 CPU's are in use for pretty much the entire time, here's the usage for the past 24 hours from the 32-CPU node I've been using, you can see it going down as threads are completing at the end of the cufflinks run.

    The mask file is masking out a few very highly expressed genes which make up almost 20% of all reads. When I didn't mask these out it got hung up at these loci.

    The library type is reversed because I'm following these instructions.

    After 3.5 days I think it's just about done (it's at "waiting for 18 threads to complete"). Given that the number of input reads isn't huge (especially once all the masked reads are accounted for) and I'm using 32 CPU's, I'm surprised it's taking so long. It doesn't seem like it's getting hung up at any specific spots, but it does seem to slow down as it goes, until it's taking many hours for each of the last few percent.

    Is this runtime normal? Anything I can do to speed it up?

  • #2
    Originally posted by biocomputer View Post
    Cufflinks 2.2.1 is taking a really long time. I start with 45 million 100bp paired-end, rRNA depleted, stranded reads aligned with STAR, 24 million uniquely align, 15 million are multimappers, using a sorted .bam as the input for cufflinks. Cufflinks command is:

    Code:
    cufflinks -o outputFolder -p 32 -g gencode.v2.annotation.gtf -M maskFile.gtf -b mm10.fa -u --library-type fr-secondstrand inputSorted.bam
    Things to note:

    -p 32, all 32 CPU's are in use for pretty much the entire time, here's the usage for the past 24 hours from the 32-CPU node I've been using, you can see it going down as threads are completing at the end of the cufflinks run.

    The mask file is masking out a few very highly expressed genes which make up almost 20% of all reads. When I didn't mask these out it got hung up at these loci.

    The library type is reversed because I'm following these instructions.

    After 3.5 days I think it's just about done (it's at "waiting for 18 threads to complete"). Given that the number of input reads isn't huge (especially once all the masked reads are accounted for) and I'm using 32 CPU's, I'm surprised it's taking so long. It doesn't seem like it's getting hung up at any specific spots, but it does seem to slow down as it goes, until it's taking many hours for each of the last few percent.

    Is this runtime normal? Anything I can do to speed it up?
    I have not used Cufflinks several years, but last time I used it, I was analyzing human liver samples and had to remove all albumin-mapped reads to prevent it from hanging. I recommend against it; it's just not very reliable or predictable - it was faster for me to write my own aligner and analysis toolchain than to wait for the Tuxedo pipeline to finish.

    Currently, DESeq seems to be a better choice than Cufflinks.

    Comment


    • #3
      Thank you, yes I definitely plan to try other programs besides cufflinks/Tuxedo to compare results.

      Comment


      • #4
        One "long runtime" problem in cufflinks 2.2.1 related to an inefficient data structure has been reported and fixed recently: https://groups.google.com/forum/#!to...rs/UzLCJhj3lUE

        It will be part of cufflinks 2.2.2 (not released yet), would be interesting to know if this fixes your issue.

        The commit in question: https://github.com/cole-trapnell-lab...a0292d507f17b6

        Chris

        Comment


        • #5
          Originally posted by offspring View Post
          One "long runtime" problem in cufflinks 2.2.1 related to an inefficient data structure has been reported and fixed recently: https://groups.google.com/forum/#!to...rs/UzLCJhj3lUE

          It will be part of cufflinks 2.2.2 (not released yet), would be interesting to know if this fixes your issue.

          The commit in question: https://github.com/cole-trapnell-lab...a0292d507f17b6

          Chris
          Actually I just came across this and I'm currently trying it, if that is the cause of my problem do you know if I should expect a large or small speed improvement using this patch?
          Last edited by biocomputer; 01-07-2015, 02:43 PM.

          Comment


          • #6
            The patched version is working more than twice as fast (actually went just a little bit faster but using -p16 instead of -p32), so that's good.

            But now I'm having a problem that I've encountered with other Tuxedo tools (cuffquant and cuffdiff) with both patched and unpatched 2.2.1, I'm getting a segfault even though I'm not running of of memory. In this case I used patched Cufflinks 2.2.1 on two different .bam files and got a segfault for both files at the same locus near the beginning of "Re-estimating abundances with bias and multi-read correction". I don't have an overabundance of reads at or near this location when I visualize the aligned files. But I don't think it's an issue with a specific locus since in cuffdiff I tried masking the offending locus then it just segfaulted somewhere else, in both cases near the beginning of "Testing for differential expression and regulation in locus".

            Comment

            Latest Articles

            Collapse

            • seqadmin
              The Impact of AI in Genomic Medicine
              by seqadmin



              Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
              Yesterday, 02:07 PM
            • seqadmin
              Multiomics Techniques Advancing Disease Research
              by seqadmin


              New and advanced multiomics tools and technologies have opened new avenues of research and markedly enhanced various disciplines such as disease research and precision medicine1. The practice of merging diverse data from various ‘omes increasingly provides a more holistic understanding of biological systems. As Maddison Masaeli, Co-Founder and CEO at Deepcell, aptly noted, “You can't explain biology in its complex form with one modality.”

              A major leap in the field has
              ...
              02-08-2024, 06:33 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 02-23-2024, 04:11 PM
            0 responses
            44 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 02-21-2024, 08:52 AM
            0 responses
            60 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 02-20-2024, 08:57 AM
            0 responses
            51 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 02-14-2024, 09:19 AM
            0 responses
            65 views
            0 likes
            Last Post seqadmin  
            Working...
            X