Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Galaxy workflow management system customization - avoiding duplication of files

    Hi all,

    My team is trying to set up an instance of galaxy workflow management system (http://galaxy.psu.edu/) that will launch jobs on our local cluster. We are involved in projects dealing with high-throughput sequencing. We then have to manage a LOT of large files (several Gb each).

    When uploading files into galaxy, these files are automatically copied to a folder named "database/files" and they are also sequentially renamed (dataset_1, dataset_2, dataset_3 ...etc..). This name convention is independent of the fact that a file is an input/intermediary/output file.

    Copying and renaming files this way is too much time consuming and makes us lose our file structure.

    Is there a way to avoid this behavior and that galaxy just remember the file path instead of possessing his own copy ?

    If someone here has experience with this tool, any help or useful link would be appreciated.

    Cheers,

    tony

  • #2
    We solved this using a filesystem with deduplication capabilities (ZFS). To do this you have to switch to an operating system that support it (IllumOS, Nexenta, Solaris).

    Comment


    • #3
      Hi Tony,

      I recommend you post your question to (and subscribe to) the galaxy-dev mailing list: http://lists.bx.psu.edu/listinfo/galaxy-dev The Galaxy team reads and responds to email on this list, and it serves a forum for discussion about the technical aspects of Galaxy as well.

      FYI, what you want to do is achievable by modifying Galaxy's universe file, but I'm not exactly sure how as it's not my area of expertise.

      Thanks,
      J.

      Emory University
      Galaxy Team

      Originally posted by tooony13 View Post
      Hi all,

      My team is trying to set up an instance of galaxy workflow management system (http://galaxy.psu.edu/) that will launch jobs on our local cluster. We are involved in projects dealing with high-throughput sequencing. We then have to manage a LOT of large files (several Gb each).

      When uploading files into galaxy, these files are automatically copied to a folder named "database/files" and they are also sequentially renamed (dataset_1, dataset_2, dataset_3 ...etc..). This name convention is independent of the fact that a file is an input/intermediary/output file.

      Copying and renaming files this way is too much time consuming and makes us lose our file structure.

      Is there a way to avoid this behavior and that galaxy just remember the file path instead of possessing his own copy ?

      If someone here has experience with this tool, any help or useful link would be appreciated.

      Cheers,

      tony

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Recent Advances in Sequencing Analysis Tools
        by seqadmin


        The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
        Today, 07:48 AM
      • seqadmin
        Essential Discoveries and Tools in Epitranscriptomics
        by seqadmin




        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
        04-22-2024, 07:01 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, Today, 07:17 AM
      0 responses
      11 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 05-02-2024, 08:06 AM
      0 responses
      19 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-30-2024, 12:17 PM
      0 responses
      20 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-29-2024, 10:49 AM
      0 responses
      28 views
      0 likes
      Last Post seqadmin  
      Working...
      X