Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • bio_boris
    Member
    • Mar 2013
    • 14

    File compression tips

    Hi all,

    I have a large collection of FASTA files and tabular BLAST files ~30-60 TB that I would like to archive. Does anyone have any experience with saving storage space by sorting these files before compression? Or should I skip this and look into different compression formats other than GZIP?

    edit
    Will not be saved to an actual archive system.

    Thanks!
    Last edited by bio_boris; 08-25-2014, 09:48 AM.
  • amitm
    Member
    • Feb 2011
    • 52

    #2
    hi bio_boris,
    I have some experience with compressing fastq (not fasta though) using double compression of tar.gz
    Though it would take lot of time (>1hr on a Linux wrkstn; fastq's of avg. 20Gb), about 3 fold compression is achieved.

    Comment

    • Brian Bushnell
      Super Moderator
      • Jan 2014
      • 2709

      #3
      Sorting mapped files can save a lot of space, though sorting fasta, probably not so much, though it depends on the contents.

      I recommend that you use pigz, which can create gzip files using all available CPU cores and thus is way faster than gzip for the same compression level. That means you can increase to a higher compression level at the same or better speed. Syntax is just like gzip and the files are still compatible with gzip. There are higher-compression formats, but gzip is more ubiquitous and better supported.

      Comment

      • Richard Finney
        Senior Member
        • Feb 2009
        • 701

        #4
        BZIP2 is better than GZIP, at the cost of compression and decompression speed.

        Many folks have bench marked the size and speed tradeoffs ...



        Note: (via stackoverlow http://stackoverflow.com/questions/4...with-all-cores ) ...
        find /source -type f -print0 | xargs -0 -n 1 -P $CORES gzip -9
        to saturate your CPUs if you don't have pigz (which is a great name for a program ... "pigz" ... cool).
        Last edited by Richard Finney; 08-25-2014, 09:24 AM.

        Comment

        • GenoMax
          Senior Member
          • Feb 2008
          • 7142

          #5
          @bio_boris: If you are going to archive to tape then you should let the backup software take care of the compression.

          Comment

          • SES
            Senior Member
            • Mar 2010
            • 275

            #6
            Originally posted by Richard Finney View Post
            BZIP2 is better than GZIP, at the cost of compression and decompression speed.
            I personally prefer bzip2 for that reason (the compression part), at least for archiving data. There is a program called pbzip2 that can be installed through your package manager and this speeds things up significantly by using all your CPUs, or any number you specify.

            Comment

            • manauwer
              Junior Member
              • Dec 2014
              • 3

              #7
              pbzip2 is an efficient way to compress data.
              the command here will do the task

              tar cf outputfile_name --use-compress-prog=pbzip2 inputfile_or_directory

              Comment

              Latest Articles

              Collapse

              • SEQadmin2
                Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                by SEQadmin2


                I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                Here are nine questions we think about, in roughly the order they matter, before...
                06-18-2026, 07:11 AM
              • SEQadmin2
                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                by SEQadmin2


                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                ...
                06-02-2026, 10:05 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by SEQadmin2, 06-17-2026, 06:09 AM
              0 responses
              34 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-09-2026, 11:58 AM
              0 responses
              97 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-05-2026, 10:09 AM
              0 responses
              117 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-04-2026, 08:59 AM
              0 responses
              112 views
              0 reactions
              Last Post SEQadmin2  
              Working...