Seqanswers Leaderboard Ad



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Fastq compression - proof of concept

    I decided to shove my proof of concept code out of the door for people to experiment with, as I do not have time to take this further myself. However I feel format specific compression tools are very worth while considering given that bioinformatics data has now grown staggeringly in the last few years.

    I have two fastq compression tools (neither are "production quality" or supported, so beware). Ie these are experimental only.

    Code for both are on the Sanger Institute ftp site at:

    I benchmarked them on a couple data sets and compared them with other general purpose tools. 1Gb file to/from /dev/shm. 54bp sequences.

    Prog           Size             Encode time     Decode time
    raw            1073741745        (2.0)           (2.0)
    lzopack         499049563        11.7             5.3
    quicklz         497987198         7.7             7.7
    quicklz -3      424803464        65.1             5.5
    gzip -1         375071650        30.1            12.8
    lzopack -9      368383765       469.8             5.2
    xz -1           318229712       134.3            33.5
    gzip -6 (def.)  316890291       108.2            10.9
    szip -o3        277408698       131.6           171.3
    bsc -m0pTcpf    256937105       120.6           141.1
    xz              253249104      1438.5            29.3
    bzip2           249508099       414.9           118.6
    fastq2fqz (-3)  244921604        22.5            12.9
    fastq2fqz (-5)  238350173        27.8            13.0
    bsc -m1pTcpf    233012984       111.3           152.4
    szip -o6        232242005       295.8           233.0
    fqzcomp         229624382        22.3            55.6
    bsc -m2pTcpf    220238875       132.9           166.2
    In the above, raw is a UNIX cat command, for comparison. Some of these you may likely have never heard of, but see for a comprehensive list of tools.

    On a smaller set of 250000 108bp sequences, allowing me to go to town testing slower tools like paq8, we get this:

    Prog            Size            Encode(s) Decode(s)
    simple_c        34445161        2.637     9.216
    comp(0)         34165620        2.388     4.036
    gzip -3         27822202        3.140     0.822
    gzip            26441159        9.356     0.751
    xz -3           22971956        67.62     2.400
    comp1           22465448        2.335     3.364
    xz              22450796        103.5     2.509
    fastq2fqz       21595974        1.536     0.967
    bzip2           21340457        10.99     5.813
    szip            20540942        14.98     16.55
    comp2           20287020        2.935     4.737
    bsc -m2pTcpf    19365157        8.826     10.95
    fqzcomp(1Mb)    19136330        1.589     2.500
    bsc -m3pTcp     19063073        23.50     17.28
    lpaq -9         18534618        178.6     (~encode)
    paq8 -8         17730550        6043.8    (~encode)
    It's impressive to see just how well the state of the art general purpose text compression tools can do (paq), albeit at *extreme* cost in CPU time. I tend to think of these are a base-line to try and approach. Although they can be beaten with code for dedicated formats, it's typically going to be very hard to do so while still being faster than, say, bzip2.

    So the tools:


    These use LZ77 and huffman encoding (both via zlib and the interlaced huffman encoder taken from the Staden Package's io_lib).

    Hence it's particularly fast at decompression as is usual with all LZ+huffman programs. It can be tweaked to be marginally faster than gzip for decompression if we ditch the interlaced huffman encoding for quality values and just call zlib again, but zlib's entropy encoder is far slower so it slows down on encoding and also has poor compression ratios.

    Either way, it's an order of magnitude faster than bzip (both encoding and decoing) while giving comparable compression ratios.

    Note that this tool MUST have fixed size lines and it only supports ACGTN.


    For this I experimented on using probabilistic modelling and a rangecoder for entropy encoding. I chose to use Michael Schindler's example source for this from

    The compression performance is very good. Encoding speed is particularly fine, even beating gzip -1, but decoding speed unfortunately is about half that of encoding so it's quite slow compared to many tools. I know there are faster entropy encoders out there, so I'm sure there is room for improvement on the speed. Even so, it runs fast compared to tools with comparable compression ratios.

    The fqzcomp program should support variable length sequences unlike fastq2fqz. I'm not sure what dna letter it accepts, but probably anything.


    edit: fixed link to the new version of Matt Mahoney's chart.
    Last edited by jkbonfield; 08-10-2010, 04:43 AM.

  • #2
    Very interesting, I will give them a try.

    Something to look at, there is a parallel implementation of bzip2


    • #3
      What would be really nice would be for some of these options to be available in the downstream tools themselves -- e.g. bwa & bowtie (as far as I know) need the input FASTQs decompressed. It would certainly be convenient if they could read the compressed formats (though bwa with short reads ends up reading them twice, so the overhead of decompressing twice might not be worth it).


      • #4
        The bsc tool is also parallel, both multi-threaded and mpi capable. I disabled it for the purposes of benchmarking though, to be fair. See for more details. I've been quite impressed with it so far.



        • #5
          bwa has been supporting gzip'ed fastq for nearly two years. Minor modification can make it work with bzip'ed or bsc'ed fastq files, although by design bwa cannot support multiple compression algorithms at the same time. Maq's gzip support is later and is only available in SVN. Bowtie accepts piping, so supporting compression or not does not matter too much.

          BTW, I did not know bsc before, but it looks very impressive to me, too.

          EDIT: a lot of free compressors (e.g. quicklz, bsc and rangecoder) are licensed under GPL or LGPL. This becomes annoying when we want to release source code under a permissive open source license (e.g. BSD and MIT/X11) such that everyone can use the library/tool freely. Another similar practical issue is the availability of other language bindings. gzip is by far the most widely supported library.
          Last edited by lh3; 08-10-2010, 12:00 PM.


          • #6
            Bfast has been also supporting gzip and bzip2 for a long time.


            • #7
              Yeah GPL can be a pain like that at times.

              For what it's worth, I'm happy to release fastq2fqz and fqz2fastq under BSD. It's kind of trivial mix of zlib and staden io_lib anyway, both of which are already BSD.

              The fqzcomp code was based on GPL code, although the basic design of what it does is trivial enough to rewrite using a more free library. (Hah! "more free" - that'll wind up the GPL crowd). I doubt I'd ever get the time though.


              PS. I'm totally with you on gzip being ubiqitous in language bindings. It's also incredibly fast at decompression compared to most, so it's ideal for a lot of our use cases. It's good to see many tools using at least some sort of on-the-fly compression.


              Latest Articles


              • seqadmin
                Exploring the Dynamics of the Tumor Microenvironment
                by seqadmin

                The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                07-08-2024, 03:19 PM
              • seqadmin
                Exploring Human Diversity Through Large-Scale Omics
                by seqadmin

                In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                06-25-2024, 06:43 AM





              Topics Statistics Last Post
              Started by seqadmin, 07-19-2024, 07:20 AM
              0 responses
              Last Post seqadmin  
              Started by seqadmin, 07-16-2024, 05:49 AM
              0 responses
              Last Post seqadmin  
              Started by seqadmin, 07-15-2024, 06:53 AM
              0 responses
              Last Post seqadmin  
              Started by seqadmin, 07-10-2024, 07:30 AM
              0 responses
              Last Post seqadmin