No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Fastq compression - proof of concept

    I decided to shove my proof of concept code out of the door for people to experiment with, as I do not have time to take this further myself. However I feel format specific compression tools are very worth while considering given that bioinformatics data has now grown staggeringly in the last few years.

    I have two fastq compression tools (neither are "production quality" or supported, so beware). Ie these are experimental only.

    Code for both are on the Sanger Institute ftp site at:

    I benchmarked them on a couple data sets and compared them with other general purpose tools. 1Gb file to/from /dev/shm. 54bp sequences.

    Prog           Size             Encode time     Decode time
    raw            1073741745        (2.0)           (2.0)
    lzopack         499049563        11.7             5.3
    quicklz         497987198         7.7             7.7
    quicklz -3      424803464        65.1             5.5
    gzip -1         375071650        30.1            12.8
    lzopack -9      368383765       469.8             5.2
    xz -1           318229712       134.3            33.5
    gzip -6 (def.)  316890291       108.2            10.9
    szip -o3        277408698       131.6           171.3
    bsc -m0pTcpf    256937105       120.6           141.1
    xz              253249104      1438.5            29.3
    bzip2           249508099       414.9           118.6
    fastq2fqz (-3)  244921604        22.5            12.9
    fastq2fqz (-5)  238350173        27.8            13.0
    bsc -m1pTcpf    233012984       111.3           152.4
    szip -o6        232242005       295.8           233.0
    fqzcomp         229624382        22.3            55.6
    bsc -m2pTcpf    220238875       132.9           166.2
    In the above, raw is a UNIX cat command, for comparison. Some of these you may likely have never heard of, but see for a comprehensive list of tools.

    On a smaller set of 250000 108bp sequences, allowing me to go to town testing slower tools like paq8, we get this:

    Prog            Size            Encode(s) Decode(s)
    simple_c        34445161        2.637     9.216
    comp(0)         34165620        2.388     4.036
    gzip -3         27822202        3.140     0.822
    gzip            26441159        9.356     0.751
    xz -3           22971956        67.62     2.400
    comp1           22465448        2.335     3.364
    xz              22450796        103.5     2.509
    fastq2fqz       21595974        1.536     0.967
    bzip2           21340457        10.99     5.813
    szip            20540942        14.98     16.55
    comp2           20287020        2.935     4.737
    bsc -m2pTcpf    19365157        8.826     10.95
    fqzcomp(1Mb)    19136330        1.589     2.500
    bsc -m3pTcp     19063073        23.50     17.28
    lpaq -9         18534618        178.6     (~encode)
    paq8 -8         17730550        6043.8    (~encode)
    It's impressive to see just how well the state of the art general purpose text compression tools can do (paq), albeit at *extreme* cost in CPU time. I tend to think of these are a base-line to try and approach. Although they can be beaten with code for dedicated formats, it's typically going to be very hard to do so while still being faster than, say, bzip2.

    So the tools:


    These use LZ77 and huffman encoding (both via zlib and the interlaced huffman encoder taken from the Staden Package's io_lib).

    Hence it's particularly fast at decompression as is usual with all LZ+huffman programs. It can be tweaked to be marginally faster than gzip for decompression if we ditch the interlaced huffman encoding for quality values and just call zlib again, but zlib's entropy encoder is far slower so it slows down on encoding and also has poor compression ratios.

    Either way, it's an order of magnitude faster than bzip (both encoding and decoing) while giving comparable compression ratios.

    Note that this tool MUST have fixed size lines and it only supports ACGTN.


    For this I experimented on using probabilistic modelling and a rangecoder for entropy encoding. I chose to use Michael Schindler's example source for this from

    The compression performance is very good. Encoding speed is particularly fine, even beating gzip -1, but decoding speed unfortunately is about half that of encoding so it's quite slow compared to many tools. I know there are faster entropy encoders out there, so I'm sure there is room for improvement on the speed. Even so, it runs fast compared to tools with comparable compression ratios.

    The fqzcomp program should support variable length sequences unlike fastq2fqz. I'm not sure what dna letter it accepts, but probably anything.


    edit: fixed link to the new version of Matt Mahoney's chart.
    Last edited by jkbonfield; 08-10-2010, 04:43 AM.

  • #2
    Very interesting, I will give them a try.

    Something to look at, there is a parallel implementation of bzip2


    • #3
      What would be really nice would be for some of these options to be available in the downstream tools themselves -- e.g. bwa & bowtie (as far as I know) need the input FASTQs decompressed. It would certainly be convenient if they could read the compressed formats (though bwa with short reads ends up reading them twice, so the overhead of decompressing twice might not be worth it).


      • #4
        The bsc tool is also parallel, both multi-threaded and mpi capable. I disabled it for the purposes of benchmarking though, to be fair. See for more details. I've been quite impressed with it so far.



        • #5
          bwa has been supporting gzip'ed fastq for nearly two years. Minor modification can make it work with bzip'ed or bsc'ed fastq files, although by design bwa cannot support multiple compression algorithms at the same time. Maq's gzip support is later and is only available in SVN. Bowtie accepts piping, so supporting compression or not does not matter too much.

          BTW, I did not know bsc before, but it looks very impressive to me, too.

          EDIT: a lot of free compressors (e.g. quicklz, bsc and rangecoder) are licensed under GPL or LGPL. This becomes annoying when we want to release source code under a permissive open source license (e.g. BSD and MIT/X11) such that everyone can use the library/tool freely. Another similar practical issue is the availability of other language bindings. gzip is by far the most widely supported library.
          Last edited by lh3; 08-10-2010, 12:00 PM.


          • #6
            Bfast has been also supporting gzip and bzip2 for a long time.


            • #7
              Yeah GPL can be a pain like that at times.

              For what it's worth, I'm happy to release fastq2fqz and fqz2fastq under BSD. It's kind of trivial mix of zlib and staden io_lib anyway, both of which are already BSD.

              The fqzcomp code was based on GPL code, although the basic design of what it does is trivial enough to rewrite using a more free library. (Hah! "more free" - that'll wind up the GPL crowd). I doubt I'd ever get the time though.


              PS. I'm totally with you on gzip being ubiqitous in language bindings. It's also incredibly fast at decompression compared to most, so it's ideal for a lot of our use cases. It's good to see many tools using at least some sort of on-the-fly compression.


              Latest Articles


              • seqadmin
                Advanced Tools Transforming the Field of Cytogenomics
                by seqadmin

                At the intersection of cytogenetics and genomics lies the exciting field of cytogenomics. It focuses on studying chromosomes at a molecular scale, involving techniques that analyze either the whole genome or particular DNA sequences to examine variations in structure and behavior at the chromosomal or subchromosomal level. By integrating cytogenetic techniques with genomic analysis, researchers can effectively investigate chromosomal abnormalities related to diseases, particularly...
                09-26-2023, 06:26 AM
              • seqadmin
                How RNA-Seq is Transforming Cancer Studies
                by seqadmin

                Cancer research has been transformed through numerous molecular techniques, with RNA sequencing (RNA-seq) playing a crucial role in understanding the complexity of the disease. Maša Ivin, Ph.D., Scientific Writer at Lexogen, and Yvonne Goepel Ph.D., Product Manager at Lexogen, remarked that “The high-throughput nature of RNA-seq allows for rapid profiling and deep exploration of the transcriptome.” They emphasized its indispensable role in cancer research, aiding in biomarker...
                09-07-2023, 11:15 PM





              Topics Statistics Last Post
              Started by seqadmin, 09-29-2023, 09:38 AM
              0 responses
              Last Post seqadmin  
              Started by seqadmin, 09-27-2023, 06:57 AM
              0 responses
              Last Post seqadmin  
              Started by seqadmin, 09-26-2023, 07:53 AM
              1 response
              Last Post seed_phrase_metal_storage  
              Started by seqadmin, 09-25-2023, 07:42 AM
              0 responses
              Last Post seqadmin