No announcement yet.

FastQ/BAM compression

  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    CRAM has both lossy and lossless modes. My own C library currently only supports lossless encoding (but can handle decoding of lossily encoded CRAM files). Vadim's Java provides options for both lossy and lossless encoding.

    As for maturity - I'd say it's pretty close now with CRAM v2.0. I'm biased of course[1], but try the latest Staden io_lib package and run the "scramble" command once built:

    Approx 1Gb bam file:
    jkb[/tmp] ls -l 6714_6#1.bam
    -rw-r--r-- 1 jkb team117 977124408 Apr 23 10:20 6714_6#1.bam

    Locally specified reference (scramble will use the UR:file: field or access the EBI's MD5 server to pull down the reference automatically; otherwise use -r to specify the .fa location). Redacted slightly because I've no idea if this is public data or not.
    jkb[/tmp] samtools view -H 6714_6#1.bam | egrep '^@SQ'
    @SQ SN:<...> LN:2892523 UR:file:/nfs/srpipe_references/references/<...> M5:76f500<...>

    Convert to CRAM losslessly, 38% less disk space used:
    jkb[/tmp] time ./io_lib-1.13.1/progs/scramble 6714_6#1.bam 6714_6#1.cram
    real 2m37.763s
    user 2m31.753s
    sys 0m3.564s
    [email protected][/tmp] ls -l 6714_6#1.cram
    -rw-r--r-- 1 jkb team117 608320844 May 15 09:23 6714_6#1.cram

    Convert back to BAM again. "-m" indicates to generate MD and NM tags:
    [email protected][/tmp] time ./io_lib-1.13.1/progs/scramble -m 6714_6#1.cram 6714_6#1.cram.bam
    real 3m10.728s
    user 3m3.043s
    sys 0m4.652s

    I then compared the differences. There *are* some, but these are restricted to nonsensical things (CIGAR strings for unmapped data) or ambiguities in the SAM specification (what exactly does TLEN really mean? everyone deals with it differently - leftmost/rightmost vs 5' ends).

    There's a script in the io_lib tests subdirectory. It's not expected to be an end-user program so lacks documentation, but feel free to look at the source for the command line options. It needs SAM and not BAM.

    Edit: running -notemplate 6714_6#1.sam 6714_6#1.cram.sam got 9899053 lines into the SAM files before detecting the first difference (ignoring TLEN diffs), which was an unmapped read having MD:Z:72T2 NM:i:1 tags. The .cram.sam file didn't have these as we auto-generate MD and NM on extraction, but obviously cannot do this for unmapped files. The difference was therefore due to a bug in the original aligner output.

    [1] Obviously Vadim's Java (and the original) CRAM implementation is available at
    Last edited by jkbonfield; 05-15-2013, 12:42 AM.


    • #17
      Originally posted by narain View Post
      But as I saw in one of the presentations, it seems CRAM does a lossy conversion from BAM, and introduces false positive and false negatives ? Is CRAM mature now to do a lossless compression from FASTQ and BAM files with random access such as BAM files give ?
      I forgot to add, CRAM supports random access too. I have a cram_index program to create .crai files and then scramble can use these for random access. On a test I did recently it turned out that total number of seek and read system calls from random access within a cram file turned out to be fewer than it was on the analogous bam file.

      This random access code hasn't been extensively tested yet, but it looks to be working in principle and is demonstrably efficient.

      Finally, long term my C CRAM implementation will end up in samtools and/or HTSlib. I already have a fork of samtools that provides CRAM reading and writing support, but only via the samopen() unified interface rather than the SAM specific sam_open() call or BAM specific bam_open() call. Practically speaking this means samtools view works, but samtools pileup does not (as pileup won't work on SAM either). These are the issues that we will be addressing over the summer.


      • #18
        You might want to try my program Genozip ( It is often better than CRAM.


        • #19
          Thanks Andrey for the question. A few points where I think Genozip provides some benefits over CRAM:

          1. Similar to CRAM, Genozip compresses each field of the SAM/BAM data separately, with the best codec for the particular type of data applied to each field. However, Genozip goes beyond that, and also leverages correlations *between* fields to further eliminate information redundancies. As a result, the compressed file is about 20% smaller than CRAM (according to our benchmark in the paper).

          2. Genozip is not specific to SAM data - it can compress FASTQ, VCF and other genomic formats.

          3. It is able to compress & archive whole directories directly into a tar file, eg: genozip *.bam --tar mydata.tar

          4. It is highly scalable with cores - it has been tested to scale up to 100+ cores.

          5. Genozip can compress BAM with or without a reference file, while CRAM requires a reference file. Compressing with a reference file in Genozip improves the compression ratio, in particular for low-coverage data, but for high-coverage data (eg 30x) Genozip can reach almost the same compression ratio without a reference file.

          6. Genozip, through the command genocat, provides some interesting capabilities. Some of them similar to samtools, and some unique - for example, directly filtering out contamination from a BAM file using kraken2 data.

          See the publication here:

          And the software documentation here:
          Last edited by divon; 07-23-2021, 07:27 PM.


          • #20
            Thanks for the prompt answer! mx player
            Last edited by jindalashu434; 08-03-2021, 12:34 AM.


            • #21
              Some new Genozip benchmarks:


              • #22
                You're great for outfitting such information with us. This blog is helpful for me. It helped me from numerous Genuinely I esteem your undertakings Batman Jacket