Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    CRAM has both lossy and lossless modes. My own C library currently only supports lossless encoding (but can handle decoding of lossily encoded CRAM files). Vadim's Java provides options for both lossy and lossless encoding.

    As for maturity - I'd say it's pretty close now with CRAM v2.0. I'm biased of course[1], but try the latest Staden io_lib package and run the "scramble" command once built:

    A fully developed set of DNA sequence assembly (Gap4 and Gap5), editing and analysis tools (Spin) for Unix, Linux, MacOSX and MS Windows.


    Approx 1Gb bam file:
    jkb[/tmp] ls -l 6714_6#1.bam
    -rw-r--r-- 1 jkb team117 977124408 Apr 23 10:20 6714_6#1.bam

    Locally specified reference (scramble will use the UR:file: field or access the EBI's MD5 server to pull down the reference automatically; otherwise use -r to specify the .fa location). Redacted slightly because I've no idea if this is public data or not.
    jkb[/tmp] samtools view -H 6714_6#1.bam | egrep '^@SQ'
    @SQ SN:<...> LN:2892523 UR:file:/nfs/srpipe_references/references/<...> M5:76f500<...>
    <...>

    Convert to CRAM losslessly, 38% less disk space used:
    jkb[/tmp] time ./io_lib-1.13.1/progs/scramble 6714_6#1.bam 6714_6#1.cram
    real 2m37.763s
    user 2m31.753s
    sys 0m3.564s
    jkb@deskpro102485[/tmp] ls -l 6714_6#1.cram
    -rw-r--r-- 1 jkb team117 608320844 May 15 09:23 6714_6#1.cram

    Convert back to BAM again. "-m" indicates to generate MD and NM tags:
    jkb@deskpro102485[/tmp] time ./io_lib-1.13.1/progs/scramble -m 6714_6#1.cram 6714_6#1.cram.bam
    real 3m10.728s
    user 3m3.043s
    sys 0m4.652s

    I then compared the differences. There *are* some, but these are restricted to nonsensical things (CIGAR strings for unmapped data) or ambiguities in the SAM specification (what exactly does TLEN really mean? everyone deals with it differently - leftmost/rightmost vs 5' ends).

    There's a compare_sam.pl script in the io_lib tests subdirectory. It's not expected to be an end-user program so lacks documentation, but feel free to look at the source for the command line options. It needs SAM and not BAM.

    Edit: running compare_sam.pl -notemplate 6714_6#1.sam 6714_6#1.cram.sam got 9899053 lines into the SAM files before detecting the first difference (ignoring TLEN diffs), which was an unmapped read having MD:Z:72T2 NM:i:1 tags. The .cram.sam file didn't have these as we auto-generate MD and NM on extraction, but obviously cannot do this for unmapped files. The difference was therefore due to a bug in the original aligner output.

    [1] Obviously Vadim's Java (and the original) CRAM implementation is available at http://www.ebi.ac.uk/ena/about/cram_toolkit
    Last edited by jkbonfield; 05-15-2013, 12:42 AM.

    Comment


    • #17
      Originally posted by narain View Post
      But as I saw in one of the presentations, it seems CRAM does a lossy conversion from BAM, and introduces false positive and false negatives ? Is CRAM mature now to do a lossless compression from FASTQ and BAM files with random access such as BAM files give ?
      I forgot to add, CRAM supports random access too. I have a cram_index program to create .crai files and then scramble can use these for random access. On a test I did recently it turned out that total number of seek and read system calls from random access within a cram file turned out to be fewer than it was on the analogous bam file.

      This random access code hasn't been extensively tested yet, but it looks to be working in principle and is demonstrably efficient.

      Finally, long term my C CRAM implementation will end up in samtools and/or HTSlib. I already have a fork of samtools that provides CRAM reading and writing support, but only via the samopen() unified interface rather than the SAM specific sam_open() call or BAM specific bam_open() call. Practically speaking this means samtools view works, but samtools pileup does not (as pileup won't work on SAM either). These are the issues that we will be addressing over the summer.

      Comment


      • #18
        You might want to try my program Genozip (www.genozip.com). It is often better than CRAM.

        Comment


        • #19
          Thanks Andrey for the question. A few points where I think Genozip provides some benefits over CRAM:

          1. Similar to CRAM, Genozip compresses each field of the SAM/BAM data separately, with the best codec for the particular type of data applied to each field. However, Genozip goes beyond that, and also leverages correlations *between* fields to further eliminate information redundancies. As a result, the compressed file is about 20% smaller than CRAM (according to our benchmark in the paper).

          2. Genozip is not specific to SAM data - it can compress FASTQ, VCF and other genomic formats.

          3. It is able to compress & archive whole directories directly into a tar file, eg: genozip *.bam --tar mydata.tar

          4. It is highly scalable with cores - it has been tested to scale up to 100+ cores.

          5. Genozip can compress BAM with or without a reference file, while CRAM requires a reference file. Compressing with a reference file in Genozip improves the compression ratio, in particular for low-coverage data, but for high-coverage data (eg 30x) Genozip can reach almost the same compression ratio without a reference file.

          6. Genozip, through the command genocat, provides some interesting capabilities. Some of them similar to samtools, and some unique - for example, directly filtering out contamination from a BAM file using kraken2 data.

          See the publication here: https://www.researchgate.net/publica...ata_Compressor

          And the software documentation here: https://www.genozip.com
          Last edited by divon; 07-23-2021, 07:27 PM.

          Comment


          • #20
            Thanks for the prompt answer! mx player
            Last edited by jindalashu434; 08-03-2021, 12:34 AM.

            Comment


            • #21
              Some new Genozip benchmarks:
              Genozip is the best compressor for FASTQ, BAM and VCF data, for a wide range of cases. See some benchmarks.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM
              • seqadmin
                Techniques and Challenges in Conservation Genomics
                by seqadmin



                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                Avian Conservation
                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                03-08-2024, 10:41 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 03-27-2024, 06:37 PM
              0 responses
              13 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-27-2024, 06:07 PM
              0 responses
              11 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-22-2024, 10:03 AM
              0 responses
              53 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-21-2024, 07:32 AM
              0 responses
              69 views
              0 likes
              Last Post seqadmin  
              Working...
              X