Announcement

Collapse
No announcement yet.

FastQ/BAM compression

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • FastQ/BAM compression

    Does anybody know of a more recent comparison of algorithms for fastq/bam compression, than this thread?
    http://seqanswers.com/forums/showthread.php?t=6349

    Best

  • #2
    CRAM is reference based compression so may or may not be of interest for you. http://www.ebi.ac.uk/ena/about/cram_toolkit/

    Comment


    • #3
      Tnx for the prompt answer!

      I just tried cram today. The compression ratio is extremely impressive, but it's a too slow for my needs.

      Comment


      • #4
        tir_al,

        What is the compression ratio? Curious to see before I take a dive with my own data.

        Comment


        • #5
          I tried it on 75bp, cca 80 mil. reads, pair end bam file. And it crammed 3.6 GB file into a 257MB archive.

          Comment


          • #6
            Originally posted by tir_al View Post
            I tried it on 75bp, cca 80 mil. reads, pair end bam file. And it crammed 3.6 GB file into a 257MB archive.
            Sounds like a promising way to store genomic data in the long run if indexed to hg19?

            Comment


            • #7
              Yeah. Preferably for storing old projects.

              Comment


              • #8
                Originally posted by tir_al View Post
                Yeah. Preferably for storing old projects.
                Are you sure the effort is going to be worthwhile rather than using plain old tar/gzip combination?

                If you are looking at thousands of samples a year then perhaps it may be.

                Comment


                • #9
                  I currently have no options, and no place for new disk space

                  Comment


                  • #10
                    Hi all, I can't figure out how to specify loseless compression using cramTools (ie retain all quality score info), can someone help me out? In NGCs paper they state a few flags which seem to be discontinued in 1.0. Presumably I specify using --lossy-quality-score-spec but I can't figure out how to set it to 'any/all'. I appreciate any help/ideas on this. Also if I am missing the point and the compression inherently removes quality scores I apologise in advance, I am n00 to the area=P
                    Last edited by bruce01; 01-11-2013, 02:55 AM.

                    Comment


                    • #11
                      Hi,

                      I'm in the same situation as Bruce. I want to compress keeping the base call qualities but can't figure out how...

                      Just did a naïf try:
                      Code:
                      --lossy-quality-score-spec all
                      and got:
                      Code:
                      Exception in thread "main" java.lang.RuntimeException: Uknown read or base category: a
                              at net.sf.cram.lossy.QualityScorePreservation.parseSinglePolicy(QualityScorePreservation.java:138)
                      So apparently you can specify a read name to keep the quality, not of use in my case as I want to keep all of them; and base category. But, what is the base category? I also tried a numeric value in case it referred referred to an index in the read, but with similar result.


                      Thanks!
                      Pablo.

                      Comment


                      • #12
                        Hi Pablo,

                        found it on the archives of cram mailing list, the call includes: -L m999 (-L flag is your --lossy-quality-score-spec above). All reads are retained, but you lose column 12+ ('info'). This isn't an issue for me, and the compressed cram file for a 1GB bam is 600MB which is pretty good!

                        Comment


                        • #13
                          Thanks Bruce,

                          It's running!
                          For the columns 12+ I guess with the option --capture-tags you can keep tags as the read group which is usually important.


                          Regards,
                          Pablo.

                          Comment


                          • #14
                            I just noticed this thread, rather late.

                            There is CRAM from EBI, which has long term support and handles random access. It's the most direct competitor to BAM I would guess.

                            Alternatives are Goby (similar ratios, but even slower from my experience), Quip (faster encoding, great compression ratio, but no(?) random access) and SamComp1/2 (faster encoding, great compression ratio, no random access, and doesn't really implement the full SAM spec - more of a fastq compressor). Finally on that topic there are tools like quip again, fqzcomp and fastqz for compression of FASTQ data. [All 3 of these were SequenceSqueeze competition entries.]

                            Comment


                            • #15
                              But as I saw in one of the presentations, it seems CRAM does a lossy conversion from BAM, and introduces false positive and false negatives ? Is CRAM mature now to do a lossless compression from FASTQ and BAM files with random access such as BAM files give ?

                              Comment

                              Working...
                              X