Seqanswers Leaderboard Ad

**GenoMax** · 01-09-2013, 08:10 AM

CRAM is reference based compression so may or may not be of interest for you. http://www.ebi.ac.uk/ena/about/cram_toolkit/

**tir_al** · 01-09-2013, 08:34 AM

Tnx for the prompt answer!

I just tried cram today. The compression ratio is extremely impressive, but it's a too slow for my needs.

**winsettz** · 01-09-2013, 02:20 PM

tir_al,

What is the compression ratio? Curious to see before I take a dive with my own data.

**tir_al** · 01-09-2013, 02:26 PM

I tried it on 75bp, cca 80 mil. reads, pair end bam file. And it crammed 3.6 GB file into a 257MB archive.

**winsettz** · 01-09-2013, 04:23 PM

Originally posted by tir_al View Post

I tried it on 75bp, cca 80 mil. reads, pair end bam file. And it crammed 3.6 GB file into a 257MB archive.

Sounds like a promising way to store genomic data in the long run if indexed to hg19?

**tir_al** · 01-09-2013, 04:25 PM

Yeah. Preferably for storing old projects.

**GenoMax** · 01-10-2013, 06:18 AM

Originally posted by tir_al View Post

Yeah. Preferably for storing old projects.

Are you sure the effort is going to be worthwhile rather than using plain old tar/gzip combination?

If you are looking at thousands of samples a year then perhaps it may be.

**tir_al** · 01-10-2013, 06:50 AM

I currently have no options, and no place for new disk space

**bruce01** · 01-11-2013, 02:41 AM

Hi all, I can't figure out how to specify loseless compression using cramTools (ie retain all quality score info), can someone help me out? In NGCs paper they state a few flags which seem to be discontinued in 1.0. Presumably I specify using --lossy-quality-score-spec but I can't figure out how to set it to 'any/all'. I appreciate any help/ideas on this. Also if I am missing the point and the compression inherently removes quality scores I apologise in advance, I am n00 to the area=P

**priesgo** · 01-16-2013, 01:46 AM

Hi,

I'm in the same situation as Bruce. I want to compress keeping the base call qualities but can't figure out how...

Just did a naïf try:

Code:

--lossy-quality-score-spec all

and got:

Code:

Exception in thread "main" java.lang.RuntimeException: Uknown read or base category: a
        at net.sf.cram.lossy.QualityScorePreservation.parseSinglePolicy(QualityScorePreservation.java:138)

So apparently you can specify a read name to keep the quality, not of use in my case as I want to keep all of them; and base category. But, what is the base category? I also tried a numeric value in case it referred referred to an index in the read, but with similar result.

Thanks!
Pablo.

**bruce01** · 01-16-2013, 02:44 AM

Hi Pablo,

found it on the archives of cram mailing list, the call includes: -L m999 (-L flag is your --lossy-quality-score-spec above). All reads are retained, but you lose column 12+ ('info'). This isn't an issue for me, and the compressed cram file for a 1GB bam is 600MB which is pretty good!

**priesgo** · 01-16-2013, 02:59 AM

Thanks Bruce,

It's running!
For the columns 12+ I guess with the option --capture-tags you can keep tags as the read group which is usually important.

Regards,
Pablo.

**jkbonfield** · 02-04-2013, 01:38 AM

I just noticed this thread, rather late.

There is CRAM from EBI, which has long term support and handles random access. It's the most direct competitor to BAM I would guess.

Alternatives are Goby (similar ratios, but even slower from my experience), Quip (faster encoding, great compression ratio, but no(?) random access) and SamComp1/2 (faster encoding, great compression ratio, no random access, and doesn't really implement the full SAM spec - more of a fastq compressor). Finally on that topic there are tools like quip again, fqzcomp and fastqz for compression of FASTQ data. [All 3 of these were SequenceSqueeze competition entries.]

**narain** · 05-14-2013, 10:53 AM

But as I saw in one of the presentations, it seems CRAM does a lossy conversion from BAM, and introduces false positive and false negatives ? Is CRAM mature now to do a lossless compression from FASTQ and BAM files with random access such as BAM files give ?

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 13 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

FastQ/BAM compression

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News