I guess that benchmark is non-typical. It is not frequent to find a file that can be compressed from 15GB to 600MB. Nonetheless, it does indicate that pigz is not scalable. Nils' pbgzip should be much better. Also, if you want to do comparison, there is another more modern variant of bzip2 that is both much faster and achieves a better compression ratio. I forgot its name. James Bonfield should know better.
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
Originally posted by lh3 View PostI guess that benchmark is non-typical. It is not frequent to find a file that can be compressed from 15GB to 600MB. Nonetheless, it does indicate that pigz is not scalable. Nils' pbgzip should be much better. Also, if you want to do comparison, there is another more modern variant of bzip2 that is both much faster and achieves a better compression ratio. I forgot its name. James Bonfield should know better.
You are right. I will give this a try using a SAM file. I wonder if the 15GB file was made by duplicating some content over and over. This would explain the compression.
Comment
-
Guys,
Below are compression times for a 6.8GB SAM file. Tested on Ubuntu 11.10 with latest versions of all software. We got the latest source for each tool and compiled it on our node. Our node has 128GB of RAM and 4x AMD Opteron(TM) Processors. Total of 64 cores.
cores pigz pbzip2 gzip bzip2
2 10m32s 9m52s xx xx
16 1m25s 1m36s xx xx
64 1m6s 0m34s xx xx
1 xx xx 21m18s 19m16s
The pbzip file was 1.7GB and the pigz file was 2GB so not as big as difference as I thought.
Comment
-
It should not be too hard to make a bz2 BAM file, using the bz2 library: BZ2_bzBuffToBuffCompress and BZ2_bzBuffToBuffDecompress. Of course, there are better methods than just using the aforementioned functions (see pbzip2).
I am not sure how necessary all the signalling is in the current implementation, but debugging race conditions is a pain.
Comment
-
BZIP compress in blocks, so it actually fits the model of BAM quite well. The default block size in BAM is 65536, so upping that to 100K wouldn't be too hard. If it saves 30%, then it could be an alternative to CRAM (i.e. "get rid of all things").
Comment
-
Originally posted by genericforms View PostBut is it worth it? BZIP wins on parallelization with lots of cores, but is this useful? I thought samtools reads and write in small blocks that are separately compressed and decompressed. So it seems you can just parallelize that, right? Do you really benefit from using BZIP over GZIP?
command ------------- | c (s) | d (s) | size (MB)
pbgzip -t 0 (gz) ---- | 17.29 | 21.24 | 698MB
pbgzip -t 1 (bz2) --- | 18.43 | 21.13 | 804MB
pbzip2 -------------- | 21.36 | 21.23 | 640MB
Since BAM uses such small block sizes (63488 bytes), the BZ2 compression is not as good as when using larger block sizes, like in pbzip2. While pbzip2 file size is 80% of pbgzip (gz), the file size of pbgzip (bz2) is a respectable 86%. Compression and decompression times were not too different either.Last edited by nilshomer; 03-21-2012, 09:01 PM.
Comment
-
I guess most interested people are on the samtools-devel/help lists; however, for those who aren't, Heng just announced a multi-threaded samtools sort/merge/view:
It'd be interesting with a merge between his approach and Nils', if feasible... or did you already re-introduce multi-threaded sort, Nils?
Comment
-
Originally posted by arvid View PostI guess most interested people are on the samtools-devel/help lists; however, for those who aren't, Heng just announced a multi-threaded samtools sort/merge/view:
It'd be interesting with a merge between his approach and Nils', if feasible... or did you already re-introduce multi-threaded sort, Nils?
I think the part I would integrate is Heng's multi-threaded sort routine, while the rest is already there.
Comment
-
Originally posted by nilshomer View PostThe way Heng implemented it was to multi-thread the in-memory sort, as well as when merging multiple BAM files, to multi-thread the compression. There is also support in the new bgzf.c to support multi-threaded writing, which is used in the merging above. The multi-threaded writing is not used elsewhere.
I think the part I would integrate is Heng's multi-threaded sort routine, while the rest is already there.
Comment
-
Originally posted by nilshomer View PostBZIP compress in blocks, so it actually fits the model of BAM quite well. The default block size in BAM is 65536, so upping that to 100K wouldn't be too hard. If it saves 30%, then it could be an alternative to CRAM (i.e. "get rid of all things").
You'd also need to solve the non-byte aligned block issue, perhaps extending or working around the C library's API: http://blastedbio.blogspot.co.uk/201...-to-bzip2.html
This is assuming using multiple cores would overcome the inherently higher CPU load of BZIP vs GZIP - which sounds viable in principle
Comment
-
Originally posted by maubp View PostYou can use 100K, 200K, ..., 900K blocks in BZIP2 - the larger the block size the better the compression rate of course. This would require rejigging the BGZF virtual offset... the current 64bit trick won't work.
You'd also need to solve the non-byte aligned block issue, perhaps extending or working around the C library's API: http://blastedbio.blogspot.co.uk/201...-to-bzip2.html
This is assuming using multiple cores would overcome the inherently higher CPU load of BZIP vs GZIP - which sounds viable in principle
Comment
-
Originally posted by maubp View PostThat's the bonus of using compressed files - they are faster to read off disk (as long as the CPU overhead doesn't cost you too much). i.e. Using more compression can save I/O.
Comment
-
Originally posted by nilshomer View PostI have been working on speeding up reading and writing within SAMtools by creating a multi-threaded block gzip reader and writer. I have an alpha version working. I would appreciate some feedback and testing, just don't use it for production systems yet. Thank-you!
NB: I would be happy describe the implementation, and collaborate to get this into Picard too.
Stacia
Comment
Latest Articles
Collapse
-
by seqadmin
The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...-
Channel: Articles
04-22-2024, 07:01 AM -
-
by seqadmin
Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...-
Channel: Articles
04-04-2024, 04:25 PM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, Today, 08:06 AM
|
0 responses
11 views
0 likes
|
Last Post
by seqadmin
Today, 08:06 AM
|
||
Started by seqadmin, 04-30-2024, 12:17 PM
|
0 responses
14 views
0 likes
|
Last Post
by seqadmin
04-30-2024, 12:17 PM
|
||
Started by seqadmin, 04-29-2024, 10:49 AM
|
0 responses
19 views
0 likes
|
Last Post
by seqadmin
04-29-2024, 10:49 AM
|
||
Started by seqadmin, 04-25-2024, 11:49 AM
|
0 responses
26 views
0 likes
|
Last Post
by seqadmin
04-25-2024, 11:49 AM
|
Comment