Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • lh3
    Senior Member
    • Feb 2008
    • 686

    #16
    I guess that benchmark is non-typical. It is not frequent to find a file that can be compressed from 15GB to 600MB. Nonetheless, it does indicate that pigz is not scalable. Nils' pbgzip should be much better. Also, if you want to do comparison, there is another more modern variant of bzip2 that is both much faster and achieves a better compression ratio. I forgot its name. James Bonfield should know better.

    Comment

    • adaptivegenome
      Super Moderator
      • Nov 2009
      • 436

      #17
      Originally posted by lh3 View Post
      I guess that benchmark is non-typical. It is not frequent to find a file that can be compressed from 15GB to 600MB. Nonetheless, it does indicate that pigz is not scalable. Nils' pbgzip should be much better. Also, if you want to do comparison, there is another more modern variant of bzip2 that is both much faster and achieves a better compression ratio. I forgot its name. James Bonfield should know better.
      Heng,

      You are right. I will give this a try using a SAM file. I wonder if the 15GB file was made by duplicating some content over and over. This would explain the compression.

      Comment

      • adaptivegenome
        Super Moderator
        • Nov 2009
        • 436

        #18
        Guys,

        Below are compression times for a 6.8GB SAM file. Tested on Ubuntu 11.10 with latest versions of all software. We got the latest source for each tool and compiled it on our node. Our node has 128GB of RAM and 4x AMD Opteron(TM) Processors. Total of 64 cores.


        cores pigz pbzip2 gzip bzip2
        2 10m32s 9m52s xx xx
        16 1m25s 1m36s xx xx
        64 1m6s 0m34s xx xx
        1 xx xx 21m18s 19m16s

        The pbzip file was 1.7GB and the pigz file was 2GB so not as big as difference as I thought.

        Comment

        • nilshomer
          Nils Homer
          • Nov 2008
          • 1283

          #19
          It should not be too hard to make a bz2 BAM file, using the bz2 library: BZ2_bzBuffToBuffCompress and BZ2_bzBuffToBuffDecompress. Of course, there are better methods than just using the aforementioned functions (see pbzip2).

          I am not sure how necessary all the signalling is in the current implementation, but debugging race conditions is a pain.

          Comment

          • adaptivegenome
            Super Moderator
            • Nov 2009
            • 436

            #20
            But is it worth it? BZIP wins on parallelization with lots of cores, but is this useful? I thought samtools reads and write in small blocks that are separately compressed and decompressed. So it seems you can just parallelize that, right? Do you really benefit from using BZIP over GZIP?

            Comment

            • nilshomer
              Nils Homer
              • Nov 2008
              • 1283

              #21
              BZIP compress in blocks, so it actually fits the model of BAM quite well. The default block size in BAM is 65536, so upping that to 100K wouldn't be too hard. If it saves 30%, then it could be an alternative to CRAM (i.e. "get rid of all things").

              Comment

              • nilshomer
                Nils Homer
                • Nov 2008
                • 1283

                #22
                Originally posted by genericforms View Post
                But is it worth it? BZIP wins on parallelization with lots of cores, but is this useful? I thought samtools reads and write in small blocks that are separately compressed and decompressed. So it seems you can just parallelize that, right? Do you really benefit from using BZIP over GZIP?
                Well the best way to answer the question is to do it (on 24 threads).

                command ------------- | c (s) | d (s) | size (MB)
                pbgzip -t 0 (gz) ---- | 17.29 | 21.24 | 698MB
                pbgzip -t 1 (bz2) --- | 18.43 | 21.13 | 804MB
                pbzip2 -------------- | 21.36 | 21.23 | 640MB


                Since BAM uses such small block sizes (63488 bytes), the BZ2 compression is not as good as when using larger block sizes, like in pbzip2. While pbzip2 file size is 80% of pbgzip (gz), the file size of pbgzip (bz2) is a respectable 86%. Compression and decompression times were not too different either.
                Last edited by nilshomer; 03-21-2012, 09:01 PM.

                Comment

                • arvid
                  Senior Member
                  • Jul 2011
                  • 156

                  #23
                  I guess most interested people are on the samtools-devel/help lists; however, for those who aren't, Heng just announced a multi-threaded samtools sort/merge/view:



                  It'd be interesting with a merge between his approach and Nils', if feasible... or did you already re-introduce multi-threaded sort, Nils?

                  Comment

                  • nilshomer
                    Nils Homer
                    • Nov 2008
                    • 1283

                    #24
                    Originally posted by arvid View Post
                    I guess most interested people are on the samtools-devel/help lists; however, for those who aren't, Heng just announced a multi-threaded samtools sort/merge/view:



                    It'd be interesting with a merge between his approach and Nils', if feasible... or did you already re-introduce multi-threaded sort, Nils?
                    The way Heng implemented it was to multi-thread the in-memory sort, as well as when merging multiple BAM files, to multi-thread the compression. There is also support in the new bgzf.c to support multi-threaded writing, which is used in the merging above. The multi-threaded writing is not used elsewhere.

                    I think the part I would integrate is Heng's multi-threaded sort routine, while the rest is already there.

                    Comment

                    • arvid
                      Senior Member
                      • Jul 2011
                      • 156

                      #25
                      Originally posted by nilshomer View Post
                      The way Heng implemented it was to multi-thread the in-memory sort, as well as when merging multiple BAM files, to multi-thread the compression. There is also support in the new bgzf.c to support multi-threaded writing, which is used in the merging above. The multi-threaded writing is not used elsewhere.

                      I think the part I would integrate is Heng's multi-threaded sort routine, while the rest is already there.
                      Great, keep up the good work!

                      Comment

                      • maubp
                        Peter (Biopython etc)
                        • Jul 2009
                        • 1544

                        #26
                        Originally posted by nilshomer View Post
                        BZIP compress in blocks, so it actually fits the model of BAM quite well. The default block size in BAM is 65536, so upping that to 100K wouldn't be too hard. If it saves 30%, then it could be an alternative to CRAM (i.e. "get rid of all things").
                        You can use 100K, 200K, ..., 900K blocks in BZIP2 - the larger the block size the better the compression rate of course. This would require rejigging the BGZF virtual offset... the current 64bit trick won't work.

                        You'd also need to solve the non-byte aligned block issue, perhaps extending or working around the C library's API: http://blastedbio.blogspot.co.uk/201...-to-bzip2.html

                        This is assuming using multiple cores would overcome the inherently higher CPU load of BZIP vs GZIP - which sounds viable in principle

                        Comment

                        • adaptivegenome
                          Super Moderator
                          • Nov 2009
                          • 436

                          #27
                          Originally posted by maubp View Post
                          You can use 100K, 200K, ..., 900K blocks in BZIP2 - the larger the block size the better the compression rate of course. This would require rejigging the BGZF virtual offset... the current 64bit trick won't work.

                          You'd also need to solve the non-byte aligned block issue, perhaps extending or working around the C library's API: http://blastedbio.blogspot.co.uk/201...-to-bzip2.html

                          This is assuming using multiple cores would overcome the inherently higher CPU load of BZIP vs GZIP - which sounds viable in principle
                          I would how important it is to speed up the compression this much. I think for really big files, the I/O probably becomes less of a problem? We have seen this at least for ~20-50GB human BAMs, where for fly BAMs that are 1-3GB, the I/O is more of a bottleneck for us.

                          Comment

                          • maubp
                            Peter (Biopython etc)
                            • Jul 2009
                            • 1544

                            #28
                            That's the bonus of using compressed files - they are faster to read off disk (as long as the CPU overhead doesn't cost you too much). i.e. Using more compression can save I/O.

                            Comment

                            • adaptivegenome
                              Super Moderator
                              • Nov 2009
                              • 436

                              #29
                              Originally posted by maubp View Post
                              That's the bonus of using compressed files - they are faster to read off disk (as long as the CPU overhead doesn't cost you too much). i.e. Using more compression can save I/O.
                              Yes, I agree. I was just suggesting that perhaps I/O might not be the limiting factor for really big files, or at least that it might not be worth spending too much time trying to speed up compression beyond simply multithreading the existing block method...

                              Comment

                              • StaciaWyman
                                Junior Member
                                • Jun 2010
                                • 1

                                #30
                                Originally posted by nilshomer View Post
                                I have been working on speeding up reading and writing within SAMtools by creating a multi-threaded block gzip reader and writer. I have an alpha version working. I would appreciate some feedback and testing, just don't use it for production systems yet. Thank-you!

                                GitHub is where people build software. More than 150 million people use GitHub to discover, fork, and contribute to over 420 million projects.


                                NB: I would be happy describe the implementation, and collaborate to get this into Picard too.
                                Good morning--I get page not found error when I go to the above link--is there an updated one? Thanks!
                                Stacia

                                Comment

                                Latest Articles

                                Collapse

                                • SEQadmin2
                                  From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                  by SEQadmin2


                                  Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                  The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                  ...
                                  06-02-2026, 10:05 AM
                                • SEQadmin2
                                  Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                  by SEQadmin2


                                  With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                  Introduction

                                  Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                  05-22-2026, 06:42 AM
                                • SEQadmin2
                                  Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                  by SEQadmin2

                                  Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                  Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                  05-06-2026, 09:04 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by SEQadmin2, Yesterday, 08:59 AM
                                0 responses
                                13 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 06-02-2026, 12:03 PM
                                0 responses
                                22 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 06-02-2026, 11:40 AM
                                0 responses
                                19 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 05-28-2026, 11:40 AM
                                0 responses
                                32 views
                                0 reactions
                                Last Post SEQadmin2  
                                Working...