Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • nilshomer
    Nils Homer
    • Nov 2008
    • 1283

    Multi-threaded (faster) SAMtools

    I have been working on speeding up reading and writing within SAMtools by creating a multi-threaded block gzip reader and writer. I have an alpha version working. I would appreciate some feedback and testing, just don't use it for production systems yet. Thank-you!

    GitHub is where people build software. More than 150 million people use GitHub to discover, fork, and contribute to over 420 million projects.


    NB: I would be happy describe the implementation, and collaborate to get this into Picard too.
  • Richard Finney
    Senior Member
    • Feb 2009
    • 701

    #2
    Any benchmarks?

    Comment

    • nilshomer
      Nils Homer
      • Nov 2008
      • 1283

      #3
      Copied here from http://sourceforge.net/mailarchive/m...sg_id=28915492

      I am working on benchmarking the samtools commands today, and will post back.

      A 4GB SAM file was used on a dual-hex-core (12 cores) computer. I
      benchmarked compression then decompression, making sure the resulting files
      were the same. Decompression seems to be limited by IO.

      Name Compression Time Decompression Time
      bgzip 485.64 39.93
      pbgzip -n 1 481.57 40.02
      pbgzip -n 2 240.85 41.03
      pbgzip -n 4 122.05 41.79
      pbgzip -n 8 63.17 41.17
      pbgzip -n 12 43.12 41.65
      pbgzip -n 16 39.59 41.48
      pbgzip -n 20 37.03 42.41
      pbgzip -n 24 34.90 47.24

      Comment

      • nilshomer
        Nils Homer
        • Nov 2008
        • 1283

        #4
        Updated #s on a few commands:
        Command samtools psamtools
        view BAM 29.45 19.2
        view -b BAM 207.51 19.36
        view -S SAM 44.89 44.43
        view -Sb SAM 222.64 32.62
        sort 206.32 25.17
        mpileup 6574.2 7252.08
        depth 17.64 7.47
        index 11.96 1.93
        flagstat 11.73 1.73
        calmd -b 209.25 22.86
        rmdup -s 154.88 22.08
        reheader 0.76 0.74
        cat 1.54 1.37

        Comment

        • Richard Finney
          Senior Member
          • Feb 2009
          • 701

          #5
          Looks good!
          question
          1) why is mpileup slower?

          Comment

          • nilshomer
            Nils Homer
            • Nov 2008
            • 1283

            #6
            Working on it. I am doing this in my free time, so having one perform worse isn't that bad so far.

            Comment

            • krobison
              Senior Member
              • Nov 2007
              • 734

              #7
              Really cool!!

              Do you have benchmarks for retrieving specific reads for a region? For mpileup of a specific region or a list of targets?

              Any idea if this will work with the Bio:B::Sam perl module (which must be linked in to samtools)

              What are the prospects for merging this with the main samtools development?

              Comment

              • nilshomer
                Nils Homer
                • Nov 2008
                • 1283

                #8
                The seeks are just as fast, so no speedup/slowdown on seeking, but then there should speedup reading from that point on, assuming there are at least a basal number of reads in the region (otherwise there is no work to be done). For mpileup, it doesn't process the regions in parallel, if that is what you were implying.

                I posted to the samtools list with response, so I have no hypothesis as to the inclusion of this (of course it needs more testing first). It generally is difficult to get things included there. I have more hope for Picard.

                Pysam and the SAM perl module should not notice the difference in the API, though there is no good mechanism yet for determining the # of threads to use (it autodetects the # of cores).

                Comment

                • krobison
                  Senior Member
                  • Nov 2007
                  • 734

                  #9
                  I also see the sort command now gives an option to pick an algorithm. What a blast to the past?

                  Any heuristics on what algorithm might perform better in what setting?

                  And why no bubble sort option :-)

                  Comment

                  • nilshomer
                    Nils Homer
                    • Nov 2008
                    • 1283

                    #10
                    Originally posted by krobison View Post
                    I also see the sort command now gives an option to pick an algorithm. What a blast to the past?

                    Any heuristics on what algorithm might perform better in what setting?

                    And why no bubble sort option :-)
                    I just used Heng's ksort.h library. I like introsort, but mergesort is by default in the original samtools.

                    I have also been toying with multi-threaded sort, which sort of works in the new version, except I didn't take time to do a proper multi-way merge (one implementation requires the calculation of evenly spaced pivots). Maybe wait a few more weekends.

                    Comment

                    • adaptivegenome
                      Super Moderator
                      • Nov 2009
                      • 436

                      #11
                      Originally posted by nilshomer View Post
                      I have been working on speeding up reading and writing within SAMtools by creating a multi-threaded block gzip reader and writer. I have an alpha version working. I would appreciate some feedback and testing, just don't use it for production systems yet. Thank-you!

                      GitHub is where people build software. More than 150 million people use GitHub to discover, fork, and contribute to over 420 million projects.


                      NB: I would be happy describe the implementation, and collaborate to get this into Picard too.
                      So are you saying you made a parallelized version of BZIP2? We have also been playing around with this. We parallelized the compression and decompression steps in the read/write functions of samtools for a local realignment tool we built.

                      I would love to learn more about what you are doing as I would hate to duplicate anything you are going to already do!

                      Comment

                      • colindaven
                        Senior Member
                        • Oct 2008
                        • 417

                        #12
                        As far as parallel (g)zip goes pigz works wonders : http://zlib.net/pigz/

                        Comment

                        • adaptivegenome
                          Super Moderator
                          • Nov 2009
                          • 436

                          #13
                          PIGZ is very very fast however it produces file sizes that are much larger than BZIP2. Is this your experience as well?

                          It would be really nice to be able to simply parallelize BZIP2. We have tried to do this a little bit but certainly don't have completed product yet.

                          Comment

                          • lh3
                            Senior Member
                            • Feb 2008
                            • 686

                            #14
                            Firstly I greatly appreciate and strongly support Nils effort in multithreading samtools. The change is likely to be merged to samtools.

                            Re sorting algorithm: samtools sort does stable sorting (i.e. preserving the relative order of records having the same coordinate). In some rare/non-typical use cases, this feature is useful. Merge sort is stable. Introsort is not.

                            Re pigz: someone told me on biostar that pigz is not very scalable with many cores. If this is true (I have not tried), this must be because the gzip format has long range dependencies. bzip2 and bgzip are much easier to parallelize and probably more scalable. In addition, bzip2 has a parallel version pbzip2 which the same person told me scales very well with the number of CPU cores.

                            Re bzip2: I have argued a couple times here (years ago) and also on the samtools list that the key reason samtools uses gzip instead of bzip2 is because gzip is 5-10X faster on decompression. With bzip2, most samtools command will be 2-10 times slower. I think for huge data sets that need to be read frequently, gzip is always preferred over bzip2.
                            Last edited by lh3; 03-06-2012, 09:53 AM.

                            Comment

                            • adaptivegenome
                              Super Moderator
                              • Nov 2009
                              • 436

                              #15
                              I think it is worth figuring out the best way to compress/decompress. Our nodes have 64 cores so I will do some tests and see how BZIP2 and GZIP scale. I'll post what I find on this thread.

                              In the meantime a quick internet search turned up this:



                              Originally posted by lh3 View Post
                              Firstly I greatly appreciate and strongly support Nils effort in multithreading samtools. The change is likely to be merged to samtools.

                              Re sorting algorithm: samtools sort does stable sorting (i.e. preserving the relative order of records having the same coordinate). In some rare/non-typical use cases, this feature is useful. Merge sort is stable. Introsort is not.

                              Re pigz: someone told me on biostar that pigz is not very scalable with many cores. If this is true (I have not tried), this must be because the gzip format has long range dependencies. bzip2 and bgzip are much easier to parallelize and probably more scalable. In addition, bzip2 has a parallel version pbzip2 which the same person told me scales very well with the number of CPU cores.

                              Re bzip2: I have argued a couple times here (years ago) and also on the samtools list that the key reason samtools uses gzip instead of bzip2 is because gzip is 5-10X faster on decompression. With bzip2, most samtools command will be 2-10 times slower. I think for huge data sets that need to be read frequently, gzip is always preferred over bzip2.

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                Yesterday, 10:05 AM
                              • SEQadmin2
                                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                by SEQadmin2


                                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                Introduction

                                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                05-22-2026, 06:42 AM
                              • SEQadmin2
                                Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                by SEQadmin2

                                Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                05-06-2026, 09:04 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Yesterday, 12:03 PM
                              0 responses
                              19 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, Yesterday, 11:40 AM
                              0 responses
                              14 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-28-2026, 11:40 AM
                              0 responses
                              29 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-26-2026, 10:12 AM
                              0 responses
                              31 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...