Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • maubp
    Peter (Biopython etc)
    • Jul 2009
    • 1544

    Using BGZF (Blocked GNU Zip Format) for general sequence files

    BAM files are compressed using a variant of GZIP (GNU ZIP), called BGZF (Blocked GNU Zip Format). Anyone who has read the SAM/BAM Specification will have seen the terms BGZF and virtual offsets, but what you may not realise is how general purpose this is for random access sections of any large compressed file.

    BAM files are compressed using a variant of GZIP (GNU ZIP) , called BGZF (Blocked GNU Zip Format). Anyone who has read the SAM/BAM Specifica...


    I wrote the above blog post looking at BGZF applied to FASTA, SwissProt and UniProt-XML sequences. In short: BGZF files are bigger than GZIP files, but they are much faster for random access.

    So, should we all be considering using BGZF in preference to GZIP?
  • maubp
    Peter (Biopython etc)
    • Jul 2009
    • 1544

    #2
    I haven't done a proof of principle implementation, but I believe efficient random access to BZIP2 files is also possible using their block structure. However, BZIP2 decompression is much more CPU intensive which would be a concern for fast random access:
    In my last post I looked at how the GZIP variant BGZF (Blocked GNU Zip Format, used in BAM files) allowed efficient random access to large ...

    Comment

    • arolfe
      Member
      • Jul 2011
      • 29

      #3
      What are the use cases where you see BGZF as useful for fasta or fastq files? It seems to only help in the random access case, eg "database" of genome sequences, but wouldn't make much difference over GZIP for files that are generally processed sequentially, eg sequencing reads.

      Also, OOC, what's your preferred fasta/fastq index format?

      Comment

      • maubp
        Peter (Biopython etc)
        • Jul 2009
        • 1544

        #4
        Originally posted by arolfe View Post
        What are the use cases where you see BGZF as useful for fasta or fastq files? It seems to only help in the random access case, eg "database" of genome sequences, but wouldn't make much difference over GZIP for files that are generally processed sequentially, eg sequencing reads.
        Yes, exactly - I see BGZF as being useful for databases of sequences (e.g. FASTA, SwissProt, GenBank, etc) where you need random access.

        Where you just need sequential access, you can treat BGZF like GZIP and pipe the decompressed data to a tool, or otherwise decompress on the fly.

        So BGZF works for both, and doesn't take that much more space than traditional GZIP (depending on the file format).

        Originally posted by arolfe View Post
        Also, OOC, what's your preferred fasta/fastq index format?
        In terms of indexing large FASTA/FASTQ, I mainly use an SQLite database mapping identifiers to file offets (and raw data length), via Biopython.

        For FASTA/FASTQ raw reads, random access by ID is not such a common need, but again BGZF could be used here for random access to a compressed file. Rather I advocate moving to unaligned BAM for raw reads, see http://blastedbio.blogspot.com/2011/...ve-sambam.html and this thread http://seqanswers.com/forums/showthread.php?t=14941

        Comment

        • salturki
          Member
          • May 2008
          • 12

          #5
          Thanks maubp for the insightful article.

          I am wondering if you have considered implementing tabix in Biopython as well?

          I am looking for a pure python tabix-like module to be used in a cross-platform solution. I tried to install Pysam and the tabix's python package (shipped with its source code) but couldn't build them on Windows.

          I ended up compiling tabix/bgzip on windows using cygwin. The final software is shipped with few dll files from cygwin in order for bgzip and tabix to work.

          If you are not planning to write such module, I would appreciate any pseudocode suggestions.

          Cheers

          Comment

          • maubp
            Peter (Biopython etc)
            • Jul 2009
            • 1544

            #6
            I'm interested in tabix, and it should be possible to implement in Python building on the BGZF support included in Biopython 1.60 - but I've not had time to look into it.

            Comment

            • SamH
              Member
              • Sep 2010
              • 15

              #7
              using bgzf for BAM files?

              Hi All,
              I wonder, has anyone tried using the Biopython bgzf support for parsing BAM files?
              Specifically it would be nice to access a BAM line by line, however it doesn't seem to quite work correctly. The data comes out garbled for me:

              i.e.
              Code:
              from Bio import bgzf
              iter = bgzf.BgzfReader("454.local.bowtie2.bam", 'rb')
              
              for i in range(20):
                  print iter.readline()
              
              
              BAM�@HD VN:1.0  SO:unsorted
              @SQ     SN:chrI LN:228539
              @SQ     SN:chrII        LN:813067
              @SQ     SN:chrIII       LN:316396
              @SQ     SN:chrIV        LN:1527223
              @SQ     SN:chrV LN:572028
              @SQ     SN:chrVI        LN:269964
              @SQ     SN:chrVII       LN:1084769
              @SQ     SN:chrVIII      LN:562680
              @SQ     SN:chrIX        LN:440022
              @SQ     SN:chrX LN:744918
              @SQ     SN:chrXI        LN:666321
              @SQ     SN:chrXII       LN:1073453
              @SQ     SN:chrXIII      LN:922770
              @SQ     SN:chrXIV       LN:778492
              @SQ     SN:chrXV        LN:1091177
              @SQ     SN:chrXVI       LN:945865
              @SQ     SN:contig00341  LN:6318
              @PG     ID:bowtie2      PN:bowtie2      VN:2.0.0-beta7
              chrI�|chrII
                         h
                          chrIII��chrIV�MchrV|chrVI�chrVIIachrVIII�chrIXֶchrX�]
                                                                              chrXI�*
              thanks!

              Sam

              Comment

              • maubp
                Peter (Biopython etc)
                • Jul 2009
                • 1544

                #8
                Originally posted by SamH View Post
                Hi All,
                I wonder, has anyone tried using the Biopython bgzf support for parsing BAM files?
                Specifically it would be nice to access a BAM line by line, however it doesn't seem to quite work correctly. The data comes out garbled for me:
                Hi Sam,

                That behaviour is expected and correct. BGZF is just a variant of gzip, once you decompress that you have a 'naked' BAM file which is a binary representation of the SAM format - although as you noticed it does contain an embedded plain text SAM header. All the Biopython Bio.bgzf code did for you was decompress it. Biopython doesn't currently have a BAM parser.

                Have you looked at pysam which is a Python wrapper for the samtools
                C API? http://code.google.com/p/pysam/

                Peter

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  New Genomics Tools and Methods Shared at AGBT 2025
                  by seqadmin


                  This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                  The Headliner
                  The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                  03-03-2025, 01:39 PM
                • seqadmin
                  Investigating the Gut Microbiome Through Diet and Spatial Biology
                  by seqadmin




                  The human gut contains trillions of microorganisms that impact digestion, immune functions, and overall health1. Despite major breakthroughs, we’re only beginning to understand the full extent of the microbiome’s influence on health and disease. Advances in next-generation sequencing and spatial biology have opened new windows into this complex environment, yet many questions remain. This article highlights two recent studies exploring how diet influences microbial...
                  02-24-2025, 06:31 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 03-20-2025, 05:03 AM
                0 responses
                17 views
                0 reactions
                Last Post seqadmin  
                Started by seqadmin, 03-19-2025, 07:27 AM
                0 responses
                18 views
                0 reactions
                Last Post seqadmin  
                Started by seqadmin, 03-18-2025, 12:50 PM
                0 responses
                19 views
                0 reactions
                Last Post seqadmin  
                Started by seqadmin, 03-03-2025, 01:15 PM
                0 responses
                186 views
                0 reactions
                Last Post seqadmin  
                Working...