I have been trying to understand how the BAM index works by reading the SAM specification (v1.4, the newest version) and even samtools' source code. However, I failed to understand it fully. I'm posting here and hopefully somebody may clear my mind.
The BAM index contains two types of indices: bin index and linear index. The bin size goes from 2^29 to 2^14, so the smallest bin represents a 16kb region on the genome. While the linear index contains the file offsets for each 16kb tiling window region on the genome, which coincides with the smallest bin size.
Inside each bin, there is also a smaller unit called chunk whose definition is missing in SAM specification. Can anybody explain to me how the chunk is defined?
It seems the BAM index uses a strategy to cluster short reads whose start coordinates are close to each other. Therefore, it does not have to index each read but rather a 16kb tiling window or a chunk. This results in a very small index file: most of my ".bai" files are smaller than 10MB. Therefore, we can comfortably hold the entire indexing info in primary memory.
Besides that, I really don't understand why the binning strategy is required here and how it compensate the linear index. Given a query region, we can easily figure out the start and end file offsets of the tiling windows, read all the data into primary memory and figure out the rest in memory. This can work because we can always assume the BAM file is sorted according to the begin coordinates. (We need two linear indices for this to work, one for begin coordinate of first and one for end coordinate of the last alignment)
We should note that the BAM index is different from the UCSC genome browser index. In UCSC GB, the index is built on the genomic features such as genes. The index is potentially very large and may reside on secondary storage. Therefore, using the binning strategy can help to shrink the search scope rapidly, resulting in less disk seeks and reads. However, for BAM index, the binning really looks like an overkill.
Can anyone explain?
Thanks!
The BAM index contains two types of indices: bin index and linear index. The bin size goes from 2^29 to 2^14, so the smallest bin represents a 16kb region on the genome. While the linear index contains the file offsets for each 16kb tiling window region on the genome, which coincides with the smallest bin size.
Inside each bin, there is also a smaller unit called chunk whose definition is missing in SAM specification. Can anybody explain to me how the chunk is defined?
It seems the BAM index uses a strategy to cluster short reads whose start coordinates are close to each other. Therefore, it does not have to index each read but rather a 16kb tiling window or a chunk. This results in a very small index file: most of my ".bai" files are smaller than 10MB. Therefore, we can comfortably hold the entire indexing info in primary memory.
Besides that, I really don't understand why the binning strategy is required here and how it compensate the linear index. Given a query region, we can easily figure out the start and end file offsets of the tiling windows, read all the data into primary memory and figure out the rest in memory. This can work because we can always assume the BAM file is sorted according to the begin coordinates. (We need two linear indices for this to work, one for begin coordinate of first and one for end coordinate of the last alignment)
We should note that the BAM index is different from the UCSC genome browser index. In UCSC GB, the index is built on the genomic features such as genes. The index is potentially very large and may reside on secondary storage. Therefore, using the binning strategy can help to shrink the search scope rapidly, resulting in less disk seeks and reads. However, for BAM index, the binning really looks like an overkill.
Can anyone explain?
Thanks!
Comment