Seqanswers Leaderboard Ad

**lh3** · 08-23-2013, 10:01 AM

Each "bin" is associated with one or more chunks. A chunk is an interval in BAM. Given a chunk with bin $b, most reads in this interval have the same bin $b. None of reads outside the interval have bin $b. By going through the chunks in the bin $b, you will get all the reads with bin $b.

UCSC is the first using the binning strategy, but it does not have the concept of "index file". Its index is just bare mysql index, which is by design sitting on disk not in memory. If we migrate the BAM-like index files for UCSC tables, it will be much smaller than the BAM index because UCSC uses fewer bins. I have not read the bigwig/bigbed paper about their index.

Binning index is not overkilling. You cannot assume BAM always keep short genomic reads. We also put contig and mRNA alignments in BAMs, which sometimes span >100kb. In that case, binning index will play an important role.

**asiangg** · 08-23-2013, 11:08 AM

Originally posted by lh3 View Post

Each "bin" is associated with one or more chunks. A chunk is an interval in BAM. Given a chunk with bin $b, most reads in this interval have the same bin $b. None of reads outside the interval have bin $b. By going through the chunks in the bin $b, you will get all the reads with bin $b.

UCSC is the first using the binning strategy, but it does not have the concept of "index file". Its index is just bare mysql index, which is by design sitting on disk not in memory. If we migrate the BAM-like index files for UCSC tables, it will be much smaller than the BAM index because UCSC uses fewer bins. I have not read the bigwig/bigbed paper about their index.

Binning index is not overkilling. You cannot assume BAM always keep short genomic reads. We also put contig and mRNA alignments in BAMs, which sometimes span >100kb. In that case, binning index will play an important role.

Thank you for explanation! However, the definition of chunk is still not clear. It seems a chunk is a smaller unit than bin but according to your description, the chunk may not necessary be within a bin. So what kind of criteria do you use to determine the interval for a chunk? If a chunk cross the boundary between two bins, do you keep the chunk in the bin where its start coordinate is contained?

I was making an assumption that samtools load the entire BAM index into primary memory in my previous post. I have not fully read and understood the source code of samtools so maybe I was wrong. But this does seem to be the case since the ".bai" file is so small. If this assumption is true, then I still do not understand why binning is necessary. With all the file offsets of tiling windows and trunks in primary memory, you can simply create two linear indices for the begin of the first alignment and the end of the last alignment for tiling windows or chunks. Then you can determine the start and end file offsets that enclose any query region and read through it. This has nothing to do with the length of an alignment. Am I right? Can you please offer some insight? Thx!

**lh3** · 08-23-2013, 11:33 AM

Say you have 3 reads, 100bp read1 at pos 1000, 100kb read2 at 1001 and 100bp read3 at 1002. Read1 and read3 are in the same bin $a, but read2 is in the parent bin of $a. The straightforward implementation will put two chunks in bin $a, the first chunk for read1 and the second for read2. Samtools is likely to use one chunk containing all the 3 reads. When you pull reads with bin $a, you go through the chunk and exclude read2 that has a different bin. It does not matter if a chunk contain a few more reads with different bins. By merging chunks close to each other, you get a smaller index file.

UCSC invented the binning index primarily because genes vary greatly in lengths which may make linear index inefficient. That is exactly the reason why BAM also uses binning index (and also why I talked about alignment lengths). In the extreme case, suppose you have a false RNA-seq alignment that spans the entire chromosome. Once there is a single alignment like this in your BAM, linear index will fail completely for that chr, as you always need to start from the beginning of the chromosome to seek to a position. With the binning index, such a problematic alignment only has limited effect.

**asiangg** · 08-23-2013, 05:56 PM

Originally posted by lh3 View Post

Say you have 3 reads, 100bp read1 at pos 1000, 100kb read2 at 1001 and 100bp read3 at 1002. Read1 and read3 are in the same bin $a, but read2 is in the parent bin of $a. The straightforward implementation will put two chunks in bin $a, the first chunk for read1 and the second for read2. Samtools is likely to use one chunk containing all the 3 reads. When you pull reads with bin $a, you go through the chunk and exclude read2 that has a different bin. It does not matter if a chunk contain a few more reads with different bins. By merging chunks close to each other, you get a smaller index file.

UCSC invented the binning index primarily because genes vary greatly in lengths which may make linear index inefficient. That is exactly the reason why BAM also uses binning index (and also why I talked about alignment lengths). In the extreme case, suppose you have a false RNA-seq alignment that spans the entire chromosome. Once there is a single alignment like this in your BAM, linear index will fail completely for that chr, as you always need to start from the beginning of the chromosome to seek to a position. With the binning index, such a problematic alignment only has limited effect.

Thank you again! I think I start to appreciate the use of binning. I forgot to consider the alignment that may spend a long distance on the genome. And I can also understand how the linear index acts to prevent reading those chunks whose end file offsets are smaller than the rbeg (first alignment).

However, I still have a concern. What about those chunks whose start file offsets are larger than rend (last alignment). You do not need to read those chunks either. But it seems BAM indexing only considers rbeg in linear index. Can you explain? Thx!

**asiangg** · 08-26-2013, 09:28 AM

Is the last question going to be answered? Or this is simply ignored in BAM indexing? I think the chance for a long alignment to be before the query start is as high as be after the query end. Does that mean quite a bit of disk seeks and reads are wasted?

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 18 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Can anyone explain BAM indexing algorithm to me?

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News