I just started working on bandwidth selection for KC-smart and/or similar kernel regression methods for the purpose of identifying recurrent copy number aberrations in NGS libraries from genomic tumor DNA. If there are other people here working on related research and/or potential users of such a bandwidth selection algorithm, maybe we could share some thoughts.
We use KC-smart for the detection of recurrent copy number abberations across multiple tumor libraries. In short, it works like this:
- Each library (i.e. sample, i.e. tumor) gives us appr. 5 mio reads. Since the fragments to be read are made with sonocation rather than restriction enzymes, they tend to be at unique locations. Subject to correction for GC bias and such, the density of the reads are proportional to the copy number.
- The reads are binned into windows of size say 50 kb so that each window has a copy number estimate
- That copy number estimate is then used by KC smart which is a kernel regression algorithm originally made for aCGH. It produces locally weighted regression coefficients related to research questions such as "does this region have a higher copy number in library class A than in library class B?".
What I want to do is to make a bandwidth selection algorithm. I want to use the read locations directly without binning them. Bandwidth selection for kernel regression is a little different than for kernel density estimation. Also, I might consider
- building the GC correction into the bandwidth selection
- dynamic bandwidth selection, i.e. larger bandwidth in low-copy number regions
- Shrinking estimates towards nearest integer copy number (the sample may be homogenous with respect to the CN of some regions)
- handling ambigiously mapped reads
We use KC-smart for the detection of recurrent copy number abberations across multiple tumor libraries. In short, it works like this:
- Each library (i.e. sample, i.e. tumor) gives us appr. 5 mio reads. Since the fragments to be read are made with sonocation rather than restriction enzymes, they tend to be at unique locations. Subject to correction for GC bias and such, the density of the reads are proportional to the copy number.
- The reads are binned into windows of size say 50 kb so that each window has a copy number estimate
- That copy number estimate is then used by KC smart which is a kernel regression algorithm originally made for aCGH. It produces locally weighted regression coefficients related to research questions such as "does this region have a higher copy number in library class A than in library class B?".
What I want to do is to make a bandwidth selection algorithm. I want to use the read locations directly without binning them. Bandwidth selection for kernel regression is a little different than for kernel density estimation. Also, I might consider
- building the GC correction into the bandwidth selection
- dynamic bandwidth selection, i.e. larger bandwidth in low-copy number regions
- Shrinking estimates towards nearest integer copy number (the sample may be homogenous with respect to the CN of some regions)
- handling ambigiously mapped reads