Hi all:
I'm assessing a callset that aggregates a couple of thousand human germline samples. I'm concerned that the calls have too many multi-allelic SNP's (something like 15% of SNPs are multi-allelic) and that these sites are enriched for false positives.
Now, I get it that at some point the typical infinite sites model will break down and mutations will start happening in sites that are already polymorphic, but for example ExAC has only like 7% multi-allelic sites and they have 60k+samples, whereas I have less than 1/20 of that.
Are there ways to assess quality of these sites? Are there any results (empirical/theoretical) about how many multiallelic SNPs to expect as a function of sample number? Is Ts/Tv meaningful at multiallelic sites? (and then which allele should it be computed on)?
TIA!
I'm assessing a callset that aggregates a couple of thousand human germline samples. I'm concerned that the calls have too many multi-allelic SNP's (something like 15% of SNPs are multi-allelic) and that these sites are enriched for false positives.
Now, I get it that at some point the typical infinite sites model will break down and mutations will start happening in sites that are already polymorphic, but for example ExAC has only like 7% multi-allelic sites and they have 60k+samples, whereas I have less than 1/20 of that.
Are there ways to assess quality of these sites? Are there any results (empirical/theoretical) about how many multiallelic SNPs to expect as a function of sample number? Is Ts/Tv meaningful at multiallelic sites? (and then which allele should it be computed on)?
TIA!