Seqanswers Leaderboard Ad

**jflowers** · 05-17-2012, 12:35 PM

Hi Rocketnight,

A little late on the reply, but I'm contemplating some of these issues myself. The mutation bias issue you mentioned seems like it might be a problem only if this class of mutations is under-represented in dbsnp (and therefore over-represented in the CountCovariates table). For human data, I can't think why this would be expected but I'm curious if you have seen evidence of this?

**Rocketknight** · 05-18-2012, 03:48 AM

Since CG is extremely likely to be methylated at the cytosine and is prone to mutation to thymine once methylated, the four dinucleotides we'd expect to be affected if this were occuring would be CG/GC and TG/AC. In the Broad's sample data for Base Quality Score Recalibration CG, GC and AC were the three dinucleotides with the lowest empirical scores. (TG actually had quite a good empirical quality score, which suggests my theory here might not be perfect)
See: http://www.broadinstitute.org/gsa/wi..._recalibration

As for the dbsnp issue, I don't think CpG mutations will be particularly underrepresented in dbsnp - in fact, I expect they will be very common, simply because they're so likely to occur. The problem is that all using dbsnp will do is eliminate all known sites from consideration, leaving you with the set of sequencing errors, de novo mutations and very rare mutations not found in dbsnp. If CG -> TG mutations are more likely to occur, then they will be overrepresented in this set regardless of how well-represented they are in dbsnp, leading to a reduced empirical quality score (which is what we see in the above Broad data).

Of course, I could be totally wrong about this. Is there a known issue with Illumina sequencing of these dinucleotides?

**jflowers** · 05-18-2012, 07:20 AM

Rocketknight,

After thinking about it further, I think your right. Ultimately, it seems like the empirical mismatch rate would be higher for any class of mutations (possibly including the class associated with cytosine deamination) that is at very low frequency in humans (and therefore less likely to be found in dbsnp).

Ultimately, it seems to me that the question comes down to population genetics and the frequency spectrum of mutations in the population. If certain classes of mutation tend to be found more frequently at very low frequencies compared with other classes then this could introduce a bias.

The question of the allele frequency spectrum for each class of mutations is an empirical question.

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 20 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 20 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

GATK Base Quality Recalibration

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News