By Radoje Drmanac | Oct 27, 2011 | 11:34 AM
Today, researchers use two approaches to identifying disease-associated variants in the human genome: exome sequencing, which targets the protein-coding regions that make up approximately 1% of the genome, and whole genome sequencing, which investigates the vast majority of the genome and includes both coding and non-coding regions. Historically, researchers opted for exome sequencing because it cost less, protein-coding variants were more easily interpreted than non-coding ones, and it had been used successfully to identify disease-causing variants in several cases. However, targeted sequencing of only a specified list of protein-coding sequences means that DNA variations outside of those regions are missed. Moreover, exome capture by hybridization can introduce considerable coverage variability, affecting comparative analysis and limiting discovery efforts. Perhaps the greatest disadvantage of exome sequencing is its low sensitivity to copy number and structural variations, whereas whole genome sequencing can detect these variation events as well as many copy neutral events, such as uniparental disomies and inversions or translocations.
Technical considerations also suggest that the most accurate and effective way to sequence the exome may be to sequence the whole human genome. Current commercially available exome targeting kits typically cover an incomplete portion of the exome. Kits from different vendors rely on different definition of ‘exome’, and even kits from a single vendor get frequently updated. The net effect is that the general term ‘exome sequencing’ refers to the sequencing of different genome regions amongst experiments using different kits. Even with each kit, not all of the desired exome is captured, as certain exons are excluded during design of the capture probes because of reasons including size or hybridization thermodynamics. Furthermore, the addition of selection itself may introduce biases that prevent the detection of exonic variants.
While a significant proportion of sequencing variants may have no discernible effect on phenotype, thousands of well-annotated and conserved elements implicated in disease exist outside of protein-coding regions. Haussler’s team at UCSC1 recently discovered ~3M non-coding evolutionary conserved sequences in human genome at 10% detection sensitivity. At 28 bases in length, on average, such regulatory sequences can constitute >20% of genome compared to ~1% in coding sequences. I think this and other recent results open a new frontier in human genetics. In addition, recent studies now show a far larger fraction of the human genome is systematically transcribed than previously thought, resulting in the discovery and characterization of new classes of non-protein-coding genes. Moreover, a sizable fraction of loci identified by genome-wide association studies lie within so-called “gene deserts”, i.e., genomic regions with no known protein-coding genes.
The practical utility of whole human genome sequencing for identifying disease-associated variants was recently demonstrated in a study published in Nature2. In a study of 38 multiple myeloma patients, 23 tumor-normal pairs were investigated using whole genome sequencing, 16 were examined using exome sequencing, and one pair was sequenced by both methods. The results showed that the mutation frequency in coding regions was significantly less than in the intronic and intergenic regions due to negative selection pressure against mutations disrupting the coding sequence. In addition, 18 statistically significant mutated non-coding regions were identified. While exome sequencing identified most of the significantly mutated genes, half of the total protein-coding mutations occurred in chromosomal aberrations such as translocations, most of which would have been missed by sequencing only the exome. Recurrent point mutations in non-coding regions would also have been missed. The paper concludes that whole genome sequencing offers the most comprehensive analysis of coding, non-coding, and other functional elements of the genome.
Finally, cost has become much less of an issue. For approximately $4,000, just a little more than the cost of sequencing an exome, Complete Genomics offers whole human genome sequencing for projects with 50 samples or more with all the benefits of this comprehensive genetic test. Data provided for each genome has on average 55x mapped coverage, and typically greater than 95% of the calls on both alleles within the coding and non-coding regions of the genome. Our resulting genomic data includes files of detected and annotated coding and non-coding sequence variants (SNPs, small indels, CNVs, and SVs), data summary reports, and a full set of supporting data for these results. In addition, the use of genome variants is as easy as using exome variants with our included annotation of typical known regulatory elements. And the sequencing can be done quickly, with large studies comprised of hundreds of whole genomes completed in just a few months with guaranteed quality. This is the reason that some projects planned for exome sequencing have already switched to whole genome sequencing. It’s mainly inertia and lack of awareness of the progress of whole genome sequencing that has prevented a faster switch.
I strongly believe that whole human genome sequencing now offers researchers a much more informative, cost-effective, easy to use, rapid and comprehensive alternative to exome sequencing for identifying disease-associated variants in the human genome.
References
1Lowe, et al, “Three Periods of Regulatory Innovation During Vertebrate Evolution”, Science, 333:1019-1023 (2011)
2Chapman, et al., “Initial Genome Sequencing and Analysis of Multiple Myeloma,” Nature 471:467-472 (2011)
Today, researchers use two approaches to identifying disease-associated variants in the human genome: exome sequencing, which targets the protein-coding regions that make up approximately 1% of the genome, and whole genome sequencing, which investigates the vast majority of the genome and includes both coding and non-coding regions. Historically, researchers opted for exome sequencing because it cost less, protein-coding variants were more easily interpreted than non-coding ones, and it had been used successfully to identify disease-causing variants in several cases. However, targeted sequencing of only a specified list of protein-coding sequences means that DNA variations outside of those regions are missed. Moreover, exome capture by hybridization can introduce considerable coverage variability, affecting comparative analysis and limiting discovery efforts. Perhaps the greatest disadvantage of exome sequencing is its low sensitivity to copy number and structural variations, whereas whole genome sequencing can detect these variation events as well as many copy neutral events, such as uniparental disomies and inversions or translocations.
Technical considerations also suggest that the most accurate and effective way to sequence the exome may be to sequence the whole human genome. Current commercially available exome targeting kits typically cover an incomplete portion of the exome. Kits from different vendors rely on different definition of ‘exome’, and even kits from a single vendor get frequently updated. The net effect is that the general term ‘exome sequencing’ refers to the sequencing of different genome regions amongst experiments using different kits. Even with each kit, not all of the desired exome is captured, as certain exons are excluded during design of the capture probes because of reasons including size or hybridization thermodynamics. Furthermore, the addition of selection itself may introduce biases that prevent the detection of exonic variants.
While a significant proportion of sequencing variants may have no discernible effect on phenotype, thousands of well-annotated and conserved elements implicated in disease exist outside of protein-coding regions. Haussler’s team at UCSC1 recently discovered ~3M non-coding evolutionary conserved sequences in human genome at 10% detection sensitivity. At 28 bases in length, on average, such regulatory sequences can constitute >20% of genome compared to ~1% in coding sequences. I think this and other recent results open a new frontier in human genetics. In addition, recent studies now show a far larger fraction of the human genome is systematically transcribed than previously thought, resulting in the discovery and characterization of new classes of non-protein-coding genes. Moreover, a sizable fraction of loci identified by genome-wide association studies lie within so-called “gene deserts”, i.e., genomic regions with no known protein-coding genes.
The practical utility of whole human genome sequencing for identifying disease-associated variants was recently demonstrated in a study published in Nature2. In a study of 38 multiple myeloma patients, 23 tumor-normal pairs were investigated using whole genome sequencing, 16 were examined using exome sequencing, and one pair was sequenced by both methods. The results showed that the mutation frequency in coding regions was significantly less than in the intronic and intergenic regions due to negative selection pressure against mutations disrupting the coding sequence. In addition, 18 statistically significant mutated non-coding regions were identified. While exome sequencing identified most of the significantly mutated genes, half of the total protein-coding mutations occurred in chromosomal aberrations such as translocations, most of which would have been missed by sequencing only the exome. Recurrent point mutations in non-coding regions would also have been missed. The paper concludes that whole genome sequencing offers the most comprehensive analysis of coding, non-coding, and other functional elements of the genome.
Finally, cost has become much less of an issue. For approximately $4,000, just a little more than the cost of sequencing an exome, Complete Genomics offers whole human genome sequencing for projects with 50 samples or more with all the benefits of this comprehensive genetic test. Data provided for each genome has on average 55x mapped coverage, and typically greater than 95% of the calls on both alleles within the coding and non-coding regions of the genome. Our resulting genomic data includes files of detected and annotated coding and non-coding sequence variants (SNPs, small indels, CNVs, and SVs), data summary reports, and a full set of supporting data for these results. In addition, the use of genome variants is as easy as using exome variants with our included annotation of typical known regulatory elements. And the sequencing can be done quickly, with large studies comprised of hundreds of whole genomes completed in just a few months with guaranteed quality. This is the reason that some projects planned for exome sequencing have already switched to whole genome sequencing. It’s mainly inertia and lack of awareness of the progress of whole genome sequencing that has prevented a faster switch.
I strongly believe that whole human genome sequencing now offers researchers a much more informative, cost-effective, easy to use, rapid and comprehensive alternative to exome sequencing for identifying disease-associated variants in the human genome.
References
1Lowe, et al, “Three Periods of Regulatory Innovation During Vertebrate Evolution”, Science, 333:1019-1023 (2011)
2Chapman, et al., “Initial Genome Sequencing and Analysis of Multiple Myeloma,” Nature 471:467-472 (2011)
Comment