Seqanswers Leaderboard Ad



No announcement yet.

Modern Methods for Phased Genomes


  • Modern Methods for Phased Genomes

    Click image for larger version  Name:	Chromosomes.jpg Views:	0 Size:	644.8 KB ID:	324937

    Continual advancements in genomic and computational technologies have allowed researchers to construct precise pictures of individual genomes and identify variations that distinguish one genome from another. In order to fully understand the impact of these variants, it is essential to determine their chromosomal context in a process known as phasing. In this article, we'll explore some of the current technologies and approaches fundamental to this process.

    What is phasing?
    “We inherit half our DNA from our mother and half from our father,” explained Jonas Korlach, Ph.D., Chief Scientific Officer at Pacific Biosciences (PacBio). “Each set of this DNA we inherit will contain a unique collection of variants, which is often referred to as a haplotype. When you sequence, phasing refers to the process of identifying those unique variants on each DNA sequencing read and then separating (phasing) those reads into their respective parental haplotypes.”

    The reason that this process is so important is that accurate phasing allows researchers to connect one or more genetic variants on the same parental allele, or gene copy. “This improves the ability to associate genetic differences with disease and disease severity, genetic traits, or to know if someone is a silent carrier of a genetic disease,” noted Korlach. “For example, if you discovered two variants at different locations within a gene that had the potential to disrupt the expression of that gene, it would be important to know if those variants resided on the same copy (one bad and one good copy) or both copies (two bad copies of the gene).”

    Alex Hastie, Ph.D., Vice President of Clinical and Scientific Affairs at Bionano, added that in addition to assessing pathogenicity, “Accurate phasing can enable complex haplotype reconstruction, allowing researchers to discriminate different structures in the most variable and functionally interesting regions of the genome (e.g., MHC, 22q, etc.).”

    Techniques used for phasing
    Long-read sequencing
    Earlier phasing techniques frequently depended on short-read sequencing, and in certain instances, imputation; however, one of the most common current approaches is with long-read sequencing technologies. Researchers are able to directly phase long reads when a single sequence read spans the interval between two genomic variants. Korlach highlighted that PacBio’s sequencers excel at this process because they generate reads that are over 100 times longer than standard short reads and can easily cover the length of these regions.

    After the appropriate sequencing reads are generated, the next steps are to assemble the reads, detect the variants, and phase the haplotypes. There are various tools that can assist with this process. The de novo genome assembler Hifiasm is frequently used for phased assemblies with PacBio HiFi datasets1. Other tools like HiPhase can be used to enhance the phasing of variant calls from whole-genome datasets2. Meanwhile, Paraphase is a tool capable of phasing haplotypes from highly homologous, medically significant genes, such as SMN1/SMN2, within targeted sequencing HiFi datasets3. “Additional strategies can be implemented to help improve phasing such as using sequence data from parents to bin long reads into their respective parental haplotypes during the assembly process, known as trio-binning, and/or by including long-range chromosomal contact information from Hi-C sequencing,” Korlach stated.

    Optical genome mapping
    The optical genome mapping (OGM) technique used by Bionano can also be used for phasing in a manner that is complementary to long-read sequencing. In order to understand how it contributes to this process, Hastie detailed that “OGM utilizes ultra-high molecular weight DNA, with molecules ≥150 kbp at an average (N50) length of ≈250–400 kbp used in the genome assembly. These molecules have a label pattern introduced at a 6-mer sequence occurring every 5 kbp, on average. OGM measures the physical distance between labels and creates a barcode that can be used to create whole chromosome maps of genomes.”

    Click image for larger version  Name:	image.png Views:	0 Size:	151.8 KB ID:	324938
    Figure: Reference genome map (from hg19/hg38) with genes annotated. The blue bar is the map of an individual with a deletion in the dystrophin gene (DMD); this map was created using long molecules with labeled sequence motifs. (Courtesy of Bionano)

    The preserved native length of the molecules allows phasing of structural variant breakpoint(s) and SNPs that impact label motifs to be captured within long individual molecules in the assembly, Hastie added. “OGM adds value as a standalone technique to anchor and span complex repeats and can serve as an orthogonal quality check complement to sequencing-based phasing approaches.”

    To perform phasing with OGM, the analysis is a standardized process that uses labeled, linearized, and imaged DNA molecules. “Phasing with OGM is performed directly by interrogating long (150 Kbp–2 Mbp) contiguous molecules in the assembly for structures that can be resolved with the label patterns,” explained Hastie. “The molecules are assembled into longer maps by overlap tiling across the chromosome and phasing is done whenever there are heterozygous SVs or SNP containing label motifs.”

    Phasing in genomic research
    According to Korlach, a crucial benefit of phasing in genomic research is its potential to produce a reference-quality, haplotype-resolved assembly using PacBio HiFi sequencing reads. He believes that this idea was best conveyed by the Human Pangenome Reference Consortium (HPRC) when they wrote, “We no longer consider collapsed 3-Gbp genome assemblies as state of the art (i.e., one representation of an individual where both haplotypes are merged) but instead consider two genomes for every diploid genome assembled (i.e., 6 Gbp vs. 3 Gbp) where parental haplotypes are phased and fully resolved4.” Korlach elaborated that this presents the genome in its true diploid state as it exists within the cell. This representation improves the detection of small and structural variations, highlights epigenetic attributes like allele-specific methylation and chromatin-accessible areas, and enriches transcriptomics by uncovering allele-specific gene expression.

    Hastie, while discussing advancements beyond OGM and long-read sequencing, pointed to the rise of various complementary technologies that facilitate genome phasing. These include linked-read sequencing, Hi-C-based conformation capture sequencing, Strand-seq, and trio sequencing. Like the HPRC, he stressed that the phasing of distinct haplotypes carries the potential for capturing organisms’ full genomic diversities more accurately by avoiding the collapse of two alleles into a single hybrid allele. “At scale, this pangenome view informs population genomics and enables applications from human health to conservation genomics,” he said.

    While there have been many advances in recent years, Korlach emphasized that some of the most exciting have been on the data analysis side. “Fully haplotype-resolved datasets have enabled the construction of large pangenomes that better catalog the genetic diversity within populations. For example, the recent releases of the first draft human pangenome5 and a first regional Chinese pangenome6 dataset, alongside a pangenome bioinformatics tool kit7, have shown improvements in reference-based sequence mapping and variant calling workflows and will hopefully replace current workflows using a single reference genome such as GRCh38 in the future.”

    Looking ahead, Korlach envisions a future where long-read sequencers like the Revio sequencing system play a pivotal role. These instruments could accelerate the construction of larger and more comprehensive pangenome datasets, and by enabling researchers to scale their long-read workflows, they will be able to generate more haplotype-resolved assemblies.

    1. Yu W, Luo H, Yang J, et al. Comprehensive assessment of eleven de novo HiFi assemblers on complex eukaryotic genomes and metagenomes. bioRxiv. Published online 2023. doi:
    2. Holt JM, Saunders CT, Rowell WJ, Kronenberg Z, Wenger AM, Eberle M. HiPhase: Jointly phasing small and structural variants from HiFi sequencing. bioRxiv. Published online 2023. doi:
    3. Chen X, Harting J, Farrow E, et al. Comprehensive SMN1 and SMN2 profiling for spinal muscular atrophy analysis using long-read PacBio HiFi sequencing. The American Journal of Human Genetics. 2023;110(2):240-250. doi:
    4. Porubsky D, Vollger MR, Harvey WT, et al. Gaps and complex structurally variant loci in phased genome assemblies. Genome Research. 2023;33:496-510. doi:
    5. Liao W, Asri M, Ebler J, et al. A draft human pangenome reference. Nature. 2023;617(7960):312-324. doi:
    6. Gao Y, Yang X, Chen H, et al. A pangenome reference of 36 Chinese populations. Nature. 2023;619(7968):112-121. doi:
    7. Chin C, Behera S, Khalak A, et al. Multiscale analysis of pangenomes enables improved representation of genomic diversity for repetitive and clinically relevant genes. Nature Methods. 2023;20(8):1213-1221. doi:

      Please sign into your account to post comments.

    About the Author


    seqadmin Benjamin Atha holds a B.A. in biology from Hood College and an M.S. in biological sciences from Towson University. With over 9 years of hands-on laboratory experience, he's well-versed in next-generation sequencing systems. Ben is currently the editor for SEQanswers. Find out more about seqadmin

    Latest Articles


    • Exploring the Dynamics of the Tumor Microenvironment
      by seqadmin

      The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
      07-08-2024, 03:19 PM
    • Exploring Human Diversity Through Large-Scale Omics
      by seqadmin

      In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
      06-25-2024, 06:43 AM
    • Best Practices for Single-Cell Sequencing Analysis
      by seqadmin

      While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
      06-06-2024, 07:15 AM