Seqanswers Leaderboard Ad



No announcement yet.

Metrics that Matter: Important Metrics for Long-Read Sequencing Experiments—Part 2


  • Metrics that Matter: Important Metrics for Long-Read Sequencing Experiments—Part 2

    Click image for larger version

Name:	Metrics2.jpg
Views:	1529
Size:	560.8 KB
ID:	324436

    As a continuation of our last article, we will be covering important metrics related to long-read sequencing technologies. In this article, we focus on metrics relevant to evaluating the success of a Pacific Biosciences (PacBio) sequencing run.

    Pacific Biosciences

    PacBio has become synonymous with their High Fidelity (HiFi) sequencing. Supported on their Sequel II and IIe instruments, and now expanded to their latest Revio sequencer, HiFi sequencing is built from PacBio’s single-molecule real-time (SMRT) sequencing technology. During this type of long-read sequencing, HiFi reads are generated from circular consensus sequencing (CCS) mode, where several successive observations of the DNA molecule (referred to as subreads) are combined to increase the accuracy of the HiFi read. From this sequencing process, users have several key metrics to evaluate their run.

    “The most important primary metrics to evaluate in a PacBio long-read sequencing run are: HiFi yield, HiFi read length, HiFi read quality, and sequencing control performance,” said Aaron Wenger, Director of Product Marketing at PacBio. “The metrics of sequencing yield and quality are common across sequencing technologies, but the nuances of how they are defined is specific to PacBio long-read sequencing.” Wenger also adds that there are many more secondary metrics that vary by application, several of which will also be covered.

    HiFi yield
    The first metric, HiFi yield, is defined as “the number of base pairs in high-quality (>QV20 or 99% accuracy), usable reads,” said Wenger. He also explained that “HiFi yield is the sum of the length of all HiFi reads (those with accuracy >QV20) produced from a run. Each HiFi read provides the sequence of a single DNA molecule and is produced from the numerous passes the polymerase makes across the molecule, which are used to generate a consensus read sequence using the CCS algorithm (”

    When asked about the range of values for the HiFi yield, Wenger noted that all of the important metrics vary by movie time, application, and sequencing platform. However, he said the typical HiFi yield on the Revio system is 90 Gb for its most popular application (whole genome sequencing), with a majority of the runs achieving 75–105 Gb based on library quality.

    The HiFi yield is calculated from the product of two other important metrics—sample polymerase read length and productivity (P1). “HiFi yield alone can obscure non-ideal values for those metrics,” stated Wenger. “So, it is worth paying attention to the underlying metrics to see if even better performance can be achieved.” He also emphasized that HiFi yield is affected by the fragment size of the SMRTbell library, and users should pay attention to the shearing conditions to achieve the optimal size. For HiFi work like whole-genome sequencing, Wenger said that size is 15–20 kb.

    HiFi read length and quality
    Our next two important PacBio metrics are HiFi read length (measured in kb) and HiFi read quality (measured in the Phred scale). “HiFi read length—typically a mean—is the length of native DNA molecules in the library,” explained Wenger. “HiFi read quality—typically a median—is the predicted accuracy of the reads.” The read length and quality are both derived from the consensus sequence, as “Each consensus sequence has a length (the number of base pairs in the DNA molecule) and a predicted read quality (the average of the base qualities across the read) output by the CCS algorithm.”

    While the range on these two metrics also varies, Wenger stated that for a whole-genome sequencing run on the Revio system, “[the] typical mean HiFi read length is 15–20 kb,” and added, “[the] typical median HiFi read quality is Q30 with a range of Q28–32.”

    Sequencing control performance
    The performance of the sequencing control (measured as count and read length for internal control fragments) is another beneficial way to assess a PacBio, long-read sequencing run. “The sequencing control evaluates sequencing performance separate from the sample and helps determine if low run performance is due to the instrument/consumables or the sample,” said Wenger. “The control read length indicates whether the sequencing polymerase performed well, and the control read count indicates whether the sequencing SMRT Cell was properly loaded.”

    For those unfamiliar with this sequencing control, Wenger explained that “The sequencing control is a specific 11 kb sequence prebound to the PacBio sequencing polymerase. It is spiked into the customer library right before placing the library on the instrument for sequencing. On-instrument software detects reads of the control sequence with very high specificity and processes these reads independently, but in a similar manner to sample reads.” The control read length can be calculated as the mean length of all the reads that match the control sequence. But Wenger said that “While the control fragment has a fixed length of 11 kb, HiFi sequencing involves multiple serial passes of the molecule and so can achieve any length.” Additionally, he stated, “The control read count is the number of reads that match the control.”

    The typical sequencing control read length is around 70 kb, with a range of 50–100 kb, Wenger explained. He also said that “shorter read lengths may indicate a poor run. The control counts have some variance from a dilution series, so we do see some variation in this metric. A count above 1,000 is normal; lower values are worthwhile to investigate with PacBio support.”

    There are some limitations with the sequencing control as Wenger explained, “The sequencing control is not a fully independent assay of the instrument and consumable performance, as it is mixed in with the sample right before going on the instrument. Sample issues, such as contaminants, may impact the PacBio chemistry and alter internal control performance. The control is also limited to sequencing performance and does not account for preparation upstream of sequencing.”

    Secondary metrics

    Along with the previous HiFi and control metrics, several other important metrics specific to common applications were mentioned by Wenger. “Key secondary analysis metrics for the most popular long-read sequencing application, whole-genome sequencing, include variant calling precision and recall for SNVs, indels, and structural variants; assembly contiguity; and assembly quality.” He added, “HiFi reads excel at providing accurate and complete variant calling and assembly.”

    Variant calling precision and recall
    “Variant calling precision (measurement of false positives) and recall (measured of false negatives) is calculated by comparing a callset for a sample to the known truth from a benchmark,” said Wenger. The commonly used benchmarks are provided by the National Institute of Standard and Technology Genome in a Bottle project. Wenger explained that the typical range of values for precision and recall is “99.9% for SNVs, 99.3% for indels, and 95% for structural variants.” Additionally, there are some limitations to these two metrics. Wenger specified that “variant calling precision and recall are great metrics for quality but are only available for a small number of benchmark samples.”

    Assembly contiguity and quality
    The last two important metrics related to genome assembly are assembly contiguity (measured as contig NG50 in Mb) and assembly quality. Wenger stated that “Genome assembly aims to reconstruct full-length (telomere-to-telomere) chromosomes for a genome.” Since the assembly contiguity is dependent on the contigs, it’s important to understand that a contig is a long, contiguous sequencing of DNA made up of overlapping segments. Therefore, Wenger explained, “The contig NG50 is the contig length such that at least 50% of the genome is in contigs that length or longer.” He also stated that the “typical contig N50 is 50 Mb for human [samples] (and >10 Mb for all species), meaning that most chromosomes are directly assembled into a small number of contigs.”

    Lastly, for determining assembly quality, users can evaluate the assembly contiguity as well as the completeness and correctness of the assembly. To measure completeness, it is common to utilize BUSCO (Benchmarking Universal Single-Copy Orthologs) scores1, which assess the presence or absence of highly conserved genes in an assembly. Correctness is harder to measure but is described as the accuracy of each base pair in the assembly and is typically measured as concordance of an assembly to a benchmark reference.

    1. Manni M, Berkeley MR, Seppey M, Simão FA, Zdobnov EM. BUSCO Update: Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes. Kelley J, ed. Molecular Biology and Evolution. 2021;38(10):4647-4654. doi:
      Please sign into your account to post comments.

    About the Author


    seqadmin Benjamin Atha holds a B.A. in biology from Hood College and an M.S. in biological sciences from Towson University. With over 9 years of hands-on laboratory experience, he's well-versed in next-generation sequencing systems. Ben is currently the editor for SEQanswers. Find out more about seqadmin

    Latest Articles


    • Current Approaches to Protein Sequencing
      by seqadmin

      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
      04-04-2024, 04:25 PM
    • Strategies for Sequencing Challenging Samples
      by seqadmin

      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
      03-22-2024, 06:39 AM
    • Techniques and Challenges in Conservation Genomics
      by seqadmin

      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

      Avian Conservation
      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
      03-08-2024, 10:41 AM