Current NGS technologies easily deliver terabases of data in a single run. At the same time, AI is on the rise, with many touting its potential for handling complex, high-volume information. In this article, we will break down how AI models are being implemented in NGS from primary to tertiary analysis, and whether they pose an advantage in accuracy, speed, and genomic insights.
Basecalling: where chemistry determines impact
Basecalling is a foundational step in any NGS technology, as errors here propagate downstream, making accuracy a priority. As a result, AI has entered basecalling, but its impact varies across platforms, depending on how signals are produced.
Short vs. long reads
Short-read platforms such as Illumina NovaSeq X are based on optical detection of well-separated fluorescence signals. This produces mostly low and random errors that fit well with probabilistic methods.
For long-read platforms like Oxford Nanopore, basecalling is more complex. Signals correspond to ionic current changes that depend on the sequence context and translocation speed, leading to non-random error profiles. Earlier methods used Hidden Markov Model basecallers designed to address these challenges, but still follow predefined assumptions. In contrast, neural-network basecallers learn error patterns directly from raw current traces. As a result, the latest basecalling models of Dorado now achieve 99.75% (Q26), a substantial improvement over previous models.
PacBio sits in the middle. Although raw error rates are higher than in Illumina, signals are produced as single incorporation events, so error profiles are largely random. Further, to obtain long and accurate sequences (HiFi), PacBio reads are generated not from individual observations but using circular consensus sequencing (CCS), where multiple observations of the same sequence are merged.
Here is where AI comes into play. On Revio, PacBio has integrated DeepConsensus, a deep-learning model developed with Google Health that is “selectively applied where simpler CCS models lack confidence,” explained Aaron Wenger, Sr. Director of Product Management at PacBio, reducing HiFi read errors by about 42%. That way, fewer observations are needed to obtain HiFi reads, contributing to throughput and speed. “By achieving high HiFi accuracy with fewer observations, DeepConsensus increases the amount of data per SMRT Cell by up to a 25%, and enables a reduction in movie times from 30 to 24 hours,” Wenger noted.
Variant calling and complex genomic regions
Identifying genetic variations is the first step toward gaining clinical insights from genomic data. Accuracy is therefore critical, as missed variants and false positives can lead to wrong diagnoses and loss of biologically relevant information.
Variant calling has long relied on statistical and heuristic approaches. They show robust performance for SNVs and short indels in well-defined regions, but can fall short in more complex contexts, including repetitive sequences and technologies with context-dependent error profiles. As Jehee Suh, CEO at Inocras, notes, “the most significant advantage of AI-based models lies in their ability to unlock underexplored areas without being constrained by human-defined heuristics and existing biases,” particularly when addressing “non-coding genome or large-scale structural events, including CNVs and SVs.”
But these variants also pose a challenge for AI models, as training requires large and accurate truth sets, which are scarce for SVs and CNVs. This limitation explains why leading sequencing platforms have not yet implemented AI-based models for complex variants, while active research is being developed in this area.
Platform-specific strategies for AI-based variant calling
PacBio already benefits from AI for small-variant calling, especially for indels. As Wenger notes: “PacBio single-molecule sequencing exhibits different error modes compared to short-read technologies, with indels being more common than mismatches.” Instead of adapting short-read statistical models, “AI-based callers can learn these error modes directly, being particularly effective at modeling indel errors that span multiple bases or happen in homopolymer regions,” explained Wenger. PacBio therefore uses AI in tools such as DeepVariant and DeepSomatic for germline and somatic variant calling, while relying on statistical approaches such as Sawfish for CNVs and SVs. Beyond sequence variants, HiFi also captures epigenetic modifications, and here “AI-based models such as Jasmine are used to detect methylation directly from the raw signal, with classical statistical methods used to summarize these signals across reads and genomic regions,” added Wenger.
For short-read sequences with low and mostly random errors, statistical methods match or outperform machine learning approaches. Platforms such as Illumina DRAGEN opt for a hybrid strategy, using ML mainly as a scoring and filtering tool for SNVs and short indels, while relying on optimized statistical and probabilistic models for more complex variants.
New AI-based tools to reduce manual interpretation
After secondary analysis, the next steps focus on linking biological and clinical meaning to genetic information. As NGS technologies produce increasingly large amounts of data, analysis represents a major bottleneck. Artificial intelligence models are central here, as they “accelerate the overall timeline by eliminating numerous manual, time-consuming steps,” explained Suh. As he stated, “AI does not just process data faster; it creates a more streamlined workflow that transforms the entire analytical lifecycle.” Companies like Inocras are now building platforms to address this bottleneck.
AI for tumor context and disease monitoring
CancerVision is Inocras’ AI-powered genetic test designed to turn paired somatic-germline WGS data into clinically supporting reports. Its deep-learning models estimate tumor purity and ploidy, creating tumor-context information that improves downstream somatic variant calling and clonal interpretation. “Our validation reports achieved 100% sensitivity and 92.3% PPV for CNVs, reflecting the high reliability in identifying such complex variant types,” Suh said.
Inocras also applies an AI-based approach to cancer monitoring, helping detect relapses and treatment failures. MRDVision integrates WGS profiles and ctDNA to track patient-specific tumor signatures in blood samples. According to Suh, this is powered by a machine learning classifier (XGBoost) model that is critical for “distinguishing true-positive somatic variants from the false-positive artifacts typically found in plasma.” The model is trained on extensive annotated SNV features, enabling “a ppm limit-of-detection, transforming low-level WG ctDNA signals into highly reliable and actionable clinical insights for cancer monitoring,” Suh added.
AI for scalable interpretation in germline genomics
Other companies are integrating AI into their interpretation pipelines. Illumina offers Emedgene, an AI-based software designed to improve variant interpretation for rare diseases and other germline applications, linking variant types to phenotype. To ensure transparency, Emedgene functions as an explainable AI (XAI) that retrieves results with easy-to-review evidence. According to Illumina, this enables a 2–5× increase in efficiency and 50–75% reduction in total workflow time compared to manual interpretation.
Illumina AI-models have expanded into noncoding regions. Its latest AI algorithm, PromoterAI, is a deep neural network that predicts how promoter variants affect gene expression. This tool is preceded by PrimateAI-3D for protein-coding variants and SpliceAI for noncoding splice mutations. A recent study implementing these tools suggested that promoter variants may account for about 6% of the genetic burden of rare diseases, showing how AI can deepen genomic insight while reducing manual work.
Gains and limits across NGS workflows
AI is becoming a core part of NGS. When adapted to chemistry and platform particularities, it can help resolve complex variant types, stabilize noisy signals, and reduce manual review time. This makes AI a powerful way to extract more genomic information from NGS data and speed up personalized diagnoses. As an advancing field, transparent models and rigorous benchmarking are key to ensuring their benefits translate into trustworthy clinical outcomes.
About the author: Nuria M. Wentzien got her Ph.D. from the University of Granada in 2023. After finishing her Ph.D., she discovered that science communication brings together her two favorite worlds: science and writing. Her career reflects her passion for learning and exploring new topics, from a Bachelor's degree in Biochemistry (with a strong focus on human sciences) to a Ph.D. in sustainable agriculture. She brings this same curiosity to every project, making readers feel as inspired by science as she does. Now dedicated to helping life-science companies communicate rigorous science in an engaging way, she is the co-founder of Helixa Communications.