Analyzing single-cell gene expression data has greatly improved biological research, offering insights at a granular level that were previously unattainable. However, the growing scale of single-cell datasets has challenged existing computational methods, leading to inconsistent results and bottlenecks in data processing. Researchers at St. Jude Children’s Research Hospital have addressed this issue by developing a machine-learning algorithm designed to handle the vast amounts of data generated in single-cell studies. The new approach, published in Cell Genomics, provides a scalable solution that delivers accurate and unbiased results.
Overcoming Limitations in Single-Cell Data Analysis
The shift from bulk gene expression studies to single-cell RNA sequencing (scRNA-seq) marked a significant advance in biomedical research. Instead of averaging gene activity across millions of cells, researchers can now study the molecular behavior of individual cells. This has yielded breakthroughs in understanding diseases and treatments. However, as datasets expand—sometimes encompassing millions of cells—the computational demands of analyzing this data have grown exponentially.
“We’ve implemented a new toolset that can be scaled as these single-cell RNA sequencing datasets continue to grow,” said Paul Geeleher, from the St. Jude Department of Computational Biology. “There has been an exponential explosion in the compute time for single-cell analysis, and our method brings accurate analysis back into a tractable timeframe.”
Harnessing GPU Technology for Scalability
Conventional computational methods for scRNA-seq analysis often require researchers to compromise on the quality of their analyses due to hardware limitations. Geeleher’s team tackled this challenge by leveraging graphics processing units (GPUs), which are optimized for handling large-scale parallel computations.
“We created a method that uses graphics processing units or GPUs,” explained Xueying Liu, the study’s first author. “The GPU integration gave us the processing power to perform the computational load in a scalable way.”
An Unsupervised Learning Approach
A critical feature of the new algorithm is its reliance on unsupervised machine learning. Unlike traditional methods, which require researchers to make assumptions about the data, this approach automatically identifies meaningful patterns.
“Our method uses unsupervised machine learning, which automatically determines more robust and less arbitrary parameters for the analysis,” Liu said. “It learns how to group cells based on their different active biological processes or cell type identities.”
The algorithm, named Consensus and Scalable Inference of Gene Expression Programs (CSI-GEP), processes each dataset independently, minimizing bias. By focusing exclusively on the biological signals within the data, CSI-GEP achieves results that are both accurate and generalizable.
Advancing Biological Discovery
When applied to some of the largest scRNA-seq datasets available, CSI-GEP outperformed existing methods. It successfully identified cell types and biological activities that other approaches missed, demonstrating its potential as a robust tool for studying a wide range of diseases.
“We’ve created a tool broadly applicable to studying any disease through single-cell RNA analysis,” Geeleher noted. “The method performed substantially better than all existing approaches we tested, so I hope other scientists consider using it to get better value out of their single-cell data.”
Publication Details
Liu X, Chapple RH, Bennett D, et al. CSI-GEP: A GPU-based unsupervised machine learning approach for recovering gene expression programs in atlas-scale single-cell RNA-seq data. Cell Genomics. 2025;5(1):100739. doi:10.1016/j.xgen.2024.100739
Overcoming Limitations in Single-Cell Data Analysis
The shift from bulk gene expression studies to single-cell RNA sequencing (scRNA-seq) marked a significant advance in biomedical research. Instead of averaging gene activity across millions of cells, researchers can now study the molecular behavior of individual cells. This has yielded breakthroughs in understanding diseases and treatments. However, as datasets expand—sometimes encompassing millions of cells—the computational demands of analyzing this data have grown exponentially.
“We’ve implemented a new toolset that can be scaled as these single-cell RNA sequencing datasets continue to grow,” said Paul Geeleher, from the St. Jude Department of Computational Biology. “There has been an exponential explosion in the compute time for single-cell analysis, and our method brings accurate analysis back into a tractable timeframe.”
Harnessing GPU Technology for Scalability
Conventional computational methods for scRNA-seq analysis often require researchers to compromise on the quality of their analyses due to hardware limitations. Geeleher’s team tackled this challenge by leveraging graphics processing units (GPUs), which are optimized for handling large-scale parallel computations.
“We created a method that uses graphics processing units or GPUs,” explained Xueying Liu, the study’s first author. “The GPU integration gave us the processing power to perform the computational load in a scalable way.”
An Unsupervised Learning Approach
A critical feature of the new algorithm is its reliance on unsupervised machine learning. Unlike traditional methods, which require researchers to make assumptions about the data, this approach automatically identifies meaningful patterns.
“Our method uses unsupervised machine learning, which automatically determines more robust and less arbitrary parameters for the analysis,” Liu said. “It learns how to group cells based on their different active biological processes or cell type identities.”
The algorithm, named Consensus and Scalable Inference of Gene Expression Programs (CSI-GEP), processes each dataset independently, minimizing bias. By focusing exclusively on the biological signals within the data, CSI-GEP achieves results that are both accurate and generalizable.
Advancing Biological Discovery
When applied to some of the largest scRNA-seq datasets available, CSI-GEP outperformed existing methods. It successfully identified cell types and biological activities that other approaches missed, demonstrating its potential as a robust tool for studying a wide range of diseases.
“We’ve created a tool broadly applicable to studying any disease through single-cell RNA analysis,” Geeleher noted. “The method performed substantially better than all existing approaches we tested, so I hope other scientists consider using it to get better value out of their single-cell data.”
Publication Details
Liu X, Chapple RH, Bennett D, et al. CSI-GEP: A GPU-based unsupervised machine learning approach for recovering gene expression programs in atlas-scale single-cell RNA-seq data. Cell Genomics. 2025;5(1):100739. doi:10.1016/j.xgen.2024.100739