Header Leaderboard Ad


From Algorithms to Assemblies: An Interview with Sequencing Analysis Experts—Part 5



No announcement yet.

  • From Algorithms to Assemblies: An Interview with Sequencing Analysis Experts—Part 5

    Click image for larger version  Name:	Analysis Image8.jpg Views:	0 Size:	676.5 KB ID:	324327

    This is part five of our Q&A article series with several leading sequencing analysis providers. We’re asking them important questions to learn how they handle different aspects of the analysis process.

    In this latest segment of our series, we ask our participants about keeping up with the latest analysis trends.

    Look back at the previous questions to review the first installment on quality control, the second installment on assemblies and alignments, the third installment on transcript analysis, and the fourth installment on data visualization.

    What are some of the latest trends in sequencing data analysis and how do you stay up-to-date with these developments?

    Mike Lelivelt, VP of Software Product Management and Marketing, Illumina

    Onboard bioinformatics – DRAGEN onboard offers deep integration with NovaSeq X and the flexibility to rethink cloud usage. Perform BCL to FASTQ onboard before uploading to the cloud. Leverage embedded ORA compression to decrease the size of your FASTQ files by up to 80%. Users gain access to run planning and quality assessments through a simple interface.

    Graph genomes When the first human whole genome was sequenced in June 2000, a composite of libraries from over ten individuals was used. Today, we have easy access to complete genomes from individuals. This changes how we do bioinformatics. A “graph genome” is a collection of genomes from multiple individuals where the phasing of the genome — or maintaining the relationship of variants within an individual — is retained. When “graph genomes” are leveraged in both mapping and variant calling, accuracy is improved. This is especially important for population studies where patterns of allele frequencies can differ.

    Richard Moir, Director of Product and Technology, Geneious

    The two most important trends we see in this space are long read and single cell sequencing technologies.

    The development direction of Geneious Prime is heavily guided by user feedback and we are continuously evaluating new technologies in the field of sequencing to identify where we can better help scientists with their research. For long read specifically, we have an ongoing effort to evaluate and implement new tools as they become mainstream as with Flye and Minimap 2, two of the leading tools in the space which are already available in Geneious. We continue to watch the space for opportunities to expand our toolkit for long read analysis.

    Geneious Biologics is a separate cloud-enabled solution that provides advanced analytics and intuitive visualizations for a wide range of TCR and antibody-like molecules. Its capabilities include tools for end-to-end processing and analysis of single cell data sets including demultiplexing with UMIs and clustering sequences within a sample.

    MGI (Complete Genomics)
    Dr. Ni Ming, Senior Vice-President, MGI

    In recent years, an increasing number of scientists are approaching the entirety of biology from a trans-omics angle. By integrating DNA, epigenetics, RNA, protein or other molecule detection through trans-omics data analysis, a complete cellular result can be obtained. Not only does this provide new scientific insights that cannot be found via a single-omics approach, it also offers different perspectives for discovery across multiple biological levels. With the continuous reduction of sequencing technology costs, the scale of trans-omics applications has increased dramatically, resulting in unprecedented challenges for large-scale sample collection, preservation, sequencing data production and analysis platforms. To address this challenge, we have developed an integrated platform for storage, reading, computing and usage, which provides multiple hardware and software support, therefore addressing the bottleneck in large-scale trans-omics data analysis and production. The applications of sequencing data analysis are primarily demonstrated in the domains of computing and usage. So far, MGI has introduced a variety of internationally competitive BIT (BioInformatics Technology) products, such as bioinformatics analysis software MegaBOLT/ZBOLT/ZBOLT Pro. By using software enhancement to improve genetic data management capabilities, we provide effective tools for analyzing massive trans-omics data.

    The following illustrates the tree latest trends in sequencing data analysis and the efforts made by MGI and Complete Genomics to stay up-to-date with these trends.

    1. The prevalence of large-scale trans-omics datasets has created a large demand for computing acceleration. Moreover, by using the complete sequence of a human genome (T2T1), more accurate and comprehensive analysis results can be obtained. To achieve calculation, storage, and management of large population genomics, it is necessary to create highly cost-effective, high-density, and highly scalable technology and products. Therefore, we have independently developed the MegaBOLT series and ZTRON series products. Among them, the MegaBOLT/ZBOLT/ZBOLT Pro bioinformatics analysis accelerator adopts a parallel computing architecture with multiple pipelines and is over 300 times faster than classic analysis algorithm. Its analytic ability can reach up to 5,000/17,000/70,000 WGS per year, with a daily throughput of approximately 1TB/5TB/20TB genetic data. Combined with our sequencer products, it can achieve highly efficient genetic data management. For example, the DNBSEQ-T7* platform produces 60 sets of WGS data per run, which then take only half a day to be processed by the ZBOLT Pro. Additionally, we provide the ZTRON genetic data center all-in-one machine as a “one-stop-shop” for large population with tens of thousands, hundreds of thousands, and millions of samples, fused with ZBOLT bioinformatics analysis accelerator (also capable of high-performance data management) to maximize cost reduction and fully accelerate genomic data processing capabilities.

    2. The prevalence of large-scale trans-omics datasets also creates increased requirements for data security and management. When running large-scale trans-omics data analysis and management, data security must be considered. Life data management and privacy security sharing are major challenges faced by big data in genomics. The technological bottlenecks include difficulties in personal data tracing, security protection, and information island sharing. Currently, countries around the world have introduced relevant personal data privacy protection laws and regulations, such as GDPR in Europe, HHIPA in the United States, and China's "Data Security Law" and "Personal Information Protection Law". For data security and privacy protection, genomic data-related products need to follow the principle of ‘Privacy by Design’, which emphasizes privacy security since the beginning of process – from design to conducting safe and efficient computing, storage, and encryption of data, as well as safe and efficient transmission and management. So far, we have completed the research and development of products based on the above three points, fully protecting customer data privacy and security.

    In addition, large-scale trans-omics research not only puts forward great requirements for the innovative development of relevant software and hardware for storage, reading, computing and usage, it also brings along new development opportunities for laboratory management, which is gradually moving towards digitization and intelligence. For this reason, we have established the ZLIMS four-layer laboratory management architecture to provide full-process and full-cycle management from sample to experimental results for laboratories. The four layers refer to environmental management, equipment management, application management, and data management. The ZLIMS has been successfully applied in large-scale sequencing laboratories with a million of samples.

    3. The breakthrough of AI technology has promoted the integration of AI and bioinformatics. In this regard, we have always kept up through cutting-edge technology. For instance, by utilizing the self-developed MegaBOLT-DV deep learning variant calling algorithm and combining it with our dataset for model training, we can obtain more accurate results. Further, we look forward to new opportunities that will be brought on by the integration of GPT4 and bioinformatics applications.

    In the future, in response to the characteristics of large-scale trans-omics research, such as numerous samples, lengthy cycles, complex projects, and voluminous data, MGI and Complete Genomics will continue to develop a complete set of digitalized core tools for life sciences to facilitate efficient trans-omics research.

    *Unless otherwise informed, StandardMPS and CoolMPS sequencing reagents, and sequencers for use with such reagents are not available in Germany, Spain, UK, Sweden, Italy, Czech Republic, Switzerland and Hong Kong (CoolMPS is available in Hong Kong).
    *Products are provided for Research Use Only. Not for use in diagnostic procedures (except as specifically noted).

    1. Nurk S, Koren S, Rhie A, et al. The complete sequence of a human genome. Science. 2022;376(6588):44-53. doi:10.1126/science.abj6987.

    Simon Valentine, Chief Commercial Officer, Basepair

    The use of sequencing data is rapidly expanding across many fields and a variety of different applications. Small organizations with limited resources are now able to generate genomic data, but don’t have the computational resources and know-how to analyze it, while larger organizations want to continuously scale up their analysis capabilities. Not only is there a need to run workflows efficiently in the cloud, but also the ability to enable more scientists to perform routine analyses themselves with an easy-to-use, point & click interface. Oftentimes, organizations want to leverage their own cloud account rather than rely on a third-party platform. Basepair is uniquely positioned to accomplish this as we are able to orchestrate the storage and compute resources of their account while still providing the benefit of Basepair’s suite of genomics tools.

    We’re also seeing that kit manufacturers have become more interested in offering their customers easy-to-use analysis tools to keep customers within their brand and ecosystem. Basepair accomplishes this by white-labeling the platform and offering a business operations layer under the hood. Organizations want more than just data analysis from platforms like ours, which is why we’ve recently built upon our advanced enterprise features. This includes the use of sample metadata layers to both enable complete sample provenance and connect the wet and dry labs, as well as automated data archival.

    QIAGEN Digital Insights Team
    There are two key trends. 1) Use of artificial intelligence (AI) and machine learning (ML) for curating data, and 2) the use of accelerated pipelines.

    1) AI/ML are very intriguing for assembling large collections of data against which you can perform analyses. You may be familiar with ChatGPT as an example of this. But what we find is if you ask these AI-derived databases for general knowledge like “What are the key genes involved with the calcium signaling pathway?” you will get a relatively accurate response, albeit quite superficial and without novel players or targets, or any sense of how the genes work together. If you ask these systems to provide a list of all upstream genes implicated in regulating a specific gene like TP53, you will get a list of genes that may or may not actually be linked to TP53.

    Further, AI-based software does not “show its work”; you have no idea where the information came from, whether it is reliable or whether the algorithm made it up. It seems the criteria from some of these AI/ML systems for including genes can widely vary so the incidents for false positives/false negatives can be as much as 30%!

    This is why we use a “four-eyes” approach to curating our QIAGEN Digital Insights knowledge bases. We use AI/ML to suggest certain findings to include in our database, but before any of these findings are introduced into our databases, we validate them with at least 2 of our expert (human) curators. This additional yet crucial step of human certification and manual curation ensures all the findings are accurate and traceable in our databases.

    2) The other emerging trend is for accelerated analytic pipelines. This year we introduced QIAGEN CLC LightSpeed Module. LightSpeed is a software-based accelerator that can process a 35x whole genome sequencing (WGS) FASTQ into a VCF in about 20 minutes on the cloud. But what’s more impressive is that it can do this for about $0.50 vs. $4 to $7 with other hardware-based accelerators. The other key aspect of our LightSpeed technology is that you don’t have to use it on the cloud. You can use it on an HPC, workstation and even a laptop.

    Read the final installment of our Q&A series!
      Please sign into your account to post comments.

    Latest Articles