As sequencing technologies and data analysis tools continue to advance, it is more important than ever to ensure your sequencing data is being handled appropriately. Analyzing sequencing data is a complex process and the current platforms are now capable of performing diverse tasks such as assembling genomes, quantifying transcripts, interpreting detailed experiments, and much more.
In this Q&A article series, we’re interviewing top sequencing analysis providers to understand the workings of their platforms and learn how they handle different aspects of the analysis process.
This first installment of the series will focus on quality control measures and ensuring subsequent analyses are set up for success.
What is your approach to quality control of sequencing data, and how do you ensure the data is high enough quality for downstream analyses?
Geneious Prime provides a broad set of tools for pre-processing and quality control that can be used in a flexible manner for a wide range of sequencing use cases. Our approach is to provide scientists with an intuitive interface for selecting and configuring the right tools for their data, then, through powerful visualizations, empowering them to explore the results and make their own assessment on the quality of the data.
In many cases, Geneious Prime uses trusted open-source solutions for QC and preprocessing including many from the BBTools package such as bbmerge for merging paired reads, bbnorm for error correction/normalization and bbduk for trimming. Tools are also provided for demultiplexing, sub-sampling and chimera detection.
As each preprocessing step is performed, a new result is saved which can be inspected using summary statistics such as base call quality, read length, GC content and ambiguities. This allows easy comparison of different approaches and rolling back when one approach doesn't work well.
When the process needs to be more standardized and repeatable, the Workflow system in Geneious Prime can be used to create a pre-configured pipeline using a visual editor that can then be run with one click. These workflows can then be exported or shared using the built-in shared database functionality to create a standard operating procedure for a wider group of scientists.
NGS is a bit different from other businesses. All products and analysis services are based on the sequencing data, which means that the quality of this kind of data is vital for most applications. MGI and Complete Genomics ensure quality control in almost all steps of sequencing:Firstly, our sequencing strategy is based on DNA nanoball which ensures minimized PCR cycles to avoid introducing errors during DNA copying.
Following that, we launch automation which significantly reduces the use of manual operations and ensures minimal manual during lab work.
During data analysis, through implementation of our in-house software SOAPnuke1, which is publicly available in github and integrated into MegaBOLT for customer use and testing, we apply a number of key criteria, such as Q30, GC content, adapter contamination rate and more to evaluate sequencing data quality before and after data processing to make sure only data of good quality is fed into subsequent analyses. This effectively elevates the accuracy and reliability of the insights derived from data analyses. In addition, all parameters can be modified to cater to different customer needs.
Finally, we have in place management systems covering the whole sequencing workflow—from sample submission, to sample management, laboratory management, data analysis and reporting.
Our fully automated sequencing and analysis process is an integrated, one-stop analysis with:
a) Comprehensive data filtering and detailed visualization
Once the raw data is generated, it is processed using the self-developed, full-featured SOAPnuke software by filtering reads with adapter, of low quality, and with high N content to obtain high-quality data. At the same time, the statistics of the preprocessed data are added with visualization-based method, for example by looking at the scale of Q30 (i.e. error rate 0 .1%), base and quality status of each sequencing cycle, etc., for better understanding of the data.
b) Flexible parameters adjustment
Because relevant parameters can be flexibly adjusted, and datasets from the same kinds of library and sequencers often have the same data characteristics, users can select the same parameters for similar data, and build business processes according to platform characteristics, in order to evaluate a set of suitable parameter scheme for subsequent production business.
c) Accelerated quality control process
To optimize the whole data quality control process, we have designed a streamlined pipeline and undergone data splitting for big sequencing data and auto-parallelization.
References:
1. Chen Y, Chen Y, Shi C, et al. SOAPnuke: a MapReduce acceleration-supported software for integrated quality control and preprocessing of high-throughput sequencing data. Gigascience. 2018;7(1):1-6. doi:10.1093/gigascience/gix120.
Quality control (QC) of sequencing data is critical to performing successful genomic analyses. With a few clicks, Basepair allows you to QC raw reads from a variety of datatypes (RNA-seq, ChIP-seq, ATAC-seq, single-cell RNA-seq, WGS/WES, etc.) using industry-standard bioinformatics tools, while also providing helpful visualizations and reports to assess your data. The first step in each of our analysis workflows always includes trimming of low-quality bases as well as detection and removal of adapter contamination.
We provide summary statistics and visualizations that allow you to quickly compare each sample within a project in order to check for the proper enrichment, signal/noise ratio, and any potential experimental bias before moving on to downstream analyses.
QIAGEN CLC Genomics Workbench Premium has all the tools scientists need for success with sequencing data analysis – whether they are analyzing RNA, DNA, microbial data, single-cell data, etc. It even has pre-defined workflows that can be customized depending on specific needs. To ensure high-quality sequencing data for downstream analyses using QIAGEN CLC, here are some recommended steps for quality control, which are generally universal:
a. Quality control and trimming of raw data: This is the first step in the sequencing data analysis pipeline; you should check the quality of the raw data and trim any low-quality reads
b. De novo assembly of reads: This will generate an assembled sequence
c. Mapping of reads: The trimmed reads should then be mapped to a reference genome. Important QC metrics are the number of reads mapped to a reference genome, and for panels, the number of reads mapped to target regions in addition to coverage
d. Variant calling: The mapped reads will serve to find germline or somatic variants
e. Post-variant calling quality control: Once variant calling is complete, it is essential to check the quality of the called variants
To ensure the data is high enough quality for downstream analyses, it is essential to follow best practices for quality control and consider the specific requirements for the downstream analysis. It is also possible to validate the analysis results using additional methods, such as PCR or Sanger sequencing. Additionally, it may be helpful to consult with experts in the field and seek advice from the scientific community to ensure your experimental design is robust for the best possible outcome.
Data quality starts with manufacturing at Illumina, ensuring our reagents are created under ISO-certified processes. Once it gets to the customer, each of our instrument runs provides extensive performance feedback to the operator, ensuring everything is within normal tolerances. Each base is assigned a quality score that predicts the accuracy of the base calls. Data that is below tolerance is removed. This is based on the Q scores people are fond of talking about. It’s about the quality of the base call, not the variant. And yes, it is a critical measure, but not the only one.
Quality base-level data from millions of reads are then fed into algorithms to map the reads into the genome and then call variants. Each of these steps has its own performance evaluation process. Illumina routinely benchmarks data against standard performance metrics provided by the Precision FDA Challenges. The Illumina DRAGEN™ pipelines are routinely awarded as the most accurate.