Welcome to part six of our Q&A article series with leading sequencing analysis providers. We’re interviewing these experts to gain helpful insights into their complex analysis processes.
In this final installment of our series, we ask our participants about one of the most important aspects of data analysis, accuracy and reproducibility.
If you’re just joining us, we recommend reviewing the first installment on quality control, the second installment covering alignments and assemblies, the third installment on transcript analysis, the fourth installment on data visualization, and the fifth installment on the latest trends in sequencing analysis.
What steps do you take to ensure that your analysis and pipelines are accurate and reproducible?
The Geneious team prioritizes scientific accuracy above all else and we ensure this by bringing together the best commercial software engineering practices such as automated testing, continuous integration and peer reviews with ample scientific knowledge provided by experienced biologists that fill key roles such as product managers and quality advocates in the dev team. Our helpdesk is also staffed by PhD qualified molecular biologists to advise users on accurate use of our tools and interpretation of the results.
Reproducibility is also an important part of the Geneious way of working with a host of features that help in this respect:
- New result documents are saved at each step of an analysis and the settings that were used are stored on each document for future reference, creating an audit trail for your analysis operations.
- Result documents keep a reference to all input documents and vice-versa meaning inputs and outputs can be reliably tracked.
- Analysis options can be saved as a preset and those presets can be shared with colleagues.
- Workflows allow creation of standard operating procedures to allow easy reproduction of an analysis pipeline.
- As a desktop tool, it is always possible to run previous versions of the software when necessary and we make all versions available for easy download from our website.
In terms of accuracy, as mentioned in a previous answer, QC preprocessing of data ensures that the analysis data is as accurate as possible. The use of T2T genome and comparison software with higher levels of accuracy further add to this, while AI algorithms help to continuously improve accuracy.
On the other hand, mutation detection software will perform random downsampling operations on high-depth data to a certain extent and employ random functions, etc. to generate unreproducible results. You can cancel downsampling, modify the use of random functions, etc. to ensure reproducibility.
The Basepair platform offers analysis workflows for a variety of genomic datatypes (RNA-seq, ChIP-seq, ATAC-seq, scRNA-seq, WGS/WES, etc.). We leverage industry-standard tools available in the public domain that have been cited in hundreds of peer-reviewed publications.
Our workflows are always validated by first processing a series of published datasets from different species and of different overall quality to ensure they consistently reproduce the expected results. In order to trust the final results of an analysis, not only must the data pass various QC checks along the way, but the pipeline itself must also be carefully evaluated.
We at QIAGEN Digital Insights have a team of PhD-level bioinformaticians and developers who pride themselves on developing high-quality tools. Unlike some open-source tools that result from a graduate-student thesis or project, the tools we produce are fully supported and follow strict development methodologies.
We use robust software development processes under ISO27001 standards and routinely test our workflows against standard datasets to ensure high quality and reproducibility. QIAGEN CLC Genomics Workbench also provides an audit trail so you can always look back at the settings and parameters used for analysis for maximum reproducibility.
In DRAGEN, over 200 smoke tests automatically run every night, and over 3500 automated test cases run every weekend. This ensures we capture issues early. We use “golden” dataset (such as GIAB data) as the truth to evaluate our pipeline accuracy. In the automation test runs that occur every weekend, if there is any regression from a previous run, the team will take action to address it. This ensures no regression in our pipeline accuracy, and the accuracy trend is always upward.
Robustness tests run the same test multiple times and ensure they generate the same result. We also run the same test across different platforms (AWS, Azure, different DRAGEN servers, BaseSpace Sequence Hub, Illumina Connected Analytics etc.) and make sure the results are consistent. DRAGEN development follows the Illumina Quality Management System. IVD compliant DRAGEN NGS analysis applications can be used as a component of Illumina and customer IVD solutions.