From Algorithms to Assemblies: An Interview with Sequencing Analysis Experts—Part 2

Published: 04-07-2023, 07:06 AM
411 views
0 comments
- Share
- Tweet

From Algorithms to Assemblies: An Interview with Sequencing Analysis Experts—Part 2
This is part two of our Q&A article series where we ask several leading sequencing analysis providers for their approach to important analysis processes.

In this newest segment of our series, we ask our participants about their assessments of assemblies and alignments.

Take a look back at the first installment to review how each provider handles the quality control process of sequencing data.

How do you evaluate the quality and reliability of generated assemblies or alignments?

Dr. Ni Ming, Senior Vice-President, MGI

Apart from establishing strict quality control in every step, we have also taken measures to ensure data quality and reliability throughout the entire data lifecycle, from acquisition to disposal.

a) Unified task scheduling and data management
Unified task scheduling and data management software is used during automatic analysis, while the integrity of the result file and report is verified during the scheduling process to ensure the reliability of results. To add, automatic re-analysis of each task improves the fault tolerance rate and success rate of the system.

b) Key indicators and comparison against gold standards
In terms of analysis, conventional indicators such as mapping rate and coverage are used in the results report to evaluate the level of alignment, and the results are evaluated indirectly by comparing the accuracy of variation detection. The evaluation can be done by comparing the variant calling of standard sample against gold standards¹, or commonly-used GATK’s best practices pipeline results.

c) Multiple alignment tools for users’ selection
At the same time, we provide different alignment software for selection, some with better precision performance in variant detection and others with better sensitivity. Users can select tools based on their specific business needs.

References:
1. Cleary, John G., et al. "Comparing variant call files for performance benchmarking of next-generation sequencing variant calling pipelines." BioRxiv (2015): 023754.

Simon Valentine, Chief Commercial Officer, Basepair

It’s always important to know where your reads are going and the confidence of their reported alignments as they pass through an analysis. However, this is complicated by the fact that aligners calculate mapping quality (MAPQ) scores in different ways and also report them on varying scales (0-42, 0-60, etc.). With the Basepair platform, we follow established best practices from the genomics community to process a variety of data types while using the appropriate genomics tools and settings.

After alignment is completed, we provide a summary plot that displays the percentage of reads removed during QC, unaligned, aligning to multiple locations in the genome, and “uniquely” aligning to the genome. Our visualizations and summary statistics allow you to make an informed decision on which additional filtering steps may be necessary during downstream analyses.

QIAGEN Digital Insights Team

Contiguity, completeness, and correctness are paramount to assessing assembly quality using QIAGEN CLC Genomics Workbench.

When using QIAGEN CLC Genomics Workbench, contiguity involves the number of contigs.
A high N50 and a low number of contigs relative to your expected number of chromosomes are ideal. If you aren’t sure what type of N50 and contig number might be reasonable to expect, you could try to get an idea by looking at existing assemblies of a similar genome, should these exist. For an even better sense of what would be reasonable for your data, you could compare the assembly of a similar genome using the whole genome alignment plugin, assembled using a similar amount and type of data. If your assembly results include a large number of very small contigs, you may have set the minimum contig length filter too low. Very small contigs, particularly those of low coverage, can generally be ignored.

Completeness involves how much of the genome is captured in the assembly. If a total genome length of 5MB is expected based on existing literature or similar genomes that have already been assembled, but the sum of all contig lengths is only 3.5MB, you may wish to reconsider your assembly parameters.

When using QIAGEN CLC Genomics Workbench here are two common reasons an assembly output would be shorter than expected:
A Word Size that is higher than optimal for your data: A high Word Size will increase the probability of discarding words because they overlap with sequencing errors. If a word is seen only once, the unique word will be discarded even if there are many other words that are identical except for one base (e.g., a sequencing error). A discarded word will not be considered in constructing the assembly graph and will therefore be excluded from the assembly contig sequences.

A Bubble Size that is higher than optimal for your data: A high Bubble Size will increase the probability that two similar sequences are classified as a repeat and thus collapsed into a single contig. It is sometimes possible to identify collapsed repeats by looking at the mapping of your reads to the assembled contigs. A collapsed repeat will be shown as a high peak of coverage in one location.

Depending on the resources available for the organism you are working on, you might also assess assembly completeness by mapping the assembled contig sequences to a known reference. You can then check for regions of the reference genome that have not been covered by the assembled contigs. Whether this is sensible depends on the sample and reference organisms and what is known about their expected differences.

For QIAGEN CLC Genomics Workbench, correctness means whether the contigs that have been assembled accurately represent the genome. One key question in assessing correctness is whether the assembly is contaminated with any foreign organism sequence data. To check this, you could run a BLAST search using your assembled contigs as query sequences against a database containing possible contaminant species data. In addition to BLAST, checking the coverage can help to identify contaminant sequence data. The coverage of a contaminant contig is often different from the desired organism, so you can compare the potential contaminant contigs to the rest of the assembled contigs. To check for these types of coverage differences between contigs you can:
Map your reads used as input for the de novo assembly to your contigs (if you do not already have a mapping output);

Create a Detailed Mapping Report;

In the Result handling step of the wizard, check the option to Create a separate table with statistics for each mapping;

Review the average coverage for each contig in this resulting table

If there are contigs that have good matches to a very different organism and there are discernable coverage differences, you could either consider removing those contigs from the assembly or run a new assembly after removing the contaminant reads. One way to remove the contaminant reads would be to run a read mapping against the foreign organism’s genome and to check the option to Collect unmapped reads. The unmapped reads Sequence List should now be clean of contamination. You can then use this set of reads in a new de novo assembly.

Assessing the correctness of an assembly also involves making sure the assembler did not join segments of sequences that should not have been joined - or checking for misassemblies. This is more difficult. One option for identifying misassemblies is to try running the InDels and Structural Variants tool. If this tool identifies structural variation within the assembly, that could indicate an issue that you should investigate.

Mike Lelivelt, VP of Software Product Management and Marketing, Illumina

The quality of a DRAGEN read alignment is given by its alignment score, which measures how closely a read matches the reference sequence at the reported mapping position (POS). The confidence in the reported mapping position is given by the mapping quality (MAPQ), which is a function of the difference between the best alignment score and the second-best alignment score(s). DRAGEN does not currently perform read assembly.

Richard Moir, Director of Product and Technology, Geneious

Geneious Prime offers a range of best-in-class algorithms for alignment and assembly paired with highly interactive visualizations that enable scientists to complete their own evaluation of different approaches and decide how to proceed with their analysis. Our algorithms include STAR, Minimap2, SPAdes, Flye, MAFFT and others, each selected for its unique strengths and capabilities. In addition to these industry-standard tools, we've also developed our own algorithm with special features like circular assembly and iterative mapping, which allows for more accurate resolution of complex indels. All of these algorithms are seamlessly integrated into Geneious Prime, making it easy for any scientist to perform complex bioinformatic analyses and generate high-quality data.

To evaluate the quality of the results, scientists have easy access to many tools, regardless of the algorithm used, including:
Textual assembly report containing a summary of the output and read counts, giving a measure of how many reads were successfully aligned.

Dynamic coverage graph with options for highlighting areas of concern.

Coloring of reads based on pair distance, direction and mapping quality.

Highly configurable consensus caller with options for handling low-quality and low-coverage regions.

Check out the third, fourth, fifth, and sixth (final) installment of our Q&A series.
Tags: None
Please sign into your account to post comments.

From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data

by SEQadmin2

Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.

The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
...
- Channel: Articles
06-02-2026, 10:05 AM
Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends

by SEQadmin2

With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.

Introduction

Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
- Channel: Articles
05-22-2026, 06:42 AM
Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies

by SEQadmin2

Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.

Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
- Channel: Articles
05-06-2026, 09:04 AM

Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism

by SEQadmin2

Sloths are the slowest mammals on Earth, and their dense jungle habitat has made them notoriously difficult to study. Now, for the first time, scientists...
- Channel: News
06-09-2026, 11:58 AM
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible

by SEQadmin2

Hantavirus infections are rare—roughly 30 people are infected in the United States each year—but they are deadly, killing 30 to 40 percent of those...
- Channel: News
06-05-2026, 10:09 AM
A New Single-Cell Method Maps DNA-Protein Interactions

by SEQadmin2

Scientists at Weill Cornell Medicine and the New York Genome Center have developed a new method that maps, in single cells, the DNA binding sites of transcription...
- Channel: News
06-04-2026, 08:59 AM
Long-Read RNA Sequencing Uncovers a Hidden Layer of Immune Cell Regulation

by SEQadmin2

Scientists at University Medical Center Utrecht have identified a previously underappreciated mechanism that helps immune cells respond rapidly to infection....
- Channel: News
06-02-2026, 12:03 PM

Unconfigured Ad

From Algorithms to Assemblies: An Interview with Sequencing Analysis Experts—Part 2

From Algorithms to Assemblies: An Interview with Sequencing Analysis Experts—Part 2

About the Author

Latest Articles

ad_right_rmr

News