Header Leaderboard Ad


From Algorithms to Assemblies: An Interview with Sequencing Analysis Experts—Part 2



No announcement yet.

  • From Algorithms to Assemblies: An Interview with Sequencing Analysis Experts—Part 2

    Click image for larger version  Name:	Analysis Image5.jpg Views:	64 Size:	659.6 KB ID:	324289
    This is part two of our Q&A article series where we ask several leading sequencing analysis providers for their approach to important analysis processes.

    In this newest segment of our series, we ask our participants about their assessments of assemblies and alignments.

    Take a look back at the first installment to review how each provider handles the quality control process of sequencing data.

    How do you evaluate the quality and reliability of generated assemblies or alignments?

    MGI (Complete Genomics)
    Dr. Ni Ming, Senior Vice-President, MGI

    Apart from establishing strict quality control in every step, we have also taken measures to ensure data quality and reliability throughout the entire data lifecycle, from acquisition to disposal.

    a) Unified task scheduling and data management
    Unified task scheduling and data management software is used during automatic analysis, while the integrity of the result file and report is verified during the scheduling process to ensure the reliability of results. To add, automatic re-analysis of each task improves the fault tolerance rate and success rate of the system.

    b) Key indicators and comparison against gold standards
    In terms of analysis, conventional indicators such as mapping rate and coverage are used in the results report to evaluate the level of alignment, and the results are evaluated indirectly by comparing the accuracy of variation detection. The evaluation can be done by comparing the variant calling of standard sample against gold standards1, or commonly-used GATK’s best practices pipeline results.

    c) Multiple alignment tools for users’ selection
    At the same time, we provide different alignment software for selection, some with better precision performance in variant detection and others with better sensitivity. Users can select tools based on their specific business needs.

    1. Cleary, John G., et al. "Comparing variant call files for performance benchmarking of next-generation sequencing variant calling pipelines." BioRxiv (2015): 023754.

    Simon Valentine, Chief Commercial Officer, Basepair

    It’s always important to know where your reads are going and the confidence of their reported alignments as they pass through an analysis. However, this is complicated by the fact that aligners calculate mapping quality (MAPQ) scores in different ways and also report them on varying scales (0-42, 0-60, etc.). With the Basepair platform, we follow established best practices from the genomics community to process a variety of data types while using the appropriate genomics tools and settings.

    After alignment is completed, we provide a summary plot that displays the percentage of reads removed during QC, unaligned, aligning to multiple locations in the genome, and “uniquely” aligning to the genome. Our visualizations and summary statistics allow you to make an informed decision on which additional filtering steps may be necessary during downstream analyses.

    QIAGEN Digital Insights Team

    Contiguity, completeness, and correctness are paramount to assessing assembly quality using QIAGEN CLC Genomics Workbench.

    When using QIAGEN CLC Genomics Workbench, contiguity involves the number of contigs.
    A high N50 and a low number of contigs relative to your expected number of chromosomes are ideal. If you aren’t sure what type of N50 and contig number might be reasonable to expect, you could try to get an idea by looking at existing assemblies of a similar genome, should these exist. For an even better sense of what would be reasonable for your data, you could compare the assembly of a similar genome using the whole genome alignment plugin, assembled using a similar amount and type of data. If your assembly results include a large number of very small contigs, you may have set the minimum contig length filter too low. Very small contigs, particularly those of low coverage, can generally be ignored.

    Completeness involves how much of the genome is captured in the assembly. If a total genome length of 5MB is expected based on existing literature or similar genomes that have already been assembled, but the sum of all contig lengths is only 3.5MB, you may wish to reconsider your assembly parameters.

    When using QIAGEN CLC Genomics Workbench here are two common reasons an assembly output would be shorter than expected:
    1. A Word Size that is higher than optimal for your data: A high Word Size will increase the probability of discarding words because they overlap with sequencing errors. If a word is seen only once, the unique word will be discarded even if there are many other words that are identical except for one base (e.g., a sequencing error). A discarded word will not be considered in constructing the assembly graph and will therefore be excluded from the assembly contig sequences.
    2. A Bubble Size that is higher than optimal for your data: A high Bubble Size will increase the probability that two similar sequences are classified as a repeat and thus collapsed into a single contig. It is sometimes possible to identify collapsed repeats by looking at the mapping of your reads to the assembled contigs. A collapsed repeat will be shown as a high peak of coverage in one location.
    Depending on the resources available for the organism you are working on, you might also assess assembly completeness by mapping the assembled contig sequences to a known reference. You can then check for regions of the reference genome that have not been covered by the assembled contigs. Whether this is sensible depends on the sample and reference organisms and what is known about their expected differences.

    For QIAGEN CLC Genomics Workbench, correctness means whether the contigs that have been assembled accurately represent the genome. One key question in assessing correctness is whether the assembly is contaminated with any foreign organism sequence data. To check this, you could run a BLAST search using your assembled contigs as query sequences against a database containing possible contaminant species data. In addition to BLAST, checking the coverage can help to identify contaminant sequence data. The coverage of a contaminant contig is often different from the desired organism, so you can compare the potential contaminant contigs to the rest of the assembled contigs. To check for these types of coverage differences between contigs you can:
    1. Map your reads used as input for the de novo assembly to your contigs (if you do not already have a mapping output);
    2. Create a Detailed Mapping Report;
    3. In the Result handling step of the wizard, check the option to Create a separate table with statistics for each mapping;
    4. Review the average coverage for each contig in this resulting table

    If there are contigs that have good matches to a very different organism and there are discernable coverage differences, you could either consider removing those contigs from the assembly or run a new assembly after removing the contaminant reads. One way to remove the contaminant reads would be to run a read mapping against the foreign organism’s genome and to check the option to Collect unmapped reads. The unmapped reads Sequence List should now be clean of contamination. You can then use this set of reads in a new de novo assembly.

    Assessing the correctness of an assembly also involves making sure the assembler did not join segments of sequences that should not have been joined - or checking for misassemblies. This is more difficult. One option for identifying misassemblies is to try running the InDels and Structural Variants tool. If this tool identifies structural variation within the assembly, that could indicate an issue that you should investigate.

    Mike Lelivelt, VP of Software Product Management and Marketing, Illumina

    The quality of a DRAGEN read alignment is given by its alignment score, which measures how closely a read matches the reference sequence at the reported mapping position (POS). The confidence in the reported mapping position is given by the mapping quality (MAPQ), which is a function of the difference between the best alignment score and the second-best alignment score(s). DRAGEN does not currently perform read assembly.

    Richard Moir, Director of Product and Technology, Geneious​

    Geneious Prime offers a range of best-in-class algorithms for alignment and assembly paired with highly interactive visualizations that enable scientists to complete their own evaluation of different approaches and decide how to proceed with their analysis. Our algorithms include STAR, Minimap2, SPAdes, Flye, MAFFT and others, each selected for its unique strengths and capabilities. In addition to these industry-standard tools, we've also developed our own algorithm with special features like circular assembly and iterative mapping, which allows for more accurate resolution of complex indels. All of these algorithms are seamlessly integrated into Geneious Prime, making it easy for any scientist to perform complex bioinformatic analyses and generate high-quality data.

    To evaluate the quality of the results, scientists have easy access to many tools, regardless of the algorithm used, including:
    • Textual assembly report containing a summary of the output and read counts, giving a measure of how many reads were successfully aligned.
    • Dynamic coverage graph with options for highlighting areas of concern.
    • Coloring of reads based on pair distance, direction and mapping quality.
    • Highly configurable consensus caller with options for handling low-quality and low-coverage regions.

    Check out the third, fourth, fifth, and sixth (final) installment of our Q&A series.
      Please sign into your account to post comments.

    Latest Articles