Seqanswers Leaderboard Ad



No announcement yet.

QC, Alignment, and Mapping: Recommended Tools for Next-Level Sequencing Analysis


  • QC, Alignment, and Mapping: Recommended Tools for Next-Level Sequencing Analysis

    Click image for larger version  Name:	Toolbox2.jpg Views:	0 Size:	365.3 KB ID:	324580

    With new tools and computational resources being released regularly, it can be hard to determine which are best suited for the analysis process and which older tools continue to be maintained. In an effort to assist the sequencing community, we interviewed three highly skilled bioinformaticians about their recommended tools for several important analysis applications.

    Quality control and preprocessing tools
    “Garbage in, garbage out” is a popular maxim among the bioinformatics and computer science communities because the beginning of any successful analysis relies on proper quality control (QC) and preprocessing of the data. We began our interviews by asking about preferred tools for these processes from the following experts: Dr. Mark Ziemann, Senior Lecturer in Biotechnology and Bioinformatics at Deakin University; Dr. Medhat Mahmoud, Postdoctoral Research Fellow at Baylor College of Medicine; and Dr. Ming "Tommy" Tang, Director of Computational Biology at Immunitas and author of From Cell Line to Command Line.

    General QC tools
    All three of our participants agreed that FastQC and MultiQC were highly recommended tools for the QC process, while Ziemann notes that these tools are useful for giving a good overview of many important metrics. Mahmoud expanded on this sentiment by explaining that FastQC serves as a convenient tool for performing quality control checks on raw sequence data derived from high-throughput sequencing pipelines. It offers a modular suite of analyses so users can quickly assess any potential issues with the data before proceeding with further analysis.

    As mentioned by each of our experts, MultiQC is another essential QC tool, but its purpose is different from FastQC. MultiQC gives reports and statistics from a variety of different results. It summarizes the data and is meant to be utilized at the end of the analysis pipeline. As stated on the tool’s official website, “MultiQC doesn’t do any analysis for you—it just finds results from other tools that you have already run and generates nice reports.”

    Trimming tools
    Removing adapter sequencing and low-quality bases are other critical steps for ensuring the accuracy and reliability of downstream data analysis. For these tasks, Ziemann suggests using skewer, as it not only performs quality trimming, but also has the capability to detect and effectively remove adapters. Ziemann particularly favors skewer due to its speed and compatibility with paired-end reads. However, Tang recommends fastp for read trimming, describing it as “fast and versatile.” This tool offers several functionalities, including adapter trimming, read filtering, quality trimming, and base correction.

    Long-read QC
    While FastQC and MultiQC are great for short-read quality control, Mahmoud lists out several more important tools for the long-read QC process. He began with PycoQC, a tool that computes metrics and generates interactive QC plots specifically for Oxford Nanopore Technologies (ONT) sequencing data. Another useful ONT-specific tool is Porechop, designed for adapter trimming and quality control of long-read sequencing data. It’s used to remove adapter sequences, perform quality filtering, and split long reads into shorter subreads. Users should be aware this tool currently works but is no longer supported.

    Mahmoud also recommends NanoPack, an essential set of tools developed for the visualization and processing of long-read sequencing data obtained from both ONT and Pacific Biosciences (PacBio) instruments. Additionally, Filtlong is a versatile tool used for QC of long-read sequencing data. It effectively eliminates low-quality reads, reads contaminated with adapters, and reads outside a specified length range.

    Finally, Mahmoud supports using LongQC, which is specifically designed for QC of PacBio and ONT long reads, offering two main functionalities: sample QC and platform QC. With sample QC, users can assess the readiness of their data for analysis by simply providing standard sequence file formats such as FASTQ, FASTA, or subread BAM from PacBio sequencers, whereas platform QC provides essential statistics, including length and productivity, for a PacBio run. It generates productivity plots for evaluating ONT performance.

    Alignment and mapping tools
    Accurate mapping of sequencing reads to a reference genome or transcriptome relies on the use of high-quality alignment and mapping tools. In this section, our expert bioinformaticians share their preferred mapping tools for the applications used to complete their work.

    Kallisto is sufficient for 99% of my RNA-seq work, which is quantification for downstream differential expression,” says Ziemann. “Kallisto is preferred for this step as it is more accurate for quantification, in addition to being faster and having lower computational requirements. In a few projects, we still use STAR (Spliced Transcripts Alignment to a Reference) as we are interested in looking for novel transcripts.” STAR is another popular RNA-Seq read mapper that can also assist with splice-junction and fusion read detection.

    “At this stage, I also perform more QC, mapping a sample of reads with BWA (Burrows-Wheeler Aligner) to a set of rRNA sequences to quantify the depletion of those sequences,” adds Ziemann. BWA is a software package for mapping sequencing against a reference genome that consists of three options for algorithms: BWA-backtrack, BWA-SW, and BWA-MEM.

    For alignments and mapping for standard DNA sequencing applications, Tang specifically cites using BWA-MEM. This is part of the same software package (BWA) that was mentioned previously but focused on a specific algorithm that performs local alignments and produces alignments for different parts of the query sequence. On other DNA applications like ChIP-seq, Tang recommends using BWA (as mentioned before) and bowtie2. Bowtie2 is a widely used tool for aligning short reads to a reference genome and enabling various downstream analyses such as variant calling, identification of gene expression levels, and detection of structural variations.

    Long reads
    While Mahmoud agrees with many of the short-read tools mentioned before, he also suggests several other mapping and alignment tools specific to long-read sequencing. His first recommendation is Minimap2, which can be used with short and long reads. This alignment program is designed for a range of applications, including mapping genomic reads from long-read sequencing to the human genome, identifying overlaps between long error-prone reads, splice-aware alignment of cDNA or direct RNA reads, and other important alignment-related applications.

    LRA (Long Read Aligner) is another useful sequence alignment program that aligns long reads from single-molecule sequencing (SMS) instruments, or large-scale contigs from SMS assemblies. Sensitivity and specificity for structural variant discovery are increased using this alignment method. The final recommendation for mapping tools is NGMLR (Next Generation Mapping and Long Read Mapping). It’s specifically designed for mapping long-read sequencing data and the resulting output data files can be used with other tools (e.g., Sniffles and CuteSV) to detect structural variations.

    Read the next segment on variant analysis and genome assembly tools or the final segment on differential expression and visualization tools.

    As a courtesy to our members, we’ve provided an attached a PDF with a list of these tools and their accompanying publications, host sites, and GitHub pages. If there are any tools for these processes that you recommend but weren’t included above, log in and share them with the community in the comments below.
    Attached Files
      Please sign into your account to post comments.

    About the Author


    seqadmin Benjamin Atha holds a B.A. in biology from Hood College and an M.S. in biological sciences from Towson University. With over 9 years of hands-on laboratory experience, he's well-versed in next-generation sequencing systems. Ben is currently the editor for SEQanswers. Find out more about seqadmin

    Latest Articles


    • Recent Advances in Sequencing Analysis Tools
      by seqadmin

      The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
      05-06-2024, 07:48 AM
    • Essential Discoveries and Tools in Epitranscriptomics
      by seqadmin

      The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
      04-22-2024, 07:01 AM
    • Current Approaches to Protein Sequencing
      by seqadmin

      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
      04-04-2024, 04:25 PM