Seqanswers Leaderboard Ad
Collapse
X
Collapse
-
QC, Alignment, and Mapping: Recommended Tools for Next-Level Sequencing Analysis
With new tools and computational resources being released regularly, it can be hard to determine which are best suited for the analysis process and which older tools continue to be maintained. In an effort to assist the sequencing community, we interviewed three highly skilled bioinformaticians about their recommended tools for several important analysis applications.
Quality control and preprocessing tools
“Garbage in, garbage out” is a popular maxim among the bioinformatics and computer science communities because the beginning of any successful analysis relies on proper quality control (QC) and preprocessing of the data. We began our interviews by asking about preferred tools for these processes from the following experts: Dr. Mark Ziemann, Senior Lecturer in Biotechnology and Bioinformatics at Deakin University; Dr. Medhat Mahmoud, Postdoctoral Research Fellow at Baylor College of Medicine; and Dr. Ming "Tommy" Tang, Director of Computational Biology at Immunitas and author of From Cell Line to Command Line.
General QC tools
All three of our participants agreed that FastQC and MultiQC were highly recommended tools for the QC process, while Ziemann notes that these tools are useful for giving a good overview of many important metrics. Mahmoud expanded on this sentiment by explaining that FastQC serves as a convenient tool for performing quality control checks on raw sequence data derived from high-throughput sequencing pipelines. It offers a modular suite of analyses so users can quickly assess any potential issues with the data before proceeding with further analysis.
As mentioned by each of our experts, MultiQC is another essential QC tool, but its purpose is different from FastQC. MultiQC gives reports and statistics from a variety of different results. It summarizes the data and is meant to be utilized at the end of the analysis pipeline. As stated on the tool’s official website, “MultiQC doesn’t do any analysis for you—it just finds results from other tools that you have already run and generates nice reports.”
Trimming tools
Removing adapter sequencing and low-quality bases are other critical steps for ensuring the accuracy and reliability of downstream data analysis. For these tasks, Ziemann suggests using skewer, as it not only performs quality trimming, but also has the capability to detect and effectively remove adapters. Ziemann particularly favors skewer due to its speed and compatibility with paired-end reads. However, Tang recommends fastp for read trimming, describing it as “fast and versatile.” This tool offers several functionalities, including adapter trimming, read filtering, quality trimming, and base correction.
Long-read QC
While FastQC and MultiQC are great for short-read quality control, Mahmoud lists out several more important tools for the long-read QC process. He began with PycoQC, a tool that computes metrics and generates interactive QC plots specifically for Oxford Nanopore Technologies (ONT) sequencing data. Another useful ONT-specific tool is Porechop, designed for adapter trimming and quality control of long-read sequencing data. It’s used to remove adapter sequences, perform quality filtering, and split long reads into shorter subreads. Users should be aware this tool currently works but is no longer supported.
Mahmoud also recommends NanoPack, an essential set of tools developed for the visualization and processing of long-read sequencing data obtained from both ONT and Pacific Biosciences (PacBio) instruments. Additionally, Filtlong is a versatile tool used for QC of long-read sequencing data. It effectively eliminates low-quality reads, reads contaminated with adapters, and reads outside a specified length range.
Finally, Mahmoud supports using LongQC, which is specifically designed for QC of PacBio and ONT long reads, offering two main functionalities: sample QC and platform QC. With sample QC, users can assess the readiness of their data for analysis by simply providing standard sequence file formats such as FASTQ, FASTA, or subread BAM from PacBio sequencers, whereas platform QC provides essential statistics, including length and productivity, for a PacBio run. It generates productivity plots for evaluating ONT performance.
Alignment and mapping tools
Accurate mapping of sequencing reads to a reference genome or transcriptome relies on the use of high-quality alignment and mapping tools. In this section, our expert bioinformaticians share their preferred mapping tools for the applications used to complete their work.
RNA-seq
“Kallisto is sufficient for 99% of my RNA-seq work, which is quantification for downstream differential expression,” says Ziemann. “Kallisto is preferred for this step as it is more accurate for quantification, in addition to being faster and having lower computational requirements. In a few projects, we still use STAR (Spliced Transcripts Alignment to a Reference) as we are interested in looking for novel transcripts.” STAR is another popular RNA-Seq read mapper that can also assist with splice-junction and fusion read detection.
“At this stage, I also perform more QC, mapping a sample of reads with BWA (Burrows-Wheeler Aligner) to a set of rRNA sequences to quantify the depletion of those sequences,” adds Ziemann. BWA is a software package for mapping sequencing against a reference genome that consists of three options for algorithms: BWA-backtrack, BWA-SW, and BWA-MEM.
DNA-seq
For alignments and mapping for standard DNA sequencing applications, Tang specifically cites using BWA-MEM. This is part of the same software package (BWA) that was mentioned previously but focused on a specific algorithm that performs local alignments and produces alignments for different parts of the query sequence. On other DNA applications like ChIP-seq, Tang recommends using BWA (as mentioned before) and bowtie2. Bowtie2 is a widely used tool for aligning short reads to a reference genome and enabling various downstream analyses such as variant calling, identification of gene expression levels, and detection of structural variations.
Long reads
While Mahmoud agrees with many of the short-read tools mentioned before, he also suggests several other mapping and alignment tools specific to long-read sequencing. His first recommendation is Minimap2, which can be used with short and long reads. This alignment program is designed for a range of applications, including mapping genomic reads from long-read sequencing to the human genome, identifying overlaps between long error-prone reads, splice-aware alignment of cDNA or direct RNA reads, and other important alignment-related applications.
LRA (Long Read Aligner) is another useful sequence alignment program that aligns long reads from single-molecule sequencing (SMS) instruments, or large-scale contigs from SMS assemblies. Sensitivity and specificity for structural variant discovery are increased using this alignment method. The final recommendation for mapping tools is NGMLR (Next Generation Mapping and Long Read Mapping). It’s specifically designed for mapping long-read sequencing data and the resulting output data files can be used with other tools (e.g., Sniffles and CuteSV) to detect structural variations.
Read the next segment on variant analysis and genome assembly tools or the final segment on differential expression and visualization tools.
As a courtesy to our members, we’ve provided an attached a PDF with a list of these tools and their accompanying publications, host sites, and GitHub pages. If there are any tools for these processes that you recommend but weren’t included above, log in and share them with the community in the comments below.Please sign into your account to post comments.
About the Author
Collapse
Benjamin Atha holds a B.A. in biology from Hood College and an M.S. in biological sciences from Towson University. With over 9 years of hands-on laboratory experience, he's well-versed in next-generation sequencing systems. Ben is currently the editor for SEQanswers.
Find out more about seqadmin
Latest Articles
Collapse
-
by seqadmin
The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...-
Channel: Articles
11-06-2024, 07:24 PM -
-
by seqadmin
Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...-
Channel: Articles
10-18-2024, 07:11 AM -
-
by seqadmin
Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.
Nobel Prize for MicroRNA Discovery
This week,...-
Channel: Articles
10-07-2024, 08:07 AM -
ad_right_rmr
Collapse
News
Collapse
-
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networksby seqadminScientists have long relied on statistical models to predict disease risks and uncover genetic associations, yet many of these tools function as “black...
-
Channel: News
10-30-2024, 05:31 AM -