Variant Analysis and Genome Assembly: Recommended Tools for Next-Level Sequencing Analysis

Published: 05-19-2023, 10:10 AM
1263 views
0 comments
- Share
- Tweet

Variant Analysis and Genome Assembly: Recommended Tools for Next-Level Sequencing Analysis
Continuing from our previous article, we share variant analysis and genome assembly tools recommended by our experts Dr. Medhat Mahmoud, Postdoctoral Research Fellow at Baylor College of Medicine, and Dr. Ming "Tommy" Tang, Director of Computational Biology at Immunitas and author of From Cell Line to Command Line.

Variant detection and analysis tools
Mahmoud classifies variant detection work into two main groups: short variants (<50 base-pairs), which include single nucleotide variants (SNVs) and insertions and deletions (indels); and longer variants (≥50 bp) such as structural variants (SV). Similarly, he divides variant analysis tools into two categories, one tailored for short-read data and another specifically designed to handle long-read data.

One exception to this separation is PRINCESS, a comprehensive variant analysis tool that takes the reads, aligns them using several available tools, and then calls short and long variants while additionally phasing them. PRINCESS can detect haplotype-resolved SNVs, SVs, and methylation events. Mahmoud is a developer of this powerful tool, which has the framework to perform QC and long-read analysis.

Short variants with short reads
Our next recommendation comes from Tang, who suggests using GATK (Genomic Analysis ToolKit) for variant analysis. This analysis toolkit is an industry standard for variant discovery, and it provides a wide range of tools for different variant workflows. In addition, Tang explains that Illumina's analysis platform, DRAGEN (Dynamic Read Analysis for GENomics), is another great tool if one has access to it. The combination of these two resources forms DRAGEN-GATK, which can further streamline and improve the variant analysis process.

Mahmoud recommends two more resources for short variant work using short reads. The first is FreeBayes, a haplotype-based variant detector. It can detect variants in regions with low read coverage and is well-suited for large-scale sequencing projects. The other recommendation is for samtools, one of the most well-known variant detection platforms. Instead of a single tool, samtools is a collection of comprehensive programs used for read alignment and variant calling. This bioinformatics toolset can process and analyze DNA sequence alignment data, enabling various operations such as format conversion, filtering, and variant calling.

Short variants with long reads
Beginning with DeepVariant, Mahmoud suggests several tools that can be used with sequencing data generated from long-read instruments. DeepVariant can work with short- and long-read data, and it uses a deep learning-based variant caller that is capable of detecting variants in complex regions. The next tool, Clair, is specifically used for calling variants with single-molecule sequencing data. It is a germline small variant caller that uses pileup data and deep neural networks. The creators of Clair have also more recently released an updated version, Clair3, and a Nanopore-specific variant caller, Clair3-trio, which is designed for trio variant calling.

Two other highly utilized variant callers for long reads are Longshot and Medaka. Longshot uses haplotype information from the long-read data to correctly detect and phase SNVs in diploid genomes. Alternatively, Medaka is an ONT-specific tool designed for creating consensus sequences and variant calls. Users should also note that the diploid variant calling workflow for Medaka has been deprecated and it’s recommended to use Clair3 instead.

Structural variants with short reads
Parliament2 stands as a consensus SV framework that combines multiple top-performing methods to efficiently identify high-quality SVs from short-read DNA sequencing data on a large scale. Another popular tool named DELLY is specifically made for detecting various types of SVs, including deletions, tandem duplications, inversions, and translocations. It utilizes paired-end and split-read data to accurately identify these structural variations.

LUMPY, a commonly employed tool for detecting structural variants, takes paired-end and split-read data to detect structural variants. It also incorporates read-depth information, enhancing its ability to identify SVs accurately. Finally, Manta is a versatile solution for SV detection that utilizes both paired-end and split-read data to detect a wide range of structural variants, such as deletions, insertions, inversions, and complex rearrangements.

Structural variants with long reads
The first tool Mahmoud suggests for detecting structural variants from long-read data is Sniffles. There is now a newer version called Sniffles2, which offers a complete redesign with enhanced capabilities for germline SV calling. It also facilitates family and population SV calling on a larger scale and introduces innovative approaches for identifying mosaic SVs. In addition, cuteSV is a long-read-based approach that enables in-depth analysis of the complex signatures of structural variants inferred from read alignments. Originally developed for constructing the syndip benchmark dataset, Dipcall is a variant-calling pipeline that operates based on a reference, specifically designed for a pair of phased haplotype assemblies. The last resource, PBSV, is actually a suite of tools for PacBio long-read sequencing data. These tools call and analyze SVs in diploid genomes, with single-sample calling and joint (multi-sample) calling provided.

Genome assembly and analysis tools
Assembling genomes involves different tools depending on the read lengths used for the process. True to their name, assemblies from short reads utilize smaller DNA fragments that are generally high in coverage but have a limited ability to resolve complex genomic regions. Conversely, long-read assemblies use longer DNA fragments, allowing for higher resolution of complex genomic regions but typically have lower coverage.

Short-read assemblies
For short-read genome assemblies, Mahmoud recommends SPAdes, ABySS, Velvet, and SOAPdenovo2. SPAdes is known for its ability to handle diverse sequencing data types and produce high-quality assemblies. ABySS employs a de Bruijn graph approach and is particularly adept at handling large and complex genomes. Velvet stands out for its fast and memory-efficient performance, making it suitable for small to medium-sized genomes. Additionally, SOAPdenovo2 is specifically designed to handle large and complex genomes while aiming to minimize errors during the assembly process. Each of these assemblers offers valuable tools for researchers working with different genomic data types and sizes, catering to various assembly needs.

Long-read assemblies
There are several influential tools Mahmoud advocates for long-read assembly. Canu is a popular choice that can effectively handle various types of long-read data and produce high-quality assemblies. Shasta, along with its polishing algorithms MarginPolish and HELEN, is a de novo long-read assembler that offers reliable assembly solutions. Specifically designed for long-read data, Flye is a tool recognized for its ability to generate highly accurate assemblies. For metagenome assembly, metaFlye provides a scalable solution using repeat graphs. Lastly, wtdbg2 is a de novo assembler that employs a repeat graph approach, making it well-suited for handling long-read data.

Attached is a PDF containing links to the websites, GitHub pages, and original publications for each resource. If you use a tool that wasn’t listed in this article, log in and tell us about the tool in the comments below! And don’t forget to read our final article on tool recommendations.

Attached Files

Sequencing Analysis Tools2.pdf (384.5 KB, 104 views)
Tags: None

Likes 1
Please sign into your account to post comments.

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM
Strategies for Sequencing Challenging Samples

by seqadmin

Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
- Channel: Articles
03-22-2024, 06:39 AM

Expanding the Horizons of Cellular Research with the Single Cell Atlas

by seqadmin

Researchers at Karolinska Institutet have introduced a sophisticated web-based platform known as the Single Cell Atlas (SCA), designed to provide a detailed...
- Channel: News
04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors

by seqadmin

Identifying Genetic Links Through Cohort Studies
In a significant study from St. Jude Children's Research Hospital, published in the Journal of...
- Channel: News
04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity

by seqadmin

Cancer's lethal spread, or metastasis, is responsible for approximately 90% of cancer-related deaths. This process is facilitated by cancer cells' remarkable...
- Channel: News
04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer

by seqadmin

In a recent study presented at the American Association for Cancer Research (AACR) Annual Meeting 2024, researchers have identified distinct proteogenomic...
- Channel: News
04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Variant Analysis and Genome Assembly: Recommended Tools for Next-Level Sequencing Analysis

Variant Analysis and Genome Assembly: Recommended Tools for Next-Level Sequencing Analysis

About the Author

Latest Articles

ad_right_rmr

News