Header Leaderboard Ad


Variant Analysis and Genome Assembly: Recommended Tools for Next-Level Sequencing Analysis




SEQanswers June Challenge Has Begun!

The competition has begun! We're giving away a $50 Amazon gift card to the member who answers the most questions on our site during the month. We want to encourage our community members to share their knowledge and help each other out by answering questions related to sequencing technologies, genomics, and bioinformatics. The competition is open to all members of the site, and the winner will be announced at the beginning of July. Best of luck!

For a list of the official rules, visit (https://www.seqanswers.com/forum/sit...wledge-and-win)
See more
See less

  • Variant Analysis and Genome Assembly: Recommended Tools for Next-Level Sequencing Analysis

    Click image for larger version  Name:	Toolbox article2.jpg Views:	0 Size:	720.4 KB ID:	324605

    Continuing from our previous article, we share variant analysis and genome assembly tools recommended by our experts Dr. Medhat Mahmoud, Postdoctoral Research Fellow at Baylor College of Medicine, and Dr. Ming "Tommy" Tang, Director of Computational Biology at Immunitas and author of From Cell Line to Command Line.

    Variant detection and analysis tools
    Mahmoud classifies variant detection work into two main groups: short variants (<50 base-pairs), which include single nucleotide variants (SNVs) and insertions and deletions (indels); and longer variants (≥50 bp) such as structural variants (SV). Similarly, he divides variant analysis tools into two categories, one tailored for short-read data and another specifically designed to handle long-read data.

    One exception to this separation is PRINCESS, a comprehensive variant analysis tool that takes the reads, aligns them using several available tools, and then calls short and long variants while additionally phasing them. PRINCESS can detect haplotype-resolved SNVs, SVs, and methylation events. Mahmoud is a developer of this powerful tool, which has the framework to perform QC and long-read analysis.

    Short variants with short reads
    Our next recommendation comes from Tang, who suggests using GATK (Genomic Analysis ToolKit) for variant analysis. This analysis toolkit is an industry standard for variant discovery, and it provides a wide range of tools for different variant workflows. In addition, Tang explains that Illumina's analysis platform, DRAGEN (Dynamic Read Analysis for GENomics), is another great tool if one has access to it. The combination of these two resources forms DRAGEN-GATK, which can further streamline and improve the variant analysis process.

    Mahmoud recommends two more resources for short variant work using short reads. The first is FreeBayes, a haplotype-based variant detector. It can detect variants in regions with low read coverage and is well-suited for large-scale sequencing projects. The other recommendation is for samtools, one of the most well-known variant detection platforms. Instead of a single tool, samtools is a collection of comprehensive programs used for read alignment and variant calling. This bioinformatics toolset can process and analyze DNA sequence alignment data, enabling various operations such as format conversion, filtering, and variant calling.

    Short variants with long reads
    Beginning with DeepVariant, Mahmoud suggests several tools that can be used with sequencing data generated from long-read instruments. DeepVariant can work with short- and long-read data, and it uses a deep learning-based variant caller that is capable of detecting variants in complex regions. The next tool, Clair, is specifically used for calling variants with single-molecule sequencing data. It is a germline small variant caller that uses pileup data and deep neural networks. The creators of Clair have also more recently released an updated version, Clair3, and a Nanopore-specific variant caller, Clair3-trio, which is designed for trio variant calling.

    Two other highly utilized variant callers for long reads are Longshot and Medaka. Longshot uses haplotype information from the long-read data to correctly detect and phase SNVs in diploid genomes. Alternatively, Medaka is an ONT-specific tool designed for creating consensus sequences and variant calls. Users should also note that the diploid variant calling workflow for Medaka has been deprecated and it’s recommended to use Clair3 instead.

    Structural variants with short reads
    Parliament2 stands as a consensus SV framework that combines multiple top-performing methods to efficiently identify high-quality SVs from short-read DNA sequencing data on a large scale. Another popular tool named DELLY is specifically made for detecting various types of SVs, including deletions, tandem duplications, inversions, and translocations. It utilizes paired-end and split-read data to accurately identify these structural variations.

    LUMPY, a commonly employed tool for detecting structural variants, takes paired-end and split-read data to detect structural variants. It also incorporates read-depth information, enhancing its ability to identify SVs accurately. Finally, Manta is a versatile solution for SV detection that utilizes both paired-end and split-read data to detect a wide range of structural variants, such as deletions, insertions, inversions, and complex rearrangements.

    Structural variants with long reads
    The first tool Mahmoud suggests for detecting structural variants from long-read data is Sniffles. There is now a newer version called Sniffles2, which offers a complete redesign with enhanced capabilities for germline SV calling. It also facilitates family and population SV calling on a larger scale and introduces innovative approaches for identifying mosaic SVs. In addition, cuteSV is a long-read-based approach that enables in-depth analysis of the complex signatures of structural variants inferred from read alignments. Originally developed for constructing the syndip benchmark dataset, Dipcall is a variant-calling pipeline that operates based on a reference, specifically designed for a pair of phased haplotype assemblies. The last resource, PBSV, is actually a suite of tools for PacBio long-read sequencing data. These tools call and analyze SVs in diploid genomes, with single-sample calling and joint (multi-sample) calling provided.

    Genome assembly and analysis tools
    Assembling genomes involves different tools depending on the read lengths used for the process. True to their name, assemblies from short reads utilize smaller DNA fragments that are generally high in coverage but have a limited ability to resolve complex genomic regions. Conversely, long-read assemblies use longer DNA fragments, allowing for higher resolution of complex genomic regions but typically have lower coverage.

    Short-read assemblies
    For short-read genome assemblies, Mahmoud recommends SPAdes, ABySS, Velvet, and SOAPdenovo2. SPAdes is known for its ability to handle diverse sequencing data types and produce high-quality assemblies. ABySS employs a de Bruijn graph approach and is particularly adept at handling large and complex genomes. Velvet stands out for its fast and memory-efficient performance, making it suitable for small to medium-sized genomes. Additionally, SOAPdenovo2 is specifically designed to handle large and complex genomes while aiming to minimize errors during the assembly process. Each of these assemblers offers valuable tools for researchers working with different genomic data types and sizes, catering to various assembly needs.

    Long-read assemblies
    There are several influential tools Mahmoud advocates for long-read assembly. Canu is a popular choice that can effectively handle various types of long-read data and produce high-quality assemblies. Shasta, along with its polishing algorithms MarginPolish and HELEN, is a de novo long-read assembler that offers reliable assembly solutions. Specifically designed for long-read data, Flye is a tool recognized for its ability to generate highly accurate assemblies. For metagenome assembly, metaFlye provides a scalable solution using repeat graphs. Lastly, wtdbg2 is a de novo assembler that employs a repeat graph approach, making it well-suited for handling long-read data.

    Attached is a PDF containing links to the websites, GitHub pages, and original publications for each resource. If you use a tool that wasn’t listed in this article, log in and tell us about the tool in the comments below! And don’t forget to read our final article on tool recommendations.
    Attached Files
      Please sign into your account to post comments.

    Latest Articles