Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.

VCF: A Guide to Key File Formats for Sequencing Data

Collapse
X
Collapse
  •  

  • VCF: A Guide to Key File Formats for Sequencing Data

    Click image for larger version

Name:	Variants.jpg
Views:	1529
Size:	542.2 KB
ID:	324525



    Variant Call Format (VCF) is an important file format that is specifically used for storing genetic variation data, such as single nucleotide polymorphisms (SNPs), small insertions or deletions (indels), and structural variations. These tab-delimited files contain useful information about the genomic location, reference allele, and alternate allele(s) for each variant. Due to their flexibility, VCF files are widely used in genomics research and are a key component in many genetic analysis pipelines.

    Creation and structure

    This format was created out of necessity in 2010 by researchers working on the 1000 Genomes Project1. “The data produced by the project was quite unprecedented at the time and there was no format that would offer the same features,” explained Dr. Petr Danecek, Senior Bioinformatician at Wellcome Trust Sanger Institute. “The design of VCF was inspired by the SAM format for storing sequence alignments, created by the group earlier in 2008. As the project progressed to more advanced stages and started producing variant calls, this was a natural thing to do.” As a junior member of the team during his time at the 1000 Genomes Project, Danecek humbly stated that he cannot take credit for the development of VCF, and he recognized that over time, numerous individuals have contributed to the maintenance of the format.

    Due to the vast amount of information contained in these files, the structure of a VCF is inherently more complex than several other commonly used file formats. However, VCFs can be quickly broken down into three main sections: the meta-information lines, the header line, and the data lines. The meta-information lines start with a ‘##’ and each line contains useful data like the VCF version number, the software, and the reference genome used, along with other pertinent information for understanding the dataset.

    The header line starts with a single ‘#’ and comprises eight essential columns that represent properties observed for the variants and additional sample-specific information. Within the final data section, there is a record per variant containing the information corresponding to the columns in the header section. Each record consists of several fields, such as the chromosome, position, reference allele, alternate allele(s), quality score, and genotype information for each sample. Complete and up-to-date details about VCF specifications can be found at: https://github.com/samtools/hts-specs

    Click image for larger version

Name:	VCF Image2.jpg
Views:	2900
Size:	138.7 KB
ID:	324526
    Figure 1. A VCF file example shows the meta-information lines, header, and data lines.


    Benefits and challenges

    Using this format has many advantages for those performing variant analysis. “The main advantage of the format is that it is extensible and allows for the representation of very rich information, in some cases unforeseen by the original specification,” said Danecek. “And although there were many improvements and refinements over the years, most of the changes were backward compatible. For example, the flexibility of the format allowed us to represent all types of genetic variation via symbolic alleles, without having to change the overall structure of VCF.”

    Another valuable feature, Danecek explained, “is the possibility to verify the version of the reference genome build by checking the genomic coordinate and the corresponding reference allele, which is required by VCF. That does not sound like much, but I've seen a lot of confusion among users working with some other formats attempting to determine the reference and alternate allele.” Danecek also noted that there are many other important features valuable to users, but suggested those interested should refer to the official VCF specifications for all the details.

    Despite the many benefits of this format, VCF does come with some challenges. “One of the most pressing problems is the ever-growing sample size that results in huge files and slow parsing speeds,” said Danecek. “We have reached the point where the world is producing VCFs with hundreds of thousands of samples, for example, see the recent release of 470k sample VCFs in UK Biobank. Parsing such big files is prohibitively slow.”

    He noted that there are two reasons for these issues. “First, VCF is a textual format and it is very slow to convert text into a binary form that computers can understand. Second, VCF does not support random access by sample and data type. For example, even when asked for a genotype of just a single sample, the entire row with potentially hundreds of thousands of samples must be parsed. Both of these problems were addressed by the binary counterpart of VCF called BCF, which is what BAM is to SAM. BCF has the full expressive power of VCF and one can convert between the formats without losing information. There is an efficient API for working with BCF and for any serious work, we use BCF instead of VCF.”

    Danecek further explained that another problem is the inherent ambiguity of variant representation. “To illustrate on a very simple example, consider two adjacent SNVs—they can be represented either as two SNV rows or a single MNP row. Now throw in phasing, indels, and other variation types into the mix; the algorithmic complexity of handling such cases in full generality is considerable.”


    VCF Facts
    • VCF files can be visualized using genome browsers, such as the UCSC Genome Browser or the Ensembl Genome Browser
    • There are many available tools and libraries to manipulate and analyze VCF files, such as bcftools, GATK, and VCFtools
    • Related formats such as BCF and gVCF are also important to understand for variant analysis

    Concluding thoughts

    VCF remains a critical component of modern genetics research and it has played a significant role in advancing our understanding of genetic variation. While the format can be complex, it provides a clear and useful framework for storing important information. “VCF has become ubiquitous,” said Danecek when asked about its broader influence on sequencing analysis.

    Additionally, he has several expectations for the future and emerging trends. “I hope to see better compression schemes and performance improvements. Full haplotype representations, known as pangenome graphs, are a promising direction to deal with complex variation and ambiguities.”



    References
    1. Danecek P, Auton A, Abecasis G, et al. The variant call format and VCFtools. Bioinformatics (Oxford, England). 2011;27(15):2156-2158. doi:https://doi.org/10.1093/bioinformatics/btr330
      Please sign into your account to post comments.

    About the Author

    Collapse

    seqadmin Benjamin Atha holds a B.A. in biology from Hood College and an M.S. in biological sciences from Towson University. With over 9 years of hands-on laboratory experience, he's well-versed in next-generation sequencing systems. Ben is currently the editor for SEQanswers. Find out more about seqadmin

    Latest Articles

    Collapse

    • Essential Discoveries and Tools in Epitranscriptomics
      by seqadmin




      The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
      04-22-2024, 07:01 AM
    • Current Approaches to Protein Sequencing
      by seqadmin


      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
      04-04-2024, 04:25 PM
    • Strategies for Sequencing Challenging Samples
      by seqadmin


      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
      03-22-2024, 06:39 AM

    ad_right_rmr

    Collapse

    News

    Collapse

    Working...
    X