VCF: A Guide to Key File Formats for Sequencing Data

VCF: A Guide to Key File Formats for Sequencing Data
Variant Call Format (VCF) is an important file format that is specifically used for storing genetic variation data, such as single nucleotide polymorphisms (SNPs), small insertions or deletions (indels), and structural variations. These tab-delimited files contain useful information about the genomic location, reference allele, and alternate allele(s) for each variant. Due to their flexibility, VCF files are widely used in genomics research and are a key component in many genetic analysis pipelines.

Creation and structure

This format was created out of necessity in 2010 by researchers working on the 1000 Genomes Project¹. “The data produced by the project was quite unprecedented at the time and there was no format that would offer the same features,” explained Dr. Petr Danecek, Senior Bioinformatician at Wellcome Trust Sanger Institute. “The design of VCF was inspired by the SAM format for storing sequence alignments, created by the group earlier in 2008. As the project progressed to more advanced stages and started producing variant calls, this was a natural thing to do.” As a junior member of the team during his time at the 1000 Genomes Project, Danecek humbly stated that he cannot take credit for the development of VCF, and he recognized that over time, numerous individuals have contributed to the maintenance of the format.

Due to the vast amount of information contained in these files, the structure of a VCF is inherently more complex than several other commonly used file formats. However, VCFs can be quickly broken down into three main sections: the meta-information lines, the header line, and the data lines. The meta-information lines start with a ‘##’ and each line contains useful data like the VCF version number, the software, and the reference genome used, along with other pertinent information for understanding the dataset.

The header line starts with a single ‘#’ and comprises eight essential columns that represent properties observed for the variants and additional sample-specific information. Within the final data section, there is a record per variant containing the information corresponding to the columns in the header section. Each record consists of several fields, such as the chromosome, position, reference allele, alternate allele(s), quality score, and genotype information for each sample. Complete and up-to-date details about VCF specifications can be found at: https://github.com/samtools/hts-specs

Figure 1. A VCF file example shows the meta-information lines, header, and data lines.

Benefits and challenges

Using this format has many advantages for those performing variant analysis. “The main advantage of the format is that it is extensible and allows for the representation of very rich information, in some cases unforeseen by the original specification,” said Danecek. “And although there were many improvements and refinements over the years, most of the changes were backward compatible. For example, the flexibility of the format allowed us to represent all types of genetic variation via symbolic alleles, without having to change the overall structure of VCF.”

Another valuable feature, Danecek explained, “is the possibility to verify the version of the reference genome build by checking the genomic coordinate and the corresponding reference allele, which is required by VCF. That does not sound like much, but I've seen a lot of confusion among users working with some other formats attempting to determine the reference and alternate allele.” Danecek also noted that there are many other important features valuable to users, but suggested those interested should refer to the official VCF specifications for all the details.

Despite the many benefits of this format, VCF does come with some challenges. “One of the most pressing problems is the ever-growing sample size that results in huge files and slow parsing speeds,” said Danecek. “We have reached the point where the world is producing VCFs with hundreds of thousands of samples, for example, see the recent release of 470k sample VCFs in UK Biobank. Parsing such big files is prohibitively slow.”

He noted that there are two reasons for these issues. “First, VCF is a textual format and it is very slow to convert text into a binary form that computers can understand. Second, VCF does not support random access by sample and data type. For example, even when asked for a genotype of just a single sample, the entire row with potentially hundreds of thousands of samples must be parsed. Both of these problems were addressed by the binary counterpart of VCF called BCF, which is what BAM is to SAM. BCF has the full expressive power of VCF and one can convert between the formats without losing information. There is an efficient API for working with BCF and for any serious work, we use BCF instead of VCF.”

Danecek further explained that another problem is the inherent ambiguity of variant representation. “To illustrate on a very simple example, consider two adjacent SNVs—they can be represented either as two SNV rows or a single MNP row. Now throw in phasing, indels, and other variation types into the mix; the algorithmic complexity of handling such cases in full generality is considerable.”

VCF Facts
VCF files can be visualized using genome browsers, such as the UCSC Genome Browser or the Ensembl Genome Browser

There are many available tools and libraries to manipulate and analyze VCF files, such as bcftools, GATK, and VCFtools

Related formats such as BCF and gVCF are also important to understand for variant analysis

Concluding thoughts

VCF remains a critical component of modern genetics research and it has played a significant role in advancing our understanding of genetic variation. While the format can be complex, it provides a clear and useful framework for storing important information. “VCF has become ubiquitous,” said Danecek when asked about its broader influence on sequencing analysis.

Additionally, he has several expectations for the future and emerging trends. “I hope to see better compression schemes and performance improvements. Full haplotype representations, known as pangenome graphs, are a promising direction to deal with complex variation and ambiguities.”

References
Danecek P, Auton A, Abecasis G, et al. The variant call format and VCFtools. Bioinformatics (Oxford, England). 2011;27(15):2156-2158. doi:https://doi.org/10.1093/bioinformatics/btr330
Tags: None
Please sign into your account to post comments.

Exploring the Dynamics of the Tumor Microenvironment

by seqadmin

The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
- Channel: Articles
07-08-2024, 03:19 PM
Exploring Human Diversity Through Large-Scale Omics

by seqadmin

In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date¹. Although the genome wasn’t fully completed until nearly 20 years later², numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
- Channel: Articles
06-25-2024, 06:43 AM
Best Practices for Single-Cell Sequencing Analysis

by seqadmin

While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
- Channel: Articles
06-06-2024, 07:15 AM

Gene Misexpression in the Healthy Human Population

by seqadmin

A recent study by researchers from the Wellcome Sanger Institute, the University of Cambridge, and AstraZeneca has discovered that 'gene misbehavior'—where...
- Channel: News
Yesterday, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders

by seqadmin

Despite significant advancements in genetic testing, over half of individuals worldwide with suspected Mendelian genetic disorders still...
- Channel: News
07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices

by seqadmin

In a significant stride forward in the field of analytical biology, researchers from the VIB-VUB Center for Structural Biology in Belgium and the University...
- Channel: News
07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration

by seqadmin

In a recent study published in Cell, a research team led by Li Wei and Zhou Qi from the Institute of Zoology at the Chinese Academy...
- Channel: News
07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

VCF: A Guide to Key File Formats for Sequencing Data

VCF: A Guide to Key File Formats for Sequencing Data

About the Author

Latest Articles

ad_right_rmr

News