Header Leaderboard Ad


VCF: A Guide to Key File Formats for Sequencing Data



No announcement yet.

  • VCF: A Guide to Key File Formats for Sequencing Data

    Click image for larger version

Name:	Variants.jpg
Views:	355
Size:	542.2 KB
ID:	324525

    Variant Call Format (VCF) is an important file format that is specifically used for storing genetic variation data, such as single nucleotide polymorphisms (SNPs), small insertions or deletions (indels), and structural variations. These tab-delimited files contain useful information about the genomic location, reference allele, and alternate allele(s) for each variant. Due to their flexibility, VCF files are widely used in genomics research and are a key component in many genetic analysis pipelines.

    Creation and structure

    This format was created out of necessity in 2010 by researchers working on the 1000 Genomes Project1. “The data produced by the project was quite unprecedented at the time and there was no format that would offer the same features,” explained Dr. Petr Danecek, Senior Bioinformatician at Wellcome Trust Sanger Institute. “The design of VCF was inspired by the SAM format for storing sequence alignments, created by the group earlier in 2008. As the project progressed to more advanced stages and started producing variant calls, this was a natural thing to do.” As a junior member of the team during his time at the 1000 Genomes Project, Danecek humbly stated that he cannot take credit for the development of VCF, and he recognized that over time, numerous individuals have contributed to the maintenance of the format.

    Due to the vast amount of information contained in these files, the structure of a VCF is inherently more complex than several other commonly used file formats. However, VCFs can be quickly broken down into three main sections: the meta-information lines, the header line, and the data lines. The meta-information lines start with a ‘##’ and each line contains useful data like the VCF version number, the software, and the reference genome used, along with other pertinent information for understanding the dataset.

    The header line starts with a single ‘#’ and comprises eight essential columns that represent properties observed for the variants and additional sample-specific information. Within the final data section, there is a record per variant containing the information corresponding to the columns in the header section. Each record consists of several fields, such as the chromosome, position, reference allele, alternate allele(s), quality score, and genotype information for each sample. Complete and up-to-date details about VCF specifications can be found at: https://github.com/samtools/hts-specs

    Click image for larger version

Name:	VCF Image2.jpg
Views:	160
Size:	138.7 KB
ID:	324526
    Figure 1. A VCF file example shows the meta-information lines, header, and data lines.

    Benefits and challenges

    Using this format has many advantages for those performing variant analysis. “The main advantage of the format is that it is extensible and allows for the representation of very rich information, in some cases unforeseen by the original specification,” said Danecek. “And although there were many improvements and refinements over the years, most of the changes were backward compatible. For example, the flexibility of the format allowed us to represent all types of genetic variation via symbolic alleles, without having to change the overall structure of VCF.”

    Another valuable feature, Danecek explained, “is the possibility to verify the version of the reference genome build by checking the genomic coordinate and the corresponding reference allele, which is required by VCF. That does not sound like much, but I've seen a lot of confusion among users working with some other formats attempting to determine the reference and alternate allele.” Danecek also noted that there are many other important features valuable to users, but suggested those interested should refer to the official VCF specifications for all the details.

    Despite the many benefits of this format, VCF does come with some challenges. “One of the most pressing problems is the ever-growing sample size that results in huge files and slow parsing speeds,” said Danecek. “We have reached the point where the world is producing VCFs with hundreds of thousands of samples, for example, see the recent release of 470k sample VCFs in UK Biobank. Parsing such big files is prohibitively slow.”

    He noted that there are two reasons for these issues. “First, VCF is a textual format and it is very slow to convert text into a binary form that computers can understand. Second, VCF does not support random access by sample and data type. For example, even when asked for a genotype of just a single sample, the entire row with potentially hundreds of thousands of samples must be parsed. Both of these problems were addressed by the binary counterpart of VCF called BCF, which is what BAM is to SAM. BCF has the full expressive power of VCF and one can convert between the formats without losing information. There is an efficient API for working with BCF and for any serious work, we use BCF instead of VCF.”

    Danecek further explained that another problem is the inherent ambiguity of variant representation. “To illustrate on a very simple example, consider two adjacent SNVs—they can be represented either as two SNV rows or a single MNP row. Now throw in phasing, indels, and other variation types into the mix; the algorithmic complexity of handling such cases in full generality is considerable.”

    VCF Facts
    • VCF files can be visualized using genome browsers, such as the UCSC Genome Browser or the Ensembl Genome Browser
    • There are many available tools and libraries to manipulate and analyze VCF files, such as bcftools, GATK, and VCFtools
    • Related formats such as BCF and gVCF are also important to understand for variant analysis

    Concluding thoughts

    VCF remains a critical component of modern genetics research and it has played a significant role in advancing our understanding of genetic variation. While the format can be complex, it provides a clear and useful framework for storing important information. “VCF has become ubiquitous,” said Danecek when asked about its broader influence on sequencing analysis.

    Additionally, he has several expectations for the future and emerging trends. “I hope to see better compression schemes and performance improvements. Full haplotype representations, known as pangenome graphs, are a promising direction to deal with complex variation and ambiguities.”

    1. Danecek P, Auton A, Abecasis G, et al. The variant call format and VCFtools. Bioinformatics (Oxford, England). 2011;27(15):2156-2158. doi:https://doi.org/10.1093/bioinformatics/btr330
      Please sign into your account to post comments.

    Latest Articles