Seqanswers Leaderboard Ad

**laura** · 12-22-2010, 03:07 AM

Documentation of the format can be found here

http://vcftools.sourceforge.net/specs.html

The files provided by the 1000 genomes project generally represent all the variant sites discovered in the samples analysed. The most recent release contains a list of the samples analysed ftp://ftp.1000genomes.ebi.ac.uk/vol1...0804.ALL.panel

vcftools provides software which can provide subsets of data from a vcf file
The files are also indexed by tabix which means you can stream variants from a specific part of the genome

**maricu** · 12-22-2010, 03:09 AM

Thanks Laura!!!
Yes indeed, I found the vcftools, the sys admin will install soon and I will try it, but in the mean time I found a way to do it with awk, it works quite well!

M

**Todd Johnson** · 12-22-2010, 03:45 AM

Laura-
Sorry to jump into someone else's thread, but you seem like an expert to whom I could ask this question. Have you tried running vcftools on the main November release genotype file ALL.2of4intersection.20100804.genotypes.vcf.gz?

If I uncompress the file and run:
vcftools --vcf ALL.2of4intersection.20100804.genotypes.vcf --chr 21 --out chr22 --recode

then VCFtools quits with "Error:Expected Number entry in INFO description..." The three INFO fields for EUR_R2, ASN_R2, and AFR_R2 are missing the "Number" entry. It seems like "Number=1" should be inserted between the field ID and the "Type=Float" tag, or else vcftools quits. I have a hard time believing that no one else has ran upon this problem, so I wonder if I'm doing something unusual? Anyway, I've modified my local copy and it works, but I thought that someone perhaps closer to the 1000 Genomes project would want to know.

Best wishes,

Todd

**laura** · 12-22-2010, 03:51 AM

That does look to be an error in the headers

If you find problems like this it is best to email [email protected] so the right people can investigate

thanks for letting us know

**Todd Johnson** · 12-22-2010, 07:15 AM

Laura-
Sorry, I actually neglected to look under the "Project Contacts" link on the web-site. However, I did e-mail goncalo, since his e-mail is at the bottom of the README file for the latest release. Having not heard anything back from him, I thought that I should take the opportunity when I saw your message up above. Another thing I noticed, but don't know if it's expected, is that there are a number of rows that have no genotypes in any of the samples. I expect that many rows would be missing genotypes in one population or the other, but not across all samples. I suppose that those are variant sites that were found at BC and NCBI but did not have genotypes since they did not perform LD aware genotype analysis. It seems to me that those should be in the "sites" file but filtered out of the "genotypes" file. I'll put together an e-mail and forward my thoughts to the [email protected] e-mail.

Thanks!

Todd

**laura** · 12-22-2010, 07:21 AM

It was decided it was better for all the sites to be in both files but those variants which don't have genotypes to get the ./. notation. The sites file is always meant to contain all the same variants as the genotype file but it is provided to give those who don't need individual genotypes a smaller download (300MB versus 60GB)

The only genotypes which should be used for imputation are those which include a prediction by BI as these are the only sites which have genotypes assigned in an LD aware manner. UMich genotyper isn't LD aware and imputation accuracy suffers if they are used for this purpose

**genesquared** · 12-22-2010, 07:02 PM

all individual genotypes = 60 GB data?!

Are you kidding me?

60 x 10^9 / 1000 = 60 x 10^6 = 60 Mb per person, sounds reasonable.

**Todd Johnson** · 12-22-2010, 07:50 PM

Tell me about it!

The VCF file has so much other information besides just the genotype calls, that it seems a bit excessive for a release to the public. It's sort of like XML imbedded in a table format. A header at the top, and key value pairs embedded within columns.
A representative single variant position call data for one sample looks like this!:
0|0:3,0:3:.:-0.00,-0.90,-13.33:22.58:./.

To understand the format a bit better, take a look at http://www.1000genomes.org/wiki/Anal...mat-version-40

If someone wants just genotype calls, you can download files formatted for Beagle, MACH, and Impute, which are much much smaller, but it seems to me that each of those formats leaves out some of the info that would be useful for checking allele orientation (i.e, between existing Build 36 Illumina 610k data and the release's Build 37 coordinates):

Beagle:

Beagle 5.4

http://faculty.washington.edu/browning/beagle/beagle.html

MACH:

MACH 1.0 - Markov Chain Haplotyping

http://www.sph.umich.edu/csg/abecasis/MaCH

Impute:

IMPUTE2

https://mathgen.stats.ox.ac.uk/impute/impute_v2.html

**laura** · 12-23-2010, 12:30 AM

Originally posted by genesquared View Post

all individual genotypes = 60 GB data?!

Are you kidding me?

60 x 10^9 / 1000 = 60 x 10^6 = 60 Mb per person, sounds reasonable.

Well its only 629 individuals in this instance and its 60GB compressed, 380GB uncompressed but you should generally be able to stream the file using a combination of tabix and or zcat so you never need to uncompress it properly

**laura** · 12-23-2010, 02:39 PM

Originally posted by Todd Johnson View Post

Laura-
Sorry to jump into someone else's thread, but you seem like an expert to whom I could ask this question. Have you tried running vcftools on the main November release genotype file ALL.2of4intersection.20100804.genotypes.vcf.gz?

If I uncompress the file and run:
vcftools --vcf ALL.2of4intersection.20100804.genotypes.vcf --chr 21 --out chr22 --recode

then VCFtools quits with "Error:Expected Number entry in INFO description..." The three INFO fields for EUR_R2, ASN_R2, and AFR_R2 are missing the "Number" entry. It seems like "Number=1" should be inserted between the field ID and the "Type=Float" tag, or else vcftools quits. I have a hard time believing that no one else has ran upon this problem, so I wonder if I'm doing something unusual? Anyway, I've modified my local copy and it works, but I thought that someone perhaps closer to the 1000 Genomes project would want to know.

Best wishes,

Todd

This error should of now been fixed

thanks for pointing it out

**genesquared** · 01-21-2011, 03:16 AM

I would like to inspect 17 individuals' and about 300 SNPs in a 500 kb loci.

Is there any "short cut"?

I know their hg18 position (but no rs#).

Thanks in advance

**laura** · 01-21-2011, 03:46 AM

Your best bet for this it to use tabix to extract the data from the released vcf files.

The vcf format is described here

VCF Specification

http://vcftools.sourceforge.net/specs.html

The files themselves can be found here
ftp://ftp.1000genomes.ebi.ac.uk/vol1...man_variation/

You can use tabix http://sourceforge.net/projects/samtools/files/tabix/ to extract subsections of these files

e.g

tabix -h ftp://ftp.1000genomes.ebi.ac.uk/vol1...notypes.vcf.gz 1:10000:20000

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 17 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 46 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Extracting genome specific SNPs from 1000 genomes

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News