Can anyone shed some light on the VCF files on the thousand genomes site? I downloaded these two files:
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/release/2010_07/trio/snps/CEU.trio.2010_03.genotypes.vcf.gz
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/release/2010_07/low_coverage/snps/CEU.low_coverage.2010_07.genotypes.vcf.gz
After decompressing I counted the lines expecting that the low-coverage data which is taken from 60 individuals would list considerably more SNPs than the trio data which by definition is taken from 3 individuals. Here's what I found:
low-coverage: 277,123 lines
trio: 3,646,774 lines
Why are there so few SNPs for the low-coverage data?
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/release/2010_07/trio/snps/CEU.trio.2010_03.genotypes.vcf.gz
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/release/2010_07/low_coverage/snps/CEU.low_coverage.2010_07.genotypes.vcf.gz
After decompressing I counted the lines expecting that the low-coverage data which is taken from 60 individuals would list considerably more SNPs than the trio data which by definition is taken from 3 individuals. Here's what I found:
low-coverage: 277,123 lines
trio: 3,646,774 lines
Why are there so few SNPs for the low-coverage data?