paper , 10 pages .pdf:
supplementary material, 113 pages .pdf:
==============================================================
after starting with the hapmap data (which looked easier) in this thread:
I arrived (again) at the 1000 genomes data which apparently is more extensive.
1000genomes,2012 paper,page 60,suppl.
10.5 Haplotype estimation from OMNI data
2123
327 trios,42 duos, 1058 singles
2177885 SNPs (-->~4.4 times more data than hapmap with 250 individuals at ~4M SNPs)
no x,y,m chromosomes
4 bytes per (individual,position) expanded : "1|1","1|0","0|1","0|0" plus ascii 009
so, only 2 bits and that corresponds to the compression rate.
11,10,01,00 could correspond to the 2-letter entries in the hapmap data, where the letters are given in
columns 2,3 and 1,0 indicate which one of them is chosen in which Zygote
however hapmap would either have 01 or 10, never both (Zygotes indistinguishable)
anyway, you choose one of the values at random and get that binary "diversity-matrix",
here of size 2123x2177885, 578MB
as compared to 250x4170000, 130MB for hapmap
Well, it's not really binary since there are empty positions (~10%)
I think we should somehow fill them in a random, unbiased way so to preserve the structure
and statistical content.
page 4, Genotype fields , oGT
> If genotype information is present, then the same types of data must be present for all samples.
> First a FORMAT field is given specifying the data types and order (colon-separated alphanumeric
> String). This is followed by one field per sample, with the colon-separated data in this field
> corresponding to the types specified in the format. The first sub-field must always be the
> genotype (GT) if it is present. There are no required sub-fields.
supplementary material, 113 pages .pdf:
==============================================================
after starting with the hapmap data (which looked easier) in this thread:
I arrived (again) at the 1000 genomes data which apparently is more extensive.
1000genomes,2012 paper,page 60,suppl.
10.5 Haplotype estimation from OMNI data
2123
327 trios,42 duos, 1058 singles
2177885 SNPs (-->~4.4 times more data than hapmap with 250 individuals at ~4M SNPs)
no x,y,m chromosomes
4 bytes per (individual,position) expanded : "1|1","1|0","0|1","0|0" plus ascii 009
so, only 2 bits and that corresponds to the compression rate.
11,10,01,00 could correspond to the 2-letter entries in the hapmap data, where the letters are given in
columns 2,3 and 1,0 indicate which one of them is chosen in which Zygote
however hapmap would either have 01 or 10, never both (Zygotes indistinguishable)
anyway, you choose one of the values at random and get that binary "diversity-matrix",
here of size 2123x2177885, 578MB
as compared to 250x4170000, 130MB for hapmap
Well, it's not really binary since there are empty positions (~10%)
I think we should somehow fill them in a random, unbiased way so to preserve the structure
and statistical content.
page 4, Genotype fields , oGT
> If genotype information is present, then the same types of data must be present for all samples.
> First a FORMAT field is given specifying the data types and order (colon-separated alphanumeric
> String). This is followed by one field per sample, with the colon-separated data in this field
> corresponding to the types specified in the format. The first sub-field must always be the
> genotype (GT) if it is present. There are no required sub-fields.
Comment