Seqanswers Leaderboard Ad

**gsgs** · 12-11-2012, 10:01 AM

anyway, it seems that you always arrive at such a binary "diversity-matrix",
individuals over SNP-positions(all chromosomes)
of size 2123x2177885, 578MB for 1000 genomes
of size 250x4170000, 130MB for hapmap
The bit at position (x,y) is set, iff sequence x differs at position y from the consensus
(or average) in one of the 2 diploid alleles, chosen at random

We just need that giant binary matrix, how is it called ? Is there a math-theory
about its properties,manipulation,relation to other objects,... already ?
Shouldn't they offer that matrix for download directly ? (-->easier)

**gsgs** · 12-17-2012, 02:27 AM

there are 445.6 TB (358954 files , 20992 directories) listed in current_tree on the ftp-site
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/
most (99.6%) TB are in the 3 data subdirs and the 3 technical subdirs :

Code:

subdir,files,terabytes
data:
main:153136,114.6
phas:013311,056.7
pilo:035889,013.1
---------------------------
data:202336files,184.4TB
 
technical:
main:144568,218.9
phas:007899,040.1
pilo:000930,000.3
--------------------------
tech:153397files,259.3TB
 
tota:358954files,445.6TB

ftp:/data has 2456 genomenames and thus 2456 subdirectories HG00096...NA21144,
which in turn have 3 subdirectories each , alignment,exome_alignment,sequence_read
In total there are 2-3021 files in those 3 subdirectories.
82 have more than 288 files
192 have more than 253 files
322 have more than 188 files

**gsgs** · 12-24-2012, 01:26 PM

I finished downloading and converting the SNP-data
from the 22 chromosomes at
/omni/haplotypes
2123 individuals, 1.1GB compressed, 18.5 GB expanded

SNPs in the 22 available chromosomes:
+ 174234 + 182881 + 153373 + 142614 + 136754 + 136611
+ 121998 + 118385 + 097989 + 112704 + 109763 + 105824
+ 077675 + 072264 + 068091 + 073797 + 064155 + 064176
+ 047536 + 054241 + 030040 + 032780 = 2177885

2*4.6 GB for the 2*23 fasta files uncompressed ~0.9GB compressed

(~450000 GB total 1000-genome data

**rama** · 01-07-2013, 08:35 AM

does anyone have an explanation why the number variants vary significantly between release/20100804 and 20110521 for any sample. I looked for the variants listed for NA10851 sample from ftp://ftp.1000genomes.ebi.ac.uk/vol1...lease/20110521 and
ftp://ftp.1000genomes.ebi.ac.uk/vol1...notypes.vcf.gz, and found 39.7 and 14.6 million variants respectively. I expected to see some differences on genome locations but was surprised to see big difference in the number of calls. I was wondering if any one knows what factors other than difference in the genome build could be accounted for such huge difference. thanks in advance for your kind help.

**gsgs** · 08-28-2017, 11:51 PM

Originally posted by gsgs View Post

anyway, it seems that you always arrive at such a binary "diversity-matrix",
individuals over SNP-positions(all chromosomes)
of size 2123x2177885, 578MB for 1000 genomes
of size 250x4170000, 130MB for hapmap
The bit at position (x,y) is set, iff sequence x differs at position y from the consensus
(or average) in one of the 2 diploid alleles, chosen at random

We just need that giant binary matrix, how is it called ? Is there a math-theory
about its properties,manipulation,relation to other objects,... already ?
Shouldn't they offer that matrix for download directly ? (-->easier)

I found it here:

Just a moment...

https://www.researchgate.net/figure/230586432_fig1_HapZipper-compression-diagram-provided-a-dbSNP-database-Phased-haplotypes-are-compared

"HapZipper" , a paper from 2012, but no followup :-(

Their purpose is (was) to compress the data, rather than to analyse,manipulate,characterize,search,order,share
them efficiently

now you can reorder the rows [individuals] and columns [SNP-locations] of that matrix
so to achieve the smallest sum of differences between neighbor rows and columns
using some "traveling salesman" algorithm and then offer that matrix as a giant
zoomable .pdf picture as a picture of human genetic diversity - to be compared with
other species

Who will do it (first) ?

[ I was trying to download genbank's dbSNP data, but couldn't figure out
the format, what to download, how to convert it.
https://en.wikipedia.org/wiki/DbSNP ]

I'd like to have the HapZipper matrix of these 2280 human public domain genomes:

A Collection of 2,280 Public Domain (CC0) Curated Human Genotypes

http://www.biorxiv.org/content/early/2017/04/19/127241

Cheap sequencing has driven the proliferation of big human genome data aggregation consortiums, providing extensive reference datasets for genome research. These datasets, however, may come with restrictive terms of use, conditioned by the consent frameworks within which individuals donate their data. Having an aggregated genome dataset with unrestricted use, analogous to public domain licensing, is therefore unusually rare. Yet public domain data is tremendously useful because it allows freedom to perform research with it. This comes with the price of donors surrendering their privacy and accepting the associated risks derived from publishing personal data. Using the Repositive platform (), an indexing service for human genome datasets, we aggregated all deposited files in public data sources under a CC0 license from 23andMe, a leading Direct-to-Consumer genetic testing service. After downloading 3,137 genotypes, we filtered out those that were incomplete, corrupt or duplicated, ending up with a dataset of 2,280 curated files, each one corresponding to a unique individual. Although the size of this dataset is modest compared to current major genome data aggregation projects, its full access and licensing terms, which allows free reuse without attribution, make it a useful reference pool for validation purposes and control experiments.

Topics	Statistics	Last Post
Ancient Viral Sequences in Human Brain Linked to Psychiatric Disorders by seqadmin Started by seqadmin, Today, 07:35 AM	0 responses 2 views 0 likes	Last Post by seqadmin Today, 07:35 AM
New Milestone for COSMIC with Extensive Cancer Mutation Data by seqadmin Started by seqadmin, Yesterday, 02:06 PM	0 responses 8 views 0 likes	Last Post by seqadmin Yesterday, 02:06 PM
The Role of Spliceosomes in RNA Splicing and Genome Evolution by seqadmin Started by seqadmin, 05-14-2024, 07:03 AM	0 responses 27 views 0 likes	Last Post by seqadmin 05-14-2024, 07:03 AM
A Closer Look at the Enigmatic Genomes of Oikopleura dioica by seqadmin Started by seqadmin, 05-10-2024, 06:35 AM	0 responses 47 views 0 likes	Last Post by seqadmin 05-10-2024, 06:35 AM

Seqanswers Leaderboard Ad

Announcement

1000 genomes data format

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News