New Resources for 1000 Genomes

This topic is closed.

This is a sticky topic.

rama replied

12-06-2012, 05:37 PM
Laura,

how/what should I specify, if I don't have particular region to look at and want to get all genome-wide variants?

Thanks so much for you kind help.
Leave a comment:
laura replied

12-06-2012, 11:17 AM
That is correct
Leave a comment:
rama replied

12-06-2012, 10:35 AM
Laura,

Thanks much for your reply. I am guessing this is the example for getting the vcf of sample.

tabix -h ftp://ftp-trace.ncbi.nih.gov/1000gen...804/ALL.2of4in... 17:1471000-1472000 | perl /nfs/1000g-work/G1K/work/bin/vcftools/perl/vcf-subset -c HG00098 | bgzip -c /tmp/HG00098.20100804.genotypes.vcf.gz
Leave a comment:
laura replied

12-05-2012, 11:10 PM
You should be able to get this info from our vcf files using a combination of tabix anc vcftools vcf-subset as described in our faq

1000genomes.org - 1000genomes Resources and Information.

http://www.1000genomes.org/faq/how-do-i-get-slice-your-vcf-files

1000genomes.org is your first and best source for all of the information you’re looking for. From general topics to more of what you would expect to find here, 1000genomes.org has it all. We hope you find what you are searching for!
Leave a comment:
rama replied

12-05-2012, 03:18 PM
vcf file of specific sample from 1000Genome data

Hi,

Can anyone help me how to access the vcf file of a specific sample from 1000Genome data. I found the consensus file at (ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release) but couldn't find the individual samples.

I am trying to compare the variants found from our sequencing vs 1000Genome. if anyone has done similar analysis please let know I would to discuss wiht you offline.

Thanks in advance
Rama
Leave a comment:
laura replied

12-05-2012, 02:29 PM
As far as chrY

http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/integrated_call_sets/ALL.chrY.phase1_samtools_si.20101123.snps.low_coverage.genotypes.vcf.gz

http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/integrated_call_sets/ALL.chrY.genome_strip_hq.20101123.svs.low_coverage.genotypes.vcf.gz

We provide all our variation data in VCF format which serves our needs quite well, if you have a better idea for your own needs then you should be able to get all the info you need from these files to do the conversion

Look at http://www.1000genomes.org/faq/how-d...your-vcf-files for streaming if you want to avoid downloading the entire data set
Leave a comment:
gsgs replied

12-05-2012, 11:38 AM
wait, I have a better idea.
You compute the genetical distance between any pair of two samples, 1092^2 integers,4MB.
Just the number of set bits in the logical xor of the two 37M-bit-vectors.
Then you (circular) sort the 1092 samples so the sum of the distances between two neighbors
is minimal (traveling salesman problem, typically easy to solve for n=1092)
Then you compute the logical xors of any two adjacent samples, which presumably has lots of zeros.
1092 binary vectors of length 37M again, but this time with much better compression
via gzip or such because of the many zeros.
I can write you the programs for encoding and decoding, if you want.
Self-expanding executable, easy to use, all automatic.
The size of that file would be a measure of the genetical variability of your set of 1092 samples.
Leave a comment:
gsgs replied

12-05-2012, 11:07 AM
no Y-chromosome ?

how would I pack the data ?
I want the 1092*36.7M SNPs in 23 binary files, one per chromosome.
Bit i in chromosome j in file(sample) k should be set, iff that SNP is present.
Then compressed with gzip.
23 files, ~50MB per file, I estimate
Leave a comment:
laura replied

12-05-2012, 10:38 AM
do feel free to email [email protected] if you have any questions

We do also have a recent set of slides which were presented in a tutorial at ASHG2012

1000genomes.org - 1000genomes Resources and Information.

http://www.1000genomes.org/announcements/1000-genomes-tutorial-and-poster-slides-ashg2012-2012-11-09

1000genomes.org is your first and best source for all of the information you’re looking for. From general topics to more of what you would expect to find here, 1000genomes.org has it all. We hope you find what you are searching for!
Leave a comment:
gsgs replied

12-05-2012, 08:37 AM
thanks.
10 pages the paper (pdf) ... printing...
2 pages the readme
that will keep me busy for a while ...
well, I'll probably only read and understand parts of it

I know, there is also the "hapmap" project, I managed to get
one of their tables into computer and analyze
Leave a comment:
laura replied

12-05-2012, 08:19 AM
I would strongly recommend starting with our recent paper and the analysis results associated with it

http://www.nature.com/nature/journal/v491/n7422/full/nature11632.html

ftp://ftp.1000genomes.ebi.ac.uk/vol1...lysis_results/

That is a great starting point
Leave a comment:
gsgs replied

12-05-2012, 07:55 AM
I don't know yet.
Probably compare them, #mutations,distances
calculate the consensus,ancestor, plot the distances,
make my cloud-graphics(plot amino acid mutations over nucleotide mutations),
and mutation pictures(binary arrays,sequences over positions,pixel
at (x,y) iff x differs from consensus at position y) etc.

maybe this also works for "STR"s over normal mutations (these are new to me)

calculate recombination frequency
estimate mutation rates and what changes them
statistics of codon-usage
search for retrovirus

Last edited by gsgs; 12-05-2012, 11:49 AM.
Leave a comment:
laura replied

12-05-2012, 07:50 AM
What would you like to do with the data, that will very much determine what the best way to approach the data set,

1000 genomes is a large data set with a variety of different data formats but to answer a single question you rarely need more than one sort of file
Leave a comment:
gsgs replied

12-05-2012, 07:46 AM
currently I estimate (wild guess) you have ~500 complete human genomes (1500GB)
at ~10fold coverage but they are scattered in lots of different formats and directories
and it would take me ~10 hours to figure out how to find the data and decompress and
convert it and another ~5 hours to just download the compressed data

I'd like to see the estimates of others

----------new estimates-------
they have all 1092 genomes(people,"samples") sequenced at 2-6 fold coverage
(which I assume means that they have lots of small segments (~500 nucleotides
per segment ?) from the genome and those may have many errors but overlap
the genome at ~2-6 fold at each position)
critical positions, those with expected mutations overlap more often (50-100 fold)
So they have a total of ~2e13 overlapping nucleotides

the data is in "vcf" files with complicated format, so I stay with my estimate
of ~10hours work to convert them into a workable format.

The data could be ~700MB only, the y-chr came in 2 files of 29MB compressed
-------------------------------------------------

Last edited by gsgs; 12-05-2012, 08:52 PM.
Leave a comment:
laura replied

07-02-2012, 01:51 AM
1000genomes.org - 1000genomes Resources and Information.

http://www.1000genomes.org/announcements/phase-1-analysis-results-including-chry-and-chrmt-variant-calls-2012-07-02

1000genomes.org is your first and best source for all of the information you’re looking for. From general topics to more of what you would expect to find here, 1000genomes.org has it all. We hope you find what you are searching for!

A relatively complete set of variant and other files associated with our Phase 1 analysis are now available on the ftp site
Leave a comment:

Previous 1 2 3 template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
- Channel: Articles
Yesterday, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 39 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 41 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 35 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 55 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News