Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
This topic is closed.
X
This is a sticky topic.
X
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • rama
    replied
    Laura,

    how/what should I specify, if I don't have particular region to look at and want to get all genome-wide variants?

    Thanks so much for you kind help.

    Leave a comment:


  • laura
    replied
    That is correct

    Leave a comment:


  • rama
    replied
    Laura,

    Thanks much for your reply. I am guessing this is the example for getting the vcf of sample.

    tabix -h ftp://ftp-trace.ncbi.nih.gov/1000gen...804/ALL.2of4in... 17:1471000-1472000 | perl /nfs/1000g-work/G1K/work/bin/vcftools/perl/vcf-subset -c HG00098 | bgzip -c /tmp/HG00098.20100804.genotypes.vcf.gz

    Leave a comment:


  • laura
    replied
    You should be able to get this info from our vcf files using a combination of tabix anc vcftools vcf-subset as described in our faq

    1000genomes.org is your first and best source for all of the information you’re looking for. From general topics to more of what you would expect to find here, 1000genomes.org has it all. We hope you find what you are searching for!

    Leave a comment:


  • rama
    replied
    vcf file of specific sample from 1000Genome data

    Hi,

    Can anyone help me how to access the vcf file of a specific sample from 1000Genome data. I found the consensus file at (ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release) but couldn't find the individual samples.

    I am trying to compare the variants found from our sequencing vs 1000Genome. if anyone has done similar analysis please let know I would to discuss wiht you offline.

    Thanks in advance
    Rama

    Leave a comment:


  • laura
    replied
    As far as chrY





    We provide all our variation data in VCF format which serves our needs quite well, if you have a better idea for your own needs then you should be able to get all the info you need from these files to do the conversion

    Look at http://www.1000genomes.org/faq/how-d...your-vcf-files for streaming if you want to avoid downloading the entire data set

    Leave a comment:


  • gsgs
    replied
    wait, I have a better idea.
    You compute the genetical distance between any pair of two samples, 1092^2 integers,4MB.
    Just the number of set bits in the logical xor of the two 37M-bit-vectors.
    Then you (circular) sort the 1092 samples so the sum of the distances between two neighbors
    is minimal (traveling salesman problem, typically easy to solve for n=1092)
    Then you compute the logical xors of any two adjacent samples, which presumably has lots of zeros.
    1092 binary vectors of length 37M again, but this time with much better compression
    via gzip or such because of the many zeros.
    I can write you the programs for encoding and decoding, if you want.
    Self-expanding executable, easy to use, all automatic.
    The size of that file would be a measure of the genetical variability of your set of 1092 samples.

    Leave a comment:


  • gsgs
    replied
    no Y-chromosome ?

    how would I pack the data ?
    I want the 1092*36.7M SNPs in 23 binary files, one per chromosome.
    Bit i in chromosome j in file(sample) k should be set, iff that SNP is present.
    Then compressed with gzip.
    23 files, ~50MB per file, I estimate

    Leave a comment:


  • laura
    replied
    do feel free to email [email protected] if you have any questions

    We do also have a recent set of slides which were presented in a tutorial at ASHG2012

    1000genomes.org is your first and best source for all of the information you’re looking for. From general topics to more of what you would expect to find here, 1000genomes.org has it all. We hope you find what you are searching for!

    Leave a comment:


  • gsgs
    replied
    thanks.
    10 pages the paper (pdf) ... printing...
    2 pages the readme
    that will keep me busy for a while ...
    well, I'll probably only read and understand parts of it

    I know, there is also the "hapmap" project, I managed to get
    one of their tables into computer and analyze

    Leave a comment:


  • laura
    replied
    I would strongly recommend starting with our recent paper and the analysis results associated with it



    ftp://ftp.1000genomes.ebi.ac.uk/vol1...lysis_results/

    That is a great starting point

    Leave a comment:


  • gsgs
    replied
    I don't know yet.
    Probably compare them, #mutations,distances
    calculate the consensus,ancestor, plot the distances,
    make my cloud-graphics(plot amino acid mutations over nucleotide mutations),
    and mutation pictures(binary arrays,sequences over positions,pixel
    at (x,y) iff x differs from consensus at position y) etc.

    maybe this also works for "STR"s over normal mutations (these are new to me)

    calculate recombination frequency
    estimate mutation rates and what changes them
    statistics of codon-usage
    search for retrovirus
    Last edited by gsgs; 12-05-2012, 11:49 AM.

    Leave a comment:


  • laura
    replied
    What would you like to do with the data, that will very much determine what the best way to approach the data set,

    1000 genomes is a large data set with a variety of different data formats but to answer a single question you rarely need more than one sort of file

    Leave a comment:


  • gsgs
    replied
    currently I estimate (wild guess) you have ~500 complete human genomes (1500GB)
    at ~10fold coverage but they are scattered in lots of different formats and directories
    and it would take me ~10 hours to figure out how to find the data and decompress and
    convert it and another ~5 hours to just download the compressed data

    I'd like to see the estimates of others

    ----------new estimates-------
    they have all 1092 genomes(people,"samples") sequenced at 2-6 fold coverage
    (which I assume means that they have lots of small segments (~500 nucleotides
    per segment ?) from the genome and those may have many errors but overlap
    the genome at ~2-6 fold at each position)
    critical positions, those with expected mutations overlap more often (50-100 fold)
    So they have a total of ~2e13 overlapping nucleotides

    the data is in "vcf" files with complicated format, so I stay with my estimate
    of ~10hours work to convert them into a workable format.

    The data could be ~700MB only, the y-chr came in 2 files of 29MB compressed
    -------------------------------------------------
    Last edited by gsgs; 12-05-2012, 08:52 PM.

    Leave a comment:


  • laura
    replied
    1000genomes.org is your first and best source for all of the information you’re looking for. From general topics to more of what you would expect to find here, 1000genomes.org has it all. We hope you find what you are searching for!


    A relatively complete set of variant and other files associated with our Phase 1 analysis are now available on the ftp site

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    Essential Discoveries and Tools in Epitranscriptomics
    by seqadmin


    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
    Yesterday, 07:01 AM
  • seqadmin
    Current Approaches to Protein Sequencing
    by seqadmin


    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
    04-04-2024, 04:25 PM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 04-11-2024, 12:08 PM
0 responses
39 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 10:19 PM
0 responses
41 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 09:21 AM
0 responses
35 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-04-2024, 09:00 AM
0 responses
55 views
0 likes
Last Post seqadmin  
Working...
X