Announcement

Collapse
No announcement yet.

Extracting genome specific SNPs from 1000 genomes

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extracting genome specific SNPs from 1000 genomes

    Hello!

    I've been trying to get SNP data from the 1000 genome project, I've been looking at the vcf files, however I fail to understand if these report population,individual, or total variation... I would like to download genotypes from specific genomes. I would appreciate any information.

    Cheers

  • #2
    Documentation of the format can be found here

    http://vcftools.sourceforge.net/specs.html

    The files provided by the 1000 genomes project generally represent all the variant sites discovered in the samples analysed. The most recent release contains a list of the samples analysed ftp://ftp.1000genomes.ebi.ac.uk/vol1...0804.ALL.panel

    vcftools provides software which can provide subsets of data from a vcf file
    The files are also indexed by tabix which means you can stream variants from a specific part of the genome

    Comment


    • #3
      Thanks Laura!!!
      Yes indeed, I found the vcftools, the sys admin will install soon and I will try it, but in the mean time I found a way to do it with awk, it works quite well!

      M

      Comment


      • #4
        Laura-
        Sorry to jump into someone else's thread, but you seem like an expert to whom I could ask this question. Have you tried running vcftools on the main November release genotype file ALL.2of4intersection.20100804.genotypes.vcf.gz?

        If I uncompress the file and run:
        vcftools --vcf ALL.2of4intersection.20100804.genotypes.vcf --chr 21 --out chr22 --recode

        then VCFtools quits with "Error:Expected Number entry in INFO description..." The three INFO fields for EUR_R2, ASN_R2, and AFR_R2 are missing the "Number" entry. It seems like "Number=1" should be inserted between the field ID and the "Type=Float" tag, or else vcftools quits. I have a hard time believing that no one else has ran upon this problem, so I wonder if I'm doing something unusual? Anyway, I've modified my local copy and it works, but I thought that someone perhaps closer to the 1000 Genomes project would want to know.

        Best wishes,

        Todd

        Comment


        • #5
          That does look to be an error in the headers

          If you find problems like this it is best to email [email protected] so the right people can investigate

          thanks for letting us know

          Comment


          • #6
            Laura-
            Sorry, I actually neglected to look under the "Project Contacts" link on the web-site. However, I did e-mail goncalo, since his e-mail is at the bottom of the README file for the latest release. Having not heard anything back from him, I thought that I should take the opportunity when I saw your message up above. Another thing I noticed, but don't know if it's expected, is that there are a number of rows that have no genotypes in any of the samples. I expect that many rows would be missing genotypes in one population or the other, but not across all samples. I suppose that those are variant sites that were found at BC and NCBI but did not have genotypes since they did not perform LD aware genotype analysis. It seems to me that those should be in the "sites" file but filtered out of the "genotypes" file. I'll put together an e-mail and forward my thoughts to the [email protected] e-mail.

            Thanks!

            Todd

            Comment


            • #7
              It was decided it was better for all the sites to be in both files but those variants which don't have genotypes to get the ./. notation. The sites file is always meant to contain all the same variants as the genotype file but it is provided to give those who don't need individual genotypes a smaller download (300MB versus 60GB)

              The only genotypes which should be used for imputation are those which include a prediction by BI as these are the only sites which have genotypes assigned in an LD aware manner. UMich genotyper isn't LD aware and imputation accuracy suffers if they are used for this purpose

              Comment


              • #8
                all individual genotypes = 60 GB data?!

                Are you kidding me?

                60 x 10^9 / 1000 = 60 x 10^6 = 60 Mb per person, sounds reasonable.

                Comment


                • #9
                  Tell me about it!

                  The VCF file has so much other information besides just the genotype calls, that it seems a bit excessive for a release to the public. It's sort of like XML imbedded in a table format. A header at the top, and key value pairs embedded within columns.
                  A representative single variant position call data for one sample looks like this!:
                  0|0:3,0:3:.:-0.00,-0.90,-13.33:22.58:./.

                  To understand the format a bit better, take a look at http://www.1000genomes.org/wiki/Anal...mat-version-40

                  If someone wants just genotype calls, you can download files formatted for Beagle, MACH, and Impute, which are much much smaller, but it seems to me that each of those formats leaves out some of the info that would be useful for checking allele orientation (i.e, between existing Build 36 Illumina 610k data and the release's Build 37 coordinates):

                  Beagle:
                  http://faculty.washington.edu/browni...le/beagle.html

                  MACH:
                  http://www.sph.umich.edu/csg/abecasis/MaCH

                  Impute:
                  https://mathgen.stats.ox.ac.uk/impute/impute_v2.html

                  Comment


                  • #10
                    Originally posted by genesquared View Post
                    all individual genotypes = 60 GB data?!

                    Are you kidding me?

                    60 x 10^9 / 1000 = 60 x 10^6 = 60 Mb per person, sounds reasonable.
                    Well its only 629 individuals in this instance and its 60GB compressed, 380GB uncompressed but you should generally be able to stream the file using a combination of tabix and or zcat so you never need to uncompress it properly

                    Comment


                    • #11
                      Originally posted by Todd Johnson View Post
                      Laura-
                      Sorry to jump into someone else's thread, but you seem like an expert to whom I could ask this question. Have you tried running vcftools on the main November release genotype file ALL.2of4intersection.20100804.genotypes.vcf.gz?

                      If I uncompress the file and run:
                      vcftools --vcf ALL.2of4intersection.20100804.genotypes.vcf --chr 21 --out chr22 --recode

                      then VCFtools quits with "Error:Expected Number entry in INFO description..." The three INFO fields for EUR_R2, ASN_R2, and AFR_R2 are missing the "Number" entry. It seems like "Number=1" should be inserted between the field ID and the "Type=Float" tag, or else vcftools quits. I have a hard time believing that no one else has ran upon this problem, so I wonder if I'm doing something unusual? Anyway, I've modified my local copy and it works, but I thought that someone perhaps closer to the 1000 Genomes project would want to know.

                      Best wishes,

                      Todd
                      This error should of now been fixed

                      thanks for pointing it out

                      Comment


                      • #12
                        I would like to inspect 17 individuals' and about 300 SNPs in a 500 kb loci.

                        Is there any "short cut"?

                        I know their hg18 position (but no rs#).

                        Thanks in advance

                        Comment


                        • #13
                          Your best bet for this it to use tabix to extract the data from the released vcf files.

                          The vcf format is described here
                          http://vcftools.sourceforge.net/specs.html

                          The files themselves can be found here
                          ftp://ftp.1000genomes.ebi.ac.uk/vol1...man_variation/

                          You can use tabix http://sourceforge.net/projects/samtools/files/tabix/ to extract subsections of these files

                          e.g

                          tabix -h ftp://ftp.1000genomes.ebi.ac.uk/vol1...notypes.vcf.gz 1:10000:20000

                          Comment

                          Working...
                          X