Seqanswers Leaderboard Ad



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • 1000 Genomes Data/ Exon targetted


    I have a question concerning the 1000 Genomes Data. On the ftp they have lowcoverage and exontargetted data.
    I assume that the exontagetted files only contain sequence information of the exons, but with a higher coverage. Is that correct?
    But why is the filesize between the individuals (exon tagetted, same chromosome) so different in size.


  • #2

    I think the 1000 genomes project have enriched and sequenced only 1000 genes in the pilot data. I am trying to find out which 1000 genes they have enriched, but this simple piece of data is frustratingly hard to find.

    Can anyone else help?


    • #3

      There is a bed file of the targeted regions and a gene list. Both labeled P3.


      • #4
        There are three pilots

        Also, There are three pilot projects.

        P1 is low coverage-whole genome
        P2 is sequencing of parent/ child trios
        P3 is a sequence capture of coding exons of 1000 genes is your first and best source for all of the information you’re looking for. From general topics to more of what you would expect to find here, has it all. We hope you find what you are searching for!


        • #5

          Wonderful. That's just what I wanted.

          Thanks Adamdeluca


          • #6
            OK summarising...

            Pilot1 = 2 - 4X coverage 180 samples Whole-genome sequencing
            Pilot2 = 20-60X coverage 6 samples(2 trios) Whole-genome sequencing
            Pilot3 = 50X coverage 900 samples 1000 genes seqenced
            Main project= 4X coverage 2000 samples Whole genome sequence.

            But the FTP data is most unwieldly with separate VCF files per population listing every genotype for every individual. Which raises a question:

            Is there somewhere that summarises the allele frequencies for SNPs across all the 1KG pilots and combines the populations?
            e.g. In pilot3 data for the CEU population we can find SNP rs61733845 has 122 alleles called but if you look up that SNP in dbSNP there is no frequency data.
            Last edited by BetterPrimate; 08-03-2010, 11:41 PM.


            • #7

              I was also looking for an overall VCF file, but I could only find genotypes per population per pilot study.

              An overall files for the whole project would be fine.


              • #8
                At the moment the project ftp doesnt provide overall files for all the variants calls

                You can get the vcf files for each sub population used in each pilot from

                low coverage represents 180 individuals sequencings to 2-4x
                trios represents 2 family trios sequenced to 30x+
                exon represents ~700 individuals sequence for 1000 genes

                You could use the vcftools sourceforge package to get your frequencies for the whole set

                The perl code that is part of this package will merge vcf files for you

                and the c++ code will provide frequency reports


                • #9

                  how can I access the data from the 2000 Individuals sequenced with a 4 x coverage.



                  • #10
                    Not all 2500 individuals have been sequenced yet.

                    So far we have sequence data for 653 samples, 552 have more than 10GB of sequence data available in fastq format

                    We have alignments for 539 individuals in bam format

                    You can get all this data from our ftp site

                    Our website explains how our ftp site is structured

           is your first and best source for all of the information you’re looking for. From general topics to more of what you would expect to find here, has it all. We hope you find what you are searching for!


                    • #11
                      Did you also call variants from this 653 samples?

                      Btw. I have a question about you called variants in the pilot 1 study. Did I undestand it right, that you pooled all the low coverage sequence data and called the variants from this new data set? Don't you loose very rare variants by doing this?


                      • #12
                        There aren't any variants released yet on the main project data.

                        We had a release of variants on the pilot data in july which you can find here


                        As far as the variant calling goes as most of the low coverage individuals only have between 2 and 4x coverage there is insufficient data to call most variants just from one individual to the pooling of data gains us power. The low coverage approach is less powerful for rare variants


                        • #13
                          Can you please tell me how many individuals are included in the last release?

                          So with this approach you are only able to call common variants? But isn't it a goal of the project to detect variants with a frequency of less than 1 %?


                          • #14
                            If you look at the alignment index and sequence index files on the ftp site you can see how many individuals are in each release.


                            With 2500 individuals we can get 95% of 1%MAF alleles in the accessible genome. We will find some variants with lower MAF but we won't find all of them.

                            This project is designed to find all shared variation within the population rather very rare variants

                            Another phase of the project is going to do exome sequencing of the 2500 individuals and these will hopefully get variants down to 0.1% in these regions as we will have higher coverage of those regions


                            • #15
                              Allele frequencies in subpopulations 628 individuals


                              I am aware that there is a vcf file "ALL.2of4intersection.20100804.sites.vcf.gz" on the ftp site where you can retrieve allele frequency for SNPs from the low coverage data of 628 individuals. This is pooled across all subpopulations.

                              Is there a way I can get the allele frequencies for the same SNPs in subpopulations?


                              Latest Articles


                              • seqadmin
                                The Impact of AI in Genomic Medicine
                                by seqadmin

                                Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                                02-26-2024, 02:07 PM
                              • seqadmin
                                Multiomics Techniques Advancing Disease Research
                                by seqadmin

                                New and advanced multiomics tools and technologies have opened new avenues of research and markedly enhanced various disciplines such as disease research and precision medicine1. The practice of merging diverse data from various ‘omes increasingly provides a more holistic understanding of biological systems. As Maddison Masaeli, Co-Founder and CEO at Deepcell, aptly noted, “You can't explain biology in its complex form with one modality.”

                                A major leap in the field has
                                02-08-2024, 06:33 AM





                              Topics Statistics Last Post
                              Started by seqadmin, Today, 06:12 AM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, 02-23-2024, 04:11 PM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, 02-21-2024, 08:52 AM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, 02-20-2024, 08:57 AM
                              0 responses
                              Last Post seqadmin