Unconfigured Ad

**GenoMax** · 09-20-2014, 12:35 PM

Here are some complete genomics genomes from PGP: https://my.pgp-hms.org/public_geneti...&commit=Search

**lh3** · 09-20-2014, 07:40 PM

The best so far is the CHM1 pacbio assembly, but I don't if it has been publicly released yet. NA12878 also has pacbio assembly and public moleculo data (from 1000g ftp). These will be useful for investigating hard regions.

**GenoMax** · 09-20-2014, 10:21 PM

CHM1 PacBio data has been released: http://blog.pacificbiosciences.com/2...erage-for.html

**ddoopus** · 09-24-2014, 09:47 PM

I can't find Pacbio assembly of NA12878, do you know where this is
available?

From what I can tell, all of the genomes from PGP are sequenced with
Complete Genomics which I thought had a relatively short read length. The
personal genome *vcfs from hg19 are on UCSC. I don't understand how they
able to call variants in repeat regions which standard whole genome
Illumina 100bp reads can not disambiguate. Are these variants possibly
the result of liftover errors from hg18 to hg19 for segmental duplications
which were collapsed in the older version? Can these variants be trusted
at all?

I found the Pacbio assembly for chm1, but can only see the raw reads in the link
given so this doesn't help that much. I found the supplementary material
for the paper on biorxiv:

http://figshare.com/articles/CHM1_Single_Haplotype_Assembly_Supplementary_Material/1091429

and CHM1_to_GRCh37_lite_snvs.site_filtered.pass.vcf is the only file which
looks relevant, but the hetero:homo ratio of that vcf is 0.04 which looks
suspect. Is there a different resource available than this which may not
display this issue?

Any other suggestions would be greatly appreciated. It would be great to
have hg19 *vcfs which have variants in these regions which can be trusted.

Thanks!

**lh3** · 09-25-2014, 11:27 AM

PacBio assembly of CHM1 is here:

PacBio Corrected Reads (PBcR) Pipeline

http://www.cbcb.umd.edu/software/PBcR/MHAP/

It is different from the version I was looking at, but I believe it should be equally good. The NA12878 PacBio assembly has not been released yet.

CHM1 is a haploid sample. Very low het:hom ratio is expected.

EDIT: I should add that I am extremely impressed by the CHM1 assembly done by Jason Chin.

**ddoopus** · 09-25-2014, 11:59 AM

Ah, thanks for the clarification it is actually mentioned directly in their biorxiv paper but I overlooked it.

Thanks!

**lh3** · 09-25-2014, 05:04 PM

I overlooked it, too... An author told me the link yesterday.

**Brian Bushnell** · 09-25-2014, 05:55 PM

Originally posted by ddoopus View Post

From what I can tell, all of the genomes from PGP are sequenced with Complete Genomics which I thought had a relatively short read length. The personal genome *vcfs from hg19 are on UCSC. I don't understand how they able to call variants in repeat regions which standard whole genome Illumina 100bp reads can not disambiguate. Are these variants possibly the result of liftover errors from hg18 to hg19 for segmental duplications which were collapsed in the older version? Can these variants be trusted at all?

I have a lot of experience with Complete Genomics data (but a bad memory, so the details are slightly fuzzy). Their reads are super-short. IIRC each "read" consists of 2x10bp fragments and 2x15bp fragments, or something like that, with unknown normally-distributed distances between the pieces but ~50% of the time the distance is one specific value, like 2bp. So you get reads like:
10bp sequenced, 0-2 bp unsequenced, 15bp sequenced, ~10bp unsequenced, 15bp sequenced, 0-2bp unsequenced, 10 bp sequenced.
...roughly. I think some of the "readlets" were 5bp. Anyway, they are nothing like other platforms.

As a result, you cannot do de-novo assembly with them, and I would never trust them in long repetitive regions. In my testing, they are quite accurate for calling SNPs (using CG's calls) but abysmal at indels, with almost no concordance to indels called from 2x100bp Illumina data, or indels that could possibly have been inherited when analyzing sequenced parents+child trios. And FYI, the way they call indels is by de-novo reassembling the areas around suspected indels using reads that map spanning it, not directly from the reads.

I would not include CG genomes if you are studying 'difficult' parts of the genome that are low-complexity, repetitive, highly variable, or are interested in indels.

Topics	Statistics	Last Post
Single-Cell Atlases Skew Toward European Ancestry, Analysis Finds by SEQadmin2 Started by SEQadmin2, 07-20-2026, 11:10 AM	0 responses 21 views 0 reactions	Last Post by SEQadmin2 07-20-2026, 11:10 AM
UC San Diego Bioengineers Map Gene Function in Human Stem Cells by SEQadmin2 Started by SEQadmin2, 07-13-2026, 10:26 AM	0 responses 34 views 0 reactions	Last Post by SEQadmin2 07-13-2026, 10:26 AM
New Analysis Splits Leukemia Into 16 Epigenomic Subgroups by SEQadmin2 Started by SEQadmin2, 07-09-2026, 10:04 AM	0 responses 44 views 0 reactions	Last Post by SEQadmin2 07-09-2026, 10:04 AM
Genome-Wide CRISPR Screen Uncovers Unlikely Psoriasis Target by SEQadmin2 Started by SEQadmin2, 07-08-2026, 10:08 AM	0 responses 30 views 0 reactions	Last Post by SEQadmin2 07-08-2026, 10:08 AM

Unconfigured Ad

Available Personal Genomes

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News