Unconfigured Ad

**HESmith** · 11-17-2016, 05:31 AM

The experiment is feasible but probably not ideal due to technical limitations. The 300 amino acid gene is 900bp long, which exceeds the read length of the most common platforms, but the longer-read platforms have high error rates that make them unsuitable for variant analysis. So the best option would be to sequence the gene as three 300bp amplicons, using paired-end 300bp sequencing for error correction (to detect low-frequency variants). But you would lose connectivity information between the amplicons (which may be important if distal variants are co-dependent) and, given the nearly identical sequences, there's no easy way to resolve that problem. So, all of the analyses would be at the amplicon (not full gene) level, although you'll be able to make inferences based on relative frequencies (which you may decide to validate by limited Sanger sequencing).

For the proportions of unique sequences, a simple string frequency counter would suffice. For amino acid analysis, you'd need to translate the sequences b/c of degeneracy in the genetic code. Then, it would be trivial to count the frequency of each amino acid at each position. But some changes are likely to be interdependent (even within an amplicon), so it would probably be more useful to discriminate haplotypes (perhaps for only the most abundant subset of variants).

**ErikFas** · 11-17-2016, 05:52 AM

Thank you for the response! What would be the longest gene in base pairs you feel could be sequenced, then? The platform that is being discussed gives 350 bp reads, if I heard them correctly.

**HESmith** · 11-17-2016, 06:01 AM

Current sequencer specs can be found here. But you'll need overlapping paired-end data for error correction, which means 300bp max on the MiSeq. Longer amplicons are possible with partial read overlap, at the cost of increased errors in the non-overlapping ends.

Since the instrument will produce MUCH more data than you'll need, you may be able to recover some haplotype information from overlapping amplicons (e.g., 1-300bp, 150-450, 300-600, 450-750, and 600-900). The only added expense is library construction, which is minimal (primers for PCR). But my guess is that their utility will be limited, given the sequence similarity.

**SNPsaurus** · 11-17-2016, 09:30 AM

A guy in my lab space (Jim Stapleton, he is an independent researcher) has a long pseudo-molecule approach that might be what you want:

Haplotype-Phased Synthetic Long Reads from Short-Read Sequencing

http://journals.plos.org/plosone/article?id=10.1371%2Fjournal.pone.0147229

Next-generation DNA sequencing has revolutionized the study of biology. However, the short read lengths of the dominant instruments complicate assembly of complex genomes and haplotype phasing of mixtures of similar sequences. Here we demonstrate a method to reconstruct the sequences of individual nucleic acid molecules up to 11.6 kilobases in length from short (150-bp) reads. We show that our method can construct 99.97%-accurate synthetic reads from bacterial, plant, and animal genomic samples, full-length mRNA sequences from human cancer cell lines, and individual HIV env gene variants from a mixture. The preparation of multiple samples can be multiplexed into a single tube, further reducing effort and cost relative to competing approaches. Our approach generates sequencing libraries in three days from less than one microgram of DNA in a single-tube format without custom equipment or specialized expertise.

Haplotype-Phased Synthetic Long Reads from Short-Read Sequencing

He is using it for exactly what you describe, to get full haplotypes of variants too long for existing read lengths with high accuracy. I don't know if he wants his current e-mail posted on a web site, so message me if you want to follow up.

**HESmith** · 11-17-2016, 10:00 AM

The approach recommended by @SNPsaurus is conceptually similar to a low-throughput Moleculo-type library, and is definitely applicable for the in silico assembly of longer (~10e4) fragments. However, it's unclear how useful it would be for the OP's application. The method requires unique 5' and 3' barcodes for each clone to be sequenced, which is a practical limit on the number of clones to screen. The scale of that approach is not significantly greater than the existing method of ~100 Sanger-sequenced clones, and the latter is undoubtedly cheaper and easier to analyze computationally.

**SNPsaurus** · 11-17-2016, 11:59 AM

The difference between a low-throughput Moleculo library and the method I linked to is that each long DNA molecule is tagged by a randomer which is then copied onto the short derivative fragments needed for sequencing on Illumina. Jim sequences libraries of >100,000 long DNA molecules and gets the full haplotype of each, so it seems more suitable for assessing the presence of different variants in a complex library when those variants are separated by moderately long distances.

**HESmith** · 11-17-2016, 12:34 PM

By conceptually similar to Moleculo, I meant that the short reads derived from a single long fragment are identified by the presence of a unique barcode/index. But I can see how this method scales much better than Moleculo, in that the 5' and 3' barcodes are randomly ligated and the matching pairs determined by sequencing. I also like the mate-pair-style fragmentation and circularization to randomize the flanking sequences - clever. Thanks for the reference and clarification.

Topics	Statistics	Last Post
Genome-Wide CRISPR Screen Uncovers Unlikely Psoriasis Target by SEQadmin2 Started by SEQadmin2, Today, 10:08 AM	0 responses 6 views 0 reactions	Last Post by SEQadmin2 Today, 10:08 AM
Engineered Protein Motor Takes Its First Steps Along DNA Track by SEQadmin2 Started by SEQadmin2, Yesterday, 11:05 AM	0 responses 8 views 0 reactions	Last Post by SEQadmin2 Yesterday, 11:05 AM
High-Resolution Sequencing Exposes Hidden Toxoplasma Diversity by SEQadmin2 Started by SEQadmin2, 07-02-2026, 11:08 AM	0 responses 31 views 0 reactions	Last Post by SEQadmin2 07-02-2026, 11:08 AM
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, 06-30-2026, 05:37 AM	0 responses 29 views 0 reactions	Last Post by SEQadmin2 06-30-2026, 05:37 AM

Unconfigured Ad

Questions about sequencing a selection library

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News