Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • ErikFas
    Member
    • Jun 2014
    • 86

    Questions about sequencing a selection library

    (Sorry for the long thread; complicated and interesting experiment that needs some explaining. Thanks for reading!)

    A couple of colleagues have recently come to the conclusion that they might have some use for sequencing in their experiments, which got me thinking how that would actually work. The types of experiments they are doing are non-bioinformatic, and I don't really know the details of them (being the only bioinformatician in an otherwise protein technology lab).

    As far as I understand it, they are (most often) interested in selecting the best binder for gene X, the binder being an antibody or an alternative scaffold (for which they know the sequence, and can be produced in E. coli through transfection of a plasmid with the sequence). In order to try to get a better binder than they currently have, they randomly and/or deliberately change a number of amino acid positions in (most often) the binding site of the scaffold, which produces the library. The library is produced in E. coli, followed by several rounds of selection, where only the best binders (using variuos different criteria for what "best" is) are kept. In the end, they have a population of cells that produce, hopefully, at least one better binder than they started with. And, hopefully, most of the binders will have converged into one or several highly similar sequences, indicating that that one is, in fact, the very best they could get. I'm sure I'm getting some of this wrong, but hopefully you get the general gist of it.

    They are interested in knowing the amino acid composition of the different positions as the library goes through the selection process, in order to be able to follow what positions/amino acids are important. For example, if position X starts out as 100 % Gly (non-mutated), but changes to 75/25 Gly/Asp, 50/50 Gly/Asp and then 100 % Asp in the various selection rounds. What they have always done is to simply take around 100 E. coli colonies and send them off to Sanger sequencing, and hope that what they sampled is more or less representative (if it's not it's not the entire world, seeing as what actually matters is the binder at the end and if that has better binding, as measured by downstream experiments).

    They are also interested in the proportion of each sequence in the pool of sequences in each selection round. For example, they have a binder that is 300 amino acids long, and want to know how many different variants of this sequence exist in each selection round. The idea is to follow the best binder as it increases in proportion compared to the lesser binders.

    Somebody said, "why don't we send it off to high-throughput sequencing instead"? They talked to some other bioinformatician they are working with, and it seems they're on their way. It got me thinking, though... how would you do this? I have some ideas, but would love to hear what you guys think!

    I'm thinking that you probably wouldn't have to do any kind of alignment, and that you'd only need to count raw reads. For the first part (amino acid composition), you'd need to know where the actual sequence starts, but that should be doable by just looking at the adapters and starting with position 1 straight afterwards. Then it becomes a simple counting problem, iterating over every read and adding up the amino acids and/or nucleotides as desired. Problems would arise if the sequence is longer than you can sequence. Maybe create some custom primers that can start sequencing at a specific part of the binder sequence, thus covering the whole sequence?

    The second part (proportion of unique sequences) I'm not so sure about. You'd need to do some kind of alignment, but disallowing any kind of mismatches. Again, lengths longer than the reads would mess it up, I'm guessing... You'd need to create full sequences from shorter reads, but it's not like the reads come from different genes; a lot of reads are going to be really, really similar, possibly only differing in 1-3 nucleotides, depending on the design of the binder library. This, I feel, is a more difficult problem, but maybe there's a simple solution I'm just not seeing?
    Last edited by ErikFas; 11-17-2016, 12:27 AM.
  • HESmith
    Senior Member
    • Oct 2009
    • 512

    #2
    The experiment is feasible but probably not ideal due to technical limitations. The 300 amino acid gene is 900bp long, which exceeds the read length of the most common platforms, but the longer-read platforms have high error rates that make them unsuitable for variant analysis. So the best option would be to sequence the gene as three 300bp amplicons, using paired-end 300bp sequencing for error correction (to detect low-frequency variants). But you would lose connectivity information between the amplicons (which may be important if distal variants are co-dependent) and, given the nearly identical sequences, there's no easy way to resolve that problem. So, all of the analyses would be at the amplicon (not full gene) level, although you'll be able to make inferences based on relative frequencies (which you may decide to validate by limited Sanger sequencing).

    For the proportions of unique sequences, a simple string frequency counter would suffice. For amino acid analysis, you'd need to translate the sequences b/c of degeneracy in the genetic code. Then, it would be trivial to count the frequency of each amino acid at each position. But some changes are likely to be interdependent (even within an amplicon), so it would probably be more useful to discriminate haplotypes (perhaps for only the most abundant subset of variants).

    Comment

    • ErikFas
      Member
      • Jun 2014
      • 86

      #3
      Thank you for the response! What would be the longest gene in base pairs you feel could be sequenced, then? The platform that is being discussed gives 350 bp reads, if I heard them correctly.

      Comment

      • HESmith
        Senior Member
        • Oct 2009
        • 512

        #4
        Current sequencer specs can be found here. But you'll need overlapping paired-end data for error correction, which means 300bp max on the MiSeq. Longer amplicons are possible with partial read overlap, at the cost of increased errors in the non-overlapping ends.

        Since the instrument will produce MUCH more data than you'll need, you may be able to recover some haplotype information from overlapping amplicons (e.g., 1-300bp, 150-450, 300-600, 450-750, and 600-900). The only added expense is library construction, which is minimal (primers for PCR). But my guess is that their utility will be limited, given the sequence similarity.
        Last edited by HESmith; 11-17-2016, 06:11 AM.

        Comment

        • SNPsaurus
          Registered Vendor
          • May 2013
          • 525

          #5
          A guy in my lab space (Jim Stapleton, he is an independent researcher) has a long pseudo-molecule approach that might be what you want:
          Next-generation DNA sequencing has revolutionized the study of biology. However, the short read lengths of the dominant instruments complicate assembly of complex genomes and haplotype phasing of mixtures of similar sequences. Here we demonstrate a method to reconstruct the sequences of individual nucleic acid molecules up to 11.6 kilobases in length from short (150-bp) reads. We show that our method can construct 99.97%-accurate synthetic reads from bacterial, plant, and animal genomic samples, full-length mRNA sequences from human cancer cell lines, and individual HIV env gene variants from a mixture. The preparation of multiple samples can be multiplexed into a single tube, further reducing effort and cost relative to competing approaches. Our approach generates sequencing libraries in three days from less than one microgram of DNA in a single-tube format without custom equipment or specialized expertise.


          Haplotype-Phased Synthetic Long Reads from Short-Read Sequencing

          He is using it for exactly what you describe, to get full haplotypes of variants too long for existing read lengths with high accuracy. I don't know if he wants his current e-mail posted on a web site, so message me if you want to follow up.
          Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

          Comment

          • HESmith
            Senior Member
            • Oct 2009
            • 512

            #6
            The approach recommended by @SNPsaurus is conceptually similar to a low-throughput Moleculo-type library, and is definitely applicable for the in silico assembly of longer (~10e4) fragments. However, it's unclear how useful it would be for the OP's application. The method requires unique 5' and 3' barcodes for each clone to be sequenced, which is a practical limit on the number of clones to screen. The scale of that approach is not significantly greater than the existing method of ~100 Sanger-sequenced clones, and the latter is undoubtedly cheaper and easier to analyze computationally.

            Comment

            • SNPsaurus
              Registered Vendor
              • May 2013
              • 525

              #7
              The difference between a low-throughput Moleculo library and the method I linked to is that each long DNA molecule is tagged by a randomer which is then copied onto the short derivative fragments needed for sequencing on Illumina. Jim sequences libraries of >100,000 long DNA molecules and gets the full haplotype of each, so it seems more suitable for assessing the presence of different variants in a complex library when those variants are separated by moderately long distances.
              Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

              Comment

              • HESmith
                Senior Member
                • Oct 2009
                • 512

                #8
                By conceptually similar to Moleculo, I meant that the short reads derived from a single long fragment are identified by the presence of a unique barcode/index. But I can see how this method scales much better than Moleculo, in that the 5' and 3' barcodes are randomly ligated and the matching pairs determined by sequencing. I also like the mate-pair-style fragmentation and circularization to randomize the flanking sequences - clever. Thanks for the reference and clarification.

                Comment

                Latest Articles

                Collapse

                • SEQadmin2
                  From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                  by SEQadmin2


                  Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                  The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                  ...
                  06-02-2026, 10:05 AM
                • SEQadmin2
                  Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                  by SEQadmin2


                  With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                  Introduction

                  Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                  05-22-2026, 06:42 AM
                • SEQadmin2
                  Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                  by SEQadmin2

                  Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                  Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                  05-06-2026, 09:04 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by SEQadmin2, Today, 08:59 AM
                0 responses
                8 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-02-2026, 12:03 PM
                0 responses
                21 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-02-2026, 11:40 AM
                0 responses
                15 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 05-28-2026, 11:40 AM
                0 responses
                29 views
                0 reactions
                Last Post SEQadmin2  
                Working...