Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Pooling Prior to Size-Selection

    Greetings everyone,

    I am a graduate student making his first foray into NGS. Briefly, I am employing a ddRAD protocol to generate my first NGS dataset. I'll be using Illumina MiSeq.

    I would ideally like to include ~48 individuals on a single Illumina MiSeq lane, each with a unique individual barcode/ID.

    I understand that performing 48 separate size-selections would probably be ill-advised and may lead to substantial drop-out at certain loci and a loss of data.

    Is it possible to anneal my Illumina adapters with unique barcodes to each of the 48 individuals separately, and then pool them in equimolar concentrations to perform a single size-selection, either with a Pippin Prep or with SPRI beads? Is this at all advisable?

    Many thanks!

  • #2
    Pooling before size selection is the recommended workflow and one of the most important aspects of GBS in general. I would suggest to review this paper again:

    http://journals.plos.org/plosone/art...l.pone.0037135

    Comment


    • #3
      I seem to have misread the detailed protocol, which prompted this question - my apologies. Thanks for the quick reply!

      Comment


      • #4
        Ah, now I see the source of my confusion!

        I'm hoping to employ a paired barcode system so that I don't need to purchase, say, 48 individual P1 adapters, each with a unique barcode. Instead I would like to use combinations of identifiers (say, 6 x 8). This way I can buy fewer oligos (I'm on a rather tight budget).

        Peterson et al. (2012) seem to employ a similar system, except the second unique identifier (referred to as an "index") is added via PCR Multiplex Primer 2. This PCR step occurs after size-selection.

        This means that, in order to employ this system, I would have to potentially perform 6 different size-selections (and 6 different PCR reactions at this step).

        What I'm wondering, however, is why not just add a second unique barcode to adapter P2, as they have with adapter P1? That way I could employ a paired barcode system and would only have to use a single PCR Multiplex Primer 2, and would only have to perform a single size-selection.

        Am I missing something crucial here? I feel as though I must be, as this seems like a much simpler method than the one proposed in this paper...surely there must be some pitfall?

        Comment


        • #5
          I think having all the samples together for size selection is crucial for ddRAD. Otherwise the loci that are near the edge of the size range will be in some libraries and not others, and depending on the size range, could significantly add to the amount of missing data.

          One reason why they may have added the index during PCR is that it keeps oligo lengths shorter and would be cheaper. The adapter is a Y-adapter with regions of double-strandedness and single-strandedness; having an extra long single-stranded portion may cause problems... chaining, etc.

          As a side note, how many loci are you planning to sequence? Will a MiSeq have enough reads for your 48 samples?
          Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

          Comment


          • #6
            As a side-side note, I noticed a little bit of the text in the ddRAD paper that is an example of a pet peeve of mine in method papers.
            we use a double restriction enzyme (RE) digest (i.e., a restriction digest with two enzymes simultaneously) that results in at least five-fold reduction in library production cost–complete ddRADseq libraries cost ~$5 per sample, while the necessary enzymatic steps following the initial restriction digest and ligation in random shearing RAD libraries alone introduce a cost of ~$25 per library.
            I could believe that ddRAD libraries are cheaper. However, sheared RAD-Seq libraries are sheared, most typically, on pools of samples. So the proper comparison is not a single sample ddRAD library versus a single sample RAD-Seq library (and that isn't what they did here since they didn't charge the full Pippin Prep run to the single ddRAD library). It is to compare 48 or 96 or whatever samples, in which case the $25 is divided by 12 or 48 or 96.

            I see this all the time in method papers. "Previous method X requires 500 ng of DNA and ours needs just 100 ng" when the previous method just happened to use 500 ng, not "require" it. Properly comparing methods is difficult!
            Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

            Comment


            • #7
              Originally posted by SNPsaurus View Post
              I think having all the samples together for size selection is crucial for ddRAD. Otherwise the loci that are near the edge of the size range will be in some libraries and not others, and depending on the size range, could significantly add to the amount of missing data.

              One reason why they may have added the index during PCR is that it keeps oligo lengths shorter and would be cheaper. The adapter is a Y-adapter with regions of double-strandedness and single-strandedness; having an extra long single-stranded portion may cause problems... chaining, etc.

              As a side note, how many loci are you planning to sequence? Will a MiSeq have enough reads for your 48 samples?
              Thanks for your reply!

              I was thinking of having the second barcode adjacent to the fragment (much like what we see on the P1 adapter). I'm just concerned that I'm missing some aspect of the chemistry and/or process which will prevent this from being feasible.

              If my P2 adapter includes a secondary barcode of three unique bases, there should be six permutations...I'm not sure how much three extra bp will cost, but I'm getting to the point where I can start to figure out how much I'll need and what my expenses will actually be. I'm also not sure whether 3 bp is long enough to be reliably located and identified after sequencing...so there's that.

              With respect to your question:

              I'm working with an organism (a shark) for which no genomic resources exist, and which has a genome of approximately 8 billion bp (rough estimate). I've downloaded the whale shark draft genome and have calculated GC content (which, incidentally, appears to be very, very high...) to get a rough idea of what I'm dealing with and I've run about 300 simulations under various conditions and with various enzyme pairs using SimRAD.

              If these simulations are any indication (I'm skeptical, although I really think SimRAD is a pretty neat tool!), then depending on my GC content I can expect somewhere between 12,000 and 31,000 fragments/loci within my desired size range (140 - 160 bp, which I realize might be a bit narrow depending on the method of size-selection).

              So, according to estimates and specifications for the various Illumina MiSeq kits (v2, v3, and their different variations), depending on which I choose, I might be able to expect something like 10 - 15x coverage per locus per individual on average...and the 10x estimate is on the low end. But this is all guess-work.

              Sorry for the long answer!

              Comment


              • #8
                One aspect of genotyping by sequencing design that people often neglect in their calculations is variance. If you look at ddRAD papers where the information is available and the sample DNA is all high quality, you can expect a 2-3-fold range of read count for most samples, and a few outliers on either side. So 10X coverage on average will be 5X for some samples and 15X for others.

                There is also locus variance, and ddRAD papers again show a wide range of coverage depending on the locus. You might get another 5-fold range from that. So now you have some poor performing samples with poor performing loci and just get 1X coverage on that.

                You are in a tough spot. With a large genome it will be hard to get fewer loci amplified reliably. But I guess you will have SNPs to spare anyway. Just expect that the number of SNPs that will be assayable across the population at depth to be a lot smaller than expected.

                As for the barcodes, you just need to make sure your size range is within the read length so you can sequence both barcodes. I prefer the indexes to be separate reads to preserve the main read for genotyping, but on a MiSeq you will also have nucleotides to spare.
                Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

                Comment


                • #9
                  Re: coverage, is there a particular reason to use the MiSeq instead of HiSeq? You can get 10X the number of reads at only slightly higher cost. Given the anticipated variation in representation, the additional data seem like a worthwhile investment.

                  Re: barcoding, three nucleotides is sufficient only in a universe where sequencing is error-free :-). Minimally, you'll need at least two differences between each barcode, and that still requires discarding any reads with a single mismatch in the barcode to avoid erroneous demultiplexing.

                  Re: index in the adapter vs. PCR primer, the former method requires longer primers but library prep is more efficient (the main reason that Illumina switched from PCR to adapter indexing when they rolled out the TruSeq system). Note that adapter indexing also allows your to pool your samples prior to size selection, then amplify the entire pool in a single reaction. And I second SNPsaurus's suggestion for separate read indexing, to maximize the genotyping information.

                  Comment


                  • #10
                    One more thought about the large genome. ddRAD (and RAD) use a Y-adapter to prevent amplification of fragments other than the ones that have the "infrequent" cutter. But if you are working with 20,000 fragments in a 8 Gb genome, then you have 20,000 fragments you are trying to amplify and 50M you are not (thinking of the fragments generated by the frequent cutter on both ends). In this situation, any artifact will rear its ugly head. Say there is a very rare mispriming off the Y-adapter end, maybe 0.001% each cycle. You'll have 5,000 artifactual loci being amplified. Or a mild exonuclease activity blunts the overhang of some fragments and some primers... suddenly there are 50,000 artifactual loci. These will move sequencing away from the loci you are trying to genotype.

                    SNPsaurus was recently asked to genotype a genome nearly 10-fold larger than your shark, and while we use nextRAD which uses priming rather than restriction enzymes, the principles are the same so I've been thinking about this quite a bit!

                    As HESmith asks, why not find a HiSeq? The University of Oregon Sequencing Facility runs a ton of outside users wanting to do genotyping since they know what a genotyping library looks like on a Fragment Analyzer, knows how to deal with the low complexity cut sites given different machines and chemistries, etc.
                    Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

                    Comment


                    • #11
                      Originally posted by SNPsaurus
                      As HESmith asks, why not find a HiSeq? The University of Oregon Sequencing Facility runs a ton of outside users wanting to do genotyping since they know what a genotyping library looks like on a Fragment Analyzer, knows how to deal with the low complexity cut sites given different machines and chemistries, etc.
                      Originally posted by HESmith View Post
                      Re: coverage, is there a particular reason to use the MiSeq instead of HiSeq? You can get 10X the number of reads at only slightly higher cost. Given the anticipated variation in representation, the additional data seem like a worthwhile investment.
                      They certainly do! The problem is that I simply can't afford a HiSeq run, given my current budget. I have $2500 to work with - and that includes funds for library prep.

                      I might be able to share a lane with someone - that would be feasible. I would just need to find them.

                      Originally posted by HESmith
                      Re: index in the adapter vs. PCR primer, the former method requires longer primers but library prep is more efficient (the main reason that Illumina switched from PCR to adapter indexing when they rolled out the TruSeq system). Note that adapter indexing also allows your to pool your samples prior to size selection, then amplify the entire pool in a single reaction. And I second SNPsaurus's suggestion for separate read indexing, to maximize the genotyping information.
                      I think this highlights another issue for me: I'm not certain that I understand the functional difference between indices and barcodes, and I've been trying to wrap my mind around the concept for some time now.

                      If indices provide distinct identities, why use barcodes at all? Why should I bother using a barcode adjacent to my insert (as in Peterson et al., 2012) if an integrated index will work just as well?

                      I suppose I can see a potential issue here if I'm doing a single-end sequencing run and I'm using a combination of two identifiers (index, barcode, etc.). Would a two-index system work, in this case? Or does a barcode adjacent to the insert become necessary? Does it depend on read length?

                      Originally posted by SNPsaurus View Post
                      One more thought about the large genome. ddRAD (and RAD) use a Y-adapter to prevent amplification of fragments other than the ones that have the "infrequent" cutter. But if you are working with 20,000 fragments in a 8 Gb genome, then you have 20,000 fragments you are trying to amplify and 50M you are not (thinking of the fragments generated by the frequent cutter on both ends). In this situation, any artifact will rear its ugly head. Say there is a very rare mispriming off the Y-adapter end, maybe 0.001% each cycle. You'll have 5,000 artifactual loci being amplified. Or a mild exonuclease activity blunts the overhang of some fragments and some primers... suddenly there are 50,000 artifactual loci. These will move sequencing away from the loci you are trying to genotype.
                      This is definitely worth thinking about! I hadn't considered this at all. I've been so focused on optimizing enzymes and size selection that this type of error simply hadn't occurred to me.

                      Thank you both for your responses! I would have come back sooner - for some reason I stopped getting notifications via my thread subscription.

                      Comment


                      • #12
                        Originally posted by Carcharodon View Post
                        They certainly do! The problem is that I simply can't afford a HiSeq run, given my current budget. I have $2500 to work with - and that includes funds for library prep.
                        The U of Oregon external rates that might be applicable to you are (these are rounded prices):
                        HiSeq 100 bp $1900
                        NextSeq500 Mid-Ouput 150 bp $1800

                        Originally posted by Carcharodon View Post
                        I think this highlights another issue for me: I'm not certain that I understand the functional difference between indices and barcodes, and I've been trying to wrap my mind around the concept for some time now.

                        If indices provide distinct identities, why use barcodes at all? Why should I bother using a barcode adjacent to my insert (as in Peterson et al., 2012) if an integrated index will work just as well?

                        I suppose I can see a potential issue here if I'm doing a single-end sequencing run and I'm using a combination of two identifiers (index, barcode, etc.). Would a two-index system work, in this case? Or does a barcode adjacent to the insert become necessary? Does it depend on read length?
                        When we designed original RAD the inline barcode was the only option. I think ddRAD was also in development before true index reads were possible. Only later did Illumina introduce separate reads for indices, so we use a dual-index system for nextRAD.

                        On the one hand, varying from established protocols always has risks. Is the added risk worth it to save a little with a more efficient barcode system? On the other hand, it should be pretty easy to switch ddRAD to dual-index. There are so many little variants of these methods out there it might be worth just reading the recent papers using ddRAD to see if someone has done just that.
                        Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

                        Comment


                        • #13
                          Originally posted by SNPsaurus View Post
                          TWhen we designed original RAD the inline barcode was the only option. I think ddRAD was also in development before true index reads were possible. Only later did Illumina introduce separate reads for indices, so we use a dual-index system for nextRAD.

                          On the one hand, varying from established protocols always has risks. Is the added risk worth it to save a little with a more efficient barcode system? On the other hand, it should be pretty easy to switch ddRAD to dual-index. There are so many little variants of these methods out there it might be worth just reading the recent papers using ddRAD to see if someone has done just that.
                          I believe folks have (based on some of the reading I've been doing lately), but the examples I've found have used paired-end sequencing. Can dual indexing be achieved with single-end reads?

                          EDIT: After watching this video a second time, I'm thinking the answer to my question is "Yes." https://www.youtube.com/watch?v=womKfikWlxM



                          Again, thank you folks for your help!
                          Last edited by Carcharodon; 07-19-2015, 06:54 PM.

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Best Practices for Single-Cell Sequencing Analysis
                            by seqadmin



                            While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
                            06-06-2024, 07:15 AM
                          • seqadmin
                            Latest Developments in Precision Medicine
                            by seqadmin



                            Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

                            Somatic Genomics
                            “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
                            05-24-2024, 01:16 PM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, Today, 07:49 AM
                          0 responses
                          12 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, Yesterday, 07:23 AM
                          0 responses
                          14 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 06-17-2024, 06:54 AM
                          0 responses
                          16 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 06-14-2024, 07:24 AM
                          0 responses
                          24 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X