Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Basic: Why are insert sizes constant?

    Hi,

    I have a basic question (+ 2 follow up questions) for which there seems to be no answer on the web or I'm applying the wrong search terms ...

    1)
    I am wondering about why the insert size in paired end (PE) experiments is constant?
    I read everywhere that the typical read size or Illumina paired end reads is 2x75 or 2x100 or whatever, but nowhere seems to be explained why the insert size constant.
    DNA is shreddered into pieces, so every fragment should be of random size or shouldn't it?

    To me the answer must be that there is some sort of size filtering step (picking only fragments of size 300 for example, thus I throw away like 99% of the DNA? Because the percentage of fragments that have exactly size 300 should be low in the whole sample after the shreddering step). But I do not want to guess, I want to know. But IF there really is a size selection step, I wonder about 2 more things:

    2.1) How is that done? I know that you could do it with a gel electrophoresis, but I thought the NGS machines would work whole-automatically so that there is no scientist who transfers samples onto a gel ...
    2.2) Why would ancient DNA reads be shorter than the ones derived from non-ancient samples? I mean it's the same procedure right?
    Last edited by Jonathan87; 06-16-2015, 08:25 PM.

  • #2
    There usually is a size-selection step, typically using a gel. NGS machines are not automatic - a lot of labor is involved in library prep.

    But, even if there is no size-selection, most platforms have a strong bias toward small insert sizes, and some (like Illumina) are fundamentally incapable of sequencing molecules with large inserts. So, there's an enrichment toward short inserts.

    Fragmentation is not random. If you use sonication, for example, long molecules will be preferentially broken. A 2bp molecule will have far less stress, and only 1 possible breakpoint, compared to a 1000bp molecule with 999 breakpoints and much more stress.

    Comment


    • #3
      Originally posted by Jonathan87 View Post
      1)
      I am wondering about why the insert size in paired end (PE) experiments is constant?
      I read everywhere that the typical read size or Illumina paired end reads is 2x75 or 2x100 or whatever, but nowhere seems to be explained why the insert size constant.
      DNA is shreddered into pieces, so every fragment should be of random size or shouldn't it?

      To me the answer must be that there is some sort of size filtering step (picking only fragments of size 300 for example, thus I throw away like 99% of the DNA? Because the percentage of fragments that have exactly size 300 should be low in the whole sample after the shreddering step). But I do not want to guess, I want to know.
      In addition to avoiding the size bias pointed out by Brian, knowing the insert size more precisely also helps with the alignment process (especially with tricky applications like calling RNA isoforms).

      Originally posted by Jonathan87 View Post
      But IF there really is a size selection step, I wonder about 2 more things:

      2.1) How is that done? I know that you could do it with a gel electrophoresis, but I thought the NGS machines would work whole-automatically so that there is no scientist who transfers samples onto a gel ...
      There are a number of methods, both gel-based and bead-based. Hopefully someone still active in the lab can shed more light on the details. But your assumption that NGS works "whole-automatically" is generally not true. People still have to use their hands and pipetters (gasp!)

      Originally posted by Jonathan87 View Post
      2.2) Why would ancient DNA reads be shorter than the ones derived from non-ancient samples? I mean it's the same procedure right?
      The reads are shorter because the DNA is shorter. DNA is kind of fragile and breaks down over time (a lot of time in the case of 'ancient' DNA). FFPE samples suffer from a similar problem - highly fragmented DNA.
      AllSeq - The Sequencing Marketplace
      [email protected]
      www.AllSeq.com

      Comment


      • #4
        Originally posted by Brian Bushnell View Post
        But, even if there is no size-selection, most platforms have a strong bias toward small insert sizes, and some (like Illumina) are fundamentally incapable of sequencing molecules with large inserts. So, there's an enrichment toward short inserts.
        But to me a "strong bias toward small insert sizes" is very different from "The insert size IS 100" ... ? I mean a "strong bias toward small insert sizes" could mean anything. It could mean that the insert size of my first mate pair is 80 and the insert size of my second mate pair is 150. I thought we knew the insert size very exactly +- 1 or 2 bp maybe?


        Originally posted by Brian Bushnell View Post
        Fragmentation is not random. If you use sonication, for example, long molecules will be preferentially broken. A 2bp molecule will have far less stress, and only 1 possible breakpoint, compared to a 1000bp molecule with 999 breakpoints and much more stress.
        Okay not toootally random of course, but knowing that long molecules will be preferentally broken is still not a very precise information about the insert size?


        In general: How precise is the information if somebody tells me "The insert size of your library is 100"? What's the standard deviation? I always thought it was very small like 1 or 2 or 3 ?


        Originally posted by AllSeq View Post
        The reads are shorter because the DNA is shorter. DNA is kind of fragile and breaks down over time (a lot of time in the case of 'ancient' DNA). FFPE samples suffer from a similar problem - highly fragmented DNA.
        I understand that the DNA molecules in the sample are shorter, but if there is a size selection step that should not affect my resulting read length. I mean if I size select, then I only get the fragments that are of size, say, 100. And the fragments that are not of size 100 are filtered out as it is done with modern DNA? Or is it done differently for aDNA? Is the size selection step skipped? Or is the size selection filter set much lower as compared to modern samples?
        Last edited by Jonathan87; 06-16-2015, 09:34 PM.

        Comment


        • #5
          Originally posted by Jonathan87 View Post
          But to me a "strong bias toward small insert sizes" is very different from "The insert size IS 100" ... ? I mean a "strong bias toward small insert sizes" could mean anything. It could mean that the insert size of my first mate pair is 80 and the insert size of my second mate pair is 150. I thought we knew the insert size very exactly +- 1 or 2 bp maybe?
          No, there's a huge variability. Some recent "450bp insert" libraries I have looked at had modes ranging from 420 to 460bp, with roughly bell-shaped distributions, in which the 5th and 95th percentiles were perhaps 300 and 550bp. Which is pretty good.

          In general: How precise is the information if somebody tells me "The insert size of your library is 100"? What's the standard deviation? I always thought it was very small like 1 or 2 or 3 ?
          No, it's high. You will never get a std deviation like that except in amplicon libraries which are not randomly fragmented.

          I understand that the DNA molecules in the sample are shorter, but if there is a size selection step that should not affect my resulting read length. I mean if I size select, then I only get the fragments that are of size, say, 100. And the fragments that are not of size 100 are filtered out as it is done with modern DNA? Or is it done differently for aDNA? Is the size selection step skipped? Or is the size selection filter set much lower as compared to modern samples?
          I have not worked with ancient DNA but I suppose that one might avoid size-selection (or set the threshold very low) because there is so little DNA, and size-selection removes a lot of it.

          Comment


          • #6
            I think that Jonathan's questions relate to the sequencing technology rather than (or in addition to) the size of the DNA molecules. Illumina uses sequencing by synthesis. Each cycle consists of the addition of one nucleotide, its detection via imaging, and deblocking to prepare for the next cycle. The researcher selects the exact number of cycles to sequence, which results in every read having the identical length (i.e., standard deviation of the data = 0) even though the insert sizes vary.

            Comment


            • #7
              Follow up for 2.1 (since I still spend some time at the bench): The most common methods of size selection employ gel electrophoresis (either manual or automated, via Pippin Prep) or differential precipitation onto beads (e.g., AMPure, which can also be manual or automated).

              Comment


              • #8
                Originally posted by Jonathan87 View Post
                Hi,

                I have a basic question (+ 2 follow up questions) for which there seems to be no answer on the web or I'm applying the wrong search terms ...

                1)
                I am wondering about why the insert size in paired end (PE) experiments is constant?
                I read everywhere that the typical read size or Illumina paired end reads is 2x75 or 2x100 or whatever, but nowhere seems to be explained why the insert size constant.
                DNA is shreddered into pieces, so every fragment should be of random size or shouldn't it?
                No, that is not what 2x75 means. It says nothing about the insert sizes, only that two 75 base reads will be collected from the insert. The insert size is not specified in the "2x75" notation.
                So you will have 2 reads of the specified length, 75, with a variable number of unsequenced base between them, or the 2 reads might even overlap.

                --
                Phillip

                Comment


                • #9
                  @Brian: Thank you, I didn't know that the std deviation of the insert size is so big. If the std dev was lower, assemblers would probably perform better.

                  @pmiguel I know what 2x75 means, I was only wondering about the insert size. That sentence of mine which you refer to was just to point out that all my searchings only led me to superficial knowledge (I know the typical read lengths) but I was looking for information about the insert size and why it is constant
                  [which was described nowhere in my search results - maybe because every wet lab scientist knows that the answer is the size selection step. But to me that was not 100% clear since the only information that I had found is that sonication is applied to shredder the DNA into pieces. But sonication does not empower you to say e.g. "The insert size of your library is 100"]

                  @HESmith: I know that the read length is exact. I know that the sequencing is performed in x steps and therefore the read length is exactly x. I was only wondering about the insert size.

                  Ultimately I was wondering about why aDNA fragments are shorter than modern DNA fragments. Because if we size select for e.g. length 100 as we do with modern DNA (and throw away everything else), the fragment length should not differ. But as Brian has pointed out the answer is probably that the procedure is not 100% the same as with modern DNA, i.e. size selection is either completely avoided or the threshold for size selection is set very low.

                  Comment


                  • #10
                    Sometimes, a large variation in insert size is nice, and can improve scaffolding and variant-calling (via mapping).

                    When people quote insert sizes they are normally talking about the target, or else the median or mode. For example, at JGI we make a lot of "270bp insert libraries". The median and mode are often pretty close to 270 but they would still be called 270bp insert libraries even if the mode was 300bp or 250bp. The goal is for them to be overlapping 2x150bp, and anything above 300bp does not overlap (and thus cannot be merged). Typical merge rates are on the order of 60-70% when the 270bp target is hit, with most of the unmergeable reads being unmergeable because they don't overlap - a fully-overlapping library will typically merge at around 97%+.

                    Comment


                    • #11
                      What you refer to as "insert size", to me it's the fragment size ... ?
                      I thought the insert size was the number of bases between the reads. So in your case (fragment size 270, read size 150) the insert size would be -30 to me.

                      But besides that definition thing:
                      What's the motivation for that procedure?
                      Is it that when you merge the paired reads, you get a long high quality sequence? Which you would not have gotten if you had just sequenced a 270bp fragment as single end read (as the quality drops towards ... was it 3'?) ?

                      Comment


                      • #12
                        Insert size is the fragment length. Sometimes people refer to the distance between reads as "inner length" or similar.

                        The goal of 270bp inserts is to reduce error rates - since the quality declines toward the read end, overlapping reads can salvage it, and create longer reads. With a kmer-based assembler, longer reads allow greater coverage with longer kmers. Without overlapping reads, a kmer-based assembler working with 2x150bp data would be limited to kmers much less than 150; with overlapping reads, k=200+ is possible.

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Strategies for Sequencing Challenging Samples
                          by seqadmin


                          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                          03-22-2024, 06:39 AM
                        • seqadmin
                          Techniques and Challenges in Conservation Genomics
                          by seqadmin



                          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                          Avian Conservation
                          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                          03-08-2024, 10:41 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, Yesterday, 06:37 PM
                        0 responses
                        7 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, Yesterday, 06:07 PM
                        0 responses
                        7 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-22-2024, 10:03 AM
                        0 responses
                        49 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-21-2024, 07:32 AM
                        0 responses
                        66 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X