Announcement

Collapse

Welcome to the New Seqanswers!

Welcome to the new Seqanswers! We'd love your feedback, please post any you have to this topic: New Seqanswers Feedback.
See more
See less

Read lengths, inserts, fragment size...

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Hi Heisman,

    I Just need a simple clarification please.
    Suppose you have a read like (AAACGGCGTTTCCC)
    and you want to sequence it using Illumina paired end runs.
    Does paired end mean you will get the sequence of the ends?
    i.e does it imply we only sequence AAA and CCC in the sequence above.
    if it is true, I assumed that using sequencing by synthesis we would get TTT and GGG.
    I understand that if we mapped the sequences back to the reference we would anticipate that they are 8 bases apart (given no INDELS are present in our DNA at hand). Is this right?
    However, I am majorly concerned about the sequence in between.
    What really happens to it?
    I guess my question is:
    What is the benefit of paired end reads if we only sequence the ends and not whats in between?

    I would really appreciate the help on clarifying this thought

    Originally posted by Heisman View Post
    1. I don't know, as we only receive the unfiltered ones. Maybe "grep -v"?

    2. I think you'll need to unzip them all and then concatenate.

    3. With Illumina paired end runs you have something like this:

    [flowcell adapter][sequencing primer][insert][sequencing primer][flowcell adapter]

    The key is that the insert may be say 300bp, and if you do 2x100 reads, you'll sequence it like this (the dots are only spacers):

    ........--------->...................<---------.....
    ........xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.....

    So when aligning paired end data it is clear that the two mates for one read should align in that orientation fairly close to each other.

    Comment


    • #17
      Originally posted by modi2020 View Post
      Hi Heisman,

      I Just need a simple clarification please.
      Suppose you have a read like (AAACGGCGTTTCCC)
      and you want to sequence it using Illumina paired end runs.
      Does paired end mean you will get the sequence of the ends?
      i.e does it imply we only sequence AAA and CCC in the sequence above.
      if it is true, I assumed that using sequencing by synthesis we would get TTT and GGG.
      I understand that if we mapped the sequences back to the reference we would anticipate that they are 8 bases apart (given no INDELS are present in our DNA at hand). Is this right?
      However, I am majorly concerned about the sequence in between.
      What really happens to it?
      I guess my question is:
      What is the benefit of paired end reads if we only sequence the ends and not whats in between?

      I would really appreciate the help on clarifying this thought
      I have never thought about this an am honestly not sure if you get TTT or AAA in your read.

      You are correct that you sequence the ends, and if you did 2x3bp reads you would get AAA and CCC (or maybe TTT and GGG). EDIT: You would actually get reverse complements, so AAA and GGG or TTT and CCC.

      You would probably NOT anticipate that they are 8 base pairs apart. When you do a library prep you will almost certainly get some distribution of insert size fragments around a mean. So you would anticipate they will be 8 +/- 2bp apart, for example (more realistically maybe 250 +/- 50bp or something like that). The sequence in between remains unknown to you.

      So, the benefit to paired end sequencing is three-fold, in my opinion. First, it makes it easier to map each fragment. If your read has two ends, A and B, and read A can be mapped almost equivalently to two locations in the genome, but read B can only be mapped to one location, the aligner will put read A at the location close to where read B maps.

      Second, if you are at all interested in detecting larger CNVs/structural variants, PE reads are much more helpful. Two examples: first, if one read maps and the other does not it's possible the unmapped read spans a breakpoint of a CNV/SV, and you can do a split-read mapping of that read to try to determine the breakpoint. Second, If both reads map but the orientation is abnormal (ie, both map like "---->" instead of "---->" and "<----"), or if the distance between the mapped reads is abnormal (ie, you expect 250 +/- 50 but you observe for one PE read that the two reads are mapped 1000bp apart), that gives you a lot of information.

      Third, and possibly the most useful (although the first point is quite useful), with PE reads it's much easier to remove duplicate reads and be more confident that they are in fact PCR duplicates as opposed to just being two random reads that align to the same location. If you have 1x100bp reads, you can have at most 100x coverage of any base without duplication (barring indels in the read). If you have 2x100bp reads an the insert sizes distribution is say 250 +/-50bp, you can potentially have 10,000x coverage or higher after removing all reads that look like duplicates.
      Last edited by Heisman; 05-27-2012, 08:15 AM.

      Comment


      • #18
        Sorry, the two ends would be the reverse complements of each other's strands.

        Comment


        • #19
          Hi Heisman,

          This clarifies the process a lot to me. Thank you so much for your detailed answer and help with this.

          Best

          Originally posted by Heisman View Post
          I have never thought about this an am honestly not sure if you get TTT or AAA in your read.

          You are correct that you sequence the ends, and if you did 2x3bp reads you would get AAA and CCC (or maybe TTT and GGG).

          You would probably NOT anticipate that they are 8 base pairs apart. When you do a library prep you will almost certainly get some distribution of insert size fragments around a mean. So you would anticipate they will be 8 +/- 2bp apart, for example (more realistically maybe 250 +/- 50bp or something like that). The sequence in between remains unknown to you.

          So, the benefit to paired end sequencing is three-fold, in my opinion. First, it makes it easier to map each fragment. If your read has two ends, A and B, and read A can be mapped almost equivalently to two locations in the genome, but read B can only be mapped to one location, the aligner will put read A at the location close to where read B maps.

          Second, if you are at all interested in detecting larger CNVs/structural variants, PE reads are much more helpful. Two examples: first, if one read maps and the other does not it's possible the unmapped read spans a breakpoint of a CNV/SV, and you can do a split-read mapping of that read to try to determine the breakpoint. Second, If both reads map but the orientation is abnormal (ie, both map like "---->" instead of "---->" and "<----"), or if the distance between the mapped reads is abnormal (ie, you expect 250 +/- 50 but you observe for one PE read that the two reads are mapped 1000bp apart), that gives you a lot of information.

          Third, and possibly the most useful (although the first point is quite useful), with PE reads it's much easier to remove duplicate reads and be more confident that they are in fact PCR duplicates as opposed to just being two random reads that align to the same location. If you have 1x100bp reads, you can have at most 100x coverage of any base without duplication (barring indels in the read). If you have 2x100bp reads an the insert sizes distribution is say 250 +/-50bp, you can potentially have 10,000x coverage or higher after removing all reads that look like duplicates.

          Comment


          • #20
            Dear Senior Member Heisman
            I read all above in this post and thank you very much for increasing understanding for the newcomers, including me.

            I got a question and think that you may answer it very easily; I received the results from company and didn't understand how much the insert size is as I have to mention insert size for aligning it. I called the company but they didnt give me an exact answer. They said adapter details are there in the report. The details, as I received, are;
            Hiseq2000: PE
            Read Length: 101 x 2
            Insert Size: 80~380 (main 150)
            Adapter 5': (1).TruSeq Universal Adapter, 58bp; (2.)TruSeq Adapter Index 1-12, 63bp; (3).TruSeq Adapter Index 13-27, 65bp.

            My question comes with --mate-inner-dist / -r (Tophat2). If I calculate as per the tophat manual then it comes like this:
            Example Adaptor PE(x2) Whole_Insert_Size Calculated(-r)
            Tophat 50 100 300 200
            If,(1). 58 116 380 264
            If,(2). 63 126 380 254
            If,(3). 65 130 380 250

            Q1. Which is the correct -r calculated above of my sample, if any? Is it 250 (+/-14)?
            Q2. Do I need more information from seq-company to calculate these values ?
            Q3. What am I missing for calculating insert size ?

            Please do reply as I am troubling so much ..
            Thank you.


            Originally posted by Heisman View Post
            Sorry, the two ends would be the reverse complements of each other's strands.

            Comment


            • #21
              Originally posted by jp. View Post
              Hiseq2000: PE
              Read Length: 101 x 2
              Insert Size: 80~380 (main 150)
              Adapter 5': (1).TruSeq Universal Adapter, 58bp; (2.)TruSeq Adapter Index 1-12, 63bp; (3).TruSeq Adapter Index 13-27, 65bp.


              Q1. Which is the correct -r calculated above of my sample, if any? Is it 250 (+/-14)?
              Q2. Do I need more information from seq-company to calculate these values ?
              Q3. What am I missing for calculating insert size ?
              --mate-inner-dist is calculated by (mean insert size) - (total read length) so for your data:

              150 - (101 x 2) = -52

              What this means in biological terms is that, on average, your read pairs overlap by ~52 bp at their 3' ends.

              Comment


              • #22
                I think, I am getting your point completely with my poor understanding. Here is what tophat says:
                -r/--mate-inner-dist <int> This is the expected (mean) inner distance between mate pairs. For, example, for paired end runs with fragments selected at 300bp, where each end is 50bp, you should set -r to be 200. The default is 50bp.
                This possibly means:
                ---50--->|-------200-------|<---50---
                = 300 - (50 x 2) = 200
                As per your example: - (total read length) so for your data: 150 - (101 x 2) = -52 [What this means in biological terms is that, on average, your read pairs overlap by ~52 bp at their 3' ends].

                ? Should I give -r -50 ? I think its no problem giving negative value of -r or is there something missing ?

                ? What about the --mate-std-dev in my case [Read Length: 101 x 2; Insert Size: 80~380 (main 150); Adapter 5': (1).TruSeq Universal Adapter, 58bp; (2.)TruSeq Adapter Index 1-12, 63bp; (3).TruSeq Adapter Index 13-27, 65bp.].
                Will it be -120 to 160 ?; if I calculate [(80-380)mean150]: 80 - (101 x 2)= -122 | 380 -(101 x 2) = 178

                Am I doing something wrong ?


                Originally posted by kmcarr View Post
                --mate-inner-dist is calculated by (mean insert size) - (total read length) so for your data:

                150 - (101 x 2) = -52

                What this means in biological terms is that, on average, your read pairs overlap by ~52 bp at their 3' ends.
                Last edited by jp.; 08-01-2013, 07:35 PM. Reason: adding info

                Comment


                • #23
                  My fragments size on Agarose gel after restriction digestion and ligation range from 200-400bp, the sequencing technology that i use is customized to sequence 80 bases for 80 cycles single read sequencing. Once i trim of the adapter i am left with 75bp from which my SNP markers are scored.
                  ***************Questions***********
                  Is there any chance that i may be missing some markers since my fragment size was much longer than my read length?

                  Comment

                  Working...
                  X