Announcement

Collapse
No announcement yet.

Read lengths, inserts, fragment size...

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Martin Kanyeki
    replied
    My fragments size on Agarose gel after restriction digestion and ligation range from 200-400bp, the sequencing technology that i use is customized to sequence 80 bases for 80 cycles single read sequencing. Once i trim of the adapter i am left with 75bp from which my SNP markers are scored.
    ***************Questions***********
    Is there any chance that i may be missing some markers since my fragment size was much longer than my read length?

    Leave a comment:


  • jp.
    replied
    I think, I am getting your point completely with my poor understanding. Here is what tophat says:
    -r/--mate-inner-dist <int> This is the expected (mean) inner distance between mate pairs. For, example, for paired end runs with fragments selected at 300bp, where each end is 50bp, you should set -r to be 200. The default is 50bp.
    This possibly means:
    ---50--->|-------200-------|<---50---
    = 300 - (50 x 2) = 200
    As per your example: - (total read length) so for your data: 150 - (101 x 2) = -52 [What this means in biological terms is that, on average, your read pairs overlap by ~52 bp at their 3' ends].

    ? Should I give -r -50 ? I think its no problem giving negative value of -r or is there something missing ?

    ? What about the --mate-std-dev in my case [Read Length: 101 x 2; Insert Size: 80~380 (main 150); Adapter 5': (1).TruSeq Universal Adapter, 58bp; (2.)TruSeq Adapter Index 1-12, 63bp; (3).TruSeq Adapter Index 13-27, 65bp.].
    Will it be -120 to 160 ?; if I calculate [(80-380)mean150]: 80 - (101 x 2)= -122 | 380 -(101 x 2) = 178

    Am I doing something wrong ?


    Originally posted by kmcarr View Post
    --mate-inner-dist is calculated by (mean insert size) - (total read length) so for your data:

    150 - (101 x 2) = -52

    What this means in biological terms is that, on average, your read pairs overlap by ~52 bp at their 3' ends.
    Last edited by jp.; 08-01-2013, 07:35 PM. Reason: adding info

    Leave a comment:


  • kmcarr
    replied
    Originally posted by jp. View Post
    Hiseq2000: PE
    Read Length: 101 x 2
    Insert Size: 80~380 (main 150)
    Adapter 5': (1).TruSeq Universal Adapter, 58bp; (2.)TruSeq Adapter Index 1-12, 63bp; (3).TruSeq Adapter Index 13-27, 65bp.


    Q1. Which is the correct -r calculated above of my sample, if any? Is it 250 (+/-14)?
    Q2. Do I need more information from seq-company to calculate these values ?
    Q3. What am I missing for calculating insert size ?
    --mate-inner-dist is calculated by (mean insert size) - (total read length) so for your data:

    150 - (101 x 2) = -52

    What this means in biological terms is that, on average, your read pairs overlap by ~52 bp at their 3' ends.

    Leave a comment:


  • jp.
    replied
    Dear Senior Member Heisman
    I read all above in this post and thank you very much for increasing understanding for the newcomers, including me.

    I got a question and think that you may answer it very easily; I received the results from company and didn't understand how much the insert size is as I have to mention insert size for aligning it. I called the company but they didnt give me an exact answer. They said adapter details are there in the report. The details, as I received, are;
    Hiseq2000: PE
    Read Length: 101 x 2
    Insert Size: 80~380 (main 150)
    Adapter 5': (1).TruSeq Universal Adapter, 58bp; (2.)TruSeq Adapter Index 1-12, 63bp; (3).TruSeq Adapter Index 13-27, 65bp.

    My question comes with --mate-inner-dist / -r (Tophat2). If I calculate as per the tophat manual then it comes like this:
    Example Adaptor PE(x2) Whole_Insert_Size Calculated(-r)
    Tophat 50 100 300 200
    If,(1). 58 116 380 264
    If,(2). 63 126 380 254
    If,(3). 65 130 380 250

    Q1. Which is the correct -r calculated above of my sample, if any? Is it 250 (+/-14)?
    Q2. Do I need more information from seq-company to calculate these values ?
    Q3. What am I missing for calculating insert size ?

    Please do reply as I am troubling so much ..
    Thank you.


    Originally posted by Heisman View Post
    Sorry, the two ends would be the reverse complements of each other's strands.

    Leave a comment:


  • modi2020
    replied
    Hi Heisman,

    This clarifies the process a lot to me. Thank you so much for your detailed answer and help with this.

    Best

    Originally posted by Heisman View Post
    I have never thought about this an am honestly not sure if you get TTT or AAA in your read.

    You are correct that you sequence the ends, and if you did 2x3bp reads you would get AAA and CCC (or maybe TTT and GGG).

    You would probably NOT anticipate that they are 8 base pairs apart. When you do a library prep you will almost certainly get some distribution of insert size fragments around a mean. So you would anticipate they will be 8 +/- 2bp apart, for example (more realistically maybe 250 +/- 50bp or something like that). The sequence in between remains unknown to you.

    So, the benefit to paired end sequencing is three-fold, in my opinion. First, it makes it easier to map each fragment. If your read has two ends, A and B, and read A can be mapped almost equivalently to two locations in the genome, but read B can only be mapped to one location, the aligner will put read A at the location close to where read B maps.

    Second, if you are at all interested in detecting larger CNVs/structural variants, PE reads are much more helpful. Two examples: first, if one read maps and the other does not it's possible the unmapped read spans a breakpoint of a CNV/SV, and you can do a split-read mapping of that read to try to determine the breakpoint. Second, If both reads map but the orientation is abnormal (ie, both map like "---->" instead of "---->" and "<----"), or if the distance between the mapped reads is abnormal (ie, you expect 250 +/- 50 but you observe for one PE read that the two reads are mapped 1000bp apart), that gives you a lot of information.

    Third, and possibly the most useful (although the first point is quite useful), with PE reads it's much easier to remove duplicate reads and be more confident that they are in fact PCR duplicates as opposed to just being two random reads that align to the same location. If you have 1x100bp reads, you can have at most 100x coverage of any base without duplication (barring indels in the read). If you have 2x100bp reads an the insert sizes distribution is say 250 +/-50bp, you can potentially have 10,000x coverage or higher after removing all reads that look like duplicates.

    Leave a comment:


  • Heisman
    replied
    Sorry, the two ends would be the reverse complements of each other's strands.

    Leave a comment:


  • Heisman
    replied
    Originally posted by modi2020 View Post
    Hi Heisman,

    I Just need a simple clarification please.
    Suppose you have a read like (AAACGGCGTTTCCC)
    and you want to sequence it using Illumina paired end runs.
    Does paired end mean you will get the sequence of the ends?
    i.e does it imply we only sequence AAA and CCC in the sequence above.
    if it is true, I assumed that using sequencing by synthesis we would get TTT and GGG.
    I understand that if we mapped the sequences back to the reference we would anticipate that they are 8 bases apart (given no INDELS are present in our DNA at hand). Is this right?
    However, I am majorly concerned about the sequence in between.
    What really happens to it?
    I guess my question is:
    What is the benefit of paired end reads if we only sequence the ends and not whats in between?

    I would really appreciate the help on clarifying this thought
    I have never thought about this an am honestly not sure if you get TTT or AAA in your read.

    You are correct that you sequence the ends, and if you did 2x3bp reads you would get AAA and CCC (or maybe TTT and GGG). EDIT: You would actually get reverse complements, so AAA and GGG or TTT and CCC.

    You would probably NOT anticipate that they are 8 base pairs apart. When you do a library prep you will almost certainly get some distribution of insert size fragments around a mean. So you would anticipate they will be 8 +/- 2bp apart, for example (more realistically maybe 250 +/- 50bp or something like that). The sequence in between remains unknown to you.

    So, the benefit to paired end sequencing is three-fold, in my opinion. First, it makes it easier to map each fragment. If your read has two ends, A and B, and read A can be mapped almost equivalently to two locations in the genome, but read B can only be mapped to one location, the aligner will put read A at the location close to where read B maps.

    Second, if you are at all interested in detecting larger CNVs/structural variants, PE reads are much more helpful. Two examples: first, if one read maps and the other does not it's possible the unmapped read spans a breakpoint of a CNV/SV, and you can do a split-read mapping of that read to try to determine the breakpoint. Second, If both reads map but the orientation is abnormal (ie, both map like "---->" instead of "---->" and "<----"), or if the distance between the mapped reads is abnormal (ie, you expect 250 +/- 50 but you observe for one PE read that the two reads are mapped 1000bp apart), that gives you a lot of information.

    Third, and possibly the most useful (although the first point is quite useful), with PE reads it's much easier to remove duplicate reads and be more confident that they are in fact PCR duplicates as opposed to just being two random reads that align to the same location. If you have 1x100bp reads, you can have at most 100x coverage of any base without duplication (barring indels in the read). If you have 2x100bp reads an the insert sizes distribution is say 250 +/-50bp, you can potentially have 10,000x coverage or higher after removing all reads that look like duplicates.
    Last edited by Heisman; 05-27-2012, 08:15 AM.

    Leave a comment:


  • modi2020
    replied
    Hi Heisman,

    I Just need a simple clarification please.
    Suppose you have a read like (AAACGGCGTTTCCC)
    and you want to sequence it using Illumina paired end runs.
    Does paired end mean you will get the sequence of the ends?
    i.e does it imply we only sequence AAA and CCC in the sequence above.
    if it is true, I assumed that using sequencing by synthesis we would get TTT and GGG.
    I understand that if we mapped the sequences back to the reference we would anticipate that they are 8 bases apart (given no INDELS are present in our DNA at hand). Is this right?
    However, I am majorly concerned about the sequence in between.
    What really happens to it?
    I guess my question is:
    What is the benefit of paired end reads if we only sequence the ends and not whats in between?

    I would really appreciate the help on clarifying this thought

    Originally posted by Heisman View Post
    1. I don't know, as we only receive the unfiltered ones. Maybe "grep -v"?

    2. I think you'll need to unzip them all and then concatenate.

    3. With Illumina paired end runs you have something like this:

    [flowcell adapter][sequencing primer][insert][sequencing primer][flowcell adapter]

    The key is that the insert may be say 300bp, and if you do 2x100 reads, you'll sequence it like this (the dots are only spacers):

    ........--------->...................<---------.....
    ........xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.....

    So when aligning paired end data it is clear that the two mates for one read should align in that orientation fairly close to each other.

    Leave a comment:


  • archie.chauhan
    replied
    thanks a lot.

    Leave a comment:


  • Heisman
    replied
    1. I don't know, as we only receive the unfiltered ones. Maybe "grep -v"?

    2. I think you'll need to unzip them all and then concatenate.

    3. With Illumina paired end runs you have something like this:

    [flowcell adapter][sequencing primer][insert][sequencing primer][flowcell adapter]

    The key is that the insert may be say 300bp, and if you do 2x100 reads, you'll sequence it like this (the dots are only spacers):

    ........--------->...................<---------.....
    ........xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.....

    So when aligning paired end data it is clear that the two mates for one read should align in that orientation fairly close to each other.

    Leave a comment:


  • archie.chauhan
    replied
    Hi sorry for the delayed response. I did "less +1000000" and the data looked good.

    I have a few more queries:
    1) I can see both the sequences flagged with "N" and "Y" which indicated that the sequences have not been filtered. Are there prog to do that.
    2) Out seq provider has given multiple fastq.gz files per lane. What is the protocolto concatenate such files.
    3) I am confused about the illumina paired end library in comparison to 454 pe library. The illumina lib has the following setup : adapter-seq-adapter in comparison to 454 which as seq-linker-seq. If the seq are 100bp each than in 454 we end up getting 200bp pe reads wheres in illumina we get separate 100bp R1 and R2 reads (for 2x100run). This means that in illumina we are just getting extra 100bp reads from the pe run which do not have any linking information. We can save money by doing unpaired ilumina runs. What is the use doing pe illumina run.

    sorry for bombarding u with so many question.

    regards,
    arc

    Leave a comment:


  • Heisman
    replied
    Oh, right; the first reads of a fastq file for Illumina will be around the edge of the flowcell, I think, making them more likely to be weird. Maybe do "less +1000000" and see what that looks like.

    I've never done any assembly so you'll have to find somebody else.

    Leave a comment:


  • archie.chauhan
    replied
    just a follow up of the above. illumina support has the following answer to the problem and i did find that the seq in the middle are good.

    "The data that you provided looks to be very normal. Generally speaking there will be data at the beginning and end of the FASTQ that is of lower quality than the data in the middle of the file. This is simply due to sorting. This data appears to be of normal quality and appears to be intact. "

    If you have time i want to discuss my course of action:

    I am having 454 unpaired and paired data and illumina reads. I have assembled the 454 data using newbler. I plan to assemble illumina data using velvet. Combine the assemblies using minimus.

    arc

    Leave a comment:


  • archie.chauhan
    replied
    thanks a lot...it helped

    Leave a comment:


  • Heisman
    replied
    No idea if it's a problem with the library prep or the run; I would check with the sequencing core (I'd imagine it's a problem with the run, though).

    There may be different numbers of files if there are different total number of reads, not different read lengths.

    From the reads you showed it looks like the indices have already been clipped and put into the headers. You may want to look at the FastX toolkit to find a way to trim adapter sequences. I align with Novoalign which does it during the alignment.

    Leave a comment:

Working...
X