Seqanswers Leaderboard Ad

**KaiYe** · 08-07-2014, 05:26 AM

DNA fragments are double stranded and the read data generated will map to one of the strand. And DNA is synthesized from 5' to 3'. Please google DNA strand and I put one search result below, although it might not be clear. You could search youtube about illumina's sequencing tech to learn more about it.

The paired-end sequencing in Illumina solexa reads.

5' 3'
--------------->
____________________________________ DNA
____________________________________
3' 5'
<-----------------

Yahoo | Mail, Weather, Search, Politics, News, Finance, Sports & Videos

https://in.answers.yahoo.com/question/index;_ylt=A0LEVzIqfeNTrkwAWtVXNyoA;_ylu=X3oDMTByZHI5MXByBHNlYwNzcgRwb3MDNgRjb2xvA2JmMQR2dGlkAw--?qid=20070614231859AALqIXj

Latest news coverage, email, free stock quotes, live scores and video are just the beginning. Discover more every day at Yahoo!

**Jeromek** · 08-08-2014, 07:09 AM

I understand what 3' and 5' are, but am just finding the wording quite vague.

assuming the unmapped is the right read it seems to be this

_________3' (mapped read)------------------___________3'(unmapped read)

but to me it is really not clear what domain you search in from the subsequent text. Presumably, you must search backwards towards the mapped read?

Cheers,

**KaiYe** · 08-08-2014, 07:17 AM

please check my ppt at http://www.ebi.ac.uk/~kye/pindel/pin...009_june28.ppt

There is one animation in the slides about the mapping procedure. let me know if you have any questions after going through the slides.

**Jeromek** · 08-08-2014, 08:33 AM

Thanks very much! I will check this out early next week.

**Jeromek** · 08-12-2014, 05:28 AM

Ok Kai, I have looked through the presentation and to understand it better but am still not 100% sure about the process.

Here is how I think the geometry is working, and would really appreciate your input on the correctness of this.

3) Basically running the algorithm to find the substrings on the reference. The bit I am not sure about is the domain on the reference that you are using as the sequence database. I think it is from the 3' of the mapped read to 3'+ 2* the average spacing in between the paired end reads (is this what you mean by average insert length?) - I am afraid I am not sure why you chose this number - is it a heuristic?

From this you can obtain the locations of minimum and maximum substrings on the reference. In the case of deletions (with the break point located within the read), you would not expect the maximum substring to span the length of the read, as the read is missing letters.

Now you have marked the maximum unique substring, you can start looking for the other piece of the read.

4) From this point on the reference, you can then run the pattern growth algo again to hopefully find the other matching section. I think this is pretty self explanatory, as the region of interest is just the user controlled parameter which you may want to adjust based on the sensitivity calculations you have done later on etc.

As a final q, what is the relevance of finding the minimum substring? I would have thought that finding two maximum substrings would have been sufficient - maybe this becomes more obvious when you try to implement it but I feel like I am missing a subtlety here.

Thank you very much for taking the time to read this, Cheers!

**KaiYe** · 08-12-2014, 07:12 AM

It is heuristic to choose 2 times insert size. Both maximum and minimum substring define the range of read being split while mapping back to the reference genome correctly. Due to local repeats around and at the breakpoints, there are more than one solution to split the read and align the two fragments to the reference genome.

Please consider the following case

reference seq
GCACATATATATGGAAC

read seq
GCACATATATGGAAC

the split read solution space
GCAC__ATATATGGAAC
GCACA__TATATGGAAC
GCACAT__ATATGGAAC
GCACATA__TATGGAAC
GCACATAT__ATGGAAC
GCACATATA__TGGAAC
GCACATATAT__GGAAC

We often use the fist one as the correct solution, to left align the variant.

Originally posted by Jeromek View Post

Ok Kai, I have looked through the presentation and to understand it better but am still not 100% sure about the process.

Here is how I think the geometry is working, and would really appreciate your input on the correctness of this.

3) Basically running the algorithm to find the substrings on the reference. The bit I am not sure about is the domain on the reference that you are using as the sequence database. I think it is from the 3' of the mapped read to 3'+ 2* the average spacing in between the paired end reads (is this what you mean by average insert length?) - I am afraid I am not sure why you chose this number - is it a heuristic?

From this you can obtain the locations of minimum and maximum substrings on the reference. In the case of deletions (with the break point located within the read), you would not expect the maximum substring to span the length of the read, as the read is missing letters.

Now you have marked the maximum unique substring, you can start looking for the other piece of the read.

4) From this point on the reference, you can then run the pattern growth algo again to hopefully find the other matching section. I think this is pretty self explanatory, as the region of interest is just the user controlled parameter which you may want to adjust based on the sensitivity calculations you have done later on etc.

As a final q, what is the relevance of finding the minimum substring? I would have thought that finding two maximum substrings would have been sufficient - maybe this becomes more obvious when you try to implement it but I feel like I am missing a subtlety here.

Thank you very much for taking the time to read this, Cheers!

**Jeromek** · 08-12-2014, 07:53 AM

I see, thank you very much. So by always min for the 5' end and max for the 3' you can avoid this problem. And just to check, the average insert length does mean average distance between pair reads?

Cheers!

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 20 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Pindel Algorithm Explanation

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News