Hey,
I am looking at the Kai Ye paper on Pindel:
and am not sure about some of what the algorithm is actually doing. Specifically, the numbered bullet points in 2.3 when looking for large deletions:
(1)Read in the location and the direction of the mapped read from the mapping result obtained in the preprocessing step;
(2) Define 3' end of the mapped read as the anchor point;
(3) Use pattern growth algorithm to search for minimum and maximum unique substrings from the 3′ end of the unmapped read within the range of two times of the insert size from the anchor point;
(4) Use pattern growth to search for minimum and maximum unique substrings from the 5′ end of the unmapped read within the range of read length + Max_D_Size starting from the already mapped 3′ end of the unmapped read obtained in step 3;
(5) Check whether a complete unmapped read can be reconstructed combining the unique substrings from 5′ and 3′ ends found in steps 3 and 4. If yes, store it in the database U. Note that exact matches and complete reconstruction of the unmapped read are required so that neither gap nor substitution is allowed.
Initially, I am not sure about the geometry of (3). Searching for substrings from the 3' end of the read in the range of 2* insert size from the anchor point.
Specifically?
How does one search for substrings from the 3' end of a read - surely this is the end of the sequence?
It seems as though the insert size is the average insert size of insertions, but it is not clear that this is what was meant.
Does anyone have any intuition on this paper / the method used?
Cheers!
I am looking at the Kai Ye paper on Pindel:
and am not sure about some of what the algorithm is actually doing. Specifically, the numbered bullet points in 2.3 when looking for large deletions:
(1)Read in the location and the direction of the mapped read from the mapping result obtained in the preprocessing step;
(2) Define 3' end of the mapped read as the anchor point;
(3) Use pattern growth algorithm to search for minimum and maximum unique substrings from the 3′ end of the unmapped read within the range of two times of the insert size from the anchor point;
(4) Use pattern growth to search for minimum and maximum unique substrings from the 5′ end of the unmapped read within the range of read length + Max_D_Size starting from the already mapped 3′ end of the unmapped read obtained in step 3;
(5) Check whether a complete unmapped read can be reconstructed combining the unique substrings from 5′ and 3′ ends found in steps 3 and 4. If yes, store it in the database U. Note that exact matches and complete reconstruction of the unmapped read are required so that neither gap nor substitution is allowed.
Initially, I am not sure about the geometry of (3). Searching for substrings from the 3' end of the read in the range of 2* insert size from the anchor point.
Specifically?
How does one search for substrings from the 3' end of a read - surely this is the end of the sequence?
It seems as though the insert size is the average insert size of insertions, but it is not clear that this is what was meant.
Does anyone have any intuition on this paper / the method used?
Cheers!
Comment