Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Pindel Algorithm Explanation

    Hey,

    I am looking at the Kai Ye paper on Pindel:



    and am not sure about some of what the algorithm is actually doing. Specifically, the numbered bullet points in 2.3 when looking for large deletions:



    (1)Read in the location and the direction of the mapped read from the mapping result obtained in the preprocessing step;

    (2) Define 3' end of the mapped read as the anchor point;

    (3) Use pattern growth algorithm to search for minimum and maximum unique substrings from the 3′ end of the unmapped read within the range of two times of the insert size from the anchor point;

    (4) Use pattern growth to search for minimum and maximum unique substrings from the 5′ end of the unmapped read within the range of read length + Max_D_Size starting from the already mapped 3′ end of the unmapped read obtained in step 3;

    (5) Check whether a complete unmapped read can be reconstructed combining the unique substrings from 5′ and 3′ ends found in steps 3 and 4. If yes, store it in the database U. Note that exact matches and complete reconstruction of the unmapped read are required so that neither gap nor substitution is allowed.



    Initially, I am not sure about the geometry of (3). Searching for substrings from the 3' end of the read in the range of 2* insert size from the anchor point.

    Specifically?

    How does one search for substrings from the 3' end of a read - surely this is the end of the sequence?
    It seems as though the insert size is the average insert size of insertions, but it is not clear that this is what was meant.

    Does anyone have any intuition on this paper / the method used?

    Cheers!

  • #2
    DNA fragments are double stranded and the read data generated will map to one of the strand. And DNA is synthesized from 5' to 3'. Please google DNA strand and I put one search result below, although it might not be clear. You could search youtube about illumina's sequencing tech to learn more about it.

    The paired-end sequencing in Illumina solexa reads.

    5' 3'
    --------------->
    ____________________________________ DNA
    ____________________________________
    3' 5'
    <-----------------

    Comment


    • #3
      I understand what 3' and 5' are, but am just finding the wording quite vague.

      assuming the unmapped is the right read it seems to be this

      _________3' (mapped read)------------------___________3'(unmapped read)


      but to me it is really not clear what domain you search in from the subsequent text. Presumably, you must search backwards towards the mapped read?

      Cheers,

      Comment


      • #4
        please check my ppt at http://www.ebi.ac.uk/~kye/pindel/pin...009_june28.ppt

        There is one animation in the slides about the mapping procedure. let me know if you have any questions after going through the slides.

        Comment


        • #5
          Thanks very much! I will check this out early next week.

          Comment


          • #6
            Ok Kai, I have looked through the presentation and to understand it better but am still not 100% sure about the process.

            Here is how I think the geometry is working, and would really appreciate your input on the correctness of this.

            3) Basically running the algorithm to find the substrings on the reference. The bit I am not sure about is the domain on the reference that you are using as the sequence database. I think it is from the 3' of the mapped read to 3'+ 2* the average spacing in between the paired end reads (is this what you mean by average insert length?) - I am afraid I am not sure why you chose this number - is it a heuristic?

            From this you can obtain the locations of minimum and maximum substrings on the reference. In the case of deletions (with the break point located within the read), you would not expect the maximum substring to span the length of the read, as the read is missing letters.

            Now you have marked the maximum unique substring, you can start looking for the other piece of the read.

            4) From this point on the reference, you can then run the pattern growth algo again to hopefully find the other matching section. I think this is pretty self explanatory, as the region of interest is just the user controlled parameter which you may want to adjust based on the sensitivity calculations you have done later on etc.


            As a final q, what is the relevance of finding the minimum substring? I would have thought that finding two maximum substrings would have been sufficient - maybe this becomes more obvious when you try to implement it but I feel like I am missing a subtlety here.


            Thank you very much for taking the time to read this, Cheers!

            Comment


            • #7
              It is heuristic to choose 2 times insert size. Both maximum and minimum substring define the range of read being split while mapping back to the reference genome correctly. Due to local repeats around and at the breakpoints, there are more than one solution to split the read and align the two fragments to the reference genome.

              Please consider the following case

              reference seq
              GCACATATATATGGAAC

              read seq
              GCACATATATGGAAC

              the split read solution space
              GCAC__ATATATGGAAC
              GCACA__TATATGGAAC
              GCACAT__ATATGGAAC
              GCACATA__TATGGAAC
              GCACATAT__ATGGAAC
              GCACATATA__TGGAAC
              GCACATATAT__GGAAC

              We often use the fist one as the correct solution, to left align the variant.

              Originally posted by Jeromek View Post
              Ok Kai, I have looked through the presentation and to understand it better but am still not 100% sure about the process.

              Here is how I think the geometry is working, and would really appreciate your input on the correctness of this.

              3) Basically running the algorithm to find the substrings on the reference. The bit I am not sure about is the domain on the reference that you are using as the sequence database. I think it is from the 3' of the mapped read to 3'+ 2* the average spacing in between the paired end reads (is this what you mean by average insert length?) - I am afraid I am not sure why you chose this number - is it a heuristic?

              From this you can obtain the locations of minimum and maximum substrings on the reference. In the case of deletions (with the break point located within the read), you would not expect the maximum substring to span the length of the read, as the read is missing letters.

              Now you have marked the maximum unique substring, you can start looking for the other piece of the read.

              4) From this point on the reference, you can then run the pattern growth algo again to hopefully find the other matching section. I think this is pretty self explanatory, as the region of interest is just the user controlled parameter which you may want to adjust based on the sensitivity calculations you have done later on etc.


              As a final q, what is the relevance of finding the minimum substring? I would have thought that finding two maximum substrings would have been sufficient - maybe this becomes more obvious when you try to implement it but I feel like I am missing a subtlety here.


              Thank you very much for taking the time to read this, Cheers!

              Comment


              • #8
                I see, thank you very much. So by always min for the 5' end and max for the 3' you can avoid this problem. And just to check, the average insert length does mean average distance between pair reads?

                Cheers!

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Understanding Genetic Influence on Infectious Disease
                  by seqadmin




                  During the COVID-19 pandemic, scientists observed that while some individuals experienced severe illness when infected with SARS-CoV-2, others were barely affected. These disparities left researchers and clinicians wondering what causes the wide variations in response to viral infections and what role genetics plays.

                  Jean-Laurent Casanova, M.D., Ph.D., Professor at Rockefeller University, is a leading expert in this crossover between genetics and infectious...
                  09-09-2024, 10:59 AM
                • seqadmin
                  Addressing Off-Target Effects in CRISPR Technologies
                  by seqadmin






                  The first FDA-approved CRISPR-based therapy marked the transition of therapeutic gene editing from a dream to reality1. CRISPR technologies have streamlined gene editing, and CRISPR screens have become an important approach for identifying genes involved in disease processes2. This technique introduces targeted mutations across numerous genes, enabling large-scale identification of gene functions, interactions, and pathways3. Identifying the full range...
                  08-27-2024, 04:44 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Today, 06:25 AM
                0 responses
                9 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, Yesterday, 01:02 PM
                0 responses
                8 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 09-18-2024, 06:39 AM
                0 responses
                10 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 09-11-2024, 02:44 PM
                0 responses
                13 views
                0 likes
                Last Post seqadmin  
                Working...
                X