Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Microindel Detection: defining the equivalent indel region

    Hi folks,
    we have written a small paper about microindel detection in short sequence reads. There is a little twist in it: the position of an indel is often not unambiguously defined by a single position (think of homopolymers). We used a simple algorithm to define an unambiguous equivalent indel region (eir). Using the eir, you can increase the sensitivity of indel detection. Its very simple to implement. Folks interested in microindel detection should think about analysing the eir in their sequence alignments:

    best,
    peter

  • #2
    Originally posted by krawitz View Post
    Hi folks,
    we have written a small paper about microindel detection in short sequence reads. There is a little twist in it: the position of an indel is often not unambiguously defined by a single position (think of homopolymers). We used a simple algorithm to define an unambiguous equivalent indel region (eir). Using the eir, you can increase the sensitivity of indel detection. Its very simple to implement. Folks interested in microindel detection should think about analysing the eir in their sequence alignments:

    best,
    peter
    When positioning an indel, it is important for variant calling to have a "left-justify" or similar rule. This way, the alignment is consistent. Re-alignment or local re-assembly will sure help.

    A few criticisms of the paper coming from an alignment author (BFAST):

    - All the evaluated aligners do not report indels in ABI SOLiD data.
    - Aligners such as BFAST and SHRiMP that perform gapped local alignment were not evaluated.
    - Some aligners compared do inherently detect indels and are mysteriously included.

    Comment


    • #3
      Hi Nils,

      you are right, a "left-justify" rule would work - however, isn't that somehow unsatisfying if a crystal clear definition is available? Think about biologist annotating indels, one group is a "left justifier group" the other a "right justifier group" - it may take a while until they understand, that they are talking about the same. This might sound ridiculous to you. But things like that happen all the time! So if possible one has to use clear cut definitions.

      I would also like to comment on your criticism. The paper was not intended as a benchmarking study of alignment tools. It is just not possible to thoroughly analyze and compare all short read mapping tools. In fact BFAST may be as accurate as BWA or Novoalign (and I am looking forward to test it on our next data sets).

      So instead of comparing a plethora of mapping tools, we focused on few widely used ones. We tested a fast gapped mapper, BWA and a very accurate one, Novoalign. The bottom line for the biologist that is actually running the experiment is the following: It's possible to detect microindels in short read data (Harismendy et. al. where apparently not aware) and your sensitivity and positive predictive values will mainly profit from longer reads and good coverage (the alignment tool can give you only little extra percent)

      Another important message of the paper is, that the microindel frequency in human genomes is probably around 1/10000. For this reason it makes a lot sense to use gapped alignment tools even if you are screening for SNPs, because of a reduction of the false positive error in SNP calling.

      I would like to emphasize again, that it was not possible for us to benchmark all existing mapping tools. I hope the message to take home for the reader is the following: stop using ungapped alignment tools in resequencing projects for mutation screening, there are plenty of gapped aligner and BFAST is one of them.

      do you agree?

      cheers,

      Peter

      Comment


      • #4
        Yes. I second this: gapped alignment is very important to SNP discovery (firstly emphasized in a 2008 paper and rediscovered several times). In addition, the human indel mutation rate accessible to 70bp reads is higher than 1/10,000, more likely to be ~1/7,000. Not doing gapped alignment throws away >10% of important mutations. I have also seen ungapped alignment may lead to wrong alignments which may deceive breakdancer into calling false translocations. I do not know how gapped alignment affect ChIP-seq/RNA-seq, but using a gapped aligner is recommended anyway.

        Comment


        • #5
          Originally posted by lh3 View Post
          I have also seen ungapped alignment may lead to wrong alignments which may deceive breakdancer into calling false translocations.
          Is this effect due to mismapping alone? Id est ungapped alignment causes mismapping leading to false positives in breakpoint analysis?

          I ask because it is not obvious to me why sporadic mismapping of PE reads would deceive breakpoint detection.
          Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
          Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
          Projects: U87MG whole genome sequence [Website] [Paper]

          Comment


          • #6
            Say the correct place of a read pair is on chr1. One read can be mapped correctly and uniquely but the mate has an 3bp indel. Once you have this 3bp indel, the best position of the mate is chr2 instead of chr1. All reads containing the indel will be placed to chr2. This looks like a strong signal of translocation, but in fact due to wrong mapping. You will find no breakpoint. I have seen this on simulated data. It should occur much more often on real data.

            Comment


            • #7
              On subject of left or right justifying indels, the position of an insert can be affected by base qualities (at least in Novoalign) as a better score can be had by inserting a low quality base than a high quality one.

              Comment


              • #8
                Originally posted by lh3 View Post
                Yes. I second this: gapped alignment is very important to SNP discovery (firstly emphasized in a 2008 paper and rediscovered several times). In addition, the human indel mutation rate accessible to 70bp reads is higher than 1/10,000, more likely to be ~1/7,000. Not doing gapped alignment throws away >10% of important mutations. I have also seen ungapped alignment may lead to wrong alignments which may deceive breakdancer into calling false translocations. I do not know how gapped alignment affect ChIP-seq/RNA-seq, but using a gapped aligner is recommended anyway.
                Heng can you please pass us the link to the paper?
                -drd

                Comment


                • #9
                  Originally posted by nilshomer View Post
                  - All the evaluated aligners do not report indels in ABI SOLiD data.
                  Hi krawitz,

                  Any thoughts on that one? I would love to see some results against
                  ABi SOLiD data.
                  -drd

                  Comment


                  • #10
                    Sequencing of natural strains of Arabidopsis thaliana with short reads by Ossowski et al. (2008). It claims that maq is bad at their single-end data set because maq does not do gapped alignment for such data. It is a fair claim, but actually in maq paper, we have already noticed the importance of indels and applied the "indel filter". The NA18507 and YanHuang papers both applied the indel filter. The 1000 genomes projects found undetected indel is the leading cause of false SNPs.

                    SOLiD data makes no difference. If you do not do gapped alignment, you will get clustered false SNPs and clustered false alignments.

                    EDIT: the attached plot (on simulated data) is additional evidence that ungapped alignment leads to more false SNPs. It is true that most of these false SNPs around a true indel, but we really want to see the indel rather than a row of false SNPs.
                    Attached Files
                    Last edited by lh3; 02-18-2010, 09:19 AM.

                    Comment


                    • #11
                      Originally posted by lh3 View Post
                      Sequencing of natural strains of Arabidopsis thaliana with short reads by Ossowski et al. (2008). It claims that maq is bad at their single-end data set because maq does not do gapped alignment for such data. It is a fair claim, but actually in maq paper, we have already noticed the importance of indels and applied the "indel filter". The NA18507 and YanHuang papers both applied the indel filter. The 1000 genomes projects found undetected indel is the leading cause of false SNPs.

                      SOLiD data makes no difference. If you do not do gapped alignment, you will get clustered false SNPs and clustered false alignments.

                      EDIT: the attached plot (on simulated data) is additional evidence that ungapped alignment leads to more false SNPs. It is true that most of these false SNPs around a true indel, but we really want to see the indel rather than a row of false SNPs.
                      Heng is right about indels causing errors with ungapped alignment.

                      Also, novoalign is correct to use qualities when aligning each read independently. However, what we really want is for all reads to "agree" where the indel occurs to create an accurate assembly. Having a "justification rule" will solve this, whereas aligning an indel based on quality will add noise (some reads will support the right or left positioning based on quality).

                      In any case, I think that some type of local re-alignment/re-assembly is required since all short read aligners align reads independently.

                      Comment


                      • #12
                        Originally posted by krawitz View Post
                        Hi Nils,

                        I would also like to comment on your criticism. The paper was not intended as a benchmarking study of alignment tools. It is just not possible to thoroughly analyze and compare all short read mapping tools. In fact BFAST may be as accurate as BWA or Novoalign (and I am looking forward to test it on our next data sets).
                        I would also be interested in BFAST/SSAHA2 comparison, but as everyone points out - there are just too many aligners out there (last count I had was at 35 programs!!!). And nobody has the time to compare them all.

                        Comment


                        • #13
                          Originally posted by nilshomer View Post
                          Heng is right about indels causing errors with ungapped alignment.

                          Also, novoalign is correct to use qualities when aligning each read independently. However, what we really want is for all reads to "agree" where the indel occurs to create an accurate assembly. Having a "justification rule" will solve this, whereas aligning an indel based on quality will add noise (some reads will support the right or left positioning based on quality).

                          In any case, I think that some type of local re-alignment/re-assembly is required since all short read aligners align reads independently.
                          What is your opinion of the Broad's GATK approach? I think they use some multiple sequence alignment to follow up the pairwise alignments done with an aligner and then put the reads in agreement.

                          Comment


                          • #14
                            "The 1000 genomes projects found undetected indel is the leading cause of false SNPs."
                            I arrived quite late to this thread but I'd like to know if there is a reference for such statement. Thanks

                            Comment


                            • #15
                              krawitz has given that

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              9 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              50 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X