Unconfigured Ad

**nilshomer** · 02-15-2010, 05:41 PM

Originally posted by krawitz View Post

Hi folks,
we have written a small paper about microindel detection in short sequence reads. There is a little twist in it: the position of an indel is often not unambiguously defined by a single position (think of homopolymers). We used a simple algorithm to define an unambiguous equivalent indel region (eir). Using the eir, you can increase the sensitivity of indel detection. Its very simple to implement. Folks interested in microindel detection should think about analysing the eir in their sequence alignments:

Checking your browser - reCAPTCHA

http://www.ncbi.nlm.nih.gov/pubmed/20144947?itool=EntrezSystem2.PEntrez.Pubmed.Pubmed_ResultsPanel.Pubmed_RVDocSum&ordinalpos=1

best,
peter

When positioning an indel, it is important for variant calling to have a "left-justify" or similar rule. This way, the alignment is consistent. Re-alignment or local re-assembly will sure help.

A few criticisms of the paper coming from an alignment author (BFAST):

- All the evaluated aligners do not report indels in ABI SOLiD data.
- Aligners such as BFAST and SHRiMP that perform gapped local alignment were not evaluated.
- Some aligners compared do inherently detect indels and are mysteriously included.

**krawitz** · 02-16-2010, 11:32 AM

Hi Nils,

you are right, a "left-justify" rule would work - however, isn't that somehow unsatisfying if a crystal clear definition is available? Think about biologist annotating indels, one group is a "left justifier group" the other a "right justifier group" - it may take a while until they understand, that they are talking about the same. This might sound ridiculous to you. But things like that happen all the time! So if possible one has to use clear cut definitions.

I would also like to comment on your criticism. The paper was not intended as a benchmarking study of alignment tools. It is just not possible to thoroughly analyze and compare all short read mapping tools. In fact BFAST may be as accurate as BWA or Novoalign (and I am looking forward to test it on our next data sets).

So instead of comparing a plethora of mapping tools, we focused on few widely used ones. We tested a fast gapped mapper, BWA and a very accurate one, Novoalign. The bottom line for the biologist that is actually running the experiment is the following: It's possible to detect microindels in short read data (Harismendy et. al. where apparently not aware) and your sensitivity and positive predictive values will mainly profit from longer reads and good coverage (the alignment tool can give you only little extra percent)

Another important message of the paper is, that the microindel frequency in human genomes is probably around 1/10000. For this reason it makes a lot sense to use gapped alignment tools even if you are screening for SNPs, because of a reduction of the false positive error in SNP calling.

I would like to emphasize again, that it was not possible for us to benchmark all existing mapping tools. I hope the message to take home for the reader is the following: stop using ungapped alignment tools in resequencing projects for mutation screening, there are plenty of gapped aligner and BFAST is one of them.

do you agree?

cheers,

Peter

**lh3** · 02-16-2010, 02:18 PM

Yes. I second this: gapped alignment is very important to SNP discovery (firstly emphasized in a 2008 paper and rediscovered several times). In addition, the human indel mutation rate accessible to 70bp reads is higher than 1/10,000, more likely to be ~1/7,000. Not doing gapped alignment throws away >10% of important mutations. I have also seen ungapped alignment may lead to wrong alignments which may deceive breakdancer into calling false translocations. I do not know how gapped alignment affect ChIP-seq/RNA-seq, but using a gapped aligner is recommended anyway.

**Michael.James.Clark** · 02-17-2010, 01:25 PM

Originally posted by lh3 View Post

I have also seen ungapped alignment may lead to wrong alignments which may deceive breakdancer into calling false translocations.

Is this effect due to mismapping alone? Id est ungapped alignment causes mismapping leading to false positives in breakpoint analysis?

I ask because it is not obvious to me why sporadic mismapping of PE reads would deceive breakpoint detection.

**lh3** · 02-17-2010, 01:43 PM

Say the correct place of a read pair is on chr1. One read can be mapped correctly and uniquely but the mate has an 3bp indel. Once you have this 3bp indel, the best position of the mate is chr2 instead of chr1. All reads containing the indel will be placed to chr2. This looks like a strong signal of translocation, but in fact due to wrong mapping. You will find no breakpoint. I have seen this on simulated data. It should occur much more often on real data.

**sparks** · 02-18-2010, 01:15 AM

On subject of left or right justifying indels, the position of an insert can be affected by base qualities (at least in Novoalign) as a better score can be had by inserting a low quality base than a high quality one.

**drio** · 02-18-2010, 08:54 AM

Originally posted by lh3 View Post

Yes. I second this: gapped alignment is very important to SNP discovery (firstly emphasized in a 2008 paper and rediscovered several times). In addition, the human indel mutation rate accessible to 70bp reads is higher than 1/10,000, more likely to be ~1/7,000. Not doing gapped alignment throws away >10% of important mutations. I have also seen ungapped alignment may lead to wrong alignments which may deceive breakdancer into calling false translocations. I do not know how gapped alignment affect ChIP-seq/RNA-seq, but using a gapped aligner is recommended anyway.

Heng can you please pass us the link to the paper?

**drio** · 02-18-2010, 08:56 AM

Originally posted by nilshomer View Post

- All the evaluated aligners do not report indels in ABI SOLiD data.

Hi krawitz,

Any thoughts on that one? I would love to see some results against
ABi SOLiD data.

**lh3** · 02-18-2010, 09:07 AM

Sequencing of natural strains of Arabidopsis thaliana with short reads by Ossowski et al. (2008). It claims that maq is bad at their single-end data set because maq does not do gapped alignment for such data. It is a fair claim, but actually in maq paper, we have already noticed the importance of indels and applied the "indel filter". The NA18507 and YanHuang papers both applied the indel filter. The 1000 genomes projects found undetected indel is the leading cause of false SNPs.

SOLiD data makes no difference. If you do not do gapped alignment, you will get clustered false SNPs and clustered false alignments.

EDIT: the attached plot (on simulated data) is additional evidence that ungapped alignment leads to more false SNPs. It is true that most of these false SNPs around a true indel, but we really want to see the indel rather than a row of false SNPs.

Attached Files

108v-bw.pdf (9.4 KB, 150 views)

**nilshomer** · 02-18-2010, 10:18 AM

Originally posted by lh3 View Post

Sequencing of natural strains of Arabidopsis thaliana with short reads by Ossowski et al. (2008). It claims that maq is bad at their single-end data set because maq does not do gapped alignment for such data. It is a fair claim, but actually in maq paper, we have already noticed the importance of indels and applied the "indel filter". The NA18507 and YanHuang papers both applied the indel filter. The 1000 genomes projects found undetected indel is the leading cause of false SNPs.

SOLiD data makes no difference. If you do not do gapped alignment, you will get clustered false SNPs and clustered false alignments.

EDIT: the attached plot (on simulated data) is additional evidence that ungapped alignment leads to more false SNPs. It is true that most of these false SNPs around a true indel, but we really want to see the indel rather than a row of false SNPs.

Heng is right about indels causing errors with ungapped alignment.

Also, novoalign is correct to use qualities when aligning each read independently. However, what we really want is for all reads to "agree" where the indel occurs to create an accurate assembly. Having a "justification rule" will solve this, whereas aligning an indel based on quality will add noise (some reads will support the right or left positioning based on quality).

In any case, I think that some type of local re-alignment/re-assembly is required since all short read aligners align reads independently.

**NGSfan** · 03-12-2010, 06:31 AM

Originally posted by krawitz View Post

Hi Nils,

I would also like to comment on your criticism. The paper was not intended as a benchmarking study of alignment tools. It is just not possible to thoroughly analyze and compare all short read mapping tools. In fact BFAST may be as accurate as BWA or Novoalign (and I am looking forward to test it on our next data sets).

I would also be interested in BFAST/SSAHA2 comparison, but as everyone points out - there are just too many aligners out there (last count I had was at 35 programs!!!). And nobody has the time to compare them all.

**NGSfan** · 03-12-2010, 08:26 AM

Originally posted by nilshomer View Post

Heng is right about indels causing errors with ungapped alignment.

Also, novoalign is correct to use qualities when aligning each read independently. However, what we really want is for all reads to "agree" where the indel occurs to create an accurate assembly. Having a "justification rule" will solve this, whereas aligning an indel based on quality will add noise (some reads will support the right or left positioning based on quality).

In any case, I think that some type of local re-alignment/re-assembly is required since all short read aligners align reads independently.

What is your opinion of the Broad's GATK approach? I think they use some multiple sequence alignment to follow up the pairwise alignments done with an aligner and then put the reads in agreement.

**maricu** · 09-17-2010, 01:08 AM

"The 1000 genomes projects found undetected indel is the leading cause of false SNPs."
I arrived quite late to this thread but I'd like to know if there is a reference for such statement. Thanks

**lh3** · 09-17-2010, 09:30 AM

krawitz has given that

Topics	Statistics	Last Post
High-Resolution Sequencing Exposes Hidden Toxoplasma Diversity by SEQadmin2 Started by SEQadmin2, Yesterday, 11:08 AM	0 responses 7 views 0 reactions	Last Post by SEQadmin2 Yesterday, 11:08 AM
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, 06-30-2026, 05:37 AM	0 responses 11 views 0 reactions	Last Post by SEQadmin2 06-30-2026, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 19 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 53 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM

Unconfigured Ad

Microindel Detection: defining the equivalent indel region

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News