Seqanswers Leaderboard Ad

**lh3** · 11-01-2010, 09:05 AM

mosaik or gsnap. novoalign reduces scores involving ambiguous bases, which I think is less preferred for reducing bias, at least on hand-made examples. I actually do not know how mosaik and gsnap cope with ambiguous bases. possibly they have the same issue. also read a great paper last year. the authors argue that merely allowing ambiguous bases helps to reduce bias, but not much.

**sjackman** · 11-01-2010, 09:22 AM

gsnap can map to a reference enhanced by a list of known SNPs. See

http://research-pub.gene.com/gmap/src/README

Cheers,
Shaun

SNP-tolerant alignment in GSNAP
===============================

GSNAP has the ability to align to a reference space of all possible
major and minor alleles in a set of known SNPs provided by the user.

**lindseyjane** · 11-02-2010, 12:57 AM

Thank you for your advice, I am checking out gsnap.

**sparks** · 11-03-2010, 02:13 AM

Hi Heng,

Novoalign should remove allelic bias.
If you have an ambiguous code such R in the reference genome it will match to an A or G in the read with the same alignment score and the same chance that the read will align. In this case a match of A or G in the read will score 3 vs a mismatch (C or T in read) will score 30 (depending on base quality). There should be no allelic bias.

The small alignment score of 3 for the match to the ambiguous code comes into play when we have reads that map to multiple locations. Say the read maps to two places in the genome with no mismatches but one mapped location has an ambiguous code in the reference and the other location is an exact match with no ambiguous code, then the location with no ambiguous codes gets a slightly lower alignment score. If we then use -r Random option to report a randomly selected alignment for multiple mappings then the mapping location with no ambiguous code will have a higher chance of being reported than the mapping with the IUB ambiguous code. So there is a bias to report multi-mapping reads to locations with least number of ambiguous codes but I don't think that it's allelic bias. A problem could arise with multi-copy genes if only some copies are marked up with SNPs or if they're marked up differently.

Also we can't give a score of zero for matches to IUB ambiguous codes as N is an ambiguous code and this would mean a read aligning against a block of N's would score the same as a perfect alignment.

Colin

Originally posted by lh3 View Post

mosaik or gsnap. novoalign reduces scores involving ambiguous bases, which I think is less preferred for reducing bias, at least on hand-made examples. I actually do not know how mosaik and gsnap cope with ambiguous bases. possibly they have the same issue. also read a great paper last year. the authors argue that merely allowing ambiguous bases helps to reduce bias, but not much.

**lh3** · 11-03-2010, 04:50 AM

I would suggest people read "effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data" published in bioinformatics last year. I used to think masking the genome in whatever way will remove most of reference biases, but they have convinced me that this bias is more complicated than my naive thinking.

For their example, giving a score 30 to A<->R match will lead to a smaller bias than a score 3, although this does not remove all the extreme biases. At that time, the authors were giving A<->R a score 0.

**sparks** · 11-03-2010, 05:46 PM

Hi Heng,
I have a slightly different interpretation of this paper, first they appear to have masked SNPs by forcing a mismatch. i.e. If SNP would be A/G (R) they changed the reference to C or T thus forcing a mismatch at that position and a high penalty of 30, not zero. (I'm not sure of BWA/MAQ scoring method but Novoalign will usually score a match as zero and a mismatch as 30, depending on base call quality) But by forcing a mismatch they effectively bias against alignment at this location and force either unmapped reads or alignment to homologous regions. They even say this "We find that the strong biases occur at SNPs for which the flanking sequence shares sequence identity with another region of the genome (Figure 3)"
The bias they see isn't really from the aligner but a product of homologous regions and that they've forced a mismatch at SNPs.
I would expect Novoalign to remove much of the bias that they reported as SNPs marked up with ambiguous codes will not get penalised as full mismatches and there will be much less bias to homologous regions.

Colin

**lh3** · 11-03-2010, 06:16 PM

Actually i did not realize that you were talking "penalty" instead of "score". Let's use minus score for penalty. The authors were trying Penalty(A<->R)=-30. I am arguing Penalty(A<->R)=0 is least biased. Now I buy that Penalty(A<->R)=-3 by novoalign should also work.

This bias is rooted in homologous regions, but in highly unique regions, any masking strategies work equally well. The hard part is homology. Nonetheless, the authors of this paper have also shown me other data that even Penalty(A<->R)=0 will lead to significant extreme bias. Using novoalign/gsnap/mosaik is a must-have, but we have to apply additional steps to avoid bias. Thanks for the explanation, Colin.

**sparks** · 11-03-2010, 08:18 PM

I thought there might have been some confusion about scoring/penalty schemes. I'm glad that's sorted out.

**bioinfosm** · 11-16-2010, 02:56 PM

I still see most of the big groups using maq/bwa - gatk type workflows. How much does the ambiguous reference mapping really affect alignments/variant calling?

Also, is the reference _with ambiguous nucleotides_ published/available somewhere?

**lh3** · 11-16-2010, 03:37 PM

Reference bias mainly matters for a few bias-critical applications such as allele specific expressions and some popgen studies. Its effect on general SNP/indel/SV calling is negligible IMO. In addition, you have to know those SNPs before hand, which is not easily obtained. Incorporating SNPs does not solve the problem caused by indels, either.

**sparks** · 11-17-2010, 08:34 PM

A single SNP can cause drop in number of alignments from 5-25%, if that's significant in your expression level studies then use an aligner that can handle ambiguous codes in the reference but you do need to put these codes into the reference at known SNPs.
Where it's most important is when users are looking for allelic specific expression biases in which case you should always use an aligner that accepts ambiguous codes in the reference.
The problem will be exacerbated if there are two or more SNPs in a read, this probably doesn't happen often but you should expect a big drop in the number of alignments.
LH3 is also right that you'll still have problem with indels.

Originally posted by bioinfosm View Post

I still see most of the big groups using maq/bwa - gatk type workflows. How much does the ambiguous reference mapping really affect alignments/variant calling?

Also, is the reference _with ambiguous nucleotides_ published/available somewhere?

**epigen** · 11-18-2010, 10:06 AM

reference with ambiguous nucleotides

Also, is the reference _with ambiguous nucleotides_ published/available somewhere?[/QUOTE]

Index of /goldenPath/hg19/snp131Mask

http://hgdownload.cse.ucsc.edu/goldenPath/hg19/snp131Mask/

**bioinfosm** · 11-18-2010, 11:44 AM

Thanks for the useful notes!

Essentially what I am hearing is, allele specific expression is the biggest application of SNP masked reference. For general SNP calling, it does not gain much, but if it does not hurt, why not totally move to SNP-masked-reference, except if it is too slow?

Topics	Statistics	Last Post
ASHG 2024 Highlights – Part Two by seqadmin Started by seqadmin, Today, 11:09 AM	0 responses 22 views 0 likes	Last Post by seqadmin Today, 11:09 AM
ASHG 2024 Highlights – Part One by seqadmin Started by seqadmin, Today, 06:13 AM	0 responses 20 views 0 likes	Last Post by seqadmin Today, 06:13 AM
Seq-Scope Expands Possibilities for High-Resolution Gene Expression Analysis by seqadmin Started by seqadmin, 11-01-2024, 06:09 AM	0 responses 30 views 0 likes	Last Post by seqadmin 11-01-2024, 06:09 AM
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, 10-30-2024, 05:31 AM	0 responses 21 views 0 likes	Last Post by seqadmin 10-30-2024, 05:31 AM

Seqanswers Leaderboard Ad

Announcement

Alignment tool for use with ambiguous reference?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News