Alignment tool for use with ambiguous reference?

bioinfosm replied

11-18-2010, 11:44 AM
Thanks for the useful notes!

Essentially what I am hearing is, allele specific expression is the biggest application of SNP masked reference. For general SNP calling, it does not gain much, but if it does not hurt, why not totally move to SNP-masked-reference, except if it is too slow?
Leave a comment:
epigen replied

11-18-2010, 10:06 AM
reference with ambiguous nucleotides

Also, is the reference _with ambiguous nucleotides_ published/available somewhere?[/QUOTE]

Index of /goldenPath/hg19/snp131Mask

http://hgdownload.cse.ucsc.edu/goldenPath/hg19/snp131Mask/
Leave a comment:
sparks replied

11-17-2010, 08:34 PM
A single SNP can cause drop in number of alignments from 5-25%, if that's significant in your expression level studies then use an aligner that can handle ambiguous codes in the reference but you do need to put these codes into the reference at known SNPs.
Where it's most important is when users are looking for allelic specific expression biases in which case you should always use an aligner that accepts ambiguous codes in the reference.
The problem will be exacerbated if there are two or more SNPs in a read, this probably doesn't happen often but you should expect a big drop in the number of alignments.
LH3 is also right that you'll still have problem with indels.

Originally posted by bioinfosm View Post

I still see most of the big groups using maq/bwa - gatk type workflows. How much does the ambiguous reference mapping really affect alignments/variant calling?

Also, is the reference _with ambiguous nucleotides_ published/available somewhere?
Leave a comment:
lh3 replied

11-16-2010, 03:37 PM
Reference bias mainly matters for a few bias-critical applications such as allele specific expressions and some popgen studies. Its effect on general SNP/indel/SV calling is negligible IMO. In addition, you have to know those SNPs before hand, which is not easily obtained. Incorporating SNPs does not solve the problem caused by indels, either.
Leave a comment:
bioinfosm replied

11-16-2010, 02:56 PM
I still see most of the big groups using maq/bwa - gatk type workflows. How much does the ambiguous reference mapping really affect alignments/variant calling?

Also, is the reference _with ambiguous nucleotides_ published/available somewhere?
Leave a comment:
sparks replied

11-03-2010, 08:18 PM
I thought there might have been some confusion about scoring/penalty schemes. I'm glad that's sorted out.
Leave a comment:
lh3 replied

11-03-2010, 06:16 PM
Actually i did not realize that you were talking "penalty" instead of "score". Let's use minus score for penalty. The authors were trying Penalty(A<->R)=-30. I am arguing Penalty(A<->R)=0 is least biased. Now I buy that Penalty(A<->R)=-3 by novoalign should also work.

This bias is rooted in homologous regions, but in highly unique regions, any masking strategies work equally well. The hard part is homology. Nonetheless, the authors of this paper have also shown me other data that even Penalty(A<->R)=0 will lead to significant extreme bias. Using novoalign/gsnap/mosaik is a must-have, but we have to apply additional steps to avoid bias. Thanks for the explanation, Colin.
Leave a comment:
sparks replied

11-03-2010, 05:46 PM
Hi Heng,
I have a slightly different interpretation of this paper, first they appear to have masked SNPs by forcing a mismatch. i.e. If SNP would be A/G (R) they changed the reference to C or T thus forcing a mismatch at that position and a high penalty of 30, not zero. (I'm not sure of BWA/MAQ scoring method but Novoalign will usually score a match as zero and a mismatch as 30, depending on base call quality) But by forcing a mismatch they effectively bias against alignment at this location and force either unmapped reads or alignment to homologous regions. They even say this "We find that the strong biases occur at SNPs for which the flanking sequence shares sequence identity with another region of the genome (Figure 3)"
The bias they see isn't really from the aligner but a product of homologous regions and that they've forced a mismatch at SNPs.
I would expect Novoalign to remove much of the bias that they reported as SNPs marked up with ambiguous codes will not get penalised as full mismatches and there will be much less bias to homologous regions.

Colin
Leave a comment:
lh3 replied

11-03-2010, 04:50 AM
I would suggest people read "effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data" published in bioinformatics last year. I used to think masking the genome in whatever way will remove most of reference biases, but they have convinced me that this bias is more complicated than my naive thinking.

For their example, giving a score 30 to A<->R match will lead to a smaller bias than a score 3, although this does not remove all the extreme biases. At that time, the authors were giving A<->R a score 0.
Leave a comment:
sparks replied

11-03-2010, 02:13 AM
Hi Heng,

Novoalign should remove allelic bias.
If you have an ambiguous code such R in the reference genome it will match to an A or G in the read with the same alignment score and the same chance that the read will align. In this case a match of A or G in the read will score 3 vs a mismatch (C or T in read) will score 30 (depending on base quality). There should be no allelic bias.

The small alignment score of 3 for the match to the ambiguous code comes into play when we have reads that map to multiple locations. Say the read maps to two places in the genome with no mismatches but one mapped location has an ambiguous code in the reference and the other location is an exact match with no ambiguous code, then the location with no ambiguous codes gets a slightly lower alignment score. If we then use -r Random option to report a randomly selected alignment for multiple mappings then the mapping location with no ambiguous code will have a higher chance of being reported than the mapping with the IUB ambiguous code. So there is a bias to report multi-mapping reads to locations with least number of ambiguous codes but I don't think that it's allelic bias. A problem could arise with multi-copy genes if only some copies are marked up with SNPs or if they're marked up differently.

Also we can't give a score of zero for matches to IUB ambiguous codes as N is an ambiguous code and this would mean a read aligning against a block of N's would score the same as a perfect alignment.

Colin

Originally posted by lh3 View Post

mosaik or gsnap. novoalign reduces scores involving ambiguous bases, which I think is less preferred for reducing bias, at least on hand-made examples. I actually do not know how mosaik and gsnap cope with ambiguous bases. possibly they have the same issue. also read a great paper last year. the authors argue that merely allowing ambiguous bases helps to reduce bias, but not much.
Leave a comment:
lindseyjane replied

11-02-2010, 12:57 AM
Thank you for your advice, I am checking out gsnap.
Leave a comment:
sjackman replied

11-01-2010, 09:22 AM
gsnap can map to a reference enhanced by a list of known SNPs. See

http://research-pub.gene.com/gmap/src/README

Cheers,
Shaun

SNP-tolerant alignment in GSNAP
===============================

GSNAP has the ability to align to a reference space of all possible
major and minor alleles in a set of known SNPs provided by the user.
Leave a comment:
lh3 replied

11-01-2010, 09:05 AM
mosaik or gsnap. novoalign reduces scores involving ambiguous bases, which I think is less preferred for reducing bias, at least on hand-made examples. I actually do not know how mosaik and gsnap cope with ambiguous bases. possibly they have the same issue. also read a great paper last year. the authors argue that merely allowing ambiguous bases helps to reduce bias, but not much.
Leave a comment:
lindseyjane started a topic Alignment tool for use with ambiguous reference?

11-01-2010, 08:45 AM
Alignment tool for use with ambiguous reference?

Does anyone know of an alignment tool other than novoalign that we can use with a reference that contains ambiguity codes as well as A,G,C,Ts? Our aim is to do allele specific expression analysis so really need to eliminate bias for the preferential mapping of the reference allele among the reads.

Thanks
Tags: None

Previous template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Today, 11:49 AM	0 responses 11 views 0 likes	Last Post by seqadmin Today, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: