Thanks for the useful notes!
Essentially what I am hearing is, allele specific expression is the biggest application of SNP masked reference. For general SNP calling, it does not gain much, but if it does not hurt, why not totally move to SNP-masked-reference, except if it is too slow?
Seqanswers Leaderboard Ad
Collapse
X
-
reference with ambiguous nucleotides
Also, is the reference _with ambiguous nucleotides_ published/available somewhere?[/QUOTE]
Leave a comment:
-
-
A single SNP can cause drop in number of alignments from 5-25%, if that's significant in your expression level studies then use an aligner that can handle ambiguous codes in the reference but you do need to put these codes into the reference at known SNPs.
Where it's most important is when users are looking for allelic specific expression biases in which case you should always use an aligner that accepts ambiguous codes in the reference.
The problem will be exacerbated if there are two or more SNPs in a read, this probably doesn't happen often but you should expect a big drop in the number of alignments.
LH3 is also right that you'll still have problem with indels.
Originally posted by bioinfosm View PostI still see most of the big groups using maq/bwa - gatk type workflows. How much does the ambiguous reference mapping really affect alignments/variant calling?
Also, is the reference _with ambiguous nucleotides_ published/available somewhere?
Leave a comment:
-
-
Reference bias mainly matters for a few bias-critical applications such as allele specific expressions and some popgen studies. Its effect on general SNP/indel/SV calling is negligible IMO. In addition, you have to know those SNPs before hand, which is not easily obtained. Incorporating SNPs does not solve the problem caused by indels, either.
Leave a comment:
-
-
I still see most of the big groups using maq/bwa - gatk type workflows. How much does the ambiguous reference mapping really affect alignments/variant calling?
Also, is the reference _with ambiguous nucleotides_ published/available somewhere?
Leave a comment:
-
-
I thought there might have been some confusion about scoring/penalty schemes. I'm glad that's sorted out.
Leave a comment:
-
-
Actually i did not realize that you were talking "penalty" instead of "score". Let's use minus score for penalty. The authors were trying Penalty(A<->R)=-30. I am arguing Penalty(A<->R)=0 is least biased. Now I buy that Penalty(A<->R)=-3 by novoalign should also work.
This bias is rooted in homologous regions, but in highly unique regions, any masking strategies work equally well. The hard part is homology. Nonetheless, the authors of this paper have also shown me other data that even Penalty(A<->R)=0 will lead to significant extreme bias. Using novoalign/gsnap/mosaik is a must-have, but we have to apply additional steps to avoid bias. Thanks for the explanation, Colin.
Leave a comment:
-
-
Hi Heng,
I have a slightly different interpretation of this paper, first they appear to have masked SNPs by forcing a mismatch. i.e. If SNP would be A/G (R) they changed the reference to C or T thus forcing a mismatch at that position and a high penalty of 30, not zero. (I'm not sure of BWA/MAQ scoring method but Novoalign will usually score a match as zero and a mismatch as 30, depending on base call quality) But by forcing a mismatch they effectively bias against alignment at this location and force either unmapped reads or alignment to homologous regions. They even say this "We find that the strong biases occur at SNPs for which the flanking sequence shares sequence identity with another region of the genome (Figure 3)"
The bias they see isn't really from the aligner but a product of homologous regions and that they've forced a mismatch at SNPs.
I would expect Novoalign to remove much of the bias that they reported as SNPs marked up with ambiguous codes will not get penalised as full mismatches and there will be much less bias to homologous regions.
Colin
Leave a comment:
-
-
I would suggest people read "effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data" published in bioinformatics last year. I used to think masking the genome in whatever way will remove most of reference biases, but they have convinced me that this bias is more complicated than my naive thinking.
For their example, giving a score 30 to A<->R match will lead to a smaller bias than a score 3, although this does not remove all the extreme biases. At that time, the authors were giving A<->R a score 0.
Leave a comment:
-
-
Hi Heng,
Novoalign should remove allelic bias.
If you have an ambiguous code such R in the reference genome it will match to an A or G in the read with the same alignment score and the same chance that the read will align. In this case a match of A or G in the read will score 3 vs a mismatch (C or T in read) will score 30 (depending on base quality). There should be no allelic bias.
The small alignment score of 3 for the match to the ambiguous code comes into play when we have reads that map to multiple locations. Say the read maps to two places in the genome with no mismatches but one mapped location has an ambiguous code in the reference and the other location is an exact match with no ambiguous code, then the location with no ambiguous codes gets a slightly lower alignment score. If we then use -r Random option to report a randomly selected alignment for multiple mappings then the mapping location with no ambiguous code will have a higher chance of being reported than the mapping with the IUB ambiguous code. So there is a bias to report multi-mapping reads to locations with least number of ambiguous codes but I don't think that it's allelic bias. A problem could arise with multi-copy genes if only some copies are marked up with SNPs or if they're marked up differently.
Also we can't give a score of zero for matches to IUB ambiguous codes as N is an ambiguous code and this would mean a read aligning against a block of N's would score the same as a perfect alignment.
Colin
Originally posted by lh3 View Postmosaik or gsnap. novoalign reduces scores involving ambiguous bases, which I think is less preferred for reducing bias, at least on hand-made examples. I actually do not know how mosaik and gsnap cope with ambiguous bases. possibly they have the same issue. also read a great paper last year. the authors argue that merely allowing ambiguous bases helps to reduce bias, but not much.
Leave a comment:
-
-
gsnap can map to a reference enhanced by a list of known SNPs. See
Cheers,
Shaun
SNP-tolerant alignment in GSNAP
===============================
GSNAP has the ability to align to a reference space of all possible
major and minor alleles in a set of known SNPs provided by the user.
Leave a comment:
-
-
mosaik or gsnap. novoalign reduces scores involving ambiguous bases, which I think is less preferred for reducing bias, at least on hand-made examples. I actually do not know how mosaik and gsnap cope with ambiguous bases. possibly they have the same issue. also read a great paper last year. the authors argue that merely allowing ambiguous bases helps to reduce bias, but not much.
Leave a comment:
-
-
Alignment tool for use with ambiguous reference?
Does anyone know of an alignment tool other than novoalign that we can use with a reference that contains ambiguity codes as well as A,G,C,Ts? Our aim is to do allele specific expression analysis so really need to eliminate bias for the preferential mapping of the reference allele among the reads.
ThanksTags: None
-
Latest Articles
Collapse
-
by seqadmin
The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...-
Channel: Articles
Yesterday, 11:48 AM -
-
by seqadmin
This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.
The Headliner
The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...-
Channel: Articles
03-03-2025, 01:39 PM -
-
by seqadmin
The human gut contains trillions of microorganisms that impact digestion, immune functions, and overall health1. Despite major breakthroughs, we’re only beginning to understand the full extent of the microbiome’s influence on health and disease. Advances in next-generation sequencing and spatial biology have opened new windows into this complex environment, yet many questions remain. This article highlights two recent studies exploring how diet influences microbial...-
Channel: Articles
02-24-2025, 06:31 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 03-20-2025, 05:03 AM
|
0 responses
34 views
0 reactions
|
Last Post
by seqadmin
03-20-2025, 05:03 AM
|
||
Started by seqadmin, 03-19-2025, 07:27 AM
|
0 responses
43 views
0 reactions
|
Last Post
by seqadmin
03-19-2025, 07:27 AM
|
||
Started by seqadmin, 03-18-2025, 12:50 PM
|
0 responses
35 views
0 reactions
|
Last Post
by seqadmin
03-18-2025, 12:50 PM
|
||
Started by seqadmin, 03-03-2025, 01:15 PM
|
0 responses
190 views
0 reactions
|
Last Post
by seqadmin
03-03-2025, 01:15 PM
|
Leave a comment: