Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • bioinfosm
    replied
    Thanks for the useful notes!

    Essentially what I am hearing is, allele specific expression is the biggest application of SNP masked reference. For general SNP calling, it does not gain much, but if it does not hurt, why not totally move to SNP-masked-reference, except if it is too slow?

    Leave a comment:


  • epigen
    replied
    reference with ambiguous nucleotides

    Also, is the reference _with ambiguous nucleotides_ published/available somewhere?[/QUOTE]

    Leave a comment:


  • sparks
    replied
    A single SNP can cause drop in number of alignments from 5-25%, if that's significant in your expression level studies then use an aligner that can handle ambiguous codes in the reference but you do need to put these codes into the reference at known SNPs.
    Where it's most important is when users are looking for allelic specific expression biases in which case you should always use an aligner that accepts ambiguous codes in the reference.
    The problem will be exacerbated if there are two or more SNPs in a read, this probably doesn't happen often but you should expect a big drop in the number of alignments.
    LH3 is also right that you'll still have problem with indels.
    Originally posted by bioinfosm View Post
    I still see most of the big groups using maq/bwa - gatk type workflows. How much does the ambiguous reference mapping really affect alignments/variant calling?

    Also, is the reference _with ambiguous nucleotides_ published/available somewhere?

    Leave a comment:


  • lh3
    replied
    Reference bias mainly matters for a few bias-critical applications such as allele specific expressions and some popgen studies. Its effect on general SNP/indel/SV calling is negligible IMO. In addition, you have to know those SNPs before hand, which is not easily obtained. Incorporating SNPs does not solve the problem caused by indels, either.

    Leave a comment:


  • bioinfosm
    replied
    I still see most of the big groups using maq/bwa - gatk type workflows. How much does the ambiguous reference mapping really affect alignments/variant calling?

    Also, is the reference _with ambiguous nucleotides_ published/available somewhere?

    Leave a comment:


  • sparks
    replied
    I thought there might have been some confusion about scoring/penalty schemes. I'm glad that's sorted out.

    Leave a comment:


  • lh3
    replied
    Actually i did not realize that you were talking "penalty" instead of "score". Let's use minus score for penalty. The authors were trying Penalty(A<->R)=-30. I am arguing Penalty(A<->R)=0 is least biased. Now I buy that Penalty(A<->R)=-3 by novoalign should also work.

    This bias is rooted in homologous regions, but in highly unique regions, any masking strategies work equally well. The hard part is homology. Nonetheless, the authors of this paper have also shown me other data that even Penalty(A<->R)=0 will lead to significant extreme bias. Using novoalign/gsnap/mosaik is a must-have, but we have to apply additional steps to avoid bias. Thanks for the explanation, Colin.

    Leave a comment:


  • sparks
    replied
    Hi Heng,
    I have a slightly different interpretation of this paper, first they appear to have masked SNPs by forcing a mismatch. i.e. If SNP would be A/G (R) they changed the reference to C or T thus forcing a mismatch at that position and a high penalty of 30, not zero. (I'm not sure of BWA/MAQ scoring method but Novoalign will usually score a match as zero and a mismatch as 30, depending on base call quality) But by forcing a mismatch they effectively bias against alignment at this location and force either unmapped reads or alignment to homologous regions. They even say this "We find that the strong biases occur at SNPs for which the flanking sequence shares sequence identity with another region of the genome (Figure 3)"
    The bias they see isn't really from the aligner but a product of homologous regions and that they've forced a mismatch at SNPs.
    I would expect Novoalign to remove much of the bias that they reported as SNPs marked up with ambiguous codes will not get penalised as full mismatches and there will be much less bias to homologous regions.

    Colin

    Leave a comment:


  • lh3
    replied
    I would suggest people read "effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data" published in bioinformatics last year. I used to think masking the genome in whatever way will remove most of reference biases, but they have convinced me that this bias is more complicated than my naive thinking.

    For their example, giving a score 30 to A<->R match will lead to a smaller bias than a score 3, although this does not remove all the extreme biases. At that time, the authors were giving A<->R a score 0.

    Leave a comment:


  • sparks
    replied
    Hi Heng,

    Novoalign should remove allelic bias.
    If you have an ambiguous code such R in the reference genome it will match to an A or G in the read with the same alignment score and the same chance that the read will align. In this case a match of A or G in the read will score 3 vs a mismatch (C or T in read) will score 30 (depending on base quality). There should be no allelic bias.

    The small alignment score of 3 for the match to the ambiguous code comes into play when we have reads that map to multiple locations. Say the read maps to two places in the genome with no mismatches but one mapped location has an ambiguous code in the reference and the other location is an exact match with no ambiguous code, then the location with no ambiguous codes gets a slightly lower alignment score. If we then use -r Random option to report a randomly selected alignment for multiple mappings then the mapping location with no ambiguous code will have a higher chance of being reported than the mapping with the IUB ambiguous code. So there is a bias to report multi-mapping reads to locations with least number of ambiguous codes but I don't think that it's allelic bias. A problem could arise with multi-copy genes if only some copies are marked up with SNPs or if they're marked up differently.

    Also we can't give a score of zero for matches to IUB ambiguous codes as N is an ambiguous code and this would mean a read aligning against a block of N's would score the same as a perfect alignment.

    Colin
    Originally posted by lh3 View Post
    mosaik or gsnap. novoalign reduces scores involving ambiguous bases, which I think is less preferred for reducing bias, at least on hand-made examples. I actually do not know how mosaik and gsnap cope with ambiguous bases. possibly they have the same issue. also read a great paper last year. the authors argue that merely allowing ambiguous bases helps to reduce bias, but not much.

    Leave a comment:


  • lindseyjane
    replied
    Thank you for your advice, I am checking out gsnap.

    Leave a comment:


  • sjackman
    replied
    gsnap can map to a reference enhanced by a list of known SNPs. See


    Cheers,
    Shaun

    SNP-tolerant alignment in GSNAP
    ===============================

    GSNAP has the ability to align to a reference space of all possible
    major and minor alleles in a set of known SNPs provided by the user.

    Leave a comment:


  • lh3
    replied
    mosaik or gsnap. novoalign reduces scores involving ambiguous bases, which I think is less preferred for reducing bias, at least on hand-made examples. I actually do not know how mosaik and gsnap cope with ambiguous bases. possibly they have the same issue. also read a great paper last year. the authors argue that merely allowing ambiguous bases helps to reduce bias, but not much.

    Leave a comment:


  • Alignment tool for use with ambiguous reference?

    Does anyone know of an alignment tool other than novoalign that we can use with a reference that contains ambiguity codes as well as A,G,C,Ts? Our aim is to do allele specific expression analysis so really need to eliminate bias for the preferential mapping of the reference allele among the reads.

    Thanks

Latest Articles

Collapse

  • seqadmin
    Essential Discoveries and Tools in Epitranscriptomics
    by seqadmin




    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
    04-22-2024, 07:01 AM
  • seqadmin
    Current Approaches to Protein Sequencing
    by seqadmin


    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
    04-04-2024, 04:25 PM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, Today, 11:49 AM
0 responses
11 views
0 likes
Last Post seqadmin  
Started by seqadmin, Yesterday, 08:47 AM
0 responses
16 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-11-2024, 12:08 PM
0 responses
61 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 10:19 PM
0 responses
60 views
0 likes
Last Post seqadmin  
Working...
X