Header Leaderboard Ad

Collapse

Alignment tool for use with ambiguous reference?

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Alignment tool for use with ambiguous reference?

    Does anyone know of an alignment tool other than novoalign that we can use with a reference that contains ambiguity codes as well as A,G,C,Ts? Our aim is to do allele specific expression analysis so really need to eliminate bias for the preferential mapping of the reference allele among the reads.

    Thanks

  • #2
    mosaik or gsnap. novoalign reduces scores involving ambiguous bases, which I think is less preferred for reducing bias, at least on hand-made examples. I actually do not know how mosaik and gsnap cope with ambiguous bases. possibly they have the same issue. also read a great paper last year. the authors argue that merely allowing ambiguous bases helps to reduce bias, but not much.

    Comment


    • #3
      gsnap can map to a reference enhanced by a list of known SNPs. See
      http://research-pub.gene.com/gmap/src/README

      Cheers,
      Shaun

      SNP-tolerant alignment in GSNAP
      ===============================

      GSNAP has the ability to align to a reference space of all possible
      major and minor alleles in a set of known SNPs provided by the user.

      Comment


      • #4
        Thank you for your advice, I am checking out gsnap.

        Comment


        • #5
          Hi Heng,

          Novoalign should remove allelic bias.
          If you have an ambiguous code such R in the reference genome it will match to an A or G in the read with the same alignment score and the same chance that the read will align. In this case a match of A or G in the read will score 3 vs a mismatch (C or T in read) will score 30 (depending on base quality). There should be no allelic bias.

          The small alignment score of 3 for the match to the ambiguous code comes into play when we have reads that map to multiple locations. Say the read maps to two places in the genome with no mismatches but one mapped location has an ambiguous code in the reference and the other location is an exact match with no ambiguous code, then the location with no ambiguous codes gets a slightly lower alignment score. If we then use -r Random option to report a randomly selected alignment for multiple mappings then the mapping location with no ambiguous code will have a higher chance of being reported than the mapping with the IUB ambiguous code. So there is a bias to report multi-mapping reads to locations with least number of ambiguous codes but I don't think that it's allelic bias. A problem could arise with multi-copy genes if only some copies are marked up with SNPs or if they're marked up differently.

          Also we can't give a score of zero for matches to IUB ambiguous codes as N is an ambiguous code and this would mean a read aligning against a block of N's would score the same as a perfect alignment.

          Colin
          Originally posted by lh3 View Post
          mosaik or gsnap. novoalign reduces scores involving ambiguous bases, which I think is less preferred for reducing bias, at least on hand-made examples. I actually do not know how mosaik and gsnap cope with ambiguous bases. possibly they have the same issue. also read a great paper last year. the authors argue that merely allowing ambiguous bases helps to reduce bias, but not much.

          Comment


          • #6
            I would suggest people read "effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data" published in bioinformatics last year. I used to think masking the genome in whatever way will remove most of reference biases, but they have convinced me that this bias is more complicated than my naive thinking.

            For their example, giving a score 30 to A<->R match will lead to a smaller bias than a score 3, although this does not remove all the extreme biases. At that time, the authors were giving A<->R a score 0.

            Comment


            • #7
              Hi Heng,
              I have a slightly different interpretation of this paper, first they appear to have masked SNPs by forcing a mismatch. i.e. If SNP would be A/G (R) they changed the reference to C or T thus forcing a mismatch at that position and a high penalty of 30, not zero. (I'm not sure of BWA/MAQ scoring method but Novoalign will usually score a match as zero and a mismatch as 30, depending on base call quality) But by forcing a mismatch they effectively bias against alignment at this location and force either unmapped reads or alignment to homologous regions. They even say this "We find that the strong biases occur at SNPs for which the flanking sequence shares sequence identity with another region of the genome (Figure 3)"
              The bias they see isn't really from the aligner but a product of homologous regions and that they've forced a mismatch at SNPs.
              I would expect Novoalign to remove much of the bias that they reported as SNPs marked up with ambiguous codes will not get penalised as full mismatches and there will be much less bias to homologous regions.

              Colin

              Comment


              • #8
                Actually i did not realize that you were talking "penalty" instead of "score". Let's use minus score for penalty. The authors were trying Penalty(A<->R)=-30. I am arguing Penalty(A<->R)=0 is least biased. Now I buy that Penalty(A<->R)=-3 by novoalign should also work.

                This bias is rooted in homologous regions, but in highly unique regions, any masking strategies work equally well. The hard part is homology. Nonetheless, the authors of this paper have also shown me other data that even Penalty(A<->R)=0 will lead to significant extreme bias. Using novoalign/gsnap/mosaik is a must-have, but we have to apply additional steps to avoid bias. Thanks for the explanation, Colin.

                Comment


                • #9
                  I thought there might have been some confusion about scoring/penalty schemes. I'm glad that's sorted out.

                  Comment


                  • #10
                    I still see most of the big groups using maq/bwa - gatk type workflows. How much does the ambiguous reference mapping really affect alignments/variant calling?

                    Also, is the reference _with ambiguous nucleotides_ published/available somewhere?
                    --
                    bioinfosm

                    Comment


                    • #11
                      Reference bias mainly matters for a few bias-critical applications such as allele specific expressions and some popgen studies. Its effect on general SNP/indel/SV calling is negligible IMO. In addition, you have to know those SNPs before hand, which is not easily obtained. Incorporating SNPs does not solve the problem caused by indels, either.

                      Comment


                      • #12
                        A single SNP can cause drop in number of alignments from 5-25%, if that's significant in your expression level studies then use an aligner that can handle ambiguous codes in the reference but you do need to put these codes into the reference at known SNPs.
                        Where it's most important is when users are looking for allelic specific expression biases in which case you should always use an aligner that accepts ambiguous codes in the reference.
                        The problem will be exacerbated if there are two or more SNPs in a read, this probably doesn't happen often but you should expect a big drop in the number of alignments.
                        LH3 is also right that you'll still have problem with indels.
                        Originally posted by bioinfosm View Post
                        I still see most of the big groups using maq/bwa - gatk type workflows. How much does the ambiguous reference mapping really affect alignments/variant calling?

                        Also, is the reference _with ambiguous nucleotides_ published/available somewhere?

                        Comment


                        • #13
                          reference with ambiguous nucleotides

                          Also, is the reference _with ambiguous nucleotides_ published/available somewhere?[/QUOTE]

                          http://hgdownload.cse.ucsc.edu/golde...19/snp131Mask/

                          Comment


                          • #14
                            Thanks for the useful notes!

                            Essentially what I am hearing is, allele specific expression is the biggest application of SNP masked reference. For general SNP calling, it does not gain much, but if it does not hurt, why not totally move to SNP-masked-reference, except if it is too slow?
                            --
                            bioinfosm

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              A Brief Overview and Common Challenges in Single-cell Sequencing Analysis
                              by seqadmin


                              ​​​​​​The introduction of single-cell sequencing has advanced the ability to study cell-to-cell heterogeneity. Its use has improved our understanding of somatic mutations1, cell lineages2, cellular diversity and regulation3, and development in multicellular organisms4. Single-cell sequencing encompasses hundreds of techniques with different approaches to studying the genomes, transcriptomes, epigenomes, and other omics of individual cells. The analysis of single-cell sequencing data i...

                              01-24-2023, 01:19 PM
                            • seqadmin
                              Introduction to Single-Cell Sequencing
                              by seqadmin
                              Single-cell sequencing is a technique used to investigate the genome, transcriptome, epigenome, and other omics of individual cells using high-throughput sequencing. This technology has provided many scientific breakthroughs and continues to be applied across many fields, including microbiology, oncology, immunology, neurobiology, precision medicine, and stem cell research.

                              The advancement of single-cell sequencing began in 2009 when Tang et al. investigated the single-cell transcriptomes
                              ...
                              01-09-2023, 03:10 PM

                            ad_right_rmr

                            Collapse
                            Working...
                            X