Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • IUPAC coded reference sequences

    I'm just getting started, so hopefully this isn't a silly post.

    Mosaik and some other aligners are able to use IUPAC codes in their alignment process.

    I was wondering if there are any publicly available whole genome reference sequences available that contain the IUPAC codes for known SNPs. If not, I was curious if there are any applications that can apply these codes via data from dbSNP, etc., or if I would be better off coding something to handle this.

    Thanks!

  • #2
    Originally posted by data View Post
    I'm just getting started, so hopefully this isn't a silly post.

    Mosaik and some other aligners are able to use IUPAC codes in their alignment process.

    I was wondering if there are any publicly available whole genome reference sequences available that contain the IUPAC codes for known SNPs. If not, I was curious if there are any applications that can apply these codes via data from dbSNP, etc., or if I would be better off coding something to handle this.

    Thanks!
    A quick perl script from your local bioinformatician should be able to handle the generation of a IUPAC reference from dbsnp. How will you deal with indels in dbsnp?

    Comment


    • #3
      I've been looking into scripting my own solution, but I ran across what appears to be a good place to start:


      I don't have a good answer about how to represent indels in the reference sequence.

      Comment


      • #4
        Data,

        SNPmask should work fairly well. Obviously this is a moving target with new dbSNP/1000g releases.

        I've been thinking of establishing some modified IUPAC reference sequences and posting them eventually.

        You may want to also see: "An extended IUPAC nomenclature code for polymorphic nucleic acids" Bioinformatics 2010 26(10):1386-1389; doi:10.1093/bioinformatics/btq098

        Comment


        • #5
          The use of a reference genome that includes IUPAC codes for known relatively frequent variants in an alignment is an elegant idea.

          I am wondering if the currently available aligners (BWA, Bowtie, etc.) can handle this type of genome build as a reference?

          It would be nice to use the 1000 genomes data for this purpose, instead of dbSNP which contains a large fraction of non-frequency validated SNPs.

          Does a human reference build of this type currently exist?

          Also, the question still remains on how to deal with indels?

          Comment


          • #6
            Try Mosaik for the IUPAC codes.

            Comment


            • #7
              Originally posted by nilshomer View Post
              Try Mosaik for the IUPAC codes.
              Nils,
              I can't find any clear statement about whether BWA or Bowtie handle IUPAC codes in reference sequences (or possible errors when using reference sequences with them included).

              Can you point me to any documentation or explanation about BWA or Bowtie and IUPAC ?

              I'm working on a bacterial genome which contains some small number of IUPAC characters in the reference.

              Thanks

              Jim

              Comment


              • #8
                Neither support them directly (I had to check the source code for BWA). BWA randomly converts them to a DNA base, not sure what Bowtie does.
                Last edited by nilshomer; 11-29-2011, 08:48 PM.

                Comment


                • #9
                  NovoAlign also handles IUPAC reference sequences. There is a difference in how these are being handled -- Mosaik considers any base aligned to an IUPAC position a partial mismatch, NovoAlign does not penalize them.

                  (Note: Might be the other way around, can't check the documentation right now)

                  Comment


                  • #10
                    Originally posted by ohofmann View Post
                    NovoAlign also handles IUPAC reference sequences. There is a difference in how these are being handled -- Mosaik considers any base aligned to an IUPAC position a partial mismatch, NovoAlign does not penalize them.

                    (Note: Might be the other way around, can't check the documentation right now)
                    TMAP considers them as a mismatch during seeding but as a match for the final alignment. IUPAC positions are difficult to handle when most aligners assume the input is a simple reference string rather than a regex or graph structure.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Recent Advances in Sequencing Analysis Tools
                      by seqadmin


                      The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
                      05-06-2024, 07:48 AM
                    • seqadmin
                      Essential Discoveries and Tools in Epitranscriptomics
                      by seqadmin




                      The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                      04-22-2024, 07:01 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 05-07-2024, 06:57 AM
                    0 responses
                    12 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 05-06-2024, 07:17 AM
                    0 responses
                    16 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 05-02-2024, 08:06 AM
                    0 responses
                    21 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-30-2024, 12:17 PM
                    0 responses
                    24 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X