Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • samse repetitive hits question

    Hello-

    So, BWA samse randomly places reads which map to multiple locations equally well. Is there anyway to force BWA not to align such reads at all? I understand that randomly placing these reads makes sense in most cases since all locations score equally well; however, we are using NGS for medical diagnostics and our molecular geneticists will not sign out cases where reads are randomly placed. Their preference is for these reads to not be aligned at all, even if it means we have to Sanger sequence certain portions of genes. Any help would be appreciated, thanks!

  • #2
    For those cases, MAPQ should be 0, so you can filter them out prior performing your downstream analysis.
    -drd

    Comment


    • #3
      Thank you...

      Comment


      • #4
        Originally posted by SWP View Post
        Thank you...
        You can also try GibbsAM, this should help to uniquely map your alignments. I'm still evaluating it, so I can't tell anything more than the paper.
        If you want to give it a chance, I have some scripts to convert bwa output into GibbsAM input (and back to BAM, then).
        Also, the mapq=0 rule is not always true... I've found a number of tags aligned in multiple places (and with the XT:A:R tag) but with mapq > 0

        d

        Comment


        • #5
          Yeah, I'd love to have the scripts and give it a shot.

          Comment


          • #6
            First of all you have to run bwa samse with -n X option, with X high enough (I've used 255). Then you can use this python script

            Code:
            #!/usr/bin/python
            
            import sys
            import pysam
            
            samfile = pysam.Samfile(sys.argv[1], 'rb')
            sep = "_"
            
            for x in samfile.fetch():
              if x.flag & 4: continue
              buf = []
              readChrom = samfile.references[x.rname]
              if '_random' not in readChrom and '_hap' not in readChrom:
                strand = '+'
                if x.flag & 16:
                  strand = '-'
                buf.append( '_'.join([readChrom, str(x.pos), strand]))
              
              XAList = [Tag for Tag in x.tags if Tag[0] == 'XA']
              if len(XAList):
                for p in XAList[0][1].split(';'):
                  a = p.split(',')
                  if '_random' in p or '_hap' in p: continue
                  if len(a) <= 1: continue 
                  if int(a[1]) >= 0:
                    buf.append('_'.join([a[0], a[1][1:], '+']))
                  else:
                    buf.append('_'.join([a[0], a[1][1:], '-']))
              if len(buf):
                print x.qname + '\t' + ','.join(buf)
            to convert your BAM into a "map" file.

            Code:
            $ python bam2map.py filein.bam > filein.map
            You can use filein.map with get_unique_file.pl from GibbsAM (part C). My code depends on pysam and prints output to stdout (and sorry if it's ugly code).
            Beware, GibbsAM is rather slow.
            After GibbsAM has done and you get your final output (results.txt) you can run this

            Code:
            import sys
            import pysam
            
            samfile = pysam.Samfile(sys.argv[1], 'rb')
            gfile = open(sys.argv[2], 'r')
            outfile = pysam.Samfile("-", 'wh', header = samfile.header)
            
            def cigar2tuple(cigarString):
              cdict = {'M':0, 'I':1, 'D':2, 'N':3, 'S':4, 'H':5, 'P':6}
              outlist = []
              pn = 0
              for n in range(len(cigarString)):
                l = cigarString[n]
                if l in 'MIDNSHP':
                  outlist.append((cdict[l], int(cigarString[pn:n])))
                  pn = n + 1
              return tuple(outlist)
            
            def correctNM(Taglist, newNM):
              newTagList = []
              for x in Taglist:
                if x[0] == 'NM':
                  newTagList.append(('NM', int(newNM)))
                else:
                  newTagList.append(x)
              return newTagList
            
            def fixTags(Taglist):
              newTagList = []
              for x in Taglist:
                if x[0] == 'XT':
                  newTagList.append(('XT', 'U'))
                elif x[0] == 'XA':
                  continue
                else:
                  newTagList.append(x)
              newTagList.append(('GF','M'))
              return newTagList
              
            mappings = {}
            
            for line in gfile:
              t = line.strip().split()
              if t[-1] == 'a':
                mappings[t[0]] = t[1] + ',' + t[3] + t[2]
            
            for x in samfile.fetch():
              try:
                d = mappings[x.qname]
              except KeyError:
                outfile.write(x)
                continue
              XATag = [Tag for Tag in x.tags if Tag[0] == 'XA'][0][1]
              x.tags = fixTags(x.tags)
              try:
                theString = [xa for xa in XATag.split(';') if d in xa][0]
                t = theString.split(',')
                x.rname = samfile.references.index(t[0])
                x.pos = int(t[1][1:])
                x.cigar = cigar2tuple(t[2])
                if  t[1][0] == '-':
                  x.flag = x.flag | 16
                else:
                  x.flag = x.flag & ~16
                x.tags = correctNM(x.tags, t[3])
              except IndexError:
                # the read is already in the correct position
                outfile.write(x)
                continue
            #  temp = [Tag for Tag in x.tags if Tag[0] != 'XA']
            #  x.tags = temp
              if not x.mapq:
                x.mapq = 1
              outfile.write(x)
            outfile.close()
            to transform your results.txt into a SAM file, just issue:

            Code:
            $ python gam2bam.py filein.bam results.txt > results.sam
            where filein.bam is your original BAM file (from bwa).
            Note that these scripts are somehow optimized for human genomes, it excludes _random and _hap chroms (it's a long story, but this happens because GibbsAM doesn't support those chroms...). If you are performing analysis with another genome you may need to modify the first code. I'm in touch with GibbsAM developers, I'm waiting this flaw to be fixed.

            HTH
            d

            Comment


            • #7
              Unless I've misunderstood something, if you want the uniquely aligned hits, you'd want the reads which had just 1 best hit reported. BWA reports that as a custom tag in the SAM output, "X0:i:n" where n is the number of best hits. To get the unique hits you could simply do a grep:
              $ grep -P "X0:i:1\t" hits.sam > unique.hits.sam

              Does anyone else have some inputs?

              EDIT
              You could also just go for the XT:A:U tag which BWA reports for the unique alignments.
              Last edited by Thomas Doktor; 02-17-2011, 04:21 PM.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Essential Discoveries and Tools in Epitranscriptomics
                by seqadmin




                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                04-22-2024, 07:01 AM
              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 10:49 AM
              0 responses
              18 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-25-2024, 11:49 AM
              0 responses
              24 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-24-2024, 08:47 AM
              0 responses
              20 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              62 views
              0 likes
              Last Post seqadmin  
              Working...
              X