Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • SAM tools

    Hi All,

    I am using BWA to align ChIP-seq data. I want to use MACS for identifying transcript factor binding sites. However, how to convert BWA output (.SAM) format to MACS input (.BED) formal?

    Thanks for your help,

    Hemang

  • #2
    Hi,
    SAM format contains all you need to make a BED entry for each aligned read. Below is a Python script that works for me. You made need to customize depending on your needs. Specifically, you might want to stuff more info into the "name" and "score" fields. Here I made the following decisions:

    a. (line 40) Only create an entry for reads that were aligned (logical methinks).
    b. (lines 50 and 51) Make the BED score field (#5) be the edit distance of the alignment.
    c. the BED lines conform to UCSC 0-based, half-open rules
    d. I chose to ignore if these are paired-end reads. each end is treated separately. if you want to change this, you'd need to make more decisions based on the SAM flag.

    If you copy the code into a text editor, save it, then "chmod +x" it, you should be able to do the following to create a BED file from your data.

    > sam2bed.py -s file.sam > file.bed

    where file.sam is your SAM input file.

    I hope this helps.
    Best, Aaron

    Code:
    #!/usr/bin/env python
    # encoding: utf-8
    """
    sam2bed.py
    
    Created by Aaron Quinlan on 2009-08-27.
    Copyright (c) 2009 Aaron R. Quinlan. All rights reserved.
    """
    
    import sys
    import getopt
    import re
    
    help_message = '''
    	USAGE: sam2bed -s <sam>
    '''
    
    
    class Usage(Exception):
    	def __init__(self, msg):
    		self.msg = msg
    
    
    def processSAM(file):
    	"""
    		Load a SAM file and convert each line to BED format.
    	"""		
    	f = open(file,'r')
    	for line in f.readlines():
    		samLine = splitLine(line.strip())
    		makeBED(samLine)
    	f.close()	
    	
    					
    def makeBED(samFields):
    	
    	samFlag = int(samFields[1])
    	
    	# Only create a BED entry if the read was aligned
    	if (not (samFlag & 0x0004)):
    		
    		chrom = samFields[2]
    		start = str(int(samFields[3])-1)
    		end = str(int(samFields[3]) + len(samFields[9]) - 1)
    		name = samFields[0]	
    		strand = getStrand(samFlag)
    
    		# Let's use the edit distance as the BED score.
    		# Clearly alternatives exist.
    		editPattern = re.compile('NM\:i\:(\d+)')
    		editDistance = editPattern.findall(samFields[12])
    
    		# Write out the BED entry
    		print chrom + "\t" + start + "\t" + end + "\t" + name + "\t" + editDistance[0] + "\t" + strand
    		
    		
    def splitLine(line, delim="\t"):
    	splitline = line.split(delim)
    	return splitline		
    
    
    def getStrand(samFlag):
    	strand = "+"
    	if (samFlag & (0x10)):	# minus strand if true.
    		strand = "-"		
    	return strand
    	
    		
    def main(argv=None):
    	if argv is None:
    		argv = sys.argv
    	try:
    		try:
    			opts, args = getopt.getopt(argv[1:], "hs:", ["help", "sam"])
    		except getopt.error, msg:
    			raise Usage(help_message)
    	
    		# option processing
    		samFile = ""
    		for option, value in opts:
    			if option in ("-h", "--help"):
    				raise Usage(help_message)
    			if option in ("-s", "--sam"):
    				samFile = value
    
    		try:
    		   f = open(samFile, 'r')
    		except IOError, msg:
    			raise Usage(help_message)
    				
    	except Usage, err:
    		print >> sys.stderr, sys.argv[0].split("/")[-1] + ": " + str(err.msg)
    		return 2	
    
    	# make a BED file of the SAM file.
    	processSAM(samFile)
    
    if __name__ == "__main__":
    	sys.exit(main())

    Comment


    • #3
      Originally posted by hemang View Post
      Hi All,

      I am using BWA to align ChIP-seq data. I want to use MACS for identifying transcript factor binding sites. However, how to convert BWA output (.SAM) format to MACS input (.BED) formal?
      Hi,
      I've written a patch to add SAM/BAM support to MACS. I've submitted to MACS developers, hopefully it will be included in the next release. If you are interested in the patch I can send it to you.

      Comment


      • #4
        Originally posted by dawe View Post
        Hi,
        I've written a patch to add SAM/BAM support to MACS. I've submitted to MACS developers, hopefully it will be included in the next release. If you are interested in the patch I can send it to you.
        Thank you so much!
        Can you please send me the MACS patch ([email protected])? I really appreciate your help.
        Hemang

        Comment


        • #5
          Originally posted by quinlana View Post
          Hi,
          SAM format contains all you need to make a BED entry for each aligned read. Below is a Python script that works for me. You made need to customize depending on your needs. Specifically, you might want to stuff more info into the "name" and "score" fields. Here I made the following decisions:

          a. (line 40) Only create an entry for reads that were aligned (logical methinks).
          b. (lines 50 and 51) Make the BED score field (#5) be the edit distance of the alignment.
          c. the BED lines conform to UCSC 0-based, half-open rules
          d. I chose to ignore if these are paired-end reads. each end is treated separately. if you want to change this, you'd need to make more decisions based on the SAM flag.

          If you copy the code into a text editor, save it, then "chmod +x" it, you should be able to do the following to create a BED file from your data.

          > sam2bed.py -s file.sam > file.bed

          where file.sam is your SAM input file.

          I hope this helps.
          Best, Aaron

          Code:
          #!/usr/bin/env python
          # encoding: utf-8
          """
          sam2bed.py
          
          Created by Aaron Quinlan on 2009-08-27.
          Copyright (c) 2009 Aaron R. Quinlan. All rights reserved.
          """
          
          import sys
          import getopt
          import re
          
          help_message = '''
          	USAGE: sam2bed -s <sam>
          '''
          
          
          class Usage(Exception):
          	def __init__(self, msg):
          		self.msg = msg
          
          
          def processSAM(file):
          	"""
          		Load a SAM file and convert each line to BED format.
          	"""		
          	f = open(file,'r')
          	for line in f.readlines():
          		samLine = splitLine(line.strip())
          		makeBED(samLine)
          	f.close()	
          	
          					
          def makeBED(samFields):
          	
          	samFlag = int(samFields[1])
          	
          	# Only create a BED entry if the read was aligned
          	if (not (samFlag & 0x0004)):
          		
          		chrom = samFields[2]
          		start = str(int(samFields[3])-1)
          		end = str(int(samFields[3]) + len(samFields[9]) - 1)
          		name = samFields[0]	
          		strand = getStrand(samFlag)
          
          		# Let's use the edit distance as the BED score.
          		# Clearly alternatives exist.
          		editPattern = re.compile('NM\:i\:(\d+)')
          		editDistance = editPattern.findall(samFields[12])
          
          		# Write out the BED entry
          		print chrom + "\t" + start + "\t" + end + "\t" + name + "\t" + editDistance[0] + "\t" + strand
          		
          		
          def splitLine(line, delim="\t"):
          	splitline = line.split(delim)
          	return splitline		
          
          
          def getStrand(samFlag):
          	strand = "+"
          	if (samFlag & (0x10)):	# minus strand if true.
          		strand = "-"		
          	return strand
          	
          		
          def main(argv=None):
          	if argv is None:
          		argv = sys.argv
          	try:
          		try:
          			opts, args = getopt.getopt(argv[1:], "hs:", ["help", "sam"])
          		except getopt.error, msg:
          			raise Usage(help_message)
          	
          		# option processing
          		samFile = ""
          		for option, value in opts:
          			if option in ("-h", "--help"):
          				raise Usage(help_message)
          			if option in ("-s", "--sam"):
          				samFile = value
          
          		try:
          		   f = open(samFile, 'r')
          		except IOError, msg:
          			raise Usage(help_message)
          				
          	except Usage, err:
          		print >> sys.stderr, sys.argv[0].split("/")[-1] + ": " + str(err.msg)
          		return 2	
          
          	# make a BED file of the SAM file.
          	processSAM(samFile)
          
          if __name__ == "__main__":
          	sys.exit(main())
          Dear Aaron,

          Thank you so much for the python script. It worked fine with MACS.

          Best regards,

          Hemang

          Comment


          • #6
            No problem. Others asked for the changes below, so I posted an improved version here:



            This version allows one to optionally create BED entries for just concordant paired-end reads or just discordant paired-end reads. I imagine that you're dealing with single-end reads...

            The BED output is written to standard out for use with other UNIX utilities (e.g. sort, uniq, awk, etc.) as well as BEDTools.

            Code:
            samToBed -s <sam> -t <alignment type>
            	
            OPTIONS:
            	-s	The SAM file to be converted to BED
            	-t	What types of alignments should be reported?
            			"all"	all aligned reads will be reported (Default)
            			"con"	only concordant pairs will be reported
            			"dis"	only discordant pairs will be reported
            Take care,
            Aaron

            Comment


            • #7
              Excuse me, I am looking for a tool to covert sam format to bed pileup format. It is a format with four fields in each row
              (1)Refname (2)Location Start (3)LocationEnd (4)Coverage
              like http://dw408k-1.cmb.usc.edu:8080/chr...445_884542.bed
              while each row shows the coverage of a segment.
              Does anyone know tools that can get the format from sam or sam pileup?

              Comment


              • #8
                BEDTools has a tool named coverageBed that will compute the coverage of one bed file "A" (e.g. aligned reads) versus another bed file "B" (e.g. the widowns for which you want to create coverage. If your reads are in BAM format, you can use another utility called bamToBed to do this as follows:

                Code:
                $ bamToBed -abam reads.bam | coverageBed -a stdin -b windows.bed > windows.cov.bed
                The first four columns of windows.cov.bed will be in the BEDGRAPH format you requested.

                Aaron

                Comment


                • #9
                  About BEDtools and coverage: do you think it would be feasible (and correct) to pipeline bamToBed, slopeBed (to extend reads to fragment size) and mergeBed (to merge and count) to produce a wig like data format? Do you think it would be better to use coverageBed with a window file which spans all the genome in 10 nt intervals?

                  thanks

                  d

                  Comment


                  • #10
                    When I attempt to run the samToBed.py conversion script I keep receiving this error:
                    File "samtoBed.py", line 138, in <module>
                    sys.exit (main())
                    File "samtoBed.py", line 135, in main
                    processSAM(samFile, aType)
                    File "samtoBed.py", line 47, in processSam
                    makeBed (samLine, alignType)
                    File "samtoBed.py", line 53, in makeBed
                    samFlag = int(samFields[1])
                    ValueError: invalid literal for int() with base 10: 'VN:1.0'

                    I believe it has something to do with the format of my .SAm file which I created using Bowtie and the -S flag. Here is a portion of that file:

                    @SQ SN:gi|224589823|ref|NC_000024.9| LN:59373566
                    @SQ SN:gi|17981852|ref|NC_001807.4| LN:16571
                    @PG ID:Bowtie VN:0.12.1 CL:"./bowtie -S h_sapiens_37_asm test.fq test.sam"
                    Noname 16 gi|224589811|ref|NC_000002.11| 20633060 255 36M * 0 0 CTCTCTTGACTCCTTCCTGGCCTTCTGGACNAGGTG ###############################;@BBA XA:i:1 MD:Z:30C5 NM:i:1
                    Noname 16 gi|224589821|ref|NC_000009.11| 92065173 255 36M * 0 0 AAGGGCAAGACCAAGGATGCAGGAGAGAGANCCACT 887676878676688888888888866888#=?75B XA:i:1 MD:Z:30A5 NM:i:1
                    Noname 0 gi|224589808|ref|NC_000017.10| 2656051 255 36M * 0 0 GATTACAGGTGTGAGCCCCTGTGCCTGGCCTTTTTT @BB@>;5566:7:745:66:445361,,.5588778 XA:i:0 MD:Z:36 NM:i:0
                    Noname 0 gi|224589820|ref|NC_000008.10| 23005941 255 36M * 0 0 CATTCTGTACCTGATTTACAAACTGTACATAGCAGG BBB@B741<96::737:9:9:9999;82/5985523 XA:i:0 MD:Z:36 NM:i:0
                    Noname 0 gi|224589801|ref|NC_000010.10| 1060158 255 36M * 0 0 AAGCTAATATTTCTTTTTATCCTTCTCAGGTTCAGA ?@@@:729;4779282+2=>:7799722210855## XA:i:0 MD:Z:36 NM:i:0
                    Noname 0 gi|224589806|ref|NC_000015.9| 28699955 255 36M * 0 0 GCTGAAGTCCGGCCGGTTCGCGCCCCGGCGGCCGAC ;################################### XA:i:2 MD:Z:18G1G13C0T0 NM:i:4
                    Noname 0 gi|224589817|ref|NC_000005.9| 156279071 255 36M * 0 0 TTAGGAACTCTTTCTGTCACTCTTCTGTCATTTGCT B@<;<525798<7369699:6975998(88699872 XA:i:0 MD:Z:36 NM:i:0

                    If anyone has any insight it would be greatly appreciated.

                    Cheers,
                    Genbio64

                    Comment


                    • #11
                      I receive the same error message when running SAMtoBED, downloaded from this site : http://people.virginia.edu/~arq5x/utilities.html

                      I also created this SAM file from an Bowtie run (-S) (version Bowtie 0.12.3)

                      Traceback (most recent call last):
                      File "/tools/bioinfo/app/galaxy/prod/tools/bioim/samToBed.py", line 138, in ?
                      sys.exit(main())
                      File "/tools/bioinfo/app/galaxy/prod/tools/bioim/samToBed.py", line 135, in main
                      processSAM(samFile, aType)
                      File "/tools/bioinfo/app/galaxy/prod/tools/bioim/samToBed.py", line 47, in processSAM
                      makeBED(samLine, alignType)
                      File "/tools/bioinfo/app/galaxy/prod/tools/bioim/samToBed.py", line 53, in makeBED
                      samFlag = int(samFields[1])
                      ValueError: invalid literal for int(): VN:1.0


                      Can somebody help?

                      Comment


                      • #12
                        That script (written by me) is incorrect. I've taken the link down to prevent future confusion. You could use the bamToBed utility in BEDTools or use the Vancouver Short Read Package for this. Alternatively, you could use the example C program on the SAMTools wiki.

                        Apologies for the confusion---I should have put that link to death months ago.
                        Aaron

                        Comment


                        • #13
                          You can do this by combining samtools and bedtools:

                          samtools view <SAMFILENAME.sam> -Sb | bamToBed -i stdin > BEFILENAME.bed

                          Comment


                          • #14
                            bamToBED gives all reads with "chr-start-end-read-ID-score" fields from a BAM file . HOMER HOMER can calculate tag density per base in whole genome in a bedGraph file.

                            Comment


                            • #15
                              conert only a part of sam file to bam?

                              HI

                              Is there any option by which i can convert only apart of sam file to bam because sam file is having some problem with CIGAR length.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Latest Developments in Precision Medicine
                                by seqadmin



                                Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

                                Somatic Genomics
                                “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
                                05-24-2024, 01:16 PM
                              • seqadmin
                                Recent Advances in Sequencing Analysis Tools
                                by seqadmin


                                The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
                                05-06-2024, 07:48 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 05-24-2024, 07:15 AM
                              0 responses
                              123 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 05-23-2024, 10:28 AM
                              0 responses
                              136 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 05-23-2024, 07:35 AM
                              0 responses
                              135 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 05-22-2024, 02:06 PM
                              0 responses
                              11 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X