Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • jdrum00
    Member
    • Dec 2009
    • 16

    Extracting one field from a SAM file

    I just became a bioinformatician (is that what we're called?) a few weeks ago, and am surprised that I can't find anything in samtools that will output a subset of fields from a SAM/BAM file. It seems to be entirely geared toward filtering "horizontally", for subsets of alignment records. I know that's a major need, but it's not actually what I need at the moment!

    What I'm looking for is a derivative file with, for example, just one long list of CIGAR strings. I don't care what reads or alignments they're from; I just want to do some statistics on them in the aggregate.

    What I'll do in the meantime is just parse the SAM with python to make my derivative file. But surely samtools would be better at it than I am. Anyone know how to make it do what I want? Thanks in advance!

    (I'm excited to see the new python wrapper for samtools, which I'll probably check out soon.)

    ANSWER from below, thanks to nilshomer: Just operate directly on the SAM-formatted stream that comes out of samtools view. E.g., samtools view <bamname> | awk <whatever field you need> | <further processing>.
    Last edited by jdrum00; 01-04-2010, 09:50 AM. Reason: Added answer to original question, for the benefit of searchers.
  • krobison
    Senior Member
    • Nov 2007
    • 734

    #2
    Doing that sort of slice is trivial from a SAM file with standard UNIX tools - cut, awk, perl, etc. For example "cut -f1" would pull out the first field. Count over (index origin 1) to CIGAR and put that value for the -f argument. Similarly, "perl -n -e '@a=split(/\t/); print $a[0],"\n"; (note that Perl is index origin 0, as most languages -- cut is wierd).

    The binary BAM file would be a bit trickier; if you are more C-fluent than I you could use the decompress routines to decompress and then parse things out.

    With the Perl (and probably Python; I'm just haven't dug into it) wrapper you can certainly iterate over all records & only grab the field(s) you want. Probably a significant overhead penalty, but it will get the job done.

    Comment

    • jdrum00
      Member
      • Dec 2009
      • 16

      #3
      Right; I can do that from a SAM, with awk or whatever. I was hoping someone would say there was an undocumented switch to samtools view or something, because that would also be likely to do it directly from BAM and save me a bunch of file conversion. It seems like the sort of simple direct thing that would be included in a basic tool set (but then I guess everyone says that about the feature they want).

      Thanks for the details and suggestions!

      Comment

      • nilshomer
        Nils Homer
        • Nov 2008
        • 1283

        #4
        Originally posted by jdrum00 View Post
        Right; I can do that from a SAM, with awk or whatever. I was hoping someone would say there was an undocumented switch to samtools view or something, because that would also be likely to do it directly from BAM and save me a bunch of file conversion. It seems like the sort of simple direct thing that would be included in a basic tool set (but then I guess everyone says that about the feature they want).

        Thanks for the details and suggestions!
        Unix programs typically follow the pipe-and-filter model so duplicating the function of one when a pipe would suffice is hearsay.

        Here's how you get the cigar directly from a BAM file:
        Code:
        samtools view my.bam | cut -f6
        Unix utilities will soon become your friends.
        Last edited by nilshomer; 12-07-2009, 07:29 PM.

        Comment

        • jdrum00
          Member
          • Dec 2009
          • 16

          #5
          Originally posted by nilshomer View Post
          Unix utilities will soon become your friends.
          Oh yes; we're already quite well acquainted. Sed, awk, grep, cut, sort and I are all good buddies. Bash sometimes joins us for a movie. I used to hang out with python a lot, but honestly, I haven't seen much of him these last couple of weeks.

          Originally posted by nilshomer View Post
          Unix programs typically follow the pipe-and-filter model so duplicating the function of one when a pipe would suffix is hearsay.
          You're absolutely right, and my confusion was in not understanding that samtools view really is just a bam streamer. Now that that's clicked, I'm much happier!

          This is a great forum; I'm looking forward to rummaging through it and learning a lot. Thanks for the useful and timely responses.

          Comment

          • KevinLam
            Senior Member
            • Nov 2009
            • 204

            #6
            Originally posted by jdrum00 View Post
            Right; I can do that from a SAM, with awk or whatever. I was hoping someone would say there was an undocumented switch to samtools view or something, because that would also be likely to do it directly from BAM and save me a bunch of file conversion. It seems like the sort of simple direct thing that would be included in a basic tool set (but then I guess everyone says that about the feature they want).

            Thanks for the details and suggestions!
            Tell me about it!
            I bet there's a script floating out there already.

            working on python script to tally total mapped reads to a fasta entry in the reference for SAM file.

            will post the script if anyone is interested but I think everyone can do it.
            any suggestions on how I can do it better? not a unix guru so hoping to learn from others.

            I feel there;s gotta be a better way than to awk grep 3000 times a sam file if i have 3000 fasta entries.
            is there a word freq count unix tool?
            http://kevin-gattaca.blogspot.com/

            Comment

            • KevinLam
              Senior Member
              • Nov 2009
              • 204

              #7
              btw here's the regex i got off the sam manual
              HTML Code:
              [CODE]
               Field    Regular expression  Range                                     Description
                                                      Query pair NAME if paired; or Query NAME if unpaired 2
              QNAME [^ \t\n\r]+
                                           [0,216-1] bitwise FLAG (Section 2.2.2)
              FLAG   [0-9]+
                                                      Reference sequence NAME 3
              RNAME [^ \t\n\r@=]+
                                           [0,229-1] 1-based leftmost POSition/coordinate of the clipped sequence
              POS    [0-9]+
                                           [0,28-1]   MAPping Quality (phred-scaled posterior probability that the mapping
              MAPQ   [0-9]+
                                                      position of this read is incorrect) 4
                                                      extended CIGAR string
              CIGAR ([0-9]+[MIDNSHP])+|\*
                                                      Mate Reference sequence NaMe; “=” if the same as <RNAME> 3
              MRNM   [^ \t\n\r@]+
                                           [0,229-1] 1-based leftmost Mate POSition of the clipped sequence
              MPOS   [0-9]+
                                           [-229,229] inferred Insert SIZE 5
              ISIZE -?[0-9]+
                                                      query SEQuence; “=” for a match to the reference; n/N/. for ambiguity;
              SEQ    [acgtnACGTN.=]+|\*
                                                      cases are not maintained 6,7
                                                      query QUALity; ASCII-33 gives the Phred base quality 6,7
                                           [0,93]
              QUAL   [!-~]+|\*
                                                      TAG
              TAG    [A-Z][A-Z0-9]
                                                      Value TYPE
              VTYPE [AifZH]
                                                      match <VTYPE> (space allowed)
              VALUE [^\t\n\r]+[/CODE]
              http://kevin-gattaca.blogspot.com/

              Comment

              • jdrum00
                Member
                • Dec 2009
                • 16

                #8
                Originally posted by KevinLam View Post
                working on python script to tally total mapped reads to a fasta entry in the reference for SAM file.
                ...
                I feel there;s gotta be a better way than to awk grep 3000 times a sam file if i have 3000 fasta entries. is there a word freq count unix tool?
                I'm not totally clear on what you need, but awk will count lines:

                Code:
                awk '{count[$i]++} END {for(x in count) print x,count[x]}'
                The first bit tallies up everything as it goes through the file, and then the bit after END prints out the results.

                So if there's a specific subportion of the SAM line you want to count, filter that out with awk or cut, and pipe that result to the awk above. For example, I needed to look at the custom MD: field, which is the sixteenth field, so:

                Code:
                samtools view <bamname> | awk '{print $16}' | awk '{count[$1]++} END {for(j in count) print count[j], j}' | sort -nr
                This results in output like:

                Code:
                38 MD:Z:50
                17 MD:Z:30
                2 MD:Z:49A0
                2 MD:Z:46G3
                If you can get python or awk (or samtools!) to do the complex filtering you want, then awk is pretty blazingly fast at counting up the results. Hope that helps!
                Last edited by jdrum00; 01-04-2010, 08:03 AM. Reason: Note that samtools is a good filterer too

                Comment

                • KevinLam
                  Senior Member
                  • Nov 2009
                  • 204

                  #9
                  Neat!
                  Thanks jdrum
                  that's pretty much what I was looking for!
                  Was thinking of using a hash to count no. of reads that map to a chr in reference genome.
                  But this works better!
                  http://kevin-gattaca.blogspot.com/

                  Comment

                  Latest Articles

                  Collapse

                  • GATTACAT
                    Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                    by GATTACAT
                    Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
                    07-01-2026, 11:43 AM
                  • SEQadmin2
                    Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                    by SEQadmin2


                    I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                    Here are nine questions we think about, in roughly the order they matter, before...
                    06-18-2026, 07:11 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by SEQadmin2, Yesterday, 11:08 AM
                  0 responses
                  7 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 06-30-2026, 05:37 AM
                  0 responses
                  11 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 06-26-2026, 11:10 AM
                  0 responses
                  20 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 06-17-2026, 06:09 AM
                  0 responses
                  53 views
                  0 reactions
                  Last Post SEQadmin2  
                  Working...