Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Bruins
    Member
    • Feb 2010
    • 78

    CIGAR format in SAMtools

    Hi all,

    I am using the SAMtools pileup format (created using 'samtools pileup -c') for SNP calling. I need to extract the frequency of each variation from the read base column of the pileup. However, there is one part of this column that I do not understand:

    ... Also at the read base column, a symbol ‘^’ marks the start of a read segment which is a contiguous subsequence on the read separated by ‘N/S/H’ CIGAR operations. The ASCII of the character following ‘^’ minus 33 gives the mapping quality. A symbol ‘$’ marks the end of a read segment.
    Would someone care to explain this in different words? I find in my pileup many examples of '^~.' (at first glance at loci with a coverage of a few reads) and '$', how do I interpret this?

    Also, I have been trying to find an explanation of the CIGAR format. So far, my search has led me to Exonerate and the remark that apparently different versions of CIGAR exist. I was unable to find any other operations than M, D and I, let alone N, S or H. Is there some documentation on the CIGAR format, other than the exonerate man pages?

    Thank you,
    Wil

    ps thanks for all the hard work on SAMtools and Picard!
  • Justin Brown
    Junior Member
    • Apr 2009
    • 1

    #2
    Hi Wil,

    If you haven't already, I suggest looking at the SAM/BAM format specification, section 1.4.3



    Also, the SAMtools paper gives an explanation of the cigar and extended cigar (section 2.1.2)

    Summary: The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences, supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms. It is flexible in ...


    The standard cigar has three operations:
    M : match or mismatch
    I : insertion
    D: deletion

    The extended cigar adds
    N : skipped bases on reference
    S : soft clipping
    H : hard clipping
    P : padding

    Hope that helps,
    Justin

    Comment

    • Bruins
      Member
      • Feb 2010
      • 78

      #3
      Hi Justin,

      This is very helpful, thanks a lot!

      Wil

      ***edit

      I understand the CIGAR operations now. But I still do not know how to interpret read base clumns that contain '^~.' or '$' - what is a read segment in this context? (sorry if it is a silly question...)

      Some example lines from my unfiltered pileup:
      Code:
      gnl|Nvit1.0|Chr1        1       T       T       30      0       60      1       ^~.     b
      gnl|Nvit1.0|Chr1        51      C       C       147     0       60      40      ......$..................................       bbabaabbbbabbbbbbbbbbbbbbbbbbb_bbbbbbbba
      gnl|Nvit1.0|Chr1        104     A       R       82      82      60      12      G$G$G$,,,,,,,^~,^~,     Bbbc`bbb\ab`
      How do I interpret these?
      Last edited by Bruins; 04-29-2010, 01:23 AM.

      Comment

      • kjngo
        Junior Member
        • Dec 2009
        • 6

        #4
        Hi Bruins,
        I don't know if anyone answered your question about the ^* and the $ in the pileup file.
        ^* symbolizes that a read begins to align at this position and the character after the ^ is the mapping quality of the read. $ marks the position where a read ends on the alignment.

        You can reference this at : http://samtools.sourceforge.net/pileup.shtml
        Hope this helps

        Comment

        • Graham Etherington
          Member
          • Apr 2010
          • 22

          #5
          A quick question about the Pileup format - if $ means a position where a read ends on the alignment, than can the example pileup given on the Samtools website ever be true?

          At the third positions in the pileup at reference positions 272 and 274 there are $ signs, which in my understanding of this statement means that two reads end within two nucleotides of each other at the same alignment position, which is an impossible scenario.
          Is this just a bad example of the pile up format, or have I misunderstood the explanation of the $ sign?

          Comment

          • nilshomer
            Nils Homer
            • Nov 2008
            • 1283

            #6
            Originally posted by Graham Etherington View Post
            ...two reads end within two nucleotides of each other at the same alignment position...
            The above statement does not make sense to me. Could you clarify?

            Maybe this example will help:
            Code:
            REF: AAAACA
            R1:  AAAAC
            R2:    AACAA
            Pileup(ish):
            Code:
            REF 1 A ^k. <
            REF 2 A . <
            REF 3 A .^k. <<
            REF 4 A .. <<
            REF 5 C .$. <<
            REF 6 A .$ <

            Comment

            • Graham Etherington
              Member
              • Apr 2010
              • 22

              #7
              Got it! Thanks (and sorry for being so thick )

              Comment

              • Bruins
                Member
                • Feb 2010
                • 78

                #8
                Hi all,

                thanks for your replies (and related question :P)!!

                So... '^~.' for a chromosome position where one read aligns means: the read starts with a match (.) and that match has quality represented by ~ (ASCII code of ~ minus 33).
                'G$G$G$' means that at that particular position, three reads end with a mismatch (in all three cases a G).

                If I got this correct, I think I know how to interpret lines like the ones in my example. Any comments?

                cheers,

                Comment

                • Jenzo
                  Member
                  • Mar 2011
                  • 31

                  #9
                  Hey all,

                  back to the CIGAR string: Would you consider it possible to calculate "Percent Identity" out of the CIGAR string?

                  My problem is: I use various mappers to align ~807000 454-Sequences against a set of "reference" sequences and need to decide, which of the 454-Sequences are (more or less) contained in my reference sequences. I would like to take only those sequences into account that have minimum of 95% Sequence Identity (number of identical residues (“matches”) in relation to the length of the alignment) in their alignment. For example in SSAHA2 (http://www.sanger.ac.uk/resources/software/ssaha2/) it is possible to set the parameter -identity properly. Other mappers do not provide such an option, so I would filter the SAM-Output via CIGAR string.
                  Do you get my problem? Perhaps it is not necessary and i totally overlooked other possibilities?

                  Thanks!

                  Comment

                  • lh3
                    Senior Member
                    • Feb 2008
                    • 686

                    #10
                    for nearly all existing SAMs: no, not directly. but you can use samtools calmd to generate extra tags given the reference sequence and SAM.

                    Comment

                    • Jenzo
                      Member
                      • Mar 2011
                      • 31

                      #11
                      Dear lh3,
                      thanks for this hint. I'll try to use calmd now!

                      Update: With an own perlscript it was possible to use information in cigar string and calmd-sam to calculate % Sequence Identity. Thanks a lot for help provided!
                      Last edited by Jenzo; 03-30-2011, 04:20 AM.

                      Comment

                      • Irina Pulyakhina
                        Member
                        • Sep 2010
                        • 24

                        #12
                        Hi all,

                        Does anyone know the difference between N and D in CIGAR? I assume for long deletions (let's say 1000 b.p.) you'll get "1000N" and for short ones (let's say 2 b.p.) you'll get "2D" in the CIGAR. But where is the border? At what point will it change from D to N?

                        Thanks!

                        Comment

                        • arvid
                          Senior Member
                          • Jul 2011
                          • 156

                          #13
                          Originally posted by Irina Pulyakhina View Post
                          Hi all,

                          Does anyone know the difference between N and D in CIGAR? I assume for long deletions (let's say 1000 b.p.) you'll get "1000N" and for short ones (let's say 2 b.p.) you'll get "2D" in the CIGAR. But where is the border? At what point will it change from D to N?

                          Thanks!
                          N is typically used to define an intron when aligning mRNA-originating reads to a genome; D would be for real deletions. Though the usage of N is not strictly defined in the SAM spec.

                          Comment

                          Latest Articles

                          Collapse

                          • SEQadmin2
                            From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                            by SEQadmin2


                            Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                            The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                            ...
                            06-02-2026, 10:05 AM
                          • SEQadmin2
                            Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                            by SEQadmin2


                            With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                            Introduction

                            Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                            05-22-2026, 06:42 AM
                          • SEQadmin2
                            Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                            by SEQadmin2

                            Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                            Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                            05-06-2026, 09:04 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by SEQadmin2, Yesterday, 08:59 AM
                          0 responses
                          13 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 06-02-2026, 12:03 PM
                          0 responses
                          22 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 06-02-2026, 11:40 AM
                          0 responses
                          19 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 05-28-2026, 11:40 AM
                          0 responses
                          31 views
                          0 reactions
                          Last Post SEQadmin2  
                          Working...