Unconfigured Ad

**Justin Brown** · 04-28-2010, 07:33 AM

Hi Wil,

If you haven't already, I suggest looking at the SAM/BAM format specification, section 1.4.3

http://samtools.sourceforge.net/SAM1.pdf

Also, the SAMtools paper gives an explanation of the cigar and extended cigar (section 2.1.2)

The Sequence Alignment/Map format and SAMtools - PMC

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2723002/

Summary: The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences, supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms. It is flexible in ...

The standard cigar has three operations:
M : match or mismatch
I : insertion
D: deletion

The extended cigar adds
N : skipped bases on reference
S : soft clipping
H : hard clipping
P : padding

Hope that helps,
Justin

**Bruins** · 04-28-2010, 11:12 PM

Hi Justin,

This is very helpful, thanks a lot!

Wil

***edit

I understand the CIGAR operations now. But I still do not know how to interpret read base clumns that contain '^~.' or '$' - what is a read segment in this context? (sorry if it is a silly question...)

Some example lines from my unfiltered pileup:

Code:

gnl|Nvit1.0|Chr1        1       T       T       30      0       60      1       ^~.     b
gnl|Nvit1.0|Chr1        51      C       C       147     0       60      40      ......$..................................       bbabaabbbbabbbbbbbbbbbbbbbbbbb_bbbbbbbba
gnl|Nvit1.0|Chr1        104     A       R       82      82      60      12      G$G$G$,,,,,,,^~,^~,     Bbbc`bbb\ab`

How do I interpret these?

**kjngo** · 05-17-2010, 03:30 PM

Hi Bruins,
I don't know if anyone answered your question about the ^* and the $ in the pileup file.
^* symbolizes that a read begins to align at this position and the character after the ^ is the mapping quality of the read. $ marks the position where a read ends on the alignment.

You can reference this at : http://samtools.sourceforge.net/pileup.shtml
Hope this helps

**Graham Etherington** · 05-18-2010, 12:26 AM

A quick question about the Pileup format - if $ means a position where a read ends on the alignment, than can the example pileup given on the Samtools website ever be true?

Pileup Format

http://samtools.sourceforge.net/pileup.shtml

At the third positions in the pileup at reference positions 272 and 274 there are $ signs, which in my understanding of this statement means that two reads end within two nucleotides of each other at the same alignment position, which is an impossible scenario.
Is this just a bad example of the pile up format, or have I misunderstood the explanation of the $ sign?

**nilshomer** · 05-18-2010, 12:36 AM

Originally posted by Graham Etherington View Post

...two reads end within two nucleotides of each other at the same alignment position...

The above statement does not make sense to me. Could you clarify?

Maybe this example will help:

Code:

REF: AAAACA
R1:  AAAAC
R2:    AACAA

Pileup(ish):

Code:

REF 1 A ^k. <
REF 2 A . <
REF 3 A .^k. <<
REF 4 A .. <<
REF 5 C .$. <<
REF 6 A .$ <

**Graham Etherington** · 05-18-2010, 01:52 AM

Got it! Thanks (and sorry for being so thick

)

**Bruins** · 05-18-2010, 04:01 AM

Hi all,

thanks for your replies (and related question :P)!!

So... '^~.' for a chromosome position where one read aligns means: the read starts with a match (.) and that match has quality represented by ~ (ASCII code of ~ minus 33).
'G$G$G$' means that at that particular position, three reads end with a mismatch (in all three cases a G).

If I got this correct, I think I know how to interpret lines like the ones in my example. Any comments?

cheers,

**Jenzo** · 03-29-2011, 04:13 AM

Hey all,

back to the CIGAR string: Would you consider it possible to calculate "Percent Identity" out of the CIGAR string?

My problem is: I use various mappers to align ~807000 454-Sequences against a set of "reference" sequences and need to decide, which of the 454-Sequences are (more or less) contained in my reference sequences. I would like to take only those sequences into account that have minimum of 95% Sequence Identity (number of identical residues (“matches”) in relation to the length of the alignment) in their alignment. For example in SSAHA2 (http://www.sanger.ac.uk/resources/software/ssaha2/) it is possible to set the parameter -identity properly. Other mappers do not provide such an option, so I would filter the SAM-Output via CIGAR string.
Do you get my problem? Perhaps it is not necessary and i totally overlooked other possibilities?

Thanks!

**lh3** · 03-29-2011, 11:38 AM

for nearly all existing SAMs: no, not directly. but you can use samtools calmd to generate extra tags given the reference sequence and SAM.

**Jenzo** · 03-30-2011, 12:06 AM

Dear lh3,
thanks for this hint. I'll try to use calmd now!

Update: With an own perlscript it was possible to use information in cigar string and calmd-sam to calculate % Sequence Identity. Thanks a lot for help provided!

**Irina Pulyakhina** · 02-03-2012, 04:32 AM

Hi all,

Does anyone know the difference between N and D in CIGAR? I assume for long deletions (let's say 1000 b.p.) you'll get "1000N" and for short ones (let's say 2 b.p.) you'll get "2D" in the CIGAR. But where is the border? At what point will it change from D to N?

Thanks!

**arvid** · 02-03-2012, 05:50 AM

Originally posted by Irina Pulyakhina View Post

Hi all,

Does anyone know the difference between N and D in CIGAR? I assume for long deletions (let's say 1000 b.p.) you'll get "1000N" and for short ones (let's say 2 b.p.) you'll get "2D" in the CIGAR. But where is the border? At what point will it change from D to N?

Thanks!

N is typically used to define an intron when aligning mRNA-originating reads to a genome; D would be for real deletions. Though the usage of N is not strictly defined in the SAM spec.

Topics	Statistics	Last Post
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, Yesterday, 11:58 AM	0 responses 13 views 0 reactions	Last Post by SEQadmin2 Yesterday, 11:58 AM
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 25 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM
A New Single-Cell Method Maps DNA-Protein Interactions by SEQadmin2 Started by SEQadmin2, 06-04-2026, 08:59 AM	0 responses 36 views 0 reactions	Last Post by SEQadmin2 06-04-2026, 08:59 AM
Long-Read RNA Sequencing Uncovers a Hidden Layer of Immune Cell Regulation by SEQadmin2 Started by SEQadmin2, 06-02-2026, 12:03 PM	0 responses 60 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 12:03 PM

Unconfigured Ad

CIGAR format in SAMtools

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News