Hello
I have been given some data in pileup format. Whilst it is beyond the scope of my work to go into the details of how the reads/alignments etc are generated I think it would be remiss of me not to try and understand a bit more about what is going on.
I have read the SAM tools manual and explanation of the pileup format but still don't understand a few things. Here is an example taken from the manual
seq1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&
seq1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+
seq1 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<6
seq1 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<<
seq1 276 G 22 ...T,,.,.,...,,,.,.... 33;+<<7=7<<7<&<<1;<<6<
seq1 277 T 22 ....,,.,.,.C.,,,.,..G. +7<;<<<<<<<&<=<<:;<<&<
seq1 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<<
seq1 279 C 23 A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<<
I don't understand the significance of the $ and ^ characters. The docs say they mark the start and end of read segments. Take the first line. Looking at the $, is that saying the that there is a read (the third read) whose last base is position 272 and the last base in that read is '.' which is the character after the $? Also looking at the ^ in row 1, is it also saying that the 24th read starts at position 272 and its first base is '.' which is the character after ^+
It also says 'The ASCII of the character following `^' minus 33 gives the mapping quality.' What is the mapping quality? Is it the quality of the mapping of that read to the reference genome? So for row 1 the quality is 10 (43 for ascii code for + minus 33) and the mapping quality for row 4 is 16 (49 -33). I assumed this mapping quality information would be in the SAM file.
If i am looking at variant data, is it safe to discard the $ and ^ data as I don't need to reconstruct the read sequence from pileup? I also don't think I see how I can use the mapping quality of a read segment to assess the quality of my variant data (please correct me if I am wrong!!!) - I can only meaningfully use the base qualities in the next column.
Also, what is a reference skip which is mentioned in pileup user manual
Having never used Samtools I am guessing there will be utilities to extract this data directly from the pileup file rather than my writing a custom parser?
Thank you for your time
I have been given some data in pileup format. Whilst it is beyond the scope of my work to go into the details of how the reads/alignments etc are generated I think it would be remiss of me not to try and understand a bit more about what is going on.
I have read the SAM tools manual and explanation of the pileup format but still don't understand a few things. Here is an example taken from the manual
seq1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&
seq1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+
seq1 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<6
seq1 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<<
seq1 276 G 22 ...T,,.,.,...,,,.,.... 33;+<<7=7<<7<&<<1;<<6<
seq1 277 T 22 ....,,.,.,.C.,,,.,..G. +7<;<<<<<<<&<=<<:;<<&<
seq1 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<<
seq1 279 C 23 A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<<
I don't understand the significance of the $ and ^ characters. The docs say they mark the start and end of read segments. Take the first line. Looking at the $, is that saying the that there is a read (the third read) whose last base is position 272 and the last base in that read is '.' which is the character after the $? Also looking at the ^ in row 1, is it also saying that the 24th read starts at position 272 and its first base is '.' which is the character after ^+
It also says 'The ASCII of the character following `^' minus 33 gives the mapping quality.' What is the mapping quality? Is it the quality of the mapping of that read to the reference genome? So for row 1 the quality is 10 (43 for ascii code for + minus 33) and the mapping quality for row 4 is 16 (49 -33). I assumed this mapping quality information would be in the SAM file.
If i am looking at variant data, is it safe to discard the $ and ^ data as I don't need to reconstruct the read sequence from pileup? I also don't think I see how I can use the mapping quality of a read segment to assess the quality of my variant data (please correct me if I am wrong!!!) - I can only meaningfully use the base qualities in the next column.
Also, what is a reference skip which is mentioned in pileup user manual
Having never used Samtools I am guessing there will be utilities to extract this data directly from the pileup file rather than my writing a custom parser?
Thank you for your time
Comment