Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • mfischer
    Junior Member
    • Mar 2010
    • 9

    SAMtool's pileup format - reference base question

    Hi everybody,

    I have a question concerning the SAMtools pileup format. I performed SNP (and Indel) calling for 75 bp PE Illumina data with hg19 by following the protocol explained here.

    I noticed in the filtered output the following SNP call:

    chr3 57627500 t C 241 241 60 71 C$c$c$c$ccccccccccccccCccccccCcccccccccccccccccccccccccccccccccccccccccccc^]c BCCCCCCCCCCCCCCCCCBCCCCCCBCCCCC@CCCCCCCCCC?C=CCCCCCCCC>CC?CBCCCBCC@CCC@

    Why is the reference base (t) written in lower case? I read that in some of MAQ's tools (eg. cns2fq) "bases in lower case are essentially repeats or do not have sufficient coverage; bases in upper case indicate regions where SNPs can be reliably called."
    I doubt that this works in this case because it seems like the coverage is ok (71), the SNP appears on both strands, the alignments are reliable (RMS MQ = 60), and, according to UCSC, the position where the SNP is called has quite a good mappability.

    Additionally, Indel lines do have more than 13 columns. Does anybody know what the additional 14th and 15th column mean?

    Any hint/help will be greatly appreciated!
    Best regards
  • westerman
    Rick Westerman
    • Jun 2008
    • 1104

    #2
    As per the samtools manual page:

    At this column [reference], a dot stands for a match to the reference base on the forward strand, a comma for a match on the reverse strand, ‘ACGTN’ for a mismatch on the forward strand and ‘acgtn’ for a mismatch on the reverse strand.
    I believe that since all of your reads are a 'C' either forward (uppercase) or reverse strand (lowercase) then the reference is upper/lowercase depending on the predominance of the reads; i.e., since most of the reads are reverse then your reference is 'reverse'.

    I do not believe that MAQ has anything to do with the sam format.

    Comment

    • lh3
      Senior Member
      • Feb 2008
      • 686

      #3
      In your reference file, t is in lowercase.

      Comment

      • epigen
        Senior Member
        • May 2010
        • 101

        #4
        @westerman: Sorry I have to correct you: The column you refer to is not the reference but the reads column.
        @mfischer: Your reference is "t" because of the so-called softmasking by UCSC that makes lower case letters if there is a repeat. The UCSC browser informs me that chr3:57627500 lies inside a simple repeat, (CAAAA)n. At the same time, that position is a C/T SNP. So everything OK, you have a homozygous SNP allele (C) that is supported by reads from both strands, but most from the reverse strand (c).
        As to the 14th and 15the colum - do you mean 11th to 13th? Because the samtools FAQ say that these are indel-specific, see
        http://sourceforge.net/apps/mediawik..._pileup_output.

        Comment

        • mfischer
          Junior Member
          • Mar 2010
          • 9

          #5
          Thanks for the replys. I totally forgot that UCSC repeat masks the reference.

          @epigen: I've expected to see 13 columns in the indel rows as described in the link you've sent, but actually I got 15 columns for every indel. An example would be:

          chr3 44826315 * */+T 221 221 60 33 * +T 24 7 2 2 0

          It seems like others have experienced that as well, see http://seqanswers.com/forums/showthread.php?t=4234 post #8.

          Comment

          • epigen
            Senior Member
            • May 2010
            • 101

            #6
            additional columns in samtools pileup output for indels

            Now that you mention it, I looked at the indel lines of my data - I ignored them before because I'm only interested in SNPs ATM - and also saw the two additional columns. Heng Li must have changed the output format since writing the samtools FAQs. (Also, the manual page entry for pileup is not up to date, the parameters have changed.) How do we bug him to answer/update since he already commented on this thread, but only answered your first question?

            Comment

            • nilshomer
              Nils Homer
              • Nov 2008
              • 1283

              #7
              Originally posted by epigen View Post
              Now that you mention it, I looked at the indel lines of my data - I ignored them before because I'm only interested in SNPs ATM - and also saw the two additional columns. Heng Li must have changed the output format since writing the samtools FAQs. (Also, the manual page entry for pileup is not up to date, the parameters have changed.) How do we bug him to answer/update since he already commented on this thread, but only answered your first question?
              His seqanswers handle is lh3.

              Comment

              • mfischer
                Junior Member
                • Mar 2010
                • 9

                #8
                I wrote an email to the samtools mailing list.

                Hi everybody,

                according to the SAM FAQ page the pileup format has 13 columns for indel
                lines (when the pileup is called with -c). I noticed in my pileup files
                that all indel rows have 15 columns. Does anybody know what column 14
                and 15 are?

                Thanks in advance
                Cheers
                Maybe this helps

                Comment

                • mard
                  Member
                  • Jan 2010
                  • 21

                  #9
                  Just wondering if anyone has discovered what the extra columns are? I can't find any information on them in the samtools documentation.

                  Comment

                  • mfischer
                    Junior Member
                    • Mar 2010
                    • 9

                    #10
                    So far, I didn't get any answers to that question. But I need to admit that I didn't dig deeper into that issue.

                    Comment

                    • nilshomer
                      Nils Homer
                      • Nov 2008
                      • 1283

                      #11
                      Originally posted by mard View Post
                      Just wondering if anyone has discovered what the extra columns are? I can't find any information on them in the samtools documentation.

                      Comment

                      • mard
                        Member
                        • Jan 2010
                        • 21

                        #12
                        Originally posted by nilshomer View Post
                        Thanks for the link but I can only see explanations for 13 out of the 15 columns there.
                        This issue has also been reported in this thread:
                        Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

                        Comment

                        Latest Articles

                        Collapse

                        • SEQadmin2
                          From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                          by SEQadmin2


                          Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                          The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                          ...
                          06-02-2026, 10:05 AM
                        • SEQadmin2
                          Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                          by SEQadmin2


                          With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                          Introduction

                          Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                          05-22-2026, 06:42 AM
                        • SEQadmin2
                          Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                          by SEQadmin2

                          Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                          Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                          05-06-2026, 09:04 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by SEQadmin2, Today, 08:59 AM
                        0 responses
                        4 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, 06-02-2026, 12:03 PM
                        0 responses
                        21 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, 06-02-2026, 11:40 AM
                        0 responses
                        14 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, 05-28-2026, 11:40 AM
                        0 responses
                        29 views
                        0 reactions
                        Last Post SEQadmin2  
                        Working...