Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • SAMtool's pileup format - reference base question

    Hi everybody,

    I have a question concerning the SAMtools pileup format. I performed SNP (and Indel) calling for 75 bp PE Illumina data with hg19 by following the protocol explained here.

    I noticed in the filtered output the following SNP call:

    chr3 57627500 t C 241 241 60 71 C$c$c$c$ccccccccccccccCccccccCcccccccccccccccccccccccccccccccccccccccccccc^]c BCCCCCCCCCCCCCCCCCBCCCCCCBCCCCC@CCCCCCCCCC?C=CCCCCCCCC>CC?CBCCCBCC@CCC@

    Why is the reference base (t) written in lower case? I read that in some of MAQ's tools (eg. cns2fq) "bases in lower case are essentially repeats or do not have sufficient coverage; bases in upper case indicate regions where SNPs can be reliably called."
    I doubt that this works in this case because it seems like the coverage is ok (71), the SNP appears on both strands, the alignments are reliable (RMS MQ = 60), and, according to UCSC, the position where the SNP is called has quite a good mappability.

    Additionally, Indel lines do have more than 13 columns. Does anybody know what the additional 14th and 15th column mean?

    Any hint/help will be greatly appreciated!
    Best regards

  • #2
    As per the samtools manual page:

    At this column [reference], a dot stands for a match to the reference base on the forward strand, a comma for a match on the reverse strand, ‘ACGTN’ for a mismatch on the forward strand and ‘acgtn’ for a mismatch on the reverse strand.
    I believe that since all of your reads are a 'C' either forward (uppercase) or reverse strand (lowercase) then the reference is upper/lowercase depending on the predominance of the reads; i.e., since most of the reads are reverse then your reference is 'reverse'.

    I do not believe that MAQ has anything to do with the sam format.

    Comment


    • #3
      In your reference file, t is in lowercase.

      Comment


      • #4
        @westerman: Sorry I have to correct you: The column you refer to is not the reference but the reads column.
        @mfischer: Your reference is "t" because of the so-called softmasking by UCSC that makes lower case letters if there is a repeat. The UCSC browser informs me that chr3:57627500 lies inside a simple repeat, (CAAAA)n. At the same time, that position is a C/T SNP. So everything OK, you have a homozygous SNP allele (C) that is supported by reads from both strands, but most from the reverse strand (c).
        As to the 14th and 15the colum - do you mean 11th to 13th? Because the samtools FAQ say that these are indel-specific, see
        http://sourceforge.net/apps/mediawik..._pileup_output.

        Comment


        • #5
          Thanks for the replys. I totally forgot that UCSC repeat masks the reference.

          @epigen: I've expected to see 13 columns in the indel rows as described in the link you've sent, but actually I got 15 columns for every indel. An example would be:

          chr3 44826315 * */+T 221 221 60 33 * +T 24 7 2 2 0

          It seems like others have experienced that as well, see http://seqanswers.com/forums/showthread.php?t=4234 post #8.

          Comment


          • #6
            additional columns in samtools pileup output for indels

            Now that you mention it, I looked at the indel lines of my data - I ignored them before because I'm only interested in SNPs ATM - and also saw the two additional columns. Heng Li must have changed the output format since writing the samtools FAQs. (Also, the manual page entry for pileup is not up to date, the parameters have changed.) How do we bug him to answer/update since he already commented on this thread, but only answered your first question?

            Comment


            • #7
              Originally posted by epigen View Post
              Now that you mention it, I looked at the indel lines of my data - I ignored them before because I'm only interested in SNPs ATM - and also saw the two additional columns. Heng Li must have changed the output format since writing the samtools FAQs. (Also, the manual page entry for pileup is not up to date, the parameters have changed.) How do we bug him to answer/update since he already commented on this thread, but only answered your first question?
              His seqanswers handle is lh3.

              Comment


              • #8
                I wrote an email to the samtools mailing list.

                Hi everybody,

                according to the SAM FAQ page the pileup format has 13 columns for indel
                lines (when the pileup is called with -c). I noticed in my pileup files
                that all indel rows have 15 columns. Does anybody know what column 14
                and 15 are?

                Thanks in advance
                Cheers
                Maybe this helps

                Comment


                • #9
                  Just wondering if anyone has discovered what the extra columns are? I can't find any information on them in the samtools documentation.

                  Comment


                  • #10
                    So far, I didn't get any answers to that question. But I need to admit that I didn't dig deeper into that issue.

                    Comment


                    • #11
                      Originally posted by mard View Post
                      Just wondering if anyone has discovered what the extra columns are? I can't find any information on them in the samtools documentation.

                      Comment


                      • #12
                        Originally posted by nilshomer View Post
                        Thanks for the link but I can only see explanations for 13 out of the 15 columns there.
                        This issue has also been reported in this thread:
                        Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Best Practices for Single-Cell Sequencing Analysis
                          by seqadmin



                          While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
                          06-06-2024, 07:15 AM
                        • seqadmin
                          Latest Developments in Precision Medicine
                          by seqadmin



                          Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

                          Somatic Genomics
                          “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
                          05-24-2024, 01:16 PM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, Today, 07:23 AM
                        0 responses
                        8 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 06-17-2024, 06:54 AM
                        0 responses
                        11 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 06-14-2024, 07:24 AM
                        0 responses
                        24 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 06-13-2024, 08:58 AM
                        0 responses
                        17 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X