Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • dingxiaofan1
    Member
    • Jul 2010
    • 17

    Default quality encoding system of SAMTools&GATK

    Hi. Everyone. Pls do me a great favor. If anyone knows what is the default encoding system for fastq (Sanger/Illumina etc) uesd in the calculation of SNP scores in SAMtools mpileup & GATK (UnifiedGenotyper) , pls let me know.
    Thank you.
  • maubp
    Peter (Biopython etc)
    • Jul 2009
    • 1544

    #2
    As samtools don't work directly on FASTQ files, your question doesn't really make sense.

    The read mapper or assembler used to make the SAM/BAM files (e.g. BWA, MAQ) may support various FASTQ encodings but they should all output the same encoding in the SAM/BAM output. The SAM plain text file format uses a Sanger FASTQ like encoding of quality scores, while BAM files store the qualities in binary format.

    Comment

    • dingxiaofan1
      Member
      • Jul 2010
      • 17

      #3
      Thank u. Yeah. I mean after the assemly. I have found some non-ref homo SNP positions with very strange nt distribution after using SAMtools mpileup & GATK to do the SNP call. Such like :
      the ref is A
      the reads at that positions could be 50 A and 10 T, but the caller call it alle:T and genotype 1/1. I have view the quality at that position under samtools tview. Indeed, most of the A is in blue or green color. So I think maybe I should not transfer the quality from Illumina to sanger (the default encoding system is Illumina > 1.3).


      Originally posted by maubp View Post
      As samtools don't work directly on FASTQ files, your question doesn't really make sense.

      The read mapper or assembler used to make the SAM/BAM files (e.g. BWA, MAQ) may support various FASTQ encodings but they should all output the same encoding in the SAM/BAM output. The SAM plain text file format uses a Sanger FASTQ like encoding of quality scores, while BAM files store the qualities in binary format.

      Comment

      • Bruins
        Member
        • Feb 2010
        • 78

        #4
        just a thought, but are you perhaps confusing base call scores, mapping scores and SNP call scores, all 3 of which are represented as Phred scores?

        Comment

        • dingxiaofan1
          Member
          • Jul 2010
          • 17

          #5
          Originally posted by Bruins View Post
          just a thought, but are you perhaps confusing base call scores, mapping scores and SNP call scores, all 3 of which are represented as Phred scores?
          In my opinion, all of them may be presented in a quite similar way, like Phred score. And the SNP call scores should be calculated based on the other two, so the quality score do have a great influence on the final snp call.

          Comment

          • Bruins
            Member
            • Feb 2010
            • 78

            #6
            Originally posted by dingxiaofan1 View Post
            In my opinion, all of them may be presented in a quite similar way, like Phred score.
            They are.
            Originally posted by dingxiaofan1 View Post
            And the SNP call scores should be calculated based on the other two, so the quality score do have a great influence on the final snp call.
            True.

            I haven't used samtools tview so I don't know what blue or green means.

            I'm not entirely sure I understand your question, so pardon me if I'm telling you stuff you already know.

            The Phred scores for sanger fastqc and Illumina pipeline > 1.3 are calculated using the same formula, Illumina < 1.3 with a different formula (goolge for the wiki about this, it's out there). Then the scores are represented as an ASCII character. This is done in 2 different ways: adding 64 to get an ASCII character (I believe Illumina does this), or adding 33 to get a character (the sanger fastqc default).

            Aligners like Bowtie and BWA use these per base quality scores. They may be able to deduce whether it's Phred64 or Phred33, I'm not sure. I do know BWA has an option to tell it which scheme was used.

            Then the aligners calculate a mapping score, a log odds score to see the likelihood that the mapping was wrong. I think the same formula is used as the base calling one. This mapping score is represented in the SAM file as an ASCII character, just like the base quality in the fastq file. I think in SAM, the default is Phred33, but you should be able to check this in the sam specs or whatever. The samtools and GATK SNP callers know this, you don't have to think about this.

            The same goes for the SNP call scores: likelihood the call is wrong. To calculate this score, mapping scores are used among other things. Your example shows a coverage of say 60 nt at the SNP position: 50 A's and 10 T's. The SNP caller checks all 60 and it may be that most of the 50 A's have a very low mapping quality or fail other checks the caller does. So in the end this SNP is supported by say 13 nt: 10 T's and 3 A's, the other 47 nt not being used in the call.

            So to conclude: it is important that the aligner knows the Phred scheme used in the fastq files. After that, you don't have to worry about the scheme.

            Does that make sense?

            Comment

            • maubp
              Peter (Biopython etc)
              • Jul 2009
              • 1544

              #7
              Originally posted by Bruins View Post
              This mapping score is represented in the SAM file as an ASCII character, just like the base quality in the fastq file. I think in SAM, the default is Phred33, but you should be able to check this in the sam specs or whatever. The samtools and GATK SNP callers know this, you don't have to think about this.
              Yes, but it is not a default - the SAM spec is explicit that read quality scores are PHRED scores stored in ASCII with an offset of 33 (just like Sanger FASTQ).

              Comment

              • Bruins
                Member
                • Feb 2010
                • 78

                #8
                you mean that it's not a default because it is always the case? That's what I expected, I gues I just used the word default wrong :P

                Comment

                • maubp
                  Peter (Biopython etc)
                  • Jul 2009
                  • 1544

                  #9
                  Originally posted by Bruins View Post
                  you mean that it's not a default because it is always the case? That's what I expected, I gues I just used the word default wrong :P
                  Yes - it should always be the case. However, if you use Solexa/Illumina FASTQ files with a mapper that expects Sanger FASTQ, you'll get inflated quality scores. Such a SAM file would be invalid since it would effectively be using the Solexa/Illumina encoding (and all derived scores like mapping quality would also be inflated). That would be bad.

                  Comment

                  • Bruins
                    Member
                    • Feb 2010
                    • 78

                    #10
                    Originally posted by maubp View Post
                    Yes - it should always be the case. However, if you use Solexa/Illumina FASTQ files with a mapper that expects Sanger FASTQ, you'll get inflated quality scores. Such a SAM file would be invalid since it would effectively be using the Solexa/Illumina encoding (and all derived scores like mapping quality would also be inflated). That would be bad.
                    Originally posted by Bruins View Post
                    So to conclude: it is important that the aligner knows the Phred scheme used in the fastq files. After that, you don't have to worry about the scheme.
                    dingxiaofan1, is this what you mean?

                    Comment

                    • dingxiaofan1
                      Member
                      • Jul 2010
                      • 17

                      #11
                      Thank u so much. And very sorry for my late reply. Yes, that is what I mean. But I had checked fastq wiki and my file carefully. quality fields in Illumina fastq format is not the same as Sanger fastq.
                      Sanger can encode a Phred quality score from 0 to 93 using ASCII 33 to 126, while Illumina 1.3+ encode from 0-62 using 64-126.
                      Generally, the quality per base for Illumina is much higher than that of Sanger with the same ASCII character. So transfer or not means a lot if SAMtools and GATK using the Phred 33 as default.


                      Originally posted by Bruins View Post
                      dingxiaofan1, is this what you mean?
                      Originally posted by Bruins View Post
                      They are.

                      True.

                      I haven't used samtools tview so I don't know what blue or green means.

                      I'm not entirely sure I understand your question, so pardon me if I'm telling you stuff you already know.

                      The Phred scores for sanger fastqc and Illumina pipeline > 1.3 are calculated using the same formula, Illumina < 1.3 with a different formula (goolge for the wiki about this, it's out there). Then the scores are represented as an ASCII character. This is done in 2 different ways: adding 64 to get an ASCII character (I believe Illumina does this), or adding 33 to get a character (the sanger fastqc default).

                      Aligners like Bowtie and BWA use these per base quality scores. They may be able to deduce whether it's Phred64 or Phred33, I'm not sure. I do know BWA has an option to tell it which scheme was used.

                      Then the aligners calculate a mapping score, a log odds score to see the likelihood that the mapping was wrong. I think the same formula is used as the base calling one. This mapping score is represented in the SAM file as an ASCII character, just like the base quality in the fastq file. I think in SAM, the default is Phred33, but you should be able to check this in the sam specs or whatever. The samtools and GATK SNP callers know this, you don't have to think about this.

                      The same goes for the SNP call scores: likelihood the call is wrong. To calculate this score, mapping scores are used among other things. Your example shows a coverage of say 60 nt at the SNP position: 50 A's and 10 T's. The SNP caller checks all 60 and it may be that most of the 50 A's have a very low mapping quality or fail other checks the caller does. So in the end this SNP is supported by say 13 nt: 10 T's and 3 A's, the other 47 nt not being used in the call.

                      So to conclude: it is important that the aligner knows the Phred scheme used in the fastq files. After that, you don't have to worry about the scheme.

                      Does that make sense?

                      Comment

                      • dingxiaofan1
                        Member
                        • Jul 2010
                        • 17

                        #12
                        Maybe SAMTools and GATK are using Phred 33 as the default system. But I cannot find any options we could choose from the SNP callers. Also I cannot change the threshold score for base calculation (base Phred >=30 pass ? ). I think it is a bit too strict. And by allowing us using different threshold scores, we may find the best threshold suitable for our own data with less errors.

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          New Genomics Tools and Methods Shared at AGBT 2025
                          by seqadmin


                          This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                          The Headliner
                          The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                          03-03-2025, 01:39 PM
                        • seqadmin
                          Investigating the Gut Microbiome Through Diet and Spatial Biology
                          by seqadmin




                          The human gut contains trillions of microorganisms that impact digestion, immune functions, and overall health1. Despite major breakthroughs, we’re only beginning to understand the full extent of the microbiome’s influence on health and disease. Advances in next-generation sequencing and spatial biology have opened new windows into this complex environment, yet many questions remain. This article highlights two recent studies exploring how diet influences microbial...
                          02-24-2025, 06:31 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 03-20-2025, 05:03 AM
                        0 responses
                        17 views
                        0 reactions
                        Last Post seqadmin  
                        Started by seqadmin, 03-19-2025, 07:27 AM
                        0 responses
                        18 views
                        0 reactions
                        Last Post seqadmin  
                        Started by seqadmin, 03-18-2025, 12:50 PM
                        0 responses
                        19 views
                        0 reactions
                        Last Post seqadmin  
                        Started by seqadmin, 03-03-2025, 01:15 PM
                        0 responses
                        185 views
                        0 reactions
                        Last Post seqadmin  
                        Working...