Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Default quality encoding system of SAMTools&GATK

    Hi. Everyone. Pls do me a great favor. If anyone knows what is the default encoding system for fastq (Sanger/Illumina etc) uesd in the calculation of SNP scores in SAMtools mpileup & GATK (UnifiedGenotyper) , pls let me know.
    Thank you.

  • #2
    As samtools don't work directly on FASTQ files, your question doesn't really make sense.

    The read mapper or assembler used to make the SAM/BAM files (e.g. BWA, MAQ) may support various FASTQ encodings but they should all output the same encoding in the SAM/BAM output. The SAM plain text file format uses a Sanger FASTQ like encoding of quality scores, while BAM files store the qualities in binary format.

    Comment


    • #3
      Thank u. Yeah. I mean after the assemly. I have found some non-ref homo SNP positions with very strange nt distribution after using SAMtools mpileup & GATK to do the SNP call. Such like :
      the ref is A
      the reads at that positions could be 50 A and 10 T, but the caller call it alle:T and genotype 1/1. I have view the quality at that position under samtools tview. Indeed, most of the A is in blue or green color. So I think maybe I should not transfer the quality from Illumina to sanger (the default encoding system is Illumina > 1.3).


      Originally posted by maubp View Post
      As samtools don't work directly on FASTQ files, your question doesn't really make sense.

      The read mapper or assembler used to make the SAM/BAM files (e.g. BWA, MAQ) may support various FASTQ encodings but they should all output the same encoding in the SAM/BAM output. The SAM plain text file format uses a Sanger FASTQ like encoding of quality scores, while BAM files store the qualities in binary format.

      Comment


      • #4
        just a thought, but are you perhaps confusing base call scores, mapping scores and SNP call scores, all 3 of which are represented as Phred scores?

        Comment


        • #5
          Originally posted by Bruins View Post
          just a thought, but are you perhaps confusing base call scores, mapping scores and SNP call scores, all 3 of which are represented as Phred scores?
          In my opinion, all of them may be presented in a quite similar way, like Phred score. And the SNP call scores should be calculated based on the other two, so the quality score do have a great influence on the final snp call.

          Comment


          • #6
            Originally posted by dingxiaofan1 View Post
            In my opinion, all of them may be presented in a quite similar way, like Phred score.
            They are.
            Originally posted by dingxiaofan1 View Post
            And the SNP call scores should be calculated based on the other two, so the quality score do have a great influence on the final snp call.
            True.

            I haven't used samtools tview so I don't know what blue or green means.

            I'm not entirely sure I understand your question, so pardon me if I'm telling you stuff you already know.

            The Phred scores for sanger fastqc and Illumina pipeline > 1.3 are calculated using the same formula, Illumina < 1.3 with a different formula (goolge for the wiki about this, it's out there). Then the scores are represented as an ASCII character. This is done in 2 different ways: adding 64 to get an ASCII character (I believe Illumina does this), or adding 33 to get a character (the sanger fastqc default).

            Aligners like Bowtie and BWA use these per base quality scores. They may be able to deduce whether it's Phred64 or Phred33, I'm not sure. I do know BWA has an option to tell it which scheme was used.

            Then the aligners calculate a mapping score, a log odds score to see the likelihood that the mapping was wrong. I think the same formula is used as the base calling one. This mapping score is represented in the SAM file as an ASCII character, just like the base quality in the fastq file. I think in SAM, the default is Phred33, but you should be able to check this in the sam specs or whatever. The samtools and GATK SNP callers know this, you don't have to think about this.

            The same goes for the SNP call scores: likelihood the call is wrong. To calculate this score, mapping scores are used among other things. Your example shows a coverage of say 60 nt at the SNP position: 50 A's and 10 T's. The SNP caller checks all 60 and it may be that most of the 50 A's have a very low mapping quality or fail other checks the caller does. So in the end this SNP is supported by say 13 nt: 10 T's and 3 A's, the other 47 nt not being used in the call.

            So to conclude: it is important that the aligner knows the Phred scheme used in the fastq files. After that, you don't have to worry about the scheme.

            Does that make sense?

            Comment


            • #7
              Originally posted by Bruins View Post
              This mapping score is represented in the SAM file as an ASCII character, just like the base quality in the fastq file. I think in SAM, the default is Phred33, but you should be able to check this in the sam specs or whatever. The samtools and GATK SNP callers know this, you don't have to think about this.
              Yes, but it is not a default - the SAM spec is explicit that read quality scores are PHRED scores stored in ASCII with an offset of 33 (just like Sanger FASTQ).

              Comment


              • #8
                you mean that it's not a default because it is always the case? That's what I expected, I gues I just used the word default wrong :P

                Comment


                • #9
                  Originally posted by Bruins View Post
                  you mean that it's not a default because it is always the case? That's what I expected, I gues I just used the word default wrong :P
                  Yes - it should always be the case. However, if you use Solexa/Illumina FASTQ files with a mapper that expects Sanger FASTQ, you'll get inflated quality scores. Such a SAM file would be invalid since it would effectively be using the Solexa/Illumina encoding (and all derived scores like mapping quality would also be inflated). That would be bad.

                  Comment


                  • #10
                    Originally posted by maubp View Post
                    Yes - it should always be the case. However, if you use Solexa/Illumina FASTQ files with a mapper that expects Sanger FASTQ, you'll get inflated quality scores. Such a SAM file would be invalid since it would effectively be using the Solexa/Illumina encoding (and all derived scores like mapping quality would also be inflated). That would be bad.
                    Originally posted by Bruins View Post
                    So to conclude: it is important that the aligner knows the Phred scheme used in the fastq files. After that, you don't have to worry about the scheme.
                    dingxiaofan1, is this what you mean?

                    Comment


                    • #11
                      Thank u so much. And very sorry for my late reply. Yes, that is what I mean. But I had checked fastq wiki and my file carefully. quality fields in Illumina fastq format is not the same as Sanger fastq.
                      Sanger can encode a Phred quality score from 0 to 93 using ASCII 33 to 126, while Illumina 1.3+ encode from 0-62 using 64-126.
                      Generally, the quality per base for Illumina is much higher than that of Sanger with the same ASCII character. So transfer or not means a lot if SAMtools and GATK using the Phred 33 as default.


                      Originally posted by Bruins View Post
                      dingxiaofan1, is this what you mean?
                      Originally posted by Bruins View Post
                      They are.

                      True.

                      I haven't used samtools tview so I don't know what blue or green means.

                      I'm not entirely sure I understand your question, so pardon me if I'm telling you stuff you already know.

                      The Phred scores for sanger fastqc and Illumina pipeline > 1.3 are calculated using the same formula, Illumina < 1.3 with a different formula (goolge for the wiki about this, it's out there). Then the scores are represented as an ASCII character. This is done in 2 different ways: adding 64 to get an ASCII character (I believe Illumina does this), or adding 33 to get a character (the sanger fastqc default).

                      Aligners like Bowtie and BWA use these per base quality scores. They may be able to deduce whether it's Phred64 or Phred33, I'm not sure. I do know BWA has an option to tell it which scheme was used.

                      Then the aligners calculate a mapping score, a log odds score to see the likelihood that the mapping was wrong. I think the same formula is used as the base calling one. This mapping score is represented in the SAM file as an ASCII character, just like the base quality in the fastq file. I think in SAM, the default is Phred33, but you should be able to check this in the sam specs or whatever. The samtools and GATK SNP callers know this, you don't have to think about this.

                      The same goes for the SNP call scores: likelihood the call is wrong. To calculate this score, mapping scores are used among other things. Your example shows a coverage of say 60 nt at the SNP position: 50 A's and 10 T's. The SNP caller checks all 60 and it may be that most of the 50 A's have a very low mapping quality or fail other checks the caller does. So in the end this SNP is supported by say 13 nt: 10 T's and 3 A's, the other 47 nt not being used in the call.

                      So to conclude: it is important that the aligner knows the Phred scheme used in the fastq files. After that, you don't have to worry about the scheme.

                      Does that make sense?

                      Comment


                      • #12
                        Maybe SAMTools and GATK are using Phred 33 as the default system. But I cannot find any options we could choose from the SNP callers. Also I cannot change the threshold score for base calculation (base Phred >=30 pass ? ). I think it is a bit too strict. And by allowing us using different threshold scores, we may find the best threshold suitable for our own data with less errors.

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Non-Coding RNA Research and Technologies
                          by seqadmin




                          Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

                          Nobel Prize for MicroRNA Discovery
                          This week,...
                          10-07-2024, 08:07 AM
                        • seqadmin
                          Recent Developments in Metagenomics
                          by seqadmin





                          Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
                          09-23-2024, 06:35 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, Yesterday, 06:55 AM
                        0 responses
                        10 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 10-02-2024, 04:51 AM
                        0 responses
                        109 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 10-01-2024, 07:10 AM
                        0 responses
                        114 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 09-30-2024, 08:33 AM
                        1 response
                        118 views
                        0 likes
                        Last Post EmiTom
                        by EmiTom
                         
                        Working...
                        X