Announcement

Collapse
No announcement yet.

mapping quality score in tophat sam file

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • mapping quality score in tophat sam file

    Hello all,


    I have been trying to work with the sam output file that I received from tophat. My input file was an RNA-sequencing single end 36 bp run using Illumina GA2 with C. elegans data.

    A sample of the sam file from tophat I am looking at:

    HWUSI-EAS1795_0102:5:116:16624:12048#0 256 I 2821 0 16M * 0 0 GCTGGGTCTGAACTCT IIIIIIIIIIBGGGGI NM:i:2 NH:i:14 CC:Z:= CP:i:13533166 HI:i:0
    HWUSI-EAS1795_0102:5:103:16153:10159#0 256 I 2916 0 17M * 0 0 GTCAATTCTGCGACATG HHHHHHHHHHHHEHHHH NM:i:0 NH:i:6 CC:Z:= CP:i:13533261 HI:i:0
    HWUSI-EAS1795_0102:5:111:9819:1338#0 0 I 2916 0 17M * 0 0 GTCAATTCTGCGACATG IIIIIIIIIIIIIIIII NM:i:0 NH:i:6 CC:Z:= CP:i:13533261 HI:i:0
    HWUSI-EAS1795_0102:5:24:5025:14685#0 272 I 2946 0 19M * 0 0 CGAAAGTAACCCGAATACC IIIIIIGIIHIIIIIHIHI NM:i:0 NH:i:6 CC:Z:= CP:i:13533291 HI:i:0
    HWUSI-EAS1795_0102:5:70:18860:13966#0 272 I 2946 0 19M * 0 0 CGAAAGTAACCCGAATACC GGGGHHGHHGGDGGGGFGG NM:i:0 NH:i:6 CC:Z:= CP:i:13533291 HI:i:0
    HWUSI-EAS1795_0102:5:85:1866:10941#0 272 I 2946 0 19M * 0 0 CGAAAGTAACCCGAATACC GIIIIIIIIIIIIIIGIII NM:i:0 NH:i:6 CC:Z:= CP:i:13533291 HI:i:0
    HWUSI-EAS1795_0102:5:86:8520:16181#0 256 I 2962 0 21M * 0 0 ACCGACAAAAGAAGAGGAACG HHHHHHHHHHDFDDFGFGGFE NM:i:0 NH:i:7 CC:Z:= CP:i:13533307 HI:i:0
    HWUSI-EAS1795_0102:5:5:11293:13799#0 256 I 2970 0 30M * 0 0 AAGAAGAGGAACGCCAACTTTGGATAGACG HHFHHHHHHHEGGGGHHHHHHHGHHHHHGH NM:i:0 NH:i:7 CC:Z:= CP:i:13533315 HI:i:0
    HWUSI-EAS1795_0102:5:120:17838:10446#0 256 I 2975 0 36M * 0 0 GAGGAACGCCAACTTTGGATAGACGCTCTAGGGGCT [email protected] NM:i:0 NH:i:7 CC:Z:= CP:i:13533320 HI:i:0
    HWUSI-EAS1795_0102:5:4:7318:17553#0 256 I 2990 0 36M * 0 0 TGGATAGACGCTCTAGGGGCTGATTTTGGTCGGAAA GHHHHGEHHEHHHHHHGHHGEG>[email protected] NM:i:0 NH:i:7 CC:Z:= CP:i:13533335 HI:i:0

    The fifth column seems to report the mapping quality score. The only scores I am coming across in my sam file is 0, 1 or 255. This seems off to me since I have a match in the 6th column... am I interpreting this incorrectly?

    I did an alignment with novoalign with the exact same input file and here is a sample of the sam file I am looking at:

    HWUSI-EAS1795_0102:5:120:17501:21319#0/1 4 * 0 0 * * 0 0 TGGCGA IIIIII PG:Z:novoalign ZS:Z:QC
    HWUSI-EAS1795_0102:5:120:17879:21320#0/1 4 * 0 0 * * 0 0 TGAGATC HGHHHHH PG:Z:novoalign ZS:Z:QC
    HWUSI-EAS1795_0102:5:120:2546:21319#0/1 4 * 0 0 * * 0 0 TGAAGATGCGACATACCCGCAGCTAGAC IIIIIIIIIIIIIIIIIIHIIGIHIGGG PG:Z:novoalign ZS:Z:NM
    HWUSI-EAS1795_0102:5:120:17450:21318#0/1 0 chrI 15064402 77 26M * 0 0 AACTGGGCCTCCAGTTGGTACGTCTG HHHHHHDHHHHHHEHHHHHHHHHHHH PG:Z:novoalign AS:i:0 UQ:i:0 NM:i:0 MD:Z:26
    HWUSI-EAS1795_0102:5:120:18252:21315#0/1 4 * 0 0 * * 0 0 CCCCGG IIIIHI PG:Z:novoalign ZS:Z:QC
    HWUSI-EAS1795_0102:5:120:17255:21323#0/1 4 * 0 0 * * 0 0 CGGCGGAGGTCGCGGGTTCG [email protected]@ PG:Z:novoalign ZS:Z:NM
    HWUSI-EAS1795_0102:5:120:19739:21305#0/1 4 * 0 0 * * 0 0 GTGAGCGGTCGAGACCTGGATTGACAAG IIHIHIHHHIGGIIIIIIDIIGGHIIDG PG:Z:novoalign ZS:Z:NM
    HWUSI-EAS1795_0102:5:120:7089:21322#0/1 4 * 0 0 * * 0 0 TGAGCTAAACCGTACTAATGATCGAGGGCTTACCAT IIIIIIIIIIIIIIIIIHIIEIIIIIIIIIIIIIIG PG:Z:novoalign ZS:Z:NM
    HWUSI-EAS1795_0102:5:120:18560:21325#0/1 0 chrI 15066796 48 21M * 0 0 ATAGAATAATGTAGGTAAGGG FFFFE<GGGGGG?GGGDGGGG PG:Z:novoalign AS:i:0 UQ:i:0 NM:i:0 MD:Z:21
    HWUSI-EAS1795_0102:5:120:18280:21316#0/1 0 chrI 15065492 43 20M * 0 0 GTCTTGAAACACGGATTGCG GGGGGGDGGGHGHGGHHHDH PG:Z:novoalign AS:i:0 UQ:i:0 NM:i:0 MD:Z:20
    HWUSI-EAS1795_0102:5:120:17770:21316#0/1 0 chrI 15063143 3 36M * 0 0 TATGGTTGCAAAGCTGAAACTTAAAGAAATTGACGG [email protected]@GDGED;F8FEC2 PG:Z:novoalign AS:i:0 UQ:i:0 NM:i:0 MD:Z:36 ZS:Z:R ZN:i:2

    In this case, looking at column number 5, I clearly have scores that are ranging from 3 to 77... depending on the quality of my alignment...

    My concern is that I am using a program to toss out poor quality alignments which is using a value of 30 or greater for mapping quality as a filter. Clearly, none of my tophat alignments are passing that filter except for the ones with a 255 (unknown) score.
    Am I missing something here? If the tophat mapping quality scores are indeed correct - What can I do to filter out poor quality alignments from my tophat output?

    Thanks for your help!

  • #2
    Originally posted by jjpurwar View Post
    I have been trying to work with the sam output file that I received from tophat. My input file was an RNA-sequencing single end 36 bp run using Illumina GA2 with C. elegans data.
    Whenever the is something funny going on with quality scores, the first thing I try to check is what kind of FASTQ encoding did you start with (Sanger, old Solexa, or Illumina). And did you tell tophat this, or just trust the default was appropriate?

    Comment


    • #3
      Hi, so I did specify the --solexa1.3-quals as my option. I dont take for granted the default option.
      I did check with tophat regarding this matter - turns out that they dont take mapping qualities into account. This is worrisome - since you might want to throw out poor quality alignments. But I guess thats the trade off since tophat is a really fast aligner....

      Comment


      • #4
        Hi, so I did specify the --solexa1.3-quals as my option. I dont take for granted the default option.
        I did check with tophat regarding this matter - turns out that they dont take mapping qualities into account. This is worrisome - since you might want to throw out poor quality alignments. But I guess thats the trade off since tophat is a really fast aligner....

        Comment


        • #5
          Tophat may not look at the qualities, but it does write them out in the SAM/BAM file. This is important because if Tophat misunderstood the input FASTQ encoding, they will be wrong in the SAM/BAM.

          It would be worth double checking a couple of reads - compare the input quality (which you say used the Solexa/Illumina 1.3 style, meaning PHRED with offset 64) and the SAM/BAM quality (which should be Sanger style, meaning PHRED with offset 33).

          [Given Tophat doesn't look at the read qualities, this is a separate issue to your original query about the mapping qualities]

          Comment


          • #6
            I believe the read quality is the column after the sequence. For example, the quality for the read "GCTGGGTCTGAACTCT" is "IIIIIIIIIIBGGGGI" where
            "B" represents a Phred quality score of 33 (Assuming Sanger scoring)
            "G" represents a Phred quality score of 38
            "I" represents a Phred quality score of 40
            To convert the letter to the Phred, just take the ASCII representation and subtract 33.

            Comment

            Working...
            X