Hello all,
I have been trying to work with the sam output file that I received from tophat. My input file was an RNA-sequencing single end 36 bp run using Illumina GA2 with C. elegans data.
A sample of the sam file from tophat I am looking at:
HWUSI-EAS1795_0102:5:116:16624:12048#0 256 I 2821 0 16M * 0 0 GCTGGGTCTGAACTCT IIIIIIIIIIBGGGGI NM:i:2 NH:i:14 CC:Z:= CP:i:13533166 HI:i:0
HWUSI-EAS1795_0102:5:103:16153:10159#0 256 I 2916 0 17M * 0 0 GTCAATTCTGCGACATG HHHHHHHHHHHHEHHHH NM:i:0 NH:i:6 CC:Z:= CP:i:13533261 HI:i:0
HWUSI-EAS1795_0102:5:111:9819:1338#0 0 I 2916 0 17M * 0 0 GTCAATTCTGCGACATG IIIIIIIIIIIIIIIII NM:i:0 NH:i:6 CC:Z:= CP:i:13533261 HI:i:0
HWUSI-EAS1795_0102:5:24:5025:14685#0 272 I 2946 0 19M * 0 0 CGAAAGTAACCCGAATACC IIIIIIGIIHIIIIIHIHI NM:i:0 NH:i:6 CC:Z:= CP:i:13533291 HI:i:0
HWUSI-EAS1795_0102:5:70:18860:13966#0 272 I 2946 0 19M * 0 0 CGAAAGTAACCCGAATACC GGGGHHGHHGGDGGGGFGG NM:i:0 NH:i:6 CC:Z:= CP:i:13533291 HI:i:0
HWUSI-EAS1795_0102:5:85:1866:10941#0 272 I 2946 0 19M * 0 0 CGAAAGTAACCCGAATACC GIIIIIIIIIIIIIIGIII NM:i:0 NH:i:6 CC:Z:= CP:i:13533291 HI:i:0
HWUSI-EAS1795_0102:5:86:8520:16181#0 256 I 2962 0 21M * 0 0 ACCGACAAAAGAAGAGGAACG HHHHHHHHHHDFDDFGFGGFE NM:i:0 NH:i:7 CC:Z:= CP:i:13533307 HI:i:0
HWUSI-EAS1795_0102:5:5:11293:13799#0 256 I 2970 0 30M * 0 0 AAGAAGAGGAACGCCAACTTTGGATAGACG HHFHHHHHHHEGGGGHHHHHHHGHHHHHGH NM:i:0 NH:i:7 CC:Z:= CP:i:13533315 HI:i:0
HWUSI-EAS1795_0102:5:120:17838:10446#0 256 I 2975 0 36M * 0 0 GAGGAACGCCAACTTTGGATAGACGCTCTAGGGGCT IHIIBEIIIIIIIIHIIG@IEHEEIIIFIBGIIIEH NM:i:0 NH:i:7 CC:Z:= CP:i:13533320 HI:i:0
HWUSI-EAS1795_0102:5:4:7318:17553#0 256 I 2990 0 36M * 0 0 TGGATAGACGCTCTAGGGGCTGATTTTGGTCGGAAA GHHHHGEHHEHHHHHHGHHGEG>GEGGGE@EDDFEE NM:i:0 NH:i:7 CC:Z:= CP:i:13533335 HI:i:0
The fifth column seems to report the mapping quality score. The only scores I am coming across in my sam file is 0, 1 or 255. This seems off to me since I have a match in the 6th column... am I interpreting this incorrectly?
I did an alignment with novoalign with the exact same input file and here is a sample of the sam file I am looking at:
HWUSI-EAS1795_0102:5:120:17501:21319#0/1 4 * 0 0 * * 0 0 TGGCGA IIIIII PG:Z:novoalign ZS:Z:QC
HWUSI-EAS1795_0102:5:120:17879:21320#0/1 4 * 0 0 * * 0 0 TGAGATC HGHHHHH PG:Z:novoalign ZS:Z:QC
HWUSI-EAS1795_0102:5:120:2546:21319#0/1 4 * 0 0 * * 0 0 TGAAGATGCGACATACCCGCAGCTAGAC IIIIIIIIIIIIIIIIIIHIIGIHIGGG PG:Z:novoalign ZS:Z:NM
HWUSI-EAS1795_0102:5:120:17450:21318#0/1 0 chrI 15064402 77 26M * 0 0 AACTGGGCCTCCAGTTGGTACGTCTG HHHHHHDHHHHHHEHHHHHHHHHHHH PG:Z:novoalign AS:i:0 UQ:i:0 NM:i:0 MD:Z:26
HWUSI-EAS1795_0102:5:120:18252:21315#0/1 4 * 0 0 * * 0 0 CCCCGG IIIIHI PG:Z:novoalign ZS:Z:QC
HWUSI-EAS1795_0102:5:120:17255:21323#0/1 4 * 0 0 * * 0 0 CGGCGGAGGTCGCGGGTTCG GGEGGEGEG@GGGDGHDHH@ PG:Z:novoalign ZS:Z:NM
HWUSI-EAS1795_0102:5:120:19739:21305#0/1 4 * 0 0 * * 0 0 GTGAGCGGTCGAGACCTGGATTGACAAG IIHIHIHHHIGGIIIIIIDIIGGHIIDG PG:Z:novoalign ZS:Z:NM
HWUSI-EAS1795_0102:5:120:7089:21322#0/1 4 * 0 0 * * 0 0 TGAGCTAAACCGTACTAATGATCGAGGGCTTACCAT IIIIIIIIIIIIIIIIIHIIEIIIIIIIIIIIIIIG PG:Z:novoalign ZS:Z:NM
HWUSI-EAS1795_0102:5:120:18560:21325#0/1 0 chrI 15066796 48 21M * 0 0 ATAGAATAATGTAGGTAAGGG FFFFE<GGGGGG?GGGDGGGG PG:Z:novoalign AS:i:0 UQ:i:0 NM:i:0 MD:Z:21
HWUSI-EAS1795_0102:5:120:18280:21316#0/1 0 chrI 15065492 43 20M * 0 0 GTCTTGAAACACGGATTGCG GGGGGGDGGGHGHGGHHHDH PG:Z:novoalign AS:i:0 UQ:i:0 NM:i:0 MD:Z:20
HWUSI-EAS1795_0102:5:120:17770:21316#0/1 0 chrI 15063143 3 36M * 0 0 TATGGTTGCAAAGCTGAAACTTAAAGAAATTGACGG BB?BBCFEFEGG@BDF8FEFGGG@GDGED;F8FEC2 PG:Z:novoalign AS:i:0 UQ:i:0 NM:i:0 MD:Z:36 ZS:Z:R ZN:i:2
In this case, looking at column number 5, I clearly have scores that are ranging from 3 to 77... depending on the quality of my alignment...
My concern is that I am using a program to toss out poor quality alignments which is using a value of 30 or greater for mapping quality as a filter. Clearly, none of my tophat alignments are passing that filter except for the ones with a 255 (unknown) score.
Am I missing something here? If the tophat mapping quality scores are indeed correct - What can I do to filter out poor quality alignments from my tophat output?
Thanks for your help!
I have been trying to work with the sam output file that I received from tophat. My input file was an RNA-sequencing single end 36 bp run using Illumina GA2 with C. elegans data.
A sample of the sam file from tophat I am looking at:
HWUSI-EAS1795_0102:5:116:16624:12048#0 256 I 2821 0 16M * 0 0 GCTGGGTCTGAACTCT IIIIIIIIIIBGGGGI NM:i:2 NH:i:14 CC:Z:= CP:i:13533166 HI:i:0
HWUSI-EAS1795_0102:5:103:16153:10159#0 256 I 2916 0 17M * 0 0 GTCAATTCTGCGACATG HHHHHHHHHHHHEHHHH NM:i:0 NH:i:6 CC:Z:= CP:i:13533261 HI:i:0
HWUSI-EAS1795_0102:5:111:9819:1338#0 0 I 2916 0 17M * 0 0 GTCAATTCTGCGACATG IIIIIIIIIIIIIIIII NM:i:0 NH:i:6 CC:Z:= CP:i:13533261 HI:i:0
HWUSI-EAS1795_0102:5:24:5025:14685#0 272 I 2946 0 19M * 0 0 CGAAAGTAACCCGAATACC IIIIIIGIIHIIIIIHIHI NM:i:0 NH:i:6 CC:Z:= CP:i:13533291 HI:i:0
HWUSI-EAS1795_0102:5:70:18860:13966#0 272 I 2946 0 19M * 0 0 CGAAAGTAACCCGAATACC GGGGHHGHHGGDGGGGFGG NM:i:0 NH:i:6 CC:Z:= CP:i:13533291 HI:i:0
HWUSI-EAS1795_0102:5:85:1866:10941#0 272 I 2946 0 19M * 0 0 CGAAAGTAACCCGAATACC GIIIIIIIIIIIIIIGIII NM:i:0 NH:i:6 CC:Z:= CP:i:13533291 HI:i:0
HWUSI-EAS1795_0102:5:86:8520:16181#0 256 I 2962 0 21M * 0 0 ACCGACAAAAGAAGAGGAACG HHHHHHHHHHDFDDFGFGGFE NM:i:0 NH:i:7 CC:Z:= CP:i:13533307 HI:i:0
HWUSI-EAS1795_0102:5:5:11293:13799#0 256 I 2970 0 30M * 0 0 AAGAAGAGGAACGCCAACTTTGGATAGACG HHFHHHHHHHEGGGGHHHHHHHGHHHHHGH NM:i:0 NH:i:7 CC:Z:= CP:i:13533315 HI:i:0
HWUSI-EAS1795_0102:5:120:17838:10446#0 256 I 2975 0 36M * 0 0 GAGGAACGCCAACTTTGGATAGACGCTCTAGGGGCT IHIIBEIIIIIIIIHIIG@IEHEEIIIFIBGIIIEH NM:i:0 NH:i:7 CC:Z:= CP:i:13533320 HI:i:0
HWUSI-EAS1795_0102:5:4:7318:17553#0 256 I 2990 0 36M * 0 0 TGGATAGACGCTCTAGGGGCTGATTTTGGTCGGAAA GHHHHGEHHEHHHHHHGHHGEG>GEGGGE@EDDFEE NM:i:0 NH:i:7 CC:Z:= CP:i:13533335 HI:i:0
The fifth column seems to report the mapping quality score. The only scores I am coming across in my sam file is 0, 1 or 255. This seems off to me since I have a match in the 6th column... am I interpreting this incorrectly?
I did an alignment with novoalign with the exact same input file and here is a sample of the sam file I am looking at:
HWUSI-EAS1795_0102:5:120:17501:21319#0/1 4 * 0 0 * * 0 0 TGGCGA IIIIII PG:Z:novoalign ZS:Z:QC
HWUSI-EAS1795_0102:5:120:17879:21320#0/1 4 * 0 0 * * 0 0 TGAGATC HGHHHHH PG:Z:novoalign ZS:Z:QC
HWUSI-EAS1795_0102:5:120:2546:21319#0/1 4 * 0 0 * * 0 0 TGAAGATGCGACATACCCGCAGCTAGAC IIIIIIIIIIIIIIIIIIHIIGIHIGGG PG:Z:novoalign ZS:Z:NM
HWUSI-EAS1795_0102:5:120:17450:21318#0/1 0 chrI 15064402 77 26M * 0 0 AACTGGGCCTCCAGTTGGTACGTCTG HHHHHHDHHHHHHEHHHHHHHHHHHH PG:Z:novoalign AS:i:0 UQ:i:0 NM:i:0 MD:Z:26
HWUSI-EAS1795_0102:5:120:18252:21315#0/1 4 * 0 0 * * 0 0 CCCCGG IIIIHI PG:Z:novoalign ZS:Z:QC
HWUSI-EAS1795_0102:5:120:17255:21323#0/1 4 * 0 0 * * 0 0 CGGCGGAGGTCGCGGGTTCG GGEGGEGEG@GGGDGHDHH@ PG:Z:novoalign ZS:Z:NM
HWUSI-EAS1795_0102:5:120:19739:21305#0/1 4 * 0 0 * * 0 0 GTGAGCGGTCGAGACCTGGATTGACAAG IIHIHIHHHIGGIIIIIIDIIGGHIIDG PG:Z:novoalign ZS:Z:NM
HWUSI-EAS1795_0102:5:120:7089:21322#0/1 4 * 0 0 * * 0 0 TGAGCTAAACCGTACTAATGATCGAGGGCTTACCAT IIIIIIIIIIIIIIIIIHIIEIIIIIIIIIIIIIIG PG:Z:novoalign ZS:Z:NM
HWUSI-EAS1795_0102:5:120:18560:21325#0/1 0 chrI 15066796 48 21M * 0 0 ATAGAATAATGTAGGTAAGGG FFFFE<GGGGGG?GGGDGGGG PG:Z:novoalign AS:i:0 UQ:i:0 NM:i:0 MD:Z:21
HWUSI-EAS1795_0102:5:120:18280:21316#0/1 0 chrI 15065492 43 20M * 0 0 GTCTTGAAACACGGATTGCG GGGGGGDGGGHGHGGHHHDH PG:Z:novoalign AS:i:0 UQ:i:0 NM:i:0 MD:Z:20
HWUSI-EAS1795_0102:5:120:17770:21316#0/1 0 chrI 15063143 3 36M * 0 0 TATGGTTGCAAAGCTGAAACTTAAAGAAATTGACGG BB?BBCFEFEGG@BDF8FEFGGG@GDGED;F8FEC2 PG:Z:novoalign AS:i:0 UQ:i:0 NM:i:0 MD:Z:36 ZS:Z:R ZN:i:2
In this case, looking at column number 5, I clearly have scores that are ranging from 3 to 77... depending on the quality of my alignment...
My concern is that I am using a program to toss out poor quality alignments which is using a value of 30 or greater for mapping quality as a filter. Clearly, none of my tophat alignments are passing that filter except for the ones with a 255 (unknown) score.
Am I missing something here? If the tophat mapping quality scores are indeed correct - What can I do to filter out poor quality alignments from my tophat output?
Thanks for your help!
Comment