Dear all
I am using htseq-count tool to summarize gene counts from bam files generated by tophat (v 2.03) based on bowtie2. I've used this pipeline (based on bowtie1) several times with human RNA-Seq and have been generating good results.
In the most recent project, we are working with Ecoli K12 genome, 100 bp paired-ends.
I tried htseq-count tool on the accepted_hits.bam files generated by tophat but it gave me all the warnings of "xxx claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)". I then sorted the bam files with samtools prior to this step yet still got no luck: thousands of the same warnings came out and I got no reads in the output gene_counts.txt file.
I lchecked the sam file (first 10 lines, converted from the sorted bam file) and they looked like these:
I then checked the sequence stats with samtools flagstat and found 82.25% reads are properly paired.
So what is wrong with my bam file? There are definitely majority of proper mate pairs in the bam file. Why can't they be sorted in a way that mate pairs are assgined in adjacent lines for htseq-count to read?
I used samtools sort commend to do the soring? Any better ideas?
I'm pretty new in this field, so pardon me if similar questions have been asked before.
I am using htseq-count tool to summarize gene counts from bam files generated by tophat (v 2.03) based on bowtie2. I've used this pipeline (based on bowtie1) several times with human RNA-Seq and have been generating good results.
In the most recent project, we are working with Ecoli K12 genome, 100 bp paired-ends.
I tried htseq-count tool on the accepted_hits.bam files generated by tophat but it gave me all the warnings of "xxx claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)". I then sorted the bam files with samtools prior to this step yet still got no luck: thousands of the same warnings came out and I got no reads in the output gene_counts.txt file.
I lchecked the sam file (first 10 lines, converted from the sorted bam file) and they looked like these:
HWI-ST984:1021021ACXX:2:1210:8261:88919 99 chr 1 255 4M14I82M =57 156 AGTAAGTATTTTTCAGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACC @@BFFFDFHHHHHJJJJJJJJJIJJJJJJJJJJJJJJJJJJJJJJJJJJJJJIJJIJIIJJJJJJJHFFFBBCEEEEEEDDDDDDDDDDDDDDDDDDDDC AS:i:-57 XN:i:0 XM:i:2 XO:i:1 XG:i:14 NM:i:16 MD:Z:2C0T82 YT:Z:UU NH:i:1
HWI-ST984:1021021ACXX:2:1308:13660:65155 99 chr 2 255 6M9I85M = 117 215 TATTTTTCAGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGT CCCFFFFFHGHHHJJJJJJJJJJJJJJJJJJJJJJJJJIIIIJIJJHIGIFJJJIJGHIJHHHH?CEFEFFEECD>@BCDDDCDDDDDD@CDDDDDBDD9 AS:i:-42 XN:i:0 XM:i:2 XO:i:1 XG:i:9 NM:i:11 MD:Z:0G0C89 YT:Z:UU NH:i:1
HWI-ST984:1021021ACXX:2:2108:14990:23666 99 chr 10 255 100M = 167 257 TTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTT CCCFFFFFHHHHHJJIIIIJJJJIIIJIJIJIIJHIJIJJJJIJJIJEHIJIJJJJJIHHHHHFFCDFFEEECEEDDDDDDDDBDDACCCDDDDDDCDDD AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:100 YT:Z:UU NH:i:1
HWI-ST984:1021021ACXX:2:1214:16246:55224 89 chr 10 255 100M * 00 TTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTT A@C>>CCC9?A<BC>::EECACC=>DDDDECCBB@@EFGGHFC===<FC?893F@B9B>EBDBDB9C9EFB3F?1JIEIGGIIGHEGHDHDFFFFFFCCC AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:100 YT:Z:UU NH:i:1
HWI-ST984:1021021ACXX:2:1108:7813:47825 99 chr 22 255 100M = 113 191 CGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGT CCCFFFFFHHGHHJJJJHJHIJIJJJJJJJJJJJJHJIIIGIJJIJJJJJJJJJJJJJJJHIJJHHHHHFDDDCC>CCEEDDDDEDDDFDDDDDDDDDCC AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:100 YT:Z:UU NH:i:1
HWI-ST984:1021021ACXX:2:1105:8881:46986 163 chr 23 255 100M = 137 214 GGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTC CCCFFFFFHHHHHJEIHGHGIGHGHIJIJJJIIIIIHIIJIIJIJJJJJJIIJHJJIJI@GIJJJIHHHBDFD>AEEEEDDDDEDDDEDDCCDDDDDDCD AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:100 YT:Z:UU NH:i:1
HWI-ST984:1021021ACXX:2:1308:13660:65155 99 chr 2 255 6M9I85M = 117 215 TATTTTTCAGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGT CCCFFFFFHGHHHJJJJJJJJJJJJJJJJJJJJJJJJJIIIIJIJJHIGIFJJJIJGHIJHHHH?CEFEFFEECD>@BCDDDCDDDDDD@CDDDDDBDD9 AS:i:-42 XN:i:0 XM:i:2 XO:i:1 XG:i:9 NM:i:11 MD:Z:0G0C89 YT:Z:UU NH:i:1
HWI-ST984:1021021ACXX:2:2108:14990:23666 99 chr 10 255 100M = 167 257 TTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTT CCCFFFFFHHHHHJJIIIIJJJJIIIJIJIJIIJHIJIJJJJIJJIJEHIJIJJJJJIHHHHHFFCDFFEEECEEDDDDDDDDBDDACCCDDDDDDCDDD AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:100 YT:Z:UU NH:i:1
HWI-ST984:1021021ACXX:2:1214:16246:55224 89 chr 10 255 100M * 00 TTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTT A@C>>CCC9?A<BC>::EECACC=>DDDDECCBB@@EFGGHFC===<FC?893F@B9B>EBDBDB9C9EFB3F?1JIEIGGIIGHEGHDHDFFFFFFCCC AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:100 YT:Z:UU NH:i:1
HWI-ST984:1021021ACXX:2:1108:7813:47825 99 chr 22 255 100M = 113 191 CGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGT CCCFFFFFHHGHHJJJJHJHIJIJJJJJJJJJJJJHJIIIGIJJIJJJJJJJJJJJJJJJHIJJHHHHHFDDDCC>CCEEDDDDEDDDFDDDDDDDDDCC AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:100 YT:Z:UU NH:i:1
HWI-ST984:1021021ACXX:2:1105:8881:46986 163 chr 23 255 100M = 137 214 GGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTC CCCFFFFFHHHHHJEIHGHGIGHGHIJIJJJIIIIIHIIJIIJIJJJJJJIIJHJJIJI@GIJJJIHHHBDFD>AEEEEDDDDEDDDEDDCCDDDDDDCD AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:100 YT:Z:UU NH:i:1
So what is wrong with my bam file? There are definitely majority of proper mate pairs in the bam file. Why can't they be sorted in a way that mate pairs are assgined in adjacent lines for htseq-count to read?
I used samtools sort commend to do the soring? Any better ideas?
I'm pretty new in this field, so pardon me if similar questions have been asked before.
Comment