Hi all,
We have been assessing different methods of calling SNPs from our RNA-seq runs ranging from using the GA pipeline itself, to using MAQ to call the SNPs (using MAQ aligner to align the reference genome as well as a set of set of splice junction sequences derived from all known genes). Both these methods gave broadly similar results, which was great!
However I have been interested in using Tophat to align our sequences as it would allow us to investigate novel transcripts or splice junctions not in the currently annotated gene set. The alignment using Tophat appeared to work fine, aside from seeming to output the quality scores for each aligned read in Solexa rather than the phred format I was expecting (using the --solexa1.3-quals option). Using cufflinks to quantitate transcript abundance gave results again broadly similar to the GA analysis pipeline, which again was super.
However using Samtool to call SNPs in the Tophat aligned reads resulted in over four times as many SNPs being called as by the other two methods. Just eyeballing the alignments indicates that most of the new Samtools-called SNPs are called at the beginning and end of exons, where fewer reads have aligned back or where reads have been split between two adjacent exons. Using Samtools to call SNPs on the alignments previously generated by MAQ resulted in only marginally more SNPs called than by the MAQ SNP caller itself, leading us to believe that the problem is with the Tophat alignment rather than the SNP caller.
All this may be a result of something stupid I am doing wrong but I figured there is no harm in finding out if others out there had encountered similar problems and if you have, how you dealt with them? I don't simply want to discard SNPs called at the beginning or end of exons for obvious reasons!
Thanks for any help anyone can give me on this
We have been assessing different methods of calling SNPs from our RNA-seq runs ranging from using the GA pipeline itself, to using MAQ to call the SNPs (using MAQ aligner to align the reference genome as well as a set of set of splice junction sequences derived from all known genes). Both these methods gave broadly similar results, which was great!
However I have been interested in using Tophat to align our sequences as it would allow us to investigate novel transcripts or splice junctions not in the currently annotated gene set. The alignment using Tophat appeared to work fine, aside from seeming to output the quality scores for each aligned read in Solexa rather than the phred format I was expecting (using the --solexa1.3-quals option). Using cufflinks to quantitate transcript abundance gave results again broadly similar to the GA analysis pipeline, which again was super.
However using Samtool to call SNPs in the Tophat aligned reads resulted in over four times as many SNPs being called as by the other two methods. Just eyeballing the alignments indicates that most of the new Samtools-called SNPs are called at the beginning and end of exons, where fewer reads have aligned back or where reads have been split between two adjacent exons. Using Samtools to call SNPs on the alignments previously generated by MAQ resulted in only marginally more SNPs called than by the MAQ SNP caller itself, leading us to believe that the problem is with the Tophat alignment rather than the SNP caller.
All this may be a result of something stupid I am doing wrong but I figured there is no harm in finding out if others out there had encountered similar problems and if you have, how you dealt with them? I don't simply want to discard SNPs called at the beginning or end of exons for obvious reasons!
Thanks for any help anyone can give me on this
Comment