I am running tophat and cufflinks on a bacterial genome using galaxy.
As parameters for tophat, I used minimal distance between introns as 15bp, and max intron size as 1500bp. Visual verification of this looks decent. What I mean by this is that when I look at the splice junctions, not many are identified (I do not expect many introns in my genome) although there are a few false ones, that seem to connect two different genes. This is one thing I would like help with- is it worth simply reducing to nothing the max intron size? What is accepted consensus when using tophat on bacterial genomes?
When I look at the second tophat file, of accepted hits, all hits align nicely with known genes. However, when I run cufflinks I run into the following issues: when I use a reference genome, I get in addition to the known transcripts, a bunch of very long transcripts spanning very large genomic regions. Also, I will have two genes that are very near each other but run in opposite directions (which you can see beautifully in the tophat accepted hits alignments - different colors for each strand) but they merge into a single CUFF identifier. Is there any way I can address this- is it something I am missing with respect to parameters I have to change because I am working on a bacterial genome?
Many thanks
Noa
As parameters for tophat, I used minimal distance between introns as 15bp, and max intron size as 1500bp. Visual verification of this looks decent. What I mean by this is that when I look at the splice junctions, not many are identified (I do not expect many introns in my genome) although there are a few false ones, that seem to connect two different genes. This is one thing I would like help with- is it worth simply reducing to nothing the max intron size? What is accepted consensus when using tophat on bacterial genomes?
When I look at the second tophat file, of accepted hits, all hits align nicely with known genes. However, when I run cufflinks I run into the following issues: when I use a reference genome, I get in addition to the known transcripts, a bunch of very long transcripts spanning very large genomic regions. Also, I will have two genes that are very near each other but run in opposite directions (which you can see beautifully in the tophat accepted hits alignments - different colors for each strand) but they merge into a single CUFF identifier. Is there any way I can address this- is it something I am missing with respect to parameters I have to change because I am working on a bacterial genome?
Many thanks
Noa
Comment