I am trying to use Tophat to find novel splicing junction on a zebrafish RNAseq done with the Illumina CAGE-protocol. I am quite novel to the usage of tophat, and I am making several trials to find the best options combination for my samples, yet I don't completely understand the -GTF (paired with the --transcriptome-index options).
as stated in the Tophat manual for the --GTF option:
As far as I understand, tophat use a GTF file to build an index (if the gtf file matches with the bowtie index in terms of position and sequence). this Index can be re-used sing the -transcriptome-index option.
After that, TH aligns the reads against this "GTF index", and discards all the reads that perfectly matches this index, focusing on the reads that not align to the index to find new splicing sites. Is this correct? If it is, then two questions raise up:
1) will the reads be aligned against this GTF index without even try to splice them? The perfect match happens before the splicing algorithm?
2) which reference will be better to use with this option? a reference genome or a reference trascriptome? And why?
thanks for your answers!
Daniele
as stated in the Tophat manual for the --GTF option:
Supply TopHat with a set of gene model annotations and/or known transcripts, as a GTF 2.2 or GFF3 formatted file. If this option is provided, TopHat will first extract the transcript sequences and use Bowtie to align reads to this virtual transcriptome first. Only the reads that do not fully map to the transcriptome will then be mapped on the genome. The reads that did map on the transcriptome will be converted to genomic mappings (spliced as needed) and merged with the novel mappings and junctions in the final tophat output.
Please note that the values in the first column of the provided GTF/GFF file (column which indicates the chromosome or contig on which the feature is located), must match the name of the reference sequence in the Bowtie index you are using with TopHat. You can get a list of the sequence names in a Bowtie index by typing:
bowtie-inspect --names your_index
So before using a known annotation file with this option please make sure that the 1st column in the annotation file uses the exact same chromosome/contig names (case sensitive) as shown by the bowtie-inspect command above.
Please note that the values in the first column of the provided GTF/GFF file (column which indicates the chromosome or contig on which the feature is located), must match the name of the reference sequence in the Bowtie index you are using with TopHat. You can get a list of the sequence names in a Bowtie index by typing:
bowtie-inspect --names your_index
So before using a known annotation file with this option please make sure that the 1st column in the annotation file uses the exact same chromosome/contig names (case sensitive) as shown by the bowtie-inspect command above.
After that, TH aligns the reads against this "GTF index", and discards all the reads that perfectly matches this index, focusing on the reads that not align to the index to find new splicing sites. Is this correct? If it is, then two questions raise up:
1) will the reads be aligned against this GTF index without even try to splice them? The perfect match happens before the splicing algorithm?
2) which reference will be better to use with this option? a reference genome or a reference trascriptome? And why?
thanks for your answers!
Daniele
Comment