Unconfigured Ad

**Cole Trapnell** · 05-26-2009, 08:50 AM

Hi,

I've just updated the TopHat manual with a brief explanation of TopHat's RPKM calculation and a simple example GFF file. Please see http://tophat.cbcb.umd.edu/manual.html

**joseph** · 05-27-2009, 01:28 PM

GFF3 file for TopHat

Originally posted by Eugeni View Post

Hi all
I am working with transcriptomic data generated by solexa, and i am using TopHat for mapping. I am trying to run the program with a GFF file in order to obtain gene expresion indices (-G option) but the program seems to not recognize none of the GFF files i specify. Does anyone know the structure of the GFF file appropiate for TopHat in order to do the mapping?
Regards

Hi
can you share the first few lines of a GFF file so I can see how it is structured?
Thanks

**geriatrics1200** · 06-02-2009, 11:19 AM

Hello, I'm wondering if anybody can direct me to a GFF3 for mouse. I came across GFF3 files for many organisms besides mouse (http://www.sequenceontology.org/reso...databases.html).

**RockChalkJayhawk** · 10-23-2009, 02:28 PM

TopHat GFF3

Originally posted by geriatrics1200 View Post

Hello, I'm wondering if anybody can direct me to a GFF3 for mouse. I came across GFF3 files for many organisms besides mouse (http://www.sequenceontology.org/reso...databases.html).

Please post the Gff3 if you ever get one!!!

**lmf_bill** · 11-23-2009, 01:24 AM

I thought Tophat needs GFF3 format files. you can download GTF files from ensembl and then transfer them to GFF3 foramt by using one perl script GFF2gtf.pl, you can look for it by google.

**Xi Wang** · 11-23-2009, 08:06 AM

I think it is better that the tophat website can provide the GFF3 file for some species, for example, the human. To convert the file format is a dirty work :-(

**Xi Wang** · 11-23-2009, 09:49 AM

Originally posted by Xi Wang View Post

I think it is better that the tophat website can provide the GFF3 file for some species, for example, the human. To convert the file format is a dirty work :-(

for more details, I downloaded knownGene GTF format annotation from UCSC table browser, and converted the format using gtf2gff3 tool. However, when i run tophat, i got the warining message as follows:

Warning: TopHat did not find any junctions in GFF file

i don't how the file should be for tophat using. i want to get help from you. thanks.

**Xi Wang** · 11-24-2009, 08:56 AM

I am wondering whether all the junctions are based on the gene model or not if the gene annotation is given. Can it be inferred that the more comprehensive gene annotation (even with invalid genes) the better?

Thanks,
Xi

**lmf_bill** · 11-24-2009, 07:13 PM

It is so complicated. I am also not sure. I guess Tophat will check the junction when gene annotation is given. The junction is mainly built based on the bowtie mapping results. When you compare the tophat with two tries: annotation-try and no-annotation-try, you will find more junction with annotation-try. It is reasonable.But when you compare them, you will find that there are un-overlapped in both tries. You can say it is the results of gene annotation. It seems that more gene the better. But, I do not check the invalid gene will affect the results. Maybe, you can give us the answer.

**Xi Wang** · 11-25-2009, 12:32 AM

Originally posted by lmf_bill View Post

It is so complicated. I am also not sure. I guess Tophat will check the junction when gene annotation is given. The junction is mainly built based on the bowtie mapping results. When you compare the tophat with two tries: annotation-try and no-annotation-try, you will find more junction with annotation-try. It is reasonable.But when you compare them, you will find that there are un-overlapped in both tries. You can say it is the results of gene annotation. It seems that more gene the better. But, I do not check the invalid gene will affect the results. Maybe, you can give us the answer.

Yes. From my experiments, I guess the results given by tophat is a mixture of junctions based on gene annotation and de novo discovering, if the gene annotation is given. But I can't understand why some (<1%) junction reads were not reported although there is a clear splice junction provided by the gene annotation, and even the de novo method (without gene annotations) can report these junctions. In my experiments, only uniquely mapped (or aligned) reads are reported. Could it be due to this reason, or be a possible bug of tophat?

BTW, did anyone think of this problem below?
Which mapping is better: (1) a read mapping to the genome as a whole with 2 mismatches; (2) the same read mapping to a possible splice junction with only 1 mismatch.

Thanks.

**lmf_bill** · 11-29-2009, 07:34 PM

To Xi Wang,
"But I can't understand why some (<1%) junction reads were not reported although there is a clear splice junction provided by the gene annotation, and even the de novo method (without gene annotations) can report these junctions."

RE:
It is strange. Can you paste one example?
another thing of mismatch setting, there is not exactly good selection. In my opinion, set 2 mismatches when as a whole mapping to genome, but 0 or 1 mismatches in the overhang when mapping to junction. it can improve the precision of splice junction prediction.

**Xi Wang** · 11-29-2009, 08:20 PM

Originally posted by lmf_bill View Post

To Xi Wang,
"But I can't understand why some (<1%) junction reads were not reported although there is a clear splice junction provided by the gene annotation, and even the de novo method (without gene annotations) can report these junctions."

RE:
It is strange. Can you paste one example?
another thing of mismatch setting, there is not exactly good selection. In my opinion, set 2 mismatches when as a whole mapping to genome, but 0 or 1 mismatches in the overhang when mapping to junction. it can improve the precision of splice junction prediction.

Thanks for your reply, Bill.
Here I pasted a few splice junctions identified with gene annotation and without gene annotation respectively.

Code:

Chr#	letf_site	right_site	#juncReadWithoutGeneAnnotation	#juncReadWithGeneAnnotation	GeneName
chrY	21147161	21150881	4	0	EIF1AY
chrY	21147409	21150881	1	1	intragenic
chrY	21150965	21153863	4	4	EIF1AY
chrY	21153967	21155747	14	14	EIF1AY
chrY	21155798	21159297	9	8	EIF1AY
chrY	21159379	21160757	17	17	EIF1AY
chrY	21160849	21163614	8	8	EIF1AY
chrY	2769668	2770205	0	28	RPS4Y1
chrY	2770283	2772117	0	21	RPS4Y1
chrY	2772298	2773686	36	17	RPS4Y1
chrY	2773784	2782640	58	0	RPS4Y1
chrY	2782812	2793128	65	65	RPS4Y1
chrY	2793286	2794833	43	43	RPS4Y1
chrY	7284271	7295396	0	1	PRKY
chrY	8577625	8578201	4	4	intergenic

You can find that about half of the two numbers in columns 4 and 5 are the same, but the other half not. Zeros appear in both columns, which means some splice junctions cannot be identified without gene model, and some others even with gene model. Also, there would be another suspicion whether all the detected splice junctions are real. I think tophat tends to suppress false positives, and maybe that's the reason why we can see clearly some false negatives. There could be some tradeoff and maybe it's better that this tradeoff could be specified by the users (however, i didn't check if tophat has already provided this option).

For the mismatches setting, I agree strongly with you. There is no best setting. But intuitively, the number of splice junction reads are less than that of exon reads, and there would be a higher risk to claim a read is splice junction read than a exon read, especially in the cases where no other evidence for the corresponding junction.

Topics	Statistics	Last Post
High-Resolution Sequencing Exposes Hidden Toxoplasma Diversity by SEQadmin2 Started by SEQadmin2, Today, 11:08 AM	0 responses 6 views 0 reactions	Last Post by SEQadmin2 Today, 11:08 AM
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, 06-30-2026, 05:37 AM	0 responses 11 views 0 reactions	Last Post by SEQadmin2 06-30-2026, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 18 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 52 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM

Unconfigured Ad

TopHat questions

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News