Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Eugeni
    Junior Member
    • May 2009
    • 4

    TopHat questions

    Hi all
    I am working with transcriptomic data generated by solexa, and i am using TopHat for mapping. I am trying to run the program with a GFF file in order to obtain gene expresion indices (-G option) but the program seems to not recognize none of the GFF files i specify. Does anyone know the structure of the GFF file appropiate for TopHat in order to do the mapping?
    Regards
  • Cole Trapnell
    Senior Member
    • Nov 2008
    • 213

    #2
    Hi,

    I've just updated the TopHat manual with a brief explanation of TopHat's RPKM calculation and a simple example GFF file. Please see http://tophat.cbcb.umd.edu/manual.html

    Comment

    • joseph
      Member
      • Feb 2008
      • 39

      #3
      GFF3 file for TopHat

      Originally posted by Eugeni View Post
      Hi all
      I am working with transcriptomic data generated by solexa, and i am using TopHat for mapping. I am trying to run the program with a GFF file in order to obtain gene expresion indices (-G option) but the program seems to not recognize none of the GFF files i specify. Does anyone know the structure of the GFF file appropiate for TopHat in order to do the mapping?
      Regards
      Hi
      can you share the first few lines of a GFF file so I can see how it is structured?
      Thanks

      Comment

      • geriatrics1200
        Junior Member
        • Jun 2009
        • 3

        #4
        Hello, I'm wondering if anybody can direct me to a GFF3 for mouse. I came across GFF3 files for many organisms besides mouse (http://www.sequenceontology.org/reso...databases.html).

        Comment

        • RockChalkJayhawk
          Senior Member
          • Mar 2009
          • 192

          #5
          TopHat GFF3

          Originally posted by geriatrics1200 View Post
          Hello, I'm wondering if anybody can direct me to a GFF3 for mouse. I came across GFF3 files for many organisms besides mouse (http://www.sequenceontology.org/reso...databases.html).

          Please post the Gff3 if you ever get one!!!

          Comment

          • lmf_bill
            Member
            • Jul 2008
            • 36

            #6
            I thought Tophat needs GFF3 format files. you can download GTF files from ensembl and then transfer them to GFF3 foramt by using one perl script GFF2gtf.pl, you can look for it by google.

            Comment

            • Xi Wang
              Senior Member
              • Oct 2009
              • 317

              #7
              I think it is better that the tophat website can provide the GFF3 file for some species, for example, the human. To convert the file format is a dirty work :-(
              Xi Wang

              Comment

              • Xi Wang
                Senior Member
                • Oct 2009
                • 317

                #8
                Originally posted by Xi Wang View Post
                I think it is better that the tophat website can provide the GFF3 file for some species, for example, the human. To convert the file format is a dirty work :-(
                for more details, I downloaded knownGene GTF format annotation from UCSC table browser, and converted the format using gtf2gff3 tool. However, when i run tophat, i got the warining message as follows:
                Warning: TopHat did not find any junctions in GFF file

                i don't how the file should be for tophat using. i want to get help from you. thanks.
                Xi Wang

                Comment

                • Xi Wang
                  Senior Member
                  • Oct 2009
                  • 317

                  #9
                  I am wondering whether all the junctions are based on the gene model or not if the gene annotation is given. Can it be inferred that the more comprehensive gene annotation (even with invalid genes) the better?

                  Thanks,
                  Xi
                  Xi Wang

                  Comment

                  • lmf_bill
                    Member
                    • Jul 2008
                    • 36

                    #10
                    It is so complicated. I am also not sure. I guess Tophat will check the junction when gene annotation is given. The junction is mainly built based on the bowtie mapping results. When you compare the tophat with two tries: annotation-try and no-annotation-try, you will find more junction with annotation-try. It is reasonable.But when you compare them, you will find that there are un-overlapped in both tries. You can say it is the results of gene annotation. It seems that more gene the better. But, I do not check the invalid gene will affect the results. Maybe, you can give us the answer.

                    Comment

                    • Xi Wang
                      Senior Member
                      • Oct 2009
                      • 317

                      #11
                      Originally posted by lmf_bill View Post
                      It is so complicated. I am also not sure. I guess Tophat will check the junction when gene annotation is given. The junction is mainly built based on the bowtie mapping results. When you compare the tophat with two tries: annotation-try and no-annotation-try, you will find more junction with annotation-try. It is reasonable.But when you compare them, you will find that there are un-overlapped in both tries. You can say it is the results of gene annotation. It seems that more gene the better. But, I do not check the invalid gene will affect the results. Maybe, you can give us the answer.
                      Yes. From my experiments, I guess the results given by tophat is a mixture of junctions based on gene annotation and de novo discovering, if the gene annotation is given. But I can't understand why some (<1%) junction reads were not reported although there is a clear splice junction provided by the gene annotation, and even the de novo method (without gene annotations) can report these junctions. In my experiments, only uniquely mapped (or aligned) reads are reported. Could it be due to this reason, or be a possible bug of tophat?

                      BTW, did anyone think of this problem below?
                      Which mapping is better: (1) a read mapping to the genome as a whole with 2 mismatches; (2) the same read mapping to a possible splice junction with only 1 mismatch.

                      Thanks.
                      Xi Wang

                      Comment

                      • lmf_bill
                        Member
                        • Jul 2008
                        • 36

                        #12
                        To Xi Wang,
                        "But I can't understand why some (<1%) junction reads were not reported although there is a clear splice junction provided by the gene annotation, and even the de novo method (without gene annotations) can report these junctions."

                        RE:
                        It is strange. Can you paste one example?
                        another thing of mismatch setting, there is not exactly good selection. In my opinion, set 2 mismatches when as a whole mapping to genome, but 0 or 1 mismatches in the overhang when mapping to junction. it can improve the precision of splice junction prediction.

                        Comment

                        • Xi Wang
                          Senior Member
                          • Oct 2009
                          • 317

                          #13
                          Originally posted by lmf_bill View Post
                          To Xi Wang,
                          "But I can't understand why some (<1%) junction reads were not reported although there is a clear splice junction provided by the gene annotation, and even the de novo method (without gene annotations) can report these junctions."

                          RE:
                          It is strange. Can you paste one example?
                          another thing of mismatch setting, there is not exactly good selection. In my opinion, set 2 mismatches when as a whole mapping to genome, but 0 or 1 mismatches in the overhang when mapping to junction. it can improve the precision of splice junction prediction.
                          Thanks for your reply, Bill.
                          Here I pasted a few splice junctions identified with gene annotation and without gene annotation respectively.

                          Code:
                          Chr#	letf_site	right_site	#juncReadWithoutGeneAnnotation	#juncReadWithGeneAnnotation	GeneName
                          chrY	21147161	21150881	4	0	EIF1AY
                          chrY	21147409	21150881	1	1	intragenic
                          chrY	21150965	21153863	4	4	EIF1AY
                          chrY	21153967	21155747	14	14	EIF1AY
                          chrY	21155798	21159297	9	8	EIF1AY
                          chrY	21159379	21160757	17	17	EIF1AY
                          chrY	21160849	21163614	8	8	EIF1AY
                          chrY	2769668	2770205	0	28	RPS4Y1
                          chrY	2770283	2772117	0	21	RPS4Y1
                          chrY	2772298	2773686	36	17	RPS4Y1
                          chrY	2773784	2782640	58	0	RPS4Y1
                          chrY	2782812	2793128	65	65	RPS4Y1
                          chrY	2793286	2794833	43	43	RPS4Y1
                          chrY	7284271	7295396	0	1	PRKY
                          chrY	8577625	8578201	4	4	intergenic
                          You can find that about half of the two numbers in columns 4 and 5 are the same, but the other half not. Zeros appear in both columns, which means some splice junctions cannot be identified without gene model, and some others even with gene model. Also, there would be another suspicion whether all the detected splice junctions are real. I think tophat tends to suppress false positives, and maybe that's the reason why we can see clearly some false negatives. There could be some tradeoff and maybe it's better that this tradeoff could be specified by the users (however, i didn't check if tophat has already provided this option).

                          For the mismatches setting, I agree strongly with you. There is no best setting. But intuitively, the number of splice junction reads are less than that of exon reads, and there would be a higher risk to claim a read is splice junction read than a exon read, especially in the cases where no other evidence for the corresponding junction.
                          Last edited by Xi Wang; 11-29-2009, 08:22 PM.
                          Xi Wang

                          Comment

                          Latest Articles

                          Collapse

                          • GATTACAT
                            Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                            by GATTACAT
                            Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
                            Yesterday, 11:43 AM
                          • SEQadmin2
                            Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                            by SEQadmin2


                            I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                            Here are nine questions we think about, in roughly the order they matter, before...
                            06-18-2026, 07:11 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by SEQadmin2, Today, 11:08 AM
                          0 responses
                          6 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 06-30-2026, 05:37 AM
                          0 responses
                          11 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 06-26-2026, 11:10 AM
                          0 responses
                          18 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 06-17-2026, 06:09 AM
                          0 responses
                          52 views
                          0 reactions
                          Last Post SEQadmin2  
                          Working...