Header Leaderboard Ad

Collapse

TopHat questions

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • TopHat questions

    Hi all
    I am working with transcriptomic data generated by solexa, and i am using TopHat for mapping. I am trying to run the program with a GFF file in order to obtain gene expresion indices (-G option) but the program seems to not recognize none of the GFF files i specify. Does anyone know the structure of the GFF file appropiate for TopHat in order to do the mapping?
    Regards

  • #2
    Hi,

    I've just updated the TopHat manual with a brief explanation of TopHat's RPKM calculation and a simple example GFF file. Please see http://tophat.cbcb.umd.edu/manual.html

    Comment


    • #3
      GFF3 file for TopHat

      Originally posted by Eugeni View Post
      Hi all
      I am working with transcriptomic data generated by solexa, and i am using TopHat for mapping. I am trying to run the program with a GFF file in order to obtain gene expresion indices (-G option) but the program seems to not recognize none of the GFF files i specify. Does anyone know the structure of the GFF file appropiate for TopHat in order to do the mapping?
      Regards
      Hi
      can you share the first few lines of a GFF file so I can see how it is structured?
      Thanks

      Comment


      • #4
        Hello, I'm wondering if anybody can direct me to a GFF3 for mouse. I came across GFF3 files for many organisms besides mouse (http://www.sequenceontology.org/reso...databases.html).

        Comment


        • #5
          TopHat GFF3

          Originally posted by geriatrics1200 View Post
          Hello, I'm wondering if anybody can direct me to a GFF3 for mouse. I came across GFF3 files for many organisms besides mouse (http://www.sequenceontology.org/reso...databases.html).

          Please post the Gff3 if you ever get one!!!

          Comment


          • #6
            I thought Tophat needs GFF3 format files. you can download GTF files from ensembl and then transfer them to GFF3 foramt by using one perl script GFF2gtf.pl, you can look for it by google.

            Comment


            • #7
              I think it is better that the tophat website can provide the GFF3 file for some species, for example, the human. To convert the file format is a dirty work :-(
              Xi Wang

              Comment


              • #8
                Originally posted by Xi Wang View Post
                I think it is better that the tophat website can provide the GFF3 file for some species, for example, the human. To convert the file format is a dirty work :-(
                for more details, I downloaded knownGene GTF format annotation from UCSC table browser, and converted the format using gtf2gff3 tool. However, when i run tophat, i got the warining message as follows:
                Warning: TopHat did not find any junctions in GFF file

                i don't how the file should be for tophat using. i want to get help from you. thanks.
                Xi Wang

                Comment


                • #9
                  I am wondering whether all the junctions are based on the gene model or not if the gene annotation is given. Can it be inferred that the more comprehensive gene annotation (even with invalid genes) the better?

                  Thanks,
                  Xi
                  Xi Wang

                  Comment


                  • #10
                    It is so complicated. I am also not sure. I guess Tophat will check the junction when gene annotation is given. The junction is mainly built based on the bowtie mapping results. When you compare the tophat with two tries: annotation-try and no-annotation-try, you will find more junction with annotation-try. It is reasonable.But when you compare them, you will find that there are un-overlapped in both tries. You can say it is the results of gene annotation. It seems that more gene the better. But, I do not check the invalid gene will affect the results. Maybe, you can give us the answer.

                    Comment


                    • #11
                      Originally posted by lmf_bill View Post
                      It is so complicated. I am also not sure. I guess Tophat will check the junction when gene annotation is given. The junction is mainly built based on the bowtie mapping results. When you compare the tophat with two tries: annotation-try and no-annotation-try, you will find more junction with annotation-try. It is reasonable.But when you compare them, you will find that there are un-overlapped in both tries. You can say it is the results of gene annotation. It seems that more gene the better. But, I do not check the invalid gene will affect the results. Maybe, you can give us the answer.
                      Yes. From my experiments, I guess the results given by tophat is a mixture of junctions based on gene annotation and de novo discovering, if the gene annotation is given. But I can't understand why some (<1%) junction reads were not reported although there is a clear splice junction provided by the gene annotation, and even the de novo method (without gene annotations) can report these junctions. In my experiments, only uniquely mapped (or aligned) reads are reported. Could it be due to this reason, or be a possible bug of tophat?

                      BTW, did anyone think of this problem below?
                      Which mapping is better: (1) a read mapping to the genome as a whole with 2 mismatches; (2) the same read mapping to a possible splice junction with only 1 mismatch.

                      Thanks.
                      Xi Wang

                      Comment


                      • #12
                        To Xi Wang,
                        "But I can't understand why some (<1%) junction reads were not reported although there is a clear splice junction provided by the gene annotation, and even the de novo method (without gene annotations) can report these junctions."

                        RE:
                        It is strange. Can you paste one example?
                        another thing of mismatch setting, there is not exactly good selection. In my opinion, set 2 mismatches when as a whole mapping to genome, but 0 or 1 mismatches in the overhang when mapping to junction. it can improve the precision of splice junction prediction.

                        Comment


                        • #13
                          Originally posted by lmf_bill View Post
                          To Xi Wang,
                          "But I can't understand why some (<1%) junction reads were not reported although there is a clear splice junction provided by the gene annotation, and even the de novo method (without gene annotations) can report these junctions."

                          RE:
                          It is strange. Can you paste one example?
                          another thing of mismatch setting, there is not exactly good selection. In my opinion, set 2 mismatches when as a whole mapping to genome, but 0 or 1 mismatches in the overhang when mapping to junction. it can improve the precision of splice junction prediction.
                          Thanks for your reply, Bill.
                          Here I pasted a few splice junctions identified with gene annotation and without gene annotation respectively.

                          Code:
                          Chr#	letf_site	right_site	#juncReadWithoutGeneAnnotation	#juncReadWithGeneAnnotation	GeneName
                          chrY	21147161	21150881	4	0	EIF1AY
                          chrY	21147409	21150881	1	1	intragenic
                          chrY	21150965	21153863	4	4	EIF1AY
                          chrY	21153967	21155747	14	14	EIF1AY
                          chrY	21155798	21159297	9	8	EIF1AY
                          chrY	21159379	21160757	17	17	EIF1AY
                          chrY	21160849	21163614	8	8	EIF1AY
                          chrY	2769668	2770205	0	28	RPS4Y1
                          chrY	2770283	2772117	0	21	RPS4Y1
                          chrY	2772298	2773686	36	17	RPS4Y1
                          chrY	2773784	2782640	58	0	RPS4Y1
                          chrY	2782812	2793128	65	65	RPS4Y1
                          chrY	2793286	2794833	43	43	RPS4Y1
                          chrY	7284271	7295396	0	1	PRKY
                          chrY	8577625	8578201	4	4	intergenic
                          You can find that about half of the two numbers in columns 4 and 5 are the same, but the other half not. Zeros appear in both columns, which means some splice junctions cannot be identified without gene model, and some others even with gene model. Also, there would be another suspicion whether all the detected splice junctions are real. I think tophat tends to suppress false positives, and maybe that's the reason why we can see clearly some false negatives. There could be some tradeoff and maybe it's better that this tradeoff could be specified by the users (however, i didn't check if tophat has already provided this option).

                          For the mismatches setting, I agree strongly with you. There is no best setting. But intuitively, the number of splice junction reads are less than that of exon reads, and there would be a higher risk to claim a read is splice junction read than a exon read, especially in the cases where no other evidence for the corresponding junction.
                          Last edited by Xi Wang; 11-29-2009, 08:22 PM.
                          Xi Wang

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            A Brief Overview and Common Challenges in Single-cell Sequencing Analysis
                            by seqadmin


                            ​​​​​​The introduction of single-cell sequencing has advanced the ability to study cell-to-cell heterogeneity. Its use has improved our understanding of somatic mutations1, cell lineages2, cellular diversity and regulation3, and development in multicellular organisms4. Single-cell sequencing encompasses hundreds of techniques with different approaches to studying the genomes, transcriptomes, epigenomes, and other omics of individual cells. The analysis of single-cell sequencing data i...

                            01-24-2023, 01:19 PM
                          • seqadmin
                            Introduction to Single-Cell Sequencing
                            by seqadmin
                            Single-cell sequencing is a technique used to investigate the genome, transcriptome, epigenome, and other omics of individual cells using high-throughput sequencing. This technology has provided many scientific breakthroughs and continues to be applied across many fields, including microbiology, oncology, immunology, neurobiology, precision medicine, and stem cell research.

                            The advancement of single-cell sequencing began in 2009 when Tang et al. investigated the single-cell transcriptomes
                            ...
                            01-09-2023, 03:10 PM

                          ad_right_rmr

                          Collapse
                          Working...
                          X