Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • GTF usage in Tophat

    I am trying to use Tophat to find novel splicing junction on a zebrafish RNAseq done with the Illumina CAGE-protocol. I am quite novel to the usage of tophat, and I am making several trials to find the best options combination for my samples, yet I don't completely understand the -GTF (paired with the --transcriptome-index options).

    as stated in the Tophat manual for the --GTF option:
    Supply TopHat with a set of gene model annotations and/or known transcripts, as a GTF 2.2 or GFF3 formatted file. If this option is provided, TopHat will first extract the transcript sequences and use Bowtie to align reads to this virtual transcriptome first. Only the reads that do not fully map to the transcriptome will then be mapped on the genome. The reads that did map on the transcriptome will be converted to genomic mappings (spliced as needed) and merged with the novel mappings and junctions in the final tophat output.
    Please note that the values in the first column of the provided GTF/GFF file (column which indicates the chromosome or contig on which the feature is located), must match the name of the reference sequence in the Bowtie index you are using with TopHat. You can get a list of the sequence names in a Bowtie index by typing:

    bowtie-inspect --names your_index

    So before using a known annotation file with this option please make sure that the 1st column in the annotation file uses the exact same chromosome/contig names (case sensitive) as shown by the bowtie-inspect command above.
    As far as I understand, tophat use a GTF file to build an index (if the gtf file matches with the bowtie index in terms of position and sequence). this Index can be re-used sing the -transcriptome-index option.

    After that, TH aligns the reads against this "GTF index", and discards all the reads that perfectly matches this index, focusing on the reads that not align to the index to find new splicing sites. Is this correct? If it is, then two questions raise up:

    1) will the reads be aligned against this GTF index without even try to splice them? The perfect match happens before the splicing algorithm?

    2) which reference will be better to use with this option? a reference genome or a reference trascriptome? And why?

    thanks for your answers!

    Daniele

  • #2
    As I've understood it, tophat first uses the information in the annotation gtf to map all the reads that match to all the known genes. After that you'll be left with a bunch of reads that did not match known genes, they will be mapped as usual to the genome. Possibly they represent novel genes or something else. You have to use a reference gene model for this, i.e. the known transciptome, the genome you are suppose to supply to tophat in the form of bowtie index.

    Comment


    • #3
      Originally posted by glados View Post
      As I've understood it, tophat first uses the information in the annotation gtf to map all the reads that match to all the known genes. After that you'll be left with a bunch of reads that did not match known genes, they will be mapped as usual to the genome. Possibly they represent novel genes or something else. You have to use a reference gene model for this, i.e. the known transciptome, the genome you are suppose to supply to tophat in the form of bowtie index.
      Ty Glados, I found out what was not working!

      Comment


      • #4
        So you can supply TopHat with a GTF file of annotated transcripts, which, using the --GTF option, will be the first place where reads are mapped, followed by the whole genome, with or without novel junction discovery in this second stage. As I understand it, this is after TopHat 1.4.
        I'm curious to know how t was before 1.4. I think you could already give TopHat a GTF file, but it used it second. Am I right? If so, what is the difference between using it [the GTF file] first and using it second after the genome?

        Comment


        • #5
          Hello everyone

          In tophat manual it is given that

          -T/--transcriptome-only Only align the reads to the transcriptome and report only those mappings as genomic mappings.

          how does it differ from -G . ( As -G do the same , extract the reads mapped against the given transcript present in the GTF file )


          I did mapping in two different ways ..
          Tophat Mapping without -T

          python tophat.py -p 8 -G jsn.gff -o LIB_SG323_FJSN_Trans refernece.fa 1_fastq_1 1_fastq_2

          and with -T and -G ,

          python tophat.py -p 8 -T -G jsn.gff -o LIB_SG323_FJSN_Trans refernece.fa 1_fastq_1 1_fastq_2

          I got the difference in FPKM values . How running tophat with first command differ from the second one??

          Comment


          • #6
            Let's look at the manual about the '-G' option

            Supply TopHat with a set of gene model annotations and/or known transcripts, as a GTF 2.2 or GFF3 formatted file. If this option is provided, TopHat will first extract the transcript sequences and use Bowtie to align reads to this virtual transcriptome first. Only the reads that do not fully map to the transcriptome will then be mapped on the genome. The reads that did map on the transcriptome will be converted to genomic mappings (spliced as needed) and merged with the novel mappings and junctions in the final tophat output.
            Compare to '-T'

            Only align the reads to the transcriptome and report only those mappings as genomic mappings.
            I hope that it is obvious that the two map reads in different ways. The first should be a super-set of the second.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM
            • seqadmin
              Techniques and Challenges in Conservation Genomics
              by seqadmin



              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

              Avian Conservation
              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
              03-08-2024, 10:41 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Yesterday, 06:37 PM
            0 responses
            10 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, Yesterday, 06:07 PM
            0 responses
            9 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-22-2024, 10:03 AM
            0 responses
            49 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-21-2024, 07:32 AM
            0 responses
            67 views
            0 likes
            Last Post seqadmin  
            Working...
            X