Header Leaderboard Ad


Typical TopHat/Cufflinks pipeline?



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Typical TopHat/Cufflinks pipeline?

    I am new to RNA-seq analysis, and am trying to understand the high-order logic of the bowtie/tophat/cufflinks/cuffcompare/cuffdiff pipeline. As far as I understand, a typical run would be something like:

    1. Creation of a genome index (needed for Bowtie and TopHat to operate)
    bowtie-build genome.fa bowtie_output/genome

    2. Alignments of RNA-seq reads to the genome with identification of exon-exon splice functions using TopHat
    tophat -o=tophat_output/ bowtie_output/genome reads.fastq

    3. Assembly of RNA-seq reads into transcripts and report of the RPKM/FPKM using Cufflinks
    cufflinks -o=cufflinks_output/ --reference-seq=genome.fa tophat_output/accepted_hits.bam

    4. Comparison of assembled transcripts with a reference annotation, or across experiments, using Cuffcompare
    cuffcompare -o=cuffcompare_output -r=genome.gtf -R cufflinks_output/transcripts.gtf

    5. Finding significant changes in transcript expression, splicing, and promoter use using Cuffdiff
    cuffdiff -o=cuffdiff_output/ cufflinks_output/transcripts.gff tophat_output/accepted_hits1a.bam,tophat_output/accepted_hits1b.bam tophat_output/accepted_hits2.bam

    (in this example, accepted_hits1a.bam and accepted_hits1b.sam are replicates, while accepted_hits2.bam is another experimental condition)

    Now I have the following questions:

    (1) I originally thought step 2 could be performed with all reads files (from different experimental conditions) in the same command line. However, tophat only generates one set of output file. I thus switched to calling tophat for each reads file at a time. Is that the correct approach?

    (2) In step 2, tophat can uses a reference genome annotation (GTF file). I am using this pipeline on a genome which, so far, has only been automatically annotated (Phaeodactylum tricornutum). From people's experience, should I rely on an automatically generated GTF file (in this case, from ENSEMBL) with hypothetical gene models or is it better to 'trust' the gene models that tophat would generate from the RNA-seq reads alone? I am worried about which option would give the most false positives; indeed, the tophat documentation states that 'TopHat will use the exon records in this file to build a set of known splice junctions for each gene, and will attempt to align reads to these junctions even if they would not normally be covered by the initial mapping.'.

    (3) In step 3, cufflinks also can uses a reference genome annotation. The question is the same: should I trust the reads' alignments more than an automatically generated GTF file for gene models? After all, a lot of person are using ChIP-seq and RNA-seq to improve on gene models.

    (4) In step 3, cufflinks is ran with one reads file (i.e., one experimental condition) at a time, producing a transcripts.gtf file with the model transcripts for this condition. However in step 5 cuffdiff is expecting a transcripts GTF file that will be used to compare all SAM/BAM files (from various experimental conditions) provided as input. Should I understand that, in order to compare different experimental conditions, I have to generate a 'meta' transcripts file by running steps 2 and 3 with a concatenation of all my reads files? (ensuring the resulting transcripts.gtf file in step 3 will be the union of all transcripts.gtf files possible?)

    Those are the key questions that prevent me, for now, to really use this pipeline (that, and an 'sort order of reads in BAMs must be the same' error in step 3 which, according to thread #9304 -- http://seqanswers.com/forums/showthread.php?t=9304 -- is due to a bug of cufflinks).


  • #2

    I'm working along the same lines as you, though my first step is simply to generate an annotation of our new genome from several different RNAseq datasests. I'll worrying about differential expression later...

    Other folks have asked similar questions, see the following post and where it links too. Unfortunately, there seems to be a fair bit ambiguity surrounding how best to do annotations with cufflinks (as opposed to diff-express).


    In principle I like the idea of a 'meta' transcripts file and plan to generate one via samtools merge of several different tophat runs. I seem to have some memory issues on our machine that prevents pooling all the samples. Haven't actually done this yet and I wonder if I'm not gonna be stopped by that sort order error...



    • #3
      Thank you for the links; I will definitively look at them. I never heard about diff-express, and couldn't find anything on Google. Could you give some more details about it?



      • #4
        I found an answer for question (4): http://seqanswers.com/forums/showthr...3800#post13800

        The suggestion from Cole Trapnell is indeed to pool samples with cuffcompare in step 4, then run cuffdiff (step 5) with the combined transcripts GTF file.


        • #5
          diff-express was just a lazy way of writing 'differential expression', which I meant in opposition to the goal of 'annotation'. no software here... just a lazy typist.


          • #6
            I didn't run 3) and 4). Instead I used the genes.gtf in the hg19 tar ball I downloaded from here:


            If I am only interested in differential expression but not in transcript assembly, is it good enough?