Greetings,
I am new to RNA-seq analysis, and am trying to understand the high-order logic of the bowtie/tophat/cufflinks/cuffcompare/cuffdiff pipeline. As far as I understand, a typical run would be something like:
1. Creation of a genome index (needed for Bowtie and TopHat to operate)
2. Alignments of RNA-seq reads to the genome with identification of exon-exon splice functions using TopHat
3. Assembly of RNA-seq reads into transcripts and report of the RPKM/FPKM using Cufflinks
4. Comparison of assembled transcripts with a reference annotation, or across experiments, using Cuffcompare
5. Finding significant changes in transcript expression, splicing, and promoter use using Cuffdiff
(in this example, accepted_hits1a.bam and accepted_hits1b.sam are replicates, while accepted_hits2.bam is another experimental condition)
Now I have the following questions:
(1) I originally thought step 2 could be performed with all reads files (from different experimental conditions) in the same command line. However, tophat only generates one set of output file. I thus switched to calling tophat for each reads file at a time. Is that the correct approach?
(2) In step 2, tophat can uses a reference genome annotation (GTF file). I am using this pipeline on a genome which, so far, has only been automatically annotated (Phaeodactylum tricornutum). From people's experience, should I rely on an automatically generated GTF file (in this case, from ENSEMBL) with hypothetical gene models or is it better to 'trust' the gene models that tophat would generate from the RNA-seq reads alone? I am worried about which option would give the most false positives; indeed, the tophat documentation states that 'TopHat will use the exon records in this file to build a set of known splice junctions for each gene, and will attempt to align reads to these junctions even if they would not normally be covered by the initial mapping.'.
(3) In step 3, cufflinks also can uses a reference genome annotation. The question is the same: should I trust the reads' alignments more than an automatically generated GTF file for gene models? After all, a lot of person are using ChIP-seq and RNA-seq to improve on gene models.
(4) In step 3, cufflinks is ran with one reads file (i.e., one experimental condition) at a time, producing a transcripts.gtf file with the model transcripts for this condition. However in step 5 cuffdiff is expecting a transcripts GTF file that will be used to compare all SAM/BAM files (from various experimental conditions) provided as input. Should I understand that, in order to compare different experimental conditions, I have to generate a 'meta' transcripts file by running steps 2 and 3 with a concatenation of all my reads files? (ensuring the resulting transcripts.gtf file in step 3 will be the union of all transcripts.gtf files possible?)
Those are the key questions that prevent me, for now, to really use this pipeline (that, and an 'sort order of reads in BAMs must be the same' error in step 3 which, according to thread #9304 -- http://seqanswers.com/forums/showthread.php?t=9304 -- is due to a bug of cufflinks).
Best,
Aurelien
I am new to RNA-seq analysis, and am trying to understand the high-order logic of the bowtie/tophat/cufflinks/cuffcompare/cuffdiff pipeline. As far as I understand, a typical run would be something like:
1. Creation of a genome index (needed for Bowtie and TopHat to operate)
bowtie-build genome.fa bowtie_output/genome
2. Alignments of RNA-seq reads to the genome with identification of exon-exon splice functions using TopHat
tophat -o=tophat_output/ bowtie_output/genome reads.fastq
3. Assembly of RNA-seq reads into transcripts and report of the RPKM/FPKM using Cufflinks
cufflinks -o=cufflinks_output/ --reference-seq=genome.fa tophat_output/accepted_hits.bam
4. Comparison of assembled transcripts with a reference annotation, or across experiments, using Cuffcompare
cuffcompare -o=cuffcompare_output -r=genome.gtf -R cufflinks_output/transcripts.gtf
5. Finding significant changes in transcript expression, splicing, and promoter use using Cuffdiff
cuffdiff -o=cuffdiff_output/ cufflinks_output/transcripts.gff tophat_output/accepted_hits1a.bam,tophat_output/accepted_hits1b.bam tophat_output/accepted_hits2.bam
(in this example, accepted_hits1a.bam and accepted_hits1b.sam are replicates, while accepted_hits2.bam is another experimental condition)
Now I have the following questions:
(1) I originally thought step 2 could be performed with all reads files (from different experimental conditions) in the same command line. However, tophat only generates one set of output file. I thus switched to calling tophat for each reads file at a time. Is that the correct approach?
(2) In step 2, tophat can uses a reference genome annotation (GTF file). I am using this pipeline on a genome which, so far, has only been automatically annotated (Phaeodactylum tricornutum). From people's experience, should I rely on an automatically generated GTF file (in this case, from ENSEMBL) with hypothetical gene models or is it better to 'trust' the gene models that tophat would generate from the RNA-seq reads alone? I am worried about which option would give the most false positives; indeed, the tophat documentation states that 'TopHat will use the exon records in this file to build a set of known splice junctions for each gene, and will attempt to align reads to these junctions even if they would not normally be covered by the initial mapping.'.
(3) In step 3, cufflinks also can uses a reference genome annotation. The question is the same: should I trust the reads' alignments more than an automatically generated GTF file for gene models? After all, a lot of person are using ChIP-seq and RNA-seq to improve on gene models.
(4) In step 3, cufflinks is ran with one reads file (i.e., one experimental condition) at a time, producing a transcripts.gtf file with the model transcripts for this condition. However in step 5 cuffdiff is expecting a transcripts GTF file that will be used to compare all SAM/BAM files (from various experimental conditions) provided as input. Should I understand that, in order to compare different experimental conditions, I have to generate a 'meta' transcripts file by running steps 2 and 3 with a concatenation of all my reads files? (ensuring the resulting transcripts.gtf file in step 3 will be the union of all transcripts.gtf files possible?)
Those are the key questions that prevent me, for now, to really use this pipeline (that, and an 'sort order of reads in BAMs must be the same' error in step 3 which, according to thread #9304 -- http://seqanswers.com/forums/showthread.php?t=9304 -- is due to a bug of cufflinks).
Best,
Aurelien
Comment