I am a newcomer to the RNAseq field. I have been trying to get (differential) expression values from my experiments (rat and mouse). To achieve this I want to run tophat and cufflinks with annotated transcripts/genes (i.e. -G option in top hat). this requires a GTF or GFF file.
I obtained my rat GTF file from:
I already figured out that the first column of this file does not contain the correct names:
['1' '10' '11' '12' '13' '14' '15' '16' '17' '18' '19' '2' '20' '3' '4' '5'
'6' '7' '8' '9' 'AABR06109308.1' 'AABR06109382.1' 'AABR06109386.1'
'AABR06109393.1' 'AABR06109522.1' 'AABR06109730.1' 'AABR06109804.1'
'AABR06109842.1' 'AABR06109980.1' 'AABR06109985.1' 'AABR06110113.1'
'AABR06110137.1' 'AABR06110207.1' 'AABR06110239.1' 'AABR06110493.1'
'AABR06110520.1' 'AABR06110524.1' 'AABR06110528.1' 'AABR06110658.1'
'AABR06110665.1' 'AABR06110667.1' 'AABR06110668.1' 'AABR06110762.1'
'AABR06110835.1' 'AABR06110904.1' 'AABR06110917.1' 'AABR06110957.1'
'AABR06110964.1' 'AABR06111084.1' 'AABR06111092.1' 'AABR06111127.1'
'AABR06111133.1' 'AABR06111236.1' 'AABR06111258.1' 'AABR06111264.1'
'AABR06111284.1' 'AABR06111288.1' 'AABR06111321.1' 'AABR06111326.1'
'AABR06111415.1' 'AABR06111426.1' 'AABR06111443.1' 'AABR06111498.1'
'AABR06111564.1' 'AABR06111611.1' 'AABR06111687.1' 'AABR06111698.1'
'AABR06111722.1' 'AABR06111841.1' 'AABR06111842.1' 'AABR06112158.1'
'AABR06112227.1' 'AABR06112323.1' 'JH620271.1' 'JH620298.1' 'JH620359.1'
'JH620370.1' 'JH620402.1' 'JH620421.1' 'JH620447.1' 'JH620470.1'
'JH620473.1' 'JH620476.1' 'JH620489.1' 'JH620497.1' 'JH620498.1'
'JH620499.1' 'JH620501.1' 'JH620511.1' 'JH620517.1' 'JH620530.1'
'JH620566.1' 'JH620632.1' 'MT' 'X']
they should be:
xxx@xxx:~/RNAseq/Rattus_norvegicus/UCSC/rn4/Sequence/Bowtie2Index$ bowtie2-inspect --names genome
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chr1
chr20
chr2
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chrM
chrX
therefore I wrote a python script that replaced the names with the correct ones. The new GTF file first column indices are, which match the ones in my index file:
['chr1' 'chr10' 'chr11' 'chr12' 'chr13' 'chr14' 'chr15' 'chr16' 'chr17'
'chr18' 'chr19' 'chr2' 'chr20' 'chr3' 'chr4' 'chr5' 'chr6' 'chr7' 'chr8'
'chr9' 'chrM' 'chrX']
However, when I run the following command:
xxx@xxx:~/RNAanalysis/data/rat RNAseq$ tophat -p 20 -G ../../genome.gtf -o D09LDACXX_101811-19-07_CAGATC_L003_R1 ../../RatIndex/Bowtie2Index/genome D09LDACXX_101811-19-07_CAGATC_L003_R1.fastq
I get an error:
[2013-06-24 11:14:41] Beginning TopHat run (v2.0.8b)
-----------------------------------------------
[2013-06-24 11:14:41] Checking for Bowtie
Bowtie version: 2.1.0.0
[2013-06-24 11:14:41] Checking for Samtools
Samtools version: 0.1.18.0
[2013-06-24 11:14:41] Checking for Bowtie index files
[2013-06-24 11:14:41] Checking for reference FASTA file
[2013-06-24 11:14:41] Generating SAM header for ../../RatIndex/Bowtie2Index/genome
format: fastq
quality scale: phred33 (default)
[2013-06-24 11:14:47] Reading known junctions from GTF file
[2013-06-24 11:14:50] Preparing reads
left reads: min. length=51, max. length=51, 15745244 kept reads (5954 discarded)
[2013-06-24 11:18:53] Creating transcriptome data files..
[FAILED]
Error: gtf_to_fasta returned an error.
the first few rows of my GTF file look like this:
chr1 Ensembl exon 312155 312360 . + . gene_id "ENSRNOG00000046319"; transcript_id "ENSRNOT00000073861"; exon_number "1"; gene_name "Vom2r3"; gene_biotype "protei$
chr1 Ensembl exon 386110 386392 . + . gene_id "ENSRNOG00000046319"; transcript_id "ENSRNOT00000073861"; exon_number "2"; gene_name "Vom2r3"; gene_biotype "protei$
chr1 Ensembl exon 391729 392532 . + . gene_id "ENSRNOG00000046319"; transcript_id "ENSRNOT00000073861"; exon_number "3"; gene_name "Vom2r3"; gene_biotype "protei$
chr1 Ensembl exon 393385 393609 . + . gene_id "ENSRNOG00000046319"; transcript_id "ENSRNOT00000073861"; exon_number "4"; gene_name "Vom2r3"; gene_biotype "protei$
chr1 Ensembl exon 394818 394941 . + . gene_id "ENSRNOG00000046319"; transcript_id "ENSRNOT00000073861"; exon_number "5"; gene_name "Vom2r3"; gene_biotype "protei$
chr1 Ensembl exon 400266 401149 . + . gene_id "ENSRNOG00000046319"; transcript_id "ENSRNOT00000073861"; exon_number "6"; gene_name "Vom2r3"; gene_biotype "protei$
chr1 Ensembl exon 312155 312361 . + . gene_id "ENSRNOG00000046319"; transcript_id "ENSRNOT00000046586"; exon_number "1"; gene_name "Vom2r3"; gene_biotype "protei$
chr1 Ensembl exon 312365 312376 . + . gene_id "ENSRNOG00000046319"; transcript_id "ENSRNOT00000046586"; exon_number "2"; gene_name "Vom2r3"; gene_biotype "protei$
chr1 Ensembl exon 312378 312383 . + . gene_id "ENSRNOG00000046319"; transcript_id "ENSRNOT00000046586"; exon_number "3"; gene_name "Vom2r3"; gene_biotype "protei$
chr1 Ensembl exon 386141 386392 . + . gene_id "ENSRNOG00000046319"; transcript_id "ENSRNOT00000046586"; exon_number "4"; gene_name "Vom2r3"; gene_biotype "protei$
chr1 Ensembl exon 391729 392532 . + . gene_id "ENSRNOG00000046319"; transcript_id "ENSRNOT00000046586"; exon_number "5"; gene_name "Vom2r3"; gene_biotype "protei$
chr1 Ensembl exon 393385 393606 . + . gene_id "ENSRNOG00000046319"; transcript_id "ENSRNOT00000046586"; exon_number "6"; gene_name "Vom2r3"; gene_biotype "protei$
chr1 Ensembl exon 394818 394940 . + . gene_id "ENSRNOG00000046319"; transcript_id "ENSRNOT00000046586"; exon_number "7"; gene_name "Vom2r3"; gene_biotype "protei$
chr1 Ensembl exon 400259 401149 . + . gene_id "ENSRNOG00000046319"; transcript_id "ENSRNOT00000046586"; exon_number "8"; gene_name "Vom2r3"; gene_biotype "protei$
(the $-sign at the end is just because the complete row didn't fit in my shell, an example of a complete row:
chr1 Ensembl exon 312155 312360 . + . gene_id "ENSRNOG00000046319"; transcript_id "ENSRNOT00000073861"; exon_number "1"; gene_name "Vom2r3"; gene_biotype "protein_coding"; transcript_name "Vom2r3-202"; exon_id "ENSRNOE00000494965"; old_row[1] "protein_coding";
)
compared to the original GTF file I also replaced the second column with the word "Ensembl" ,, where before this said "protein_coding". I only kept the exons, so I removed start and stop codons and CDS.
Does anybody have a clue why TopHat is still failing to run?
thanks in advance!
I obtained my rat GTF file from:
I already figured out that the first column of this file does not contain the correct names:
['1' '10' '11' '12' '13' '14' '15' '16' '17' '18' '19' '2' '20' '3' '4' '5'
'6' '7' '8' '9' 'AABR06109308.1' 'AABR06109382.1' 'AABR06109386.1'
'AABR06109393.1' 'AABR06109522.1' 'AABR06109730.1' 'AABR06109804.1'
'AABR06109842.1' 'AABR06109980.1' 'AABR06109985.1' 'AABR06110113.1'
'AABR06110137.1' 'AABR06110207.1' 'AABR06110239.1' 'AABR06110493.1'
'AABR06110520.1' 'AABR06110524.1' 'AABR06110528.1' 'AABR06110658.1'
'AABR06110665.1' 'AABR06110667.1' 'AABR06110668.1' 'AABR06110762.1'
'AABR06110835.1' 'AABR06110904.1' 'AABR06110917.1' 'AABR06110957.1'
'AABR06110964.1' 'AABR06111084.1' 'AABR06111092.1' 'AABR06111127.1'
'AABR06111133.1' 'AABR06111236.1' 'AABR06111258.1' 'AABR06111264.1'
'AABR06111284.1' 'AABR06111288.1' 'AABR06111321.1' 'AABR06111326.1'
'AABR06111415.1' 'AABR06111426.1' 'AABR06111443.1' 'AABR06111498.1'
'AABR06111564.1' 'AABR06111611.1' 'AABR06111687.1' 'AABR06111698.1'
'AABR06111722.1' 'AABR06111841.1' 'AABR06111842.1' 'AABR06112158.1'
'AABR06112227.1' 'AABR06112323.1' 'JH620271.1' 'JH620298.1' 'JH620359.1'
'JH620370.1' 'JH620402.1' 'JH620421.1' 'JH620447.1' 'JH620470.1'
'JH620473.1' 'JH620476.1' 'JH620489.1' 'JH620497.1' 'JH620498.1'
'JH620499.1' 'JH620501.1' 'JH620511.1' 'JH620517.1' 'JH620530.1'
'JH620566.1' 'JH620632.1' 'MT' 'X']
they should be:
xxx@xxx:~/RNAseq/Rattus_norvegicus/UCSC/rn4/Sequence/Bowtie2Index$ bowtie2-inspect --names genome
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chr1
chr20
chr2
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chrM
chrX
therefore I wrote a python script that replaced the names with the correct ones. The new GTF file first column indices are, which match the ones in my index file:
['chr1' 'chr10' 'chr11' 'chr12' 'chr13' 'chr14' 'chr15' 'chr16' 'chr17'
'chr18' 'chr19' 'chr2' 'chr20' 'chr3' 'chr4' 'chr5' 'chr6' 'chr7' 'chr8'
'chr9' 'chrM' 'chrX']
However, when I run the following command:
xxx@xxx:~/RNAanalysis/data/rat RNAseq$ tophat -p 20 -G ../../genome.gtf -o D09LDACXX_101811-19-07_CAGATC_L003_R1 ../../RatIndex/Bowtie2Index/genome D09LDACXX_101811-19-07_CAGATC_L003_R1.fastq
I get an error:
[2013-06-24 11:14:41] Beginning TopHat run (v2.0.8b)
-----------------------------------------------
[2013-06-24 11:14:41] Checking for Bowtie
Bowtie version: 2.1.0.0
[2013-06-24 11:14:41] Checking for Samtools
Samtools version: 0.1.18.0
[2013-06-24 11:14:41] Checking for Bowtie index files
[2013-06-24 11:14:41] Checking for reference FASTA file
[2013-06-24 11:14:41] Generating SAM header for ../../RatIndex/Bowtie2Index/genome
format: fastq
quality scale: phred33 (default)
[2013-06-24 11:14:47] Reading known junctions from GTF file
[2013-06-24 11:14:50] Preparing reads
left reads: min. length=51, max. length=51, 15745244 kept reads (5954 discarded)
[2013-06-24 11:18:53] Creating transcriptome data files..
[FAILED]
Error: gtf_to_fasta returned an error.
the first few rows of my GTF file look like this:
chr1 Ensembl exon 312155 312360 . + . gene_id "ENSRNOG00000046319"; transcript_id "ENSRNOT00000073861"; exon_number "1"; gene_name "Vom2r3"; gene_biotype "protei$
chr1 Ensembl exon 386110 386392 . + . gene_id "ENSRNOG00000046319"; transcript_id "ENSRNOT00000073861"; exon_number "2"; gene_name "Vom2r3"; gene_biotype "protei$
chr1 Ensembl exon 391729 392532 . + . gene_id "ENSRNOG00000046319"; transcript_id "ENSRNOT00000073861"; exon_number "3"; gene_name "Vom2r3"; gene_biotype "protei$
chr1 Ensembl exon 393385 393609 . + . gene_id "ENSRNOG00000046319"; transcript_id "ENSRNOT00000073861"; exon_number "4"; gene_name "Vom2r3"; gene_biotype "protei$
chr1 Ensembl exon 394818 394941 . + . gene_id "ENSRNOG00000046319"; transcript_id "ENSRNOT00000073861"; exon_number "5"; gene_name "Vom2r3"; gene_biotype "protei$
chr1 Ensembl exon 400266 401149 . + . gene_id "ENSRNOG00000046319"; transcript_id "ENSRNOT00000073861"; exon_number "6"; gene_name "Vom2r3"; gene_biotype "protei$
chr1 Ensembl exon 312155 312361 . + . gene_id "ENSRNOG00000046319"; transcript_id "ENSRNOT00000046586"; exon_number "1"; gene_name "Vom2r3"; gene_biotype "protei$
chr1 Ensembl exon 312365 312376 . + . gene_id "ENSRNOG00000046319"; transcript_id "ENSRNOT00000046586"; exon_number "2"; gene_name "Vom2r3"; gene_biotype "protei$
chr1 Ensembl exon 312378 312383 . + . gene_id "ENSRNOG00000046319"; transcript_id "ENSRNOT00000046586"; exon_number "3"; gene_name "Vom2r3"; gene_biotype "protei$
chr1 Ensembl exon 386141 386392 . + . gene_id "ENSRNOG00000046319"; transcript_id "ENSRNOT00000046586"; exon_number "4"; gene_name "Vom2r3"; gene_biotype "protei$
chr1 Ensembl exon 391729 392532 . + . gene_id "ENSRNOG00000046319"; transcript_id "ENSRNOT00000046586"; exon_number "5"; gene_name "Vom2r3"; gene_biotype "protei$
chr1 Ensembl exon 393385 393606 . + . gene_id "ENSRNOG00000046319"; transcript_id "ENSRNOT00000046586"; exon_number "6"; gene_name "Vom2r3"; gene_biotype "protei$
chr1 Ensembl exon 394818 394940 . + . gene_id "ENSRNOG00000046319"; transcript_id "ENSRNOT00000046586"; exon_number "7"; gene_name "Vom2r3"; gene_biotype "protei$
chr1 Ensembl exon 400259 401149 . + . gene_id "ENSRNOG00000046319"; transcript_id "ENSRNOT00000046586"; exon_number "8"; gene_name "Vom2r3"; gene_biotype "protei$
(the $-sign at the end is just because the complete row didn't fit in my shell, an example of a complete row:
chr1 Ensembl exon 312155 312360 . + . gene_id "ENSRNOG00000046319"; transcript_id "ENSRNOT00000073861"; exon_number "1"; gene_name "Vom2r3"; gene_biotype "protein_coding"; transcript_name "Vom2r3-202"; exon_id "ENSRNOE00000494965"; old_row[1] "protein_coding";
)
compared to the original GTF file I also replaced the second column with the word "Ensembl" ,, where before this said "protein_coding". I only kept the exons, so I removed start and stop codons and CDS.
Does anybody have a clue why TopHat is still failing to run?
thanks in advance!
Comment