I'm trying to create a human GTF file using the hg19.gtf file from the TopHat website.The problem is that DEXseq's script to prepare the GTF for DEXSeq is complaining about a couple od different types of errors. These types are the most common:
What I have been doing so far is to grep out the genes which are causing these issues, but I'm wondering how many more I will have to do by hand to get this file to work. Does anyone already have one?
Here is what I have so far:
Code:
Traceback (most recent call last):
File "dexseq_prepare_annotation.py", line 93, in <module>
raise ValueError, "Same name found on two strands: %s, %s" % ( str(l[i]), str(l[i+1]) )
ValueError: Same name found on two strands: <GenomicFeature: exonic_part 'FAM95B1' at chr9: 42473935 -> 42474238 (strand '+')>, <GenomicFeature: exonic_part 'FAM95B1' at chr9: 43027964 -> 43027663 (strand '-')>
Code:
Traceback (most recent call last):
File "dexseq_prepare_annotation.py", line 91, in <module>
raise ValueError, "Same name found on two chromosomes: %s, %s" % ( str(l[i]), str(l[i+1]) )
ValueError: Same name found on two chromosomes: <GenomicFeature: exonic_part 'YTHDC1' at chr4_ctg9_hap1: 45745 -> 45366 (strand '-')>, <GenomicFeature: exonic_part 'YTHDC1' at chr4: 69180040 -> 69176105 (strand '-')>
Code:
Traceback (most recent call last):
File "dexseq_prepare_annotation.py", line 89, in <module>
assert l[i].iv.end <= l[i+1].iv.start, str(l[i+1]) + " starts too early"
AssertionError: <GenomicFeature: exonic_part 'LOC399939' at chr11: 89645253 -> 89644640 (strand '-')> starts too early
Here is what I have so far:
Code:
egrep -v "PRAMEF5|LOC728855|LOC646743|YTHDC1|FAM95B1|LOC399939|PRAMEF22|RSPH10B|ANKRD20|chrUn|FLJ20518|RIMBP|LOC100093|GSTT|UGT2A3|SPAG|PMS2L|HIST|LOC440|chr6_|LOC399940|DUX|TRIM|random|chrX|chrY|TMPRSS11E|SNAR|REXO1L|PPP2|LOC727849|AGSK1|FAM25|GOLGA8|SPDYE|RGPD|MIR4283|CDY1|MIR3675|VCY1|GTF2IP|PRY2|FAM41|PPIAL4|SHOX|H2AFB|LOC100288570|LOC440895|TP53TG|DEFB10|LOC100287834|CSAG3|CSF2R|CBWD|LOC728875|GOLGA2P|CTAGE4|NCRNA00230|TISP|LOC642826|RBMY1|OR2A|UGT2B10|XGPY|MIR4650|LOC100133920|MIR3180|PNMA|LOC150527|MIR3179|TTTY|TBC1D|ZNF84|EIF3C|IL3RA|OR4F3|IL9R|LOC100132287|MIR1256|FAM7A2|RNF5P1|CDY2B|MIR1184-1|AGAP9|SSX|CXorf51|LOC100506123|FAM41AY1|RBMY1J|MAGEA2|MIR1244|HSFX|DEFB104B|HIST2H3C|FAM7A|FAM75A|MCART6|chr17_|LIMS3|SPANX|OR4F29|PPIAL4A|ASMTL" hg19.refFlat.gtf >whole_transcriptome.hg19.gtf
Comment