Seqanswers Leaderboard Ad

**Simon Anders** · 02-18-2011, 12:14 AM

Well, there obviously are mismatched quotes in your attribute strings. In a proper GTF file, the first line should look like this:

Code:

Chr1   CNA2_FINAL_CALLGENES_1   start_codon   11499   11501   .   -   0   gene_id "CNAG_00001"; transcript_id "CNAG_00001T0"

All these extra quotes make little sense and are confusing to HTSeq. It actually looks a bit as if you loaded the file with a spreadsheet program and saved it again. Doing something like this might introduce extra quotes.

Where did you get the GTF file from?

**Artur Jaroszewicz** · 01-16-2013, 09:44 PM

Same problem, different GTF

Hi Simon,

I was wondering if you could possibly help me with my problem. I downloaded the arabidopsis thaliana ensembl gtf from plants.ensembl.org. Here's a sample:

1 protein_coding CDS 30424421 30424675 . + 0 gene_id "AT1G80990"; transcript_id "AT1G80990.1"; exon_number "1"; gene_name "AT1G80990"; transcript_name "AT1G80990.1"; protein_id "AT1G80990.1";
1 protein_coding start_codon 30424421 30424423 . + 0 gene_id "AT1G80990"; transcript_id "AT1G80990.1"; exon_number "1"; gene_name "AT1G80990"; transcript_name "AT1G80990.1";

When I try to run HTSeq, it gives me the same error as above:

Traceback (most recent call last):
File "python_scripts/sam_to_gene_array_2.py", line 80, in <module>
main()
File "python_scripts/sam_to_gene_array_2.py", line 41, in main
for feature in gtf:
File "/u/home/mcdb/arturj/.local/lib/python2.6/site-packages/HTSeq-0.5.3p3-py2.6-linux-x86_64.egg/HTSeq/__init__.py", line 215, in __iter__
( attr, name ) = parse_GFF_attribute_string( attributeStr, True )
File "/u/home/mcdb/arturj/.local/lib/python2.6/site-packages/HTSeq-0.5.3p3-py2.6-linux-x86_64.egg/HTSeq/__init__.py", line 168, in parse_GFF_attribute_string
raise ValueError, "The attribute string seems to contain mismatched quotes."
ValueError: The attribute string seems to contain mismatched quotes.

Any ideas why this could be happening? Thank you in advance, and thank you for all your help in the past.

Best Regards,
Artur Jaroszewicz

**DonDolowy** · 01-17-2013, 03:26 AM

If you download the GTF from the iGenomes, it should work:

404 Not Found

http://tophat.cbcb.umd.edu/igenomes.html

**Artur Jaroszewicz** · 01-17-2013, 11:57 PM

Still getting the same error:

Traceback (most recent call last):
File "/u/home/mcdb/arturj/python_scripts/sam_to_gene_array_2.py", line 80, in <module>
main()
File "/u/home/mcdb/arturj/python_scripts/sam_to_gene_array_2.py", line 41, in main
for feature in gtf:
File "/u/home/mcdb/arturj/.local/lib/python2.6/site-packages/HTSeq-0.5.3p3-py2.6-linux-x86_64.egg/HTSeq/__init__.py", line 215, in __iter__
( attr, name ) = parse_GFF_attribute_string( attributeStr, True )
File "/u/home/mcdb/arturj/.local/lib/python2.6/site-packages/HTSeq-0.5.3p3-py2.6-linux-x86_64.egg/HTSeq/__init__.py", line 168, in parse_GFF_attribute_string
raise ValueError, "The attribute string seems to contain mismatched quotes."
ValueError: The attribute string seems to contain mismatched quotes.

Any other suggestions?

**Mahtab** · 02-07-2013, 09:45 PM

I have the same problem with arabidopsis and RNASeq in Galaxy and I have used different GTF files from ensembl and arabidopsis.org.

Any ideas?

Thanks

**Artur Jaroszewicz** · 02-07-2013, 10:30 PM

Hi Mahtab,

Yes, I actually solved the problem. I thought I had posted the solution to my problem, but evidently not. I guess there was another thread that I started. Anyway, there's maybe 100 lines or so that have semicolons in the gene id of the attribute fields, so I wrote a quick script to take care of it. If you'd like to use my modified gtf, you can download it at:
http://pellegrini.mcdb.ucla.edu/Artu...10.ensembl.gtf

Good luck in your analysis!

Artur

**jparsons** · 02-08-2013, 06:18 AM

Is it possible to come up with a reasonable standard for the gtf format so programs that only expect one very specific format only get that one specific kind of format, instead of making us spend so much time reformatting files to fit square, triangular, or round pegs into uniquely-shaped holes?

I can't say i've spent a great deal of time actually LOOKING at gtf files (although i have spent a great deal of time struggling with getting programs to accept them), but since every data source's gtf format seems to be (eventually) convertible into any type of input, it should be doable, right?

**kmcarr** · 02-08-2013, 08:09 AM

Originally posted by jparsons View Post

Is it possible to come up with a reasonable standard for the gtf format so programs that only expect one very specific format only get that one specific kind of format, instead of making us spend so much time reformatting files to fit square, triangular, or round pegs into uniquely-shaped holes?

I can't say i've spent a great deal of time actually LOOKING at gtf files (although i have spent a great deal of time struggling with getting programs to accept them), but since every data source's gtf format seems to be (eventually) convertible into any type of input, it should be doable, right?

There is a standard defined for GTF files. The problem isn't the standard, it's when people create files that do not conform to that standard, e.g. including a semicolon in your gene_id.

**Mahtab** · 02-11-2013, 09:56 PM

Hi Artur,

Thank you very much for your help. It worked!
I had seen the other thread and downloaded the gft from there but for some reason I was still getting the same error.

Thanks again
Mahtab

**mslider** · 09-22-2013, 10:33 AM

--Hi,

i have a similar problem with gtf file using htseq-count (version 0.5.4p3):

samtools view BNV13.sorted.bam | htseq-count -m intersection-nonempty -s no - Rattus_norvegicus.gtf
100000 GFF lines processed.
200000 GFF lines processed.
300000 GFF lines processed.
400000 GFF lines processed.
500000 GFF lines processed.
525298 GFF lines processed.
Error: 'itertools.chain' object has no attribute 'get_line_number_string'
[Exception type: AttributeError, raised in count.py:201]

first lines of gtf file:

AABR06112227.1 pseudogene exon 345 455 . - . gene_id "ENSRNOG00000002531"; transcript_id "ENSRNOT00000003418"; exon_number "1"; gene_biotype "
pseudogene"; exon_id "ENSRNOE00000476932";
AABR06112227.1 pseudogene exon 157 342 . - . gene_id "ENSRNOG00000002531"; transcript_id "ENSRNOT00000003418"; exon_number "2"; gene_biotype "
pseudogene"; exon_id "ENSRNOE00000024118";
AABR06112227.1 pseudogene exon 86 154 . - . gene_id "ENSRNOG00000002531"; transcript_id "ENSRNOT00000003418"; exon_number "3"; gene_biotype "
pseudogene"; exon_id "ENSRNOE00000470172";
AABR06111321.1 miRNA exon 71 156 . + . gene_id "ENSRNOG00000045547"; transcript_id "ENSRNOT00000070977"; exon_number "1"; gene_biotype "miRNA";
exon_id "ENSRNOE00000464516";
AABR06111321.1 pseudogene exon 170 424 . + . gene_id "ENSRNOG00000047372"; transcript_id "ENSRNOT00000071624"; exon_number "1"; gene_biotype "
pseudogene"; exon_id "ENSRNOE00000256162";
AABR06111321.1 pseudogene exon 429 434 . + . gene_id "ENSRNOG00000047372"; transcript_id "ENSRNOT00000071624"; exon_number "2"; gene_biotype "
pseudogene"; exon_id "ENSRNOE00000472450";
AABR06111841.1 miRNA exon 87 210 . - . gene_id "ENSRNOG00000046613"; transcript_id "ENSRNOT00000072639"; exon_number "1"; gene_biotype "miRNA";
exon_id "ENSRNOE00000503423";
AABR06110665.1 protein_coding exon 343 613 . - . gene_id "ENSRNOG00000048972"; transcript_id "ENSRNOT00000061381"; exon_number "1"; gene_name "H2-

is there something to do ?

thank you --

**Simon Anders** · 09-25-2013, 06:55 AM

It's a problem with your BAM file.

There is a bug in the code that writes the error message which appears only when you read the SAM file from standard input. I'll fix this in the next release. For now, please convert your BAM file to a SAM file, and put the SAM file's name instead of the "-". Then, you should be able to see the actual error message.

**mslider** · 09-25-2013, 10:19 AM

Error with GTF file when using htseq-count

--

my problem is over,
i've fixed it using samtools view -f 0x2 input.bam | htseq-count .....
with the option -f 0x2 all reads not properly paired are discarded.
So, in this circonstance the problem is not due to SAM file read from standard input. This bam file was produced by tophat2, maybe a bug of tophat !?

Laurent --

**jshaik** · 01-13-2015, 09:29 AM

When i had this error, i removed the fasta sequences from my gff file (the sequences at the end of gff) and it worked!

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, Yesterday, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin Yesterday, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 24 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 159 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

Error with GTF file when using htseq-count

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News