Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Error with GTF file when using htseq-count

    Hi,

    Just finished installing HTSeq on a MacOSX with python 2.6.6 and latest version of Numpy.

    I can execute the first few commands of the HTSeq tour using the yeast example sequence file so the install seems to be working

    I invoked the htseq-counts script using the following:
    >python -m HTSeq.scripts.count 45minCt_1.sam cneoh99.gtf

    and I get the following error:
    Error occured in line 1 of file cneoh99.gtf.
    Error: The attribute string seems to contain mismatched quotes.
    [Exception type: ValueError, raised in __init__.py:167]

    The first few lines of my gtf file looks like:
    Chr1 CNA2_FINAL_CALLGENES_1 start_codon 11499 11501 . - 0 "gene_id ""CNAG_00001""; transcript_id ""CNAG_00001T0"";"
    Chr1 CNA2_FINAL_CALLGENES_1 stop_codon 11060 11062 . - 0 "gene_id ""CNAG_00001""; transcript_id ""CNAG_00001T0"";"
    Chr1 CNA2_FINAL_CALLGENES_1 exon 11430 11501 . - . "gene_id ""CNAG_00001""; transcript_id ""CNAG_00001T0"";"

    I've attached an excerpt of the file.
    Do I need headers in this file?

    Thanks for any help.

    Regards,
    Maureen
    Attached Files

  • #2
    Well, there obviously are mismatched quotes in your attribute strings. In a proper GTF file, the first line should look like this:

    Code:
    Chr1   CNA2_FINAL_CALLGENES_1   start_codon   11499   11501   .   -   0   gene_id "CNAG_00001"; transcript_id "CNAG_00001T0"
    All these extra quotes make little sense and are confusing to HTSeq. It actually looks a bit as if you loaded the file with a spreadsheet program and saved it again. Doing something like this might introduce extra quotes.

    Where did you get the GTF file from?

    Comment


    • #3
      Same problem, different GTF

      Hi Simon,

      I was wondering if you could possibly help me with my problem. I downloaded the arabidopsis thaliana ensembl gtf from plants.ensembl.org. Here's a sample:

      1 protein_coding CDS 30424421 30424675 . + 0 gene_id "AT1G80990"; transcript_id "AT1G80990.1"; exon_number "1"; gene_name "AT1G80990"; transcript_name "AT1G80990.1"; protein_id "AT1G80990.1";
      1 protein_coding start_codon 30424421 30424423 . + 0 gene_id "AT1G80990"; transcript_id "AT1G80990.1"; exon_number "1"; gene_name "AT1G80990"; transcript_name "AT1G80990.1";
      When I try to run HTSeq, it gives me the same error as above:

      Traceback (most recent call last):
      File "python_scripts/sam_to_gene_array_2.py", line 80, in <module>
      main()
      File "python_scripts/sam_to_gene_array_2.py", line 41, in main
      for feature in gtf:
      File "/u/home/mcdb/arturj/.local/lib/python2.6/site-packages/HTSeq-0.5.3p3-py2.6-linux-x86_64.egg/HTSeq/__init__.py", line 215, in __iter__
      ( attr, name ) = parse_GFF_attribute_string( attributeStr, True )
      File "/u/home/mcdb/arturj/.local/lib/python2.6/site-packages/HTSeq-0.5.3p3-py2.6-linux-x86_64.egg/HTSeq/__init__.py", line 168, in parse_GFF_attribute_string
      raise ValueError, "The attribute string seems to contain mismatched quotes."
      ValueError: The attribute string seems to contain mismatched quotes.
      Any ideas why this could be happening? Thank you in advance, and thank you for all your help in the past.

      Best Regards,
      Artur Jaroszewicz

      Comment


      • #4
        If you download the GTF from the iGenomes, it should work:

        Comment


        • #5
          Still getting the same error:
          Traceback (most recent call last):
          File "/u/home/mcdb/arturj/python_scripts/sam_to_gene_array_2.py", line 80, in <module>
          main()
          File "/u/home/mcdb/arturj/python_scripts/sam_to_gene_array_2.py", line 41, in main
          for feature in gtf:
          File "/u/home/mcdb/arturj/.local/lib/python2.6/site-packages/HTSeq-0.5.3p3-py2.6-linux-x86_64.egg/HTSeq/__init__.py", line 215, in __iter__
          ( attr, name ) = parse_GFF_attribute_string( attributeStr, True )
          File "/u/home/mcdb/arturj/.local/lib/python2.6/site-packages/HTSeq-0.5.3p3-py2.6-linux-x86_64.egg/HTSeq/__init__.py", line 168, in parse_GFF_attribute_string
          raise ValueError, "The attribute string seems to contain mismatched quotes."
          ValueError: The attribute string seems to contain mismatched quotes.
          Any other suggestions?

          Comment


          • #6
            I have the same problem with arabidopsis and RNASeq in Galaxy and I have used different GTF files from ensembl and arabidopsis.org.

            Any ideas?


            Thanks

            Comment


            • #7
              Hi Mahtab,

              Yes, I actually solved the problem. I thought I had posted the solution to my problem, but evidently not. I guess there was another thread that I started. Anyway, there's maybe 100 lines or so that have semicolons in the gene id of the attribute fields, so I wrote a quick script to take care of it. If you'd like to use my modified gtf, you can download it at:
              http://pellegrini.mcdb.ucla.edu/Artu...10.ensembl.gtf

              Good luck in your analysis!

              Artur

              Comment


              • #8
                Is it possible to come up with a reasonable standard for the gtf format so programs that only expect one very specific format only get that one specific kind of format, instead of making us spend so much time reformatting files to fit square, triangular, or round pegs into uniquely-shaped holes?

                I can't say i've spent a great deal of time actually LOOKING at gtf files (although i have spent a great deal of time struggling with getting programs to accept them), but since every data source's gtf format seems to be (eventually) convertible into any type of input, it should be doable, right?

                Comment


                • #9
                  Originally posted by jparsons View Post
                  Is it possible to come up with a reasonable standard for the gtf format so programs that only expect one very specific format only get that one specific kind of format, instead of making us spend so much time reformatting files to fit square, triangular, or round pegs into uniquely-shaped holes?

                  I can't say i've spent a great deal of time actually LOOKING at gtf files (although i have spent a great deal of time struggling with getting programs to accept them), but since every data source's gtf format seems to be (eventually) convertible into any type of input, it should be doable, right?
                  There is a standard defined for GTF files. The problem isn't the standard, it's when people create files that do not conform to that standard, e.g. including a semicolon in your gene_id.

                  Comment


                  • #10
                    Hi Artur,

                    Thank you very much for your help. It worked!
                    I had seen the other thread and downloaded the gft from there but for some reason I was still getting the same error.

                    Thanks again
                    Mahtab

                    Comment


                    • #11
                      --Hi,

                      i have a similar problem with gtf file using htseq-count (version 0.5.4p3):

                      samtools view BNV13.sorted.bam | htseq-count -m intersection-nonempty -s no - Rattus_norvegicus.gtf
                      100000 GFF lines processed.
                      200000 GFF lines processed.
                      300000 GFF lines processed.
                      400000 GFF lines processed.
                      500000 GFF lines processed.
                      525298 GFF lines processed.
                      Error: 'itertools.chain' object has no attribute 'get_line_number_string'
                      [Exception type: AttributeError, raised in count.py:201]

                      first lines of gtf file:

                      AABR06112227.1 pseudogene exon 345 455 . - . gene_id "ENSRNOG00000002531"; transcript_id "ENSRNOT00000003418"; exon_number "1"; gene_biotype "
                      pseudogene"; exon_id "ENSRNOE00000476932";
                      AABR06112227.1 pseudogene exon 157 342 . - . gene_id "ENSRNOG00000002531"; transcript_id "ENSRNOT00000003418"; exon_number "2"; gene_biotype "
                      pseudogene"; exon_id "ENSRNOE00000024118";
                      AABR06112227.1 pseudogene exon 86 154 . - . gene_id "ENSRNOG00000002531"; transcript_id "ENSRNOT00000003418"; exon_number "3"; gene_biotype "
                      pseudogene"; exon_id "ENSRNOE00000470172";
                      AABR06111321.1 miRNA exon 71 156 . + . gene_id "ENSRNOG00000045547"; transcript_id "ENSRNOT00000070977"; exon_number "1"; gene_biotype "miRNA";
                      exon_id "ENSRNOE00000464516";
                      AABR06111321.1 pseudogene exon 170 424 . + . gene_id "ENSRNOG00000047372"; transcript_id "ENSRNOT00000071624"; exon_number "1"; gene_biotype "
                      pseudogene"; exon_id "ENSRNOE00000256162";
                      AABR06111321.1 pseudogene exon 429 434 . + . gene_id "ENSRNOG00000047372"; transcript_id "ENSRNOT00000071624"; exon_number "2"; gene_biotype "
                      pseudogene"; exon_id "ENSRNOE00000472450";
                      AABR06111841.1 miRNA exon 87 210 . - . gene_id "ENSRNOG00000046613"; transcript_id "ENSRNOT00000072639"; exon_number "1"; gene_biotype "miRNA";
                      exon_id "ENSRNOE00000503423";
                      AABR06110665.1 protein_coding exon 343 613 . - . gene_id "ENSRNOG00000048972"; transcript_id "ENSRNOT00000061381"; exon_number "1"; gene_name "H2-

                      is there something to do ?

                      thank you --

                      Comment


                      • #12
                        It's a problem with your BAM file.

                        There is a bug in the code that writes the error message which appears only when you read the SAM file from standard input. I'll fix this in the next release. For now, please convert your BAM file to a SAM file, and put the SAM file's name instead of the "-". Then, you should be able to see the actual error message.

                        Comment


                        • #13
                          Error with GTF file when using htseq-count

                          --

                          my problem is over,
                          i've fixed it using samtools view -f 0x2 input.bam | htseq-count .....
                          with the option -f 0x2 all reads not properly paired are discarded.
                          So, in this circonstance the problem is not due to SAM file read from standard input. This bam file was produced by tophat2, maybe a bug of tophat !?

                          Laurent --

                          Comment


                          • #14
                            When i had this error, i removed the fasta sequences from my gff file (the sequences at the end of gff) and it worked!

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Best Practices for Single-Cell Sequencing Analysis
                              by seqadmin



                              While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
                              06-06-2024, 07:15 AM
                            • seqadmin
                              Latest Developments in Precision Medicine
                              by seqadmin



                              Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

                              Somatic Genomics
                              “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
                              05-24-2024, 01:16 PM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, Today, 07:23 AM
                            0 responses
                            8 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 06-17-2024, 06:54 AM
                            0 responses
                            12 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 06-14-2024, 07:24 AM
                            0 responses
                            24 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 06-13-2024, 08:58 AM
                            0 responses
                            18 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X