Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • meher
    Member
    • Jun 2011
    • 54

    HTSeq 'gene_id' attribute error

    I have just started using HTSeq and found problems using GFF file. I have checked some of the threads in seqanswers but couldn't fix.

    Here is the command and error:

    htseq-count sorted.sam ../ref_CanFam3.1_top_level.gff3
    Error occured in line 9 of file ../ref_CanFam3.1_top_level.gff3.
    Error: Feature id1 does not contain a 'gene_id' attribute
    [Exception type: SystemExit, raised in count.py:55]


    Below are a few lines of my GFF file:


    #gff-version 3
    #!gff-spec-version 1.20
    #!processor NCBI annotwriter
    ##sequence-region chr1 1 122678785
    ##species http://www.ncbi.nlm.nih.gov/Taxonomy...ax.cgi?id=9615
    chr1 RefSeq region 1 122678785 . + . ID=id0;Name=1;Dbxref=taxon:9615;breed=boxer;chromosome=1;gbkey=Src;genome=chromosome;mol_type=genomic DNA;sex=female;sub-species=familiaris
    chr1 Gnomon gene 251990 322081 . - . ID=gene0;Name=ENPP1;Dbxref=GeneID:100856150;gbkey=Gene;gene=ENPP1;part=1%2F1
    chr1 Gnomon mRNA 251990 322081 . - . ID=rna0;Name=XM_003638785.2;Parent=gene0;Dbxref=GeneID:100856150,Genbank:XM_003638785.2;gbkey=mRNA;gene=ENPP1;product=ectonucleotide pyrophosphatase%2Fphosphodiesterase 1;transcript_id=XM_003638785.2
    chr1 Gnomon exon 321851 322081 . - . ID=id1;Parent=rna0;Dbxref=GeneID:100856150,Genbank:XM_003638785.2;gbkey=mRNA;gene=ENPP1;product=ectonucleotide pyrophosphatase%2Fphosphodiesterase 1;transcript_id=XM_003638785.2
    chr1 Gnomon exon 289674 289746 . - . ID=id2;Parent=rna0;Dbxref=GeneID:100856150,Genbank:XM_003638785.2;gbkey=mRNA;gene=ENPP1;product=ectonucleotide pyrophosphatase%2Fphosphodiesterase 1;transcript_id=XM_003638785.2
    chr1 Gnomon exon 287787 287903 . - . ID=id3;Parent=rna0;Dbxref=GeneID:100856150,Genbank:XM_003638785.2;gbkey=mRNA;gene=ENPP1;product=ectonucleotide pyrophosphatase%2Fphosphodiesterase 1;transcript_id=XM_003638785.2
    chr1 Gnomon exon 286842 286967 . - . ID=id4;Parent=rna0;Dbxref=GeneID:100856150,Genbank:XM_003638785.2;gbkey=mRNA;gene=ENPP1;product=ectonucleotide pyrophosphatase%2Fphosphodiesterase 1;transcript_id=XM_003638785.2
    chr1 Gnomon exon 285904 285964 . - . ID=id5;Parent=rna0;Dbxref=GeneID:100856150,Genbank:XM_003638785.2;gbkey=mRNA;gene=ENPP1;product=ectonucleotide pyrophosphatase%2Fphosphodiesterase 1;transcript_id=XM_003638785.2



    Could someone help to fix this?
  • dpryan
    Devon Ryan
    • Jul 2011
    • 3478

    #2
    Either specify a new value for "-i " or fix the GFF so it has a gene_id field.

    Comment

    • meher
      Member
      • Jun 2011
      • 54

      #3
      Originally posted by dpryan View Post
      Either specify a new value for "-i " or fix the GFF so it has a gene_id field.
      Could you give a sample output line of the gff file showing how it should be modified?

      Thanks.

      Comment

      • dpryan
        Devon Ryan
        • Jul 2011
        • 3478

        #4
        Code:
         chr1 Gnomon exon 321851 322081 . - . ID=id1;Parent=rna0;Dbxref=GeneID:100856150,Genbank:XM_003638785.2;gbkey=mRNA;gene=ENPP1;product=ectonucleotide pyrophosphatase%2Fphosphodiesterase 1;transcript_id=XM_003638785.2;gene_id=100856150
        Those gene_ids happen to be from NCBI, should you wonder later. It should be pretty easy to just parse each line and extract the right part. In theory, you could instead use "-i gene", but that's unlikely to always be unique.

        Comment

        • meher
          Member
          • Jun 2011
          • 54

          #5
          Originally posted by dpryan View Post
          Code:
           chr1 Gnomon exon 321851 322081 . - . ID=id1;Parent=rna0;Dbxref=GeneID:100856150,Genbank:XM_003638785.2;gbkey=mRNA;gene=ENPP1;product=ectonucleotide pyrophosphatase%2Fphosphodiesterase 1;transcript_id=XM_003638785.2;gene_id=100856150
          Those gene_ids happen to be from NCBI, should you wonder later. It should be pretty easy to just parse each line and extract the right part. In theory, you could instead use "-i gene", but that's unlikely to always be unique.
          I have tried using:
          htseq-count -i gene file.sam file.gff > count.txt

          It processed without any error but the counts are zero.The output looks like:

          A1BG 0
          A1CF 0
          A2M 0
          A2ML1 0
          .
          .
          .
          no_feature 38443217
          ambiguous 0
          too_low_aQual 0
          not_aligned 0
          alignment_not_unique 9202815

          I also tried by changing the gff file to the format suggested by you:

          ##gff-version 3
          #!gff-spec-version 1.20
          #!processor NCBI annotwriter
          ##sequence-region chr1 1 122678785
          ##species http://www.ncbi.nlm.nih.gov/Taxonomy...ax.cgi?id=9615
          chr1 RefSeq region 1 122678785 . + . ID=id0;Name=1;Dbxref=taxon:9615;breed=boxer;chromosome=1;gbkey=Src;genome=chromosome;mol_type=genomic DNA;sex=female;sub-species=familiaris,gene_id=9615
          chr1 Gnomon gene 251990 322081 . - . ID=gene0;Name=ENPP1;Dbxref=GeneID:100856150;gbkey=Gene;gene=ENPP1;part=1%2F1,gene_id=100856150
          chr1 Gnomon mRNA 251990 322081 . - . ID=rna0;Name=XM_003638785.2;Parent=gene0;Dbxref=GeneID:100856150,Genbank:XM_003638785.2;gbkey=mRNA;gene=ENPP1;product=ectonucleotide pyrophosphatase%2Fphosphodiesterase 1;transcript_id=XM_003638785.2,gene_id=100856150

          The output and error:
          htseq-count sorted.sam ../refGene_canFam3.gff > ref_counts.txt
          Error occured in line 6 of file ../refGene_canFam3.gff.
          Error: need more than 1 value to unpack
          [Exception type: ValueError, raised in __init__.py:220]



          Any further tips to fix this?

          Comment

          • dpryan
            Devon Ryan
            • Jul 2011
            • 3478

            #6
            Firstly you need a semicolon before "gene_id". Secondly, if you're getting 0 counts when you specify "-i gene" then there's something else wrong as well. You might just make a miniature SAM file with a few reads that map uniquely and should be counted and just see what happens (try using the -o option to help in tracking).

            Comment

            • meher
              Member
              • Jun 2011
              • 54

              #7
              Originally posted by dpryan View Post
              Firstly you need a semicolon before "gene_id". Secondly, if you're getting 0 counts when you specify "-i gene" then there's something else wrong as well. You might just make a miniature SAM file with a few reads that map uniquely and should be counted and just see what happens (try using the -o option to help in tracking).
              I have placed semicolon before gene_id and tried htseq-count.

              Error occured in line 6 of file ../refGene_canFam3.gff.
              Error: need more than 1 value to unpack
              [Exception type: ValueError, raised in __init__.py:220


              And infact it gave zerou counts with -i gene. so it means that there should be something else wrong. Could you let me know the commands or steps which you mean to check?
              Last edited by meher; 11-14-2013, 07:01 AM.

              Comment

              • dpryan
                Devon Ryan
                • Jul 2011
                • 3478

                #8
                Not explicitly without seeing how whichever aligner you're using indicates a uniquely aligned read. In general, you can probably just grep for "NH:i:0" and then just put the first thousand or so in a file. That'd be a miniature SAM file. Then just use htseq-count with "-o annotated_input.sam" to see how each read is treated. You might also just want to look at things in IGV.

                The chromosome names are the same between your reference and the annotation file, I hope.

                Comment

                • meher
                  Member
                  • Jun 2011
                  • 54

                  #9
                  Originally posted by dpryan View Post
                  Not explicitly without seeing how whichever aligner you're using indicates a uniquely aligned read. In general, you can probably just grep for "NH:i:0" and then just put the first thousand or so in a file. That'd be a miniature SAM file. Then just use htseq-count with "-o annotated_input.sam" to see how each read is treated. You might also just want to look at things in IGV.

                  The chromosome names are the same between your reference and the annotation file, I hope.

                  Do you mean "NH:i:1"? because "NH:i:0" seems to be absent when it tried

                  samtools view -S accepted_hits.sam | grep "NH:i:0".

                  Comment

                  • dpryan
                    Devon Ryan
                    • Jul 2011
                    • 3478

                    #10
                    Mea culpa, yeah NH:i:1 would be correct :P

                    Comment

                    • meher
                      Member
                      • Jun 2011
                      • 54

                      #11
                      Originally posted by dpryan View Post
                      Mea culpa, yeah NH:i:1 would be correct :P
                      I have done what u have suggested:

                      htseq-count -i gene mini.sam gff3 -o mini_annot.sam

                      So what should be observed from mini_annot.sam?

                      $ wc -l mini.sam
                      10000 mini.sam
                      $ wc -l mini_annot.sam
                      145 mini_annot.sam

                      It says "XF:Z:alignment_not_unique" at the end of each line in mini_annot.sam file.

                      Any clue out of this?

                      Comment

                      • dpryan
                        Devon Ryan
                        • Jul 2011
                        • 3478

                        #12
                        I'm not sure why there are so few alignments in the mini_annot.sam file, but generally just check those to ensure that they actually do map to something in the gff file.
                        Last edited by dpryan; 11-14-2013, 10:51 AM.

                        Comment

                        • meher
                          Member
                          • Jun 2011
                          • 54

                          #13
                          Originally posted by dpryan View Post
                          I'm not sure why there are so few alignments in the mini_annot.sam file, but generally just check those in the reference GTF to ensure that they actually do map to something in the gff file.
                          There are a few mapping to the features in gff file. It is weird and no clue.

                          Comment

                          • dpryan
                            Devon Ryan
                            • Jul 2011
                            • 3478

                            #14
                            Maybe take a subset of the gff file where some of those should map and test that. In general, htseq-count works well, so there's likely something off in the annotation somewhere. So start with a small chunk of the annotation such that things work and then just increase the size until it breaks. You will have then found the source of the problem.

                            Alternatively you can always try summarizeOverlaps() in GenomicRanges in R or featureCounts from subread. Perhaps one of those will ignore whatever is breaking htseq-count.

                            Comment

                            • Simon Anders
                              Senior Member
                              • Feb 2010
                              • 995

                              #15
                              Originally posted by meher View Post
                              It says "XF:Z:alignment_not_unique" at the end of each line in mini_annot.sam file.
                              Just post the SAM file, so we can see what's wrong.

                              Comment

                              Latest Articles

                              Collapse

                              • GATTACAT
                                Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                                by GATTACAT
                                Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
                                07-01-2026, 11:43 AM
                              • SEQadmin2
                                Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                                by SEQadmin2


                                I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                                Here are nine questions we think about, in roughly the order they matter, before...
                                06-18-2026, 07:11 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, 07-02-2026, 11:08 AM
                              0 responses
                              16 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-30-2026, 05:37 AM
                              0 responses
                              17 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-26-2026, 11:10 AM
                              0 responses
                              21 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-17-2026, 06:09 AM
                              0 responses
                              54 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...