Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Perseo
    Junior Member
    • Jul 2023
    • 2

    Help with htseq -count read counts

    Hello

    I am doing a transcriptome analysis on Pseudomonas putida and I have been trying to do a read count using Htseq -count. The program always give an error. I have tried different genome references (fna) and annotation files (gtf ang gff) but it does not work.

    The mapping works fine it seems but the read count doesnt.

    I have run the script:
    htseq-count -i gene_id -t CDS test_v2.sam ncbi_dataset_v2/ncbi_dataset/data/GCA_000007565.2/genomic.gtf > test_v2.count


    I got the error:
    built-in function delete__Pair_long_obj> returned a result with an error set
    [Exception type: SystemError, raised in _HTSeq_internal.py:43]


    If I include other options of htseq-count like "-s no" or "-m intersection-nonempty" Is the same error.

    If I run only:
    htseq-count test_v2.sam ncbi_dataset_v2/ncbi_dataset/data/GCA_000007565.2/genomic.gtf > test_v2.count


    The same error.

    So it must be the files compatibility? is there something wrong with the Pseudomonas putida gtf file?

    This is an example of a couple of lines in the gtf file:
    AE015451.2 Genbank gene 147 1019 . - . gene_id "PP_0001"; transcript_id ""; gbkey "Gene"; gene "parB"; gene_biotype "protein_coding"; locus_tag "PP_0001";

    AE015451.2 Genbank CDS 150 1019 . - 0 gene_id "PP_0001"; transcript_id "unassigned_transcript_1"; gbkey "CDS"; gene "parB"; locus_tag "PP_0001"; note "Evidence 2a : Function of homologous gene experimentally demonstrated in an other organism; PubMedId : 16306995, 17462018, 17729285; Product type pf : putative factor; Biologicalprocesses : Replicate"; product "probable chromosome-partitioning protein"; protein_id "AAN65635.1"; transl_table "11"; exon_number "1";


    This is how the fna file (I used for doing the mapping and creating the sam file) looks like:
    >AE015451.2 Pseudomonas putida KT2440 complete genome
    AACTGCTCCTCGGAAGTCGACCAACAAGTCAGCTATGACTTGGCATAATTTGTGCCGACAAAATGCGCGCAGAGTATAGG
    GGTGGATTAACCCCTATTCAACTCTTCGGTAGTGATTTCCGACTTCACGCTACAACAGGAATTGTTTCAGCGGATGTGAG
    CGAGCACGCCTTGCAACTCGTCAAGCGAGTTGTAGCGAATAACCAACTGGCCTTTGCCCTTGTTGCCATGACGGATCTGC
    ACGGCCGAGCCCAGGCGCTCTGCGAGCCGCTGTTCAAGGCGTGCGATATCCGGATCAGGTTTGCTCGGTTCGACCGGATC


    This is how the sam file looks like:
    @HD VN:1.0 SO:unsorted
    @SQ SN:AE015451.2 LN:6181873
    @PG ID:hisat2 PN:hisat2 VN:2.1.0 CL:"/var/bin/hisat2-2.1.0/hisat2-align-s --wrapper basic-0 -x genome_pseudomonas_putida_v2 -S test_v2.sam --summary-file test_v2.sam.summary -1 ../NG-33893_Exp2_PP_25h_b_lib709417_10278_4_1_val_1.fq -2 ../NG-33893_Exp2_PP_25h_b_lib709417_10278_4_2_val_2.fq"
    A00604:565:HFWKVDSX7:4:1101:4689:1000 83 AE015451.2 179677 1 150M1S = 174221 -5608 AGGAGGTTGGCTTAGAAGCAGCCACCCTTTAAAGAAAGCGTAATAGCTCACTAGTCGAGTCGGCCTGCGCGGAAGATGTAACGGGGCTCAAACCATGCACCGAAGCTACGGGTATCATCTTATGATGATGCGGTAGAGGAGCGTTCTGTAN FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF# AS:i:-1 ZS:i:-1 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:150 YS:i:-1 YT:Z:CP NH:i:5


    What I am missing in my code ? is the id of the genomic feature incorrect?? Are the sam and gtf incompatible?

    Thank you very much for your help.
  • GenomicSeq
    Member
    • Oct 2022
    • 39

    #2
    Hello Perseo

    It looks like there might be an issue with the compatibility of the files, as you suggested, but it's also possible that there might be something wrong with the actual GTF or SAM files.

    HTSeq-count expects the GTF file to have entries for "exon" features which then should be grouped by "gene_id". In your GTF file snippet you've shared, the feature type used is 'gene' and 'CDS', but not 'exon'. This is a common cause for errors as HTSeq-count needs 'exon' features to correctly count reads.

    Another potential issue could be related to the structure of the GTF file. HTSeq-count requires a very specific format, and it might be that your GTF file is not perfectly adhering to this. Make sure that your GTF file follows the correct structure.

    Also, the SAM file should ideally have been sorted and indexed, even though it's not always necessary. It's a good practice to have this done before using it with HTSeq-count. Use samtools to sort and index your SAM file.

    Additionally, I think that the '-i gene_id -t CDS' in your script may cause issues. The '-i' option specifies the attribute to be used as feature ID. 'gene_id' is fine if your GTF file uses that, but '-t' specifies the feature type, which should ideally be 'exon'.

    Here's an example of how your command should look like:

    htseq-count -f sam -i gene_id -r pos -t exon test_v2.sam ncbi_dataset_v2/ncbi_dataset/data/GCA_000007565.2/genomic.gtf > test_v2.count


    I would also recommend you try using a different Python environment, with a version compatible with HTSeq (for example Python 3.7 or 3.8). Sometimes the error you're getting can occur due to a conflict in Python environments.

    Let me know if any of these suggestions help, or if you need further assistance. Good luck!

    Comment

    • Perseo
      Junior Member
      • Jul 2023
      • 2

      #3
      Hello

      Thank you very much for your answer
      • "HTSeq-count expects the GTF file to have entries for "exon" features"

      Well the file does contain exon features in the third column but you can find them in some rows further down:

      AE015451.2 Genbank exon 171389 172910 . + . gene_id "PP_16SA"; transcript_id "unassigned_transcript_168"; locus_tag "PP_16SA"; note "Evidence 1b : Function experimentally demonstrated in the studied species"; product "16S ribosomal RNA"; transcript_biotype "rRNA"; exon_number "1";

      AE015451.2 Genbank gene 173203 176104 . + . gene_id "PP_23SA"; transcript_id ""; gbkey "Gene"; gene_biotype "rRNA"; locus_tag "PP_23SA";

      AE015451.2 Genbank transcript 173203 176104 . + . gene_id "PP_23SA"; transcript_id "unassigned_transcript_169"; gbkey "rRNA"; locus_tag "PP_23SA"; product "23S ribosomal RNA"; transcript_biotype "rRNA";

      AE015451.2 Genbank exon 173203 176104 . + . gene_id "PP_23SA"; transcript_id "unassigned_transcript_169"; locus_tag "PP_23SA"; product "23S ribosomal RNA"; transcript_biotype "rRNA"; exon_number "1";

      AE015451.2 Genbank gene 176237 176356 . + . gene_id "PP_5SA"; transcript_id ""; gbkey "Gene"; gene_biotype "rRNA"; locus_tag "PP_5SA";

      AE015451.2 Genbank transcript 176237 176356 . + . gene_id "PP_5SA"; transcript_id "unassigned_transcript_170"; gbkey "rRNA"; locus_tag "PP_5SA"; product "5S ribosomal RNA"; transcript_biotype "rRNA";

      AE015451.2 Genbank exon 176237 176356 . + . gene_id "PP_5SA"; transcript_id "unassigned_transcript_170"; locus_tag "PP_5SA"; product "5S
      • "Make sure that your GTF file follows the correct structure."

      The file looks fine... at least it has all the entries from the example described in the web page: http://mblab.wustl.edu/GTF22.html


      At least by looking seems fine. Is there any deeper way to check if the GFT file is in correct form, a command or script?
      • "the SAM file should ideally have been sorted and indexed"

      I sorted my sam:



      samtools sort -o sorted_testv2.sam test_v2.sam

      I got:

      @HD VN:1.0 SO:coordinate @SQ SN:AE015451.2 LN:6181873 @PG ID:hisat2 PN:hisat2 VN:2.1.0 CL:"/var/bin/hisat2-2.1.0/hisat2-align-s --wrapper basic-0 -x genome_pseudomonas_putida_v2 -S test_v2.sam --summary-file test_v2.sam.summary -1 ../NG-33893_Exp2_PP_25h_b_lib709417_10278_4_1_val_1.fq -2 ../NG-33893_Exp2_PP_25h_b_lib709417_10278_4_2_val_2.fq" A00604:565:HFWKVDSX7:4:1257:6686:25817 99 AE015451.2 1 60 24S126M = 41 213 GTTTGGCGTGCGTATTTCCATCACAACTGCTCCTCGGAAGTCGACCAACAAGTCAGCTATGACTTGGCATAATTTGTGCCGACAAAATGCGCGCAGAGTATAGGGGTGGATTAACCCCTATTCAACTCTTCGGTAGTGATTTCCGACTTC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF


      Doesn´t the indexing works on BAM files only?? Do I need to transform my SAM? should I try it and then running a BAM instead??
      • "but '-t' specifies the feature type, which should ideally be 'exon'."

      I though that -t uses that argument (ie: CDS) and all features of other type are ignored.. Thus, I wanted to run only coding DNA sequences...

      Anyway I have run the code as you suggested:

      htseq-count -f sam -i gene_id -r pos -t exon sorted_testv2.sam ncbi_dataset_v2/ncbi_dataset/data/GCA_000007565.2/genomic.gtf > test_v2.count


      I got the same fricking error:

      <built-in function delete__Pair_long_obj> returned a result with an error set

      [Exception type: SystemError, raised in _HTSeq_internal.py:43]



      I checked my version of python:

      python --version

      Python 3.7.2

      the htseq version is:

      htseq-count --version

      2.0.3


      So the version should be compatible right??

      Any other idea?


      Thank you very much for your time and help


      Kind regards






      Comment

      • GenomicSeq
        Member
        • Oct 2022
        • 39

        #4
        Sorry for the delay! I just noticed you messaged me back.

        It's clear that you've done some extensive troubleshooting, but I'll suggest a few more ideas. I'm not sure if they'll work but I'll suggest them anyway.


        First, I would try converting the SAM to BAM. Even though both SAM and BAM formats are compatible with htseq-count, BAM is more memory efficient. Since your error seems to be related to a memory issue in Python, I'm thinking it might help to convert your SAM file to a sorted and indexed BAM file.

        Code:
        samtools view -S -b test_v2.sam > test_v2.bam
        samtools sort test_v2.bam -o sorted_test_v2.bam
        samtools index sorted_test_v2.bam
        ​

        Then you can use the sorted BAM file in your htseq-count command:

        Code:
        htseq-count -f bam -r pos -t exon -i gene_id sorted_test_v2.bam ncbi_dataset_v2/ncbi_dataset/data/GCA_000007565.2/genomic.gtf > test_v2.count
        ​
        The next thing to do in addition to a visual inspection, is verify the integrity of your GTF file.

        One such tool is `gtfToGenePred` from UCSC, which you can use as follows:

        Code:
        gtfToGenePred ncbi_dataset_v2/ncbi_dataset/data/GCA_000007565.2/genomic.gtf stdout | genePredCheck stdin
        ​
        If the GTF file has any structural issues, this command should raise an error.


        From here, I would check compatibility between Python and HTSeq. You are using Python 3.7.2 and HTSeq 2.0.3. This should technically be fine, but sometimes there can be incompatibility issues that aren't immediately obvious.

        You could try creating a new Python environment with a different Python version (for example, 3.8 or 3.9) and install HTSeq in this new environment. For managing Python environments, you can use tools like `conda` or `virtualenv`.

        Code:
        conda create -n htseq_env python=3.8
        conda activate htseq_env
        pip install HTSeq
        ​
        And then, use the htseq-count within this new environment.


        The last thing I'll suggest if you are still experiencing issues with htseq-count, you may consider trying alternative tools such as featureCounts from the Subread package, which serves a similar purpose. It might work better for your case.

        Hope at least one of these suggestions helps! Let me know if you need more assistance.​

        Comment

        Latest Articles

        Collapse

        • SEQadmin2
          Nine Things a Sample Prep Scientist Thinks About Before Sequencing
          by SEQadmin2


          I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

          Here are nine questions we think about, in roughly the order they matter, before...
          06-18-2026, 07:11 AM
        • SEQadmin2
          From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
          by SEQadmin2


          Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


          The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
          ...
          06-02-2026, 10:05 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by SEQadmin2, 06-26-2026, 11:10 AM
        0 responses
        15 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-17-2026, 06:09 AM
        0 responses
        49 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-09-2026, 11:58 AM
        0 responses
        107 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-05-2026, 10:09 AM
        0 responses
        125 views
        0 reactions
        Last Post SEQadmin2  
        Working...