Announcement

Collapse

Welcome to the New Seqanswers!

Welcome to the new Seqanswers! We'd love your feedback, please post any you have to this topic: New Seqanswers Feedback.
See more
See less

Strange error when using htseq-count

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Strange error when using htseq-count

    Hi everyone,

    I am using HTSeq 0.4.7 on a Mac OS X with Python 2.7 and Numpy 1.5.1.

    It's my first time using the program, and I am getting a strange error.

    After calling the htseq-counts script using:
    >python -m HTSeq.scripts.count -a 1 polyA_sorted.sam Homo_sapiens.GRCh37.61.gtf

    I get the following error:
    Error occured in line 2196 of file ../../data/htseq_count/polyA_sorted.sam.
    Error:
    [Exception type: AssertionError, raised in __init__.py:576]

    "polyA_sorted.sam" contains paired-end reads mapped using bwa and sorted by read name, and "Homo_sapiens.GRCh37.61.gtf" is the latest GTF file from Ensembl.

    Line 2196 and 2197 of polyA_sorted.sam are:
    DBBK90L1:8:1:10101:43720#0 145 22 50502479 37 97M = 50499957 -2619 TCCTGCGGGCCGTTCTGGGTGTCCCAGGATGCACCCTGCAGCCTTGCACTGACCTTGAAGCGCACGCACTGGATGGCGGTGCCCGTGTTGAGGCCGG BBBBBBBBBBBBBBBB`bccX[b^bcecd`debZcddead_gefdedbf_daeZddgbdegbg`eedbdgggcegebbgggeggggggggggggggg XT:A:U NM:i:0 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:97

    DBBK90L1:8:1:10101:43720#0 97 22 50499957 37 97M = 50502479 2619 GTTTCCATGCCTGGGGCCAGGCTGGGCGCTGCCACCCGGTTTCCGCGTCTGGGGGTCACTGGGCCATTTGCACCACGACGGCTCTCCAGGCTTTCTC gggeggggggggggggggfggggggfggfggggadgggceeeeegggdedacbe`Gbb^`X^bZcbdddd^ggdgeg`gddeeedbdc_V`M```ca XT:A:U NM:i:1 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:10T86

    I'm thoroughly confused about what is causing this error, and I would appreciate any help with this problem. Thank you!

    Cheers,
    Sandy

  • #2
    Hi Sandy

    thanks for the precise bug report. This seems to be an internal error from the function 'pair_SAM_alignment'. I cannot reproduce it from the two lines you posted but as this function looks at several adjacent lines together, it could be one of the earlier lines.

    Could you please send me an excerpt from your SAM file of the part surrounding the reported line, let's say, lines 2000 to 2400 or so, by e-mail to anders(at)embl(dot)de, so that I can investigate. Thanks!

    Simon

    Comment


    • #3
      The bug is now fixed, in version 0.4.7p2.

      The actual error turned out to be caused by the pair of lines one further up; two SAM lines describing a mate pair, with flag values 103 and 147. 147 implies that both read and mate are mapped (bits 2 and 3 cleared), while 103 contradicts this (bit 2 set). HTSeq's pair_SAM_alignment function has code to detect and handle such inconsistencies, but this particular case was handled incorrectly. Now, a warning is issued, without aborting the program.

      Which program produced the SAM file, by the way?

      Comment


      • #4
        Hi Simon,
        I had a problem with HT-seq count too not long ago and I posted to an old thread; http://seqanswers.com/forums/showthr...?t=4805&page=5
        I wrote my own Perl (not so elegant) work around but would prefer to have a HTSEQ-COUNT view of the world too.

        Thanks for any advice on the problem.

        Kind regards,

        John.

        Comment


        • #5
          Hello,

          I get a very similar error in htseq-count (ubuntu 10.10 x64, Python 2.6.6, HTseq-0.5.3p7):

          htseq-count -mintersection-nonempty -s no -t mRNA -i ID accepted_hits.sam Asterochloris_gene_features.gff3 > Alga_T0.txt

          throws an error message shortly after it started:

          100000 GFF lines processed.
          193583 GFF lines processed.
          Error occured in line 2 of file accepted_hits.sam.
          Error:
          [Exception type: AssertionError, raised in __init__.py:599]

          I also tried the run with -t exon -i Parent (or ID) and get the same error.

          My sam file contains single end reads mapped to a genome with tophat, the gff3 file is validated.

          These are the first 3 lines of the sam-file:

          HWI-EAS279_0362:1:55:19120:13207#0 0 scaffold_00001 111 50 34M * 0 0 CTGTTGAGGCACGCTGCAGGGCCTGCTCGCTCTG CEFAGDDDHHHHHFFFFGHH>[email protected] AS:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:34 NM:i:0 NH:i:1
          HWI-EAS279_0362:1:96:4492:2513#0 0 scaffold_00001 131 50 34M * 0 0 GCCTGCTCGCTCTGCCGCACTGCATAGCCTCATG IIIHIIIIIIGGBIIIHIIIBIIIHIIIIIIFII AS:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:34 NM:i:0 NH:i:1
          HWI-EAS279_0362:1:90:14208:12646#0 0 scaffold_00001 137 50 34M * 0 0 TCGCTCTGCCGCACTGCATGGCCTCATGCTTGCT IIIG7GGGIIIIIIIIIIG,GGGGGGGGIIIIHI AS:i:-3 XM:i:1 XO:i:0 XG:i:0 MD:Z:19A14 NM:i:1 NH:i:1

          And a few from the gff3:

          ##gff-version 3
          scaffold_00001 . contig 1 261971 . . . ID=scaffold_00001;Name=scaffold_00001
          scaffold_00001 maker gene 4175 14929 . - . ID=Aster-x0418_g;Name=Aster-x0418_g;Alias=genemark-scaffold_00001-abinit-gene-0.18
          scaffold_00001 maker mRNA 4175 14929 . - . ID=Aster-x0418;Parent=Aster-x0418_g;Name=Aster-x0418;Alias=genemark-scaffold_00001-abinit-gene-0.18-mRNA-1
          scaffold_00001 maker exon 4175 4381 . - . ID=genemark-scaffold_00001-abinit-gene-0.18:exon:102734;Parent=Aster-x0418;Name=genemark-scaffold_00001-abinit-gene-0.18:exon:102734
          scaffold_00001 maker CDS 4175 4381 . - 0 ID=genemark-scaffold_00001-abinit-gene-0.18:CDS:102734;Parent=Aster-x0418;Name=genemark-scaffold_00001-abinit-gene-0.18:CDS:102734
          scaffold_00001 maker exon 4553 4785 . - . ID=genemark-scaffold_00001-abinit-gene-0.18:exon:102733;Parent=Aster-x0418;Name=genemark-scaffold_00001-abinit-gene-0.18:exon:102733
          scaffold_00001 maker CDS 4553 4785 . - 2 ID=genemark-scaffold_00001-abinit-gene-0.18:CDS:102733;Parent=Aster-x0418;Name=genemark-scaffold_00001-abinit-gene-0.18:CDS:102733

          I'd appreciate any help.
          Thanks.

          Comment


          • #6
            I am having the same problem. I am using bam files producted with TopHat 2.0.3. When I try to convert to counts with HTSeq v0.5.3p7 I get the following error

            Code:
            600000 GFF lines processed.
            700000 GFF lines processed.
            800000 GFF lines processed.
            803778 GFF lines processed.
            Error occured in line 30 of file accepted_hits.sam.
            Error:
            [Exception type: AssertionError, raised in __init__.py:599]
            Here is the .sam file

            Code:
            @HD     VN:1.0  SO:coordinate
            @RG     ID:NG-6113_R26_lib11656_sequence        SM:NG-6113_R26_lib11656_sequence.fastq
            @SQ     SN:chr1 LN:249250621
            @SQ     SN:chr10        LN:135534747
            @SQ     SN:chr11        LN:135006516
            @SQ     SN:chr12        LN:133851895
            @SQ     SN:chr13        LN:115169878
            @SQ     SN:chr14        LN:107349540
            @SQ     SN:chr15        LN:102531392
            @SQ     SN:chr16        LN:90354753
            @SQ     SN:chr17        LN:81195210
            @SQ     SN:chr18        LN:78077248
            @SQ     SN:chr19        LN:59128983
            @SQ     SN:chr2 LN:243199373
            @SQ     SN:chr20        LN:63025520
            @SQ     SN:chr21        LN:48129895
            @SQ     SN:chr22        LN:51304566
            @SQ     SN:chr3 LN:198022430
            @SQ     SN:chr4 LN:191154276
            @SQ     SN:chr5 LN:180915260
            @SQ     SN:chr6 LN:171115067
            @SQ     SN:chr7 LN:159138663
            @SQ     SN:chr8 LN:146364022
            @SQ     SN:chr9 LN:141213431
            @SQ     SN:chrM LN:16571
            @SQ     SN:chrX LN:155270560
            @SQ     SN:chrY LN:59373566
            @PG     ID:TopHat       VN:2.0.3        CL:/home/support/apps/cports/rhel-6.x86_64/gnu/tophat/2.0.3/bin/tophat -p 8 -g 1 --rg-id NG-6113_R26_lib11656_sequenc
            e --rg-sample NG-6113_R26_lib11656_sequence.fastq -o R26_multihit_genome_1_tophat_2 -G /home/shared/hg19/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.gtf /h
            ome/shared/hg19/Homo_sapiens/UCSC/hg19/Sequence/BowtieIndex/hg19 R26.fastq
            HWI-ST486:270:C0NKUACXX:1:2106:9562:73782       16      chr1    11902   50      51M     *       0       0       
            TCTTTCTGTTCATGTGTATTTGCTGTCTCTTAGCCCAGACTTCCTGTGTCT  
            [email protected]
            AS:i:-10        XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2  MD:Z:44C5C0     YT:Z:UU NH:i:1  RG:Z: NG-6113_R26_lib11656_sequence
            HWI-ST486:270:C0NKUACXX:1:1206:2572:60266       0       chr1    12561   50      16M1D3M3D32M    *       0       0       
            AGCAGCTGGTGATGTGGAGACCGGCCCCAGGCTCCTGTCTCCCCCCAGGTG  
            @@[email protected]?     
            AS:i:-27        XN:i:0  XM:i:1  XO:i:2  XG:i:4  NM:i:5  MD:Z:16^T1G1^CCC32      YT:Z:
            UU      NH:i:1  RG:Z:NG-6113_R26_lib11656_sequence
            Has anyone else had these problems? I wonder if it's a problem with the latest version of HTSeq count or could it be my .sam file?

            Comment


            • #7
              I'm also getting the same error as the 2 previous.

              Code:
              Error occured in line 47 of file /galaxy-galaxy-dist-17d57db9a7c0/database/files/000/dataset_139.dat.
              Error:
              [Exception type: AssertionError, raised in __init__.py:599]

              Here is the line it complains about. There doesn't appear to be an obvious error.


              Code:
              chr1:320162-328581W:ENST00000432964:10:575:19:326:      0       chr1    320181  3       100M    *       0       0       TGCCCAGGCTGGCCACAAATTCCTGGGCTCAAGTGATCCTCCCACCTCGTCCTTGTAGAGATGAGATTTAGTTACGTCGTCCAGGCTGATCTCAAACTCC ___eeeeegggggiiiiiiiiiiiiigfZcegg`cH_U^cggiiiiiiiiihihihiifhhfhggggfhiiiiiihhihihhdeeeea_``]b_bcc_^^ CC:Z:chr5       NH:i:2  HI:i:0  NM:i:0  CP:i:180746796

              Comment


              • #8
                A new version of HTSeq (HTSeq-0.5.3p9) has been relased as of today. I will check and see if the above problem occurs with the new version.

                Comment


                • #9
                  Yes, it should. Sorry about the rapid patches of last week -- it seems I need more sleep: it took four tries to get things right. At least I hope it's correct now.

                  Comment


                  • #10
                    HTSeq-count (HTSeq-0.5.3p9) works fine with my .sam files from Tophat 2.0.3

                    Thanks for all your hard work Simon. It's much appreciated.

                    Comment


                    • #11
                      Hello,

                      samtools sort -n accepted_hits.bam accepted_hits_sorted

                      samtools view -h accepted_hits_sorted.bam > accepted_hits_sorted.sam

                      htseq-count -m intersection-strict --stranded=no -t exon -i gene_id accepted_hits_sorted.sam /genomes/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.gtf > sample_counts.txt


                      I used this genes.gtf file for tophat alignment.

                      When I run htseq-count, It gives this error:

                      Warning: Skipping read 'HWI-ST222:159:d120kacxx:4:1101:6254:154403', because chromosome 'chrM', to which it has been aligned, did not appear in the GFF file.
                      Warning: Skipping read 'HWI-ST222:159:d120kacxx:4:1101:6254:165844', because chromosome 'chrM', to which it has been aligned, did not appear in the GFF file.
                      ....so on

                      my hg19 genes.gtf file:

                      [Genes]$ cat genes.gtf | head
                      chr1 unknown exon 14362 14829 . - . gene_id "WASH7P"; gene_name "WASH7P"; transcript_id "NR_024540"; tss_id "TSS7245";
                      chr1 unknown exon 14970 15038 . - . gene_id "WASH7P"; gene_name "WASH7P"; transcript_id "NR_024540"; tss_id "TSS7245";
                      chr1 unknown exon 15796 15947 . - . gene_id "WASH7P"; gene_name "WASH7P"; transcript_id "NR_024540"; tss_id "TSS7245";
                      chr1 unknown exon 16607 16765 . - . gene_id "WASH7P"; gene_name "WASH7P"; transcript_id "NR_024540"; tss_id "TSS7245";
                      chr1 unknown exon 16858 17055 . - . gene_id "WASH7P"; gene_name "WASH7P"; transcript_id "NR_024540"; tss_id "TSS7245";


                      my sorted sam file:

                      [tophat_novel_out]$ tail -n 10 accepted_hits_sorted.sam
                      HWI-ST222:159:d120kacxx:4:2308:21368:162731 89 chr10 45938905 255 46M * 0 0 AGGTGGTGGAGGAGGACCCGGAGCTGCGGGACTTCGTGAACGATGT =CDDCDDCDDDECC?7)IIEDBDDIEEIFFEEEDAEEDA2=DDBD= NM:i:1 NH:i:1
                      HWI-ST222:159:d120kacxx:4:2308:21368:186258 89 chr6 146119264 255 46M * 0 0 GTTGGAATTTGTCAAACAAGTATTAAAATTTTATTTAAATGAACGT IIIIHFD;@IHEHHG?9HIHGIIHIHFF<A<32322IHF>HFDDDD NM:i:0 NH:i:1
                      HWI-ST222:159:d120kacxx:4:2308:21371:39723 89 chrM 2709 255 46M * 0 0 ACACAGCAAGACGAGAAGACACTATGGAGCTTTAATTTATTAATGC F?00*HDEGFFGIGHFA2EAFA,[email protected] NM:i:1 NH:i:1
                      HWI-ST222:159:d120kacxx:4:2308:21373:38424 73 chr11 10529444 3 46M * 0 0 GGACTCTAGAATAG


                      HOW CAN I FIX THIS PROBLEM?

                      Thank you so much

                      Comment


                      • #12
                        So, does 'chrM' appear in your GTF file, or does it not?

                        Comment


                        • #13
                          Running the following command with HTseq-count (v.0.5.3p7) gives me the same error:

                          @@@@@@@@@@@@@@@@
                          home$ htseq-count Spore_hits.sam Downloads/magnaporthe_oryzae_70-15_8_transcripts.gff3 -i Parent
                          100000 GFF lines processed.
                          137630 GFF lines processed.
                          Error occured in line 2 of file Spore_hits.sam.
                          Error:
                          [Exception type: AssertionError, raised in __init__.py:599]
                          @@@@@@@@@@@@@@@@

                          The .sam file was generated using TopHat v2.0.0 and the first two lines are as follows:

                          9VH2F:01972:01968 272 supercont8.1 8 0 172M * 0 0 GGCGGTCCTGCGGCATCTTTGTACTTCTTGTGGAAGTCGTCAAGGGCTGTCGCTGAATTCTTGAAGTTTTCAGCCGGGTACCACGTGTCGTCTGGATCACATCCTTGCCATGCGACCTGGTATTGCAATGTCTTACTCCGGCCAAATAATCGGGACGCTAAAACTTTATCGA * AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:172 YT:Z:UU NH:i:11 CC:Z:supercont8.2 CP:i:6378979 HI:i:0
                          9VH2F:01867:01608 272 supercont8.1 10 0 153M * 0 0 CGGTCCTGCGGCATCTTTGTACTTCTTGTGGAGGTCGTCAAGGGCTGTCGCTGAATTCTTGAAGTTTTCAGCCGGGTACCACGTGTCGTCTGGATCACATCCTTGCCATGCGACCTGGTATTGCAATGTCTTACTCCGGCCAAATAATCGGGA * AS:i:-4 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:32A120 YT:Z:UU NH:i:11 CC:Z:supercont8.2 CP:i:6378981 HI:i:0

                          Any ideas? Thanks.

                          Comment


                          • #14
                            Please use v0.5.3p9, not v0.5.3p7.

                            Comment

                            Working...
                            X