Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Strange error when using htseq-count

    Hi everyone,

    I am using HTSeq 0.4.7 on a Mac OS X with Python 2.7 and Numpy 1.5.1.

    It's my first time using the program, and I am getting a strange error.

    After calling the htseq-counts script using:
    >python -m HTSeq.scripts.count -a 1 polyA_sorted.sam Homo_sapiens.GRCh37.61.gtf

    I get the following error:
    Error occured in line 2196 of file ../../data/htseq_count/polyA_sorted.sam.
    Error:
    [Exception type: AssertionError, raised in __init__.py:576]

    "polyA_sorted.sam" contains paired-end reads mapped using bwa and sorted by read name, and "Homo_sapiens.GRCh37.61.gtf" is the latest GTF file from Ensembl.

    Line 2196 and 2197 of polyA_sorted.sam are:
    DBBK90L1:8:1:10101:43720#0 145 22 50502479 37 97M = 50499957 -2619 TCCTGCGGGCCGTTCTGGGTGTCCCAGGATGCACCCTGCAGCCTTGCACTGACCTTGAAGCGCACGCACTGGATGGCGGTGCCCGTGTTGAGGCCGG BBBBBBBBBBBBBBBB`bccX[b^bcecd`debZcddead_gefdedbf_daeZddgbdegbg`eedbdgggcegebbgggeggggggggggggggg XT:A:U NM:i:0 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:97

    DBBK90L1:8:1:10101:43720#0 97 22 50499957 37 97M = 50502479 2619 GTTTCCATGCCTGGGGCCAGGCTGGGCGCTGCCACCCGGTTTCCGCGTCTGGGGGTCACTGGGCCATTTGCACCACGACGGCTCTCCAGGCTTTCTC gggeggggggggggggggfggggggfggfggggadgggceeeeegggdedacbe`Gbb^`X^bZcbdddd^ggdgeg`gddeeedbdc_V`M```ca XT:A:U NM:i:1 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:10T86

    I'm thoroughly confused about what is causing this error, and I would appreciate any help with this problem. Thank you!

    Cheers,
    Sandy

  • #2
    Hi Sandy

    thanks for the precise bug report. This seems to be an internal error from the function 'pair_SAM_alignment'. I cannot reproduce it from the two lines you posted but as this function looks at several adjacent lines together, it could be one of the earlier lines.

    Could you please send me an excerpt from your SAM file of the part surrounding the reported line, let's say, lines 2000 to 2400 or so, by e-mail to anders(at)embl(dot)de, so that I can investigate. Thanks!

    Simon

    Comment


    • #3
      The bug is now fixed, in version 0.4.7p2.

      The actual error turned out to be caused by the pair of lines one further up; two SAM lines describing a mate pair, with flag values 103 and 147. 147 implies that both read and mate are mapped (bits 2 and 3 cleared), while 103 contradicts this (bit 2 set). HTSeq's pair_SAM_alignment function has code to detect and handle such inconsistencies, but this particular case was handled incorrectly. Now, a warning is issued, without aborting the program.

      Which program produced the SAM file, by the way?

      Comment


      • #4
        Hi Simon,
        I had a problem with HT-seq count too not long ago and I posted to an old thread; http://seqanswers.com/forums/showthr...?t=4805&page=5
        I wrote my own Perl (not so elegant) work around but would prefer to have a HTSEQ-COUNT view of the world too.

        Thanks for any advice on the problem.

        Kind regards,

        John.

        Comment


        • #5
          Hello,

          I get a very similar error in htseq-count (ubuntu 10.10 x64, Python 2.6.6, HTseq-0.5.3p7):

          htseq-count -mintersection-nonempty -s no -t mRNA -i ID accepted_hits.sam Asterochloris_gene_features.gff3 > Alga_T0.txt

          throws an error message shortly after it started:

          100000 GFF lines processed.
          193583 GFF lines processed.
          Error occured in line 2 of file accepted_hits.sam.
          Error:
          [Exception type: AssertionError, raised in __init__.py:599]

          I also tried the run with -t exon -i Parent (or ID) and get the same error.

          My sam file contains single end reads mapped to a genome with tophat, the gff3 file is validated.

          These are the first 3 lines of the sam-file:

          HWI-EAS279_0362:1:55:19120:13207#0 0 scaffold_00001 111 50 34M * 0 0 CTGTTGAGGCACGCTGCAGGGCCTGCTCGCTCTG CEFAGDDDHHHHHFFFFGHH>HFHGHDHDD@BGH AS:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:34 NM:i:0 NH:i:1
          HWI-EAS279_0362:1:96:4492:2513#0 0 scaffold_00001 131 50 34M * 0 0 GCCTGCTCGCTCTGCCGCACTGCATAGCCTCATG IIIHIIIIIIGGBIIIHIIIBIIIHIIIIIIFII AS:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:34 NM:i:0 NH:i:1
          HWI-EAS279_0362:1:90:14208:12646#0 0 scaffold_00001 137 50 34M * 0 0 TCGCTCTGCCGCACTGCATGGCCTCATGCTTGCT IIIG7GGGIIIIIIIIIIG,GGGGGGGGIIIIHI AS:i:-3 XM:i:1 XO:i:0 XG:i:0 MD:Z:19A14 NM:i:1 NH:i:1

          And a few from the gff3:

          ##gff-version 3
          scaffold_00001 . contig 1 261971 . . . ID=scaffold_00001;Name=scaffold_00001
          scaffold_00001 maker gene 4175 14929 . - . ID=Aster-x0418_g;Name=Aster-x0418_g;Alias=genemark-scaffold_00001-abinit-gene-0.18
          scaffold_00001 maker mRNA 4175 14929 . - . ID=Aster-x0418;Parent=Aster-x0418_g;Name=Aster-x0418;Alias=genemark-scaffold_00001-abinit-gene-0.18-mRNA-1
          scaffold_00001 maker exon 4175 4381 . - . ID=genemark-scaffold_00001-abinit-gene-0.18:exon:102734;Parent=Aster-x0418;Name=genemark-scaffold_00001-abinit-gene-0.18:exon:102734
          scaffold_00001 maker CDS 4175 4381 . - 0 ID=genemark-scaffold_00001-abinit-gene-0.18:CDS:102734;Parent=Aster-x0418;Name=genemark-scaffold_00001-abinit-gene-0.18:CDS:102734
          scaffold_00001 maker exon 4553 4785 . - . ID=genemark-scaffold_00001-abinit-gene-0.18:exon:102733;Parent=Aster-x0418;Name=genemark-scaffold_00001-abinit-gene-0.18:exon:102733
          scaffold_00001 maker CDS 4553 4785 . - 2 ID=genemark-scaffold_00001-abinit-gene-0.18:CDS:102733;Parent=Aster-x0418;Name=genemark-scaffold_00001-abinit-gene-0.18:CDS:102733

          I'd appreciate any help.
          Thanks.

          Comment


          • #6
            I am having the same problem. I am using bam files producted with TopHat 2.0.3. When I try to convert to counts with HTSeq v0.5.3p7 I get the following error

            Code:
            600000 GFF lines processed.
            700000 GFF lines processed.
            800000 GFF lines processed.
            803778 GFF lines processed.
            Error occured in line 30 of file accepted_hits.sam.
            Error:
            [Exception type: AssertionError, raised in __init__.py:599]
            Here is the .sam file

            Code:
            @HD     VN:1.0  SO:coordinate
            @RG     ID:NG-6113_R26_lib11656_sequence        SM:NG-6113_R26_lib11656_sequence.fastq
            @SQ     SN:chr1 LN:249250621
            @SQ     SN:chr10        LN:135534747
            @SQ     SN:chr11        LN:135006516
            @SQ     SN:chr12        LN:133851895
            @SQ     SN:chr13        LN:115169878
            @SQ     SN:chr14        LN:107349540
            @SQ     SN:chr15        LN:102531392
            @SQ     SN:chr16        LN:90354753
            @SQ     SN:chr17        LN:81195210
            @SQ     SN:chr18        LN:78077248
            @SQ     SN:chr19        LN:59128983
            @SQ     SN:chr2 LN:243199373
            @SQ     SN:chr20        LN:63025520
            @SQ     SN:chr21        LN:48129895
            @SQ     SN:chr22        LN:51304566
            @SQ     SN:chr3 LN:198022430
            @SQ     SN:chr4 LN:191154276
            @SQ     SN:chr5 LN:180915260
            @SQ     SN:chr6 LN:171115067
            @SQ     SN:chr7 LN:159138663
            @SQ     SN:chr8 LN:146364022
            @SQ     SN:chr9 LN:141213431
            @SQ     SN:chrM LN:16571
            @SQ     SN:chrX LN:155270560
            @SQ     SN:chrY LN:59373566
            @PG     ID:TopHat       VN:2.0.3        CL:/home/support/apps/cports/rhel-6.x86_64/gnu/tophat/2.0.3/bin/tophat -p 8 -g 1 --rg-id NG-6113_R26_lib11656_sequenc
            e --rg-sample NG-6113_R26_lib11656_sequence.fastq -o R26_multihit_genome_1_tophat_2 -G /home/shared/hg19/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.gtf /h
            ome/shared/hg19/Homo_sapiens/UCSC/hg19/Sequence/BowtieIndex/hg19 R26.fastq
            HWI-ST486:270:C0NKUACXX:1:2106:9562:73782       16      chr1    11902   50      51M     *       0       0       
            TCTTTCTGTTCATGTGTATTTGCTGTCTCTTAGCCCAGACTTCCTGTGTCT  
            IIIIIIGIIIIIIIIIIIIIIGGIFIIIIIGGIHEHBHHFBAGFFFFFCC@
            AS:i:-10        XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2  MD:Z:44C5C0     YT:Z:UU NH:i:1  RG:Z: NG-6113_R26_lib11656_sequence
            HWI-ST486:270:C0NKUACXX:1:1206:2572:60266       0       chr1    12561   50      16M1D3M3D32M    *       0       0       
            AGCAGCTGGTGATGTGGAGACCGGCCCCAGGCTCCTGTCTCCCCCCAGGTG  
            @@CFFFFFHFHBHIEGIGJJIIJJJIJJIGGHGII@FHIIIIIJJHAEE7?     
            AS:i:-27        XN:i:0  XM:i:1  XO:i:2  XG:i:4  NM:i:5  MD:Z:16^T1G1^CCC32      YT:Z:
            UU      NH:i:1  RG:Z:NG-6113_R26_lib11656_sequence
            Has anyone else had these problems? I wonder if it's a problem with the latest version of HTSeq count or could it be my .sam file?

            Comment


            • #7
              I'm also getting the same error as the 2 previous.

              Code:
              Error occured in line 47 of file /galaxy-galaxy-dist-17d57db9a7c0/database/files/000/dataset_139.dat.
              Error:
              [Exception type: AssertionError, raised in __init__.py:599]

              Here is the line it complains about. There doesn't appear to be an obvious error.


              Code:
              chr1:320162-328581W:ENST00000432964:10:575:19:326:      0       chr1    320181  3       100M    *       0       0       TGCCCAGGCTGGCCACAAATTCCTGGGCTCAAGTGATCCTCCCACCTCGTCCTTGTAGAGATGAGATTTAGTTACGTCGTCCAGGCTGATCTCAAACTCC ___eeeeegggggiiiiiiiiiiiiigfZcegg`cH_U^cggiiiiiiiiihihihiifhhfhggggfhiiiiiihhihihhdeeeea_``]b_bcc_^^ CC:Z:chr5       NH:i:2  HI:i:0  NM:i:0  CP:i:180746796

              Comment


              • #8
                A new version of HTSeq (HTSeq-0.5.3p9) has been relased as of today. I will check and see if the above problem occurs with the new version.

                Comment


                • #9
                  Yes, it should. Sorry about the rapid patches of last week -- it seems I need more sleep: it took four tries to get things right. At least I hope it's correct now.

                  Comment


                  • #10
                    HTSeq-count (HTSeq-0.5.3p9) works fine with my .sam files from Tophat 2.0.3

                    Thanks for all your hard work Simon. It's much appreciated.

                    Comment


                    • #11
                      Hello,

                      samtools sort -n accepted_hits.bam accepted_hits_sorted

                      samtools view -h accepted_hits_sorted.bam > accepted_hits_sorted.sam

                      htseq-count -m intersection-strict --stranded=no -t exon -i gene_id accepted_hits_sorted.sam /genomes/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.gtf > sample_counts.txt


                      I used this genes.gtf file for tophat alignment.

                      When I run htseq-count, It gives this error:

                      Warning: Skipping read 'HWI-ST222:159:d120kacxx:4:1101:6254:154403', because chromosome 'chrM', to which it has been aligned, did not appear in the GFF file.
                      Warning: Skipping read 'HWI-ST222:159:d120kacxx:4:1101:6254:165844', because chromosome 'chrM', to which it has been aligned, did not appear in the GFF file.
                      ....so on

                      my hg19 genes.gtf file:

                      [Genes]$ cat genes.gtf | head
                      chr1 unknown exon 14362 14829 . - . gene_id "WASH7P"; gene_name "WASH7P"; transcript_id "NR_024540"; tss_id "TSS7245";
                      chr1 unknown exon 14970 15038 . - . gene_id "WASH7P"; gene_name "WASH7P"; transcript_id "NR_024540"; tss_id "TSS7245";
                      chr1 unknown exon 15796 15947 . - . gene_id "WASH7P"; gene_name "WASH7P"; transcript_id "NR_024540"; tss_id "TSS7245";
                      chr1 unknown exon 16607 16765 . - . gene_id "WASH7P"; gene_name "WASH7P"; transcript_id "NR_024540"; tss_id "TSS7245";
                      chr1 unknown exon 16858 17055 . - . gene_id "WASH7P"; gene_name "WASH7P"; transcript_id "NR_024540"; tss_id "TSS7245";


                      my sorted sam file:

                      [tophat_novel_out]$ tail -n 10 accepted_hits_sorted.sam
                      HWI-ST222:159:d120kacxx:4:2308:21368:162731 89 chr10 45938905 255 46M * 0 0 AGGTGGTGGAGGAGGACCCGGAGCTGCGGGACTTCGTGAACGATGT =CDDCDDCDDDECC?7)IIEDBDDIEEIFFEEEDAEEDA2=DDBD= NM:i:1 NH:i:1
                      HWI-ST222:159:d120kacxx:4:2308:21368:186258 89 chr6 146119264 255 46M * 0 0 GTTGGAATTTGTCAAACAAGTATTAAAATTTTATTTAAATGAACGT IIIIHFD;@IHEHHG?9HIHGIIHIHFF<A<32322IHF>HFDDDD NM:i:0 NH:i:1
                      HWI-ST222:159:d120kacxx:4:2308:21371:39723 89 chrM 2709 255 46M * 0 0 ACACAGCAAGACGAGAAGACACTATGGAGCTTTAATTTATTAATGC F?00*HDEGFFGIGHFA2EAFA,DGHF@HIGIIIIGHFDADFBBAA NM:i:1 NH:i:1
                      HWI-ST222:159:d120kacxx:4:2308:21373:38424 73 chr11 10529444 3 46M * 0 0 GGACTCTAGAATAG


                      HOW CAN I FIX THIS PROBLEM?

                      Thank you so much

                      Comment


                      • #12
                        So, does 'chrM' appear in your GTF file, or does it not?

                        Comment


                        • #13
                          Running the following command with HTseq-count (v.0.5.3p7) gives me the same error:

                          @@@@@@@@@@@@@@@@
                          home$ htseq-count Spore_hits.sam Downloads/magnaporthe_oryzae_70-15_8_transcripts.gff3 -i Parent
                          100000 GFF lines processed.
                          137630 GFF lines processed.
                          Error occured in line 2 of file Spore_hits.sam.
                          Error:
                          [Exception type: AssertionError, raised in __init__.py:599]
                          @@@@@@@@@@@@@@@@

                          The .sam file was generated using TopHat v2.0.0 and the first two lines are as follows:

                          9VH2F:01972:01968 272 supercont8.1 8 0 172M * 0 0 GGCGGTCCTGCGGCATCTTTGTACTTCTTGTGGAAGTCGTCAAGGGCTGTCGCTGAATTCTTGAAGTTTTCAGCCGGGTACCACGTGTCGTCTGGATCACATCCTTGCCATGCGACCTGGTATTGCAATGTCTTACTCCGGCCAAATAATCGGGACGCTAAAACTTTATCGA * AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:172 YT:Z:UU NH:i:11 CC:Z:supercont8.2 CP:i:6378979 HI:i:0
                          9VH2F:01867:01608 272 supercont8.1 10 0 153M * 0 0 CGGTCCTGCGGCATCTTTGTACTTCTTGTGGAGGTCGTCAAGGGCTGTCGCTGAATTCTTGAAGTTTTCAGCCGGGTACCACGTGTCGTCTGGATCACATCCTTGCCATGCGACCTGGTATTGCAATGTCTTACTCCGGCCAAATAATCGGGA * AS:i:-4 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:32A120 YT:Z:UU NH:i:11 CC:Z:supercont8.2 CP:i:6378981 HI:i:0

                          Any ideas? Thanks.

                          Comment


                          • #14
                            Please use v0.5.3p9, not v0.5.3p7.

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Strategies for Sequencing Challenging Samples
                              by seqadmin


                              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                              03-22-2024, 06:39 AM
                            • seqadmin
                              Techniques and Challenges in Conservation Genomics
                              by seqadmin



                              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                              Avian Conservation
                              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                              03-08-2024, 10:41 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, Yesterday, 06:37 PM
                            0 responses
                            10 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, Yesterday, 06:07 PM
                            0 responses
                            9 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-22-2024, 10:03 AM
                            0 responses
                            51 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-21-2024, 07:32 AM
                            0 responses
                            67 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X