No announcement yet.

Htseq-count Vs CountOverlap function in IRanges

  • Filter
  • Time
  • Show
Clear All
new posts

  • Htseq-count Vs CountOverlap function in IRanges

    Hi All,
    I computed the raw read counts for one of my SAM files using htdeq-count program as well as using 'countOverlaps' function in IRanges package in R.

    CountOverlaps function gives me the read counts for 23123 gene ids.
    Output of htseq gives read counts for 35886 refseq ids. (NM as well as NR ids).
    After matching the common genes (from CountOverlaps output), I got 32222 genes total from htseq-counts function. I observed that the gene ids were not unique. (21423 unique gene ids.)

    The computations were comparable if not same for some of the genes. However, for significant portion of the genes, the raw read counts did not match between the two outputs.
    What result is more accurate? Should I sum up the count for same gene id (for htseq-count)?
    Can someone please help? I have been stuck with this issue for days now...
    thanks a lot.

  • #2
    htseq-count gives you one count for each gene ID. I cannot imagine how you could have managed to get more than one count value per ID. (I can, however, see, how one could use countOverlaps in a way that gives multiple count values per ID.) Please give more details on what you did.

    Also, both methods should give you a perfectly accurate result; the difference is because they use different rules for special cases. You will have to read the rules and figure out what is more appropriate for your use case.


    • #3
      thanks for replying.
      I gave the following command.
      python -m HTSeq.scripts.count ./tophat_out_SRR037447/accepted_hits.sam ./genomes/hg19RefGene.gtf -a 1 -i gene_id -o seqres37447filter > seq37447filter

      Oh BTW --minaqual option is not making any difference in the results. Maybe I am specifying it wrong?

      this was the outout:
      NM_000014 2234
      NM_000015 0
      NM_000016 0
      NM_000017 125
      NM_000018 15
      NM_000019 246
      NM_000020 0
      NM_000021 0
      NM_000022 489
      NM_000023 2

      I then mapped the NM numbers to gene IDs and observed multiple NM numbers for the same gene id. e.g.
      NM_005465 34
      NM_181690 2

      I do understand that it is normal. My question is should I add up the counts for that gene ID?
      Could you please help?
      Last edited by biofreak; 06-29-2011, 09:05 AM.


      • #4
        You did not use a proper GTF file in Ensembl GTF format. In a proper GTF file, each line describing an exon has an attribute called 'gene_id', which gives the gene ID. All exons from the same gene (no matter which transcripts) must have the same gene ID.

        The idea of GTF files is that the information is on three levels. A gene (given by its gene ID) has several transcripts (with the same gene ID but different transcript IDs) , each of which has several exon lines. The UCSC table browser, for example, produces GTF file in which the gene ID is always the same as the transcript ID, i.e., it does not show which transcripts belong to the same gene. Obviously, this is not useful, and hence, htseq-count won't work with this.

        What you need to do is to replace, in your GTF file, the NM_number after 'gene ID' by the correct gene ID (which should have been there from the beginning, of course). Or you use a GTF file from Ensembl, which has proper gene IDs.

        You can use HTSeq to do this (if you know some Python).


        • #5
          Thanks a lot Simon.
          I downloaded the gtf file from ensemble and ran the program again. It however gives me the following error for all the reads.
          Skipping read 'SRR037447.3320388', because chromosome 'chr20', to which it has been aligned, did not appear in the GFF file.

          Do I need to make any changes to the SAM file?

          I am trying to replcace NM numbers with gene ids in my previous gtf file.
          thanks a lot.


          • #6
            could be that the chromsome is called 'chr20' in your sam file and just '20' in the GTF file.