Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Htseq-count Vs CountOverlap function in IRanges

    Hi All,
    I computed the raw read counts for one of my SAM files using htdeq-count program as well as using 'countOverlaps' function in IRanges package in R.

    CountOverlaps function gives me the read counts for 23123 gene ids.
    Output of htseq gives read counts for 35886 refseq ids. (NM as well as NR ids).
    After matching the common genes (from CountOverlaps output), I got 32222 genes total from htseq-counts function. I observed that the gene ids were not unique. (21423 unique gene ids.)

    The computations were comparable if not same for some of the genes. However, for significant portion of the genes, the raw read counts did not match between the two outputs.
    What result is more accurate? Should I sum up the count for same gene id (for htseq-count)?
    Can someone please help? I have been stuck with this issue for days now...
    thanks a lot.

  • #2
    htseq-count gives you one count for each gene ID. I cannot imagine how you could have managed to get more than one count value per ID. (I can, however, see, how one could use countOverlaps in a way that gives multiple count values per ID.) Please give more details on what you did.

    Also, both methods should give you a perfectly accurate result; the difference is because they use different rules for special cases. You will have to read the rules and figure out what is more appropriate for your use case.

    Comment


    • #3
      thanks for replying.
      I gave the following command.
      python -m HTSeq.scripts.count ./tophat_out_SRR037447/accepted_hits.sam ./genomes/hg19RefGene.gtf -a 1 -i gene_id -o seqres37447filter > seq37447filter

      Oh BTW --minaqual option is not making any difference in the results. Maybe I am specifying it wrong?

      this was the outout:
      NM_000014 2234
      NM_000015 0
      NM_000016 0
      NM_000017 125
      NM_000018 15
      NM_000019 246
      NM_000020 0
      NM_000021 0
      NM_000022 489
      NM_000023 2

      I then mapped the NM numbers to gene IDs and observed multiple NM numbers for the same gene id. e.g.
      NM_005465 34
      NM_181690 2

      I do understand that it is normal. My question is should I add up the counts for that gene ID?
      Could you please help?
      Last edited by biofreak; 06-29-2011, 09:05 AM.

      Comment


      • #4
        You did not use a proper GTF file in Ensembl GTF format. In a proper GTF file, each line describing an exon has an attribute called 'gene_id', which gives the gene ID. All exons from the same gene (no matter which transcripts) must have the same gene ID.

        The idea of GTF files is that the information is on three levels. A gene (given by its gene ID) has several transcripts (with the same gene ID but different transcript IDs) , each of which has several exon lines. The UCSC table browser, for example, produces GTF file in which the gene ID is always the same as the transcript ID, i.e., it does not show which transcripts belong to the same gene. Obviously, this is not useful, and hence, htseq-count won't work with this.

        What you need to do is to replace, in your GTF file, the NM_number after 'gene ID' by the correct gene ID (which should have been there from the beginning, of course). Or you use a GTF file from Ensembl, which has proper gene IDs.

        You can use HTSeq to do this (if you know some Python).

        Comment


        • #5
          Thanks a lot Simon.
          I downloaded the gtf file from ensemble and ran the program again. It however gives me the following error for all the reads.
          Skipping read 'SRR037447.3320388', because chromosome 'chr20', to which it has been aligned, did not appear in the GFF file.

          Do I need to make any changes to the SAM file?

          I am trying to replcace NM numbers with gene ids in my previous gtf file.
          thanks a lot.

          Comment


          • #6
            could be that the chromsome is called 'chr20' in your sam file and just '20' in the GTF file.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM
            • seqadmin
              Techniques and Challenges in Conservation Genomics
              by seqadmin



              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

              Avian Conservation
              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
              03-08-2024, 10:41 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 03-27-2024, 06:37 PM
            0 responses
            12 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-27-2024, 06:07 PM
            0 responses
            11 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-22-2024, 10:03 AM
            0 responses
            53 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-21-2024, 07:32 AM
            0 responses
            69 views
            0 likes
            Last Post seqadmin  
            Working...
            X