Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • difference between htseq-count and bedtools multicov

    Hi,

    Anyone have used htseq-count and bedtools multicov, I use both them to get the read counts for ref genes. But I got very different read counts between them. Which one is more common..

    bedtools htseq-count
    uc001aaa.3 8 0
    uc001aac.4 2309 0
    uc001aae.4 753 0
    uc001aah.4 2309 0
    uc001aai.1 586 0
    uc001aak.3 0 0
    uc001aal.1 0 0
    uc001aam.4 1315 6

  • #2
    i'd guess it's due to the fact that htseq-count only reports one hit per aligned read assuming that read is aligned uniquely and does not overlap multiple features listed in your GTF. if an aligned read hits more than one feature in your GTF then it doesn't report that hit. bedtools gives you raw hits which includes every 1 hit for every intersection of every alignment with any features in the GTF no matter how many times it aligned or how many features it hit. you might think, "wow, htseq-count is dropping a lot of information". yes, it is! i've moved to using other tools to count hits to genes (RSEM/eXpress) since they disambiguate those ambiguous alignments and as a result you get counts for all of your aligned reads. in a genome with alternative splicing you lose too much data using htseq-count, in my opinion.
    /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
    Salk Institute for Biological Studies, La Jolla, CA, USA */

    Comment


    • #3
      sdriscoll, te source of your problems is simply that you use annotation files that are unsuitable for htseq-count.

      If a read alignes two several transcripts of the same gene, it is counted for this gene; only if it overlaps with different genes, it is dropped.

      To this end, your GTF file needs to indicate which transcripts belong to the same gene. The GTF standard specifies that exon lines from two transcripts of the same gene should have the same 'gene_id' but different 'transcript_id' values. Unfortunately, the Table Browser function of the UCSC Genome Browser returns faulty GTF files that just copy the transcript ID into the gene ID field, and with these, htseq-count does not work.

      So, please don't use the UCSC Table Browser. (They explained on previous occasion that they are unable to fix this bug, which, in my opinion, is rather serious.)

      Comment


      • #4
        Hi Simon - you're right about that. If you use the UCSC GTFs straight away then your program won't work well at all. I was being a little dramatic in my post however even if you have the gene names inserted in your GTF, which I always do, there's still a large number of multi-gene loci where information can be lost because of the unique gene_id requirement in htseq-count. With a quick loci clustering and a search for those with multiple gene names (using UCSC's knowngene annotation for mouse mm9) there's over 2200 loci with more than one gene_id which is close to 10% of the annotated loci in the mouse genome. I don't work with human much but I suspect there's more overlap there. Maybe not all of those multi-gene loci share all of their exons with each other but some of them do. For example there are loci like CC1/Rb1cc1 or Sgk3/Cisk. One would lose quite a few reads in those loci. So a bit more work would have to be done to one's annotation to ensure that they aren't losing too much information with htseq-count. For example htseq-count works beautifully if you collapse these loci into exon-unions. I've written my own read counters over the years and what seems to work best for me is to allow reads to be assigned to multiple features but to weight those reads by 1/(number of features). I could futher restrict that to only allowing those reads to be ones that are shared across features within the same locus but if I'm working with alignments that don't inlcude multi-aligned reads then that doesn't matter. This way I get as many reads assigned to features as possible and when I collapse the loci down into "genes" I can just add the features' counts. In the end I get the same counts as I would have by using a modifed GTF with overlapped features collapsed into unions without having to generate that extra GTF file.
        /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
        Salk Institute for Biological Studies, La Jolla, CA, USA */

        Comment


        • #5
          Sdriscoll, I think that the main issue here is not due to "genuinely" overlapping genes but to the bug Simon mentioned: in a raw GTF from the UCSC table browser all the transcripts have a different "gene_id" whether they come from the same locus or not.
          I count 55,419 different identifiers in the "gene_id" field of the UCSC knownGene GTF file for mm9, not 2,200.

          Comment


          • #6
            That's correct. I however am not working with the raw Gtf to get that number. I swap in the real gene symbols into the gene_id field. I also ran an analysis of the entire gtf collapsing different transcript ids down into loci where different transcripts share some sequence between them. I actually just use gffread (from Cufflinks) for that. Then parsing that result I find >2200 loci with multiple gene names, and obviously multiple transcript ids. If you look at the gene loci I mentioned in the UCSC genome browser for mm9 you can see what i mean.
            /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
            Salk Institute for Biological Studies, La Jolla, CA, USA */

            Comment


            • #7
              Yes, replacing UCSC gene_id by NCBI gene symbols is the good move. Interesting to see that still more than 10% of the historically annotated loci overlap with at least another one in the mouse genome, even when considering "real gene symbols".

              Comment


              • #8
                The UCSC gene name substitution is a problem I've run into before too.
                Here's a blog post I found that offers a solution for anyone who's interested:
                When trying to do gene and transcript-level quantification of RNA-Seq data, you often need what's called a GTF file. This file is a list of ...

                Comment


                • #9
                  maybe it's just me but i've always been annoyed with official gene symbols. from the start i've never thought they were reliable to base any part of my codes around so when i've written analysis tools for my lab i use more checks to make sure my code knows what it's looking at and so i don't confuse anybody. from a programmer's point of view it's just a design constraint and one less assumption that can be made along the way. i end up dealing with a lot of different annotations, sometimes custom made, and different genomes and it's easier for me if my tools don't put too much requirement on the naming conventions of the annotation.

                  i'm only pointing out the fact that even if you're working with official gene symbols as the 'gene_id' field then a counter like htseq-count will end up throwing out some reads from many loci where all that's actually happening is some illogical historical naming issues. that's not the researcher's fault. some of those locations in the genome are overlap of adjacent genes but sometimes it's just what looks like a typical alternatively spliced gene locus but for some reason not all of the isoforms have the same gene name. those are the ones that are the issue. it's a minor point - yes if you have the right annotation it's irrelevant.

                  the human genome is even more complex. for example this SMN2 locus shows what I'm talking about.



                  looks like your typical alternatively spliced locus but it houses SMN1 and SMN2. UCSC's track and the Refseq track both show this naming issue.
                  /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                  Salk Institute for Biological Studies, La Jolla, CA, USA */

                  Comment


                  • #10
                    I never really ran into these problems myself and I wonder if this might be because I only use Ensembl annotation. Ensembl changes their gene IDs quite often so that you have to remap these ENBG000... IDs to HGNC symbols anew for each Ensembl release but the advantage is that this gives the Ensembl people the flexibility to change easily what they consider one gene. This might explain why I feel that the described issue with overlapping genes is somehow avoided in Ensembl annotation

                    Comment


                    • #11
                      That explains it! Thanks for pointing that out. It's unfortunate that the annotations are so lame. Cufflinks can actually help with this but people still will need to parse the gtf to build a translation table. You can use cuffcompare to compare the gtf to itself and that produces cuffcmp.combined.gtf. In that file gene ids are replaced with XLOC ids which are unique to loci. The gene names are set in the gene_name field, if they were in your gtf to start. You can also use the gffread program that comes packaged with Cufflinks. It has a --cluster-only option that adds a locus attribute to each gtf row grouping the features into unique loci. One could specify that attribute when running htseq-count and then translate the locus ids bak to gene names. I think ideally the counting script should do the clustering or at least provide the option. Then the annotation doesn't have to be perfect.
                      /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                      Salk Institute for Biological Studies, La Jolla, CA, USA */

                      Comment


                      • #12
                        How can I convert the ENSG00000261838 to real name of genes? Or is there any parameter do I need to use prior using htseq-count? Thank you

                        ENSG00000261838 0
                        ENSG00000261839 2
                        ENSG00000261840 3
                        ENSG00000261841 0
                        ENSG00000261842 0
                        no_feature 985727
                        ambiguous 435068
                        too_low_aQual 0
                        not_aligned 0
                        alignment_not_unique 778451

                        Comment


                        • #13
                          usually , I go to ensemble(www.ensembl.org) => bioMart . CHOOSE DATABSE ( which is Ensemble gene 70 currently ) , CHOOSE DATASET (which is Homo sapiens for Human), once these two steps done , the left side of the webpage
                          will show Dataset / filter / Attributes , click Attributes (check Features) , expand "gene" , check "Ensembl gene id" "Associated gene name" "description" or other features you want . Click "Result " which is at the top left side of the webpage. you can "Export all result to " there .

                          Originally posted by wanfahmi View Post
                          How can I convert the ENSG00000261838 to real name of genes? Or is there any parameter do I need to use prior using htseq-count? Thank you

                          Comment


                          • #14
                            Hi members,

                            Reviving an old topic... I am bewildered about htseq-count output. The endpoint is I will be using DEseq for differential expression, but I am confused with the htseq-count results.

                            I compared the htseq-count results with what I view in IGV... and I see very inconsistent things which I can not explain.. I was hoping if anyone can give an insight into this matter..

                            Quick overview on what I did
                            Mapped the paired end .fastq file to a reference transcription using tophat2
                            [CODE]"tophat2 -G <transcript.gtf> -o <outputdir> --no-novel-juncs <bowtie2 indexed genome> <.fastq1> <.fastq2> [CODE]

                            Following this I converted the resulting .bam file to .sam in order to proceed with read counting using htseq-count
                            #Converting .bam to .sam
                            Code:
                            samtools view -h <bam> > <sam>
                            #Sorting the .sam (Would be great if anyone can tell me if this is the correct method or not, as sometimes it returns some error - I also asked here http://seqanswers.com/forums/showthr...t=12376&page=2)
                            Code:
                            sort -k1 <sam> > <sorted.sam>
                            #Running htseq-count
                            Code:
                            htseq-count -m intersection-strict -s no <sorted.sam> <transripts.gtf> > <result_count>
                            Above is the general code I used, I also tried running htseq-count with -m intersection-nonempty.

                            The problem I have now is when I look at the gene count and IGV it gives confusing statement

                            Location 1
                            Result from htseq-count gives - 14 (regardless which htseq-count mode)
                            Image from IGV


                            Location 2
                            Result from htseq-count gives - 7 (regardless which htseq-count mode)
                            Image from IGV


                            As you can see, at Location 2 has more reads, but how come the counts number only is 7??

                            Could it be to do with the .gtf file I am using as mentioned by
                            Originally posted by sdriscoll View Post
                            i'd guess it's due to the fact that htseq-count only reports one hit per aligned read assuming that read is aligned uniquely and does not overlap multiple features listed in your GTF. if an aligned read hits more than one feature in your GTF then it doesn't report that hit. bedtools gives you raw hits which includes every 1 hit for every intersection of every alignment with any features in the GTF no matter how many times it aligned or how many features it hit. you might think, "wow, htseq-count is dropping a lot of information". yes, it is! i've moved to using other tools to count hits to genes (RSEM/eXpress) since they disambiguate those ambiguous alignments and as a result you get counts for all of your aligned reads. in a genome with alternative splicing you lose too much data using htseq-count, in my opinion.
                            Or its because location 2 has tons of non-unique mapping and the read counts are very low??

                            Eventually if I would be doing the differential expression on DESeq, would this give me a false result at location 2 since there is only 7 count but there are much more reads??

                            Appreciate any feedback

                            Can anyone help explain why is there this discrepancy between IGV and count data??

                            Comment


                            • #15
                              Please take a few of these many reads in location 2, hover your mouse over them in IGV to get the read name, then grep for the read name to find the corresponding lines in the SAM file. Maybe they are all non-unique mapping reads.

                              The fact that htseq-count does not count these reads is deliberate, of course. See earlier threads for an explanation.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                The Impact of AI in Genomic Medicine
                                by seqadmin



                                Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                                02-26-2024, 02:07 PM
                              • seqadmin
                                Multiomics Techniques Advancing Disease Research
                                by seqadmin


                                New and advanced multiomics tools and technologies have opened new avenues of research and markedly enhanced various disciplines such as disease research and precision medicine1. The practice of merging diverse data from various ‘omes increasingly provides a more holistic understanding of biological systems. As Maddison Masaeli, Co-Founder and CEO at Deepcell, aptly noted, “You can't explain biology in its complex form with one modality.”

                                A major leap in the field has
                                ...
                                02-08-2024, 06:33 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:12 AM
                              0 responses
                              17 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 02-23-2024, 04:11 PM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 02-21-2024, 08:52 AM
                              0 responses
                              73 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 02-20-2024, 08:57 AM
                              0 responses
                              62 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X