Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • counting reads mapping to exons and HTseq question

    Is there a simple way to count the number of reads that map to known exons (or genes) given a SAM file with aligned reads and an annotation GTF file?

    I am aware of HTSeq, but I am not sure I am getting what I'm looking for.

    For example, if I run the following:

    htseq-count -m intersection-strict accepted_hits.sam mm9_ucsc_exon.gtf

    I get a count of 3 for a gene that by visual inspection has many more reads mapped to it.

    Any suggestion?

    Thanks.

  • #2
    I used Picard (the Java implementation of SamTools) to create my own exon counter. It took a few hours, though I'm out of practice with Java. I know that BioScope from ABI has a known exon counter, but their pipeline is a bit frustrating. It's not particularly complicated to write one if you use SamTools, and the documentation for Picard is good, so I suspect other programs are available to do it, but I don't know what they are.

    Comment


    • #3
      Hi

      Originally posted by PFS View Post
      I get a count of 3 for a gene that by visual inspection has many more reads mapped to it.
      Possible explanations:

      - htseq-counts assumes your RNA-Seq data to be strand-specific, i.e., it will only count those genes which map to the strand that the feature is on. If you want it to count reads on both strands, use the '--straded=no' option.

      - If a read is not fully contained in the exon, it will not get counted. If you want it to be counted even if it only partially overlaps with the feature, use '--mode=intersection-nonempty'.

      - If there are two overlapping exons from different genes, the reads overlapping both might be discarded and counted as "ambiguous".

      If neither of these explains your observations, you have found a bug. If so, please contact me by mail with more details. Thanks.

      Cheers
      Simon

      Comment


      • #4
        I've have any discrepancies in the count of exons due to your third explanation: Some exons are cataloged as ambiguous because they (exons) belong to splice variants of the same gene. For instance,

        htseq-count -texon -iParent -sno -mintersection-nonempty reads.sam hg18_refFlat.gff3 > exoncounts.txt

        By other hands, counting by gene maps all the reads in the intra-genic region (not only exons). This is not our goal.

        For instance,

        htseq-count -t gene -i ID --stranded=no reads.sam hg18_refFlat.gff3 > genecounts.txt


        So, Is there any way (using htseq-count) to count exons grouping by gene ("Parent" attribute of "mRNA" row of refFlat file)?

        Originally posted by Simon Anders View Post
        Hi



        Possible explanations:

        - htseq-counts assumes your RNA-Seq data to be strand-specific, i.e., it will only count those genes which map to the strand that the feature is on. If you want it to count reads on both strands, use the '--straded=no' option.

        - If a read is not fully contained in the exon, it will not get counted. If you want it to be counted even if it only partially overlaps with the feature, use '--mode=intersection-nonempty'.

        - If there are two overlapping exons from different genes, the reads overlapping both might be discarded and counted as "ambiguous".

        If neither of these explains your observations, you have found a bug. If so, please contact me by mail with more details. Thanks.

        Cheers
        Simon

        Comment


        • #5
          counting reads mappint to exons and HTseq question

          Hi jorgebm,
          Were you able to resolve this question? I am interested in accomplishing the same thing and was hoping HTSeq could accommodate.

          Originally posted by jorgebm View Post
          So, Is there any way (using htseq-count) to count exons grouping by gene ("Parent" attribute of "mRNA" row of refFlat file)?

          Comment


          • #6
            I have included a Python scripts to do something like this with DEXSeq. Maybe have a look.

            Apart from that: HTSeq is a Python package intended to facilitate programmin such stiff yourself. htseq-count was originally only intended as a demonstration for HTSeq. So, if you know some Python, just do it yourself (and if you know Java, use Picard, which is somewhat similar to HTSeq). I fully agree with mrawlins: it is not that difficult.

            Comment


            • #7
              Hi,

              I'm sorry for the delay in my reply. I hope it isn't to late to help.....

              My goal was count gene hits to test for differential expresion. So finally I discarded "ht-seq" and use CASAVA (We've a GAIIx) couting output. CASAVA generates RPKM an a "Raw count" (sum of coverages for each base within the feature). Then I've used "Raw counts" and DEseq for differential expression testing.

              However, previously I also tried "intersectBed" (Bed Tools) to overlap alignments with a gene model (in my case RefSeq). If you don't run CASAVA It's a choice.

              Hope it helps.

              Regards

              Comment


              • #8
                I sort of have different question. I tried to use HTseq-count to count exons, I got GFF files from NCBI mm9 and extracted the exon information to create exon.gff, but I don't know why there is 0 count in the exon output, does anyone know what's wrong with my process?

                Thanks in advance!

                Comment


                • #9
                  I don't know if it might be due to a read mapping to different isoforms sharing the same exons... I have a GFF file of annotated transcripts, and there are of course several transcripts to most genes, some of which share some of same exons.

                  For example, gene A may have isoforms 1 and 2, both of which contain exon y. When htseq tries to assign a feature to a read that mapped to exon y, will it discard it as ambiguous since it cannot decide upon assigning it to isoform 1 or 2? Even though both isoforms come from the same gene and have a column in the GFF stating so.

                  I have a feeling it won't actually be entangled in this sort of a problem, but it could be?

                  Carmen

                  Comment


                  • #10
                    Originally posted by carmeyeii View Post
                    I don't know if it might be due to a read mapping to different isoforms sharing the same exons... I have a GFF file of annotated transcripts, and there are of course several transcripts to most genes, some of which share some of same exons.

                    For example, gene A may have isoforms 1 and 2, both of which contain exon y. When htseq tries to assign a feature to a read that mapped to exon y, will it discard it as ambiguous since it cannot decide upon assigning it to isoform 1 or 2? Even though both isoforms come from the same gene and have a column in the GFF stating so.

                    I have a feeling it won't actually be entangled in this sort of a problem, but it could be?

                    Carmen
                    Normally when one uses htseq-count, one tells it to group reads by gene (-i gene_id or similar), so a read will be kept regardless of how many isoforms of a single gene it can be assigned to.

                    Comment


                    • #11
                      Originally posted by emilyjia2000 View Post
                      I sort of have different question. I tried to use HTseq-count to count exons, I got GFF files from NCBI mm9 and extracted the exon information to create exon.gff, but I don't know why there is 0 count in the exon output, does anyone know what's wrong with my process?
                      Maybe if you told us how exactly you extracted the exon information and how you called htseq-count, we might be able to help you without resorting to guessing. In any case, have a look at the Python scripts that we supply with DEXSeq. These are meant to count reads mapping to exons. Using htseq-count for this purpose is not at all straight-forward.

                      Comment


                      • #12
                        Originally posted by dpryan View Post
                        Normally when one uses htseq-count, one tells it to group reads by gene (-i gene_id or similar), so a read will be kept regardless of how many isoforms of a single gene it can be assigned to.
                        Thanks!

                        Comment


                        • #13
                          Hi all,
                          do you have any idea how to separate reads(from a bam file) according to the strand they came from in an RNA-Seq paired end and strand specific data ? I tried both tophat flag and samtools flag , the two results are quite different and many reads are not affected correctly to their specific strand

                          Comment


                          • #14
                            Do you just want pairs that maps to the + strand in one file and those that map to the - strand in another (alternatively, separating paired-end reads by the strand to which read1 aligns)? You can do that with gawk or python (or pretty much any language for that matter). Having said that, if you're trying to count RNAseq reads, then just have htseq-count deal with this for you.

                            Also, please start a new thread next time. This one is over a year old.

                            Comment


                            • #15
                              HTSeq count of single reads

                              Hi all,
                              As far as we come across,it is clear that the HTSeq Count gives the count of paired reads only. Even if we included the single reads in addition to paired reads in mapping, single reads has no meaning.
                              So my question is whether there is any option in HTSeq count to consider the single read also.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              18 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              22 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              16 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              47 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X