Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • HTseq not counting reads toward genes the overlap?

    I isolated a number of sam lines from a bigger sam file where I do not understand why the reads are not counted toward the genes they overlap.
    I checked the gtf files by hand to confirm, as well as writing a script to confirm these genes solely map to one specific gene each (each maps to a different gene). Sadly when I try to truncate the gtf file HTseq gives me an error "Error: The attribute string seems to contain mismatched quotes.", making it impossible for me to check with a GTF file containing solely these genes.

    The command I used was: "htseq-count -m intersection-nonempty SamLines.sam Homo_sapiens.GRCh37.73.gtf > res.txt"

    The gtf file was acuired from ensembl: ftp://ftp.ensembl.org/pub/release-73/gtf/homo_sapiens.

    And the SamLines.sam file can be obtained here:


    The GTF file containing only those lines of the genes I think it should map to:


    I was wondering if I am missing some variable that determines that these genes should not be counted?

    Additionally I tested the example mentioned on the website: "http://www-huber.embl.de/users/anders/HTSeq/HTSeq_example_data.tgz"

    I noticed in the GTF file that gene "YEL034W" is completely overlapped by gene "YEL034C-A" yet using the example files the results show 67 counts for gene "YEL034W". I thought HTseq only counted reads toward a gene if it is uniquely mapped to 1 gene, meaning this should be impossible. Does HTseq have some other mechanism by which way it determines it should map this to that gene or is this a mistake?

  • #2
    Appologies for the mistake in the title. *"the" = "they"

    Comment


    • #3
      At a cursory glance, it seems that your annotations don't match. You seem to have aligned against an Ensembl genome (or something like that, since chromosome names are "2", "3", "4", etc.), but then the annotation GTF file is using a more UCSC-like format (where chromosomes are named "chr2", "chr3", etc.). That's going to cause reads to not be counted appropriately.

      Comment


      • #4
        Ah yeah I fixed that before I ran it. I copied the original though. The file I obtained from ensembl does not have "chr" before the number.

        Comment


        • #5
          Ah, good, glad you already fixed that. I downloaded the updated GTF and looked at it (there were a number of problems with it, but presumably these are particular to the version you uploaded, since it can't be parsed). You likely aren't setting "-s no" when that's what you need. Those reads overlap a feature, just on the wrong strand (unless you have a stranded library, in which case perhaps "-s reverse" is correct, depending on how it was made).
          Last edited by dpryan; 09-18-2013, 09:46 AM. Reason: I should really proof read

          Comment


          • #6
            Thank you very much, that solves most of my problem. Now there is just a very small number of lines that map differently than I would expect.

            Here is one example:
            HWI-EAS225:1:10:1402:1333#0/1 16 VII 1026639 255 36M * 0 0 ATCATGCGTGTCTTTAGACATAACAGTATGGTTTAT ###############:7:<999:8.).;;<9<9:<B XA:i:0 MD:Z:36 NM:i:0

            Does not map to:
            VII protein_coding exon 1026642 1026968 . + . gene_id "YGR269W"; transcript_id "YGR269W"; exon_number "1"; gene_name "YGR269W"; transcript_name "YGR269W";
            VII protein_coding CDS 1026642 1026965 . + 0 gene_id "YGR269W"; transcript_id "YGR269W"; exon_number "1"; gene_name "YGR269W"; transcript_name "YGR269W"; protein_id "YGR269W";
            VII protein_coding start_codon 1026642 1026644 . + 0 gene_id "YGR269W"; transcript_id "YGR269W"; exon_number "1"; gene_name "YGR269W"; transcript_name "YGR269W";
            VII protein_coding stop_codon 1026966 1026968 . + 0 gene_id "YGR269W"; transcript_id "YGR269W"; exon_number "1"; gene_name "YGR269W"; transcript_name "YGR269W";

            If I put these lines in a seperate gtf file, the gene gets a count of 1. If I use the entire gtf file obtained from ensembl it shows 0 counts with for the specific sam line reporting "no feature" (using the -o filename.txt argument).
            Last edited by [email protected]; 09-19-2013, 07:08 AM.

            Comment


            • #7
              The same happens with:

              HWI-EAS225:1:10:1427:554#0/1 0 II 606243 255 36M * 0 0 AAGACAAAAATAACTAGCAACAATGGGTAAATCACA B>@=@A@>?@@?>:999><<99999999999<:;;< XA:i:0 MD:Z:33G0T1 NM:i:2
              HWI-EAS225:1:10:1615:484#0/1 0 II 606243 255 36M * 0 0 AAGACAAAAATAACTAGCAACAATGGGTAAATAACA 4>CCB>+>6?8==B6<*B>9<1AA:1<062:##### XA:i:0 MD:Z:32C0G0T1 NM:i:3

              II protein_coding exon 605961 606272 . + . gene_id "YBR190W"; transcript_id "YBR190W"; exon_number "1"; gene_name "YBR190W"; transcript_name "YBR190W";
              II protein_coding CDS 605961 606269 . + 0 gene_id "YBR190W"; transcript_id "YBR190W"; exon_number "1"; gene_name "YBR190W"; transcript_name "YBR190W"; protein_id "YBR190W";
              II protein_coding start_codon 605961 605963 . + 0 gene_id "YBR190W"; transcript_id "YBR190W"; exon_number "1"; gene_name "YBR190W"; transcript_name "YBR190W";
              II protein_coding stop_codon 606270 606272 . + 0 gene_id "YBR190W"; transcript_id "YBR190W"; exon_number "1"; gene_name "YBR190W"; transcript_name "YBR190W";
              Last edited by [email protected]; 09-19-2013, 07:32 AM.

              Comment


              • #8
                I don't have any good explanation for why that's happening when there's an assigned feature in the miniature version. You might try successively larger versions of the annotation to see where things go whonky.

                Comment


                • #9
                  Try again with "--mode union". My guess it that the "-o" output will then show you another feature with which the read but not the gene overlaps.

                  Comment


                  • #10
                    I followed your suggestion, and you are right. This read:
                    HWI-EAS225:1:10:1427:554#0/1 0 II 606243 255 36M * 0 0 AAGACAAAAATAACTAGCAACAATGGGTAAATCACA B>@=@A@>?@@?>:999><<99999999999<:;;< XA:i:0 MD:Z:33G0T1 NM:i:2 XF:Z:ambiguous[YBR191W+YBR190W]

                    now is ambiguously mapped to
                    II protein_coding exon 605961 606272 . + . gene_id "YBR190W"; transcript_id "YBR190W"; exon_number "1"; gene_name "YBR190W"; transcript_name "YBR190W";
                    II protein_coding exon 606265 606275 . + . gene_id "YBR191W"; transcript_id "YBR191W"; exon_number "1"; gene_name "RPL21A"; transcript_name "RPL21A";

                    But the overlap of the read with gene "YBR191W" is only 10 and with gene "YBR190W" it is 29 from what I can see. I do not understand why with the option -mode intersection-nonempty it is considered as a "no feature" read instead of being counted toward YBR190W. Are reads considered to have "no feature" if they overlap more then one gene with the mode intersection-nonempty even if the size of the overlap is different?

                    Comment


                    • #11
                      No, the size of the overlap is not taken into account. Please see the description on the web page.

                      In the end, I now see only very few use cases where one may want to use something else than the default mode "union" and hence recommend to stick to it unless you have some really special problem.

                      Comment


                      • #12
                        Ah I understand it now. Thanks, that explains it.

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Current Approaches to Protein Sequencing
                          by seqadmin


                          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                          04-04-2024, 04:25 PM
                        • seqadmin
                          Strategies for Sequencing Challenging Samples
                          by seqadmin


                          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                          03-22-2024, 06:39 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 04-11-2024, 12:08 PM
                        0 responses
                        31 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 10:19 PM
                        0 responses
                        33 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 09:21 AM
                        0 responses
                        28 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-04-2024, 09:00 AM
                        0 responses
                        53 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X