I isolated a number of sam lines from a bigger sam file where I do not understand why the reads are not counted toward the genes they overlap.
I checked the gtf files by hand to confirm, as well as writing a script to confirm these genes solely map to one specific gene each (each maps to a different gene). Sadly when I try to truncate the gtf file HTseq gives me an error "Error: The attribute string seems to contain mismatched quotes.", making it impossible for me to check with a GTF file containing solely these genes.
The command I used was: "htseq-count -m intersection-nonempty SamLines.sam Homo_sapiens.GRCh37.73.gtf > res.txt"
The gtf file was acuired from ensembl: ftp://ftp.ensembl.org/pub/release-73/gtf/homo_sapiens.
And the SamLines.sam file can be obtained here:
The GTF file containing only those lines of the genes I think it should map to:
I was wondering if I am missing some variable that determines that these genes should not be counted?
Additionally I tested the example mentioned on the website: "http://www-huber.embl.de/users/anders/HTSeq/HTSeq_example_data.tgz"
I noticed in the GTF file that gene "YEL034W" is completely overlapped by gene "YEL034C-A" yet using the example files the results show 67 counts for gene "YEL034W". I thought HTseq only counted reads toward a gene if it is uniquely mapped to 1 gene, meaning this should be impossible. Does HTseq have some other mechanism by which way it determines it should map this to that gene or is this a mistake?
I checked the gtf files by hand to confirm, as well as writing a script to confirm these genes solely map to one specific gene each (each maps to a different gene). Sadly when I try to truncate the gtf file HTseq gives me an error "Error: The attribute string seems to contain mismatched quotes.", making it impossible for me to check with a GTF file containing solely these genes.
The command I used was: "htseq-count -m intersection-nonempty SamLines.sam Homo_sapiens.GRCh37.73.gtf > res.txt"
The gtf file was acuired from ensembl: ftp://ftp.ensembl.org/pub/release-73/gtf/homo_sapiens.
And the SamLines.sam file can be obtained here:
The GTF file containing only those lines of the genes I think it should map to:
I was wondering if I am missing some variable that determines that these genes should not be counted?
Additionally I tested the example mentioned on the website: "http://www-huber.embl.de/users/anders/HTSeq/HTSeq_example_data.tgz"
I noticed in the GTF file that gene "YEL034W" is completely overlapped by gene "YEL034C-A" yet using the example files the results show 67 counts for gene "YEL034W". I thought HTseq only counted reads toward a gene if it is uniquely mapped to 1 gene, meaning this should be impossible. Does HTseq have some other mechanism by which way it determines it should map this to that gene or is this a mistake?
Comment