You are supposed to pass first the SAM file, then the GFF file.
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
End location of read in HTSeq-count
Hi Simon,
I am a user of the HTSeq-count.
It's a wonderful tool, and I really love it.
Nevertheless, I have a small question about it.
I am wondering how dose HTSeq-count define the end location (right site end) of a read from RNA-seq sam file?
I am not very familiar with Pathon, so I cannot well understand the source code.
For single-end strand-specific RNA-seq sam file, is the end location defined by the CIGAR?
That is, End= Start + ( No. of Ms) + (No. of Ns) + (No. of Ds) ?
For paired-end strand-specific RNA-seq sam file, how does HTSeq-count handle this?
Is the start location (left site) defined as the smaller one of the two start loci within a pair?
And the end location (right site end) defined as the larger one of the two end sites within a pair?
Many thanks and best regards,
Jerry
Comment
-
Originally posted by sumanexHowever, the following command is not showing any coverage only the coordinates and a blue horizotal line at 0.
>>> import matplotlib
>>> from matplotlib import pyplot
>>> pyplot.plot( list( cvg[ HTSeq.GenomicInterval( "chr2", 2000000, 3000000, "+" ) ] ) )
[<matplotlib.lines.Line2D object at 0x104751ac>]
>>> pyplot.show()
Do have to provide a gtf file also for showing the coverage.
http://www-huber.embl.de/users/ander...-full-coverage
Comment
-
HtSeq TSS plots
Hi,
I am following the TSS plots code in HTSeq in ubunut 12.04 but getting error:
1) in step p in tsspos answer is false. but getting a wincvg plot in the matplot window. the gtf file is for mouse release64 ensembl.
Python 2.7.3 (default, Aug 1 2012, 05:16:07)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import HTSeq
>>> bamfile = HTSeq.BAM_Reader( "galaxy58.bam" )
>>> bamfile
<HTSeq.BAM_Reader object at 0xb721afac>
>>> gtffile = HTSeq.GFF_Reader( "Mus_musculus.NCBIM37.64.gtf" )
>>> coverage = HTSeq.GenomicArray( "auto", stranded=False, typecode="i" )
>>> coverage
<HTSeq._HTSeq.GenomicArray object at 0x9065b0c>
>>> for almnt in bamfile:
... if almnt.aligned:
... coverage[ almnt.iv ] += 1
...
>>> import itertools
>>> for feature in itertools.islice( gtffile, 100):
... if feature.type == "exon" and feature.attr["exon_number"] == "1":
... print feature.attr["gene_id"], feature.attr["transcript_id"], feature.iv.start_d_as_pos
...
ENSMUSG00000000702 ENSMUST00000105216 NT_166433:11954/+
ENSMUSG00000078423 ENSMUST00000105217 NT_166433:28586/+
ENSMUSG00000078424 ENSMUST00000105218 NT_166433:47744/+
ENSMUSG00000078424 ENSMUST00000105219 NT_166433:47927/+
ENSMUSG00000071964 ENSMUST00000096694 NT_166402:12542/+
ENSMUSG00000091539 ENSMUST00000165255 18:3123464/-
ENSMUSG00000063889 ENSMUST00000151311 18:3327588/-
>>> tsspos = set()
>>> for feature in gtffile:
... if feature.type == "exon" and feature.attr["exon_number"] == "1":
... tsspos.add( feature.iv.start_d_as_pos )
...
>>>
>>> p = HTSeq.GenomicPosition( "chr1", 133363801, "+" )
>>> p
<GenomicPosition object 'chr1':133363801, strand '+'>
>>> p in tsspos
False
>>> halfwinwidth = 3000
>>> window = HTSeq.GenomicInterval( p.chrom, p.pos - halfwinwidth, p.pos + halfwinwidth, "." )
>>> window
<GenomicInterval object 'chr1', [133360801,133366801), strand '.'>
>>> list( coverage[window] )
>>> import numpy
>>> wincvg = numpy.fromiter( coverage[window], dtype='i', count=2*halfwinwidth )
>>> wincvg
array([0, 0, 0, ..., 0, 0, 0])
>>> from matplotlib import pyplot
>>> pyplot.plot( wincvg )
[<matplotlib.lines.Line2D object at 0x10c8122c>]
>>> pyplot.show()
and 2) Index error start too small
>>> profile = numpy.zeros( 2*halfwinwidth, dtype='i' )
>>> profile
array([0, 0, 0, ..., 0, 0, 0])
^
>>> for p in tsspos:
... window = HTSeq.GenomicInterval( p.chrom, p.pos - halfwinwidth, p.pos + halfwinwidth, "." )
... wincvg = numpy.fromiter( coverage[window], dtype='i', count=2*halfwinwidth )
... if p.strand == "+":
... profile += wincvg
... else:
... profile += wincvg[::-1]
...
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
File "_HTSeq.pyx", line 520, in HTSeq._HTSeq.GenomicArray.__getitem__ (src/_HTSeq.c:10593)
File "_HTSeq.pyx", line 370, in HTSeq._HTSeq.ChromVector.__getitem__ (src/_HTSeq.c:7538)
IndexError: start too small
any help is appreciated
suman
Comment
-
Originally posted by sumanex View Post>>> tsspos = set()
>>> for feature in gtffile:
... if feature.type == "exon" and feature.attr["exon_number"] == "1":
... tsspos.add( feature.iv.start_d_as_pos )
...
>>> p = HTSeq.GenomicPosition( "chr1", 133363801, "+" )
>>> p
<GenomicPosition object 'chr1':133363801, strand '+'>
>>> p in tsspos
False
>>> list( coverage[window] )
>>> import numpy
...
>>> for p in tsspos:
... window = HTSeq.GenomicInterval( p.chrom, p.pos - halfwinwidth, p.pos + halfwinwidth, "." )
... wincvg = numpy.fromiter( coverage[window], dtype='i', count=2*halfwinwidth )
... if p.strand == "+":
... profile += wincvg
... else:
... profile += wincvg[::-1]
...
Traceback (most recent call last):
...
IndexError: start too small
Comment
-
tsspos
the lines of tsspos are as follows:
<GenomicPosition object '2':49762399, strand '-'>, <GenomicPosition object '2':26280379, strand '-'>, <GenomicPosition object '6':42236683, strand '+'>, <GenomicPosition object '6':47528008, strand '-'>, <GenomicPosition object '4':154836045, strand '+'>, <GenomicPosition object '2':136717343, strand '+'>, <GenomicPosition object '15':73649839, strand '-'>, <GenomicPosition object '7':148651406, strand '-'>, <GenomicPosition object '3':29912426, strand '-'>, <GenomicPosition object '15':91602775, strand '+'>, <GenomicPosition object '7':133709879, strand '-'>, <GenomicPosition object '4':101516627, strand '-'>, <GenomicPosition object '7':16456750, strand '-'>, <GenomicPosition object '3':90262860, strand '-'>,
however, i am getting plot of wincvg showing all values are not zero.Attached FilesLast edited by sumanex; 02-02-2013, 11:24 PM.
Comment
-
Hi Simon,
I have a question regarding my data and numbers of counts resulting from htseq-count.
I have ~45m SE 80bp Illumina reads aligning to reference. I then remove duplicates (using Picard MarkDuplicates) ending up with ~25m reads. When I run htseq-count on the original aligned SAM and the duplicate-removed SAM I get ~23m, 8.8m counts.
The proportion of 'retained' reads (following duplicate removal) is ~0.53, whereas the proportion of 'retained' counts is ~0.39. I see this as quite a large discrepancy. Another way of looking at this is sum of counts over total reads in original and duplicate-removed. This gives the same proportions, ~0.5 in original SAM, ~0.37 in duplicate-removed.
I cannot see why the duplicate-removed SAM should give less counts than the original: if anything I would think it would give more because the duplicates would only give a single count (for highest quality if I understand htseq-count correctly). Is there an obvious answer or have you seen this before? I am using the 'union' mode with same reference as I aligned to, non-stranded data, all other parameters are default. I looked at the SAM flag and counted 0, 16 which give the same ratios for aligned vs. duplicate-removed reads. I would be interested to know your thoughts.
Thanks,
Bruce.
Comment
-
new cigar operator
--Hi,
i think htseq-count doesn't understand the new CIGAR operators X and =
I have this error when reading a sam file:
Error occured when reading first line of sam file.
Error: ("Unknown CIGAR code 'X' encountered.", 'line 29 of file AB4_AR013.sam'
[Exception type: ValueError, raised in _HTSeq.pyx:1163]
Laurent --
Comment
-
Hi,
I'm having the same issue with one of my SAM files when using htseq-count.
For some reason, one of my SAM files raises the error:
Error occured in line 38586430 of file RNA7_sorted.sam.
Error: ("invalid literal for int() with base 10: ''", 'line 38586430 of file RNA7_sorted.sam')
[Exception type: ValueError, raised in _HTSeq.pyx:1166]
I am using HTSeq-0.5.3p9
Can you help please, thanks, alig
Comment
-
library type and stranded parameter
Hello,
I'm trying to figure out the right "stranded" parameter to use for my RNA-seq data which was aligned using TopHat with the "--library-type fr-firststrand" parameter. I'm using paired-end reads.
From what I can see, the results of running "stranded=no" is similar to "stranded=reverse" which gives me about ~50% of the total fragments, the majority have no feature. But if I ran using "stranded=yes", I only get ~2% of total fragments as having a feature.
I'm thinking that the "stranded=reverse" is the way to go if I want to measure sense expression, since for the fr-firststrand protocol, the right most strand is sequenced first which is opposite to the coding strand. Is this correct?
Thanks,
Patrick
Comment
-
Hello,
I've used Tophat 2.0.9 & then HTseq version 0.5.4p3 & just with 3 of my 28 SAM files I get this error.
Error occured in line 63841485 of file RNA8_sorted.sam.
Error: ("'seq' and 'qualstr' do not have the same length.", 'line 63841485 of file RNA8_sorted.sam')
[Exception type: ValueError, raised in _HTSeq.pyx:772]
Can anyone please help as it's holding up my analysis.
Thank you
alig
Comment
-
I'm thinking that the "stranded=reverse" is the way to go if I want to measure sense expression, since for the fr-firststrand protocol, the right most strand is sequenced first which is opposite to the coding strand. Is this correct?
Comment
Latest Articles
Collapse
-
by seqadmin
Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.
Long-Read Sequencing
Long-read sequencing has...-
Channel: Articles
12-02-2024, 01:49 PM -
-
by seqadmin
The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben MartÃnez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...-
Channel: Articles
11-06-2024, 07:24 PM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 12-02-2024, 09:29 AM
|
0 responses
150 views
0 likes
|
Last Post
by seqadmin
12-02-2024, 09:29 AM
|
||
Started by seqadmin, 12-02-2024, 09:06 AM
|
0 responses
51 views
0 likes
|
Last Post
by seqadmin
12-02-2024, 09:06 AM
|
||
Started by seqadmin, 12-02-2024, 08:03 AM
|
0 responses
42 views
0 likes
|
Last Post
by seqadmin
12-02-2024, 08:03 AM
|
||
Started by seqadmin, 11-22-2024, 07:36 AM
|
0 responses
74 views
0 likes
|
Last Post
by seqadmin
11-22-2024, 07:36 AM
|
Comment