HTSeq: A Python framework to work with high-throughput sequencing data

alig replied

09-09-2013, 06:25 PM
Hi,
I'm having the same issue with one of my SAM files when using htseq-count.
For some reason, one of my SAM files raises the error:

Error occured in line 38586430 of file RNA7_sorted.sam.
Error: ("invalid literal for int() with base 10: ''", 'line 38586430 of file RNA7_sorted.sam')
[Exception type: ValueError, raised in _HTSeq.pyx:1166]

I am using HTSeq-0.5.3p9

Can you help please, thanks, alig
Leave a comment:
mslider replied

06-05-2013, 03:04 AM
new cigar operator

--Hi,

i think htseq-count doesn't understand the new CIGAR operators X and =
I have this error when reading a sam file:

Error occured when reading first line of sam file.
Error: ("Unknown CIGAR code 'X' encountered.", 'line 29 of file AB4_AR013.sam'
[Exception type: ValueError, raised in _HTSeq.pyx:1163]

Laurent --
Leave a comment:
bruce01 replied

02-25-2013, 07:41 AM
Hi Simon,

I have a question regarding my data and numbers of counts resulting from htseq-count.

I have ~45m SE 80bp Illumina reads aligning to reference. I then remove duplicates (using Picard MarkDuplicates) ending up with ~25m reads. When I run htseq-count on the original aligned SAM and the duplicate-removed SAM I get ~23m, 8.8m counts.

The proportion of 'retained' reads (following duplicate removal) is ~0.53, whereas the proportion of 'retained' counts is ~0.39. I see this as quite a large discrepancy. Another way of looking at this is sum of counts over total reads in original and duplicate-removed. This gives the same proportions, ~0.5 in original SAM, ~0.37 in duplicate-removed.

I cannot see why the duplicate-removed SAM should give less counts than the original: if anything I would think it would give more because the duplicates would only give a single count (for highest quality if I understand htseq-count correctly). Is there an obvious answer or have you seen this before? I am using the 'union' mode with same reference as I aligned to, non-stranded data, all other parameters are default. I looked at the SAM flag and counted 0, 16 which give the same ratios for aligned vs. duplicate-removed reads. I would be interested to know your thoughts.

Thanks,

Bruce.
Leave a comment:
Simon Anders replied

02-25-2013, 04:01 AM
Originally posted by crazyhottommy View Post

It looks like the HTSeq_example_data.tgz link is broken...can you please fix it?

Done.

Thanks for reporting it.
Leave a comment:
crazyhottommy replied

02-22-2013, 09:02 PM
Hi Simon,

It looks like the HTSeq_example_data.tgz link is broken...can you please fix it?

Thanks!
Tommy
Leave a comment:
sumanex replied

02-02-2013, 11:22 PM
tsspos

the lines of tsspos are as follows:

<GenomicPosition object '2':49762399, strand '-'>, <GenomicPosition object '2':26280379, strand '-'>, <GenomicPosition object '6':42236683, strand '+'>, <GenomicPosition object '6':47528008, strand '-'>, <GenomicPosition object '4':154836045, strand '+'>, <GenomicPosition object '2':136717343, strand '+'>, <GenomicPosition object '15':73649839, strand '-'>, <GenomicPosition object '7':148651406, strand '-'>, <GenomicPosition object '3':29912426, strand '-'>, <GenomicPosition object '15':91602775, strand '+'>, <GenomicPosition object '7':133709879, strand '-'>, <GenomicPosition object '4':101516627, strand '-'>, <GenomicPosition object '7':16456750, strand '-'>, <GenomicPosition object '3':90262860, strand '-'>,

however, i am getting plot of wincvg showing all values are not zero.
Attached Files

wincvg tss htseq.png (11.4 KB, 10 views)
Last edited by sumanex; 02-02-2013, 11:24 PM.
Leave a comment:
gringer replied

02-02-2013, 12:10 PM
Originally posted by sumanex View Post

>>> tsspos = set()
>>> for feature in gtffile:
... if feature.type == "exon" and feature.attr["exon_number"] == "1":
... tsspos.add( feature.iv.start_d_as_pos )
...
>>> p = HTSeq.GenomicPosition( "chr1", 133363801, "+" )
>>> p
<GenomicPosition object 'chr1':133363801, strand '+'>
>>> p in tsspos
False

You've printed the type of p, but not the type of feature.iv.start_d_as_pos -- I find this omission a bit strange. The 'in' operator checks for membership (I think with a deep equality check), and it may be that the precise GenomicPosition object that you've declared is not in that set. Perhaps there's a position HTSeq.GenomicPosition( "chr1", 133363801, "."), or a position HTSeq.GenomicPosition( "chr1", 133363800, "+"). It's hard to know for sure without this detail.

>>> list( coverage[window] )
>>> import numpy
...
>>> for p in tsspos:
... window = HTSeq.GenomicInterval( p.chrom, p.pos - halfwinwidth, p.pos + halfwinwidth, "." )
... wincvg = numpy.fromiter( coverage[window], dtype='i', count=2*halfwinwidth )
... if p.strand == "+":
... profile += wincvg
... else:
... profile += wincvg[::-1]
...
Traceback (most recent call last):
...
IndexError: start too small

I guess it's possibly that wincvg and/or coverage[window] is empty, or the numpy.fromiter count value is too high for the size of coverage[window]. Could you print the first few entries of tsspos?
Leave a comment:
sumanex replied

02-02-2013, 08:43 AM
HtSeq TSS plots

Hi,
I am following the TSS plots code in HTSeq in ubunut 12.04 but getting error:

1) in step p in tsspos answer is false. but getting a wincvg plot in the matplot window. the gtf file is for mouse release64 ensembl.

Python 2.7.3 (default, Aug 1 2012, 05:16:07)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import HTSeq
>>> bamfile = HTSeq.BAM_Reader( "galaxy58.bam" )
>>> bamfile
<HTSeq.BAM_Reader object at 0xb721afac>
>>> gtffile = HTSeq.GFF_Reader( "Mus_musculus.NCBIM37.64.gtf" )
>>> coverage = HTSeq.GenomicArray( "auto", stranded=False, typecode="i" )
>>> coverage
<HTSeq._HTSeq.GenomicArray object at 0x9065b0c>
>>> for almnt in bamfile:
... if almnt.aligned:
... coverage[ almnt.iv ] += 1
...
>>> import itertools
>>> for feature in itertools.islice( gtffile, 100):
... if feature.type == "exon" and feature.attr["exon_number"] == "1":
... print feature.attr["gene_id"], feature.attr["transcript_id"], feature.iv.start_d_as_pos
...
ENSMUSG00000000702 ENSMUST00000105216 NT_166433:11954/+
ENSMUSG00000078423 ENSMUST00000105217 NT_166433:28586/+
ENSMUSG00000078424 ENSMUST00000105218 NT_166433:47744/+
ENSMUSG00000078424 ENSMUST00000105219 NT_166433:47927/+
ENSMUSG00000071964 ENSMUST00000096694 NT_166402:12542/+
ENSMUSG00000091539 ENSMUST00000165255 18:3123464/-
ENSMUSG00000063889 ENSMUST00000151311 18:3327588/-
>>> tsspos = set()
>>> for feature in gtffile:
... if feature.type == "exon" and feature.attr["exon_number"] == "1":
... tsspos.add( feature.iv.start_d_as_pos )
...

>>>
>>> p = HTSeq.GenomicPosition( "chr1", 133363801, "+" )
>>> p
<GenomicPosition object 'chr1':133363801, strand '+'>
>>> p in tsspos
False

>>> halfwinwidth = 3000
>>> window = HTSeq.GenomicInterval( p.chrom, p.pos - halfwinwidth, p.pos + halfwinwidth, "." )
>>> window
<GenomicInterval object 'chr1', [133360801,133366801), strand '.'>
>>> list( coverage[window] )
>>> import numpy
>>> wincvg = numpy.fromiter( coverage[window], dtype='i', count=2*halfwinwidth )
>>> wincvg
array([0, 0, 0, ..., 0, 0, 0])
>>> from matplotlib import pyplot
>>> pyplot.plot( wincvg )
[<matplotlib.lines.Line2D object at 0x10c8122c>]
>>> pyplot.show()

and 2) Index error start too small

>>> profile = numpy.zeros( 2*halfwinwidth, dtype='i' )
>>> profile
array([0, 0, 0, ..., 0, 0, 0])
^

>>> for p in tsspos:
... window = HTSeq.GenomicInterval( p.chrom, p.pos - halfwinwidth, p.pos + halfwinwidth, "." )
... wincvg = numpy.fromiter( coverage[window], dtype='i', count=2*halfwinwidth )
... if p.strand == "+":
... profile += wincvg
... else:
... profile += wincvg[::-1]
...
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
File "_HTSeq.pyx", line 520, in HTSeq._HTSeq.GenomicArray.__getitem__ (src/_HTSeq.c:10593)
File "_HTSeq.pyx", line 370, in HTSeq._HTSeq.ChromVector.__getitem__ (src/_HTSeq.c:7538)
IndexError: start too small

any help is appreciated
suman
Leave a comment:
sumanex replied

02-02-2013, 07:22 AM
matplot coverage

Hi,
I had to change the coordinates to chr1, 133363801,134605170

of cvg vector to get a decent plot.

thanks
suman
Leave a comment:
gringer replied

02-01-2013, 11:51 AM
Originally posted by sumanex

However, the following command is not showing any coverage only the coordinates and a blue horizotal line at 0.
>>> import matplotlib
>>> from matplotlib import pyplot
>>> pyplot.plot( list( cvg[ HTSeq.GenomicInterval( "chr2", 2000000, 3000000, "+" ) ] ) )
[<matplotlib.lines.Line2D object at 0x104751ac>]
>>> pyplot.show()

If this is all you have, it's not surprising that there's no coverage information, because you haven't put any in. The GenomicInterval object defines a region of a chromosome.

Do have to provide a gtf file also for showing the coverage.

GTF files are also region definition objects, rather than coverage objects. You need a GenomicArray in there, together with something that populates the array. The HTSeq documentation has good code examples. You need to get your head around what's going on before diving into code:

http://www-huber.embl.de/users/ander...-full-coverage
Leave a comment:
Jerry_Zhao replied

02-01-2013, 08:33 AM
End location of read in HTSeq-count

Hi Simon,

I am a user of the HTSeq-count.
It's a wonderful tool, and I really love it.

Nevertheless, I have a small question about it.
I am wondering how dose HTSeq-count define the end location (right site end) of a read from RNA-seq sam file?
I am not very familiar with Pathon, so I cannot well understand the source code.

For single-end strand-specific RNA-seq sam file, is the end location defined by the CIGAR?
That is, End= Start + ( No. of Ms) + (No. of Ns) + (No. of Ds) ?

For paired-end strand-specific RNA-seq sam file, how does HTSeq-count handle this?
Is the start location (left site) defined as the smaller one of the two start loci within a pair?
And the end location (right site end) defined as the larger one of the two end sites within a pair?

Many thanks and best regards,
Jerry
Leave a comment:
Simon Anders replied

01-21-2013, 07:33 AM
You are supposed to pass first the SAM file, then the GFF file.
Leave a comment:
ajay87 replied

01-21-2013, 02:54 AM
Hello Dr. Simon,
I tried to use HTseq_count function to find no of reads.
Data of D. melanogaster; ref genome from iGenomes (Ensembl).
Command typed >>
htseq-count -m intersection-nonempty -s no -t exon -i gene_id -o samout -q /home/ajay/bin2/binp/dmel-all-r5.48.gff /home/ajay/bin2/binp/C2_R1_thout/outc2r1.sam

Error occured in line 1 of file /home/ajay/bin2/binp/C2_R1_thout/outc2r1.sam.
Error: need more than 3 values to unpack
[Exception type: ValueError, raised in __init__.py:214]

Please suggest me where I am going wrong.
Ajay
Leave a comment:
Siva replied

01-12-2013, 03:47 PM
Ok I got an answer from another thread; I have now installed it successfully
Check this in case you are grappling with a similar problem

installing htseq on a mac - SEQanswers

http://seqanswers.com/forums/showthread.php?t=23768

Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc
Leave a comment:
Siva replied

01-12-2013, 02:20 PM
Clang errors when trying to build HTseq

Hi All
I have a mac (OSx10.8.2) and get the following error when trying to build HT seq (0.5.3p9). Looks like a gcc issue...Any suggestions please?

running build
running build_py
running build_ext
building 'HTSeq._StepVector' extension
clang -fno-strict-aliasing -fno-common -dynamic -g -Os -pipe -fno-common -fno-strict-aliasing -fwrapv -mno-fused-madd -DENABLE_DTRACE -DMACOSX -DNDEBUG -Wall -Wstrict-prototypes -Wshorten-64-to-32 -DNDEBUG -g -Os -Wall -Wstrict-prototypes -DENABLE_DTRACE -arch i386 -arch x86_64 -pipe -I/System/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c src/StepVector_wrap.cxx -o build/temp.macosx-10.8-intel-2.7/src/StepVector_wrap.o -w
clang++ -bundle -undefined dynamic_lookup -Wl,-F. -arch i386 -arch x86_64 build/temp.macosx-10.8-intel-2.7/src/StepVector_wrap.o src/step_vector.h -o build/lib.macosx-10.8-intel-2.7/HTSeq/_StepVector.so
clang: warning: treating 'c-header' input as 'c++-header' when in C++ mode, this behavior is deprecated
clang: error: cannot use 'precompiled-header' output with multiple -arch options
clang: error: cannot specify -o when generating multiple output files
error: command 'clang++' failed with exit status 1
Leave a comment:

Previous 1 2 3 4 5 6 7 13 template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
- Channel: Articles
Yesterday, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 55 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 51 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 45 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 55 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News