HTSeq: A Python framework to work with high-throughput sequencing data

ssharma replied

09-28-2011, 12:49 PM
Error while using htseq-count

Hi,
I am getting an error while using htseq-count.
I used the following command:
htseq-count CL1_forBWA_left_DSS3.sam NC_003911.gff

This is the error i am getting. I got all the files from ncbi.
Error occured in line 2169 of file NC_003911.gff.
Error: Feature tRNA does not contain a 'gene_id' attribute
[Exception type: SystemExit, raised in count.py:55]

I would really appreciate if anyone can help me out.

Thanks
Shalabh
Leave a comment:

Simon Anders replied

09-28-2011, 01:41 AM

For an explanation why htseq-count skips multiply matched reads, see post #4 of this thread: http://seqanswers.com/forums/showthread.php?t=9129

Now, if you have two mappings for a read and one has high quality and the other does not, it would indeed make sense to use the good one and discard the bad one rather than skipping both. Hence, you could sort you SAM file by read ID, so that multiple mappings are in adjacent lines and the write a script to filter the best one. Here is a quick attempt to do this with HTSeq:

Code:

import sys, re
import HTSeq

insam = HTSeq.SAM_Reader( sys.stdin )

# Go through all reads, with their alignments bundled up:
for bundle in HTSeq.bundle_multiple_alignments( insam ):
   bestAlmt = None
   # Go through all alignments of a given read, looking
   # for the one with the best alignment score
   for almt in bundle:
      if bestAlmt is None:
         bestAlmt = almt
      elif almt.aQual > bestAlmt.aQual:
         bestAlmt = almt
      elif almt.aQual == bestAlmt:
         # If there are more than one best alignment, 
         # better skip the read
         bestAlmt = None
   if bestAlmt is not None:
      # Change the NH field to 1 and print the line
      print re.sub( "NH:i:\d+", "NH:i:1", bestAlmt.original_sam_line ),

Call this with something like

Code:

sort samfile.sam | python chooseBest.py > filtered.sam

.

Be aware hat I haven't really tested this thoroughly.

Leave a comment:

Tobias Kockmann replied

09-27-2011, 05:57 AM
That is easily doable - but this would imply that reads are counted several time as part of different alignments. To my knowledege such a way of counting is not compatible with the statistical model used by DESeq and this is actually what I would like to use the counts for.
Leave a comment:
joro replied

09-27-2011, 12:39 AM
Hi Tobi, could you write a script to change the NH flag to NH:i:1 in the sam file so that htseq-count doesn't ignore them?
Leave a comment:
Tobias Kockmann replied

09-26-2011, 07:42 AM
alignment_not_unique

Hi HTSeq users,

I realized that the htseq-count script is very conservative when counting reads matching to known gene models:

I generated an alignment of my RNA-seq reads to the Drosophila genome using tophat in combination with known Ensembl gene models (provided by gtf file). The output in sam format looks like this:

Code:

[tobiasko@bs-dsvr09 tophat_out]$ head accepted_hits.sam -1350550920:7:9:328:542 0 chr2L 8880 3 35M * 0 0 CTGGACTCGCAAAGTGGACTTGTTCAGCAAGGACA 6IC@/IIIIIFABI2II1III3'E2>@I:,&'(31 NM:i:0 NH:i:2 CC:Z:chrX CP:i:19097273 XS:A:- -1350550920:7:77:1586:563 0 chr2L 8883 3 35M * 0 0 GACTCGCAAAGTGGACTTGTTCAGCAAGGACATAA IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII NM:i:0 NH:i:2 CC:Z:chrX CP:i:19097270 XS:A:- -1350550920:7:24:1131:1918 0 chr2L 9327 3 35M * 0 0 TTTTACAAATCTTACGTAAACACTCCAAGCATGAA IIIIIIIIIIIIIIII?III=ICEI;EI:IIDII; NM:i:0 NH:i:2 CC:Z:chrX CP:i:19096819 XS:A:-

As indicated by the optional NH flag some of the reads are mapped to multiple locations (e.g. first line -> NH:i:2). My problem is, that htseq-counts ignores all reads/alignments that have been mapped to multiple locations by default, which in my case leads to a massive loss of coverage (>50%).

I wonder, if there is a way to either tune tophat to only report the best alignment per read like one can do when using bowtie with -k1 --best, or to modify the couting scheme of htseq-count to be less stringent (e.g. count every read once at the best alignment position).

Greetings,
Tobi
Leave a comment:
usad replied

08-15-2011, 12:32 PM
Install instructions for CentOS (RHEL)

Hi,

for what it's worth I detailed the additional steps necessary to install HTSeq on CentOS (should work on Red Head Enterprise as well) in RNA-Seq howto wiki. (Or rather in the first incomplete alpha version)

SEQanswers

http://seqanswers.com/wiki/How-to/RNASeq_analysis#Summarizing_Reads

Best Wishes,
Björn
Leave a comment:
fabrice replied

08-05-2011, 03:24 AM
I am runing DEXseq and htseq-count and have some problems. Maybe others are interesting to this or have the answer, so I post to here again.

When I run dexseq_count.py in DEXseq on my sam file from bwa, it always has this warning:

~/python/lib/python2.7/site-packages/HTSeq-0.5.3p1-py2.7-linux-x86_64.egg/HTSeq/__init__.py:543:
UserWarning: Malformed SAM line: RNAME != '*' although flag bit
&0x0004 set
algnt = SAM_Alignment.from_SAM_line( line )
~/python/lib/python2.7/site-packages/HTSeq-0.5.3p1-py2.7-linux-x86_64.egg/HTSeq/__init__.py:592:
UserWarning: Read GA2-EAS337_0017_FC:1:8:11202:16260#0 claims to have
an aligned mate which could not be found. (Is the SAM file properly
sorted?)
"which could not be found. (Is the SAM file properly sorted?)" )

but I can found GA2-EAS337_0017_FC:1:8:11202:16260#0 mate in the data,
as show in the attachments a.sam. And the a.sam is sorted by the name
(default by bwa).

In htseq-count, it also have this problem.

htseq-count a.sam Homo_sapiens.GRCh37.63/Homo_sapiens.GRCh37.63.gtf
100000 GFF lines processed.
200000 GFF lines processed.
...
2000000 GFF lines processed.
2064769 GFF lines processed.
Warning: Malformed SAM line: RNAME != '*' although flag bit &0x0004 set
Warning: Read GA2-EAS337_0017_FC:1:8:11202:16260#0 claims to have an
aligned mate which could not be found. (Is the SAM file properly
sorted?)
4 sam line pairs processed.

Why here report 4 line pairs? In fact, there are 3 sam line pairs in a.sam.

BTW: Do this kind of count using BWA or tophat
mapping, which one is better? (Illumina pair-end human RNA-seq data)

Thank you.
Attached Files

a.sam.tar.gz (996 Bytes, 25 views)
Leave a comment:
joro replied

08-05-2011, 02:48 AM
Oops, I didn't realise that I had the quiet option on! Yes I'm now getting warnings about the chromosomes not appearing in the GFF file. I guess these reads are not included in the output SAM file? Thanks for your help Simon.
Leave a comment:
joro replied

08-05-2011, 01:15 AM
Thanks for your reply. No I didn't get any warnings about reads being skipped. I'll email the files to you.
Leave a comment:
Simon Anders replied

08-04-2011, 08:05 AM
Ho Joro,

did you get any warnings about reads being skipped? Otherwise, could you send me some examples of SAM lines that are in the input but not the output file, togetehr with your GTF file? Thanks.

Simon
Leave a comment:
joro replied

07-28-2011, 06:09 AM
Hi,

I have some reads in the original SAM file that don't seem to be accounted for in the output of htseq-count. To look in more detail I created a new SAM file using --samout. The reads aren't in this SAM file. Presumably they should be in the SAM file but assigned something like "no_feature"? Any suggestions are welcome, thanks.
Leave a comment:
Simon Anders replied

07-03-2011, 04:55 AM
In principle, you could follow the 'Tour' in the HTSeq documentation to learn how to do this yourself.

The htseq-count script falls into two parts: the main part doing the actual counting, and surrounding it a thin wrapper to get the intervals out of the SAM file. This is all that needs to be changed, and once I separate these into two modular parts, it will be even easier. Might take a while until I find the time to so, though.

S
Leave a comment:
sshen replied

06-18-2011, 06:23 PM
Will HTSeq be able to count reads from a bed file

Dear Simon,

My case is somehow unusual. I have a mapped sequence reads file in bed format. I wonder if you could point me out how I can use HTSeq for reads counts on gene basis. I understand that the bed file contains only interval data and doesn't have quality scores for each base. But in my case, I only need read counts. I have limited knowledge about python.

Your help will be greatly appreciated!

Best
Steve
Leave a comment:
Simon Anders replied

04-17-2011, 10:32 PM
The '-i' and '-t' options give you some flexibility what htseq-count is counted.

If this is insufficient, please remember that htseq-count was actually meant not so much as a tool in its own right but rather a demonstration of the possibilities of HTSeq, our Python framework for HTS analysis. As demonstrated in the documentation, it is fairly easy (if you know Python or are willing to learn it) to write your own counting script using HTSeq and then implement whatever custom logic you need for your use case.
Leave a comment:
adeonari replied

04-15-2011, 06:51 AM
Another quick question: is it possible with htseq-counts to exclude the reads from the 3' UTR?
Leave a comment:

Previous 1 3 4 5 6 7 8 9 13 template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
A Close Examination at Probiotic-Related Bacteremia by seqadmin Started by seqadmin, Today, 08:06 AM	0 responses 11 views 0 likes	Last Post by seqadmin Today, 08:06 AM
Expanded Genetic Insights into Blood Pressure Regulation by seqadmin Started by seqadmin, 04-30-2024, 12:17 PM	0 responses 13 views 0 likes	Last Post by seqadmin 04-30-2024, 12:17 PM
The Role of Enhancers in Defining Cell Fate by seqadmin Started by seqadmin, 04-29-2024, 10:49 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-29-2024, 10:49 AM
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 26 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News