Originally posted by Thomas Doktor
View Post
Header Leaderboard Ad
Collapse
HTSeq: A Python framework to work with high-throughput sequencing data
Collapse
Announcement
Collapse
SEQanswers June Challenge Has Begun!
The competition has begun! We're giving away a $50 Amazon gift card to the member who answers the most questions on our site during the month. We want to encourage our community members to share their knowledge and help each other out by answering questions related to sequencing technologies, genomics, and bioinformatics. The competition is open to all members of the site, and the winner will be announced at the beginning of July. Best of luck!
For a list of the official rules, visit (https://www.seqanswers.com/forum/sit...wledge-and-win)
For a list of the official rules, visit (https://www.seqanswers.com/forum/sit...wledge-and-win)
See more
See less
X
-
Hi Thomas
Originally posted by Thomas Doktor View PostI'm trying to use htseq-count version 0.4.2-p3 on a sam file produced by TopHat and a hg19 Ensembl GTF file. I'm analysing the reads in non-stranded mode and looking for exons in the gene_id features. The script runs for a while and outputs several warnings about reads incorrectly flagged as proper pairs, but then exits with the following error:
Is this an error in my sam file and if so how can I identify the read in question?
As for the warnings about improper pairs: Have you sorted your SAM file before calling htseq-count? This is necessary to make sure that the read pairs appear in adjacent lines (see man page).
Simon
Comment
-
Hi Simon
Thanks for the updated source. I did sort my sam file prior to analysis and although most read pairs seem to be in adjacent lines, some reads are lacking a mate. I suspect this is because TopHat does not discard unmated reads. The question is if I should remove these unmated reads or if the script considers them in the read count and merely displays a warning?
As it turns out, my GFF file is lacking the mitochondrial encoded genes.
On another note, the script seems to read the GFF file before checking if the sam file exists and as my GFF file is quite large it takes a couple of minutes for the script to exit when I - as happens - sometimes forget to supply an existing sam file. I think it would be nice to have the script check that the files exist as the first thing and then exit immediately upon error.
Comment
-
Hi Simon,
I am using htseq-count with the -q option. However I'm still getting warnings telling me that htseq-count encountered a read, which has been aligned to a chromosome that did not appear in the GFF file. Any ideas on how to resolve this?
The command I'm using is:
htseq-count -q <sam_file> <gff_file>
Comment
-
Originally posted by joro View Post\I am using htseq-count with the -q option. However I'm still getting warnings telling me that htseq-count encountered a read, which has been aligned to a chromosome that did not appear in the GFF file.
Simon
Comment
-
Dear Simon,
thanx for this package.
So far everything works except when I try to use htseq-count using tophat output sam file as input and a refseq gff file that has worked just fine with tophat.
This is the error I am getting:
Code:Error: invalid literal for int() with base 10: '0.000000' [Exception type: ValueError, raised in __init__.py:200]
Comment
-
Hi
Originally posted by marcora View PostThis is the error I am getting:
Code:Error: invalid literal for int() with base 10: '0.000000' [Exception type: ValueError, raised in __init__.py:200]
Cheers
Simon
Comment
-
Thanx a lot Simon.
One more thing. It is unclear from your comments here and from the doc online whether HTseq handles both GTF and GFF interchangeably. I am new to this bioinformatics business, and already all these formats are giving me an headache, expecially when GTF files are easily available but no standard/robust GTF>GFF converter is readily available.
Cheers
Comment
-
Hi Marcora
Yes, there is a very robust GTF->GFF converter available: Just don't do anything, because every GTF file is a GFF file as well.
GTF is a tightening of the GFF specification. This means: If your file has tab-separated fields with the contents <seqname> <source> <feature> <start> <end> <score> <strand> <frame> [attributes] [comments], it is a GFF file. The GFF specs are a bit lax about how certain columns are to be filled. Should the ID in the attributes field be called "ID" or "gene_ID" or "gene"? Which words should be used in the feature column? If you want to have a general format, it is hard to give clear rules, but once you have agreed that you want to describe not any kind of feature, but specifically gene models, you can be more explicit. This is what the GTF specification does: it explains how precisely a GFF file should look like if it is used to describe gene models, and if a GFF file follows these rules, it is called a GTF file.
Specifically for htseq-count: If you want to count reads in genes and have a GTF file, you can use it out of the box. If you want to count reads in some other kind of feature, and your GFF file hence cannot follow the GTF specs, you have to tell htseq-count which feature types it should use and how the field with the ID is named. (By default, it takes the lines with feature type "exon" and looks for the ID in the attribute field "gene_id", which is what makes sense for GTF files.)
I hope that clarifies it.
Simon
Comment
-
Dear Simon,
thanx for the clear explanation. Is the lax part of GFF that makes going from GTF to GFF "difficult" sometimes, for example when a piece of software requires GFF with specific "comments" (tophat?).
Thanx again for your time and consideration,
Dado
Comment
-
Hi Simon,
I was trying to use htseq-qa to assess the technique quality of my aligned sam file, but I've encountered the following errors. While, when I used the command on my solexa-fastq file, I got the quality plot successfully. My sam file was generated by bwa-0.5.7.
$htseq-qa -t sam q -r 30 s_8.sam
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/2.6/bin/htseq-qa", line 5, in <module>
pkg_resources.run_script('HTSeq==0.4.3-p4', 'htseq-qa')
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/setuptools-0.6c11-py2.6.egg/pkg_resources.py", line 489, in run_script
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/setuptools-0.6c11-py2.6.egg/pkg_resources.py", line 1207, in run_script
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/HTSeq-0.4.3_p4-py2.6-macosx-10.3-fat.egg/EGG-INFO/scripts/htseq-qa", line 5, in <module>
HTSeq.scripts.qa.main()
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/HTSeq-0.4.3_p4-py2.6-macosx-10.3-fat.egg/HTSeq/scripts/qa.py", line 124, in main
r.add_qual_to_count_array( qual_arr_A )
File "_HTSeq.pyx", line 715, in _HTSeq.SequenceWithQualities.add_qual_to_count_array (src/_HTSeq.c:12251)
File "_HTSeq.pyx", line 734, in _HTSeq.SequenceWithQualities.add_qual_to_count_array (src/_HTSeq.c:12169)
ValueError: Too large quality value encountered.
$ htseq-qa -t solexa-fastq -r 30 s_8_sequence.txt (This time, it works with fastq file).
I am not sure is this a problem with my BWA alignment or with htseq-qa. It would be very much appreciated if you could put some of your input here!
Yuan
Comment
-
Hi Yuan
Originally posted by yh253 View PostValueError: Too large quality value encountered.
If you did, and the large quality values are legitimate, I'd be interested to see your SAM file.
Simon
Comment
-
Hi Simon,
I am having difficulty in running the htseq-qa script. I think I have installed HTSeq correctly since I get no error message for "import HTSeq" command. Then on giving the "htseq-qa -t sam accepted.sam" command, I get a Syntax error. I have given the following export command in Unix
export PYTHONPATH=$PYTHONPATH:/Library/Python/2.6/
Is this wrong? On giving the command "whereis python", I get /usr/local/python. I am confused.
Thank you
Abhijit
Comment
Latest Articles
Collapse
-
by seqadmin
Developments in sequencing technologies and methodologies have transformed the field of epigenetics, giving researchers a better way to understand the complex world of gene regulation and heritable modifications. This article explores some of the diverse sequencing methods employed in the study of epigenetics, ranging from classic techniques to cutting-edge innovations while providing a brief overview of their processes, applications, and advances.
Methylation Detect...-
Channel: Articles
05-31-2023, 10:46 AM -
-
Differential Expression and Data Visualization: Recommended Tools for Next-Level Sequencing Analysisby seqadmin
After covering QC and alignment tools in the first segment and variant analysis and genome assembly in the second segment, we’re wrapping up with a discussion about tools for differential gene expression analysis and data visualization. In this article, we include recommendations from the following experts: Dr. Mark Ziemann, Senior Lecturer in Biotechnology and Bioinformatics, Deakin University; Dr. Medhat Mahmoud Postdoctoral Research Fellow at Baylor College of Medicine;...-
Channel: Articles
05-23-2023, 12:26 PM -
-
by seqadmin
Continuing from our previous article, we share variant analysis and genome assembly tools recommended by our experts Dr. Medhat Mahmoud, Postdoctoral Research Fellow at Baylor College of Medicine, and Dr. Ming "Tommy" Tang, Director of Computational Biology at Immunitas and author of From Cell Line to Command Line.
Variant detection and analysis tools
Mahmoud classifies variant detection work into two main groups: short variants (<50...-
Channel: Articles
05-19-2023, 10:03 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 06-01-2023, 08:56 PM
|
0 responses
9 views
0 likes
|
Last Post
by seqadmin
06-01-2023, 08:56 PM
|
||
Deep Sequencing Unearths Novel Genetic Variants: Enhancing Precision Medicine for Vascular Anomalies
by seqadmin
Started by seqadmin, 06-01-2023, 07:33 AM
|
0 responses
8 views
0 likes
|
Last Post
by seqadmin
06-01-2023, 07:33 AM
|
||
Unveiling Genetic Associations Through Transcription Factor Binding Quantitative Trait Loci
by seqadmin
Started by seqadmin, 05-31-2023, 07:50 AM
|
0 responses
4 views
0 likes
|
Last Post
by seqadmin
05-31-2023, 07:50 AM
|
||
Exploring French-Canadian Ancestry: Insights into Migration, Settlement Patterns, and Genetic Structure
by seqadmin
Started by seqadmin, 05-26-2023, 09:22 AM
|
0 responses
11 views
0 likes
|
Last Post
by seqadmin
05-26-2023, 09:22 AM
|
Comment