Seqanswers Leaderboard Ad



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • HTSeq: A Python framework to work with high-throughput sequencing data


    I would like to advertise the release of HTSeq, a Python framework to process and analyse high-throughput sequencing (HTS) data.

    With the many short-read aligners available now, HTS analysis seems simple. In practice, however, data often needs to be converted, tweaked, filtered, or otherwise pre-processed before they can be given to the aligner, and the results require similar processing to do the statistical analysis one needs.

    HTSeq is meant to render such tasks easy and convenient, and so act as a "glue" between aligners and other existing tools.

    Some examples of typical use cases for HTSeq:
    • Quality assessment of reads: Check the dependence of the proportions of base calls and quality scores on the position in the reads, stratify by alignment status.
    • Counting: How many reads fall onto each exon, or each gene? For such tasks, you may want to design and implement rules on how to deal with overlapping features or ambiguous assignments.
    • Calculating coverage: HTSeq helps you not only to produce a Wiggle file for visualization in a genome browser, but also to do customized statistics on this.
    • Multiple alignments: Many aligners can output multiple alignments for each read, but what to do with this? HTSeq makes it easy to implement post-processing to choose the right alignment according to your criteria.
    • Adapter trimming: In miRNA-Seq, you often sequence into the adapter at the other end and need to cut this off before aligning. In multiplexed sequencing, you may need to cut off and sort by the mutiplex tag.

    Have a look and give it a try:

    To use HTSeq you only need a basic understanding of Python, as can be obtained by reading the first few chapters of a Python book.
    For users without programming knowledge, stand-alone scripts for common tasks are provided: htseq-count to count the overlap of reads with features (such as exons), htseq-qa to get a quick overview of the quality of your sequencing run, and htseq-bedgraph (coming soon) to convert an alignment file into a Bedgraph Wiggle file for visualization with a genome browser.

    For programmers, HTSeq has been designed to keep thing simple:
    • All classes have extensive reference documentation, and a tutorial demonstrates their use.
    • All supported file formats (Fasta, Fastq, SAM, SolexaPipeline files, GFF, GTF, etc.) can be read in a loop, providing an object describing one record at a time to the loop body. This object describes the data in a convenient and consistent way.
    • The 'GenomicArray' class is the Swiss army knife of HTSeq. It is a container that can efficiently store anything that has a position on the genome: integer number to represent coverage, objects with feature data to represent exons, sets of objects to handle overlapping features, etc.

    Please let me know what you think of it.


  • #2
    Nice one, looks interesting! I like the look of the quality plots.

    Do you have any plans to try and integrate with the Biopython project? There's some obvious overlap that I can see.


    • #3
      I'm trying to run htseq-qa, but get the following error:
      Traceback (most recent call last):
      File "/usr/local/bin/htseq-qa", line 5, in <module>
      pkg_resources.run_script('HTSeq==0.4.2-p1', 'htseq-qa')
      File "build/bdist.linux-x86_64/egg/", line 489, in run_script
      if dist.key not in keys2:
      File "build/bdist.linux-x86_64/egg/", line 1207, in run_script

      File "/usr/local/lib/python2.6/dist-packages/HTSeq-0.4.2_p1-py2.6-linux-x86_64.egg/EGG-INFO/scripts/htseq-qa", line 3, in <module>
      File "/usr/local/lib/python2.6/dist-packages/HTSeq-0.4.2_p1-py2.6-linux-x86_64.egg/HTSeq/", line 8, in <module>
      from _HTSeq import *
      ImportError: /usr/local/lib/python2.6/dist-packages/HTSeq-0.4.2_p1-py2.6-linux-x86_64.egg/HTSeq/ undefined symbol: PyUnicodeUCS2_DecodeUTF8
      I've installed the dependencies and I'm running Python 2.6.4 - any idea what I could be doing wrong?


      • #4
        Originally posted by Simon Anders View Post

        I would like to advertise the release of HTSeq, a Python framework to process and analyse high-throughput sequencing (HTS) data.
        Nice one! Trying it ASAP!



        • #5
          The error I encountered was resolved by compiling the package from source, thanks for the help and the package Simon.


          • #6

            Side note - pip is an awesome easy_install replacement that works great with virtualenv. To use it to install: `pip install HTseq`
            Senthil Palanisami


            • #7
              Does this package supports reads in SOLiD Color Space?


              • #8
                Hi Simon
                In HTSeq count is it ok to use Bowtie SAM output against the corresponding Cufflinks generated GTF file? I get an error message regarding the Cufflinks GTF file saying CUFF1.0 doesn't contain gene_id attribute. I feel I must supply a standard GFF/GTF annotation file from the database. I am not sure a comprehensive GTF file exists for maize.


                • #9
                  Hi Siva

                  Originally posted by Siva View Post
                  In HTSeq count is it ok to use Bowtie SAM output against the corresponding Cufflinks generated GTF file? I get an error message regarding the Cufflinks GTF file saying CUFF1.0 doesn't contain gene_id attribute.
                  Thanks for the report, I've just investigated and corrected. The real issue is that in the cufflinks GTF file, the genes have no strand information, and hence, htseq-count has to be called with the '--stranded=no' information.

                  Sorry for the misleading error message. I've just uploaded a fix, which will now display the more helpful message Feature CUFF.1 at chr1:[1047,1108)/. does not have strand information but you are running htseq-count in stranded mode. Use '--stranded=no'..



                  • #10

                    Originally posted by lvaruzza View Post
                    Does this package supports reads in SOLiD Color Space?
                    Yes and no. ;-)

                    There are no specific facilities for color space yet, but HTSeq should nevertheless be helpful to work with SOLiD data. For example, you can use the FastaReader and FastqReader classes can also be used to read in color-space files, and the GFF_Reader class can deal with the GFF files output by the WTAP aligner.

                    I don't have much experience with SOLiD data, but I'd be interested to collaborate with a SOLiD-using bioinformatician to fill in the gaps in order to make HTSeq really useful for colour-space data. I think not much is missing for this.



                    • #11
                      Cool. Thanks. Will be back with questions.


                      • #12
                        I believe I found a bug or I did not understand the documentation.
                        maybe there is a better place for bug report but I couldn't find a mailing list of bug tracker
                        My uderstanding of the Sequence trimming is ti looks for a match (allowing mismatches) between the leftpart of the read and the right part of the adapter. but it seems to fail finding a match as long as the adapter.

                        Here are a few command that reproduce the issue:
                        >>> from HTSeq import Sequence
                        >>> adapter1=Sequence('ACACGT')
                        >>> adapter2=Sequence('AACACGT')
                        >>> print read.seq.startswith(adapter1.seq)
                        >>> print read.seq.startswith(adapter2.seq)
                        >>> print read.trim_left_end(adapter1)
                        >>> print read.trim_left_end(adapter2)
                        Last edited by tcezard; 04-26-2010, 03:41 AM. Reason: add code output


                        • #13
                          Hi tcezard

                          thanks for the bug report and the nice code example. I've found and fixed the bug and uploaded a new release path, version now 0.4.2-p3.

                          If you find further bugs, just send me an e-mail.

                          Simon (anders at embl dot de )


                          • #14
                            Nice work. I will start using it. I'll report my experience!


                            • #15
                              I'm trying to use htseq-count version 0.4.2-p3 on a sam file produced by TopHat and a hg19 Ensembl GTF file. I'm analysing the reads in non-stranded mode and looking for exons in the gene_id features. The script runs for a while and outputs several warnings about reads incorrectly flagged as proper pairs, but then exits with the following error:
                              Error: 'tuple' object has no attribute 'read'
                              [Exception type: AttributeError, raised in]
                              Is this an error in my sam file and if so how can I identify the read in question?
                              Last edited by Thomas Doktor; 04-29-2010, 09:32 AM.


                              Latest Articles


                              • seqadmin
                                Latest Developments in Precision Medicine
                                by seqadmin

                                Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

                                Somatic Genomics
                                “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
                                05-24-2024, 01:16 PM
                              • seqadmin
                                Recent Advances in Sequencing Analysis Tools
                                by seqadmin

                                The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
                                05-06-2024, 07:48 AM





                              Topics Statistics Last Post
                              Started by seqadmin, 05-24-2024, 07:15 AM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, 05-23-2024, 10:28 AM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, 05-23-2024, 07:35 AM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, 05-22-2024, 02:06 PM
                              0 responses
                              Last Post seqadmin