Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • HTSeq: A Python framework to work with high-throughput sequencing data

    Hi

    I would like to advertise the release of HTSeq, a Python framework to process and analyse high-throughput sequencing (HTS) data.

    With the many short-read aligners available now, HTS analysis seems simple. In practice, however, data often needs to be converted, tweaked, filtered, or otherwise pre-processed before they can be given to the aligner, and the results require similar processing to do the statistical analysis one needs.

    HTSeq is meant to render such tasks easy and convenient, and so act as a "glue" between aligners and other existing tools.

    Some examples of typical use cases for HTSeq:
    • Quality assessment of reads: Check the dependence of the proportions of base calls and quality scores on the position in the reads, stratify by alignment status.
    • Counting: How many reads fall onto each exon, or each gene? For such tasks, you may want to design and implement rules on how to deal with overlapping features or ambiguous assignments.
    • Calculating coverage: HTSeq helps you not only to produce a Wiggle file for visualization in a genome browser, but also to do customized statistics on this.
    • Multiple alignments: Many aligners can output multiple alignments for each read, but what to do with this? HTSeq makes it easy to implement post-processing to choose the right alignment according to your criteria.
    • Adapter trimming: In miRNA-Seq, you often sequence into the adapter at the other end and need to cut this off before aligning. In multiplexed sequencing, you may need to cut off and sort by the mutiplex tag.


    Have a look and give it a try: http://www-huber.embl.de/users/anders/HTSeq/

    To use HTSeq you only need a basic understanding of Python, as can be obtained by reading the first few chapters of a Python book.
    For users without programming knowledge, stand-alone scripts for common tasks are provided: htseq-count to count the overlap of reads with features (such as exons), htseq-qa to get a quick overview of the quality of your sequencing run, and htseq-bedgraph (coming soon) to convert an alignment file into a Bedgraph Wiggle file for visualization with a genome browser.

    For programmers, HTSeq has been designed to keep thing simple:
    • All classes have extensive reference documentation, and a tutorial demonstrates their use.
    • All supported file formats (Fasta, Fastq, SAM, SolexaPipeline files, GFF, GTF, etc.) can be read in a loop, providing an object describing one record at a time to the loop body. This object describes the data in a convenient and consistent way.
    • The 'GenomicArray' class is the Swiss army knife of HTSeq. It is a container that can efficiently store anything that has a position on the genome: integer number to represent coverage, objects with feature data to represent exons, sets of objects to handle overlapping features, etc.


    Please let me know what you think of it.

    Cheers
    Simon

  • #2
    Nice one, looks interesting! I like the look of the quality plots.

    Do you have any plans to try and integrate with the Biopython project? There's some obvious overlap that I can see.

    Comment


    • #3
      I'm trying to run htseq-qa, but get the following error:
      Traceback (most recent call last):
      File "/usr/local/bin/htseq-qa", line 5, in <module>
      pkg_resources.run_script('HTSeq==0.4.2-p1', 'htseq-qa')
      File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 489, in run_script
      if dist.key not in keys2:
      File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 1207, in run_script

      File "/usr/local/lib/python2.6/dist-packages/HTSeq-0.4.2_p1-py2.6-linux-x86_64.egg/EGG-INFO/scripts/htseq-qa", line 3, in <module>
      import HTSeq.scripts.qa
      File "/usr/local/lib/python2.6/dist-packages/HTSeq-0.4.2_p1-py2.6-linux-x86_64.egg/HTSeq/__init__.py", line 8, in <module>
      from _HTSeq import *
      ImportError: /usr/local/lib/python2.6/dist-packages/HTSeq-0.4.2_p1-py2.6-linux-x86_64.egg/HTSeq/_HTSeq.so: undefined symbol: PyUnicodeUCS2_DecodeUTF8
      I've installed the dependencies and I'm running Python 2.6.4 - any idea what I could be doing wrong?

      Comment


      • #4
        Originally posted by Simon Anders View Post
        Hi

        I would like to advertise the release of HTSeq, a Python framework to process and analyse high-throughput sequencing (HTS) data.
        Nice one! Trying it ASAP!

        d

        Comment


        • #5
          The error I encountered was resolved by compiling the package from source, thanks for the help and the package Simon.

          Comment


          • #6
            Perfect!

            Side note - pip is an awesome easy_install replacement that works great with virtualenv. To use it to install: `pip install HTseq`
            --
            Senthil Palanisami

            Comment


            • #7
              Does this package supports reads in SOLiD Color Space?

              Comment


              • #8
                Hi Simon
                In HTSeq count is it ok to use Bowtie SAM output against the corresponding Cufflinks generated GTF file? I get an error message regarding the Cufflinks GTF file saying CUFF1.0 doesn't contain gene_id attribute. I feel I must supply a standard GFF/GTF annotation file from the database. I am not sure a comprehensive GTF file exists for maize.

                Comment


                • #9
                  Hi Siva

                  Originally posted by Siva View Post
                  In HTSeq count is it ok to use Bowtie SAM output against the corresponding Cufflinks generated GTF file? I get an error message regarding the Cufflinks GTF file saying CUFF1.0 doesn't contain gene_id attribute.
                  Thanks for the report, I've just investigated and corrected. The real issue is that in the cufflinks GTF file, the genes have no strand information, and hence, htseq-count has to be called with the '--stranded=no' information.

                  Sorry for the misleading error message. I've just uploaded a fix, which will now display the more helpful message Feature CUFF.1 at chr1:[1047,1108)/. does not have strand information but you are running htseq-count in stranded mode. Use '--stranded=no'..

                  Cheers
                  Simon

                  Comment


                  • #10
                    Hi

                    Originally posted by lvaruzza View Post
                    Does this package supports reads in SOLiD Color Space?
                    Yes and no. ;-)

                    There are no specific facilities for color space yet, but HTSeq should nevertheless be helpful to work with SOLiD data. For example, you can use the FastaReader and FastqReader classes can also be used to read in color-space files, and the GFF_Reader class can deal with the GFF files output by the WTAP aligner.

                    I don't have much experience with SOLiD data, but I'd be interested to collaborate with a SOLiD-using bioinformatician to fill in the gaps in order to make HTSeq really useful for colour-space data. I think not much is missing for this.

                    Cheers
                    Simon

                    Comment


                    • #11
                      Cool. Thanks. Will be back with questions.

                      Comment


                      • #12
                        I believe I found a bug or I did not understand the documentation.
                        maybe there is a better place for bug report but I couldn't find a mailing list of bug tracker
                        My uderstanding of the Sequence trimming is ti looks for a match (allowing mismatches) between the leftpart of the read and the right part of the adapter. but it seems to fail finding a match as long as the adapter.

                        Here are a few command that reproduce the issue:
                        Code:
                        >>> from HTSeq import Sequence
                        >>> read=Sequence('ACACGTTCGATATCCCGTATGCAACGGACCCGGCAGGAAACCGGCTGTGGG')
                        >>> adapter1=Sequence('ACACGT')
                        >>> adapter2=Sequence('AACACGT')
                        >>> print read.seq.startswith(adapter1.seq)
                        True
                        >>> print read.seq.startswith(adapter2.seq)
                        False
                        >>> print read.trim_left_end(adapter1)
                        ACACGTTCGATATCCCGTATGCAACGGACCCGGCAGGAAACCGGCTGTGGG
                        >>> print read.trim_left_end(adapter2)
                        TCGATATCCCGTATGCAACGGACCCGGCAGGAAACCGGCTGTGGG
                        Last edited by tcezard; 04-26-2010, 03:41 AM. Reason: add code output

                        Comment


                        • #13
                          Hi tcezard

                          thanks for the bug report and the nice code example. I've found and fixed the bug and uploaded a new release path, version now 0.4.2-p3.

                          If you find further bugs, just send me an e-mail.

                          Cheers
                          Simon (anders at embl dot de )

                          Comment


                          • #14
                            Nice work. I will start using it. I'll report my experience!

                            Comment


                            • #15
                              I'm trying to use htseq-count version 0.4.2-p3 on a sam file produced by TopHat and a hg19 Ensembl GTF file. I'm analysing the reads in non-stranded mode and looking for exons in the gene_id features. The script runs for a while and outputs several warnings about reads incorrectly flagged as proper pairs, but then exits with the following error:
                              Error: 'tuple' object has no attribute 'read'
                              [Exception type: AttributeError, raised in count.py:100]
                              Is this an error in my sam file and if so how can I identify the read in question?
                              Last edited by Thomas Doktor; 04-29-2010, 09:32 AM.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Non-Coding RNA Research and Technologies
                                by seqadmin




                                Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

                                Nobel Prize for MicroRNA Discovery
                                This week,...
                                10-07-2024, 08:07 AM
                              • seqadmin
                                Recent Developments in Metagenomics
                                by seqadmin





                                Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
                                09-23-2024, 06:35 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 02:44 PM
                              0 responses
                              7 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 10-11-2024, 06:55 AM
                              0 responses
                              14 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 10-02-2024, 04:51 AM
                              0 responses
                              110 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 10-01-2024, 07:10 AM
                              0 responses
                              117 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X