Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Running HTSeq in parallel

    Hello,

    I am trying to process mapped reads in parallel. However, when using pool of workers (multiprocessing package), I get following error:

    Traceback (most recent call last):
    File "testmp.py", line 15, in <module>
    out = pool.map(repr, iter(sa), chunksize=1)
    File "/usr/lib/python2.7/multiprocessing/pool.py", line 228, in map
    return self.map_async(func, iterable, chunksize).get()
    File "/usr/lib/python2.7/multiprocessing/pool.py", line 531, in get
    raise self._value
    AttributeError: 'NoneType' object has no attribute 'name'


    Running in serial fashion (using just built-in 'map' function) works fine.

    Do you know what can be wrong here?

    Thank you.

    ---

    Here is a simple script to reproduce the error. I am using HTSeq ver 0.6.1, Python 2.7.3, 64bit ubuntu 12.04

    import HTSeq
    from multiprocessing import Pool
    # this works for me
    map(repr, sa)

    # this does not work
    pool = Pool(processes=1)
    sa = HTSeq.SAM_Reader('test.sam')
    out = pool.map(repr, sa, chunksize=1)
    print list(out)

    Comment


    • You asked me this before, but Ididn't reply, right? Sorry about that, I was a bit overwhelmed with mails.

      The bad news is: I have no clue why it does not work; I have never worked with the multiprocessing package. But I agree that it would be nice if this worked.

      Maybe somebody else here has some idea?

      Comment


      • Originally posted by Simon Anders View Post
        You asked me this before, but Ididn't reply, right? Sorry about that, I was a bit overwhelmed with mails.

        The bad news is: I have no clue why it does not work; I have never worked with the multiprocessing package. But I agree that it would be nice if this worked.

        Maybe somebody else here has some idea?
        my guess is that it is something trivial, perhaps missing implementation of some part of dictionary/list interface in on of HTSeq data structures. I tried to look into HTSeq source, but it seems to be machine generated code, so I quickly gave up.

        Comment


        • Yes, the C code is machine generated, but if you look at the pyx files which it is generated from, it should be clearer. Have a look at:

          Comment


          • Originally posted by Simon Anders View Post
            Yes, the C code is machine generated, but if you look at the pyx files which it is generated from, it should be clearer. Have a look at:
            http://www-huber.embl.de/users/ander...c/contrib.html
            Multiprocessing can work only with objects that can be pickled. SAM_Alignment cannot be pickled. I suspect this may be the reason it does not work.

            Objects must implement __getstate__ and __setstate__ functions in order to be pickled/unpickled. Would it be difficult to implement these functions?

            Comment


            • Originally posted by superpyrin View Post
              Multiprocessing can work only with objects that can be pickled. SAM_Alignment cannot be pickled. I suspect this may be the reason it does not work.
              I suppose you are right. makes perfect sense.

              Objects must implement __getstate__ and __setstate__ functions in order to be pickled/unpickled. Would it be difficult to implement these functions?
              I don't think so. I would just need to find the time to do it.

              All one needs to do is take all the slots defined for the class in _HTSeq.SAM_Alignment, pack them into a tuple for __getstate__ and write them back for __setstate__.

              Comment


              • Hi,
                I'm having trouble installing HTSeq.
                I pretty much followed the instructions, but when I try to run it, I get the following error:

                .local/lib/python2.7/site-packages/HTSeq-0.6.1-py2.7-linux-x86_64.egg/HTSeq/_HTSeq.so: undefined symbol: PyUnicodeUCS2_DecodeUTF8

                Any help is appreciated!

                Comment


                • i just got a result from ht-seq. It showed that my interesting gene has 7 counts in the alignment. However, from the view of IGV, i could easily identify much more counts than 7 on this gene. My alignment is from STAR, and i used more stringent parameters to control the multiple alignment, which means there should not be any multiple aligned reads in the output. I am really confused about this.

                  Any suggestion?
                  Attached Files

                  Comment


                  • I don't know where to report bugs so I posted here.
                    I think the start_d and end_d feature of GenomicIntervals have bugs.
                    With a SAM file below as sample.sam:
                    read1 0 chr 1 40 7M * 0 0 ATGGCGT AAAAAAA
                    read2 16 chr 1 40 7M * 0 0 ATGGCGT AAAAAAA

                    and:
                    >>> read1,read2 = list(itertools.islice(HTSeq.SAM_Reader('sample.sam'),2))

                    >>> read1
                    <SAM_Alignment object: Read 'read' aligned to chr:[0,7)/+>

                    >>> read2
                    <SAM_Alignment object: Read 'read2' aligned to chr:[0,7)/->

                    >>> read1.iv.start,read1.iv.end,read1.iv.start_d,read1.iv.end_d
                    (0, 7, 0, 7)

                    >>> read2.iv.start,read2.iv.end,read2.iv.start_d,read2.iv.end_d
                    (0, 7, 6, -1)

                    the end_d of read2 ended with a negative coordinate! This behavior is mentioned in document, but I think it is a bug rather than a feature.

                    Comment


                    • From the looks of it, the read locations are zero-based and open-ended on the right, so don't include the "end" location in the list of base locations. For an end location of -2, that's a bit more concerning, otherwise it's just business as usual for how these things are done.

                      Comment


                      • Hi Simon,

                        We have been encountering an error with htseq-count (v. 0.6.1p1) on alignment files that have SAM v.1.4 tags.

                        The specific error is

                        Code:
                        Unknown CIGAR code 'X' encountered
                        I found an old request about this error which does not appear to have been implemented in htseq-count yet : http://sourceforge.net/p/htseq/support-requests/22/

                        Are there plans to add support for SAM v.1.4 tags to htseq-count? For now we have been working around this by generating SAM v.1.3 tags.

                        Thanks.

                        Comment


                        • Plans, yes -- but I'm so overwhelmed with other things that it might take a while till I get to that. Sorry.

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Current Approaches to Protein Sequencing
                            by seqadmin


                            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                            04-04-2024, 04:25 PM
                          • seqadmin
                            Strategies for Sequencing Challenging Samples
                            by seqadmin


                            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                            03-22-2024, 06:39 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, 04-11-2024, 12:08 PM
                          0 responses
                          32 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 10:19 PM
                          0 responses
                          37 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 09:21 AM
                          0 responses
                          31 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-04-2024, 09:00 AM
                          0 responses
                          53 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X