Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • lstbl
    Junior Member
    • Aug 2015
    • 7

    Biopython stops querying database after ~10 seconds

    Hi everyone,

    I am a novice at biopython, but have gotten a few things to work so far. Previously, I used biopython to pull nucleotide and protein sequences from a number of gene that were differentially expressed in my RNA-seq analysis. I am now trying to perform GO analysis on my dataset, and am trying to use biopython to gather the Entrez gene IDs (needed for the gene-2-go annotations in the GO analysis R package) from the nucleotide genbank nucleotide IDs.

    My script seems to be working fine, but the problem comes after about 10-60s of running. At that point, it appears to stop querying the database and becomes "stuck". I've attempted to put in a "try-except" loop for when it gets stuck, but this doesn't seem to work. I'll post my code below along with the error message after I control-c to exit the program.

    NOTE: my output file is correct up to the point where biopython stops querying the database. Every run gets "stuck" at a different point, so I don't think there is anything wrong with my files.


    what the file looks like that needs to be parsed:
    >ABCA2|NM_001606.4
    ATGGGC...TGA
    >ABHD15|NM_198147.2
    ATGCCG...TAG

    etc...

    the output file will be identical, but with the additional Entrez IDs after the genbank IDs, e.g.:
    >ABCA2|NM_001606.4|20
    etc...

    my code:
    Code:
    from Bio import Entrez
    import glob
    import re
    
    Entrez.email = "[email protected]"
    
    filenames = glob.glob("*_cds.fas")
    for file in filenames:
        print "working on %s"%file
        ofile = open(file)
        wfile = open(file+"_entrez",'w')
        n=0
        for line in ofile:
            if line.startswith(">"):
                line = [x.strip() for x in line.split("|")]
                handle = Entrez.esearch(db="gene",term=line[1].strip())
                EntrezID = Entrez.read(handle)
                EntrezID = EntrezID["IdList"][0]+"\n"
                wfile.write('|'.join(x for x in line+[EntrezID]))
                n+=1
                if n%100 == 0:
                    print "processed %s sequences"%n
            else:
                wfile.write(line)
        print "finished, processed %s entries"%n
        ofile.close()
        wfile.close()
    and the error:

    Code:
    KeyboardInterrupt                         Traceback (most recent call last)
    /Users/XXX/Desktop/XXX/XXX/XXX/XXX/Add_Entrez_IDs.py in <module>()
         21         if line.startswith(">"):
         22             line = [x.strip() for x in line.split("|")]
    ---> 23             handle = Entrez.esearch(db="gene",term=line[1].strip())
         24             EntrezID = Entrez.read(handle)
         25             EntrezID = EntrezID["IdList"][0]+"\n"
    
    /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/Entrez/__init__.pyc in esearch(db, term, **keywds)
        187                  'term': term}
        188     variables.update(keywds)
    --> 189     return _open(cgi, variables)
        190 
        191 
    
    /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/Entrez/__init__.pyc in _open(cgi, params, post)
        464             # HTTP GET
        465             cgi += "?" + options
    --> 466             handle = _urlopen(cgi)
        467     except _HTTPError as exception:
        468         raise exception
    
    /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.pyc in urlopen(url, data, timeout, cafile, capath, cadefault, context)
        152     else:
        153         opener = _opener
    --> 154     return opener.open(url, data, timeout)
        155 
        156 def install_opener(opener):
    
    /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.pyc in open(self, fullurl, data, timeout)
        429             req = meth(req)
        430 
    --> 431         response = self._open(req, data)
        432 
        433         # post-process response
    
    /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.pyc in _open(self, req, data)
        447         protocol = req.get_type()
        448         result = self._call_chain(self.handle_open, protocol, protocol +
    --> 449                                   '_open', req)
        450         if result:
        451             return result
    
    /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.pyc in _call_chain(self, chain, kind, meth_name, *args)
        407             func = getattr(handler, meth_name)
        408 
    --> 409             result = func(*args)
        410             if result is not None:
        411                 return result
    
    /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.pyc in http_open(self, req)
       1225 
       1226     def http_open(self, req):
    -> 1227         return self.do_open(httplib.HTTPConnection, req)
       1228 
       1229     http_request = AbstractHTTPHandler.do_request_
    
    /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.pyc in do_open(self, http_class, req, **http_conn_args)
       1192 
       1193         try:
    -> 1194             h.request(req.get_method(), req.get_selector(), req.data, headers)
       1195         except socket.error, err: # XXX what error?
       1196             h.close()
    
    /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.pyc in request(self, method, url, body, headers)
       1051     def request(self, method, url, body=None, headers={}):
       1052         """Send a complete request to the server."""
    -> 1053         self._send_request(method, url, body, headers)
       1054 
       1055     def _set_content_length(self, body, method):
    
    /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.pyc in _send_request(self, method, url, body, headers)
       1091         for hdr, value in headers.iteritems():
       1092             self.putheader(hdr, value)
    -> 1093         self.endheaders(body)
       1094 
       1095     def getresponse(self, buffering=False):
    
    /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.pyc in endheaders(self, message_body)
       1047         else:
       1048             raise CannotSendHeader()
    -> 1049         self._send_output(message_body)
       1050 
       1051     def request(self, method, url, body=None, headers={}):
    
    /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.pyc in _send_output(self, message_body)
        891             msg += message_body
        892             message_body = None
    --> 893         self.send(msg)
        894         if message_body is not None:
        895             #message_body was not a string (i.e. it is a file) and
    
    /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.pyc in send(self, data)
        853         if self.sock is None:
        854             if self.auto_open:
    --> 855                 self.connect()
        856             else:
        857                 raise NotConnected()
    
    /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.pyc in connect(self)
        830         """Connect to the host and port specified in __init__."""
        831         self.sock = self._create_connection((self.host,self.port),
    --> 832                                            self.timeout, self.source_address)
        833 
        834         if self._tunnel_host:
    
    /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.pyc in create_connection(address, timeout, source_address)
        564             if source_address:
        565                 sock.bind(source_address)
    --> 566             sock.connect(sa)
        567             return sock
        568 
    
    /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.pyc in meth(name, self, *args)
        226 
        227 def meth(name,self,*args):
    --> 228     return getattr(self._sock,name)(*args)
        229 
        230 for _m in _socketmethods:
    
    KeyboardInterrupt:
    any help would be great!
  • maubp
    Peter (Biopython etc)
    • Jul 2009
    • 1544

    #2
    The slowdown might be the NCBI throttling your searches. Have you looked into using elink rather than esearch? If that is possible, you should be able to submit batches of queries at once.

    I suspect however these is a more appropriate way to do this, you can probably download all the accessions for human genes in one go...

    Comment

    • lstbl
      Junior Member
      • Aug 2015
      • 7

      #3
      yeah, I thought that could be it, too. However I never recieved an email from NCBI saying that I was ping-ing them too fast. (According to the biopython cookbook tutorial, they will send you an email if they are limiting your access).

      Oh well, I'll figure something else out. It's fairly trivial to parse a .gff file to pull entrez gene IDs. Thanks for your help!

      Comment

      • GenoMax
        Senior Member
        • Feb 2008
        • 7142

        #4
        Have you considered the possibility that it may be your institutional firewall that is blocking access (not sure what port you are using)?

        Comment

        • lstbl
          Junior Member
          • Aug 2015
          • 7

          #5
          I don't think so. I was previously able to pull sequences for all the genes in my analysis using basically the same code.

          Also, and I don't know if this matters, but the program does work, it just stops collecting data after a few seconds. I'm probably mistaken, but if my institution was blocking, wouldn't that mean I couldn't get any data at all?

          I also tried to do this at home with no success...

          Thanks for your reply

          Comment

          • GenoMax
            Senior Member
            • Feb 2008
            • 7142

            #6
            Can you put in a pause after retrieving every 2-3 records to see if that helps? BTW: Which entrz ID are you referring to? The example above must be a dummy.

            Comment

            • maubp
              Peter (Biopython etc)
              • Jul 2009
              • 1544

              #7
              Originally posted by lstbl View Post
              yeah, I thought that could be it, too. However I never recieved an email from NCBI saying that I was ping-ing them too fast. (According to the biopython cookbook tutorial, they will send you an email if they are limiting your access).
              Well in theory, the NCBI says "The value of email will be used only to contact developers if NCBI observes requests that violate our policies, and we will attempt such contact prior to blocking access."
              The Entrez Programming Utilities (E-utilities) are a set of nine server-side programs that provide a stable interface into the Entrez query and database system at the National Center for Biotechnology Information (NCBI). The E-utilities use a fixed URL syntax that translates a standard set of input parameters into the values necessary for various NCBI software components to search for and retrieve the requested data. The E-utilities are therefore the structured interface to the Entrez system, which currently includes 38 databases covering a variety of biomedical data, including nucleotide and protein sequences, gene records, three-dimensional molecular structures, and the biomedical literature.

              Originally posted by lstbl View Post
              Oh well, I'll figure something else out. It's fairly trivial to parse a .gff file to pull entrez gene IDs. Thanks for your help!
              If you are dealing with 1000s of IDs, this ought to be far more reliable and faster than making all those online requests.

              Comment

              Latest Articles

              Collapse

              • SEQadmin2
                Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                by SEQadmin2


                I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                Here are nine questions we think about, in roughly the order they matter, before...
                06-18-2026, 07:11 AM
              • SEQadmin2
                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                by SEQadmin2


                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                ...
                06-02-2026, 10:05 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by SEQadmin2, 06-17-2026, 06:09 AM
              0 responses
              41 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-09-2026, 11:58 AM
              0 responses
              102 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-05-2026, 10:09 AM
              0 responses
              123 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-04-2026, 08:59 AM
              0 responses
              114 views
              0 reactions
              Last Post SEQadmin2  
              Working...