No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Biopython stops querying database after ~10 seconds

    Hi everyone,

    I am a novice at biopython, but have gotten a few things to work so far. Previously, I used biopython to pull nucleotide and protein sequences from a number of gene that were differentially expressed in my RNA-seq analysis. I am now trying to perform GO analysis on my dataset, and am trying to use biopython to gather the Entrez gene IDs (needed for the gene-2-go annotations in the GO analysis R package) from the nucleotide genbank nucleotide IDs.

    My script seems to be working fine, but the problem comes after about 10-60s of running. At that point, it appears to stop querying the database and becomes "stuck". I've attempted to put in a "try-except" loop for when it gets stuck, but this doesn't seem to work. I'll post my code below along with the error message after I control-c to exit the program.

    NOTE: my output file is correct up to the point where biopython stops querying the database. Every run gets "stuck" at a different point, so I don't think there is anything wrong with my files.

    what the file looks like that needs to be parsed:


    the output file will be identical, but with the additional Entrez IDs after the genbank IDs, e.g.:

    my code:
    from Bio import Entrez
    import glob
    import re = "[email protected]"
    filenames = glob.glob("*_cds.fas")
    for file in filenames:
        print "working on %s"%file
        ofile = open(file)
        wfile = open(file+"_entrez",'w')
        for line in ofile:
            if line.startswith(">"):
                line = [x.strip() for x in line.split("|")]
                handle = Entrez.esearch(db="gene",term=line[1].strip())
                EntrezID =
                EntrezID = EntrezID["IdList"][0]+"\n"
                wfile.write('|'.join(x for x in line+[EntrezID]))
                if n%100 == 0:
                    print "processed %s sequences"%n
        print "finished, processed %s entries"%n
    and the error:

    KeyboardInterrupt                         Traceback (most recent call last)
    /Users/XXX/Desktop/XXX/XXX/XXX/XXX/ in <module>()
         21         if line.startswith(">"):
         22             line = [x.strip() for x in line.split("|")]
    ---> 23             handle = Entrez.esearch(db="gene",term=line[1].strip())
         24             EntrezID =
         25             EntrezID = EntrezID["IdList"][0]+"\n"
    /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/Entrez/__init__.pyc in esearch(db, term, **keywds)
        187                  'term': term}
        188     variables.update(keywds)
    --> 189     return _open(cgi, variables)
    /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/Entrez/__init__.pyc in _open(cgi, params, post)
        464             # HTTP GET
        465             cgi += "?" + options
    --> 466             handle = _urlopen(cgi)
        467     except _HTTPError as exception:
        468         raise exception
    /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.pyc in urlopen(url, data, timeout, cafile, capath, cadefault, context)
        152     else:
        153         opener = _opener
    --> 154     return, data, timeout)
        156 def install_opener(opener):
    /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.pyc in open(self, fullurl, data, timeout)
        429             req = meth(req)
    --> 431         response = self._open(req, data)
        433         # post-process response
    /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.pyc in _open(self, req, data)
        447         protocol = req.get_type()
        448         result = self._call_chain(self.handle_open, protocol, protocol +
    --> 449                                   '_open', req)
        450         if result:
        451             return result
    /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.pyc in _call_chain(self, chain, kind, meth_name, *args)
        407             func = getattr(handler, meth_name)
    --> 409             result = func(*args)
        410             if result is not None:
        411                 return result
    /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.pyc in http_open(self, req)
       1226     def http_open(self, req):
    -> 1227         return self.do_open(httplib.HTTPConnection, req)
       1229     http_request = AbstractHTTPHandler.do_request_
    /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.pyc in do_open(self, http_class, req, **http_conn_args)
       1193         try:
    -> 1194             h.request(req.get_method(), req.get_selector(),, headers)
       1195         except socket.error, err: # XXX what error?
       1196             h.close()
    /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.pyc in request(self, method, url, body, headers)
       1051     def request(self, method, url, body=None, headers={}):
       1052         """Send a complete request to the server."""
    -> 1053         self._send_request(method, url, body, headers)
       1055     def _set_content_length(self, body, method):
    /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.pyc in _send_request(self, method, url, body, headers)
       1091         for hdr, value in headers.iteritems():
       1092             self.putheader(hdr, value)
    -> 1093         self.endheaders(body)
       1095     def getresponse(self, buffering=False):
    /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.pyc in endheaders(self, message_body)
       1047         else:
       1048             raise CannotSendHeader()
    -> 1049         self._send_output(message_body)
       1051     def request(self, method, url, body=None, headers={}):
    /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.pyc in _send_output(self, message_body)
        891             msg += message_body
        892             message_body = None
    --> 893         self.send(msg)
        894         if message_body is not None:
        895             #message_body was not a string (i.e. it is a file) and
    /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.pyc in send(self, data)
        853         if self.sock is None:
        854             if self.auto_open:
    --> 855                 self.connect()
        856             else:
        857                 raise NotConnected()
    /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.pyc in connect(self)
        830         """Connect to the host and port specified in __init__."""
        831         self.sock = self._create_connection((,self.port),
    --> 832                                            self.timeout, self.source_address)
        834         if self._tunnel_host:
    /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.pyc in create_connection(address, timeout, source_address)
        564             if source_address:
        565                 sock.bind(source_address)
    --> 566             sock.connect(sa)
        567             return sock
    /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.pyc in meth(name, self, *args)
        227 def meth(name,self,*args):
    --> 228     return getattr(self._sock,name)(*args)
        230 for _m in _socketmethods:
    any help would be great!

  • #2
    The slowdown might be the NCBI throttling your searches. Have you looked into using elink rather than esearch? If that is possible, you should be able to submit batches of queries at once.

    I suspect however these is a more appropriate way to do this, you can probably download all the accessions for human genes in one go...


    • #3
      yeah, I thought that could be it, too. However I never recieved an email from NCBI saying that I was ping-ing them too fast. (According to the biopython cookbook tutorial, they will send you an email if they are limiting your access).

      Oh well, I'll figure something else out. It's fairly trivial to parse a .gff file to pull entrez gene IDs. Thanks for your help!


      • #4
        Have you considered the possibility that it may be your institutional firewall that is blocking access (not sure what port you are using)?


        • #5
          I don't think so. I was previously able to pull sequences for all the genes in my analysis using basically the same code.

          Also, and I don't know if this matters, but the program does work, it just stops collecting data after a few seconds. I'm probably mistaken, but if my institution was blocking, wouldn't that mean I couldn't get any data at all?

          I also tried to do this at home with no success...

          Thanks for your reply


          • #6
            Can you put in a pause after retrieving every 2-3 records to see if that helps? BTW: Which entrz ID are you referring to? The example above must be a dummy.


            • #7
              Originally posted by lstbl View Post
              yeah, I thought that could be it, too. However I never recieved an email from NCBI saying that I was ping-ing them too fast. (According to the biopython cookbook tutorial, they will send you an email if they are limiting your access).
              Well in theory, the NCBI says "The value of email will be used only to contact developers if NCBI observes requests that violate our policies, and we will attempt such contact prior to blocking access."
              The Entrez Programming Utilities (E-utilities) are a set of nine server-side programs that provide a stable interface into the Entrez query and database system at the National Center for Biotechnology Information (NCBI). The E-utilities use a fixed URL syntax that translates a standard set of input parameters into the values necessary for various NCBI software components to search for and retrieve the requested data. The E-utilities are therefore the structured interface to the Entrez system, which currently includes 38 databases covering a variety of biomedical data, including nucleotide and protein sequences, gene records, three-dimensional molecular structures, and the biomedical literature.

              Originally posted by lstbl View Post
              Oh well, I'll figure something else out. It's fairly trivial to parse a .gff file to pull entrez gene IDs. Thanks for your help!
              If you are dealing with 1000s of IDs, this ought to be far more reliable and faster than making all those online requests.


              Latest Articles


              • seqadmin
                Advanced Methods for the Detection of Infectious Disease
                by seqadmin

                The recent pandemic caused worldwide health, economic, and social disruptions with its reverberations still felt today. A key takeaway from this event is the need for accurate and accessible tools for detecting and tracking infectious diseases. Timely identification is essential for early intervention, managing outbreaks, and preventing their spread. This article reviews several valuable tools employed in the detection and surveillance of infectious diseases.
                11-27-2023, 01:15 PM
              • seqadmin
                Strategies for Investigating the Microbiome
                by seqadmin

                Microbiome research has led to the discovery of important connections to human and environmental health. Sequencing has become a core investigational tool in microbiome research, a subject that we covered during a recent webinar. Our expert speakers shared a number of advancements including improved experimental workflows, research involving transmission dynamics, and invaluable analysis resources. This article recaps their informative presentations, offering insights...
                11-09-2023, 07:02 AM





              Topics Statistics Last Post
              Started by seqadmin, Today, 10:48 AM
              0 responses
              Last Post seqadmin  
              Started by seqadmin, Yesterday, 08:26 AM
              0 responses
              Last Post seqadmin  
              Started by seqadmin, Yesterday, 08:12 AM
              0 responses
              Last Post seqadmin  
              Started by seqadmin, 11-27-2023, 08:12 AM
              0 responses
              Last Post seqadmin