Green caterpillar

Source: Andrew Magill

Working on PyMongo exposes me to a vexing swarm of issues in the Python interpreter itself. I've dealt with bugs in Python's threadlocals, a race condition in the Thread class, and the awful things that happen when you run C extensions in multiple sub interpreters.

Last month I was faced with a real stumper. A user reported a deadlock in the following code:

from gevent import monkey
monkey.patch_all()

from pymongo import MongoReplicaSetClient
MongoReplicaSetClient('host1', use_greenlets=False, replicaSet='rs')

When he hit Control-C, the exception traceback indicated the process was stuck in getaddrinfo. The deadlock only occurred with MongoReplicaSetClient, not MongoClient. It happened with Gevent 1.0 but not the previous version, 0.13.8. And here's the kicker: the code itself ran fine. It was when the file was imported that it deadlocked.

So I stepped through the MongoReplicaSetClient initialization code in PyCharm's debugger. To my surprise, the first call to getaddrinfo succeeded: the client acquired the IP address of "host1" and connected promptly to the MongoDB running there. So where was the deadlock?

Once the client connected to the first host, it asked it for a list of other hosts in the replica set. MongoDB returned the list as BSON, and PyMongo represents all BSON strings as unicode. So, whereas the first getaddrinfo was called on the string 'host1', subsequent calls were on unicodes u'host2', u'host3', and so on. PyMongo hung on the first attempt to resolve a unicode hostname. (This discussion is about Python 2, obviously; Gevent doesn't do Python 3 yet.)

I reproduced the deadlock with ever-simpler versions of the test script. I arrived at this:

from gevent import monkey
monkey.patch_all()

import socket
socket.getaddrinfo(
    u'mongodb.org',
    80,
    socket.AF_INET,
    socket.SOCK_STREAM)

Again, the script finished promptly when I ran with it python script.py, but it hung when I imported it with python -c "import script". Changing the unicode hostname to a str avoided the deadlock, as did downgrading Gevent to 0.13.8.

The changelog for Gevent 1.0 says that it runs getaddrinfo on a thread to make it asynchronous. So I tried running getaddrinfo on a thread, without involving Gevent at all:

import socket
import threading

def resolve():
    print socket.getaddrinfo(
        u'mongodb.org',
        80,
        socket.AF_INET,
        socket.SOCK_STREAM)

t = threading.Thread(target=resolve)
t.start()
t.join()

This script, too, hung when imported. Now that PyMongo and Gevent were both out of the picture, it looked to me like a Python bug. I looked at the implementation of Python's getaddrinfo wrapper. What does it do differently if the hostname is unicode?

if (PyUnicode_Check(hobj)) {
    idna = PyObject_CallMethod(
        hobj, "encode", "s", "idna");
    if (!idna)
        return NULL;
    hptr = PyString_AsString(idna);
} else if (PyString_Check(hobj)) {
    hptr = PyString_AsString(hobj);
}

If the host object, hobj, is unicode, it's encoded as "idna". (That stands for Internationalized Domain Names in Applications if you're curious, and I know you are.) I put some printfs in the C code and sure enough, getaddrinfo was hanging at the encoding step. Why? Following the call tree and sprinkling printfs throughout, I found that the hang occurred in Python's encodings module, when it tried to import encodings.idna. I finally understood the problem.

Importing script.py locks Python's import machinery until the import finishes. My script spawns a thread and calls getaddrinfo on the thread. If the hostname is unicode, getaddrinfo tries to encode it as "idna". If this is the first time the "idna" encoding has been used in this interpreter's life, then the thread must import encodings.idna, but the import machinery is locked by the main thread. The getaddrinfo thread waits for the import lock, while the main thread waits for the getaddrinfo thread to finish. Forever.

Gevent 1.0 is prone to this problem because it runs getaddrinfo on a thread. But the deadlock can be triggered without Gevent, by importing a contrived script like this:

def connect():
    MongoReplicaSetClient('host1', replicaSet='rs')

t = threading.Thread(target=connect)
t.start()
t.join()

It's just as the Python docs say:

Other than in the main module, an import should not have the side effect of spawning a new thread and then waiting for that thread in any way. Failing to abide by this restriction can lead to a deadlock if the spawned thread directly or indirectly attempts to import a module.

It's obvious once you know. Even without this admonition, it's my opinion that a module shouldn't spawn threads or do network I/O when you import it. It seems like an anti-pattern.

But my job is to fix bugs, not give style tips. I've submitted a patch to Gevent to prevent the deadlock, and I fixed a similar hang in Tornado. Python 3.4's asyncio also runs getaddrinfo on a thread, but it can't deadlock on import, because the import machinery was rewritten in Python 3.3 with per-module locking.

In any case there are a number of easy workarounds to this issue. I've detailed them in the bug report, but I'll show my favorite here:

u'foo'.encode('idna')

Do that in the main thread before any calls to getaddrinfo. This will cache the imported encoder and avoid importing it in a thread later on. If I were you, I'd replace "foo" with a mysterious-looking string and add a comment implying that the string is magic, just to confound future generations.