Night Of The Living Thread
What should this Python code print?:
t = threading.Thread()
t.start()
if os.fork() == 0:
# We're in the child process.
print t.isAlive()
In Unix, only the thread that calls fork()
is copied to the child process; all other threads are dead. So t.isAlive()
in the child process should always return False. But sometimes, it returns True! It's the....
How did I discover this horrifying zombie thread? A project I work on, PyMongo, uses a background thread to monitor the state of the database server. If a user initializes PyMongo and then forks, the monitor is absent in the child. PyMongo should notice that the monitor thread's isAlive
is False, and raise an error:
# Starts monitor:
client = pymongo.MongoReplicaSetClient()
os.fork()
# Should raise error, "monitor is dead":
client.db.collection.find_one()
But intermittently, the monitor is still alive after the fork! It keeps coming back in a bloodthirsty lust for HUMAN FLESH!
I put on my Sixties scientist outfit (lab coat, thick-framed glasses) and sought the cause of this unnatural reanimation. To begin with, what does Thread.isAlive()
do?:
class Thread(object):
def isAlive(self):
return self.__started.is_set() and not self.__stopped
After a fork, __stopped
should be True on all threads but one. Whose job is it to set __stopped
on all the threads that didn't call fork()
? In threading.py
I discovered the _after_fork()
function, which I've simplified here:
# Globals.
_active = {}
_limbo = {}
def _after_fork():
# This function is called by PyEval_ReInitThreads
# which is called from PyOS_AfterFork. Here we
# clean up threading module state that should not
# exist after a fork.
# fork() only copied current thread; clear others.
new_active = {}
current = current_thread()
for thread in _active.itervalues():
if thread is current:
# There is only one active thread.
ident = _get_ident()
new_active[ident] = thread
else:
# All the others are already stopped.
thread._Thread__stop()
_limbo.clear()
_active.clear()
_active.update(new_active)
assert len(_active) == 1
This function iterates all the Thread objects in a global dict called _active
; each is removed and marked as "stopped", except for the current thread. How could this go wrong?
Well, consider how a thread starts:
class Thread(object):
def start(self):
_limbo[self] = self
_start_new_thread(self.__bootstrap)
def __bootstrap(self):
self.__started.set()
_active[self.__ident] = self
del _limbo[self]
self.run()
(Again, I've simplified this.) The Thread object's start
method adds the object to the _limbo
list, then creates a new OS-level thread. The new thread, before it gets to work, marks itself as "started" and moves itself from _limbo
to _active
.
Do you see the bug now? Perhaps the thread was reanimated by space rays from Venus and craves the flesh of the living!
Or perhaps there's a race condition:
- Main thread calls worker's
start()
. - Worker calls
self.__started.set()
, but is interrupted before it adds itself to_active
. - Main thread calls
fork()
. - In child process, main thread calls
_after_fork
, which doesn't find the worker in_active
and doesn't mark it "stopped". isAlive()
now returns True because the worker is started and not stopped.
Now we know the cause of the grotesque revenant. What's the cure? Headshot?
I submitted a patch to Python that simply swapped the order of operations: first the thread adds itself to _active
, then it marks itself started:
def __bootstrap(self):
_active[self.__ident] = self
self.__started.set()
self.run()
If the thread is interrupted by a fork after adding itself to _active
, then _after_fork()
finds it there and marks it stopped. The thread ends up stopped but not started, rather than the reverse. In this case isAlive()
correctly returns False.
The Python core team looked at my patch, and Charles-François Natali suggested a cleaner fix. If the zombie thread is not yet in _active
, it is in the global _limbo
list. So _after_fork
should iterate over both _limbo
and _active
, instead of just _active
. Then it will mark the zombie thread as "stopped" along with the other threads.
def _enumerate():
return _active.values() + _limbo.values()
def _after_fork():
new_active = {}
current = current_thread()
for thread in _enumerate():
if thread is current:
# There is only one active thread.
ident = _get_ident()
new_active[ident] = thread
else:
# All the others are already stopped.
thread._Thread__stop()
This fix will be included in the next Python 2.7 and 3.3 releases. The zombie threads will stay good and dead...for now!
(Now read the sequels: Dawn of the Thread, in which I battle zombie threads in the abandoned tunnels of Python 2.6; and Day of the Thread, a post-apocalyptic thriller in which a lone human survivor tries to get a patch accepted via bugs.python.org.)