Motor Internals: How I Asynchronized a Synchronous Library
I'm going to explain why and how I wrote Motor, my asynchronous driver for MongoDB and Tornado. I hope I can justify my ways to you.
The Problem
Here's how you query one document from MongoDB with PyMongo, 10gen's official driver:
connection = Connection()
document = connection.db.collection.find_one()
print document
As you can see, the official driver is blocking: you call find_one
and your code waits for the result.
Deep in the bowels of PyMongo, the driver sends your query over a socket and waits for the database's response:
class Connection(object):
def send_and_receive(self, message, socket):
socket.sendall(message)
header = socket.recv(16) # Get 16-byte header
length = struct.unpack("<i", header[:4])[0]
body = socket.recv(length)
return header + body
That's three blocking operations on the socket in a row. All of PyMongo relies on the assumption that it can use sockets synchronously. How the hell can I make it non-blocking so you can use it with Tornado? Specifically, how can I implement this API?:
def opened(connection, error):
connection.db.collection.find_one(callback=found)
def found(document, error):
print document
MotorConnection().open(callback=opened)
AsyncMongo's Solution
bit.ly's non-blocking driver, AsyncMongo, took the straightforward approach. It copied and pasted PyMongo as it stood two years ago, and turned it inside-out to use callbacks. PyMongo's send_and_receive
became this:
class Connection(object):
def send_and_receive(self, message, callback):
self.callback = callback
# self.stream is a Tornado IOStream
self.stream.write(message)
self.stream.read_bytes(16,
callback=self.parse_header)
def parse_header(self, data):
self.header = data
length = struct.unpack("<i", data[:4])[0]
self.stream.read_bytes(length,
callback=self.parse_response)
def parse_response(self, data):
response = self.header + data
self.callback(response)
(Note that IOStream buffers the output in write
, so only the read_bytes
calls take callbacks.)
This is a solution to the problem of making PyMongo async, but now there's a new problem: how do we maintain code like this? PyMongo is extended and improved every month by 10gen's programmers (like me!). An effort comparable to that devoted to maintaining PyMongo would be required to keep AsyncMongo up to date, because every PyMongo change must be manually ported over. Who has that kind of time?
Motor's Solution
Since I joined 10gen in November last year, I'd been thinking there must be a better way. I wanted to somehow reuse all of PyMongo's existing code—its years of improvements and bugfixes and battle-testing—but make it non-blocking so Tornado programmers could use it. I thought that if Python had something like Scheme's call-with-current-continuation, I could pause PyMongo's execution whenever it would block waiting for a socket, and resume when the socket was ready. From that thought, it surely took me longer, dear reader, than it would have taken you to deduce the solution, but during a particularly distracted meditation session it somehow dawned on me: greenlets. I'd use a Gevent-like technique to wrap PyMongo and asynchronize it, while presenting a classic Tornado callback API to you.
Asynchronizing PyMongo takes two steps. First, I wrap each PyMongo method and run it on a greenlet, like this:
So when you call collection.find_one(callback=found)
, Motor (1) grabs the callback argument and (2) starts a greenlet that (3) runs PyMongo's original find_one
. That find_one
sends a message to the server and calls recv
on a socket to get the response.
The second step is to pause the greenlet whenever it would block. I wrote a MotorSocket
class which seems to PyMongo like a regular socket, but in fact it wraps a Tornado IOStream:
MotorSocket.recv
(4) starts reading the requested number of bytes and (5) pauses the caller's greenlet. At this point, (6) the original call to find_one
returns. Because Motor's API is callback-based, its find_one
returns None
. The actual MongoDB document will be passed into the callback asynchronously.
Eventually, IOStream's read_bytes
call completes and executes the callback, which (7) resumes the paused greenlet. That greenlet then completes PyMongo's processing, parsing the server's response and so on, until PyMongo's original find_one
returns. Motor gets a result or an exception from PyMongo's find_one
and (8) schedules your callback on the IOLoop.
(The real code is a little more complicated, gory details here.)
If you're a visual learner, here's the same sequence of events diagrammed:
Sorry, it's the best diagram I can think of.
Why?
PyMongo is three and a half years old. The core module is 3000 source lines of code. There are hundreds improvements and bugfixes, and 7000 lines of unittests. Anyone who tries to make a non-blocking version of it has a lot of work cut out, and will inevitably fall behind development of the official PyMongo. With Motor's technique, I can wrap and reuse PyMongo whole, and when we fix a bug or add a feature to PyMongo, Motor will come along for the ride, for free.