It's my pleasure to announce that Bernie Hackett and I have released PyMongo 2.6, the successor to PyMongo 2.5.2, with new features and bugfixes.

Connection Pooling

The big news in PyMongo 2.6 is that the max_pool_size option actually means what it says now.

PyMongo opens sockets as needed, to support the number of concurrent operations demanded by a multithreaded application. In versions before 2.6, the default max_pool_size was 10, and it did not actually bound the number of open connections; it only determined the number of connections that would be kept open when no longer in use. Consider this code:

client = MongoClient(max_pool_size=5)

PyMongo's old connection pool would open as many sockets as you need, without bound, during a load spike, but once the spike is over it will close all but five of them.

In PyMongo 2.6, max means max. The connection pool will open at most max_pool_size sockets to meet demand. So if your max_pool_size is 5, and five threads currently have MongoDB operations in progress, a sixth thread will block waiting for a socket to be returned to the pool by one of the prior threads. Obviously, this can slow down your app, so we've increased the default max_pool_size from 10 to 100.

If you haven't been specifying max_pool_size when you create a MongoClient or MongoReplicaSetClient, then you'll get the new default when you upgrade and you have nothing to worry about. But if you have been overriding the default, you should ensure your max_pool_size is large enough to match your highest expected number of concurrent operations. Sockets are only opened when needed, so there's no cost to having a max_pool_size larger than necessary. Err towards a larger value.

In addition we've added two more options: waitQueueMultiple caps the number of threads that can wait for sockets before PyMongo starts throwing exceptions. E.g., to keep the number of waiters less than or equal to 500:

client = MongoClient(max_pool_size=50, waitQueueMultiple=10)

When 500 threads are waiting for a socket, additional waiters throw exceptions. Use this option to bound the amount of queueing in your application during a load spike, at the cost of additional exceptions.

Once the pool reaches its max size, additional threads are allowed to wait indefinitely for connections to become available, unless you set waitQueueTimeoutMS:

client = MongoClient(waitQueueTimeoutMS=100)

A thread that waits more than 100ms (in this example) for a connection throws an exception. Use this option if it is more important to bound the duration of operations during a load spike than it is to complete every operation.

All this intricate work on the connection pool was generously contributed by Justin Patrin, a developer at Idle Games.

Justin Patrin

With extraordinary patience Justin worked with us for a month to deal with our critiques and all the edge cases. Adding a feature to PyMongo's connection pool is maximally difficult. We support Python 2.4 through 3.3, PyPy, and Jython, plus Gevent, and each has very different behavior regarding concurrency, I/O, object lifetimes, and threadlocals. Our pool needs special code for each of them. Justin pounded through the bugs and made one of the best contributions PyMongo's received from an outside programmer.

Batched Inserts

PyMongo's maintainer Bernie Hackett added a great new feature: if you pass insert() a huge list of documents, PyMongo automatically splits it into 48MB chunks and passes each chunk to MongoDB. In the past it was up to you to determine the largest message size MongoDB would accept and ensure your batch inserts didn't exceed it, otherwise PyMongo threw an exception. Now, you can stream an arbitrary number of documents to the server in a single method call.

Aggregation Cursors

Starting with MongoDB 2.5.1, you can stream results from the aggregation framework instead of getting them in one batch. This finally lifts the infamous 16MB limit on aggregation results and brings the aggregation framework closer to the standard find operation. In PyMongo 2.6 we added support for this server feature.

The old style gets all the documents at once:

>>> collection.aggregate(pipeline)
{
    'ok': 1.0,
    'result': [
        {'key': 'value1'},
        {'key': 'value2'}
    ]
}

But now you can iterate through the results and there's no limit on the number of them:

>>> for doc in collection.aggregate(pipeline, cursor={}):
...     print doc
{'key': 'value1'}
{'key': 'value2'}

Exhaust Cursors

On the subject of cursors, MongoDB has long had "exhaust cursors": instead of waiting for the client to ask for each batch of results from a query, an exhaust cursor streams batches to the client as fast as possible. This supercharged cursor is used internally for replication and database dumps. Bernie added support for exhaust cursors in PyMongo so you can now take advantage of them, too, for Python tools that need to pull huge data sets from MongoDB over a high-latency network.

`get_default_database`

A single MongoDB server can have multiple databases, but there's no standard way to use a config file to tell your app which database to use. You can put the database's name in a MongoDB URI like 'mongodb://host:port/my_database', but when you pass the URI to MongoClient you just get a connection to the whole server. There's been no way to say, "Give me a connection to only the database I specified in the URI." I added get_default_database method, so now you can do:

uri = 'mongodb://host/my_database'
client = MongoClient(uri)
db = client.get_default_database()

db is a reference the database you specified in the URI. This makes your application more customizable with the URI alone, which should make life easier for developers and operations folk.

Other

You can see all 22 issues resolved in this release in our bug tracker. There was also a very interesting bugfix from a GitHub contributor to our Gevent compatibility. Upgrade, check your max_pool_size, and enjoy!