Deep sea fish

I released libbson and libmongoc 1.1.8 today. The significant change is the defeat of a stubborn crash reported weeks ago. Very rarely, when a mongoc_client_t is connected to a replica set while a member is added, and authentication fails, it leaves the client's data structures in an inconsistent state that makes it seg fault later in mongoc_client_destroy().

I had already gone one round with this bug and given up: I released 1.1.7 with extra checking and logging along this code path, but without a theory about the cause of the crash, much less a fix. The customer who reported the crash could reproduce it a couple times in each of their days-long durability tests, so they sent me core dumps. My colleague Spencer Jackson devoted heroic effort to understanding the core dumps (including one with no debug symbols!), and we finally discovered the sequence that leads to the crash.

The bug was in _mongoc_cluster_reconnect_replica_set(), which has two loops. The first loop tries nodes until it finds a replica set primary. In the second loop, it iterates over the primary's peer list connecting and authenticating with each peer, including the primary itself.

The crash comes when we:

  1. Connect to a 2-node replica set.
  2. The function enters its first loop, connects to the primary and finds two peers.
  3. nodes_len is set to 2 and the nodes list is reallocated, but the second node's struct is uninitialized.
  4. The function enters its second loop.
  5. Auth fails on the first node (the primary) so the driver breaks from the loop with goto CLEANUP.
  6. Now nodes_len is 2 but the second node is still uninitialized!
  7. Later, mongoc_client_destroy iterates the nodes list, destroying them.
  8. Since nodes_len is 2, the client tries to destroy the second, uninitialized node.
  9. If the stream field in the second node happens to be non-NULL, the client calls stream->close on it and segfaults.

This was particularly hard for the customer's test to reproduce, because the driver has to connect while the test framework is reconfiguring auth in the replica set, and the buffer reallocation has to return a non-zero chunk of memory.

The fix is to properly manage nodes_len: don't increment it to N unless N nodes have actually been initialized. Additionally, zero-out all nodes right after reallocating the nodes list to ensure all data structures are NULL.

Details about the bug and the fix are in Jira.

It's satisfying to nail this bug after a long chase, but also painful: that code path is long gone in the 1.2.0 branch, replaced by Samantha Ritter's implementation of the Server Discovery And Monitoring spec. If I could've released 1.2.0 by now we'd have saved all the trouble of debugging the old code. It only redoubles my drive to release a beta of the new driver this quarter and get out of this bind.


Image: The deep sea fish eurypharynx pelecanoides, Popular Science Monthly, 1883.