A. Jesse Jiryu Davis

Goodbye MongoDB World, Hello Open Source Bridge

Today I talked to a small audience at MongoDB World in New York City. I described how I made a visualization of global weather data using MongoDB, Python, Monary, and Matplotlib. The visualization looks like this: My talk was one of a [...]

Talking at Mongodb World

Today I talked to a small audience at MongoDB World in New York City. I described how I made a visualization of global weather data using MongoDB, Python, Monary, and Matplotlib. The visualization looks like this:

Contour map of global temperature using Matplotlib

My talk was one of a three-part series called "The Weather of the Century". My colleague Randall Hunt gave a schema design talk about weather data, and André Spiegel, who was the brains behind the whole idea, analyzed the performance of various MongoDB configurations executing various operations on terabytes of data.

In the next week I plan to write an article on how I generated this visualization. I also want to write an introduction to Monary: it's a little-known, highly specialized MongoDB driver capable of shocking throughput.

Tomorrow I'm flying to Portland, Oregon to catch the second half of Open Source Bridge. I'll exhort the audience to write an excellent programming blog. I hope they take my advice!

Rules of Thumb for Methods and Functions

[Source] The Python team at MongoDB is partially rewriting PyMongo. The next version, 3.0, aims to be faster, more flexible, and more maintainable than the current 2.x series. There is nothing like the satisfaction of pulling out the [...]

Le Pouce, sculpture in Paris[Source]

The Python team at MongoDB is partially rewriting PyMongo. The next version, 3.0, aims to be faster, more flexible, and more maintainable than the current 2.x series. There is nothing like the satisfaction of pulling out the weeds and making a fresh patch of ground for new code.

A design flaw in the current PyMongo is that a large number of instance methods have return values and side effects. For example, MongoClient has a private _check_response_to_last_error method. It takes a binary message from the server and returns a parsed version of it. But depending on what errors it finds in the server message, the method sometimes clears the client's connection pool, or changes all threads' socket affinities, or wipes its cached information about who the primary server is. Just looking at the method's signature doesn't tell me all the things it could do: since it's an instance method of MongoClient, it could change any part of the MongoClient's state.

This gets gnarly, quickly.

In most cases these mixed methods did one thing at first: they only returned a value, or only changed state. And then we had to fix something and the easiest way was to add a side-effect or add a return value. And so the road to hell was paved.

I want to minimize the temptation for these mixed methods in PyMongo 3. My main strategy is to minimize methods, period. My rules of thumb are these:

  • If it accesses private instance variables, it's an instance method. Everything else can and should be a function.
  • When a method is necessary, it should set a private variable, or it should have a return value. Not both.

No rule should followed without exception, of course. And there will be a handful of exceptions to these rules. But on the whole I think this limits the risk and complexity of methods in PyMongo. What do you think?

Street Retreat 2014 Recap

Although I did my street retreat a month ago, before my Taipei trip, I'm just now finding time to write about it. A street retreat is a Zen practice in which a group of people join to live on the streets together for a few days. We leave our money, [...]

Tompkins Square Park

Although I did my street retreat a month ago, before my Taipei trip, I'm just now finding time to write about it.

A street retreat is a Zen practice in which a group of people join to live on the streets together for a few days. We leave our money, phones, and everything behind. We sleep on the sidewalk, meditate in parks, and eat at soup kitchens. It's not exactly homelessness, but it's a taste of it, and it's the best way I know to genuinely meet homeless people. It's also a way to raise a few thousand dollars to give to homeless services. And the retreat is a chance to practice like the first Buddhist monks: They were homeless too.

If you want to know more about street retreats read my invitation for the most recent one, or my recap of last year.

This year's retreat was marked by great abundance. At our opening council in Washington Square Park we had twelve people, the biggest group I've seen. And over the course of the retreat we kept adding more. We had Batman with us—he's a long-time student of Zen teacher Bernie Glassman, and he was homeless for decades. He knows everything about the streets and everyone on the streets. On the morning of our second day, he ran into a friend at the Bowery Mission. Her name was Fatima and she was a dynamo, a master of panhandling. Each day she gave us tips on how to beg, and demonstrated her techniques. "Romeo and Juliet!" she yelled hoarsely at any couple who walked by. If she got them to laugh, then she asked for money.

Normally a group on a street retreat likes to blend in with the homeless population, but Fatima wasn't playing by our rules. She announced us to the Bowery Mission staff and explained what we were doing on the street. I'm not sure how it happened, but she obliged us to work a volunteer shift at the Mission. I and some of the men in our group worked in the Mission's clothing-distribution room. A resident named Lucky was running the show. That day the Mission offered showers to homeless men. As men arrived from the showers they brought a list of items they needed. We fulfilled the orders for shirts, pants, and so on. We saw that the Mission had a pile of sleeping bags and we asked for five of them, which Lucky gave us.

That night when we unrolled the sleeping bags they were revealed to contain street survival kits: shirt, toothbrush, hand sanitizer, scripture.

I ate and slept better than I ever have on a retreat. Fatima found us a place to sleep, under a construction scaffold on Mott Street. Batman picked up a case of discarded juice from Organic Avenue, still cold and fresh, and distributed it among us. "These go for eight dollars each!" A security guard came on duty around midnight; Batman instantly befriended him. They talked half the night. "I'm your security guard," he announced. "This place is your home. Go ahead and sleep, I'll keep an eye out for you." It rained hard at night but the scaffold kept us dry.

Somewhere along the way Fatima ran into a friend, Julius, who joined our retreat too. He was a black man my age, with a monkish vibe. He had a meditation practice of his own, and he was in the habit of sleeping on the A train with an old homeless man who needed someone to watch over him.

One evening we begged our dinner from the Union Square farmer's market. The vendors gave us a quiche, chêvre, feta, a big bag of mustard greens, many loaves of bread. There was a group at the Square called Occupy Kitchen, they gave us a tray full of pasta, a bag of salmon salad, plates and forks. It rained on us in spurts as we sat in the square chanting sutras. A few local panhandlers joined for the ceremony and talked with us as we shared the food.

This retreat our group was porous, welcoming. There were too many of us, and we were too mostly white, to blend in, so street people came up and asked us what we were doing. Some joined us for a bit. Even our security guard joined on the final morning, staying with us long enough to participate in the closing ceremony.

This was my fourth retreat. Each time, it's less of an adventure and more of a practice. I did my first street retreat to have an adventure and survive it. Now, I continue to do it as a practice, to train myself, to refresh the lessons retreat teaches me: to be humble, to be generous, and to gratefully receive generosity.

Refactoring Tornado Coroutines

[Source] Sometimes writing callback-style asynchronous code with Tornado is a pain. But the real hurt comes when you want to refactor your async code into reusable subroutines. Tornado's coroutines make refactoring easy. I'll [...]

Tornado [Source]

Sometimes writing callback-style asynchronous code with Tornado is a pain. But the real hurt comes when you want to refactor your async code into reusable subroutines. Tornado's coroutines make refactoring easy. I'll explain the rules.

(This article updates my old "Refactoring Tornado Code With gen.engine". The updated code here demonstrates the current syntax for Tornado 3 and Motor 0.3.)

For Example

I'll use this blog to illustrate. I built it with Motor-Blog, a trivial blog platform on top of Motor, my asynchronous MongoDB driver for Tornado.

When you came here, Motor-Blog did three or four MongoDB queries to render this page.

1: Find the blog post at this URL and show you this content.

2 and 3: Find the next and previous posts to render the navigation links at the bottom.

Maybe 4: If the list of categories on the left has changed since it was last cached, fetch the list.

Let's go through each query and see how Tornado coroutines make life easier.

Fetching One Post

In Tornado, fetching one post takes a little more work than with blocking-style code:

db = motor.MotorClient().my_blog_db

class PostHandler(tornado.web.RequestHandler):
    @tornado.asynchronous
    def get(self, slug):
        db.posts.find_one({'slug': slug}, callback=self._found_post)

    def _found_post(self, post, error):
        if error:
            raise tornado.web.HTTPError(500, str(error))
        elif not post:
            raise tornado.web.HTTPError(404)
        else:
            self.render('post.html', post=post)

Not so bad. But is it better with a coroutine?

class PostHandler(tornado.web.RequestHandler):
    @gen.coroutine
    def get(self, slug):
        post = yield db.posts.find_one({'slug': slug})
        if not post:
            raise tornado.web.HTTPError(404)

        self.render('post.html', post=post)

Much better. If you don't pass a callback to find_one, then it returns a Future instance. A Future is nothing special, it's just a little object that represents an unresolved value. Some time hence, Motor will resolve the Future with a value or an exception. To wait for the Future to be resolved, yield it.

The yield statement makes this function a generator. gen.coroutine is a brilliant invention that runs the generator until it's complete. Each time the generator yields a Future, gen.coroutine schedules the generator to be resumed when the Future is resolved. Read the source code of the Runner class for details, it's exhilarating. Or just enjoy the glow of putting all your logic in a single function again, without defining any callbacks.

Even better, you get normal exception handling: if find_one gets a network error or some other failure, it raises an exception. Tornado knows how to turn an exception into an HTTP 500, so we no longer need special code for errors.

This coroutine is much more readable than a callback, but it doesn't look any nicer than multithreaded code. It will start to shine when you need to parallelize some tasks.

Fetching Next And Previous

Once Motor-Blog finds the current post, it gets the next and previous posts so it can display their titles. Since the two queries are independent we can save a few milliseconds by doing them in parallel. How does this look with callbacks?

@tornado.asynchronous
def get(self, slug):
    db.posts.find_one({'slug': slug}, callback=self._found_post)

def _found_post(self, post, error):
    if error:
        raise tornado.web.HTTPError(500, str(error))
    elif not post:
        raise tornado.web.HTTPError(404)
    else:
        _id = post['_id']
        self.post = post

        # Two queries in parallel.
        # Find the previously published post.
        db.posts.find_one(
            {'pub_date': {'$lt': post['pub_date']}}
            sort=[('pub_date', -1)],
            callback=self._found_prev)

        # Find subsequently published post.
        db.posts.find_one(
            {'pub_date': {'$gt': post['pub_date']}}
            sort=[('pub_date', 1)],
            callback=self._found_next)

def _found_prev(self, prev_post, error):
    if error:
        raise tornado.web.HTTPError(500, str(error))
    else:
        self.prev_post = prev_post
        if self.next_post:
            # Done
            self._render()

def _found_next(self, next_post, error):
    if error:
        raise tornado.web.HTTPError(500, str(error))
    else:
        self.next_post = next_post
        if self.prev_post:
            # Done
            self._render()

def _render(self)
    self.render(
        'post.html',
        post=self.post,
        prev_post=self.prev_post,
        next_post=self.next_post)

This is completely disgusting and it makes me want to give up on async. We need special logic in each callback to determine if the other callback has already run or not. All that boilerplate can't be factored out. Will a coroutine help?

@gen.coroutine
def get(self, slug):
    post = yield db.posts.find_one({'slug': slug})
    if not post:
        raise tornado.web.HTTPError(404)
    else:
        future_0 = db.posts.find_one(
            {'pub_date': {'$lt': post['pub_date']}}
            sort=[('pub_date', -1)])

        future_1 = db.posts.find_one(
            {'pub_date': {'$gt': post['pub_date']}}
            sort=[('pub_date', 1)])

        prev_post, next_post = yield [future_0, future_1]
        self.render(
            'post.html',
            post=post,
            prev_post=prev_post,
            next_post=next_post)

Yielding a list of Futures tells the coroutine to wait until they are all resolved.

Now our single get function is just as nice as it would be with blocking code. In fact, the parallel fetch is far easier than if you were multithreading instead of using Tornado. But what about factoring out a common subroutine that request handlers can share?

Fetching Categories

Every page on my blog needs to show the category list on the left side. Each request handler could just include this in its get method:

categories = yield db.categories.find().sort('name').to_list(10)

But that's terrible engineering. Here's how to factor it into a coroutine:

@gen.coroutine
def get_categories(db):
    categories = yield db.categories.find().sort('name').to_list(10)
    raise gen.Return(categories)

This coroutine does not have to be part of a request handler—it stands on its own at the module scope.

The raise gen.Return() statement is the weirdest syntax in this example. It's an artifact of Python 2, in which generators aren't allowed to return values. To hack around this limitation, Tornado coroutines raise a special kind of exception called a Return. The coroutine catches this exception and treats it like a returned value. In Python 3, a simple return categories accomplishes the same result.

To call my new coroutine from a request handler, I do:

class PostHandler(tornado.web.RequestHandler):
    @gen.coroutine
    def get(self, slug):
        categories = yield get_categories(db)
        # ... get the current, previous, and
        # next posts as usual, then ...
        self.render(
            'post.html',
            post=post,
            prev_post=prev_post,
            next_post=next_post,
            categories=categories)

Since get_categories is a coroutine now, calling it returns a Future. To wait for get_categories to complete, the caller can yield the Future. Once get_categories completes, the Future it returned is resolved, so the caller resumes. It's almost like a regular function call!

Now that I've factored out get_categories, it's easy to add more logic to it. This is nice because I want to cache the categories between page views. get_categories can be updated very simply to use a cache:

categories = None

@gen.coroutine
def get_categories(db):
    global categories
    if not categories:
        categories = yield db.categories.find().sort('name').to_list(10)

    raise gen.Return(categories)

(Note for nerds: I invalidate the cache whenever a post with a new category is added. The "new category" event is saved to a capped collection in MongoDB, which all the Tornado servers are always tailing. This is a simple way to use MongoDB as an event queue, which the multiple Tornado processes use to communicate with each other.)

Conclusion

Tornado's excellent documentation shows briefly how a method that makes a few async calls can be simplified using gen.coroutine, but the power really comes when you need to factor out a common subroutine. There are only three steps:

  1. Decorate the subroutine with @gen.coroutine.
  2. In Python 2, the subroutine returns its result with raise gen.Return(result).
  3. Call the subroutine from another coroutine like result = yield subroutine().

That's all there is to it. Tornado's coroutines make asynchronous code efficient, clean—even beautiful.

Motor 0.3 Released

Today I released Motor 0.3. This version has no new features compared to Motor 0.2.1. Here's what I changed: I updated the PyMongo dependency from 2.7 to 2.7.1, therefore inheriting PyMongo 2.7.1’s bug fixes. Motor continues to [...]

Motor

Today I released Motor 0.3. This version has no new features compared to Motor 0.2.1. Here's what I changed:

  • I updated the PyMongo dependency from 2.7 to 2.7.1, therefore inheriting PyMongo 2.7.1’s bug fixes.
  • Motor continues to support Python 2.6, 2.7, 3.3, and 3.4, but now with single-source. 2to3 no longer runs during installation with Python 3.
  • nosetests is no longer required for regular Motor tests.
  • I fixed a mistake in the docs for aggregate().

Rewriting Motor to support Python 2 and 3 in the same source code makes life sane for me, and it reflects the current consensus about the best way to write portable Python. It wasn't terribly difficult either.

Now that I've simplified Motor's Python 3 support, I'm ready to tackle the next big challenge: I want to see if Motor can support Twisted and asyncio, in addition to Tornado. Wish me luck.

The Aura of the Live Demo

A live demo is too difficult. Too risky. On speaking.io, Zach Holman tells you that "live demos are like Global Thermonuclear War, the only way to win is to not do a live demo." So why bother doing one? Showing a video is reliable and easy, and [...]

A live demo is too difficult. Too risky. On speaking.io, Zach Holman tells you that "live demos are like Global Thermonuclear War, the only way to win is to not do a live demo." So why bother doing one? Showing a video is reliable and easy, and just as good. Right?

When you show a video, you lose something vital. There's a reason people still do live demos, even though we all know better. The reason is that a live demo is live.

This liveness is particularly effective if your audience is programmers like me. I have the traits of a scientist and an engineer: Like a scientist, I'm skeptical, and like an engineer I love to make things go.

Because I'm skeptical, I want proof. If you tell me what your code does, I want to see your code actually do it. It's not that I think you're lying, I just want your experiment reproduced in front of me, so I can verify it with the evidence of my senses. Until then, the scientist in me doesn't think I've done my job.

A few years ago I gave my first big talk, an introduction to MongoDB replica sets. It was at a conference in Atlanta, with an audience of a hundred. I was very nervous, but I was determined to do a demo. I must have practiced it fifty times before I did it live: I spun up a three-node replica set, I killed the primary node, and the surviving nodes elected a new primary. Abracadabra! At the end of the talk, someone asked, "I read somewhere that three nodes isn't enough to provide fault tolerance?" To this day I have no idea where he read that. But I was happy I could say, in front of the audience, "A three-node replica set can survive the loss of one node. You don't have to take my word for it—I've shown you."

I want proof, like most programmers, and I also want to make things go. I'm Doctor Frankenstein: I'm obsessed with creating something that is alive. The first time I made a turtle draw on the screen, the first time I made the computer go "beep", I fell in love. So, when I see you make the machine go, I'm entranced. You press a button and the machine is doing something, it is acting in the world. It's alive! A video of something the machine did in the past is no substitute for its activity in the room now.

In "The Work of Art in the Age of Mechanical Reproduction", Walter Benjamin distinguishes between original art and copies:

Even the most perfect reproduction of a work of art is lacking in one element: its presence in time and space, its unique existence at the place where it happens to be.

Benjamin calls this element, this thing that's lost when art is copied, its "aura." He imagines that the first use of art was in ritual. Back then, art was valuable because it was magic. The animals that Stone Age people painted in caves were instruments of magic, he thinks. A copy of a work of art has no magic power. It is separated from its ritual use, and so its only remaining value is aesthetic. "This permits the audience to take the position of a critic."

So, too, when you show me a video of your demo. I can appreciate your video aesthetically, if it's beautiful. But you don't want me to have critical distance: you want to be a magician. You want to perform the ritual in front of me and entrance me. You press the button, and the magic happens.

This is why people like Bill Gates and Steve Jobs have shown live demos instead of canned ones. They want to be magicians. The risk is great: Windows 98 blue-screened when Bill Gates demonstrated it, and Steve Jobs couldn't get his iPhone 4 online. If you're going to do a live demo you need a better backup plan than they had. And you need to practice like crazy. But the experience of a live demo cannot be matched. The magic only happens when the machine is doing something now, in the room. Don't you want to be a magician?

Motor 0.2.1 Released

Version 0.2.1 of Motor, the asynchronous MongoDB driver for Python and Tornado, has been released. It fixes two bugs: MOTOR-32: The documentation claimed that MotorCursor.close immediately halted execution of MotorCursor.each, [...]

Motor

Version 0.2.1 of Motor, the asynchronous MongoDB driver for Python and Tornado, has been released. It fixes two bugs:

  • MOTOR-32: The documentation claimed that MotorCursor.close immediately halted execution of MotorCursor.each, but it didn't. MotorCursor.each() is now halted correctly.
  • MOTOR-33: An incompletely iterated cursor's __del__ method sometimes got stuck and cost 100% CPU forever, even though the application was still responsive.

The manual is on ReadTheDocs. If you find a bug or want a feature, I exhort you to report it.

PyCon APAC 2014 recap

[Source] Thanks to the miracle of satellite Internet, I'm posting from a plane over the Pacific. My cramped schedule prohibited me from visiting Taipei as long as I'd like: this trip comprised three days in the city and two days on planes. [...]

ShiLin [Source]

Thanks to the miracle of satellite Internet, I'm posting from a plane over the Pacific. My cramped schedule prohibited me from visiting Taipei as long as I'd like: this trip comprised three days in the city and two days on planes. But the exuberant city, and the sincerity of the conference organizers' efforts, made it worthwhile.

I delivered a half-length version of my PyCon talk on async in the morning, to an audience slightly overflowing the room. When I was deciding how to cut the talk, I made the painful choice to cut the code and keep the analogies. And once again the analogies were real winners: lots of laughter when I started talking about sandwiches and pizza.

PyCon APAC was held at Academia Sinica, a research institute. Being in an academic setting gave me two big boosts as a speaker: lecture rooms and young people.

The lecture rooms are actually designed to help the speaker and audience stay connected. In contrast, the giant rooms in convention centers are designed to be usable for anything but good for nothing. (The room I last spoke in, at PyCon in Montréal, would serve best for assembling aircraft.) But the Academia Sinica rooms are purpose-built. As Scott Berkun writes, "the ideal room for a lecture is a theater. It's crazy, I know, but we solved most lecture-room problems about 2,000 years ago. The Greek amphitheater gets it all just about right, provided it doesn't rain." What a friendly feeling, to be surrounded by the audience and to see everyone's faces.

The audience was generally university students or professionals early in their careers. They came ready to learn. Plus, they're excited when Western open source programmers make the trip to meet them: there are fewer open source leaders in Asia (except perhaps Japan) and the area isn't saturated with conferences.

In the afternoon I gave my second talk, a new one I wrote this week. I guess people liked my async talk and came back for seconds: we overflowed the room so badly that latecomers could not wedge themselves through the door. I told a story about how a blogger complained that PyMongo was slow, and what tools I used to prove the blogger wrong. Huge laughs, the most fun I've had speaking.

Jesse pycon apac [Source]

My colleague Amalia Hawkins delivered her "Narrowing the Gender Gap at Hackathons" talk to general acclaim. Hackathons aren't yet the phenomenon in Asia that they are in the US, so there's a chance to start things right. Amalia's thesis is that focusing on the experience of all hackathon newcomers benefits everyone, and narrows the gender gap as a side effect.

Fernando Perez and Wes McKinney gave inspiring keynotes about their numerical Python tools, IPython and Pandas respectively. I'm severely ignorant about numerical Python, so I appreciated learning from experts. Jessica McKellar's keynote invited us to expand Python's reach among groups who don't feel welcome in the open source community.

Besides the conference, all I remember about Taipei is a continuous blur of food. The university cafeteria served weird delicious Chinese vegetables, sour eggplant, seitan, pieces of seaweed tied into bows. All for a dollar.

Amalia and I ate strange things for dinner at night markets. Grilled cuttlefish. Enoki mushrooms wrapped in bacon. One of the food stands made hotdogs, except the bun was replaced with a big sausage, so it was like a meta-hotdog. Amalia had two scoops of taro-root ice cream that had peanut brittle shaved onto them with a carpenter's lathe, and wrapped up into a crêpe like an ice-cream burrito.

We found a food cart in a grimy back alley that made ramen for 50 cents, with the freshest, chewiest noodles I've ever tasted.

PyCon in Taipei!

I got back from Street Retreat on Sunday, and tomorrow I fly to Taipei. Why in the world do I overschedule myself like this? Nevertheless I'm excited to visit Taiwan for the first time and to speak at PyCon APAC. I'll give a shorter version of [...]

Taipei Rushhour birdseye

I got back from Street Retreat on Sunday, and tomorrow I fly to Taipei. Why in the world do I overschedule myself like this?

Nevertheless I'm excited to visit Taiwan for the first time and to speak at PyCon APAC. I'll give a shorter version of the talk I gave at PyCon in Montreal: "What Is Async, How Does It Work, And When Should I Use It?"

I'll also give a new talk on "Python Profiling: The Guts and The Glory." This isn't your regular old Python profiling talk. The regular old talk shows you cProfile, admits that its output is unreadable, and wishes you the best of luck. My talk will tell a story of drama and intrigue, introduce you to a powerful Python profiler called Yappi, show you how to visualize its output with KCacheGrind, and even delve into how CPython profilers actually work.