Performance Modeling and Design of Computer Systems: Queueing Theory in Action, by Dr. Mor Harchol-Balter.
We are A. Jesse Jiryu Davis, Andrew Helwer, and Murat Demirbas, three enthusiasts of distributed systems and formal methods. We’re looking for rigorous ways to model the performance of distributed systems, and we hoped this book would point the way.
We followed the author’s undergrad syllabus, although we were disappointed at its focus on single-threaded task queues. Most of our work is with multi-threaded servers, aka Processor-Sharing, so we added Chapter 10 section 2 (Aloha protocol) and Chapter 22 (only Jesse read it), skipped Chapter 29, and added Chapter 30. We met roughly every two weeks on Zoom, from December 2022 to August 2023. Between meetings we spent a few hours per week reading; we did many of the homework exercises at the beginning and none of them towards the end.
Andrew asked us five questions about our experience with this book, here are our answers.
Q: Why did you join the reading group for this book?
Jesse: I’m a distributed systems researcher at MongoDB Research, I want to learn about autoscaling for systems with many machines and many services. The book claims on its first page that it’s “written with computer scientists and engineers in mind”, so I hoped it would help me analyze the resource requirements of big complex systems.
Andrew: Jesse sold me on a summary of Marc Brooker’s talk, Formal Methods Only Solve Half My Problems. TLA+ can’t effectively model properties like latency, throughput, or bottlenecks. These are all problems software engineers care about, and I’d been asked how to analyze them by people going through TLA+ tutorials. I was hoping to add a complementary tool to my toolbox, and invest more learning time into distributed systems generally.
Murat: My work at AWS involves modeling distributed systems/protocols for correctness and fault-tolerance validation. It quickly becomes apparent that it is also important to model/estimate the performance improvements we would get from pursuing variants of these protocols. Marc Brooker makes this point nicely in his post titled, Formal Methods Only Solve Half My Problems. I knew that queueing theory provides a basic toolkit for modeling performance and estimating bottlenecks of systems, and we employed some basic queueing theory for our paper on “Dissecting performance bottlenecks of strongly-consistent replication protocols”. I had heard Brooker recommend Mor’s book as an introduction to analyzing performance using queueing theory, and I was happy to find people interested in studying the book together. I knew that when I would feel unmotivated, I would be able to continue to journey out of civic duty to my study group friends. I wouldn’t have tried this alone.
Q: Did you meet your objectives by the end of the book?
Jesse: No, I did not learn practical methods to answer my day-to-day questions. Harchol-Balter writes in the preface, “First, I want to provide enough applications from computer systems to make the book relevant and interesting to computer scientists. Second, I want to make the book mathematically rich enough to give readers the ability to actually develop new queueing analysis, not just apply existing analysis.” I am not interested in the second goal, and I don’t think the book fully accomplished the first goal. I must not be the kind of computer scientist she had in mind. She writes, “systems designers could mathematically model the system, … analytically derive the performance of the system as a function of workload and input parameters.” I have read most of the book, but I can’t do this. For the systems I work with, the math is too hard to be worthwhile compared to simulation, or else a closed-form solution is actually intractable or unknown. Now I’m seeking a book that teaches best practices for simulating the performance of complex computer systems, which I believe is practical.
Andrew: I did not. If you plunked down a work queue network in front of me at this moment and asked me to analyze it using the methods from the book, I couldn’t. Maybe this is on me for not doing many of the exercises, but I don’t think coming up with a bunch of differential equations to model a system is something that will really fly in industry. None of your coworkers would understand what the hell you’re doing. And maybe they should, like maybe we should expect advanced mathematical knowledge from our software engineers, but that norm is at least a decade away even if there were a big push for that education now. Really I’d say the drift of software engineering has been toward simple tools powered by good abstractions, and the constant search for new, better abstractions. I think TLA+ is a really great abstraction of a distributed or concurrent system if you want to analyze consistency properties, for example. In that vein I would want to analyze the queue network by writing a simulation of it. Simulations are easy, and easy to understand. Modeling something with systems of differential equations which you then solve feels like something from another era. Like how a physicist on the Manhattan project might work. Maybe it’s extraordinarily effective if you have the ability, but it isn’t scalable in human terms. Every assignment becomes its own unique thing with its own unique difficulties and risk of getting stumped.
Murat: Not really. I got some familiarity with the queueing theorist style of approaching the problems, but I didn’t get any mastery of the space. I knew I didn’t want to be a citizen of queueing-theoryville, but I don’t think I got enough learning/insight by touring the city under the guidance of this book. Seeing expert theoretical physicists derive lengthy formulas on blackboard did not help me. I needed a high-school physics teacher to explain popularized/simplified versions of the formulas. As I read the book, I kept thinking this could have been explained much better, much shorter, and much nicer.
Q: What was the most valuable thing you took from the book?
Jesse: Since I did not learn how to answer all my questions with equations, I now hope to answer them with off-the-shelf simulation software. (JMT seems promising.) So it was valuable to learn the concepts and terminology to configure such software. These simulation tools have some quirks, such as requiring most/all probability distributions to be Exponential, which make sense to me now that I’ve seen the math. I learned rules of thumb like the “Square-Root Staffing Rule” or “All-Can-Win Theorem”, which I can look up in the book whenever I need. (I briefly understood the justification for some of these rules, too.) I got some intuition about the interactions among job size variability, utilization, and response time variability. Job size variability turns out to be the most important parameter; I’m glad I know that now.
Andrew: I’d always been confused by the difference between poisson, exponential, and Pareto distributions. Queueing theory was an ideal context in which to learn these. The difference between exponential (memoryless) and Pareto (memory…ful?) distributions were particularly interesting, with varying implications of how you should act if you’re operating in one versus the other. Closely related, I enjoyed learning about various Markov chain properties. Ergodic, irreducible, null-recurrent, and so forth. I’ve read a few books by Nassim Nicholas Taleb (The Black Swan, Skin in the Game) and he often uses these concepts so it was nice to retroactively understand the theses at a deeper level. I am also interested in probabilistic model checking so getting acquainted with discrete- & continuous-time Markov chains at a deeper level was welcome. Finally, it was useful to learn the ontology of these systems: what properties do we care about? How are those properties related & derived? It felt like I was in an introductory physics class again.
Murat: I enjoyed derivation of nontrivial results/formulas starting with simple first principles, whenever I caught glimpses. This reminded me of my high-school calculus and math classes where I saw that simple math and reasoning can carry one a long way. I really enjoyed the occasional Socratic method employed in the book. By just adopting a very simple question and answer format, some sections of the book were made much more approachable. It is a simple yet very effective trick that we should employ more.
To mention specific chapters of the book, I liked the Chapter 3 probability review (I needed that), and Chapter 6 where we saw the surprising effectiveness of Little’s Law (I think I needed to be shown more examples on this to internalize it). I also enjoyed Chapter 24 task assignment in server farms, because the book returns to a more practical topic after covering theoretical topics.
Q: What was your least favorite aspect of the book?
Jesse: The book mostly assumes that a “server” does one task at a time. But when I see the word “server” I assume we are referring to a multi-core, multi-threaded machine doing many tasks in parallel, so it took me a long time to realize how inapplicable most of the book’s examples are to my work. If the book had referred to something actually single-threaded like a “CPU pipeline” or “network channel” I would’ve understood sooner how much of the book isn’t for me. Networks of multi-threaded servers are finally discussed in Chapter 22, which is not on the undergrad curriculum we followed. I read the chapter anyway: it’s mostly long proofs, and the practical techniques at the end are still too intricate for me to ever apply. It did teach me why off-the-shelf queue simulation software has certain constraints, e.g. job arrivals must be a Poisson process.
Andrew: Many chapters were largely composed of equations. At some level I appreciate the inclusion of these proofs but I would have to expect to live for 1000 years or more to justify investing the time to follow them line by line. If I were to really do this book justice I would come out the other side a fully-fledged queueing theorist. It wasn’t my objective to become a queueing theorist so large parts of the book were completely wasted on me. I just skipped over them. Maybe they could have been moved to an appendix, I don’t know. Later chapters would only be a couple of pages long then.
Murat: Long winded derivation of formulas with little motivation and connection to practical applications. I also would have liked more examples and context to practice and strengthening what was explained in order to serve my style of learning. To mention a specific chapter of the book, I think Chapter 12 on continuous time Markov chains was hard, theoretical, and disconnected.
Q: Was your time with the book well-invested?
Jesse: Yes, actually. I’m in an exploratory phase of my career, so it’s reasonable to run experiments like, “If I read this math textbook will it teach me to design and analyze systems?” I discovered that the answer is mostly “no”, and I also learned some useful queue theory principles. I enjoyed Dr. Harchol-Balter’s clear, enthusiastic writing. It was fun to discuss the book with Murat and Andrew. I intend to keep reading hard books with friends.
Andrew: Yes, despite my comments above I’m happy I went through the book. I got a taste of queueing theory and a rough estimate of the current outer limits of its power. I acquired some real knowledge of things I care about in other contexts, like exponential/Pareto distributions and Markov chains. Above all it was a nice way to spend time with Jesse and Murat. I look forward to our next book!
Murat: Yes. I don’t regret the journey. This was something completely different than the type of distributed systems work I do, and I welcomed learning about a different world alongside my friends Jesse and Andrew. I really enjoyed their companionship, they carried me to the finish line. I also got confidence that if I invest more time in it, I have a chance to master this strange world if I wanted to. But I don’t think going analytical is the right approach. By following a simulation-based approach to performance modeling/estimation, I think I will be able to learn and retain more with less effort, and will be able to apply that toolkit to real-world practical systems to make some leeway in solving the other half of my problems.
To make things more concrete, consider the SEDA paper from SOSP 2001: “SEDA: An architecture for well-conditioned scalable internet services”. I love this paper, it defines and solves an important problem: “A service is well-conditioned if it behaves like a simple pipeline: as the offered load increases, the delivered throughput increases proportionally until the pipeline is full and the throughput saturates; additional load should not degrade throughput.” With a simulation-based approach, I can explore what would be a suitable way of laying out my service to meet performance goals, and when things needs to change apply the same technique to restructure the service. I don’t think I would come close to doing these with analytical queueing theory techniques.