Doing Redundant Work to Speed Up Distributed Queries
20 Sep 2012
tl;dr: In distributed data stores, redundant operations can dramatically drop tail latency at the expense of increased system load; different Dynamo-style stores handle this trade-off differently, and there’s room for improvement.
Update 10/2013: Cassandra has since added support for “speculative retry”–effectively, Dean’s suggestions applied to Dynamo reads, as described below.
Update 9/2014: Akka 2.3.5 introduced support for this kind of speculative retry.
At scale, tail latencies matter. When serving high volumes of traffic, even a miniscule fraction of requests corresponds to a large number of operations. Latency has a huge impact on service quality, and looking at the average service latency alone is often insufficient. Instead, folks running high-performance systems at places like Amazon and Google look to the tail when measuring their performance.1 High variance often hides in distribution tails: in a talk at Berkeley last spring, Jeff Dean reported a 95 percentile latency of 24ms and a 99.9th percentile latency of 994ms in Google’s BigTable service—a 42x difference!
In distributed systems, there’s a subtle and somewhat underappreciated strategy for reducing tail latencies: doing redundant work. If you send the same request to multiple servers, (all else equal) you’re going to get an answer back faster than waiting for a single server. Waiting for, say, one of three servers to reply is often faster than waiting for one of one to reply. The basic cause is due to variance in modern service components: requests take different amounts of time in the network and on different servers at different times.2 In Dean’s experiments, BigTable’s 99.9th percentile latency dropped to 50ms when he sent out a second, redundant request if the initial request hadn’t come back in 10ms—a 40x improvement. While there’s a cost associated with redundant work—increased service load—the load increase may be modest. In the example I’ve mentioned, Dean recorded only a 5% total increase in number of requests.3
Learning about Google’s systems is instructional, but we can also observe the trade-off between tail latency and load in several publicly-available distributed data stores patterned on Amazon’s influential Dynamo data store. In the original paper, Dynamo sends a client’s read and write requests to all replicas for a given key. For writes, the system needs to update all replicas anyway. For reads, requests are idempotent, so the system doesn’t necessarily need to contact all replicas—should it? Sending read requests to all replicas results in a linear increase in load compared to sending to the minimum required number of replicas.4 For read-dominated workloads (like many internet applications), this optimization has a cost. When is it worthwhile?
Open-source Dynamo-style stores have different answers. Apache Cassandra originally sent reads to all replicas, but CASSANDRA-930 and CASSANDRA-982 changed this: one commenter argued that “in IO overloaded situations” it was better to send read requests only to the minimum number of replicas. By default, Cassandra now sends reads to the minimum number of replicas 90% of the time and to all replicas 10% of the time, primarily for consistency purposes.5 (Surprisingly, the relevant JIRA issues don’t even mention the latency impact.) LinkedIn’s Voldemort also uses a send-to-minimum strategy (and has evidently done so since it was open-sourced). In contrast, Basho Riak chooses the “true” Dynamo-style send-to-all read policy.
Who’s right? What do these choices mean for a real NoSQL deployment? We can do a back-of-the-envelope analysis pretty easily. For one of our recent papers on latency-consistency trade-offs in Dynamo style systems, we obtained latency data from Yammer’s Riak clusters. If we run some simple Monte Carlo analysis (script available here), we see that—perhaps unsurprisingly—redundant work can have a big effect on latencies. For example, at the 99.9th percentile, sending a single read request to two servers instead of one is 17x faster than sending to one—maybe worth the 2x load increase. Sending reads to three servers and waiting for one is 30x faster. Pretty good!
Requests sent | |||||||
Responses waited for |
|||||||
1 | 2 | 3 | 4 | 5 | |||
1 | 170.0 | 10.7 | 5.6 | 4.8 | 4.5 | ||
2 | 200.6 | 33.9 | 6.5 | 5.3 | |||
3 | 218.2 | 50.0 | 7.5 | ||||
4 | 231.1 | 59.8 | |||||
5 | 242.2 |
The numbers above assume you send both requests at the same time, but this need not be the case. For example, sending a second request if the first hasn’t come back within 8ms results in a modest 4.2% increase in requests sent and a 99.9th percentile read latency of 11.0ms. This is due to the long tail of the latency distributions we see in Yammer’s clusters—we only have to speed up a small fraction of queries to improve the overall performance.
To preempt any dissatisfaction, I’ll admit that this analysis is simplistic. First, I’m not considering the increased load on each server due to sending multiple requests. The increased load may in turn increase latencies, which would decrease the benefits we see here. This effect depends on the system and workload. Second, I’m assuming that each request is identically, independently distributed. This means that each server behaves the same (according to the Yammer latency distribution we have). This models a system equally loaded, equally powerful servers, but this too may be different in practice.6 Third, with a different latency distribution, the numbers will change. Real-world benchmarking is the best source of truth, but this analysis is a starting point, and you can easily play with different distributions of your own either with the provided script or in your browser using an older demo I built.
Ultimately, the balance between redundant work and tail latency depends on the application. However, in latency-sensitive environments (particularly when there are serial dependencies between requests) this redundant work has a massive impact. And we’ve only begun: we don’t need to send to all replicas to see benefits—even one extra request can help—while delay and cancellation mechanisms like the ones that Jeff Dean hints at can further reduce load penalties. There’s a large amount of hard work and research to be done designing, implementing, and battle-testing these strategies, but I suspect that these kinds of techniques will have a substantial impact on future large-scale distributed data systems.
Thanks to Shivaram Venkataraman, Ali Ghodsi, Kay Ousterhout, Patrick Wendell, Sean Cribbs, and Andy Gross for assistance and feedback that contributed to this post.
Read More
- How To Make Fossils Productive Again (30 Apr 2016)
- You Can Do Research Too (24 Apr 2016)
- Lean Research (20 Feb 2016)
- I Loved Graduate School (01 Jan 2016)
- NSF Graduate Research Fellowship: N=1 Materials for Systems Research (03 Sep 2015)
- Worst-Case Distributed Systems Design (03 Feb 2015)
- When Does Consistency Require Coordination? (12 Nov 2014)
- Data Integrity and Problems of Scope (20 Oct 2014)
- Linearizability versus Serializability (24 Sep 2014)
- MSR Silicon Valley Systems Projects I Have Loved (19 Sep 2014)