Kyle Kingsbury

Elle: Finding Isolation Violations in Real-World Databases
Kyle Kingsbury, Jepsen

Talk Abstract

Distributed databases are supposed to manipulate data safely, ensuring properties like serializability even when nodes or networks fail. Unfortunately, they often don’t. We found consistency violations in 26 systems over the last eight years, ranging from stale reads to catastrophic data loss.

We review the Jepsen testing library [1], which combines automated deployment, fault injection, and property-based testing techniques to uncover safety violations and performance characteristics in a broad range of distributed systems.

We then present Elle [2], a new library for analyzing Jepsen histories and finding consistency violations in linear time. We build on Adya’s formalism [3] of transactional anomalies as cycles in a dependency graph. Unfortunately these graphs are invisible to clients. However, by carefully choosing the datatypes and operations we submit to a database, we can generate histories whose client-observable structure constrains the dependency graph of every Adya history the database could have executed.

Elle can identify every anomaly in the Adya formalism (except for predicates) in linear time, allowing us to validate a broad range of isolation properties up to strict serializability. It automatically finds minimal counterexamples, helping an engineer see exactly which transactions were incompatible with (e.g.) snapshot isolation. It is sound, but since some information may be missing from observed histories, not complete. We use Jepsen and Elle to find consistency errors in a broad variety of datastores, including PostgreSQL, Dgraph, and Redis-Raft.

Keywords
Distributed systems; transactions; consistency; testing

References
[1] Kingsbury, K., et al., Jepsen, https://jepsen.io.
[2] Alvaro, P. and Kingsbury, K., Elle, https://github.com/jepsen-io/elle.
[3] Adya, A., Weak Consistency: A Generalized Theory and Optimistic Implementations for Distributed Transactions, MIT, 1999.

Speaker Biography

Kyle Kingsbury has worked as a network engineer, sysadmin, and software engineer at a variety of software startups since 2004.

He is the primary author of the Riemann monitoring system, and the Clojure From the Ground Up introduction to programming, as well as a slew of open-source libraries. Since 2013, Kingsbury has focused on testing and teaching distributed systems, including work on the Jepsen test library, the Knossos, Gretchen, and Elle safety checkers, and the Maelstrom workbench for learning and prototyping distributed algorithms.