Michael J. Fischer, Nancy A. Lynch and Michael S. Paterson, “Impossibility of Distributed Consensus with One Faulty Process,” Journal of the ACM, April 1985, 32(2):374-382.
Members of the selection committee were Richard Ladner, Yoram Moses, Michael Saks, and Nir Shavit (chair). The following descriptions of the winning paper’s contributions were written by Jennifer Welch and Vassos Hadzilacos (presented here with minor editing by Nir Shavit). Jennifer Welch writes the following:
The result of this paper (commonly known as FLP) is that, surprisingly, it is impossible for a set of processors in an asynchronous distributed system to agree on a binary value, even if only a single processor is subject to an unannounced crash. Although the result was motivated by the problem of committing transactions in distributed database systems, the proof is sufficiently general that it directly implies the impossibility of a number of related problems, including consensus.
This result has had a monumental impact in distributed computing, both theory and practice. Systems designers were motivated to clarify their claims concerning under what circumstances the systems work.
On the theory side, people have attempted to get around the impossibility result by changing the system assumptions or the problem statement. Work on changing the system assumptions includes the study of partially synchronous models and of various kinds of failure detectors. Modified problem statements include randomized algorithms, approximate agreement, k-set agreement, and condition-based approaches.
The proof technique used in FLP, valency arguments, has been used and adapted to show many other impossibility and lower bound results in distributed computing. These include impossibility results for consensus, k-set consensus, and renaming in various models, and lower bounds on contention and on the number of rounds for synchronous consensus.
The FLP result forms the basis of work on the wait-free hierarchy, in which data types are classified and compared according to the maximum number of processes for which they can solve wait-free consensus. The calculation of consensus numbers relies on valency arguments.
Finally, work on applying ideas from topology to fault-tolerant distributed computing were inspired by the posing of the k-set consensus problem, which in turn was inspired by the FLP result.
Vassos Hadzilacos writes the following:
This paper proved the most fundamental result in fault-tolerant distributed computing. The result asserts that a particularly simple and natural problem, consensus, is not solvable in the very attractive asynchronous model of computation, even under strong assumptions about the number and nature of failures that may occur – specifically, even if communication is perfectly reliable, if at most one process can fail, and if it can do so only by crashing.
This result has been extremely influential both because of its implications and because of the proof technique it pioneered. Its importance stems from the fact that consensus lies at the heart of many practical problems, including atomic commit, leader election, atomic broadcast, and the maintenance of consistent replicated data. The unsolvability result has motivated researchers to explore models that retain as many of the asynchronous model’s attractive features as possible, while making consensus (and related practical problems) solvable. These explorations include randomization, partially synchronous models, and unreliable failure detectors. The proof technique introduced in the nominated paper, now known as “the bivalence argument”, is notable for its originality, its elegance, and its wide use to obtain other impossibility results in distributed computing. The paper is written masterfully, and although it contains an unusually deep result, it is accessible even to advanced undergraduate computer science students.
The following is a brief historical perspective of the paper written by the authors themselves:
Nancy Lynch and Mike Fischer began working on the distributed consensus problem in early 1980 and proved a lower bound on the number of rounds needed to reach agreement in a synchronous distributed setting.
During 1981-82, Lynch worked with Danny Dolev and others on an approximate version of the consensus problem. For this problem, it turned out that the algorithms for the synchronous setting extended easily to the asynchronous setting, simply by adding enough extra processes to accommodate the differences in process views due to asynchrony. This observation led her to wonder whether synchronous algorithms for exact agreement could similarly be adapted to the asynchronous setting. She made no progress on such an adaptation, however.
Independently, Fischer was introduced to the problem of asynchronous consensus by Butler Lampson during a visit to Xerox PARC in early summer 1982. Lampson suggested an algorithm to solve the problem, and Fischer was anxious to understand it and to prove it correct. Although he made some progress on the former, he failed utterly on the latter.
Later that summer, Fischer, Lynch, and Mike Paterson tried to resolve this problem during a visit by Fischer to MIT, a subsequent visit by Paterson to Yale, and some phone calls. We worked simultaneously on trying to produce an algorithm and trying to prove that no such algorithm exists. Only gradually did we come to realize that the problem is fundamentally unsolvable.
The intuition behind the impossibility proof is pretty simple: Initially, either decision, 0 or 1, is possible. Assume a correct algorithm. At some point in time the system as a whole must commit to one value or the other. That commitment must result from some action of a single process. Suppose that process fails. Then there is no way for the other processes to know the commitment value; hence, they will sometimes make the wrong decision. Contradiction!
This simple argument turned out to be surprisingly difficult to make precise. The subtlety of the argument became apparent to us when we tried to explain the result to others: many people expressed skepticism and disbelief. It took a lot of time, and careful polishing of the proof on our part, before the correctness of the result became generally accepted.
Little did we imagine how influential the paper would turn out to be! We assumed that the main value of our impossibility result was to close off unproductive lines of research on trying to find fault-tolerant consensus algorithms. But much to our surprise, it opened up entirely new lines of research. There has been analysis of exactly what assumptions about the distributed system model are needed for the impossibility proof. Many related distributed problems to which the proof also applies have been found, together with seemingly similar problems which do have solutions. Eventually a long line of research developed in which primitives were classified based on their ability to implement wait-free fault-tolerant consensus.
We thank the Awards Committee for the recognition and the prize. We especially thank the PODC community for the subsequent research which allowed our paper to be deemed influential.
This page maintained by Gil Neiger.
Last modified: January 23, 2003