PODC ’19 Proceedings of the 2019 ACM Symposium on Principles of Distributed Computing
Full Citation in the ACM Digital LibrarySESSION: Awards
2019 Edsger W. Dijkstra Prize in Distributed Computing
The committee decided to award the 2019 Edsger W. Dijkstra Prize in Distributed Computing to Alessandro Panconesi and Aravind Srinivasan for their paper Randomized Distributed Edge Coloring via an Extension of the ChernoffHoeffding Bounds, SIAM Journal on Computing, volume 26, number 2, 1997, pages 350368. A preliminary version of this paper appeared as Fast Randomized Algorithms for Distributed Edge Coloring, Proceedings of the Eleventh Annual ACM Symposium Principles of Distributed Computing (PODC), 1992, pages 251262.
The paper presents a simple synchronous algorithm in which processes at the nodes of an undirected network color its edges so that the edges adjacent to each node have different colors. It is randomized, using 1.6Δ + O(log^{1+ζ}n) colors and O(log n) rounds with high probability for any constant ζ>0, where n is the number of nodes and is the maximum degree of the nodes. This was the first nontrivial distributed algorithm for the edge coloring problem and has influenced a great deal of followup work. Edge coloring has applications to many other problems in distributed computing such as routing, scheduling, contention resolution, and resource allocation.
In spite of its simplicity, the analysis of their edge coloring algorithm is highly nontrivial. ChernoffHoeffding bounds, which assume random variables to be independent, cannot be used. Instead, they develop upper bounds for sums of negatively correlated random variables, for example, which arise when sampling without replacement. More generally, they extend ChernoffHoeffding bounds to certain random variables they call λcorrelated. This has directly inspired more specialized concentration inequalities. The new techniques they introduced have also been applied to the analyses of important randomized algorithms in a variety of areas including optimization, machine learning, cryptography, streaming, quantum computing, and mechanism design.
2019 Principles of Distributed Computing Doctoral Dissertation Award
The winner of the 2019 Principles of Distributed Computing Doctoral Dissertation Award is Dr. Sepehr Assadi for his dissertation Combinatorial Optimization on Massive Datasets: Streaming, Distributed, and Massively Parallel Computation, written under the supervision of Prof. Sanjeev Khanna at the University of Pennsylvania.
The thesis resolves a number of longstanding problems in the exciting and still relatively new area of sublinear computation. The area of sublinear computation focuses on design of algorithms that use sublinear space, time, or communication to solve global optimization problems on very large datasets. In addition to addressing a wide range of different problems, comprising graph optimization problems (matching, vertex cover, and connectivity), submodular optimization (set cover and maximum coverage), and resourceconstrained optimization (combinatorial auctions and learning), these problems are studied in three different models of computation, namely, streaming algorithms, multiparty communication, and massively parallel computation (MPC). The thesis also reveals interesting relations between these different models, including generic algorithmic and analysis techniques that can be applied in all of them.
For many fundamental optimization problems, the thesis gives asymptotically matching algorithmic and intractability results, completely resolving several longstanding problems. This is accomplished by using a broad spectrum of mathematical methods in very detailed and intricate proofs. In addition to a wide variety of classic techniques, ranging from graph theory, combinatorics, probability, linear algebra and calculus, it also makes heavy use of communication complexity and information theory, for example.
Sepehr’s dissertation work has been published in a remarkably large number of topconference papers. It received multiple best paper awards and multiple special issue invitations, as well as two invitations to the Highlights of Algorithms (HALG) conference. Due to its contributions to the field of distributed computing and all the merits mentioned above, the award committee unanimously selected this thesis as the winner of the 2019 Principles of Distributed Computing Doctoral Dissertation Award.
SESSION: Keynote Lecture 1
Local Computation Algorithms
Consider a setting in which inputs to and outputs from a computational problem are so large, that there is not time to read them in their entirety. However, if one is only interested in small parts of the output at any given time, is it really necessary to solve the entire computational problem? Is it even necessary to view the whole input? We survey recent work in the model of “local computation algorithms” which for a given input, supports queries by a user to values of specified bits of a legal output. The goal is to design local computation algorithms in such a way that very little of the input needs to be seen in order to determine the value of any single bit of the output. Though this model describes sequential computations, techniques from local distributed algorithms have been extremely important in designing efficient local computation algorithms.
In this talk, we describe results on a variety of problems for which sublinear time and space local computation algorithms have been developed — we will give special focus to finding maximal independent sets and sparse spanning graphs.
SESSION: Session 1
Symmetry Breaking in the Plane: Rendezvous by Robots with Unknown Attributes
We study a fundamental question related to the feasibility of deterministic symmetry breaking in the infinite Euclidean plane for two robots that have minimal or no knowledge of the respective capabilities and “measuring instruments” of themselves and each other. Assume that two anonymous mobile robots are placed at different locations at unknown distance d from each other on the infinite Euclidean plane. Each robot knows neither the location of itself nor of the other robot. The robots cannot communicate wirelessly, but have a certain nonzero visibility radius r (with range r unknown to the robots). By rendezvous we mean that they are brought at distance at most r of each other by executing symmetric (identical) mobility algorithms. The robots are moving with unknown and constant but not necessarily identical speeds, their clocks and pedometers may be asymmetric, and their chirality inconsistent.
We demonstrate that rendezvous for two robots is feasible under the studied model iff the robots have either: different speeds; or different clocks; or different orientations but equal chiralities. When the rendezvous is feasible, we provide a universal algorithm which always solves rendezvous despite the fact that the robots have no knowledge of which among their respective parameters may be different.
Composable Computation in Discrete Chemical Reaction Networks
We study the composability of discrete chemical reaction networks (CRNs) that stably compute (i.e., with probability 0 of error) integervalued functions ƒ:N^{d}→ N. We consider outputoblivious CRNs in which the output species is never a reactant (input) to any reaction. The class of outputoblivious CRNs is fundamental, appearing in earlier studies of CRN computation, because it is precisely the class of CRNs that can be composed by simply renaming the output of the upstream CRN to match the input of the downstream CRN.
Our main theorem precisely characterizes the functions f stably computable by outputoblivious CRNs with an initial leader. The key necessary condition is that for sufficiently large inputs, f is the minimum of a finite number of nondecreasing quiltaffine functions. (An affine function is linear with a constant offset; a quiltaffine function is linear with a periodic offset).
How to Spread a Rumor: Call Your Neighbors or Take a Walk?
We study the problem of randomized information dissemination in networks. We compare the now standard PUSHPULL protocol, with agentbased alternatives where information is disseminated by a collection of agents performing independent random walks. In the VISITEXCHANGE protocol, both nodes and agents store information, and each time an agent visits a node, the two exchange all the information they have. In the MEETEXCHANGE protocol, only the agents store information, and exchange their information with each agent they meet.
We consider the broadcast time of a single piece of information in an nnode graph for the above three protocols, assuming a linear number of agents that start from the stationary distribution. We observe that there are graphs on which the agentbased protocols are significantly faster than PUSHPULL, and graphs where the converse is true. We attribute the good performance of agentbased algorithms to their inherently fair bandwidth utilization, and conclude that, in certain settings, agentbased information dissemination, separately or in combination with PUSHPULL, can significantly improve the broadcast time.
The graphs considered above are highly nonregular. Our main technical result is that on any regular graph of at least logarithmic degree, PUSHPULL and VISITEXCHANGE have the same asymptotic broadcast time. The proof uses a novel coupling argument which relates the random choices of vertices in PUSHPULL with the random walks in VISITEXCHANGE. Further, we show that the broadcast time of MEETEXCHANGE is asymptotically at least as large as the other two’s on all regular graphs, and strictly larger on some regular graphs.
As far as we know, this is the first systematic and thorough comparison of the running times of these very natural information dissemination protocols.
Efficient Size Estimation and Impossibility of Termination in Uniform Dense Population Protocols
We study uniform population protocols: networks of anonymous agents whose pairwise interactions are chosen at random, where each agent uses an identical transition algorithm that does not depend on the population size n. Many existing polylog(n) time protocols for leader election and majority computation are nonuniform: to operate correctly, they require all agents to be initialized with an approximate estimate of n (specifically, the value łfloorłog n\rfloor). Our first main result is a uniform protocol for calculating łog(n) \pm O(1) with high probability in O(łog^2 n) time and O(łog^4 n) states (O(łog łog n) bits of memory). The protocol is not terminating : it does not signal when the estimate is close to the true value of łog n. If it could be made terminating with high probability, this would allow composition with protocols requiring a size estimate initially. We do show how our main protocol can be indirectly composed with others in a simple and elegant way, based on leaderless phase clocks, demonstrating that those protocols can in fact be made uniform. However, our second main result implies that the protocol cannot be made terminating, a consequence of a much stronger result: a uniform protocol for any task requiring more than constant time cannot be terminating even with probability bounded above 0, if infinitely many initial configurations are dense : any state present initially occupies Ømega(n) agents. (In particular no leader is allowed.) Crucially, the result holds no matter the memory or time permitted.
On Counting the Population Size
We consider the problem of counting the population size in the population model. In this model, we are given a distributed system of n identical agents which interact in pairs with the goal to solve a common task. In each time step, the two interacting agents are selected uniformly at random. In this paper, we consider socalled uniform protocols, where the actions of two agents upon an interaction may not depend on the population size n. We present two population protocols to count the size of the population: protocol Approximate, which computes with high probability either [log n] or [log n], and protocol CountExact, which computes the exact population size in optimal O(log n) interactions, using Õ (n) states. Both protocols can also be converted to stable protocols that give a correct result with probability 1 by using an additional multiplicative factor of O(log n) states.
SelfStabilizing Leader Election
In this paper, we study the selfstabilizing leader election (SSLE) problem in population protocols. We construct a nondeterministic population protocol that can solve SSLE on directed rings of all sizes. Our algorithm uses a constant number of states and can be converted to a deterministic population protocol on undirected rings using previous techniques [8]. Furthermore, we extend our algorithm to perform SSLE on directed and undirected tori of arbitrary sizes.
Logarithmic ExpectedTime Leader Election in Population Protocol Model
In this paper, we present a leader election protocol in the population protocol model that stabilizes within O(log n) parallel time in expectation with O(log n) states per agent, where n is the number of agents. Given a rough knowledge m of the population size n such that m ≥ = log_{2} n and m=O(log n), this protocol guarantees that exactly one leader is elected and the unique leader is kept forever thereafter.
On Site Fidelity and the Price of Ignorance in Swarm Robotic Central Place Foraging Algorithms
A key factor limiting the performance of central place foraging algorithms is the awareness of the agent(s) about the location of food items around the nest. We study the ratio of how much time an ignorant agent takes relative to an omniscient forager for complete collection of food items in the arena. This effectively quantifies the penalty each algorithm pays for not knowing (or choosing to ignore information gained about) where the resources are located. We model the effect of depletion of food items from the arena on the foraging efficiency over time and analytically verify that returning to the location of the last food item found strongly helps in counteracting this effect. To the best of our knowledge, these results have only been empirically argued so far.
SESSION: Session 2
Improved Distributed Expander Decomposition and Nearly Optimal Triangle Enumeration
An(ε,φ)expander decomposition of a graph G=(V,E) is a clustering of the vertices V=V_{1}∪…∪ V_{x} such that (1) each cluster V_{i} induces subgraph with conductance at least φ, and (2) the number of intercluster edges is at most εE. In this paper, we give an improved distributed expander decomposition, and obtain a nearly optimal distributed triangle enumeration algorithm in the CONGEST model.
Specifically, we construct an (ε,φ)expander decomposition with φ=(ε/log n)^{2 O(k)} in O(n^{2/k} ⋅ poly (1/φ, log n))rounds for any ε ∈(0,1) and positive integer k. For example, a (1/n^{o(1)}, 1/n^{o(1)})expander decomposition only requires O(n^{o(1)}) rounds to compute, which is optimal up to subpolynomial factors, and a (0.01,1/poly log n)expander decomposition can be computed in O(n^{γ}) rounds, for any arbitrarily small constant γ > 0. Previously, the algorithm by Chang, Pettie, and Zhang can construct a (1/6,1/poly log n)expander decomposition using Õ (n^{1δ}) rounds for any δ > 0, with a caveat that the algorithm is allowed to throw away a set of edges into an extra part which form a subgraph with arboricity at most n^{δ}. Our algorithm does not have this caveat.
By slightly modifying the distributed algorithm for routing on expanders by Ghaffari, Kuhn and Su [PODC’17], we obtain a triangle enumeration algorithm using Õ(n^{1/3}) rounds. This matches the lower bound by Izumi and LeGall [PODC’17] and Pandurangan, Robinson and Scquizzato [SPAA’18] of Ø(n^{1/3}) which holds even in the CONGESTEDCLIQUE model. To the best of our knowledge, this provides the first nontrivial example for a distributed problem that has essentially the same complexity (up to a polylogarithmic factor) in both CONGEST and CONGESTEDCLIQUE.
The key technique in our proof is the first distributed approximation algorithm for finding a low conductance cut that is as balanced as possible. Previous distributed sparse cut algorithms do not have this nearly most balanced guarantee.
Fast Approximate Shortest Paths in the Congested Clique
We design fast deterministic algorithms for distance computation in the CONGESTED CLIQUE model. Our key contributions include:
 A (2+ε)approximation for allpairs shortest paths problem in O(log^{2}n / ε) rounds on unweighted undirected graphs. With a small additional additive factor, this also applies for weighted graphs. This is the first subpolynomial constantfactor approximation for APSP in this model.
 A (1+ε)approximation for multisource shortest paths problem from O(√n) sources in O(log^{2} n / ε) rounds on weighted undirected graphs. This is the first subpolynomial algorithm obtaining this approximation for a set of sources of polynomial size.
Our main techniques are new distance tools that are obtained via improved algorithms for sparse matrix multiplication, which we leverage to construct efficient hopsets and shortest paths. Furthermore, our techniques extend to additional distance problems for which we improve upon the stateoftheart, including diameter approximation, and an exact singlesource shortest paths algorithm for weighted undirected graphs in Õ (n_{1/6}) rounds.
Quantum Distributed Algorithm for the AllPairs Shortest Path Problem in the CONGESTCLIQUE Model
The AllPairs Shortest Path problem (APSP) is one of the most central problems in distributed computation. In the CONGESTCLIQUE model, in which n nodes communicate with each other over a fully connected network by exchanging messages of O(łog n) bits in synchronous rounds, the best known general algorithm for APSP uses Õ(n_{1/3}) rounds. Breaking this barrier is a fundamental challenge in distributed graph algorithms. In this paper we investigate for the first time quantum distributed algorithms in the CONGESTCLIQUE model, where nodes can exchange messages of O(log n) quantum bits, and show that this barrier can be broken: we construct a Õ(n_{1/4})round quantum distributed algorithm for the APSP over directed graphs with polynomial weights in the CONGESTCLIQUE model. This speedup in the quantum setting contrasts with the case of the standard CONGEST model, for which Elkin et al. (PODC 2014) showed that quantum communication does not offer significant advantages over classical communication.
Our quantum algorithm is based on a relationship discovered by Vassilevska Williams and Williams (JACM 2018) between the APSP and the detection of negative triangles in a graph. The quantum part of our algorithm exploits the framework for quantum distributed search recently developed by Le Gall and Magniez (PODC 2018). Our main technical contribution is a method showing how to implement multiple quantum searches (one for each edge in the graph) in parallel without introducing congestions.
Deterministic Distributed Dominating Set Approximation in the CONGEST Model
We develop deterministic approximation algorithms for the minimum dominating set problem in the CONGEST model with an almost optimal approximation guarantee. For ε 1/ poly log Δ we obtain two algorithms with approximation factor (1 + ε)(1 + ł n (Δ + 1)) and with runtimes 2^{O}(√ log n log log n) and O(Δ poly log Δ + poly log Δ log* n), respectively. Further we show how dominating set approximations can be deterministically transformed into a connected dominating set in the CONGEST model while only increasing the approximation guarantee by a constant factor. This results in a deterministic O(log Δ)approximation algorithm for the minimum connected dominating set with time complexity 2O(√ log n log log n).
Optimal Distributed Covering Algorithms
We present a timeoptimal deterministic distributed algorithm for approximating a minimum weight vertex cover in hypergraphs of rank ƒ. This problem is equivalent to the Minimum Weight Set Cover problem in which the frequency of every element is bounded by ƒ. The approximation factor of our algorithm is (ƒ + ε). Let Δ denote the maximum degree in the hypergraph. Our algorithm runs in the CONGEST model and requires O(log Δ/log log Δ) rounds, for constants ε ∈ (0,1] and ƒ ∈ N^{+}. This is the first distributed algorithm for this problem whose running time does not depend on the vertex weights nor the number of vertices. Thus adding another member to the exclusive family of emphprovably optimal distributed algorithms.
For constant values of ƒ and ε, our algorithm improves over the (&3402; + ε)approximation algorithm of [16] whose running time is O(log Δ + log W), where W is the ratio between the largest and smallest vertex weights in the graph. Our algorithm also achieves an ƒapproximation for the problem in O(ƒ log n) rounds, improving over the classical result of [13] that achieves a running time of O(ƒ log ^{2} n). Finally, for weighted vertex cover (ƒ=2) our algorithm achieves a deterministic running time of O(log n), matching the randomized previously best result of [14].
We also show that integer coveringprograms can be reduced to the Minimum Weight Set Cover problem in the distributed setting. This allows us to achieve an (ƒ + ε)approximate integral solution in O(1 + ƒ /log n)⋅ log Δ over log log Δ+(ƒ ⋅ log M)^{1.01}⋅ log ^{ε1} ⋅(log Δ)^{0.01})) rounds, where ƒ bounds the number of variables in a constraint, Δ bounds the number of constraints a variable appears in, and M=max{1,1/a min},, a_{min}, where a_{min} is the smallest normalized constraint coefficient. This significantly improves over the results of [16] for the integral case, which achieves the same guarantees in O(ε^{4} ⋅ ƒ^{4} ⋅ log ƒ ⋅ log(M ⋅ Δ)) rounds.
SESSION: Session 3
Secure Distributed Computing Made (Nearly) Optimal
In this paper, we study secure distributed algorithms that are nearly optimal, with respect to running time, for the given input graph G. Roughly speaking, an algorithm is secure if the nodes learn only their final output while gaining no information on the input (or output) of other nodes.
A graph theoretic framework for secure distributed computation was recently introduced by the authors (SODA 2019). This framework is quite general and it is based on a new combinatorial structure called private neighborhood trees : a collection of n trees T(u_{1}), …, T(u_{n}) such that each tree T(u_{i}) spans the neighbors of u_{i} without going through u_{i}. Intuitively, each tree T(u_{i}) allows all neighbors of u_{i} to exchange a secret that is hidden from u_{i}. The efficiency of the framework depends on two key parameters of these trees: their depth and the amount of overlap. In a (d,c)private neighborhood trees each tree T(u_{i}) has depth O(d) and each edge e ∈ G appears in at most O(c) different trees. An existentially optimal construction of private neighborhood trees with d=O(Δ … D) and c=Õ (D) was presented therein. We make two key contributions:
Universally Optimal Private Trees: We show a combinatorial construction of nearly (universally) optimal (d,c)private neighborhood trees with d + c=Õ (OPT(G)) for any input graph G. Perhaps surprisingly, we show that OPT(G) is equal to the best depth possible for these trees even without the congestion constraint. We also present efficient distributed constructions of these private trees.
Optimal Secure Computation: Using the optimal constructions above, we get a secure compiler for distributed algorithms where the overhead for each round is Õ (poly(Δ)… OPT(G)). As our second key contribution, we design an optimal compiler with an overhead of merely Õ (OPT(G)) per round for a class of “simple” algorithms. This class includes many standard distributed algorithms such as LubyMIS, the standard logarithmicround algorithms for matching and Δ + 1coloring, as well as the computation of aggregate functions.
With Great Speed Come Small Buffers: SpaceBandwidth Tradeoffs for Routing
We consider the Adversarial Queuing Theory (AQT) model, where packet arrivals are subject to a maximum average rate 0 ≤ ρ ≤ 1 and burstiness σ ≤ 0. In this model, we analyze the size of buffers required to avoid overflows in the basic case of a path. Our main results characterize the space required by the average rate and the number of distinct destinations: we show that O(ℓ d^{1/ℓ} + σ) space suffice, where d is the number of distinct destinations and ℓ=⌋1/ρ⌊ and we show that Ω(1 over ℓ ^{d1/ℓ} + σ) space is necessary. For directed trees, we describe an algorithm whose buffer space requirement is at most 1 + d’ + σ where d’ is the maximum number of destinations on any rootleaf path.
Plain SINR is Enough!
We develop randomized distributed algorithms for many of the most fundamental communication problems in the wireless SINR model, including (multimessage) broadcast, local broadcast, coloring, MIS, and aggregation. The complexity of the algorithms is optimal up to polylogarithmic preprocessing time. It shows — contrary to expectation — that the plain vanilla SINR model is just as powerful and fast (modulo the preprocessing) as various extensions studied, including power control, carrier sense, collision detection, free acknowledgements, and GPS location. A key component of the algorithms is an efficient simulation of CONGEST algorithms on a constantdensity SINR backbone.
Efficient Multiparty Interactive Coding for Insertions, Deletions, and Substitutions
In the field of interactive coding, two or more parties wish to carry out a distributed computation over a communication network that may be noisy. The ultimate goal is to develop efficient coding schemes that can tolerate a high level of noise while increasing the communication by only a constant factor (i.e., constant rate).
In this work we consider synchronous communication networks over an arbitrary topology, in the powerful adversarial insertiondeletion noise model. Namely, the noisy channel may adversarially alter the content of any transmitted symbol, as well as completely remove a transmitted symbol or inject a new symbol into the channel.
We provide efficient, constant rate schemes that successfully conduct any computation with high probability as long as the adversary corrupts at most ε over m fraction of the total communication, where m is the number of links in the network and ε is a small constant. This scheme assumes an oblivious adversary which is independent of the parties’ inputs and randomness. We can remove this assumption and resist a worstcase adversary at the price of being resilient to ε over m log m errors.
While previous work considered the insertiondeletion noise model in the twoparty setting, to the best of our knowledge, our scheme is the first multiparty scheme that is resilient to insertions and deletions. Furthermore, our scheme is the first computationally efficient scheme in the multiparty setting that is resilient to adversarial noise.
Multiparty Interactive Communication with Private Channels
A group of n players wants to run a distributed protocol ℘ over a network where communication occurs via private pointtopoint channels. Can we efficiently simulate ℘ in the presence of an adversary who knows ℘ and is able to maliciously flip bits on the channels? We show that this is possible, even when L, the number of bits sent in ℘, the average message size α in ℘, and T, the number of bits flipped by the adversary are not known in advance. In particular, we show how to create a robust version of ℘, ℘ such that 1) ℘’ fails with probability at most δ, for any δ>0; and 2) ℘’ sends O( L (1 + (1/α) łog (n L/δ)) + T) bits. We note that if α is Ω (log (n L/δ), then ℘ sends only O(L+T) bits, and is therefore within a constant factor of optimal. Critically, our result requires that ℘ runs correctly in an asynchronous network and our protocol ℘ must run in a synchronous network.
Coded State Machine — Scaling State Machine Execution under Byzantine Faults
We introduce Coded State Machine (CSM), an informationtheoretic framework to securely and efficiently execute multiple state machines on Byzantine nodes. The standard method of solving this problem is using State Machine Replication, which achieves high security at the cost of low efficiency. CSM simultaneously achieves the optimal linear scaling in storage, throughput, and security with increasing network size. The storage is scaled via the design of Lagrange coded states and coded input commands that require the same storage size as their origins. The computational efficiency is scaled using a novel delegation algorithm, called INTERMIX, which is an informationtheoretically verifiable matrixvector multiplication algorithm of independent interest.
On Termination of a Flooding Process
Flooding is a fundamental distributed algorithms technique. Consider the following flooding process, for simplicity, in a synchronous message passing network: A distinguished node begins the flooding process by sending the (same) message to all its neighbours in the first round. In subsequent rounds, every node receiving the message relays a copy of the message further to all those, and only those, nodes it did not receive the message from in the previous round. However, the nodes do not remember if they’ve taken part in the flooding before and therefore will repeat the process every time they get a message. In other words, they execute an amnesiac flooding process with memory only of the present round. The flooding process terminates in a particular round when no edge in the network carries the message in that, and, hence, subsequent, rounds. We call this process Amnesiac Flooding (AF).
In this work, the main question we address is whether AF will terminate on an arbitrary network (graph) and in what time? We show that, indeed, AF terminates on any arbitrary graph. Further, AF terminates in at most D rounds in bipartite graphs and at most 2D + 1 rounds in nonbipartite graphs – in this brief announcement, we show this for the bipartite case only.
We also show that in a natural asynchronous variant of AF, an adversary can always ensure nontermination.
SESSION: Keynote Lecture 2
Towards a Theory of Randomized Shared Memory Algorithms
Randomization has become an invaluable tool to overcome some of the problems associated with asynchrony and faultiness. Allowing processors to use random bits helps to break symmetry, and to reduce the likelihood of undesirable schedules. As a consequence, randomized techniques can lead to simpler and more efficient algorithms, and sometimes to solutions of otherwise unsolvable computational problems. However, the design and the analysis of randomized shared memory algorithms remains challenging. This talk will give an overview of recent progress towards developing a theory of randomized shared memory algorithms.
For many years, linearizability [6] has been the gold standard of distributed correctness conditions, and the corner stone of modular programming. In deterministic algorithms, implemented linearizable methods can be assumed to be atomic. But when processes can make random choices, the situation is not the same: Probability distributions of outcomes of algorithms using linearizable methods may be very different from those using equivalent atomic operations [4]. In general, modular algorithm design is much more difficult for randomized algorithms than for deterministic ones. The first part of the talk will present a correctness condition [2, 5] that is suitable for randomized algorithms in certain settings, and will explain why in other settings no such correctness condition exists [3] and what we can do about that.
To this date, almost all randomized shared memory algorithms are Las Vegas, meaning they permit no error. Monte Carlo algorithms, which allow errors to occur with small probability, have been studied thoroughly for sequential systems. But in the shared memory world such algorithms have been neglected. The second part of this talk will discuss recent attempts to devise Monte Carlo algorithms for fundamental shared memory problems (e.g., [1]). It will also present some general techniques, that have proved useful in the design of concurrent randomized algorithms.
SESSION: Session 4
Optimal MemoryAnonymous Symmetric DeadlockFree Mutual Exclusion
The notion of an anonymous shared memory, introduced by Taubenfeld in PODC 2017, considers that processes use different names for the same memory location. As an example, a location name A used by a process p and a location name B ≠ A used by another process q can correspond to the very same memory location X, and similarly for the names B used by p and A used by q which may (or may not) correspond to the same memory location Y ≠ X. In this context, the PODC paper presented a 2process symmetric deadlockfree mutual exclusion (mutex) algorithm and a necessary condition on the size m of the anonymous memory for the existence of such an nprocess algorithm. This condition states that m must be belongs to M(n) {1} where M(n)= {m: ∀ ℓ: (1) < ℓ ≤ n: gcd(ℓ,m)=1). Symmetric means here that,process identities define a specific data type which allows a process to check only if two identities are equal or not.
The present paper presents two optimal deadlockfree symmetric mutual exclusion algorithms for nprocess systems where communication is through m registers. The first algorithm, which considers anonymous read/write registers, works for any m which is ≥ n and belongs to the set M(n). It follows that this condition on m is both necessary and sufficient for symmetric deadlockfree mutual exclusion in this anonymity context, and this algorithm is optimal with respect to m The second algorithm, which considers anonymous read/modify/write atomic registers, works for any m∈ M(n), which is shown to be necessary and sufficient for anonymous read/modify/write registers. It follows that, when m > 1, m ∈ M(n) is a tight characterization of the size of the anonymous shared memory needed to solve deadlockfree mutex, be the registers read/write or read/modify/write.
Constant Amortized RMR Abortable Mutex for CC and DSM
The Abortable mutual exclusion problem, proposed by Scott and Scherer in response to the needs in real time systems and databases, is a variant of mutual exclusion that allows processes to abort from their attempt to acquire the lock. Worstcase constant remote memory reference (RMR) algorithms for mutual exclusion using hardware instructions such as Fetch&Add or Fetch&Store have long existed for both Cache Coherent (CC) and Distributed Shared Memory (DSM) multiprocessors, but no such algorithms are known for abortable mutual exclusion. Even relaxing the worstcase requirement to amortized, algorithms are only known for the CC model.
In this paper, we improve this stateoftheart by designing a deterministic algorithm that uses Fetch&Store (FAS) to achieve amortized O(1) RMR in both the CC and DSM models. Our algorithm supports Fast Abort (a process aborts within six steps of receiving the abort signal), and has the following additional desirable properties: it supports an arbitrary number of processes of arbitrary names, requires only O(1) space per process, and satisfies a novel fairness condition that we call “Airline FCFS”. Our algorithm is short and practical with fewer than a dozen lines of code.
A Recoverable Mutex Algorithm with Sublogarithmic RMR on Both CC and DSM
In light of recent advances in nonvolatile main memory technology, Golab and Ramaraju reformulated the traditional mutex problem into the novel Recoverable Mutual Exclusion (RME) problem. In the best known solution for RME, due to Golab and Hendler from PODC 2017, a process incurs at most O(√ log n log log n) remote memory references (RMRs) per passage on a system with n processes, where a passage is an interval from when a process enters the Try section to when it subsequently returns to Remainder. Their algorithm, however, guarantees this bound only for cachecoherent (CC) multiprocessors, leaving open the question of whether a similar bound is possible for distributed shared memory (DSM) multiprocessors.
We answer this question affirmatively by designing an algorithm for a system with n processes, such that, it satisfies the same complexity bound as Golab and Hendler’s for both CC and DSM multiprocessors. Our algorithm has some additional advantages over Golab and Hendler’s: (i) its Exit section is waitfree, (ii) it uses only the FetchandStore instruction, and (iii) on a CC machine our algorithm needs each process to have a cache of only O(1) words, while their algorithm needs a cache of size that is a function of n.
Randomized Concurrent Set Union and Generalized WakeUp
We consider the disjoint set union problem in the asynchronous shared memory multiprocessor computation model. We design a randomized algorithm that performs at most O(log n) work per operation (with high probability), and performs at most O(m #8226; (α(n, m/(np)) + log(np/m + 1)) total work in expectation for a problem instance with m operations on n elements solved by p processes. Our algorithm is the first to have work bounds that grow sublinearly with p against an adversarial scheduler.
We use Jayanti’s Wake Up problem and our newly defined Generalized Wake Up problem to prove several lower bounds on concurrent set union. We show an Ω(log min {n,p}) expected work lower bound on the cost of any single operation on a set union algorithm. This shows that our singleoperation upper bound is optimal across all algorithms when p = n^{Ω(1)}. Furthermore, we identify a class of “symmetric algorithms” that captures the complexities of all the known algorithms for the disjoint set union problem, and prove an Ω(m•(α(n, m(np)) + log(np/m + 1))) expected total work lower bound on algorithms of this class, thereby showing that our algorithm has optimal total work complexity for this class. Finally, we prove that any randomized algorithm, symmetric or not, cannot breach an Ω(m •(α(n, m/n) + log log(np/m + 1))) expected total work lower bound.
Strongly Linearizable Implementations of Snapshots and Other Types
Linearizability is the gold standard of correctness conditions for shared memory algorithms, and historically has been considered the practical equivalent of atomicity. However, it has been shown that replacing atomic objects with linearizable implementations can affect the probability distribution of execution outcomes in randomized algorithms. Thus, linearizable objects are not always suitable replacements for atomic objects. A stricter correctness condition called strong linearizability has been developed and shown to be appropriate for randomized algorithms in a strong adaptive adversary model[16].
We devise several new lockfree strongly linearizable implementations from atomic registers. In particular, we give the first strongly linearizable lockfree snapshot implementation that uses bounded space. This improves on the unbounded space solution of Denysyuk and Woelfel[14]. As a building block, our algorithm uses a lockfree strongly linearizable ABAdetecting register. We obtain this object by modifying the waitfree linearizable ABAdetecting register of Aghazadeh and Woelfel [5], which, as we show, is not strongly linearizable.
Aspnes and Herlihy[8] identified a wide class of types that have waitfree linearizable implementations. These types require that any pair of operations either commute, or one overwrites the other. Aspnes and Herlihy gave a general waitfree linearizable implementation of such types, employing an atomic snapshot object. We show that this implementation is strongly linearizable, proving that all types in this class have a lockfree strongly linearizable implementation from atomic registers.
Fast Concurrent Data Sketches
Data sketches are approximate succinct summaries of long data streams. They are widely used for processing massive amounts of data and answering statistical queries about it. Existing libraries producing sketches are very fast, but do not allow parallelism for creating sketches using multiple threads or querying them while they are being built. We present a generic approach to parallelising data sketches efficiently and allowing them to be queried in real time, while bounding the error that such parallelism introduces. Utilising relaxed semantics and the notion of strong linearisability we prove our algorithm’s correctness and analyse the error it induces in two specific sketches. Our implementation achieves high scalability while keeping the error small. We have contributed one of our concurrent sketches to the opensource data sketches library.
SelfStabilizing Snapshot Objects for Asynchronous FailureProne Networked Systems
A snapshot object simulates the behavior of an array of singlewriter/multireader shared registers that can be read atomically. DelporteGallet et al. proposed two faulttolerant algorithms for snapshot objects in asynchronous crashprone messagepassing systems. Their first algorithm is nonblocking; it allows snapshot operations to terminate once all write operations had ceased. It uses O(n) messages of O(n v) bits, where n is the number of nodes and v is the number of bits it takes to represent the object. Their second algorithm allows snapshot operations to always terminate independently of write operations. It incurs O(n^2) messages. The fault model of DelporteGallet et al. considers node failures (crashes). We aim at the design of even more robust snapshot objects. We do so through the lenses of selfstabilization—a very strong notion of faulttolerance. In addition to DelporteGallet et al.’s fault model, a selfstabilizing algorithm can recover after the occurrence of transient faults; these faults represent arbitrary violations of the assumptions according to which the system was designed to operate (as long as the code stays intact). In particular, in this work, we propose selfstabilizing variations of DelporteGallet et al.’s nonblocking algorithm and alwaysterminating algorithm. Our algorithms have similar communication costs to the ones by DelporteGallet et al. and O(1) recovery time (in terms of asynchronous cycles) from transient faults. The main differences are that our proposal considers repeated gossiping of O(v) bits messages and deals with bounded space, which is a prerequisite for selfstabilization.
The Recoverable Consensus Hierarchy
Herlihy’s consensus hierarchy ranks the power of various synchronization primitives for solving consensus in a model where asynchronous processes communicate through shared memory, and may fail by halting. This paper revisits the consensus hierarchy in a model with crashrecovery failures, where the specification of consensus, called recoverable consensus in this paper, is weakened by allowing nonterminating executions when a process fails infinitely often. Two variations of this model are considered: with independent process failures, and with simultaneous (i.e., systemwide) process failures. We prove two fundamental results: (i) TestAndSet is at level 2 of the recoverable consensus hierarchy if failures are simultaneous, and similarly for any primitive at level 2 of the traditional consensus hierarchy; and (ii) TestAndSet drops to level 1 of the hierarchy if failures are independent, unless the number of such failures is bounded. To our knowledge, this is the first separation between the simultaneous and independent crashrecovery failure models with respect to the computability of consensus.
How Fast Reads Affect MultiValued Register Simulations
We consider the problem of simulating a kvalued register in a waitfree manner using binary registers as building blocks, where k 2. We show that for any simulation using atomic binary base registers to simulate a safe kvalued register in which the read algorithm takes the optimal number of steps (log_{2} k), the write algorithm must take at least log_{2} k steps in the worst case. A fortiori, the same lower bound applies when the simulated register should be regular. Previously known algorithms show that both these lower bounds are tight. We also show that in order to simulate an atomic kvalued register for two readers, the optimal number of steps for the read algorithm must be strictly larger than log_{2} k.
SESSION: Session 5
Topological Characterization of Consensus under General Message Adversaries
In this paper, we provide a rigorous characterization of consensus solvability in synchronous directed dynamic networks controlled by an arbitrary message adversary using pointset topology: We extend the approach introduced by Alpern and Schneider in 1985 by introducing two novel topologies on the space of infinite executions: the processview topology, induced by a distance function that relies on the local view of a given process in an execution, and the minimum topology, which is induced by a distance function that focuses on the local view of the process that is the last to distinguish two executions. We establish some simple but powerful topological results, which not only lead to a topological explanation of bivalence arguments, but also provide necessary and sufficient topological conditions on the admissible graph sequences of a message adversary for solving consensus. In particular, we characterize consensus solvability in terms of connectivity of the set of admissible graph sequences. For noncompact message adversaries, which are not limitclosed in the sense that there is a convergent sequence of graph sequences whose limit is not permitted, this requires the exclusion of all “fair” and “unfair” limit sequences that coincide with the forever bivalent runs constructed in bivalence proofs. For both compact and noncompact message adversaries, we also provide tailored characterizations of consensus solvability, i.e., tight conditions for impossibility and existence of algorithms, based on the broadcastability of the connected components of the set of admissible graph sequences.
Can Distributed Uniformity Testing Be Local?
In the distributed uniformity testing problem, k servers draw samples from some unknown distribution, and the goal is to determine whether the unknown distribution is uniform or whether it is εfar from uniform, where ε is a proximity parameter. Each server decides whether to accept or reject, and these decisions are sent to a referee, who makes a final decision based on the servers’ local decisions. Uniformity testing is a particularly useful buildingblock, because it is complete for the problem of testing identity to any fixed distribution.
It was recently shown that distributing the task of uniformity testing allows each server to draw fewer samples than are needed in the centralized case, but so far the number of samples required for distributed uniformity testing has not been well understood. In this paper we settle this question, and also investigate the cost of using local decision rules, such as rejecting iff at least one server wants to reject (the usual decision rule used in local distributed decision). To answer these questions, we develop a new Fourierbased technique for proving lower bounds on the sample complexity of distribution testing, which lends itself particularly well to the distributed case.
Using our technique, we tightly characterize the number of samples required for uniformity testing when the referee can apply any decision function to the servers’ local decisions. We also show that if the network rejects whenever one server wants to reject, then the cost of uniformity testing is much higher, and in fact we do not gain compared to the centralized case unless the number of servers is exponential in Ω (1/ε). Finally, we apply our lower bound technique to the case where the referee applies a threshold decision rule, and also generalize a lower bound from[1] for learning an unknown input distribution.
Hardness of Distributed Optimization
This paper studies lower bounds for fundamental optimization problems in the CONGEST model. We show that solving problems exactly in this model can be a hard task, by providing tildeΩmega (n^{2}) lower bounds for cornerstone problems, such as minimum dominating set (MDS), Hamiltonian path, Steiner tree and maxcut. These are almost tight, since all of these problems can be solved optimally in O(n^{2}) rounds. Moreover, we show that even in boundeddegree graphs and even in simple graphs with maximum degree 5 and logarithmic diameter, it holds that various tasks, such as finding a maximum independent set (MaxIS) or a minimum vertex cover, are still difficult, requiring a neartight number of tildeΩ (n) rounds.
Furthermore, we show that in some cases even approximations are difficult, by providing an tildeΩ (n^{2}) lower bound for a (7/8+ε)approximation for MaxIS, and a nearlylinear lower bound for an O(log n )approximation for the kMDS problem for any constant k geq≥ 2, as well as for several variants of the Steiner tree problem.
Our lower bounds are based on a rich variety of constructions that leverage novel observations, and reductions among problems that are specialized for the CONGEST model. However, for several additional approximation problems, as well as for exact computation of some central problems in P, such as maximum matching and max flow, we show that such constructions cannot be designed, by which we exemplify some limitations of this framework.
Broadcast Congested Clique: Planted Cliques and Pseudorandom Generators
We develop techniques to prove lower bounds for the BCAST(log n) Broadcast Congested Clique model (a distributed message passing model where in each round, each processor can broadcast an O(log n)sized message to all other processors). Our techniques are built to prove bounds for natural input distributions. So far, all lower bounds for problems in the model relied on constructing specifically tailored graph families for the specific problem at hand, resulting in lower bounds for artificially constructed inputs, instead of natural input distributions.
One of our results is a lower bound for the directed planted clique problem. In this problem, an input graph is either a random directed graph (each directed edge is included with probability 1/2), or a random graph with a planted clique of size k. That is, k randomly chosen vertices have all of the edges between them included, and all other edges in the graph appear with probability 1/2. The goal is to determine whether a clique exists. We show that when k = n(1/4 – ε), this problem requires a number of rounds polynomial in n.
Additionally, we construct a pseudorandom generator which fools the Broadcast Congested Clique. This allows us to show that every k round randomized algorithm in which each processor uses up to n random bits can be efficiently transformed into an O(k)round randomized algorithm in which each processor uses only up to O(k log n) random bits, while maintaining a high success probability. The pseudorandom generator is simple to describe, computationally very cheap, and its seed size is optimal up to constant factors. However, the analysis is quite involved, and is based on the new technique for proving lower bounds in the model.
The technique also allows us to prove the first average case lower bound for the Broadcast Congested Clique, as well as an averagecase time hierarchy. We hope our technique will lead to more lower bounds for problems such as triangle counting, APSP, MST, diameter, and more, for natural input distributions.
Connectivity Lower Bounds in Broadcast Congested Clique
We prove three new lower bounds for graph connectivity in the 1bit broadcast congested clique model, BCC(1). First, in the KT0 version of BCC(1), in which nodes are aware of neighbors only through port numbers, we show an Ømega(log n) round lower bound for CONNECTIVITY even for constanterror randomized Monte Carlo algorithms. The deterministic version of this result can be obtained via the wellknown “edgecrossing” argument, but, the randomized version of this result requires establishing new combinatorial results regarding the indistinguishability graph induced by inputs. In our second result, we show that the Ømega(log n) lower bound result extends to the KT1 version of the BCC(1) model, in which nodes are aware of IDs of all neighbors, though our proof works only for deterministic algorithms. Since nodes know IDs of their neighbors in the KT1 model, it is no longer possible to play “edgecrossing” tricks; instead we present a reduction from the 2party communication complexity problem PARTITION in which Alice and Bob are give two set partitions on [n] and are required to determine if the join of these two set partitions equals the trivial onepart set partition. While our KT1 CONNECTIVITY lower bound holds only for deterministic algorithms, in our third result we extend this Ømega(log n) KT1 lower bound to constanterror Monte Carlo algorithms for the closely related CONNECTED COMPONENTS problem. We use informationtheoretic techniques to obtain this result. All our results hold for the seemingly easy special case of CONNECTIVITY in which an algorithm has to distinguish an instance with one cycle from an instance with multiple cycles. Our results showcase three rather different lower bound techniques and lay the groundwork for further improvements in lower bounds for CONNECTIVITY in the BCC(1) model.
Does Preprocessing Help under Congestion?
This paper investigates the power of preprocessing in the CONGEST model. Schmid and Suomela (ACM HotSDN 2013) introduced the SUPPORTED CONGEST model to study the application of distributed algorithms in SoftwareDefined Networks (SDNs). In this paper, we show that a large class of lower bounds in the CONGEST model still hold in the SUPPORTED model, highlighting the robustness of these bounds. This also raises the question how much does preprocessing help in the CONGEST model
SESSION: Session 6
The Distributed Complexity of Locally Checkable Problems on Paths is Decidable
Consider a computer network that consists of a path with n nodes. The nodes are labeled with inputs from a constantsized set, and the task is to find output labels from a constantsized set subject to some local constraints—more formally, we have an LCL (locally checkable labeling) problem. How many communication rounds are needed (in the standard LOCAL model of computing) to solve this problem?
It is well known that the answer is always either O(1) rounds, or Θ(log_{⋅} n) rounds, or Θ(n) rounds. In this work we show that this question is decidable (albeit PSPACEhard): we present an algorithm that, given any LCL problem defined on a path, outputs the distributed computational complexity of this problem and the corresponding asymptotically optimal algorithm.
Hardness of Exact Distance Queries in Sparse Graphs Through Hub Labeling
A distance labeling scheme is an assignment of bitlabels to the vertices of an undirected, unweighted graph such that the distance between any pair of vertices can be decoded solely from their labels. An important class of distance labeling schemes is that of hub labelings, where a node ν ∈ G stores its distance to the socalled hubs S_{ν} ⊆ V, chosen so that for any u,ν ∈ V there is w ∈ S_{u} ∩ S_{v} belonging to some shortest uv path. Notice that for most existing graph classes, the best distance labelling constructions existing use at some point a hub labeling scheme at least as a key building block.
Our interest lies in hub labelings of sparse graphs, i.e., those with E(G) = O (n), for which we show a lowerbound of n 2^{O} (^{√log n)} for the average size of the hubsets. Additionally, we show a hublabeling construction for sparse graphs of average size O(√n RS (n)^{c}) for some 0 < c < 1, where RS (n) is the socalled RuzsaSzemerédi function, linked to structure of induced matchings in dense graphs. This implies that further improving the lower bound on hub labeling size to n over 2(log n)o(1 would require a breakthrough in the study of lower bounds on RS (n), which have resisted substantial improvement in the last 70 years.
For general distance labeling of sparse graphs, we show a lowerbound of 1 over 2^{Θ}(√log n) SumIndex (n), where SumIndex (n) is the communication complexity of the SUMI problem over Z_{n}. Our results suggest that the best achievable hublabel size and distancelabel size in sparse graphs may be Θ(n over 2^{(log n)c} ) for some 0 < c < 1.,
On the Complexity of Distributed Splitting Problems
One of the fundamental open problems in the area of distributed graph algorithms is whether randomization is needed for efficient symmetry breaking. While there are poly log ntime randomized algorithms for all the classic symmetry breaking problems, for many of them, the best deterministic algorithms are almost exponentially slower. The following basic local splitting problem, which is known as weak splitting, takes a central role in this context: Each node of a graph G=(V,E) has to be colored red or blue such that each node of sufficiently large degree has at least one neighbor of each color. Ghaffari, Kuhn, and Maus [STOC ’17] showed that this seemingly simple problem is complete w.r.t. the above fundamental open question in the following sense: If there is an efficient poly log ntime determinstic distributed algorithm for weak splitting, then there is such an algorithm for all locally checkable graph problems for which an efficient randomized algorithm exists. We investigate the distributed complexity of weak splitting and some closely related problems and we in particular obtain the following results:
 We obtain efficient algorithms for special cases of weak splitting in nearly regular graphs. We show that if δ=Ø(log n) and Δ are the minimum and maximum degrees of G, weak splitting can be solved deterministically in time O #916;(√ over δ • poly(log n)). Further, if δ = Ø(log log n) and Δ ≤ 2^{ε δ}, the time complexity is O(Δ over δ⋅poly(log log n)).
 We prove that the following two related problems are also complete in the same sense: (I) Color the nodes of a graph with C ≤ poly log n colors such that each node with a sufficiently large polylogarithmic degree has at least 2 log n different colors among its neighbors, and (II) Color the nodes with a large constant number of colors so that for each node of a sufficiently large at least logarithmic degree d(v), the number of neighbors of each color is at most (1εd(v) for some constant ε > 0.
On the Use of Randomness in Local Distributed Graph Algorithms
We attempt to better understand randomization in local distributed graph algorithms by exploring how randomness is used and what we can gain from it:
 We first ask the question of how much randomness is needed to obtain efficient randomized algorithms. We show that for all locally checkable problems with poly log ntime randomized algorithms, there are such algorithms even if either (I) there is a only a single (private) independent random bit in each poly log nneighborhood of the graph, (II) the (private) bits of randomness of different nodes are only poly log nwise independent, or (III) there are only poly log n bits of global shared randomness (and no private randomness).
 Second, we study how much we can improve the error probability of randomized algorithms. For all locally checkable problems with poly log ntime randomized algorithms, we show that there are such algorithms that succeed with probability 1n^{2 ε(log log n) 2} and more generally Tround algorithms, for T ≥ poly log n, with success probability 1n^{2 εlog 2T}. We also show that poly log ntime randomized algorithms with success probability 12^{2 log ε n} for some ε > 0 can be derandomized to poly log ntime deterministic algorithms.
Both of the directions mentioned above, reducing the amount of randomness and improving the success probability, can be seen as partial derandomization of existing randomized algorithms. In all the above cases, we also show that any significant improvement of our results would lead to a major breakthrough, as it would imply significantly more efficient deterministic distributed algorithms for a wide class of problems.
Message Reduction in the LOCAL Model is a Free Lunch
A new spanner construction algorithm is presented, working under the LOCAL model assuming unique edge IDs. Given an nnode communication graph, a spanner with a constant stretch and Õ(n^{1 + c}) edges (for any small constant c > 0) is constructed efficiently — i.e., in a constant number of rounds and a message complexity of Õ (n^{1 + 2c}) whp.
One of the many known applications of spanners is for reducing the number of messages of various algorithms. However, usually, one still needs to pay the cost of constructing the spanner. Due to the efficiency of the spanner construction here, we show that every tround LOCAL algorithm can be transformed into a randomized one with the same asymptotic time complexity and Õ(t^{2}n^{1 + O(1/log t)}) message complexity. All previous messagereduction schemes for LOCAL algorithms incur either an O(log n)multiplicative or an O(polylog (n))additive blowup of the round complexity.
PSLOCALCompleteness of Maximum Independent Set Approximation
We prove that the maximum independent set approximation problem with polylogarithmic approximation factor is PSLOCALcomplete.
SESSION: Keynote Lecture 3
Engineering Distributed Systems that We Can Trust (and Also Run)
The interest in formal methods and verification of correctnesscritical distributed systems is on the rise in the past few years. But what are the gains from proving statements about software in full mathematical rigour? Do they justify the high cost of verification? And how far can we extend our trust in formal methods when talking about realistic distributed systems and their client programs?
This talk is in three parts. First, I will provide an overview of the state of the art in machineassisted reasoning about distributed consensus protocols, their implementations, and applications. Next, I will discuss the tradeoffs that have to be made in order to enable mechanised proofs about runnable systems code, as well as implications of the assumptions made to describe the realworld execution environments. Lastly, I will focus on the ongoing work propelled by the programming languages community towards engineering modular proofs about distributed protocolsa way to build correctbyconstruction composite systems from verified reusable components.
SESSION: Session 7
The Consensus Number of a Cryptocurrency
Many blockchainbased algorithms, such as Bitcoin, implement a decentralized asset transfer system, often referred to as a cryptocurrency. As stated in the original paper by Nakamoto, at the heart of these systems lies the problem of preventing doublespending ; this is usually solved by achieving consensus on the order of transfers among the participants. By treating the asset transfer problem as a concurrent object and determining its consensus number, we show that consensus is not necessary to prevent doublespending. We first consider the problem as defined by Nakamoto, where only a single process—the account owner—can withdraw from each account. Safety and liveness need to be ensured for correct account owners, whereas misbehaving account owners might be unable to perform transfers. We show that the consensus number of an asset transfer object is 1. We then consider a more general kshared asset transfer object where up to k processes can atomically withdraw from the same account, and show that this object has consensus number k. We first establish these these results in the context of shared memory with benign faults, in order to properly understand the level of difficulty of the asset transfer problem. Then, we translate our result in the more practically relevant message passing setting with Byzantine players. We describe an asynchronous Byzantine faulttolerant asset transfer implementation that is both simpler and more efficient than stateoftheart consensusbased solutions. Our results are applicable to both the permissioned (private) and permissionless (public) setting, as normally their differentiation is hidden by the abstractions on top of which our algorithms are based.
Communication Complexity of Byzantine Agreement, Revisited
As Byzantine Agreement (BA) protocols find application in largescale decentralized cryptocurrencies, an increasingly important problem is to design BA protocols with improved communication complexity. A few existing works have shown how to achieve subquadratic BA under an adaptive adversary. Intriguingly, they all make a common relaxation about the adaptivity of the attacker, that is, if an honest node sends a message and then gets corrupted in some round, the adversary cannot erase the message that was already sent – henceforth we say that such an adversary cannot perform “afterthefact removal”. By contrast, many (super)quadratic BA protocols in the literature can tolerate afterthefact removal. In this paper, we first prove that disallowing afterthefact removal is necessary for achieving subquadraticcommunication BA.
Next, we show a new subquadratic binary BA construction (of course, assuming no afterthefact removal) that achieves near optimal resilience and expected constant rounds under standard cryptographic assumptions and a publickey infrastructure (PKI). In comparison, all known subquadratic protocols make additional strong assumptions such as random oracles or the ability of honest nodes to erase secrets from memory, and even with these strong assumptions, no prior work can achieve the above properties. Lastly, we show that some setup assumption is necessary for achieving subquadratic multicastbased BA.
Exact Byzantine Consensus on Undirected Graphs under Local Broadcast Model
This paper considers the Byzantine consensus problem for nodes with binary inputs. The nodes are interconnected by a network represented as an undirected graph, and the system is assumed to be synchronous. Under the classical pointtopoint communication model, it is wellknown that the following two conditions are both necessary and sufficient to achieve Byzantine consensus among n nodes in the presence of up to ƒ Byzantine faulty nodes: n & 3 #8805; 3 ≥ ƒ+ 1 and vertex connectivity at least 2 ƒ + 1. In the classical pointtopoint communication model, it is possible for a faulty node to equivocate, i.e., transmit conflicting information to different neighbors. Such equivocation is possible because messages sent by a node to one of its neighbors are not overheard by other neighbors.
This paper considers the local broadcast model. In contrast to the pointtopoint communication model, in the local broadcast model, messages sent by a node are received identically by all of its neighbors. Thus, under the local broadcast model, attempts by a node to send conflicting information can be detected by its neighbors. Under this model, we show that the following two conditions are both necessary and sufficient for Byzantine consensus: vertex connectivity at least ⌋ 3 fƒ / 2 ⌊ + 1 and minimum node degree at least 2 ƒ. Observe that the local broadcast model results in a lower requirement for connectivity and the number of nodes n, as compared to the pointtopoint communication model.
We extend the above results to a hybrid model that allows some of the Byzantine faulty nodes to equivocate. The hybrid model bridges the gap between the pointtopoint and local broadcast models, and helps to precisely characterize the tradeoff between equivocation and network requirements.
Asymptotically Optimal Validated Asynchronous Byzantine Agreement
We provide a new protocol for Validated Asynchronous Byzantine Agreement in the authenticated setting. Validated (multivalued) Asynchronous Byzantine Agreement is a key building block in constructing Atomic Broadcast and faulttolerant state machine replication in the asynchronous setting. Our protocol has optimal resilience of ƒ < n/3 Byzantine failures and asymptotically optimal expected O(1) running time to reach agreement. Honest parties in our protocol send only an expected O(n^{2}) messages where each message contains a value and a constant number of signatures. Hence our total expected communication is O(n^{2}) words. The best previous result of Cachin et al. from 2001 solves Validated Byzantine Agreement with optimal resilience and O(1) expected time but with O(n^{3}) expected word communication. Our work addresses an open question of Cachin et al. from 2001 and improves the expected word communication from O(n^{3}) to asymptotically optimal O(n^{2}).
HotStuff: BFT Consensus with Linearity and Responsiveness
We present HotStuff, a leaderbased Byzantine faulttolerant replication protocol for the partially synchronous model. Once network communication becomes synchronous, HotStuff enables a correct leader to drive the protocol to consensus at the pace of actual (vs. maximum) network delay–a property called responsiveness—and with communication complexity that is linear in the number of replicas. To our knowledge, HotStuff is the first partially synchronous BFT replication protocol exhibiting these combined properties. Its simplicity enables it to be further pipelined and simplified into a practical, concise protocol for building largescale replication services.
Fault Tolerant Gradient Clock Synchronization
Synchronizing clocks in distributed systems is wellunderstood, both in terms of faulttolerance in fully connected systems, and the optimal achievable local skew in general faultfree networks. However, so far nothing nontrivial is known about the local skew that can be achieved in nonfullyconnected topologies even under a single Byzantine fault. In this work, we show that asymptotically optimal local skew can be achieved in the presence of Byzantine faults.
Our approach combines the LynchWelch algorithm [19] for synchronizing a clique of n nodes with up to ƒ < n/3 Byzantine faults, and the gradient clock synchronization (GCS) algorithm by Lenzen et al. [15] in order to render the latter resilient to faults. This is not possible on general graphs, so we augment an arbitrary input graph G by replacing each node with a fully connected cluster of 3 ƒ +1 copies, and execute an instance of the LynchWelch algorithm within each cluster. We interpret the clusters as supernodes executing the GCS algorithm on G, where each node in the cluster maintains an estimate of the logical clock of its supernode. By also fully connecting clusters corresponding to neighbors in l G, supernodes maintain estimates of neighboring clusters’ logical clocks. We achieve asymptotically optimal local skew, assuming that no cluster contains more than ƒ faulty nodes. This construction yields factors of O(ƒ) and O(ƒ^{2}) overheads in terms of nodes and edges, respectively. Since tolerating ƒ faulty neighbors trivially requires degrees larger than ƒ, these overheads are asymptotically optimal.
Bootstrapping Public Blockchains Without a Trusted Setup
We propose a protocol that allows the participants of a permissionless decentralized system to agree on a set of identities in the presence of a computationallybounded Byzantine adversary. Our protocol guarantees that the fraction of identities belonging to the adversary in the set of identities is at most equal to the total computational hash power of the adversary.
We significantly improve on the existing stateoftheart in the following four ways. First, our algorithm runs in expected O(1) rounds, in contrast to previous results which require O(log n/log log n) rounds, where n is the number of initial nodes in the system. Second, we require each node to solve just one computational puzzle, whereas previous algorithms require O(log n/log log n) puzzles per node. Third, our algorithm sends only O(n) bits per node in expectation, whereas previous algorithms send O( (log^{2} n/log log n)) bits in expectation. Finally, in contrast to past results, our algorithm handles dynamic joining and leaving of nodes.
SESSION: Session 8
Hardness of Minimal Symmetry Breaking in Distributed Computing
A graph is weakly 2colored if the nodes are labeled with colors black and white such that each black node is adjacent to at least one white node and vice versa. In this work we study the distributed computational complexity of weak 2coloring in the standard łocal model of distributed computing, and how it is related to the distributed computational complexity of other graph problems.
First, we show that weak 2coloring is a minimal distributed symmetrybreaking problem for regular evendegree trees and highgirth graphs: if there is any nontrivial locally checkable labeling problem that is solvable in o(log^{⋅} n) rounds with a distributed graph algorithm in the middle of a regular evendegree tree, then weak 2coloring is also solvable in o(log^{⋅} n) rounds there.
Second, we prove a tight lower bound of Ω(log ^{⋅} n) for the distributed computational complexity of weak 2coloring in regular trees; previously only a lower bound of Ω n(log log^{⋅} n) was known. By minimality, the same lower bound holds for any nontrivial locally checkable problem inside regular evendegree trees.
An Automatic Speedup Theorem for Distributed Problems
Recently, Brandt et al.\ [STOC’16] proved a lower bound for the distributed Lovász Local Lemma, which has been conjectured to be tight for sufficiently relaxed LLL criteria by Chang and Pettie [FOCS’17]. At the heart of their result lies a speedup technique that, for graphs of girth at least 2t+2, transforms any tround algorithm for one specific LLL problem into a (t1)round algorithm for the same problem. We substantially improve on this technique by showing that such a speedup exists for any locally checkable problem ¶i, with the difference that the problem ¶i_1 the inferred (t1)round algorithm solves is not (necessarily) the same problem as ¶i. Our speedup is automatic in the sense that there is a fixed procedure that transforms a description for ¶i into a description for ¶i_1 and reversible in the sense that any (t1)round algorithm for ¶i_1 can be transformed into a tround algorithm for ¶i. In particular, for any locally checkable problem ¶i with exact deterministic time complexity T(n, Δ) łeq t on graphs with n nodes, maximum node degree Δ, and girth at least 2t+2, there is a sequence of problems ¶i_1, ¶i_2, \dots with time complexities T(n, Δ)1, T(n, Δ)2, \dots, that can be inferred from ¶i. As a first application of our generalized speedup, we solve a longstanding open problem of Naor and Stockmeyer [STOC’93]: we show that weak 2coloring in odddegree graphs cannot be solved in o(łog^* Δ) rounds, thereby providing a matching lower bound to their upper bound.
A Sharp Threshold Phenomenon for the Distributed Complexity of the Lovász Local Lemma
The Lová sz Local Lemma (LLL) says that, given a set of bad events that depend on the values of some random variables and where each event happens with probability at most p and depends on at most d other events, there is an assignment of the variables that avoids all bad events if the LLL criterion ep(d+1)<1 is satisfied. Nowadays, in the area of distributed graph algorithms it has also become a powerful framework for developing—mostly randomized—algorithms. A classic result by Moser and Tardos yields an O(łog^2 n) algorithm for the distributed Lová sz Local Lemma [JACM’10] if ep(d + 1) < 1 is satisfied. Given a stronger criterion, i.e., demanding a smaller error probability, it is conceivable that we can find better algorithms. Indeed, for example Chung, Pettie and Su [PODC’14] gave an O(łog_epd^2 n) algorithm under the epd^2 < 1 criterion. Going further, Ghaffari, Harris and Kuhn introduced an 2^O(\sqrtłog łog n ) time algorithm given d^8 p = O(1) [FOCS’18]. On the negative side, Brandt et al.\ and Chang et al.\ showed that we cannot go below Ømega(łog łog n) (randomized) [STOC’16] and Ømega(łog n) (deterministic) [FOCS’16], respectively, under the criterion płeq 2^d . Furthermore, there is a lower bound of Ømega(łog^* n) that holds for any criterion. In this paper, we study the dependency of the distributed complexity of the LLL problem on the chosen LLL criterion. We show that for the fundamental case of each random variable of the considered LLL instance being associated with an edge of the input graph, that is, each random variable influences at most two events, a sharp threshold phenomenon occurs at p = 2^d : we provide a simple deterministic (!) algorithm that matches the Ømega(łog^* n) lower bound in bounded degree graphs, if p < 2^d , whereas for p \geq 2^d , the Ømega(łog łog n) randomized and the Ømega(łog n) deterministic lower bounds hold. In many applications variables affect more than two events; our main contribution is to extend our algorithm to the case where random variables influence at most three different bad events. We show that, surprisingly, the sharp threshold occurs at the exact same spot, providing evidence for our conjecture that this phenomenon always occurs at p = 2^d , independent of the number r of events that are affected by a variable. Almost all steps of the proof framework we provide for the case r=3 extend directly to the case of arbitrary r; consequently, our approach serves as a step towards characterizing the complexity of the LLL under different exponential criteria.
SESSION: Session 9
Reconfigurable Atomic Transaction Commit
Modern data stores achieve scalability by partitioning data into shards and faulttolerance by replicating each shard across several servers. A key component of such systems is a Transaction Certification Service (TCS), which atomically commits a transaction spanning multiple shards. Existing TCS protocols require 2f+1 crashstop replicas per shard to tolerate f failures. In this paper we present atomic commit protocols that require only f+1 replicas and reconfigure the system upon failures using an external reconfiguration service. We furthermore rigorously prove that these protocols correctly implement a recently proposed TCS specification. We present protocols in two different models—the standard asynchronous messagepassing model and a model with Remote Direct Memory Access (RDMA), which allows a machine to access the memory of another machine over the network without involving the latter’s CPU. Our protocols are inspired by a recent FARM system for RDMAbased transaction processing. Our work codifies the core ideas of FARM as distributed TCS protocols, rigorously proves them correct and highlights the tradeoffs required by the use of RDMA.
The Impact of RDMA on Agreement
Remote Direct Memory Access (RDMA) is becoming widely available in data centers. This technology allows a process to directly read and write the memory of a remote host, with a mechanism to control access permissions. In this paper, we study the fundamental power of these capabilities. We consider the wellknown problem of achieving consensus despite failures, and find that RDMA can improve the inherent tradeoff in distributed computing between failure resilience and performance. Specifically, we show that RDMA allows algorithms that simultaneously achieve high resilience and high performance, while traditional algorithms had to choose one or another. With Byzantine failures, we give an algorithm that only requires n \geq 2f_P + 1 processes (where f_P is the maximum number of faulty processes) and decides in two (network) delays in common executions. With crash failures, we give an algorithm that only requires n \geq f_P + 1 processes and also decides in two delays. Both algorithms tolerate a minority of memory failures inherent to RDMA, and they provide safety in asynchronous systems and liveness with standard additional assumptions.
Hyaline: Fast and Transparent LockFree Memory Reclamation
We present a new lockfree safe memory reclamation algorithm, Hyaline, which is fast, scalable, and transparent to the underlying data structures. Hyaline easily handles virtually unbounded number of threads that can be created and deleted dynamically, while retaining O(1) reclamation cost. We also extend Hyaline to avoid situations where stalled threads prevent others from reclaiming newly allocated objects, a common problem with epochbased reclamation. Our evaluation reveals that Hyaline’s throughput is high — it steadily outperformed other reclamation schemes by >10% in one test and yielded even higher gains in oversubscribed scenarios.
Layering Data Structures over Skip Graphs for Increased NUMA Locality
We present a lockfree, linearizable, and NUMAaware data structure that implements sets, maps, and priority queue abstract data types (ADTs), based on using threadlocal, sequential maps that are used to “jump” to suitable positions in a lockfree, linearizable variant of a skip graph. Our skip graph is suitably constrained in height and subjected to a data partition scheme that reduces contention and increases NUMA locality. We developed an additional skip graph variant, which we call sparse skip graph, that causes our threadlocal maps as well as our shared structure to become more sparse. Compared to using regular skip graphs, sparse skip graphs show increased performance in workloads dominated by “insert” or “remove” operations, and comparable performance in workloads dominated by “contains” operations.
SESSION: Session 10
Partially Replicated Causally Consistent Shared Memory: Lower Bounds and An Algorithm
The focus of this paper is on causal consistency in a partially replicated distributed shared memory (DSM) system that provides the abstraction of shared read/write registers. Maintaining causal consistency in distributed shared memory systems has received significant attention in the past, mostly on full replication wherein each replica stores a copy of all the registers in the shared memory. To ensure causal consistency, all causally preceding updates must be performed before an update is performed at any given replica. Therefore, some mechanism for tracking causal dependencies is required, such as vector timestamps with the number of vector elements being equal to the number of replicas in the context of full replication. In this paper, we investigate causal consistency in partially replicated systems, wherein each replica may store only a subset of the shared registers. Building on the past work, this paper makes three key contributions:
 present a necessary condition on the metadata (which we refer as a timestamp) that must be maintained by each replica to be able to track causality accurately. The necessary condition identifies a set of directed edges in a share graph that a replica’s timestamp must keep track of.
 We present an algorithm for achieving causal consistency using a timestamp that matches the above necessary condition, thus showing that the condition is necessary and sufficient.
 We define a measurement of timestamp space size and present a lower bound (in bits) on the size of the timestamps. The lower bound matches our algorithm in several special cases.
Vorpal: Vector Clock Ordering For Large Persistent Memory Systems
In systems with nonvolatile main memories (NVMMs), programmers must carefully control the order in which writes become persistent. Otherwise, what will remain in persistence after a crash may be unusable upon recovery. Prior art has already explored semantic models for specifying this persist order, but most enforcement algorithms for the order are not scalable to large server machines because they assume that the machine contains only one or two memory controllers. In this paper, we describe a collection of provably correct algorithms for enforcing the persistorder across writes, generated at many different cores, and persisted across numerous different memory controllers. Relative to existing solutions, our algorithms improve performance by 48% by reducing both traffic and serialization overheads.
On the Parallels between Paxos and Raft, and how to Port Optimizations
In recent years, Raft has surpassed Paxos to become the more popular consensus protocol in the industry. While many researchers have observed the similarities between the two protocols, no one has shown how Raft and Paxos are formally related to each other. In this paper, we present a formal mapping between Raft and Paxos, and use this knowledge to port a certain class of optimizations from Paxos to Raft. In particular, our porting method can automatically generate an optimized protocol specification with guaranteed correctness. As case studies, we port and evaluate two optimizations, Mencius and Paxos Quorum Lease to Raft.
Linearizable State Machine Replication of StateBased CRDTs without Logs
General solutions of state machine replication have to ensure that all replicas apply the same commands in the same order, even in the presence of failures. Such strict ordering incurs high synchronization costs due to the use of distributed consensus or a leader. This paper presents a protocol for linearizable state machine replication of conflictfree replicated data types (CRDTs) that neither requires consensus nor a leader. By leveraging the properties of statebased CRDTs—in particular the monotonic growth of a join semilattice—synchronization overhead is greatly reduced. In addition, updates just need a single round trip and modify the state ‘inplace’ without the need for a log. Furthermore, the message size overhead for coordination consists of a single counter per message. While reads in the presence of concurrent updates are not waitfree without a coordinator, we show that more than 97\,% of reads can be handled in one or two round trips under highly concurrent accesses. Our protocol achieves high throughput without auxiliary processes such as command log management or leader election. It is well suited for all practical scenarios that need linearizable access on CRDT data on a finegranular scale.
On Mixing Eventual and Strong Consistency: Bayou Revisited
In this paper we study the properties of eventually consistent distributed systems that feature arbitrarily complex semantics and mix eventual and strong consistency. These systems execute requests in a highlyavailable, weaklyconsistent fashion, but also enable stronger guarantees through additional interreplica synchronization mechanisms that require the ability to solve distributed consensus. We use the seminal Bayou system as a case study, and then generalize our findings to a whole class of systems. We show dubious and unintuitive behaviour exhibited by those systems and provide a theoretical framework for reasoning about their correctness. We also state an impossibility result that formally proves the inherent limitation of such systems, namely temporary operation reordering, which admits interim disagreement between replicas on the relative order in which the client requests were executed.
SESSION: Session 11
Massively Parallel Algorithms for Finding WellConnected Components in Sparse Graphs
Massively parallel computation (MPC) algorithms for graph problems have witnessed a resurgence of interest in recent years. Despite major progress for numerous graph problems however, the complexity of the sparse graph connectivity problem in this model has remained elusive: While classical logarithmicround PRAM algorithms for finding connected components in any nvertex graph have been known for more than three decades (and imply the same bounds for MPC model), no o(log n)round MPC algorithms are known for this task with truly sublinear in n memory per machine (which is the only interesting regime for sparse graphs with O(n) edges). It is conjectured that an o(log n)round algorithm for connectivity on general sparse graphs with n^{1Ω (1)} permachine memory may not exist, a conjecture that also forms the basis for multiple conditional hardness results on the round complexity of other problems in the MPC model.
We take an opportunistic approach towards the sparse graph connectivity problem by designing an algorithm with improved performance in terms of the connectivity structure of the input graph. Formally, we design an MPC algorithm that finds all connected components with spectral gap at least λ in a graph in O(log log n + log(1/λ)) MPC rounds and n^{δ} memory per machine for any constant δ ∈ (0,1). While this algorithm still requires Θ(log n) rounds in the worstcase, it achieves an exponential round reduction on “wellconnected” components with λ ≥ 1/polylog(n) using only n^{δ} memory per machine and ł(n) total memory, and still operates in o(log n)l rounds even when λ = 1/n^{o(1)}.
Enroute to our main result, we design a new distributed data structure for performing independent random walks from all vertices simultaneously, as well as a new leaderelection algorithm with exponentially faster round complexity on random graphs.
The Complexity of (Δ+1) Coloring in Congested Clique, Massively Parallel Computation, and Centralized Local Computation
In this paper, we present new randomized algorithms that improve the complexity of the classic (Δ+1)coloring problem, and its generalization (Δ+1)listcoloring, in three wellstudied models of distributed, parallel, and centralized computation: Distributed Congested Clique: We present an O(1)round randomized algorithm for (Δ + 1)listcoloring in the congested clique model of distributed computing. This settles the asymptotic complexity of this problem. It moreover improves upon the O(log* Δ)round randomized algorithms of Parter and Su [DISC’18] and O((log log Δ)⋅ log* Δ)round randomized algorithm of Parter [ICALP’18].
Massively Parallel Computation: We present a randomized (Δ + 1)listcoloring algorithm with round complexity O(√ log log n ) in the Massively Parallel Computation (MPC) model with strongly sublinear memory per machine. This algorithm uses a memory of O(n^{α}) per machine, for any desirable constant α > 0, and a total memory of Õ (m), where m is the number of edges in the graph. Notably, this is the first coloring algorithm with sublogarithmic round complexity, in the sublinear memory regime of MPC. For the quasilinear memory regime of MPC, an O(1)round algorithm was given very recently by Assadi et al. [SODA’19].
Centralized Local Computation: We show that (Δ + 1)listcoloring can be solved by a randomized algorithm with query complexity Δ O(1) … O(log n), in the centralized local computation model. The previous state of the art for (Δ+1)listcoloring in the centralized local computation model are based on simulation of known LOCAL algorithms. The deterministic O(√ Δ poly log Δ + log* n)round LOCAL algorithm of Fraigniaud et al. [FOCS’16] can be implemented in the centralized local computation model with query complexity Δ^{O}(√ Δ poly log Δ) … O(log* n); the randomized O(log* Δ) + 2^^{O}(√ log log n)round LOCAL algorithm of Chang et al. [STOC’18] can be implemented in the centralized local computation model with query complexity Δ^{O}(log* Δ) … O(log n).
Massively Parallel Computation of Matching and MIS in Sparse Graphs
The Massively Parallel Computation (MPC) model serves as a common abstraction of many modern largescale parallel computation frameworks and has recently gained a lot of importance, especially in the context of classic graph problems. In this work, we mainly consider maximal matching and maximal independent set problems in the MPC model.
These problems are known to admit efficient MPC algorithms if the space available per machine is nearlinear in the number n of nodes. This is not only often significantly more than what we can afford, but also allows for easy if not trivial solutions for sparse graphs—which are common in realworld largescale graphs. We are, therefore, interested in the lowmemory MPC model, where the space per machine is restricted to be strongly sublinear, that is, n^{δ} for any constant 0 < δ < 1.
We parametrize our algorithms by the arboricity λ of the input graph. Our key ingredient is a degree reduction technique that reduces these problems in graphs with arboricity λ to the corresponding problems in graphs with maximum degree poly(λ, log n) in O(log^{2} log n) rounds, giving rise to O(√ log λ ⋅ log log λ + log ^{2} log n)round algorithms.
Our result is particularly interesting for graphs with poly log n arboricity as for such graphs, we get O(log ^{2} log n)round algorithms. This covers most natural families of sparse graphs and almost exponentially improves over previous algorithms that all required log ^{Ω(1)} n rounds in this regime of MPC.
Finally, our maximal matching algorithm can be employed to obtain a (1+ε)approximate maximum cardinality matching, a (2+ε)approximate maximum weighted matching, as well as a 2approximate minimum vertex cover in essentially the same number of rounds.
Weighted Matchings via Unweighted Augmentations
We design a generic method to reduce the task of finding weighted matchings to that of finding short augmenting paths in unweighted graphs. This method enables us to provide efficient implementations for approximating weighted matchings in the massively parallel computation (MPC) model and in the streaming model.
For the MPC and the multipass streaming model, we show that any algorithm computing a (1δ)approximate unweighted matching in bipartite graphs can be translated into an algorithm that computes a (1(ε(δ))approximate maximum weighted matching. Furthermore, this translation incurs only a constant factor (that depends on ε > 0) overhead in the complexity. Instantiating this with the current best MPC algorithm for unweighted matching yields a (1 – ε)approximation algorithm for maximum weighted matching that uses O_{ε}(log log n) rounds, O(m/n) machines per round, and O(npoly(logn)) memory per machine. This improves upon the previous best approximation guarantee of (1/2ε) for weighted graphs. In the context of singlepass streaming with random edge arrivals, our techniques yield a (1/2+c)approximation algorithm thus breaking the natural barrier of 1/2.
SESSION: Session 12
Implementing Mediators with Asynchronous Cheap Talk
A mediator can help noncooperative agents obtain an equilibrium that may otherwise not be possible. We study the ability of players to obtain the same equilibrium without a mediator, using only cheap talk, that is, nonbinding preplay communication. Previous work has considered this problem in a synchronous setting. Here we consider the effect of asynchrony on the problem, and provide upper bounds for implementing mediators. Considering asynchronous environments introduces new subtleties, including exactly what solution concept is most appropriate and determining what move is played if the cheap talk goes on forever. Different results are obtained depending on whether the move after such “infinite play” is under the control of the players or part of the description of the game.
Distributed Minimum Degree Spanning Trees
The minimum degree spanning tree (MDST) problem requires the construction of a spanning tree T for graph G, such that the maximum degree of T is the smallest among all spanning trees of G. Let d be this MDST degree for a given graph. In this paper, we present a randomized distributed approximation algorithm for the MDST problem that constructs a spanning tree with maximum degree in O(d+log n ). With high probability in n, the algorithm runs in O((D + √n) log_{4} n) rounds, in the broadcastCONGEST model, where D is the graph diameter and n is the graph size. We then show how to derandomize this algorithm, obtaining the same asymptotic guarantees on degree and time complexity, but now requiring the standard CONGEST model. Although efficient approximation algorithms for the MDST problem have been known in the sequential setting since the 1990’s (finding an exact solution is NPhard), our algorithms are the first efficient distributed solutions. We conclude by proving a lower bound that establishes that any randomized MDST algorithm that guarantees a maximum degree in ∼Ω (d) requires &Ø#8764;Ω (n^{1/3}) rounds, and any deterministic solution requires ∼Ω (n^{1/2}) rounds. These bounds proves our deterministic algorithm to be asymptotically optimal, and eliminates the possibility of significantly more efficient randomized solutions.
Improved Distributed Approximations for MinimumWeight TwoEdgeConnected Spanning Subgraph
The minimumweight 2edgeconnected spanning subgraph (2ECSS) problem is a natural generalization of the wellstudied minimumweight spanning tree (MST) problem, and it has received considerable attention in the area of network design. The latter problem asks for a minimumweight subgraph with an edge connectivity of 1 between each pair of vertices while the former strengthens this edgeconnectivity requirement to 2. Despite this resemblance, the 2ECSS problem is considerably more complex than MST. While MST admits a lineartime centralized exact algorithm, 2ECSS is NPhard and the best known centralized approximation algorithm for it (that runs in polynomial time) gives a 2approximation.
In this paper, we give a deterministic distributed algorithm with round complexity of Õ (D + √n) that computes a (9 + ε)approximation of 2ECSS, for any constant ε > 0. Up to logarithmic factors, this complexity matches the Ø (D + √ n) lower bound that can be derived from the technique of Das Sarma et al. [STOC’11], as shown by CensorHillel and Dory [OPODIS’17]. Our result is the first distributed constant approximation for 2ECSS in the nearly optimal time and it improves on a recent randomized algorithm of Dory [PODC’18], which achieved an O(łog n)approximation in Õ (D+√ ) rounds.
We also present an alternative algorithm for O(log n)approximation, whose round complexity is linear in the lowcongestion shortcut parameter of the network—following a framework introduced by Ghaffari and Haeupler [SODA’16]. This algorithm has round complexity Ö (D+√n) in worstcase networks but it provably runs much faster in many wellbehaved graph families of interest. For instance, it runs in Õ (D) time in planar networks and those with bounded genus, bounded pathwidth or bounded treewidth.
NearAdditive Spanners In Low Polynomial Deterministic CONGEST Time
Given a pair of parameters α ≥ 1,β ≥ 0, a subgraph G’=(V,H) of an nvertex unweighted undirected graph G=(V,E) is called an (α,β)spanner if for every pair u,ν ∈ V of vertices, we have d_{G’} (u,ν)≤ α d_{G} (u,α)+β. If β=0 the spanner is called a multiplicative αspanner, and if α = 1+ε, for an arbitrarily small ε>0, the spanner is said to be nearadditive.
Graph spanners [5,36], are a fundamental and extremely wellstudied combinatorial construct, with a multitude of applications in distributed computing and in other areas. Nearadditive spanners, introduced in [27], preserve large distances much more faithfully than the more traditional multiplicative spanners. Also, recent lower bounds [1] ruled out the existence of arbitrarily sparse purely additive spanners (i.e., spanners with α=1), and therefore showed that essentially nearadditive spanners provide the best approximation of distances that one can hope for.
Numerous distributed algorithms, for constructing sparse nearadditive spanners, were devised in [17,20,25,28,40]. In particular, there are now known efficient randomized algorithms in the CONGEST model that construct such spanners [25]., and also there are efficient deterministic algorithms in the LOCAL model [17]. However, the only known deterministic CONGESTmodel algorithm for the problem [20] requires superlinear time in n. In this paper, we remedy the situation and devise an efficient deterministic CONGESTmodel algorithm for constructing arbitrarily sparse nearadditive spanners.
The running time of our algorithm is low polynomial, i.e., roughly O(β ⋅ n^{ρ}), where ρ > 0 is an arbitrarily small positive constant that affects the additive term β. In general, the parameters of our new algorithm and of the resulting spanner are at the same ballpark as the respective parameters of the stateoftheart randomized algorithm due to [25].
A Trivial Yet Optimal Solution to Vertex Fault Tolerant Spanners
We give a short and easy upper bound on the worstcase size of fault tolerant spanners, which improves on all prior work and is fully optimal at least in the setting of vertex faults.
SESSION: Tutorials
From Classical to Blockchain Consensus: What Are the Exact Algorithms?
This tutorial describes wellknown algorithms for distributed consensus problems, from classical consensus to blockchain consensus, and discusses exact algorithms that are highlevel as in pseudocode and directly executable at the same time. The tutorial consists of five parts:

 (1) A introduction to different distributed consensus problems, from classical consensus and Byzantine consensus to blockchain consensus.

 (2) An overview of wellknown algorithms, from Paxos for classical consensus to the Bitcoin algorithm for blockchain consensus, including important variants such as Viewstamped Replication and Virtual Synchrony, as well as ProofofStake vs. ProofofWork.

 (3) An overview of a method and language, DistAlgo, for expressing distributed algorithms precisely at a highlevel as pseudocode and having them be directly executable at the same time.

 (4) A study of exact algorithms expressed at a high level for the most extensively studied algorithm variants, including Lamport’s Paxos for classical consensus and Nakamoto’s Bitcoin algorithm for blockchain consensus.

 (5) A demo of the direct execution of these exact algorithms by distributed processes.
Byzantine Fault Tolerant State Machine Replication in Any Programming Language
State machine replication is a fundamental primitive in fault tolerant distributed computing, but few production tools exist to support the replication of arbitrary state machines. The tools that do exist, like Apache Zookeeper, CoreOS’s etcd, and Hashicrop’s Consul, include an implementation of a consensus algorithm (eg. ZAB or Raft) for replication, and a servicediscovery oriented keyvalue store as the state machine. While these tools can tolerate crash failures, they cannot tolerate malicious or adversarial (“Byzantine”) faults.
We present Tendermint, a productiongrade Byzantine Fault Tolerant State Machine Replication engine written in Go. Tendermint supports replication for state machines written in any language by using a socket protocol to communicate between the state machine and the replication engine. Tendermint is being used on the public internet today to secure upwards of 1 Billion USD in value, with deployments supporting hundreds of consensus nodes. In this workshop, we provide an overview of the Tendermint system and demonstrate how to build Byzantine Fault Tolerant applications in Go and Javascript. We will also introduce the Cosmos Hub, an advanced ProofofStake cryptocurrency system built on Tendermint.
Central Control over Distributed Asynchronous Systems: A Tutorial on SoftwareDefined Networks and Consistent Network Updates
This tutorial will give an introduction to a topic that lies at the intersection of distributed computing and networking, and combines asynchronous distributed systems with central control, namely consistent updates in SoftwareDefined Networks (SDNs). We will give an overview on current models and algorithms, but also selected related topics, in particular those of potential interest to the PODC community, showcasing avenues for further research.
In more detail, SDNs have been an intensively studied topic in networking over the last years, but much of its focus has been on (logical) central control, abstracting away most of its underlying foundation, namely that a network is still a distributed asynchronous system at its core. Summarized in a simplified way, SDNs come with the promise that the network state can be optimized and updated from a global point of view. However, such a simplification becomes especially problematic when consistency guarantees have to maintained. In asynchronous distributed systems, it is not possible to simultaneously change the state of all nodes, such a naive approach will lead to an inconsistent mix of old and new states, introducing e.g. forwarding loops. Notwithstanding, most approaches tackle these issues from the viewpoints of the networking/systems communities, and we believe could henceforth greatly benefit from connections to and ideas from the PODC community.
Tutorial: Specifying, Implementing, and Verifying Algorithms for Persistent Memory
Highdensity byteaddressable nonvolatile memory became a reality earlier this year when Intel launched the longawaited Optane persistent memory module. This tutorial is intended for researchers interested in using persistent memory to construct faulttolerant data structures that can maintain state consistently across power outages and system crashes without relying on conventional secondary storage. A number of practical and theoretical topics will be covered including hardware purchasing considerations, operating system and programming language support for persistent memory, definitions of correctness properties for faulttolerant data structures, techniques for implementing faulttolerant concurrency control and memory management, as well as verification of correctness.