(* Hello, dear reader. You're brave enough to have opened the actual code. The blog post uses my super hacky literate coq python script. See https://github.com/wilcoxjay/coq-lit. You might be wondering: "why not use coqdoc?" Indeed. *) (** % A Proof of Peterson's Algorithm % James Wilcox % May 8, 2015 *) (*raw *) (** > _In this post, we take a break from distributed systems to look at > shared memory systems. As a case study, we give a proof that > Peterson's algorithm provides mutual exclusion._ [distributed-algorithms]: http://groups.csail.mit.edu/tds/distalgs.html This quarter I've been organizing a reading group on distributed algorithms. We've been reading Lynch's aptly named _[Distributed Algorithms][distributed-algorithms]_. Last week we covered shared-memory mutual exclusion algorithms, and I thought I'd have a go at formalizing some of the proofs in Coq. # Mutual Exclusion Mutual exclusion is the problem of implementing locks. It's a classic in the concurrent programming literature, with just about everybody famous publishing solutions to it. And why not: it's not at all obvious software should be able to provide mutual exclusion on top of hardware that doesn't. These days the algorithms are of mostly theoretical interest because modern multicore chips _do_ have primitives that make it easier to implement locks. But these old-school algorithms don't need any of that. All they require is _sequential consistency_ of the underlying memory system. (To make these algorithms feel even worse about themselves, sequential consistency isn't guaranteed by default on modern chips.) # Sequential Consistency Programmers can reason about a single-threaded program by simulating its execution in their heads, with each line of code executing in order. Such reasoning is an abstraction of what's actually going on, especially on modern hardware and with modern compilers. Compilers move code around, and hardware executes instructions out of order. But they all guarantee that the program behaves _as if_ each line of code executed in order. This guarantee ensures that the programmer's line-by-line reasoning is correct. Sequential consistency is an extension of this reasoning technique to multithreaded programs. An execution of a multithreaded program is sequentially consistent if it corresponds to some interleaving of the steps of each thread. For example, consider the following code, which executes in an environment where both `x` and `y` are initially 0. *) (** T1 T2 x = 1 y = 1 print y print x *) (** Thread 1's code is in the left column. It first writes the value 1 to the global variable `x`. Then it prints the value of the global variable `y`. Thread 2's code is in the right column. It writes 1 to `y` and then prints `x`. Now suppose we want to convince ourselves that no execution of this program will print "00". In a sequentially consistent world, we know that any execution of this program is equivalent to the sequential execution of some interleaving of the steps of the threads. So we consider all possible interleavings of the steps of the threads. Here they are. *) (** x = 1; print y; y = 1; print x // "01" x = 1; y = 1; print x; print y // "11" x = 1; y = 1; print y; print x // "11" y = 1; x = 1; print x; print y // "11" y = 1; x = 1; print y; print x // "11" y = 1; print x; x = 1; print y // "01" *) (** Now we imagine running each interleaving as a sequential program, and indeed, "00" is not printed by any of them. At a slightly higher level, we might argue that "00" is impossible because one thread must go first, and each thread begins by writing a 1 to its global variable. So when the other thread reaches its print statement, it must print a "1". # Peterson's Algorithm [peterson]: http://cs.nyu.edu/~lerner/spring12/Read03-MutualExclusion.pdf [herlihy-book]: http://booksite.elsevier.com/9780123705914/?ISBN=9780123705914 Now that we understand sequential consistency, let's get back to mutual exclusion. [Herlihy and Shavit][herlihy-book] calls Peterson's algorithm "the most succinct and elegant two-thread mutual exclusion algorithm." Well, this better be good. The original paper is [here][peterson]. *) (** T1 T2 flag1 = true flag2 = true victim = 1 victim = 2 while (flag2 && while (flag1 && victim == 1) {} victim == 2) {} // critical section // critical section flag1 = false flag2 = false *) (** Each thread begins by setting its flag to true. It then sets the `victim` variable to its thread id, and waits until either the other thread's flag is false or `victim` is set by the other thread. To see informally why this ensures mutual exclusion, consider the case where `T1` enters the critical section. Then the loop condition must have been false, which means either `flag2` was false or `victim` was set to 2. If `flag2` was false, then `T2` was not even trying to enter the critical section. If `victim` was set to 2, then `T2` _was_ trying to enter the critical section, but `T1` won the race. Let's now prove this formally. # Modeling Shared Memory [code]: SharedMem.v To reason about the algorithm formally, we first have to write down our assumptions about how shared memory works. The model used here is essentially the one used by [Lynch][distributed-algorithms]. For the purposes of presentation, I'm going to leave out some details from the post, but this post is generated from a Coq file that contains all the proofs. The full code is available [here][code]. *) (* Note: this code tested with Coq 8.5beta2. It should also work in earlier versions with the usual slight tweaks everywhere... *) Require Import JRWTactics. (* available at https://github.com/wilcoxjay/tactics *) Require Import Relations. Require Import List. Import ListNotations. Require Import Eqdep_dec. (* Decidable equality *) Class Eq (A : Type) := { Eq_dec : forall a b : A, {a = b} + {a <> b} }. (* Dependently typed total maps. Not explained in the blog post, but we'll use them below to model both the global (shared) state (map from register to contents) and the PC (map from thread to that thread's PC). *) Section update. Context {A : Type}. Context {A_eq : Eq A}. Context {B : A -> Type}. Definition update (f : forall a, B a) (x : A) (v : B x) : forall y : A, B y := fun y => match Eq_dec x y with | left pf => eq_rect _ _ v _ pf | right _ => f y end. Lemma update_same : forall f x v, update f x v x = v. Proof. unfold update. intros. break_match. - rewrite <- Eqdep_dec.eq_rect_eq_dec by auto using Eq_dec. auto. - congruence. Qed. Lemma update_diff : forall f x v y, x <> y -> update f x v y = f y. Proof. unfold update. intros. break_match; congruence. Qed. End update. (*begin code*) Section SharedMem. (*end code*) (** Shared memory is a set of registers. *) (*begin code*) (* register names *) Variable register : Type. (* we assume the names have decidable equality *) Context {reg_eq_dec : Eq register}. (* the type of data stored in each register *) Variable contents : register -> Type. (*end code*) (** Each register has a name and a type of data that it stores. As a technical condition, we need to assume that register names have decidable equality. For that we use the `Eq` typeclass (definition not shown; see the code for more details). *) (*begin code*) Definition global_state : Type := forall r, contents r. (*end code*) (** We model the global state of all registers as a (dependent) function from registers to their contents. A shared-memory program is a collection of threads. *) (*begin code*) (* thread names *) Variable thread : Type. (* more decidable equality *) Context {thread_eq_dec : Eq thread}. Variable program_counter : thread -> Type. Definition program_counters : Type := forall t, program_counter t. (*end code*) (** Each thread has a name, and we again assume the names have decidable equality. Each thread also has a program counter (PC), which keeps tracks of what it should do next. The PC is not visible to other threads. The collection of all threads' program counters is modeled as a dependent function from threads to their PC. From the point of view of the shared memory model, a program is just a function for each thread that updates that thread's PC as well as the global state. At each step, a thread is nondeterministically selected, and its function is run with access to its current PC and the current global state. The function returns the new PC and new global state. It's important to note that the function runs atomically, without interference from other threads. To model real shared-memory algorithms, we need to use the PC to break them down into repeated applications of this function, as we'll see below.[^PC-generality] [^PC-generality]: Because the PC is an arbitrary type, we can also use it to model thread-local storage. Peterson's algorithm doesn't need any of that, though. *) (* Author's note: The syntax warning from coq-lit.py here is benign. *) (*begin code*) (* one atomic step. given: - a thread t - thread t's current PC - the current global state returns thread t's new PC and the new global state *) Variable handler : forall t, program_counter t -> global_state -> (program_counter t * global_state). (*end code*) (** Execution proceeds as described above: select a thread; run the handler; update the state; rinse; repeat. We can write this down as a step relation. *) (*begin code*) Inductive step : program_counters * global_state -> program_counters * global_state -> Prop := | step_step : forall t ls gs st' gs', handler t (ls t) gs = (st', gs') -> step (ls, gs) (update ls t st', gs'). (*end code*) (** Here we use the `update` function (definition not shown; see code for details) to update thread `t`'s PC. To model multiple steps of execution, we take the transitive closure of the relation `step`. It saddens me that I think the name `clos_refl_trans_n1` makes sense. *) (*begin code*) Definition step_star := clos_refl_trans_n1 _ step. End SharedMem. (*end code*) (** # Peterson's Algorithm Formalized Now that we've got a model of shared memory, we can write programs against it. Because each step of the handler function is atomic, we have to be careful to break the algorithm down into its atomic steps. Let's recall Peterson's algorithm: *) (** T1 T2 flag1 = true flag2 = true victim = 1 victim = 2 while (flag2 && while (flag1 && victim == 1) {} victim == 2) {} // critical section // critical section flag1 = false flag2 = false *) (** To write this in our model, we need to define `register`, `contents`, `thread`, `program_counter`, and `handler`. Here we go. Peterson's algorithm works with two threads, so we'll make the type `thread` have two elements, `T1` and `T2`. We give a decidable equality instance as well (see code for proof). *) (*begin code*) Module Peterson. Inductive thread := | T1 | T2. Instance thread_eq_dec : Eq thread. (*end code*) constructor. intros. decide equality. Defined. (** Peterson's algorithm as presented above has three registers: `flag1`, `flag2`, and `victim`. To make it possible to write the code for both threads as a single function, we define `register` to have a `Flag` constructor that takes a `thread` as an argument. Thus `Flag T1` corresponds to `flag1` in the code above. Again, we also give a decidable equality instance. *) (*begin code*) Inductive register := | Flag : thread -> register | Victim. Instance register_eq_dec : Eq register. (*end code*) constructor. intros. decide equality; apply thread_eq_dec. Defined. (** The flags hold booleans while `victim` holds a thread name. *) (*begin code*) Definition contents (r : register) : Type := match r with | Victim => thread | Flag _ => bool end. (*end code*) (** Initially, both flags should be set to `false`. But Peterson's algorithm will work correctly for any initial value of `victim`. Thus we define the initial state of the registers to be parametrized on a thread `t` that is the initial value of `victim`. The theorems below will show that the algorithm works for both possible values of `t`. *) (*begin code*) Definition register_init (t : thread) (r : register) : contents r := match r with | Victim => t | _ => false end. (*end code*) (** We're going to split up the code into atomic sections. The program counter of each thread will keep track of which atomic section to execute next. Each line of the pseudocode above corresponds to a single atomic section (either reading or writing a single shared variable). So `program_counter` just has six elements, one for each line. *) (*begin code*) Inductive program_counter : Type := | SetFlag | SetVictim | CheckFlag | CheckVictim | Crit | ResetFlag. (*end code*) (** Initially both threads start at `SetFlag`. *) (*begin code*) Definition PC_init (t : thread) : program_counter := SetFlag. (*end code*) (* monad for writing handler functions. it's a state monad over both the PC and the global state *) Definition HM (t : thread) (A : Type) : Type := program_counter -> global_state register contents -> A * program_counter * global_state register contents. Definition ret {t : thread} {A : Type} (x : A) : HM t A := fun l g => (x, l, g). Definition bind {t : thread} {A B : Type} (m : HM t A) (f : A -> HM t B) : HM t B := fun l g => let '(a, l', g') := m l g in f a l' g'. Definition get_PC {t : thread} : HM t program_counter := fun l g => (l, l, g). Definition global_read {t : thread} : HM t (global_state register contents) := fun l g => (g, l, g). Definition goto {t : thread} (l' : program_counter) : HM t unit := fun _ g => (tt, l', g). Definition global_write {t : thread} (g' : global_state register contents) : HM t unit := fun l _ => (tt, l, g'). Definition read_register {t : thread} (r : register) : HM t (contents r) := bind global_read (fun g => ret (g r)). Definition write_register {t : thread} (r : register) (x : contents r) : HM t unit := bind global_read (fun g => global_write (update g r x)). (* unwrap the monad *) Definition run_handler (h : forall t, HM t unit) := fun t l g => let '(_, l', g') := h t l g in (l', g'). Ltac monad_unfold := unfold write_register, read_register, global_write, goto, global_read, get_PC, bind, ret in *. Notation "x <- m1 ;; m2" := (bind m1 (fun x => m2)) (right associativity, at level 100, m1 at next level). Notation "m1 ;; m2" := (bind m1 (fun _ => m2)) (right associativity, at level 100). Notation "r ::= x" := (write_register r x) (right associativity, at level 90). Notation "[[ r ]]" := (read_register r) (right associativity, at level 90). (** Now we're ready to write down the handler function. Instead of writing a separate one for each thread, we're going to write it once by using a helper function to refer to the other thread. *) (*begin code*) Definition other (t : thread) : thread := match t with | T1 => T2 | T2 => T1 end. (*end code*) (** And finally, here's Peterson's algorithm. *) (*begin code*) Definition handler (t : thread) : HM t unit := pc <- get_PC ;; match pc with | SetFlag => Flag t ::= true ;; goto SetVictim | SetVictim => Victim ::= t ;; goto CheckFlag | CheckFlag => b <- [[ Flag (other t) ]] ;; if b then goto CheckVictim else goto Crit | CheckVictim => t' <- [[ Victim ]] ;; if Eq_dec t t' then goto CheckFlag else goto Crit | Crit => goto ResetFlag | ResetFlag => Flag t ::= false ;; goto SetFlag end. (*end code*) (** The code is written in a monadic style. `HM` is a state monad over the thread's PC as well as the global state. To write the value `x` to a register `r`, there is the notation `r ::= x`. To read from a register `r`, there's the notation `[[ r ]]`. `get_PC` reads the PC. `goto` writes the PC. The handler reads the current PC and branches on it. In each branch, it executes the corresponding atomic section. You should take a second to convince yourself that this code implements the pseudocode for Peterson's algorithm. # Proof We're going to show that Peterson's algorithm provides mutual exclusion. More precisely, in any reachable state of the system, at most one thread has their PC equal to `Crit`. First, we write down what it means for a state to be reachable. *) (*begin code*) Definition reachable l g : Prop := exists t0, step_star register contents thread (fun _ => program_counter) (run_handler handler) (PC_init, register_init t0) (l, g). (*end code*) (** Recall that the `register_init` takes an argument that is the initial value of `Victim`. Since we want the algorithm to work regardless of this initial value, a state is considered reachable if there exists any value that causes the system to reach the state. The function `run_handler` just unwraps the monadic handler function. Next we provide an induction principle for the reachability predicate. This is essentially just a cleaned up version of the induction principle that you'd get from `clos_refl_trans_n1`. *) (*begin code*) Lemma reachable_ind : forall (P : program_counters thread (fun _ => program_counter) -> global_state register contents -> Prop), (forall t, P PC_init (register_init t)) -> (forall l g t st' g' u, P l g -> reachable l g -> handler t (l t) g = (u, st', g') -> P (update l t st') g') -> forall l g, reachable l g -> P l g. Proof. (* see code for details *) (*end code*) unfold reachable, step_star. intros. break_exists. prep_induction H1. induction H1; intros; subst. - find_inversion. auto. - invc H. unfold run_handler in *. repeat break_let. find_inversion. eapply H4; eauto. eapply IHclos_refl_trans_n1; auto. (*begin code*) Qed. (*end code*) Ltac update_destruct := match goal with | [ H : context [update _ ?x _ ?y] |- _ ] => destruct (Eq_dec x y) | [ |- context [update _ ?x _ ?y] ] => destruct (Eq_dec x y) end. Ltac find_rewrite_lem lem := match goal with | [ H : _ |- _ ] => rewrite lem in H end. Ltac update_rewrite := repeat rewrite update_same in *; repeat (find_rewrite_lem update_same); repeat rewrite update_diff in * by congruence; repeat (find_rewrite_lem uconstr:(update_diff); [|congruence]). (** First we show that the `other` function is correct. The proof is by case analysis. *) (*begin code*) Lemma other_neq : forall t, t <> other t. Proof. unfold other. intros. break_match; congruence. Qed. (*end code*) Ltac workhorse := update_destruct; repeat find_inversion; try subst; update_rewrite; auto using other_neq; try congruence. (** We're finally ready to start reasoning about Peterson's algorithm. We start by showing that if `Flag t` is `false`, then the PC must be at `SetFlag`. This matches our intuition that the flag is true whenever the thread is acquiring, holding, or releasing the lock. *) (*begin code*) (* assertion 10.5.1 from Lynch *) Lemma flag_false : forall l g, reachable l g -> forall t, g (Flag t) = false -> l t = SetFlag. (*end code*) (** The proof is by induction over the execution. In the base case, both threads have flags set to `false` and have their PC at `SetFlag`. The inductive case proceeds by case analysis on which thread took the step and where that thread's PC was before the step. The tactic `workhorse` is a helper tactic that automates some of the tedium of the case analysis; its definition is not shown, but you can find it in the code. *) (*begin code*) Proof. induction 1 using reachable_ind; intros. - auto. - unfold handler in *. monad_unfold. repeat break_match; repeat find_inversion; repeat workhorse; repeat find_apply_hyp_hyp; congruence. Qed. (*end code*) (** Next, we'll show that if thread `t` holds the lock and thread `other t` is trying to acquire the lock, then `Victim` is set to `other t`. *) (*begin code*) (* assertion 10.5.2 from Lynch *) Lemma turn_i : forall l g, reachable l g -> forall t, In (l t) [Crit; ResetFlag] -> In (l (other t)) [CheckFlag; CheckVictim; Crit; ResetFlag] -> g Victim <> t. (*end code*) (** The proof is again by induction over the execution. In the base case, no thread holds the lock, so the property holds trivially. In the inductive case, we do a bunch of case analysis. We use the previous lemma to rule out the case where `other t` enters the critical section by reading `Flag t` as `false`, which can't happen because `t` is in the critical section and thus not at `SetFlag`. The other cases are straightforward. *) (*begin code*) Proof. induction 1 using reachable_ind; intros. - compute in *. intuition; try discriminate. - unfold handler in *. monad_unfold. repeat break_match; repeat find_inversion; workhorse; try workhorse; try solve [simpl in *; intuition discriminate]; try solve [apply IHreachable; auto; find_rewrite; simpl; intuition]; try solve [exfalso; eapply other_neq; eauto]. + find_eapply_lem_hyp flag_false; eauto. find_rewrite. simpl in *. intuition discriminate. + repeat find_rewrite. auto using other_neq. Qed. (*end code*) (** Now we can prove mutual exclusion: two threads never have both their PCs in the critical section. *) (*begin code*) (* lemma 10.10 from Lynch *) Theorem mutex : forall l g, reachable l g -> l T1 = Crit -> l T2 = Crit -> False. (*end code*) (** The proof uses the previous lemma. If both threads are in the critical section, then neither of them can be the victim, a contradiction. *) (*begin code*) Proof. intros. destruct (g Victim) eqn:?; eapply turn_i; eauto; repeat find_rewrite; simpl; auto. Qed. End Peterson. (*end code*) (** # Conclusion This is a horrible way to write down programs. They get longer and harder to read. Control flow has to be made explicit. The proofs aren't too bad, honestly. But it would be easy to make a mistake in translating the pseudocode into the state machine. I'd like to find a better way to write these kinds of algorithms down while still being able to reason about them. The key to the proof is attaching a bunch of invariants to the PC, but the PC had to be introduced explicitly. [ewd779]: http://www.cs.utexas.edu/users/EWD/ewd07xx/EWD779.PDF I think this is related to the problem of auxiliary variables. For example, in [EWD779][ewd779], Dijkstra proves Peterson's algorithm correct by using an auxiliary variable that captures the essential information about the PC. As always, comments are welcome. Find me on freenode as `jrw`, or send email to `jrw12@cs.washington.edu`. *)