In this post, we take a break from distributed systems to look at shared memory systems. As a case study, we give a proof that Peterson’s algorithm provides mutual exclusion.

This quarter I’ve been organizing a reading group on distributed algorithms. We’ve been reading Lynch’s aptly named *Distributed Algorithms*. Last week we covered shared-memory mutual exclusion algorithms, and I thought I’d have a go at formalizing some of the proofs in Coq.

Mutual exclusion is the problem of implementing locks. It’s a classic in the concurrent programming literature, with just about everybody famous publishing solutions to it. And why not: it’s not at all obvious software should be able to provide mutual exclusion on top of hardware that doesn’t.

These days the algorithms are of mostly theoretical interest because modern multicore chips *do* have primitives that make it easier to implement locks. But these old-school algorithms don’t need any of that. All they require is *sequential consistency* of the underlying memory system. (To make these algorithms feel even worse about themselves, sequential consistency isn’t guaranteed by default on modern chips.)

Programmers can reason about a single-threaded program by simulating its execution in their heads, with each line of code executing in order. Such reasoning is an abstraction of what’s actually going on, especially on modern hardware and with modern compilers. Compilers move code around, and hardware executes instructions out of order. But they all guarantee that the program behaves *as if* each line of code executed in order. This guarantee ensures that the programmer’s line-by-line reasoning is correct.

Sequential consistency is an extension of this reasoning technique to multithreaded programs. An execution of a multithreaded program is sequentially consistent if it corresponds to some interleaving of the steps of each thread. For example, consider the following code, which executes in an environment where both `x`

and `y`

are initially 0.

```
T1 T2
x = 1 y = 1
print y print x
```

Thread 1’s code is in the left column. It first writes the value 1 to the global variable `x`

. Then it prints the value of the global variable `y`

. Thread 2’s code is in the right column. It writes 1 to `y`

and then prints `x`

.

Now suppose we want to convince ourselves that no execution of this program will print “00”. In a sequentially consistent world, we know that any execution of this program is equivalent to the sequential execution of some interleaving of the steps of the threads. So we consider all possible interleavings of the steps of the threads. Here they are.

```
x = 1; print y; y = 1; print x // "01"
x = 1; y = 1; print x; print y // "11"
x = 1; y = 1; print y; print x // "11"
y = 1; x = 1; print x; print y // "11"
y = 1; x = 1; print y; print x // "11"
y = 1; print x; x = 1; print y // "01"
```

Now we imagine running each interleaving as a sequential program, and indeed, “00” is not printed by any of them. At a slightly higher level, we might argue that “00” is impossible because one thread must go first, and each thread begins by writing a 1 to its global variable. So when the other thread reaches its print statement, it must print a “1”.

Now that we understand sequential consistency, let’s get back to mutual exclusion. Herlihy and Shavit calls Peterson’s algorithm “the most succinct and elegant two-thread mutual exclusion algorithm.” Well, this better be good. The original paper is here.

```
T1 T2
flag1 = true flag2 = true
victim = 1 victim = 2
while (flag2 && while (flag1 &&
victim == 1) {} victim == 2) {}
// critical section // critical section
flag1 = false flag2 = false
```

Each thread begins by setting its flag to true. It then sets the `victim`

variable to its thread id, and waits until either the other thread’s flag is false or `victim`

is set by the other thread.

To see informally why this ensures mutual exclusion, consider the case where `T1`

enters the critical section. Then the loop condition must have been false, which means either `flag2`

was false or `victim`

was set to 2. If `flag2`

was false, then `T2`

was not even trying to enter the critical section. If `victim`

was set to 2, then `T2`

*was* trying to enter the critical section, but `T1`

won the race.

Let’s now prove this formally.

To reason about the algorithm formally, we first have to write down our assumptions about how shared memory works. The model used here is essentially the one used by Lynch.

For the purposes of presentation, I’m going to leave out some details from the post, but this post is generated from a Coq file that contains all the proofs. The full code is available here.

```
Section SharedMem.
```

Shared memory is a set of registers.

```
(* register names *)
Variable register : Type.
(* we assume the names have decidable equality *)
Context {reg_eq_dec : Eq register}.
(* the type of data stored in each register *)
Variable contents : register -> Type.
```

Each register has a name and a type of data that it stores. As a technical condition, we need to assume that register names have decidable equality. For that we use the `Eq`

typeclass (definition not shown; see the code for more details).

```
Definition global_state : Type := forall r, contents r.
```

We model the global state of all registers as a (dependent) function from registers to their contents.

A shared-memory program is a collection of threads.

```
(* thread names *)
Variable thread : Type.
(* more decidable equality *)
Context {thread_eq_dec : Eq thread}.
Variable program_counter : thread -> Type.
Definition program_counters : Type := forall t, program_counter t.
```

Each thread has a name, and we again assume the names have decidable equality. Each thread also has a program counter (PC), which keeps tracks of what it should do next. The PC is not visible to other threads. The collection of all threads’ program counters is modeled as a dependent function from threads to their PC.

From the point of view of the shared memory model, a program is just a function for each thread that updates that thread’s PC as well as the global state. At each step, a thread is nondeterministically selected, and its function is run with access to its current PC and the current global state. The function returns the new PC and new global state. It’s important to note that the function runs atomically, without interference from other threads. To model real shared-memory algorithms, we need to use the PC to break them down into repeated applications of this function, as we’ll see below.^{1}

```
(* one atomic step. given:
- a thread t
- thread t's current PC
- the current global state
returns thread t's new PC and the new global state
*)
Variable handler :
forall t, program_counter t ->
global_state ->
(program_counter t * global_state).
```

Execution proceeds as described above: select a thread; run the handler; update the state; rinse; repeat. We can write this down as a step relation.

```
Inductive step : program_counters * global_state ->
program_counters * global_state -> Prop :=
| step_step :
forall t ls gs st' gs',
handler t (ls t) gs = (st', gs') ->
step (ls, gs) (update ls t st', gs').
```

Here we use the `update`

function (definition not shown; see code for details) to update thread `t`

’s PC.

To model multiple steps of execution, we take the transitive closure of the relation `step`

. It saddens me that I think the name `clos_refl_trans_n1`

makes sense.

```
Definition step_star := clos_refl_trans_n1 _ step.
End SharedMem.
```

Now that we’ve got a model of shared memory, we can write programs against it. Because each step of the handler function is atomic, we have to be careful to break the algorithm down into its atomic steps. Let’s recall Peterson’s algorithm:

```
T1 T2
flag1 = true flag2 = true
victim = 1 victim = 2
while (flag2 && while (flag1 &&
victim == 1) {} victim == 2) {}
// critical section // critical section
flag1 = false flag2 = false
```

To write this in our model, we need to define `register`

, `contents`

, `thread`

, `program_counter`

, and `handler`

. Here we go.

Peterson’s algorithm works with two threads, so we’ll make the type `thread`

have two elements, `T1`

and `T2`

. We give a decidable equality instance as well (see code for proof).

```
Module Peterson.
Inductive thread :=
| T1
| T2.
Instance thread_eq_dec : Eq thread.
```

Peterson’s algorithm as presented above has three registers: `flag1`

, `flag2`

, and `victim`

. To make it possible to write the code for both threads as a single function, we define `register`

to have a `Flag`

constructor that takes a `thread`

as an argument. Thus `Flag T1`

corresponds to `flag1`

in the code above. Again, we also give a decidable equality instance.

```
Inductive register :=
| Flag : thread -> register
| Victim.
Instance register_eq_dec : Eq register.
```

The flags hold booleans while `victim`

holds a thread name.

```
Definition contents (r : register) : Type :=
match r with
| Victim => thread
| Flag _ => bool
end.
```

Initially, both flags should be set to `false`

. But Peterson’s algorithm will work correctly for any initial value of `victim`

. Thus we define the initial state of the registers to be parametrized on a thread `t`

that is the initial value of `victim`

. The theorems below will show that the algorithm works for both possible values of `t`

.

```
Definition register_init (t : thread) (r : register) : contents r :=
match r with
| Victim => t
| _ => false
end.
```

We’re going to split up the code into atomic sections. The program counter of each thread will keep track of which atomic section to execute next. Each line of the pseudocode above corresponds to a single atomic section (either reading or writing a single shared variable). So `program_counter`

just has six elements, one for each line.

```
Inductive program_counter : Type :=
| SetFlag
| SetVictim
| CheckFlag
| CheckVictim
| Crit
| ResetFlag.
```

Initially both threads start at `SetFlag`

.

```
Definition PC_init (t : thread) : program_counter := SetFlag.
```

Now we’re ready to write down the handler function. Instead of writing a separate one for each thread, we’re going to write it once by using a helper function to refer to the other thread.

```
Definition other (t : thread) : thread :=
match t with
| T1 => T2
| T2 => T1
end.
```

And finally, here’s Peterson’s algorithm.

```
Definition handler (t : thread) : HM t unit :=
pc <- get_PC ;;
match pc with
| SetFlag => Flag t ::= true ;;
goto SetVictim
| SetVictim => Victim ::= t ;;
goto CheckFlag
| CheckFlag => b <- [[ Flag (other t) ]] ;;
if b
then goto CheckVictim
else goto Crit
| CheckVictim => t' <- [[ Victim ]] ;;
if Eq_dec t t'
then goto CheckFlag
else goto Crit
| Crit => goto ResetFlag
| ResetFlag => Flag t ::= false ;;
goto SetFlag
end.
```

The code is written in a monadic style. `HM`

is a state monad over the thread’s PC as well as the global state. To write the value `x`

to a register `r`

, there is the notation `r ::= x`

. To read from a register `r`

, there’s the notation `[[ r ]]`

. `get_PC`

reads the PC. `goto`

writes the PC.

The handler reads the current PC and branches on it. In each branch, it executes the corresponding atomic section. You should take a second to convince yourself that this code implements the pseudocode for Peterson’s algorithm.

We’re going to show that Peterson’s algorithm provides mutual exclusion. More precisely, in any reachable state of the system, at most one thread has their PC equal to `Crit`

.

First, we write down what it means for a state to be reachable.

```
Definition reachable l g : Prop :=
exists t0,
step_star register contents thread
(fun _ => program_counter) (run_handler handler)
(PC_init, register_init t0) (l, g).
```

Recall that the `register_init`

takes an argument that is the initial value of `Victim`

. Since we want the algorithm to work regardless of this initial value, a state is considered reachable if there exists any value that causes the system to reach the state. The function `run_handler`

just unwraps the monadic handler function.

Next we provide an induction principle for the reachability predicate. This is essentially just a cleaned up version of the induction principle that you’d get from `clos_refl_trans_n1`

.

```
Lemma reachable_ind :
forall (P : program_counters thread (fun _ => program_counter) ->
global_state register contents -> Prop),
(forall t, P PC_init (register_init t)) ->
(forall l g t st' g' u,
P l g ->
reachable l g ->
handler t (l t) g = (u, st', g') ->
P (update l t st') g') ->
forall l g,
reachable l g ->
P l g.
Proof.
(* see code for details *)
```

```
Qed.
```

First we show that the `other`

function is correct. The proof is by case analysis.

```
Lemma other_neq :
forall t,
t <> other t.
Proof.
unfold other.
intros.
break_match; congruence.
Qed.
```

We’re finally ready to start reasoning about Peterson’s algorithm. We start by showing that if `Flag t`

is `false`

, then the PC must be at `SetFlag`

. This matches our intuition that the flag is true whenever the thread is acquiring, holding, or releasing the lock.

```
(* assertion 10.5.1 from Lynch *)
Lemma flag_false :
forall l g,
reachable l g ->
forall t,
g (Flag t) = false ->
l t = SetFlag.
```

The proof is by induction over the execution. In the base case, both threads have flags set to `false`

and have their PC at `SetFlag`

. The inductive case proceeds by case analysis on which thread took the step and where that thread’s PC was before the step. The tactic `workhorse`

is a helper tactic that automates some of the tedium of the case analysis; its definition is not shown, but you can find it in the code.

```
Proof.
induction 1 using reachable_ind; intros.
- auto.
- unfold handler in *.
monad_unfold.
repeat break_match; repeat find_inversion;
repeat workhorse;
repeat find_apply_hyp_hyp; congruence.
Qed.
```

Next, we’ll show that if thread `t`

holds the lock and thread `other t`

is trying to acquire the lock, then `Victim`

is set to `other t`

.

```
(* assertion 10.5.2 from Lynch *)
Lemma turn_i :
forall l g,
reachable l g ->
forall t,
In (l t) [Crit; ResetFlag] ->
In (l (other t)) [CheckFlag; CheckVictim; Crit; ResetFlag] ->
g Victim <> t.
```

The proof is again by induction over the execution. In the base case, no thread holds the lock, so the property holds trivially. In the inductive case, we do a bunch of case analysis. We use the previous lemma to rule out the case where `other t`

enters the critical section by reading `Flag t`

as `false`

, which can’t happen because `t`

is in the critical section and thus not at `SetFlag`

. The other cases are straightforward.

```
Proof.
induction 1 using reachable_ind; intros.
- compute in *. intuition; try discriminate.
- unfold handler in *.
monad_unfold.
repeat break_match; repeat find_inversion;
workhorse; try workhorse;
try solve [simpl in *; intuition discriminate];
try solve [apply IHreachable; auto; find_rewrite; simpl; intuition];
try solve [exfalso; eapply other_neq; eauto].
+ find_eapply_lem_hyp flag_false; eauto.
find_rewrite. simpl in *. intuition discriminate.
+ repeat find_rewrite. auto using other_neq.
Qed.
```

Now we can prove mutual exclusion: two threads never have both their PCs in the critical section.

```
(* lemma 10.10 from Lynch *)
Theorem mutex :
forall l g,
reachable l g ->
l T1 = Crit ->
l T2 = Crit ->
False.
```

The proof uses the previous lemma. If both threads are in the critical section, then neither of them can be the victim, a contradiction.

```
Proof.
intros.
destruct (g Victim) eqn:?;
eapply turn_i; eauto;
repeat find_rewrite; simpl; auto.
Qed.
End Peterson.
```

This is a horrible way to write down programs. They get longer and harder to read. Control flow has to be made explicit. The proofs aren’t too bad, honestly. But it would be easy to make a mistake in translating the pseudocode into the state machine. I’d like to find a better way to write these kinds of algorithms down while still being able to reason about them. The key to the proof is attaching a bunch of invariants to the PC, but the PC had to be introduced explicitly.

I think this is related to the problem of auxiliary variables. For example, in EWD779, Dijkstra proves Peterson’s algorithm correct by using an auxiliary variable that captures the essential information about the PC.

As always, comments are welcome. Find me on freenode as `jrw`

, or send email to `jrw12@cs.washington.edu`

.

Because the PC is an arbitrary type, we can also use it to model thread-local storage. Peterson’s algorithm doesn’t need any of that, though.↩