(*

Hello, dear reader. You're brave enough to have opened the actual
code. The blog post uses my super hacky literate coq python
script. See https://github.com/wilcoxjay/coq-lit. You might be
wondering: "why not use coqdoc?" Indeed.

*)

(**
% A Proof of Peterson's Algorithm
% James Wilcox
% May 8, 2015
*)

(*raw
<script src="jquery-2.1.1.js"></script>
<link rel="stylesheet" href="jquery-ui-1.11.1.custom/jquery-ui.min.css">
<script src="jquery-ui-1.11.1.custom/jquery-ui.min.js"></script>
*)

(**

> _In this post, we take a break from distributed systems to look at
> shared memory systems. As a case study, we give a proof that
> Peterson's algorithm provides mutual exclusion._

[distributed-algorithms]: http://groups.csail.mit.edu/tds/distalgs.html

This quarter I've been organizing a reading group on distributed
algorithms. We've been reading Lynch's aptly named _[Distributed
Algorithms][distributed-algorithms]_. Last week we covered
shared-memory mutual exclusion algorithms, and I thought I'd have a go
at formalizing some of the proofs in Coq.

# Mutual Exclusion

Mutual exclusion is the problem of implementing locks. It's a classic
in the concurrent programming literature, with just about everybody
famous publishing solutions to it.  And why not: it's not at all
obvious software should be able to provide mutual exclusion on top of
hardware that doesn't.

These days the algorithms are of mostly theoretical interest because
modern multicore chips _do_ have primitives that make it easier to
implement locks. But these old-school algorithms don't need any of
that. All they require is _sequential consistency_ of the underlying
memory system. (To make these algorithms feel even worse about
themselves, sequential consistency isn't guaranteed by default on
modern chips.)

# Sequential Consistency

Programmers can reason about a single-threaded program by simulating
its execution in their heads, with each line of code executing in
order. Such reasoning is an abstraction of what's actually going on,
especially on modern hardware and with modern compilers. Compilers
move code around, and hardware executes instructions out of order. But
they all guarantee that the program behaves _as if_ each line of code
executed in order. This guarantee ensures that the programmer's
line-by-line reasoning is correct.

Sequential consistency is an extension of this reasoning technique to
multithreaded programs. An execution of a multithreaded program is
sequentially consistent if it corresponds to some interleaving of the
steps of each thread. For example, consider the following code, which
executes in an environment where both `x` and `y` are initially 0.

*)

(**
               T1          T2
              x = 1       y = 1
             print y      print x

*) 
(**

Thread 1's code is in the left column. It first writes the value 1 to
the global variable `x`. Then it prints the value of the global
variable `y`. Thread 2's code is in the right column. It writes 1 to
`y` and then prints `x`.

Now suppose we want to convince ourselves that no execution of this
program will print "00". In a sequentially consistent world, we know
that any execution of this program is equivalent to the sequential
execution of some interleaving of the steps of the threads. So we
consider all possible interleavings of the steps of the threads. Here
they are.

*)

(**
            x = 1; print y; y = 1; print x  // "01"
            x = 1; y = 1; print x; print y  // "11"
            x = 1; y = 1; print y; print x  // "11"
            y = 1; x = 1; print x; print y  // "11"
            y = 1; x = 1; print y; print x  // "11"
            y = 1; print x; x = 1; print y  // "01"
*) 

(**

Now we imagine running each interleaving as a sequential program, and
indeed, "00" is not printed by any of them. At a slightly higher
level, we might argue that "00" is impossible because one thread must
go first, and each thread begins by writing a 1 to its global
variable. So when the other thread reaches its print statement, it
must print a "1".

# Peterson's Algorithm

[peterson]: http://cs.nyu.edu/~lerner/spring12/Read03-MutualExclusion.pdf

[herlihy-book]: http://booksite.elsevier.com/9780123705914/?ISBN=9780123705914

Now that we understand sequential consistency, let's get back to
mutual exclusion. [Herlihy and Shavit][herlihy-book] calls Peterson's algorithm
"the most succinct and elegant two-thread mutual exclusion algorithm."
Well, this better be good. The original paper is [here][peterson].

*) 
(**
                   T1                           T2
             flag1 = true                 flag2 = true
             victim = 1                   victim = 2
             while (flag2 &&              while (flag1 && 
                    victim == 1) {}              victim == 2) {} 
             // critical section          // critical section
             flag1 = false                flag2 = false
*) 

(**

Each thread begins by setting its flag to true. It then sets the
`victim` variable to its thread id, and waits until either the other
thread's flag is false or `victim` is set by the other thread.

To see informally why this ensures mutual exclusion, consider the case
where `T1` enters the critical section. Then the loop condition must
have been false, which means either `flag2` was false or `victim` was
set to 2. If `flag2` was false, then `T2` was not even trying to enter
the critical section. If `victim` was set to 2, then `T2` _was_ trying
to enter the critical section, but `T1` won the race.

Let's now prove this formally.

# Modeling Shared Memory

[code]: SharedMem.v

To reason about the algorithm formally, we first have to write down
our assumptions about how shared memory works. The model used here is
essentially the one used by [Lynch][distributed-algorithms].

For the purposes of presentation, I'm going to leave out some details
from the post, but this post is generated from a Coq file that
contains all the proofs. The full code is available [here][code].

*)

(*

Note: this code tested with Coq 8.5beta2. It should also work in
earlier versions with the usual slight tweaks everywhere...

*)

Require Import JRWTactics. (* available at https://github.com/wilcoxjay/tactics *)
Require Import Relations.
Require Import List.
Import ListNotations.

Require Import Eqdep_dec.

(* Decidable equality *)
Class Eq (A : Type) :=
  {
    Eq_dec : forall a b : A, {a = b} + {a <> b}
  }.

(*

Dependently typed total maps. Not explained in the blog post, but
we'll use them below to model both the global (shared) state (map from
register to contents) and the PC (map from thread to that
thread's PC).

*)
Section update.
  Context {A : Type}.
  Context {A_eq : Eq A}.
  Context {B : A -> Type}.

  Definition update (f : forall a, B a) (x : A) (v : B x) : forall y : A, B y :=
    fun y => 
    match Eq_dec x y with 
    | left pf => eq_rect _ _ v _ pf
    | right _ => f y
    end.

  Lemma update_same : 
    forall f x v, 
      update f x v x = v.
  Proof.
    unfold update.
    intros.
    break_match.
    - rewrite <- Eqdep_dec.eq_rect_eq_dec by auto using Eq_dec. auto.
    - congruence.
  Qed.

  Lemma update_diff : 
    forall f x v y, 
      x <> y -> 
      update f x v y = f y.
  Proof.
    unfold update.
    intros.
    break_match; congruence.
  Qed.
End update.

(*begin code*)
Section SharedMem.
(*end code*)

(**

Shared memory is a set of registers.

*)
(*begin code*) 
  (* register names *)
  Variable register : Type.

  (* we assume the names have decidable equality *) 
  Context {reg_eq_dec : Eq register}.

  (* the type of data stored in each register *)
  Variable contents : register -> Type.
(*end code*)

(**

Each register has a name and a type of data that it stores. As a
technical condition, we need to assume that register names have
decidable equality. For that we use the `Eq` typeclass (definition not
shown; see the code for more details). 

*)

(*begin code*)  
  Definition global_state : Type := forall r, contents r.
(*end code*)

(**

We model the global state of all registers as a (dependent) function
from registers to their contents.

A shared-memory program is a collection of threads.

*)

(*begin code*)
  (* thread names *)
  Variable thread : Type.
  
  (* more decidable equality *)
  Context {thread_eq_dec : Eq thread}.

  Variable program_counter : thread -> Type.

  Definition program_counters : Type := forall t, program_counter t.
(*end code*)

(**

Each thread has a name, and we again assume the names have decidable
equality. Each thread also has a program counter (PC), which keeps
tracks of what it should do next. The PC is not visible to other
threads. The collection of all threads' program counters is modeled as
a dependent function from threads to their PC.

From the point of view of the shared memory model, a program is just a
function for each thread that updates that thread's PC as well as the
global state. At each step, a thread is nondeterministically selected,
and its function is run with access to its current PC and the current
global state. The function returns the new PC and new global
state. It's important to note that the function runs atomically,
without interference from other threads. To model real shared-memory
algorithms, we need to use the PC to break them down into repeated
applications of this function, as we'll see below.[^PC-generality]

[^PC-generality]: Because the PC is an arbitrary type, we can also use
it to model thread-local storage. Peterson's algorithm doesn't need
any of that, though.

*)

(* Author's note: The syntax warning from coq-lit.py here is benign. *)

(*begin code*)
  (* one atomic step. given:
         - a thread t
         - thread t's current PC
         - the current global state
     returns thread t's new PC and the new global state
  *)
  Variable handler : 
    forall t, program_counter t -> 
              global_state -> 
              (program_counter t * global_state).
(*end code*)

(**

Execution proceeds as described above: select a thread; run the
handler; update the state; rinse; repeat. We can write this down as a
step relation.

*)

(*begin code*)  
  Inductive step : program_counters * global_state -> 
                   program_counters * global_state -> Prop :=
  | step_step : 
      forall t ls gs st' gs', 
        handler t (ls t) gs = (st', gs') -> 
        step (ls, gs) (update ls t st', gs').
(*end code*)

(**

Here we use the `update` function (definition not shown; see code for
details) to update thread `t`'s PC.

To model multiple steps of execution, we take the transitive closure
of the relation `step`. It saddens me that I think the name
`clos_refl_trans_n1` makes sense.

*)

(*begin code*)
  Definition step_star := clos_refl_trans_n1 _ step.
End SharedMem.
(*end code*)


(**

# Peterson's Algorithm Formalized

Now that we've got a model of shared memory, we can write programs
against it. Because each step of the handler function is atomic, we
have to be careful to break the algorithm down into its atomic
steps. Let's recall Peterson's algorithm: 

*)

(**
                   T1                           T2
             flag1 = true                 flag2 = true
             victim = 1                   victim = 2
             while (flag2 &&              while (flag1 && 
                    victim == 1) {}              victim == 2) {} 
             // critical section          // critical section
             flag1 = false                flag2 = false
*) 

(**

To write this in our model, we need to define `register`, `contents`,
`thread`, `program_counter`, and `handler`. Here we go.

Peterson's algorithm works with two threads, so we'll make the type
`thread` have two elements, `T1` and `T2`. We give a decidable
equality instance as well (see code for proof).

*)

(*begin code*)
Module Peterson.
  Inductive thread :=
  | T1
  | T2.

  Instance thread_eq_dec : Eq thread.
(*end code*)
  constructor.
  intros.
  decide equality.
  Defined.

(**

Peterson's algorithm as presented above has three registers: `flag1`,
`flag2`, and `victim`. To make it possible to write the code for both
threads as a single function, we define `register` to have a `Flag`
constructor that takes a `thread` as an argument. Thus `Flag T1`
corresponds to `flag1` in the code above. Again, we also give a
decidable equality instance.

*)

(*begin code*)  
  Inductive register :=
  | Flag : thread -> register
  | Victim.

  Instance register_eq_dec : Eq register.
(*end code*)
  constructor.
  intros.
  decide equality;
  apply thread_eq_dec.
  Defined.

(**

The flags hold booleans while `victim` holds a thread name. 

*)

(*begin code*)  
  Definition contents (r : register) : Type :=
    match r with
    | Victim => thread
    | Flag _ => bool
    end.
(*end code*)

(**

Initially, both flags should be set to `false`. But Peterson's
algorithm will work correctly for any initial value of `victim`. Thus
we define the initial state of the registers to be parametrized on a
thread `t` that is the initial value of `victim`. The theorems below
will show that the algorithm works for both possible values of `t`.

*)

(*begin code*)
  Definition register_init (t : thread) (r : register) : contents r :=
    match r with
    | Victim => t
    | _ => false
    end.
(*end code*)

(**

We're going to split up the code into atomic sections. The program
counter of each thread will keep track of which atomic section to
execute next. Each line of the pseudocode above corresponds to a
single atomic section (either reading or writing a single shared
variable). So `program_counter` just has six elements, one for each
line.

*)

(*begin code*)  
  Inductive program_counter : Type :=
  | SetFlag
  | SetVictim
  | CheckFlag
  | CheckVictim
  | Crit
  | ResetFlag.
(*end code*)

(**

Initially both threads start at `SetFlag`.

*)

(*begin code*)
  Definition PC_init (t : thread) : program_counter := SetFlag.
(*end code*)


  (* monad for writing handler functions. it's a state monad over both
     the PC and the global state *) 
  Definition HM (t : thread) (A : Type) : Type := 
    program_counter -> global_state register contents -> 
    A * program_counter * global_state register contents.

  Definition ret {t : thread} {A : Type} (x : A) : HM t A := 
    fun l g => (x, l, g).

  Definition bind {t : thread} {A B : Type} (m : HM t A) (f : A -> HM t B) : HM t B := 
    fun l g => let '(a, l', g') := m l g in f a l' g'.

  Definition get_PC {t : thread} : HM t program_counter :=
    fun l g => (l, l, g).

  Definition global_read {t : thread} : HM t (global_state register contents) :=
    fun l g => (g, l, g).

  Definition goto {t : thread} (l' : program_counter) : HM t unit := 
    fun _ g => (tt, l', g).

  Definition global_write {t : thread} (g' : global_state register contents) : HM t unit :=
    fun l _ => (tt, l, g').

  Definition read_register {t : thread} (r : register) : HM t (contents r) :=
    bind global_read (fun g => ret (g r)).

  Definition write_register {t : thread} (r : register) (x : contents r) : HM t unit :=
    bind global_read (fun g => global_write (update g r x)).
    
  (* unwrap the monad *)
  Definition run_handler (h : forall t, HM t unit) :=
    fun t l g => let '(_, l', g') := h t l g in (l', g').

  Ltac monad_unfold :=
    unfold write_register, 
           read_register, 
           global_write, 
           goto, 
           global_read, 
           get_PC, 
           bind, 
           ret in *.
  
  Notation "x <- m1 ;; m2" := (bind m1 (fun x => m2)) (right associativity, at level 100, m1 at next level).
  Notation "m1 ;; m2" := (bind m1 (fun _ => m2)) (right associativity, at level 100).
  Notation "r ::= x" := (write_register r x)  (right associativity, at level 90).
  Notation "[[ r ]]" := (read_register r) (right associativity, at level 90).
(**

Now we're ready to write down the handler function. Instead of writing
a separate one for each thread, we're going to write it once by using
a helper function to refer to the other thread.

*)

(*begin code*)
  Definition other (t : thread) : thread :=
    match t with
    | T1 => T2
    | T2 => T1
    end.
(*end code*)  

(**

And finally, here's Peterson's algorithm.

*)

(*begin code*)
  Definition handler (t : thread) : HM t unit :=
    pc <- get_PC ;;
    match pc with
    | SetFlag     => Flag t ::= true ;;
                     goto SetVictim
    | SetVictim   => Victim ::= t ;; 
                     goto CheckFlag
    | CheckFlag   => b <- [[ Flag (other t) ]] ;;
                     if b 
                     then goto CheckVictim
                     else goto Crit
    | CheckVictim => t' <- [[ Victim ]] ;;
                     if Eq_dec t t' 
                     then goto CheckFlag
                     else goto Crit
    | Crit        => goto ResetFlag
    | ResetFlag   => Flag t ::= false ;;
                     goto SetFlag 
    end.
(*end code*)

(**

The code is written in a monadic style. `HM` is a state monad over the
thread's PC as well as the global state. To write the value `x` to a
register `r`, there is the notation `r ::= x`. To read from a register
`r`, there's the notation `[[ r ]]`. `get_PC` reads the PC. `goto`
writes the PC.

The handler reads the current PC and branches on it. In each branch,
it executes the corresponding atomic section. You should take a second
to convince yourself that this code implements the pseudocode for
Peterson's algorithm.

# Proof

We're going to show that Peterson's algorithm provides mutual
exclusion. More precisely, in any reachable state of the system, at
most one thread has their PC equal to `Crit`.

First, we write down what it means for a state to be reachable.

*)

(*begin code*)
  Definition reachable l g : Prop :=
    exists t0, 
      step_star register contents thread 
                (fun _ => program_counter) (run_handler handler)
                (PC_init, register_init t0) (l, g).
(*end code*)

(**

Recall that the `register_init` takes an argument that is the initial
value of `Victim`. Since we want the algorithm to work regardless of
this initial value, a state is considered reachable if there exists
any value that causes the system to reach the state. The function
`run_handler` just unwraps the monadic handler function.

Next we provide an induction principle for the reachability
predicate. This is essentially just a cleaned up version of the
induction principle that you'd get from `clos_refl_trans_n1`. 

*)

(*begin code*)
  Lemma reachable_ind : 
    forall (P : program_counters thread (fun _ => program_counter) -> 
                global_state register contents -> Prop), 
      (forall t, P PC_init (register_init t)) -> 
      (forall l g t st' g' u, 
          P l g -> 
          reachable l g -> 
          handler t (l t) g = (u, st', g') -> 
          P (update l t st') g') -> 
      forall l g, 
        reachable l g -> 
        P l g.
  Proof.
    (* see code for details *)
(*end code*)
    unfold reachable, step_star.
    intros. break_exists.
    prep_induction H1.
    induction H1; intros; subst.
    - find_inversion. auto.
    - invc H. unfold run_handler in *. 
      repeat break_let. find_inversion.
      eapply H4; eauto.
      eapply IHclos_refl_trans_n1; auto.
(*begin code*)
  Qed.
(*end code*)
                                
  Ltac update_destruct :=
  match goal with
  | [ H : context [update _ ?x _ ?y] |- _ ] => destruct (Eq_dec x y)
  | [ |- context [update _ ?x _ ?y] ] => destruct (Eq_dec x y)
  end.

  Ltac find_rewrite_lem lem :=
    match goal with
    | [ H : _ |- _ ] => rewrite lem in H
    end.

  Ltac update_rewrite :=
    repeat rewrite update_same in *;
    repeat (find_rewrite_lem update_same);
    repeat rewrite update_diff in * by congruence;
    repeat (find_rewrite_lem uconstr:(update_diff); [|congruence]).
  
(**

First we show that the `other` function is correct. The proof is by
case analysis.

*)
(*begin code*)
  Lemma other_neq : 
    forall t, 
      t <> other t.
  Proof.
    unfold other.
    intros.
    break_match; congruence.
  Qed.
(*end code*)

  Ltac workhorse := 
    update_destruct; repeat find_inversion; try subst; update_rewrite; auto using other_neq; 
    try congruence.

(**

We're finally ready to start reasoning about Peterson's algorithm. We
start by showing that if `Flag t` is `false`, then the PC must be at
`SetFlag`. This matches our intuition that the flag is true whenever
the thread is acquiring, holding, or releasing the lock.

*)

(*begin code*)
  (* assertion 10.5.1 from Lynch *)
  Lemma flag_false : 
    forall l g,
      reachable l g -> 
      forall t, 
        g (Flag t) = false -> 
        l t = SetFlag.
(*end code*)

(**

The proof is by induction over the execution. In the base case, both
threads have flags set to `false` and have their PC at `SetFlag`. The
inductive case proceeds by case analysis on which thread took the step
and where that thread's PC was before the step. The tactic `workhorse`
is a helper tactic that automates some of the tedium of the case
analysis; its definition is not shown, but you can find it in the
code.

*)

(*begin code*)
  Proof.
    induction 1 using reachable_ind; intros.
    - auto.
    - unfold handler in *.
      monad_unfold.
      repeat break_match; repeat find_inversion;
      repeat workhorse;
      repeat find_apply_hyp_hyp; congruence.
  Qed.
(*end code*)  

(**

Next, we'll show that if thread `t` holds the lock and thread `other
t` is trying to acquire the lock, then `Victim` is set to `other t`.

*)

(*begin code*)
  (* assertion 10.5.2 from Lynch *)
  Lemma turn_i :
    forall l g,
      reachable l g -> 
      forall t, 
        In (l t) [Crit; ResetFlag] -> 
        In (l (other t)) [CheckFlag; CheckVictim; Crit; ResetFlag] -> 
        g Victim <> t.
(*end code*)
(**

The proof is again by induction over the execution. In the base case,
no thread holds the lock, so the property holds trivially. In the
inductive case, we do a bunch of case analysis. We use the previous
lemma to rule out the case where `other t` enters the critical section
by reading `Flag t` as `false`, which can't happen because `t` is in
the critical section and thus not at `SetFlag`. The other cases are
straightforward.

*)
(*begin code*)
  Proof.
    induction 1 using reachable_ind; intros.
    - compute in *. intuition; try discriminate.
    - unfold handler in *.
      monad_unfold.
      repeat break_match; repeat find_inversion;
      workhorse; try workhorse;
      try solve [simpl in *; intuition discriminate]; 
      try solve [apply IHreachable; auto; find_rewrite; simpl; intuition]; 
      try solve [exfalso; eapply other_neq; eauto].
      + find_eapply_lem_hyp flag_false; eauto.
        find_rewrite. simpl in *. intuition discriminate.
      + repeat find_rewrite. auto using other_neq.
  Qed.
(*end code*)

(**

Now we can prove mutual exclusion: two threads never have both their
PCs in the critical section.

*)

(*begin code*)
  (* lemma 10.10 from Lynch *)
  Theorem mutex :
    forall l g, 
      reachable l g -> 
      l T1 = Crit -> 
      l T2 = Crit -> 
      False.
(*end code*)

(**

The proof uses the previous lemma. If both threads are in the critical
section, then neither of them can be the victim, a contradiction.

*)

(*begin code*)
  Proof.
    intros.
    destruct (g Victim) eqn:?;
    eapply turn_i; eauto; 
    repeat find_rewrite; simpl; auto.
  Qed.
End Peterson.
(*end code*)

(**

# Conclusion

This is a horrible way to write down programs. They get longer and
harder to read. Control flow has to be made explicit. The proofs
aren't too bad, honestly. But it would be easy to make a mistake in
translating the pseudocode into the state machine. I'd like to find a
better way to write these kinds of algorithms down while still being
able to reason about them. The key to the proof is attaching a bunch
of invariants to the PC, but the PC had to be introduced explicitly.

[ewd779]: http://www.cs.utexas.edu/users/EWD/ewd07xx/EWD779.PDF

I think this is related to the problem of auxiliary variables. For
example, in [EWD779][ewd779], Dijkstra proves Peterson's algorithm
correct by using an auxiliary variable that captures the essential
information about the PC.

As always, comments are welcome. Find me on freenode as `jrw`, or send
email to `jrw12@cs.washington.edu`.

*)