Computation of optimal policy
Given the value function V*(s), for each
state, do Bellman backups and the action
which maximises the inner product term is
the optimal action.
Optimal policy is stationary (time
independent) – intuitive for infinite horizon
case.