Expectation Maximization

EM is a broad topic. I hope to just give a brief but understandable introduction to it here. This material is borrowed heavily from a class I took, taught by Mathias Drton.

The setup

First, let's assume we have some complete data $x$ , for which we know how to do Maximum Likelihood Estimation for some parameters $\theta$ . The complete data likelihood is then:

$P(x \vert \theta)$

Now assume we don't observe this wonderful complete data, but only some transformation $y = h(x)$ . Then, the observed data likelihood is computed by:

$P(y \vert \theta) = \sum_{x: y = h(x)} P(x \vert \theta)$

Let's take a very simple running example. Let's say we have some data $x = [x_1,x_2,x_3, x_4, x_5]$ such that:

$P(x | \pi) = \frac{(x_1 + x_2 + x_3 + x_4 + x_5)!}{x_1!x_2!x_3!x_4!x_5!}\left(\frac{1}{2}\right)^{x_1}\left(\frac{\pi}{4}\right)^{x_2}\left(\frac{1-\pi}{4}\right)^{x_3}\left(\frac{1-\pi}{4}\right)^{x_4} \left(\frac{\pi}{4}\right)^{x_5}$

That is - $x$ is distributed according to a multinomial distribution with parameters $\left[ \dfrac{1}{2}, \dfrac{\pi}{4}, \dfrac{1 - \pi}{4}, \dfrac{1- \pi}{4} , \dfrac{\pi}{4} \right]$

We know that estimating $\pi$ in this case is simple. We just take the derivative of the log likelihood and set it equal to 0:

$\begin{align} log(P(x | \pi)) &\propto x_1 log(\frac{1}{2}) + (x_2 + x_5) log(\frac{\pi}{4}) + (x_3 + x_4) log(\frac{1 - \pi}{4}) \\ \frac{dlog(P(x | \pi))}{d\pi} &= \frac{(x_2 + x_5)}{\pi} - \frac{x_3 + x_4}{1 - \pi} = 0\\ \hat{\pi} &= \frac{(x_2 + x_5)}{x_2 + x_3 + x_4 + x_5} \end{align}$

Now, let's say that instead of observing $x$ , we observe $y = [y_1,y_2,y_3,y_4]$ , such that:

$\begin{align*} y_1 &= x_1 + x_2 \\ y_2 &= x_3 \\ y_3 &= x_4 \\ y_4 &= x_5 \end{align*}$

Now, the likelihood $P(y | \pi)$ is not so easy to optimize in close form (try it):

$P(y | \pi) = \frac{(y_1 + y_2 + y_3 + y_4)!}{y_1!y_2!y_3!y_4}\left(\frac{1}{2} + \frac{\pi}{4}\right)^{y_1}\left(\frac{1-\pi}{4}\right)^{y_2}\left(\frac{1-\pi}{4}\right)^{y_3}\left(\frac{\pi}{4}\right)^{y_4}$

The algorithm

$\require{color} \definecolor{red}{RGB}{255,0,0} \require{cancel}$ The EM algorithm works as follows:

E-step - Evaluate $E[log P(x \vert \theta) \vert y, \theta_n)]$
M-step - Set $\theta_{n+1} = argmax_{\theta}E[log(P(x \vert \theta) \vert y, \theta_n)]$
repeat until some convergence criteria is satisfied
(e.g. $\vert log P(y \vert \theta_{n+1}) - log P(y \vert \theta_n) \vert \le \epsilon$ )

Now, what we want to prove is that the log-likelihood does not decrease at each iteration.

$log P(y | \theta_{n+1}) \geq log P(y | \theta_n)$

Note that since $y$ is a deterministic function of $x$ ,

$P(x | \theta_n) = P(x, y | \theta_n) = P(x | y, \theta_n)P(y | \theta_n)$

Or equivalently, we have:

$P(y | \theta_n) = \frac{P(x | \theta_n)}{P(x | y, \theta_n)}$

Thus, we can rewrite the claim as:

$log P(y | \theta_{n+1}) - log P(y | \theta_n) = log P(x | \theta_{n+1}) - log P(x | \theta_{n}) + log P(x | y, \theta_n) - log P(x | y, \theta_{n+1})$

Taking expectations $E[. \vert y, \theta_n]$ over $x$ on both sides:

$log P(y | \theta_{n+1}) - log P(y | \theta_n) = \underbrace{ {\color{red} E[log P(x | \theta_{n+1})] - E[log P(x | \theta_{n})]}}_{\geq 0} + E[log P(x | y, \theta_n)] - E[log P(x | y, \theta_{n+1})]$

The red part is $\geq$ 0 because of the way we choose $\theta_{n+1}$ . Note that due to the M-step, the left term has to be greater or equal to the right term. We can thus write the following inequality:

$\begin{align}log P(y | \theta_{n+1}) - log P(y | \theta_n) &\geq E[log P(x | y, \theta_n)] - E[log P(x | y, \theta_{n+1})] \\ &= E\left[ -log\left(\frac{P(x | y, \theta_{n+1})}{P(x | y, \theta_{n})}\right) | y, \theta_n\right] \end{align}$

Now we use Jensen's inequality, which states that for a convex function $g$ :

$E[g(x)] \geq g[E[x]]$

Since log is a concave function, -log is a convex function, and thus we can write:

$\begin{align}log P(y | \theta_{n+1}) - log P(y | \theta_n) &\geq -log \left( E\left[\left(\frac{P(x | y, \theta_{n+1})}{P(x | y, \theta_{n})}\right) | y, \theta_n\right]\right) \\ &= -log \left( \sum_x \frac{P(x | y, \theta_{n+1})}{\cancel{P(x | y, \theta_{n})}}\cancel{P(x | y , \theta_n)}\right) \\ &= -log(1) \\ &= 0 \end{align}$

Thus, we have proved that the log likelihood does not decrease at each iteration.

Going back to the example

In the E step, we would evaluate the expected value of $x_2, x_3, x_4, x_5$ given $\pi_k$ (a current estimate of $\pi$ ) and $\boldsymbol{y}$ . Let's say each observation $g_i$ in the complete data has a category in $\{1,2,3,4,5\}$ and each observation $f_i$ in the observed data has a category in $\{1,2,3,4\}$ Since we assume data is i.i.d, $P(g_i = 2 | \cdot) = P(g_1 = 2 | \cdot)$ .

$\begin{align*} E[x_2 | \pi_k, \boldsymbol{y}] &= E[\sum_{i = 1}^{n}{1_{\{g_i=2\}}} | \pi_k, \boldsymbol{y}] \\ &= \sum_{i=1}^{n}{P(g_i=2|\pi_k, \boldsymbol{y})} \\ &= y_1 * P(g_1=2|\pi_k, f_1 =1) \tag{*}\\ &= y_1 * \frac{P(g_1=2,f_1 = 1 | \pi_k)}{P(f_1=1 | \pi_k)} \\ &= y_1 * \frac{\pi_k/4}{1/2 + \pi_k/4}\\ &= \frac{y_1\pi_k}{2 + \pi_k}\\ E[x_3 | \pi_k, \boldsymbol{y}] &= y_2\\ E[x_4 | \pi_k, \boldsymbol{y}] &= y_3\\ E[x_5 | \pi_k, \boldsymbol{y}] &= y_4 \end{align*}$

For the $(*)$ line, note that $P(g_1 = 2 \vert f_i \ne 1) = 0$ , and $\sum_{i=1}^{n}\{f_i = 1\} = y_1$ .

In the M step, we would maximize the expected log likelihood of the complete data using the values we found in the E step:

$\begin{align*} E[log(P(\boldsymbol{x} | \pi)) | \pi_k, \boldsymbol{y}] &\propto (E[x_2|\pi_k, \boldsymbol{y}] + E[x_5|\pi_k, \boldsymbol{y}])log\left(\frac{\pi}{4}\right) + (E[x_3|\pi_k, \boldsymbol{y}] + E[x_4|\pi_k, \boldsymbol{y}]) log\left(\frac{1-\pi}{4}\right) \\ &\propto \left(\frac{y_1\pi_k}{2 + \pi_k} + y_4\right)log\left(\frac{\pi}{4}\right) + (y_2 + y_3) log\left(\frac{1-\pi}{4}\right) \end{align*}$

Taking the derivative and setting it to 0 gives us:

$\begin{align*} \hat{\pi} = \frac{y_1\pi_k + y_4(2 + \pi_k)}{y_1\pi_k + (y_2 + y_3 + y_4)(2 + \pi_k)} \end{align*}$

Thus, we would set $\pi_{k+1} = \hat{\pi}$ and iterate. Note that the equation for $\hat{\pi}$ is exactly what we had for the complete likelihood, but taking expectations given the current parameter instead of using the values:

$\hat{\pi} = \frac{(E[x_2 | \pi_k, y] + E[x_5 | \pi_k, y])}{E[x_2 | \pi_k, y] + E[x_3 | \pi_k, y] + E[x_4 | \pi_k, y] + E[x_5 | \pi_k, y]}$

This is an illustration of when EM is used: when the observed data likelihood is annoying, but the complete data likelihood is nice.

Written on May 4, 2015