Expectation Maximization

EM is a broad topic. I hope to just give a brief but understandable introduction to it here. This material is borrowed heavily from a class I took, taught by Mathias Drton.

The setup

First, let's assume we have some complete data , for which we know how to do Maximum Likelihood Estimation for some parameters . The complete data likelihood is then:

Now assume we don't observe this wonderful complete data, but only some transformation . Then, the observed data likelihood is computed by:

Let's take a very simple running example. Let's say we have some data such that:

That is - is distributed according to a multinomial distribution with parameters

We know that estimating in this case is simple. We just take the derivative of the log likelihood and set it equal to 0:

Now, let's say that instead of observing , we observe , such that:

Now, the likelihood is not so easy to optimize in close form (try it):

The algorithm

The EM algorithm works as follows:

  • E-step - Evaluate
  • M-step - Set
  • repeat until some convergence criteria is satisfied
    (e.g. )

Now, what we want to prove is that the log-likelihood does not decrease at each iteration.

Note that since is a deterministic function of ,

Or equivalently, we have:

Thus, we can rewrite the claim as:

Taking expectations over on both sides:

The red part is 0 because of the way we choose . Note that due to the M-step, the left term has to be greater or equal to the right term. We can thus write the following inequality:

Now we use Jensen's inequality, which states that for a convex function :

Since log is a concave function, -log is a convex function, and thus we can write:

Thus, we have proved that the log likelihood does not decrease at each iteration.

Going back to the example

In the E step, we would evaluate the expected value of given (a current estimate of ) and . Let's say each observation in the complete data has a category in and each observation in the observed data has a category in Since we assume data is i.i.d, .

For the line, note that , and .

In the M step, we would maximize the expected log likelihood of the complete data using the values we found in the E step:

Taking the derivative and setting it to 0 gives us:

Thus, we would set and iterate. Note that the equation for is exactly what we had for the complete likelihood, but taking expectations given the current parameter instead of using the values:

This is an illustration of when EM is used: when the observed data likelihood is annoying, but the complete data likelihood is nice.

Written on May 4, 2015