Expectation Maximization
EM is a broad topic. I hope to just give a brief but understandable introduction to it here. This material is borrowed heavily from a class I took, taught by Mathias Drton.
The setup
First, let's assume we have some complete data , for which we know how to do Maximum Likelihood Estimation for some parameters . The complete data likelihood is then:
Now assume we don't observe this wonderful complete data, but only some transformation . Then, the observed data likelihood is computed by:
Let's take a very simple running example. Let's say we have some data such that:
That is - is distributed according to a multinomial distribution with parameters
We know that estimating in this case is simple. We just take the derivative of the log likelihood and set it equal to 0:
Now, let's say that instead of observing , we observe , such that:
Now, the likelihood is not so easy to optimize in close form (try it):
The algorithm
The EM algorithm works as follows:
- E-step - Evaluate
- M-step - Set
- repeat until some convergence criteria is satisfied
(e.g. )
Now, what we want to prove is that the log-likelihood does not decrease at each iteration.
Note that since is a deterministic function of ,
Or equivalently, we have:
Thus, we can rewrite the claim as:
Taking expectations over on both sides:
The red part is 0 because of the way we choose . Note that due to the M-step, the left term has to be greater or equal to the right term. We can thus write the following inequality:
Now we use Jensen's inequality, which states that for a convex function :
Since log is a concave function, -log is a convex function, and thus we can write:
Thus, we have proved that the log likelihood does not decrease at each iteration.
Going back to the example
In the E step, we would evaluate the expected value of given (a current estimate of ) and . Let's say each observation in the complete data has a category in and each observation in the observed data has a category in Since we assume data is i.i.d, .
For the line, note that , and .
In the M step, we would maximize the expected log likelihood of the complete data using the values we found in the E step:
Taking the derivative and setting it to 0 gives us:
Thus, we would set and iterate. Note that the equation for is exactly what we had for the complete likelihood, but taking expectations given the current parameter instead of using the values:
This is an illustration of when EM is used: when the observed data likelihood is annoying, but the complete data likelihood is nice.