Introduction |
Thu, Sep 30
Intro to convexity and optimization
LV Book: 1.1-1.5
Homework #0
|
- Convex sets, convex functions, convex optimization
- $f : \mathbb{R}^n \to \mathbb{R}$ convex and differentiable, then local optimality
implies global optimality:
-
This holds because the following are equivalent:
- $f : \mathbb{R}^n \to \mathbb{R}$ convex
- $f(y) \geq f(x) + \langle \nabla f(x), y-x\rangle$ for all $x,y \in \mathbb{R}^n$.
-
If $f$ is $\mathcal{C}^2$ (if it has continuous second-order partial derivatives), then the
Hessian matrix $\nabla^2 f(x)$ is defined for every $x \in \mathbb{R}^n$:
-
To check convexity, often it is useful to employ the following:
Theorem: If $f : \mathbb{R}^n \to \mathbb{R}$ is $\mathcal{C}^2$, then $f$ is convex if and only if the Hessian $\nabla^2 f(x)$ is
positive semidefinite for all $x \in \mathbb{R}^n$.
-
As a simple example, consider the two-variate function
defined on the positive orthant $u,v > 0$.
Then we have
The trace of this matrix is $1/u + 1/v^2 > 0$ for $u,v > 0$.
The determinant of this matrix is $1/v^2 - 1/v^2 = 0$ for $u,v > 0$.
Therefore if $\lambda_1,\lambda_2$ are the eigenvalues of $\nabla^2 f(u,v)$,
it holds that $\lambda_1 + \lambda_2 > 0$ and $\lambda_1 \lambda_2 = 0$.
The only possibility is that both eigenvalues are nonnegative,
meaning that $\nabla^2 f(u,v)$ is positive semidefinite on the positive orthant.
|
Tue, Oct 05
Gradient descent I
LV Book: 2.1-2.3
|
-
Gradient descent : $f : \mathbb{R}^n \to \mathbb{R}$, parameters $\eta, \epsilon > 0$, initial point .
- If , stop and output $x^{(k)}$.
- Else update .
-
A function has -Lipschitz gradients if
-
The following are equivalent:
- has -Lipschitz gradients.
- for all .
-
For all ,
-
Theorem: If has -Lipshitz gradients, then stops within steps.
Note that in this case we have no guarantee of optimality, only that .
-
Theorem: If is convex with -Lipschitz gradients, then satisfies
|
Thu, Oct 07
Gradient descent II
LV Book: 2.3-2.7
Homework #1
|
-
A function is -strongly convex if for any ,
-
For and , the following are equivalent:
-
$f$ is -strongly convex
-
For all , it holds that
-
for all .
-
Theorem: Suppose that is -strongly convex and has -Lipschitz gradients. Then
satisfies
-
Gradient descent is a discretization of the continuous algorithm:
Note that:
And if is -strongly convex, then
hence
-
Line search and the Wolfe conditions (Sec 2.5)
|
Elimination |
Tue, Oct 12
Cutting plane methods, the Ellipsoid algorithm
LV Book: 3.1-3.2
|
-
The hyperplane separation theorem: If is a closed convex set
and then there is a direction such that
Said differently, there is an -dimensional hyperplane
that separates from .
- Cutting plane framework
- Affine geometry of ellipsoids
- Ellipsoid algorithm and analysis
|
Thu, Oct 14
NO CLASS
|
|
Tue, Oct 19
From volume to function value
LV Book: 3.2-3.3
|
-
Using ellipsoid to minimize non-smooth convex functions
-
Volume shrinkage yields small function value
The main result here is that if we are given a convex function
and a bounding box containing some minimizer ,
then after iterations of the ellipsoid algorithm, we obtain
-
Application to Linear Programming
A linear program is an optimization of the form: Given inputs ,
The feasible region is the set . Note that is shorthand
for the linear inequalities for , where
are the rows of .
|
Thu, Oct 21
Center of gravity
LV Book: 3.4
Homework #2
|
-
Exact center of gravity converges to $\varepsilon$-approximate minimum
in iterations
Compare this to the iterations used by the Ellipsoid Algorithm.
-
Analysis using Grunbaum’s Theorem: If is a compact, convex set and is
the center of gravity of :
then for any halfspace containing , it holds that
-
Approximate center of gravity suffices: Robust Grunbaum Theorem
-
Logconcave distributions
A measure on is log-concave if it holds that for all measurable sets
and , we have
Every log-concave measure arises from a density for which
, and such that the function is concave.
Equivalently, for some convex function .
-
Isotropic distributions
A probability distribution on is isotropic if and ,
where is a random vector with law .
|
Reductions |
Tue, Oct 26
Oracles and equivalences
LV Book: 4.1, 4.6
|
-
Different oracles for accessing a convex set
-
Oracles for accessing a convex function : Evaluation oracle, subgradient oracle
-
The convex conjugate of a function is defined by
This function is convex, as it is the maximum of the family of affine functions ,
one for every fixed .
-
We can compute a subgradient of at via
-
The epigraph of a function is the set
This is the set of points lying above the graph of . The function
is convex if and only if is a convex set.
-
The involution property: For any convex function whose epigraph is closed,
it holds that .
-
We can consider that set of all linear lower bounds on a function: All those pairs
with and such that
For a fixed , the value of giving the best lower bound is precisely .
Similarly, at a fixed point , we can consider all the lower bounds
By definition, the best such lower bound is given by , hence .
The involution property ensures that when the epigraph of is closed, this family of lower bounds is
tight in the sense that for all .
-
Computational equivalences of the subgradient oracles for and using cutting planes and the involution property
|
Thu, Oct 28
Duality
LV Book: 4.2-4.4
|
-
Sion’s Minimax Theorem
Definitions of quasi-convex, quasi-concave, and lower/upper semicontinuous functions.
-
Application to a function decomposed as where it gives
- Linear programming duality
-
SDP duality: Equivalence of the two optimizations
Here we take
where we recall that for a subset , we define if and if .
And then we take:
and
-
Separation oracle for the PSD constraint .
Define .
Note that is the underlying variable, so the feasible region
corresponding to this constraint is .
Compute the minimum eigenvector of . If the constraint is violated
at , it must be that
On the other hand, for any feasible and any vector , it holds that
Hence if we define the vector by , then gives a hyperplane that separates from , in the sense that
|
Geometry and Optimization |
Tue, Nov 02
Projected subgradient descent
LV Book: 5.1
|
-
The collection of subgradients
-
If has continuous partial derivatives at , then .
-
Example: Show that if is defined by , then .
If you graph , this corresponds to the fact that
all the lines through the origin of slope in the interval lie below the graph of .
-
Projected Subgradient Descent:
- (subgradient step)
- (projection step)
After iterations, we output
-
Pythagorean property for the Euclidean projection .
-
Theorem: If is closed and convex, and
is a convex, -Lipschitz function, then after
iterations, the output of the subgradient descent algorithm satisfies,
for any ,
|
Thu, Nov 04
Mirror maps and Bregman divergences
LV Book: 5.1.3-5.1.4
Homework #3
|
- Setup: is closed and convex, is convex.
- Mirror map: A strongly convex function .
-
Associated Bregman divergence:
-
Bregman projection onto :
-
Mirror descent iteration:
Find such that
and then
-
Equivalently:
Recall that if is non-smooth, can be replaced by any subgradient of at .
-
The map is said to be -strongly convex on wrt a norm if
-
Theorem: If is convex and -Lipschitz wrt a norm ,
and is -strongly convex on wrt , then after iterations
mirror descent outputs the point
and it satisfies
where
and
-
If we choose , then this bound is
- Examples include the Euclidean ball with and the probability simplex with .
|
Tue, Nov 09
Newton's Method
LV Book: 5.3
|
-
For a function , Newton’s method attempts to find a point where .
-
This is done using the best linear approximation to at the current iterate :
where is the Jacobian of at .
-
We then set to calculate the next iterate:
-
We can (try) find local minima of a function by taking .
In this case, the update rule takes the form
Note that the Newton iterate has the nice feature that the algorithm is invariant under
affine transformations. Note also one of the major drawbacks: A single iterate requires
inverting the Hessian of .
-
Suppose is a univariate polynomial with roots
and . Then Newton’s method converges: and, moreover, one has the rate
-
In general, Newton’s method need not converge to a zero of , but when it does, the rate of convergence is
(eventually) quadratic.
(Local quadratic convergence)
If is , and for some satisfying
, and moreover is invertible, then there is a number such that for
sufficiently large,
-
Newton’s formula will always converge to a zero as long as , where if
are suitably nondegenerate around .
One can find here a proof that
Newton’s method converges quadratically
to a zero as long as is sufficiently small.
One can see the preceding link for a discussion of basis of attraction and failure cases.
|
Tue, Nov 16
Interior point methods I
LV Book: 5.4
|
Consider , and
the corresponding linear program in slack form:
and its dual
For a primal-feasible and dual-feasible , we always have
The duality gap for such a pair is the quantity .
Linear programming duality tells us that if and only if are an optimal primal-dual pair.
In the optimization , the difficult set of constraints is given by . We can seek to soften these
constraints via the use of a barrier. Consider the log-barrier: For some paramater ,
Note that any solution to with bounded objective must satisfy that (because otherwise the objective
is ).
Introducing a vector of Lagrangian multipliers, the (unique) optimum of satisfies
where . We can therefore characterize the optimizer as given by a triple satisfying the constraints
Note that and are primal- and dual-feasible, respectively, and
Hence as , we have converging to an optimal primal-dual pair.
The sequence is called the central path, and interior point methods
solve by approximately following this path.
The idea is to compute from for sufficiently small. (It turns out we can take .)
This computation is done via Newton’s method, and the key insight is that for small, we will be in a “basin of attraction”
and achieve local quadratic convergence. The end result is that we can solve a linear program within accuracy by
solving about linear systems.
Animation of the central path for some optimizations.
|
Thu, Nov 18
Interior point methods II
LV Book: 5.4
|
Continuing as in the last lecture, our goal is to approximately follow the central path:
Define the vector by . Our notion of distance to the
central path will be measured by
In order to decrease , we will try to improve the approximation by finding
a vector such that
To approximate this by a linear system, we will drop the quadratic term from the first equality and ignore
the nonnegativity constraints, yielding
Now we can see as the solution to the linear system
where and .
Solving this system shows that there is a projection matrix such that
If we now assume that , then
and the same bound holds for . Therefore for , we have and
for , meaning that the nonnegativity constraints are satisfied for and .
Moreover, if we again assume , then we have
By Cauchy-Schwarz, we can bound this by
where in the last line we used the bound coming from .
One can see here the local quadratic convergence of the Newton step: We have decreased the error from to .
For , this gives
One can now check that if , then we can take
then we have
meaning that we have decreased by a factor of and maintained the invariant .
|
Sparsification |
Tue, Nov 23
Subspace embeddings and sparse linear regression
LV Book: 6.1
Homework #4
|
-
Consider a linear regression problem specified by and with .
Our goal is to minimize the classical least-squares error:
-
We can calculate the optimizer exactly by examining the critical points of the gradient:
which has solution
-
Note that is a matrix, so we can compute its inverse in time , where
is the matrix multiplication exponent. But computing the product might take time .
If , this may dominate the running time.
-
We could also try to calculate via gradient descent, leading to the “Richardson iteration:”
As we will see in a moment, it holds that if we scale so that , then .
-
Preconditioning. Note that for an invertible, symmetric matrix , we can write .
The corresponding Richardson iteration looks like
and if we know that , then we obtain the exponential convergence:
-
Let us think of , where . Then for all
is the same as for all .
So to find a good preconditioner , we can take for some , where
is a linear map that approximately preserves the norms of
vectors in the subspace . Note that .
-
In order for this to be efficient, it is useful if we can take the map to be “oblivious” to the matrix .
While there will not be a single (with ) that preserves the norms in every -dimensional subspace,
it turns out that we can choose at random with this property.
-
Oblivious-subspace embeddings: A random linear map is said to be a -OSE if it
holds that for every subspace with ,
where the probability is taken only over the choice of .
-
It is known that if has entries that are i.i.d. or i.i.d. , then
is a -OSD for some choice
This is not so useful for our application: Since this is a dense matrix, computing is still expensive.
-
Research on sparse JL transforms shows that if is chosen uniformly at random so that every row contains
exactly non-zero entries that are , then is a -OSD when we choose
-
Using this in the Richardson iteration (see 6.1.3 in the LV book),
one obtains an algorithm for linear regresion that runs in time time,
where is the number of non-zero entries in .
|
Tue, Nov 30
Stochastic gradient descent
LV Book 6.3
|
-
Let be a distribution on convex functions , and suppose
that is the convex function given by
As usual, our goal is to find a minimizer
-
One example is where . If we tried to solve this with
gradient descent, computing at some would require gradient
computations .
-
Stochastic gradient descent instead evaluates a single gradient at every step:
Analysis of SGD
-
We will require the following fact: If is convex with -Lispchitz gradients, then
In other words, when is close to the first-order approximation at , it holds that
and are also close.
Taking and and then taking expectations over gives
where we have used .
-
Theorem: Suppose that every is convex with -Lipschitz gradients, and that is -strongly-convex.
Define .
Then for any step size , the SGD iterates satisfy, for every ,
the following guarantees:
-
Comparison with the emprical risk minimizer. To understand the quality of the SGD guarantees, suppose we sample
independently and then define
The Cramer-Rao bound asserts that under the conditions of preceding theorem, we have
thus we can think of error as the gold standard.
By combining \eqref{eq:sgd1} and \eqref{eq:sgd2}, one obtains that for the average point
SGD gives
|
Thu, Dec 02
Variance reduction and coordinate descent
LV Book 6.3
|
Let us now assume that (so the distribution is uniform on ).
Notice the dependence of \eqref{eq:sgd1} and \eqref{eq:sgd2} on the variance . To remove this dependence,
we consider a variance reduction technique. For any and , define by
\begin{equation}\label{eq:tildef}
\nabla \tilde{f}(x) = \nabla f(x) - \nabla f(\tilde{x}) + \nabla F(\tilde{x}).
\end{equation}
Note that . From \eqref{eq:Fgrad}, one can also conclude that
This suggests a natural algorithm that runs SGD in phases, where in each phase we replace by progressively
better approximations to (so that the variance keeps reducing).
-
SGD phase(, , , ): Given points and , we run for steps
using the functions where is defined in \eqref{eq:tildef}.
Output .
Note that the SGD phase requires gradient computations, where gradient computations are needed for .
-
SVRG (Stochatic Variance Reduced Gradient):
-
Theorem (SVRG): Suppose have -Lipschitz gradients and is -strongly convex. Define and .
Then for every , the output of SVRG satisfies
-
Note that to get a solution with error requires gradient computations.
Our analysis of regular gradient descent required iterations, which requires
gradient computations.
|
Acceleration |
Tue, Dec 07
Chebyshev polynomials
LV Book 7.1
|
- Chebyshev polynomials (of the first kind)
-
Existence of a degree polynomial satisfying
where and .
-
For every , existence of a polynomial of degree satisfying
|
Thu, Dec 09
Conjugate gradient
LV Book 7.2
|
-
Consider the function where is a PSD matrix. We have
-
Any sequence satisfying the invariant
has the property that
Define this subspace to be . This is called the th Krylov subspace.
-
We can consider the best such sequence given by
-
Using the fact that , one can use the existence of Chebyshev polynomials to show that
where is the condition number of .
Note the factor decay in the error, vs.
for the Richardson iteration.
-
We can compute the sequence using the conjugate gradient method.
-
For an introduction to accelerated gradient descent for general convex functions, see
Aaron Sidford’s lecture notes.
|