Introduction 
Thu, Sep 30
Intro to convexity and optimization
LV Book: 1.11.5
Homework #0

 Convex sets, convex functions, convex optimization
 $f : \mathbb{R}^n \to \mathbb{R}$ convex and differentiable, then local optimality
implies global optimality:

This holds because the following are equivalent:
 $f : \mathbb{R}^n \to \mathbb{R}$ convex
 $f(y) \geq f(x) + \langle \nabla f(x), yx\rangle$ for all $x,y \in \mathbb{R}^n$.

If $f$ is $\mathcal{C}^2$ (if it has continuous secondorder partial derivatives), then the
Hessian matrix $\nabla^2 f(x)$ is defined for every $x \in \mathbb{R}^n$:

To check convexity, often it is useful to employ the following:
Theorem: If $f : \mathbb{R}^n \to \mathbb{R}$ is $\mathcal{C}^2$, then $f$ is convex if and only if the Hessian $\nabla^2 f(x)$ is
positive semidefinite for all $x \in \mathbb{R}^n$.

As a simple example, consider the twovariate function
defined on the positive orthant $u,v > 0$.
Then we have
The trace of this matrix is $1/u + 1/v^2 > 0$ for $u,v > 0$.
The determinant of this matrix is $1/v^2  1/v^2 = 0$ for $u,v > 0$.
Therefore if $\lambda_1,\lambda_2$ are the eigenvalues of $\nabla^2 f(u,v)$,
it holds that $\lambda_1 + \lambda_2 > 0$ and $\lambda_1 \lambda_2 = 0$.
The only possibility is that both eigenvalues are nonnegative,
meaning that $\nabla^2 f(u,v)$ is positive semidefinite on the positive orthant.

Tue, Oct 05
Gradient descent I
LV Book: 2.12.3


Gradient descent : $f : \mathbb{R}^n \to \mathbb{R}$, parameters $\eta, \epsilon > 0$, initial point .
 If , stop and output $x^{(k)}$.
 Else update .

A function has Lipschitz gradients if

The following are equivalent:
 has Lipschitz gradients.
 for all .

For all ,

Theorem: If has Lipshitz gradients, then stops within steps.
Note that in this case we have no guarantee of optimality, only that .

Theorem: If is convex with Lipschitz gradients, then satisfies

Thu, Oct 07
Gradient descent II
LV Book: 2.32.7
Homework #1


A function is strongly convex if for any ,

For and , the following are equivalent:

$f$ is strongly convex

For all , it holds that

for all .

Theorem: Suppose that is strongly convex and has Lipschitz gradients. Then
satisfies

Gradient descent is a discretization of the continuous algorithm:
Note that:
And if is strongly convex, then
hence

Line search and the Wolfe conditions (Sec 2.5)

Elimination 
Tue, Oct 12
Cutting plane methods, the Ellipsoid algorithm
LV Book: 3.13.2


The hyperplane separation theorem: If is a closed convex set
and then there is a direction such that
Said differently, there is an dimensional hyperplane
that separates from .
 Cutting plane framework
 Affine geometry of ellipsoids
 Ellipsoid algorithm and analysis

Thu, Oct 14
NO CLASS


Tue, Oct 19
From volume to function value
LV Book: 3.23.3


Using ellipsoid to minimize nonsmooth convex functions

Volume shrinkage yields small function value
The main result here is that if we are given a convex function
and a bounding box containing some minimizer ,
then after iterations of the ellipsoid algorithm, we obtain

Application to Linear Programming
A linear program is an optimization of the form: Given inputs ,
The feasible region is the set . Note that is shorthand
for the linear inequalities for , where
are the rows of .

Thu, Oct 21
Center of gravity
LV Book: 3.4
Homework #2


Exact center of gravity converges to $\varepsilon$approximate minimum
in iterations
Compare this to the iterations used by the Ellipsoid Algorithm.

Analysis using Grunbaum’s Theorem: If is a compact, convex set and is
the center of gravity of :
then for any halfspace containing , it holds that

Approximate center of gravity suffices: Robust Grunbaum Theorem

Logconcave distributions
A measure on is logconcave if it holds that for all measurable sets
and , we have
Every logconcave measure arises from a density for which
, and such that the function is concave.
Equivalently, for some convex function .

Isotropic distributions
A probability distribution on is isotropic if and ,
where is a random vector with law .

Reductions 
Tue, Oct 26
Oracles and equivalences
LV Book: 4.1, 4.6


Different oracles for accessing a convex set

Oracles for accessing a convex function : Evaluation oracle, subgradient oracle

The convex conjugate of a function is defined by
This function is convex, as it is the maximum of the family of affine functions ,
one for every fixed .

We can compute a subgradient of at via

The epigraph of a function is the set
This is the set of points lying above the graph of . The function
is convex if and only if is a convex set.

The involution property: For any convex function whose epigraph is closed,
it holds that .

We can consider that set of all linear lower bounds on a function: All those pairs
with and such that
For a fixed , the value of giving the best lower bound is precisely .
Similarly, at a fixed point , we can consider all the lower bounds
By definition, the best such lower bound is given by , hence .
The involution property ensures that when the epigraph of is closed, this family of lower bounds is
tight in the sense that for all .

Computational equivalences of the subgradient oracles for and using cutting planes and the involution property

Thu, Oct 28
Duality
LV Book: 4.24.4


Sion’s Minimax Theorem
Definitions of quasiconvex, quasiconcave, and lower/upper semicontinuous functions.

Application to a function decomposed as where it gives
 Linear programming duality

SDP duality: Equivalence of the two optimizations
Here we take
where we recall that for a subset , we define if and if .
And then we take:
and

Separation oracle for the PSD constraint .
Define .
Note that is the underlying variable, so the feasible region
corresponding to this constraint is .
Compute the minimum eigenvector of . If the constraint is violated
at , it must be that
On the other hand, for any feasible and any vector , it holds that
Hence if we define the vector by , then gives a hyperplane that separates from , in the sense that

Geometry and Optimization 
Tue, Nov 02
Projected subgradient descent
LV Book: 5.1


The collection of subgradients

If has continuous partial derivatives at , then .

Example: Show that if is defined by , then .
If you graph , this corresponds to the fact that
all the lines through the origin of slope in the interval lie below the graph of .

Projected Subgradient Descent:
 (subgradient step)
 (projection step)
After iterations, we output

Pythagorean property for the Euclidean projection .

Theorem: If is closed and convex, and
is a convex, Lipschitz function, then after
iterations, the output of the subgradient descent algorithm satisfies,
for any ,

Thu, Nov 04
Mirror maps and Bregman divergences
LV Book: 5.1.35.1.4
Homework #3

 Setup: is closed and convex, is convex.
 Mirror map: A strongly convex function .

Associated Bregman divergence:

Bregman projection onto :

Mirror descent iteration:
Find such that
and then

Equivalently:
Recall that if is nonsmooth, can be replaced by any subgradient of at .

The map is said to be strongly convex on wrt a norm if

Theorem: If is convex and Lipschitz wrt a norm ,
and is strongly convex on wrt , then after iterations
mirror descent outputs the point
and it satisfies
where
and

If we choose , then this bound is
 Examples include the Euclidean ball with and the probability simplex with .

Tue, Nov 09
Newton's Method
LV Book: 5.3


For a function , Newton’s method attempts to find a point where .

This is done using the best linear approximation to at the current iterate :
where is the Jacobian of at .

We then set to calculate the next iterate:

We can (try) find local minima of a function by taking .
In this case, the update rule takes the form
Note that the Newton iterate has the nice feature that the algorithm is invariant under
affine transformations. Note also one of the major drawbacks: A single iterate requires
inverting the Hessian of .

Suppose is a univariate polynomial with roots
and . Then Newton’s method converges: and, moreover, one has the rate

In general, Newton’s method need not converge to a zero of , but when it does, the rate of convergence is
(eventually) quadratic.
(Local quadratic convergence)
If is , and for some satisfying
, and moreover is invertible, then there is a number such that for
sufficiently large,

Newton’s formula will always converge to a zero as long as , where if
are suitably nondegenerate around .
One can find here a proof that
Newton’s method converges quadratically
to a zero as long as is sufficiently small.
One can see the preceding link for a discussion of basis of attraction and failure cases.

Tue, Nov 16
Interior point methods I
LV Book: 5.4

Consider , and
the corresponding linear program in slack form:
and its dual
For a primalfeasible and dualfeasible , we always have
The duality gap for such a pair is the quantity .
Linear programming duality tells us that if and only if are an optimal primaldual pair.
In the optimization , the difficult set of constraints is given by . We can seek to soften these
constraints via the use of a barrier. Consider the logbarrier: For some paramater ,
Note that any solution to with bounded objective must satisfy that (because otherwise the objective
is ).
Introducing a vector of Lagrangian multipliers, the (unique) optimum of satisfies
where . We can therefore characterize the optimizer as given by a triple satisfying the constraints
Note that and are primal and dualfeasible, respectively, and
Hence as , we have converging to an optimal primaldual pair.
The sequence is called the central path, and interior point methods
solve by approximately following this path.
The idea is to compute from for sufficiently small. (It turns out we can take .)
This computation is done via Newton’s method, and the key insight is that for small, we will be in a “basin of attraction”
and achieve local quadratic convergence. The end result is that we can solve a linear program within accuracy by
solving about linear systems.
Animation of the central path for some optimizations.

Thu, Nov 18
Interior point methods II
LV Book: 5.4

Continuing as in the last lecture, our goal is to approximately follow the central path:
Define the vector by . Our notion of distance to the
central path will be measured by
In order to decrease , we will try to improve the approximation by finding
a vector such that
To approximate this by a linear system, we will drop the quadratic term from the first equality and ignore
the nonnegativity constraints, yielding
Now we can see as the solution to the linear system
where and .
Solving this system shows that there is a projection matrix such that
If we now assume that , then
and the same bound holds for . Therefore for , we have and
for , meaning that the nonnegativity constraints are satisfied for and .
Moreover, if we again assume , then we have
By CauchySchwarz, we can bound this by
where in the last line we used the bound coming from .
One can see here the local quadratic convergence of the Newton step: We have decreased the error from to .
For , this gives
One can now check that if , then we can take
then we have
meaning that we have decreased by a factor of and maintained the invariant .

Sparsification 
Tue, Nov 23
Subspace embeddings and sparse linear regression
LV Book: 6.1
Homework #4


Consider a linear regression problem specified by and with .
Our goal is to minimize the classical leastsquares error:

We can calculate the optimizer exactly by examining the critical points of the gradient:
which has solution

Note that is a matrix, so we can compute its inverse in time , where
is the matrix multiplication exponent. But computing the product might take time .
If , this may dominate the running time.

We could also try to calculate via gradient descent, leading to the “Richardson iteration:”
As we will see in a moment, it holds that if we scale so that , then .

Preconditioning. Note that for an invertible, symmetric matrix , we can write .
The corresponding Richardson iteration looks like
and if we know that , then we obtain the exponential convergence:

Let us think of , where . Then for all
is the same as for all .
So to find a good preconditioner , we can take for some , where
is a linear map that approximately preserves the norms of
vectors in the subspace . Note that .

In order for this to be efficient, it is useful if we can take the map to be “oblivious” to the matrix .
While there will not be a single (with ) that preserves the norms in every dimensional subspace,
it turns out that we can choose at random with this property.

Oblivioussubspace embeddings: A random linear map is said to be a OSE if it
holds that for every subspace with ,
where the probability is taken only over the choice of .

It is known that if has entries that are i.i.d. or i.i.d. , then
is a OSD for some choice
This is not so useful for our application: Since this is a dense matrix, computing is still expensive.

Research on sparse JL transforms shows that if is chosen uniformly at random so that every row contains
exactly nonzero entries that are , then is a OSD when we choose

Using this in the Richardson iteration (see 6.1.3 in the LV book),
one obtains an algorithm for linear regresion that runs in time time,
where is the number of nonzero entries in .

Tue, Nov 30
Stochastic gradient descent
LV Book 6.3


Let be a distribution on convex functions , and suppose
that is the convex function given by
As usual, our goal is to find a minimizer

One example is where . If we tried to solve this with
gradient descent, computing at some would require gradient
computations .

Stochastic gradient descent instead evaluates a single gradient at every step:
Analysis of SGD

We will require the following fact: If is convex with Lispchitz gradients, then
In other words, when is close to the firstorder approximation at , it holds that
and are also close.
Taking and and then taking expectations over gives
where we have used .

Theorem: Suppose that every is convex with Lipschitz gradients, and that is stronglyconvex.
Define .
Then for any step size , the SGD iterates satisfy, for every ,
the following guarantees:

Comparison with the emprical risk minimizer. To understand the quality of the SGD guarantees, suppose we sample
independently and then define
The CramerRao bound asserts that under the conditions of preceding theorem, we have
thus we can think of error as the gold standard.
By combining \eqref{eq:sgd1} and \eqref{eq:sgd2}, one obtains that for the average point
SGD gives

Thu, Dec 02
Variance reduction and coordinate descent
LV Book 6.3

Let us now assume that (so the distribution is uniform on ).
Notice the dependence of \eqref{eq:sgd1} and \eqref{eq:sgd2} on the variance . To remove this dependence,
we consider a variance reduction technique. For any and , define by
\begin{equation}\label{eq:tildef}
\nabla \tilde{f}(x) = \nabla f(x)  \nabla f(\tilde{x}) + \nabla F(\tilde{x}).
\end{equation}
Note that . From \eqref{eq:Fgrad}, one can also conclude that
This suggests a natural algorithm that runs SGD in phases, where in each phase we replace by progressively
better approximations to (so that the variance keeps reducing).

SGD phase(, , , ): Given points and , we run for steps
using the functions where is defined in \eqref{eq:tildef}.
Output .
Note that the SGD phase requires gradient computations, where gradient computations are needed for .

SVRG (Stochatic Variance Reduced Gradient):

Theorem (SVRG): Suppose have Lipschitz gradients and is strongly convex. Define and .
Then for every , the output of SVRG satisfies

Note that to get a solution with error requires gradient computations.
Our analysis of regular gradient descent required iterations, which requires
gradient computations.

Acceleration 
Tue, Dec 07
Chebyshev polynomials
LV Book 7.1

 Chebyshev polynomials (of the first kind)

Existence of a degree polynomial satisfying
where and .

For every , existence of a polynomial of degree satisfying

Thu, Dec 09
Conjugate gradient
LV Book 7.2


Consider the function where is a PSD matrix. We have

Any sequence satisfying the invariant
has the property that
Define this subspace to be . This is called the th Krylov subspace.

We can consider the best such sequence given by

Using the fact that , one can use the existence of Chebyshev polynomials to show that
where is the condition number of .
Note the factor decay in the error, vs.
for the Richardson iteration.

We can compute the sequence using the conjugate gradient method.

For an introduction to accelerated gradient descent for general convex functions, see
Aaron Sidford’s lecture notes.
