CSE 535: Theory of Optimization and Continuous Algorithms

Autumn 2021

Tues, Thurs 1130am-1250pm in CSE2 G04

Instructor: James R. Lee

Office hours: Tues 3-4pm in CSE 574
(or Zoom by appointment)

Teaching assistant(s):

Course email list [archives]

Class discussion: CSE 535 Piazza

Course evaluation: 100% Homework

  • Homework #0 [Due: DO NOT TURN IN]
  • Homework #1 (gradescope link) [Due: Fri, Oct 22]
  • Homework #2 (gradescope link) [Due: Fri, Nov 5]
  • Homework #3 (gradescope link) [Due: Fri, Nov 19]
  • Homework #4 (gradescope link) [Due: Sun, Dec 12]
  • Course description:

    The design of algorithms is traditionally a discrete endeavor. However, many advances have come from a continuous viewpoint. Typically, a continuous process, deterministic or randomized is designed (or shown) to have desirable properties, such as approaching an optimal solution or a desired distribution, and an algorithm is derived from this by appropriate discretization. In interesting and general settings, the current fastest methods are a consequence of this perspective. We will discuss several examples of algorithms for high-dimensional optimization and sampling.

    Lectures

    Introduction
    Thu, Sep 30
    Intro to convexity and optimization
    LV Book: 1.1-1.5
    Homework #0
    • Convex sets, convex functions, convex optimization
    • $f : \mathbb{R}^n \to \mathbb{R}$ convex and differentiable, then local optimality implies global optimality:
    • This holds because the following are equivalent:

      • $f : \mathbb{R}^n \to \mathbb{R}$ convex
      • $f(y) \geq f(x) + \langle \nabla f(x), y-x\rangle$ for all $x,y \in \mathbb{R}^n$.
    • If $f$ is $\mathcal{C}^2$ (if it has continuous second-order partial derivatives), then the Hessian matrix $\nabla^2 f(x)$ is defined for every $x \in \mathbb{R}^n$:

    • To check convexity, often it is useful to employ the following:

      Theorem: If $f : \mathbb{R}^n \to \mathbb{R}$ is $\mathcal{C}^2$, then $f$ is convex if and only if the Hessian $\nabla^2 f(x)$ is positive semidefinite for all $x \in \mathbb{R}^n$.

    • As a simple example, consider the two-variate function defined on the positive orthant $u,v > 0$. Then we have

      The trace of this matrix is $1/u + 1/v^2 > 0$ for $u,v > 0$. The determinant of this matrix is $1/v^2 - 1/v^2 = 0$ for $u,v > 0$. Therefore if $\lambda_1,\lambda_2$ are the eigenvalues of $\nabla^2 f(u,v)$, it holds that $\lambda_1 + \lambda_2 > 0$ and $\lambda_1 \lambda_2 = 0$. The only possibility is that both eigenvalues are nonnegative, meaning that $\nabla^2 f(u,v)$ is positive semidefinite on the positive orthant.

    Tue, Oct 05
    Gradient descent I
    LV Book: 2.1-2.3
    • Gradient descent : $f : \mathbb{R}^n \to \mathbb{R}$, parameters $\eta, \epsilon > 0$, initial point .

      • If , stop and output $x^{(k)}$.
      • Else update .
    • A function has -Lipschitz gradients if

    • The following are equivalent:

      • has -Lipschitz gradients.
      • for all .
      • For all ,

    • Theorem: If has -Lipshitz gradients, then stops within steps.

      Note that in this case we have no guarantee of optimality, only that .

    • Theorem: If is convex with -Lipschitz gradients, then satisfies

    Thu, Oct 07
    Gradient descent II
    LV Book: 2.3-2.7
    Homework #1
    • A function is -strongly convex if for any ,

    • For and , the following are equivalent:

      • $f$ is -strongly convex

      • For all , it holds that

      • for all .

    • Theorem: Suppose that is -strongly convex and has -Lipschitz gradients. Then satisfies

    • Gradient descent is a discretization of the continuous algorithm:

      Note that:

      And if is -strongly convex, then

      hence

    • Line search and the Wolfe conditions (Sec 2.5)

    Elimination
    Tue, Oct 12
    Cutting plane methods, the Ellipsoid algorithm
    LV Book: 3.1-3.2
    • The hyperplane separation theorem: If is a closed convex set and then there is a direction such that

      Said differently, there is an -dimensional hyperplane that separates from .

    • Cutting plane framework
    • Affine geometry of ellipsoids
    • Ellipsoid algorithm and analysis
    Thu, Oct 14
    NO CLASS
    Tue, Oct 19
    From volume to function value
    LV Book: 3.2-3.3
    • Using ellipsoid to minimize non-smooth convex functions

    • Volume shrinkage yields small function value

      The main result here is that if we are given a convex function and a bounding box containing some minimizer , then after iterations of the ellipsoid algorithm, we obtain

    • Application to Linear Programming

      A linear program is an optimization of the form: Given inputs ,

      The feasible region is the set . Note that is shorthand for the linear inequalities for , where are the rows of .

    Thu, Oct 21
    Center of gravity
    LV Book: 3.4
    Homework #2
    • Exact center of gravity converges to $\varepsilon$-approximate minimum in iterations

      Compare this to the iterations used by the Ellipsoid Algorithm.

    • Analysis using Grunbaum’s Theorem: If is a compact, convex set and is the center of gravity of :

      then for any halfspace containing , it holds that

    • Approximate center of gravity suffices: Robust Grunbaum Theorem

    • Logconcave distributions

      A measure on is log-concave if it holds that for all measurable sets and , we have

      Every log-concave measure arises from a density for which , and such that the function is concave. Equivalently, for some convex function .

    • Isotropic distributions

      A probability distribution on is isotropic if and , where is a random vector with law .

    Reductions
    Tue, Oct 26
    Oracles and equivalences
    LV Book: 4.1, 4.6
    • Different oracles for accessing a convex set

    • Oracles for accessing a convex function : Evaluation oracle, subgradient oracle

    • The convex conjugate of a function is defined by

      This function is convex, as it is the maximum of the family of affine functions , one for every fixed .

    • We can compute a subgradient of at via

    • The epigraph of a function is the set

      This is the set of points lying above the graph of . The function is convex if and only if is a convex set.

    • The involution property: For any convex function whose epigraph is closed, it holds that .

    • We can consider that set of all linear lower bounds on a function: All those pairs with and such that

      For a fixed , the value of giving the best lower bound is precisely .

      Similarly, at a fixed point , we can consider all the lower bounds

      By definition, the best such lower bound is given by , hence . The involution property ensures that when the epigraph of is closed, this family of lower bounds is tight in the sense that for all .

    • Computational equivalences of the subgradient oracles for and using cutting planes and the involution property

    Thu, Oct 28
    Duality
    LV Book: 4.2-4.4
    • Sion’s Minimax Theorem

      Definitions of quasi-convex, quasi-concave, and lower/upper semicontinuous functions.

    • Application to a function decomposed as where it gives

    • Linear programming duality
    • SDP duality: Equivalence of the two optimizations

      Here we take

      where we recall that for a subset , we define if and if .

      And then we take: and

    • Separation oracle for the PSD constraint .

      Define . Note that is the underlying variable, so the feasible region corresponding to this constraint is .

      Compute the minimum eigenvector of . If the constraint is violated at , it must be that

      On the other hand, for any feasible and any vector , it holds that

      Hence if we define the vector by , then gives a hyperplane that separates from , in the sense that

    Geometry and Optimization
    Tue, Nov 02
    Projected subgradient descent
    LV Book: 5.1
    • The collection of subgradients

    • If has continuous partial derivatives at , then .

    • Example: Show that if is defined by , then . If you graph , this corresponds to the fact that all the lines through the origin of slope in the interval lie below the graph of .

    • Projected Subgradient Descent:

      • (subgradient step)
      • (projection step)

      After iterations, we output

    • Pythagorean property for the Euclidean projection .

    • Theorem: If is closed and convex, and is a convex, -Lipschitz function, then after iterations, the output of the subgradient descent algorithm satisfies, for any ,

    Thu, Nov 04
    Mirror maps and Bregman divergences
    LV Book: 5.1.3-5.1.4
    Homework #3
    • Setup: is closed and convex, is convex.
    • Mirror map: A strongly convex function .
    • Associated Bregman divergence:

    • Bregman projection onto :

    • Mirror descent iteration:

      Find such that and then

    • Equivalently:

      Recall that if is non-smooth, can be replaced by any subgradient of at .

    • The map is said to be -strongly convex on wrt a norm if

    • Theorem: If is convex and -Lipschitz wrt a norm , and is -strongly convex on wrt , then after iterations mirror descent outputs the point

      and it satisfies

      where

      and

    • If we choose , then this bound is

    • Examples include the Euclidean ball with and the probability simplex with .
    Tue, Nov 09
    Newton's Method
    LV Book: 5.3
    • For a function , Newton’s method attempts to find a point where .

    • This is done using the best linear approximation to at the current iterate :

      where is the Jacobian of at .

    • We then set to calculate the next iterate:

    • We can (try) find local minima of a function by taking . In this case, the update rule takes the form

      Note that the Newton iterate has the nice feature that the algorithm is invariant under affine transformations. Note also one of the major drawbacks: A single iterate requires inverting the Hessian of .

    • Suppose is a univariate polynomial with roots and . Then Newton’s method converges: and, moreover, one has the rate

    • In general, Newton’s method need not converge to a zero of , but when it does, the rate of convergence is (eventually) quadratic.

      (Local quadratic convergence) If is , and for some satisfying , and moreover is invertible, then there is a number such that for sufficiently large,

    • Newton’s formula will always converge to a zero as long as , where if are suitably nondegenerate around . One can find here a proof that Newton’s method converges quadratically to a zero as long as is sufficiently small. One can see the preceding link for a discussion of basis of attraction and failure cases.

    Tue, Nov 16
    Interior point methods I
    LV Book: 5.4

    Consider , and the corresponding linear program in slack form:

    and its dual

    For a primal-feasible and dual-feasible , we always have

    The duality gap for such a pair is the quantity . Linear programming duality tells us that if and only if are an optimal primal-dual pair.

    In the optimization , the difficult set of constraints is given by . We can seek to soften these constraints via the use of a barrier. Consider the log-barrier: For some paramater ,

    Note that any solution to with bounded objective must satisfy that (because otherwise the objective is ).

    Introducing a vector of Lagrangian multipliers, the (unique) optimum of satisfies

    where . We can therefore characterize the optimizer as given by a triple satisfying the constraints

    Note that and are primal- and dual-feasible, respectively, and

    Hence as , we have converging to an optimal primal-dual pair. The sequence is called the central path, and interior point methods solve by approximately following this path.

    The idea is to compute from for sufficiently small. (It turns out we can take .) This computation is done via Newton’s method, and the key insight is that for small, we will be in a “basin of attraction” and achieve local quadratic convergence. The end result is that we can solve a linear program within accuracy by solving about linear systems.


    Animation of the central path for some optimizations.

    Thu, Nov 18
    Interior point methods II
    LV Book: 5.4

    Continuing as in the last lecture, our goal is to approximately follow the central path:

    Define the vector by . Our notion of distance to the central path will be measured by

    In order to decrease , we will try to improve the approximation by finding a vector such that

    To approximate this by a linear system, we will drop the quadratic term from the first equality and ignore the nonnegativity constraints, yielding

    Now we can see as the solution to the linear system

    where and .

    Solving this system shows that there is a projection matrix such that

    If we now assume that , then

    and the same bound holds for . Therefore for , we have and for , meaning that the nonnegativity constraints are satisfied for and .

    Moreover, if we again assume , then we have

    By Cauchy-Schwarz, we can bound this by

    where in the last line we used the bound coming from . One can see here the local quadratic convergence of the Newton step: We have decreased the error from to . For , this gives

    One can now check that if , then we can take

    then we have

    meaning that we have decreased by a factor of and maintained the invariant .

    Sparsification
    Tue, Nov 23
    Subspace embeddings and sparse linear regression
    LV Book: 6.1
    Homework #4
    • Consider a linear regression problem specified by and with . Our goal is to minimize the classical least-squares error:

    • We can calculate the optimizer exactly by examining the critical points of the gradient:

      which has solution

    • Note that is a matrix, so we can compute its inverse in time , where is the matrix multiplication exponent. But computing the product might take time . If , this may dominate the running time.

    • We could also try to calculate via gradient descent, leading to the “Richardson iteration:”

      As we will see in a moment, it holds that if we scale so that , then .

    • Preconditioning. Note that for an invertible, symmetric matrix , we can write . The corresponding Richardson iteration looks like

      and if we know that , then we obtain the exponential convergence:

    • Let us think of , where . Then for all is the same as for all . So to find a good preconditioner , we can take for some , where is a linear map that approximately preserves the norms of vectors in the subspace . Note that .

    • In order for this to be efficient, it is useful if we can take the map to be “oblivious” to the matrix . While there will not be a single (with ) that preserves the norms in every -dimensional subspace, it turns out that we can choose at random with this property.

    • Oblivious-subspace embeddings: A random linear map is said to be a -OSE if it holds that for every subspace with ,

      where the probability is taken only over the choice of .

    • It is known that if has entries that are i.i.d. or i.i.d. , then is a -OSD for some choice

      This is not so useful for our application: Since this is a dense matrix, computing is still expensive.

    • Research on sparse JL transforms shows that if is chosen uniformly at random so that every row contains exactly non-zero entries that are , then is a -OSD when we choose

    • Using this in the Richardson iteration (see 6.1.3 in the LV book), one obtains an algorithm for linear regresion that runs in time time, where is the number of non-zero entries in .

    Tue, Nov 30
    Stochastic gradient descent
    LV Book 6.3
    • Let be a distribution on convex functions , and suppose that is the convex function given by

      As usual, our goal is to find a minimizer

    • One example is where . If we tried to solve this with gradient descent, computing at some would require gradient computations .

    • Stochastic gradient descent instead evaluates a single gradient at every step:

      • Choose
      • Define .

    Analysis of SGD

    • We will require the following fact: If is convex with -Lispchitz gradients, then

      In other words, when is close to the first-order approximation at , it holds that and are also close.

      Taking and and then taking expectations over gives

      where we have used .

    • Theorem: Suppose that every is convex with -Lipschitz gradients, and that is -strongly-convex. Define . Then for any step size , the SGD iterates satisfy, for every , the following guarantees:

    • Comparison with the emprical risk minimizer. To understand the quality of the SGD guarantees, suppose we sample independently and then define

      The Cramer-Rao bound asserts that under the conditions of preceding theorem, we have

      thus we can think of error as the gold standard.

      By combining \eqref{eq:sgd1} and \eqref{eq:sgd2}, one obtains that for the average point

      SGD gives

    Thu, Dec 02
    Variance reduction and coordinate descent
    LV Book 6.3

    Let us now assume that (so the distribution is uniform on ).

    Notice the dependence of \eqref{eq:sgd1} and \eqref{eq:sgd2} on the variance . To remove this dependence, we consider a variance reduction technique. For any and , define by

    \begin{equation}\label{eq:tildef} \nabla \tilde{f}(x) = \nabla f(x) - \nabla f(\tilde{x}) + \nabla F(\tilde{x}). \end{equation}

    Note that . From \eqref{eq:Fgrad}, one can also conclude that

    This suggests a natural algorithm that runs SGD in phases, where in each phase we replace by progressively better approximations to (so that the variance keeps reducing).

    • SGD phase(, , , ): Given points and , we run for steps using the functions where is defined in \eqref{eq:tildef}. Output .

      Note that the SGD phase requires gradient computations, where gradient computations are needed for .

    • SVRG (Stochatic Variance Reduced Gradient):

      • Define

      • For ,

        • Let be the output of SGD phase

          • $\tilde{x} \leftarrow \frac{1}{m} \left(x^{(0)}+x^{(1)}+\cdots+x^{(m-1)}\right)$

          • .

      • Output .

    • Theorem (SVRG): Suppose have -Lipschitz gradients and is -strongly convex. Define and . Then for every , the output of SVRG satisfies

    • Note that to get a solution with error requires gradient computations. Our analysis of regular gradient descent required iterations, which requires gradient computations.

    Acceleration
    Tue, Dec 07
    Chebyshev polynomials
    LV Book 7.1
    • Chebyshev polynomials (of the first kind)
    • Existence of a degree polynomial satisfying

      where and .

    • For every , existence of a polynomial of degree satisfying

    Thu, Dec 09
    Conjugate gradient
    LV Book 7.2
    • Consider the function where is a PSD matrix. We have

    • Any sequence satisfying the invariant

      has the property that

      Define this subspace to be . This is called the th Krylov subspace.

    • We can consider the best such sequence given by

    • Using the fact that , one can use the existence of Chebyshev polynomials to show that

      where is the condition number of . Note the factor decay in the error, vs. for the Richardson iteration.

    • We can compute the sequence using the conjugate gradient method.

    • For an introduction to accelerated gradient descent for general convex functions, see Aaron Sidford’s lecture notes.

    Homeworks

    • You may discuss problems with your classmates, but when you write down the solutions, you should do so by yourself. You can use the internet and books for reference material but you should cite every source that you consulted (the name of the book or web page suffices). You should also cite any classmates with whom you discussed solutions.
    • Homework should be typeset.
    • In general, late submissions are not accepted.
    • We follow the standard UW CSE policy for academic integrity.
    • The office hours are for general questions about material from the lecture and homework problems.
    • Please refer to university policies regarding disability accomodations or religious accommodations