# CSE 535: Theory of Optimization and Continuous Algorithms #### Instructor: James R. Lee

Office hours: Tues 3-4pm in CSE 574
(or Zoom by appointment)

#### Course description:

The design of algorithms is traditionally a discrete endeavor. However, many advances have come from a continuous viewpoint. Typically, a continuous process, deterministic or randomized is designed (or shown) to have desirable properties, such as approaching an optimal solution or a desired distribution, and an algorithm is derived from this by appropriate discretization. In interesting and general settings, the current fastest methods are a consequence of this perspective. We will discuss several examples of algorithms for high-dimensional optimization and sampling, and use them to understand the following concepts in detail:

• Elimination
• Reduction
• Geometrization
• Acceleration
• Sparsification
• Decomposition
• Discretization

## Lectures

 Introduction Thu, Sep 30 Intro to convexity and optimization LV Book: 1.1-1.5 Homework #0 Convex sets, convex functions, convex optimization $f : \mathbb{R}^n \to \mathbb{R}$ convex and differentiable, then local optimality implies global optimality: $\nabla f(x) = 0 \iff f(x) = \min \{ f(y) : y \in \mathbb{R}^n \}$ This holds because the following are equivalent: $f : \mathbb{R}^n \to \mathbb{R}$ convex $f(y) \geq f(x) + \langle \nabla f(x), y-x\rangle$ for all $x,y \in \mathbb{R}^n$. If $f$ is $\mathcal{C}^2$ (if it has continuous second-order partial derivatives), then the Hessian matrix $\nabla^2 f(x)$ is defined for every $x \in \mathbb{R}^n$: To check convexity, often it is useful to employ the following: Theorem: If $f : \mathbb{R}^n \to \mathbb{R}$ is $\mathcal{C}^2$, then $f$ is convex if and only if the Hessian $\nabla^2 f(x)$ is positive semidefinite for all $x \in \mathbb{R}^n$. As a simple example, consider the two-variate function $f(u,v) = u \log \frac{u}{v}$ defined on the positive orthant $u,v > 0$. Then we have The trace of this matrix is $1/u + 1/v^2 > 0$ for $u,v > 0$. The determinant of this matrix is $1/v^2 - 1/v^2 = 0$ for $u,v > 0$. Therefore if $\lambda_1,\lambda_2$ are the eigenvalues of $\nabla^2 f(u,v)$, it holds that $\lambda_1 + \lambda_2 > 0$ and $\lambda_1 \lambda_2 = 0$. The only possibility is that both eigenvalues are nonnegative, meaning that $\nabla^2 f(u,v)$ is positive semidefinite on the positive orthant. Tue, Oct 05 Gradient descent, I LV Book: 2.1-2.3 Gradient descent $\mathrm{GD}(\eta)$: $f : \mathbb{R}^n \to \mathbb{R}$, parameters $\eta, \epsilon > 0$, initial point $x^{(0)} \in \mathbb{R}^n$. If $\|\nabla f\|_2 \leq \epsilon$, stop and output $x^{(k)}$. Else update $x^{(k+1)} \leftarrow x^{(k)} - \eta \nabla f(x^{(k)})$. A function $f \in \mathcal{C}^2(\mathbb{R}^n)$ has $L$-Lipschitz gradients if The following are equivalent: $f \in \mathcal{C}^2(\mathbb{R}^n)$ has $L$-Lipschitz gradients. $-L \preceq \nabla^2 f(x) \preceq L$ for all $x \in \mathbb{R}^n$. For all $x,y \in \mathbb{R}^n$, Theorem: If $f : \mathbb{R}^n \to \mathbb{R}$ has $L$-Lipshitz gradients, then $\mathrm{GD}(\eta=\frac{1}{L})$ stops within $2 \frac{L}{\epsilon^2} (f(x^{(0)})-f(x^*))$ steps. Note that in this case we have no guarantee of optimality, only that $\|\nabla f(x^{(k)})\|_2 \leq \epsilon$. Theorem: If $f : \mathbb{R}^n \to \mathbb{R}$ is convex with $L$-Lipschitz gradients, then $\mathrm{GD}(\eta=\frac{1}{L})$ satisfies Thu, Oct 07 Gradient descent, II LV Book: 2.3-2.7 Homework #1 A function $f \in \mathcal{C}^1(\mathbb{R}^n)$ is $\mu$-strongly convex if for any $x,y \in \mathbb{R}^n$, For $f \in \mathcal{C}^2(\mathbb{R}^n)$ and $\mu \geq 0$, the following are equivalent: $f$ is $\mu$-strongly convex For all $x,y \in \mathbb{R}^n$, it holds that $\nabla^2 f(x) \succeq \mu$ for all $x \in \mathbb{R}^n$. Theorem: Suppose that $f \in \mathcal{C}^2(\mathbb{R}^n)$ is $\mu$-strongly convex and has $L$-Lipschitz gradients. Then $\mathrm{GD}(\eta = \frac{1}{L})$ satisfies Gradient descent is a discretization of the continuous algorithm: Note that: And if $f$ is $\mu$-strongly convex, then hence Line search and the Wolfe conditions (Sec 2.5) Elimination Tue, Oct 12 Cutting plane methods, the Ellipsoid algorithm LV Book: 3.1-3.2 The hyperplane separation theorem Cutting plane framework Affine geometry of ellipsoids Ellipsoid algorithm and analysis Thu, Oct 14 NO CLASS Tue, Oct 19 From volume to function value LV Book: 3.2-3.3 Using ellipsoid to minimize non-smooth convex functions Volume shrinkage yields small function value Application to Linear Programming Thu, Oct 21 Center of gravity LV Book: 3.4 Homework #2 Exact center of gravity converges to $\varepsilon$-approximate minimum in $O(n \log (1/\varepsilon))$ iterations Analysis using Grunbaum’s Theorem Approximate center of gravity suffices: Robust Grunbaum Theorem Logconcave distributions Isotropic distributions Tue, Oct 26 Thu, Oct 28 Tue, Nov 02 Thu, Nov 04 Tue, Nov 09 Tue, Nov 16 Thu, Nov 18 Tue, Nov 23 Thu, Nov 25 Tue, Nov 30 Thu, Dec 02 Tue, Dec 07 Thu, Dec 09

## Homeworks

• You may discuss problems with your classmates, but when you write down the solutions, you should do so by yourself. You can use the internet and books for reference material but you should cite every source that you consulted (the name of the book or web page suffices). You should also cite any classmates with whom you discussed solutions.
• Homework should be typeset.
• In general, late submissions are not accepted.