# CSE 535: Theory of Optimization and Continuous Algorithms

#### Instructor: James R. Lee

Office hours: Tues 3-4pm in CSE 574
(or Zoom by appointment)

#### Course evaluation: 100% Homework

• Homework #0 [Due: DO NOT TURN IN]

#### Course description:

The design of algorithms is traditionally a discrete endeavor. However, many advances have come from a continuous viewpoint. Typically, a continuous process, deterministic or randomized is designed (or shown) to have desirable properties, such as approaching an optimal solution or a desired distribution, and an algorithm is derived from this by appropriate discretization. In interesting and general settings, the current fastest methods are a consequence of this perspective. We will discuss several examples of algorithms for high-dimensional optimization and sampling.

## Lectures

 Introduction Thu, Sep 30 Intro to convexity and optimization LV Book: 1.1-1.5 Homework #0 Convex sets, convex functions, convex optimization $f : \mathbb{R}^n \to \mathbb{R}$ convex and differentiable, then local optimality implies global optimality: $\nabla f(x) = 0 \iff f(x) = \min \{ f(y) : y \in \mathbb{R}^n \}$ This holds because the following are equivalent: $f : \mathbb{R}^n \to \mathbb{R}$ convex $f(y) \geq f(x) + \langle \nabla f(x), y-x\rangle$ for all $x,y \in \mathbb{R}^n$. If $f$ is $\mathcal{C}^2$ (if it has continuous second-order partial derivatives), then the Hessian matrix $\nabla^2 f(x)$ is defined for every $x \in \mathbb{R}^n$: To check convexity, often it is useful to employ the following: Theorem: If $f : \mathbb{R}^n \to \mathbb{R}$ is $\mathcal{C}^2$, then $f$ is convex if and only if the Hessian $\nabla^2 f(x)$ is positive semidefinite for all $x \in \mathbb{R}^n$. As a simple example, consider the two-variate function $f(u,v) = u \log \frac{u}{v}$ defined on the positive orthant $u,v > 0$. Then we have The trace of this matrix is $1/u + 1/v^2 > 0$ for $u,v > 0$. The determinant of this matrix is $1/v^2 - 1/v^2 = 0$ for $u,v > 0$. Therefore if $\lambda_1,\lambda_2$ are the eigenvalues of $\nabla^2 f(u,v)$, it holds that $\lambda_1 + \lambda_2 > 0$ and $\lambda_1 \lambda_2 = 0$. The only possibility is that both eigenvalues are nonnegative, meaning that $\nabla^2 f(u,v)$ is positive semidefinite on the positive orthant. Tue, Oct 05 Gradient descent I LV Book: 2.1-2.3 Gradient descent $\mathrm{GD}(\eta)$: $f : \mathbb{R}^n \to \mathbb{R}$, parameters $\eta, \epsilon > 0$, initial point $x^{(0)} \in \mathbb{R}^n$. If $\|\nabla f\|_2 \leq \epsilon$, stop and output $x^{(k)}$. Else update $x^{(k+1)} \leftarrow x^{(k)} - \eta \nabla f(x^{(k)})$. A function $f \in \mathcal{C}^2(\mathbb{R}^n)$ has $L$-Lipschitz gradients if The following are equivalent: $f \in \mathcal{C}^2(\mathbb{R}^n)$ has $L$-Lipschitz gradients. $-L \preceq \nabla^2 f(x) \preceq L$ for all $x \in \mathbb{R}^n$. For all $x,y \in \mathbb{R}^n$, Theorem: If $f : \mathbb{R}^n \to \mathbb{R}$ has $L$-Lipshitz gradients, then $\mathrm{GD}(\eta=\frac{1}{L})$ stops within $2 \frac{L}{\epsilon^2} (f(x^{(0)})-f(x^*))$ steps. Note that in this case we have no guarantee of optimality, only that $\|\nabla f(x^{(k)})\|_2 \leq \epsilon$. Theorem: If $f : \mathbb{R}^n \to \mathbb{R}$ is convex with $L$-Lipschitz gradients, then $\mathrm{GD}(\eta=\frac{1}{L})$ satisfies Thu, Oct 07 Gradient descent II LV Book: 2.3-2.7 Homework #1 A function $f \in \mathcal{C}^1(\mathbb{R}^n)$ is $\mu$-strongly convex if for any $x,y \in \mathbb{R}^n$, For $f \in \mathcal{C}^2(\mathbb{R}^n)$ and $\mu \geq 0$, the following are equivalent: $f$ is $\mu$-strongly convex For all $x,y \in \mathbb{R}^n$, it holds that $\nabla^2 f(x) \succeq \mu$ for all $x \in \mathbb{R}^n$. Theorem: Suppose that $f \in \mathcal{C}^2(\mathbb{R}^n)$ is $\mu$-strongly convex and has $L$-Lipschitz gradients. Then $\mathrm{GD}(\eta = \frac{1}{L})$ satisfies Gradient descent is a discretization of the continuous algorithm: Note that: And if $f$ is $\mu$-strongly convex, then hence Line search and the Wolfe conditions (Sec 2.5) Elimination Tue, Oct 12 Cutting plane methods, the Ellipsoid algorithm LV Book: 3.1-3.2 The hyperplane separation theorem: If $K \subseteq \mathbb{R}^n$ is a closed convex set and $x \notin K$ then there is a direction $\theta \in \mathbb{R}^n$ such that Said differently, there is an $(n-1)$-dimensional hyperplane $H = \{ z \in \mathbb{R}^n : \langle z, \theta\rangle = \langle x,\theta\rangle \}$ that separates $x$ from $K$. Cutting plane framework Affine geometry of ellipsoids Ellipsoid algorithm and analysis Thu, Oct 14 NO CLASS Tue, Oct 19 From volume to function value LV Book: 3.2-3.3 Using ellipsoid to minimize non-smooth convex functions Volume shrinkage yields small function value The main result here is that if we are given a convex function $f : \mathbb{R}^n \to \mathbb{R}$ and a bounding box $E^{(0)} \subseteq \mathbb{R}^n$ containing some minimizer $x^*$, then after $k$ iterations of the ellipsoid algorithm, we obtain Application to Linear Programming A linear program is an optimization of the form: Given inputs $c \in \mathbb{R}^n, A \in \mathbb{R}^{m \times n}, b \in \mathbb{R}^m$, The feasible region is the set $K = \{ x \in \mathbb{R}^n : Ax \leq b\}$. Note that $Ax \leq b$ is shorthand for the $m$ linear inequalities $\langle a_i, x\rangle \leq b_i$ for $i=1,2,\ldots,m$, where $a_1,\ldots,a_m$ are the rows of $A$. Thu, Oct 21 Center of gravity LV Book: 3.4 Homework #2 Exact center of gravity converges to $\varepsilon$-approximate minimum in $O(n \log (1/\varepsilon))$ iterations Compare this to the $O(n^2 \log (1/\varepsilon))$ iterations used by the Ellipsoid Algorithm. Analysis using Grunbaum’s Theorem: If $K \subseteq \mathbb{R}^n$ is a compact, convex set and $z \in K$ is the center of gravity of $K$: then for any halfspace $H$ containing $z$, it holds that Approximate center of gravity suffices: Robust Grunbaum Theorem Logconcave distributions A measure $\mu$ on $\mathbb{R}^n$ is log-concave if it holds that for all measurable sets $A,B\subseteq \mathbb{R}^n$ and $\lambda \in [0,1]$, we have Every log-concave measure arises from a density $p : \mathbb{R}^n \to \mathbb{R}_+$ for which $\mu(S) = \int_S p(x)\,dx$, and such that the function $x \mapsto \log p(x)$ is concave. Equivalently, $p(x) = e^{-f(x)}$ for some convex function $f : \mathbb{R}^n \to \mathbb{R}$. Isotropic distributions A probability distribution $\mu$ on $\mathbb{R}^n$ is isotropic if $\mathbb{E}_{\mu} X = 0$ and $\mathbb{E}_{\mu} XX^{\top} = \mathbf{I}$, where $X$ is a random vector with law $\mu$. Reductions Tue, Oct 26 Oracles and equivalences LV Book: 4.1, 4.6 Different oracles for accessing a convex set $K \subseteq \mathbb{R}^n$ Oracles for accessing a convex function $f : \mathbb{R}^n \to \mathbb{R}$: Evaluation oracle, subgradient oracle The convex conjugate $f^* : \mathbb{R}^n \to \mathbb{R} \cup \{\pm \infty\}$ of a function $f : \mathbb{R}^n \to \mathbb{R} \cup \{\pm \infty\}$ is defined by This function is convex, as it is the maximum of the family of affine functions $\{ \theta \mapsto \langle x,\theta\rangle - f(x) : x \in \mathbb{R}^n \}$, one for every fixed $x$. We can compute a subgradient of $f^*$ at $\theta$ via $\nabla f^*(\theta) = \mathrm{argmax}_x \left(\langle x,\theta\rangle - f(x)\right)$ The epigraph of a function $f$ is the set This is the set of points lying above the graph of $f$. The function $f$ is convex if and only if $\mathrm{Epi}(f)$ is a convex set. The involution property: For any convex function $f : \mathbb{R}^n \to \mathbb{R}$ whose epigraph is closed, it holds that $f = f^{**}$. We can consider that set of all linear lower bounds on a function: All those pairs $\mathcal{F} = \{ (\theta,b) \}$ with $\theta \in \mathbb{R}^n$ and $b \in \mathbb{R}$ such that For a fixed $\theta$, the value of $b$ giving the best lower bound is precisely $f^*(\theta)$. Similarly, at a fixed point $x \in \mathbb{R}^n$, we can consider all the lower bounds By definition, the best such lower bound is given by $f^{**}(x)$, hence $f(x) \geq f^{**}(x)$. The involution property ensures that when the epigraph of $f$ is closed, this family of lower bounds is tight in the sense that $f(x) = f^{**}(x)$ for all $x \in \mathbb{R}^n$. Computational equivalences of the subgradient oracles for $f$ and $f^*$ using cutting planes and the involution property Thu, Oct 28 Duality LV Book: 4.2-4.4 Sion’s Minimax Theorem Definitions of quasi-convex, quasi-concave, and lower/upper semicontinuous functions. Application to a function decomposed as $f(x) = g(x) + h(Ax)$ where it gives Linear programming duality SDP duality: Equivalence of the two optimizations Here we take where we recall that for a subset $K \subseteq \mathbb{R}^n$, we define $\delta_K(x) = 0$ if $x \in K$ and $+\infty$ if $x \notin K$. And then we take: $g(X) = -\mathrm{tr}(CX) + \delta_{X \succeq 0}$ and $h(y) = \delta_{\{b\}}(y).$ Separation oracle for the PSD constraint $\sum_{i=1}^m \theta_i A_i \succeq C$. Define $A(\theta) = \sum_{i=1}^m \theta_i A_i - C$. Note that $\theta$ is the underlying variable, so the feasible region corresponding to this constraint is $K = \{ \theta \in \mathbb{R}^m : A(\theta) \succeq 0 \}$. Compute the minimum eigenvector $v_{\min}$ of $A(\theta)$. If the constraint is violated at $\theta$, it must be that On the other hand, for any feasible $\theta' \in K$ and any vector $v \in \mathbb{R}^n$, it holds that Hence if we define the vector $x \in \mathbb{R}^m$ by $x_i = - \langle v_{\min}, A_i v_{\min} \rangle$, then $x$ gives a hyperplane that separates $\theta$ from $K$, in the sense that Geometry and Optimization Tue, Nov 02 Projected subgradient descent LV Book: 5.1 The collection of subgradients If $f$ has continuous partial derivatives at $x \in \mathbb{R}^n$, then $\partial f(x) = \{ \nabla f(x) \}$. Example: Show that if $f : \mathbb{R} \to \mathbb{R}$ is defined by $f(x) = |x|$, then $\partial f(0) = [-1,1]$. If you graph $f$, this corresponds to the fact that all the lines through the origin of slope in the interval $[-1,1]$ lie below the graph of $f$. Projected Subgradient Descent: (subgradient step) $\quad y^{(k+1)} \leftarrow x^{(k)} - \eta g^{(k)}$ (projection step) $\quad x^{(k+1)} \leftarrow \pi_{\mathcal{D}}(y^{(k+1)}) = \mathrm{argmin}_{x \in \mathcal{D}} \|x-y^{(k+1)}\|_2^2$ After $T$ iterations, we output Pythagorean property for the Euclidean projection $\pi_{\mathcal{D}}$. Theorem: If $\mathcal{D} \subseteq \mathbb{R}^n$ is closed and convex, and $f : \mathcal{D} \to \mathbb{R}$ is a convex, $G$-Lipschitz function, then after $T$ iterations, the output of the subgradient descent algorithm satisfies, for any $x^* \in \mathrm{argmin}_{x\in \mathcal{D}} f(x)$, Thu, Nov 04 Mirror maps and Bregman divergences LV Book: 5.1.3-5.1.4 Homework #3 Setup: $\mathcal{D} \subseteq \mathbb{R}^n$ is closed and convex, $f : \mathcal{D} \to \mathbb{R}$ is convex. Mirror map: A strongly convex function $\Phi : \mathcal{D} \to \mathbb{R}$. Associated Bregman divergence: Bregman projection onto $\mathcal{D}$: Mirror descent iteration: Find $y^{(k+1)}$ such that $\nabla \Phi(y^{(k+1)}) = \nabla \Phi(x^{(k+1)}) - \eta \nabla f(x^{(k)})$ and then $x^{(k+1)} = \pi_{\mathcal{D}}^{\Phi}(y^{(k+1)}).$ Equivalently: Recall that if $f$ is non-smooth, $\nabla f(x^{(k)})$ can be replaced by any subgradient of $f$ at $x^{(k)}$. The map $\Phi$ is said to be $\rho$-strongly convex on $\mathcal{D}$ wrt a norm $\|\cdot\|$ if Theorem: If $f : \mathcal{D} \to \mathbb{R}$ is convex and $G$-Lipschitz wrt a norm $\|\cdot\|$, and $\Phi$ is $\rho$-strongly convex on $\mathcal{D}$ wrt $\|\cdot\|$, then after $T$ iterations mirror descent outputs the point and it satisfies where and If we choose $\eta = \frac{R}{G} \sqrt{\frac{2\eta}{T}}$, then this bound is Examples include the Euclidean ball with $\Phi(x) = \frac12 \|x\|_2^2$ and the probability simplex with $\Phi(x) = \sum_{i=1}^n x_i \log x_i$. Tue, Nov 09 Newton's Method LV Book: 5.3 For a $\mathcal{C}^1$ function $g : \mathbb{R}^n \to \mathbb{R}^n$, Newton’s method attempts to find a point $x \in \mathbb{R}^n$ where $g(x^*)=0$. This is done using the best linear approximation to $g$ at the current iterate $x^{(k)})$: where $D g(y)$ is the Jacobian of $g$ at $y$. We then set $g(x) = 0$ to calculate the next iterate: We can (try) find local minima of a $\mathcal{C}^2$ function $f : \mathbb{R}^n \to \mathbb{R}$ by taking $g(x) = \nabla f(x)$. In this case, the update rule takes the form Note that the Newton iterate has the nice feature that the algorithm is invariant under affine transformations. Note also one of the major drawbacks: A single iterate requires inverting the Hessian of $f$. Suppose $g(x) = a_n x^n + a_{n-1} x^{n-1} + \cdots + a_1 x + a_0$ is a univariate polynomial with roots $\lambda_n \leq \lambda_{n-1} \leq \cdots \leq \lambda_1$ and $\lambda_1 \leq x^{(0)}$. Then Newton’s method converges: $x^{(k)} \to \lambda_1$ and, moreover, one has the rate In general, Newton’s method need not converge to a zero of $g$, but when it does, the rate of convergence is (eventually) quadratic. (Local quadratic convergence) If $g : \mathbb{R}^n \to \mathbb{R}^n$ is $\mathcal{C}^2$, and $x^{(k)} \to x^*$ for some $x^*$ satisfying $g(x^*)=0$, and moreover $D g(x^*)$ is invertible, then there is a number $C=C(g,x^*)$ such that for $k$ sufficiently large, Newton’s formula will always converge to a zero as long as $\|x^{(0)}-x^*\| \leq R(x^*, g)$, where $R(x^*,g) > 0$ if $D g, D^2 g$ are suitably nondegenerate around $x^*$. One can find here a proof that Newton’s method converges quadratically to a zero as long as $|x^{(0)}-x^*|$ is sufficiently small. One can see the preceding link for a discussion of basis of attraction and failure cases. Tue, Nov 16 Interior point methods I LV Book: 5.4 Consider $c \in \mathbb{R}^n, b \in \mathbb{R}^m, A \in \mathbb{R}^{m \times n}$, and the corresponding linear program in slack form: and its dual For a primal-feasible $x$ and dual-feasible $y$, we always have The duality gap for such a pair $(x,y)$ is the quantity $\mathrm{gap}(x,y) = \langle c,x\rangle - \langle b,y\rangle$. Linear programming duality tells us that $\mathrm{gap}(x^*,y^*)=0$ if and only if $(x^*,y^*)$ are an optimal primal-dual pair. In the optimization $(\mathcal{P})$, the difficult set of constraints is given by $x \geq 0$. We can seek to soften these constraints via the use of a barrier. Consider the log-barrier: For some paramater $t > 0$, Note that any solution to $\mathcal{P}_t$ with bounded objective must satisfy that $x > 0$ (because otherwise the objective is $+\infty$). Introducing a vector $y \in \mathbb{R}^m$ of Lagrangian multipliers, the (unique) optimum of $(\mathcal{P}_t)$ satisfies where $s_i = t/x_i$. We can therefore characterize the optimizer as given by a triple $C_t = (x^{(t)},y^{(t)},s^{(t)})$ satisfying the constraints Note that $x^{(t)}$ and $y^{(t)}$ are primal- and dual-feasible, respectively, and Hence as $t \to 0$, we have $(x^{(t)},y^{(t)}) \to (x^*,y^*)$ converging to an optimal primal-dual pair. The sequence $\{ C_t = (x^{(t)},y^{(t)},s^{(t)}) : t \geq 0 \}$ is called the central path, and interior point methods solve $(\mathcal{P})$ by approximately following this path. The idea is to compute $C_{(1-\eta) t}$ from $C_t$ for $\eta > 0$ sufficiently small. (It turns out we can take $\eta \approx 1/\sqrt{n}$.) This computation is done via Newton’s method, and the key insight is that for $\eta > 0$ small, we will be in a “basin of attraction” and achieve local quadratic convergence. The end result is that we can solve a linear program within accuracy $\delta$ by solving about $O(\sqrt{n} \log \frac{n}{\delta})$ linear systems. Animation of the central path for some optimizations. Thu, Nov 18 Interior point methods II LV Book: 5.4 Continuing as in the last lecture, our goal is to approximately follow the central path: Define the vector $r \in \mathbb{R}^n$ by $r_i = t - x_i s_i$. Our notion of distance to the central path will be measured by In order to decrease $t$, we will try to improve the approximation $x_i s_i \approx t$ by finding a vector $(\delta_x,\delta_y,\delta_s) \in \mathbb{R}^{n + m + n}$ such that To approximate this by a linear system, we will drop the quadratic term from the first equality and ignore the nonnegativity constraints, yielding Now we can see $(\delta_x,\delta_y,\delta_s)$ as the solution to the linear system where $X = \mathrm{diag}(x_i)$ and $S = \mathrm{diag}(s_i)$. Solving this system shows that there is a projection matrix $P$ such that If we now assume that $\Phi_t = \|r\|_2^2 \leq \varepsilon^2 t^2$, then and the same bound holds for $\|X^{-1} \delta_x\|_2$. Therefore for $% $, we have $|\delta_{x,i}| \leq x_i$ and $|\delta_{s,i}| \leq s_i$ for $i=1,2,\ldots,n$, meaning that the nonnegativity constraints are satisfied for $x+\delta_x$ and $s+\delta_s$. Moreover, if we again assume $\Phi_t \leq \varepsilon^2 t^2$, then we have By Cauchy-Schwarz, we can bound this by where in the last line we used the bound coming from $(*)$. One can see here the local quadratic convergence of the Newton step: We have decreased the error from $\varepsilon^2 t^2$ to $\varepsilon^4 (1+o_{\varepsilon}(1)) t^2$. For $\varepsilon = 1/4$, this gives One can now check that if $\eta = \frac{1}{10 \sqrt{n}}$, then we can take then we have meaning that we have decreased $t$ by a factor of $1-\eta$ and maintained the invariant $\Phi_t \leq t^2/16$. Sparsification Tue, Nov 23 Subspace embeddings and sparse linear regression LV Book: 6.1 Homework #4 Consider a linear regression problem specified by $A \in \mathbb{R}^{n \times d}$ and $b \in \mathbb{R}^n$ with $n \gg d$. Our goal is to minimize the classical least-squares error: We can calculate the optimizer exactly by examining the critical points of the gradient: which has solution $x^* = (A^{\top} A)^{-1} (A^{\top} b).$ Note that $A^{\top} A$ is a $d \times d$ matrix, so we can compute its inverse in time $O(d^{\omega})$, where $\omega$ is the matrix multiplication exponent. But computing the product $A^{\top} A$ might take time $n d^2$. If $n \gg d$, this may dominate the running time. We could also try to calculate $x^*$ via gradient descent, leading to the “Richardson iteration:” As we will see in a moment, it holds that if we scale so that $A^{\top} A \prec I$, then $x^{(k)} \to x^*$. Preconditioning. Note that for an invertible, symmetric matrix $M$, we can write $x^* = (M^{-1} A^{\top} A)^{-1} M^{-1} A^{\top} b$. The corresponding Richardson iteration looks like and if we know that $A^{\top} A \preceq M \preceq \kappa A^{\top} A$, then we obtain the exponential convergence: Let us think of $M = B^{\top} B$, where $B \in \mathbb{R}^{m \times d}$. Then $\langle x, M x\rangle \approx \langle x, A^{\top} A x\rangle$ for all $x \in \mathbb{R}^d$ is the same as $\|Bx\|_2^2 \approx \|Ax\|_2^2$ for all $x \in \mathbb{R}^d$. So to find a good preconditioner $M$, we can take $B = \Pi A$ for some $\Pi \in \mathbb{R}^{m \times n}$, where $\Pi : \mathbb{R}^n \to \mathbb{R}^m$ is a linear map that approximately preserves the norms of vectors in the subspace $S = \{Ax : x \in\mathbb{R}^d \}$. Note that $\dim(S) = d$. In order for this to be efficient, it is useful if we can take the map $\Pi$ to be “oblivious” to the matrix $A$. While there will not be a single $\Pi$ (with $m \ll n$) that preserves the norms in every $d$-dimensional subspace, it turns out that we can choose $\Pi$ at random with this property. Oblivious-subspace embeddings: A random linear map $\Pi \in \mathbb{R}^{m \times n}$ is said to be a $(d,\epsilon,\delta)$-OSE if it holds that for every subspace $S \subseteq \mathbb{R}^n$ with $\dim(S)=d$, where the probability is taken only over the choice of $\Pi$. It is known that if $\Pi \in \mathbb{R}^{m \times n}$ has entries that are i.i.d. $N(0,1/m)$ or i.i.d. $\pm 1/\sqrt{m}$, then $\Pi$ is a $(d,\epsilon,\delta)$-OSD for some choice This is not so useful for our application: Since this $\Pi$ is a dense matrix, computing $\Pi A$ is still expensive. Research on sparse JL transforms shows that if $\Pi$ is chosen uniformly at random so that every row contains exactly $s$ non-zero entries that are $\pm 1/\sqrt{s}$, then $\Pi$ is a $(d,\epsilon,\delta)$-OSD when we choose Using this in the Richardson iteration (see 6.1.3 in the LV book), one obtains an algorithm for linear regresion that runs in time $\tilde{O}(\mathrm{nnz}(A) + n + d^{\omega})$ time, where $\mathrm{nnz}(A)$ is the number of non-zero entries in $A$. Tue, Nov 30 Stochastic gradient descent LV Book 6.3 Let $\mathcal{D}$ be a distribution on convex functions $f : \mathbb{R}^n \to \mathbb{R}$, and suppose that $F : \mathbb{R}^n \to \mathbb{R}$ is the convex function given by As usual, our goal is to find a minimizer $x^* \in \mathrm{argmin} \{ F(x) : x \in \mathbb{R}^n \}.$ One example is where $F(x) = \frac{1}{N} \sum_{i=1}^N f_i(x)$. If we tried to solve this with gradient descent, computing $\nabla F(x)$ at some $x$ would require $N$ gradient computations $\nabla f_1(x), \ldots, \nabla f_N(x)$. Stochastic gradient descent instead evaluates a single gradient at every step: Choose $f^{(k)} \sim \mathcal{D}$ Define $x^{(k+1)} \leftarrow x^{(k)} - \eta \nabla f^{(k)}(x^{(k)})$. Analysis of SGD We will require the following fact: If $f : \mathbb{R}^n \to \mathbb{R}$ is convex with $L$-Lispchitz gradients, then In other words, when $f(y)$ is close to the first-order approximation at $x$, it holds that $\nabla f(x)$ and $\nabla f(y)$ are also close. Taking $y = x^*$ and $x = x$ and then taking expectations over $f \sim \mathcal{D}$ gives where we have used $\mathbb{E}_{f \sim \mathcal{D}} [\nabla f(x^*)] = \nabla F(x^*) = 0$. Theorem: Suppose that every $f$ is convex with $L$-Lipschitz gradients, and that $F$ is $\mu$-strongly-convex. Define $\sigma^2 = \frac12 \mathbb{E}[\|\nabla f(x^*)\|^2]$. Then for any step size $\eta \leq \frac{1}{4L}$, the SGD iterates $\{ x^{(k)} \}$ satisfy, for every $T \geq 1$, the following guarantees: Comparison with the emprical risk minimizer. To understand the quality of the SGD guarantees, suppose we sample $f_1,f_2,\ldots,f_T \sim \mathcal{D}$ independently and then define The Cramer-Rao bound asserts that under the conditions of preceding theorem, we have thus we can think of $\sigma^2/T$ error as the gold standard. By combining \eqref{eq:sgd1} and \eqref{eq:sgd2}, one obtains that for the average point SGD gives Thu, Dec 02 Variance reduction and coordinate descent LV Book 6.3 Let us now assume that $F(x) = \frac{1}{N} \sum_{i=1}^N f_i(x)$ (so the distribution $\mathcal{D}$ is uniform on $\{f_1,\ldots,f_N\}$). Notice the dependence of \eqref{eq:sgd1} and \eqref{eq:sgd2} on the variance $\sigma^2$. To remove this dependence, we consider a variance reduction technique. For any $\tilde{x} \in \mathbb{R}^n$ and $f : \mathbb{R}^n \to \mathbb{R}$, define $\tilde f(x)$ by $$\label{eq:tildef} \nabla \tilde{f}(x) = \nabla f(x) - \nabla f(\tilde{x}) + \nabla F(\tilde{x}).$$ Note that $\mathbb{E}_{i \in [N]} \nabla \tilde{f_i}(x) = \nabla F(x) = \mathbb{E}_{i \in [N]} \nabla f_i(x)$. From \eqref{eq:Fgrad}, one can also conclude that This suggests a natural algorithm that runs SGD in phases, where in each phase we replace $\tilde{x}$ by progressively better approximations to $x^*$ (so that the variance $\sigma^2$ keeps reducing). SGD phase($\tilde{x}$, $x^{(0)}$, $m$, $\eta$): Given points $\tilde{x},x^{(0)} \in \mathbb{R}^n$ and $m \geq 1$, we run $\mathrm{SGD}(\eta)$ for $m$ steps using the functions $\{\tilde{f_1},\ldots,\tilde{f_N}\}$ where $\tilde{f}$ is defined in \eqref{eq:tildef}. Output $\langle x^{(0)},x^{(1)}, \ldots, x^{(m-1)}\rangle$. Note that the SGD phase requires $m + N$ gradient computations, where $N$ gradient computations are needed for $\nabla F(\tilde{x})$. SVRG$(x^{(0)}, m, T, \eta)$ (Stochatic Variance Reduced Gradient): Define $\tilde{x} = x^{(0)}$ For $k=0,1,\ldots,T-1$, Let $\langle x^{(0)},x^{(1)}, \ldots, x^{(m-1)}\rangle$ be the output of SGD phase$(\tilde{x}, x^{(0)}, m, \eta)$ $\tilde{x} \leftarrow \frac{1}{m} \left(x^{(0)}+x^{(1)}+\cdots+x^{(m-1)}\right)$ $x^{(0)} \leftarrow \tilde{x}$. Output $x^{(T-1)}$. Theorem (SVRG): Suppose $f_1,\ldots,f_N$ have $L$-Lipschitz gradients and $F$ is $\mu$-strongly convex. Define $\eta = \frac{1}{64 L}$ and $m = 256 \frac{L}{\mu}$. Then for every $T \geq 1$, the output of $x_{T} =$ SVRG$(x^{(0)},m,T,\eta)$ satisfies Note that to get a solution with error $\epsilon (F(x^{(0)})-F(x^*))$ requires $O((N+\frac{L}{\mu}) \log \frac{1}{\epsilon})$ gradient computations. Our analysis of regular gradient descent required $O(\frac{L}{\mu} \log \frac{1}{\epsilon})$ iterations, which requires $O(\frac{L}{\mu} N \log \frac{1}{\epsilon})$ gradient computations. Acceleration Tue, Dec 07 Chebyshev polynomials LV Book 7.1 Chebyshev polynomials (of the first kind) Existence of a degree $d$ polynomial $p_d(x)$ satisfying where $d \leq O(\sqrt{\kappa} \log (\frac{1}{\varepsilon}))$ and $\kappa = \frac{\lambda_{\max}(A)}{\lambda_{\min}(A)}$. For every $s,d > 0$, existence of a polynomial $q(x)$ of degree $d$ satisfying Thu, Dec 09 Conjugate gradient LV Book 7.2 Consider the function $f(x) = \langle x,Ax\rangle - \langle b,x\rangle$ where $A \succ 0$ is a PSD matrix. We have Any sequence $x^{(k)}$ satisfying the invariant has the property that Define this subspace to be $\mathcal{K}_k$. This is called the $k$th Krylov subspace. We can consider the best such sequence given by Using the fact that $% $, one can use the existence of Chebyshev polynomials to show that where $\kappa = \lambda_{\max}(A)/\lambda_{\min}(A)$ is the condition number of $A$. Note the $1-\frac{1}{\sqrt{\kappa}}$ factor decay in the error, vs. $1-\frac{1}{\kappa}$ for the Richardson iteration. We can compute the sequence $x^{(k)}$ using the conjugate gradient method. For an introduction to accelerated gradient descent for general convex functions, see Aaron Sidford’s lecture notes.

## Homeworks

• You may discuss problems with your classmates, but when you write down the solutions, you should do so by yourself. You can use the internet and books for reference material but you should cite every source that you consulted (the name of the book or web page suffices). You should also cite any classmates with whom you discussed solutions.
• Homework should be typeset.
• In general, late submissions are not accepted.