CSE 599: Analytic and geometric methods in TCS

Wed, Jan 04 NO CLASS
Generic chaining and extrema of random processes
Mon, Jan 09 Examples, applications, Gaussian processes	Notes A random process is a collection $\{X_t : t \in T\}$ of random variables. For the next few lectures, we will considered symmetric processes, i.e., those where $X_t$ and $-X_t$ have the same law for every $t \in T$. We equip the index set $T$ of such a process with the natural $\ell_2$ metric \[d(s,t) \seteq \sqrt{\E(X_s-X_t)^2}.\] A fundamental example is that of a Gaussian process, where the family $\{X_t : t \in T\}$ is a jointly normal family of random variables. When $T$ is finite, any such family can be described as follows: $T \subseteq \R^n$ for some $n \geq 1$, and $X_t = \langle g,t\rangle$, where $g=(g_1,\ldots,g_n)$ is a standard $n$-dimensional gaussian, i.e., the coordinates $g_1,\ldots,g_n$ are i.i.d. $N(0,1)$ random variables. In this case the geometry of the metric space $(T,d)$ is Eucliean, since $d(s,t) = \\|s-t\\|_2$. Also note that the metric $d(s,t)$ completely describes the covariance structure of the process, and therefore characterizes it (this is a special property of jointly normal families of random variables). We will be concerned with extrema of such a process, and to that end we study the quantity \[\E \max_{t \in T} X_t\,.\] A random process $\{ X_t : t \in T \}$ is called sub-gaussian if there is some constant $c > 0$ such that \[\begin{equation}\label{eq:sg-tail} \P[X_s - X_t > \lambda] \leq \exp\left(- c\lambda^2/d(s,t)^2\right) \end{equation}\] In particular, Gaussian processes are sub-gaussian with $c=1/2$. Another example is a Bernoulli process, i.e. a family of random variables defined thusly: For some subset $T \subseteq \R^n$, \[\left\{ \e_1 t_1 + \e_2 t_2 + \cdots + \e_n t_n : t = (t_1,\ldots,t_n) \in T \right\},\] where $\e_1,\ldots,\e_n \in \{-1,1\}$ are i.i.d. uniformly random signs. Diameter bound: When the index set $T$ is finite, we can use the tail bound \eqref{eq:sg-tail} to give a simple upper bound on $\E \max_{t \in T} X_t$. Fix some $t_0 \in T$ and note that $\E \max_{t \in T} X_t = \E \max_{t \in T} (X_t-X_{t_0})$. Now applying a union bound gives \[\P\left[\max_{t \in T} (X_t-X_{t_0}) > \lambda\right] \leq \|T\| \exp\left(- c\lambda^2/\Delta^2\right)\,,\] where $\Delta \seteq \max \{ d(t,t_0) : t \in T \}$. A simple calculation gives the conclusion \[\E \max_{t \in T} X_t \lesssim \Delta \sqrt{\log \|T\|}\,.\] DETAILED PRELIMINARY NOTES Supplementary reading: Monotonicity of the volume of intersection of balls The Generic Chaining Majorizing measures for the optimizer
Wed, Jan 11 Generic chaining and Dudley's entropy bound	Notes Recall that we consider a subgaussian process $\{X_t : t \in T\}$ and equip $T$ with the distance $d(s,t) = \sqrt{\E(X_s-X_t)^2}$. Chaining: Instead of applying a naive union bound to control $X_t - X_{t_0}$ for every $t \in T$, we can break every such difference into parts and then bound each part. This potentially reduces the number of events we need to take a union bound over. For instance, given three random variables $X_0,X_1,X_2,X_3$, we could write \[\begin{eqnarray} X_3 - X_0 &=& (X_3 - X_1) + (X_1 - X_0) \\ X_2 - X_0 &=& (X_2 - X_1) + (X_1 - X_0) \\ X_1 - X_0 &=& X_1 - X_0\,. \end{eqnarray}\] Generic chaining upper bound: For sub-gaussian processes, we can control $\E \max_{t \in T} X_t$ using chaining. Fix some $t_0 \in T$ and a sequence of finite sets \[\{t_0\} = T_0 \subseteq T_1 \subseteq T_2 \subseteq \cdots \subseteq T_n \subseteq \cdots \subseteq T\,,\] with cardinalities bounded by $\abs{T_n} \leq 2^{2^n}$ for every $n \geq 1$. Then we have: \[\begin{equation}\label{eq:gc-ub} \E \max_{t \in T} X_t \lesssim \max_{t \in T} \sum_{n \geq 0} 2^{n/2} d(t,T_n)\,, \end{equation}\] where $d(t,T_n) = \min_{s \in T_n} d(s,t)$. Dudley’s entropy bound: If $\{X_t : t \in T\}$ is a subgaussian process, then \[\begin{equation}\label{eq:dudley} \E \max_{t \in T} X_t \lesssim \sum_{n \geq 1} 2^{n/2} e_n(T)\,, \end{equation}\] where $e_n(T)$ is the largest distance $d$ such that there are $2^{2^n}$ points $y_1,y_2,\ldots,y_{2^{2^n}}$ in $T$ with $d(y_i,y_j) \geq d$ for all $i \neq j$. Easy exercise: Show that Dudley’s entropy bound follows from the generic chaining upper bound \eqref{eq:gc-ub}. Covering numbers: Let $N(T,d,\e)$ denote the smallest number $N$ such that the metric space $(T,d)$ can be covered by $N$ balls of radius $\e$, and consider the quantity \[C_T \seteq \sum_{h=-\infty}^{\infty} 2^h \sqrt{\log N(T,d,2^{h})}\,.\] Up to a factor of $2$, this is equivalent to the more elegant expression $\int_0^{\infty} \sqrt{\log N(T,d,\e)} \,d\e$. It is an exercise to show that \[C_T \asymp \sum_{n \geq 1} 2^{n/2} e_n(T)\,.\] Exercise: Show that for any metric space $(T,d)$, it holds that \[\begin{equation}\label{eq:CT} \sum_{n \geq 1} 2^{n/2} e_n(T) \asymp \sum_{h=-\infty}^{\infty} 2^h \sqrt{\log N(T,d,2^h)}\,. \end{equation}\] Analysis of the probability simplex Let $T = \{ x \in \R^n : x_1,\ldots,x_n \geq 0, x_1+\cdots+x_n = 1 \}$ denote the probability simplex. Lemma 1: For this $T$, let $X_t = \langle g,t\rangle$, where $g=(g_1,\ldots,g_n)$ is a standard $n$-dimensional Gaussian. Then $\E \max_{t \in T} X_t \asymp \sqrt{\log n}$. This is because the maximum of the linear functional $t \mapsto \langle g,t\rangle$ is achieved at a vertex of the convex body $T$ which is equal to the convex hull of the standard basis vectors: $T = \mathrm{conv}(e_1,\ldots,e_n)$. Thus $\max_{t \in T} \langle g,t\rangle = \max \{ g_i : i =1,\ldots,n\}$, and therefore \[\E \max_{t \in T} \langle g,t\rangle = \E \max(g_1,\ldots,g_n) \asymp \sqrt{\log n}.\] Let us now see that the entropy bound \eqref{eq:dudley} is far from tight in this simple case. We will use the formulation via covering numbers coming from the equivalence \eqref{eq:CT}. Consider vectors $t \in T$ with $t_i \in \{0,1/k\}$ for $i=1,\ldots,n$. There are ${n \choose k} \geq \left(\frac{n}{k}\right)^k$ such vectors. For $k \leq \sqrt{n}$, we have $(n/k)^k \gtrsim 2^{k \log n}$, and it is possible to choose a proportional subset of those vectors (e.g., a random subset will suffice) such that \[d(t_i,t_j) \gtrsim \left(k \cdot \frac{1}{k^2} \right)^{1/2} \asymp k^{-1/2}, \qquad i \neq j\,.\] It follows that for $k \leq \sqrt{n}$, we have $\log N(T,d,k^{-1/2}) \gtrsim k \log n$. Consider then the $\approx \log n$ values of $h$ for which $1 \leq 2^h \leq \sqrt{n}$. This gives us $\approx \log n$ values of $h$ for which \[2^{-h} \sqrt{\log N(T,d,2^{-h})} \gtrsim 2^{-h} \cdot 2^h \sqrt{\log n} = \sqrt{\log n}\,.\] Therefore Dudley’s bound \eqref{eq:CT} is $\gtrsim (\log n) \cdot \sqrt{\log n} = (\log n)^{3/2}$. (Recall that the correct bound, given by Lemma 1, is $\sqrt{\log n}$.)
Mon, Jan 16 NO CLASS (MLK day)
Wed, Jan 18 Graph and hypergraph sparsification	Notes We have seen that if $\{X_t : t \in T \}$ is a symmetric sub-Gaussian process equipped with the distance $d(s,t) = \sqrt{\E(X_s-X_t)^2}$, then for any sequence $\{t_0\} = T_0 \subseteq T_1 \subseteq \cdots \subseteq T_h \subseteq \cdots T$ with $\|T_h\| \leq 2^{2^h}$, we have a chaining upper bound \[\E \max_{t \in T} X_t \lesssim \max_{t \in T} \sum_{h \geq 0} 2^{n/2} d(t,T_h)\,.\] Let us define $\gamma_2(T,d)$ as the best possible chaining upper bound, i.e., the infimum of the RHS over all compliant sequences $\{T_h\}$. This is Talagrand’s famous $\gamma_2$ functional, and it may be surprising that it characterizes (up to constants) the expected supremum of a Gaussian process. Majorizing-measures theorem (Fernique-Talagrand): If $\{X_t : t\in T \}$ is a centered Gaussian process, then \[\E \max_{t \in T} X_t \asymp \gamma_2(T,d)\,.\] Random signed sums: Suppose that $\e_1,\e_2,\ldots,\e_n$ is a sequence of independent, symmetric random variables and we define the random process: \[\left\{ X_t = \e_1 t_1 + \e_2 t_2 + \cdots + \e_n t_n : t \in T \right\}\,,\] where $T \subseteq \R^n$. Then classical concentration bounds show that $\{X_t : t\in T\}$ is subgaussian, and it is straightforward to calculate the distance (by expanding the square): \[d(s,t) = \\|s-t\\|_2\,.\] The chaining upper bound gives $\E \max_{t \in T} X_t \lesssim \gamma_2(T,d)$. Let us now use this to understand the Hypergraph sparsification problem. Hypergraph sparsification. Consider a weighted hypergraph $H=(V,E,w)$ where $\{w_e : e \in E\}$ are nonnegative edge weights. We associated to $H$ the quadratic expression \[Q_H(x) = \sum_{e \in E} w_e \max_{u,v \in e} (x_u-x_v)^2\,.\] If $H$ were a graph (where $\|e\|=2$ for every $e \in E$), this would correspond to the quadratic form of the graph Laplacian. Our goal is to find another hypergraph $\tilde{H} = (V,\tilde{E},\tilde{w})$ with $\tilde{E} \subseteq E$ and such that \[\begin{equation}\label{eq:sparse-approx} \|Q_H(x)-Q_{\tilde{H}}(x)\| \leq \e Q_H(x),\qquad \forall x \in \R^n\,, \end{equation}\] where $\e > 0$ is some accuracy parameter. Moreover, we would like $\abs{\tilde{E}}$ to be small, ideally near-linear in $\abs{V}$ (while $\abs{E}$ could be as large as $2^{\abs{V}}$). Independent random sampling. Suppose we have a probability distribution $\{\mu_e : e \in E\}$. Let us form $\tilde{H}$ by sampling $M$ edges $e_1,e_2,\ldots,e_M$ independently from $\mu$ and setting the edge weights in $\tilde{H}$ so that \[Q_{\tilde{H}}(x) = \frac{1}{M} \sum_{k=1}^M \frac{w_{e_k}}{\mu_{e_k}} Q_{e_k}(x)\,,\] where we define $Q_e(x) = \max_{u,v \in e} (x_u-x_v)^2$. Note that $\E[Q_{\tilde{H}}(x)] = Q_H(x)$ for all $x \in \R^n$, and to establish a bound like \eqref{eq:sparse-approx}, it makes sense to study the quantity \[\E \max_{Q_H(x) \leq 1} \|Q_H(x) - Q_{\tilde{H}}(x)\|.\] Passing to a sub-Gaussian process. Let $\hat{H}$ be an independent copy of $\tilde{H}$ and write \[\begin{eqnarray} \E_{\tilde{H}} \max_{Q_H(x) \leq 1} \|Q_H(x) - Q_{\tilde{H}}(x)\| &= \E_{\tilde{H}} \max_{Q_H(x) \leq 1} \|\E_{\hat{H}}[Q_{\hat{H}}(x)]-Q_{\tilde{H}}(x)\| \\ &= \E_{\tilde{H}} \max_{Q_H(x) \leq 1} \|\E_{\hat{H}} [Q_{\hat{H}}(x)-Q_{\tilde{H}}(x)]\| \\ &\leq \E_{\hat{H},\tilde{H}} \max_{Q_H(x) \leq 1} \|Q_{\hat{H}}(x)-Q_{\tilde{H}}(x)\|\,, \end{eqnarray}\] where we have used convexity to pull out the expectation: $\abs{\E X} \leq E \abs{X}$ and $\max(\E X_1,\ldots,\E X_k) \leq \E \max(X_1,\ldots,X_k)$ Note that for any choice of signs $\e_1,\ldots,\e_M \in \{-1,1\}$, we have \[\begin{eqnarray} \|Q_{\hat{H}}(x)-Q_{\hat{H}}(x)\| &= \frac{1}{M} \left\|\sum_{k=1}^M \left(\frac{w_{\hat{e}_k}}{\mu_{\hat{e}_k}} Q_{\hat{e}_k}(x)- \frac{w_{\tilde{e}_k}}{\mu_{\tilde{e}_k}} Q_{\hat{e}_k}(x)\right)\right\| \\ &= \frac{1}{M} \left\|\sum_{k=1}^M \e_k \left(\frac{w_{\hat{e}_k}}{\mu_{\hat{e}_k}} Q_{\hat{e}_k}(x)- \frac{w_{\tilde{e}_k}}{\mu_{\tilde{e}_k}} Q_{\hat{e}_k}(x)\right)\right\|. \end{eqnarray}\] because $\tilde{e}_k$ and $\hat{e}_k$ have the same law. Define $t_k(x) \seteq \frac{w_{e_k}}{\mu_{e_k}} Q_{\hat{e}_k}(x)$ and then we have \[\max_{Q_H(x) \leq 1} \|Q_{\hat{H}}(x)-Q_{\tilde{H}}(x)\| \leq \frac{2}{M} \E_{(\e_1,\ldots,\e_M)} \max_{Q_H(x) \leq 1} \left\|\sum_{k=1}^M \e_k t_k\right\| = \frac{4}{M} \E_{(\e_1,\ldots,\e_M)} \max_{Q_H(x) \leq 1} \sum_{k=1}^M \e_k t_k(x)\,,\] where we take the expectation over i.i.d. uniformly random signs, and we removed the absolute values using symmetry of the process. So now to bound the quantity we cared about initially, it suffices to bound $\max_{Q_H(x) \leq 1} \sum_{k=1}^M \e_k t_k(x)$ for every choice of hyperedges $e_1,\ldots,e_M$. Of course, this is a sub-Gaussian process, and so we are led to the study of $\gamma_2(T,d)$, where $T = \{ x \in \R^n : Q_H(x) \leq 1\}$, and \[d(x,y) = \left(\sum_{k=1}^M \left(\frac{w_{e_k}}{\mu_{e_k}} (Q_{e_k}(x)-Q_{e_k}(y))\right)^2\right)^{1/2}\] Abstracting out the norm sparsification problem. Note that each $\sqrt{Q_e(x)}$ for $e \in E$ is actually a norm on $\R^n$ (technically it is only a seminorm since $Q_e(x)=0$ does not entail $x=0$). Thus we can consider the following related problem: Suppose that $N_1,N_2,\ldots,N_m$ are (semi)norms on $\R^n$ and we want to bound \[\E \max_{x \in T} \sum_{k=1}^m \e_k N_k(x)^2\,,\] where $T \subseteq \R^n$ and $\e_1,\ldots,\e_m$ are i.i.d. random signs. Again, this is a sub-Gaussian process equipped with the distance \[d(x,y) = \left(\sum_{k=1}^M (N_k(x)^2 - N_k(y)^2)^2\right)^{1/2}.\] In the next lecture, we will find upper bounds on $\gamma_2(T,d)$ for this faimly of problems. Supplementary reading: Spectral hypergraph sparsifiers of nearly linear size Spectral hypergraph sparsification via chaining Cover times, blanket times, and majorizing measures
Mon, Jan 23 Sparsifying sums of norms	Notes Norms on $\R^n$. Recall that a norm $N$ on $\R^n$ is a nonnegative mapping $N : \R^n \to \R_+$ such that $N(\lambda x) = \abs{\lambda} N(x)$ for all $x \in \R^n$ and $\lambda \in \R$. $N(x+y) \leq N(x) + N(y)$ for all $x,y \in \R^n$. $N(x) = 0 \iff x = 0$ for all $x \in \R^n$. $N$ is a semi-norm if it only satisfies properties (1) and (2). We will consider semi-norms throughout the lecture and say “norm” even when considering semi-norms. An upper bound. Suppose that $N_1,N_2,\ldots,N_m$ are norms on $\R^n$ and $T \subseteq B_2^n$, where $B_2^n = \{ x \in \R^n : \\|x\\|_2 \leq 1 \}$ is the Euclidean unit ball. Define $\kappa \seteq \E \max_{k=1,\ldots,m} N_k(g)$, where $g$ is a standard $n$-dimensional Gaussian. Our goal is to prove the following: If $\e_1,\e_2,\ldots,\e_m$ are i.i.d. random signs, then \[\begin{equation}\label{eq:goal11} \E \max_{x \in T} \sum_{k=1}^m \e_k N_k(x)^2 \lesssim \kappa \log(n) \max_{x \in T} \sqrt{\sum_{k=1}^m N_k(x)^2}\,. \end{equation}\] Example: Operator norm for sums of random matrices. Another natural situation where a bound like this is useful is for sums of random matrices. Suppose that \[A = \sum_{k=1}^m \e_k A_k^T A_k\,,\] where each $A_k^T A_k$ is a PSD matrix. Then the operator norm of $A$ can be written as \[\\|A\\|_{\mathrm{op}} = \max_{\\|x\\|_2 \leq 1} \langle x, A x\rangle = \max_{\\|x\\|_2 \leq 1} \sum_{k=1}^m \e_k \langle x, A_k^T A_k x\rangle = \max_{\\|x\\|_2 \leq 1} \sum_{k=1}^m \e_k \\|A_k x\\|_2^2\,,\] which is the special case of our problem where $N_k(x) = \\|A_k x\\|_2^2$ and $T = B_2^n$. Applying Dudley’s entropy inequality. Recall that $\left\{ \sum_{k=1}^m \e_k N_k(x)^2 : x \in T \right\}$ is a subgaussian process when equipped with the distance \[d(x,y) \seteq \left(\sum_{k=1}^m \left(N_k(x)^2 - N_k(y)^2\right)^2\right)^{1/2}.\] Therefore Dudley’s entropy bound shows that \[\begin{equation}\label{eq:dudley3} \E \max_{x \in T} \sum_{k=1}^m \e_k N_k(x)^2 \lesssim \sum_{h \geq 0} 2^{h/2} e_h(T,d)\,. \end{equation}\] Controlling the subgaussian distance via a norm. Notice that both sides of the inequality \eqref{eq:goal11} are quadratic in the norms $N_k$. Therefore by rescaling we may assume that \[\begin{equation}\label{eq:L2N} \max_{x \in T} \sqrt{\sum_{k=1}^m N_k(x)^2} = 1\,. \end{equation}\] Define the norm \[\\|x\\|_N \seteq \max_{k=1,\ldots,m} N_k(x)\,.\] Note that using $a^2-b^2 = (a-b)(a+b)$, for any $x,y \in T$, we have \[\begin{eqnarray} d(x,y) &=& \left(\sum_{k=1}^m (N_k(x)-N_k(y))^2 (N_k(x)+N_k(y))^2 \right)^{1/2} \\ &\leq& \\|x-y\\|_N \left(\sum_{k=1}^m (N_k(x)+N_k(y))^2\right)^{1/2} \\ &\leq& 2 \\|x-y\\|_N\,, \end{eqnarray}\] where the first inequality uses $\|N_k(x)-N_k(y)\| \leq N_k(x-y)$ (true for any norm) and the second inequality uses \eqref{eq:L2N} and $x,y \in T$. Splitting the Dudley sum. Therefore we have $e_h(T,d) \leq 2 e_h(T, \\|\cdot\\|_N)$ and combining this with Dudley’s bound \eqref{eq:dudley3} gives \[\begin{equation}\label{eq:dudley2} \E \max_{x \in T} \sum_{k=1}^m \e_k N_k(x)^2 \lesssim \sum_{h \geq 0} 2^{h/2} e_h(T,\\|\cdot\\|_N)\,. \end{equation}\] We will split this sum into two parts depending on whether $h \leq 4 \log (n)$ or $h > 4 \log (n)$. Controlling the large $h$ terms with a volume bound. Note that for $x \in T$, by \eqref{eq:L2N} we have \[\\|x\\|_N \leq \sqrt{\sum_{k=1}^m N_k(x)^2} \leq 1\,.\] In other words, $T \subseteq B_N \seteq \{ x \in \R^n : \\|x\\|_N \leq 1\}$. Therefore $e_h(T,\\|\cdot\\|_N) \leq e_h(B_N,\\|\cdot\\|_N)$. Claim: It holds that $e_h(B_N,\\|\cdot\\|_N) \leq 4 \cdot 2^{-2^h/n}$. This proof works for any norm on $\R^n$. Assume that $x_1,\ldots,x_{s} \in B_N$ are a maximal set of points satisfying $\\|x_i-x_j\\|_N \geq \delta$ for all $i \neq j$. In that case, we have \[B_N \subseteq \bigcup_{j=1}^s (x_j + 2 \delta B_N).\] In other words, $B_N$ is covered by $s$ balls of radius $2 \delta$. But it also holds that (for $\delta < 1$), \[2 B_N \supseteq \bigcup_{j=1}^s (x_j + \delta B_n)\] hence \[\mathrm{vol}_n(2 B_N) \geq s \,\mathrm{vol}_n(\delta B_N) = s \,(\delta/2)^n \mathrm{vol}_n(2 B_N),\] where $\mathrm{vol}_n$ is the standard $n$-dimensional volume on $\R^n$. It follows that $s \leq (\delta/2)^{-n}$, i.e., $\delta \leq 2 s^{-n}$. Taking $s = 2^{2^h}$ yields \[e_h(B_N,\\|\cdot\\|_N) \leq 4 \delta = 4 \cdot 2^{-2^h/n}.\] Recalling \eqref{eq:dudley2}, we can use the preceding claim to evaluate \[\sum_{h > 4 \log n} 2^{h/2} e_h(T, \\|\cdot\\|_N) \leq 4 \sum_{h > 4 \log n} 2^{h/2} 2^{-2^h/n} \leq O(1)\,.\] Thus we have \[\E \max_{x \in T} \sum_{k=1}^m \e_k N_k(x)^2 \lesssim O(1) + \sum_{0 \leq h \leq 4 \log n} 2^{h/2} e_h(T,\\|\cdot\\|_N)\,.\] Controlling the interesting $h$ terms. To bound the second sum, let us use $T \subseteq B_2^n$ to write $e_h(T,\\|\cdot\\|_N) \leq e_h(B_2^n,\\|\cdot\\|_N)$. (Dual) Sudakov Lemma: For any norm $\\|\cdot\\|$ on $\R^n$ and $h \geq 0$, it holds that \[e_h(B_2^n, \\|\cdot\\|) \lesssim 2^{-h/2} \E \\|g\\|\,.\] Note that $\E\, \\|g\\|_N = \E \max_k N_k(g) = \kappa$ (by definition), and therefore the second sum above is bounded by \[\sum_{0 \leq h \leq 4 \log n} 2^{h/2} e_h(T,\\|\cdot\\|_N) \lesssim \kappa \log n\,.\] This completes the proof of \eqref{eq:goal11} once we have established the Dual Sudakov Lemma. Gaussian Shift Lemma. We need the following Lemma: Suppose that $K$ is a symmetric convex set and $\gamma_n$ is the $n$-dimensional Gaussian measure on $\R^n$. Then for any $x \in \R^n$, \[\gamma_n(K+x) \geq e^{-\\|x\\|_2^2/2} \gamma_n(K)\,.\] Proof: Write \[\begin{eqnarray} \gamma_n(K+x) &=& (2\pi)^{-n} \int_K e^{-\\|x+z\\|^2/2}\,dz \\ &=& (2\pi)^{-n} \int_K \E_{\sigma \in \{-1,1\}} e^{-\\|\sigma x+z\\|_2^2/2}\,dz\,, \end{eqnarray}\] where the second inequality follows because $K$ is symmetric. Now note that $\E_{\sigma \in \{-1,1\}} \\|\sigma x + z\\|_2^2 = \\|x\\|_2^2 + \\|z\\|_2^2$ and use convexity of $y \mapsto e^y$ to write \[\begin{eqnarray} (2\pi)^{-n} \int_K \E_{\sigma \in \{-1,1\}} e^{-\\|\sigma x+z\\|_2^2/2}\,dz &\geq & (2\pi)^{-n} \int_K e^{-\E_{\sigma \in \{-1,1\}} \\|\sigma x+z\\|_2^2/2}\,dz \\ &=& (2\pi)^{-n} \int_K e^{-(\\|x\\|_2^2+\\|z\\|_2^2)/2}\,dz = e^{-\\|x\\|_2^2} \gamma_n(K)\,. \end{eqnarray}\] Proof of Dual Sudakov. Let $\mathcal{B} \seteq \{ x \in \R^n : \\|x\\| \leq 1 \}$ denote the unit ball of the $\\|\cdot\\|$ norm and suppose $x_1,\ldots,x_s \in B_2^n$ is a maximal set of points such that the shifted balls $x_j + \delta \mathcal{B}$ are all pairwise disjoint. In that case, we have \[\begin{equation}\label{eq:Bcover} B_2^n \subseteq \bigcup_{j=1}^s (x_j + 2 \delta \mathcal{B})\,, \end{equation}\] i.e., we have covered $B_2^n$ by $s$ balls (in $\\|\cdot\\|$) of radius at most $2\delta$. Let $\lambda > 0$ be a parameter we will choose later and note that the sets $\lambda(x_j+\delta \mathcal{B})$ are also pairwise disjoint, and therefore \[\begin{eqnarray} 1 &\geq& \gamma_n\left(\bigcup_{j=1}^s \lambda (x_j + \delta \mathcal{B}) \right) \\ &=& \sum_{j=1}^s \gamma_n\left(\lambda (x_j + \delta \mathcal{B}) \right) \\ &\geq& \sum_{j=1}^s e^{-\lambda^2 \\|x_j\\|_2^2/2} \gamma_n(\lambda \delta \mathcal{B}) \geq s \cdot e^{-\lambda^2/2} \gamma_n(\lambda \delta \mathcal{B})\,, \end{eqnarray}\] where the first equality uses disjointness, the second inequality uses the Gaussian Shift Lemma, and the final inequality uses $x_1,\ldots,x_s \in B_2^n$. If we define $\lambda \seteq \frac{2}{\delta} \E\, \\|g\\|$, then \[\gamma_n(\lambda \delta B) = \P\left(g \in \lambda \delta B\right) = \P\left(\\|g\\| \leq \lambda \delta\right) = \P\left(\\|g\\| \leq 2 \E\, \\|g\\|\right) \geq \frac12\,,\] where the last step is simply Markov’s inequality. Combining this with our previous calculation yields \[1 \geq s \cdot \exp\left(-(2/\delta)^2 (\E\, \\|g\\|)^2\right) \cdot \frac12\,,\] i.e. \[\sqrt{\log (s/2)} \leq \frac{2}{\delta} \E\, \\|g\\|\,.\] Taking $s=2^{2^h}$ and recalling \eqref{eq:Bcover} shows that we can cover $B_2^n$ by at most $s$ $\\|\cdot\\|$-balls of radius $2 \delta$ with \[\delta \leq \frac{2\, \E\, \\|g\\|}{\sqrt{\log (s/2)}} \lesssim 2^{-h/2} \E\,\\|g\\|\,.\] By definition, this yields our desired bound \[e_h(B_2^n, \\|\cdot\\|) \lesssim 2^{-h/2} \E\,\\|g\\|\,.\] Supplementary reading: Rothvoss: Notes on Asymptotic Convex Geometry (See Sec 4.3 for the Dual Sudakov inequality.)
The KLS conjecture and stochastic localization
Wed, Jan 25 A Cheeger inequality for isotropic convex sets	Notes We studied the $\ell_1$ norm sparsification problem: Given norms $N_1,\ldots,N_m$ on $\R^n$, define $N(x) \seteq N_1(x) + \cdots + N_m(x)$. Then we seek a weight vector $c \in \R_+^m$ with $\# \{ i : c_i > 0 \}$ small such that \[\max_{N(x) \leq 1} \left\|N(x) - \sum_{j=1}^m c_j N_j(x)\right\| \leq \e\,.\] The role of concentration was essential: If $\mathbf{Z}$ is a log-concave random vector on $\R^n$ and $\\|\cdot\\|$ is any norm, then \[\Pr\left(\\|\mathbf{Z}\\| > t \E[\\|\mathbf{Z}\\|]\right) \leq e^{-c t \psi_n}\,,\] where $c > 0$ is a universal constant and $\psi_n$ is the KLS constant. This sort of concentration is essentially equivalent to a uniform lower bound on the Cheeger constant of isotropic log-concave measures on $\R^n$. (Next lecture.) Reading: The Kannan-Lovasz-Simonovits Conjecture, Y. T. Lee and Vempala Bourgain’s slicing problem and KLS isoperimetry up to polylog, Klartag and Lehec Videos: On Chen’s recent breakthrough on KLS, I, talk by Ronen Eldan @ IAS Overview of the conjecture from an expert, including relation to the slicing problem and asymptotic convex geometry.
Mon, Jan 30 The KLS conjecture, slicing, thin shells, and concentration	Notes Localization Consider a log-concave density $F : \R^n \to \R_+$ with corresponding measure $\mu$, and suppose that $D \seteq \mathrm{diam}(\mathrm{supp}(F)) < \infty$ (where $\mathrm{supp}(F) \seteq \{ x \in \R^n : F(x) > 0 \}$). Let $S_1,S_2,S_3$ be three measurable sets partitioning $\R^n$ and denote $d(S_1,S_2) = \min \{ \\|x-y\\|_2 : x \in S_1, y \in S_2 \}$. KK Lemma (Karzanov and Khachiyan): It holds that $\mu(S_3) \geq \frac{d(S_1,S_2)}{D} \min \{ \mu(S_1), \mu(S_2) \}.$ Our goal is to reduce this to the $1$-dimensional case. Define $C \seteq \frac{d(S_1,S_2)}{D}$. The idea is to define two functions $g_1,g_2 : \R^n \to \R$ by \[g_1(x) \seteq \begin{cases} C F(x) & x \in S_1 \\ 0 & x \in S_2 \\ -F(x) & x \in S_3 \end{cases} \qquad g_2(x) \seteq \begin{cases} C F(x) & x \in S_2 \\ 0 & x \in S_1 \\ -F(x) & x \in S_3\,. \end{cases}\] Suppose, for the sake of contradiction, that $C \mu(S_1), C \mu(S_2) \geq \mu(S_3)$. This gives $\int g_1(x)\,dx = C \mu(S_1) - \mu(S_3) > 0$ and $\int g_2(x)\,dx = C \mu(S_2) - \mu(S_3) > 0$. Now we are going to define a sequence $K_0 \subseteq K_1 \subseteq K_2 \subseteq \cdots$ of convex bodies so that $K \seteq \bigcap_{i=1}^{\infty} K_i$ is either a point or a segment and such that $\int_{K_i} g_1(x)\,dx > 0$ and $\int_{K_i} g_2(x)\,dx > 0$ for each $i=1,2,\ldots$. For $K_0$, simply choose a ball large enough to contain $\mathrm{supp}(F)$. Consider any halfspace $H$ such that \[\int_{K_i \cap H} g_1(x)\,dx = \frac12 \int_{K_i} g_1(x)\,dx\,.\] Call such a halfspace $H$ bisecting. Then either $H$ (or its complement $\R^n \setminus H$) will satisfy \[\int_{K_i \cap H} g_2(x)\,dx > 0\,.\] We will choose $H$ and take $K_{i+1} \seteq K_i \cap H$. Let’s see how to choose $H$. Claim: For any $(n-2)$-dimensional affine subspace $A$, there is at least one bisecting hyperplane that contains $A$. Note that the boundary of a halfspace $H$ is an $(n-1)$-dimensional hyperplane. So consider a halfspace $H$ that contains $A$ in its boundary (think of a plane containing a line). We can rotate $H$ about $A$ so that $H$ becomes its complenetary halfspace $H'$. If $\int_H g_1 - \int_{\R^n \setminus H} g_1 = 0$ initially, we are done. If not, then $\int_H g_1 - \int_{\R^n \setminus H} g_1$ changes sign as we rotate from $H$ to $H'$. By continuity, we find a bisecting halfspace whose bounding hyperplane contains $A$. Let $A_0,A_1,\ldots,$ be a sequence of all $(n-2)$-dimensional affine spaces of the form $v_0 + \mathrm{span}(v_1,\ldots,v_{n-2})$ where the vectors $\{v_j\}$ have rational coordinates (there is a countable number of such subspaces). Let $K_{i+1} \seteq K_i \cap H_i$ where $H_i$ is a bisecting halfspace whose boundary contains $A_i$. Claim: The intersection $K \seteq \bigcap_{i\geq 1} K_i$ is a point or an interval. Assume to the contrary that $K$ is at least $2$-dimensional. Then its projection onto some pair of coordinate axes is still $2$-dimensional. Without loss, we can assume it’s the first two axes. Then there must be a pair of rational numbers $r_1$ and $r_2$ such that $(r_1,r_2,x_3, \ldots,x_n)$ is in the interior of $K$. But now the $(n-2)$-dimensional subspace $\mathrm{span}(r_1 e_1, r_2 e_2)$ occurs as some $A_i$. Recall that the halfspace $H_i$ has its boundary containing $A_i$. Since $(r_1,r_2,x_3,\cdots,x_n)$ is in the interior of $K$, it must also be in the interior of $K_{i+1}$. But that means that $H_i$ properly cuts $K_{i+1}$. But this is a contradiction, as $H_i$ should fully contain $K_{i+1} = K_i \cap H_i$. We conclude that $\int_K g_1 \geq 0, \int_K g_2 \geq 0$ on the intersection $K$. Note that we can easily get $\int_K g_1 > 0, \int_K g_2 > 0$ for the original functions by starting with functions $\hat{g}_1 = g_1 - \delta \mathbf{1}_{K_0}$ and $\hat{g}_2 = g_2 - \delta \mathbf{1}_{K_0}$ for $\delta > 0$ sufficiently small and then finding $K$ where $\int_K \hat{g}_1 \geq 0, \int_K \hat{g}_2 \geq 0$. Since the restriction of a log-concave function to an interval is still log-concave, this is indeed a reduction of the KK Lemma to the $1$-dimensional case. Finding a better needle For other applications, it helps to replace $g_1$ and $g_2$ by nicer functions at every step. One can consult Lovasz-Simonovits Lemma 2.5 for a (rather technical) argument proving the following. Localization Lemma: Given $\int_{\R^n} g_1 > 0$ and $\int_{\R^n} g_2 > 0$, there are two points $a,b \in \R^n$ and a linear function $\ell : [0,1] \to \R_+$ such that \[\begin{align} \int_0^1 \ell(t)^{n-1} g_1((1-t) a + tb)\,dt &> 0 \\ \int_0^1 \ell(t)^{n-1} g_2((1-t) a + tb)\,dt &> 0 \end{align}\] One can instead derive the conclusion that there is a number $\gamma \in \R$ such that \[\begin{align} \int_0^1 e^{\gamma t} g_1((1-t) a + tb)\,dt &> 0 \\ \int_0^1 e^{\gamma t} g_2((1-t) a + tb)\,dt &> 0\,, \end{align}\] which I find a bit more natural. Reading: Bounding the norm of a log-concave vector via thin-shell estimates, Eldan and Lehec Thin shell implies spectral up to polylog with a stochastic localization scheme, Eldan A stochastic calculus approach to the oracle separation of BQP and PH, by Xinyu Wu Videos: On Chen’s recent breakthrough on KLS, II, talk by Ronen Eldan @ IAS Relationship to volume computation and mixing times. Detailed overview of the “Needle decomposition.”
Wed, Feb 01 Stochastic localization	Notes Recall the KLS constant of a probability measure $\mu$ on $\R^n$: \[\psi_{\mu} \seteq \inf_{S \subseteq \R^n : \mu(S) \leq 1/2} \frac{\mu(\partial S)}{\mu(S)}\,,\] where the infimum is over measureable subsets and \[\mu(\partial S) \seteq \lim_{\e \to 0+} \frac{\mu(S_{\e} \setminus S)}{\e}\,,\] where $S_{\e} \seteq \{ x \in \R^n : d(x,S) \leq \e \}$ is the $\e$-neighborhood of $S$. The KLS conjecture asserts that when $\mu$ is a log-concave distribution, it should hold that the infimum can be restricted to halfspaces: \[\psi_{\mu} \gtrsim \inf_{H \subseteq \R^n : \mu(H) \leq 1/2} \frac{\mu(\partial H)}{\mu(H)}\,,\] where a halfspace is defined by $H = \{ x \in \R^n : \langle x,v \rangle \leq c \}$ for some $v \in \R^n$ and $c \in \R$. Suppose that $\mu$ is measure on $\R^n$. The covariance matrix of $\mu$ is defined as \[A \seteq \E[(X-\E[X])(X-\E[X])^{\top}]\,,\] where $X$ is a random variable with law $\mu$. This describes the variance of $\mu$ along every projection: If $v \in \R^n$, then \[\E[\langle v, X - \E[X]\rangle^2] = \langle v, A v\rangle\,.\] Lemma: If $\mu$ is log-concave, then \[\inf_{H \subseteq \R^n : \mu(H) \leq 1/2} \frac{\mu(\partial H)}{\mu(H)} \asymp \frac{1}{\sqrt{\\|A\\|_{op}}}\,,\] where $\\|A\\|_{op}$ denotes the maximum eigenvalue of $A$. This lemma is not too difficult to prove given that if $\mu$ is log-concave and $X$ has law $\mu$, then $\langle v,X\rangle$ is also log-concave for any $v \in \R^n$. For distributions that are more log-concave than a Gaussian, one can prove that an analogous bound holds. Gaussian Factor Lemma: Suppose that the density of $\mu$ is proportional to $f(x) e^{-\langle x, B x\rangle}$ for some PSD matrix $B$, and a log-concave function $f : \R^n \to \R_+$. Then, \[\psi_{\mu} \gtrsim \frac{1}{\sqrt{\\|B^{-1}\\|_{op}}}\,.\] Since the product of log-concave functions is log-concave, the measure $\mu$ is log-concave. This lemma can be proved using localization (the approach we sketched last lecture) to reduce it to the $1$-dimensional case. Note that $B^{-1}$ is precisely the covariance matrix of the Gaussian with density proportional to $e^{- \langle x,Bx\rangle}$: \[\int e^{-\langle x,Bx\rangle} xx^{\top}\,dx = c_B \int e^{-\\|x\\|^2/2} (B^{-1/2} x) (B^{-1/2} x)^{\top}\,dx = c_B B^{-1/2} \left(\int e^{-\\|x\\|^2} xx^{\top} \,dx\right) B^{-1/2}\,,\] where $c_B$ is some constant resulting from the substitution $x \mapsto B^{-1/2} x$. The quantity in parentheses is a multiple of the identity. Therefore if a measure is log-concave with respect to a Gaussian distribution, its KLS constant is at least (up to a universal constant factor) as good as that for the Gaussian. Brownian motion: An $n$-dimensional Brownian motion is a stochastic process $(B_t : t \geq 0)$ such that $B_s = \int_0^s dB_t$ and $dB_t$ can be thought of as a random variable independent of $(B_s : 0 \leq s \leq t)$ with law $N(0, dt \cdot I)$ (the law of a spherical $n$-dimensional Gaussian with covariance $dt \cdot I$). Stochastic localization: Suppose that a log-concave measure $\mu$ has density $f : \R^n \to \R_+$. We will define a stochastic process on densities: Take $f_0(x)=f(x)$ and \[d f_t(x) = f_t(x) \langle x-a_t, dB_t\rangle\,,\] where $(B_t : t \geq 0)$ is an $n$-dimensional Brownian motion, and \[a_t \seteq \int_{\R^n} x f_t(x)\,dx\] is the center of mass of the $f_t$. Note that if $f_0(x)=f(x)$ is a density, then so is $f_t$ since \[\int_{\R^n} d f_t(x) = \left\langle \int_{\R^n} f_t(x) (x-a_t), dB_t\right\rangle = 0\,.\] Let $\mu_t$ be the corresponding measure. Note that $f_t(x)$ is a martingale: $\E[d f_t(x)] = 0$. Martingale property: $\E[\mu_t(S)] = \mu(S)$ for all $S \subseteq \R^n$. For a sufficiently nice function $f : \R \to \R$, we can approximate the change in $f$ by its Taylor expansion \[df(x) = f(x+dx) - f(x) = f'(x) \,dx + \frac12 d^2 f (x_0) \,dx^2 + \cdots\,.\] Dividing both sides by $dx$ and taking $dx \to 0$, all but the $dx$ vanishes. But now suppose $B_t$ is $1$-dimensional Brownian motion and we compute \[d f(B_t) = f(B_{t+dt}) - f(B_t) = f'(B_t) \,dB_t + \frac12 f''(B_t) \,dB_t^2 + \cdots\,.\] Then $dB_t$ has law $N(0,dt)$, hence $\E[dB_t^2]=dt$. In other words, $(B_t)$ has non-trivial quadratic variation. That means we cannot neglect the 2nd order terms. Ito’s lemma tells us that we need only go to second order so that \[d f(B_t) = f'(B_t) \,dB_t + \frac12 f''(B_t) \,dt\,.\] Let’s consider a more general version for stochastic processes in $n$ dimension. Stochastic processes and Ito derivatives: Consider a process $(X_t : t \geq 0)$ defined by the stochastic differential equation \[dX_t = u_t \,dt + \Sigma_t dB_t\,,\] where $u_t \in \R^m$ and $\Sigma_t \in \R^{m \times n}$. Let $f : \R^m \to \R$ be a sufficiently nice (e.g., continuously differentiable) function. Then Ito’s lemma tells us how to compute the time derivative of $f(X_t)$: \[df(X_t) = \sum_{i=1}^m \partial_i f(X_t) dX_t^i + \frac12 \sum_{i,j} \partial_{ij} f(X_t) d[X_t^i,X_t^j]_t\,,\] where $[X_t^i, X_t^j]_t$ is the quadratic variation \[[X_t^i, X_t^j]_t = \int_0^t \left(\Sigma_s \Sigma_s^{\top}\right)_{i,j} \,ds\] so that \[d [X_t^i, X_t^j]_t = \left(\Sigma_s \Sigma_s^{\top}\right)_{i,j}\,.\] Note that $\Sigma_t \Sigma_t^{\top}$ is the covariance matrix of the random vector $\Sigma_t dB_t$. We can use this figure out what $f_t$ looks like: Here for a fixed $x \in \R^n$, we take $X_t = f_t(x)$ so that $\Sigma_t = f_t(x) (x-a_t)^{\top}$. Then using the Ito lemma: \[\begin{align} d \log f_t(x) = d \log (X_t) &= \frac{d X_t}{X_t} - \frac12 \frac{d[X_t,X_t]_t}{X_t^2} \\ &= \frac{d f_t(x)}{f_t(x)} - \frac12 \frac{d[f_t(x),f_t(x)]_t}{f_t(x)^2} \\ &=\langle x-a_t,dB_t\rangle - \frac12 \\|x-a_t\\|^2\,, \end{align}\] where we used \[d[X_t,X_t]_t = (x-a_t)^{\top} (x-a_t) = \\|x-a_t\\|^2 \,dt\,.\] Let us simplify: \[d \log f_t(x) = \langle x, a_t\,dt + dB_t\rangle - \frac12 \\|x\\|^2 - \left[\langle a_t, dB_t\rangle + \tfrac12 \\|a_t\\|^2\,dt \right]\] Note that the terms in brackets don’t depend on $x$. Thus integrating the logarithm gives (up to scaling by a constant): \[f_t(x) = \exp\left(- \frac{t}{2} \\|x\\|^2+ \int_0^t \langle x,a_s\,ds + dB_s\rangle \right) f(x)\,.\] If we let $\mu_t$ denote the corresponding measure, then the Gaussian Factor Lemma says that (with probability one): \[\psi_{\mu_t} \gtrsim \sqrt{t}\,.\] Now consider $S \subseteq \R^n$ with $\mu(S)=1/2$ and suppose there is a time $t_0$ such that \[\begin{equation}\label{eq:good-meas} \P[\tfrac14 \mu(S) \leq \mu_{t_0}(S) \leq \tfrac34 \mu(S)] \geq \frac{1}{10}\,. \end{equation}\] Use the Martingale Property of $\mu_t$ to write \[\mu(\partial S) = \E[\mu_{t_0}(\partial S)] \geq \frac18 \psi_{\mu_{t_0}} \P[\tfrac14 \mu(S) \leq \mu_{t_0}(S) \leq \tfrac34 \mu(S)] \gtrsim \psi_{\mu_{t_0}} \gtrsim \sqrt{t_0}\,.\] Thus if we can find a lower bound on $t_0$, we have proved a lower bound on $\psi_{\mu}$. In the next lecture, we will show that \eqref{eq:good-meas} holds for some $t_0 \asymp (\tr(A^2))^{-1/2}$. This yields \[\psi_{\mu} \gtrsim \frac{1}{\tr(A^2)^{1/4}}\,.\] If $\mu$ is istropic, this value is $\gtrsim n^{-1/4}$, implying that $\psi_n \gtrsim n^{-1/4}$. Supplementary reading: Localization schemes: A framework for proving mixing bounds for Markov chains, by Chen and Eldan An information-theoretic view of stochastic localization, El Alaoui and Montanari Videos: On Chen’s recent breakthrough on KLS, III, talk by Ronen Eldan @ IAS Lecture on stochastic localization by the inventor himself. Covers Chen’s breakthrough that the KLS constant is $n^{o(1)}$. Two talks by Ronen @ IAS: Localization schemes: A framework for proving mixing bounds for Markov chains, I Localization schemes: A framework for proving mixing bounds for Markov chains, II
Mon, Feb 06 Tracking the covariance	Notes Reducing to control of the covariance Recall from last time that $f : \R^n \to \R_+$ is a log-concave density on $\R^n$ and we use $\mu$ to denote the associated probability measure. We evolve $f$ according to a stochastic differential equation (SDE) given by $f_0(x)=f(x)$ and \[d f_t(x) = f_t(x) \langle x - a_t, dB_t \rangle\,,\] were $(B_t : t \geq 0)$ is an $n$-dimensional Brownian motion and \[a_t = \int_{\R^n} x f_t(x)\,dx\] is the center of mass of $f_t$. Let $\mu_t$ be the measure corresponding to the density $f_t$. We used Itô’s Lemma to calculate \[\begin{equation}\label{eq:gfactor} f_t(x) \propto \exp\left(-\frac{t}{2} \\|x\\|^2 + \int_0^t \langle x,a_s \,ds + dB_s\rangle\right) f(x)\,. \end{equation}\] The Gaussian factor here is important for us because, according to the Gaussian Factor Lemma from last lecture, this yields $\psi_{\mu_t} \gtrsim \sqrt{t}$. Note that $f_t(x)$ is a martingale for every $x \in \R^n$: If $\mathcal{F}_t$ denotes the filtration generated by $(B_s : 0 \leq s \leq t)$, then \[\E[df_t(x) \mid \mathcal{F}_t] = f_t(x) \left\langle x-a_t, \E[dB_t \mid \mathcal{F}_t]\right\rangle = 0.\] This means that for every set $S \subseteq \R^n$, we have $\E[\mu_t(S)]=\mu(S)$. So now suppose we want start with $S \subseteq \R^n$ with $\mu(S) = 1/2$. We want to prove a lower bound on $\mu(\partial S)$ as follows: At any time $t \geq 0$, \[\E[\mu(\partial S)] = \E[\mu_t(\partial S)] \gtrsim \psi_{\mu_t} \P[\tfrac14 \leq \mu_t(S) \leq \tfrac34] \gtrsim \sqrt{t} \cdot \P[\tfrac14 \leq \mu_t(S) \leq \tfrac34]\,.\] So now we are left to solve the following problem: For how long does $\mu_t(S)$ stay bounded between $1/4$ and $3/4$? So let’s analyze how $\mu_t(S)$ evolves using Itô’s lemma. Define \[m_t \seteq \mu_t(S) = \int_S f_t(x) \,dx\,.\] Using the SDE for $f_t$, we have \[d m_t = \int_S df_t(x)\,dx = \left\langle \int_S (x-a_t) f_t(x), dB_t\right\rangle.\] Dubins-Schwartz: While we don’t need this technically, it’s useful to keep in mind. Roughly speaking, all reasonable martingales are a reparameterized Brownian motion. Given a $1$-dimensional stochastic process $dm_t = u_t \,dt + \Sigma_t dB_t$ (so $\Sigma_t \in \R^{1 \times n}$), recall the quadratic variation \[[m_t]_t = \int_0^t \Sigma_t \Sigma_t^{\top} \,ds\,.\] Define the stopping time $\tau_s \seteq \inf \{ t : [m_t]_t > s \}$, i.e., the first time at which we have collected quadratic variation $s$. Then $m_{\tau_s}$ and $B_s$ (with $B_0 = m_0$) have the same law. Controlling $\mu_t(S)$ by controlling the operator norm of the covariance. This means that in order to control $\P(\tfrac14 \leq m_t \leq \frac34)$, we just need to control the quadratic variation: \[[m_t]_t = \int_0^t \left\\|\int_S (x-a_s) f_s(x)\,dx\right\\|^2 \,ds\,.\] Note that \[\begin{align} \left\\|\int_S (x-a_s) f_s(x)\,dx\right\\|^2 &= \max_{\\|v\\| \leq 1} \left(\int_S \langle v, x-a_s \rangle f_s(x)\,dx\right)^2 \\ &=\max_{\\|v\\| \leq 1} \left(\int_{\R^n} \mathbf{1}_S(x) \langle v, x-a_s \rangle f_s(x)\,dx\right)^2 \\ &\leq \max_{\\|v\\| \leq 1} \left(\int_{\R^n} \langle v, x-a_s \rangle^2 f_s(x)\,dx\right) \left(\int_{\R^n} \mathbf{1}_S(x)^2 f_s(x)\,dx\right), \end{align}\] where the last line is Cauchy-Schwarz. The second integral is precisely $\mu_s(S) \leq 1$, and the first integral is $\langle v, A_s v\rangle$, where $A_s$ is the covariance matrix of $\mu_s$. So we have \[[m_t]_t \leq \int_0^t \max_{\\|v\\| \leq 1} \langle v, A_s v\rangle\,dt = \int_0^t \\|A_s\\|_{op}\,dt\] So we need to see how large $t$ can be so that $\int_0^t \\|A_s\\|_{op}\,dt \leq 0.1$. The operator norm $\\|A_s\\|_{op}$ (the maximum eigenvalue of $A_s$) is not something easier to argue about as time evolves because it’s defined as a maximum over directions. Thus we will use a crude upper bound: \[\\|A_t\\|_{op} \leq \tr(A_t^2)^{1/2}\,.\] Key Lemma: Let $A$ be the covariance matrix of $\mu$. Then for some constant $c > 0$ and $0 \leq t \leq c (\tr(A^2))^{-1/2}$, we have \[\tr(A_t^2) \lesssim \tr(A^2)\,.\] The Key Lemma implies that $[m_t]_t \lesssim \int_0^t \tr(A_s^2)^{1/2}\,ds \leq t \cdot \max_{0 \leq s \leq t} \tr(A_t^2)^{1/2} \lesssim t \cdot \tr(A^2)^{1/2}\,.$ And therefore for some choice of $T \gtrsim (\tr(A^2))^{-1/2}$, we have $\Pr([m_{T}]_T \ll 1) \gtrsim 1$, and therefore $\Pr(\tfrac14 \leq m_T \leq \tfrac34) \gtrsim 1$. Now \eqref{eq:gfactor} and the Gaussian Factor Lemma together tell us that $\psi_{\mu} \gtrsim \sqrt{T} \gtrsim \tr(A^2)^{-1/4}$. If $\mu$ is in isotropic position, then $A=I$ and we get $\psi_{\mu} \gtrsim n^{-1/4}$. Controlling the covariance We first need to compute the Itô derivative of the covariance \[A_t = \int_{\R^n} (x-a_t)(x-a_t)^{\top} f_t(x)\,dx\,.\] For this it helps to extend Itô’s lemma notationally to handle a pair of stochastic processes: Suppose that $dX_t = u_t \,dt + M_t\,dB_t$ and $dY_t = v_t\,dt + N_t\,dB_t$, then \[\begin{align} d f(X_t, Y_t) &= \sum_i \partial_{x^i} f(X_t,Y_t) dX_t^i + \sum_j \partial_{y^j} f(X_t,Y_t) dY_t^j \\ &+ \frac12 \sum_{i,i'} \partial_{x^i,x^{i'}} f(X_t,Y_t) d[x^i,x^{i'}]_t + \frac12 \sum_{j,j'} \partial_{y^j,y^{j'}} f(X_t,Y_t) d[y^{j},y^{j'}]_t + \frac12 \sum_{i,j} \partial_{x^i,y^j} f(X_t,Y_t) d[x^i, y^j]_t\,, \end{align}\] where \[[X_t,Y_t]_t = \int_0^t M_s N_s^{\top}\,ds\,.\] Note that this is the same formula as before (by lifting to a single process $Z_t = X_t \oplus Y_t$), but it gives guidance to our next calculation. Write: \[d A_t = \int_{\R^n} (x-a_t)(x-a_t)^{\top} d f_t(x)\,dx + \int_{\R^n} \left((d a_t)(x-a_t)^{\top}\right) f_t(x)\,dx + \int_{\R^n} \left((x - a_t)(d a_t)^{\top}\right) f_t(x)\,dx + \cdots\,,\] where $\cdots$ represents the second order derivatives. The first term is equal to $\int_{\R^n} (x-a_t)(x-a_t)^{\top} \langle x-a_t, dB_t\rangle f_t(x)\,dx.$ The second two terms both evaluate to zero since $\int (x-a_t) f_t(x)\,dx = 0$ by definition of $a_t$. Now let’s calculate the second-order derivatives: \[\frac12 \cdot 2 \cdot d[a_t,a_t]_t \int f_t(x) \,dx - \frac{1}{2} \cdot 2 \cdot \int (x-a_t) d[a_t^{\top},f_t(x)]_t)^{\top}\,dx - \frac{1}{2} \cdot 2 \cdot d[a_t,f_t(x)]_t (x-a_t)^{\top}\,dx\] Recalling $d a_t = A_t dB_t$ gives $d[a_t,a_t]_t = A_t^2\,dt$. We also have \[d[a_t, f_t(x)]_t = A_t (x-a_t) f_t(x) \,dt\,,\] hence the second order terms evaluate to \[A_t^2\,dt - 2 A_t \int f_t(x) (x-a_t) (x-a_t)^{\top}\,dx = - A_t^2\,dt\,.\] Altogether this gives \[dA_t = \int_{\R^n} (x-a_t)(x-a_t)^{\top} \langle x-a_t, dB_t\rangle f_t(x)\,dx -A_t^2\,dt\] Now we wish to analyze the evolution of $\Phi_t = \tr(A_t^2)$, and another application of Itô’s lemma gives \[\begin{align} d \Phi_t = -2 \tr(A_t^3) &+ \int \int \left((x-a_t)^{\top} (y-a_t)\right)^3 f_t(x) f_t(y)\,dx\,dy \\ &+ 2 \int (x-a_t)^{\top} A_t (x-a_t) \langle x-a_t,dB_t\rangle f_t(x)\,dx \end{align}\] We want to ensure that $\Phi_t$ is small for $t \in [0,T]$, so we can ignore the term $-2 \tr(A_t^3)$ (since $A_t$ is PSD, this is always a negative term). If $\hat{\mathbf{X}},\hat{\mathbf{Y}}$ are independent random variables with law $\mu_t$ and $\mathbf{X} \seteq \hat{\mathbf{X}} - \E[\hat{\mathbf{X}}]$, $\mathbf{Y} \seteq \hat{\mathbf{Y}} - \E[\hat{\mathbf{Y}}]$, then the second term is \[d \Phi_t \leq \E \langle \mathbf{X}, \mathbf{Y}\rangle^3 +\E \left[\langle \mathbf{X}, A_t \mathbf{X}\rangle \mathbf{X}^{\top} \right] dB_t\] Reverse Holder inequality: For any log-concave random vector $\mathbf{X}$ in $\R^n$ and $k \geq 0$, \[(\E \\|\mathbf{X}\\|^k)^{1/k} \leq 2k\ \left(\E \\|\mathbf{X}\\|^2\right)^{1/2}\,.\] Using the this inequality one obtains \[d \Phi_t \lesssim \Phi_t^{3/2}\,dt + \Phi_t^{5/4} dB_t\,.\] To analyze this, let us assume that we run the process until the first time $\tau_0$ such that $\Phi_{\tau_0} = (1+c) \tr(A^2)$ Then for $t \in [0,\tau_0]$, \[d \Phi_t \lesssim \tr(A^2)^{3/2}\,dt + \tr(A^2)^{5/4} dB_t\,,\] and therefore \[c\, \tr(A^2) = \Phi_{\tau_0} - \Phi_{0} = \int_{0}^{\tau_0} d\Phi_t \lesssim \tau_0 \tr(A^2)^{3/2} + W_{\tau}\,,\] where $\tau \leq \tau_0 \tr(A^2)^{5/2}$, and $W_{\tau}$ has the law of a Brownian motion (recall Dubins-Schwartz). Thus with high probability, this upper bound is \[\lesssim \tau_0\,\tr(A^2)^{3/2} + \tau_0^{1/2}\,\tr(A^2)^{5/4}\,.\] We therefore conclude that, with constant probability, $\tau_0 \gtrsim \tr(A^2)^{-1/2}$, as desired. Supplementary material: Current best bound: A Slightly Improved Bound for the KLS Constant by Jambulapati, Y. T. Lee, & Vempala Videos: On Chen’s recent breakthrough on KLS, III, talk by Ronen Eldan @ IAS Lecture on stochastic localization by the inventor himself. Covers Chen’s breakthrough that the KLS constant is $n^{o(1)}$. Bourgain’s slicing problem and KLS isoperimetry up to polylog, Joseph Lehec @ ICERM A slightly improved bound for the KLS constant (or The Fashion Wars: LV vs L-dan), Santosh Vempala @ ICERM
Topological methods
Wed, Feb 08 Borsuk-Ulam and the Kneser conjecture	Reading: Old lecture notes on the topic
Mon, Feb 13 The evasiveness conjecture	Reading: Lovasz’s lecture notes on evasiveness, Sec 1-2 Old lecture notes on the topic
Wed, Feb 15 Simplicial complexes, fixed points, and symmetry	Reading: Lovasz’s lecture notes on evasiveness, Sec 3-4 The mysteries of counting: Euler characteristic vs. homotopy cardinality, John Baez
Mon, Feb 20 NO CLASS (President's day)
High-dimensional expanders
Wed, Feb 22 Spectral gaps on simplicial complexes	Notes Spectral expansion in graphs Consider an undirected graph $G=(V,E)$ and a probability distribution $\mu$ on $E$. Let $\mu^{(0)}$ be the probability measure induced on $V$ as follows: We sample a random edge $uv \in E$ according to $\mu$, and then choose one of the two endpoints uniformly at random. Equivalently, $\mu^{(0)}(u) = \frac12 \sum_{v : uv \in E} \mu(uv)$. Random walk. We can define a random walk $\{X_t\}$ on $V$ as follows. For $v \in V$, let $\mu_v$ denote the distribution $\mu$ conditioned on the event $\{ v \in E \}$, i.e., \[\mu_v(e) = \frac{\mu(e) \mathsf{1}_e(v)}{\sum_{e' \in E} \mu(e') \mathsf{1}_{e'}(v)}\,.\] Now given $X_t = v$, we sample an edge $e \in E$ according to $\mu_v$ and let $X_{t+1} = e \setminus v$. Define the associated random walk transition matrix by \[P_{uv} \seteq \frac{\mu(uv)}{\sum_{w : uw \in E} \mu(uw)} = \frac12 \frac{\mu(uv)}{\mu^{(0)}(u)}\,.\] We will think of $P$ as a linear operator on the space of functions $f : V \to \R$, where $P f(u) = \sum_{v \in V} P_{uv} f(v)$. Note that the matrix $P$ is not necessarily symmetric, i.e., it is not self-adjoint with respect to the standard inner product on $\R^{V}$. For $f,g : V \to \R$, define the inner product \[\langle f,g \rangle_{\mu^{(0)}} \seteq \sum_{u \in V} \mu^{(0)}(u) f(u) g(u)\,.\] Let us use $\ell_2(\mu^{(0)})$ to denote the Hilbert space of functions equipped with $\langle \cdot,\cdot\rangle_{\mu^{(0)}}$. Now let us verify that \[\langle f, P g\rangle_{\mu^{(0)}} = \sum_{u \in V} \mu^{(0)}(u) f(u) \sum_{v \in V} P_{uv} g(v) = \frac12 \sum_{u,v \in V} \mu(uv) f(u) g(v) = \sum_{uv \in E} \mu(uv) f(u) g(v)\,.\] Evidently, this is also equal to $\langle Pf, g\rangle_{\mu^{(0)}}$ so that $P$ is indeed self-adjoint with respect to $\langle \cdot,\cdot\rangle_{\mu^{(0)}}$. In particular, this implies that $P$ has a complete orthonormal basis of eigenfunctions and associated real eigenvalues. If we denote by $\mathbf{1}$ the constant function, then evidently $P \mathbf{1} = \mathbf{1}$, hence $P$ has $1$ as an eigenvalue. Moreover, $P$ is an averaging operator (it has nonnegative entries and is row-stochastic), so by Jensen’s inequality, \[(P f(u))^2 = \left(\sum_{v \in V} P_{uv} f(v)\right)^2 \leq \sum_{v \in P_{uv}} P_{uv} f(v)^2\,,\] and therefore \[\\|P f\\|_{\mu^{(0)}}^2 = \sum_{u \in V} \mu^{(0)}(u) (P f(u))^2 \leq \sum_{u \in V} \mu^{(0)}(u) \sum_{v \in V} P_{uv} f(v)^2 = \frac12 \sum_{u,v \in V} \mu(uv) f(u)^2 = \sum_{u \in V} \mu^{(0)}(u) f(u)^2 = \\|f\\|_{\mu^{(0)}}^2\,.\] This is usually summarized by saying: “$P$ is a contraction on $\ell_2(\mu^{(0)})$.” It implies that the spectrum of $P$ lies in $[-1,1]$. The spectral gap. Recall that $1$ is an eigenvalue of $P$ with eigenvector $\mathbf{1}$. The spectral gap of $P$ is the largest value $\varepsilon$ such that all the other eigenvalues of $P$ lie in the interval $[-(1-\e),1-\e]$. Equivalently, this is the largest value $\e$ such that $\\|P f\\|_{\mu^{(0)}} \leq (1-\e) \\|f\\|_{\mu^{(0)}}$ for all $\langle f, \mathbf{1}\rangle_{\mu^{(0)}} = 0$. Let us denote this value, the “spectral gap of $\mu$” by $\e(\mu)$ The Laplacian. It is sometimes more convenient to work with the corresponding Laplacian operator $L \seteq I - P$. Note that \[\begin{align} \langle f, L g\rangle_{\mu^{(0)}} &= \sum_{u \in V} \mu^{(0)}(u) g(u) \left(f(u) - \sum_{v \in V} P_{uv} g(v)\right) \nonumber \\ &= \frac12 \sum_{u,v \in V} \mu(uv) \left(f(u) g(u) - f(u) g(v)\right) \nonumber\\ &= \frac14 \sum_{u,v \in V} \mu(uv) \left(f(u) g(u) + f(v) g(v) - f(u) g(v) - f(v) g(u)\right) \nonumber\\ &= \frac14 \sum_{u,v \in V} \mu(uv) (f(u)-f(v)) (g(u)-g(v)) \nonumber\\ &= \frac12 \sum_{uv \in E} \mu(uv) \left(f(u)-f(v)\right) \left(g(u)-g(v)\right)\,. \label{eq:laplacian} \end{align}\] The eigenvalue $1$ in $P$ corresponding to the constant function $\mathbf{1}$ becomes an eigenvalue $0$ in $L$. We have an equivalent characterization \[\e(\mu) = \min \left\{ \frac{\langle f, L f\rangle_{\mu^{(0)}}}{\langle f,f\rangle_{\mu^{(0)}}} : \langle f, \mathbf{1}\rangle_{\mu^{(0)}} = 0 \right\}.\] Poincare inequalities. In order to eliminate the algebraic constraint $\langle f,\mathbf{1}\rangle_{\mu^{(0)}}=0$, we can simply project $f$ onto the orthogonal complement of the span of $\mathbf{1}$, i.e., we can replace $f$ by $f - \E_{\mu^{(0)}}[f]$. Define \[\mathsf{Var}_{\mu^{(0)}}(f) \seteq \left\\|f - \E_{\mu^{(0)}}[f]\right\\|_{\mu^{(0)}}^2 = \E_{\mu^{(0)}} \left(f - \E_{\mu^{(0)}}[f]\right)^2 = \E_{\mu^{(0)}}(f^2) - (\E_{\mu^{(0)}} f)^2\,.\] Then we can write \[\e(\mu) = \min_f \frac{\langle f-\E_{\mu^{(0)}}[f], L(f-\E_{\mu^{(0)}}[f])\rangle_{\mu^{(0)}}}{\mathsf{Var}_{\mu^{(0)}}(f)} = \min_f \frac{\langle f,L f\rangle_{\mu^{(0)}}}{\mathsf{Var}_{\mu^{(0)}}(f)}\,,\] where the second equality uses the fact that $L$ annhilates constant functions. This is often written in the form of a Poincare inequality: For all $f : V \to \R$, it holds that \[\mathsf{Var}_{\mu^{(0)}}(f) \leq K(\mu) \langle f,L f\rangle_{\mu^{(0)}}\,,\] where $K(\mu) \seteq 1/\e(\mu)$. Link expansion in simplicial complexes Let $X$ be a simplicial complex. The dimension of $X$ is $\dim(X) \seteq \max \{ \abs{\sigma} - 1 : \sigma \in X \}$. Say that a $d$-dimensional complex $X$ is pure if every face of $X$ is contained in some face of dimension $d$. Note that a pure simplicial complex is specified by a $d$-uniform hypergraph. Write $X \seteq X^{(-1)} \cup X^{(0)} \cup \cdots \cup X^{(d)}$ where $X^{(i)}$ contains the faces of $X$ of dimension $i$. Note that $X^{(-1)} = \{ \emptyset \}$. Induced measures. Suppose now that $\pi^{(d)}$ is a probability measure on $X^{(d)}$. In analogy with the graph case, we can define probability measures $\pi^{(0)}, \pi^{(1)},\ldots, \pi^{(d-1)}$ as follows: To sample from $\pi^{(i)}$, choose $\sigma \in X^{(d)}$ according to $\pi^{(d)}$ and then remove $d-i$ uniformly random vertices from $\sigma$. Equivalently, we can define the measures inductively by \[\pi^{(i-1)}(\tau) \seteq \sum_{\sigma \in X^{(i)} : \tau \subseteq \sigma} \frac{\pi^{(i)}(\sigma)}{i+1}\,.\] Links. Recall that for $\sigma \in X$, the link of $\sigma$ in $X$ is defined by \[X_{\sigma} \seteq \{ \tau \in X : \sigma \cap \tau = \emptyset \textrm{ and } \sigma \cup \tau \in X \}\,.\] Note that if $X$ is a graph, i.e., a $1$-dimensional simplicial complex, then for a vertex $v \in X^{(0)}$, $X_v$ is precisely the set of neighbors of $v$ in $X$. Induced measures on links. Given a face $\sigma \in X^{(i)}$, the link $X_{\sigma}$ is a pure $(d-i-1)$-dimensional complex. We can define the induced measure $\pi_{\sigma}^{(d-i-1)}$ in the straightforward way (as a conditional measure): For any top-dimensional face $\tau \in X_{\sigma}^{(d-i-1)}$, \[\pi_{\sigma}^{(d-i-1)}(\tau) \seteq \frac{\pi^{(d)}(\sigma \cup \tau)}{\sum_{\tau' \in X_{\sigma}^{(d-i-1)}} \pi^{(d)}(\sigma \cup \tau')} = \P_{\rho \sim \pi^{(d)}}\left[\rho = \sigma \cup \tau \mid \sigma \subseteq \rho\right].\] Repeating the earlier construction, we obtain measures $\pi_{\sigma}^{(0)}, \pi_{\sigma}^{(1)}, \ldots, \pi_{\sigma}^{(d-i-1)}$. Skeletons. The $k$-skeleton of a simplicial complex $X$ is precisely the complex $X^{(-1)} \cup X^{(0)} \cup \cdots \cup X^{(k)}$. Link expansion. Say that a pure $d$-dimensional complex $X$ with measure $\pi^{(d)}$ is an $\e$-spectral link expander if, for every $\sigma \in X^{(d-2)} \cup X^{(d-1)} \cup \cdots \cup X^{(-1)}$, it holds that the $1$-skeleton of $X_\sigma$ with the induced measure $\pi_{\sigma}^{(1)}$ satisfies $\e(\pi_{\sigma}^{(1)}) \geq \e$. Theorem Trickle step: Suppose that $X$ is a pure $2$-dimensional simpicial complex with a probability measure $\pi^{(2)}$ on $X^{(2)}$ and that (a) The $1$-skeleton $(X^{(0)}, X^{(1)})$ is connected. More precisely, $\e(\pi^{(1)}) > 0$. (b) For every vertex $v \in X^{(0)}$, the ($1$-dimensional) link $X_v$ satisfies $\e(\pi_v^{(1)}) \geq \e$. Then $\e(\pi^{(1)}) \geq 2 - 1/\e$. Note that the theorem is only nontrivial for $\e \geq 1/2$, i.e., the links have to have a substantial spectral gap. Proof: Let $L$ denote the Laplace operator on the graph $(X^{(0)}, X^{(1)})$ equipped with the measure $\mu = \pi^{(1)}$. Use \eqref{eq:laplacian} to write \[\begin{align} \langle f, L f \rangle_{\pi^{(0)}} &= \frac12 \E_{uw \sim \pi^{(1)}} (f(u)-f(w))^2 \\ &= \frac12 \E_{v \sim \pi^{(0)}} \E_{uw \sim \pi_v^{(1)}} (f(u)-f(w))^2 \\ &= \E_{v \sim \pi^{(0)}} \langle f, L_v f\rangle_{\pi_v^{(0)}}\,, \end{align}\] where $L_v$ is the Laplacian on $X_v$ associated to the edge measure $\mu = \pi_v^{(1)}$. Now using assumption (b) that $\e(\pi_v^{(1)}) \geq \e$ gives \[\langle f, L f\rangle_{\pi^{(0)}} \geq \e\ \E_{v \sim \pi^{(0)}} \mathsf{Var}_{\pi_v^{(0)}}(f)\,.\] The law of total variance: If $X$ and $Y$ are random variables, then \[\mathsf{Var}(Y) = \E[\mathsf{Var}(Y \mid X)] + \mathsf{Var}(\E[Y \mid X]).\] Note that $\E_{u \sim \pi_v^{(0)}} f(u) = P f(v)$. Thus by the law of total variance, we have \[\mathsf{Var}_{\pi^{(0)}}(f) = \E_{v \sim \pi^{(0)}} \mathsf{Var}_{\pi_v^{(0)}} (f) + \mathsf{Var}_{\pi^{(0)}} (P f)\,.\] We conclude that \[\begin{equation}\label{eq:penult} \langle f, L f\rangle_{\pi^{(0)}} \geq \e \left(\mathsf{Var}_{\pi^{(0)}}(f) - \mathsf{Var}_{\pi^{(0)}}(Pf) \right). \end{equation}\] We are happy as long as $\mathsf{Var}_{\pi^{(0)}}(Pf) \ll \mathsf{Var}_{\pi^{(0)}}(f)$. But in some sense this is what we are trying to prove! Indeed, the spectral gap can be characterized in terms of the rate of decay of the variance of functions acted on by $P$. Convergence to the stationary measure $\pi^{(0)}$ is precisely the assertion that $\mathsf{Var}_{\pi^{(0)}}(P^t f) \to 0$ as $t \to \infty$. The key reason we can still say something interesting is that $\mathsf{Var}_{\pi^{(0)}}(Pf)$ involves terms of the form $\langle f, P^2 f\rangle_{\pi^{(0)}}$. To check what happens, let us write everything in terms of the Laplacian: Observe that $I - P^2 = 2 (I-P) - (I-P)^2 = 2 L - L^2$, and therefore \eqref{eq:penult} can be equivalently written as \[\langle f, L f\rangle_{\pi^{(0)}} \geq \e \left(2 \langle f,L f\rangle_{\pi^{(0)}} - \langle f,L^2 f\rangle_{\pi^{(0)}}\right).\] We immediately conclude that any eigenvalue $\lambda$ of $L$ satisfies \[\lambda \geq \e (2-\lambda) \lambda\,.\] If $\lambda > 0$, this gives $\lambda \geq 2-1/\e$, implying that the connected components of $(X^{(0)},X^{(1)})$ all have spectral gap at least $2-1/\e$. Since we assume this graph to be connected, the desired result follows. Reading: High Dimensional Expanders, ICM survey by Lubotzky Video: Introduction to high dimensional expanders, Irit Dinur @ IAS
Mon, Feb 27 High-dimensional random walks	Notes Trickle Down Theorem Suppose that $X$ is a pure $d$-dimensional simpicial complex with a probability measure $\pi^{(d)}$ on $X^{(d)}$. Consider $0 \leq k \leq d-2$ and assume furthermore that: (a) For every $\sigma \in X^{(k-1)}$, the $1$-skeleton of $X_{\sigma}$ is connected. More precisely, that $\e(\pi^{(1)}_{\sigma}) > 0$. (b) For every $\sigma \in X^{(k)}$, there is a uniform spectral gap $\e(\pi_{\sigma}^{(1)}) \geq \e$. Then for every $\rho \in X^{(k-1)}$, we have $\e(\pi_{\rho}^{(1)}) \geq 2 - 1/\e$. Proof: Fix some $\rho \in X^{(k-1)}$. Let $L$ denote the Laplace operator on the $1$-skeleton of $X_{\rho}$ equipped with the measure $\mu = \pi^{(1)}_{\rho}$. Write: \[\begin{align} \langle f, L f \rangle_{\pi_{\rho}^{(0)}} &= \frac12 \E_{uw \sim \pi_{\rho}^{(1)}} (f(u)-f(w))^2 \\ &= \frac12 \E_{v \sim \pi_{\rho}^{(0)}} \E_{uw \sim \pi_{\rho \cup v}^{(1)}} (f(u)-f(w))^2 \\ &= \E_{v \sim \pi_{\rho}^{(0)}} \langle f, L_v f\rangle_{\pi_{\rho \cup v}^{(0)}}\,, \end{align}\] where $L_v$ is the Laplacian on the $1$-skeleton of $X_{\rho \cup v}$ associated to the edge measure $\mu = \pi_{\rho \cup v}^{(1)}$. Now assumption (b) gives $\e(\pi_{\rho \cup v}^{(1)}) \geq \e$, and therefore \[\langle f, L f\rangle_{\pi_{\rho}^{(0)}} \geq \e\ \E_{v \sim \pi_{\rho}^{(0)}} \mathsf{Var}_{\pi_{\rho \cup v}^{(0)}}(f)\,.\] Note that $\E_{u \sim \pi_{\rho \cup v}^{(0)}} f(u) = (I-L) f(v)$. Thus by the law of total variance, we have \[\mathsf{Var}_{\pi_{\rho}^{(0)}}(f) = \E_{v \sim \pi_{\rho}^{(0)}} \mathsf{Var}_{\pi_{\rho \cup v}^{(0)}} (f) + \mathsf{Var}_{\pi_{\rho}^{(0)}} ((I-L) f)\,.\] We conclude that \[\begin{align} \langle f, L f\rangle_{\pi_{\rho}^{(0)}} &\geq \e \left(\mathsf{Var}_{\pi_{\rho}^{(0)}}(f) - \mathsf{Var}_{\pi_{\rho}^{(0)}}((I-L)f) \right) \\ &= \e \left(2 \langle f,L f\rangle_{\pi_{\rho}^{(0)}} - \langle f,L^2 f\rangle_{\pi_{\rho}^{(0)}}\right). \end{align}\] Therefore if $\lambda$ is an eignevalue of $L$, then \[\lambda \geq \e (2-\lambda) \lambda\,.\] If $\lambda > 0$, this gives $\lambda \geq 2-1/\e$. Now assumption (a) asserts that only the trivial eigenvalue of $L$ is $0$, and the desired result follows. Corollary Suppose that $X$ is a pure $d$-dimensional simplicial complex equipped with a probability measure $\pi^{(d)}$ on $X^{(d)}$, and that (a) For every face $\sigma \in X$ with $\dim(\sigma) \leq d-2$, it holds that the $1$-skeleton of $X_{\sigma}$ is connected. More precisely, $\e(\pi_{\sigma}^{(1)}) > 0$. (b) For every face $\sigma \in X^{(d-2)}$, there is a uniform spectral gap $\e(\pi_{\sigma}^{(1)}) \geq \e$. Then for every $-1 \leq k \leq d-2$ and $\sigma \in X^{(k)}$, it holds that \[\e(\pi_{\sigma}^{(1)}) \geq 1 - \left(d-(k+1)\right) (1-\e) \geq 1 - d(1-\e) \,.\] Note that to derive a non-trivial bound for the $1$-skeleton of $X$, we need a very strong spectral gap for the $(d-2)$-links: $\e \geq 1-1/d$. Proof: The proof is by induction using the Trickle Down Theorem: Consider $\sigma \in X^{(k)}$: $k=d-2$. By assumption, $\e(\pi_{\sigma}^{(1)}) \geq \e$. $k=d-3$. The theorem yields $\e(\pi_{\sigma}^{(1)}) \geq 2 - 1/\e \geq 1 - 2(1-\e)$, where the last inequality holds for $\e \geq 1/2$. $k=d-4$. The theorem yields $\e(\pi_{\sigma}^{(1)}) \geq 2 - \frac{1}{1-2(1-\e)} \geq 1 - 3(1-\e)$, where the last inequality holds for $\e \geq 1/3$. Etc. using the inequality \[2 - \frac{1}{1-(j-1)(1-\e)} \geq 1 - j (1-\e),\qquad \e \geq 1 - \frac{1}{j}\,.\] Finally, note that the constraint $\e \geq 1 - 1/j$ is unnecessary since the spectral gap is always at least $0$. Random walks Example: Sampling random spanning trees. Consider an undirected graph $G=(V,E)$ and let $\mathcal{T}$ denote the set of spanning trees in $G$ (where we think of a spanning tree as a subset $T \subseteq E$). The downward closure $X$ of $\mathcal{T}$ is an $(n-2)$-dimensional simplicial complex, where $n = \|V\|$. The edge exchange walk on $\mathcal{T}$ is the random walk defined as follows: If $T$ is a spanning tree, we remove a uniformly random edge $e \in E(T)$ and then define $T' \seteq (T \setminus \{e\}) \cup \{e'\}$, where $e' \notin E(T)$ is a uniformly random edge such that $T' \in \mathcal{T}$. Our goal now is to define high-dimensional random walks so that the associated spectral gaps allow us to analyze the mixing time of such a random process. The edge exchange walk will be a particular example of the “down-up” walk analyzed next. Let $X$ denote a pure $d$-dimensional simplicial complex equipped with a probability measure $\pi^{(d)}$ on $X^{(d)}$. Let us define up and down walk operators: $P_{\nearrow k} : \ell_2(\pi^{(k-1)}) \to \ell_2(\pi^{(k)})$ and $P_{\searrow k} : \ell_2(\pi^{(k+1)}) \to \ell_2(\pi^{(k)})$ by \[\begin{align} P_{\nearrow k} f(\sigma) &\seteq \E_{v \sim \pi_{\sigma}^{(0)}} f(\sigma \cup v) \\ P_{\searrow k} f(\sigma) &\seteq \E_{v \in \sigma} f(\sigma \setminus v)\,. \end{align}\] Note that these are “co-“operators: They operate on functions. This means that the adjoint of the “up” operator $P_{\nearrow k}^{\top}$ acts in the opposite direction on distributions: \[\begin{align} \pi^{(k)} &= P_{\searrow k-1}^{\top} \pi^{(k-1)} \\ \pi^{(k-1)} &= P_{\nearrow k}^{\top} \pi^{(k)}\,. \end{align}\] Correspondingly, we obtain two different walk operators on $X^{(k)}$: The up-down walk and the down-up walk, respectively \[\begin{align} P_k^{\triangle} &\seteq P_{\searrow k} P_{\nearrow k+1} \\ P_k^{\triangledown} &\seteq P_{\nearrow k} P_{\searrow k-1}\,. \end{align}\] Let us also define the non-lazy up-down walk operator by \[P_k^{\wedge} \seteq P_k^{\triangle} + \frac{1}{k+1} (P_k^{\triangle} - I)\,.\] It is not difficult to check that all three walks have $\pi^{(k)}$ as the stationary measure. Let us define the corresponding Laplacians \[\begin{align} L_k^{\triangle} &\seteq I - P_k^{\triangle} \\ L_k^{\triangledown} &\seteq I - P_k^{\triangledown} \\ L_k^{\wedge} &\seteq I - P_k^{\wedge} = \frac{k+2}{k+1} L_k^{\triangle}\,. \end{align}\] Theorem: Link expansion $\implies$ Spectral gaps Define $\e_k \seteq \min_{\sigma \in X^{(k)}} \e(\pi^{(1)}_{\sigma})$ as the minimum spectral gap over all $1$-skeleta of links of $k$-dimensional faces. Then for every $0 \leq k \leq d-1$, and $f \in \ell_2(\pi^{(k)})$, it holds that \[\begin{equation}\label{eq:al-thm} \langle f, L_k^{\wedge} f\rangle_{\pi^{(k)}} \geq \e_{k-1} \langle f, L_k^{\triangledown} f\rangle_{\pi^{(k)}}\,. \end{equation}\] Corollary Using the relationship between $L_k^{\wedge}$ and $L_k^{\triangle}$, this gives \[\frac{k+2}{k+1} \e(L_k^{\triangle}) \geq \e_{k-1} \e(L_k^{\triangledown})\,.\] But recall that $P_k^{\triangle} = P_{\searrow k} P_{\nearrow k+1}$ and $P_{k+1}^{\triangledown} = P_{\nearrow k+1} P_{\searrow k}$. It is straightforward to check that for rectangular matrices $A,B$, the two matrices $AB$ and $BA$ have the same spectrum, and therefore $P_k^{\triangle}$ and $P_{k+1}^{\triangledown}$ have the same spectrum. In particular, $\e(L_k^{\triangle}) = \e(L_{k+1}^{\triangledown})$, and therefore \[\begin{equation}\label{eq:rec-formula} \e(L_{k+1}^{\triangledown}) \geq \frac{k+1}{k+2} \e_{k-1} \e(L_k^{\triangledown})\,. \end{equation}\] Applying \eqref{eq:rec-formula} iteratively gives, for $1 \leq k \leq d$, \[\e(L_{k}^{\triangledown}) \geq \frac{1}{k+1} \prod_{j=-1}^{k-2} \e_j\,.\] In particular, we have \[\e(L_d^{\triangledown}) \geq \frac{1}{d+1} \prod_{j=-1}^{d-2} \e_j\,,\] yielding a spectral gap for the down-up walk whenever there are non-trivial spectral gaps for the $1$-skeleta of all the links in $X$. Proof of \eqref{eq:al-thm}: For $j \leq k$, $\sigma \in X^{(j)}$, and $f : X^{(k)} \to \R$, let us define $f_{\sigma}^{(k-j-1)} : X_{\sigma}^{(k-j-1)} \to \R$ by $f^{(k-j-1)}_{\sigma}(\tau) \seteq f(\sigma \cup \tau)$. Let $L_{\sigma}$ denote the Laplacian on the $1$-skeleton of $X_{\sigma}$ with respect to the edge measure $\mu = \pi_{\sigma}^{(1)}$. Then we have \[\begin{align} \langle f, L_k^{\wedge} f\rangle_{\pi^{(k)}} &= \E_{\sigma \in \pi^{(k-1)}} \langle f^{(0)}_{\sigma}, L_{\sigma} f^{(0)}_{\sigma} \rangle_{\pi_{\sigma}^{(0)}} \label{eq:down1} \\ \langle f, L_k^{\triangledown} f \rangle_{\pi^{(k)}} &= \E_{\sigma \in \pi^{(k-1)}} \mathsf{Var}_{\pi_{\sigma}^{(0)}}(f_{\sigma}^{(0)})\,, \label{eq:down2} \end{align}\] and \eqref{eq:al-thm} follows immediately from the definition of the spectral gap. We are left to verify \eqref{eq:down1} and \eqref{eq:down2}. Use \eqref{eq:laplacian} to write \[\begin{align} \langle f, L_k^{\wedge} f\rangle_{\pi^{(k)}} &= \frac14 \E_{\alpha \sim \pi^{(k)}} \E_{v \in \alpha} \E_{u \sim \pi^{(0)}_{\alpha}} (f(\alpha)-f(\alpha + u - v))^2 \\ &=\frac12 \E_{\sigma \sim \pi^{(k-1)}} \E_{uv \sim \pi^{(1)}_{\sigma}} (f(\sigma + u)-f(\sigma + v))^2 \\ &=\frac12 \E_{\sigma \sim \pi^{(k-1)}} \E_{uv \sim \pi^{(1)}_{\sigma}} (f_{\sigma}^{(0)}(u) - f_{\sigma}^{(0)}(v))^2 \\ &= \E_{\sigma \sim \pi^{(k-1)}} \langle f_{\sigma}^{(0)}, L_{\sigma} f_{\sigma}^{(0)}\rangle_{\pi_{\sigma}^{(0)}}\,. \end{align}\] Next write \[\begin{align} \langle f, L_k^{\triangledown} f \rangle_{\pi^{(k)}} &= \frac14 \E_{\alpha \sim \pi^{(k)}} \E_{v \in \alpha} \E_{u \sim \pi^{(0)}_{\alpha \setminus v}} (f(\alpha) - f(\alpha + u - v))^2 \\ &= \frac12 \E_{\sigma \sim \pi^{(k-1)}} \E_{u,v \sim \pi^{(0)}_{\sigma}} (f(\sigma + u) - f(\sigma + v))^2 \\ &= \frac12 \E_{\sigma \sim \pi^{(k-1)}} \E_{u,v \sim \pi^{(0)}_{\sigma}} (f^{(0)}_{\sigma}(u) - f^{(0)}_{\sigma}(v))^2 \\ &= \E_{\sigma \sim \pi^{(k-1)}} \mathsf{Var}_{\pi^{(0)}_{\sigma}}(f_{\sigma}^{(0)})\,, \end{align}\] where the last line follows from the general fact \[\mathsf{Var}(X) = \frac12 \E \|X-X'\|^2\,,\] when $X$ and $X'$ are two independent copies of $X$. Reading: Improved analysis of higher order random walks and applications, Alev and Lau
Noise stability on the hypercube via Brownian motion
Wed, Mar 01 Leaking entropy at constant rate	Reading: Noise stability on the Boolean hypercube via a renormalized Brownian motion, Eldan, Mikulincer, and Raghavendra
Mon, Mar 06 Majority is stablest
Recent topics in ML theory
Wed, Mar 08 The edge of stability	Notes Self-stabilization dynamics Define the set $\mathcal{M} \seteq \{ \theta \in \R^n : S(\theta) \leq 2/\eta \textrm{ and } \langle \nabla L(\theta), u(\theta)\rangle = 0 \}$, where we recall that $u(\theta)$ is the top eigenvector of $\nabla^2 L(\theta)$. Given the gradient descent dynamics $\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t)$, we define $\hat{\theta}_t \seteq \mathrm{proj}_{\mathcal{M}}(\theta_t)$ as the (Euclidean) projection to the convex set $\mathcal{M}$. Now define the quantities \[\begin{align} x_t &\seteq \langle u, \theta_t - \hat{\theta}_0\rangle \\ y_t &\seteq \langle \nabla S(\hat{\theta}_0), \theta_t - \hat{\theta}_0\rangle, \end{align}\] where $u \seteq u(\hat{\theta}_0)$ is the top eigenvector of $\nabla^2 L(\hat{\theta}_0)$ and $S(\theta) \seteq \lambda_{\max}(\nabla^2 L(\theta))$ is the sharpness. If we assume that $S(\hat{\theta}_0)=2/\eta$ and $\nabla S(\theta_t) \approx \nabla S(\hat{\theta}_0)$, then \[\begin{equation}\label{eq:sharp} S(\theta_t) \approx 2/\eta + y_t\,. \end{equation}\] Linear approximation. For $x_t,y_t$ small, we can approximate $\nabla L(\theta_t) \approx \nabla L(\hat{\theta}_0)$. Quadratic approximation. When $\abs{x_t}$ blows up, we use a quadratic approximation: \[\begin{align} \nabla L(\theta_t) - \nabla L(\hat{\theta}_0) &\approx \nabla^2 L(\theta_t) (\theta_t-\hat{\theta}_0) \\ &\approx \langle u, \theta_t - \hat{\theta}_0\rangle \nabla^2 L(\theta_t) u\,. \end{align}\] In the last line, we have used the fact that $\theta_t-\hat{\theta}_0$ only blows up in the $u$ direction, so we only use the quadratic approximation there. This quantity is equal to $x_t \nabla^2 L(\theta_t) u \approx x_t S(\theta_t) u$. In conjunction with $\langle u, \nabla L(\hat{\theta}_0)\rangle = 0$, this gives \[\begin{equation}\label{eq:xpart} x_{t+1} = x_t - \eta \langle u, \nabla L(\theta_t)\rangle = x_t(1-\eta S(\theta_t)) \approx -(1+\eta y_t) x_t\,, \end{equation}\] where the last approximation uses \eqref{eq:sharp}. Here we see the blowup phase where $\abs{x_t}$ grows multiplicatively as soon as $y_t > 0$, i.e., as soon as the sharpness exceeds $2/\eta$, and the sign of $x_t$ oscillates. Cubic terms. For $\abs{x_t}$ sufficiently large, we need to incorporate the 3rd-order term in the Taylor expansion: \[\begin{align} \nabla L(\theta_t) - \nabla L(\hat{\theta}_0) &\approx \nabla^2 L(\theta_t) (\theta_t-\hat{\theta}_0) + \frac12 \nabla^3 L(\theta_t)[\theta_t-\hat{\theta}_0,\theta_t-\hat{\theta}_0] \\ &\approx \nabla^2 L(\theta_t) (\theta_t-\hat{\theta}_0) + \frac12 \nabla^3 L(\hat{\theta}_0)[\theta_t-\hat{\theta}_0,\theta_t-\hat{\theta}_0] \\ &\approx \nabla^2 L(\theta_t) (\theta_t-\hat{\theta}_0) + \frac12 \nabla^3 L(\hat{\theta}_0)[u,u] \langle u, \theta_t - \hat{\theta}_0\rangle^2\,, \end{align}\] where as in the quadratic setting, we only consider the component in direction $u$. Using the approximation we derived for the quadratic part, this gives \[\begin{align} \nabla L(\theta_t) - \nabla L(\hat{\theta}_0) &\approx x_t S(\theta_t) u + \frac{x_t^2}{2} \nabla^2 L(\hat{\theta}_0)[u,u] \\ &=x_t S(\theta_t) u + \frac{x_t^2}{2} \nabla S(\hat{\theta}_0)\,, \end{align}\] where we used the fact that if the top eigenvalue of $\nabla^2 L(\theta)$ has multiplicity one, then $\nabla S(\theta) = \nabla^3 L(\theta)[u(\theta),u(\theta)]$. Then our update to $y_t$ looks like \[\begin{align} y_{t+1} - y_t &= \langle \nabla S(\hat{\theta}_0), - \eta \nabla L(\theta_t)\rangle \\ &= -\eta \langle \nabla S(\hat{\theta}_0), \nabla L(\hat{\theta}_0)\rangle - \eta \langle \nabla S(\hat{\theta}_0), \nabla L(\theta_t)-\nabla L(\hat{\theta}_0)\rangle \\ &\approx \eta \alpha - \eta x_t S(\theta_t) \langle \nabla S(\hat{\theta}_0), u \rangle - \eta \frac{x_t^2}{2} \beta\,, \end{align}\] where $\alpha \seteq \langle \nabla S(\hat{\theta}_0), - \nabla L(\hat{\theta}_0)\rangle$ is the progressive sharpening parameter and $\beta \seteq \\|\nabla S(\hat{\theta}_0)\\|^2$ is the “stabilization” parameter. If we assume for simplicity that $\langle \nabla S(\hat{\theta}_0), u\rangle = 0$, then we obtain \[\begin{equation}\label{eq:ypart} y_{t+1} - y_t = \eta \left(\alpha - \beta \frac{x_t^2}{2}\right)\,, \end{equation}\] meaning that the sharpness is decaying whenever $x_t > \sqrt{2\alpha/\beta}$. Reading: Gradient descent on neural networks typically occurs at the edge of stability Learning threshold neurons via the “edge of stability” Self-Stabilization: The Implicit Bias of Gradient Descent at the Edge of Stability

CSE 599: Analytic and geometric methods in TCS

Winter 2023

MW 11:30am-12:50pm in CSE2 387

Instructor: James R. Lee

Course email list [archives]

Course description:

Lectures

Localization

Finding a better needle

Reducing to control of the covariance

Controlling the covariance

Spectral expansion in graphs

Link expansion in simplicial complexes

Theorem

Proof:

Trickle Down Theorem

Proof:

Corollary

Proof:

Random walks

Theorem: Link expansion \(\implies\) Spectral gaps

Corollary

Proof of \eqref{eq:al-thm}:

Self-stabilization dynamics