CSE 599: Analytic and geometric methods in TCS

Winter 2023

MW 11:30am-12:50pm in CSE2 387

Instructor: James R. Lee

Office hours: By appointment

Course email list [archives]

Course description:

This is an advanced graduate-level theory course about tools from analysis and geometry that have interesting applications in algorithms, complexity theory, and ML theory. Evaluation will be based on scribe notes and (possibly) student presentations. One topic per week, choice of topics will be guided by class interest.

I taught a version of this course back in prehistory. The set of topics will be mostly disjoint.

Lectures

  • Jan 04: NO CLASS
Generic chaining and extrema of random processes The KLS conjecture and stochastic localization Topological methods High-dimensional expanders Noise stability on the hypercube via Brownian motion Recent topics in ML theory

Wed, Jan 04
NO CLASS

Generic chaining and extrema of random processes
Mon, Jan 09
Examples, applications, Gaussian processes

  • A random process is a collection \(\{X_t : t \in T\}\) of random variables. For the next few lectures, we will considered symmetric processes, i.e., those where \(X_t\) and \(-X_t\) have the same law for every \(t \in T\).

  • We equip the index set \(T\) of such a process with the natural \(\ell_2\) metric

    \[d(s,t) \seteq \sqrt{\E(X_s-X_t)^2}.\]
  • A fundamental example is that of a Gaussian process, where the family \(\{X_t : t \in T\}\) is a jointly normal family of random variables. When \(T\) is finite, any such family can be described as follows: \(T \subseteq \R^n\) for some \(n \geq 1\), and \(X_t = \langle g,t\rangle\), where \(g=(g_1,\ldots,g_n)\) is a standard \(n\)-dimensional gaussian, i.e., the coordinates \(g_1,\ldots,g_n\) are i.i.d. \(N(0,1)\) random variables.

    In this case the geometry of the metric space \((T,d)\) is Eucliean, since \(d(s,t) = \|s-t\|_2\).

    Also note that the metric \(d(s,t)\) completely describes the covariance structure of the process, and therefore characterizes it (this is a special property of jointly normal families of random variables).

  • We will be concerned with extrema of such a process, and to that end we study the quantity

    \[\E \max_{t \in T} X_t\,.\]
  • A random process \(\{ X_t : t \in T \}\) is called sub-gaussian if there is some constant \(c > 0\) such that

    \[\begin{equation}\label{eq:sg-tail} \P[X_s - X_t > \lambda] \leq \exp\left(- c\lambda^2/d(s,t)^2\right) \end{equation}\]

    In particular, Gaussian processes are sub-gaussian with \(c=1/2\).

    Another example is a Bernoulli process, i.e. a family of random variables defined thusly: For some subset \(T \subseteq \R^n\),

    \[\left\{ \e_1 t_1 + \e_2 t_2 + \cdots + \e_n t_n : t = (t_1,\ldots,t_n) \in T \right\},\]

    where \(\e_1,\ldots,\e_n \in \{-1,1\}\) are i.i.d. uniformly random signs.

  • Diameter bound: When the index set \(T\) is finite, we can use the tail bound \eqref{eq:sg-tail} to give a simple upper bound on \(\E \max_{t \in T} X_t\).

    Fix some \(t_0 \in T\) and note that \(\E \max_{t \in T} X_t = \E \max_{t \in T} (X_t-X_{t_0})\). Now applying a union bound gives

    \[\P\left[\max_{t \in T} (X_t-X_{t_0}) > \lambda\right] \leq |T| \exp\left(- c\lambda^2/\Delta^2\right)\,,\]

    where \(\Delta \seteq \max \{ d(t,t_0) : t \in T \}\).

    A simple calculation gives the conclusion

    \[\E \max_{t \in T} X_t \lesssim \Delta \sqrt{\log |T|}\,.\]

DETAILED PRELIMINARY NOTES

Supplementary reading:

Wed, Jan 11
Generic chaining and Dudley's entropy bound

Recall that we consider a subgaussian process \(\{X_t : t \in T\}\) and equip \(T\) with the distance \(d(s,t) = \sqrt{\E(X_s-X_t)^2}\).

  • Chaining: Instead of applying a naive union bound to control \(X_t - X_{t_0}\) for every \(t \in T\), we can break every such difference into parts and then bound each part. This potentially reduces the number of events we need to take a union bound over.

    For instance, given three random variables \(X_0,X_1,X_2,X_3\), we could write

    \[\begin{eqnarray*} X_3 - X_0 &=& (X_3 - X_1) + (X_1 - X_0) \\ X_2 - X_0 &=& (X_2 - X_1) + (X_1 - X_0) \\ X_1 - X_0 &=& X_1 - X_0\,. \end{eqnarray*}\]
  • Generic chaining upper bound: For sub-gaussian processes, we can control \(\E \max_{t \in T} X_t\) using chaining.

    Fix some \(t_0 \in T\) and a sequence of finite sets

    \[\{t_0\} = T_0 \subseteq T_1 \subseteq T_2 \subseteq \cdots \subseteq T_n \subseteq \cdots \subseteq T\,,\]

    with cardinalities bounded by \(\abs{T_n} \leq 2^{2^n}\) for every \(n \geq 1\).

    Then we have:

    \[\begin{equation}\label{eq:gc-ub} \E \max_{t \in T} X_t \lesssim \max_{t \in T} \sum_{n \geq 0} 2^{n/2} d(t,T_n)\,, \end{equation}\]

    where \(d(t,T_n) = \min_{s \in T_n} d(s,t)\).

  • Dudley’s entropy bound: If \(\{X_t : t \in T\}\) is a subgaussian process, then

    \[\begin{equation}\label{eq:dudley} \E \max_{t \in T} X_t \lesssim \sum_{n \geq 1} 2^{n/2} e_n(T)\,, \end{equation}\]

    where \(e_n(T)\) is the largest distance \(d\) such that there are \(2^{2^n}\) points \(y_1,y_2,\ldots,y_{2^{2^n}}\) in \(T\) with \(d(y_i,y_j) \geq d\) for all \(i \neq j\).

    Easy exercise: Show that Dudley’s entropy bound follows from the generic chaining upper bound \eqref{eq:gc-ub}.

  • Covering numbers: Let \(N(T,d,\e)\) denote the smallest number \(N\) such that the metric space \((T,d)\) can be covered by \(N\) balls of radius \(\e\), and consider the quantity

    \[C_T \seteq \sum_{h=-\infty}^{\infty} 2^h \sqrt{\log N(T,d,2^{h})}\,.\]

    Up to a factor of \(2\), this is equivalent to the more elegant expression \(\int_0^{\infty} \sqrt{\log N(T,d,\e)} \,d\e\). It is an exercise to show that

    \[C_T \asymp \sum_{n \geq 1} 2^{n/2} e_n(T)\,.\]

    Exercise: Show that for any metric space \((T,d)\), it holds that

    \[\begin{equation}\label{eq:CT} \sum_{n \geq 1} 2^{n/2} e_n(T) \asymp \sum_{h=-\infty}^{\infty} 2^h \sqrt{\log N(T,d,2^h)}\,. \end{equation}\]
  • Analysis of the probability simplex

    Let \(T = \{ x \in \R^n : x_1,\ldots,x_n \geq 0, x_1+\cdots+x_n = 1 \}\) denote the probability simplex.

    Lemma 1: For this \(T\), let \(X_t = \langle g,t\rangle\), where \(g=(g_1,\ldots,g_n)\) is a standard \(n\)-dimensional Gaussian. Then \(\E \max_{t \in T} X_t \asymp \sqrt{\log n}\).

    This is because the maximum of the linear functional \(t \mapsto \langle g,t\rangle\) is achieved at a vertex of the convex body \(T\) which is equal to the convex hull of the standard basis vectors: \(T = \mathrm{conv}(e_1,\ldots,e_n)\). Thus \(\max_{t \in T} \langle g,t\rangle = \max \{ g_i : i =1,\ldots,n\}\), and therefore

    \[\E \max_{t \in T} \langle g,t\rangle = \E \max(g_1,\ldots,g_n) \asymp \sqrt{\log n}.\]

    Let us now see that the entropy bound \eqref{eq:dudley} is far from tight in this simple case. We will use the formulation via covering numbers coming from the equivalence \eqref{eq:CT}.

    Consider vectors \(t \in T\) with \(t_i \in \{0,1/k\}\) for $i=1,\ldots,n$. There are \({n \choose k} \geq \left(\frac{n}{k}\right)^k\) such vectors. For \(k \leq \sqrt{n}\), we have \((n/k)^k \gtrsim 2^{k \log n}\), and it is possible to choose a proportional subset of those vectors (e.g., a random subset will suffice) such that

    \[d(t_i,t_j) \gtrsim \left(k \cdot \frac{1}{k^2} \right)^{1/2} \asymp k^{-1/2}, \qquad i \neq j\,.\]

    It follows that for \(k \leq \sqrt{n}\), we have $\log N(T,d,k^{-1/2}) \gtrsim k \log n$. Consider then the \(\approx \log n\) values of \(h\) for which \(1 \leq 2^h \leq \sqrt{n}\).

    This gives us \(\approx \log n\) values of \(h\) for which

    \[2^{-h} \sqrt{\log N(T,d,2^{-h})} \gtrsim 2^{-h} \cdot 2^h \sqrt{\log n} = \sqrt{\log n}\,.\]

    Therefore Dudley’s bound \eqref{eq:CT} is \(\gtrsim (\log n) \cdot \sqrt{\log n} = (\log n)^{3/2}\). (Recall that the correct bound, given by Lemma 1, is \(\sqrt{\log n}\).)

Mon, Jan 16
NO CLASS (MLK day)

Wed, Jan 18
Graph and hypergraph sparsification

  • We have seen that if \(\{X_t : t \in T \}\) is a symmetric sub-Gaussian process equipped with the distance \(d(s,t) = \sqrt{\E(X_s-X_t)^2}\), then for any sequence \(\{t_0\} = T_0 \subseteq T_1 \subseteq \cdots \subseteq T_h \subseteq \cdots T\) with \(|T_h| \leq 2^{2^h}\), we have a chaining upper bound

    \[\E \max_{t \in T} X_t \lesssim \max_{t \in T} \sum_{h \geq 0} 2^{n/2} d(t,T_h)\,.\]

    Let us define \(\gamma_2(T,d)\) as the best possible chaining upper bound, i.e., the infimum of the RHS over all compliant sequences \(\{T_h\}\). This is Talagrand’s famous \(\gamma_2\) functional, and it may be surprising that it characterizes (up to constants) the expected supremum of a Gaussian process.

    Majorizing-measures theorem (Fernique-Talagrand): If \(\{X_t : t\in T \}\) is a centered Gaussian process, then

    \[\E \max_{t \in T} X_t \asymp \gamma_2(T,d)\,.\]
  • Random signed sums: Suppose that \(\e_1,\e_2,\ldots,\e_n\) is a sequence of independent, symmetric random variables and we define the random process:

    \[\left\{ X_t = \e_1 t_1 + \e_2 t_2 + \cdots + \e_n t_n : t \in T \right\}\,,\]

    where \(T \subseteq \R^n\). Then classical concentration bounds show that \(\{X_t : t\in T\}\) is subgaussian, and it is straightforward to calculate the distance (by expanding the square):

    \[d(s,t) = \|s-t\|_2\,.\]

    The chaining upper bound gives \(\E \max_{t \in T} X_t \lesssim \gamma_2(T,d)\). Let us now use this to understand the Hypergraph sparsification problem.

  • Hypergraph sparsification. Consider a weighted hypergraph \(H=(V,E,w)\) where \(\{w_e : e \in E\}\) are nonnegative edge weights. We associated to \(H\) the quadratic expression

    \[Q_H(x) = \sum_{e \in E} w_e \max_{u,v \in e} (x_u-x_v)^2\,.\]

    If \(H\) were a graph (where \(|e|=2\) for every \(e \in E\)), this would correspond to the quadratic form of the graph Laplacian.

    Our goal is to find another hypergraph \(\tilde{H} = (V,\tilde{E},\tilde{w})\) with \(\tilde{E} \subseteq E\) and such that

    \[\begin{equation}\label{eq:sparse-approx} |Q_H(x)-Q_{\tilde{H}}(x)| \leq \e Q_H(x),\qquad \forall x \in \R^n\,, \end{equation}\]

    where \(\e > 0\) is some accuracy parameter. Moreover, we would like \(\abs{\tilde{E}}\) to be small, ideally near-linear in \(\abs{V}\) (while \(\abs{E}\) could be as large as \(2^{\abs{V}}\)).

  • Independent random sampling. Suppose we have a probability distribution \(\{\mu_e : e \in E\}\). Let us form \(\tilde{H}\) by sampling \(M\) edges \(e_1,e_2,\ldots,e_M\) independently from \(\mu\) and setting the edge weights in \(\tilde{H}\) so that

    \[Q_{\tilde{H}}(x) = \frac{1}{M} \sum_{k=1}^M \frac{w_{e_k}}{\mu_{e_k}} Q_{e_k}(x)\,,\]

    where we define \(Q_e(x) = \max_{u,v \in e} (x_u-x_v)^2\).

    Note that \(\E[Q_{\tilde{H}}(x)] = Q_H(x)\) for all \(x \in \R^n\), and to establish a bound like \eqref{eq:sparse-approx}, it makes sense to study the quantity

    \[\E \max_{Q_H(x) \leq 1} |Q_H(x) - Q_{\tilde{H}}(x)|.\]
  • Passing to a sub-Gaussian process. Let \(\hat{H}\) be an independent copy of \(\tilde{H}\) and write

    \[\begin{eqnarray*} \E_{\tilde{H}} \max_{Q_H(x) \leq 1} |Q_H(x) - Q_{\tilde{H}}(x)| &= \E_{\tilde{H}} \max_{Q_H(x) \leq 1} |\E_{\hat{H}}[Q_{\hat{H}}(x)]-Q_{\tilde{H}}(x)| \\ &= \E_{\tilde{H}} \max_{Q_H(x) \leq 1} |\E_{\hat{H}} [Q_{\hat{H}}(x)-Q_{\tilde{H}}(x)]| \\ &\leq \E_{\hat{H},\tilde{H}} \max_{Q_H(x) \leq 1} |Q_{\hat{H}}(x)-Q_{\tilde{H}}(x)|\,, \end{eqnarray*}\]

    where we have used convexity to pull out the expectation: \(\abs{\E X} \leq E \abs{X}\) and \(\max(\E X_1,\ldots,\E X_k) \leq \E \max(X_1,\ldots,X_k)\)

    Note that for any choice of signs \(\e_1,\ldots,\e_M \in \{-1,1\}\), we have

    \[\begin{eqnarray*} |Q_{\hat{H}}(x)-Q_{\hat{H}}(x)| &= \frac{1}{M} \left|\sum_{k=1}^M \left(\frac{w_{\hat{e}_k}}{\mu_{\hat{e}_k}} Q_{\hat{e}_k}(x)- \frac{w_{\tilde{e}_k}}{\mu_{\tilde{e}_k}} Q_{\hat{e}_k}(x)\right)\right| \\ &= \frac{1}{M} \left|\sum_{k=1}^M \e_k \left(\frac{w_{\hat{e}_k}}{\mu_{\hat{e}_k}} Q_{\hat{e}_k}(x)- \frac{w_{\tilde{e}_k}}{\mu_{\tilde{e}_k}} Q_{\hat{e}_k}(x)\right)\right|. \end{eqnarray*}\]

    because \(\tilde{e}_k\) and \(\hat{e}_k\) have the same law. Define \(t_k(x) \seteq \frac{w_{e_k}}{\mu_{e_k}} Q_{\hat{e}_k}(x)\) and then we have

    \[\max_{Q_H(x) \leq 1} |Q_{\hat{H}}(x)-Q_{\tilde{H}}(x)| \leq \frac{2}{M} \E_{(\e_1,\ldots,\e_M)} \max_{Q_H(x) \leq 1} \left|\sum_{k=1}^M \e_k t_k\right| = \frac{4}{M} \E_{(\e_1,\ldots,\e_M)} \max_{Q_H(x) \leq 1} \sum_{k=1}^M \e_k t_k(x)\,,\]

    where we take the expectation over i.i.d. uniformly random signs, and we removed the absolute values using symmetry of the process.

    So now to bound the quantity we cared about initially, it suffices to bound \(\max_{Q_H(x) \leq 1} \sum_{k=1}^M \e_k t_k(x)\) for every choice of hyperedges \(e_1,\ldots,e_M\). Of course, this is a sub-Gaussian process, and so we are led to the study of \(\gamma_2(T,d)\), where \(T = \{ x \in \R^n : Q_H(x) \leq 1\}\), and

    \[d(x,y) = \left(\sum_{k=1}^M \left(\frac{w_{e_k}}{\mu_{e_k}} (Q_{e_k}(x)-Q_{e_k}(y))\right)^2\right)^{1/2}\]
  • Abstracting out the norm sparsification problem. Note that each \(\sqrt{Q_e(x)}\) for \(e \in E\) is actually a norm on \(\R^n\) (technically it is only a seminorm since \(Q_e(x)=0\) does not entail \(x=0\)).

    Thus we can consider the following related problem: Suppose that \(N_1,N_2,\ldots,N_m\) are (semi)norms on \(\R^n\) and we want to bound

    \[\E \max_{x \in T} \sum_{k=1}^m \e_k N_k(x)^2\,,\]

    where \(T \subseteq \R^n\) and \(\e_1,\ldots,\e_m\) are i.i.d. random signs. Again, this is a sub-Gaussian process equipped with the distance

    \[d(x,y) = \left(\sum_{k=1}^M (N_k(x)^2 - N_k(y)^2)^2\right)^{1/2}.\]

    In the next lecture, we will find upper bounds on \(\gamma_2(T,d)\) for this faimly of problems.

Supplementary reading:

Mon, Jan 23
Sparsifying sums of norms

  • Norms on \(\R^n\). Recall that a norm \(N\) on \(\R^n\) is a nonnegative mapping \(N : \R^n \to \R_+\) such that

    1. \(N(\lambda x) = \abs{\lambda} N(x)\) for all \(x \in \R^n\) and \(\lambda \in \R\).
    2. \(N(x+y) \leq N(x) + N(y)\) for all \(x,y \in \R^n\).
    3. \(N(x) = 0 \iff x = 0\) for all \(x \in \R^n\).

    \(N\) is a semi-norm if it only satisfies properties (1) and (2). We will consider semi-norms throughout the lecture and say “norm” even when considering semi-norms.

  • An upper bound. Suppose that \(N_1,N_2,\ldots,N_m\) are norms on \(\R^n\) and \(T \subseteq B_2^n\), where \(B_2^n = \{ x \in \R^n : \|x\|_2 \leq 1 \}\) is the Euclidean unit ball.

    Define \(\kappa \seteq \E \max_{k=1,\ldots,m} N_k(g)\), where \(g\) is a standard \(n\)-dimensional Gaussian.

    Our goal is to prove the following: If \(\e_1,\e_2,\ldots,\e_m\) are i.i.d. random signs, then

    \[\begin{equation}\label{eq:goal11} \E \max_{x \in T} \sum_{k=1}^m \e_k N_k(x)^2 \lesssim \kappa \log(n) \max_{x \in T} \sqrt{\sum_{k=1}^m N_k(x)^2}\,. \end{equation}\]
  • Example: Operator norm for sums of random matrices. Another natural situation where a bound like this is useful is for sums of random matrices. Suppose that

    \[A = \sum_{k=1}^m \e_k A_k^T A_k\,,\]

    where each \(A_k^T A_k\) is a PSD matrix. Then the operator norm of \(A\) can be written as

    \[\|A\|_{\mathrm{op}} = \max_{\|x\|_2 \leq 1} \langle x, A x\rangle = \max_{\|x\|_2 \leq 1} \sum_{k=1}^m \e_k \langle x, A_k^T A_k x\rangle = \max_{\|x\|_2 \leq 1} \sum_{k=1}^m \e_k \|A_k x\|_2^2\,,\]

    which is the special case of our problem where \(N_k(x) = \|A_k x\|_2^2\) and \(T = B_2^n\).

  • Applying Dudley’s entropy inequality. Recall that \(\left\{ \sum_{k=1}^m \e_k N_k(x)^2 : x \in T \right\}\) is a subgaussian process when equipped with the distance

    \[d(x,y) \seteq \left(\sum_{k=1}^m \left(N_k(x)^2 - N_k(y)^2\right)^2\right)^{1/2}.\]

    Therefore Dudley’s entropy bound shows that

    \[\begin{equation}\label{eq:dudley3} \E \max_{x \in T} \sum_{k=1}^m \e_k N_k(x)^2 \lesssim \sum_{h \geq 0} 2^{h/2} e_h(T,d)\,. \end{equation}\]
  • Controlling the subgaussian distance via a norm. Notice that both sides of the inequality \eqref{eq:goal11} are quadratic in the norms \(N_k\). Therefore by rescaling we may assume that

    \[\begin{equation}\label{eq:L2N} \max_{x \in T} \sqrt{\sum_{k=1}^m N_k(x)^2} = 1\,. \end{equation}\]

    Define the norm

    \[\|x\|_N \seteq \max_{k=1,\ldots,m} N_k(x)\,.\]

    Note that using \(a^2-b^2 = (a-b)(a+b)\), for any \(x,y \in T\), we have

    \[\begin{eqnarray*} d(x,y) &=& \left(\sum_{k=1}^m (N_k(x)-N_k(y))^2 (N_k(x)+N_k(y))^2 \right)^{1/2} \\ &\leq& \|x-y\|_N \left(\sum_{k=1}^m (N_k(x)+N_k(y))^2\right)^{1/2} \\ &\leq& 2 \|x-y\|_N\,, \end{eqnarray*}\]

    where the first inequality uses \(|N_k(x)-N_k(y)| \leq N_k(x-y)\) (true for any norm) and the second inequality uses \eqref{eq:L2N} and \(x,y \in T\).

  • Splitting the Dudley sum. Therefore we have \(e_h(T,d) \leq 2 e_h(T, \|\cdot\|_N)\) and combining this with Dudley’s bound \eqref{eq:dudley3} gives

    \[\begin{equation}\label{eq:dudley2} \E \max_{x \in T} \sum_{k=1}^m \e_k N_k(x)^2 \lesssim \sum_{h \geq 0} 2^{h/2} e_h(T,\|\cdot\|_N)\,. \end{equation}\]

    We will split this sum into two parts depending on whether \(h \leq 4 \log (n)\) or \(h > 4 \log (n)\).

  • Controlling the large \(h\) terms with a volume bound. Note that for \(x \in T\), by \eqref{eq:L2N} we have

    \[\|x\|_N \leq \sqrt{\sum_{k=1}^m N_k(x)^2} \leq 1\,.\]

    In other words, \(T \subseteq B_N \seteq \{ x \in \R^n : \|x\|_N \leq 1\}\).

    Therefore \(e_h(T,\|\cdot\|_N) \leq e_h(B_N,\|\cdot\|_N)\).

    Claim: It holds that \(e_h(B_N,\|\cdot\|_N) \leq 4 \cdot 2^{-2^h/n}\).

    This proof works for any norm on \(\R^n\).

    Assume that \(x_1,\ldots,x_{s} \in B_N\) are a maximal set of points satisfying \(\|x_i-x_j\|_N \geq \delta\) for all \(i \neq j\). In that case, we have

    \[B_N \subseteq \bigcup_{j=1}^s (x_j + 2 \delta B_N).\]

    In other words, \(B_N\) is covered by \(s\) balls of radius \(2 \delta\).

    But it also holds that (for \(\delta < 1\)),

    \[2 B_N \supseteq \bigcup_{j=1}^s (x_j + \delta B_n)\]

    hence

    \[\mathrm{vol}_n(2 B_N) \geq s \,\mathrm{vol}_n(\delta B_N) = s \,(\delta/2)^n \mathrm{vol}_n(2 B_N),\]

    where \(\mathrm{vol}_n\) is the standard \(n\)-dimensional volume on \(\R^n\). It follows that \(s \leq (\delta/2)^{-n}\), i.e., \(\delta \leq 2 s^{-n}\). Taking \(s = 2^{2^h}\) yields

    \[e_h(B_N,\|\cdot\|_N) \leq 4 \delta = 4 \cdot 2^{-2^h/n}.\]
  • Recalling \eqref{eq:dudley2}, we can use the preceding claim to evaluate

    \[\sum_{h > 4 \log n} 2^{h/2} e_h(T, \|\cdot\|_N) \leq 4 \sum_{h > 4 \log n} 2^{h/2} 2^{-2^h/n} \leq O(1)\,.\]

    Thus we have

    \[\E \max_{x \in T} \sum_{k=1}^m \e_k N_k(x)^2 \lesssim O(1) + \sum_{0 \leq h \leq 4 \log n} 2^{h/2} e_h(T,\|\cdot\|_N)\,.\]
  • Controlling the interesting \(h\) terms. To bound the second sum, let us use \(T \subseteq B_2^n\) to write \(e_h(T,\|\cdot\|_N) \leq e_h(B_2^n,\|\cdot\|_N)\).

    (Dual) Sudakov Lemma: For any norm \(\|\cdot\|\) on \(\R^n\) and \(h \geq 0\), it holds that

    \[e_h(B_2^n, \|\cdot\|) \lesssim 2^{-h/2} \E \|g\|\,.\]

    Note that \(\E\, \|g\|_N = \E \max_k N_k(g) = \kappa\) (by definition), and therefore the second sum above is bounded by

    \[\sum_{0 \leq h \leq 4 \log n} 2^{h/2} e_h(T,\|\cdot\|_N) \lesssim \kappa \log n\,.\]

    This completes the proof of \eqref{eq:goal11} once we have established the Dual Sudakov Lemma.

  • Gaussian Shift Lemma. We need the following

    Lemma: Suppose that \(K\) is a symmetric convex set and \(\gamma_n\) is the \(n\)-dimensional Gaussian measure on \(\R^n\). Then for any \(x \in \R^n\),

    \[\gamma_n(K+x) \geq e^{-\|x\|_2^2/2} \gamma_n(K)\,.\]

    Proof: Write

    \[\begin{eqnarray*} \gamma_n(K+x) &=& (2\pi)^{-n} \int_K e^{-\|x+z\|^2/2}\,dz \\ &=& (2\pi)^{-n} \int_K \E_{\sigma \in \{-1,1\}} e^{-\|\sigma x+z\|_2^2/2}\,dz\,, \end{eqnarray*}\]

    where the second inequality follows because \(K\) is symmetric. Now note that \(\E_{\sigma \in \{-1,1\}} \|\sigma x + z\|_2^2 = \|x\|_2^2 + \|z\|_2^2\) and use convexity of \(y \mapsto e^y\) to write

    \[\begin{eqnarray*} (2\pi)^{-n} \int_K \E_{\sigma \in \{-1,1\}} e^{-\|\sigma x+z\|_2^2/2}\,dz &\geq & (2\pi)^{-n} \int_K e^{-\E_{\sigma \in \{-1,1\}} \|\sigma x+z\|_2^2/2}\,dz \\ &=& (2\pi)^{-n} \int_K e^{-(\|x\|_2^2+\|z\|_2^2)/2}\,dz = e^{-\|x\|_2^2} \gamma_n(K)\,. \end{eqnarray*}\]
  • Proof of Dual Sudakov. Let \(\mathcal{B} \seteq \{ x \in \R^n : \|x\| \leq 1 \}\) denote the unit ball of the \(\|\cdot\|\) norm and suppose \(x_1,\ldots,x_s \in B_2^n\) is a maximal set of points such that the shifted balls \(x_j + \delta \mathcal{B}\) are all pairwise disjoint. In that case, we have

    \[\begin{equation}\label{eq:Bcover} B_2^n \subseteq \bigcup_{j=1}^s (x_j + 2 \delta \mathcal{B})\,, \end{equation}\]

    i.e., we have covered \(B_2^n\) by \(s\) balls (in \(\|\cdot\|\)) of radius at most \(2\delta\).

    Let \(\lambda > 0\) be a parameter we will choose later and note that the sets \(\lambda(x_j+\delta \mathcal{B})\) are also pairwise disjoint, and therefore

    \[\begin{eqnarray*} 1 &\geq& \gamma_n\left(\bigcup_{j=1}^s \lambda (x_j + \delta \mathcal{B}) \right) \\ &=& \sum_{j=1}^s \gamma_n\left(\lambda (x_j + \delta \mathcal{B}) \right) \\ &\geq& \sum_{j=1}^s e^{-\lambda^2 \|x_j\|_2^2/2} \gamma_n(\lambda \delta \mathcal{B}) \geq s \cdot e^{-\lambda^2/2} \gamma_n(\lambda \delta \mathcal{B})\,, \end{eqnarray*}\]

    where the first equality uses disjointness, the second inequality uses the Gaussian Shift Lemma, and the final inequality uses \(x_1,\ldots,x_s \in B_2^n\).

    If we define \(\lambda \seteq \frac{2}{\delta} \E\, \|g\|\), then

    \[\gamma_n(\lambda \delta B) = \P\left(g \in \lambda \delta B\right) = \P\left(\|g\| \leq \lambda \delta\right) = \P\left(\|g\| \leq 2 \E\, \|g\|\right) \geq \frac12\,,\]

    where the last step is simply Markov’s inequality.

    Combining this with our previous calculation yields

    \[1 \geq s \cdot \exp\left(-(2/\delta)^2 (\E\, \|g\|)^2\right) \cdot \frac12\,,\]

    i.e.

    \[\sqrt{\log (s/2)} \leq \frac{2}{\delta} \E\, \|g\|\,.\]

    Taking \(s=2^{2^h}\) and recalling \eqref{eq:Bcover} shows that we can cover \(B_2^n\) by at most \(s\) \(\|\cdot\|\)-balls of radius \(2 \delta\) with

    \[\delta \leq \frac{2\, \E\, \|g\|}{\sqrt{\log (s/2)}} \lesssim 2^{-h/2} \E\,\|g\|\,.\]

    By definition, this yields our desired bound

    \[e_h(B_2^n, \|\cdot\|) \lesssim 2^{-h/2} \E\,\|g\|\,.\]

Supplementary reading:

The KLS conjecture and stochastic localization
Wed, Jan 25
A Cheeger inequality for isotropic convex sets

  • We studied the \(\ell_1\) norm sparsification problem:

    Given norms \(N_1,\ldots,N_m\) on \(\R^n\), define \(N(x) \seteq N_1(x) + \cdots + N_m(x)\). Then we seek a weight vector \(c \in \R_+^m\) with \(\# \{ i : c_i > 0 \}\) small such that

    \[\max_{N(x) \leq 1} \left|N(x) - \sum_{j=1}^m c_j N_j(x)\right| \leq \e\,.\]
  • The role of concentration was essential: If \(\mathbf{Z}\) is a log-concave random vector on \(\R^n\) and \(\|\cdot\|\) is any norm, then

    \[\Pr\left(\|\mathbf{Z}\| > t \E[\|\mathbf{Z}\|]\right) \leq e^{-c t \psi_n}\,,\]

    where \(c > 0\) is a universal constant and \(\psi_n\) is the KLS constant.

  • This sort of concentration is essentially equivalent to a uniform lower bound on the Cheeger constant of isotropic log-concave measures on \(\R^n\). (Next lecture.)

Reading:

Videos:

Mon, Jan 30
The KLS conjecture, slicing, thin shells, and concentration

Localization

  • Consider a log-concave density \(F : \R^n \to \R_+\) with corresponding measure \(\mu\), and suppose that \(D \seteq \mathrm{diam}(\mathrm{supp}(F)) < \infty\) (where \(\mathrm{supp}(F) \seteq \{ x \in \R^n : F(x) > 0 \}\)).

    Let \(S_1,S_2,S_3\) be three measurable sets partitioning \(\R^n\) and denote \(d(S_1,S_2) = \min \{ \|x-y\|_2 : x \in S_1, y \in S_2 \}\).

    KK Lemma (Karzanov and Khachiyan): It holds that \(\mu(S_3) \geq \frac{d(S_1,S_2)}{D} \min \{ \mu(S_1), \mu(S_2) \}.\)

    Our goal is to reduce this to the \(1\)-dimensional case. Define \(C \seteq \frac{d(S_1,S_2)}{D}\). The idea is to define two functions $g_1,g_2 : \R^n \to \R$ by

    \[g_1(x) \seteq \begin{cases} C F(x) & x \in S_1 \\ 0 & x \in S_2 \\ -F(x) & x \in S_3 \end{cases} \qquad g_2(x) \seteq \begin{cases} C F(x) & x \in S_2 \\ 0 & x \in S_1 \\ -F(x) & x \in S_3\,. \end{cases}\]

    Suppose, for the sake of contradiction, that \(C \mu(S_1), C \mu(S_2) \geq \mu(S_3)\). This gives \(\int g_1(x)\,dx = C \mu(S_1) - \mu(S_3) > 0\) and \(\int g_2(x)\,dx = C \mu(S_2) - \mu(S_3) > 0\).

    Now we are going to define a sequence \(K_0 \subseteq K_1 \subseteq K_2 \subseteq \cdots\) of convex bodies so that \(K \seteq \bigcap_{i=1}^{\infty} K_i\) is either a point or a segment and such that \(\int_{K_i} g_1(x)\,dx > 0\) and \(\int_{K_i} g_2(x)\,dx > 0\) for each \(i=1,2,\ldots\).

  • For \(K_0\), simply choose a ball large enough to contain \(\mathrm{supp}(F)\).

    Consider any halfspace \(H\) such that

    \[\int_{K_i \cap H} g_1(x)\,dx = \frac12 \int_{K_i} g_1(x)\,dx\,.\]

    Call such a halfspace \(H\) bisecting.

    Then either \(H\) (or its complement \(\R^n \setminus H\)) will satisfy

    \[\int_{K_i \cap H} g_2(x)\,dx > 0\,.\]

    We will choose \(H\) and take \(K_{i+1} \seteq K_i \cap H\). Let’s see how to choose \(H\).

  • Claim: For any \((n-2)\)-dimensional affine subspace \(A\), there is at least one bisecting hyperplane that contains \(A\).

    Note that the boundary of a halfspace \(H\) is an \((n-1)\)-dimensional hyperplane. So consider a halfspace \(H\) that contains \(A\) in its boundary (think of a plane containing a line). We can rotate \(H\) about \(A\) so that \(H\) becomes its complenetary halfspace \(H'\).

    If \(\int_H g_1 - \int_{\R^n \setminus H} g_1 = 0\) initially, we are done. If not, then \(\int_H g_1 - \int_{\R^n \setminus H} g_1\) changes sign as we rotate from \(H\) to \(H'\). By continuity, we find a bisecting halfspace whose bounding hyperplane contains \(A\).

  • Let \(A_0,A_1,\ldots,\) be a sequence of all \((n-2)\)-dimensional affine spaces of the form \(v_0 + \mathrm{span}(v_1,\ldots,v_{n-2})\) where the vectors \(\{v_j\}\) have rational coordinates (there is a countable number of such subspaces). Let \(K_{i+1} \seteq K_i \cap H_i\) where \(H_i\) is a bisecting halfspace whose boundary contains \(A_i\).

    Claim: The intersection \(K \seteq \bigcap_{i\geq 1} K_i\) is a point or an interval.

    Assume to the contrary that \(K\) is at least \(2\)-dimensional. Then its projection onto some pair of coordinate axes is still \(2\)-dimensional. Without loss, we can assume it’s the first two axes. Then there must be a pair of rational numbers \(r_1\) and \(r_2\) such that \((r_1,r_2,x_3, \ldots,x_n)\) is in the interior of \(K\). But now the \((n-2)\)-dimensional subspace \(\mathrm{span}(r_1 e_1, r_2 e_2)\) occurs as some \(A_i\).

    Recall that the halfspace \(H_i\) has its boundary containing \(A_i\). Since \((r_1,r_2,x_3,\cdots,x_n)\) is in the interior of \(K\), it must also be in the interior of \(K_{i+1}\). But that means that \(H_i\) properly cuts \(K_{i+1}\). But this is a contradiction, as \(H_i\) should fully contain \(K_{i+1} = K_i \cap H_i\).

  • We conclude that \(\int_K g_1 \geq 0, \int_K g_2 \geq 0\) on the intersection \(K\).

    Note that we can easily get \(\int_K g_1 > 0, \int_K g_2 > 0\) for the original functions by starting with functions \(\hat{g}_1 = g_1 - \delta \mathbf{1}_{K_0}\) and \(\hat{g}_2 = g_2 - \delta \mathbf{1}_{K_0}\) for \(\delta > 0\) sufficiently small and then finding \(K\) where \(\int_K \hat{g}_1 \geq 0, \int_K \hat{g}_2 \geq 0\).

    Since the restriction of a log-concave function to an interval is still log-concave, this is indeed a reduction of the KK Lemma to the \(1\)-dimensional case.

Finding a better needle

For other applications, it helps to replace \(g_1\) and \(g_2\) by nicer functions at every step. One can consult Lovasz-Simonovits Lemma 2.5 for a (rather technical) argument proving the following.

  • Localization Lemma: Given \(\int_{\R^n} g_1 > 0\) and \(\int_{\R^n} g_2 > 0\), there are two points \(a,b \in \R^n\) and a linear function \(\ell : [0,1] \to \R_+\) such that

    \[\begin{align*} \int_0^1 \ell(t)^{n-1} g_1((1-t) a + tb)\,dt &> 0 \\ \int_0^1 \ell(t)^{n-1} g_2((1-t) a + tb)\,dt &> 0 \end{align*}\]
  • One can instead derive the conclusion that there is a number \(\gamma \in \R\) such that

    \[\begin{align*} \int_0^1 e^{\gamma t} g_1((1-t) a + tb)\,dt &> 0 \\ \int_0^1 e^{\gamma t} g_2((1-t) a + tb)\,dt &> 0\,, \end{align*}\]

    which I find a bit more natural.

Reading:

Videos:

Wed, Feb 01
Stochastic localization

  • Recall the KLS constant of a probability measure \(\mu\) on \(\R^n\):

    \[\psi_{\mu} \seteq \inf_{S \subseteq \R^n : \mu(S) \leq 1/2} \frac{\mu(\partial S)}{\mu(S)}\,,\]

    where the infimum is over measureable subsets and

    \[\mu(\partial S) \seteq \lim_{\e \to 0+} \frac{\mu(S_{\e} \setminus S)}{\e}\,,\]

    where \(S_{\e} \seteq \{ x \in \R^n : d(x,S) \leq \e \}\) is the \(\e\)-neighborhood of \(S\).

  • The KLS conjecture asserts that when \(\mu\) is a log-concave distribution, it should hold that the infimum can be restricted to halfspaces:

    \[\psi_{\mu} \gtrsim \inf_{H \subseteq \R^n : \mu(H) \leq 1/2} \frac{\mu(\partial H)}{\mu(H)}\,,\]

    where a halfspace is defined by \(H = \{ x \in \R^n : \langle x,v \rangle \leq c \}\) for some \(v \in \R^n\) and \(c \in \R\).

  • Suppose that \(\mu\) is measure on \(\R^n\). The covariance matrix of \(\mu\) is defined as

    \[A \seteq \E[(X-\E[X])(X-\E[X])^{\top}]\,,\]

    where \(X\) is a random variable with law \(\mu\).

    This describes the variance of \(\mu\) along every projection: If \(v \in \R^n\), then

    \[\E[\langle v, X - \E[X]\rangle^2] = \langle v, A v\rangle\,.\]
  • Lemma: If \(\mu\) is log-concave, then

    \[\inf_{H \subseteq \R^n : \mu(H) \leq 1/2} \frac{\mu(\partial H)}{\mu(H)} \asymp \frac{1}{\sqrt{\|A\|_{op}}}\,,\]

    where \(\|A\|_{op}\) denotes the maximum eigenvalue of \(A\).

    This lemma is not too difficult to prove given that if \(\mu\) is log-concave and \(X\) has law \(\mu\), then \(\langle v,X\rangle\) is also log-concave for any \(v \in \R^n\).

  • For distributions that are more log-concave than a Gaussian, one can prove that an analogous bound holds.

    Gaussian Factor Lemma: Suppose that the density of \(\mu\) is proportional to \(f(x) e^{-\langle x, B x\rangle}\) for some PSD matrix \(B\), and a log-concave function \(f : \R^n \to \R_+\). Then,

    \[\psi_{\mu} \gtrsim \frac{1}{\sqrt{\|B^{-1}\|_{op}}}\,.\]

    Since the product of log-concave functions is log-concave, the measure \(\mu\) is log-concave. This lemma can be proved using localization (the approach we sketched last lecture) to reduce it to the \(1\)-dimensional case. Note that \(B^{-1}\) is precisely the covariance matrix of the Gaussian with density proportional to \(e^{- \langle x,Bx\rangle}\):

    \[\int e^{-\langle x,Bx\rangle} xx^{\top}\,dx = c_B \int e^{-\|x\|^2/2} (B^{-1/2} x) (B^{-1/2} x)^{\top}\,dx = c_B B^{-1/2} \left(\int e^{-\|x\|^2} xx^{\top} \,dx\right) B^{-1/2}\,,\]

    where \(c_B\) is some constant resulting from the substitution \(x \mapsto B^{-1/2} x\). The quantity in parentheses is a multiple of the identity.

    Therefore if a measure is log-concave with respect to a Gaussian distribution, its KLS constant is at least (up to a universal constant factor) as good as that for the Gaussian.

  • Brownian motion: An \(n\)-dimensional Brownian motion is a stochastic process \((B_t : t \geq 0)\) such that \(B_s = \int_0^s dB_t\) and \(dB_t\) can be thought of as a random variable independent of \((B_s : 0 \leq s \leq t)\) with law \(N(0, dt \cdot I)\) (the law of a spherical \(n\)-dimensional Gaussian with covariance \(dt \cdot I\)).

  • Stochastic localization: Suppose that a log-concave measure \(\mu\) has density \(f : \R^n \to \R_+\). We will define a stochastic process on densities: Take \(f_0(x)=f(x)\) and

    \[d f_t(x) = f_t(x) \langle x-a_t, dB_t\rangle\,,\]

    where \((B_t : t \geq 0)\) is an \(n\)-dimensional Brownian motion, and

    \[a_t \seteq \int_{\R^n} x f_t(x)\,dx\]

    is the center of mass of the \(f_t\).

    Note that if \(f_0(x)=f(x)\) is a density, then so is \(f_t\) since

    \[\int_{\R^n} d f_t(x) = \left\langle \int_{\R^n} f_t(x) (x-a_t), dB_t\right\rangle = 0\,.\]

    Let \(\mu_t\) be the corresponding measure. Note that \(f_t(x)\) is a martingale: \(\E[d f_t(x)] = 0\).

    Martingale property: \(\E[\mu_t(S)] = \mu(S)\) for all \(S \subseteq \R^n\).

  • For a sufficiently nice function \(f : \R \to \R\), we can approximate the change in \(f\) by its Taylor expansion

    \[df(x) = f(x+dx) - f(x) = f'(x) \,dx + \frac12 d^2 f (x_0) \,dx^2 + \cdots\,.\]

    Dividing both sides by \(dx\) and taking \(dx \to 0\), all but the \(dx\) vanishes.

    But now suppose $B_t$ is \(1\)-dimensional Brownian motion and we compute

    \[d f(B_t) = f(B_{t+dt}) - f(B_t) = f'(B_t) \,dB_t + \frac12 f''(B_t) \,dB_t^2 + \cdots\,.\]

    Then \(dB_t\) has law \(N(0,dt)\), hence \(\E[dB_t^2]=dt\). In other words, \((B_t)\) has non-trivial quadratic variation. That means we cannot neglect the 2nd order terms. Ito’s lemma tells us that we need only go to second order so that

    \[d f(B_t) = f'(B_t) \,dB_t + \frac12 f''(B_t) \,dt\,.\]

    Let’s consider a more general version for stochastic processes in \(n\) dimension.

  • Stochastic processes and Ito derivatives: Consider a process \((X_t : t \geq 0)\) defined by the stochastic differential equation

    \[dX_t = u_t \,dt + \Sigma_t dB_t\,,\]

    where \(u_t \in \R^m\) and \(\Sigma_t \in \R^{m \times n}\).

    Let \(f : \R^m \to \R\) be a sufficiently nice (e.g., continuously differentiable) function. Then Ito’s lemma tells us how to compute the time derivative of \(f(X_t)\):

    \[df(X_t) = \sum_{i=1}^m \partial_i f(X_t) dX_t^i + \frac12 \sum_{i,j} \partial_{ij} f(X_t) d[X_t^i,X_t^j]_t\,,\]

    where \([X_t^i, X_t^j]_t\) is the quadratic variation

    \[[X_t^i, X_t^j]_t = \int_0^t \left(\Sigma_s \Sigma_s^{\top}\right)_{i,j} \,ds\]

    so that

    \[d [X_t^i, X_t^j]_t = \left(\Sigma_s \Sigma_s^{\top}\right)_{i,j}\,.\]

    Note that \(\Sigma_t \Sigma_t^{\top}\) is the covariance matrix of the random vector \(\Sigma_t dB_t\).

  • We can use this figure out what \(f_t\) looks like:

    Here for a fixed \(x \in \R^n\), we take \(X_t = f_t(x)\) so that \(\Sigma_t = f_t(x) (x-a_t)^{\top}\). Then using the Ito lemma:

    \[\begin{align*} d \log f_t(x) = d \log (X_t) &= \frac{d X_t}{X_t} - \frac12 \frac{d[X_t,X_t]_t}{X_t^2} \\ &= \frac{d f_t(x)}{f_t(x)} - \frac12 \frac{d[f_t(x),f_t(x)]_t}{f_t(x)^2} \\ &=\langle x-a_t,dB_t\rangle - \frac12 \|x-a_t\|^2\,, \end{align*}\]

    where we used

    \[d[X_t,X_t]_t = (x-a_t)^{\top} (x-a_t) = \|x-a_t\|^2 \,dt\,.\]

    Let us simplify:

    \[d \log f_t(x) = \langle x, a_t\,dt + dB_t\rangle - \frac12 \|x\|^2 - \left[\langle a_t, dB_t\rangle + \tfrac12 \|a_t\|^2\,dt \right]\]

    Note that the terms in brackets don’t depend on \(x\). Thus integrating the logarithm gives (up to scaling by a constant):

    \[f_t(x) = \exp\left(- \frac{t}{2} \|x\|^2+ \int_0^t \langle x,a_s\,ds + dB_s\rangle \right) f(x)\,.\]

    If we let \(\mu_t\) denote the corresponding measure, then the Gaussian Factor Lemma says that (with probability one):

    \[\psi_{\mu_t} \gtrsim \sqrt{t}\,.\]
  • Now consider \(S \subseteq \R^n\) with \(\mu(S)=1/2\) and suppose there is a time \(t_0\) such that

    \[\begin{equation}\label{eq:good-meas} \P[\tfrac14 \mu(S) \leq \mu_{t_0}(S) \leq \tfrac34 \mu(S)] \geq \frac{1}{10}\,. \end{equation}\]

    Use the Martingale Property of \(\mu_t\) to write

    \[\mu(\partial S) = \E[\mu_{t_0}(\partial S)] \geq \frac18 \psi_{\mu_{t_0}} \P[\tfrac14 \mu(S) \leq \mu_{t_0}(S) \leq \tfrac34 \mu(S)] \gtrsim \psi_{\mu_{t_0}} \gtrsim \sqrt{t_0}\,.\]
  • Thus if we can find a lower bound on \(t_0\), we have proved a lower bound on \(\psi_{\mu}\).

    In the next lecture, we will show that \eqref{eq:good-meas} holds for some \(t_0 \asymp (\tr(A^2))^{-1/2}\).

    This yields

    \[\psi_{\mu} \gtrsim \frac{1}{\tr(A^2)^{1/4}}\,.\]

    If \(\mu\) is istropic, this value is \(\gtrsim n^{-1/4}\), implying that \(\psi_n \gtrsim n^{-1/4}\).

Supplementary reading:

Videos:

Two talks by Ronen @ IAS:

Mon, Feb 06
Tracking the covariance

Reducing to control of the covariance

  • Recall from last time that \(f : \R^n \to \R_+\) is a log-concave density on \(\R^n\) and we use \(\mu\) to denote the associated probability measure.

  • We evolve \(f\) according to a stochastic differential equation (SDE) given by \(f_0(x)=f(x)\) and

    \[d f_t(x) = f_t(x) \langle x - a_t, dB_t \rangle\,,\]

    were \((B_t : t \geq 0)\) is an $n$-dimensional Brownian motion and

    \[a_t = \int_{\R^n} x f_t(x)\,dx\]

    is the center of mass of \(f_t\). Let \(\mu_t\) be the measure corresponding to the density \(f_t\).

  • We used Itô’s Lemma to calculate

    \[\begin{equation}\label{eq:gfactor} f_t(x) \propto \exp\left(-\frac{t}{2} \|x\|^2 + \int_0^t \langle x,a_s \,ds + dB_s\rangle\right) f(x)\,. \end{equation}\]

    The Gaussian factor here is important for us because, according to the Gaussian Factor Lemma from last lecture, this yields \(\psi_{\mu_t} \gtrsim \sqrt{t}\).

  • Note that \(f_t(x)\) is a martingale for every \(x \in \R^n\): If $\mathcal{F}_t$ denotes the filtration generated by \((B_s : 0 \leq s \leq t)\), then

    \[\E[df_t(x) \mid \mathcal{F}_t] = f_t(x) \left\langle x-a_t, \E[dB_t \mid \mathcal{F}_t]\right\rangle = 0.\]

    This means that for every set \(S \subseteq \R^n\), we have \(\E[\mu_t(S)]=\mu(S)\).

  • So now suppose we want start with \(S \subseteq \R^n\) with \(\mu(S) = 1/2\). We want to prove a lower bound on \(\mu(\partial S)\) as follows: At any time \(t \geq 0\),

    \[\E[\mu(\partial S)] = \E[\mu_t(\partial S)] \gtrsim \psi_{\mu_t} \P[\tfrac14 \leq \mu_t(S) \leq \tfrac34] \gtrsim \sqrt{t} \cdot \P[\tfrac14 \leq \mu_t(S) \leq \tfrac34]\,.\]

    So now we are left to solve the following problem: For how long does \(\mu_t(S)\) stay bounded between \(1/4\) and \(3/4\)?

  • So let’s analyze how \(\mu_t(S)\) evolves using Itô’s lemma. Define

    \[m_t \seteq \mu_t(S) = \int_S f_t(x) \,dx\,.\]

    Using the SDE for $f_t$, we have

    \[d m_t = \int_S df_t(x)\,dx = \left\langle \int_S (x-a_t) f_t(x), dB_t\right\rangle.\]
  • Dubins-Schwartz: While we don’t need this technically, it’s useful to keep in mind. Roughly speaking, all reasonable martingales are a reparameterized Brownian motion.

    Given a \(1\)-dimensional stochastic process \(dm_t = u_t \,dt + \Sigma_t dB_t\) (so \(\Sigma_t \in \R^{1 \times n}\)), recall the quadratic variation

    \[[m_t]_t = \int_0^t \Sigma_t \Sigma_t^{\top} \,ds\,.\]

    Define the stopping time \(\tau_s \seteq \inf \{ t : [m_t]_t > s \}\), i.e., the first time at which we have collected quadratic variation \(s\). Then $m_{\tau_s}$ and $B_s$ (with \(B_0 = m_0\)) have the same law.

  • Controlling \(\mu_t(S)\) by controlling the operator norm of the covariance. This means that in order to control \(\P(\tfrac14 \leq m_t \leq \frac34)\), we just need to control the quadratic variation:

    \[[m_t]_t = \int_0^t \left\|\int_S (x-a_s) f_s(x)\,dx\right\|^2 \,ds\,.\]

    Note that

    \[\begin{align*} \left\|\int_S (x-a_s) f_s(x)\,dx\right\|^2 &= \max_{\|v\| \leq 1} \left(\int_S \langle v, x-a_s \rangle f_s(x)\,dx\right)^2 \\ &=\max_{\|v\| \leq 1} \left(\int_{\R^n} \mathbf{1}_S(x) \langle v, x-a_s \rangle f_s(x)\,dx\right)^2 \\ &\leq \max_{\|v\| \leq 1} \left(\int_{\R^n} \langle v, x-a_s \rangle^2 f_s(x)\,dx\right) \left(\int_{\R^n} \mathbf{1}_S(x)^2 f_s(x)\,dx\right), \end{align*}\]

    where the last line is Cauchy-Schwarz. The second integral is precisely \(\mu_s(S) \leq 1\), and the first integral is \(\langle v, A_s v\rangle\), where \(A_s\) is the covariance matrix of \(\mu_s\). So we have

    \[[m_t]_t \leq \int_0^t \max_{\|v\| \leq 1} \langle v, A_s v\rangle\,dt = \int_0^t \|A_s\|_{op}\,dt\]
  • So we need to see how large \(t\) can be so that \(\int_0^t \|A_s\|_{op}\,dt \leq 0.1\). The operator norm \(\|A_s\|_{op}\) (the maximum eigenvalue of \(A_s\)) is not something easier to argue about as time evolves because it’s defined as a maximum over directions.

    Thus we will use a crude upper bound:

    \[\|A_t\|_{op} \leq \tr(A_t^2)^{1/2}\,.\]

    Key Lemma: Let \(A\) be the covariance matrix of \(\mu\). Then for some constant $c > 0$ and \(0 \leq t \leq c (\tr(A^2))^{-1/2}\), we have

    \[\tr(A_t^2) \lesssim \tr(A^2)\,.\]

    The Key Lemma implies that \([m_t]_t \lesssim \int_0^t \tr(A_s^2)^{1/2}\,ds \leq t \cdot \max_{0 \leq s \leq t} \tr(A_t^2)^{1/2} \lesssim t \cdot \tr(A^2)^{1/2}\,.\)

    And therefore for some choice of \(T \gtrsim (\tr(A^2))^{-1/2}\), we have \(\Pr([m_{T}]_T \ll 1) \gtrsim 1\), and therefore \(\Pr(\tfrac14 \leq m_T \leq \tfrac34) \gtrsim 1\).

  • Now \eqref{eq:gfactor} and the Gaussian Factor Lemma together tell us that \(\psi_{\mu} \gtrsim \sqrt{T} \gtrsim \tr(A^2)^{-1/4}\). If \(\mu\) is in isotropic position, then \(A=I\) and we get \(\psi_{\mu} \gtrsim n^{-1/4}\).

Controlling the covariance

  • We first need to compute the Itô derivative of the covariance

    \[A_t = \int_{\R^n} (x-a_t)(x-a_t)^{\top} f_t(x)\,dx\,.\]
  • For this it helps to extend Itô’s lemma notationally to handle a pair of stochastic processes: Suppose that \(dX_t = u_t \,dt + M_t\,dB_t\) and \(dY_t = v_t\,dt + N_t\,dB_t\), then

    \[\begin{align*} d f(X_t, Y_t) &= \sum_i \partial_{x^i} f(X_t,Y_t) dX_t^i + \sum_j \partial_{y^j} f(X_t,Y_t) dY_t^j \\ &+ \frac12 \sum_{i,i'} \partial_{x^i,x^{i'}} f(X_t,Y_t) d[x^i,x^{i'}]_t + \frac12 \sum_{j,j'} \partial_{y^j,y^{j'}} f(X_t,Y_t) d[y^{j},y^{j'}]_t + \frac12 \sum_{i,j} \partial_{x^i,y^j} f(X_t,Y_t) d[x^i, y^j]_t\,, \end{align*}\]

    where

    \[[X_t,Y_t]_t = \int_0^t M_s N_s^{\top}\,ds\,.\]

    Note that this is the same formula as before (by lifting to a single process \(Z_t = X_t \oplus Y_t\)), but it gives guidance to our next calculation.

  • Write:

    \[d A_t = \int_{\R^n} (x-a_t)(x-a_t)^{\top} d f_t(x)\,dx + \int_{\R^n} \left((d a_t)(x-a_t)^{\top}\right) f_t(x)\,dx + \int_{\R^n} \left((x - a_t)(d a_t)^{\top}\right) f_t(x)\,dx + \cdots\,,\]

    where \(\cdots\) represents the second order derivatives.

    The first term is equal to \(\int_{\R^n} (x-a_t)(x-a_t)^{\top} \langle x-a_t, dB_t\rangle f_t(x)\,dx.\)

    The second two terms both evaluate to zero since \(\int (x-a_t) f_t(x)\,dx = 0\) by definition of \(a_t\).

    Now let’s calculate the second-order derivatives:

    \[\frac12 \cdot 2 \cdot d[a_t,a_t]_t \int f_t(x) \,dx - \frac{1}{2} \cdot 2 \cdot \int (x-a_t) d[a_t^{\top},f_t(x)]_t)^{\top}\,dx - \frac{1}{2} \cdot 2 \cdot d[a_t,f_t(x)]_t (x-a_t)^{\top}\,dx\]
  • Recalling \(d a_t = A_t dB_t\) gives \(d[a_t,a_t]_t = A_t^2\,dt\).

  • We also have

    \[d[a_t, f_t(x)]_t = A_t (x-a_t) f_t(x) \,dt\,,\]

    hence the second order terms evaluate to

    \[A_t^2\,dt - 2 A_t \int f_t(x) (x-a_t) (x-a_t)^{\top}\,dx = - A_t^2\,dt\,.\]

    Altogether this gives

    \[dA_t = \int_{\R^n} (x-a_t)(x-a_t)^{\top} \langle x-a_t, dB_t\rangle f_t(x)\,dx -A_t^2\,dt\]
  • Now we wish to analyze the evolution of \(\Phi_t = \tr(A_t^2)\), and another application of Itô’s lemma gives

    \[\begin{align*} d \Phi_t = -2 \tr(A_t^3) &+ \int \int \left((x-a_t)^{\top} (y-a_t)\right)^3 f_t(x) f_t(y)\,dx\,dy \\ &+ 2 \int (x-a_t)^{\top} A_t (x-a_t) \langle x-a_t,dB_t\rangle f_t(x)\,dx \end{align*}\]

    We want to ensure that \(\Phi_t\) is small for \(t \in [0,T]\), so we can ignore the term \(-2 \tr(A_t^3)\) (since \(A_t\) is PSD, this is always a negative term).

    If \(\hat{\mathbf{X}},\hat{\mathbf{Y}}\) are independent random variables with law \(\mu_t\) and \(\mathbf{X} \seteq \hat{\mathbf{X}} - \E[\hat{\mathbf{X}}]\), \(\mathbf{Y} \seteq \hat{\mathbf{Y}} - \E[\hat{\mathbf{Y}}]\), then the second term is

    \[d \Phi_t \leq \E \langle \mathbf{X}, \mathbf{Y}\rangle^3 +\E \left[\langle \mathbf{X}, A_t \mathbf{X}\rangle \mathbf{X}^{\top} \right] dB_t\]
  • Reverse Holder inequality: For any log-concave random vector \(\mathbf{X}\) in \(\R^n\) and \(k \geq 0\),

    \[(\E \|\mathbf{X}\|^k)^{1/k} \leq 2k\ \left(\E \|\mathbf{X}\|^2\right)^{1/2}\,.\]
  • Using the this inequality one obtains

    \[d \Phi_t \lesssim \Phi_t^{3/2}\,dt + \Phi_t^{5/4} dB_t\,.\]
  • To analyze this, let us assume that we run the process until the first time \(\tau_0\) such that \(\Phi_{\tau_0} = (1+c) \tr(A^2)\) Then for \(t \in [0,\tau_0]\),

    \[d \Phi_t \lesssim \tr(A^2)^{3/2}\,dt + \tr(A^2)^{5/4} dB_t\,,\]

    and therefore

    \[c\, \tr(A^2) = \Phi_{\tau_0} - \Phi_{0} = \int_{0}^{\tau_0} d\Phi_t \lesssim \tau_0 \tr(A^2)^{3/2} + W_{\tau}\,,\]

    where \(\tau \leq \tau_0 \tr(A^2)^{5/2}\), and \(W_{\tau}\) has the law of a Brownian motion (recall Dubins-Schwartz). Thus with high probability, this upper bound is

    \[\lesssim \tau_0\,\tr(A^2)^{3/2} + \tau_0^{1/2}\,\tr(A^2)^{5/4}\,.\]

    We therefore conclude that, with constant probability, \(\tau_0 \gtrsim \tr(A^2)^{-1/2}\), as desired.

Supplementary material:

Videos:

Topological methods
Wed, Feb 08
Borsuk-Ulam and the Kneser conjecture

Reading:

Mon, Feb 13
The evasiveness conjecture

Reading:

Wed, Feb 15
Simplicial complexes, fixed points, and symmetry

Reading:

Mon, Feb 20
NO CLASS (President's day)

High-dimensional expanders
Wed, Feb 22
Spectral gaps on simplicial complexes

Spectral expansion in graphs

  • Consider an undirected graph $G=(V,E)$ and a probability distribution \(\mu\) on \(E\). Let \(\mu^{(0)}\) be the probability measure induced on \(V\) as follows: We sample a random edge \(uv \in E\) according to \(\mu\), and then choose one of the two endpoints uniformly at random. Equivalently, \(\mu^{(0)}(u) = \frac12 \sum_{v : uv \in E} \mu(uv)\).

  • Random walk. We can define a random walk \(\{X_t\}\) on \(V\) as follows. For \(v \in V\), let \(\mu_v\) denote the distribution \(\mu\) conditioned on the event \(\{ v \in E \}\), i.e.,

    \[\mu_v(e) = \frac{\mu(e) \mathsf{1}_e(v)}{\sum_{e' \in E} \mu(e') \mathsf{1}_{e'}(v)}\,.\]

    Now given \(X_t = v\), we sample an edge \(e \in E\) according to \(\mu_v\) and let \(X_{t+1} = e \setminus v\).

    Define the associated random walk transition matrix by

    \[P_{uv} \seteq \frac{\mu(uv)}{\sum_{w : uw \in E} \mu(uw)} = \frac12 \frac{\mu(uv)}{\mu^{(0)}(u)}\,.\]

    We will think of \(P\) as a linear operator on the space of functions \(f : V \to \R\), where \(P f(u) = \sum_{v \in V} P_{uv} f(v)\).

  • Note that the matrix \(P\) is not necessarily symmetric, i.e., it is not self-adjoint with respect to the standard inner product on $\R^{V}$. For \(f,g : V \to \R\), define the inner product

    \[\langle f,g \rangle_{\mu^{(0)}} \seteq \sum_{u \in V} \mu^{(0)}(u) f(u) g(u)\,.\]

    Let us use \(\ell_2(\mu^{(0)})\) to denote the Hilbert space of functions equipped with \(\langle \cdot,\cdot\rangle_{\mu^{(0)}}\). Now let us verify that

    \[\langle f, P g\rangle_{\mu^{(0)}} = \sum_{u \in V} \mu^{(0)}(u) f(u) \sum_{v \in V} P_{uv} g(v) = \frac12 \sum_{u,v \in V} \mu(uv) f(u) g(v) = \sum_{uv \in E} \mu(uv) f(u) g(v)\,.\]

    Evidently, this is also equal to \(\langle Pf, g\rangle_{\mu^{(0)}}\) so that \(P\) is indeed self-adjoint with respect to \(\langle \cdot,\cdot\rangle_{\mu^{(0)}}\). In particular, this implies that \(P\) has a complete orthonormal basis of eigenfunctions and associated real eigenvalues.

  • If we denote by \(\mathbf{1}\) the constant function, then evidently \(P \mathbf{1} = \mathbf{1}\), hence \(P\) has \(1\) as an eigenvalue. Moreover, \(P\) is an averaging operator (it has nonnegative entries and is row-stochastic), so by Jensen’s inequality,

    \[(P f(u))^2 = \left(\sum_{v \in V} P_{uv} f(v)\right)^2 \leq \sum_{v \in P_{uv}} P_{uv} f(v)^2\,,\]

    and therefore

    \[\|P f\|_{\mu^{(0)}}^2 = \sum_{u \in V} \mu^{(0)}(u) (P f(u))^2 \leq \sum_{u \in V} \mu^{(0)}(u) \sum_{v \in V} P_{uv} f(v)^2 = \frac12 \sum_{u,v \in V} \mu(uv) f(u)^2 = \sum_{u \in V} \mu^{(0)}(u) f(u)^2 = \|f\|_{\mu^{(0)}}^2\,.\]

    This is usually summarized by saying: “\(P\) is a contraction on \(\ell_2(\mu^{(0)})\).” It implies that the spectrum of \(P\) lies in \([-1,1]\).

  • The spectral gap. Recall that \(1\) is an eigenvalue of \(P\) with eigenvector \(\mathbf{1}\). The spectral gap of \(P\) is the largest value \(\varepsilon\) such that all the other eigenvalues of \(P\) lie in the interval \([-(1-\e),1-\e]\).

    Equivalently, this is the largest value \(\e\) such that \(\|P f\|_{\mu^{(0)}} \leq (1-\e) \|f\|_{\mu^{(0)}}\) for all \(\langle f, \mathbf{1}\rangle_{\mu^{(0)}} = 0\).

    Let us denote this value, the “spectral gap of \(\mu\)” by \(\e(\mu)\)

  • The Laplacian. It is sometimes more convenient to work with the corresponding Laplacian operator \(L \seteq I - P\). Note that

    \[\begin{align} \langle f, L g\rangle_{\mu^{(0)}} &= \sum_{u \in V} \mu^{(0)}(u) g(u) \left(f(u) - \sum_{v \in V} P_{uv} g(v)\right) \nonumber \\ &= \frac12 \sum_{u,v \in V} \mu(uv) \left(f(u) g(u) - f(u) g(v)\right) \nonumber\\ &= \frac14 \sum_{u,v \in V} \mu(uv) \left(f(u) g(u) + f(v) g(v) - f(u) g(v) - f(v) g(u)\right) \nonumber\\ &= \frac14 \sum_{u,v \in V} \mu(uv) (f(u)-f(v)) (g(u)-g(v)) \nonumber\\ &= \frac12 \sum_{uv \in E} \mu(uv) \left(f(u)-f(v)\right) \left(g(u)-g(v)\right)\,. \label{eq:laplacian} \end{align}\]

    The eigenvalue \(1\) in \(P\) corresponding to the constant function \(\mathbf{1}\) becomes an eigenvalue \(0\) in \(L\). We have an equivalent characterization

    \[\e(\mu) = \min \left\{ \frac{\langle f, L f\rangle_{\mu^{(0)}}}{\langle f,f\rangle_{\mu^{(0)}}} : \langle f, \mathbf{1}\rangle_{\mu^{(0)}} = 0 \right\}.\]
  • Poincare inequalities. In order to eliminate the algebraic constraint \(\langle f,\mathbf{1}\rangle_{\mu^{(0)}}=0\), we can simply project \(f\) onto the orthogonal complement of the span of \(\mathbf{1}\), i.e., we can replace \(f\) by \(f - \E_{\mu^{(0)}}[f]\).

    Define

    \[\mathsf{Var}_{\mu^{(0)}}(f) \seteq \left\|f - \E_{\mu^{(0)}}[f]\right\|_{\mu^{(0)}}^2 = \E_{\mu^{(0)}} \left(f - \E_{\mu^{(0)}}[f]\right)^2 = \E_{\mu^{(0)}}(f^2) - (\E_{\mu^{(0)}} f)^2\,.\]

    Then we can write

    \[\e(\mu) = \min_f \frac{\langle f-\E_{\mu^{(0)}}[f], L(f-\E_{\mu^{(0)}}[f])\rangle_{\mu^{(0)}}}{\mathsf{Var}_{\mu^{(0)}}(f)} = \min_f \frac{\langle f,L f\rangle_{\mu^{(0)}}}{\mathsf{Var}_{\mu^{(0)}}(f)}\,,\]

    where the second equality uses the fact that \(L\) annhilates constant functions.

    This is often written in the form of a Poincare inequality: For all \(f : V \to \R\), it holds that

    \[\mathsf{Var}_{\mu^{(0)}}(f) \leq K(\mu) \langle f,L f\rangle_{\mu^{(0)}}\,,\]

    where \(K(\mu) \seteq 1/\e(\mu)\).

  • Let \(X\) be a simplicial complex. The dimension of \(X\) is \(\dim(X) \seteq \max \{ \abs{\sigma} - 1 : \sigma \in X \}\). Say that a \(d\)-dimensional complex \(X\) is pure if every face of \(X\) is contained in some face of dimension \(d\).

    Note that a pure simplicial complex is specified by a \(d\)-uniform hypergraph.

    Write \(X \seteq X^{(-1)} \cup X^{(0)} \cup \cdots \cup X^{(d)}\) where \(X^{(i)}\) contains the faces of \(X\) of dimension \(i\). Note that \(X^{(-1)} = \{ \emptyset \}\).

  • Induced measures. Suppose now that \(\pi^{(d)}\) is a probability measure on \(X^{(d)}\). In analogy with the graph case, we can define probability measures \(\pi^{(0)}, \pi^{(1)},\ldots, \pi^{(d-1)}\) as follows: To sample from \(\pi^{(i)}\), choose \(\sigma \in X^{(d)}\) according to \(\pi^{(d)}\) and then remove \(d-i\) uniformly random vertices from \(\sigma\).

    Equivalently, we can define the measures inductively by

    \[\pi^{(i-1)}(\tau) \seteq \sum_{\sigma \in X^{(i)} : \tau \subseteq \sigma} \frac{\pi^{(i)}(\sigma)}{i+1}\,.\]
  • Links. Recall that for \(\sigma \in X\), the link of \(\sigma\) in \(X\) is defined by

    \[X_{\sigma} \seteq \{ \tau \in X : \sigma \cap \tau = \emptyset \textrm{ and } \sigma \cup \tau \in X \}\,.\]

    Note that if \(X\) is a graph, i.e., a \(1\)-dimensional simplicial complex, then for a vertex \(v \in X^{(0)}\), \(X_v\) is precisely the set of neighbors of \(v\) in \(X\).

  • Induced measures on links. Given a face \(\sigma \in X^{(i)}\), the link \(X_{\sigma}\) is a pure \((d-i-1)\)-dimensional complex. We can define the induced measure \(\pi_{\sigma}^{(d-i-1)}\) in the straightforward way (as a conditional measure): For any top-dimensional face \(\tau \in X_{\sigma}^{(d-i-1)}\),

    \[\pi_{\sigma}^{(d-i-1)}(\tau) \seteq \frac{\pi^{(d)}(\sigma \cup \tau)}{\sum_{\tau' \in X_{\sigma}^{(d-i-1)}} \pi^{(d)}(\sigma \cup \tau')} = \P_{\rho \sim \pi^{(d)}}\left[\rho = \sigma \cup \tau \mid \sigma \subseteq \rho\right].\]

    Repeating the earlier construction, we obtain measures \(\pi_{\sigma}^{(0)}, \pi_{\sigma}^{(1)}, \ldots, \pi_{\sigma}^{(d-i-1)}\).

  • Skeletons. The \(k\)-skeleton of a simplicial complex \(X\) is precisely the complex \(X^{(-1)} \cup X^{(0)} \cup \cdots \cup X^{(k)}\).

  • Link expansion. Say that a pure \(d\)-dimensional complex \(X\) with measure \(\pi^{(d)}\) is an \(\e\)-spectral link expander if, for every \(\sigma \in X^{(d-2)} \cup X^{(d-1)} \cup \cdots \cup X^{(-1)}\), it holds that the \(1\)-skeleton of \(X_\sigma\) with the induced measure \(\pi_{\sigma}^{(1)}\) satisfies \(\e(\pi_{\sigma}^{(1)}) \geq \e\).

Theorem

  • Trickle step: Suppose that \(X\) is a pure \(2\)-dimensional simpicial complex with a probability measure \(\pi^{(2)}\) on \(X^{(2)}\) and that

    • (a) The \(1\)-skeleton \((X^{(0)}, X^{(1)})\) is connected. More precisely, \(\e(\pi^{(1)}) > 0\).

    • (b) For every vertex \(v \in X^{(0)}\), the (\(1\)-dimensional) link \(X_v\) satisfies \(\e(\pi_v^{(1)}) \geq \e\).

    Then \(\e(\pi^{(1)}) \geq 2 - 1/\e\).

  • Note that the theorem is only nontrivial for \(\e \geq 1/2\), i.e., the links have to have a substantial spectral gap.

Proof:

  • Let \(L\) denote the Laplace operator on the graph \((X^{(0)}, X^{(1)})\) equipped with the measure \(\mu = \pi^{(1)}\).

    Use \eqref{eq:laplacian} to write

    \[\begin{align*} \langle f, L f \rangle_{\pi^{(0)}} &= \frac12 \E_{uw \sim \pi^{(1)}} (f(u)-f(w))^2 \\ &= \frac12 \E_{v \sim \pi^{(0)}} \E_{uw \sim \pi_v^{(1)}} (f(u)-f(w))^2 \\ &= \E_{v \sim \pi^{(0)}} \langle f, L_v f\rangle_{\pi_v^{(0)}}\,, \end{align*}\]

    where \(L_v\) is the Laplacian on \(X_v\) associated to the edge measure \(\mu = \pi_v^{(1)}\).

    Now using assumption (b) that \(\e(\pi_v^{(1)}) \geq \e\) gives

    \[\langle f, L f\rangle_{\pi^{(0)}} \geq \e\ \E_{v \sim \pi^{(0)}} \mathsf{Var}_{\pi_v^{(0)}}(f)\,.\]
  • The law of total variance: If \(X\) and \(Y\) are random variables, then

    \[\mathsf{Var}(Y) = \E[\mathsf{Var}(Y \mid X)] + \mathsf{Var}(\E[Y \mid X]).\]

    Note that \(\E_{u \sim \pi_v^{(0)}} f(u) = P f(v)\). Thus by the law of total variance, we have

    \[\mathsf{Var}_{\pi^{(0)}}(f) = \E_{v \sim \pi^{(0)}} \mathsf{Var}_{\pi_v^{(0)}} (f) + \mathsf{Var}_{\pi^{(0)}} (P f)\,.\]

    We conclude that

    \[\begin{equation}\label{eq:penult} \langle f, L f\rangle_{\pi^{(0)}} \geq \e \left(\mathsf{Var}_{\pi^{(0)}}(f) - \mathsf{Var}_{\pi^{(0)}}(Pf) \right). \end{equation}\]
  • We are happy as long as \(\mathsf{Var}_{\pi^{(0)}}(Pf) \ll \mathsf{Var}_{\pi^{(0)}}(f)\). But in some sense this is what we are trying to prove! Indeed, the spectral gap can be characterized in terms of the rate of decay of the variance of functions acted on by \(P\). Convergence to the stationary measure \(\pi^{(0)}\) is precisely the assertion that \(\mathsf{Var}_{\pi^{(0)}}(P^t f) \to 0\) as \(t \to \infty\).

    The key reason we can still say something interesting is that \(\mathsf{Var}_{\pi^{(0)}}(Pf)\) involves terms of the form \(\langle f, P^2 f\rangle_{\pi^{(0)}}\). To check what happens, let us write everything in terms of the Laplacian: Observe that \(I - P^2 = 2 (I-P) - (I-P)^2 = 2 L - L^2\), and therefore \eqref{eq:penult} can be equivalently written as

    \[\langle f, L f\rangle_{\pi^{(0)}} \geq \e \left(2 \langle f,L f\rangle_{\pi^{(0)}} - \langle f,L^2 f\rangle_{\pi^{(0)}}\right).\]

    We immediately conclude that any eigenvalue \(\lambda\) of \(L\) satisfies

    \[\lambda \geq \e (2-\lambda) \lambda\,.\]

    If \(\lambda > 0\), this gives \(\lambda \geq 2-1/\e\), implying that the connected components of \((X^{(0)},X^{(1)})\) all have spectral gap at least \(2-1/\e\). Since we assume this graph to be connected, the desired result follows.

Reading:

Mon, Feb 27
High-dimensional random walks

Trickle Down Theorem

  • Suppose that \(X\) is a pure \(d\)-dimensional simpicial complex with a probability measure \(\pi^{(d)}\) on \(X^{(d)}\).

    Consider \(0 \leq k \leq d-2\) and assume furthermore that:

    • (a) For every \(\sigma \in X^{(k-1)}\), the \(1\)-skeleton of \(X_{\sigma}\) is connected. More precisely, that \(\e(\pi^{(1)}_{\sigma}) > 0\).

    • (b) For every \(\sigma \in X^{(k)}\), there is a uniform spectral gap \(\e(\pi_{\sigma}^{(1)}) \geq \e\).

    Then for every \(\rho \in X^{(k-1)}\), we have \(\e(\pi_{\rho}^{(1)}) \geq 2 - 1/\e\).

Proof:

  • Fix some \(\rho \in X^{(k-1)}\). Let \(L\) denote the Laplace operator on the \(1\)-skeleton of \(X_{\rho}\) equipped with the measure \(\mu = \pi^{(1)}_{\rho}\).

    Write:

    \[\begin{align*} \langle f, L f \rangle_{\pi_{\rho}^{(0)}} &= \frac12 \E_{uw \sim \pi_{\rho}^{(1)}} (f(u)-f(w))^2 \\ &= \frac12 \E_{v \sim \pi_{\rho}^{(0)}} \E_{uw \sim \pi_{\rho \cup v}^{(1)}} (f(u)-f(w))^2 \\ &= \E_{v \sim \pi_{\rho}^{(0)}} \langle f, L_v f\rangle_{\pi_{\rho \cup v}^{(0)}}\,, \end{align*}\]

    where \(L_v\) is the Laplacian on the \(1\)-skeleton of \(X_{\rho \cup v}\) associated to the edge measure \(\mu = \pi_{\rho \cup v}^{(1)}\).

    Now assumption (b) gives \(\e(\pi_{\rho \cup v}^{(1)}) \geq \e\), and therefore

    \[\langle f, L f\rangle_{\pi_{\rho}^{(0)}} \geq \e\ \E_{v \sim \pi_{\rho}^{(0)}} \mathsf{Var}_{\pi_{\rho \cup v}^{(0)}}(f)\,.\]
  • Note that \(\E_{u \sim \pi_{\rho \cup v}^{(0)}} f(u) = (I-L) f(v)\). Thus by the law of total variance, we have

    \[\mathsf{Var}_{\pi_{\rho}^{(0)}}(f) = \E_{v \sim \pi_{\rho}^{(0)}} \mathsf{Var}_{\pi_{\rho \cup v}^{(0)}} (f) + \mathsf{Var}_{\pi_{\rho}^{(0)}} ((I-L) f)\,.\]

    We conclude that

    \[\begin{align*} \langle f, L f\rangle_{\pi_{\rho}^{(0)}} &\geq \e \left(\mathsf{Var}_{\pi_{\rho}^{(0)}}(f) - \mathsf{Var}_{\pi_{\rho}^{(0)}}((I-L)f) \right) \\ &= \e \left(2 \langle f,L f\rangle_{\pi_{\rho}^{(0)}} - \langle f,L^2 f\rangle_{\pi_{\rho}^{(0)}}\right). \end{align*}\]
  • Therefore if \(\lambda\) is an eignevalue of \(L\), then

    \[\lambda \geq \e (2-\lambda) \lambda\,.\]

    If \(\lambda > 0\), this gives \(\lambda \geq 2-1/\e\). Now assumption (a) asserts that only the trivial eigenvalue of \(L\) is \(0\), and the desired result follows.

Corollary

  • Suppose that \(X\) is a pure \(d\)-dimensional simplicial complex equipped with a probability measure \(\pi^{(d)}\) on \(X^{(d)}\), and that

    • (a) For every face \(\sigma \in X\) with \(\dim(\sigma) \leq d-2\), it holds that the \(1\)-skeleton of \(X_{\sigma}\) is connected. More precisely, \(\e(\pi_{\sigma}^{(1)}) > 0\).

    • (b) For every face \(\sigma \in X^{(d-2)}\), there is a uniform spectral gap \(\e(\pi_{\sigma}^{(1)}) \geq \e\).

    Then for every \(-1 \leq k \leq d-2\) and \(\sigma \in X^{(k)}\), it holds that

    \[\e(\pi_{\sigma}^{(1)}) \geq 1 - \left(d-(k+1)\right) (1-\e) \geq 1 - d(1-\e) \,.\]
  • Note that to derive a non-trivial bound for the \(1\)-skeleton of \(X\), we need a very strong spectral gap for the \((d-2)\)-links: \(\e \geq 1-1/d\).

Proof:

  • The proof is by induction using the Trickle Down Theorem:

    Consider \(\sigma \in X^{(k)}\):

    • \(k=d-2\). By assumption, \(\e(\pi_{\sigma}^{(1)}) \geq \e\).

    • \(k=d-3\). The theorem yields \(\e(\pi_{\sigma}^{(1)}) \geq 2 - 1/\e \geq 1 - 2(1-\e)\), where the last inequality holds for \(\e \geq 1/2\).

    • \(k=d-4\). The theorem yields \(\e(\pi_{\sigma}^{(1)}) \geq 2 - \frac{1}{1-2(1-\e)} \geq 1 - 3(1-\e)\), where the last inequality holds for \(\e \geq 1/3\).

    • Etc. using the inequality

      \[2 - \frac{1}{1-(j-1)(1-\e)} \geq 1 - j (1-\e),\qquad \e \geq 1 - \frac{1}{j}\,.\]

    Finally, note that the constraint \(\e \geq 1 - 1/j\) is unnecessary since the spectral gap is always at least \(0\).

Random walks

  • Example: Sampling random spanning trees. Consider an undirected graph \(G=(V,E)\) and let \(\mathcal{T}\) denote the set of spanning trees in \(G\) (where we think of a spanning tree as a subset \(T \subseteq E\)). The downward closure \(X\) of \(\mathcal{T}\) is an \((n-2)\)-dimensional simplicial complex, where \(n = |V|\).

    The edge exchange walk on \(\mathcal{T}\) is the random walk defined as follows: If \(T\) is a spanning tree, we remove a uniformly random edge \(e \in E(T)\) and then define \(T' \seteq (T \setminus \{e\}) \cup \{e'\}\), where \(e' \notin E(T)\) is a uniformly random edge such that \(T' \in \mathcal{T}\).

    Our goal now is to define high-dimensional random walks so that the associated spectral gaps allow us to analyze the mixing time of such a random process. The edge exchange walk will be a particular example of the “down-up” walk analyzed next.

  • Let \(X\) denote a pure \(d\)-dimensional simplicial complex equipped with a probability measure \(\pi^{(d)}\) on \(X^{(d)}\).

    Let us define up and down walk operators: \(P_{\nearrow k} : \ell_2(\pi^{(k-1)}) \to \ell_2(\pi^{(k)})\) and \(P_{\searrow k} : \ell_2(\pi^{(k+1)}) \to \ell_2(\pi^{(k)})\) by

    \[\begin{align*} P_{\nearrow k} f(\sigma) &\seteq \E_{v \sim \pi_{\sigma}^{(0)}} f(\sigma \cup v) \\ P_{\searrow k} f(\sigma) &\seteq \E_{v \in \sigma} f(\sigma \setminus v)\,. \end{align*}\]

    Note that these are “co-“operators: They operate on functions. This means that the adjoint of the “up” operator \(P_{\nearrow k}^{\top}\) acts in the opposite direction on distributions:

    \[\begin{align*} \pi^{(k)} &= P_{\searrow k-1}^{\top} \pi^{(k-1)} \\ \pi^{(k-1)} &= P_{\nearrow k}^{\top} \pi^{(k)}\,. \end{align*}\]

    Correspondingly, we obtain two different walk operators on \(X^{(k)}\): The up-down walk and the down-up walk, respectively

    \[\begin{align*} P_k^{\triangle} &\seteq P_{\searrow k} P_{\nearrow k+1} \\ P_k^{\triangledown} &\seteq P_{\nearrow k} P_{\searrow k-1}\,. \end{align*}\]

    Let us also define the non-lazy up-down walk operator by

    \[P_k^{\wedge} \seteq P_k^{\triangle} + \frac{1}{k+1} (P_k^{\triangle} - I)\,.\]
  • It is not difficult to check that all three walks have \(\pi^{(k)}\) as the stationary measure.

  • Let us define the corresponding Laplacians

    \[\begin{align*} L_k^{\triangle} &\seteq I - P_k^{\triangle} \\ L_k^{\triangledown} &\seteq I - P_k^{\triangledown} \\ L_k^{\wedge} &\seteq I - P_k^{\wedge} = \frac{k+2}{k+1} L_k^{\triangle}\,. \end{align*}\]

Define \(\e_k \seteq \min_{\sigma \in X^{(k)}} \e(\pi^{(1)}_{\sigma})\) as the minimum spectral gap over all \(1\)-skeleta of links of \(k\)-dimensional faces.

Then for every \(0 \leq k \leq d-1\), and \(f \in \ell_2(\pi^{(k)})\), it holds that

\[\begin{equation}\label{eq:al-thm} \langle f, L_k^{\wedge} f\rangle_{\pi^{(k)}} \geq \e_{k-1} \langle f, L_k^{\triangledown} f\rangle_{\pi^{(k)}}\,. \end{equation}\]

Corollary

  • Using the relationship between \(L_k^{\wedge}\) and \(L_k^{\triangle}\), this gives

    \[\frac{k+2}{k+1} \e(L_k^{\triangle}) \geq \e_{k-1} \e(L_k^{\triangledown})\,.\]

    But recall that \(P_k^{\triangle} = P_{\searrow k} P_{\nearrow k+1}\) and \(P_{k+1}^{\triangledown} = P_{\nearrow k+1} P_{\searrow k}\).

    It is straightforward to check that for rectangular matrices \(A,B\), the two matrices \(AB\) and \(BA\) have the same spectrum, and therefore \(P_k^{\triangle}\) and \(P_{k+1}^{\triangledown}\) have the same spectrum. In particular, \(\e(L_k^{\triangle}) = \e(L_{k+1}^{\triangledown})\), and therefore

    \[\begin{equation}\label{eq:rec-formula} \e(L_{k+1}^{\triangledown}) \geq \frac{k+1}{k+2} \e_{k-1} \e(L_k^{\triangledown})\,. \end{equation}\]
  • Applying \eqref{eq:rec-formula} iteratively gives, for \(1 \leq k \leq d\),

    \[\e(L_{k}^{\triangledown}) \geq \frac{1}{k+1} \prod_{j=-1}^{k-2} \e_j\,.\]

    In particular, we have

    \[\e(L_d^{\triangledown}) \geq \frac{1}{d+1} \prod_{j=-1}^{d-2} \e_j\,,\]

    yielding a spectral gap for the down-up walk whenever there are non-trivial spectral gaps for the \(1\)-skeleta of all the links in \(X\).

Proof of \eqref{eq:al-thm}:

  • For \(j \leq k\), \(\sigma \in X^{(j)}\), and \(f : X^{(k)} \to \R\), let us define \(f_{\sigma}^{(k-j-1)} : X_{\sigma}^{(k-j-1)} \to \R\) by \(f^{(k-j-1)}_{\sigma}(\tau) \seteq f(\sigma \cup \tau)\).

  • Let \(L_{\sigma}\) denote the Laplacian on the \(1\)-skeleton of \(X_{\sigma}\) with respect to the edge measure \(\mu = \pi_{\sigma}^{(1)}\). Then we have

    \[\begin{align} \langle f, L_k^{\wedge} f\rangle_{\pi^{(k)}} &= \E_{\sigma \in \pi^{(k-1)}} \langle f^{(0)}_{\sigma}, L_{\sigma} f^{(0)}_{\sigma} \rangle_{\pi_{\sigma}^{(0)}} \label{eq:down1} \\ \langle f, L_k^{\triangledown} f \rangle_{\pi^{(k)}} &= \E_{\sigma \in \pi^{(k-1)}} \mathsf{Var}_{\pi_{\sigma}^{(0)}}(f_{\sigma}^{(0)})\,, \label{eq:down2} \end{align}\]

    and \eqref{eq:al-thm} follows immediately from the definition of the spectral gap.

  • We are left to verify \eqref{eq:down1} and \eqref{eq:down2}.

    Use \eqref{eq:laplacian} to write

    \[\begin{align*} \langle f, L_k^{\wedge} f\rangle_{\pi^{(k)}} &= \frac14 \E_{\alpha \sim \pi^{(k)}} \E_{v \in \alpha} \E_{u \sim \pi^{(0)}_{\alpha}} (f(\alpha)-f(\alpha + u - v))^2 \\ &=\frac12 \E_{\sigma \sim \pi^{(k-1)}} \E_{uv \sim \pi^{(1)}_{\sigma}} (f(\sigma + u)-f(\sigma + v))^2 \\ &=\frac12 \E_{\sigma \sim \pi^{(k-1)}} \E_{uv \sim \pi^{(1)}_{\sigma}} (f_{\sigma}^{(0)}(u) - f_{\sigma}^{(0)}(v))^2 \\ &= \E_{\sigma \sim \pi^{(k-1)}} \langle f_{\sigma}^{(0)}, L_{\sigma} f_{\sigma}^{(0)}\rangle_{\pi_{\sigma}^{(0)}}\,. \end{align*}\]
  • Next write

    \[\begin{align*} \langle f, L_k^{\triangledown} f \rangle_{\pi^{(k)}} &= \frac14 \E_{\alpha \sim \pi^{(k)}} \E_{v \in \alpha} \E_{u \sim \pi^{(0)}_{\alpha \setminus v}} (f(\alpha) - f(\alpha + u - v))^2 \\ &= \frac12 \E_{\sigma \sim \pi^{(k-1)}} \E_{u,v \sim \pi^{(0)}_{\sigma}} (f(\sigma + u) - f(\sigma + v))^2 \\ &= \frac12 \E_{\sigma \sim \pi^{(k-1)}} \E_{u,v \sim \pi^{(0)}_{\sigma}} (f^{(0)}_{\sigma}(u) - f^{(0)}_{\sigma}(v))^2 \\ &= \E_{\sigma \sim \pi^{(k-1)}} \mathsf{Var}_{\pi^{(0)}_{\sigma}}(f_{\sigma}^{(0)})\,, \end{align*}\]

    where the last line follows from the general fact

    \[\mathsf{Var}(X) = \frac12 \E |X-X'|^2\,,\]

    when \(X\) and \(X'\) are two independent copies of \(X\).

Reading:

Noise stability on the hypercube via Brownian motion
Wed, Mar 01
Leaking entropy at constant rate

Reading:

Mon, Mar 06
Majority is stablest

Recent topics in ML theory
Wed, Mar 08
The edge of stability

Notes

Self-stabilization dynamics

  • Define the set \(\mathcal{M} \seteq \{ \theta \in \R^n : S(\theta) \leq 2/\eta \textrm{ and } \langle \nabla L(\theta), u(\theta)\rangle = 0 \}\), where we recall that \(u(\theta)\) is the top eigenvector of \(\nabla^2 L(\theta)\).

    Given the gradient descent dynamics \(\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t)\), we define \(\hat{\theta}_t \seteq \mathrm{proj}_{\mathcal{M}}(\theta_t)\) as the (Euclidean) projection to the convex set \(\mathcal{M}\).

  • Now define the quantities

    \[\begin{align*} x_t &\seteq \langle u, \theta_t - \hat{\theta}_0\rangle \\ y_t &\seteq \langle \nabla S(\hat{\theta}_0), \theta_t - \hat{\theta}_0\rangle, \end{align*}\]

    where \(u \seteq u(\hat{\theta}_0)\) is the top eigenvector of \(\nabla^2 L(\hat{\theta}_0)\) and \(S(\theta) \seteq \lambda_{\max}(\nabla^2 L(\theta))\) is the sharpness.

    If we assume that \(S(\hat{\theta}_0)=2/\eta\) and \(\nabla S(\theta_t) \approx \nabla S(\hat{\theta}_0)\), then

    \[\begin{equation}\label{eq:sharp} S(\theta_t) \approx 2/\eta + y_t\,. \end{equation}\]
  • Linear approximation. For \(x_t,y_t\) small, we can approximate \(\nabla L(\theta_t) \approx \nabla L(\hat{\theta}_0)\).

  • Quadratic approximation. When \(\abs{x_t}\) blows up, we use a quadratic approximation:

    \[\begin{align*} \nabla L(\theta_t) - \nabla L(\hat{\theta}_0) &\approx \nabla^2 L(\theta_t) (\theta_t-\hat{\theta}_0) \\ &\approx \langle u, \theta_t - \hat{\theta}_0\rangle \nabla^2 L(\theta_t) u\,. \end{align*}\]

    In the last line, we have used the fact that \(\theta_t-\hat{\theta}_0\) only blows up in the \(u\) direction, so we only use the quadratic approximation there.

    This quantity is equal to $x_t \nabla^2 L(\theta_t) u \approx x_t S(\theta_t) u$.

    In conjunction with \(\langle u, \nabla L(\hat{\theta}_0)\rangle = 0\), this gives

    \[\begin{equation}\label{eq:xpart} x_{t+1} = x_t - \eta \langle u, \nabla L(\theta_t)\rangle = x_t(1-\eta S(\theta_t)) \approx -(1+\eta y_t) x_t\,, \end{equation}\]

    where the last approximation uses \eqref{eq:sharp}. Here we see the blowup phase where \(\abs{x_t}\) grows multiplicatively as soon as \(y_t > 0\), i.e., as soon as the sharpness exceeds \(2/\eta\), and the sign of \(x_t\) oscillates.

  • Cubic terms. For \(\abs{x_t}\) sufficiently large, we need to incorporate the 3rd-order term in the Taylor expansion:

    \[\begin{align*} \nabla L(\theta_t) - \nabla L(\hat{\theta}_0) &\approx \nabla^2 L(\theta_t) (\theta_t-\hat{\theta}_0) + \frac12 \nabla^3 L(\theta_t)[\theta_t-\hat{\theta}_0,\theta_t-\hat{\theta}_0] \\ &\approx \nabla^2 L(\theta_t) (\theta_t-\hat{\theta}_0) + \frac12 \nabla^3 L(\hat{\theta}_0)[\theta_t-\hat{\theta}_0,\theta_t-\hat{\theta}_0] \\ &\approx \nabla^2 L(\theta_t) (\theta_t-\hat{\theta}_0) + \frac12 \nabla^3 L(\hat{\theta}_0)[u,u] \langle u, \theta_t - \hat{\theta}_0\rangle^2\,, \end{align*}\]

    where as in the quadratic setting, we only consider the component in direction \(u\). Using the approximation we derived for the quadratic part, this gives

    \[\begin{align*} \nabla L(\theta_t) - \nabla L(\hat{\theta}_0) &\approx x_t S(\theta_t) u + \frac{x_t^2}{2} \nabla^2 L(\hat{\theta}_0)[u,u] \\ &=x_t S(\theta_t) u + \frac{x_t^2}{2} \nabla S(\hat{\theta}_0)\,, \end{align*}\]

    where we used the fact that if the top eigenvalue of \(\nabla^2 L(\theta)\) has multiplicity one, then \(\nabla S(\theta) = \nabla^3 L(\theta)[u(\theta),u(\theta)]\).

  • Then our update to \(y_t\) looks like

    \[\begin{align*} y_{t+1} - y_t &= \langle \nabla S(\hat{\theta}_0), - \eta \nabla L(\theta_t)\rangle \\ &= -\eta \langle \nabla S(\hat{\theta}_0), \nabla L(\hat{\theta}_0)\rangle - \eta \langle \nabla S(\hat{\theta}_0), \nabla L(\theta_t)-\nabla L(\hat{\theta}_0)\rangle \\ &\approx \eta \alpha - \eta x_t S(\theta_t) \langle \nabla S(\hat{\theta}_0), u \rangle - \eta \frac{x_t^2}{2} \beta\,, \end{align*}\]

    where \(\alpha \seteq \langle \nabla S(\hat{\theta}_0), - \nabla L(\hat{\theta}_0)\rangle\) is the progressive sharpening parameter and \(\beta \seteq \|\nabla S(\hat{\theta}_0)\|^2\) is the “stabilization” parameter.

  • If we assume for simplicity that \(\langle \nabla S(\hat{\theta}_0), u\rangle = 0\), then we obtain

    \[\begin{equation}\label{eq:ypart} y_{t+1} - y_t = \eta \left(\alpha - \beta \frac{x_t^2}{2}\right)\,, \end{equation}\]

    meaning that the sharpness is decaying whenever \(x_t > \sqrt{2\alpha/\beta}\).

Reading: