Wed, Jan 04
NO CLASS


Generic chaining and extrema of random processes 
Mon, Jan 09
Examples, applications, Gaussian processes

Notes

A random process is a collection \(\{X_t : t \in T\}\) of random variables.
For the next few lectures, we will considered symmetric processes, i.e.,
those where \(X_t\) and \(X_t\) have the same law for every \(t \in T\).

We equip the index set \(T\) of such a process with the natural \(\ell_2\) metric
\[d(s,t) \seteq \sqrt{\E(X_sX_t)^2}.\]

A fundamental example is that of a Gaussian process, where the family \(\{X_t : t \in T\}\)
is a jointly normal family of random variables. When \(T\) is finite, any such family can be
described as follows: \(T \subseteq \R^n\) for some \(n \geq 1\), and \(X_t = \langle g,t\rangle\),
where \(g=(g_1,\ldots,g_n)\) is a standard \(n\)dimensional gaussian, i.e., the coordinates
\(g_1,\ldots,g_n\) are i.i.d. \(N(0,1)\) random variables.
In this case the geometry of the metric space \((T,d)\) is Eucliean, since \(d(s,t) = \st\_2\).
Also note that the metric \(d(s,t)\) completely describes the covariance structure of the process,
and therefore characterizes it (this is a special property of jointly normal families of random variables).

We will be concerned with extrema of such a process, and to that end we study the quantity
\[\E \max_{t \in T} X_t\,.\]

A random process \(\{ X_t : t \in T \}\) is called subgaussian if there is some constant \(c > 0\) such that
\[\begin{equation}\label{eq:sgtail}
\P[X_s  X_t > \lambda] \leq \exp\left( c\lambda^2/d(s,t)^2\right)
\end{equation}\]
In particular, Gaussian processes are subgaussian with \(c=1/2\).
Another example is a Bernoulli process, i.e. a family of random variables defined thusly:
For some subset \(T \subseteq \R^n\),
\[\left\{ \e_1 t_1 + \e_2 t_2 + \cdots + \e_n t_n : t = (t_1,\ldots,t_n) \in T \right\},\]
where \(\e_1,\ldots,\e_n \in \{1,1\}\) are i.i.d. uniformly random signs.

Diameter bound: When the index set \(T\) is finite, we can use the tail bound \eqref{eq:sgtail} to give a simple
upper bound on \(\E \max_{t \in T} X_t\).
Fix some \(t_0 \in T\) and note that \(\E \max_{t \in T} X_t = \E \max_{t \in T} (X_tX_{t_0})\).
Now applying a union bound gives
\[\P\left[\max_{t \in T} (X_tX_{t_0}) > \lambda\right] \leq T \exp\left( c\lambda^2/\Delta^2\right)\,,\]
where \(\Delta \seteq \max \{ d(t,t_0) : t \in T \}\).
A simple calculation gives the conclusion
\[\E \max_{t \in T} X_t \lesssim \Delta \sqrt{\log T}\,.\]
DETAILED PRELIMINARY NOTES
Supplementary reading:

Wed, Jan 11
Generic chaining and Dudley's entropy bound

Notes
Recall that we consider a subgaussian process \(\{X_t : t \in T\}\) and equip \(T\)
with the distance \(d(s,t) = \sqrt{\E(X_sX_t)^2}\).

Chaining: Instead of applying a naive union bound to control \(X_t  X_{t_0}\) for every \(t \in T\),
we can break every such difference into parts and then bound each part. This potentially reduces
the number of events we need to take a union bound over.
For instance, given three random variables \(X_0,X_1,X_2,X_3\), we could write
\[\begin{eqnarray*}
X_3  X_0 &=& (X_3  X_1) + (X_1  X_0) \\
X_2  X_0 &=& (X_2  X_1) + (X_1  X_0) \\
X_1  X_0 &=& X_1  X_0\,.
\end{eqnarray*}\]

Generic chaining upper bound: For subgaussian processes, we can control \(\E \max_{t \in T} X_t\) using chaining.
Fix some \(t_0 \in T\) and a sequence of finite sets
\[\{t_0\} = T_0 \subseteq T_1 \subseteq T_2 \subseteq \cdots \subseteq T_n \subseteq \cdots \subseteq T\,,\]
with cardinalities bounded by \(\abs{T_n} \leq 2^{2^n}\) for every \(n \geq 1\).
Then we have:
\[\begin{equation}\label{eq:gcub}
\E \max_{t \in T} X_t \lesssim \max_{t \in T} \sum_{n \geq 0} 2^{n/2} d(t,T_n)\,,
\end{equation}\]
where \(d(t,T_n) = \min_{s \in T_n} d(s,t)\).

Dudley’s entropy bound: If \(\{X_t : t \in T\}\) is a subgaussian process, then
\[\begin{equation}\label{eq:dudley}
\E \max_{t \in T} X_t \lesssim \sum_{n \geq 1} 2^{n/2} e_n(T)\,,
\end{equation}\]
where \(e_n(T)\) is the largest distance \(d\) such that there are \(2^{2^n}\)
points \(y_1,y_2,\ldots,y_{2^{2^n}}\) in \(T\) with \(d(y_i,y_j) \geq d\) for all \(i \neq j\).
Easy exercise: Show that Dudley’s entropy bound follows from the generic chaining upper bound \eqref{eq:gcub}.

Covering numbers: Let \(N(T,d,\e)\) denote the smallest number \(N\) such that the metric space \((T,d)\)
can be covered by \(N\) balls of radius \(\e\), and consider the quantity
\[C_T \seteq \sum_{h=\infty}^{\infty} 2^h \sqrt{\log N(T,d,2^{h})}\,.\]
Up to a factor of \(2\), this is equivalent to the more elegant expression \(\int_0^{\infty} \sqrt{\log N(T,d,\e)} \,d\e\).
It is an exercise to show that
\[C_T \asymp \sum_{n \geq 1} 2^{n/2} e_n(T)\,.\]
Exercise: Show that for any metric space \((T,d)\), it holds that
\[\begin{equation}\label{eq:CT}
\sum_{n \geq 1} 2^{n/2} e_n(T) \asymp \sum_{h=\infty}^{\infty} 2^h \sqrt{\log N(T,d,2^h)}\,.
\end{equation}\]

Analysis of the probability simplex
Let \(T = \{ x \in \R^n : x_1,\ldots,x_n \geq 0, x_1+\cdots+x_n = 1 \}\) denote the
probability simplex.
Lemma 1: For this \(T\), let \(X_t = \langle g,t\rangle\), where \(g=(g_1,\ldots,g_n)\) is a standard
\(n\)dimensional Gaussian. Then \(\E \max_{t \in T} X_t \asymp \sqrt{\log n}\).
This is because the maximum of the linear functional \(t \mapsto \langle g,t\rangle\) is achieved at a vertex of the convex body \(T\)
which is equal to the convex hull of the standard basis vectors: \(T = \mathrm{conv}(e_1,\ldots,e_n)\).
Thus \(\max_{t \in T} \langle g,t\rangle = \max \{ g_i : i =1,\ldots,n\}\), and therefore
\[\E \max_{t \in T} \langle g,t\rangle = \E \max(g_1,\ldots,g_n) \asymp \sqrt{\log n}.\]
Let us now see that the entropy bound \eqref{eq:dudley} is far from tight in this simple case.
We will use the formulation via covering numbers coming from the equivalence \eqref{eq:CT}.
Consider vectors \(t \in T\) with \(t_i \in \{0,1/k\}\) for $i=1,\ldots,n$. There are \({n \choose k} \geq \left(\frac{n}{k}\right)^k\) such vectors.
For \(k \leq \sqrt{n}\), we have \((n/k)^k \gtrsim 2^{k \log n}\),
and it is possible to choose a proportional subset of those vectors (e.g., a random subset will suffice) such that
\[d(t_i,t_j) \gtrsim \left(k \cdot \frac{1}{k^2} \right)^{1/2} \asymp k^{1/2}, \qquad i \neq j\,.\]
It follows that for \(k \leq \sqrt{n}\), we have $\log N(T,d,k^{1/2}) \gtrsim k \log n$.
Consider then the \(\approx \log n\) values of \(h\) for which \(1 \leq 2^h \leq \sqrt{n}\).
This gives us \(\approx \log n\) values of \(h\) for which
\[2^{h} \sqrt{\log N(T,d,2^{h})} \gtrsim 2^{h} \cdot 2^h \sqrt{\log n} = \sqrt{\log n}\,.\]
Therefore Dudley’s bound \eqref{eq:CT} is \(\gtrsim (\log n) \cdot \sqrt{\log n} = (\log n)^{3/2}\).
(Recall that the correct bound, given by Lemma 1, is \(\sqrt{\log n}\).)

Mon, Jan 16
NO CLASS (MLK day)


Wed, Jan 18
Graph and hypergraph sparsification

Notes

We have seen that if \(\{X_t : t \in T \}\) is a symmetric subGaussian process equipped
with the distance \(d(s,t) = \sqrt{\E(X_sX_t)^2}\), then for any sequence
\(\{t_0\} = T_0 \subseteq T_1 \subseteq \cdots \subseteq T_h \subseteq \cdots T\)
with \(T_h \leq 2^{2^h}\), we have a chaining upper bound
\[\E \max_{t \in T} X_t \lesssim \max_{t \in T} \sum_{h \geq 0} 2^{n/2} d(t,T_h)\,.\]
Let us define \(\gamma_2(T,d)\) as the best possible chaining upper bound, i.e., the infimum of the RHS
over all compliant sequences \(\{T_h\}\). This is Talagrand’s famous \(\gamma_2\) functional, and it may be
surprising that it characterizes (up to constants) the expected supremum of a Gaussian process.
Majorizingmeasures theorem (FerniqueTalagrand): If \(\{X_t : t\in T \}\) is a centered Gaussian process, then
\[\E \max_{t \in T} X_t \asymp \gamma_2(T,d)\,.\]

Random signed sums: Suppose that \(\e_1,\e_2,\ldots,\e_n\) is a sequence of independent, symmetric random variables and
we define the random process:
\[\left\{ X_t = \e_1 t_1 + \e_2 t_2 + \cdots + \e_n t_n : t \in T \right\}\,,\]
where \(T \subseteq \R^n\). Then classical concentration bounds show that \(\{X_t : t\in T\}\) is subgaussian,
and it is straightforward to calculate the distance (by expanding the square):
\[d(s,t) = \st\_2\,.\]
The chaining upper bound gives \(\E \max_{t \in T} X_t \lesssim \gamma_2(T,d)\). Let us now use this to
understand the Hypergraph sparsification problem.

Hypergraph sparsification. Consider a weighted hypergraph \(H=(V,E,w)\) where \(\{w_e : e \in E\}\) are nonnegative edge weights.
We associated to \(H\) the quadratic expression
\[Q_H(x) = \sum_{e \in E} w_e \max_{u,v \in e} (x_ux_v)^2\,.\]
If \(H\) were a graph (where \(e=2\) for every \(e \in E\)), this would correspond to the quadratic form
of the graph Laplacian.
Our goal is to find another hypergraph \(\tilde{H} = (V,\tilde{E},\tilde{w})\) with \(\tilde{E} \subseteq E\) and such that
\[\begin{equation}\label{eq:sparseapprox}
Q_H(x)Q_{\tilde{H}}(x) \leq \e Q_H(x),\qquad \forall x \in \R^n\,,
\end{equation}\]
where \(\e > 0\) is some accuracy parameter. Moreover, we would like \(\abs{\tilde{E}}\) to be small, ideally nearlinear in \(\abs{V}\)
(while \(\abs{E}\) could be as large as \(2^{\abs{V}}\)).

Independent random sampling. Suppose we have a probability distribution \(\{\mu_e : e \in E\}\). Let us form \(\tilde{H}\) by
sampling \(M\) edges \(e_1,e_2,\ldots,e_M\) independently from \(\mu\) and setting the edge weights in \(\tilde{H}\) so that
\[Q_{\tilde{H}}(x) = \frac{1}{M} \sum_{k=1}^M \frac{w_{e_k}}{\mu_{e_k}} Q_{e_k}(x)\,,\]
where we define \(Q_e(x) = \max_{u,v \in e} (x_ux_v)^2\).
Note that \(\E[Q_{\tilde{H}}(x)] = Q_H(x)\) for all \(x \in \R^n\), and to establish a bound like \eqref{eq:sparseapprox},
it makes sense to study the quantity
\[\E \max_{Q_H(x) \leq 1} Q_H(x)  Q_{\tilde{H}}(x).\]

Passing to a subGaussian process.
Let \(\hat{H}\) be an independent copy of \(\tilde{H}\) and write
\[\begin{eqnarray*}
\E_{\tilde{H}} \max_{Q_H(x) \leq 1} Q_H(x)  Q_{\tilde{H}}(x) &= \E_{\tilde{H}} \max_{Q_H(x) \leq 1} \E_{\hat{H}}[Q_{\hat{H}}(x)]Q_{\tilde{H}}(x) \\
&= \E_{\tilde{H}} \max_{Q_H(x) \leq 1} \E_{\hat{H}} [Q_{\hat{H}}(x)Q_{\tilde{H}}(x)] \\
&\leq \E_{\hat{H},\tilde{H}} \max_{Q_H(x) \leq 1} Q_{\hat{H}}(x)Q_{\tilde{H}}(x)\,,
\end{eqnarray*}\]
where we have used convexity to pull out the expectation: \(\abs{\E X} \leq E \abs{X}\) and \(\max(\E X_1,\ldots,\E X_k) \leq \E \max(X_1,\ldots,X_k)\)
Note that for any choice of signs \(\e_1,\ldots,\e_M \in \{1,1\}\), we have
\[\begin{eqnarray*}
Q_{\hat{H}}(x)Q_{\hat{H}}(x)
&= \frac{1}{M} \left\sum_{k=1}^M \left(\frac{w_{\hat{e}_k}}{\mu_{\hat{e}_k}} Q_{\hat{e}_k}(x) \frac{w_{\tilde{e}_k}}{\mu_{\tilde{e}_k}} Q_{\hat{e}_k}(x)\right)\right \\
&= \frac{1}{M} \left\sum_{k=1}^M \e_k \left(\frac{w_{\hat{e}_k}}{\mu_{\hat{e}_k}} Q_{\hat{e}_k}(x) \frac{w_{\tilde{e}_k}}{\mu_{\tilde{e}_k}} Q_{\hat{e}_k}(x)\right)\right.
\end{eqnarray*}\]
because \(\tilde{e}_k\) and \(\hat{e}_k\) have the same law. Define \(t_k(x) \seteq \frac{w_{e_k}}{\mu_{e_k}} Q_{\hat{e}_k}(x)\) and then we have
\[\max_{Q_H(x) \leq 1} Q_{\hat{H}}(x)Q_{\tilde{H}}(x) \leq \frac{2}{M} \E_{(\e_1,\ldots,\e_M)} \max_{Q_H(x) \leq 1} \left\sum_{k=1}^M \e_k t_k\right
= \frac{4}{M} \E_{(\e_1,\ldots,\e_M)} \max_{Q_H(x) \leq 1} \sum_{k=1}^M \e_k t_k(x)\,,\]
where we take the expectation over i.i.d. uniformly random signs, and
we removed the absolute values using symmetry of the process.
So now to bound the quantity we cared about initially, it suffices to bound \(\max_{Q_H(x) \leq 1} \sum_{k=1}^M \e_k t_k(x)\) for every choice
of hyperedges \(e_1,\ldots,e_M\). Of course, this is a subGaussian process, and so we are led to the study of \(\gamma_2(T,d)\),
where \(T = \{ x \in \R^n : Q_H(x) \leq 1\}\), and
\[d(x,y) = \left(\sum_{k=1}^M \left(\frac{w_{e_k}}{\mu_{e_k}} (Q_{e_k}(x)Q_{e_k}(y))\right)^2\right)^{1/2}\]

Abstracting out the norm sparsification problem.
Note that each \(\sqrt{Q_e(x)}\) for \(e \in E\) is actually a norm on \(\R^n\) (technically it is only a seminorm since \(Q_e(x)=0\) does not entail \(x=0\)).
Thus we can consider the following related problem: Suppose that \(N_1,N_2,\ldots,N_m\) are (semi)norms on \(\R^n\) and we want to bound
\[\E \max_{x \in T} \sum_{k=1}^m \e_k N_k(x)^2\,,\]
where \(T \subseteq \R^n\) and \(\e_1,\ldots,\e_m\) are i.i.d. random signs. Again, this is a subGaussian process equipped with the distance
\[d(x,y) = \left(\sum_{k=1}^M (N_k(x)^2  N_k(y)^2)^2\right)^{1/2}.\]
In the next lecture, we will find upper bounds on \(\gamma_2(T,d)\) for this faimly of problems.
Supplementary reading:

Mon, Jan 23
Sparsifying sums of norms

Notes

Norms on \(\R^n\). Recall that a norm \(N\) on \(\R^n\) is a nonnegative mapping \(N : \R^n \to \R_+\) such that
 \(N(\lambda x) = \abs{\lambda} N(x)\) for all \(x \in \R^n\) and \(\lambda \in \R\).
 \(N(x+y) \leq N(x) + N(y)\) for all \(x,y \in \R^n\).
 \(N(x) = 0 \iff x = 0\) for all \(x \in \R^n\).
\(N\) is a seminorm if it only satisfies properties (1) and (2). We will consider seminorms throughout the lecture and say “norm”
even when considering seminorms.

An upper bound. Suppose that \(N_1,N_2,\ldots,N_m\) are norms on \(\R^n\) and \(T \subseteq B_2^n\), where \(B_2^n = \{ x \in \R^n : \x\_2 \leq 1 \}\)
is the Euclidean unit ball.
Define \(\kappa \seteq \E \max_{k=1,\ldots,m} N_k(g)\), where \(g\) is a standard \(n\)dimensional Gaussian.
Our goal is to prove the following: If \(\e_1,\e_2,\ldots,\e_m\) are i.i.d. random signs, then
\[\begin{equation}\label{eq:goal11}
\E \max_{x \in T} \sum_{k=1}^m \e_k N_k(x)^2 \lesssim \kappa \log(n) \max_{x \in T} \sqrt{\sum_{k=1}^m N_k(x)^2}\,.
\end{equation}\]

Example: Operator norm for sums of random matrices. Another natural situation where a bound like this is useful is for sums of random matrices. Suppose that
\[A = \sum_{k=1}^m \e_k A_k^T A_k\,,\]
where each \(A_k^T A_k\) is a PSD matrix. Then the operator norm of \(A\) can be written as
\[\A\_{\mathrm{op}} = \max_{\x\_2 \leq 1} \langle x, A x\rangle = \max_{\x\_2 \leq 1} \sum_{k=1}^m \e_k \langle x, A_k^T A_k x\rangle
= \max_{\x\_2 \leq 1} \sum_{k=1}^m \e_k \A_k x\_2^2\,,\]
which is the special case of our problem where \(N_k(x) = \A_k x\_2^2\) and \(T = B_2^n\).

Applying Dudley’s entropy inequality. Recall that \(\left\{ \sum_{k=1}^m \e_k N_k(x)^2 : x \in T \right\}\) is a subgaussian process when equipped with the distance
\[d(x,y) \seteq \left(\sum_{k=1}^m \left(N_k(x)^2  N_k(y)^2\right)^2\right)^{1/2}.\]
Therefore Dudley’s entropy bound shows that
\[\begin{equation}\label{eq:dudley3}
\E \max_{x \in T} \sum_{k=1}^m \e_k N_k(x)^2 \lesssim \sum_{h \geq 0} 2^{h/2} e_h(T,d)\,.
\end{equation}\]

Controlling the subgaussian distance via a norm. Notice that both sides of the inequality \eqref{eq:goal11} are quadratic in the norms \(N_k\). Therefore by rescaling we may assume that
\[\begin{equation}\label{eq:L2N}
\max_{x \in T} \sqrt{\sum_{k=1}^m N_k(x)^2} = 1\,.
\end{equation}\]
Define the norm
\[\x\_N \seteq \max_{k=1,\ldots,m} N_k(x)\,.\]
Note that using \(a^2b^2 = (ab)(a+b)\), for any \(x,y \in T\), we have
\[\begin{eqnarray*}
d(x,y) &=& \left(\sum_{k=1}^m (N_k(x)N_k(y))^2 (N_k(x)+N_k(y))^2 \right)^{1/2} \\
&\leq& \xy\_N \left(\sum_{k=1}^m (N_k(x)+N_k(y))^2\right)^{1/2} \\
&\leq& 2 \xy\_N\,,
\end{eqnarray*}\]
where the first inequality uses \(N_k(x)N_k(y) \leq N_k(xy)\) (true for any norm) and the second inequality uses
\eqref{eq:L2N} and \(x,y \in T\).

Splitting the Dudley sum. Therefore we have \(e_h(T,d) \leq 2 e_h(T, \\cdot\_N)\) and combining this with Dudley’s bound \eqref{eq:dudley3} gives
\[\begin{equation}\label{eq:dudley2}
\E \max_{x \in T} \sum_{k=1}^m \e_k N_k(x)^2 \lesssim \sum_{h \geq 0} 2^{h/2} e_h(T,\\cdot\_N)\,.
\end{equation}\]
We will split this sum into two parts depending on whether \(h \leq 4 \log (n)\) or \(h > 4 \log (n)\).

Controlling the large \(h\) terms with a volume bound. Note that for \(x \in T\), by \eqref{eq:L2N} we have
\[\x\_N \leq \sqrt{\sum_{k=1}^m N_k(x)^2} \leq 1\,.\]
In other words, \(T \subseteq B_N \seteq \{ x \in \R^n : \x\_N \leq 1\}\).
Therefore \(e_h(T,\\cdot\_N) \leq e_h(B_N,\\cdot\_N)\).
Claim: It holds that \(e_h(B_N,\\cdot\_N) \leq 4 \cdot 2^{2^h/n}\).
This proof works for any norm on \(\R^n\).
Assume that \(x_1,\ldots,x_{s} \in B_N\) are a maximal set of points
satisfying \(\x_ix_j\_N \geq \delta\) for all \(i \neq j\). In that case, we have
\[B_N \subseteq \bigcup_{j=1}^s (x_j + 2 \delta B_N).\]
In other words, \(B_N\) is covered by \(s\) balls of radius \(2 \delta\).
But it also holds that (for \(\delta < 1\)),
\[2 B_N \supseteq \bigcup_{j=1}^s (x_j + \delta B_n)\]
hence
\[\mathrm{vol}_n(2 B_N) \geq s \,\mathrm{vol}_n(\delta B_N) = s \,(\delta/2)^n \mathrm{vol}_n(2 B_N),\]
where \(\mathrm{vol}_n\) is the standard \(n\)dimensional volume on \(\R^n\). It follows that \(s \leq (\delta/2)^{n}\), i.e.,
\(\delta \leq 2 s^{n}\). Taking \(s = 2^{2^h}\) yields
\[e_h(B_N,\\cdot\_N) \leq 4 \delta = 4 \cdot 2^{2^h/n}.\]

Recalling \eqref{eq:dudley2}, we can use the preceding claim to evaluate
\[\sum_{h > 4 \log n} 2^{h/2} e_h(T, \\cdot\_N) \leq 4 \sum_{h > 4 \log n} 2^{h/2} 2^{2^h/n} \leq O(1)\,.\]
Thus we have
\[\E \max_{x \in T} \sum_{k=1}^m \e_k N_k(x)^2 \lesssim O(1) + \sum_{0 \leq h \leq 4 \log n} 2^{h/2} e_h(T,\\cdot\_N)\,.\]

Controlling the interesting \(h\) terms. To bound the second sum, let us use \(T \subseteq B_2^n\) to write \(e_h(T,\\cdot\_N) \leq e_h(B_2^n,\\cdot\_N)\).
(Dual) Sudakov Lemma: For any norm \(\\cdot\\) on \(\R^n\) and \(h \geq 0\), it holds that
\[e_h(B_2^n, \\cdot\) \lesssim 2^{h/2} \E \g\\,.\]
Note that \(\E\, \g\_N = \E \max_k N_k(g) = \kappa\) (by definition), and therefore the second sum above is bounded by
\[\sum_{0 \leq h \leq 4 \log n} 2^{h/2} e_h(T,\\cdot\_N) \lesssim \kappa \log n\,.\]
This completes the proof of \eqref{eq:goal11} once we have established the Dual Sudakov Lemma.

Gaussian Shift Lemma. We need the following
Lemma: Suppose that \(K\) is a symmetric convex set and \(\gamma_n\) is the \(n\)dimensional Gaussian measure on \(\R^n\). Then for any \(x \in \R^n\),
\[\gamma_n(K+x) \geq e^{\x\_2^2/2} \gamma_n(K)\,.\]
Proof: Write
\[\begin{eqnarray*}
\gamma_n(K+x) &=& (2\pi)^{n} \int_K e^{\x+z\^2/2}\,dz \\
&=&
(2\pi)^{n} \int_K \E_{\sigma \in \{1,1\}} e^{\\sigma x+z\_2^2/2}\,dz\,,
\end{eqnarray*}\]
where the second inequality follows because \(K\) is symmetric. Now note that \(\E_{\sigma \in \{1,1\}} \\sigma x + z\_2^2 = \x\_2^2 + \z\_2^2\)
and use convexity of \(y \mapsto e^y\) to write
\[\begin{eqnarray*}
(2\pi)^{n} \int_K \E_{\sigma \in \{1,1\}} e^{\\sigma x+z\_2^2/2}\,dz &\geq &
(2\pi)^{n} \int_K e^{\E_{\sigma \in \{1,1\}} \\sigma x+z\_2^2/2}\,dz \\
&=&
(2\pi)^{n} \int_K e^{(\x\_2^2+\z\_2^2)/2}\,dz = e^{\x\_2^2} \gamma_n(K)\,.
\end{eqnarray*}\]

Proof of Dual Sudakov. Let \(\mathcal{B} \seteq \{ x \in \R^n : \x\ \leq 1 \}\) denote the unit ball of the \(\\cdot\\) norm and suppose
\(x_1,\ldots,x_s \in B_2^n\) is a maximal set of points such that the shifted balls \(x_j + \delta \mathcal{B}\) are all pairwise disjoint.
In that case, we have
\[\begin{equation}\label{eq:Bcover}
B_2^n \subseteq \bigcup_{j=1}^s (x_j + 2 \delta \mathcal{B})\,,
\end{equation}\]
i.e., we have covered \(B_2^n\) by \(s\) balls (in \(\\cdot\\)) of radius at most \(2\delta\).
Let \(\lambda > 0\) be a parameter we will choose later and note that the sets \(\lambda(x_j+\delta \mathcal{B})\) are also pairwise disjoint, and therefore
\[\begin{eqnarray*}
1 &\geq& \gamma_n\left(\bigcup_{j=1}^s \lambda (x_j + \delta \mathcal{B}) \right) \\
&=& \sum_{j=1}^s \gamma_n\left(\lambda (x_j + \delta \mathcal{B}) \right) \\
&\geq& \sum_{j=1}^s e^{\lambda^2 \x_j\_2^2/2} \gamma_n(\lambda \delta \mathcal{B})
\geq s \cdot e^{\lambda^2/2} \gamma_n(\lambda \delta \mathcal{B})\,,
\end{eqnarray*}\]
where the first equality uses disjointness, the second inequality uses the Gaussian Shift Lemma, and the final inequality uses
\(x_1,\ldots,x_s \in B_2^n\).
If we define \(\lambda \seteq \frac{2}{\delta} \E\, \g\\), then
\[\gamma_n(\lambda \delta B) = \P\left(g \in \lambda \delta B\right) = \P\left(\g\ \leq \lambda \delta\right) = \P\left(\g\ \leq 2 \E\, \g\\right) \geq \frac12\,,\]
where the last step is simply Markov’s inequality.
Combining this with our previous calculation yields
\[1 \geq s \cdot \exp\left((2/\delta)^2 (\E\, \g\)^2\right) \cdot \frac12\,,\]
i.e.
\[\sqrt{\log (s/2)} \leq \frac{2}{\delta} \E\, \g\\,.\]
Taking \(s=2^{2^h}\) and recalling \eqref{eq:Bcover} shows that we can cover \(B_2^n\) by at most \(s\) \(\\cdot\\)balls of radius \(2 \delta\) with
\[\delta \leq \frac{2\, \E\, \g\}{\sqrt{\log (s/2)}} \lesssim 2^{h/2} \E\,\g\\,.\]
By definition, this yields our desired bound
\[e_h(B_2^n, \\cdot\) \lesssim 2^{h/2} \E\,\g\\,.\]
Supplementary reading:

The KLS conjecture and stochastic localization 
Wed, Jan 25
A Cheeger inequality for isotropic convex sets

Notes

We studied the \(\ell_1\) norm sparsification problem:
Given norms \(N_1,\ldots,N_m\) on \(\R^n\),
define \(N(x) \seteq N_1(x) + \cdots + N_m(x)\). Then we seek a weight vector \(c \in \R_+^m\) with
\(\# \{ i : c_i > 0 \}\) small such that
\[\max_{N(x) \leq 1} \leftN(x)  \sum_{j=1}^m c_j N_j(x)\right \leq \e\,.\]

The role of concentration was essential: If \(\mathbf{Z}\) is a logconcave random vector on \(\R^n\) and
\(\\cdot\\) is any norm, then
\[\Pr\left(\\mathbf{Z}\ > t \E[\\mathbf{Z}\]\right) \leq e^{c t \psi_n}\,,\]
where \(c > 0\) is a universal constant and \(\psi_n\) is the KLS constant.

This sort of concentration is essentially equivalent to a uniform lower bound on the Cheeger constant
of isotropic logconcave measures on \(\R^n\). (Next lecture.)
Reading:
Videos:

Mon, Jan 30
The KLS conjecture, slicing, thin shells, and concentration

Notes
Localization

Consider a logconcave density \(F : \R^n \to \R_+\) with corresponding measure \(\mu\), and suppose
that \(D \seteq \mathrm{diam}(\mathrm{supp}(F)) < \infty\) (where \(\mathrm{supp}(F) \seteq \{ x \in \R^n : F(x) > 0 \}\)).
Let \(S_1,S_2,S_3\) be three measurable sets partitioning \(\R^n\) and denote \(d(S_1,S_2) = \min \{ \xy\_2 : x \in S_1, y \in S_2 \}\).
KK Lemma (Karzanov and Khachiyan): It holds that
\(\mu(S_3) \geq \frac{d(S_1,S_2)}{D} \min \{ \mu(S_1), \mu(S_2) \}.\)
Our goal is to reduce this to the \(1\)dimensional case.
Define \(C \seteq \frac{d(S_1,S_2)}{D}\).
The idea is to define two functions $g_1,g_2 : \R^n \to \R$ by
\[g_1(x) \seteq \begin{cases}
C F(x) & x \in S_1 \\
0 & x \in S_2 \\
F(x) & x \in S_3
\end{cases} \qquad
g_2(x) \seteq \begin{cases}
C F(x) & x \in S_2 \\
0 & x \in S_1 \\
F(x) & x \in S_3\,.
\end{cases}\]
Suppose, for the sake of contradiction, that \(C \mu(S_1), C \mu(S_2) \geq \mu(S_3)\).
This gives \(\int g_1(x)\,dx = C \mu(S_1)  \mu(S_3) > 0\) and \(\int g_2(x)\,dx = C \mu(S_2)  \mu(S_3) > 0\).
Now we are going to define a sequence \(K_0 \subseteq K_1 \subseteq K_2 \subseteq \cdots\) of convex bodies so that
\(K \seteq \bigcap_{i=1}^{\infty} K_i\) is either a point or a segment and such that
\(\int_{K_i} g_1(x)\,dx > 0\) and \(\int_{K_i} g_2(x)\,dx > 0\) for each \(i=1,2,\ldots\).

For \(K_0\), simply choose a ball large enough to contain \(\mathrm{supp}(F)\).
Consider any halfspace \(H\) such that
\[\int_{K_i \cap H} g_1(x)\,dx = \frac12 \int_{K_i} g_1(x)\,dx\,.\]
Call such a halfspace \(H\) bisecting.
Then either \(H\) (or its complement \(\R^n \setminus H\)) will satisfy
\[\int_{K_i \cap H} g_2(x)\,dx > 0\,.\]
We will choose \(H\) and take \(K_{i+1} \seteq K_i \cap H\). Let’s see how to choose \(H\).

Claim: For any \((n2)\)dimensional affine subspace \(A\), there is at least one bisecting hyperplane that contains \(A\).
Note that the boundary of a halfspace \(H\) is an \((n1)\)dimensional hyperplane.
So consider a halfspace \(H\) that contains \(A\) in its boundary (think of a plane containing a line).
We can rotate \(H\) about \(A\) so that \(H\) becomes its complenetary halfspace \(H'\).
If \(\int_H g_1  \int_{\R^n \setminus H} g_1 = 0\) initially, we are done. If not, then \(\int_H g_1  \int_{\R^n \setminus H} g_1\) changes sign as we rotate from \(H\) to \(H'\). By continuity, we find a bisecting halfspace whose bounding hyperplane contains \(A\).

Let \(A_0,A_1,\ldots,\) be a sequence of all \((n2)\)dimensional affine spaces of the form \(v_0 + \mathrm{span}(v_1,\ldots,v_{n2})\) where the vectors \(\{v_j\}\) have rational coordinates (there is a countable number of such subspaces). Let \(K_{i+1} \seteq K_i \cap H_i\) where \(H_i\) is a bisecting halfspace whose boundary
contains \(A_i\).
Claim: The intersection \(K \seteq \bigcap_{i\geq 1} K_i\) is a point or an interval.
Assume to the contrary that \(K\) is at least \(2\)dimensional. Then its projection onto some pair of coordinate
axes is still \(2\)dimensional. Without loss, we can assume it’s the first two axes.
Then there must be a pair of rational numbers \(r_1\) and \(r_2\) such that \((r_1,r_2,x_3, \ldots,x_n)\) is in the interior of \(K\).
But now the \((n2)\)dimensional subspace \(\mathrm{span}(r_1 e_1, r_2 e_2)\) occurs as some \(A_i\).
Recall that the halfspace \(H_i\) has its boundary containing \(A_i\). Since \((r_1,r_2,x_3,\cdots,x_n)\) is in the interior of \(K\),
it must also be in the interior of \(K_{i+1}\). But that means that \(H_i\) properly cuts \(K_{i+1}\). But this is a contradiction,
as \(H_i\) should fully contain \(K_{i+1} = K_i \cap H_i\).

We conclude that \(\int_K g_1 \geq 0, \int_K g_2 \geq 0\) on the intersection \(K\).
Note that we can easily get \(\int_K g_1 > 0, \int_K g_2 > 0\) for the original functions
by starting with functions \(\hat{g}_1 = g_1  \delta \mathbf{1}_{K_0}\) and \(\hat{g}_2 = g_2  \delta \mathbf{1}_{K_0}\) for \(\delta > 0\) sufficiently
small and then finding \(K\)
where \(\int_K \hat{g}_1 \geq 0, \int_K \hat{g}_2 \geq 0\).
Since the restriction of a logconcave function to an interval is still logconcave, this is indeed a reduction
of the KK Lemma to the \(1\)dimensional case.
Finding a better needle
For other applications, it helps
to replace \(g_1\) and \(g_2\) by nicer functions at every step. One can consult LovaszSimonovits
Lemma 2.5 for a (rather technical) argument proving the following.

Localization Lemma: Given \(\int_{\R^n} g_1 > 0\) and \(\int_{\R^n} g_2 > 0\), there are two points \(a,b \in \R^n\)
and a linear function \(\ell : [0,1] \to \R_+\) such that
\[\begin{align*}
\int_0^1 \ell(t)^{n1} g_1((1t) a + tb)\,dt &> 0 \\
\int_0^1 \ell(t)^{n1} g_2((1t) a + tb)\,dt &> 0
\end{align*}\]

One can instead derive the conclusion that there is a number \(\gamma \in \R\) such that
\[\begin{align*}
\int_0^1 e^{\gamma t} g_1((1t) a + tb)\,dt &> 0 \\
\int_0^1 e^{\gamma t} g_2((1t) a + tb)\,dt &> 0\,,
\end{align*}\]
which I find a bit more natural.
Reading:
Videos:

Wed, Feb 01
Stochastic localization

Notes

Recall the KLS constant of a probability measure \(\mu\) on \(\R^n\):
\[\psi_{\mu} \seteq \inf_{S \subseteq \R^n : \mu(S) \leq 1/2} \frac{\mu(\partial S)}{\mu(S)}\,,\]
where the infimum is over measureable subsets and
\[\mu(\partial S) \seteq \lim_{\e \to 0+} \frac{\mu(S_{\e} \setminus S)}{\e}\,,\]
where \(S_{\e} \seteq \{ x \in \R^n : d(x,S) \leq \e \}\) is the \(\e\)neighborhood of \(S\).

The KLS conjecture asserts that when \(\mu\) is a logconcave distribution, it should hold that
the infimum can be restricted to halfspaces:
\[\psi_{\mu} \gtrsim \inf_{H \subseteq \R^n : \mu(H) \leq 1/2} \frac{\mu(\partial H)}{\mu(H)}\,,\]
where a halfspace is defined by \(H = \{ x \in \R^n : \langle x,v \rangle \leq c \}\) for some \(v \in \R^n\) and \(c \in \R\).

Suppose that \(\mu\) is measure on \(\R^n\). The covariance matrix of \(\mu\) is defined as
\[A \seteq \E[(X\E[X])(X\E[X])^{\top}]\,,\]
where \(X\) is a random variable with law \(\mu\).
This describes the variance of \(\mu\) along every projection: If \(v \in \R^n\), then
\[\E[\langle v, X  \E[X]\rangle^2] = \langle v, A v\rangle\,.\]

Lemma: If \(\mu\) is logconcave, then
\[\inf_{H \subseteq \R^n : \mu(H) \leq 1/2} \frac{\mu(\partial H)}{\mu(H)} \asymp \frac{1}{\sqrt{\A\_{op}}}\,,\]
where \(\A\_{op}\) denotes the maximum eigenvalue of \(A\).
This lemma is not too difficult to prove given that if \(\mu\) is logconcave and \(X\) has law \(\mu\),
then \(\langle v,X\rangle\) is also logconcave for any \(v \in \R^n\).

For distributions that are more logconcave than a Gaussian, one can prove that an analogous bound holds.
Gaussian Factor Lemma: Suppose that the density of \(\mu\) is proportional to \(f(x) e^{\langle x, B x\rangle}\) for some PSD matrix \(B\),
and a logconcave function \(f : \R^n \to \R_+\). Then,
\[\psi_{\mu} \gtrsim \frac{1}{\sqrt{\B^{1}\_{op}}}\,.\]
Since the product of logconcave functions is logconcave, the measure \(\mu\) is logconcave.
This lemma can be proved using localization (the approach we sketched last lecture) to reduce it to the \(1\)dimensional case.
Note that \(B^{1}\) is precisely the covariance matrix of the Gaussian with density proportional to \(e^{ \langle x,Bx\rangle}\):
\[\int e^{\langle x,Bx\rangle} xx^{\top}\,dx = c_B \int e^{\x\^2/2} (B^{1/2} x) (B^{1/2} x)^{\top}\,dx = c_B B^{1/2} \left(\int e^{\x\^2} xx^{\top} \,dx\right) B^{1/2}\,,\]
where \(c_B\) is some constant resulting from the substitution \(x \mapsto B^{1/2} x\). The quantity in parentheses is a multiple of the identity.
Therefore if a measure is logconcave with respect to a Gaussian distribution, its KLS constant is at least (up to a universal constant factor)
as good as that for the Gaussian.

Brownian motion: An \(n\)dimensional Brownian motion is a stochastic process \((B_t : t \geq 0)\) such that
\(B_s = \int_0^s dB_t\) and \(dB_t\) can be thought of as a random variable independent of \((B_s : 0 \leq s \leq t)\)
with law \(N(0, dt \cdot I)\) (the law of a spherical \(n\)dimensional Gaussian with covariance \(dt \cdot I\)).

Stochastic localization: Suppose that a logconcave measure \(\mu\) has density \(f : \R^n \to \R_+\).
We will define a stochastic process on densities: Take \(f_0(x)=f(x)\) and
\[d f_t(x) = f_t(x) \langle xa_t, dB_t\rangle\,,\]
where \((B_t : t \geq 0)\) is an \(n\)dimensional Brownian motion, and
\[a_t \seteq \int_{\R^n} x f_t(x)\,dx\]
is the center of mass of the \(f_t\).
Note that if \(f_0(x)=f(x)\) is a density, then so is \(f_t\) since
\[\int_{\R^n} d f_t(x) = \left\langle \int_{\R^n} f_t(x) (xa_t), dB_t\right\rangle = 0\,.\]
Let \(\mu_t\) be the corresponding measure. Note that \(f_t(x)\) is a martingale: \(\E[d f_t(x)] = 0\).
Martingale property: \(\E[\mu_t(S)] = \mu(S)\) for all \(S \subseteq \R^n\).

For a sufficiently nice function \(f : \R \to \R\), we can approximate the change in \(f\) by its Taylor expansion
\[df(x) = f(x+dx)  f(x) = f'(x) \,dx + \frac12 d^2 f (x_0) \,dx^2 + \cdots\,.\]
Dividing both sides by \(dx\) and taking \(dx \to 0\), all but the \(dx\) vanishes.
But now suppose $B_t$ is \(1\)dimensional Brownian motion and we compute
\[d f(B_t) = f(B_{t+dt})  f(B_t) = f'(B_t) \,dB_t + \frac12 f''(B_t) \,dB_t^2 + \cdots\,.\]
Then \(dB_t\) has law \(N(0,dt)\), hence \(\E[dB_t^2]=dt\). In other words, \((B_t)\) has nontrivial quadratic variation.
That means we cannot neglect the 2nd order terms. Ito’s lemma tells us that we need only go to second order so that
\[d f(B_t) = f'(B_t) \,dB_t + \frac12 f''(B_t) \,dt\,.\]
Let’s consider a more general version for stochastic processes in \(n\) dimension.

Stochastic processes and Ito derivatives: Consider a process \((X_t : t \geq 0)\) defined by the stochastic differential equation
\[dX_t = u_t \,dt + \Sigma_t dB_t\,,\]
where \(u_t \in \R^m\) and \(\Sigma_t \in \R^{m \times n}\).
Let \(f : \R^m \to \R\) be a sufficiently nice (e.g., continuously differentiable) function. Then Ito’s lemma tells us
how to compute the time derivative of \(f(X_t)\):
\[df(X_t) = \sum_{i=1}^m \partial_i f(X_t) dX_t^i + \frac12 \sum_{i,j} \partial_{ij} f(X_t) d[X_t^i,X_t^j]_t\,,\]
where \([X_t^i, X_t^j]_t\) is the quadratic variation
\[[X_t^i, X_t^j]_t = \int_0^t \left(\Sigma_s \Sigma_s^{\top}\right)_{i,j} \,ds\]
so that
\[d [X_t^i, X_t^j]_t = \left(\Sigma_s \Sigma_s^{\top}\right)_{i,j}\,.\]
Note that \(\Sigma_t \Sigma_t^{\top}\) is the covariance matrix of the random vector \(\Sigma_t dB_t\).

We can use this figure out what \(f_t\) looks like:
Here for a fixed \(x \in \R^n\), we take \(X_t = f_t(x)\) so that \(\Sigma_t = f_t(x) (xa_t)^{\top}\).
Then using the Ito lemma:
\[\begin{align*}
d \log f_t(x) = d \log (X_t) &= \frac{d X_t}{X_t}  \frac12 \frac{d[X_t,X_t]_t}{X_t^2} \\
&= \frac{d f_t(x)}{f_t(x)}  \frac12 \frac{d[f_t(x),f_t(x)]_t}{f_t(x)^2} \\
&=\langle xa_t,dB_t\rangle  \frac12 \xa_t\^2\,,
\end{align*}\]
where we used
\[d[X_t,X_t]_t = (xa_t)^{\top} (xa_t) = \xa_t\^2 \,dt\,.\]
Let us simplify:
\[d \log f_t(x) = \langle x, a_t\,dt + dB_t\rangle  \frac12 \x\^2  \left[\langle a_t, dB_t\rangle + \tfrac12 \a_t\^2\,dt \right]\]
Note that the terms in brackets don’t depend on \(x\). Thus integrating the logarithm gives
(up to scaling by a constant):
\[f_t(x) = \exp\left( \frac{t}{2} \x\^2+ \int_0^t \langle x,a_s\,ds + dB_s\rangle \right) f(x)\,.\]
If we let \(\mu_t\) denote the corresponding measure, then the Gaussian Factor Lemma says that
(with probability one):
\[\psi_{\mu_t} \gtrsim \sqrt{t}\,.\]

Now consider \(S \subseteq \R^n\) with \(\mu(S)=1/2\) and suppose there is a time \(t_0\) such that
\[\begin{equation}\label{eq:goodmeas}
\P[\tfrac14 \mu(S) \leq \mu_{t_0}(S) \leq \tfrac34 \mu(S)] \geq \frac{1}{10}\,.
\end{equation}\]
Use the Martingale Property of \(\mu_t\) to write
\[\mu(\partial S) = \E[\mu_{t_0}(\partial S)] \geq \frac18 \psi_{\mu_{t_0}} \P[\tfrac14 \mu(S) \leq \mu_{t_0}(S) \leq \tfrac34 \mu(S)]
\gtrsim \psi_{\mu_{t_0}} \gtrsim \sqrt{t_0}\,.\]

Thus if we can find a lower bound on \(t_0\), we have proved a lower bound on \(\psi_{\mu}\).
In the next lecture, we will show that \eqref{eq:goodmeas} holds for some \(t_0 \asymp (\tr(A^2))^{1/2}\).
This yields
\[\psi_{\mu} \gtrsim \frac{1}{\tr(A^2)^{1/4}}\,.\]
If \(\mu\) is istropic, this value is \(\gtrsim n^{1/4}\), implying that \(\psi_n \gtrsim n^{1/4}\).
Supplementary reading:
Videos:
Two talks by Ronen @ IAS:

Mon, Feb 06
Tracking the covariance

Notes
Reducing to control of the covariance

Recall from last time that \(f : \R^n \to \R_+\) is a logconcave density on \(\R^n\) and we use \(\mu\)
to denote the associated probability measure.

We evolve \(f\) according to a stochastic differential equation (SDE) given by \(f_0(x)=f(x)\) and
\[d f_t(x) = f_t(x) \langle x  a_t, dB_t \rangle\,,\]
were \((B_t : t \geq 0)\) is an $n$dimensional Brownian motion and
\[a_t = \int_{\R^n} x f_t(x)\,dx\]
is the center of mass of \(f_t\). Let \(\mu_t\) be the measure corresponding to the density \(f_t\).

We used Itô’s Lemma to calculate
\[\begin{equation}\label{eq:gfactor}
f_t(x) \propto \exp\left(\frac{t}{2} \x\^2 + \int_0^t \langle x,a_s \,ds + dB_s\rangle\right) f(x)\,.
\end{equation}\]
The Gaussian factor here is important for us because, according to the Gaussian Factor Lemma from last lecture,
this yields \(\psi_{\mu_t} \gtrsim \sqrt{t}\).

Note that \(f_t(x)\) is a martingale for every \(x \in \R^n\): If $\mathcal{F}_t$ denotes the filtration
generated by \((B_s : 0 \leq s \leq t)\), then
\[\E[df_t(x) \mid \mathcal{F}_t] = f_t(x) \left\langle xa_t, \E[dB_t \mid \mathcal{F}_t]\right\rangle = 0.\]
This means that for every set \(S \subseteq \R^n\), we have \(\E[\mu_t(S)]=\mu(S)\).

So now suppose we want start with \(S \subseteq \R^n\) with \(\mu(S) = 1/2\). We want to prove a lower bound on \(\mu(\partial S)\) as follows:
At any time \(t \geq 0\),
\[\E[\mu(\partial S)] = \E[\mu_t(\partial S)] \gtrsim \psi_{\mu_t} \P[\tfrac14 \leq \mu_t(S) \leq \tfrac34] \gtrsim \sqrt{t} \cdot \P[\tfrac14 \leq \mu_t(S) \leq \tfrac34]\,.\]
So now we are left to solve the following problem: For how long does \(\mu_t(S)\) stay bounded between \(1/4\) and \(3/4\)?

So let’s analyze how \(\mu_t(S)\) evolves using Itô’s lemma.
Define
\[m_t \seteq \mu_t(S) = \int_S f_t(x) \,dx\,.\]
Using the SDE for $f_t$, we have
\[d m_t = \int_S df_t(x)\,dx = \left\langle \int_S (xa_t) f_t(x), dB_t\right\rangle.\]

DubinsSchwartz: While we don’t need this technically, it’s useful to keep in mind. Roughly speaking, all reasonable martingales are a reparameterized Brownian motion.
Given a \(1\)dimensional stochastic process \(dm_t = u_t \,dt + \Sigma_t dB_t\) (so \(\Sigma_t \in \R^{1 \times n}\)), recall the quadratic variation
\[[m_t]_t = \int_0^t \Sigma_t \Sigma_t^{\top} \,ds\,.\]
Define the stopping time \(\tau_s \seteq \inf \{ t : [m_t]_t > s \}\), i.e., the first time at which we have collected quadratic variation \(s\).
Then $m_{\tau_s}$ and $B_s$ (with \(B_0 = m_0\)) have the same law.

Controlling \(\mu_t(S)\) by controlling the operator norm of the covariance.
This means that in order to control \(\P(\tfrac14 \leq m_t \leq \frac34)\), we just need to control the quadratic variation:
\[[m_t]_t = \int_0^t \left\\int_S (xa_s) f_s(x)\,dx\right\^2 \,ds\,.\]
Note that
\[\begin{align*}
\left\\int_S (xa_s) f_s(x)\,dx\right\^2 &= \max_{\v\ \leq 1} \left(\int_S \langle v, xa_s \rangle f_s(x)\,dx\right)^2 \\
&=\max_{\v\ \leq 1} \left(\int_{\R^n} \mathbf{1}_S(x) \langle v, xa_s \rangle f_s(x)\,dx\right)^2 \\
&\leq
\max_{\v\ \leq 1} \left(\int_{\R^n} \langle v, xa_s \rangle^2 f_s(x)\,dx\right) \left(\int_{\R^n} \mathbf{1}_S(x)^2 f_s(x)\,dx\right),
\end{align*}\]
where the last line is CauchySchwarz. The second integral is precisely \(\mu_s(S) \leq 1\), and the first integral is
\(\langle v, A_s v\rangle\), where \(A_s\) is the covariance matrix of \(\mu_s\). So we have
\[[m_t]_t \leq \int_0^t \max_{\v\ \leq 1} \langle v, A_s v\rangle\,dt = \int_0^t \A_s\_{op}\,dt\]

So we need to see how large \(t\) can be so that \(\int_0^t \A_s\_{op}\,dt \leq 0.1\).
The operator norm \(\A_s\_{op}\) (the maximum eigenvalue of \(A_s\)) is not something easier
to argue about as time evolves because it’s defined as a maximum over directions.
Thus we will use a crude upper bound:
\[\A_t\_{op} \leq \tr(A_t^2)^{1/2}\,.\]
Key Lemma: Let \(A\) be the covariance matrix of \(\mu\). Then for some constant $c > 0$ and \(0 \leq t \leq c (\tr(A^2))^{1/2}\), we have
\[\tr(A_t^2) \lesssim \tr(A^2)\,.\]
The Key Lemma implies that \([m_t]_t \lesssim \int_0^t \tr(A_s^2)^{1/2}\,ds \leq t \cdot \max_{0 \leq s \leq t} \tr(A_t^2)^{1/2} \lesssim t \cdot \tr(A^2)^{1/2}\,.\)
And therefore for some choice of \(T \gtrsim (\tr(A^2))^{1/2}\), we have \(\Pr([m_{T}]_T \ll 1) \gtrsim 1\), and therefore \(\Pr(\tfrac14 \leq m_T \leq \tfrac34) \gtrsim 1\).

Now \eqref{eq:gfactor} and the Gaussian Factor Lemma together tell us that \(\psi_{\mu} \gtrsim \sqrt{T} \gtrsim \tr(A^2)^{1/4}\).
If \(\mu\) is in isotropic position, then \(A=I\) and we get \(\psi_{\mu} \gtrsim n^{1/4}\).
Controlling the covariance

We first need to compute the Itô derivative of the covariance
\[A_t = \int_{\R^n} (xa_t)(xa_t)^{\top} f_t(x)\,dx\,.\]

For this it helps to extend Itô’s lemma notationally to handle a pair of stochastic processes:
Suppose that \(dX_t = u_t \,dt + M_t\,dB_t\) and \(dY_t = v_t\,dt + N_t\,dB_t\), then
\[\begin{align*}
d f(X_t, Y_t) &= \sum_i \partial_{x^i} f(X_t,Y_t) dX_t^i + \sum_j \partial_{y^j} f(X_t,Y_t) dY_t^j \\
&+ \frac12 \sum_{i,i'} \partial_{x^i,x^{i'}} f(X_t,Y_t) d[x^i,x^{i'}]_t
+ \frac12 \sum_{j,j'} \partial_{y^j,y^{j'}} f(X_t,Y_t) d[y^{j},y^{j'}]_t + \frac12 \sum_{i,j} \partial_{x^i,y^j} f(X_t,Y_t) d[x^i, y^j]_t\,,
\end{align*}\]
where
\[[X_t,Y_t]_t = \int_0^t M_s N_s^{\top}\,ds\,.\]
Note that this is the same formula as before (by lifting to a single process \(Z_t = X_t \oplus Y_t\)), but it gives
guidance to our next calculation.

Write:
\[d A_t = \int_{\R^n} (xa_t)(xa_t)^{\top} d f_t(x)\,dx + \int_{\R^n} \left((d a_t)(xa_t)^{\top}\right) f_t(x)\,dx +
\int_{\R^n} \left((x  a_t)(d a_t)^{\top}\right) f_t(x)\,dx + \cdots\,,\]
where \(\cdots\) represents the second order derivatives.
The first term is equal to \(\int_{\R^n} (xa_t)(xa_t)^{\top} \langle xa_t, dB_t\rangle f_t(x)\,dx.\)
The second two terms both evaluate to zero since \(\int (xa_t) f_t(x)\,dx = 0\) by definition of \(a_t\).
Now let’s calculate the secondorder derivatives:
\[\frac12 \cdot 2 \cdot d[a_t,a_t]_t \int f_t(x) \,dx  \frac{1}{2} \cdot 2 \cdot \int (xa_t) d[a_t^{\top},f_t(x)]_t)^{\top}\,dx  \frac{1}{2} \cdot 2 \cdot d[a_t,f_t(x)]_t (xa_t)^{\top}\,dx\]

Recalling \(d a_t = A_t dB_t\) gives \(d[a_t,a_t]_t = A_t^2\,dt\).

We also have
\[d[a_t, f_t(x)]_t = A_t (xa_t) f_t(x) \,dt\,,\]
hence the second order terms evaluate to
\[A_t^2\,dt  2 A_t \int f_t(x) (xa_t) (xa_t)^{\top}\,dx =  A_t^2\,dt\,.\]
Altogether this gives
\[dA_t = \int_{\R^n} (xa_t)(xa_t)^{\top} \langle xa_t, dB_t\rangle f_t(x)\,dx A_t^2\,dt\]

Now we wish to analyze the evolution of \(\Phi_t = \tr(A_t^2)\), and another application of Itô’s lemma gives
\[\begin{align*}
d \Phi_t = 2 \tr(A_t^3) &+ \int \int \left((xa_t)^{\top} (ya_t)\right)^3 f_t(x) f_t(y)\,dx\,dy \\ &+ 2 \int (xa_t)^{\top} A_t (xa_t) \langle xa_t,dB_t\rangle f_t(x)\,dx
\end{align*}\]
We want to ensure that \(\Phi_t\) is small for \(t \in [0,T]\), so we can ignore the term \(2 \tr(A_t^3)\) (since \(A_t\) is PSD, this is always a negative term).
If \(\hat{\mathbf{X}},\hat{\mathbf{Y}}\) are independent random variables with law \(\mu_t\)
and \(\mathbf{X} \seteq \hat{\mathbf{X}}  \E[\hat{\mathbf{X}}]\),
\(\mathbf{Y} \seteq \hat{\mathbf{Y}}  \E[\hat{\mathbf{Y}}]\),
then the second term is
\[d \Phi_t \leq \E \langle \mathbf{X}, \mathbf{Y}\rangle^3 +\E \left[\langle \mathbf{X}, A_t \mathbf{X}\rangle \mathbf{X}^{\top} \right] dB_t\]

Reverse Holder inequality: For any logconcave random vector \(\mathbf{X}\) in \(\R^n\) and \(k \geq 0\),
\[(\E \\mathbf{X}\^k)^{1/k} \leq 2k\ \left(\E \\mathbf{X}\^2\right)^{1/2}\,.\]

Using the this inequality one obtains
\[d \Phi_t \lesssim \Phi_t^{3/2}\,dt + \Phi_t^{5/4} dB_t\,.\]

To analyze this, let us assume that we run the process until the first time \(\tau_0\) such that \(\Phi_{\tau_0} = (1+c) \tr(A^2)\)
Then for \(t \in [0,\tau_0]\),
\[d \Phi_t \lesssim \tr(A^2)^{3/2}\,dt + \tr(A^2)^{5/4} dB_t\,,\]
and therefore
\[c\, \tr(A^2) = \Phi_{\tau_0}  \Phi_{0} = \int_{0}^{\tau_0} d\Phi_t \lesssim \tau_0 \tr(A^2)^{3/2} + W_{\tau}\,,\]
where \(\tau \leq \tau_0 \tr(A^2)^{5/2}\), and \(W_{\tau}\) has the law of a Brownian motion (recall DubinsSchwartz).
Thus with high probability, this upper bound is
\[\lesssim \tau_0\,\tr(A^2)^{3/2} + \tau_0^{1/2}\,\tr(A^2)^{5/4}\,.\]
We therefore conclude that, with constant probability, \(\tau_0 \gtrsim \tr(A^2)^{1/2}\), as desired.
Supplementary material:
Videos:

Topological methods 
Wed, Feb 08
BorsukUlam and the Kneser conjecture

Reading:

Mon, Feb 13
The evasiveness conjecture

Reading:

Wed, Feb 15
Simplicial complexes, fixed points, and symmetry

Reading:

Mon, Feb 20
NO CLASS (President's day)


Highdimensional expanders 
Wed, Feb 22
Spectral gaps on simplicial complexes

Notes
Spectral expansion in graphs

Consider an undirected graph $G=(V,E)$ and a probability distribution \(\mu\) on \(E\).
Let \(\mu^{(0)}\) be the probability measure induced on \(V\) as follows: We sample a random edge \(uv \in E\) according
to \(\mu\), and then choose one of the two endpoints uniformly at random.
Equivalently, \(\mu^{(0)}(u) = \frac12 \sum_{v : uv \in E} \mu(uv)\).

Random walk.
We can define a random walk \(\{X_t\}\) on \(V\) as follows. For \(v \in V\), let \(\mu_v\) denote
the distribution \(\mu\) conditioned on the event \(\{ v \in E \}\), i.e.,
\[\mu_v(e) = \frac{\mu(e) \mathsf{1}_e(v)}{\sum_{e' \in E} \mu(e') \mathsf{1}_{e'}(v)}\,.\]
Now given \(X_t = v\), we sample an edge \(e \in E\) according to \(\mu_v\) and let \(X_{t+1} = e \setminus v\).
Define the associated random walk transition matrix by
\[P_{uv} \seteq \frac{\mu(uv)}{\sum_{w : uw \in E} \mu(uw)} = \frac12 \frac{\mu(uv)}{\mu^{(0)}(u)}\,.\]
We will think of \(P\) as a linear operator
on the space of functions \(f : V \to \R\), where \(P f(u) = \sum_{v \in V} P_{uv} f(v)\).

Note that the matrix \(P\) is not necessarily symmetric, i.e., it is not selfadjoint with respect to the standard inner product on $\R^{V}$.
For \(f,g : V \to \R\), define the inner product
\[\langle f,g \rangle_{\mu^{(0)}} \seteq \sum_{u \in V} \mu^{(0)}(u) f(u) g(u)\,.\]
Let us use \(\ell_2(\mu^{(0)})\) to denote the Hilbert space of functions equipped with \(\langle \cdot,\cdot\rangle_{\mu^{(0)}}\).
Now let us verify that
\[\langle f, P g\rangle_{\mu^{(0)}} = \sum_{u \in V} \mu^{(0)}(u) f(u) \sum_{v \in V} P_{uv} g(v) = \frac12 \sum_{u,v \in V} \mu(uv) f(u) g(v) =
\sum_{uv \in E} \mu(uv) f(u) g(v)\,.\]
Evidently, this is also equal to \(\langle Pf, g\rangle_{\mu^{(0)}}\)
so that \(P\) is indeed selfadjoint with respect to \(\langle \cdot,\cdot\rangle_{\mu^{(0)}}\).
In particular, this implies that \(P\) has a complete orthonormal basis of eigenfunctions and associated real eigenvalues.

If we denote by \(\mathbf{1}\) the constant function, then evidently \(P \mathbf{1} = \mathbf{1}\), hence \(P\) has \(1\) as an eigenvalue.
Moreover, \(P\) is an averaging operator (it has nonnegative entries and is rowstochastic), so by Jensen’s inequality,
\[(P f(u))^2 = \left(\sum_{v \in V} P_{uv} f(v)\right)^2 \leq \sum_{v \in P_{uv}} P_{uv} f(v)^2\,,\]
and therefore
\[\P f\_{\mu^{(0)}}^2 = \sum_{u \in V} \mu^{(0)}(u) (P f(u))^2 \leq \sum_{u \in V} \mu^{(0)}(u) \sum_{v \in V} P_{uv} f(v)^2 = \frac12 \sum_{u,v \in V} \mu(uv) f(u)^2 = \sum_{u \in V} \mu^{(0)}(u) f(u)^2 = \f\_{\mu^{(0)}}^2\,.\]
This is usually summarized by saying: “\(P\) is a contraction on \(\ell_2(\mu^{(0)})\).” It implies that the spectrum of \(P\) lies in \([1,1]\).

The spectral gap. Recall that \(1\) is an eigenvalue of \(P\) with eigenvector \(\mathbf{1}\).
The spectral gap of \(P\) is the largest value \(\varepsilon\) such that all the other eigenvalues of \(P\) lie in the interval \([(1\e),1\e]\).
Equivalently, this is the largest value \(\e\) such that \(\P f\_{\mu^{(0)}} \leq (1\e) \f\_{\mu^{(0)}}\) for all \(\langle f, \mathbf{1}\rangle_{\mu^{(0)}} = 0\).
Let us denote this value, the “spectral gap of \(\mu\)” by \(\e(\mu)\)

The Laplacian. It is sometimes more convenient to work with the corresponding Laplacian operator \(L \seteq I  P\).
Note that
\[\begin{align}
\langle f, L g\rangle_{\mu^{(0)}} &= \sum_{u \in V} \mu^{(0)}(u) g(u) \left(f(u)  \sum_{v \in V} P_{uv} g(v)\right) \nonumber \\
&= \frac12 \sum_{u,v \in V} \mu(uv) \left(f(u) g(u)  f(u) g(v)\right) \nonumber\\
&= \frac14 \sum_{u,v \in V} \mu(uv) \left(f(u) g(u) + f(v) g(v)  f(u) g(v)  f(v) g(u)\right) \nonumber\\
&= \frac14 \sum_{u,v \in V} \mu(uv) (f(u)f(v)) (g(u)g(v)) \nonumber\\
&= \frac12 \sum_{uv \in E} \mu(uv) \left(f(u)f(v)\right) \left(g(u)g(v)\right)\,. \label{eq:laplacian}
\end{align}\]
The eigenvalue \(1\) in \(P\) corresponding to the constant function \(\mathbf{1}\) becomes an eigenvalue \(0\) in \(L\).
We have an equivalent characterization
\[\e(\mu) = \min \left\{ \frac{\langle f, L f\rangle_{\mu^{(0)}}}{\langle f,f\rangle_{\mu^{(0)}}} : \langle f, \mathbf{1}\rangle_{\mu^{(0)}} = 0 \right\}.\]

Poincare inequalities.
In order to eliminate the algebraic constraint \(\langle f,\mathbf{1}\rangle_{\mu^{(0)}}=0\), we can simply project \(f\) onto the orthogonal complement of
the span of \(\mathbf{1}\), i.e., we can replace \(f\) by \(f  \E_{\mu^{(0)}}[f]\).
Define
\[\mathsf{Var}_{\mu^{(0)}}(f) \seteq \left\f  \E_{\mu^{(0)}}[f]\right\_{\mu^{(0)}}^2 = \E_{\mu^{(0)}} \left(f  \E_{\mu^{(0)}}[f]\right)^2 = \E_{\mu^{(0)}}(f^2)  (\E_{\mu^{(0)}} f)^2\,.\]
Then we can write
\[\e(\mu) = \min_f \frac{\langle f\E_{\mu^{(0)}}[f], L(f\E_{\mu^{(0)}}[f])\rangle_{\mu^{(0)}}}{\mathsf{Var}_{\mu^{(0)}}(f)}
= \min_f \frac{\langle f,L f\rangle_{\mu^{(0)}}}{\mathsf{Var}_{\mu^{(0)}}(f)}\,,\]
where the second equality uses the fact that \(L\) annhilates constant functions.
This is often written in the form of a Poincare inequality: For all \(f : V \to \R\), it holds that
\[\mathsf{Var}_{\mu^{(0)}}(f) \leq K(\mu) \langle f,L f\rangle_{\mu^{(0)}}\,,\]
where \(K(\mu) \seteq 1/\e(\mu)\).
Link expansion in simplicial complexes

Let \(X\) be a simplicial complex. The dimension of \(X\) is \(\dim(X) \seteq \max \{ \abs{\sigma}  1 : \sigma \in X \}\).
Say that a \(d\)dimensional complex \(X\) is pure if every face of \(X\) is contained in some face of dimension \(d\).
Note that a pure simplicial complex is specified by a \(d\)uniform hypergraph.
Write \(X \seteq X^{(1)} \cup X^{(0)} \cup \cdots \cup X^{(d)}\) where \(X^{(i)}\) contains the faces of \(X\) of dimension \(i\).
Note that \(X^{(1)} = \{ \emptyset \}\).

Induced measures.
Suppose now that \(\pi^{(d)}\) is a probability measure on \(X^{(d)}\). In analogy with the graph case, we can define
probability measures \(\pi^{(0)}, \pi^{(1)},\ldots, \pi^{(d1)}\) as follows:
To sample from \(\pi^{(i)}\), choose \(\sigma \in X^{(d)}\) according to \(\pi^{(d)}\) and then remove \(di\) uniformly
random vertices from \(\sigma\).
Equivalently, we can define the measures inductively by
\[\pi^{(i1)}(\tau) \seteq \sum_{\sigma \in X^{(i)} : \tau \subseteq \sigma} \frac{\pi^{(i)}(\sigma)}{i+1}\,.\]

Links. Recall that for \(\sigma \in X\), the link of \(\sigma\) in \(X\) is defined by
\[X_{\sigma} \seteq \{ \tau \in X : \sigma \cap \tau = \emptyset \textrm{ and } \sigma \cup \tau \in X \}\,.\]
Note that if \(X\) is a graph, i.e., a \(1\)dimensional simplicial complex, then for a vertex \(v \in X^{(0)}\),
\(X_v\) is precisely the set of neighbors of \(v\) in \(X\).

Induced measures on links. Given a face \(\sigma \in X^{(i)}\), the link \(X_{\sigma}\) is a pure \((di1)\)dimensional complex.
We can define the induced measure \(\pi_{\sigma}^{(di1)}\) in the straightforward way (as a conditional measure):
For any topdimensional face \(\tau \in X_{\sigma}^{(di1)}\),
\[\pi_{\sigma}^{(di1)}(\tau) \seteq \frac{\pi^{(d)}(\sigma \cup \tau)}{\sum_{\tau' \in X_{\sigma}^{(di1)}} \pi^{(d)}(\sigma \cup \tau')}
= \P_{\rho \sim \pi^{(d)}}\left[\rho = \sigma \cup \tau \mid \sigma \subseteq \rho\right].\]
Repeating the earlier construction, we obtain measures \(\pi_{\sigma}^{(0)}, \pi_{\sigma}^{(1)}, \ldots, \pi_{\sigma}^{(di1)}\).

Skeletons. The \(k\)skeleton of a simplicial complex \(X\) is precisely the complex \(X^{(1)} \cup X^{(0)} \cup \cdots \cup X^{(k)}\).

Link expansion. Say that a pure \(d\)dimensional complex
\(X\) with measure \(\pi^{(d)}\) is an \(\e\)spectral link expander if, for every \(\sigma \in X^{(d2)} \cup X^{(d1)} \cup \cdots \cup X^{(1)}\), it holds that
the \(1\)skeleton of \(X_\sigma\) with the induced measure \(\pi_{\sigma}^{(1)}\) satisfies \(\e(\pi_{\sigma}^{(1)}) \geq \e\).
Theorem

Trickle step: Suppose that \(X\) is a pure \(2\)dimensional simpicial complex with a probability measure \(\pi^{(2)}\) on \(X^{(2)}\) and that

(a) The \(1\)skeleton \((X^{(0)}, X^{(1)})\) is connected. More precisely, \(\e(\pi^{(1)}) > 0\).

(b) For every vertex \(v \in X^{(0)}\), the (\(1\)dimensional) link \(X_v\) satisfies \(\e(\pi_v^{(1)}) \geq \e\).
Then \(\e(\pi^{(1)}) \geq 2  1/\e\).

Note that the theorem is only nontrivial for \(\e \geq 1/2\), i.e., the links have to have a substantial spectral gap.
Proof:

Let \(L\) denote the Laplace operator on the graph \((X^{(0)}, X^{(1)})\) equipped with the measure \(\mu = \pi^{(1)}\).
Use \eqref{eq:laplacian} to write
\[\begin{align*}
\langle f, L f \rangle_{\pi^{(0)}}
&= \frac12 \E_{uw \sim \pi^{(1)}} (f(u)f(w))^2 \\
&= \frac12 \E_{v \sim \pi^{(0)}} \E_{uw \sim \pi_v^{(1)}} (f(u)f(w))^2 \\
&= \E_{v \sim \pi^{(0)}} \langle f, L_v f\rangle_{\pi_v^{(0)}}\,,
\end{align*}\]
where \(L_v\) is the Laplacian on \(X_v\) associated to the edge measure \(\mu = \pi_v^{(1)}\).
Now using assumption (b) that \(\e(\pi_v^{(1)}) \geq \e\) gives
\[\langle f, L f\rangle_{\pi^{(0)}} \geq \e\ \E_{v \sim \pi^{(0)}} \mathsf{Var}_{\pi_v^{(0)}}(f)\,.\]

The law of total variance:
If \(X\) and \(Y\) are random variables, then
\[\mathsf{Var}(Y) = \E[\mathsf{Var}(Y \mid X)] + \mathsf{Var}(\E[Y \mid X]).\]
Note that \(\E_{u \sim \pi_v^{(0)}} f(u) = P f(v)\). Thus by the law of total variance, we have
\[\mathsf{Var}_{\pi^{(0)}}(f) = \E_{v \sim \pi^{(0)}} \mathsf{Var}_{\pi_v^{(0)}} (f) + \mathsf{Var}_{\pi^{(0)}} (P f)\,.\]
We conclude that
\[\begin{equation}\label{eq:penult}
\langle f, L f\rangle_{\pi^{(0)}} \geq \e \left(\mathsf{Var}_{\pi^{(0)}}(f)  \mathsf{Var}_{\pi^{(0)}}(Pf) \right).
\end{equation}\]

We are happy as long as \(\mathsf{Var}_{\pi^{(0)}}(Pf) \ll \mathsf{Var}_{\pi^{(0)}}(f)\).
But in some sense this is what we are trying to prove!
Indeed, the spectral gap can be characterized in terms of the rate of decay of the variance
of functions acted on by \(P\). Convergence to the stationary measure \(\pi^{(0)}\) is
precisely the assertion that \(\mathsf{Var}_{\pi^{(0)}}(P^t f) \to 0\) as \(t \to \infty\).
The key reason we can still say something interesting is that \(\mathsf{Var}_{\pi^{(0)}}(Pf)\) involves
terms of the form \(\langle f, P^2 f\rangle_{\pi^{(0)}}\).
To check what happens, let us write everything in terms of the Laplacian:
Observe that \(I  P^2 = 2 (IP)  (IP)^2 = 2 L  L^2\), and therefore \eqref{eq:penult} can be equivalently written as
\[\langle f, L f\rangle_{\pi^{(0)}} \geq \e \left(2 \langle f,L f\rangle_{\pi^{(0)}}  \langle f,L^2 f\rangle_{\pi^{(0)}}\right).\]
We immediately conclude that any eigenvalue \(\lambda\) of \(L\) satisfies
\[\lambda \geq \e (2\lambda) \lambda\,.\]
If \(\lambda > 0\), this gives \(\lambda \geq 21/\e\), implying that the connected components of \((X^{(0)},X^{(1)})\)
all have spectral gap at least \(21/\e\). Since we assume this graph to be connected, the desired result follows.
Reading:

Mon, Feb 27
Highdimensional random walks

Notes
Trickle Down Theorem

Suppose that \(X\) is a pure \(d\)dimensional simpicial complex with a probability measure \(\pi^{(d)}\) on \(X^{(d)}\).
Consider \(0 \leq k \leq d2\) and assume furthermore that:

(a) For every \(\sigma \in X^{(k1)}\), the \(1\)skeleton of \(X_{\sigma}\) is connected. More precisely, that \(\e(\pi^{(1)}_{\sigma}) > 0\).

(b) For every \(\sigma \in X^{(k)}\), there is a uniform spectral gap \(\e(\pi_{\sigma}^{(1)}) \geq \e\).
Then for every \(\rho \in X^{(k1)}\), we have \(\e(\pi_{\rho}^{(1)}) \geq 2  1/\e\).
Proof:

Fix some \(\rho \in X^{(k1)}\). Let \(L\) denote the Laplace operator on the \(1\)skeleton of \(X_{\rho}\)
equipped with the measure \(\mu = \pi^{(1)}_{\rho}\).
Write:
\[\begin{align*}
\langle f, L f \rangle_{\pi_{\rho}^{(0)}}
&= \frac12 \E_{uw \sim \pi_{\rho}^{(1)}} (f(u)f(w))^2 \\
&= \frac12 \E_{v \sim \pi_{\rho}^{(0)}} \E_{uw \sim \pi_{\rho \cup v}^{(1)}} (f(u)f(w))^2 \\
&= \E_{v \sim \pi_{\rho}^{(0)}} \langle f, L_v f\rangle_{\pi_{\rho \cup v}^{(0)}}\,,
\end{align*}\]
where \(L_v\) is the Laplacian on the \(1\)skeleton of \(X_{\rho \cup v}\) associated to the edge measure \(\mu = \pi_{\rho \cup v}^{(1)}\).
Now assumption (b) gives \(\e(\pi_{\rho \cup v}^{(1)}) \geq \e\), and therefore
\[\langle f, L f\rangle_{\pi_{\rho}^{(0)}} \geq \e\ \E_{v \sim \pi_{\rho}^{(0)}} \mathsf{Var}_{\pi_{\rho \cup v}^{(0)}}(f)\,.\]

Note that \(\E_{u \sim \pi_{\rho \cup v}^{(0)}} f(u) = (IL) f(v)\). Thus by the law of total variance, we have
\[\mathsf{Var}_{\pi_{\rho}^{(0)}}(f) = \E_{v \sim \pi_{\rho}^{(0)}} \mathsf{Var}_{\pi_{\rho \cup v}^{(0)}} (f) + \mathsf{Var}_{\pi_{\rho}^{(0)}} ((IL) f)\,.\]
We conclude that
\[\begin{align*}
\langle f, L f\rangle_{\pi_{\rho}^{(0)}} &\geq \e \left(\mathsf{Var}_{\pi_{\rho}^{(0)}}(f)  \mathsf{Var}_{\pi_{\rho}^{(0)}}((IL)f) \right) \\
&= \e \left(2 \langle f,L f\rangle_{\pi_{\rho}^{(0)}}  \langle f,L^2 f\rangle_{\pi_{\rho}^{(0)}}\right).
\end{align*}\]

Therefore if \(\lambda\) is an eignevalue of \(L\), then
\[\lambda \geq \e (2\lambda) \lambda\,.\]
If \(\lambda > 0\), this gives \(\lambda \geq 21/\e\).
Now assumption (a) asserts that only the trivial eigenvalue of \(L\) is \(0\), and the desired result follows.
Corollary

Suppose that \(X\) is a pure \(d\)dimensional simplicial complex equipped with a probability measure \(\pi^{(d)}\) on \(X^{(d)}\), and that

(a) For every face \(\sigma \in X\) with \(\dim(\sigma) \leq d2\), it holds that the \(1\)skeleton of \(X_{\sigma}\) is connected.
More precisely, \(\e(\pi_{\sigma}^{(1)}) > 0\).

(b) For every face \(\sigma \in X^{(d2)}\), there is a uniform spectral gap \(\e(\pi_{\sigma}^{(1)}) \geq \e\).
Then for every \(1 \leq k \leq d2\) and \(\sigma \in X^{(k)}\), it holds that
\[\e(\pi_{\sigma}^{(1)}) \geq 1  \left(d(k+1)\right) (1\e) \geq 1  d(1\e) \,.\]

Note that to derive a nontrivial bound for the \(1\)skeleton of \(X\), we need a very strong
spectral gap for the \((d2)\)links: \(\e \geq 11/d\).
Proof:

The proof is by induction using the Trickle Down Theorem:
Consider \(\sigma \in X^{(k)}\):

\(k=d2\). By assumption, \(\e(\pi_{\sigma}^{(1)}) \geq \e\).

\(k=d3\). The theorem yields \(\e(\pi_{\sigma}^{(1)}) \geq 2  1/\e \geq 1  2(1\e)\), where the last inequality holds for \(\e \geq 1/2\).

\(k=d4\). The theorem yields \(\e(\pi_{\sigma}^{(1)}) \geq 2  \frac{1}{12(1\e)} \geq 1  3(1\e)\), where the last inequality holds for \(\e \geq 1/3\).

Etc. using the inequality
\[2  \frac{1}{1(j1)(1\e)} \geq 1  j (1\e),\qquad \e \geq 1  \frac{1}{j}\,.\]
Finally, note that the constraint \(\e \geq 1  1/j\) is unnecessary since the spectral gap is always at least \(0\).
Random walks

Example: Sampling random spanning trees.
Consider an undirected graph \(G=(V,E)\) and let \(\mathcal{T}\) denote the set of spanning trees in \(G\) (where
we think of a spanning tree as a subset \(T \subseteq E\)).
The downward closure \(X\) of \(\mathcal{T}\) is an \((n2)\)dimensional simplicial complex, where \(n = V\).
The edge exchange walk on \(\mathcal{T}\) is the random walk defined as follows: If \(T\) is a spanning tree,
we remove a uniformly random edge \(e \in E(T)\) and then define \(T' \seteq (T \setminus \{e\}) \cup \{e'\}\), where
\(e' \notin E(T)\) is a uniformly random edge such that \(T' \in \mathcal{T}\).
Our goal now is to define highdimensional random walks so that the associated spectral gaps
allow us to analyze the mixing time of such a random process. The edge exchange walk will be
a particular example of the “downup” walk analyzed next.

Let \(X\) denote a pure \(d\)dimensional simplicial complex equipped with a probability
measure \(\pi^{(d)}\) on \(X^{(d)}\).
Let us define up and down walk operators: \(P_{\nearrow k} : \ell_2(\pi^{(k1)}) \to \ell_2(\pi^{(k)})\) and
\(P_{\searrow k} : \ell_2(\pi^{(k+1)}) \to \ell_2(\pi^{(k)})\) by
\[\begin{align*}
P_{\nearrow k} f(\sigma) &\seteq \E_{v \sim \pi_{\sigma}^{(0)}} f(\sigma \cup v) \\
P_{\searrow k} f(\sigma) &\seteq \E_{v \in \sigma} f(\sigma \setminus v)\,.
\end{align*}\]
Note that these are “co“operators: They operate on functions. This means that the adjoint of the “up” operator \(P_{\nearrow k}^{\top}\)
acts in the opposite direction on distributions:
\[\begin{align*}
\pi^{(k)} &= P_{\searrow k1}^{\top} \pi^{(k1)} \\
\pi^{(k1)} &= P_{\nearrow k}^{\top} \pi^{(k)}\,.
\end{align*}\]
Correspondingly, we obtain two different walk operators on \(X^{(k)}\): The updown walk and the downup walk, respectively
\[\begin{align*}
P_k^{\triangle} &\seteq P_{\searrow k} P_{\nearrow k+1} \\
P_k^{\triangledown} &\seteq P_{\nearrow k} P_{\searrow k1}\,.
\end{align*}\]
Let us also define the nonlazy updown walk operator by
\[P_k^{\wedge} \seteq P_k^{\triangle} + \frac{1}{k+1} (P_k^{\triangle}  I)\,.\]

It is not difficult to check that all three walks have \(\pi^{(k)}\) as the stationary measure.

Let us define the corresponding Laplacians
\[\begin{align*}
L_k^{\triangle} &\seteq I  P_k^{\triangle} \\
L_k^{\triangledown} &\seteq I  P_k^{\triangledown} \\
L_k^{\wedge} &\seteq I  P_k^{\wedge} = \frac{k+2}{k+1} L_k^{\triangle}\,.
\end{align*}\]
Theorem: Link expansion \(\implies\) Spectral gaps
Define \(\e_k \seteq \min_{\sigma \in X^{(k)}} \e(\pi^{(1)}_{\sigma})\) as the minimum spectral gap
over all \(1\)skeleta of links of \(k\)dimensional faces.
Then for every \(0 \leq k \leq d1\), and \(f \in \ell_2(\pi^{(k)})\), it holds that
\[\begin{equation}\label{eq:althm}
\langle f, L_k^{\wedge} f\rangle_{\pi^{(k)}} \geq \e_{k1} \langle f, L_k^{\triangledown} f\rangle_{\pi^{(k)}}\,.
\end{equation}\]
Corollary

Using the relationship between \(L_k^{\wedge}\) and \(L_k^{\triangle}\), this gives
\[\frac{k+2}{k+1} \e(L_k^{\triangle}) \geq \e_{k1} \e(L_k^{\triangledown})\,.\]
But recall that \(P_k^{\triangle} = P_{\searrow k} P_{\nearrow k+1}\) and \(P_{k+1}^{\triangledown} = P_{\nearrow k+1} P_{\searrow k}\).
It is straightforward to check that for rectangular matrices \(A,B\), the two matrices \(AB\) and \(BA\) have the same spectrum, and
therefore \(P_k^{\triangle}\) and \(P_{k+1}^{\triangledown}\) have the same spectrum.
In particular, \(\e(L_k^{\triangle}) = \e(L_{k+1}^{\triangledown})\), and therefore
\[\begin{equation}\label{eq:recformula}
\e(L_{k+1}^{\triangledown}) \geq \frac{k+1}{k+2} \e_{k1} \e(L_k^{\triangledown})\,.
\end{equation}\]

Applying \eqref{eq:recformula} iteratively gives, for \(1 \leq k \leq d\),
\[\e(L_{k}^{\triangledown}) \geq \frac{1}{k+1} \prod_{j=1}^{k2} \e_j\,.\]
In particular, we have
\[\e(L_d^{\triangledown}) \geq \frac{1}{d+1} \prod_{j=1}^{d2} \e_j\,,\]
yielding a spectral gap for the downup walk whenever there are nontrivial spectral gaps for the \(1\)skeleta of all the links in \(X\).
Proof of \eqref{eq:althm}:

For \(j \leq k\), \(\sigma \in X^{(j)}\), and \(f : X^{(k)} \to \R\), let us define \(f_{\sigma}^{(kj1)} : X_{\sigma}^{(kj1)} \to \R\) by
\(f^{(kj1)}_{\sigma}(\tau) \seteq f(\sigma \cup \tau)\).

Let \(L_{\sigma}\) denote the Laplacian on the \(1\)skeleton of \(X_{\sigma}\) with respect to the edge measure \(\mu = \pi_{\sigma}^{(1)}\).
Then we have
\[\begin{align}
\langle f, L_k^{\wedge} f\rangle_{\pi^{(k)}} &= \E_{\sigma \in \pi^{(k1)}} \langle f^{(0)}_{\sigma}, L_{\sigma} f^{(0)}_{\sigma} \rangle_{\pi_{\sigma}^{(0)}} \label{eq:down1} \\
\langle f, L_k^{\triangledown} f \rangle_{\pi^{(k)}} &= \E_{\sigma \in \pi^{(k1)}} \mathsf{Var}_{\pi_{\sigma}^{(0)}}(f_{\sigma}^{(0)})\,, \label{eq:down2}
\end{align}\]
and \eqref{eq:althm} follows immediately from the definition of the spectral gap.

We are left to verify \eqref{eq:down1} and \eqref{eq:down2}.
Use \eqref{eq:laplacian} to write
\[\begin{align*}
\langle f, L_k^{\wedge} f\rangle_{\pi^{(k)}} &= \frac14 \E_{\alpha \sim \pi^{(k)}} \E_{v \in \alpha} \E_{u \sim \pi^{(0)}_{\alpha}} (f(\alpha)f(\alpha + u  v))^2 \\
&=\frac12 \E_{\sigma \sim \pi^{(k1)}} \E_{uv \sim \pi^{(1)}_{\sigma}} (f(\sigma + u)f(\sigma + v))^2 \\
&=\frac12 \E_{\sigma \sim \pi^{(k1)}} \E_{uv \sim \pi^{(1)}_{\sigma}} (f_{\sigma}^{(0)}(u)  f_{\sigma}^{(0)}(v))^2 \\
&= \E_{\sigma \sim \pi^{(k1)}} \langle f_{\sigma}^{(0)}, L_{\sigma} f_{\sigma}^{(0)}\rangle_{\pi_{\sigma}^{(0)}}\,.
\end{align*}\]

Next write
\[\begin{align*}
\langle f, L_k^{\triangledown} f \rangle_{\pi^{(k)}} &= \frac14 \E_{\alpha \sim \pi^{(k)}} \E_{v \in \alpha} \E_{u \sim \pi^{(0)}_{\alpha \setminus v}}
(f(\alpha)  f(\alpha + u  v))^2 \\
&=
\frac12 \E_{\sigma \sim \pi^{(k1)}} \E_{u,v \sim \pi^{(0)}_{\sigma}} (f(\sigma + u)  f(\sigma + v))^2 \\
&=
\frac12 \E_{\sigma \sim \pi^{(k1)}} \E_{u,v \sim \pi^{(0)}_{\sigma}} (f^{(0)}_{\sigma}(u)  f^{(0)}_{\sigma}(v))^2 \\
&=
\E_{\sigma \sim \pi^{(k1)}} \mathsf{Var}_{\pi^{(0)}_{\sigma}}(f_{\sigma}^{(0)})\,,
\end{align*}\]
where the last line follows from the general fact
\[\mathsf{Var}(X) = \frac12 \E XX'^2\,,\]
when \(X\) and \(X'\) are two independent copies of \(X\).
Reading:

Noise stability on the hypercube via Brownian motion 
Wed, Mar 01
Leaking entropy at constant rate

Reading:

Mon, Mar 06
Majority is stablest


Recent topics in ML theory 
Wed, Mar 08
The edge of stability

Notes
Selfstabilization dynamics

Define the set \(\mathcal{M} \seteq \{ \theta \in \R^n : S(\theta) \leq 2/\eta \textrm{ and } \langle \nabla L(\theta), u(\theta)\rangle = 0 \}\),
where we recall that \(u(\theta)\) is the top eigenvector of \(\nabla^2 L(\theta)\).
Given the gradient descent dynamics \(\theta_{t+1} = \theta_t  \eta \nabla L(\theta_t)\), we define \(\hat{\theta}_t \seteq \mathrm{proj}_{\mathcal{M}}(\theta_t)\)
as the (Euclidean) projection to the convex set \(\mathcal{M}\).

Now define the quantities
\[\begin{align*}
x_t &\seteq \langle u, \theta_t  \hat{\theta}_0\rangle \\
y_t &\seteq \langle \nabla S(\hat{\theta}_0), \theta_t  \hat{\theta}_0\rangle,
\end{align*}\]
where \(u \seteq u(\hat{\theta}_0)\) is the top eigenvector of \(\nabla^2 L(\hat{\theta}_0)\) and \(S(\theta) \seteq \lambda_{\max}(\nabla^2 L(\theta))\) is the sharpness.
If we assume that \(S(\hat{\theta}_0)=2/\eta\) and \(\nabla S(\theta_t) \approx \nabla S(\hat{\theta}_0)\), then
\[\begin{equation}\label{eq:sharp}
S(\theta_t) \approx 2/\eta + y_t\,.
\end{equation}\]

Linear approximation. For \(x_t,y_t\) small, we can approximate \(\nabla L(\theta_t) \approx \nabla L(\hat{\theta}_0)\).

Quadratic approximation. When \(\abs{x_t}\) blows up, we use a quadratic approximation:
\[\begin{align*}
\nabla L(\theta_t)  \nabla L(\hat{\theta}_0) &\approx \nabla^2 L(\theta_t) (\theta_t\hat{\theta}_0) \\
&\approx \langle u, \theta_t  \hat{\theta}_0\rangle \nabla^2 L(\theta_t) u\,.
\end{align*}\]
In the last line, we have used the fact that \(\theta_t\hat{\theta}_0\) only blows up in the \(u\) direction,
so we only use the quadratic approximation there.
This quantity is equal to $x_t \nabla^2 L(\theta_t) u \approx x_t S(\theta_t) u$.
In conjunction with \(\langle u, \nabla L(\hat{\theta}_0)\rangle = 0\), this gives
\[\begin{equation}\label{eq:xpart}
x_{t+1} = x_t  \eta \langle u, \nabla L(\theta_t)\rangle = x_t(1\eta S(\theta_t)) \approx (1+\eta y_t) x_t\,,
\end{equation}\]
where the last approximation uses \eqref{eq:sharp}. Here we see the blowup phase where \(\abs{x_t}\) grows multiplicatively
as soon as \(y_t > 0\), i.e., as soon as the sharpness exceeds \(2/\eta\),
and the sign of \(x_t\) oscillates.

Cubic terms. For \(\abs{x_t}\) sufficiently large, we need to incorporate the 3rdorder term in the Taylor expansion:
\[\begin{align*}
\nabla L(\theta_t)  \nabla L(\hat{\theta}_0)
&\approx \nabla^2 L(\theta_t) (\theta_t\hat{\theta}_0) + \frac12 \nabla^3 L(\theta_t)[\theta_t\hat{\theta}_0,\theta_t\hat{\theta}_0] \\
&\approx \nabla^2 L(\theta_t) (\theta_t\hat{\theta}_0) + \frac12 \nabla^3 L(\hat{\theta}_0)[\theta_t\hat{\theta}_0,\theta_t\hat{\theta}_0] \\
&\approx \nabla^2 L(\theta_t) (\theta_t\hat{\theta}_0) + \frac12 \nabla^3 L(\hat{\theta}_0)[u,u] \langle u, \theta_t  \hat{\theta}_0\rangle^2\,,
\end{align*}\]
where as in the quadratic setting, we only consider the component in direction \(u\). Using the approximation we derived for the quadratic part,
this gives
\[\begin{align*}
\nabla L(\theta_t)  \nabla L(\hat{\theta}_0) &\approx x_t S(\theta_t) u + \frac{x_t^2}{2} \nabla^2 L(\hat{\theta}_0)[u,u] \\
&=x_t S(\theta_t) u + \frac{x_t^2}{2} \nabla S(\hat{\theta}_0)\,,
\end{align*}\]
where we used the fact that if the top eigenvalue of \(\nabla^2 L(\theta)\) has multiplicity one, then \(\nabla S(\theta) = \nabla^3 L(\theta)[u(\theta),u(\theta)]\).

Then our update to \(y_t\) looks like
\[\begin{align*}
y_{t+1}  y_t &= \langle \nabla S(\hat{\theta}_0),  \eta \nabla L(\theta_t)\rangle \\
&= \eta \langle \nabla S(\hat{\theta}_0), \nabla L(\hat{\theta}_0)\rangle  \eta \langle \nabla S(\hat{\theta}_0), \nabla L(\theta_t)\nabla L(\hat{\theta}_0)\rangle \\
&\approx \eta \alpha  \eta x_t S(\theta_t) \langle \nabla S(\hat{\theta}_0), u \rangle  \eta \frac{x_t^2}{2} \beta\,,
\end{align*}\]
where \(\alpha \seteq \langle \nabla S(\hat{\theta}_0),  \nabla L(\hat{\theta}_0)\rangle\) is the progressive sharpening parameter and
\(\beta \seteq \\nabla S(\hat{\theta}_0)\^2\) is the “stabilization” parameter.

If we assume for simplicity that \(\langle \nabla S(\hat{\theta}_0), u\rangle = 0\), then we obtain
\[\begin{equation}\label{eq:ypart}
y_{t+1}  y_t = \eta \left(\alpha  \beta \frac{x_t^2}{2}\right)\,,
\end{equation}\]
meaning that the sharpness is decaying whenever \(x_t > \sqrt{2\alpha/\beta}\).
Reading:
