2023 10 16

Slepian's lemma and dual-Sudakov bounds

CSE 599, Autumn 2023

Sparsification and entropy numbers

Suppose that $F(x) = f_1(x) + \cdots + f_m(x)$, $\rho \in \R_+^m$ is a probability vector, and define our potential sparsifier as

\[\tilde{F}_{\nu}(x) \seteq \sum_{j=1}^M \frac{f_{\nu_j}(x)}{M \rho_{\nu_j}}\,.\]

Define the related distance

\[d_{\nu}(x,y) \seteq \left(\sum_{j=1}^M \left(\frac{f_{\nu_j}(x)-f_{\nu_j}(y)}{M \rho_{\nu_j}}\right)^2 \right)^{1/2}.\]

We have seen that if we can bound

\[\begin{equation}\label{eq:delta-bound} \E_{\e} \max_{F(x) \leq 1} \sum_{j=1}^m \e_j \frac{f_{\nu_j}(x)}{M \rho_{\nu_j}} \leq \delta \left(\max_{F(x) \leq 1} \tilde{F}_{\nu}(x)\right)^{1/2}\quad \forall \nu \in [m]^M\,, \end{equation}\]

where $\e_1,\ldots,\e_M \in \{-1,1\}$ are uniformly random signs, then

\[\begin{equation}\label{eq:Fapprox} \E_{\nu} \max_{F(x) \leq 1} \left|F(x)-\tilde{F}_{\nu}(x)\right| \lesssim \delta\,, \end{equation}\]

where $\nu_1,\ldots,\nu_M$ are indicies sampled i.i.d. from $\rho$.

Finally, we have seen, using Dudley’s entropy bound, that if $B_F \seteq \{ x \in \R^n : F(x) \leq 1 \}$, then

\[\begin{equation}\label{eq:dudley} \E_{\e} \max_{F(x) \leq 1} \sum_{j=1}^m \e_j \frac{f_{\nu_j}(x)}{M \rho_{\nu_j}} \lesssim \sum_{h \geq 0} 2^{h/2} e_h(B_F, d_{\nu})\,. \end{equation}\]

$\ell_2$ regression

Recall the setting for $\ell_2$ regression: $a_1,\ldots,a_m \in \R^n$ and $f_i(x) \seteq \abs{\langle a_i,x\rangle}^2$.

Control by $\ell_{\infty}$. In this case, for $F(x),F(y) \leq 1$, we can write
\[\begin{align*} d_{\nu}(x,y) &= \left(\sum_{j=1}^M \left(\frac{|\langle a_{\nu_j},x\rangle|^2 - |\langle a_{\nu_j},y\rangle|^2}{M \rho_{\nu_j}}\right)^2\right)^{1/2} \\ &= \left(\sum_{j=1}^M \frac{\left(|\langle a_{\nu_j},x\rangle| - |\langle a_{\nu_j},y\rangle|\right)^2\left(|\langle a_{\nu_j},x\rangle| + |\langle a_{\nu_j},y\rangle|\right)^2}{M \rho_{\nu_j}}\right)^{1/2} \\ &\leq 2 \max_{j \in [M]} \frac{\left||\langle a_{\nu_j},x\rangle| - |\langle a_{\nu_j},y\rangle|\right|}{\sqrt{M \rho_{\nu_j}}} \left(\max_{F(x) \leq 1} \sum_{j=1}^M \frac{|\langle a_{\nu_j},x\rangle|^2}{M \rho_{\nu_j}}\right)^{1/2} \\ &\leq 2 \max_{j \in [M]} \frac{|\langle a_{\nu_j},x-y\rangle|}{\sqrt{M \rho_{\nu_j}}} \left(\max_{F(x) \leq 1} \tilde{F}_{\nu}(x)\right)^{1/2}. \end{align*}\]
If we have two distances $d_1$ and $d_2$ such that $d_1 \leq C d_2$, then $e_h(B_F, d_1) \leq C e_h(B_F, d_2)$. Thus if we define
\[d_{\nu,\infty}(x,y) \seteq \max_{j \in [M]} \frac{|\langle a_{\nu_j},x-y\rangle|}{\sqrt{M \rho_{\nu_j}}}\,,\]
then we have
\[\begin{equation}\label{eq:nucomp} e_h(B_F, d_{\nu}) \leq e_h(B_F, d_{\nu,\infty}) \left(\max_{F(x) \leq 1} \tilde{F}_{\nu}(x)\right)^{1/2}, \end{equation}\]
which goes nicely with our goal of bounding \eqref{eq:delta-bound} via \eqref{eq:dudley}.
Bounding the entropy numbers.

For a metric space $(T,d)$, define $\mathcal{N}(T,d,\e)$ as the minimal number of $\e$-balls in $(T,d)$ needed to cover $T$. It will help to check one’s understanding of the definitions to verify that if
\[\log \mathcal{N}(T,d,\e) \leq \left(\frac{L}{\e}\right)^p\,, \quad \forall \e > 0\,,\]
then $e_h(T,d) \leq 2^{-h/p} L$ for every $h \geq 0$.
- Covering Lemma: Suppose that $u_1,\ldots,u_M \in \R^n$, let $U$ be the matrix with $u_1,\ldots,u_M$ as rows, and define
  \[d_{U}(x,y) \seteq \|U(x-y)\|_{\infty} = \max_{j \in [M]} |\langle u_j,x-y\rangle|\,.\]
  Then,
  \[\left(\log \mathcal{N}(B_2^n, d_{U}, \e)\right)^{1/2} \lesssim \frac{1}{\e} \max(\|u_1\|_2,\ldots,\|u_M\|_2) \sqrt{\log M}\,,\]
  where $B_2^n$ is the unit ball in the Euclidean norm.
Sparsifiers for $\ell_2$ regression.

Let us first use the Covering Lemma to finish our first construction of sparsifiers for $\ell_2$ regression.

Recall that we choose $\rho_{i} \seteq \|(A^{\top} A)^{-1/2} a_i\|_2^2/n$, where $A$ is the matrix with rows $a_1,\ldots,a_m$ and, in this case, $F(x) = \|Ax\|_2^2$.

Using the linear transformation $x \mapsto (A^{\top} A)^{-1/2} x$, we can write
\[\mathcal{N}(B_F, d_{\nu,\infty}, \e) = \mathcal{N}(B_2^n, d_{U}, \e)\,,\]
where $U$ is the matrix with rows $u_j \seteq \frac{(A^{\top} A)^{-1/2} a_{\nu_j}}{\sqrt{M \rho_{\nu_j}}}$. Note that, by our choice of $\rho_1,\ldots,\rho_M$, we have
\[\|u_j\|_2 = \frac{\|(A^{\top} A)^{-1/2} a_{\nu_j}\|_2}{\sqrt{M \rho_{\nu_j}}} = \sqrt{\frac{n}{M}} \frac{(A^{\top} A)^{-1/2} a_{\nu_j}}{\|(A^{\top} A)^{-1/2} a_{\nu_j}\|_2} = \sqrt{\frac{n}{M}}\,.\]
- Terms with $h \lesssim \log n$. So the Covering Lemma gives us
  \[\left(\log \mathcal{N}(B_F, d_{\nu,\infty}, \e)\right)^{1/2} \lesssim \frac{1}{\e} \sqrt{n \frac{\log M}M}.\]
  By our preceding remarks, this yields
  \[\begin{equation}\label{eq:ent1} e_h(B_F, d_{\nu,\infty}) \lesssim 2^{-h/2} \sqrt{n \frac{\log M}{M}} \end{equation}\]
- Terms with $h \gg \log n$. If we denote $B_{\nu,\infty} \seteq \{ x \in \R^n : d_{\nu,\infty}(x,0) \leq 1 \}$, then we have
  \[B_F \subseteq \mathrm{diam}(B_F, d_{\nu,\infty}) B_{\nu,\infty}\,.\]
  A simple calculation gives us
  \[d_{\nu,\infty}(x,y) \leq \|A (x-y)\|_{\infty} \leq \sqrt{\frac{n}{M}} \|A (x-y)\|_2\,,\]
  therefore $\mathrm{diam}(B_F, d_{\nu,\infty}) \leq \sqrt{n/M}$, therefore
  \[\begin{equation}\label{eq:ent2} e_h(B_F, d_{\nu,\infty}) \leq \sqrt{\frac{n}{M}} e_h(B_{\nu,\infty}, d_{\nu,\infty}) \leq \sqrt{\frac{n}{M}} 2^{-2^h/n}\,, \end{equation}\]
  where the last inequality holds for the unit ball of any norm on $\R^n$ (and, indeed, $d_{\nu,\infty}$ is a distance induced by a norm on $\R^n$), as you show in Homework #1.
- Putting everything together. Now \eqref{eq:ent1} and \eqref{eq:ent2} in conjunction give
  \[\sum_{h \geq 0} 2^{h/2} e_h(B_F, d_{\nu,\infty}) \lesssim \log(n) \sqrt{n \frac{\log M}{M}} + \sum_{h > 4 \log_2 n} 2^{-2^h/n} \sqrt{\frac{n}{M}} \lesssim \sqrt{n (\log n)^2 \frac{\log M}{M}}\,.\]
  Combining this with \eqref{eq:nucomp} and \eqref{eq:dudley} gives
  \[\E_{\e} \max_{F(x) \leq 1} \sum_{j=1}^m \e_j \frac{f_{\nu_j}(x)}{M \rho_{\nu_j}} \lesssim \sqrt{n (\log n)^2 \frac{\log M}{M}} \left(\max_{F(x) \leq 1} \tilde{F}_{\nu}(x)\right)^{1/2}.\]
  Therefore by setting $M \seteq C \frac{n}{\e^2} (\log n)^2 \log(n/\e)$ for an appropriate constant $C > 1$, \eqref{eq:Fapprox} gives us our desired $\e$-approximation.

The covering lemma

Before proving the lemma, it helps to consider the more basic problem of covering the Euclidean ball $B_2^n$ by translates of $\e B_{\infty}^n$, i.e., by translates of small cubes.

Suppose $\|x\|_2^2 = x_1^2 + x_2^2 + \cdots + x_n^2 = 1$. Since we only care about approximation up to $\e$ in the $\ell_{\infty}$ distance, we could discretize this vector to lie in, say, $\e \mathbb{Z}^n \cap B_2^n$.

The most basic kind of vector we need to cover is of the form $(0, \pm \e, 0, 0, \pm \e, 0, \pm \e, 0, \ldots, 0)$. Because $\|x\|_2^2 = 1$, there are only $n^{O(1/\e^2)}$ choices for such a vector. But we also need to handle vectors of the form $(0, \pm 2\e, 0, 0, \pm \e, 0, \pm 2\e, 0, \ldots, 0)$, and so on.

It is not hard to convince one’s self that there are asymptotically fewer vectors of this form. Ineed, if some entry is $2\e$ then there are $n$ choices for where it goes, but there are $n(n-1)/2$ choics for where two copies of $\e$ go. Thus the total number of centers one needs is only $n^{O(1/\e^2)}$.

In other words,

\[\left(\log \mathcal{N}(B_2^n, \|\cdot\|_{\infty}, \e)\right)^{1/2} \lesssim \frac{1}{\e} \sqrt{\log n}\,.\]

Now suppose we wanted to cover $B_2^n$ instead with cubes of different side lengths, or with parallelpipeds (where the sides are no longer perpendicular), etc. There is a beautiful approach that gives surprisingly good bounds for cover $B_2^n$ by translations of an arbitrary symmetric convex body. (I have heard it credited to Talagrand, or to Pajor and Talagrand.)

The dual-Sudakov approach.

We will prove the following substantial strengthening of this bound. Suppose that $d(x,y) = \|x-y\|$ is a distance induced by some norm $\norm{\cdot}$ on $\R^n$. Then for every $\e > 0$,
\[\begin{equation}\label{eq:dual-sudakov} \left(\log \mathcal{N}(B_2^n, d, \e)\right)^{1/2} \lesssim \frac{1}{\e} \E \|\mathbf{g}\|\,, \end{equation}\]
where $\mathbf{g}$ is a standard $n$-dimensional gaussian.

Note that this proves the Covering Lemma since
\[\E \|U \mathbf{g}\|_{\infty} = \E \max_{j \in [M]} |\langle u_j, \mathbf{g}\rangle| \lesssim \max(\|u_1\|_2,\ldots,\|u_M\|_2) \sqrt{\log M}\,.\]
The last bound is a basic exercise with the gaussian tail, taking a union bound over the $M$ gaussian variables $\{ \langle u_j,\mathbf{g} \rangle : j \in [M] \}$.
- Shift Lemma: Suppose that $\gamma_n$ is the standard gaussian measure on $\R^n$ and $W \subseteq \R^n$ is any symmetric set ($W = -W$) and $z \in \R^n$. Then,
  \[\gamma_n(W+z) \geq e^{-\|z\|_2^2} \gamma_n(W)\,.\]
- Proof: Write
  \[\begin{align*} \gamma_n(W+z) &\propto \int_W e^{-\|x+z\|_2^2/2}\,dx \\ &= \E_{\sigma \in \{-1,1\}} \int_W e^{-\|\sigma x + z\|_2^2}\,dx \\ &\geq \int_W e^{-\E_{\sigma \in \{-1,1\}} \|\sigma x + z\|_2^2}\,dx\,, \end{align*}\]
  where the first equality uses symmetry of $W$ and the second uses convexity of $e^{-y}$.
  
  Now note that from Pythagoras, $\E_{\sigma \in \{-1,1\}} \|\sigma x + z\|_2^2 = \|x\|_2^2 + \|z\|_2^2$, therefore
  \[\int_W e^{-\E_{\sigma \in \{-1,1\}} \|\sigma x + z\|_2^2}\,dx = e^{-\|z\|_2^2} \int_W e^{-\|x\|_2/2} \propto e^{-\|z\|_2^2} \gamma_n(W)\,.\]
Proof of \eqref{eq:dual-sudakov}:

Let $B \seteq \{ x \in \R^n : \|x\| \leq 1 \}$ denote the unit ball in our given norm. By scaling, we can assume that $\e = 2$.

If we cannot cover $B_2^n$ by $N$ translates of $2 B$, then there must exist $x_1,\ldots,x_N \in B_2^n$ such that $x_1 + B, \ldots, x_N + B$ are all pairwise disjoint. Let $\lambda > 0$ be a parameter we will choose momentarily, and note that $\lambda (x_1+B),\ldots, \lambda(x_N+B)$ are pairwise disjoint as well, therefore
\[\begin{align*} 1 \geq \gamma_n\left(\bigcup_{i \leq N} \lambda (x_i+B)\right) &= \sum_{i \leq N} \gamma_n\left(\lambda (x_i+B)\right) \\ &\geq \gamma_n(\lambda B) \sum_{i \leq N} e^{-\lambda^2 \|x_i\|_2^2/2} \\ &\geq \gamma_n(\lambda B) N e^{-\lambda^2/2}\,, \end{align*}\]
where the first inequality uses the Shift Lemma and the second inequality uses $x_1,\ldots,x_N \in B_2^n$.

Now note that, by Markov’s inequality, $\E \|\mathbf{g}\| \geq \lambda \gamma_n(\R^n \setminus (\lambda B))$, thus taking $\lambda \seteq 2 \E \|\mathbf{g}\|$ gives $\gamma_n(\lambda B) \geq 1/2$, and we conclude that
\[N \leq 2 e^{-2 (\E \|\mathbf{g}\|)^2}\,,\]
completing the proof.

Slepian's lemma and dual-Sudakov bounds

CSE 599, Autumn 2023

Sparsification and entropy numbers

\(\ell_2\) regression

The covering lemma