Suppose that \(F(x) = f_1(x) + \cdots + f_m(x)\), \(\rho \in \R_+^m\) is a probability vector, and define our potential sparsifier as
\[\tilde{F}_{\nu}(x) \seteq \sum_{j=1}^M \frac{f_{\nu_j}(x)}{M \rho_{\nu_j}}\,.\]Define the related distance
\[d_{\nu}(x,y) \seteq \left(\sum_{j=1}^M \left(\frac{f_{\nu_j}(x)-f_{\nu_j}(y)}{M \rho_{\nu_j}}\right)^2 \right)^{1/2}.\]We have seen that if we can bound
\[\begin{equation}\label{eq:delta-bound} \E_{\e} \max_{F(x) \leq 1} \sum_{j=1}^m \e_j \frac{f_{\nu_j}(x)}{M \rho_{\nu_j}} \leq \delta \left(\max_{F(x) \leq 1} \tilde{F}_{\nu}(x)\right)^{1/2}\quad \forall \nu \in [m]^M\,, \end{equation}\]where \(\e_1,\ldots,\e_M \in \{-1,1\}\) are uniformly random signs, then
\[\begin{equation}\label{eq:Fapprox} \E_{\nu} \max_{F(x) \leq 1} \left|F(x)-\tilde{F}_{\nu}(x)\right| \lesssim \delta\,, \end{equation}\]where \(\nu_1,\ldots,\nu_M\) are indicies sampled i.i.d. from \(\rho\).
Finally, we have seen, using Dudley’s entropy bound, that if \(B_F \seteq \{ x \in \R^n : F(x) \leq 1 \}\), then
\[\begin{equation}\label{eq:dudley} \E_{\e} \max_{F(x) \leq 1} \sum_{j=1}^m \e_j \frac{f_{\nu_j}(x)}{M \rho_{\nu_j}} \lesssim \sum_{h \geq 0} 2^{h/2} e_h(B_F, d_{\nu})\,. \end{equation}\]Recall the setting for \(\ell_2\) regression: \(a_1,\ldots,a_m \in \R^n\) and \(f_i(x) \seteq \abs{\langle a_i,x\rangle}^2\).
Control by \(\ell_{\infty}\). In this case, for \(F(x),F(y) \leq 1\), we can write
\[\begin{align*} d_{\nu}(x,y) &= \left(\sum_{j=1}^M \left(\frac{|\langle a_{\nu_j},x\rangle|^2 - |\langle a_{\nu_j},y\rangle|^2}{M \rho_{\nu_j}}\right)^2\right)^{1/2} \\ &= \left(\sum_{j=1}^M \frac{\left(|\langle a_{\nu_j},x\rangle| - |\langle a_{\nu_j},y\rangle|\right)^2\left(|\langle a_{\nu_j},x\rangle| + |\langle a_{\nu_j},y\rangle|\right)^2}{M \rho_{\nu_j}}\right)^{1/2} \\ &\leq 2 \max_{j \in [M]} \frac{\left||\langle a_{\nu_j},x\rangle| - |\langle a_{\nu_j},y\rangle|\right|}{\sqrt{M \rho_{\nu_j}}} \left(\max_{F(x) \leq 1} \sum_{j=1}^M \frac{|\langle a_{\nu_j},x\rangle|^2}{M \rho_{\nu_j}}\right)^{1/2} \\ &\leq 2 \max_{j \in [M]} \frac{|\langle a_{\nu_j},x-y\rangle|}{\sqrt{M \rho_{\nu_j}}} \left(\max_{F(x) \leq 1} \tilde{F}_{\nu}(x)\right)^{1/2}. \end{align*}\]If we have two distances \(d_1\) and \(d_2\) such that \(d_1 \leq C d_2\), then \(e_h(B_F, d_1) \leq C e_h(B_F, d_2)\). Thus if we define
\[d_{\nu,\infty}(x,y) \seteq \max_{j \in [M]} \frac{|\langle a_{\nu_j},x-y\rangle|}{\sqrt{M \rho_{\nu_j}}}\,,\]then we have
\[\begin{equation}\label{eq:nucomp} e_h(B_F, d_{\nu}) \leq e_h(B_F, d_{\nu,\infty}) \left(\max_{F(x) \leq 1} \tilde{F}_{\nu}(x)\right)^{1/2}, \end{equation}\]which goes nicely with our goal of bounding \eqref{eq:delta-bound} via \eqref{eq:dudley}.
Bounding the entropy numbers.
For a metric space \((T,d)\), define \(\mathcal{N}(T,d,\e)\) as the minimal number of \(\e\)-balls in \((T,d)\) needed to cover \(T\). It will help to check one’s understanding of the definitions to verify that if
\[\log \mathcal{N}(T,d,\e) \leq \left(\frac{L}{\e}\right)^p\,, \quad \forall \e > 0\,,\]then \(e_h(T,d) \leq 2^{-h/p} L\) for every \(h \geq 0\).
Covering Lemma: Suppose that \(u_1,\ldots,u_M \in \R^n\), let \(U\) be the matrix with \(u_1,\ldots,u_M\) as rows, and define
\[d_{U}(x,y) \seteq \|U(x-y)\|_{\infty} = \max_{j \in [M]} |\langle u_j,x-y\rangle|\,.\]Then,
\[\left(\log \mathcal{N}(B_2^n, d_{U}, \e)\right)^{1/2} \lesssim \frac{1}{\e} \max(\|u_1\|_2,\ldots,\|u_M\|_2) \sqrt{\log M}\,,\]where \(B_2^n\) is the unit ball in the Euclidean norm.
Sparsifiers for \(\ell_2\) regression.
Let us first use the Covering Lemma to finish our first construction of sparsifiers for \(\ell_2\) regression.
Recall that we choose \(\rho_{i} \seteq \|(A^{\top} A)^{-1/2} a_i\|_2^2/n\), where \(A\) is the matrix with rows \(a_1,\ldots,a_m\) and, in this case, \(F(x) = \|Ax\|_2^2\).
Using the linear transformation \(x \mapsto (A^{\top} A)^{-1/2} x\), we can write
\[\mathcal{N}(B_F, d_{\nu,\infty}, \e) = \mathcal{N}(B_2^n, d_{U}, \e)\,,\]where \(U\) is the matrix with rows \(u_j \seteq \frac{(A^{\top} A)^{-1/2} a_{\nu_j}}{\sqrt{M \rho_{\nu_j}}}\). Note that, by our choice of \(\rho_1,\ldots,\rho_M\), we have
\[\|u_j\|_2 = \frac{\|(A^{\top} A)^{-1/2} a_{\nu_j}\|_2}{\sqrt{M \rho_{\nu_j}}} = \sqrt{\frac{n}{M}} \frac{(A^{\top} A)^{-1/2} a_{\nu_j}}{\|(A^{\top} A)^{-1/2} a_{\nu_j}\|_2} = \sqrt{\frac{n}{M}}\,.\]Terms with \(h \lesssim \log n\). So the Covering Lemma gives us
\[\left(\log \mathcal{N}(B_F, d_{\nu,\infty}, \e)\right)^{1/2} \lesssim \frac{1}{\e} \sqrt{n \frac{\log M}M}.\]By our preceding remarks, this yields
\[\begin{equation}\label{eq:ent1} e_h(B_F, d_{\nu,\infty}) \lesssim 2^{-h/2} \sqrt{n \frac{\log M}{M}} \end{equation}\]Terms with \(h \gg \log n\). If we denote \(B_{\nu,\infty} \seteq \{ x \in \R^n : d_{\nu,\infty}(x,0) \leq 1 \}\), then we have
\[B_F \subseteq \mathrm{diam}(B_F, d_{\nu,\infty}) B_{\nu,\infty}\,.\]A simple calculation gives us
\[d_{\nu,\infty}(x,y) \leq \|A (x-y)\|_{\infty} \leq \sqrt{\frac{n}{M}} \|A (x-y)\|_2\,,\]therefore \(\mathrm{diam}(B_F, d_{\nu,\infty}) \leq \sqrt{n/M}\), therefore
\[\begin{equation}\label{eq:ent2} e_h(B_F, d_{\nu,\infty}) \leq \sqrt{\frac{n}{M}} e_h(B_{\nu,\infty}, d_{\nu,\infty}) \leq \sqrt{\frac{n}{M}} 2^{-2^h/n}\,, \end{equation}\]where the last inequality holds for the unit ball of any norm on \(\R^n\) (and, indeed, \(d_{\nu,\infty}\) is a distance induced by a norm on \(\R^n\)), as you show in Homework #1.
Putting everything together. Now \eqref{eq:ent1} and \eqref{eq:ent2} in conjunction give
\[\sum_{h \geq 0} 2^{h/2} e_h(B_F, d_{\nu,\infty}) \lesssim \log(n) \sqrt{n \frac{\log M}{M}} + \sum_{h > 4 \log_2 n} 2^{-2^h/n} \sqrt{\frac{n}{M}} \lesssim \sqrt{n (\log n)^2 \frac{\log M}{M}}\,.\]Combining this with \eqref{eq:nucomp} and \eqref{eq:dudley} gives
\[\E_{\e} \max_{F(x) \leq 1} \sum_{j=1}^m \e_j \frac{f_{\nu_j}(x)}{M \rho_{\nu_j}} \lesssim \sqrt{n (\log n)^2 \frac{\log M}{M}} \left(\max_{F(x) \leq 1} \tilde{F}_{\nu}(x)\right)^{1/2}.\]Therefore by setting \(M \seteq C \frac{n}{\e^2} (\log n)^2 \log(n/\e)\) for an appropriate constant \(C > 1\), \eqref{eq:Fapprox} gives us our desired \(\e\)-approximation.
Before proving the lemma, it helps to consider the more basic problem of covering the Euclidean ball \(B_2^n\) by translates of \(\e B_{\infty}^n\), i.e., by translates of small cubes.
Suppose \(\|x\|_2^2 = x_1^2 + x_2^2 + \cdots + x_n^2 = 1\). Since we only care about approximation up to \(\e\) in the \(\ell_{\infty}\) distance, we could discretize this vector to lie in, say, \(\e \mathbb{Z}^n \cap B_2^n\).
The most basic kind of vector we need to cover is of the form \((0, \pm \e, 0, 0, \pm \e, 0, \pm \e, 0, \ldots, 0)\). Because \(\|x\|_2^2 = 1\), there are only \(n^{O(1/\e^2)}\) choices for such a vector. But we also need to handle vectors of the form \((0, \pm 2\e, 0, 0, \pm \e, 0, \pm 2\e, 0, \ldots, 0)\), and so on.
It is not hard to convince one’s self that there are asymptotically fewer vectors of this form. Ineed, if some entry is \(2\e\) then there are \(n\) choices for where it goes, but there are \(n(n-1)/2\) choics for where two copies of \(\e\) go. Thus the total number of centers one needs is only \(n^{O(1/\e^2)}\).
In other words,
\[\left(\log \mathcal{N}(B_2^n, \|\cdot\|_{\infty}, \e)\right)^{1/2} \lesssim \frac{1}{\e} \sqrt{\log n}\,.\]Now suppose we wanted to cover \(B_2^n\) instead with cubes of different side lengths, or with parallelpipeds (where the sides are no longer perpendicular), etc. There is a beautiful approach that gives surprisingly good bounds for cover \(B_2^n\) by translations of an arbitrary symmetric convex body. (I have heard it credited to Talagrand, or to Pajor and Talagrand.)
The dual-Sudakov approach.
We will prove the following substantial strengthening of this bound. Suppose that \(d(x,y) = \|x-y\|\) is a distance induced by some norm $\norm{\cdot}$ on \(\R^n\). Then for every \(\e > 0\),
\[\begin{equation}\label{eq:dual-sudakov} \left(\log \mathcal{N}(B_2^n, d, \e)\right)^{1/2} \lesssim \frac{1}{\e} \E \|\mathbf{g}\|\,, \end{equation}\]where \(\mathbf{g}\) is a standard \(n\)-dimensional gaussian.
Note that this proves the Covering Lemma since
\[\E \|U \mathbf{g}\|_{\infty} = \E \max_{j \in [M]} |\langle u_j, \mathbf{g}\rangle| \lesssim \max(\|u_1\|_2,\ldots,\|u_M\|_2) \sqrt{\log M}\,.\]The last bound is a basic exercise with the gaussian tail, taking a union bound over the \(M\) gaussian variables \(\{ \langle u_j,\mathbf{g} \rangle : j \in [M] \}\).
Shift Lemma: Suppose that \(\gamma_n\) is the standard gaussian measure on \(\R^n\) and \(W \subseteq \R^n\) is any symmetric set (\(W = -W\)) and \(z \in \R^n\). Then,
\[\gamma_n(W+z) \geq e^{-\|z\|_2^2} \gamma_n(W)\,.\]Proof: Write
\[\begin{align*} \gamma_n(W+z) &\propto \int_W e^{-\|x+z\|_2^2/2}\,dx \\ &= \E_{\sigma \in \{-1,1\}} \int_W e^{-\|\sigma x + z\|_2^2}\,dx \\ &\geq \int_W e^{-\E_{\sigma \in \{-1,1\}} \|\sigma x + z\|_2^2}\,dx\,, \end{align*}\]where the first equality uses symmetry of \(W\) and the second uses convexity of \(e^{-y}\).
Now note that from Pythagoras, \(\E_{\sigma \in \{-1,1\}} \|\sigma x + z\|_2^2 = \|x\|_2^2 + \|z\|_2^2\), therefore
\[\int_W e^{-\E_{\sigma \in \{-1,1\}} \|\sigma x + z\|_2^2}\,dx = e^{-\|z\|_2^2} \int_W e^{-\|x\|_2/2} \propto e^{-\|z\|_2^2} \gamma_n(W)\,.\]Proof of \eqref{eq:dual-sudakov}:
Let \(B \seteq \{ x \in \R^n : \|x\| \leq 1 \}\) denote the unit ball in our given norm. By scaling, we can assume that \(\e = 2\).
If we cannot cover \(B_2^n\) by \(N\) translates of \(2 B\), then there must exist \(x_1,\ldots,x_N \in B_2^n\) such that \(x_1 + B, \ldots, x_N + B\) are all pairwise disjoint. Let \(\lambda > 0\) be a parameter we will choose momentarily, and note that \(\lambda (x_1+B),\ldots, \lambda(x_N+B)\) are pairwise disjoint as well, therefore
\[\begin{align*} 1 \geq \gamma_n\left(\bigcup_{i \leq N} \lambda (x_i+B)\right) &= \sum_{i \leq N} \gamma_n\left(\lambda (x_i+B)\right) \\ &\geq \gamma_n(\lambda B) \sum_{i \leq N} e^{-\lambda^2 \|x_i\|_2^2/2} \\ &\geq \gamma_n(\lambda B) N e^{-\lambda^2/2}\,, \end{align*}\]where the first inequality uses the Shift Lemma and the second inequality uses \(x_1,\ldots,x_N \in B_2^n\).
Now note that, by Markov’s inequality, \(\E \|\mathbf{g}\| \geq \lambda \gamma_n(\R^n \setminus (\lambda B))\), thus taking \(\lambda \seteq 2 \E \|\mathbf{g}\|\) gives \(\gamma_n(\lambda B) \geq 1/2\), and we conclude that
\[N \leq 2 e^{-2 (\E \|\mathbf{g}\|)^2}\,,\]completing the proof.