Linear regression
Consider the prototypical problem of least-squares linear regression: Suppose we are given input pairs \((a_1,b_1), (a_2,b_2) \ldots, (a_m,b_m)\) with \(a_1,\ldots,a_m \in \R^n\) and \(b_1,\ldots,b_m \in \R\), and we want to find \(x \in \R^n\) such that \(\langle a_i,x\rangle \approx b_i\) for all \(i=1,\ldots,m\).
Let \(A\) be the \(m \times n\) matrix whose rows are \(a_1,\ldots,a_m\). We will assume that the data is noisy and the system is overdetermined (\(m \gg n\)) so that there is no actual solution to \(Ax=b\).
In this case, a natural thing to try is to minimize the least-squares error:
\[\textrm{minimize}\ \left\{ \|Ax-b\|_2^2 : x \in \R^n \right\}.\]One might now ask the question of whether we really need to work with all \(m\) examples given that the dimensionality is only \(n \ll m\). Here’s one possible formulation of this problem: Is there a matrix \(\tilde{A}\) with \(\tilde{m} \approx n\) rows and such that the solution to any least-squares regression problem with \((\tilde{A},b)\) is close to the solution with \((A,b)\)?
Even more ambitiously, we could ask that every row of \(\tilde{A}\) is a weighted scaling of some row of \(A\). Thus \(\tilde{A}\) would correspond to the weighted input \(c_{i_1} (a_{i_1}, b_{i_1}), c_{i_2} (a_{i_2}, b_{i_2}), \ldots, c_{i_M} (a_{i_M}, b_{i_M})\), with \(c_{i_1}, c_{i_2}, \ldots \geq 0\), and the property we’d like to have is that \(\|A x - b\|_2 \approx \|\tilde{A} x - b\|_2\) for every \(x \in \R^n\).
Sparsification
This is a special case of a general type of sparsification problem: Suppose that $F : \R^n \to \R_+$ is given by
\[\begin{equation}\label{eq:F} F(x) = f_1(x) + f_2(x) + \cdots + f_m(x) \end{equation}\]for some functions \(f_1,\ldots,f_m : \R^n \to \R_+\). Consider a function \(\tilde{F} : \R^n \to \R\) given by
\[\tilde{F}(x) = c_1 f_1(x) + c_2 f_2(x) + \cdots + c_m f_m(x)\]for some nonnegative weights \(c_1,c_2,\ldots,c_m \geq 0\). Say that the representation is \(s\)-sparse if at most \(s\) of the weights are nonzero. Say that \(\tilde{F}\) is an \(\e\)-approximation to \(F\) if it holds that
\[|F(x)-\tilde{F}(x)| \leq \e F(x),\quad \forall x \in \R^n\,.\]General linear models
Our least-squares regression problem corresponds to the case where \(f_i(x) = \abs{\langle a_i,x\rangle - b_i}^2\) for \(i=1,\ldots,m\).
$\ell_p$ loss. But one can imagine many other loss functions. If, instead, we took \(f_i(x) = \abs{\langle a_i,x\rangle - b_i}^p\), the corresponding problem is called $\ell_p$ regression. Jumping ahead:
In general, one can ask the question: Suppose that \(f_i(x) \seteq \varphi(\langle a_i,x\rangle - b_i)\). For what loss functions \(\varphi : \R \to \R_+\) is such sparsification possible?
Huber and \(\gamma_p\) losses. An interesting example is the Huber loss where \(\varphi(y) = \min(y^2, |y|)\). (The Huber loss is more general, but this special case captures its essence.) The loss function grows quadratically near \(0\), but is much less sensitive to outliers than the \(\ell_2\) loss. In general, functionals that grow like \(\varphi(y) = \min(y^2, |y|^p)\) for \(0 < p \leq 2\) have been studied, especially in the context of fast algorithms for solving \(\ell_p\) regression to high accuracy (see, e.g., [BCLL17]).
The best published work achieves \(s \leq O(n^{4-2\sqrt{2}})\) [MMWY21], but we will see, using methods in this course (from upcoming joint work with Jambulapati, Liu, and Sidford) that \(s \leq O(\frac{n}{\e^2} (\log n)^{O(1)})\) is possible.
What about the function \(\varphi(y) = \min(y^2, 1)\)? This is sometimes called the Tukey loss, where we know only partial answers to the sparsification question.
ReLUs. For an example of a loss function that doesn’t allow sparsification, consider \(\varphi(y) = \max(0, y - 0.5)\). The functional \(f_i(x) = \max(0, \langle a_i,x\rangle - 0.5)\) “lights up” only when \(\langle a_i,x\rangle\) is substantial. But one can easily choose exponentially many unit vectors \(\{a_i\}\) in \(\R^n\) where these functionals have disjoint support, meaning that there is no way to sparsify down to fewer than a number of terms that’s exponential in the dimension.
If we choose \(f_i(x) = \max(0, |\langle a_i,x\rangle| - 0.5)\), then \(f_1,\ldots,f_m\) are symmetric and convex, so those two conditions alone are not enough to give strong sparsification.
In general, we don’t yet a complete theory of sparsification even for 1-dimensional functions.
Graph sparsification
Consider an undirected graph \(G=(V,E,w)\) with nonnegative weights \(w : E \to \R_+\) on the edges. Let us enumerate the edges of \(G\) as \(u_1 v_1, u_2 v_2, \ldots, u_m v_m\), and define \(f_i(x) = w_{u_i v_i} |x_{u_i} - x_{v_i}|\). Then,
\[F(x) = f_1(x) + \cdots + f_m(x) = \sum_{uv \in E} w_{uv} |x_u-x_v|\,,\]and for \(x \in \{0,1\}^V\), the latter sum is precisely the total weight of the edges crossing the cut defined by \(x\).
Now the question of whether \(F\) admits an \(s\)-sparse \(\e\)-approximation is precisely the question of whether the weighted graph \(G\) admits a cut sparsifier in the sense of Benczur and Karger.
If we instead define \(f_i(x) = w_{u_i v_i} |x_{u_i}-x_{v_i}|^2\), then the question becomes one about spectral graph sparsifiers in the sense of Spielman and Teng.
High-dimensional sparsification
Hypergraph cut sparsifiers. Suppose now that \(H=(V,E,w)\) is a weighted hypergraph in the sense that each hyperedge \(e \in E\) is a subset of vertices. If we define, for \(e \in E\),
\[f_e(x) \seteq \max_{\{u,v\} \in {e \choose 2}} |x_u-x_v|\,,\]and \(F(x) = \sum_{e \in E} f_e(x)\), then the corresponding notion is that of a hypergraph cut sparsifier.
If we replaced \(\abs{x_u-x_v}\) by \(\abs{x_u-x_v}^2\), we would get the notion of a spectral hypergraph sparsifier.
See [Lee23] and the references therein.
Symmetric submodular functions. More generally, we can ask about the setting where \(f_1,\ldots,f_m : \R^n \to \R_+\) are symmetric submodular functions and
\[F(x) = f_1(x)^p + f_2(x)^p + \cdots + f_m(x)^p\,.\](More precisely, one should take each \(f_i\) to be the Lovasz extension of a symmetric submodular function with \(f_i(\emptyset) = 0\).) For \(p=1\), this generalizes the setting of hypergraph cut sparsifiers, and one can achieve \(s \leq \frac{n}{\e^2} (\log n)^{O(1)}\).
For \(p=2\), the best-known bound is \(s \leq \frac{n^{3/2}}{\e^2} (\log n)^{O(1)}\), and it’s unknown whether one can do better.
See [JLLS23] and the references therein.
General norms. A further generalization considers general norms \(N_1,\ldots,N_m\) on \(\R^n\) and whether the norm \(N\) defined by
\[N(x)^p = N_1(x)^p + N_2(x)^p + \cdots + N_m(x)^p\]admits an approximate sparsifier.
See [JLLS23] and the references therein.
General submodular functions. For more general (non-symmetric) submodular functions, there are many classes of functions that allow varying degrees of sparsification. See, e.g., [KK23].
Unbiased estimators. A general method to construct sparsifiers is by independent random sampling. Suppose we are in the setting \eqref{eq:F} and let \(\rho = (\rho_1,\ldots,\rho_m)\) be a probability vector, i.e., a nonnegative vector with \(\rho_1 + \cdots + \rho_m = 1\).
Let us sample indices \(i_1,i_2,\ldots,i_M\) i.i.d. from \(\rho\) and define the random sparsifier \(\tilde{F}\) by
\[\tilde{F}(x) \seteq \frac{1}{M} \left(\frac{f_{i_1}(x)}{\rho_{i_1}} + \frac{f_{i_2}(x)}{\rho_{i_2}} + \cdots + \frac{f_{i_M}(x)}{\rho_{i_M}}\right).\]Note that, because we have rescaled by the probabilities, it holds that for every \(x \in \R^n\),
\[\E\left[\frac{f_{i_1}(x)}{\rho_{i_1}}\right] = \sum_{i=1}^m \rho_i \frac{f_i(x)}{\rho_i} = F(x)\,,\]and therefore \(\E[\tilde{F}(x)] = F(x)\) as well.
Importance. As we just argued, \(\tilde{F}(x)\) is an unbiased estimator for any \(\rho\), but crucially in order to choose \(M\) small and still hope that \(\tilde{F}\) will be a good approximation to \(F\), we need to choose \(\rho\) carefully.
A simple example comes from the \(\ell_2\) regression problem. Let us assume that \(b=0\) and consider \(a_1,\ldots,a_m \in \R^n\) such that \(a_1\) is linearly independent from \(a_2,\ldots,a_m\). Then \(f_1(x) = |\langle a_1,x\rangle|^2\) must appear in a sparse approximation to \(F\) to handle the input \(x = a_1\). Therefore we will need to sample at least \(O(1/\rho_1)\) terms before we expect to have \(f_1\) appear. If we are looking to have \(M \leq n (\log n)^{O(1)}\), it’s clear that choosing \(\rho\) to be uniform, for instance, will be drastically insufficient.
Concentration. In general, once we choose \(\rho\), we are left with the non-trivial problem of getting \(F(x) \approx \tilde{F}(x)\) for every \(x \in \R^n\). This can be a daunting task that is often framed as bounding an expected maximum:
\[\E \sup_{x \in \R^n} \frac{|F(x)-\tilde{F}(x)|}{|F(x)|} \leq \e\,.\]If we focus on the case where \(F\) is sufficiently homogeneous (like the \(\ell_2\) regression problem), this simplifies a bit to
\[\E \sup_{F(x) \leq 1} |F(x) - \tilde{F}(x)| \leq \e\,,\]but the fundamental difficulty remains.
Considering \(F(x) = f_1(x) + \cdots + f_m(x)\), one way to choose an importance for each \(f_i\) is to obtain a uniform bound on each term. In other words, we consider the quantity
\[\alpha_i \seteq \max_{0 \neq x \in \R^n} \frac{f_i(x)}{F(x)}\,.\]If we assume that \(F\) and \(f_i\) scale homogeneously (e.g., in the case of \(\ell_2\) regression, \(F(\lambda x) = |\lambda|^2 F(x)\) and similarly for each \(f_i\)), then we can write this as
\[\begin{equation}\label{eq:uniform-con} \alpha_i = \max_{F(x) \leq 1} f_i(x)\,. \end{equation}\]Let’s try to derive good sampling probabilities for the \(\ell_2\) regression problem where
\[F(x) = |\langle a_1,x\rangle|^2 + |\langle a_2,x\rangle|^2 + \cdots + |\langle a_m,x\rangle|^2\,,\]for \(a_1,\ldots,a_m \in \R^n\), and we define \(A\) as the matrix with \(a_1,\ldots,a_m\) as rows. We’ll derive our importance scores in a few different ways. Each of these perspectives will be useful for moving later to more general settings.
Uniform control.
Consideration of \eqref{eq:uniform-con} in this setting is easy:
\[\alpha_i = \max_{\|A x\|^2 \leq 1} |\langle a_i,x\rangle|^2\,.\]Observe that \(\|A x\|^2 = \langle Ax, Ax\rangle = \langle (A^{\top} A)^{1/2} x, (A^{\top} A)^{1/2} x\rangle\), so
\[\begin{align*} \alpha_i &= \max_{\|x\|^2 \leq 1} |\langle a_i, (A^{\top} A)^{-1/2} x\rangle|^2 \\ &= \max_{\|x\|^2 \leq 1} |\langle (A^{\top} A)^{-1/2} a_i, x\rangle|^2 \\ &= \|(A^{\top} A)^{-1/2} a_i\|_2^2\,, \end{align*}\]where in the last line we took \(x = (A^{\top} A)^{-1/2} a_i/\|(A^{\top} A)^{-1/2} a_i\|_2\) as the maximizer. The value
\[\sigma_i(A) \seteq \|(A^{\top} A)^{-1/2} a_i\|_2^2 = \langle a_i, (A^{\top} A)^{-1} a_i\rangle\]is called the \(i\)th leverage score of \(A\).
Note that
\[\sum_{i=1}^m \sigma_i(A) = \sum_{i=1}^m \langle a_i, (A^{\top} A)^{-1} a_i\rangle = \tr\left((A^{\top} A)^{-1} \sum_{i=1}^m a_i a_i^{\top}\right) = \mathrm{rank}(A)\,,\]therefore our corresponding sample probabilities would be \(\rho_i \seteq \sigma_i(A)/\mathrm{rank}(A)\). We will assume in the rest of this lecture that \(\mathrm{rank}(A) = n\), but this is only for convenience.
Symmetry considerations.
We have already seen that if \(a_1\) is linearly independent from the rest of the vectors, then we should sample it with a pretty large probability (considering the case where \(a_1,\ldots,a_n\) forms an orthonormal basis, “large” will have to mean probability \(\approx 1/n\)). Another indication comes from how \(F\) scales: If we replace \(a_1\) by two copies of \(a_1/\sqrt{2}\), then the function \(F\) doesn’t change. This suggests that \(\rho_i\) should scale proportional to \(\|a_i\|_2^2\).
But notice also that \(F\) is a sum of squares; as there are no cancellations, it is natural to consider the matrix \(a_1 a_1^{\top} + \cdots + a_m a_m^{\top} = A^{\top} A\). Let’s suppose, for simplicity, that the vectors form a decomposition of the identity:
\[\begin{equation}\label{eq:isotropic} a_1 a_1^{\top} + \cdots + a_m a_m^{\top} = I\,. \end{equation}\]Then we have \(\|a_1\|_2^2 + \cdots + \|a_m\|_2^2 = \tr\left(a_1 a_1^{\top} + \cdots + a_m a_m^{\top}\right) = \tr(I) = n\), so symmetry suggests we take \(\rho_i \seteq \|a_i\|_2^2/n\).
For the general case, let us assume that \(A\) has full rank, i.e., \(\mathrm{rank}(A) = n\). We can left-multiply the vectors by \((A^{\top} A)^{-1/2}\) to achieve \eqref{eq:isotropic}, again suggesting the probabilities \(\rho_i = \sigma_i(A)/n\).
We can think about leverage scores in a couple of other ways.
Expected contribution to \(F(x)\).
One way to judge the importance of \(a_i\) is to switch from a “worst case” perspective (approximating \(F(x)\) for every \(x\)) to an ``average case’’ one: Suppose that \(\mathbf{X}\) is a random vector in \(\R^n\) and define
\[\rho_i \propto \E \left[|\langle a_i, \mathbf{X}\rangle|^2\right].\]What distribution should we use? In the case that \eqref{eq:isotropic} holds, it makes sense for \(\mathbf{X}\) to be rotationally invariant, and indeed any such choice works. Using the Gaussian measure on $\R^n$, this would give
\[\rho_i \propto \int |\langle a_i,x\rangle|^2 e^{-\|x\|_2^2/2}\,dx\,,\]and it is straightforward to check that \(\rho_i \propto \|a_i\|_2^2\). And if we apply the transformation \((A^{\top} A)^{-1/2}\) to a general set of vectors to achieve \eqref{eq:isotropic}, then the corresponding form is
\[\rho_i \propto \int |\langle a_i,x\rangle|^2 e^{-\|(A^{\top} A)^{1/2} x\|_2^2/2}\,dx = \int |\langle a_i,x\rangle|^2 e^{-\|A x\|_2^2/2}\,dx\,,\]and again one has \(\rho_i \propto \|(A^{\top}A)^{-1/2} a_i\|_2^2\).
Singular value decomposition.
Suppose we write \(A = U \Sigma V^{\top}\). Recall that we consider \(n \leq m\), thus \(\Sigma = \begin{bmatrix} D \\ \hline 0\end{bmatrix}\), where \(D\) is an \(n \times n\) diagonal matrix containing the singular values \(\sigma_1,\ldots,\sigma_n\). Note that \(U\) is an \(m \times m\) matrix, but \(\Sigma\) makes all but the first \(n\) columns of \(U\) meaningless (indeed, one can obtain an SVD from any completion of the first \(n\) columns of \(U\) to an orthonormal basis).
If we consider the “compact SVD” \(A = \hat{U} \Sigma V^{\top}\) where we cut the last \(m-n\) columns off of \(U\), then \(\sigma_i(A) = \|\hat{u}_i\|_2^2\), where \(\hat{u}_1,\ldots,\hat{u}_m\) are the rows of \(\hat{U}\).
Once we’ve chosen our sampling probabilities \(\rho_i = \sigma_i(A)/n\), we can independently sample \(M\) terms from the distribution, and we are left to analyze the random sum
\[\tilde{A} = \frac{1}{M}\left(\frac{a_{i_1} a_{i_1}^{\top}}{\rho_{i_1}} + \cdots + \frac{a_{i_M} a_{i_M}^{\top}}{\rho_{i_M}}\right).\]We have already argued that \(\E[\tilde{A}] = A\), so the real question is about concentration, which we will begin to cover in the next lecture.
There are actually two ways to approach this question. One is to think about proving concentration of the sum along every direction \(x \in \R^n\) simultaneously. This leads naturally to entropy bounds, covering numbers, and the generic chaining theory. The second approach is to think about \(\tilde{A}\) as a random operator and try to prove concentration of sums of independent random matrices. The former will be far more general, while the latter appears substantially more powerful in certain special cases.