Empirical risk minimization (ERM) is a widely studied problem in learning theory and statistics. A prominent special case is the problem of optimizing a generalized linear model (GLM), i.e.,
where the total loss $F : \mathbb R^n \to \mathbb R$, is defined by vectors $a_1,\ldots,a_m \in \mathbb R^n$, $b \in \mathbb R^m$, and loss functions $f_1, \dots, f_m: \mathbb R\to \mathbb R$. Different choices of the loss functions ${ \varphi_i }$ capture a variety of important problems, including linear regression, logistic regression, and $\ell_p$ regression.
When $m \gg n$, a natural approach for fast algorithms is to apply sparsification techniques to reduce the value of $m$, while maintaining a good multiplicative approximation of the objective value. This corresponds to our usual sparsification model where $f_i(x) = \varphi_i(\langle a_i,x\rangle - b_i)$. As mentioned previously, we will can assume that $b_1=\cdots=b_m=0$ by lifting to one dimension higher. In previous lectures, we solved the sparsification problem when $\varphi_i(z)=|z|^p$ for some $1 \leqslant p \leqslant 2$.
However, $\ell_p$ losses are the only class of natural loss functions for which linear-size sparsification results are known for GLMs. For instance, for the widely-studied class of Huber-type loss functions, e.g., $\varphi_i(z) = \min(|z|, |z|^2)$, the best known sparsity bounds obtained by previous methods were of the form is $\tilde{O}(n^{4-2\sqrt{2}} \varepsilon^{-2})$.
We will often think of the case $\varphi_i(z) = h_i(z)^2$ for some $h_i : \mathbb R\to \mathbb R_+$, as the assumptions we need are stated more naturally in terms of $\sqrt{\varphi_i}$. To that end, consider a function $h : \mathbb R^n \to \mathbb R_+$ and the following two properties, where $L, c, \theta> 0$ are some positive constants.
Assumptions:
($L$-auto-Lipschitz) $|h(z)-h(z’)| \leqslant L\,h(z-z’)$ for all $z,z’ \in \mathbb R$.
(Lower $\theta$-homogeneous) $h(\lambda z) \geqslant c \lambda^{\theta} h(z)$ for all $z \in \mathbb R$ and $\lambda \geqslant 1$.
Theorem [Jambulapati-Lee-Liu-Sidford 2023]: Consider $\varphi_1, \dots, \varphi_m: \mathbb R\to \mathbb R_{+}$, and suppose there are numbers $L,c,\theta> 0$ such that each $\sqrt{\varphi_i}$ is $L$-auto-Lipschitz and lower $\theta$-homogeneous (with constant $c$). Then for any $a_1,\ldots,a_m \in \mathbb R^n$, and numbers $0 < \varepsilon< \tfrac12$ and $s_{\max}> s_{\min}\geqslant 0$, there are nonnegative weights $w_1,\ldots,w_m \geqslant 0$ such that
\[\left| F(x) - \sum_{i=1}^m w_i \varphi_i(\langle a_i,x\rangle) \right| \leqslant\varepsilon F(x)\,,\quad \forall x \in \mathbb R^n \ \textrm{ s.t. }\ s_{\min}\leqslant F(x) \leqslant s_{\max}\,,\]where $F(x) \mathrel{\mathop:}=\varphi_1(\langle a_1,x\rangle) + \cdots + \varphi_m(\langle a_m,x\rangle)$, and
\[\left | i \in \{1,\ldots,m\} : w_i > 0 \right | \lesssim_{L,c,\theta} \frac{n}{\varepsilon^2} \log \left(\frac{n}{\varepsilon} \frac{s_{\max}}{s_{\min}}\right) \left(\log \frac{n}{\varepsilon}\right)^3\,.\]Moreover, the weights ${w_i}$ can be computed in time \(\tilde{O}_{L,c,\theta}\!\left((\mathsf{nnz}(a_1,\ldots,a_m) + n^{\omega} + m \mathcal{T}_{\mathrm{eval}}) \log(ms_{\max}/s_{\min})\right)\), with high probability.
Here, $\mathcal{T}_{\mathrm{eval}}$ is the maximum time needed to evaluate each $\varphi_i$, $\mathsf{nnz}(a_1,\ldots,a_m)$ is the total number of non-zero entries in the vectors $a_1,\ldots,a_m$, and $\omega$ is the matrix multiplication exponent. “High probability” means that the failure probability can be made less than $n^{-\ell}$ for any $\ell > 1$ by increasing the running time by an $O(\ell)$ factor.
It is not difficult to see that for $0 < p \leqslant 2$, the function $\varphi_i(z) = |z|^p$ satisfies the required hypotheses of . Additionally, the $\gamma_p$ functions, defined as
for $p \in (0, 2]$, also satisfy the conditions. The special case of $\gamma_1$ is known as the Huber loss.
To show that $\sqrt{\gamma_p}$ is $1$-auto-Lipschitz, it suffices to prove that $\sqrt{\gamma_p}$ is concave. The lower growth bound is immediate for $p > 0$.
Given $F(x) = \varphi_1(\langle a_1,x\rangle) + \cdots + \varphi_m(\langle a_m,x\rangle)$, our approach to sparsification is via importance sampling. Given a probability vector $\rho \in \mathbb R^m$ with $\rho_1,\ldots,\rho_m > 0$ and $\rho_1 + \cdots + \rho_m = 1$, we sample $M \geqslant 1$ coordinates $\nu_1,\ldots,\nu_M$ i.i.d. from $\rho$, and define our potential approximator by
Since we want an approximation guarantee to hold simultaneously for many $x \in \mathbb R^n$, it is natural to analyze expressions of the form
As we have seen, analysis of this expression involves the size of discretizations of the set \(B_F(s) \mathrel{\mathop:}=\{ x \in \mathbb R^n : F(x) \leqslant s \}\) at various granularities. The key consideration (via Dudley’s entropy inequality) is how well $B_F(s)$ can be covered by cells on which we have uniform control on how much the terms $\varphi_i(\langle a_i,x\rangle)/\rho_i$ vary within each cell.
Let’s consider the case $\varphi_i(z) = |z|^2$ so that $F(x) = |\langle a_1,x\rangle|^2 + \cdots + |\langle a_m,x\rangle|^2$. Here, \(B_F(s) = \{ x \in \mathbb R^n : \norm{Ax}_2^2 \leqslant s \}\), where $A$ is the matrix with $a_1,\ldots,a_m$ as rows.
A cell at scale $2^{j}$ looks like
and the pertinent question is how many translates of $\mathsf K_j$ it takes to cover $B_F(s)$. In the $\ell_2$ case, this is the well-studied problem of covering Euclidean balls by $\ell_{\infty}$ balls.
If $N_j$ denotes the minimum number of such cells required, then the dual-Sudakov inequality tells us that
Choosing $\rho_i \mathrel{\mathop:}=\frac{1}{n} |(A^\top A)^{-1/2} a_i|_2^2$ to be the normalized leverage scores yields uniform control on the size of the coverings:
Now we take $\varphi_i(z) = |z|^p$ so that $F(x) = \norm{Ax}_p^p$. A cell at scale $2^j$ now looks like
To cover $B_F(s)$ by translates of $\mathsf K_j$, we will again employ Euclidean balls, and one can use $\ell_p$ Lewis weights to relate the $\ell_p$ structure to an $\ell_2$ structure.
A classical result of Lewis establishes that there are nonnegative weights $w_1,\ldots,w_m \geqslant 0$ such that if $W = \mathrm{diag}(w_1,\ldots,w_m)$ and $U \mathrel{\mathop:}=(A^{\top} W A)^{1/2}$, then
Assuming that $A$ has full rank, a straightforward calculation gives $\sum_{i=1}^m w_i \norm{U^{-1} a_i}_2^2 = \mathop{\mathrm{tr}}(U^2 U^{-2}) = n$.
Therefore we can take $\rho_i \mathrel{\mathop:}=\frac{1}{n} w_i \norm{U^{-1} a_i}_2^2$ for $i=1,\ldots,m$, and our cells become
If we are trying to use $\ell_2$-$\ell_{\infty}$ covering bounds, we face an immediate problem: Unlike in the $\ell_2$ case, we don’t have prior control on $\norm{U x}_2$ for $x \in B_F(s)$. One can obtain an initial bound using the structure of $U = (A^{\top} W A)^{1/2}$:
where the last inequality is Cauchy-Schwarz: $|\langle a_i, x\rangle| = |\langle U^{-1} a_i, U x\rangle| \leqslant\norm{U^{-1} a_i}_2 \norm{U x}_2$. This gives the bound $\norm{U x}_2 \leqslant\norm{Ax}_p \leqslant s^{1/p}$ for $x \in B_F(s)$.
Problematically, this uniform $\ell_2$ bound is too weak, but there is a straightforward solution: Suppose we cover $B_F(s)$ by translates of $\mathsf K_{j_0}$. This gives an $\ell_{\infty}$ bound on the elements of each cell, meaning that we can apply \eqref{eq:intro-nc0} and obtain a better upper bound on \(\norm{Ux}_2\) for $x \in \mathsf K_{j_0}$. Thus to cover $B_F(s)$ by translates of $\mathsf K_j$ with $j < j_0$, we will cover first by translates of $\mathsf K_{j_0}$, then cover each translate $(x + \mathsf K_{j_0}) \cap B_F(s)$ by translates of $\mathsf K_{j_0-1}$, and so on.
The standard approach in this setting is to use interpolation inequalities and duality of covering numbers for a cleaner analytic version of such an iterated covering bound. But the iterative covering argument can be adapted to the non-homogeneous setting, as we discuss next.
When we move to more general loss functions $\varphi_i : \mathbb R\to \mathbb R$, we lose the homogeneity property $\varphi_i(\lambda x) = \lambda^p \varphi_i(x)$, $\lambda > 0$ that holds for $\ell_p$ losses. Because of this, we need to replace the single Euclidean structure present in \eqref{eq:wis} (given by the linear operator $U$) with a family of structures, one for every relevant scale.
Definition (Approximate weight at scale $s$): Fix $a_1,\ldots,a_m \in \mathbb R^n$ and loss functions $\varphi_1,\ldots,\varphi_m~:~\mathbb R\to \mathbb R_+$. Say that a vector $w \in \mathbb R_+^m$ is an $\alpha$-approximate weight at scale $s$ if
\[\label{eq:approx-weight} \frac{s}{\alpha} \leqslant\frac{\varphi_i(\norm{M_w^{-1/2} a_i}_2)}{w_i \norm{M_w^{-1/2} a_i}_2^2} \leqslant\alpha s\,,\quad i=1,\ldots,m\,,\text{ where }M_w \mathrel{\mathop:}=\sum_{j=1}^m w_j a_j a_j^{\top}\,.\]Moreover, in order to generalize the iterated covering argument for $\ell_p$ losses, we need there to be a relationship between weights at different scales.
Definition (Approximate weight scheme): Let $\mathcal J\subseteq \mathbb Z$ be a contiguous interval. A family ${w^{(j)} \in \mathbb R_+^m : j \in \mathcal J}$ is an $\alpha$-approximate weight scheme if each $w^{(j)}$ is an $\alpha$-approximate weight at scale $2^j$ and, furthermore, for every pair $j,j+1 \in \mathcal J$ and $i \in {1,\ldots,m}$,
\[\label{eq:smooth-weights} w_i^{(j+1)} \leqslant\alpha w_i^{(j)}\,.\]Given a weight scheme, we choose sampling probabilities
where \(W_j = \mathrm{diag}(w_1^{(j)},\ldots,w_m^{(j)})\).
In our setting, $|\mathcal J| \leqslant O(\log(ms_{\max}/s_{\min}))$, which results in the sparsity increasing by a corresponding factor.