Lecture notes: Sparsification, sampling, and optimization

$\ell_2$ sparsification

Sparsifying $F(x) = |\langle a_1,x\rangle|^2 + |\langle a_2,x\rangle|^2 + \cdots + |\langle a_m,x\rangle|^2$

Importance sampling and leverage scores

Intro to sparsification

Unbiased estimators and importance sampling

\(\ell_2\) regression and statistical leverage scores

Considering \(F(x) = f_1(x) + \cdots + f_m(x)\), one way to choose an importance for each \(f_i\) is to obtain a uniform bound on each term. In other words, we consider the quantity

\[\alpha_i \seteq \max_{0 \neq x \in \R^n} \frac{f_i(x)}{F(x)}\,.\]

If we assume that \(F\) and \(f_i\) scale homogeneously (e.g., in the case of \(\ell_2\) regression, \(F(\lambda x) = |\lambda|^2 F(x)\) and similarly for each \(f_i\)), then we can write this as

\[\begin{equation}\label{eq:uniform-con} \alpha_i = \max_{F(x) \leq 1} f_i(x)\,. \end{equation}\]

Let’s try to derive good sampling probabilities for the \(\ell_2\) regression problem where

\[F(x) = |\langle a_1,x\rangle|^2 + |\langle a_2,x\rangle|^2 + \cdots + |\langle a_m,x\rangle|^2\,,\]

for \(a_1,\ldots,a_m \in \R^n\), and we define \(A\) as the matrix with \(a_1,\ldots,a_m\) as rows. We’ll derive our importance scores in a few different ways. Each of these perspectives will be useful for moving later to more general settings.


Once we’ve chosen our sampling probabilities \(\rho_i = \sigma_i(A)/n\), we can independently sample \(M\) terms from the distribution, and we are left to analyze the random sum

\[\tilde{A} = \frac{1}{M}\left(\frac{a_{i_1} a_{i_1}^{\top}}{\rho_{i_1}} + \cdots + \frac{a_{i_M} a_{i_M}^{\top}}{\rho_{i_M}}\right).\]

We have already argued that \(\E[\tilde{A}] = A\), so the real question is about concentration, which we will begin to cover in the next lecture.

There are actually two ways to approach this question. One is to think about proving concentration of the sum along every direction \(x \in \R^n\) simultaneously. This leads naturally to entropy bounds, covering numbers, and the generic chaining theory. The second approach is to think about \(\tilde{A}\) as a random operator and try to prove concentration of sums of independent random matrices. The former will be far more general, while the latter appears substantially more powerful in certain special cases.

Symmetrization and subgaussian processes

Symmetrization, take I

Recall that we have a function \(F : \R^n \to \R\) written as \(F(x) = f_1(x) + \cdots + f_m(x)\) and we want to sparisfy it by sampling the terms in proportion to a probability vector \(\rho \in \R_+^m\) (the “importance scores”).

One way to analyze this is to first make sure all the probabilities are uniformly small, say \(\rho_1,\ldots,\rho_m \leq 2/m\). Since we are trying to sparsify down to a number of terms that is independent of \(m\), we can always achieve this by simply splitting a term

\[f_i(x) \to f_i(x)/s + f_i(x)/s + \cdots + f_i(x)/s\,,\]

where there are \(s\) summands.

Then one can sparsify in phases by deleting every term independently with probability \(1/2\). This experiment can be modeled with \(\{0,1\}\) random variables in the natural way:

\[\E_{\e} \max_{x \in \Omega} \left|F(x) - \sum_{j=1}^m (1+\e_j) f_j(x)\right| = \E_{\e} \max_{x \in \Omega} \left|\sum_{j=1}^m \e_j f_j(x)\right|.\]

where \(\e_1,\ldots\e_m\) are i.i.d. uniform random $\pm 1$ variables, and \(\Omega \subseteq \R^n\) is our domain of interest.

It turns out that it is possible to do something similar and reduce the analysis of one-shot sampling to the analysis of randomly signed sums of the functions. This makes for both a simpler algorithm, and a more robust method of analysis that will be very useful later.

Symmetrization, take II

Let’s consider a more general situation. Consider a family of independent random functions \(\mathbf{g}_1,\ldots,\mathbf{g}_M : \R^n \to \R\) that are sampled from some distribution. (Recall that we are thinking about each one as being chosen from a finite family of functions and, in our case, it even holds that they are i.i.d.).

Define \(\mathbf{G}(x) \seteq \mathbf{g}_1(x) + \cdots + \mathbf{g}_m(x)\) and \(F(x) \seteq \E[\mathbf{G}(x)]\). We are interested in the quantity

\[\begin{equation}\label{eq:Esup} \mathcal{S} \seteq \E \max_{x \in \Omega} \left|F(x)-\mathbf{G}(x)\right| = \E \max_{x \in \Omega} \left|F(x) - \left(\mathbf{g}_1(x)+\cdots+\mathbf{g}_M(x)\right)\right|\,, \end{equation}\]

where \(\Omega \subseteq \R^n\).

For our particular case of interest, let us take \(\Omega \seteq \{ x \in \R^n : F(x) \leq 1 \}\).

This completes the symmetrization argument, allowing us to focus on proving bounds like \eqref{eq:rad}.

The advantage here lies in the the fact that \(\left \{ \e_1 g_1(x) + \e_2 g_2(x) + \cdots + \e_M g_M(x) : x \in \R^n \right\}\) is an example of a subgaussian process, and bounding the expected maximum of a subgaussian process has a long and rich history, along with a correspondingly powerful framework.

The (mild) disadvantage is that we require the inequality \eqref{eq:rad} to hold for every choice of functions \(g_1,\ldots,g_M\) in the support of our distribution.

(Note: The confusing terminology “process” is a historical artifact. Originally, one was indeed interested in \(\e_1 g_1(x)\), \(\e_1 g_1(x) + \e_2 g_2(x)\), etc. as a sequence “evolving in time.” If one switches from discrete to continuous time, e.g., a process like Brownian motion, then control of the expected maximum is closely related to almost sure continuity of the sample paths.)

Subgaussian processes

Consider a collection of random variables \(\{ X_t : t \in T \}\) where the index set \(T\) is equipped with a distance \(d\). The family is called subgaussian (with respect to \(d\)) if there if there is a constant \(c > 0\) such that

\[\begin{equation}\label{eq:subgaussian} \P\left[|X_s-X_t| > \lambda\,d(s,t)\right] \leq e^{- c\lambda^2}\,,\quad \forall \lambda > 0, s,t \in T\,. \end{equation}\]

The quantity we care about after our symmetrization argument is

\[\begin{equation}\label{eq:symout} \E \max_{x \in \Omega} \left(\e_1 g_1(x) + \cdots + \e_M g_M(x)\right), \end{equation}\]

and as we have just seen this is the expected maximum of a centered subgaussian process. (Centered means that \(\E X_t = 0\) for every \(t \in T\).) There is a rich history and theory for bounding such expected maxima, with M. Talagrand as the primary architect (see his comprehensive book on the topic).

Entropy numbers and Dudley's bound

Dudley’s entropy inequality

Slepian's lemma and dual-Sudakov bounds

Sparsification and entropy numbers

Suppose that \(F(x) = f_1(x) + \cdots + f_m(x)\), \(\rho \in \R_+^m\) is a probability vector, and define our potential sparsifier as

\[\tilde{F}_{\nu}(x) \seteq \sum_{j=1}^M \frac{f_{\nu_j}(x)}{M \rho_{\nu_j}}\,.\]

Define the related distance

\[d_{\nu}(x,y) \seteq \left(\sum_{j=1}^M \left(\frac{f_{\nu_j}(x)-f_{\nu_j}(y)}{M \rho_{\nu_j}}\right)^2 \right)^{1/2}.\]

We have seen that if we can bound

\[\begin{equation}\label{eq:delta-bound} \E_{\e} \max_{F(x) \leq 1} \sum_{j=1}^m \e_j \frac{f_{\nu_j}(x)}{M \rho_{\nu_j}} \leq \delta \left(\max_{F(x) \leq 1} \tilde{F}_{\nu}(x)\right)^{1/2}\quad \forall \nu \in [m]^M\,, \end{equation}\]

where \(\e_1,\ldots,\e_M \in \{-1,1\}\) are uniformly random signs, then

\[\begin{equation}\label{eq:Fapprox} \E_{\nu} \max_{F(x) \leq 1} \left|F(x)-\tilde{F}_{\nu}(x)\right| \lesssim \delta\,, \end{equation}\]

where \(\nu_1,\ldots,\nu_M\) are indicies sampled i.i.d. from \(\rho\).

Finally, we have seen, using Dudley’s entropy bound, that if \(B_F \seteq \{ x \in \R^n : F(x) \leq 1 \}\), then

\[\begin{equation}\label{eq:dudley} \E_{\e} \max_{F(x) \leq 1} \sum_{j=1}^m \e_j \frac{f_{\nu_j}(x)}{M \rho_{\nu_j}} \lesssim \sum_{h \geq 0} 2^{h/2} e_h(B_F, d_{\nu})\,. \end{equation}\]

\(\ell_2\) regression

Recall the setting for \(\ell_2\) regression: \(a_1,\ldots,a_m \in \R^n\) and \(f_i(x) \seteq \abs{\langle a_i,x\rangle}^2\).

The covering lemma

Before proving the lemma, it helps to consider the more basic problem of covering the Euclidean ball \(B_2^n\) by translates of \(\e B_{\infty}^n\), i.e., by translates of small cubes.

Suppose \(\|x\|_2^2 = x_1^2 + x_2^2 + \cdots + x_n^2 = 1\). Since we only care about approximation up to \(\e\) in the \(\ell_{\infty}\) distance, we could discretize this vector to lie in, say, \(\e \mathbb{Z}^n \cap B_2^n\).

The most basic kind of vector we need to cover is of the form \((0, \pm \e, 0, 0, \pm \e, 0, \pm \e, 0, \ldots, 0)\). Because \(\|x\|_2^2 = 1\), there are only \(n^{O(1/\e^2)}\) choices for such a vector. But we also need to handle vectors of the form \((0, \pm 2\e, 0, 0, \pm \e, 0, \pm 2\e, 0, \ldots, 0)\), and so on.

It is not hard to convince one’s self that there are asymptotically fewer vectors of this form. Ineed, if some entry is \(2\e\) then there are \(n\) choices for where it goes, but there are \(n(n-1)/2\) choics for where two copies of \(\e\) go. Thus the total number of centers one needs is only \(n^{O(1/\e^2)}\).

In other words,

\[\left(\log \mathcal{N}(B_2^n, \|\cdot\|_{\infty}, \e)\right)^{1/2} \lesssim \frac{1}{\e} \sqrt{\log n}\,.\]

Now suppose we wanted to cover \(B_2^n\) instead with cubes of different side lengths, or with parallelpipeds (where the sides are no longer perpendicular), etc. There is a beautiful approach that gives surprisingly good bounds for cover \(B_2^n\) by translations of an arbitrary symmetric convex body. (I have heard it credited to Talagrand, or to Pajor and Talagrand.)

$\ell_p$ sparsification

Sparsifying $F(x) = |\langle a_1,x\rangle|^p + |\langle a_2,x\rangle|^p + \cdots + |\langle a_m,x\rangle|^p$, $1 < p < 2$.

$\ell_p$ sparsification for $1 < p <2$ and uniform smoothness

Fix \(1 \leq p \leq 2\). We will now prove that if \(a_1,\ldots,a_m \in \R^n\) and \(f_i(x) = |\langle a_i,x\rangle|^p\), then the sum \(F(x) = f_1(x) + \cdots + f_m(x)\) admits \(s\)-sparse \(\e\)-approximations for

$$ s \lesssim \frac{n}{\e^2} \left(\log \frac{n}{\e}\right)^{O(1)}, $$

generalizing what we saw last lecture for \(p=2\).

As we have already seen, this will follow from an estimate of the form

$$\begin{equation}\label{eq:lp_ent_goal} e_h(B_F, d_{\nu}) \lesssim 2^{-h/2} \sqrt{\frac{n}{M}} \left(\log(M) \log(n)\right)^{O(1)} \left(\max_{F(x) \leq 1} \tilde{F}_{\nu}(x)\right)^{1/2}\,, \end{equation} $$


$$\begin{align*} B_F &= \{ x \in \R^n : \|Ax\|_p \leq 1 \} \\ \tilde{F}_{\nu}(x) &= \sum_{j=1}^M \frac{f_{\nu_j}(x)}{M \rho_{\nu_j}} \\ d_{\nu}(x,y) &= \left(\sum_{j=1}^M \left(\frac{f_{\nu_j}(x)-f_{\nu_j}(y)}{M \rho_{\nu_j}}\right)^2\right)^{1/2}\,, \end{align*} $$

for some choice of sampling probabilities \((\rho_1,\ldots,\rho_m) \in \R_+^m\).

Reduction to the consideration of \(d_{\nu,\infty}\)

$\ell_p$ importances from averages

Consider \(A : \R^n \to \R^m\) with rows \(a_1,\ldots,a_m \in \R^n\), and let us define importance scores

$$\begin{equation}\label{eq:lp_scores} \rho_i \seteq \frac{\int e^{-\norm{Ax}_p^p} \abs{\langle a_i,x\rangle}^p\,dx}{\int e^{-\norm{Ax}_p^p} \norm{Ax}_p^p\,dx}\,. \end{equation} $$

The shift lemma and uniform smoothness

In light of our choice of important scores in \eqref{eq:lp_scores}, we can think of the following setting. Let \(U\) be a matrix with rows \(u_1,\ldots,u_M \in \R^n\). Let \(\mathbf{Z}\) be a random variable whose distribution has density proportional to \(e^{-\|A x\|_p^p}\), and assume that

$$\begin{equation}\label{eq:pth-moments} \left(\E\left[\abs{\langle u_j, \mathbf{Z}\rangle}^p\right]\right)^{1/p} \leq K\,,\quad j=1,\ldots,M\,. \end{equation} $$

Note that if \(u_j = a_{\nu_j}/(M \rho_{\nu_j})^{1/p}\), then \(K \leq (\frac{n}{p M})^{1/p}\) using \eqref{eq:rho-alt}.

Our goal now is to bound the covering numbers of the set \(B_A \seteq \{x \in \R^n : \|A x\|_p \leq 1 \}\) in the metric \(d_{U}(x,y) \seteq \|U(x-y)\|_{\infty}\). Recall that we did this in the last lecture in the case \(p=2\) using the Shift Lemma for the gaussian measure. There we used the inequality

$$ \frac{\|x + z\|_2^2 +\|x-z\|_2^2}{2} \leq \|x\|_2^2 + \|z\|_2^2\,, $$

which actually holds with equality (by the Pythagorean theorem) for the Euclidean norm. We will need a generalization of this property.

\(\psi_1\) concentration for one-dimensional marginals of log-concave measures

$\ell_p$ Lewis weights

A variational formulation

Properties of Lewis weights

Entropy bounds

Computation of the Lewis weights for $p < 4$: A contraction argument

The following algorithm is taken from $\ell_p$ row sampling by Lewis weights by Cohen and Peng.

Norm sparsification

Sparsifying $F(x) = N_1(x) + \cdots + N_m(x)$, $N_1,\ldots,N_m$ norms on $\R^n$.

Importance scores for general norms

The material in this lecture is taken from Sparsifying sums of norms.

Suppose that $N_1,\ldots,N_m : \R^n \to \R_+$ are semi-norms on $\R^n$, and define the semi-norm given by their $\ell_p$ sum, for $p > 1$:

$$\begin{equation}\label{eq:ns} N(x) \mathrel{\mathop:}=\left(N_1(x)^p + \cdots + N_m(x)^p\right)^{1/p}\,.\end{equation} $$

Our goal will be to sparification where the terms are $f_i(x) \mathrel{\mathop:}=N_i(x)^p$ in our usual notation, and $F(x) = N(x)^p$.

A semi-norm $N$ is nonnegative and satisfies $N(\lambda x) = |\lambda| N(x)$ and $N(x+y) \leqslant N(x)+N(y)$ for all $\lambda\in \R$, $x,y \in \R^n$, though possibly $N(x)=0$ for $x \neq 0$.

Note that, equivalently, a semi-norm is a homogeneous function: $N(\lambda x) = |\lambda| N(x)$ such that $N$ is also convex.


Importance scores

The KLS conjecture, covering, and concentration

$$d_{\nu}(x,y) \mathrel{\mathop:}=\left(\sum_{j=1}^M \left(\frac{N_{\nu_j}(x)-N_{\nu_j}(y)}{M \rho_{\nu_j}}\right)^2 \right)^{1/2}.$$

Concentration of Lipschitz functions and KLS

Further reading on KLS

The homotopy method

Computing the importance scores

One thing we haven’t addressed yet is how to actually compute the values

$$\rho_i = \frac{\mathbb{E}N_i(\mathbf{Z})}{\mathbb{E}N(\mathbf{Z})}\,,$$

when $\mathbf{Z}$ chosen uniformly (according to the volume measure on $\mathbb{R}^n$) from the unit ball $B_N$.

The homotopy method

General linear models

Sparsifying $F(x) = \varphi(\langle a_1,x\rangle - b_1) + \cdots + \varphi(\langle a_m, x\rangle - b_m)$, $\varphi : \R \to \R$.

Multiscale importance scores

Empirical risk minimization (ERM) is a widely studied problem in learning theory and statistics. A prominent special case is the problem of optimizing a generalized linear model (GLM), i.e.,

$$\begin{equation}\label{eq:glm} \min_{x \in \mathbb R^n} F(x) \quad \textrm{ for } \quad F(x) := \sum_{i=1}^m \varphi_i(\langle a_i, x \rangle - b_i)\,, \end{equation}$$

where the total loss $F : \mathbb R^n \to \mathbb R$, is defined by vectors $a_1,\ldots,a_m \in \mathbb R^n$, $b \in \mathbb R^m$, and loss functions $f_1, \dots, f_m: \mathbb R\to \mathbb R$. Different choices of the loss functions ${ \varphi_i }$ capture a variety of important problems, including linear regression, logistic regression, and $\ell_p$ regression.

When $m \gg n$, a natural approach for fast algorithms is to apply sparsification techniques to reduce the value of $m$, while maintaining a good multiplicative approximation of the objective value. This corresponds to our usual sparsification model where $f_i(x) = \varphi_i(\langle a_i,x\rangle - b_i)$. As mentioned previously, we will can assume that $b_1=\cdots=b_m=0$ by lifting to one dimension higher. In previous lectures, we solved the sparsification problem when $\varphi_i(z)=|z|^p$ for some $1 \leqslant p \leqslant 2$.

However, $\ell_p$ losses are the only class of natural loss functions for which linear-size sparsification results are known for GLMs. For instance, for the widely-studied class of Huber-type loss functions, e.g., $\varphi_i(z) = \min(|z|, |z|^2)$, the best known sparsity bounds obtained by previous methods were of the form is $\tilde{O}(n^{4-2\sqrt{2}} \varepsilon^{-2})$.

We will often think of the case $\varphi_i(z) = h_i(z)^2$ for some $h_i : \mathbb R\to \mathbb R_+$, as the assumptions we need are stated more naturally in terms of $\sqrt{\varphi_i}$. To that end, consider a function $h : \mathbb R^n \to \mathbb R_+$ and the following two properties, where $L, c, \theta> 0$ are some positive constants.

It is not difficult to see that for $0 < p \leqslant 2$, the function $\varphi_i(z) = |z|^p$ satisfies the required hypotheses of . Additionally, the $\gamma_p$ functions, defined as

$$\begin{equation}\label{eq:gammap} \gamma_p(z) := \begin{cases} \frac{p}{2}z^2 & \enspace \text{ for } \enspace |z| \leqslant 1 \\ |z|^p - (1-\frac{p}{2}) & \enspace \text{ for } \enspace |z| \geqslant 1, \end{cases}\end{equation}$$

for $p \in (0, 2]$, also satisfy the conditions. The special case of $\gamma_1$ is known as the Huber loss.

To show that $\sqrt{\gamma_p}$ is $1$-auto-Lipschitz, it suffices to prove that $\sqrt{\gamma_p}$ is concave. The lower growth bound is immediate for $p > 0$.

Importance sampling and multiscale weights

Given $F(x) = \varphi_1(\langle a_1,x\rangle) + \cdots + \varphi_m(\langle a_m,x\rangle)$, our approach to sparsification is via importance sampling. Given a probability vector $\rho \in \mathbb R^m$ with $\rho_1,\ldots,\rho_m > 0$ and $\rho_1 + \cdots + \rho_m = 1$, we sample $M \geqslant 1$ coordinates $\nu_1,\ldots,\nu_M$ i.i.d. from $\rho$, and define our potential approximator by

$$\tilde{F}(x) \seteq \frac{1}{M} \sum_{j=1}^M \frac{\varphi_{\nu_j}(\langle a_{\nu_j},x\rangle)}{\rho_{\nu_j}}\,.$$

Since we want an approximation guarantee to hold simultaneously for many $x \in \mathbb R^n$, it is natural to analyze expressions of the form

$$\mathop{\mathrm{\mathbb{E}}}\max_{F(x) \leqslant s} \left|F(x)-\tilde{F}(x)\right|.$$

As we have seen, analysis of this expression involves the size of discretizations of the set \(B_F(s) \mathrel{\mathop:}=\{ x \in \mathbb R^n : F(x) \leqslant s \}\) at various granularities. The key consideration (via Dudley’s entropy inequality) is how well $B_F(s)$ can be covered by cells on which we have uniform control on how much the terms $\varphi_i(\langle a_i,x\rangle)/\rho_i$ vary within each cell.

The $\ell_2$ case

Let’s consider the case $\varphi_i(z) = |z|^2$ so that $F(x) = |\langle a_1,x\rangle|^2 + \cdots + |\langle a_m,x\rangle|^2$. Here, \(B_F(s) = \{ x \in \mathbb R^n : \norm{Ax}_2^2 \leqslant s \}\), where $A$ is the matrix with $a_1,\ldots,a_m$ as rows.

A cell at scale $2^{j}$ looks like

$$\mathsf K_j \mathrel{\mathop:}=\left\{ x \in \mathbb R^n : \max_{i \in [m]} \frac{|\langle a_i,x\rangle|^2}{\rho_i} \leqslant 2^{j} \right\},$$

and the pertinent question is how many translates of $\mathsf K_j$ it takes to cover $B_F(s)$. In the $\ell_2$ case, this is the well-studied problem of covering Euclidean balls by $\ell_{\infty}$ balls.

If $N_j$ denotes the minimum number of such cells required, then the dual-Sudakov inequality tells us that

$$\log N_j \lesssim \frac{s}{2^j} \log (m) \max_{i \in [m]} \frac{\|(A^{\top} A)^{-1/2} a_i\|^2_2}{\rho_i}\,.$$

Choosing $\rho_i \mathrel{\mathop:}=\frac{1}{n} |(A^\top A)^{-1/2} a_i|_2^2$ to be the normalized leverage scores yields uniform control on the size of the coverings:

$$\log N_j \lesssim \frac{s}{2^j} n \log m\,.$$

The $\ell_p$ case

Now we take $\varphi_i(z) = |z|^p$ so that $F(x) = \norm{Ax}_p^p$. A cell at scale $2^j$ now looks like

$$\mathsf K_j \mathrel{\mathop:}=\left\{ x \in \mathbb R^n : \max_{i \in [m]} \frac{|\langle a_i,x\rangle|^p}{\rho_i} \leqslant 2^j \right\},$$

To cover $B_F(s)$ by translates of $\mathsf K_j$, we will again employ Euclidean balls, and one can use $\ell_p$ Lewis weights to relate the $\ell_p$ structure to an $\ell_2$ structure.

A classical result of Lewis establishes that there are nonnegative weights $w_1,\ldots,w_m \geqslant 0$ such that if $W = \mathrm{diag}(w_1,\ldots,w_m)$ and $U \mathrel{\mathop:}=(A^{\top} W A)^{1/2}$, then

$$\begin{equation}\label{eq:wis} w_i = \frac{\norm{U^{-1} a_i}_2^p}{\norm{U^{-1} a_i}_2^2} = \frac{\varphi_i(\norm{U^{-1} a_i}_2)}{\norm{U^{-1} a_i}_2^2}\,. \end{equation}$$

Assuming that $A$ has full rank, a straightforward calculation gives $\sum_{i=1}^m w_i \norm{U^{-1} a_i}_2^2 = \mathop{\mathrm{tr}}(U^2 U^{-2}) = n$.

Therefore we can take $\rho_i \mathrel{\mathop:}=\frac{1}{n} w_i \norm{U^{-1} a_i}_2^2$ for $i=1,\ldots,m$, and our cells become

$$\mathsf K_j \mathrel{\mathop:}=\left\{ x \in \mathbb R^n : \max_{i \in [m]} |\langle a_i,x\rangle|^p \leqslant\frac{2^j}{n} w_i \norm{U^{-1} a_i}_2^2 \right\}.$$

If we are trying to use $\ell_2$-$\ell_{\infty}$ covering bounds, we face an immediate problem: Unlike in the $\ell_2$ case, we don’t have prior control on $\norm{U x}_2$ for $x \in B_F(s)$. One can obtain an initial bound using the structure of $U = (A^{\top} W A)^{1/2}$:

$$ \begin{align} \norm{U x}_2^2 = \sum_{i=1}^m w_i \langle a_i,x\rangle^2 &= \sum_{i=1}^m \norm{U^{-1} a_i}_2^{p-2} \langle a_i,x\rangle^2 \nonumber \\ &= \sum_{i=1}^m \left(\frac{|\langle a_i,x\rangle|}{\norm{U^{-1} a_i}_2}\right)^{2-p} |\langle a_i,x\rangle|^p \leqslant\norm{U x}_2^{2-p} \sum_{i=1}^m |\langle a_i,x\rangle|^p\,, \label{eq:intro-nc0} \end{align}$$

where the last inequality is Cauchy-Schwarz: $|\langle a_i, x\rangle| = |\langle U^{-1} a_i, U x\rangle| \leqslant\norm{U^{-1} a_i}_2 \norm{U x}_2$. This gives the bound $\norm{U x}_2 \leqslant\norm{Ax}_p \leqslant s^{1/p}$ for $x \in B_F(s)$.

Problematically, this uniform $\ell_2$ bound is too weak, but there is a straightforward solution: Suppose we cover $B_F(s)$ by translates of $\mathsf K_{j_0}$. This gives an $\ell_{\infty}$ bound on the elements of each cell, meaning that we can apply \eqref{eq:intro-nc0} and obtain a better upper bound on \(\norm{Ux}_2\) for $x \in \mathsf K_{j_0}$. Thus to cover $B_F(s)$ by translates of $\mathsf K_j$ with $j < j_0$, we will cover first by translates of $\mathsf K_{j_0}$, then cover each translate $(x + \mathsf K_{j_0}) \cap B_F(s)$ by translates of $\mathsf K_{j_0-1}$, and so on.

The standard approach in this setting is to use interpolation inequalities and duality of covering numbers for a cleaner analytic version of such an iterated covering bound. But the iterative covering argument can be adapted to the non-homogeneous setting, as we discuss next.

Non-homogeneous loss functions

When we move to more general loss functions $\varphi_i : \mathbb R\to \mathbb R$, we lose the homogeneity property $\varphi_i(\lambda x) = \lambda^p \varphi_i(x)$, $\lambda > 0$ that holds for $\ell_p$ losses. Because of this, we need to replace the single Euclidean structure present in \eqref{eq:wis} (given by the linear operator $U$) with a family of structures, one for every relevant scale.

Moreover, in order to generalize the iterated covering argument for $\ell_p$ losses, we need there to be a relationship between weights at different scales.

Given a weight scheme, we choose sampling probabilities

$$\rho_i \propto \max_{j \in \mathcal J} w_i^{(j)} \norm{M_{w^{(j)}}^{-1/2} a_i}_2^2 = \max_{j \in \mathcal{J}} \sigma_i(W_j^{1/2} A)\,,\quad i=1,\ldots,m\,,$$

where \(W_j = \mathrm{diag}(w_1^{(j)},\ldots,w_m^{(j)})\).

In our setting, $|\mathcal J| \leqslant O(\log(ms_{\max}/s_{\min}))$, which results in the sparsity increasing by a corresponding factor.

Finding multiscale weights

Recall the definitions of approximate weights and weight schemes from the previous lecture.

This can be proved by considering critical points of the functional $U \mapsto \det(U)$ subject to the constraint $G(U) \leqslant s$, where

$$G(U) \mathrel{\mathop:}=\varphi_1(\|U a_1\|_2) + \cdots + \varphi_m(\|U a_m\|_2)\,,$$

analogous to how we showed the existence of Lewis weights.

Contractive algorithm

For a full-rank matrix $V$ with rows $v_1,\ldots,v_m \in \mathbb R^n$, define the $i$th leverage score of $V$ as \(\label{eq:ls-def} \sigma_i(V) \mathrel{\mathop:}=\langle v_i, (V^{\top} V)^{-1} v_i\rangle\,.\)

For a weight $w \in \mathbb R_{+}^m$ and \(i \in \{1,\ldots,m\}\), define \(\tau_i(w) \mathrel{\mathop:}=\frac{\sigma_i(W^{1/2} A)}{w_i} = \langle a_i, (A^{\top} W A)^{-1} a_i\rangle\,, \quad W \mathrel{\mathop:}=\mathrm{diag}(w_1,\ldots,w_m)\,,\) and denote $\tau(w) \mathrel{\mathop:}=(\tau_1(w),\ldots,\tau_m(w))$.

Fix a scale parameter $s > 0$ and define the iteration $\Lambda_s : \mathbb R_+^m \to \mathbb R_+^m$ by

$$\begin{equation}\label{eq:phi-iter} (\Lambda_s(w))_i \mathrel{\mathop:}=\frac{1}{s} \frac{\varphi_i(\sqrt{\tau_i(w)})}{\tau_i(w)}\,.\end{equation}$$

Write $\theta^k \mathrel{\mathop:}=\theta \circ \cdots \circ \theta$ for the $k$-fold composition of $\theta$. In this case where $\varphi_i(z)=|z|^p$ and $1 \leqslant p \leqslant 2$, it is known, for $s = 1$, starting from any $w_0 \in \mathbb R_{+}^m$, the sequence \(\{\Lambda_1^k(w_0) : k \geqslant 1\}\) converges to the unique fixed point of $\varphi$, which are the corresponding $\ell_p$ Lewis weights.

Define now a metric $d$ on $\mathbb R_+^m$ by

$$\begin{aligned} d(u,w) &\mathrel{\mathop:}=\max \left\{ \left|\log \frac{u_i}{w_i}\right| : i = 1,\ldots,m \right\}.\end{aligned}$$

We note the following characterization.

First, we observe that $\tau$ is $1$-Lipschitz on $(\mathbb R_+^m, d)$. In the next proof, $\preceq$ denotes the ordering of two real, symmetric matrices in the Loewner order, i.e., $A \preceq B$ if and only if $B-A$ is positive semi-definite.

Proof of Theorem 2

Consider the map $\psi: \mathbb R_+^m \to \mathbb R_+^m$ whose $i$-th coordinate is defined as

$$\psi_i(x) \mathrel{\mathop:}=\frac{\varphi_i(\sqrt{x_i})}{x_i}\,.$$

Our assumptions on lower and upper-homogeneity give, for all $y_i \geqslant x_i$,

$$c \left(\frac{y_i}{x_i}\right)^{\theta/2-1} \leqslant\frac{\varphi_i(\sqrt{y_i})/y_i}{\varphi_i(\sqrt{x_i})/x_i} \leqslant C \left(\frac{y_i}{x_i}\right)^{u/2-1}\,,$$

yielding, for \(C_1 \mathrel{\mathop:}=\max\{C, 1/c\}\),

$$\begin{equation}\label{eq:psi-contract} d(\psi(x),\psi(y)) \leqslant\max \left(\left|\frac{\theta}{2}-1\right|, \left|\frac{u}{2}-1\right|\right) d(x,y) + \log(C_1)\,.\end{equation}$$

Fix $s > 0$ and consider the mapping $\varphi: \mathbb R_+^m \to \mathbb R_+^m$ defined in \eqref{eq:phi-iter} Then for $u< 4$ and $\delta \mathrel{\mathop:}=\max \left(\left|\frac{\theta}{2}-1\right|, \left|\frac{u}{2}-1\right|\right) < 1$, in conjunction with \eqref{eq:psi-contract}, shows that

$$\begin{aligned} d(\Lambda_s(w), \Lambda_s(w')) < \delta\,d(w,w') + \log(C_1)\,. \label{eq:onestep}\end{aligned}$$

Applying this bound inductively, for any weight $w \in \mathbb R_{+}^m$ and $k \geqslant 1$, we have

$$\begin{equation}\label{eq:iter-contract0} d\left(\Lambda_s^{k}(w), \Lambda_s^{k+1}(w)\right) \leqslant\frac{\delta^k d(\Lambda_s(w),w) + \log C_1}{1-\delta}\,,\end{equation}$$

Now define \(w^{(0)} \mathrel{\mathop:}=\varphi^{k}_{1}(1,\ldots,1)\,,\) where $k \geqslant 1$ is chosen large enough so that $d(w^{(0)}, \Lambda_{1}(w^{(0)})) \leqslant\frac{2 \log C_1}{1-\delta}$. From , one sees that $w^{(0)}$ is an $\alpha$-approximate weight at scale $1$ for $\alpha = C_1^{2/(1-\delta)}$.

Define inductively, for $j =1,2,\ldots$,

$$\begin{aligned} w^{(j)} &\mathrel{\mathop:}=\Lambda_{2^j}(w^{(j-1)}) \\ w^{(-j)} &\mathrel{\mathop:}=\Lambda_{2^{-j}}(w^{(1-j)})\,.\end{aligned}$$

Note that

$$\begin{aligned} d(\Lambda_{2^j}(w^{(j)}), w^{(j)}) &= d(\Lambda_{2^j}^2(w^{(j-1)}), \Lambda_{2^j}(w^{(j-1)})) \\ &\leqslant\delta d(\Lambda_{2^j}(w^{(j-1)}), w^{(j-1)}) + \log(C_1) \\ &\leqslant\delta d(\Lambda_{2^{j-1}}(w^{(j-1)}), w^{(j-1)}) + \delta \log(2) + \log(C_1)\,,\end{aligned}$$

where the last inequality uses $\Lambda_{2s}(w) = 2 \Lambda_s(w)$ for all $w \in \mathbb R_+^m$.

Therefore, by induction, $d(\Lambda_{2^j}(w^{(j)}), w^{(j)}) \leqslant\frac{2 \log(C_1) + \log 2}{1-\delta}$ for all $j > 0$. To see that the family of weights \(\{ w^{(j)} : j \in \mathbb Z\}\) forms a weight scheme, note that

$$d(w^{(j)}, w^{(j-1)}) = d(\Lambda_{2^j}(w^{(j-1)}), w^{(j-1)}) \leqslant d(\Lambda_{2^{j}}(w^{(j-1)}), w^{(j-1)}) + \log 2\,,$$

thus \(\{w^{(j)} : j \in \mathbb Z\}\) is an $\alpha$-approximate weight scheme for $\alpha = \frac{2 \log(2C_1)}{1-\delta}$, completing the proof.

Analyzing the potential change

These are preliminary notes for analyzing the Ito process.

Reis and Rothvoss gave an algorithm for finding BSS sparsifiers that uses a a discretized random walk. It appears conceptually simpler to use a modified Brownian motion directly.

Consider symmetric matrices $A_1,\ldots,A_m \succeq 0$ with $|A_1|+\cdots+|A_m| \preceq \mathbb{I}_n$. Define a stochastic process $X_t = (X_t^1,\ldots,X_t^m)$ by $X_0 = 0$, and

$$dX_t = \mathbb{I}_{\mathcal J_t}\, dB_t\,,$$

where \(\{B_t\}\) is a standard Brownian motion and \(\mathcal J_t \subseteq [m]\) is the set of good coordinates at time \(t\).

Define the stopping times \(T_i \mathrel{\mathop:}=\inf \{ t > 0 : |X_t^i| \geqslant 1 \}\) and the set of stopped coordinates

$$\mathcal S_t \mathrel{\mathop:}=\{ i \in [m] : T_i \leqslant t \}\,.$$

We will ensure that \(\mathcal J_t \cap \mathcal S_t = \emptyset\) so that \(|X_t^1|,\ldots,|X_t^m| \leqslant 1\) holds for all times \(t \geqslant 0\).

Denote \(T \mathrel{\mathop:}=\sup \{ t > 0 : |\mathcal S_t| \leqslant m/2 \}\). Our goal is to show that with good probability,

$$\label{eq:goal} \sum_{i=1}^m X_T^i A_i \preceq C \left(\frac{n}{m}\right)^{1/2} \mathbb{I}_n\,.$$

Since the stopping time \(T\) entails that \(m/2\) coordinates have been fixed to \(\pm 1\), this completes the argument since at least one of the two matrices \(\sum_{i=1}^m (1+X_T^i) A_i\) or \(\sum_{i=1}^m (1-X_T^i) A_i\) has at most \(m/2\) non-zero coefficients, and

$$\sum_{i=1}^m (1 \pm X_T^i) A_i \preceq \left(1 + C \left(\frac{n}{m}\right)^{1/2}\right) \mathbb{I}_n\,.$$

Repeatedly halving \(m\) until \(m \asymp n/\varepsilon^2\) yields a \(1 \pm \varepsilon\) approximation with \(O(n/\varepsilon^2)\) sparsity. To get a two-sided bound, replace \(A_i\) by \(\begin{pmatrix} A_i & 0 \\ 0 & - A_i\end{pmatrix}\).

To this end, let us define, for parameters $\theta,\lambda > 0$ to be chosen later,

$$\begin{aligned} U(x) &\mathrel{\mathop:}=\theta \mathbb{I}_n + \lambda \|x\|^2 \mathbb{I}_n - \sum_{i=1}^m x_i A_i \\ \Phi(x) &\mathrel{\mathop:}=\mathop{\mathrm{tr}}(U(x)^{-1})\,.\end{aligned}$$

Define $U_t \mathrel{\mathop:}=U(X_t)$. Note that $\Phi(X_0)=\Phi(0) = n/\theta$ and $U_0 \succ 0$. The idea is that if ${ \Phi(X_t) : t \in [0,T] }$ remains bounded, then $U_t \succ 0$ holds for $t \in [0,T]$, and therefore at time $T$,

$$\sum_{i=1}^m X_T^i A_i \preceq (\theta + \lambda \|X_T\|^2) \mathbb{I}_n \preceq (\theta + \lambda m)\mathbb{I}_n\,.$$

From Ito’s lemma, we know that if $dX_t = \Sigma_t dB_t$ and $f : \mathbb R^m \to \mathbb R$, then

$$d f(X_t) = \langle \nabla f(t,X_t), dX_t\rangle + \frac12 \mathop{\mathrm{tr}}\left(\nabla^2 f(t,X_t) \Sigma_t \Sigma_t^{\top}\right) dt\,.$$

Since $\Sigma_t = \mathbb{I}_{\mathcal J_t}$,

$$\frac12 \mathop{\mathrm{tr}}\left(\nabla^2 f(t,X_t) \Sigma_t \Sigma_t^{\top}\right) dt = \frac12 \sum_{i \in \mathcal J_t} \partial_{x_i}^2 f(t,X_t).$$

Now denote $f(x) \mathrel{\mathop:}=\mathop{\mathrm{tr}}(U(x)^{-1})$ and calculate:

$$\begin{aligned} \partial_{x_i} f(x) &= - 2 \lambda x_i \mathop{\mathrm{tr}}(U(x)^{-2}) + \mathop{\mathrm{tr}}(U(x)^{-2} A_i) \\ \partial_{x_j} \partial_{x_i} f(x) &= - 2 \lambda \mathbf{1}_{\{i=j\}} \mathop{\mathrm{tr}}(U(x)^{-2}) + 8 \lambda^2 x_i x_j \mathop{\mathrm{tr}}(U(x)^{-3}) - 4 \lambda x_i \mathop{\mathrm{tr}}(U(x)^{-3} A_j) + \partial_{x_j} \mathop{\mathrm{tr}}(U(x)^{-2} A_i) \\ &= - 2 \lambda \mathbf{1}_{\{i=j\}} \mathop{\mathrm{tr}}(U(x)^{-2}) + 8 \lambda^2 x_i x_j \mathop{\mathrm{tr}}(U(x)^{-3}) - 4 \lambda x_i \mathop{\mathrm{tr}}(U(x)^{-3} A_j) \\ &\quad - 4 \lambda x_j \mathop{\mathrm{tr}}(U(x)^{-3} A_i) + 2 \mathop{\mathrm{tr}}\left(U(x)^{-1} A_i U(x)^{-1} A_j U(x)^{-1}\right).\end{aligned}$$

This gives

$$\begin{aligned} d \Phi(X_t) = d f(X_t) &= {\color{Blue}{- 2 \lambda \langle X_t, d X_t\rangle \mathop{\mathrm{tr}}(U_t^{-2}) + \sum_{i=1}^m \mathop{\mathrm{tr}}(U_t^{-2} A_i) dX_t^i}} + \frac12 \mathop{\mathrm{tr}}\left(\nabla^2 f(t,X_t) \Sigma_t \Sigma_t^{\top}\right)\,dt \\ \frac12 \mathop{\mathrm{tr}}\left(\nabla^2 f(t,X_t) \Sigma_t \Sigma_t^{\top}\right) &= {\color{Maroon}{- \lambda \mathop{\mathrm{tr}}(U_t^{-2}) |\mathcal J_t|}} - {\color{ForestGreen}{4 \lambda \sum_{i \in \mathcal J_t} X_t^i \mathop{\mathrm{tr}}\left(U_t^{-3} A_i\right)}} + {\color{ForestGreen}{4 \lambda^2 \|\mathbb{I}_{\mathcal J_t} X_t\|^2 \mathop{\mathrm{tr}}(U_t^{-3})}} \\ &\quad {\color{Maroon}{+ \sum_{i \in \mathcal J_t} \mathop{\mathrm{tr}}\left(U_t^{-1} A_i U_t^{-1} A_i U_t^{-1}\right)}}\end{aligned}$$

The blue terms are the martingales, which we ignore for the moment. The green terms are error terms. The two red terms need to be balanced by choosing $\theta,\lambda$ appropriately.

For any $A,B$ with singular values $\sigma_1(A) \geqslant\cdots \geqslant\sigma_n(A)$ and $\sigma_1(B) \geqslant\cdots \geqslant\sigma_n(B)$, we have

$$\mathop{\mathrm{tr}}(A B) \leqslant\sum_{i=1}^n \sigma_i(A) \sigma_i(B)\,.$$

In particular, if $A,B \succeq 0$, then $\mathop{\mathrm{tr}}(AB) \leqslant\mathop{\mathrm{tr}}(A) \mathop{\mathrm{tr}}(B)$.

Let us use this to bound the error terms

$$\begin{aligned} {\color{ForestGreen}{4 \lambda^2 \|\mathbb{I}_{\mathcal J_t} X_t\|^2 \mathop{\mathrm{tr}}(U_t^{-3})}} &\leqslant 4 \lambda^2 m \mathop{\mathrm{tr}}(U_t^{-2}) \mathop{\mathrm{tr}}(U_t^{-1}) \\ {\color{ForestGreen}{4 \lambda \sum_{i \in \mathcal J_t} X_t^i \mathop{\mathrm{tr}}(U_t^{-3} A_i)}} &\leqslant 4 \lambda (\theta+\lambda m) \mathop{\mathrm{tr}}(U_t^{-2}) \mathop{\mathrm{tr}}(U_t^{-1})\,.\end{aligned}$$

This fact also bounds, for any $U \succ 0$ and symmetric $V$,

$$\mathop{\mathrm{tr}}\left(U^{-1} V U^{-1} V U^{-1}\right) \leqslant\mathop{\mathrm{tr}}(U^{-2} |V|) \mathop{\mathrm{tr}}(U^{-1} |V|)\,.$$

Accordingly, we define the set of bad coordinates:

$$\mathcal I_t \mathrel{\mathop:}=\left\{ i \in [m] : \mathop{\mathrm{tr}}(U_t^{-1} |A_i|) > \frac{\gamma}{m} \mathop{\mathrm{tr}}(U_t^{-1}) \right\}\,,$$

and finally $\mathcal J_t \mathrel{\mathop:}=[m] \setminus (\mathcal I_t \cup \mathcal S_t)$.

Observe that

$$\sum_{i \in \mathcal J_t} \mathop{\mathrm{tr}}\left(U_t^{-1} A_i U_t^{-1} A_i U_t^{-1}\right) \leqslant\frac{\gamma}{m} \mathop{\mathrm{tr}}(U_t^{-1}) \sum_{i=1}^m \mathop{\mathrm{tr}}(U_t^{-2} |A_i|) \leqslant\frac{\gamma}{m} \mathop{\mathrm{tr}}(U_t^{-1}) \mathop{\mathrm{tr}}(U_t^{-2})\,,$$

where the last inequality uses $|A_1|+\cdots+|A_m| \preceq \mathbb{I}_n$.

The point is to maintain $\Phi(X_t) \leqslant 2 \Phi_0(X_0) = 2 n/\theta$ by proving that $d \Phi(X_t) < 0$. Calculate:

$$\begin{aligned} d \Phi(X_t) &\leqslant\mathop{\mathrm{tr}}(U_t^{-2}) \left(- \lambda |\mathcal J_t| + \gamma \mathop{\mathrm{tr}}(U_t^{-1})\right) \\ &=\mathop{\mathrm{tr}}(U_t^{-2}) \left(- \lambda |\mathcal J_t| + \frac{\gamma}{m} \Phi(X_t)\right) \\ &\leqslant\mathop{\mathrm{tr}}(U_t^{-2}) \left(- \lambda |\mathcal J_t| + \frac{2 \gamma n}{\theta m}\right).\end{aligned}$$

For $\gamma \mathrel{\mathop:}=8$, we have $|\mathcal J_t| \geqslant m - |\mathcal I_t| - |\mathcal S_t| \geqslant\frac{m}{2} - \frac{m}{\gamma} \geqslant\frac{m}{4}$. So let us take

$$\theta \mathrel{\mathop:}=4 \left(\frac{n}{m}\right)^{1/2},\ \lambda \mathrel{\mathop:}=\frac{4}{m} \theta\,.$$

If we maintain $\max_{t \in [0,T]} \Phi(X_t) \leqslant 2n/\theta$, it means that $U_T \succ 0$, so

$$\sum_{i=1}^m X_T^i A_i \preceq \left(\theta + \lambda \|X_T\|^2\right) \mathbb{I}_n \preceq 5 \theta \mathbb{I}_n \asymp \left(\frac{n}{m}\right)^{1/2} \mathbb{I}_n\,,$$

completing the proof.