$\ell_p$ sparsification for $1 \leq p \leq 2$ and uniform smoothness

CSE 599, Autumn 2023

Fix $1 \leq p \leq 2$. We will now prove that if $a_1,\ldots,a_m \in \R^n$ and $f_i(x) = |\langle a_i,x\rangle|^p$, then the sum $F(x) = f_1(x) + \cdots + f_m(x)$ admits $s$-sparse $\e$-approximations for

$$ s \lesssim \frac{n}{\e^2} \left(\log \frac{n}{\e}\right)^{O(1)}, $$

generalizing what we saw last lecture for $p=2$.

As we have already seen, this will follow from an estimate of the form

$$\begin{equation}\label{eq:lp_ent_goal} e_h(B_F, d_{\nu}) \lesssim 2^{-h/2} \sqrt{\frac{n}{M}} \left(\log(M) \log(n)\right)^{O(1)} \left(\max_{F(x) \leq 1} \tilde{F}_{\nu}(x)\right)^{1/2}\,, \end{equation} $$

where

$$\begin{align*} B_F &= \{ x \in \R^n : \|Ax\|_p \leq 1 \} \\ \tilde{F}_{\nu}(x) &= \sum_{j=1}^M \frac{f_{\nu_j}(x)}{M \rho_{\nu_j}} \\ d_{\nu}(x,y) &= \left(\sum_{j=1}^M \left(\frac{f_{\nu_j}(x)-f_{\nu_j}(y)}{M \rho_{\nu_j}}\right)^2\right)^{1/2}\,, \end{align*} $$

for some choice of sampling probabilities $(\rho_1,\ldots,\rho_m) \in \R_+^m$.

Reduction to the consideration of $d_{\nu,\infty}$

Let’s introduce a distance analogous to our argument for $p=2$:
\[d_{\nu,\infty}(x,y) \seteq \max_{j \in [M]} \frac{|\langle a_{\nu_j}, x-y\rangle|}{(M \rho_{\nu_j})^{1/p}}\]
Using the elementary inequality, valid for $1 \leq p \leq 2$,
\[|a^p - b^p|^2 \leq 2 |a-b|^p \left(a^p + b^p\right)\,,\quad \forall a,b \geq 0\,,\]
it holds that
\[d_{\nu}(x,y) \lesssim d_{\nu,\infty}(x,y)^{p/2} \left(\max_{F(x) \leq 1} \tilde{F}_{\nu}(x)\right)^{1/2}.\]
Thus we need to prove a bound of the form
\[e_h(B_F, d_{\nu,\infty}^{p/2}) \lesssim 2^{-h/2} L\,,\quad h \geq 0\,.\]
As we discussed last time, this is equivalent to the bound
\[\begin{equation}\label{eq:lp_cov_goal} \left(\log \mathcal{N}(B_F, d_{\nu,\infty}, \e^{2/p})\right)^{1/2} \lesssim \frac{L}{\e}\,\quad \e > 0\,. \end{equation}\]

$\ell_p$ importances from averages

Consider $A : \R^n \to \R^m$ with rows $a_1,\ldots,a_m \in \R^n$, and let us define importance scores

$$\begin{equation}\label{eq:lp_scores} \rho_i \seteq \frac{\int e^{-\norm{Ax}_p^p} \abs{\langle a_i,x\rangle}^p\,dx}{\int e^{-\norm{Ax}_p^p} \norm{Ax}_p^p\,dx}\,. \end{equation} $$

Lemma: For any norm $N$ on $\R^n$ and $1 \leq p \leq 2$, it holds that
\[\int e^{-N(x)^p} N(x)^p \,dx = \frac{n}{p} \int e^{-N(x)^p}\,dx\,.\]
Proof:

Let $B_N$ denote the unit ball of $N$. Define $V(r) \seteq \mathrm{vol}_n(r B_N) = r^n \mathrm{vol}_n(B_N)$ so that $\frac{d}{dr} V(r) = n r^{n-1} \mathrm{vol}_n(B_N)$. Therefore,
\[\begin{align*} \int_{\R^n} N(x)^p\,d\mu(x) &= \frac{\int_{\R^n} N(x)^p e^{-N(x)^p}\,dx}{\int_{\R^n} e^{-N(x)^p}\,dx} \\ &= \frac{\int_{0}^{\infty} r^p e^{-r^p} d V(r)}{\int_{0}^{\infty} e^{-r^p} d V(r)} = \frac{\int_{0}^{\infty} e^{-r^p} r^{n-1+p}\,dr}{\int_{0}^{\infty} e^{-r^p} r^{n-1}\,dr}\,. \end{align*}\]
Make the subsitution $u = r^p$, yielding
\[\int_{\R^n} N(x)^p e^{-N(x)^p}\,dx = \frac{\int_{0}^{\infty} e^{-u} u^{n/p}\,du}{\int_{0}^{\infty} e^{-u} u^{n/p-1}\,du} = \frac{n}{p}\,,\]
where the latter equality follows from integration by parts.
Let $\mathbf{Z}$ be a random variable with law $\mu$, where $\mu$ has density proportional to $e^{-\norm{Ax}_p^p}$. Then the lemma tells us that
\[\begin{equation}\label{eq:rho-alt} \rho_i = \frac{p}{n} \E \left[\abs{\langle a_i,\mathbf{Z}\rangle}^p\right]\,, \quad i=1,\ldots,m\,. \end{equation}\]
Check that in the case $p=2$, it holds that $\rho_i = \norm{(A^{\top} A)^{-1/2} a_i}_2^2/n$, i.e., the corresponding weights are precisely the statistical leverage scores.

The shift lemma and uniform smoothness

In light of our choice of important scores in \eqref{eq:lp_scores}, we can think of the following setting. Let $U$ be a matrix with rows $u_1,\ldots,u_M \in \R^n$. Let $\mathbf{Z}$ be a random variable whose distribution has density proportional to $e^{-\|A x\|_p^p}$, and assume that

$$\begin{equation}\label{eq:pth-moments} \left(\E\left[\abs{\langle u_j, \mathbf{Z}\rangle}^p\right]\right)^{1/p} \leq K\,,\quad j=1,\ldots,M\,. \end{equation} $$

Note that if $u_j = a_{\nu_j}/(M \rho_{\nu_j})^{1/p}$, then $K \leq (\frac{n}{p M})^{1/p}$ using \eqref{eq:rho-alt}.

Our goal now is to bound the covering numbers of the set $B_A \seteq \{x \in \R^n : \|A x\|_p \leq 1 \}$ in the metric $d_{U}(x,y) \seteq \|U(x-y)\|_{\infty}$. Recall that we did this in the last lecture in the case $p=2$ using the Shift Lemma for the gaussian measure. There we used the inequality

$$ \frac{\|x + z\|_2^2 +\|x-z\|_2^2}{2} \leq \|x\|_2^2 + \|z\|_2^2\,, $$

which actually holds with equality (by the Pythagorean theorem) for the Euclidean norm. We will need a generalization of this property.

Uniform smoothness.
- A norm $N$ on $\R^n$ is said to be $p$-uniformly smooth with constant $S$ if it holds that
  \[\begin{equation}\label{eq:p-smooth} \frac{N(x+y)^p + N(x-y)^p}{2} \leq N(x)^p + N(S y)^p,\qquad x,y \in \R^n\,. \end{equation}\]
- For $1 \leq p \leq 2$, the $\ell_p$ norm is $p$-uniformly smooth with constant $1$. This reduces to the $1$-dimensional inequality (due to Clarkson):
  \[\begin{equation}\label{eq:clarkson} \frac{|a+b|^p + |a-b|^p}{2} \leq |a|^p + |b|^p\,,\quad \forall a,b \in \R\,. \end{equation}\]
  - Proof: Write
  \[\begin{align*} \frac{|a+b|^p + |a-b|^p}{2} &= \frac{((a+b)^2)^{p/2} + ((a-b)^2)^{p/2}}{2} \\ &\leq \left(\frac{(a+b)^2 + (a-b)^2}{2}\right)^{p/2} \\ &=\left(a^2+b^2\right)^{p/2} \leq |a|^p + |b|^p\,, \end{align*}\]
  where the first inequality is concavity of $y \mapsto y^{q}$ for $0 \leq q \leq 1$, and the second inequalitiy uses $(u+v)^q \leq u^q + v^q$ for $0 \leq q \leq 1$ and $u,v \geq 0$.
- Uniform smoothness is a quantative measure of how much the unit ball of a norm lacks “corners.” Consider the modulus of smoothness
  \[\rho_N(t) \seteq \sup \left\{ \frac{N(x+y) + N(x-y)}{2} : N(x)=1, N(y) = t\right\}\]
  
  Then $N$ is $p$-uniformly smooth if $\rho_N(t) \leq O(t^p)$ for all $t > 0$.
  
  To get some intuition for this notion, see the figure to the right. The blue arrow corresponds to $x$ and the red arrows correspond to $\pm y$. One of the directions is bad for the $\ell_1$ ball, and the other for the $\ell_{\infty}$ ball.
  
  Note how following the red arrows takes one far outside the $\ell_1$ (and $\ell_{\infty}$) ball, but almost stays inside the $\ell_2$ ball. It turns out that $\ell_p$ is $p$-uniformly smooth with constant $1$ for $1 \leq p \leq 2$, and $2$-uniformly smooth with constant $O(\sqrt{p})$ for $p > 2$.
Shift Lemma for $p$-uniformly smooth norms:

Suppose $N$ is a norm on $\R^n$ that is $p$-uniformly smooth with constant $S$. Let $\mu$ be the probability measure on $\R^n$ with density proportional to $e^{-N(x)^p}$. Then for any symmetric convex body $W \subseteq \R^n$ and $z\in \R^n$,
\[\begin{equation}\label{eq:shift-measure} \mu(W+z) \geq e^{- S^p N (z)^p}\mu(W)\,. \end{equation}\]
- Proof:
  
  For any $z \in \R^n$, it holds that
  \[\mu({W}+z) = \frac{\int_{W} \exp(-N(x+z)^p)\,dx}{\int_{W} \exp(-N(x)^p)\,dx} \mu({W})\,.\]
  Now we bound
  \[\begin{align*} \int_{W} \exp(-N(x+z)^p)\,dx &= \int_{W} \E_{\sigma \in \{-1,1\}} \exp(-N(\sigma x+z)^p)\,dx \\ &\geq \int_{W} \exp\left(-\E_{\sigma \in \{-1,1\}} N(\sigma x+z)^p\right)\,dx \\ &\geq \int_{W} \exp\left(-(N(x)^p + S^p N(z)^p)\right)\,dx \\ &= \exp(-S^p N(z)^p) \int_{W} \exp(-N(x)^p)\,dx\,, \end{align*}\]
  where the equality uses symmetry of ${W}$, the first inequality uses convexity of $e^y$, and the second inequality uses $p$-uniform smoothness of $N$ (recall \eqref{eq:p-smooth}).
Dual-Sudakov Lemma for $p$-smooth norms.

Let $N$ and $\hat{N}$ be norms on $\R^n$. Suppose that $N$ is $p$-uniformly smooth with constant $S$, and let $\mu$ be the probability measure on $\R^n$ with density propertional to $e^{-N(x)^p}$. Then for any $\e > 0$,
\[\left(\log \mathcal{N}(B_{N}, \hat{N}, \e)\right)^{1/p} \lesssim \frac{S}{\e} \int \hat{N}(x)\,d\mu(x)\,.\]
- Proof: By scaling $\hat{N}$, we may assume that $\e = 1$. Suppose now that $x_1,\ldots,x_M \in B_{N}$ and $x_1+B_{\hat{N}},\ldots,x_M+B_{\hat{N}}$ are pairwise disjoint. To establish an upper bound on $M$, let $\lambda > 0$ be a number we will choose later and write
  \[\begin{align*} 1 \geq \mu\left(\bigcup_{j=1}^M \lambda (x_j + B_{\hat{N}})\right) &= \sum_{j=1}^M \mu\left(\lambda x_j + \lambda B_{\hat{N}}\right) \\ &\geq \sum_{j=1}^M e^{-\lambda^p S^p N(x_j)^p} \mu(\lambda B_{\hat{N}}) \geq M e^{-S^p \lambda^p} \mu(\lambda B_{\hat{N}})\,, \end{align*}\]
  where \eqref{eq:shift-measure} used the Shift Lemma and the last inequality used $x_1,\ldots,x_M \in B_N$.
  
  Now choose $\lambda \seteq 2 \int \hat{N}(x)\, d\mu(x)$ so that Markov’s inequality gives
  \[\mu(\lambda B_{\hat{N}}) = \mu\left(\{x : \hat{N}(x) \leq \lambda \} \right) \geq 1/2\,.\]
  Combining with the preceding inequality yields the upper bound
  \[\left(\log (M/2)\right)^{1/p} \leq S \lambda\,.\]
Control on the covering numbers. Using the dual-Sudakov Lemma, we have
\[\left(\log \mathcal{N}(B_A, d_U, \e)\right)^{1/p} \lesssim \frac{1}{\e} \E \|U \mathbf{Z}\|_{\infty} = \frac{1}{\e} \max_{j \in [M]} \abs{\langle u_j, \mathbf{Z}\rangle}\,.\]
Since it holds that $\E[\abs{\langle u_j,\mathbf{Z}\rangle}] \leq \left(\E \abs{\langle u_j,\mathbf{Z}\rangle}^p\right)^{1/p} \leq L^{1/p}$ by \eqref{eq:pth-moments}, we are left with the final piece, a concentration estimate for the random variables $\{ \langle u_j, \mathbf{Z} \rangle \}$.

$\psi_1$ concentration for one-dimensional marginals of log-concave measures

Exponential concentration. Let’s first summarize the final piece: The function $e^{-\|Ax\|_p^p}$ is log-concave, meaning that the random variable $\mathbf{Z}$ is log-concave. You will prove in HW#2 that the $1$-dimensional marginals of such a random variable always satisfy an exponential concentration inequality: For some constant $c > 0$,
\[\P\left[\abs{\langle u_j, \mathbf{Z}\rangle} \geq t \E[\abs{\langle u_j, \mathbf{Z}\rangle} \right] \leq e^{-c t}\,,\quad t \geq 1\,.\]
Finishing the argument. In particular, this implies that
\[\E \max_{j \in [M]} \abs{\langle u_j, \mathbf{Z}\rangle} \lesssim \left(\max_{j \in [M]} \E \abs{\langle u_j, \mathbf{Z}\rangle}\right) \log M\,,\]
which gives
\[\left(\log \mathcal{N}(B_A, d_U, \e)\right)^{1/p} \lesssim \frac{1}{\e} K \log M\,.\]
Now for
\[u_j \seteq \frac{a_{\nu_j}}{M^{1/p} \rho_{\nu_j}^{1/p}}\,,\]
we satisfy this with $K \leq (n/M)^{1/p}$. Recalling \eqref{eq:lp_cov_goal}, let us write this as
\[\left(\log \mathcal{N}(B_A, d_U, \e^{2/p})\right)^{1/2} \lesssim \frac{1}{\e} \sqrt{\frac{n (\log M)^p}{M}}\,.\]
Translating all the way back to our initial \eqref{eq:lp_ent_goal} gives
\[e_h(B_F, d_{\nu}) \lesssim 2^{-h/2} \sqrt{\frac{n (\log M)^p}{M}} \left(\max_{F(x) \leq 1} \tilde{F}_{\nu}(x)\right)^{1/2}\,,\]
Now applying Dudley’s entropy bound as in the previous lecture gives
\[\E_{\e} \max_{F(x) \leq 1} \mag{\sum_{j=1}^M \e_j \frac{f_{\nu_j}(x)}{M \rho_{\nu_j}}} \lesssim \sqrt{\frac{n (\log M)^p}{M}} \log n\,.\]
And using our symmetrization reduction, we should take $M \asymp \frac{n}{\e^2} (\log (n/\e))^p (\log n)^2$ to obtain an $\e$-approximation with precisely $M$ terms.