General norms: The homotopy method

CSE 599, Autumn 2023

Computing the importance scores

One thing we haven’t addressed yet is how to actually compute the values

$$\rho_i = \frac{\mathbb{E}N_i(\mathbf{Z})}{\mathbb{E}N(\mathbf{Z})}\,,$$

when $\mathbf{Z}$ chosen uniformly (according to the volume measure on $\mathbb{R}^n$) from the unit ball $B_N$.

A first important thing to notice is that it suffices to sample according to a distribution $\tilde{\rho}$ that satisfies $\tilde{\rho}_i \geq c \rho_i$ for all $i=1,\ldots,m$ and some constant $c > 0$. This follows from the way the importances ${\rho_i}$ are used in our analyses.

Recall the distance
\[d_{\nu,\rho}(x,y) = \left(\sum_{j=1}^M \left(\frac{f_{\nu_j}(x)-f_{\nu_j}(y)}{M \rho_{\nu_j}}\right)^2\right)^{1/2},\]
and that our sparsification bounds are in terms of entropy numbers $e_{h}(B_F, d_{\nu,\rho})$ with $B_F \seteq \{ x \in \R^n : F(x) \leq 1 \}$. Note that for an approximation $\tilde{\rho}$ as above, we have
\[d_{\nu,\tilde{\rho}}(x,y) \leq \frac{1}{c} d_{\nu,\rho}(x,y)\,,\quad \forall x,y \in \R^n\,,\]
which implies $e_h(B_F, d_{\nu,\tilde{\rho}}) \leq \frac{1}{c} e_h(B_F, d_{\nu,\rho})$.
It is now well-understood how to (approximatley) sample from the uniform measure on a convex body. (See JLLS21 and CV18.)

Theorem: There is an algorithm that, given a convex body $A \subseteq \mathbb{R}^n$ satisfying $r \cdot B_2^n \subseteq A \subseteq R \cdot B_2^n$ and $\varepsilon> 0$, samples from a distribution that is within TV distance $\varepsilon$ from the uniform measure on $A$ using $O(n^3 (\log \frac{nR}{\varepsilon r})^{O(1)})$ membership oracle queries to $A$, and $(n (\log \frac{nR}{\varepsilon r}))^{O(1)}$ additional time.
Say that a seminorm $N$ on $\mathbb{R}^n$ is $(r,R)$-rounded if it holds that
\[r \|x\|_2 \leqslant N(x) \leqslant R \|x\|_2\,,\quad \forall x \in \ker(N)^{\perp}\,.\]
If $N$ is a genuine norm (so that $\ker(N)=0$) and $N$ is $(r,R)$-rounded, then $r B_2^n \subseteq B_N \subseteq R B_2^n$, hence we can use the theorem to sample and estimate. But even in this case the running time for computing $\rho_1,\ldots,\rho_m$ would be $\tilde{O}(m n^{O(1)} \log \frac{R}{r})$ since evaluating $N$ requires $\Omega(m)$ time assuming each $N_i$ can be evalued in $O(1)$ time.

The homotopy method

Sparsification from sampling from sparsification. We would like a better running time of the form $\tilde{O}(m + n^{O(1)}) \log \frac{R}{r}$.

Note that if we could compute approximations ${\tilde{\rho}_i}$ to the importance scores, then we could sample from $\tilde{\rho}$, which would allow us to sparsify to obtain an approximation $\tilde{N}$ to $N$ with only $\tilde{O}(n)$ non-zero terms. This approximation could then be evaluated in time $n^{O(1)}$, allowing us to obtain samples in only $n^{O(1)}$ time. With these samples, we could then sparisfy.
To solve this “chicken and egg” problem, we use the following simple (but elegant) method. Define, for $t > 0$,
\[N_t(x) \mathrel{\mathop:}=N(x) + t \norm{x}_2\,.\]
Assuming $N$ is an $(r,R)$-rounded semi-norm, it follows that $N_R(x) \asymp R \norm{x}_2$.

As far as I know, in the context of $\ell_p$ regression and sparsification vs. sampling, this “homotopy method” originates from work of [Li-Miller-Peng 2013] . See also the work of [Bubeck-Cohen-Lee-Li 2017].
Thus we can compute an $\tilde{O}(n)$-sparse $1/2$-approximation to $\tilde{N}_R(x)$ by sampling from the Euclidean unit ball. But now since $\tilde{N}_t(x) \asymp \tilde{N}_{t/2}(x)$ this allows us to obtain an $\tilde{O}(n)$-sparse $1/2$-approximation $\tilde{N}_{R/2}(x)$ of $N_{R/2}(x)$.
Doing this $O(\log \frac{R}{\varepsilon r})$ times, we eventually obtain an $\tilde{O}(n)$-sparse $1/2$-approximation $\tilde{N}_{\varepsilon r}(x)$ of $N_{\varepsilon r}(x)$.
We can use this to obtain an $\tilde{O}(n)$ sparse $\varepsilon$-approximation to $N_{\varepsilon r}(x)$ which it self an $\varepsilon$-approximation to $N(x)$ on $\ker(N)^{\perp}$, using the assumption that $N$ is $(r,R)$-rounded.
Finally, we can easily remove the Euclidean shift from $\tilde{N}_{\varepsilon r}(x)$, yielding a genuine $\varepsilon$-approximation to $N(x)$, since only the Euclidean part of $\tilde{N}_{\varepsilon r}(x)$ has support on $\ker(N)$.