$\ell_p$ Lewis weights

CSE 599, Autumn 2023

A variational formulation

Fix $p \geqslant 1$ and vectors $a_1,\ldots,a_m \in \R^n$. Consider the optimization problem, over all $n \times n$ matrices,
\[\begin{equation}\label{eq:lewis-opt} \textrm{maximize} \quad \left\{ \det(U) : \|U a_1\|_2^p + \cdots + \|U a_m\|_2^p = n \right \}. \end{equation}\]
Claim: Suppose that $U_0$ is a maximizer of \eqref{eq:lewis-opt}. Then, $(U_0^{\top} U_0)^{-1} = A^{\top} W A\,,$ where $W$ is a nonnegative diagonal matrix with $W_{ii} = \norm{U_0 a_i}_2^{p-2}$ for $i=1,\ldots,m$.
Proof: The method of Langrange multipliers tells us that there is a nonnegative $\lambda > 0$ such that the maximum is given by a critical point of the functional
\[G(U) \mathrel{\mathop:}=\det(U) + \lambda \left(n - \sum_{i=1}^m \|U a_i\|_2^p\right).\]
Assuming that $\mathrm{span}(a_1,\ldots,a_m) = \R^n$, it is straightforward to check that the optimization has a finite, non-zero maximum, and therefore any maximizer $U_0$ is invertible.

Let us compute the gradient of $G(U)$ with respect to the matrix entries ${ U_{ij} : 1 \leqslant i,j \leqslant n }$. The rank-one update rule for the determinant says that for invertible $U$,
\[\det(U + u v^{\top}) = \det(U) (1+\langle v, U^{-1} u\rangle)\,,\]
This follows from $U + uv^{\top} = U (I + U^{-1} uv^{\top})$, $\det(AB)=\det(A)\det(B)$, and $\det(I + uv^{\top}) = 1 + \langle u,v\rangle$.

Therefore
\[\partial_{U_{ij}} \det(U) = \left.\frac{d}{dt}\right|_{t=0} \det(U + t e_i e_j^{\top}) = \det(U) \langle e_j, U^{-1} e_i\rangle = \det(U) U^{-1}_{ji}\,.\]
(From this one can see the more general formula: $\mathsf{d}\det(U) = \det(U) \mathop{\mathrm{tr}}(U^{-1} \mathsf{d} U)$.)

So, denoting the inverse transpose by $U^{-\top}$, we have
\[\nabla_U \det(U) = \det(U) U^{-\top}\,.\]
Let’s also compute
\[\mathsf{d} \norm{U a_k}_2^p = p \norm{U a_k}_2^{p-1} \frac{\mathsf{d} \!\left(\|Ua_k\|_2^{2}\right)}{2 \|U a_k\|_2}\,,\]
Now, observe that $\norm{U a_k}_2^2 = \mathop{\mathrm{tr}}\left(U^{\top} U a_k a_k^{\top}\right)$, thus using linearity of the gradient and the trace,
\[\begin{aligned} \mathsf{d} \left(\sum_{k=1}^m \|U a_k\|_2^{p}\right) &= \mathop{\mathrm{tr}}\left(\mathsf{d} \left(U^{\top} U\right) \sum_{k=1}^m \left(p \|U a_k\|_2^{p-2}\right) a_k a_k^{\top}\right). \end{aligned}\]
Now a simple application of the product rule gives
\[\mathsf{d} (U^{\top} U) = (\mathsf{d} U^{\top}) U + U^{\top} (\mathsf{d} U) = (\mathsf{d} U)^{\top} U + U^{\top} \mathsf{d} U\,.\]
Plugging this in above and using $\mathop{\mathrm{tr}}(M)=\mathop{\mathrm{tr}}(M^{\top})$ gives
\[\mathsf{d} \left(\sum_{k=1}^m \|U a_k\|_2^{p}\right) = \mathop{\mathrm{tr}}\left((\mathsf{d} U)^{\top} U A^{\top} D_U A\right),\]
where $D_U$ is the diagonal matrix with $(D_U)_{kk} = p \norm{U a_k}_2^{p-2}$.

Therefore
\[\nabla_U \left(\sum_{k=1}^m \norm{U a_k}_2^{p}\right) = U A^{\top} D_U A\,,\]
which implies
\[\nabla_U G(U) = \det(U) U^{-\top} - \lambda U A^{\top} D_U A\,.\]
So if $U_0$ is a critical point of $G$, we have
\[\det(U_0) U_0^{-\top} = \lambda U_0 A^{\top} D_{U_0} A\,,\]
or equivalently $(U_0^{\top} U_0)^{-1} = \lambda_0 A^{\top} D_{U_0} A\,,$ with $\lambda_0 \mathrel{\mathop:}=\lambda/\det(U_0)$.

To compute the value of $\lambda_0$, we can use $A^{\top} D_{U_0} A = \sum_{i=1}^m (D_{U_0})_{ii} a_i a_i^{\top}$ to calculate
\[\begin{aligned} n = \mathop{\mathrm{tr}}\left((U_0^{\top} U_0)^{-1} (U_0^{\top} U_0)\right) &= \lambda_0 \sum_{i=1}^m \mathop{\mathrm{tr}}\left((D_{U_0})_{ii} (U_0^{\top} U_0) a_i a_i^{\top}\right) \\ &= \lambda_0 p \sum_{i=1}^m (D_{U_0})_{ii} \|U_0 a_i\|_2^2 \\ &= \lambda_0 p \sum_{i=1}^m \|U_0 a_i\|_2^p = \lambda_0 p n\,,\end{aligned}\]
therefore $\lambda_0 = 1/p$.

Properties of Lewis weights

Let $U_0$ be a maximizer of \eqref{eq:lewis-opt} and define $U \mathrel{\mathop:}=(U_0^{\top} U_0)^{1/2}$. Then,
\[U^{-1} = (A^{\top} W A)^{1/2}\,,\]
with $W_{ii} = \norm{U a_i}_2^{p-2}$ for $i=1,\ldots,m$, and $\norm{U a_1}_2^p + \cdots + \norm{U a_m}_2^p = n$.

Let us define $w_i \mathrel{\mathop:}=W_{ii}$ for convenience.
First, note the consequence of Cauchy-Schwarz:
\[\begin{equation}\label{eq:lewis-cs} |\langle a_i,x\rangle| \leqslant\|U a_i\|_2 \|U^{-1} x\|_2\,.\end{equation}\]
Fix $1 \leqslant p \leqslant 2$ and note the formula
\[\begin{equation}\label{eq:lewis-exp} \|U^{-1} x\|_2^2 = \langle x, (A^{\top} W A) x\rangle = \sum_{i=1}^m w_i \langle a_i, x\rangle^2 =\sum_{i=1}^m w_i |\langle a_i, x\rangle|^p |\langle a_i,x\rangle|^{2-p}\,.\end{equation}\]
Now use \eqref{eq:lewis-cs} to arrive at
\[\begin{aligned} \|U^{-1} x\|_2^2 &\leqslant\sum_{i=1}^m |\langle a_i,x\rangle|^p w_i \|U a_i\|_2^{2-p} \|U^{-1} x\|_2^{2-p} \\ &= \|U^{-1} x\|_2^{2-p} \sum_{i=1}^m |\langle a_i,x\rangle|^p \\ &= \|U^{-1} x\|_2^{2-p} \|Ax\|_p^p\,.\end{aligned}\]
Simplifying yields
\[\begin{equation}\label{eq:l2-lp} \|U^{-1} x\|^p_2 \leqslant\|Ax\|_p^p\,.\end{equation}\]
Consulting \eqref{eq:lewis-cs} again, this yields
\[|\langle a_i,x\rangle|^p \leqslant\|U a_i\|_2^p \|Ax\|_p^p\,.\]
In particular, we get a bound on the sensitivity of the $i$th term:
\[\begin{equation}\label{eq:lewis-sens} s_i \mathrel{\mathop:}=\max_{x \neq 0} \frac{|\langle a_i,x\rangle|^p}{\|Ax\|_p^p} \leqslant\|U a_i\|_2^p\,,\end{equation}\]
and $s_1+\cdots+s_m \leqslant n (\norm{U a_1}_2^p + \cdots + \norm{U a_m}_2^p) = n$.

Thus Lewis weights give us sampling probabilities:
\[\rho_i \mathrel{\mathop:}=\frac{\|U a_i\|_2^p}{n}\,,\quad i=1,\ldots,m\,.\]

Entropy bounds

As before, the analysis of concentration for importance sampling comes down to analyzing covering numbers of the set $B_{A,p} \seteq \{ x \in \R^n : \|Ax\|_p \leqslant 1 \}$ in the metric
\[d_{\rho,\infty}(x,y) \mathrel{\mathop:}=\max_{i=1,\ldots,m} \frac{|\langle a_i,x-y\rangle|}{\rho_i^{1/p}} = \max_{i=1,\ldots,m} \frac{|\langle a_i,x-y\rangle|}{\|U a_i\|_2}\,.\]
The quantity we need to bound in this case is
\[\begin{equation}\label{eq:ldb} \left(\log \mathcal{N}(B_{A,p}, d_{\rho,\infty}, r)\right)^{1/p} \lesssim \frac{L n^{1/2}}{r}\,, \end{equation}\]
where $L$ should be $(\log m)^{O(1)}$.

This metric is clearly inducd by the (semi-)norm
\[\|x\|_{\rho,\infty} \mathrel{\mathop:}=\max_{i=1,\ldots,m} \frac{|\langle a_i,x\rangle|}{\|U a_i\|_2} = \max_{i=1,\ldots,m} \frac{|\langle U a_i, U^{-1} x\rangle|}{\|U a_i\|_2}\,.\]
From \eqref{eq:l2-lp}, we see that the Euclidean ball $B_{\rho,2} \seteq \{ x \in \R^n : \norm{U^{-1} x}_2 \leq 1 \}$ satisfies $B_{A,p} \subseteq B_{\rho,2}$.
Denote the unit ball $B_{\rho,\infty} \seteq \{ x \in \R^n : \norm{x}_{\rho,\infty} \leqslant 1 \}$. Then we can bound
\[\mathcal{N}(B_{A,p}, r B_{\rho,\infty}) \leqslant\mathcal{N}(B_{\rho,2}, r B_{\rho,\infty})\,.\]
If we apply the linear transformation $U^{-1}$ that takes $B_{\rho,2}$ to the standard Euclidean ball $B_2$, then the dual-Sudakov inequality tells us how to bound the latter quantity:
\[\begin{equation}\label{eq:lewis-ds} (\log \mathcal{N}(B_{\rho,2}, r B_{\rho,\infty}))^{1/2} \lesssim \frac{n^{1/p}}{r} \max_{i=1,\ldots,m} \frac{|\langle U a_i, \mathbf{g}\rangle|}{\|U a_i\|_2} \lesssim \frac{n^{1/p}}{r} \sqrt{\log m}\,.\end{equation}\]
Problematically, this bound is too weak to recover \eqref{eq:ldb}. Taking both sides to the $p/2$ power gives
\[\begin{equation}\label{eq:badcover} (\log \mathcal{N}(n^{1/p} B_{\rho,2}, r B_{\rho,\infty}))^{1/p} \lesssim \frac{n^{1/2}}{r^{2/p}} (\log m)^{p/4}\,.\end{equation}\]
The issue is that the scaling in $r$ is bad unless $p=2$ (compare \eqref{eq:ldb}).
The problem is our use of the worst-case bound $B_{A,p} \subseteq n^{1/p} B_{\rho,2}$. There is a more sophisticated way to bound covering numbers in this setting using norm interpolation and duality of covering numbers (see Ledoux-Talagrand, Prop 15.19).

But the sort of improvement one needs can be summarized as follows: \eqref{eq:badcover} is bad for $r \to 0$, but it gives a good bound for $r$ small, but constant. Having covered $B_{\rho,2}$ by translates of $B_{\rho,\infty}$, we get an improved bound on the inner products $|\langle a_i,x\rangle|^{2-p}$ in \eqref{eq:lewis-exp} This, in turn, gives us a better bound on $\norm{U^{-1} x}_2$, which means that when we want to reduce the radius from $r$ to $r/2$ (say), we get an improved estimate for the Euclidean diameter, making the application of \eqref{eq:lewis-ds} stronger.

Computation of the Lewis weights for $p < 4$: A contraction argument

The following algorithm is taken from $\ell_p$ row sampling by Lewis weights by Cohen and Peng.

For $p < 4$, it is possible to compute the Lewis weights using a simple iterative scheme, assuming that one can compute leverage scores. For a matrix $A$, define the $i$th leverage score by $\sigma_i(A) \mathrel{\mathop:}=\langle a_i, (A^{\top} A)^{-1} a_i\rangle$. In this section, if $W$ is a nonnegative $m \times m$ diagonal matrix, we will use $w \in \R_+^m$ for the vector with $w_i = W_{ii}$ for $i=1,\ldots,m$.
Define $\tau_i(w) \mathrel{\mathop:}=\langle a_i, (A^{\top} W A)^{-1} a_i\rangle$. The idea is to recall that for the Lewis weights, we have $w_i = \norm{U a_i}_2^{p-2}$, hence
\[\begin{equation}\label{eq:lw-desc1} w_i^{2/(p-2)} = \|Ua_i\|_2^2 = \langle a_i, (A^{\top} W A)^{-1} a_i\rangle = \tau_i(w) = \frac{\sigma_i(W^{1/2} A)}{w_i}\,,\end{equation}\]
since $\sigma_i(W^{1/2} A) = w_i \langle a_i, (A^{\top} W A)^{-1} a_i\rangle$.
This suggests the simple update rule
\[\hat{w}_i \mathrel{\mathop:}=\tau_i(w)^{(p-2)/2} = \left(\frac{\sigma_i(W^{1/2} A)}{w_i}\right)^{(p-2)/2}\,,\]
where we use the latter equality for computing $\tau_i(w)$ given a subroutine that computes leverage scores of $W^{1/2} A$.
To see that this converges quickly to the unique weights satisfying \eqref{eq:lw-desc1}, let us define the metric $d$ on weights:
\[d(w,w') \mathrel{\mathop:}=\max \left\{\left|\log \frac{w_i}{w'_i}\right| : i =1,\ldots,m\right\}\,,\]
as well as the iteration map $\varphi: \R_+^m \to \R_+^m$ by
\[\varphi(w)_i \mathrel{\mathop:}=\left(\tau_i\left(w\right)\right)^{(p-2)/p}\,.\]
Finally, define the vector $\tau(w) \in \R_+^m$ by $\tau(w) \mathrel{\mathop:}=(\tau_1(w),\ldots,\tau_m(w))$.
Claim: For any $w,w’ \in \R_+^m$, it holds that $d(\tau(w),\tau(w')) \leqslant d(w,w')\,.$
Proof: If $w_i \leqslant\alpha w’_i$ for every $i=1,\ldots,m$, then $W \preceq \alpha W’$ in the PSD (Loewner) ordering. Therefore $A^{\top} W A \preceq \alpha A^{\top} W’ A$, and finally $(A^{\top} W’ A)^{-1} \preceq \alpha (A^{\top} W A)^{-1}$, since matrix inversion inverts the PSD order.

Finally, by definition this yields $\langle u, (A^{\top} W’ A)^{-1} u\rangle \leqslant\langle u, (A^{\top} W A)^{-1} u \rangle$ for every $u \in \R^m$. In particular, $\tau_i(w’) \leqslant\alpha \tau_i(w)$ for each $i=1,\ldots,m$.
As an immediate consequence of the lemma, we have
\[d\!\left(\varphi(w), \varphi(w')\right) \leqslant\left|\frac{p-2}{p}\right| d(w,w')\,.\]
Note that if $1 \leqslant p < 4$, then $|(p-2)/p| < 1$, and therefore $\varphi$ is a strict contraction on the space of weights. In particular, it holds that for $\delta > 0$ such that $1-\delta = |(p-2)/p|$, and any $w_0 \in \R_+^m$,
\[d(\varphi^{k+1}(w_0), \varphi^{k}(w_0)) \leqslant(1-\delta)^{k} d(\varphi(w_0),w_0)\,,\]
so that ${\varphi^k(w_0)}$ is a Cauchy sequence and therefore ${\varphi^k(w_0)}$ converges to a limit point.

Moreover, one can see that the limit point is unique, since for any $w_0,w_1 \in \R_+^m$,
\[d(\varphi^k(w_0), \varphi^k(w_1)) \leqslant(1-\delta)^k d(w_0, w_1)\,.\]
Since the Lewis weights \eqref{eq:lw-desc1} exist, they must be the unique limit of this procedure starting from any initial weights.