# A variational formulation

• Fix $p \geqslant 1$ and vectors $a_1,\ldots,a_m \in \R^n$. Consider the optimization problem, over all $n \times n$ matrices,

$$$\label{eq:lewis-opt} \textrm{maximize} \quad \left\{ \det(U) : \|U a_1\|_2^p + \cdots + \|U a_m\|_2^p = n \right \}.$$$
• Claim: Suppose that $U_0$ is a maximizer of \eqref{eq:lewis-opt}. Then, $(U_0^{\top} U_0)^{-1} = A^{\top} W A\,,$ where $W$ is a nonnegative diagonal matrix with $W_{ii} = \norm{U_0 a_i}_2^{p-2}$ for $i=1,\ldots,m$.

• Proof: The method of Langrange multipliers tells us that there is a nonnegative $\lambda > 0$ such that the maximum is given by a critical point of the functional

$G(U) \mathrel{\mathop:}=\det(U) + \lambda \left(n - \sum_{i=1}^m \|U a_i\|_2^p\right).$

Assuming that $\mathrm{span}(a_1,\ldots,a_m) = \R^n$, it is straightforward to check that the optimization has a finite, non-zero maximum, and therefore any maximizer $U_0$ is invertible.

Let us compute the gradient of $G(U)$ with respect to the matrix entries ${ U_{ij} : 1 \leqslant i,j \leqslant n }$. The rank-one update rule for the determinant says that for invertible $U$,

$\det(U + u v^{\top}) = \det(U) (1+\langle v, U^{-1} u\rangle)\,,$

This follows from $U + uv^{\top} = U (I + U^{-1} uv^{\top})$, $\det(AB)=\det(A)\det(B)$, and $\det(I + uv^{\top}) = 1 + \langle u,v\rangle$.

Therefore

$\partial_{U_{ij}} \det(U) = \left.\frac{d}{dt}\right|_{t=0} \det(U + t e_i e_j^{\top}) = \det(U) \langle e_j, U^{-1} e_i\rangle = \det(U) U^{-1}_{ji}\,.$

(From this one can see the more general formula: $\mathsf{d}\det(U) = \det(U) \mathop{\mathrm{tr}}(U^{-1} \mathsf{d} U)$.)

So, denoting the inverse transpose by $U^{-\top}$, we have

$\nabla_U \det(U) = \det(U) U^{-\top}\,.$

Let’s also compute

$\mathsf{d} \norm{U a_k}_2^p = p \norm{U a_k}_2^{p-1} \frac{\mathsf{d} \!\left(\|Ua_k\|_2^{2}\right)}{2 \|U a_k\|_2}\,,$

Now, observe that $\norm{U a_k}_2^2 = \mathop{\mathrm{tr}}\left(U^{\top} U a_k a_k^{\top}\right)$, thus using linearity of the gradient and the trace,

\begin{aligned} \mathsf{d} \left(\sum_{k=1}^m \|U a_k\|_2^{p}\right) &= \mathop{\mathrm{tr}}\left(\mathsf{d} \left(U^{\top} U\right) \sum_{k=1}^m \left(p \|U a_k\|_2^{p-2}\right) a_k a_k^{\top}\right). \end{aligned}

Now a simple application of the product rule gives

$\mathsf{d} (U^{\top} U) = (\mathsf{d} U^{\top}) U + U^{\top} (\mathsf{d} U) = (\mathsf{d} U)^{\top} U + U^{\top} \mathsf{d} U\,.$

Plugging this in above and using $\mathop{\mathrm{tr}}(M)=\mathop{\mathrm{tr}}(M^{\top})$ gives

$\mathsf{d} \left(\sum_{k=1}^m \|U a_k\|_2^{p}\right) = \mathop{\mathrm{tr}}\left((\mathsf{d} U)^{\top} U A^{\top} D_U A\right),$

where $D_U$ is the diagonal matrix with $(D_U)_{kk} = p \norm{U a_k}_2^{p-2}$.

Therefore

$\nabla_U \left(\sum_{k=1}^m \norm{U a_k}_2^{p}\right) = U A^{\top} D_U A\,,$

which implies

$\nabla_U G(U) = \det(U) U^{-\top} - \lambda U A^{\top} D_U A\,.$

So if $U_0$ is a critical point of $G$, we have

$\det(U_0) U_0^{-\top} = \lambda U_0 A^{\top} D_{U_0} A\,,$

or equivalently $(U_0^{\top} U_0)^{-1} = \lambda_0 A^{\top} D_{U_0} A\,,$ with $\lambda_0 \mathrel{\mathop:}=\lambda/\det(U_0)$.

To compute the value of $\lambda_0$, we can use $A^{\top} D_{U_0} A = \sum_{i=1}^m (D_{U_0})_{ii} a_i a_i^{\top}$ to calculate

\begin{aligned} n = \mathop{\mathrm{tr}}\left((U_0^{\top} U_0)^{-1} (U_0^{\top} U_0)\right) &= \lambda_0 \sum_{i=1}^m \mathop{\mathrm{tr}}\left((D_{U_0})_{ii} (U_0^{\top} U_0) a_i a_i^{\top}\right) \\ &= \lambda_0 p \sum_{i=1}^m (D_{U_0})_{ii} \|U_0 a_i\|_2^2 \\ &= \lambda_0 p \sum_{i=1}^m \|U_0 a_i\|_2^p = \lambda_0 p n\,,\end{aligned}

therefore $\lambda_0 = 1/p$.

## Properties of Lewis weights

• Let $U_0$ be a maximizer of \eqref{eq:lewis-opt} and define $U \mathrel{\mathop:}=(U_0^{\top} U_0)^{1/2}$. Then,

$U^{-1} = (A^{\top} W A)^{1/2}\,,$

with $W_{ii} = \norm{U a_i}_2^{p-2}$ for $i=1,\ldots,m$, and $\norm{U a_1}_2^p + \cdots + \norm{U a_m}_2^p = n$.

Let us define $w_i \mathrel{\mathop:}=W_{ii}$ for convenience.

• First, note the consequence of Cauchy-Schwarz:

$$$\label{eq:lewis-cs} |\langle a_i,x\rangle| \leqslant\|U a_i\|_2 \|U^{-1} x\|_2\,.$$$
• Fix $1 \leqslant p \leqslant 2$ and note the formula

$$$\label{eq:lewis-exp} \|U^{-1} x\|_2^2 = \langle x, (A^{\top} W A) x\rangle = \sum_{i=1}^m w_i \langle a_i, x\rangle^2 =\sum_{i=1}^m w_i |\langle a_i, x\rangle|^p |\langle a_i,x\rangle|^{2-p}\,.$$$

Now use \eqref{eq:lewis-cs} to arrive at

\begin{aligned} \|U^{-1} x\|_2^2 &\leqslant\sum_{i=1}^m |\langle a_i,x\rangle|^p w_i \|U a_i\|_2^{2-p} \|U^{-1} x\|_2^{2-p} \\ &= \|U^{-1} x\|_2^{2-p} \sum_{i=1}^m |\langle a_i,x\rangle|^p \\ &= \|U^{-1} x\|_2^{2-p} \|Ax\|_p^p\,.\end{aligned}

Simplifying yields

$$$\label{eq:l2-lp} \|U^{-1} x\|^p_2 \leqslant\|Ax\|_p^p\,.$$$
• Consulting \eqref{eq:lewis-cs} again, this yields

$|\langle a_i,x\rangle|^p \leqslant\|U a_i\|_2^p \|Ax\|_p^p\,.$

In particular, we get a bound on the sensitivity of the $i$th term:

$$$\label{eq:lewis-sens} s_i \mathrel{\mathop:}=\max_{x \neq 0} \frac{|\langle a_i,x\rangle|^p}{\|Ax\|_p^p} \leqslant\|U a_i\|_2^p\,,$$$

and $s_1+\cdots+s_m \leqslant n (\norm{U a_1}_2^p + \cdots + \norm{U a_m}_2^p) = n$.

Thus Lewis weights give us sampling probabilities:

$\rho_i \mathrel{\mathop:}=\frac{\|U a_i\|_2^p}{n}\,,\quad i=1,\ldots,m\,.$

## Entropy bounds

• As before, the analysis of concentration for importance sampling comes down to analyzing covering numbers of the set $B_{A,p} \seteq \{ x \in \R^n : \|Ax\|_p \leqslant 1 \}$ in the metric

$d_{\rho,\infty}(x,y) \mathrel{\mathop:}=\max_{i=1,\ldots,m} \frac{|\langle a_i,x-y\rangle|}{\rho_i^{1/p}} = \max_{i=1,\ldots,m} \frac{|\langle a_i,x-y\rangle|}{\|U a_i\|_2}\,.$

The quantity we need to bound in this case is

$$$\label{eq:ldb} \left(\log \mathcal{N}(B_{A,p}, d_{\rho,\infty}, r)\right)^{1/p} \lesssim \frac{L n^{1/2}}{r}\,,$$$

where $L$ should be $(\log m)^{O(1)}$.

This metric is clearly inducd by the (semi-)norm

$\|x\|_{\rho,\infty} \mathrel{\mathop:}=\max_{i=1,\ldots,m} \frac{|\langle a_i,x\rangle|}{\|U a_i\|_2} = \max_{i=1,\ldots,m} \frac{|\langle U a_i, U^{-1} x\rangle|}{\|U a_i\|_2}\,.$

From \eqref{eq:l2-lp}, we see that the Euclidean ball $B_{\rho,2} \seteq \{ x \in \R^n : \norm{U^{-1} x}_2 \leq 1 \}$ satisfies $B_{A,p} \subseteq B_{\rho,2}$.

• Denote the unit ball $B_{\rho,\infty} \seteq \{ x \in \R^n : \norm{x}_{\rho,\infty} \leqslant 1 \}$. Then we can bound

$\mathcal{N}(B_{A,p}, r B_{\rho,\infty}) \leqslant\mathcal{N}(B_{\rho,2}, r B_{\rho,\infty})\,.$

If we apply the linear transformation $U^{-1}$ that takes $B_{\rho,2}$ to the standard Euclidean ball $B_2$, then the dual-Sudakov inequality tells us how to bound the latter quantity:

$$$\label{eq:lewis-ds} (\log \mathcal{N}(B_{\rho,2}, r B_{\rho,\infty}))^{1/2} \lesssim \frac{n^{1/p}}{r} \max_{i=1,\ldots,m} \frac{|\langle U a_i, \mathbf{g}\rangle|}{\|U a_i\|_2} \lesssim \frac{n^{1/p}}{r} \sqrt{\log m}\,.$$$

Problematically, this bound is too weak to recover \eqref{eq:ldb}. Taking both sides to the $p/2$ power gives

$$$\label{eq:badcover} (\log \mathcal{N}(n^{1/p} B_{\rho,2}, r B_{\rho,\infty}))^{1/p} \lesssim \frac{n^{1/2}}{r^{2/p}} (\log m)^{p/4}\,.$$$

The issue is that the scaling in $r$ is bad unless $p=2$ (compare \eqref{eq:ldb}).

• The problem is our use of the worst-case bound $B_{A,p} \subseteq n^{1/p} B_{\rho,2}$. There is a more sophisticated way to bound covering numbers in this setting using norm interpolation and duality of covering numbers (see Ledoux-Talagrand, Prop 15.19).

But the sort of improvement one needs can be summarized as follows: \eqref{eq:badcover} is bad for $r \to 0$, but it gives a good bound for $r$ small, but constant. Having covered $B_{\rho,2}$ by translates of $B_{\rho,\infty}$, we get an improved bound on the inner products $|\langle a_i,x\rangle|^{2-p}$ in \eqref{eq:lewis-exp} This, in turn, gives us a better bound on $\norm{U^{-1} x}_2$, which means that when we want to reduce the radius from $r$ to $r/2$ (say), we get an improved estimate for the Euclidean diameter, making the application of \eqref{eq:lewis-ds} stronger.

## Computation of the Lewis weights for $p < 4$: A contraction argument

The following algorithm is taken from $\ell_p$ row sampling by Lewis weights by Cohen and Peng.

• For $p < 4$, it is possible to compute the Lewis weights using a simple iterative scheme, assuming that one can compute leverage scores. For a matrix $A$, define the $i$th leverage score by $\sigma_i(A) \mathrel{\mathop:}=\langle a_i, (A^{\top} A)^{-1} a_i\rangle$. In this section, if $W$ is a nonnegative $m \times m$ diagonal matrix, we will use $w \in \R_+^m$ for the vector with $w_i = W_{ii}$ for $i=1,\ldots,m$.

• Define $\tau_i(w) \mathrel{\mathop:}=\langle a_i, (A^{\top} W A)^{-1} a_i\rangle$. The idea is to recall that for the Lewis weights, we have $w_i = \norm{U a_i}_2^{p-2}$, hence

$$$\label{eq:lw-desc1} w_i^{2/(p-2)} = \|Ua_i\|_2^2 = \langle a_i, (A^{\top} W A)^{-1} a_i\rangle = \tau_i(w) = \frac{\sigma_i(W^{1/2} A)}{w_i}\,,$$$

since $\sigma_i(W^{1/2} A) = w_i \langle a_i, (A^{\top} W A)^{-1} a_i\rangle$.

• This suggests the simple update rule

$\hat{w}_i \mathrel{\mathop:}=\tau_i(w)^{(p-2)/2} = \left(\frac{\sigma_i(W^{1/2} A)}{w_i}\right)^{(p-2)/2}\,,$

where we use the latter equality for computing $\tau_i(w)$ given a subroutine that computes leverage scores of $W^{1/2} A$.

• To see that this converges quickly to the unique weights satisfying \eqref{eq:lw-desc1}, let us define the metric $d$ on weights:

$d(w,w') \mathrel{\mathop:}=\max \left\{\left|\log \frac{w_i}{w'_i}\right| : i =1,\ldots,m\right\}\,,$

as well as the iteration map $\varphi: \R_+^m \to \R_+^m$ by

$\varphi(w)_i \mathrel{\mathop:}=\left(\tau_i\left(w\right)\right)^{(p-2)/p}\,.$

Finally, define the vector $\tau(w) \in \R_+^m$ by $\tau(w) \mathrel{\mathop:}=(\tau_1(w),\ldots,\tau_m(w))$.

• Claim: For any $w,w’ \in \R_+^m$, it holds that $d(\tau(w),\tau(w')) \leqslant d(w,w')\,.$

• Proof: If $w_i \leqslant\alpha w’_i$ for every $i=1,\ldots,m$, then $W \preceq \alpha W’$ in the PSD (Loewner) ordering. Therefore $A^{\top} W A \preceq \alpha A^{\top} W’ A$, and finally $(A^{\top} W’ A)^{-1} \preceq \alpha (A^{\top} W A)^{-1}$, since matrix inversion inverts the PSD order.

Finally, by definition this yields $\langle u, (A^{\top} W’ A)^{-1} u\rangle \leqslant\langle u, (A^{\top} W A)^{-1} u \rangle$ for every $u \in \R^m$. In particular, $\tau_i(w’) \leqslant\alpha \tau_i(w)$ for each $i=1,\ldots,m$.

• As an immediate consequence of the lemma, we have

$d\!\left(\varphi(w), \varphi(w')\right) \leqslant\left|\frac{p-2}{p}\right| d(w,w')\,.$

Note that if $1 \leqslant p < 4$, then $|(p-2)/p| < 1$, and therefore $\varphi$ is a strict contraction on the space of weights. In particular, it holds that for $\delta > 0$ such that $1-\delta = |(p-2)/p|$, and any $w_0 \in \R_+^m$,

$d(\varphi^{k+1}(w_0), \varphi^{k}(w_0)) \leqslant(1-\delta)^{k} d(\varphi(w_0),w_0)\,,$

so that ${\varphi^k(w_0)}$ is a Cauchy sequence and therefore ${\varphi^k(w_0)}$ converges to a limit point.

Moreover, one can see that the limit point is unique, since for any $w_0,w_1 \in \R_+^m$,

$d(\varphi^k(w_0), \varphi^k(w_1)) \leqslant(1-\delta)^k d(w_0, w_1)\,.$

Since the Lewis weights \eqref{eq:lw-desc1} exist, they must be the unique limit of this procedure starting from any initial weights.