For these problems, we consider our standard setup of sparsification via importance sampling: We have loss functions $f_1,\ldots,f_m : \mathbb R\to \mathbb R$, and define $F(x) \mathrel{\mathop:}=f_1(x) + \cdots + f_m(x)$.
Given a probability distribution $\rho \in \mathbb R^m$ with $\rho_1,\ldots,\rho_m > 0$ and a parameter $M \geqslant 1$, we sample $\nu_1,\ldots,\nu_M$ i.i.d. from $\rho$ and define the potential approximator
Define the sensitivity of the $i$th term by
and the total sensitivity as $\bar{\xi} \mathrel{\mathop:}=\xi_1 + \cdots + \xi_m$.
Sensitivities and concentration.
Let us recall Bernstein’s inequality: If $X_1,\ldots,X_m$ are independent random variables such that with probability one, $\max_i |X_i| \leqslant B$, then for any $t > 0$,
Show that if choose our probabilities so that $\rho_i \geqslant C \xi_i/\bar{\xi}$ for each $i=1,\ldots,m$, then for every fixed $x \in \mathbb R^n$, then for the parameter $M$ satisfying
it holds that
where $c > 0$ is some universal constant.
Sampling by sensitivities is not enough
Suppose that $A$ is the $2n \times n$ matrix
where $I_n$ is the $n \times n$ identity matrix, and $H_n$ is an $n \times n$ Hadamard matrix, which is a matrix \(H_n \in \{-1/\sqrt{n},1/\sqrt{n}\}^{n \times n}\) all of whose rows are orthogonal.
Show that $\norm{Ax}_1 \geqslant n^{1/4} \norm{x}_2$ for all $x \in \mathbb R^n$.
Use the Cauchy-Schwarz inequality \(\abs{\langle a_i, x\rangle} \leqslant\norm{a_i}_2 \norm{x}_2\) to show that if $f_i(x) = \abs{\langle a_i,x\rangle}$, where $a_1,\ldots,a_{2n}$ are the rows of $A$, then
Suppose we define $\rho_i \mathrel{\mathop:}=\xi_i/\bar{\xi}$ for $i=1,\ldots,m$. Argue that choosing $M \asymp \frac{\bar{\xi}}{\varepsilon^2} (\log n)^{O(1)}$ cannot possibly produce a sparsifier (even though part (a) says there is enough concentration to get a guarantee for any single $x \in \mathbb R^n$.)
Approximate weights for generalized linear models.
Suppose now that $f_i(x) \mathrel{\mathop:}=\varphi_i(\langle a_i,x\rangle)$ for some loss functions $\varphi_1,\ldots,\varphi_m : \mathbb R^n \to \mathbb R_+$. Define the sensitivity of the $i$th term at scale $s$ by
and the total sensitivity at scale $s$ by $\bar{\xi}(s) \mathrel{\mathop:}=\xi_1(s) + \cdots + \xi_m(s)$.
Recall the definition of an $\alpha$-approximate weight at scale $s > 0$: This is a weight function $w \in \mathbb R_+^m$ such that
where $M_w \mathrel{\mathop:}=\sum_{i=1}^m w_i a_i a_i^{\top} = A^{\top} W A$ if $A$ is the matrix with rows $a_1,\ldots,a_m$ and $W \mathrel{\mathop:}=\mathrm{diag}(w_1,\ldots,w_m)$. You may assume that $A$ has full rank, and thus $A^{\top} W A$ has full rank whenever $w_1,\ldots,w_m > 0$.
You will now show that an approximate weight at scale $s$ gives an upper bound on the total sensitivity $\bar{\xi}(s)$.
Suppose that $w$ is an $\alpha$-approximate weight at scale $s$. Also suppose that for some $\theta > 0$, each $\varphi_i$ satisfies the following two conditions:
(Symmetry) $\varphi_i(z)=\varphi_i(-z)$ for all $z \in \mathbb R$ and $\varphi_i(0)=0$.
(Growth bounds) $\lambda^{\theta} \varphi_i(z) \leqslant\varphi_i(\lambda z) \leqslant\lambda^2 \varphi_i(z)$ for all $\lambda \geqslant 1$ and $z \in \mathbb R$.
Prove that for all $x \in \mathbb R^n$ and $i=1,\ldots,m$,
Prove that for all $x \in \mathbb R^n$,
Prove that for $x \in \mathbb R^n$ with $F(x) \leqslant s$, it holds that
where the implicit constant may depend on the parameters $\alpha$ and $\theta$.
Prove that for all $x \in \mathbb R^n$ and $i=1,\ldots,m$,
where the implicit constant may depend on the parameters $\alpha$ and $\theta$.
Prove that
[It might help to keep in mind that if $V$ is a matrix with rows $v_1,\ldots,v_m \in \mathbb R^n$, then the leverage scores satisfy $\sigma_1(V) + \cdots + \sigma_m(V) \leqslant\mathop{\mathrm{rank}}(V)$.]
Prove that $\bar{\xi}(s) = \xi_1(s) + \cdots + \xi_m(s) \lesssim n$, where the implicit constant will depend on $\alpha$ and $\theta$.
Computing approximate weight schemes.
Consider the same setup as the preceding problem. An $\alpha$-approximate weight scheme is a family of weights \(\{ w^{(j)} \in \mathbb R_+^m : j=j_{\min},\ldots,j_{\max} \}\) such that
$w^{(j)}$ is an $\alpha$-approximate weight at scale $2^j$ for each $j=j_{\min}, \ldots, j_{\max}$.
For all $j_{\min} \leqslant j \leqslant j_{\max} - 1$,
Define the functions
In lecture, we argued that (under the growth bounds in Problem 3) for some $0 < \delta < 1$ and any $w,w’ \in \mathbb R_+^m$, we have
where \(d(w,w') \seteq \max \left\{ \abs{\log \frac{w_i}{w'_i}} : i=1,\ldots,m\right\}\).
Define $w^{(j_{\min})} \mathrel{\mathop:}=\Lambda_{j_{\min}}^{k}(1,\ldots,1)$, where $\Lambda_j^k = \Lambda_j \circ \Lambda_j \circ \cdots \circ \Lambda_j$ is the $k$-fold composition of $\Lambda_j$.
Show that we can choose $k$ large enough so that
Choose $k$ large enough so that \eqref{eq:jmin} holds and take this as the definition of $w^{(j_{\min})}$.
Show that $w^{(j_{\min})}$ is an $\alpha$-approximate weight at scale $2^{j_{\min}}$ for some $\alpha = \alpha(\delta)$.
For $j = j_{\min}+1,j_{\min}+2,\ldots,j_{\max}-1$, define inductively the weights
Prove that \(\{ w^{(j)} : j_{\min} \leqslant j \leqslant j_{\max} \}\) is an $\alpha$-approximate weight scheme for some $\alpha > 0$ (possibly different from the $\alpha$ you computed in the previous part).
Note that you must show both that each $w^{(j)}$ is an $\alpha$-approximate weight and also that \eqref{eq:weight-scheme} holds. It will help to note that $\Lambda_{j}(w) = \frac12 \Lambda_{j-1}(w)$ for all $j$.