CSE 599I: Homework #3

Due: Fri, Dec 15 @ 11:59pm

Multi-scale importance sampling

For these problems, we consider our standard setup of sparsification via importance sampling: We have loss functions $f_1,\ldots,f_m : \mathbb R\to \mathbb R$, and define $F(x) \mathrel{\mathop:}=f_1(x) + \cdots + f_m(x)$.

Given a probability distribution $\rho \in \mathbb R^m$ with $\rho_1,\ldots,\rho_m > 0$ and a parameter $M \geqslant 1$, we sample $\nu_1,\ldots,\nu_M$ i.i.d. from $\rho$ and define the potential approximator

$$\tilde{F}_{\rho,\nu}(x) \mathrel{\mathop:}=\sum_{j=1}^M \frac{f_{\nu_j}(x)}{\rho_{\nu_j} M}\,.$$

Define the sensitivity of the $i$th term by

$$\xi_i \mathrel{\mathop:}=\max_{x : F(x) \neq 0} \frac{f_i(x)}{F(x)}\,,$$

and the total sensitivity as $\bar{\xi} \mathrel{\mathop:}=\xi_1 + \cdots + \xi_m$.

Sensitivities and concentration.

Let us recall Bernstein’s inequality: If $X_1,\ldots,X_m$ are independent random variables such that with probability one, $\max_i |X_i| \leqslant B$, then for any $t > 0$,

$$\mathop{\mathrm{\mathbb{P}}}\left(\sum_{i=1}^m X_i > t\right) \leqslant\exp\left(\frac{-t^2}{2 \left(\sum_{i=1}^m \mathop{\mathrm{\mathbb{E}}}[X_i^2] + (B/3) t\right)}\right).$$

Show that if choose our probabilities so that $\rho_i \geqslant C \xi_i/\bar{\xi}$ for each $i=1,\ldots,m$, then for every fixed $x \in \mathbb R^n$, then for the parameter $M$ satisfying

$$M \geqslant\lambda \frac{C}{\varepsilon^2} \bar{\xi}\,,$$

it holds that

$$\mathop{\mathrm{\mathbb{P}}}\left(|F(x)-\tilde{F}_{\rho,\nu}(x)| \leqslant\varepsilon F(x)\right) \geqslant 1 - e^{-c \lambda}\,,$$

where $c > 0$ is some universal constant.
Sampling by sensitivities is not enough

Suppose that $A$ is the $2n \times n$ matrix

$$A = \begin{pmatrix} I_n \\ H_n \end{pmatrix}$$

where $I_n$ is the $n \times n$ identity matrix, and $H_n$ is an $n \times n$ Hadamard matrix, which is a matrix $H_n \in \{-1/\sqrt{n},1/\sqrt{n}\}^{n \times n}$ all of whose rows are orthogonal.
1. Show that $\norm{Ax}_1 \geqslant n^{1/4} \norm{x}_2$ for all $x \in \mathbb R^n$.
2. Use the Cauchy-Schwarz inequality $\abs{\langle a_i, x\rangle} \leqslant\norm{a_i}_2 \norm{x}_2$ to show that if $f_i(x) = \abs{\langle a_i,x\rangle}$, where $a_1,\ldots,a_{2n}$ are the rows of $A$, then
  
  $$\bar{\xi} \leqslant 2 n^{3/4}\,.$$
3. Suppose we define $\rho_i \mathrel{\mathop:}=\xi_i/\bar{\xi}$ for $i=1,\ldots,m$. Argue that choosing $M \asymp \frac{\bar{\xi}}{\varepsilon^2} (\log n)^{O(1)}$ cannot possibly produce a sparsifier (even though part (a) says there is enough concentration to get a guarantee for any single $x \in \mathbb R^n$.)
Approximate weights for generalized linear models.

Suppose now that $f_i(x) \mathrel{\mathop:}=\varphi_i(\langle a_i,x\rangle)$ for some loss functions $\varphi_1,\ldots,\varphi_m : \mathbb R^n \to \mathbb R_+$. Define the sensitivity of the $i$th term at scale $s$ by

$$\xi_i(s) \mathrel{\mathop:}=\max_{x : F(x) \in [s/2, s]} \frac{f_i(x)}{F(x)}\,,$$

and the total sensitivity at scale $s$ by $\bar{\xi}(s) \mathrel{\mathop:}=\xi_1(s) + \cdots + \xi_m(s)$.

Recall the definition of an $\alpha$-approximate weight at scale $s > 0$: This is a weight function $w \in \mathbb R_+^m$ such that

$$\frac{s}{\alpha} \leqslant\frac{\varphi_i(\norm{M_w^{-1/2} a_i}_2)}{w_i \norm{M_w^{-1/2} a_i}_2^2} \leqslant \alpha s\,,\quad \forall i=1,\ldots,m\,,$$

where $M_w \mathrel{\mathop:}=\sum_{i=1}^m w_i a_i a_i^{\top} = A^{\top} W A$ if $A$ is the matrix with rows $a_1,\ldots,a_m$ and $W \mathrel{\mathop:}=\mathrm{diag}(w_1,\ldots,w_m)$. You may assume that $A$ has full rank, and thus $A^{\top} W A$ has full rank whenever $w_1,\ldots,w_m > 0$.

You will now show that an approximate weight at scale $s$ gives an upper bound on the total sensitivity $\bar{\xi}(s)$.

Suppose that $w$ is an $\alpha$-approximate weight at scale $s$. Also suppose that for some $\theta > 0$, each $\varphi_i$ satisfies the following two conditions:
- (Symmetry) $\varphi_i(z)=\varphi_i(-z)$ for all $z \in \mathbb R$ and $\varphi_i(0)=0$.
- (Growth bounds) $\lambda^{\theta} \varphi_i(z) \leqslant\varphi_i(\lambda z) \leqslant\lambda^2 \varphi_i(z)$ for all $\lambda \geqslant 1$ and $z \in \mathbb R$.
1. Prove that for all $x \in \mathbb R^n$ and $i=1,\ldots,m$,
  
  $$|\langle a_i,x\rangle| \leqslant\norm{M_w^{-1/2} a_i}_2 \norm{M_w^{1/2} x}2\,.$$
2. Prove that for all $x \in \mathbb R^n$,
  
  $$\norm{M_w^{1/2} x}_2^2 = \sum_{i=1}^m w_i \langle a_i, x\rangle^2\,.$$
3. Prove that for $x \in \mathbb R^n$ with $F(x) \leqslant s$, it holds that
  
  $$\norm{M_w^{1/2} x}_2 \lesssim 1\,,$$
  
  where the implicit constant may depend on the parameters $\alpha$ and $\theta$.
4. Prove that for all $x \in \mathbb R^n$ and $i=1,\ldots,m$,
  
  $$\varphi_i(\langle a_i,x\rangle) \lesssim s w_i \norm{M_w^{-1/2} a_i}_2^2\,,$$
  
  where the implicit constant may depend on the parameters $\alpha$ and $\theta$.
5. Prove that
  
  $$\sum_{i=1}^m w_i \norm{M_w^{-1/2} a_i}_2^2 \leqslant n\,.$$
  
  [It might help to keep in mind that if $V$ is a matrix with rows $v_1,\ldots,v_m \in \mathbb R^n$, then the leverage scores satisfy $\sigma_1(V) + \cdots + \sigma_m(V) \leqslant\mathop{\mathrm{rank}}(V)$.]
6. Prove that $\bar{\xi}(s) = \xi_1(s) + \cdots + \xi_m(s) \lesssim n$, where the implicit constant will depend on $\alpha$ and $\theta$.
Computing approximate weight schemes.

Consider the same setup as the preceding problem. An $\alpha$-approximate weight scheme is a family of weights $\{ w^{(j)} \in \mathbb R_+^m : j=j_{\min},\ldots,j_{\max} \}$ such that
- $w^{(j)}$ is an $\alpha$-approximate weight at scale $2^j$ for each $j=j_{\min}, \ldots, j_{\max}$.
- For all $j_{\min} \leqslant j \leqslant j_{\max} - 1$,
  
  $$\begin{equation}\label{eq:weight-scheme} w_i^{(j)} \leqslant\alpha w_i^{(j+1)}\,,\quad i=1,\ldots,m\,.\end{equation}$$
Define the functions

$$\begin{aligned} \tau(w)_i &\mathrel{\mathop:}=\langle a_i, (A^{\top} W A)^{-1} a_i\rangle \\ \psi_i(z) &\mathrel{\mathop:}=\frac{\varphi_i(\sqrt{z_i})}{z_i} \\ (\Lambda_j(z))_i &\mathrel{\mathop:}=2^{-j} \psi_i(\tau(w))\,. \end{aligned}$$

In lecture, we argued that (under the growth bounds in Problem 3) for some $0 < \delta < 1$ and any $w,w’ \in \mathbb R_+^m$, we have

$$d(\Lambda_j(w),\Lambda_j(w')) \leqslant\delta\, d(w,w')\,,$$

where $d(w,w') \seteq \max \left\{ \abs{\log \frac{w_i}{w'_i}} : i=1,\ldots,m\right\}$.
1. Define $w^{(j_{\min})} \mathrel{\mathop:}=\Lambda_{j_{\min}}^{k}(1,\ldots,1)$, where $\Lambda_j^k = \Lambda_j \circ \Lambda_j \circ \cdots \circ \Lambda_j$ is the $k$-fold composition of $\Lambda_j$.
  
  Show that we can choose $k$ large enough so that
  
  $$\begin{equation}\label{eq:jmin} d(w^{(j_{\min})}, \Lambda_{j_{\min}}(w^{(j_{\min})})) \leqslant\frac{2}{1-\delta}\,. \end{equation}$$
2. Choose $k$ large enough so that \eqref{eq:jmin} holds and take this as the definition of $w^{(j_{\min})}$.
  
  Show that $w^{(j_{\min})}$ is an $\alpha$-approximate weight at scale $2^{j_{\min}}$ for some $\alpha = \alpha(\delta)$.
3. For $j = j_{\min}+1,j_{\min}+2,\ldots,j_{\max}-1$, define inductively the weights
  
  $$w^{(j)} \mathrel{\mathop:}=\Lambda_{j}(w^{(j-1)})\,.$$
  
  Prove that $\{ w^{(j)} : j_{\min} \leqslant j \leqslant j_{\max} \}$ is an $\alpha$-approximate weight scheme for some $\alpha > 0$ (possibly different from the $\alpha$ you computed in the previous part).
  
  Note that you must show both that each $w^{(j)}$ is an $\alpha$-approximate weight and also that \eqref{eq:weight-scheme} holds. It will help to note that $\Lambda_{j}(w) = \frac12 \Lambda_{j-1}(w)$ for all $j$.