Spectral sparsification via Ito

CSE 599, Autumn 2023

These are preliminary notes for analyzing the Ito process.

Reis and Rothvoss gave an algorithm for finding BSS sparsifiers that uses a a discretized random walk. It appears conceptually simpler to use a modified Brownian motion directly.

Consider symmetric matrices $A_1,\ldots,A_m \succeq 0$ with $|A_1|+\cdots+|A_m| \preceq \mathbb{I}_n$. Define a stochastic process $X_t = (X_t^1,\ldots,X_t^m)$ by $X_0 = 0$, and

$$dX_t = \mathbb{I}_{\mathcal J_t}\, dB_t\,,$$

where $\{B_t\}$ is a standard Brownian motion and $\mathcal J_t \subseteq [m]$ is the set of good coordinates at time $t$.

Define the stopping times $T_i \mathrel{\mathop:}=\inf \{ t > 0 : |X_t^i| \geqslant 1 \}$ and the set of stopped coordinates

$$\mathcal S_t \mathrel{\mathop:}=\{ i \in [m] : T_i \leqslant t \}\,.$$

We will ensure that $\mathcal J_t \cap \mathcal S_t = \emptyset$ so that $|X_t^1|,\ldots,|X_t^m| \leqslant 1$ holds for all times $t \geqslant 0$.

Denote $T \mathrel{\mathop:}=\sup \{ t > 0 : |\mathcal S_t| \leqslant m/2 \}$. Our goal is to show that with good probability,

$$\label{eq:goal} \sum_{i=1}^m X_T^i A_i \preceq C \left(\frac{n}{m}\right)^{1/2} \mathbb{I}_n\,.$$

Since the stopping time $T$ entails that $m/2$ coordinates have been fixed to $\pm 1$, this completes the argument since at least one of the two matrices $\sum_{i=1}^m (1+X_T^i) A_i$ or $\sum_{i=1}^m (1-X_T^i) A_i$ has at most $m/2$ non-zero coefficients, and

$$\sum_{i=1}^m (1 \pm X_T^i) A_i \preceq \left(1 + C \left(\frac{n}{m}\right)^{1/2}\right) \mathbb{I}_n\,.$$

Repeatedly halving $m$ until $m \asymp n/\varepsilon^2$ yields a $1 \pm \varepsilon$ approximation with $O(n/\varepsilon^2)$ sparsity. To get a two-sided bound, replace $A_i$ by $\begin{pmatrix} A_i & 0 \\ 0 & - A_i\end{pmatrix}$.

To this end, let us define, for parameters $\theta,\lambda > 0$ to be chosen later,

$$\begin{aligned} U(x) &\mathrel{\mathop:}=\theta \mathbb{I}_n + \lambda \|x\|^2 \mathbb{I}_n - \sum_{i=1}^m x_i A_i \\ \Phi(x) &\mathrel{\mathop:}=\mathop{\mathrm{tr}}(U(x)^{-1})\,.\end{aligned}$$

Define $U_t \mathrel{\mathop:}=U(X_t)$. Note that $\Phi(X_0)=\Phi(0) = n/\theta$ and $U_0 \succ 0$. The idea is that if ${ \Phi(X_t) : t \in [0,T] }$ remains bounded, then $U_t \succ 0$ holds for $t \in [0,T]$, and therefore at time $T$,

$$\sum_{i=1}^m X_T^i A_i \preceq (\theta + \lambda \|X_T\|^2) \mathbb{I}_n \preceq (\theta + \lambda m)\mathbb{I}_n\,.$$

From Ito’s lemma, we know that if $dX_t = \Sigma_t dB_t$ and $f : \mathbb R^m \to \mathbb R$, then

$$d f(X_t) = \langle \nabla f(t,X_t), dX_t\rangle + \frac12 \mathop{\mathrm{tr}}\left(\nabla^2 f(t,X_t) \Sigma_t \Sigma_t^{\top}\right) dt\,.$$

Since $\Sigma_t = \mathbb{I}_{\mathcal J_t}$,

$$\frac12 \mathop{\mathrm{tr}}\left(\nabla^2 f(t,X_t) \Sigma_t \Sigma_t^{\top}\right) dt = \frac12 \sum_{i \in \mathcal J_t} \partial_{x_i}^2 f(t,X_t).$$

Now denote $f(x) \mathrel{\mathop:}=\mathop{\mathrm{tr}}(U(x)^{-1})$ and calculate:

$$\begin{aligned} \partial_{x_i} f(x) &= - 2 \lambda x_i \mathop{\mathrm{tr}}(U(x)^{-2}) + \mathop{\mathrm{tr}}(U(x)^{-2} A_i) \\ \partial_{x_j} \partial_{x_i} f(x) &= - 2 \lambda \mathbf{1}_{\{i=j\}} \mathop{\mathrm{tr}}(U(x)^{-2}) + 8 \lambda^2 x_i x_j \mathop{\mathrm{tr}}(U(x)^{-3}) - 4 \lambda x_i \mathop{\mathrm{tr}}(U(x)^{-3} A_j) + \partial_{x_j} \mathop{\mathrm{tr}}(U(x)^{-2} A_i) \\ &= - 2 \lambda \mathbf{1}_{\{i=j\}} \mathop{\mathrm{tr}}(U(x)^{-2}) + 8 \lambda^2 x_i x_j \mathop{\mathrm{tr}}(U(x)^{-3}) - 4 \lambda x_i \mathop{\mathrm{tr}}(U(x)^{-3} A_j) \\ &\quad - 4 \lambda x_j \mathop{\mathrm{tr}}(U(x)^{-3} A_i) + 2 \mathop{\mathrm{tr}}\left(U(x)^{-1} A_i U(x)^{-1} A_j U(x)^{-1}\right).\end{aligned}$$

This gives

$$\begin{aligned} d \Phi(X_t) = d f(X_t) &= {\color{Blue}{- 2 \lambda \langle X_t, d X_t\rangle \mathop{\mathrm{tr}}(U_t^{-2}) + \sum_{i=1}^m \mathop{\mathrm{tr}}(U_t^{-2} A_i) dX_t^i}} + \frac12 \mathop{\mathrm{tr}}\left(\nabla^2 f(t,X_t) \Sigma_t \Sigma_t^{\top}\right)\,dt \\ \frac12 \mathop{\mathrm{tr}}\left(\nabla^2 f(t,X_t) \Sigma_t \Sigma_t^{\top}\right) &= {\color{Maroon}{- \lambda \mathop{\mathrm{tr}}(U_t^{-2}) |\mathcal J_t|}} - {\color{ForestGreen}{4 \lambda \sum_{i \in \mathcal J_t} X_t^i \mathop{\mathrm{tr}}\left(U_t^{-3} A_i\right)}} + {\color{ForestGreen}{4 \lambda^2 \|\mathbb{I}_{\mathcal J_t} X_t\|^2 \mathop{\mathrm{tr}}(U_t^{-3})}} \\ &\quad {\color{Maroon}{+ \sum_{i \in \mathcal J_t} \mathop{\mathrm{tr}}\left(U_t^{-1} A_i U_t^{-1} A_i U_t^{-1}\right)}}\end{aligned}$$

The blue terms are the martingales, which we ignore for the moment. The green terms are error terms. The two red terms need to be balanced by choosing $\theta,\lambda$ appropriately.

For any $A,B$ with singular values $\sigma_1(A) \geqslant\cdots \geqslant\sigma_n(A)$ and $\sigma_1(B) \geqslant\cdots \geqslant\sigma_n(B)$, we have

$$\mathop{\mathrm{tr}}(A B) \leqslant\sum_{i=1}^n \sigma_i(A) \sigma_i(B)\,.$$

In particular, if $A,B \succeq 0$, then $\mathop{\mathrm{tr}}(AB) \leqslant\mathop{\mathrm{tr}}(A) \mathop{\mathrm{tr}}(B)$.

Let us use this to bound the error terms

$$\begin{aligned} {\color{ForestGreen}{4 \lambda^2 \|\mathbb{I}_{\mathcal J_t} X_t\|^2 \mathop{\mathrm{tr}}(U_t^{-3})}} &\leqslant 4 \lambda^2 m \mathop{\mathrm{tr}}(U_t^{-2}) \mathop{\mathrm{tr}}(U_t^{-1}) \\ {\color{ForestGreen}{4 \lambda \sum_{i \in \mathcal J_t} X_t^i \mathop{\mathrm{tr}}(U_t^{-3} A_i)}} &\leqslant 4 \lambda (\theta+\lambda m) \mathop{\mathrm{tr}}(U_t^{-2}) \mathop{\mathrm{tr}}(U_t^{-1})\,.\end{aligned}$$

This fact also bounds, for any $U \succ 0$ and symmetric $V$,

$$\mathop{\mathrm{tr}}\left(U^{-1} V U^{-1} V U^{-1}\right) \leqslant\mathop{\mathrm{tr}}(U^{-2} |V|) \mathop{\mathrm{tr}}(U^{-1} |V|)\,.$$

Accordingly, we define the set of bad coordinates:

$$\mathcal I_t \mathrel{\mathop:}=\left\{ i \in [m] : \mathop{\mathrm{tr}}(U_t^{-1} |A_i|) > \frac{\gamma}{m} \mathop{\mathrm{tr}}(U_t^{-1}) \right\}\,,$$

and finally $\mathcal J_t \mathrel{\mathop:}=[m] \setminus (\mathcal I_t \cup \mathcal S_t)$.

Observe that

$$\sum_{i \in \mathcal J_t} \mathop{\mathrm{tr}}\left(U_t^{-1} A_i U_t^{-1} A_i U_t^{-1}\right) \leqslant\frac{\gamma}{m} \mathop{\mathrm{tr}}(U_t^{-1}) \sum_{i=1}^m \mathop{\mathrm{tr}}(U_t^{-2} |A_i|) \leqslant\frac{\gamma}{m} \mathop{\mathrm{tr}}(U_t^{-1}) \mathop{\mathrm{tr}}(U_t^{-2})\,,$$

where the last inequality uses $|A_1|+\cdots+|A_m| \preceq \mathbb{I}_n$.

The point is to maintain $\Phi(X_t) \leqslant 2 \Phi_0(X_0) = 2 n/\theta$ by proving that $d \Phi(X_t) < 0$. Calculate:

$$\begin{aligned} d \Phi(X_t) &\leqslant\mathop{\mathrm{tr}}(U_t^{-2}) \left(- \lambda |\mathcal J_t| + \gamma \mathop{\mathrm{tr}}(U_t^{-1})\right) \\ &=\mathop{\mathrm{tr}}(U_t^{-2}) \left(- \lambda |\mathcal J_t| + \frac{\gamma}{m} \Phi(X_t)\right) \\ &\leqslant\mathop{\mathrm{tr}}(U_t^{-2}) \left(- \lambda |\mathcal J_t| + \frac{2 \gamma n}{\theta m}\right).\end{aligned}$$

For $\gamma \mathrel{\mathop:}=8$, we have $|\mathcal J_t| \geqslant m - |\mathcal I_t| - |\mathcal S_t| \geqslant\frac{m}{2} - \frac{m}{\gamma} \geqslant\frac{m}{4}$. So let us take

$$\theta \mathrel{\mathop:}=4 \left(\frac{n}{m}\right)^{1/2},\ \lambda \mathrel{\mathop:}=\frac{4}{m} \theta\,.$$

If we maintain $\max_{t \in [0,T]} \Phi(X_t) \leqslant 2n/\theta$, it means that $U_T \succ 0$, so

$$\sum_{i=1}^m X_T^i A_i \preceq \left(\theta + \lambda \|X_T\|^2\right) \mathbb{I}_n \preceq 5 \theta \mathbb{I}_n \asymp \left(\frac{n}{m}\right)^{1/2} \mathbb{I}_n\,,$$

completing the proof.