Recall that we have a function \(F : \R^n \to \R\) written as \(F(x) = f_1(x) + \cdots + f_m(x)\) and we want to sparisfy it by sampling the terms in proportion to a probability vector \(\rho \in \R_+^m\) (the “importance scores”).
One way to analyze this is to first make sure all the probabilities are uniformly small, say \(\rho_1,\ldots,\rho_m \leq 2/m\). Since we are trying to sparsify down to a number of terms that is independent of \(m\), we can always achieve this by simply splitting a term
\[f_i(x) \to f_i(x)/s + f_i(x)/s + \cdots + f_i(x)/s\,,\]where there are \(s\) summands.
Then one can sparsify in phases by deleting every term independently with probability \(1/2\). This experiment can be modeled with \(\{0,1\}\) random variables in the natural way:
\[\E_{\e} \max_{x \in \Omega} \left|F(x) - \sum_{j=1}^m (1+\e_j) f_j(x)\right| = \E_{\e} \max_{x \in \Omega} \left|\sum_{j=1}^m \e_j f_j(x)\right|.\]where \(\e_1,\ldots\e_m\) are i.i.d. uniform random $\pm 1$ variables, and \(\Omega \subseteq \R^n\) is our domain of interest.
It turns out that it is possible to do something similar and reduce the analysis of one-shot sampling to the analysis of randomly signed sums of the functions. This makes for both a simpler algorithm, and a more robust method of analysis that will be very useful later.
Let’s consider a more general situation. Consider a family of independent random functions \(\mathbf{g}_1,\ldots,\mathbf{g}_M : \R^n \to \R\) that are sampled from some distribution. (Recall that we are thinking about each one as being chosen from a finite family of functions and, in our case, it even holds that they are i.i.d.).
Define \(\mathbf{G}(x) \seteq \mathbf{g}_1(x) + \cdots + \mathbf{g}_m(x)\) and \(F(x) \seteq \E[\mathbf{G}(x)]\). We are interested in the quantity
\[\begin{equation}\label{eq:Esup} \mathcal{S} \seteq \E \max_{x \in \Omega} \left|F(x)-\mathbf{G}(x)\right| = \E \max_{x \in \Omega} \left|F(x) - \left(\mathbf{g}_1(x)+\cdots+\mathbf{g}_M(x)\right)\right|\,, \end{equation}\]where \(\Omega \subseteq \R^n\).
For our particular case of interest, let us take \(\Omega \seteq \{ x \in \R^n : F(x) \leq 1 \}\).
Lemma: Suppose it holds that for every choice \(g_1,\ldots,g_M\) of functions from the support of \(\mathbf{g}_1,\ldots,\mathbf{g}_M\), we have
\[\begin{equation}\label{eq:rad} \E_{\mathbf{\e}} \max_{x \in \Omega} \left|\sum_{j=1}^M \e_j g_j(x)\right| \leq \delta \left(\max_{x \in \Omega} G(x)\right)^{1/2}, \end{equation}\]where \(\delta < 1\). Then we have \(\mathcal{S} \lesssim \delta\).
Proof: Note that if \(\tilde{\mathbf{g}}_1, \ldots, \tilde{\mathbf{g}}_M\) is an independent family of functions from the same distribution, then clearly \(F(x) = \E\left[\tilde{\mathbf{g}}_1(x) + \cdots + \tilde{\mathbf{g}}_M(x)\right]\) as well, so we can write
\[\begin{align*} \E \max_{x \in \Omega} & \left|F(x) - \left(\mathbf{g}_1(x)+\cdots+\mathbf{g}_M(x)\right)\right| \\ &= \E \max_{x \in \Omega} \left|\E \left[\tilde{\mathbf{g}}_1(x)+\cdots+\tilde{\mathbf{g}}_M(x)\right] - \left(\mathbf{g}_1(x)+\cdots+\mathbf{g}_M(x)\right)\right| \\ &\leq \E \max_{x \in \Omega} \left|\left(\tilde{\mathbf{g}}_1(x)+\cdots+\tilde{\mathbf{g}}_M(x)\right) - \left(\mathbf{g}_1(x)+\cdots+\mathbf{g}_M(x)\right)\right|, \end{align*}\]where we have pulled out the expectation using convexity of the absolute value and the maximum.
Write the latter quantity as
\[\begin{align*} \E \max_{x \in \Omega} \left|\left(\tilde{\mathbf{g}}_1(x)-\mathbf{g}_1(x)\right)+\cdots+\left(\tilde{\mathbf{g}}_M(x) - \mathbf{g}_M(x)\right)\right|, \end{align*}\]and note that since \(\mathbf{g}_i(x) - \tilde{\mathbf{g}}_i(x)\) and \(\tilde{\mathbf{g}}_i(x)-\mathbf{g}_i(x)\) have the same distribution, we can write this as
\[\begin{align*} \E \max_{x \in \Omega} \left|\e_1 \left(\tilde{\mathbf{g}}_1(x)-\mathbf{g}_1(x)\right)+\cdots+\e_M \left(\tilde{\mathbf{g}}_M(x) - \mathbf{g}_M(x)\right)\right|, \end{align*}\]for any fixed choice of signs \(\e_1,\ldots,\e_M \in \{-1,1\}\).
Now let \(\mathbf{\e}_1,\ldots,\mathbf{\e}_M\) be a uniformly random choice of signs and the latter expression equals
\[\begin{align*} \E \E_{\mathbf{\e}} \max_{x \in \Omega} \left|\e_1 \left(\tilde{\mathbf{g}}_1(x)-\mathbf{g}_1(x)\right)+\cdots+\e_M \left(\tilde{\mathbf{g}}_M(x) - \mathbf{g}_M(x)\right)\right|, \end{align*}\]where we use \(\E_{\e}\) denote expectation over only the random signs.
Now we ignore the outer expectation for a moment and use the triangle inequality to write
\[\begin{align*} \E_{\mathbf{\e}} &\max_{x \in \Omega} \left|\e_1 \left(\tilde{g}_1(x)-{g}_1(x)\right)+\cdots+\e_M \left(\tilde{g}_M(x) - {g}_M(x)\right)\right| \\ &\leq \E_{\mathbf{\e}} \max_{x \in \Omega} \left|\e_1 {g}_1(x) + \cdots + \e_M {g}_M(x)\right| + \E_{\mathbf{\e}} \max_{x \in \Omega} \left|\e_1 \tilde{g}_1(x) + \cdots + \e_M \tilde{g}_M(x)\right| \\ &\leq 2\delta \left(\max_{x \in \Omega} G(x)\right)^{1/2}\,, \end{align*}\]where the last inequality uses the assumption \eqref{eq:rad}.
Taking expectations, we have therefore proved that
\[\mathcal{S} \leq 2 \delta \E\left[\left(\max_{x \in \Omega} \mathbf{G}(x)\right)^{1/2}\right].\]Since \(F(x) \leq 1\) for \(x \in \Omega\), we have
\[\max_{x \in \Omega} G(x) \leq 1 + \max_{x \in \Omega} \left|F(x) - G(x)\right|\]Given the above discussion and recalling the definition in \eqref{eq:Esup}, if \eqref{eq:rad} holds for every choice of \(g_1,\ldots,g_M\), this yields the bound
\[\begin{align*} \mathcal{S} \leq \delta\ \E \left(1 + \max_{x \in \Omega} \left|F(x)-\mathbf{G}(x)\right|\right)^{1/2} \leq \delta \left(1+\mathcal{S}\right)^{1/2}\,, \end{align*}\]where the last inequality uses concavity of the square root to move the expectation inside. For \(\delta < 1\), this tells us that \(\mathcal{S} \leq O(\delta)\).
This completes the symmetrization argument, allowing us to focus on proving bounds like \eqref{eq:rad}.
The advantage here lies in the the fact that \(\left \{ \e_1 g_1(x) + \e_2 g_2(x) + \cdots + \e_M g_M(x) : x \in \R^n \right\}\) is an example of a subgaussian process, and bounding the expected maximum of a subgaussian process has a long and rich history, along with a correspondingly powerful framework.
The (mild) disadvantage is that we require the inequality \eqref{eq:rad} to hold for every choice of functions \(g_1,\ldots,g_M\) in the support of our distribution.
(Note: The confusing terminology “process” is a historical artifact. Originally, one was indeed interested in \(\e_1 g_1(x)\), \(\e_1 g_1(x) + \e_2 g_2(x)\), etc. as a sequence “evolving in time.” If one switches from discrete to continuous time, e.g., a process like Brownian motion, then control of the expected maximum is closely related to almost sure continuity of the sample paths.)
Consider a collection of random variables \(\{ X_t : t \in T \}\) where the index set \(T\) is equipped with a distance \(d\). The family is called subgaussian (with respect to \(d\)) if there if there is a constant \(c > 0\) such that
\[\begin{equation}\label{eq:subgaussian} \P\left[|X_s-X_t| > \lambda\,d(s,t)\right] \leq e^{- c\lambda^2}\,,\quad \forall \lambda > 0, s,t \in T\,. \end{equation}\]Gaussian processes. The canonical example occurs when \(\{X_t : t \in T\}\) is a jointly Gaussian family of random variables, in which case \eqref{eq:subgaussian} holds with \(c=1/2\) and \(d(s,t) = (\E |X_s-X_t|^2)^{1/2}\).
Note that if the index set \(T\) is finite, then such a family can always be realized in the following way: Let \(g_1,g_2,\ldots,g_m\) be i.i.d. \(N(0,1)\) random variables and define \(X_t = t_1 g_1 + t_2 g_2 + \cdots + t_m g_m\) for \(t \in T \subseteq \R^n\).
Bernoulli processes. Suppose that \(\e_1,\ldots,\e_m\) are i.i.d uniform \(\pm 1\) random variables and for \(t \in T\), we define \(X_t \seteq \e_1 t_1 + \e_2 t_2 + \cdots + \e_m t_m\). Then the \(\{ X_t : t \in \R^n \}\) is a family of subgaussian random variables with the same parameters: \(c=1/2\) and \(d(s,t) = (\E |X_s-X_t|^2)^{1/2}\).
Many proofs are known. Even proofs of significant generalizations can be made quite short. See, e.g., the proof of Theorem 1.3 here.
The quantity we care about after our symmetrization argument is
\[\begin{equation}\label{eq:symout} \E \max_{x \in \Omega} \left(\e_1 g_1(x) + \cdots + \e_M g_M(x)\right), \end{equation}\]and as we have just seen this is the expected maximum of a centered subgaussian process. (Centered means that \(\E X_t = 0\) for every \(t \in T\).) There is a rich history and theory for bounding such expected maxima, with M. Talagrand as the primary architect (see his comprehensive book on the topic).