CSE 525 (Spring 2019): HW #5

Due: Thurs, May 16th, 11:59pm

Note that you have two weeks to complete this homework.

1. High-dimensional nearest-neighbor search

In this homework, you will design a data structure for nearest-neighbor search in high-dimensional spaces that overcomes the “curse of dimensionality.” $\newcommand{\diam}{\mathrm{diam}} \newcommand{\e}{\varepsilon}$

Recall the approximate NNS problem: For a metric space $(X,d)$ and a database of points $D \subseteq X$ , the goal is to preprocess $D$ so that given a query $q \in X$ and an accuracy parameter $\e > 0$ , one can return an answer $a \in D$ such that $d(q,a) \leq (1+\e) d(q,D)$ , where $d(q,D)$ denotes the distance from $q$ to the closest point of the database.

In this problem, we will have $X = \mathbb R^k$ equipped with the $\ell_2$ norm or $X = \{0,1\}^k$ equipped with the Hamming metric. If $n=|D|$ , then the goal will be algorithms that use space $(kn)^{O(1)}$ and which have query time at most $(k \log n)^{O(1)}$ .

(A) Locality sensitive hashing

As a first step, we want to have a simple random experiment that gives us a little information about $q$ vs. a point $x \in D$ . Fix a scale $\tau > 0$ . Our goal is to define a random map $F : X \to \mathbb R$ that can distinguish between the cases $d(q,x) \leq \tau$ and $d(q,\tau) > (1+\e) \tau$ with some constant probability (depending on $\e$ ).

(i) Consider $X=\{0,1\}^k$ and now $d(x,y)$ is the Hamming distance between $x$ and $y$ . Let $\beta=\frac{1}{2\tau}$ . (You may assume that for this part $\tau \in \{1,2,\ldots,k\}$ .) Let $A \subseteq \{1,2,\ldots,k\}$ be a subset where every element is chosen independent with probability $\beta$ , and let $z \in \{0,1\}^k$ be a uniformly random string.

Define $F_{\beta} : X \to \{0,1\}$ by
$F_{\beta}(x) = \sum_{i \in A} x_i z_i \quad (\!\bmod\,2)$
Show that there are constants $\delta_{\mathrm{far}} > \delta_{\mathrm{close}} > 0$ such that
$\begin{align*} d(x,q) \leq \tau &\implies \Pr[F_{\beta}(x) \neq F_{\beta}(q)] < \delta_{\mathrm{close}}, \\ d(x,q) > (1+\e) \tau &\implies \Pr[F_{\beta}(x) \neq F_{\beta}(q)] > \delta_{\mathrm{far}} \end{align*}$

(ii) Now consider $X = \mathbb R^d$ and $d(x,y) = \|x-y\|_2$ . Let $g = (g_1, g_2, \ldots, g_k)$ be a $k$ -dimensional Gaussian random variable, i.e. each coordinate is independent with distribution $N(0,1)$ . Let $P$ be a random partition of the real line $\mathbb R$ into continguous intervals of length $\tau$ . More precisely, if $\alpha \in [0,\tau]$ is chosen uniformly at random, then
$P = \left\{ [\alpha + \tau k, \alpha + \tau (k+1)) : k \in \mathbb Z \right\}$
For a point $t \in \mathbb R$ , let us use the notation $P(t)$ for the interval of $P$ that contains the point $t$ . We choose for every interval $I \in P$ an independent uniform random variable $z_I \in \{0,1\}$ . Now we define the map $G_{\beta} : \mathbb R^d \to \{0,1\}$ by
$G_{\beta}(x) = z_{P(\langle g,x\rangle)}$
Here, $\langle x,y\rangle = x_1 y_1 + \cdots + x_k y_k$ denotes the inner product between $x$ and $y$ .

Again, show that there are constants $\delta_{\mathrm{far}} > \delta_{\mathrm{close}} > 0$ such that
$\begin{align*} \|x-q\|_2 \leq \tau &\implies \Pr[G_{\beta}(x) \neq G_{\beta}(q)] < \delta_{\mathrm{close}}, \\ \|x-q\|_2 > (1+\e) \tau &\implies \Pr[G_{\beta}(x) \neq G_{\beta}(q)] > \delta_{\mathrm{far}} \end{align*}$
You will want to remember the $2$ -stability property of independent $N(0,1)$ random variables: For any $x_1, x_2, \ldots, x_k$ , the random variable $g_1 x_1 + g_2 x_2 + \cdots + g_k x_k$ has distribution $N(0, \sum_{i=1}^k x_i^2)$ .

(B) Handling a single scale

Suppose we are in the setting of Problem 1(a). The numbers $\e$ and $\tau$ are still fixed. Let $\delta = \delta_{\mathrm{far}} - \delta_{\mathrm{close}} > 0$ be the difference in probabilities you computed in 1(a).

Let $F_1, F_2, \ldots, F_h$ denote $h$ independent samples from the random map $F_{\beta}$ (recall that $\beta = \frac{1}{2\tau}$ ). We define the fingerprint of a point $x \in \{0,1\}^d$ by

$F(x) = (F_1(x), F_2(x), \ldots, F_h(x)) \in \{0,1\}^h.$

We can use $F$ to build a random table $T$ indexed by points in $\{0,1\}^h$ . For every $z \in \{0,1\}^h$ , the table will have $T[z] \in D \cup \{\mathsf{null}\}$ , i.e. the entry will either contain a database point $D$ or an empty cell.

We put $T[z] = x$ if $d_H(F(x),z) \leq (\delta_{\mathrm{close}} + \frac13 \delta) h$ (and any such point if there are more than one), else the table contains $\mathsf{null}$ in that entry. Note here that $d_H$ is the Hamming distance on the hypercube $\{0,1\}^h$ .

(i) Prove the following statement: For any query $q \in \{0,1\}^k$ , the following holds
$\begin{align*} d(q,x) \leq \tau &\implies \Pr\left[d_H(F(x),F(q)) > \left(\delta_{\mathrm{close}} + \frac13 \delta\right) h\right] \leq e^{-2 \delta^2 h/9} \\ d(q,x) > (1+\e) \tau &\implies \Pr\left[d_H(F(x),F(q)) < \left(\delta_{\mathrm{far}} - \frac13 \delta\right) h\right] \leq e^{-2 \delta^2 h/9} \\ \end{align*}$

(ii) Now suppose finally that we sample independent $m$ random tables $T_1, T_2, \ldots, T_m$ as above. This brings the total size of our data structure for scale $\tau$ up to $O(m 2^h |D|)$ . Let $\mathcal{S}_{\tau} = \{ T_1, T_2, \ldots, T_m \}$ be our collection of tables. Say that $T_i$ fails on $q$ if there is a point $x \in D$ such that either of the bad events in part B(i) happen. Say that $\mathcal{S}_{\tau}$ fails on $q$ if $T_i$ fails on $q$ for more than $\frac{m}{10 \log_2 k}$ of the tables $T_i$ .

Prove that you can choose $h = O(\frac{1}{\delta^2} \log (n \log k))$ and $m = O(k \log k)$ so that for any $q \in \{0,1\}^k$ ,
$\Pr[\mathcal{S}_{\tau} \textrm{ fails on } q] \leq \frac{2^{-k}}{10 k}.$
Conclude that with probability at least $1-1/(10 k)$ , $\mathcal{S}_{\tau}$ does not fail on any $q$ .

(C) Answering a query with high probability

Our final data structure is $\mathcal{S} = \{\mathcal{S}_1, \mathcal{S}_2, \ldots, \mathcal{S}_k\}$ . Using part (b), observe that with probability at least $9/10$ , it will the the case that no $\mathcal{S}_i$ fails on any query. Let us suppose now that this property holds (i.e., suppose that our data structure construction was successful).

You should show that there’s a randomized algorithm (with access to the data structure $\mathcal{S}$ ) that, given a query $q \in \{0,1\}^k$ , outputs a point $a \in D$ such that $d(q,a) \leq (1+\e) \,d(q,D)$ with probability at least $1/2$ . (Then you could achieve error probability $2^{-a}$ by repeating the query algorithm $a$ times and outputting the best point.) The running time should be $\approx O(h k \log k)$ .

[Hint: For a given distance $\ell \in \{1,2,\ldots,k\}$ , you will choose a random table $T_i \in \mathcal{S}_{\ell}$ . You’ll compute the corresponding fingerprint $F(q)$ and check whether $T_i[F(q)] = \mathsf{null}$ . Using binary search, you should find the smallest index $\ell$ that gives a non- $\mathsf{null}$ answer and output $T_i[F(q)]$ .]

2. Extra (make up) problem

Let $T_n$ be a complete binary tree of height $n$ with root $r$ . Consider the process of random walk on $T_n$ . At each time step, a walker sitting at a vertex $v$ moves to a uniformly random neighbor of $v$ . Every vertex has three neighbors except for the root (two neighbors) and the leaves (one neighbor).

Now suppose that we have $m$ independent random walkers and each does $n$ steps of random walk starting at the root $r$ . Let the heights of those walkers after $n$ steps be $H_1, H_2, \ldots, H_m$ . We would like to say that all of the $m$ random walkers are contained in a narrow band of heights.

Prove carefully that for some constant $C$ (you get to choose $C$ , but it should not depend on $m$ or $n$ ), we have
$\Pr\left[\max_i \left|H_i - \frac{n}{3}\right| \leq C \sqrt{n \log m}\right] \geq 1-\frac{1}{m}\,.$