# CSE 525 (Spring 2019): HW #5

Due: Thurs, May 16th, 11:59pm

Note that you have two weeks to complete this homework.

In this homework, you will design a data structure for nearest-neighbor search in high-dimensional spaces that overcomes the “curse of dimensionality.” $\newcommand{\diam}{\mathrm{diam}} \newcommand{\e}{\varepsilon}$

Recall the approximate NNS problem: For a metric space $(X,d)$ and a database of points $D \subseteq X$, the goal is to preprocess $D$ so that given a query $q \in X$ and an accuracy parameter $\e > 0$, one can return an answer $a \in D$ such that $d(q,a) \leq (1+\e) d(q,D)$, where $d(q,D)$ denotes the distance from $q$ to the closest point of the database.

In this problem, we will have $X = \mathbb R^k$ equipped with the $\ell_2$ norm or $X = \{0,1\}^k$ equipped with the Hamming metric. If $n=|D|$, then the goal will be algorithms that use space $(kn)^{O(1)}$ and which have query time at most $(k \log n)^{O(1)}$.

### (A) Locality sensitive hashing

As a first step, we want to have a simple random experiment that gives us a little information about $q$ vs. a point $x \in D$. Fix a scale $\tau > 0$. Our goal is to define a random map $F : X \to \mathbb R$ that can distinguish between the cases $d(q,x) \leq \tau$ and $d(q,\tau) > (1+\e) \tau$ with some constant probability (depending on $\e$).

(i) Consider $X=\{0,1\}^k$ and now $d(x,y)$ is the Hamming distance between $x$ and $y$. Let $\beta=\frac{1}{2\tau}$. (You may assume that for this part $\tau \in \{1,2,\ldots,k\}$.) Let $A \subseteq \{1,2,\ldots,k\}$ be a subset where every element is chosen independent with probability $\beta$, and let $z \in \{0,1\}^k$ be a uniformly random string.

Define $F_{\beta} : X \to \{0,1\}$ by

Show that there are constants $\delta_{\mathrm{far}} > \delta_{\mathrm{close}} > 0$ such that

(ii) Now consider $X = \mathbb R^d$ and $d(x,y) = \|x-y\|_2$. Let $g = (g_1, g_2, \ldots, g_k)$ be a $k$-dimensional Gaussian random variable, i.e. each coordinate is independent with distribution $N(0,1)$. Let $P$ be a random partition of the real line $\mathbb R$ into continguous intervals of length $\tau$. More precisely, if $\alpha \in [0,\tau]$ is chosen uniformly at random, then

For a point $t \in \mathbb R$, let us use the notation $P(t)$ for the interval of $P$ that contains the point $t$. We choose for every interval $I \in P$ an independent uniform random variable $z_I \in \{0,1\}$. Now we define the map $G_{\beta} : \mathbb R^d \to \{0,1\}$ by

Here, $\langle x,y\rangle = x_1 y_1 + \cdots + x_k y_k$ denotes the inner product between $x$ and $y$.

Again, show that there are constants $\delta_{\mathrm{far}} > \delta_{\mathrm{close}} > 0$ such that

You will want to remember the $2$-stability property of independent $N(0,1)$ random variables: For any $x_1, x_2, \ldots, x_k$, the random variable $g_1 x_1 + g_2 x_2 + \cdots + g_k x_k$ has distribution $N(0, \sum_{i=1}^k x_i^2)$.

### (B) Handling a single scale

Suppose we are in the setting of Problem 1(a). The numbers $\e$ and $\tau$ are still fixed. Let $\delta = \delta_{\mathrm{far}} - \delta_{\mathrm{close}} > 0$ be the difference in probabilities you computed in 1(a).

Let $F_1, F_2, \ldots, F_h$ denote $h$ independent samples from the random map $F_{\beta}$ (recall that $\beta = \frac{1}{2\tau}$). We define the fingerprint of a point $x \in \{0,1\}^d$ by

We can use $F$ to build a random table $T$ indexed by points in $\{0,1\}^h$. For every $z \in \{0,1\}^h$, the table will have $T[z] \in D \cup \{\mathsf{null}\}$, i.e. the entry will either contain a database point $D$ or an empty cell.

We put $T[z] = x$ if $d_H(F(x),z) \leq (\delta_{\mathrm{close}} + \frac13 \delta) h$ (and any such point if there are more than one), else the table contains $\mathsf{null}$ in that entry. Note here that $d_H$ is the Hamming distance on the hypercube $\{0,1\}^h$.

(i) Prove the following statement: For any query $q \in \{0,1\}^k$, the following holds

(ii) Now suppose finally that we sample independent $m$ random tables $T_1, T_2, \ldots, T_m$ as above. This brings the total size of our data structure for scale $\tau$ up to $O(m 2^h |D|)$. Let $\mathcal{S}_{\tau} = \{ T_1, T_2, \ldots, T_m \}$ be our collection of tables. Say that $T_i$ fails on $q$ if there is a point $x \in D$ such that either of the bad events in part B(i) happen. Say that $\mathcal{S}_{\tau}$ fails on $q$ if $T_i$ fails on $q$ for more than $\frac{m}{10 \log_2 k}$ of the tables $T_i$.

Prove that you can choose $h = O(\frac{1}{\delta^2} \log (n \log k))$ and $m = O(k \log k)$ so that for any $q \in \{0,1\}^k$,

Conclude that with probability at least $1-1/(10 k)$, $\mathcal{S}_{\tau}$ does not fail on any $q$.

### (C) Answering a query with high probability

Our final data structure is $\mathcal{S} = \{\mathcal{S}_1, \mathcal{S}_2, \ldots, \mathcal{S}_k\}$. Using part (b), observe that with probability at least $9/10$, it will the the case that no $\mathcal{S}_i$ fails on any query. Let us suppose now that this property holds (i.e., suppose that our data structure construction was successful).

You should show that there’s a randomized algorithm (with access to the data structure $\mathcal{S}$) that, given a query $q \in \{0,1\}^k$, outputs a point $a \in D$ such that $d(q,a) \leq (1+\e) \,d(q,D)$ with probability at least $1/2$. (Then you could achieve error probability $2^{-a}$ by repeating the query algorithm $a$ times and outputting the best point.) The running time should be $\approx O(h k \log k)$.

[Hint: For a given distance $\ell \in \{1,2,\ldots,k\}$, you will choose a random table $T_i \in \mathcal{S}_{\ell}$. You’ll compute the corresponding fingerprint $F(q)$ and check whether $T_i[F(q)] = \mathsf{null}$. Using binary search, you should find the smallest index $\ell$ that gives a non-$\mathsf{null}$ answer and output $T_i[F(q)]$.]

## 2. Extra (make up) problem

Let $T_n$ be a complete binary tree of height $n$ with root $r$. Consider the process of random walk on $T_n$. At each time step, a walker sitting at a vertex $v$ moves to a uniformly random neighbor of $v$. Every vertex has three neighbors except for the root (two neighbors) and the leaves (one neighbor).

Now suppose that we have $m$ independent random walkers and each does $n$ steps of random walk starting at the root $r$. Let the heights of those walkers after $n$ steps be $H_1, H_2, \ldots, H_m$. We would like to say that all of the $m$ random walkers are contained in a narrow band of heights.

Prove carefully that for some constant $C$ (you get to choose $C$, but it should not depend on $m$ or $n$), we have