CSE 525 (Spring 2019): HW #5

Due: Thurs, May 16th, 11:59pm

Note that you have two weeks to complete this homework.

In this homework, you will design a data structure for nearest-neighbor search in high-dimensional spaces that overcomes the “curse of dimensionality.”

Recall the approximate NNS problem: For a metric space and a database of points , the goal is to preprocess so that given a query and an accuracy parameter , one can return an answer such that , where denotes the distance from to the closest point of the database.

In this problem, we will have equipped with the norm or equipped with the Hamming metric. If , then the goal will be algorithms that use space and which have query time at most .

(A) Locality sensitive hashing

As a first step, we want to have a simple random experiment that gives us a little information about vs. a point . Fix a scale . Our goal is to define a random map that can distinguish between the cases and with some constant probability (depending on ).

(i) Consider and now is the Hamming distance between and . Let . (You may assume that for this part .) Let be a subset where every element is chosen independent with probability , and let be a uniformly random string.

Define by

Show that there are constants such that

(ii) Now consider and . Let be a -dimensional Gaussian random variable, i.e. each coordinate is independent with distribution . Let be a random partition of the real line into continguous intervals of length . More precisely, if is chosen uniformly at random, then

For a point , let us use the notation for the interval of that contains the point . We choose for every interval an independent uniform random variable . Now we define the map by

Here, denotes the inner product between and .

Again, show that there are constants such that

You will want to remember the -stability property of independent random variables: For any , the random variable has distribution .

(B) Handling a single scale

Suppose we are in the setting of Problem 1(a). The numbers and are still fixed. Let be the difference in probabilities you computed in 1(a).

Let denote independent samples from the random map (recall that ). We define the fingerprint of a point by

We can use to build a random table indexed by points in . For every , the table will have , i.e. the entry will either contain a database point or an empty cell.

We put if (and any such point if there are more than one), else the table contains in that entry. Note here that is the Hamming distance on the hypercube .

(i) Prove the following statement: For any query , the following holds

(ii) Now suppose finally that we sample independent random tables as above. This brings the total size of our data structure for scale up to . Let be our collection of tables. Say that fails on if there is a point such that either of the bad events in part B(i) happen. Say that fails on if fails on for more than of the tables .

Prove that you can choose and so that for any ,

Conclude that with probability at least , does not fail on any .

(C) Answering a query with high probability

Our final data structure is . Using part (b), observe that with probability at least , it will the the case that no fails on any query. Let us suppose now that this property holds (i.e., suppose that our data structure construction was successful).

You should show that there’s a randomized algorithm (with access to the data structure ) that, given a query , outputs a point such that with probability at least . (Then you could achieve error probability by repeating the query algorithm times and outputting the best point.) The running time should be .

[Hint: For a given distance , you will choose a random table . You’ll compute the corresponding fingerprint and check whether . Using binary search, you should find the smallest index that gives a non- answer and output .]

2. Extra (make up) problem

Let be a complete binary tree of height with root . Consider the process of random walk on . At each time step, a walker sitting at a vertex moves to a uniformly random neighbor of . Every vertex has three neighbors except for the root (two neighbors) and the leaves (one neighbor).

Now suppose that we have independent random walkers and each does steps of random walk starting at the root . Let the heights of those walkers after steps be . We would like to say that all of the random walkers are contained in a narrow band of heights.

Prove carefully that for some constant (you get to choose , but it should not depend on or ), we have