CSE 422: Assignment #1

Hashing and load balancing

Due: Wed, Jan 17, 11:59pm
Gradescope Link
Dropbox for code

Problem 1: The power of two (or more) choices [32 pts]

In this problem, you will explore the idea that very simple randomized load-balancing schemes are surprisingly effective.

Consider the following load-balancing setup: We have $N$ jobs to assign to $N$ servers, which we will model using “balls and bins.” There are $N$ balls and each ball is thrown into one of $N$ bins. Let $\mathsf{M}$ denote the number of balls in the most populated bin. A good low-overhead load-balancing scheme should have $\mathsf{M}$ small (with high probability), while at the same time using a simple rule to assign balls to bins.

You will compare four different rules for choosing the bin in which to place a ball. To place all $N$ balls, we iteratively apply one of the rules $N$ times.

Select one of the $N$ bins uniformly at random.
Select two of the $N$ bins $B_1$ and $B_2$ uniformly at random and place the ball in whichever of $B_1$ or $B_2$ has the fewest balls so far. (You can break ties arbitrarily).
Same as the previous strategy, but now choose three bins $B_1,B_2,B_3$ uniformly at random.
Select $B_1$ uniformly at random from the first $\lfloor N/2\rfloor$ bins and $B_2$ uniformly at random from the last $\lceil N/2\rceil$ bins. Place the ball in whichever of $B_1$ or $B_2$ has the fewest balls so far. (You can break ties arbitrarily).

(a) [5 points] Write code to simulate strategies 1-4. For each strategy, there should be a function that takes the number $N$ of balls and bins as input, simulates a run of the corresponding random process, and outputs the number of balls in the most populated bin (the value $\mathsf{M}$).

Paste your code into the typeset assignment.

(b) [6 points] Let $N=$ 300,000 and simulate each of the four strategies 50 times. For each strategy, plot a histogram of the 50 values of $\mathsf{M}$ (the maximum load). Discuss the pros and cons of each of the strategies based on the histograms.

Submit the histograms you generated and your discussion.

Here is some python code to help you generate histograms.

(c) [5 points] Write code to simulate the following variation for strategies 1-3: First select $N$ “servers” $s_1,s_2,\ldots,s_N \in \{0,1,\ldots,2^{32}-1\}$ uniformly at random. We will consider each server as a bin, where any ball thrown into the interval preceding $s_i$ falls in bin $B_i$, as in consistent hashing.

To select a bin at random, choose a uniformly random $x \in \{0,1,\ldots,2^{32}-1\}$ and let $s_i$ be the server whose interval contains $x$. Write code to simulate this variation for strategies 1-3.

Paste your code into the typeset assignment.

(d) [6 points] Let $N=$ 150,000 and simulate each of the three strategies 50 times. For each strategy, plot a histogram of the 50 values of $\mathsf{M}$ (the maximum load). Discuss the pros and cons of these variations of the strategies based on the histograms.

Submit the histograms you generated and your discussion.

(e) [5 points] Consider hashing $N$ elements into a hash table with $N$ buckets, and resolving collisions via chaining (for a refresher, see the discussion of separate chaining with linked lists). Discuss how the value of $\mathsf{M}$ for strategy (1) relates to search times in the hash table.

Submit your answer and discussion.

(f) [5 points] Do strategies (2)-(4) suggest an alternate way of implementing hash tables with chaining? Discuss the possible trade-offs between the different hash table implementations you propose (in particular, with respect to insertion time vs. search time).

Submit your answer and discussion.

ExtraCredit 1EC(a): Analysis of $\mathsf{M}$ via Chernoff bounds [math, 6 pts]

Let $X_1,\ldots,X_n$ be independent $\{0,1\}$ random variables with $p_i \seteq \mathbb{E}[X_i]$ for $i=1,\ldots,n$ and define the sum $X \seteq X_1 + \cdots + X_n$ and the expectation $\mu \seteq \E[X] = p_1 + \cdots + p_n$.

The multiplicative form of the Chernoff bound asserts that for every $\beta \geq 1$, it holds that

\[\begin{align*} \P\left(X \geq \beta \mu\right) &\leq \left(\frac{e^{\beta-1}}{\beta^{\beta}} \right)^{\mu}\,. \end{align*}\]

Let $\mathsf{M}$ denote the maximum load when $N$ balls are thrown into $N$ bins uniformly at random. Use the multiplicative Chernoff bound to prove that $\E[\mathsf{M}] \leq O(\frac{\log n}{\log \log n})$.

ExtraCredit 1EC(b): Two choices on a graph [research-level, 0-1000 pts]

Consider the following generalization of the two-choice model. There are now $m$ balls and $n$ bins, where we generally think about $m \geq n$. There is also an undirected graph $G=(V,E)$ with vertex set $V = \{1,2,\ldots,n\}$ so that the vertices correspond naturally to the bins $B_1,\ldots,B_n$.

Now to throw the $i$th ball into a bin, we proceed as follows: First choose a uniformly random edge $\{j,k\} \in E$. Then place the ball into whichever bin $B_j$ or $B_k$ has the smallest load.

Warmup question: What graph $G$ corresponds to the two-choice process used in the problem?

Note that in any setup, the average load of a bin is $m/n$. Let us define the gap as the random variable

\[Z \seteq \max\left\{ \mathrm{load}(B_1) - \frac{m}{n}, \mathrm{load}(B_2) -\frac{m}{n}, \ldots, \mathrm{load}(B_n)-\frac{m}{n}\right\},\]

which is the maximum amount by which the load of a bin exceeds the average load. (As a point of reference, it is expected that $Z$ typically behaves in a way that is independent of $m$ for $m \geq n$.)

Explore what happens when $G$ is the path graph on $n$ vertices. An interesting thing to do here is to make a plot where the $x$-axis has the labels $1,2,\ldots,n$ and on the $y$-axis, you plot $\mathrm{load}(B_j) - \mathrm{load}(B_1)$ for $j=1,2,\ldots,n$.
Can you estimate the asymptotics of the typical maximum gap $Z$ as $n \to \infty$?
Explore what happens when $G$ is the $\sqrt{n} \times \sqrt{n}$ square grid graph. You might use matplotlib to make a 3D plot that is analogous to the plot suggested in part (1).
Compare the asymptotics of the typical maximum gap $Z$ to what happens for the path graph.
Can you offer some intuitive or empirical explanation for the disrepancy between the magnitude of $Z$ in the two cases? [Note that it is an open problem to rigorously bound the expected value $\E[Z]$ for the grid graph!]

Problem 2: The Count-Min sketch [27 pts]

You will implement and experiment with a count-min sketch using $\ell=5$ hash tables and $b=256$ buckets.

The universe is is the set of strings $\Sigma^*$ where $\Sigma = \{a,b,c,\ldots,x,y,z\}$. Your code should take an integer seed $s$ as input. The five hash functions $h_1,h_2,h_3,h_4,h_5 : \Sigma^* \to \{0,1,\ldots,255\}$ are computed as follows: For $x \in \Sigma^*$, define $h_i(x)$ to be the $i$th byte of the MD5 hash of $x \circ \mathrm{str}(s)$, where $\mathrm{str}(\cdot)$ is the string representation of an integer and $\circ$ corresponds to concatenation of strings.

(Do not implement MD5 yourself; most modern programming languages have packages to compute MD5 hashes.)

(a) [5 points] Implement the count-min sketch as described above. You should have functions Inc(x,s) and Count(x,s).

Paste your code into the typeset assignment.

The data stream is contained in the file stream.txt, which contains a list of hashtags.

(b) [2 points] Call an integer a heavy hitter if the number of times it appears in the data stream is at least $1\%$ of the total length of the stream. How many heavy hitters are there in stream.txt?

Now consider three different data streams corresponding to the entries in stream.txt, but with the hashtags in different orders:

Heavy last: Sort the elements of stream.txt in order of increasing frequency and then calculate heavy hitters.
Heavy first: Sort the elements in order of decreasing frequency
Random: Sort the elements randomly

(c) [6 points] For each of the three data streams, feed it into your count-min sketch, and compute the values of the following quantities averaged over 20 trials. In the $i$th trial, you should use the seed value $s=i$.

The sketch’s estimate for the frequency of the hashtag workout in the stream.
The sketch’s estimate for the number of heavy hitters (i.e., where the count is at least $1\%$ of the stream length).

[Note: When you use the count-min sketch to compute heavy hitters, the whole idea is to use a small amount of space. So you can’t store all the elements you’ve seen so far and then check their counts at the end.

The way to keep track of heavy hitters with small space (see Sec 2.7 in the notes) is that every time you see an element, you call Inc(x,s) and then you call Count(x,s) to check if the count has exceeded your heavy hitter threshold.]

Does the order of the stream affect the estimated counts? Explain what the data says, and propose a plausible explanation.

(d) [5 points] Implement the following conservative update optimization. When you are updating the five counters, only increment the subset of counters for which the current count is smallest (if a few are tied for smallest, increment all of those).

Paste your code into the typeset assignment.

(e) [3 points] Argue that, even with the conservative update rule, the count-min sketch never underestimates the count.

(f) [6 points] Repeat part (c) with conservative updates (including the discussion). If the outcome seems different, offer a plausible explanation.