Out: Thursday, Jan 26 at midnight
Due: Thursday, Feb 2 at midnight (in the dropbox)
See intro slides from Lecture 4 for specific instructions.
Goal: To get some familiarity with similarity measures.
Turn in: All code. Three heat maps for part (b) and one for part (d). Explanations for (c), (e), (f).
You will use a well-known data set of articles from various newsgroups. Each article is represented by a "bag of words," which is a vector indexed by words with each component indicating the number of times the word occurs in a given article.
The data is stored in three files: data50.csv, label.csv, and groups.csv.
The data50.csv file is a sparse representation of each "bag of words." Every row contains three fields: articleId, wordId, and count. To find out which group an article belongs to, use the file label.csv. If articleId=n, then line n of label.csv contains the corresponding groupId. Line m of groups.csv contains the name of the mth group.
You will examine three measures of similarity for bag-of-words vectors \(x\) and \(y\); higher numbers mean the vectors are more similar:
Import the data sets into the language you're using. Remember the total number of words is huge, so you probably want to use a sparse representation.
Implement the three similarity measures. For each metric, prepare the following plot: Rows and columns are indexed by newsgroups (in the same order). For each entry \((A,B)\) of this \(20\times 20\) matrix, compute the average similarity over all ways of pairing one article from \(A\) with one article from \(B\).
After computing these 400 numbers (for each of the three similarity measures), plot your results using a heatmap. Label your axes with the group names. Here is starter code for heat maps in python.
Based on the heatmaps you generated, which of the measures seems the most reasonable? Are there any pairs of newsgroups that are very similar? Would have you expected this? Explain.
Now compute another set of \(20 \times 20\) matrices as follows.
For each article \(a_1\), find the article \(a_2\) from a different newsgroup that has the largest Jaccard similarity with \(a_1\). (You can do this with brute-force search.)
Now for each pair of groups \(A,B\), count how many articles in group \(A\) have their nearest-neighbor in group \(B\). Plot these results in a heatmap.
Your plot for part (b) was symmetric, but for part (d) was asymmetric. Explain.
Which groups seem similar? Compare the plots from parts (b) and (d). Which method seems more suited to comparing newsgroups?
Goal: Explore the tradeoff in dimension reduction between quality and number of dimensions.
Turn in: Code, figures, times, and discussion for parts (a)-(c).
Given parameters \(k\) and \(d\), define a \(d \times k\) matrix \(M\) by drawing each entry of \(M\) randomly and independently from a normal distribution with mean 0 and variance 1.
Given a \(k\)-dimensional vector \(v \in \mathbb{R}^k\), you compute its image in \(d\) dimensions using the matrix-vector product \(M v\).
Implement the random projection dimension reduction method and plot the nearest-neighbor visualization as in Problem 1(d) for cosine similarity and \(d=10,25,50,100\). Also record the wall-clock time for each run. For which values of the target dimension are the results comparable to the original embedding?
Consider article 3 (from alt.atheism) and for each dimension \(d=10,25,50,100\), create a scatter plot where each point represents an article, with the \(x\)-coordinate representing the cosine similarity to article 3 in the original (\(k\)-dimensional) space, and the \(y\)-coordinate representing the cosine similarity to article 3 in the \(d\)-dimensional space. Discuss the plots and how the error relates to the target dimension \(d\).
Implement random projection by generating the matrix \(M\) in a different way: Generate \(M\) by choosing each entry to be an independent, uniformly random sign \(-1\) or \(+1\). Do part (b) again with this family of dimension reduction matrices (for \(d=10,25,50,100\)). Compare your results using the two methods.