For the problems in this homework, turn in all your plots, answers, and discussion. The corresponding code should be placed at the end of each part.
Download the image westworld.png
. It is a 1069x535 grayscale PNG
which you can think of as a \(535 \times 1069\) matrix.
Run SVD and recover the rank \(k\) approximation for \(k \in \{535,100,75,50,20,10,5,2,1\}\). In your assignment, include the recovered image for \(k=50,10,1\). Note that your recovered image may have pixel values outside the interval \([0,1]\). You can truncate those values to lie in the interval.
Why did we stop at \(535\)?
Can you propose an interpretation for the first (left and right) singular vectors?
How much memory is required to efficiently store the rank \(50\) approximation? Assume each floating point number takes one unit of memory, and make sure not to store unnecessary blocks of zeros. How much better is this than naively saving the image as a matrix of pixel values?
A word embedding is a mapping from words, to vector representations of the words. Ideally, the geometry of the vectors will capture the semantic and syntactic meaning of the words. For example, words similar in meaning should have representations which are close to each other in the vector space. A good word embedding provides a means for mapping text into vectors, from which you can then apply all the usual learning algorithms that take, as input, a set of vectors.
Word embeddings have taken NLP (natural language processing) by storm in the last few years, and have become the backbone for numerous NLP tasks such as question answering and machine translation. There are neural-network approaches to learning word embeddings, but in this question we will study a simple SVD based scheme that does a surprisingly good job at learning word embeddings.
Your input data is a word co-occurrence matrix M of the 10,000 most frequent words from a Wikipedia corpus with 1.5 billion words. Entry \(M_{ij}\) of the matrix denotes the number of
times in the corpus that the \(i\)th and \(j\)th words occur within 5 words of each other.
The file co_occur.csv.gz
contains the symmetric co-occurence matrix \(M\).
The file dictionary.txt
contains the
dictionary for interpreting this matrix: The \(i\)th row of
the dictionary is the word corresponding to the \(i\)th row and column of \(M\). The
dictionary is sorted according to the word frequencies. Therefore the first
word in the dictionary “the” is the most common word in the
corpus and the first row and column of \(M\) contain
the co-occurrence counts of “the” with every other word in the dictionary.
Before you start, make sure you can import the dataset and do a few trials to make sure you can interpret the entries of \(M\) using the dictionary.
For the rest of the problem, let us use \(\widehat{M}\) to denote the normalized matrix defined by
\[\widehat{M}_{ij} = \ln(1+M_{ij})\]Compute the rank-100 approximation to \(\widehat{M}\) according to the singular value decomposition \(\widehat{M} = U D V^T\) and plot the top 100 singular values of \(\widehat{M}\). Does \(\widehat{M}\) look close to being low rank?
[Note that computing the full SVD will be computationally intensive; it is better to compute the top 100 singular values and singular vectors, and keep those around as you will be using them later.]
Note that as the matrix \(\widehat{M}\) is symmetric, the left and right singular vectors are the same, up to flipping signs of some columns. You will now interpret the singular vectors (columns of \(U\) or \(V\)). For any \(i\), denote by \(v_i\) as the singular vector corresponding to the \(i\)th largest singular value. Note that the coordinates of this vector correspond to the 10,000 words in our dictionary.
For a given vector \(v_i\), you can see which words are most/least relevant for that vector by looking at the words corresponding to the coordinates of \(v_i\) that have the largest or smallest values. This allows you to get a rough sense for the semantic information that is captured in each of the singular vectors. Find six interesting/interpretable singular vectors, and describe what semantic or syntactic structures they capture.
For each of the six vectors you choose, provide a list of the 10 words corresponding to the coordinates with the largest values and the 10 words corresponding to the coordinates of that vector with the smallest values. Not all of the singular vectors have easy-to-interpret semantics; why would you expect this to be the case?
Normalize the rows of \(U\) so that each one has unit \(\ell_2\) norm. You can consider the \(i\)th row as giving an embedding of the \(i\)th word into \(\mathbb{R}^{100}\) (using only the columns of \(U\) corresponding to the 100 largest singular values). You will now explore a curious property of these word embedding: certain directions in the embedded space correspond to specific syntactic or semantic concepts. Let \(u_1\) be the word embedding for “woman” and let \(u_2\) be the word embedding for “man.” Denote \(u = u_1 - u_2\).
Project the embedding of the following words onto \(u\):
boy, girl, brother, sister, king, queen, he, she, john, mary, all, tree
.
Present a plot of the projections of the embeddings
of these words marked on a line. For example, if the projection of the embedding
for “girl” onto \(u\) is \(0.1\), when you should label \(0.1\) on the
line with “girl.” What do you observe?
Present a similar plot of the projections of the embeddings of the following words
onto \(u\): math, matrix, history, nurse, doctor, pilot, teacher, engineer, science, arts, literature, bob, alice
.
What do you observe? Why do you think this is the case? Do you see a potential problem with
this? (For example, suppose LinkedIn used such word embeddings to extract information from
candidates’ resumes to improve their “search for qualified job candidates” option. What might be the result of this?)
Now you will explore in more depth the property that directions in the embedded space correspond to semantic or syntactic concepts.
First, we will define a similarity metric for words: The similarity of two words is defined as the cosine-similarity between their embeddings. As all the word embedding vectors have unit \(\ell_2\) norm, the cosine similarity between two words \(i\) and \(j\) with embeddings \(w_i\) and \(w_j\) is equal to their inner product \(\langle w_i, w_j\rangle\).
Now that we have a similarity metric defined, we can explore these embeddings by querying for the closest word to any word we like. Using this definition of similarity, what are the most similar words to “washington”?
Because word embeddings capture semantic and syntactic concepts, they can be used to solve word analogy tasks. For example, consider an analogy question like
boat is to plane as captain is to ________
where the goal is to fill in the blank space. This can be solved by finding the word whose embedding that is most similar to the vector
\[w_{\mathrm{plane}} - w_{\mathrm{boat}} + w_{\mathrm{captain}}\]You can do this by nearest neighbor search across the dictionary, excluding the three words “boat, plane, captain” which already appear in the analogy as they cannot be valid answers to the question.
There is a dataset in the file analogy_task.txt
, which tests the ability of word embeddings to answer analogy questions.
Using the cosine similarity metric, find and report the accuracy of the
word embedding approach for solving the analogy task. Comment on the results and
which analogies seemed especially difficult for this approach.