CSE 422: Assignment #4

Principal component analysis

Due: Wed, Feb 7, 11:59pm
Gradescope Link
Dropbox for code

Problem 1: PCA on genetic data

The file pca-data.txt contains data from the 100 genomes project. Each of the lines in the file represents an individual. The first three columns contain: an individual’s unique identifiers, his or her biological sex (1=male, 2=female), and the population to which they belong. (See the encodings here.) The subsequent columns of each line are a sample of the nucleobases from that individual’s genome.

Recall that if we are going to perform PCA on the vectors $x_1,x_2,\ldots,x_n \in \mathbb{R}^d$, then we want to demean them, i.e., first replace them by $x_1-\bar{x},x_2-\bar{x},\ldots,x_n-\bar{x}$ where

\[\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i\]

By the first $k$ PCA components, we mean the orthonormal vectors $v_1,v_2,\ldots,v_k$ that maximize the objective

\[\sum_{j=1}^k \sum_{i=1}^n \langle v_j, x_i\rangle^2\]

As discussed in lecture, these vectors need not be unique, but they will be unique for this data set. You may use Matlab’s built in pca function or the Python sklearn package to compute the components.

For each column of nuclobases (A,C,G,T), compute the nuclobase that occurs most frequently. Call that the column-$j$ mode.

Convert the genetic data into a binary matrix $X$ such that $X_{ij}=0$ if person $i$ has the column-$j$ mode in position $j$ and $X_{ij}=1$ otherwise. The $1$ entries correspond to mutations.

(a) [0 points]

Say we ran PCA on the binary matrix X above. What would be the dimension of the returned vectors?

(b) [6 points]

We will examine the first 2 principal components of X. These components contain lots of information about our data set. Create a scatter plot with each of the 995 rows of $X$ projected onto the first two principal components. In other words, the horizontal axis should be $v_1$ and the vertical axis $v_2$, and each individual should be projected onto the subspace spanned by $v_1$ and $v_2$. Your plot should use a different color for each population and include a legend. (Recall that the population data occurs in the third column.)

(c) [7 points]

In two sentences, list 1 or 2 basic facts about the plot created in part (b). Can you interpret the first two principal components? What aspects of the data do the first two principal components capture? Hint: think about history and geography.

(d) [5 points]

We will now examine the third principal component of $X$. Create another scatter plot with each individual projected onto the subspace spanned by the first and third principal components. After plotting, play with different labeling schemes (with labels derived from the meta-data) to explain the clusters that you see. Your plot must include a legend.

(e) [5 points]

Something should have popped out at you in the plot above. In one sentence, what information does the third principal component capture?

(f) [4 points]

In this part, you will inspect the third principal component. Plot the nucleobase index vs the absolute value of the third principal component. What do you notice? What’s a possible explanation? Hint: think about chromosomes.

Problem 2: PCA vs. least-squares

Both PCA and least-squares regression can be viewed as algorithms for inferring (linear) relationships among data variables. In this problem, you will develop some intuition for the differences between these two approaches, and an understanding of the settings that are better suited to using PCA or better suited to using the least squares fit.

The high level bit is that PCA is useful when there is a set of latent (hidden/underlying) variables, and all the coordinates of your data are linear combinations (plus noise) of those variables. The least squares fit is useful when you have direct access to the independent variables, so any noisy coordinates are linear combinations (plus noise) of known variables.

We will consider a simple example with two variables, $x$ and $y$, where the true relationship between the variables is $y = 5x$. Our goal is to recover this relationship, i.e., to recover the coefficient “5”. In part (b), we consider the setting where our data consists of the actual values of $x$, and noisy estimates of $y$. In part (c), we consider the case where our data consists of noisy measurements of both $x$ and $y$. For each part, we will evaluate the quality of the relationship recovered by PCA, and that recovered by standard least squares regression. As a reminder, least squares regression minimizes the squared error of the dependent variable from its prediction. Namely, given $(x_i,y_i)$ pairs, least squares returns the line $\ell(x)$ that minimizes

\[\sum_i (y_i - \ell(x_i))^2\]

(a) [0 points]

Initialization:

Write a routine pca-recover that takes a vector $X$ of $x_i$’s and a vector $Y$ of $y_i$’s and returns the slope of the first component of the PCA (the second coordinate divided by the first).
Write a routine ls-recover that takes $X$ and $Y$ a input and returns the slope of the least-squares fit. Since $X$ is one-dimensional, this has a particularly simple form:
\[\frac{\langle X-\bar{X}, Y-\bar{Y}\rangle}{\|X-\bar{X}\|_2^2}\]
where $\bar{X}$ is the mean value of $X$.
Using $X = [.001, .002, .003, . . . , 1]$ and $Y = 5X$, make sure both routines return the number 5.

(b) [4 points]

Say the elements of $X$ and $Y$ were chosen identically and independently at random (e.g., every element is uniformly distributed in the unit square $[0,1]^2$). What would PCA recover, and what would LS recover?

(c) [5 points]

We first consider the case where $x$ is an independent variable, and we get noisy measurements of $y$. Fix $X = [x_1, x_2, . . . , x_{1000}] = [.001, .002, .003, . . . , 1]$. For a given noise level $c$, let

\[\begin{align*} \hat{y}_i &\sim 5 x_i + \mathcal{N}(0,c) = 5i/1000 + \mathcal{N}(0,c) \\ \widehat{Y} &= [\hat{y}_1,\hat{y}_2,\ldots,\hat{y}_{1000}] \end{align*}\]

Make a scatter plot with $c$ on the horizontal axis, and the output of pca-recover and ls-recover on the vertical axis. For each $c \in [0, 0.05, 0.1, . . . , .45, .5]$, take a sample $\widehat{Y}$, plot the output of pca-recover as a red dot, and the output of ls-recover as a blue dot. Repeat 30 times. You should end up with a plot of 660 dots, in 11 columns of 60, half red and half blue.

Note that in numpy, the code numpy.randn(1000)*s generates an array of 1,000 independent samples from a $\mathcal{N}(0,s^2)$ distribution.

(d) [5 points]

We now examine the case where our data consists of noisy estimates of both $x$ and $y$. For a given noise level $c$, let

\[\begin{align*} \hat{x}_i &\sim x_i + \mathcal{N}(0,c) = i/1000 + \mathcal{N}(0,c) \\ \hat{y}_i &\sim y_i + \mathcal{N}(0,c) = 5i/1000 + \mathcal{N}(0,c) \end{align*}\]

Similar to part (b), for each $c \in [0, 0.05, 0.1, . . . , .45, .5]$, take a sample $\widehat{X}$ and $\widehat{Y}$, plot the output of pca-recover as a red dot, and the output of ls-recover as a blue dot. Repeat 40 times. You should have a plot with 440 red dots and 440 blue dots.

(e) [9 points]

Answer the following:

Why does PCA do poorly with noise in only $Y$?
Why does PCA do well with noise in $X$ and $Y$?
Why does LS do poorly with noise in $X$ and $Y$?