Your typeset assignment should include all the answers, discussion, and plots specified. In addition, include (nicely formatted) code snippets as described at the end of each problem.
Consider the following problem of fitting a linear model. Given \(n\) points \(\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(n)} \in \mathbb{R}^d\) along with real-valued labels \(y^{(1)}, \ldots, y^{(n)} \in \mathbb{R}\), the goal is to find a coefficient vector \(\mathbf{a}\) that minimizes the sum of the squared errors:
\begin{equation}\label{eq:fone} f(\mathbf{a}) = \sum_{i=1}^n \left(\langle \mathbf{a},\mathbf{x}^{(i)}\rangle - y^{(i)}\right)^2, \end{equation}
where \(\langle \mathbf{u},\mathbf{v}\rangle = u_1 v_1 + \cdots + u_d v_d\) is the standard inner product for vectors $\mathbf{u},\mathbf{v} \in \mathbb{R}^d$.
The data in this problem will be drawn from the following synthetic linear model. For the training data, we select \(n\) points \(\mathbf{x}^{(1)},\ldots,\mathbf{x}^{(n)}\) independently from a \(d\)-dimensional standard Gaussian distribution. We will then choose a “true” coefficient vector \(\mathbf{a}^* \in \mathbb{R}^d\) from the same distribution, and the label of the training point \(\mathbf{x}^{(i)}\) will be given by
\[y^{(i)} = \langle \mathbf{a}^*, \mathbf{x}^{(i)}\rangle + n_i\]where \(n_i\) is random noise drawn from an \(N(0, (0.5)^2)\) distribution.
Note that a standard \(d\)-dimensional Gaussian is a random vector \(\mathbf{g} = (g_1,\ldots,g_d) \in \mathbb{R}^d\) with the property that the coordinates \(\{g_i\}\) are independent \(N(0,1)\) random variables.
The following Python code generates such a training set:
d = 20 # dimensions of data
n = 2000 # number of data points
X = np.random.normal(0,1, size=(n,d))
a_true = np.random.normal(0,1, size=(d,1))
y = X.dot(a_true) + np.random.normal(0,0.5,size=(n,1))
Least-squares regression has a simple closed-form solution given by
\begin{equation}\label{eq:sanity} \mathbf{a} = (X^T X)^{-1} X^T \mathbf{y}\,, \end{equation}
where \(X\) is the \(n \times d\) with rows \(\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(n)}\) and \(\mathbf{y} = (y^{(1)},\ldots,y^{(n)})\) is the \(n\)-dimensional vector of their labels. This is the vector \(\mathbf{a} \in \mathbb{R}^d\) that minimizies the total squared error, i.e., it is the argmin of the least-squares objective \eqref{eq:fone}.
Solve for \(\mathbf{a}\) and report the value of the objective. For comparison, what is the total squared error for \(\mathbf{a} = (0,0,\ldots,0)\)?
Note that computing the closed form solution is quite slow (since it requires matrix inversion), and we will consider a faster method below that gives a good approximation. But since we will deal with small data sets, you can use the computation \eqref{eq:sanity} as a sanity check.
Draw \(n=2000\) new (independent) samples $\hat{\mathbf{x}}^{(1)}, \ldots, \hat{\mathbf{x}}^{(n)} \in \mathbb{R}^d$ from the distribution on training points and report the average value of
\begin{equation}\label{eq:ge} \left(\langle \mathbf{a}^*, \hat{\mathbf{x}}^{(i)} \rangle - \langle \mathbf{a}, \hat{\mathbf{x}}^{(i)}\rangle\right)^2 \end{equation}
as $i$ ranges from \(1\) to \(n\). Is this value small? What does this say about how well your solution \(\mathbf{a}\) generalizes to new data?
Let us refer to the quantity \eqref{eq:ge} as the generalization error when \(\mathbf{x}^{(i)}\) is drawn from the input distribution. Write the generalization error as a (deterministic) expression involving $\mathbf{a}$ and $\mathbf{a}^*$, and give a mathematical justification for your expression. (It will help to recall Lecture #4.)
First, review this primer on gradient descent.
You will solve the same regression problem as in part (a) using gradient descent on the objective function \(f(\mathbf{a})\). Recall that the gradient is a linear operator, so:
\begin{equation} \label{eq:sep} \nabla f(\mathbf{a}) = \sum_{i=1}^n \nabla f_i(\mathbf{a})\,, \end{equation}
where \(f_i(\mathbf{a}) = \left(\langle \mathbf{a},\mathbf{x}^{(i)}\rangle-y^{(i)}\right)^2\).
Write down the expression for \(\nabla f(\mathbf{a})\).
Now use gradient descent to find a coefficient vector \(\mathbf{a}\) that approximately minimizies \(f(\mathbf{a})\) over the data. Run gradient descent three times with step sizes \(0.00007\), \(0.00035\), and \(0.0007\). For all three runs, you should start with the initial value \(\mathbf{a}_0 = (0,\ldots,0)\).
Plot the objective function value for \(20\) iterations of gradient descent for all three step sizes on the same graph. Discuss how the step size seems to affect the convergence rate of gradient descent, and feel free to experiment with other step sizes to support your argument. Also report the step size that had the best final objective value and the corresponding value of \(f\).
Often we can decompose an objective function as \(f = f_1+\cdots+f_n\), where the gradients $\nabla f_i$ can be evaluated much faster than \(\nabla f\). This is a setting where stochastic gradient descent shines.
Instead of computing \(\nabla f(\mathbf{a}_t)\) at our current point \(\mathbf{a}_t\), we will compute a random value \(Z_t\) that gives the gradient in expectation:
\begin{equation}\label{eq:inexp} \mathbb{E}[Z_t] = \nabla f(\mathbf{a}_t)\,. \end{equation}
We do this by choosing an index \(i \in \{1,2,\ldots,n\}\) uniformly at random and then computing \(\nabla f_i(\mathbf{a}_t)\). Write down a corresponding expression for \(Z_t\) so that \eqref{eq:inexp} holds.
Use this expression to define an algorithm (called stochastic gradient descent (SGD)) with step size \(\alpha\). The key point is that your (randomized) algorithm should only compute the gradient of a single \(f_i\) in every iteration.
Remark: In general, one step of SGD should look like \(a_{t+1} = a_t - \alpha \nabla f_i(a_t)\) (in other words, don’t multiply the step size \(\alpha\) by \(n\)).
Run SGD using step sizes \(\{0.0005, 0.005, 0.01\}\), each for 20,000 iterations. Plot the objective function value vs. the iteration number for all 3 step sizes on the same graph.
Discuss how the step size appears to affect the convergence of stochastic gradient descent, and compare/contrast this to vanilla gradient descent.
Compare the performance of the two methods. How do the best final objective function values compare? How many times does each algorithm use each data point? Also report the step size that had the best final objective function value and the corresponding objective function value.
In addition to your answers, discussion, and plots, make sure to include your code for parts (c) and (e).
In the previous problem, the number of data points was much larger than the number of dimensions and hence the least-squares solution generalized quite easily (recall part (b)).
We will now consider the setting where \(d = n\), and examine the test error along with the training error. Use the following Python code for generating the training data and test data.
train_n = 200
test_n = 2000
d = 200
X_train = np.random.normal(0,1, size=(train_n,d))
a_true = np.random.normal(0,1, size=(d,1))
y_train = X_train.dot(a_true) + np.random.normal(0,0.5,size=(train_n,1))
X_test = np.random.normal(0,1, size=(test_n,d))
y_test = X_test.dot(a_true) + np.random.normal(0,0.5,size=(test_n,1))
Let us define the normalized error
\[\hat{f}(\mathbf{a}) = \frac{\| X \mathbf{a} - \mathbf{y}\|_2}{\|\mathbf{y}\|_2}\,,\]where \(X = X_{\mathrm{test}}\) or \(X = X_{\mathrm{train}}\).
As a baseline for comparison, compute the linear regression solution \(\mathbf{a} = X^{-1} \mathbf{y}\) without any regularization. Note that this is simpler than \eqref{eq:sanity} because \(X\) is now a square matrix.
Report the normalized training error and test error averaged over \(10\) trials.
The \(\ell_2\)-regularized objective function is given by
\begin{equation}\label{eq:reg} \sum_{i=1}^n \left(\langle \mathbf{a}, \mathbf{x}^{(i)}\rangle - y^{(i)}\right)^2 + \lambda \|\mathbf{a}\|_2^2\,, \end{equation}
where \(\lambda \geq 0\) is a regularization parameter. The argmin of \eqref{eq:reg} has a closed-form solution
\[\mathbf{a} = (X^T X + \lambda I)^{-1} X^T \mathbf{y}\,.\]Present a plot of the normalized training and test error for \(\lambda \in \{0.0005, 0.005, 0.05, 0.5, , 5, 50, 500\}\). As before, you should average over \(10\) trials.
Discuss the characteristics of your plot and compare your answers to part (a).
Run stochastic gradient descent (SGD) on the original objective function \(f(\mathbf{a})\), with the initial guess \(\mathbf{a}_0 = (0,\ldots,0)\).
Run SGD for 1,000,000 iterations for each different choice of the step size \(\alpha \in \{0.00005, 0.0005, 0.005\}\). Report the normalized training error and the normalized test error for each of these three settings, averaged over 10 trials.
How does the SGD solution compare with the solutions obtained using $\ell_2$ regularization? Note that SGD is minimizing the original objective function, which does not have any regularization.
In part (a) of this problem, we found the optimal solution to the original objective function with respect to the training data. How does the training and test error of the SGD solutions compare with those of the solution in (a)? Can you explain your observations? (For this comparison, it may be helpful to also compute the normalized training and test error corresponding to the true coefficient vector \(f(\mathbf{a}^*)\).)
Examine the behavior of SGD in more detail for step sizes \(\alpha \in \{0.00005, 0.005\}\) and 1,000,000 iterations of SGD.
Plot the normalized training error vs. the iteration number. On the plot, draw a line parallel to the \(x\)-axis indicating the normalized error of the true model \(\mathbf{a}^*\).
Plot the normalized test error vs. the iteration number. You can compute the test error every 100 iterations of SGD to produce the plot more efficiently (so you will only need to compute the test error 10,000 times instead of 1,000,000).
Plot the \(\ell_2\) norm of the SGD solution vs. the iteration number.
Discuss the plots. What can you say about the generalization ability of SGD with different step sizes? Does the plot correspond to the intuition that a learning algorithm starts to overfit when the training error becomes too small, i.e., smaller than the noise level of the true model? How does the generalization ability of the final solution depend on the $\ell_2$ norm of the final solution?
You will now examine the effect of the starting point on the SGD solution. Fixing the step size at \(\alpha = 0.00005\) and the maximum number of iterations at 1,000,000, choose the initial point randomly from the $d$-dimensional sphere with radius \(r = \{0, 0.1, 0.5, 1, 10, 20, 30\}\), and plot the average normalized training error and the average normalized test error over 10 iterations as a function of \(r\).
(Note you can choose a random point on the unit sphere by sampling \(\mathbf{g} \in \mathbb{R}^d\) according to the standard $d$-dimensional Gaussian distribution, and then outputting \(\mathbf{g}/\|\mathbf{g}\|_2\).)
Comment on the results, in relation to the results from part (b) where you explored different \(\ell_2\) regularization coefficients. Can you provide an explanation for the behavior seen in this plot?
Make sure to include your code for all the parts.
Consider now the setting where \(d > n\) and we suspect that generalization should be more difficult. Take \(d = 300\) and \(n = 200\). Use the following Python code for generating the training and test data:
train_n = 200
test_n = 2000
d = 300
X_train = np.random.normal(0,1, size=(train_n,d))
a_true = np.random.normal(0,1, size=(d,1))
y_train = X_train.dot(a_true) + np.random.normal(0,0.5,size=(train_n,1))
X_test = np.random.normal(0,1, size=(test_n,d))
y_test = X_test.dot(a_true) + np.random.normal(0,0.5,size=(test_n,1))
Try to achieve the best test error you can using the techniques
from the previous two problems, or by any other method.
(Of course, your learning algorithm can only use the training data for this purpose, and cannot
refer to a_true
.)
You will receive credit based on your accuracy. Report the average test error you
obtain, averaged over 1000 trials (where you re-pick a_true
and the data in each trial). Feel free to
use regularization, SGD, gradient descent, or any other algorithm you want to try, but clearly describe
the algorithm you use in human-readable pseudocode.
Briefly discuss the approach you used, the thought process that informed your decisions, and the extent to which you believe a better test error is achievable. Your score will be based on a combination of the short discussion and the average test error you obtain. A paragraph of analysis is enough to earn full credit—you don’t need to go too crazy.
Include your code, your average normalized test error, and your analysis and discussion.