{
"author": "Jacob Schreiber",
"contact": "jmschr@cs.washington.edu",
"title": "Bioinformatics",
"date": "4/23/15"
}
where variables in black were known constants and variables in red were variables he was trying to estimate. To make these estimates, he made several observations of a specific crater on the moon over time and wanted to infer these values using the observations. A simple way to solve this problem would be to get three observations and use simple linear algebra to solve for three constants from three equations. The problem, however, is that Mayer had many more than three observations and was unsure how to solve this problem. He ended up clustering his observations in order to maximize the contrast and use averaging to reduce it down to a set of three observations. This was the first step towards an algorithm some of you know as least squares regression, or linear regression, which is one of the most basic machine learning algorithms which exist.

## What is Machine Learning?¶

## Linear Regression¶

As stated before, linear regression can handle an arbitrary number of variables whose types are categorical, discrete, or continuous with ease. There are also many extensions which do things such as feature selection (automatically assigning 0 weights to variables who do not really contribute predictive power) while learning the weights. One of its limitations, however, is learning nonlinear contributions of features. Consider the extension to our house example: lets say the seller was going to include a single corgi (the best dogs) with the house. This would obviously be a big plus. Perhaps a second corgi would be an equally big plus. But a third corgi would have diminishing returns, by the time you get to five corgis you're probably zeroing out your marginal reward, and 80 corgis would be a wildly inappropriate number of corgis for this person to be leaving you. However, since linear regression simply adds up $w_{i}X_{i}$, it would assume the more corgis the merrier.

## Logistic Regression¶

Would linear regression work if you were trying to predict class labels, such as 0 or 1 distinctly? It may, but it would likely predict the midrange, and would certainly not be bounded by 0 or 1 if we saw extreme values for our features. One might suggest doing something like rounding to either 0 or 1 depending on raw euclidean distance of the prediction, but that seems like a hack. Seems like it's time for a new prediction function!

Now, if you had a super image classifier like this, what would you do with it? Identify cats on the internet, obviously.

## Model Evaluation¶

Lets train our model on some of the data, and see what decision boundary we get. A decision boundary is basically the transition between predicting one class and predicting the other class from the classifier. Lets also run the data through the model again, and see what accuracy we get in predicting the labels for those points.

Wow! Great performance. Just for the sake of argument, lets take some other portion of the data which we haven't seen before and run it through as well.

In [6]:

```
from IPython.display import YouTubeVideo
YouTubeVideo('3f_21N3wcX8')
```

Out[6]:

In the mid 18th century, an astronomer by the name of Tobias Mayer was interested in studying the moon. Specifically, he was interested in how even though the moon presents the same face to the Earth in general, it wobbles slightly over the course of a month in a phenomena known as the libration of the moon. Mayer was attempting to mathematically model this libration based on his understanding of both its elliptical orbit around the Earth, and its non-perpendicular tilt with respect to the axis between the Earth and the sun. In this venture, he had derived an equation which described this motion, which looked like the following:

\begin{equation} {\color{red}{\beta}} - x = y {\color{red}{\alpha}} - z {\color{red}{\alpha}} sin {\color{red}{\theta}} \end{equation}

The general definition of artificial intelligence is:

improve performance P at task T with regard to experience E.

At face value, this seems like a very vague definition, possibly too general to even mean anything. However, the more I have delved into machine learning the more I have found that that statement is just as precise as you can describe the field given its breadth. However, we can try to incorporate more information into it to make the problem statement more meaningful. Lets take the words of the great 21st century philosopher Ashley Tisdale;

How do you love someone without getting hurt?

In her case the answer is likely good parenting, but we can rework this question into something more meaningful for our goal

How do you pursue something you want while minimizing negative outcomes?

This is still a bit abstract. Computers don't know what the concept of 'goals' or 'pain' are necessarily. Lets try one more time.

In what manner should you act such that you minimize loss?

We can now make better sense of the abstract definition above. Machine learning is about building 'intelligent' systems, which will take in data from the real world and make predictions as to how they should act given the minimization of a *loss function*. This loss function describes the "pain" associated with making a wrong decision. This distinction is extremely useful when it is unclear how you should act.

To delve in, lets start with the model which Tobias Mayer was making his first steps towards; linear regression. The general problem statement is that you have a set of $d$ features as your input which can be either continuous or discrete, and you want to make a prediction as to the output, where the output is some continuous number assumed to be a linear combination of the features. Your goal is to learn weights on each of these features such that you minimize your loss when making a prediction. More formally, your problem statement is the following:

\begin{equation} \hat{y} = w_{0} + \sum\limits_{i=1}^{d} w_{i}x_{i} \end{equation}

Lets consider a simple example--predicting the price of a house. You have features about the house, such as square footage, number of bedrooms, and presence of a garage, which would likely all have positive weights of varying magnitude. However, you also have things such as local crime rate, distance to the nearest bus, and dead people in closets, which would have negative weights associated with them as they all cause the price to go down.

Since it is difficult to visualize more than one input dimension, lets look at some random data I generated in one dimension, with the understanding that this generalizes to as many dimensions as you want trivially.

In [7]:

```
import numpy as np
%pylab inline
import seaborn as sns
x = range(1,100)
X = np.array( [x] ).T
y = 25 + np.random.randn( 99 ) * 10 + np.arange( 1, 100 )
plt.figure( figsize=(16, 8))
plt.xlabel( "Input" )
plt.ylabel( "Output" )
plt.scatter( x, y, c='c', alpha=0.6, s=75, linewidth=0 )
```

Out[7]:

How can we determine the weights, in this case $w_{0}$ and $w_{1}$? Our linear regression after all is going to be $\hat{y} = w_{0} + w_{1}x_{1}$, since we only have one variable. The update step ends up being called *gradient descent*, as we take the gradient of the loss function and take a step along it. Despite the complicated words, the updates end up being relatively simple. We initialize:

\begin{align} w_{0} &= 0 \\ w_{i} &= 0 \end{align}

then we iterate:

\begin{align} w_{0} &= \eta ( y - \hat{y} ) \\ w_{i} &= \eta X_{i} ( y - \hat{y} ) \end{align}

If we keep updating this until we converge (reach a threshold on how different our predictions are on two adjacent steps), we will get a line which looks like the following!

In [8]:

```
from sklearn.linear_model import LinearRegression
clf = LinearRegression( fit_intercept=True )
clf.fit( X, y )
y_hat = clf.predict( X )
plt.figure( figsize=(16, 8))
plt.xlabel( "Input" )
plt.ylabel( "Output" )
plt.scatter( range(1,100), y, c='c', alpha=0.6, s=75, linewidth=0 )
plt.plot( range(1,100), y_hat, c='m', linewidth=2, alpha=0.66 )
```

Out[8]:

In [9]:

```
x = np.concatenate( [ np.random.randn( 100 )*3+15, np.random.randn( 100 )*2.5+23 ] )
X = np.array([x]).T
y = [0]*100 + [1]*100
plt.figure( figsize=(16, 8) )
plt.xlabel("Input")
plt.ylabel("Output")
plt.scatter( x, y, linewidth=0, alpha=0.66, c='c', s=75 )
```

Out[9]:

Since we're trying to bound our predictions by 0 and 1, and are hoping for something similar to a thresholding situation, we should use the logistic function!

\begin{equation} \hat{y} = \frac{ exp( w_{0} + \sum\limits_{i=1}^{d} w_{i}x_{i} ) }{ 1 + exp( w_{0} + \sum\limits_{i=1}^{d} w_{i}x_{i} ) } \end{equation}

In [10]:

```
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit( X, y )
plt.figure( figsize=(16, 8) )
plt.xlabel("Input")
plt.ylabel("Output")
plt.scatter( x, y, linewidth=0, alpha=0.66, c='c', s=75 )
r = np.arange( 5, 30, .01 )
plt.plot( r, clf.predict_proba( np.array([r]).T )[:,1], linewidth=2, alpha=0.6, c='m' )
plt.plot( r, clf.predict( np.array([r]).T ), linewidth=2, alpha=0.6, c='#FF6600' )
```

Out[10]:

Our updates rules will be the same, as both of these functions are part of what is called the 'exponential family.'

Now you may be looking at this synthetic data and seeing that the purple model doesn't seem to fit terribly well. You'd be absolutely correct in that this model is very simplistic, and there may be better ones out there which perform well. But before we continue on to other models, lets consider it some more.

If we represent the logistic regression graphically, we get something like the perceptron:

Each of the inputs are some node, and are connected to a logistic regression layer which takes them all in as inputs and runs logistic regression on them. It then spits out the value it expects to get. This generalizes a bit easier to the concept of neural networks, which are just multiple perceptrons stacked in layers.

This is a tiny neural network. In practice, neural networks are a lot bigger, such as in this case:

However, this is still a shallow neural network, in that it is only two hidden layers (in reality it's only one hidden layer, but we're going to skip over the details. For those interested, the first three layers make up an autoencoder, and the predictions are used based on the recovery from the sparse representation layer). In practice we get models like GoogLeNet, which was Google's ImageNet submission a few years ago and still the state of the art in image classification.

In [11]:

```
from IPython.display import IFrame
IFrame('http://www.wired.com/2012/06/google-x-neural-network/', width='100%', height=500)
```

Out[11]:

So far we have glossed significantly over the evaluation of these models. We can make them, we can train them, but now we want to see how well they actually perform! It doesn't matter if you train a bad model on garbage data and get something terrible, what you want is to show that your bad model on garbage data performs well.

Your first instinct may be to take the data you trained on and see how well your model performs on this. An example may be to fit a linear regression to your data and then evaluate it based on the $r^{2}$ you get for that data. **THIS IS NOT STATISTICALLY VALID.** Despite warnings from statisticians, testing on your training set is a faux pas which is extremely common among social scientisits and physicists, and somewhat common amongst biologists. Do not do this. You will get overly optimistic results, which may be what you want if you're trying to get a paper out, but it is not robust.

Lets motivate this with pictures, and a simple machine learning algorithm.

The k-nearest neighbors algorithm is an algorithm where training involves saving all points to memory and testing involves finding the k nearest points and assigning a label based on the mode of the labels in the neighbors. Often times it's just 1-nearest neighbor, and makes intuitive sense that if you don't want to think too hard you store a bunch of data, and whatever is closest to a point you've seen before has that data.

Lets make some blob data.

In [12]:

```
from sklearn.datasets import make_blobs
X, y = make_blobs( n_samples=1000, n_features=2, cluster_std = 4.5, centers=2, random_state=42 )
o, p = X, y
cm='Spectral'
plt.figure( figsize=(15,10) )
plt.scatter( X[:, 0], X[:, 1], c=y, cmap=cm, s=75, alpha=0.7 )
```

Out[12]:

In [13]:

```
from sklearn.neighbors import KNeighborsClassifier as knn
from sklearn.cross_validation import *
m, n = X.shape
h = 0.1
xx, yy = np.meshgrid( np.arange( X[:,0].min()-1, X[:,0].max()+1, h ),
np.arange( X[:,1].min()-1, X[:,1].max()+1, h ) )
X_explore, X_future, y_explore, y_future = train_test_split( X, y, test_size=0.2 )
clf = knn( n_neighbors=1, weights='distance' )
clf.fit( X_explore, y_explore )
Z = clf.predict( np.c_[xx.ravel(), yy.ravel()] ).reshape( xx.shape )
plt.figure( figsize=(15, 15) )
plt.contourf( xx, yy, Z, cmap=cm, alpha=0.4 )
plt.scatter( X_explore[:, 0], X_explore[:, 1], s=50, c=y_explore, cmap=cm, alpha=0.8 )
plt.text( -7, -10, "Accuracy = {:2.2}".format( clf.score( X_explore, y_explore ) ), fontsize=28 )
```

Out[13]:

In [14]:

```
plt.figure( figsize=(12, 12) )
plt.contourf( xx, yy, Z, cmap=cm, alpha=0.4 )
plt.scatter( X_future[:, 0], X_future[:, 1], s=80, c=y_future, cmap=cm, alpha=0.8 )
plt.text( -7, -10, "Accuracy = {:2.2}".format( clf.score( X_future, y_future ) ), fontsize=28 )
```

Out[14]:

Only 80% accuracy? That sucks, what happened? Well, in this algorithm, when you trained your model you saved the points to memory. Then when you tested those points, the nearest point was itself, since it was in the training set, so of course it got the label prediction correct. When you tried with new data, you get worse results because you learned a little, but not enough.

This leads to a more formal approach to model evaluation. Split your data into a 'training' and a 'testing' phase, and train your model on the training data and evaluate its performance on the testing data. If you'd like to use your entire dataset, consider a cross-validation scheme where you split your data into k folds, train on k-1 of them, and test on the remainder, iterating k times for each fold and averaging the results (this may have bias problems).

How can we solve this divergence between the training and the testing phase though? In general, regularizing. Trying not to learn ~as much~ from the data so that we don't overfit to the specific data which we have.

In [15]:

```
X_explore, X_future, y_explore, y_future = train_test_split( X, y, test_size=0.2 )
clf = knn( n_neighbors=9, weights='distance' )
clf.fit( X_explore, y_explore )
Z = clf.predict( np.c_[xx.ravel(), yy.ravel()] ).reshape( xx.shape )
plt.figure( figsize=(15, 15) )
plt.contourf( xx, yy, Z, cmap=cm, alpha=0.4 )
plt.scatter( X_explore[:, 0], X_explore[:, 1], s=50, c=y_explore, cmap=cm, alpha=0.8 )
plt.text( -7, -9, "Training Accuracy = {:2.2}".format( clf.score( X_explore, y_explore ) ), fontsize=28 )
plt.text( -7, -11, "Testing Accuracy = {:2.2}".format( clf.score( X_future, y_future ) ), fontsize=28 )
```

Out[15]: