Last week, we were talking about how to conduct neural net experiments.
After discussing how to conduct and write up such experiments, we said a little about common statistics and graphs people use in presenting how good neural nets are.
These included confusion matrices, ROC curves, etc.
We begin today by exploring a technique to maximize the use of our traning data known as cross validation, and then discuss its statistical properties.
Maximizing Your Test Data - Cross Validation
Often one starts one network training with a limited amount of data. For example, one has only 10,000 crazy cat versus normal cat images.
If you divide the data set into test and training data, you might end up with a small test set, which in turn results in the error
in the accuracy of your test measurements being larger.
If the errors are large in your tests, it might make it hard to claim algorithm A works better than algorithm B.
One way to solve this problem is to use Cross-Validation: You split the data set into training and test data several times and
in several different ways, perform your measurements for each of these trials, and combine these measurements by computing an appropriate
aggregation such as an average. (Larson 1931).
Cross-validation can be either exhaustive or non-exhaustive.
An exhaustive cross-validation is one in which all possible ways to divide the data into training and test data are considered. For example,
cycle over all possible p-subsets of the data sets, train on the data excluding the p-subset, test on the p-subset. This is called Leave-p-out
cross validation.
If the cross-validation doesn't consider all possible ways to divide the data, it is called inexhaustive.
Two common inexhaustive techniques are:
Repeated Random Sub-sampling Validation: Randomly repeat `m`-times, choose a subset of size `p` as the test data set, train on the remaining data, test on this subset.
k-fold Cross Validation: Split the data into `k` equal sized subsets. `S_1, ..., S_k`. For `i=1,..., k`, train on all but subset `S_i`, then test on `S_i`, aggregate the results. (Often choose `k=5`)
Estimators
When conducting experiments we are often interested in trying to come up with a "best" prediction of some quantity of interest, say accuracy of neural net architecture, or the temperature today.
For example, suppose we want to determine the actual temperature. We take a series of measurements, `T_1`, `T_2`, ... `T_m`.
One possible way to estimate the temperature is to compute the average of these measuments `1/m cdot sum_i T_i`. However, it is not the only way. For example, maybe you and I were reading the thermometer, and you decide I am blind and can't read it correctly, so you throw out all my measurements and take the average of the remaining ones. We call any way of going from the raw data to an estimate of some quantity, and estimator.
More precisely, a point estimator (aka statistic) is any function of the data `hat{theta}_m = g(vec{x}^{(1)}, ..., vec{x}^{(m)})`.
A "good estimator" is an estimator whose output is close to the underlying `theta` that generated the training data.
For example, assume some distribution `Pr` on real world data (the sample space, `S`). Let `X subseteq S` be some set we are trying to build a classifier for. Let `X(x)` be 1 if `x in X` and 0 otherwise, let `M(x)` denote the value our trained model gives on `x`.
The function `Co\r\r\e\c\t: x -> |X(x) - M(x) |` is a random variable.
If we took `m` samples from our data and computed the accuracy of `M(x)` on those samples, `A\c\c_m`, then `A\c\c_m` would be an estimator of `E(Co\r\r\e\c\t)`.
Since `A\c\c_m` is a function of the data which is assumed to be randomly drawn according to some distribution, it too is a random variable.
When we do an operation like `k`-fold validation, we can ask if the result is actually a better estimator of the underlying quantity or random variable than say not doing it?
We call an estimator a function estimator, if it is a point estimator where we are trying to approximate an underlying random variable based on the data rather than a quantity.
Estimator Bias
The bias of an estimator is defined as: `bias(hat{theta}_m) = E(hat{theta}_m) - theta`.
As an example suppose we have a coin which is heads (value 1) with probability `theta` and tails (value 0) with probability `1-theta`. Flipping coins in this way is an example of a Bernoulli Trial and gives rise to
the Bernoulli Distribution.
We want to perform trials to estimate `theta`. Let `{x^{(1)},..., x^{(m)}}` be the result of `m` trials.
A common estimator for `theta` is `hat{theta}_m = 1/m sum_{i=1}^m x^{(i)}`.
So in this case, our estimator is unbiased, that is, it's bias is 0.
The Gaussian Distribution and an Example Biased Estimator
The normal distribution, aka Gaussian distribution, is the distribution with sample space `RR` and probabilities given by:
`Pr(x; mu, sigma^2) = sqrt(1/(2\pi\sigma^2))exp(-1/(2\sigma^2)(x - mu)^2).`
Here `mu` and `sigma` are fixed constants (hyperparameters) and so we list them after a ";" in the left hand side of `Pr` equation above.
Let `X_1,..., X_n` be `n` identical independent random variables on the reals. Let `Pr` be a probability distribution on the reals. Since they are all identical, with respect to the `Pr` distribution, they all have the same `mu = E(X_i)` and `sigma = Var(X_i)`. The central limit theorem (DeMoivre 1733) of probability theory says `sqrt(n)((sum_{i=1}^n X_i/n) - mu)` converges to the Gaussian distribution as `n` gets large.
One estimator for the sample mean `mu ` of the Gaussian distribution is `\hat{mu}_m = 1/m(sum_{i=1}^m x^{(i)})`. One can show by the same kind of argument as the last slide that this is an unbiased estimator.
One estimator for the sample variance `sigma` of the Gaussian distribution is `\hat{sigma}_m^2 = 1/m sum_{i=1}^m (x^{(i)} - hat{mu}_m)^2`.
Notice `E(\hat{sigma}_m^2) = E(1/msum_{i=1}^m (x^{(i)} - hat{mu}_m)^2) = E(1/m sum_{i=1}^m (x^{(i)} - 1/m(sum_{i=1}^m x^{(i)}))^2)`
`=E(1/m sum_{i=1}^m ((m-1)/m x^{(i)} - (m-1)/m E(x))^2) = (m-1)/m sigma^2`, which differs from `sigma^2` by its bias `-sigma^2/m`.
To correct for this we can use the unbiased sample variance estimator (aka standard deviation): `\tilde{sigma}_m^2 = 1/(m-1) sum_{i=1}^m (x^{(i)} - hat{mu}_m)^2`.
Cross-Validation and Bias
The accuracy measure one gets using cross validation turns out to be a biased estimator of the underlying expected correctness because the training data set is a strict subset of all of the data.
Unfortunately, it is hard to prove confidence intervals around the accuracy one gets via cross validation.
In most situations, the effect of the cross valdidation bias will tend to be conservative, i.e., the reported accuracy will be less than the actual, but one has to be careful in that one can sometimes end up tuning ones hyperparameters to falsely improve the cross validation accuracy.
Quiz
Which of the following is true?
The kernel version of the S-K algorithm requires the kernel is an RBF.
We showed how to rotate an image using numpy.
An ROC curve plots the TPR versus the FPR for a binary classifier.
Recurrent, MLP, and Feedforward Networks
We are now going to start looking at networks of more than one neuron in more detail.
Such a network is typically described by a directed graph where each of the internal nodes computes some neuron-like operation.
To start we will assume that each of the node computes a perceptron. This kind of network is called a multi-layer perceptron (MLP) network.
If we allow cycles in the graph, the network is called recurrent.
For now, we will assume our network is not recurrent, so the underlying graph is a DAG. Such a network is called a feedforward network.
To as much of a degree as possible, we would like to use the mathematical notation of linear algebra to say how our network computes things.
To do this, we will require our network to be split into layers.
We imagine our network receives its inputs as layer 0. The nodes in this layer are all in-degree 0.
For layer `i >0 `, the only edges into nodes at layer `i` come from layer `i-1` and the only edges out of layer `i` go to layer `i+1`.
To further simplify the notation, we will assume each node at layer `i` is connected to all the nodes at layer `i+1`. If we want to break a connection, we will multiply by a 0 weight on that input into the perceptron.
We call the number of nodes in a given layer, the width of the layer, and the number of layers in the network, the depth of the network.
Computing a function from a feedforward MLP
Let `g` be an activation function for a neuron, let `vec{w}` be its weights, and let `b` the bias. Let `vec{x}` be a vector of inputs. Then a single neuron computes the function `g(vec{w} \cdot vec{x} + b)`.
Given an activation function `g`, let `vec{g}_m` be the vector-valued function that takes `vec{c} in RR^m` and computes the vector `[[g(c_1)],[...],[g(c_m)]]`.
We will be interested in `vec{ReLU}_m`, `vec{tanh}_m`, `vec{step}_m`.
To compute the value of two neurons, let `vec{w}_1` and `b_1` and `vec{w}_2` and `b_2` be the weights and biases of each neuron.
Rather than take the dot product, we arrange the `vec{w}_i` as the rows of a matrix `W` and we use the two values `b_i` as the entries of a column vector `vec{b}`.
Then the two perceptron network computes `vec{g}_2(W vec{x} + vec{b})`.
In general, a layer of `m` neurons can be computed as `vec{g}_m(W vec{x} + vec{b})` where `W` and `vec{b}` each have `m` rows.
If a feedforward neural network has layers `L_1, ..., L_n`, then it computes the function that results from composing the functions computed by the layers. I.e.,
`L_n(L_{n-1}(...L_2(L_1(vec{x}))...)).`
If we had to code this, we could imagine storing all the weights in a 3D array with elements` W_{ij}^{(k)}`, where `k` says the layer, and `ij` give the weight of the `j`th input to the `i` neuron. Similarly, the biases become a 2D array with elements `b_{i}^{(k)}` where `k` says the layer and `i` says the bias to the `i` neuron. To keep things simple for now, let's assume all the neurons use the same activation.
Given an input `vec{x}` we could then have a for loop that cycles over layers from `1` to `n`. For each layer, we look up its weights and biases, then use our equation above to compute the output of this layer. We then use this output as the input when we perform this next pass through the for loop.
Costs Functions
We would now like to describe how to train `N(x)` in the presence of labeled data.
Recall that when we proved the perceptron convergence theorem, we knew that we were learning some target threshold function with weights `vec{w}^t` and we could define a function of the current weights `vec{w}` to say how far we were from the target.
This function was `||vec{w}^t - vec{w}||^2` and we showed the update rule reduced this.
For our PAC-learning result, we used the potential function `N_t(alpha) = ||alpha w - w_t||^2 + (alpha theta - theta_t)^2`, and for the S-K algorithm we used `||vec{w}^{+} - vec{w}^{-}|| - m(vec{x}'_t)`, but the idea was the same.
These functions are often called loss or cost, or potential functions.
In each case, we are trying to adjust the weights to minimize the potential function.
One class of algorithms to figure out how to adjust the weights are known as gradient descent algorithms.
Before we look at these, let's looks at a couple different cost functions.
Mean Square Error
Assume we train using a sequence of examples `(vec{x}^{(1)},y^{(1)}), (vec{x}^{(2)},y^{(2)}), ..., (vec{x}^{(m)},y^{(m)})` and let `N(vec{x})` be the output of our network.
One common choice of cost function is mean squared error (MSE) `1/m sum_i||y^{(i)} - N(vec{x}^{(i)})||^2`.
This measures the average error the networks makes over the training set.
It turns out this choice of cost function can be justified if the underlying function that the data is coming from is a linear function of the
`vec{x}^{(i)}` and the `vec{x}^{(i)}`'s are chosen according to a Gaussian distribution.
To get a more robust cost function for a larger variety of underlying target functions, we make use of the theory of estimators from earlier today.
Maximum Likelihood Estimation - unlabeled data
Let `mathbb{X}` be a set of examples `vec{x}^{(1)}, ..., vec{x}^{(m)}` drawn independently from sample space `S` according to some unknown data generating distribution `p_{data}(vec{x})`.
Let `p_{mo\d\el}(\vec{x};\vec{theta})` be a parametric family of probability distributions over the same space `S` indexed be `vec{theta}`.
Our goal is to find a `vec{theta}` so that `p_{mo\de\l}(\vec{x};\vec{theta})` is as close to `p_{data}(vec{x})` as possible.
The maximum likelihood estimator (MLE) for `vec{theta}` is defined as:
\begin{eqnarray*}
\vec{\theta}_{ML} &=& argmax_{\vec{\theta}} p_{model}(\mathbb{X}; \vec{\theta})\\
&=& argmax_{\vec{\theta}}\prod_{i=1}^m p_{model}(\vec{x}^{(i)}; \vec{\theta})
\end{eqnarray*}
I.e., we want to find the value of `vec{theta}` that maximizes the product of probabilities of the sequence of items that were observed.