Finish Regularization, Optimization

CS256

Chris Pollett

Nov 8, 2017

Outline

Sparse Representation
In-Class Exercise
Bagging
Dropout
Adversarial Training
Start Optimization

Introduction

We have been talking about regularization techniques -- techniques that make the trained model generalize better
We have look at parameter norm penalties, data augmentation; semi-supervised learning; adding noise to inputs, outputs, and weights; multi-task learning, early stopping, and parameter tying.
Today, we continue with several more such techniques and then switch to talking about optimizing neural networks.

Sparse Representation

A representation of the data where many of the elements are 0 (or close to 0) is called a sparse representation.
For a sparse single neural net layer might be written as `g(W cdot f(\vec{x}))` where `f:RR^n -> RR^n` (assume biases in W) such that most coordinates of `f`'s output are `0`.
One simple choice of `f` would be a linear map `W'\vec{x}`.
We have already discussed that `L_1`-regularization tends to lead to a sparse parameterization, most of the weights tend to be `0`.
For neural nets, a sparse parameterization effectively means that most of the neurons are only connected to a few neurons at a given layer.
On the other hand, a sparse representation will tend to mean only a few neurons connected to the input will fire. I.e., we have some sort of activation sparsity.
For online training, a sparse representation will tend to mean only a few weights in a given layer update for a given data item.
Let `vec{h} = f(\vec{x})`, we can make a norm penalty that tends to favor this by setting `\tilde{J}(\vec{theta}; \vec{\mathbb{x}}, \mathbb{y}) = J(\vec{theta}; \vec{\mathbb{x}}, \mathbb{y}) +\alpha Omega(vec{h})`.
An example choice for `Omega(vec{h})` is the `L_1`-norm, `||h||_1 = \sum_i|h_i|`.
Another approach to representational sparsity rather than a norm penalty, called orthogonal matching pursuit-`k` (OMP-`k`) (Pati, et al 1993), encodes `vec{x}` as an `vec{h}` that solves
`arg min_{vec{h}, ||vec{h}||_{0} < k }||vec{x} - W' vec{h}||^2`
where `||vec{h}||_0` is the number of nonzero entries in `vec{h}` and where `W'` is constrained to be orthogonal.

In-Class Exercise

Suppose `W' = [[2, 0, 0], [0,0,-1],[0,3,0]]`, `x=[[-1],[3],[-3]]`, and OMP-2 was being used.
What would `vec{h}` be?
Post your solution and work to the Nov 8 In-Class Exercise Thread.

Bagging

Bagging (short for bootstrap aggregating) is a technique for reducing generalization error by combining several models (Breiman, 1994).
One trains several models separately and then have all models vote on the output for test examples.
It is an example of a model averaging or ensemble method.
The hope is that the different models are committing different kinds of errors and that on average these errors will cancels. (A little like the Wisdom of the Crowd.)
Suppose we have `k` regression models, the `i`th making error with variance `v=E[(epsilon_i)^2]` on mean 0 drawn according to a normal distribution. Suppose the covariances are all `c= E(epsilon_i cdot epsilon_k)`.
Then the expected squared error of the predictor is:
`E[(1/k \sum_i epsilon_i)^2] = 1/k^2 E[(epsilon_i^2 + sum_{j ne i}epsilon_i \cdot epsilon_j)] = 1/k v + (k-1)/k c`
Notice when `c=0`, we've reduced the error by a factor of `1/k`.
To do bagging with a dataset of `n` items. We make `k` new datasets. To make dataset `i`, we sample from our original data set with replacement `n` items. Once we have these `k` datasets, we train them separately and do the voting described above.

Boosting

Boosting is another ensemble technique which can be used to improve generalization error as well a model capacity.
Model capacity measures the degree to which the model could be trained to fit the data. I.e., how prone to under or overfitting it is.
Usually in boosting rather than make a new classifier from existing ones by looking at what each classifier votes and taking the majority, we instead take a weighted sum of the outputs of the classifiers and check if its positive or negative.
To determine the weights, we can use a greedy approach such as Adaboost. We start with the initially always false classifier. We cycle through the data items. For the item `(x_i, y_i)` we add to our current classifier so far a term `+w_i C_j(x_i)` for an appropriate weight `w_i` and for the classifier `j` that will most improve our ability to answer on this item.

Dropout

What do we do when the dataset we are trying to use bagging for is too large? I.e., its impractical to make `k` complete data sets.
Rather than sample over different data items to generate new data sets, we use a single dataset but sample over masks `vec{\mu}` of which neurons in our model to use for a given training mini-batch (`mu` will always include the output neurons, but may not include any other neuron).
When we update the weights in a mini-batch the changed weights are stored in the weights of the whole model before sampling the next mini-batch.
Let `d` be the number of units that may be dropped by a mask. To make predictions when dropout is used, we typically take the geometric mean of the outputs of the masked networks:
`\tilde{P}_{ensemb\l\e}(y|vec{x}) = (\prod_{vec{\mu}}P(y|vec{x}, vec{\mu}))^{1/(2^d)}.`
Here the product is over masks that were selected during training.
To make this a probability distribution we renormalize this as:
`P_{ensemb\l\e}(y|vec{x}) = (\tilde{P}_{ensemb\l\e}(y|vec{x}))/(sum_{y'} \tilde{P}_{ensemb\l\e}(y'|vec{x})).`
Hinton et al 2012, show that the calculation of `P_{ensemb\l\e}` can be simplified. You just need to look at the weights of the final model without masks. Then for each unit multiply the probability of each of its outputs by the probability that of including the unit in some masked network. This is called the weight scaling inference rule.
Hinton et al's approximation is not exactly equivalent to `P_{ensemb\l\e}(y|vec{x})`, but works well in practice.
Under some settings, dropout can be shown equivalent to other regularization techniques. For example, Wagner et al (2013), show linear regression with dropout is the same as linear regression with `L^2` weight decay.

Adversarial Training

Szegedy et al (2014) have demonstrated that even neural networks that perform as human level accuracy (GoogLeNet on ImageNet data) have a nearly 100% error rate on examples intentionally constructed to search for inputs `vec{x}'` near a data point `vec{x}` such that the model output is very different from `vec{x}`.
I.e., we search for `vec{x}'` such that `||vec{x}'- vec{x}||_1` is as small as possible for which the output of the model is different.
The book gives where a picture of a panda blended with imperceptibly small signed gradient of the loss function on a nematode, changes the nets panda classification to gibbon.
This is called an adversarial example.
Linear models tend to be prone to adversarial examples because the weights `W` tend to change as much as `epsilon ||W||_1` which can be a lot for the higher dimensional space the examples live in.
By training using adverserial examples, we can form MLP models to be more non-linear, and hopefully generalize better.

Optimization

Optimization is the process by which we try to find inputs which minimize or maximize some objective function.
Neural network training involves optimization.
Neural network training can be very time intensive: Models may take days or weeks to train.
So it useful to come up with techniques which speed up the optimizations that need to be carried out in neural network training.
While doing this, however, we should realize that neural network training is not exactly a pure optimization problem.
We are not trying to necessarily get the weights that minimize the training cost function on the observed data. We are trying to get the weights that will minimize the cost function based on data generated from some underlying distribution.

More on differences between Optimization and Training

Let `L` be the per-example loss function, `f(\vec{x}; theta)` be the predicted output on `vec{x}`, and `p_{data}` and `\hat{p}_{data}` be the respectively the actual and empirical distributions.
In machine learning we are trying to optimize:
`J^\star(theta) = E_{(\vec{x},y) ~ p_{data}}[L(f(\vec{x}; theta), y)]`
not
`J(theta) = E_{(\vec{x},y) ~ \hat{p}_{data}}[L(f(\vec{x}; theta), y)] = 1/m \sum_{i=1}^m L(f(\vec{x}^{(i)}; theta), y^{(i)}).`
The quantity `J^\star(theta)` is called the risk and `J(theta)` is called the empirical risk.
In order to get an optimization problem, we often approximate risk using the empirical risk.
Training done using this is called, Empirical Risk Minimization.
Sometimes it is not practical to use the actual loss function `L`.
For example, for classifiers one often has a 0-1 loss, which might be an integer programming problem to optimize (and hence, NP-hard).
So instead we use a surrogate loss function such as negative log likelihood instead.
Unlike optimization, NN training might not halt at a minimum, but when some convergence criteria such as early stopping is met.
Futhermore, as it might not be practical to scan the whole data to do updates even based on empirical risk minimization, we might use mini-batch techniques instead.

Challenges of Optimization

Ill-Conditioning - If the Hessian is too large compared to the gradient, then we need to shrink `epsilon` so that the cost function is better approximated by the first term of the Taylor series (and so the gradient descent step make sense). However, doing this slows the rate of convergence of SGD even when on has a big gradient. Let `epsilon g` be the gradient step, the ill-conditioning problem occurs when `1/2 epsilon^2 g^THg` exceeds `epsilon g^Tg`.
Local Minima - optimization of neural nets is often nonconvex and so can have many local minima. These can cause a problem if they are too high in cost as compared to the global minima. So we want to avoid the situation where we have a loss function with too many high cost local minima.
Plateaus and Saddlepoints If we use a Newton's method based technique like limited-BFGS, then it is possible the technique converges to a critical point that is not a local minima such as a plateau, saddlepoint. So we hope our loss function doesn't have a lot of these.
Cliffs and Exploding Gradients A cliff is a region where the gradient can change rapidly, so much so that the weight update step move one far from a good solution. One technique to avoid this is gradient clipping where we require the gradients to never be larger than some absolute value.
Long Term Dependencies When training especially deep networks that involve shared weights between layers (as say in a recurrent neural net), the effect of the repeated part can become like powering of a matrix of weights `W^t`. This will tend to converge to powering by the highest eigenvalue if certain mild conditions are met. So if this eigenvalue is `<1` the value and its gradient will vanish, and if this value `>1` it will explode.
Inexact Gradients. Optimization algorithms are often designed assuming we know the gradient exactly, but we often have finite precision issues that we need to worry about.
Poor Correspondence between Local and Global Structure. We might start at a point from which it is hard to train a useful model. For example, at a place where locally the gradient vanishes, but where it is not a good local minimum.