Introduction

On Monday, we began going over Ng (2015)'s design methodology for Neural Nets.
This consisted of following steps:
1. Determine the problem to solve and metrics by which to measure success. We talked about this in class and intruduced notions of precision, recall, and `F_1`-scores.
2. Develop an architecture including training algorithm. Gave in-class common baseline architectures to choose based on the demands of the learning problem.
3. Instrument the system to determine bottleneck/performance issues and their causes
4. Incrementally make changes to the model, collect new data, adjust hyperparameters
(2) and (3) are connected and we begin today by discussing when to collect new data.

Determining Whether To Gather More Data -- Training Data

Often simply having more data will improve a NN more than tweaking its underlying model.
To see if more data might be needed, one should first ask if performance on the the existing training data is acceptable?
If it is poor (but has plateaued in the number of epochs), then this suggests the learning algorithm isn't fully exploiting the data that is available.
One thing to try is to tweak the learning rate schedule (switch the Optimizer you are using).
If this doesn't work, maybe you should increase the size of the model.
If this still doesn't work, then maybe the data is too noisy, or your data set consists of too few examples of what you are trying to detect.
This latter would suggest trying to use cleaner more representative data.

Determining Whether To Gather More Data -- Test Data

If the performance on the training data is good, then you should see how well it works on the test data.
If this is good, you're done. Pat yourself on the back.
If the performance on the test data is much worse than the training data, then gather more training data (if you have deep pockets, are a big web company, etc) -- it is likely you haven't trained on enough examples to generalize well.
If getting more data is not feasible, then it might make sense to adjust the kind of regularization you are using or some other strategy to make the model smaller.

How Much New Data to Gather?

To figure out how much new data you would need to improve your model to the acceptable error, one can try to plot training set size versus generalization error where one trains on different amounts of the existing data.
Fit a curve to this and try to extrapolate to where you would like your generalization error to be.
As small tweaks to the training set size usually don't have much of an effect, you probably want some multiple of the current data set size.
So it makes sense when you are plotting the graphs mentioned above to use a logarithmic scale.

Selecting Hyperparameters

Most neural net models come with several tweakable magic numbers (aka hyperparameters).
One can try to improve performance of ones model by adjusting these.
The two common approaches to do this are: manually adjusting the parameters, or automatically adjusting the parameters via some algorithm to find the values that work best.
The former approach requires understanding what the hyperparameters are doing, the latter approach doesn't require this but often requires much more training time.

Manual Hyperparameter Tuning

The goal of manual hyperparameter search is usually to find the lowest generalization error subject to some runtime and memory budget.
The effective capacity of a model (the easiness of finding a good enough sets of weights from the space of all weights) is constrained by:
1. The representational capacity of the model (how well the space of weights captures functions related to the problem at hand)
2. The ability of the learning algorithm to minimize the cost function used to train the model.
3. The degree to which the cost function and training procedure regularize the model (i.e., allow one to find weights which will work in general on the test data).
Usually, when one plots the value of a hyperparameter versus test cost, it traces out a U-shaped curve, the goal is to find the value at the bottom of the U.
Not all of this U might be traced out by a given hyperparameter. For example, for a discrete parameter like number of units in a hidden layer, one can only visit a few points along the curve. Other parameters like minimum weight decay have a least value (0 in this case), or a max value.

Common Hyperparameter Tuning Rules of Thumb

Hyperparameter	Increases Capacity when...	Reason	Caveat
Number of Hidden Units	increased	Increasing the number of hidden units increases the representational capacity of the model.	Increasing the number of hidden units increased both the time and memory cost of every operation int he model.
Learning Rate	tuned optimally	An improper learning rate, whether too high or too low, results in a model with low effective capacity due to optimization failure.
Convolutional Kernel Width	increased	Increasing the kernel width increases the number of parameters in the model.	A wide kernel results in a narrower output dimension, reducing the model capacity unless you use implicit zero padding to reduce this effect. Wider kernels require more memory and runtime, usually, but narrower outputs can sometimes reduce memory cost.
Implicit Zero Padding	increased	Adding implicit zeros before convolution keeps the representation size large.	Increases time and memory costs of more operations.
Weight Decay Coefficient	decreased	Decreasing the weight decay coefficient (for `L_2` or `L_1`, etc regularizers) frees the model parameters to become larger.
Dropout Rate	decreased	Dropping units less often gives the units more opportunities to "conspire" with each other to fit the training set.

Automatic Hyperparameter Algorithms

Manual hyperparameter tuning is a kind of optimization problem.
So it makes sense that one could develop a hyperparameter optimization algorithm that wraps a learning algorithm and chooses its hyperparameters.
Often these algorithms have their own hyperparameters such as the range of values that should be explored for each hyperparameter, but are hopefully easier to choose.

Grid Search Hyperparameter Selection

Here the user chooses a range of values `n` (meta-hyperparameter 1) to initially search over based on prior experience with similar models. Say 0 to 10000.
This range is then split into some fixed number of discrete points `k` (meta-hyperparameter 2) either using a uniform or logarithmic spacing (meta-hyperparameter 3): For example, if 5 was chosen, then the range from above might be split as {0, 2500, 5000, 7500, 10000} or {0, 10, 100, 1000, 10000} respectively.
Training is done for each value to find the one that works the best.
The interval about this best value is then appropriately split. For example, suppose the best value was 5000. Then we might split this to {3333, 4166, 5000, 5833, 6666}.
If the best was an end point, then the interval to the left or right is explored. For example, if 3333 is the best value, we split from 2500 to 4166 into 5 equal sized regions.
This splitting is continued to some fixed depth `d`, (meta-hyperparameter 4).
If one has `m` hyperparameters to search, then the cost of performing grid search is `O(k^{dm})`.

In-Class Exercise

Suppose we do a grid search for hyperparameter `lambda`, using the range 0 to 100, `k=3`, and `d=3`. The optimal value of a `lambda` is `21`. How would the search proceed?
Post your solutions to the Nov 29 In-Class Exercise Thread.

Random Search Hyperparameter Selection

Here the user chooses a marginal distribution for each hyperparameter (meta-hyperparameter 1), for example, a Bernoulli distribution or multi-noulli distribution for discrete hyperparameters, or a uniform distribution on a log-scale for positive real-valued hyperparameters.
We then repeatedly pick all the hyperparameters according to these distributions, train, and see the cost.
We keep track of the best performing setting of the hyperparameters and use that.

NN Design Methodology

Outline

Introduction

Determining Whether To Gather More Data -- Training Data

Determining Whether To Gather More Data -- Test Data

How Much New Data to Gather?

Selecting Hyperparameters

Manual Hyperparameter Tuning

Common Hyperparameter Tuning Rules of Thumb

Automatic Hyperparameter Algorithms

Grid Search Hyperparameter Selection

In-Class Exercise

Random Search Hyperparameter Selection