Introduction

Last Monday, we were talking about optimization in the context of neural nets.
Optimization is the problem of find the least or largest, or locally least or largest, value from a set of values.
In our case, we want to find weights for our neural net which minimize the cost function.
We discussed ways that training neural nets differs from a pure optimization problem: risk versus empirical risk, surrogate loss function rather than actual, etc.
We also discussed some challenges of neural net training as an optimization activity: ill-conditioning, local minima, saddlepoints, plateaus, cliffs, etc.
Today, we begin by looking at how use of SGD affects our ability to optimize, and then look at techniques to improve neural net optimization.

Optimization and SGD

Using stochastic gradient descent introduces a source of noise not present with normal gradient descent: the choice of `m` training examples for a minibatch.
This source of noise doesn't disappear even when we arrive at a minimum for our cost function.
In comparison, for batch gradient descent the total cost function can become small and zero when we approach and reach a minimum.
Because of this, when using SGD we typically use a learning rate `\epsilon_k` which decreases as a function of the `k`th iteration.
A sufficient condition to guarantee convergence is that
`sum_{k=1}^{\infty} epsilon_k = \infty`, and
`sum_{k=1}^{\infty} epsilon_k^2 < \infty`
In practice, people often choose a linear decay rate until some fixed iteration `tau` of the form: `epsilon_k = (1-alpha)epsilon_0 + alpha\epsilon_{tau}`
where `alpha = k/tau`. After the `tau`th iteration, the decay rate is left constant.
Setting `epsilon_0` and `\epsilon_{tau}` can be done either via trial and error, or by monitoring the learning curves that plot the objective function as a function of time.
A common rule of thumb for `epsilon_{tau}` is to use `0.01 epsilon_0` and to set `tau` to be 100 iterations.
`epsilon_0` is harder to set: if it is too large the learning curve might show wild oscillations, if it is too low the algorithm can take forever to converge.
Often people experiment on a small number of iterations to see which initial `epsilon` seems to be converging the faster, and then do a full training.

Excess Error

To study the convergence rate of an optimization algorithm, one often measures the excess error: `J(\theta) - min_{theta}J(theta)`.
This is the amount the current cost function is larger than the minimal possible cost.
For convex problems (not typically the neural net case), the excess error for SGD can be shown to decrease as `O(1/sqrt(k))` after `k` iterations, for batch gradient descent it falls as `O(1/k)`.
The Cramer-Rao bound says the generalization error cannot decrease faster than `O(1/k)` for `k` rounds.
Aside: If we have two unbiased estimators `W_1, W_2` of the same quantity, we say `W_1` is more efficient than `W_2` if `Var(W_1) < Var(W_2)`. Cramer Rao puts a lower bound on the efficiency of any estimator using `k` samples as `Var(W) \geq c/(k)` for some constant `c > 0`. This is the pointwise-estimator special case, Cramer-Rao for a function estimator `f` is `Var(W) \geq 1/( k E[(( del ln f)/(del theta))^2])`.
One should also bear in mind that unlike batch gradient descent, the computation time per update (iteration) is constant for SGD, and does not grow with the number of training examples.
So one can have convergence even if the training set is large.
Acceptable convergence might even occur before the whole training set is processed.

Momentum

Momentum is a technique introduced by Polyak 1964 to try to speed up the convergence of SGD.
We introduce a new variable `vec {v}`, which one can think of as a measure of velocity, telling us how fast the gradient was moving in previous iterations.
Rather than directly update the weights now, we instead update the velocity, and then use that to update the weights. This is done according to the rule:
`vec{v} leftarrow alpha vec{v} - epsilon\grad_{theta}(1/m sum_{i=1}^m L(\vec{f}(vec{x}^{(i)}; theta), vec{y}^{(i)}))`
`vec{theta} leftarrow vec{theta} + vec{v}`.
The larger the value of `alpha` the more previous gradients affect the update rule.
The hope in using momentum is to blast through regions where by poor conditioning the Hessian is larger than the gradient, and also to overcome weirdness caused by the inherent in the minibatch variance.
To see one use of this consider the case where we are close to a minima, because of fluctations in the mini-batch gradient, standard SGD may never converge to a minimum, but with momentum, these fluctuations will tend to cancel and the velocity will tend to stabilize more quickly that a pure gradient.
Common choices for `alpha` are `0.5`, `0.9`, `0.99`.

Nestorov Momentum

Sutskever (2013) introduced a variant of momentum to the neural net community.
This was based on Nesterov (1983, 2004)'s accelerated gradient method.
It uses the update rule:
`vec{v} leftarrow alpha vec{v} - epsilon\grad_{theta}(1/m sum_{i=1}^m L(\vec{f}(vec{x}^{(i)}; theta + alpha vec{v}), vec{y}^{(i)}))`
`vec{theta} leftarrow vec{theta} + vec{v}`.
The change is in where the gradient is evaluated. In the Nestorov Momentum case, it is evaluated after the current velocity is applied.
In the convex batch case, Nesterov showed that the rate of convergence of the excess error falls as `O(1/(k^2))` using Nesterov momentum as compared to the standard `O(1/(k))`.
Note this result is only known for the batch case.

Parameter Initialization

Training algorithms for neural nets are usually iterative and require the user to specify some initial point from which to begin training.
The choice of these initial weights can greatly influence how quickly training converges, or whether training converges at all.
Initialization can also affect how well the results of training generalize.
Initial weights often have the important role of breaking symmetry between units in the same layer in the net.
Typically, biases for each unit are set to heuristically chosen constants, and the weights are initialized randomly.
This is usually done by choosing all weights in the model to values drawn randomly from a Gaussian or Uniform distribution.
The choice between these distributions matters less than the choice of variance, or width of the distribution.
If the width is too small, we lose symmetry breaking and the model will converge slowly; if the width is too large, we might have exploding values during forward or back propagation.
Glorot and Bengio (2010) suggest using the normalized initialization of
`W_{i,j} ~ U(-sqrt(6/(m+n)), sqrt(6/(m+n)))`
where `m` is the number of inputs to a layer, and `n` is the number of outputs to a layer.
This was designed to be a compromise between having all layers have the same activation variance (from the `n`) and all layers having the same gradient variance (from `m`).
The bounds were derived based on a network without nonlinearities.

In-Class Exercise

Using Glorot and Bengio's suggestion, what would be the normalized initialization you should use for the first layer of the neural net you had in HW3?
Please post your solution to the Nov 2 In-Class Exercise Thread.

Initializing Biases

It is common to initially sets biases to 0.
There are a few situations where a better choice might be used:
1. For output units, we often have small initial weights, and so it makes sense to set the biases so as to get for outputs roughly the correct odds for each output class. If we are using a softmax layer, and `vec{c}` denotes the probability vector of the possible classes, then this amounts to solving: `\mbox{softmax}(\vec{b}) = vec{c}`.
2. If we are using ReLU (or are in a similar situation), we typically don't want the biases to be `0` as this will tend to make the initial gradients `0`. So it is common to set the biases to `0.1` in this case.
3. For some kinds of control gates (gates which control whether some other gate is in use), the control is achieved multiplying the output of one unit with another `h\cdot u`. Here `h` is the output of the hidden control unit. In this case we typically want to initialize `h`'s bias to be close to 1, so that `u` has a chance to learn. This situation often arises for recurrent neural net layers, for example the forget gates in the LSTM model.

Adaptive Learning Rate Algorithms

The cost function is often highly sensitive to changes in some coordinates, and relatively insensitive to changes in other coordinates.
Therefore, it makes senses that one could try to come up with a coordinate-based notion of learning rate and figure out how to create a schedule for updating the learning rate `epsilon`'s with respect to different coordinates.
Jacobs (1988) gave an early heuristic approach that works for batch gradient descent known as the delta-bar-delta algorithm.
In it, we check the partial derivatives of the loss with respect to a given model parameter. If it stays the same sign as the previous update, the learning rate is increased; otherwise, it is decreased.

Adaptive Learning Rate Algorithms that work with SGD

AdaGrad (Duchi 2011) This algorithm individually adapts the learning parameters by scaling them to be inversely proportional to the square root of the sum of all historical squared values of the gradient. This works well for convex optimization, but empirically tends to decrease the learning rates too fast for deep neural nets.

RMSProp (Hinton 2012) Modifies AdaGrad to use an exponentially decaying average of the gradient histories. It can be adapted to work with momentum and has been shown to work better with deep neural nets. Here is what the algorithm with Nestorov Momentum looks like:

epsilon := starting learning rate
rho := starting decay rate
alpha := momentum coefficient
theta := initial weights
v := initial velocity
r := 0 // accumulation variable 
while stopping criteria not met:
    Sample minibatch {x[1], ..., x[m]} 
        from training set with targets {y[1], ..., y[m]}.
    Compute interim update: tmpTheta := theta + alpha*v
    Compute gradient g := 1/m * grad_{tmpTheta} sum_i L(f([i]; tmpTheta), y[i])
    Accumulate gradient: r := rho * r + (1 - rho)* hadamard (g, g) 
        // pointwise product
    Update velocity: v := alpha * v - hadamard (vec (epsilon_j /sqrt(r_j)),  g) 
        //make vector of epsilons then hadamard
    Apply update: theta := theta + v

Adam (Kingma Ba 2014) is a variant on RMSProp with momentum which uses a different way of incorporating moments and considers second moments of the gradients. It currently very popular as it is relatively robust to changes in hyperparameters.
The book shows each of these algorithms, I am showing only RMSProp above as a representative example of how they work in general.

Finish Optimization

Outline