Introduction

Last week, we were talking about design processes for neural nets.
We introduced Ng (2015)'s model for this:
1. Determine your goals
2. Establish a working end-to-end pipeline as early as possible
3. Instrument your system well so you can determine bottlenecks in performance
4. Repeatedly make incremental changes
We then spent some time talking about performance metrics for the first step above, and then on how to establish an end-to-end model based on the category of problem at hand.
We also looked at when to gather more data for our neural nets and on techniques to manually and automatically adjust hyperparameters.
Today, we start by looking at debugging strategies for neural nets, which will help towards the third step above.

Debugging NNs

Neural Nets are hard to debug because it is often hard to tell whether poor performance is intrinsic to the algorithm itself or whether there is a bug in the implementation.
In many cases, we don't know before trying the program what its intended behavior will be. If we train an NN on a new classification problem and it gets a 5% test error, we don't know easily if this is the expected behavior or suboptimal.
Since NN algorithms usually have multiple parts, each of which is adaptive, if one part of the program has a bug, the other parts might adapt and still give acceptable performance.
NN debugging strategies try to get around these difficulties by either:
1. Using design cases which are so simple that the correct behavior can be predicted.
2. Using tests on more easily understood of parts of the NN implementation in isolation.

Common Debugging Strategies

Visualize the model in action: This varies by setting. For example, for object detection one can superimpose NN detections on the original images and see how accurate they look. For generative speech models, one can listen to the sounds produced by the net. This allows one to get a qualitative feel for how good the model is instead of just relying on quantitative metrics which might not fully capture what you want.
Visualize the worst mistake: If you are in a classification setting where your model outputs probabilities, look at the examples where the classifications were made using the least probability differences. This can often discover problems with pre-processing. In our running Street View example, when this kind of visualization was done, it was detected that the initial preprocessing step was cropping images too tightly, omitting some digits and so making them impossible to transcribe.
Reason about the software using training and test error: Some clues about your algorithm might be found by looking at training and tests errors. For example, if training error is low, but test error is high, your training algorithm probably works, but you are overfitting. Or there might be a bug in saving and loading the model so the model used for testing is incorrect. If both training and test error is large, you need to continue performing tests, such as:
- Fit a tiny dataset: If you have a high error on a training set, determine whether it is underfitting or a software defect by trying to fit a very small data set. Even a 1 item data set might be helpful to figure this out.
- Compare back-propagated derivatives to numerical derivatives: If you need to implement a gradient computation for a new operation, then it is quite possible to implement this incorrectly. One should compare your result with the output of a finite difference calculation. For example, if you were implementing the derivative of `f(x)= x^n` as `f'(x) = n x^{n - 1}`, you could compare your derivative's output with the centered difference numerical estimate: `(f(x + 1/2epsilon) - f(x - 1/2epsilon))/epsilon`. If your function has a lot of variables, rather than test all the component of the gradient or Jacobian, you can try testing randomly chosen directions.
- Monitor histograms of activations and derivatives: One can plot the gradients and activations of each layer of an NN as a function of training epochs. This lets one detect if layers are saturating or if one has a vanishing gradient problem.

Quiz

Which of the following is true?

Clipping the norm of the gradient is the easiest to compute form of gradient clipping under all performance measures.
LSTMs have forget components/gates.
Increasing the learning rate of your neural net always increases its effective capacity.

Street Map Example Revisited

We now revisit the Street Map example we mentioned at the outset of NN Design process discussion.
This machine learning task was begun by first collecting using cars the Street View data.
A data pre-processing step was then done using other ML techniques and human input to detect where the house numbers were prior to transcribing them.
The choices for performance metrics were made. Because maps are only useful if they are highly accurate, a goal of 98 percent accuracy (human-level accuracy) was chosen. To reach this level, coverage was initially sacrificed. Eventually with more data coverage was brought back to 95%.
A baseline network architecture was then chosen. This consisted of a CNN with ReLU. This was novel for the time as sequence prediction was not usually done with CNNs. The first implementation had an output layer consisting of `n` different softmax units for each of the `n` characters that needed to be predicted. Each softmax unit was trained independently.
The baseline was then iteratively refined and tested. Initially, the network refused to classify an input `vec{x}` if the output sequence has `p(vec{y}|vec{x}) < t`. `p(vec{y}|vec{x})` was defined by just multiplying all the softmax outputs together. This was changed to a specialized output layer and cost function that actually computed a correct log-likelihood. This made the rejection mechanism perform more effectively.
As coverage was still below 90%, the training and test set were more carefully instrumented to figure out whether underfitting or overfitting was happening. Training and test error turned out to be nearly identical. This suggested underfitting or a problem with the training data.
Visualizing the worst errors led to the discovery of the cropping issue mentioned earlier. Fixing this led to a 10% improvement to transcription coverage.
Finally, hyperparameter adjustments were done to gain a few additional percentage points of improvement. This mainly came from making the model larger as, after the crop tweak, the training and test error were still equal, suggesting underfitting.

Applications

We next look at how to solve applications in computer vision, speech recognition, natural language processing, etc. using NNs.
For many serious AI applications one needs a large scale network, so we begin by looking at some implementation challenges this causes.

Infrastructure for Deep Learning

One of the key factors responsible for the improvements in NN accuracy and the improvement in the complexity of the tasks they perform is the dramatic increase in the size of the networks in use.
This biggest networks in the 1950s had fewer than 10 neurons, those of the 80s had less than 10,000 neurons, whereas, those today might have more than a million.
To handle this increase in size, we need better hardware. I.e., either faster CPU implementations, or GPU implementations.
Vanhoucke (2011) showed just rewriting NN architecture to take advantage of floating point CPU processing hardware over fixed-point only instructions, led to a three-fold speedup in training.
So if one is trying to build a serious network it is important to take advantage of these implementation details.

GPU Implementations

GPUs have specialized hardware that makes it fast to do different kinds of matrix operations in parallel.
For example, a 3D model might be specified as a collection of points, triangles these points belong to, a skeleton these triangles are attached to, a color model for the material at different places in the model, etc.
To do 3D translations, rotations, etc, one can apply a matrix to each vertex in parallel and let this induce a change in the whole model.
Similarly, 3D to 2D mappings can be computed on vertices in parallel.
To facilitate this, high end GPUs have many (5000-11000) generalized shader units, texture mapping units, and render output units.
These are programmable nowadays using a relatively high level C shader language like Cuda, GLSL, Vulkan, Metal.
The programmable units usually have a shared memory as well as a small amount of independent local memory.
You can write C functions and make use of vector and matrix built-in functions but you can't typically do recursion.
As GPUs can be used to speed up many parallelizable tasks, the OpenCL library was developed to facilitate this.
Since NN neurons can often be updated independently of each other on a layer by layer basis, it was natural to try to create GPU implementation of NNs.
Steinkrau et al (2005) first showed a three time speed-up in doing this over a CPU baseline. Shortly thereafter Chellapilla et al (2006) showed a big speed up for supervised CNNs.
Typically, code optimization strategies for CPUs and GPUs are different, for example, on a CPU it might make sense to read values from a cache as often as possible; whereas on a GPU memory locations are not cached so it can be faster to compute the same value twice.
As it is often hard to write highly efficient GPU code, it often makes sense to use a library that knows how to get a NN to work on either a CPU or a GPU such as Tensorflow, Theano, or Torch. You still need to make sure the your library is compiled to take advantage of the GPU available and test it does.

Distributed Implementations

For very large problems, we might want to distribute the workload of training and inference across several machines.
To distribute inference is easy. Just copy the neural net to different machines, and on a given inference request, assign it to one of these machines and have it compute the class. This is called data parallelism.
On the other hand, if multiple machines work on the same input at the same time, each running different parts of the model, one has model parallelism.
To perform training using data parallelism, one can use a technique like asynchronous SGD. (Bengio et al 2001; Recht et al 2011).
In this approach the machines share the memory representing model parameters. Each machine/core reads parameters without a lock for its mini-batch, computes a gradient, then updates the parameter without a lock. Some of these writes might overlap each other and be forgotten. Since this algorithm increases the number of batches that are processed/time the hope is this will compensate for the occasionally lossed update.
Sometimes these shared parameters are stored on their own machine called a parameter server (Dean at al 2012).

Model Compression

In a commercial setting it is often more important that the cost of inference (time and memory) be low, rather than the cost of training be low.
If you don't need personalization, you can spend a lot to train a model once (say for speech recognition) and then deploy it to millions or billions of users.
To reduce the cost of inference one can use model compression:
1. Using a large model train a function `f(vec{x})` with the limited data you have.
2. Use the large model `f(vec{x})` to generate a lot of training data, drawn according to the test distribution, to fit a smaller model which is what is deployed.

Computer Vision

Our first application of neural nets that we will consider in detail is computer vision.
Computer vision is a broad field with applications such as:
- Recognizing faces
- Recognizing sound waves based on vibrations in images
- General object recognition, detection, annotation.
- Transcribing sequences of symbols from an image
- Labeling pixels based on object it belongs to
- Image restoration, removing defects

Preprocessing

Preprocessing might be required because the original input comes in a form that is difficult for a deep learning architecture to represent.
For computer vision, this might involve standardizing images so that all pixels lie in a common range, say [0,1] or [-1, 1].
Another common manipulation is to make all images the same size. This might involve cropping or scaling.
If cropping is used, then the dataset might be augmented by including cropping generated at slightly different locations. This might help reduce generalization error.
Next day, we will begin by looking at a kind of preprocessing that is useful to reduce the amount of variation in a data set when one doesn't have too much data known as contrast normalization.

Finish NN Design, NN Applications

Outline

Introduction

Debugging NNs

Common Debugging Strategies

Quiz

Street Map Example Revisited

Applications

Infrastructure for Deep Learning

GPU Implementations

Distributed Implementations

Model Compression

Computer Vision

Preprocessing