Introduction

Last Wednesday, we were looking at optimization for neural networks.
We considered how the learning rate of SGD should be adjusted as a function of training iteration.
We compared the excess error of batch gradient descent (`O(1/k)`) versus SGD (`O(1/sqrt(k))`) as a function of number of training iterations.
We then talked about Nesterov momentum as a modification to basic SGD that can improve the convergence rate of basic SGD.
Then we looked at the effects of parameter initialization on how fast SGD converges.
Finally, we looked at adaptive learning rate algorithms such as AdaGrad, RMSProp, and Adam.
We begin today by looking briefly at Keras again before talking about CNNs in detail...

Creating Custom Layers in Keras

Sometimes we want to create a new layer type besides the existing layer types that come with Keras.
This can be done by creating a custom layer by subclassing tensorflow.keras.layers.Layer
When we create such a subclass, we will typically want to override the following methods:
- __init__: used to initialize class variables and super class variables
- build(input_shape): used to define trainable weights
- call(inputs): used for the algorithm to compute the layer
- compute_output_shape(input_shape): used for the output shape of custom layer.

Example Custom Layer

The code below implements a Parametric ReLU layer and activation layer we discussed in the Oct 14 Lecture:

import tensorflow as tf
class ParametricRelu(tf.keras.layers.Layer):
    #constructor here is just calling the parent
    def __init__(self, **kwargs):
        super(ParametricRelu, self).__init__(**kwargs)
    # in build below we specify one new weight to train
    # with initial value 0
    # we also say this layer's units only take 1 input
    def build(self, input_shape):
        self.alpha = self.add_weight(
            name = 'alpha', shape=(input_shape[1]),
            initializer = 'zeros',
            trainable = True
        )
        super(ParametricRelu, self).build(input_shape)
    # for the inputs to a unit (in this case just 1),
    # how we compute the output of that unit
    def call(self, x):
        return tf.maximum(0.,x)+ self.alpha * tf.minimum(0.,x)

Given a Sequential model, we could then add this layer using:

model.add(ParametricRelu());

Creating Custom Layer Connections in Keras

To create our models so far in Keras we have been using the subclass Sequential of Model and using the add method.

Another approach (called the Functional API) is to create the layers one at a time, saying what each layer's inputs and outputs should be, finally, providing the first and last layer to Model's constructor:

input_layer = Input(shape=input_shape)
#take all inputs and feeds to 32 neurons
dense_layer_1 = Dense(32, activation='relu')(input_layer)
#take all inputs and feeds to 16 neurons
dense_layer_2 = Dense(16, activation='relu')(input_layer)
#makes a single layer with above two layers happening in parallel
merged_layer = Concatenate()([dense_layer_1, dense_layer_2])
#feeds results of layer into a final layer
final_layer = Dense(10, activation='softmax')(merged_layer)
#create model specifying input and output layers.
model = Model(inputs=input_layer, outputs=final_layer, name="My Model")

Custom Connection Example 2 - Mini ResNET

inputs = keras.Input(shape=(32, 32, 3), name="img")
x = layers.Conv2D(32, 3, activation="relu")(inputs)
x = layers.Conv2D(64, 3, activation="relu")(x)
block_1_output = layers.MaxPooling2D(3)(x)

x = layers.Conv2D(64, 3, activation="relu", padding="same")(block_1_output)
x = layers.Conv2D(64, 3, activation="relu", padding="same")(x)
block_2_output = layers.add([x, block_1_output]) #this implements skip connection

x = layers.Conv2D(64, 3, activation="relu", padding="same")(block_2_output)
x = layers.Conv2D(64, 3, activation="relu", padding="same")(x)
block_3_output = layers.add([x, block_2_output]) #this implements skip connection

x = layers.Conv2D(64, 3, activation="relu")(block_3_output)
x = layers.GlobalAveragePooling2D()(x)
x = layers.Dense(256, activation="relu")(x)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(10)(x)

model = keras.Model(inputs, outputs, name="toy_resnet")
model.summary()

Quiz

Which of the following is true?

`L1`-regularization always leads to a sparse parameterization of weights.
Optimization is the process by which we try to find inputs which minimize or maximize some objective function.
We used the Cramer Rao theorem to prove Nesterov Momentum is the correct approach to SGD.

Convolutional Neural Nets

Convolutional Neural Nets (aka CNNs) were first introduced by LeCun et al 1989 to make a neural net to perform handwriting recognition.
A CNN is any neural net that makes use of at least one convolutional layer, a layer that makes use of a special kind of weight sharing.
The name convolution comes from a mathematical operation on functions known as a convolution: `s(t) = int x(a)w(t-a) da`.
This is often written as `s(t) = (x \star w)(t)`.
We call `x` the input function, `w` the kernel, and `s(t)` a feature map.
Typically, we imagine `w` as being localized in space, and the values of the integral, the feature map, measures the degree to which `x` had features at `t` which looked like translations of `w`.
So we can think of a convolution as a little like a filter.
Convolutions can of course be generalized to more than a single dimension.

Convolutional Neural Net Layers

A convolutional neural net layer consists of some number `t` of feature maps.
In a (standard) 1D convolutional layer, each feature map receives `n` inputs and each has a fixed kernel size of `k`. Each feature map consists on `n-k+1` perceptron units, the first unit gets inputs 0 to k-1 as its inputs, the second unit get inputs 1 through k, the third unit gets inputs, 2 through k+1, etc. All of the neurons in a feature map use the same `k` weights as the other units. (Weights in different feature maps may vary.)
So a feature map uses parameter sharing.
1D convolutional layers might be used for audio or times series data processing.
The idea of a convolutional layer can be generalized to 2D-convolutional layers (and to 3D and higher). We might start with `n times m` inputs and use a kernel of size `k times k`. The number of neurons in a feature map will be `(n-k+1) times (m-k+1)`. The `(i,j)` neuron will receive as input the square `(i,j)` to `(i+k, j+k)` from the input. Weight parameters are again shared between neurons in the feature map.
2D convolutional layers might be used for image processing.
You can also define 3D, 4D, etc feature maps. For example, a 3D map might be useful for handling a sequence of images (video) or volumetric processing.

CNN Motivation Biology

LeCun et al were motivated by how the visual system in animals works.
My description here is abbreviated from the book plus Wikipedia pages. A lot of what's known about the visual cortex was learned by studying cats (Hubel and Wiesel 1959) and monkeys (Zeki 1969).
The eye only has a tiny region which is very sensitive to light called the fovea.
Javal (1878), while studying how peoples eyes scan pages during silent reading, noticed that peoples eyes constantly bounce over a scene changing what they are stereoscopically focusing on in discontinuous movements he called saccades. These saccades change the data coming from the fovea.
Eighty percent of the data that travels along the optic nerve to the V1 area in the back of the brain comes from the fovea.
The mapping to V1 is pretty close to 1-1 for lower layers of V1 with even the blind spot of the eye caused by the optic nerve attachment being mapped.
Deeper layers in V1, and especially higher regions in the brain, V2, V3, etc respond to more of a scene. You can think of this low level V1 brain region as outputting something like kernels applied to the current fovea part of a scene and writing the result to a higher level layer that responds to that part of the whole scene. When a saccade happens a new higher level region is written to.
The higher level layers send signals back to the V1 layer to tell it where it is in the higher layer.
The V1 region of the brain can do simple things like edge detection.
Higher layers like V2 can do more complicated things like: such as illusory contours and binocular disparity.

Why are CNNs good for image processing?

Convolutional layers usually have far fewer weights that are being trained for than a dense layer. So will be quicker to train.
A 2D feature map essentially calculates an image that looks like what regions in the input image were similar to its kernel.
Regions that looked less like the kernel will tend to be "off" or respond less and those that looked more like it will tend to be "on".
Higher order functions that are computed off of a feature map will tend to be translation equivariant. Equivariance means that if the inputs change, the output feature map will change in the same way.
We can imagine starting with one kernel, and then apply a rotation matrix to it, then computing a new feature map.
We could do this for several different orientation. The resulting set of feature maps could then be merged together at a higher level in the net to make a network which is rotation invariant.

Stages of a CNN Layer and Pooling

So far we have been taking a simple terminology for the description of CNN Layers: A CNN layer consists of a layer of weight shared perceptrons.
There is also a "complex terminology", in which a single CNN layer is viewed as being built out of three sublayers:
1. A convolution stage which just computes the affine transformation portion of the perceptrons involved in the layer.
2. A detector stage which computes the nonlinear activations.
3. A pooling stage which replaces the output of the feature map so far at a location with an aggregate of the outputs of several nearby locations. (This layer would be treated separately in the simple terminology.)
One example of pooling would be to take a `p times q` rectangle of pixels in the feature map and just output the maximum value of a pixel in this region. This is called max pooling.
Other pooling operations might be average value, the `L_2`-norm of the rectangle, or weighted-average based on distance to the center pixel.
Pooling helps make the layer invariant under slight translations.
One can also use pooling to reduce the size of the output. For example, one can do max pooling on disjoint `2 times 2` squares in the feature map and reduce the number of total outputs by 4.
Pooling can also make the data invariant to input size. For example, an input of size `n times m` where `n` and `m` may vary could be max-pooled into four neurons each handling one quadrant of the image.

Variants on a Basic CNN Layer

When describing our feature maps, we always shifted the kernel by a single step at a time over the input to generate one output.
One could imagine instead always shifting by some width `w` many steps. For CNNs, the terminology stride is used to refer to `w`.
There are also different techniques for handling how the kernel is used at the edge of a feature map:
- If a kernel is always contained within the inputs (for 2D case, image), the convolution is called valid. This tends to under-represent inputs along the boundary.
- If we zero pad enough kernel neurons so that the number of inputs to a layer is the same as the number of outputs from the layer, we call it a same convolution.
- If every input gets to be treated as the input to every possible input of some kernel used to get an output than the convolution is called full. This solves the under-representation problem, but the inputs along boundary now are more sensitive because of the 0's, and this can make it harder to learn kernels.
Sometimes people like the locally connected nature of CNN layers, but don't like the fact that each of the neurons in a feature map uses the same weights. A CNN-like layer in which the neurons don't share weights is called an unshared convolution.
Suppose we had a set of `m` different kernels and were in the 1D setting. At location `i` in the input we use kernel `i mod m` (or we do something similar if we are in more than 1 dimension). Then we are using a tiled convolution.
This is somewhat like what `V1` does for detecting vertical versus horizontal lines.

What is the effect of having multiple CNN layers?

If we have multiple feature maps at one layer and we feed them all into a single feature map at a higher layer we can often detect more complicated patterns and invariances.
For example, taking a feature map of a vertical edge detecting kernel and taking a horizontal edge detecting kernel and feeding both into a map that uses a kernel of size `2k times k` (`k times k` squares in each lower feature are mapped into the same neuron), we might be able to learn different kinds of edges joint shapes like `+`, `top`,`bot`, `-|`, or `|--`.
Feeding feature maps that corresponds to rotations of the same kernel into a common feature map can give rotational invariance.

CNN Architecture

Given `p` many feature maps at levels `i`, if we wanted to train for all combinations of `q` many of these feature we would need `O(p^q)` many layer `i+1` feature maps.
This would tend to lead to an explosion of weights to learn.
A common architecture choice is to try to keep the number of neurons at each layer of the CNN part of a network roughly the same.
To do this we use max pooling at layer `i`.
For example, suppose we started with an `n times n` image and we had `m` feature maps each each with kernel size `k times k` at layer 1.
So the total number of neurons at layer 1 is `m \cdot (n-k+1)^2`.
If we use a max-pool layer using `2 times 2` squares, we would be left with at most `|~ (m\cdot (n-k+1)^2)/4~|` neurons to send into our next layer.
To learn the most complicated possible combination of our lower layer feature maps at layer two we use feature maps with `m cdot k' times k'` sized kernals which vary over all Layer 1 features.
So a single Layer 2 feature map will have size `((n-k+1)/2 - k'+1)^2 approx (n^2)/4`.
So to keep this layer roughly the same size we can have `4m` feature maps.

CNN Architecture - LeNet-5 Example

Due to Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. 1998

Inputs: 32 x 32 pixel digits images
Hidden Layer 1: Uses 5x5 kernel (with bias this means 26 weights), outputs 6, 28 x 28 feature maps.
Hidden Layer 2: Uses 2x2 pixel maxpool layer, outputs 6, 14 x 14 feature maps.
Hidden Layer 3: Uses 5x5 pixel kernel, outputs 16, 10 x 10 feature maps.
Hidden Layer 4: Uses 2x2 pixel maxpool layer, outputs 16, 5 x 5 feature maps.
Hidden Layer 6: Uses 5x5 pixel kernel, outputs 120, 1 x 1 feature maps.
Hidden Layer 7: Fully connected to 120 input, 84 outputs.
Output: 10 radial basis functions each getting 84 inputs, and whose outputs correspond to the different digits.

Intro to Recurrent Neural Nets

We now consider neural nets useful for handling sequential data.
For example, we might want to be able to train a neural net to recognize a sequence of handwritten characters such as a word, but aren't sure in advance how long the sequence is.
As another example, we might have a question answering system and we might want it to understand from the pair of phrases "Two is my favorite even prime" and "My favorite even prime is two", that two is same kind of information.
Recurrent Neural Networks (Rumelhart 1986) proposed an early model to do this by having neural nets with loops from latter layers back to earlier ones.
You can view this as a kind of parameter sharing in time rather than space.
Lang and Hinton 1988, Waibel 1989, and Lang et al 1989 actually use this observation to propose a model, time-delay neural networks, based on 1D temporal convolutions.
We will continue this topic on Wednesday.

More Keras - CNNs and RNNs

Outline