Introduction

Recall a `k`-layer feedforward network `N` can be viewed as a composition of layer functions, `L_1,..., L_k`. On an input `vec{x}`, it computes the function `N(vec{x}) := L_k(L_{k-1}(...L_2(L_1(vec{x})) ...))`.
Motivated by maximum likelihood estimator (MLE) considerations, we said the cross entropy, `C`,
`-E_{vec{y}~hat{p}_{data}}[log p_{mo\de\l}(y|\vec{x};vec{theta})] = - 1/m \sum_{i=1}^m[y^{(i)} \log g(vec{theta}^Tvec{x}^{(i)}) + (1 - y^{(i)}) \log (1 - g(vec{theta}^Tvec{x}^{(i)}))]`
is a good choice of error function.
Also motivated by MLE considerations, we said that the last layer, `L_k` usually computes a softmax function of its weighted inputs, `soft\m\ax(vec{z})_i = exp(z_i)/(sum_j(exp(z_j)))`.
We described two update weight techniques to search for weights which minimize cross-entropy:
1. Newton-like Methods which seek to find zeros of C' using a weight update rule based on multi-variable versions of computing `(C')/(C'')` (I.e., involving the gradient and the Hessian). For example, the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm.
2. Gradient descent methods which directly update the weight based on `\grad C`.
Cross entropy calculations needed for either method involve cycling over the complete training set. To avoid this, we said we could use random sampling of the training set of a fixed size. This use of "mini-batches" gives rise to the stochastic gradient descent algorithm.
Using training considerations, we determined the ReLU or leaky ReLU hidden units often work better than those with logistic activations.
The cross-entropy serves as a good error function for the last layer as we can directly compare the output of a unit of the last layer with the value it is supposed to give.
To give a complete training algorithm using stochastic gradient descent, we need to say how to compute updates for hidden units.

Defining our Symbols

Recall a layer `L` in our network computes a function `vec{g}(W vec{z} + vec{b})` where `vec{g}` computes a vector-valued activation, `W` is a weight matrix, `vec{z}` is the output of the previous layer, and `vec{b}` is a bias vector.
To keep things simple we will assume there is a dummy output from the previous layer that always has value 1, and we move the bias into the weight matrix, giving the form `vec{g}(W vec{z})`.
Assume our network has `n` layers.
We write `vec{g}_i` for the vector-valued activation function used in the `i`th layer.
We write `\mathbf{W}` for the tensor representing the weights for all layers in our network. Let `W_i` denote the weights in `\mathbf{W}` for the `i`th layer, let `W_{ij}` denote the weights of the `j`th neuron in this layer, and let `W_{ijk}` denote the weights of its `k`th component.
Recall we defined a tensor to mean a multi-linear map. `\mathbf{W}` acts linearly for each `W_i` so is a multi-linear map.
Let `vec{N}_{i}` denote the outputs of the `i`th layer of our network. We let `N(vec{x},\mathbf{W})` denote the matrix of outputs from all layers computed by our network.
If we have a set of training inputs `vec{mathbb{x}} = {vec{x}^{(1)}, ..., vec{x}^{(m)}}` and outputs `vec{mathbb{y}} = {y^{(1)}, ..., y^{(m)}}`, write `vec{mathbb{N}_i}(vec{mathbb{x}}, \mathbf{W})` for `{N_i(vec{x}^{(1)},\mathbf{W}), ..., N_i(vec{x}^{(m)},\mathbf{W})}` and `vec{mathbb{N}}(vec{mathbb{x}}, \mathbf{W})` for `{N(vec{x}^{(1)},\mathbf{W}), ..., N(vec{x}^{(m)},\mathbf{W})}`.
Let `C(mathbb(y), mathbb(z)) = - 1/m \sum_{i=1}^m[y^{(i)} \log z^{(i)} + (1 - y^{(i)}) \log (1 - z^{(i)})]` be the cross-entropy function.
Define `Loss(mathbb(y), vec{mathbb{x}}, \mathbf{W}) = C(mathbb(y), vec{\mathbb{N}_n}(vec{mathbb{x}},\mathbf{W}))`.

Using the Chain Rule to Compute an Update

Let `epsilon` be our learning rate. Let `\mathbb{x}_t`, `\mathbb{y}_t` denote the items in the `t` training mini-batch. Let `\mathbf{W}^{(t)}` denote the weights after `t` updates.
Using stochastic gradient descent we update the weights in the `i`th layer according to:
`W_i^{(t+1)} = W_i^{(t)} + \epsilon (\partial Loss(mathbb(y), vec{mathbb{x}}, \mathbf{W}))/(\partial W_i)|_{{:(vec{mathbb{x}} = vec{\mathbb{x}}_t),(vec{mathbb{y}} = vec{\mathbb{y}}_t), (\mathbf{W} = \mathbf{W}^{(t)}):}}`
Here `(\partial Loss(mathbb(y), vec{mathbb{x}}, \mathbf{W}))/(\partial W_i)` is a matrix of partials `(\partial Loss(mathbb(y), vec{mathbb{x}}, \mathbf{W}))/(\partial W_{ijk})`.
To calculate `(\partial Loss(mathbb(y), vec{mathbb{x}}, \mathbf{W}))/(\partial W_i)` we use the chain rule:
`(\partial Loss(mathbb(y), vec{mathbb{x}}, \mathbf{W}))/(\partial W_i)|_{{:(vec{mathbb{x}} = vec{\mathbb{x}}_t),(vec{mathbb{y}} = vec{\mathbb{y}}_t), (\mathbf{W} = \mathbf{W}^{(t)}):}} = `` (\partial C)/(\partial \mathbb{N}_i)|_{{:(vec{mathbb{N}_i} = vec{\mathbb{N}_i}^{(t)}),(vec{mathbb{y}} = vec{\mathbb{y}}_t), (\mathbf{W} = \mathbf{W}^{(t)}):}}`` \cdot (\partial vec{mathbb{N}_i})/(\partial W_i)|_{{:(vec{mathbb{N}_{i-1}} = vec{\mathbb{N}_{i-1}}^{(t)}), (\mathbf{W} = \mathbf{W}^{(t)}):}}`
We will abuse notation and depending on the two things on the right hand side of the chain rule, we will treat '`cdot`' as either multiplication, dot product, matrix product, or tensor composition as appropriate.
In the above, it is not hard to see how we could slightly perturb the outputs of layer `i-1` to approximate `(\partial vec{mathbb{N}_i})/(\partial W_i)`.
It is harder to see how to calculate `(\partial C)/(\partial \mathbb{N}_i)`.

Backpropagation

If our network has `n` layers, then we can slightly tweak the output of the last layer to compute `(\partial C)/(\partial \mathbb{N}_n)`.
We can then use the chain rule to compute the partials `(\partial C)/(\partial \mathbb{N}_i)` via the recursion: `(\partial C)/(\partial \mathbb{N}_{i-1})|_{{:(vec{mathbb{N}_{i-1}} = vec{\mathbb{N}_{i-1}}^{(t)}),(vec{mathbb{y}} = vec{\mathbb{y}}_t), (\mathbf{W} = \mathbf{W}^{(t)}):}}``= (\partial C)/(\partial \mathbb{N}_i)|_{{:(vec{mathbb{N}_i} = vec{\mathbb{N}_i}^{(t)}),(vec{mathbb{y}} = vec{\mathbb{y}}_t), (\mathbf{W} = \mathbf{W}^{(t)}):}}``cdot (\partial \mathbb{N}_i)/(\partial \mathbb{N}_{i-1})|_{{:(vec{mathbb{N}_{i-1}} = vec{\mathbb{N}_{i-1}}^{(t)}), (\mathbf{W} = \mathbf{W}^{(t)}):}}`
The whole procedure for computing the gradient `(\partial Loss(mathbb(y), vec{mathbb{x}}, \mathbf{W}))/(\partial W_i)` of the last two slides is called backpropagation (Rumelhart et al, 1986), but it gets its name because of the above computation.

Quiz

Which of the following is true?

Mean squared error is an example of a cost function that might be used to train a neural net.
We called a function `f:RR^m->RR` convex if its Hessian was negative definite.
All gradient descent algorithms make use of mini-batches.

Computing Partial Derivatives

The proceeding description of the partials needed to compute the update for `W_i^{(t+1)}` is slightly simplified.
For example, we could imagine computing `(\partial \mathbb{N}_i)/(\partial \mathbb{N}_{i-1})` as `(\partial \mathbb{N}_i)/(\partial vec{g}_i) cdot (\partial vec{g}_i)/(\partial \mathbb{N}_{i-1})`.
Individual components of `(\partial \mathbb{N}_i)/(\partial vec{g}_i)` might be useful in computing other partial derivative in the whole algorithm.
To avoid recomputing partial derivatives, most software systems break `N` down into its set of operations (one such operation might, say, compute a component of `vec{g}_i`), and define a computation graph of what needs to be computed before what for both the operations and their associated derivatives.
A computation graph is symbolic, you can think of nodes in this graph having references to functions to call to evaluate that node.
Given a numerical input to such a graph we can evaluate the nodes in the graph.
To get values for derivatives, there are two general approaches:
1. Use symbol-to-number differentiation (what Torch and Caffe do). Here when we need to compute `(\partial f)/(\partial x_i)` at a point `vec{x}`, we just use `(f(vec{x} + h\cdot hat{e}_i) - f(vec{x}))/h` where `hat{e}_i` is the `i`th unit basis vector and `h` is a small real value like `0.01`. In this case the derivative's the value is returned and memoized if need be.
2. Use symbol-to-symbol differentiation (what Theano and TensorFlow do). i.e., Here we have a reference to a function computing `(f(vec{x} + h\cdot hat{e}_i) - f(vec{x}))/h` as part of our computation graph. Each node has an eval function for when we need to compute an actual value. This allows higher order derivatives to also be computed, say if we want to use memory limited BFGS as our update mechanism.

TensorFlow

TensorFlow is a neural network package developed by the Google Brain team at Google.
It was initially released under an Apache Licence in November, 2015, but was based on an earlier system called DistBelief that had been used internally at Google since 2011.
To install TensorFlow for use with python, we can do:
```
pip install tensorflow
```
This will get a version of TensorFlow suitable for the experiments in this class, although it probably won't make use of your computer's GPU.
To use the TensorFlow module, you then put something like:
```
import tensorflow as tf
```
at the start of your program.
To install TensorFlow in a more production way, you can check out the TensorFlow Install Instructions.
One thing to be aware of is if you use the simpler instructions like the above, when you do some operations in TensorFlow, such as create a session, you might get a warning like:
```
...
TensorFlow library wasn't compiled to use SSE4.2 instructions, 
but these are available on your machine and could speed up CPU 
computations.
...
```
To suppress these (and other) warnings, you can add:
```
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'
```
to the start of your code.

What TensorFlow Lets Us Do

Tensor Flow allows us to do the following main operations:
1. Define computational graphs of neural networks.
2. Train these graphs.
3. Run the trained model on some inputs.

Computation Graph Nodes

A TensorFlow Computation graph consists of nodes which are Tensor objects.
There are three basic kinds of leaf nodes:
1. constants - these are values that won't change during the computation, are known before the computation begins, and do not change during the training.
2. placeholders - these are slots where input values to the graph can be placed. These values do not change during training.
3. Variables - these are values that will change during training. I.e., they will serve as our neural net weights.
Graphs can be built from leaf nodes by combining one or more existing graphs using some operation, such as +, *, sin, maximum, log_sigmoid, etc.

Example Building and Running Computation Graph

Here is an example of creating a small computation graph:

node1 = tf.constant(5.0, dtype=tf.float32)
node2 = tf.constant(4.0) #defaults to float32
node3 = tf.multiply(node1, node2) 
   # could also have done node1*node2
   # similarly tf.add(node1mnode2) equiv to node1 + node2
node4 = tf.placeholder(tf.float32)
node5 = tf.Variable([.4], dtype=tf.float32)
   # .4 is the initial value of the Variable
   # which might change with training
node6 = node4 * node5 + node3 
   # this implicitly create a node for node4 * node 5
   # that doesn't have a name

To compute things with a computation graph, we use a session:

session = tf.Session()
session.run(node3) #outputs 20
session.run(node4, {node4: 5}) # fills in 5 into placeholder
    #array(5.0, dtype=float32)

If we tried to call session.run(node6, {node4: 5}), we'd get an error as Variables are uninitialized until we call:

init = tf.global_variables_initializer()
session.run(init)
#now it is okay to call
session.run(node6, {node4: 5})
# outputs array([ 22.], dtype=float32) as .4*5 + 4.0*5. = 22.0

To change the value of an assigned variable, we can use the assign() method:

update = tf.assign(node5, [3.])
session.run([update])
session.run(node6, {node4: 5})
#outputs array([ 35.], dtype=float32) as 3.* 5. + 4.*5. =35.

Using Reduce Operations

For training it is often useful to feed in a sequence of values to a placeholder. This will get a sequence of outputs:
```
session.run(node6, {node4: [1,2,3,4,5]})
#outputs array([ 23.,  26.,  29.,  32.,  35.], dtype=float32)
```
We could imagine each component in the above array as being used to compute the summand of an error term for a loss function.
To get a single tensor object that computes the sum of these summands we can use reduce_sum:
```
node7 = tf.reduce_sum(node6)
session.run(node7, {node4: [1,2,3,4,5]})
#outputs 145.0
```

TensorFlow Optimizers

tensorflow has a train submodule which contains a collection of classes used to train your computation graph.
In particular, it has a variety of subclasses of the Optimizer class, which can control how the Variables in your graph get updated according to training.
Some subclasses of Optimizer are GradientDescentOptimizer, AdagradOptimizer, and MomentumOptimizer.

We can initialize a GradientDescentOptimizer object using the command:

epsilon = 0.01 # the learning rate
optimizer = tf.train.GradientDescentOptimizer(epsilon)

As part of our computation graph, we'd typically define a node which computes our cost function.
If this node was called, loss, then we could set the optimizer to try to minimize the value of this node via:
```
train = optimizer.minimize(loss)
```
Here we can think of train as object which has a reference to a function which can computes symbolically derivatives and then applies them for the computation graph, so as to perform one update according to gradient descent.
If we call tf.run(train, training_data) will train the model according to training_data.

Example Training a Computation Graph TensorFlow

Here is a short example program that trains a linear model from the TensorFlow site:

import tensorflow as tf

# Model parameters
W = tf.Variable([.3], dtype=tf.float32)
b = tf.Variable([-.3], dtype=tf.float32)
# Model input and output
x = tf.placeholder(tf.float32)
linear_model = W * x + b
y = tf.placeholder(tf.float32)

# loss
loss = tf.reduce_sum(tf.square(linear_model - y)) # sum of the squares
# optimizer
optimizer = tf.train.GradientDescentOptimizer(0.01)
train = optimizer.minimize(loss)

# training data
x_train = [1, 2, 3, 4]
y_train = [0, -1, -2, -3]
# training loop
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init) # reset values to wrong
for i in range(1000):
  sess.run(train, {x: x_train, y: y_train})

# evaluate training accuracy
curr_W, curr_b, curr_loss = sess.run([W, b, loss], {x: x_train, y: y_train})
print("W: %s b: %s loss: %s"%(curr_W, curr_b, curr_loss))

Backpropagation, Tensorflow

Outline