Introduction

Last day, we were considering the problem of designing perceptron networks that had some hope of computing a given class of functions.
This is important because if we are trying to train a network `N` to learn a function `f`, but no network of the architecture that `N` has can compute `f`, then we will live in a world of failure.
We proved Minsky and Papert (1969) result that certain kinds of 2 layer perceptron networks can't compute the parity of an `n`-bit number.
We three showed how to compute an arbitrary boolean function of `n` input bits by a 3-layer threshold gate network (and hence, as threshold gates are a particular kind of perceptron, a 3-layer perceptron network) of size at most `n2^{n-1}` threshold gates.
This particular network isn't very practical because it is exponential size.
If we imagine trying to train a perceptron network that operates on `n`-bits but that requires exponentially many perceptrons, we would expect to need at least one training example/gate to train the weights of that gate, so training would likely be at least exponential time in the number of inputs.
Rather than try to consider arbitrary boolean functions, suppose we knew `f` was computing the `n`-bit input `x_1, ..., x_n` version of some polynomial time algorithm ...?

`p`-time algorithms

A photo an a Turing Machine implemented in real-life

When people formally try to write down polynomial time algorithms they usually give a Turing Machine (TM) implementation.
This is because TMs are a relatively simple mathematical model of computation which is capable of simulating, with less than a cubic slow-down, computations carried out on modern day computer architectures using modern day instruction sets (This result was probably shown in your undergrad automata class when presenting the Church-Turing thesis).
For the purpose of the next result, we assume a TM with 1 tape.
The tape is made up of cells each containing one of the symbols `0`, `1`, blank, $ (start-of-tape).
Only one square has the $ symbol and has one tape head is over exactly one cell at a given time.
The tape has arbitrarily many squares to the right of the $ symbol.
In one time step, a TM machine can read 1 symbol from the tape head location, write one symbol back to that location (provided that the symbol was not $), and move the tape head to the left, right, or stay put.
The action performed by the TM at a given time step is uniquely determined by its current state `q` from a finite set a states and the current symbol being read.
A polynomial time algorithm is specified by given a runtime bounding polynomial `p(n)` and the complete transition table which maps pairs `(q, \i\n_{symbol})` to triples `(q',\o\u\t_{symbol}, mbox(left-right-stay-put))`.
A computation of such an algorithm on an `n`-bit input `x_1, ..., x_n` consists of starting the machine with the tape cells given as `\$x_1x_2...x_n` and all the other cells blank. Running the machine for `p(n)` steps, and defining the output to be string given by the non-blank, non-$ tape cells.

Simulating `p`-time algorithms with Threshold circuits

Theorem. A `p(n)`-time algorithm for a 1-tape TM can be simulated by a `O(p(n))`-layer threshold network of size `O(p(n)^2)`. Moreover, the `O(p(n))`-layers, are built out of 1 layer which maps the input to an encoding of input, followed `p(n)`-many `O(1)`-layer networks each of which compute the same function `L`, followed by a layer which maps the encoding of the output to the final output . An `L` layer can be further split into `p(n)` many threshold networks each of size `O(1)` with `O(1)` inputs computing the same function `U` in parallel.

Remark. Imagine we were trying to learn a `p`-time algorithm. The result above says that we just need to learn the `O(1)` many weights in the repeated `U`, not polynomially many weights as you might initially guess. We can think of `U` as roughly corresponding to the neural nets finite control. The idea of using repeated sub-networks each of which use the same weights is essentially the idea behind convolutional neural layers.

Proof of Simulation Result

Let `|Q|` be the number of states of our polynomial time TM algorithm and let `p(n)` be the run time bound.
In `p(n)` time steps, a TM can affect at most `p(n)` many tapes cells.
We can encode the contents of a tape cell with a `3 + |~log|Q| ~|` bit number, the first bit `|~log|Q| ~| + 1` bits are `0` if the tape head is not over the square, otherwise they say the machine state. The last two bits are 00, if the tape cell has a $ in it; 01, if it has a 1 in it; 10 if it has a 0 in it; 11 if it is blank.
Given our result about any boolean function being computable by a depth 3 threshold function, we know we can find two fixed finite threshold networks which take a 1-bit input, either 0 or 1, and output the high and low bits of the encoding of 0 or 1.
The first layer of our network consists of a fixed finite size hard-coded circuit outputting the encoding of the pair (initial state, $), saying the tape is over $ and the machine is in the initial state, followed by circuits which map each `x_i` to their encoding as `|~log|Q| ~| + 1` zero bits (tape head not there), followed by either 01 or 10 depending on the value of `x_i`. The remainder of the layer outputs encodings of `p(n) - n` many blank squares.
The units `U` of the intermediate layers will each take `3(3 + |~log|Q| ~|)` input bits and output `3 + |~log|Q| ~|` output bits. The `3 + |~log|Q| ~|` output bits will correspond to the encoding of a particular tape cell at one time step further into the future.
The value of this cell depends only on the values of the tape square, and those to its left and right at the current time step as well as the current state so can be computed by some fixed finite size threshold circuit.
A layer will consist of `p(n)` many such `U` circuits each having as inputs the encodings of appropriate tape squares at layer (can be viewed as time step) `t` and having as output the appropriate encodings of tape cells at layer (can be viewed as time step) t+1.
After `p(n)` many such layers the output will be an encoding of the output tape cells of the TM after `p(n)` many steps.
The final layer of our network then computes a mapping back from tape cells to a binary string. Q.E.D.

Quiz

Which of the following is true?

Python does not support classes.
A single perceptron cannot compute the parity function.
A majority gate cannot be implemented with a single perceptron.

Support Vector Machines

Before we talk exclusively about deep neural networks, I want to explore one more single layer neural network architecture which is often used in machine learning: Support Vector Machines (Cortes Vapnik 1995).
An SVM computes a function `|~\sum_{j=1}^mw_j f_j(\vec{x}) ge theta~|` where `vec{f}:RR^n->RR^m`. That is, we take an input `x_1, ..., x_n`, we then map it to some different space of potentially higher dimension, and then ask if the result is above or below some hyperplane.
Often rather than taking the output to be 1 for above the hyperplane and 0 for below, we will use instead 1 for above the hyperplane and -1 for below. Both settings are of equivalent power and we will indicate as needed which one we are in.
Given a training set `X`, the SVM training algorithm computes the weights `w_i` so that the separating hyperplane maximizes the distance to the nearest positive and negative training set examples. See image above from Wikipedia.
So we can see this separator is computed geometrically, rather than by an update rule.
The separating hyperplane only depends on the points closest it in the training data. When solving the optimization problem to get the hyperplane, this allows one often to avoid computing the inner product of points `vec{f}(\vec(x_j)) cdot vec{f}(\vec(x_k))` and instead compute a simpler function `k(\vec{x_j}, \vec{x_k})` called a kernel. Using a kernel, is called using the kernel trick.
Cortes and Vapnik paper extended earlier work of Vapnik and Chervonenkis (1963) and others. One of their contributions was to define a good hyperplane based on soft margins that behaves well even when the positive and negative values cannot be separated by a hyperplane.

Mapping to a Separable Space

Theorem. It is always possible to map a dataset of examples `(vec{x}, y_i)` where `vec{x} in RR^n` and `y_i in {0,1}` to some higher dimensional `vec{f}(vec{x}) in RR^m` such that the negatives examples `vec{f}(vec{x_j})` can be separated from the positive examples `vec{f}(vec{x_k})` by a hyperplane in `RR^m`.

Proof. Let `f_(vec{z})(vec{x})` be the function which is `0` if `vec{x} ne vec{z}` or if `vec{z}` was not a positive training example and is `1` otherwise. Let `vec{f}` be the mapping from `RR^n -> RR^{2^n}` given by `vec{x} mapsto (f_{vec{0}}(vec{x}), ...,f_(vec{z})(vec{x}), ..., f_{vec{1}}(vec{x}))`. All of the negative examples will map to the `0` vector of length `2^n` and a positive example will map to a vector with exactly one coordinate 1. Hence, the hyperplane which cuts each axis in the target space at a `1/2` will separate the positive from the negative examples. Q.E.D.

Corollary. Any boolean function can be computed by an SVM (maybe slightly relaxing the definition of maximally separate to allow for unbounded support vectors).

Remark. The above construction gives an SVM which does not generalize very well, so it is not really a practical construction.

Remark. For SVMs, usually `y_i`'s are chosen from `{-1, 1}`, but a similar theorem to the above could still be obtained.

Other Ways to try to Map to a Separable Space

Suppose we want to compute `PAR_2(x_1, x_2)`.
Define a mapping from `RR^2` to `RR^2` by:
`f_1(x_1,x_2) := exp(-||((x_1),(x_2)) - ((1),(0))||^2)`
`f_2(x_1,x_2) := exp(-||((x_1),(x_2)) - ((0),(1))||^2).`
We call functions `f_{\vec{t}, sigma}(vec{x}) = exp(-||vec{x} - vec{t}||^2/(2sigma^2))` where `sigma in RR`, Gaussian functions.
A function `f_{vec{t}} := phi(||vec{x} - vec{t}||)` for some function `phi: RR -> RR` is called a radial basis function.
Notice `f_1` is almost `0` except close to the vector `((1),(0))` where it approaches 1, and similarly `f_2` is almost `0` except close to the vector `((0),(1))` where it approaches 1.
Hence, an input `(x_1, x_2)` to `PAR_2` will map close to `((0),(0))` for inputs `(0,0)` or `(1,1)` and `(0,1)` will map close to `((0),(1))` and `(1,0)` will map close to `((1),(0))`.
So we can separate them with the line `1/2x + 1/2 y = 1/2`.
A function which computes an activation function on a weighted sum of radial basis function inputs is called a radial basis network.

Computing Separators

Let's consider the problem of computing the maximal separator assuming we didn't do any mapping `vec{f}`.
The standard way to do this reduces the problem to solving a quadratic program: Maximize
`mbox(argmax)_(vec alpha) (sum_j alpha_j - 1/2 sum alpha_j alpha_k y_j y_k (vec x_j cdot vec x_k))`
subject to `alpha_j ge 0` and `sum_j alpha_jy_j = 0`.
Here the weights of the output SVM are related to the `alpha_j`'s above via the equation `vec w = sum_j alpha_j vec x_j`.
A classifier can be built out of `vec alpha` directly using the equation:
`h(vec x) = sign(sum_j alpha_j y_j(vec x cdot vec x_j) - b)`
which also only uses the data in a dot product.
An important property of the `alpha_j`'s is that they are `0` usually except the support vectors -- the closest vectors to the separator (typically in each dimension).
If we want to do a mapping `vec{f}` before computing the separator, the process is almost identical, except now the classifier becomes: `h(vec x) = sign(sum_j alpha_j y_j(vec{f}(vec x) cdot vec{f}(vec x_j)) - b)`
For some `vec{f}` we can replace the `(vec{f}(vec x) cdot vec{f}(vec x_j))` in classifier and `(vec{f}(vec x_j) cdot vec{f}(vec x_k))` in the quadratic program with a simple kernel function `k(vec{x}, vec{x}_j)` or `k(vec{x_j}, vec{x}_k)`.
A result known as Mercer's theorem says when this is okay to do.

Popular Choices of Kernels

A Polynomial Kernel: `(vec{x}^Tvec{x_i} +1)^p`. Might use when we imagine in the original space it is likely we could separate the positive from the negative examples by a polynomial.
A RBF or Gaussian Kernel: `exp(-||vec{x} - vec{x_i}||^2/(2sigma^2))`. We might do clustering on the data and choose `sigma` based on the closest clusters. Then the idea why this kernel might be useful is based on parity example I gave.
A Two Layer Perceptron Kernel: `tanh(beta_0vec{x}^Tvec{x_i} + beta_1)`.
Next day, we will discuss a simpler way more recent technique to compute the separator not based on quadratic programming.

Perceptron Networks and p-time algorithms, SVMs

Outline