Introduction

We return from our Python interlude to talk about neural networks again.
I'd like to remind people I am making relevant original papers of results we are discussing on the class wiki. You'll need your password to access them. I also added one more bullet to an earlier slide about when Winnow is better than the Perceptron Learning Rule.
So far this semester we have given learning rules for single unit perceptron circuits. We have shown the perceptron rule converges when learning boolean threshold 1 perceptrons and that the rule can `epsilon`-learn Nested Boolean Functions at `1-delta` fraction of the time using `p(1/delta, 1/epsilon)` uniformly chosen boolean examples where `p` is logarithmic in `1/delta` and quadratic in `1/epsilon`.
Today, I would like to build on our intuition on what can and cannot be learned by non-deep networks of artificial neurons (depth less than 3).
To start if the class of networks we are considering cannot compute a function `f`, then obviously it cannot learn that function no matter what super-duper learning rule we come up with.

What Single Perceptrons Cannot Do

For now, let's assume the activation function used by our perceptrons is a step function.
Its relatively easy to see a single perceptron cannot compute certain functions.
A single perceptron outputs 1 when we view input `vec{x}` as a point it is in an upper halfspace, that is above a line, plane, or hyperplane, and 0 otherwise.
Consider the two input function `phi(x,y) = { (1 mbox( if x=y)), (0 mbox( otherwise.)) :}`
We can plot this function as:

0 1

1 0

from which visually we can see there is no single line in the plane that separates the 1's from the 0's.

Single Perceptrons Don't Do Equality - Proofyized Version

Let's try to see this more formally...
Given a predicate `psi(\vec{x})`, let `|~psi(\vec{x})~|` denote the function which is 1 when the predicate is true, and 0 otherwise.
Suppose `phi(x,y) = |~ ax + by \geq theta ~|` for some `a`, `b`, `theta` real numbers.
Then:
`phi(1, 0) = 0 => a < theta`
`phi(0, 1) = 0 => b < theta`
`phi(1, 1) = 1 => a+ b \geq theta`
`phi(0, 0) = 1 => 0 \geq theta`
Adding the first two conditions gives:
`a+b < 2theta`.
Combine with the third condition gives:
`theta < 2theta`.
Hence, `theta` must be positive.
But as by the fourth condition `theta` is at most `0`, it can't be positive. Therefore, `phi` can't be computed by a perceptron. Q.E.D.

Two-Layer Networks

In the case of `phi(x,y)`, one can imagine drawing a line to either side of the diagonal and defining the 1 region to be the intersection of half-spaces given by these lines.
This idea could be used to give a two layer network for equality.
In it, both the perceptron at the top and in the bottom layers depend on all of their inputs.
Notice `phi(x,y)` is a so-called symmetric function: Given an permutation of its inputs, it output is the same. I.e., `phi(x,y) = phi(y,x)`.
As an example, let `PAR_n(x_1, ..., x_n)` be the function which is `1` if the number of its inputs which are `1` is odd. Notice for any permutation of `pi:[n] -> [n]`, we have `PAR_n(x_1, ..., x_n) = PAR_n(x_{pi(1)}, ..., x_{pi(n)})`, so it is symmetric.
In a similar fashion to equality, `PAR_n` can be shown to be computable as intersection and unions of half spaces.
When trying to design a neural network architecture, we generally want to find the simplest architecture with the fewest weights that could possible learn the desired function.
One could ask: Can we simplify the neural network architecture, where we restrict the first level of network to perceptrons that only depend on at most `K` of their inputs being 1?
We will show (Minsky and Papert 1969's result) that if `K < n` where `n` is the number of input bits, `PAR_n` cannot be computed by this simpler architecture.

In-Class Exercise

Explicitly work out the perceptrons needed for a two layer perceptron network computing `PAR_3(x_1, x_2, x_3)`.

Post your solutions to the Sep 20 In-Class Exercise Thread.

Answer. Since a lot of the solutions posted seemed to be poorly drawn figures with no description of how to get a perceptron circuit out of them I decided to give an answer. First, notice `PAR_3` is 1 for the boolean values `(1, 0, 0)`, `(0,1,0)`, `(0,0,1)`, and `(1,1,1)`. It is 0 for `(0,0,0)`, `(1, 1, 0)`, `(0,1,1)`, and `(1,0,1)`. We will build a perceptron circuit with four perceptron gates `G_1, ..., G_4`, which computes the function `G_4(G_1(x_1, x_2, x_3),G_2(x_1, x_2, x_3), G_3(x_1, x_2, x_3))`. `G_1` checks if `x+y+z ge 1/2`. The equation `x+y+z = 1/2` determines the plane that cuts the `x`, `y`, and `z` axis at a 1/2, so it is above the point `(0,0,0)`. So if `G_1` is `1` then we know the input isn't `(0,0,0)`, although it might be any of the other elements of the boolean cube. `G_2` checks if `x+y+z le 3/2`. The points `(1, 0, 0)`, `(0,1,0)`, `(0,0,1)` all satisfy this inequality, but `(1, 1, 0)`, `(0,1,1)`, and `(1,0,1)` do not. So `G_1` and `G_2` can both be 1 on the boolean cube only for the points `(1, 0, 0)`, `(0,1,0)`, `(0,0,1)`. We let `G_3` check if `x+y+z ge 2.5`. The only point on the boolean cube which satisfies this inequality is the point `(1,1,1)`. Notice if either `G_2` or `G_3` is 1, the other is 0. Finally, we let `G_4` check `Out_{G_1} + Out_{G_2} + 2Out_{G_3} ge 2`. Since `G_2` and `G_3` can't simultaneously be `1`, the only way for `G_4` to output `1` is if either `Out_{G_1}` and `Out_{G_2}` are both 1, or `Out_{G_1}` and `Out_{G_3}` are both 1. The first case corresponds only to `(1, 0, 0)`, `(0,1,0)`, `(0,0,1)` on the boolean cube, the latter to `(1,1,1)`.

Simplifying Symmetric Perceptrons

Given a boolean vector `vec{x}`, let `#(vec{x})` be the function which returns the number of 1 bits in `vec{x}`.

Lemma. Suppose a symmetric function `f(vec{x})` is computed as `|~ vec{w}\cdot vec{x} \geq theta ~|`, where `vec{w} in RR^n` and `theta in RR`. Then there exists a `gamma in RR` such that `f(vec{x}) = |~ gamma cdot #(vec{x}) \geq 1 ~|` for all `vec{x} in {0,1}^n`.

Proof. Since `f` is symmetric either all of the following `n` inequalities hold, or none of them do:
\begin{eqnarray*} w_1x_1 + w_2x_2 + w_3x_3 + \cdots + w_{n-1} x_{n-1} + w_n x_n &\geq& \theta\\ w_1x_2 + w_2x_3 + w_3x_4 + \cdots + w_{n-1} x_{n} + w_n x_1 &\geq& \theta\\ ...&&\\ w_1x_{n-1} + w_2x_n + w_3x_1 \cdots + w_{n-1} x_{n-3} + w_n x_{n-2} &\geq& \theta`\\ w_1x_{n} + w_2x_1 + w_3x_2 \cdots + w_{n-1} x_{n-2} + w_n x_{n-1} &\geq& \theta \end{eqnarray*} Adding these equations and grouping by the `x_i`'s implies `f(vec{x}) = 1` iff
`(sum_{i=1}^n w_i) x_1 + (sum_{i=1}^n w_i) x_2 + cdots + (sum_{i=1}^n w_i) x_n \geq n \cdot theta`.
Thus, `f(vec{x}) = 1` iff `(sum_{i=1}^n w_i)(sum_{i=1}^n x_i) \geq n\cdot theta` and as `#(\vec{x}) = (sum_{i=1}^n x_i)`, dividing both sides by `n \cdot theta` and defining `gamma = 1/(n\cdot theta) (sum_{i=1}^n w_i)` gives the result. Q.E.D.

Minsky Papert (1969)

Theorem . Let `Psi_k^n` be the collection of all boolean functions computable by perceptrons on `n` inputs which depends on exactly `k` of these inputs being 1. Then if `K < n`, `PAR_n` cannot be computed as `|~sum_{j=0}^K sum_{psi_{jk} in Psi_j^n}[alpha_{jk} psi_{jk}(\vec{x))] \geq 1 ~|`

Proof. Suppose `PAR_n` could be computed as above. Notice if `psi_{jk}` and `psi_{jk'}` depend on the same inputs being 1, we could combine their contribution to top level pereceptron as `(alpha_{jk} + alpha_{jk'}) psi_{jk}(\vec{x})`. Notice given two bit vectors, `vec{x} ne vec{x'}` which have exactly `j` many `1` bits on, `PAR_n(vec{x}) = PAR_n(vec{x'})`. In the first case, `|~sum_{j=0}^K sum_{psi_{jk} in Psi_j^n}[alpha_{jk} psi_{jk}(\vec{x))] \geq 1 ~|` reduces to at most one `j`th level summand `alpha_{jk} psi_{jk}(\vec{x)) ` and in the second case `alpha_{jk'} psi_{jk'}(\vec{x'))` and they have to agree. This shows we can take the coefficients for a given `j` level all to be the same. I.e., `alpha_{jk} = alpha_{jk'}`. Hence, our formula above can be simplified to `|~sum_{j=0}^K alpha_{j } [sum_{psi_{jk} in Psi_j^n} psi_{jk}(\vec{x))] \geq 1 ~|`.

`sum_{psi_{jk} in Psi_j^n} psi_{jk}(\vec{x)) = N_j(vec{x})` where `N_j(\vec{x})` is the number of subsets of `vec{x}` with exactly `j` many inputs 1. So `N_j(\vec{x}) = ((#(\vec{x})),(j))`,a polynomial degree `j` in `#(\vec{x})`. Hence, our formula for `PAR_n(vec{x})` reduces to `|~sum_{j=0}^K alpha_{j } ((#(\vec{x})),(j)) \geq 1 ~|`. Pulling the 1 to the other side of the equation, determining the output of this inequality can be viewed as checking if a polynomial of degree at most `K` in `#(\vec{x})` is `\geq 0`. Notice when `#(\vec{x}) = 0`, this polynomial needs to be less than `0`, when `#(\vec{x}) = 1`, it must be at least `0`, etc. So it needs to cross the a line just below the `x`-axis at least `n` times. Therefore, by the fundamental theorem of algebra, its degree `K` must be at least `n`. Q.E.D.

On the Power of Perceptron Networks

We recall again if we are going to be able to learn a function `f` using a network of perceptrons of a particular architecture, then `f` better be computable by this network for some choice of weights.
So let's try to get a handle on the kinds of things computable by simple networks of perceptrons.
To start let `k >0` be an integer, define
`T_{k}^n(x_1, ..., x_n) = |~sum_i x_i \geq k~|` (at least `k` of the inputs must be on)
and define
`bar{T}_{k}^n(x_1, ..., x_n) = |~sum_i -x_i \geq - k~|` (at most `k` inputs 1).
We'll call these functions threshold gates.
These functions are very nice special kinds of perceptrons as all the weights are the same.
An interesting particular type of threshold gate is the majority gate:
`MAJ_n(x_1, ..., x_n) = T_{\lfloor(n)/2\rfloor +1 }^n(x_1, ..., x_n)`
Majority is 1 when most of its inputs are 1. I.e., you can think of it as like a voting gate.
A two layer network consisting of majority gates could compute functions like our electoral college system in the U.S. does. It is not surprising such networks turn out to be strictly more powerful than single layer networks.

Facts About Threshold Functions

`T_{k}^n(x_1, ..., x_n) = T_{k}^n(x_{pi(1)}, ..., x_{pi(n)})` and `bar{T}_{k}^n(x_1, ..., x_n) = bar{T}_{k}^n(x_{pi(1)}, ..., x_{pi(n)})` for any permutation `pi`.
`vv_{i=1}^n x_i = T_{1}^n(x_1, ..., x_n)`
`^^_{i=1}^n x_i = T_{n}^n(x_1, ..., x_n)`
`neg x = bar{T}_{0}^1(x)`
`^^_{i=1}^n (neg x_i) = bar{T}_{0}^n(x_1, ..., x_n)`
By repeating inputs, arbitrary positive integer weighted inputs can be computed using a threshold function.
`^^_{i=1}^n x_i = MAJ_(2n-1)(x_1, ..., x_n, vec{0})` where `vec{0}` is `n-1` zeros.
`vv_{i=1}^n x_i = MAJ_(2n-1)(x_1, ..., x_n, vec{1})` where `vec{1}` is `n-1` ones.

Conjunctive and Disjunctive Normal Form

A literal, `l_i`, is used to mean either a variable `u_i` or its negation `neg u_i`. We often write `neg u_i` as `bar u_i`.
A formula is said to be in conjunctive normal form, (CNF), if it is AND of ORs of variables or their negation.
For example, `(u_1 vv bar u_2 vv u_3) ^^ (u_2 vv bar u_3 vv u_4) ^^ (bar u_1 vv u_2 vv bar u_4)`
We often write CNF formulas like `^^_i(vv_j nu_(i_j))`
`vv_j nu_(i_j)` are called clauses.
A formula is said to be in disjunctive normal form, (DNF), if it is OR of ANDs of variables or their negation.
For example, `(bar u_1 ^^ u_2 ^^ u_3) vv (baru_2 ^^ bar u_3 ^^ u_4) vv (bar u_1 ^^ u_2 ^^ bar u_4)`
We often write DNF formulas like `vv _i(^^_j nu_(i_j))`.

CNFs and DNFs are universal

Claim. For every Boolean function `f:{0, 1}^n -> {0,1}`, there is an `n`-variable CNF formula `phi` and there is an `n`-variable DNF formula `psi`, each of size at most `n2^n`, such that `phi(u) = psi(u) = f(u)` for every truth assignment `u in {0, 1}^n`. Here the size of a CNF or DNF formula is defined to be the number of `^^`'s/`vv`'s appearing in it.

Proof. For each `v in {0,1}^n` we make a clause `C_v(u_1, ..., u_n)` where `u_i` appears negated in the clause if bit `i` of `v` is `1`, otherwise, it appears un-negated. Notice this clause has `l` ORs. Also notice `C_v(v) = 0` and `C_v(u) =1` for `u ne v`. Using these `C_v`'s we can define a CNF formula for `f` as:
`phi = ^^_(v:f(v) = 0) C_v(u_1, .. u_n)`.
I.e., we are computing an AND over the rows of the truth table which are false and computing an OR of literals which ensures this false row does not happen. As there are at most `2^n` strings `u` which make `f(u)=0`, the total size of this CNF will be `n 2^n`.

For the DNF formula let `A_v(u_1, ..., u_n)` be an AND based on the `v`th row of the truth table, where `u_i` appears negated in the AND if bit `i` of `v` is `0`, otherwise, it appears un-negated. Then we define
`psi = vv_(v:f(v) = 1) A_v(u_1, .. u_n)`. Q.E.D.

Notice either at least half of the row in the truth table for `f` are 1 or at least half of the rows are `0`. By picking the DNF in the former case and the CNF in the latter, and from our earlier results expressing ANDs and ORs using threshold gates we thus have:

Corollary. For every Boolean function `f:{0, 1}^n -> {0,1}`, there is an at most three layer network of threshold functions of size `O(n2^{n-1})` computing it.

Remark. The top threshold computes an AND or an OR of up to `2^{n-1}` rows, so has exponential size.

Perceptron Lower and Upper Bounds

Outline