Perceptron Lower and Upper Bounds




CS256

Chris Pollett

Sep 20, 2017

Outline

Introduction

What Single Perceptrons Cannot Do

Single Perceptrons Don't Do Equality - Proofyized Version

Two-Layer Networks

In-Class Exercise

Explicitly work out the perceptrons needed for a two layer perceptron network computing `PAR_3(x_1, x_2, x_3)`.

Post your solutions to the Sep 20 In-Class Exercise Thread.

Answer. Since a lot of the solutions posted seemed to be poorly drawn figures with no description of how to get a perceptron circuit out of them I decided to give an answer. First, notice `PAR_3` is 1 for the boolean values `(1, 0, 0)`, `(0,1,0)`, `(0,0,1)`, and `(1,1,1)`. It is 0 for `(0,0,0)`, `(1, 1, 0)`, `(0,1,1)`, and `(1,0,1)`. We will build a perceptron circuit with four perceptron gates `G_1, ..., G_4`, which computes the function `G_4(G_1(x_1, x_2, x_3),G_2(x_1, x_2, x_3), G_3(x_1, x_2, x_3))`. `G_1` checks if `x+y+z ge 1/2`. The equation `x+y+z = 1/2` determines the plane that cuts the `x`, `y`, and `z` axis at a 1/2, so it is above the point `(0,0,0)`. So if `G_1` is `1` then we know the input isn't `(0,0,0)`, although it might be any of the other elements of the boolean cube. `G_2` checks if `x+y+z le 3/2`. The points `(1, 0, 0)`, `(0,1,0)`, `(0,0,1)` all satisfy this inequality, but `(1, 1, 0)`, `(0,1,1)`, and `(1,0,1)` do not. So `G_1` and `G_2` can both be 1 on the boolean cube only for the points `(1, 0, 0)`, `(0,1,0)`, `(0,0,1)`. We let `G_3` check if `x+y+z ge 2.5`. The only point on the boolean cube which satisfies this inequality is the point `(1,1,1)`. Notice if either `G_2` or `G_3` is 1, the other is 0. Finally, we let `G_4` check `Out_{G_1} + Out_{G_2} + 2Out_{G_3} ge 2`. Since `G_2` and `G_3` can't simultaneously be `1`, the only way for `G_4` to output `1` is if either `Out_{G_1}` and `Out_{G_2}` are both 1, or `Out_{G_1}` and `Out_{G_3}` are both 1. The first case corresponds only to `(1, 0, 0)`, `(0,1,0)`, `(0,0,1)` on the boolean cube, the latter to `(1,1,1)`.

Simplifying Symmetric Perceptrons

Given a boolean vector `vec{x}`, let `#(vec{x})` be the function which returns the number of 1 bits in `vec{x}`.

Lemma. Suppose a symmetric function `f(vec{x})` is computed as `|~ vec{w}\cdot vec{x} \geq theta ~|`, where `vec{w} in RR^n` and `theta in RR`. Then there exists a `gamma in RR` such that `f(vec{x}) = |~ gamma cdot #(vec{x}) \geq 1 ~|` for all `vec{x} in {0,1}^n`.

Proof. Since `f` is symmetric either all of the following `n` inequalities hold, or none of them do:
\begin{eqnarray*} w_1x_1 + w_2x_2 + w_3x_3 + \cdots + w_{n-1} x_{n-1} + w_n x_n &\geq& \theta\\ w_1x_2 + w_2x_3 + w_3x_4 + \cdots + w_{n-1} x_{n} + w_n x_1 &\geq& \theta\\ ...&&\\ w_1x_{n-1} + w_2x_n + w_3x_1 \cdots + w_{n-1} x_{n-3} + w_n x_{n-2} &\geq& \theta`\\ w_1x_{n} + w_2x_1 + w_3x_2 \cdots + w_{n-1} x_{n-2} + w_n x_{n-1} &\geq& \theta \end{eqnarray*} Adding these equations and grouping by the `x_i`'s implies `f(vec{x}) = 1` iff
`(sum_{i=1}^n w_i) x_1 + (sum_{i=1}^n w_i) x_2 + cdots + (sum_{i=1}^n w_i) x_n \geq n \cdot theta`.
Thus, `f(vec{x}) = 1` iff `(sum_{i=1}^n w_i)(sum_{i=1}^n x_i) \geq n\cdot theta` and as `#(\vec{x}) = (sum_{i=1}^n x_i)`, dividing both sides by `n \cdot theta` and defining `gamma = 1/(n\cdot theta) (sum_{i=1}^n w_i)` gives the result. Q.E.D.

Minsky Papert (1969)

Theorem . Let `Psi_k^n` be the collection of all boolean functions computable by perceptrons on `n` inputs which depends on exactly `k` of these inputs being 1. Then if `K < n`, `PAR_n` cannot be computed as `|~sum_{j=0}^K sum_{psi_{jk} in Psi_j^n}[alpha_{jk} psi_{jk}(\vec{x))] \geq 1 ~|`

Proof. Suppose `PAR_n` could be computed as above. Notice if `psi_{jk}` and `psi_{jk'}` depend on the same inputs being 1, we could combine their contribution to top level pereceptron as `(alpha_{jk} + alpha_{jk'}) psi_{jk}(\vec{x})`. Notice given two bit vectors, `vec{x} ne vec{x'}` which have exactly `j` many `1` bits on, `PAR_n(vec{x}) = PAR_n(vec{x'})`. In the first case, `|~sum_{j=0}^K sum_{psi_{jk} in Psi_j^n}[alpha_{jk} psi_{jk}(\vec{x))] \geq 1 ~|` reduces to at most one `j`th level summand `alpha_{jk} psi_{jk}(\vec{x)) ` and in the second case `alpha_{jk'} psi_{jk'}(\vec{x'))` and they have to agree. This shows we can take the coefficients for a given `j` level all to be the same. I.e., `alpha_{jk} = alpha_{jk'}`. Hence, our formula above can be simplified to `|~sum_{j=0}^K alpha_{j } [sum_{psi_{jk} in Psi_j^n} psi_{jk}(\vec{x))] \geq 1 ~|`.

`sum_{psi_{jk} in Psi_j^n} psi_{jk}(\vec{x)) = N_j(vec{x})` where `N_j(\vec{x})` is the number of subsets of `vec{x}` with exactly `j` many inputs 1. So `N_j(\vec{x}) = ((#(\vec{x})),(j))`,a polynomial degree `j` in `#(\vec{x})`. Hence, our formula for `PAR_n(vec{x})` reduces to `|~sum_{j=0}^K alpha_{j } ((#(\vec{x})),(j)) \geq 1 ~|`. Pulling the 1 to the other side of the equation, determining the output of this inequality can be viewed as checking if a polynomial of degree at most `K` in `#(\vec{x})` is `\geq 0`. Notice when `#(\vec{x}) = 0`, this polynomial needs to be less than `0`, when `#(\vec{x}) = 1`, it must be at least `0`, etc. So it needs to cross the a line just below the `x`-axis at least `n` times. Therefore, by the fundamental theorem of algebra, its degree `K` must be at least `n`. Q.E.D.

On the Power of Perceptron Networks

Facts About Threshold Functions

  1. `T_{k}^n(x_1, ..., x_n) = T_{k}^n(x_{pi(1)}, ..., x_{pi(n)})` and `bar{T}_{k}^n(x_1, ..., x_n) = bar{T}_{k}^n(x_{pi(1)}, ..., x_{pi(n)})` for any permutation `pi`.
  2. `vv_{i=1}^n x_i = T_{1}^n(x_1, ..., x_n)`
  3. `^^_{i=1}^n x_i = T_{n}^n(x_1, ..., x_n)`
  4. `neg x = bar{T}_{0}^1(x)`
  5. `^^_{i=1}^n (neg x_i) = bar{T}_{0}^n(x_1, ..., x_n)`
  6. By repeating inputs, arbitrary positive integer weighted inputs can be computed using a threshold function.
  7. `^^_{i=1}^n x_i = MAJ_(2n-1)(x_1, ..., x_n, vec{0})` where `vec{0}` is `n-1` zeros.
  8. `vv_{i=1}^n x_i = MAJ_(2n-1)(x_1, ..., x_n, vec{1})` where `vec{1}` is `n-1` ones.

Conjunctive and Disjunctive Normal Form

CNFs and DNFs are universal

Claim. For every Boolean function `f:{0, 1}^n -> {0,1}`, there is an `n`-variable CNF formula `phi` and there is an `n`-variable DNF formula `psi`, each of size at most `n2^n`, such that `phi(u) = psi(u) = f(u)` for every truth assignment `u in {0, 1}^n`. Here the size of a CNF or DNF formula is defined to be the number of `^^`'s/`vv`'s appearing in it.

Proof. For each `v in {0,1}^n` we make a clause `C_v(u_1, ..., u_n)` where `u_i` appears negated in the clause if bit `i` of `v` is `1`, otherwise, it appears un-negated. Notice this clause has `l` ORs. Also notice `C_v(v) = 0` and `C_v(u) =1` for `u ne v`. Using these `C_v`'s we can define a CNF formula for `f` as:
`phi = ^^_(v:f(v) = 0) C_v(u_1, .. u_n)`.
I.e., we are computing an AND over the rows of the truth table which are false and computing an OR of literals which ensures this false row does not happen. As there are at most `2^n` strings `u` which make `f(u)=0`, the total size of this CNF will be `n 2^n`.

For the DNF formula let `A_v(u_1, ..., u_n)` be an AND based on the `v`th row of the truth table, where `u_i` appears negated in the AND if bit `i` of `v` is `0`, otherwise, it appears un-negated. Then we define
`psi = vv_(v:f(v) = 1) A_v(u_1, .. u_n)`. Q.E.D.

Notice either at least half of the row in the truth table for `f` are 1 or at least half of the rows are `0`. By picking the DNF in the former case and the CNF in the latter, and from our earlier results expressing ANDs and ORs using threshold gates we thus have:

Corollary. For every Boolean function `f:{0, 1}^n -> {0,1}`, there is an at most three layer network of threshold functions of size `O(n2^{n-1})` computing it.

Remark. The top threshold computes an AND or an OR of up to `2^{n-1}` rows, so has exponential size.