Finish Go/No-Go Perceptron Results - Start SVMs




CS256

Chris Pollett

Sep 20, 2021

Outline

Introduction

Minsky Papert (1969)

Theorem . Let `Psi_k^n` be the collection of all boolean functions computable by perceptrons on `n` inputs which depends on exactly `k` of these inputs being 1. Then if `K < n`, `PAR_n` cannot be computed as `|~sum_{j=0}^K sum_{psi_{jk} in Psi_j^n}[alpha_{jk} psi_{jk}(\vec{x))] \geq 1 ~|`

Remark: There are potentially a lot of functions `psi_{jk} in Psi_j^n` so the inner sum is potentially exponential size in `n`.

Proof. Suppose `PAR_n` could be computed as above. Notice if `psi_{jk}` and `psi_{jk'}` depend on the same inputs being 1, we could combine their contribution to top level pereceptron as `(alpha_{jk} + alpha_{jk'}) psi_{jk}(\vec{x})`. Notice given two bit vectors, `vec{x} ne vec{x'}` which have exactly `j` many `1` bits on, `PAR_n(vec{x}) = PAR_n(vec{x'})`. In the first case, `|~sum_{j=0}^K sum_{psi_{jk} in Psi_j^n}[alpha_{jk} psi_{jk}(\vec{x))] \geq 1 ~|` reduces to at most one `j`th level summand `alpha_{jk} psi_{jk}(\vec{x)) ` and in the second case `alpha_{jk'} psi_{jk'}(\vec{x'))` and they have to agree. This shows we can take the coefficients for a given `j` level all to be the same. I.e., `alpha_{jk} = alpha_{jk'}`. Hence, our formula above can be simplified to `|~sum_{j=0}^K alpha_{j } [sum_{psi_{jk} in Psi_j^n} psi_{jk}(\vec{x))] \geq 1 ~|`.

`sum_{psi_{jk} in Psi_j^n} psi_{jk}(\vec{x)) = N_j(vec{x})` where `N_j(\vec{x})` is the number of subsets of `vec{x}` with exactly `j` many inputs 1. So `N_j(\vec{x}) = ((#(\vec{x})),(j))`,a polynomial degree `j` in `#(\vec{x})`. Hence, our formula for `PAR_n(vec{x})` reduces to `|~sum_{j=0}^K alpha_{j } ((#(\vec{x})),(j)) \geq 1 ~|`. Pulling the 1 to the other side of the equation, determining the output of this inequality can be viewed as checking if a polynomial of degree at most `K` in `#(\vec{x})` is `\geq 0`. Notice when `#(\vec{x}) = 0`, this polynomial needs to be less than `0`, when `#(\vec{x}) = 1`, it must be at least `0`, etc. So it needs to cross the a line just below the `x`-axis at least `n` times. Therefore, by the fundamental theorem of algebra, its degree `K` must be at least `n`. Q.E.D.

Quiz

Which of the following is true?

  1. `\P\A\R_3` can be computed by a four neuron Perceptron network.
  2. Python does not support multiple inheritance.
  3. xrange is a Python3 function.

On the Power of Perceptron Networks

Facts About Threshold Functions

  1. `T_{k}^n(x_1, ..., x_n) = T_{k}^n(x_{pi(1)}, ..., x_{pi(n)})` and `bar{T}_{k}^n(x_1, ..., x_n) = bar{T}_{k}^n(x_{pi(1)}, ..., x_{pi(n)})` for any permutation `pi`.
  2. `vv_{i=1}^n x_i = T_{1}^n(x_1, ..., x_n)`
  3. `^^_{i=1}^n x_i = T_{n}^n(x_1, ..., x_n)`
  4. `neg x = bar{T}_{0}^1(x)`
  5. `^^_{i=1}^n (neg x_i) = bar{T}_{0}^n(x_1, ..., x_n)`
  6. By repeating inputs, arbitrary positive integer weighted inputs can be computed using a threshold function.
  7. `^^_{i=1}^n x_i = MAJ_(2n-1)(x_1, ..., x_n, vec{0})` where `vec{0}` is `n-1` zeros.
  8. `vv_{i=1}^n x_i = MAJ_(2n-1)(x_1, ..., x_n, vec{1})` where `vec{1}` is `n-1` ones.

Conjunctive and Disjunctive Normal Form

CNFs and DNFs are universal

Claim. For every Boolean function `f:{0, 1}^n -> {0,1}`, there is an `n`-variable CNF formula `phi` and there is an `n`-variable DNF formula `psi`, each of size at most `n2^n`, such that `phi(u) = psi(u) = f(u)` for every truth assignment `u in {0, 1}^n`. Here the size of a CNF or DNF formula is defined to be the number of `^^`'s/`vv`'s appearing in it.

Proof. For each `v in {0,1}^n` we make a clause `C_v(u_1, ..., u_n)` where `u_i` appears negated in the clause if bit `i` of `v` is `1`, otherwise, it appears un-negated. Notice this clause has `n` ORs. Also notice `C_v(v) = 0` and `C_v(u) =1` for `u ne v`. Using these `C_v`'s we can define a CNF formula for `f` as:
`phi = ^^_(v:f(v) = 0) C_v(u_1, .. u_n)`.
I.e., we are computing an AND over the rows of the truth table which are false and computing an OR of literals which ensures this false row does not happen. As there are at most `2^n` strings `u` which make `f(u)=0`, the total size of this CNF will be `n 2^n`.

For the DNF formula let `A_v(u_1, ..., u_n)` be an AND based on the `v`th row of the truth table, where `u_i` appears negated in the AND if bit `i` of `v` is `0`, otherwise, it appears un-negated. Then we define
`psi = vv_(v:f(v) = 1) A_v(u_1, .. u_n)`. Q.E.D.

Notice either at least half of the row in the truth table for `f` are 1 or at least half of the rows are `0`. By picking the DNF in the former case and the CNF in the latter, and from our earlier results expressing ANDs and ORs using threshold gates we thus have:

Corollary. For every Boolean function `f:{0, 1}^n -> {0,1}`, there is an at most three layer network of threshold functions of size `O(n2^{n-1})` computing it.

Remark. The top threshold computes an AND or an OR of up to `2^{n-1}` rows, so has exponential size.

`p`-time algorithms

A photo an a Turing Machine implemented in real-life

Simulating `p`-time algorithms with Threshold circuits

Theorem. A `p(n)`-time algorithm for a 1-tape TM can be simulated by a `O(p(n))`-layer threshold network of size `O(p(n)^2)`. Moreover, the `O(p(n))`-layers, are built out of one layer which maps the input to an encoding of input, followed `p(n)`-many `O(1)`-layer networks each of which compute the same function `L`, followed by a layer which maps the encoding of the output to the final output . An `L` layer can be further split into `p(n)` many threshold networks each of size `O(1)` with `O(1)` inputs computing the same function `U` in parallel.

Remark. Imagine we were trying to learn a `p`-time algorithm. The result above says that we just need to learn the `O(1)` many weights in the repeated `U`, not polynomially many weights as you might initially guess. We can think of `U` as roughly corresponding to the neural nets finite control. The idea of using repeated sub-networks each of which use the same weights is essentially the idea behind convolutional neural layers.

Proof of Simulation Result

Support Vector Machines

maximum separating hyperplane example

Mapping to a Separable Space

Theorem. It is always possible to map a dataset of examples `(vec{x}, y_i)` where `vec{x} in RR^n` and `y_i in {0,1}` to some higher dimensional `vec{f}(vec{x}) in RR^m` such that the negatives examples `vec{f}(vec{x_j})` can be separated from the positive examples `vec{f}(vec{x_k})` by a hyperplane in `RR^m`.

Proof. Let `f_(vec{z})(vec{x})` be the function which is `0` if `vec{x} ne vec{z}` or if `vec{z}` was not a positive training example and is `1` otherwise. Let `vec{f}` be the mapping from `RR^n -> RR^{2^n}` given by `vec{x} mapsto (f_{vec{0}}(vec{x}), ...,f_(vec{z})(vec{x}), ..., f_{vec{1}}(vec{x}))`. All of the negative examples will map to the `0` vector of length `2^n` and a positive example will map to a vector with exactly one coordinate 1. Hence, the hyperplane which cuts each axis in the target space at a `1/2` will separate the positive from the negative examples. Q.E.D.

Corollary. Any boolean function can be computed by an SVM (maybe slightly relaxing the definition of maximally separate to allow for unbounded support vectors).

Remark. The above construction gives an SVM which does not generalize very well, so it is not really a practical construction.

Remark. For SVMs, usually `y_i`'s are chosen from `{-1, 1}`, but a similar theorem to the above could still be obtained.

Other Ways to try to Map to a Separable Space

Computing Separators

Popular Choices of Kernels