Perceptron Learning, Python




CS256

Chris Pollett

Sep 11, 2017

Outline

Introduction

Perceptron Convergence Theorem

Lemma. Let `X` be a finite training set. Given `t`, a boolean threshold 1 perceptron, there is a weight vector `vec{w}^t` and a constant `c_t >0` such that `t(vec{x}) = 0 => vec{w}^t \cdot vec{x} \leq 1 - c_t` and `t(vec{x}) = 1 => vec{w}^t \cdot vec{x} \geq 1 + c_t` for all `vec{x} in X`.
Proof. Since `t` is a boolean threshold 1 perceptron, there is some weight vector such that for `vec{x} in X`, `t(vec{x}) = 1` iff `vec{w} \cdot vec{x} geq 1`. Let `X^-` be the negative examples, as `X` is finite, so is `X^-`. Define `c_t = (1 - \max_{x in X^-}vec{w}\cdot vec{x})/2` and define `vec{w}^t := vec{w}/(1-c_t)`. Then
`t(vec{x}) = 1 => vec{w}^t \cdot vec{x} = 1/(1-c_t)vec{w}cdot vec{x} \geq 1/(1-c_t) geq 1 +c_t`,
`t(vec{x}) = 0 => vec{w}^t \cdot vec{x} = 1/(1-c_t)vec{w}cdot vec{x} \leq (1 -2 c_t)/(1-c_t) leq 1 - c_t`. Q.E.D.

Theorem. Let `X` be a training set on `RR^n` and let `nu leq c_t/n` (if `X` is infinite we assume there is still a `c_t`). Let `t` be a boolean threshold 1 perceptron. Suppose `L_{\nu}` is applied with initial weights all zero. Then after at most `I(t, nu) = \lfloor(||vec{w}^t||^2)/(nu c_t) rfloor` steps that change the weights the algorithm will converge on a final set of weights.

Proof of Perceptron Convergence Theorem

Let `g_{vec{w}}(\vec{x})` denote the boolean threshold 1 perceptron with weights `vec{w}`. Let `eta = t(vec{x}) - g_{vec{w}}(\vec{x})`. Notice `eta` is `\pm 1` or `0` -- we don't do anything in 0 case so not interested. For the nonzero case, when `L_{nu}` is applied the weight vector changes as
`vec{w}' = vec{w} + nu eta vec{x}`.
When `eta = +1`, `t(vec{x}) = 1` and `g_{vec{w}}(\vec{x}) = 0`, so
`vec{w}^t cdot \vec{x} geq 1 + c_t` and `vec{w} cdot \vec{x} leq 1`.
This gives `(vec{w}^t - vec{w})\cdot \vec{x} geq c_t`.
Similarly, when `eta = -1` we have
`(vec{w} - vec{w}^t )\cdot \vec{x} geq c_t`.
We can combine these two statement to say when `L_{nu}` is invoked we have
`eta(vec{w}^t - vec{w})\cdot \vec{x} geq c_t`.
Calculating the change in `||vec{w}^t - vec{w}||^2` after an invocation of `L_{nu}` which changes something gives:
\begin{eqnarray*} ||\vec{w}^t - \vec{w}'||^2 & = & ||\vec{w}^t - {w} - \eta \nu \vec{x}||^2\\ &= & ||\vec{w}^t - \vec{w}||^2 - 2 \eta \nu (\vec{w}^t - \vec{w}) \cdot \vec{x} + \eta^2 \nu^2||\vec{x}||^2\\ &\leq & ||\vec{w}^t - \vec{w}||^2 + \nu^2||\vec{x}||^2 - 2 \nu c_t \end{eqnarray*} Since `vec{x}` has at most `n` on bits, `||vec{x}||^2 \leq n`, and since we assume `nu leq c_t/n`, we have `\nu^2||\vec{x}||^2 leq nu c_t`. So we get
`||\vec{w}^t - \vec{w}'||^2 leq ||\vec{w}^t - \vec{w}||^2 - nu c_t`.
So each training step that changes the weights reduces the value of `||vec{w}^t - vec{w}||^2` by at least `nu c_t`. This starts at `||vec{w}^t||^2` when the initial weight are 0, and cannot be negative. Hence, we get at most `I(T, nu) = \lfloor(||vec{w}^t||^2)/(nu c_t) rfloor` change steps to convergence. Q.E.D.

Learning Results

Nested Boolean Functions

Definition. The class of nested functions over `x_1, ..., x_n`, `NF_n`, is defined as follows:

  1. For `n=1`, `x_1`, and `bar(x_1)` (then negation of `x_1`) are nested.
  2. For `n > 1`, `f(x_1, ..., x_n)` is nested if `f = g \star l_n`, where `g` is a nested function, `star` is either `vv` or `^^`, and `l_n` is either `x_n` or `bar{x_n}`.

For example, `(x_1 ^^ bar{x_2}) vv x_3` is a nested formula; `(x_1 ^^ bar{x_2}) vv (bar{x}_3 ^^ bar{x}_4)` is not.

Gradual Boolean Threshold Functions

Remark. There is a slide on hyperplanes on last day's lecture notes.

In what follows, we will assume that `theta` is no longer a coordinate of `vec{w}`.

Definition. Let `G= {G_n}` be a collection of hyperplanes in `RR^n`. Then `G` is gradual if there is constant `c > 0` such that: For every `tau geq 0`, `n geq 1`, at most `c\tau2^n` many `x in {0,1}^n` lie within a distance `tau` of `G_n`.

Definition. A class of boolean threshold functions `F` is said to be gradual if there is a mapping `phi: F -> G`, where `G` is a gradual family of hyperplanes, such that for all `f in F`, if `phi(f)` is the hyperplane `vec{w} \cdot vec{x} = theta`, then `(vec{w}, theta)` represents the boolean threshold function `f`.

Representations of Nested Boolean Functions

Lemma. Any `f in NF_n` can be represented by a boolean threshold function
`w_1x_1 + cdots w_nx_n geq theta_n`,
with `theta_n = k +1/2` for some integer `k`, `w_i = \pm 2^{i-1}`, and `sum_{w_i<0}w_i < theta_n < sum_{w_i>0}w_i`.

Proof. By induction on `n`. When `n=1`, we can take `x_1 \geq 1/2` or `-x_1 geq -1/2`. For `n > 1`, `f` is of the form `g star l_n` and by the induction hypothesis `g` can be written in the form `w_1x_1 + cdots w_{n-1}x_{n-1} geq theta_{n-1}` with the `w_i`'s satisfying the conditions of the lemma. The rest of the proof splits into four cases depending on the value of `star` and the sign of the literal. We only show one case, but the rest are similar. Suppose `f=g ^^ x_n`, then we can use:
`w_1x_1 + cdots w_{n-1}x_{n-1} + 2^{n-1}x_n \geq theta_n = theta_{n-1} + 2^{n-1}`.
Notice if both `g(\vec{x})` and `x_n` hold, then by the induction the terms besides the last term are greater than `theta_{n-1}` and we also have `2^{n-1}x_n geq 2^{n-1}`, if either of `g(\vec{x})` and `x_n` doesn't hold there is no way the threshold can be met. Q.E.D.

Lemma. Nested Boolean Functions can be represented by gradual boolean threshold functions.

Proof. By the above lemma, given `f in NF_n` there is a linear threshold function `vec{w}\cdot vec{x} \geq theta` which represents `f` with `w_i = \pm 2^{i-1}` and `theta = k +1/2` for some integer `k`. For `x in {0,1}^n`, if `vec{w}\cdot vec{x} = t` then `t` must be an integer. Since an integer has only one binary representation, at most one `vec{x}` can satisfy `vec{w}\cdot vec{x} = t` for a given `t`. So no `x in {0,1}^n` can have `|vec{w}\cdot vec{x} - theta| < 1/2`. Thus,
`|{x in {0,1}^n: |vec{w}\cdot vec{x} - theta| leq m}| leq 2m +1`
for `m geq 1/2`. Since the distance from a point `vec{x}'` to the hyperplane `vec{w}cdot vec{x} = theta` is `||vec{w}||^{-1}\cdot |vec{w} cdot vec{x}' - theta|`. The `||vec{w}||^{-1}` term is because the weight vector might not be normalized. The lemma follows by noting, using the equation for geometric series, that we have `||w|| = (\sum_{i=1}^n w_i^2)^{1/2} = (\sum_{i=1}^n (2^{i-1})^2)^{1/2} = (\sum_{i=1}^n (4^{i-1}))^{1/2} = ((4^n-1)/3)^{1/2} < 2^n`.
So
`|{x in {0,1}^n: ||w||^{-1}|vec{w}\cdot vec{x} - theta| leq tau}| = |{x in {0,1}^n: |vec{w}\cdot vec{x} - theta| leq tau ||w||}| \leq 2tau||w|| +1 < 2tau2^n+1 < 3tau 2^n`
and we can take `c` in the definition of gradual to be 3.

PAC-Learning of Gradual Threshold Functions

Theorem. If `C` is a gradual class of boolean threshold functions, then the perceptron rule is a PAC Learning algorithm for `C` under the uniform distribution on `{0, 1}^n`.

Proof. Let `vec{w} \cdot vec{x} \ge theta` be an `n`-bit linear threshold function from the class `C`. Without loss of generality, we can take `w,theta` to be normalized, so that `|vec{w}\cdot vec{x} -theta|` is the distance of the point `x` to the hyperplane. By the definition of gradual, there is some constant `k` such that for all `tau>0`, the odds that a uniformly chosen element of `{0,1}^n` is within distance `tau` of the hyperplane `vec{w}\cdot vec{x} = theta` is at most `tau/(2k)`. If we set `tau = k epsilon`, then with probability at most `epsilon/2`, a random example drawn from `{0,1}^n` is within `k epsilon` of the hyperplane. From this, if we let `B \subseteq {0,1}^n` be the examples `x` which lie within `k epsilon` of the hyperplane, then `Pr[x in B] le epsilon/2`.

Let `(w_t, theta_t)` be the perceptron algorithm's hypothesis after `t` updates have been made. If `epsilon ge 1`, the definition of PAC-learnability is trivially satisfied, so assume `epsilon < 1`. Also, from the definition of gradual, if a collection of hyperplanes is gradual with constant `c` then it will be gradual for constant `c' > c`, So we can assume `k` above is at least 1. Suppose `(w_t, theta_t)` is not yet `epsilon`-accurate, then with probability at most `epsilon/2 < 1/2`, the next example which causes an update will be in `B`. Define the potential function
`N_t(alpha) = ||alpha w - w_t||^2 + (alpha theta - theta_t)^2`.

The perceptron update rule tells us `vec{w}_{t+1} = \vec{w}_t \pm x` and `\theta_{t+1} = \theta_{t} bar{+} 1`, so `N_{t+1}(alpha) - N_t(alpha)` is
\begin{eqnarray*} \Delta N(\alpha) &=& ||\alpha \vec{w} - \vec{w_{t+1}}||^2 + (\alpha \theta - \theta_{t+1})^2\\ &&\quad - ||\alpha \vec{w} - \vec{w_t}||^2 - (\alpha \theta - \theta_t)^2\\ &=&\mp 2\alpha \vec{w} \cdot \vec{x} \pm 2\alpha\theta \pm 2\vec{w_t}\cdot \vec{x} \mp 2 \theta_t + ||x||^2 +1\\ &\leq& 2 \alpha A \pm 2 (\vec{w}_t \cdot \vec{x} - \theta_t) + n+1. \end{eqnarray*} with `A = bar{+} (vec{w}\cdot vec{x} - theta)`. We are again using above that `||x||^2 le n`.

Proof of PAC-Learning cont'd

Since we are assuming `vec{x}` was misclassified, we know `\pm(vec{w_t} \cdot vec{x} - theta_t) < 0`, so `\Delta N(\alpha) < 2alpha A + n +1`. If `x in B` then `A \le 0`; if `x !in B`, then `A leq -k epsilon`. So `\Delta N(\alpha) < n + 1` for `x in B` and `\Delta N(\alpha) < n + 1 - 2k \epsilon\alpha` for `x !in B`. Suppose the perceptron algorithm has made `r` updates with examples in `B`, and `s` updates for examples outside `B`. Since `(vec{w}, theta)` was normalized, `|theta| \leq sqrt(n)`. Recall at the start of the perceptron algorithm the initial weights are all `0`. Hence, `N_0(\alpha) \leq alpha^2(n+1)`. Since for all `t`, `N_t(\alpha) ge 0`, it follows that
`0 le r(n+1) + s(n+1 - 2k\epsilon\alpha) + alpha^2(n+1)`.
Setting `alpha = (12(n+1))/(5 k epsilon)`, the above simplifies to
`0 \leq r - 19/5s + (144(n+1)^2)/(25(k epsilon)^2)`.
If `m_1 = r + s = (144(n+1)^2)/(25(k epsilon)^2)` updates have been made, then `r = m_1 - s`, and if we substitute this into the above inequality, we get
\begin{eqnarray*} 0 &\leq& m_1 - s - \frac{19}{5}s + m_1\\ 0 &\leq& 2m_1 - \frac{24}{5} s\\ \frac{24}{5}s &\leq & 2m_1\\ s &\leq& \frac{10}{24} m_1\\ s &\leq & \frac{5}{12}m_1. \end{eqnarray*} So at least `7/12` fraction of the updates must have been made on examples in `B`.

Proof of PAC-Learning Conclusion

If the perceptron's hypothesis has never been `\epsilon`-accurate, then from our discussion at the start of the proof, at each update, the probability of that update occurring on a point in `B` is at most `1/2`. So by Chernoff Bounds, the probability that more than `7/12`ths of `m = max(-144ln(\delta/2), m_1)` updates occur in `B` is at most `delta/2`. I. e., Chernoff Bound's says the odds we see `(7/12)m` trials in `B` when we should expect only `(1/2)m` trials in `B` is governed by `p=1/2`, `c` and `m` where `(1+c)(1/2)m = (7/12)m`. Solving for `c` gives `c=1/6`. Then the bound given by Chernoff's inequality is at most `e^{(-c^2p m)/2} = e^{-((1/6)^2(1/2)m)/2} = e^(-m/144) = e^((144ln(delta/2))/144) = delta/2`. So this mean that with probability at least `1 - \delta/2` the perceptron algorithm will have found an `epsilon`-accurate hypothesis. With probability `1-\delta/2`, using `(2m)/epsilon` examples will ensure that `m` updates occur. Thus, with probability `1-delta < 1 - delta + delta^2`, using `(2m)/epsilon` examples will ensure `m` updates occur and that the result of these updates is `epsilon`-accurate. Q.E.D.

Quiz

Which of the following is true?

  1. For a probability distribution `P` and events `A`, `B`, it is always the case that `P(A cup B) = P(A) + P(B)`.
  2. If a function `g` is PAC-learnable on `D` to within `epsilon`, then the training algorithm always succeeds in outputting a `w`, such that on the test data drawn according to `D`, we have `E(m(d(w,x_t), g(x_t))) leq epsilon`.
  3. There exists an `n times n` matrix `A` over `RR` that does not have an inverse.

Getting Started with Python

Running Python