Introduction

Last week, we began talking about perceptrons.
We said a perceptron is a function `f(\vec{x}) = g(sum_(i=1)^n w_(i) x_i)` where `g: RR -> RR` is an activation function, and `\vec{w}` is a real-valued vector of weights.
We gave three example choices of `g`. Today, we'll be interested in
\begin{eqnarray} g: x \mapsto \left\{ \begin{array}[ll]\ 1 & x \geq \theta \\ 0 & \mbox{otherwise} \end{array} \right. \end{eqnarray}
We gave two learning rules to train perceptrons: the perceptron learning rule and Winnow.
The perceptron learning rule (due to Rosenblatt 1959) was:
- On a correct prediction, do nothing.
- On a false positive prediction, set `vec{w} := vec{w} - vec{x}` and set `theta := theta + 1`.
- On a false negative prediction, set `vec{w} := vec{w} + vec{x}` and set `theta := theta - 1`.
Today, we will start by proving some things about this rule.
First, let `nu >0 ` we can slightly modify the rule to the rule `L_{nu}`:
- On a correct prediction, do nothing.
- On a false positive prediction, set `vec{w} := vec{w} - nu vec{x}` and set `theta := theta + nu`.
- On a false negative prediction, set `vec{w} := vec{w} + nu vec{x}` and set `theta := theta - nu`.

Perceptron Convergence Theorem

We extend both `vec{w}` and `vec{x}` by a 0th coordinate with value `-theta + 1` for `vec{w}` and 1 for `vec{x}` to make `vec{w'}` and `vec{x'}`. Then the original `vec{w}cdot vec{x} \geq theta` iff `vec{w'}cdot vec{x'} \geq 1`
We call a perceptron which computes an inequality `vec{w'}cdot vec{x'} \geq 1`, a boolean threshold 1 perceptron.

Lemma. Let `X` be a finite training set. Given `t`, a boolean threshold 1 perceptron, there is a weight vector `vec{w}^t` and a constant `c_t >0` such that `t(vec{x}) = 0 => vec{w}^t \cdot vec{x} \leq 1 - c_t` and `t(vec{x}) = 1 => vec{w}^t \cdot vec{x} \geq 1 + c_t` for all `vec{x} in X`.
Proof. Since `t` is a boolean threshold 1 perceptron, there is some weight vector such that for `vec{x} in X`, `t(vec{x}) = 1` iff `vec{w} \cdot vec{x} geq 1`. Let `X^-` be the negative examples, as `X` is finite, so is `X^-`. Define `c_t = (1 - \max_{x in X^-}vec{w}\cdot vec{x})/2` and define `vec{w}^t := vec{w}/(1-c_t)`. Then
`t(vec{x}) = 1 => vec{w}^t \cdot vec{x} = 1/(1-c_t)vec{w}cdot vec{x} \geq 1/(1-c_t) geq 1 +c_t`,
`t(vec{x}) = 0 => vec{w}^t \cdot vec{x} = 1/(1-c_t)vec{w}cdot vec{x} \leq (1 -2 c_t)/(1-c_t) leq 1 - c_t`. Q.E.D.

Theorem. Let `X` be a training set on `RR^n` and let `nu leq c_t/n` (if `X` is infinite we assume there is still a `c_t`). Let `t` be a boolean threshold 1 perceptron. Suppose `L_{\nu}` is applied with initial weights all zero. Then after at most `I(t, nu) = \lfloor(||vec{w}^t||^2)/(nu c_t) rfloor` steps that change the weights the algorithm will converge on a final set of weights.

Proof of Perceptron Convergence Theorem

Let `g_{vec{w}}(\vec{x})` denote the boolean threshold 1 perceptron with weights `vec{w}`. Let `eta = t(vec{x}) - g_{vec{w}}(\vec{x})`. Notice `eta` is `\pm 1` or `0` -- we don't do anything in 0 case so not interested. For the nonzero case, when `L_{nu}` is applied the weight vector changes as
`vec{w}' = vec{w} + nu eta vec{x}`.
When `eta = +1`, `t(vec{x}) = 1` and `g_{vec{w}}(\vec{x}) = 0`, so
`vec{w}^t cdot \vec{x} geq 1 + c_t` and `vec{w} cdot \vec{x} leq 1`.
This gives `(vec{w}^t - vec{w})\cdot \vec{x} geq c_t`.
Similarly, when `eta = -1` we have
`(vec{w} - vec{w}^t )\cdot \vec{x} geq c_t`.
We can combine these two statement to say when `L_{nu}` is invoked we have
`eta(vec{w}^t - vec{w})\cdot \vec{x} geq c_t`.
Calculating the change in `||vec{w}^t - vec{w}||^2` after an invocation of `L_{nu}` which changes something gives:
\begin{eqnarray*} ||\vec{w}^t - \vec{w}'||^2 & = & ||\vec{w}^t - {w} - \eta \nu \vec{x}||^2\\ &= & ||\vec{w}^t - \vec{w}||^2 - 2 \eta \nu (\vec{w}^t - \vec{w}) \cdot \vec{x} + \eta^2 \nu^2||\vec{x}||^2\\ &\leq & ||\vec{w}^t - \vec{w}||^2 + \nu^2||\vec{x}||^2 - 2 \nu c_t \end{eqnarray*} Since `vec{x}` has at most `n` on bits, `||vec{x}||^2 \leq n`, and since we assume `nu leq c_t/n`, we have `\nu^2||\vec{x}||^2 leq nu c_t`. So we get
`||\vec{w}^t - \vec{w}'||^2 leq ||\vec{w}^t - \vec{w}||^2 - nu c_t`.
So each training step that changes the weights reduces the value of `||vec{w}^t - vec{w}||^2` by at least `nu c_t`. This starts at `||vec{w}^t||^2` when the initial weight are 0, and cannot be negative. Hence, we get at most `I(T, nu) = \lfloor(||vec{w}^t||^2)/(nu c_t) rfloor` change steps to convergence. Q.E.D.

Learning Results

We are next going to give a class of boolean functions which Servedio '99 shows is PAC learnable by the perceptron learning rule on the uniform boolean distribution.
This paper proves several other interesting results which we will not prove:
- It shows that neither the Winnow update rule of last day or the Perceptron rule can PAC-learn positive half spaces under uniform boolean distribution.
- The Perceptron rule cannot PAC-Learn linear threshold functions under uniform boolean distribution. (Hint at proof: Notice in our convergence theorem, the time to converge depends on the weight complexity of the target, the proof uses a result of Hastad that there exists linear threshold functions of `2^{Omega(n \log n)}` weight complexity, and so the time to learn an error rate of the form `1/(2^n+1)` would be super-polynomial.)
It was previously known by Baum 1990, that the Perceptron rule can PAC-learn linear threshold functions under the spherical uniform distribution.
On adversarial data, Kivinen, Warmuth, Auer, 1997 have shown that a strategy that can force the Perceptron rule to make at least `Omega(kn)` mistakes to learn an n-bit monotone disjunctions of `k`-literals. They also show Winnow only makes `O(k log n)` many mistakes in the same situation.

Nested Boolean Functions

Definition. The class of nested functions over `x_1, ..., x_n`, `NF_n`, is defined as follows:

For `n=1`, `x_1`, and `bar(x_1)` (then negation of `x_1`) are nested.
For `n > 1`, `f(x_1, ..., x_n)` is nested if `f = g \star l_n`, where `g` is a nested function, `star` is either `vv` or `^^`, and `l_n` is either `x_n` or `bar{x_n}`.

For example, `(x_1 ^^ bar{x_2}) vv x_3` is a nested formula; `(x_1 ^^ bar{x_2}) vv (bar{x}_3 ^^ bar{x}_4)` is not.

Gradual Boolean Threshold Functions

Remark. There is a slide on hyperplanes on last day's lecture notes.

In what follows, we will assume that `theta` is no longer a coordinate of `vec{w}`.

Definition. Let `G= {G_n}` be a collection of hyperplanes in `RR^n`. Then `G` is gradual if there is constant `c > 0` such that: For every `tau geq 0`, `n geq 1`, at most `c\tau2^n` many `x in {0,1}^n` lie within a distance `tau` of `G_n`.

Definition. A class of boolean threshold functions `F` is said to be gradual if there is a mapping `phi: F -> G`, where `G` is a gradual family of hyperplanes, such that for all `f in F`, if `phi(f)` is the hyperplane `vec{w} \cdot vec{x} = theta`, then `(vec{w}, theta)` represents the boolean threshold function `f`.

Representations of Nested Boolean Functions

Lemma. Any `f in NF_n` can be represented by a boolean threshold function
`w_1x_1 + cdots w_nx_n geq theta_n`,
with `theta_n = k +1/2` for some integer `k`, `w_i = \pm 2^{i-1}`, and `sum_{w_i<0}w_i < theta_n < sum_{w_i>0}w_i`.

Proof. By induction on `n`. When `n=1`, we can take `x_1 \geq 1/2` or `-x_1 geq -1/2`. For `n > 1`, `f` is of the form `g star l_n` and by the induction hypothesis `g` can be written in the form `w_1x_1 + cdots w_{n-1}x_{n-1} geq theta_{n-1}` with the `w_i`'s satisfying the conditions of the lemma. The rest of the proof splits into four cases depending on the value of `star` and the sign of the literal. We only show one case, but the rest are similar. Suppose `f=g ^^ x_n`, then we can use:
`w_1x_1 + cdots w_{n-1}x_{n-1} + 2^{n-1}x_n \geq theta_n = theta_{n-1} + 2^{n-1}`.
Notice if both `g(\vec{x})` and `x_n` hold, then by the induction the terms besides the last term are greater than `theta_{n-1}` and we also have `2^{n-1}x_n geq 2^{n-1}`, if either of `g(\vec{x})` and `x_n` doesn't hold there is no way the threshold can be met. Q.E.D.

Lemma. Nested Boolean Functions can be represented by gradual boolean threshold functions.

Proof. By the above lemma, given `f in NF_n` there is a linear threshold function `vec{w}\cdot vec{x} \geq theta` which represents `f` with `w_i = \pm 2^{i-1}` and `theta = k +1/2` for some integer `k`. For `x in {0,1}^n`, if `vec{w}\cdot vec{x} = t` then `t` must be an integer. Since an integer has only one binary representation, at most one `vec{x}` can satisfy `vec{w}\cdot vec{x} = t` for a given `t`. So no `x in {0,1}^n` can have `|vec{w}\cdot vec{x} - theta| < 1/2`. Thus,
`|{x in {0,1}^n: |vec{w}\cdot vec{x} - theta| leq m}| leq 2m +1`
for `m geq 1/2`. Since the distance from a point `vec{x}'` to the hyperplane `vec{w}cdot vec{x} = theta` is `||vec{w}||^{-1}\cdot |vec{w} cdot vec{x}' - theta|`. The `||vec{w}||^{-1}` term is because the weight vector might not be normalized. The lemma follows by noting, using the equation for geometric series, that we have `||w|| = (\sum_{i=1}^n w_i^2)^{1/2} = (\sum_{i=1}^n (2^{i-1})^2)^{1/2} = (\sum_{i=1}^n (4^{i-1}))^{1/2} = ((4^n-1)/3)^{1/2} < 2^n`.
So
`|{x in {0,1}^n: ||w||^{-1}|vec{w}\cdot vec{x} - theta| leq tau}| = |{x in {0,1}^n: |vec{w}\cdot vec{x} - theta| leq tau ||w||}| \leq 2tau||w|| +1 < 2tau2^n+1 < 3tau 2^n`
and we can take `c` in the definition of gradual to be 3.

PAC-Learning of Gradual Threshold Functions

Theorem. If `C` is a gradual class of boolean threshold functions, then the perceptron rule is a PAC Learning algorithm for `C` under the uniform distribution on `{0, 1}^n`.

Proof. Let `vec{w} \cdot vec{x} \ge theta` be an `n`-bit linear threshold function from the class `C`. Without loss of generality, we can take `w,theta` to be normalized, so that `|vec{w}\cdot vec{x} -theta|` is the distance of the point `x` to the hyperplane. By the definition of gradual, there is some constant `k` such that for all `tau>0`, the odds that a uniformly chosen element of `{0,1}^n` is within distance `tau` of the hyperplane `vec{w}\cdot vec{x} = theta` is at most `tau/(2k)`. If we set `tau = k epsilon`, then with probability at most `epsilon/2`, a random example drawn from `{0,1}^n` is within `k epsilon` of the hyperplane. From this, if we let `B \subseteq {0,1}^n` be the examples `x` which lie within `k epsilon` of the hyperplane, then `Pr[x in B] le epsilon/2`.

Let `(w_t, theta_t)` be the perceptron algorithm's hypothesis after `t` updates have been made. If `epsilon ge 1`, the definition of PAC-learnability is trivially satisfied, so assume `epsilon < 1`. Also, from the definition of gradual, if a collection of hyperplanes is gradual with constant `c` then it will be gradual for constant `c' > c`, So we can assume `k` above is at least 1. Suppose `(w_t, theta_t)` is not yet `epsilon`-accurate, then with probability at most `epsilon/2 < 1/2`, the next example which causes an update will be in `B`. Define the potential function
`N_t(alpha) = ||alpha w - w_t||^2 + (alpha theta - theta_t)^2`.

The perceptron update rule tells us `vec{w}_{t+1} = \vec{w}_t \pm x` and `\theta_{t+1} = \theta_{t} bar{+} 1`, so `N_{t+1}(alpha) - N_t(alpha)` is
\begin{eqnarray*} \Delta N(\alpha) &=& ||\alpha \vec{w} - \vec{w_{t+1}}||^2 + (\alpha \theta - \theta_{t+1})^2\\ &&\quad - ||\alpha \vec{w} - \vec{w_t}||^2 - (\alpha \theta - \theta_t)^2\\ &=&\mp 2\alpha \vec{w} \cdot \vec{x} \pm 2\alpha\theta \pm 2\vec{w_t}\cdot \vec{x} \mp 2 \theta_t + ||x||^2 +1\\ &\leq& 2 \alpha A \pm 2 (\vec{w}_t \cdot \vec{x} - \theta_t) + n+1. \end{eqnarray*} with `A = bar{+} (vec{w}\cdot vec{x} - theta)`. We are again using above that `||x||^2 le n`.

Proof of PAC-Learning cont'd

Since we are assuming `vec{x}` was misclassified, we know `\pm(vec{w_t} \cdot vec{x} - theta_t) < 0`, so `\Delta N(\alpha) < 2alpha A + n +1`. If `x in B` then `A \le 0`; if `x !in B`, then `A leq -k epsilon`. So `\Delta N(\alpha) < n + 1` for `x in B` and `\Delta N(\alpha) < n + 1 - 2k \epsilon\alpha` for `x !in B`. Suppose the perceptron algorithm has made `r` updates with examples in `B`, and `s` updates for examples outside `B`. Since `(vec{w}, theta)` was normalized, `|theta| \leq sqrt(n)`. Recall at the start of the perceptron algorithm the initial weights are all `0`. Hence, `N_0(\alpha) \leq alpha^2(n+1)`. Since for all `t`, `N_t(\alpha) ge 0`, it follows that
`0 le r(n+1) + s(n+1 - 2k\epsilon\alpha) + alpha^2(n+1)`.
Setting `alpha = (12(n+1))/(5 k epsilon)`, the above simplifies to
`0 \leq r - 19/5s + (144(n+1)^2)/(25(k epsilon)^2)`.
If `m_1 = r + s = (144(n+1)^2)/(25(k epsilon)^2)` updates have been made, then `r = m_1 - s`, and if we substitute this into the above inequality, we get
\begin{eqnarray*} 0 &\leq& m_1 - s - \frac{19}{5}s + m_1\\ 0 &\leq& 2m_1 - \frac{24}{5} s\\ \frac{24}{5}s &\leq & 2m_1\\ s &\leq& \frac{10}{24} m_1\\ s &\leq & \frac{5}{12}m_1. \end{eqnarray*} So at least `7/12` fraction of the updates must have been made on examples in `B`.

Proof of PAC-Learning Conclusion

If the perceptron's hypothesis has never been `\epsilon`-accurate, then from our discussion at the start of the proof, at each update, the probability of that update occurring on a point in `B` is at most `1/2`. So by Chernoff Bounds, the probability that more than `7/12`ths of `m = max(-144ln(\delta/2), m_1)` updates occur in `B` is at most `delta/2`. I. e., Chernoff Bound's says the odds we see `(7/12)m` trials in `B` when we should expect only `(1/2)m` trials in `B` is governed by `p=1/2`, `c` and `m` where `(1+c)(1/2)m = (7/12)m`. Solving for `c` gives `c=1/6`. Then the bound given by Chernoff's inequality is at most `e^{(-c^2p m)/2} = e^{-((1/6)^2(1/2)m)/2} = e^(-m/144) = e^((144ln(delta/2))/144) = delta/2`. So this mean that with probability at least `1 - \delta/2` the perceptron algorithm will have found an `epsilon`-accurate hypothesis. With probability `1-\delta/2`, using `(2m)/epsilon` examples will ensure that `m` updates occur. Thus, with probability `1-delta < 1 - delta + delta^2`, using `(2m)/epsilon` examples will ensure `m` updates occur and that the result of these updates is `epsilon`-accurate. Q.E.D.

Quiz

Which of the following is true?

For a probability distribution `P` and events `A`, `B`, it is always the case that `P(A cup B) = P(A) + P(B)`.
If a function `g` is PAC-learnable on `D` to within `epsilon`, then the training algorithm always succeeds in outputting a `w`, such that on the test data drawn according to `D`, we have `E(m(d(w,x_t), g(x_t))) leq epsilon`.
There exists an `n times n` matrix `A` over `RR` that does not have an inverse.

Getting Started with Python

As we will be using Python to do the homeworks, I'd like to give a quick intro/refresher on how to code in Python.
Python can be obtained from:
http://python.org/download/.
Python 2.7.x is available by default on Mac's running OSX Mavericks/Yosemite, so we'll stick to that for the first homework rather than Python 3 (for what we do it won't make much difference).
Some differences between Version 2 and 3 are Unicode support in strings, exception chaining, annotations, etc -- none of which we'll really make use this semester.

Running Python

Assuming you have set up your path environment, you can launch python in interactive mode by just typing:

python

at the command prompt. You should see something like:

Python 2.7.1 (r271:86832, Jun 16 2011, 16:59:05) 
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>

Typing quit() on a line or hitting CTRL-D returns you to the command prompt.
As an example, we could type:
```
>>>print "hello world"
```
which would print hello world to the terminal.

Perceptron Learning, Python

Outline