PAC-Learning Gradual Thresholds - Python




CS256

Chris Pollett

Sep 13, 2021

Outline

Introduction

PAC-Learning of Gradual Threshold Functions

Theorem. If `C` is a gradual class of boolean threshold functions, then the perceptron rule is a PAC Learning algorithm for `C` under the uniform distribution on `{0, 1}^n`.

Proof. Let `vec{w} \cdot vec{x} \ge theta` be an `n`-bit linear threshold function from the class `C`. Without loss of generality, we can take `w,theta` to be normalized, so that `|vec{w}\cdot vec{x} -theta|` is the distance of the point `x` to the hyperplane. By the definition of gradual, there is some constant `k` such that for all `tau>0`, the odds that a uniformly chosen element of `{0,1}^n` is within distance `tau` of the hyperplane `vec{w}\cdot vec{x} = theta` is at most `tau/(2k)`. If we set `tau = k epsilon`, then with probability at most `epsilon/2`, a random example drawn from `{0,1}^n` is within `k epsilon` of the hyperplane. From this, if we let `B \subseteq {0,1}^n` be the examples `x` which lie within `k epsilon` of the hyperplane, then `Pr[x in B] le epsilon/2`.

Let `(w_t, theta_t)` be the perceptron algorithm's hypothesis after `t` updates have been made. If `epsilon ge 1`, the definition of PAC-learnability is trivially satisfied, so assume `epsilon < 1`. Also, from the definition of gradual, if a collection of hyperplanes is gradual with constant `c` then it will be gradual for constant `c' > c`, So we can assume `k` above is at least 1. Suppose `(w_t, theta_t)` is not yet `epsilon`-accurate, then with probability at most `epsilon/2 < 1/2`, the next example which causes an update will be in `B`. Define the potential function
`N_t(alpha) = ||alpha w - w_t||^2 + (alpha theta - theta_t)^2`.

The perceptron update rule tells us `vec{w}_{t+1} = \vec{w}_t \pm x` and `\theta_{t+1} = \theta_{t} bar{+} 1`, so `N_{t+1}(alpha) - N_t(alpha)` is
\begin{eqnarray*} \Delta N(\alpha) &=& ||\alpha \vec{w} - \vec{w_{t+1}}||^2 + (\alpha \theta - \theta_{t+1})^2\\ &&\quad - ||\alpha \vec{w} - \vec{w_t}||^2 - (\alpha \theta - \theta_t)^2\\ &=&\mp 2\alpha \vec{w} \cdot \vec{x} \pm 2\alpha\theta \pm 2\vec{w_t}\cdot \vec{x} \mp 2 \theta_t + ||x||^2 +1\\ &\leq& 2 \alpha A \pm 2 (\vec{w}_t \cdot \vec{x} - \theta_t) + n+1. \end{eqnarray*} with `A = bar{+} (vec{w}\cdot vec{x} - theta)`. We are again using above that `||x||^2 le n`.

Proof of PAC-Learning cont'd

Since we are assuming `vec{x}` was misclassified, we know `\pm(vec{w_t} \cdot vec{x} - theta_t) < 0`, so `\Delta N(\alpha) < 2alpha A + n +1`. If `x in B` then `A \le 0`; if `x !in B`, then `A leq -k epsilon`. So `\Delta N(\alpha) < n + 1` for `x in B` and `\Delta N(\alpha) < n + 1 - 2k \epsilon\alpha` for `x !in B`. Suppose the perceptron algorithm has made `r` updates with examples in `B`, and `s` updates for examples outside `B`. Since `(vec{w}, theta)` was normalized, `|theta| \leq sqrt(n)`. Recall at the start of the perceptron algorithm the initial weights are all `0`. Hence, `N_0(\alpha) \leq alpha^2(n+1)`. Since for all `t`, `N_t(\alpha) ge 0`, it follows that
`0 le r(n+1) + s(n+1 - 2k\epsilon\alpha) + alpha^2(n+1)`.
Setting `alpha = (12(n+1))/(5 k epsilon)`, the above simplifies to
`0 \leq r - 19/5s + (144(n+1)^2)/(25(k epsilon)^2)`.
If `m_1 = r + s = (144(n+1)^2)/(25(k epsilon)^2)` updates have been made, then `r = m_1 - s`, and if we substitute this into the above inequality, we get
\begin{eqnarray*} 0 &\leq& m_1 - s - \frac{19}{5}s + m_1\\ 0 &\leq& 2m_1 - \frac{24}{5} s\\ \frac{24}{5}s &\leq & 2m_1\\ s &\leq& \frac{10}{24} m_1\\ s &\leq & \frac{5}{12}m_1. \end{eqnarray*} So at least `7/12` fraction of the updates must have been made on examples in `B`.

Proof of PAC-Learning Conclusion

If the perceptron's hypothesis has never been `\epsilon`-accurate, then from our discussion at the start of the proof, at each update, the probability of that update occurring on a point in `B` is at most `1/2`. So by Chernoff Bounds, the probability that more than `7/12`ths of `m = max(-144ln(\delta/2), m_1)` updates occur in `B` is at most `delta/2`. I. e., Chernoff Bound's says the odds we see `(7/12)m` trials in `B` when we should expect only `(1/2)m` trials in `B` is governed by `p=1/2`, `c` and `m` where `(1+c)(1/2)m = (7/12)m`. Solving for `c` gives `c=1/6`. Then the bound given by Chernoff's inequality is at most `e^{(-c^2p m)/2} = e^{-((1/6)^2(1/2)m)/2} = e^(-m/144) = e^((144ln(delta/2))/144) = delta/2`. So this mean that with probability at least `1 - \delta/2` the perceptron algorithm will have found an `epsilon`-accurate hypothesis. With probability `1-\delta/2`, using `(2m)/epsilon` examples will ensure that `m` updates occur. Thus, with probability `1-delta < 1 - delta + delta^2`, using `(2m)/epsilon` examples will ensure `m` updates occur and that the result of these updates is `epsilon`-accurate. Q.E.D.

Getting Started with Python

Running Python

Quiz

Which of the following is true?

  1. Perceptrons with logistic activation functions are tensors.
  2. The perceptron convergence theorem shows perceptrons can learn an arbitrary function.
  3. The learning update rule used for the perceptron convergence theorem and our PAC learning of nested boolean functions was the same.

Strings

Lists

Example Using Lists and Command Line

import sys
if(len(sys.argv)) != 2: 
   #notice sys.argv is a list of command-line args and we found its length
   print("Please supply a filename")
   raise SystemExit(1) # throw an error and exit
f = open(sys.argv[1]) #program name is argv[0]
lines = f.readlines() # reads all lines into list one go
f.close()

#convert inputs to list of ints
ivalues = [int(line) for line in lines]

# print min and max
print("The min is ", min(ivalues))
print("The max is", max(ivalues))

Example Files in a Folder as a List

import glob
path = './*'
files = glob.glob(path) 
for name in files:
   print(name)

Tuples

Sets

Dictionaries

Iteration and Looping

Examples of things can iterate over with for