Introduction

On Monday, we gave a definition of learning in terms of improving a program's performance on a task based on its observations of the environment.
We then listed different possible tasks one might want to have a learning algorithms for.
We characterized learning algorithms based on the kind of observations they receive.
We talked about different ways to measure performance.
We then said there were two ingredients one trades off when designing a learning algorithm compression and generality.
We imagined a supervised learning algorithm, `c`, operates on the concatenation of the training examples `(\vec{x}_1, y_1), ...(\vec{x}_k, y_m)` and outputs some `w` of size at most the size of this concatenation. Given `w`, the trained model, we imagine there is some evaluator `d(w,x)` such that on inputs `x_i` from the training set `d(w, x_i) approx y_i`.
If the length of `w` is always fixed sized, regardless of the training set size, we call the learning algorithm, a parametric, otherwise, we called it a non-parametric.
We gave a counting argument that there are data sets that parametric learning algorithms can't exactly learn.
We start today by look at role of generality in learning algorithms.

Generality versus Learning

If the only data items `x` we ever expected to see came from the training set's, `x_i`'s, then learning would really reduce to a kind of compression problem.
Usually, we want our evaluator `d(w,x)` to output something reasonable for never seen before `x`'s. I.e., the trained model should generalize.
On training data items, `x_i`, we have been assuming that `d(w, x_i) approx y_i`. This means `m(d(w, x_i), y_i) < epsilon` for some choice of real number `epsilon` and where `m` is a metric measuring the distance between `d(w, x_i)` and `y_i` as a nonnegative real number.
- For example, if `d` outputs a real number and the `y_i`'s are reals, we could choose `m`, so that `m(d(w, x_i), y_i) = |d(w, x_i) - y_i|`.
- A pair `(X,m)` where `X` is a set, and `m: X times X -> [0, infty)`, such that for all `x, y, z`, `m(x,y) = m(y,x)`, `m(x,z) \leq m(x,y) + m(y,z)`, and `m(x,y) = 0` implies `x=y` is called a metric space. We assume our learning algorithms are formulated in metric spaces.
By generality, we mean we want models so that:
1. We can make the training error, `epsilon`, small
2. We can make the gap between the training and test error small.
If the model is not able to obtain a sufficiently low error value on the training set, we say underfitting is occurring.
If the gap between the training and test is too large, we say overfitting is occurring.

How good is a learning algorithm?

So far we have talked about different things that can be traded off in our learning algorithms.
It is useful to have a definition of learning algorithm which is concrete enough that we can compare one learning algorithm to another.
Valiant 1984 proposed one popular model for measuring learning called PAC-learning (Probably Approximately Correctly learning).
In order to present/understand a version of this model, we need to have a short refresher on probability theory...

Probability (Take 1) - Distributions

A sample space `S` will for us be some collection on elementary events. For instance, results of coin flips.
An event `E` is any subset of `S`.
For example, if `S={HH, TH, HT, T\T}`, an event might be `{TH, HT}`
A probability distribution `Pr_S{}` on `S` is a mapping from events on `S` to the real numbers satisfying for any events `A` and `B`:
1. `Pr_S{A} ge 0`
2. `Pr_S{S} = 1`
3. `Pr_S{A cup B} = Pr_S{A} + Pr_S{B}` if `A cap B= emptyset`
Notice `1 = Pr_S{S cup emptyset} = Pr_S{S} + Pr_S{emptyset} = 1 + Pr_S{emptyset}`. So `Pr_S{emptyset} = 0`.
Let `bar(A)` denote the complement of `A` in `S` -- all the elements of `S` that are not in `A`.
Notice `1 = Pr_S{S}= Pr_S{bar(A) cup A} = Pr_S{bar(A)} + Pr_S{A}`. So `Pr_S{bar(A)} = 1 - Pr_S{A}`.

Conditional Probability and Independence

The conditional probability of an event `A` given an event `B` is defined to be:
`Pr{A\|B} = frac{Pr{A cap B}}{Pr{B}}`.
Two events are independent if
`Pr{A cap B} = Pr{A} cdot Pr{B}`
Given a collection `A_1, A_2,...A_k` of events we say they are pairwise independent if
`Pr{A_i cap A_j} = Pr{A_i}Pr{A_j}` for any `i` and `j`.
They are mutually independent if for any subset `A_(i_1), A_(i_2),..., A_(i_m)` of them `Pr{A_(i_1) cap ...cap A_(i_m)} = Pr{A_(i_1)}cdot cdots cdot Pr{A_(i_m)}`

Discrete Random Variables

A discrete random variable `X` is a map `X:S -> RR`.
Given such a function `X` we can define the probability density function for `X` as:
`f(x) = Pr{X = x}` where the little `x in RR`.

Expectation and Variance

The expected value of a random variable `X` is defined to be: `E[X] = sum_x x cdot Pr{X=x}`
The variance of `X`, `Var[X]`, is defined to be: `E[(X - E(X))^2] = E[X^2] -(E[X])^2`
The standard deviation of `X`, `sigma_X`, is defined to be the `(Var[X])^(1/2)`.

In-Class Exercise

Suppose you have a coin that is heads with probability 1/(day of your birthday), and is tails otherwise.
When you flip this coin, Bob is willing to pay you $5 if it lands heads, and $1 otherwise.
Write up a distribution corresponding to your coin, a random variable corresponding to what Bob will pay you. Then compute the expected value of this random variable.
Post your answer to the Aug 30 In-class Exercise Thread.

Markov Inequality

The following lemma let's us determine how likely a random variable has a value much larger than its expectaion:
Lemma (Markov Inequality). If `x` is a random variable taking nonnegative integer values, then for any `k > 0`, `Pr[x ge k cdot E(x)] le 1/k`.
Proof. Let `p_i` be the probability that `x=i`.
`E(x) = sum_i i cdot p_i = sum_(i le k cdot E(x)) i cdot p_i + sum_(i > k cdot E(x)) i cdot p_i \geq sum_(i > k cdot E(x)) (k cdot E(x) )p_i = k cdot E(x)sum_(i > k cdot E(x)) p_i = k cdot E(x) cdot Pr[x geq k cdot E(x)]` Q.E.D.

Error Reduction -- Chernoff Bounds

Suppose we have an algorithm that fails with a certain probability.
We would like to be able to reduce the error in our algorithm by running it multiple times and then taking the majority answer.
The main tool to show that this works is Chernoff Bounds:
Lemma (Chernoff). Suppose `X_1,...,X_n` are independent random variables taking the values `1` and `0` with probabilities `p` and `1-p`. Let `X = sum_(i=1)^n X_i`. Then for all `0 le c le 1`, `Pr[X ge (1+c)pn] le e^(-(c^2pn)/2)`.

Proof of Chernoff Bounds

If t is a positive real number, then
`Pr[X ge (1+c)pn]= Pr[e^(tX) ge e^(t(1+c)pn)]` (*)
By Markov's Inequality,
`Pr[e^(tX) > k cdot E(e^(tX))] le 1/k` for any real `k > 0`.
Taking `k=e^(t(1+c)pn)/[E(e^(tX))]` and using (*) gives
`Pr[X ge (1+c)pn] le [E(e^(tX))]cdot e^(-t(1+c)pn)`. (**)
Since `X= sum_(i=1)^n X_i`, we have `E(e^(tX) )=[E(e^(tX_1))]^n` which in turn equals `(1 + p(e^t-1))^n`. Substituting this into (**) gives:
`Pr[X ge (1+c)pn] le (1 + p(e^t-1))^n cdot e^(-t(1+c)pn)`
`le e^(-t(1+c)pn) cdot e^(pn (e^t-1))`, since `(1+a)^n le e^(an)`. Take `t=ln(1+c)` to get `Pr[X ge (1+c)pn] le e^(pn(c-(1+c)ln(1+c)))`. Taylor expanding `ln(1+c)` as `c - c^2/2 + ...` and substituting gives the result. I.e.,
`e^(pn(c-(1+c)ln(1+c))) le e^(pn(c-(1+c)(c- c^2/2 +c^3/3 + ...))) le e^(-(c^2 pn)/2).`

Corollary. If `p=1/2 + epsilon` for some `epsilon > 0`, then the probability that `sum_(i=1)^n X_ i le n/2` is at most `e^(-epsilon^2n/4)`.

Proof. Take `c = epsilon/(1/2+ epsilon)`. Q.E.D.

PAC Learning

Let's return to the problem of giving a definition of learning that quantifies how well we are learning.
To start we assume that the training examples are being computed according to some underlying function `g` (`g`, for ground function).
For example, `g` might compute a polynomial or `g` might be the output of a photo app at time `x`.
Given this, we want to quantify after training, how close the function `d(w,x)` is to `g(x)` on the test data.
We say our learning algorithm can PAC-learn (Probably Approximately Correctly learn) `g` with respect to some distribution `D`, polynomials `p`, `q`, and training items of length `n` if:
- For any two numbers `0 < epsilon` and `0 leq delta leq 1`, our algorithm takes `q(n)` to generate an example and if we train according to our algorithm on `p(n, 1/epsilon, 1/delta)` examples of length `n` drawn according to `D`, we will find a trained `w` with probability `1-\delta` such that on all the test data `x_t` drawn according to `D`, `E(m(d(w,x_t), g(x_t))) leq epsilon`, where `m` is a metric on the outputs.
This is a slight variant on Valiant's definition.

Is anything PAC-learnable?

Valiant's paper gives a PAC learning algorithm for certain kinds of DNF's.
We will show Servedio (1999)'s result that certain AND, OR, NOT formulas are learnable by perceptrons.
Before we do this, it is probably useful to know what a perceptron is.
In turn, our notation for perceptrons makes use of vectors and operations on vectors, so we are going to review a little linear algebra, then present perceptrons, study what they can do, and then present Servedio's result.

Linear Algebra (Take 1)

The study of linear algebra involves several kinds of mathematical object:
1. Scalars: These are single numbers 7, 7.5, `pi` which we usually take to be from some field like `RR`, `CC`, `QQ`, `ZZ_n`. Recall a field is a set `X` with two binary operations `+` and `cdot` and two elements `0` and `1` so that `x+0 = x`, `1 cdot x = x`, both operations are associative, commutative, and obey a distributive law `(a+b)\cdot c = a cdot b + b cdotc`, and for each `x in X` there is a `-x` such that `x + -x = 0` and if `x ne 0`, there is a `1/x` such that `x cdot 1/x = 1`. We write scalar variables using lower case italicized letters, `a`, `b`, `c`...
2. Vectors: For simplicity, we will assume these are tuples of fields elements. For example,
  `[[1], [2.1], [-9]]` or `[[7], [2.1], [-99], [8]].`
  We write vector variables using lower case letters with an arrow above them (book uses boldface), `vec{a}, vec{b}, vec{c}`. For a vector `vec{a}`, we write `a_i` for its `i`th coordinate. So if `vec{a}` was the first example vector above, `a_1 = 1, a_2 = 2.1, a_3 = -9`. If `\mathbb{F}` is a field, define `\mathbb{F}^1 = \mathbb{F}`, `\mathbb{F}^n = \mathbb{F}^{n-1} times \mathbb{F}` where `times` is a Cartesian product. We will mainly be interested in vectors `vec{v}` which are elements of `\mathbb{F}^n` for some field `\mathbb{F}`. We call `n` the dimension of `vec{v}`.
3. Matrices: A matrix is a 2D array of field elements. For example,
  `[[1, -1], [6, 2.1], [4, -9]]` or `[[1, 0], [0, 1]]`.
  We write matrix variables using upper case letters, `A, B, C, ...` So a matrix `A` of `m` rows and `n` columns is an element of `\mathbb{F}^{m times n}`. We write `A_{i,j}` for the value at the `i`th row and `j`-column of matrix `A`. So if `A` was the first matrix above `A_{3,2} = -9`. We can view an `n`-dimensional vector `v` as a matrix in `\mathbb{F}^{n times 1}`, so operations we define for matrices also work for vectors.

Matrix Operations

We call a matrix with all its coordinates `0` the zero matrix, `mathbf{0}`; we call a vector with all its coordinates `0` the zero vector, `mathbf{0}`.
Given a matrix `M` or a vector `vec{v}`, we can multiply it by a scalar `c`. The coordinates of the result are given by `(c \cdot M)_{i,j} = c cdot M_{i,j}` and `(c \vec{v})_i = c \cdot \vec{v}_i`. For example, `3 \cdot [[1, 2], [3, 4]] = [[3, 6], [9, 12]]` and `2 cdot [[1],[-1]] = [[2],[-2]]`.
Given two matrices `A`, `B` or vectors `vec{v}`, `vec{w}` we can also add them. The coordinates of the result are given by `(A + B)_{i,j} = A_{i,j} + B_{i,j}` and `(v + w)_i = v_i + w_i`. For example, `[[1],[-1]] + [[2],[2]] = [[3],[1]]`.
We note matrices have an additive inverse as `M + (-1)M = mathbf{0}`.
We will call any of the above operations a linear combination. Given a set `vec{x}_1, vec{x}_2, ..., ` of vectors `X`, we call the set of all vectors that can be created by taking linear combinations of these vector the span of `X`, denoted `span(X)`.
If an equation `vec{v} = \sum_{i=1}^m a_i vec{w}_i` holds, then we say `vec{v}` is linear dependent on `vec{w}_1, ..., vec{w}_n`. If no scalars `a_i` exists such that an equation of this type holds, then we say `vec{v}` is linearly independent of `vec{w}_1, ..., vec{w}_n`.
We'll call a subset `V` of `\mathbb{F}^n` containing `mathbf{0}`, closed under addition and multiplication by scalars, a vector space.

More Matrix Operations

Given an `A in \mathbb{F}^{m times n}` we can create a matrix, `A^{\top}`, where we swap `A`'s rows and columns: `(A^{\top})_{i,j} = A_{j,i}`. For example,
`[[1, 2], [3, 4], [5, 6]]^T = [[1, 3, 5],[2, 4, 6]].`
Given `A in mathbb{F}^{m times n}` and `B in mathbb{F}^{n times p}`, we can define their matrix product `C=AB` as `C_{i,j} = sum_k A_{i,k}B_{k,j}`. For example, `[[1, 2], [3, 4], [5, 7]] [[1, 1], [-1, 0]] = [[-1, 1], [-1, 3], [-2, 5]]`.
Matrix multiplication is associative, `A(BC) = (AB) C`, and distributes over "`+`", `A(B+C) = AB +AC`, but it is not generally commutative. That is, often `AB ne BA`. For example, `[[-1,1],[1,0]] = [[1,-1],[0,1]] [[0,1],[1,0]] ne [[0,1],[1,0]] [[1,-1],[0,1]] = [[0,1],[1,-1]]`.
Given two `n`-dimensional vectors `vec{v}`, `vec{w}`, we can view them as matrices and multiply them as, `vec{v}^{top}vec{w}`. This is called the inner product of `vec{v}`, `vec{w}` and will be a `1 times 1` matrix, that we'll view as a scalar. (When `mathbb{F} = CC`, you need to take the complex conjugate of the components of `vec{v}` for this to make sense.)
We can check/prove-by-induction `(AB)^{\top} = B^{\top}A^{\top}`, and so `vec{v}^{top}vec{w} = vec{w}^{top}vec{v}`.
In high school, we are often asked to solve systems of linear equations such as:
`2x+4y = 7`
`-3x + 2y = 20`
we can always write as a matrix equation. For example,
`[[2, 4],[-3, 2]][[x],[y]] = [[7],[20]]`.
The general format of this equation is `A vec{x} = vec{b}`.

Norms

Taking the inner product of a vector with itself gives `vec{v}^{top}vec{v} = sum_i (v_i)^2`. Taking the square root of this, geometrically represents the standard euclidean distance of `vec{v}` from the origin. We write this as `||vec{v}|| = (vec{v}^{top}vec{v})^{1/2}`. This is called the norm of `vec{v}`. Sometimes we will also write this as `||vec{v}||_2` when we want to emphasize it is the `L_2` norm. In general, the `L_p` norm of `vec{x}` is defined as `||x||_p = (sum_i |x_i|^p)^{1/p}`.
Given two vector `vec{x}` and `vec{y}`, one can verify that `||vec{x} - \vec{y}||` satisfies the definition of a metric, and so we can use this in the definition of PAC learning.
One can also prove that `vec{v}^{top}vec{w} = ||vec{v}^{\top}|| ||vec{w}|| cos theta`, where `theta` is the angle between the two vectors, so the inner product has a geometric interpretation.
We will finish up our discussion of linear algebra next day.

Probability, PAC Learning, Linear Algebra

Outline