What is Learning? Kinds of Learning algorithms

A learning algorithm is an algorithm which makes use of observations of the world to either improve its own performance on future tasks or to improve the performance of some other program on future tasks.
Learning algorithms can be grouped according to the kind of observations they get as inputs:
- A supervised learning algorithm observes some examples input-output pairs and learns a function that maps from input to output. I.e., Given points `(\vec{x}_1, y_1), ...(\vec{x}_k, y_k)`, we want to learn a simple function `f` such that for all `i`, `f(\vec{x}_i) approx y_i`, and such that `f` makes reasonable predictions on new data. If there are only a fixed set `1,..., k` of values (classes) that the `y_i` can be in, we call this kind of learning classification.
- An unsupervised learning algorithm learns patterns in the input even though no explicit feedback is supplied. A common task in this area is clustering: detecting potentially useful clusters of input examples.
- A reinforcement learning algorithm learns from a series of reinforcements -- rewards or punishments. For example, the lack of a tip at the end of the journey gives a taxi agent an indication that it did something wrong. Winning a chess game gives a chess playing agent an indication is did something right.

Learning Algorithms Tasks

Some tasks we might use a learning algorithm to solve are:
- Classification How to assign a label (for example, spam/no spam) based on input vector `vec{x}`
- Classification with Missing Inputs Same task as above but some of the components of `vec{x}` might be missing. For example, `vec{x}` might be a vector of medical test results, one of the tests of which is missing.
- Regression Find the best curve of some type (for example, a line) through a set of data points. Then use this curve to predict previous unseen values. For example, might train on predicted claim amounts an insured person might make for certain circumstances, then use this to predict expected new claim amounts.
- Transcription Observe non-textual data and convert it to text. For example, images of street signs.
- Machine Translation Convert a sequence of symbols from one language to a sequence of symbols in another. For example, English to Chinese.
- Structured Output Take unstructured input data and convert to data with annotated relationships. For example, label parts of speech of an English sentence.
- Anomaly Detection Find atypical events in an event sequence. For example, misuses of a credit card.
- Synthesis and Sampling Generate new examples similar to training data. Might use for automatic scene generation in video games.
- Imputation of Missing Values This is similar to classification with missing inputs. Task is given a `vec{x}` where some of the coordinates are missing predict those coordinates. For example, the predict movie rating of someone for a movie they didn't rate based on ones they did.
- Denoising given a corrupted input `\tilde{\vec{x}}` compute a clean one `\vec{x}`.
- Density Estimation Learn a `p_{mbox{model}}: RR^n->RR` a probability density function based on training examples. This can be used for things like the imputation of missing values problem.

Learning Algorithm Performance

To evaluate the abilities of a machine learning algorithm, we must come up with a way to measure its performance.
For tasks like classification, accuracy of the model is often measured. This is the proportion of the examples on which the model produced the correct output.
Error Rate is an equivalent measure: It is the proportion of the inputs on which the model got the answer wrong.
For tasks where it doesn't make sense to measure accuracy, for example, density estimation, a continuous valued score might be used.
A common approach is to report the average log-probability the models assigns to some examples.
Usually we are interested in how well the machine learning algorithm performs on data that it has not seen before.
We call this data that is separate from the training data, the test set.

Learning versus Compression

Typically, there are two main ingredients that one trades offs when designing a learning algorithm: compression and generality.
A compression algorithm `c` takes as input a long string `y` (for example, the 1's and 0's of an image) and tries to come up with a short string `c(y) = x` such that when a decompression algorithm `d` is applied to `x`, we have `d(x) approx y`.
For supervised learning, we can imagine `c` as training algorithm that tries to find a short description of all the data. I.e., `c` takes as input a string which is the concatenation of `(\vec{x}_1, y_1), ...(\vec{x}_k, y_k)` and outputs some `w` which is at most the size of this concatenation. The function `d` now takes two arguments `(w, x)`, and we want `d(w,x_i) approx y_i`. So `d` is an evaluator on the trained model `w` and on inputs `x`.
If the length of `w` is always fixed size, regardless of the amount of training data, we call the learning algorithm, a parametric learning algorithm, otherwise, we will call it a non-parametric learning algorithm.
As we will see, neural network training is an example of a parametric learning algorithm. On the other hand, `k`-nearest neighbors which gives a point `x` the value `y`, based on what a simple function such as majority or average of the values of the `k` nearest `x_i`'s to `x` is an example of a nonparametric learning algorithm.
If `y` is `n` bits long, it is one of `2^n` many `n` bits strings. On the other hand, there are only
`1+2+2^2 + \cdots +2^{n-1} = (2-1)(1+2+2^2 + \cdots + 2^{n-1}) = 2^n -1`
binary strings of shorter length. So not all strings of length `n` can be compressed.
So we cannot expect that all data sets can be parametrically learned.
Data sets that cannot be learned can be viewed as (cryptographically/pseudo) random with respect to our learning algorithm.
If we allow larger `w`'s, it is more likely we can learn, but then we run into our second trade-off...

Generality versus Learning

If the only data items `x` we ever expected to see came from the training set's, `x_i`'s, then learning would really reduce to a kind of compression problem.
Usually, we want our evaluator `d(w,x)` to output something reasonable for never seen before `x`'s. I.e., the trained model should generalize.
On training data items, `x_i`, we have been assuming that `d(w, x_i) approx y_i`. This means `m(d(w, x_i), y_i) < epsilon` for some choice of real number `epsilon` and where `m` is a metric measuring the distance between `d(w, x_i)` and `y_i` as a nonnegative real number.
- For example, if `d` outputs a real number and the `y_i`'s are reals, we could choose `m`, so that `m(d(w, x_i), y_i) = |d(w, x_i) - y_i|`.
- A pair `(X,m)` where `X` is a set, and `m: X times X -> [0, infty)`, such that for all `x, y, z`, `m(x,y) = m(y,x)`, `m(x,z) \leq m(x,y) + m(y,z)`, and `m(x,y) = 0` implies `x=y` is called a metric space. We assume our learning algorithms are formulated in metric spaces.
By generality, we mean we want models so that:
1. We can make the training error, `epsilon`, small
2. We can make the gap between the training and test error small.
If the model is not able to obtain a sufficiently low error value on the training set, we say underfitting is occurring.
If the gap between the training and test is too large, we say overfitting is occurring.

In-Class Exercise

Suppose we are trying to learn a set which specifies a 2D half-space (all points to one side (either above or below) of a line).
A training item might look ((x,y), 1) if the point is in the half space, and ((x,y), 0) if it is not.
Suggest a non-parametric way this space could be learned.
Give an example where your approach produces a correct answer on a test set and one where it produces an incorrect answer.
If we were parametrically learning this data set, what things influence the smallest model for a given error?
Please post your solution to the Aug 25 In-Class Exercise Thread.

Defining Learning - Introduction to Probability

Outline

What is Learning? Kinds of Learning algorithms

Learning Algorithms Tasks

Learning Algorithm Performance

Learning versus Compression

Generality versus Learning

In-Class Exercise