Introduction

On Monday, we introduced the notion of estimator from statistics.
A point estimator is any function of the data `hat{theta}_m = g(vec{x}^{(1)}, ..., vec{x}^{(m)})` used to approximate some value of interest, for example, neural network accuracy, temperature, etc.
We defined the bias of an estimator is defined as: `bias(hat{theta}_m) = E(hat{theta}_m) - theta`.
Then we showed using `hat{theta}_m = 1/m sum_{i=1}^m x^{(i)}` where `x^{(i)}` are coin flips to estimate the probability `theta` of a coin being heads is an unbiased estimator. I.e., `bias(hat{theta}_m) = 0`.
On the other hand, we showed `\hat{sigma}_m^2 = 1/m sum_{i=1}^m (x^{(i)} - hat{mu}_m)^2` is a biased estimator for variance of a Gaussian distribution. Further we stated that the technique of Cross Validation can be biased.
We then began talking about multi-layer perceptron networks and training cost functions for them.
We looked at mean square error as a cost function.
Today, we begin by looking at Maximum Likelihood Estimation as a means to develop cost functions.

Maximum Likelihood Estimation - unlabeled data

Let `mathbb{X}` be a set of examples `vec{x}^{(1)}, ..., vec{x}^{(m)}` drawn independently from sample space `S` according to some unknown data generating distribution `p_{data}(vec{x})`.
Let `p_{mo\d\el}(\vec{x};\vec{theta})` be a parametric family of probability distributions over the same space `S` indexed be `vec{theta}`.
Our goal is to find a `vec{theta}` so that `p_{mo\de\l}(\vec{x};\vec{theta})` is as close to `p_{data}(vec{x})` as possible.
The maximum likelihood estimator (MLE) for `vec{theta}` is defined as:
\begin{eqnarray*} \vec{\theta}_{ML} &=& argmax_{\vec{\theta}} p_{model}(\mathbb{X}; \vec{\theta})\\ &=& argmax_{\vec{\theta}}\prod_{i=1}^m p_{model}(\vec{x}^{(i)}; \vec{\theta}) \end{eqnarray*}
I.e., we want to find the value of `vec{theta}` that maximizes the product of probabilities of the sequence of items that were observed.

The product on the previous slide can be quite small and so is sensitive to underflow round off errors.
Since the log of this product is maximized when the product is maximized, we can instead use the less round off sensitive definition:
`\vec{\theta}_{ML} = ar\gm\ax_{\vec{\theta}} sum_{i=1}^m log p_{mo\de\l}(\vec{x}^{(i)}; \vec{\theta}).`
Since multiplying the above sum by `1/m` doesn't affect the argmax either, we can do that and then view the result as an expectation:
`\vec{\theta}_{ML} = ar\gm\ax_{\vec{\theta}} E_{vec{x} ~ hat{p}_{data}} (log p_{mo\de\l}(\vec{x}; \vec{\theta})).`

We can interpret the above as trying to minimize the dissimilarity between the empirical distribution `hat{p}_{data}`, defined by the training set and the model. Ideally, we would like to match the actual distribution `p_{data}`, but we don't have direct access to this.
One measure of this dissimilarity between two distributions is Kullback-Leibler (KL) divergence:
`D_{KL}(P||Q) = E_{x~P}[log( (P(x))/ (Q(x)))] = E_{x~P}[log P(x) - log Q(x)]`
As log of a number is roughly the number of bits in a number, the above roughly measures the expected number of bits of information needed to compute `P(x)` given `Q(x)`.
In the case of MLE, the dissimilarity between `hat{p}_{data}` and `p_{mo\de\l}` is
`D_{KL}(hat{p}_{data}||p_{mo\de\l}) = E_{vec{x}~hat{p}_{data}}[log hat{p}_{data}(\vec{x]) - log p_{mo\de\l}(\vec{x}; \vec{theta})].`
So `\vec{\theta}_{ML}` is the value that minimizes `vec{theta}` the above.
To minimize `D_{KL}(hat{p}_{data}||p_{mo\de\l})` it suffices to minimize:
`-E_{vec{x}~hat{p}_{data}}[log p_{mo\de\l}(\vec{x}; vec{theta})]`
So we can take this as our loss function.

Suppose our data is coming from pairs of coins flips on unbiased coins.
Make a data set consisting of 5 pairs of flips.
Our model from the training data computes the fraction `p_0` of times coin 0 was heads and the fraction `p_1` of times coin 1 was heads.
Our model assigns the probability of a pair `(x_0, x_1)` as `m_0\cdot m_1` where `m_i` is `p_i` if `x_i` is heads and `(1-p_i)` otherwise.
Compute `D_{KL}(hat{p}_{data}||p_{mo\de\l})`.
Please post your solution to the Oct 6 In-Class Exercise Thread.

In the case of labeled data, we need to use conditional probabilities:
`\vec{\theta}_{ML} = ar\gm\ax_\vec{\theta} P(mathbb{Y}|\mathbb{X}; vec{theta})`
In this case, the cost function we want to minimize is:
`-E_{vec{y}~hat{p}_{data}}[log p_{mo\de\l}(y|\vec{x};vec{theta})]`
This cost function is known as the cross entropy.

Notice for the Gaussian distribution we have:
`p_{mo\de\l}(y|\vec{x};vec{theta}) = Pr(y, hat{y}(\vec{x};vec{theta}), sigma^2) = sqrt(1/(2\pi\sigma^2))exp(-1/(2\sigma^2)(y - hat{y}(\vec{x};vec{theta}))^2)`.
Let `m` again be the number of training examples. Assuming the data items independently, identically chosen from the underlying distribution, we see that when we substitute this into the cross entropy, we get
`- 1/m sum_{i=1}^m log p_{mo\de\l}(y^{(i)}|\vec{x}^{(i)};vec{theta}) = log sigma + 1/2 log (2\pi) + 1/m sum_{i=1}^m (||y^{(i)} - hat{y}^{(i)} ||^2)/(2\sigma^2).`
The first two terms on the right are constant, and minimizing with or without the `2\sigma^2` term has no effect.
So minimizing the above is equivalent to minimizing the mean square error (MSE): `1/m sum_i||y^{(i)} - hat{y}^{(i)}||^2`.
Thus, if we had a neuron whose activation function was the identity function, using MSE might make sense.