Neural networks Idea: use the brain as a model for knowledge and learning For tasks that the brain is good at (like recognizing faces and words), we know that each step of the computation takes several milliseconds the entire computation takes several tenths of a second Thus the entire computation takes about 100 steps. The only hope for matching this seems to be to use massively parallel algorithms. This seems consistent with the structure of the brain (i.e., we have a working model!). We may model the brain's neurons and axons with nodes (units) and edges (connections or links) of a graph (network). The resulting neural nets have the desired properties of tolerating noise allowing graceful degradation allowing convenient generalization and stereotyping allowing easy learning allowing associative memory Units have numerical activations that change during a computation. These values may be in {0,1}, [0,1], R, etc. Some units are input units, some are output units, some may be neither (hidden units). Connections have weights. Learning involves changing these weights. A unit is activated based on the activation of neighboring units, and the weights of the corresponding connections. More precisely, each unit first takes a weighted sum of the activations of its neighboring weights. It then determines its own activation by applying an activation function to the result. Models differ in their underlying topology. For example, in feedforward networks, all connections are directed in a direction from the input layer to the output layer. Other networks allow cycles in the underlying network. We will have little to say about these networks. Feedforward networks may have discrete layers of units, with all connections between one layer and the next. The simplest feedfoward model is the linear model, with just one layer of connections. So every unit is an input unit or an output unit. This means that the output units are independent, and we can assume there's just one. Activations are in R. Here the activation function is the identity function. So activations in this model are sums of the activations of neighboring units, as weighted by the connection weights. Only linear functions can be modeled. This is true even if extra layers are added, since linear functions compose to giv other linear functions. Constant terms in linear functions correspond to a bias. A bias may be implemented by having an input unit that's always on (or off). This way of implementing a bias is common in other types of networks as well. There is a simple learning algorithm in the linear case. Weights are increased when the units at either end are both on (or both off). This turns out to be equivalent to gradient descent for the problem of minimizing least squares error. This corresponding space is well-behaved, so local minima are not a problem. In the linear threshold model, activations are in {0,1}. The activation function is a threshold function (cf. Figure 20.16(a) of R&N). That is, the activation of a unit is 1 iff the weighted sum of the neighboring activations exceeds the unit's threshold. Thresholds are just negative biases, so they can be learned. Representations of functions OR, AND, NOR, and NAND exist with no hidden units. One layer of hidden units suffices to represent any Boolean function. However exponentially many hidden units are needed. In general, linear threshold units can represent only linearly separable functions. The learning algorithm for the linear case generalizes naturally to the linear threshold model without hidden units. With hidden units, credit assignment appears difficult. Also, a gradient descent analog appears to be impossible, since the step function defined by the threshold isn't differentiable. However, this function can be approximated by a sigmoid function, which is differentiable. Then the chain rule for differentiation gives a recursive form of gradient descent. The resulting algorithm is called back propagation. Linear threshold units are sometimes called perceptrons.