Decision trees A decision tree is a tree in which every nonleaf corresponds to a set of choices, and each of its children to one of the choices. The leaves typically correspond to classes in a classification (which needn't be binary). For us, instances will be assumed to be represented as n-tuples of values of attributes -- each for the same n attributes. And every node will represent a choice based on the values of a single attribute. Here, a decision tree can be constructed recursively, once an attribute is selected for its root. The subtrees are constructed from those instances with a particular value for the parent. Values with no instances can be given a classification based on, say, majority vote of the parent instances. The algorithm terminates if all remaining instances have the same classification. Note that this algorithm is greedy -- it never reconsiders the choice of root for any subtree. So the question becomes which attribute to select for a given node. The ideal choice would give a tree that is small in height and perhaps size. This would allow for reasonable generalization. A good choice of root will minimize the amount of remaining uncertainty. Information theory lets us compute uncertainty and information gain precisely. So a greedy strategy may choose the attribute with the largest information gain. It's not ideal to split every node with inconsistent classifications be split. Doing so can lead to overfitting -- some attributes may be spurious. A statistical test can be used to determine when an attribute appears to be irrelevant -- that is, when not to split nodes. Other problems: Information gain gives a bias toward attributes with a large number of values. One can fix this (i.e., normalize the gain) by dividing by a suitable expression. Attribute values can be missing. Input or output values can be continuous. Decision tree learning is supervised learning. The algorithm as presented in Coppin (or Russell & Norvig), based on Quinlan's ID3, is not incremental. However an incremental version is possible. The original algorithm behaves badly in the presence of noise. For example, noise can lead to inconsistent classification. Here we might settle the issue by majority vote. Or we could live with inaccuracy, as we do when some nodes with several classifications aren't split.