100 points, due May 13, 1997
Implement the ID3 algorithm, using the input representation described below. The input is a list of data points; you are to take all of the data points given as your training set -- that is, you needn't reserve any of them for the test set. The output to your ID3 function should be a tree.
Each data point consists of a label, a value (or classification), and a list of attributes. In your output tree, only the label should appear from the data point. For nonleaf nodes, the numerical gain should be represented. You should also have some indication of what attribute is used at each level. I suggest keeping track of the unused attribute names (or numbers) at each level, and representing and printing this along with the gain for each internal node.
You may assume that there are no missing or unknown attribute values. You may assume that no collection of attribute values receives more than one classification.
At each node N, use the attribute A with the maximum gain ratio [I(N)-E(A)]/IV(A), where
1) if N has B data points divided into classes of sizes
(B1, B2, ..., Br), then
I(N) = the sum as i goes from 1 to r of -[Bi/B][lg (Bi/B)]
2) if A divides N into v child nodes (S1, S2, ..., Sv) and node Si
has Mi data points, then
E(A) = the sum as i goes from 1 to v of [Mi/B][I(Si)], and
3) IV(A) = the sum as i goes from 1 to v of -[Mi/B][lg (Mi/B)]
If IV(A)=0, then A should be given an impossibly small gain ratio (e.g.,
a negative one) so that it is not selected.
You may use either Lisp or Prolog for this assigment. If you use Lisp, you
may use PP to print the output tree. Feel free to give a handwritten
explanation of your tree on the hard copy of your output if you think it's
necessary.I suggest at least 2 data types -- one for nodes and one for data sets (i.e., lists of data points). The data type for nodes could have components for gain, for the attributes so far unused, and for the list of children (or labels, if the node is a leaf). It might also be convenient if the data type for data sets had the same first two components as well, in addition to a component for the list of data points.
Use at least the test data publicized under the file name
a4.dat
Contrary to my normal practice, I will accept for partial credit solutions turned in by May 15 to this last assignment of the semester.