CSE 151 Programming Assignment 3
Introduction to Artificial Intelligence

Due Thursday, June 6, at the start of class (last day!).

The Decision Tree Learning Algorithm

Your mission, should you decide to accept it, is to implement and test the DECISION-TREE-LEARNING algorithm given on page 537 of AI: A Modern Approach by Russell and Norvig. Use the information gain criterion of page 541 for the Choose-Attribute function.

Your function shouldn't print anything out (except, perhaps, error messages). It should return a decision tree in the format described below. (You may, of course, write a separate "driver" function that calls your function and prints out the decision tree. You will need this during development, but you do not need to turn it in since we will use our own driver to do testing.)

To help with development and testing, we will provide two data sets of labeled training examples in the format described below. You may use these to test your function. We will also provide a function that will take your function and a data set as input and perform leave-one-out cross validation to verify that it is learning the pattern in the data. This function will print the predicted percentage error of the decision tree (CV-error) which you will turn in with your written report.

As always, if any of your programming group are caught or killed, the CSE department will disavow all knowledge of your actions. Good luck!

Coding Specifications

In order to facilitate your job (and to make our job of grading easier), you must use the following format for trees and sets of examples.

The name of your main function should be "DECISION-TREE-LEARNING". This function should take three required arguments in the following order: "EXAMPLES", "ATTRIBUTES" and "DEFAULT". The function should return a decision tree.

"EXAMPLES" should be a list of labeled instances. The label on each instance will be a symbol, either "YES" or "NO". A labeled instance is a list whose CAR is a symbolic category label and whose CDR is a Lisp association list, with attribute names associated with attribute values. For example:

	(yes (big . yes) 
             (charming . no) 
             (hairy . yes) 
             (cute . no))
... might be a labeled instance describing a true example of the concept "wumpus".

"ATTRIBUTES" should be a Lisp association list which associates all valid attribute names to lists of corresponding valid values for those attributes. For example, one such "ATTRIBUTES" list might be:

  ((Alternate    . (Yes No))
   (Bar          . (Yes No))
   (Fri-Sat      . (Yes No))
   (Hungry       . (Yes No))
   (Patrons      . (None Some Full))
   (Price        . (1 2 3))
   (Raining      . (Yes No))
   (Reservation  . (Yes No))
   (Type         . (French Italian Thai Burger))
   (WaitEstimate . (0-10 10-30 30-60 >60)))
This list communicates both the allowable attributes and all allowable nominal values that those attributes can take on.

"DEFAULT" is either the symbol "YES" or the symbol "NO". It should be used as a default category when there are insufficient examples to decide on the correct classification.

The output decision tree should be a tree of CONSes. Leaves should be atoms which indicate the category label assigned to a given instance object. They should be one of the symbols "YES" or "NO". Internal nodes are lists, each containing the name of the nominal attribute to be examined at the node and a sequence of subtrees tagged with their corresponding attribute values. For example, this is the decision tree for Figure 18.8 of Russell & Norvig (1995):

   (Patrons
     (None . No)
     (Some . Yes)
     (Full . (Hungry
               (Yes . No)
               (No . (Type
                       (French . Yes)
                       (Italian . No)
                       (Burger . Yes)
                       (Thai . (Fri-Sat
                                 (No . No)
                                 (Yes. Yes))))))))

Software Tools

Some software tools for dealing with decision trees and performing leave-one-out cross validation have been prepared for your use. A simple function for reading the training set files, described below, is also available. These functions are in the files:

No guarantee is offered concerning the correctness or utility of this code. You are free to use all or part of it in your assignment, as long as you provide proper credit to its author. Do not assume, however, that this code will be available to the grader if you do not explicitly provide it.

Training Sets

We will provide two training sets on which to test your function. One is a database of edible and poisonous mushrooms where "YES" means edible. The other is a collection of morality examples where "YES" means that doing the action described in the example should make you feel "guilty". These data sets are available in the "public" directory of the "cs151s" umbrella directory.

What To Hand In

You may work alone or in groups of up to three people on the programming assignments (including this one).

Turn in the code using the turnin utility by the date and time given above. This is done by placing all of your Lisp code into a single source code file called "dtl.lisp", and then executing:

% turnin -c cs151s dtl.lisp

The turnin program will print out some messages, including the size of the file submitted. Make sure that this size is reasonable (and that you haven't turned in something else by mistake). Make sure that the names and login IDs of all group members appear clearly in a comment at the top of your source file.

When working as part of a group, only one group member needs to submit the group's source code using turnin. Just make absolutely sure that all group members have their names and login IDs in a header comment in that file.

Each group should also bring to class and hand in a full listing of the code. In addition, each member of the group must turn in their own written document. This document should:

Make sure you clearly identify all the members of your group on everything you hand in.

Software Engineering Guidelines

All the code that you write should be high-quality. This means that you should pay attention to making it concise, elegant, and completely correct. Do not develop your software just by trial and error. Instead, follow good software engineering practice. Before you start coding, think carefully about the requirements that your code should satisfy, and about what algorithms you will use.

Find errors as much as possible by rereading your code, not just by testing. Do not be satisfied as soon as your code seems to work. Continue looking for errors and making your code more elegant and concise until you are satisfied that it is likely to be completely correct.

Your code should be well-documented. This means that there should be comments explaining the purpose, inputs, and outputs of every non-trivial function or procedure that you write, and comments explaining any tricky parts of your code. You should write comments at the same time that you write the corresponding code. Comments added afterwards are rarely as useful.


Grade Statistics

------------------------------------------------------------
 Under Range    In Range  Over Range         Sum
           0          14           0    1276.000
------------------------------------------------------------
        Mean      Median    Midpoint   Geometric    Harmonic
      91.143      95.000      80.000      90.407      89.529
------------------------------------------------------------
          SD   Quart Dev       Range     SE mean
      11.016       5.000      40.000       2.944
------------------------------------------------------------
     Minimum  Quartile 1  Quartile 2  Quartile 3     Maximum
      60.000      90.000      95.000     100.000     100.000
------------------------------------------------------------
        Skew    Kurtosis
      -1.639       4.940
------------------------------------------------------------
   Null Mean           t    prob (t)           F    prob (F)
       0.000      30.956       0.000     958.271       0.000
------------------------------------------------------------


Histogram:

       Midpt    Freq
      63.500       1 *
      70.500       0 
      77.500       1 *
      84.500       0 
      91.500       8 ********
      98.500       4 ****


[--------------------------------------]
Thursday, 13 June 1996, 00:00:00 GMT
Tim Bailey & David Noelle / tbailey@sdsc.edu & dnoelle@cs.ucsd.edu