Escolar Documentos
Profissional Documentos
Cultura Documentos
http://onlamp.com/lpt/a/6464
Published on ONLamp.com (http://www.onlamp.com/) See this if you're having trouble printing code examples
1 of 12
12/13/11 7:16 PM
http://onlamp.com/lpt/a/6464
A decision tree is essentially a series of if-then statements, that, when applied to a record in a data set, results in the classication of that record. Therefore, once you've created your decision tree, you will be able to run a data set through the program and get a classication for each individual record within the data set. What this means to you, as a manufacturer of quality widgets, is that the program you create from this article will be able to predict the likelihood of each user, within a data set, purchasing your nely crafted product. Though the classication of data is the driving force behind creating a decision tree program, it is not what makes a decision tree special. The beauty of decision trees lies in their ability to learn. When nished, you will be able to feed your program a test set of data, and it essentially will learn how to classify future sets of data from the examples. Hopefully, if I've done my job well enough, you're champing at the bit to start coding. However, before you go any further with your code monkey endeavors, you need a good idea of what a decision tree looks like. It's one of those data structures that is easiest to understand with a good visual representation. Figure 1 contains a graphical depiction of a decision tree.
Figure 1. A decision tree Notice that it actually is a true tree structure. Because of this, you can use recursive techniques to both create and traverse the decision tree. For that reason, you could use just about any tree representation that you remember from your Data Structures course to represent the decision tree. In this article, though, I'm going to keep everything very simple and create my decision tree out of only Python's built-in dictionary object.
2 of 12
12/13/11 7:16 PM
http://onlamp.com/lpt/a/6464
demographic. To do this, you need a set of test data to use to train the decision tree program. I assume that, being the concerned entrepreneur that you undoubtedly are, you've already gathered plenty of demographic information through, perhaps, anonymous email surveys. Now, what you need to do is organize all this data into a large set of user records. Here is a table containing a sampling of the information you collected during your email survey: Age Education Income Marital Status Purchase? high single single single single single married married single married married single single single single single married married married single married will buy won't buy will buy won't buy will buy won't buy won't buy will buy won't buy will buy will buy will buy won't buy will buy will buy will buy won't buy will buy will buy won't buy
36-55 master's
18-35 high school low 36-55 master's 18-35 bachelor's < 18 low high
high school low high low high low low high high
36-55 master's
Given the decision tree in Figure 1 and the set of data, it should be somewhat easy to see just how a
3 of 12
12/13/11 7:16 PM
http://onlamp.com/lpt/a/6464
decision tree can classify records in a data set. Starting with the top node (Age), check the value of the rst record in the eld matching that of the top node (in this case, 36-55). Then follow the link to the next node in the tree (Marital Status) and repeat the process until you nally reach a leaf node (a node with no children). This leaf node holds the answer to the question of whether the user will buy your product. (In this example, the user will buy, because his marital status is single). It's also quite easy to see that this type of operation lends itself to a recursive process (not necessarily the most efcient way to program, I know--but a very elegant way). The decision tree in the gure is just one of many decision tree structures you could create to solve the marketing problem. The task of nding the optimal decision tree is an intractable problem. For those of you who have taken an analysis of algorithms course, you no doubt recognize this term. For those of you who haven't had this pleasure (he says, gritting his teeth), essentially what this means is that as the amount of test data used to train the decision tree grows, the amount of time it takes to do so grows as well--exponentially. While it may be nearly impossible to nd the smallest (or more ttingly, the shallowest) decision tree in a respectable amount of time, it is possible to nd a decision tree that is "small enough" using special heuristics. It is the job of the heuristic you choose to accomplish this task by choosing the "next best" attribute by which to divide the data set according to some predened criteria. There are many such heuristics (C4.5, C5.0, gain ratio, GINI, and others). However, for this article I've used one of the more popular heuristics for choosing "next best" attributes based on some of the ideas found in information theory. The ID3 (information theoretic) heuristic uses the concept of entropy to calculate which attribute is best to use for dividing the data into subgroups. The next section quickly covers the basic idea behind how this heuristic works. Don't worry; it's not too much math. Following this discussion, you'll nally get a chance to get your hands dirty by writing the code that will create the decision tree and classify the users in your data set as a "will buy" or "won't buy," thereby making your company instantly more protable.
4 of 12
12/13/11 7:16 PM
http://onlamp.com/lpt/a/6464
trying to nd the attribute that best reduces the amount of information you need to classify your data. The rst step in this process is getting the "next best" attribute from the set of available attributes. The call to choose_attribute takes care of this step. The choose_attribute function uses the heuristic you've chosen for selecting "next best" attributes--in this case, the ID3 heuristic. In fact, the fitness_func parameter you see in the call to choose_attribute is a pointer to the gain function from the ID3 algorithm described in the next section. By passing in a pointer to the gain function, you've effectively separated the code for choosing the next attribute in the decision tree from the code for assembling the decision tree. This makes it possible, and extremely easy, to switch out the ID3 heuristic and exchange it with other heuristics that you may prefer with only the most minimal amount of change to the code. The next step is to create a new decision tree containing the attribute returned from the choose_attribute function as its root node. The only task left after this is to create the subtrees for each of the values in the best attribute. The get_values function cycles through each of the records in the data set and returns a list containing the unique values for the chosen attribute. Finally, the code loops through each of these unique values and creates a subtree for them by making a recursive call to the create_decision_tree function. The call to get_examples just returns a list of all the records in the data set that have the value val for the attribute dened by the best variable. This list of examples is passed into the create_decision_tree function along with the list of remaining attributes (minus the currently selected "next best" attribute). The call to create_decision_tree returns the subtree for the remaining list of attributes and the subset of data passed into it. All that's left is to add each of these subtrees to the current decision tree and return it. The next step in nding the entropy for the data set is to nd the number of bits needed to represent each of the probabilities we calculated in the previous step. This is where you use the logarithm function. For the example above, the number of bits needed to represent the probability of each value occurring in the target attribute is log2 0.6 = -0.736 for "will buy" and log2 0.4 = -1.321 for "won't buy." Now that you have the number of bits needed to represent the probability of each value occurring in the data set, all that's left to do is sum this up and, voil, you have the entropy for the data set! Right? Not exactly. Before you do this, there is one more step. You need to go through and weight each of these numbers before summing them. To do so, multiply each amount that you found in the previous step by the probability of that value occurring, and then multiply the outcome by -1 to make the number positive. Once you've done this, the summation should look like (-0.6 * -0.736) + (-0.4 * -1.321) = 0.97. Thus, 0.97 is the entropy for the data set in the table above. That's all there is to nding the entropy for a set of data. You use the same equation to calculate the entropy for each subset of data in the gain equation, but it is essentially the same process. The only difference is that you will be using a smaller subset of the records within the data set, and you'll also be using an attribute other than the target attribute to calculate the entropy. The next step in the ID3 heuristic is to calculate the information gain that each attribute affords if it is the next decision criteria in the decision tree. If you understood the rst step on calculating the
5 of 12
12/13/11 7:16 PM
http://onlamp.com/lpt/a/6464
entropy, then this step should be a breeze. Essentially, all you need to do to nd the gain for a specic attribute is nd the entropy measurement for that attribute using the process described in the last few paragraphs (nd the entropy for the subset of data for each value in the chosen attribute and sum them all), and subtract this value from the entropy for the entire data set. The decision tree algorithm follows this process for each attribute in the data set, and the attribute with the highest gain will be the one chosen as the next node in the decision tree. That's the prose explanation. For those of you a bit more mathematically minded, the equations in Figure 2 and Figure 3 are the entropy and information gain for the data set.
Figure 3. The information gain equation Entropy and gain are the only two methods you need in the ID3 module. If you understood the concepts of entropy and information gain, then you understand the nal pieces of the puzzle. Just as a quick note: if you didn't totally understand the section on the ID3 heuristic, don't worry-several good web sites go over the ID3 heuristic in more detail. (One very good site in particular is decisiontrees.net, created by Michael Nashvili.) Also, keep in mind that it's just one of several heuristics that you can use to decide the "next best" node in the decision tree. The most important thing is to understand the inner workings of the decision tree algorithm. In the end, if you don't understand ID3, you can always just plug in another heuristic or create your own.
6 of 12
12/13/11 7:16 PM
http://onlamp.com/lpt/a/6464
# return that classification. elif vals.count(vals[0]) == len(vals): return vals[0] else: # Choose the next best attribute to best classify our data best = choose_attribute(data, attributes, target_attr, fitness_func) # Create a new decision tree/node with the best attribute and an empty # dictionary object--we'll fill that up next. tree = {best:{}} # Create a new decision tree/sub-node for each of the values in the # best attribute field for val in get_values(data, best): # Create a subtree for the current value under the "best" field subtree = create_decision_tree( get_examples(data, best, val), [attr for attr in attributes if attr != best], target_attr, fitness_func) # Add the new subtree to the empty dictionary object in our new # tree/node we just created. tree[best][val] = subtree return tree
The create_decision_tree function starts off by declaring three variables: data, vals, and default. The rst, data, is just a copy of the data list being passed into the function. The reason I do this is because Python passes all mutable data types, such as dictionaries and lists, by reference. It's a good rule of thumb to make a copy of any of these in order to keep from accidentally altering the original data. vals is a list of all the values in the target attribute for each record in the data set, and default holds the default value that is returned from the function when the data set is empty. That is simply the value in the target attribute with the highest frequency, and thus, the best guess for when the decision tree is unable to classify a record. The next lines are the real nitty-gritty of the algorithm. The algorithm makes use of recursion to create the decision tree, and as such it needs a base case (or, in this case, two base cases) to prevent it from entering an innite recursive loop. What are the base cases for this algorithm? For starters, if either the data or attributes list is empty, then the algorithm has reached a stopping point. The rst if-then statement takes care of this case. If either list is empty, then the algorithm returns a default value. (Actually, for the attributes list, check to see whether it has only one attribute in it, because the attributes list also contains the target attribute, which the decision tree never uses; that is what the tree should predict.) It returns the value with the highest frequency in the data set for the target attribute. The only other case to worry about is when the remaining records in the data list all have the same value for the target attribute, in which case the algorithm returns that value. Those are the base cases. What about the recursive case? Well, when everything else is normal (that is, the data and attributes lists are not empty and the records in the list of data still have multiple values for the target attribute), the algorithm needs to choose the "next best" attribute for classifying the test data and add it to the decision tree. The choose_attribute function is responsible for picking
7 of 12
12/13/11 7:16 PM
http://onlamp.com/lpt/a/6464
the "next best" attribute for classifying the records in the test data set. After this, the code creates a new decision tree containing only the newly selected "best" attribute. Then the recursion takes place. In other words, each of the subtrees is created by making a recursive call to the create_decision_tree function and adding the returned tree to the newly created tree in the last step. The rst step in this process is getting the "next best" attribute from the set of available attributes. The call to choose_attribute takes care of this step. The next step is to create a new decision tree containing the chosen attribute as the root node. All that remains to do after this is to create the subtrees for each of the values in the best attribute. The get_values function cycles through each of the records in the data set and returns a list containing the unique values for the chosen attribute. Next, the code loops through each of these unique values and creates a subtree for them by making a recursive call to the create_decision_tree function. The call to get_examples just returns a list of all the records in the data set that have the value val for the attribute dened by the best variable. This list of examples passes to the create_decision_tree function along with the list of remaining attributes (minus the currently selected "next best" attribute). The call to create_decision_tree will return the subtree for the remaining list of attributes and the subset of data passed into it. All that's left is to add each of these subtrees to the current decision tree and return it. If you're not used to recursion, this process can seem a bit strange. Take some time to look over the code and make sure that you understand what is happening here. Create a little script to run the function and print out the tree (or, just alter test.py to do so), so you can get a better idea of how it's functioning. It's really a good idea to take your time and make sure you understand what's happening, because many programming problems lend themselves to a recursive solution--you just may be adding a very important tool to your programming arsenal. That's about all there is to the algorithm; everything else is just helper functions to the main algorithm. Most of the functions should be fairly self-explanatory, with the exception of the ID3 heuristic.
8 of 12
12/13/11 7:16 PM
http://onlamp.com/lpt/a/6464
Just like the create_decision_tree function, the rst thing the entropy function does is create the variables it uses throughout the algorithm. The rst is a dictionary object called val_freq to hold all the values found in the data set passed into this function and the frequency at which each value appears in the data set. The other variable is data_entropy, which holds the ongoing calculation of the data's entropy value. The next section of code adds each of the values in the data set to the val_freq dictionary and calculates the corresponding frequency for each value. It does so by looping through each of the records in the data set and checking the val_freq dictionary object to see if the current value already resides within it. If it does, it increments the frequency for the current value, otherwise, it adds the current value to the dictionary object and initializes its frequency to 1. The nal portion of the code is responsible for actually calculating the entropy measurement (using the equation in Figure 1) with the frequencies stored in the val_freq dictionary object. That was easy, wasn't it? That's only the rst half of the ID3 heuristic. Now that you know how to calculate the amount of disorder in a set of data, you need to take those calculations and use them to nd the amount of information gain you will get by using an attribute in the decision tree. The information gain function is very similar to the entropy function. Here's the code that calculates this measurement:
def gain(data, attr, target_attr): """ Calculates the information gain (reduction in entropy) that would result by splitting the data on the chosen attribute (attr). """ val_freq = {} subset_entropy = 0.0 # Calculate the frequency of each of the values in the target attribute for record in data: if (val_freq.has_key(record[attr])): val_freq[record[attr]] += 1.0 else: val_freq[record[attr]] = 1.0 # Calculate the sum of the entropy for each subset of records weighted # by their probability of occuring in the training set. for val in val_freq.keys(): val_prob = val_freq[val] / sum(val_freq.values()) data_subset = [record for record in data if record[attr] == val] subset_entropy += val_prob * entropy(data_subset, target_attr) # Subtract the entropy of the chosen attribute from the entropy of the # whole data set with respect to the target attribute (and return it) return (entropy(data, target_attr) - subset_entropy)
Once again, the code starts by calculating the frequency of each of the values in the data set. Following this, it calculates the entropy for the data set with the new division of data derived by using
9 of 12
12/13/11 7:16 PM
http://onlamp.com/lpt/a/6464
the chosen attribute, attr, to classify the records in the data set. Subtracting that from the original entropy for the current subset of data set nds the gain in information (or, reduction in disorder, if you prefer to think those terms) that you get by choosing that attribute as the next node in the decision tree. That is essentially all there is to it. You still need some code that cycles through each attribute and calculates its information gain measure and chooses the best one, but that part of the code should be somewhat obvious; it's just a matter of repeatedly calling the gain function on each attribute and keeping track of the attribute with the best score. That said, I leave it as a challenge for you to look over the rest of the helper functions in the accompanying source code and gure out each one.
Conclusion
As I stated earlier, the rest of the code is basically just helper functions for the decision tree algorithm. I am hoping they will be fairly self-explanatory. Download the decision tree source code to see the
10 of 12
12/13/11 7:16 PM
http://onlamp.com/lpt/a/6464
rest of the functions that help create the decision tree. The tarball contains three separate source code les. If you want to try out the algorithm and see how it works, just uncompress the source and run the test.py le. All it does is create a set of test data (the data you saw earlier in this article) that it uses to create a decision tree. Then it creates another set of sample data whose records it classies using the decision tree it created with the test data in the rst step. The other two source les are the code for the decision tree algorithm and the ID3 heuristic. The rst le, d_tree.py, contains the create_decision_tree function and all the helper functions associated with it. The second le contains all the code that implements the ID3 heuristic, called, appropriately enough, id3.py. The reason for this division is that the decision tree learning algorithm is a well-established algorithm with little need for change. However, there exist many heuristics that can be used in the choosing of the "next best" attribute, and by placing this code into its own le, you are able to try out other heuristics by just adding another le and including it in place of id3.py in the le that makes the call to the create_decision_tree function. (In this case, that le is test.py.) I've had fun running through this rst foray into the world of articial intelligence with you. I hope you've enjoyed this tutorial and had plenty of success in getting your decision tree up and running. If so, and you nd yourself thirsting for more AI related topics, such as genetic algorithms, neural networks, and swarm intelligence, then keep your eyes peeled for my next installment in this series on Python AI programming. Until next time ... I wish you all the best in your programming endeavors. Christopher Roach recently graduated with a master's in computer science and currently works in Florida as a software engineer at a government communications corporation. Related Reading Python Pocket Reference By Mark Lutz
11 of 12
12/13/11 7:16 PM
http://onlamp.com/lpt/a/6464
12 of 12
12/13/11 7:16 PM