Você está na página 1de 18

Predictive Modeling

Decision Trees The CART Algorithm

The CART Algorithm


Classification and Regression Trees Sometimes designated C&RT Breiman, L., J. H. Friedman, R. A. Oshen, and C. J. Stone, 1984. Classification and regression trees. Belmont, CA: Wadsworth.

CART Algorithm - Overview


Binary decision tree algorithm Recursively partitions data into 2 subsets so that cases within each subset are more homogeneous Allows consideration of misclassification costs, prior distributions, cost-complexity pruning

CART Algorithm
1. The basic idea is to choose a split at each node so that the data in each subset (child node) is purer than the data in the parent node. CART measures the impurity of the data in the nodes of a split with an impurity measure i(t).

CART Algorithm
2. If a split s at node t sends a proportion pL of data to its left child node tL and a corresponding proportion pR of data to its right child node tR , the decrease in impurity of split s at node t is defined as i(s,t) = i(t) pLi(tL) pRi(tR) = impurity in node t weighted average of impurities in nodes tL and tR

CART Algorithm
3. A CART tree is grown, starting from its root node (i.e., the entire training data set) t=1, by searching for a split s* among the set of all possible candidates S which give the largest decrease in impurity

i (s *,1) = max i (s,1)


sS

Then node t=1 is split in two nodes t=2 and t=3 using split s*

CART Algorithm
4. The above split searching process is repeated for each child node. 5. The tree growing process is stopped when all the stopping criteria are met.

Measures of Impurity
Categorical (Nominal) Targets
The Gini Index The Twoing Index

Categorical (Ordinal) Targets


Ordered Twoing

Continuous Targets
Least-Squared Deviation

Gini Index
If a data set T contains examples from n classes, the Gini Index is defined as
gini (T ) = 1 p2 j
j =1 n

where p j (t ) =

n j (t ) n( t )

is the relative proportion of category j in Node t

Gini Index
If a data set T is split into 2 subsets T1 and T2, with sizes N1 and N2 respectively, the Gini index of the split is defined as

ginisplit (T ) =

N1 N gini (T1) + 2 gini (T2 ) N N

This is a weighted average of the Gini indexes from the 2 subsets

Gini Index
The Gini Index takes its maximum value of 1-1/k , where k is the number of categories, when the cases are evenly distributed across the k categories The Gini Index takes its minimum value of 0 when all the cases occur in a single category.

Gini Index
A successful input variable is going to tend to drive the cases into a single category. The more that an input variable does this, the greater the change in the Gini Index from the root node to the child node.

Improvement in Gini Index


For a split s, at node t, the improvement in the Gini Index is given by: I(s,t) = gini(t) pL gini(tL) pR gini(tR) The split s (i.e., the input variable corresponding to split s) that accomplishes the greatest improvement will be selected.

An illustration of impurity
The next 3 slides are based on slides created by Dr. Hyunjoong Kim, formerly with the Dept. of Statistics, University of Tennessee.

A measure of impurity The Gini Index


gini(t)=1-jp2(j|t)
p(red|t)=11/25, p(green|t)=14/25, gini(t)=1-0.194-0.314=0.492
t2 Impurity of all data

One partition t1

p(red|t1)=2/13, p(green|t1)=11/13, gini(t1)=1 0.024 0.726 = 0.25

p(red|t2)=9/12, p(green|t2)=3/12, gini(t2)=1 0.563 0.063 = 0.374

Weighted average = 0.25*(13/25) +0.374*(12/25) = 0.31

Goodness of Split
Improvement in Gini = gini(t)-(n1/n)*gini(t1)-(n2/n)*gini(t2)
t1 s1 t2

Improvement(s1) = 0.492 0.31 = 0.182

gini(t)=1-jp2(j|t)
p(red|t)=11/25, p(green|t)=14/25, gini(t)=1-0.194-0.314=0.492
s2 t2

Impurity of all data

Another partition t1

p(red|t1)=0/11, p(green|t1)=11/11, gini(t1) = 1 - 0.0 - 1.0 = 0.0

p(red|t2)=11/14, p(green|t2)=3/14, gini(t2) = 1 0.617 0.046 = 0.337

Weighted average = 0.0*(11/25) +0.337*(14/25) = 0.189

Improvement(s2) = 0.492 0.189 = 0.303 Improvement(s1) = 0.492 0.31 = 0.182

S2 is better split

1-level Tree from AnswerTree CART Algorithm


Hypertension Node 0 Category % Normal 60.28 High 21.39 Low 18.33 Total (100.00) n 217 77 66 360

AGE Improvement=0.0391

63-72 Node 1 Category % Normal 45.65 High 48.91 Low 5.43 Total (25.56)

51-62;32-50 Node 2 Category % n Normal 65.30 175 High 11.94 32 Low 22.76 61 Total (74.44) 268

n 42 45 5 92

Blood Pressure Example


217 2 77 2 66 2 g ( node 0) = 1 + + 360 360 360 = 1 .4427 = .5573
42 2 45 2 5 2 g (node 1) = 1 + + 92 92 92 = 1 .4506 = .5494

175 2 32 2 61 2 g (node 2) = 1 + + 268 268 268 = 1 .4925 = .5075

Blood Pressure Example

Improvement = g(0) p1g(1) p2g(2) 92 268 = .5573 (.5494 ) (.5075 ) 360 360 = .0391

10

Gini The General Formula


g(t) = C(i | j) p( j | t) p(i | t)
ji k

where p( j | t) = p( j,t) = p( j,t) p(t) ( j) Nj (t) Nj


k

p(t) = p( j,t)
j =1

C(i|j) = cost of misclassifying a category j case as category i (j) = prior value for category j

What the Gini index does


Gini looks at the largest class in the target, and tries to find a split to isolate it from the other categories. A perfect series of splits would end up with k pure child nodes, one for each of the k categories in the target. If costs are used, Gini will attempt to isolate the most costly class.

11

Another splitting criterion: Twoing


Twoing first segments the categories of the target into two supercategories (or groups), attempting to find groups that together add up to 50% of the data. Twoing then searches for a split to best separate these two supercategories into separate child nodes.

Improvement in Twoing Index


For a split s, at node t, the improvement in the Twoing Index is given by:
I(s,t) = pL pR p( j | tL ) p( j | tR ) j
2

The split s (i.e., the input variable corresponding to split s) that results in the greatest difference between the two child nodes will be selected.

12

Gini vs. Twoing


Use Gini when the target has a small number of categories, 2 to 4. Use Twoing when the target has a large number of categories, 4 or more. Note that costs cannot be taken into account when splitting nodes using the Twoing criterion.

Splitting criterion for targets measured on a continuous scale


The measure of impurity used for a continuous target is the Least-Squared Deviation (LSD) The LSD measure is the weighted withinnode variance for node t It is also equal to the resubstitution estimate of risk for that node

13

The Least-Squared Deviation Splitting Criterion


R(t) =
where NW(t) = (weighted) number of cases in node t wn = value of the weighting variable for case i (if any) yi = value of the target variable fn = value of the frequency variable (if any)
y(t) = weighted mean for node t

1 w n fn [yi y(t)]2 Nw (t) it

Improvement in LSD Index


For a split s, at node t, the improvement in the LSD Index is given by:
I(s,t) = R(t) pLR(tL ) pRR(tR )

The split s (i.e., the input variable corresponding to split s) that accomplishes the greatest improvement will be selected.

14

CART Stopping Criteria


All cases in a node have identical values for all predictors The depth of the tree has reached its prespecified maximum value The size of the node is less than a pre-specified minimum node size The split at a node results in producing a child node whose size is less than a pre-specified minimum node size The node becomes pure, i.e., all cases have the same value of the target variable. The maximum decrease in impurity is less than a prespecified value

Pruning
After the splitting stops because on of the stopping criteria has been met, the next step is pruning:
Prune the tree by cutting off weak branches. A weak branch is one with high misclassification rate, where it is measured on validation data. Pruning the full tree will increase the overall error rate for the training set, but the reduced tree will generally provide better predictive power for the unseen records. Prune the branches that provide the least additional predictive power.

15

Missing Values in CART


A surrogate split can be used to handle missing values in predictor (input) variables. Suppose that X* is the predictor input variable that defines the best split s* at node t. The purpose of the a surrogate split is to find another split s, called the surrogate split, which uses another input variable X, such that this split is most similar to s* at node t.

Missing Values in CART


If a new case is to be predicted, and it has a missing value on X* at node t, the prediction will be done on the surrogate split s so long as the case doesnt have a missing value on X.

16

Missing Values
CHAID and Exhaustive CHAID handle missing values by considering them as another valid category. CART uses the surrogate split technique to handle cases with missing values

Validation and Risk Estimation


Once a tree has been built, its predictive value can be assessed
Nominal & Ordinal Targets:
Each node assigns a predicted category to all cases belonging to it. The risk estimate is the proportion of all cases incorrectly classified.

17

Validation and Risk Estimation


Once a tree has been built, its predictive value can be assessed
Continuous Targets:
Each node predicts the value as the mean of the cases in the node. The risk estimate is the within-node variance about each nodes mean, averaged over all nodes (i.e., the mean squared error within nodes).

18

Você também pode gostar