Cart

Predictive Modeling
Decision Trees The CART Algorithm
The CART Algorithm

Classification and Regression Trees Sometimes designated C&RT Breiman, L., J. H. Friedman, R. A. Oshen, and C. J. Stone, 1984. Classification and regression trees. Belmont, CA: Wadsworth.
CART Algorithm - Overview

Binary decision tree algorithm Recursively partitions data into 2 subsets so that cases within each subset are more homogeneous Allows consideration of misclassification costs, prior distributions, cost-complexity pruning
CART Algorithm
1. The basic idea is to choose a split at each node so that the data in each subset (child node) is purer than the data in the parent node. CART measures the impurity of the data in the nodes of a split with an impurity measure i(t).
CART Algorithm
2. If a split s at node t sends a proportion pL of data to its left child node tL and a corresponding proportion pR of data to its right child node tR , the decrease in impurity of split s at node t is defined as i(s,t) = i(t) pLi(tL) pRi(tR) = impurity in node t weighted average of impurities in nodes tL and tR
CART Algorithm
3. A CART tree is grown, starting from its root node (i.e., the entire training data set) t=1, by searching for a split s* among the set of all possible candidates S which give the largest decrease in impurity
i (s *,1) = max i (s,1)

sS
Then node t=1 is split in two nodes t=2 and t=3 using split s*
CART Algorithm
4. The above split searching process is repeated for each child node. 5. The tree growing process is stopped when all the stopping criteria are met.
Measures of Impurity
Categorical (Nominal) Targets
The Gini Index The Twoing Index
Categorical (Ordinal) Targets

Ordered Twoing
Continuous Targets
Least-Squared Deviation
Gini Index
If a data set T contains examples from n classes, the Gini Index is defined as
gini (T ) = 1 p2 j
j =1 n
where p j (t ) =
n j (t ) n( t )
is the relative proportion of category j in Node t
Gini Index
If a data set T is split into 2 subsets T1 and T2, with sizes N1 and N2 respectively, the Gini index of the split is defined as
ginisplit (T ) =
N1 N gini (T1) + 2 gini (T2 ) N N
This is a weighted average of the Gini indexes from the 2 subsets
Gini Index
The Gini Index takes its maximum value of 1-1/k , where k is the number of categories, when the cases are evenly distributed across the k categories The Gini Index takes its minimum value of 0 when all the cases occur in a single category.
Gini Index
A successful input variable is going to tend to drive the cases into a single category. The more that an input variable does this, the greater the change in the Gini Index from the root node to the child node.
Improvement in Gini Index

For a split s, at node t, the improvement in the Gini Index is given by: I(s,t) = gini(t) pL gini(tL) pR gini(tR) The split s (i.e., the input variable corresponding to split s) that accomplishes the greatest improvement will be selected.
An illustration of impurity
The next 3 slides are based on slides created by Dr. Hyunjoong Kim, formerly with the Dept. of Statistics, University of Tennessee.
A measure of impurity The Gini Index

gini(t)=1-jp2(j|t)
p(red|t)=11/25, p(green|t)=14/25, gini(t)=1-0.194-0.314=0.492
t2 Impurity of all data
One partition t1
p(red|t1)=2/13, p(green|t1)=11/13, gini(t1)=1 0.024 0.726 = 0.25
p(red|t2)=9/12, p(green|t2)=3/12, gini(t2)=1 0.563 0.063 = 0.374
Weighted average = 0.25*(13/25) +0.374*(12/25) = 0.31
Goodness of Split
Improvement in Gini = gini(t)-(n1/n)*gini(t1)-(n2/n)*gini(t2)
t1 s1 t2
Improvement(s1) = 0.492 0.31 = 0.182
gini(t)=1-jp2(j|t)
p(red|t)=11/25, p(green|t)=14/25, gini(t)=1-0.194-0.314=0.492
s2 t2
Impurity of all data
Another partition t1
p(red|t1)=0/11, p(green|t1)=11/11, gini(t1) = 1 - 0.0 - 1.0 = 0.0
p(red|t2)=11/14, p(green|t2)=3/14, gini(t2) = 1 0.617 0.046 = 0.337
Weighted average = 0.0*(11/25) +0.337*(14/25) = 0.189
Improvement(s2) = 0.492 0.189 = 0.303 Improvement(s1) = 0.492 0.31 = 0.182
S2 is better split
1-level Tree from AnswerTree CART Algorithm

Hypertension Node 0 Category % Normal 60.28 High 21.39 Low 18.33 Total (100.00) n 217 77 66 360
AGE Improvement=0.0391
63-72 Node 1 Category % Normal 45.65 High 48.91 Low 5.43 Total (25.56)
51-62;32-50 Node 2 Category % n Normal 65.30 175 High 11.94 32 Low 22.76 61 Total (74.44) 268
n 42 45 5 92
Blood Pressure Example

217 2 77 2 66 2 g ( node 0) = 1 + + 360 360 360 = 1 .4427 = .5573
42 2 45 2 5 2 g (node 1) = 1 + + 92 92 92 = 1 .4506 = .5494
175 2 32 2 61 2 g (node 2) = 1 + + 268 268 268 = 1 .4925 = .5075
Blood Pressure Example
Improvement = g(0) p1g(1) p2g(2) 92 268 = .5573 (.5494 ) (.5075 ) 360 360 = .0391
10
Gini The General Formula

g(t) = C(i | j) p( j | t) p(i | t)
ji k
where p( j | t) = p( j,t) = p( j,t) p(t) ( j) Nj (t) Nj

k
p(t) = p( j,t)
j =1
C(i|j) = cost of misclassifying a category j case as category i (j) = prior value for category j
What the Gini index does

Gini looks at the largest class in the target, and tries to find a split to isolate it from the other categories. A perfect series of splits would end up with k pure child nodes, one for each of the k categories in the target. If costs are used, Gini will attempt to isolate the most costly class.
11
Another splitting criterion: Twoing

Twoing first segments the categories of the target into two supercategories (or groups), attempting to find groups that together add up to 50% of the data. Twoing then searches for a split to best separate these two supercategories into separate child nodes.
Improvement in Twoing Index

For a split s, at node t, the improvement in the Twoing Index is given by:
I(s,t) = pL pR p( j | tL ) p( j | tR ) j
2
The split s (i.e., the input variable corresponding to split s) that results in the greatest difference between the two child nodes will be selected.
12
Gini vs. Twoing

Use Gini when the target has a small number of categories, 2 to 4. Use Twoing when the target has a large number of categories, 4 or more. Note that costs cannot be taken into account when splitting nodes using the Twoing criterion.
Splitting criterion for targets measured on a continuous scale

The measure of impurity used for a continuous target is the Least-Squared Deviation (LSD) The LSD measure is the weighted withinnode variance for node t It is also equal to the resubstitution estimate of risk for that node
13
The Least-Squared Deviation Splitting Criterion

R(t) =
where NW(t) = (weighted) number of cases in node t wn = value of the weighting variable for case i (if any) yi = value of the target variable fn = value of the frequency variable (if any)
y(t) = weighted mean for node t
1 w n fn [yi y(t)]2 Nw (t) it
Improvement in LSD Index

For a split s, at node t, the improvement in the LSD Index is given by:
I(s,t) = R(t) pLR(tL ) pRR(tR )
The split s (i.e., the input variable corresponding to split s) that accomplishes the greatest improvement will be selected.
14
CART Stopping Criteria

All cases in a node have identical values for all predictors The depth of the tree has reached its prespecified maximum value The size of the node is less than a pre-specified minimum node size The split at a node results in producing a child node whose size is less than a pre-specified minimum node size The node becomes pure, i.e., all cases have the same value of the target variable. The maximum decrease in impurity is less than a prespecified value
Pruning
After the splitting stops because on of the stopping criteria has been met, the next step is pruning:
Prune the tree by cutting off weak branches. A weak branch is one with high misclassification rate, where it is measured on validation data. Pruning the full tree will increase the overall error rate for the training set, but the reduced tree will generally provide better predictive power for the unseen records. Prune the branches that provide the least additional predictive power.
15
Missing Values in CART

A surrogate split can be used to handle missing values in predictor (input) variables. Suppose that X* is the predictor input variable that defines the best split s* at node t. The purpose of the a surrogate split is to find another split s, called the surrogate split, which uses another input variable X, such that this split is most similar to s* at node t.
Missing Values in CART

If a new case is to be predicted, and it has a missing value on X* at node t, the prediction will be done on the surrogate split s so long as the case doesnt have a missing value on X.
16
Missing Values
CHAID and Exhaustive CHAID handle missing values by considering them as another valid category. CART uses the surrogate split technique to handle cases with missing values
Validation and Risk Estimation

Once a tree has been built, its predictive value can be assessed
Nominal & Ordinal Targets:
Each node assigns a predicted category to all cases belonging to it. The risk estimate is the proportion of all cases incorrectly classified.
17
Validation and Risk Estimation

Once a tree has been built, its predictive value can be assessed
Continuous Targets:
Each node predicts the value as the mean of the cases in the node. The risk estimate is the within-node variance about each nodes mean, averaged over all nodes (i.e., the mean squared error within nodes).
18

Cart

Enviado por

Dados do documento

Descrição original:

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Cart

Enviado por

Direitos autorais:

Formatos disponíveis

Predictive Modeling

Decision Trees The CART Algorithm

The CART Algorithm

CART Algorithm - Overview

i (s *,1) = max i (s,1)

Categorical (Ordinal) Targets

is the relative proportion of category j in Node t

N1 N gini (T1) + 2 gini (T2 ) N N

This is a weighted average of the Gini indexes from the 2 subsets

Improvement in Gini Index

A measure of impurity The Gini Index

p(red|t1)=2/13, p(green|t1)=11/13, gini(t1)=1 0.024 0.726 = 0.25

p(red|t2)=9/12, p(green|t2)=3/12, gini(t2)=1 0.563 0.063 = 0.374

Weighted average = 0.25*(13/25) +0.374*(12/25) = 0.31

Improvement(s1) = 0.492 0.31 = 0.182

Impurity of all data

p(red|t1)=0/11, p(green|t1)=11/11, gini(t1) = 1 - 0.0 - 1.0 = 0.0

p(red|t2)=11/14, p(green|t2)=3/14, gini(t2) = 1 0.617 0.046 = 0.337

Weighted average = 0.0*(11/25) +0.337*(14/25) = 0.189

Improvement(s2) = 0.492 0.189 = 0.303 Improvement(s1) = 0.492 0.31 = 0.182

1-level Tree from AnswerTree CART Algorithm

Blood Pressure Example

175 2 32 2 61 2 g (node 2) = 1 + + 268 268 268 = 1 .4925 = .5075

Blood Pressure Example

Gini The General Formula

where p( j | t) = p( j,t) = p( j,t) p(t) ( j) Nj (t) Nj

What the Gini index does

Another splitting criterion: Twoing

Improvement in Twoing Index

Gini vs. Twoing

Splitting criterion for targets measured on a continuous scale

The Least-Squared Deviation Splitting Criterion

1 w n fn [yi y(t)]2 Nw (t) it

Improvement in LSD Index

CART Stopping Criteria

Missing Values in CART

Missing Values in CART

Validation and Risk Estimation

Validation and Risk Estimation

Você também pode gostar

Weighted average = 0.25(13/25) +0.374(12/25) = 0.31

Weighted average = 0.0(11/25) +0.337(14/25) = 0.189