Escolar Documentos
Profissional Documentos
Cultura Documentos
CART Algorithm
1. The basic idea is to choose a split at each node so that the data in each subset (child node) is purer than the data in the parent node. CART measures the impurity of the data in the nodes of a split with an impurity measure i(t).
CART Algorithm
2. If a split s at node t sends a proportion pL of data to its left child node tL and a corresponding proportion pR of data to its right child node tR , the decrease in impurity of split s at node t is defined as i(s,t) = i(t) pLi(tL) pRi(tR) = impurity in node t weighted average of impurities in nodes tL and tR
CART Algorithm
3. A CART tree is grown, starting from its root node (i.e., the entire training data set) t=1, by searching for a split s* among the set of all possible candidates S which give the largest decrease in impurity
Then node t=1 is split in two nodes t=2 and t=3 using split s*
CART Algorithm
4. The above split searching process is repeated for each child node. 5. The tree growing process is stopped when all the stopping criteria are met.
Measures of Impurity
Categorical (Nominal) Targets
The Gini Index The Twoing Index
Continuous Targets
Least-Squared Deviation
Gini Index
If a data set T contains examples from n classes, the Gini Index is defined as
gini (T ) = 1 p2 j
j =1 n
where p j (t ) =
n j (t ) n( t )
Gini Index
If a data set T is split into 2 subsets T1 and T2, with sizes N1 and N2 respectively, the Gini index of the split is defined as
ginisplit (T ) =
Gini Index
The Gini Index takes its maximum value of 1-1/k , where k is the number of categories, when the cases are evenly distributed across the k categories The Gini Index takes its minimum value of 0 when all the cases occur in a single category.
Gini Index
A successful input variable is going to tend to drive the cases into a single category. The more that an input variable does this, the greater the change in the Gini Index from the root node to the child node.
An illustration of impurity
The next 3 slides are based on slides created by Dr. Hyunjoong Kim, formerly with the Dept. of Statistics, University of Tennessee.
One partition t1
Goodness of Split
Improvement in Gini = gini(t)-(n1/n)*gini(t1)-(n2/n)*gini(t2)
t1 s1 t2
gini(t)=1-jp2(j|t)
p(red|t)=11/25, p(green|t)=14/25, gini(t)=1-0.194-0.314=0.492
s2 t2
Another partition t1
S2 is better split
AGE Improvement=0.0391
63-72 Node 1 Category % Normal 45.65 High 48.91 Low 5.43 Total (25.56)
51-62;32-50 Node 2 Category % n Normal 65.30 175 High 11.94 32 Low 22.76 61 Total (74.44) 268
n 42 45 5 92
Improvement = g(0) p1g(1) p2g(2) 92 268 = .5573 (.5494 ) (.5075 ) 360 360 = .0391
10
p(t) = p( j,t)
j =1
C(i|j) = cost of misclassifying a category j case as category i (j) = prior value for category j
11
The split s (i.e., the input variable corresponding to split s) that results in the greatest difference between the two child nodes will be selected.
12
13
The split s (i.e., the input variable corresponding to split s) that accomplishes the greatest improvement will be selected.
14
Pruning
After the splitting stops because on of the stopping criteria has been met, the next step is pruning:
Prune the tree by cutting off weak branches. A weak branch is one with high misclassification rate, where it is measured on validation data. Pruning the full tree will increase the overall error rate for the training set, but the reduced tree will generally provide better predictive power for the unseen records. Prune the branches that provide the least additional predictive power.
15
16
Missing Values
CHAID and Exhaustive CHAID handle missing values by considering them as another valid category. CART uses the surrogate split technique to handle cases with missing values
17
18