Escolar Documentos
Profissional Documentos
Cultura Documentos
Artificial
Intelligence:
Decision Trees
Outline
Introduction: Data Flood
Top-Down Decision Tree Construction
Choosing the Splitting Attribute
Information Gain and Gain Ratio
Pruning
Data Growth
Large DB examples as of 2003:
France Telecom has largest decision-support DB,
~30TB; AT&T ~ 26 TB
Alexa internet archive: 7 years of data, 500 TB
Google searches 3.3 Billion pages, ? TB
Business
advertising, CRM (Customer Relationship management),
investments, manufacturing, sports/entertainment,
telecom, e-Commerce, targeted marketing, health care,
Web:
search engines, bots,
Government
law enforcement, profiling tax cheaters, anti-terror(?)
5
AML
Related Fields
Visualization
Machine
Learning
Statistics
Databases
10
see
www.crisp-dm.org
for more
information
Monitoring
11
12
13
DECISION TREE
An internal node is a test on an attribute.
A branch represents an outcome of the test, e.g.,
Color=red.
A leaf node represents a class label or class label
distribution.
At each node, one attribute is chosen to split
training examples into distinct classes as much as
possible
A new case is classified by following a matching
path to a leaf node.
14
Temperature
Humidity
Windy
Play?
sunny
hot
high
false
No
sunny
hot
high
true
No
overcast
hot
high
false
Yes
rain
mild
high
false
Yes
rain
cool
normal
false
Yes
rain
cool
normal
true
No
overcast
cool
normal
true
Yes
sunny
mild
high
false
No
sunny
cool
normal
false
Yes
rain
mild
normal
false
Yes
sunny
mild
normal
true
Yes
overcast
mild
high
true
Yes
overcast
hot
normal
false
Yes
rain
mild
high
true
No
15
Note:
Outlook is the
Forecast,
no relation to
Microsoft
email program
overcast
Humidity
Yes
rain
Windy
high
normal
true
false
No
Yes
No
Yes
16
17
18
19
Computing information
Information is measured in bits
Given a probability distribution, the info
required to predict an event is the distributions
entropy
Entropy gives the information required in bits
(this can involve fractions of bits!)
21
Entropy
Formula for computing the entropy:
entropy( p1 , p2 , , pn ) p1logp1 p2 logp2 pn logpn
22
entropy(0.5,0.5)
0.5 * 1 * 2 1 bit
23
P(head)
0.5
0.25
0.1
P(tail)
0.5
0.75
0.9
Entropy
24
Entropy of a split
Information in a split with x items of one
class, y items of the second class
x
y
info([x, y]) entropy(
,
)
xy xy
x
x
y
y
log(
)
log(
)
x y
x y
x y
x y
25
26
Outlook = Overcast
Outlook = Overcast: 4/0
split
Note: log(0) is
info([4,0]) entropy(1,0) 1log(1) 0 log(0) 0 bits not defined, but
we evaluate
0*log(0) as zero
27
Outlook = Rainy
Outlook = Rainy:
3
3 2
2
info([3,2]) entropy(3/5,2/5) log( ) log( ) 0.971 bits
5
5 5
5
28
Expected Information
Expected information for attribute:
info([3,2], [4,0],[3,2]) (5 / 14) 0.971 (4 / 14) 0 (5 / 14) 0.971
0.693 bits
29
gain("gain
Outlook"
0.247 bits from
Information
for) attributes
Temperatur e" ) 0.029 bits
weather gain("
data:
gain(" Humidity") 0.152 bits
gain(" Windy" ) 0.048 bits
30
Continuing to split
32
Highly-branching attributes
Problematic: attributes with a large number
of values (extreme case: ID code)
Subsets are more likely to be pure if there
is a large number of values
Information gain is biased towards choosing
attributes with a large number of values
This may result in overfitting (selection of an
attribute that is non-optimal for prediction)
33
Outlook
Temperature
Humidity
Windy
Play?
sunny
hot
high
false
No
sunny
hot
high
true
No
overcast
hot
high
false
Yes
rain
mild
high
false
Yes
rain
cool
normal
false
Yes
rain
cool
normal
true
No
overcast
cool
normal
true
Yes
sunny
mild
high
false
No
sunny
cool
normal
false
Yes
rain
mild
normal
false
Yes
sunny
mild
normal
true
Yes
overcast
mild
high
true
Yes
overcast
hot
normal
false
Yes
rain
mild
high
true
No
34
Gain ratio
Gain ratio: a modification of the information gain
that reduces its bias on high-branch attributes
Gain ratio should be
Large when data is evenly spread
Small when all data belong to one branch
36
|S |
|S |
IntrinsicInfo(S , A) i log i .
|S| 2 | S |
Gain ratio (Quinlan86) normalizes info gain
by:
GainRatio(S , A)
37
Gain(S , A)
.
IntrinsicInfo(S , A)
Temperature
Info:
0.693
Info:
0.911
Gain: 0.940-0.693
0.247
Gain: 0.940-0.911
0.029
1.577
Split info:
info([4,6,4])
1.362
Gain ratio:
0.247/1.577
0.156
Gain ratio:
0.029/1.362
0.021
Humidity
Windy
Info:
0.788
Info:
0.892
Gain: 0.940-0.788
0.152
Gain: 0.940-0.892
0.048
1.000
0.985
0.152
Gain ratio:
0.048/0.985
0.049
38
Industrial-strength algorithms
42
Numeric attributes
Info gain for best split point is info gain for attribute
43
Temperature
Humidity
Windy
Play
Sunny
85
85
False
No
Sunny
80
90
True
No
Overcast
83
86
False
Yes
Rainy
75
80
False
Yes
44
Example
65
Yes
No
68
69
Yes Yes
70
Yes
71
No
72
No
72
Yes Yes
75
Yes
75
80
No
Info([4,2],[5,3])
= 6/14 info([4,2]) + 8/14 info([5,3])
= 0.939 bits
Yes
81
83
Yes No
85
Speeding up
Entropy only needs to be evaluated between
points of different classes (Fayyad & Irani,
1992)
value 64
class Yes
65
No
68
69
Yes Yes
70
Yes
71
No
72
No
72
Yes Yes
75
Yes
75
No
80
Yes
81
83
85
Yes No
46
Missing values
Missing value denoted ? in C4.X
Simple idea: treat missing as a separate
value
Q: When this is not appropriate?
When values are missing due to different
reasons
Example: field IsPregnant=missing for a male
patient should be treated differently (no) than for
a female patient of age 25 (unknown)
47
weights sum to 1
48
Pruning
49
Prepruning
50
Early stopping
class
Post-pruning
Then, prune it
1.
Subtree replacement
2.
Subtree raising
Possible strategies:
error estimation
significance testing
MDL principle
52
Subtree replacement
Bottom-up
53
Subtree replacement, 2
54
Subtree replacement, 3
55
C4.5s method
56
Then
look at each class in turn
consider the rules for that class
find a good subset (guided by MDL)
57
58
Summary
Decision Trees
splits binary, multi-way
split criteria information gain, gain ratio
missing value treatment
pruning
rule extraction from trees
59