Escolar Documentos
Profissional Documentos
Cultura Documentos
Basic concepts
Decision tree induction
Evaluation of classifiers
Rule induction
Classification using association rules
Nave Bayesian classification
Nave Bayes for text classification
Support vector machines
K-nearest neighbor
Ensemble methods: Bagging and Boosting
Summary
Chapter 3:
Supervised Learning
An example application
Another application
A credit card company receives thousands of
applications for new cards. Each application
contains information about an applicant,
age
Marital status
annual salary
outstanding debts
credit rating
etc.
Approved or not
10
An example
Given
a data set D,
a task T, and
a performance measure M,
12
Road Map
13
Basic concepts
Decision tree induction
Evaluation of classifiers
Rule induction
Classification using association rules
Nave Bayesian classification
Nave Bayes for text classification
Support vector machines
K-nearest neighbor
Ensemble methods: Bagging and Boosting
Summary
CS583, Bing Liu, UIC
14
Introduction
Approved or not
15
16
17
18
19
20
21
22
Approved or not
23
24
Information theory
Information theory provides a mathematical
basis for measuring the information content.
To understand the notion of information, think
about it as providing the answer to a question,
for example, whether a coin will come up heads.
If one already has a good guess about the answer,
then the actual answer is less informative.
If one already knows that the coin is rigged so that it
will come with heads with probability 0.99, then a
message (advanced information) about the actual
outcome of a flip is worth less than it would be for a
honest coin (50-50).
25
26
entropy ( D) =
|C |
Pr(c ) log
j
Pr(c j )
j =1
|C |
Pr(c ) = 1,
j
j =1
28
Information gain
Given a set of examples D, we first compute its
entropy:
29
30
An example
entropy(D) =
6
6 9
9
log2 log2 = 0.971
15
15 15
15
6
9
entropy( D1 ) entropy( D2 )
15
15
6
9
= 0 + 0.918
15
15
= 0.551
5
5
5
entropy( D1 ) entropy(D2 ) entropy(D3 ) Age Yes No entropy(Di)
15
15
15
young
2
3 0.971
5
5
5
= 0.971+ 0.971+ 0.722
middle 3
2 0.971
15
15
15
old
4
1
0.722
= 0.888
entropyAge(D) =
31
32
33
34
35
36
An example
37
38
Road Map
39
Basic concepts
Decision tree induction
Evaluation of classifiers
Rule induction
Classification using association rules
Nave Bayesian classification
Nave Bayes for text classification
Support vector machines
K-nearest neighbor
Ensemble methods: Bagging and Boosting
Summary
CS583, Bing Liu, UIC
40
10
Evaluation methods
Holdout set: The available data set D is divided into
two disjoint subsets,
Predictive accuracy
Efficiency
time to construct the model
time to use the model
41
42
43
44
11
Classification measures
46
p=
TP
.
TP + FP
r=
TP
.
TP + FN
47
48
12
An example
50
51
52
13
An example
An example
Lift curve
Bin
210
42%
42%
60
12%
78%
40
8%
86%
22
4.40%
90.40%
18
3.60%
94%
12
2.40%
96.40%
7
1.40%
97.80%
6
1.20%
99%
10
5
1%
100%
100
90
80
70
60
lift
50
random
40
30
20
10
0
0
10
20
30
40
50
60
70
80
90
100
54
Road Map
3
120
24%
66%
55
Basic concepts
Decision tree induction
Evaluation of classifiers
Rule induction
Classification using association rules
Nave Bayesian classification
Nave Bayes for text classification
Support vector machines
K-nearest neighbor
Summary
CS583, Bing Liu, UIC
56
14
Introduction
Sequential covering
Learn one rule at a time, sequentially.
After a rule is learned, the training examples
covered by the rule are removed.
Only the remaining data are used to find
subsequent rules.
The process repeats until some stopping
criteria are met.
Note: a rule covers an example if the example
satisfies the conditions of the rule.
We introduce two specific algorithms.
57
58
60
15
Learn-one-rule-1 function
Differences:
Algorithm 2: Rules of the same class are found
together. The classes are ordered. Normally,
minority class rules are found first.
Algorithm 1: In each iteration, a rule of any class
may be found. Rules are ordered according to the
sequence they are found.
61
62
Learn-one-rule-1 algorithm
63
64
16
Learn-one-rule-2 function
Learn-one-rule-2 algorithm
65
66
pn
p+n
p1
p0
gain ( R, R + ) = p1 log 2
log 2
p1 + n1
p 0 + n 0
67
68
17
Discussions
Road Map
69
Three approaches
Basic concepts
Decision tree induction
Evaluation of classifiers
Rule induction
Classification using association rules
Nave Bayesian classification
Nave Bayes for text classification
Support vector machines
K-nearest neighbor
Ensemble methods: Bagging and Boosting
Summary
CS583, Bing Liu, UIC
70
71
72
18
73
74
75
76
19
Building classifiers
77
78
79
80
20
Road Map
Basic concepts
Decision tree induction
Evaluation of classifiers
Rule induction
Classification using association rules
Nave Bayesian classification
Nave Bayes for text classification
Support vector machines
K-nearest neighbor
Ensemble methods: Bagging and Boosting
Summary
81
Bayesian classification
Pr(C = c j | A1 = a1 ,..., A| A| = a| A| )
=
=
Pr( A = a ,..., A
1
| A|
= a| A| | C = cr ) Pr(C = cr )
r =1
is maximal
CS583, Bing Liu, UIC
82
83
84
21
Computing probabilities
| A|
i =1
85
Pr(C = c j | A1 = a1 ,..., A| A| = a| A| )
| A|
Pr(C = c j ) Pr( Ai = ai | C = c j )
|C |
i =1
| A|
r =1
i =1
86
Pr(C = cr ) Pr( Ai = ai | C = cr )
We are done!
How do we estimate P(Ai = ai| C=cj)? Easy!.
| A|
87
i =1
88
22
An example
An Example (cont )
For C = t, we have
Pr(C = t ) Pr( A j = a j | C = t ) =
j =1
1 2 2 2
=
2 5 5 25
Pr( A
Pr(C = f )
j =1
= aj |C = f ) =
1 1 2 1
=
2 5 5 25
nij +
n j + ni
90
Additional issues
Pr( Ai = ai | C = c j ) =
91
Advantages:
Easy to implement
Very efficient
Good results obtained in many applications
Disadvantages
Assumption: class conditional independence,
therefore loss of accuracy when the assumption
is seriously violated (those highly correlated
data sets)
92
23
Road Map
Text classification/categorization
Basic concepts
Decision tree induction
Evaluation of classifiers
Rule induction
Classification using association rules
Nave Bayesian classification
Nave Bayes for text classification
Support vector machines
K-nearest neighbor
Ensemble methods: Bagging and Boosting
Summary
CS583, Bing Liu, UIC
Probabilistic framework
94
Mixture model
95
96
24
An example
class 2
97
Document generation
Pr( c
| ) Pr( d i | c j ; )
98
Pr( d i | ) =
(23)
j =1
99
100
25
Multinomial distribution
With the assumptions, each document can be
regarded as generated by a multinomial
distribution.
In other words, each document is drawn from
a multinomial distribution of words with as
many independent trials as the length of the
document.
The words are from a given vocabulary V =
{w1, w2, , w|V|}.
101
Parameter estimation
N Pr(c | d )
)=
Pr(w | c ;
.
N Pr(c | d )
i =1
|V |
t =1
(24)
|V |
Pr(w | c ; ) = 1.
Nit =| di |
(25)
t =1
t =1
102
ti
|V |
| D|
s =1
i =1
si
(26)
)=
Pr(cj |
| D|
i =1
Pr(cj | di )
(28)
| D|
|D|
)=
Pr(wt | c j ;
CS583, Bing Liu, UIC
| D|
(27)
103
104
26
Classification
Discussions
Given a test document di, from Eq. (23) (27) and (28)
Pr(di | )
=
|C |
r =1
) |di | Pr( w | c ;
Pr(cr |
k =1 di ,k r )
105
Road Map
106
Introduction
Basic concepts
Decision tree induction
Evaluation of classifiers
Rule induction
Classification using association rules
Nave Bayesian classification
Nave Bayes for text classification
Support vector machines
K-nearest neighbor
Ensemble methods: Bagging and Boosting
Summary
CS583, Bing Liu, UIC
108
27
The hyperplane
Basic concepts
1 if w x i + b 0
yi =
1 if w x i + b < 0
CS583, Bing Liu, UIC
109
110
111
112
28
| w xi + b |
|| w ||
(36)
(37)
113
A optimization problem!
margin = d + + d =
2
|| w ||
(38)
(39)
114
1
LP = w w
2
(40)
yi ( w x i + b 1, i = 1, 2, ..., r summarizes
| w xs + b 1 |
1
=
|| w ||
|| w ||
w xi + b 1
w xi + b -1
d+ =
w w
Minimize :
2
Subject to : yi ( w x i + b) 1, i = 1, 2, ..., r
for yi = 1
for yi = -1.
115
[ y (w x + b) 1]
i
(41)
i =1
116
29
Kuhn-Tucker conditions
Dual formulation
i
i =1
1 r
y i y j i j x i x j ,
2 i , j =1
118
LD =
(55)
119
120
30
w x + b =
y x x + b = 0
i
(57)
isv
(58)
sign( w z + b) = sign i yi x i z + b
isv
121
w w
2
Subject to : yi ( w x i + b) 1, i = 1, 2, ..., r
Minimize :
122
Geometric interpretation
Two error data points xa and xb (circled) in wrong
regions
123
124
31
125
Kuhn-Tucker conditions
r
w w
+ C i
2
i =1
Subject to : yi ( w x i + b ) 1 i , i = 1, 2, ..., r
Minimize :
(61)
i 0, i = 1, 2, ..., r
(62)
r
r
r
1
LP = w w + C i i [ yi ( w x i + b) 1 + i ] i i
2
i =1
i =1
i =1
126
127
128
32
Dual
129
b=
r
1
yi i x i x j = 0.
yi i =1
(73)
130
w x + b =
(74) shows a very important property of SVM.
x + b = 0
(75)
i =1
y x
131
132
33
133
Space transformation
The basic idea is to map the data in the input
space X to a feature space F via a nonlinear
mapping ,
:X F
(76)
x a ( x)
After the mapping, the original training data
set {(x1, y1), (x2, y2), , (xr, yr)} becomes:
{((x1), y1), ((x2), y2), , ((xr), yr)} (77)
CS583, Bing Liu, UIC
134
Geometric interpretation
135
136
34
( x1 , x2 ) a ( x1 , x2 , 2 x1 x2 )
The training example ((2, 3), -1) in the input
space is transformed to the following in the
feature space:
((4, 9, 8.5), -1)
CS583, Bing Liu, UIC
137
Kernel functions
Polynomial kernel
(83)
K(x, z) = x zd
Let us compute the kernel with degree d = 2 in a 2dimensional space: x = (x1, x2) and z = (z1, z2).
x z 2 = ( x1 z1 + x 2 z 2 ) 2
= x12 z12 + 2 x1 z1 x 2 z 2 + x 2 2 z 2 2
(84)
= ( x12 , x 2 2 , 2 x1 x 2 ) ( z12 , z 2 2 , 2 z1 z 2 )
138
139
= ( x) ( z ),
This shows that the kernel x z2 is a dot product in
a transformed feature space
CS583, Bing Liu, UIC
140
35
Kernel trick
Is it a kernel function?
141
142
144
36
Road Map
Basic concepts
Decision tree induction
Evaluation of classifiers
Rule induction
Classification using association rules
Nave Bayesian classification
Nave Bayes for text classification
Support vector machines
K-nearest neighbor
Ensemble methods: Bagging and Boosting
Summary
CS583, Bing Liu, UIC
kNNAlgorithm
146
A new point
Pr(science| )?
147
148
37
Discussions
Road Map
149
Combining classifiers
150
Bagging
Breiman, 1996
Bagging
Boosting
Basic concepts
Decision tree induction
Evaluation of classifiers
Rule induction
Classification using association rules
Nave Bayesian classification
Nave Bayes for text classification
Support vector machines
K-nearest neighbor
Ensemble methods: Bagging and Boosting
Summary
151
152
38
Bagging Example
Bagging (cont)
Training
Create k bootstrap samples S[1], S[2], , S[k]
Build a distinct classifier on each S[i] to produce k
classifiers, using the same learning algorithm.
Testing
Original
Training set 1
Training set 2
Training set 3
Training set 4
153
154
Boosting
Bagging (cont )
A family of methods:
We only study AdaBoost (Freund & Schapire, 1996)
Training
Produce a sequence of classifiers (the same base
learner)
Each classifier is dependent on the previous one,
and focuses on the previous ones errors
Examples that are incorrectly predicted in previous
classifiers are given higher weights
Testing
For a test case, the results of the series of
classifiers are combined to determine the final
class of the test case.
CS583, Bing Liu, UIC
156
39
AdaBoost
AdaBoost algorithm
called a weaker classifier
Weighted
training set
Build a classifier ht
whose accuracy on
training set >
Non-negative weights
sum to 1
Change weights
CS583, Bing Liu, UIC
157
158
Bagged C4.5
vs. C4.5.
Boosted C4.5
vs. C4.5.
Boosting vs.
Bagging
159
160
40
Road Map
Summary
Basic concepts
Decision tree induction
Evaluation of classifiers
Rule induction
Classification using association rules
Nave Bayesian classification
Nave Bayes for text classification
Support vector machines
K-nearest neighbor
Summary
CS583, Bing Liu, UIC
162
41