Escolar Documentos
Profissional Documentos
Cultura Documentos
Advanced Topics in
Pattern Recognition
http://engineering.rowan.edu/~polikar/CLASSES/ECE555
PR
Week 9
AdaBoost & Learn++
AdaBoost
AdaBoost.M1
AdaBoost.M2
AdaBoost.R
Learn++
Graphical
Drawings:
Classification, Duda, Hart and Stork, Copyright
John
and
Sons,University,
2001. Glassboro, NJ
Advanced
Topics
in PatternPattern
Recognition
2005,
RobiWiley
Polikar,
Rowan
PR
This Week in PR
http://engineering.rowan.edu/~polikar/CLASSES/ECE555
PR
AdaBoost
http://engineering.rowan.edu/~polikar/CLASSES/ECE555
PR
AdaBoost
http://engineering.rowan.edu/~polikar/CLASSES/ECE555
Algorithm AdaBoost
Create a discreet distribution of the training data by assigning a weight to each instance.
Initially, the distribution is uniform, hence all weights are the same
Draw a subset from this distribution and train a weak classifier with this dataset
Compute the error, , of this classifier on its own training dataset. Make sure that this
error is less than .
Test the entire training data on this classifier:
If an instance x is correctly classified, reduce its weight proportional to
If it is misclassified, increase its weight proportional to
Normalize the weights such that they constitute a distribution
PR
AdaBoost.M1
http://engineering.rowan.edu/~polikar/CLASSES/ECE555
C1
h1
Robi Polikar
hk-1
Ck-1
Training Data
Training Data
Distribution
S1
hk
Ck
hk+1
Ck
Ensemble
Decision
hT
ST
CT
log T
DT
D1
Advanced Topics in Pattern Recognition
Update
Distribution
Normalized
Error
log 1
Voting Weights
log 1/
1
PR
AdaBoost.M1
http://engineering.rowan.edu/~polikar/CLASSES/ECE555
Algorithm AdaBoost.M1
Input:
4.
5.
t:ht ( x ) = y
PR
http://engineering.rowan.edu/~polikar/CLASSES/ECE555
Problem: How many hours a day should our students spend working on homework?
Experts
7
3
0.3
0.2
0.25
0.15 0.1
Weight
Assigner
Weighted Majority Voting
5.96 ~ 6
Advanced Topics in Pattern Recognition
PR
You Betcha!
The error of AdaBoost can be shown to be
<2
t =1
t (1 t )
where t is the error of the tth hypothesis. Note that this product gets smaller and
smaller with each added classifier
But waitisnt this against the Occams razor?
PR
http://engineering.rowan.edu/~polikar/CLASSES/ECE555
What to expect:
training error to decrease with number of
classifiers
generalization error to increase after a
while (overfitting)
Advanced Topics in Pattern Recognition
From R. Schapire
http://www.cs.princeton.edu/~schapire/
(letters database)
Whats observed:
generalization error does not increase
even after many many iterations
In fact, it even decreases even after
training error is zero!
Is Occams razor of simple is better
wrong? Violated?
2005, Robi Polikar, Rowan University, Glassboro, NJ
PR
The margin of an instance x roughly describes the confidence of the ensemble in its
decision:
Loosely speaking, the margin of an instance is simply the difference between the total (or
fraction of) vote(s) it receives from correctly identifying classifiers and the maximum (or
fraction of) vote(s) it receives by any incorrect class
m ( x ) = k ( x ) max { j ( x )}
j k
where kth class is the true class, and j(x) is the total support (vote) class j receives from all
classifiers such that
T
(x) = 1
j =1
The margin is therefore the strength of the vote, and the higher the margin, the more
confidence there is the classification. Incorrect decisions have negative margins
PR
Margin Theory
http://engineering.rowan.edu/~polikar/CLASSES/ECE555
(letters database)
PR
Margin Theory
http://engineering.rowan.edu/~polikar/CLASSES/ECE555
They show that boosting tends to increase margins on the training data examples,
and argue that an ensemble classifier with larger margins is a simpler classifier
regardless of the number of classifiers that make up the ensemble.
More specifically: Let H be a finite space of base classifiers. For any >0 and >0,
with probability 1- over the random choice of the training data set S, any ensemble
E={h1, , hT} H combined through weighted majority satisfies
1/ 2
1 log N log H
2
N
N: number of instances
|H|: Cardinality of the classifier space the weaker the classifier, the smaller the |H|
P(error) is independent of the number of classifiers!
Advanced Topics in Pattern Recognition
PR
AdaBoost.M2
http://engineering.rowan.edu/~polikar/CLASSES/ECE555
PR
AdaBoost.M2
http://engineering.rowan.edu/~polikar/CLASSES/ECE555
Algorithm AdaBoost.M2
Input:
Set t = t / (1 - t ).
Update distribution Dt:
D (i, y ) (1 2 )(1+ ht ( xi , yi ) ht ( xi , y ))
where Z t = Dt (i ) is a
Dt +1 (i, y ) = t
t
Zt
i
normalization constant chosen so that Dt+1 becomes a distribution function
Output the final hypothesis:
T
1
h final ( x ) = arg max log ht ( x, y )
yY
t
t =1
4.
5.
PR
Incremental Learning
http://engineering.rowan.edu/~polikar/CLASSES/ECE555
PR
Incremental Learning
http://engineering.rowan.edu/~polikar/CLASSES/ECE555
Data1
C1
Data2
C2C3
C4
Feature 2
Feature 1
PR
Learn++
http://engineering.rowan.edu/~polikar/CLASSES/ECE555
Learn++: Modifies the distribution update rule to make the update based on
the ensemble decision, not just the previous classifier.
Why should this make any difference?
PR
Learn++
http://engineering.rowan.edu/~polikar/CLASSES/ECE555
h2
h1
D1
h8
h3
D2
h7
h6
h5
C3
h4
C1
LEARN++
Wght. Majority Voting
C4
C5
C8
Voting
weights
Learned Decision
Boundary
PR
Learn++
http://engineering.rowan.edu/~polikar/CLASSES/ECE555
Database 1
Database
Database 2
Subset:
D1
D2
D3
D4
Dn
Classifier:
C1
C2
C3
C4
Cn
(Mis)classified
Instances
Error:
Perf.:
Composite
Hypotheses
E1
E2
h1
P1
W1
(Mis)classified
Instances
(Mis)classified
Instances
E3
H2
W1 W W2 W2 W W
3
W1
H3
P3
P2
W2
E4
W3
(Mis)classified
Instances
Hn-1
P4
W4
W4
En
Pn
Wn
PR
Learn++
http://engineering.rowan.edu/~polikar/CLASSES/ECE555
D t (i) on Sk.
i:ht ( xi ) yi
D t (i) = Dt (i)[| H t ( xi ) yi |]
i:H t ( xi ) yi
i =1
B ,
w t +1 (i ) = w t (i ) t
1 ,
= w t (i ) Bt
log(1 t ) , and
t:ht ( x )= y
if H t ( xi ) = yi
otherwise
1[|H t ( xi ) yi |]
Call Weighted majority on combined hypotheses Ht and Output the final hypothesis:
K
k =1 t:H t ( x ) = y
log
Simulation Results
Odorant Identification
PR
http://engineering.rowan.edu/~polikar/CLASSES/ECE555
E th a n o l
T o lu e n e
X y le n e
TCE
0 .8
0 .8
0 .8
0 .8
0 .8
0 .6
0 .6
0 .6
0 .6
0 .6
0 .4
0 .4
0 .4
0 .4
0 .4
0 .2
0 .2
0 .2
0 .2
0 .2
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
O c ta n e
1 2 3 4 5 6
1. APZ: Apiezon,
2. PIB: Polyisobutelene,
3. DEGA:Poly(diethyleneglycoladipate),
4. SG: Solgel,
5. OV275 :Poly(siloxane),
Data Distribution
Dataset
S1
S2
S3
TEST
Advanced Topics in Pattern Recognition
ET
20
5
5
34
OC
20
5
5
34
TL
40
5
5
62
TCE
0
25
5
34
XL
0
0
40
40
Odorant Identification
Results
PR
PNN
Dataset
S1
S2
S3
TEST
TS1 (2)
93.70%
----58.80%
TS2 (3)
73.70%
92.50%
--67.70%
TS3 (3)
76.30%
82.50%
85.00%
83.80%
RBF
Dataset
S1
S2
S3
TEST
TS1 (5)
97.50%
----59.30%
TS2 (6)
81.20%
97.00%
--67.60%
TS3 (7)
76.20%
95.00%
90.00%
86.30%
MLP
http://engineering.rowan.edu/~polikar/CLASSES/ECE555
Dataset
S1
S2
S3
TEST
TS1 (10)
87.50%
----58.80%
TS2 (5)
75.00%
82.50%
--71.10%
TS3 (9)
71.30%
85.00%
86.70%
87.20%
PR
http://engineering.rowan.edu/~polikar/CLASSES/ECE555
Slag
Porosity
Crack
Lack of
Fusion
North
Incomplete
Penetration
South
Data Distribution
Dataset
S1
S2
S3
TEST
LOF
300
200
150
200
SLAG
300
200
150
200
CRACK
0
200
137
150
POROSITY
0
0
99
125
Ultrasonic Weld
Inspection Database
PR
http://engineering.rowan.edu/~polikar/CLASSES/ECE555
Crack
Lack of Fusion
Slag
Porosity
PR
UWI Results
TS1 (7)
99.50%
----48.90%
TS2 (7)
83.60%
93.00%
--62.40%
TS3 (14)
71.50%
75.00%
99.40%
76.10%
RBF
Dataset
S1
S2
S3
TEST
TS1 (9)
90.80%
----47.70%
TS2 (5)
78.30%
89.50%
--60.90%
TS3 (14)
69.80%
75.80%
94.70%
76.40%
Dataset
S1
S2
S3
TEST
TS1 (8)
98.20%
----48.90%
TS2 (3)
84.70%
98.70%
--66.80%
TS3 (17)
80.30%
86.50%
97.00%
78.40%
PNN
Dataset
S1
S2
S3
TEST
MLP
http://engineering.rowan.edu/~polikar/CLASSES/ECE555
PR
Optical Character
Recognition (OCR)
http://engineering.rowan.edu/~polikar/CLASSES/ECE555
Handwritten character
recognition problem
2997 instances, 62
attributes, 10 classes
Divided into for: S1~ S4
(2150) training, TEST
(1797) for testing
Different set of classes in
different datasets
PR
OCR Database
http://engineering.rowan.edu/~polikar/CLASSES/ECE555
Data Distribution
Class
0
1
2
3
4
5
6
7
8
9
S1
100
0
100
0
100
0
100
0
100
0
S2
50
150
50
150
50
150
50
0
0
50
S3
50
50
50
50
50
50
0
150
0
100
S4
25
0
25
25
0
25
100
50
150
50
TEST
178
182
177
183
181
182
181
179
174
180
PR
OCR Results
TS1 (2)
99.80%
------48.70%
TS2 (5)
80.80%
96.30%
----69.10%
TS3 (6)
78.20%
89.00%
98.20%
--82.80%
TS4 (3)
92.40%
88.70%
94.00%
88.00%
86.70%
RBF
Dataset
S1
S2
S3
S4
TEST
TS1 (4)
98.00%
------47.80%
TS2 (6)
81.00%
96.50%
----73.20%
TS3 (8)
77.00%
88.40%
93.60%
--79.80%
TS4 (15)
93.40%
80.60%
92.70%
90.00%
85.90%
Dataset
S1
S2
S3
S4
TEST
TS1 (18)
96.60%
------46.60%
TS2 (30)
89.80%
87.10%
----68.90%
TS3 (23)
86.00%
89.40%
92.00%
--82.00%
TS4 (3)
94.80%
87.90%
92.20%
87.30%
87.00%
PNN
Dataset
S1
S2
S3
S4
TEST
MLP
http://engineering.rowan.edu/~polikar/CLASSES/ECE555
Learn++
Implementation Issues
PR
http://engineering.rowan.edu/~polikar/CLASSES/ECE555
Learn++.MT:
The Concept
PR
4?
4?
4?
http://engineering.rowan.edu/~polikar/CLASSES/ECE555
PR
Learn++.MT
http://engineering.rowan.edu/~polikar/CLASSES/ECE555
Pc ( xi ) =
t :ht ( xi ) = c
t :cCTrt
Updates (lowers) the weights of classifiers that have not been trained with the new class (i).
Wt:cCTr t ( xi ) = Wt:cCTrt (1 Pc ( xi ) )
Advanced Topics in Pattern Recognition
PR
Learn++.MT
http://engineering.rowan.edu/~polikar/CLASSES/ECE555
++
eTk =
k 1
j =1
Tj
w (i ) so that D is a distribution.
i =1
Zc =
t =
Dt ( i )
i :ht ( x i ) y i
t:cCTrt
t:ht ( xi ) = c
Wt
Zc
Wt
t :ht ( xi ) = c
D (i )
t
i:H t ( xi ) y i
B ,
wt +1 (i ) = wt t
1 ,
if H t ( xi ) = y i
otherwise
Learn++.MT Simulation
VOC Data
PR
http://engineering.rowan.edu/~polikar/CLASSES/ECE555
Class
Dataset 1
20
20
40
Dataset 2
10
10
10
25
Dataset 3
10
10
10
15
40
Test
24
24
52
24
40
Test procedure
Learn++ and Learn++.MT were each allowed to create a set number of classifiers on
each dataset. The number of classifiers generated in each training session was chosen
to optimize the algorithms performance. Learn++ appeared to generate the best results
when 6 classifiers were generated on the first dataset, 12 on the next, and 18 on the
last. Learn++.MT, on the other hand, performed optimally using 6 classifiers in the first
training, 4 on the second, and 6 on the last.
Learn++.MT Simulation
VOC Data
PR
http://engineering.rowan.edu/~polikar/CLASSES/ECE555
Training
Session
# Classifiers Added
Performance
Learn++
Learn++.MT
Learn++
Learn++.MT
TS1
55%
54%
TS2
12
64%
63%
TS3
18
69%
81%
PR
If one dataset has substantially more data than the other (s), the
ensemble decision might be unfairly biased towards the data with the
lower cardinality
Under the generally valid assumptions of
No instance is repeated in any dataset, and
The noise distribution remains relatively unchanged among datasets;
it is reasonable to believe that the dataset that has more instances carries more information.
Classifiers generated with such data should therefore be weighted more heavily
Learn++.MT2
Algorithm
PR
http://engineering.rowan.edu/~polikar/CLASSES/ECE555
nc
wt ,c = pt
Nc
For each classifier, this ratio is proportional to the number of instances from a
particular class used for training that classifier, to the number of instances from that
class used for training all classifiers thus far within the ensemble
The final decision is made similarly to Learn++ but with using the class conditional
weights
H final ( xi ) = arg max
wt ,c
cYk t :h ( x ) = c
t i
PR
Experimental Setup
http://engineering.rowan.edu/~polikar/CLASSES/ECE555
Base classifiers were all single layer MLPs, normally incapable of learning
incrementally, with 12~40 nodes and an error goal of 0.05 ~ 0.025.
In each case the data distributions were designed to simulate unbalanced data.
To make a comparison between Learn++ and Learn++.MT2, a set number of classifiers
are created instead of selecting the number adaptively.
This number was selected as a result of experimental testing, such that the tests show an
accurate and unbiased comparison of the algorithms.
Wine Recognition
Database
PR
http://engineering.rowan.edu/~polikar/CLASSES/ECE555
Algorithm
TS1
TS2
Std. Dev
Learn++
88%
84%
2.9%
Learn++.MT2
88%
89%
1.6%
Observations:
After initial training session (TS1), the
performances of both algorithms are the
same;
After TS2, Learn++.MT2 outperforms
Learn++;
No performance degradation is seen with
Learn++.MT2 after second dataset is
introduced. Learn++.MT2 is more stable.
Advanced Topics in Pattern Recognition
VOC Recognition
Database
PR
http://engineering.rowan.edu/~polikar/CLASSES/ECE555
Algorithm
TS1
TS2
Std. Dev.
Learn++
89%
86%
2.1%
Learn++.MT2
88%
89%
1.9%
PR
OCR Database
http://engineering.rowan.edu/~polikar/CLASSES/ECE555
Algorithm
TS1
TS2
Std. Dev.
Learn++
94%
92%
0.9%
Learn++.MT2
94%
95%
0.6%
Observations:
Drastically unbalanced data, majority of the information contained in Dataset 1, explains
high TS1 performance.
Due to the imbalance, Learn++ performance declines with Dataset 2, whereas |
Advanced
Topics in Pattern
Recognition
2005, Robi Polikar, Rowan University, Glassboro, NJ
Learn++.MT2
provides
a modest gain.
OCR Reverse
Presentation
PR
http://engineering.rowan.edu/~polikar/CLASSES/ECE555
Algorithm
TS1
TS2
Std. Dev.
Learn++
85%
91%
0.7%
Learn++.MT2
88%
94%
0.6%
Observations:
Reversed scenario: little information is initially provided, followed by more substantial data.
Final performances remains unchanged: the algorithm is immune to the order of presentation.
The momentary dip in Learn++.MT2 performance as a new dataset is introduced, ironically
justifies the approach taken. Why?
PR
PR
Other Ensemble
Techniques
http://engineering.rowan.edu/~polikar/CLASSES/ECE555
PR
Stacked Generalization
http://engineering.rowan.edu/~polikar/CLASSES/ECE555
Classifier 1
with
parameters
1
Classifier N+1
with
parameters N+1
C1
h1 (x, 1)
Ck-1
Input
x
Ck
hk(x, k)
C K+1
CN
1st Layer
Classifiers
Advanced Topics in Pattern Recognition
Final
Decision
hN(x, N)
2nd Layer
Classifier
2005, Robi Polikar, Rowan University, Glassboro, NJ
PR
Mixture of Experts
http://engineering.rowan.edu/~polikar/CLASSES/ECE555
h1 (x, 1)
Ck-1
Input
x
Ck
wN
hk(x, k)
CK+1
CT
hT(x, T)
Final
Decision
stochastic
winner takes all
weighted average
w1
Gating
Network
CT+1
Pooling /
Combining
System
C1
Classifier 1
with
parameters
1