Você está na página 1de 45

PR

Dept. of Electrical and Computer Engineering


0909.555.01

Advanced Topics in

Pattern Recognition
http://engineering.rowan.edu/~polikar/CLASSES/ECE555

PR

Week 9
AdaBoost & Learn++
AdaBoost
AdaBoost.M1
AdaBoost.M2
AdaBoost.R
Learn++

Graphical
Drawings:
Classification, Duda, Hart and Stork, Copyright
John
and
Sons,University,
2001. Glassboro, NJ
Advanced
Topics
in PatternPattern
Recognition
2005,
RobiWiley
Polikar,
Rowan

PR

This Week in PR
http://engineering.rowan.edu/~polikar/CLASSES/ECE555

AdaBoost and Variations


AdaBoost.M1
AdaBoost.M2
AdaBoost.R (independent research)
Learn++

Bias Variance Analysis

Advanced Topics in Pattern Recognition

2005, Robi Polikar, Rowan University, Glassboro, NJ

PR

AdaBoost
http://engineering.rowan.edu/~polikar/CLASSES/ECE555

Arguably, the most popular and successful of all ensemble generation


algorithms, AdaBoost (Adaptive Boosting) is an extension of the original
boosting algorithm, that extends boosting to the multi-class problems.
Y. Freund and R. Schapire, A decision theoretic generalization of on-line
learning and an application to boosting, Journal of Computer and System Sciences, vol.
55, no. 1, pp. 119-129, 1997.

Solves the the general problem of producing a very accurate prediction


rule by combining rough and moderately inaccurate rules-of-thumbs.
AdaBoost generates an ensemble of classifiers, the training data of each is drawn
from a distribution that starts uniform and iteratively changes into one that
provides more weight to those instances that are misclassified.
Each classifier in AdaBoost focuses increasingly on the more difficult to classifiy
instances.
The classifiers are then combined through weighted majority voting
Advanced Topics in Pattern Recognition

2005, Robi Polikar, Rowan University, Glassboro, NJ

PR

AdaBoost
http://engineering.rowan.edu/~polikar/CLASSES/ECE555

Algorithm AdaBoost
Create a discreet distribution of the training data by assigning a weight to each instance.
Initially, the distribution is uniform, hence all weights are the same
Draw a subset from this distribution and train a weak classifier with this dataset
Compute the error, , of this classifier on its own training dataset. Make sure that this
error is less than .
Test the entire training data on this classifier:
If an instance x is correctly classified, reduce its weight proportional to
If it is misclassified, increase its weight proportional to
Normalize the weights such that they constitute a distribution

Repeat until T classifiers are generated


Combine the classifiers using weighted majority voting

Advanced Topics in Pattern Recognition

2005, Robi Polikar, Rowan University, Glassboro, NJ

PR

AdaBoost.M1
http://engineering.rowan.edu/~polikar/CLASSES/ECE555

C1

h1

Robi Polikar

hk-1

Ck-1

Training Data

Training Data
Distribution

S1

hk

Ck

hk+1

Ck

Ensemble
Decision

hT

ST

CT
log T

DT
D1
Advanced Topics in Pattern Recognition

Update
Distribution

Normalized
Error

log 1

Voting Weights

log 1/
1

2005, Robi Polikar, Rowan University, Glassboro, NJ

PR

AdaBoost.M1
http://engineering.rowan.edu/~polikar/CLASSES/ECE555

Algorithm AdaBoost.M1
Input:

Sequence of m examples S= [( x1 , y1 ), ( x 2 , y 2 ),..., ( x m , y m )]


with labels yi Y = {1,...,C} drawn from a distribution D,

Weak learning algorithm WeakLearn,

Integer T specifying number of iterations.


1
Initialize D1(i) = , i .
m
Do for t = 1,2,...,T:
1. Call WeakLearn, providing it with the distribution Dt.
2. Get back a hypothesis ht : X Y
3. Calculate the error of ht : t = Dt (i )
i:ht ( xi ) yi

4.
5.

If t > , then set T = t -1 and abort loop.


Set t = t / (1 - t ).
Update distribution Dt:
D (i ) , if ht ( xi ) = y i
Dt +1 (i ) = t t
Zt
1 , otherwise

where Z t = Dt (i ) is a normalization constant chosen so that


i

Dt+1 becomes a distribution function


Output the final hypothesis:
1
h final ( x ) = arg max log
yY

Advanced Topics in Pattern Recognition

t:ht ( x ) = y

2005, Robi Polikar, Rowan University, Glassboro, NJ

Weighted Majority Voting


Demystified!

PR

http://engineering.rowan.edu/~polikar/CLASSES/ECE555

Problem: How many hours a day should our students spend working on homework?

Experts
7

3
0.3

0.2

0.25

0.15 0.1

Weight
Assigner
Weighted Majority Voting
5.96 ~ 6
Advanced Topics in Pattern Recognition

2005, Robi Polikar, Rowan University, Glassboro, NJ

PR

Does AdaBoost Work?


http://engineering.rowan.edu/~polikar/CLASSES/ECE555

You Betcha!
The error of AdaBoost can be shown to be

<2

t =1

t (1 t )

where t is the error of the tth hypothesis. Note that this product gets smaller and
smaller with each added classifier
But waitisnt this against the Occams razor?

For explanation see Freund and Schapires paper as well as Schapires


tutorial on boosting and margin theory. More about this later.

Advanced Topics in Pattern Recognition

2005, Robi Polikar, Rowan University, Glassboro, NJ

Occams Razor vs.


AdaBoost

PR

http://engineering.rowan.edu/~polikar/CLASSES/ECE555

What to expect:
training error to decrease with number of
classifiers
generalization error to increase after a
while (overfitting)
Advanced Topics in Pattern Recognition

From R. Schapire
http://www.cs.princeton.edu/~schapire/

(letters database)

Whats observed:
generalization error does not increase
even after many many iterations
In fact, it even decreases even after
training error is zero!
Is Occams razor of simple is better
wrong? Violated?
2005, Robi Polikar, Rowan University, Glassboro, NJ

PR

The Margin Theory


http://engineering.rowan.edu/~polikar/CLASSES/ECE555

The margin of an instance x roughly describes the confidence of the ensemble in its
decision:
Loosely speaking, the margin of an instance is simply the difference between the total (or
fraction of) vote(s) it receives from correctly identifying classifiers and the maximum (or
fraction of) vote(s) it receives by any incorrect class

m ( x ) = k ( x ) max { j ( x )}
j k

where kth class is the true class, and j(x) is the total support (vote) class j receives from all
classifiers such that
T

(x) = 1
j =1

The margin is therefore the strength of the vote, and the higher the margin, the more
confidence there is the classification. Incorrect decisions have negative margins

Advanced Topics in Pattern Recognition

2005, Robi Polikar, Rowan University, Glassboro, NJ

PR

Margin Theory
http://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition

From R. Schapire - http://www.cs.princeton.edu/~schapire/

(letters database)

2005, Robi Polikar, Rowan University, Glassboro, NJ

PR

Margin Theory
http://engineering.rowan.edu/~polikar/CLASSES/ECE555

Large margins indicate a lower bound on generalization error.


If all margins are large, the final decision boundary can be obtained using a simpler classifier
(similar to polls can predict the outcomes of not-so-close races very early on)

They show that boosting tends to increase margins on the training data examples,
and argue that an ensemble classifier with larger margins is a simpler classifier
regardless of the number of classifiers that make up the ensemble.
More specifically: Let H be a finite space of base classifiers. For any >0 and >0,
with probability 1- over the random choice of the training data set S, any ensemble
E={h1, , hT} H combined through weighted majority satisfies
1/ 2
1 log N log H

P(error ) P (training margin ) + O


+ log

2
N

N: number of instances
|H|: Cardinality of the classifier space the weaker the classifier, the smaller the |H|
P(error) is independent of the number of classifiers!
Advanced Topics in Pattern Recognition

2005, Robi Polikar, Rowan University, Glassboro, NJ

PR

AdaBoost.M2
http://engineering.rowan.edu/~polikar/CLASSES/ECE555

AdaBoost.M1 requires that all classifiers have a weighted error no greater


than .
This is the least that can be asked from a classifier in a two-class problem, since an
error of is equivalent to random guessing.
The probability of error for random guessing is much higher for multi-class
problems (specifically, k-1/k for a k-class problem). Therefore, achieving an error
of becomes increasing difficult for larger number of classes, particularly if the
weak classifiers are really weak.

AdaBoost.M2 address this problem by removing the weighted error


restriction, instead defines the pseudo-error, which itself is then required to
have an error no larger than .
Pseudo-error recognizes that there is information given in the outputs of classifies
for non-selected / non-winning classes.
On the OCR problem, 1 and 7 may look alike, and the classifier give high plausibility
outputs to these and low to all others when faced with a 1 or 7.
Advanced Topics in Pattern Recognition

2005, Robi Polikar, Rowan University, Glassboro, NJ

PR

AdaBoost.M2
http://engineering.rowan.edu/~polikar/CLASSES/ECE555

Algorithm AdaBoost.M2
Input:

Sequence of m examples S= [( x1 , y1 ), ( x 2 , y 2 ),..., ( x m , y m )]


with labels yi Y = {1,...,C} drawn from a distribution D,

Weak learning algorithm WeakLearn,

Integer T specifying number of iterations.


Let B = {( i, y ) : i {1, 2,L , m} , y yi }
Initialize D1 ( i, y ) = 1 B for ( i, y ) B .
Do for t = 1,2,...,T:
1. Call WeakLearn, providing it with the distribution Dt.
2. Get back a hypothesis ht : X Y [ 0,1]
3. Calculate the pseudo-error of ht
t = (1 2 ) Dt (i, y ) (1 ht ( xi , yi ) + ht ( xi , y ) )
( i , y )B

Set t = t / (1 - t ).
Update distribution Dt:
D (i, y ) (1 2 )(1+ ht ( xi , yi ) ht ( xi , y ))
where Z t = Dt (i ) is a
Dt +1 (i, y ) = t
t
Zt
i
normalization constant chosen so that Dt+1 becomes a distribution function
Output the final hypothesis:
T

1
h final ( x ) = arg max log ht ( x, y )
yY
t
t =1
4.
5.

Advanced Topics in Pattern Recognition

2005, Robi Polikar, Rowan University, Glassboro, NJ

PR

Incremental Learning
http://engineering.rowan.edu/~polikar/CLASSES/ECE555

We now pose the following question:


If after training an algorithm we receive additional data, how can we update the trained
classifier to learn new data?
None of the classic algorithms we have seen so far, including MLP, RBF, PNN, KNN,
RCE, etc. is capable of incrementally updating its knowledge base to learn new information
The typical procedure is to scratch the previously trained classifier, combine the old and new
data, and start from over.
This causes all the information learned so far to be lost catastrophic forgetting
Furthermore, what if the old data is no longer available, or what about if the new data
introduces new classes?

The ensemble of classifiers approach which is generally used for improving


the generalization accuracy of a classifier, can be used to address the issue of
incremental learning.
One such implementation of ensemble classifiers for incremental learning is Learn++

Advanced Topics in Pattern Recognition

2005, Robi Polikar, Rowan University, Glassboro, NJ

PR

Incremental Learning
http://engineering.rowan.edu/~polikar/CLASSES/ECE555

Data1
C1

Data2
C2C3

C4

Feature 2

Advanced Topics in Pattern Recognition

Feature 1

2005, Robi Polikar, Rowan University, Glassboro, NJ

PR

Learn++
http://engineering.rowan.edu/~polikar/CLASSES/ECE555

So, how do we achieve incremental learning?


What prevents us if anything in the AdaBoost formulation from learning
new data, if instances of previously unseen instances are introduced?
Actually nothing!
AdaBoost should work for incremental learning, but it can be made more efficient

Learn++: Modifies the distribution update rule to make the update based on
the ensemble decision, not just the previous classifier.
Why should this make any difference?

Advanced Topics in Pattern Recognition

2005, Robi Polikar, Rowan University, Glassboro, NJ

PR

Learn++
http://engineering.rowan.edu/~polikar/CLASSES/ECE555

h2
h1
D1

h8

h3

D2
h7

h6

h5

C3
h4

C1

LEARN++
Wght. Majority Voting

C4
C5
C8

Voting
weights

Learned Decision
Boundary

Advanced Topics in Pattern Recognition

2005, Robi Polikar, Rowan University, Glassboro, NJ

PR

Learn++
http://engineering.rowan.edu/~polikar/CLASSES/ECE555

Database 1

Database

Database 2

Subset:

D1

D2

D3

D4

Dn

Classifier:

C1

C2

C3

C4

Cn

(Mis)classified
Instances

Error:
Perf.:
Composite
Hypotheses

E1

E2

h1

P1
W1

Advanced Topics in Pattern Recognition

(Mis)classified
Instances

(Mis)classified
Instances

E3

H2

W1 W W2 W2 W W
3

W1

H3

P3

P2
W2

E4

W3

(Mis)classified
Instances

Hn-1

P4
W4

W4

En
Pn

Wn

2005, Robi Polikar, Rowan University, Glassboro, NJ


Final Classification

PR

Algorithm Learn++ (with major differences from AdaBoost.M1 indicated by


Input: For each database drawn from Dk k=1,2,,K

Learn++

Sequence of m training examples Sk = [(x1,y1),(x2, y2),,(xmk,ymk)].


Weak learning algorithm WeakLearn.
Integer Tk, specifying the number of iterations.
Do for k=1,2, , K:

http://engineering.rowan.edu/~polikar/CLASSES/ECE555

Initialize w1i = D(i ) = 1 mk , i , unless there is prior knowledge to select otherwise.


Do for t = 1,2,...,Tk:
1. Set Dt = w t

wt (i) so that Dt is a distribution.


i =1

2. Randomly choose training data subset TRt according to Dt.


3. Call WeakLearn, providing it with TRt

4. Get back a hypothesis ht : X Y, and calculate the error of ht : t =

D t (i) on Sk.

i:ht ( xi ) yi

If t > , set t = t 1, discard ht and go to step 2. Otherwise, compute normalized error as


t = t (1 t ) .
5. Call weighted majority, obtain the overall hypothesis H t = arg max
yY

compute the overall error t =

D t (i) = Dt (i)[| H t ( xi ) yi |]

i:H t ( xi ) yi

i =1

If Et > , set t = t 1, discard Ht and go to step 2.


6. Set Bt = Et/(1-Et), and update the weights of the instances:

B ,
w t +1 (i ) = w t (i ) t
1 ,
= w t (i ) Bt

log(1 t ) , and

t:ht ( x )= y

if H t ( xi ) = yi
otherwise

1[|H t ( xi ) yi |]

Call Weighted majority on combined hypotheses Ht and Output the final hypothesis:
K

H final = arg max


yY

Advanced Topics in Pattern Recognition

k =1 t:H t ( x ) = y

log

2005, Robi Polikar, Rowan University, Glassboro, NJ

Simulation Results
Odorant Identification

PR

http://engineering.rowan.edu/~polikar/CLASSES/ECE555

E th a n o l

T o lu e n e

X y le n e

TCE

0 .8

0 .8

0 .8

0 .8

0 .8

0 .6

0 .6

0 .6

0 .6

0 .6

0 .4

0 .4

0 .4

0 .4

0 .4

0 .2

0 .2

0 .2

0 .2

0 .2

1 2 3 4 5 6

1 2 3 4 5 6

1 2 3 4 5 6

1 2 3 4 5 6

O c ta n e

1 2 3 4 5 6

1. APZ: Apiezon,

2. PIB: Polyisobutelene,

3. DEGA:Poly(diethyleneglycoladipate),

4. SG: Solgel,

5. OV275 :Poly(siloxane),

6. PDPP: Poly (diphenoxylphosphorazene)

Data Distribution
Dataset
S1
S2
S3
TEST
Advanced Topics in Pattern Recognition

ET
20
5
5
34

OC
20
5
5
34

TL
40
5
5
62

TCE
0
25
5
34

XL
0
0
40
40

2005, Robi Polikar, Rowan University, Glassboro, NJ

Odorant Identification
Results

PR
PNN

Dataset
S1
S2
S3
TEST

TS1 (2)
93.70%
----58.80%

TS2 (3)
73.70%
92.50%
--67.70%

TS3 (3)
76.30%
82.50%
85.00%
83.80%

RBF

Dataset
S1
S2
S3
TEST

TS1 (5)
97.50%
----59.30%

TS2 (6)
81.20%
97.00%
--67.60%

TS3 (7)
76.20%
95.00%
90.00%
86.30%

MLP

http://engineering.rowan.edu/~polikar/CLASSES/ECE555

Dataset
S1
S2
S3
TEST

TS1 (10)
87.50%
----58.80%

TS2 (5)
75.00%
82.50%
--71.10%

TS3 (9)
71.30%
85.00%
86.70%
87.20%

Advanced Topics in Pattern Recognition

2005, Robi Polikar, Rowan University, Glassboro, NJ

PR

Ultrasonic Weld Inspection


(UWI)

http://engineering.rowan.edu/~polikar/CLASSES/ECE555

Slag
Porosity
Crack
Lack of
Fusion

North

Incomplete
Penetration

South

Data Distribution
Dataset
S1
S2
S3
TEST

Advanced Topics in Pattern Recognition

LOF
300
200
150
200

SLAG
300
200
150
200

CRACK
0
200
137
150

POROSITY
0
0
99
125

2005, Robi Polikar, Rowan University, Glassboro, NJ

Ultrasonic Weld
Inspection Database

PR

http://engineering.rowan.edu/~polikar/CLASSES/ECE555

Crack

Lack of Fusion

Slag

Porosity

Advanced Topics in Pattern Recognition

2005, Robi Polikar, Rowan University, Glassboro, NJ

PR

UWI Results
TS1 (7)
99.50%
----48.90%

TS2 (7)
83.60%
93.00%
--62.40%

TS3 (14)
71.50%
75.00%
99.40%
76.10%

RBF

Dataset
S1
S2
S3
TEST

TS1 (9)
90.80%
----47.70%

TS2 (5)
78.30%
89.50%
--60.90%

TS3 (14)
69.80%
75.80%
94.70%
76.40%

Dataset
S1
S2
S3
TEST

TS1 (8)
98.20%
----48.90%

TS2 (3)
84.70%
98.70%
--66.80%

TS3 (17)
80.30%
86.50%
97.00%
78.40%

PNN

Dataset
S1
S2
S3
TEST

MLP

http://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition

2005, Robi Polikar, Rowan University, Glassboro, NJ

PR

Optical Character
Recognition (OCR)

http://engineering.rowan.edu/~polikar/CLASSES/ECE555

Handwritten character
recognition problem
2997 instances, 62
attributes, 10 classes
Divided into for: S1~ S4
(2150) training, TEST
(1797) for testing
Different set of classes in
different datasets

Advanced Topics in Pattern Recognition

2005, Robi Polikar, Rowan University, Glassboro, NJ

PR

OCR Database
http://engineering.rowan.edu/~polikar/CLASSES/ECE555

Data Distribution

Class
0
1
2
3
4
5
6
7
8
9

Advanced Topics in Pattern Recognition

S1
100
0
100
0
100
0
100
0
100
0

S2
50
150
50
150
50
150
50
0
0
50

S3
50
50
50
50
50
50
0
150
0
100

S4
25
0
25
25
0
25
100
50
150
50

TEST
178
182
177
183
181
182
181
179
174
180

2005, Robi Polikar, Rowan University, Glassboro, NJ

PR

OCR Results
TS1 (2)
99.80%
------48.70%

TS2 (5)
80.80%
96.30%
----69.10%

TS3 (6)
78.20%
89.00%
98.20%
--82.80%

TS4 (3)
92.40%
88.70%
94.00%
88.00%
86.70%

RBF

Dataset
S1
S2
S3
S4
TEST

TS1 (4)
98.00%
------47.80%

TS2 (6)
81.00%
96.50%
----73.20%

TS3 (8)
77.00%
88.40%
93.60%
--79.80%

TS4 (15)
93.40%
80.60%
92.70%
90.00%
85.90%

Dataset
S1
S2
S3
S4
TEST

TS1 (18)
96.60%
------46.60%

TS2 (30)
89.80%
87.10%
----68.90%

TS3 (23)
86.00%
89.40%
92.00%
--82.00%

TS4 (3)
94.80%
87.90%
92.20%
87.30%
87.00%

PNN

Dataset
S1
S2
S3
S4
TEST

MLP

http://engineering.rowan.edu/~polikar/CLASSES/ECE555

Advanced Topics in Pattern Recognition

2005, Robi Polikar, Rowan University, Glassboro, NJ

Learn++
Implementation Issues

PR

http://engineering.rowan.edu/~polikar/CLASSES/ECE555

Distribution initialization when new dataset becomes available:


Solution: Start with a uniform distribution, and update that distribution based on
the performance of the existing ensemble on the new data

When to stop training for each dataset?


Solution: Use a validation dataset, if one available.
Or, keep training until performance on test data peaks mild cheating

Classifier proliferation when new classes are added: Sufficient additional


classifiers need to be generated to out-vote existing classifiers which cannot
correctly predict the new class.
Solution: Learn++.MT (after Muhlbaier & Topalis)

Advanced Topics in Pattern Recognition

2005, Robi Polikar, Rowan University, Glassboro, NJ

Learn++.MT:
The Concept

PR

4?

4?

4?

Advanced Topics in Pattern Recognition

2004 All Rights Reserved, Muhlbaier and Topalis

http://engineering.rowan.edu/~polikar/CLASSES/ECE555

2005, Robi Polikar, Rowan University, Glassboro, NJ

PR

Learn++.MT
http://engineering.rowan.edu/~polikar/CLASSES/ECE555

Learn++.MT: Creates a preliminary class confidence on each instance and updates


the weights of classifiers that have not seen a particular class.
Each classifier is assigned a weight based on its performance on the training data.
The preliminary class confidence is obtained by summing the weights of all classifiers that
picked a given class and dividing by the sum of the weights of all classifiers that have been
trained on that class.
Preliminary confidence of
those classifiers that has seen
class c for xi belonging to
class c

Pc ( xi ) =

t :ht ( xi ) = c

t :cCTrt

Set of classifier that have picked class c


for c=1,2,,C

Set of classifier that have seen class c

Updates (lowers) the weights of classifiers that have not been trained with the new class (i).

Wt:cCTr t ( xi ) = Wt:cCTrt (1 Pc ( xi ) )
Advanced Topics in Pattern Recognition

2005, Robi Polikar, Rowan University, Glassboro, NJ

PR

Learn++.MT
http://engineering.rowan.edu/~polikar/CLASSES/ECE555

++

Algorithm Learn .MT


Input: For each dataset Dk k=1,2,,K
Sequence of mk instances xi with labels y i Yk = {1,..., c}
Weak learning algorithm BaseClassifier.
Integer Tk, specifying the number of iterations.
Do for k=1,2,,K
If k=1 Initialize w1 = D1 ( i ) = 1 / m , eT1 = 0 for all i.
Else Go to Step 5: evaluate the current ensemble on new
dataset Dk, update weights, and recall current # of classifiers

eTk =

k 1
j =1

Tj

Do for t= eT k +1, eT k +2,, eT k + Tk :


1. Set D t = wt

2. Create normalization factor, Z, for each class

w (i ) so that D is a distribution.
i =1

Zc =

2. Call BaseClassifier with a subset of Dk chosen using Dt.


3. Obtain ht : X Y, and calculate its error:

Algorithm Dynamically Weighted Voting


Input:
Sequence of i=1,, n training instances or any test instance xi
Classifiers ht.
Corresponding error values, t.
Classes, CTrt used in training ht.
For t=1,2,,T where T is the total number classifiers
1. Initialize classifier weights Wt = log(1 t )

t =

Dt ( i )

i :ht ( x i ) y i

t > , discard ht and go to step 2. Otherwise, compute


normalized error as t = t (1 t ) .
If

4. CTrt = Yk, to save labels of classes used in training ht.


5. Call DWV to obtain the composite hypothesis Ht.
6. Compute the error of the composite hypothesis E t =

t:cCTrt

, for c=1,2,,C classes

3. Obtain preliminary decision Pc =

t:ht ( xi ) = c

Wt

Zc

4. Update voting weights Wt :cCTr t = Wt :cCTrt (1 Pc )


5. Compute final / composite hypothesis

H final ( xi ) = arg max


c

Wt
t :ht ( xi ) = c

D (i )

t
i:H t ( xi ) y i

7. Set Bt=Et/(1-Et), and update the instance weights:

B ,
wt +1 (i ) = wt t
1 ,

if H t ( xi ) = y i

otherwise

Call DWV to obtain the final hypothesis, Hfinal.

Advanced Topics in Pattern Recognition

2005, Robi Polikar, Rowan University, Glassboro, NJ

Learn++.MT Simulation
VOC Data

PR

http://engineering.rowan.edu/~polikar/CLASSES/ECE555

Class

Dataset 1

20

20

40

Dataset 2

10

10

10

25

Dataset 3

10

10

10

15

40

Test

24

24

52

24

40

Test procedure
Learn++ and Learn++.MT were each allowed to create a set number of classifiers on
each dataset. The number of classifiers generated in each training session was chosen
to optimize the algorithms performance. Learn++ appeared to generate the best results
when 6 classifiers were generated on the first dataset, 12 on the next, and 18 on the
last. Learn++.MT, on the other hand, performed optimally using 6 classifiers in the first
training, 4 on the second, and 6 on the last.

Advanced Topics in Pattern Recognition

2005, Robi Polikar, Rowan University, Glassboro, NJ

Learn++.MT Simulation
VOC Data

PR

http://engineering.rowan.edu/~polikar/CLASSES/ECE555

This test was executed twenty times on Learn++ and


Learn++.MT to acquire a well represented generalization
performance on the test data.

Training
Session

# Classifiers Added

Performance

Learn++

Learn++.MT

Learn++

Learn++.MT

TS1

55%

54%

TS2

12

64%

63%

TS3

18

69%

81%

Advanced Topics in Pattern Recognition

2005, Robi Polikar, Rowan University, Glassboro, NJ

2004 All Rights Reserved Muhlbaier and Topalis

PR

Learning from Unbalanced Data:


Learn++.MT2
http://engineering.rowan.edu/~polikar/CLASSES/ECE555

Learn++.MT2 was created to account for the unbalanced data problem


We define unbalanced data as any discrepancy in the cardinality of each dataset used in
incremental learning.

If one dataset has substantially more data than the other (s), the
ensemble decision might be unfairly biased towards the data with the
lower cardinality
Under the generally valid assumptions of
No instance is repeated in any dataset, and
The noise distribution remains relatively unchanged among datasets;

it is reasonable to believe that the dataset that has more instances carries more information.
Classifiers generated with such data should therefore be weighted more heavily

It is not unusual to see major discrepancies in the cardinalities of


datasets that subsequently become available.
The cardinality of each dataset, including relative cardinalities of
individual classes within a dataset, should be taken into consideration
in any ensemble based learning algorithm that employs a classifier
combination scheme.
Advanced Topics in Pattern Recognition

2005, Robi Polikar, Rowan University, Glassboro, NJ

Learn++.MT2
Algorithm

PR

http://engineering.rowan.edu/~polikar/CLASSES/ECE555

The primary novelty in Learn++.MT2 is the way by which the voting


weights are determined
Learn++.MT2 attempts to addresses the unbalanced data problem by keeping track of the
number of instances from each class with which each classifier is trained
Each classifier is first given a weight based on its performance on its own training data
This weight is later adjusted according to its class conditional weight factor, wt,c

nc
wt ,c = pt
Nc

pt: Training performance of the tth classifier,


nc: # of class-c instances in the current dataset
Nc: # of all class-c instances seen so far

For each classifier, this ratio is proportional to the number of instances from a
particular class used for training that classifier, to the number of instances from that
class used for training all classifiers thus far within the ensemble

The final decision is made similarly to Learn++ but with using the class conditional
weights
H final ( xi ) = arg max

wt ,c

cYk t :h ( x ) = c
t i

Advanced Topics in Pattern Recognition

2005, Robi Polikar, Rowan University, Glassboro, NJ

PR

Experimental Setup
http://engineering.rowan.edu/~polikar/CLASSES/ECE555

Learn++.MT2 has been tested on three databases:


Wine database from UCI (3 classes, 13 features);
Optical Character Recognition database from UCI (10 classes, 64 features);
A real-world gas identification problem for determining one of five volatile organic
compounds (VOC) based on chemical sensor data (5 classes, 6 features).

Base classifiers were all single layer MLPs, normally incapable of learning
incrementally, with 12~40 nodes and an error goal of 0.05 ~ 0.025.
In each case the data distributions were designed to simulate unbalanced data.
To make a comparison between Learn++ and Learn++.MT2, a set number of classifiers
are created instead of selecting the number adaptively.
This number was selected as a result of experimental testing, such that the tests show an
accurate and unbiased comparison of the algorithms.

Advanced Topics in Pattern Recognition

2005, Robi Polikar, Rowan University, Glassboro, NJ

Wine Recognition
Database

PR

http://engineering.rowan.edu/~polikar/CLASSES/ECE555

Algorithm

TS1

TS2

Std. Dev

Learn++

88%

84%

2.9%

Learn++.MT2

88%

89%

1.6%

Observations:
After initial training session (TS1), the
performances of both algorithms are the
same;
After TS2, Learn++.MT2 outperforms
Learn++;
No performance degradation is seen with
Learn++.MT2 after second dataset is
introduced. Learn++.MT2 is more stable.
Advanced Topics in Pattern Recognition

2005, Robi Polikar, Rowan University, Glassboro, NJ

VOC Recognition
Database

PR

http://engineering.rowan.edu/~polikar/CLASSES/ECE555

Algorithm

TS1

TS2

Std. Dev.

Learn++

89%

86%

2.1%

Learn++.MT2

88%

89%

1.9%

Similar observations as the Wine data :


Again, performances are virtually identical after TS1.
After TS2, Learn++.MT2 outperforms Learn++;
No performance degradation is seen with Learn++.MT2 after second dataset is
introduced. Learn++.MT2 is more stable. Precise termination point is not required.
Advanced Topics in Pattern Recognition

2005, Robi Polikar, Rowan University, Glassboro, NJ

PR

OCR Database
http://engineering.rowan.edu/~polikar/CLASSES/ECE555

Algorithm

TS1

TS2

Std. Dev.

Learn++

94%

92%

0.9%

Learn++.MT2

94%

95%

0.6%

Observations:
Drastically unbalanced data, majority of the information contained in Dataset 1, explains
high TS1 performance.
Due to the imbalance, Learn++ performance declines with Dataset 2, whereas |
Advanced
Topics in Pattern
Recognition
2005, Robi Polikar, Rowan University, Glassboro, NJ
Learn++.MT2
provides
a modest gain.

OCR Reverse
Presentation

PR

http://engineering.rowan.edu/~polikar/CLASSES/ECE555

Algorithm

TS1

TS2

Std. Dev.

Learn++

85%

91%

0.7%

Learn++.MT2

88%

94%

0.6%

Observations:
Reversed scenario: little information is initially provided, followed by more substantial data.
Final performances remains unchanged: the algorithm is immune to the order of presentation.
The momentary dip in Learn++.MT2 performance as a new dataset is introduced, ironically
justifies the approach taken. Why?

Advanced Topics in Pattern Recognition

2005, Robi Polikar, Rowan University, Glassboro, NJ

PR

Some Open Problems


http://engineering.rowan.edu/~polikar/CLASSES/ECE555

Is the distribution update rule used on Learn++ optimal? Can a weighted


combination of AdaBoost and Learn++ update rule be better?
Is there a better initialization scheme?
Can Learn++ be used in a non-stationary learning environment, where the
data distribution rule changes (in which case, it may be necessary to forget
some of the previously learned information throw away some classifiers)
How can Learn++ be update / initialized if the training data is known to be
very unbalanced with new classes being introduced?
Can the performance of Learn++ on incremental learning be theoretically
justified?
Does Learn++ create more or less diverse classifiers? An analysis of the
algorithm on several diversity measures.
Can Learn++ be used on function approximation problems?
How does Learn++ behave under different combination scenarios?
Advanced Topics in Pattern Recognition

2005, Robi Polikar, Rowan University, Glassboro, NJ

PR

Other Ensemble
Techniques
http://engineering.rowan.edu/~polikar/CLASSES/ECE555

There are several other ensemble techniques, including


Stacked generalization
Hierarchical mixture of experts
Random forests

Advanced Topics in Pattern Recognition

2005, Robi Polikar, Rowan University, Glassboro, NJ

PR

Stacked Generalization
http://engineering.rowan.edu/~polikar/CLASSES/ECE555

Classifier 1
with
parameters
1

Classifier N+1
with
parameters N+1

C1

h1 (x, 1)

Ck-1
Input
x

Ck

hk(x, k)

C K+1

CN
1st Layer
Classifiers
Advanced Topics in Pattern Recognition

Final
Decision

hN(x, N)

2nd Layer
Classifier
2005, Robi Polikar, Rowan University, Glassboro, NJ

PR

Mixture of Experts
http://engineering.rowan.edu/~polikar/CLASSES/ECE555

h1 (x, 1)

Ck-1
Input
x

Ck

wN
hk(x, k)

CK+1

CT

hT(x, T)

Final
Decision

stochastic
winner takes all
weighted average

w1
Gating
Network
CT+1

Advanced Topics in Pattern Recognition

Pooling /
Combining
System

C1

Classifier 1
with
parameters
1

(Usually trained with


Expectation Maximization)

2005, Robi Polikar, Rowan University, Glassboro, NJ

Você também pode gostar