AdaBoost New PDF

PR
Dept. of Electrical and Computer Engineering

0909.555.01
Advanced Topics in
Pattern Recognition
http://engineering.rowan.edu/~polikar/CLASSES/ECE555
PR
Week 9
AdaBoost & Learn++
AdaBoost
AdaBoost.M1
AdaBoost.M2
AdaBoost.R
Learn++
Graphical
Drawings:
Classification, Duda, Hart and Stork, Copyright
John
and
Sons,University,
2001. Glassboro, NJ
Advanced
Topics
in PatternPattern
Recognition
2005,
RobiWiley
Polikar,
Rowan
PR
This Week in PR
AdaBoost and Variations

AdaBoost.M1
AdaBoost.M2
AdaBoost.R (independent research)
Learn++
Bias Variance Analysis
Advanced Topics in Pattern Recognition
2005, Robi Polikar, Rowan University, Glassboro, NJ
PR
AdaBoost
Arguably, the most popular and successful of all ensemble generation

algorithms, AdaBoost (Adaptive Boosting) is an extension of the original
boosting algorithm, that extends boosting to the multi-class problems.
Y. Freund and R. Schapire, A decision theoretic generalization of on-line
learning and an application to boosting, Journal of Computer and System Sciences, vol.
55, no. 1, pp. 119-129, 1997.
Solves the the general problem of producing a very accurate prediction

rule by combining rough and moderately inaccurate rules-of-thumbs.
AdaBoost generates an ensemble of classifiers, the training data of each is drawn
from a distribution that starts uniform and iteratively changes into one that
provides more weight to those instances that are misclassified.
Each classifier in AdaBoost focuses increasingly on the more difficult to classifiy
instances.
The classifiers are then combined through weighted majority voting
PR
AdaBoost
Algorithm AdaBoost
Create a discreet distribution of the training data by assigning a weight to each instance.
Initially, the distribution is uniform, hence all weights are the same
Draw a subset from this distribution and train a weak classifier with this dataset
Compute the error, , of this classifier on its own training dataset. Make sure that this
error is less than .
Test the entire training data on this classifier:
If an instance x is correctly classified, reduce its weight proportional to
If it is misclassified, increase its weight proportional to
Normalize the weights such that they constitute a distribution
Repeat until T classifiers are generated

Combine the classifiers using weighted majority voting
PR
AdaBoost.M1
C1
h1
Robi Polikar
hk-1
Ck-1
Training Data
Training Data
Distribution
S1
hk
Ck
hk+1
Ck
Ensemble
Decision
hT
ST
CT
log T
DT
D1
Update
Distribution
Normalized
Error
log 1
Voting Weights
log 1/
1
PR
AdaBoost.M1
Algorithm AdaBoost.M1
Input:
Sequence of m examples S= [( x1 , y1 ), ( x 2 , y 2 ),..., ( x m , y m )]

with labels yi Y = {1,...,C} drawn from a distribution D,
Weak learning algorithm WeakLearn,
Integer T specifying number of iterations.

1
Initialize D1(i) = , i .
m
Do for t = 1,2,...,T:
1. Call WeakLearn, providing it with the distribution Dt.
2. Get back a hypothesis ht : X Y
3. Calculate the error of ht : t = Dt (i )
i:ht ( xi ) yi
4.
5.
If t > , then set T = t -1 and abort loop.

Set t = t / (1 - t ).
Update distribution Dt:
D (i ) , if ht ( xi ) = y i
Dt +1 (i ) = t t
Zt
1 , otherwise
where Z t = Dt (i ) is a normalization constant chosen so that

i
Dt+1 becomes a distribution function

Output the final hypothesis:
1
h final ( x ) = arg max log
yY
t:ht ( x ) = y
Weighted Majority Voting

Demystified!
PR
Problem: How many hours a day should our students spend working on homework?
Experts
7
3
0.3
0.2
0.25
0.15 0.1
Weight
Assigner
Weighted Majority Voting
5.96 ~ 6
PR
Does AdaBoost Work?

You Betcha!
The error of AdaBoost can be shown to be
<2
t =1
t (1 t )
where t is the error of the tth hypothesis. Note that this product gets smaller and
smaller with each added classifier
But waitisnt this against the Occams razor?
For explanation see Freund and Schapires paper as well as Schapires

tutorial on boosting and margin theory. More about this later.
Occams Razor vs.

AdaBoost
PR
What to expect:
training error to decrease with number of
classifiers
generalization error to increase after a
while (overfitting)
From R. Schapire
http://www.cs.princeton.edu/~schapire/
(letters database)
Whats observed:
generalization error does not increase
even after many many iterations
In fact, it even decreases even after
training error is zero!
Is Occams razor of simple is better
wrong? Violated?
PR
The Margin Theory

The margin of an instance x roughly describes the confidence of the ensemble in its
decision:
Loosely speaking, the margin of an instance is simply the difference between the total (or
fraction of) vote(s) it receives from correctly identifying classifiers and the maximum (or
fraction of) vote(s) it receives by any incorrect class
m ( x ) = k ( x ) max { j ( x )}
j k
where kth class is the true class, and j(x) is the total support (vote) class j receives from all
classifiers such that
T
(x) = 1
j =1
The margin is therefore the strength of the vote, and the higher the margin, the more
confidence there is the classification. Incorrect decisions have negative margins
PR
Margin Theory
From R. Schapire - http://www.cs.princeton.edu/~schapire/
(letters database)
PR
Margin Theory
Large margins indicate a lower bound on generalization error.

If all margins are large, the final decision boundary can be obtained using a simpler classifier
(similar to polls can predict the outcomes of not-so-close races very early on)
They show that boosting tends to increase margins on the training data examples,
and argue that an ensemble classifier with larger margins is a simpler classifier
regardless of the number of classifiers that make up the ensemble.
More specifically: Let H be a finite space of base classifiers. For any >0 and >0,
with probability 1- over the random choice of the training data set S, any ensemble
E={h1, , hT} H combined through weighted majority satisfies
1/ 2
1 log N log H
P(error ) P (training margin ) + O

+ log
2
N
N: number of instances
|H|: Cardinality of the classifier space the weaker the classifier, the smaller the |H|
P(error) is independent of the number of classifiers!
PR
AdaBoost.M2
AdaBoost.M1 requires that all classifiers have a weighted error no greater

than .
This is the least that can be asked from a classifier in a two-class problem, since an
error of is equivalent to random guessing.
The probability of error for random guessing is much higher for multi-class
problems (specifically, k-1/k for a k-class problem). Therefore, achieving an error
of becomes increasing difficult for larger number of classes, particularly if the
weak classifiers are really weak.
AdaBoost.M2 address this problem by removing the weighted error

restriction, instead defines the pseudo-error, which itself is then required to
have an error no larger than .
Pseudo-error recognizes that there is information given in the outputs of classifies
for non-selected / non-winning classes.
On the OCR problem, 1 and 7 may look alike, and the classifier give high plausibility
outputs to these and low to all others when faced with a 1 or 7.
PR
AdaBoost.M2
Algorithm AdaBoost.M2
Input:
Sequence of m examples S= [( x1 , y1 ), ( x 2 , y 2 ),..., ( x m , y m )]

with labels yi Y = {1,...,C} drawn from a distribution D,
Weak learning algorithm WeakLearn,
Integer T specifying number of iterations.

Let B = {( i, y ) : i {1, 2,L , m} , y yi }
Initialize D1 ( i, y ) = 1 B for ( i, y ) B .
Do for t = 1,2,...,T:
1. Call WeakLearn, providing it with the distribution Dt.
2. Get back a hypothesis ht : X Y [ 0,1]
3. Calculate the pseudo-error of ht
t = (1 2 ) Dt (i, y ) (1 ht ( xi , yi ) + ht ( xi , y ) )
( i , y )B
Set t = t / (1 - t ).
Update distribution Dt:
D (i, y ) (1 2 )(1+ ht ( xi , yi ) ht ( xi , y ))
where Z t = Dt (i ) is a
Dt +1 (i, y ) = t
t
Zt
i
normalization constant chosen so that Dt+1 becomes a distribution function
Output the final hypothesis:
T
1
h final ( x ) = arg max log ht ( x, y )
yY
t
t =1
4.
5.
PR
Incremental Learning
We now pose the following question:

If after training an algorithm we receive additional data, how can we update the trained
classifier to learn new data?
None of the classic algorithms we have seen so far, including MLP, RBF, PNN, KNN,
RCE, etc. is capable of incrementally updating its knowledge base to learn new information
The typical procedure is to scratch the previously trained classifier, combine the old and new
data, and start from over.
This causes all the information learned so far to be lost catastrophic forgetting
Furthermore, what if the old data is no longer available, or what about if the new data
introduces new classes?
The ensemble of classifiers approach which is generally used for improving

the generalization accuracy of a classifier, can be used to address the issue of
incremental learning.
One such implementation of ensemble classifiers for incremental learning is Learn++
PR
Incremental Learning
Data1
C1
Data2
C2C3
C4
Feature 2
Feature 1
PR
Learn++
So, how do we achieve incremental learning?

What prevents us if anything in the AdaBoost formulation from learning
new data, if instances of previously unseen instances are introduced?
Actually nothing!
AdaBoost should work for incremental learning, but it can be made more efficient
Learn++: Modifies the distribution update rule to make the update based on
the ensemble decision, not just the previous classifier.
Why should this make any difference?
PR
Learn++
h2
h1
D1
h8
h3
D2
h7
h6
h5
C3
h4
C1
LEARN++
Wght. Majority Voting
C4
C5
C8
Voting
weights
Learned Decision
Boundary
PR
Learn++
Database 1
Database
Database 2
Subset:
D1
D2
D3
D4
Dn
Classifier:
C1
C2
C3
C4
Cn
(Mis)classified
Instances
Error:
Perf.:
Composite
Hypotheses
E1
E2
h1
P1
W1
(Mis)classified
Instances
(Mis)classified
Instances
E3
H2
W1 W W2 W2 W W
3
W1
H3
P3
P2
W2
E4
W3
(Mis)classified
Instances
Hn-1
P4
W4
W4
En
Pn
Wn

Final Classification
PR
Algorithm Learn++ (with major differences from AdaBoost.M1 indicated by

Input: For each database drawn from Dk k=1,2,,K
Learn++
Sequence of m training examples Sk = [(x1,y1),(x2, y2),,(xmk,ymk)].

Weak learning algorithm WeakLearn.
Integer Tk, specifying the number of iterations.
Do for k=1,2, , K:
Initialize w1i = D(i ) = 1 mk , i , unless there is prior knowledge to select otherwise.

Do for t = 1,2,...,Tk:
1. Set Dt = w t
wt (i) so that Dt is a distribution.

i =1
2. Randomly choose training data subset TRt according to Dt.

3. Call WeakLearn, providing it with TRt
4. Get back a hypothesis ht : X Y, and calculate the error of ht : t =
D t (i) on Sk.
i:ht ( xi ) yi
If t > , set t = t 1, discard ht and go to step 2. Otherwise, compute normalized error as

t = t (1 t ) .
5. Call weighted majority, obtain the overall hypothesis H t = arg max
yY
compute the overall error t =
D t (i) = Dt (i)[| H t ( xi ) yi |]
i:H t ( xi ) yi
i =1
If Et > , set t = t 1, discard Ht and go to step 2.

6. Set Bt = Et/(1-Et), and update the weights of the instances:
B ,
w t +1 (i ) = w t (i ) t
1 ,
= w t (i ) Bt
log(1 t ) , and
t:ht ( x )= y
if H t ( xi ) = yi
otherwise
1[|H t ( xi ) yi |]
Call Weighted majority on combined hypotheses Ht and Output the final hypothesis:
K
H final = arg max

yY
k =1 t:H t ( x ) = y
log
Simulation Results
Odorant Identification
PR
E th a n o l
T o lu e n e
X y le n e
TCE
0 .8
0 .8
0 .8
0 .8
0 .8
0 .6
0 .6
0 .6
0 .6
0 .6
0 .4
0 .4
0 .4
0 .4
0 .4
0 .2
0 .2
0 .2
0 .2
0 .2
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
O c ta n e
1 2 3 4 5 6
1. APZ: Apiezon,
2. PIB: Polyisobutelene,
3. DEGA:Poly(diethyleneglycoladipate),
4. SG: Solgel,
5. OV275 :Poly(siloxane),
6. PDPP: Poly (diphenoxylphosphorazene)
Data Distribution
Dataset
S1
S2
S3
TEST
ET
20
5
5
34
OC
20
5
5
34
TL
40
5
5
62
TCE
0
25
5
34
XL
0
0
40
40
Odorant Identification
Results
PR
PNN
Dataset
S1
S2
S3
TEST
TS1 (2)
93.70%
----58.80%
TS2 (3)
73.70%
92.50%
--67.70%
TS3 (3)
76.30%
82.50%
85.00%
83.80%
RBF
Dataset
S1
S2
S3
TEST
TS1 (5)
97.50%
----59.30%
TS2 (6)
81.20%
97.00%
--67.60%
TS3 (7)
76.20%
95.00%
90.00%
86.30%
MLP
Dataset
S1
S2
S3
TEST
TS1 (10)
87.50%
----58.80%
TS2 (5)
75.00%
82.50%
--71.10%
TS3 (9)
71.30%
85.00%
86.70%
87.20%
PR
Ultrasonic Weld Inspection

(UWI)
Slag
Porosity
Crack
Lack of
Fusion
North
Incomplete
Penetration
South
Data Distribution
Dataset
S1
S2
S3
TEST
LOF
300
200
150
200
SLAG
300
200
150
200
CRACK
0
200
137
150
POROSITY
0
0
99
125
Ultrasonic Weld
Inspection Database
PR
Crack
Lack of Fusion
Slag
Porosity
PR
UWI Results
TS1 (7)
99.50%
----48.90%
TS2 (7)
83.60%
93.00%
--62.40%
TS3 (14)
71.50%
75.00%
99.40%
76.10%
RBF
Dataset
S1
S2
S3
TEST
TS1 (9)
90.80%
----47.70%
TS2 (5)
78.30%
89.50%
--60.90%
TS3 (14)
69.80%
75.80%
94.70%
76.40%
Dataset
S1
S2
S3
TEST
TS1 (8)
98.20%
----48.90%
TS2 (3)
84.70%
98.70%
--66.80%
TS3 (17)
80.30%
86.50%
97.00%
78.40%
PNN
Dataset
S1
S2
S3
TEST
MLP
PR
Optical Character
Recognition (OCR)
Handwritten character
recognition problem
2997 instances, 62
attributes, 10 classes
Divided into for: S1~ S4
(2150) training, TEST
(1797) for testing
Different set of classes in
different datasets
PR
OCR Database
Data Distribution
Class
0
1
2
3
4
5
6
7
8
9
S1
100
0
100
0
100
0
100
0
100
0
S2
50
150
50
150
50
150
50
0
0
50
S3
50
50
50
50
50
50
0
150
0
100
S4
25
0
25
25
0
25
100
50
150
50
TEST
178
182
177
183
181
182
181
179
174
180
PR
OCR Results
TS1 (2)
99.80%
------48.70%
TS2 (5)
80.80%
96.30%
----69.10%
TS3 (6)
78.20%
89.00%
98.20%
--82.80%
TS4 (3)
92.40%
88.70%
94.00%
88.00%
86.70%
RBF
Dataset
S1
S2
S3
S4
TEST
TS1 (4)
98.00%
------47.80%
TS2 (6)
81.00%
96.50%
----73.20%
TS3 (8)
77.00%
88.40%
93.60%
--79.80%
TS4 (15)
93.40%
80.60%
92.70%
90.00%
85.90%
Dataset
S1
S2
S3
S4
TEST
TS1 (18)
96.60%
------46.60%
TS2 (30)
89.80%
87.10%
----68.90%
TS3 (23)
86.00%
89.40%
92.00%
--82.00%
TS4 (3)
94.80%
87.90%
92.20%
87.30%
87.00%
PNN
Dataset
S1
S2
S3
S4
TEST
MLP
Learn++
Implementation Issues
PR
Distribution initialization when new dataset becomes available:

Solution: Start with a uniform distribution, and update that distribution based on
the performance of the existing ensemble on the new data
When to stop training for each dataset?

Solution: Use a validation dataset, if one available.
Or, keep training until performance on test data peaks mild cheating
Classifier proliferation when new classes are added: Sufficient additional

classifiers need to be generated to out-vote existing classifiers which cannot
correctly predict the new class.
Solution: Learn++.MT (after Muhlbaier & Topalis)
Learn++.MT:
The Concept
PR
4?
4?
4?
2004 All Rights Reserved, Muhlbaier and Topalis
PR
Learn++.MT
Learn++.MT: Creates a preliminary class confidence on each instance and updates

the weights of classifiers that have not seen a particular class.
Each classifier is assigned a weight based on its performance on the training data.
The preliminary class confidence is obtained by summing the weights of all classifiers that
picked a given class and dividing by the sum of the weights of all classifiers that have been
trained on that class.
Preliminary confidence of
those classifiers that has seen
class c for xi belonging to
class c
Pc ( xi ) =
t :ht ( xi ) = c
t :cCTrt
Set of classifier that have picked class c

for c=1,2,,C
Set of classifier that have seen class c
Updates (lowers) the weights of classifiers that have not been trained with the new class (i).
Wt:cCTr t ( xi ) = Wt:cCTrt (1 Pc ( xi ) )
PR
Learn++.MT
++
Algorithm Learn .MT

Input: For each dataset Dk k=1,2,,K
Sequence of mk instances xi with labels y i Yk = {1,..., c}
Weak learning algorithm BaseClassifier.
Integer Tk, specifying the number of iterations.
Do for k=1,2,,K
If k=1 Initialize w1 = D1 ( i ) = 1 / m , eT1 = 0 for all i.
Else Go to Step 5: evaluate the current ensemble on new
dataset Dk, update weights, and recall current # of classifiers
eTk =
k 1
j =1
Tj
Do for t= eT k +1, eT k +2,, eT k + Tk :

1. Set D t = wt
2. Create normalization factor, Z, for each class
w (i ) so that D is a distribution.
i =1
Zc =
2. Call BaseClassifier with a subset of Dk chosen using Dt.

3. Obtain ht : X Y, and calculate its error:
Algorithm Dynamically Weighted Voting

Input:
Sequence of i=1,, n training instances or any test instance xi
Classifiers ht.
Corresponding error values, t.
Classes, CTrt used in training ht.
For t=1,2,,T where T is the total number classifiers
1. Initialize classifier weights Wt = log(1 t )
t =
Dt ( i )
i :ht ( x i ) y i
t > , discard ht and go to step 2. Otherwise, compute

normalized error as t = t (1 t ) .
If
4. CTrt = Yk, to save labels of classes used in training ht.

5. Call DWV to obtain the composite hypothesis Ht.
6. Compute the error of the composite hypothesis E t =
t:cCTrt
, for c=1,2,,C classes
3. Obtain preliminary decision Pc =
t:ht ( xi ) = c
Wt
Zc
4. Update voting weights Wt :cCTr t = Wt :cCTrt (1 Pc )

5. Compute final / composite hypothesis
H final ( xi ) = arg max

c
Wt
t :ht ( xi ) = c
D (i )
t
i:H t ( xi ) y i
7. Set Bt=Et/(1-Et), and update the instance weights:
B ,
wt +1 (i ) = wt t
1 ,
if H t ( xi ) = y i
otherwise
Call DWV to obtain the final hypothesis, Hfinal.
Learn++.MT Simulation
VOC Data
PR
Class
Dataset 1
20
20
40
Dataset 2
10
10
10
25
Dataset 3
10
10
10
15
40
Test
24
24
52
24
40
Test procedure
Learn++ and Learn++.MT were each allowed to create a set number of classifiers on
each dataset. The number of classifiers generated in each training session was chosen
to optimize the algorithms performance. Learn++ appeared to generate the best results
when 6 classifiers were generated on the first dataset, 12 on the next, and 18 on the
last. Learn++.MT, on the other hand, performed optimally using 6 classifiers in the first
training, 4 on the second, and 6 on the last.
Learn++.MT Simulation
VOC Data
PR
This test was executed twenty times on Learn++ and

Learn++.MT to acquire a well represented generalization
performance on the test data.
Training
Session
# Classifiers Added
Performance
Learn++
Learn++.MT
Learn++
Learn++.MT
TS1
55%
54%
TS2
12
64%
63%
TS3
18
69%
81%
2004 All Rights Reserved Muhlbaier and Topalis
PR
Learning from Unbalanced Data:

Learn++.MT2
Learn++.MT2 was created to account for the unbalanced data problem

We define unbalanced data as any discrepancy in the cardinality of each dataset used in
incremental learning.
If one dataset has substantially more data than the other (s), the
ensemble decision might be unfairly biased towards the data with the
lower cardinality
Under the generally valid assumptions of
No instance is repeated in any dataset, and
The noise distribution remains relatively unchanged among datasets;
it is reasonable to believe that the dataset that has more instances carries more information.
Classifiers generated with such data should therefore be weighted more heavily
It is not unusual to see major discrepancies in the cardinalities of

datasets that subsequently become available.
The cardinality of each dataset, including relative cardinalities of
individual classes within a dataset, should be taken into consideration
in any ensemble based learning algorithm that employs a classifier
combination scheme.
Learn++.MT2
Algorithm
PR
The primary novelty in Learn++.MT2 is the way by which the voting

weights are determined
Learn++.MT2 attempts to addresses the unbalanced data problem by keeping track of the
number of instances from each class with which each classifier is trained
Each classifier is first given a weight based on its performance on its own training data
This weight is later adjusted according to its class conditional weight factor, wt,c
nc
wt ,c = pt
Nc
pt: Training performance of the tth classifier,

nc: # of class-c instances in the current dataset
Nc: # of all class-c instances seen so far
For each classifier, this ratio is proportional to the number of instances from a
particular class used for training that classifier, to the number of instances from that
class used for training all classifiers thus far within the ensemble
The final decision is made similarly to Learn++ but with using the class conditional
weights
H final ( xi ) = arg max
wt ,c
cYk t :h ( x ) = c
t i
PR
Experimental Setup
Learn++.MT2 has been tested on three databases:

Wine database from UCI (3 classes, 13 features);
Optical Character Recognition database from UCI (10 classes, 64 features);
A real-world gas identification problem for determining one of five volatile organic
compounds (VOC) based on chemical sensor data (5 classes, 6 features).
Base classifiers were all single layer MLPs, normally incapable of learning
incrementally, with 12~40 nodes and an error goal of 0.05 ~ 0.025.
In each case the data distributions were designed to simulate unbalanced data.
To make a comparison between Learn++ and Learn++.MT2, a set number of classifiers
are created instead of selecting the number adaptively.
This number was selected as a result of experimental testing, such that the tests show an
accurate and unbiased comparison of the algorithms.
Wine Recognition
Database
PR
Algorithm
TS1
TS2
Std. Dev
Learn++
88%
84%
2.9%
Learn++.MT2
88%
89%
1.6%
Observations:
After initial training session (TS1), the
performances of both algorithms are the
same;
After TS2, Learn++.MT2 outperforms
Learn++;
No performance degradation is seen with
Learn++.MT2 after second dataset is
introduced. Learn++.MT2 is more stable.
VOC Recognition
Database
PR
Algorithm
TS1
TS2
Std. Dev.
Learn++
89%
86%
2.1%
Learn++.MT2
88%
89%
1.9%
Similar observations as the Wine data :

Again, performances are virtually identical after TS1.
After TS2, Learn++.MT2 outperforms Learn++;
No performance degradation is seen with Learn++.MT2 after second dataset is
introduced. Learn++.MT2 is more stable. Precise termination point is not required.
PR
OCR Database
Algorithm
TS1
TS2
Std. Dev.
Learn++
94%
92%
0.9%
Learn++.MT2
94%
95%
0.6%
Observations:
Drastically unbalanced data, majority of the information contained in Dataset 1, explains
high TS1 performance.
Due to the imbalance, Learn++ performance declines with Dataset 2, whereas |
Advanced
Topics in Pattern
Recognition
Learn++.MT2
provides
a modest gain.
OCR Reverse
Presentation
PR
Algorithm
TS1
TS2
Std. Dev.
Learn++
85%
91%
0.7%
Learn++.MT2
88%
94%
0.6%
Observations:
Reversed scenario: little information is initially provided, followed by more substantial data.
Final performances remains unchanged: the algorithm is immune to the order of presentation.
The momentary dip in Learn++.MT2 performance as a new dataset is introduced, ironically
justifies the approach taken. Why?
PR
Some Open Problems

Is the distribution update rule used on Learn++ optimal? Can a weighted

combination of AdaBoost and Learn++ update rule be better?
Is there a better initialization scheme?
Can Learn++ be used in a non-stationary learning environment, where the
data distribution rule changes (in which case, it may be necessary to forget
some of the previously learned information throw away some classifiers)
How can Learn++ be update / initialized if the training data is known to be
very unbalanced with new classes being introduced?
Can the performance of Learn++ on incremental learning be theoretically
justified?
Does Learn++ create more or less diverse classifiers? An analysis of the
algorithm on several diversity measures.
Can Learn++ be used on function approximation problems?
How does Learn++ behave under different combination scenarios?
PR
Other Ensemble
Techniques
There are several other ensemble techniques, including

Stacked generalization
Hierarchical mixture of experts
Random forests
PR
Stacked Generalization
Classifier 1
with
parameters
1
Classifier N+1
with
parameters N+1
C1
h1 (x, 1)
Ck-1
Input
x
Ck
hk(x, k)
C K+1
CN
1st Layer
Classifiers
Final
Decision
hN(x, N)
2nd Layer
Classifier
PR
Mixture of Experts
h1 (x, 1)
Ck-1
Input
x
Ck
wN
hk(x, k)
CK+1
CT
hT(x, T)
Final
Decision
stochastic
winner takes all
weighted average
w1
Gating
Network
CT+1
Pooling /
Combining
System
C1
Classifier 1
with
parameters
1
(Usually trained with

Expectation Maximization)

AdaBoost New PDF

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

AdaBoost New PDF

Enviado por

Direitos autorais:

Formatos disponíveis

PR

Dept. of Electrical and Computer Engineering

AdaBoost and Variations

Bias Variance Analysis

Advanced Topics in Pattern Recognition

2005, Robi Polikar, Rowan University, Glassboro, NJ

Arguably, the most popular and successful of all ensemble generation

Solves the the general problem of producing a very accurate prediction

2005, Robi Polikar, Rowan University, Glassboro, NJ

Repeat until T classifiers are generated

Advanced Topics in Pattern Recognition

2005, Robi Polikar, Rowan University, Glassboro, NJ

2005, Robi Polikar, Rowan University, Glassboro, NJ

Sequence of m examples S= [( x1 , y1 ), ( x 2 , y 2 ),..., ( x m , y m )]

Weak learning algorithm WeakLearn,

Integer T specifying number of iterations.

If t > , then set T = t -1 and abort loop.

where Z t = Dt (i ) is a normalization constant chosen so that

Dt+1 becomes a distribution function

Advanced Topics in Pattern Recognition

2005, Robi Polikar, Rowan University, Glassboro, NJ

Weighted Majority Voting

2005, Robi Polikar, Rowan University, Glassboro, NJ

Does AdaBoost Work?

For explanation see Freund and Schapires paper as well as Schapires

Advanced Topics in Pattern Recognition

2005, Robi Polikar, Rowan University, Glassboro, NJ

Occams Razor vs.

The Margin Theory

Advanced Topics in Pattern Recognition

2005, Robi Polikar, Rowan University, Glassboro, NJ

Advanced Topics in Pattern Recognition

From R. Schapire - http://www.cs.princeton.edu/~schapire/

2005, Robi Polikar, Rowan University, Glassboro, NJ

Large margins indicate a lower bound on generalization error.

P(error ) P (training margin ) + O

2005, Robi Polikar, Rowan University, Glassboro, NJ

AdaBoost.M1 requires that all classifiers have a weighted error no greater

AdaBoost.M2 address this problem by removing the weighted error

2005, Robi Polikar, Rowan University, Glassboro, NJ

Sequence of m examples S= [( x1 , y1 ), ( x 2 , y 2 ),..., ( x m , y m )]

Weak learning algorithm WeakLearn,

Integer T specifying number of iterations.

Advanced Topics in Pattern Recognition

2005, Robi Polikar, Rowan University, Glassboro, NJ

We now pose the following question:

The ensemble of classifiers approach which is generally used for improving

Advanced Topics in Pattern Recognition

2005, Robi Polikar, Rowan University, Glassboro, NJ

Advanced Topics in Pattern Recognition

2005, Robi Polikar, Rowan University, Glassboro, NJ

So, how do we achieve incremental learning?

Advanced Topics in Pattern Recognition

2005, Robi Polikar, Rowan University, Glassboro, NJ

Advanced Topics in Pattern Recognition

2005, Robi Polikar, Rowan University, Glassboro, NJ

Advanced Topics in Pattern Recognition

2005, Robi Polikar, Rowan University, Glassboro, NJ

Algorithm Learn++ (with major differences from AdaBoost.M1 indicated by

Sequence of m training examples Sk = [(x1,y1),(x2, y2),,(xmk,ymk)].

Initialize w1i = D(i ) = 1 mk , i , unless there is prior knowledge to select otherwise.

wt (i) so that Dt is a distribution.