Escolar Documentos
Profissional Documentos
Cultura Documentos
Evaluation Techniques
Created By
Shubham Pachori
Overview
Motivation
Metrics for Classifiers Evaluation
Methods for Classifiers Evaluation
Comparing the Performance of two Classifiers
Costs in Classification
Ensemble Methods To Improve Accuracy
Motivation
It is important to evaluate classifiers generalization
performance in order to:
Determine whether to employ the classifier;
Target
data
data
Selection
Processed
data
Patterns
Interpretation
Evaluation
Data Mining
Transformation
Preprocessing & feature
selection
& cleaning
Accuracy = (TP+TN)/(P+N)
Error = (FP+FN)/(P+N)
Precision = TP/(TP+FP)
Recall/TP rate = TP/P
FP Rate = FP/N
Actual
class
Predicted class
Pos
Neg
Pos
TP
FN
Neg
FP
TN
Classifier
Training set
Training set
Q: Why?
A: Because new data will probably not be exactly the same as the training data!
Classifier
Training set
Test set
Hold-out Method
The hold-out method splits the data into training data and test
data (usually 2/3 for train, 1/3 for test). Then we build a classifier
using the train data and test it using the test data.
Classifier
Training set
Test set
Data
The hold-out method is usually used when we have thousands
of instances, including several hundred instances from each
class.
Data
Model
Builder
Training set
Evaluate
Classifier Builder
Validation set
Classifier
Predictions
+
+
+
- Final Evaluation
+
-
Stratification
The holdout method reserves a certain amount for testing
and uses the remainder for training.
Usually: one third for testing, the rest for training.
k-Fold Cross-Validation
k-fold cross-validation avoids overlapping test sets:
train
train
test
train
test
train
test
train
train
More on Cross-Validation
Standard method for evaluation: stratified 10-fold crossvalidation.
Why 10? Extensive experiments have shown that this is
the best choice to get an accurate estimate.
Stratification reduces the estimates variance.
Even better: repeated stratified cross-validation:
E.g. ten-fold cross-validation is repeated ten times and
results are averaged (reduces the variance).
Leave-One-Out Cross-Validation
Bootstrap Method
Cross validation uses sampling without replacement:
The same instance, once selected, can not be selected again
for a particular training/test set
Bootstrap Method
The bootstrap method is also called the 0.632
bootstrap:
A particular instance has a probability of 11/n of not being
picked;
Thus its probability of ending up in the test data is:
n
1
1
1
e
0.368
n
This means the training data will contain approximately
63.2% of the instances and the test data will contain
approximately 36.8% of the instances.
instances
0.368 etraining
instances
The training error gets less weight than the error on the test
data.
Repeat process several times with different replacement
samples; average the results.
These mean error rates are just estimates of error on the true
population of future data cases. What if the difference between
the two error rates is just attributed to chance?
1 n
2
sd
[(
e
e
)
(
e
e
)]
1i
2i
1
2
n i 1
Since we approximate the true standard deviation D, we
introduce T statistics:
D D
T
sd / n
Area = 1 -
t/2
t1- /2
sd
s
D d t / 2 d ,
n
n
Loan decisions
Fault diagnosis
Promotional mailing
Cost Matrices
Hypothesized
class
True class
Pos
Neg
Pos
TP Cost
FN Cost
Neg
FP Cost
TN Cost
Cost-Sensitive Classification
If the classifier outputs probability for each class, it can be adjusted to
minimize the expected costs of the predictions.
Expected cost is computed as dot product of vector of class
probabilities and appropriate column in cost matrix.
Hypothesized
class
True class
Pos
Neg
Pos
TP Cost
FN Cost
Neg
FP Cost
TN Cost
Hypothesized
class
True class
Pos
Neg
Pos
Neg
10
Pos
Neg
Pos
Neg
10
In Weka Cost Sensitive Classification and Learning can be applied for any classifier using
the meta scheme: CostSensitiveClassifier.
Predicted
Predicted
True
pos
neg
True
pos
neg
True
pos
neg
pos
40
60
pos
70
30
pos
60
40
neg
30
70
neg
50
50
neg
20
80
Classifier 1
TPr = 0.4
FPr = 0.3
Classifier 2
TPr = 0.7
FPr = 0.5
Classifier 3
TPr = 0.6
FPr = 0.2
ROC Space
Ideal classifier
always positive
False Negative Rate
chance
always negative
Classifier A dominates classifier B if and only if TPrA > TPrB and FPrA < FPrB.
Ensemble Methods
model 1
Data
Ensemble
model
model 2
model k
35
Combine multiple
models into one!
Motivations
Ensemble model improves accuracy and robustness over single
model methods
Applications:
distributed computing
privacy-preserving applications
large-scale data with reusable models
multiple sources of data
36
Model 3
Model 5
Model 2
Model 4
Model 6
Decision Tree
39
Model Averaging
test
classifier 1
labeled
data
unlabeled
data
classifier 2
classifier k
Ensemble
model
final
predictions
40
Ensemble of ClassifiersConsensus
training
test
classifier 1
labeled
data
classifier 2
classifier k
unlabeled
data
combine
the
predictions
by majority
voting
final
predictions
41
Cons
42
Bagging
Original Data
Bagging (Round 1)
Bagging (Round 2)
Bagging (Round 3)
1
7
1
1
2
8
4
8
3
10
9
5
4
8
1
10
5
2
2
5
6
5
3
5
7
10
2
9
8
10
7
6
9
5
3
3
10
9
2
7
Bagging
Accuracy of bagging:
k
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
-1
-1
-1
-1
Bagging
Decision Stump
Single level decision
binary tree
Entropy x<=0.35 or
x<=0.75
Accuracy at most
70%
Bagging
Accuracy of ensemble
classifier: 100%
Boosting
Principles
Boosting
Records that are wrongly classified will have their weights
increased
Records that are classified correctly will have their weights
decreased
Original Data
Boosting (Round 1)
Boosting (Round 2)
Boosting (Round 3)
1
7
5
4
2
3
4
4
3
2
9
8
4
8
4
10
5
7
2
4
6
9
5
5
7
4
1
4
8
10
7
6
9
6
4
3
10
3
2
4
Boosting
Equal weights are assigned to each training tuple (1/d for round
1)
After a classifier Mi is learned, the weights are adjusted to allow
the subsequent classifier Mi+1 to pay more attention to tuples
that were misclassified by Mi.
Final boosted classifier M* combines the votes of each
individual classifier
Weight of each classifiers vote is a function of its accuracy
Adaboost popular boosting algorithm
Adaboost
Input:
Training set D containing d tuples
k rounds
A classification learning scheme
Output:
A composite model
Adaboost
Data set D containing d class-labeled tuples (X1,y1), (X2,y2),
(X3,y3),.(Xd,yd)
Initially assign equal weight 1/d to each tuple
To generate k base classifiers, we need k rounds or
iterations
Round i, tuples from D are sampled with replacement , to
form Di (size d)
Each tuples chance of being selected depends on its weight
Adaboost
Base classifier Mi, is derived from training tuples of Di
Error of Mi is tested using Di
Weights of training tuples are adjusted depending on how
they were classified
Correctly classified: Decrease weight
Incorrectly classified: Increase weight
Adaboost
Some classifiers may be better at classifying some
hard tuples than others
We finally have a series of classifiers that
complement each other!
d
Error rate of model Mi: error ( M i ) w j * err ( X j )
j
Adaboost
error (Mi) affects how the weights of training tuples are
updated
If a tuple is correctly classified in round i, its weight is
multiplied by
error ( M i )
1 error ( M i )
error ( M i )
Weight of a classifier Mis weight is log
1 error ( M i )
Adaboost
The lower a classifier error rate, the more accurate it is, and
therefore, the higher its weight for voting should be
Weight of a classifier Mis vote is
log
error ( M i )
1 error ( M i )
Summary
In this seminar report we have considered:
Thank You