Data Mining Models and Evaluation Techniques

Data Mining Models &
Evaluation Techniques
Mentored ByProf. K.P Agarwal
Created By
Shubham Pachori
Overview
Motivation
Metrics for Classifiers Evaluation
Methods for Classifiers Evaluation
Comparing the Performance of two Classifiers
Costs in Classification
Ensemble Methods To Improve Accuracy
Motivation
It is important to evaluate classifiers generalization
performance in order to:
Determine whether to employ the classifier;
(For example: when learning the effectiveness of

medical treatments from a limited-size data, it is
important to estimate the accuracy of the classifiers.)
Optimize the classifier.
(For example: when post-pruning decision trees we

must evaluate the accuracy of the decision trees on
each pruning step.)
Models Evaluation in the KDD Process

Knowledge
Transformed
data
Target
data
data
Selection
Processed
data
Patterns
Interpretation
Evaluation
Data Mining
Transformation
Preprocessing & feature
selection
& cleaning
Accuracy = (TP+TN)/(P+N)
Error = (FP+FN)/(P+N)
Precision = TP/(TP+FP)
Recall/TP rate = TP/P
FP Rate = FP/N
Actual
class
Predicted class
Pos
Neg
Pos
TP
FN
Neg
FP
TN
How to Estimate the Metrics?

We can use:
Training data;
Independent test data;
Hold-out method;
k-fold cross-validation method;
Leave-one-out method;
Bootstrap method;
Ensemble methods
Estimation with Training Data

The accuracy/error estimates on the training data are not good indicators
of performance on future data.
Classifier
Training set
Training set
Q: Why?
A: Because new data will probably not be exactly the same as the training data!
The accuracy/error estimates on the training data measure the degree of

classifiers overfitting.
Estimation with Independent Test Data

Estimation with independent test data is used when we have plenty
of data and there is a natural way to forming training and test data.
Classifier
Training set
Test set
For example: Quinlan in 1987 reported experiments in a medical

domain for which the classifiers were trained on data from 1985 and
tested on data from 1986.
Hold-out Method
The hold-out method splits the data into training data and test
data (usually 2/3 for train, 1/3 for test). Then we build a classifier
using the train data and test it using the test data.
Classifier
Training set
Test set
Data
The hold-out method is usually used when we have thousands
of instances, including several hundred instances from each
class.
Classification: Train, Validation, Test Split

Results Known
+
+
+
Data
Model
Builder
Training set
Evaluate
Classifier Builder
Validation set
Final Test Set
Classifier
Predictions
+
+
+
- Final Evaluation
+
-
The test data cant be used for parameter tuning!
Making the Most of the Data

Once evaluation is complete, all the data can be used
to build the final classifier.
Generally, the larger the training data the better the
classifier (but returns diminish).
The larger the test data the more accurate the error
estimate.
Stratification
The holdout method reserves a certain amount for testing
and uses the remainder for training.
Usually: one third for testing, the rest for training.
For unbalanced datasets, samples might not be

representative.
Few or none instances of some classes viz. fraudulent
transaction detection and Medical Diagnostic Tests
Stratified sample: advanced version of balancing the

data.
Make sure that each class is represented with approximately
equal proportions in both subsets.
Repeated Holdout Method

Holdout estimate can be made more reliable by
repeating the process with different subsamples.
In each iteration, a certain proportion is randomly
selected for training (possibly with stratification).
The error rates on the different iterations are averaged
to yield an overall error rate.
This is called the repeated holdout method.
Repeated Holdout Method, 2
Still not optimum: the different test sets overlap, but

we would like all our instances from the data to be
tested at least ones.
Can we prevent overlapping?
witten & eibe
k-Fold Cross-Validation
k-fold cross-validation avoids overlapping test sets:
First step: data is split into k subsets of equal size;

Second step: each subset in turn is used for testing and the
remainder for training.
Classifier
The subsets are stratified

before the cross-validation.
The estimates are averaged to
yield an overall estimate.
Data
train
train
test
train
test
train
test
train
train
More on Cross-Validation
Standard method for evaluation: stratified 10-fold crossvalidation.
Why 10? Extensive experiments have shown that this is
the best choice to get an accurate estimate.
Stratification reduces the estimates variance.
Even better: repeated stratified cross-validation:
E.g. ten-fold cross-validation is repeated ten times and
results are averaged (reduces the variance).
Leave-One-Out Cross-Validation
Leave-One-Out is a particular form of crossvalidation:
Set number of folds to number of training instances;

I.e., for n training instances, build classifier n times.
Makes best use of the data.

Involves no random sub-sampling.
Very computationally expensive.
Leave-One-Out Cross-Validation and

Stratification
A disadvantage of Leave-One-Out-CV is that

stratification is not possible:
It guarantees a non-stratified sample because there

is only one instance in the test set!
Extreme example - random dataset split equally into

two classes:
Best inducer predicts majority class;

50% accuracy on fresh data;
Leave-One-Out-CV estimate is 100% error!
Bootstrap Method
Cross validation uses sampling without replacement:
The same instance, once selected, can not be selected again
for a particular training/test set
The bootstrap uses sampling with replacement to form

the training set:
Sample a dataset of n instances n times with replacement to
form a new dataset of n instances;
Use this data as the training set;
Use the instances from the original dataset that dont occur
in the new training set for testing.
Bootstrap Method
The bootstrap method is also called the 0.632
bootstrap:
A particular instance has a probability of 11/n of not being
picked;
Thus its probability of ending up in the test data is:
n
1
1
1
e
0.368
n
This means the training data will contain approximately
63.2% of the instances and the test data will contain
approximately 36.8% of the instances.
Estimating Error with the

Bootstrap Method
The error estimate on the test data will be very pessimistic

because the classifier is trained on just ~63% of the instances.
Therefore, combine it with the training error:
err 0.632 etest
instances
0.368 etraining
instances
The training error gets less weight than the error on the test
data.
Repeat process several times with different replacement
samples; average the results.
Comparing Two Classifier Models

Assume that we have two classifiers, M1 and M2, and we would
like to know which one is better for a classification problem.
We test the classifiers on n test data sets D1, D2, , Dn, and we
receive error rate estimates e11, e12, , e1n for classifier M1 and
error rate estimates e21, e22, , e2n for classifier M2.
Using rate estimates we can compute the mean error rate e1 for
classifier M1 and the mean error rate e2 for classifier M2.
These mean error rates are just estimates of error on the true
population of future data cases. What if the difference between
the two error rates is just attributed to chance?

We note that error rate estimates e11, e12, , e1n for classifier M1 and error
rate estimates e21, e22, , e2n for classifier M2 are paired. Thus, we consider
the differences d1, d2, , dn where dj= | e1j- e2j|.
The differences d1, d2, , dn are instantiations of n random variables D1, D2,
, Dn with mean D and standard deviation D.
We need to establish confidence intervals for D in order to decide whether
the difference in the generalization performance of the classifiers M1 and M2

is statistically significant or not.

Since the standard deviation D is unknown, we approximate it
using the sample standard deviation sd:
1 n
2
sd
[(
e
e
)
(
e
e
)]
1i
2i
1
2
n i 1
Since we approximate the true standard deviation D, we
introduce T statistics:
D D
T
sd / n

The T statistics is governed by t-distribution with n - 1 degrees
of freedom.
Area = 1 -
t/2
t1- /2

If d and sd are the mean and standard deviation of the normally
distributed differences of n random pairs of errors, a (1
)100% confidence interval for D = 1 - 2 is :
d t / 2
sd
s
D d t / 2 d ,
n
n
where t/2 is the t-value with v = n -1 degrees of freedom,
leaving an area of /2 to the right.

Thus, if the interval contains 0.0 we can conclude on
significance level that the difference is 0.0.
Counting the Costs

In practice, different types of classification errors
often incur different costs
Examples:
Terrorist profiling
Not a terrorist correct 99.99% of the time
Loan decisions
Fault diagnosis
Promotional mailing
Cost Matrices
Hypothesized
class
True class
Pos
Neg
Pos
TP Cost
FN Cost
Neg
FP Cost
TN Cost
Usually, TP Cost and TN Cost are set equal to 0!
Cost-Sensitive Classification
If the classifier outputs probability for each class, it can be adjusted to
minimize the expected costs of the predictions.
Expected cost is computed as dot product of vector of class
probabilities and appropriate column in cost matrix.
Hypothesized
class
True class
Pos
Neg
Pos
TP Cost
FN Cost
Neg
FP Cost
TN Cost
Cost Sensitive Classification

Assume that the classifier returns for an instance ppos = 0.6 and pneg = 0.4.
Then, the expected cost if the instance is classified as positive is 0.6 * 0 +
0.4 * 10 = 4. The expected cost if the instance is classified as negative is 0.6
* 5 + 0.4 * 0 = 3. To minimize the costs the instance is classified as negative.
Hypothesized
class
True class
Pos
Neg
Pos
Neg
10
Cost Sensitive Learning

Simple methods for cost sensitive learning:
Resampling of instances according to costs;
Weighting of instances according to costs.
Hypothesized
class
True class
Pos
Neg
Pos
Neg
10
In Weka Cost Sensitive Classification and Learning can be applied for any classifier using
the meta scheme: CostSensitiveClassifier.
ROC Curves and Analysis

Predicted
Predicted
Predicted
True
pos
neg
True
pos
neg
True
pos
neg
pos
40
60
pos
70
30
pos
60
40
neg
30
70
neg
50
50
neg
20
80
Classifier 1
TPr = 0.4
FPr = 0.3
Classifier 2
TPr = 0.7
FPr = 0.5
Classifier 3
TPr = 0.6
FPr = 0.2
ROC Space
Ideal classifier
always positive
False Negative Rate
True Negative Rate
chance
always negative
Dominance in the ROC Space
Classifier A dominates classifier B if and only if TPrA > TPrB and FPrA < FPrB.
Ensemble Methods
model 1
Data
Ensemble
model
model 2
model k
35
Combine multiple
models into one!
Applications: classification, clustering, collaborative

filtering, anomaly detection
Motivations
Ensemble model improves accuracy and robustness over single
model methods
Applications:
distributed computing
privacy-preserving applications
large-scale data with reusable models
multiple sources of data
Efficiency: a complex problem can be decomposed into multiple

sub-problems that are easier to understand and solve (divide-andconquer approach)
36
Why Ensemble Works? (1)

Intuition
combining diverse, independent opinions in human decisionmaking as a protective mechanism (e.g. stock portfolio)
Uncorrelated error reduction

Suppose we have 5 completely independent classifiers for majority
voting
If accuracy is 70% for each
10 (.7^3)(.3^2)+5(.7^4)(.3)+(.7^5)
83.7% majority vote accuracy
101 such classifiers

99.9% majority vote accuracy
37

Some unknown distribution
Model 1
Model 3
Model 5
Model 2
Model 4
Model 6
Ensemble gives the global picture!

38

Overcome limitations of single hypothesis
The target function may not be implementable with
individual classifiers, but may be approximated by model
averaging
Decision Tree
39
Model Averaging
Ensemble of ClassifiersLearn to Combine

Training
test
classifier 1
labeled
data
unlabeled
data
classifier 2
classifier k
Ensemble
model
final
predictions
40
learn the combination from labeled

data
Algorithms: boosting, stacked generalization, rule
ensemble, Bayesian model averaging
Ensemble of ClassifiersConsensus
training
test
classifier 1
labeled
data
classifier 2
classifier k
unlabeled
data
combine
the
predictions
by majority
voting
final
predictions
41
Algorithms: bagging, random forest, random

decision tree, model averaging of probabilities
Pros and Cons
Combine by learning Combine by consensus

Pros
Get useful feedbacks from

labeled data
Can potentially improve
accuracy
Do not need labeled data

Can improve the generalization
performance
Cons
Need to keep the labeled

data to train the ensemble
May overfit the labeled data
Cannot work when no labels
are available
No feedbacks from the labeled

data
Require the assumption that
consensus is better
42
Bagging
Original Data
Bagging (Round 1)
Bagging (Round 2)
Bagging (Round 3)
1
7
1
1
2
8
4
8
3
10
9
5
4
8
1
10
5
2
2
5
6
5
3
5
7
10
2
9
8
10
7
6
9
5
3
3
10
9
2
7
Also known as bootstrap aggregation

Sampling uniformly with replacement
Build classifier on each bootstrap sample
0.632 bootstrap
Each bootstrap sample Di contains approx. 63.2% of
the original training data
Remaining (36.8%) are used as test set
Bagging
Accuracy of bagging:
k
Acc ( M ) (0.632 * Acc ( M i ) test _ set 0.368 * Acc ( M i ) train _ set )

i 1
Works well for small data sets

Example:
X
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
-1
-1
-1
-1
Bagging
Decision Stump
Single level decision
binary tree
Entropy x<=0.35 or
x<=0.75
Accuracy at most
70%
Bagging
Accuracy of ensemble
classifier: 100%
Bagging- Final Points
Works well if the base classifiers are unstable

Increased accuracy because it reduces the variance of
the individual classifier
Does not focus on any particular instance of the
training data
Therefore, less susceptible to model over-fitting when
applied to noisy data
What if we want to focus on a particular instances of
training data?
Boosting
Principles
Boost a set of weak learners to a strong learner

An iterative procedure to adaptively change distribution of
training data by focusing more on previously misclassified
records
Initially, all N records are assigned equal weights
Unlike bagging, weights may change at the end of a boosting
round
Boosting
Records that are wrongly classified will have their weights
increased
Records that are classified correctly will have their weights
decreased
Original Data
Boosting (Round 1)
Boosting (Round 2)
Boosting (Round 3)
1
7
5
4
2
3
4
4
3
2
9
8
4
8
4
10
5
7
2
4
6
9
5
5
7
4
1
4
8
10
7
6
9
6
4
3
10
3
2
4
Example 4 is hard to classify

Its weight is increased, therefore it is
more likely to be chosen again in
subsequent rounds
Boosting
Equal weights are assigned to each training tuple (1/d for round
1)
After a classifier Mi is learned, the weights are adjusted to allow
the subsequent classifier Mi+1 to pay more attention to tuples
that were misclassified by Mi.
Final boosted classifier M* combines the votes of each
individual classifier
Weight of each classifiers vote is a function of its accuracy
Adaboost popular boosting algorithm
Adaboost
Input:
Training set D containing d tuples
k rounds
A classification learning scheme
Output:
A composite model
Adaboost
Data set D containing d class-labeled tuples (X1,y1), (X2,y2),
(X3,y3),.(Xd,yd)
Initially assign equal weight 1/d to each tuple
To generate k base classifiers, we need k rounds or
iterations
Round i, tuples from D are sampled with replacement , to
form Di (size d)
Each tuples chance of being selected depends on its weight
Adaboost
Base classifier Mi, is derived from training tuples of Di
Error of Mi is tested using Di
Weights of training tuples are adjusted depending on how
they were classified
Correctly classified: Decrease weight
Incorrectly classified: Increase weight
Weight of a tuple indicates how hard it is to classify it

(directly proportional)
Adaboost
Some classifiers may be better at classifying some
hard tuples than others
We finally have a series of classifiers that
complement each other!
d
Error rate of model Mi: error ( M i ) w j * err ( X j )
j
where err(Xj) is the misclassification error for Xj(=1)

If classifier error exceeds 0.5, we abandon it
Try again with a new Di and a new Mi derived from
it
Adaboost
error (Mi) affects how the weights of training tuples are
updated
If a tuple is correctly classified in round i, its weight is
multiplied by
error ( M i )
1 error ( M i )
Adjust weights of all correctly classified tuples

Now weights of all tuples (including the misclassified tuples)
are normalized
sum _ of _ old _ weights
Normalization factor =
sum _ of _ new _ weights
error ( M i )
Weight of a classifier Mis weight is log
1 error ( M i )
Adaboost
The lower a classifier error rate, the more accurate it is, and
therefore, the higher its weight for voting should be
Weight of a classifier Mis vote is
log
error ( M i )
1 error ( M i )
For each class c, sum the weights of each classifier that

assigned class c to X (unseen tuple)
The class with the highest sum is the WINNER!
Metric Evaluation Summary:

Use test sets and the hold-out method for large data;
Use the cross-validation method for middle-sized data;
Use the leave-one-out and bootstrap methods for small
data;
Dont use test data for parameter tuning - use separate
validation data.
Summary
In this seminar report we have considered:

Methods for Classifiers Evaluation
Comparing Data Mining Schemes
Costs in Data Mining
Ensemble Methods
Thank You

Data Mining Models and Evaluation Techniques

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Data Mining Models and Evaluation Techniques

Enviado por

Direitos autorais:

Formatos disponíveis

Data Mining Models &

Mentored ByProf. K.P Agarwal

(For example: when learning the effectiveness of

(For example: when post-pruning decision trees we

Models Evaluation in the KDD Process

Metrics for Classifiers Evaluation

How to Estimate the Metrics?

Estimation with Training Data

The accuracy/error estimates on the training data measure the degree of

Estimation with Independent Test Data

For example: Quinlan in 1987 reported experiments in a medical

Classification: Train, Validation, Test Split

Final Test Set

The test data cant be used for parameter tuning!

Making the Most of the Data

For unbalanced datasets, samples might not be

Stratified sample: advanced version of balancing the

Repeated Holdout Method

This is called the repeated holdout method.

Repeated Holdout Method, 2

Still not optimum: the different test sets overlap, but

Can we prevent overlapping?

witten & eibe

First step: data is split into k subsets of equal size;

The subsets are stratified

Leave-One-Out is a particular form of crossvalidation:

Set number of folds to number of training instances;

Makes best use of the data.

Leave-One-Out Cross-Validation and

A disadvantage of Leave-One-Out-CV is that

It guarantees a non-stratified sample because there

Extreme example - random dataset split equally into

Best inducer predicts majority class;

The bootstrap uses sampling with replacement to form

Estimating Error with the

The error estimate on the test data will be very pessimistic

err 0.632 etest

Comparing Two Classifier Models

Comparing Two Classifier Models

the difference in the generalization performance of the classifiers M1 and M2

Comparing Two Classifier Models

Comparing Two Classifier Models

Comparing Two Classifier Models

where t/2 is the t-value with v = n -1 degrees of freedom,

leaving an area of /2 to the right.

Counting the Costs

Usually, TP Cost and TN Cost are set equal to 0!

Cost Sensitive Classification

Cost Sensitive Learning

ROC Curves and Analysis

True Negative Rate

Dominance in the ROC Space

Applications: classification, clustering, collaborative

Efficiency: a complex problem can be decomposed into multiple

Why Ensemble Works? (1)

Uncorrelated error reduction

101 such classifiers

Why Ensemble Works? (2)

Ensemble gives the global picture!

Why Ensemble Works? (3)

Ensemble of ClassifiersLearn to Combine

learn the combination from labeled

Algorithms: bagging, random forest, random

Pros and Cons