Andrew Rosenberg - Lecture 22: Evaluation

Lecture
22: Evalua.on
April 24, 2010
Last Time
Spectral Clustering
Today
Evalua.on Measures
Accuracy Signicance Tes.ng F-Measure Error Types
ROC Curves Equal Error Rate
AIC/BIC
How do you know that you have a good classier?

Is a feature contribu.ng to overall performance? Is classier A beQer than classier B? Internal Evalua.on:
Measure the performance of the classier.
External Evalua.on:
Measure the performance on a downstream task
Accuracy
Easily the most common and intui.ve measure of classica.on performance.
#correct Accuracy = N
Signicance tes.ng
Say I have two classiers. A = 50% accuracy B = 75% accuracy B is beQer, right?
Signicance Tes.ng
Say I have another two classiers A = 50% accuracy B = 50.5% accuracy Is B beQer?
Basic Evalua.on
Training data used to iden.fy model parameters Tes.ng data used for evalua.on Op.onally: Development / tuning data used to iden.fy model hyperparameters. Dicult to get signicance or condence values
Cross valida.on
Iden.fy n folds of the available data. Train on n-1 folds Test on the remaining fold. In the extreme (n=N) this is known as leave-one-out cross valida.on n-fold cross valida.on (xval) gives n samples of the performance of the classier.
Signicance Tes.ng
Is the performance of two classiers dierent with sta.s.cal signicance? Means tes.ng
If we have two samples of classier performance (accuracy), we want to determine if they are drawn from the same distribu.on (no dierence) or two dierent distribu.ons.
T-test
One Sample t-test
Once you have a t- value, look up the signicance level on a table, keyed on the t- value and degrees of freedom
Independent t-test
Unequal variances and sample sizes
Signicance Tes.ng
Run Cross-valida.on to get n-samples of the classier mean. Use this distribu.on to compare against either:
A known (published) level of performance
one sample t-test
Another distribu.on of performance

two sample t-test
If at all possible, results should include informa.on about the variance of classier performance.
Signicance Tes.ng
Caveat including more samples of the classier performance can ar.cially inate the signicance measure.
If x and s are constant (the sample represents the popula.on mean and variance) then raising n will increase t. If these samples are real, then this is ne. Ocen cross-valida.on fold assignment is not truly random. Thus subsequent xval runs only resample the same informa.on.
Condence Bars
Variance informa.on can be included in plots of classier performance to ease visualiza.on.
= 10 SD = CI95%
=1
n = 10 SE = n
Plot standard devia.on, standard error or condence interval?
= 1.96 n
Condence Bars
Most important to be clear about what is ploQed. 95% condence interval has the clearest interpreta.on.
11.5 11 10.5 10 9.5 9 8.5 8 SD SE CI
Baseline Classiers
Majority Class baseline
Every data point is classied as the class that is most frequently represented in the training data
Random baseline
Randomly assign one of the classes to each data point.
with an even distribu.on with the training class distribu.on
Problems with accuracy

Con.ngency Table
True Values Posi1ve Nega1ve Posi1ve Hyp Values Nega1ve True Posi.ve False Posi.ve
False True Nega.ve Nega.ve
TP + TN Accuracy = TP + FP + TN + FN

Informa.on Retrieval Example
Find the 10 documents related to a query in a set of 110 documents
True Values Posi1ve Nega1ve Hyp Posi1ve Values Nega1ve 0 10 0 100
Accuracy = 90%

Precision: how many TP P = hypothesized TP + FP events were true events TP Recall: how many of the true R = TP + FN events were iden.ed F-Measure: Harmonic mean F = 2P R P +R of precision and recall
True Values Posi1ve Nega1v e Posi1ve Hyp Values Nega1v e 0 10 0 100
F-Measure
F-measure can be weighted to favor Precision or Recall
beta > 1 favors recall beta < 1 favors precision
True Values Posi1ve Nega1ve Hyp Posi1ve Values Nega1ve 0 10 0 100
(1 + 2 )P R F = ( 2 P ) + R P =0 R=0 F1 = 0
F-Measure
True Values Posi1ve Hyp Values Posi1ve Nega1ve 1 9 Nega1ve 0 100
(1 + 2 )P R F = ( 2 P ) + R P R F1 = 1 1 = 10 = .18
F-Measure
(1 + 2 )P R F = ( 2 P ) + R P R F1 10 = 60 = 1 = .29
F-Measure
(1 + 2 )P R F = ( 2 P ) + R P R F1 = .9 = .9 = .9
F-Measure
Accuracy is weighted towards majority class performance. F-measure is useful for measuring the performance on minority classes.
Types of Errors
False Posi.ves
The system predicted TRUE but the value was FALSE aka False Alarms or Type I error
False Nega.ves
The system predicted FALSE but the value was TRUE aka Misses or Type II error
ROC curves
It is common to plot classier performance at a variety of selngs or thresholds Receiver Opera.ng Characteris.c (ROC) curves plot true posi.ves against false posi.ves. The overall performance is calculated by the Area Under the Curve (AUC)
ROC Curves
Equal Error Rate (EER) is commonly reported. EER represents the highest accuracy of the classier Curves provide more detail about performance
Gauvain et al. 1995
Goodness of Fit
Another view of model performance. Measure the model likelihood of the unseen data. l(x; ) However, weve seen that model likelihood is likely to improve by adding parameters. Two informa.on criteria measures include a cost term for the number of parameters in the model
Akaike Informa.on Criterion

Akaike Informa.on Criterion (AIC) based on entropy The best model has the lowest AIC.
Greatest model likelihood Fewest free parameters
AIC = 2k 2 ln(l(x; ))
Informa.on in the parameters Informa.on lost by the modeling
Bayesian Informa.on Criterion

Another penaliza.on term based on Bayesian arguments
Select the model that is a posteriori most probably with a constant penalty term for wrong models If errors are normally distributed.
BIC =
2 ln(e )
BIC = k ln(n) 2 ln(l(x; ))

k + ln(n) n
Note compares es.mated models when x is constant
Today
Accuracy Signicance Tes.ng F-Measure AIC/BIC
Next Time
Regression Evalua.on Cluster Evalua.on

Andrew Rosenberg - Lecture 22: Evaluation

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Andrew Rosenberg - Lecture 22: Evaluation

Enviado por

Direitos autorais:

Formatos disponíveis

Lecture

How do you know that you have a good classier?

Unequal variances and sample sizes

Another distribu.on of performance

Plot standard devia.on, standard error or condence interval?

Problems with accuracy

False True Nega.ve Nega.ve

Problems with accuracy

Problems with accuracy

Akaike Informa.on Criterion

Bayesian Informa.on Criterion

BIC = k ln(n) 2 ln(l(x; ))

Note compares es.mated models when x is constant

Você também pode gostar