Escolar Documentos
Profissional Documentos
Cultura Documentos
22:
Evalua.on
April
24,
2010
Last
Time
Spectral
Clustering
Today
Evalua.on
Measures
Accuracy
Signicance
Tes.ng
F-Measure
Error
Types
ROC
Curves
Equal
Error
Rate
AIC/BIC
External
Evalua.on:
Measure
the
performance
on
a
downstream
task
Accuracy
Easily
the
most
common
and
intui.ve
measure
of
classica.on
performance.
#correct Accuracy = N
Signicance
tes.ng
Say
I
have
two
classiers.
A
=
50%
accuracy
B
=
75%
accuracy
B
is
beQer,
right?
Signicance
Tes.ng
Say
I
have
another
two
classiers
A
=
50%
accuracy
B
=
50.5%
accuracy
Is
B
beQer?
Basic
Evalua.on
Training
data
used
to
iden.fy
model
parameters
Tes.ng
data
used
for
evalua.on
Op.onally:
Development
/
tuning
data
used
to
iden.fy
model
hyperparameters.
Dicult
to
get
signicance
or
condence
values
Cross
valida.on
Iden.fy
n
folds
of
the
available
data.
Train
on
n-1
folds
Test
on
the
remaining
fold.
In
the
extreme
(n=N)
this
is
known
as
leave-one-out
cross
valida.on
n-fold
cross
valida.on
(xval)
gives
n
samples
of
the
performance
of
the
classier.
Signicance
Tes.ng
Is
the
performance
of
two
classiers
dierent
with
sta.s.cal
signicance?
Means
tes.ng
If
we
have
two
samples
of
classier
performance
(accuracy),
we
want
to
determine
if
they
are
drawn
from
the
same
distribu.on
(no
dierence)
or
two
dierent
distribu.ons.
T-test
One
Sample
t-test
Once
you
have
a
t- value,
look
up
the
signicance
level
on
a
table,
keyed
on
the
t- value
and
degrees
of
freedom
Independent t-test
Signicance
Tes.ng
Run
Cross-valida.on
to
get
n-samples
of
the
classier
mean.
Use
this
distribu.on
to
compare
against
either:
A
known
(published)
level
of
performance
one
sample
t-test
If at all possible, results should include informa.on about the variance of classier performance.
Signicance
Tes.ng
Caveat
including
more
samples
of
the
classier
performance
can
ar.cially
inate
the
signicance
measure.
If
x
and
s
are
constant
(the
sample
represents
the
popula.on
mean
and
variance)
then
raising
n
will
increase
t.
If
these
samples
are
real,
then
this
is
ne.
Ocen
cross-valida.on
fold
assignment
is
not
truly
random.
Thus
subsequent
xval
runs
only
resample
the
same
informa.on.
Condence
Bars
Variance
informa.on
can
be
included
in
plots
of
classier
performance
to
ease
visualiza.on.
= 10 SD = CI95%
=1
n = 10 SE = n
= 1.96 n
Condence
Bars
Most
important
to
be
clear
about
what
is
ploQed.
95%
condence
interval
has
the
clearest
interpreta.on.
11.5
11
10.5
10
9.5
9
8.5
8
SD
SE
CI
Baseline
Classiers
Majority
Class
baseline
Every
data
point
is
classied
as
the
class
that
is
most
frequently
represented
in
the
training
data
Random
baseline
Randomly
assign
one
of
the
classes
to
each
data
point.
with
an
even
distribu.on
with
the
training
class
distribu.on
TP + TN Accuracy = TP + FP + TN + FN
Accuracy = 90%
F-Measure
F-measure
can
be
weighted
to
favor
Precision
or
Recall
beta
>
1
favors
recall
beta
<
1
favors
precision
True
Values
Posi1ve
Nega1ve
Hyp
Posi1ve
Values
Nega1ve
0
10
0
100
(1 + 2 )P R F = ( 2 P ) + R P =0 R=0 F1 = 0
F-Measure
True
Values
Posi1ve
Hyp
Values
Posi1ve
Nega1ve
1
9
Nega1ve
0
100
(1 + 2 )P R F = ( 2 P ) + R P R F1 = 1 1 = 10 = .18
F-Measure
True
Values
Posi1ve
Hyp
Values
Posi1ve
Nega1ve
10
0
Nega1ve
50
50
(1 + 2 )P R F = ( 2 P ) + R P R F1 10 = 60 = 1 = .29
F-Measure
True
Values
Posi1ve
Hyp
Values
Posi1ve
Nega1ve
9
1
Nega1ve
1
99
(1 + 2 )P R F = ( 2 P ) + R P R F1 = .9 = .9 = .9
F-Measure
Accuracy
is
weighted
towards
majority
class
performance.
F-measure
is
useful
for
measuring
the
performance
on
minority
classes.
Types
of
Errors
False
Posi.ves
The
system
predicted
TRUE
but
the
value
was
FALSE
aka
False
Alarms
or
Type
I
error
False
Nega.ves
The
system
predicted
FALSE
but
the
value
was
TRUE
aka
Misses
or
Type
II
error
ROC
curves
It
is
common
to
plot
classier
performance
at
a
variety
of
selngs
or
thresholds
Receiver
Opera.ng
Characteris.c
(ROC)
curves
plot
true
posi.ves
against
false
posi.ves.
The
overall
performance
is
calculated
by
the
Area
Under
the
Curve
(AUC)
ROC
Curves
Equal
Error
Rate
(EER)
is
commonly
reported.
EER
represents
the
highest
accuracy
of
the
classier
Curves
provide
more
detail
about
performance
Gauvain
et
al.
1995
Goodness
of
Fit
Another
view
of
model
performance.
Measure
the
model
likelihood
of
the
unseen
data.
l(x; ) However,
weve
seen
that
model
likelihood
is
likely
to
improve
by
adding
parameters.
Two
informa.on
criteria
measures
include
a
cost
term
for
the
number
of
parameters
in
the
model
AIC = 2k 2 ln(l(x; ))
Informa.on
in
the
parameters
Informa.on
lost
by
the
modeling
Today
Accuracy
Signicance
Tes.ng
F-Measure
AIC/BIC
Next
Time
Regression
Evalua.on
Cluster
Evalua.on