07 Handout PDF

Lecture 7: The VC Dimension
0/26
The VC Dimension
Roadmap
1
When Can Machines Learn?
Why Can Machines Learn?
Lecture 6: Theory of Generalization

Eout Ein possible
if mH (N) breaks somewhere and N large enough

Definition of VC Dimension
VC Dimension of Perceptrons
Physical Intuition of VC Dimension
Interpreting VC Dimension
3
How Can Machines Learn?
How Can Machines Learn Better?

1/26
Theory of Generalization
Bounding Function: Inductive Cases
Bounding Function: The Theorem

B(N, k )
kX
1

N
i
i=0
| {z }
highest term N k 1
simple induction using boundary and inductive formula

for fixed k , B(N, k ) upper bounded by poly (N)
= mH (N) is poly (N) if break point exists
can be = actually,
go play and prove it if math lover! :-)
Bounding Function: Inductive Cases
The Three Break Points

B(N, k )
kX
1

N
i
i=0
| {z }
highest term N k 1
positive rays:
mH (N) = N + 1 N + 1
mH (2) = 3 < 22 : break point at 2
positive intervals:
mH (N) = 21 N 2 + 21 N + 1 12 N 2 + 12 N + 1
2D perceptrons:
mH (N)=? 61 N 3 + 56 N + 1
can bound mH (N) by only one break point
BAD Bound for General H
want:
h
i

P h H s.t. Ein (h)Eout (h) > 2

mH ( N)exp 2
2 N
actually, when N large enough,

h
i

1 2

P h H s.t. Ein (h) Eout (h) > 22mH (2N) exp 2 N
16
The VC Dimension
Vapnik-Chervonenkis (VC) Bound

For any g = A(D) H and statistical large D
h
i

PD Ein (g) Eout (g) >
i
h

PD h H s.t. Ein (h) Eout (h) >

4mH (2N) exp 81 2 N

if k exists
4(2N)k 1 exp 18 2 N
if 1 mH (N) breaks at k
(good H)
2 N large enough
(good D)
= probably generalized Eout Ein , and
if 3 A picks a g with small Ein
(good A)
= probably learned!
3/26
The VC Dimension
VC Dimension
the formal name of maximum non-break point
Definition
VC dimension of H, denoted dVC (H) is
largest N for which mH (N) = 2N
the most inputs H that can shatter
dVC = minimum k - 1
N dVC
k > dVC
=
=
H can shatter some N inputs

k is a break point for H
4/26
The VC Dimension
The Four VC Dimensions

positive rays:
mH (N) = N + 1
dVC = 1
positive intervals:
dVC = 2
convex sets:
mH (N) = 12 N 2 + 12 N + 1
mH (N) = 2N
up
dVC =
bottom
2D perceptrons:
dVC = 3
mH (N) N 3 for N 2
good: finite dVC

5/26
The VC Dimension
VC Dimension and Learning

finite dVC = g will generalize (Eout (g) Ein (g))
regardless of learning algorithm A
regardless of input distribution P

regardless of target function f
unknown target function
f: X Y
unknown
P on X
(ideal credit approval formula)

x1 , x2 , , xN
training examples
D : (x1 , y1 ), , (xN , yN )
(historical records in bank)
hypothesis set
H
x
learning
algorithm
A
final hypothesis
gf
(learned formula to be used)
worst case guarantee

on generalization
6/26
The VC Dimension
Fun Time
If there is a set of N inputs that cannot be shattered by H. Based
only on this information, what can we conclude about dVC (H)?
1
dVC (H) > N
dVC (H) = N
dVC (H) < N
no conclusion can be made
Reference Answer: 4
It is possible that there is another set of N
inputs that can be shattered, which means
dVC N. It is also possible that no set of N
input can be shattered, which means dVC < N.
Neither cases can be ruled out by one
non-shattering set.
7/26
The VC Dimension
2D PLA Revisited
linearly separable D
with xn P and yn = f (xn )
PLA can converge
P[|Ein (g) Eout (g)| > ] ... by dVC = 3
T large
N large
Eout (g) Ein (g)
Ein (g) = 0
Eout (g) 0 :-)
general PLA for x with more than 2 features?
8/26
The VC Dimension
1D perceptron (pos/neg rays): dVC = 2
2D perceptrons: dVC = 3
dVC 3:

dVC 3:
d-D perceptrons: dVC = d + 1
two steps:
dVC d + 1
dVC d + 1
9/26
The VC Dimension
Extra Fun Time

What statement below shows that dVC d + 1?
1
There are some d + 1 inputs we can shatter.
We can shatter any set of d + 1 inputs.
There are some d + 2 inputs we cannot shatter.
We cannot shatter any set of d + 2 inputs.
Reference Answer: 1
dVC is the maximum that mH (N) = 2N , and
mH (N) is the most number of dichotomies of N
inputs. So if we can find 2d+1 dichotomies on
some d + 1 inputs, mH (d + 1) = 2d+1 and
hence dVC d + 1.
10/26
The VC Dimension
dVC d + 1
There are some d + 1 inputs we can shatter.

some trivial inputs:
X=
visually in 2D:
xT1
xT2
xT3
..
.
xTd+1
1 0 0 ... 0
1 1 0 ... 0
1 0 1
0
.. ..
..
. 0
. .
1 0 ... 0 1
note: X invertible!
11/26
The VC Dimension
Can We Shatter X?
xT1
xT2
..
.
X=
xTd+1
1 0 0 ... 0
1 1 0 ... 0
.. ..
..
. 0
. .
1 0 ... 0 1
invertible
to shatter . . .
y1
for any y = ... , find w such that

yd+1
sign (Xw) = y
(Xw) = y
X invertible!
w = X1 y
special X can be shattered = dVC d + 1

12/26
The VC Dimension
Degrees of Freedom
10
11
10
11
6
12
13
10
15
16
17
18
10
0
9
11
16
17
11
10
6
12
18
16
2
17
18
16
11
18
10
0
9
16
2
17
18
16
11
18
10
0
9
16
2
17
18
4
3
16
2
17
11
18
10
0
9
7
6
12
5
13
4 14
15
12
5
15
2
17
4 14
15
15
13
4 14
13
12
5
12
5
11
6
13
2
17
4 14
15
15
10
4 14
13
12
5
12
5
14
0
9
11
6
13
13
10
4 14
15
12
5
13
4 14
14
11
6
12
13
4 14
15
16
2
17
18
15
16
2
17
18
(modified from the work of Hugues Vermeiren on http://www.texample.net)
hypothesis parameters w = (w0 , w1 , , wd ):
creates degrees of freedom
hypothesis quantity M = |H|:
analog degrees of freedom
hypothesis power dVC = d + 1:
effective binary degrees of freedom

dVC (H): powerfulness of H
17/26
The VC Dimension
0.8 Physical Intuition of VC Dimension
Two Old Friends

Positive Rays (dVC = 1)
h(x) = 1
0.8
x1
x2
x3
...
h(x) = +1
xN
free parameters: a
Positive Intervals (dVC = 2)

h(x) = 1
x1
x2
x3
h(x) = +1
...
h(x) = 1
xN
free parameters: `, r
practical rule of thumb:
dVC #free parameters (but not always)
18/26
The VC Dimension
M and dVC
copied from Lecture 5 :-)
1
can we make sure that Eout (g) is close enough to Ein (g)?
can we make Ein (g) small enough?
small M
1
large M
Yes!,
P[BAD] 2 M exp(. . .)
No!, too few choices
small dVC
1
No!,
P[BAD] 2 M exp(. . .)
Yes!, many choices
large dVC
Yes!, P[BAD]
4 (2N)dVC exp(. . .)
No!, too limited power
No!, P[BAD]
4 (2N)dVC exp(. . .)
Yes!, lots of power
using the right dVC (or H) is important

19/26
The VC Dimension
Fun Time
Origin-crossing Hyperplanes are essentially perceptrons with w0
fixed at 0. Make a guess about the dVC of origin-crossing
hyperplanes in Rd .
1
d +1
Reference Answer: 2
The proof is almost the same as proving the
dVC for usual perceptrons, but it is the intuition
(dVC #free parameters) that you shall use to
answer this quiz.
20/26
The VC Dimension
VC Bound Rephrase: Penalty for Model Complexity

h
i

{z
}
|
BAD

4(2N)dVC exp 18 2 N
|
{z
}
Rephrase

. . ., with probability 1 , GOOD: Ein (g) Eout (g)

set
= 4(2N)dVC exp 18 2 N

1 2
=
exp

N
d
8
4(2N) VC

4(2N)dVC
1 2
ln
= 8 N
r

4(2N)dVC
8
ln
=
N
21/26
The VC Dimension
VC Bound Rephrase: Penalty for Model Complexity

For any g = A(D) H and statistical large D, for N 2, dVC 2
h
i

{z
}
|
BAD

4(2N)dVC exp 18 2 N
|
{z
}
Rephrase
. . ., with probability 1 , GOOD!
r

gen. error Ein (g) Eout (g)
r

Ein (g)
8
N
ln
4(2N)dVC
8
N
ln
4(2N)dVC
r
Eout (g)
Ein (g) +
8
N
ln

4(2N)dVC
. . . : penalty for model complexity

| {z}
(N, H, )
21/26
The VC Dimension
THE VC Message
with a high probability,
r
Eout (g) Ein (g) +
|
out-of-sample error
Error
model complexity
8
N
ln
4(2N)dVC
{z
(N,H,)

}
dVC : Ein but
dVC : but Ein
in the middle
best dVC
in-sample error
dvc
VC dimension, dvc
powerful H not always good!

22/26
The VC Dimension
VC Bound Rephrase: Sample Complexity

i
h

{z
}
|
BAD

4(2N)dVC exp 18 2 N
|
{z
}

given specs = 0.1, = 0.1, dVC = 3, want 4(2N)dVC exp 81 2 N
N
100
1,000
10,000
100,000
29,300
bound
2.82 107
9.17 109
1.19 108
1.65 1038
9.99 102
sample complexity:
need N 10, 000dVC in theory
practical rule of thumb:

N 10dVC often enough!
23/26
The VC Dimension
Looseness of VC Bound
i
h


4(2N)dVC exp 18 2 N
theory: N 10, 000dVC ; practice: N 10dVC
Why?
Hoeffding for unknown Eout
mH (N) instead of |H(x1 , . . . , xN )|
N dVC
instead of mH (N)
union bound on worst cases
any distribution, any target

any data
any H of same dVC
any choice made by A
but hardly better, and similarly loose for all models

philosophical message of VC bound
important for improving ML
24/26
The VC Dimension
Fun Time
Consider the VC Bound below. How can we decrease the
probability of getting BAD data?
h
i


4(2N)dVC exp 18 2 N
decrease model complexity dVC
increase data size N a lot
increase generalization error tolerance
all of the above
Reference Answer:
Congratulations on being
Master of VC bound! :-)
25/26
The VC Dimension
Summary
1
2
When Can Machines Learn?

Why Can Machines Learn?
Lecture 6: Theory of Generalization

maximum non-break point
dVC (H) = d + 1
dVC #free parameters
loosely: model complexity & sample complexity
next: more than noiseless binary classification?
3
4
How Can Machines Learn?

How Can Machines Learn Better?
26/26

07 Handout PDF

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

07 Handout PDF

Enviado por

Direitos autorais:

Formatos disponíveis

Lecture 7: The VC Dimension

When Can Machines Learn?

Why Can Machines Learn?

Lecture 6: Theory of Generalization

Lecture 7: The VC Dimension

How Can Machines Learn?

How Can Machines Learn Better?

Bounding Function: Inductive Cases

Bounding Function: The Theorem

simple induction using boundary and inductive formula

= mH (N) is poly (N) if break point exists

Bounding Function: Inductive Cases

The Three Break Points

can bound mH (N) by only one break point

BAD Bound for General H

actually, when N large enough,

Vapnik-Chervonenkis (VC) Bound

PD h H s.t. Ein (h) Eout (h) > 

4mH (2N) exp 81 2 N

H can shatter some N inputs

The Four VC Dimensions

good: finite dVC

VC Dimension and Learning

regardless of input distribution P

(ideal credit approval formula)

worst case guarantee

dVC (H) > N

dVC (H) < N

no conclusion can be made

with xn P and yn = f (xn )

PLA can converge

P[|Ein (g) Eout (g)| > ] ... by dVC = 3

Eout (g) Ein (g)

general PLA for x with more than 2 features?

d-D perceptrons: dVC = d + 1

Extra Fun Time

There are some d + 1 inputs we can shatter.

We can shatter any set of d + 1 inputs.

There are some d + 2 inputs we cannot shatter.

We cannot shatter any set of d + 2 inputs.

There are some d + 1 inputs we can shatter.

for any y = ... , find w such that

special X can be shattered = dVC d + 1

Physical Intuition of VC Dimension

(modified from the work of Hugues Vermeiren on http://www.texample.net)

hypothesis parameters w = (w0 , w1 , , wd ):

creates degrees of freedom

hypothesis quantity M = |H|:

analog degrees of freedom

hypothesis power dVC = d + 1:

effective binary degrees of freedom

0.8 Physical Intuition of VC Dimension

Two Old Friends

Positive Intervals (dVC = 2)

Physical Intuition of VC Dimension

can we make Ein (g) small enough?

No!, too limited power

using the right dVC (or H) is important

Physical Intuition of VC Dimension

VC Bound Rephrase: Penalty for Model Complexity

VC Bound Rephrase: Penalty for Model Complexity

. . . : penalty for model complexity

dVC : Ein but

PD h H s.t. Ein (h) Eout (h) >

4mH (2N) exp 81 2 N

P[|Ein (g) Eout (g)| > ] ... by dVC = 3

increase generalization error tolerance