Bernstein's Inequality, and Generalizations: CS281B/Stat241B (Spring 2003) Statistical Learning Theory

CS281B/Stat241B (Spring 2003) Statistical Learning Theory Lecture: 21
Bernsteins inequality, and generalizations
Lecturer: Peter L. Bartlett Scribe: XuanLong Nguyen
Let F [1, 1]X . Recall that with probability 1 we have

log(1/)
sup Ef Ef 2Rn (F ) + c
f F n
where Rn (F ) is the Rademacher average of F , and E denotes the empirical expectation. Hence, if f F
minimizes Ef , we have

log(1/)
Ef inf Ef + 2Rn (F ) + c .
f F n

Since Rn (F ) is generally in O(1/ n), the RHS of above inequality is in O(1/ n). Applying to classes of
loss function {lf |f F }, we obtained a generalization
bound for our learning algorithms. In some cases,
we can obtain error rate better than O(1/ n). For instance, for the class of perceptron studied earlier in
the course, the error rate is O(1/n). On the other hand, if Var(lf) supf F Var(lf ), the bound is tight, as
illustrated by the following theorem:
Theorem 0.1. For F [1, 1]X , F2 = supf F Var(f ). If F2 1/n, then for some constant c
cF
E sup |Ef Ef | .
f F n
To prove this result, we will need a concentration inequality that is stronger than Hoedings inequality,
which is loose in case of small variance.
Theorem 0.2. (Beinsteins inequality) Suppose that X1 , . . . ,Xn are independent random variables with
n
mean 0. |Xi | < c almost surely. Let i2 = Var(Xi ) and 2 = n1 i=1 i2 . Then,
n
1 n2
P( Xi ) exp 2 .
n i=1 2 + 2c/3
Proof. The proof is similar to that of Chernos bound, but we need a slightly dierent bound for the
moment that accounts for variance information. We have, as before, for any s > 0:
n

P( Xi t) est EesXi .
i=1
1
2 Bernsteins inequality, and generalizations
Now,

sj EX j
EesXi = 1 + EsXi + i
j=2
j!

i2 sj cj
1+ since EXij cj2 i2
c2 j=2 j!
i2 sc
= 1+ (e 1 sc)
c2

i2 sc
exp (e 1 sc)
c2
Hence,
n
n 2 sc
P( Xi t) inf exp(st + (e 1 sc))
i=1
s>0 c2
t2
exp
2n 2 + 2ct/3
1 tc
by choosing s = c log(1 + n 2 ).
Now we prove the rst theorem using Bernsteins inequality.

Proof. Assume that F2 = supf F Var(f ) is attained at some f F (otherwise, an extra limiting argument
will suce). Set Y = ni=1 (f (Xi ) Ef ), and v = EY 2 = nF2 1.

We will show that |Y | v with high probability. Indeed, for any k 1,

EY 2 1[|Y |kv] = EY 2 1[mv|Y |(m+1)v] v (m + 1)2 P [|Y | m v].
m=k m=k
Note that by Bernsteins inequality,

n

P [|Y | m v] P (f (Xi ) Ef ) m v
i=1

m2 v
exp
2(v + m v/3)

m2 v
exp = e3m/8 .
2(4mv/3)
Hence, with some suciently large constant k, we have EY 2 1[|Y |kv] v/4. So,
v = EY 2 = EY 2 1[|Y |v/2] + EY 2 1[v/2|Y |kv] + v/4

v/2 + k2 vP (|Y | v/2)

So, P (|Y | v/2) 1/2k2 . So, P (supf F |Ef Ef | v/2n) 1/2k2 . By Markovs inequality, this
implies that
F
E sup |Ef Ef | .
f F 4k2 n
Bernsteins inequality, and generalizations 3
The proof above uses Bernsteins inequality, which is stronger than Hoedlings inequality when the variance
information is known. From now until the rest of the lecture, we will present a number of Bernstein style
inequalities and their application. To draw the parallel, we rst list the familiar Hoedings inequality, and
its consequences, and then Bernsteins inequality, and its consequences.

Lemma 0.3. (Hoedings inequality) Let f : X [a, b], Z = ni=1 f (Xi ). Then,

2t2
P [Z EZ t] exp
n(b a)2
Lemma 0.4. (Bounded sequence inequality) if |g(x1 , . . . , xn ) g(x1 , . . . , xk1 , xk , xk+1 , . . . , xn )| ck for
all x1 , . . . , xn , then:
2t2
P [g(X1 , . . . , Xn ) Eg(X1 , . . . , Xn ) t] exp n 2 .
i=1 ci
Consequence to empirical processes:

n
Lemma 0.5. Consider f : X [a, b], Z = supf F i=1 (Ef f (Xi )), then:
2t2
P [Z EZ t] exp .
n(b a)2
n
Lemma
n 0.6. (Bernsteins inequality) Consider f : X [c, c], Z = i=1 f (Xi ), i2 = Varf (Xi ), 2 =
1 2
n i=1 i , then:
t2
P [Z EZ t] exp .
2n 2 + 2ct/3
The counter-part to bounded dierence inequality is concentration inequality for functions with self-bounding
property (due to Boucheron, Lugosi and Massart).
A function g : X n R has self-bounding property gk : X n1 R for k = 1, . . . , n such
n i there exist(k)
(k)
that for all xi X , 0 g(x) gk (x ) 1, and k=1 (g(x) gk (x )) g(x).
Lemma 0.7. (Inequality of self-bounding functions) If g is has self-bounding property, then Z = g(X1 , . . . , Xn )
(with Xi are i.i.d) satises:
t2
P [Z EZ t] exp .
2EZ + 2t/3
The following inequality for empirical processes is called Talagrands inequality. This version is, however,
due to Bousquet:

Lemma 0.8. Consider f : X [1, 1]. Let Z = supf F f (Xi ). Assume that Ef = 0, and supf F Varf (Xi )
2 , then:
t2
P [Z EZ t] exp ,
2v + 2t/3
with v = n 2 + 2EZ.
If EZ n 2 , t = EZ + n, (0 < < 1), then an application of above inequality gives:
1 EZ n
P [ Z (1 + ) + ] exp .
n n 7
4 Bernsteins inequality, and generalizations
Before ending the lecture, we give an application of Boucheron,

n Lugosi and Massarts inequality for self-
bounding functions. Consider g(X1 , . . . , Xn ) = supf F i=1 f (Xi ), where f F has f : X [0, 1], then g
has self-bounding property.

Indeed, dene gk (x(k) ) = supf F i=k f (xi ). Then we have:
n

g(x) gk (x(k) ) = sup f (xi ) sup f (xi )
f F i=1 f F
i=k
suppose the rst sup is attained at f , the second at fk
n

= sup f (xi ) fk (xi )
f F i=1
i=k
fk (xi ) 0.
n
Similarly, g(x) gk (x(k) ) = i=1 f (xi ) supf F i=k f (xk ) f (xk ) 1. Also,
n

(n 1)g(x) = (n 1) f (xi )
i=1
n
n

= f (xi ) sup f (xi )
f F
k=1 i=k k=1 i=k
n
= gk (x(k) ).
k=1

Rearraining gives nk=1 (g(x) gk (x
(k)
)) g(x). Therefore, we can apply Boucheron, Lugosi and Massarts
1
inequalty: If Z = supf F n f (Xi ), then:
n2
P [Z EZ t] exp .
2EZ + 2/3

Bernstein's Inequality, and Generalizations: CS281B/Stat241B (Spring 2003) Statistical Learning Theory

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Bernstein's Inequality, and Generalizations: CS281B/Stat241B (Spring 2003) Statistical Learning Theory

Enviado por

Direitos autorais:

Formatos disponíveis

CS281B/Stat241B (Spring 2003) Statistical Learning Theory Lecture: 21

Bernsteins inequality, and generalizations

Lecturer: Peter L. Bartlett Scribe: XuanLong Nguyen

Let F [1, 1]X . Recall that with probability 1 we have

Now we prove the rst theorem using Bernsteins inequality.

Note that by Bernsteins inequality,

v = EY 2 = EY 2 1[|Y |v/2] + EY 2 1[v/2|Y |kv] + v/4

Consequence to empirical processes:

If EZ n 2 , t = EZ + n, (0 < < 1), then an application of above inequality gives:

Before ending the lecture, we give an application of Boucheron,

Você também pode gostar