Escolar Documentos
Profissional Documentos
Cultura Documentos
where Rn (F ) is the Rademacher average of F , and E denotes the empirical expectation. Hence, if f F
minimizes Ef , we have
log(1/)
Ef inf Ef + 2Rn (F ) + c .
f F n
Since Rn (F ) is generally in O(1/ n), the RHS of above inequality is in O(1/ n). Applying to classes of
loss function {lf |f F }, we obtained a generalization
bound for our learning algorithms. In some cases,
we can obtain error rate better than O(1/ n). For instance, for the class of perceptron studied earlier in
the course, the error rate is O(1/n). On the other hand, if Var(lf) supf F Var(lf ), the bound is tight, as
illustrated by the following theorem:
Theorem 0.1. For F [1, 1]X , F2 = supf F Var(f ). If F2 1/n, then for some constant c
cF
E sup |Ef Ef | .
f F n
To prove this result, we will need a concentration inequality that is stronger than Hoedings inequality,
which is loose in case of small variance.
Theorem 0.2. (Beinsteins inequality) Suppose that X1 , . . . ,Xn are independent random variables with
n
mean 0. |Xi | < c almost surely. Let i2 = Var(Xi ) and 2 = n1 i=1 i2 . Then,
n
1 n2
P( Xi ) exp 2 .
n i=1 2 + 2c/3
Proof. The proof is similar to that of Chernos bound, but we need a slightly dierent bound for the
moment that accounts for variance information. We have, as before, for any s > 0:
n
P( Xi t) est EesXi .
i=1
1
2 Bernsteins inequality, and generalizations
Now,
sj EX j
EesXi = 1 + EsXi + i
j=2
j!
i2 sj cj
1+ since EXij cj2 i2
c2 j=2 j!
i2 sc
= 1+ (e 1 sc)
c2
i2 sc
exp (e 1 sc)
c2
Hence,
n
n 2 sc
P( Xi t) inf exp(st + (e 1 sc))
i=1
s>0 c2
t2
exp
2n 2 + 2ct/3
1 tc
by choosing s = c log(1 + n 2 ).
Hence, with some suciently large constant k, we have EY 2 1[|Y |kv] v/4. So,
The proof above uses Bernsteins inequality, which is stronger than Hoedlings inequality when the variance
information is known. From now until the rest of the lecture, we will present a number of Bernstein style
inequalities and their application. To draw the parallel, we rst list the familiar Hoedings inequality, and
its consequences, and then Bernsteins inequality, and its consequences.
Lemma 0.3. (Hoedings inequality) Let f : X [a, b], Z = ni=1 f (Xi ). Then,
2t2
P [Z EZ t] exp
n(b a)2
Lemma 0.4. (Bounded sequence inequality) if |g(x1 , . . . , xn ) g(x1 , . . . , xk1 , xk , xk+1 , . . . , xn )| ck for
all x1 , . . . , xn , then:
2t2
P [g(X1 , . . . , Xn ) Eg(X1 , . . . , Xn ) t] exp n 2 .
i=1 ci
2t2
P [Z EZ t] exp .
n(b a)2
n
Lemma
n 0.6. (Bernsteins inequality) Consider f : X [c, c], Z = i=1 f (Xi ), i2 = Varf (Xi ), 2 =
1 2
n i=1 i , then:
t2
P [Z EZ t] exp .
2n 2 + 2ct/3
The counter-part to bounded dierence inequality is concentration inequality for functions with self-bounding
property (due to Boucheron, Lugosi and Massart).
A function g : X n R has self-bounding property gk : X n1 R for k = 1, . . . , n such
n i there exist(k)
(k)
that for all xi X , 0 g(x) gk (x ) 1, and k=1 (g(x) gk (x )) g(x).
Lemma 0.7. (Inequality of self-bounding functions) If g is has self-bounding property, then Z = g(X1 , . . . , Xn )
(with Xi are i.i.d) satises:
t2
P [Z EZ t] exp .
2EZ + 2t/3
The following inequality for empirical processes is called Talagrands inequality. This version is, however,
due to Bousquet:
Lemma 0.8. Consider f : X [1, 1]. Let Z = supf F f (Xi ). Assume that Ef = 0, and supf F Varf (Xi )
2 , then:
t2
P [Z EZ t] exp ,
2v + 2t/3
with v = n 2 + 2EZ.
1 EZ n
P [ Z (1 + ) + ] exp .
n n 7
4 Bernsteins inequality, and generalizations
n2
P [Z EZ t] exp .
2EZ + 2/3