Escolar Documentos
Profissional Documentos
Cultura Documentos
http://hunch.net/~vw/
John Langford
Yahoo! Research
{With help from Nikos Karampatziakis & Daniel
Hsu}
git clone
git://github.com/JohnLangford/vowpal_wabbit.git
Why VW?
1. There should exist an open source online
learning system.
2. Online learning
10
minutes [caching]
Get example x (, ) .
P
Make prediction y
= i wi xi clipped to interval
Start with
1.
2.
i :
[0, 1].
3. Learn truth
or goto
(1).
4. Update
wi wi + 2(y y)Ixi
and go to (1).
Input Format
Label [Importance] [Tag]|Namespace Feature ...
|Namespace Feature ... ... \n
Namespace = String[:Float]
Feature = String[:Float]
Feature and Label are what you expect.
Importance is multiplier on learning rate.
Tag is an identier for an example, echoed on
example output.
Namespace is a mechanism for feature manipulation
and grouping.
use all
create. Multiple+missing
concatenate
compressed <f>: Read a gzip compressed le.
[= 1]
[= 1]
<p> [= 0.5]
initial_t <i>
power_t
-l [ learning_rate ] <l>
e =
[= 10]
(i +
ld n1i p
p
e 0 <e e 0 )
Weight Options
-b [ bit_precision ] <b> [=18] : Number of
weights. Too many features in example set
collisions occur.
-i [ initial_regressor ] <ri> : Initial weight values.
Multiple
average.
xi
hw x i
has unit
wi
has units of
L (x )
w
w
Thus update =
sense.
make
Implications
xi
Choose xi
1. Choose
2.
on a similar scale to
xj
so unit
f (x ) = E [y |x ] (default).
y > f (x )|x ) =
Pr(
y = 1|x ) = f (x ).
Pr(
{1, 1}.
[0.5, 1]
is sensible. values
overtting.
<1
protect against
How do I debug?
1. Is your progressive validation loss going down as
you train? (no => malordered examples or bad
choice of learning rate)
2. If you test on the train set, does it work? (no
=> something crazy)
3. Are the predictions sensible?
4. Do you see the right number of features coming
up?
1. Save state
2. Create a super-example with all features
3. Start with audit option
4. Save printout.
(Seems whacky: but this works with hashing.)
Principle
Having an example with importance weight
be equivalent to having the example
h should
h times in the
dataset.
(Karampatziakis & Langford,
http://arxiv.org/abs/1011.1576 for details.)
wt> x
(`)> x
wt> x
(`)> x
>
wt> x wt+1
x
6(`)> x
wt> x
6(`)> x
wt> x
>
wt+1
x ??
(`)> x
wt> x
>
wt+1
x
>
wt> x wt+1
x y
s(h)||x||2
>
wt> x wt+1
x y
What is s ()?
Take limit as update size goes to 0 but number of
updates goes to
What is s ()?
Take limit as update size goes to 0 but number of
updates goes to
`(p , y )
Squared
(y p )2
Logistic
log(1
Hinge
-Quantile
+ e
max(0, 1
if
if
y >p
y p
Update
yp )
yp )
(y p )
(1 )(p y )
s (h )
p>y e hx > x
x
x
W (ehx > x +yp+e>yp )hx > x eyp y { ,
yx x 1yp
y
h, >
y { , }
x x
p)
y >p
(h, y
x>x
p
y
)
y p
( )
(h,
(1 )x > x
1
1 1}
for
min
for
if
if
min
min
+ many others worked out. Similar in eect to implicit gradient, but closed form.
1 1
0.97
0.98
0.96
0.97
0.96
standard
standard
0.95
0.94
0.93
0.92
0.95
0.94
0.93
0.92
0.91
0.91
0.9
0.9
0.9
0.91
0.92
0.93
0.94
0.95
importance aware
0.96
0.97
0.9
0.945
0.99
0.94
0.98
0.935
0.97
0.93
0.96
0.925
0.92
0.96
0.97
0.98
0.95
0.94
0.915
0.93
0.91
0.92
0.905
0.92
0.95
standard
standard
0.91
0.91
0.9
0.9
0.9 0.905 0.91 0.915 0.92 0.925 0.93 0.935 0.94 0.945 0.95
importance aware
0.9
Adaptive Updates
I Adaptive, individual learning rates in VW.
I It's really gradient descent separately on each
coordinate
with
t ,i = r
Pt
s =1
`(w > x ,y )
w ,
s
2
s i
regressor is compatible.
wi
near
wi .
Experiments
I
Raw Data
d Hd
T
2 l (z )
= 2 hx , d i2
z
xt .
(xt , yt , p )
t
4.
S.
Let ht =Learn(S ).
1
to
pt .
pt = min
where
1,
2t
log
t 1
xt .
pt = min
where
1,
2t
log
t 1
xt .
t t
as the
t :=
t t
log
max{
h(xt ), 1 h(xt )}
0.5
: (tuning paramter
vw active_simulation active_mellowness C
(increasing C = supervised learning)
: (tuning paramter
C > 0)
vw active_simulation active_mellowness C
(increasing C = supervised learning)
vw active_learning active_mellowness C
daemon
I vw interacts with an active_interactor (ai)
I for each unlabeled data point, vw sends back a
query decision (+importance weight)
I
ai
as requested
I
vw
vw
x_1
(query, 1/p_1)
(x_1,y_1,1/p_1)
x_2
(gradient
update)
(no query)
...
training
vw active_simulation active_mellowness
0.000001
-d rcv1-train -f active.reg -l 10
initial_t 10
number of examples = 781265
training
vw active_simulation active_mellowness
0.000001
-d rcv1-train -f active.reg -l 10
initial_t 10
number of examples = 781265
testing
vw -t -d rcv1-test -i active.reg
average loss = 0.04872 (better than supervised)
spam
0.1
importance aware
gradient multiplication
passive
0.09
0.09
0.08
0.08
0.07
0.07
error
error
importance aware
gradient multiplication
passive
0.06
0.06
0.05
0.05
0.04
0.04
0.03
0.03
0.02
0
0.2
0.4
0.6
0.8
0.2
0.8
importance aware
gradient multiplication
passive
0.09
0.08
0.085
0.07
0.08
0.06
error
error
0.09
0.6
webspam
0.1
importance aware
gradient multiplication
passive
0.095
0.4
0.075
0.05
0.07
0.04
0.065
0.03
0.06
0.02
0.055
0.01
0.05
0
0
0.2
0.4
0.6
fraction of labels queried
0.8
0.2
0.4
0.6
fraction of labels queried
0.8