Escolar Documentos
Profissional Documentos
Cultura Documentos
Common types:
Global effects
Nearest neighbor
Matrix factorization
Restricted Boltzmann machine
Clustering
Etc.
w ( X T X) 1 X T y
f( x ) = x2 + 2x 3
f( x ) / dx = 2x + 2
2x + 2 = 0
x = -1
is value of x where f( x ) is minimum
Gradient is a vector
Each element is the slope of function along
direction of one of variables
Each element is the partial derivative of function
with respect to one of variables
f (x) f (x) f (x)
f (x) f ( x1 , x2 ,, xd )
x1 x2 xd
Example:
f (x) f ( x1 , x2 ) x1 x1 x2 3x2
2 2
f (x) f (x)
f (x) f ( x1 , x2 ) 2 x1 x2 x1 6 x2
x1 x2
Jeff Howbert Introduction to Machine Learning Winter 2012 #
Optimization
f ( x1 , x2 )
f ( x1 , x2 ) x2 f ( x1 , x2 ) x1
f ( x1 , x2 )
f ( x1 , x2 ) x1 x1 x2 3x2
2 2
f ( x1 , x2 ) 2 x1 x2 x1 6 x2
http://www.ce.berkeley.edu/~bayen/ce191www/lecturenotes
/lecture10v01_descent2.pdf
Example in MATLAB
http://www.youtube.com/watch?v=cY1YGQQbrpQ
Problems:
Choosing step size
too small convergence is slow and inefficient
too large may not converge
Can get stuck on flat areas of function
Easily trapped in local minima
Stochastic (definition):
1. involving a random variable
2. involving chance or probability; probabilistic
movie 17770
movie 10
movie 1
movie 2
movie 3
movie 4
movie 5
movie 6
movie 7
movie 8
movie 9
movie 17770
movie 10
feature 1
movie 1
movie 2
movie 3
movie 4
movie 5
movie 6
movie 7
movie 8
movie 9
feature 2
feature 3
< a bunch of numbers >
user 1 1 2 3 feature 4
user 2 2 3 3 4 feature 5
user 3 5 3 4
user 4 2 3 2 2 +
user 5 4 5 3 4 factorization
feature 1
feature 2
feature 3
feature 4
feature 5
user 6 2 (training
user 7 2 4 2 3
process)
user 8 3 4 4
user 9 3 user 1
user 10 1 2 2 user 2
user 3
< a bunch of
user 480189 4 3 3
numbers >
user 4
user 5
training user 6
data user 7
user 8
user 9
user 10
user 480189
movie 17770
movie 10
movie 1
movie 2
movie 3
movie 4
movie 5
movie 6
movie 7
movie 8
movie 9
movie 17770
feature 1
movie 10
movie 1
movie 2
movie 3
movie 4
movie 5
movie 6
movie 7
movie 8
movie 9
feature 2
feature 3
feature 4 user 1 1 2 3
feature 5
user 2 2 3 3 4
+ user 3 5 3 4
user 4 2 3 2 2
user 5 4 5 3 4
feature 1
feature 2
feature 3
feature 4
feature 5
Notation
Number of users = I
Number of items = J
Number of factors per user / item = F
User of interest = i
Item of interest = j
Factor index = f
for f = 1 to F
L(rij , rij ) e 2
2eV jf
U if U if
L(rij , rij ) e 2
2eU if
V jf V jf
for f = 1 to F
for f = 1 to F
f 1 f 1 f 1
U if U if 2 (eV jf U if )
V jf V jf 2 (eU if V jf )
for f = 1 to F
Random thoughts
Samples can be processed in small batches
instead of one at a time batch gradient
descent
Well see stochastic / batch gradient descent
again when we learn about neural networks
(as back-propagation)