Você está na página 1de 15

Machine Learning Srihari

The Hessian Matrix

Sargur Srihari

1
Machine Learning Srihari

Definitions of Gradient and Hessian


•  First derivative of a scalar function E(w) with respect to a
vector w=[w1,w2]T is a vector called the Gradient of E(w)
⎡ ∂E ⎤
⎢ ⎥
∂w1 ⎥ If there are M elements in the vector
E(w) = ⎢⎢
d
∇E(w) = then Gradient is a M x 1 vector
dw ∂E ⎥
⎢ ⎥
⎢⎣ ∂w2 ⎥

•  Second derivative of E(w) is a matrix called the Hessian


of E(w)
⎡ 2 ⎤
⎢ ∂E ∂ 2E ⎥
⎢ 2 ⎥
d2 ⎢ ∂w1 ∂w1∂w2 ⎥ Hessian is a matrix with
H = ∇∇E(w) = E(w) = ⎢ ⎥ M2 elements
dw 2 ⎢ ∂ 2E ∂ 2E ⎥
⎢ ⎥
⎢ ∂w ∂w ∂w2 ⎥⎥
2
⎣⎢ 2 1

•  Jacobian is a matrix consisting of first derivatives wrt a
vector
Machine Learning Srihari

Need for Gradient and Hessian in ML

Training samples: n=1,..N ⎛ φ 0( x1 ) φ 1( x1 ) ... φ M −1( x1 ) ⎞


⎜ ⎟
Inputs: M x 1 vectors ⎜ φ 0( x 2 )
Φ=⎜
⎟ Error surface for M=2


⎜φ (x )
⎟ a paraboloid with
φ M −1( x N ) ⎟⎠
( φ (x ) )
T
φ1 ( xn ) ... φM −1 ( xn )
⎝ 0 N a single global minimum
φ(xn ) = 0 n
Φ0(x)=1, dummy feature
Outputs t = (t1,.. tN )T

For Logistic Regression (cross-entropy error):


N
E(w) = −ln p(t | w) = −∑ {tn ln yn + (1 −tn )ln(1 −yn )} where yn = σ wT φ(xn )
n=1
( )
For Stochastic Gradient Descent we need ∇En (w) where E(w) = ∑ n En (w)
w(τ+1) = w(τ ) − η∇En (w)
For Newton-Raphson update
−1
we need both∇E(w) and H=∇∇E(w)
w
(new ) (old )
=w − H ∇E(w)
Machine Learning Srihari

Gradient and Hessian for Two-class Logistic Regression


•  Cross-Entropy Error
N
E(w) = −ln p(t | w) = −∑ {tn ln yn + (1 −tn )ln(1 −yn )} where yn = σ wT φ(xn )
n=1
( )

•  Gradient of E y = (y1,..yN )T
N
t = (t1,.. tN )T
∇E(w) = ∑ (yn −tn )φ(xn ) = Φ (y − t)
T

n=1
⎛ φ 0( x1 ) φ 1( x1 ) ... φ M −1( x1 ) ⎞
⎜ ⎟
⎜ φ 0( x 2 )
•  Hessian of E Φ=⎜


⎜ ⎟
N ⎜φ (x ) φ ( x ) ⎟
H = ∇∇E(w) = ∑ yn (1 −yn )φ(xn )φT (xn ) = ΦT RΦ
⎝ 0 N M −1 N ⎠

n=1

( φ (x ) )
T
R is NxN diagonal matrix with elements φ(xn ) = 0 n
φ1 (xn ) ... φM −1 (xn )
Rnn=yn(1-yn)=wTφ(xn)(1-wTφ(xn))
Hessian is not constant and depends on w through R 4
Since H is positive-definite (i.e., for arbitrary u, uTHu>0)
error function is a concave function of w and so has a unique minimum
Machine Learning Srihari

Gradient and Hessian for Multi-class Logistic Regression


⎡ t . . t1K ⎤⎥
•  Cross-Entropy Error ⎢ 11
⎢ ⎥
T = ⎢⎢ . ⎥

N K ⎢ . ⎥
E(w1,..., w K ) = −ln p(T | w1,.., w K ) = −∑ ∑ tnk ln ynk ⎢ t
⎢⎣ N 1 tNK ⎥⎥
n=1 k=1

exp(wTk φ(xn ))
ynk = yk (φ(xn )) =
•  Gradient of E ∑ j
exp(wT
j
φ(xn ))
N
∇w E(w1,..., w K ) = −∑ (ynj −tnj )φ(xn )
j
n=1 ⎛ φ 0( x1 ) φ 1( x1 ) ... φ M −1( x1 ) ⎞
⎜ ⎟
⎜ φ 0( x 2 ) ⎟
•  Hessian of E Φ=⎜ ⎟
⎜ ⎟
N ⎜φ (x ) φ M −1( x N ) ⎟⎠
⎝ 0 N
∇w ∇w E(w1,..., w K ) = −∑ ynk (I kj −ynj )φ(xn )φT (xn )
k j

( φ (x ) )
n=1 T
φ(xn ) = 0 n
φ1 (xn ) ... φM −1 (xn )

Each element of the Hessian needs M multiplications and additions 5


Since there are M2 elements in the matrix the computation is O(M3)
Machine Learning Srihari

Hessian of Neural Network Error Function


•  Backpropagation can be used to
obtain first derivatives of error
function wrt weights in network
•  It can also be used to derive
second derivatives
∂2E
∂w ji w lk

•  If all weights and bias parameters


are elements wi of single vector w
then the second derivatives form
the elements Hij of Hessian matrix
H where i,j e {1,..W}
Machine Learning Srihari

Role of Hessian in Neural Computing


1.  Several nonlinear optimization
algorithms for neural networks
are based on second order
derivatives of error surface
2.  Basis for fast procedure for
retraining with small change of
training data
3.  Identifying least significant
weights
for network pruning requires inverse
of Hessian
4.  Bayesian neural network
Central role in Laplace approximation 7
Machine Learning Srihari

Evaluating the Hessian Matrix


•  Full Hessian matrix can be difficult to compute in
practice
•  quasi-Newton algorithms have been developed that use
approximations to the Hessian
•  Various approximation techniques have been used to
evaluate the Hessian for a neural network
•  calculated exactly using an extension of backpropagation
•  Important consideration is efficiency
•  With W parameters (weights and biases) matrix has
dimension W x W
•  Efficient methods have O(W2)

8
Machine Learning Srihari

Methods for evaluating the Hessian Matrix


•  Diagonal Approximation
•  Outer Product Approximation
•  Inverse Hessian
•  Finite Differences
•  Exact Evaluation using Backpropagation
•  Fast multiplication by the Hessian

9
Machine Learning Srihari

Diagonal Approximation

•  In many case inverse of Hessian is needed


•  If Hessian approximation is diagonal, its inverse is
trivially computed
•  Complexity is O(W) rather than O(W2) for full Hessian

10
Machine Learning Srihari

Outer product approximation


•  Neural networks commonly use sum-of-squared
errors function
N
1
E = ∑ (y n − t n ) 2
2 n=1
•  Can write Hessian matrix in the form
n
H ≈ ∑ bn bn
T

n=1

•  Where bn = ∇y n = ∇an
•  Elements can be found in O(W2) steps
11
Machine Learning Srihari

Inverse Hessian

•  Use outer product approximation to obtain


computationally efficient procedure for approximating
inverse of Hessian

12
Machine Learning Srihari

Finite Differences

•  Using backprop, complexity is reduced from O(W3) to


O(W2)

13
Machine Learning Srihari

Exact Evaluation of the Hessian

•  Using an extension of backprop


•  Complexity is O(W2)

14
Machine Learning Srihari

Fast Multiplication by the Hessian

•  Application of the Hessian involve multiplication by


the Hessian
•  The vector vTH has only W elements
•  Instead of computing H as an intermediate step, find
efficient method to compute product

15

Você também pode gostar