Chap5.4 Hessian

Machine Learning Srihari
The Hessian Matrix
Sargur Srihari
1
Definitions of Gradient and Hessian

•  First derivative of a scalar function E(w) with respect to a
vector w=[w1,w2]T is a vector called the Gradient of E(w)
⎡ ∂E ⎤
⎢ ⎥
∂w1 ⎥ If there are M elements in the vector
E(w) = ⎢⎢
d
∇E(w) = then Gradient is a M x 1 vector
dw ∂E ⎥
⎢ ⎥
⎢⎣ ∂w2 ⎥
⎦
•  Second derivative of E(w) is a matrix called the Hessian

of E(w)
⎡ 2 ⎤
⎢ ∂E ∂ 2E ⎥
⎢ 2 ⎥
d2 ⎢ ∂w1 ∂w1∂w2 ⎥ Hessian is a matrix with
H = ∇∇E(w) = E(w) = ⎢ ⎥ M2 elements
dw 2 ⎢ ∂ 2E ∂ 2E ⎥
⎢ ⎥
⎢ ∂w ∂w ∂w2 ⎥⎥
2
⎣⎢ 2 1
⎦
•  Jacobian is a matrix consisting of first derivatives wrt a
vector
Need for Gradient and Hessian in ML
Training samples: n=1,..N ⎛ φ 0( x1 ) φ 1( x1 ) ... φ M −1( x1 ) ⎞

⎜ ⎟
Inputs: M x 1 vectors ⎜ φ 0( x 2 )
Φ=⎜
⎟ Error surface for M=2
⎟
⎜
⎜φ (x )
⎟ a paraboloid with
φ M −1( x N ) ⎟⎠
( φ (x ) )
T
φ1 ( xn ) ... φM −1 ( xn )
⎝ 0 N a single global minimum
φ(xn ) = 0 n
Φ0(x)=1, dummy feature
Outputs t = (t1,.. tN )T
For Logistic Regression (cross-entropy error):

N
E(w) = −ln p(t | w) = −∑ {tn ln yn + (1 −tn )ln(1 −yn )} where yn = σ wT φ(xn )
n=1
( )
For Stochastic Gradient Descent we need ∇En (w) where E(w) = ∑ n En (w)
w(τ+1) = w(τ ) − η∇En (w)
For Newton-Raphson update
−1
we need both∇E(w) and H=∇∇E(w)
w
(new ) (old )
=w − H ∇E(w)
Gradient and Hessian for Two-class Logistic Regression

•  Cross-Entropy Error
N
E(w) = −ln p(t | w) = −∑ {tn ln yn + (1 −tn )ln(1 −yn )} where yn = σ wT φ(xn )
n=1
( )
•  Gradient of E y = (y1,..yN )T
N
t = (t1,.. tN )T
∇E(w) = ∑ (yn −tn )φ(xn ) = Φ (y − t)
T
n=1
⎛ φ 0( x1 ) φ 1( x1 ) ... φ M −1( x1 ) ⎞
⎜ ⎟
⎜ φ 0( x 2 )
•  Hessian of E Φ=⎜
⎟
⎟
⎜ ⎟
N ⎜φ (x ) φ ( x ) ⎟
H = ∇∇E(w) = ∑ yn (1 −yn )φ(xn )φT (xn ) = ΦT RΦ
⎝ 0 N M −1 N ⎠
n=1
( φ (x ) )
T
R is NxN diagonal matrix with elements φ(xn ) = 0 n
φ1 (xn ) ... φM −1 (xn )
Rnn=yn(1-yn)=wTφ(xn)(1-wTφ(xn))
Hessian is not constant and depends on w through R 4
Since H is positive-definite (i.e., for arbitrary u, uTHu>0)
error function is a concave function of w and so has a unique minimum
Gradient and Hessian for Multi-class Logistic Regression

⎡ t . . t1K ⎤⎥
•  Cross-Entropy Error ⎢ 11
⎢ ⎥
T = ⎢⎢ . ⎥
⎥
N K ⎢ . ⎥
E(w1,..., w K ) = −ln p(T | w1,.., w K ) = −∑ ∑ tnk ln ynk ⎢ t
⎢⎣ N 1 tNK ⎥⎥
n=1 k=1
⎦
exp(wTk φ(xn ))
ynk = yk (φ(xn )) =
•  Gradient of E ∑ j
exp(wT
j
φ(xn ))
N
∇w E(w1,..., w K ) = −∑ (ynj −tnj )φ(xn )
j
n=1 ⎛ φ 0( x1 ) φ 1( x1 ) ... φ M −1( x1 ) ⎞
⎜ ⎟
⎜ φ 0( x 2 ) ⎟
•  Hessian of E Φ=⎜ ⎟
⎜ ⎟
N ⎜φ (x ) φ M −1( x N ) ⎟⎠
⎝ 0 N
∇w ∇w E(w1,..., w K ) = −∑ ynk (I kj −ynj )φ(xn )φT (xn )
k j
( φ (x ) )
n=1 T
φ(xn ) = 0 n
φ1 (xn ) ... φM −1 (xn )
Each element of the Hessian needs M multiplications and additions 5

Since there are M2 elements in the matrix the computation is O(M3)
Hessian of Neural Network Error Function

•  Backpropagation can be used to
obtain first derivatives of error
function wrt weights in network
•  It can also be used to derive
second derivatives
∂2E
∂w ji w lk
•  If all weights and bias parameters

are elements wi of single vector w
then the second derivatives form
the elements Hij of Hessian matrix
H where i,j e {1,..W}
Role of Hessian in Neural Computing

1.  Several nonlinear optimization
algorithms for neural networks
are based on second order
derivatives of error surface
2.  Basis for fast procedure for
retraining with small change of
training data
3.  Identifying least significant
weights
for network pruning requires inverse
of Hessian
4.  Bayesian neural network
Central role in Laplace approximation 7
Evaluating the Hessian Matrix

•  Full Hessian matrix can be difficult to compute in
practice
•  quasi-Newton algorithms have been developed that use
approximations to the Hessian
•  Various approximation techniques have been used to
evaluate the Hessian for a neural network
•  calculated exactly using an extension of backpropagation
•  Important consideration is efficiency
•  With W parameters (weights and biases) matrix has
dimension W x W
•  Efficient methods have O(W2)
8
Methods for evaluating the Hessian Matrix

•  Diagonal Approximation
•  Outer Product Approximation
•  Inverse Hessian
•  Finite Differences
•  Exact Evaluation using Backpropagation
•  Fast multiplication by the Hessian
9
Diagonal Approximation
•  In many case inverse of Hessian is needed

•  If Hessian approximation is diagonal, its inverse is
trivially computed
•  Complexity is O(W) rather than O(W2) for full Hessian
10
Outer product approximation

•  Neural networks commonly use sum-of-squared
errors function
N
1
E = ∑ (y n − t n ) 2
2 n=1
•  Can write Hessian matrix in the form
n
H ≈ ∑ bn bn
T
n=1
•  Where bn = ∇y n = ∇an
•  Elements can be found in O(W2) steps
11
Inverse Hessian
•  Use outer product approximation to obtain

computationally efficient procedure for approximating
inverse of Hessian
12
Finite Differences
•  Using backprop, complexity is reduced from O(W3) to

O(W2)
13
Exact Evaluation of the Hessian
•  Using an extension of backprop

•  Complexity is O(W2)
14
Fast Multiplication by the Hessian
•  Application of the Hessian involve multiplication by

the Hessian
•  The vector vTH has only W elements
•  Instead of computing H as an intermediate step, find
efficient method to compute product
15

Chap5.4 Hessian

Enviado por

Dados do documento

Descrição original:

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Chap5.4 Hessian

Enviado por

Direitos autorais:

Formatos disponíveis

Machine Learning Srihari

The Hessian Matrix

Definitions of Gradient and Hessian

• Second derivative of E(w) is a matrix called the Hessian

Need for Gradient and Hessian in ML

Training samples: n=1,..N ⎛ φ 0( x1 ) φ 1( x1 ) ... φ M −1( x1 ) ⎞

For Logistic Regression (cross-entropy error):

Gradient and Hessian for Two-class Logistic Regression

Gradient and Hessian for Multi-class Logistic Regression

Each element of the Hessian needs M multiplications and additions 5

Hessian of Neural Network Error Function

• If all weights and bias parameters

Role of Hessian in Neural Computing

Evaluating the Hessian Matrix

Methods for evaluating the Hessian Matrix

• In many case inverse of Hessian is needed

Outer product approximation

• Use outer product approximation to obtain

• Using backprop, complexity is reduced from O(W3) to

Exact Evaluation of the Hessian

• Using an extension of backprop

Fast Multiplication by the Hessian

• Application of the Hessian involve multiplication by

Você também pode gostar

•  Second derivative of E(w) is a matrix called the Hessian

•  If all weights and bias parameters

•  In many case inverse of Hessian is needed

•  Use outer product approximation to obtain

•  Using backprop, complexity is reduced from O(W3) to

•  Using an extension of backprop

•  Application of the Hessian involve multiplication by