Schonemann Trace Derivatives Presentation

With(out) A Trace
Matrix Derivatives the Easy Way

Steven W. Nydick
University of Minnesota
May 16, 2012
Outline
1
Introduction
Notation
History of Paper
Traces
Algebraic Trace Properties
Calculus Trace Properties
Trace Derivatives
Directional Derivatives
Example 1: tr(AX)
Example 2: tr(X T AXB)
Example 3: tr(Y 1 )
Example 4: |Y |
Trace Derivative Applications

Application 1: Least Squares
Application 2: Restricted Least Squares (X = X T )
Application 3: MLE Factor Analysis (LRC)
References
Steven W. Nydick
2/82
Introduction
Notation
Notation
A: A matrix
Ac : A matrix held constant
x: A vector
y: A scalar (or a scalar function)
xT or XT : The transpose of x or X
xij : The element in the ith row and jth column of X
(xT )ij : The element in the ith row and jth column of XT
Y
x :
y
X :
A matrix with elements

A matrix with elements
yij
x
y
xij
hxii or hXiij : The ith or ijth place of x or X

Steven W. Nydick
3/82
Introduction
Notation
Gradient, Jacobian, Hessian

A Gradient is the derivative of a scalar with respect to a vector.
f (x)
=
x

f (x)
x1

f (x)
x2

...
f (x)
xn
T
If we have the function: f (x) = 2x1 x2 + x22 + x1 x23 , then the Gradient is

f (x)
f (x)
f (x)
f (x) T
=
x
x1
x2
x3

T
= 2x2 + x23 2x1 + 2x2 2x1 x3
Steven W. Nydick
4/82
Introduction
Notation

A Jacobian is a the derivative of a vector with respect to a transposed vector.

f1 (x)
f1 (x)
...
x1
xn
f(x)
.
..
.
=
. . . . .
xT
fk (x)
fk (x)
...
x1
xn
If we have the function

f(x) = 3x21 + x2
ln(x1 )
sin(x2 )
T
Then the Jacobian is
6x
f(x) 1 1
=
x1
xT
0
Steven W. Nydick
1
0
cos(x2 )
5/82
Introduction
Notation

The Hessian is derivative of a Gradient with respect to a transposed vector.

f (x)
f (x)
.
.
.
x2
x1 xn
1
2 f (x)
.
..
..
.
=
.
.
.

xxT

f (x)
f (x)
...
xn x1
x2n
Because our above Gradient is
f (x)
= 2x2 + x23
x
2x1 + 2x2
2x1 x3
T
The Hessian would be
0
2 f (x)
2
=
xxT
2x3
Steven W. Nydick
2 2x3
2
0
0 2x1
6/82
Introduction
History of Paper
Simplifying Classes of Matrix Derivatives

History of Schnemans paper:
1
Wrote it while a post doc at UNC.
Originally submitted it to Psychometrika in 1965.
Editor mildly criticized paper.

1
2
4
Compliment: reformulate certain problems (Lagrange multipliers)

into interesting form (traces).
Complaint: why would we want to do that?
Revised paper, resubmitted paper, but editorship changed hands,

and took them almost a year to respond (asking for another
revision).
1
The new editor told him that a reviewer said: nothing wrong with
paper but not too important.
Steven W. Nydick
7/82
Introduction
History of Paper
History of Schnemans paper:

5
Later learned that the original delay was caused by a statistician

with expertise in matrix derivatives who thought that the paper
would be published eventually.
The paper was published eventually ... 20 years later in MBR.
Wrote the article Better Never than Late: Peer Review and the
Preservation of Prejudice in 2001.
Steven W. Nydick
8/82
Introduction
History of Paper

There are two beneficial properties of Schnemans paper:
Derivatives are always in matrix form.
No need for Dummy Matrices.
But, uses traces, and thus, uses trace properties.
So ... A Review of Traces/Trace Properties:
Steven W. Nydick
9/82
Traces
What is a Trace?
Definition:
tr (Y) =
(yii ),
Y is square
OK - thats simple, but what does that mean?

Well, take a square matrix and add up the diagonal elements
a11
..
A= .
an1
Steven W. Nydick
..
.
a1n
.. ,
.
tr (A) = a11 + a22 + + ann
ann
10/82
Traces
Linearity of Traces
The MOST important aspect of traces (for our later derivations):
tr : M (R)n R1
is linear
Thus:
tr (A + B) = tr (A) + tr (B)
(1)
and
tr (cA) = c tr (A)
Steven W. Nydick
(2)
11/82
Traces
Transposition of Dependent Variable

Traces have SEVERAL OTHER important properties.
Property 1: Transposition of Dependent Variable
We have:
tr (Y) = tr (YT )
(3)
Thus:
tr (YT )
tr (Y)
=
X
X
Steven W. Nydick
12/82
Traces
Cyclic Permutation
Property 2: Cyclic Permutation
We have:
tr (AB) = tr (BA)
(4)
Why? Well, start from the left of Equation (4).
b11 b1n
a11 a1m
.. ..
..
..
..
tr (AB) = tr ...
.
.
.
. .
bm1 bmn
an1 anm
m
m
X
X
= a11 b11 + + a1m bm1 +
(a2i bi2 ) + +
(ani bin )
i=1
n X
m
X
i=1
(aji bij )
j=1 i=1
Steven W. Nydick
13/82
Traces
Cyclic Permutation
And also start from the right of Equation
b11 b1n
a11
..
.
.
..
.. ...
tr (BA) = tr .
bm1
=
Pm Pn
i=1
bmn
j=1 (bij aji ) =
an1
Pn
j=1
(4).
..
.
Pm
a1m
..
.
anm
i=1 (bij aji )
n X
m
X
(aji bij )
j=1 i=1
= tr (AB)
So:
tr (AB)
tr (BA)
=
X
X
Rotating the order does not change the trace of square matrices.
Steven W. Nydick
14/82
Traces
Cyclic Permutation
A consequence of the last derivation:

Let U and H have the same dimensions.
If you want to multiply paired entries (e.g. uij hij ) and add all the
multiplications:
Flip one of the matrices, multiply, and take the trace of that
multiplication.
m X
n
X
j=1 i=1
Steven W. Nydick
(uij hij ) =
m X
n
X

(uT )ji hij = tr (UT H)
(5)
j=1 i=1
15/82
Traces
Transposition of Independent Variable

Calculus Property 1: Transposition of Independent Variable
By definition:
tr (Y)
tr (Y)
=
, i = 1, . . . , n; j = 1, . . . , m
X
xij
where
tr (Y)
xij
is what we put in the ijth place in our derivative matrix.
Thus:
tr (Y)
tr (Y)
=
, (j = 1, . . . , m; i = 1, . . . , n) =
T
xji
(X )
because
tr (Y)
xji
Steven W. Nydick
tr (Y)
X
T
(6)
is what we put in the ijth place in our derivative matrix.
16/82
Traces
Transposition of Independent Variable
Deriving with respect to a transposed variable replaces each entry in

the new matrix with the derivative of the corresponding
transposed component.
Replacing every entry with the derivative of the transposed component

Transposing the entire matrix of partial derivatives.
Steven W. Nydick
17/82
Traces
Product Rule
Calculus Property 2: Product Rule
An illustration of the product rule:
Steven W. Nydick
18/82
Traces
Product Rule
Calculus Property 2: Product Rule
Based on the previous illustration:
d(uv)
=
dx
du
dx

(v) + (u)
dv
dx
In this case, u and v are scalar functions of x.

Now, we want to translate this to matrices and traces of matrices:
Pick any row ui and any column vj from U and V.
Steven W. Nydick
19/82
Traces
Product Rule
If we take the derivative of the matrix product with respect to a scalar:
(UV)
x
we find that the i,jth place in our new derivative matrix is
(uTi. v.j )
(ui1 v1j + + uin vnj )
=
x
x
(ui1 v1j )
(uin vnj )
=
+ +
x
x
v1j
vnj
ui1
uin
=
v1j + ui1
+ +
vnj + uin
x
x
x
x
Steven W. Nydick
20/82
Traces
Product Rule
So, now we want to collect terms:
(uTi. v.j )
v1j
vnj
ui1
uin
=
v1j + ui1
+ +
vnj + uin
x
x
x
x
x

v
vnj
ui1
uin
1j
=
v1j + +
vnj + ui1
+ + uin
x
x
x
x
T
v.j
ui.
=
v.j + uTi.
x
x
And because our element is arbitrary, we can generalize:
(UV)
U
V
=
V+U
x
x
x
(UVc ) (Uc V)
=
+
x
x
Steven W. Nydick
(7)
21/82
Traces
Product Rule
There are two notes on the product rule:
Note 1: For the product rule to make sense, both U and V should be
functions of X.
For the Univariate Case, let u = x2 + 2 and v = 2x + sin(x). Then:

d(uv)
du
dv
=
(v) + (u)
dx
dx
dx

= (2x) 2x + sin(x) + (x2 + 2) 2 + cos(x)
= 4x2 + 2x sin(x) + 2x2 + 4 + x2 cos(x) + 2 cos(x)

= x2 6 + cos(x) + 2 x sin(x) + cos(x) + 4
We will discuss the multivariate case later.
Steven W. Nydick
22/82
Traces
Product Rule
There are two notes on the product rule:
Note 2: We can put Vc and Uc inside the derivative funtion, but they
are now constants with respect to X, even if they are functions of X.
For the Univariate Case, let u = x3 + ln(x) and v = 3x2 . Then:
h

i
3 + ln(x) (3x2 )
d
x
c
d(uvc )
=
dx

dx
1
= 3x2 +
(3x2 )
x
= 9x4 + 3x
We will discuss the multivariate case later.
Steven W. Nydick
23/82
Traces
Multidimensional Chain Rule

Calculus Property 3: Chain Rule
Let:
z = 2x21 + x1 cos(x2 )
Then, by definition:

z
x 4x1 + cos(x2 )
= 1 =
x z
x1 sin(x2 )
x2
Our partial derivative with respect to x1 is 4x1 + cos(x2 ), and our
partial derivative with respect to x2 is x sin(x2 ). Furthermore, these
go in the respective parts of our derivative matrix (replacing x1 and x2 ).
Steven W. Nydick
24/82
Traces

Now if:
x1 = 3t
and
x2 = t
Then:
dz
dt
is now the derivative with respect to t accounted for by x1 and the

derivative with respect to t accounted for by x2 .
And we account for:
x1
t
x2
(3t) sin(t)
t
[4(3t) + cos(t)]
Steven W. Nydick
by x1
by x2
25/82
Traces

Thus:
dz
=
dt
z
x
T
x X
=
t
i=1
z xi
xi t
= [12t + cos(t)](3) + [(3t) sin(t)](1)

= 36t + 3 cos(t) 3t sin(t)
But: z = 2x21 + x1 cos(x2 ) = 2(3t)2 + (3t) cos(t) = 18t2 + 3t cos(t)
So, another way of getting the same result:
dz
d[18t2 + 3t cos(t)]
=
dt
dt
= 36t + 3t[ sin(t)] + 3 cos(t) = 36t + 3 cos(t) 3t sin(t)
Steven W. Nydick
26/82
Traces
An Easy Way to Remember the Chain Rule

Because effects are slopes and slopes are derivatives, writing out a path
diagram from t to z would have the derivatives along the paths.
t
x1
t
Z

Z

=

x1
x2
Z t
Z
Z
Z
~
Z
x2

Z
Z
Z

z

x2
z Z
x1 Z
Z
~
Z

=
The total effect of t on z is found by multiplying the effects down each

path and summing the total effects across paths.
Steven W. Nydick
27/82
Traces

t
x1
t
Z

Z

=

x1
x2
Z t
Z
Z
Z
~
Z
x2

Z
Z
Z
z Z
x1 Z
Z
~
Z
z

x2

=
For example, lets find the total effect of a 1 unit change in t on z.

1
Well, if t changes 1 unit, then x1 changes dx
dt units (because the
derivative is the slope of t on x1 ). Moreover, if x1 changes 1 unit, then
dz
z changes dx
units (because the derivative is the slope of x1 on z).
1
Steven W. Nydick
28/82
Traces

t
x1
t
Z

Z

x2
t
Z
Z
x1

=
Z
~
Z
x2
Z
zZ
x1

z
x2
Z
Z

~
Z
=
Therefore, if t changes 1 unit, then its effect on z through x1 would be

the distance it travels in the x1 direction:
x1 distance =
dt
dt
1=
dx1
dx1
multiplied by how much a unit change in the x1 direction changes z:

z distance through x1 =
Steven W. Nydick
dx1 dt
dx1
(x1 distance) =
dz
dz dx1
29/82
Traces
And if t changes 1 unit, then its effect on z through x2 would be:

dx2
dx2 dt
z distance through x2 =
(x2 distance) =
dz
dz dx2

1 dt
Thus, if t moves 1 unit, it moves z: dx
dz dx1 through x1 and it moves

2 dt
z: dx
dz dx2 through x2 , so it in total moves z:
dx1 dt
dx2 dt
z total distance =
+
dz dx1
dz dx2
Steven W. Nydick
30/82
Traces

t
x1
t
Z

Z

=

x1
x2
Z t
Z
Z
Z
~
Z
x2

Z
Z
Z

z

x2
z Z
x1 Z
Z
~
Z

=
Or, as written before, to find the effect of t on z:

dz
=
dt
Steven W. Nydick
z
x1

x1
t

+
z
x2

x2
t

2
X
z xi
=
xi t
i=1
31/82
Traces
Matrices Chain Rule

And because as our vector chain rule:

d[f (y)] X d[f (y)] d[yi (x)]
=
dx
d[yi (x)] dx
(8)
We can expand upon that to obtain a chain rule for matrices.

[f (Y)] X X
[f (Y)] [yij (xpq )]
=
xpq
[yij (xpq )] xpq
i
(9)
In Equation (9) yij is a function of xpq , and we have to take the

derivative with respect to each of the elements in Y.
Steven W. Nydick
32/82
Trace Derivatives
Derivatives
Here is the standard derivative definition:
Df (x) = lim
t0
f (x + t) f (t)
t
y
The equation is a infinitesimal form of m = x
; it is finding the slope
or linear approximation to this function as the distance between the
points on the x-axis goes to 0.
If there is a large distance between points on the x-axis, and if the

function is not linear, then the slope will not be a good representation
of how the function is changing. However, as the distance between
points on the x-axis goes to 0, the mini function becomes more linear.
Steven W. Nydick
33/82
Trace Derivatives
Directional Derivatives: Vectors

In vector calculus, there is a similar equation.
f (x + tw) f (x)
t0
t
Dw f (x) = lim
Now, our function is a surface (a scalar function of as many

dimensional inputs as there are elements in x), so the derivative will
change at a multidimensional point x based on the direction we travel
from that point.
Think of what happens if you were to stand on a mountain and turn
around in a circle: in some directions, the slope will be very steep (and
you might fall off the mountain), but in other directions, there will
barely be any slope at all.
Steven W. Nydick
34/82
Trace Derivatives

Now w tells us which direction we want to be facing when we calculate
the derivative at a specific point x.
f (x + tw) f (x)
t0
t
Dw f (x) = lim
The directional derivative is basically telling us what is the best linear

approximation of this function at a particular point if we are facing
up the mountain, down the mountain, at a 45 degree angle up the
mountain, etc.
Steven W. Nydick
35/82
Trace Derivatives

If our x vector is two dimensional, then the function would form a
mountain in three dimensional space.
One Direction (at a given point):
Steven W. Nydick
36/82
Trace Derivatives

A Second Direction (at the same point):
Notice how the steepness of the slope changes at both points.

Steven W. Nydick
37/82
Trace Derivatives

First -- pick an abritrary unit length w:
wT w = 1
Second -- set up the standard, directional derivative definition:
f (x + tw) f (x)
t0
t
= (0, 0, 0, 0, 1, 0, . . . , 0) where 1 is in hwii , then
Dw f (x) = lim
If w(i)
f (x + tw(i) ) f (x)
t0
t
Dw(i) f (x) = lim
will reduce to the regular partial derivative in the ith place.
Steven W. Nydick
38/82
Trace Derivatives

Now, if we can find a u, such that:
Dw f (x) = lim
t0
f (x + tw) f (x)
= wT u
t
Then, for an arbitrary place i, in an arbitrary direction hwii :

Dw(i) f (x) = wT(i) u
reduces to
Dw(i) f (x) = ui
where ui is the partial derivative in the ith place.

Because the ith place is arbitrary:
f (x)
=u
x
Steven W. Nydick
39/82
Trace Derivatives
Directional Derivatives: Matrices

Now let Y(ij) be a Matrix such that yij = 1 in hYiij and 0 elsewhere.
Then, extending our directional derivative definition to matrices:
DY f (X) = lim
t0
f (X + tY) f (X)
t
(10)
We can conclude that

f (X + tY(ij) ) f (X)
t0
t
DY(ij) f (X) = lim
will pick off the partial derivative in the ijth place.
Steven W. Nydick
40/82
Trace Derivatives

Now, if we can find a U, such that
f (X + tY) f (X)
= tr (YT U)
t0
t
DY f (X) = lim
Then, for an arbitrary place ij, in an arbitrary direction hYiij

XX
(yij uij )
DY(ij) f (X) = tr(YT(ij) U) =
j
(11)
by (5)
= uij
Because the ijth place is arbitrary:
f (X)
=U
X
Steven W. Nydick
(12)
41/82
Trace Derivatives

First -- put in the form of the definition:
DY f (X) = lim
t0
f (X + tY) f (X)
t
Second -- simplify until you can find the equality:

DY f (X) = tr (YT U)
Third -- remove your U, and note that:

f (X)
=U
X
Steven W. Nydick
42/82
Trace Derivatives
Example 1: tr(AX)
Definition
Our 1st function:
f (X) = tr(AX)
Our objective is to find:

f (X)
tr(AX)
=
X
X
Only by simplifying the definition:
f (X + tY) f (X)
t0
t
DY f (X) = lim
Steven W. Nydick
43/82
Example 1: tr(AX)
Trace Derivatives
Calculation
f (X + tY) f (X)
t0
t
tr(A[X + tY]) tr(AX)
= lim
t0
t
tr(AX + AtY) tr(AX)
= lim
t0
t
tr(tAY)
= lim
t0
t
= lim tr(AY)
DY f (X) = lim
t0
by (10)
by (1)
by (2)
= tr(AY)
= tr([AY]T )
T
by (3)
= tr(Y A )
Steven W. Nydick
44/82
Trace Derivatives
Example 1: tr(AX)
Result
So, we found that:
f (X + tY) f (X)
t
= tr(YT AT ) = tr(YT U)
DY f (X) = lim
t0
by (11)
And we can spot that in this case:

U = AT
And thus, by Equation (12):
f (X)
X
tr(AX)
=
= AT
X
U=
Steven W. Nydick
(13)
45/82
Trace Derivatives
Definition
Our 2nd function:
f (X) = tr(XT AXB)
Our objective is to find:

f (X)
tr(XT AXB)
=
X
X
Only by simplifying the definition:
f (X + tY) f (X)
t0
t
DY f (X) = lim
Steven W. Nydick
46/82
Trace Derivatives
Calculation
f (X + tY) f (X)
by (10)
t
tr([X + tY]T A[X + tY]B) tr(XT AXB)
= lim
t0
t
tr([X + tY]T A[X + tY]B XT AXB)
= lim
by (1)
t0
t
tr(XT AtYB + tYT AXB + tYT AtYB)
= lim
t0
t
T
T
tr(t[X AYB + Y AXB + tYT AYB])
= lim
t0
t
= lim[tr(XT AYB + YT AXB + tYT AYB)]
by (2)
DY f (X) = lim
t0
t0
Steven W. Nydick
47/82
Trace Derivatives
Calculation
Continuing:
DY f (X) = lim[tr(XT AYB + YT AXB + tYT AYB)]
t0
= lim[tr(XT AYB + YT AXB)] + lim[t tr(YT AYB)]

t0
t0
by (1) & (2)

T
= lim[tr(X AYB + Y AXB)]

t0
= tr(XT AYB + YT AXB)

= tr(XT AYB) + tr(YT AXB)
T
= tr(BX AY) + tr(Y AXB)
Steven W. Nydick
by (1)
by (4)
48/82
Trace Derivatives
Calculation
And Finally:
DY f (X) = tr(BXT AY) + tr(YT AXB)
= tr[(BXT AY)T ] + tr(YT AXB)
by (3)
= tr(YT AT XBT ) + tr(YT AXB)

= tr(YT AT XBT + YT AXB)
by (1)
= tr(YT [AT XBT + AXB])
Steven W. Nydick
49/82
Trace Derivatives
Result
So, we found that:
f (X + tY) f (X)
t
T
T
= tr(Y [A XBT + AXB]) = tr(YT U)
DY f (X) = lim
t0
by (11)
And we can spot that in this case:

U = AT XBT + AXB
And thus, by Equation (12):
f (X)
X
tr(XT AXB)
=
= AT XBT + AXB
X
U=
Steven W. Nydick
(14)
50/82
Trace Derivatives
Example 3: tr(Y 1 )
Definition
Our 3rd function, assuming that Y is non-singular and depends on X:

f (X) = tr(Y1 )
Our objective is to find a better expression for

f (X)
tr(Y1 )
=
X
X
by working with previous trace derivative rules.
Steven W. Nydick
51/82
Trace Derivatives
Example 3: tr(Y 1 )
Calculation
We have:
tr(Y1 )
tr(Y2 Y)
=
X
X
=
tr(Y2 Yc )
tr(Y2
c Y)
+
X
X
by (7)

Zoom In

9

1
tr(Y1 Y1 Yc )
tr(Yc Y1
tr(Y1 Y1
c Y )
c Yc )
=
+
X
X
X
by (7) & (4)
Steven W. Nydick
tr(Y1 ) tr(Y1 )
2 tr(Y1 )
+
=
X
X
X
(15)
52/82
Trace Derivatives
Example 3: tr(Y 1 )
Result
Therefore:
tr(Y1 )
tr(Y2
tr(Y2 Yc )
c Y)
=
+
X
X
X
1
tr(Y2
Y)
2
tr(Y
)
tr(Y1 )
c
=
+
X
X
X
tr(Y1 )
tr(Y2
Y)
c
=
X
X
by (15)
And, finally, after multiplying by (1) on both sides:

tr(Y1 )
tr(Y2
c Y)
=
X
X
(16)
We have turned a derivative of the trace-inverse problem into a

standard trace derivative problem.
Steven W. Nydick
53/82
Trace Derivatives
Example 4: |Y |
Definition
Our 4th function, assuming that Y is non-singular and depends on X:

f (X) = |Y|
Our objective is to find a better expression for

f (X)
|Y|
=
X
X
by working with previous trace derivative rules and determinant rules.
Steven W. Nydick
54/82
Trace Derivatives
Example 4: |Y |
Determinant Review
We know from linear algebra:
Y1 =
1
Q
|Y|
where Q is the adjoint matrix
Process for calculating (q T )ij :

1
Cross out row i and column j of Y.
Take determinant of the smaller [(n 1) (n 1)] matrix.
If i + j is odd, then negate the previous step.
Thus:
|Y|I = QY
Steven W. Nydick
(17)
55/82
Trace Derivatives
Example 4: |Y |
Adjoint Review
Note: (q T )ij = qji does not depend on any of the elements in row i
or column j of Y.
Thus, qji does not depend on yij .
So:
(qji yij )
yij
= qji
Based on the previous slide we have:
|Y| 0
.
0 |Y| ..
..
..
.
.
0
0
Steven W. Nydick
|Y|I = QY
P
0
p (q1p yp1 )
..
..
.
.
=

0
|Y|
O
O
..
by (17)
.
P
p (qnp ypn )
56/82
Trace Derivatives
Example 4: |Y |
Determinant Derivative, Part 1

Thus, given any j such that 1 j n:
|Y| =
(qjp ypj )
(18)
And, for an arbitrary yij , pick the jth row of q and column of y:

P
(q
y
)
jp
pj
p
|Y|
=
by (18)
yij
yij

X
ypj
yij
=
qjp
= qji
= qji = (q T )ij
(19)
y
y
ij
ij
p
Because yij was arbitrary, for an entire matrix:
|Y|
= QT
Y
Steven W. Nydick
(20)
57/82
Trace Derivatives
Example 4: |Y |
Determinant Derivative, Part 2

Now, for an arbitrary pqth element of X (where Y depends on X):

|Y| X X |Y| yij
=
by (9)
xpq
yij xpq
i
j
X X yij
=
qji
by (19)
xpq
i
j

P P
(q
y
)
i
j cji ij
=
xpq
tr(Qc Y)
=
by (5)
xpq
Because xpq was arbitrary, for an entire matrix:
|Y|
tr(Qc Y)
=
X
X
Steven W. Nydick
(21)
58/82
Definition
Lets say that we have
A=X+E
where A is the observation matrix, E is a matrix of stochastic
fluctuations with a mean of 0, and X is our approximation to A.
In Least Squares, our objective is to minimize the
sum of squared errors:
SSE = e211 + e212 + + e21n + e221 + + e22n + + e2mn
XX
=
e2ij
i
XX
=
(eij eij )
i
j
T
= tr(E E)
Steven W. Nydick
by (5)
59/82
Definition
If we have no constraints on X, then we are, equivalently, minimizing:

SSE = tr(ET E) = tr[(A X)T (A X)]
Steven W. Nydick
60/82
Calculation
A minimization:
(SSE)
tr(ET E)
tr[(A X)T (A X)]
=
=
X
X
X
T
T
T
tr(A A A X X A + XT X)
=
X
T
[tr(A A) tr(AT X) tr(XT X) + tr(XT X)]
=
by (1)
X
tr(AT A) tr(AT X) tr(XT A) tr(XT X)
=
+
X
X
X
X
tr(AT X) tr(XT A) tr(XT X)
=0
+
X
X
X
Steven W. Nydick
61/82
Calculation
Continuing:
(SSE)
tr(AT X) tr(XT A) tr(XT X)
=
+
X
X
X
X
tr(AT X) tr(XT X)
+
by (13) & (3)
= A
X
X
tr(XT X)
= A A +
by (13)
X
tr(XT IXI)
= A A +
X
T
= A A + [I XIT + IXI]
by (14)
= A A + [X + X] = 2A + 2X
Steven W. Nydick
(22)
62/82
Result
As in any Least Squares problem, we should set our derivative equal to
0 in order to find the minimum of the function.
(SSE)
= 2A + 2X = 0
X
2X = 2A
b =X=A
A
Surprisingly, without any constraints on X, the best approximation of
A is A itself.
Oh, the things you learn in calculus ^!
Steven W. Nydick
63/82
LaGrange Multipliers
Pretend you have a function:
f (X)
To maximize or minimize less than mn restraints equivalent to
h(x11 , . . . , xmn )ij = 0
use LaGrange Multipliers uij (one for each restraint), and set
g(X) = f (X) +
XX
i
(uij hij )
(23)
Finally, take the derivative with respect to X, set equal to 0, and solve.
Steven W. Nydick
64/82
Definition
We still want to find an X that minimizes the SSE to best
approximate A; however, we are now subject to the constraint that X
is a Symmetric Matrix.
X Symmetric Means:
X = XT
X XT = 0
The Recipe:
1
2
3
4
5
6
We have our equation to minimize: tr(ET E).

We have our constraint: H = X XT = 0.
Put in LaGrange multiplier form.
Take the derivative.
Set the derivative equal to 0.
Solve for X.
Steven W. Nydick
65/82
Calculation
First -- set up the problem:

g(X) = f (X) +
XX
(uij hij )
i
by (23)
= tr(ET E) + tr(UT H)
by (5)
= tr(ET E) + tr[UT (X XT )]
= tr(ET E) + tr(UT X) tr(UT XT )
= tr(ET E) + tr(UT X) tr(UX)
Steven W. Nydick
by (1)
by (3) & (4)
66/82
Calculation
Second -- take the derivative:

g(X)
[tr(ET E) + tr(UT X) tr(UX)]
=
X
X
tr(ET E) tr(UT X) tr(UX)
=
+
X
X
X
tr(UT X) tr(UX)
= 2A + 2X +
X
X
T
= 2A + 2X + U U
Steven W. Nydick
by (22)
by (13)
67/82
Calculation
Third -- set the derivative equal to 0:
g(X)
= 2A + 2X + U UT = 0
X
2X = 2A + UT U
X=A+
UT U
2
However, now note that: X = XT

T
UT U
X = A+
2
T
XT = AT +
Steven W. Nydick
U UT
2
68/82
Result
Fourth -- add XT to both sides and solve for X:
U UT
UT U
+ AT +
2
2
T
T
U U U U
X + X = A + AT +
2
2
T
2X = A + A
X + XT = A +
T
b =X= A+A
A
2
Therefore, to approximate A with a Symmetric Matrix, the best

matrix (according to the Least Squares Criterion) is the average of the
elements of A and the elements of AT .
Steven W. Nydick
69/82
Definition
The function we want to maximize:

u = |U1 (R FFT )U1 |
1
R is a correlation matrix.
diag(R) = I
F is a factor pattern matrix of uncorrelated common factors.
U2 is a covariance matrix of uncorrelated unique factors.

U2 = diag(U2 ) = I diag(FFT )
Steven W. Nydick
70/82
Definition
The function we want to maximize:
u = |U1 (R FFT )U1 |
The function u is a likelihood ratio criterion for a test of independence
after the common factors have been partialed out of the covariance
matrix.
U1 (R FFT )U1 should be close to I, so u should be close to 1.
We want to find the F (and consequently the U2 ) that results in a
determinant as close to 1 as possible.
Steven W. Nydick
71/82
Definition
To make the derivatives simpler, let:
u1 = |U2 |
and
u2 = |R FFT |
Note that:
u1 u2 = |U2 ||RFFT | = |U1 ||RFFT ||U1 | = |U1 (RFFT )U1 |
Thus, we can use the product rule to find the derivative:

u
(u1 u2 )
=
F
F
u1
u2
=
u2 + u1
F
F
Steven W. Nydick
by (7)
72/82
Calculation:
u1
F
Lets find the derivative of the first part.

u1
|U2 |
|U2 |1
=
=
F
F
F
tr(|U2 |1 )
=
F
2
tr(|U2 |2
c |U |)
=
F
|U|2
= |U2 |2
F
(by determinant rules)

(since tr(|X|) = |X|)
by (16)
(24)
Our next objective is to find the derivative of the highlighted part.
Steven W. Nydick
73/82
Calculation:
u1
F
Continuing:
|U|
|I diag(FFT )|
=
F
F

T
tr Qc [I diag(FF )]
=
F
tr(Qc ) tr[Qc diag(FFT )]
=
F
F
tr[Qc diag(FFT )]
=0
F
tr[Qc diag(FFT )]
=
F
Steven W. Nydick
(by definition)
by (21)
by (1) & (2)
74/82
Calculation:
u1
F
Based on (21), Qc is the Adjoint of [I diag(FFT )] = U2 .

Therefore:
|U2 |1 Qc = (U2 )1
2
Qc = |U |(U
by (17)
2
Because U2 is a diagonal matrix, Qc = |U2 |(U2 ) is a diagonal matrix.

And:
|U|
tr[Qc diag(FFT )]
tr[diag(Qc FFT )]
tr(Qc FFT )
=
=
=
F
F
F
F
Because the trace only operates on the diagonal, the trace of the
diagonal of a matrix is the same as the trace of the original matrix.
Steven W. Nydick
75/82
Calculation:
u1
F
Continuing:
tr(Qc FFT )
tr(FT Qc FI)
=
F
F
T
= (Q FIT + QFI)
by (4)
by (14)
= (Q + Q)F
= 2QF
= 2|U2 |U2 F
(Q is symmetric)
by (17)
And, thus:
u1
|U|2
= |U2 |
F
F
= |U2 |2 (2|U2 |U2 F)
by (24)
= 2|U2 |1 U2 F
= 2|U2 |U2 F
Steven W. Nydick
(25)
76/82
Calculation:
u2
F
Now, lets find the derivative of the second part.

u2
|R FFT |
=
F
F
tr[Qc (R FFT )]
=
F
tr(Qc R) tr(Qc FFT )
=
F
F
tr(FT Qc FI)
=0
F
T
= (Q + Q)F
by (21)
by (1) & (2)
by (4)
by (14)
= 2QF
(Q is symmetric)
T
T 1
= 2|R FF |(R FF )
Steven W. Nydick
(26)
77/82
Calculation: Entire Thing
Putting the pieces together:

u
u1
u2
=
u2 + u1
F
F
F
Which implies that
u
= (2|U2 |U2 F)|R FFT |
F
+|U2 |(2|R FFT |(R FFT )1 F)
Steven W. Nydick
78/82
Calculation: Finding Maximum
To find the maximum of this function, we must set it equal to 0 and

solve.
0 = 2|U2 |(U2 F)|R FFT | 2|U2 ||R FFT |(R FFT )1 F
0 = 2|U2 ||R FFT |(U2 F (R FFT )1 F)
0 = U2 F (R FFT )1 F
Steven W. Nydick
79/82
Calculation: Finding Maximum
Finishing the calculation:

0 = U2 F (R FFT )1 F
(R FFT )1 F = U2 F
F = (R FFT )U2 F
F = RU2 F FFT U2 F
RU2 F F = FFT U2 F
(RU2 I)F = F(FT U2 F)
Steven W. Nydick
(27)
80/82
Result
Based on the previous slide, we have
(RU2 I)F = F(FT U2 F)
Let = (FT U2 F) be diagonal. Then
1 0
0 2
(RU2 I)(f1 , f2 , . . . , fn ) = (f1 , f2 , . . . , fn )
..
.
0
..
.
..
.
0
0
..
.
0
n
(RU2 I)(f1 , f2 , . . . , fn ) = (f1 1 , f2 2 , . . . , fn n )

(RU2 I)(f1 , f2 , . . . , fn ) = (1 f1 , 2 f2 , . . . , n fn )
is an implicit eigenproblem.
Steven W. Nydick
81/82
References
References
Schnemann, P. H. (1985). On the formal differentiation of traces

and determinants. Multivariate Behavioral Research, 20, 113139.
Schnemann, P. H. (2001). Better never than late: Peer review and

the preservation of prejudice. Ethical Human Sciences and Services,
3, 721.
Steven W. Nydick
82/82

Schonemann Trace Derivatives Presentation

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Schonemann Trace Derivatives Presentation

Enviado por

Direitos autorais:

Formatos disponíveis

With(out) A Trace

Matrix Derivatives the Easy Way

May 16, 2012

Trace Derivative Applications

A matrix with elements

hxii or hXiij : The ith or ijth place of x or X

Gradient, Jacobian, Hessian

Gradient, Jacobian, Hessian

Then the Jacobian is

Gradient, Jacobian, Hessian

The Hessian would be

Simplifying Classes of Matrix Derivatives

Wrote it while a post doc at UNC.

Originally submitted it to Psychometrika in 1965.

Editor mildly criticized paper.

Compliment: reformulate certain problems (Lagrange multipliers)

Revised paper, resubmitted paper, but editorship changed hands,

Simplifying Classes of Matrix Derivatives

History of Schnemans paper:

Later learned that the original delay was caused by a statistician

The paper was published eventually ... 20 years later in MBR.

Simplifying Classes of Matrix Derivatives

Derivatives are always in matrix form.

No need for Dummy Matrices.

But, uses traces, and thus, uses trace properties.

So ... A Review of Traces/Trace Properties:

Algebraic Trace Properties

OK - thats simple, but what does that mean?

tr (A) = a11 + a22 + + ann

Algebraic Trace Properties

Algebraic Trace Properties

Transposition of Dependent Variable

Algebraic Trace Properties

Why? Well, start from the left of Equation (4).

Algebraic Trace Properties

j=1 (bij aji ) =

i=1 (bij aji )

Algebraic Trace Properties

A consequence of the last derivation:

Calculus Trace Properties

Transposition of Independent Variable

is what we put in the ijth place in our derivative matrix.

is what we put in the ijth place in our derivative matrix.

Calculus Trace Properties

Transposition of Independent Variable

Deriving with respect to a transposed variable replaces each entry in

Replacing every entry with the derivative of the transposed component

Calculus Trace Properties

Calculus Trace Properties

In this case, u and v are scalar functions of x.

Calculus Trace Properties

Calculus Trace Properties

Calculus Trace Properties

Calculus Trace Properties

Calculus Trace Properties

Multidimensional Chain Rule

Calculus Trace Properties

Multidimensional Chain Rule

is now the derivative with respect to t accounted for by x1 and the

Calculus Trace Properties

Multidimensional Chain Rule

= [12t + cos(t)](3) + [(3t) sin(t)](1)

Calculus Trace Properties

An Easy Way to Remember the Chain Rule