Escolar Documentos
Profissional Documentos
Cultura Documentos
1 N
N n=1 xn ,
N n=1 xin
sample covariance matrix is a d d matrix Z with N entries Zij = N 1 n=1 (xin xi )(xjn xj ) 1 sample correlation matrix is a d d matrix C with entries Cij =
1 N 1 PN
n=1 (xin xi )(xjn xj ) xi xj
, where xi and
0.7 0.4
4.2 5.8
x=
2.15 2.75
C=
Observe, if sample is z -normalized (xnew = ij standard deviation 1) then C equals Z. See cov(), cor(), scale() in R.
xij xi xi ,
mean 0,
p. 169
p. 170
xn
n=1
p. 171
uT xn 1
n=1
uT x 1
= uT Su1 1
Goal: maximize the projected variance uT Su1 with respect to 1 u1 . Prevent u1 growing to innity, use constrain uT u1 = 1, 1 gives optimization problem: maximize subject to
uT Su1 1 uT u1 = 1 1
p. 172
gives
Su1 = 1 u1
last term says that u1 must be an eigenvector of S. Finally by left-multiplying by uT and making use of uT u1 = 1 one can 1 1 see that the variance is given by
uT Su1 = 1 . 1
Observe, that variance is maximized when u1 equals to the eigenvector having largest eigenvalue 1 .
p. 173
which implies that u2 should be eigenvector of S with second largest eigenvalue 2 . Other dimensions are given by the eigenvectors with decreasing eigenvalues. p. 174
PCA Example
First and second eigenvector Projection on first eigenvector
data.xy[,2]
data.x.eig[,2]
0 data.x.eig[,1]
p. 175
Proportion of Variance
In image and speech processing problems the inputs are usually highly correlated.
If dimensions are highly correlated, then there will be small number of eigenvectors with large eigenvalues (m d). As a result, a large reduction in dimensionality can be attained.
Proportion of variance explained, digit class 1 (USPS database)
1.0
1 + 2 + . . . + m 1 + 2 + . . . + m + . . . + d
Proportion of variance
0.5
0.6
0.7
0.8
0.9
50
100 Eigenvectors
150
200
250
p. 176
256
256
Segment image in 32 32 = 1024 image pieces of size 8 8 1 64: x1 , x2 , . . . , x1024 R64 Determine mean: x = 1024 1 n=1 xi 1024
Determine covariance matrix S and the m eigenvectors u1 , u2 , . . . , um having the largest corresponding eigenvalues 1 , 2 , . . . , m Create eigenvector matrix U, where u1 , u2 , . . . , um are column vectors Project image pieces xi into subspace as follows: zT = UT (xT xT ) i i
p. 177
Reconstruct image pieces by back-projecting it to the original space as xT = UzT + x. Note, mean is added i i (substracted step before) because data is not normalized
Proportion of variance explained in image Original image
Proportion of variance
0.6 0
0.8
1.0
10
20
30
Eigenvectors
40
50
60
p. 178
w1 x1 x2
w2
wd xd
V = wT x =
j=1
w j xj
such that after some update steps weight vector w should point in direction of maximum variance.
p. 179
w j xj xi =
j
Cij wj = Cw.
Angle brackets indicates an average over the input distribution P (x) and C denotes the correlation matrix with
Cij xi xj ,
or
C x xT
Note, C is symmetric (Cij = Cji ) and positive semi-denite which implies that its eigenvalues are positive or zero and eigenvectors can be taken as orthogonal.
p. 180
At our hypothetical equilibrium point, w is an eigenvector of C with eigenvalue 0 Never stable, because C has some pos. eigenvalues and some corresponding eigenvector would grow exponentially constrain the growth of w, e.g. renormalization ( w = 1) after each update step more elegant idea: adding a weight decay proportional to V 2 to Hebbian learning rule (Ojas Rule)
wi = V (xi V wi )
p. 181
data.xy[,2]
0 data.xy[,1]
p. 182
p. 183
Single-layer network with the i-th output Vi given by T Vi = j wij xj = wi x, wi is the weight vector for the i-th output Ojas m-unit learning rule
d
wij = Vi (xj
k=1
Vk wkj )
wij = Vi (xj
k=1
Vk wkj )
Both rules reduce to Ojas 1-unit rule for the m = 1 and i = 1 case
p. 184
In both cases the wi vectors converge to orthogonal unit vectors Weight vectors become in Sangers rule exactly the rst m principal components, in order wi = ci , where ci is normalized eigenvector of the correlation matrix C belonging to the i-th largest eigenvalue i Ojas m-unit rule converges to the m weight vectors that span the same subspace as the rst m eigenvectors, but do not nd the eigenvector directions themselves
p. 185
reconstructed features
xd
reconstruction
z1 m extracted features
zm
extraction
x1
original features
xd
Network is training to perform identity mapping Idea: bottleneck units represents signicant features in the input data Train network by minimizing the sum-of-square error
1 2 N n=1 d k=1 (n) yk (x(n) ) xk ) 2
p. 186
Equivalent to the Ojas/Sangers update rule, this type of learning can be considered as unsupervised learning, since no independent target data is provided Error function has a unique global minimum when hidden units have linear activations functions At this minimum the network performs a projection onto the m-dimensional sub-space which is spanned by the rst m principal components of the data Note, however, that these vectors need not to be orthogonal or normalized
p. 187
linear
non-linear
linear
z1 zm
m extracted features
non-linear
x1
original features
xd
p. 188