Escolar Documentos
Profissional Documentos
Cultura Documentos
Some people distinguish these two carefully as feature • A representation/compression from which one can
extraction and feature selection. We can also look at reconstruct the original data with minimal error.
them as dimensionality reduction. Dimensionality reduc-
tion makes the downstream computations efficient. Some
11.33.1 Minimizing Variance
time storage/memory is also made efficient through the
dimensionality reduction. If your original x is “raw-data” Let u be a dimension on which we want to project the
(say an image as such represented as a vector), then such data xi so that we obtain a new representation zi that
techniques are treated as principled ways to define and has maximum variance.
extract features. It is easy to see that the mean of the original represen-
tation gets projected to the mean of the new representa-
tion.
N N N
1 X 1 X T 1 X
z̄ = zi = u xi = uT xi = uT µ
N i=1 N i=1 N i=1
Lec 11: Linear Dimensionality Reduction 11-2
We are interested in finding a u that maximizes the Minimizing the reconstruction error now becomes that of
variance after the projection Maximizing
uT ⌃u
N N
1 X 1 X T with our familiar constraint of uT u = 1. This reduces the
arg max (zi z̄)2 = (u xi uT µ)2
N i=1 N i=1 solution as the eigen vectors corresponding to the largest
! eigen values.
N
X
1 T T
arg max u [xi µ][xi µ] u
N i=1
11.33.3 PCA: Algorithm
1 T • Center the data by subtracting the mean µ
arg max u ⌃u
N • Compute the covariance matrix ⌃ = N1 XXT =
with the constraint of ||u|| = 1, the solution to this prob- 1
PN T
N i=1 xi xi
lem can be seen as the eigen vector corresponding to
the largest eigen value of the covariance matrix ⌃. (see • Compute eigen values and eigen vectors of ⌃.
the derivation somewhere else in the notes.) When the
• Take the k eigen vectors corresponding to the largest
data is centered or mean is subtracted, one can see that
k eigen values. Keep the eigen vectors as the rows
⌃ = XXT . Where X is a d ⇥ N data matrix. (Note: may
and create a matrix U of k ⇥ d.
be we did use the notation in an earlier lecture for the
transpose of this matrix.) • Compute the reduced dimensional vectors as
Best dimension to project so as to maximize the vari-
ance is the eigen vector corresponding to the largest eigen zi = Uxi
value. The second best will be the second largest one and
so on. 11.33.3.1 How many eigen vectors?
In practice what should be the value of k? This is often
11.33.2 Minimizing Reconstruction Loss decided by looking at how much information is lost in
doing the dimensionality reduction. An estimate of this
Let u1 , . . . ud be d orthonormal
Pd vectors. We can repre-
is P
sent the vectors x as i=1 ↵i ui . Where the scalar ↵i is d
i
xT ui . However, if we use smaller than d basis vectors, i=k+1
Pd
there could be some loss or reconstruction error. Let us i=1 i
consider the loss when we use only one u. i.e., Often k is picked such that the above ratio is less than
5% or 10%.
x uuT x Q: Why this ratio is useful? Can you find an explana-
tion?
Sum of the reconstruction loss for all the N data samples
is now:
XN
||xi uuT xi ||2 11.34 Example: Eigen Face
i=1
A classical example/application of PCA is in face recog-
N nition and face representation.p Let us
X p assume that we are
= xTi xi + (uuT xi )T (uuT xi ) 2xTi uuT xi given N face images each of d ⇥ d. We can visualize
i=1 this as a d dimensional vector. Assume that our input is
We would like to minimize this. First term is positive all the faces from a passport office. All the faces are ap-
(non negative). It is independnet of u. Therefore, we proximately of the same size. They are all frontal. And
would like to minimize: also expression neutral.
N
X • Let us assume that mean µ is subtracted from each
(xi T uuT uuT xi 2xTi uuT xi ) of the face. If all of the inputs were faces, then the
i=1 mean is also looking very similar to a face.
We also know that uT u = 1. • Let the input be x1 . . . xN be the mean subtracted
N N
faces and X be the d ⇥ N matrix of all these faces.
X X
= xTi uuT xi = uT xi xTi u = uT ⌃u (Q: Practically which would be larger here? N or
i=1 i=1
d?)
• We are interested in finding the eigen vectors of the
matrix XXT . Say they are u1 . . . uk . Note that
k min(N, d).
• We then project each of the inputs x to these new
eigen vectors and obtain a new feature.
zi = u i T x i = 1, . . . , k
XT Xu = u (11.21)
XXT v = v (11.22)
Note that u is N ⇥ 1 and v is d ⇥ 1. Assume d >> N .
We are interested in computing v from u.
Let us premultiply both side by XT .
XT XXT v = Xv
Let us replace XT v by u.
XXT u = u
The computational procedure can be now:
• Compute the m principal eigen vectors for XXT , say
v1 , . . . vm .
• Compute the eigen vectors of our interest as
u i = XT v i
11-3