PCA Tutorial: Instructor: Forbes Burkowski

PCA
Tutorial

CS898

Instructor:
Forbes Burkowski
PCA
Overview
Suppose we are given a set of n column vectors

( )
1,2, ,
k
x k n K

each with m components, that is

( ) ( ) ( ) ( ) ( )
1 2 3
, , , ,
T
k k k k k
m
x x x x x 1
]
L .

Our description of principal component analysis (PCA) will be much more meaningful
if we regard each vector as representing a single point in an m-dimensional Euclidean
space. The typical diagram used to motivate PCA considers several points in a Euclidean
plane, so n is some large integer and m = 2. As shown in Figure 1, the points are seen to
be situated along an invisible straight line but in a noisy fashion so that the points may
lie on either side of that line.

FIGURE 1: Data points in a Euclidean plane

From the figure it is visually apparent that the distribution of points is not completely
random and there is some interrelationship between the coordinates x
1
and x
2
. We see
that, roughly speaking, x
2
is large when x
1
is large and x
2
is small when x
1
is small.
More precisely, we might suspect that x
1
and x
2
are related in a strict linear fashion but
the exact nature of that relationship is obscured by noise that is inherent in some physical
measurement procedure that produced the data. In PCA the first objective of the analysis
is to specify the equation of that invisible line and then consider it to be the so-called
principal axis y
1
of a new coordinate system for the points. To complete the new
coordinate system we would next derive another axis y
2
that is perpendicular to the first
axis and since m is only 2, we would be done. Figure 2 illustrates the new axes that we
want to derive. We see that the new coordinates for the points are such that the y
1

ordinate varies over a wide range while the y
2
ordinate has a more restricted range.

FIGURE 2: PCA derives a new coordinate system

In a real world application of PCA, m will usually be much larger than 2 and it will be
impossible to visualize the data. However, the goal of the analysis will be the same: Find
a new coordinate system for the data such that the first axis corresponds to the largest
variation in the data points, the second axis is perpendicular to the first axis and
corresponds to the next largest variation in the data. The third axis will be perpendicular
to both the first and second axes and will correspond to the third largest variation in the
data, and so on for the remaining m-3 axes.

Multivariate Data
The data points introduced in the previous section were considered to be vectors:
mathematical abstractions without any further explanation. We now consider how such
data are derived and along the way, we introduce more terminology.

PCA is typically applied to data that are generated by some experimental procedure.
Each vector is a set of observations that corresponds to some single event or entity that is
a member of a large set of such events or entities. For example, we might be studying the
different structural conformations of a single protein with N amino acids. To simplify the
problem we could assume that a conformation is specified by the 3D coordinates of all
the -carbon atoms of the protein. In this case, our vector of observations has m = 3N
components being a sequence of N triplets each representing the 3D coordinates of a
single -carbon atom. Each change in the conformation of the protein produces a new
set of -carbon coordinates and hence a new vector of observations. While each
observation in this example was part of a 3D coordinate, it is not necessary that all
observations have the same data type. In some applications of PCA, the vector of
observations may be a diverse collection of data types measuring physical quantities such
as atomic charge, hydrophobicity, dihedral angles, etc.
Data types are also called inputs, attributes, or features. It should be clear that, for all
vectors in the data set, each row position in a vector always corresponds to the same type
of observation. We will now collect the n vectors to form an m by n matrix:

( ) ( ) ( ) ( )
( ) ( ) ( )
( ) ( ) ( )
( ) ( ) ( )
1 2
1 1 1
1 2
1 2 3 2 2 2
1 2
n
n
n
n
m m m
x x x
x x x
X x x x x
x x x
1
1
1
1

]
1
1
1
]
L
L
L
M M O M
L

Because of the data generation process, we will assume that each of the m rows contains
values that are independent and identically distributed observations.
While the observations along a single row are independent, we are assuming that some
values within a column vector are correlated and it is the objective of PCA to make these
correlations evident. If there is no such correlation then a principal components analysis
is of no use. Later in the chapter, we will eventually discover that this can provide a
simplification of the data that essentially reduces the dimensionality of the data by
discarding dimensions that correspond to small variations in the data. In the trivial
example described earlier, the y
1
axis would expose the most significant variation in the
data set while the y
2
axis would show the amount of noise in the data.
We should also mention that in many applications, the principal axes corresponding to
smaller variations in the data may be present for other reasons that are not due to noise or
measurement errors. These axes may correspond to underlying physical properties that
naturally produce small changes within the system under study. Imagine bond length
changes in a benzene ring situated within a larger molecule. If we collect bond length
data for the entire molecule as it goes through various conformational changes, we will
see some bond length changes in the benzene ring but they will have a very small
magnitude. In this situation, a reasonable objective of PCA is to do a dimensionality
reduction. That is, it tries to find a set of linear combinations of the given attributes that
is most important in explaining variations in the data. The cardinality of this set would
be less than m, in some case, much smaller than m.
To reiterate our goals: The first principal component corresponds to the linear
combination of the data rows that takes into account the largest variation in the data, the
next principal component is another linear combination of the data rows that accounts for
the next largest variation in the data set, etc.

Data Analysis
Data evaluated relative to a mean
In the previous section data points were distributed along or near a straight line with
deviations from a linear relationship being present due to noise in the data gathering. The
standard approach in the analysis of noisy data is to use data averaging strategies to
expose the hidden but intrinsic nature of the data. We hope that errors due to noise will
cancel out when a data mean is calculated. In this approach, the mean of the data has a
value that is more significant than the position of any single point in the data collection.
The first step of the analysis would be to express each observed attribute as the difference
between its original given input value and the mean of all such values. Going back to our
matrix of data we compute the mean of the attributes in each row to get m mean values,

( )
1
1
1,2, ,
n
k
i i
k
x x i m
n

K .

Then each entry in the data matrix is replaced by its difference with the mean,

( ) ( ) j j
i i i
x x x .

In the remainder of our discussion, we will assume that the data has already been
subjected to this computation and so Figure 2 can be redrawn as Figure 3 with data
translated by subtracting of the means so that they are centered on the origin.

FIGURE 3: Data translated to be centered on the origin

Because the data is now centered each row of the matrix X will sum to 0. Specifically,

( )
1
0 0
n
k
i i
k
x x
for all i = 1, 2,, m.

Consider a single row of attribute values in X. We can use the usual definition from
statistics to compute the variance of this data:

( )
( )
( )
( )
( )
2
1
2
1
1
variance_row
1
1
1
n
k
i i
k
n
k
i
k
i x x
n
x
n

the last equality holding because each 0
i
x .
Recall the earlier statement about Figure 1: x
2
is large when x
1
is large and x
2
is small
when x
1
is small. To further our analysis, we will now present a formal definition that
quantitatively characterizes the nature of this covariance between x
1
and x
2
. The
covariance between two rows of data:

( )
1,2, ,
k
i
x k n K and
( )
1,2, ,
k
j
x k n K

is defined to be:

( ) ( )
1
1
1
n
k k
ij i j
k
s x x
n

.

Note how this is a natural generalization of the definition given earlier for the variance.
With our translated data centered on the origin, we see that a large positive value of
ij
s
indicates that the two data attributes increase together and in an attempt to characterize
the data we might assume that they have a relationship that is either linear or almost
linear. A large negative value of
ij
s indicates that the data attributes are such that as one
increases in a positive direction, the other increases in the opposite or negative direction.
Again, if we are confident that data dependencies are linear then these particular
attributes can be assumed to be related in a linear or almost linear fashion. If the
covariance
ij
s is close to zero, then the two rows represent attributes that are independent
of one another and there is no discernable linear relationship between them.

The Covariance Matrix
It is clear that
ij ji
s s and since we have m different attributes we can perform
( ) 1 2 m m different covariance computations. We will retain these calculations in a
symmetric m by m covariance matrix S with the entry at row i and column j containing
ij
s
while the main diagonal contains the variance for row(i). Recalling that the given data
can be stored as a matrix of column vectors,

( ) ( ) ( ) ( ) 1 2 3 n
X x x x x 1
]
L

where each
( ) k
x is a column vector, it is clear that the covariance matrix is given by:

( )
1
.
1
T
S X XX
n

We are using the notation S(X) instead of S to emphasize the dependency of the variance
on the data matrix X. It is very important to understand the structure of S(X). The
diagonal entries are the variances of the row data while off-diagonal entries are the
covariance values between rows, that is, between different types of data.
It is worthwhile to note the appearance of S(X) when the covariance values are
computed from the centered raw data as in Figure 3 and compare this with the appearance
of

( )
1
.
1
T
S Y YY
n

when the frame of reference for the data is a coordinate system based on the principle
components as in Figure 4. In the first case we have attributes that show a covariant
relationship and so there will be off-diagonal entries of S(X) that have values that are
large in absolute value. However, for S(Y) the off-diagonal entries tend to be much closer
to zero because of the orientation of the new axes among the data points. In our trivial
example illustrated in Figure 4, data points are such that large values of y
1
are multiplied
by small values of y
2
. Given the shape of the data distribution and the fact that y
2
is
perpendicular to y
1
, this is expected.

FIGURE 4: The data with axes determined by PCA

To motivate PCA, we see that our best new coordinate system

( ) ( ) ( )
( )
1 2
, , ,
m
Y y y y K

would be such that S(Y) has nonzero diagonal entries and all off-diagonal elements are
zero. These observations lead us to a formal statement defining the requirements of a
principal component analysis as follows:
Given a matrix X, with columns representing origin centered data points, find a
transformation matrix P such that the transformed data is represented as

Y PX

and the covariance matrix for Y,

( )
1
1
T
S Y YY
n

is a diagonal matrix. We can focus on
T
YY and observe that

( ) ( )
.
T
T
T T
YY PX PX
PXX P

The symmetric matrix
T
XX is diagonalizable and so there exists invertible orthogonal
matrix Q such that

T T
XX QDQ

where D is purely diagonal and Q has columns taken from the eigenvectors of
T
XX . So,
with this decomposition,

( ) ( )
( ) ( )
1
T T
T T
T
n S Y PXX P
PQDQ P
PQ D PQ

and it is easily seen that S(Y) is purely diagonal if P is the inverse of Q. Since Q is
orthogonal we could equivalently require that

T
P Q .

In other words, P has rows that are the eigenvectors of
T
XX . So, for any column in Y
say
( ) k
y we have

( ) ( ) k k T
y Q x

where
( ) k
x is a column of X.
Considering the change of basis topic covered in Appendix A we can understand that
T
Q x really has the form

( ) ( ) ( )
[ ] [ ]
1 2
.
n
Q
T
n
E B
q q q x x
1

]
K

In other words, the eigenvectors are really a new basis and the left multiply of X by
T
Q is
really a change of basis that takes vectors x in the standard basis (i.e. [ ] n
E
x ) over to [ ]
Q
B
x
- our notation indicating that x is in a coordinate representation that uses the eigenvectors
of
T
XX as a new basis.
PCA and SVD
Recall the singular value decomposition of a matrix. For an m n matrix X with
( ) rank X r there exists orthogonal m m matrix U, orthogonal n n matrix V, and
diagonal matrix S such that

T
X USV .

The covariance matrix

2
.
T T T T
T
XX USV VS U
US U

Consequently, the availability of a singular value decomposition for X will allow us to
readily compute the eigenvectors and eigenvalues of the covariance matrix. Using the
SVD has some advantages: For example, the rank of X is immediately clear by counting
the nonzero entries in the diagonal matrix S.
There is also the possibility of using a variant of SVD called the thin SVD. In this
scenario, for
T
X USV with m n we can write

T
X USV

with

( ) ( ) ( ) 1 2 n m n
U u u u

1
]
L

where
( )
1,2, ,
k
u k n K are the first n columns of U corresponding to nonzero
eigenvalues. The diagonal matrix
n n
S

contains the first n singular values of S. If m
is substantially larger than n then this approach yields the same results with less memory
storage. A straightforward
T
XX computation requires ( )
2
O m space which is typically
more than the ( ) O mn space needed by the thin SVD.

Ordering of eigenvalues
Let us review what we have accomplished up to this point. We now know how to find a
transformation matrix P such that when the data matrix X is multiplied on the left by P
we get a new data matrix Y for which the covariance matrix
T
YY is diagonal. The rows
of P are the eigenvectors of
T
XX and these rows establish a new set of axes for the data.
In many applications of PCA there is the notion that some axes are more important than
others. It is the main assumption of PCA that this measure of importance is taken to be
magnitude of variation of the data along an axis. Since each axis will typically show a
different amount of variation, we need to formulate a strategy to measure this variation.
If we consider a point
( ) k m
x , then its image under
T
Q is
( ) ( ) k k T
y Q x . The mean of
all such
( ) k
y points is given by the vector

( ) ( )
1 1
1 1
0
n n
k k T
k k
u y Q x
n n

since the
( )
1,2, ,
k
x k n K represent centered data. This makes the computation of
variance somewhat less complicated. If we compute the variance of the projections of
the X data onto some particular axis, say
( ) j
q , we get

( )
( )
( ) ( )
( ) ( ) ( ) ( )
2
1
1
1
var
1
1
.
1
n
j j T k
k
n
k T j j T k
k
q q x
n
x q q x
n

Now since both
( ) ( ) k T j
x q and
( ) ( ) j T k
q x are scalar quantities we can commute them and
continue as follows:

( )
( )
( ) ( ) ( ) ( )
( ) ( ) ( ) ( )
( ) ( )
( ) ( )
1
1
1
var
1
1
1
1
1
1
1
1
n
j j T k k T j
k
n
j T k k T j
k
j T j T
j T j
j
j
q q x x q
n
q x x q
n
q XX q
n
q q
n
n

The last two equalities make use of two facts:
( ) j
q is an eigenvector of
T
XX with
eigenvalue
j
and
( ) j
q has unit length.
The idea behind PCA is that we can assign an importance to each new axis that is
proportional to the amount of variance in the projection of the data on that axis. The last
derivation essentially says that the most important axis is the one corresponding to the
largest eigenvalue. It is typical that the new axes are ordered with respect to their
corresponding eigenvalues.

Summary of the PCA algorithm

Input:
A set of data vectors
( ) ( ) ( ) 1 2
, , ,
n
x x x K each consisting of m observations.

Procedure:
1. Set up the data vectors as the columns of an m n data matrix X.
2. Define the data average vector

( )
1
1
n
k
k
x
n
.

3. Compute the covariance matrix

( )
( )
( )
( )
1
1
1
n
T
k k T
k
XX x x
n

.

4. Derive the (eigenvector, eigenvalue) pairs of
T
XX and retain them as the set
( )
( ) { }
1
,
r
j
j
j
q
where r is the rank of

T
XX . The
( ) j
q eigenvectors are set up as
columns in a matrix Q. Their order in Q should correspond to a sort of their
eigenvalues in descending order.
5. For each
( ) ( ) ( ) 1 2
, , ,
n
x x x K compute
( ) ( ) k k T
y Q x .

Output :
The n vectors
( ) ( ) ( ) 1 2
, , ,
n
y y y K .

Uses of PCA
The computation of a PCA gives us the following information:
1. The correlations among the observations become clear after diagonalization of the
covariance matrix. The linear combinations of the observations (row data)
essentially specify dependencies among the entries within a vector of
observations.
2. The specification of a new set of axes (defining a coordinate system) for the data
effectively removes these pair wise dependencies.
3. The magnitude of the eigenvalue corresponding to each new axis essentially
specifies the amount of variance in the data projected along that axis. Depending
on the application, this may or may not be important and caution should be
exercised. It is possible that the data might be used as input to some nonlinear
process that exhibits drastic changes in its behavior for even small changes in the
data along some axis. In such a setting, the significance of the principal
component analysis must be supplemented with extra information that specifies
how each axis is to be properly used. We would be assuming that PCA is still an
appropriate analysis even though we are working with a nonlinear process.
4. Principal component analysis can be used to provide a simplification of the data
by means of a dimension reduction. This is discussed in the next section.

Dimension Reduction
It is instructive to go back to our trivial example introduced at the start of the chapter.
Although the data set is given as a sequence of vectors
( ) ( ) ( ) 1 2
, , ,
n
x x x K each with two
components
( )
1
k
x and
( )
2
k
x , corresponding to the data axes
1
x and
2
x . It is visually
apparent that these components are possibly correlated perhaps due to some type of
redundancy in the data generation process. In this example, we might further suppose
that the lack of strict collinearity in the data arises due to noise that is inherent in the data
generation or in the data collection (or possibly both). A principal component analysis
would lead to the generation of two eigenvectors that could be used as the two new axes
1
y and
2
y for the data as shown in Figure 4. In such a scenario it would be clear that
each data point is really one dimensional and not two dimensional. The
1
y axis would be
the single principal component describing the most significant data component while the
2
y axis would specify the noise in the data. Note that the eigenvalue corresponding to
1
y
would be significantly larger than that for
2
y .
In a more complicated data set that is subject to both redundancy (correlation of
observations) and noise we would see an analogous eigenvalue spectrum: After the
eigenvalues are sorted in descending order, we would likely observe fairly large
eigenvalues followed by a sudden decrease of magnitude in the eigenvalues. This would
allow us to build a transformation matrix
T
Q that has fewer rows; the result being to
generate
( ) k
y vectors that retain the axes with the most variation while rejecting the rest.
In our trivial example
T
Q would be a single 1 m row matrix and the resulting
( ) k
y
vectors would be projections on the
1
y axis with the noise component removed.
In such a strategy we simplify the data by retaining only the most meaningful
coordinates and we also reduce the appearance of noise in the new data representation.
Any noise component that is projected along a retained axis will still remain but it is
hopefully small.
Before moving to the next topic it should be mentioned that dimension reduction is
often done for reasons other than noise removal. The principal component analysis may
be followed by a dimension reduction to simply eliminate observations in the data that
show small variation (corresponding to small eigenvalues). Because of their limited
variability these observations may be considered to have an insignificant effect on the
system under study and their elimination would lead to less storage and computational
cost while also reducing the complexity of the model.

PCA Tutorial: Instructor: Forbes Burkowski

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

PCA Tutorial: Instructor: Forbes Burkowski

Enviado por

Direitos autorais:

Formatos disponíveis

PCA

for all i = 1, 2,, m.

where r is the rank of

Você também pode gostar