Escolar Documentos
Profissional Documentos
Cultura Documentos
1 Introduction
Under many circumstances the data we encounter can be viewed as a cloud of n points in a d-dimensional
space. For example, consider the expression matrix E, of Ng genes (rows) and Ns samples (columns).
Element Egs is the expression level of gene g in sample s.
Sample i is represented by Ng numbers - the components of a vector:
(i)
e1
E1i
(i)
E2i
e2
=
~e (i) =
..
.
.
.
.
(i)
ENg i
eNg
e1
Eg1
(g)
Eg2
e2
=
~e (g) =
.
.
.
..
.
in Ns dimensional space:
EgNs
(g)
eNs
5
1
8
19
9
2
27
11
4
37
3
6
Each patient is represented by a 2-component vector, and a point in d = 2 dimensions (see Fig
1):
3
11
1
9
~e (27) =
~e (37) =
~e (19) =
~e (5) =
6
4
8
2
Example - two projections
Usually we wish to project the Ns data points down from the high dimensional space (d = Ng )
in which they are embedded. In our example (Fig 1) we can project 4 points down from d = 2 to
d = 1. Two simplest projections are shown in Fig 2; onto the x and y axes. On the y axis, the
projections of our samples form 4 equidistant points with a maximal spread of 8-2=6. On the x
axis the four points split into two groups of 2 points in each, with a maximal spread of 11-1=10.
The expression levels of gene 1 are more spread out and separate the samples into two groups.
2. PCA
It is evident from the two projections shown for the example, that what we see depends on the direction onto
which we project.
Question: How does one find the direction whose projections are the most spread out?
A better measure of spread is 2 , the variance of the projections: for n numbers xi , i = 1, 2, ..., n, the mean
x
and variance are
n
x
=
1X
xi
n i=1
2 =
1X
(xi x
)2
n i=1
Example: our 4 samples projections on the e1 or x axis are 1,9,11,3; the mean is e1 = 6; for
the projections on e2 , i.e. 8,2,4,6, the mean is e2 = 5. To calculate the variance, construct the
gene-centered expression matrix
patient No:
(k)
e1 e1
(k)
e2 e2
so that
5
-5
3
19
3
-3
27
5
-1
37
-3
1
12 =
1 X (k)
1
(e1 e1 )2 = [(5)2 + 32 + 52 + (3)2 ] = 17
4
4
(1)
k=1
Similarly, 22 = 5.
2.1 First Principal Component
We can now formulate the following
, such that projecting n points in d dimensions, ~e
PROBLEM: find the direction u
gives the largest variance.
(i)
,
, i = 1, ...n, onto u
RECIPE:
1. Construct the d d Covariance matrix
n
Cij =
1 X (k)
(k)
[ei ei ][ej ej ]
n
(2)
k=1
C
u = 1 u
(3)
is the eigenvalue 1 .
3. The value of the (maximal) variance of the projections onto u
A proof (that this procedure indeed maximizes variance of the projections) is given below - first I show how
this recipe is applied to the example.
For our example, these steps take the following form:
1. The 2 2 Covariance matrix is
C11
C12
17
C=
=
C21
C22
8
8
5
(4)
The matrix elements were calculated using in eq. (2) the entries from the gene-centered expression
matrix given above, e.g.
4
C11
1 X (k)
1
(k)
=
[e1 e1 ][e1 e1 ] = [(5)2 + 32 + 52 + (3)2 ] = 12 = 17
4
4
(5)
k=1
C12 = C21 =
1 X (k)
1
(k)
[e1 e1 ][e2 e2 ] = [(5)(3) + 3(3) + 5(1) + (3)1] = 8
4
4
k=1
(6)
17
8
u
1
u
1
=
Cu
= 1
8
5
u
2
u
2
It is easy to check that the (normalized) vector
1
u
1
2
=
u
2
1
5
(7)
(8)
satisfies (7) with the eigenvalue 1 = 21, which is the largest (the other, smaller one is 2 = 1).
3. To show that the variance of the projections is indeed 1 , we first calculate for our 4 sample
:
vectors their projections onto u
=
e(k)
e(k) u
u =~
d
X
(k)
i
ei u
(9)
i=1
(5)
(19)
which yields for sample 5: eu = 1 2/ 5 + 8 (1)/ 5 = 6/ 5, and for the others eu =
(27)
(37)
16/ 5, eu = 18/ 5 and eu = 0. The mean of these 4 numbers is eu = 7/ 5 and their
variance is given by
"
2
2
2 #
1
6
7
16
7
18
7
7
2
+
+
2+ 0
= 21 = 1 (10)
u =
4
5
5
5
5
5
5
5
2.2 Higher Principal Components
(2) , that corresponds to 2 , the second largest eigenvalue of the covariance matrix is the
The eigenvector u
second Principal Component, the next is the third and so on. Since these are eigenvectors of a symmetric
matrix, they are orthogonal to each other. Representing the data points by their projections onto the first two
gives a projection onto the plane of largest variance and similarly the first three give the d = 3 dimensional
space in which the projected data exhibit the largest variation. The total variance of the data,
X (k)
)2
2 =
(~e e
k
is also given by the sum of the variances of the projections on the principal components:
2 =
d
X
2 =
=1
and one often gives, together with the projection onto the first 2 or 3 principal components, the fraction of
the total variance that has been captured, e.g. for a d = 3 representation
captured =
3
X
2 / 2
=1
2.3 Warning
PCA is not more than what was presented; it should be born in mind that for many important questions
PCA does not give the answer. There may be situations in which the data break into two distinct clouds,
but in order to see this one has to project onto a subspace in which the variation is not maximal (see fig
with two pitas)
2.4 Proof of Recipe
(k)
. The n projections are eu = ~e(k)
Let us calculate the variance of k = 1, 2...n vectors ~e(k) onto a direction u
u.
The average of these n projections is the projection of the average vector:
n
!
n
n
1 X (k)
1 X (k)
1 X (k)
~e u
=
~e
=e
u
eu =
u
eu =
n
n
n
k=1
k=1
k=1
h
i2
i2
Pn h (k)
(k)
1
=
u
e
u
u
k=1
k=1
n
i
h P
Pd Pd
(k)
(k)
n
1
i u
j n k=1 ei ei ej ej
i=1
j=1 u
Pd Pd
i u
j Cij
i=1
j=1 u
1
n
Pn
(11)
where in the last equality we recognized the object in square brackets as the covariance matrix Cij . This
result is written in concise vector notation as
T C u
u2 = u
(12)
(13)
k=1
0
10
10.77
2.82
10
0
2.82
7.21
D=
10.77
2.82
0
8.24
2.82
7.21
8.24
0
(17)
(i)
~y
(j)
(18)
and D? There are N (N 1)/2 indepenHow does one define the distance between two distance matrices, D
ij = Dij for all pairs i, j. Hence one
dent matrix elements; usually we cannot satisfy equality of all, i.e. D
defines a cost function that measures the deviation from this perfect solution. The obvious choice is
X
ij )2
(Dij D
(19)
J0 =
i<j
This choice is however problematic since it measures absolute error. Say for some points x we found a good
solution y; if we scale all coordinates x by a factor > 1, we would like the similarly scaled positions y
to be a good solution. The new error is, however, 2 times larger. Hence one defines scaled cost functions.
Thee choices are
P
2
i<j (Dij Dij )
P
Jee =
(20)
2
i<j Dij
!2
X (Dij D
ij )
Jf f =
(21)
Dij
i<j
X (Dij D
ij )2
Dij
i<j Dij i<j
Jef = P
(22)
Remember that the coordinates of the data points x( i) are given, and we should minimize the cost J over the
(i)
unknowns (3N or 2N ) and we minimize J over them by gradient
unknown coordinates yk . There are dN
descent. The solution one obtains is not unique - depends on the initial point.
Jee emphasizes large errors - a difference of 10 between 1000 and 1010 has the same contribution to the cost
and the difference between 10 and 20. Jf f measures fractional error.