Você está na página 1de 63

Chris Bishops PRML

Ch. XII: Continuous Latent Variables


Caroline Bernard-Michel & Herve Jegou

June 12, 2008

Caroline Bernard-Michel & Herv


e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Introduction
I

Aim of this chapter: dimensionality reduction

Can be interesting for lossy data compression, feature


extraction and data visualization.

Example: synthetic data set

Choice of one of the off-line digit images

Creation of multiple copies with a random displacement and


rotation

Individuals= images (28 28 = 784)

Variables= pixels grey levels

Only two latent variables: the translation and the rotation

Caroline Bernard-Michel & Herv


e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Chapter content
I

Principal Component Analysis


I
I
I
I

Probabilistic PCA
I
I
I
I
I

I
I

Maximum variance formulation


Minimum-error formulation
Applications of PCA
PCA for high-dimensional data
Problem setup
Maximum likelihood PCA
EM algorithm for PCA
Bayesian PCA
Factor analysis

Kernel PCA
Nonlinear Latent Variable Models
I
I
I

Independent component analysis


Autoassociative neural networks
Modelling nonlinear manifolds
Caroline Bernard-Michel & Herv
e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Maximum variance formulation


I
I

Consider a data set of observations {xn } where n = 1, . . . , N


(xn with dimensionality D).
Idea of PCA: Project this data onto a space of lower
dimensionality M < D, called the principal subspace, while
maximizing the variance of the projected data

45
50
30
35

55
40
60

45
50

65
55
70

60

Caroline Bernard-Michel & Herv


e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Notations
We will denote by:
I

D the dimensionality

M the fixed dimension of the principal subspace

{ui }, i = 1, . . . , M the basis vectors ((D 1) vectors) of the


principal subspace

The sample mean ((D 1) vector) by:


x
=

N
1 X
xn
N

(1.90)

n=1

The sample variance/covariance matrix ((D D) matrix) by:


S=

N
1 X
(xn x
)(xn x
)T
N

(1)

n=1

Caroline Bernard-Michel & Herv


e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Idea of PCA with one-dimensional principal subspace


I

Let us consider a unit D-dimensional normalized vector u1


(uT1 u1 = 1)

Each point xn is then projected onto a scalar is uT1 xn

The mean of the projected data is:


uT1 x

(2)

The variance of the projected data is:


N
1 X T
2
u1 xn uT1 x
= uT1 Su1
N

(3)

n=1

Idea of PCA: Maximize the projected variance uT1 Su1 with respect
to u1 under the normalization constraint uT1 u1 = 1
Caroline Bernard-Michel & Herv
e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Idea of PCA with one-dimensional principal subspace


I
I
I

Trick: introduce the Lagrange multiplier 1


Unconstrained maximization of uT1 Su1 + 1 (1 uT1 u1 )
Solution must verify:
Su1 = 1 u1

I
I
I

I
I

(4)

u1 must be an eigenvector of S having eigenvalue 1 !


The variance of the projected data is 1 (uT1 Su1 = 1 ), so 1
has to be the largest eigenvalue!
Additional principal components are obtained maximizing the
projected variance amongst all possible directions orthogonal
to those already considered!
PCA =calculating the eigenvectors of the data covariance
matrix corresponding to the largest eigenvalues!
P
Note: D
i=1 i is generally called the total inerty or the total
variance. The percentage of inerty explained by one
component ui is then PDi .
i=1

Caroline Bernard-Michel & Herv


e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Chapter content
I

Principal Component Analysis


I
I
I
I

Probabilistic PCA
I
I
I
I
I

I
I

Maximum variance formulation


Minimum-error formulation
Applications of PCA
PCA for high-dimensional data
Problem setup
Maximum likelihood PCA
EM algorithm for PCA
Bayesian PCA
Factor analysis

Kernel PCA
Nonlinear Latent Variable Models
I
I
I

Independent component analysis


Autoassociative neural networks
Modelling nonlinear manifolds
Caroline Bernard-Michel & Herv
e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Minimum-error formulation
I
I
I

Based on projection error minimization


Consider a D-dimensional basis vectors {ui } where
i = 1, . . . , D satisfying uTi uj = ij
Each data point xn can be represented by:
xn =

D
X

ni ui where ni = xTn ui

(5)

i=1
I

xn can be approximated by
x
n =

M
X

zni ui +

i=1
I

D
X

bi u i

(6)

i=M +1

Idea of PCA: Minimize the distortion J introduced by the


reduction in dimensionality
N
1 X
|| xn x
n ||2
J=
N

(7)

n=1

Caroline Bernard-Michel & Herv


e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Minimum-error formulation (2)


I

Setting the derivative with respect to znj and to bj , one


obtains that:
znj = xTn uj and bj = x
T uj

J can then be expressed as:


J=

I
I
I

(8)

1
N

D
X

uTi Sui

(9)

i=M +1

The minimum is obtained when {ui }, i = M + 1, . . . , D are


the eigenvectors of S associated to the smallest eigenvalues.
P
The distortion is then given by J = D
i=M +1 i
xn is approximated by:
x
n =

M
X
i=1

(xTn ui )ui

D
X
i=M +1

(
x ui )ui = x
+

M
X

(xTn x
T ui )ui

i=1

(10)
Caroline Bernard-Michel & Herv
e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Chapter content
I

Principal Component Analysis


I
I
I
I

Probabilistic PCA
I
I
I
I
I

I
I

Maximum variance formulation


Minimum-error formulation
Applications of PCA
PCA for high-dimensional data
Problem setup
Maximum likelihood PCA
EM algorithm for PCA
Bayesian PCA
Factor analysis

Kernel PCA
Nonlinear Latent Variable Models
I
I
I

Independent component analysis


Autoassociative neural networks
Modelling nonlinear manifolds
Caroline Bernard-Michel & Herv
e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Application of PCA: data compression


I
I

Individuals = images
Variables = grey levels of each pixel (784)
5

x 10

x 10



2
2
1

1
0

Mean

200

400
(a)

 


600

    

Caroline Bernard-Michel & Herv


e Jegou

200

400
(b)

 
 

600

   

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Application of PCA: data compression (2)


I

Compression using the PCA approximation for xn


x
n = x
+

M
X

{xn ui x
ui }ui

(11)

i=1
I

For each data point we have replaced the D-dimensional


vector xn with an M-dimensional vector having components
(xTn ui x
T ui )

Original



Caroline Bernard-Michel & Herv


e Jegou









Chris Bishops PRML Ch. XII: Continuous Latent Variables

Application of PCA: data pre-processing


I

Usually, individual standardization of each variable: each


variable has zero mean and unit variance. Variables still
correlated.
Use of PCA for standardization:
I

I
I

writing the eigenvector equation SU = U L where L is a


D D diagonal matrix with element i and U is a D D
orthogonal matrix with columns given by ui
And defining by: yn = L1/2 U T (xn x
)
yn has zero mean and identity covariance matrix (new
variables are decorrelated)

100

90
80
70
60
50
2

40
2

2
2

Caroline Bernard-Michel & Herv


e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Application of PCA: data pre-processing


I

Comparison: PCA chooses the direction of maximum variance


whereas the Fishers linear discriminant takes account of the
class labels (see Chap. 4).
1.5
1
0.5
0
0.5
1
1.5
2

Vizualization: projection of the oil data flow onto the first two
principal factors. Three geometrical configurations of the oil,
water and gas phases.

Stratified

Annular
Oil
Water
Gas
Mix

Homogeneous

Caroline Bernard-Michel & Herv


e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Chapter content
I

Principal Component Analysis


I
I
I
I

Probabilistic PCA
I
I
I
I
I

I
I

Maximum variance formulation


Minimum-error formulation
Applications of PCA
PCA for high-dimensional data
Problem setup
Maximum likelihood PCA
EM algorithm for PCA
Bayesian PCA
Factor analysis

Kernel PCA
Nonlinear Latent Variable Models
I
I
I

Independent component analysis


Autoassociative neural networks
Modelling nonlinear manifolds
Caroline Bernard-Michel & Herv
e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

PCA for high-dimensional data


I

Number of data points N is smaller than the dimensionality D

At least D N + 1 of the eigenvalues equal to zero!

Generally computationally infeasible.

Let us denote X the (N D)-dimensional centred matrix.

The covariance matrix can be writen as S = N 1 X T X

It can be shown that S has D N + 1 eigenvalues of value


zero and N 1 eigenvalues as XX T

If we denote the eigenvectors of XX T by vi , the normalized


eigenvectors ui for S can be deduced by:
ui =

Caroline Bernard-Michel & Herv


e Jegou

1
(N i )

1
2

X T vi

(12)

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Chapter content
I

Principal Component Analysis


I
I
I
I

Probabilistic PCA
I
I
I
I
I

I
I

Maximum variance formulation


Minimum-error formulation
Applications of PCA
PCA for high-dimensional data
Problem setup
Maximum likelihood PCA
EM algorithm for PCA
Bayesian PCA
Factor analysis

Kernel PCA
Nonlinear Latent Variable Models
I
I
I

Independent component analysis


Autoassociative neural networks
Modelling nonlinear manifolds
Caroline Bernard-Michel & Herv
e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Probabilistic PCA

Advantages:
I

Derive an EM algorithm for PCA that is computationally


efficient

Allows to deal with missing values in the dataset

Mixture of probabilistic PCA models

Basis for the Bayesian treatment of PCA in which the


dimensionality of the principal subspace can be found
automatically.

...

Caroline Bernard-Michel & Herv


e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Probabilistic PCA (2)


Related to factor analysis:
I

A latent variable model seeks to relate a D-dimensional


observation vector x to a corresponding M -dimensional
Gaussian latent variable z
x = Wz + + 

(13)

where
I
I
I
I
I

z is an M -dimensional Gaussian latent variable


W is an (D M ) matrix (the latent space)
 is a D-dimensional Gaussian noise
 and z are independent
is a parameter vector that permits the model to have non
zero mean

Factor analysis:  v N (O, )

Probabilistic PCA:  v N (O, 2 I)


Caroline Bernard-Michel & Herv
e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Probabilistic PCA (3)

The use of the isotropic Gaussian noise model for  implies


that the z-conditional probability distribution over x-space is
given by
x/z v N (W z + , 2 I)
(14)

Defining z v N (0, I), the marginal distribution of x is


obtained by integrating out the latent variables and is likewise
Gaussian
x v N (, C)
(15)
with C = W W T + 2 I

To do: estimate the parameters: , W and 2

Caroline Bernard-Michel & Herv


e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Chapter content
I

Principal Component Analysis


I
I
I
I

Probabilistic PCA
I
I
I
I
I

I
I

Maximum variance formulation


Minimum-error formulation
Applications of PCA
PCA for high-dimensional data . .
Problem setup
Maximum likelihood PCA
EM algorithm for PCA
Bayesian PCA
Factor analysis

Kernel PCA
Nonlinear Latent Variable Models
I
I
I

Independent component analysis


Autoassociative neural networks
Modelling nonlinear manifolds
Caroline Bernard-Michel & Herv
e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Maximum likelihood PCA


Given a data set X = {xn } of observed data points, the log
likelihood is given by:
ND
N
1X
ln(2) ln | C |
N (xn )T C 1 (xn )
2
2
2
n=1
(16)
Setting the derivative with respect to gives
L=

=x

(17)

Back-substituting,we can write:


L=

ND
ln(2) + ln | C | +T r(C 1 S)
2

(18)

This solution represents the unique maximum

Caroline Bernard-Michel & Herv


e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Maximum likelihood PCA (2)


Maximization with respect to W and 2 is more complex but has
an exact closed-form solution
1

WM L = UM (LM 2 I) 2 R

(19)

where
I

UM is a (D M ) matrix whose columns are given by the


eigenvectors of S whose eigenvalues are the M largest
LM is an (M M ) diagonal matrix given by the
corresponding eigenvalues i
R is an arbitrary (M M ) orthogonal matrix.

2
M
L =

1
DM

D
X

(20)

i=M +1

Average variance of the discarded dimensions


Caroline Bernard-Michel & Herv
e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Maximum likelihood PCA (3)

R can be interpreted as a rotation matrix in the M M


latent space

The predictive density is unchanged by rotations

If R = I, the columns of W are the principle component


eigenvectors scaled by the variance i 2

The model correctly captures the variance of the data along


the principal axes and approximates the variance in all
remaining directions with a single average value 2 . Variance
lost in the projections.

Caroline Bernard-Michel & Herv


e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Maximum likelihood PCA (4)


I

PCA generally expressed as a projection of points from the


D-dimensional dataspace onto an M-dimensional subspace

Use of the posterior distribution


z/x v N (M 1 W T (x ), 2 M )

(21)

where M = W T W + 2 I
The mean is given by
T
E(z/x) = M 1 WM
)
L (x x

(22)

Note: Takes the same form as the solution of a regularized


linear regression!
This projects to a point in data space given by
W E(z/x) +
Caroline Bernard-Michel & Herv
e Jegou

(23)

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Chapter content
I

Principal Component Analysis


I
I
I
I

Probabilistic PCA
I
I
I
I
I

I
I

Maximum variance formulation


Minimum-error formulation
Applications of PCA
PCA for high-dimensional data
Problem setup
Maximum likelihood PCA
EM algorithm for PCA
Bayesian PCA
Factor analysis

Kernel PCA
Nonlinear Latent Variable Models
I
I
I

Independent component analysis


Autoassociative neural networks
Modelling nonlinear manifolds
Caroline Bernard-Michel & Herv
e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

EM algorithm for PCA

In spaces of high dimensionality, computational advantages


using EM!

Can be extended to factor analysis for which there is no


closed-form solution

Can be used when values are missing, for mixture models...

Requires the complete-data log likelihood function that takes the


form:
N
X
Lc =
lnp(xn /zn ) + lnp(zn )
(24)
n=1

In the followings, is substituted by the sample mean x

Caroline Bernard-Michel & Herv


e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

EM algorithm for PCA


I
I

Initialize the parameters W and 2


E-step
E[Lc ] =
+

N
X
D
1
T
{ ln(2 2 ) + Tr(E[zn zn
])
2
2
n=1

1
1
1
T
T
|| xn ||2 2 E[zn
]W T (xn ) +
Tr(E[zn Zn
]W T W )}
2 2

2 2

with
E[zn ] = M 1 W (xn x
)
T
E[zn zn
] = 2 M 1 + E[zn ]E[zn ]T

M-step
Wnew = [

N
X

(xn x
)E[zn ]T ][

n=1

2
new
=

N
X

T 1
E[zn zn
]]

(25)

n=1

N
1 X
T
{|| xn x
||2 2E[zn ]T Wnew
(xn x
)
N D n=1

T
+ Tr(E[zn zn
]Wnew Wnew )}

Check for convergence


Caroline Bernard-Michel & Herv
e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

EM algorithm for PCA


I
I

When 2 0, EM approach corresponds to standard PCA


a matrix of size N D whose nth row is given by
Defining X
xn x

Defining a matrix of size D M whose nth row is given by


the vector E[zn ]
The E-step becomes
T
T
= (Wold
Wold )1 Wold
X

(26)

Orthogonal projection on the current estimate for the


principal subspace
The M-step takes the form
T T (T )1
Wnew = X

(27)

Re-estimation of the principal subspace minimizing the


squared reconstruction errors in which the projections are fixed
Caroline Bernard-Michel & Herv
e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

EM algorithm for PCA

(a)

(b)

(d)

(e)

Caroline Bernard-Michel & Herv


e Jegou

(c)

(f)

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Chapter content
I

Principal Component Analysis


I
I
I
I

Probabilistic PCA
I
I
I
I
I

I
I

Maximum variance formulation


Minimum-error formulation
Applications of PCA
PCA for high-dimensional data
Problem setup
Maximum likelihood PCA
EM algorithm for PCA
Bayesian PCA
Factor analysis

Kernel PCA
Nonlinear Latent Variable Models
I
I
I

Independent component analysis


Autoassociative neural networks
Modelling nonlinear manifolds
Caroline Bernard-Michel & Herv
e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Idea of Bayesian PCA


I
I
I

Usefull to choose the dimensionality M of the principal


subspace
Cross validation with a validation data set: computationally
costly!
Define an independent Gaussian prior over each column wi of
W . The variance is governed by a precision parameter i
p(W/) =

M
Y
i
1
( )D/2 exp{ i iT wi }
2
2

(28)

i=1

Values of i are estimated iteratively by maximizing the


logarithm of the marginal likelihood function:
Z
p(X/, , 2 ) = p(X|W, , 2 )p(W |)dW
(29)

The effective dimensionality of the principal subspace is


determined by the number of finite i values. Principal
subspace = the corresponding wi .
Caroline Bernard-Michel & Herv
e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Idea of Bayesian PCA


I

Maximization with respect to i :


inew =

(30)

These estimations are intervealed with EM algorithm with


Wnew modified
Wnew = [

N
X

(xn x
)E[zn ]T ][

n=1
I

D
wiT wi

N
X

E[zn znT ] + 2 A]1

(31)

n=1

with A = diag(i )
Example: 300 point in dimension D sampled from a Gaussian distribution having M = 3 directions
with larger variance

Caroline Bernard-Michel & Herv


e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Chapter content
I

Principal Component Analysis


I
I
I
I

Probabilistic PCA
I
I
I
I
I

I
I

Maximum variance formulation


Minimum-error formulation
Applications of PCA
PCA for high-dimensional data
Problem setup
Maximum likelihood PCA
EM algorithm for PCA
Bayesian PCA
Factor analysis

Kernel PCA
Nonlinear Latent Variable Models
I
I
I

Independent component analysis


Autoassociative neural networks
Modelling nonlinear manifolds
Caroline Bernard-Michel & Herv
e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Factor analysis
I

Closely related to Bayesian PCA


but the covariance of p(x|z) diagonal instead of isotropic:
x/z v N (W z + , )

(64)

where is a D D diagonal matrix.


I

The components variance of natural axes is explained by

Observed covariance structured is captured by W


Consequences

For PCA,
rotation of data space same fit with W rotated with the
same matrix
For Factor Analysis, the analogous property is: component-wise
re-scaling is absorbed into the re-scaling elements of

Caroline Bernard-Michel & Herv


e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Factor analysis

The marginal density of the observed variable is


x v N (, W W T + )

(65)

As in probabilistic PCA, the model is invariant w.r.t the latent


space

, W and can be determined by maximum likehood

=x
, as in probabilistic PCA

But no closed-form ML solution for W


iteratively estimated using EM

Caroline Bernard-Michel & Herv


e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Parameters estimation using EM


I

E step
E[zn ] = GW T 1 (x x
)
E[zn znT ]

(66)

= G + E[zn ] E[zn ]

(67)

where G = (I + W T 1 W )1
I

M step
"
W new =

N
X

#"
(x x
)E[zn ]T

= diag S W

#1
E[zn znT ]

(69)

n=1
new

N
X

new

N
1 X
E[zn ](xn x
)T
N

)
(70)

n=1

where the diag operator zeros all non diagonal elements


Caroline Bernard-Michel & Herv
e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Chapter content
I

Principal Component Analysis


I
I
I
I

Probabilistic PCA
I
I
I
I
I

I
I

Maximum variance formulation


Minimum-error formulation
Applications of PCA
PCA for high-dimensional data
Problem setup
Maximum likelihood PCA
EM algorithm for PCA
Bayesian PCA
Factor analysis

Kernel PCA
Nonlinear Latent Variable Models
I
I
I

Independent component analysis


Autoassociative neural networks
Modelling nonlinear manifolds
Caroline Bernard-Michel & Herv
e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Kernel PCA
Applying the ideas of Kernel substitution (see Chapter 5) to PCA
x2

x1

Caroline Bernard-Michel & Herv


e Jegou

v1

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Kernel PCA: preliminaries


Kernel substitution: express each step of PCA in terms of the inner
product xT x between data vectors to generalize the inner product
I

Recall that the principal components are defined as


Sui = i ui

(71)

with ||ui ||2 = uTi ui = 1 and covariance Matrix S defined as


N
1 X
S=
xn xTn
N

(72)

n=1

Consider a nonlinear mapping transformation into a


M -dimensional feature space
maps any data point xn onto (xn )

Caroline Bernard-Michel & Herv


e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Kernel PCA
I
I

P
Let assume that n (xn ) = 0
the M M sample covariance matrix C in feature space is
given by
N
1 X
C=
(xn )(xn )T
(73)
N
n=1

with eigenvector expansion as


Cvi = i vi , i = 1, ..., M
I

(74)

Goal: solve this eigenvalue problem without working explicitly


in the feature space
Eigenvector vi can be written as a linear combination of the
(xn ), of the form
vi =

M
X

ain (xn )

(76)

n=1
Caroline Bernard-Michel & Herv
e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Kernel PCA
Note: typo in (12.78)
I

The eigenvector equation can then be defined in terms of the


kernel function as
K 2 ai = i N Kai

(79)

where ai = (a1i , . . . , aN i )T , unknown at this point.


I

The ai can be found by solving the eigenvalue problem:


Kai = i N ai

(80)

The ai s normalization condition is obtained by requiring that


the eigenvectors in feature space be normalized:
1 = viT vi = aTi Kai = i N aTi ai
Caroline Bernard-Michel & Herv
e Jegou

(81)

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Kernel PCA
I

The resulting principal component projections can also be cast


in terms of the kernel function

A point x is projected onto eigenvector i as


yi (x) = (x)T vi
=

N
X

ain k(x, xn )

(82)

n=1
I

Remarks:
I
I
I

At most D linear principal components


The number of nonlinear principal components can exceed D
The number of nonzero eigenvalues cannot exceed the number
of data points N

Caroline Bernard-Michel & Herv


e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Kernel PCA

Up to now, we assumed that the projected data has zero mean


N
X

(xn ) = 0

i=1
I

This mean cant be simply computed and subtracted

However the projected data points after centralizing can be


obtained as
N
1 X

(xn ) = (xn )
(xl )
N

(83)

l=1

Caroline Bernard-Michel & Herv


e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Kernel PCA
I

The corresponding elements of the Gram matrix are given by


n )T (x
m)
nm = (x
K
= k(xn , xm )

N
1 X
k(xl , xm )
N
l=1

N
X
l=1

N N
1 XX
k(xn , xl ) + 2
k(xj , xl )
N

(84)

j=1 l=1

i.e.,
= K 1N K K1N + 1N K1N ,
K
where

1N

1/N

...
=
1/N

Caroline Bernard-Michel & Herv


e Jegou

(85)

... 1/N
...
... 1/N

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Kernel PCA: example and remark

I
I

PCA is often used to reconstruct a sample xn with good


accuracy from its projections on the first principal components
In kernel PCA, this is not possible in general, as we cant map
points explicitly from the feature space to the data space
Caroline Bernard-Michel & Herv
e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Chapter content
I

Principal Component Analysis


I
I
I
I

Probabilistic PCA
I
I
I
I
I

I
I

Maximum variance formulation


Minimum-error formulation
Applications of PCA
PCA for high-dimensional data
Problem setup
Maximum likelihood PCA
EM algorithm for PCA
Bayesian PCA
Factor analysis

Kernel PCA
Nonlinear Latent Variable Models
I
I
I

Independent component analysis


Autoassociative neural networks
Modelling nonlinear manifolds
Caroline Bernard-Michel & Herv
e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Independent component analysis

Consider models in which


I
I

the observed variables are related linearly to the latent variables


for which the latent distribution is non-Gaussian

Important class of such models: independent component


analysis, for which
p(z) =

M
Y

p(zj )

(86)

j=1

the distribution of latent variables factorize

Caroline Bernard-Michel & Herv


e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Application case: blind source separation

Setup:
I
I

Objective: to reconstruct the two signal separately


blind because we are given only the mixed data. We
havent observed
I
I

Two people talking at the same time


their voices recorded using two microphones

the original sources


the mixing coefficients

under some assumptions (no time delay and echoes)


I

the signals received by the microphone are linear combinations


of the voice amplitudes
the coefficient of this linear combination are constant

Caroline Bernard-Michel & Herv


e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Application case: blind source separation

Hereafter: a possible approach (see Mackay03)


that does not consider the temporal aspect of the problem
Consider generative model with
I
I

the latent variables: unobserved speech signal amplitudes


the two observed signal values o = [o1 o2 ]T at the microphones

Distribution of latent variables factorizes as p(z) = p(z1 )p(z2 )

No need to include noise: observed variables = deterministic


linear combinations of latent variables as


a11 a12
o=
z
a21 a22

Caroline Bernard-Michel & Herv


e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Application case: blind source separation


I

Given a set of observation


I
I

This requires that the latent variables have non Gaussian


distributions
I

the likehood function is a function of the coefficients aij


log likehood maximized using gradient-based optimization
particular case of independent analysis

Probabilistic PCA: latent-space distribution = zero-mean


isotropic Gaussian
No way to distinguish between two choices for the latent
variables these differ by a rotation in the latent space

Common choice for the latent-variable distribution:

10

p(zj ) =

1
1
=
z
cosh(zj )
(e j + ezj )
(90)

Caroline Bernard-Michel & Herv


e Jegou

10

10

10

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Chapter content
I

Principal Component Analysis


I
I
I
I

Probabilistic PCA
I
I
I
I
I

I
I

Maximum variance formulation


Minimum-error formulation
Applications of PCA
PCA for high-dimensional data
Problem setup
Maximum likelihood PCA
EM algorithm for PCA
Bayesian PCA
Factor analysis

Kernel PCA
Nonlinear Latent Variable Models
I
I
I

Independent component analysis


Autoassociative neural networks
Modelling nonlinear manifolds
Caroline Bernard-Michel & Herv
e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Autoassociative neural networks


I

Chapter 5: Neural networks for predicting outputs given inputs

They can also used for dimensionality reduction


xD

zM

outputs

inputs
z1

x1

xD

x1

Network that perform an autoassociative mapping


I

#outputs = #inputs > number of hidden units


no perfect reconstruction
find networks parameters w minimizing a given error function
for instance sum-of-square errors
N

E(w) =

1X
||y(xn , w) xn ||2
2

(91)

n=1

Caroline Bernard-Michel & Herv


e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Autoassociative neural networks

Linear activations functions


I
I

unique global minimum


the network performs projections onto the M -dimensional
principal component subspace
this subspace is spanned by the vector of weights

Even with nonlinear hiddens units, minimum error obtained by


principal component subspace
no advantage of using two-layer neural networks to perform
dimensionality reduction: use standard PCA techniques

Caroline Bernard-Michel & Herv


e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Autoassociative neural networks


F1

F2

xD

xD

outputs

inputs

x1

x1

non-linear

Using more hidden layers (4 here), the approach is worthful


x3

F1

z2

x3
F2
S

x1
x2

z1

x1
x2

Training the network involves nonlinear optimization


techniques (with risk of suboptimally)
Caroline Bernard-Michel & Herv
e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Chapter content
I

Principal Component Analysis


I
I
I
I

Probabilistic PCA
I
I
I
I
I

I
I

Maximum variance formulation


Minimum-error formulation
Applications of PCA
PCA for high-dimensional data
Problem setup
Maximum likelihood PCA
EM algorithm for PCA
Bayesian PCA
Factor analysis

Kernel PCA
Nonlinear Latent Variable Models
I
I
I

Independent component analysis


Autoassociative neural networks
Modelling nonlinear manifolds
Caroline Bernard-Michel & Herv
e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Modelling nonlinear manifolds

Data may lie in a manifold of lower dimensionality than the


observed data space

capture this property explicitly may improve the density


modelling
Possible approach: non-linear manifold modelled by piece-wise
linear approximation, e.g.,

I
I

k-means + PCA for each cluster


better: use reconstruction error for cluster assignment

These are limited by not having an overall density model

Tipping and Bishop: full probablistic model using a mixture


distribution in which components are probabilistic PCA
both discrete latent variables and continuous ones

Caroline Bernard-Michel & Herv


e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Modelling nonlinear manifolds


Alternative approach: to use a single nonlinear model
I

Principal curves:
I
I
I
I

Extension of PCA (that finds a linear subspace)


A curve is described by a vector-valued function f ()
Natural parametrization: the arc length along the curve
Given a point x
, we can find the closest point = gf (x) on
the curve in terms of the Euclidean distance
A principal curve is a curve for which every point on the curve
is the mean of all points in data space to project to it, so that
E[x|gf (x) = ] = f ()

(92)

there may be many principal curves for a continuous


distribution
Hastie et al: two-stage iterative procedure for finding principal
curve

Caroline Bernard-Michel & Herv


e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Modelling nonlinear manifolds: MDS

I
I

PCA is often used for the purpose of visualization


Another technique with a similar aim: multidimensional
scaling (MDS, Cox and Cox 2000)
I

I
I

preserve as closely as possible the pairwise distances between


data points
involves finding the eigenvectors of the distance matrix
equivalent results to PCA when the distance is Euclidean
but can be extended to a wide variety of data types
specified in terms of a similarity matrix

Caroline Bernard-Michel & Herv


e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Modelling nonlinear manifolds: LLE


I

Locally linear embedding (LLE, Roweis and Saul 2000)

Compute the set of coefficients that best reconstruct each


data point from its neighbours

coefficients arranged to be invariant to rotation, translations,


scaling
characterize the local geometrical properties of the
neighborhood

LLE maps the high-dimensional data to a lower dimensional


subspace while preserving these coefficients

These weights are used to reconstruct the data points in


low-dimensional space as in the high dimensional space

Albeit non linear, LLE does not exhibit local minima

Caroline Bernard-Michel & Herv


e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Modelling nonlinear manifolds: ISOMAP

Isometric feature mapping (ISOMAP, Tenenbaum et al. 2000)

Goal: data projected to a lower-dimensional space using MDS


but dissimilarities defined in terms of the geodesic
distances on the manifold
Algorithm:

I
I

First defines the neighborhood using KNN or -search


Construct a neighborhood graph with weights corresponding to
the Euclidean distances
Geodesic distance approximated by the sum of Euclidean
distances along the shortest path connecting two points
Apply MDS to the geodesic distances

Caroline Bernard-Michel & Herv


e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Modelling nonlinear manifolds: other techniques


I

Latent traits: Models having continous latent variables


together with discrete observed variables
can be used to visualize binary vectors analogously to PCA
for continuous variables

Density network: nonlinear function governed by a


multilayered neural network
flexible model but computationally intensive
Generative topographic mapping (GTM): restricted forms for
the nonlinear function nonlinear and efficient to train

latent distribution defined by a finite regular grid over the


latent space (of dimensionality 2, typically)
can be seen as a probabilistic version of the self-organizing
map (SOM, Kohonen)

Caroline Bernard-Michel & Herv


e Jegou

Chris Bishops PRML Ch. XII: Continuous Latent Variables

Você também pode gostar