Seminar Talk

A Quick Review of FDA
Functional PCA
An Example of Functional PCA Applied to Sparse Data
Applications of Functional PCA to Sparse and Irregular Functional Data

Daniel Conn
Department of Biostatistics UCLA School of Public Health
Functional PCA
Outline
Functional PCA
Functional PCA
What is Functional Data?

Each observation consists of one function or curve. So our dataset consists of n functional observations
x1 (t), x2 (t), ...xn (t). For the sake of simplicity, all of the functions in this talk will be indexed by time, t.
Well, to be precise, our raw data still comes in the form of
discrete data sets; the methods of functional data analysis assume we have functional observations (more on this later).
Each functional observation is usually assumed to be
continuous and differentiable.

You could view functional data as continuous extensions of
longitudinal data.
Functional PCA
Some Examples of Functional Data

You see a lot of weather examples in functional data
analysis (FDA). One might be interested in seeing how the temperature changes over the course of a year in various geographical locations. Similarly, you might also record precipitation levels over time (Ramsay, 2006).
Growth data is another big topic. How does a childs height
or weight change with time? (Ramsay, 2006)

There are less obvious applications of functional data
analysis. These highlight the idea that t need not stand for time. One could examine how the probability of answering a test question correctly varies w/ IQ (or some other latent variable). Each test question gives rise to a different functional observation. (Ramsay,2006)
Functional PCA
How Can We Take Advantage of Functional Data

Longitudinal data analysis already allows us to take into
account how measurements on an individual change over time. What do we gain by assuming these measurements to be part of smooth trajectories over time and analyzing the data as such?
Conceptually, if you really believe that measurements are
coming from smooth functions, you might want your statistical analysis to reect this fact.
Functional data analysis offers an intuitively natural
approach to analyzing highly correlated data. Smoothness can be thought of us saying that as you look at two separate observations taken closer and closer in time one observation perfectly determines the other (Ferraty and Vieu, 2006).
Functional PCA
Taking Advantage of Functional Data Continued

At a more practical level, many natural and intuitive
questions can only be asked once you assume you can talk about things like continuity and differentiation.
For example, in our weather data, we would expect
temperature and precipitation to exhibit some sort of cyclical or sinusoidal pattern over the course of the year.
Its natural to explore the extent to which these weather
patterns truly exhibit sinusoidal behavior.

If a function f (t) = c1 + c2 sin((t r )) truly is sinusoidal, it
satises the following differential equation: f = ()2 (f c1 ). We could answer our question by estimating f and f and seeing how linearly related they are (Ramsay, 2006).
Functional PCA
The Mean Function
x (t) = n1
is the mean curve amongst a population of curves (Ramsay 2006). involves more than naive application of the above formula.
n i=1 xi (t)
In practice, estimating a reasonable mean curve often For example, if we have a lot of measurements on people
with very high curves and few measurements on people with low curves x (t) will be skewed too high.
The mean curve might not be representative of any of the
individual curves.
Functional PCA
Covariance Function
The covariance surface is dened as follows:
cov (t1 , t2 ) = (n 1)1 (Ramsay 2006).
n i=1 (xi (t1 )
x (t1 ))(xi (t2 ) x (t2 ))
cov (t1 , t2 ) is just the covariance between the value of the
average functional observation at time points t1 and t2 .

Reasonable estimation of the covariance surface depends
on reasonable estimation of the mean curve.
Functional PCA
Simple Functional Regression
E(Y (t)|X ) = (t) +
X (s)(s, t)ds
First note that both s and t represent time on different
time-scales. s is the time-scale for the predictor curve; t is the time-scale for the response curve. For our weather example, they are the same, in general, they can be different.
The effect of the predictor curve on the response curve is
summarized by a bivariate function (s, t).

If t and s are simultaneous time-scales (s, t) = 0 for
s > t.
Functional PCA
Generalized Functional Regression
E(Yi |X ) = g 1 ( +
(t)Xi (t)dt)
Yi is a scalar. Were relating curves to scalars. Note the scalar response and functional predictor variable. Again, when making sense of the functional model it helps
to convert integrals back to summations.

g is the link function. The linear predictor is now in the form of an integral. The Yi are chosen to have appropriate probability
distribution.
Functional PCA
Practicalities
This whole time weve been talking about how to analyze
smooth curves but, in reality, we never actually get to observe continuous functions through time.
In reality, we only get measurements on individuals at
discrete numbers of time points.

This is where smoothing techniques come into picture.
Before analyzing our data set of smooth functions x1 (t), x2 (t), ...xn (t), we need estimates of these curves.
Curse of Dimensionality: functional data is innite
dimensional. Functional PCA is one dimension reduction method based on multivariate PCA.
Functional PCA
How Does One Estimate a Smooth Curve From Noisy Data?

We dont observe smooth curves. We get ni observations,
(tij , yij ), on the ith individual. We need to use these observations to estimate xi (t).
Our general method consists of three steps. Step 1: assume that the function xi (t) takes the following
form, xi (t) =
n=1 cn bn (t)
is acceptably close to xi (t) and use the data nd estimates of ci for i = 1, 2, ..K . Our estimate of xi (t) is K cn bn (t). n=1
Step 2/3: nd a number K such that
K n=1 cn bn (t)
Functional PCA
Further Practicalities
In the above algorithm there are three places where we
have to make well-informed decisions.

First, we have to choose the bi (t), the basis functions.
Different basis functions give rise to more or less satisfactory results. Different basis functions also have different computational consequences.
We have to choose a number K at which to truncate the
innite series expansion of xi (t). If K is too small we wont have a good approximation. If K is too big well be tting noise.
We need a method of estimating the ci . If we have enough
data on each curve, we can estimate the ci with simple methods like least squares.
Essentially, this presentation is about choosing a sensible
set of basis functions and estimating the coefcients.
Functional PCA
Orthonormal Basis Functions

A set of basis functions b1 (t), b2 (t)... is said to be
orthonormal iff (bi (t))2 dt = 1. and bi (t)bj (t)dt = 0.

If we have an innite series xi (t) =
bi (t) are orthonormal, the cn =
n=1 cn bn (t) x(t)bn (t)dt
where the
There are many different choices of orthonormal basis
functions: Fourier series, B-splines, wavelets, Hermite polynomials...

Fourier series are typically used to estimate periodic
functions. B-splines are very popular. Wavelets are newer are popular for tting non-smooth functions.
In functional data analysis, one is also interested in
estimating derivatives. This also has to be taken into account.
Functional PCA
First Glimpse of Functional PCA

In step 1 of our three step algorithm, we came to data with
well-chosen basis functions in mind.

We can take a different tact. We can use the data in some
systematic fashion to choose a set of basis functions for us.

Functional principal components analysis does just this.
We can see functional principal components analysis as a method for estimating the optimal set of orthonormal basis functions with which to represent our data.
Its optimal in the sense that it lets us represent functional
data with a (hopefully) small number of basis functions without losing too much information.
In fact, regular multivariate principal components analysis
can also be seen in this light.
Functional PCA
Multivariate Principal Components Analysis

Multivariate PCA is usually presented as a method of data
reduction which takes in a large set of correlated covariates (high-dimensional data) and spits out a new set of covariates.
Each new set of covariates is just a linear combination of
the old covariates. The coefcients in these linear combinations can be seen as weights of length one (the sum of the squared weights add to one).
The new set of covariates is better than the old set for two
main reasons.
FIrst of all, our new covariates are uncorrelated. Second, our new covariates can be ordered in such a way
that rst new covariate explains the most variation in the data set and the second covariate explains the second most variation and so on...
Functional PCA
Optimality of Principal Components

Suppose we measure the total amount of information in
random vector X by the sum if the variances of each component: trace(). Then we can measure the loss of information incurred by dimension reduction by comparing the amount of information in a new, reduced random vector to the amount of information in the original random vector.
Let Y be the reduced random vector. The amount of
information lost by using Y as a proxy for X is quantied trace(Y )/trace(X ). Thus, minimizing information loss is equivalent to maximizing the trace(Y ) subject to the previously mentioned restraints.
Its easy to show that the rst k-principal components
maximize trace(Y ), the principal components are obtained by taking the inner-products of the rst k-eigenvectors of with X, and trace(Y)=k i . i=1
Functional PCA
Functional PCA
X (t) = (t) +
i=1
k k (t)
Just as in multivariate PCA, we can represent a sample
curve as a linear combination of eigenfunctions; this is called the Karhunen-Loeve expansion.

If we truncate the expansion at k , we get the
k -dimensional representation of X (t) that minimizes the amount of information lost.

Data reduction is now mandatory as the Karhunen-Loeve
Expansion demonstrates that each functional observation is innite dimensional.
Functional PCA
Optimality in Functional PCA

The functional analogue to the covariance matrix is the
covariance surface cov (s, t).

In multivariate PCA, we measured the total amount of
variability in X by Trace(). What is Trace(cov (s, t))?

Recall the fact that Trace()=n i where the i are the i=1
eigenvalues of .
Every nxn matrix gives rise to a linear function on R n . The
covariance surface gives rise to a special type of linear function on L2 called an integral operator. Call this integral operator (x). maps a function x in L2 to another function on L2 , (x)(t) = x(s)c(s, t)ds.
Integral operators have eigenfunctions and corresponding
eigenvalues: (x)(t) = x(t)

We can dene the total amount of variation in X (t)
summing up eigenvalues.
Functional PCA
Optimality in Functional PCA
X (t) = (t) +
i=1
k k (t)
The eigenfunctions are the functional principal
components.
The functional principal components are an orthonormal
basis (now in L2 instead of R 2 ).

If we truncate the expansion at k , we get the
k -dimensional representation of X (t) that minimizes the amount of information lost.

The sum of the rst k eigenvalues over the sum of all
eigenvalues can be interpreted as the percent of variation explained by the rst k eigenfunctions.
Functional PCA
Sparse Longitudinal Data

We are said to have sparse longitudinal data if we have
only a small number of measurements on each individual.

Although we might not have very much information on any
individual curve in the data set, we might have a large number of individuals in the study.
If we had a way of tting smooth curves to such sparse
longitudinal data, we could extend the methods of functional data analysis to this sparse setting.
The question is how do we t smooth curves to sparse
trajectories.
One approach for smoothing sparse longitudinal data is
given in a paper by Hans-Georg Muller called "Functional Modelling and Classication of Longitdunal Data"
Functional PCA
Overall Game Plan in this Muller Paper
In a non-sparse setting we could estimate individual curves
via smoothing methods. With maybe only 4-5 measurements on each observation these methods wont work.
Even though we dont have a lot of information on each
individual, we have information on a lot of different individuals. As a result, we should have enough information to estimate (t) and cov (s, t) by using standard univariate and bivariate smoothing methods.
It would be nice if we could use this population level data to
re-construct individual curves.
Functional PCA
Functional PCA Applied to Sparse Data
X (t) = (t) +
i=1
k k (t)
Going back to the Kharunen-Loeve Expansion, note that a
sample curve is fully determined by the mean curve, eigenfunctions, and principal component scores.
k =
(Xi (t) (t))dt is the equation for the PCA scores.
In a non-sparse setting, wed simply estimate the PCA
score with a Riemann sum.

Hopefully, by making further assumptions, we can estimate
principal component scores for individuals without needing to compute the above integral.
Functional PCA
Estimating PCA Scores
Uij = Xi (Tij ) +
ij
= (Tij ) +
i=1
ik k (Tij ) +
ij
ik = E[ij |Ui ] = k T 1 (Ui i ) ik

Ui
Ni is the number of observations we have recorded for the
ith person. Tij is the jth measurement time on the ith individual. For xed i the Tij are i.i.d. Uij is the measurement on the ith person taken at time j.
Assume X is a Gaussian processes and the ij are normal. If we make the above assumptions, we can get estimates
of the principal component scores for each individual.
Functional PCA
Estimating Individual Trajectories
Xi (t) = (t) +
i=1 K
ik k (t)
XiK (t) = (t) +

i=1
E[ik |Ui ]k (t)
Once we have have estimates of (t), k (t), and ik , we
can just plug them in to the Kharunen-Loeve expansion to get estimates of individual trajectories.
Functional PCA
The Data
The data consists of initial bilirubin measurements taken
on 258 primary biliary cirrhosis patients. Primary biliary cirrhosis is a type of liver disease.
From these bilirubin measurements we wish to predict
whether or not a patient will survive for longer than 10 years (categorical outcome).
High bilirubin levels are generally a bad thing. Wed expect
patients with higher bilirubin measures to be less likely to survive for over 10 years.
The study population is restricted to only those patients
who survived longer than 910 days.

84 patients were short-lived (did not survive 10 years); 184
were long-lived.
Functional PCA
Plots of Bilirubin Measurements
Functional PCA
First Three Eigenfunctions
Functional PCA
Deciding How Many Eigenfunctions to Retain
Mueller uses one-curve-leave-out cross validation to
minimize prediction error.

We rst drop measurements on the ith individual to obtain
a new mean function new eigenfunctions, and new (i) (i) trajectories for each individual: (i) , k , and Yi . We then choose the number of eigenfucntions to retain, K , so that K minimizes
n i=1 Ni j=1 (Uij
(i) Xi (Tij ))2 .
Mueller also uses an analogue to AIC and BIC. Both methods suggested that the rst two eigenfunctions
should be retained.
Functional PCA
Predicted Bilirubin Curves and Probabilities of Surviving 10 Years
Functional PCA
Plot of PCA Scores in Short-LIved and Long-Lived Groups
Functional PCA
Estimates on Beta Function and Model Based Classication
The negative s match what we saw in the plot of PCA
scores. Short-lived patients in general seemed to be in the upper right corner (i.e. high PCA scores). The overall misclassication rate, based on a one-leave-out prediction error, was 26.54%.
Functional PCA
References
Horvath, Lajos, and Piotr Kokoszka. Inference for
Functional Data With Applications. Springer Verlag, 2012. Springer Ser. in Statistics
Ramsay, J. O., and B. W. Silverman. Functional Data
Analysis. New York: Springer, 2006.

Fang Yao, Hans-Georg Muller and Jane-Ling Wang (2005):
Functional Data Analysis for Sparse Longitudinal Data, Journal of the American Statistical Association, 100:470, 577-590
Ferraty, Frederic, and Philippe Vieu. Nonparametric
Functional Data Analysis: Theory and Practice. New York: Springer, 2006.
Functional PCA
Simple Functional Regression

Suppose we were interested in understanding the
relationship between weather in Los Angeles and weather in San Diego. Wed expect some correlation between weather in the two regions but how much? Are there more complex relationships to look for? Could it be that there is a time lag between weather in the two locations? HIgh precipitation in LA could mean high precipitation in San Diego next week. We could go into the records and gather data on precipitation over multiple years in LA and SD. Each years weather data would give rise to one pair of functional observations. Wed end up with n pairs of functional observations. We might be able to investigate the relationship between these curves w/ the technique of simple functional regression.
Functional PCA
More Simple Functional Regression

E(Y (t)|X ) = (t) + X (s)(s, t)ds
Simple functional regression serves as a good illustration
of how classical methods generalize by simply converting sums to integrals.

First hold t constant at to Then picture the integral as summation over many values
of s. E(Y (t0 )|X ) (to ) +
X (si )(si , to )
i=1
So at each point time point t were really just doing a
multiple regression with a continuum of covariates. Each point s gets its own covariate X (s).
Functional PCA
Generalized Functional Regression
In simple functional regression we were relating pairs of
functional observations to one another.

We might also want to relate functional observations to
scalar outcomes.
For example, we might want to relate a runners velocity at
the beginning of a race to his nishing time. Maybe good runners save their energy or maybe good runners run fast the whole way. Maybe both strategies work and the effects of initial velocity will balance out.
Functional PCA
Classication of Various Functional Data Models
Functional PCA
Multivariate Principal Components Analysis Continued

Let be the covariance matrix of a vector of random
variables X = (X1 , X2 ..., Xp ).

One might measure the total amount of random variation in
the vector X by simply adding up the variances of the Xi . This is just Trace().
Suppose p is an impractically large number. Suppose we
wanted a k -dimensional representation of X where k is much smaller than p.

Call this k -dimensional summary Y . For simplicity, lets
constrain Y so that each element of Y is a linear combination of the Xi and the coefcients in each linear combination has length 1.
Functional PCA
Optimality of Principal Components

First off, Y is k -dimensional; X is p-dimensional. k is
supposed to be much less than p. By using Y as a proxy for X , it is clear that we are losing some amount of information.
Naturally, we want to calculate Y in such a way that we
lose less rather than more information.

The rst step towards solving this problem is to precisely
dene what we mean by information loss. Recall that Trace() represents the amount of variation in a random vector. We can measure the amount of information loss by comparing the amount of variation in Y with the variation in Y . In particular, we take the ratio of the two: Trace(Y )/Trace(X ).
Functional PCA
Optimality of Principal Components Continued

It is easy to show that the Y that minimizes loss
information is given by Y = (X C1 , ..., X Ck ). Where C1 ,...,Ck are the rst k principal components.
Let C be a matrix whose columns are the n principal
components. Because the principal components are an orthonormal basis, X = CC X .

It follows that X = X , C1 C1 + ... + X , Cp Cp Thus, we see that when we carry out multivariate PCA to
reduce the dimensionality of a data set , we are using the data choose an orthonormal basis (the Ci ) and coefcients (the X , Ci Ci ). Then we are re-expressing our data in terms of the basis and the coefcients. We are using the data to choose a basis that minimizes the loss of information.
Functional PCA
Functional PCA and Generalized Functional Regression

E[Yi |Xi (t)] = g( + =+ (t)(Xi (t) (t))dt)
(t)(Xi (t) (t))dt
If we choose a specic set of orthonormal basis functions,
k (t), then = + j j where j = (t)j (t)dt and j=1 j = (Xi (t) (t))j (t)dt. We can approximate by simply truncating the innite sum at some number k depending on sample size and the particulars of the data. We just have a regular glm with predictors given by ik . The parameters were trying to estimate are projections of (t) onto the basis functions.
Functional PCA
Functional PCA and Generalized Functional Regression Continued

ik = E[ij |Ui ] = k T 1 (Ui i ) ik
Ui K
E[i |Ui ] = +
[k T 1 (Ui i )]k ik
k =1 Ui
The k k (t) are the projections of (t) onto the
orthonormal basis functions k , so (t) = predictors given by ik = k T 1 (Ui i ). ik

Ui
j j (t).
We can estimate the j by using a regular GLM model with
We can then estimate (t): (t) =
j j (t).
Functional PCA
Average Curves
Functional PCA
Covariance Surface

Seminar Talk

Enviado por

Dados do documento

Descrição original:

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Seminar Talk

Enviado por

Direitos autorais:

Formatos disponíveis

A Quick Review of FDA

An Example of Functional PCA Applied to Sparse Data

Applications of Functional PCA to Sparse and Irregular Functional Data

A Quick Review of FDA

An Example of Functional PCA Applied to Sparse Data

A Quick Review of FDA

An Example of Functional PCA Applied to Sparse Data

A Quick Review of FDA

An Example of Functional PCA Applied to Sparse Data

What is Functional Data?

continuous and differentiable.

A Quick Review of FDA

An Example of Functional PCA Applied to Sparse Data

Some Examples of Functional Data

or weight change with time? (Ramsay, 2006)

A Quick Review of FDA

An Example of Functional PCA Applied to Sparse Data

How Can We Take Advantage of Functional Data

A Quick Review of FDA

An Example of Functional PCA Applied to Sparse Data

Taking Advantage of Functional Data Continued

patterns truly exhibit sinusoidal behavior.

A Quick Review of FDA

An Example of Functional PCA Applied to Sparse Data

The Mean Function

A Quick Review of FDA

An Example of Functional PCA Applied to Sparse Data

The covariance surface is dened as follows:

cov (t1 , t2 ) = (n 1)1 (Ramsay 2006).

n i=1 (xi (t1 )

x (t1 ))(xi (t2 ) x (t2 ))

cov (t1 , t2 ) is just the covariance between the value of the

average functional observation at time points t1 and t2 .

on reasonable estimation of the mean curve.

A Quick Review of FDA

An Example of Functional PCA Applied to Sparse Data

Simple Functional Regression

E(Y (t)|X ) = (t) +

First note that both s and t represent time on different

summarized by a bivariate function (s, t).

A Quick Review of FDA

An Example of Functional PCA Applied to Sparse Data

Generalized Functional Regression

to convert integrals back to summations.

A Quick Review of FDA

An Example of Functional PCA Applied to Sparse Data

discrete numbers of time points.

A Quick Review of FDA

An Example of Functional PCA Applied to Sparse Data

How Does One Estimate a Smooth Curve From Noisy Data?

Step 2/3: nd a number K such that

A Quick Review of FDA

An Example of Functional PCA Applied to Sparse Data

have to make well-informed decisions.

set of basis functions and estimating the coefcients.

A Quick Review of FDA

An Example of Functional PCA Applied to Sparse Data

Orthonormal Basis Functions

orthonormal iff (bi (t))2 dt = 1. and bi (t)bj (t)dt = 0.

bi (t) are orthonormal, the cn =

n=1 cn bn (t) x(t)bn (t)dt

There are many different choices of orthonormal basis

functions: Fourier series, B-splines, wavelets, Hermite polynomials...