Escolar Documentos
Profissional Documentos
Cultura Documentos
Functional PCA
Functional PCA
Outline
Functional PCA
Functional PCA
x1 (t), x2 (t), ...xn (t). For the sake of simplicity, all of the functions in this talk will be indexed by time, t.
Well, to be precise, our raw data still comes in the form of
discrete data sets; the methods of functional data analysis assume we have functional observations (more on this later).
Each functional observation is usually assumed to be
longitudinal data.
Functional PCA
analysis (FDA). One might be interested in seeing how the temperature changes over the course of a year in various geographical locations. Similarly, you might also record precipitation levels over time (Ramsay, 2006).
Growth data is another big topic. How does a childs height
analysis. These highlight the idea that t need not stand for time. One could examine how the probability of answering a test question correctly varies w/ IQ (or some other latent variable). Each test question gives rise to a different functional observation. (Ramsay,2006)
Functional PCA
account how measurements on an individual change over time. What do we gain by assuming these measurements to be part of smooth trajectories over time and analyzing the data as such?
Conceptually, if you really believe that measurements are
coming from smooth functions, you might want your statistical analysis to reect this fact.
Functional data analysis offers an intuitively natural
approach to analyzing highly correlated data. Smoothness can be thought of us saying that as you look at two separate observations taken closer and closer in time one observation perfectly determines the other (Ferraty and Vieu, 2006).
Functional PCA
questions can only be asked once you assume you can talk about things like continuity and differentiation.
For example, in our weather data, we would expect
temperature and precipitation to exhibit some sort of cyclical or sinusoidal pattern over the course of the year.
Its natural to explore the extent to which these weather
satises the following differential equation: f = ()2 (f c1 ). We could answer our question by estimating f and f and seeing how linearly related they are (Ramsay, 2006).
Functional PCA
x (t) = n1
is the mean curve amongst a population of curves (Ramsay 2006). involves more than naive application of the above formula.
n i=1 xi (t)
In practice, estimating a reasonable mean curve often For example, if we have a lot of measurements on people
with very high curves and few measurements on people with low curves x (t) will be skewed too high.
The mean curve might not be representative of any of the
individual curves.
Functional PCA
Covariance Function
Functional PCA
X (s)(s, t)ds
time-scales. s is the time-scale for the predictor curve; t is the time-scale for the response curve. For our weather example, they are the same, in general, they can be different.
The effect of the predictor curve on the response curve is
s > t.
Functional PCA
E(Yi |X ) = g 1 ( +
(t)Xi (t)dt)
Yi is a scalar. Were relating curves to scalars. Note the scalar response and functional predictor variable. Again, when making sense of the functional model it helps
distribution.
Functional PCA
Practicalities
This whole time weve been talking about how to analyze
smooth curves but, in reality, we never actually get to observe continuous functions through time.
In reality, we only get measurements on individuals at
Before analyzing our data set of smooth functions x1 (t), x2 (t), ...xn (t), we need estimates of these curves.
Curse of Dimensionality: functional data is innite
dimensional. Functional PCA is one dimension reduction method based on multivariate PCA.
Functional PCA
(tij , yij ), on the ith individual. We need to use these observations to estimate xi (t).
Our general method consists of three steps. Step 1: assume that the function xi (t) takes the following
form, xi (t) =
n=1 cn bn (t)
is acceptably close to xi (t) and use the data nd estimates of ci for i = 1, 2, ..K . Our estimate of xi (t) is K cn bn (t). n=1
K n=1 cn bn (t)
Functional PCA
Further Practicalities
In the above algorithm there are three places where we
Different basis functions give rise to more or less satisfactory results. Different basis functions also have different computational consequences.
We have to choose a number K at which to truncate the
innite series expansion of xi (t). If K is too small we wont have a good approximation. If K is too big well be tting noise.
We need a method of estimating the ci . If we have enough
data on each curve, we can estimate the ci with simple methods like least squares.
Essentially, this presentation is about choosing a sensible
Functional PCA
where the
functions. B-splines are very popular. Wavelets are newer are popular for tting non-smooth functions.
In functional data analysis, one is also interested in
Functional PCA
We can see functional principal components analysis as a method for estimating the optimal set of orthonormal basis functions with which to represent our data.
Its optimal in the sense that it lets us represent functional
data with a (hopefully) small number of basis functions without losing too much information.
In fact, regular multivariate principal components analysis
Functional PCA
reduction which takes in a large set of correlated covariates (high-dimensional data) and spits out a new set of covariates.
Each new set of covariates is just a linear combination of
the old covariates. The coefcients in these linear combinations can be seen as weights of length one (the sum of the squared weights add to one).
The new set of covariates is better than the old set for two
main reasons.
FIrst of all, our new covariates are uncorrelated. Second, our new covariates can be ordered in such a way
that rst new covariate explains the most variation in the data set and the second covariate explains the second most variation and so on...
Functional PCA
random vector X by the sum if the variances of each component: trace(). Then we can measure the loss of information incurred by dimension reduction by comparing the amount of information in a new, reduced random vector to the amount of information in the original random vector.
Let Y be the reduced random vector. The amount of
information lost by using Y as a proxy for X is quantied trace(Y )/trace(X ). Thus, minimizing information loss is equivalent to maximizing the trace(Y ) subject to the previously mentioned restraints.
Its easy to show that the rst k-principal components
maximize trace(Y ), the principal components are obtained by taking the inner-products of the rst k-eigenvectors of with X, and trace(Y)=k i . i=1
Functional PCA
Functional PCA
X (t) = (t) +
i=1
k k (t)
Functional PCA
eigenvalues of .
Every nxn matrix gives rise to a linear function on R n . The
covariance surface gives rise to a special type of linear function on L2 called an integral operator. Call this integral operator (x). maps a function x in L2 to another function on L2 , (x)(t) = x(s)c(s, t)ds.
Integral operators have eigenfunctions and corresponding
summing up eigenvalues.
Functional PCA
X (t) = (t) +
i=1
k k (t)
components.
The functional principal components are an orthonormal
eigenvalues can be interpreted as the percent of variation explained by the rst k eigenfunctions.
Functional PCA
individual curve in the data set, we might have a large number of individuals in the study.
If we had a way of tting smooth curves to such sparse
longitudinal data, we could extend the methods of functional data analysis to this sparse setting.
The question is how do we t smooth curves to sparse
trajectories.
One approach for smoothing sparse longitudinal data is
given in a paper by Hans-Georg Muller called "Functional Modelling and Classication of Longitdunal Data"
Functional PCA
via smoothing methods. With maybe only 4-5 measurements on each observation these methods wont work.
Even though we dont have a lot of information on each
individual, we have information on a lot of different individuals. As a result, we should have enough information to estimate (t) and cov (s, t) by using standard univariate and bivariate smoothing methods.
It would be nice if we could use this population level data to
Functional PCA
X (t) = (t) +
i=1
k k (t)
sample curve is fully determined by the mean curve, eigenfunctions, and principal component scores.
k =
principal component scores for individuals without needing to compute the above integral.
Functional PCA
Uij = Xi (Tij ) +
ij
= (Tij ) +
i=1
ik k (Tij ) +
ij
ith person. Tij is the jth measurement time on the ith individual. For xed i the Tij are i.i.d. Uij is the measurement on the ith person taken at time j.
Assume X is a Gaussian processes and the ij are normal. If we make the above assumptions, we can get estimates
Functional PCA
Xi (t) = (t) +
i=1 K
ik k (t)
can just plug them in to the Kharunen-Loeve expansion to get estimates of individual trajectories.
Functional PCA
The Data
The data consists of initial bilirubin measurements taken
on 258 primary biliary cirrhosis patients. Primary biliary cirrhosis is a type of liver disease.
From these bilirubin measurements we wish to predict
whether or not a patient will survive for longer than 10 years (categorical outcome).
High bilirubin levels are generally a bad thing. Wed expect
patients with higher bilirubin measures to be less likely to survive for over 10 years.
The study population is restricted to only those patients
were long-lived.
Functional PCA
Functional PCA
Functional PCA
a new mean function new eigenfunctions, and new (i) (i) trajectories for each individual: (i) , k , and Yi . We then choose the number of eigenfucntions to retain, K , so that K minimizes
n i=1 Ni j=1 (Uij
Mueller also uses an analogue to AIC and BIC. Both methods suggested that the rst two eigenfunctions
should be retained.
Functional PCA
Functional PCA
Functional PCA
scores. Short-lived patients in general seemed to be in the upper right corner (i.e. high PCA scores). The overall misclassication rate, based on a one-leave-out prediction error, was 26.54%.
Functional PCA
References
Horvath, Lajos, and Piotr Kokoszka. Inference for
Functional Data With Applications. Springer Verlag, 2012. Springer Ser. in Statistics
Ramsay, J. O., and B. W. Silverman. Functional Data
Functional Data Analysis for Sparse Longitudinal Data, Journal of the American Statistical Association, 100:470, 577-590
Ferraty, Frederic, and Philippe Vieu. Nonparametric
Functional Data Analysis: Theory and Practice. New York: Springer, 2006.
Functional PCA
relationship between weather in Los Angeles and weather in San Diego. Wed expect some correlation between weather in the two regions but how much? Are there more complex relationships to look for? Could it be that there is a time lag between weather in the two locations? HIgh precipitation in LA could mean high precipitation in San Diego next week. We could go into the records and gather data on precipitation over multiple years in LA and SD. Each years weather data would give rise to one pair of functional observations. Wed end up with n pairs of functional observations. We might be able to investigate the relationship between these curves w/ the technique of simple functional regression.
Functional PCA
X (si )(si , to )
i=1
multiple regression with a continuum of covariates. Each point s gets its own covariate X (s).
Functional PCA
scalar outcomes.
For example, we might want to relate a runners velocity at
the beginning of a race to his nishing time. Maybe good runners save their energy or maybe good runners run fast the whole way. Maybe both strategies work and the effects of initial velocity will balance out.
Functional PCA
Functional PCA
the vector X by simply adding up the variances of the Xi . This is just Trace().
Suppose p is an impractically large number. Suppose we
constrain Y so that each element of Y is a linear combination of the Xi and the coefcients in each linear combination has length 1.
Functional PCA
supposed to be much less than p. By using Y as a proxy for X , it is clear that we are losing some amount of information.
Naturally, we want to calculate Y in such a way that we
dene what we mean by information loss. Recall that Trace() represents the amount of variation in a random vector. We can measure the amount of information loss by comparing the amount of variation in Y with the variation in Y . In particular, we take the ratio of the two: Trace(Y )/Trace(X ).
Functional PCA
information is given by Y = (X C1 , ..., X Ck ). Where C1 ,...,Ck are the rst k principal components.
Let C be a matrix whose columns are the n principal
reduce the dimensionality of a data set , we are using the data choose an orthonormal basis (the Ci ) and coefcients (the X , Ci Ci ). Then we are re-expressing our data in terms of the basis and the coefcients. We are using the data to choose a basis that minimizes the loss of information.
Functional PCA
k (t), then = + j j where j = (t)j (t)dt and j=1 j = (Xi (t) (t))j (t)dt. We can approximate by simply truncating the innite sum at some number k depending on sample size and the particulars of the data. We just have a regular glm with predictors given by ik . The parameters were trying to estimate are projections of (t) onto the basis functions.
Functional PCA
E[i |Ui ] = +
[k T 1 (Ui i )]k ik
k =1 Ui
j j (t).
j j (t).
Functional PCA
Average Curves
Functional PCA
Covariance Surface