CPA 200 COmponents

08/03/2018 Your Document is Ready
POST GRADUATION IN AGRONOMY - CPGA-CS
Multivariate Analysis Applied to Agrarian Sciences
Principal component
analysis
Carlos Alberto Alves Varella
Seropédica - RJ
12/14/2008
Content
Introduction................................................. .................................................. ........................

Data matrix X .............................................. .................................................. (I.e.
Covariance matrix S .............................................. .................................................. ......
http://www.htmlpublish.com/convert-pdf-to-html/success.aspx?zip=DocStorage/9a2d4f284a3c411a842ad8aa4311edf0/analise%20de%20compo… 1/11
Standardization with zero mean and variance 1 ........................................... ....................

Standardization with variance 1 and any mean ............................................ (I.e.
Determination of major components .............................................. ..........................
Contribution of each major component ............................................. .........................
Interpretation of each component .............................................. .......................................
Scores of major components .............................................. ....................................
Table 1. Organization of a data set with n treatments, p variables ek
components ................................................. .................................................. ................
Application example ............................................... .................................................. (I.e.
Table 2. Original and standardized values of two variables for five treatments10
Obtaining the main components .............................................. (I.e.
Table 3. Information that can be obtained with the analysis of components
main ................................................. .................................................. ...................
Table 4. Scores of the two major components for the five treatments obtained
from the correlation matrix R ........................................... ......................................
Scatter plot ............................................... .................................................. (I.e.
Figure 2. Dispersion of treatments as a function of component scores
main. .................................................. .................................................. (I.e.
SAS program for obtaining the main components ........................................... ...
BIBLIOGRAPHY ................................................. .................................................. ...........
POST GRADUATION IN AGRONOMY - CPGA-CS

Multivariate Analysis Applied to Agrarian Sciences
PRINCIPAL COMPONENT ANALYSIS

Carlos Alberto Alves Varella 1
Introduction
Principal component analysis is a multivariate statistical technique that consists of
transforming a set of original variables into another set of variables of the same dimension
called principal components. The main components have important properties: each main
component is a linear combination of all the original variables, they are independent of each
other and estimated with the purpose of retaining, in order of estimation, the maximum
information, in terms of the total variation contained in the data . Principal component
analysis is associated with the idea of reducing data mass, with the lowest possible loss of
information. Wantedredistribute the observed variation in the original axes in order to
obtain a set of uncorrelated orthogonal axes. This technique can be used for index
generation and grouping of individuals. The analysis groups the individuals according to
their variation, that is, the individuals are grouped according to their variances, that is,
according to their behavior within the population, represented by the variation of the set of
characteristics that define the individual, that is, the technique groups the individuals of a
population according to the variation of their characteristics. According to REGAZZI
(2000), although the techniques of multivariate analysis have been developed to solve
specific problems, especially Biology and Psychology, they can also be used to solve other
types of problems in several areas of knowledge.
1
Teacher. Federal Rural University of Rio de Janeiro, IT-Department of Engineering, BR 465 km 7 - CEP 23890-000 - Seropédica
- RJ. E-mail: varella@ufrrj.br .
Data matrix X
Consider the situation in which we observe 'p' characteristics of 'n' individuals of a π
population. The observed characteristics are represented by the variables X 1 , X 2 , X 3 , ...,
X p . The data matrix is of order 'nx p' and is usually called the 'X' array.
é x 11 x 12 x 13 x 1 p (I.e.
(I.e. (I.e.
ê x 21 x 22 x 23 x2pú
X=êx x x x (I.e.
(I.e. 31 32 33 3p
ú
(I.e. (I.e.
(I.e. x n3 x(I.e.
ê x n1 x np ú
(I.e. (I.e.
The interdependence structure between the variables of the data matrix is represented by
the covariance matrix 'S' or by the correlation matrix 'R'. The understanding of this structure
through the variables X 1 , X 2 , X 3 , ..., X p , may in practice be a complicated thing. Thus, the
objective of the analysis of principal components is to transform this complicated structure,
represented by the variables X 1 , X 2 , X 3 , ..., X p, into another structure represented by the
variables Y 1 , Y 2 , Y 3 , Y pnot correlated and with ordered variances, so that it is possible to
compare the individuals using only the Y is variables that present greater variance. The
solution is given from the covariance matrix S or the correlation matrix R.
Covariance matrix S
From the matrix X of order data 'nx p' we can make an estimate of the covariance matrix Σ
of the population π that we represent by S. The matrix S is symmetric and order
'px p'.
ˆ ˆ ˆ ˆ
é Var (x 1 ) Cov (x 1 x 2 ) Cov (x 1 x 3 ) Cov (x 1 x P (
(I.e.ˆ ˆ ˆ ˆ
ê Cov (x 2 x Var (x 2 ) Cov (x 2 x 3 ) Cov (x 2 x P (
(I.e.ˆ ˆ ˆ ˆ
S = ê Cov (x 3 x Cov (x 3 x 2 ) Var (x 3 ) Cov (x 3 x
(I.e.
(I.e.
(I.e.ˆ ˆ ˆ ˆ
Cov (x p x 2 ) Cov (x p x 3 ) Var (x p )
ê Cov (x p x
(I.e.
In this case, according to REGAZZI (2000), it is convenient to standardize the variables
X j (i = 1, 2, 3,
..., P). Standardization can be done with mean zero and variance 1, or with variance 1 and any
mean.
Standardization with zero mean and variance 1
z ij (I.e. x ij - x j , i = 1, 2,, n j = 1, 2 ,, p
s (x j )
Standardization with variance 1 and any mean
x ij
z ij = s (x ) i = 1, 2, ne j = 1, 2 ,, p
j
in which, X j and S (x j ) are, respectively, the mean and standard deviation of

characteristic j:
åx ij
x j (I.e. i = 1
n
ˆ
and s (x j ) = Var (x j ), j = 1, 2, p
(I.e. n
ö2
n 2 n (I.e. å x ij ÷
è i=1 (I.e.
ˆ
å(x
i=1
ij - xj)
ˆ å
i=1
x ij -
2
n
Var (x j ) = or Var (x j ) =
n- 1 n -1
After the standardization we obtain a new data matrix Z:
é z 11 z 12 z 13 z 1 p (I.e.
(I.e. (I.e.
ê z 21 z 22 z 23 z2pú
ê
Z= z z z z (I.e.
(I.e. 31 32 33 3 P. (I.e.
(I.e. (I.e.
(I.e. z n2 z n3 (I.e.
ê z n1 z NP ú
(I.e. (I.e.
The matrix Z of the standardized variables z j is equal to the correlation matrix of the data
matrix X. To determine the main components we usually start from the correlation matrix R. It
is important to note that the result found for the analysis from the matrix S can be different
from the result found from the R matrix. The recommendation is that
should only be made where the units of measurement of the characteristics observed are not
the same.
Determination of the main components
The main components are determined by solving the characteristic equation of the matrix
S or R, that is:
det [ R - l I ] = 0 or R-I =0
é1 r ( x 1 x 2 ) r ( x 1 x 3 ) r ( x 1 x p ) ù I.e.
ê
r(x x) 1 r(x x )r(x x (
(I.e. 2 1 2 3 2 P (I.e.
R=êr(x x)r(x x ) 1 r ( xx )(I.e.
(I.e. 3 1 3 2 3 P (I.e.
(I.e. (I.e.
(I.e. (I.e.
ê r ( x p x 1) r ( x p x 2) r ( x px 3 ) 1 (I.e.
(I.e. (I.e.
If the matrix R is full rank equal to 'p', that is, it does not present any column that is a
linear combination of another, the equation R - l I = 0 will have 'p' roots called eigenvalues or
roots characteristic of the matrix R In the data matrix X it is important to note that the value of
'n' (individuals, treatments, genotypes, etc.) must be at least equal to 'p + 1', that is, if we want
to set up an experiment to analyze the behavior of 'p' characteristics of individuals from a
population is recommended that the statistical design present at least 'p + 1' treatments.
Let λ 1 , λ 2, λ 3, ..., λ p be the roots of the characteristic equation of the matrix R or S, then:
l 1 > l 2 > l 3, l p.
For each eigenvalue λ i exists an

eigenvector ~ :
ai
é a i1 (I.e.
(I.e. (I.e.
~ ê to i2 (I.e.
to i(I.e.
(I.e. (I.e.
(I.e. (I.e.
ê the ip(I.e.
(I.e. (I.e.
~
The eigenvectors at iare normalized, that is, the sum of the squares of the coefficients is equal
to 1, and are still orthogonal to each other. Because of this they have the following properties:
P
2 ~ ' ~
åj =the
1
ij = 1 ( the i × the i = 1 )
P
~ ~
)
'
and å a ij × a kj = 0
j=1
( the × thei k = 0 for i ¹k
~
Being i the corresponding eigenvector to eigenvalue λ i , then the i-th component
is given by:
Y i = a i1 X 1 + a i2 X 2 + + a ip X p
The main components have the following properties:

1) The variance of the principal component Y i is equal to the value of the eigenvalue λ i .
ˆ
Var ( Y i ) = l i
2) The first component is the one with the greatest variance and so on:
ˆ ˆ ˆ
Var (Y 1 ) > Var (Y 2 ) > > Var (Y p )
3) The total variance of the original variables is equal to the sum of the eigenvalues that is equal
to the total variance of the main components:
ˆ ˆ
å Var (X i )=å l i = å Var (Y i )
4) The main components are not correlated with each other:
ˆ
Cov (Y i ,Yj =0 )
Contribution of each main component
The contribution C i of each major component Y i is expressed as a percentage. It is
calculated by dividing the variance of Y i by the total variance. Represents the ratio of total
variance explained by the principal component Y i .
ˆ
W i (I.e. Var ( Y i(I.e. × 100 = l i × 100 =
li
× 100
P trace ( S )
åp Var ( Y (I.e. å l i i
i=1 i=1
The importance of a major component is assessed by its contribution, ie by the ratio of

total variance explained by the component. The sum of the first k eigenvalues represents the
proportion of information retained in the reduction of p to k dimensions. With this information

we can decide how many components we will use in the analysis, that is, how many
components will be used to differentiate the individuals. There is no statistical model to
support this decision. According to REGAZZI (2000) for applications in several areas of
knowledge the number of components used has been the one that accumulates 70% or more of
proportion of the total variance.
ˆ ˆ
Var ( Y 1 ) + Var ( Y k (I.e.× 100³ 70% wherek < p
k ˆ
å Var ( Y (I.e.
i=1
i
Interpretation of each component

This analysis is done by verifying the degree of influence that each variable X j has on the
component Y i . The degree of influence is given by the correlation between each X j and the
component Yi being interpreted. For example the correlation between X j and Y 1 is:
ˆ
Corr ( X j, Y 1 ) = r Xj × Y1= the 1j ×
Var ( Y 1 (I.e. (I.e. l 1 × to 1j
ˆ
Var X Var ( X (I.e.
( j (I.e. j
To compare the influence of X 1 , X 2 , ..., Xp on Y 1 we analyze the weight or loading of

each variable on the Y 1 component . The weight of each variable on a given component is
given by:
to 11 to 12 to 1p
w1= (I.e. W w= , where w 1 is the
Var ( X 1 (I.e. Var ( X 2 (I.e. Var ( X p(I.e.
If the purpose of the analysis is to obtain indices, a very common practice in economics,
the analysis ends here.
If the purpose of the analysis is to compare or group individuals, the analysis continues
and it is necessary to calculate the scores for each major component that will be used in the
analysis.
Core component scores

Scores are the values of the major components. After the reduction of p to k dimensions,
the k main components will be the new individuals and all analysis is done using the scores of
these components. In Table 1 is exemplified the organization of a data set composed

of ntreatments, p variables and k main components.
Table 1. Organization of a set of data with n treatments, p variables and components

Treatments Variables Core component scores
(Individuals) X1 X2 ... Xp Y1 Y2 ... Yk
1 X11 X12 X1p Y11 Y12 ... Y1k
2 X21 X22 X2p Y21 Y22 ... Y2k
n Xn1 Xn2 ... Xnp Yn1 Yn2 ... Ynk
Thus we have that the scores of the first component for the n treatments are:
Trat First main conponent
1 and 11 = to 11 X 11 + to 12 X 12 + + to 1p X 1p
2 and 21 = to 11 X 21 + to 12 X 22 + + to 1p X 2p
N Y n1 = a 11 X n1 + a 12 X n 2 + + a 1p X np
Application example
Table 2 shows the original values observed (X 1 and X 2 ) and standardized (Z 1 and Z 2 )
of two variables for five treatments (n = 5).
Table 2. Original and standardized values of two variables for five treatments
Original variables Standardized variables
Treatments
X1 X2 Z1 Z2
1 102 96 24.3827 6.9554
2 104 87 24.8608 6,3033
3 101 62 24.1436 4.4920
4 93 68 22.2313 4.9268
5 100 77 23,9046 5,5788
Variance 17.50 190.50 1 1
Average 100.00 78.00 23,9046 5,6513
The data are standardized for variance 1:

Z ij = X ij (I.e. Z 12 = 104 = 24.8608
s ( X (I.e.
j 17.5
The correlation matrix is:
é1
R=
0.5456 ù
ê ú
ë 0.5456 1 û
The characteristic equation is: R - I = 0
1 - l 0.5456 = 0
0.5456 1 - l
l 2 - 2 l + 0.7023 = 0
The eigenvalues of the correlation matrix R are:

λ 1 = 1.5456 and λ 2 = 0.4544
The sum of λ 1 and λ 2 is equal to the trace of the matrix R. The trace of a matrix is the
sum of the elements of its principal diagonal.
dash (R) = 1 + 1 = 2
Obtaining the main components

The normalized eigenvector for the first principal component is:
10
~ (I.e.The 11
1
(I.e. 1 (I.e. 0,7071(I.e.
(I.e.
to 1 = ê ú= (I.e. = ê (I.e.
ë to 12 û 2ë1û ë 0.7070 û
and the first major component is:
Y 1 = 0.7071Z 1 + 0.7071Z 2
In the same way for the second main component we have:

~ (I.e. The 21 ù 1 é -(I.e.
1 (I.e. - 0.7071
(I.e.
to 21 = ê (I.e. (I.e. (I.e. (I.e. = ê (I.e.
ë the 22 û 2 ë 1 û (I.e. 0,7070(I.e.

And 2 = - 0.7071Z 1 + 0.7071Z 2
Table 3. Information that can be obtained by analyzing major components

Component Variance Coefficient of Correlation Percentage Percentage
main (Autoval weighting between Zj and Yi of variance accumulated
or Z1 Z2 Z1 Z2 total of variance
two and i
Y1 1.5456 0,7071 0,7071 0.879 0.879 77.28 77.28
Y2 0.4544 -0.7071 0,7071 -0.476 0,476 22.72 100.00
Table 4. Scores of the two main components for the five treatments obtained from the
correlation matrix R.
Core component scores
Treatments
Y1 And 2
1 22.16 -12.32
2 22.04 -13,12
3 20.25 -13.90
4 19,20 -12.24
5 20.85 -12.96
11
Scatter plot
They are used to visualize the dispersion of the treatments in function of the
main components in bi or three-dimensional space. The dispersion of
The treatments for this example are illustrated in Figure 2.
23
2 1
22
(First component (Y1
21 5
3
20
4
19
-14 -13.5 -13 -12.5 -12
Second component (Y2)
Figure 2. Dispersion of the treatments as a function of the scores of the main components.
SAS program for obtaining the main components
BIBLIOGRAPHY
REGAZZI, AJ Multivariate analysis, lecture notes INF 766, Department of Informatics,
Federal University of Viçosa, v.2, 2000.
KHATTREE, R. & NAIK, DN . Multivariate data reduction and discrimination with SAS
software . Cary, NC, USA: SAS Institute Inc., 2000. 558 p.
JOHNSON, RA; WICHERN, DW Applied multivariate statistical analysis . 4th ed. Upper
Saddle River, New Jersey: Prentice-Hall, 1999, 815 p.
http://www.htmlpublish.com/convert-pdf-to-html/success.aspx?zip=DocStorage/9a2d4f284a3c411a842ad8aa4311edf0/analise%20de%20comp… 10/11
12
http://www.htmlpublish.com/convert-pdf-to-html/success.aspx?zip=DocStorage/9a2d4f284a3c411a842ad8aa4311edf0/analise%20de%20comp… 11/11

CPA 200 COmponents

Enviado por

Dados do documento

Descrição original:

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

CPA 200 COmponents

Enviado por

Direitos autorais:

Formatos disponíveis

08/03/2018 Your Document is Ready

POST GRADUATION IN AGRONOMY - CPGA-CS

Multivariate Analysis Applied to Agrarian Sciences

Carlos Alberto Alves Varella

Introduction................................................. .................................................. ........................

Standardization with zero mean and variance 1 ........................................... ....................

POST GRADUATION IN AGRONOMY - CPGA-CS

PRINCIPAL COMPONENT ANALYSIS

in which, X j and S (x j ) are, respectively, the mean and standard deviation of

For each eigenvalue λ i exists an

The main components have the following properties:

The importance of a major component is assessed by its contribution, ie by the ratio of

proportion of information retained in the reduction of p to k dimensions. With this information

Interpretation of each component

To compare the influence of X 1 , X 2 , ..., Xp on Y 1 we analyze the weight or loading of

Core component scores

these components. In Table 1 is exemplified the organization of a data set composed

Table 1. Organization of a set of data with n treatments, p variables and components

n Xn1 Xn2 ... Xnp Yn1 Yn2 ... Ynk

Trat First main conponent

The data are standardized for variance 1:

The characteristic equation is: R - I = 0

The eigenvalues of the correlation matrix R are:

Obtaining the main components

In the same way for the second main component we have:

ë the 22 û 2 ë 1 û (I.e. 0,7070(I.e.

Table 3. Information that can be obtained by analyzing major components

SAS program for obtaining the main components

Você também pode gostar