Você está na página 1de 6

Principal Component Analysis implementation in Java

Sebastian Wójtowicz1, Radosław Belka1, Tomasz Sławiński2, Mahnaz Parian3.


1
Faculty of Electrical Engineering, Automatics and Computer Science,
Kielce University of Technology, Al. Tysiąclecia Państwa Polskiego 7, 25-314 Kielce, Poland
2
Consultant. Szewna, ul. Armii Ludowej 82, 27-400 Ostrowiec Swietokrzyski, Poland
3
Electrical and Electronic Engineering Department, Islamic Azad University, South Tehran Branch

ABSTRACT
In this paper we show how PCA (Principal Component Analysis) method can be implemented using Java programming
language. We consider using PCA algorithm especially in analysed data obtained from Raman spectroscopy
measurements, but other applications of developed software should also be possible. Our goal is to create a general
purpose PCA application, ready to run on every platform which is supported by Java.
Keywords: PCA, statistical procedures, Raman spectroscopy, JAVA.

1. INTRODUCTION
Nowadays large amount of data is used in many applications. Unfortunately sometimes it is difficult to notice some
differences in data presented in tables or even in plots. A possible assistance in solving this problem is to use statistical
methods of factor analysis, in particular the relatively simple PCA (Principal Component Analysis) method. The PCA
found many applications in different fields including image compression [1], multiple ways for face recognition [2, 3]
and many more advanced applications such as communications between human brain and external devices (BCI method)
[4]. With all that examples PCA is dominant tool for finding patterns in analyzed data. In this paper principal component
analysis basics will be explained with some computations and one example from multivariate data. Its purpose is to
explain how PCA works and better understanding of its implementation in real-world data measurement. Our goal is to
create a general purpose application ready to work on every platform supported by Java. Developed software has been
tested on data received from Raman spectroscopy.

2. DESCRIPTION OF PCA METHOD


Principal component analysis is one of the statistical techniques used for reducing the multivariate data dimensionality
with minimum data loss. Using linear algebra, PCA has found many applications (face recognition, image compression)
and is a great technique for patterns finding in high dimensional data [5].
PCA main objectives:
- To reduce original variables to lower number of variables called principal components,
- to visualize correlations among the original variables and between these variables and the factors,
- to visualize proximities among statistical units.

2.1. Computing PCA


To find principal components it is necessary to calculate eigenvectors and eigenvalues of data covariance matrix and this
process is all about finding axis system with diagonal co-variance matrix [6]. Direction of greatest variation is the
eigenvector with biggest eigenvalue and the orthogonal direction is a second eigenvalue. To better understand this
computation a brief eigenvectors/eigenvalues short review will be given.
Let det be determinant, M be an m x m matrix, I be an m x m identity matrix. Eigenvalues of this matrix will be defined
as:
det(M - λI) = 0 (1)

Photonics Applications in Astronomy, Communications, Industry, and High-Energy Physics Experiments 2015,
edited by Ryszard S. Romaniuk, Proc. of SPIE Vol. 9662, 96623P · © 2015 SPIE
CCC code: 0277-786X/15/$18 · doi: 10.1117/12.2205857

Proc. of SPIE Vol. 9662 96623P-1

Downloaded From: http://proceedings.spiedigitallibrary.org/ on 03/18/2016 Terms of Use: http://spiedigitallibrary.org/ss/TermsOfUse.aspx


where λ represents the eigenvalue of M. There exist a vector 𝑣 such that:
M𝑣 =λ𝑣 (2)
The equation (2) is known as matrix eigenfunction. Vector 𝑣 is an eigenvector of M matrix associated with λ eigenvalue.
If the M matrix is symmetric one, the eigenvalues are real number, and eigenvectors are orthogonal each other. This
direction vector can be extended or rescaled to any magnitude. To find a numerical solution for this vector it is necessary
to choose one of its elements and set it to arbitrary value. Without solution this process will be repeated with other
element. Final values are normalized so that v length is one, which is 𝑣 ∙ 𝑣 𝑇 = 1.
With 3 x 3 matrix M we have 3 eigenvectors 𝑣1, 𝑣2, 𝑣3 and 3 eigenvalues λ1, λ2, λ3 , which means:

𝑀𝑣1 = λ1 𝑣1 𝑀𝑣2 = λ2 𝑣2 𝑀𝑣3 = λ3 𝑣3 (3)

Putting eigenvectors as columns of matrix:


λ1 0 0
𝑀 [𝑣1 𝑣2 𝑣3 ] = [𝑣1 𝑣2 𝑣3 ] [ 0 λ2 0] (4)
0 0 λ3
And writing:
λ1 0 0
Α = [𝑣1 𝑣2 𝑣3 ] Β = [0 λ2 0] (5)
0 0 λ3
where A is orthogonal matrix, what means A-1 = AT
Equation (4) can be expressed as:
MA = AB (6)

As the A is orthogonal matrix, we can write:

ATMA = B or M=ABAT (7)

Now with covariance matrix (let C be m x m covariance matrix) there is orthogonal m x m matrix A with columns which
are eigenvectors form C and the second matrix B (diagonal) where diagonal elements of it are eigenvalues of C, that:

𝐴𝑇 𝐶𝐴 = 𝐵 (8)

This linear transformation helps to transform data points from one axis system to another with uncorrelated variables.

2.2. Principal component analysis example


Principal component analysis will be performed with a multivariate data. Table 1 shows the average food consumption in
four UK countries given in grams per person per week (source: Department for Environment, Food and Rural Affairs).
It is really hard to find any variable correlations in table 1. Differences aren’t notable enough. Usually series of plots or
diagrams might be used to visualize this data set and to analyse it. For large data sets it is not possible to plot a good
graph easily. PCA allows us to perform analysis for many variables. If we take food types as variables and countries as
observations, we can visualize plot with 4 coordinates in 17 dimensional space. Every single correlation between those
observations will be visible as a points in 17 dimensional space. It’s not possible to visualise 17D space which means it
is hard to see clustering.
Firstly, new set of orthogonal coordinate axes need to be found and it is called first principal component. Then
orthogonal projection (linear transformation, P, that maps from a given vector space to the same vector space and is such
that P2 = P) is used to map coordinates in new axis – figure 2.

Proc. of SPIE Vol. 9662 96623P-2

Downloaded From: http://proceedings.spiedigitallibrary.org/ on 03/18/2016 Terms of Use: http://spiedigitallibrary.org/ss/TermsOfUse.aspx


Table 1. Food consumption in UK in g/person/week.
England Wales Scotland N Ireland
Cheese 105 103 103 66
Carcass meat 245 227 242 267
Other meat 685 803 750 586
Fish 147 160 122 93
Fats and oils 193 235 184 209
Sugars 156 175 147 139
Fresh potatoes 720 874 566 1033
Fresh Veg 253 265 171 143
Other Veg 488 570 418 355
Processed potatoes 198 203 220 187
Processed Veg 360 365 337 334
Fresh fruit 1102 1137 957 674
Cereals 1472 1582 1462 1494
Beverages 57 73 53 47
Soft drinks 1374 1256 1572 1506
Alcoholic drinks 375 475 458 135
Confectionery 54 64 62 41

0.5

0 -
Wal Eng Scot
-
N Ire
-0.5

1
-300 -200 -100 0 100 200 300 400 500
PC1
Figure 1. First principal component (PC1) for data presented in table 1 [7].
As it can be see in Figure 1 the coordinates are clustered. It shows that two major clusters are forming. One is Wales,
England and Scotland together and the second one is Northern Ireland itself, separate from others. Here it is notable
enough to see that something is different about Northern Ireland. Figure 2 shows two principal components (orthogonal)
which projects the coordinates in 2D scores plot. Now 2 it is much easier to see differences between UK countries. One
visible difference is that North Ireland is major outlier and it is real life example because North Ireland is one of those
four countries that are not on the Great Britain Island.
There may be a question why dots are clustered together in that specific way. Answer is simple - if we look again at
Table 1 we will notice that North Ireland consumes less fish, fruits, alcoholic drinks and cheese and at the same time
more fresh potatoes. It is much harder to see differences in table with large data amount, whereas using the PCA
methods results in an improvement in the data analysis process. For better visual understanding of PCA there is a
website with multivariate data example which can be useful [9].

Proc. of SPIE Vol. 9662 96623P-3

Downloaded From: http://proceedings.spiedigitallibrary.org/ on 03/18/2016 Terms of Use: http://spiedigitallibrary.org/ss/TermsOfUse.aspx


47'0

r
200 YYr aI

N Irr+
Eng

Scot

0:0 00 400 DC

Figure 2. PC2 vs PC1 components for data presented in table 1 [7].

3. RESULTS AND DISCUSSION


The main goal of this paper was presents the PCA implementation as application in Java programming language and
finally to use this application for analysis of spectroscopy data, (e.g. Raman spectra). However, there are a number of
implementations of PCA method (e.g. PCA Model Editor as a component of Jasco Spectra Manager [8]), they are
usually considered as only a module of software package associated with a specific equipment and dedicated for specific
applications. Often, the ability to export data is difficult or even impossible. Therefore it was decided to create an
application capable of processing any data saved in the simplest txt format. Our application was inspired by Jasco
application Spectra Manager One disadvantage of PCA in Jasco application is .txt file support.
Java runtime environment and JFree chart show was used to create a PCA application. Figure 3 shows graphic interface
of application with loaded of input data. Graphic interface is not complicated and easy to use. The program includes the
following tabs:
- data diagrams - plotting the input data imported from .txt file,
- PCA – visualisation of coordinates value where principal component analysis,
- transformed diagrams - plotting the eigenvectors
Mk Moot
f'C.Tt20VlGfÇ ,.öü(OfIDQe
Dane poczatkowe
1,05
1,90
0,95
0,90
0,85
0,90
0,75
0,77
0,65
0,6',
0,55
55,
05.;
PP 01 (2926, 0517)1

0.,
0,3, .36

0,3
0,25
25
0,2
0,15
0,1',
0,05
0,00 0219
0 200 400 600 800 1000 1200 1400 16W 1900 2 000 2 200 2 40. _ 3 .0) 3 2N0 3 400 3 460 3 900 4 003
X Aids
- ABS td
- PEHD,h1
PPDI
PS6t

Figure 3. Graphic interface of developed software.


Developed application where tested on Raman Spectroscopy (RS) data. The RS is a non-invasive method of studying the
molecular structure of compounds and chemical substances basing on the inelastic scattering of a photons. The Raman

Proc. of SPIE Vol. 9662 96623P-4

Downloaded From: http://proceedings.spiedigitallibrary.org/ on 03/18/2016 Terms of Use: http://spiedigitallibrary.org/ss/TermsOfUse.aspx


spectrum is an appropriately converted energy distribution of photons scattered in an inelastic way by the sample.
Generally, RS is usually used for materials and chemical identification [10,11]. Due to the fact that the Raman spectra
can be complex (e.g.. pharmaceuticals, polymers), their analysis and classification may be difficult and require additional
support methods such as the PCA. Spectra of different materials were obtained using Nicolet Amega XR Raman
spectrometer with 532nm excitation.
Figures 3 shows data containing the Raman spectra of some polymers as ABS (Acrylonitrile Butadiene Styrene), PEHD
(High-density Polyethylene), PP (Polypropylene) and PS (Polystyrene). In all spectra PL background was removed, and
intensity of them were normalised in range of 0-1. Pre-processing was made for data sets to obtain the same resolution
and spectra range. It is important factor for PCA in this case so every sample starts with 150 and end at 4000 on x axis
with 2 steps, so contains about 1925 data points. Prepared text files contain numerical data in the form of two columns,
the first of which contained Raman shift (identical for all files), and the second - Raman intensity. Next, text files can be
easily imported to the software and reprocessed according to PCA procedure.

GJ

-7,5 -7p -6,5 dp -5.5 Sp H,5 -1p J5 -3,0 -25 -2p -1.5 -1p -0,5 0p .5 10 1.5 2D 2S 3D 3,5 40 4.5 5p 55 60 6,5 ]p J.5
PCI OS Y 4P02
6J
AS.61
PE14006
PP.4
P3.G2

Figure 4. Result of PCA algorithm.


In figure 4 the calculated PC2 vs PC1 coordinates plot for input data was presented. In the application there is a
possibility to choice of any two PCs and their visualization in 2D plot. By default, the 2 most important components
(PC1 and PC2) are visualized, as shown in figure 4. As it was mentioned before, we can see how data is clustered. Upper
left corner is grouped with PP and PEHD while lower right is created with group of two samples ABS and PS. This is
correct, as the ABS is a copolymer comprising polystyrene chains, thus ABS and PS spectra should show similarity.

4. CONCLUSION
PCA is a great statistical method which is used to reduce dimensionality and visualize correlations and proximities. It has
five steps: data preparation, covariance or correlation matrix calculation, eigenvectors and eigenvalues decomposition,
principal component selection, new data set computation [12]. Figures 1, 2 and 4 are great example of how this variables
were reduced without any loss of original data set information. In PCA direction with the largest variances is most
important or in this case most principal. It is particularly used with highly correlated variables. Above example shows
that in order to classify the data, only 2 variables are needed and not nearly two thousand as in input data.The motivation
of this application and article was to show that a great statistical method (used with image compression, face recognition,
Raman spectroscopy and many more) can be obtained with modern programming language and can be simply used with
any .txt data on every platform which is supported by Java.

Proc. of SPIE Vol. 9662 96623P-5

Downloaded From: http://proceedings.spiedigitallibrary.org/ on 03/18/2016 Terms of Use: http://spiedigitallibrary.org/ss/TermsOfUse.aspx


ACKNOWLEDGEMENTS
Raman measurements was performed using equipment founded by the European Regional Development Fund within the
Innovative Economy Operational Programme 2007-2013 (No POIG 02.02.00-26-023/08-00).

REFERENCES
[1] Qian, D. and Fowler, J., "Hyperspectral Image Compression Using JPEG2000 and Principal Component Analysis",
Geoscience and Remote Sensing Letters IEEE (4), 201-205 (2007).
[2] Nedevschi, S., "PCA type algorithm applied in face recognition ", Intelligent Computer Communication and
Processing (ICCP), 167-171 (2012).
[3] Zhang, D., Zhou, Z. and Chen, S., "Diagonal principal component analysis for face recognition", Pattern
Recognition 39 (1), 140-142 (2006).
[4] Kottaimalai, R., Rajasekaran, M., Selvam, V. and Kannapiran, B., "EEG signal classification using Principal
Component Analysis with Neural Network in Brain Computer Interface applications", Emerging Trends in
Computing, Communication and Nanotechnology (ICE-CCN), 227-231 (2013).
[5] Smith, L., "A tutorial on Principal Components Analysis", Cornell University, 1-22 (2002).
[6] Gillies, D., "DOC493: Intelligent Data Analysis and Probabilistic Inference Lecture 15", Department of Computing,
Imperial College London.
[7] Richardson, M., "Principal Component Analysis", Wiley Interdisciplinary Reviews: Computational Statistics (2), 3-
14 (2009).
[8] http://www.jascoinc.com/applications (on 30.06.2015).
[9] http://setosa.io/ev/principal-component-analysis/ (on 30.06.2015).
[10] Belka, R., Suchańska, M., Czerwosz, E. and Kęczkowska, J., “Raman studies of Pd-C nanocomposites,” Central
European Journal of Physics11 (2), 245-250 (2013).
[11] Belka, R. and Suchańska, M., “Properties of the carbon-palladium nanocomposites studied by Raman spectroscopy
method,” Proceedings of SPIE 8903, (2013).
[12] http://www.sthda.com/english/wiki/principal-component-analysis-the-basics-you-should-read-r-software-and-data-
mining (on 30.06.2015).

Proc. of SPIE Vol. 9662 96623P-6

Downloaded From: http://proceedings.spiedigitallibrary.org/ on 03/18/2016 Terms of Use: http://spiedigitallibrary.org/ss/TermsOfUse.aspx

Você também pode gostar