Escolar Documentos
Profissional Documentos
Cultura Documentos
INDEX TERMS Learning systems, Supervised learning, Machine learning, Prediction methods, Predictive
models, Neural networks, Artificial neural networks, Feedforward neural networks, Radial basis function
networks, Computer applications, Scientific computing, Performance analysis, High performance computing
Software, Open source software, Utility programs.
2169-3536
2015 IEEE. Translations and content mining are permitted for academic research only.
VOLUME 3, 2015 Personal use is also permitted, but republication/redistribution requires IEEE permission. 1011
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
A. Akusok et al.: High-Performance ELMs
The hidden layer is not constrained to have only one describing the projection between the two linear spaces.
type of transformation function in neurons. Different Thus an ELM is viewed as two projections: input XW and
functions can be used (sigmoid, hyperbolic tangent, output H, with a (non-linear) transformation between them
threshold, etc.) [3], [6], [7]. Some neurons may have no H = (XW + b). The number of hidden neurons regulates
transformation function at all. They are linear neurons, and the size of matrices W, H and ; but the network neurons are
learn linear dependencies between data features and targets never treated separately.
directly, without approximating them by a non-linear With different types of hidden neurons, the first projection
function. Usually the number of linear neurons equals the and transformation are performed independently for each
number of data features, and each of these neurons copy the type of neurons. Then the resulted sub-matrices H1 are
corresponding feature (by using an identity W and zero b). concatenated along the second dimension. For two types of
Another type of neurons commonly present in ELMs is hidden neurons:
the Radial Basis Function (RBF) neurons [32]. They use
distances to centroids as inputs to the hidden layer, instead H = [H1 | H2 ] = [1 (XW1 + b1 ) | 2 (XW2 + b2 )]. (5)
of a linear projections. The non-linear projection function Linear neurons are added into ELM by simply copying
is applied as usual. ELMs with RBF neurons compute inputs into the hidden layer outputs:
predictions based on similar training data samples, which
helps solving tasks with a complex dependency between H = [H1 | H2 | X]
data features and targets. Any function (norm) of distances = [1 (XW1 + b1 ) | 2 (XW2 + b2 ) | X]. (6)
between samples and centroids can be used, for instance L 2 ,
L 1 or L norms. D. ELM SOLUTION WITH PSEUDO-INVERSE
Most often, an ELM problem is over-determined (N > L),
C. MATRIX FORM OF ELMs with the number of training data samples larger than the
Practically, ELMs are often solved in a matrix form by a number of hidden neurons. For determined (N = L) and
closed form solution. An implementation with matrices is under-determined (N < L) instances, ELM should use
easy to write and fast to run on computers. An ELM is written regularization [3]. Otherwise it has a poor generalization
in a matrix form by gathering outputs of all hidden neurons performance.
into a matrix H as on equation 3. A graphical representation A unique solution for an over-determined system is given
is shown in Figure 2. The matrix form of ELMs is used in the by a minimum L2 norm of the training error. It may be
paper hereafter. found using the Moore-Penrose generalized inverse [33]
(w1 x1 + b1 ) (wL x1 + bL )
(pseudoinverse) of the matrix H, denoted as H . As the
.. .. .. matrix H has a full column rank, the pseudoinverse is
H= . . . , (3)
computed as in equation (9).
(w1 xN + b1 ) (wL xN + bL )
T T H = T (7)
= T1 TL , T = yT1 yTN . (4) = H T (8)
H = (HT H)1 HT , (9)
The pseudoinverse is prone to numerical instabilities if the
matrix HT H is close to singular. Practically (in Matlab
R
and
Python), the implementations of the pseudoinverse include
a small regularization term H = (HT H + I)HT where
= 50 and is the machine precision for a used type of
floating point numbers. Adding a regularization term makes
matrix HT H non-singular, and the same solution applicable
also for determined and under-determined systems.
class independence is kept. The predicted class is assigned Model structure selection is less important in Big Data
according to the target with the largest ELM output. tasks, because with a large number of samples a model learns
If the classes are ordinal and have a ranking, then they are to ignore noise. Large tasks are often complex enough not to
translated into real numbers. Only one target is created for all overfit even at the limits of the hardware. Also, most model
the classes, and a predicted class is the one with the closest structure selection methods significantly increase runtime,
number to an ELM output. which is a limiting factor for training large ELM models. For
In a multi-label problem, a sample can have multiple the provided reasons, only one fast neuron pruning method
correct classes. The targets are created similarly as in the with a validation set is included in the toolbox part for large
independent classes problem formulation. The predicted data.
classes are assigned for all ELM outputs greater than a
threshold value. III. ELMs IN PRACTICE
Using ELM for classification with independent classes A. DATA NORMALIZATION
changes the way how the prediction error is calculated. The Input data normalization is a critical preprocessing step for
classification error does not penalize (or encourage) small many Machine Learning methods, including ELMs. Raw data
changes in the ELM output values, which do not lead to often has features of different scales, for example an age of
a different classification. This makes a difference in the a man is at a scale 1-100, and his annual salary in dollars is
model structure selection (described in section II-F), where an 3 orders of magnitude larger. Without normalization, small
optimization w.r.t. the MSE regression error finds an relative variations in the salary make large relative variations
incorrect optimal number of hidden neurons, and creates in the age negligible. Normalization sets all features at the
a model with a sub-optimal classification prediction same scale. Then all features have the same influence, and the
performance. training method learns which ones to use for the prediction.
In the ELM toolbox, weights can be given explicitly or
F. MODEL STRUCTURE SELECTION IN ELMs generated automatically. Automatic weights generation
Model structure selection prevents ELM from learning noise assumes that the data has zero mean and unit variance. The
from data and over-fitting. It does so by artificially limiting generated weights keep the performance of neural network
the learning ability of an ELM. A training dataset has multiple with sigmoid neurons near the optimum, and compensate for
instances of inputs, and the corresponding targets, which are large number of inputs. The explanation and experimental
generated by the projected data and an added noise. The noise evaluation of the automatic random weights parameters are
term includes both random noise and projection from features given in the experimental Section V-B.
not present in the inputs. Learning particular data samples
with the associated noise is called over-fitting. An over-fitted B. ELM SOLUTION WITH BEST LINEAR
ELM model has worse generalization performance UNBIASED ESTIMATOR
(prediction performance on new data), which can be The best linear unbiased estimator gives the optimal least
measured using a validation set of data. A model structure squares solution to the matrix equation X = T for stochastic
selection process finds an optimal generalization perfor- vectors x and t combined into the corresponding matrices.
mance by changing the amount of model parameters or It uses two theoretical correlation matrices
applying regularization to the model.
A hyper-parameter of ELMs, which governs the amount of E[xT x] = Cxx , E[xT t] = Cxt (10)
effective parameters, is the number of hidden neurons. The which are assumed to be known. The best linear unbiased
optimum number of neurons is found with a validation set, estimator of T, denoted by Y, is then
a cross-validation procedure or a Leave-One-Out validation
procedure (which has an efficient solution in ELMs). Hidden xx Cxt X = X.
Y = C1 (11)
neurons can be added and removed randomly, or they can
be ranked by their relevance to the problem. This ranking The inverse of Cxx exists because x is a stochastic variable
is called Optimal Pruning [28] and it achieves better for which Cxx = E[x T x] has a full rank.
performance with a trade-off of a longer runtime. Neuron The ELM problem has a finite amount of projected data
pruning methods correspond to L 1 -regularization. samples H and corresponding targets T, so the correlation
Another model structure selection technique available in matrices are replaced by their estimations
ELMs is the Tikhonov regularization [34]. It reduces an Cxx HT H = h , Cxt HT T = t , (12)
effective number of model parameters by reducing the
influence of neuron outputs without removing neurons and the ELM output weights are computed from those
by themselves. Tikhonov regularization is efficient for estimates
achieving numerical stability in near-singular ELMs (and = (HT H)1 (HT T) = 1
h t . (13)
linear problems in general). This regularization corresponds
to L 2 -regularization, and can be combined with L 1 to achieve The inverse of h = HT H matrix exists if it has full rank.
the best results [29]. In ELM model, the nonlinear random projection produces
almost orthogonal features which are linearly independent. A memory requirement of a correlation ELM solution is
The number of hidden neurons (columns of H) is smaller constant for any number of training samples, because the
than the number of training samples (rows of H), otherwise correlation matrices h and t can be computed for batches
the liner model will learn training samples perfectly and of training data. The batch computation replaces the
overfit. Under such constraints, the rank of matrix H equals number of samples N in the memory requirements by a batch
its number of columns, thus matrix HT H = h is full rank size. A good trade-off in terms of memory requirement and
and its inverse exists. computational overhead is achieved with a batch size equal
to L. The final h and t are computed from batches by a
C. NUMERICAL STABILITY OF AN ELM SOLUTION simple summation. The summation operation adds to runtime
WITH CORRELATION MATRICES overhead, because in software and hardware implementa-
If numerical instabilities are faced in the inverse, a regulariza- tions, matrix multiplication and summation are performed in
tion term is applied to the correlation matrix h = HT H+I, a single operation3 : gemm(A, B, C) = AB + C.
where is a small positive constant. This approach is called
Ridge Regression [35] aka. Tikhonov regularization [34]. E. WEIGHTED CLASSIFICATION WITH ELMs
A greater than zero parameter reduces the effective number In a classification task with highly uneven number of data
of variable in the model, increasing the inverse stability but samples for different classes, ELM predictions are biased
decreasing predictive power. The default Ridge regression towards the class with the most data. This behaviour is
is used in all matrix inverse functions of Python (Numpy) improved by using a weighted linear system solution in the
and Matlab
R
with = 50 where is a machine precision output layer of an ELM [31]. A weighted linear system has
constant. a Least Squares solution similar to the Best linear unbiased
estimator solution:
D. OUT-OF-MEMORY INCREMENTAL ELMs h = HT AH, t = HT AT, (14)
ELMs easily run out of memory for storing the matrix H with
large number of data samples and hidden neurons. Previously where A RN N is an arbitrary weight matrix. If only
this problem was tackled by iteratively updating the output sample weights are used, the A matrix is diagonal; but these
weights. However, these methods are computationally slower weights are complicated to obtain if they are not given
because they perform updates of large matrices for each data explicitly. In a classification task, diagonal elements of A for
sample [24], or need to calculate a solution repeatedly [3]. all samples of class j J1, cK are given the same weight aj
An easier way of finding solution of a large ELM is N
possible using the notations of estimated correlation matri- aj = PN , j J1, cK. (15)
ces. It addresses memory limitation by being invariant to i=1 Tij
the number of training samples. The runtime is virtually the The solution of ELM obtained this way is unbiased for
same as for a pseudoinverse solution on a machine with any class. An additional multiplication by A is avoided by
infinite memory. The comparison of computational complex- applying weights aj directly to the rows of matrices H, T
ity and memory requirements for the pseudo-inverse versus which correspond to the data samples of a class j.
correlation matrices ELM solutions are presented in Table 1. Alternatively, the correlation matrices can be computed
for each class separately 1h , 1t , . . . , ch , ct . Then the
weights are applied during the summation of the correlation
TABLE 1. ELM computation and memory requirements; computations matrices h and t :
along the dimension N can be performed incrementally
in L-size batches. h = 1 1h + . . . + c ch , (16)
t = 1 1t + . . . + c ct . (17)
on Big Data. The toolbox is written to provide the best a limiting factor. For the small data, there is not enough
performing ELM implementation to all interested researchers samples for learning an underlying model exactly, thus a
and end users. model structure selection is necessary to find an optimal
model complexity.
A. HOW TO GET THE TOOLBOX
Training a small data model is computationally intensive,
The toolbox is a Python library, also available from Matlab
R
.
but the whole data is kept in the working memory for quick
It is written in Python programming language using efficient
access. The big data training algorithm relies on iterative
numerical libraries Numpy4 and Scipy.5
processing of small chunks of data (which normally does not
The toolbox requires Python and the following libraries:
fit into memory), but with a huge amount of training samples
Numpy, Scipy, Numexpr and pyTables.6 The easiest way to
there is no need for a model structure selection process (larger
get Python with all required libraries is to use the Anaconda7
model provides better performance). Processing a small data
Python distribution. It is a one-click install on Windows/
which does not fit into memory is not implemented, because a
Linux/OSX, free and provides free MKL acceleration to all
typical server node has up to 256-512GB RAM, and anything
university affiliates. Any other Python installation will work
larger would certainly be limited by the computational speed.
as well.
A big data for an easy problem which fits into memory is
To install the toolbox for CPU, open the console and type
solved by either of the two first methods.
pip install hpelm. This will download and install the
toolbox with all required libraries. Anaconda provides a
C. OUT-OF-MEMORY ACCELERATED BIG DATA ELM
python console on Windows; Linux and OSX have built-in
ones. The HP-ELM toolbox for big data is provided by the
To obtain an accelerated toolbox, first download and hpelm.HPELM class. All data is stored in HDF510 format.
install MAGMA8 math library for your accelerator: The toolbox takes names of HDF5 files as inputs and outputs.
Nvidia GPU with CUDA, AMD GPU with OpenCL or Thus a size of processed data is limited only by disk capacity.
Xeon Phi accelerator card (called MIC architecture). All ver- The HDF5 file format provides a fast and convenient access
sions of MAGMA are available from the website; it also has to huge data matrices on a hard drive as if they are in memory:
a forum for installation support. To build MAGMA, rename data can be read from or written to any place of a matrix.
one of the make.inc.xxxx files as make.inc and edit It also supports transparent data compression, and is native
that file according to your system installation. Then install to Matlab
R
. A convention is used to store only one matrix
MAGMA by running make, make shared and in one HDF5 file. Additional utility functions make_hdf5
make install in console from MAGMA directory. and normalize_hdf5 create HDF5 files from text/csv or
Second, download the toolbox archive from its repository matrix data, and normalize these files.
https://pypi.python.org/pypi/hpelm or the latest version from The HPELM class supports a GPU or Xeon Phi
Github,9 extract it and go to a sub-folder ./hpelm/acc. acceleration. The accelerated functions are provided by
There is an accelerated code which must be compiled. MAGMA library, an accelerated linear algebra library similar
To get compilation flags, add your MAGMA library to to LAPACK. It must be compiled by a user to get the
pkg-config path, or use the same flags as MAGMA acceleration, but it supports any brands of GPUs and
used to compile its tutorial files during an installation. Xeon Phi accelerators. Accelerated parts are correlation
To compile an accelerated ELM library, matrices computation from BLUE ELM solution, and the
run python setup_gpu.py build_ext -inplace calculation of . These two operations take more than
replacing _gpu by _ocl for OpenCL MAGMA or_mic for 95% of runtime for ELMs with very large numbers of hidden
Xeon Phi MAGMA. You can test an acceleration by running neurons.
pyhton try_gpu.py from the same folder. After that, The ELM solution is computed iteratively by reading
go to the root directory of the toolbox and install the now- chunks of data from HDF5 files. Only the h , t and
accelerated toolbox with python setup.py install. matrices are stored in memory. The large H matrix is never
obtained explicitly. Targets for new inputs are also predicted
B. BIG DATA VERSUS SMALL DATA iteratively and saved into an HDF5 file; and the error is
Based on the number of training samples and underlying computed iteratively. This makes Big Data ELM independent
model complexity, all machine learning tasks can be of the number of samples in the dataset, so even the largest
separated into two categories: big data and small data. In the problems can be solved on a workstation with GPU.
big data, the number of samples is enough to learn the HPELM has one model structure selection function that
model accurately without over-fitting, but the training time is tests different numbers of hidden neurons on a validation
4 http://www.numpy.org
set. It takes pre-computed h , t as an input, and creates
5 http://www.scipy.org solutions k for different k J3, LK spaced equally on
6 http://www.pytables.org a logarithmic scale. Then the validation data is projected
7 http://continuum.io/downloads iteratively, and errors for all values of k are computed from
8 http://icl.cs.utk.edu/magma/
9 https://github.com/akusok/hpelm 10 http://www.hdfgroup.org/HDF5/
the same projected data. This function does the most time G. HOW TO CREATE AN ELM?
consuming process of projecting the data (see section V-E) An ELM is an object of ELM or HPELM class. Two mandatory
only once. The optimal number of hidden neurons is chosen parameters are numbers of input and output features. The
by a minimum validation error. HPELM also accepts a batch size for iterative processing, and
a type of accelerator.
D. MODEL STRUCTURE SELECTION FOR SMALL DATA ELM An ELM is created without any neurons. Neurons
The small data support in the HP-ELM toolbox is pro- are added with elm.add_neurons function. It has
vided by the hpelm.ELM class. It has three types of model two mandatory parameters: a number of neurons and their
structure selection alternatives: with a validation set, with type, and two optional ones: projection matrix W and bias
cross-validation and with a LOO validation error computed vector b. Types of neurons are the following: lin, sigm,
by PRESS statistics. All model structure selection methods tanh, rbf_l1, rbf_l2, rbf_linf. For RBF neurons,
find an optimal number of hidden neurons less or equal than W are coordinates of RBF centers and b are corresponding
current L. Neurons are ordered randomly, except when the kernel widths. Multiple different types of neurons can be
L 1 regularization is used. These methods remove the extra added to a single ELM.
neurons from the model and re-calculate the solution.
Both L 1 and L 2 regularization are available in ELM. H. HOW TO TRAIN AN ELM?
The L 1 regularization is done by MRSR [36], a multi- The train function provides a universal wrapper for
output version of LARS [37]. It ranks the neurons starting training an ELM. Two mandatory parameters are data
from the most relevant to the problem. All model structure samples X and targets T, and optional arguments and key-
selection methods work better with such ranked neurons, with word arguments specify the selected way of training:
a trade-off of extra runtime. "V" perform a model structure selection using
The toolbox includes another method of performing validation error; requires keyword arguments Xv and Tv
L 1 -regularization, based on an updated MRSR for validation dataset
algorithm [38]. The original MRSR includes a part with "CV" perform a model structure selection using
O(2c ) complexity w.r.t. number of outputs c. It takes cross-validation error; optional keyword argument k for
noticeable runtime with 10 outputs, and makes the method number of data splits (k 3)
impractically slow with more than 15 outputs. The complex- "LOO" perform a model structure selection using
ity of an updated version scales linearly with the number PRESS LOO error
of outputs. It allows L 1 regularization for a larger set of "OP" perform L 1 regularization that ranks neurons
problems, including an auto-encoder for ELMs in image starting from the most useful one; works with any model
processing [22]. structure selection
L 2 regularization is a class parameter of ELM called "c", "mc", "wc" use classification multi-class/
alpha which can be changed freely. A notable benefit from multi-label/weighted multi-class error instead of MSE,
L 2 regularization is making an ill-conditioned ELM solvable. see explanations above; "wc" requires keyword
One can use any single-variable optimization method to find argument w for class weights vector.
an optimum value of L 2 parameter. The resulted method will
be similar to TROP-ELM [29]. I. HOW TO TRAIN A LARGE ELM IN PARALLEL?
For an ELM with a large number of neurons trained on
E. WHAT KIND OF DATA DOES HP-ELM SUPPORT? a huge dataset, almost all the running time is spent on
The ELM supports matrices (second order tensors) as inputs, computing h . Hopefully, an operation of computing h is
and HPELM uses names of HDF5 files as inputs. The utility conveniently parallel: a large dataset can be split in n parts,
function make_hdf5 creates an HDF5 file from a matrix, or matrices ih , it , i J1, nK computed simultaneously for all
a text/csv file. the parts of a dataset using the same ELM parameters (loading
the same ELM model). The results are combinedP together by
a simple summation h = ni=1 ih , t = ni=1 it . The
P
F. WHAT ABOUT CLASSIFICATION?
The HP-ELM toolbox supports three kinds of classification: output weights will take seconds to compute.
multi-class (one correct class for each sample), multi-label To perform ELM training in parallel, first split the data into
(arbitrary number of correct classes for each sample) and multiple parts and store them in HDF5 format required by
weighted multi-class (each class has a weight, it is indepen- HPELM. Then compute partial matrices ih , it using function
dent of the number of samples in a class). ELM targets must HPELM._project on each data part separately. This oper-
have one feature per class (binary classification is a two-class ation takes the most runtime, and is easy to run in parallel.
multi-class classification), where the true class(es) for Save the outputs on a disk as they are computed. When
a sample are set to one and irrelevant classes are set to zero. all partial matrices are ready, obtain P thei final correlation
matrices by a summation h = i h , t = i t .
P i
This convention is required for correct work of the classifica-
tion error and model structure selection. Classification is set The output weights are computed from h , t by function
with an argument while training, see section IV-H below. HPELM._solve_corr.
To validate multiple different numbers of hidden is used to reduce the number of neurons, showing an updated
neurons efficiently, use function HPELM.train_hpv with test error and selected neurons in the model. Finally the
pre-computed h , t and a validation data set. It outputs model is re-trained using an L 1 regularization (OP parameter),
errors for each of the given numbers of neurons, and solves showing again re-calculated test error and model neurons.
for an optimal number of neurons. i m p o r t hpelm
i m p o r t numpy
J. HOW TO USE A TRAINED ELM? x=numpy . l o a d t x t ( x . csv , d e l i m i t e r = , )
The predict function takes inputs X and returns corre- t =numpy . l o a d t x t ( t . csv , d e l i m i t e r = , )
x t e s t =numpy . l o a d t x t ( x t e s t . csv , d e l i m i t e r = , )
sponding calculated outputs Y. Works only on a trained t t e s t =numpy . l o a d t x t ( t t e s t . csv , d e l i m i t e r = , )
ELM. For HPELM, the second input gives an HDF5 file name
xx = ( xx . mean ( 0 ) ) / x . s t d ( 0 )
for Y where the predicted outputs are written, and the function t t = ( t t . mean ( 0 ) ) / t . s t d ( 0 )
returns nothing. ELM predictions Y are always real numbers, x x t e s t = ( x t e s t x . mean ( 0 ) ) / x . s t d ( 0 )
t t t e s t = ( t t e s t t . mean ( 0 ) ) / t . s t d ( 0 )
predicted classes are found by taking the maximum number
(multi-class) or a threshold Y > 0.5 (multi-label). model = hpelm .ELM( 9 , 1 )
model . a d d _ n e u r o n s ( 1 0 0 , sigm )
model . a d d _ n e u r o n s ( 9 , l i n )
K. HOW TO GET AN ERROR OF AN ELM?
model . t r a i n ( xx , t t )
Error of model predictions is given by error function of t t h = model . p r e d i c t ( xx )
ELM, which takes true targets T and predicted targets Y as p r i n t model . e r r o r ( t t , t t h )
y y t e s t = model . p r e d i c t ( x x t e s t )
inputs. It uses the same classification settings as the ones used p r i n t model . e r r o r ( y y t e s t , t t t e s t )
for training, if any. For HPELM, the error function takes file
model . t r a i n ( xx , t t , CV , k = 1 0 )
names of HDF5 files containing T and Y matrices. y y t e s t = model . p r e d i c t ( x x t e s t )
p r i n t model . e r r o r ( y y t e s t , t t t e s t )
L. THREE EXAMPLES OF HP-ELM TOOLBOX p r i n t s t r ( model )
Below there are three examples of running the ELM and model . t r a i n ( xx , t t , LOO , OP )
y y t e s t = model . p r e d i c t ( x x t e s t )
HPELM toolboxes in Python, with the data obtained from p r i n t model . e r r o r ( y y t e s t , t t t e s t )
Matlab
R
. The input data has 9 features and the output has p r i n t s t r ( model )
one. Example ELMs using 100 sigmoid and 9 linear neurons
Matlab
R
section for Example 3. The training data:
are given. If the data is already in Python, one can skip the
x and y is saved as HDF5 files using build-in Matlab
R
import from Matlab
R
section.
R
R functions. Note the transpose operation, as Matlab uses
Matlab section for Examples 1 and 2. Here
Fortran matrix ordering by default for HDF5 files.
four variables: x, y, xtest and ytest are saved as a comma
h 5 c r e a t e ( x . h5 , / data , s i z e ( x ) ) ;
separated values (.cvs files). h 5 c r e a t e ( t . h5 , / data , s i z e ( y ) ) ;
c s v w r i t e ( x . csv ,x) h 5 w r i t e ( x . h5 , / data , x ) ;
c s v w r i t e ( t . csv ,y) h 5 w r i t e ( t . h5 , / data , y ) ;
csvwrite ( xtest . csv , x t e s t )
csvwrite ( t t e s t . csv , y t e s t ) Python part for Example 3. An HPELM is built and trained
using the HDF5 files created by Matlab
R
.
Example 1, Python part. Here the .csv files are converted
i m p o r t hpelm
to HDF5 ones in Python, and an HPELM is trained with those model = hpelm . HPELM( 9 , 1 )
files. Training and test errors are printed. model . a d d _ n e u r o n s ( 1 0 0 , sigm )
model . a d d _ n e u r o n s ( 9 , l i n )
i m p o r t hpelm model . t r a i n ( x . h5 , t . h5 )
model . p r e d i c t ( x . h5 , y . h5 )
hpelm . make_hdf5 ( x . csv , x . h5 , d e l i m i t e r = , ) p r i n t model . e r r o r ( y . h5 , t . h5 )
hpelm . make_hdf5 ( t . csv , t . h5 , d e l i m i t e r = , )
hpelm . make_hdf5 ( x t e s t . csv , x t e s t . h5 , d e l i m i t e r = , )
hpelm . make_hdf5 ( t t e s t . csv , t t e s t . h5 , d e l i m i t e r = , ) M. HOW TO USE GAUSSIAN (RBF) NEURONS?
model = hpelm . HPELM( 9 , 1 )
The ELM toolbox has Gaussian neurons. Centroids are given
model . a d d _ n e u r o n s ( 1 0 0 , sigm ) instead of a projection matrix W and kernel widths in a
model . a d d _ n e u r o n s ( 9 , l i n )
bias vector b. There are three kinds of distance functions:
model . t r a i n ( x . h5 , t . h5 ) L 2 (Euclidean), L 1 and L . They are chosen by a type of
model . p r e d i c t ( x . h5 , y . h5 ) neurons: rbf_l2, rbf_l1 or rbf_linf correspondingly. The RBF
p r i n t model . e r r o r ( y . h5 , t . h5 )
neurons are about 10 times slower to compute than sigmoid
model . p r e d i c t ( x t e s t . h5 , y t e s t . h5 ) ones, even though the computation is parallelized.
p r i n t model . e r r o r ( y t e s t . h5 , t t e s t . h5 )
Example 2, Python part. Here the ELM model is trained N. MY ELM SOLUTION DOES NOT EXIST!
with different model structure selection. Previously created An ELM may not converge if there are a few input
.csv files are loaded into Python and normalized to zero features (2-3) with a large number of hidden neurons, if the
mean and unit variance. Then a basic ELM is trained printing data features are strongly correlated and not independent, or if
the training and test error. After that a 10-fold cross-validation the number of data samples is close to the number of hidden
neurons. In these cases, matrix h will be almost singular, of 7 7 pixel mask centered on a classified pixel. The
and its inverse is numerically unstable. 3-pixel wide boundaries of images are omitted. There are
The numerical stability problem is solved by increasing 7 7 3(RGB) = 147 features and 109 (one billion) data
the value of the L 2 regularization parameter (an ELM samples in total. It gives a 1.1 TB dataset in HDF5 format
parameter called alpha). The default value of = 109 when stored in double precision. Two separate datasets for
can be increased up to 102 or higher. This reduces the all training and all test samples are created from training/test
effective number of parameters in the model. The regulariza- images. Data features (color values of pixels) are normalized
tion parameter should be increased if the output matrix to zero mean and unit variance. A single ELM is trained
has elements with a large magnitude (larger than 102 . . . 103 ). with 19,000 neurons, limited by the available GPU memory.
However, it is not worthwhile increasing the parameter Performance is tested on differently sized subsets of these
excessively, as this may reduce the accuracy of 19,000 neurons, as explained in section IV-I.
ELM predictions.
B. PARAMETERS OF AUTOMATICALLY GENERATED
V. EXPERIMENTAL RESULTS RANDOM WEIGHTS
A. DATASETS Sigmoid function is a common choice of a non-linear trans-
The HP-ELM toolbox is tested in three scenarios: regular formation function for hidden nodes of ELM. However, it is
datasets with regularization, large datasets and a Big Data sensitive to the range of input weights, which are XW + b.
problem. If inputs to the sigmoid function have small magnitude,
Small datasets are 11 regression and 4 classification it performs similarly to linear function. If these inputs have
problems from the University of California at Irvine (UCI) very large magnitude, it performs as a cutoff value. The effect
Machine Learning Repository [39]. Ten different permuta- can be seen by checking the difference between predictions of
tions of the datasets are taken without replacements, and SLFNs with the same weights and sigmoid/linear/threshold
for each of them 2/3 of the data is used for training transformation functions. If the input data has zero mean
and 1/3 for testing. The data permutations are obtained and unit variance, the range of inputs to SLFN is governed
from the an author of the OP-ELM [28] paper exactly as by the range of weights W generated from W = N (0, s)
they are used there. Comparison results for Support Vector with different values of standard deviation s. The effect is
Machines [40] (SVM), Multilayer Perceptron [41] (MLP), shown on Figure 3. The range of weights W also affects the
Gaussian Processes [42] (GP) are taken from the same performance, as on Figure 4.
article [28]. Small datasets are tested on ELMs with model
structure selection and without.
Large datasets are 6 relatively large datasets, available
from UCI Machine Learning repository with clear prediction
targets. They are Banana dataset of two banana species, Adult
dataset of people with annual income below/above $50,000,
MNIST handwritten digits dataset for classification of
10 digits based on their image representation, Record
Linkage dataset for detecting duplicate person records with
5.5 millions samples, and HIGGS dataset for detecting pro-
cesses which produce a Higgs boson or not, with 11 million
samples (one of the largest UCI datasets available). Each
dataset is split into training and test parts (respecting the
guidelines where applicable), stored in HDF5 file format and FIGURE 3. Mean squared error difference of predictions of SLFNs with
normalized to zero mean and unit variance for all features. 5 hidden neurons for Iris dataset (average over 100 runs), for different
Categorical features from Adult dataset are encoded as binary values of s in W = N (0, s). For small s, outputs of sigmoid SLFN are
similar to linear SLFN, and for large s they are similar
inputs (one per each category); these are not normalized. threshold SLFN.
Large datasets are tested without model structure selection,
but multiple ELMs are built with different numbers of hidden Another issue is an increase in standard deviation of
neurons. inputs to the transformation function, if the dataset has high
The Big Data is obtained from a Face/Skin Detection dimensionality. For a single additive hidden P layer neuron, an
d
dataset [43]. It consists of 4000 photos of people with input to the transformation function ak = i=1 xi wi,k is a
hand-made masks for skin and faces, under various lightning sum of d components. If a single component has standard
conditions, surrounding and human skin colors. Skin deviation s, then that sum has a larger standard deviation ds.
occupies roughly 20% of the pixels in all images. The This leads to larger magnitudes of inputs to a transfor-
dataset is separated into 2000 training and 2000 test images. mation function and sub-optimal performance for large d,
The problem is to classify each image pixel to be a for example in MNIST dataset (see Figure 5). The effect
skin or a non-skin. Dataset inputs are RGB color values of large input dimensionality is fixed by dividing the
TABLE 2. Meas Squared Error (bold) and runtime in seconds for the regression datasets. Results denoted by are computed with
500 hidden neurons, as suggested by pruning.
TABLE 3. Accuracy in % (bold) and runtime in seconds numbers of neurons the runtime difference is even less. Thus
for the classification datasets.
a medium size ELM model can be trained fast even on a
common laptop with a low-power CPU.
FIGURE 8. Training (a) and prediction (b) runtimes of a basic ELM for
MNIST dataset with 64 neurons. ELM predictions are obtained on the
same training dataset for comparable runtimes.
FIGURE 7. Test errors (a) and runtimes (b) on different hardware (c) for
large datasets, on logarithmic scale. Runtime on different hardware is
shown for two datasets only for image clarity. The 4-core CPU runs
at 4 GHz, 2x8-core CPU run at 2.6 GHz, a 2-core laptop CPU runs an
1.4 GHz and the GPU is GTX Titan Black (similar to Tesla K40).
TABLE 4. Training time of an ELM with 19,000 hidden neurons on ELM output weights are solved and a separate confu-
0,5 billion samples with 147 features. sion matrix is computed. The classification results for skin
and non-skin from these confusion matrices are shown on
Figure 10. Getting this test accuracy plot took 60 hours:
13 hours to obtain hidden layer output H, and 47 hours to
compute confusion matrices for all the 100 different numbers
on hidden neurons.
The final test accuracy with 19,000 hidden neurons is The test accuracy plot shows very good results for skin
86,46%, and the confusion matrix is presented on Table 5. pixels classification, owing to a balanced classification
method used. An ELM without class balancing would be
TABLE 5. Test confusion matrix for ELM with 19,000 neurons.
heavily biased towards predicting non-skin pixels, which are
83% of samples in the dataset. The improvement of skin
classification accuracy slows down past 128 hidden neurons,
so a smaller ELM can be used for detecting skin. However,
the non-skin classification accuracy grows steadily up to
the maximum number of neurons. This can be explained
Test accuracy for different numbers of neurons is com- by a higher variety of non-skin pixel masks than skin ones.
puted from a single matrix h , as explained in section IV-I. ELM does not overfit even with 19,000 neurons,
100 different numbers of neurons are tested, spaced equally although the performance gain decreases at large numbers
on a logarithmic scale from 3 to 19,000. For each number, of hidden neurons.
[33] C. R. Rao and S. K. Mitra, Generalized inverse of a matrix and its KAJ-MIKAEL BJRK received the masters and
applications, in Proceedings of the Sixth Berkeley Symposium on Math- Ph.D. degrees in chemical engineering from bo
ematical Statistics and Probability: Theory of Statistics, vol. 1. Berkeley, Akademi University, in 1999 and 2002, respec-
CA, USA: Univ. California Press, 1972, pp. 601620. [Online]. Available: tively, and the Ph.D. degree in business adminis-
http://projecteuclid.org/euclid.bsmsp/1200514113 tration (information systems) from bo Akademi
[34] A. N. Tikhonov, Solution of incorrectly formulated problems and the University, in 2006. He was a Visiting Researcher
regularization method, Soviet Math. Doklady, vol. 4, pp. 10351038, with Carnegie Mellon University, Pittsburg, USA,
1963. in 2000, the University of Linkping, Sweden, in
[35] A. E. Hoerl and R. W. Kennard, Ridge regression: Applications to 2001, and UC Berkeley, CA, USA, from 2005 to
nonorthogonal problems, Technometrics, vol. 12, no. 1, pp. 6982, 1970. 2006. Before working as the Head of Department,
[36] T. Simil and J. Tikka, Multiresponse sparse regression with application he was a Principal Lecturer with Logistics (Arcada) and an Assistant Pro-
to multidimensional scaling, in Proc. 15th Int. Conf. Artif. Neural Netw.,
fessor in Information Systems with bo Akademi University. He has held
Formal Models Appl. (ICANN), 2005, pp. 97102. [Online]. Available:
approximately 15 different courses in the fields of logistics and management
http://dl.acm.org/citation.cfm?id=1986079.1986097
science and engineering. Within the research projects, he has participated
[37] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, Least angle regres-
in approximately 60 scientific peer-reviewed articles and has an h-index of
sion, Ann. Statist., vol. 32, no. 2, pp. 407499, 2004.
10 (according to a Google scholar). His research interests are in information
[38] T. Simila and J. Tikka, Common subset selection of inputs in multire-
sponse regression, in Proc. Int. Joint Conf. Neural Netw. (IJCNN), 2006,
systems, analytics, supply chain management, machine learning, fuzzy logic,
pp. 19081915. and optimization.
[39] M. Lichman. (2013). UCI Machine Learning Repository. [Online].
Available: http://archive.ics.uci.edu/ml
[40] C.-C. Chang and C.-J. Lin, LIBSVM: A library for support vec-
tor machines, ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, 2011, YOAN MICHE was born in France in 1983.
Art. ID 27. He received the Engineering degree from TELE-
[41] C. M. Bishop, Pattern Recognition and Machine Learning (Information COM, Institut National Polytechnique de Greno-
Science and Statistics), vol. 4, M. Jordan, J. Kleinberg, and ble (INPG), France, in 2006, the masters degree in
B. Schlkopf, Eds. New York, NY, USA: Springer-Verlag, 2006, no. 4. signal, image and telecom from ENSERG, INPG,
[Online]. Available: http://www.library.wisc.edu/selectedtocs/bg0137.pdf
in 2006, and the Ph.D. degree in computer science
[42] C. E. Rasmussen, Gaussian processes in machine learning, in and signal and image processing from the Aalto
Advanced Lectures on Machine Learning (Lecture Notes in Computer
University School of Science and Technology, Fin-
Science), O. Bousquet, U. von Luxburg, and G. Rtsch, Eds. Berlin,
land, and INPG. His main research interests are
Germany: Springer-Verlag, 2004, pp. 6371. [Online]. Available:
http://dx.doi.org/10.1007/978-3-540-28650-9_4 anomaly detection and machine learning for clas-
[43] S. L. Phung, A. Bouzerdoum, and D. Chai, Skin segmentation using color sification/regression.
pixel classification: Analysis and comparison, IEEE Trans. Pattern Anal.
Mach. Intell., vol. 27, no. 1, pp. 148154, Jan. 2005.