Você está na página 1de 12

T. Ns et al., J. Near Infrared Spectrosc.

1, 111 (1993) 1

T. Ns et al., J. Near Infrared Spectrosc. 1, 111 (1993)


Artificial Neural Networks in Multivariate Calibration

Artificial neural networks in multivariate


calibration
Tormod Ns, Knut Kvaal, Tomas Isaksson and Charles Millera
MATFORSKNorwegian Food Research Institute, Osloveien 1, N1430 s, Norway.
a
Present address: Du Pont Polymers, Industrial Polymers Research, PO Box 1089, Orange, TX 77631-1089, USA.

This paper is about the use of artificial neural networks for multivariate calibration. We discuss network
architecture and estimation as well as the relationship between neural networks and related linear and non-linear
techniques. A feed-forward network is tested on two applications of near infrared spectroscopy, both of which
have been treated previously and which have indicated non-linear features. In both cases, the network gives more
precise prediction results than the linear calibration method of PCR.
Keywords: Artificial neural network, back-propagation, feed-forward network, PCR, PLS, near infrared spectroscopy,
multivariate calibration.

Introduction Previous comparisons between the ANN method


and other techniques have given quite different re-
sults with respect to prediction ability in NIR analy-
This paper addresses the use of artificial neural sis. For example, in Long et al.1 the ANN method
networks (ANN) in multivariate calibration, with was compared to PCR and the latter gave the best
special emphasis on their performance in near infra- prediction results, while in Borggaard and Thod-
red (NIR) spectroscopy. In particular, we will focus berg 2 a special application of the ANN method gave
on so-called feed-forward networks that use much better results than both PCR and PLS. The
back-propagation (BP) for estimation. This topic main goal of this paper is to provide more informa-
has been addressed by other researchers,1,2,3 but tion about the prediction ability of the ANN method
there are still many unanswered questions about, for in practice, and also to compare it with the perform-
instance, prediction properties in practical NIR ap- ance of a standard linear technique, namely PCR. In
plications. this way we want to contribute to the discussion
The first half of the paper is devoted to a discus- about the performance of non-linear calibration
sion of the choice of network and learning (estima- techniques in NIR spectroscopy. We will also inves-
tion) rule, and a comparison between neural tigate how ANNs should be used in order to give as
networks and other statistical methods such as par- good prediction results as possible. For more theo-
tial least squares (PLS), principal component re- retical discussions regarding approximation proper-
gression (PCR) and non-linear PLS. This discussion ties of ANN network models and statistical
will also serve as a background and explanation for convergence properties of estimation algorithms,
the problems considered in the second part, which we refer to References 4 and 5. For a general back-
is an empirical investigation of the performance of ground discussion of the ANN method, we refer to
ANNs in NIR spectroscopy. References 6, 7 and 8.

NIR Publications 1993, ISSN 0967-0335


2 Artificial Neural Networks in Multivariate Calibration

Figure 2. An illustration of the signal processing in a


sigmoid function.

are multiplied by constants and added before a pos-


sible transformation takes place within the node.
Figure 1. An illustration of a simple feed-forward network The transformation is in practice often a sigmoid
with one hidden layer and one output node. function, but can in principle be any function. The
sigmoid signal processing in a node is illustrated in
Figure 2.
Choice of network The feed-forward neural network in Figure 1
gives a regression equation of the form
Feed-forward networks
The field of neural networks covers a wide range I J
of different network methods which are developed
i =1

y = f bi gi wij x j + a i1 + a 2 + e

(1)
for and applied to very different situations. In par- j =1
ticular, the feed-forward network structure is suit-
able for handling non-linear relationships between where y is the output variable, the xs are the input
input and output variables, 6 when the focus is variables, e is a random error term, gi and f are
prediction. Reference 6 also discusses the differ- functions and b i , wij , ai1 and a2 are constants to be
ences between feed-forward and other networks. In determined. The constants w ij are the weights that
this work we will confine ourselves to discuss some each input element must be multiplied by before
of the main features of the feed-forward network their contributions are added in node i in the hidden
and how it can be fitted to data. layer. In this node, the sum over j of all elements
A feed-forward network is a function in which wijxj is used as input to the function gi. Then, each
the information from the input data goes via inter- function gi is multiplied by a constant b i before
mediate variables to the output data. The input data summation over i. At last the sum over i is used as
(X) is frequently called the input layer and the input for the function f. More than one hidden layer
output data (Y) is referred to as the output layer. can also be used, resulting in a similar, but more
Between these two layers are the hidden variables complicated function. Note that for both the hidden
which are collected in one or more hidden layers. and the output layer, there are constants, a i1 and a2,
The elements or nodes in the hidden layers can be respectively, that are added to the contribution from
thought of as sets of intermediate variables analo- the rest of the variables before the transformations
gous to the latent variables in bilinear regression take place. These constants play the same role as the
(e.g. PLS and PCR). An illustration of a feed- intercept (constant) term in a linear regression
forward network with one output variable and one model.
hidden layer is given in Figure 1. The information As can be seen from Equation 1, an artificial
from all input variables goes to each of the nodes in feed-forward neural network is simply a non-linear
the hidden layer and all hidden nodes are connected parametric model for the relationship between y and
to the single variable in the output layer in each all the x-variables. There are functions gi and f that
case. The contributions from all nodes or elements have to be selected6 and parameters wij and bi that
T. Ns et al., J. Near Infrared Spectrosc. 1, 111 (1993) 3

must be estimated from the data. The process by as small as possible. The error from the ouput layer
which these parameters are determined is, in the is then back-propagated to the hidden layer and
terminology of artificial neural computing, called the same procedure is repeated for the w ijs. The
learning. The best choice for g i and f can in differences between the new and the updated pa-
practice be found by trial and error, in that several rameters are dependent on so-called learning con-
options are tested and the functions that result in the stants which can have a strong influence on the
best prediction ability are selected. It has been speed of convergence of the network. Such learning
shown that the class of models in Equation 1 is constants are usually dynamic, which means that
dense (for suitable choice of functions g i and f) in they decrease as the number of iterations increases
the class of smooth continuous functions. This (see the examples below). The constants can be
means that any continuous function relating y and optimised for each particular application. In some
x1, , xJ can be approximated arbitrarily well by a cases, an extra term (momentum) is added to the
function of this kind. 4 standard back-propagation formulae in order to im-
As with linear calibration methods, network prove efficiency of the algorithm. We refer to Ref-
models must be constructed with consideration of erence 8 for details.
two important effects: underfitting and overfitting.9 Although this kind of back-propagation seeks to
If a model that is too simple or too rigid is selected, minimise a least squares criterion, it is important to
underfitting is the result, and if a model that is too note that this kind of learning is different from
complex is used, overfitting can be the conse- standard least squares (LS) fitting where the sum of
quence. The optimal model complexity is usually squared residuals (yy)2 is minimised over all pa-
somewhere between these two extremes. Tech- rameter values bi and wij. Theoretical studies of the
niques have been developed that simultaneously convergence properties of ANNs can be found in
estimate the complexity (which is related to the Reference 10. Extensions of the back-propagation
number of hidden nodes) and the model parame- technique described above have been developed in
ters,10 but this subject is not covered here. order to make it more similar to the standard LS
fitting of all objects simultaneously.8

The back-propagation learning rule


The learning rule that is most frequently used for A modification of the standard network
feed-forward networks is the back-propagation
technique.6 In back-propagation, the objects in the A modification of the standard model described
calibration set are presented to the network one by in Equation 1 has been suggested in Reference 2.
one in random order, and the weights w ij and bi The idea of this modification is to add direct linear
(regression coefficients) are updated each time in connections between the input and the output layer.
order to make the current prediction error as small This leads to the network model
as possible. This process continues until conver-
I J J
gence of the regression coefficients. Typically, at y = f bi gi wij x j + a i1 + a 2 + d j x j + a 3 + e (2 )
i =1 j =1 j =1
least 10,000 runs or training cycles are necessary in
order to reach a reasonable convergence. In some
cases many more iterations are needed, but this The ds here are the constants that are multiplied
depends on the complexity of the underlying func- by the input variables in the direct connections
tion and the choice of input training parameters. between the input and output layer and the as are
More specifically, the back-propagation is based constant terms. In this way, the network is com-
on the following strategy: (1) Subtract the output of posed of a linear regression part and a standard
the current observation from the output of the net- non-linear network part. As for the standard net-
work, (2) compute the squared residual (or another work, back-propagation is used for learning. In Ref-
function of it) and (3) let the weights for that layer erence 2 this modified network performed well on
be updated in order to make the squared output error near infrared data.
4 Artificial Neural Networks in Multivariate Calibration

Relations to other methods ti = wijxj (i.e. y-loadings). It can be seen that apart
from the non-linear g i and f in the feed-forward
The artificial neural network models have strong network Equation 1, the two equations (1) and (4)
relations to other statistical methods that have been are identical. This means that the model equation
used frequently in the past. In this section, however, for PLS and PCR is a special case of an ANN
we will only focus on a limited set of relations to equation with linear g i and f.
methods that are frequently discussed and used in However, for the learning procedure, PCR and
the chemometric and NIR literature. For a discus- PLS are quite different from ANN. The idea behind
sion of more statistically oriented aspects of such the PCR/PLS methods is that the weights w ij are
relations, we refer to References 5 and 10. determined with a strong restriction on them in
order to give as relevant scores for prediction as
Multiple linear regression (MLR) possible. 9 In PLS, the w ij s are the PLS loading
MLR is based on the following model for the weights found by maximising the covariance be-
relationship between y and x: tween y and linear functions of x, and in PCR they
are the principal component loadings obtained by
I
y = a+ bi xi + e (3) maximising the variance of linear functions of x. In
other words, estimation of bi and wij is not done by
i =1
fitting Equation (4) as well as possible, but through
where the a and b is are regression coefficients to be the use of quite a different rule. If both b i and wij in
estimated. We see that this model is identical to a Equation 4 were determined by LS directly, the
special version of model Equation 1 in which the ordinary MLR method would be the result, since
hidden layer is removed and the transfer function is Equation 4 is only a reparameterisation of the linear
set equal to the identity function. It is also easy to regression equation. Since the ANN method con-
see that if we replace both f and gi in Equation 1 by centrates on fitting without restrictions (on the pa-
identity functions, we end up with a model which is rameters), there are reasons to believe that ANNs
essentially the same as the MLR model. This is are more sensitive to overfitting than are PLS and
easily seen by multiplying w ij and bi values and PCR. As a compromise, it has been proposed to use
renaming the products. This procedure is called principal components as input for the network in-
reparameterisation of the network function. stead of the original variables themselves. 2 This
However, in the learning procedure, the MLR procedure may, however, have some drawbacks,
equation is most frequently fitted by using standard since it restricts the network input to only a few
LS instead of applying, for example, the iterative linear combinations of the original spectral values.
fitting procedure described above. Below we will This problem is studied in the application section
investigate differences between the empirical prop- below.
erties of these two estimation techniques. In PCR and PLS models, the scores ti and loading
weights w ij are orthogonal (for each j). This is not
Partial least squares regression and the case for ANN models, in which the w ij s are
principal component regression determined without this restriction. The orthogonal-
ity properties of PLS and PCR are strong reasons
The regression equation for both PLS and PCR for their usefulness in the interpretation of data, and
(with centred y and x) can be written as this same argument can be used against the present
I J I versions of ANN.
y= bi wij x j + e = bi ti + e ( 4)
i =1 j =1 i =1

where the wijs now correspond to the loading Projection pursuit (PP) regression
weights 9 of factor i and the b i s correspond to the The PP regression method 11 is based on the equa-
regression coefficients of y on the latent variables tion
T. Ns et al., J. Near Infrared Spectrosc. 1, 111 (1993) 5

I J version of non-linear PLS is based on a similar, but


y= si wij x j + e (5) more flexible model than the ANN model, but the
i =1 j =1 estimation of coefficients is carried out under a
strong restriction, namely the PLS criterion based
for the relation between y and the x-variables. In this on maximisation of covariance between y and linear
case, the si s are smooth functions of the linear functions of x variables.
function ti = wijxj . Again, we see that the model The non-linear PLS method proposed in Wold13
Equation 5 is similar to an ANN model with linear is also essentially based on a model of the same kind
f and gi replaced by s i. Notice also that since the sis as the PP regression. In this case, however, the
have no underlying functional assumption, PP re- smooth functions are approximated by splines, and
gression is based on an even more flexible function the estimation of parameters follows a quite differ-
than the ANN in Equation 1. Note also that since the ent rule than the procedure presented in Frank. 12
si s are totally general except that they are smooth,
we no longer need the constants a i1 and a2 .
Regarding the estimation of the weights w ij and Examples
functions si , the PP regression is based on a least
squares strategy. First, the constants w ij and the In this section we will test the feed-forward ANN
smooth functions s i are determined, then the effect described by Equation 1 on two real situations in
of this factor is subtracted and the same procedure which it is desired to relate NIR spectral data (X) to
is applied to the residuals. This continues until the a quality parameter (Y). Both datasets have been
residuals have no systematic effects left. For each used elsewhere and the wavelengths (X-data) are
factor, the si and wij are found by an iterative pro- strongly collinear.14 In this study, the main empha-
cedure. The s functions are usually determined us- sis will be on the different aspects of prediction
ing moving averages or local linear fits to the data. ability of the ANN method, considered as a function
The PP regression method as described here is of the number of nodes in the hidden layer, the
more flexible than the class of ANN functions, and number of input variables (or principal compo-
the optimal functions si and constants w ij are deter- nents) and the number of iterations.
mined simultaneously. This procedure is very dif- All prediction results below will be presented in
ferent from the ANN procedure, in which different terms of root mean square error of prediction
functions g and f have to be tried independently. (RMSEP), which is defined by
However, one main drawback with PP regression is 1
that it gives no closed form solution for the predic- Ip 2

RMSEP = ( yi yi ) / Ip
2
(6)
tion equation, only smooth fits to samples in the i =1
calibration set. Therefore, prediction of y values for
new samples must be based on linear or other inter- where Ip is the number of samples used in the pre-
polations between the calibration points. diction set. All reported results are obtained by
The relationship between the PP and ANN meth- testing the networks on samples that were not used
ods is treated in more detail in Maechler et al.5 for training of the network. All networks obtained
in this paper are computed using NeuralWorks Pro-
fessional II/plus (Neuralware Inc., PA, USA). Dur-
Non-linear PLS regression ing the network training, all input and output values
In 1990, Frank 12 published a method called non- are scaled so that they are in the range 01 (for
linear PLS, which is a kind of hybrid of PP regres- input) and 0.20.8 (for output), respectively, (for
sion and PLS. The technique is essentially based on arguments see the manual of the program). After
the same model as PP regression, but the constants convergence of the network, all prediction results
wij are determined the same way as the PLS loading are transformed back to original scale. The learning
weights are determined from residual matrices after constants and momentum values that are used
all previous factors have been subtracted. Thus, this throughout this work are given in Table 1. It is
6 Artificial Neural Networks in Multivariate Calibration

Table 1. The learning constants of the neural network: For both the output and hidden layer we used the same learning
constant and momentum value. The constants change as the number of iterations increases.

Number of iterations 02000 200020,000 20,00040,000 40,000100.000

Learning constant: 0.9 0.5 0.25 0.1

Momentum: 0.6 0.3 0.15 0.1

important to note that these constants are changed


as the number of training cycles increases. Such a
procedure is recommended (see the manual of the
program) and is natural because it enables a sort of
fine-tuning of the network as it approaches con-
vergence. All non-linear networks used are based on
a sigmoid transfer function.

Water predictions in meat


The data set used in this study contains the near
infrared transmission (NIT) spectra of 103 different
meat samples at 50 different wavelengths (X-vari-
ables) and the concentration of water in these sam-
ples as determined by oven-drying (Y-variable). Figure 3. PCR prediction error as a function of the number
Seventy samples are used for calibration and the of components (meat data). The upper curve corresponds
remaining 33 samples are used for testing the per- to uncorrected data and the lower curve correspond to
formance of the calibration model. The data set is scatter-corrected data.
described in more detail in Reference 15. Regarding
the X-data, both original NIT spectra and NIT spec- Table 2. RMSEP results for a sigmoid ANN with the use
of the 50 individual x-values as input (meat data). Both
tra that are corrected for scattering effects (by the uncorrected and scatter-corrected data are used. All re-
MSC method)16 will be considered. sults are based on 50,000 iterations.

Prediction ability Number of nodes in hidden layer


The prediction results using the PCR method are
given in Figure 3. The best result is RMSEP = 0.77, 1 10 25 50
obtained by using eight principal components from Scatter-corrected 0.82 0.70 0.74 0.81
scatter-corrected data. The RMSEP obtained for the
PCR model that uses the maximum number of fac- Uncorrected 3.95 2.28 2.26 2.27
tors from the scatter-corrected data is equal to 1.65.
In other words, PCR gave much better results than
a full MLR based on all components. It can also be this data set, and all results reported in this section
seen that scatter-corrected data gave clearly better are therefore based on this number. As can be seen,
results than uncorrected data. the scatter-corrected data gave much better results
The prediction results obtained using non-linear than the uncorrected data, indicating the advantage
ANN models with different numbers of hidden of input data prelinearisation, even for the non-
nodes are given in Table 2 for both uncorrected and linear ANN technique. The effect of using different
scatter-corrected data. As will be discussed below, numbers of nodes in the hidden layer is weak. For
the use of 50,000 iterations was close to optimal for the scatter-corrected data the best results were
T. Ns et al., J. Near Infrared Spectrosc. 1, 111 (1993) 7

Table 3. RMSEP results for sigmoid ANNs computed on sented in Figure 4. The table presents some results
scatter-corrected meat data, using different numbers of obtained using different numbers of principal com-
principal components as input and different numbers of
ponents as input, and different numbers of nodes in
nodes in the hidden layer. 50,000 iterations were used.
the hidden layer. Scatter-corrected (MSC) data gave
Number of nodes in hidden layer much better results than uncorrected data and there-
fore only the results using scatter-correction are
1 3 8 15 reported. It is clear that it is best to use an interme-
diate number of PCs as input, namely eight, and
3 0.84 0.79 0.74 0.92 also an intermediate number of nodes in the hidden
Number layer, namely three (RMSEP 0.64). In other words,
of input 8 0.72 0.64 0.68 0.66
there is an effect of underfitting and an effect of
variables 15 1.05 1.03 1.01 1.06 overfitting both with respect to the number of nodes
and the number of input components (variables). It
22 1.32 1.31 1.16 1.37 is also worth noting that the results obtained using
principal components as input are better than those
obtained using individual x-responses as input,
although the differences are small. The best ANN
result was about 15% better than the best PCR
result.
For the scatter-corrected data, a linear transfer
function network based on all input variables and
with only one hidden node, was also tested and the
RMSEP was equal to 0.89, which is somewhat
larger than the value obtained for the best sigmoid
network. For comparison, the linear ANN based on
all available principal components gave a RMSEP
of 1.66 after 50,000 iterations. (This result is ex-
pectedly similar to the linear PCR based on all
available components.) This difference in perform-
ance between the linear ANN method based on all
original variables and the linear ANN method based
on all principal components is somewhat surpris-
ing, since the two methods are based on essentially
the same model. The most plausible explanation of
Figure 4. A smooth illustration of the numbers in Table the different performances must therefore be that
3. As smoothing technique is used the default method of the iterative ANN learning rule, as opposed to
PROC G3GRID in SAS (SAS Institute, NC, USA). standard least squares fitting, can be dependent on
linear transformations of the data. Our interpreta-
tion of this result is that when original variables are
obtained using 10 and 25 nodes, and for the uncor- presented to the network, back-propagation (at least
rected data, the best results were obtained using 10, with the learning constants in Table 1) is not able to
indicating that an intermediate number of hidden detect the directions of minor variability and rele-
nodes is optimal. The best network gave about 9% vance in the X-space. When principal components
better results than PCR. are used, however, these directions are sorted out
Results obtained by using a sigmoid ANN and blown up by the scaling, and therefore are
method based on principal components instead of given the same possibility as the major variabilities
the original x-variables are presented in Table 3. A to influence the determination of regression coeffi-
smooth graphical illustration of the results is pre- cients.
8 Artificial Neural Networks in Multivariate Calibration

networks is, in this case, independent of starting


values. However, the differences between replicates
are large enough to indicate that the smaller differ-
ences in Table 3 should not be interpreted too
strongly.

The rate of convergence


From testing of networks (on the prediction set)
for different number of iterations, it was found that
about 50,000 iterations gave the best results for this
data set. The results from four different networks
are presented in Figure 5. This figure also indicates
that already after 30,000 iterations, we are close to
Figure 5. Three replicates of four different networks com-
puted on the meat data. For each network, the three repli- the optimal values. From 50,000 to 100,000 itera-
cates are illustrated by the same symbol. For each tions there is a small increase in the RMSEP. This
network, only a limited number of RMSEPs were re- is interpreted as a slight overfitting of the network
corded. In each case they are joined by linear interpola- model. The results in Table 2 and Table 3 are pre-
tion. Overlapping curves are represented by just one line. sented for 50,000 iterations.

Linear ANN models with three, eight and 15 Polymer data


principal components as input were also tested and The data used in this experiment contains the
these gave exactly the same results as the linear near infrared diffuse reflectance (NIR) spectra of 90
PCR solutions. This result is expected because an different polyurethane elastomers at 138 wave-
ANN model with a linear transfer function network lengths (X), and the flex modulus (or stiffness)
in the principal components is essentially the same value for these samples (Y). The flex modulus val-
model as the linear PCR method. ues of these samples ranged from 5.8 to 68.9 PSI. A
In Ns and Isaksson15 the same data were used training set of 47 samples and a prediction set of 43
for evaluating the locally weighted regression samples was chosen such that the Y-values of the
(LWR) method (in that paper, however, 100 wave- prediction samples were within the range of the
lengths were used). The LWR method based on values for the training set samples. In this example
three principal components (after scatter-correc- only the scatter-corrected data (by the MSC
tion), gave an RMSEP of 0.65, which is almost method) are used. Details regarding this data set are
equal to the prediction ability obtained from the presented in Reference 17. It should be noted that
best ANN model. The ANN, however, needed more linear PLS analysis of these data revealed a strong
principal components, namely eight, to obtain a non-linear relationship between the X and Y data
comparable result. when three PLS factors were used.

The influence of starting values on the RMSEP Prediction ability


To test the dependence of the fitted neural net- The PCR prediction results are presented in Fig-
works on the starting values of the network parame- ure 6. The best prediction result was in this case
ters, we recomputed different networks using RMSEP = 2.84, obtained using 38 components
different starting values. In Figure 5, the prediction (note that 21 components gave only slightly less
results (on the test data) of three replicated compu- precise predictions). The RMSEP was, however,
tations (from 5000 to 100,000 iterations) for four quite stable with respect to the number of compo-
quite different networks are given. It can be seen nents, but increased as a result of overfitting as the
that the differences between the replicates is quite number of components became very large. The use
small and the relative differences between the four of the maximum number of components (46 = 471,
T. Ns et al., J. Near Infrared Spectrosc. 1, 111 (1993) 9

Table 4. RMSEPs for polymer data in which all 138


wavelengths are used as input (scatter-corrected data).
100,000 iterations were used.

Number of nodes in hidden layer


1 25 138

2.58 1.96 2.13

Table 5. RMSEP results for sigmoid ANNs computed on


scatter-corrected polymer data, using different numbers
of principal components as input and different numbers of
nodes in hidden layer. 100,000 iterations were used except
for the computations in the last line, where only 1600
Figure 6. PCR prediction error as a function of the number
iterations were used (see text).
of components (polymer data).
Number of nodes in hidden layer
because of mean-centring) resulted in a RMSEP of
1 3 8 13
3.86.
Non-linear ANN prediction results for networks 3 3.05 2.98 3.61 3.22
that use each of the input wavelengths as input are
given in Table 4, for different numbers of nodes in 5 3.16 3.13 3.86 4.53
Number
the hidden layer. As in the previous data set, an
of input 13 2.65 2.64 3.00 3.11
intermediate number of nodes, namely 25, in the
variables
hidden layer gave the absolutely best result 16 2.76 2.63 2.85 2.96
(RMSEP = 1.96). As can be seen, the improvement
over the best PCR result is about 30%. It is impor- 21 2.57 2.51 2.64 2.80
tant to note that the use of all available wavelengths
46 3.11 3.27 3.38 3.79
in a traditional least squares approach would be
impossible in this case, because the number of sam-
ples is much smaller than the number of variables.
However, for the ANN method with back-propaga-
tion learning, this situation posed no problem.
The RMSEPs obtained by using principal com-
ponents as input variables for the ANN are given in
Table 5. As for the previous data set, there is an
effect of overfitting with respect to the number of
hidden nodes. It is also clear that the use of 13, 16
and 21 components as input resulted in similar and
optimal prediction results. Note also the overfitting
effect as we increase the number of components to
the maximum.

The influence of starting values on the RMSEP Figure 7. Three replicates of four different networks com-
As in the previous data set, prediction results of puted on the polymer data. For each network, the three
replicates are illustrated by the same symbol. For each
three replicates of four different network models are network, only a limited number of RMSEPs were re-
shown in Figure 7. There are some differences be- corded. In each case they are joined by linear interpola-
tween the prediction errors obtained from the repli- tion. Overlapping curves are represented by just one line.
10 Artificial Neural Networks in Multivariate Calibration

cate models, but these are not large enough to justify spectral variables was very small. This indicates
a change in the main conclusions regarding the that the overfitting effect is not necessarily so
differences among the four networks. strong as expected when the original variables are
used. When the principal components are used as
The rate of convergence input, however, the overfitting effect was strong. To
For this data set, 100,000 iterations resulted in understand this fully more research is needed.
slightly better prediction results (on the test data) Networks with sigmoid functions gave, in gen-
than 50,000 iterations, but these differences are eral, better predictions than networks with linear
very small. The only exception here was the net- transfer functions.
work that used 46 principal components, for which The number of iterations that gave the best result
about 1600 iterations resulted in the best model. In was different in the two cases. In the first example,
other words, a strong effect of overfitting as the 50,000 was close to the optimum and in the other
number of iterations was increased was observed in example 100,000 was close. The differences be-
this case. All results in Tables 4 and 5 are given for tween the two choices, however, was small in both
100,000 iterations, except those obtained from the cases. It should also be mentioned that the rate of
network with 46 principal components as input. convergence is dependent on the learning parame-
ters. Parameters other than those used in the present
investigation could possibly give somewhat differ-
Discussion ent results, but this was not investigated. The differ-
There are a number of conclusions that can be ences between replicated computations of the same
extracted from the applications above, which both network model (with different starting parameters),
were selected from earlier publications in which was quite small in all the cases considered.
they had indicated non-linear features. First of all, The main conclusion is that the ANN method, if
the sigmoid ANN method gave clear improvements used in the right way, was a good method for cali-
in prediction ability compared to the linear PCR bration in the two cases tested. The improvements
method (15% and 30%). In addition, scatter- compared to the linear PCR were, however, much
corrected data gave better results than the uncor- more moderate than in the examples in Borggaard
rected NIR data both for PCR and for the non-linear and Thodberg.2 This shows that the performance of
ANN. This result shows that even the very flexible ANN methods relative to linear methods is strongly
ANN functions are not always able to cope with dependent on the situation considered. This conclu-
non-linearities from the strong non-linear scattering sion is supported if we also compare the above
effects in NIR spectra. Next, there is always, except results with the results in Long et al.1
in one case, an effect of overfitting with respect to A last point to mention is that even though arti-
the number of nodes used in the hidden layer. This ficial neural networks may in certain cases be better
result means that the use of a function that is too than for example PCR from a prediction point of
flexible results in less precise networks. view, they are more difficult to understand and the
In both examples, the use of principal compo- results are much more difficult to visualise and
nents as input data gave good results, but the interpret. These aspects may, however, be improved
RMSEP varied quite considerably with the number in future developments of ANN techniques.
of components used. In both examples, there was a
clear effect of overfitting with respect to the number
of principal components used, as was also the case References
for the linear PCR. For one of the sets, the use of all 1. R.L. Long, V.G. Gregoriou and P.J. Gemper-
individual x-values as input gave better prediction line, Anal. Chem. 62, 1791 (1990).
results than the optimal number of principal com- 2. C. Borggaard and H.H. Thodberg, Anal. Chem.
ponents. In the other case the difference in predic- 64, 545 (1992).
tion ability between networks based on principal 3 W.F. McClure, H. Maha and S. Junichi, in Mak-
components and networks based on the original ing Light Work: Advances in Near Infrared
T. Ns et al., J. Near Infrared Spectrosc. 1, 111 (1993) 11

Spectroscopy. (Ed. by I. Murray and I. Cowe). 10. A.R. Barron, in Nonparametric Functional Es-
VCH, Weinheim, pp. 200209 (1992). timation (Ed. by G. Roussas). Kluwer Aca-
4. A.R. Barron and R.L. Barron, in Computing demic Publishers, Dordrecht, The Netherlands,
Science and Statistics, Proceedings of the 21st pp. 10341053 (1991).
Symposium on the Interface. American Statisti- 11. J.H. Friedman and W. Stuetzle, J. Am. Stat.
cal Association, Alexandria, VA, pp. 192203 Assoc. 76, 817 (1981).
(1988). 12. I. Frank. Chemometrics and Intelligent Labora-
5. M. Maechler, D. Martin, J. Schimert, M. Csop- tory Systems 8, 109 (1990).
penszky and J.N. Hwang, in Proceedings of the 13. S. Wold, Chemometrics and Intelligent Labora-
2nd International Conference on Tools for AI tory Systems 14, 71 (1982).
(1990). 14. S. Weisberg. Applied Linear Regression. John
6. Y.H. Pao, Adaptive Pattern Recognition and Wiley & Sons, New York (1985).
Neural Networks. Addison-Wesley (1989). 15. T. Ns and T. Isaksson, Appl. Spectrosc. 46, 34
7. J. Zupan and J. Gasteiger, Anal. Chim. Acta. (1992).
248, 1 (1991). 16. P. Geladi, D. McDougal and H. Martens, Appl.
8. J. Hertz, A. Krogh and R. Palmer, Introduction Spectrosc. 39, 491 (1985).
to the Theory of Neural Computation. Addison- 17. C.E. Miller and B.E. Eichinger, Appl. Pol. Sci.
Wesley, USA (1991). 42, 2169 (1991)
9. H. Martens and T. Ns, Multivariate Calibra-
tion. John Wiley & Sons, Chichester (1989). Paper received: 23 October 1992
Accepted: 5 February 1993
12 Artificial Neural Networks in Multivariate Calibration

Você também pode gostar