Escolar Documentos
Profissional Documentos
Cultura Documentos
1, 111 (1993) 1
This paper is about the use of artificial neural networks for multivariate calibration. We discuss network
architecture and estimation as well as the relationship between neural networks and related linear and non-linear
techniques. A feed-forward network is tested on two applications of near infrared spectroscopy, both of which
have been treated previously and which have indicated non-linear features. In both cases, the network gives more
precise prediction results than the linear calibration method of PCR.
Keywords: Artificial neural network, back-propagation, feed-forward network, PCR, PLS, near infrared spectroscopy,
multivariate calibration.
must be estimated from the data. The process by as small as possible. The error from the ouput layer
which these parameters are determined is, in the is then back-propagated to the hidden layer and
terminology of artificial neural computing, called the same procedure is repeated for the w ijs. The
learning. The best choice for g i and f can in differences between the new and the updated pa-
practice be found by trial and error, in that several rameters are dependent on so-called learning con-
options are tested and the functions that result in the stants which can have a strong influence on the
best prediction ability are selected. It has been speed of convergence of the network. Such learning
shown that the class of models in Equation 1 is constants are usually dynamic, which means that
dense (for suitable choice of functions g i and f) in they decrease as the number of iterations increases
the class of smooth continuous functions. This (see the examples below). The constants can be
means that any continuous function relating y and optimised for each particular application. In some
x1, , xJ can be approximated arbitrarily well by a cases, an extra term (momentum) is added to the
function of this kind. 4 standard back-propagation formulae in order to im-
As with linear calibration methods, network prove efficiency of the algorithm. We refer to Ref-
models must be constructed with consideration of erence 8 for details.
two important effects: underfitting and overfitting.9 Although this kind of back-propagation seeks to
If a model that is too simple or too rigid is selected, minimise a least squares criterion, it is important to
underfitting is the result, and if a model that is too note that this kind of learning is different from
complex is used, overfitting can be the conse- standard least squares (LS) fitting where the sum of
quence. The optimal model complexity is usually squared residuals (yy)2 is minimised over all pa-
somewhere between these two extremes. Tech- rameter values bi and wij. Theoretical studies of the
niques have been developed that simultaneously convergence properties of ANNs can be found in
estimate the complexity (which is related to the Reference 10. Extensions of the back-propagation
number of hidden nodes) and the model parame- technique described above have been developed in
ters,10 but this subject is not covered here. order to make it more similar to the standard LS
fitting of all objects simultaneously.8
Relations to other methods ti = wijxj (i.e. y-loadings). It can be seen that apart
from the non-linear g i and f in the feed-forward
The artificial neural network models have strong network Equation 1, the two equations (1) and (4)
relations to other statistical methods that have been are identical. This means that the model equation
used frequently in the past. In this section, however, for PLS and PCR is a special case of an ANN
we will only focus on a limited set of relations to equation with linear g i and f.
methods that are frequently discussed and used in However, for the learning procedure, PCR and
the chemometric and NIR literature. For a discus- PLS are quite different from ANN. The idea behind
sion of more statistically oriented aspects of such the PCR/PLS methods is that the weights w ij are
relations, we refer to References 5 and 10. determined with a strong restriction on them in
order to give as relevant scores for prediction as
Multiple linear regression (MLR) possible. 9 In PLS, the w ij s are the PLS loading
MLR is based on the following model for the weights found by maximising the covariance be-
relationship between y and x: tween y and linear functions of x, and in PCR they
are the principal component loadings obtained by
I
y = a+ bi xi + e (3) maximising the variance of linear functions of x. In
other words, estimation of bi and wij is not done by
i =1
fitting Equation (4) as well as possible, but through
where the a and b is are regression coefficients to be the use of quite a different rule. If both b i and wij in
estimated. We see that this model is identical to a Equation 4 were determined by LS directly, the
special version of model Equation 1 in which the ordinary MLR method would be the result, since
hidden layer is removed and the transfer function is Equation 4 is only a reparameterisation of the linear
set equal to the identity function. It is also easy to regression equation. Since the ANN method con-
see that if we replace both f and gi in Equation 1 by centrates on fitting without restrictions (on the pa-
identity functions, we end up with a model which is rameters), there are reasons to believe that ANNs
essentially the same as the MLR model. This is are more sensitive to overfitting than are PLS and
easily seen by multiplying w ij and bi values and PCR. As a compromise, it has been proposed to use
renaming the products. This procedure is called principal components as input for the network in-
reparameterisation of the network function. stead of the original variables themselves. 2 This
However, in the learning procedure, the MLR procedure may, however, have some drawbacks,
equation is most frequently fitted by using standard since it restricts the network input to only a few
LS instead of applying, for example, the iterative linear combinations of the original spectral values.
fitting procedure described above. Below we will This problem is studied in the application section
investigate differences between the empirical prop- below.
erties of these two estimation techniques. In PCR and PLS models, the scores ti and loading
weights w ij are orthogonal (for each j). This is not
Partial least squares regression and the case for ANN models, in which the w ij s are
principal component regression determined without this restriction. The orthogonal-
ity properties of PLS and PCR are strong reasons
The regression equation for both PLS and PCR for their usefulness in the interpretation of data, and
(with centred y and x) can be written as this same argument can be used against the present
I J I versions of ANN.
y= bi wij x j + e = bi ti + e ( 4)
i =1 j =1 i =1
where the wijs now correspond to the loading Projection pursuit (PP) regression
weights 9 of factor i and the b i s correspond to the The PP regression method 11 is based on the equa-
regression coefficients of y on the latent variables tion
T. Ns et al., J. Near Infrared Spectrosc. 1, 111 (1993) 5
Table 1. The learning constants of the neural network: For both the output and hidden layer we used the same learning
constant and momentum value. The constants change as the number of iterations increases.
Table 3. RMSEP results for sigmoid ANNs computed on sented in Figure 4. The table presents some results
scatter-corrected meat data, using different numbers of obtained using different numbers of principal com-
principal components as input and different numbers of
ponents as input, and different numbers of nodes in
nodes in the hidden layer. 50,000 iterations were used.
the hidden layer. Scatter-corrected (MSC) data gave
Number of nodes in hidden layer much better results than uncorrected data and there-
fore only the results using scatter-correction are
1 3 8 15 reported. It is clear that it is best to use an interme-
diate number of PCs as input, namely eight, and
3 0.84 0.79 0.74 0.92 also an intermediate number of nodes in the hidden
Number layer, namely three (RMSEP 0.64). In other words,
of input 8 0.72 0.64 0.68 0.66
there is an effect of underfitting and an effect of
variables 15 1.05 1.03 1.01 1.06 overfitting both with respect to the number of nodes
and the number of input components (variables). It
22 1.32 1.31 1.16 1.37 is also worth noting that the results obtained using
principal components as input are better than those
obtained using individual x-responses as input,
although the differences are small. The best ANN
result was about 15% better than the best PCR
result.
For the scatter-corrected data, a linear transfer
function network based on all input variables and
with only one hidden node, was also tested and the
RMSEP was equal to 0.89, which is somewhat
larger than the value obtained for the best sigmoid
network. For comparison, the linear ANN based on
all available principal components gave a RMSEP
of 1.66 after 50,000 iterations. (This result is ex-
pectedly similar to the linear PCR based on all
available components.) This difference in perform-
ance between the linear ANN method based on all
original variables and the linear ANN method based
on all principal components is somewhat surpris-
ing, since the two methods are based on essentially
the same model. The most plausible explanation of
Figure 4. A smooth illustration of the numbers in Table the different performances must therefore be that
3. As smoothing technique is used the default method of the iterative ANN learning rule, as opposed to
PROC G3GRID in SAS (SAS Institute, NC, USA). standard least squares fitting, can be dependent on
linear transformations of the data. Our interpreta-
tion of this result is that when original variables are
obtained using 10 and 25 nodes, and for the uncor- presented to the network, back-propagation (at least
rected data, the best results were obtained using 10, with the learning constants in Table 1) is not able to
indicating that an intermediate number of hidden detect the directions of minor variability and rele-
nodes is optimal. The best network gave about 9% vance in the X-space. When principal components
better results than PCR. are used, however, these directions are sorted out
Results obtained by using a sigmoid ANN and blown up by the scaling, and therefore are
method based on principal components instead of given the same possibility as the major variabilities
the original x-variables are presented in Table 3. A to influence the determination of regression coeffi-
smooth graphical illustration of the results is pre- cients.
8 Artificial Neural Networks in Multivariate Calibration
The influence of starting values on the RMSEP Figure 7. Three replicates of four different networks com-
As in the previous data set, prediction results of puted on the polymer data. For each network, the three
replicates are illustrated by the same symbol. For each
three replicates of four different network models are network, only a limited number of RMSEPs were re-
shown in Figure 7. There are some differences be- corded. In each case they are joined by linear interpola-
tween the prediction errors obtained from the repli- tion. Overlapping curves are represented by just one line.
10 Artificial Neural Networks in Multivariate Calibration
cate models, but these are not large enough to justify spectral variables was very small. This indicates
a change in the main conclusions regarding the that the overfitting effect is not necessarily so
differences among the four networks. strong as expected when the original variables are
used. When the principal components are used as
The rate of convergence input, however, the overfitting effect was strong. To
For this data set, 100,000 iterations resulted in understand this fully more research is needed.
slightly better prediction results (on the test data) Networks with sigmoid functions gave, in gen-
than 50,000 iterations, but these differences are eral, better predictions than networks with linear
very small. The only exception here was the net- transfer functions.
work that used 46 principal components, for which The number of iterations that gave the best result
about 1600 iterations resulted in the best model. In was different in the two cases. In the first example,
other words, a strong effect of overfitting as the 50,000 was close to the optimum and in the other
number of iterations was increased was observed in example 100,000 was close. The differences be-
this case. All results in Tables 4 and 5 are given for tween the two choices, however, was small in both
100,000 iterations, except those obtained from the cases. It should also be mentioned that the rate of
network with 46 principal components as input. convergence is dependent on the learning parame-
ters. Parameters other than those used in the present
investigation could possibly give somewhat differ-
Discussion ent results, but this was not investigated. The differ-
There are a number of conclusions that can be ences between replicated computations of the same
extracted from the applications above, which both network model (with different starting parameters),
were selected from earlier publications in which was quite small in all the cases considered.
they had indicated non-linear features. First of all, The main conclusion is that the ANN method, if
the sigmoid ANN method gave clear improvements used in the right way, was a good method for cali-
in prediction ability compared to the linear PCR bration in the two cases tested. The improvements
method (15% and 30%). In addition, scatter- compared to the linear PCR were, however, much
corrected data gave better results than the uncor- more moderate than in the examples in Borggaard
rected NIR data both for PCR and for the non-linear and Thodberg.2 This shows that the performance of
ANN. This result shows that even the very flexible ANN methods relative to linear methods is strongly
ANN functions are not always able to cope with dependent on the situation considered. This conclu-
non-linearities from the strong non-linear scattering sion is supported if we also compare the above
effects in NIR spectra. Next, there is always, except results with the results in Long et al.1
in one case, an effect of overfitting with respect to A last point to mention is that even though arti-
the number of nodes used in the hidden layer. This ficial neural networks may in certain cases be better
result means that the use of a function that is too than for example PCR from a prediction point of
flexible results in less precise networks. view, they are more difficult to understand and the
In both examples, the use of principal compo- results are much more difficult to visualise and
nents as input data gave good results, but the interpret. These aspects may, however, be improved
RMSEP varied quite considerably with the number in future developments of ANN techniques.
of components used. In both examples, there was a
clear effect of overfitting with respect to the number
of principal components used, as was also the case References
for the linear PCR. For one of the sets, the use of all 1. R.L. Long, V.G. Gregoriou and P.J. Gemper-
individual x-values as input gave better prediction line, Anal. Chem. 62, 1791 (1990).
results than the optimal number of principal com- 2. C. Borggaard and H.H. Thodberg, Anal. Chem.
ponents. In the other case the difference in predic- 64, 545 (1992).
tion ability between networks based on principal 3 W.F. McClure, H. Maha and S. Junichi, in Mak-
components and networks based on the original ing Light Work: Advances in Near Infrared
T. Ns et al., J. Near Infrared Spectrosc. 1, 111 (1993) 11
Spectroscopy. (Ed. by I. Murray and I. Cowe). 10. A.R. Barron, in Nonparametric Functional Es-
VCH, Weinheim, pp. 200209 (1992). timation (Ed. by G. Roussas). Kluwer Aca-
4. A.R. Barron and R.L. Barron, in Computing demic Publishers, Dordrecht, The Netherlands,
Science and Statistics, Proceedings of the 21st pp. 10341053 (1991).
Symposium on the Interface. American Statisti- 11. J.H. Friedman and W. Stuetzle, J. Am. Stat.
cal Association, Alexandria, VA, pp. 192203 Assoc. 76, 817 (1981).
(1988). 12. I. Frank. Chemometrics and Intelligent Labora-
5. M. Maechler, D. Martin, J. Schimert, M. Csop- tory Systems 8, 109 (1990).
penszky and J.N. Hwang, in Proceedings of the 13. S. Wold, Chemometrics and Intelligent Labora-
2nd International Conference on Tools for AI tory Systems 14, 71 (1982).
(1990). 14. S. Weisberg. Applied Linear Regression. John
6. Y.H. Pao, Adaptive Pattern Recognition and Wiley & Sons, New York (1985).
Neural Networks. Addison-Wesley (1989). 15. T. Ns and T. Isaksson, Appl. Spectrosc. 46, 34
7. J. Zupan and J. Gasteiger, Anal. Chim. Acta. (1992).
248, 1 (1991). 16. P. Geladi, D. McDougal and H. Martens, Appl.
8. J. Hertz, A. Krogh and R. Palmer, Introduction Spectrosc. 39, 491 (1985).
to the Theory of Neural Computation. Addison- 17. C.E. Miller and B.E. Eichinger, Appl. Pol. Sci.
Wesley, USA (1991). 42, 2169 (1991)
9. H. Martens and T. Ns, Multivariate Calibra-
tion. John Wiley & Sons, Chichester (1989). Paper received: 23 October 1992
Accepted: 5 February 1993
12 Artificial Neural Networks in Multivariate Calibration