A7) Development of Self-Adaptive Artificial Neural Networks Training Algorithm

Smart Engineering System Design, 5:119127, 2003 Copyright # 2003 Taylor & Francis 1025-5818/03 $12.00 .
.00 DOI: 10.1080/10255810390210940
Development of Self-Adaptive Articial Neural Networks Training Algorithm

Shamsuddin Ahmed
College of Business and Economics, United Arab Emirates University, Al Ain, UAE and Edith Cowan University, School of Engineering and Mathematics, Joondalup, Perth, Australia
This paper introduces an artificial neural network (ANN) training algorithm that computes a directional search vector for rapid convergence in ANN training. A higher-dimensional ANN error function is reduced in lower dimension and a mapping scheme is developed to identify self-adaptive learning rates of each neuron to improve training performance. The directional search vector points to the direction of fast training, while the dynamic self-adaptive learning rates identified by the mapping scheme generate a convergent sequence of the error function. As a result, the training is faster by a factor of 1.76 than that of the standard backpropagation training with XOR problem. The learning rates are self-adaptive and change dynamically every epoch; consequently, the oscillation during training is greatly reduced. Keywords: Artificial Neural Networks, self-adaptive training, directional derivative, variable learning rate, XOR
The convergence of ANN error function with an algorithmic map is described in the following sections. Two important properties are developed in this training scheme. The direction of training is identified using the algorithm in Table 3 and the proper amount of dynamic learning rate is computed via algorithm 2. The properties of the descent directions are discussed. To determine appropriate self-adaptive learning rates, the training problem is reconstituted as a one-dimensional error function from a multi-dimensional (Em) training problem. The training scheme developed in Table 2 prevents the exploration phase from overshooting the minimum trajectory and, consequently, the ANN training is faster.
Self-Adaptive Training
Consider an ANN error function in which m training weights are to be identified dynamically. The higher-dimensional ANN error function is decomposed into several error functions in a lower dimension (Ahmed et al. 2000). Such a transformed error function retains the true convex characteristics of the original error function. The training method selects learning rates for all the weight parameters, say wj ; j 1; 2; . . . ; m, by a factor such that improvement in training is noticeable. Each epoch identifies m different learning rates along the training directions (see Figure 2a, b, c, Table 1, and Table 3). The magnitudes of weights at each epoch are
Received March 6, 2000; accepted November 6, 2002. Address correspondence to Shamsuddin Ahmed, College of Business and Economics, United Arab Emirates University, P. O. Box17555, Al Ain, United Arab Emirates. E-mail: Dr_Shamsuddin_Ahmed@yahoo.com
measured by a mapping scheme shown in Table 2 that exploits least square training criteria (Jang et al. 1997). The ANN training weights are then updated and the error function is re-evaluated. The updated training results are noted namely in error function improvements, learning behavior of the error function, and the rate of convergence of the ANN error function. To demonstrate the concept behind the self-adaptive backpropagation-training scheme, a standard XOR ANN with 2-2-1 configuration is considered. Figure 1 shows the variable learning rates generated by the algorithm. As the training continues, the variable learning rates approach a lower magnitude. As a result, the training converges rapidly. This dynamically self-adaptive training reduces the error function value to the order of 1073 at the early stage of training and eventually is trained at a 1076 tolerance limit for accuracy. Figure 1 shows the training performance. Notice that the ANN training function converges without oscillation and the self-adaptive learning rates gradually reduce to small magnitude. The terminal value of the vector is [0.00098, 1.00001, 0.99995, 0.0006] against the desired value [0,1,1,0]. The training method generates different learning rates during training as seen in Figure 1. It is evident from Figure 1 that the considerable reduction in error function is achieved within 20 epochs of training. The standard backpropagation (BP) takes, on average, 464 epochs to train the same XOR problem, while the proposed method needs, on average, 152 epochs (see Table 4). The standard BP training method and its variant are not self-adaptive, but ad-hoc (Fahlman 1988; Hinton 1987; Jacobs 1988; Wier 1991; Vogl et al. 1988, 1 19
120
TABLE 1
S. Ahmed
Function Convergence During Early Stage of Training and Self-Adaptive Training Parameters in m Dimensional Space
Self-Adaptive Parameter 2.254981 0.21698 0.579189 0.079919 0.217752 0.083051 0.149859 0.076583 0.093502 0.084137 0.071481 0.113894 0.061804 0.213493 0.051583 0.102653 0.018744 0.019004 0.009912 0.009385 0.010542 0.007896 0.011967 0.007003 0.00952 0.006111 0.012569 0.005806 0.025267 0.006333 0.03645 0.006578 0.002101 0.002144 0.001928 0.002146 0.001812 0.002236 0.001706 0.002395 0.00163 0.002656 0.001585 0.003092 0.001577 0.003883 0.001625 0.005625 0.001778 0.011255 0.002137 0.025228 0.002155 0.001072 Direction 1 0.66555 70.09591 70.31127 0.09031 0.00568 0.09849 0.0494 0.10937 0.04934 0.11619 0.04062 0.11971 0.0222 0.11534 70.00288 0.08709 0.00878 0.05315 0.0523 0.04591 0.05903 0.03683 0.06193 0.02943 0.05878 0.03185 0.06171 0.02311 0.06345 0.01163 0.0604 0.00894 0.03633 0.04147 0.03424 0.04245 0.03275 0.04384 0.03042 0.04523 0.02782 0.04659 0.02493 0.04788 0.02168 0.04905 0.01793 0.05003 0.01341 0.05056 0.0078 0.04945 0.00684 0.03906 Direction 2 0.06914 70.11911 70.76998 70.056 70.70295 70.06117 70.65291 70.18523 70.62395 70.28802 70.59801 70.41637 70.54191 70.55781 70.38568 70.64335 0.05156 0.01873 70.35616 70.45511 70.36393 70.35919 70.38539 70.30513 70.3632 70.31253 70.40266 70.25278 70.4373 70.16457 70.44434 70.01773 70.2138 70.23202 70.2625 70.27752 70.27376 70.30559 70.26526 70.32566 70.24924 70.34227 70.22877 70.35718 70.20458 70.371 70.17597 70.38375 70.14054 70.39418 70.0944 70.39502 70.03172 70.29707 Direction 3 0.32588 70.02037 70.44431 70.10769 70.64339 70.13192 70.69015 70.24356 70.68367 70.30643 70.66385 70.36443 70.62827 70.37384 70.63206 70.27511 70.87518 70.68476 0.41141 70.83342 0.24858 70.88588 0.13505 70.90762 0.18729 70.90345 0.05222 70.91955 70.12707 70.91477 70.42991 0.86175 70.852 0.36901 70.87057 0.34281 70.88016 0.29876 70.89511 0.24887 70.9094 0.19271 70.92211 0.12937 70.93237 0.05677 70.93913 70.02979 70.94028 70.13935 70.9313 70.36892 0.87001 70.71208 Direction 4 0.0174 70.003 0.03643 0.03861 0.05822 0.04099 0.04978 0.04551 0.0346 0.04527 0.02327 0.04292 0.01196 0.03678 0.00247 0.02418 0.00703 0.01799 0.01205 0.01372 0.01435 0.01146 0.01528 0.0094 0.01423 0.00974 0.01506 0.00739 0.01558 0.00426 0.01479 0.00129 0.00855 0.00849 0.00795 0.0086 0.00754 0.00885 0.00699 0.00911 0.00641 0.00937 0.00576 0.00961 0.00505 0.00981 0.00423 0.00996 0.00326 0.00998 0.00204 0.00957 0.00087 0.00693 Direction 5 70.48411 0.67551 70.27662 70.7641 70.09469 70.76862 70.07083 70.7481 0.08311 70.70154 0.1901 70.62134 0.29813 70.50396 0.30887 70.43446 70.19879 70.69382 70.39353 70.30966 70.49677 70.27433 70.54193 70.22499 70.5331 70.25893 70.56759 70.18877 70.59639 70.09456 70.60014 70.11671 70.46941 70.51829 70.41252 70.51281 70.38585 70.52423 70.35656 70.54047 70.32677 70.55878 70.29428 70.57773 70.25765 70.59651 70.21495 70.61451 70.16251 70.6304 70.09688 70.63649 70.09922 70.56579 Direction 6 70.45977 0.72102 0.18703 70.62592 0.28201 70.61376 0.29574 70.57679 0.36432 70.56164 0.40417 70.53999 0.47127 70.52958 0.59695 70.5599 0.43789 70.21499 70.73902 70.00985 70.74518 0.09727 70.73176 0.17763 70.73835 0.13401 70.7134 0.23304 70.65779 0.3564 70.5037 70.49332 70.08149 70.73456 70.04235 70.73526 70.01917 70.73521 0.0174 70.73333 0.05726 70.72885 0.10066 70.72079 0.14815 70.70768 0.20133 70.68675 0.26368 70.65203 0.33812 70.5479 70.48186 70.28811
FunctionValue 0.49679 0.4967 0.41952 0.37161 0.35828 0.3413 0.32532 0.30795 0.29547 0.28205 0.26883 0.25522 0.23771 0.22024 0.18495 0.14089 0.11366 0.10105 0.09632 0.09404 0.09177 0.08973 0.08763 0.08551 0.08335 0.08165 0.0799 0.07783 0.07577 0.07172 0.06825 0.0582 0.0515 0.05111 0.05074 0.05036 0.05 0.04964 0.04929 0.04892 0.04856 0.04819 0.04781 0.04742 0.04699 0.04656 0.04604 0.04552 0.0448 0.04409 0.04263 0.04137 0.03627 0.0343
Training Algorithm
121
TABLE 2 The Closed Mapping Algorithm to Identify Dynamic Learning Rates

Step 1a: Initialization: Set index j 0 and small error value e 0 as termination criteria and } 0 as interpolation search precession factor. Set a scalar d3 :01, and index i 0, let m number of ANN connection weights, set learning rate parameter in direction j, defined 0. by learning rates Zj Z1 ; Z2 ; . . . ; Zm Set the scalar quantities: d1 4 and d2 2:5. Initialize network weight w wj w1 ; w2 ; . . . ; wm . Set direction vector d dj d1 ; d2 ; . . . ; dm 0. 1.1 Set j j1 _ _ 1.2 Set one locally found learning rate as Z1 wj , dj 1, set function value: f1 f wj Z1 dj and perform next step. 2.1 Set a scalar quantity: a1 d3 _ _ 3.1 Set the second locally found learning rates as Z2 1 a1 Zj _ 4.1 Set the function value: f2 f wj Z2 dj , if f2 < f1 , perform step 5, or else 4.2 a1 d1 a1 , if ja1 j < }, perform 8, otherwise perform step 3 5.1 Set a scalar quantity: a2 a1 6.1 a2 d2 a2 _ _ _ 6.2 Set the third locally found learning rate as Z3 1 a2 Z1 , set function value: f3 f w Z3 dj , if f3 > f2 , perform step 7, otherwise, 6.3 a1 a2 , f2 f3 , repeat step 6. Find the scalar quantity a0 that minimizes the function value locally. _ _ 1 a0 Z1 , f1 f wj Z dj , if f1 < f2 , perform step 2, otherwise, set Z 1 a1 Z1 7.1 Let the best learning rate be: Z j and f1 f wj Z dj , so that f2 < f1 and perform step 2. j 7.2 If jf1 f2 j e, perform step 8, otherwise, 7.3 Set f^ f wj Z dj , and monitor error function value. _ 1a1 Zt _ 7.4 Change the value of Z1 1a0 and perform step 1c. 8.1 Set dj d1 ; d2 ; . . . . . . ; dm 0, if j m, perform step 9, or else, perform step 1b. 9.1 Set epoch index k k 1, for next training dj d1 ; d2 ; . . . ; dm 0, set j 0, perform step 1b.
Step Step Step Step Step
1b: 1c: 2: 3: 4:
Step 5: Step 6:
Step 7:
Step 8: Step 9:
TABLE 3 Algorithm to Identify Directional Derivatives

Initialization: 1, d 0 (as a percentage factor), j a) Set a termination criteria W 0 , epoch counter k number as 25,000 and let wk1 be the initial vector, execute step 1. Determine directional derivatives, select the fraction, a scalar, by few experiments: a) Set gj fwj ; f f fwk1 and wj fractionj wj wk b) Hf wj step 2
f wj gj j f wj gj j I I 2gj j I
1, set limit to the iteration
step 1:
Step 3:
c) If j m, perform step 2, or else j j 1 and repeat step 1 a) Set j 1 Hf wj b) Hf wj kHf wk c) dj Hf wj d) If kHf wk k < W, stop and report wk as the solution, otherwise perform interpolation search. Interpolation Search: Perform interpolations search according to Table 2 and select adaptation length Zk at iteration k. Solve the following minimization problem: a) Z min f wk Zk dk Z50 b) Set wk1 wk Zk dk a) j 1; k k 1 and perform step 1.
Limitations in Standard BP
Kamarthi and Pittner 1999; Ahmed and Cross 2000). The training scheme generates self-adaptive variable learning rates and monotonically reduces the error function. General references on BP neural networks include Rumelhart and McClelland (1989), Muller and Reinhardt (1990), and Hertz et al. (1991). More discussions on neural networks in useful applications can be found in Cheng and Titterington (1994), Werbos (1994), Kuan and White (1994), Bishop (1995), Haykin (1999), Sakamoto and Kobuchi (2000), Partovi and Anandrajan (2002). The training direction, dk , of the error function fw is computed using the gradient, Hfwk , information from a single training pattern in standard online backpropagation (Bishop 1995; Jacobs 1988). In online training, the other training patterns compute the gradient components, Hfwk , which may result in different directions (Kamarthi and Pittner 1999) for a particular weight. Therefore, a single descent direction is not generated. The fixed value of a learning rate, Zk , does not always lead to a maximum local decrease in function value. The learning rate depends on the shape of the error function (Jacobs 1988), since it trains
122
TABLE 4 Comparison with Standard Backpropagation Method (2-2-1:ANN XOR)
Proposed Method Statistical Measure Mean Median Std deviation Range Minimum Maximum Function Gradient Total function Terminal evaluations evaluations evaluations function value Epoch 916.8 108 2524.97 8058 42 8100 2601.1 206 7449.30 23718 78 23796 3518.4 310.5 9974.1 31770 126 31896 0.00859 1.8E 7 08 0.02713 0.0858 1.3E 7 10 0.0858 463.6 25.5 951.12 2667 20 2687 Standard Backpropagation Method
S. Ahmed
Epoch 151.8 17 420.83 1343 6 1349
Function Gradient Total function Terminal evaluations evaluations evaluations function value 2787.6 159 5706.73 16002 126 16128 465.6 27.5 951.121 2667 22 2689 3253.2 186.5 6657.85 18669 148 18817 0.00557 0.00478 0.00489 0.01487 0.00013 0.015
from the current epoch k to the next epoch k 1, k 2, and so on. The training epochs, therefore, do not produce a convergent sequence in a strict sense. In the absence of a convergent sequence, there is risk of overshooting the minimum trajectory of an ANN error function. The difficulty originates from two different sources (Jacobs 1988; Weir 1991; Vogl et al. 1988). 1. The fixed value of learning rate, Zk , may misdirect the search towards the minimum during epoch k. 2. The directions, dk , generated from gradient, Hfwk , during epoch k, are different for a single weight component in standard backpropagation training. By selecting arbitrary learning rates for each neuron, the weights are modified, but a steep descent move is not performed for fast convergence. In this case, the learning rates are updated based on the partial
derivative of the error function with respect to the patterns in the training set. Since there are many neurons, the directions of training movements are also different. The gradient information of the multi-dimensional error function provides suitable learning directions. If the valley of the error function has twists and turns, the large value of learning rate Zk will prevent the training from making reasonable progress across a long flat slope (Vogl et al. 1988; Jang et al. 1997). Identifying a suitable learning rate in a particular problem involves experimenting with different values of learning rates that help reduce the training time (Fahlman 1988; Hinton 1987) and is essentially an ad hoc method. Rumelhart and McClelland (1989) suggest a large value of learning rate Zk for rapid learning without oscillations. As a matter of fact, in some training steps, a large value of learning rate Zk may be suitable, but there is no assurance that the same learning rate would be appropriate for other steps in the
FIG. 1
Convergence and self-adaptive variable magnitude training parameters.
Training Algorithm
123
FIG. 2a Automatic descent directions generated in XOR training.
FIG. 2b Automatic descent directions generated in XOR training.
124
S. Ahmed
FIG. 2c
Automatic descent directions generated in XOR training. is applied to an error function with an initial arbitrary weight vector wk , at the beginning of iteration k, the algorithm generates a sequence of vectors wk1 , wk2 , . . . during epoch k 1, k 2, . . . , and henceforth. The iterative algorithm is globally convergent if the sequence of vectors converges to a solution set O. Consider, for example, the following training problem, where w is defined over the dimension E m: minimize fw subject to: w 2 E m :
training cycle (Jacobs 1988). The appropriate value of learning rate Zk will depend on the topography of the domain of an ANN error function being traversed during training. If the contours of the error function are circular, then the step size will have little influence on the convergence in backpropagation (BP). The convergence difficulties arise when the contours have stiff ridges and the error surface contains local minimums, while the contours with an elliptical or circular shape will favor convergence (Moller 1997; Ahmed et al. 2000; Ahmed and Cross 2000).
IDENTIFYING TRAINING DIRECTIONS

In order to describe the self-adaptive iterative training algorithm, the convergence properties of the equivalent XOR error function fw in a 2-2-1 ANN framework (Figure 3) with log activation function in the hidden layer is defined. When an iterative algorithm
Let O 2 E m be the solution set, and the application of an algorithmic map, B, generates the sequence wk1 , wk2 , . . ., starting with weight vector wk such that (wk , wk1 , wk2 , . . .)2 O, then the algorithm converges globally and the algorithmic map is closed over O.
FIG. 3
A 2-2-1 XOR with log activation function.
Training Algorithm
125 uk wk1 wk Zk dk or u w Zd: 3
The XOR error function is considered a minimization problem, since it is possible to train such a function (Haykin 1999). Let O be a non-empty compact subject of E m , and, if the algorithmic map B generates a sequence, fwk g 2 O such that fw decreases at each iteration while satisfying the network weight inequality in order, fwk > fwk1 > fwk2 ; . . ., then the error function fw is implicitly a descent function. In ANN computation, fw is assumed to possess descent properties. It implies that it is convex in nature (Haykin 1999). Therefore, it is feasible to define a descent direction along which the error function can be trained. Following properties of ANN error function demonstrate the concept of generating self-adaptive variable learning rates.
Property 1. Suppose that f : E m ! E 1 and the gradient
Now consider a training problem with ANN error function defined as: Zk arg fmin fw Zdg subject to: Zk 2 L in a closed interval L fZ : Z 2 E 1 g.
Property 3. The interpolation map in restricted error
space is defined as: A : E m E m ! E m such that : Aw; d fu : u w Z djZg 2 L fu min fw Zd Subject to: Z Z 2 L: 5
of the error function, Hfw, is defined, then there is a directional vector d such that HfwT d < 0, and fw Zd < fw : fZ 2 0; d; d > 0g, then the vector d is a descent direction of fw, when d is an assumed arbitrary positive scalar. A vector defined as directional derivative conceptualizes the direction along which the error function converges to minimum. Also, assume that the error function is smooth and continuous. Subsequently, the directional derivative is defined using the following property. Let f: E m ! E 1 , w 2 E m and d is a non-zero vector satisfying w Zd 2 E m , Z > 0, and Z ! 0 . The directional derivative at w along the descent direction d is given by: fw Zd fw : 2 Hfw; d limZ!0 Z
Property 2. Let f: E m ! E 1 be a descent function.
Consider any training weight w 2 E m and d 2 E m : d 6 0. Then the directional derivative Hfw; d of the error function fw in direction d always exists. The central theme behind the self-adaptive BP training is the computation of the directional search vector, d, and the learning rate parameter, Z, in m weight space. Since the exact location of the minimum valley is not known from a starting point, there is an uncertainty in identifying the boundary or region in weight space over which the training may explore. The uncertainty can be reduced if it is possible to eliminate the sections of the error function (Kiefer 1953) which do not contain the minimum. This can be done by an interpolation search in constrained intervals. Therefore, what is needed is a search map that explores the constrained region, L, of the error surface. The search map samples the error surface with discrete intervals in a given direction. The following definitions are needed to describe the interpolation search map. For convenience, define the following expression to update the ANN current connection weight, uk , or the next weight, wk1 :
Suppose that the mapping algorithm, B, operating on the error function, fw, produces descent directions. The map is closed as set value mapping (Luenberger 1984) such that the correct learning rate is obtained during the training phase. It is, therefore, now possible by any standard computing technique to identify Zk for each epoch of training cycle, k, while descending along the minimum trajectory and exploring all the m dimensions in space. The error function, as a result, monotonically converges to a minimum value. Finally, the algorithm progressively converges to the acceptable minimum boundary in a few epochs. Table 2 shows the algorithm to compute the self-adaptive learning rates, while Table 3 shows the algorithm to compute the directional derivatives.
DYNAMIC TRAINING
The most difficult computational steps in identifying training weights in ANN are faced when the ANN error function contains several local minimums and the slope is flat within the reasonable range of network training weights wj . The reduced ANN error function in lower dimension (E1 ) can be considered a continuous function of m training weights wj ; j 1; 2; 3; . . . :; m, describing a hyper surface in m dimensional space (Moller 1997). By suitable transformation, an approximate error function in one dimension is constructed with a parabolic hyper surface so that the learning weights can be found from the new projected function with an initial estimate made at the beginning of training. The training is initiated with a random starting weight vector and a few experiments are carried out to calculate approximately the suitable learning rates for all the neurons in the neighborhood. The learning rate thus obtained improves convergence of the error function. Table 4 compares the standard BP training results with the proposed training method.
126
S. Ahmed
Learning Rate for the Standard Backpropagation Training Method

To choose a suitable learning rate for the standard BP training, several experiments with an XOR problem are carried out. The initial starting weight is the same, but the learning rate is changed from one experiment to another. The simulated results are analyzed with different learning rates. The effect of learning rates on function value, number of epochs, and number of function evaluations are observed. The simulated experiments that have function value of the order 104 are taken into consideration to identify the suitable learning rates for standard BP training. The experiment suggests a reasonable value of the learning rate is 0.1, which is a bargain between epoch number, function evaluations, and terminal function value.
standard backpropagation training are 463.6, 2787.6, and 465.6, respectively. The median performance of the proposed algorithm with epoch size, function evaluations, and gradient evaluations corresponds to the values 17, 108, and 206, while with the standard backpropagation training these counts are 25.5, 159, and 27.5, respectively. The related standard deviations are low with the proposed training method and suggest consistent performance. The standard backpropagation shows high magnitude in standard deviations and, therefore, inconsistent behavior is expected.
CONCLUSIONS
It is demonstrated in this study how an ANN error function could be trained dynamically with variable learning rates. A 2-2-1 XOR ANN training function is used to exhibit the training results. A higher-dimensional error function is transformed into a lower-dimensional error function in several phases. A closed algorithmic map defined over a descent error function estimates the variable learning rates. The training explores error space in all dimensions to identify the individual learning rates of all the neurons. These are important features of this training method. The results indicate that as the training progresses the descent direction dynamically changes. The learning rates do not have fixed values. The closed algorithmic map identifies the descent directions and the variable learning rates. During each epoch, the shape and properties of the error function are changing as the error function is gradually reduced. The proposed training algorithm takes into account the changes in the geometry of the error function. Sometimes the error function is flat and the reduction in error function is insignificant, but the proposed algorithm does not prematurely terminate the training. At some training epoch near a flat surface, the algorithm identifies a suitable learning rate so that the further reduction in error function is possible, and the training escapes the local minimum. The proposed self-adaptive training method trains the XOR problem within 20 epochs and produces a vector [0.00098, 1.00001, 0.99995, 0.0006] against the true value [0, 1, 10].
DISCUSSIONS ON THE TRAINING RESULTS

Figure 1 shows the self-adaptive parameter and the function value as the training progresses. Notice that within 30 epochs the XOR error function converged to a low magnitude. Figure 1 also indicates that the convergence is monotonic, signifying that the convergence sequence of the error function is generated. The self-adaptive dynamic learning rates are relatively larger at the early stage of training and gradually reduce to smaller magnitude as the training progress. This implies that large oscillations in training are prevented. The training is continued further to observe the convergence pattern. It is noted that as the training continues beyond 100 epochs, the decrease in error function is relatively insignificant. In general, the learning rate at this stage is very small in magnitude. Somewhere between epoch 181 and 191 there is a higher magnitude in learning rate compared to the previous one. It indicates that for some time there is no significant improvement in error function, but, at that particular point, further decrease in error function has been achieved. It indicates that training is able to escape the long flat valley. As the training continues, the improvement is small. During the training cycle between 281 and 291, again, a relatively large magnitude of learning rate is detected. At this training epoch, the training gains considerable reduction in error function. The training does not terminate, even when a flat valley is encountered during the training phase. It is one of the salient features of this training scheme. Ten simulation experiments were carried out to test the proposed training method against the standard BP algorithm (Rumelhart and McClelland 1989) with small random starting weights. The corresponding simulation results are shown in Table 4. To find the appropriate learning rates, all the directions are explored as demonstrated in Table 1 and Figure 2a, b, and c. The average epoch, function evaluations, and gradient evaluations are 151.8, 916.8, and 2601.1, respectively. The corresponding values with the
REFERENCES
Ahmed, S., and J. Cross. 2000. Convergence in Artificial Neural Network without learning parameter. In Proceeding of Second International Computer Science Conventions on Neural Computations, Berlin, Germany. Ahmed, S., J. Cross, and A. Bouzerdoum. 2000. Performance analysis of a new multi-directional training algorithm for feed-forward Neural Network. World Neural Network Journal Vol=No: p. # Bishop, C. M. 1995. Neural Networks for Pattern Recognition. Oxford, UK: Clarendon. Cheng, B., and D. M. T|tterington. 1994. Neural Networks: A review from a statistical perspective. Statistical Science 9(1): 254. Fahlman, S. E. 1988. Faster-learning variations on back-propagation: An empirical study. In Proceedings of the Connectionist Models Summer School, eds. D. Touretzky, j Hinton, and T. Sejnowski 3851. San Mateo: Morgan Kaufmann.
Training Algorithm
127
Moller, M. 1997. Efficient training of feed forward neural networks. In Neural Network: Analysis, Architectures and Applications, ed. A. Brown. Bristol and Philadelphia, PA: Institute of Physics Publishing. Muller, B., and J. Reinhardt. 1990. Neural Networks: An Introduction. Berlin: Springer. Partovi, F., and M. Anandrajan. 2002. Classifying inventory using an artificial neural network approach. Computers and Industrial Engineering, 41:389404. Rumelhart, D. E., and J. L. McClelland. 1989. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. I. Cambridge, MA: The MIT Press. Sakamoto, S., and Y. Kobuchi. 2000. Convergence property of topographic mapping formation from cell layer to cell layer through correlation learning rule. Neural Networks, 13:709718. Vogl, T. P., J. K. Mangis, A. K. Rigler, W. T. Zink, and D. L. Alkon. 1988. Accelerating the convergence of the back-propagation method. Biological Cybernetics, 59:257263. Weir, M. K. 1991. A method for self-determination of adaptive learning rates in back propagation. Neural Networks, 4:371379. Werbos, P. 1994. The Roots of Back Propagation: From Ordered Derivatives to Neural Net-works and Political Forecasting. New York, NY: John Wiley and Sons.
Haykin, S. 1999. Neural Networks: A Comprehensive Foundation. New Jersey: Prentice Hall. Hertz, J., A. Krogh, and R. G. Palmer. 1991. Introduction to the Theory of Neural Computation. Reading, MA: Addison-Welsey. Hinton, G E. 1987. Learning translation invariant recognition in massively parallel network. In Proceedings PARLE Conference on Parallel Architectures and Languages Europe, eds. J. W. De Bakker, A. J. Nijman, and j. Treleaven, 113. Berlin: Springer-Verlag. Jacobs, R. A. 1988. Increased rate of convergence through learning rate adaptation. Neural Networks 1:295307. Jang, J. -S. R., C. -T. Sun, and E. Mizutani. 1997. Neuro-Fuzzy and Soft Computing, New Jersey: Prentice Hall International. Kamarthi, S. V., and j. Pittner. 1999. Accelerating Neural Network training using weight extrapolation. Neural Networks 12: 12851299. Kiefer, J. 1953. Sequential minimax search for a maximum. Proceedings of the American mathematical Society, 4: 502506. Kuan, C. M., and H. White. 1994. Artificial Neural Network: An econometric perspective. Econometric Reviews 13(1):191. Luenberger, D. G. 1984. Introduction to Linear and Nonlinear Programming, 2nd edition. Reading, MA: Addison-Wesley.

A7) Development of Self-Adaptive Artificial Neural Networks Training Algorithm

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

A7) Development of Self-Adaptive Artificial Neural Networks Training Algorithm

Enviado por

Direitos autorais:

Formatos disponíveis

Smart Engineering System Design, 5:119127, 2003 Copyright # 2003 Taylor & Francis 1025-5818/03 $12.00 .

.00 DOI: 10.1080/10255810390210940

Development of Self-Adaptive Articial Neural Networks Training Algorithm

TABLE 2 The Closed Mapping Algorithm to Identify Dynamic Learning Rates

Step Step Step Step Step

TABLE 3 Algorithm to Identify Directional Derivatives

1, set limit to the iteration

Epoch 151.8 17 420.83 1343 6 1349

Convergence and self-adaptive variable magnitude training parameters.

FIG. 2a Automatic descent directions generated in XOR training.

FIG. 2b Automatic descent directions generated in XOR training.

IDENTIFYING TRAINING DIRECTIONS

A 2-2-1 XOR with log activation function.

125 uk wk1 wk Zk dk or u w Zd: 3

Learning Rate for the Standard Backpropagation Training Method

DISCUSSIONS ON THE TRAINING RESULTS

Você também pode gostar