Você está na página 1de 6

Supervised Learning

Csm10: BACKPROPAGATION: An Example of

References: Textbook article on Backpropagation. Byte Magazine Article on Backpropagation, October


1987,pages 155 - 162.
Aims: To introduce one of the most commonly used forms of supervised learning (backpropagation) and
show how it can be used in a feedforward network architecture.
Objectives: You should be able to:
Explain what supervised learning is.
Explain the basic ideas behind how feedforward networks can be trained using the backpropagation
algorithm.
There are many potential computer applications that are difficult to implement, for example applications
that must perform some complex data translation, yet have no pre-defined mapping function that
describes the mapping process, or those that must attempt to provide a 'best guess' as to the correct
output when presented with noisy input data. One neural network that has been shown to be useful in
addressing such problems is the feed-forward network (often trained using the backpropagation
algorithm), a generic version of which is illustrated in Figure 1.

Fig. 1.
The network consists of three distinct layers of units, where each of these units can take on a real
numbered value between 0 and 1. The INPUT LAYER, where sets of data are presented to the network,
is connected by bi-directional weighted connections to the HIDDEN LAYER which is itself connected by
bi-directional weighted connections to the OUTPUT LAYER. All the weights in the network are
modifiable, and the network learns to produce the correct input-output mapping by modifying these
weights. The backpropagation network is an example of supervised learning, the network is repeatedly
presented with sample inputs to the input layer, and the desired activation of the output layer for that
sample input is compared with the actual activation of the output layer, and the network learns by
adjusting its weights until it has found a set of weights that produce the correct output for every sample
input. For example, if we were trying to solve the XOR (exclusive or problem) our training inputs and
target outputs would be as in Figure 2.

Fig. 2.

Considering a feedforward network with N input(i) units, L hidden(j) units and W output(k) units, the
backpropagation training cycle consists of two distinct phases:
Forward pass (Figure 3):
(A) One of the set of p training input patterns is applied to the input layer.
xp = (xp1, xp2,... xpn) which may be a binary or real-numbered vector.
(B) The activations of units in the hidden layer are calculated by taking their net input (the sum of the
activations of the input layer units they are connected to multiplied by their respective connection
weights) and passing it through a transfer function (discussed later).
i)

net input to hidden layer unit j

net j =

x (1)

ji i

i=1

i.e. take the value of each of the n input units connected to it and multiply it by the respective connection
weight between them.
ii)

output (activation) of hidden layer unit j

oh j = ( net j )
i.e. take net input of unit j and pass it through a transfer function
(C) The activations of the hidden layer units calculated in (B) are then used in updating the activation of
the output units (or unit in the case of XOR), the activation of the output units is calculated by taking their
net input (the sum of the activations of the hidden layer units they are connected to multiplied by their
respective connection weights) and passing it through the same transfer function.

Fig. 3.
i)

net input to output unit k

net k =

kj

oh j (2)

j=1

ii)

output of output unit k

oo j = ( net k )
Backward pass (Figures 4 and 5):
(A) The difference between the actual activation of each output unit and the desired target activation (d k)
for that unit is found, and this difference is used to generate an error signal for each output unit. A
quantity called delta is then calculated for all of the output units.

i)
dk

error signal for each output unit is difference between its actual output ook and its desired output

( d k - ook )
ii)
delta term for each output unit is equal to its error signal multiplied by the output of that unit
multiplied by (1 - its output).

ok = ( d k - ook ) ook (1 - ook ) (3)

(B) The error signals for the hidden layer units are then calculated by taking the sum of the deltas of the
output units a particular hidden unit connects to multiplied by the weight that connects the hidden and
output unit. The deltas for each of the hidden layer units are then calculated.
i)

error signal for each hidden unit j

o w
k

kj

k =1

ii)
delta term for each hidden unit j is equal to its error signal multiplied by its output, multiplied by
(1 - its output).

h j = ( oh j )(1 - oh j ) ok wkj (4)


k=1

Fig. 4.
(C) The weight error derivatives for each weight between the hidden and output units are calculated by
taking the delta of each output unit and multiplying it by the activation of the hidden unit it connects to.
These weight error derivatives are then used to change the weights between the hidden and output
layers.

wed jk = ok ( oh j ) (5)
i.e. to calculate the weight error derivative between hidden unit j and output unit k take the delta term of
output unit k and multiply it by the output (activation) of hidden unit j
(D) The weight error derivatives for each weight between the input unit i and hidden unit j are calculated
by taking the delta of each hidden unit and multiplying it by the activation of the input unit it connects to
(i.e. that input pattern x i). These weight error derivatives are then used to change the weights between
the input and hidden layers.

wed ij = h j ( xi )
To change the actual weights themselves, a learning rate parameter n is used, which controls the
amount the weights are updated during each backpropagation cycle. The weights at a time (t + 1)
between the hidden and output layers are set using the weights at a time and the weight error derivatives
between the hidden and output layers using the following equation.

w jk (t + 1) = w jk (t) + ( wed jk )

(6)

In a similar way the weights are changed between the input and hidden units

wij (t + 1) = wij (t) + ( wed ij )

(6)

Fig. 5.
Using this method, each unit in the network receives an error signal that describes its relative

contribution to the total error between the actual output and the target output. Based on the error signal
received, the weights connecting the units in different layers are updated.
These two passes are repeated many times for different input patterns and their targets, until the error
between the actual output of the network and its target output is acceptably small for all of the members
of the set of training inputs (what constitutes an acceptably small error will be discussed later).
This form of training can be applied to much larger networks than the XOR network to solve much more
complex problems, but the basic two-pass cycle remains the same. As the network trains, units in the
hidden layer organise themselves such that different units learn to recognise different features of the
total input space. For example, if a network were trained to respond to a pixel image of the letter 'T', one
unit might develop as a feature detector for the vertical bar on the top of the 'T'. After training, when
presented with an arbitrary new input pattern which is noisy or incomplete, the units in the hidden layer
will respond with an active output if the new input contains a pattern that resembles the feature the
individual units learned to recognise during training (for example our unit may still respond to the vertical
bar on the top of the 'T' if a pixel were missing). Conversely, hidden-layer units have a tendency to inhibit
their outputs if the input pattern does not contain a feature they were trained to recognise (if for example
a 'C' were input). These networks tend to develop internal relationships between units so as to organise
the training data into classes of patterns (these classes may not be evident to a human observer but can
sometimes be detected by applying clustering algorithms to the weights and activations of the units in
the network), in
this way they develop an internal representation that enables them to generate the desired outputs when
given the training inputs, this same internal representation can be applied to inputs that were not used
during training, the BPN will classify these new inputs according to the features they share with the
training inputs, i.e. these networks have the ability to generalise.
An Example Application: A Decision System for Processing Consumer Credit Applications
Partitioning of available data: The relevant details for several thousand previous applications for
credit, including information as to whether the loan was repaid in full or whether the applicant defaulted
on the agreement. If for example we have the details for 5000 previous credit agreements, these could
be split into a training set of 4000 (used to train the network) and a test set of 1000 (randomly selected
from the original 5000, and held in reserve to test the predictive accuracy of the network once it has
been trained).
Training inputs (Figure 6): Details from the applications such as age of applicant, salary level and size
of other financial commitments, could be input to the network either as analog values (for example
salary could be input as a real number) or as binary inputs (for example salary could be divided into 4
bands, and the unit associated with the relevant band switched on in each training cycle). Many other
relevant inputs could be used in the training of the network (such as occupation), but they will not be
discussed here.
Target outputs: These could simply be two units signifying whether the applicant repaid the loan or not,
or one output to indicate whether the applicant repaid the loan and another to indicate the time taken to
repay the loan.

Fig. 6.
The network would be trained by repeated presentation of training inputs and loan outcome, until the
error between the output and target units became acceptably small. The data for the 1000 examples in

the test set would then be presented to measure the predictive accuracy of the system on examples it
had never before encountered. If the system performed well enough, then it could be used to process
new loan applications and decide whether to provide credit to a person depending on their application
details.

To summarise
Backpropagation is an example of supervised learning
Training inputs and their corresponding outputs are supplied to the
network
The network calculates error signals, and uses these to adjust the
weights
After many passes, the network settles to a low error on the training
data
It is then tested on test data that it has not seen before, to measure its
generalisation ability

Você também pode gostar