Você está na página 1de 23

Chapter-2: Fundamental Concepts of ANN

2.1 ANN Fundamentals


An Artificial Neural Network (ANN) is an information processing paradigm that is inspired by
the way biological nervous systems, such as the brain, process information. The key element of
this paradigm is the novel structure of the information processing system. It is composed of a
large number of highly interconnected simple processing elements (neurons) working in unison
to solve specific problems. ANNs, like people, learn by example. An ANN is configured for a
specific application, such as pattern recognition or data classification, through a learning process.

Terminal Branches
Dendrites
of Axon
x1
w1
x2
w2
x3 w3


Axon

wn
xn

Learning in biological systems involves adjustments to the synaptic connections that exist
between the neurons. This is true of ANNs as well i.e learning occurs via changes in value of the
connection weights. ANNs can process information at a great speed owing to their highly
massive parallelism.

The building blocks of artificial neural networks are artificial neurons. Artificial neurons have
the same basic components as biological neurons. Three basic elements of the neural model are
A. A set of synapses, or connecting links, each of which is characterized by a weight or
strength of its own.
B. An adder for summing the input signals,weighted by the respective synaptic strengths of
the neuron; the operations described here constitute a linear combiner.
C. An activation function for limiting the amplitude of the output of a neuron.
The simplest ANNs consist of a set of McCulloch-Pitts neuron was originally proposed by
McCulloch and Pitts (McCulloch & Pitts 1943) which is the attempt to model the computing
process of the biological neuron. The McCulloch-Pitts model is composed of one neuron only,
limited computing capability and no learning capability.

Introduction to Neural Network Compiled By Gidena M Page 1


Figure 2.1 Nonlinear model of a neuron, labeled k.
The neural model of also includes an externally applied bias, denoted by θ(theta) has the effect
of increasing or lowering the net input of the activation function, depending on whether it is
positive or negative, respectively.

Figure 2.2: Artificial neuron model


The neuron model shown in figure 2.2 is the one that widely used in artificial neural networks
with some minor modifications on it.
 It has N input, denoted as u1, u2 , ...uN
 Each line connecting these inputs to the neuron is assigned a weight, which are denoted
as w1 , w2 , .., wN respectively. Weights in the artificial model correspond to the
synaptic connections in biological neurons.
 Θ or θ represents threshold or bias in artificial neuron.
 The inputs and the weights are real values
 The activation is given by the formula

Equation-1
 A negative value for a weight indicates an inhibitory connection while a positive value
indicates an excitatory one. Although in biological neurons, θ has a negative value, it
may be assigned a positive value in artificial neuron models. If θ is positive, it is usually

Introduction to Neural Network Compiled By Gidena M Page 2


referred as bias. For its mathematical convenience we will use (+) sign in the activation
formula.
 Sometimes, the threshold is combined for simplicity into the summation part by assuming
an imaginary input u0 =+1 and a connection weight w0 =θ. Hence the activation
formula becomes:

considering bias as external input


 The output value of the neuron is a function of its activation in an analogy to the firing
frequency of the biological neurons: x= f(a)
 Originally the neuron output function f(a) in McCulloch Pitts model proposed as
threshold function, however linear, ramp and sigmoid functions are also widely used
output functions.

Artificial neural networks are among the most powerful learning models. They have the
versatility to approximate a wide range of complex functions representing multidimensional
input-output maps. Neural networks also have inherent adaptability, and can perform robustly
even in noisy environments.
Artificial neural network is a computational model inspired by the structure, processing method
and learning ability of a biological brain. It is composed of a number of interconnected simple
processing elements called neurons or nodes. The neurons interconnected to form a group and
process information using a connectionist.

Neural networks, with their remarkable ability to derive meaning from complicated or imprecise
data, can be used to extract patterns and detect trends that are too complex to be noticed by either
humans or other computer techniques. A trained neural network can be thought of as an "expert"
in the category of information it has been given to analyse. This expert can then be used to
provide projections given new situations of interest and answer "what if" questions.
Other advantages include:
A. Adaptive learning: An ability to learn how to do tasks based on the data given for training
or initial experience.
B. Self-Organisation: An ANN can create its own organization or representation of the
information it receives during learning time.
C. Real Time Operation: ANN computations may be carried out in parallel, and special
hardware devices are being designed and manufactured which take advantage of this
capability.
D. Fault Tolerance via Redundant Information Coding: Partial destruction of a network
leads to the corresponding degradation of performance. However, some network
capabilities may be retained even with major network damage.

Introduction to Neural Network Compiled By Gidena M Page 3


Artificial neural networks are represented by a set of nodes, often arranged in layers, and a set of
weighted directed links connecting them. The nodes are equivalent to neurons, while the links
denote synapses. The nodes are the information processing units and the links acts as
communicating media.

Figure 2.3: Artificial neural network


A neuron or processing element (PE) has primarily two things to do.
 One is that it computes output, which is sent to the other PE’s or outside the network. The
neuron or PE determines its output value by applying a transfer function.
 Secondly, it updates a local memory, i.e. Weights and other types of data called data
variables
Specifically a neural network model is composed of a number of neurons that are organized
in several layers: an input layer, a hidden layer(s), and an output . The input layer of neurons
feeds the input variables into the network. The hidden layer is a bridge between the input
layer and the output layer. The neurons in this layer are fundamentally hidden from view,
and their number and arrangement can typically be treated as a black box to those who are
carrying out the system.
The function of the hidden layer is to process the input variables. This is achieved by
summing up all weighted inputs, checking whether the sum meets the threshold value and
applying the transformation function. The weights between the input neuron and hidden
neurons determine when each unit in the hidden layer may fire or not and by modifying these
weights, the hidden layer may fire or not. In other words, the hidden layers learn the
relationship between inputs and outputs in a way similar to that of the human brain by
adjusting the weights during the training process.
The function of the output layer is similar to that of the hidden layer. Each input for
this layer is possessed as in the hidden layer. The processing ability of network is stored in
the inter-unit connection strengths or weights obtained by a process of learning from a set of
training patterns.

Introduction to Neural Network Compiled By Gidena M Page 4


2.2 ANN Architectures/Structures/Topologies

There are a wide variety of networks depending on the nature of information processing carried
out at individual nodes, the topology of the links, and the algorithm for adaptation of link
weights. In other words, a specific neural network model is determined by its topology, learning
paradigm and learning algorithm.
The manner in which the neurons of a neural network are structured is intimately linked with the
learning algorithm used to train the network. We may therefore speak of learning algorithms
(rules) used in the design of neural networks as being structured.
In general, according to the structure/topology of the connections, fundamentally ANN can be
categorized in to three different classes of network architectures:
A. Feed forward Neural networks
B. Recurrent Neural Networks
1. Feed forward Neural networks:
In feed forward neural networks, the neurons are organized in the form of layers. The neurons in
a layer get input from the previous layer and feed their output to the next layer. In this kind of
networks connections to the neurons in the same or previous layers are not permitted.
The last layer of neurons is called the output layerand the layers between the input and output
layers are called the hidden layers. The input layer is made up of special input neurons,
transmitting only the applied external input to their outputs.
 Links in feed forward can only go in one direction and each unit is linked only in the unit
in next layer. i.e no units are linked between the same layer, back to the previous layer or
skipping a layer.
 Computations can proceed uniformly from input to output units.
 No feed-back connections

Introduction to Neural Network Compiled By Gidena M Page 5


Figure 2.4: Feed forward Neural Network
Based on the number of layers, feed forward neural network also categorized in to two:
A. Single Layer feed forward : In a network, if there is only the layer of input nodes and a
single layer of neurons constituting the output layer and this called single layer network.
This also known as perseptron without hidden layer. of neurons constituting the output
layer.
B. Multilayer Feed forward Networks: in this network there are one or more hidden
layers, whose computation nodes are correspondingly called hidden neurons or hidden
units. the term “hidden” refers to the fact that this part of the neural network is not seen
directly from either the input or output of the network.

Figure 2.4: Single layer network Figure 2.5: Multilayer neural network

Introduction to Neural Network Compiled By Gidena M Page 6


The function of hidden neurons is to intervene between the external input and the network output
in some useful manner. By adding one or more hidden layers, the network is enabled to extract
higher-order statistics from its input.
2. Recurrent Neural Networks:
A recurrent neural network distinguishes itself from a feed forward neural network in that it has
at least one feedback loop. In this structure connections to the neurons of the same layer or to the
previous layers are allowed. In recurrent networks, due to feedback, it is not possible to obtain a
triangular weight matrix with any assignment of the indices.
 This topology involves backward links from output to the input and hidden layers. i.e at
least one feed-back connection. But, it may or may not have hidden units.
 May take long time to compute a stable output
 Learning process is much more difficult
 Can implement more complex designs

Figure 2.6: Recurrent Neural Network

2.3 Learning Process in ANN


The weights in an ANN are adjusted to solve the problem presented to ANN. Learning or
training is used to describe the process of finding values of these weights. Just as there are
different ways in which we ourselves learn from our own surrounding environments, so it is with
neural networks. In a broad sense, we may categorize the learning processes through which
neural networks function as follows: learning with a teacher and learning without a teacher.

Learning/ Training in ANN is a process of teaching the neural net to learn hidden patterns from
training dataset. During the training process, NN parameters are adjusted in order to make
the actual outputs (predicated) close to the target (measured) outputs of the network.
There are three learning approaches in learning of ANN. These are:

Introduction to Neural Network Compiled By Gidena M Page 7


A. Supervised learning,
B. Unsupervised learning
C. Reinforcement Learning.
A. Supervised learning: this learning/training by example; i.e., priori known desired output
for each input pattern.
 The network is trained by providing it with input and matching output patterns.
 This learning occurs when there is known target value associated with each input
in the training set.
 Output of ANN is compared with a target value, and this difference is used to
train ANN which alters the weights. Example: Classification Problem.
Supervised learning is also referred to as learning with a teacher as having knowledge of the
environment, with that knowledge being represented by a set of input–output examples. The
environment is, however, unknown to the neural network. Suppose now that the teacher and the
neural network are both exposed to a training vector (i.e., example) drawn from the same
environment.
By virtue of built-in knowledge, the teacher is able to provide the neural network with a desired
response for that training vector. Indeed, the desired response represents the “optimum” action to
be performed by the neural network. The network parameters are adjusted under the combined
influence of the training vector and the error signal. The error signal is defined as the difference
between the desired response and the actual response of the network. This adjustment is carried
out iteratively in a step-by-step fashion with the aim of eventually making the neural network
emulate the teacher; the emulation is presumed to be optimum in some statistical sense. In this
way, knowledge of the environment available to the teacher is transferred to the neural network
through training and stored in the form of “fixed” synaptic weights, representing long-term
memory. When this condition is reached, we may then dispense with the teacher and let the
neural network deal with the environment completely by itself.
The form of supervised learning we have just described is the basis of error correction learning.
The supervised-learning process constitutes a closed-loop feedback system, but the unknown
environment is outside the loop. As a performance measure for the system, we may think in
terms of the mean square error, or the sum of squared errors over the training sample, defined as
a function of the free parameters (i.e., synaptic weights) of the system. This function may be
visualized as a multidimensional error-performance surface, or simply error surface, with the free
parameters as coordinates.

Learning/ Training algorithm in supervised learning

1. Compute error between desired and actual outputs


2. Use the error through a learning rule (e.g., gradient descent) to adjust the network’s
connection weights
Two types of supervised
3. Repeat stepslearning
1 and 2 algorithms exist,patterns
for input/output based ontowhen/how weights
complete one epochare updated
4. Repeat
Introduction steps
to Neural 1 to 3 until
Network maximum
Compiled numberMof epochs is reached or an acceptable
By Gidena Page 8
training error is reached
1. Stochastic/Delta/(online) learning, where the NN weights are adjusted after each
pattern presentation. In this case the next input pattern is selected randomly from the
training set, to prevent any bias that may occur due to the sequences in which patterns
occur in the training set.
2. Batch/(offline) learning, where the NN weight changes are accumulated and used to
adjust weights only after all training patterns have been presented.
B. Unsupervised learning: there is no a priori set of categories into which the patterns are
to be classified. In other words, training data composed of input patterns only. Network
uses training patterns to discover emerging collective properties and organizes the data
into clusters. Example: Clustering
 The objective of unsupervised learning is to discover patterns or features in the input
data with no help from a teacher, basically performing a clustering of input space.
 There is no teacher to oversee the learning process.
 An (output) unit is trained to respond to clusters of pattern within the input.
 The unsupervised learning is needed when training data lack target output values
corresponding to input patterns.
Learning/ Training algorithm in unsupervised learning

1. Training data set is presented at the input layer


2. Output nodes are evaluated through a specific criterion
3. Only weights connected to the winner node are adjusted
4. Repeat steps 1 to 3 until maximum number of epochs is reached or the connection
weights reach steady state.

C. Reinforcement Learning: the learning of an input output mapping is performed through


continued interaction with the environment in order to minimize a scalar index of
performance. The system is designed to learn under delayed reinforcement, which means
that the system observes a temporal sequence of stimuli also received from the
environment, which eventually result in the generation of the heuristic reinforcement
signal.
The goal of reinforcement learning is to minimize a cost-to-go function,defined as the
expectation of the cumulative cost of actions taken over a sequence of steps instead of simply the
immediate cost. It may turn out that certain actions taken earlier in that sequence of time steps
are in fact the best determinants of overall system behavior. The function of the learning system
is to discover these actions and feed them back to the environment.
 the learning machine does some action on the environment and gets a feedback
response from the environment.
 The learning system grades its action good (rewarding) or bad (punishable) based on
the environmental response and accordingly adjusts its parameters.

Introduction to Neural Network Compiled By Gidena M Page 9


 Reinforcement learning mimics the way humans adjust their behavior when interacting
with physical systems (e.g., learning to ride a bike).

2.4 Activation Function


The activation function is also called the transfer function. Activation function is mathematical
formulae that determine the output of processing nodes. It transforms inputs in to outputs. They
introduce a degree of nonlinearity that is valuable for most ANN applications, which
governs the relationship between inputs and outputs of a node and a network. Example: a
single neuron with two inputs. Each input is weighted with an appropriate w. The sum of the
weighted inputs and the bias forms the input to the transfer function f. Neurons can use any
differentiable transfer function f to generate their output.

Each activation function is characterized by its shape, output range, and derivative function. In
general, activation functions can be categorized as binary or continuous (differentiable)
activation functions. There are different types of activation functions, the two types are threshold
function and sigmoid function. Other commonly used activation function are: Tan-Sigmoid, Log-
Sigmoid and linear (purelin) activation function, ramp and Gaussian function. See below figure.

Figure 2.7: Different Activation Functions.


A. Threshold Function:
In engineering, this form of a threshold
function is commonly referred to as a
Heaviside function

Introduction to Neural Network Compiled By Gidena M Page 10


Correspondingly, the output of neuron k
employing such a threshold function is
expressed as:
B. Sigmoid Function
The sigmoid function, whose graph is “S”-shaped, is by far the most common form of activation
function used in the construction of neural networks. It is defined as a strictly increasing function
that exhibits a graceful balance between linear and nonlinear behavior. An example of the
sigmoid function is the logistic function defined by

Example: see below figure that represent different equations of activation

Figure 2.8: Equations of some activation functions.


2.5 Perceptrons
The perceptron is the simplest form of a neural network used for the classification of patterns
said to be linearly separable (i.e., patterns that lie on opposite sides of a hyperplane). Basically, it
consists of a single neuron with adjustable synaptic weights and bias. The perceptron built
around a single neuron is limited to performing pattern classification with only two classes
(hypotheses). Beside, Rosenblatt’s perceptron is built around a nonlinear neuron, namely, the
McCulloch–Pitts model of a neuron.
Perceptron uses supervised learning to adjust its weights in response to a comparative signal
between the network’s actual output and the target output. It is mainly designed to classify
linearly separable patterns. Patterns are linearly separable means that there exists a hyper planar
multidimensional decision boundary that classifies the patterns into two classes.

Perceptron consists of a single neuron with multiple inputs and a single output. It has restricted
information processing capability. The information processing is done through a transfer
function which is either linear or non-linear.

Introduction to Neural Network Compiled By Gidena M Page 11


Perceptron: remarks
 One neuron (one output)
 input signals (patterns) x1 , x2, . . ., xn
 Adjustable synaptic weights w1 , w2, . . ., wn, and bias θ(w0)
 Binary activation function; i.e., step or hard limiter function
Definition: It’s a step function based on a linear combination of real-valued inputs. If the
combination is above a threshold it outputs a 1, otherwise it outputs a –1.
A perceptron takes a vector of real-valued inputs x1 through xn, the output o (x1, …, xn)
computed by the perceptron is:
O (x1, …, xn) = 1 if w0 + w1x1 + … + wnxn > 0
-1 otherwise

Where wi is a real-valued constant, or weight. Learning a perceptron involves choosing values


for the weights w0, w1,…, wn.

 To simplify notation, we imagine an additional constant input x0 = 1, allowing us to write


the above inequality as n
i=0 wixi >0

Figure 2.9: A Perceptron

Introduction to Neural Network Compiled By Gidena M Page 12


A perceptron draws a hyperplane as the decision boundary over the (n-dimensional) input space.
The perceptron outputs a 1 for instances lying on one side of the hyperplane and outputs a –1 for
instances lying on the other side.
The equation for decision hyperplane or decision boundary is:

Figure 2.10: hyperplane (in this example, a straight line) as decision boundary for a two-
dimensional, two-class pattern-classification problem.
A perceptron can learn only examples that are called “linearly separable”. These are examples
that can be perfectly separated by a hyperplane. Some sets of positive and negative examples
cannot be separated by any hyperplane and those examples are called non linearly separable
examples.

Figure 2.11: Linearly separable Non-linearly separable


Perceptrons can learn many boolean functions: AND, OR, NAND, NOR, but cannot learn not
XOR. However, every boolean function can be represented with a perceptron network that has
two levels of depth (layer) or more.

Introduction to Neural Network Compiled By Gidena M Page 13


Example: lets prove a Boolean function AND can be learn by single perceptron that use step
function given that of truth table(input and output values).
Training Examples
X1 X2 Output(desired) ∑WiXi
0 0 -1 -0.8

0 1 -1 -0.3

1 0 -1 -0.3

1 1 1 0.2

Output using decision boundary (hyberplan) 0r using this equation


WX=0, w0+w1x1+w2x2=0, given w0=-0.8 and w1=w2=0.5
Therefore, the decision boundary will be
-0.8+0.5x1+0.5x2=0 now substitute the value of the two variables x1 and x2 of AND
truth table in this equation if the result is greater than zero(0) then the output will be 1, otherwise
it will be -1.
Hence, for both values of variables in the above truth table is x1=0 and x2=0 the output WiXi=0
w0+w1x1+w2x2=0
-0.8+0.5x1+0.5x2=0
-0.8+0.5(0) +0.5(0)=0
-0.8=0 since result (-0.8) < 0, output will be -1 the same for other input variables
nd
2 -row -0.8+0.5x1+0.5x2-> -0.8+0.5(0)+0.5(1)-> -0.3<0, output is -1
3rd row -0.8+0.5x1+0.5x2-> -0.8+0.5(1)+0.5(0)-> -0.3<0, output is -1
4th row -0.8+0.5x1+0.5x2-> -0.8+0.5(1)+0.5(1)-> 0.2 >0, output is 1

x2

- +

- -
x1
-0.8 + 0.5 x1 + 0.5 x2 = 0
Figure 2.12: Two-Dimensional decision boundary for AND function
Even we can prove using the graph or decision boundary on the above figure, the examples can
be perfectly separated by a hyper plane (the positive one is on one side and negative ones is on

Introduction to Neural Network Compiled By Gidena M Page 14


the opposite side). Therefore, this function (i.e AND) can be represent or implemented using
perceptron.
Example 2: check whether OR function is represented by the perseptron.
WX=0, w0+w1x1+w2x2=0, given w0=-0.3 and w1=w2=0.5
w0+w1x1+w2x2=0-> -0.3 + 0.5 x1 + 0.5 x2 = 0

Test result x2
X1 X2 Output(desired) ∑WiXi
0 0 -1 -0.3
+ +
0 1 1 0.2

1 0 1 0.2

1 1 1 0.7 - +
x1
-0.3 + 0.5 x1 + 0.5 x2 = 0

Therefore, OR function also can be represented by perceptron using the graph and equation.

Exercise: check or proof whether XOR function can be implemented?

<Training examples> x2
x1 x2 output
0 0 -1 + -

0 1 1
- +
1 0 1 x1
1 1 -1
2.5.1 Perceptron Learning
Learning a perceptron means finding the right values for W (weight). The hypothesis space of a
perceptron is the space of all weight vectors. The perceptron learning algorithm can be stated as
below.
1. Assign random values to the weight vector
2. Apply the weight update rule to every training example
3. Are all training examples correctly classified?
A. Yes. Quit
B. No. Go back to Step 2.
Learning is to determine a weight vector that causes the perceptron to produce the correct +1 or –
1 for each of the given training examples. There are two popular weight update rules(algorithms)
A. The perceptron rule

Introduction to Neural Network Compiled By Gidena M Page 15


B. Delta rule
The Perceptron Rule
For a new training example X = (x1, x2, …, xn), update each weight according to this rule:
wi = wi + Δwi
Where Δwi = η (t-o) xi
t: target output value for the current training example
o: output generated by the perceptron
η: constant called the learning rate (e.g., 0.1) The role of the learning rate is to moderate the
degree to which weights are changed at each step.
Comments about the perceptron training rule:
 If the example is correctly classified the term (t-o) equals zero, and no update on the
weight is necessary.
 If the perceptron outputs –1 and the real answer is 1, the weight is increased.
 If the perceptron outputs a 1 and the real answer is -1, the weight is decreased.
 Provided the examples are linearly separable and a small value for η is used, the rule is
proved to classify all training examples correctly (i.e, is consistent with the training data).
 We can prove that the algorithm will converge
 If training data is linearly separable and  sufficiently small.
If the data is not linearly separable, convergence is not assured.
Perceptron Convergence Theorem: If the training set is linearly separable, there exists a set of
weights for which the training of the Perceptron will converge in a finite time and the training
patterns are correctly classified.
The Delta Rule
Although the perceptron rule finds a successful weight vector when the training examples are
linearly separable, it can fail to converge if the examples are not linearly separatable.
What happens if the examples are not linearly separable? To address this situation we try to
approximate the real concept using the delta rule.
The key idea of delta rule is to use gradient descent to search the space of possible weight
vector to find the weights that best fit the training examples. This rule is important because it
provides the basis for the backpropagration algorithm, which can learn networks with many
interconnected units.
Gradient descent will minimize the training error. In order to derive a weight learning rule for
linear units, let specify a measure for the training error of a weight vector, relative to the training
examples. The Training Error can be computed as the following squared error

Equ-1

Introduction to Neural Network Compiled By Gidena M Page 16


Where the sum goes over all training examples, i.e D is set of training examples, td is the target
output for the training example d and od is the output of the linear unit for the training example
d. Here we characterize E as a function of weight vector because the linear unit output O depends
on this weight vector. The idea is to find a minimum in the space of weights and the error
function E.
To understand the gradient descent algorithm, it is helpful to visualize the entire space of
possible weight vectors and their associated E values, as illustrated in Figure 2.13.
 Here the axes wo, w1 represents possible values for the two weights of a simple linear
unit. The wo, w1 plane represents the entire hypothesis space.
 The vertical axis indicates the error E relative to some fixed set of training examples. The
error surface shown in the figure summarizes the desirability of every weight vector in
the hypothesis space.
For linear units, this error surface must be parabolic with a single global minimum. And we
desire a weight vector with this minimum.

Figure 2.13: The Error Surface


How can we calculate the direction of steepest descent along the error surface? This direction
can be found by computing the derivative of E with respect to. each component of the vector w.
This vector derivative is called the gradient of E with respect to the vector <w0,…,wn>, written

E . Equ-2.
Notice E is itself a vector, whose components are the partial derivatives of E with respect to
each of the wi. When interpreted as a vector in weight space, the gradient specifies the direction
that produces the steepest increase in E. The negative of this vector therefore gives the direction
of steepest decrease.
The delta rule is as follows:
For a new training example X = (x1, x2, …, xn), update each weight according to this rule:
wi = wi + Δwi Equ-3

Introduction to Neural Network Compiled By Gidena M Page 17


Where , and Equ-4
Here  is a positive constant called the learning rate, which determines the step size in the
gradient descent search. The negative sign is present because we want to move the weight vector
in the direction that decreases E.

Therefore, finally it gives us


This Equ-6, where xid denotes the single input component xi for the training example d and
output od and the target value td associated with the training example.
The gradient descent algorithm for training linear units is as follows:
 Pick an initial random weight vector.
 Apply the linear unit to all training examples, them compute wi for each weight
according to Equation (6).
 Update each weight wi by adding wi , them repeat the process.
There are two differences between the perceptron and the delta rule. The perceptron is based on
an output from a step function, whereas the delta rule uses the linear combination of inputs
directly. The perceptron is guaranteed to converge to a consistent hypothesis assuming the data is
linearly separable. The delta rules converge in the limit but it does not need the condition of
linearly separable data.
There are two main difficulties with the gradient descent method:
1. Convergence to a minimum may take a long time.
2. There is no guarantee we will find the global minimum.
These are handled by using momentum terms and random perturbations to the weight vectors.

Introduction to Neural Network Compiled By Gidena M Page 18


2.6 Multilayer Networks and Backpropogation Algorithm

Multi-layered Perceptron (MLP): It has a layered architecture consisting of input, hidden and
output layers. Each layer consists of a number of perceptrons(neurons). A neuron in any layer of
the network is connected to all the neurons (nodes) in the previous layer. Signal flow through the
network progresses in a forward direction, from left to right and on a layer-by-layer basis. The
output of each layer is transmitted to the input of nodes in other layers through weighted links.
Usually, this transmission is done only to nodes of the next layer, leading to what are known as
feed forward networks.
Three points highlight the basic features of multilayer perceptrons are:
A. The model of each neuron in the network includes a nonlinear activation function that is
differentiable.
B. The network contains one or more layers that are hidden from both the input and output
nodes.
C. The network exhibits a high degree of connectivity, the extent of which is determined by
synaptic weights of the network.
These same characteristics, however, are also responsible for the deficiencies in our knowledge
on the behavior of the network. First, the presence of a distributed form of nonlinearity and the
high connectivity of the network make the theoretical analysis of a multilayer perceptron
difficult to undertake. Second, the use of hidden neurons makes the learning process harder to
visualize. In an implicit sense, the learning process must decide which features of the input
pattern should be represented by the hidden neurons. The learning process is therefore made
more difficult because the search has to be conducted in a much larger space of possible
functions, and a choice has to be made between alternative representations of the input pattern.
The typical architecture of a multi-layer perceptron (MLP) is shown below.

Figure 2.14: Architectural graph of a multilayer perceptron with two hidden layers.
Two kinds of signals are identified in this network:
1. Function Signals. A function signal is an input signal (stimulus) that comes in at the
input end of the network, propagates forward (neuron by neuron) through the network,

Introduction to Neural Network Compiled By Gidena M Page 19


and emerges at the output end of the network as an output signal. We refer to such a
signal as a “function signal” for two reasons. First, it is presumed to perform a useful
function at the output of the network. Second, at each neuron of the network through
which a function signal passes, the signal is calculated as a function of the inputs and
associated weights applied to that neuron.The function signal is also referred to as the
input signal.
2. Error Signals. An error signal originates at an output neuron of the network and
propagates backward (layer by layer) through the network. We refer to it as an “error
signal” because its computation by every neuron of the network involves an error-
dependent function in one form or another.
The output neurons constitute the output layer of the network. The remaining neurons constitute
hidden layers of the network. Each hidden or output neuron of a multilayer perceptron is
designed to perform two computations:
1. the computation of the function signal appearing at the output of each neuron, which is
expressed as a continuous nonlinear function of the input signal and synaptic weights
associated with that neuron
2. the computation of an estimate of the gradient vector (i.e., the gradients of the error
surface with respect to the weights connected to the inputs of a neuron), which is needed
for the backward pass through the network.
The hidden neurons act as feature detectors; as such, they play a critical role in the operation of a
multilayer perceptron. As the learning process progresses across the multilayer perceptron, the
hidden neurons begin to gradually “discover” the salient features that characterize the training
data. They do so by performing a nonlinear transformation on the input data into a new space
called the feature space. In this new space, the classes of interest in a pattern-classification task,
for example, may be more easily separated from each other than could be the case in the original
input data space. Indeed, it is the formation of this feature space through supervised learning that
distinguishes the multilayer perceptron from Rosenblatt’s perceptron.
More conveniently, we can instead extend the simple Perceptron to a Multi-Layer Perceptron,
which includes a least one hidden layer of neurons with non-linear activations functions f(x)
(such as sigmoids):
In contrast to perceptrons, multilayer networks can learn not only multiple decision boundaries,
but the boundaries may be nonlinear. To make nonlinear partitions on the space we need to
define each unit as a nonlinear function (unlike the perceptron). One solution is to use the
sigmoid unit. Another reason for using sigmoids are that they are continuous unlike linear
thresholds and are thus differentiable at all points.

Introduction to Neural Network Compiled By Gidena M Page 20


O(x1,x2,…,xn)= σ(WX) where: σ ( WX ) = 1 / 1 + e -WX
Function σ is called the sigmoid or logistic function. It has the following property:
d σ(y) / dy = σ(y) (1 – σ(y))
2.6.1 Back-Propagation (BP) Algorithm
MLPs were proposed to extend the limited information processing capabilities of simple
percptrons, and are highly versatile in terms of their approximation ability. Training or weight
adaptation is done in MLPs using supervised backpropagation learning.
A popular and most commonly used method for the training of multilayer perceptrons is the
back-propagation algorithm, which includes the LMS algorithm as a special case. The training
proceeds in two phases:
1. In the forward phase, the synaptic weights of the network are fixed and the input signal
is propagated through the network, layer by layer, until it reaches the output. Thus, in this
phase, changes are confined to the activation potentials and outputs of the neurons in the
network.
2. In the backward phase, an error signal is produced by comparing the output of the
network with a desired response. The resulting error signal is propagated through the
network, again layer by layer, but this time the propagation is performed in the backward
direction. In this second phase, successive adjustments are made to the synaptic weights
of the network. Calculation of the adjustments for the output layer is straightforward, but
it is much more challenging for the hidden layers.
Hence, Multi-layered perceptrons can be trained using the back-propagation algorithm described
next.
Goal: To learn the weights for all links in an interconnected multilayer network.
 The BP algorithm learns the weights for a multilayer network, given a network with a
fixed set of units and interconnections. It employs a gradient descent to attempt to
minimize the squared error between the network output values and the target values for
these outputs.
We begin by defining our measure of error:
Because we are considering networks with multiple output units rather than single units as
before, we begin by redefining E to sum the errors over all of the network output units.
2
E(W) = ½ Σd Σk (tkd – okd) d D, k outputs
where outputs is the set of output units in the network, and tkd and okd are the target and output
values associated with the kth output unit and training example d.
k varies along the output nodes and d over the training examples.
The idea is to use again a gradient descent over the space of weights to find a global minimum
(no guarantee).
Algorithm:
1. Create a network with n input nodes, n hidden internal nodes, and n out output nodes.

Introduction to Neural Network Compiled By Gidena M Page 21


2. Initialize all weights to small random numbers.
3. Until error is small do:
For each example X do
 Propagate example X forward through the network
 Propagate errors backward through the network
Forward Propagation: Given example X, compute the output of every node until we reach the
output nodes:
Backward Propagation:
A. For each output node k compute the error:
δk = Ok (1-Ok)(tk – Ok)
B. For each hidden unit h, calculate the error:
δh = Oh (1-Oh) Σk Wkh δk
C. Update each network weight:
Wji = Wji + ΔWji
Where ΔWji = η δj Xji (Wji and Xji are the input and weight of node i to node j)
 A variety of termination conditions can be used to halt the procedure.
 One may choose to halt after a fixed number of iterations through the loop, or
 once the error on the training examples falls below some threshold, or
 once the error on a separate validation set of examples meets some criteria.
Adding Momentum: A momentum term, depending on the weight value at last iteration, may
also be added to the update rule as follows. At iteration n we have the following:
ΔWji (n) = η δj Xji + αΔWji (n)
Where α ( 0 <= α <= 1) is a constant called the momentum with role of ff:
A. It increases the speed along a local minimum.
B. It increases the speed along flat regions.
Remarks on Back-propagation
1. It implements a gradient descent search over the weight space.
2. It may become trapped in local minima.
3. In practice, it is very effective.
4. How to avoid problem of local minima?
A. Add momentum.
B. Use stochastic gradient descent.
C. Use different networks with different initial values for the weights.
Multi-layered perceptrons have high representational power. They can represent the following:
1. Boolean functions. Every boolean function can be represented with a network having two
layers of units.

Introduction to Neural Network Compiled By Gidena M Page 22


2. Continuous functions. All bounded continuous functions can also be approximated with a
network having two layers of units.
3. Arbitrary functions. Any arbitrary function can be approximated with a network with
three layers of units.
Generalization and Overfitting
The objective of learning is to achieve good generalization to new cases, otherwise just use a
look-up table. Generalization can be defined as a mathematical interpolation or regression over a
set of training points.
When number of training cases, m >> number of weights, then generalization occurs.
Backpropagation is susceptible to overfitting the training examples at the cost of decreasing
generalization accuracy over other unseen examples. One obvious stopping point for
backpropagation is to continue iterating until the error is below some threshold; this can lead to
overfitting.
Overfitting can be avoided using the following strategies.
 Use a validation set and stop until the error is small in this set.
 Use 10 folds cross validation.
 Use weight decay; the weights are decreased slowly on each iteration

Introduction to Neural Network Compiled By Gidena M Page 23

Você também pode gostar