Você está na página 1de 190

NEURAL NETWORKS

Lecturer: Primo Potonik University of Ljubljana Faculty of Mechanical Engineering Laboratory of Synergetics www.neural.si primoz.potocnik@fs.uni-lj.si +386-1-4771-167

2012 Primo Potonik

NEURAL NETWORKS (0) Organization of the Study

#1

TABLE OF CONTENTS
0. 1. 2. 3. 4. 5. 6. 7. 8. Organization of the Study Introduction to Neural Networks Neuron Model Network Architectures Learning Perceptrons and linear filters Backpropagation Dynamic Networks Radial Basis Function Networks Self-Organizing Maps Practical Considerations

2012 Primo Potonik

NEURAL NETWORKS (0) Organization of the Study

#2

0. Organization of the Study


0.1 Objectives of the study 0.2 Teaching methods 0.3 Assessment 0.4 Lecture plan 0.5 Books 0.6 SLO books 0.7 E-Books 0.8 Online resources 0.9 Simulations 0.10 Homeworks
2012 Primo Potonik NEURAL NETWORKS (0) Organization of the Study #3

1. Objectives of the study


Objectives
Introduce the principles and methods of neural networks (NN) Present the principal NN models Demonstrate the process of applying NN

Learning outcomes
Understand the concept of nonparametric modelling by NN Explain the most common NN architectures
Feedforward networks Dynamic networks Radial Basis Function Networks Self-organized networks

Develop the ability to construct NN for solving real-world problems


Design proper NN architecture Achieve good training and generalization performance Implement neural network solution

2012 Primo Potonik

NEURAL NETWORKS (0) Organization of the Study

#4

2. Teaching methods
Teaching methods:
1. Lectures 4 hours weekly, clasical & practical (MATLAB) Tuesday 9:15 - 10:45 Friday 9:15 - 10:45 2. Homeworks home projects 3. Consultations with the lecturer

Organization of the study


Nov Dec: Jan: Jan: lectures homework presentations exam

Location
Institute for Sustainable Innovative Technologies, (Pot za Brdom 104, Ljubljana)

2012 Primo Potonik

NEURAL NETWORKS (0) Organization of the Study

#5

3. Assessment
ECTS credits:
EURHEO (II): 6 ECTS

Final mark:
Homework Written exam 50% final mark 50% final mark

Important dates
Homework presentations: Written exam: Tue, 8 Jan 2013 Fri, 11 Jan 2013 Fri, 18 Jan 2013

2012 Primo Potonik

NEURAL NETWORKS (0) Organization of the Study

#6

4. Lecture plan (1/5)


1. Introduction to Neural Networks
1.1 1.2 1.3 1.4 1.5 1.6 1.7 What is a neural network? Biological neural networks Human nervous system Artificial neural networks Benefits of neural networks Brief history of neural networks Applications of neural networks

2. Neuron Model, Network Architectures and Learning


2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 Neuron model Activation functions Network architectures Learning algorithms Learning paradigms Learning tasks Knowledge representation Neural networks vs. statistical methods

2012 Primo Potonik

NEURAL NETWORKS (0) Organization of the Study

#7

4. Lecture plan (2/5)


3. Perceptrons and Linear Filters
3.1 3.2 3.3 3.4 3.5 3.6 Perceptron neuron Perceptron learning rule Adaline LMS learning rule Adaptive filtering XOR problem

4. Backpropagation
4.1 4.2 4.3 4.4 4.5 Multilayer feedforward networks Backpropagation algorithm Working with backpropagation Advanced algorithms Performance of multilayer perceptrons

2012 Primo Potonik

NEURAL NETWORKS (0) Organization of the Study

#8

4. Lecture plan (3/5)


5. Dynamic Networks
5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 Historical dynamic networks Focused time-delay neural network Distributed time-delay neural network NARX network Layer recurrent network Computational power of dynamic networks Learning algorithms System identification Model reference adaptive control

2012 Primo Potonik

NEURAL NETWORKS (0) Organization of the Study

#9

4. Lecture plan (4/5)


6. Radial Basis Function Networks
6.1 RBFN structure 6.2 Exact interpolation 6.3 Commonly used radial basis functions 6.4 Radial Basis Function Networks 6.5 RBFN training 6.6 RBFN for pattern recognition 6.7 Comparison with multilayer perceptron 6.8 RBFN in Matlab notation 6.9 Probabilistic networks 6.10 Generalized regression networks

2012 Primo Potonik

NEURAL NETWORKS (0) Organization of the Study

#10

4. Lecture plan (5/5)


7. Self-Organizing Maps
7.1 7.2 7.3 7.4 7.5 Self-organization Self-organizing maps SOM algorithm Properties of the feature map Learning vector quantization

8. Practical considerations
8.1 8.2 8.3 8.4 8.5 8.6 8.7 Designing the training data Preparing data Selection of inputs Data encoding Principal component analysis Invariances and prior knowledge Generalization

2012 Primo Potonik

NEURAL NETWORKS (0) Organization of the Study

#11

5. Books
1. 2. Neural Networks and Learning Machines, 3/E Simon Haykin (Pearson Education, 2009) Neural Networks: A Comprehensive Foundation, 2/E Simon Haykin (Pearson Education, 1999)

3. 4. 5. 6.

Neural Networks for Pattern Recognition Chris M. Bishop (Oxford University Press, 1995) Practical Neural Network Recipes in C++ Timothy Masters (Academic Press, 1993) Advanced Algorithms for Neural Networks Timothy Masters (John Wiley and Sons, 1995) Signal and Image Processing with Neural Networks Timothy Masters (John Wiley and Sons, 1994)

2012 Primo Potonik

NEURAL NETWORKS (0) Organization of the Study

#12

6. SLO Books
1. Nevronske mree Andrej Dobnikar, (Didakta 1990) 2. Modeliranje dinaminih sistemov z umetnimi nevronskimi mreami in sorodnimi metodami Ju Kocijan, (Zaloba Univerze v Novi Gorici, 2007)

2012 Primo Potonik

NEURAL NETWORKS (0) Organization of the Study

#13

7. E-Books (1/2)
List of links at www.neural.si An Introduction to Neural Networks Ben Krose & Patrick van der Smagt, 1996
Recommended as an easy introduction

Neural Networks - Methodology and Applications Gerard Dreyfus, 2005 Metaheuristic Procedures for Training Neural Networks Enrique Alba & Rafael Marti (Eds.), 2006 FPGA Implementations of Neural Networks Amos R. Omondi & Mmondi J.C. Rajapakse (Eds.), 2006 Trends in Neural Computation Ke Chen & Lipo Wang (Eds.), 2007
2012 Primo Potonik NEURAL NETWORKS (0) Organization of the Study #14

7. E-Books (2/2)
Neural Preprocessing and Control of Reactive Walking Machines Poramate Manoonpong, 2007 Artificial Neural Networks for the Modelling and Fault Diagnosis of Technical Processes Krzysztof Patan, 2008 Speech, Audio, Image and Biomedical Signal Processing using Neural Networks [only two chapters], Bhanu Prasad & S.R. Mahadeva Prasanna (Eds.), 2008

MATLAB Neural Networks Toolbox 7 User's Guide, 2010

2012 Primo Potonik

NEURAL NETWORKS (0) Organization of the Study

#15

8. Online resources
List of links at www.neural.si
Neural FAQ by Warren Sarle, 2002 How to measure importance of inputs by Warren Sarle, 2000 MATLAB Neural Networks Toolbox (User's Guide) latest version Artificial Neural Networks on Wikipedia.org Neural Networks online book by StatSoft Radial Basis Function Networks by Mark Orr Principal components analysis on Wikipedia.org libsvm Support Vector Machines library

2012 Primo Potonik

NEURAL NETWORKS (0) Organization of the Study

#16

9. Simulations
Recommended computing platform
MATLAB R2010b (or later) & Neural Network Toolbox 7 http://www.mathworks.com/products/neuralnet/ Acceptable older MATLAB release: MATLAB 7.5 & Neural Network Toolbox 5.1 (Release 2007b)

Introduction to Matlab
Get familiar with MATLAB M-file programming Online documentation: Getting Started with MATLAB

Freeware computing platform


Stuttgart Neural Network Simulator http://www.ra.cs.uni-tuebingen.de/SNNS/

2012 Primo Potonik

NEURAL NETWORKS (0) Organization of the Study

#17

10. Homeworks
EURHEO students (II)
1. Practical oriented projects 2. Based on UC Irvine Machine Learning Repository data http://archive.ics.uci.edu/ml/ 3. Select data set and discuss with lecturer 4. Formulate problem 5. Develop your solution (concept & Matlab code) 6. Describe solution in a short report 7. Submit results (report & Matlab source code) 8. Present results and demonstrate solution
Presentation (~10 min) Demonstration (~20 min)

2012 Primo Potonik

NEURAL NETWORKS (0) Organization of the Study

#18

Video links
Robots with Biological Brains: Issues and Consequences Kevin Warwick, University of Reading http://videolectures.net/icannga2011_warwick_rbbi/ Computational Neurogenetic Modelling: Methods, Systems, Applications Nikola Kasabov, University of Auckland http://videolectures.net/icannga2011_kasabov_cnm/

2012 Primo Potonik

NEURAL NETWORKS (0) Organization of the Study

#19

2012 Primo Potonik

NEURAL NETWORKS (0) Organization of the Study

#20

10

1. Introduction to Neural Networks


1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 What is a neural network? Biological neural networks Human nervous system Artificial neural networks Benefits of neural networks Brief history of neural networks Applications of neural networks List of symbols

2012 Primo Potonik

NEURAL NETWORKS (1) Introduction to Neural Networks

#21

1.1 What is a neural network? (1/2)


Neural network
Network of biological neurons Biological neural networks are made up of real biological neurons that are connected or functionally-related in the peripheral nervous system or the central nervous system

Artificial neurons
Simple mathematical approximations of biological neurons

2012 Primo Potonik

NEURAL NETWORKS (1) Introduction to Neural Networks

#22

11

What is a neural network? (2/2)


Artificial neural networks
Networks of artificial neurons Very crude approximations of small parts of biological brain Implemented as software or hardware By Neural Networks we usually mean Artificial Neural Networks Neurocomputers, Connectionist networks, Parallel distributted processors, ...

2012 Primo Potonik

NEURAL NETWORKS (1) Introduction to Neural Networks

#23

Neural network definitions


Haykin (1999)
A neural network is a massively parallel distributed processor that has a natural propensity for storing experiential knowledge and making it available for use. It resembles the brain in two respects: Knowledge is acquired by the network through a learning process. Interneuron connection strengths known as synaptic weights are used to store the knowledge.

Zurada (1992)
Artificial neural systems, or neural networks, are physical cellular systems which can acquire, store, and utilize experiential knowledge.

Pinkus (1999)
The question 'What is a neural network?' is ill-posed.

2012 Primo Potonik

NEURAL NETWORKS (1) Introduction to Neural Networks

#24

12

1.2 Biological neural networks


Cortical neurons (nerve cells) growing in culture Neurons have a large cell body with several long processes extending from it, usually one thick axon and several thinner dendrites Dendrites receive information from other neurons Axon carries nerve impulses away from the neuron. Its branching ends make contacts with other neurons and with muscles or glands

0.1 mm

This complex network forms the nervous system, which relays information through the body
#25

2012 Primo Potonik

NEURAL NETWORKS (1) Introduction to Neural Networks

Biological neuron

2012 Primo Potonik

NEURAL NETWORKS (1) Introduction to Neural Networks

#26

13

Interaction of neurons
Action potentials arriving at the synapses stimulate currents in its dendrites These currents depolarize the membrane at its axon, provoking an action potential Action potential propagates down the axon to its synaptic knobs, releasing neurotransmitter and stimulating the post-synaptic neuron (lower left)

2012 Primo Potonik

NEURAL NETWORKS (1) Introduction to Neural Networks

#27

Synapses
Elementary structural and functional units that mediate the interaction between neurons Chemical synapse: pre-synaptic electric signal chemical neurotransmitter post-synaptic electrical signal

2012 Primo Potonik

NEURAL NETWORKS (1) Introduction to Neural Networks

#28

14

Action potential
Spikes or action potential
Neurons encode their outputs as a series of voltage pulses Axon is very long, high resistance & high capacity Frequency modulation Improved signal/noise ratio

2012 Primo Potonik

NEURAL NETWORKS (1) Introduction to Neural Networks

#29

1.3 Human nervous system


Human nervous system can be represented by three stages:

Stimulus

Receptors

Neural net (Brain)

Effectors

Response

Receptors
collect information from environment (photons on retina, tactile info, ...)

Effectors
generate interactions with the environment (muscle activation, ...)

Flow of information
feedforward & feedback

2012 Primo Potonik

NEURAL NETWORKS (1) Introduction to Neural Networks

#30

15

Human brain
Human activity is regulated by a nervous system: Central nervous system
Brain Spinal cord

Peripheral nervous system

1010 neurons in the brain 104 synapses per neuron 1 ms processing speed of a neuron Slow rate of operation Extrem number of processing units & interconnections Massive parallelism

2012 Primo Potonik

NEURAL NETWORKS (1) Introduction to Neural Networks

#31

Structural organization of brain


Molecules & Ions ................ transmitters Synapses ............................ fundamental organization level Neural microcircuits .......... assembly of synapses organized into patterns of connectivity to produce desired functions Dendritic trees .................... subunits of individual neurons Neurons ............................... basic processing unit, size: 100 m Local circuits ....................... localized regions in the brain, size: 1 mm Interregional circuits .......... pathways, topographic maps Central nervous system ..... final level of complexity
2012 Primo Potonik NEURAL NETWORKS (1) Introduction to Neural Networks #32

16

1.4 Artificial neural networks


Neuron model

Network of neurons

2012 Primo Potonik

NEURAL NETWORKS (1) Introduction to Neural Networks

#33

What NN can do?


In principle
NN can compute any computable function (everything a normal digital computer can do)

In practice
NN are especially useful for classification and function approximation problems which are tolerant of some imprecision Almost any finite-dimensional vector function on a compact set can be approximated to arbitrary precision by feedforward NN Need a lot of training data Difficulties to apply hard rules (such as used in an expert system)

Problems difficult for NN


Predicting random or pseudo-random numbers Factoring large integers Determining whether a large integer is prime or composite Decrypting anything encrypted by a good algorithm

2012 Primo Potonik

NEURAL NETWORKS (1) Introduction to Neural Networks

#34

17

1.5 Benefits of neural networks (1/3)


1. Ability to learn from examples
Train neural network on training data Neural network will generalize on new data Noise tolerant Many learning paradigms
Supervised (with a teacher) Unsupervised (no teacher, self-organized) Reinforcement learning

2. Adaptivity
Neural networks have natural capability to adapt to the changing environment Train neural network, then retrain Continuous adaptation in nonstationary environment

2012 Primo Potonik

NEURAL NETWORKS (1) Introduction to Neural Networks

#35

Benefits of neural networks (2/3)


3. Nonlinearity
Artificial neuron can be linear or nonlinear Network of nonlinear neurons has nonlinearity distributed throughout the network Important for modelling inherently nonlinear signals

4. Fault tolerance
Capable of robust computation Graceful degradation rather then catastrophic failure

2012 Primo Potonik

NEURAL NETWORKS (1) Introduction to Neural Networks

#36

18

Benefits of neural networks (3/3)


5. Massively parallel distributed structure
Well suited for VLSI implementation Very fast hardware operation

6. Neurobiological analogy
NN design is motivated by analogy with brain NN are research tool for neurobiologists Neurobiology inspires further development of artificial NN

7. Uniformity of analysis & design


Neurons represent building blocks of all neural networks Similar NN architecture for various tasks: pattern recognition, regression, time series forecasting, control applications, ...

2012 Primo Potonik

NEURAL NETWORKS (1) Introduction to Neural Networks

#37

www.stanford.edu/group/brainsinsilicon/

2012 Primo Potonik

NEURAL NETWORKS (1) Introduction to Neural Networks

#38

19

1.6 Brief history of neural networks (1/2)


-1940 von Hemholtz, Mach, Pavlov , etc.
General theories of learning, vision, conditioning No specific mathematical models of neuron operation

1943 1949

McCulloch and Pitts


Proposed the neuron model

Hebb
Published his book The Organization of Behavior Introduced Hebbian learning rule

1958

Rosenblatt, Widrow and Hoff


Perceptron, ADALINE First practical networks and learning rules

1969

Minsky and Papert


Published book Perceptrons, generalised the limitations of single layer perceptrons to multilayered systems Neural Network field went into hibernation

2012 Primo Potonik

NEURAL NETWORKS (1) Introduction to Neural Networks

#39

Brief history of neural networks (2/2)


1974 Werbos
Developed back-propagation learning method in his PhD thesis Several years passed before this approach was popularized

1982 1982

Hopfield
Published a series of papers on Hopfield networks

Kohonen
Developed the Self-Organising Maps

1980s Rumelhart and McClelland


Backpropagation rediscovered, re-emergence of neural networks field Books, conferences, courses, funding in USA, Europe, Japan

1990s Radial Basis Function Networks were developed 2000s The power of Ensembles of Neural Networks and Support Vector Machines becomes apparent

2012 Primo Potonik

NEURAL NETWORKS (1) Introduction to Neural Networks

#40

20

Current NN research
Topics for the 2013 International Joint Conference on NN
Neural network theory and models Computational neuroscience Cognitive models Brain-machine interfaces Embodied robotics Evolutionary neural systems Self-monitoring neural systems Learning neural networks Neurodynamics Neuroinformatics Neuroengineering Neural hardware Neural network applications Pattern recognition Machine vision Collective intelligence Hybrid systems Self-aware systems Data mining Sensor networks Agent-based systems Computational biology Bioinformatics Artificial life

2012 Primo Potonik

NEURAL NETWORKS (1) Introduction to Neural Networks

#41

1.7 Applications of neural networks (1/3)


Aerospace
High performance aircraft autopilots, flight path simulations, aircraft control systems, autopilot enhancements, aircraft component simulations, aircraft component fault detectors

Automotive
Automobile automatic guidance systems, warranty activity analyzers

Banking
Check and other document readers, credit application evaluators

Defense
Weapon steering, target tracking, object discrimination, facial recognition, new kinds of sensors, sonar, radar and image signal processing including data compression, feature extraction and noise suppression, signal/image identification

Electronics
Code sequence prediction, integrated circuit chip layout, process control, chip failure analysis, machine vision, voice synthesis, nonlinear modeling

2012 Primo Potonik

NEURAL NETWORKS (1) Introduction to Neural Networks

#42

21

Applications of neural networks (2/3)


Financial
Real estate appraisal, loan advisor, corporate bond rating, credit line use analysis, portfolio trading program, corporate financial analysis, currency price prediction

Manufacturing
Manufacturing process control, product design and analysis, process and machine diagnosis, real-time particle identification, visual quality inspection systems, welding quality analysis, paper quality prediction, computer chip quality analysis, analysis of grinding operations, chemical product design analysis, machine maintenance analysis, project planning and management, dynamic modelling of chemical process systems

Medical
Breast cancer cell analysis, EEG and ECG analysis, prothesis design, optimization of transplant times, hospital expense reduction, hospital quality improvement, emergency room test advisement

2012 Primo Potonik

NEURAL NETWORKS (1) Introduction to Neural Networks

#43

Applications of neural networks (3/3)


Robotics
Trajectory control, forklift robot, manipulator controllers, vision systems

Speech
Speech recognition, speech compression, vowel classification, text to speech synthesis

Securities
Market analysis, automatic bond rating, stock trading advisory systems

Telecommunications
Image and data compression, automated information services, real-time translation of spoken language, customer payment processing systems

Transportation
Truck brake diagnosis systems, vehicle scheduling, routing systems

2012 Primo Potonik

NEURAL NETWORKS (1) Introduction to Neural Networks

#44

22

1.8 List of symbols


THIS PRESENTATION n t x y d f v w b e | MATLAB

iteration, time step time input .................................. p network output ................... a desired (target) output ....... t activation function induced local field .............. n synaptic weight bias error

2012 Primo Potonik

NEURAL NETWORKS (1) Introduction to Neural Networks

#45

2012 Primo Potonik

NEURAL NETWORKS (1) Introduction to Neural Networks

#46

23

2. Neuron Model Network Architectures Learning


2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 Neuron model Activation functions Network architectures Learning algorithms Learning paradigms Learning tasks Knowledge representation Neural networks vs. statistical methods

2012 Primo Potonik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#47

2.1 Neuron model


Neuron
information processing unit that is fundamental to the operation of a neural network

Single input neuron


scalar input x synaptic weight w bias b adder or linear combiner activation potential v activation function f neuron output y

Adjustable parameters
synaptic weight w bias b
2012 Primo Potonik

f (wx b)

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#48

24

Neuron with vector input


Input vector
x = [x1, x2, ... xR ], R = number of elements in input vector

Weight vector
w = [w1, w2, ... wR ]

Activation potential
v=wx+b product of input vector and weight vector

x1 xR

w1

wR

f ( wx b) f ( w1 x1 w2 x2 ... wR xR b)

2012 Primo Potonik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#49

2.2 Activation functions (1/2)


Activation function defines the output of a neuron Types of activation functions
Threshold function Linear function Sigmoid function

y (v )

1 if v 0 0 if v 0

y(v) v
y

y (v )

1 1 exp( v)

2012 Primo Potonik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#50

25

Activation functions (2/2)

2012 Primo Potonik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#51

McCulloch-Pitts Neuron (1943)


Vector input, threshold activation function

y (v) sgn( wx b)
x1 xR
v
y
The output is binary, depending on whether the input meets a specified threshold

y 1 if wx y 0 if wx
y f (wx b)

b b

Extremely simplified model of real biological neurons


Missing features: non-binary outputs, non-linear summation, smooth thresholding, stochasticity, temporal information processing

Nevertheless, computationally very powerful


Network of McCulloch-Pits neurons is capable of universal computation

2012 Primo Potonik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#52

26

Matlab notation
Presentation of more complex neurons and networks
Input vector p is represented by the solid dark vertical bar Weight vector is shown as single-row, R-column matrix W p and W multiply into scalar Wp [R x 1] [1 x R]

2012 Primo Potonik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#53

Matlab Demos
nnd2n1 One input neuron nnd2n2 Two input neuron

2012 Primo Potonik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#54

27

2.3 Network architectures


About network architectures
Two or more of the neurons can be combined in a layer Neural network can contain one or more layers Strong link between network architecture and learning algorithm

1. Single-layer feedforward networks


Input layer of source nodes projects onto an output layer of neurons Single-layer reffers to the output layer (the only computation layer)

2. Multi-layer feedforward networks


One or more hidden layers Can extract higher-order statistics

3. Recurrent networks
Contains at least one feedback loop Powerfull temporal learning capabilities

2012 Primo Potonik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#55

Single-layer feedforward networks

2012 Primo Potonik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#56

28

Multi-layer feedforward networks (1/2)

2012 Primo Potonik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#57

Multi-layer feedforward networks (2/2)


Data flow strictly feedforward: input output No feedback Static network, easy learning

2012 Primo Potonik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#58

29

Recurrent networks (1/2)


Also called Dynamic networks Output depends on
current input to the network (as in static networks) and also on current or previous inputs, outputs, or states of the network

Simple recurrent network

Delay

Feedback loop

2012 Primo Potonik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#59

Recurrent networks (2/2)


Layered Recurrent Dynamic Network example

2012 Primo Potonik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#60

30

2.4 Learning algorithms


Important ability of neural networks
To learn from its environment To improve its performance through learning

Learning process
1. Neural network is stimulated by an environment 2. Neural network undergoes changes in its free parameters as a result of this stimulation 3. Neural network responds in a new way to the environment because of its changed internal structure

Learning algorithm
Prescribed set of defined rules for the solution of a learning problem
1. 2. 3. 4. Error correction learning Memory-based learning Hebbian learning Competitive learning

2012 Primo Potonik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#61

Error-correction learning (1/2)


d(t) x(t) y(t)

e(t)

1. Neural network is driven by input x(t) and responds with output y(t) 2. Network output y(t) is compared with target output d(t) Error signal = difference of network output and target output

e(t )

y(t ) d (t )

2012 Primo Potonik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#62

31

Error-correction learning (2/2)


Error signal control mechanism to correct synaptic weights Corrective adjustments designed to make network output y(t) closer to target d(t) Learning achieved by minimizing instantaneous error energy

(t )

1 2 e (t ) 2

Delta learning rule (Widrow-Hoff rule)


Adjustment to a synaptic weight of a neuron is proportional to the product of the error signal and the input signal of the synapse

w(t )
Comments

e(t ) x(t )

Error signal must be directly measurable Key parameter: Learnign rate Closed loop feedback system Stability determined by learning rate

2012 Primo Potonik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#63

Memory-based learning
All (or most) past experiences are stored in a memory of input-output pairs (inputs and target classes)
( xi , yi )
N i 1

Two essential ingredients of memory-based learning


1. Define local neighborhood of a new input xnew 2. Apply learning rule to adapt stored examples in the local neighborhood of xnew

Examples of memory-based learning


Nearest neighbor rule
Local neighborhood defined by the nearest training example (Euclidian distance) Local neighborhood defined by k-nearest training examples robust against outliers Selecting the centers of basis functions

K-nearest neighbor classifier Radial basis function network

2012 Primo Potonik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#64

32

Hebbian learning
The oldest and most famous learning rule (Hebb, 1949)
Formulated as associative learning in a neurobiological context
When an axon of a cell A is near enough to exite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic changes take place in one or both cells such that As efficiency as one of the cells firing B, is increased.

Strong physiological evidence for Hebbian learning in hippocampus, important for long term memory and spatial navigation

Hebbian learning (Hebbian synapse)


Time dependent, highly local, and strongly interactive mechanism to increase synaptic efficiency as a function of the correlation between the presynaptic and postsynaptic activities.
1. If two neurons on either side of a synapse are activated simultaneously, then the strength of that synapse is selectively increased 2. If two neurons on either side of a synapse are activated asynchronously, then that synapse is selectively weakned or eliminated

Simplest form of Hebbian learning

w(t )
2012 Primo Potonik

y(t ) x(t )

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#65

Competitive learning
Inputs

Competitive learning network architecture


1. Set of inputs, connected to a layer of outputs 2. Each output neuron receives excitation from all inputs 3. Output neurons of a neural network compete to become active by exchanging lateral inhibitory connections 4. Only a single neuron is active at any time

Competitive learning rule


Neuron with the largest induced local field becomes a winning neuron Winning neuron shifts its synaptic weights toward the input

Individual neurons specialize on ensambles of similar patterns feature detectors for different classes of input patterns

2012 Primo Potonik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#66

33

2.5 Learning paradigms


Learning algorithm
Prescribed set of defined rules for the solution of a learning problem
1. 2. 3. 4. Error correction learning Memory-based learning Hebbian learning Competitive learning

Learning paradigm
Manner in which a neural network relates to its environment 1. Supervised learning 2. Unsupervised learning 3. Reinforcement learning

2012 Primo Potonik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#67

Supervised learning
Learning with a teacher
Teacher has a knowledge of the environment Knowledge is represented by a set of input-output examples Environment Teacher
Target response = optimal action

+ Learning system
Error signal

Learning algorithm
Error-correction learning Memory-based learning

2012 Primo Potonik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#68

34

Unsupervised learning
Unsupervised or self-organized learning
No external teacher to oversee the learning process Only a set of input examples is available, no output examples Learning system

Environment

Unsupervised NNs usually perform some kind of data compression, such as dimensionality reduction or clustering

Learning algorithms
Hebbian learning Competitive learning

2012 Primo Potonik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#69

Reinforcement learning
No teacher, environment only offers primary reinforcement signal System learns under delayed reinforcement
Temporal sequence of inputs which result in the generation of a reinforcement signal

Goal is to minimize the expectation of the cumulative cost of actions taken over a sequence of steps RL is realized through two neural networks: Critic and Learning system
Primary reinforcement

Environment

Critic
Heuristic reinforcement

Critic network converts primary reinforcement signal (obtained directly from environment) into a higher quality heuristic reinforcement signal which solves temporal credit assignment problem
Actions

Learning system

2012 Primo Potonik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#70

35

2.6 Learning tasks (1/7)


1. Pattern Association
Associative memory is brain-like distributed memory that learns by association Two phases in the operation of associative memory
1. 2. Storage Recall

Autoassociation
Neural network stores a set of patterns by repeatedly presenting them to the network Then, when presented a distored pattern, neural network is able to recall the original pattern Unsupervised learning algorithms

Heteroassociation
Set of input patterns is paired with arbitrary set of output patterns Supervised learning algorithms

2012 Primo Potonik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#71

Learning tasks (2/7)


2. Pattern Recognition
In pattern recognition, input signals are assigned to categories (classes) Two phases of pattern recognition
1. 2. Learning (supervised) Classification

Statistical nature of pattern recognition


Patterns are represented in multidimensional decision space Decision space is divided by separate regions for each class Decision boundaries are determined by a learning process Support-Vector-Machine example

2012 Primo Potonik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#72

36

Learning tasks (3/7)


3. Function Approximation
Arbitrary nonlinear input-output mapping y = f(x) can be approximated by a neural network, given a set of labeled examples {xi, yi}, i=1,..,N The task is to approximate the mapping f(x) by a neural network F(x) so that f(x) and F(x) are close enough ||F(x) f(x)|| < for all x, ( is a small positive number)

Neural network mapping F(x) can be realized by supervised learning (error-correction learning algorithm) Important function approximation tasks System identification Inverse system
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning #73

Learning tasks (4/7)


System identification Environment Unknown System Neural network
Error signal

Unknown system response

+ -

Inverse system
Inputs from the environment

+ Environment System Neural network


Error signal

2012 Primo Potonik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#74

37

Learning tasks (5/7)


4. Control
Neural networks can be used to control a plant (a process) Brain is the best example of a paralled distributed generalized controller
Operates thousands of actuators (muscles) Can handle nonlinearity and noise Can handle invariances Can optimize over long-range planning horizon

Feedback control system (Model reference control)


NN controller has to supply inputs that will drive a plant according to a reference

2012 Primo Potonik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#75

Model predictive control


NN model provides multi-step ahead predictions for optimizer

2012 Primo Potonik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#76

38

Learning tasks (6/7)


5. Filtering
Filter device or algorithm used to extract information about a prescribed quantity of interest from a noisy data set Filters can be used for three basic information processing tasks:
o o o o o o o o o o

1. Filtering

Extraction of information at discrete time n by using measured data up to and including time n Examples: Cocktail party problem, Blind source separation
o o o o o o x o o o

2. Smoothing

Differs from filtering in: a) Data need not be available at time n b) Data measured later than n can be used to obtain this information

3. Prediction

Deriving information about the quantity in the future at time n+h, h>0, by using data measured up to including n Example: Forecasting of energy consumption, stock market prediction
NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning #77

2012 Primo Potonik

Learning tasks (7/7)


6. Beamforming
Spatial form of filtering, used to distinguish between the spatial properties of a target signal and background noise Device is called a beamformer Beamforming is used in human auditory response and echo-locating bats the task is suitable for neural network application Common beamforming tasks: radar and sonar systems
Task is to detect a target in the presence of receiver noise and interfering signals Target signal originates from an unknown direction No a priori information available on interfering signals

Neural beamformer, neuro-beamformer, attentional neurocomputers

2012 Primo Potonik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#78

39

Adaptation
Learning has spatio-temporal nature
Space and time are fundamental dimensions of learning (control, beamforming)

1. Stationary environment
Learning under the supervision of a teacher, weights then frozen Neural network then relies on memory to exploit past experiences

2. Nonstationary environment
Statistical properties of environment change with time Neural network should continuously adapt its weights in real-time Adaptive system continuous learning

3. Pseudostationary environment
Changes are slow over a short temporal window
Speech stationary in interval 10-30 ms Ocean radar stationary in interval of several seconds

2012 Primo Potonik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#79

2012 Primo Potonik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#80

40

2.7 Knowledge representation


What is knowledge?
Stored information or models used by a person or machine to interpret, predict, and appropriately respond to the outside world (Fischler & Firschein, 1987)

Knowledge representation
1. 2. Good solution depends on a good representation of knowledge Knowledge of the world consists of: Prior information facts about what is and what has been known Observations of the world measurements, obtained through sensors designed to probe the environment

Observations can be: 1. Labeled input signals are paired with desired response 2. Unlabeled input signals only

2012 Primo Potonik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#81

Knowledge representation in NN
Design of neural networks based directly on real-life data
Examples to train the neural network are taken from observations

Examples to train neural network can be


Positive examples ... input and correct target output
e.g. sonar data + echos from submarines

Negative examples ... input and false output


e.g. sonar data + echos from marine life

Knowledge representation in neural networks


Defined by the values of free parameters (synaptic weights and biases) Knowledge is embedded in the design of a neural network Interpretation problem neural networks suffer from inability to explain how a result (decision / prediction / classification) was obtained
Serious limitation for safe-critical application (medicial diagnosis, air trafic) Explanation capability by integration of NN and other artificial intelligence methods

2012 Primo Potonik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#82

41

Knowledge representation rules for NN


Rule 1 Similar inputs from similar classes should produce similar representations inside the network, and should be classified to the same category Rule 2 Items to be categorized as separate classes should be given widelly different representations in the network Rule 3 If a particular feature is important, then there should be a large number of neurons involved in the representation of that item in the network Rule 4 Prior information and invariances should be built into the design of a neural network, thereby simplifying the network design by not having to learn them
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning #83

Prior information and invariances (Rule 4)


Application of Rule 4 results in neural networks with specialized structure
Biological visual and auditory networks are highly specialized Specialized network has smaller number of parameters
needs less training data faster learning faster network throughput cheaper because of its smaller size

How to build prior information into neural network design


Currently no well defined rules, but usefull ad-hoc procedures We may use a combination of two techniques
1. Receptive fields restricting the network architecture by using local connections 2. Weight-sharing several neurons share same synaptic weights

2012 Primo Potonik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#84

42

How to build invariances into NN


Character recognition example
Transformations Pattern recognition system should be invariant to them

Original

Size

Rotation

Shift

Incomplete image

Techniques
1. 2. 3. Invariance by neural network structure Invariance by training Invariant feature space

2012 Primo Potonik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#85

Invariant feature space


Neural net classifier with invariant feature extractor
Input Invariant feature extractor Neural network classifier Class estimate

Features
Characterize the essential information content of an input data Should be invariant to transformations of the input

Benefits
1. Dimensionality reduction number of features is small compared to the original input space 2. Relaxed design requirements for a neural network 3. Invariances for all objects can be assured (for known transformations) Prior knowledge is required!

2012 Primo Potonik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#86

43

Example 2A (1/4) Invariant character recognition


Problem: distinguishing handwritten characters a and b

Classifier design
Invariant feature extractor Neural network classifier Class estimate: A, B

Image representation
Grid of pixels (typically 256x256) with gray level [0..1] (typically 8-bit coding)

2012 Primo Potonik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#87

Example 2A (2/4) Problems with image representation


1. Invariance problem (various transformations) 2. High dimensionality problem
Image size 256x256 65536 inputs

Curse of dimensionality increasing input dimensionality leads to sparse data and this provides very poor representation of the mapping problems with correct classification and generalization

Possible solution
Combining inputs into features Goal is to obtain just a few features instead of 65536 inputs

Ideas for feature extraction (for character recognition)


F1 character heigth character width
NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning #88

2012 Primo Potonik

44

Example 2A (3/4) Feature extraction


Extracted feature:
F1 character heigth character width

Distribution for various samples from class A and B


Decision
Class A Class B

samples from class A

samples from class B

F1

Overlaping distributions: need for additional features


F1, F2, F3, ...
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning #89

Example 2A (4/4) Classification in multi feature space


Classification in the space of two features (F1, F2)
F2
samples from class A

Decision boundary

samples from class B

F1

Neural network can be used for classification in the feature space (F1, F2)
2 inputs instead of 65536 original inputs Improved generalization and classification ability

2012 Primo Potonik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#90

45

Generalization and model complexity


What is the optimal decision boundary?

Linear classifier is insufficient, false classifications

Optimal classifier ?

Over-fitting, correct classification but poor generalization

Best generalization is achieved by a model whose complexity is neither too small nor too large Occams razor principle: we should prefer simpler models to more complex models Tradeoff: modeling simplicity vs. modeling capacity

2012 Primo Potonik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#91

2.8 Neural networks vs. stat. methods (1/3)


Considerable overlap between neural nets and statistics
Statistical inference means learning to generalize from noisy data Feedforward nets are a subset of the class of nonlinear regression and discrimination models Application of statistical theory to neural networks: Bishop (1995), Ripley (1996)

Most NN that can learn to generalize effectively from noisy data are similar or identical to statistical methods
Single-layered feedforward nets are basically generalized linear models Two-layer feedforward nets are closely related to projection pursuit regression Probabilistic neural nets are identical to kernel discriminant analysis Kohonen nets for adaptive vector quantization are similar to k-means cluster analysis Kohonen self-organizing maps are discrete approximations to principal curves and surfaces Hebbian learning is closely related to principal component analysis

Some neural network areas have no relation to statistics


Reinforcement learning Stopped training (similar to shrinkage estimation, but the method is quite different)
NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning #92

2012 Primo Potonik

46

Neural networks vs. statistical methods (2/3)


Many statistical methods can be used for flexible nonlinear modeling
Polynomial regression, Fourier series regression K-nearest neighbor regression and discriminant analysis Kernel regression and discriminant analysis Wavelet smoothing, Local polynomial smoothing Smoothing splines, B-splines Tree-based models (CART, AID, etc.) Multivariate adaptive regression splines (MARS) Projection pursuit regression, various Bayesian methods

Why use neural nets rather than statistical methods?


Multilayer perceptron (MLP) tends to be useful in similar situations as projection pursuit regression, i.e.: the number of inputs is fairly large, many of the inputs are relevant, but most of the predictive information lies in a low-dimensional subspace computing predicted values from MLPs is simpler and faster MLPs are better at learning moderately pathological functions than are many other methods with stronger smoothness assumptions
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning #93

Some advantages of MLPs over projection pursuit regression

Neural networks vs. statistical methods (2/3)


Neural Network Jargon

Statistical Jargon

Generalizing from noisy data .................................... Statistical inference Neuron, unit, node .................................................... A simple linear or nonlinear computing element that accepts one or more inputs and computes a function thereof Neural networks ....................................................... A class of flexible nonlinear regression and discriminant models, data reduction models, and nonlinear dynamical systems Architecture .............................................................. Model Training, Learning, Adaptation ................................. Estimation, Model fitting, Optimization Classification ............................................................ Discriminant analysis Mapping, Function approximation ............................ Regression Competitive learning ................................................. Cluster analysis Hebbian learning ...................................................... Principal components Training set ............................................................... Sample, Construction sample Input ......................................................................... Independent variables, Predictors, Regressors, Explanatory variables, Carriers Output ....................................................................... Predicted values Generalization .......................................................... Interpolation, Extrapolation, Prediction Prediction ................................................................. Forecasting
NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning #94

2012 Primo Potonik

47

MATLAB example
nn02_neuron_output

2012 Primo Potonik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#95

MATLAB example
nn02_custom_nn

2012 Primo Potonik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#96

48

MATLAB example
nnstart

2012 Primo Potonik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#97

2012 Primo Potonik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#98

49

3. Perceptrons and Linear Filters


3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 Perceptron neuron Perceptron learning rule Perceptron network Adaline LMS learning rule Adaline network ADALINE vs. Perceptron Adaptive filtering XOR problem

2012 Primo Potonik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#99

Introduction
Pioneering neural network contributions
McCulloch & Pits (1943) the idea of neural networks as computing machines Rosenblatt (1958) proposed perceptron as the first supervised learning model Widrow and Hoff (1961) least-mean-square learning as an important generalization of perceptron learning

Perceptron
Layer of McCulloch-Pits neurons with adjustable synaptic weights Simplest form of a neural network for classification of linearly separable patterns Perceptron convergence theorem for two linearly separable classes

Adaline
Similar to perceptron, trained with LMS learning Used for linear adaptive filters

2012 Primo Potonik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#100

50

3.1 Perceptron neuron


Perceptron neuron (McCulloch-Pits neuron): hard-limit (threshold) activation function
y (v )
x1 xR
v

1 if v 0 0 if v 0
y

Perceptron output: 0 or 1 usefull for classification


If y=0 pattern belongs to class A If y=1 pattern belongs to class B
2012 Primo Potonik NEURAL NETWORKS (3) Perceptrons and Linear Filters #101

Linear discriminant function


Perceptron with two inputs
x1

v
x2

f (wx b)

f (w1 x1 w2 x2 b)

Separation between the two classes is a straight line, given by

w1 x1 w2 x2 b 0
Geometric representation

x2

x2

w1 x1 w2

b w2
x1

Perceptron represents linear discriminant function

2012 Primo Potonik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#102

51

Matlab Demos (Perceptron)


nnd2n2 Two input perceptron nnd4db Decision boundaries

2012 Primo Potonik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#103

How to train a perceptron?


How to train weights and bias?
Perceptron learning rule Least-means-square learning rule or delta rule

Both are iterative learning procedures


1. A learning sample is presented to the network 2. For each network parameter, the new value is computed by adding a correction

w j (n 1) b(n 1)

w j ( n) b ( n)

w j ( n) b( n)

x1 x2 xR

Formulation of the learning problem


How do we compute w(t) and b(t) in order to classify the learning patterns correctly?

2012 Primo Potonik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#104

52

3.2 Perceptron learning rule


A set of learning samples (inputs and target classes)
( xi , di )
N i 1

xi

, di

0,1

Objective:
Reduce error e between target class d and neuron response y (error-correction learning)

e=d-y

Learning procedure
1. 2. 3. 4. Start with random weights for the connections Present an input vector xi from the set of training samples If perceptron response is wrong: yd, e0, modify all connections w Go back to 2

2012 Primo Potonik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#105

Three conditions for a neuron


After the presentation of input x, the neuron can be in three conditions:
CASE 1: If neuron output is correct, weights w are not altered CASE 2: Neuron output is 0 instead of 1 (y=0, d=1, e=d-y=1) Input x is added to weight vector w
This makes the weight vector point closer to the input vector, increasing the chance that the input vector will be classified as 1 in the future.

CASE 3: Neuron output is 1 instead of 0 (y=1, d=0, e=d-y=-1) Input x is subtracted from weight vector w
This makes the weight vector point farther away from the input vector, increasing the chance that the input vector will be classified as a 0 in the future.

2012 Primo Potonik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#106

53

Three conditions rewritten


Three conditions for a neuron rewritten
CASE 1: e = 0 w = 0 CASE 2: e = 1 w = x CASE 3: e = -1 w = -x

Three conditions in a single expression


w = (d-y)x = ex

Similar for the bias


b = (d-y)(1) = e

Perceptron learning rule


w j (n 1) b(n 1) w j (n) e(n) x j (n) b(n) e(n)
x1 x2 x1

2012 Primo Potonik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#107

Convergence theorem
For the perceptron learning rule there exists a convergence theorem:
Theorem 1 If there exists set of connection weights w which is able to perform the transformation d=y(x), the perceptron learning rule will converge to some solution in a finite number of steps for any initial choice of the weights.

Comments
Theorem is only valid for linearly separable classes Outliers can cause long training times If classes are linearly separable, perceptron offers a powerfull pattern recognition tool

2012 Primo Potonik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#108

54

Perceptron learning rule summary


1. Start with random weights for the connections w 2. Select an input vector x from the set of training samples 3. If perceptron response is wrong: yd, modify all connections according to learning rule:
w ex b e

4. Go back to 2 (until all input vectors are correctly classified)

2012 Primo Potonik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#109

Matlab demo (Preceptron learning rule)


nnd4pr Two input perceptron

2012 Primo Potonik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#110

55

MATLAB example nn03_perceptron


Classification of linearly separable data with a perceptron

2012 Primo Potonik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#111

Matlab demo (Presence of an outlier)


demop4 Slow learning with the presence of an outlier

2012 Primo Potonik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#112

56

Matlab demo (Linearly non-separable classes)


demop6 Perceptron attempts to classify linearly nonseparable classes

2012 Primo Potonik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#113

Matlab demo (Classification application)


nnd3pc Perceptron classification fruit example

2012 Primo Potonik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#114

57

3.3 Perceptron network


Single layer of perceptron neurons

Classification in more than two linearly separable classes


2012 Primo Potonik NEURAL NETWORKS (3) Perceptrons and Linear Filters #115

MATLAB example nn03_perceptron_network


Classification of 4-class problem with a 2-neuron perceptron

2012 Primo Potonik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#116

58

3.4 Adaline
ADALINE = Adaptive Linear Element Widrow and Hoff, 1961: LMS learning (Least mean square) or Delta rule Important generalization of perceptron learning rule Main difference with perceptron activation function
Perceptron: Threshold activation function ADALINE: Linear activation function

Both Perceptron and ADALINE can only solve linearly separable problems
2012 Primo Potonik NEURAL NETWORKS (3) Perceptrons and Linear Filters #117

Linear neuron
Basic ADALINE element
Linear transfer function
y

x1 xR

wx b

y(v) v

2012 Primo Potonik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#118

59

Simple ADALINE
Simple ADALINE with two inputs
x1

v
x2

f (wx b)

w1 x1 w2 x2 b

Like a perceptron, ADALINE has a decision boundary


defined by network inputs for which network output is zero

w1 x1 w2 x2 b 0
see Perceptron decision boundary

ADALINE can be used to classify objects into categories


2012 Primo Potonik NEURAL NETWORKS (3) Perceptrons and Linear Filters #119

3.5 LMS learning rule


LMS = Least Square Learning rule A set of learning samples (inputs and target classes)
( xi , di )
N i 1

xi

, di

Objective: reduce error e between target class d and neuron response y (error-correction learning)
e=dy

Goal is to minimize the average sum of squared errors


mse 1 N
N

d ( n ) y ( n)
n 1

2012 Primo Potonik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#120

60

LMS algorithm (1/3)


LMS algorithm is based on approximate steepest decent procedure
Widrow & Hoff introduced the idea to estimate mean-square-error

mse

1 N

d ( n ) y ( n)
n 1

by using square-error at each iteration

e2 (n)

d (n) y(n)

and change the network weights proportional to the negative derivative of error

w j ( n)

e 2 ( n) wj

with some learning constant

2012 Primo Potonik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#121

LMS algorithm (2/3)


Now we expand the expression for weight change ...

w j ( n)

e 2 ( n) wj

2 e( n )

e( n ) wj

2 e( n )

d ( n) y ( n ) wj

Expanding the neuron activation y(n)

y(n) Wx(n) w1 x1 (n) w j x j (n) wR xR (n)


and using the cosmetic correction

2
we finaly obtain the weight change at step n

w j (n)

e(n) x j (n)

2012 Primo Potonik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#122

61

LMS algorithm (3/3)


Final form of LMS learning rule
w j (n 1) b(n 1) w j ( n) b ( n) e(n) x j (n) e(n)

Learning is regulated by a learning rate Stable learning learning rate must be less then the reciprocal of the largest eigenvalue of the correlation matrix xTx of input vectors

Limitations
Linear network can only learn linear input-output mappings Proper selection of learning rate

2012 Primo Potonik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#123

Matlab demo (LMS learning)


pp02 Gradient descent learning by LMS learnig rule

2012 Primo Potonik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#124

62

3.6 Adaline network


ADALINE network = MADALINE (single layer of ADALINE neurons)

2012 Primo Potonik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#125

3.7 ADALINE vs. Perceptron


Architectures
ADALINE PERCEPTRON

Learning rules
LMS learning Perceptron learning

w j (n 1) b(n 1)

w j ( n) b ( n)

e(n) x j (n) e(n)

w j (n 1) b(n 1)

w j (n) e(n) x j (n) b(n) e(n)

2012 Primo Potonik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#126

63

ADALINE and Perceptron summary


Single layer networks can be built based on ADALINE or Perceptron neurons Both architectures are suitable to learn only linear inputoutput relationships Perceptron with threshold activation function is suitable for classification problems ADALINE with linear output is more suitable for regression & filtering ADALINE is suitable for continuous learning
2012 Primo Potonik NEURAL NETWORKS (3) Perceptrons and Linear Filters #127

3.8 Adaptive filtering


ADALINE is one of the most widely used neural networks in practical applications Adaptive filtering is one of its major application areas We introduce the new element: Tapped delay line
Input signal enters from the left and passes through N-1 delays Output of the tapped delay line (TDL) is an N-dimensional vector, composed from current and past inputs

Input

2012 Primo Potonik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#128

64

Adaptive filter
Adaptive filter = ADALINE combined with TDL

a(k ) Wp b
i
2012 Primo Potonik

wi p(k i 1) b
#129

NEURAL NETWORKS (3) Perceptrons and Linear Filters

Simple adaptive filter example


Adaptive filter with three delayed inputs

a(t )

w1 p(t ) w2 p(t 1) w3 p(t 2) b

2012 Primo Potonik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#130

65

Adaptive filter for prediction


Adaptive filter can be used to predict the next value of a time series p(t+1) Now
Learning
p(t-2) p(t-1) p(t) p(t+1) Time

Operation
p(t-2) p(t-1) p(t) p(t+1) Time

Learning

2012 Primo Potonik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#131

Noise cancellation example


Adaptive filter can be used to cancel engine noise in pilots voice in an airplane
The goal is to obtain a signal that contains the pilots voice, but not the engine noise. Linear neural net is adaptively trained to predict the combined pilot/engine signal m from an engine signal n. Only engine noise n is available to the network, so it only learns to predict the engines contribution to the pilot/engine signal m. The network error e becomes equal to the pilots voice. The linear adaptive network adaptively learns to cancel the engine noise. Such adaptive noise canceling generally does a better job than a classical filter, because the noise here is subtracted from rather than filtered out of the signal m.

2012 Primo Potonik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#132

66

Single-layer adaptive filter network


If more than one output neuron is required, a tapped delay line can be connected to a layer of neurons

2012 Primo Potonik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#133

Matlab demos (ADALINE)


nnd10eeg ADALINE for noise filtering of EEG signals nnd10nc Adaptive noise cancelation

2012 Primo Potonik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#134

67

MATLAB example nn_03_adaline


ADALINE time series prediction with adaptive linear filter

2012 Primo Potonik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#135

3.9 XOR problem


Single layer perceptron cannot represent XOR function
One of Minsky and Paperts most discouraging results Example: perceptron with two inputs
x1

v
x2

Discriminant function

x2

w1 x1 w2

b w2

Only AND and OR functions can be represented by Perceptron

2012 Primo Potonik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#136

68

XOR solution
Extending single-layer perceptron to multi-layer perceptron by introducing hidden units
x1
w2,1
w1,1 1 w2,1 1 w2, 2 w2,3 1 b2 0.5 2

w2, 2

w1, 2 b1

1 0.5

x2

w2,3

XOR problem can be solved but we no longer have a learning rule to train the network Multilayer perceptrons can do everything How to train them?

2012 Primo Potonik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#137

Homework
Create a two-layer perceptron to solve XOR problem
Create a custom network Demonstrate solution

2012 Primo Potonik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#138

69

4. Backpropagation

4.1 4.2 4.3 4.4 4.5

Multilayer feedforward networks Backpropagation algorithm Working with backpropagation Advanced algorithms Performance of multilayer perceptrons

2012 Primo Potonik

NEURAL NETWORKS (4) Backpropagation

#139

Introduction
Single-layer networks have severe restrictions
Only linearly separable tasks can be solved

Minsky and Papert (1969)


Showed a power of a two layer feed-forward network But didnt find the solution how to train the network

Werbos (1974)
Parker (1985), Cun (1985), Rumelhart (1986) Solved the problem of training multi-layer networks by back-propagating the output errors through hidden layers of the network

Backpropagation learning rule


2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #140

70

4.1 Multilayer feedforward networks


Important class of neural networks
Input layer (only distributting inputs, without processing) One or more hidden layers Output layer

Commonly referred to as multilayer perceptron


2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #141

Properties of multilayer perceptrons


1. Neurons include nonlinear activation function
Without nonlinearity, the capacity of the network is reduced to that of a single layer perceptron Nonlinearity must be smooth (differentiable everywhere), not hard-limiting as in the original perceptron Often, logistic function is used: 1

1 exp( v)

2. One or more layers of hidden neurons


Enable learning of complex tasks by extracting features from the input patterns

3. Massive connectivity
Neurons in successive layers are fully interconnected

2012 Primo Potonik

NEURAL NETWORKS (4) Backpropagation

#142

71

Matlab demo
nnd11nf Response of the feedforward network with one hidden layer

2012 Primo Potonik

NEURAL NETWORKS (4) Backpropagation

#143

About backpropagation
Multilayer perceptrons can be trained by backpropagation learning rule
Based on error-correction learning rule Generalization of LMS learnig rule (used to train ADALINE)

Backpropagation consists of two passes through the network 1. Forward pass


Input is applied to the network and propagated to the output Synaptic weights stay frozen Based on the desired response, error signal is calculated

2. Backward pass
Error signal is propagated backwards from output to input Synaptic weights are adjusted according to the error gradient

2012 Primo Potonik

NEURAL NETWORKS (4) Backpropagation

#144

72

4.2 Backpropagation algorithm (1/9)


A set of learning samples (inputs and target outputs)
( xn , d n )
N n 1

xn

, dn

Error signal at output layer, neuron j, learning iteration n


e j (n) d j (n) y j (n)

Instantaneous error energy of output layer with R neurons


E ( n) 1 2
R

e j ( n) 2
j 1

Average error energy over all learning set


E 1 N
N

E ( n)
n 1

2012 Primo Potonik

NEURAL NETWORKS (4) Backpropagation

#145

Backpropagation algorithm (2/9)


Average error energy E represents a cost function as a measure of learning performance E is a function of free network parameters
synaptic weights bias levels

Learning objective is to minimize average error energy E by minimizing free network parameters We use an approximation: pattern-by-pattern learning instead of epoch learning
Parameter adjustments are made for each pattern presented to the network Minimizing instantaneous error energy at each step instead of average error energy

2012 Primo Potonik

NEURAL NETWORKS (4) Backpropagation

#146

73

Backpropagation algorithm (3/9)


Similar as LMS algorithms, backpropagation applies correction of weights proportional to partial derivative
w ji (n) E ( n) w ji (n)
Instantaneous error energy

Expressing this gradient by the chain rule


E ( n) w ji (n) E ( n) e j ( n) y j ( n) v j ( n) e j (n) y j (n) v j (n) w ji (n)
yi
w ji

output error network output induced local field synaptic weight

vj

yj

ej E

dj 1 2
R

yj ej
j 1 2

2012 Primo Potonik

NEURAL NETWORKS (4) Backpropagation

#147

Backpropagation algorithm (4/9)


1. Gradient on output error
E ( n) e j ( n) e j ( n)

2. Gradient on network output


e j (n) d j (n) y j (n)

e j ( n) y j ( n) y j ( n) v j ( n)

3. Gradient on induced local field


yi
w ji

vj

yj

f (v j (n))

4. Gradient on synaptic weight


R

v j ( n)
j 0
2012 Primo Potonik

w ji (n) yi (n)

v j ( n) w ji (n)

yi (n)
#148

NEURAL NETWORKS (4) Backpropagation

74

Backpropagation algorithm (5/9)


Putting gradients together
E ( n) w ji (n) E ( n) e j ( n) y j ( n) v j ( n) e j (n) y j (n) v j (n) w ji (n) e j (n) ( 1) f (v j (n)) yi (n) e j (n) f (v j (n)) yi (n)
yi
w ji

vj

yj

Correction of synaptic weight is defined by delta rule


w ji (n) E ( n) w ji e j (n) f (v j (n)) yi (n)
j (n)

Learning rate

Local gradient

w ji (n)
2012 Primo Potonik

(n) yi (n)
#149

NEURAL NETWORKS (4) Backpropagation

Backpropagation algorithm (6/9)


CASE 1 Neuron j is an output node
Output error ej(n) is available Computation of local gradient is straightforward
f (v j (n)) 1 1 exp( av j (n)) a exp( av j (n)) [1 exp( av j (n))]2

e j (n) f (v j (n)) j (n)


f (v j (n))

CASE 2 Neuron j is a hidden node


Hidden error is not available Credit assignment problem Local gradient solved by backpropagating errors through the network

E ( n) w ji (n)

E ( n) e j ( n) y j ( n) v j ( n) e j (n) y j (n) v j (n) w ji (n)


j (n)

yi

w ji

vj

yj

yi ( n )

y j ( n) v j ( n)

f (v j (n))

( n)

E ( n) y j ( n) y j ( n) v j ( n)

E ( n) f (v j (n)) y j ( n)
#150

derivative of output error energy E on hidden layer output yj ?

2012 Primo Potonik

NEURAL NETWORKS (4) Backpropagation

75

Backpropagation algorithm (7/9)


CASE 2 Neuron j is a hidden node ...
Instantaneous error energy of the output layer with R neurons

E ( n)

1 R ek (n) 2 2k1

Expressing the gradient of output error energy E on hidden layer output yj

E ( n) y j ( n)

ek
k

ek (n) y j ( n) ek (n) vk (n) vk ( n ) y j ( n )


f ( vk ( n )) wkj

ek (n)

d k ( n) y k ( n) d k ( n)
vk (n)
j 0

f (vk (n))
M

ek
k

wkj (n) y j (n)

ek f (vk (n)) wkj


k k k
2012 Primo Potonik

yj

wkj

vk

yk

wkj
#151

NEURAL NETWORKS (4) Backpropagation

Backpropagation algorithm (8/9)


CASE 2 Neuron j is a hidden node ...
Finally, combining ansatz for hidden layer local gradient
j

( n)

E ( n) f (v j (n)) y j ( n)

and gradient of output error energy on hidden layer output

E ( n) y j ( n)

k k

wkj

gives final result for hidden layer local gradient

( n)

f (v j (n))
k

wkj

2012 Primo Potonik

NEURAL NETWORKS (4) Backpropagation

#152

76

Backpropagation algorithm (9/9)


Backpropagation summary

w ji (n)
Weight correction Learning rate

(n) yi (n)
Local gradient Input of neuron j

1. Local gradient of an output node


k

(n) ek (n) f (vk (n))

xi

w ji

vj

yj
wkj

2. Local gradient of a hidden node


j

vk

yk

( n)

f (v j (n))
k

wkj

2012 Primo Potonik

NEURAL NETWORKS (4) Backpropagation

#153

Two passes of computation


1. Forward pass
Input is applied to the network and propagated to the output Inputs Hidden layer output Output layer output Output error

xi (n)

yj

w ji xi

xi

yk
w ji

f
vj

wkj y j
yj

ek (n) d k (n) yk (n)

wkj

vk

yk

2. Backward pass
Recursive computing of local gradients Output local gradients Hidden layer local gradients
k

(n) ek (n) f (vk (n))

( n)

f (v j (n))
k

wkj

Synaptic weights are adjusted according to local gradients

wkj (n)
2012 Primo Potonik

(n) y j (n)

w ji (n)

(n) xi (n)
#154

NEURAL NETWORKS (4) Backpropagation

77

Summary of backpropagation algorithm


1. Initialization
Pick weights and biases from the uniform distribution with zero mean and variance that induces local fields between the linear and saturated parts of logistic function

2. Presentation of training samples


For each sample from the epoch, perform forward pass and backward pass

3. Forward pass
Propagate training sample from network input to the output Calculate the error signal

4. Backward pass
Recursive computation of local gradients from output layer toward input layer Adaptation of synaptic weights according to generalized delta rule

5. Iteration
Iterate steps 2-4 until stopping criterion is met

2012 Primo Potonik

NEURAL NETWORKS (4) Backpropagation

#155

Matlab demo
nnd11bc Backpropagation calculation

2012 Primo Potonik

NEURAL NETWORKS (4) Backpropagation

#156

78

Matlab demo
nnd12sd1 Steepest descent

2012 Primo Potonik

NEURAL NETWORKS (4) Backpropagation

#157

Backpropagation for ADALINE


Using backpropagation learning for ADALINE
No hidden layers, one output neuron Linear activation function
x1 xR
v

f (v(n)) v(n)

f (v(n)) 1

Backpropagation rule
wi (n) (n) yi (n), yi xi (n) e(n) f (v(n)) e(n) wi (n) e(n) xi (n)

Original Delta rule


wi (n) e(n) xi (n)

Backpropagation is a generalization of a Delta rule


2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #158

79

4.3 Working with backpropagation


Efficient application of backpropagation requires some fine-tuning Various parameters, functions and methods should be selected
Training mode (sequential / batch) Activation function Learning rate Momentum Stopping criterium Heuristics for efficient backpropagation Methods for improving generalization

2012 Primo Potonik

NEURAL NETWORKS (4) Backpropagation

#159

Sequential and batch training


Learning results from many presentations of training examples
Epoch = presentation of the entire training set

Batch training
Weight updating after the presentation of a complete epoch

Sequential training
Weight updating after the presentation of each training example Stochastic nature of learning, faster convergence Important practical reasons for sequential learning:
Algorithm is easy to implement Provides effective solution to large and difficult problems

Therefore sequential training is preferred training mode Good practice is random order of presentation of training examples

2012 Primo Potonik

NEURAL NETWORKS (4) Backpropagation

#160

80

Activation function
Derivative of activation function f (v j (n)) is required for computation of local gradients
Only requirement for activation function: differentiability Commonly used: logistic function

f (v j (n))

1 1 exp( av j (n))

0, v j ( n)

Derivative of logistic function

f (v j (n))

a exp( av j (n)) [1 exp( av j (n))]2

y j ( n ) f ( v j ( n ))

f (v j (n)) a y j (n)[1 y j (n)]

Local gradient can be calculated without explicit knowledge of the activation function

2012 Primo Potonik

NEURAL NETWORKS (4) Backpropagation

#161

Other activation functions


Using sin() activation functions
f ( x) a
k 1

ck sin(kx

Equivalent to traditional Fourier analysis Network with sin() activation functions can be trained by backpropagation Example: Approximating periodic function by

8 sigmoid hidden neurons

4 sin hidden neurons

2012 Primo Potonik

NEURAL NETWORKS (4) Backpropagation

#162

81

Learning rate
Learning procedure requires
Change in the weight space to be proportional to error gradient True gradient descent requires infinitesimal steps

Learning in practice
w ji (n) Factor of proportionality is learning rate j (n) yi (n) Choose a learning rate as large as possible without leading to oscillations
0.010 0.035
0.040

2012 Primo Potonik

NEURAL NETWORKS (4) Backpropagation

#163

Stopping criteria
Generally, backpropagation cannot be shown to converge
No well defined criteria for stopping its operation

Possible stopping criteria


1. Gradient vector Euclidean norm of the gradient vector reaches a sufficiently small gradient 2. Output error Output error is small enough Rate of change in the average squared error per epoch is sufficiently small 3. Generalization performance Generalization performance has peaked or is adequate 4. Max number of iterations We are out of time ...

2012 Primo Potonik

NEURAL NETWORKS (4) Backpropagation

#164

82

Heuristics for efficient backpropagation (1/3)


1. Maximizing information content
General rule: every training example presented to the backpropagation algorithm should be chosen on the basis that its information content is the largest possible for the task at hand Simple technique: randomize the order in which examples are presented from one epoch to the next

2. Activation function
Faster learning with antisimetric sigmoid activation functions Popular choice is:

f (v) a tanh(bv) a 1.72 b 0.67

f (1) 1, f ( 1) 1 effective gain f (0) 1 max second derivative at v 1

2012 Primo Potonik

NEURAL NETWORKS (4) Backpropagation

#165

Heuristics for efficient backpropagation (2/3)


3. Target values
Must be in the range of the activation function Offset is recommended, otherwise learnig is driven into saturation
Example: max(target) = 0.9 max(f)

4. Preprocessing inputs
a) Normalizing mean to zero b) Decorrelating input variables (by using principal component analysis) c) Scaling input variables (variances should be approx. equal)

Original

a) Zero mean

b) Decorrelated

c) Equalized variance

2012 Primo Potonik

NEURAL NETWORKS (4) Backpropagation

#166

83

Heuristics for efficient backpropagation (3/3)


5. Initialization
Choice of initial weights is important for a successful network design
Large initial values saturation Small initial values slow learning due to operation only in the saddle point near origin Standard deviation of induced local fields should lie between the linear and saturated parts of its sigmoid function tanh activation function example (a=1.72, b=0.67): synaptic weights should be chosen from a uniform distribution with zero mean and standar deviation m 1/ 2 m ... number of synaptic weights v

Good choice lies between these extrem values

6. Learning from hints


Prior information about the unknown mapping can be included into the learning proces
Initialization Possible invariance properties, symetries, ... Choice of activation functions

2012 Primo Potonik

NEURAL NETWORKS (4) Backpropagation

#167

Generalization
Neural network is able to generalize:
Input-output mapping computed by the network is correct for test data
Test data were not used during training Test data are from the same population as training data

Correct response even if input is slightly different than the training examples

Overfitting

Good generalization

2012 Primo Potonik

NEURAL NETWORKS (4) Backpropagation

#168

84

Improving generalization
Methods to improve generalization
1. Keeping the network small 2. Early stopping 3. Regularization

Early stopping
Available data are divided into three sets:
1. Training set used to train the network 2. Validation set used for early stopping, when the error starts to increase 3. Test set used for final estimation of network performance and for comparison of various models

Early stopping

2012 Primo Potonik

NEURAL NETWORKS (4) Backpropagation

#169

Regularization
Improving generalization by regularization
Modifying performance function

mse

1 N 1 M

(d j (n) y j (n)) 2
n 1

with mean sum of squares of network weights and biases


M 2 wm m 1

msw

thus obtaining new performance function

msreg

mse (1

)msw

Using this performance function, network will have smaller weights and biases, and this forces the network response to be smoother and less likely to overfit

2012 Primo Potonik

NEURAL NETWORKS (4) Backpropagation

#170

85

Deficiencies of backpropagation
Some properties of backpropagation do not guarantee the algorithm to be universally useful: 1. Long training process
Possibly due to non-optimum learning rate (advanced algorithms address this problem)

2. Network paralysis
Combination of sigmoidal activation and very large weights can decrease gradients almost to zero training is almost stopped

3. Local minima
Error surface of a complex network can be very complex, with many hills and valleys Gradient methods can get trapped in local minima Solutions: probabilistic learning methods (simulated annealing, ...)

2012 Primo Potonik

NEURAL NETWORKS (4) Backpropagation

#171

4.4 Advanced algorithms


Basic backpropagation is slow
Adjusts the weights in the steepest descent direction (negative of the gradient) in which the performance function is decreasing most rapidly It turns out that, although the function decreases most rapidly along the negative of the gradient, this does not necessarily produce the fastest convergence

1. Advanced algorithms based on heuristics


Developed from an analysis of the performance of the standard steepest descent algorithm
Momentum technique Variable learning rate backpropagation Resilient backpropagation

2. Numerical optimization techniques


Application of standard numerical optimization techniques to network training
Quasi-Newton algorithms Conjugate Gradient algorithms Levenberg-Marquardt

2012 Primo Potonik

NEURAL NETWORKS (4) Backpropagation

#172

86

Momentum
A simple method of increasing learning rate yet avoiding the danger of instability Modified delta rule by adding momentum term
w ji (n)
j

(n) yi (n)

w ji (n 1)

1 Momentum constant 0 Accelerates backpropagation in steady downhill directions


Large learning rate (oscillations) Small learning rate

Learning with momentum

2012 Primo Potonik

NEURAL NETWORKS (4) Backpropagation

#173

Variable learning rate (t)


Another method of manipulating learning rate and momentum to accelerate backpropagation
1. If error decreases after weight update:
weight update is accepted learning rate is increased ............................................. (t+1) = (t), >1 if momentum has been previously reset to 0, it is set to its original value weight update is accepted learning rate is not changed ......................................... (t+1) = (t), if momentum has been previously reset to 0, it is set to its original value weight update is discarded learning rate is decreased ............................................ (t+1) = (t), 0<<1 momentum is reset to 0

2. If error increases less than after weight update:

3. If error increases more than after weight update:

Possible parameter values:

4%,

0.7,

1.05

2012 Primo Potonik

NEURAL NETWORKS (4) Backpropagation

#174

87

Resilient backpropagation
Slope of sigmoid functions approaches zero as the input gets large
This causes a problem when you use steepest descent to train a network Gradient can have a very small magnitude also changes in weights are small, even though the weights are far from their optimal values

Resilient backpropagation
Eliminates these harmful effects of the magnitudes of the partial derivatives Only sign of the derivative is used to determine the direction of weight update, size of the weight change is determined by a separate update value Resilient backpropagation rules:
1. Update value for each weight and bias is increased by a factor inc if derivative of the performance function with respect to that weight has the same sign for two successive iterations 2. Update value is decreased by a factor dec if derivative with respect to that weight changes sign from the previous iteration 3. If derivative is zero, then the update value remains the same 4. If weights are oscillating, the weight change is reduced

2012 Primo Potonik

NEURAL NETWORKS (4) Backpropagation

#175

Numerical optimization (1/3)


Supervised learning as an optimization problem
Error surface of a multilayer perceptron, expressed by instantaneous error energy E(n), is a highly nonlinear function of synaptic weight vector w(n)

E (n)

E (w(n))

E(w1,w2)

w2

w1

2012 Primo Potonik

NEURAL NETWORKS (4) Backpropagation

#176

88

Numerical optimization (2/3)


Expanding the error energy by a Taylor series
E (n) E (w(n))
w(n)) E ( w(n)) g T (n) w(n) 1 T w (n) H (n) w(n) 2

E ( w(n)

Local gradient

g T ( n)

E ( w) w w
2

w( n )

Hessian matrix H ( n)

E ( w) w2 w

w( n )

2012 Primo Potonik

NEURAL NETWORKS (4) Backpropagation

#177

Numerical optimization (3/3)


Steepest descent method (backpropagation)
Weight adjustment proportional to the gradient Simple implementation, but slow convergence

w(n)

g (n)

Significant improvement by using higher order information


Adding momentum term crude approximation to use second order information about error surface Quadratic approximation about error surface The essence of Newtons method

w(n)

H 1 (n) g (n)
gradient descent Newtons method

H-1 is the inverse of Hessian matrix

2012 Primo Potonik

NEURAL NETWORKS (4) Backpropagation

#178

89

Quasi-Newton algorithms
Problems with the calculation of Hessian matrix
Inverse Hessian H-1 is required, which is computationally expensive Hessian has to be nonsingular which is not guaranteed Hessian for neural network can be rank defficient No convergence guarantee for non-quadratic error surface

Quasi-Newton method
Only requires calculation of the gradient vector g(n) The method estimates the inverse Hessian directly without matrix inversion Quasi-Newton variants:
Davidon-Fletcher-Powell algorithm Broyden-Fletcher-Goldfarb-Shanno algorithm ... best form of Quasi-Newton algorithm!

Application for neural networks


The method is fast for small neural networks

2012 Primo Potonik

NEURAL NETWORKS (4) Backpropagation

#179

Conjugate gradient algorithms


Conjugate gradient algorithms
Second order methods, avoid computational problems with the inverse Hessian Search is performed along conjugate directions, which produces generally faster convergence than steepest descent directions
1. In most of the conjugate gradient algorithms, the step size is adjusted at each iteration 2. A search is made along the conjugate gradient direction to determine the step size that minimizes the performance function along that line

Many variants of conjugate gradient algorithms


Fletcher-Reeves Update Polak-Ribire Update Powell-Beale Restarts Scaled Conjugate Gradient
gradient descent conjugate gradient

Application for neural networks


Perhaps the only method suitable for large scale problems (hundreds or thousands of adjustable parameters) well suited for multilayer perceptrons

2012 Primo Potonik

NEURAL NETWORKS (4) Backpropagation

#180

90

Levenberg-Marquardt algorithm
Levenberg-Marquardt algorithm (LM)
Like the quasi-Newton methods, LM algorithm was designed to approach second-order training speed without having to compute the Hessian matrix When the performance function has the form of a sum of squares (typical in neural network training), then the Hessian matrix H can be approximated by Jacobian matrix J

JT J

where Jacobian matrix contains first derivatives of the network errors with respect to the weights Jacobian can be computed through a standard backpropagation technique that is much less complex than computing the Hessian matrix

Application for neural networks


Algorithm appears to be the fastest method for training moderate-sized feedforward neural networks (up to several hundred weights)

2012 Primo Potonik

NEURAL NETWORKS (4) Backpropagation

#181

Advanced algorithms summary


Practical hints (Matlab related)
Variable learning rate algorithm is usually much slower than the other methods Resiliant backpropagation method is very well suited to pattern recognition problems Function approximation problems, networks with up to a few hundred weights: Levenberg-Marquardt algorithm will have the fastest convergence and very accurate training Conjugate gradient algorithms perform well over a wide variety of problems, particularly for networks with a large number of weights (modest memory requirements)

2012 Primo Potonik

NEURAL NETWORKS (4) Backpropagation

#182

91

Training algorithms in MATLAB

2012 Primo Potonik

NEURAL NETWORKS (4) Backpropagation

#183

4.5 Performance of multilayer perceptrons


Approximation error is influenced by
Learning algorithm used ... (discused in last section)
This determines how good the error on the training set is minimized

Number and distribution of learning samples


This determines how good training samples represent the actual function

Number of hidden units


This determines the expressive power of the network. For smooth functions only a few number of hidden units are needed, for wildly fluctuating functions more hidden units will be needed

2012 Primo Potonik

NEURAL NETWORKS (4) Backpropagation

#184

92

Number of learning samples


Function approximation example y=f(x)
4 learning samples 20 learning samples

Learning set with 4 samples has small training error but gives very poor generalization Learning set with 20 samples has higher training error but generalizes well Low training error is no guarantee for a good network performance!

2012 Primo Potonik

NEURAL NETWORKS (4) Backpropagation

#185

Number of hidden units


Function approximation example y=f(x)
5 hidden units 20 hidden units

A large number of hidden units leads to a small training error but not necessarily to a small test error Adding hidden units always leads to a reduction of the training error However, adding hidden units will first lead to a reduction of test error but then to an increase of test error ... (peaking efect, early stopping can be applied)

2012 Primo Potonik

NEURAL NETWORKS (4) Backpropagation

#186

93

Size effect summary

Number of training samples


Error rate Error rate

Number of hidden units

Optimal number of hidden neurons Test set

Test set Training set Number of training samples

Training set Number of hidden units

2012 Primo Potonik

NEURAL NETWORKS (4) Backpropagation

#187

Matlab demo
nnd11fa Function approximation, variable number of hidden units

2012 Primo Potonik

NEURAL NETWORKS (4) Backpropagation

#188

94

Matlab demo
nnd11gn Generalization, variable number of hidden units

2012 Primo Potonik

NEURAL NETWORKS (4) Backpropagation

#189

2012 Primo Potonik

NEURAL NETWORKS (4) Backpropagation

#190

95

5. Dynamic Networks
5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 Historical dynamic networks Focused time-delay neural network Distributed time-delay neural network Layer recurrent network NARX network Computational power of dynamic networks Learning algorithms System identification Model reference adaptive control

2012 Primo Potonik

NEURAL NETWORKS (5) Dynamic networks

#191

Introduction
Time
An essential ingredient of the learning process Important for many practical tasks: speech, vision, signal processing, control

Many applications require temporal processing


Time series prediction Noise cancelation Adaptive control System identification ... Linear systems well developed theories Nonlinear systems neural networks have the potential to solve such problems

2012 Primo Potonik

NEURAL NETWORKS (5) Dynamic networks

#192

96

Introduction
How can we build time into the operation of neural networks?
Extending static neural networks into dynamic neural networks networks become responsive to the temporal structure of input signals Networks become dynamic by adding

TEMPORAL MEMORY

and/or

FEEDBACK

Feedback loop

2012 Primo Potonik

NEURAL NETWORKS (5) Dynamic networks

#193

Static / dynamic networks


Neural network categories 1. Static networks structural pattern recognition
Feedforward networks No feedback elements, no delays Output is calculated directly from the input through feedforward connections

2. Dynamic networks temporal pattern recognition


Output depends on
current input to the network also on previous inputs previous network output previous network states

A need for short-term memory and feedback

Dynamic networks can be divided into two categories


1. Networks that have only feedforward connections 2. Networks with feedback or recurrent connections

2012 Primo Potonik

NEURAL NETWORKS (5) Dynamic networks

#194

97

Memory
Memory
Long-term memory
Acquired through supervised learning and stored into synaptc weights

Short-term memory
Temporal memory, usefull to capture temporal dimension Implemented as time delays at various parts of the network

Long-term memory

2012 Primo Potonik

NEURAL NETWORKS (5) Dynamic networks

#195

Tapped delay line


The simplest form of short-term memory
Already mentioned at linear adaptive filters Most commonly used for dynamic networks Tapped delay line (TDL) consists of N unit delay operators Output of TDL is an N+1 dimensional vector, composed from current and past inputs

x ( n)

x ( n) x(n 1)
TDL( x(n)) [ x(n), x(n 1),..., x(n N )]

x( n N )

2012 Primo Potonik

NEURAL NETWORKS (5) Dynamic networks

#196

98

5.1 Historical dynamic networks


Hopfield (1982) Elman (1990)

Jordan (1986)

2012 Primo Potonik

NEURAL NETWORKS (5) Dynamic networks

#197

Hopfield network
Hopfield network (Hopfield, 1982)
Network consists of N interconnected neurons which update their activation values asynchronously and independently of other neurons All neurons are both input and output neurons Activation values are binary (-1, +1) Multiple-loop feedback system interesting to study stability of the system Primary applications Associative memory Solving optimization problems MATLAB example: demohop1.m

2012 Primo Potonik

NEURAL NETWORKS (5) Dynamic networks

#198

99

Jordan network
Jordan network (Jordan, 1986)
Network outputs are fed back as extra inputs (state units) Each state unit is fed with one network output The connections from output to state units are fixed (+1) Learning takes place only in the connections between input to hidden units as well as hidden to output units Standard backpropagation learnig rule can be applied to train the network

2012 Primo Potonik

NEURAL NETWORKS (5) Dynamic networks

#199

Elman network
Elman network (Elman, 1990)
Similar to Jordan network, with the following differences: 1. Hidden units are fed back (instead of output units) 2. Context units have no self-connections

context units

2012 Primo Potonik

NEURAL NETWORKS (5) Dynamic networks

#200

100

5.2 Focused time-delay neural network


The most straightforward dynamic network feedforward network + tapped delay line at input
Temporal dynamics only at the input layer of a static network Nonlinear extention of linear adaptive filters Backpropagation training can be used The structure is suitable for time-series prediction

2012 Primo Potonik

NEURAL NETWORKS (5) Dynamic networks

#201

TDL & prediction horizon


Input delays = [0 6 12] Prediction horizon = 1 Inputs {x(t), x(t-6), x(t-12)} Output x(t+1)
Known world

Input delays = [12 6 0]

Unknown world

Prediction horizon = 1

2012 Primo Potonik

NEURAL NETWORKS (5) Dynamic networks

#202

101

Online prediction application

Past

Now

Future

2012 Primo Potonik

NEURAL NETWORKS (5) Dynamic networks

#203

MATLAB example (1/3)


Application of focused time-delay neural network for prediction of chaotic MacKay-Glass time series
(t ) y by (t ) cy (t ) 1 y10 (t ) b 0.1, c 0.2, 17

Objective
Design Focused time-delay neural network for recursive one-step-ahead predictor Fixed network parameters
Number of hidden layers: Hidden layer activation func.: Output layer activation func.: 1 Logistic Linear

Variable network parameters


Input delays = ? Hidden layer neurons = ?

2012 Primo Potonik

NEURAL NETWORKS (5) Dynamic networks

#204

102

MATLAB example (2/3)


Samples
500 training samples 500 validation samples, recursive prediction

Results

(A)

2012 Primo Potonik

NEURAL NETWORKS (5) Dynamic networks

#205

MATLAB example (3/3)

(B)

(C)

2012 Primo Potonik

NEURAL NETWORKS (5) Dynamic networks

#206

103

5.3 Distributed time-delay neural network


Tapped delay lines distributed throughout the network
Distributted temporal dynamics ability to handle non-stationary environments Backpropagation training cannot be used any more the need for temporal backpropagation Possible applications: phoneme recognition, recognition of various frequency contents in signals

2012 Primo Potonik

NEURAL NETWORKS (5) Dynamic networks

#207

Temporal backpropagation
Backpropagation algorithm
Suitable for static networks and focused time-delay neural networks

Temporal backpropagation
Supervised learning algorithm Extension of backpropagation Required for distributed time delay neural networks Computationaly demanding

Which form of backpropagation to use?


Based on the nature of the temporal processing task 1. STATIONARY ENVIRONMENT Standard backpropagation + Focused time-delay neural networks 2. NON-STATIONARY ENVIRONMANT Temporal backpropagation + Distributed time delay neural networks

2012 Primo Potonik

NEURAL NETWORKS (5) Dynamic networks

#208

104

Example (1/2)
Wan (1994): Time series prediction by using a connectionist network with internal delay lines
Winner of the Santa Fe Institute Time-Series Competition, USA (1992) Task: Nonlinear prediction of a nonstationary time series exhibiting chaotic pulsations of NH3 laser

2012 Primo Potonik

NEURAL NETWORKS (5) Dynamic networks

#209

Example (2/2)
Prediction results

2012 Primo Potonik

NEURAL NETWORKS (5) Dynamic networks

#210

105

5.4 Layer recurrent network


Layer recurrent network = Recurrent multilayer perceptron
One or more hidden layers Each computation layer has feedback link

2012 Primo Potonik

NEURAL NETWORKS (5) Dynamic networks

#211

Layer recurrent network structure


Feedback loop with single delay for hidden layer
Can be trained by backpropagation

Elman (1990)

2012 Primo Potonik

NEURAL NETWORKS (5) Dynamic networks

#212

106

Example (1/3)
Phoneme detection problem
Recognition of various frequency components

Layer recurrent network


1 hidden layer 8 neurons 5 delays

2012 Primo Potonik

NEURAL NETWORKS (5) Dynamic networks

#213

Example (2/3)
Network training
Successful recognition of two phonemes

2012 Primo Potonik

NEURAL NETWORKS (5) Dynamic networks

#214

107

Example (3/3)
Network testing
Unreliable generalization, works only on trained phonemes

OK

OK

2012 Primo Potonik

NEURAL NETWORKS (5) Dynamic networks

#215

5.5 NARX network


Networks discussed so far
Focused or distributed time delays Feedback only localized to specific network layers

NARX = Nonlinear AutoRegressive Network with EXogenous Inputs


Reccurent network with global feedback Feedback over several layers of the network Based on linear ARX model Defining equation for NARX model

Output y is a nonlinear function of past outputs and past inputs Nonlinear function f can be implemented by a neural network

2012 Primo Potonik

NEURAL NETWORKS (5) Dynamic networks

#216

108

NARX structure
NARX network with global feedback

Possible application areas


Nonlinear prediction and modelling Adaptive equalization of communication channels Speech processing Automobile engine diagnostics

2012 Primo Potonik

NEURAL NETWORKS (5) Dynamic networks

#217

NARX training considerations


NARX output is an estimate of the output of some nonlinear dynamic system Output is fed back to the input of the feedforward neural network parallel architecture True output is available during training possible to create series-parallel architecture
True output is used instead of feeding back the estimated output

Advantages of series-parallel architecture for training


1. Training input to the feedforward network is more accurate improved training accuracy 2. Resulting network is purely feedforward static backpropagation can be used

2012 Primo Potonik

NEURAL NETWORKS (5) Dynamic networks

#218

109

Example (1/5)
Problem: Magnetic levitation Objective
to control the position of a magnet suspended above an electromagnet, where the magnet can only move in the vertical direction

Equation of motion

y(t) = distance of the magnet above the electromagnet i(t) = current flowing in the electromagnet M = mass of the magnet g = gravitational constant = viscous friction coefficient (determined by the material in which the magnet moves) = field strength constant, determined by the number of turns of wire on the electromagnet and the strength of the magnet

2012 Primo Potonik

NEURAL NETWORKS (5) Dynamic networks

#219

Example (2/5)
Data
Sampling interval: 0.01 sec Input: current i(t) Output: magnet position y(t)

NARX network structure


3 hidden neurons 5 input delays 5 global feedback delays

3 neurons 5

2012 Primo Potonik

NEURAL NETWORKS (5) Dynamic networks

#220

110

Example (3/5)
Series-parallel training results for NARX network

2012 Primo Potonik

NEURAL NETWORKS (5) Dynamic networks

#221

Example (4/5)
Parallel recursive prediction (1000 steps)

2012 Primo Potonik

NEURAL NETWORKS (5) Dynamic networks

#222

111

Example (5/5)
Possible learning results: unstable learning, local minima
Case A: OK Case B: Unstable Case C: Local minimum

2012 Primo Potonik

NEURAL NETWORKS (5) Dynamic networks

#223

5.6 Computational power of dynamic networks


Fully and partially connected recurrent networks

2012 Primo Potonik

NEURAL NETWORKS (5) Dynamic networks

#224

112

Theorems
Theorem I (Siegelmann & Sontag, 1991)
All Turing machines may be simulated by fully connected recurrent networks built on neurons with sigmoid activation functions. (Turing machine is a theoretical abstraction that is functionally as powerfull as any computer, see http://aturingmachine.com )

Theorem II (Siegelmann et. al. 1997)


NARX networks with one layer of hidden neurons with BOSS* activation functions and a linear output neuron can simulate fully connected recurrent networks with BOSS* activation functions, except for a linear slowdown

Corollary to Theorem I and II


NARX networks with one hidden layer of neurons with BOSS* activation functions and a linear output neuron are Turing equivalent.

* BOSS = bounded, one-sided saturated function


2012 Primo Potonik NEURAL NETWORKS (5) Dynamic networks #225

5.7 Learning algorithms


Two modes of training for recurrent networks 1. Epochwise training
For a given epoch, the recurrent network starts running from some initial state until it reaches a new state, at which point the training is stopped and the network is reset to initial state for the next epoch METHOD: Backpropagation through time

2. Continuous training
Suitable if no reset states are available or online learning is required Network learns while it is performing signal processing The learning process never stops METHOD: Real-time recurrent learning

2012 Primo Potonik

NEURAL NETWORKS (5) Dynamic networks

#226

113

Backpropagation through time


Extension of the standard backpropagation algorithm
Derived by unfolding the temporal operation of the network into a layered feedforward network The topology grows by one layer at every time step EXAMPLE: unfolding the temporal operation of a 2-neuron recurrent network

2012 Primo Potonik

NEURAL NETWORKS (5) Dynamic networks

#227

Backpropagation through time example (1/2)


Nguyen (1989): The truck backer-upper

2012 Primo Potonik

NEURAL NETWORKS (5) Dynamic networks

#228

114

... example (2/2)


Training

Generalization

2012 Primo Potonik

NEURAL NETWORKS (5) Dynamic networks

#229

5.8 System Identification


System identification = experimental approach to modeling the process with unknown parameters
STEP 1: STEP 2: STEP 3: STEP 4: Experimental planning Selection of a model structure Parameter estimation Model validation

Unknown nonlinear dynamical process dynamic neural networks can be used as identification model
Two basic identification approaches: 1. System identification using state-space model 2. System identification using input-output model

2012 Primo Potonik

NEURAL NETWORKS (5) Dynamic networks

#230

115

System identification using state-space model


State
A vital role in the mathematical formulation of a dynamical system State of a dynamical system defined as a set of quantities that summarizes all the information about the past behavior of the system that is needed to uniquely describe its future behavior, except for the purely external effects arising from the applied input.

Plant description by a state-space model


State: Output:

x(n 1) y ( n)

f x(n), u (n) h x ( n)

f, h : unknown nonlinear vector functions Two dynamic neural networks can be used to approximate f and h

2012 Primo Potonik

NEURAL NETWORKS (5) Dynamic networks

#231

State-space solution to the identification problem


Neural network (I)
Identification of plant state State must be physically accessible!

Neural network (II)


Identification of plant output Actual state x(n) is used as input rather than the predicted output

Both networks are trained by gradient descent minimizing error signals eI and eII

2012 Primo Potonik

NEURAL NETWORKS (5) Dynamic networks

#232

116

System identification using input-output model


If system state is not accessible identification by input-output model Plant description by an input-output model
(n 1) y

f y(n),, y(n q 1), u(n),, u(n q 1)

f is unknown nonlinear vector function Input-output formulation is equivalent to NARX formulation NARX neural network can be used to approximate f q past inputs and outputs should be available

2012 Primo Potonik

NEURAL NETWORKS (5) Dynamic networks

#233

Input-output solution to the identification problem


NARX neural network can be used as a dynamic identification model Series-parallel learning
system output is used as feedback, not the predicted output

Parallel architecture for application

2012 Primo Potonik

NEURAL NETWORKS (5) Dynamic networks

#234

117

5.9 Model-reference adaptive control


Dynamic networks are important for feedback control systems
MULTIPLE PROBLEMS: Nonlinear coupling of plant state with control signals Presence of unmeasured or random disturbances Possibility of a nonunique plant inverse Presence of unobservable plant states

MRAC = Model reference adaptive control


Well suited for the use of neural networks Possible control methods: Direct MRAC Indirect MRAC

2012 Primo Potonik

NEURAL NETWORKS (5) Dynamic networks

#235

MRAC using direct control


Unknown plant dynamics adaptive learning Controller + plant = closed loop feedback system
Controller and plant build externaly recurrent network How to get plant gradients indirect control

Controller uc (n)

f xc (n), y p (n), r (n), w g xr (n), r (n)

Reference model d (n 1)

2012 Primo Potonik

NEURAL NETWORKS (5) Dynamic networks

#236

118

MRAC using indirect control


Two step procedure to train the controller
1. Identification of the plant (identification model) 2. Using plant model to obtain dynamic derivatives to train the controller Controller and plant model build externaly recurrent network

2012 Primo Potonik

NEURAL NETWORKS (5) Dynamic networks

#237

Summary
Focused time-delay neural network Distributed time-delay neural network

Layer recurrent network

NARX network

2012 Primo Potonik

NEURAL NETWORKS (5) Dynamic networks

#238

119

6. Radial Basis Function Networks


6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 RBFN structure Exact interpolation Radial basis functions Radial basis function networks RBFN training RBFN for classification Comparison with multilayer perceptron Probabilistic networks Generalized regression networks

2012 Primo Potonik

NEURAL NETWORKS (6) Radial Basis Function Networks

#239

Introduction
RBFN = Radial Basis Function Network New class of neural networks
Multilayer perceptrons output is a nonlinear function of the scalar product of input vector and weight vector RBFN activation of a hidden unit is determined by distance between input vector and prototype vector

RBFN theory forms a link between


Function approximation Regularization Noisy interpolation Density estimation Optimal classification theory

2012 Primo Potonik

NEURAL NETWORKS (6) Radial Basis Function Networks

#240

120

6.1 RBFN structure


Feedforward network with two computation layers
1. Hidden layer implements a set of radial basis functions (e.g. Gaussian functions) 2. Output layer implements linear summation functions (as in MLP)

2012 Primo Potonik

NEURAL NETWORKS (6) Radial Basis Function Networks

#241

RBFN properties
Two-stage training procedures
1. Training of hidden layer weights 2. Training of output layer weights

Training/learning is very fast

RBFN provides excellent interpolation

2012 Primo Potonik

NEURAL NETWORKS (6) Radial Basis Function Networks

#242

121

6.2 Exact interpolation (1/3)


Exact interpolation task = mapping of every input vector exactly onto the corresponding output vector in the multi-dimensional space The goal is to find a function that will map input vectors x into target vectors t Radial basis function approach (Powell, 1987) introduces a set of N basis functions, one for each data point xp, in the form

Basis functions are nonlinear, and depend on the distance measure between input x and stored prototype xp
x xp ( x1 x1p ) 2 ( xM
p 2 xM )

2012 Primo Potonik

NEURAL NETWORKS (6) Radial Basis Function Networks

#243

Exact interpolation (2/3)


Output is a linear combination of basis functions

Goal is to find the weights wp such that the function goes through all data points

We introduce the matrix formulation

Provided that inverse of exist, the weights are obtained by any standard matrix inversion techniques

2012 Primo Potonik

NEURAL NETWORKS (6) Radial Basis Function Networks

#244

122

Exact interpolation (3/3)


For large class of functions , the matrix provided that the data points are distinct is indeed non-singular

Solution represents a continuous diferentiable surface that passes exactly through each data point Both theoretical and empirical studies confirm (in the context of exact interpolation) that many properties of the interpolating function are relatively insensitive to the precise form of the basis functions

Various forms of basis functions can be used

x xp
2012 Primo Potonik

r
#245

NEURAL NETWORKS (6) Radial Basis Function Networks

6.3 Radial basis functions (1/2)


1. Gaussian

2. Multi-Quadratic

3. Generalized Multi-Quadratic

4. Inverse Multi-Quadratic

2012 Primo Potonik

NEURAL NETWORKS (6) Radial Basis Function Networks

#246

123

Radial basis functions (2/2)


5. Generalized Inverse Multi-Quadratic

6. Thin Plate Spline

7. Cubic

8. Linear

2012 Primo Potonik

NEURAL NETWORKS (6) Radial Basis Function Networks

#247

x xp

( x1 x1p ) 2 ( xM

p 2 xM )

2012 Primo Potonik

NEURAL NETWORKS (6) Radial Basis Function Networks

#248

124

Properties of radial basis functions


Gaussian and Inverse Multi-Quadric basis functions are localised

Localised property is not strictly necessary all the other functions (Multi-Quadratic, Cubic, Linear, ...) are not localised

Note that even the Linear Function is still nonlinear in the components of x In one dimension, this leads to a piecewise-linear interpolating function which performs the simplest form of exact interpolation For neural network mappings, there are good reasons for preferring localised basis functions we will focus on Gaussian basis functions

2012 Primo Potonik

NEURAL NETWORKS (6) Radial Basis Function Networks

#249

Exact interpolation example (1/2)


Interpolation problem
We would like to find a function which fits all data points

Solution approach
Supperposition of Gaussian radial basis functions

2012 Primo Potonik

NEURAL NETWORKS (6) Radial Basis Function Networks

#250

125

Exact interpolation example (2/2)


=1

= 0.02

= 20

2012 Primo Potonik

NEURAL NETWORKS (6) Radial Basis Function Networks

#251

6.4 Radial basis function networks


Exact interpolation model using RB functions can already be described as a radial basis function network

N training inputs directly determine hidden layer prototypes (centers of hidden layer neurons) Training inputs and outputs also directly determine output weights

2012 Primo Potonik

NEURAL NETWORKS (6) Radial Basis Function Networks

#252

126

Problems with exact interpolation


1. Exact interpolation of noisy data is highly oscillatory function such interpolating functions are generally undesirable

2.

Number of basis functions is equal to the number of data patterns exact RBF networks are not computationally efficient

2012 Primo Potonik

NEURAL NETWORKS (6) Radial Basis Function Networks

#253

RBF neural network model


Introduced by Moody & Darken (1989) by several modifications of exact interpolation procedure
Number M of basis functions (hidden units) need not equal the number N of training data points. In general it is better to have M much less than N. Centers of basis functions do not need to be defined as the training data input vectors. They can instead be determined by a training algorithm. Basis functions need not all have the same width parameter . These can also be determined by a training algorithm. We can introduce bias parameters into the linear sum of activations at the output layer. These will compensate for the difference between the average value over the data set of the basis function activations and the corresponding average value of the targets.

2012 Primo Potonik

NEURAL NETWORKS (6) Radial Basis Function Networks

#254

127

Improved RBFN
Including the proposed changes + expanding to the multidimensional output

Which can be simplified by introducing an extra basis function 0 = 1

For the case of a Gausian RBF centers 1 1 widths M M

2012 Primo Potonik

NEURAL NETWORKS (6) Radial Basis Function Networks

#255

RBFN in Matlab notation


RBF neuron

center

width

RBF network

centers output weights

widths

biases

2012 Primo Potonik

NEURAL NETWORKS (6) Radial Basis Function Networks

#256

128

Computational power of RBFN


Hartman et al. (1990)
Formal proof of universal approximation property for networks with Gaussian basis functions in which the widths are treated as adjustable parameters

Girosi & Poggio (1990)


Showed that RBF networks posses the best approximation property which states: in the set of approximating functions there is one function which has minimum approximating error for any given function to be approximated. This property is not shared by multilayer perceptrons!

As with the corresponding proofs for MLPs, RBFN proofs rely on the availability of an arbitrarily large number of hidden units (i.e. basis functions) However, proofs provide a theoretical foundation on which practical applications can be based with confidence

2012 Primo Potonik

NEURAL NETWORKS (6) Radial Basis Function Networks

#257

6.5 RBFN training


Key aspect of RBFN: different roles of first and second computational layer Training process can be divided into two stages
1. Hidden layer training 2. Output layer training

Hidden layer can be trained by unsupervised methods (random selection, clustering, ...) Output layer has linear activation output weights are determined analitically by solving a set of linear equations Gradient descent learning is not needed for RBFN, therefore training is very fast!

2012 Primo Potonik

NEURAL NETWORKS (6) Radial Basis Function Networks

#258

129

Hidden layer training


One major advantage of RBF networks is the possibility of choosing suitable hidden unit (basis function) parameters without having to perform a full non-linear optimization of the whole network Methods for unsupervised selection of basis function centers
Fixed centres selected at random Orthogonal least squares K-means clustering

Problems with unsupervised methods


Selection of number of centers M Selection of center widths

It is also possible to perform a full supervised non-linear optimization of the network instead
NEURAL NETWORKS (6) Radial Basis Function Networks #259

2012 Primo Potonik

Fixed centres selected at random


Simplest and quickest approach to setting RBFN parameters
Centers fixed at M points selected randomly from the N data points Widths fixed to be equal at an appropriate size for the distribution of data points

Specifically, we can use RBFs centred at {j} defined by

Widths j are all related in the same way to the maximum or average distance between the chosen centres j
Common choices are

which ensure that the individual RBFs are neither too wide, nor too narrow, for the given training data For large training sets, this approach gives reasonable results
NEURAL NETWORKS (6) Radial Basis Function Networks #260

2012 Primo Potonik

130

Orthogonal least squares


A more principled approach to selecting a sub-set of data points as the basis function centres is based on the technique of orthogonal least squares
1. Sequential addition of new basis functions, each centred on one of the data points 2. At each stage, we try out each potential Lth basis function by using the NL other data points to determine the networks output weights 3. The potential Lth basis function which gives the smallest output error is used, and we move on to choose which L+1th basis function to add

To get good generalization we generally use cross-validation to stop the process when an appropriate number of data points have been selected as centers

2012 Primo Potonik

NEURAL NETWORKS (6) Radial Basis Function Networks

#261

K-means clustering
A potentially even better approach is to use clustering techniques to find a set of centres which more accurately reflects the distribution of the data points K-Means Clustering Algorithm
Select the number of centres (K) in advance Apply a simple re-estimation procedure to partition the data points {xp} into K disjoint subsets Sj containing Nj data points to minimize the sum squared clustering function

where j is the mean/centroid of the data points in set Sj given by

Once the basis centres have been determined in this way, the widths can then be set according to the variances of the points in the corresponding cluster
NEURAL NETWORKS (6) Radial Basis Function Networks #262

2012 Primo Potonik

131

K-means clustering example

2012 Primo Potonik

NEURAL NETWORKS (6) Radial Basis Function Networks

#263

Output layer training


After training the hidden layer neurons (selection of centers and widths), output layer training essentially means optimization of a single layer linear network

As with MLPs, a sum-squared output error can be defined

At the minimum of E, gradients with respect to weights wki are zero

2012 Primo Potonik

NEURAL NETWORKS (6) Radial Basis Function Networks

#264

132

Computing the output weights


Equations for the weights are most conveniently written in matrix form by defining matrices which gives and the formal solution for the weights is here we have the standard pseudo inverse of

Network weights can be computed by fast linear matrix inversion techniques


In practice, singular value decomposition (SVD) is often used to avoid possible ill-conditioning of , i.e. T being singular or near singular

2012 Primo Potonik

NEURAL NETWORKS (6) Radial Basis Function Networks

#265

Supervised RBFN training


Supervised training of basis function parameters can give good results, but the computational costs are usually enormous Obvious approach is to perform gradient descent on a sum squared output error function as in MLP backpropagation learning. Error function would be

Supervised RBFN training would iteratively update the weights (basis function parameters) using gradients

2012 Primo Potonik

NEURAL NETWORKS (6) Radial Basis Function Networks

#266

133

Supervised RBFN training


By using the Gaussian basis functions

derivatives of error function

become very complex and therefore computationally inefficient Additionally, we get all the problems of choosing the learning rates, avoiding local minima ... that we had for training MLPs by backpropagation And there is a tendency for the basis function widths to grow large leaving non-localised basis functions
NEURAL NETWORKS (6) Radial Basis Function Networks #267

2012 Primo Potonik

Regularization theory for RBFN


Alternative approach to prevent overfitting in RBFN Based on the theory of regularization, which is a method of controlling the smoothness of mapping functions We can have one basis function for each training data point as in the case of exact interpolation, but add an extra term to the error measure which penalizes mappings which are not smooth

2012 Primo Potonik

NEURAL NETWORKS (6) Radial Basis Function Networks

#268

134

Regularization term in error measure


In regularization approach, error measure is modified with additional regularization term that is composed from
differential operator P, and regularization parameter

Regularization parameter determines the relative importance of smoothness compared with error Differential operator P can have many possible forms, but the general idea is that mapping functions which have large curvature should yield large regularization term and hence contribute a large penalty in the total error function

2012 Primo Potonik

NEURAL NETWORKS (6) Radial Basis Function Networks

#269

RBFN training summary


Option 1) Exact interpolation model + Regularization Option 2) Supervised RBFN training Option 3) Two-stage hybrid training
3a) Hidden layer training
Fixed centres selected at random Orthogonal least squares K-means clustering

3b) Output layer training


Linear matrix operation

Where to start?
Two stage hybrid training with K-means clustering and linear matrix operation for output layer
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #270

135

6.6 RBFN for classification


Key insight into RBFN can be obtained by using such networks for classification problems Suppose we have data set with three classes MLP RBFN

Multilayer perceptron can separate classes by using hidden units to form hyperplanes in the input space Alternative approach is to model the separate class distributions by localised radial basis functions
NEURAL NETWORKS (6) Radial Basis Function Networks #271

2012 Primo Potonik

Implementing RBFN for classification


Define an output function yk(x) for each class k with appropriate targets

RBFN is trained with input patterns x and corresponding target classes t

Underlying justification for using RBFN for classification is found in Covers theorem which states A complex pattern classification problem cast in a high dimensional space non-linearly is more likely to be linearly separable than in a low dimensional space. Once we have linear separable patterns, the classification problem can be solved by a linear layer

2012 Primo Potonik

NEURAL NETWORKS (6) Radial Basis Function Networks

#272

136

6.7 Comparison with multilayer perceptron


Similarities between RBF networks and MLPs
1. They are both non-linear feed-forward networks 2. They are both universal approximators for arbitrary nonlinear functional mappings 3. They can be used in similar application areas There always exists an RBF network capable of accurately mimicking a specified MLP, or vice versa. MLP RBFN

2012 Primo Potonik

NEURAL NETWORKS (6) Radial Basis Function Networks

#273

RBFN / MLP differences


RBFN 1. Single hidden layer 2. Hidden nodes (basis functions) operate very differently, and have a different purpose compared to the output nodes 3. Argument of each hidden unit activation function is the distance between the input and the weights (RBF centres) 4. Usually trained one layer at a time with the first layer unsupervised 5. Use localised non-linearities (Gaussians) at the hidden layer to construct local approximations training 6. 2012Fast Primo Potonik
MLP 1. Can have any number of hidden layers 2. Computation nodes (processing units) in different layers share a common neuronal model, though not necessarily the same activation function 3. Argument of each hidden unit activation function is the inner product of the input and the weights 4. Usually trained with a single global supervised algorithm 5. Construct global approximations to nonlinear input-output mappings with distributed hidden representations 6. Require a smaller number of parameters

NEURAL NETWORKS (6) Radial Basis Function Networks

#274

137

6.8 Probabilistic networks


Probabilistic neural networks (PNN) can be used for classification problems First layer computes distances from the input vector to the training input vectors (prototypes) and produces a vector whose elements indicate how close the input is to a training input Second layer sums these contributions for each class of inputs to produce as its net output a vector of probabilities Finally, a competitive output layer picks the maximum of these probabilities, and produces 1 for that class and 0 for the other classes

2012 Primo Potonik

NEURAL NETWORKS (6) Radial Basis Function Networks

#275

PNN example 1
Three training patterns PNN division of the input space

Classifying new sample

2012 Primo Potonik

NEURAL NETWORKS (6) Radial Basis Function Networks

#276

138

PNN example 2 (1/4)

2012 Primo Potonik

NEURAL NETWORKS (6) Radial Basis Function Networks

#277

PNN example 2 (2/4)

2012 Primo Potonik

NEURAL NETWORKS (6) Radial Basis Function Networks

#278

139

PNN example 2 (3/4)

2012 Primo Potonik

NEURAL NETWORKS (6) Radial Basis Function Networks

#279

PNN example 2 (4/4)

2012 Primo Potonik

NEURAL NETWORKS (6) Radial Basis Function Networks

#280

140

PNN considerations
Probabilistic neural networks are specialized to classification (less general than RBFN or MLP) PNN are sensitive to the selection of spread parameter spread can be optimized by leave-one-out cross-validation technique
1. Leave one training sample out, train PNN and test on the omitted sample 2. Repeat procedure for all samples and save results 3. Find optimal spread that yields minimal average classification error

Benefits
Little or no training required (except spread optimization) Beside classifications, PNN also provides Bayesian posterior probabilities solid theoretical fundation to support confidence estimates for the networks decisions Robust against outliers outliers have no real effect on decisions

Drawbacks
PNN performance depends strongly on a thoroughly representative training set Entire training set must be stored large memory and poor execution speed

2012 Primo Potonik

NEURAL NETWORKS (6) Radial Basis Function Networks

#281

6.9 Generalized regression networks


GRNN can be well explained by reviewing the regression problem: How to use measured values (independent variables) to predict the value of a dependent variable?
Linear regression is OK Linear regression fails

2012 Primo Potonik

NEURAL NETWORKS (6) Radial Basis Function Networks

#282

141

Simple linear regression


Simple linear regression is expressed with

ax b

Given the training data, the slope a and bias b are computed as
Compute sum of squares

SS x
i

xi

x ,

SS xy
i

xi

x yi

Compute slope and bias

SS xy SS x

b y ax
Resulting linear equation will minimize mean squared error of predicted values y in the training set
NEURAL NETWORKS (6) Radial Basis Function Networks #283

2012 Primo Potonik

Multiple regression
Several independent variables x1, x2, x3, ...

a1 x1 a2 x2 a3 x3 b4

Matrix notation

x1 , x2 , x3 ,1

xa

Pack training data into matrices

Xa

Parameter can be expressed as

XX

XY

Final solution is usually obtained numerically by singular value decomposition method (SVD)

2012 Primo Potonik

NEURAL NETWORKS (6) Radial Basis Function Networks

#284

142

General regression neural network


Best predictor for dependent variable y is defined by its conditional expectation, given the independent variable x

Joint density function fxy(x,y) is not known but can be approximated by Parzen estimator

By using the Parzen approximator with Gaussian kernels, we obtain equation for GRNN predictor

2012 Primo Potonik

NEURAL NETWORKS (6) Radial Basis Function Networks

#285

GRNN properties
GRNN closely resembles RBFN with normalization term in denominator it is sometimes called Normalized RBFN GRNN also resembles PNN but is used for regression (function approximation), not for classification Width parameter spread must be selected, as in all RBF networks First layer has Gaussian kernels located at each training case and computes distances from the input vector to the training input vectors (prototypes) Second layer is a special linear layer with normalization operator Normalization makes GRNN a very robust predictor

2012 Primo Potonik

NEURAL NETWORKS (6) Radial Basis Function Networks

#286

143

GRNN architecture
Standard Radial basis layer Normalization Linear layer

2012 Primo Potonik

NEURAL NETWORKS (6) Radial Basis Function Networks

#287

RBFN vs. GRNN example (1/3)

2012 Primo Potonik

NEURAL NETWORKS (6) Radial Basis Function Networks

#288

144

RBFN vs. GRNN example (2/3)

2012 Primo Potonik

NEURAL NETWORKS (6) Radial Basis Function Networks

#289

RBFN vs. GRNN example (3/3)

2012 Primo Potonik

NEURAL NETWORKS (6) Radial Basis Function Networks

#290

145

Summary
RBFN PNN

GRNN

2012 Primo Potonik

NEURAL NETWORKS (6) Radial Basis Function Networks

#291

2012 Primo Potonik

NEURAL NETWORKS (6) Radial Basis Function Networks

#292

146

7. Self-Organizing Maps

7.1 7.2 7.3 7.4 7.5

Self-organization Self-organizing maps SOM algorithm Properties of the feature map SOM discussion & examples

2012 Primo Potonik

NEURAL NETWORKS (7) Self-Organizing Maps

#293

Introduction
1. We discussed so far a number of networks which were trained to perform a mapping INPUTS OUTPUTS which corresponds to supervised learning paradigm

2. However, problems exist where target outputs are not available the only information is provided by a set of input patterns INPUTS ?? which corresponds to unsupervised learning paradigm

2012 Primo Potonik

NEURAL NETWORKS (7) Self-Organizing Maps

#294

147

Examples of problems
Clustering
Input data are grouped into clusters for any input, neural net should return a corresponding cluster label

Vector quantization
Continuous space has to be discretised neural net has to find optimal discretisation of the input space

Dimensionality reduction
Input data are grouped in a subspace with lower dimensionality than the original data Neural net has to learn an optimal mapping such that most of the variance in the input data is preserved in the output data

Feature extraction
System has to extract features from the input signal this often means a dimensionality reduction as described above
2012 Primo Potonik

NEURAL NETWORKS (7) Self-Organizing Maps

#295

7.1 Self-organization
What is self-organization?
System structure appears without explicit pressure or involvement from outside the system Constraints on form (i.e. organization) of interest to us are internal to the system, resulting from the interactions among the components The organization can evolve in either time or space, maintain a stable form or show transient phenomena

2012 Primo Potonik

NEURAL NETWORKS (7) Self-Organizing Maps

#296

148

Self-organization properties
Typical features include (in rough order of generality)
Autonomy Dynamic operation Fluctuations Symmetry breaking Global order Dissipation Instability Multiple equilibria Criticality Redundancy Self-maintenance Adaptation Complexity Hierarchies (absence of external control) (evolution in time) (noise / searches through options) (loss of freedom) (emergence from local interactions) (energy usage / far-from-equilibrium) (self-reinforcing choices / nonlinearity) (many possible attractors) (threshold effects / phase changes) (insensitivity to damage) (repair / reproduction metabolisms) (functionality / tracking of external variations) (multiple concurrent values or objectives) (multiple nested self-organized levels)

2012 Primo Potonik

NEURAL NETWORKS (7) Self-Organizing Maps

#297

John Conways Game of Life


John Conway (1970), published paper in Scientific American Game of Life:
infinite two-dimensional grid of square cells, each cell is in one of two possible states, alive or dead, every cell interacts with its eight neighbours RULES:
1. 2. Alive cell with less than 2 or more than 4 neighbours dies (loneliness / overcrowding) Dead cell with 3 neighbours turns alive (reproduction)

Glider Gun creating gliders

2012 Primo Potonik

NEURAL NETWORKS (7) Self-Organizing Maps

#298

149

Self-organization in neural networks


Self-organizing networks are based on competitive learning
output neurons of the network compete to be activated and only one neuron can become a winning neuron

Self-organizing maps (SOM)


learn to recognize groups of similar input vectors in such a way that neurons physically near each other in the neuron layer respond to similar input vectors

Learning vector quantization (LVQ)


a method for training competitive layers in a supervised manner learns to classify input vectors into target classes chosen by the user

2012 Primo Potonik

NEURAL NETWORKS (7) Self-Organizing Maps

#299

Neurobiological motivation
Neurobiological studies indicate that different sensory inputs (tactile, visual, auditory, etc.) are mapped onto different areas of the cerebral cortex in an ordered fashion This form of a map, known as a topographic map, has two important properties:
1. At each stage of representation, or processing, each piece of incoming information is kept in its proper context / neighbourhood 2. Neurons dealing with closely related pieces of information are kept close together so that they can interact via short synaptic connections

Our interest is in building artificial topographic maps that learn through self-organization in a neurobiologically inspired manner We shall follow the principle of topographic map formation: The spatial location of an output neuron in a topographic map corresponds to a particular domain or feature drawn from the input space

2012 Primo Potonik

NEURAL NETWORKS (7) Self-Organizing Maps

#300

150

7.2 Self-organizing maps (SOM)


Neurons are placed at the nodes of a lattice, usually 1D or 2D Neurons are trained by self-organized competitive learning rule Neurons become selectively tuned to various input patterns or classes of input patterns Locations of neurons become ordered in a way that a meaningfull topographic map of input patterns is created The process of ordering is automatic (self-organized) without guidance from outside Self-organizing maps are inherently nonlinear a nonlinear generalization of principal component analysis (PCA)

2012 Primo Potonik

NEURAL NETWORKS (7) Self-Organizing Maps

#301

Organization of a self-organizing map


Points x from the input space are mapped to points I(x) in the output space (self-organizing map) Each point I in the output space will map to a corresponding point w(I) in the input space

2012 Primo Potonik

NEURAL NETWORKS (7) Self-Organizing Maps

#302

151

Kohonen network
Kohonen (1982) : Self-organized formation of topologically correct feature maps. Biological Cybernetics Kohonen network or Self-Organizing Map (SOM) has a single computational layer arranged in rows and columns
1D, 2D, 3D

Each neuron is fully connected to all source nodes in the input layer

2012 Primo Potonik

NEURAL NETWORKS (7) Self-Organizing Maps

#303

SOM architecture
Calculating the distance between inputs and neurons dist Competitive layer selection of a winning neuron and its neighborhood dist, linkdist, mandist, boxdist

Topologies: 1D, 2D, 3D

2012 Primo Potonik

NEURAL NETWORKS (7) Self-Organizing Maps

#304

152

7.3 SOM algorithm


1. Initialization Define SOM topology, then initialize weights with small random values 2. Competition For each input pattern, neurons compute their values of a distance function which provides the basis for competition. A neuron with the smallest distance to the input pattern is declared the winner. 3. Cooperation Winning neuron determines the topological neighbourhood of excited neurons, thereby providing the basis for cooperation among neighbouring neurons 4. Adaptation Excited neurons decrease their distance to the input pattern through adjustment of synaptic weights response of the winning neuron to the subsequent application of a similar input pattern is enhanced
2012 Primo Potonik NEURAL NETWORKS (7) Self-Organizing Maps #305

Competition - Cooperation - Adaptation


We have m-dimensional input space

x [ x1 , x2 ,, xm ]
Synaptic weight vector of each neuron in the network has the same dimension as input space

wj

[ w j1 , w j 2 ,, w jm ],

j 1,, K

The best match of the input vector x with the synaptic weight vectors wj can be found by comparing the Euclidean distance between input vector x and each neuron j

d j ( x)

x wj

Neuron whose weight vector comes closest to the input vector (i.e. is most similar to it) is declared the winning neuron In this way the continuous input space can be mapped to the discrete output space of neurons by a simple process of competition between the neurons

2012 Primo Potonik

NEURAL NETWORKS (7) Self-Organizing Maps

#306

153

Competition - Cooperation - Adaptation


Winning neuron locates the center of a topological neighborhood of cooperating neurons Neurobiological studies confirm that there is lateral interaction within a set of excited neurons
When one neuron fires, its closest neighbours tend to get excited more than those further away Topological neighbourhood decays with distance

We define a similar neurobiologically correct topological neighbourhood for the neurons in SOM and assume two requirements:
1. Topological neighborhood is symetric around the winning neuron 2. Amplitude of the topological neighborhood decreases monotonically with increasing lateral distance (and decaying to zero in the limit d which is neccessary for covergence)

2012 Primo Potonik

NEURAL NETWORKS (7) Self-Organizing Maps

#307

Competition - Cooperation - Adaptation


A typical choise of a topological neighbourhood function that covers both requirements is defined by Gaussian function

h j ,i ( x )

exp

d2 j ,i 2
2

Effective width of the topological neighborhood

Gaussian function is translation invariant (independent of the location of the winning neuron)

2012 Primo Potonik

NEURAL NETWORKS (7) Self-Organizing Maps

#308

154

Competition - Cooperation - Adaptation


For cooperation to be effective, topological neighborhood must depend on lateral distance between winning neuron and its neughbors in the output space and NOT on distance measure in the original input space

Neighbours

Winning neuron Distance: dist linkdist mandist boxdist


2012 Primo Potonik NEURAL NETWORKS (7) Self-Organizing Maps #309

Competition - Cooperation - Adaptation


Another special feature of the SOM algorithm is that size of the topological neighborhood shrinks with time Shrinking requirement is fulfilled by decreasing the width of the Gaussian neighborhood function with time. Popular choice is exponential temporal decay

( n)

exp

n
1

, n

1,2,...,

Consequently, topological neighborhood function assumes timevarying form

h j , i ( x, n )

exp

d2 j ,i 2 ( n) 2

, n 1,2,...,

Time increases width decreases neighborhood shrinks


NEURAL NETWORKS (7) Self-Organizing Maps #310

2012 Primo Potonik

155

Competition - Cooperation - Adaptation


Time increases width decreases neighborhood shrinks

2012 Primo Potonik

NEURAL NETWORKS (7) Self-Organizing Maps

#311

Competition - Cooperation - Adaptation


Clearly, SOM must involve some kind of adaptation or learning by which the outputs become self-organised and the feature map between inputs and outputs is formed Meaning of the topographic neighbourhood is that not only the winning neuron gets its weights updated, but its neighbours will have their weights updated as well Learning rule for adaptation

w j (n 1)

w j (n)

(n)h j ,i ( x, n)( x w j (n))

the rule is applied to all neurons inside the topological neighbourhood of the winning neuron i Adaptation moves the synaptic weights wj of the chosen neurons toward the input vector x
NEURAL NETWORKS (7) Self-Organizing Maps #312

2012 Primo Potonik

156

Competition - Cooperation - Adaptation


Adaptation algorithm leads to a topological ordering of the feature map neurons that are neighbours in the lattice will tend to have similar weight vectors Learning parameter (n) should be decreasing with time for a proper convergence

( n)

exp

n
2

, n

1,2,...,

Thus, SOM algorithm requires choice of several parameters:


0

, 1,

Even if not optimal, section of parameters usually leads to the formation of the feature map in a self-organized manner

2012 Primo Potonik

NEURAL NETWORKS (7) Self-Organizing Maps

#313

Competition - Cooperation - Adaptation


Adaptation process can be decomposed in two phases 1. Self-organizing or ordering phase
Topological ordering of weight vectors typically cca 1000 iterations of SOM algorithm needs proper choice of neighbourhood function and learning rate

2. Convergence phase
Feature map fine-tuning provides statistical quantification of the input space typically the number of iterations at least 500 times the number of neurons

Result of SOM algorithm Starting from the initial state of complete disorder, SOM algorithm gradually leads to an organized representation of activation patterns drawn from the input space
However, it is possible to end up in a metastable state in which the feature map has a topological defect
2012 Primo Potonik

NEURAL NETWORKS (7) Self-Organizing Maps

#314

157

SOM algorithm essentials


Essential characteristics of the SOM algorithm:
Continuous input space of activation patterns that are generated according to a certain probability distribution Discrete output space in a form of a lattice of neurons Shrinking neighborhood function h that is defined around a winning neuron Decreasing learning rate that is exponentially decreasing with time

2012 Primo Potonik

NEURAL NETWORKS (7) Self-Organizing Maps

#315

SOM algorithm summary


1. Initialization Choose random values for the initial weight vectors wj 2. Sampling Draw a sample training input vector x from the input space 3. Competition Find the winning neuron with weight vector closest to input vector 4. Cooperation Select neurons in the topological neighbourhood of the winning neuron 5. Adaptation Adjust synaptic weights of the selected neurons

w j (n 1)

w j (n)

(n)h j ,i ( x, n)( x w j (n))

6. Iteration Continue with step 2 until the feature map stops changing
2012 Primo Potonik

NEURAL NETWORKS (7) Self-Organizing Maps

#316

158

Visualizing the SOM algorithm (1/2)


Step 1 Suppose we have four data points (x) in our continuous 2D input space, and want to map this onto four points in a discrete 1D output space (o). The output nodes map to points in the input space (o). Random initial weights start the circles at random positions in the centre of the input space.

2012 Primo Potonik

Step 2 We randomly pick one of the data points for training (). The closest output point represents the winning neuron ( ). That winning neuron is moved towards the data point by a certain amount, and the two neighbouring neurons move by smaller NEURAL NETWORKS (7) Self-Organizing Maps amounts ().

#317

Visualizing the SOM algorithm (2/2)


Step 3 Next we randomly pick another data point for training (). The closest output point gives the new winning neuron ( ). The winning neuron moves towards the data point by a certain amount, and the one neighbouring neuron moves by a smaller amount ().

2012 Primo Potonik

Step 4 We carry on randomly picking data points for training (). Each winning neuron moves towards the data point by a certain amount, and its neighbouring neuron(s) move by smaller amounts (). Eventually the whole output grid unravels itself to NEURAL NETWORKS (7) Self-Organizing Maps represent the input space.

#318

159

Example: 1D Lattice driven by 2D distribution


2D input data distribution Initial condition of 1D lattice

End of ordering phase

End of convergence phase

2012 Primo Potonik

NEURAL NETWORKS (7) Self-Organizing Maps

#319

Parameters for 1D example


(a) Exponential decay of neighborhood width (n)

(b) Exponential decay of learning rate (n)

(c) Initial neighborhood function (spanning over 100 neurons)

(d) Final neighborhood function at the end of the ordering phase


2012 Primo Potonik NEURAL NETWORKS (7) Self-Organizing Maps #320

160

Example: 2D Lattice driven by 2D distribution


2D input data distribution Initial condition of 2D lattice

End of ordering phase

End of convergence phase

2012 Primo Potonik

NEURAL NETWORKS (7) Self-Organizing Maps

#321

Matlab examples
nnd14fm1 1D feature map nnd14fm2 2D feature map

2012 Primo Potonik

NEURAL NETWORKS (7) Self-Organizing Maps

#322

161

7.4 Properties of the feature map

Property 1: Approximation of the input space Property 2: Topological ordering Property 3: Density matching Property 4: Feature selection

2012 Primo Potonik

NEURAL NETWORKS (7) Self-Organizing Maps

#323

Property 1: Approximation of the input space


The feature map represented by the set of weight vectors {wi} in the output space, provides a good approximation to the input space Goal of SOM can be formulated as to store a large set of input vectors {x} by a smaller set of prototypes {wi} that provide a good approximation to the original input space. Goodness of the approximation is given by the total squared distance which we wish to minimize

If we work through gradient descent style mathematics we do end up with the SOM weight update algorithm, which confirms that it is generating a good approximation to the input space

2012 Primo Potonik

NEURAL NETWORKS (7) Self-Organizing Maps

#324

162

Property 2: Topological ordering


The feature map computed by the SOM algorithm is topologically ordered in the sense that the spatial location of a neuron in the output lattice corresponds to a particular domain or feature of input patterns The topological ordering property is a direct consequence of the weight update equation: Not only the winning neuron but also the neurons in the topological neighbourhood are updated Consequently the whole output space becomes appropriately ordered Visualise the feature map as elastic net ...

2012 Primo Potonik

NEURAL NETWORKS (7) Self-Organizing Maps

#325

Property 3: Density matching


The feature map reflects variations in the statistics of the input distribution: Regions in the input space from which the sample training vectors x are drawn with high probability of occurrence are mapped onto larger domains of the output space, and therefore with better resolution than regions of input space from which training vectors are drawn with low probability. We can relate the input vector probability distribution p(x) to the magnification factor m(x) of the feature map. Generally, for two dimensional feature maps the relation cannot be expressed as a simple function, but in one dimension we can show that

So the SOM algorithm doesnt match the input density exactly, because of the power of 2/3 rather than 1. As a general rule, the feature map tend to over-represent the regions with low input density and to under-represent regions with high input density.

2012 Primo Potonik

NEURAL NETWORKS (7) Self-Organizing Maps

#326

163

Property 4: Feature selection


Given data from an input space with a non-linear distribution, the selforganizing map is able to select a set of best features for approximating the underlying distribution This property is a natural culmination of properties 1,2,3 Principal Component Analysis (PCA) is able to compute the input dimensions which carry the most variance in the training data. It does this by computing the eigenvector associated with the largest eigenvalue of the correlation matrix. PCA is fine if the data really does form a line or plane in input space, but if the data forms a curved line or surface, linear PCA is no good, but a SOM will overcome the approximation problem by virtue of its topological ordering property. The SOM provides a discrete approximation of finding so-called principal curves or principal surfaces, and may therefore be viewed as a non-linear generalization of PCA

2012 Primo Potonik

NEURAL NETWORKS (7) Self-Organizing Maps

#327

7.5 SOM discussion & examples


SOM is a neural network built around 1D or 2D lattice of neurons for capturing important features contained in input data SOM provides a structural representation of input data by neurons weight vectors as prototypes SOM is neurobiologically inspired and incorporates self-organizing mechanisms
Competition Cooperation Adaptation

SOM is simple to implement yet mathematically difficult to analyze

2012 Primo Potonik

NEURAL NETWORKS (7) Self-Organizing Maps

#328

164

4 clusters, 1D SOM

2012 Primo Potonik

NEURAL NETWORKS (7) Self-Organizing Maps

#329

4 clusters, 2D SOM

2012 Primo Potonik

NEURAL NETWORKS (7) Self-Organizing Maps

#330

165

Uniform distribution, square


2.5

Weight Vectors 2.5

50 neurons Uniform distribution of 1000 points in a square


2

2
W(i,2)
1.5

1.5
Weight Vectors 2.5
1

1
2

10x10 neurons
0.5 0.5 1 1.5 W(i,1) 2 2.5

0.5 0.5

1.5

2
W(i,2)
1.5

2.5

0.5 0.5

1.5 W(i,1)

2.5

2012 Primo Potonik

NEURAL NETWORKS (7) Self-Organizing Maps

#331

Weight Vectors

Uniform distribution, circle


1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 -1
0.8 0.6 0.4
W(i,2)

0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 1 50 neurons

-1

-0.5

Weight Vectors

0 W(i,1)

0.5

-0.8

-0.6

-0.4

-0.2

0.2

0.4

0.6

0.8

1
0.2

Uniform distribution of 1000 points in a circle

W(i,2)

0 -0.2 -0.4 -0.6 -0.8

10x10 neurons
-1 -0.5 0 W(i,1) 0.5 1

2012 Primo Potonik

NEURAL NETWORKS (7) Self-Organizing Maps

#332

166

Gaussian distribution, square


1.5
W(i,2)

Weight Vectors 1.5

50 neurons
1

0.5

-0.5

0.5
-1

0
Weight Vectors 1.5
-1.5 -1.5 -1 -0.5 0 W(i,1) 0.5 1 1.5

-0.5
1

10x10 neurons

-1
0.5

-1

-0.5

0.5

1.5

W(i,2)

-1.5 -1.5

Gaussian distribution of 1000 points in a square

-0.5

-1

-1.5

-1.5

-1

-0.5

0 W(i,1)

0.5

1.5

2012 Primo Potonik

NEURAL NETWORKS (7) Self-Organizing Maps

#333

Weight Vectors 2.5

Complex distribution
2
2.5

50 neurons

W(i,2)
2 1.5

1.5

Weight Vectors 2.5


1

0.5 10x10 neurons

0.5

1.5 W(i,1)

2.5

2
0.5 0.5

1.5

2.5

Complex distribution in a square

W(i,2)

1.5

0.5 0.5

1.5 W(i,1)

2.5

2012 Primo Potonik

NEURAL NETWORKS (7) Self-Organizing Maps

#334

167

4 clusters, 2D SOM
Weight Vectors
2

1.5

1.5

0.5

W(i,2)
-1 -0.5 0 0.5 1 1.5 2 2.5

0.5

-0.5

-0.5

-1 -1.5

-1 -1.5

-1

-0.5

0.5 W(i,1)

1.5

2.5

4 classes with uniform distribution 1000 points in each class

Net 8x8 neurons

2012 Primo Potonik

NEURAL NETWORKS (7) Self-Organizing Maps

#335

2012 Primo Potonik

NEURAL NETWORKS (7) Self-Organizing Maps

#336

168

8. Practical Considerations
8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 Designing the training data Preparing data Selection of inputs Data encoding Principal component analysis Invariances and prior knowledge Generalization General guidelines

2012 Primo Potonik

NEURAL NETWORKS (8) Practical considerations

#337

Introduction
Neural network could, in principle, map raw input data into required outputs in practice, this will generally give poor results For most applications, some data manipulations are recommended: Preparing data designing the training data handling missing and extreme data incorporating invariances and prior knowledge Preparing inputs pre-processing, rescaling, normalizing, standardizing, detrending dimensionality reduction: principal component analysis feature selection, feature extraction Preparing outputs encoding of classes, post-processing, rescaling, standardizing
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #338

169

8.1 Designing the training data


Good training data are required to train a NN
Neural nets are not good at extrapolation

Training data must be representative for the problem considered For pattern recognition
Every class must be represented Within each class, statistical variation must be adequately represented Potato chips factory example:
NN must be trained on 1) normal chips, 2) burned chips, 3) uncooked chips, ...

Large training set prevents overfitting


Overfitting = perfect fit to a small number of training data Three-layer feedforward network example:
With 25 inputs and 10 hidden neurons over 260 free parameters Apply at least 500-1000 training samples (preferably more) for proper training

2012 Primo Potonik

NEURAL NETWORKS (8) Practical considerations

#339

8.2 Preparing data


Some data transformation are usually necessary to achieve good neural network results Rescaling
Add/subtract a constant and then multiply/divide by a constant Example: convert a temperature from Celsius to Fahrenheit

Standardizing
Subtracting a measure of location and dividing by a measure of scale Example: subtracting a mean and dividing by standard deviation, thereby obtaining a "standard normal" random variable with mean 0 and standard deviation 1

Normalizing
Dividing a vector by its norm Example: make the Euclidean length of the vector equal to one. In the NN literature, "normalizing" often refers to rescaling into [0,1] range

Which operations should be applied to data? It depends!


NEURAL NETWORKS (8) Practical considerations #340

2012 Primo Potonik

170

Rescaling
Rescaling inputs
Often recommended rescaling of inputs to interval [0,1] is a misconception, there is in fact no such requirement. Interval [0,1] is usually a bad choice, rescaling to [-1,1] interval is better Standardizing inputs is better than rescaling ...

Rescaling outputs
1. For bounded activation functions (range [0,1] or [-1,1] ) the target values must lie within that range The alternative is to use an activation function suited to the distribution of the targets, for example linear activation function. 2. It is essential to rescale the multidimensional targets so that their variability reflects their importance, or at least is not in inverse relation to their importance. If the targets are of equal importance, they should typically be rescaled or standardized to the same range or the same standard deviation.

2012 Primo Potonik

NEURAL NETWORKS (8) Practical considerations

#341

Standardizing
Standardizing usually reffers to transforming data into zero mean with standard deviation one
Statistics (mean, std) are computed from training data, not from validation data Validation data must be standardized using the statistics computed from training data

Standardizing inputs
Often very benefitial for MLP and RBFN networks
RBFN inputs are combined with via a distance function (Euclidean) therefore it is important to standardize them into similar range MLP standardizing enables utilization of steep parts of transfer functions faster learning and avoidance of saturation

Standardizing outputs
Typically more a convenience for getting good initial weights than a necessity Important for the equal relevance of targets Note: use rescaling for bounded activation functions, not standardizing!

2012 Primo Potonik

NEURAL NETWORKS (8) Practical considerations

#342

171

Time series transformations


Detrending
Removing linear trend from the time series After neural network application, original trend is added to the results Carefull: it is too easy to create trend where none belongs!

Removing seasonal components


Yearly, monthly, weekly, daily, hourly cycles can be removed before the application of neural networks Decomposition methods

Differencing
Working with differences between successive samples can sometimes bring good results Example: daily stock-market values convey one sort of information, the change from one day to the next conveys entirely different information Differencing can be applied at inputs and outputs, powerfull option is to apply raw and diferrenced inputs!
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #343

Why detrending is dangerous?

2012 Primo Potonik

NEURAL NETWORKS (8) Practical considerations

#344

172

Example of time series preparation


1. Original data x
nonstationary mean nonstationary variance

2. Log(x)
stationary variance

3. Differencing
stationary mean stationary variance

2012 Primo Potonik

NEURAL NETWORKS (8) Practical considerations

#345

Time series decomposition


Time series can often be decomposed into components
Trend (T) Seasonal cycle (S) Residual (E)

Decomposition can be aditive or multiplicative


Aditive: Y=T+S+E Multiplicative: Y = T * S * E

Methods
X-12-ARIMA (U.S. Census Bureau, Statistical Research Divison ) STL (Seasonal Trend Decomposition based on Loess)

2012 Primo Potonik

NEURAL NETWORKS (8) Practical considerations

#346

173

Example of STL decomposition


Daily energy consumption

Original data

Trend Weekly cycle

Residual

2012 Primo Potonik

NEURAL NETWORKS (8) Practical considerations

#347

Missing data and outliers


Handling missing data is difficult
If not many data are missing, discard missing samples Substituting the missing data with mean values Input vector with a missing single variable:
Find similar input vectors (without missing variable) based on a distance measure Take the missing value as an average of the variable contained in the similar input vectors

Outliers can appear due to


Natural variation of the variable's distribution Noise in data acquisition chain Defects

Carefull examination of the experiment is required to confirm validity of outliers


If outliers have some significance, keep them in the training data

Some abnormality is normal!


Do not reject a point unless it is really wild
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #348

174

8.3 Selection of inputs


Importance of inputs which inputs should be selected for best results (classification or prediction) Several aspects of importance: Predictive importance
Increase in generalization error when an input is omitted from a network

Causal importance
How much the outputs change if inputs are changed (also called sensitivity)

Marginal importance
Considers inputs in isolation Easy to compute without even training a neural net ...
(Pearson correlation, rank correlation, mutual information, ...)

Marginal importance is of little practical use other than for a preliminary description of the data

2012 Primo Potonik

NEURAL NETWORKS (8) Practical considerations

#349

How to measure importance of inputs


How to measure importance of inputs: very difficult!
Comparing weights in linear models can be misleading Comparing standardized weights in linear models can be misleading Comparing changes in the error function in linear models can be misleading Statistical p-values can be misleading Comparing weights in MLPs can be misleading Sums of products of weights in MLPs can be misleading Partial derivatives can be misleading Average partial derivatives over the input space can be misleading Average absolute partial derivative can be misleading

ftp://ftp.sas.com/pub/neural/importance.html

2012 Primo Potonik

NEURAL NETWORKS (8) Practical considerations

#350

175

Methods of input selection


Practical approach: Selection of inputs based on cross-validation General framework
1. 2. 3. 4. Select a subset of inputs Train and validate the network based on the selected subset of inputs Based on the validation result, decide upon further inclusion/rejection of inputs Continue iterating until good results are obtained

Direct search methods


Exhaustive search Forward selection Backward elimination Selection by genetic algorithms, ...

Prunning methods
Removing nonrelevant inputs during the neural network construction

2012 Primo Potonik

NEURAL NETWORKS (8) Practical considerations

#351

8.4 Data encoding (1/3)


Numeric variables
No need for encoding check the need for rescaling or standardizing

Ordinal variables
Discrete data with natural ordering (e.g. 'small', 'medium', 'big') Ordinal variables can often be represented by a single variable
Small Medium Big Small Medium Big Small Medium Big | 1 | | 2 | | 3 | | 0 0 1 | | 0 1 1 | | 1 1 1 | | -1 -1 | -1 1 | 1 1 1 | 1 | 1 |

Thermometer coding (using dummy variables)

Improved thermometer coding faster learning

2012 Primo Potonik

NEURAL NETWORKS (8) Practical considerations

#352

176

Data encoding (2/3)


Categorical variables
Discrete data without ordering (e.g. 'apple', 'banana', 'orange' ) 1-of-C coding
Red Green Blue Red Green Blue | 0 0 1 | | 0 1 0 | | 1 0 0 | | 0 0 | | 0 1 | | 1 0 |

1-of-(C-1) ... if the network has bias

1-of-C coding with a softmax activation function will produce valid posterior probability estimates

It is very important NOT to use a single variable for an unordered categorical target

2012 Primo Potonik

NEURAL NETWORKS (8) Practical considerations

#353

Data encoding (3/3)


Circular discontinuity
How to encode variables that are fundamentaly circular? ...
e.g. Angle 0..360

Day of the week (Mon=1, ... Sun=7) we have discontinuity when passing from 7 to 1, although Sunday and Monday are very close Solutions
1. Discretizing and using any of the categorical coding (1-of-C) 2. Encoding with two dummy variables (sin,cos)

cos

sin

2012 Primo Potonik

NEURAL NETWORKS (8) Practical considerations

#354

177

8.5 Principal component analysis


In some situations, dimension of the input vector is large, but the components of the vectors are highly correlated It is useful in this situation to reduce the dimension of the input vectors feature extraction An effective procedure for performing this operation is principal component analysis (PCA) PCA is a vector space transform used to reduce multidimensional data to lower dimensions for analysis PCA method generates a new set of variables, called principal components
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #355

Calculation of principal components


Input matrix X is represented as a linear combination of principal components

Projection vectors p are eigen vectors of the covariance matrix XXT Each principal component zp is obtained as a product of input matrix with projection vectors Each principal component is a linear combination of original variables All the principal components are orthogonal to each other, so there is no redundant information
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #356

178

PCA example
Original data: x1, x2 Principal components: z1, z2 Variability
z1 95% z2 5%

Benefit
Dimensionality reduction by using only the first principal component (z1) instead of original 2D data (x1, x2)

2012 Primo Potonik

NEURAL NETWORKS (8) Practical considerations

#357

Properties of principal components


Principal components form an orthogonal basis for the data 1st principal component variance of this variable is the maximum among all possible choices of the first axis 2nd principal component is perpendicular to the 1st principal component. The variance of this variable is the maximum among all possible choices of this second axis Often, the first few principal components explain the majority of the total variance these few new variables can be taken as low-dimensional input to neural network instead of highdimensional original data
2012 Primo Potonik

NEURAL NETWORKS (8) Practical considerations

#358

179

How to use PCA for neural networks


1. Load original data X 2. Compute principal components z 3. Plot variance explained

4. Decide how much variance to keep ... 90%, 95%? 5. Keep only a few selected principal components, discard the rest data dimensionality is reduced

2012 Primo Potonik

NEURAL NETWORKS (8) Practical considerations

#359

Intrinsic dimensionality
Suppose we apply PCA on d-dimensional data and discover that first n-eigenvalues have significantly larger values (n < d) Consequently, data can be represented with high accuracy by first n-eigenvalues effective dimensionality is only n Generally, data in d-dimensions have intrinsic dimensionality equal to n if data lies entirely within a n-dimensional subspace

2012 Primo Potonik

NEURAL NETWORKS (8) Practical considerations

#360

180

Neural nets for dimensionality reduction


Multilayer feedforward neural networks can be used to perform nonlinear dimensionality reduction Auto-associative multilayer perceptron with extra hidden layers can perform a general nonlinear dimensionality reduction
Number of neurons: 1024 300 50 300 1024

32 x 32 pixels

32 x 32 pixels

Nonlinear dimensionality reduction


2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #361

8.6 Invariances and prior knowledge


In many practical situations we have, in addition to the data itself, also a priori knowledge
General information about the form of the mapping Prior probabilities of class membership Information about constraints Knowledge about invariances

How to build invariances into neural networks?


1. Invariance by neural network structure
Shared weights, Higher-order neural networks Include a large number of translated inputs to train NN Extract features that are invariant for the problem considered

2. Invariance by training 3. Invariant feature space

Review of the Lecture NN-02 feature extraction ...


NEURAL NETWORKS (8) Practical considerations #362

2012 Primo Potonik

181

Handwritten character recognition problem


Recognize handwritten characters a and b

Image representation
Grid of pixels (typically 256x256) 65536 inputs Gray level [0..1] (typically 8-bit coding)

Extraction of the features:


F1 character heigth , F2 character width closed area , character heigth

Solving two problems


1. Invariance problem (translations) 2. Curse of dimensionality problem
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #363

8.7 Generalization
Goal of network training is not to exactly fit the data but to build a statistical model that generates the data Well trained network is able to generalize to make good predictions on new inputs
Here it is assumed that the test data are drawn from the same population used to generate the training data

Neural network designed to generalize well will produce correct input-output mapping even if new inputs differ slightly from the samples used to train the network Overfitting problem neural net learns the complete training set but not the underlaying function
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #364

182

Generalization in classification
The task of our network is to learn a classification decision boundary
Good generalization Overfitting

If we know that training data contains noise, we dont necessarily want the training data to be classified totally accurately, as that is likely to reduce the generalization ability
NEURAL NETWORKS (8) Practical considerations #365

2012 Primo Potonik

Generalization in function approximation


Function approximation based on noisy data samples
Good generalization Overfitting

We can expect the neural network output to give a better representation of the underlying function if its output curve does not pass through all the data points Again, allowing a larger error on the training data is likely to lead to better generalization
NEURAL NETWORKS (8) Practical considerations #366

2012 Primo Potonik

183

Overfitting, underfitting
Overfitting
Neural network perfectly learns the training data but gives poor results on test data

Underfitting
Neural network is unable to properly learn the data due to insufficient number of neurons or due to extreme regularization Such network also generalizes poorly

Underfitting

Overfitting

2012 Primo Potonik

NEURAL NETWORKS (8) Practical considerations

#367

Improving generalization
How to prevent underfitting
1. Provide enough hidden units to represent the required mappings 2. Train the network for long enough so that the sum squared error cost function is sufficiently minimised

How to prevent overfitting


3. Design the training data properly use large training set 4. Cross-validation check generalization ability on test data 5. Early stopping before NN had time to learn the training data too well 6. Restrict the number of adjustable parameters the network has
a) Reduce the number of hidden units, or b) Force connections to share the same weight values

7. Add regularization term to the error function to encourage smoother network mappings 8. Add noise to the training patterns to smear out the data points
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #368

184

Cross-validation
Cross-validation is used to estimate generalization error based on resampling Available data are randomly partitioned into a
Training set, and Test set

Training set is further partitioned into


Estimation subset, used to train the model Validation subset, used to validate the model Training set is used to build and validate various candidate models and to choose the best one

Generalization performance of the selected model is tested on the test set which is different from the validation subset
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #369

Variants of cross-validation
If only a small set of data exists ... Multifold cross-validation
Divide available N samples into K subsets Model is trained on all subsets except one Validation error is measured on the subset left out Procedure is repeated K-times Model performance is obtained by averaging K trials

Trial 2 ...

Leave-one-out cross-validation
Extreme form of cross-validation N-1 samples are used for training Model is validated on the sample left out Procedure is repeated N-times Result is averaged over N-trials

2012 Primo Potonik

NEURAL NETWORKS (8) Practical considerations

#370

185

Early stopping
Neural networks are often set up with more than enough parameters which can cause over-fitting For the iterative gradient descent based network training procedures (backpropagation, conjugate gradients, ...), the training set error will naturally decrease with increasing numbers of epochs of training The error on the unseen validation and testing data sets, however, will start off decreasing as the under-fitting is reduced, but then it will eventually begin to increase again as over-fitting occurs The natural solution to get the best generalization, i.e. the lowest error on the test set, is to use the procedure of early stopping

2012 Primo Potonik

NEURAL NETWORKS (8) Practical considerations

#371

Early stopping procedure


How to perform learning with early stopping?
1. 2. 3. 4. 5. 6. Divide the training data into estimation and validation subsets Use a large number of hidden units Use very small random initial values Use a slow learning rate Compute the validation error rate periodically during training Stop training when the validation error rate starts increasing

Since validation error is not a good estimate of the generalization error, a third test set must be applied to estimate generalization performance Available data are divided as in cross-validation
Training set
Estimation subset Validation subset

Test set
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #372

186

Practical considerations of early stopping


One potential problem: validation error may go up and down numerous times during training the safest approach is generally to train to convergence, saving the weights at each epoch, and then go back to weights at the epoch with the lowest validation error Early stopping resembles regularization with weight decay which indicates that it will work best if the training starts with very small random initial weights General practical problems
How to best split available training data into training and validation subsets? What fraction of the patterns should be in the validation set? Should the data be split randomly, or by some systematic algorithm?

Such issues are problem dependent ... Default Matlab parameters (train, validation, test): 70%, 15%, 15%

2012 Primo Potonik

NEURAL NETWORKS (8) Practical considerations

#373

Weight restriction and weight sharing


Perhaps the most obvious way to prevent over-fitting in neural networks is to restrict the number of free parameters The simplest solution is to restrict the number of hidden units, as this will automatically reduce the number of weights. Optimal number for a given problem can be determined by cross-validation. Alternative solution is to have many weights in the network, but constrain certain groups of them to be equal
a) If there are symmetries in the problem, we can enforce hard weight sharing by building them into the network in advance b) In other problems we can use soft weight sharing where sets of weights are encouraged to have similar values by the learning algorithm one way to implement soft weight sharing is to add an appropriate term to the error function regularization

2012 Primo Potonik

NEURAL NETWORKS (8) Practical considerations

#374

187

Regularization
Regularization technique encourages smoother network mappings by adding a penalty term to the standard (sum-squared-error) cost function where the regularization parameter controls the trade-off between reducing the error Esse and increasing the smoothing This modifies the gradient descent weight updates

The resulting neural network mapping is a compromise between fitting the data and minimizing the regularizer

2012 Primo Potonik

NEURAL NETWORKS (8) Practical considerations

#375

Regularization by weight decay


One of the simplest forms of regularizer is called weight decay and consists of the sum of squares of the network weights

In conventional curve fitting this regularizer is known as ridge regression. We can see why it is called weight decay when we observe the extra term in the weight updates

In each epoch the weights decay in proportion to their size Empirically, this leads to significant improvements in generalization. Weight decay keeps the weights small and hence the mappings are smooth

2012 Primo Potonik

NEURAL NETWORKS (8) Practical considerations

#376

188

Training with noise / Jittering


Adding noise (jitter) to the inputs during training was also found empirically to improve network generalization Noise will smear out each data point and make it difficult for the network to fit the individual data points precisely, and consequently reduce over-fitting Jittering is accomplished by generating new inputs by using original inputs and small amounts of jitter. Adding jitter to the targets will not change the optimal weights, it will just slow down training. Jittering is also closely related to regularization methods such as weight decay and ridge regression

2012 Primo Potonik

NEURAL NETWORKS (8) Practical considerations

#377

Generalization summary
Preventing underfitting
1. Provide enough hidden units 2. Train the network for long enough

Preventing overfitting
3. 4. 5. 6. 7. 8. Design the training data properly Cross-validation Early stopping Restrict the number of adjustable parameters Regularization Jittering

2012 Primo Potonik

NEURAL NETWORKS (8) Practical considerations

#378

189

8.8 General guidelines


General guidelines for designing successful neural network solutions:
1. Understand and specify your problem 2. Acquire and analyze data, define inputs and outputs, remove outliers, apply preprocessing methods (rescale, standardize, normalize), properly encode outputs, ... 3. Acquire prior knowledge and apply it in terms of feature selection, feature extraction, selection of neural network type, neural network complexity, etc. 4. Start with simple neural network architectures few layers, few neurons 5. Train the network and make sure it performs well on its training data. If this doesnt work, increase the complexity of the network. 6. Test its generalization by checking its performance on new test data. If this doesnt work, check your data, check partitioning of data into train/test sets, check and modify network architecture, ...

2012 Primo Potonik

NEURAL NETWORKS (8) Practical considerations

#379

2012 Primo Potonik

NEURAL NETWORKS (8) Practical considerations

#380

190

Você também pode gostar