Escolar Documentos
Profissional Documentos
Cultura Documentos
Lecturer: Primo Potonik University of Ljubljana Faculty of Mechanical Engineering Laboratory of Synergetics www.neural.si primoz.potocnik@fs.uni-lj.si +386-1-4771-167
#1
TABLE OF CONTENTS
0. 1. 2. 3. 4. 5. 6. 7. 8. Organization of the Study Introduction to Neural Networks Neuron Model Network Architectures Learning Perceptrons and linear filters Backpropagation Dynamic Networks Radial Basis Function Networks Self-Organizing Maps Practical Considerations
#2
Learning outcomes
Understand the concept of nonparametric modelling by NN Explain the most common NN architectures
Feedforward networks Dynamic networks Radial Basis Function Networks Self-organized networks
#4
2. Teaching methods
Teaching methods:
1. Lectures 4 hours weekly, clasical & practical (MATLAB) Tuesday 9:15 - 10:45 Friday 9:15 - 10:45 2. Homeworks home projects 3. Consultations with the lecturer
Location
Institute for Sustainable Innovative Technologies, (Pot za Brdom 104, Ljubljana)
#5
3. Assessment
ECTS credits:
EURHEO (II): 6 ECTS
Final mark:
Homework Written exam 50% final mark 50% final mark
Important dates
Homework presentations: Written exam: Tue, 8 Jan 2013 Fri, 11 Jan 2013 Fri, 18 Jan 2013
#6
#7
4. Backpropagation
4.1 4.2 4.3 4.4 4.5 Multilayer feedforward networks Backpropagation algorithm Working with backpropagation Advanced algorithms Performance of multilayer perceptrons
#8
#9
#10
8. Practical considerations
8.1 8.2 8.3 8.4 8.5 8.6 8.7 Designing the training data Preparing data Selection of inputs Data encoding Principal component analysis Invariances and prior knowledge Generalization
#11
5. Books
1. 2. Neural Networks and Learning Machines, 3/E Simon Haykin (Pearson Education, 2009) Neural Networks: A Comprehensive Foundation, 2/E Simon Haykin (Pearson Education, 1999)
3. 4. 5. 6.
Neural Networks for Pattern Recognition Chris M. Bishop (Oxford University Press, 1995) Practical Neural Network Recipes in C++ Timothy Masters (Academic Press, 1993) Advanced Algorithms for Neural Networks Timothy Masters (John Wiley and Sons, 1995) Signal and Image Processing with Neural Networks Timothy Masters (John Wiley and Sons, 1994)
#12
6. SLO Books
1. Nevronske mree Andrej Dobnikar, (Didakta 1990) 2. Modeliranje dinaminih sistemov z umetnimi nevronskimi mreami in sorodnimi metodami Ju Kocijan, (Zaloba Univerze v Novi Gorici, 2007)
#13
7. E-Books (1/2)
List of links at www.neural.si An Introduction to Neural Networks Ben Krose & Patrick van der Smagt, 1996
Recommended as an easy introduction
Neural Networks - Methodology and Applications Gerard Dreyfus, 2005 Metaheuristic Procedures for Training Neural Networks Enrique Alba & Rafael Marti (Eds.), 2006 FPGA Implementations of Neural Networks Amos R. Omondi & Mmondi J.C. Rajapakse (Eds.), 2006 Trends in Neural Computation Ke Chen & Lipo Wang (Eds.), 2007
2012 Primo Potonik NEURAL NETWORKS (0) Organization of the Study #14
7. E-Books (2/2)
Neural Preprocessing and Control of Reactive Walking Machines Poramate Manoonpong, 2007 Artificial Neural Networks for the Modelling and Fault Diagnosis of Technical Processes Krzysztof Patan, 2008 Speech, Audio, Image and Biomedical Signal Processing using Neural Networks [only two chapters], Bhanu Prasad & S.R. Mahadeva Prasanna (Eds.), 2008
#15
8. Online resources
List of links at www.neural.si
Neural FAQ by Warren Sarle, 2002 How to measure importance of inputs by Warren Sarle, 2000 MATLAB Neural Networks Toolbox (User's Guide) latest version Artificial Neural Networks on Wikipedia.org Neural Networks online book by StatSoft Radial Basis Function Networks by Mark Orr Principal components analysis on Wikipedia.org libsvm Support Vector Machines library
#16
9. Simulations
Recommended computing platform
MATLAB R2010b (or later) & Neural Network Toolbox 7 http://www.mathworks.com/products/neuralnet/ Acceptable older MATLAB release: MATLAB 7.5 & Neural Network Toolbox 5.1 (Release 2007b)
Introduction to Matlab
Get familiar with MATLAB M-file programming Online documentation: Getting Started with MATLAB
#17
10. Homeworks
EURHEO students (II)
1. Practical oriented projects 2. Based on UC Irvine Machine Learning Repository data http://archive.ics.uci.edu/ml/ 3. Select data set and discuss with lecturer 4. Formulate problem 5. Develop your solution (concept & Matlab code) 6. Describe solution in a short report 7. Submit results (report & Matlab source code) 8. Present results and demonstrate solution
Presentation (~10 min) Demonstration (~20 min)
#18
Video links
Robots with Biological Brains: Issues and Consequences Kevin Warwick, University of Reading http://videolectures.net/icannga2011_warwick_rbbi/ Computational Neurogenetic Modelling: Methods, Systems, Applications Nikola Kasabov, University of Auckland http://videolectures.net/icannga2011_kasabov_cnm/
#19
#20
10
#21
Artificial neurons
Simple mathematical approximations of biological neurons
#22
11
#23
Zurada (1992)
Artificial neural systems, or neural networks, are physical cellular systems which can acquire, store, and utilize experiential knowledge.
Pinkus (1999)
The question 'What is a neural network?' is ill-posed.
#24
12
0.1 mm
This complex network forms the nervous system, which relays information through the body
#25
Biological neuron
#26
13
Interaction of neurons
Action potentials arriving at the synapses stimulate currents in its dendrites These currents depolarize the membrane at its axon, provoking an action potential Action potential propagates down the axon to its synaptic knobs, releasing neurotransmitter and stimulating the post-synaptic neuron (lower left)
#27
Synapses
Elementary structural and functional units that mediate the interaction between neurons Chemical synapse: pre-synaptic electric signal chemical neurotransmitter post-synaptic electrical signal
#28
14
Action potential
Spikes or action potential
Neurons encode their outputs as a series of voltage pulses Axon is very long, high resistance & high capacity Frequency modulation Improved signal/noise ratio
#29
Stimulus
Receptors
Effectors
Response
Receptors
collect information from environment (photons on retina, tactile info, ...)
Effectors
generate interactions with the environment (muscle activation, ...)
Flow of information
feedforward & feedback
#30
15
Human brain
Human activity is regulated by a nervous system: Central nervous system
Brain Spinal cord
1010 neurons in the brain 104 synapses per neuron 1 ms processing speed of a neuron Slow rate of operation Extrem number of processing units & interconnections Massive parallelism
#31
16
Network of neurons
#33
In practice
NN are especially useful for classification and function approximation problems which are tolerant of some imprecision Almost any finite-dimensional vector function on a compact set can be approximated to arbitrary precision by feedforward NN Need a lot of training data Difficulties to apply hard rules (such as used in an expert system)
#34
17
2. Adaptivity
Neural networks have natural capability to adapt to the changing environment Train neural network, then retrain Continuous adaptation in nonstationary environment
#35
4. Fault tolerance
Capable of robust computation Graceful degradation rather then catastrophic failure
#36
18
6. Neurobiological analogy
NN design is motivated by analogy with brain NN are research tool for neurobiologists Neurobiology inspires further development of artificial NN
#37
www.stanford.edu/group/brainsinsilicon/
#38
19
1943 1949
Hebb
Published his book The Organization of Behavior Introduced Hebbian learning rule
1958
1969
#39
1982 1982
Hopfield
Published a series of papers on Hopfield networks
Kohonen
Developed the Self-Organising Maps
1990s Radial Basis Function Networks were developed 2000s The power of Ensembles of Neural Networks and Support Vector Machines becomes apparent
#40
20
Current NN research
Topics for the 2013 International Joint Conference on NN
Neural network theory and models Computational neuroscience Cognitive models Brain-machine interfaces Embodied robotics Evolutionary neural systems Self-monitoring neural systems Learning neural networks Neurodynamics Neuroinformatics Neuroengineering Neural hardware Neural network applications Pattern recognition Machine vision Collective intelligence Hybrid systems Self-aware systems Data mining Sensor networks Agent-based systems Computational biology Bioinformatics Artificial life
#41
Automotive
Automobile automatic guidance systems, warranty activity analyzers
Banking
Check and other document readers, credit application evaluators
Defense
Weapon steering, target tracking, object discrimination, facial recognition, new kinds of sensors, sonar, radar and image signal processing including data compression, feature extraction and noise suppression, signal/image identification
Electronics
Code sequence prediction, integrated circuit chip layout, process control, chip failure analysis, machine vision, voice synthesis, nonlinear modeling
#42
21
Manufacturing
Manufacturing process control, product design and analysis, process and machine diagnosis, real-time particle identification, visual quality inspection systems, welding quality analysis, paper quality prediction, computer chip quality analysis, analysis of grinding operations, chemical product design analysis, machine maintenance analysis, project planning and management, dynamic modelling of chemical process systems
Medical
Breast cancer cell analysis, EEG and ECG analysis, prothesis design, optimization of transplant times, hospital expense reduction, hospital quality improvement, emergency room test advisement
#43
Speech
Speech recognition, speech compression, vowel classification, text to speech synthesis
Securities
Market analysis, automatic bond rating, stock trading advisory systems
Telecommunications
Image and data compression, automated information services, real-time translation of spoken language, customer payment processing systems
Transportation
Truck brake diagnosis systems, vehicle scheduling, routing systems
#44
22
iteration, time step time input .................................. p network output ................... a desired (target) output ....... t activation function induced local field .............. n synaptic weight bias error
#45
#46
23
#47
Adjustable parameters
synaptic weight w bias b
2012 Primo Potonik
f (wx b)
#48
24
Weight vector
w = [w1, w2, ... wR ]
Activation potential
v=wx+b product of input vector and weight vector
x1 xR
w1
wR
f ( wx b) f ( w1 x1 w2 x2 ... wR xR b)
#49
y (v )
1 if v 0 0 if v 0
y(v) v
y
y (v )
1 1 exp( v)
#50
25
#51
y (v) sgn( wx b)
x1 xR
v
y
The output is binary, depending on whether the input meets a specified threshold
y 1 if wx y 0 if wx
y f (wx b)
b b
#52
26
Matlab notation
Presentation of more complex neurons and networks
Input vector p is represented by the solid dark vertical bar Weight vector is shown as single-row, R-column matrix W p and W multiply into scalar Wp [R x 1] [1 x R]
#53
Matlab Demos
nnd2n1 One input neuron nnd2n2 Two input neuron
#54
27
3. Recurrent networks
Contains at least one feedback loop Powerfull temporal learning capabilities
#55
#56
28
#57
#58
29
Delay
Feedback loop
#59
#60
30
Learning process
1. Neural network is stimulated by an environment 2. Neural network undergoes changes in its free parameters as a result of this stimulation 3. Neural network responds in a new way to the environment because of its changed internal structure
Learning algorithm
Prescribed set of defined rules for the solution of a learning problem
1. 2. 3. 4. Error correction learning Memory-based learning Hebbian learning Competitive learning
#61
e(t)
1. Neural network is driven by input x(t) and responds with output y(t) 2. Network output y(t) is compared with target output d(t) Error signal = difference of network output and target output
e(t )
y(t ) d (t )
#62
31
(t )
1 2 e (t ) 2
w(t )
Comments
e(t ) x(t )
Error signal must be directly measurable Key parameter: Learnign rate Closed loop feedback system Stability determined by learning rate
#63
Memory-based learning
All (or most) past experiences are stored in a memory of input-output pairs (inputs and target classes)
( xi , yi )
N i 1
#64
32
Hebbian learning
The oldest and most famous learning rule (Hebb, 1949)
Formulated as associative learning in a neurobiological context
When an axon of a cell A is near enough to exite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic changes take place in one or both cells such that As efficiency as one of the cells firing B, is increased.
Strong physiological evidence for Hebbian learning in hippocampus, important for long term memory and spatial navigation
w(t )
2012 Primo Potonik
y(t ) x(t )
#65
Competitive learning
Inputs
Individual neurons specialize on ensambles of similar patterns feature detectors for different classes of input patterns
#66
33
Learning paradigm
Manner in which a neural network relates to its environment 1. Supervised learning 2. Unsupervised learning 3. Reinforcement learning
#67
Supervised learning
Learning with a teacher
Teacher has a knowledge of the environment Knowledge is represented by a set of input-output examples Environment Teacher
Target response = optimal action
+ Learning system
Error signal
Learning algorithm
Error-correction learning Memory-based learning
#68
34
Unsupervised learning
Unsupervised or self-organized learning
No external teacher to oversee the learning process Only a set of input examples is available, no output examples Learning system
Environment
Unsupervised NNs usually perform some kind of data compression, such as dimensionality reduction or clustering
Learning algorithms
Hebbian learning Competitive learning
#69
Reinforcement learning
No teacher, environment only offers primary reinforcement signal System learns under delayed reinforcement
Temporal sequence of inputs which result in the generation of a reinforcement signal
Goal is to minimize the expectation of the cumulative cost of actions taken over a sequence of steps RL is realized through two neural networks: Critic and Learning system
Primary reinforcement
Environment
Critic
Heuristic reinforcement
Critic network converts primary reinforcement signal (obtained directly from environment) into a higher quality heuristic reinforcement signal which solves temporal credit assignment problem
Actions
Learning system
#70
35
Autoassociation
Neural network stores a set of patterns by repeatedly presenting them to the network Then, when presented a distored pattern, neural network is able to recall the original pattern Unsupervised learning algorithms
Heteroassociation
Set of input patterns is paired with arbitrary set of output patterns Supervised learning algorithms
#71
#72
36
Neural network mapping F(x) can be realized by supervised learning (error-correction learning algorithm) Important function approximation tasks System identification Inverse system
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning #73
+ -
Inverse system
Inputs from the environment
#74
37
#75
#76
38
1. Filtering
Extraction of information at discrete time n by using measured data up to and including time n Examples: Cocktail party problem, Blind source separation
o o o o o o x o o o
2. Smoothing
Differs from filtering in: a) Data need not be available at time n b) Data measured later than n can be used to obtain this information
3. Prediction
Deriving information about the quantity in the future at time n+h, h>0, by using data measured up to including n Example: Forecasting of energy consumption, stock market prediction
NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning #77
#78
39
Adaptation
Learning has spatio-temporal nature
Space and time are fundamental dimensions of learning (control, beamforming)
1. Stationary environment
Learning under the supervision of a teacher, weights then frozen Neural network then relies on memory to exploit past experiences
2. Nonstationary environment
Statistical properties of environment change with time Neural network should continuously adapt its weights in real-time Adaptive system continuous learning
3. Pseudostationary environment
Changes are slow over a short temporal window
Speech stationary in interval 10-30 ms Ocean radar stationary in interval of several seconds
#79
#80
40
Knowledge representation
1. 2. Good solution depends on a good representation of knowledge Knowledge of the world consists of: Prior information facts about what is and what has been known Observations of the world measurements, obtained through sensors designed to probe the environment
Observations can be: 1. Labeled input signals are paired with desired response 2. Unlabeled input signals only
#81
Knowledge representation in NN
Design of neural networks based directly on real-life data
Examples to train the neural network are taken from observations
#82
41
#84
42
Original
Size
Rotation
Shift
Incomplete image
Techniques
1. 2. 3. Invariance by neural network structure Invariance by training Invariant feature space
#85
Features
Characterize the essential information content of an input data Should be invariant to transformations of the input
Benefits
1. Dimensionality reduction number of features is small compared to the original input space 2. Relaxed design requirements for a neural network 3. Invariances for all objects can be assured (for known transformations) Prior knowledge is required!
#86
43
Classifier design
Invariant feature extractor Neural network classifier Class estimate: A, B
Image representation
Grid of pixels (typically 256x256) with gray level [0..1] (typically 8-bit coding)
#87
Curse of dimensionality increasing input dimensionality leads to sparse data and this provides very poor representation of the mapping problems with correct classification and generalization
Possible solution
Combining inputs into features Goal is to obtain just a few features instead of 65536 inputs
44
F1
Decision boundary
F1
Neural network can be used for classification in the feature space (F1, F2)
2 inputs instead of 65536 original inputs Improved generalization and classification ability
#90
45
Optimal classifier ?
Best generalization is achieved by a model whose complexity is neither too small nor too large Occams razor principle: we should prefer simpler models to more complex models Tradeoff: modeling simplicity vs. modeling capacity
#91
Most NN that can learn to generalize effectively from noisy data are similar or identical to statistical methods
Single-layered feedforward nets are basically generalized linear models Two-layer feedforward nets are closely related to projection pursuit regression Probabilistic neural nets are identical to kernel discriminant analysis Kohonen nets for adaptive vector quantization are similar to k-means cluster analysis Kohonen self-organizing maps are discrete approximations to principal curves and surfaces Hebbian learning is closely related to principal component analysis
46
Statistical Jargon
Generalizing from noisy data .................................... Statistical inference Neuron, unit, node .................................................... A simple linear or nonlinear computing element that accepts one or more inputs and computes a function thereof Neural networks ....................................................... A class of flexible nonlinear regression and discriminant models, data reduction models, and nonlinear dynamical systems Architecture .............................................................. Model Training, Learning, Adaptation ................................. Estimation, Model fitting, Optimization Classification ............................................................ Discriminant analysis Mapping, Function approximation ............................ Regression Competitive learning ................................................. Cluster analysis Hebbian learning ...................................................... Principal components Training set ............................................................... Sample, Construction sample Input ......................................................................... Independent variables, Predictors, Regressors, Explanatory variables, Carriers Output ....................................................................... Predicted values Generalization .......................................................... Interpolation, Extrapolation, Prediction Prediction ................................................................. Forecasting
NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning #94
47
MATLAB example
nn02_neuron_output
#95
MATLAB example
nn02_custom_nn
#96
48
MATLAB example
nnstart
#97
#98
49
#99
Introduction
Pioneering neural network contributions
McCulloch & Pits (1943) the idea of neural networks as computing machines Rosenblatt (1958) proposed perceptron as the first supervised learning model Widrow and Hoff (1961) least-mean-square learning as an important generalization of perceptron learning
Perceptron
Layer of McCulloch-Pits neurons with adjustable synaptic weights Simplest form of a neural network for classification of linearly separable patterns Perceptron convergence theorem for two linearly separable classes
Adaline
Similar to perceptron, trained with LMS learning Used for linear adaptive filters
#100
50
1 if v 0 0 if v 0
y
v
x2
f (wx b)
f (w1 x1 w2 x2 b)
w1 x1 w2 x2 b 0
Geometric representation
x2
x2
w1 x1 w2
b w2
x1
#102
51
#103
w j (n 1) b(n 1)
w j ( n) b ( n)
w j ( n) b( n)
x1 x2 xR
#104
52
xi
, di
0,1
Objective:
Reduce error e between target class d and neuron response y (error-correction learning)
e=d-y
Learning procedure
1. 2. 3. 4. Start with random weights for the connections Present an input vector xi from the set of training samples If perceptron response is wrong: yd, e0, modify all connections w Go back to 2
#105
CASE 3: Neuron output is 1 instead of 0 (y=1, d=0, e=d-y=-1) Input x is subtracted from weight vector w
This makes the weight vector point farther away from the input vector, increasing the chance that the input vector will be classified as a 0 in the future.
#106
53
#107
Convergence theorem
For the perceptron learning rule there exists a convergence theorem:
Theorem 1 If there exists set of connection weights w which is able to perform the transformation d=y(x), the perceptron learning rule will converge to some solution in a finite number of steps for any initial choice of the weights.
Comments
Theorem is only valid for linearly separable classes Outliers can cause long training times If classes are linearly separable, perceptron offers a powerfull pattern recognition tool
#108
54
#109
#110
55
#111
#112
56
#113
#114
57
#116
58
3.4 Adaline
ADALINE = Adaptive Linear Element Widrow and Hoff, 1961: LMS learning (Least mean square) or Delta rule Important generalization of perceptron learning rule Main difference with perceptron activation function
Perceptron: Threshold activation function ADALINE: Linear activation function
Both Perceptron and ADALINE can only solve linearly separable problems
2012 Primo Potonik NEURAL NETWORKS (3) Perceptrons and Linear Filters #117
Linear neuron
Basic ADALINE element
Linear transfer function
y
x1 xR
wx b
y(v) v
#118
59
Simple ADALINE
Simple ADALINE with two inputs
x1
v
x2
f (wx b)
w1 x1 w2 x2 b
w1 x1 w2 x2 b 0
see Perceptron decision boundary
xi
, di
Objective: reduce error e between target class d and neuron response y (error-correction learning)
e=dy
d ( n ) y ( n)
n 1
#120
60
mse
1 N
d ( n ) y ( n)
n 1
e2 (n)
d (n) y(n)
and change the network weights proportional to the negative derivative of error
w j ( n)
e 2 ( n) wj
#121
w j ( n)
e 2 ( n) wj
2 e( n )
e( n ) wj
2 e( n )
d ( n) y ( n ) wj
2
we finaly obtain the weight change at step n
w j (n)
e(n) x j (n)
#122
61
Learning is regulated by a learning rate Stable learning learning rate must be less then the reciprocal of the largest eigenvalue of the correlation matrix xTx of input vectors
Limitations
Linear network can only learn linear input-output mappings Proper selection of learning rate
#123
#124
62
#125
Learning rules
LMS learning Perceptron learning
w j (n 1) b(n 1)
w j ( n) b ( n)
w j (n 1) b(n 1)
#126
63
Input
#128
64
Adaptive filter
Adaptive filter = ADALINE combined with TDL
a(k ) Wp b
i
2012 Primo Potonik
wi p(k i 1) b
#129
a(t )
#130
65
Operation
p(t-2) p(t-1) p(t) p(t+1) Time
Learning
#131
#132
66
#133
#134
67
#135
v
x2
Discriminant function
x2
w1 x1 w2
b w2
#136
68
XOR solution
Extending single-layer perceptron to multi-layer perceptron by introducing hidden units
x1
w2,1
w1,1 1 w2,1 1 w2, 2 w2,3 1 b2 0.5 2
w2, 2
w1, 2 b1
1 0.5
x2
w2,3
XOR problem can be solved but we no longer have a learning rule to train the network Multilayer perceptrons can do everything How to train them?
#137
Homework
Create a two-layer perceptron to solve XOR problem
Create a custom network Demonstrate solution
#138
69
4. Backpropagation
Multilayer feedforward networks Backpropagation algorithm Working with backpropagation Advanced algorithms Performance of multilayer perceptrons
#139
Introduction
Single-layer networks have severe restrictions
Only linearly separable tasks can be solved
Werbos (1974)
Parker (1985), Cun (1985), Rumelhart (1986) Solved the problem of training multi-layer networks by back-propagating the output errors through hidden layers of the network
70
1 exp( v)
3. Massive connectivity
Neurons in successive layers are fully interconnected
#142
71
Matlab demo
nnd11nf Response of the feedforward network with one hidden layer
#143
About backpropagation
Multilayer perceptrons can be trained by backpropagation learning rule
Based on error-correction learning rule Generalization of LMS learnig rule (used to train ADALINE)
2. Backward pass
Error signal is propagated backwards from output to input Synaptic weights are adjusted according to the error gradient
#144
72
xn
, dn
e j ( n) 2
j 1
E ( n)
n 1
#145
Learning objective is to minimize average error energy E by minimizing free network parameters We use an approximation: pattern-by-pattern learning instead of epoch learning
Parameter adjustments are made for each pattern presented to the network Minimizing instantaneous error energy at each step instead of average error energy
#146
73
vj
yj
ej E
dj 1 2
R
yj ej
j 1 2
#147
e j ( n) y j ( n) y j ( n) v j ( n)
vj
yj
f (v j (n))
v j ( n)
j 0
2012 Primo Potonik
w ji (n) yi (n)
v j ( n) w ji (n)
yi (n)
#148
74
vj
yj
Learning rate
Local gradient
w ji (n)
2012 Primo Potonik
(n) yi (n)
#149
E ( n) w ji (n)
yi
w ji
vj
yj
yi ( n )
y j ( n) v j ( n)
f (v j (n))
( n)
E ( n) y j ( n) y j ( n) v j ( n)
E ( n) f (v j (n)) y j ( n)
#150
75
E ( n)
1 R ek (n) 2 2k1
E ( n) y j ( n)
ek
k
ek (n)
d k ( n) y k ( n) d k ( n)
vk (n)
j 0
f (vk (n))
M
ek
k
yj
wkj
vk
yk
wkj
#151
( n)
E ( n) f (v j (n)) y j ( n)
E ( n) y j ( n)
k k
wkj
( n)
f (v j (n))
k
wkj
#152
76
w ji (n)
Weight correction Learning rate
(n) yi (n)
Local gradient Input of neuron j
xi
w ji
vj
yj
wkj
vk
yk
( n)
f (v j (n))
k
wkj
#153
xi (n)
yj
w ji xi
xi
yk
w ji
f
vj
wkj y j
yj
wkj
vk
yk
2. Backward pass
Recursive computing of local gradients Output local gradients Hidden layer local gradients
k
( n)
f (v j (n))
k
wkj
wkj (n)
2012 Primo Potonik
(n) y j (n)
w ji (n)
(n) xi (n)
#154
77
3. Forward pass
Propagate training sample from network input to the output Calculate the error signal
4. Backward pass
Recursive computation of local gradients from output layer toward input layer Adaptation of synaptic weights according to generalized delta rule
5. Iteration
Iterate steps 2-4 until stopping criterion is met
#155
Matlab demo
nnd11bc Backpropagation calculation
#156
78
Matlab demo
nnd12sd1 Steepest descent
#157
f (v(n)) v(n)
f (v(n)) 1
Backpropagation rule
wi (n) (n) yi (n), yi xi (n) e(n) f (v(n)) e(n) wi (n) e(n) xi (n)
79
#159
Batch training
Weight updating after the presentation of a complete epoch
Sequential training
Weight updating after the presentation of each training example Stochastic nature of learning, faster convergence Important practical reasons for sequential learning:
Algorithm is easy to implement Provides effective solution to large and difficult problems
Therefore sequential training is preferred training mode Good practice is random order of presentation of training examples
#160
80
Activation function
Derivative of activation function f (v j (n)) is required for computation of local gradients
Only requirement for activation function: differentiability Commonly used: logistic function
f (v j (n))
1 1 exp( av j (n))
0, v j ( n)
f (v j (n))
y j ( n ) f ( v j ( n ))
Local gradient can be calculated without explicit knowledge of the activation function
#161
ck sin(kx
Equivalent to traditional Fourier analysis Network with sin() activation functions can be trained by backpropagation Example: Approximating periodic function by
#162
81
Learning rate
Learning procedure requires
Change in the weight space to be proportional to error gradient True gradient descent requires infinitesimal steps
Learning in practice
w ji (n) Factor of proportionality is learning rate j (n) yi (n) Choose a learning rate as large as possible without leading to oscillations
0.010 0.035
0.040
#163
Stopping criteria
Generally, backpropagation cannot be shown to converge
No well defined criteria for stopping its operation
#164
82
2. Activation function
Faster learning with antisimetric sigmoid activation functions Popular choice is:
#165
4. Preprocessing inputs
a) Normalizing mean to zero b) Decorrelating input variables (by using principal component analysis) c) Scaling input variables (variances should be approx. equal)
Original
a) Zero mean
b) Decorrelated
c) Equalized variance
#166
83
#167
Generalization
Neural network is able to generalize:
Input-output mapping computed by the network is correct for test data
Test data were not used during training Test data are from the same population as training data
Correct response even if input is slightly different than the training examples
Overfitting
Good generalization
#168
84
Improving generalization
Methods to improve generalization
1. Keeping the network small 2. Early stopping 3. Regularization
Early stopping
Available data are divided into three sets:
1. Training set used to train the network 2. Validation set used for early stopping, when the error starts to increase 3. Test set used for final estimation of network performance and for comparison of various models
Early stopping
#169
Regularization
Improving generalization by regularization
Modifying performance function
mse
1 N 1 M
(d j (n) y j (n)) 2
n 1
msw
msreg
mse (1
)msw
Using this performance function, network will have smaller weights and biases, and this forces the network response to be smoother and less likely to overfit
#170
85
Deficiencies of backpropagation
Some properties of backpropagation do not guarantee the algorithm to be universally useful: 1. Long training process
Possibly due to non-optimum learning rate (advanced algorithms address this problem)
2. Network paralysis
Combination of sigmoidal activation and very large weights can decrease gradients almost to zero training is almost stopped
3. Local minima
Error surface of a complex network can be very complex, with many hills and valleys Gradient methods can get trapped in local minima Solutions: probabilistic learning methods (simulated annealing, ...)
#171
#172
86
Momentum
A simple method of increasing learning rate yet avoiding the danger of instability Modified delta rule by adding momentum term
w ji (n)
j
(n) yi (n)
w ji (n 1)
#173
4%,
0.7,
1.05
#174
87
Resilient backpropagation
Slope of sigmoid functions approaches zero as the input gets large
This causes a problem when you use steepest descent to train a network Gradient can have a very small magnitude also changes in weights are small, even though the weights are far from their optimal values
Resilient backpropagation
Eliminates these harmful effects of the magnitudes of the partial derivatives Only sign of the derivative is used to determine the direction of weight update, size of the weight change is determined by a separate update value Resilient backpropagation rules:
1. Update value for each weight and bias is increased by a factor inc if derivative of the performance function with respect to that weight has the same sign for two successive iterations 2. Update value is decreased by a factor dec if derivative with respect to that weight changes sign from the previous iteration 3. If derivative is zero, then the update value remains the same 4. If weights are oscillating, the weight change is reduced
#175
E (n)
E (w(n))
E(w1,w2)
w2
w1
#176
88
E ( w(n)
Local gradient
g T ( n)
E ( w) w w
2
w( n )
Hessian matrix H ( n)
E ( w) w2 w
w( n )
#177
w(n)
g (n)
w(n)
H 1 (n) g (n)
gradient descent Newtons method
#178
89
Quasi-Newton algorithms
Problems with the calculation of Hessian matrix
Inverse Hessian H-1 is required, which is computationally expensive Hessian has to be nonsingular which is not guaranteed Hessian for neural network can be rank defficient No convergence guarantee for non-quadratic error surface
Quasi-Newton method
Only requires calculation of the gradient vector g(n) The method estimates the inverse Hessian directly without matrix inversion Quasi-Newton variants:
Davidon-Fletcher-Powell algorithm Broyden-Fletcher-Goldfarb-Shanno algorithm ... best form of Quasi-Newton algorithm!
#179
#180
90
Levenberg-Marquardt algorithm
Levenberg-Marquardt algorithm (LM)
Like the quasi-Newton methods, LM algorithm was designed to approach second-order training speed without having to compute the Hessian matrix When the performance function has the form of a sum of squares (typical in neural network training), then the Hessian matrix H can be approximated by Jacobian matrix J
JT J
where Jacobian matrix contains first derivatives of the network errors with respect to the weights Jacobian can be computed through a standard backpropagation technique that is much less complex than computing the Hessian matrix
#181
#182
91
#183
#184
92
Learning set with 4 samples has small training error but gives very poor generalization Learning set with 20 samples has higher training error but generalizes well Low training error is no guarantee for a good network performance!
#185
A large number of hidden units leads to a small training error but not necessarily to a small test error Adding hidden units always leads to a reduction of the training error However, adding hidden units will first lead to a reduction of test error but then to an increase of test error ... (peaking efect, early stopping can be applied)
#186
93
#187
Matlab demo
nnd11fa Function approximation, variable number of hidden units
#188
94
Matlab demo
nnd11gn Generalization, variable number of hidden units
#189
#190
95
5. Dynamic Networks
5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 Historical dynamic networks Focused time-delay neural network Distributed time-delay neural network Layer recurrent network NARX network Computational power of dynamic networks Learning algorithms System identification Model reference adaptive control
#191
Introduction
Time
An essential ingredient of the learning process Important for many practical tasks: speech, vision, signal processing, control
#192
96
Introduction
How can we build time into the operation of neural networks?
Extending static neural networks into dynamic neural networks networks become responsive to the temporal structure of input signals Networks become dynamic by adding
TEMPORAL MEMORY
and/or
FEEDBACK
Feedback loop
#193
#194
97
Memory
Memory
Long-term memory
Acquired through supervised learning and stored into synaptc weights
Short-term memory
Temporal memory, usefull to capture temporal dimension Implemented as time delays at various parts of the network
Long-term memory
#195
x ( n)
x ( n) x(n 1)
TDL( x(n)) [ x(n), x(n 1),..., x(n N )]
x( n N )
#196
98
Jordan (1986)
#197
Hopfield network
Hopfield network (Hopfield, 1982)
Network consists of N interconnected neurons which update their activation values asynchronously and independently of other neurons All neurons are both input and output neurons Activation values are binary (-1, +1) Multiple-loop feedback system interesting to study stability of the system Primary applications Associative memory Solving optimization problems MATLAB example: demohop1.m
#198
99
Jordan network
Jordan network (Jordan, 1986)
Network outputs are fed back as extra inputs (state units) Each state unit is fed with one network output The connections from output to state units are fixed (+1) Learning takes place only in the connections between input to hidden units as well as hidden to output units Standard backpropagation learnig rule can be applied to train the network
#199
Elman network
Elman network (Elman, 1990)
Similar to Jordan network, with the following differences: 1. Hidden units are fed back (instead of output units) 2. Context units have no self-connections
context units
#200
100
#201
Unknown world
Prediction horizon = 1
#202
101
Past
Now
Future
#203
Objective
Design Focused time-delay neural network for recursive one-step-ahead predictor Fixed network parameters
Number of hidden layers: Hidden layer activation func.: Output layer activation func.: 1 Logistic Linear
#204
102
Results
(A)
#205
(B)
(C)
#206
103
#207
Temporal backpropagation
Backpropagation algorithm
Suitable for static networks and focused time-delay neural networks
Temporal backpropagation
Supervised learning algorithm Extension of backpropagation Required for distributed time delay neural networks Computationaly demanding
#208
104
Example (1/2)
Wan (1994): Time series prediction by using a connectionist network with internal delay lines
Winner of the Santa Fe Institute Time-Series Competition, USA (1992) Task: Nonlinear prediction of a nonstationary time series exhibiting chaotic pulsations of NH3 laser
#209
Example (2/2)
Prediction results
#210
105
#211
Elman (1990)
#212
106
Example (1/3)
Phoneme detection problem
Recognition of various frequency components
#213
Example (2/3)
Network training
Successful recognition of two phonemes
#214
107
Example (3/3)
Network testing
Unreliable generalization, works only on trained phonemes
OK
OK
#215
Output y is a nonlinear function of past outputs and past inputs Nonlinear function f can be implemented by a neural network
#216
108
NARX structure
NARX network with global feedback
#217
#218
109
Example (1/5)
Problem: Magnetic levitation Objective
to control the position of a magnet suspended above an electromagnet, where the magnet can only move in the vertical direction
Equation of motion
y(t) = distance of the magnet above the electromagnet i(t) = current flowing in the electromagnet M = mass of the magnet g = gravitational constant = viscous friction coefficient (determined by the material in which the magnet moves) = field strength constant, determined by the number of turns of wire on the electromagnet and the strength of the magnet
#219
Example (2/5)
Data
Sampling interval: 0.01 sec Input: current i(t) Output: magnet position y(t)
3 neurons 5
#220
110
Example (3/5)
Series-parallel training results for NARX network
#221
Example (4/5)
Parallel recursive prediction (1000 steps)
#222
111
Example (5/5)
Possible learning results: unstable learning, local minima
Case A: OK Case B: Unstable Case C: Local minimum
#223
#224
112
Theorems
Theorem I (Siegelmann & Sontag, 1991)
All Turing machines may be simulated by fully connected recurrent networks built on neurons with sigmoid activation functions. (Turing machine is a theoretical abstraction that is functionally as powerfull as any computer, see http://aturingmachine.com )
2. Continuous training
Suitable if no reset states are available or online learning is required Network learns while it is performing signal processing The learning process never stops METHOD: Real-time recurrent learning
#226
113
#227
#228
114
Generalization
#229
Unknown nonlinear dynamical process dynamic neural networks can be used as identification model
Two basic identification approaches: 1. System identification using state-space model 2. System identification using input-output model
#230
115
x(n 1) y ( n)
f x(n), u (n) h x ( n)
f, h : unknown nonlinear vector functions Two dynamic neural networks can be used to approximate f and h
#231
Both networks are trained by gradient descent minimizing error signals eI and eII
#232
116
f is unknown nonlinear vector function Input-output formulation is equivalent to NARX formulation NARX neural network can be used to approximate f q past inputs and outputs should be available
#233
#234
117
#235
Controller uc (n)
Reference model d (n 1)
#236
118
#237
Summary
Focused time-delay neural network Distributed time-delay neural network
NARX network
#238
119
#239
Introduction
RBFN = Radial Basis Function Network New class of neural networks
Multilayer perceptrons output is a nonlinear function of the scalar product of input vector and weight vector RBFN activation of a hidden unit is determined by distance between input vector and prototype vector
#240
120
#241
RBFN properties
Two-stage training procedures
1. Training of hidden layer weights 2. Training of output layer weights
#242
121
Basis functions are nonlinear, and depend on the distance measure between input x and stored prototype xp
x xp ( x1 x1p ) 2 ( xM
p 2 xM )
#243
Goal is to find the weights wp such that the function goes through all data points
Provided that inverse of exist, the weights are obtained by any standard matrix inversion techniques
#244
122
Solution represents a continuous diferentiable surface that passes exactly through each data point Both theoretical and empirical studies confirm (in the context of exact interpolation) that many properties of the interpolating function are relatively insensitive to the precise form of the basis functions
x xp
2012 Primo Potonik
r
#245
2. Multi-Quadratic
3. Generalized Multi-Quadratic
4. Inverse Multi-Quadratic
#246
123
7. Cubic
8. Linear
#247
x xp
( x1 x1p ) 2 ( xM
p 2 xM )
#248
124
Localised property is not strictly necessary all the other functions (Multi-Quadratic, Cubic, Linear, ...) are not localised
Note that even the Linear Function is still nonlinear in the components of x In one dimension, this leads to a piecewise-linear interpolating function which performs the simplest form of exact interpolation For neural network mappings, there are good reasons for preferring localised basis functions we will focus on Gaussian basis functions
#249
Solution approach
Supperposition of Gaussian radial basis functions
#250
125
= 0.02
= 20
#251
N training inputs directly determine hidden layer prototypes (centers of hidden layer neurons) Training inputs and outputs also directly determine output weights
#252
126
2.
Number of basis functions is equal to the number of data patterns exact RBF networks are not computationally efficient
#253
#254
127
Improved RBFN
Including the proposed changes + expanding to the multidimensional output
#255
center
width
RBF network
widths
biases
#256
128
As with the corresponding proofs for MLPs, RBFN proofs rely on the availability of an arbitrarily large number of hidden units (i.e. basis functions) However, proofs provide a theoretical foundation on which practical applications can be based with confidence
#257
Hidden layer can be trained by unsupervised methods (random selection, clustering, ...) Output layer has linear activation output weights are determined analitically by solving a set of linear equations Gradient descent learning is not needed for RBFN, therefore training is very fast!
#258
129
It is also possible to perform a full supervised non-linear optimization of the network instead
NEURAL NETWORKS (6) Radial Basis Function Networks #259
Widths j are all related in the same way to the maximum or average distance between the chosen centres j
Common choices are
which ensure that the individual RBFs are neither too wide, nor too narrow, for the given training data For large training sets, this approach gives reasonable results
NEURAL NETWORKS (6) Radial Basis Function Networks #260
130
To get good generalization we generally use cross-validation to stop the process when an appropriate number of data points have been selected as centers
#261
K-means clustering
A potentially even better approach is to use clustering techniques to find a set of centres which more accurately reflects the distribution of the data points K-Means Clustering Algorithm
Select the number of centres (K) in advance Apply a simple re-estimation procedure to partition the data points {xp} into K disjoint subsets Sj containing Nj data points to minimize the sum squared clustering function
Once the basis centres have been determined in this way, the widths can then be set according to the variances of the points in the corresponding cluster
NEURAL NETWORKS (6) Radial Basis Function Networks #262
131
#263
#264
132
#265
Supervised RBFN training would iteratively update the weights (basis function parameters) using gradients
#266
133
become very complex and therefore computationally inefficient Additionally, we get all the problems of choosing the learning rates, avoiding local minima ... that we had for training MLPs by backpropagation And there is a tendency for the basis function widths to grow large leaving non-localised basis functions
NEURAL NETWORKS (6) Radial Basis Function Networks #267
#268
134
Regularization parameter determines the relative importance of smoothness compared with error Differential operator P can have many possible forms, but the general idea is that mapping functions which have large curvature should yield large regularization term and hence contribute a large penalty in the total error function
#269
Where to start?
Two stage hybrid training with K-means clustering and linear matrix operation for output layer
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #270
135
Multilayer perceptron can separate classes by using hidden units to form hyperplanes in the input space Alternative approach is to model the separate class distributions by localised radial basis functions
NEURAL NETWORKS (6) Radial Basis Function Networks #271
Underlying justification for using RBFN for classification is found in Covers theorem which states A complex pattern classification problem cast in a high dimensional space non-linearly is more likely to be linearly separable than in a low dimensional space. Once we have linear separable patterns, the classification problem can be solved by a linear layer
#272
136
#273
#274
137
#275
PNN example 1
Three training patterns PNN division of the input space
#276
138
#277
#278
139
#279
#280
140
PNN considerations
Probabilistic neural networks are specialized to classification (less general than RBFN or MLP) PNN are sensitive to the selection of spread parameter spread can be optimized by leave-one-out cross-validation technique
1. Leave one training sample out, train PNN and test on the omitted sample 2. Repeat procedure for all samples and save results 3. Find optimal spread that yields minimal average classification error
Benefits
Little or no training required (except spread optimization) Beside classifications, PNN also provides Bayesian posterior probabilities solid theoretical fundation to support confidence estimates for the networks decisions Robust against outliers outliers have no real effect on decisions
Drawbacks
PNN performance depends strongly on a thoroughly representative training set Entire training set must be stored large memory and poor execution speed
#281
#282
141
ax b
Given the training data, the slope a and bias b are computed as
Compute sum of squares
SS x
i
xi
x ,
SS xy
i
xi
x yi
SS xy SS x
b y ax
Resulting linear equation will minimize mean squared error of predicted values y in the training set
NEURAL NETWORKS (6) Radial Basis Function Networks #283
Multiple regression
Several independent variables x1, x2, x3, ...
a1 x1 a2 x2 a3 x3 b4
Matrix notation
x1 , x2 , x3 ,1
xa
Xa
XX
XY
Final solution is usually obtained numerically by singular value decomposition method (SVD)
#284
142
Joint density function fxy(x,y) is not known but can be approximated by Parzen estimator
By using the Parzen approximator with Gaussian kernels, we obtain equation for GRNN predictor
#285
GRNN properties
GRNN closely resembles RBFN with normalization term in denominator it is sometimes called Normalized RBFN GRNN also resembles PNN but is used for regression (function approximation), not for classification Width parameter spread must be selected, as in all RBF networks First layer has Gaussian kernels located at each training case and computes distances from the input vector to the training input vectors (prototypes) Second layer is a special linear layer with normalization operator Normalization makes GRNN a very robust predictor
#286
143
GRNN architecture
Standard Radial basis layer Normalization Linear layer
#287
#288
144
#289
#290
145
Summary
RBFN PNN
GRNN
#291
#292
146
7. Self-Organizing Maps
Self-organization Self-organizing maps SOM algorithm Properties of the feature map SOM discussion & examples
#293
Introduction
1. We discussed so far a number of networks which were trained to perform a mapping INPUTS OUTPUTS which corresponds to supervised learning paradigm
2. However, problems exist where target outputs are not available the only information is provided by a set of input patterns INPUTS ?? which corresponds to unsupervised learning paradigm
#294
147
Examples of problems
Clustering
Input data are grouped into clusters for any input, neural net should return a corresponding cluster label
Vector quantization
Continuous space has to be discretised neural net has to find optimal discretisation of the input space
Dimensionality reduction
Input data are grouped in a subspace with lower dimensionality than the original data Neural net has to learn an optimal mapping such that most of the variance in the input data is preserved in the output data
Feature extraction
System has to extract features from the input signal this often means a dimensionality reduction as described above
2012 Primo Potonik
#295
7.1 Self-organization
What is self-organization?
System structure appears without explicit pressure or involvement from outside the system Constraints on form (i.e. organization) of interest to us are internal to the system, resulting from the interactions among the components The organization can evolve in either time or space, maintain a stable form or show transient phenomena
#296
148
Self-organization properties
Typical features include (in rough order of generality)
Autonomy Dynamic operation Fluctuations Symmetry breaking Global order Dissipation Instability Multiple equilibria Criticality Redundancy Self-maintenance Adaptation Complexity Hierarchies (absence of external control) (evolution in time) (noise / searches through options) (loss of freedom) (emergence from local interactions) (energy usage / far-from-equilibrium) (self-reinforcing choices / nonlinearity) (many possible attractors) (threshold effects / phase changes) (insensitivity to damage) (repair / reproduction metabolisms) (functionality / tracking of external variations) (multiple concurrent values or objectives) (multiple nested self-organized levels)
#297
#298
149
#299
Neurobiological motivation
Neurobiological studies indicate that different sensory inputs (tactile, visual, auditory, etc.) are mapped onto different areas of the cerebral cortex in an ordered fashion This form of a map, known as a topographic map, has two important properties:
1. At each stage of representation, or processing, each piece of incoming information is kept in its proper context / neighbourhood 2. Neurons dealing with closely related pieces of information are kept close together so that they can interact via short synaptic connections
Our interest is in building artificial topographic maps that learn through self-organization in a neurobiologically inspired manner We shall follow the principle of topographic map formation: The spatial location of an output neuron in a topographic map corresponds to a particular domain or feature drawn from the input space
#300
150
#301
#302
151
Kohonen network
Kohonen (1982) : Self-organized formation of topologically correct feature maps. Biological Cybernetics Kohonen network or Self-Organizing Map (SOM) has a single computational layer arranged in rows and columns
1D, 2D, 3D
Each neuron is fully connected to all source nodes in the input layer
#303
SOM architecture
Calculating the distance between inputs and neurons dist Competitive layer selection of a winning neuron and its neighborhood dist, linkdist, mandist, boxdist
#304
152
x [ x1 , x2 ,, xm ]
Synaptic weight vector of each neuron in the network has the same dimension as input space
wj
[ w j1 , w j 2 ,, w jm ],
j 1,, K
The best match of the input vector x with the synaptic weight vectors wj can be found by comparing the Euclidean distance between input vector x and each neuron j
d j ( x)
x wj
Neuron whose weight vector comes closest to the input vector (i.e. is most similar to it) is declared the winning neuron In this way the continuous input space can be mapped to the discrete output space of neurons by a simple process of competition between the neurons
#306
153
We define a similar neurobiologically correct topological neighbourhood for the neurons in SOM and assume two requirements:
1. Topological neighborhood is symetric around the winning neuron 2. Amplitude of the topological neighborhood decreases monotonically with increasing lateral distance (and decaying to zero in the limit d which is neccessary for covergence)
#307
h j ,i ( x )
exp
d2 j ,i 2
2
Gaussian function is translation invariant (independent of the location of the winning neuron)
#308
154
Neighbours
( n)
exp
n
1
, n
1,2,...,
h j , i ( x, n )
exp
d2 j ,i 2 ( n) 2
, n 1,2,...,
155
#311
w j (n 1)
w j (n)
the rule is applied to all neurons inside the topological neighbourhood of the winning neuron i Adaptation moves the synaptic weights wj of the chosen neurons toward the input vector x
NEURAL NETWORKS (7) Self-Organizing Maps #312
156
( n)
exp
n
2
, n
1,2,...,
, 1,
Even if not optimal, section of parameters usually leads to the formation of the feature map in a self-organized manner
#313
2. Convergence phase
Feature map fine-tuning provides statistical quantification of the input space typically the number of iterations at least 500 times the number of neurons
Result of SOM algorithm Starting from the initial state of complete disorder, SOM algorithm gradually leads to an organized representation of activation patterns drawn from the input space
However, it is possible to end up in a metastable state in which the feature map has a topological defect
2012 Primo Potonik
#314
157
#315
w j (n 1)
w j (n)
6. Iteration Continue with step 2 until the feature map stops changing
2012 Primo Potonik
#316
158
Step 2 We randomly pick one of the data points for training (). The closest output point represents the winning neuron ( ). That winning neuron is moved towards the data point by a certain amount, and the two neighbouring neurons move by smaller NEURAL NETWORKS (7) Self-Organizing Maps amounts ().
#317
Step 4 We carry on randomly picking data points for training (). Each winning neuron moves towards the data point by a certain amount, and its neighbouring neuron(s) move by smaller amounts (). Eventually the whole output grid unravels itself to NEURAL NETWORKS (7) Self-Organizing Maps represent the input space.
#318
159
#319
160
#321
Matlab examples
nnd14fm1 1D feature map nnd14fm2 2D feature map
#322
161
Property 1: Approximation of the input space Property 2: Topological ordering Property 3: Density matching Property 4: Feature selection
#323
If we work through gradient descent style mathematics we do end up with the SOM weight update algorithm, which confirms that it is generating a good approximation to the input space
#324
162
#325
So the SOM algorithm doesnt match the input density exactly, because of the power of 2/3 rather than 1. As a general rule, the feature map tend to over-represent the regions with low input density and to under-represent regions with high input density.
#326
163
#327
#328
164
4 clusters, 1D SOM
#329
4 clusters, 2D SOM
#330
165
2
W(i,2)
1.5
1.5
Weight Vectors 2.5
1
1
2
10x10 neurons
0.5 0.5 1 1.5 W(i,1) 2 2.5
0.5 0.5
1.5
2
W(i,2)
1.5
2.5
0.5 0.5
1.5 W(i,1)
2.5
#331
Weight Vectors
-1
-0.5
Weight Vectors
0 W(i,1)
0.5
-0.8
-0.6
-0.4
-0.2
0.2
0.4
0.6
0.8
1
0.2
W(i,2)
10x10 neurons
-1 -0.5 0 W(i,1) 0.5 1
#332
166
50 neurons
1
0.5
-0.5
0.5
-1
0
Weight Vectors 1.5
-1.5 -1.5 -1 -0.5 0 W(i,1) 0.5 1 1.5
-0.5
1
10x10 neurons
-1
0.5
-1
-0.5
0.5
1.5
W(i,2)
-1.5 -1.5
-0.5
-1
-1.5
-1.5
-1
-0.5
0 W(i,1)
0.5
1.5
#333
Complex distribution
2
2.5
50 neurons
W(i,2)
2 1.5
1.5
0.5
1.5 W(i,1)
2.5
2
0.5 0.5
1.5
2.5
W(i,2)
1.5
0.5 0.5
1.5 W(i,1)
2.5
#334
167
4 clusters, 2D SOM
Weight Vectors
2
1.5
1.5
0.5
W(i,2)
-1 -0.5 0 0.5 1 1.5 2 2.5
0.5
-0.5
-0.5
-1 -1.5
-1 -1.5
-1
-0.5
0.5 W(i,1)
1.5
2.5
#335
#336
168
8. Practical Considerations
8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 Designing the training data Preparing data Selection of inputs Data encoding Principal component analysis Invariances and prior knowledge Generalization General guidelines
#337
Introduction
Neural network could, in principle, map raw input data into required outputs in practice, this will generally give poor results For most applications, some data manipulations are recommended: Preparing data designing the training data handling missing and extreme data incorporating invariances and prior knowledge Preparing inputs pre-processing, rescaling, normalizing, standardizing, detrending dimensionality reduction: principal component analysis feature selection, feature extraction Preparing outputs encoding of classes, post-processing, rescaling, standardizing
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #338
169
Training data must be representative for the problem considered For pattern recognition
Every class must be represented Within each class, statistical variation must be adequately represented Potato chips factory example:
NN must be trained on 1) normal chips, 2) burned chips, 3) uncooked chips, ...
#339
Standardizing
Subtracting a measure of location and dividing by a measure of scale Example: subtracting a mean and dividing by standard deviation, thereby obtaining a "standard normal" random variable with mean 0 and standard deviation 1
Normalizing
Dividing a vector by its norm Example: make the Euclidean length of the vector equal to one. In the NN literature, "normalizing" often refers to rescaling into [0,1] range
170
Rescaling
Rescaling inputs
Often recommended rescaling of inputs to interval [0,1] is a misconception, there is in fact no such requirement. Interval [0,1] is usually a bad choice, rescaling to [-1,1] interval is better Standardizing inputs is better than rescaling ...
Rescaling outputs
1. For bounded activation functions (range [0,1] or [-1,1] ) the target values must lie within that range The alternative is to use an activation function suited to the distribution of the targets, for example linear activation function. 2. It is essential to rescale the multidimensional targets so that their variability reflects their importance, or at least is not in inverse relation to their importance. If the targets are of equal importance, they should typically be rescaled or standardized to the same range or the same standard deviation.
#341
Standardizing
Standardizing usually reffers to transforming data into zero mean with standard deviation one
Statistics (mean, std) are computed from training data, not from validation data Validation data must be standardized using the statistics computed from training data
Standardizing inputs
Often very benefitial for MLP and RBFN networks
RBFN inputs are combined with via a distance function (Euclidean) therefore it is important to standardize them into similar range MLP standardizing enables utilization of steep parts of transfer functions faster learning and avoidance of saturation
Standardizing outputs
Typically more a convenience for getting good initial weights than a necessity Important for the equal relevance of targets Note: use rescaling for bounded activation functions, not standardizing!
#342
171
Differencing
Working with differences between successive samples can sometimes bring good results Example: daily stock-market values convey one sort of information, the change from one day to the next conveys entirely different information Differencing can be applied at inputs and outputs, powerfull option is to apply raw and diferrenced inputs!
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #343
#344
172
2. Log(x)
stationary variance
3. Differencing
stationary mean stationary variance
#345
Methods
X-12-ARIMA (U.S. Census Bureau, Statistical Research Divison ) STL (Seasonal Trend Decomposition based on Loess)
#346
173
Original data
Residual
#347
174
Causal importance
How much the outputs change if inputs are changed (also called sensitivity)
Marginal importance
Considers inputs in isolation Easy to compute without even training a neural net ...
(Pearson correlation, rank correlation, mutual information, ...)
Marginal importance is of little practical use other than for a preliminary description of the data
#349
ftp://ftp.sas.com/pub/neural/importance.html
#350
175
Prunning methods
Removing nonrelevant inputs during the neural network construction
#351
Ordinal variables
Discrete data with natural ordering (e.g. 'small', 'medium', 'big') Ordinal variables can often be represented by a single variable
Small Medium Big Small Medium Big Small Medium Big | 1 | | 2 | | 3 | | 0 0 1 | | 0 1 1 | | 1 1 1 | | -1 -1 | -1 1 | 1 1 1 | 1 | 1 |
#352
176
1-of-C coding with a softmax activation function will produce valid posterior probability estimates
It is very important NOT to use a single variable for an unordered categorical target
#353
Day of the week (Mon=1, ... Sun=7) we have discontinuity when passing from 7 to 1, although Sunday and Monday are very close Solutions
1. Discretizing and using any of the categorical coding (1-of-C) 2. Encoding with two dummy variables (sin,cos)
cos
sin
#354
177
Projection vectors p are eigen vectors of the covariance matrix XXT Each principal component zp is obtained as a product of input matrix with projection vectors Each principal component is a linear combination of original variables All the principal components are orthogonal to each other, so there is no redundant information
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #356
178
PCA example
Original data: x1, x2 Principal components: z1, z2 Variability
z1 95% z2 5%
Benefit
Dimensionality reduction by using only the first principal component (z1) instead of original 2D data (x1, x2)
#357
#358
179
4. Decide how much variance to keep ... 90%, 95%? 5. Keep only a few selected principal components, discard the rest data dimensionality is reduced
#359
Intrinsic dimensionality
Suppose we apply PCA on d-dimensional data and discover that first n-eigenvalues have significantly larger values (n < d) Consequently, data can be represented with high accuracy by first n-eigenvalues effective dimensionality is only n Generally, data in d-dimensions have intrinsic dimensionality equal to n if data lies entirely within a n-dimensional subspace
#360
180
32 x 32 pixels
32 x 32 pixels
181
Image representation
Grid of pixels (typically 256x256) 65536 inputs Gray level [0..1] (typically 8-bit coding)
8.7 Generalization
Goal of network training is not to exactly fit the data but to build a statistical model that generates the data Well trained network is able to generalize to make good predictions on new inputs
Here it is assumed that the test data are drawn from the same population used to generate the training data
Neural network designed to generalize well will produce correct input-output mapping even if new inputs differ slightly from the samples used to train the network Overfitting problem neural net learns the complete training set but not the underlaying function
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #364
182
Generalization in classification
The task of our network is to learn a classification decision boundary
Good generalization Overfitting
If we know that training data contains noise, we dont necessarily want the training data to be classified totally accurately, as that is likely to reduce the generalization ability
NEURAL NETWORKS (8) Practical considerations #365
We can expect the neural network output to give a better representation of the underlying function if its output curve does not pass through all the data points Again, allowing a larger error on the training data is likely to lead to better generalization
NEURAL NETWORKS (8) Practical considerations #366
183
Overfitting, underfitting
Overfitting
Neural network perfectly learns the training data but gives poor results on test data
Underfitting
Neural network is unable to properly learn the data due to insufficient number of neurons or due to extreme regularization Such network also generalizes poorly
Underfitting
Overfitting
#367
Improving generalization
How to prevent underfitting
1. Provide enough hidden units to represent the required mappings 2. Train the network for long enough so that the sum squared error cost function is sufficiently minimised
7. Add regularization term to the error function to encourage smoother network mappings 8. Add noise to the training patterns to smear out the data points
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #368
184
Cross-validation
Cross-validation is used to estimate generalization error based on resampling Available data are randomly partitioned into a
Training set, and Test set
Generalization performance of the selected model is tested on the test set which is different from the validation subset
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #369
Variants of cross-validation
If only a small set of data exists ... Multifold cross-validation
Divide available N samples into K subsets Model is trained on all subsets except one Validation error is measured on the subset left out Procedure is repeated K-times Model performance is obtained by averaging K trials
Trial 2 ...
Leave-one-out cross-validation
Extreme form of cross-validation N-1 samples are used for training Model is validated on the sample left out Procedure is repeated N-times Result is averaged over N-trials
#370
185
Early stopping
Neural networks are often set up with more than enough parameters which can cause over-fitting For the iterative gradient descent based network training procedures (backpropagation, conjugate gradients, ...), the training set error will naturally decrease with increasing numbers of epochs of training The error on the unseen validation and testing data sets, however, will start off decreasing as the under-fitting is reduced, but then it will eventually begin to increase again as over-fitting occurs The natural solution to get the best generalization, i.e. the lowest error on the test set, is to use the procedure of early stopping
#371
Since validation error is not a good estimate of the generalization error, a third test set must be applied to estimate generalization performance Available data are divided as in cross-validation
Training set
Estimation subset Validation subset
Test set
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #372
186
Such issues are problem dependent ... Default Matlab parameters (train, validation, test): 70%, 15%, 15%
#373
#374
187
Regularization
Regularization technique encourages smoother network mappings by adding a penalty term to the standard (sum-squared-error) cost function where the regularization parameter controls the trade-off between reducing the error Esse and increasing the smoothing This modifies the gradient descent weight updates
The resulting neural network mapping is a compromise between fitting the data and minimizing the regularizer
#375
In conventional curve fitting this regularizer is known as ridge regression. We can see why it is called weight decay when we observe the extra term in the weight updates
In each epoch the weights decay in proportion to their size Empirically, this leads to significant improvements in generalization. Weight decay keeps the weights small and hence the mappings are smooth
#376
188
#377
Generalization summary
Preventing underfitting
1. Provide enough hidden units 2. Train the network for long enough
Preventing overfitting
3. 4. 5. 6. 7. 8. Design the training data properly Cross-validation Early stopping Restrict the number of adjustable parameters Regularization Jittering
#378
189
#379
#380
190