Curs 1-12

DENIS ENCHESCU
University of Bucharest

ELEMENTS OF STATISTICAL LEARNING. APPLICATIONS IN DATA
MINING
Lecture Notes
1
1 The Nature of Machine Learning
1.1 Basic Definitions and Key Concepts
Learning (understood: artificial, automatic) (Machine Learning) This concept includes any
method making it possible to build a model of reality starting from data, either by improving a
partial or less general model, or by creating the model completely. There are two principal
tendencies in learning, that resulting from the artificial intelligence and qualified symbolic
system, and that resulting from the statistics and qualified numerical.
Precision vs. Generalization, the great dilemma of the learning. Precision is defined by a
difference between a measured or predicted value and an actual value. To learn with too much
precision leads to an "over-fitting", like the learning by heart, for which unimportant details (or
induced by the noise) are learned. To learn with not enough precision leads to an "over-
generalization" and the model applies even when the user does not wish it.
Intelligibility (should be Comprehensibility but tends to become Understandability). For a few
years, mainly under the push of the industrialists, the researchers have started to try also to
7
Curs 1
control the intelligibility of the model obtained by the mining of data. Until now, the methods of
measurement of intelligibility are reduced to check that the results are expressed in the language
of the user and that the size of the models is not excessive. Specific methods of visualization are
also used.
The criterion of success. The criterion of success is what is measured in the performance
evaluation. It thus acts as a criterion relative to an external observer. For example, the
performance will be measured according to the error count made by the learner in the course of
learning, or according to its error rate after learning. More generally, the measurement of
performance can include factors independent of the adequacy to the data of learning and of very
diverse natures. For example, simplicity of the learning result produced by the learning machine
(LM), its comprehensibility, its intelligibility by an expert, its facility to the integration in a
current theory, the low computational cost necessary to its obtaining, etc.
Here, it should be made an important remark. The criterion of success, measured by an external
observer, is not necessarily identical to the performance index or to the loss function that is
8
intern to the LM and used in the internal evaluation of the learning model. For example, an
algorithm of learning of a connectionist network generally seeks to minimize a standard
deviation between what it predicts on each example of learning and the desired exit.
The protocol of learning. The learning and its evaluation depend on the protocol that
establishes the interactions between the LM and his environment, including the supervisor (the
oracle). It is thus necessary to distinguish between the batch learning, in which all the data of
learning are provided all at the start, and the on-line learning in which the data arrive in
sequences and where the learner must deliberate and provide an answer after each entry or
groups entries.
The protocol also stipulates the type of entries provided to be learned and the type of awaited
exits. For example, a scenario can specify that at every moment the LM receive an observation
, that it must provide an answer and only then, the supervisor produces the correct answer
i
x
i
y
9
i
u . One speaks then naturally about a prediction task. More, the tasks known as prediction are
interested to envisage correctly a response in a precise point.
In contrast, in the identification tasks the goal is to find a total explanation among all those
possible, which once known will make possible to make predictions whatever the question.
The scenario will be then different. By example, the learning system must yet provide after each
new entry
( )
,
i i
u x an assumption on the "hidden function" of the supervisor by which this one
determines as function of . It is conceived that the criterion of success is not the same in the
case of a prediction task as in that of an identification task. In this last case, indeed, one asks
much more from the LM since one awaits from him an explicit assumption, therefore a kind of
explanation from his predictions.
i
u
i
x
In addition, the LM can be more or less active. In the protocols described up to now, the LM
receives passively the data without having influence on their selection. It is possible to consider
scenarios in which the LM has a certain initiative in the search for information. In certain cases,
10
this initiative is limited, for example when the LM, without having the total control of the choice
of the learning sample, is simply able to direct its probability distribution; the boosting methods
are an illustration of this case. In other cases, the LM can put questions about the class of
membership of an observation, one speaks then of learning by membership queries, or to even
organize experiments on the world, and one speaks then of active learning. The play of
Mastermind, which consists in guessing a configuration of colors hidden pawns by raising
questions according to certain rules, is a simple example of active learning in which the learner
has the initiative of the questions.
11
The task of learning. It is possible to approach the objective of the process of learning
following several points of view.
The knowledge point of view.
The goal of the learning can be to modify the contents of knowledge
2
. One speaks then of
knowledge acquisition, of revision, and, why not, of lapse of memory.
The goal of the learning can also be, without necessarily modifying the "contents" of knowledge,
to make it more effective compared to a certain goal, by reorganization, optimization or
compilation for example. It could be the case of a player of chess or a mental calculator who
learns how to go more and more quickly without knowing new rules of play or of calculation.
One speaks in this case about optimization of performance (speed-up learning).

2
Measured, for example, by its deductive closure, i.e., in a logical representation, all that can be deduced correctly starting from the current base of knowledge.
12
The environment point of view.
The task of the learning can also be defined compared to what the learning agent must carry out
to survive in its environment. That can include:
To learn how to recognize patterns (for example: handwritten characters, birds, the
predatory ones, an ascending trend of a title to the bourse, appendicitis, etc.). When the
learning is done with a professor, or supervisor, who provides the wished answers, on have
a supervised learning. If not, one speaks of unsupervised learning. In this last case, the task
of learning at the same time consists in discovering categories and finding rules of
categorization.
To learn how to predict. There is then a concept of temporal dependence or causality.
To learn how to be more effective. It is the case in particular of the situations of resolution
of problem, or search for action plans in the world.
13

The abstract classes of problems point of view.
Independently from the learning algorithm, it is possible to characterize the learning process by a
general and abstract class of problems and processes of resolution. Thus a certain number of
disciplines, in particular resulting from mathematics or information theory, were discovered an
interest for the problems of learning.
The theories of compression of information. In a certain direction, the learning can be
approached like a problem of extraction and compression of information. It is a question of
extracting essential information or the initial message from an ideal transmitter, cleared of
all its redundancies. In a sense, the nature sciences, such astronomy or ornithology, proceed
by elimination of the superfluous or redundant details and by the description of hidden
regularities.
14
The cryptography. From the similar point of view, near to the goals of the information
theory, the learning can be regarded as one try decoding or even of decoding of a message
coded by the ideal transmitter and intercepted in whole or part by the learner agent. After
all, it is sometimes like the scientist studying nature. It is then logical to see under which
conditions a message can "be broken", i.e. under which conditions learning is possible.
The mathematical / numerical analysis. The learning can also be examined like one
problem of approximation. The task of learning is to find an approximation as good as
possible of a hidden function known only by the intermediary of a sample of data. The
problem of learning becomes often that of the study of the conditions of approximation and
convergence.
The induction. In the Seventies and at the beginning of the Eighties, under the influence
from the cognitive point of view, a broad community of researchers, particularly active in
France, is leaning on the learning as a problem of generalization. This approach starts from
two essential hypotheses. First, the cognitive learning agent must learn something that
15
another cognitive agent equivalently knows. It is thus normally able to reach the target
knowledge perfectly. Second, knowledge and data can be described by a language. One
seeks then the operators in this language who can correspond to operations of
generalization or specialization useful for induction, and one builds algorithms using them,
making it possible to summarize the data while avoiding the over-fitting and the drawing of
illegitimate consequences.
The applied mathematics. Finally, the engineer can be tempted to see in the learning a
particular case of the resolution of an inverse problem. Let us take two examples:
one can say that the theory of probability is a theory sticking to a direct problem (being
given a parameterized model, which are the probabilities associated with such event?),
while the theory of the statistics attacks an inverse problem (being given a sample of
data, which model does make it possible to explain it, i.e. can have produced it?).
16
being given two numbers, it is easy to find the product of it (direct problem). It is on the
other hand generally impossible to find starting from a number those of which it is the
product (inverse problem).
The inverse problems are thus often problems that one known as ill-posed, i.e. not having a
single solution. According to this point of view, the study of the learning can be seen like
that of the conditions making possible to solve an ill-posed problem, i.e. constraints which
it will have to be added so that the procedure of resolution can find a particular solution.

The structures of data or types of concerned assumptions
17
It frequently arrives that one imposes the type of structure (or the language of expression of the
assumptions) that must be sought by the learning system. That makes it possible to guide at the
same time the determination of the learning algorithm to be used, but also the data that will be
necessary so that the learning is possible. Without to seek to be exhaustive, we quote among the
principal structures of studied data:
the Boolean Expressions, who are often adapted to learn concepts definite on a language of
attribute-values(for example the rules of an expert system).
the grammars and the Markovian Processes allowing representing sequences of events.
the linear/nonlinear functions making possible to discriminate objects belonging to a
subspace or its complementary.
the decision trees who allow the classifications by hierarchies of questions. The
corresponding decision tree is often at the same time concise and comprehensible.
18
the logical programs who allow learning from the relational concepts.
the Bayesian Networks allowing at the same time to represent universes structured by
relations of causality, to take into account, and to express measurements of certainty or
confidence.
Sometimes the learning can consist in changing the structure of data in order to find an
equivalent but most computational effective structure. It is once again, under another angle, the
problem of performance optimization.
19

To simplify, we will suppose that the LM seek an approximation of the target function inside a
family of hypothesis functions. It is the case, for example, of the learning using a neurons
network of which architecture constrained the type of realizable functions to a certain space of
functions.
H
We defined the task of learning like that of a problem of estimating a function starting from the
observation of a sample of data. We turn now to the principles allowing carrying out this
estimate.
The exploration of the hypothesis space. Let be a hypothesis space, a data space and
a training sample. The task of learning is to find a hypothesis approximating as well as
possible, within the meaning of a certain measurement of performance, a target function
H X S
h
f
based on the sample
( ) { }
1,
,
i i
i m
in which one supposes that each label , was calculated
by the function
u
=
= x S
i
u
f applied to the data .
i
x
20
How to find such a hypothesis hH ? Two questions arise:
1. How to know that a satisfactory hypothesis (even optimal) was found, and more generally
how to evaluate the quality of a hypothesis?
2. How to organize the research in ? H
Whatever the process guiding exploration of , it is necessary that the LM can evaluate the
hypothesis that it considers at each moment of its research. We will see that this evaluation
utilizes an intern performance index (for example a standard deviation between the exits
calculated from and desired targets provided in the training sample). It is this performance
index, more possibly, the other information provided by the environment (including the user for
example), which allows the LM to measure its performance on the training sample and to decide
if it must continue his research in or it can stop.
H
h t
h u
H
By supposing that at the moment , the LM judge unsatisfactory his current assumption , how
can it change it? It is there that the effectiveness of the learning is decided and in this context,
t
t
h
21
the structure of the space plays an important role. More this one will be rich and fine, more it
will be possible to organize the effectively exploration of . Quickly let us examine three
possibilities in an ascending order of structuring:
H
H
the space of hypothesis does not present any structure. In this case, only a random
exploration is possible. Nothing makes it possible to guide research, nor even to benefit from
H
the information already gained on . It is the case where nothing is known a priori on . H H
a concept of neighborhood is definable on . It is then possible to operate an exploration by
techniques of optimization like the gradient method. The advantage of these techniques, and
what makes them so popular, it is that they are of a very general use since it is often possible
to define a concept of neighborhood on a space. A fundamental problem is that of the
relevance of this concept. A bad neighboring relation can indeed move away the LM from
the promising areas of the space! In addition, it is still a low structure, which, except in
particular cases (differentiability, convexity, etc. of the function to be optimized), does not
allow a fast exploration.
H
22
It is sometimes possible to have a stronger structure making it possible to organize the
exploration of . In this case, for example, it becomes possible to modify an erroneous
hypothesis by specializing it just enough so that it does not cover any more the new negative
example, or on the contrary by generalizing it just enough so that it covers the new provided
positive example. This type of exploration, possible in particular when the space of
hypothesis is structured by a language, is generally better guided and more effective than a
blind exploration.
H
By what precedes, it is obvious that more the structuring of the space of the hypothesis is strong
and is adapted to the problem of learning, more the learning will be facilitated. On the other
hand, of course, that will require a preliminary deliberation.
23

1.2 Short History
The artificial learning is a young discipline at the common frontier of the artificial intelligence
and the computer science, but it has already a history. We brush it here rapidly, believing that it
is always interesting to know the past of a discipline because it can reveal, by the updated
tensions, its major problems and its major options.
The theoretical preliminary principles of the learning are posed with the first results in statistics
in the years 1920 and 1930. These results seek to determine how to inhere a model starting from
data, but especially how to validate an assumption based on a sample of data. Fisher in particular
studies the properties of the linear models and how they can be derived starting from a sample of
data. At the same period, the computer science born with the work of Gdel, Church and
especially Turing in 1936, and the first simulated data become possible after the Second World
War. Besides the theoretical reflections and the conceptual debates on the cybernetics and the
cognitivism, the pioneers of the domain try to program machines to carry out intelligent tasks,
often integrating learning. It is particularly the case of the first simulations of tortoises or
cybernetic mice, which one places in labyrinths while hoping to see how they learn to leave it
more and more quickly. On his side, Samuel at IBM, in the years 1959-1962, develops a
program to play the American Backgammon, which includes an evaluation function of the
positions enabling him to become quickly a very good player.
24
In the years 1960, the learning is marked by two currents. On the one hand, a first
connectionism, which under the crook of Rosenblatt father of the perceptron, sees developing
small artificial neurons networks tested in class recognition using supervised learning. On the
other hand, the conceptual tools on pattern-recognition are developed.
At the end of 1960, publication of the book of Minsky and Papert (1969) which states the limits
of the perceptron causes the stop for about fifteen years of almost all researches in this field. In a
concomitant manner, the accent put in artificial intelligence in the years 1970, on knowledge,
their representation and the use of sophisticated inference rules (period of the expert systems)
encourages work on the learning systems based on structured knowledge representations
bringing on the stage complex rules of inference like the generalization, the analogy, etc.

25

Figure 1-1 The first period of the artificial learning

26
It is then the triumph of impressive systems realizing the specific tasks of learning by simulating
strategies used, more or less, in the human learning. It must be cited, the system ARCH of
Winston in 1970, which learns how to recognize arches in a world of blocks starting from
examples and counterexamples; the system AM of Lenat in 1976, which discovers conjectures in
the field of arithmetic by the use of a set of heuristic rules or even the system META-DENDRAL of
Mitchell which learns rules in an expert system dedicated to the identification of chemical
molecules.
It is also a period during which the dialogue is easy and fertile between the psychologists and the
experts of the artificial learning. From where assumptions relating concepts like the short-term
and long-term memories, the procedural or declaratory type of knowledge, etc. also the ACT
system of Andersen testing general assumptions on the learning of mathematical concepts in
education.
However, also spectacular they are, these systems have weaknesses, which come from their
complexity. Indeed their realization implies necessarily a great number of choices, small and
large, often implicit, and who of this fact do not allow an easy replication of the experiments,
and especially throw the doubt about the general and generic range of the proposed principles. It
is why years 1980 saw gradually drying up work relating to such simulations with some brilliant
exceptions like the systems ACT or SOAR.
Moreover, these years saw a very powerful come back of connectionism in 1985, with in
particular the discovery of a new algorithm of learning by the gradient descent method for multi-
layer perceptrons. That deeply modified the study of the artificial learning by opening large the
27
door at all the concepts and mathematical techniques relating on optimization and the
convergence properties. Parallel to the intrusion of continuous mathematics, other
mathematicians engulfed themselves (behind Valiant in 1984 ) in the breach opened by the
concept of space of versions due to Mitchell.

28

Figure 1-2 The second period of the artificial learning.

29
Of only one blow the learning was seen either as the search for algorithms simulating a task of
learning, but like a process of elimination of hypothesis not satisfying an optimization criterion.
It was then a question within this research framework how a sample of data drawn by chance
could make it possible to identify a good hypothesis in a given space of hypotheses. It was
extremely misleading, and as the language used in this new research direction was rather distant
from that of the experts of the artificial learning, those continued to develop algorithms simpler
but more general than those of the previous decade: decision trees, genetic algorithms, induction
of logical programs, etc.
It is only in the years 1990, and especially after 1995 and the publication of a small book of
Vapnik (1995 ), that the statistical theory of the learning truly influenced the artificial learning
by giving a solid theoretical framework to the interrogations and empirical observations made in
the practice of the artificial learning.
The current development of the discipline is dominated at the same time by a vigorous
theoretical effort in the directions opened by Vapnik and the theorists of the statistical approach,
and by redeployment towards the application of the developed techniques to great applications
of economic purpose, as the mining of socio-economic data, or with finality, like the genomic
one. It is undeniable that for the moment the learning is felt like necessary in very many fields
and that we live a golden age for this discipline. That should not however forget the need for
joining again the dialogue with the psychologists, the teachers, and more generally all those
which work for the learning in a form or another.

30

Figure 1-3 The third period of the artificial learning.

31
A non-exhaustive list of reviews specialized on the artificial learning is:
Machine Learning Journal
Journal of Machine Learning Research (available free on
http://www.ai.mit.edu/projects/jmlr/)
Journal of Artificial Intelligence Research (J AIR) accessible free on Internet
(http://www.ai.mit.edu/projects/jmlr/)
Data Mining and Knowledge Discovery Journal
Transactions on Knowledge and Date Engineering
32

Table 1.1 - Core tasks for Machine Learning

Task category Specific tasks
Classification
Classification, Theory revision, Characterization, Knowledge refinement,
Prediction, Regression, Concept drift
Heuristics
Learning heuristics, Learning in Planning, Learning in Scheduling, Learning in
Design, Learning operators, Strategy learning, Utility problem, Learning in
Problem solving, Knowledge compilation
Discovery Scientific knowledge discovery, Theory formation, Clustering
Grammatical
inference
Grammar inference, Automata Learning, Learning programs
Agents
Learning agents, Multiagent system learning, Control, Learning in Robotics,
Learning in perception, Skill acquisition, Active learning, Learning models of
environment
Theory
Foundations, Theoretical issues, Evaluation issues, Comparisons, Complexity,
Hypothesis selection
Features/Languages
Feature selection, Discretization, Missing value handling, Parameter setting,
Constructive induction, Abstraction, Bias issues
Cognitive Modeling Cognitive modeling
33

The Information Society Technologies Advisory Group (ISTAG) has recently identified a set of
grand research challenges for the preparation of FP7 (J uly 2004). Among these challenges are

The 100% safe car
A multilingual companion
A service robot companion
The self-monitoring and self repairing computer
The internet police agent
A disease and treatment simulator
An augmented personal memory
A pervasive communication jacket
A personal everywhere visualiser
An ultra light aerial transportation agent
The intelligent retail store

If perceived from an application perspective, a multilingual companion, an internet police agent
or a 100% safe car are vastly different things. Consequently, such systems are investigated in
largely unconnected scientific disciplines and will be commercialized in various industrial
sectors ranging from health care to automotive.

34
Curs 02 - Modelul general al nvatarii supervizate
1.1 General Model of Learning from Examples
A problem of learning is defined by the following components:

1. A set of three actors:
The environment: it is supposed to be stationary and it generates data drawn
independently and identically distributed (sample i.i.d.) according to a distribution
on the space of data X .
i
x
X
D
The oracle or supervisor or professor or Nature, who, for each return a desired
answer or label in agreement with an unknown conditional probability distribution
i
x
i
u
( )
| . F u x
The learner or learning machine (LM) able to fulfill a function (not necessarily
deterministic) belonging to a space of functions H such that the exit produced by
LM verifies
A
( )
i i
y h = x for hH .
1

2. The learning task: LM seek in the space a function who as well as possible
approximate the desired response of supervisor. In the case of induction, the distance
between the hypothesis function h and the response of the supervisor is defined by the
mean loss on the possible situations in
H h
= Z X U . Thus, for each entry and response
of supervisor , one measures the loss or cost
i
x
i
u
( ) ( )
i
evaluating the cost to have
taken the decision
,
i
l u h x
( )
i i
y h = x when the desired answer was (one will suppose, without
loss of generality, the loss positive or null). The mean cost, or real risk is then:
i
u
( ) ( ) ( ) ( )
, ,
real i
R h l u h dF u
x x
Z
=
It is a statistical measurement that is a function of the functional dependence
( )
, F u x between
the entries and desired exits u.This dependence can be expressed by a density of joint
probability definite on
x
X U who is unknown. In other words, it is a question of finding a
hypothesis h near to f in the sense of the loss function, and this is done particularly in the
frequently met areas of the space X . As these areas are not know a priori, it is necessary to
use the training sample to estimate them, and the problem of induction is thus to seek to
minimize the unknown real risk starting from the observation of the training sample S .
2

3. Finally, an inductive principle that prescribe what the sought function must check,
according at the same time to the concept of proximity evoked above and the observed
training sample
h
( ) ( ) { }
1 1
, ,..., ,
m m
u u = S , with the aim of minimizing the real risk. x x
The inductive principle dictate what the best assumption must check according to the training
sample, the loss function and, possibly, other criteria. It acts of an ideal objective. It should be
distinguished from the learning method (or algorithm) which describes an effective realization
of the inductive principle. For a given inductive principle, there are many learning methods,
which result from different choices of solving the computational problems that are beyond the
scope of the inductive principle. For example, the inductive principle can prescribe that it is
necessary to choose the simplest hypothesis compatible with the training sample. The learning
method must then specify how to seek this hypothesis indeed, or a suboptimal hypothesis if it
is necessary, by satisfying certain constraints of reliability like computational resources. Thus,
for example, the learning method will seek by a gradient method, sub-optimal but easily
controllable, the optimum defined by the inductive principle.
The definition given above is very general: in particular, it does not depend on the selected loss
function. It has the merit to distinguish the principal ingredients of a learning problem that are
often mixed in practical achievements descriptions.
3

1.1.1 The Theory of Inductive Inference
The inductive principle prescribes which assumption one should choose to minimize the real risk
based on the observation of a training sample. However, there is no unique or ideal inductive
principle single or ideal. How to extract, starting from the data, a regularity which has chances to
have a relevance for the future? A certain number of "reasonable" answers were proposed. We
describe the principal ones in a qualitative way here before more formally re-examining them in
this and next chapters.
The choice of the hypothesis minimizing the empirical risk (Empirical Risk Minimization or
the ERM principle). The empirical risk is the average loss measured on the training sampleS :
( ) ( ) ( )
1
1
,
m
emp i i
i
R h l u h
m
=
=

x
The idea subjacent of this principle is that the hypothesis, which agrees best to the data, by
supposing that those are representative, is a hypothesis that describes the world correctly in
general.
The ERM principle was, often implicitly, the principle used in the artificial intelligence since the
origin, as well in the connectionism as in the learning symbolic system. What could be more
natural indeed than to consider that a regularity observed on the known data will be still verified
by the phenomenon that produced these data? It is for example the guiding principle of the
4
perceptron algorithm like that of the ARCH system. In these two cases, one seeks a coherent
hypothesis with the examples, i.e. of null empirical risk. It is possible to refine the principle of
the empirical risk minimization while choosing among the optimal hypothesis, either one of
most specific, or one of most general.
5
The choice of the most probable hypothesis being given the training sample. It is the
Bayesian decision principle. The idea is here that it is possible to define a probability
distribution on the hypothesis space and that the knowledge preliminary to the learning can be
expressed in particular in the form of an a priori probability distribution on the hypotheses
space. The sample of learning is then regarded as information modifying the probability
distribution on H (see Figure Error! No text of specified style in document.-1). One can then, or to
choose the most probable a posteriori hypothesis (the maximum likelihood principle) or
Maximum A posteriori (MAP), or to adopt a composite hypothesis resulting from the average of
the hypotheses weighed by their a posteriori probability (true Bayesian approach).

Figure Error! No text of specified style in document.-1 The space of the assumptions is presumably
provided with a density of probabilities a priori. The learning consists in modifying this density according to
H
6
the learning example.

The choice of a hypothesis that compresses as well as possible the information contained in
the training sample. We will call this precept: the information compression principle. The idea
is to eliminate the redundancies present in the data in order to extract the subjacent regularities
allowing an economic description of the world. It is implied that the regularities discovered in
the data are valid beyond the data and apply to the whole world.
The question is to know if these ideas intuitively tempting make it possible to learn effectively.
More precisely, we would like to obtain answers to a certain number of naive questions :
does the application of the selected inductive principle to minimize the real risk indeed?
what conditions should be checked for that? Moreover, the conditions must be verified on
the training sample, or on the target functions, or by the supervisor, or on the hypotheses
space.
how the performance in generalization depends on the information contained in the training
sample, or of its size, etc. ?
which maximum performance is possible for a given learning problem?
which is the best LM for a given learning problem?
To answer these questions implies choices that depend partly on the type of inductive principle
used. It is why we made a brief description of it above.
8

1.1.2 How to Analyze the Learning?
We described the learning, at least the inductive learning, like a problem of optimization: to seek
the best hypothesis in the sense of the minimization of the risk mean of a training sample. We
want to now study under which conditions the resolution of such a problem is possible. We want
also to have tools permitting to judge performance of an inductive principle or of a learning
algorithm. This analysis requires additional assumptions, which correspond to options on what is
awaited from the LM.
Thus, a learning problem depends on the environment, which generates data according to a
certain unknown distribution , of the supervisor, which chooses a target function
i
x
X
D f , and of
the selected loss function l . The performance of the LM (which depends on the selected
inductive principle and the learning algorithm carrying it out) will be evaluated according to the
choices of each one of these parameters. When we seek to determine the expected performance
of the LM, we must thus discuss the source of these parameters. There are in particular three
possibilities:

1. It is supposed that one does not know anything a priori on the environment, therefore neither
on the distribution of the learning data, nor on the target dependence, but one wants to guard
oneself against the worst possible situations, like if the environment and supervisor were
adversaries. One then seeks to characterize the performance of learning in the worst possible
9
situations, which generally is expressed in intervals of the risk. It is the analysis in the worst
case. One also speaks about the framework of Min Max analysis, by reference to the game
theory. The advantage from this point of view is that the guarantees of possible performances
will be independent of the environment (the real risk being calculated whatever the
distribution of the events) and of supervisor or Nature (i.e. whatever the target function). On
the other hand, the conditions identified to obtain such guarantees will be so strong that they
will be often very far away from the real situations of learning.
2. One can on the contrary want to measure a mean of performance. In this case, it should be
supposed that there is a distribution on the learning data, but also a distribution on
X
D
F
D
possible target functions. The analysis that has results is the analysis in the average case.
One also speaks about Bayesian framework. This analysis allows in theory a finer
characterization of the performance, at the price however to have to make a priori
assumptions on the spaces X and F . Unfortunately, it is often very difficult analytically to
obtain guarantees conditions of successful learning, and it is generally necessary to use
methods of approximation, which remove a part of the interest of such an approach.
3. Finally, one could seek to characterize the most favorable case, when environment and
supervisor are benevolent and want to help the LM. But it is difficult to determine the border
between the benevolence, that of a professor for example, and the collusion who would see
the supervisor then acting like an accomplice and coding the target function in a known code
of the learner, which would not be any more learning, but an illicit transmission. This is why
this type of analysis, though interesting, does not have yet a well-established framework.
10
1.1.3 Validity Conditions for the ERM Principle
In this section, we concentrate on the analysis of the inductive principle ERM who prescribes to
choose hypothesis minimizing the empirical risk measured on the learning sample. It is indeed
the most employed rule, and its analysis leads to very general conceptual principles. The ERM
principle has initially been the subject of an analysis in the worst case, which we describe here.
An analysis in the average case, utilizing ideas of statistical physics, also was the object of many
very interesting works. It is however technically definitely more difficult.
Let us recall that the learning consists in seeking a hypothesis such that it minimizes the
learning average loss. Formally, it is a question of finding an optimal hypothesis
*
h minimizing
the real risk:
h
( )
*
ArgMin
real
h
h R h
=
H

The problem is that one does not know the real risk attached to each hypothesis . The natural
idea is thus to select hypothesis h in H who behaves well on the learning data S : it is the
inductive principle of the ERM. We will note
this optimal hypothesis for the empirical risk

measured on the sample :
h
S
h
S
( )
ArgMin
S emp
h
h R h
=
H

11
This inductive principle will be relevant only if the empirical risk is correlated with the real risk.
Its analysis must thus attempt to study the correlation between the two risks and more
particularly the correlation between the real risk incurred with the selected hypothesis using the
ERM principle,
( )
and the optimal real risk
( )

emp S
R h
*
real
R h
This correlation will utilize two aspects:
1. The difference (inevitably positive or null) between the real risk of the hypothesis
selected using the training sample and the real risk of the optimal hypothesis :
S
h
S
*
h
( ) ( )
.
*
real S real
R h R h
2. The probability that this difference is higher than a given bound . Being given indeed that
the empirical risk depends on the training sample, the correlation between the measured
empirical risk and the real risk depend on the representativeness of this sample. This is why
also, when the difference
( )
( )
is studied is necessary to take into account

the probability of the training sample being given a certain target function. One cannot be a
good learner of all the situations, but only for the reasonable one (representative training
samples) which are most probable.
*
real real S
R h R h
12
Thus, let us take again the question of the correlation between the empirical risk and the real
risk. The ERM principle is a valid inductive principle if, the real risk computed with the
hypothesis
that minimize the empirical risk, is guaranteed to be close to the optimal real risk
obtained with the optimal hypothesis . This closeness must happen in the large majority of the
situations that can occur, i.e. for the majority of the samples of learning drawn by chance
according to the distribution .
S
h
*
h
X
D
In a more formal way, one seeks under which conditions it would be possible to ensure:
( )
( ) ( )
( )
*
0 , 1:
real S real
P R h R h (1) <

Figure Error! No text of specified style in document.-2
13
It is well understood that the correlation between the empirical risk and the real risk depends on
the training sample S and, since this one is drawn randomly, of its size m too. That suggests a
natural application of the law of large numbers according to which, under very general
conditions, the average of a random variable (here
( )
h ) converges towards its mean (here
( )
h ) when grows the size m of the sample.
emp
R
real
R
The law of large numbers encourages to want to ensure the inequality (1) by growing the sample
size of the training set S towards and to ask starting from which size m of the training sample
drawn randomly (according to an unspecified distribution ), the inequality (1) is guaranteed:
X
D
( ) ( )
0 , 1: m such that
( )
( )
( )
*
m
real real S
P R h R h <
The Figure Error! No text of specified style in document.-3 illustrates the desired convergence of the
empirical risk towards the real risk.
14

Figure Error! No text of specified style in document.-3
15
Definition 1.1 (The consistency of the ERM principle)
It is said that the ERM principle is consistent if the unknown real risk
( )
real S
R h and the
empirical risk
( )
converge towards the same limit
emp S
R h
( )
*
real
R h when the size m of the sample
tends towards (see Figure Error! No text of specified style in document.-4).

Figure Error! No text of specified style in document.-4 Consistency of the ERM principle.
16
Unfortunately, the law of large numbers is not sufficient for our study. Indeed, what this law
affirms, it is that the empirical risk of a given hypotheses data h converge towards its real risk.
However, what we seek is different. We want to ensure that the hypotheses
taken in and
who minimizes the empirical risk for the sample has an associated real risk that converges
towards the optimal real risk obtained for the optimal hypotheses
*
h independent of S . It is well
necessary to see that in this case the training sample does not play only the role of a test set, but
also the role of a set being used for the choice of the hypothesis. One cannot thus take without
precaution the performance measured on the learning sample as representative of the real
performance.
m
S
h H
S

Indeed one can build hypotheses spaces such that it is always possible to find a hypothesis
with null empirical risk without that indicate a good general performance. It is sufficient to
imagine a hypothesis that agrees to all the learning data and which randomly draws the label
from the not sights data. This is why it is necessary to generalize the law of large numbers.
H
This generalization is easy in the case of a finite space of hypotheses functions. It was obtained
only recently by Vapnik and Chervonenkis (1971,1989), within the framework of induction, for
the case of spaces of infinite size.

17
The Bias-Variance Tradeoff
The compromise bias-variance express the effect of various possible factors on the final error
between the hypothesis chosen by the LM and that which it would have had to choose, the ideal
target function.
According to the general model of learning from examples, the LM receive from the
environment a sample of data
{ }
1
,...,
m
x x where
i
x X . In the absence of additional information
on their source, and for reasons of simplicity of modeling and mathematical analysis, one will
suppose that these objects are drawn randomly and independently the ones of the others
according to a probability distribution (it is what one calls the assumption of independently
and identically distributed). Attached with each one of these data the LM receives in addition
one label or supervision produced according to a functional dependence between and .
X
D
i
x
i
u x u
We note
( ) ( ) { }
1 1 1
, ,..., ,
m m m
u u the sample of learning made up here of supervised
examples. To simplify, we will suppose that the functional dependence between an entry and
= = = z x z x S
x
1
Curs 03 - Compromisul deplasare - dispersie
its label takes the form of a function u f belonging to a family of functions . Without
loosing the generality we also suppose that there can be erroneous labeling, in the form of a
noise, i.e. a measurable bias between the proposed label and the true label according to f. The
LM seeks to find a hypothesis function , in the space of the functions as near as possible to
f, the target function. We will specify later the concept of proximity used to evaluate the distance
between f and h.
F
h H
Figure Error! No text of specified style in document.-1 illustrates the various sources of error between the
target functions f and the hypothesis function h. We call total error the error resulting from the
conjunction of these various errors between f and h. Let us detail them.
The first source of error comes from the following fact: nothing does not allow a priori to
postulate the equality between the target functions space of the Nature and the hypotheses
functions space realizable by the LM. Of this fact, even if the LM provides an optimal
assumption (in the sense of the proximity measurement mentioned above), is inevitably
F
H
*
h
*
h
2
taken in and can thus be different from the target function f. It is the approximation error
often called inductive bias (or simply bias) due to the difference between and .
H
F H

Then, the LM does not provide in general the optimal hypothesis in but a hypothesis
based on the learning sample . Depending of this sample, the learned hypothesis will be
able to vary inside a set of functions that we denote by
{
*
h H
h S
h
}
S
h to underline the dependence of each
one of its elements on the random sample . The distance between and the estimated
hypothesis who depends on the particularities of is the estimating error. One can show
formally that it is the variance related on the sensitivity of the calculation of the hypothesis as
function of the sample . More the hypotheses space is rich, more, in general, this variance
is important.
S
*
h
h S
h
S H

Finally, it occurs the noise on labeling: because of transmission errors, the label

associated to can to be inaccurate with respect to f. Hence the LM receives a sample of data
u
x

3
relative to disturbed function
b
f f noise = + . It is the intrinsic error who generally complicates
the research of the optimal hypothesis .
*
h

5
Figure Error! No text of specified style in document.-1 The various types of errors arising in the estimate of a targets function starting from a
learning sample. With a more restricted space of hypotheses, one can reduce the variance, but generally at the price of a greater error of
approximation.

Being given these circumstances, the bias-variance trade off can be defined in the following
way: to reduce bias due to the bad adequacy of to it is necessary to increase the richness
of . Unfortunately, this enrichment will be paid, generally, with an increase in the variance.
Of this fact, the total error, which is the sum of the approximation error and the estimation error,
cannot significantly be decreased.
H F
H
The bias-variance tradeoff should thus be rather called the compromise of the approximation
error/estimation error. However, the important thing is that it is well a question of making a
compromise, since one exploits a sum of terms that vary together in contrary direction. On the
other hand the noise, or the intrinsic error, can only worsen the things while increasing. The
ideal would be to have a null noise and a restricted hypotheses space to reduce the variance,
but at the same time well informed, i.e. containing only functions close to the target function,
which would obviously be equivalent to have an a priori knowledge on Nature.
H
6
Regularization Methods
The examination of the compromise bias-variance and the analysis of the ERM principle by
Vapnik have clearly shown that the mean of risk (the real risk) depends at the same time on the
empirical risk measured on the learning sample and on the "capacity" of the space of the
hypotheses functions. The larger this one is, the more there is a greater chance to find a
hypothesis close to the target function (small approximation error), but also the hypothesis
minimizing the empirical risk depends on the provided particular learning sample (big estimation
error), which prohibits to exploit with certainty the performance measured by the empirical risk
to the real risk.
In other words, supervised induction must always face the risk of over-fitting. If the space of the
assumptions is too rich, there are strong chances that the selected hypothesis, whose
empirical risk is small, presents a high real risk. That is because several hypotheses can have a
small empirical risk on a learning sample, while having very different real risks. It is thus not
possible, only based on measured empirical risk, to distinguish the good hypothesis from the bad
H
7
one. It is thus necessary to restrict as much as possible the richness of the hypotheses space,
while seeking to preserve a sufficient approximation capacity.
Tuning the Hypotheses Class
Since one can measure only the empirical risk, the idea is thus to try to evaluate the real risk by
correcting the empirical risk, necessarily optimistic, by a penalization term corresponding to a
measurement of the capacity of the used hypotheses space. It is there the essence of all
induction approaches, which revise the ERM principle (the adaptation to data) by a
regularization term (depending on the hypotheses class). This fundamental idea is found in the
heart of a whole set of methods like the regularization theory, Minimum Description Length
Principle: (MDLP), the Akaike information criterion (AIC), and other methods based on
complexity measures.
H
The problem thus defined is known, at least empirically, for a long time, and many techniques
were developed to solve there. One can arrange them in three principal categories: methods of
models selection, regularization techniques, and average methods.
8
In the methods of models selection, the approach consists in considering a hypotheses space
and to decompose it into a discrete collection of nested subspaces
then, being given a learning sample, to try to identify the optimal subspace in which to choose
the final hypothesis. Several methods were proposed within this framework, that one can gather
in two types:
H
1 2
... ...
d
H H H
complexity penalization methods, among which appear the structural risk
minimization principle (SRM) of Vapnik, the Minimum Description Length principle of Rissanen
(1978) and various methods or statistical criteria of selection,
methods of validation by multiple learning: among which appears the cross validation
and bootstrapping.
The regularization methods act in the same spirit as the methods selection models, put aside
that they do not impose a discrete decomposition on the hypotheses class. A penalization
criterion of is associated to each hypothesis, which, either measure the complexity of their
parametric structure, or the global properties of "regularity", related, for example, to the
9
derivability of the hypothesis functions or their dynamics (for example the high frequency
functions, i.e. changing value quickly, will be more penalized comparatively to the low
frequency functions).
The average methods do not select a single hypothesis in the space , but choose a
weighed combination of hypothesis to form one prediction function. Such a weighed
combination can have like effect "to smooth" the erratic hypothesis (as in the methods of
Bayesian average and in the bagging methods),or to increase the capacity of representation of
the hypothesis class if this one is not convex (as in the boosting methods).
H
All these methods generally led to notable improvements of the performances compared to the
"naive" methods. However, they ask to be used carefully. On the one side, indeed, they
correspond sometimes to an increase in the richness of the hypotheses space, and to an increased
risk of over-fitting. On the other side, they require frequently an expertise to be applied, in
particular because additional parameters should be regulated. Some recent work tries, for these
10
reasons, to determine automatically the suitable complexity of the candidates hypotheses to
adapt to the learning data.
Selection of the Models
We will define more formally the problem of the models selection, which is the objective of all
these methods.
Let be a nested sequence of spaces or classes of hypotheses (or models)
where the spaces are of increasing capacity. The target function f can or cannot be included
in one of these classes. Let be the optimal hypothesis in the class of hypotheses and
1 2
...
d
H H H
d
H
*
d
h
d
H
( )
( )
*
real d
R d R h = the associate real risk. We note that the sequence
( ) { }
1 d
R d

is decreasing
since the hypotheses classes are nested, and thus their approximation capacity of the targets
function f can only increase.
d
H
Using these notations, the problem of the models selection can be defined as follows.

11
Definition 1.2 The model selection problem consist to choose, on the basis of a learning sample
of length m, a class of hypotheses and a hypothesis S
*
d
H
*
d
d
h H such that the associate real
risk
( )
real d
R h is minimal.

The underlying conjecture is that the real risk associate with the selected hypothesis for each
class present a global minimum for a nontrivial value of d (i.e. different from zero and m)
corresponding to the "ideal" hypothesis space . (see
d
h
d
H
*
d
H Figure Error! No text of specified style in
document.-2).
It is thus a question of finding the ideal hypothesis space , and in addition to select the best
hypothesis in .The definition say nothing about this last problem. It is generally solved by
using the ERM principle dictating to seek the hypothesis that minimizes the empirical risk.
*
d
H
d
h
*
d
H
12
For the selection of , one uses an estimate of the optimal real risk in each by choosing
the best hypothesis according to the empirical risk (the ERM method) and by correcting the
associated empirical risk with a penalization term related to the characteristics of space . The
problem of model selection consists then in solving an equation of the type:
*
d
H
d
H
d
H
( )
{ }
( ) { }
*
:
:
estimated
d d real d
d
d d emp d
d
d ArgMin h R h
ArgMin h R h penalization term
=
= +
H
H

Let us note that the choice of the best hypotheses space depends on the size m of the data
sample. The larger this one is, the more it is possible, if necessary, to choose without risk (i.e.
with a little variance or confidence interval) a rich hypotheses space making possible to
approach as much as possible the targets function f.
13

Figure Error! No text of specified style in document.-2 The bounds on the real risk results from the sum of the empirical risk and a confidence
interval depending on the capacity of the associated hypotheses space. By supposing that one has a nested sequence of hypotheses spaces of
increasing capacity and subscripted by d, the accessible optimal empirical risk decreases for increasing d (corresponding to the bias), while the
confidence interval (corresponding to the variance) increases. The minimal bound on the real risk is reached for a suitable hypotheses space
.
d
H
14

Evaluation of the Learning Performances
We have a set of examples and a learning algorithm of which we can tune certain parameters.
This algorithm returns a hypothesis. How can we evaluate the performances of this hypothesis
A solution consists in applying the theoretical results that provide probability bounds on the real
risk according to the empirical risk. These bounds have the general form:
( ) ( ) ( ) ( )
,
real emp VC
R h R h d m = + H
where is a function of the Vapnik-Chervonenkis dimension of the hypothesis space of and
m is the sample size of training . If one can obtain in theory asymptotically tightened bounds,
the assumptions to be made to compute in practice,
H
S
( ) ( )
,
VC
d m imply often such margins
that the obtained bounds are too loose and do not allow to estimate the real performance
precisely. It is why, except favorable particular cases, the estimate of the learning performance is
estimated generally, by empirical measurements.
H
15
A Posteriori Empirical Evaluation
We admit most of the time in this paragraph that the optimization algorithm employed for the
training functioned perfectly, i.e. it discovered the best hypothesis within the framework of the
ERM principle. It is an unrealistic simplification, but of no importance for the developments
presented here. To return to the practical case, it is enough to remember that in general the
empirical risk of the hypothesis found by the algorithm is not minimal.
Let be the hypothesis that minimizes
*
h
H
( )
real
R h
for : hH
( )
*
real
h
h ArgMinR h
=
H
H
Its real risk
( )
*
real
R h
H
can be noted more simply
( )
real
R H , since this hypothesis depends only on
. H
This hypothesis is theoretically that which any training algorithm should seek to approach.
Nevertheless, this search is illusory: one cannot actually know if one of it is close or not. The
16
assumption made by the ERM method is that one can replace his research by that of the
hypothesis described in the following paragraph.
*
,
h
S H
The empirical risk of on the training data can be noted
*
h
H
( )
*
emp
R h
H
but this term is in general
not measurable since is unknown.
*
h
H
One notes the hypothesis that minimizes the empirical risk on the training sample
*
,
h
S H
( )
*
, emp
h
h ArgMinR h
=
S H
H

This hypothesis is that which the learning algorithm seeks to find and using in accordance to
the ERM principle. As one is never sure that the selected algorithm finds this hypothesis, one
does know neither its empirical risks
S
( )
*
, emp
R h
S H
nor its real risk
( )
*
, real
R h
S H

17
It is noted by the hypothesis found by the training algorithm. It depends of , and
the training algorithm. Its empirical risk
*
lg, , a
h
S H
H S
( )
*
lg, , emp a
R h
S H
is measured on the training sample. Its
real risk
( )
*
lg, , real a
R h
S H
can be estimated by methods that we will see below.
As one said, one supposes for the moment that
*
h i.e. that the algorithm is effective
from the ERM principle point of view.
*
lg, , , a
h =
S H S H
However, in reality one has
( ) ( )
* *
lg, , , emp a real
R h R h
S H S H

in addition, most of the time
( ) ( )
* *
lg, , , real a real
R h R h
S H S H
.
18

Practical Selection of the Model
We seek to approach . The choice of the hypothesis space that one explores being left free,
it would be useful to know a priori to compare two spaces and
*
,
h
S H
H

H . However, one does not
have any indication in general on the subject. On the other side, once selected it is often easy
to order it partially according to a criterion. One can often index its elements by the order of the
algorithmic complexity of the program, which fulfills the function of corresponding decision,
and parameterize the learning algorithm according to this index.
H
It will thus be supposed that one can define a nested sequence of sets of increasing
algorithmic complexity:
k
H
1
... ...
k k +

1
H H H H
We also suppose that the target function finishes by being included in one of these sets of
increasing size, i.e. f

H .
19
Let us note:
the hypothesis having the smaller real risk (probability of error) of
*
k
h
H k
H
the hypothesis having the smaller empirical risk (the apparent error rate) of .
*
,
k
h
S H k
H
Let us recall that we make the simplifying assumption that the learning algorithm is ideal from
the point of view of the ERM principle: it is supposed to be able to discover for any training set
the hypothesis of .
*
,
k
h
S H k
H
What can one say relative to the values of and of for a given k and when k varies?
*
k
h
H
*
,
k
h
S H
20

k is constant.
One has first of all:
( ) ( )
* *
,
k k
real real
R h R h
S H H

This inequality translates simply the fact that the hypothesis , found by the algorithm, is in
general not optimal in term of real error, because the training set cannot perfectly summarize the
probability distributions of all the data. It is the problem of any generalization.
*
,
k
h
S H
One has also in general:
( ) ( )
* *
,
k k
real emp
R h R h
H S H

This formula expresses the fact that the learned hypothesis being tuned on the learning data, it
tends to over-estimate their characteristics with the detriment of a good generalization, in sense
of the ERM principle.
21
k increase
One has in general, for all k :
( )
real k
R H decrease when k increases
( )
*
,
k
emp
R h
S H
decrease when k increases
Indeed, the apparent error of decrease when k increases, in general until being zero for k
enough large: in a rather complex hypotheses space, one can learn by heart the sample
*
k
h
H
S
In general the value:
( ) ( )
* *
, ,
k
real emp
R h R h
S H S H

is positive and increases with k.
One also has, when k increase, till a given
1

0
k

1
To simplify, it is supposed that there is a single value; actually, it acts of an area around this value. However, that does not change the basic argument.
22
( ) ( )
( ) ( )
1 2 1
0 0
* * * *
, , , ,
...
k k
real real real real
R h R h R h R h

S H S H S H S H

It seems that the increasing of k has a positive effect, since the error probability of the learned
hypothesis tends to decrease.
However, exceeding , the inequality is reversed:
0
k
( ) ( )
1
* *
0
: ...
k k
real real
k k R h R h
+

H H

Thus, beyond a value the real performance of the learned hypotheses will decrease.
0
k
The last phenomenon is called over-fitting (see Figure Error! No text of specified style in document.-2).
Intuitively, it means that a hypothesis of too great complexity represents too exactly the training
set, i.e. realizes as a matter of fact learning by heart, with the detriment of its quality of
generalization. There is consequently a value of compromise, on which one does not have any
a priori information, which is the best for a given training sample and a family of hypothesis
0
k
23
ordered according to the complexity of k. The value is thus crucial. To estimate it, one can
use a sample of validation
0
k
V .
In the case where the learning algorithm is not optimal from the ERM point of view the same
phenomenon is met.
An example
24

25

Let be two classes of uniform density, one inside the central ellipse, the other between the
external rectangle and this ellipse. The separating optimal curve is here known: it is the central
ellipse.
To make the problem a little more difficult, we draw the 40 co-ordinates of the training points
according to the uniform distributions, by adding a Gaussian noise. The separating optimal curve
remains the ellipse, but the points of the two classes are not more all exactly on both sides of the
separating curve. It is noticed that the noise made one of the points outside the ellipse and a
point in its interior. The error is not thus not null, because of these noise effects. Its
empirical value on the training data is 7.5 % (3 points badly classified on forty:
B
R
3
7.5%
40
= ).

26
Let be the lines, the curves of second degree, the curves of the third degree, etc. For k
=1, the separating optimal curve is the horizontal straight line in the middle of Error!
Reference source not found.. For k = 2, one has h . The best separating surface thus
belongs to . One is certain here only because the data are generated and that is known. For
one has:
B
h since one can bring back a curve of superior degree to a curve of the
second degree by canceling the necessary coefficients. The best line, which minimizes the real
error, is noted with . It is calculable since the probability densities of the both classes are
known: it is the horizontal median line of the ellipse. Its real error is also calculable: it is
1
H
2
H
3
H
1
*
h
H
2
*
B
h =
H
k
h h = =
H
2
H
B
h
3 k
*
3
1
*
h
H
( )
1
35% R = H .In our example, its empirical error is worth
10 7
42.5%
20 20
+
=
+
since the empirical
matrix of confusion is the following:

27
For example, the number 7 in this matrix means that seven objects labeled have been
classified like
A good quality-training algorithm finds in the line that minimizes the empirical error.
This one is
1
H
1
*
,
h
S H
10 1
27.5%
20 20
+
=
+
since its matrix of empirical confusion is the following one:

As the distributions are uniform and the geometry fixed, one can measure
( )
1
*
, real
R h
S H
that is 45
%.
In , this algorithm finds , for which one has
2
H
2
*
,
h
S H
( )
2
*
,
1 0
2.5%
20 20
emp
R h
+
= =
+
S H

28
It is the best ellipse than one can trace to separate the training data: it makes only one error.
( )
2
*
, real
R h
S H
is 10 %.
We do not treat the case of and we pass directly to .
3
H
4
H
In one can find such that
4
H
4
*
,
h
S H
( )
4
*
,
0
emp
R h =
S H
but this time
( )
4
*
, real
R h
S H
has increased until
approximately 15 % and exceeds
( )
2
*
, emp
R h
S H
.
One thus has in this example. In short:
0
2 k =

29
Hypothesis Curve
Empirical
risk
Real risk
1
*
,
h
S H

Oblique line 27.5% 45%
2
*
,
h
S H

Leaning ellipse 2.5% 10%,
4
*
,
h
S H

Curve of degree 4 0% 15%
1
*
h
H

Horizontal line 42.5% 35%
2
*
B
h h =
H

Central ellipse 7.5% 5%
30
k is constant.
( ) ( )
* *
,
k k
real real
R h R h
S H H

( ) ( )
* *
,
k k
real emp
R h R h
H S H

k increase
( )
real k
R H decrease when k increases
( )
*
,
k
emp
R h
S H
decrease when k increases
In general the value:
( ) ( )
* *
, ,
k
real emp
R h R h
S H S H

is positive and increases with k.
One also has, when k increase, till a given
0
k
( ) ( )
( ) ( )
1 2 1
0 0
* * * *
, , , ,
...
k k
real real real real
R h R h R h R h

S H S H S H S H
It seems that the increasing of k has a positive effect, since the
error probability of the learned hypothesis tends to decrease.
However, exceeding , the inequality is reversed:
0
k
( ) ( )
1
* *
0
: ...
k k
real real
k k R h R h
+

H H

Thus, beyond a value the real performance of the learned
hypotheses will decrease.
0
k
31
Estimation of the Hypothesis Real Risk
The simplest method to estimate the objective quality of a learning hypothesis h is to divide the
examples set in two independent sets: the first, noted A is used for the training of h and the
second, noted , is used to measure its quality. This second set is called test sample (or
examples set). One has
T
= S A T and = A T .
As we will see, the measurement of the errors made by h on the test set is an estimate of the
real risk of h. This estimate is noted:
T
( )
real
R h .
Let us examine initially the particular case of the learning of a separating function in the case of
classification rule.
Let us point out initially the definition of a confusion matrix:

1
Curs 04 - Estimarea riscului real
The Nature of Machine Learning

2
Definition 1.3
The confusion matrix M(i, j) of a classification rule h is a matrix C C of which the generic
element gives the number of examples of the test set of the class i which was classified in the
class j.
T
In the case of a binary classification, the matrix of confusion is thus of the form:

+ -
+ true positives false positives
- false negatives true positives


3
If all the errors have the same gravity, the sum of the nondiagonal terms of M, divided by the
size t of the test set, is an estimate
( )
real
R h on of the real risk of h T
( )
( )
,
i j
real
M i j
R h
t
=

.
Nothing the number of objects of the test set misclassified, one has:
err
t
( )
err
real
t
R h
t
= .
The empirical confusion matrix is the confusion matrix defined on the training set; for this
matrix, the sum of the nondiagonal terms is proportional to the empirical risk, but is not an
estimate of the real risk.

4
The estimation by the confidence interval
Which confidence cans one grant to the estimate
( )
real
R h ? Can it express numerically?
The answer to these two questions is given in a simple way by traditional statistical
considerations. If the random samples of training and test are independent then the precision of
the estimate depends only on t the number of examples of the test set and of
( )
real
R h .
The sufficient approximation if t is rather large (beyond the hundred) is given by the confidence
interval of :
( )
real
R h
( )
1
err err
err
t t
t
t t
x
t t


5
The function
( )
x has in particular the following values:

x
50% 68% 80% 90% 95% 98% 99%
( )
x
0.67 1.00 1.28 1.64 1.96 2.33 2.58

The estimate of the real error rate on a test sample independent of the training sample T A
provides an unbiased estimate of
( )
real
R h with a controllable confidence interval, depending
only on the size t of the test sample. The larger is this one, the more reduced is the confidence
interval and consequently more the empirical error rate gives an indication of the real error rate.
Unfortunately, in the majority of the applications, the number of examples, i.e. the observations
for which an expert provided a label, is limited. Generally each new example is expensive to

6
obtain and hence the training sample and the test sample cannot be increased arbitrarily. There is
a conflict between the interest to have the largest possible learning sample A, and the largest
possible sample to test the result of the training. As it is necessary that the two samples are
independent, which is given to one is withdrawn to the other. This is why this method of
validation is called the hold-out method. It can be applied when the data are abundant. If on the
other hand the data are parsimonious, it is necessary to recourse to other methods.
T

7
The estimation by cross validation
The idea of the cross validation (N-fold cross -validation) consist:
1. To divide the training data in N subsamples of equal sizes. S
2. To retain one of these samples, let by i, for the test and to learn on the others 1 N .
3. To measure the empirical error rate
( )
real
R h on the sample i.
4. To start again N time varying the sample i from 1 with N.
The final error is given by the average of the measured errors:
( ) ( )
,
1
1

N
real real i
i
R h R h
N
=
=

One can show that this procedure provides an unbiased estimate of the real error rate. It is
common to take for N values ranging between 5 and 10. In this manner, one can use a great part

8
of the examples for the training while obtaining a precise measurement of the real error rate. On
the other hand, it is necessary to carry out the training procedure of N time.
The question arises however of knowing which learned hypothesis one must finally use. It is
indeed probable that each learned hypothesis depends on the sample i used for the training and
that one thus obtains N different hypotheses.
Note that if the learned hypotheses are very different the ones from the others (have supposing
that one can measure this difference), it should be perhaps there an indication of the inadequacy
of . That indeed seem to show a great variance (in general associated to a great Vapnik-
Chervonenkis dimension), and thus the training risk has a little importance.
H
Best is then to remake the training on the total set . The precision will be good and the
estimate of the error rate is known by the N previously training.
S


9
The estimate by the leave-one-out method
When the available data are poorly, it is possible to push to the extreme the method of cross-
validation. In this case, one retains each time one example for the test, and one repeats the
training N time for all the other training examples.
It should be noted that if one of the interests of the leave-one-out validation is to produce less
variable hypotheses, one shows, on the other hand, that the estimate of the error rate is often
more variable than for cross-validation with a smaller N.
This method has the advantage of simplicity and speed. However, when the total number of
examples that one lays out is restricted, it can be interesting not to distinguish between the
training and test sets, but to use techniques requiring several training passing. In this case, one
will lose in computing times but one will gain in smoothness of the estimation relative to the
quantity of available data.

10
Some alternatives of the cross validation method: bootstrap, jackknife
These techniques differ from the preceding ones in the use of the sampling with replacement
over the examples set. The process is as follows: one draws randomly an example, to place it in a
set called bootstrap
1
. The process is repeated n time and the training is then carried out on the
bootstrap set. A test is made on the examples nonpresent in this set, computing
1
P a first value of
the classifier errors. Another test is made on the complete set of examples, computing
2
P . The
whole set of operation is repeated K time. A certain linear combination of
1
P, the averaged
values of
1
P and of
2
P , the average values o
2
P give the value
( )
real
R h . The theory (Hastie, 2001)
proposes the formula:
( )
1 2
0.636 0.368
real
R h P P = +

1
It is known that the Baron of Munchausen could rise in the airs while drawing on his boots. The omonime, method gives quite astonishing results (here justified theoretically
and practically).

11
based on the fact that the mean of the proportion of the not repeated elements in the test set is
equal to 0.368. For small samples, the bootstrap method provides a remarkably precise estimate
of
( )
real
R h . On the other hand, it asks a great value of K (several hundreds), i.e. a high number of
trainings of the classification rule.
There is, finally, another method close but more complex called jackknife who aims to reduce
the bias of the error rate by plugging-in, when the data are used in the same time for learning and
for testing.
We send the interested reader to Ripley (1996) pp.72-73. There are also good references for the
problem of the estimate of performance in general.

12
Algorithms Tuning by a Validation Set
The seeking of the best method for solving a given learning problem implies:
the choice of the inductive principle;
the choice of a measurement of the performance, which often implies the choice of a cost
function;
the choice of a training algorithm;
the choice of the hypotheses space, which depends partly on the choice of the algorithm;
the tuning of the parameters controlling the algorithm running. Generally the operator tests
several methods on the learning problem in order to determine that which seems most suitable
with the class of concerned problems. How does it have to proceed?

13
It is necessary to be wary of an approach that seems natural. One could indeed believe that it is
enough to measure for each method the empirical performance using an above described
technique. While proceeding of these kind, one arrange to minimize the risk measured on the test
sample and hence to tune the method according to this sample. That is dangerous because it may
be that, as in the case of the over-fitting phenomenon, working towards this end so much, one
moves away from a reduction in the real risk. This is why one envisages, beside the training
sample and the test sample, a third sample independent of both others: the validation sample, on
which one evaluates the real performance of the method. Hence one divides the supervised data
in three parts: the training set S A the test set and the validation set T V
The separation of the examples in three sets is also useful to determine at which moment certain
training algorithms converge.

14
The ROC curve
Up to now we primarily described evaluation methods of the performances taking into account
only one number: the estimate of the real risk. However, in a context of decision-making, it is
perhaps useful to be finer in the performance evaluation and to take, in account not only the
number of errors, but also the rate of "false positive" and "false negative" (available from the
confusion matrix). Often, indeed, the cost of bad classification is not symmetrical and one can
rather prefer to have an error rate a little worse if that makes it possible to reduce the type of the
most expensive error (for example it is better to operate appendix wrongly (false positive), than
of not to detect appendicitis (false negative). The ROC curve (Receiver Operating
Characteristic) allows tuning this compromise
2
.

2
These curves were used for the first time in the Second World War when, one wanted to quantify the radars capacity to discern the interferences of random nature of the
signal really indicating the presence of aircraft.

15
Let us suppose that one characterizes the shapes of data (for example patients) by a size that can
result from the combination of examinations (for example the age of the patient, the family
antecedents, its blood pressure, etc.). One can then establish a graph for each class giving the
probability of belonging to this class according to the data shape (see Figure 1).
In a second stage, one determines for each value of the data size the probability that a positive
diagnosis is correct. This probability is given starting from the fraction of the training sample for
which the prediction is exact (see Figure 2).
The third stage corresponds to the construction of the ROC curve. For each value of data, one
computes the ratio of "true positives" to that of "false positives". If a line is obtained, one must
conclude that the test has 50 % of chances to lead to the good diagnosis. The more the curve
upwards, the more the test is consistent (the ratio of "true positives" on the "false positive"
increases). The consistency is measured by the surface under the curve; it increases with its
curvature (see Figure 3)

16

Figure Error! No text of specified style in document. Curves corresponding to the classes ' + ' and ' - '


17
When one found a system of classification judged sufficient good it remains to choose the
threshold for a diagnosis 'class +' / 'class -'. The choice of the threshold must provide a high
proportion of true positives without involving an unacceptable proportion of false positive. Each
point of the curve represents a particular threshold, ranging from the most severe (limiting the
number of false positive to the price of many examples of the class ' + ' not diagnosed, i.e.,
strong proportion of false negative and small proportion of true positive), to the more laxest
(increasing the number of true positive at the price of many false positive, see Figure 3). The
optimal threshold for a given application depends on factors such as the relative costs of the false
positives and false negative, like that of the prevalence of the class ' + '. For example an operator
(of telephony or of cabled chain) seeks to detect the churners (in the jargon of the domain,
subscribers likely to leave it). These leaving subscribers are very few, but very expensive. One
will thus seek to detect the maximum of it in order to try to retain them, even if that means to
detect some false churners. One will then use a "laxest" threshold.

18

Figure 2 Threshold deciding for each class the "true positive", "false negative", "false positive" and "true negative ".

19
Other evaluation criteria
More of numerical criteria, there is a certain number of qualities that allow distinguishing a
hypothesis among others. It can be enumerated:
1. The intelligibility of the training results;
2. The simplicity of the produced hypotheses.

This criterion rise of a traditional rhetoric argument, the Occam razor, who affirms that it is
wasteful to multiply the useless "entities"
3
, in other words that a simple explanation is better
than a complicated one.

3
Frustra fit per plura, quod fieri potest per pauciora, classically translated by: It is futile to do with more what can be done with less. Alternatively, Essentia non sunt
multiplicanda praeter necessitatem. Entities should not be multiplied unnecessarily. Guillaume d' Occam (1288-1348).

20

Figure 3 A ROC curve on the left. Two thresholds on this curve on the right


21
Comparison of the Learning Methods
One can always use various algorithms of training for the same task. How to interpret the
difference of the performance measured empirically between algorithms? More concretely, an
algorithm whose error rate in binary classification is 0.17 is better than another whose measured
performance is 0.20?
The answer is not obvious, because the measured performance depends at the same time on the
characteristic of the empirical tests carried out and the used tests samples. To decide between
two systems knowing only one measurement is thus problematic. The same question arises
besides two hypotheses produced by the same algorithm starting from different initial
conditions.
A whole of literature pone on this domain. Here we only list the main results.

22
Comparison of two Hypotheses Produced by the same Algorithm on two Different Test
Samples.
Let and produced by the same training
1
h
2
h
4
algorithm and two test sets and of size
and . It is supposed that these two test samples are independent: i.e. they are i.i.d. We want to
estimate the quantity:
1
T
2
T
1
t
2
t
( ) ( ) ( )
1 2 1 2
,
R real real
h h R h R h = .
If we lay out estimators of these two real risks, it is possible to show that an estimator of their
difference is written like the difference of the estimators:
( ) ( ) ( )
1 2 1 2
,
R real real
h h R h R h =
) ) )
.
Noting with the data of misclassified by the hypothesis and with the data of
misclassified by the hypothesis one has:
,1 err
t
1
T
1
h
,2 err
t
2
T
2
h

4


23
( )
,1 ,2
1 2
1 2
,
err err
R
t t
h h
t t
=
)
.
The confidence interval of this value is given by the formula
( )
,1 ,1 ,2 ,2
1 1 2 2
1 2
1 1
err err err err

R
t t t t
t t t t
x
t t

+

Comparison of two Algorithms on Different Test Sets
We take now a situation that one often meets in the practice: one has training data and one seeks
which is the algorithm to be applied to them, among available panoply. For example if one seeks
a concept in , thus the learning data are numerical and labeled positive or negative, one can
d
R

24
use a separating function in the form of a hyperplan, or the output of a neural network, or the k-
nearest neighbors etc. How to choose the best?
Let us suppose to have at our disposal two algorithms
1
A and
2
A and a set of supervised data.
The simplest method consists in dividing this set in two subsets and , training S T
1
A and
2
A
on (eventually tuning the parameters using a subset S V of ), then compare the performances
obtained on . Two questions arise:
S
T
can one trust this comparison?
can one make a more precise comparison with the same data?
The answer to the first question is rather negative. The principal reason is that the training
sample being the same one for the two methods, its characteristics will be magnified by the
comparison. What one seeks is the better algorithm not on but on the values of the target
functions of which the examples of are only random selection.
S
S

25
To answer positively the second question, is necessary to use a technique which browse
randomly the training and test data, like the cross validation. An effective algorithm is given
below.

Algorithm 1.1 The comparison of two training algorithms
1. Divide the training data = D S T in K equal parts. They are noted
1 2
, ,...,
K
T T T
2. For i =1. K makes

i i
S D T
Train the algorithm
1
A on . It provides the hypothesis .
i
S
1
i
h
Train the algorithm
2
A on . It provides the hypothesis .
i
S
2
i
h
1 2
i
R where and are error rates of and on .
i i
R
1
i
R
2
i
R
1
i
h
2
i
h
i
T

26
end.
3.
1
1
K
i
i
K

=

.

The confidence interval of who is an estimator of the difference in performance between the
two algorithms, is then given by the formula:
( )
( )
( )
2
1
1
,
1
K
i
i
x K
K K

=

.
The function
( )
, x K tends towards
( )
x when K increase, but the formula above is valid only
if the size of each is at least of thirty examples. Let us give some values:
i
T


27

x 90% 95% 98% 99%
( )
,2 x
2.92 4.30 6.96 9.92
( )
,5 x
2.02 2.57 3.36 4.03
( )
,10 x
1.81 2.23 2.76 3.17
( )
,30 x
1.70 2.04 2.46 2.75
( ) ( )
, x x =
1.64 1.96 2.33 2.58


28
Comparison of two Algorithms on the Same Test Set
If the tests sets on which the two algorithms are evaluated are the same ones, the confidence
intervals can be much tighter insofar as one eliminates the variance due to the difference
between the test samples.
Dietterich published in 1997 a long paper on the comparison of the training algorithms on the
same test set. It examined five statistical tests in order to study the probability of detecting a
difference between two algorithms whereas it does not have.
Let and be two classification hypotheses. Let us note:
1
h
2
h
00
n =the number of test examples badly classified by and and;
1
h
2
h
01
n =the number of test examples badly classified by , but not by ;
1
h
2
h
10
n =the number of tests examples badly classified by , but not by ;
2
h
1
h

29
11
n =the number of tests examples correctly classified by and .
1
h
2
h
Let be the statistics
01 10
01 10
n n
z
n n
=
+

The assumption that and have the same error rate can be rejected with a probability higher
than x% if
1
h
2
h
( )
z x > .
The test is known under the name of paired test of McNemar or Gillick.

30
Discussion and Perspective
This chapter introduced the study of various reasonable inductive principles. Those transform a
problem of learning into a problem of optimization have providing a criterion that must be
optimized by the ideal hypothesis. The majority of the learning methods can then be seen like
manners of specifying the hypotheses space to be considered as well as the technique of
exploration of this space in order to find there the best hypothesis. This vision of the learning is
of a great force. It makes possible to conceive methods of learning, to compare them, and even
to build new inductive principles, as those that control automatically the hypotheses space. It is
easy to be seduced and to start to reason in the terms of this approach. However, if one reflects
about, it will find a surprising framework to the learning approach.


31
On the one side, there is an indifferent Nature, which distils messages, the data, in a random
way, excluding by there the situations of organized or at least benevolent learning. On the other
side, there is a solitary learner, completely passive, which awaits the messages, and, in general,
does nothing before it have collected them all. One evacuates thus, the continuous collaborative
learning, with an evolution of the learner. In the same way the learning in non-stationary
environments are excluded, a negative vote for a science, which should above all be a science of
dynamics. Moreover, the LM on average it optimizes a mean of the risk, but it really does not
seek to identify the target concept. Otherwise it would undoubtedly have interest to devote its
resources to the areas of the space in which the target function presents a great dynamics (strong
variations) and less in the regions where the things occur quietly. This implies to have a
hypotheses space with variable geometry: rich data in the areas of strong dynamics and poor data
elsewhere. In addition, the role of a priori knowledge, so important in the natural learning, it is
here reduced to a very poor expression related only to the choice of the hypotheses space.

32
Finally, the performances criteria take into account only the mean of error or the risk, and at all
criteria of intelligibility or fruitfulness of the produced knowledge. Therefore, one is far from a
framework that analyzes the whole diversity of the learning situations. Therefore, this much
purified framework appears of a great effectiveness in the analysis of data, which corresponds to
a vast field of application.
1

Curs 05 - Perceptronul
2

3

4

5

6

8
Perceptron Learning Rule

Consider the linear threshold gate LTG-
shown in Figure 1, which will be referred
to as the perceptron. The perceptron maps
an input vector
[ ]
T
1 2 1
...
n
x x x
+
= x to a
bipolar binary output y, and thus it may be
view, as a simple two-class classifier. The
input signal
1 n
x
+
is usually set to 1 and
plays the role of a bias to the perceptron. We will denote by w the vector
[ ]
T
1
1 2 1
...
n
n
w w w R
+
+
= w consisting of the free parameters (weights) of the perceptron. The
input/output relation for the perceptron is given by
( )
sgn
T
y = x w , where sgn is the "sign"
function, which returns +1 or 1 depending on whether the sign of its scalar argument is positive
or negative, respectively.
9
Assume we are training this perceptron to load (learn) the training pairs
{ } { } { }
1 1 2 2
, , , ,..., ,
m m
d d d x x x , where
1 k n
R
+
x is the kth input vector and { } 1, 1
k
d + , k =1, 2,
... , m, is the desired target for the kth input vector (usually the order of these training pairs is
random). The entire collection of these pairs is called the training set.
The goal, then, is to design a perceptron such that for each input vector
k
x of the training set, the
perceptron output
k
y matches the desired target
k
d ; that is, we require
( )
T
sgn
k k k
y d = = w x , for
each k =1, 2, ... , m. In this case we say that the perceptron correctly classifies the training set. Of
course, "designing" an appropriate perceptron to correctly classify the training set amounts to
determining a weight vector
*
w such that the following relations are satisfied:
( )
( )
T
T
0 if 1
0 if 1
k k
k k
d
d
> = +
< =
*
*
x w
x w
(2.1)
10
Recall that the set of all x which satisfy
T *
0 = x w defines a hyperplane in
n
R . Thus, in the
context of the preceding discussion, finding a solution vector
*
w to Equation (2.1) is equivalent
to finding a separating hyperplane that correctly classifies all vectors
k
x , k =1, 2, ... , m. In other
words, we desire a hyperplane
T *
0 = x w that partitions the input space into two distinct regions,
one containing all points
k
x with 1
k
d = + and the other region containing all points
k
x with
1
k
d = .
One possible incremental method for arriving at a solution
*
w is to invoke the perceptron
learning rule (Rosenblatt, 1962):
( )
1
1
arbitrary
1,2,...
k k k k k
d y k
+
= + =
w
w w x
(2.2)
where is a positive constant called the learning rate. The incremental learning process given
in Equation (2.2) proceeds as follows: First, an initial weight vector
1
w is selected (usually at
11
random) to begin the process. Then, the m pairs
{ }
,
k k
d x of the training set are used to
successively update the weight vector until (hopefully) a solution
*
w is found that correctly
classifies the training set. This process of sequentially presenting the training patterns is usually
referred to as cycling through the training set, and a complete presentation of the m training pairs
is referred to as a cycle (or pass) through the training set. In general, more than one cycle
through the training set is required to determine an appropriate solution vector. Hence, in
Equation (2.2), the superscript k in
k
w refers to the iteration number. On the other hand, the
superscript k in
k
x (and
k
d ) is the label of the training pair presented at the kth iteration. To be
more precise, if the number of training pairs m is finite, then the superscripts in
k
x and
k
d
should be replaced by ( ) 1 mod 1 k m +

. Here, a mod b returns the remainder of the division of
a by b (e.g., 5 mod 8 =5, 8 mod 8 =0, and 19 mod 8 =3). This observation is valid for all
incremental learning rules presented in this chapter.

12
Decision Boundaries
nnd4db Decision boundaries demonstration.

Notice that for =0.5, the perceptron learning rule can be written as
( )
1
T
1
1
arbitrary
if 0
otherwise
k k k k k
k k
+
+
= +
w
w w z z w
w w
(2.3)
where
if 1
if 1
k k
k
k k
d
d
+ = +
=

=
x
z
x
(2.4)
That is, a correction is made if and only if a misclassification, indicated by
( )
T
0
k k
z w (2.5)
13
occurs. The addition of vector
k
z to
k
w in Equation (2.3) moves the weight vector directly
toward and perhaps across the hyperplane
( )
T
0
k k
= z w . The new inner product
( )
T
1 k k+
z w is
larger than
( )
T
k k
z w by the amount of
2
k
z , and the correction
1 k k k +
= w w w is clearly
moving
k
w in a good direction, the direction of increasing
( )
T
k k
z w , as can be seen from Figure
2-2.
1
. Thus the perceptron
learning rule attempts to find a
solution
*
w for the following
system of inequalities:
( )
T
0 for 1,2,...,
k
k m > = z w
(2.6)

1
The quantity
2
z is given by
T
z z and is sometimes referred to as the energy of z. z is the Euclidean norm (length) of vector z and is given
by the square root of the sum of the squares of the components of z [note that = z x by virtue of Equation (2.4)].
14
In an analysis of any learning algorithm, and in particular the perceptron learning algorithm of
Equation (2.2), there are two main issues to consider:
(1) the existence of solutions and
(2) convergence of the algorithm to the desired solutions (if they exist).
In the case of the perceptron, it is clear that a solution vector (i.e., a vector
*
w that correctly
classifies the training set) exists if and only if the given training set is linearly separable.
Assuming, then, that the training set is linearly separable, we may proceed to show that the
perceptron learning rule converges to a solution (Novikoff, 1962; Nilsson, 1965) as follows: Let
*
w be any solution vector so that
( )
T
0 for 1,2,...,
k k
k m > = z w (2.7)
Then, from Equation (2.3), if the kth pattern is misclassified, we may write
1 * * k k k

+
= + w w w w z (2.8)
15
where is a positive scale factor, and hence
( ) ( )
2 2 2 T
1 * * *
2
k k k k k

+
= + + w w w w z w w z (2.9)
Since
k
z is misclassified, we have
( )
T
0
k k
z w , and thus
( )
2 2 2 T
1 * * *
2
k k k k
w w w w z w z
+
+ (2.10)
Now, let
2
2
max
i
i
= z and
( )
T
*
min
i
i
= z w [ is positive because
( )
T
*
0
i
> z w ] and substitute
into Equation (2.10) to get
2 2
1 * * 2
2
k k

+
+ w w w w (2.11)
If we choose sufficiently large, in particular
2
/ = , we obtain
2 2
1 * * 2 k k

+
w w w w (2.12)
16
Thus the square distance between
k
w and
k
w is reduced by at least
2
at each correction, and
after k corrections, we may write Equation (2.12) as
2 2
1 * 1 * 2
0
k
k
+
w w w w (2.13)
It follows that the sequence of corrections must terminate after no more than
0
k corrections,
where
2
1 *
0 2
k

=
w w
(2.14)
Therefore, if a solution exists, it is achieved in a finite number of iterations. When corrections
cease, the resulting weight vector must classify all the samples correctly, since a correction
occurs whenever a sample is misclassified, and since each sample appears infinitely often in the
sequence. In general, a linearly separable problem admits an infinite number of solutions. The
perceptron learning rule in Equation (2.2) converges to one of these solutions. This solution,
17
though, is sensitive to the value of the learning rate used and to the order of presentation of
the training pairs.
This sensitivity is responsible for the varying quality of the perception-generated separating
surface observed in simulations.
The bound on the number of corrections
0
k given by Equation (2.14) depends on the choice of
the initial weight vector w
1
. If w
1
=0, we get
2 2
2 * 2 *
0 2 2
k

= =
w w
or
( )
2 2
*
0 2
T
*
max
min
i
i
i
i
k =

x w
x w
(2.15)
Here,
0
k is a function of the initially unknown solution weight vector
*
w . Therefore, Equation
(2.15) is of no help for predicting the maximum number of corrections. However, the
denominator of Equation (2.15) implies that the difficulty of the problem is essentially
determined by the samples most nearly orthogonal to the solution vector.
2.1.1Generalizations of the Perceptron Learning Rule
k
The perceptron learning rule may be generalized to include a variable increment and a fixed
( )
T
k k
z w positive margin b. This generalized learning rule updates the weight vector whenever
fails to exceed the margin b. Here, the algorithm for weight vector update is given by
( )
1
T
1
1
arbitrary
if
otherwise
k k k k k k
k k
b
+
+
= +
w
w w z z w
w w
(2.16)
The margin b is useful because it gives a dead-zone robustness to the decision boundary. That is,
the perceptron's decision hyperplane is constrained to lie in a region between the two classes
such that sufficient clearance is realized between this hyper-plane and the extreme points
(boundary patterns) of the training set. This makes the perceptron robust with respect to noisy
1
Curs 06 - Algoritmi de nvatare a perceptronului
inputs. It can be shown (Duda and Hart, 1973) that if the training set is linearly separable and if
the following three conditions are satisfied:
1. 0
k
(2.17a)
1
2. lim
m
k
m
k
=
=
(2.17b)
( )
2
1
2
1
3. lim 0
m
k
k
m m
k
k
=
=

(2.17c)
(e.g., /
k
k = or even
k
k = ), then converges to a solution that satisfies
k
w
*
w
( )
T
i k
b > z w
k
, for i =l, 2, ... , m. Furthermore, when , this is fixed at a positive constant
learning rule converges in finite time.
2
Another variant of the perceptron learning rule is given by the batch update procedure
( )
1
1
arbitrary
k
k k
= +

z Z w
w
w w z
(2.18)
( )
k
Z w is the set of patterns z misclassified by . Here, the weight vector change
k
w where
1 k+
= w w w
k
is along the direction of the resultant vector of all misclassified patterns. In
general, this update procedure converges faster than the perceptron rule, but it requires more
storage.
In the nonlinearly separable case, the preceding algorithms do not converge. Few theoretical
results are available on the behavior of these algorithms for nonlinearly separable problems [see
Minsky and Papert (1969) for some preliminary results]. For example, it is known that the length
of w in the perceptron rule is bounded, i.e., tends to fluctuate near some limiting value
*
w . This
information may be used to terminate the search for w*. Another approach is to average the
3
weight vectors near the fluctuation point . Butz (1967) proposed the use of a reinforcement
*
w
factor , 1, in the perceptron learning rule. This reinforcement places w in a region that 0
tends to minimize the probability of error for nonlinearly separable cases. Butz's rule is as
follows:
( )
( )
1
T
1
T
1
arbitrary
if 0
if 0
k k k k k
k k k k k
+
+
= +
= + >
w
w w z z w
w w z z w
(2.19)
2.1.2The Perceptron Criterion Function
It is interesting to see how the preceding error-correction rules can be derived by a gradient
descent on an appropriate criterion (objective) function. For the perceptron, we may define the
following criterion function (Duda and Hart, 1973):
4
( )
( )
T
Z
J
=

z w
w z w (2.20)
( )
Z w
( )
Z w
T
0 z w is the set of samples misclassified by w (i.e., where ). Note that if is empty,
( )
0 J = w
( )
0 J > w then ; otherwise, . Geometrically, is proportional to the sum of the
( )
J w
distances from the misclassified samples to the decision boundary. The smaller J is, the better
the weight vector w will be.
( )
J w Given this objective function , the search point can be incrementally improved at each
k
w
( )
J w in w space. Specifically, we may iteration by sliding downhill on the surface defined by
use J to perform a discrete gradient-descent search that updates so that a step is taken
k
w
( )
J w downhill in the "steepest" direction along the search surface at . This can be achieved
k
w
5
k
w proportional to the gradient of J at the present location ; formally, we may
k
w by making
12
write
( )
T
1
1 2 1
| ... |
k k
k k k
n
J J J
J
w w w

+
= =
+

= =

w w w w
w w w w (2.21)
1
Here, the initial search point w are to be specified by the and the learning rate (step size)
user. Equation (2.21) can be called the steepest gradient descent search rule or, simply, gradient
descent. Next, substituting the gradient

1
Discrete gradient-search methods are generally governed by the following equation:
1
|

k
k k
J
+
=
=
w w
w w A
Here, A is an n n matrix and is a real number, both are functions of . Numerous versions of gradient-search methods exist, and they differ in the way in which A and
k
w
are selected at
k
= w w . For example, if A is taken to be the identity matrix, and if is set to a small positive constant, the gradient "descent" search in Equation (2.21) is
obtained. On the other hand, if is a small negative constant, gradient "ascent" search is realized which seeks a local maximum. In either case, though, a saddle point
(nonstable equilibrium) may be reached. However, the existence of noise in practical systems prevents convergence to such nonstable equilibriums.
It also should be noted that in addition to its simple structure, Equation (2.21) implements "steepest" descent. It can be shown that starting at a point w, the gradient direction
( )
yields the greatest incremental increase of
( )
0
J w J w
for a fixed incremental distance
0 0
= w w w . The speed of convergence of steepest descent search is affected by
the choice of , which is normally adjusted at each time step to make the most error correction subject to stability constraints.
[ ]
1
J

Finally, it should bepointed out that setting A equal to theinverseof theHessian matrix and to 1 results in thewell-known Newton's search method.

6
( )
( )
k
k
Z
J
=

z w
w z (2.22)
into Equation (2.21) leads to the weight update rule
( )
1
k
k k
Z
= +

z w
w w z (2.23)
The learning rule given in Equation (2.23) is identical to the multiple-sample (batch) perceptron
rule of Equation (2.18). The original perceptron learning rule of Equation (2.3) can be thought of
as an "incremental" gradient descent search rule for minimizing the perceptron criterion function
in Equation (2.20). Following a similar procedure as in Equations (2.21) through (2.23) it can be
shown that
( )
( )
T
T
b
J b
= (2.24)
z w
w z w
is the appropriate criterion function for the modified perceptron rule in Equation (2.16).
7
Before moving on, it should be noted that the gradient of J in Equation (2.22) is not
mathematically precise. Owing to the piecewise linear nature of J, sudden changes in the
( )
T
0
k
= z w gradient of J occur every time the perceptron output y goes through a transition at .
( )
T
0
k
= z w Therefore, the gradient of J is not defined at "transition" points w satisfying , k =1,
2, ... , m. However, because of the discrete nature of Equation (2.21), the likelihood of
k
w
J overlapping with one of these transition points is negligible, and thus we may still express as
in Equation (2.22).
2.1.3Mays' Learning Rule
The criterion functions in Equations (2.20) and (2.24) are by no means the only functions that
are minimized when w is a solution vector. For example, an alternative function is the quadratic
function
( )
( )
T
2
T
1
2
b
J b
= (2.25)
z w
w z w
8
where b is a positive constant margin. Like the previous criterion functions, the function J(w) in
Equation (2.25) focuses attention on the misclassified samples. Its major difference is that its
gradient is continuous, whereas the gradient of the perceptron criterion function, with or without
the use of margin, is not. Unfortunately, the present function can be dominated by the input
vectors with the largest magnitudes. We may eliminate this undesirable effect by dividing by
2
z :
( )
( )
T
2
T
2
1
2
b
b
J
=

z w
z w
w
z
(2.26)
The gradient of J(w) in Equation (2.26) is given by
( )
T
T
2
b
b
J
=

z w
z w
w z (2.27)
z
which, upon substituting in Equation (2.21), leads to the following learning rule
9
T
1
T
1
2
arbitrary
k
k k
b
b
= +

z w
w
z w
w w z
(2.28)
z
If we consider the incremental update version of Equation (2.28), we arrive at Mays' rule (Mays,
1964):
( )
( )
1
T
T
1
2
1
arbitrary
if
otherwise
k k
k k k k k
k k
b
b
+
+

= +
w
z w
w w z z w
z
w w
(2.29)
If the training set is linearly separable, Mays' rule converges in a finite number of iterations, for
0 2 < < (Duda and Hart, 1973). In the case of a nonlinearly separable training set, the training
procedure in Equation (2.29) will never converge. To fix this problem, a decreasing learning rate
/
k
k = such as may be used to force convergence to some approximate separating surface.
10
Widrow-Hoff ( -LMS) Learning Rule
Another example of an error correcting rule with a quadratic criterion function is the Widrow-
Hoff rule (Widrow and Hoff, 1960). This rule was originally used to train the linear unit, also
known as the adaptive linear combiner element (ADALINE), shown in Figure 2-3. In this case,
( )
T
k k
y = x the output of the linear unit in response to the input is simply
k
x w. The Widrow-
Hoff rule was proposed originally as an ad hoc rule which embodies the so-called minimal
disturbance principle. Later, it was discovered (Widrow and Stearns, 1985) that this rule
converges in the mean square to the solution that corresponds to the least- mean-square
*
w
(LMS) output error if all
1
Curs 07 - Algoritmul Widrow-Hoff de antrenare a perceptronului

Figure 2-3 Adaptive linear combiner clement (ADALINE).

2
k
x is the same for all k). Therefore, this rule is input patterns are of the same length (i.e.,
sometimes referred to as the -LMS rule (the is used here to distinguish this rule from
another very similar rule that is discussed in next section). The -LMS rule is given by
( )
1
1
2
or arbitrary
k
k k k k
k
d y
+
=
= +
w 0
x
w w
x
(2.30)
where R
k
d is the desired response, and >0. Equation (2.30) is similar to the perceptron
rule if one sets in Equation (2.2) as
2
k
k
= =
x
(2.31)
However, the error in Equation (2.30) is measured at the linear output, not after the nonlinearity,
as in the perceptron. The constant controls the stability and speed of convergence (Widrow
3
and Stearns, 1985; Widrow and Lehr, 1990). If the input vectors are independent over time,
0 2 < < stability is ensured for most practical purposes if .
As for Mays' rule, this rule is self-normalizing in the sense that the choice of does not depend
k
w on the magnitude of the input vectors. Since the -LMS rule selects to be collinear with
k
x , the desired error correction is achieved with a weight change of the smallest possible
magnitude. Thus, when adapting to learn a new training sample, the responses to previous
training samples are, on average, minimally disturbed. This is the basic idea behind the minimal
disturbance principle on which the -LMS is founded. Alternatively, one can show that the -
LMS learning rule is a gradient descent minimizer of an appropriate quadratic criterion function
Other Gradient-Descent-Based Learning Rules
In the following, additional learning rules for single-unit training are derived. These rules are
derived systematically by first defining an appropriate criterion function and then optimizing
such a function by an iterative gradient search procedure.
4
-LMS Learning Rule
The -LMS learning rule (Widrow and Hoff, 1960) represents the most analyzed and most
applied simple learning rule. It is also of special importance due to its possible extension to
learning in multiple unit neural nets. Therefore, special attention is given to this rule in this
chapter. In the following, the -LMS rule is described in the context of the linear unit in Figure
2-3. Let
( )
( )
2
1
1
2
m
i i
i
J d y
=
=
w (2.32)
be the sum of squared error (SSE) criterion function, where
( )
T
i i
y = x w (2.33)
Now, using steepest gradient-descent search to minimize J(w) in Equation (2.32) gives
5
( )
( )
1
1
k k
m
k i
+
i i
i
J
d y
=
=
= +
w w w
w x
(2.34)
The criterion function J(w) in Equation (2.32) is quadratic in the weights because of the linear
1 i
y and w. In fact, J(w) defines a convex relation between hyperparaboloidal surface with a
single minimum (the global minimum). Therefore, if the positive constant
*
w is chosen
sufficiently small, the gradient-descent search implemented by Equation (2.34) will
asymptotically converge toward the solution regardless of the setting of the initial search
*
w
point w
1
. The learning rule in Equation (2.34) is sometimes referred to as the batch LMS rule.
-LMS or LMS rule, is given by The incremental version of Equation (2.34), known as the

1

A function of the form
:
n
f R R
is said to be convex if the following condition is satisfied:
( ) ( ) ( ) ( ) 1 1 f f f + +

u v u v

n
R
for any pair of vectors u and v in and any real number in the closed interval [0,1].
6
( )
1
1
0 or arbitrary
k k k k k
x
(2.35)
d y
+
=
= +
w
w w
Note that this rule becomes identical to the -LMS learning rule in Equation (2.30) upon setting
as
2
k
k
= =
x
(2.36)
{ }
1, 1
n
+ x Also, when the input vectors have the same length, as would be the case when , then
the -LMS rule becomes identical to the -LMS rule. Since the -LMS learning algorithm
converges when 2 0 < < , we can start from Equation (2.36) and calculate the required range on
for ensuring the convergence of the -LMS rule for "most practical purposes":
2
2
0
max
i
i
< <
x
(2.37)
7
For input patterns independent over time and generated by a stationary process, convergence of
k
w is chosen to be the mean of the weight vector is ensured if the fixed learning rate
2
2/ x smaller than (Widrow and Stearns, 1985). Here, signifies the "mean" or expected i
2
2/ x value. In this case,
k
w approaches a solution as . Note that the bound
*
w k is
( )
2
2/ 3 x guarantees the less restrictive than the one in Equation (2.37). The bound on
2
*
0
k
w w convergence of w in the mean square (i.e., as ) for input patterns k
generated by a zero-mean Gaussian process independent over time. It should be noted that
convergence in the mean square implies convergence in the mean; however, the converse is not
necessarily true. The assumptions of decorrelated patterns and stationarity are not necessary
conditions for the convergence of -LMS. For example, Macchi and Eweda (1983) have a much
stronger result regarding convergence of the -LMS rule which is even valid when a finite
number of successive training patterns are strongly correlated.
8
In practical problems, ; hence it becomes impossible to satisfy all requirements 1 m n > +
( )
T
k k
d = x w = , m. Therefore, Equation (2.35) never converges. Thus, for convergence, 1,2,..., k
is set to 0 is a small positive constant. In applications such as linear
0
/k , where >
0
filtering, though, the decreasing step size is not very valuable, because it cannot accommodate
nonstationarity in the input signal. Indeed, will essentially stop changing for large k, which
k
w
) LMS learning precludes the tracking of time variations. Thus the fixed-increment (constant
rule has the advantage of limited memory, which enables it to track time fluctuations in the input
data.
When the learning rate is sufficiently small, the -LMS rule becomes a "good"
approximation to the gradient-descent rule in Equation (2.34). This means that the weight vector
k
w will tend to move toward the global minimum of the convex SSE criterion function.
*
w
Next, we show that is given by
*
w
*
=
w X d (2.38)
9
( )
1
T

=
X XX
T
1 2
...
m
d d d =

d
1 2
...
m
=

X x x x X is the generalized inverse where , , and
or pseudoinverse (Penrose, 1955) of X for . 1 m n > +
The extreme points (minima and maxima) of the function J(w) are solutions to the equation
( )
J = 0 (2.39) w
Therefore, any minimum of the SSE criterion function in Equation (2.32) must satisfy
( )
( ) ( )
T
T
1
m
i i i
i
J d
=

= = = d 0 (2.40)

w x w x X X w
Equation (2.40) can be rewritten as
T
= XX w Xd (2.41)
which for a nonsingular matrix gives the solution in Equation (2.38), or explicitly
T
XX
10
( )
1
* T

= w XX Xd (2.42)
( )
*
J = w Recall that just because in Equation (2.42) satisfies the condition
*
w 0, this does not
guarantee that is a local minimum of the criterion function J. It does, however, considerably
*
w
narrow the choices in that such represents (in a local sense) either a point of minimum,
*
w
maximum, or saddle point of J. To verify that is actually a minimum of .J(w), we may
*
w
evaluate the second derivative or Hessian matrix
2
i j
J
J
w w

=

2
of J at and show that it is positive definite
*
w . But this result follows immediately after noting
3
that J is equal to the positive-definite matrix . Thus is a minimum of J.
*
w
T
XX

n
R
2
An real symmetric matrix A is positive-definite if the quadratic form x n n
T
Ax is strictly positive for all nonzero column vectors x in .
3
Of course, the same result could have been achieved by noting that the convex, unconstrained quadratic nature of J(w) admits one extreme point , which must be the
*
w
11
The LMS rule also may be applied to synthesize the weight vector w of a perceptron for solving
two-class classification problems. Here, one starts by training the linear unit in Figure 2- with
{ }
, , 1,2,...,
k k
d k = x m, using the LMS rule. During training, the desired the given training pairs
target is set to +1 for one class and to
k
d 1 for the other class. (In fact, any positive constant
can be used as the target for one class, and any negative constant can be used as the target for the
other class.) After convergence of the learning process, the solution vector obtained may now be
used in the perceptron for classification. Because of the thresholding nonlinearity in the
{ }
1, 1 + perceptron, the output of the classifier will now be properly restricted to the set .
When used as a perceptron weight vector, the minimum SSE solution in Equation (2.42) does
not generally minimize the perceptron classification error rate. This should not be surprising,
since the SSE criterion function is not designed to constrain its minimum inside the linearly
separable solution region. Therefore, this solution does not necessarily represent a linear
separable solution, even when the training set is linearly separable (this is further explored

global minimum of J(w).
12
above). However, when the training set is nonlinearly separable, the solution arrived at may still
be a useful approximation. Therefore, by employing the LMS rule for perceptron training, linear
separability is sacrificed for good compromise performance on both separable and nonseparable
problems.
Example 2.1 This example presents the results of a set of simulations that should help give
some insight into the dynamics of the batch and incremental LMS learning rules. Specifically,
we are interested in comparing the convergence behavior of the discrete-time dynamical systems
in Equations (2.34) and (2.35). Consider the training set depicted in Figure 2-4 for a simple
mapping problem. The 10 squares and 10 filled circles in this figure are positioned at the points
whose coordinates specify the two components of the input vectors. The squares and ( )
1 2
, x x
1 circles are to be mapped to the targets +1 and , respectively. For example, the left-most square
[ ]
{ }
T
1,0 ,1 in the figure represents the training pair . Similarly, the right most circle represents the
[ ]
{ }
T
2,2 , 1 training pair .
13

Figure 2-4 A 20-sample training set used in the simulations associated with Example 2.1. Points signified by a square and a
filled circle should map into +1 and 1 , respectively.
15
Figure 2-5 shows plots for the evolution of the square of the distance between the vector and
k
w
the (computed) minimum SSE solution for batch LMS (dashed line) and incremental LMS
*
w
(solid line). In both simulations, the learning rate (step size) was set to 0.005. The initial
[ ]
T
0,0 search point w
1
was set to . For the incremental LMS rule, the training examples are
selected randomly from the training set. The batch LMS rule converges to the optimal solution
*
w in less than 100 steps. Incremental LMS requires more learning steps, on the order of 2000
steps, to converge to a small neighborhood of .
*
w

16

Figure 2-5 Plots (learning curves) for the square of the distance between the search point and the minimum SSE
solution
*
w generated using two versions of the LMS learning rule. The dashed line corresponds to the batch LMS rule in
Equation (2.34). The solid line corresponds to the incremental LMS rule in Equation (2.35) with a random order of
presentation of the training patterns. In both cases, w
k
w
1
=0 and 0.005 = are used. Note the logarithmic scale for the
iteration number k.
17

2
* k
w w The fluctuations in in this neighborhood are less than 0.02, as can be seen from
Figure 2-5. The effect of a deterministic order of presentation of the training examples on the
incremental LMS rule is shown by the solid line in Figure 2-6. Here, the training examples are
presented in a predefined order, which did not change during training. The same initialization
and step size are used as before. In order to allow for a more meaningful comparison between
the two LMS rule versions, one learning step of incremental LMS is taken to mean a full cycle
through the 20 samples. For comparison, the simulation result with batch LMS learning is
plotted in the figure (see dashed line). These results indicate a very similar behavior in the
convergence characteristics of incremental and batch LMS learning. This is so because of the
small step size used. Both cases show asymptotic convergence toward the optimal solution ,
*
w
but with a relatively faster convergence of the batch LMS rule near . This is attributed to the
*
w
use of more accurate gradient information.
18

Figure 2-6 Learning curves for the batch LMS (dashed line) and incremental LMS (solid line) learning rules for the data in
Figure 2-5. The result for the batch LMS rule shown here is identical to the one shown in Figure 2-5 (this result looks
different only because of the present use of a linear scale for the horizontal axis). The incremental LMS rule results shown
assume a deterministic, fixed order of presentation of the training patterns. Also, for the incremental LMS case,
k
w
represents the weight vector after the completion of the kth learning "cycle." Here, one cycle corresponds to 20 consecutive
learning iterations.
19
The -LMS as a Stochastic Process
Stochastic approximation theory may be employed as an alternative to the deterministic
gradient-descent analysis presented thus far. It has the advantage of naturally arriving at a
k
learning-rate schedule for asymptotic convergence in the mean square. Here, one starts with
the mean-square error (MSE) criterion function:
( )
( )
2
T
1
2
J d = w x w (2.43)
where again denotes the mean (expectation) over all training vectors. Now one may compute i
the gradient of J as
( )
( )
T
J d x (2.44) = w x w
which upon setting to zero allows us to find the minimum of J in Equation (2.43) as the
*
w
solution of
20
T *
* 1
d
=
=
xx w x
w C P
(2.45)
which gives

d P x . Note that the expected value of a vector or a matrix is found by
T
C xx where and
taking the expected values of its components. We refer to C as the auto-correlation matrix of the
input vectors and to P as the cross-correlation vector between the input vector x and its
C associated desired target d. In Equation (2.45), the determinant of C, , is assumed different
from zero. The solution in Equation (2.45) is sometimes called the Wiener weight vector
*
w
(Widrow and Stearns, 1985). It represents the minimum MSE solution, also known as the least-
mean-square (LMS) solution.
It is interesting to note here the close relation between the minimum SSE solution in Equation
(2.42) and the LMS or minimum MSE solution in Equation (2.45). In fact, one can show that
when the size of the training set m is large, the minimum SSE solution converges to the
minimum MSE solution.
21
First, let us express XX
T
as the sum of vector outer products . We can also rewrite
( )
T
1
m
k k
k=
x x
Xd as . This representation allows us to express Equation (2.42) as
1
m
k k
k
d
=
x
( )
1
T
*
1 1
m m
k k k k
k k
d
= =

=

x

w x x
Now, multiplying the right-hand side of the preceding equation by m/m allows us to express it as
( )
1
T
*
1 1
1 1
m m
k k k k
k k
d
m m
= =

=

w x x x
Finally, if m is large, the averages
( )
T
1
1
m
k k
k
m
=
x x
1
1
m
k k
k
d
m
=
x and
22
d = P x , respectively.
T
= C xx become very good approximations of the expectations and
Thus we have established the equivalence of the minimum SSE and minimum MSE for a large
training set.
Next, in order to minimize the MSE criterion, one may employ a gradient-descent procedure
where, instead of the expected gradient in Equation (2.44), the instantaneous gradient
( )
T
k k k
d

x w
k
x is used. Here, at each learning step the input vector x is drawn at random.
This leads to the stochastic process
( )
T
1 k k k k k k k
x (2.46) d
+

= +
w w x w
k
. It which is the same as the -LMS rule in Equation (2.35) except for a variable learning rate
k
0 C can be shown that if and satisfies the three conditions

23
1. 0
k
(2.47a)
1
2. lim
m
k
m
k
=
= +
(2.47b)
( )
2
1
3. lim
m
k
m
k
=
<
(2.47c)
then converges to in Equation (2.45) asymptotically in the mean square; i.e.,
k
w
*
w
2
*
lim 0
k
k
= w w (2.48)
( )
, g w x and is known as a regression The criterion function in Equation (2.43) is of the form
function. The iterative algorithm in Equation (2.46) is also known as a stochastic approximation
procedure (or Kiefer-Wolfowitz or Robbins-Monro procedure). For a thorough discussion of
stochastic approximation theory, the reader is referred to Wasan (1969).
24
Curs 08 - Correlation Learning Rule / The Delta Rule
Correlation Learning Rule
The correlation learning rule is derived by starting from the criterion function
( )
1
m
i i
J y
i
d
=
=
x (2.49)
where
( )
T
i i
y = x w, and performing gradient descent to minimize J. Note that minimizing J(w) is
equivalent to maximizing the correlation between the desired target and the corresponding linear
unit's output for all , i =l, 2, ... , m. Now, employing steepest gradient descent to minimize J(w)
i
x
leads to the learning rule:
1
1 k k k k
x
(2.50)
d
+
=
= +
w 0
w w
By setting to 1 and completing one learning cycle using Equation (2.50), we arrive at the weight
vector given by
*
w
Adaptive Ho-Kashyap (AHK) Learning Rules
Supervised Learning of a Perceptron

2
*
1
m
i i
i
d
=
= =
w x Xd (2.51)
where X and d are as defined above. Note that Equation (2.51) leads to the minimum SSE solution
in Equation (2.38) if X. This is only possible if the training vectors are encoded such that =
X
k
x
XX
T
is the identity matrix (i.e., the vectors are orthonormal).
k
x
Another version of this type of learning is the covariance learning rule. This rule is obtained by
steepest gradient descent on the criterion function
( )
( )( )
1
m
i i
i
J y y d d
=
=
w .
Here, y and d are computed averages, over all training pairs, for the unit's output and the desired
target, respectively. Covariance learning provides the basis of the cascade-correlation net.

3
The Delta Rule
The following rule is similar to the -LMS rule except that it allows for units with a differentiable
nonlinear activation function f. Figure 2-7 illustrates a unit with a sigmoidal activation function.
Here, the unit's output is y = f(net), with net defined as the vector inner product .
T
x w

Figure 2-7 A perceptron with a differentiable sigmoidal activation function.

4
Again, consider the training pairs
{ }
,
i i
d x , i=l, 2, ... , m, with
n 1 i +
R x (
1
1
i
n
x
+
= for all i) and
[ ]
1, 1
i
d + . Performing gradient descent on the instantaneous SSE criterion function
( ) (
2 1
2
)
J d y = w ,

whose gradient is given by
( ) ( ) ( )
J d y f net
= w x (2.52)
leads to the delta rule:
( ) ( )
1
1
arbitrary
k k k k k k k k k
x
(2.53)
d f net f net
+

= + = +
w
w w x w
where
( )
T
k k k
w and net = x
df
f
d net
= . If f is defined by
( ) ( )
tanh f net net = , then its derivative is
given by
( ) ( )
2
1 f net f net

. For the "logistic" function, =
( )
( )
1/ 1
net
f net e

= + , the
derivative is
( ) ( ) ( )
1 f net f net f net
=

. Figure 2-8 plots f and f

for the hyperbolic tangent
activation function with 1 = . Note how f asymptotically approaches +1 and 1 in the limit as net
approaches + and , respectively.

6

Figure 2-8 Hyperbolic tangent activation function f and its derivative f

, plotted for 3 3 net + .

7
One disadvantage of the delta learning rule is immediately apparent upon inspection of the graph of
( )
net
in Figure 2-8. In particular, notice how

( )
0 f net
when net has large magnitude (i.e., f

3 net > ); these regions are called flat spots of f

. In these flat spots, we expect the delta learning
rule to progress very slowly (i.e., very small weight changes even when the error ( y d ) is large),
because the magnitude of the weight change in Equation (2.53) directly depends on the magnitude of
( )
f net
. Since slow convergence results in excessive computation time, it would be advantageous to

try to eliminate the flat spot phenomenon when using the delta learning rule. One common flat spot
elimination technique involves replacing f

by f

plus a small positive bias . In this case, the
weight update equation reads as
( ) ( )
1 k k k k k k
x (2.54) d f net f net

+

= + +

w w

8
One of the primary advantages of the delta rule is that it has a natural extension that may be used to
train multilayered neural nets. This extension, known as error back propagation, will be discussed
in Chapter 3.
Adaptive Ho-Kashyap (AHK) Learning Rules
Hassoun and Song (1992) proposed a set of adaptive learning rules for classification problems as
enhanced alternatives to the LMS and perceptron learning rules. In the following, three learning
rules, AHK I, AHK II, and AHK III, are derived based on gradient-descent strategies on an
appropriate criterion function. Two of the proposed learning rules, AHK I and AHK II, are well
suited for generating robust decision surfaces for linearly separable problems. The third training rule,
AHK III, extends these capabilities to find "good" approximate solutions for nonlinearly separable
problems. The three AHK learning rules preserve the simple incremental nature found in the LMS
and perceptron learning rules. The AHK rules also possess additional processing capabilities, such

9
as the ability to automatically identify critical cluster boundaries and place a linear decision surface
in such a way that it leads to enhanced classification robustness.
Consider a two-class
{ }
1 2
, c c classification problem with m labeled feature vectors (training vectors)
{ }
,
i
d x
i
, i =1,2,..., m. Assume that belongs to
i
x
1 n
R
+
(with the last component of being a constant
i
x
bias of value 1) and that ( )
1 1
i
d = + if ( )
1 2
i
c c x . Then, a single perceptron can be trained to
correctly classify the preceding training pairs if an
( )
1 n + -dimensional weight vector w is computed
that satisfies the following set of m inequalities (the sgn function is assumed to be the perceptron's
activation function):
( )
T 0 if 1
for 1,2,...,
0 if 1
i
i
i
d
i m (2.55)
d
> = +
=
< =
x w
Next, if we define a set of m new vectors according to
i
z

10
if 1
for 1,2,...,
if 1
i i
i
i i
d
i m = (2.56)
d
+ = +
=

=
x
z
x
and we let
1 2
...
m
=

Z z z z (2.57)
then Equation (2.55) may be rewritten as the single matrix equation
T
> Z w 0 (2.58)
Now, defining an m-dimensional positive-valued margin vector b (b >0) and using it in Equation
(2.58), we arrive at the following equivalent form of Equation (2.55):
T
= Z w b (2.59)

11
Thus the training of the perceptron is now equivalent to solving Equation (2.59) for w, subject to the
constraint b >0. Ho and Kashyap (1965) proposed an iterative algorithm for solving Equation
(2.59). In the Ho-Kashyap algorithm, the components of the margin vector are first initialized to
small positive values, and the pseu-doinverse is used to generate a solution for w (based on the
initial guess of b) that minimizes the SSE criterion function
( )
2
T
1
,
2
J = w b Z w b :
=

w Z b (2.60)
where
( )
1
T

=
Z ZZ Z, for . Next, a new estimate for the margin vector is computed by 1 m n > +
performing the constrained (b >0) gradient descent
1
1
2
k k

+
= + +

b b with
k T k k
Z w b (2.61) =

12
where i denotes the absolute value of the components of the argument vector, and is the
k
b
"current" margin vector. A new estimate of w can now be computed using Equation (2.60) and
employing the updated margin vector from Equation (2.61). This process continues until all the
components of are zero (or are sufficiently small and positive), which is an indication of linear
separability of the training set, or until 0 < , which is an indication of nonlinear separability of the
training set (no solution is found). It can be shown (Ho and Kashyap, 1965) that the Ho-Kashyap
procedure converges in a finite number of steps if the training set is linearly separable. For
simulations comparing the preceding training algorithm with the LMS and perceptron training
procedures, the reader is referred to Hassoun and Clark (1988). This algorithm will be referred to
here as the direct Ho-Kashyap (DHK) algorithm.
The direct synthesis of the w estimate in Equation (2.60) involves a one-time computation of the
pseudoinverse of Z. However, such computation can be computationally expensive and requires
special treatment when is ill-conditioned (i.e., the determinant
T
ZZ
T
ZZ close to zero). An

13
alternative algorithm that is based on gradient-descent principles and which does not require the
direct computation of can be derived. This derivation is presented next.
Z
Starting with the criterion function
( )
2
T
1
,
2
J = w b Z w b , gradient descent may be performed with
respect to b and w so that J is minimized subject to the constraint b >0. The gradient of J with
respect to w and b is given by
( )
( )
T
,
, |
k k
k k
(2.62a) J =
b
w b
w b Z w b
( )
( ) 1
T 1
,
, |
k k
k k
J
+
+
=
w
w b
w b Z Z w b (2.62b)
where the superscripts k and k +1 represent current and updated values, respectively. One analytic
method for imposing the constraint b > 0 is to replace the gradient in Equation (2.62a) by
( )
0.5 + , with as defined in Equation (2.61). This leads to the following gradient-descent
formulation of the Ho-Kashyap procedure:

14
( )
1
1
2
k k k

+
= + + with
k
b b
T k k
Z w b (2.63a) =
and
( )
1 T 1
2
1 2
1
2
1
2
k k k k
k k k
+ +
=

(2.63b)
= + +

w w Z Z w b
w Z
where
1
and
2
are strictly positive constant learning rates. Because of the requirement that all
training vectors (or ) be present and included in Z, this procedure is called the batch-mode
k
z
k
x
adaptive Ho-Kashyap (AHK) procedure. It can be easily shown that if
1
0 = and
1
= b 1, Equation
(2.63) reduces to the -LMS learning rule. Furthermore, convergence can be guaranteed (Duda and
Hart, 1973) if
1
0 2 < < and
max 2
0 2/ < < where
max
is the largest eigenvalue of the positive
definite matrix .
T
ZZ

15
A completely adaptive Ho-Kashyap procedure for solving Equation (2.59) is arrived at by starting
from the instantaneous criterion function
( )
( )
T
1
,
2
i i
J

=

w b z w b
which leads to the following incremental update rules:
( )
1
1
2
k k k k
i i i i
b b

+
= + + with
( )
T
k i k k
i i
b z w (2.64a) =
and
( )
T
1 1
2
1 2
1
2
1
2
k k i i k k
i
k k k i
(2.64b)
i i
b

+ +

=

= + +

w w z z w
w z
Here, , represents a scalar margin associated with the input. In all the preceding Ho-Kashyap
i
b
i
x
learning procedures, the margin values are initialized to small positive values, and the perceptron

16
weights are initialized to zero (or small random) values. If full margin error correction is assumed in
Equation (2.64a), i.e.,
1
1 = , the incremental learning procedure in Equation (2.64) reduces to the
heuristically derived procedure reported in Hassoun and Clark (1988). An alternative way of writing
Equation (2.64) is
1
k
i i
b = and ( )
2 1
1
k i
i
= w z if (2.65a) 0
k
i
>
0
i
b = and
i
2
k
i
= w z if 0
k
i
(2.65b)
where and signify the difference between the updated and current values of b and w, b w
respectively. This procedure is called the AHK I learning rule. For comparison purposes, it may be
noted that the -LMS rule in Equation (2.35) can be written as
i k
i
= w z , with , held fixed at
i
b
+1.

17
The implied constraint in Equations (2.64) and (2.65) was realized by starting with a positive 0
i
b >
initial margin and restricting the change b to positive real values. An alternative, more flexible
way to realize this constraint is to allow both positive and negative changes in b , except for the
cases where a decrease in b, results in a negative margin. This modification results in the following
alternative AHK II learning rule:
1
k
i i
b = and ( )
2 1
1
k i
i
= w z if
1
0
k k
i i
b + > (2.66a)
0
i
b = and
i
2
k
i
= w z if
1
0
k k
i i
b + (2.66b)
In the general case of an adaptive margin, as in Equation (2.66), Hassoun and Song (1992) showed
that a sufficient condition for the convergence of the AHK rules is given by
2 2
2
0
max
i
i
< <
z
(2.67a)

18
1
0 2 < < (2.67b)
Another variation results in the AHK III rule, which is appropriate for both linearly separable and
nonlinearly separable problems. Here, w is set to 0 in Equation (2.66b). The advantages of the
AHK III rule are that
(1) it is capable of adaptively identifying difficult-to-separate class boundaries and
(2) it uses such information to discard nonseparable training vectors and speed up convergence
(Hassoun and Song, 1992).
Example 2.2 In this example the perceptron, LMS, and AHK learning rules are compared in terms
of the quality of the solutions they generate. Consider the simple two-class linearly separable
problem shown earlier in Figure 2-4. The -LMS rule of Equation (2.35) is used to obtain the
solution shown as a dashed line in Figure 2-9. Here, the initial weight vector was set to 0 and a
learning rate =0.005 is used. This solution is not one of the linearly separable solutions for this
(2.68)

19
problem. Four examples of linearly separable solutions are shown as solid lines in the figure. These
solutions are generated using the perceptron learning rule of Equation (2.2), with varying order of
input vector presentations and with a learning rate of = 0.1. Here, it should be noted that the most
robust solution, in the sense of tolerance to noisy input, is given by
2 1
1
2
x x = + which is shown as a
dotted line in Figure 2-9. This robust solution was in fact automatically generated by the AHK I
learning rule of Equation (2.65).
Other Criterion Functions
The SSE criterion function in Equation (2.32) is not the only possible choice. We have already seen
other alternative functions such as the ones in Equations (2.20), (2.24), and (2.25). In general, any
differentiable function that is minimized upon setting
i
d ,

for i =1,2,..., m, could be used. One
i
y =
possible generalization of SSE is the Minkowski-r criterion function (Hanson and Burr, 1988) given
by
( )
1
1
r
m
i i
i
d y
r
=
=
w J (2.69) (2.69) (2.69)

or its instantaneous version ( )
1
r
i i
J d y
r
= w

21

Figure 2-9 LMS-generated decision boundary (dashed line) for a two-class linearly separable problem. For comparison, four
solutions generated using the perceptron learning rule are shown (solid lines). The dotted line is the solution generated by the
AHK I rule.

22
Figure 2-10 shows a plot of
r
i i
d y for r =1, 1.5, 2, and 20. The general form of the gradient of
this criterion function is given by
( ) ( ) ( )
1
sgn
r
J d y d y f net
= w x (2.70)
Note that for r = 2 this reduces to the gradient of the SSE criterion function given by Equation
(2.52). If r =1, then
( )
i i
J d y = w with the gradient [note that the gradient of J(w) does not exist
at the solution points d = y]
( ) ( ) ( )
sgn J d y f net
= w x (2.71)
In this case, the criterion function in Equation (2.68) is known as the Manhattan norm. For , a r
supremum error measure is approached.
A small r gives less weight for large deviations and tends to reduce the influence of the outer-most
points in the input space during learning. It can be shown, for a linear unit with normally distributed

23
inputs, that r =2 is an appropriate choice in the sense of both minimum SSE and minimum
probability of prediction error (maximum likelihood). The proof is as follows.

Figure 2-10 A family of instantaneous Minkowski-r criterion functions

24
Another criterion function that can be used (Hopfield, 1987) is the instantaneous relative entropy
error measure (Kullback, 1959) defined by
( ) ( ) ( )
1 1 1
1 ln 1 ln
2 1 1
d d
J d d
y y
+
= + +

+

w (2.76)
where d belongs to the open interval
( )
1, 1 + . As before,
( )
0 J w , and if y =d, then
( )
0 J = w . If
( ) ( )
tanh y f net net = = , the gradient of Equation (2.76) is
( ) ( )
J d y = w x (2.77)
The factor
( )
f net
in Equations (2.53) and (2.70) is missing from Equation (2.77). This eliminates
the flat spot encountered in the delta rule and makes the training here more like -LMS [note,
however, that y here is given by
( )
y f net net = ]. This entropy criterion is "well formed" in the
sense that gradient descent over such a function will result in a linearly separable solution, if one

25
exists (Hertz et al., 1991). On the other hand, gradient descent on the SSE criterion function does not
share this property, since it may fail to find a linearly separable solution, as demonstrated in
Example 2.2.
In order for gradient-descent search to find a solution in the desired linearly separable region, we
*
w
need to use a well-formed criterion function. Consider the following general criterion function:
( )
( )
T
1
m
i
J g
=
=
w z w (2.78)
where
1
2
if class
if class
c
z
c
+
x x
x x

Let . The criterion function J(w) is said to be well formed if
T
s = z w
( )
g s is differentiable and
satisfies the following conditions (Wittner and Denker, 1988):

26
1. For all s,
( )
0
dg s
ds
; i.e., g does not push in the wrong direction.
2. There exists 0 > such that
( )
dg s
ds
for all ; i.e., g keeps pushing if there is a 0 s
misclassification.
3. g(s) is bounded from below.
For a single unit with weight vector w, it can be shown (Wittner and Denker, 1988) that if the
criterion function is well formed, then gradient descent is guaranteed to enter the region of linearly
separable solutions , provided that such a region exists.
*
w

Table 2-1 Summary of Basic Learning Rules
Learning
Rule
Criterion Function
1
Learning Vector
2
Conditions
Activation
Function
3
Remarks
Perceptron
rule
(supervised)
( )
T
T
0
J
=

z w
w z w
( )
T
if 0
0 otherwise
k k k
z z w

0 >
( ) ( ) sgn f net net =
Finite convergence time if
training set is linearly
separable. w stays
bounded for arbitrary
training sets.

Perceptron
rule with
variable
learning rate
and fixed
margin
(supervised)
( ) ( )
T
T
b
J b
z w
w z w
( )
T
if
0 otherwise
k k k
b
z z w

0 b >
k
satisfies:
1. 0
k

1
2.
m
k
k
=
=

( ) ( ) sgn f net net = Converges to if
separable. Finite
convergence if
k
T
b > z w
= ,
where is a finite positive
constant.

1
Note:
if 1
if 1
k k
k
k k
d
d
= +
=

=
x
z
x

2
The general form of the learning equation is
k k 1 k k
+
= + w w s , where
k
is the learning rate and is the learning vector.
k
s
3

T
net = x w
27

28
( )
2
1
2
1
3. 0
m
k
k
m
k
k
=
=
=

May`s rule
(supervised)
( )
( )
T
2
T
2
1
2
b
b
J
=

z w
z w
w
z

( )
( )
T
T
2
if
0 otherwise
k k
k k k
b
b
z w
z z w
z

0 2 < <
0 b >
Finite convergence to the
solution
T
0 b > z w if the
separable.

Butz`s rule
(supervised)
( ) ( )
T
i
i
J =
w z w
( )
T
if 0
otherwise
k k k
k
z z w
z

0 1 <
0 >
Finite convergence if
separable. Places w in a
region that tends to
minimize the probability of
error for nonlinearly
separable cases.
Widrow-
Hoff rule
( -LMS)
(supervised)
( )
( )
2
T
2
1
2
i i
i
i
d
J

=

x w
w
x

( )
T
k k k k
d

x w x
2
k
k
=
x

0 2 < <
( ) f net net = Converges in the mean
square to the minimum SSE
or LMS solution if
i j
= x x for al i, j.

29
-LMS
(supervised)
( ) ( )
2
T
1
2
i i
i
J d

=

w x w ( )
T
k k k k
d

x w x
2
2
0
3
< <
x

( ) f net net =
Converges in the mean
square to the minimum SSE
or LMS solution.

Stochastic
-LMS rule
(supervised)
( ) ( )
2
T
1
2
i i
J d

=

w x w
( )
T
k k k k
d

x w x
k
satisfies:
1. 0
k

1
2.
k
k
=
= +

( )
2
1
3.
k
k
=
<

( ) f net net =
i mean operator. (At
each leaning step the
training vector
k
x is drawn
at random). Converges in
the mean square to the
minimum SSE or LMS
solution.
Correlation
rule
(supervised)
( ) ( )
T
i i
i
J d =
w x w
k k
d x
0 >
( ) f net net =
Converges to the minimum
SSE solution if the vectors
k
x are mutually
orthonormal.

Delta rule
(supervised)
( ) ( )
( )
2
T
1
2
i i
i
i i
J d y
y
=
w
x w

( ) ( )
k k k k
d y f net x
0 1 < <
( ) y f net = where f
is a sigmoid function
Extends the -LMS rule to
cases with differentiable
nonlinear activations.


30
Learning
Rule
Criterion Function
4
Learning Vector
5
Conditions
Activation
Function
6
Remarks
Minkowski-r
delta rule
(supervised)
( )
1
r
i i
i
J d y
r
=
w ( ) ( )
1
sgn
r
k k k k k k
d y d y f net
x

0 1 < <
( ) y f net = where f
is a sigmoid function
0 2 r < < for pseudo-
Gaussian distribution
with pronounced tails. r =2
gives delta rule. r =1 arises
when
( ) p x
( ) p x is a Laplace
distribution.

4
Note:
if 1
if 1
k k
k
k k
d
d
= +
=

=
x
z
x

5
The general form of the learning equation is
k k 1 k k
+
= + w w s , where
k
is the learning rate and is the learning vector.
k
s
6

T
net = x w

31
Relative
entropy delta
rule
(supervised)
( )
( )
( )
1
1 ln
1
1
2
1
1 ln
1
i
i
i
i
i
i
i
J
d
d
y
d
d
y

+
+

+

=

+

w

( )
k k k
d y x
0 1 < <
( ) tanh y net =
Eliminates the flat spot
suffered by the delta rule.
Converges to one linearly
separable solution if one
exists.
AHK I
(supervised)
( ) ( )
2
T
1
,
2
i
i
i
J

=

Margin
se

w b z w b
k k
i i
i
b
>
=

1
if 0
0 otherwi
Weight vector:
( )
( )
2
2 1
2
T
1 if 0
if 0
k i
i i
k i k
i i
k i k k
i i
b

>
=
z
z
z w

1
0 > b
1
0 2 < <
2 2
2
0
max
i
i
< <
z

i
b values can only increase
from their initial values.
Converges to a robust
solution for linearly
separable problems.
AHK II
(supervised)
( ) ( )
2
T
1
,
2
i
i
i
J

Margin
=

w b z w b
with margin vector b>0
k
i
b
1
0 > b
1
0 2 < <
i
b values can take any
positive value. Converges to
a robust solution for linearly
separable problems.

32

1
1
if
0 otherwise
k
k k i
i i
b

>

Weight vector:
( )
( )
2
2 1
1
2
1
T
1 if
if
k
k i i
i i
k
k i k i
i i
k i k k
i i
b
b
b

>
=
z
z
z w

2 2
2
0
max
i
i
< <
z

AHK III
(supervised)
( ) ( )
2
T
1
,
2
i
i
i
J

Margin
=

w b z w b
with margin vector b>0
k
i
b

1
1
if
0 otherwise
k
k k i
i i
b

>

Weight vector:
1
0 > b
1
0 2 < <
2 2
2
0
max
i
i
< <
z

Converges for linearly
separable as well as
nonlinearly separable cases.
It automatically identifies
and discards the critical
points affecting the
nonlinear separability, and
results in a solution which
tends to minimize

33
( )
( )
2
2 1
1
T
1 if
0 otherwise
k
k i i
i i
k i k k
i i
b
b

>
=
z
z w

misclassifications.
Delta rule for
stochastic
units
(supervised)
( )
( )
2
1
2
i i
i
J d y =
w
( )
( )
2
tanh net
1 tanh
k k
k k
d
net

x
i

0 1 < < Stochastic
activation:
( )
y
( )
with
1
1
with
1
1 1
P y
P y
= +
= +
( )
2
1
1
1
net
P y
e

= =
+

Performance in the average
is equivalent to the delta
rule applied to a unit with
deterministic activation:

Learning Rules for Multilayer Feedforward Neural Networks
This chapter extends the gradient-descent-based delta rule of Chapter 2 to multilayer feedforward
neural networks. The resulting learning rule is commonly known as error backpropagation (or
backprop), and it is one of the most frequently used learning rules in many applications of artificial
neural networks.
The backprop learning rule is central to much current work on learning in artificial neural networks.
In fact, the development of backprop is one of the main reasons for the renewed interest in artificial
neural networks. Backprop provides a computationally efficient method for changing the weights in
a feedforward network, with differentiable activation function units, to learn a training set of input-
output examples. Backprop-trained multilayer neural nets have been applied successfully to solve
some difficult and diverse problems, such as pattern classification, function approximation,
nonlinear system modeling, time-series prediction, and image compression and reconstruction. For
these reasons, most of this chapter is devoted to the study of backprop, its variations, and its
extensions.
1
Curs 10 - Retele de perceptroni
Backpropagation is a gradient-descent search algorithm that may suffer from slow convergence to
local minima. In this chapter, several methods for improving back-prop's convergence speed and
avoidance of local minima are presented. Whenever possible, theoretical justification is given for
these methods. A version of backprop based on an enhanced criterion function with global search
capability is described which, when properly tuned, allows for relatively fast convergence to good
solutions.
Consider the two-layer feedforward architecture shown in Figure 0-1. This network receives a set of
scalar signals
{ }
0 1
, ,...,
n
x x x where is a bias signal equal to 1. This set of signals constitutes an
0
x
input vector
1 n
R
+
x . The layer receiving the input signal is called the hidden layer. Figure 0-1
shows a hidden layer having J units. The output of the hidden layer is a (J + l)-dimensional real-
valued vector
[ ]
T
0 1
, ,...,
J
z z z = z . Again,
0
1 z = represents a bias input and can be thought of as being
generated by a "dummy" unit (with index zero) whose output is clamped at 1. The vector z
0
z
supplies the input for the output layer of L units. The output layer generates an L-dimensional vector
2
y in response to the input x which, when the network is fully trained, should be identical (or very
close) to a "desired" output vector d associated with x.

Figure 0-1 A two-layer fully interconnected feedforward neural network architecture. For clarity, only selected connections are drawn.
3
The activation function of the hidden units is assumed to be a differentiate nonlinear function
h
f
[typically, is the logistic function defined by
h
f
( )
( )
1/ 1
net
h
f net e

= + , or hyperbolic tangent
function
( ) ( )
tanh
h
f net net = , with values for and close to unity]. Each unit of the output
layer is assumed to have the same activation function, denoted the functional form of is
0
f
0
f
determined by the desired output signal/pattern representation or the type of application. For
example, if the desired output is real valued (as in some function approximation applications), then a
linear activation
( )
h
f net net = may be used. On the other hand, if the network implements a
pattern classifier with binary outputs, then a saturating nonlinearity similar to may be used for .
h
f
0
f
In this case, the components of the desired output vector d must be chosen within the range of . It
0
f
is important to note that if is linear, then one can always collapse the net in
h
f Figure 0-1 to a single-
layer net and thus lose the universal approximation/mapping capabilities discussed in Chapter 4.
Finally, we denote by
ji
w , the weight of the jth hidden unit associated with the input signal .
i
x
Similarly, is the weight of the lth output unit associated with the hidden signal
lj
w
j
z .
4
Next, consider a set of m input/output pairs
{ }
,
k k
x d , where is an L-dimensional vector
k
d
representing the desired network output upon presentation of . The objective here is to adaptively
k
x
adjust the J(n +1) +L(J + 1) weights of this network such that the underlying function/mapping
represented by the training set is approximated or learned. Since the learning here is supervised (i.e.,
target outputs are available), an error function may be defined to measure the degree of
approximation for any given setting of the network's weights. A commonly used error function is the
SSE measure, but this is by no means the only possibility, and later in this chapter, several other
error functions will be discussed. Once a suitable error function is formulated, learning can be
viewed (as was done in Chapters 2) as an optimization process. That is, the error function serves as a
criterion function, and the learning algorithm seeks to minimize the criterion function over the space
of possible weight settings. For instance, if a differentiable criterion function is used, gradient
descent on such a function will naturally lead to a learning rule.

( ) ( )
2
1
1
2
L
l l
l
E d y (3.1)
=
=
w
5
Here, w represents the set of all weights in the network. Note that Equation (3.1) is the
"instantaneous" SSE criterion of Equation (2.32) generalized for a multiple-output network.
3.1 Error Backpropagation Learning Rule
Since the targets for the output units are explicitly specified, one can use the delta rule directly,
derived in Section 2.3 for updating the weights. That is,
lj
w
( ) ( )
new
0
c
lj lj lj o l l o l j
lj
E
w w w d y f net z
w

= = =
(3.2)
with and . Here
lj j
is the weighted sum for the lth output unit, 1,2,..., l L = 0,1,..., j J =
0
J
l
j
net w z
=
=
o
f

is the derivative of , with respect to net, and and represent the updated (new) and
o
f
new
lj
w
c
lj
w
current weight values, respectively. The
j
z values are computed by propagating the input vector x
through the hidden layer according to
6
J (3.3)
( )
0
1,2,...,
n
j h ji i h j
i
z f w x f net j
=

= = =

The learning rule for the hidden-layer weights

ji
w is not as obvious as that for the output layer
because we do not have available a set of target values (desired outputs) for hidden units. However,
one may derive a learning rule for hidden units by attempting to minimize the output-layer error.
This amounts to propagating the output errors
( )
l l
d y back through the output layer toward the
hidden units in an attempt to estimate "dynamic" targets for these units. Such a learning rule is
termed error backpropagation or the backprop learning rule and may be viewed as an extension of
the delta rule [Equation (3.2)] used for updating the output layer. To complete the derivation of
backprop for the hidden-layer weights, and similar to the preceding derivation for the output-layer
weights, gradient descent is performed on the criterion function in Equation (3.1), but this time, the
gradient is calculated with respect to the hidden weights:
1,2,..., ; 0,1,2,...,
ji h
ji
E
w j J i n (3.4)
w

= = =
7
where the partial derivative is to be evaluated at the current weight values. Using the chain rule for
differentiation, one may express the partial derivative in Equation (3.4) as

j j
ji j j ji
z net
E E
w z net w

=

(3.5)
with

j
i
ji
net
x
w
(3.6)

( )
j
h j
j
z
f net
net
(3.7)
and
8

( )
( )
( )
( ) ( )
2
1
1
1
1
2
L
l o l
l
j j
L
o l
l o l
l
j
L
l l o l lj
l
E
d f net
z z
f net
d f net
z
d y f net w
=
=
=

=

(3.8)
Now, upon substituting Equations (3.6) through (3.8) into Equation (3.5) and using Equation (3.4),
the desired learning rule is obtained:

(3.9)
( ) ( )
( )
1
L
ji h l l o l lj h j i
l
w d y f net w f net x
=

=
By comparing Equation (3.9) with Equation (3.2), one can immediately define an "estimated target"
for the jth hidden unit implicitly in terms of the backpropagated error signal as follows: d
j

( ) ( )
1
L
j j l l o l lj
w (3.10)
l
d z d y f net
=

9
It is usually possible to express the derivatives of the activation functions in Equations (3.2) and
(3.9) in terms of the activations themselves. For example, for the logistic activation function,

( ) ( ) ( )
1 f net f net f net
(3.11)
and for the hyperbolic tangent function,

( ) ( )
2
1 f net f net

(3.12) =
These learning equations may also be extended to feedforward nets with more than one hidden layer
and/or nets with connections that jump over one or more layers. The complete procedure for
updating the weights in a feedforward neural net utilizing these rules is summarized below for the
two-layer architecture of Figure 0-1. This learning procedure will be referred to as incremental
backprop or just backprop.

10
1. Initialize all weights and refer to them as "current" weights and
c
lj
w
c
ji
w (see Section 3.3.1 for
details).
2. Set the learning rates
o
and
h
to small positive values (refer to Section 3.3.2 for additional
details).
3. Select an input pattern from the training set (preferably at random) and propagate it through
k
x
the network, thus generating hidden- and output-unit activities based on the current weight settings.
4. Use the desired target associated with , and employ Equation (3.2) to compute the output
k
d
k
x
layer weight changes .
lj
w
5. Employ Equation (3.9) to compute the hidden-layer weight changes
ji
w . Normally, the
current weights are used in these computations. In general, enhanced error correction may be
achieved if one employs the updated output-layer weights w
new c
lj lj lj
w w = + . However, this comes at
the added cost of recomputing and
l
y
( )
o l
f net
.
11
6. Update all weights according to w
new c
lj lj lj
w w + and
new c
ji ji ji
w w w = = + for the output and
hidden layers, respectively.
7. Test for convergence. This is done by checking some preselected function of the output errors
1

to see if its magnitude is below some preset threshold. If convergence is met, stop; otherwise, set
new c
ji
w w =
ji
and , and go to step 3. It should be noted that backprop may fail to find a
new c
lj lj
w w =
solution that passes the convergence test. In this case, one may try to reinitialize the search process,
tune the learning parameters, and/or use more hidden units.
This procedure is based on incremental learning, which means that the weights are updated after
every presentation of an input pattern. Another alternative is to employ batch learning, where weight
updating is performed only after all patterns (assuming a finite training set) have been presented.
Batch learning is formally stated by summing the right-hand sides of Equations (3.2) and (3.9) over
all patterns This amounts to gradient descent on the criterion function
k
x

1
A convenient selection is the root-mean-square (RMS) error given by ( ) 2 / E mL with E as in Equation (3.13). An alternative, and more sensible stopping test may be
formulated by using cross-validation; see Section 3.3.6 for details.
12

( ) ( )
2
1 1
1
2
m L
l l
k l
E d y (3.13)
= =
=

w
Even though batch updating moves the search point w in the direction of the true gradient at each
update step, the "approximate" incremental updating is more desirable for two reasons: (1) it
requires less storage, and (2) it makes the search path in the weight space stochastic (here, at each
time step, the input vector x is drawn at random), which allows for a wider exploration of the search
space and, potentially, leads to better-quality solutions. When backprop converges, it converges to a
local minimum of the criterion function. This fact is true of any gradient-descent-based learning rule
when the surface being searched is nonconvex; i.e., it admits local minima. Using stochastic
approximation theory, Finnoff (1994) showed that for "very small" learning rates (approaching
zero), incremental backprop approaches batch backprop and produces essentially the same results.
However, for small constant learning rates there is a nonnegligible stochastic element in the training
process that gives incremental backprop a quasi-annealing character in which the cumulative
gradient is continuously perturbed, allowing the search to escape local minima with small and
shallow basins of attraction. Thus solutions generated by incremental backprop are often practical
13
ones. The local minima problem can be eased further by heuristically adding random noise to the
weights (von Lehman et al., 1988) or by adding noise to the input patterns (Sietsma and Dow, 1988).
In both cases, some noise-reduction schedule should be employed to dynamically reduce the added
noise level toward zero as learning progresses.
Example 3.1 Consider the two-class problem shown in Figure 0-2. The points inside the shaded
region belong to class B, and all other points are in class A. A three-layer feedforward neural
network with backprop training is employed that is supposed to learn to distinguish between these
two classes. The network consists of an eight-unit first hidden layer, followed by a second hidden
layer with four units, followed by a one-unit output layer. Such a network is said to have an 8-4-1
architecture. All units employ a hyperbolic tangent activation function. The output unit should
encode the class of each input vector, a positive output indicates class B and a negative output
indicates class A. Incremental backprop was used with learning rates set to 0.1. The training set
consists of 500 randomly chosen points, 250 from region A and another 250 from region B. In this
14
training set, points representing class B and class A were assigned desired output (target) values of
+1 and , respectively 1
2
. Training was performed for several hundred cycles over the training set.

Figure 0-2 Decision regions for the pattern-classification problem in Example 3.1

2
In fact, the actual targets used were offset by a small positive constant (say, 0.1 = ) away from the limiting values of the activation function. This resulted in replacing the +1
and 1 targets by 1 and 1 + , respectively. Otherwise, backprop tends to drive the weights of the network to infinity and thereby slow the learning process.
15
Figure 0-3 shows geometric plots of all unit responses upon testing the network with a new set of
1000 uniformly (randomly) generated points inside the region. In generating each plot, a
[ ]
2
1, 1 +
black dot was placed at the exact coordinates of the test point (input) in the input space if and only if
the corresponding unit response was positive. The boundaries between the dotted and the white
regions in the plots represent approximate decision boundaries learned by the various units in the
network. Figure 0-3a-h represents the decision boundaries learned by the eight units in the first
hidden layer. Figure 0-3i-l shows the decision boundaries learned by the four units of the second
hidden layer. Figure 0-3m shows the decision boundary realized by the output unit. Note the linear
nature of the separating surface realized by the first-hidden-layer units, from which complex
nonlinear separating surfaces are realized by the second-hidden-layer units and ultimately by the
output-layer unit. This example also illustrates how a single-hidden-layer feedforward net (counting
only the first two layers) is capable of realizing convex, concave, as well as disjoint decision regions,
as can be seen from Figure 0-3i-l. Here, we neglect the output unit and view the remaining net as one
with an 8-4 architecture.
16
The present problem can also be solved with smaller networks (fewer numbers of hidden units or
even a network with a single hidden layer). However, the training of such smaller networks with
backprop may become more difficult. A smaller network with a 5-3-1 architecture utilizing a variant
backprop learning procedure has a comparable separating surface to the one in Figure 0-3m.
Huang and Lippmann (1988) employed Monte Carlo simulations to investigate the capabilities of
backprop in learning complex decision regions (see Figure 4.5). They reported no significant
performance difference between two- and three-layer feedforward nets when forming complex
decision regions using backprop. They also demonstrated that backprop's convergence time is
excessive for complex decision regions and that the performance of such trained classifiers is similar
to that obtained with the k-nearest neighbor classifier (Duda and Hart, 1973). Villiers and Barnard
(1993) reported similar simulations but on data sets that consisted of a "distribution of distributions"
where a typical class is a set of clusters (distributions) in the feature space; each of which can be
more or less spread out and which might involve some or all of the dimensions of the feature space;
the distribution of distributions thus assigns a probability to each distribution in the data set. It was
found for networks of equal complexity (same number of weights) that there is no significant
17
difference between the quality of "best" solutions generated by two- and three-layer backprop-
trained feedforward networks; actually, the two-layer nets demonstrated better performance, on
average. As for the speed of convergence, three-layer nets converged faster if the number of units in
the two hidden layers were roughly equal.
Gradient-descent search may be eliminated altogether in favor of a stochastic global search
procedure that guarantees convergence to a global solution with high probability; genetic algorithms
and simulated annealing are examples of such procedures. However, the assured (in probability)
optimality of these global search procedures comes at the expense of slow convergence. Next, a
deterministic search procedure termed global descent is presented that helps backprop reach globally
optimal solutions.

18

Figure 0-3 Separating surfaces generated by the
various units in the 8-4-1 network of Example 3.1. (a-
h) Separating surfaces realized by the units in the first
hidden layer; (i-l) separating surface realized by the
units in the second hidden layer, (m) separating surface
realized by the output unit.

19
3.1 Global-Descent-Based Error Backpropagation
Here, a learning method is described in which the gradient-descent rule in batch backprop is
replaced with a global-descent rule (Cetin et al., 1993a). This methodology is based on a global
optimization scheme, acronymed TRUST (terminal repeller unconstrained subenergy tunneling), that
formulates optimization in terms of the flow of a special deterministic dynamical system (Cetin et
al., 1993b).
1
Global descent is a gradient descent on a special criterion function
( )
, C
*
w w given by

( )
( )
( )
( )
( )
( )
4/3
*
1 3
, ln
4
1
i i
E E
i
k
C w w u E E

(3.14)
e

+

=

+
*
* *
w w
w w w w
where , with component values , is a fixed-weight vector that can be a local minimum of E(w)
*
w
*
i
w
or an initial weight state ,
0
w
[ ]
u i is the unit step function, is a shifting parameter (typically set to
2), and k is a small positive constant. The first term in the right-hand side in Equation (3.14) is a
monotonic transformation of the original criterion function (e.g., SSE criterion may be used) that

1
This method was originally designed for implementation in parallel analog VLSI circuitry, allowing implementation in a form whose computational complexity is only weakly
dependent on problem dimensionality.
1
Curs 11 - Algoritmi de antrenare a retelelor de perceptroni
preserves all critical points of E(w) and has the same relative ordering of the local and global
minima of E(w). It also flattens the portion of E(w) above
( )
E
*
w with minimal distortion elsewhere.
On the other hand, the term
( )
4/3
*
i i
i
w w
is a "repeller term," which gives rise to a convex surface

with a unique minimum located at =
*
w w . The overall effect of this energy transformation is
schematically represented for a one-dimensional criterion function in Figure Error! No text of specified style
in document.-1.
Performing gradient descent on
( )
, C
*
w w leads to the global-descent update rule:

( )
( )
( )
( )
( )
( )
1/3
*
1
u
1
i i i
E E
i
E
w k w w E E
w
e

+ +

= +
+
*
*
w w
w
w w

(3.15)
The first term on the right-hand side of Equation (3.15) is a subenergy gradient, while the second
term is a non-Lipschitzian terminal repeller (Zak, 1989). Upon replacing the gradient descent in
Equations (3.2) and (3.4) by Equation (3.15), where represents an arbitrary hidden-unit or output-
i
w
unit weight, the modified backprop procedure may escape local minima of the original criterion
2
function E(w) given in Equation (3.13). Here, the batch training is required because Equation (3.15)
necessitates a unique error surface for all patterns.
The update rule in Equation (3.15) automatically switches between two phases: a tunneling phase
and a gradient-descent phase. The tunneling phase is characterized by ( )
( )
E E
*
w w . Since for this
condition the subenergy gradient term is nearly zero in the vicinity of the local minimum , the
*
w
terminal repeller term in Equation (3.15) dominates, leading to the dynamical system

( )
1/3
*
i i i
(3.16) w k w w

3

Figure Error! No text of specified style in document.-1 A plot of a one-dimensional criterion function E(w) with local minimum at . The function
*
w
( )
( )
*
E w E w is plotted below, as well as the global-descent criterion function
( )
*
, C w . w

4
This system has an unstable repeller equilibrium point at w
t
=wf, i.e., at the local minimum of
( )
E w . The "power" of this repeller is determined by the constant k. Thus the dynamical system
given by Equation (3.15), when initialized with a small perturbation from , is repelled from this
*
w
local minimum until it reaches a lower-energy region ( )
( )
E E <
*
w w ; i.e., tunneling through
portions of E(w) where ( )
( )
E E
*
w w is accomplished. The second phase is a gradient-descent
minimization phase characterized by ( )
( )
E E <
*
w w . Here, the repeller term is identically zero.
Thus Equation (3.15) becomes

( )
( )
i
i
E
w
w
w
w (3.17)
where
( )
w is a dynamic learning rate (step size) equal to
( )
( ) { }
1
1 exp E E

+ +

*
w w . Note
that
( )
w is approximately equal to when
( )
E
*
w is large compared to
( )
E + w .
Initially, is chosen as one comer of a domain in the form of a hyperparallelepiped of dimension
*
w
J(n + 1) +L(J + 1), which is the dimension of w in the architecture of Error! Reference source not
5
found.. A slightly perturbed version of , namely,
*
w +
*
w
w , is taken as the initial state of the
dynamical system in Equation (3.15). Here
w
, is a small perturbation that drives the system into the
domain of interest. If
( ) ( )
E E
*
w , the system immediately enters a gradient-descent phase + <
*
w
w
that equilibrates at a local minimum. Every time a new equilibrium is reached, is set equal to this
*
w
equilibrium, and Equation (3.15) is reinitialized with +
*
w
w which ensures a necessary consistency
in the search flow direction. Since is now a local minimum,
*
w ( )
( )
E E
*
w w holds in the
neighborhood of . Thus the system enters a repelling (tunneling) phase, and the repeller at
*
w
*
w
repels the system until it reaches a lower basin of attraction where ( )
( )
E E <
*
w w . As the dynamical
system enters the next basin, the system automatically switches to gradient descent and equilibrates
at the next lower local minimum. Then is set equal to this new minimum, and the process is
*
w
repeated. If, on the other hand,
( ) ( )
E E
*
w at the onset of training, then the system is +
*
w
w
initially in a tunneling phase. The tunneling will proceed to a lower basin, at which point it enters the
minimization phase and follows the behavior discussed above. Training can be stopped when a
6
minimum corresponding to
*
w
( )
0 E =
*
w
is reached or when
( )
E
*
w becomes smaller than a preset
threshold.

Figure Error! No text of specified style in document.-2 Learning curves for global-descent- and gradient-descent-based batch backprop for the 4-bit
parity.
7

The global-descent method is guaranteed to find the global minimum for functions of one variable
but not for multivariate functions. However, in the multidimensional case, the algorithm will always
escape from one local minimum to another with a lower or equal functional value. Figure Error! No
text of specified style in document.-2 compares the learning curve for the global-descent-based backprop with
that of batch backprop for the 4-bit parity problem in a feedforward net with four hidden units and a
single output unit. The same initial random weights are used in both cases. The figure depicts one
tunneling phase for the global-descent algorithm before convergence to a (perfect) global-minimum
solution. In performing this simulation, it is found that the choice of the direction of the perturbation
vector
w
, is very critical in regard to reaching a global minimum successfully. On the other hand,
batch backprop converges to the first local minimum it reaches. This local solution represents a
partial solution to the 4-bit parity problem (i.e., mapping error is present). Simulations using
incremental backprop with the same initial weights as in the preceding simulations are also
performed but are not shown in the figure. Incremental backprop was able to produce both solutions
8
shown in Figure Error! No text of specified style in document.-2; very small learning rates (
0
and
n
,) often
lead to imperfect local solutions, while relatively larger learning rates may lead to a perfect solution.
3.2 Backprop Enhancements and Variations
In general, learning with backprop is slow (Huang and Lippmann, 1988). Typically, this is due to the
characteristics of the error surface. The surface is characterized by numerous flat and steep regions.
In addition, it has many troughs that are flat in the direction of search. These characteristics are
particularly pronounced in classification problems, especially when the size of the training set is
small.
Many enhancements of and variations to backprop have been proposed. These are mostly heuristic
modifications with goals of increased speed of convergence, avoidance of local minima, and/or
improvement in the network's ability to generalize. This section presents some common heuristics
that may improve these aspects of backprop learning in multilayer feedforward neural networks.
9
3.2.1Weight Initialization
Owing to its gradient-descent nature, backprop is very sensitive to initial conditions. If the choice of
the initial weight vector (here w is a point in the weight space being searched by backprop)
0
w
happens to be located within the attraction basin of a strong local minima attractor (one where the
minima is at the bottom of a steep-sided valley of the criterion/error surface), then the convergence
of backprop will be fast and the solution quality will be determined by the depth of the valley
relative to the depth of the global minima. On the other hand, backprop converges very slowly if
0
w
starts the search in a relatively flat region of the error surface.
An alternative explanation for the sensitivity of backprop to initial weights (as well as to other
learning parameters) is advanced by Kolen and Pollack (1991). Using Monte Carlo simulations on
simple feedforward nets with incremental backprop learning of simple functions, they discovered a
complex fractal-like structure for convergence as a function of initial weights. They reported regions
of high sensitivity in the weight space where two very close initial points can lead to substantially
different learning curves. Thus they hypothesize that these fractal-like structures arise in backprop
due to the nonlinear nature of the dynamic learning equations, which exhibit multiple attractors;
10
rather than the gradient-descent metaphor with local valleys to get stuck in, they advance a many-
body metaphor where the search trajectory is determined by complex interactions with the systems
attractors.
In practice, the weights are normally initialized to small zero-mean random values (Rumelhart et al.,
1986). The motivation for starting from small weights is that large weights tend to prematurely
saturate units in a network and render them insensitive to the learning process (this phenomenon is
known as flat spot). On the other hand, randomness is introduced as a symmetry-breaking
mechanism; it prevents units from adopting similar functions and becoming redundant.
A sensible strategy for choosing the magnitudes of the initial weights for avoiding premature
saturation is to choose them such that an arbitrary unit i starts with a small and random weighted
sum . This may be achieved by setting the initial weights of unit i to be on the order of
i
net 1/
i
f
where is the number of inputs (fan-in) for unit i. It can be easily shown that for zero-mean random
i
f
uniform weights in
[ ]
, r r + and assuming normalized inputs that are randomly and uniformly
distributed in the range [0,1], has zero mean and has standard deviation
i
net
( )
/3
i
net i
r f . Thus, =
11
by generating uniform random weights within the range 3/ , 3/
i i
f f
,the input to unit i ( )

i
net
is a random variable with zero mean and a standard deviation of unity, as desired.
In simulations involving single-hidden-layer feedforward networks for pattern classification and
function approximation tasks, substantial improvements in backprop convergence speed and
avoidance of "bad" local minima are possible by initializing the hidden unit weight vectors to
normalized vectors selected randomly from the training set (Denoeux and Lengelle, 1993).
3.2.2Learning Rate
The convergence speed of backprop is directly related to the learning rate parameter [
o
and
h

Equations (3.2) and (3.9), respectively]; if is small, the search path will closely approximate the
gradient path, but convergence will be very slow due to the large number of update steps needed to
reach a local minima. On the other hand, if is large, convergence initially will be very fast, but the
algorithm will eventually oscillate and thus not reach a minimum. In general, it is desirable to have
large steps when the search point is far away from a minimum, with decreasing step size as the
12
search approaches a minimum. This section gives a sample of the various approaches for selecting
the proper learning rate.
One early proposed heuristic (Plaut et al., 1986) is to use constant learning rates that are inversely
proportional to the fan-in of the corresponding units. The increased convergence speed of backprop
as a result of using this method of setting the individual learning rates for each unit inversely
proportional to the number of inputs to that unit has been theoretically justified by analyzing the
eigenvalue distribution of the Hessian matrix of the criterion function,
2
E (Le Cun et al., 199l).
Such learning rate normalization can be thought of intuitively as maintaining balance between the
learning speed of units with different fan-in. Without this normalization, after each learning
iteration, units with high fan-in have their input activity (net) changed by a larger amount than units
with low fan-in. Thus, and due to the nature of the sigmoidal activation function used, the units with
large fan-in tend to commit their output to a saturated state prematurely and are rendered difficult to
adapt. Therefore, normalizing the learning rates of the various units by dividing by their
corresponding fan-in helps speed up learning.
13
The optimal learning rate for fast convergence of backprop/gradient-descent search is the inverse of
the largest eigenvalue of the Hessian matrix H of the error function E evaluated at the search point
w. Computing the full Hessian matrix is prohibitively expensive for large networks with thousands
of parameters involved. Therefore, finding the largest eigenvalue
max
, for speedy convergence
seems rather inefficient. However, one may employ a shortcut to efficiently estimate
max
(Le Cun et
al., 1993). This shortcut is based on a simple method of approximating the product of H by an
arbitrarily chosen (random) vector z through Taylor expansion:
( ) ( ) ( )
1/ E E = +

Hz w z w ,
where is a small positive constant. Now, using the power method, which amounts to iterating the
procedure

( )
1
, E E

= +

Hz z
z w w
z z

the vector z converges to
max max
c , where is the normalized eigenvector of H corresponding to
max
c
. Thus the norm of the converged vector z gives a good estimate of
max
, and its reciprocal may
max
14
now be used as the learning rate in backprop. An on-line version of this procedure is reported by Le
Cun et al. (1993).
Many heuristics have been proposed so as to adapt the learning rate automatically. Chan and Fallside
(1987) proposed an adaptation rule for
( )
t that is based on the cosine of the angle between the
gradient vectors and
( )
E t
( )
1 E t (here, t is an integer that represents iteration number). Sutton
(1986) presented a method that can increase or decrease
( )
i
t for each weight , according to the
i
w
number of sign changes observed in the associated partial derivative
i
E
w
. Franzini (1987)
investigated a technique that heuristically adjusts
( )
t , increasing it whenever
( )
E t is close to
(
1 E t
)
and decreasing it otherwise. Cater (1987) suggested using separate parameters
( )
k
t one
for each pattern .
k
x
Silva and Almeida (1990) used a method where the learning rate for a given weight , is set to
i
w
( )
i
a t
if
( )
i
E t
w
and
( )
1
i
E t
w

have the same sign, with a > 1; if the partial derivatives have different signs,
15
then a learning rate of
( )
i
b t
is used, with 0 <b < 1. A similar, theoretically justified method for
increasing the convergence speed of incremental gradient-descent search is to set
( ) ( )
1 t t = if
( )
E t has the same sign as
( )
1 E t , and
( ) ( )
1 / 2 t t = otherwise (Pflug, 1990).
When the input vectors are assumed to be randomly and independently chosen from a probability
distribution, we may view incremental backprop as a stochastic gradient-descent algorithm. Thus
simply setting the learning rate to a constant results in persistent residual fluctuations around a
local minimum . The variance of such fluctuations depends on the size of
*
w , the criterion
function being minimized, and the training set. Based on results from stochastic approximation
theory, the "running average" schedule
( ) ( )
0
/ 1 t t = + with sufficiently small , guarantees
asymptotic convergence to a local minimum . However, this schedule leads to very slow
*
w
convergence. Here, one would like to start the search with a learning rate faster than 1/t but then
ultimately converge to the 1/t rate as is approached. Unfortunately, increasing
*
w
0
can lead to
instability for small t. Darken and Moody (1991) proposed the "search then converge" schedule
( ) ( )
0
/ 1 / t t = +
, which allows for faster convergence without compromising stability. In this
16
schedule, the learning rate stays relatively high for a "search time" during which it is hoped that
the weights will hover about a good minimum. Then, for times t , the learning rate decreases as
0
/t , and the learning converges. Note that for 1 = , this schedule reduces to the running average
schedule. Therefore, a procedure for optimizing is needed. A completely automatic "search then
converge" schedule can be found in Darken and Moody (1992).
3.2.3Momentum
Another simple approach to speed up backprop is through the addition of a momentum term (Plaut et
al., 1986) to the right-hand side of the weight update rules in Equations (3.2) and (3.9). Here, each
weight change
i
w , is given some momentum so that it accelerates in the average downhill direction
instead of fluctuating with every change in the sign of the associated partial derivative
( )
i
E
w t
. The
addition of momentum to gradient search is stated formally as

( )
( )
( )
1
i i
(3.18)
i
E
w t w t
w t

= +
17
where is a momentum rate normally chosen between 0 and 1 and
( ) ( ) ( )
1 1
i i i
w t w t w t = .
Equation (3.18) is a special case of multistage gradient methods that have been proposed for
accelerating convergence and escaping local minima.
The momentum term also can be viewed as a way of increasing the effective learning rate in almost-
flat regions of the error surface while maintaining a learning rate close to (here 0 1 < ) in
regions with high fluctuations. This can be seen by employing an N-step recursion and writing
Equation (3.18) as

( )
( )
( )
1
0
N
n N
i i
n
i
E
w t w t N
w t n

= +

(3.19)
If the search point is caught in a flat region, then
i
E
w
will be about the same at each time step, and

Equation (3.19) can be approximated as (with 0 < <1 and N large)

( )
( ) ( )
1
0
1
N
n
i
n
i i
E E
w t
w t w t
=

=

(3.20)
18
Thus, for flat regions, a momentum term leads to increasing the learning rate by a factor
( )
1/ 1 .
On the other hand, if the search point is in a region of high fluctuation, the weight change will not
gain momentum; i.e., the momentum effect vanishes. An empirical study of the effects of and
on the convergence of backprop and on its learning curve can be found in Tollenaere (1990).
Adaptive momentum rates also may be employed. Fahlman (1989) proposed and extensively
simulated a heuristic variation of backprop, called quickprop, that employs a dynamic momentum
rate given by

( )
( )
( ) ( )
1
i
i i
E
w t
t
E E
(3.21)
w t w t
=

With this adaptive
( )
t substituted in Equation (3.18), if the current slope is persistently smaller
than the previous one but has the same sign, then
( )
t is positive, and the weight change will
accelerate. Here, the acceleration rate is determined by the magnitude of successive differences
between slope values. If the current slope is in the opposite direction from the previous one, it
19
signals that the weights are crossing over a minimum. In this case,
( )
t has a negative sign, and the
weight change starts to decelerate. Additional heuristics are used to handle the undesirable case
where the current slope is in the same direction as the previous one but has the same or larger
magnitude; otherwise, this scenario would lead to taking an infinite step or moving the search point
backwards or up the current slope and toward a local maximum. Substituting Equation (3.21) in
Equation (3.18) leads to the update rule

( )
( )
( )
( ) ( )
( )
1
1
i
i
i
i i
i
E
w t
E
w t
E E
w t
w t w t
w t

(3.22)
It is interesting to note that Equation (3.22) corresponds to steepest gradient-descent-based
adaptation with a dynamically changing effective learning rate
( )
t . This learning rate is given by
the sum of the original constant learning rate and the reciprocal of the denominator of the second
term in the right-hand side of Equation (3.22).
20
The use of error gradient information at two consecutive time steps in Equation (3.21) to improve
convergence speed can be justified as being based on approximations of second-order search
methods such as Newtons method. Newton's method is based on a quadratic model
( )
E w of the
criterion E(w) and hence uses only the first three terms in a Taylor series expansion of E about the
"current" weight vector :
c
w

( ) ( ) ( ) ( )
T
T 2
1
2
c c c c
E E E E + = + + w w w w w w w w
This quadratic function is minimized by solving the equation
( )
c
E + = w w 0, which leads to
Newton's method:
( ) ( ) ( ) ( )
1 1
2 c c c c
w E E H E

= =

w w w w . Here, H is the Hessian
matrix with components
2
ij
i j
E
w w
=

H .
Newton's algorithm iteratively computes the weight changes w and works well when initialized
within a convex region of E. In fact, the algorithm converges quickly if the search region is quadratic
or nearly so. However, this method is very computationally expensive, since the computation
1
H
21
requires
( )
3
O N operations at each iteration (here, N is the dimension of the search space). Several
authors have suggested computationally efficient ways of approximating Newton's method (e.g.
Becker and Le Cun, 1989). Becker and Le Cun proposed an approach whereby the off-diagonal
elements of H are neglected, thus arriving at the approximation

1
2
2
i
i i
E E
w
w w

=

(3.23)

which is a "decoupled" form of Newton's rule where each weight is updated separately. The second
term in the right-hand side of Equation (3.22) can now be viewed as an approximation of Newton's
rule, since its denominator is a crude approximation of the second derivative of E at step t. In fact,
this suggests that the weight update rule in Equation (3.22) may be used with = 0.
As with Equation (3.22), special heuristics must be used in order to prevent the search from moving
in the wrong gradient direction and in order to deal with regions of very small curvature, such as
inflection points and plateaus, which cause
i
w in Equation (3.23) to blow up. A simple solution is
22
to replace the
2
2
i
E
w
term in Equation (3.23) by

2
2
i
E
w

where is a small positive constant. The

approximate Newton method just described is capable of scaling the descent step in each direction.
However, because it neglects off-diagonal Hessian terms, it is not able to rotate the search direction
as in the exact Newton's method. Thus this approximate rule is only efficient if the directions of
maximal and minimal curvature of E happen to be aligned with the weight space axes. Bishop
(1992) reported a somewhat efficient technique for computing the elements of the Hessian matrix
exactly using multiple feedforward propagation through the network followed by multiple backward
propagation.
Another approach for deriving theoretically justifiable update schedules for the momentum rate in
Equation (3.18) is to adjust
( )
t at each update step such that the gradient-descent search direction
is "locally" optimal. In optimal steepest descent (also known as best-step steepest descent), the
learning rate is set at time t such that it minimizes the criterion function E at time step t + 1; i.e., we
desire a that minimizes ( ) ( ) ( )
{ }
1 E t E t E t + =

w w

w . Unfortunately, this optimal
learning step is impractical because it requires computation of the Hessian
2
E at each time step.
23
However, one may still use some of the properties of the optimal in order to accelerate the search,
as demonstrated next.
When is specified, the necessary condition for minimizing
( )
t w ( )
1 E t +

w is

( )
( )
( )
( ) ( )
T
T
1 1
1
1 0
E t t
E t
E t E t

+ +

= +

= + =

w w
w
w w
(3.24)
This implies that the search direction in two successive steps of optimal steepest descent are
orthogonal. The easiest method to enforce the orthogonal requirement is the Gram-Schmidt
orthogonalization method. Suppose that the search direction at time 1 t is known, denoted
( )
1 t d ,
and that the "exact" gradient
( )
E t (used in batch backprop) can be computed at time step t [to
simplify notation, we write
( )
E t

w as
( )
E t ]. Now, we can satisfy the condition of orthogonal
consecutive search directions by computing a new search direction, employing Gram-Schmidt
orthogonolization
24

( ) ( )
( ) ( )
( ) ( )
( )
T
T
1
1
1 1
E t t
t E t t
t t

= +

d
d d (3.25)
d d
Performing descent search in the direction d(t) in Equation (3.25) leads to the weight vector update
rule

( ) ( )
( ) ( )
( )
( )
T
1
1
1
E t t
t E t t
t

= +
d
w w
d
(3.26)
where the relation
( ) ( ) ( ) ( )
1 1 1 t t t t = = + w w w d has been used. Comparing the
component-wise weight update version of Equation (3.26) with Equation (3.18) reveals another
adaptive momentum rate given by
( )
( ) ( )
( )
( ) ( )
( )
T T
2 2
1 1
1 1
E t t E t t
t
t t

= =

d w
d w

Another similar approach is to set the current search direction d(t) to be a compromise between the
current "exact" gradient
( )
E t and the previous search direction
( )
1 t d ; i.e.,
25
( ) ( ) ( )
1 t E t t = + d d , with
( ) ( )
0 0 E = d . This is the basis for the conjugate gradient method
in which the search direction is chosen (by appropriately setting ) so that it distorts as little as
possible the minimization achieved by the previous search step. Here, the current search direction is
chosen to be conjugate (with respect to H) to the previous search direction. Analytically, we require
( ) ( ) ( )
T
1 1 t t t d H d 0 = , where the Hessian
( )
1 t H is assumed to be positive definite. In practice,
, which plays the role of an adaptive momentum, is chosen according to the Polak-Ribiere rule
(Polak and Ribiere, 1969):

( )
( ) ( ) ( )
( )
T
2
1
1
E t E t E t
t
E t

= =

Thus the search direction in the conjugate gradient method at time t is given by

( ) ( ) ( )
( )
( ) ( ) ( )
( )
( )
T
2
1
1
1
1
t E t t
E t E t E t
E t t
E t
= +

= +

d d
d

26
Now, using
( ) ( ) ( )
1 1/ 1 t t = d w and substituting the preceding expression for d(t) in
( ) ( )
t t = w d leads to the weight update rule:

( ) ( )
( ) ( ) ( )
( )
( )
T
2
1
1
1
E t E t E t
t E t t
E t

= +

w w
When E is quadratic, the conjugate gradient method theoretically converges in N or fewer iterations.
In general, E is not quadratic, and therefore, this method would be slower than what the theory
predicts. However, it is reasonable to assume that E is approximately quadratic near a local
minimum. Therefore, conjugate gradient descent is expected to accelerate the convergence of
backprop once the search enters a small neighborhood of a local minimum. As a general note, the
basic idea of conjugate gradient search was introduced by Hestenes and Stiefel (1952). Beckman
(1964) gives a good account of this method. van der Smagt (1994) gave additional characterization
of second-order backprop (such as conjugate gradient-based backprop) from the point of view of
optimization. The conjugate gradient method has been applied to multilayer feedforward neural net
training (van der Smagt, 1994) and is shown to outperform backprop in speed of convergence.
27
It is important to note that the preceding second-order modifications to backprop improve the speed
of convergence of the weights to the "closest" local minimum. This faster convergence to local
minima is the direct result of employing a better search direction as compared with incremental
backprop. On the other hand, the stochastic nature of the search directions of incremental backprop
and its fixed learning rates can be an advantage, since they allow the search to escape shallow local
minima, which generally leads to better solution quality. These observations suggest the use of
hybrid learning algorithms, where one starts with incremental backprop and then switches to
conjugate gradient-based backprop for the final convergence phase. This hybrid method has its roots
in a technique from numerical analysis known as Levenberg-Marquardt optimization (Press et al.,
1986).
As a historical note, it should be mentioned that the concept of gradient descent was first introduced
by Cauchy for use in the solution of simultaneous equations; the method has enjoyed popularity ever
since. For a good survey of gradient search, the reader is referred to the book by Polyak (1987).

28
3.1.1Weight Decay, Weight Elimination, and Unit Elimination
It can be shown that in order to guarantee good generalization, the number of degrees of freedom or
number of weights (which determines a network's complexity) must be considerably smaller than the
amount of information available for training. Some insight into this matter can be gained from
considering an analogous problem in curve fitting (Duda and Hart, 1973). For example, consider the
rational function ( ) ( )( )
( )
2
2 2 1 / 1 g x x x x , which is plotted in = + +

Figure Error! No text of specified
style in document.-1 (solid line). And assume that we are given a set of 15 samples (shown as small circles)
from which we are to find a "good" approximation to g(x). Two polynomial approximations are
shown in Figure Error! No text of specified style in document.-1: an eleventh-order polynomial (dashed line) and
an eighth-order polynomial (dotted line). These approximations are computed by minimizing the
SSE criterion over the sample points. The higher-order polynomial has about the same number of
parameters as the number of training samples and thus is shown to give a very close fit to the data;
this is referred to as memorization. However, it is clear from the figure that this polynomial does not
provide good "generalization" (i.e., it does not provide reliable interpolation and/or extrapolation)
over the full range of data. On the other hand, fitting the data by an eighth-order polynomial leads to
1
Curs 12 - Optimizarea retelelor de perceptroni
relatively better overall interpolations over a wider range of x values (refer to the dotted line in
Figure Error! No text of specified style in document.-1).
In this case, the number of free parameters is equal to 9, which is smaller than the number of training
samples. This "underdetermined" nature leads to an approximation function that better matches the
"smooth" function g(x) being approximated. Trying to use a yet lower-order polynomial (e.g., fifth
order or less) leads to a poor approximation because this polynomial would not have sufficient
"flexibility" to capture the nonlinear structure in g(x).
The reader is advised to consider the nature and complexity of this simple approximation problem by
carefully studying Figure Error! No text of specified style in document.-1. Here, the total number of possible
training samples of the form (x, g(x)) is uncountably infinite. From this huge set of potential data,
though, we chose only 15 samples to try to approximate the function. In this case, the approximation
involved minimizing an SSE criterion function over these few sample points.
Clearly, however, a solution that is globally (or near globally) optimal in terms of sum-squared error
over the training set (e.g., the eleventh-order polynomial) may be hardly appropriate in terms of
interpolation (generalization) between data points.
2

(
Figure Error! No text of specified style in document.-1 Polynomial approximation for the function ( ) ( )( )
)
2
2 2 1 / 1 g x x x x = + +

(solid line) based on the 15 samples shown (small circles). The objective of the approximation is to minimize the sum of squared error criterion. The
dashed line represents an eleventh-order polynomial. A better overall approximation for g(x) is given by an eighth-order polynomial (dotted line).
3

Figure Error! No text of specified style in document.-2 Neural network approximation for the function
( ) ( )( )
( )
2
2 2 1 / 1 g x x x x = + +

(solid line). The dotted line was generated by a 3-hidden-unit feedforward net. The dashed line,
which is shown to have substantial overlap with g(x), was generated by a 12-hidden unit feedforward net. In both cases, standard incremental backprop
training was used.
4
Thus one should choose a class of approximation functions that penalizes unnecessary fluctuations
between training sample points. Neural networks satisfy this approximation property and are thus
superior to polynomials in approximating arbitrary nonlinear functions from sample points (see
further discussion given below). Figure Error! No text of specified style in document.-2 shows the results of
simulations involving the approximation of the function g(x), with the same set of samples used in
the preceding simulations, using single-hidden-layer feedforward neural nets. Here, all hidden units
employ the hyperbolic tangent activation function (with a slope of 1), and the output unit is linear.
These nets are trained using the incremental backprop algorithm [given by Equations (3.2) and (3.9)]
with 0.05
o
= and 0.01
h
= . Weights are initialized randomly and uniformly over the
range[ ]. The training was stopped when the rate of change of the SSE became 0.2, 0.2 +
insignificantly small. The dotted line in Figure Error! No text of specified style in document.-2 is for a net with
three hidden units (which amounts to 10 degrees of freedom). Surprisingly, increasing the number of
hidden units to 12 units (37 degrees of freedom) improved the quality of the fit, as shown by the
dashed line in the figure. By comparing Figure Error! No text of specified style in document.-2 and Figure Error!
5
No text of specified style in document.-2, it is clear that the neural net approximation for g(x) is superior to that
of polynomials in terms of accurate interpolation and extrapolation.
The generalization superiority of the neural net can be attributed to the bounded and smooth nature
of the hidden-unit responses as compared with the potentially divergent nature of polynomials. The
bounded-unit response localizes the nonlinear effects of individual hidden units in a neural network
and allows for the approximations in different regions of the input space to be independently tuned.
This approximation process is similar in its philosophy to the traditional spline technique for curve
fitting. Hornik et al. (1990) gave related theoretical justification for the usefulness of feedforward
neural nets with sigmoidal hidden units in function approximation. They showed that in addition to
approximating the training set, the derivative of the output of the network evaluated at the training
data points is also a good approximation of the derivative of the unknown function being
approximated
1
. This result explains the good extrapolation capability of neural nets observed in
simulations. For example, the behavior of the neural net output shown in Figure Error! No text of specified

1
In addition, Hornik et al. (1990) showed that a multilayer feedforward network can approximate functions that are not differentiable in the classical sense but possess a generalized
derivative, as in the case of piece-wise differentiable functions and functions with discontinuities. For example, a neural net with one hidden unit that employs the hyperbolic tangent
function and a linear output unit can approximate very accurately the discontinuous function ( ) ( ) sgn g x a x b c = +
6
style in document.-2 for and 10 x > 5 x < is a case in point. It should be noted, though, that in most
practical situations the training data are noisy. Hence an exact fit of these data must be avoided,
which means that the degrees of freedom of a neural net approximator must be constrained.
Otherwise, the net will have a tendency for overfitting. This issue is explored next.
Once a particular approximation function or network architecture is decided on, generalization can
be improved if the number of free parameters in the net is optimized. Since it is difficult to estimate
the optimal number of weights (or units) a priori, there has been much interest in techniques that
automatically remove excess weights and/or units from a network. These techniques are sometimes
referred to as network pruning algorithms and are surveyed in Reed (1993).
One of the earliest and simplest approaches to remove excess degrees of freedom from a neural
network is through the use of simple weight decay (Plaut et al., 1986), in which each weight decays
toward zero at a rate proportional to its magnitude so that connections disappear unless reinforced.
Hinton (1987) gave empirical justification by showing that such weight decay improves
generalization in feedforward networks. Krogh and Hertz (1992) gave some theoretical justification
for this generalization phenomenon.
7
Weight decay in the weight update equations of backprop can be accounted for by adding a
complexity (regularization) term to the criterion function E that penalizes large weights:
( ) ( )
2
2
i
i
J E w
= +

w w (3.29)
Here, represents the relative importance of the complexity term with respect to the error term
E(w). Now, gradient search for minima of J(w) leads to the following weight update rule:

i i
w
i
E
w
w
(3.30)
which shows an exponential decay in if no learning occurs. Because it penalizes more weights
i
w
than necessary, the criterion function in Equation (3.29) overly discourages the use of large weights
where a single large weight costs much more than many small ones. Weigend et al. (1991) proposed
a procedure of weight elimination given by minimizing

( ) ( )
1
2 2
2 2
0 0
1
2
i i
i
w w
J E
w w

= + +

(3.31)

w w
8
where the penalty term on the right-hand side helps regulate weight magnitudes and is a positive
0
w
free parameter that must be determined. For large this procedure reduces to the weight decay
0
w
procedure described above and hence favors many small weights, whereas if is small, fewer large
0
w
weights are favored. Also note that when
0 i
w w , the cost of the weight approaches 1 (times ),
which justifies interpretation of the penalty term as a counter of large weights. In practice, a close
0
w
to unity is used. It should be noted that this weight elimination procedure is very sensitive to the
choice of . A heuristic for adjusting dynamically during learning is described in Weigend et al.
(1991).
The preceding ideas have been extended to unit elimination. Here, one would start with an excess of
hidden units and dynamically discard redundant ones. As an example, one could penalize redundant
units by replacing the weight decay term in Equation (3.30) by
( )
2
/ 1
ji
i
w

+
for all weights of

hidden units, which leads to the hidden-unit update rule
9

2
2
1
1
ji h
ji
ji
i
E
w
w
w

(3.32)
Generalization in feedforward networks also can be improved by using network construction
procedures as opposed to weight or unit pruning. Here, one starts with a small network and allows it
to grow gradually (add more units) in response to incoming data. Further details on network
construction procedures can be found in Marchand et al. (1990).
3.1.2Cross-Validation
An alternative or complementary strategy to the preceding methods for improving generalization in
feedforward neural networks is suggested by findings based on empirical results (e.g. Weigend et al.,
1991). In simulations involving backprop training of feedforward nets on noisy data, it is found that
the validation (generalization) error decreases monotonically to a minimum but then starts to
increase, even as the training error continues to decrease. This phenomenon is depicted in the
conceptual plot in Figure Error! No text of specified style in document.-3 and is illustrated through the computer
simulation given next.
10
Consider the problem of approximating the rational function g(x) plotted in Figure Error! No text of
specified style in document.-2 from a set of noisy sample points. This set of points is generated from the 15
perfect samples, shown in Figure Error! No text of specified style in document.-2, by adding zero-mean normally
distributed random noise with variance of 0.25. A single-hidden-layer feedforward neural net is used
with 12 sigmoidal hidden units and a single linear output unit. It employs incremental backprop
training with 0.05
o
= , 0.01
h
= , and initial random weights in [ 0.2, 0.2 + ]. After 80 training
cycles on the 15 noisy samples, the net is tested for

11

Figure Error! No text of specified style in document.-3 Training error and validation error encountered in training multilayer feedforward neural nets
using backprop.
uniformly sampled inputs x in the range [ 8,12 ]. The output of this 80-cycle net is shown as a
dashed line in Figure Error! No text of specified style in document.-4. Next, the training continued and then
stopped after 10,000 cycles. The output of the resulting net is shown as a dotted line in the figure.
12
Comparing the two approximations in Figure Error! No text of specified style in document.-4 leads to the
conclusion that the partially trained net is superior to the excessively trained net in terms of overall
interpolation and extrapolation capabilities. Further insight into the dynamics of the generalization
process for this problem can be gained from Figure Error! No text of specified style in document.-4. Here, the
validation RMS error is monitored by testing the net on a validation set of 294 perfect samples,
uniformly spaced in the interval [ 8,12 ], after every 10 training cycles. This validation error is
shown as the dashed line in Figure Error! No text of specified style in document.-5. The training error (RMS
error on the training set of 15 points) is also shown in the figure as a solid line. Note that the optimal
net in terms of overall generalization capability is the one obtained after about 80 to 90 training
cycles
2
. Beyond this training point, the training error keeps decreasing, while the validation error
increases. It is interesting to note the nonmonotonic behavior of the validation error between training
cycles 2000 and 7000. This suggests that, in general, multiple local minima may exist in the
validation error curves of backprop-trained feedforward neural networks. The location of these

2
In general, the global minimum of the validation RMS error curve, which is being used to determine the optimally trained net, is also a function of the number of hidden
units/weights in the net. Therefore, the possibility exists for a better approximation than the one obtained here, though, in order to find such an approximation, one would need to
repeat the preceding simulations for various numbers of hidden units. The reader should be warned that this example is for illustrative purposes only. For example, if one has 294 (or
fewer) "perfect" samples, one would not have to worry about validation; one would just train on the available perfect data! In practice, the validation set is noisy; usually having the
same noise statistics as for the training set. Also, in practice, the size of the validation set is smaller than that of the training set Thus the expected generalization capability of what is
referred to as the "optimally trained" net may not be as good as those reported here.
13
minima is a complex function of the network size, weight initialization, and learning parameters. To
summarize, when training with noisy data, excessive training usually leads to overfilling.

Figure Error! No text of specified style in document.-4 Two different neural network approximations (dashed lines and dotted lines) of the rational
function g(x) (solid line) from noisy samples. The training samples shown are generated from the 15 perfect samples in Figure Error! No text of
specified style in document.-2 by adding zero-mean normally distributed random noise with 0.25 variance. Both approximations resulted from the
same net with 12 hidden units and incremental backprop learning. The dashed line represents the output of the net after 80 learning cycles. After
completing 10,000 learning cycles, the same net generates the dotted line output.
On the other hand, partial training may lead to a better approximation of the unknown function in the
sense of improved interpolation and, possibly, improved extrapolation.
14

Figure Error! No text of specified style in document.-5 Training and validation RMS errors for the neural net approximation of the function g(x). The
training set consists of the 15 noisy samples in Figure Error! No text of specified style in document.-4. The validation set consists of 294 perfect
samples uniformly spaced in the interval [ 8,12 ]. The validation error starts lower than the training error mainly because perfect samples are used
for validation.
15
A qualitative explanation for the generalization phenomenon depicted in Figure Error! No text of specified
style in document.-3 (and illustrated by the simulation in Figure Error! No text of specified style in document.-4 and
Figure Error! No text of specified style in document.-5) was advanced by Weigend et al. (1991). They explain
that, to a first approximation, backprop initially adapts the hidden units in the network such that they
all attempt to fit the major features of the data. Later, as training proceeds, some of the units then
start to fit the noise in the data. This later process continues as long as there is error and as long as
training continues (this is exactly what happens in the simulation of Figure Error! No text of specified style in
document.-4). The overall process suggests that the effective number of free parameters (weights) starts
small (even if the network is oversized) and gets larger, approaching the true number of adjustable
parameters in the network as training proceeds. Baldi and Chauvin (1991) derived analytical results
on the behavior of the validation error in LMS-trained single-layer feedforward networks learning
the identity map from noisy au-toassociation pattern pairs. Their results agree with the preceding
generalization phenomenon in nonlinear multilayer feedforward nets. More recently, Wang et al.
(1994) gave a formal justification for the phenomenon of improved generalization by stopping
learning before the global minimum of the training error is reached. They showed that there exists a
critical region in the training process where the trained network generalizes best, and after that the
generalization error will increase. In this critical region, as long as the network is large enough to
16
learn the examples, the size of the network has only a small effect on the best generalization
performance of the network. All this means that stopping learning before the global minimum of the
training error has the effect of network size (that is, network complexity) selection.
Therefore, a suitable strategy for improving generalization in networks of non-optimal size is to
avoid "overtraining" by carefully monitoring the evolution of the validation error during training and
stopping just before it starts to increase. This strategy is based on one of the early criteria in model
evaluation known as cross-validation. Here, the whole available data set is split into three parts:
training set, validation set, and prediction set. The training set is used to determine the values of the
weights of the network. The validation set is used for deciding when to terminate training. Training
continues as long as the performance on the validation set keeps improving. When it ceases to
improve, training is stopped. The third part of the data, the prediction set, is used to estimate the
expected performance (generalization) of the trained network on new data. In particular, the
prediction set should not be used for validation during the training phase. Note that this heuristic
requires the application to be data-rich. Some applications, though, suffer from scarcity of training
data, which makes this method inappropriate. The reader is referred to Finnoff et al. (1993) for an
17
empirical study of cross-validation-based generalization and its comparison to weight decay and
other generalization-inducing methods.
3.1.3Criterion Functions
As seen earlier in Section 2.2.5, other criterion/error functions can be used from which new versions
of the backprop weight update rules can be derived. Here we consider two such criterion functions:
(1) relative entropy and (2) Minkowski-r. Starting from the instantaneous entropy criterion (Baum
and Wilczek, 1988)

( ) ( ) ( )
1
1 1 1
1 ln 1 ln
2 1 1

(3.33)
L
l l
l l
l
l l
d d
E d d
y y
=
+
= + +

+

w
and employing gradient-descent search, the following learning equations are obtained:
( )
lj o l l j
w d y z = (3.34)
and

2
z x (3.35)
( )
( )
2
1
1
L
ji h l l lj j i
l
w d y w
=

=

18
for the output- and hidden-layer units, respectively. Equations (3.34) and (3.35) assume hyperbolic
tangent activations at both layers.
From Equation (3.34) we see that the
o
f

term present in the corresponding equation of standard
backprop [Equation (3.2)] has now been eliminated. Thus the output units do not have a flat-spot
problem; on the other hand,
h
f

still appears in Equation (3.35) for the hidden units [this derivative
appears implicitly as the
( )
2
1
j
z term in the standard backprop equation]. Therefore, the flat-spot
problem is only partially solved by employing the entropy criterion.
The entropy-based backprop is well suited to probabilistic training data. It has a natural
interpretation in terms of learning the correct probabilities of a set of hypotheses represented by the
outputs of units in a multilayer neural network. Here, the probability that the lth hypothesis is true
given an input pattern is determined by the output of the lth unit as
k
x
( )
1
1
2
l
y + . The entropy
criterion is a "well-formed" error function; the reader is referred to Section 2.2.5 for a definition and
discussion of "well-formed" error functions. Such functions have been shown in simulations to
converge faster than standard backprop.
19
Another choice is the Minkowski-r criterion function (Hanson and Burr, 1988):

( )
1
1
r
L
l l
y (3.36)
l
E d
r
=
=
w
which leads to the following weight update equations:

( ) ( )
1
sgn
r
lj o l l l l o l j
w d y d y f net z (3.37)

=
and

( ) ( )
( )
1
1
sgn
L
r
ji h l l l l lj o l h j i
l
w d y d y w f net f net x

=

=

(3.38)
where sgn is the sign function. These equations reduce to those of standard backprop for the case r =
2. The motivation behind the use of this criterion is that it can lead to maximum-likelihood
estimation of weights for Gaussian and non-Gaussian input data distributions by appropriately
choosing r (e.g., r =1 for data with Laplace distributions). A small r ( 2 1 r < ) gives less weight for
large deviations and tends to reduce the influence of outlier points in the input space during learning.
On the other hand, when noise is negligible, the sensitivity of the separating surfaces implemented
20
by the hidden units to the geometry of the problem may be increased by employing . Here, 2 r >
fewer hidden units are recruited when learning complex nonlinearly separable mappings for larger r
values (Hanson and Burr, 1988).
If no a priori knowledge is available about the distribution of the training data, it would be difficult
to estimate a value for r without extensive experimentation with various r values (e.g., r =1.5, 2, 3).
Alternatively, an automatic method for estimating r is possible by adaptively updating r in the
direction of decreasing E. Here, steepest gradient descent on E(r) results in the update rule

2
1 1
ln
r r
l l l l l l
l l
r d y d y d y
r r
=

(3.38)
which when restricting r to be strictly greater than 1 (metric error measure case) may be
approximated as

1
ln
r
l l l l
r d y d y
r

(3.39)
Note that it is important that the r update rule be invoked much less frequently than the weight
update rule (e.g., r is updated once every 10 training cycles of backprop).
21
The idea of increasing the learning robustness of backprop in noisy environments can be placed in a
more general statistical framework where the technique of robust statistics takes effect. Here,
robustness of learning refers to insensitivity to small perturbations in the underlying probability
distribution p(x) of the training set. These statistical techniques motivate the replacement of the
linear error
l
y
l l
d
( )
e l
f in Equations (3.2) and (3.9) by a nonlinear error suppressor function =
that is compatible with the underlying probability density function p(x). One example is to set
( ) ( )
1
sgn
r
e l l l l l
f d y d y

= r with 2 1 < . This error suppressor leads to the exact Minkowski-r
weight update rule of Equations (3.37) and (3.38). In fact, the case r =1 is equivalent to minimizing
the summed absolute error criterion that is known to suppress outlier data points. Similarly, the
selection ( )
( )
2
2 / 1
e l l l
f leads to robust backprop if p(x) has long tails such as a Cauchy = +
distribution or some other infinite variance density.
Furthermore, regularization terms may be added to the preceding error functions E(w) in order to
introduce some desirable effects such as good generalization, smaller effective network size, smaller
weight magnitudes, faster learning, etc. (Poggio and Girosi, 1990). The regularization terms in
22
Equations (3.29) and (3.31) used for enhancing generalization through weight pruning/elimination
are examples. Another possible regularization term is
2
x
E , which has been shown to improve
backprop generalization by forcing the output to be insensitive to small changes in the input. It also
helps speed up convergence by generating hidden-layer weight distributions that have smaller
variances than those generated by standard backpropagation.
Weight sharing, a method where several weights in a network are controlled by a single parameter,
is another way of enhancing generalization [Rumelhart et al. (1986b)]. It imposes equality
constraints among weights, thus reducing the number of free (effective) parameters in the network,
which leads to improved generalization. An automatic method for affecting weight sharing can be
derived by adding the regularization term

j i
( )
ln
R j
i j
p w

=

(3.41) J
to the error function, where each ( )
j i
p w is a Gaussian density with mean
j
and variance
j
, the
is the mixing proportion of Gaussian
j
p with 1
j
j
=
, and represents an arbitrary weight in

i
w
j
23
the network. The
j
,
j
and
j
parameters are assumed to adapt as the network learns. The use of
multiple adaptive Gaussians allows the implementation of "soft weight sharing," in which the
learning algorithm decides for itself which weights should be tied together. If the Gaussians all start
with high variance, the initial grouping of weights into subsets will be very soft. As the network
learns and the variance shrinks, the groupings become more and more distinct and converge to
subsets influenced by the task being learned. For gradient-descent-based adaptation, one may
employ the partial derivatives

( )
2
j i
(3.42)
R
j i
j
i j
w
J
r w
w

( )
2
j i
(3.43)
R
j i
i
j j
w
J
r w

= +

( )
( )
2
2
3
j i j
R
j i
i
j j
w
J
r w

(3.44) =
and
24

( )
1
j i
R
i
j j
r w
J

(3.45)
with

( )
( )
( )
j j i
j j
k k i
k
p w
r w
p w
(3.46)
It should be noted that the derivation of the partial of
R
J with respect to the mixing proportions is
less straightforward than those in Equations (3.42) through (3.44) because the sum of the
j
values
must be maintained at 1. Thus the result in Equation (3.45) has been obtained by appropriate use of a
Lagrange multiplier method and a bit of algebraic manipulation. The term ( )
j i
r w in Equations (3.42)
through (3.46) is the posterior probability of Gaussian j given weight ; i.e., it measures the
responsibility of Gaussian j for the ith weight. Equation (3.42) attempts to pull the weights toward
the center of the "responsible" Gaussian. It realizes a competition mechanism among the various
Gaussians for taking on responsibility for weight . The partial derivative for
i
w
i
w
j
drives
j
toward
25
the weighted average of the set of weights for which Gaussian j is responsible. Similarly, one may
come up with simple interpretations for the derivatives in Equations (3.44) and (3.45). To
summarize, the penalty term in Equation (3.41) leads to unsupervised clustering of weights (weight
sharing) driven by the biases in the training set.
26

Curs 1-12

Enviado por

Dados do documento

Descrição original:

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Curs 1-12

Enviado por

Direitos autorais:

Formatos disponíveis

DENIS ENCHESCU

this optimal hypothesis for the empirical risk

is studied is necessary to take into account

err err err err

. For the "logistic" function, =

in Figure 2-8. In particular, notice how

when net has large magnitude (i.e., f

. Since slow convergence results in excessive computation time, it would be advantageous to

x (2.54) d f net f net

w J (2.69) (2.69) (2.69)

The learning rule for the hidden-layer weights

and for the hyperbolic tangent function,

is a "repeller term," which gives rise to a convex surface

,the input to unit i ( )

, which allows for faster convergence without compromising stability. In this

will be about the same at each time step, and

term in Equation (3.23) by

where is a small positive constant. The

for all weights of

, and represents an arbitrary weight in

Você também pode gostar