Você está na página 1de 27

Ch2: Adaline and Madaline

Adaline : Adaptive Linear neuron


Madaline : Multiple Adaline
2.1 Adaline (Bernard Widrow, Stanford Univ.)
x0 : bias term
(feedback,
error, gain,
adjust
n
term)
y = �w j x j = wT x
j =0
Linear combination
d = f(y) 1
2.1.1 Least Mean Square (LMS) Learning
◎ Input vectors : { x1 , x2 , � ��, xL }
Ideal outputs : {d1 , d 2 , ��
�, d L}
Actual outputs : { y1 , y2 , �
��, yL}
Assume the output function: f(y) = y = d
Mean square error:
L
1
( )
2
� k = (d k - )
2 2 2 T
εεd
k = y k = k - w xk
L k =1
= d k 2 + wT xk xkT w - 2 d k xkT w - - (2.4)
2
=
Let ξεd k , p= x
k k , R = x T
k k : correlation
x
matrix 2
(2.4) � x = 2
dk T
+ w Rw - 2 p w T

Idea: w * = arg min x( w )


w
d x ( w) * -1
Let = 2 Rw - 2 p = 0. Obtain w = R p.
dw
Practical difficulties of analytical formula :
1. Large dimensions - R -1 difficult to calculate
2. < > expected value - Knowledge of probabilities

3
2.1.2 Steepest Descent
The graph of x ( w) = d k2 + wT Rw - 2 pT w is a
paraboloid.

4
Steps: 1. Initialize weight values w( t0 )
2. Determine the steepest descent direction
d x ( w(t ))
- �x ( w(t )) = - = 2( p - Rw(t ))
w
dw ( t )
Let Dw (t ) = -�x ( w (t ))
w

3. Modify weight values


w (t + 1) = w (t ) + mDw (t ), m : step size
4. Repeat 2~3.
No calculation of -1
R
Drawbacks: i) To know R and p is equivalent to
knowing the error surface in advance. ii) Steepest
descent training is a batch training method.
5
2.1.3 Stochastic Gradient Descent
Approximate - �x ( w (t )) = 2( p - Rw (t ))by randomly
w

selecting one training example at a time


1. Apply an input vector xk
2 2 T 2
2. εtk ( ) =d( k y- k ) =d( k - wt ( ) �
xk )
3. �x=(ѻ�
tεtεt
) 2
k( ) 2
k( )
w w w

= -2( d k - w t (tεt
)� x k ) x k = -2 k ( ) x k
4. w (t + 1) = w (tμεt
)+2 k( ) xk
5. Repeat 1~4 with the next input vector
No calculation of p and R
6
Drawback: time consuming.
Improvement: mini-batch training method.

○ Practical Considerations:
(a) No. of training vectors, (b) Stopping criteria
(c) Initial weights, (d) Step size
7
2.1.4 Conjugate Gradient Descent
-- Drawback: can only minimize quadratic functions,
1 T
e.g., f ( w) = w Aw - bT w + c
2
Advantage: guarantees to find the optimum solution in
at most n iterations, where n is the size of matrix A.
A-Conjugate Vectors:
Let An�n : square, symmetric, positive-definite matrix.
S = {s(0), s(1), �
��, s( n - 1)}

s T
(i ) As( j ) = 0, "i �j
Vectors are A-conjugate
* If A = I (identity matrix), conjugacy = orthogonality.
if
Set S forms a basis for space R n .
The solution in can be written as n -1
w *
R n
w = �ai s(i )
*

i =0
• The conjugate-direction method for minimizing f(w) is
defined by w(i + 1) = w(i ) + h (i ) s(i ), i = 0,1, �
�,n -1

where w(0) is an arbitrary starting vector.
h (i ) is determined by min f ( w(i ) + h s(i ))
h
How to determine s(i ) ?
Define r (i ) = b - Aw(i ) , which is in the steepest
descent direction of f ( w) (Q - �
w
f ( w) = 2( b - Aw)).
Let s(i ) = r (i ) + a (i ) s(i - 1), i = 1,2, �
�, n - 1 - (A)

Multiply by s(i-1)A,
sT (i - 1) As(i ) = sT (i - 1) A( r (i ) + a (i ) s(i - 1)).
In order to be A-conjugate: (i ) As( j ) = 0, "i �j
T
s
0 = sT (i - 1) Ar (i ) + a (i ) sT (i - 1) As(i - 1).
sT (i - 1) Ar (i )
a (i ) = - T - - - (B)
s (i - 1) As(i - 1)
s(1), s(2), ��
� , s( n - 1) generated by Eqs. (A) and (B)

are A-conjugate.
• Desire that evaluating a (i ) does not need to know A.
r T (i )( r (i ) - r (i - 1))
Polak-Ribiere formula: a (i ) = -
r (i - 1)r (i - 1) 10
T
r T (i ) r (i )
Fletcher-Reeves formula: a (i ) = T
r (i - 1)r (i - 1)
* The conjugate-direction method for minimizing
x ( w) = 2
dk T
+ w Rw - 2 p w T

Let w(i + 1) = w(i ) + h (i ) s(i ), i = 0,1, �


�, n -1

w(0) is an arbitrary starting vector
h (i ) is determined by min x ( w(i ) + h s(i ))
h

s(i ) = r (i ) + a (i ) s(i - 1), r (i ) = p - Rw(i )


sT (i - 1) Rr (i )
a (i ) = - T
s (i - 1) Rs(i - 1)
Nonlinear Conjugate Gradient Algorithm
Initialize w(0) by an appropriate process
Example: A comparison of the convergences of
gradient descent (green) and conjugate gradient
(red) for minimizing a quadratic function.

Conjugate gradient
converges in at most
n steps where n is the
size of the matrix of
the system (here n=2).
2.3. Applications
2.3.1. Echo Cancellation in Telephone Circuits

n : incoming voice, s : outgoing voice


n�: noise (leakage of the incoming voice)
y : the output of the filter mimics n� 14
Hybrid circuit: deals with the leakage issue, which
attempts to isolate incoming from outgoing signals
Adaptive filter: deals with the choppy issue, which
mimics the leakage of the incoming voice for
suppressing the choppy speech from the outgoing
signals
2
< s� > = < ( s + n�- y)2 > = < s 2 > + < (n�- y )2 > +2 < s(n�- y ) >
= < s 2 > + < (n�- y )2 >
< s ( n�
- y ) > = 0 (s not correlated with y, n�)
< εs2 > = <s �
2
> - < 2n> =y < ( �- )2 >
\ min < εn2 > y= min < ( � - )2 > 15
2.3.2 Predict Signal

An adaptive filter is trained to predict signal.


The signal used to train the filter is a delayed
actual signal.
The expected output is the current signal.
16
2.3.3 Reproduce Signal

17
2.3.4. Adaptive beam – forming antenna
arrays
Antenna : spatial array of sensors which are
directional in their reception
characteristics.
Adaptive filter learns to steer antennae in order
that they can respond to incoming signals no
matter what their directions are, which reduce
responses to unwanted noise signals coming in
from other directions

18
2.4 Madaline : Many
adaline
○ XOR function ?

19
20
2.4.2. Madaline Rule II (MRII)
○ Training algorithm – A trial–and–error procedure
with a minimum disturbance principle (those
nodes that can affect the output error while
incurring the least change in their weights
should have precedence in the learning
process)
○ Procedure –
1. Input a training pattern
2. Count #incorrect values in the output layer
21
3. For all units on the output layer
3.1. Select the first previously unselected error
node whose analog output is closest to zero
( Q this node can reverse its bipolar output
with the least change in its weights)
3.2. Change the weights on the selected unit s.t.
the bipolar output of the unit changes
3.3. Input the same training pattern
3.4. If reduce #errors, accept the weight change,
otherwise restore the original weights
4. Repeat Step 3 for all layers except the input layer
22
5. For all units on the output layer
5.1. Select the previously unselected pair of units
whose output are closest to zero
5.2. Apply a weight correction to both units, in
order to change their bipolar outputs
5.3. Input the same training pattern
5.4. If reduce # errors, accept the correction;
otherwise, restore the original weights.
6. Repeat step 5 for all layers except the input layer.

23
※ Steps 5 and 6 can be repeated with triplets,
quadruplets or longer combinations of units
until satisfactory results are obtained

The MRII learning rule considers the network with


only one hidden layer. For networks with more hidden
layers, the backpropagation learning strategy to be
discussed later can be employed.

24
2.4.3. A Madaline for Translation–Invariant
Pattern Recognition

25
。 Relationships among the weight matrices of Adalines

26
○ Extension -- Mutiple slabs with different key weight
matrices for discriminating more then two classes of
patterns

27

Você também pode gostar