Você está na página 1de 31

Deep Learning: Theory, Techniques & Applications

- Recurrent Neural Networks -

Prof. Matteo Matteucci – matteo.matteucci@polimi.it

Department of Electronics, Information and Bioengineering


Artificial Intelligence and Robotics Lab - Politecnico di Milano
Sequence Modeling

So far we have considered only «static» datasets

1 1 1
𝑤10 1
Xt
𝑤11
x1
… 𝑔1 𝑥 w
𝑤𝑗𝑖
xi …

… … … 𝑔𝐾 𝑥 w
xI 𝑤𝐽𝐼

2
Sequence Modeling

So far we have considered only «static» datasets

X0 X1 X2 X3 Xt

x1 x1 x1 x1 x1
… … … … …
xi xi xi xi … xi …
… … … … …

xI xI xI xI xI

time

3
Sequence Modeling

Different ways to deal with «dynamic» data:

Memoryless models: X0 X1 X2 X3 Xt

• Autoregressive models
• Feedforward neural networks x1 x1 x1 x1

x1

… … …
xi xi xi xi … xi …
Models with memory:
… … … … …
• Linear dynamical systems
• Hidden Markov models xI xI xI xI xI

• Recurrent Neural Networks time


• ... X0 X1 X2 X3

Xt

4
Memoryless Models for Sequences

𝑊𝑡−2

Autoregressive models
X0 X1 X2 X3 Xt
• Predict the next input from … …
previous ones using «delay taps» 𝑊𝑡−1

time

Hidden
Feed forward neural networks 𝑊𝑡−2 𝑊𝑡−1 𝑊𝑡
• Generalize autoregressive models X0 X1 X2 X3 Xt
using non linear hidden layers … …

time

5
Dynamical Systems
Stochastic systems ...

Generative models with a real-valued hidden state which cannot be observed directly

• The hidden state has some dynamics possibly


affected by noise and produces the output Y0 Y1 … Yt

• To compute the output has to infer hidden state


• Input are treated as driving inputs

Hidden

Hidden

Hidden

In linear dynamical systems this becomes:

• State continuous with Gaussian uncertainty X0 X1 Xt


• Transformations are assumed to be linear …

• State can be estimated using Kalman filtering


time

6
Dynamical Systems
Stochastic systems ...

Generative models with a real-valued hidden state which cannot be observed directly

• The hidden state has some dynamics possibly


affected by noise and produces the output Y0 Y1 … Yt

• To compute the output has to infer hidden state


• Input are treated as driving inputs

Hidden

Hidden

Hidden

In hidden Markov models this becomes:

• State assumed to be discrete, state transitions


are stochastic (transition matrix)
• Output is a stochastic function of hidden states
• State can be estimated via Viterbi algorithm. time

7
Recurrent Neural networks Deterministic
systems ...
1

Introduce memory by recurrent connections: 𝑤11 ℎ𝑗𝑡 (𝑥, W (1) ,)


x1
• Distributed hidden state allows …
𝑤𝑗𝑖
to store a information efficiently 1

xi
• Non-linear dynamics allows
complex hidden state updates … …
𝑔𝑡 𝑥 w
xI 𝑤𝐽𝐼

“With enough neurons and time, RNNs


1 (1)
can compute anything that can be 𝑐1𝑡 𝑥, W𝐵 , VB
computed by a computer.” 𝑐1𝑡−1

(Computation Beyond the Turing Limit … 1 (1)


Hava T. Siegelmann, 1995) 𝑐𝐵𝑡 𝑥, W𝐵 , VB
𝑐𝐵𝑡−1

8
Recurrent Neural networks

Introduce memory by recurrent connections: 𝑤11 ℎ𝑗𝑡 (𝑥, W 1 , 𝑉 (1) )


x1
• Distributed hidden state allows …
𝑤𝑗𝑖
to store a information efficiently 1

xi
• Non-linear dynamics allows
complex hidden state updates … …
𝑔𝑡 𝑥 w
𝐽 𝐵 xI 𝑤𝐽𝐼
(2) (2)
𝑡
𝑔 𝑥𝑛 |𝑤 = 𝑔 𝑤1𝑗 ⋅ ℎ𝑗𝑡 ⋅ + 𝑣1𝑏 ⋅ 𝑐𝑏𝑡 ⋅
𝑗=0 𝑏=0

1 (1)
𝐽
(1)
𝐵
(1)
𝑐1𝑡 𝑥, W𝐵 , VB
ℎ𝑗𝑡 ⋅ = ℎ𝑗𝑡 𝑤𝑗𝑖 ⋅ 𝑥𝑖,𝑛 + 𝑣𝑗𝑏 ⋅ 𝑐𝑏𝑡−1
𝑐1𝑡−1
𝑗=0 𝑏=0

𝐽 𝐵
… 1 (1)
𝑐𝑏𝑡 ⋅ = 𝑐𝑏𝑡
(1)
𝑣𝑏𝑖 ⋅ 𝑥𝑖,𝑛 +
(1)
𝑣𝑏𝑏′ ⋅ 𝑡−1
𝑐𝑏′ 𝑐𝐵𝑡 𝑥, W𝐵 , VB
𝑗=0 𝑏′=0
𝑐𝐵𝑡−1

9
Backpropagation Through Time

𝑤11 ℎ𝑗𝑡 (𝑥, W 1 ,𝑉 1 )


x1


𝑤𝑗𝑖 1

xi

… …
𝑔𝑡 𝑥 w
xI 𝑤𝐽𝐼

1 (1)
𝑐1𝑡 𝑥, W𝐵 , VB
𝑐1𝑡−1

… 1 (1)
𝑐𝐵𝑡 𝑥, W𝐵 , VB
𝑐𝐵𝑡−1

10
Backpropagation Through Time

1 1 1 1

𝑤11 ℎ𝑗𝑡 (𝑥, W 1 ,𝑉 1 )


x1 x1 x1 x1

… … All these weights


… …
𝑤𝑗𝑖 1
should be the same.
xi xi xi xi

… … … … …
𝑔𝑡 𝑥 w
xI xI xI xI 𝑤𝐽𝐼

1 (1)
𝑐1𝑡 𝑥, W𝐵 , VB

… … … … … 1 (1)
𝑐𝐵𝑡 𝑥, W𝐵 , VB

11
Backpropagation Through Time

• Perform network unroll for U steps 𝑤11 ℎ𝑗𝑡 (𝑥, W 1 ,𝑉 1 )


x1
• Initialize 𝑉, 𝑉𝐵 replicas to be the same

• Compute gradients and update replicas 𝑤𝑗𝑖 1

with the average of their gradients xi

𝑈−1 𝑈−1 … …
1 1
𝑉 =𝑉−𝜂⋅ 𝑉 𝑡−𝑢 𝑉𝐵 = 𝑉𝐵 − 𝜂 ⋅ 𝑉𝐵𝑡−𝑢
𝑈 𝑈 xI 𝑤𝐽𝐼 𝑔𝑡 𝑥 w
0 0

1 (1)
𝑐1𝑡 𝑥, W𝐵 , VB

… … … … … 1 (1)
𝑐𝐵𝑡 𝑥, W𝐵 , VB
𝑉𝐵𝑡−3 𝑉𝐵𝑡−2 𝑉𝐵𝑡−1 𝑉𝐵𝑡

12
How much should we go back in time?
1

𝑤11 ℎ𝑗𝑡 (𝑥, W 1 ,𝑉 1 )


Sometime the output might be related to x1
some input happened quite long before

𝑤𝑗𝑖 1

xi
Jane walked into the room. John walked in too.
It was late in the day. Jane said hi to <???> … …
𝑔𝑡 𝑥 w
xI 𝑤𝐽𝐼
However backpropagation through time was
not able to train recurrent neural networks 1 (1)
𝑐1𝑡 𝑥, W𝐵 , VB
significantly back in time ...
𝑐1𝑡−1

Was due to not being … 1 (1)


able to backprop through 𝑐𝐵𝑡 𝑥, W𝐵 , VB
many layers ... 𝑐𝐵𝑡−1

13
How much can we go back in time?

To better understand why it was not working let consider a simplified case:

ℎ𝑡 = 𝑔(𝑣 (1) ℎ𝑡−1 + 𝑤 (1) 𝑥) 𝑦 𝑡 = 𝑤 (2) 𝑔(ℎ𝑡 )


𝑥

Backpropagation over an entire sequence is computed as


𝑆 𝑡 𝑡 𝑡
𝜕𝐸 𝜕𝐸 𝑡 𝜕𝐸 𝑡 𝜕𝐸 𝑡 𝜕𝑦 𝑡 𝜕ℎ𝑡 𝜕ℎ𝑘 𝜕ℎ𝑡 𝜕ℎ𝑖 1
= = = = 𝑣 𝑔′ ℎ𝑖−1
𝜕𝑤 𝜕𝑤 𝜕𝑤 𝜕𝑦 𝑡 𝜕ℎ𝑡 𝜕ℎ𝑘 𝜕𝑤 𝜕ℎ𝑘 𝜕ℎ𝑖−1
𝑡=1 𝑡=1 𝑖=𝑘+1 𝑖=𝑘+1

If we consider the norm of these terms If (𝛾𝑣 𝛾𝑔′ ) < 1this


converges to 0 ...
𝜕ℎ𝑖 1 𝑖−1
𝜕ℎ𝑡 𝑡−𝑘
= 𝑣 𝑔′ ℎ 𝑘
≤ 𝛾𝑣 ⋅ 𝛾𝑔′
𝜕ℎ𝑖−1 𝜕ℎ
With Sigmoids and Tanh we
have vanishing gradients
14
Dealing with Vanishing Gradient

Force all gradients to be either 0 or 1

𝑔 𝑎 = 𝑅𝑒𝐿𝑢 𝑎 = max 0, 𝑎
𝑔′ 𝑎 = 1𝑎>0

Build Recurrent Neural Networks using small modules that are designed to remember
values for a long time.
ℎ𝑡 = 𝑣 (1) ℎ𝑡−1 + 𝑤 (1) 𝑥 𝑦 𝑡 = 𝑤 (2) 𝑔(ℎ𝑡 )
𝑥

𝑣 (1) = 1
It only accumulates
the input ...

15
Long-Short Term Memories

Hochreiter & Schmidhuber (1997) solved the problem of vanishing gradient designing a
memory cell using logistic and linear units with multiplicative interactions:

• Information gets into the cell


whenever its “write” gate is on.
• The information stays in the cell
so long as its “keep” gate is on.
• Information is read from the cell
by turning on its “read” gate.

Can backpropagate
through this since the loop
has fixed weight.

16
RNN vs. LSTM

RNN

17
Long Short Term Memory

LSTM

19
Long Short Term Memory

Input gate

LSTM

20
Long Short Term Memory

Forget gate

LSTM

21
Long Short Term Memory

Memory gate

LSTM

22
Long Short Term Memory

Output gate

LSTM

23
Gated Recurrent Unit (GRU)

It combines the forget and input gates into a single “update gate.” It also merges the
cell state and hidden state, and makes some other changes.

24
LSTM Networks

You can build a computation graph with continuous transformations.

Y0 Y1 … Yt

Hidden

Hidden

Hidden

X0 X1 Xt

25
LSTM Networks

You can build a computation graph with continuous transformations.

Y0 Y1 … Yt

Hidden

Hidden

Hidden

X0 X1 Xt

26
LSTM Networks

You can build a computation graph with continuous transformations.

Y0 Y1 … Yt

ReLu

ReLu
ReLu
LSTM

LSTM

LSTM

LSTM

LSTM

LSTM
X0 X1 … Xt

27
Sequential Data Problems

Fixed-sized Sequence output Sequence input (e.g. Sequence input and Synced sequence input
input (e.g. image captioning sentiment analysis sequence output (e.g. and output (e.g. video
to fixed-sized takes an image and where a given sentence Machine Translation: an classification where we
output outputs a sentence of is classified as RNN reads a sentence in wish to label each frame
(e.g. image words). expressing positive or English and then outputs of the video)
classification) negative sentiment). a sentence in French)
Credits: Andrej Karpathy

28
Sequence to Sequence Modeling

Given <S, T> pairs, read S, and output T’ that best matches T

29
Tips & Tricks

When conditioning on a full input sequence Bidirectional RNNs can exploit it:
• Have one RNNs traverse the sequence left-to-right
• Have another RNN traverse the sequence right-to-left
• Use concatenation of hidden layers as feature representation

When initializing RNN we need to specify the initial state


• Could initialize them to a fixed value (such as 0)
• Better to treat the initial state as learned parameters
• Start off with random guesses of the initial state values
• Backpropagate the prediction error through time all the way to the initial
state values and compute the gradient of the error with respect to these
• Update these parameters by gradient descent

30
Acknowledgements

This slides are highly based on material taken from:


• Jeoffrey Hinton
• Hugo Larochelle
• Andrej Karpathy
• Nando De Freitas
• Chris Olah
You can find more details on the original slides

The amazing images on LSTM cell are taken from Chris Hola’s blog:

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

3131
Deep Learning: Theory, Techniques & Applications
- Recurrent Neural Networks -

Prof. Matteo Matteucci – matteo.matteucci@polimi.it

Department of Electronics, Information and Bioengineering


Artificial Intelligence and Robotics Lab - Politecnico di Milano

Você também pode gostar