Recurrent Neural Networks

Deep Learning: Theory, Techniques & Applications
- Recurrent Neural Networks -
Prof. Matteo Matteucci – matteo.matteucci@polimi.it
Department of Electronics, Information and Bioengineering

Artificial Intelligence and Robotics Lab - Politecnico di Milano
Sequence Modeling
So far we have considered only «static» datasets
1 1 1
𝑤10 1
Xt
𝑤11
x1
… 𝑔1 𝑥 w
𝑤𝑗𝑖
xi …
…
… … … 𝑔𝐾 𝑥 w
xI 𝑤𝐽𝐼
2
Sequence Modeling
So far we have considered only «static» datasets
X0 X1 X2 X3 Xt
x1 x1 x1 x1 x1
… … … … …
xi xi xi xi … xi …
… … … … …
xI xI xI xI xI
time
3
Sequence Modeling
Different ways to deal with «dynamic» data:
Memoryless models: X0 X1 X2 X3 Xt
• Autoregressive models
• Feedforward neural networks x1 x1 x1 x1
…
x1
…
… … …
xi xi xi xi … xi …
Models with memory:
… … … … …
• Linear dynamical systems
• Hidden Markov models xI xI xI xI xI
• Recurrent Neural Networks time

• ... X0 X1 X2 X3
…
Xt
…
4
Memoryless Models for Sequences
𝑊𝑡−2
Autoregressive models
X0 X1 X2 X3 Xt
• Predict the next input from … …
previous ones using «delay taps» 𝑊𝑡−1
time
Hidden
Feed forward neural networks 𝑊𝑡−2 𝑊𝑡−1 𝑊𝑡
• Generalize autoregressive models X0 X1 X2 X3 Xt
using non linear hidden layers … …
time
5
Dynamical Systems
Stochastic systems ...
Generative models with a real-valued hidden state which cannot be observed directly
• The hidden state has some dynamics possibly

affected by noise and produces the output Y0 Y1 … Yt
• To compute the output has to infer hidden state

• Input are treated as driving inputs
Hidden
Hidden
Hidden
…
In linear dynamical systems this becomes:
• State continuous with Gaussian uncertainty X0 X1 Xt

• Transformations are assumed to be linear …
• State can be estimated using Kalman filtering

time
6
Dynamical Systems
Stochastic systems ...
Generative models with a real-valued hidden state which cannot be observed directly
• The hidden state has some dynamics possibly

affected by noise and produces the output Y0 Y1 … Yt
• To compute the output has to infer hidden state

• Input are treated as driving inputs
Hidden
Hidden
Hidden
…
In hidden Markov models this becomes:
• State assumed to be discrete, state transitions

are stochastic (transition matrix)
• Output is a stochastic function of hidden states
• State can be estimated via Viterbi algorithm. time
7
Recurrent Neural networks Deterministic
systems ...
1
Introduce memory by recurrent connections: 𝑤11 ℎ𝑗𝑡 (𝑥, W (1) ,)

x1
• Distributed hidden state allows …
𝑤𝑗𝑖
to store a information efficiently 1
xi
• Non-linear dynamics allows
complex hidden state updates … …
𝑔𝑡 𝑥 w
xI 𝑤𝐽𝐼
“With enough neurons and time, RNNs

1 (1)
can compute anything that can be 𝑐1𝑡 𝑥, W𝐵 , VB
computed by a computer.” 𝑐1𝑡−1
(Computation Beyond the Turing Limit … 1 (1)

Hava T. Siegelmann, 1995) 𝑐𝐵𝑡 𝑥, W𝐵 , VB
𝑐𝐵𝑡−1
8
Recurrent Neural networks
Introduce memory by recurrent connections: 𝑤11 ℎ𝑗𝑡 (𝑥, W 1 , 𝑉 (1) )

x1
• Distributed hidden state allows …
𝑤𝑗𝑖
to store a information efficiently 1
xi
• Non-linear dynamics allows
complex hidden state updates … …
𝑔𝑡 𝑥 w
𝐽 𝐵 xI 𝑤𝐽𝐼
(2) (2)
𝑡
𝑔 𝑥𝑛 |𝑤 = 𝑔 𝑤1𝑗 ⋅ ℎ𝑗𝑡 ⋅ + 𝑣1𝑏 ⋅ 𝑐𝑏𝑡 ⋅
𝑗=0 𝑏=0
1 (1)
𝐽
(1)
𝐵
(1)
𝑐1𝑡 𝑥, W𝐵 , VB
ℎ𝑗𝑡 ⋅ = ℎ𝑗𝑡 𝑤𝑗𝑖 ⋅ 𝑥𝑖,𝑛 + 𝑣𝑗𝑏 ⋅ 𝑐𝑏𝑡−1
𝑐1𝑡−1
𝑗=0 𝑏=0
𝐽 𝐵
… 1 (1)
𝑐𝑏𝑡 ⋅ = 𝑐𝑏𝑡
(1)
𝑣𝑏𝑖 ⋅ 𝑥𝑖,𝑛 +
(1)
𝑣𝑏𝑏′ ⋅ 𝑡−1
𝑐𝑏′ 𝑐𝐵𝑡 𝑥, W𝐵 , VB
𝑗=0 𝑏′=0
𝑐𝐵𝑡−1
9
Backpropagation Through Time
𝑤11 ℎ𝑗𝑡 (𝑥, W 1 ,𝑉 1 )

x1
…
𝑤𝑗𝑖 1
xi
… …
𝑔𝑡 𝑥 w
xI 𝑤𝐽𝐼
1 (1)
𝑐1𝑡−1
… 1 (1)
𝑐𝐵𝑡 𝑥, W𝐵 , VB
𝑐𝐵𝑡−1
10
1 1 1 1
𝑤11 ℎ𝑗𝑡 (𝑥, W 1 ,𝑉 1 )

x1 x1 x1 x1
… … All these weights

… …
𝑤𝑗𝑖 1
should be the same.
xi xi xi xi
… … … … …
𝑔𝑡 𝑥 w
xI xI xI xI 𝑤𝐽𝐼
1 (1)
… … … … … 1 (1)
11
• Perform network unroll for U steps 𝑤11 ℎ𝑗𝑡 (𝑥, W 1 ,𝑉 1 )

x1
• Initialize 𝑉, 𝑉𝐵 replicas to be the same
…
• Compute gradients and update replicas 𝑤𝑗𝑖 1
with the average of their gradients xi
𝑈−1 𝑈−1 … …
1 1
𝑉 =𝑉−𝜂⋅ 𝑉 𝑡−𝑢 𝑉𝐵 = 𝑉𝐵 − 𝜂 ⋅ 𝑉𝐵𝑡−𝑢
𝑈 𝑈 xI 𝑤𝐽𝐼 𝑔𝑡 𝑥 w
0 0
1 (1)
… … … … … 1 (1)
𝑉𝐵𝑡−3 𝑉𝐵𝑡−2 𝑉𝐵𝑡−1 𝑉𝐵𝑡
12
How much should we go back in time?
1
𝑤11 ℎ𝑗𝑡 (𝑥, W 1 ,𝑉 1 )

Sometime the output might be related to x1
some input happened quite long before
…
𝑤𝑗𝑖 1
xi
Jane walked into the room. John walked in too.
It was late in the day. Jane said hi to <???> … …
𝑔𝑡 𝑥 w
xI 𝑤𝐽𝐼
However backpropagation through time was
not able to train recurrent neural networks 1 (1)
significantly back in time ...
𝑐1𝑡−1
Was due to not being … 1 (1)

able to backprop through 𝑐𝐵𝑡 𝑥, W𝐵 , VB
many layers ... 𝑐𝐵𝑡−1
13
How much can we go back in time?
To better understand why it was not working let consider a simplified case:
ℎ𝑡 = 𝑔(𝑣 (1) ℎ𝑡−1 + 𝑤 (1) 𝑥) 𝑦 𝑡 = 𝑤 (2) 𝑔(ℎ𝑡 )

𝑥
Backpropagation over an entire sequence is computed as

𝑆 𝑡 𝑡 𝑡
𝜕𝐸 𝜕𝐸 𝑡 𝜕𝐸 𝑡 𝜕𝐸 𝑡 𝜕𝑦 𝑡 𝜕ℎ𝑡 𝜕ℎ𝑘 𝜕ℎ𝑡 𝜕ℎ𝑖 1
= = = = 𝑣 𝑔′ ℎ𝑖−1
𝜕𝑤 𝜕𝑤 𝜕𝑤 𝜕𝑦 𝑡 𝜕ℎ𝑡 𝜕ℎ𝑘 𝜕𝑤 𝜕ℎ𝑘 𝜕ℎ𝑖−1
𝑡=1 𝑡=1 𝑖=𝑘+1 𝑖=𝑘+1
If we consider the norm of these terms If (𝛾𝑣 𝛾𝑔′ ) < 1this

converges to 0 ...
𝜕ℎ𝑖 1 𝑖−1
𝜕ℎ𝑡 𝑡−𝑘
= 𝑣 𝑔′ ℎ 𝑘
≤ 𝛾𝑣 ⋅ 𝛾𝑔′
𝜕ℎ𝑖−1 𝜕ℎ
With Sigmoids and Tanh we
have vanishing gradients
14
Dealing with Vanishing Gradient
Force all gradients to be either 0 or 1
𝑔 𝑎 = 𝑅𝑒𝐿𝑢 𝑎 = max 0, 𝑎
𝑔′ 𝑎 = 1𝑎>0
Build Recurrent Neural Networks using small modules that are designed to remember
values for a long time.
ℎ𝑡 = 𝑣 (1) ℎ𝑡−1 + 𝑤 (1) 𝑥 𝑦 𝑡 = 𝑤 (2) 𝑔(ℎ𝑡 )
𝑥
𝑣 (1) = 1
It only accumulates
the input ...
15
Long-Short Term Memories
Hochreiter & Schmidhuber (1997) solved the problem of vanishing gradient designing a
memory cell using logistic and linear units with multiplicative interactions:
• Information gets into the cell

whenever its “write” gate is on.
• The information stays in the cell
so long as its “keep” gate is on.
• Information is read from the cell
by turning on its “read” gate.
Can backpropagate
through this since the loop
has fixed weight.
16
RNN vs. LSTM
RNN
17
Long Short Term Memory
LSTM
19
Input gate
LSTM
20
Forget gate
LSTM
21
Memory gate
LSTM
22
Output gate
LSTM
23
Gated Recurrent Unit (GRU)
It combines the forget and input gates into a single “update gate.” It also merges the
cell state and hidden state, and makes some other changes.
24
LSTM Networks
You can build a computation graph with continuous transformations.
Y0 Y1 … Yt
Hidden
Hidden
Hidden
…
X0 X1 Xt
…
25
LSTM Networks
Y0 Y1 … Yt
Hidden
Hidden
Hidden
…
X0 X1 Xt
…
26
LSTM Networks
Y0 Y1 … Yt
ReLu
ReLu
ReLu
LSTM
LSTM
LSTM
…
LSTM
LSTM
LSTM
X0 X1 … Xt
27
Sequential Data Problems
Fixed-sized Sequence output Sequence input (e.g. Sequence input and Synced sequence input
input (e.g. image captioning sentiment analysis sequence output (e.g. and output (e.g. video
to fixed-sized takes an image and where a given sentence Machine Translation: an classification where we
output outputs a sentence of is classified as RNN reads a sentence in wish to label each frame
(e.g. image words). expressing positive or English and then outputs of the video)
classification) negative sentiment). a sentence in French)
Credits: Andrej Karpathy
28
Sequence to Sequence Modeling
Given <S, T> pairs, read S, and output T’ that best matches T
29
Tips & Tricks
When conditioning on a full input sequence Bidirectional RNNs can exploit it:
• Have one RNNs traverse the sequence left-to-right
• Have another RNN traverse the sequence right-to-left
• Use concatenation of hidden layers as feature representation
When initializing RNN we need to specify the initial state

• Could initialize them to a fixed value (such as 0)
• Better to treat the initial state as learned parameters
• Start off with random guesses of the initial state values
• Backpropagate the prediction error through time all the way to the initial
state values and compute the gradient of the error with respect to these
• Update these parameters by gradient descent
30
Acknowledgements
This slides are highly based on material taken from:

• Jeoffrey Hinton
• Hugo Larochelle
• Andrej Karpathy
• Nando De Freitas
• Chris Olah
You can find more details on the original slides
The amazing images on LSTM cell are taken from Chris Hola’s blog:
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
3131
Deep Learning: Theory, Techniques & Applications
- Recurrent Neural Networks -
Prof. Matteo Matteucci – matteo.matteucci@polimi.it
Department of Electronics, Information and Bioengineering

Artificial Intelligence and Robotics Lab - Politecnico di Milano

Recurrent Neural Networks

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Recurrent Neural Networks

Enviado por

Direitos autorais:

Formatos disponíveis

Deep Learning: Theory, Techniques & Applications

- Recurrent Neural Networks -

Prof. Matteo Matteucci – matteo.matteucci@polimi.it

Department of Electronics, Information and Bioengineering

So far we have considered only «static» datasets

So far we have considered only «static» datasets

Different ways to deal with «dynamic» data:

• Recurrent Neural Networks time

• The hidden state has some dynamics possibly

• To compute the output has to infer hidden state

In linear dynamical systems this becomes:

• State continuous with Gaussian uncertainty X0 X1 Xt

• State can be estimated using Kalman filtering

• The hidden state has some dynamics possibly

• To compute the output has to infer hidden state

In hidden Markov models this becomes:

• State assumed to be discrete, state transitions

Introduce memory by recurrent connections: 𝑤11 ℎ𝑗𝑡 (𝑥, W (1) ,)

“With enough neurons and time, RNNs

(Computation Beyond the Turing Limit … 1 (1)

Introduce memory by recurrent connections: 𝑤11 ℎ𝑗𝑡 (𝑥, W 1 , 𝑉 (1) )

𝑤11 ℎ𝑗𝑡 (𝑥, W 1 ,𝑉 1 )

𝑤11 ℎ𝑗𝑡 (𝑥, W 1 ,𝑉 1 )

… … All these weights

• Perform network unroll for U steps 𝑤11 ℎ𝑗𝑡 (𝑥, W 1 ,𝑉 1 )

with the average of their gradients xi

𝑤11 ℎ𝑗𝑡 (𝑥, W 1 ,𝑉 1 )

Was due to not being … 1 (1)

ℎ𝑡 = 𝑔(𝑣 (1) ℎ𝑡−1 + 𝑤 (1) 𝑥) 𝑦 𝑡 = 𝑤 (2) 𝑔(ℎ𝑡 )

Backpropagation over an entire sequence is computed as

If we consider the norm of these terms If (𝛾𝑣 𝛾𝑔′ ) < 1this

Force all gradients to be either 0 or 1

• Information gets into the cell

You can build a computation graph with continuous transformations.

You can build a computation graph with continuous transformations.

You can build a computation graph with continuous transformations.

When initializing RNN we need to specify the initial state

This slides are highly based on material taken from:

Prof. Matteo Matteucci – matteo.matteucci@polimi.it

Department of Electronics, Information and Bioengineering

Você também pode gostar