Escolar Documentos
Profissional Documentos
Cultura Documentos
1 1 1
𝑤10 1
Xt
𝑤11
x1
… 𝑔1 𝑥 w
𝑤𝑗𝑖
xi …
…
… … … 𝑔𝐾 𝑥 w
xI 𝑤𝐽𝐼
2
Sequence Modeling
X0 X1 X2 X3 Xt
x1 x1 x1 x1 x1
… … … … …
xi xi xi xi … xi …
… … … … …
xI xI xI xI xI
time
3
Sequence Modeling
Memoryless models: X0 X1 X2 X3 Xt
• Autoregressive models
• Feedforward neural networks x1 x1 x1 x1
…
x1
…
… … …
xi xi xi xi … xi …
Models with memory:
… … … … …
• Linear dynamical systems
• Hidden Markov models xI xI xI xI xI
4
Memoryless Models for Sequences
𝑊𝑡−2
Autoregressive models
X0 X1 X2 X3 Xt
• Predict the next input from … …
previous ones using «delay taps» 𝑊𝑡−1
time
Hidden
Feed forward neural networks 𝑊𝑡−2 𝑊𝑡−1 𝑊𝑡
• Generalize autoregressive models X0 X1 X2 X3 Xt
using non linear hidden layers … …
time
5
Dynamical Systems
Stochastic systems ...
Generative models with a real-valued hidden state which cannot be observed directly
Hidden
Hidden
Hidden
…
6
Dynamical Systems
Stochastic systems ...
Generative models with a real-valued hidden state which cannot be observed directly
Hidden
Hidden
Hidden
…
7
Recurrent Neural networks Deterministic
systems ...
1
xi
• Non-linear dynamics allows
complex hidden state updates … …
𝑔𝑡 𝑥 w
xI 𝑤𝐽𝐼
8
Recurrent Neural networks
xi
• Non-linear dynamics allows
complex hidden state updates … …
𝑔𝑡 𝑥 w
𝐽 𝐵 xI 𝑤𝐽𝐼
(2) (2)
𝑡
𝑔 𝑥𝑛 |𝑤 = 𝑔 𝑤1𝑗 ⋅ ℎ𝑗𝑡 ⋅ + 𝑣1𝑏 ⋅ 𝑐𝑏𝑡 ⋅
𝑗=0 𝑏=0
1 (1)
𝐽
(1)
𝐵
(1)
𝑐1𝑡 𝑥, W𝐵 , VB
ℎ𝑗𝑡 ⋅ = ℎ𝑗𝑡 𝑤𝑗𝑖 ⋅ 𝑥𝑖,𝑛 + 𝑣𝑗𝑏 ⋅ 𝑐𝑏𝑡−1
𝑐1𝑡−1
𝑗=0 𝑏=0
𝐽 𝐵
… 1 (1)
𝑐𝑏𝑡 ⋅ = 𝑐𝑏𝑡
(1)
𝑣𝑏𝑖 ⋅ 𝑥𝑖,𝑛 +
(1)
𝑣𝑏𝑏′ ⋅ 𝑡−1
𝑐𝑏′ 𝑐𝐵𝑡 𝑥, W𝐵 , VB
𝑗=0 𝑏′=0
𝑐𝐵𝑡−1
9
Backpropagation Through Time
…
𝑤𝑗𝑖 1
xi
… …
𝑔𝑡 𝑥 w
xI 𝑤𝐽𝐼
1 (1)
𝑐1𝑡 𝑥, W𝐵 , VB
𝑐1𝑡−1
… 1 (1)
𝑐𝐵𝑡 𝑥, W𝐵 , VB
𝑐𝐵𝑡−1
10
Backpropagation Through Time
1 1 1 1
… … … … …
𝑔𝑡 𝑥 w
xI xI xI xI 𝑤𝐽𝐼
1 (1)
𝑐1𝑡 𝑥, W𝐵 , VB
… … … … … 1 (1)
𝑐𝐵𝑡 𝑥, W𝐵 , VB
11
Backpropagation Through Time
𝑈−1 𝑈−1 … …
1 1
𝑉 =𝑉−𝜂⋅ 𝑉 𝑡−𝑢 𝑉𝐵 = 𝑉𝐵 − 𝜂 ⋅ 𝑉𝐵𝑡−𝑢
𝑈 𝑈 xI 𝑤𝐽𝐼 𝑔𝑡 𝑥 w
0 0
1 (1)
𝑐1𝑡 𝑥, W𝐵 , VB
… … … … … 1 (1)
𝑐𝐵𝑡 𝑥, W𝐵 , VB
𝑉𝐵𝑡−3 𝑉𝐵𝑡−2 𝑉𝐵𝑡−1 𝑉𝐵𝑡
12
How much should we go back in time?
1
xi
Jane walked into the room. John walked in too.
It was late in the day. Jane said hi to <???> … …
𝑔𝑡 𝑥 w
xI 𝑤𝐽𝐼
However backpropagation through time was
not able to train recurrent neural networks 1 (1)
𝑐1𝑡 𝑥, W𝐵 , VB
significantly back in time ...
𝑐1𝑡−1
13
How much can we go back in time?
To better understand why it was not working let consider a simplified case:
𝑔 𝑎 = 𝑅𝑒𝐿𝑢 𝑎 = max 0, 𝑎
𝑔′ 𝑎 = 1𝑎>0
Build Recurrent Neural Networks using small modules that are designed to remember
values for a long time.
ℎ𝑡 = 𝑣 (1) ℎ𝑡−1 + 𝑤 (1) 𝑥 𝑦 𝑡 = 𝑤 (2) 𝑔(ℎ𝑡 )
𝑥
𝑣 (1) = 1
It only accumulates
the input ...
15
Long-Short Term Memories
Hochreiter & Schmidhuber (1997) solved the problem of vanishing gradient designing a
memory cell using logistic and linear units with multiplicative interactions:
Can backpropagate
through this since the loop
has fixed weight.
16
RNN vs. LSTM
RNN
17
Long Short Term Memory
LSTM
19
Long Short Term Memory
Input gate
LSTM
20
Long Short Term Memory
Forget gate
LSTM
21
Long Short Term Memory
Memory gate
LSTM
22
Long Short Term Memory
Output gate
LSTM
23
Gated Recurrent Unit (GRU)
It combines the forget and input gates into a single “update gate.” It also merges the
cell state and hidden state, and makes some other changes.
24
LSTM Networks
Y0 Y1 … Yt
Hidden
Hidden
Hidden
…
X0 X1 Xt
…
25
LSTM Networks
Y0 Y1 … Yt
Hidden
Hidden
Hidden
…
X0 X1 Xt
…
26
LSTM Networks
Y0 Y1 … Yt
ReLu
ReLu
ReLu
LSTM
LSTM
LSTM
…
LSTM
LSTM
LSTM
X0 X1 … Xt
27
Sequential Data Problems
Fixed-sized Sequence output Sequence input (e.g. Sequence input and Synced sequence input
input (e.g. image captioning sentiment analysis sequence output (e.g. and output (e.g. video
to fixed-sized takes an image and where a given sentence Machine Translation: an classification where we
output outputs a sentence of is classified as RNN reads a sentence in wish to label each frame
(e.g. image words). expressing positive or English and then outputs of the video)
classification) negative sentiment). a sentence in French)
Credits: Andrej Karpathy
28
Sequence to Sequence Modeling
Given <S, T> pairs, read S, and output T’ that best matches T
29
Tips & Tricks
When conditioning on a full input sequence Bidirectional RNNs can exploit it:
• Have one RNNs traverse the sequence left-to-right
• Have another RNN traverse the sequence right-to-left
• Use concatenation of hidden layers as feature representation
30
Acknowledgements
The amazing images on LSTM cell are taken from Chris Hola’s blog:
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
3131
Deep Learning: Theory, Techniques & Applications
- Recurrent Neural Networks -