Escolar Documentos
Profissional Documentos
Cultura Documentos
Abstract train due to the exploding and the vanishing gradient prob-
lems (Hochreiter, 1991; Bengio et al., 1994).
The Recurrent Neural Network (RNN) is an ex-
tremely powerful sequence model that is often There have been a number of attempts to address the diffi-
difficult to train. The Long Short-Term Memory culty of training RNNs. Vanishing gradients were success-
(LSTM) is a specific RNN architecture whose fully addressed by Hochreiter & Schmidhuber (1997), who
design makes it much easier to train. While developed the Long Short-Term Memory (LSTM) archi-
wildly successful in practice, the LSTM’s archi- tecture, which is resistant to the vanishing gradient prob-
tecture appears to be ad-hoc so it is not clear if it lem. The LSTM turned out to be easy to use, causing it
is optimal, and the significance of its individual to become the standard way of dealing with the vanishing
components is unclear. gradient problem. Other attempts to overcome the vanish-
In this work, we aim to determine whether the ing gradient problem include the use of powerful second-
LSTM architecture is optimal or whether much order optimization algorithms (Martens, 2010; Martens &
better architectures exist. We conducted a thor- Sutskever, 2011), regularization of the RNN’s weights that
ough architecture search where we evaluated ensures that the gradient does not vanish (Pascanu et al.,
over ten thousand different RNN architectures, 2012), giving up on learning the recurrent weights alto-
and identified an architecture that outperforms gether (Jaeger & Haas, 2004; Jaeger, 2001), and a very
both the LSTM and the recently-introduced careful initialization of RNN’s parameters (Sutskever et al.,
Gated Recurrent Unit (GRU) on some but not all 2013). Unlike the vanishing gradient problem, the explod-
tasks. We found that adding a bias of 1 to the ing gradient problem turned out to be relatively easy to ad-
LSTM’s forget gate closes the gap between the dress by simply enforcing a hard constraint over the norm
LSTM and the GRU. of the gradient (Mikolov, 2012; Pascanu et al., 2012).
A criticism of the LSTM architecture is that it is ad-hoc
and that it has a substantial number of components whose
1. Introduction purpose is not immediately apparent. As a result, it is also
not clear that the LSTM is an optimal architecture, and it is
The Deep Neural Network (DNN) is an extremely expres- possible that better architectures exist.
sive model that can learn highly complex vector-to-vector
mappings. The Recurrent Neural Network (RNN) is a Motivated by this criticism, we attempted to determine
DNN that is adapted to sequence data, and as a result the whether the LSTM architecture is optimal by means of
RNN is also extremely expressive. RNNs maintain a vector an extensive evolutionary architecture search. We found
of activations for each timestep, which makes the RNN ex- specific architectures similar to the Gated Recurrent Unit
tremely deep. Their depth, in turn, makes them difficult to (GRU) (Cho et al., 2014) that outperformed the LSTM
and the GRU by on most tasks, although an LSTM vari-
1
Work done while the author was at Google. ant achieved the best results whenever dropout was used.
In addition, by adding a bias of 1 to the LSTM’s forget gate
Proceedings of the 32 nd International Conference on Machine
(a practice that has been advocated by Gers et al. (2000)
Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copy-
right 2015 by the author(s). but that has not been mentioned in recent LSTM papers),
Evolving Recurrent Neural Network Architectures
the LSTM on all tasks; the same architecture also outper- rounds, but it is not allowed to run it with a previously-
formed the GRU on all tasks though by a small margin. To run hyperparameter configuration.
outperform the GRU by a significant margin we needed to
If an architecture passes the first stage, we evaluate
use different architectures on for different tasks.
the architecture on the first task on a set of 20 ran-
dom hyperparameter settings from the distribution de-
3.1. The Search Procedure
scribed in sec. 3.3. If the architecture’s best perfor-
Our search procedure is straightforward. We maintain a mance exceeds 90% of the GRU’s best performance,
list of the 100 best architectures that we encountered so then the same procedure is applied to the second, and
far, and estimate the quality of each architecture which is then to the third task, after which the architecture may
the performance of its best hyperparameter setting up to be added to the list of the 100 best architectures (so
the current point. This top-100 list is initialized with only it will displace one of the architectures in the top-100
the LSTM and the GRU, which been evaluated for every list). If the architecture does not make it to the top-
allowable hyperparameter setting (sec. 3.7). 100 list after this stage, then the hyperparameters that
it has been evauated on are stored, and if the architec-
At each step of its execution the algorithm performs one of ture is randomly chosen again then it will not need to
the following three steps: pass the memorization task and it will only be evalu-
ated on fresh hyperparameter settings.
1. It selects a random architecture from the list of its 100
best architectures, evaluates it on an 20 randomly cho-
sen hyperparameter settings (see sec. 3.3 for the hy- 3.2. Mutation Generation
perparameter distribution) for each of the three tasks, A candidate architecture is represented with a computa-
and updates its performance estimate. We maintained tional graph, similar to ones that are easily expressed in
a list of the evaluated hyperparameters to avoid eval- Theano (Bergstra et al., 2010) or Torch (Collobert et al.,
uating the same hyperparameter configuration more 2011). Each layer is represented with a node, where all
than once. Each task is evaluated on an independent nodes have the same dimensionality. Each architecture re-
set of 20 hyperparameters, and the architecture’s per- ceives M + 1 nodes as inputs, and outputs M nodes as in-
formance estimate is updated to put, so the first M inputs are the RNN’s hidden state at the
architecture’s best accuracy on task previous timestep, while the M + 1th input is the external
min (1) input to the RNN. So for example, the LSTM architecture
task GRU’s best accuracy on task
has three inputs (x, h and c) and two outputs (h and c).
For a given task, this expression represents the per-
We mutate a given architecture by choosing a probability
formance of the best known hyperparameter setting
p uniformly at random from [0, 1] and independently apply
over the performance of the best GRU (which has been
a randomly chosen transformations from the below with
evaluated on all allowable hyperparameter settings).
probability p to each node in the graph. The value of p de-
By selecting the minimum over all tasks, we make
termines the magnitude of the mutation — this way, most
sure that our search procedure will look for an archi-
of the mutations are small while some are large. The trans-
tecture that works well on every task (as opposed to
formations below are anchored to a “current node”.
merely on most tasks).
2. Alternatively, it can propose a new architecture by 1. If the current node is an activation function, re-
choosing an architecture from the list of 100 best ar- place it with a randomly chosen activation function.
chitectures mutating it (see sec. 3.2). The set of permissible nonlinearities is {tanh(x),
First, we use the architecture to solve a simple memo- sigmoid(x), ReLU(x), Linear(0, x), Linear(1, x),
rization problem, where 5 symbols in sequence (with Linear(0.9, x), Linear(1.1, x)}. Here the operator
26 possibilities) are to be read in sequence and then Linear(a, x) introduced a new square parameter ma-
reproduced in the same order. This task is exceed- trix W and a bias b which replaces x by W · x + b.
ingly easy for the LSTM, which can solve it with 25K The new matrix is initialized to small random values
parameters under four minutes with all but the most whose magnitude is determined by the scale param-
extreme settings of its hyperparameters. We evaluated eter, and the scalar value a is added to its diagonal
the candidate architecture on one random setting of its entries.
hyperparameters. We would discard the architecture if
its performance was below 95% with teacher forcing 2. If the current node is an elementwise operation, re-
(Zaremba & Sutskever, 2014). The algorithm is al- place it with a different randomly-chosen elementwise
lowed to re-examine a discarded architecture in future operation (multiplication, addition, subtraction) .
Evolving Recurrent Neural Network Architectures
3. Insert a random activation function between the cur- • Arithmetic In this problem, the RNN is required to
rent node and one of its parents. compute the digits of the sum or difference of two
numbers, where it first reads the sequence which has
4. Remove the current node if it has one input and one
the two arguments, and then emits their sum, one char-
output.
acter at a time (as in Sutskever et al. (2014); Zaremba
5. Replace the current node with a randomly chosen & Sutskever (2014)). The numbers can have up to 8
node from the current node’s ancestors (i.e., all nodes digits, and the numbers can be both positive and nega-
that A depends on). This allows us to reduce the size tive. To make the task more difficult, we introduced a
of the graph. random number of distractor symbols between succes-
sive input characters. Thus a typical instance would
6. Randomly select a node (node A). Replace the cur-
be 3e36d9-h1h39f94eeh43keg3c= -13991064, which
rent node with either the sum, product, or difference
represents 3369-13994433= -13991064. The RNN
of a random ancestor of the current node and a ran-
reads the input sequence until it observes the “=” sign,
dom ancestor of A. This guarantees that the resulting
after which it is to output the answer, one digit at a
computation graph will remain well-defined.
time. We used a next step prediction formulation, so
We apply 1-3 of these transformations to mutate an archi- the LSTM observes the correct value of the past dig-
tecture. We reject architectures whose number of output its when predicting a present digit. The total number
vectors is incompatible with the number of its input vec- of different symbols was 40: 26 lowercase characters,
tors. If an architecture is rejected we try again, until we the 10 digits, and “+”, “-”,“=”, and “.” (the latter rep-
find an architecture that is accepted. resents the end of the output).
• XML modelling. The goal of this problem is to pre-
3.3. Hyperparameter Selection for New Architectures dict the next character in synthetic XML dataset. We
would create an XML tag consisting of a random num-
The random hyperparameters of a candidate architecture
ber (between 2 and 10) of lowercase characters. The
are chosen as follows. If the parent architecture has been
generation procedure is as follows:
evaluated on more than 420 different hyperparameters, then
80% of the hyperparameters for the child (i.e., candidate) – With 50% probability or after 50 iterations, we
would come from best 100 hyperparameter settings of the either close the most recently opened tag, or stop
parent, and the remaining 20% would be generated com- if no open tags remain
pletely randomly. If its parent had fewer than 420 hyperpa- – Otherwise, open a new random tag
rameter evaluations, then 33% of the hyperparameters are
generated completely randomly, 33% of the hyperparame- A short random sample is the following: “hetdompi
ters would are selected from the best 100 hyperparameters hpegshmnaji hzbhbmgi h/zbhbmgi h/pegshmnaji
of the LSTM and the GRU (which have been evaluated with hautmhi h/autmhi h/etdompi”
a full hyperparameter search and thus have known settings • Penn Tree-Bank (PTB). We also included a word-
of good hyperparameters), and 33% of the hyperparameters level language modelling task on the Penn TreeBank
are selected from the hyperparameter that have been tried (Marcus et al., 1993) following the precise setup of
on the parent architecture. Mikolov et al. (2010), which has 1M words with a vo-
cabulary of size 10,000. We did not use dropout in
3.4. Some Statistics the search procedure since the dropout models were
too large and too expensive to properly evaluate, al-
We evaluated 10,000 different architectures and 1,000 of
though we did evaluate our candidate architectures
them made it past the initial filtering stage on the memo-
with dropout later.
rization problem. Each such architecture has been evalu-
ated on 220 hyperparameter settings on average. These fig-
ures suggest that every architecture that we evaluated had Although we optimized our architectures on only the pre-
a reasonable chance at achieving its highest performance. vious three tasks, we added the following Music datasets
Thus we evaluated 230,000 hyperparameter configurations to measure the generalization ability of our architecture
in total. search procedure.
we needed to predict discrete symbols and therefore Arch. Arith. XML PTB
used the softmax nonlinearity, we were required to Tanh 0.29493 0.32050 0.08782
predict binary vectors so we used the sigmoid. We LSTM 0.89228 0.42470 0.08912
did not use the music datasets in the search procedure. LSTM-f 0.29292 0.23356 0.08808
LSTM-i 0.75109 0.41371 0.08662
For all problems, we used a minibatch of size 20, and un- LSTM-o 0.86747 0.42117 0.08933
rolled the RNNs for 35 timesteps. We trained on sequences LSTM-b 0.90163 0.44434 0.08952
of length greater than 35 by moving the minibatch sequen- GRU 0.89565 0.45963 0.09069
tially through the sequence, and ensuring that the hidden MUT1 0.92135 0.47483 0.08968
state of the RNN is initialized by the last hidden state of MUT2 0.89735 0.47324 0.09036
the RNN in the previous minibatch. MUT3 0.90728 0.46478 0.09161
z = sigm(Wxz xt + bz )
r = sigm(Wxr xt + Whr ht + br ) Arch. N N-dropout P
ht+1 = tanh(Whh (r ht ) + tanh(xt ) + bh ) z Tanh 3.612 3.267 6.809
LSTM 3.492 3.403 6.866
+ ht (1 − z) LSTM-f 3.732 3.420 6.813
LSTM-i 3.426 3.252 6.856
MUT2: LSTM-o 3.406 3.253 6.870
LSTM-b 3.419 3.345 6.820
z = sigm(Wxz xt + Whz ht + bz ) GRU 3.410 3.427 6.876
r = sigm(xt + Whr ht + br ) MUT1 3.254 3.376 6.792
ht+1 = tanh(Whh (r ht ) + Wxh xt + bh ) z MUT2 3.372 3.429 6.852
MUT3 3.337 3.505 6.840
+ ht (1 − z)
Table 2. Negative Log Likelihood on the music datasets. N stands
MUT3: for Nottingham, N-dropout stands for Nottingham with nonzero
dropout, and P stands for Piano-Midi.
z = sigm(Wxz xt + Whz tanh(ht ) + bz )
r = sigm(Wxr xt + Whr ht + br )
ht+1 = tanh(Whh (r ht ) + Wxh xt + bh ) z
+ ht (1 − z)
Our results also shed light on the significance of the various the GRU and MUT1. Thus we recommend adding a bias
LSTM gates. The forget gate turns out to be of the greatest of 1 to the forget gate of every LSTM in every application;
importance. When the forget gate is removed, the LSTM it is easy to do often results in better performance on our
exhibits drastically inferior performance on the ARITH and tasks. This adjustment is the simple improvement over the
the XML problems, although it is relatively unimportant in LSTM that we set out to discover.
language modelling, consistent with Mikolov et al. (2014).
The second most significant gate turns out to be the input References
gate. The output gate was the least important for the perfor-
mance of the LSTM. When removed, ht simply becomes Amit, Daniel J. Modeling brain function: The world of
tanh(ct ) which was sufficient for retaining most of the attractor neural networks. Cambridge University Press,
LSTM’s performance. However, the results on the music 1992.
dataset with dropout do not conform to this pattern, since
there LSTM-i and LSTM-o achieved the best performance. Bayer, Justin, Wierstra, Daan, Togelius, Julian, and
Schmidhuber, Jürgen. Evolving memory cell structures
But most importantly, we determined that adding a positive for sequence learning. In Artificial Neural Networks–
bias to the forget gate greatly improves the performance of ICANN 2009, pp. 755–764. Springer, 2009.
the LSTM. Given that this technique the simplest to imple-
ment, we recommend it for every LSTM implementation. Bengio, Yoshua, Simard, Patrice, and Frasconi, Paolo.
Interestingly, this idea has already been stated in the paper Learning long-term dependencies with gradient descent
that introduced the forget gate to the LSTM (Gers et al., is difficult. Neural Networks, IEEE Transactions on, 5
2000). (2):157–166, 1994.
Greff, Klaus, Srivastava, Rupesh Kumar, Koutnı́k, Jan, Ste- Pascanu, Razvan, Mikolov, Tomas, and Bengio, Yoshua.
unebrink, Bas R, and Schmidhuber, Jürgen. Lstm: A On the difficulty of training recurrent neural networks.
search space odyssey. arXiv preprint arXiv:1503.04069, arXiv preprint arXiv:1211.5063, 2012.
2015.
Plaut, David C. Semantic and associative priming in a dis-
Hinton, Geoffrey E and Shallice, Tim. Lesioning an attrac- tributed attractor network. In Proceedings of the 17th
tor network: investigations of acquired dyslexia. Psy- annual conference of the cognitive science society, vol-
chological review, 98(1):74, 1991. ume 17, pp. 37–42. Pittsburgh, PA, 1995.
Hochreiter, Sepp. Untersuchungen zu dynamischen neu- Sutskever, Ilya, Martens, James, Dahl, George, and Hin-
ronalen netzen. Master’s thesis, Institut fur Informatik, ton, Geoffrey. On the importance of initialization and
Technische Universitat, Munchen, 1991. momentum in deep learning. In Proceedings of the 30th
International Conference on Machine Learning (ICML-
Hochreiter, Sepp and Schmidhuber, Jürgen. Long short- 13), pp. 1139–1147, 2013.
term memory. Neural computation, 9(8):1735–1780,
1997. Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc VV. Se-
quence to sequence learning with neural networks. In
Jaeger, Herbert. The echo state approach to analysing and Advances in Neural Information Processing Systems, pp.
training recurrent neural networks-with an erratum note. 3104–3112, 2014.
Bonn, Germany: German National Research Center for
Information Technology GMD Technical Report, 148:34, Zaremba, Wojciech and Sutskever, Ilya. Learning to exe-
2001. cute. arXiv preprint arXiv:1410.4615, 2014.
Zaremba, Wojciech, Sutskever, Ilya, and Vinyals, Oriol.
Jaeger, Herbert and Haas, Harald. Harnessing nonlinearity:
Recurrent neural network regularization. arXiv preprint
Predicting chaotic systems and saving energy in wireless
arXiv:1409.2329, 2014.
communication. Science, 304(5667):78–80, 2004.