2016 - Bengio - Learning To Understand Phrases by Embedding The Dictionary

Learning to Understand Phrases by Embedding the Dictionary
Felix Hill Kyunghyun Cho

Computer Laboratory Courant Institute of Mathematical Sciences
University of Cambridge and Centre for Data Science
felix.hill@cl.cam.ac.uk New York University
kyunghyun.cho@nyu.edu
Anna Korhonen Yoshua Bengio

Department of Theoretical and Applied Linguistics CIFAR Senior Fellow
University of Cambridge Universite de Montreal
alk23@cam.ac.uk yoshua.bengio@umontreal.ca
Abstract in training and evaluation. Consequently, it is diffi-

cult to design approaches that could learn from such
Distributional models that learn rich seman- a gold standard, and also hard to evaluate or compare
tic word representations are a success story different models.
of recent NLP research. However, develop-
ing models that learn useful representations of In this work, we use dictionary definitions to ad-
phrases and sentences has proved far harder. dress this issue. The composed meaning of the
We propose using the definitions found in words in a dictionary definition (a tall, long-necked,
everyday dictionaries as a means of bridg- spotted ruminant of Africa) should correspond to
ing this gap between lexical and phrasal se- the meaning of the word they define (giraffe). This
mantics. Neural language embedding mod- bridge between lexical and phrasal semantics is use-
els can be effectively trained to map dictio- ful because high quality vector representations of
nary definitions (phrases) to (lexical) repre-
single words can be used as a target when learning
sentations of the words defined by those defi-
nitions. We present two applications of these to combine the words into a coherent phrasal repre-
architectures: reverse dictionaries that return sentation.
the name of a concept given a definition or This approach still requires a model capable of
description and general-knowledge crossword learning to map between arbitrary-length phrases
question answerers. On both tasks, neural lan- and fixed-length continuous-valued word vectors.
guage embedding models trained on defini-
For this purpose we experiment with two broad
tions from a handful of freely-available lex-
ical resources perform as well or better than classes of neural language models (NLMs): Recur-
existing commercial systems that rely on sig- rent Neural Networks (RNNs), which naturally en-
nificant task-specific engineering. The re- code the order of input words, and simpler (feed-
sults highlight the effectiveness of both neu- forward) bag-of-words (BOW) embedding models.
ral embedding architectures and definition- Prior to training these NLMs, we learn target lexi-
based training for developing models that un- cal representations by training the Word2Vec soft-
derstand phrases and sentences.
ware (Mikolov et al., 2013) on billions of words of
raw text.
1 Introduction We demonstrate the usefulness of our approach by
building and releasing two applications. The first is
Much recent research in computational seman- a reverse dictionary or concept finder: a system that
tics has focussed on learning representations of returns words based on user descriptions or defini-
arbitrary-length phrases and sentences. This task is tions (Zock and Bilac, 2004). Reverse dictionaries
challenging partly because there is no obvious gold are used by copywriters, novelists, translators and
standard of phrasal representation that could be used other professional writers to find words for notions

Work mainly done at the University of Montreal. or ideas that might be on the tip of their tongue.
17
Transactions of the Association for Computational Linguistics, vol. 4, pp. 1730, 2016. Action Editor: Chris Callison-Burch.
Submission batch: 9/2015; revised 12/2015; revised 1/2016; Published 2/2016.
c
2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.
For instance, a travel-writer might look to enhance The objective of the model is to map these defin-
her prose by searching for examples of a country ing phrases or sentences to an embedding of the
that people associate with warm weather or an ac- word that the definition defines. The target word
tivity that is mentally or physically demanding. We embeddings are learned independently of the RNN
show that an NLM-based reverse dictionary trained weights, using the Word2Vec software (Mikolov et
on only a handful of dictionaries identifies novel def- al., 2013).
initions and concept descriptions comparably or bet- The set of all words in the training data consti-
ter than commercial systems, which rely on signif- tutes the vocabulary of the RNN. For each word in
icant task-specific engineering and access to much this vocabulary we randomly initialise a real-valued
more dictionary data. Moreover, by exploiting mod- vector (input embedding) of model parameters. The
els that learn bilingual word representations (Vulic RNN reads the first word in the input by applying
et al., 2011; Klementiev et al., 2012; Hermann and a non-linear projection of its embedding v1 parame-
Blunsom, 2013; Gouws et al., 2014), we show that terised by input weight matrix W and b, a vector of
the NLM approach can be easily extended to pro- biases.
duce a potentially useful cross-lingual reverse dic-
tionary. A1 = (W v1 + b)
The second application of our models is as a
yielding the first internal activation state A1 . In our
general-knowledge crossword question answerer.
implementation, we use (x) = tanh(x), though in
When trained on both dictionary definitions and the
theory can be any differentiable non-linear func-
opening sentences of Wikipedia articles, NLMs pro-
tion. Subsequent internal activations (after time-step
duce plausible answers to (non-cryptic) crossword
t) are computed by projecting the embedding of the
clues, even those that apparently require detailed
tth word and using this information to update the
world knowledge. Both BOW and RNN models can
internal activation state.
outperform bespoke commercial crossword solvers,
particularly when clues contain a greater number of At = (U At1 + W vt + b).
words. Qualitative analysis reveals that NLMs can
learn to relate concepts that are not directly con- As such, the values of the final internal activation
nected in the training data and can thus generalise state units AN are a weighted function of all input
well to unseen input. To facilitate further research, word embeddings, and constitute a summary of the
all of our code, training and evaluation sets (together information in the sentence.
with a system demo) are published online with this
paper.1 2.1 Long Short Term Memory
A known limitation when training RNNs to read lan-
2 Neural Language Model Architectures guage using gradient descent is that the error sig-
nal (gradient) on the training examples either van-
The first model we apply to the dictionary-based
ishes or explodes as the number of time steps (sen-
learning task is a recurrent neural network (RNN).
tence length) increases (Bengio et al., 1994). Conse-
RNNs operate on variable-length sequences of in-
quently, after reading longer sentences the final in-
puts; in our case, natural language definitions, de-
ternal activation AN typically retains useful infor-
scriptions or sentences. RNNs (with LSTMs) have
mation about the most recently read (sentence-final)
achieved state-of-the-art performance in language
words, but can neglect important information near
modelling (Mikolov et al., 2010), image caption
the start of the input sentence. LSTMs (Hochreiter
generation (Kiros et al., 2015) and approach state-
and Schmidhuber, 1997) were designed to mitigate
of-the-art performance in machine translation (Bah-
this long-term dependency problem.
danau et al., 2015).
At each time step t, in place of the single inter-
During training, the input to the RNN is a dic-
nal layer of units A, the LSTM RNN computes six
tionary definition or sentence from an encyclopedia.
internal layers iw , g i , g f , g o , h and m. The first, g w ,
1
https://www.cl.cam.ac.uk/fh295/ represents the core information passed to the LSTM
18
unit by the latest input word at t. It is computed as The BOW model simply maps an input definition
a simple linear projection of the input embedding with word embeddingsP v1 . . . vn to the sum of the
vt (by input weights Ww ) and the output state of projected embeddings ni=1 W vi . This model can
the LSTM at the previous time step ht1 (by update also be considered a special case of an RNN in
weights Uw ): which the update function U and nonlinearity are
both the identity, so that reading the next word in
iw
t = Ww vt + Uw ht1 + bw the input phrase updates the current representation
more simply:
The layers g i , g f and g o are computed as weighted
sigmoid functions of the input embeddings, again
At = At1 + W vt .
parameterised by layer-specific weight matrices W
and U :
2.3 Pre-trained Input Representations
1
gts = We experiment with variants of these models in
1 + exp((Ws vt + Us ht1 + bs ))
which the input definition embeddings are pre-
where s stands for one of i, f or o. These vectors learned and fixed (rather than randomly-initialised
take values on [0, 1] and are often referred to as gat- and updated) during training. There are several po-
ing activations. Finally, the internal memory state, tential advantages to taking this approach. First, the
mt and new output state ht , of the LSTM at t are word embeddings are trained on massive corpora
computed as and may therefore introduce additional linguistic or
f conceptual knowledge to the models. Second, at test
mt =iw i
t gt + mt1 gt time, the models will have a larger effective vocab-
ht =gto (mt ), ulary, since the pre-trained word embeddings typi-
cally span a larger vocabulary than the union of all
where indicates elementwise vector multiplica-
dictionary definitions used to train the model. Fi-
tion and is, as before, some non-linear function
nally, the models will then map to and from the same
(we use tanh). Thus, g i determines to what extent
space of embeddings (the embedding space will be
the new input word is considered at each time step,
closed under the operation of the model), so con-
g f determines to what extent the existing state of
ceivably could be more easily applied as a general-
the internal memory is retained or forgotten in com-
purpose composition engine.
puting the new internal memory, and g o determines
how much this memory is considered when comput-
2.4 Training Objective
ing the output state at t.
The sentence-final memory state of the LSTM, We train all neural language models M to map the
mN , a summary of all the information in the sen- input definition phrase sc defining word c to a lo-
tence, is then projected via an extra non-linear pro- cation close to the the pre-trained embedding vc of
jection (parameterised by a further weight matrix) c. We experiment with two different cost functions
to a target embedding space. This layer enables the for the word-phrase pair (c, sc ) from the training
target (defined) word embedding space to take a dif- data. The first is simply the cosine distance between
ferent dimension to the activation layers of the RNN, M (sc ) and vc . The second is the rank loss
and in principle enables a more complex definition-
reading function to be learned. max(0, m cos(M (sc ), vc ) cos(M (sc ), vr ))
2.2 Bag-of-Words NLMs where vr is the embedding of a randomly-selected

We implement a simpler linear bag-of-words (BOW) word from the vocabulary other than c. This loss
architecture for encoding the definition phrases. As function was used for language models, for example,
with the RNN, this architecture learns an embedding in (Huang et al., 2012). In all experiments we apply
vi for each word in the model vocabulary, together a margin m = 0.1, which has been shown to work
with a single matrix of input projection weights W . well on word-retrieval tasks (Bordes et al., 2015).
19
2.5 Implementation Details 3 Reverse Dictionaries
Since training on the dictionary data took 6-10 The most immediate application of our trained mod-
hours, we did not conduct a hyper-parameter search els is as a reverse dictionary or concept finder. It
on any validation sets over the space of possible is simple to look up a definition in a dictionary
model configurations such as embedding dimension, given a word, but professional writers often also re-
or size of hidden layers. Instead, we chose these quire suitable words for a given idea, concept or
parameters to be as standard as possible based on definition.3 Reverse dictionaries satisfy this need
previous research. For fair comparison, any aspects by returning candidate words given a phrase, de-
of model design that are not specific to a particu- scription or definition. For instance, when queried
lar class of model were kept constant across experi- with the phrase an activity that requires strength
ments. and determination, the OneLook.com reverse dictio-
nary returns the concepts exercise and work. Our
The pre-trained word embeddings used in all of trained RNN model can perform a similar func-
our models (either as input or target) were learned tion, simply by mapping a phrase to a point in the
by a continuous bag-of-words (CBOW) model using target (Word2Vec) embedding space, and returning
the Word2Vec software on approximately 8 billion the words corresponding to the embeddings that are
words of running text.2 When training such models closest to that point.
on massive corpora, a large embedding length of up
Several other academic studies have proposed
to 700 have been shown to yield best performance
reverse dictionary models. These generally rely
(see e.g. (Faruqui et al., 2014)). The pre-trained em-
on common techniques from information retrieval,
beddings used in our models were of length 500,
comparing definitions in their internal database to
as a compromise between quality and memory con-
the input query, and returning the word whose def-
straints.
inition is closest to that query (Bilac et al., 2003;
In cases where the word embeddings are learned Bilac et al., 2004; Zock and Bilac, 2004). Proxim-
during training on the dictionary objective, we make ity is quantified differently in each case, but is gen-
these embeddings shorter (256), since they must erally a function of hand-engineered features of the
be learned from much less language data. In the two sentences. For instance, Shaw et al. (2013) pro-
RNN models, and at each time step each of the pose a method in which the candidates for a given
four LSTM RNN internal layers (gating and activa- input query are all words in the models database
tion states) had length 512 another standard choice whose definitions contain one or more words from
(see e.g. (Cho et al., 2014)). The final hidden state the query. This candidate list is then ranked accord-
was mapped linearly to length 500, the dimension ing to a query-definition similarity metric based on
of the target embedding. In the BOW models, the the hypernym and hyponym relations in WordNet,
projection matrix projects input embeddings (either features commonly used in IR such as tf-idf and a
learned, of length 256, or pre-trained, of length 500) parser.
to length 500 for summing. There are, in addition, at least two commercial
All models were implemented with online reverse dictionary applications, whose ar-
Theano (Bergstra et al., 2010) and trained with chitecture is proprietary knowledge. The first is
minibatch SGD on GPUs. The batch size was the Dictionary.com reverse dictionary 4 , which re-
fixed at 16 and the learning rate was controlled by trieves candidate words from the Dictionary.com
adadelta (Zeiler, 2012). dictionary based on user definitions or descrip-
tions. The second is OneLook.com, whose algo-
rithm searches 1061 indexed dictionaries, including
2
The Word2Vec embedding models are well known; further
3
details can be found at https://code.google.com/p/ See the testimony from professional writers at http://
word2vec/ The training data for this pre-training was com- www.onelook.com/?c=awards
4
piled from various online text sources using the script demo- Available at http://dictionary.reference.
train-big-model-v1.sh from the same page. com/reverse/
20
all major freely-available online dictionaries and re- formance on all tasks. For brevity, we therefore do
sources such as Wikipedia and WordNet. not present these results in what follows.
3.1 Data Collection and Training 3.2 Comparisons

To compile a bank of dictionary definitions for train- As a baseline, we also implemented two entirely
ing the model, we started with all words in the tar- unsupervised methods using the neural (Word2Vec)
get embedding space. For each of these words, we word embeddings from the target word space. In the
extracted dictionary-style definitions from five elec- first (W2V add), we compose the embeddings for
tronic resources: Wordnet, The American Heritage each word in the input query by pointwise addition,
Dictionary, The Collaborative International Dictio- and return as candidates the nearest word embed-
nary of English, Wiktionary and Websters. We dings to the resulting composed vector.8 The sec-
chose these five dictionaries because they are freely- ond baseline, (W2V mult), is identical except that
available via the WordNik API,5 but in theory any the embeddings are composed by elementwise mul-
dictionary could be chosen. Most words in our train- tiplication. Both methods are established ways of
ing data had multiple definitions. For each word building phrase representations from word embed-
w with definitions {d1 . . . dn } we included all pairs dings (Mitchell and Lapata, 2010).
(w, d1 ) . . . (w, dn ) as training examples. None of the models or evaluations from previous
To allow models access to more factual knowl- academic research on reverse dictionaries is pub-
edge than might be present in a dictionary (for in- licly available, so direct comparison is not possi-
stance, information about specific entities, places or ble. However, we do compare performance with
people, we supplemented this training data with in- the commercial systems. The Dictionary.com sys-
formation extracted from Simple Wikipedia. 6 For tem returned no candidates for over 96% of our in-
every word in the models target embedding space put definitions. We therefore conduct detailed com-
that is also the title of a Wikipedia article, we treat parison with OneLook.com, which is the first re-
the sentences in the first paragraph of the article as verse dictionary tool returned by a Google search
if they were (independent) definitions of that word. and seems to be the most popular among writers.
When a word in Wikipedia also occurs in one (or 3.3 Reverse Dictionary Evaluation
more) of the five training dictionaries, we simply
add these pseudo-definitions to the training set of To our knowledge there are no established means of
definitions for the word. Combining Wikipedia and measuring reverse dictionary performance. In the
dictionaries in this way resulted in 900, 000 word- only previous academic research on English reverse
definition pairs of 100, 000 unique words. dictionaries that we are aware of, evaluation was
To explore the effect of the quantity of training conducted on 300 word-definition pairs written by
data on the performance of the models, we also lexicographers (Shaw et al., 2013). Since these are
trained models on subsets of this data. The first sub- not publicly available we developed new evaluation
set comprised only definitions from Wordnet (ap- sets and make them freely available for future eval-
proximately 150,000 definitions of 75,000 words). uations.
The second subset comprised only words in Word- The evaluation items are of three types, designed
net and their first definitions (approximately 75,000 to test different properties of the models. To cre-
word, definition pairs).7 . For all variants of RNN ate the seen evaluation, we randomly selected 500
and BOW models, however, reducing the training words from the WordNet training data (seen by all
data in this way resulted in a clear reduction in per- models), and then randomly selected a definition for
each word. Testing models on the resulting 500
5
See http://developer.wordnik.com word-definition pairs assesses their ability to recall
6
https://simple.wikipedia.org/wiki/Main_ or decode previously encoded information. For the
Page
7 8
As with other dictionaries, the first definition in WordNet Since we retrieve all answers from embedding spaces by
generally corresponds to the most typical or common sense of a cosine similarity, addition of word embeddings is equivalent to
word. taking the mean.
21
Dictionary definitions
Test Set Seen (500 WN defs) Unseen (500 WN defs) Concept descriptions (200)
Unsup. W2V add - - - 923 .04/.16 163 339 .07/.30 150
models W2V mult - - - 1000 .00/.00 10* 1000 .00/.00 27*
OneLook 0 .89/.91 67 - - - 18.5 .38/.58 153
RNN cosine 12 .48/.73 103 22 .41/.70 116 69 .28/.54 157
RNN w2v cosine 19 .44/.70 111 19 .44/.69 126 26 .38/.66 111
RNN ranking 18 .45/.67 128 24 .43/.69 103 25 .34/.66 102
NLMs RNN w2v ranking 54 .32/.56 155 33 .36/.65 137 30 .33/.69 77
BOW cosine 22 .44/.65 129 19 .43/.69 103 50 .34/.60 99
BOW w2v cosine 15 .46/.71 124 14 .46/ .71 104 28 .36/.66 99
BOW ranking 17 .45/.68 115 22 .42/.70 95 32 .35/.69 101
BOW w2v rankng 55 .32/.56 155 36 .35/.66 138 38 .33/.72 85
median rank accuracy@10/100 rank variance
Table 1: Performance of different reverse dictionary models in different evaluation settings. *Low variance in mult
models is due to consistently poor scores, so not highlighted.
unseen evaluation, we randomly selected 500 words Test set Word Description
from WordNet and excluded all definitions of these Dictionary valve control consisting of a mechanical
definition device for controlling fluid flow
words from the training data of all models.
Concept prefer when you like one thing
Finally, for a fair comparison with OneLook, description more than another thing
which has both the seen and unseen pairs in its in-
ternal database, we built a new dataset of concept Table 2: Style difference between dictionary definitions
descriptions that do not appear in the training data and concept descriptions in the evaluation.
for any model. To do so, we randomly selected 200
adjectives, nouns or verbs from among the top 3000 answer (over the whole test set, lower better), the
most frequent tokens in the British National Cor- proportion of training cases in which the correct an-
pus (Leech et al., 1994) (but outside the top 100). swer appears in the top 10/100 in this ranking (accu-
We then asked ten native English speakers to write racy@10/100 - higher better) and the variance of the
a single-sentence description of these words. To rank of the correct answer across the test set (rank
ensure the resulting descriptions were good qual- variance - lower better).
ity, for each description we asked two participants
who did not produce that description to list any 3.4 Results
words that fitted the description (up to a maximum
of three). If the target word was not produced by Table 1 shows the performance of the different mod-
one of the two checkers, the original participant was els in the three evaluation settings. Of the unsu-
asked to re-write the description until the validation pervised composition models, elementwise addition
was passed.9 These concept descriptions, together is clearly more effective than multiplication, which
with other evaluation sets, can be downloaded from almost never returns the correct word as the near-
our website for future comparisons. est neighbour of the composition. Overall, however,
the supervised models (RNN, BOW and OneLook)
Given a test description, definition, or question,
clearly outperform these baselines.
all models produce a ranking of possible word an-
swers based on the proximity of their representations The results indicate interesting differences be-
of the input phrase and all possible output words. tween the NLMs and the OneLook dictionary search
To quantify the quality of a given ranking, we re- engine. The Seen (WN first) definitions in Table 1
port three statistics: the median rank of the correct occur in both the training data for the NLMs and
the lookup data for the OneLook model. Clearly the
9
Re-writing was required in 6 of the 200 cases. OneLook algorithm is better than NLMs at retriev-
22
ing already available information (returning 89% of tual knowledge from this data (and thus may over-
correct words among the top-ten candidates on this fit to this setting), whereas models with pre-trained
set). However, this is likely to come at the cost of a embeddings have some semantic memory acquired
greater memory footprint, since the model requires from general running-text language data and other
access to its database of dictionaries at query time.10 knowledge acquired from the dictionaries.
The performance of the NLM embedding models
on the (unseen) concept descriptions task shows that 3.5 Qualitative Analysis
these models can generalise well to novel, unseen Some example output from the various models is
queries. While the median rank for OneLook on presented in Table 3. The differences illustrated
this evaluation is lower, the NLMs retrieve the cor- here are also evident from querying the web demo.
rect answer in the top ten candidates approximately The first example shows how the NLMs (BOW and
as frequently, within the top 100 candidates more RNN) generalise beyond their training data. Four
frequently and with lower variance in ranking over of the top five responses could be classed as ap-
the test set. Thus, NLMs seem to generalise more propriate in that they refer to inhabitants of cold
consistenly than OneLook on this dataset, in that countries. However, inspecting the WordNik train-
they generally assign a reasonably high ranking to ing data, there is no mention of cold or anything to
the correct word. In contrast, as can also be verified do with climate in the definitions of Eskimo, Scandi-
by querying our we demo, OneLook tends to per- navian, Scandinavia etc. Therefore, the embedding
form either very well or poorly on a given query.11 models must have learned that coldness is a char-
When comparing between NLMs, perhaps the acteristic of Scandinavia, Siberia, Russia, relates to
most striking observation is that the RNN models Eskimos etc. via connections with other concepts
do not significantly outperform the BOW models, that are described or defined as cold. In contrast,
even though the BOW model output is invariant to the candidates produced by the OneLook and (unsu-
changes in the order of words in the definition. Users pervised) W2V baseline models have nothing to do
of the online demo can verify that the BOW models with coldness.
recover concepts from descriptions strikingly well, The second example demonstrates how the NLMs
even when the words in the description are per- generally return candidates whose linguistic or con-
muted. This observation underlines the importance ceptual function is appropriate to the query. For a
of lexical semantics in the interpretation of language query referring explicitly to a means, method or pro-
by NLMs, and is consistent with some other recent cess, the RNN and BOW models produce verbs in
work on embedding sentences (Iyyer et al., 2015). different forms or an appropriate deverbal noun. In
It is difficult to observe clear trends in the dif- contrast, OneLook returns words of all types (aero-
ferences between NLMs that learn input word em- dynamics, draught) that are arbitrarily related to the
beddings and those with pre-trained (Word2Vec) in- words in the query. A similar effect is apparent in
put embeddings. Both types of input yield good the third example. While the candidates produced
performance in some situations and weaker perfor- by the OneLook model are the correct part of speech
mance in others. In general, pre-training input em- (Noun), and related to the query topic, they are not
beddings seems to help most on the concept de- semantically appropriate. The dictionary embedding
scriptions, which are furthest from the training data models are the only ones that return a list of plausi-
in terms of linguistic style. This is perhaps unsur- ble habits, the class of noun requested by the input.
prising, since models that learn input embeddings
3.6 Cross-Lingual Reverse Dictionaries
from the dictionary data acquire all of their concep-
We now show how the RNN architecture can be eas-
10
The trained neural language models are approximately half ily modified to create a bilingual reverse dictionary
the size of the six training dictionaries stored as plain text, so - a system that returns candidate words in one lan-
would be hundreds of times smaller than the OneLook database
of 1061 dictionaries if stored this way.
guage given a description or definition in another.
11
We also observed that the mean ranking for NLMs was A bilingual reverse dictionary could have clear ap-
lower than for OneLook on the concept descriptions task. plications for translators or transcribers. Indeed, the
23
Input
Description OneLook W2V add RNN BOW
a native of 1:country 2:citizen 1:a 2.the 1:eskimo 2:scandinavian 1:frigid 2:cold
a cold 3:foreign 4:naturalize 3:another 4:of 3:arctic 4:indian 3:icy 4:russian
country 5:cisco 5:whole 5:siberian 5:indian
a way of 1:drag 2:whiz 1:the 2:through 1:glide 2:scooting 1:flying 2:gliding
moving 3:aerodynamics 4:draught 3:a 4:moving 3:glides 4:gliding 3:glide 4:fly
through 5:coefficient of drag 5:in 5:flight 5:scooting
the air
a habit that 1:sisterinlaw 2:fatherinlaw 1:annoy 2:your 1:bossiness 2:jealousy 1:infidelity 2:bossiness
might annoy 3:motherinlaw 4:stepson 3:might 4:that 3:annoyance 4:rudeness 3:foible 4:unfaithfulness
your spouse 5:stepchild 5:either 5:boorishness 5:adulterous
Table 3: The top-five candidates for example queries (invented by the authors) from different reverse dictionary mod-
els. Both the RNN and BOW models are without Word2Vec input and use the cosine loss.
Input description RNN EN-FR W2V add RNN + Google

an emotion that you might feel triste, pitoyable insister, effectivement sentiment, regretter
after being rejected repugnante, epouvantable pourquoi, nous peur, aversion
a small black flying insect that mouche, canard attentivement, pouvions voler, faucon
transmits disease and likes horses hirondelle, pigeon pourrons, naturellement mouches, volant
Table 4: Responses from cross-lingual reverse dictionary models to selected queries. Underlined responses are cor-
rect or potentially useful for a native French speaker.
problem of attaching appropriate words to concepts evaluation of bilingual reverse dictionaries, we com-
may be more common when searching for words in pare this approach qualitatively with two alternative
a second language than in a monolingual context. methods for mapping definitions to words across
To create the bilingual variant, we simply replace languages. The first is analogous to the W2V Add
the Word2Vec target embeddings with those from model of the previous section: in the bilingual em-
a bilingual embedding space. Bilingual embedding bedding space, we first compose the embeddings of
models use bilingual corpora to learn a space of rep- the English words in the query definition with ele-
resentations of the words in two languages, such mentwise addition, and then return the French word
that words from either language that have similar whose embedding is nearest to this vector sum. The
meanings are close together (Hermann and Blun- second uses the RNN monolingual reverse dictio-
som, 2013; Chandar et al., 2014; Gouws et al., nary model to identify an English word from an En-
2014). For a test-of-concept experiment, we used glish definition, and then translates that word using
English-French embeddings learned by the state-of- Google Translate.
the-art BilBOWA model (Gouws et al., 2014) from
Table 4 shows that the RNN model can be ef-
the Wikipedia (monolingual) and Europarl (bilin-
fectively modified to create a cross-lingual reverse
gual) corpora.12 We trained the RNN model to map
dictionary. It is perhaps unsurprising that the W2V
from English definitions to English words in the
Add model candidates are generally the lowest in
bilingual space. At test time, after reading an En-
quality given the performance of the method in the
glish definition, we then simply return the nearest
monolingual setting. In comparing the two RNN-
French word neighbours to that definition.
based methods, the RNN (embedding space) model
Because no benchmarks exist for quantitative
appears to have two advantages over the RNN +
12
The approach should work with any bilingual embeddings. Google approach. First, it does not require on-
We thank Stephan Gouws for doing the training. line access to a bilingual word-word mapping as
24
defined e.g. by Google Translate. Second, it less task is reduced. This has been achieved by restrict-
prone to errors caused by word sense ambiguity. ing questions to a specific topic or domain (Molla
For example, in response to the query an emotion and Vicedo, 2007), allowing systems access to pre-
you feel after being rejected, the bilingual embed- specified passages of text from which the answer can
ding RNN returns emotions or adjectives describing be inferred (Iyyer et al., 2014; Weston et al., 2015),
mental states. In contrast, the monolingual+Google or centering both questions and answers on a partic-
model incorrectly maps the plausible English re- ular knowledge base (Berant and Liang, 2014; Bor-
sponse regret to the verbal infinitive regretter. The des et al., 2014).
model makes the same error when responding to a In what follows, we show that the dictionary em-
description of a fly, returning the verb voler (to fly). bedding models introduced in the previous sections
may form a useful component of an open QA sys-
3.7 Discussion tem. Given the absence of a knowledge base or
We have shown that simply training RNN or BOW web-scale information in our architecture, we nar-
NLMs on six dictionaries yields a reverse dictionary row the scope of the task by focusing on general
that performs comparably to the leading commer- knowledge crossword questions. General knowl-
cial system, even with access to much less dictio- edge (non-cryptic, or quick) crosswords appear in
nary data. Indeed, the embedding models consis- national newspapers in many countries. Crossword
tently return syntactically and semantically plausi- question answering is more tractable than general
ble responses, which are generally part of a more open QA for two reasons. First, models know the
coherent and homogeneous set of candidates than length of the correct answer (in letters), reducing
those produced by the commercial systems. We also the search space. Second, some crossword questions
showed how the architecture can be easily extended mirror definitions, in that they refer to fundamental
to produce bilingual versions of the same model. properties of concepts (a twelve-sided shape) or re-
In the analyses performed thus far, we only test quest a category member (a city in Egypt).13
the dictionary embedding approach on tasks that it
4.1 Evaluation
was trained to accomplish (mapping definitions or
descriptions to words). In the next section, we ex- General Knowledge crossword questions come in
plore whether the knowledge learned by dictionary different styles and forms. We used the Eddie James
embedding models can be effectively transferred to crossword website to compile a bank of sentence-
a novel task. like general-knowledge questions.14 Eddie James is
one of the UKs leading crossword compilers, work-
4 General Knowledge (crossword) ing for several national newspapers. Our long ques-
Question Answering tion set consists of the first 150 questions (starting
from puzzle #1) from his general-knowledge cross-
The automatic answering of questions posed in nat- words, excluding clues of fewer than four words
ural language is a central problem of Artificial In- and those whose answer was not a single word (e.g.
telligence. Although web search and IR techniques kingjames).
provide a means to find sites or documents related to To evaluate models on a different type of clue, we
language queries, at present, internet users requiring also compiled a set of shorter questions based on
a specific fact must still sift through pages to locate the Guardian Quick Crossword. Guardian questions
the desired information. still require general factual or linguistic knowledge,
Systems that attempt to overcome this, via but are generally shorter and somewhat more cryptic
fully open-domain or general knowledge question- than the longer Eddie James clues. We again formed
answering (open QA), generally require large teams
13
of researchers, modular design and powerful infras- As our interest is in the language understanding, we
do not address the question of fitting answers into a grid,
tructure, exemplified by IBMs Watson (Ferrucci which is the main concern of end-to-end automated crossword
et al., 2010). For this reason, much academic re- solvers (Littman et al., 2002).
search focuses on settings in which the scope of the 14
http://www.eddiejames.co.uk/
25
a list of 150 questions, beginning on 1 January 2015 to compare against a third well-known automatic
and excluding any questions with multiple-word an- crossword solver, Dr Fill (Ginsberg, 2011), because
swers. For clear contrast, we excluded those few code for Dr Fills candidate-generation module is
questions of length greater than four words. Of these not readily available. As with the RNN and base-
150 clues, a subset of 30 were single-word clues. line models, when evaluating existing systems we
All evaluation datasets are available online with the discard candidates whose length does not match the
paper. length specified in the clue.
As with the reverse dictionary experiments, can- Certain principles connect the design of the ex-
didates are extracted from models by inputting def- isting commercial systems and differentiate them
initions and returning words corresponding to the from our approach. Unlike the NLMs, they each re-
closest embeddings in the target space. In this case, quire query-time access to large databases contain-
however, we only consider candidate words whose ing common crossword clues, dictionary definitions,
length matches the length specified in the clue. the frequency with which words typically appear
as crossword solutions and other hand-engineered
Test set Word Description
and task-specific components (Littman et al., 2002;
Long Baudelaire French poet
Ginsberg, 2011).
(150) and key figure
in the development 4.3 Results
of Symbolism.
The performance of models on the various question
Short (120) satanist devil devotee
types is presented in Table 6. When evaluating the
Single-Word (30) guilt culpability two commercial systems, One Across and Cross-
word Maestro, we have access to web interfaces that
Table 5: Examples of the different question types in the
crossword question evaluation dataset. return up to approximately 100 candidates for each
query, so can only reliably record membership of the
top ten (accuracy@10).
4.2 Benchmarks and Comparisons On the long questions, we observe a clear advan-
As with the reverse dictionary experiments, we com- tage for all dictionary embedding models over the
pare RNN and BOW NLMs with a simple unsuper- commercial systems and the simple unsupervised
vised baseline of elementwise addition of Word2Vec baseline. Here, the best performing NLM (RNN
vectors in the embedding space (we discard the in- with Word2Vec input embeddings and ranking loss)
effective W2V mult baseline), again restricting can- ranks the correct answer third on average, and in the
didates to words of the pre-specified length. We top-ten candidates over 60% of the time.
also compare to two bespoke online crossword- As the questions get shorter, the advantage of
solving engines. The first, One Across (http: the embedding models diminishes. Both the unsu-
//www.oneacross.com/) is the candidate gen- pervised baseline and One Across answer the short
eration module of the award-winning Proverb cross- questions with comparable accuracy to the RNN and
word system (Littman et al., 2002). Proverb, which BOW models. One reason for this may be the differ-
was produced by academic researchers, has featured ence in form and style between the shorter clues and
in national media such as New Scientist, and beaten the full definitions or encyclopedia sentences in the
expert humans in crossword solving tournaments. dictionary training data. As the length of the clue de-
The second comparison is with Crossword Maestro creases, finding the answer often reduces to generat-
(http://www.crosswordmaestro.com/), a ing synonyms (culpability - guilt), or category mem-
commercial crossword solving system that handles bers (tall animal - giraffe). The commercial systems
both cryptic and non-cryptic crossword clues (we can retrieve good candidates for such clues among
focus only on the non-cryptic setting), and has also their databases of entities, relationships and com-
been featured in national media.15 We are unable mon crossword answers. Unsupervised Word2Vec
15
See e.g. http://www.theguardian.com/ crossword-blog-computers-crack-cryptic-
crosswords/crossword-blog/2012/mar/08/ clues
26
Question Type avg rank -accuracy@10/100 - rank variance
Long (150) Short (120) Single-Word (30)

One Across .39 / .68 / .70 /
Crossword Maestro .27 / .43 / .73 /
W2V add 42 .31/.63 92 11 .50/.78 66 2 .79/.90 45
RNN cosine 15 .43/.69 108 22 .39/.67 117 72 .31/.52 187
RNN w2v cosine 4 .61/.82 60 7 .56/.79 60 12 .48/.72 116
RNN ranking 6 .58/.84 48 10 .51/.73 57 12 .48/.69 67
RNN w2v ranking 3 .62/.80 61 8 .57/.78 49 12 .48/.69 114
BOW cosine 4 .60/.82 54 7 .56/.78 51 12 .45/.72 137
BOW w2v cosine 4 .60/.83 56 7 .54/.80 48 3 .59/.79 111
BOW ranking 5 .62/.87 50 8 .58/.83 37 8 .55/.79 39
BOW w2v ranking 5 .60/.86 48 8 .56/.83 35 4 .55/.83 43
Table 6: Performance of different models on crossword questions of different length. The two commercial systems
are evaluated via their web interface so only accuracy@10 can be reported in those cases.
representations are also known to encode these sorts Joshua in its top candidates for the third query.
of relationships (even after elementwise addition for The final example in Table 7 illustrates the sur-
short sequences of words) (Mikolov et al., 2013). prising power of the BOW model. In the training
This would also explain why the dictionary embed- data there is a single definition for the correct an-
ding models with pre-trained (Word2Vec) input em- swer Schoenberg: United States composer and musi-
beddings outperfom those with learned embeddings, cal theorist (born in Austria) who developed atonal
particularly for the shortest questions. composition. The only word common to both the
query and the definition is composer (there is no
4.4 Qualitative Analysis tokenization that allows the BOW model to directly
connect atonal and atonality). Nevertheless, the
A better understanding of how the different models model is able to infer the necessary connections be-
arrive at their answers can be gained from consider- tween the concepts in the query and the definition to
ing specific examples, as presented in Table 7. The return Schoenberg as the top candidate.
first three examples show that, despite the apparently Despite such cases, it remains an open question
superficial nature of its training data (definitions and whether, with more diverse training data, the world
introductory sentences) embedding models can an- knowledge required for full open QA (e.g. sec-
swer questions that require factual knowledge about ondary facts about Schoenberg, such as his fam-
people and places. Another notable characteristic of ily) could be encoded and retained as weights in a
these model is the consistent semantic appropriate- (larger) dynamic network, or whether it will be nec-
ness of the candidate set. In the first case, the top essary to combine the RNN with an external mem-
five candidates are all mountains, valleys or places in ory that is less frequently (or never) updated. This
the Alps; in the second, they are all biblical names. latter approach has begun to achieve impressive re-
In the third, the RNN model retrieves currencies, in sults on certain QA and entailment tasks (Bordes et
this case performing better than the BOW model, al., 2014; Graves et al., 2014; Weston et al., 2015).
which retrieves entities of various type associated
with the Netherlands. Generally speaking (as can 5 Conclusion
be observed by the web demo), the smoothness or
consistency in candidate generation of the dictionary Dictionaries exist in many of the worlds languages.
embedding models is greater than that of the com- We have shown how these lexical resources can con-
mercial systems. Despite its simplicity, the unsuper- stitute valuable data for training the latest neural lan-
vised W2V addition method is at times also surpris- guage models to interpret and represent the mean-
ingly effective, as shown by the fact that it returns ing of phrases and sentences. While humans use
27
Input Description One Across Crossword Maestro BOW RNN
Swiss mountain 1:noted 2:front 1:after 2:favor 1:Eiger 2.Crags 1:Eiger 2:Aosta
peak famed for its 3:Eiger 4:crown 3:ahead 4:along 3:Teton 4:Cerro 3:Cuneo 4:Lecco
north face (5) 5:fount 5:being 5:Jebel 5:Tyrol
Old Testament 1:Joshua 2:Exodus 1:devise 2:Daniel 1:Isaiah 2:Elijah 1:Joshua 2:Isaiah
successor to 3:Hebrew 4:person 3:Haggai 4: Isaiah 3:Joshua 4:Elisha 3:Gideon 4:Elijah
Moses (6) 5:across 5:Joseph 5:Yahweh 5:Yahweh
The former 1:Holland 2:general 1:Holland 2:ancient 1:Guilder 2:Holland 1:Guilder 2:Escudos
currency of the 3:Lesotho 3:earlier 4:onetime 3:Drenthe 4:Utrecht 3:Pesetas 4:Someren
Netherlands 5:qondam 5:Naarden 5:Florins
(7)
Arnold, 20th 1:surrealism 1:disharmony 1:Schoenberg 1:Mendelsohn
Century composer 2:laborparty 2:dissonance 2:Christleib 2:Williamson
pioneer of 3:tonemusics 3:bringabout 3:Stravinsky 3:Huddleston
atonality 4:introduced 4:constitute 4:Elderfield 4:Mandelbaum
(10) 5:Schoenberg 5:triggeroff 5:Mendelsohn 5:Zimmerman
Table 7: Responses from different models to example crossword clues. In each case the model output is filtered to
exclude any candidates that are not of the same length as the correct answer. BOW and RNN models are trained
without Word2Vec input embeddings and cosine loss.
the phrasal definitions in dictionaries to better un- line for future research. In particular, we propose the
derstand the meaning of words, machines can use reverse dictionary task as a comparatively general-
the words to better understand the phrases. We used purpose and objective way of evaluating how well
two dictionary embedding architectures - a recurrent models compose lexical meaning into phrase or sen-
neural network architecture with a long-short-term tence representations (whether or not they involve
memory, and a simpler linear bag-of-words model - training on definitions directly).
to explicitly exploit this idea. In the next stage of this research, we will ex-
On the reverse dictionary task that mirrors its plore ways to enhance the NLMs described here,
training setting, NLMs that embed all known con- especially in the question-answering context. The
cepts in a continuous-valued vector space perform models are currently not trained on any question-
comparably to the best known commercial applica- like language, and would conceivably improve on
tions despite having access to many fewer defini- exposure to such linguistic forms. We would also
tions. Moreover, they generate smoother sets of can- like to understand better how BOW models can per-
didates and require no linguistic pre-processing or form so well with no awareness of word order,
task-specific engineering. We also showed how the and whether there are specific linguistic contexts in
description-to-word objective can be used to train which models like RNNs or others with the power
models useful for other tasks. NLMs trained on the to encode word order are indeed necessary. Finally,
same data can answer general-knowledge crossword we intend to explore ways to endow the model with
questions, and indeed outperform commercial sys- richer world knowledge. This may require the in-
tems on questions containing more than four words. tegration of an external memory module, similar to
While our QA experiments focused on crosswords, the promising approaches proposed in several recent
the results suggest that a similar embedding-based papers (Graves et al., 2014; Weston et al., 2015).
approach may ultimately lead to improved output
from more general QA and dialog systems and in- Acknowledgments
formation retrieval engines in general. KC and YB acknowledge the support of the follow-
We make all code, training data, evaluation sets ing organizations: NSERC, Calcul Quebec, Com-
and both of our linguistic tools publicly available on- pute Canada, the Canada Research Chairs and CI-
28
FAR. FH and AK were supported by Google Faculty David Ferrucci, Eric Brown, Jennifer Chu-Carroll, James
Research Award, and AK further by Google Euro- Fan, David Gondek, Aditya A. Kalyanpur, Adam
pean Fellowship. Lally, J. William Murdock, Eric Nyberg, John Prager,
Nico Schlaefer, and Chris Welty. 2010. Building Wat-
son: An overview of the DeepQA project. In AI mag-
References azine, volume 31(3), pages 5979.
Matthew L. Ginsberg. 2011. Dr. FILL: Crosswords and
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- an implemented solver for singly weighted CSPs. In
gio. 2015. Neural machine translation by jointly Journal of Artificial Intelligence Research, pages 851
learning to align and translate. In Proceeding of ICLR. 886.
Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Stephan Gouws, Yoshua Bengio, and Greg Corrado.
1994. Learning long-term dependencies with gradient 2014. BilBOWA: Fast bilingual distributed represen-
descent is difficult. Neural Networks, IEEE Transac- tations without word alignments. In Proceedings of
tions on, 5(2):157166. NIPS Deep Learning Workshop.
Jonathan Berant and Percy Liang. 2014. Semantic pars-
Alex Graves, Greg Wayne, and Ivo Danihelka.
ing via paraphrasing. In Proceedings of the Associa-
2014. Neural turing machines. arXiv preprint
tion for Computational Linguistics.
arXiv:1410.5401.
James Bergstra, Olivier Breuleux, Frederic Bastien, Pas-
Karl Moritz Hermann and Phil Blunsom. 2013. Multi-
cal Lamblin, Razvan Pascanu, Guillaume Desjardins,
lingual distributed representations without word align-
Joseph Turian, David Warde-Farley, and Yoshua Ben-
ment. In Proceedings of ICLR.
gio. 2010. Theano: a CPU and GPU math expression
compiler. In Proceedings of the Python for Scientific Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long
Computing Conference (SciPy). short-term memory. Neural computation, 9(8):1735
1780.
Slaven Bilac, Timothy Baldwin, and Hozumi Tanaka.
2003. Improving dictionary accessibility by maximiz- Eric H. Huang, Richard Socher, Christopher D. Manning,
ing use of available knowledge. Traitement Automa- and Andrew Y. Ng. 2012. Improving word representa-
tique des Langues, 44(2):199224. tions via global context and multiple word prototypes.
Slaven Bilac, Wataru Watanabe, Taiichi Hashimoto, In Proceedings of the Association for Computational
Takenobu Tokunaga, and Hozumi Tanaka. 2004. Dic- Linguistics.
tionary search based on the target word description. In Mohit Iyyer, Jordan Boyd-Graber, Leonardo Claudino,
Proceedings of NLP 2014. Richard Socher, and Hal Daume III. 2014. A neu-
Antoine Bordes, Sumit Chopra, and Jason Weston. 2014. ral network for factoid question answering over para-
Question answering with subgraph embeddings. Pro- graphs. In Proceedings of EMNLP.
ceedings of EMNLP. Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber,
Antoine Bordes, Nicolas Usunier, Sumit Chopra, and and Hal Daume III. 2015. Deep unordered compo-
Jason Weston. 2015. Large-scale simple question sition rivals syntactic methods for text classification.
answering with memory networks. arXiv preprint In Proceedings of the Association for Computational
arXiv:1506.02075. Linguistics.
Sarath Chandar, Stanislas Lauly, Hugo Larochelle, Ryan Kiros, Ruslan Salakhutdinov, and Richard S.
Mitesh Khapra, Balaraman Ravindran, Vikas C. Zemel. 2015. Unifying visual-semantic embeddings
Raykar, and Amrita Saha. 2014. An autoencoder ap- with multimodal neural language models. Transac-
proach to learning bilingual word representations. In tions of the Association for Computational Linguistics.
Advances in Neural Information Processing Systems, to appear.
pages 18531861. Alexandre Klementiev, Ivan Titov, and Binod Bhattarai.
Kyunghyun Cho, Bart Van Merrienboer, Caglar Gul- 2012. Inducing crosslingual distributed representa-
cehre, Dzmitry Bahdanau, Fethi Bougares, Holger tions of words. Proceedings of COLING.
Schwenk, and Yoshua Bengio. 2014. Learning phrase Geoffrey Leech, Roger Garside, and Michael Bryant.
representations using RNN encoder-decoder for statis- 1994. CLAWS4: The tagging of the British National
tical machine translation. In Proceedings of EMNLP. Corpus. In Proceedings of COLING.
Manaal Faruqui, Jesse Dodge, Sujay K. Jauhar, Chris Michael L. Littman, Greg A. Keim, and Noam Shazeer.
Dyer, Eduard Hovy, and Noah A. Smith. 2014. 2002. A probabilistic approach to solving crossword
Retrofitting word vectors to semantic lexicons. Pro- puzzles. Artificial Intelligence, 134(1):2355.
ceedings of the North American Chapter of the Asso- Tomas Mikolov, Martin Karafiat, Lukas Burget, Jan Cer-
ciation for Computational Linguistics. nocky, and Sanjeev Khudanpur. 2010. Recurrent neu-
29
ral network based language model. In Proceedings of
INTERSPEECH 2010.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Cor-
rado, and Jeff Dean. 2013. Distributed representations
of words and phrases and their compositionality. In
Advances in Neural Information Processing Systems.
Jeff Mitchell and Mirella Lapata. 2010. Composition in
distributional models of semantics. Cognitive Science,
34(8):13881429.
Diego Molla and Jose Luis Vicedo. 2007. Question an-
swering in restricted domains: An overview. Compu-
tational Linguistics, 33(1):4161.
Ryan Shaw, Anindya Datta, Debra VanderMeer, and
Kaushik Dutta. 2013. Building a scalable database-
driven reverse dictionary. Knowledge and Data Engi-
neering, IEEE Transactions on, 25(3):528540.
Ivan Vulic, Wim De Smet, and Marie-Francine Moens.
2011. Identifying word translations from comparable
corpora using latent topic models. In Proceedings of
the Association for Computational Linguistics.
Jason Weston, Antoine Bordes, Sumit Chopra, and
Tomas Mikolov. 2015. Towards AI-complete question
answering: A set of prerequisite toy tasks. In arXiv
preprint arXiv:1502.05698.
Matthew D. Zeiler. 2012. Adadelta: An adaptive learn-
ing rate method. In arXiv preprint arXiv:1212.5701.
Michael Zock and Slaven Bilac. 2004. Word lookup on
the basis of associations: From an idea to a roadmap.
In Proceedings of the ACL Workshop on Enhancing
and Using Electronic Dictionaries.
30

2016 - Bengio - Learning To Understand Phrases by Embedding The Dictionary

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

2016 - Bengio - Learning To Understand Phrases by Embedding The Dictionary

Enviado por

Direitos autorais:

Formatos disponíveis

Learning to Understand Phrases by Embedding the Dictionary

Felix Hill Kyunghyun Cho

Anna Korhonen Yoshua Bengio

Abstract in training and evaluation. Consequently, it is diffi-

2.2 Bag-of-Words NLMs where vr is the embedding of a randomly-selected

3.1 Data Collection and Training 3.2 Comparisons

median rank accuracy@10/100 rank variance

Input description RNN EN-FR W2V add RNN + Google

Long (150) Short (120) Single-Word (30)

Você também pode gostar