p742 Goldberg 2

An Efficient Algorithm for Easy-First Non-Directional Dependency Parsing
Yoav Goldberg and Michael Elhadad

Ben Gurion University of the Negev
Department of Computer Science
POB 653 Beer Sheva, 84105, Israel
{yoavg|elhadad}@cs.bgu.ac.il
Abstract Current dependency parsers can be categorized

into three families: local-and-greedy transition-
We present a novel deterministic dependency pars- based parsers (e.g., M ALT PARSER (Nivre et al.,
ing algorithm that attempts to create the easiest arcs 2006)), globally optimized graph-based parsers
in the dependency structure first in a non-directional (e.g., M ST PARSER (McDonald et al., 2005)), and
manner. Traditional deterministic parsing algorithms hybrid systems (e.g., (Sagae and Lavie, 2006b;
are based on a shift-reduce framework: they traverse
Nivre and McDonald, 2008)), which combine the
the sentence from left-to-right and, at each step, per-
form one of a possible set of actions, until a complete output of various parsers into a new and improved
tree is built. A drawback of this approach is that parse, and which are orthogonal to our approach.
it is extremely local: while decisions can be based Transition-based parsers scan the input from left
on complex structures on the left, they can look only to right, are fast (O(n)), and can make use of rich
at a few words to the right. In contrast, our algo- feature sets, which are based on all the previously
rithm builds a dependency tree by iteratively select-
derived structures. However, all of their decisions
ing the best pair of neighbours to connect at each
parsing step. This allows incorporation of features are very local, and the strict left-to-right order im-
from already built structures both to the left and to the plies that, while the feature set can use rich struc-
right of the attachment point. The parser learns both tural information from the left of the current attach-
the attachment preferences and the order in which ment point, it is also very restricted in information
they should be performed. The result is a determin- to the right of the attachment point: traditionally,
istic, best-first, O(nlogn) parser, which is signifi- only the next two or three input tokens are avail-
cantly more accurate than best-first transition based
able to the parser. This limited look-ahead window
parsers, and nears the performance of globally opti-
mized parsing models. leads to error propagation and worse performance on
root and long distant dependencies relative to graph-
based parsers (McDonald and Nivre, 2007).
1 Introduction Graph-based parsers, on the other hand, are glob-
ally optimized. They perform an exhaustive search
Dependency parsing has been a topic of active re- over all possible parse trees for a sentence, and find
search in natural language processing in the last sev- the highest scoring tree. In order to make the search
eral years. An important part of this research effort tractable, the feature set needs to be restricted to fea-
are the CoNLL 2006 and 2007 shared tasks (Buch- tures over single edges (first-order models) or edges
holz and Marsi, 2006; Nivre et al., 2007), which al- pairs (higher-order models, e.g. (McDonald and
lowed for a comparison of many algorithms and ap- Pereira, 2006; Carreras, 2007)). There are several
proaches for this task on many languages. attempts at incorporating arbitrary tree-based fea-

Supported by the Lynn and William Frankel Center for tures but these involve either solving an ILP prob-
Computer Sciences, Ben Gurion University lem (Riedel and Clarke, 2006) or using computa-
742
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the ACL, pages 742750,
Los Angeles, California, June 2010. 2010
c Association for Computational Linguistics
(1) ATTACH R IGHT (2)
-157 -68 -197 -152 231
a brown fox jumped with joy
-27 403 -47 -243 3
(2) ATTACH R IGHT (1) (3) ATTACH R IGHT (1)
-52 -159 -176 246 -133 -149 246

a fox jumped with joy fox jumped with joy
314 0 -146 12 270 -154 10
brown a brown
(4) ATTACH L EFT (2) (5) ATTACH L EFT (1) (6)

-161 186 430
jumped
jumped with joy jumped with
-435 -2 -232
fox with
fox fox joy
a brown joy
a brown a brown
Figure 1: Parsing the sentence a brown fox jumped with joy. Rounded arcs represent possible actions.
tionally intensive sampling-based methods (Naka- right manner. However, when humans consciously
gawa, 2007). As a result, these models, while accu- annotate a sentence with syntactic structure, they
rate, are slow (O(n3 ) for projective, first-order mod- hardly ever work in fixed left-to-right order. Rather,
els, higher polynomials for higher-order models, and they start by building several isolated constituents
worse for richer tree-feature models). by making easy and local attachment decisions and
We propose a new category of dependency pars- only then combine these constituents into bigger
ing algorithms, inspired by (Shen et al., 2007): non- constituents, jumping back-and-forth over the sen-
directional easy-first parsing. This is a greedy, de- tence and proceeding from easy to harder phenom-
terministic parsing approach, which relaxes the left- ena to analyze. When getting to the harder decisions
to-right processing order of transition-based pars- a lot of structure is already in place, and this struc-
ing algorithms. By doing so, we allow the ex- ture can be used in deciding a correct attachment.
plicit incorporation of rich structural features de- Our parser follows a similar kind of annotation
rived from both sides of the attachment point, and process: starting from easy attachment decisions,
implicitly take into account the entire previously de- and proceeding to harder and harder ones. When
rived structure of the whole sentence. This exten- making later decisions, the parser has access to the
sion allows the incorporation of much richer features entire structure built in earlier stages. During the
than those available to transition- and especially to training process, the parser learns its own notion of
graph-based parsers, and greatly reduces the local- easy and hard, and learns to defer specific kinds of
ity of transition-based algorithm decisions. On the decisions until more structure is available.
other hand, it is still a greedy, best-first algorithm
leading to an efficient implementation. 3 Parsing algorithm
We present a concrete O(nlogn) parsing algo-
rithm, which significantly outperforms state-of-the- Our (projective) parsing algorithm builds the parse
art transition-based parsers, while closing the gap to tree bottom up, using two kinds of actions: AT-
graph-based parsers. TACH L EFT (i) and ATTACH R IGHT (i) . These
2 Easy-first parsing actions are applied to a list of partial structures
When humans comprehend a natural language sen- p1 , . . . , pk , called pending, which is initialized with
tence, they arguably do it in an incremental, left-to- the n words of the sentence w1 , . . . , wn . Each ac-
743
tion connects the heads of two neighbouring struc- (1.5) ). Ideally, we would like to score easy and
tures, making one of them the parent of the other, reliable attachments higher than harder less likely
and removing the daughter from the list of partial attachments, thus performing attachments in order
structures. ATTACH L EFT (i) adds a dependency of confidence. This strategy allows us both to limit
edge (pi , pi+1 ) and removes pi+1 from the list. AT- the extent of error propagation, and to make use of
TACH R IGHT (i) adds a dependency edge (pi+1 , pi ) richer contextual information in the later, harder
and removes pi from the list. Each action shortens attachments. Unfortunately, this kind of ordering
the list of partial structures by 1, and after n1 such information is not directly encoded in the data. We
actions, the list contains the root of a connected pro- must, therefore, learn how to order the decisions.
jective tree over the sentence. We first describe the learning algorithm (Section
Figure 1 shows an example of parsing the sen- 4) and a feature representation (Section 5) which en-
tence a brown fox jumped with joy. The pseu- ables us to learn an effective scoring function.
docode of the algorithm is given in Algorithm 1.
4 Learning Algorithm
Algorithm 1: Non-directional Parsing
We use a linear model score(x) = w ~ (x), where
Input: a sentence= w1 . . . wn
(x) is a feature representation and w ~ is a weight
Output: a set of dependency arcs over the
sentence (Arcs) vector. We write act(i) to denote the feature repre-
sentation extracted for action act at location i. The
1 Acts = {ATTACH L EFT, ATTACH R IGHT}
model is trained using a variant of the structured per-
2 Arcs {}
3 pending = p1 . . . pn w1 . . . wn
ceptron (Collins, 2002), similar to the algorithm of
4 while length(pending) > 1 do (Shen et al., 2007; Shen and Joshi, 2008). As usual,
best arg max score(act(i)) we use parameter averaging to prevent the percep-
actActs tron from overfitting.
5 1ilen(pending)
6 (parent, child) edgeFor(best) The training algorithm is initialized with a zero
7 Arcs.add( (parent, child) ) parameter vector w. ~ The algorithm makes several
8 pending.remove(child) passes over the data. At each pass, we apply the
9 end training procedure given in Algorithm 2 to every
10 return Arcs ( sentence in the training set.
(pi , pi+1 ) ATTACH L EFT (i) At training time, each sentence is parsed using the
edgeFor(act(i)) =
(pi+1 , pi ) ATTACH R IGHT (i) parsing algorithm and the current w. ~ Whenever an
invalid action is chosen by the parsing algorithm, it
At each step the algorithm chooses a spe- is not performed (line 6). Instead, we update the pa-
cific action/location pair using a function rameter vector w ~ by decreasing the weights of the
score(ACTION (i)), which assign scores to ac- features associated with the invalid action, and in-
tion/location pairs based on the partially built creasing the weights for the currently highest scor-
structures headed by pi and pi+1 , as well as neigh- ing valid action.1 We then proceed to parse the sen-
bouring structures. The score() function is learned tence with the updated values. The process repeats
from data. This scoring function reflects not only until a valid action is chosen.
the correctness of an attachment, but also the order Note that each single update does not guarantee
in which attachments should be made. For example, that the next chosen action is valid, or even different
consider the attachments (brown,fox) and (joy,with) than the previously selected action. Yet, this is still
in Figure (1.1). While both are correct, the scoring an aggressive update procedure: we do not leave a
function prefers the (adjective,noun) attachment sentence until our parameters vector parses it cor-
over the (prep,noun) attachment. Moreover, the
1
attachment (jumped,with), while correct, receives We considered 3 variants of this scheme: (1) using the high-
est scoring valid action, (2) using the leftmost valid action, and
a negative score for the bare preposition with (3) using a random valid action. The 3 variants achieved nearly
(Fig. (1.1) - (1.4) ), and a high score once the verb identical accuracy, while (1) converged somewhat faster than
has its subject and the PP with joy is built (Fig. the other two.
744
rectly, and we do not proceed from one partial parse 5 Feature Representation
to the next until w
~ predicts a correct location/action
pair. However, as the best ordering, and hence the The feature representation for an action can take
best attachment point is not known to us, we do not into account the original sentence, as well as
perform a single aggressive update step. Instead, our the entire parse history: act(i) above is actually
aggressive update is performed incrementally in a (act(i), sentence, Arcs, pending).
series of smaller steps, each pushing w ~ away from We use binary valued features, and each feature is
invalid attachments and toward valid ones. This way conjoined with the type of action.
we integrate the search of confident attachments into When designing the feature representation, we
the learning process. keep in mind that our features should not only di-
rect the parser toward desired actions and away from
Algorithm 2: Structured perceptron training
for direction-less parser, over one sentence. undesired actions, but also provide the parser with
Input: sentence,gold arcs,current w,feature
~ means of choosing between several desired actions.
representation We want the parser to be able to defer some desired
Output: weight vector w ~ actions until more structure is available and a more
1 Arcs {}
informed prediction can be made. This desire is re-
2 pending sent flected in our choice of features: some of our fea-
3 while length(pending) > 1 do tures are designed to signal to the parser the pres-
4 allowed {act(i)|isV alid(act(i), Gold, Arcs)} ence of possibly incomplete structures, such as an
choice arg max ~ act(i)
w incomplete phrase, a coordinator without conjuncts,
actActs
5 1ilen(pending) and so on.
6 if choice allowed then When considering an action ACTION (i), we limit
7 (parent, child) edgeFor(choice) ourselves to features of partial structures around the
8 Arcs.add( (parent, child) ) attachment point: pi2 , pi1 , pi , pi+1 , pi+2 , pi+3 ,
9 pending.remove(child) that is the two structures which are to be attached by
10 else
the action (pi and pi+1 ), and the two neighbouring
good arg max w ~ act(j)
11 act(j)allowed structures on each side3 .
12 ~ w
w ~ + good choice While these features encode local context, it is lo-
13 end cal in terms of syntactic structure, and not purely in
14 return w
~ terms of sentence surface form. This let us capture
some, though not all, long-distance relations.
Function isValid(action,Gold,Arcs) For a partial structure p, we use wp to refer to
1 (p, c) edgeFor(action) the head word form, tp to the head word POS tag,
2 if (c0 : (c, c0 ) Gold (c, c0 ) 6 Arcs) and lcp and rcp to the POS tags of the left-most and
(p, c) 6 Gold then right-most child of p respectively.
3 return false All our prepositions (IN) and coordinators (CC)
4 return true are lexicalized: for them, tp is in fact wp tp .
The function isV alid(act(i), gold, arcs) (line 4) We define structural, unigram, bigram and pp-
is used to decide if the chosen action/location pair attachment features.
is valid. It returns True if two conditions apply: (a) The structural features are: the length of the
(pi , pj ) is present in gold, (b) all edges (2, pj ) in structures (lenp ), whether the structure is a word
gold are also in arcs. In words, the function verifies (contains no children: ncp ), and the surface distance
that the proposed edge is indeed present in the gold between structure heads (pi pj ). The unigram and
parse and that the suggested daughter already found bigram features are adapted from the feature set for
all its own daughters.2 left-to-right Arc-Standard dependency parsing de-
2
This is in line with the Arc-Standard parsing strategy of directional algorithm.
3
shift-reduce dependency parsers (Nivre, 2004). We are cur- Our sentences are padded from each side with sentence de-
rently experimenting also with an Arc-Eager variant of the non- limiter tokens.
745
Structural
for p in pi2 , pi1 , pi , pi+1 , pi+2 , pi+3 lenp , ncp
for p,q in (pi2 , pi1 ),(pi1 , pi ),(pi , pi+1 ),(pi+1 , pi + 2),(pi+2 , pi+3 ) qp , qp tp tq
Unigram
for p in pi2 , pi1 , pi , pi+1 , pi+2 , pi+3 tp , wp , tp lcp , tp rcp , tp rcp lcp
Bigram
for p,q in (pi , pi+1 ),(pi , pi+2 ),(pi1 , pi ),(pi1 , pi+2 ),(pi+1 , pi+2 ) tp tq , wp wq , tp wq , wp tq
tp tq lcp lcq , tp tq rcp lcq
tp tq lcp rcq , tp tq rcp rcq
PP-Attachment
if pi is a preposition wpi1 wpi rcpi , tpi1 wpi rcwpi
if pi+1 is a preposition wpi1 wpi+1 rcpi+1 , tpi1 wpi+1 rcwpi+1
wpi wpi+1 rcpi+1 , tpi wpi+1 rcwpi+1
if pi+2 is a preposition wpi+1 wpi+2 rcpi+2 , tpi+1 wpi+2 rcwpi+2
wpi wpi+2 rcpi+2 , tpi wpi+2 rcwpi+2
Figure 2: Feature Templates
scribed in (Huang et al., 2009). We extended that tures and with it the extracted features and the com-
feature set to include the structure on both sides of puted scores. However, these changes are limited
the proposed attachment point. to a fixed local context around the attachment point
In the case of unigram features, we added features of the action. Thus, we observe that the feature ex-
that specify the POS of a word and its left-most and traction and score calculation can be performed once
right-most children. These features provide the non- for each action/location pair in a given sentence, and
directional model with means to prefer some attach- reused throughout all the iterations. After each iter-
ment points over others based on the types of struc- ation we need to update the extracted features and
tures already built. In English, the left- and right- calculated scores for only k locations, where k is a
most POS-tags are good indicators of constituency. fixed number depending on the window size used in
The pp-attachment features are similar to the bi- the feature extraction, and usually k n.
gram features, but fire only when one of the struc- Using this technique, we perform only (k + 1)n
tures is headed by a preposition (IN). These features feature extractions and score calculations for each
are more lexicalized than the regular bigram fea- sentence, that is O(n) feature-extraction operations
tures, and include also the word-form of the right- per sentence.
most child of the PP (rcwp ). This should help the Given the scores for each location, the argmax can
model learn lexicalized attachment preferences such then be computed in O(logn) time using a heap,
as (hit, with-bat). resulting in an O(nlogn) algorithm: n iterations,
Figure 2 enumerate the feature templates we use. where the first iteration involves n feature extrac-
6 Computational Complexity and Efficient tion operations and n heap insertions, and each sub-
Implementation sequent iteration involves k feature extractions and
heap updates.
The parsing algorithm (Algorithm 1) begins with
We note that the dominating factor in polynomial-
n + 1 disjoint structures (the words of the sentence +
time discriminative parsers, is by far the feature-
ROOT symbol), and terminates with one connected
extraction and score calculation. It makes sense to
structure. Each iteration of the main loop connects
compare parser complexity in terms of these opera-
two structures and removes one of them, and so the
tions only.4 Table 1 compares the complexity of our
loop repeats for exactly n times.
4
The argmax in line 5 selects the maximal scoring Indeed, in our implementation we do not use a heap, and
action/location pair. At iteration i, there are n i opt instead to find the argmax using a simple O(n) max oper-
ation. This O(n2 ) algorithm is faster in practice than the heap
locations to choose from, and a naive computation of based one, as both are dominated by the O(n) feature extrac-
the argmax is O(n), resulting in an O(n2 ) algorithm. tion, while the cost of the O(n) max calculationis negligible
Each performed action changes the partial struc- compared to the constants involved in heap maintenance.
746
parser to other dependency parsing frameworks. representative of the tagging performance on non-
WSJ corpus texts.
Parser Runtime Features / Scoring Parsers We evaluate our parser against the
M ALT O(n) O(n) transition-based M ALT parser and the graph-based
M ST O(n3 ) O(n2 )
M ST 2 O(n3 ) O(n3 )
M ST parser. We use version 1.2 of M ALT parser8 ,
B EAM O(n beam) O(n beam) with the settings used for parsing English in the
N ON D IR (This Work) O(nlogn) O(n) CoNLL 2007 shared task. For the M ST parser9 ,
we use the default first-order, projective parser set-
Table 1: Complexity of different parsing frameworks. tings, which provide state-of-the-art results for En-
M ST: first order MST parser, M ST 2: second order MST glish. All parsers are trained and tested on the same
parser, M ALT: shift-reduce left-to-right parsing. B EAM: data. Our parser is trained for 20 iterations.
beam search parser, as in (Zhang and Clark, 2008) Evaluation Measures We evaluate the parsers using
three common measures:
In terms of feature extraction and score calcula- (unlabeled) Accuracy: percentage of tokens which
tion operations, our algorithm has the same cost as got assigned their correct parent.
traditional shift-reduce (M ALT) parsers, and is an Root: The percentage of sentences in which the
order of magnitude more efficient than graph-based ROOT attachment is correct.
(M ST) parsers. Beam-search decoding for left-to- Complete: the percentage of sentences in which all
right parsers (Zhang and Clark, 2008) is also linear, tokens were assigned their correct parent.
but has an additional linear dependence on the beam- Unlike most previous work on English dependency
size. The reported results in (Zhang and Clark, parsing, we do not exclude punctuation marks from
2008) use a beam size of 64, compared to our con- the evaluation.
stant of k = 6.
Our Python-based implementation5 (the percep- Results are presented in Table 2. Our non-
tron is implemented in a C extension module) parses directional easy-first parser significantly outper-
about 40 tagged sentences per second on an Intel forms the left-to-right greedy M ALT parser in terms
based MacBook laptop. of accuracy and root prediction, and significantly
outperforms both parsers in terms of exact match.
7 Experiments and Results The globally optimized M ST parser is better in root-
prediction, and slightly better in terms of accuracy.
We evaluate the parser using the WSJ Treebank. The
We evaluated the parsers also on the English
trees were converted to dependency structures with
dataset from the CoNLL 2007 shared task. While
the Penn2Malt conversion program,6 using the head-
this dataset is also derived from the WSJ Treebank, it
finding rules from (Yamada and Matsumoto, 2003).7
differs from the previous dataset in two important as-
We use Sections 2-21 for training, Section 22 for
pects: it is much smaller in size, and it is created us-
development, and Section 23 as the final test set.
ing a different conversion procedure, which is more
The text is automatically POS tagged using a trigram
linguistically adequate. For these experiments, we
HMM based POS tagger prior to training and pars-
use the dataset POS tags, and the same parameters as
ing. Each section is tagged after training the tagger
in the previous set of experiments: we train the non-
on all other sections. The tagging accuracy of the
directional parser for 20 iterations, with the same
tagger is 96.5 for the training set and 96.8 for the
feature set. The CoNLL dataset contains some non-
test set. While better taggers exist, we believe that
projective constructions. M ALT and M ST deal with
the simpler HMM tagger overfits less, and is more
non-projectivity. For the non-directional parser, we
5
http://www.cs.bgu.ac.il/yoavg/software/ projectivize the training set prior to training using
6
http://w3.msi.vxu.se/nivre/research/Penn2Malt.html the procedure described in (Carreras, 2007).
7
While other and better conversions exist (see, e.g., (Johans- Results are presented in Table 3.
son and Nugues, 2007; Sangati and Mazza, 2009)), this con-
8
version heuristic is still the most widely used. Using the same http://maltparser.org/dist/1.2/malt-1.2.tar.gz
9
conversion facilitates comparison with previous works. http://sourceforge.net/projects/mstparser/
747
Parser Accuracy Root Complete Combination Accuracy Complete
M ALT 88.36 87.04 34.14 Penn2Malt, Train 2-21, Test 23
M ST 90.05 93.95 34.64 M ALT+M ST 92.29 44.03
N ON D IR (this work) 89.70 91.50 37.50 N ON D IR+M ALT 92.19 45.48
N ON D IR+M ST 92.53 44.41
Table 2: Unlabeled dependency accuracy on PTB Section N ON D IR+M ST+M ALT 93.54 49.79
23, automatic POS-tags, including punctuation. CoNLL 2007
M ALT+M ST 91.50 33.64
Parser Accuracy Root Complete N ON D IR+M ALT 91.02 34.11
M ALT 85.82 87.85 24.76 N ON D IR+M ST 91.90 34.11
M ST 89.08 93.45 24.76 N ON D IR+M ST+M ALT 92.70 38.31
N ON D IR (this work) 88.34 91.12 29.43
Table 4: Parser combination with Oracle, choosing the
Table 3: Unlabeled dependency accuracy on CoNLL highest scoring parse for each sentence of the test-set.
2007 English test set, including punctuation.
While all models suffer from the move to the in Table 4.

A non-oracle blending of M ALT+M ST+N ON D IR
smaller dataset and the more challenging annotation
using Sagae and Lavies (2006) simplest combina-
scheme, the overall story remains the same: the non-
tion method assigning each component the same
directional parser is better than M ALT but not as
weight, yield an accuracy of 90.8 on the CoNLL
good as M ST in terms of parent-accuracy and root
2007 English dataset, making it the highest scoring
prediction, and is better than both M ALT and M ST
system among the participants.
in terms of producing complete correct parses.
That the non-directional parser has lower accu- 7.1 Error Analysis / Limitations
racy but more exact matches than the M ST parser When we investigate the POS category of mistaken
can be explained by it being a deterministic parser, instances, we see that for all parsers, nodes with
and hence still vulnerable to error propagation: once structures of depth 2 and more which are assigned
it erred once, it is likely to do so again, result- an incorrect head are predominantly PPs (headed
ing in low accuracies for some sentences. How- by IN), followed by NPs (headed by NN). All
ever, due to the easy-first policy, it manages to parse parsers have a hard time dealing with PP attachment,
many sentences without a single error, which lead but M ST parser is better at it than N ON D IR, and both
to higher exact-match scores. The non-directional are better than M ALT.
parser avoids error propagation by not making the Looking further at the mistaken instances, we no-
initial error. On average, the non-directional parser tice a tendency of the PP mistakes of the N ON D IR
manages to assign correct heads to over 60% of the parser to involve, before the PP, an NP embedded
tokens before making its first error. in a relative clause. This reveals a limitation of our
The M ST parser would have ranked 5th in the parser: recall that for an edge to be built, the child
shared task, and N ON D IR would have ranked 7th . must first acquire all its own children. This means
The better ranking systems in the shared task that in case of relative clauses such as I saw the
are either higher-order global models, beam-search boy [who ate the pizza] with my eyes, the parser
based systems, or ensemble-based systems, all of must decide if the PP with my eyes should be at-
which are more complex and less efficient than the tached to the pizza or not before it is allowed to
N ON D IR parser. build parts of the outer NP (the boy who. . . ). In
Parse Diversity The parses produced by the non- this case, the verb saw and the noun boy are
directional parser are different than the parses pro- both outside of the sight of the parser when decid-
duced by the graph-based and left-to-right parsers. ing on the PP attachment, and it is forced to make a
To demonstrate this difference, we performed an Or- decision in ignorance, which, in many cases, leads
acle experiment, in which we combine the output of to mistakes. The globally optimized M ST does not
several parsers by choosing, for each sentence, the suffer as much from such cases. We plan to address
parse with the highest score. Results are presented this deficiency in future work.
748
8 Related Work Structure Restrictions Eisner and Smith (2005)
propose to improve the efficiency of a globally op-
Deterministic shift-reduce parsers are restricted by a timized parser by posing hard constraints on the
strict left-to-right processing order. Such parsers can lengths of arcs it can produce. Such constraints
rely on rich syntactic information on the left, but not pose an explicit upper bound on parser accuracy.10
on the right, of the decision point. They are forced Our parsing model does not pose such restrictions.
to commit early, and suffer from error propagation. Shorter edges are arguably easier to predict, and our
Our non-directional parser addresses these deficien- parses builds them early in time. However, it is
cies by discarding the strict left-to-right processing also capable of producing long dependencies at later
order, and attempting to make easier decisions be- stages in the parsing process. Indeed, the distribu-
fore harder ones. Other methods of dealing with tion of arc lengths produced by our parser is similar
these deficiencies were proposed over the years: to those produced by the M ALT and M ST parsers.
Several Passes Yamada and Matsumotos (2003)
pioneering work introduces a shift-reduce parser 9 Discussion
which makes several left-to-right passes over a sen- We presented a non-directional deterministic depen-
tence. Each pass adds structure, which can then be dency parsing algorithm, which is not restricted by
used in subsequent passes. Sagae and Lavie (2006b) the left-to-right parsing order of other deterministic
extend this model to alternate between left-to-right parsers. Instead, it works in an easy-first order. This
and right-to-left passes. This model is similar to strategy allows using more context at each decision.
ours, in that it attempts to defer harder decisions to The parser learns both what and when to connect.
later passes over the sentence, and allows late deci- We show that this parsing algorithm significantly
sions to make use of rich syntactic information (built outperforms a left-to-right deterministic algorithm.
in earlier passes) on both sides of the decision point. While it still lags behind globally optimized pars-
However, the model is not explicitly trained to op- ing algorithms in terms of accuracy and root pre-
timize attachment ordering, has an O(n2 ) runtime diction, it is much better in terms of exact match,
complexity, and produces results which are inferior and much faster. As our parsing framework can eas-
to current single-pass shift-reduce parsers. ily and efficiently utilize more structural information
Beam Search Several researchers dealt with the than globally optimized parsers, we believe that with
early-commitment and error propagation of deter- some enhancements and better features, it can out-
ministic parsers by extending the greedy decisions perform globally optimized algorithms, especially
with various flavors of beam-search (Sagae and when more structural information is needed, such as
Lavie, 2006a; Zhang and Clark, 2008; Titov and for morphologically rich languages.
Henderson, 2007). This approach works well and Moreover, we show that our parser produces
produces highly competitive results. Beam search different structures than those produced by both
can be incorporated into our parser as well. We leave left-to-right and globally optimized parsers, mak-
this investigation to future work. ing it a good candidate for inclusion in an ensem-
Strict left-to-right ordering is also prevalent in se- ble system. Indeed, a simple combination scheme
quence tagging. Indeed, one major influence on of graph-based, left-to-right and non-directional
our work is Shen et.al.s bi-directional POS-tagging parsers yields state-of-the-art results on English de-
algorithm (Shen et al., 2007), which combines a pendency parsing on the CoNLL 2007 dataset.
perceptron learning procedure similar to our own We hope that further work on this non-directional
with beam search to produce a state-of-the-art POS- parsing framework will pave the way to better under-
tagger, which does not rely on left-to-right process- standing of an interesting cognitive question: which
ing. Shen and Joshi (2008) extends the bidirectional kinds of parsing decisions are hard to make, and
tagging algorithm to LTAG parsing, with good re- which linguistic constructs are hard to analyze?
sults. We build on top of that work and present a 10
In (Dreyer et al., 2006), constraints are chosen to be the
concrete and efficient greedy non-directional depen- minimum value that will allow recovery of 90% of the left
dency parsing algorithm. (right) dependencies in the training corpus.
749
References Kenji Sagae and Alon Lavie. 2006a. A best-first proba-
bilistic shift-reduce parser. In Proc of ACL.
Sabine Buchholz and Erwin Marsi. 2006. CoNLL-X
Kenji Sagae and Alon Lavie. 2006b. Parser combination
shared task on multilingual dependency parsing. In
by reparsing. In Proc of NAACL.
Proc. of CoNLL.
Federico Sangati and Chiara Mazza. 2009. An english
Xavier Carreras. 2007. Experiments with a higher-order dependency treebank a la tesniere. In Proc of TLT8.
projective dependency parser. In Proc. of CoNLL
Libin Shen and Aravind K. Joshi. 2008. Ltag depen-
Shared Task, EMNLP-CoNLL.
dency parsing with bidirectional incremental construc-
Michael Collins. 2002. Discriminative training methods tion. In Proc of EMNLP.
for hidden markov models: Theory and experiments Libin Shen, Giorgio Satta, and Aravind K. Joshi. 2007.
with perceptron algorithms. In Proc of EMNLP. Guided learning for bidirectional sequence classifica-
Markus Dreyer, David A. Smith, and Noah A. Smith. tion. In Proc of ACL.
2006. Vine parsing and minimum risk reranking for Ivan Titov and James Henderson. 2007. Fast and robust
speed and precision. In Proc. of CoNLL, pages 201 multilingual dependency parsing with a generative la-
205, Morristown, NJ, USA. Association for Computa- tent variable model. In Proc. of EMNLP-CoNLL.
tional Linguistics.
Yamada and Matsumoto. 2003. Statistical dependency
Jason Eisner and Noah A. Smith. 2005. arsing with soft analysis with support vector machines. In Proc. of
and hard constraints on dependency length. In Proc. IWPT.
of IWPT. Yue Zhang and Stephen Clark. 2008. A tale of
Liang Huang, Wenbin Jiang, and Qun Liu. 2009. two parsers: investigating and combining graph-based
Bilingually-constrained (monolingual) shift-reduce and transition-based dependency parsing using beam-
parsing. In Proc of EMNLP. search. In Proc of EMNLP.
Richard Johansson and Pierre Nugues. 2007. Extended
constituent-to-dependency conversion for english. In
Proc of NODALIDA.
Ryan McDonald and Joakim Nivre. 2007. Characteriz-
ing the errors of data-driven dependency parsing mod-
els. In Proc. of EMNLP.
Ryan McDonald and Fernando Pereira. 2006. On-
line learning of approximate dependency parsing al-
gorithms. In Proc of EACL.
Ryan McDonald, Koby Crammer, and Fernando Pereira.
2005. Online large-margin training of dependency
parsers. In Proc of ACL.
Tetsuji Nakagawa. 2007. Multilingual dependency pars-
ing using global features. In Proc. of EMNLP-CoNLL.
Joakim Nivre and Ryan McDonald. 2008. Integrating
graph-based and transition-based dependency parsers.
In Proc. of ACL, pages 950958, Columbus, Ohio,
June. Association for Computational Linguistics.
Joakim Nivre, Johan Hall, and Jens Nillson. 2006. Malt-
Parser: A data-driven parser-generator for dependency
parsing. In Proc. of LREC.
Joakim Nivre, Johan Hall, Sandra Kubler, Ryan Mcdon-
ald, Jens Nilsson, Sebastian Riedel, and Deniz Yuret.
2007. The CoNLL 2007 shared task on dependency
parsing. In Proc. of EMNLP-CoNLL.
Joakim Nivre. 2004. Incrementality in deterministic de-
pendency parsing. In Incremental Parsing: Bringing
Engineering and Cognition Together, ACL-Workshop.
Sebastian Riedel and James Clarke. 2006. Incremental
integer linear programming for non-projective depen-
dency parsing. In Proc. of EMNLP 2006, July.
750

p742 Goldberg 2

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

p742 Goldberg 2

Enviado por

Direitos autorais:

Formatos disponíveis

An Efficient Algorithm for Easy-First Non-Directional Dependency Parsing

Yoav Goldberg and Michael Elhadad

Abstract Current dependency parsers can be categorized

(2) ATTACH R IGHT (1) (3) ATTACH R IGHT (1)

-52 -159 -176 246 -133 -149 246

(4) ATTACH L EFT (2) (5) ATTACH L EFT (1) (6)

Figure 2: Feature Templates

While all models suffer from the move to the in Table 4.

Você também pode gostar