Você está na página 1de 45

Edit Distance Textual Entailment Suite (EDITS)

User Manual - Version 2.1 Milen Kouylekov and Matteo Negri Fondazione Bruno Kessler, FBK-irst {kouylekov,negri}@fbk.eu

1 Introduction
Textual Entailment (TE) has been proposed as a unifying generic framework for modeling language variability and semantic inference in different Natural Language Processing (NLP) tasks. The Recognizing Textual Entailment (RTE) task (Dagan and Glickman, 2007) consists in deciding, given two text fragments (respectively called Text- T, and Hypothesis - H), whether the meaning of H can be inferred from the meaning of T, as in: T: Yahoo acquired Overture H: Yahoo owns Overture The system has been designed following three basic requirements: Modularity. System architecture is such that the overall processing task is broken up into major modules. Modules can be composed through a configuration file, and extended as plug-ins according to individual requirements. Systems work-flow, the behavior of the basic components, and their IO formats are described in a comprehensive documentation available upon download. Flexibility. The system is general-purpose, and suited for any TE corpus provided in a simple XML format. In addition, both language dependent and language independent configurations are allowed by algorithms that manipulate different representations of the input data. Adaptability. Modules can be tuned over training data to optimize performance along several dimensions (e.g. overall Accuracy, Precision/Recall trade-off on YES and NO entailment judgements). In addition, an optimization component based on genetic algorithms is available to automatically set parameters starting from a basic configuration. EDITS is open source, and available under GNU Lesser General Public License (LGPL). The tool is implemented in Java, it runs on Unix-based Operating Systems, and has been tested on MAC OSX, Linux, and Sun Solaris. The latest release of the package can be downloaded from: http: //edits.fbk.eu. EDITS comes pre-packaged with: System Graphical Interface - an editor capable of handling the system core data structures; Set of System Configurations - configurations used to run EDITS described in Section 4.2; Set of Cost Schemes - xml files used by the entailment engine described in Section 5; Entailment Rules - xml files that contain knowledge extracted from WordNet, Verbocean and Wikipedia represented as rules. Set of RTE Datasets - The public available RTE Corpora (RTE 1,2,3) in ETAF format described in Section 4.1; Trained Models - Configured and Trained Entailment engines used by the FBK Textual Entailment group.

HTML Reference - a HTML Document describing all the modules available in the system. INSTALL.txt - a file that describes the installation of edits

2 System Overview

Figure 1: Entailment Engine The EDITS package allows to: Create an Entailment Engine (Figure 1) by defining its basic components (i.e. algorithms, cost schemes, rules, and optimizers); Train such Entailment Engine over an annotated RTE corpus (containing T-H pairs annotated in terms of entailment) to learn a Model; Use the Entailment Engine and the Model to assign an entailment judgment and a confidence score to each pair of an un-annotated test corpus. EDITS implements a distance-based framework which assumes that the probability of an entailment relation between a given T-H pair is inversely proportional to the distance between T and H (i.e. the higher the distance, the lower is the probability of entailment). Within this framework the system implements and harmonizes different approaches to distance computation, providing both edit distance algorithms, and similarity algorithms (see Section 3.1). Each algorithm returns a normalized distance score (a number between 0 and 1). At a training stage, distance scores calculated over annotated T-H pairs are used to estimate a threshold that best separates positive from negative examples. The threshold, which is stored in a Model, is used at a test stage to assign an entailment judgment and a confidence score to each test pair. In the creation of a distance Entailment Engine, algorithms are combined with cost schemes (see Section 3.2) that can be optimized to determine their behavior (see Section 3.3), and optional external knowledge represented as rules (see Section 3.4). Besides the definition of a single Entailment Engine, a unique feature of EDITS is that it allows for the combination of multiple Entailment Engines in different ways (see Section 4.4). Basic components are already provided with EDITS, allowing to create a variety of entailment engines. Fast prototyping of new solutions is also allowed by the possibility to extend the modular architecture of the system with new algorithms, cost schemes, rules, or plug-ins to new language processing components.

3 Basic Components
This section overviews the main components of a distance Entailment Engine, namely: i) algorithms, iii) cost schemes, iii) the cost optimizer, and iv) entailment/contradiction rules. 3.1 Algorithms Algorithms are used to compute a distance score between T-H pairs. EDITS provides a set of predefined algorithms, including edit distance algorithms, and similarity algorithms adapted to the proposed distance framework. The choice of the available algorithms is motivated by their large use documented in RTE literature. Edit distance algorithms cast the RTE task as the problem of mapping the whole content of H into the content of T. Mappings are performed as sequences of editing operations (i.e. insertion, deletion, substitution of text portions) needed to transform T into H, where each edit operation has a cost associated with it. The distance algorithms available in the current release of the system are: Token Edit Distance: a token-based version of the Levenshtein distance algorithm, with edit operations defined over sequences of tokens of T and H; Tree Edit Distance: an implementation of the algorithm described in (Zhang and Shasha, 1990), with edit operations defined over single nodes of a syntactic representation of T and H. Similarity algorithms are adapted to the EDITS distance framework by transforming measures of the lexical/semantic similarity between T and H into distance measures. These algorithms are also adapted to use the three edit operations to support overlap calculation, and define term weights. For instance, substitutable terms in T and H can be treated as equal, and non-overlapping terms can be weighted proportionally to their insertion/deletion costs. Five similarity algorithms are available, namely: Word Overlap: computes an overall (distance) score as the proportion of common words in T and H. In the current implementation the algorithm uses the cost scheme to find the less costly substitution of a word from H with a word form T. One word from T can substitute more than one of the words in H. The score returned by the algorithm is the sum of the cost of all substitutions divided by the number of words in H; Jaro-Winkler distance: a similarity algorithm between strings, adapted to similarity between word .The algorithm uses the cost scheme to define if two words are the same (they have a 0 cost of substitution). The entailment score is obtained by subtracting the obtained Jaro-Winkler metric from 1 (i.e score(A,B)=1-JW(A,B)); Cosine Similarity: a common vector-based similarity measure. The EDITS implementation uses the cost scheme to define if two words are the same (they have a 0 cost of substitution) and the weight of words (the cost of insertion for a word in H and the cost of deletion for a word in T). Longest Common Subsequence: searches the longest possible sequence of words appearing both in T and H in the same order, normalizing its length by the length of H; The algorithm uses the cost scheme to define if two words are the same (they have a 0 cost of substitution). The entailment score is obtained by subtracting the obtained similarity from 1 (i.e. score(A,B)=1-(LCS(A,B)/ words(B))). Jaccard Coefficient: confronts the intersection of words in T and H to their union. The algorithm uses the cost scheme to define if two words are the same (they have a 0 cost of substitution). The entailment score is obtained by subtracting the obtained similarity from 1 (i.e. score(A,B)=1-JSC(A,B)).

Rouge : out implementation of a set of metrics for summarization evaluation. (Lin 2004) 3.2 Cost Schemes Cost schemes are used to define the cost of each edit operation. Cost schemes are defined as XML files that explicitly associate a cost (a positive real number) to each edit operation applied to elements of T and H. Elements, referred to as A and B, can be of different types, depending on the algorithm used. For instance, Tree Edit Distance will manipulate nodes in a dependency tree representation, whereas Token Edit Distance and similarity algorithms will manipulate words. Figure 2 shows an example of a cost scheme, where edit operation costs are defined as follows: Insertion(B)=10 - inserting an element B from H to T, no matter what B is, always costs 10; Deletion(A)=10 - deleting an element A from T, no matter what A is, always costs 10; Substitution(A,B)=0 if A=B - substituting A with B costs 0 if A and B are equal; Substitution(A,B)=20 if A!=B - substituting A with B costs 20 if A and B are different.

<scheme> <insertion><cost>10</cost></insertion> <deletion><cost>10</cost></deletion> <substitution> <condition>(equals A B)</condition> <cost>0</cost> </substitution> <substitution> <condition>(not (equals A B))</condition> <cost>20</cost> </substitution> </scheme> Figure 2: Example of XML Cost Scheme

In the distance-based framework adopted by EDITS, the interaction between algorithms and cost schemes plays a central role. Given a T-H pair, in fact, the distance score returned by an algorithm directly depends on the cost of the operations applied to transform T into H (edit distance algorithms), or on the cost of mapping words in H with words in T (similarity algorithms). Such interaction determines the overall behavior of an Entailment Engine, since distance scores returned by the same algorithm with different cost schemes can be considerably different. This allows users to define (and optimize, as explained in Section 3.3) the cost schemes that best suit the RTE data they want to model. For instance, when dealing with T-H pairs composed by texts that are much longer than the hypotheses (as in the RTE5 Campaign), setting low deletion costs avoids penalization to short Hs fully contained in the Ts. To facilitate the usage of the system EDITS provides a mechanism for automatic generation of cost schemes. This mechanism creates a simple cost scheme that is compatible with the distance algorithm used and the resources of entailment/ contradiction accessible to the system. More information about this mechanism can be found in Section 8. EDITS provides two predefined cost schemes: Simple Cost Scheme - the one shown in Figure 2, setting fixed costs for each edit operation.

IDF Cost Scheme - insertion and deletion costs for a word w are set to the inverse document frequency of w (IDF(w)). The substitution cost is set to 0 if a word w1 from T and a word w2 from H are the same, and IDF(w1)+IDF(w2) otherwise. In the creation of new cost schemes, users can express edit operation costs, and conditions over the A and B elements, using a meta-language based on a lisp-like syntax (e.g. (+ (IDF A) (IDF B)), (not (equals A B))). The system also provides functions to access data stored in hash files. For example, the IDF Cost Scheme accesses the IDF values of the most frequent 100K English words (calculated on the Brown Corpus) stored in a file distributed with the system. Users can create new hash files to collect statistics about words in other languages, or other information to be used inside the cost scheme. The definition of a cost scheme is described in details in Section 5. EDITS provides a mechanism for automatic generation of cost schemes, as described in Section 4.6. 3.3 Cost Optimizer A cost optimizer is used to adapt cost schemes (either those provided with the system, or new ones defined by the user) to specific datasets. The optimizer is based on cost adaptation through genetic algorithms, as proposed in (Mehdad, 2009). To this aim, cost schemes can be parametrized by externalizing as parameters the edit operations costs. The optimizer iterates over training data using different values of these parameters until on optimal set is found (i.e. the one that best performs on the training set). The optimization mechanism is described in Section 4.5. 3.4 Rules Rules are used to provide the Entailment Engine with knowledge (e.g. lexical, syntactic, semantic) about the probability of entailment or contradiction between elements of T and H. Rules are invoked by cost schemes to influence the cost of substitutions between elements of T and H. Typically, the cost of the substitution between two elements A and B is inversely proportional to the probability that A entails B. Rules are stored in XML files called Rule Repositories, with the format shown in Figure 3. Each rule consists of three parts: i) a left-hand side, ii) a right-hand side, iii) a probability that the left-hand side entails (or contradicts) the right-hand side. <rule entailment="ENTAILMENT"> <t>acquire</t> <h>own</h> <probability>0.95</probability> </rule> <rule entailment="CONTRADICTION"> <t>beautiful</t> <h>ugly</h> <probability>0.88</probability> </rule> Figure 3: Example of XML Rule Repository The format of the entailment rules repository is described in Section 6.

4 Using the System


This section provides basic information about the use of EDITS, which can be run with commands in a Unix Shell. 4.1 EDITS Input The input of the system is an entailment corpus represented in the EDITS Text Annotation Format (ETAF), a simple XML internal annotation format (DTD in Figure 4). ETAF is used to represent both the input T-H pairs, and the entailment and contradiction rules. ETAF allows to represent texts at two different levels: i) as sequences of tokens with their associated morpho-syntactic properties, or ii) as syntactic trees with structural relations among nodes. Plug-ins for several widely used annotation tools (including TreeTagger, Stanford Parser, and OpenNLP) can be downloaded from the systems website. Users can also extend EDITS by implementing plug-ins to convert the output of other annotation tools into ETAF. <!ELEMENT entailment-corpus (pair+)> <!ELEMENT pair (t, h, tAnnotation*, hAnnotation*)> <!ELEMENT t (#PCDATA)> <!ELEMENT h (#PCDATA)> <!ELEMENT tAnnotation (string?,word*,tree?,semantics?)> <!ELEMENT hAnnotation (string?,word*,tree?,semantics?)> <!ELEMENT string (#PCDATA)> <!ELEMENT word (attribute+)> <!ELEMENT attribute (#PCDATA)> <!ELEMENT tree (node+,edge*)> <!ELEMENT node (word|label)> <!ELEMENT label (#PCDATA)> <!ELEMENT edge EMPTY> <!ELEMENT semantics (entity*,relation*)> <!ELEMENT entity EMPTY> <!ELEMENT relation EMPTY> <!ATTLIST pair id CDATA #REQUIRED entailment (YES|NO|UNKNOWN) #REQUIRED task CDATA #IMPLIED length CDATA #IMPLIED > <!ATTLIST tAnnotation id CDATA #IMPLIED> <!ATTLIST hAnnotation id CDATA #IMPLIED> <!ATTLIST word id CDATA #IMPLIED> <!ATTLIST attribute name CDATA #REQUIRED> <!ATTLIST tree id CDATA #IMPLIED> <!ATTLIST node id CDATA #IMPLIED> <!ATTLIST edge name CDATA #IMPLIED from CDATA #REQUIRED to CDATA #REQUIRED > <!ATTLIST relation name CDATA #REQUIRED source CDATA #IMPLIED > <!ATTLIST entity name CDATA #REQUIRED start CDATA #IMPLIED

>

end CDATA #IMPLIED source CDATA #IMPLIED Figure 4. DTD of the ETAF annotation format

The basic level of annotation represents texts as sequences of tokens with morphosyntactic features. Common properties are token, lemma, morpho and part of speech, but other linguistic annotations may be used at this level, including named entities, weights of tokens (like IDF), and many others. For example, the following is the morphosyntactic representation of the sentence "Edison invented the Kinetoscope.", with userdefined attribute names for two pos tagging sets (pos - TextPro tagset and wnpos WordNet part of speech), token, lemma, full morphological analysis (full morpho) and sentence boundaries. <hAnnotation> <string>Edison invented the Kinetoscope.</string> <word id="2232"> <attribute name="wnpos">n</attribute> <attribute name="token">Edison</attribute> <attribute name="lemma">edison</attribute> <attribute name="pos">NP0</attribute> </word> <word id="2233"> <attribute name="full_morpho">invent+v+part+past invented+adj+zero+invent+v+indic+past</attribute> &nbsp;<attribute name="wnpos">v</attribute> &nbsp; &nbsp;<attribute name="token">invented</attribute> <attribute name="lemma">invent</attribute>s <attribute name="pos">VVD</attribute> </word> <word id="2234"> <attribute name="full_morpho">the+adv the+art</attribute> <attribute name="wnpos"></attribute> <attribute name="token">the</attribute> <attribute name="lemma">the</attribute> <attribute name="pos">AT0</attribute> </word> <word id="2235"> <attribute name="wnpos">n</attribute> <attribute name="token">Kinetoscope</attribute> <attribute name="lemma">kinetoscope</attribute> <attribute name="pos">NN1</attribute> </word> <word id="2236"> <attribute name="full_morpho">.+punc</attribute> <attribute name="sentence">&lt;eos&gt;</attribute> <attribute name="wnpos"></attribute> <attribute name="token">.</attribute> <attribute name="lemma">.</attribute> <attribute name="pos">PUN</attribute> </word> </hAnnotation> Figure 5: Morpho-Syntactic Annotation Example The second level of annotation represents texts as syntactic trees with their structural

features. Both nodes (terminal and non terminal) and edges with syntactic relations are represented. Nodes are typically described with their morpho-syntactic properties. The example below shows the output of the Stanford Parser for the sentence "Edison invented the Kinetoscope." converted into ETAF. <hAnnotation> <string>Edison invented the Kinetoscope.</string> <tree root="2"> <node id="1"> <word id="1"> <attribute name="token">Edison</attribute> <attribute name="lemma">Edison</attribute> <attribute name="pos">NNP</attribute> </word> </node> <node id="2"> <word id="2"> <attribute name="token">invented</attribute> <attribute name="lemma">invent</attribute> <attribute name="pos">VBD</attribute> </word> </node> <node id="3"> <word id="3"> <attribute name="token">the</attribute> <attribute name="lemma">the</attribute> <attribute name="pos">DT</attribute> </word> </node> <node id="4"> <word id="4"> <attribute name="token">Kinetoscope</attribute> <attribute name="lemma">Kinetoscope</attribute> <attribute name="pos">NN</attribute> </word> </node> <edge to="1" name="nsubj" from="2"/> <edge to="3" name="det" from="4"/> <edge to="4" name="dobj" from="2"/> </tree> </hAnnotation> Figure 6: Syntactic Annotation Example Publicly available RTE corpora (RTE 1-3, and EVALITA 2009), annotated in ETAF at both the annotation levels, are delivered together with the system to be used as first experimental datasets. The annotation of an entailment corpus with one of the annotation tools known by the system is done with the following command: edits -a -name-of-the-tool -o output-file input-file where "-a" indicates that EDITS must annotate a file. The "-name-of-the-tool" option indicates the annotation tool (e.g. "stanford-parser", "texpro" or "opennlp") used to perform the annotation. The "-o" indicates the file where EDITS will store the result of annotation. For example the following command will annotate RTE2 corpus with the

Stanford parser. edits -a -stanford-parser -o RTE2_dev-annotated.xml rte/RTE2_dev.xml ETAF annotatated files can be visualised with the EDITS graphical interface. For example snapshots check Section 8. 4.2 Configuration The creation of an Entailment Engine is done by defining its basic components (algorithms, cost schemes, optimizer, and rules) through an XML configuration file. <conf> <module alias="distance"> <module alias="tree"/> <module alias="xml"> <option name="scheme-file" value="${EDITS}/share/cost-schemes/idfscheme.xml"/> <option name="hash-file" id="idf" value="${EDITS}/share/cost-schemes/ idf.txt"/> <option name="hash-file" id="stopwords" value="${EDITS}/share/costschemes/stopwords.txt"/> </module> <module alias="pso"/> </module> </conf> Figure 7: An Example of a Configuration File The configuration file in Figure 7 is divided in modules, each having a set of options. This configuration defines a distance Entailment Engine that combines Tree Edit Distance as a core distance algorithm, and the predefined IDF Cost Scheme that will be optimized on training data with the Particle Swarm Optimization algorithm (pso) as in (Mehdad, 2009). Adding external knowledge to an entailment engine can be done by extending the configuration file with a reference to a rules file (e.g. wordnet-rules.xml) as follows: <module alias="memory"> <option name="rules-file" value="${EDITS}/share/cost-schemes/wordnetrules.xml"/> </module> The DTD of the format and more information about the configuration file appear in Section 7. Configuration files can be created, visualized and modified with the EDITS graphical interface. For example snapshots check Section 8. 4.3 Training and Test 4.3.1 Training Given a configuration file and an RTE corpus annotated in ETAF, the user can run the training procedure to learn a model. This is done using the following command: edits -r -c configuration_file -sm model annotated_entailment_corpus. where "-r" instructs the system to train a model using the entailment engine defined in the configuration file specified by the "-c" option on the annotated entailment corpus file or directory provided as input. The obtained model will be saved in the file specified by

10

the "-sm" option. For example if we want to train a model on the RTE3 development dataset with the predefined configuration specified in the file share/ configurations/conf2.xml we use the following command: edits -r -c share/configurations/conf2.xml -sm RTE3dev-model-conf2 rte/etaf/morphosyntax/RTE3dev.xml The output of the training phase is a model: a zip file that contains the learned threshold, the configuration file, the cost scheme, and the entailment/contradiction rules used to calculate the threshold. The explicit availability of all this information in the 1 model allows users to share, replicate and modify experiments. At the end of the training phase, a summary of the system's performance over the training set is printed on screen. This summary reports: (i) the distance model, including the distance threshold; (ii) the accuracy of the annotation of the whole training set; (iii) separate precision, recall and F-measure scores for the YES and NO training pairs. An example summary is shown in Figure 8. Calculated Threshold: 0.7948717948717948 ****************************** Accuracy: 0.61 ############################################### # Examples # Precision # Recall # FMeasure # Confidence # Class # # 412 &nbsp; # 0.6244 # 0.6092 # 0.6167 # 0.1454 # YES # 388 # 0.5955 # 0.6108 # 0.6031 # 0.3313 # NO # ############################################## ############### # # YES # NO # # YES # 251 # 161 # # NO # 151 # 237 # ############### Figure 8: Example of Training Summary The training stage also allows to tune performance along several dimensions (e.g. overall Accuracy, Precision/Recall trade-off on YES and/or NO entailment judgments). By default the system maximizes the overall accuracy (distinction between YES and NO pairs). To make such adjustments the user should modify the configuration file adding the following option to the definition of the distance entailment engine: <module alias="distance"> ... <option name="optimize-per-metric" value="METRIC"/> ... </module>

1. Our policy is to publish online the models we use for participation in the RTE Challenges. We encourage other users of EDITS to do the same, thus creating a collaborative environment, allow new users to quickly modify working configurations, and replicate results.

11

More information of the values of the option and other options of the distance entailment engine can be found in the HTML reference guide downloadable with the system. 4.3.1 Test Given a model and an ETAF annotated RTE corpus as input, the test procedure produces a file containing for each pair: i) the decision of the system (YES, NO), ii) the confidence of the decision, iii) the entailment score, iv) the sequence of edit operations made to calculate the entailment score. The command used to evoke the test procedure is the following:

edits -e -m model -o edits_result entailment_corpus where "-e" instructs the system to load the entailment engine stored in the file specified by the "-m" option and to annotate the entailment relation for the pairs of the input file. EDITS will store the in the output file, specified by the option "-o", the result for each entailment pair. An example output is the following xml framgment: <pair task="IE" length="short" id="1" entailment="NO" confidence="0.22" benchmark="YES"> <t>Claude Chabrol (born June 24, 1930) is a French movie director and has become well-known in the 40 years since his first film, Le Beau Serge , for his chilling tales of murder, including Le Boucher .</t> <h>Le Beau Serge was directed by Chabrol.</h> </pair> Figure 9: Simple EDITS output The entailment relation found by EDITS is reported in the entailment attribute. If the pairs in the corpus had already an entailment relation assigned (in case of training data or annotated test) the additional attribute benchmark is added signifying the original value. The user can obtain also the edit distance operations made by the system for each pair by controlling the verbosity of the output with the "-ot" option. This allow the to produce the Extended Output (-ot=extended) represented in Figure 10 or the Full output represented in Figure 11. <pair task="IE" score="0.84" normalization="500.0" length="short" id="1" entailment="NO" distance="420.0" confidence="0.22" benchmark="YES" > <t>Claude Chabrol (born June 24, 1930) is a French movie director and has become well-known in the 40 years since his first film, Le Beau Serge , for his chilling tales of murder, including Le Boucher .</t> <h>Le Beau Serge was directed by Chabrol.</h> </pair> Figure 10: Extended Output <pair task="IE" score="0.84" normalization="500.0" length="short" id="1" entailment="NO" distance="420.0" confidence="0.22" benchmark="YES"> <t>Claude Chabrol (born June 24, 1930) is a French movie director and has become well-known in the 40 years since his first film, Le Beau Serge , for his chilling tales of murder, including Le Boucher .</t> <h>Le Beau Serge was directed by Chabrol.</h> <log xsi:type="EditOperations" cost="420.0" xmlns:xsi="http://www.w3.org/2001/ XMLSchema-instance">

12

<operation type="deletion" scheme="deletion" cost="10.0"> <source xsi:type="Word" id="1"> <attribute name="lemma">claude</attribute> <attribute name="wnpos">n</attribute> <attribute name="token">Claude</attribute> <attribute name="full_morpho">claude+pn</attribute> <attribute name="sentence">-</attribute> <attribute name="pos">NP0</attribute> </source> </operation> .... <operation type="insertion" scheme="insertion" cost="10.0"> <target xsi:type="Word" id="48"> <attribute name="lemma">by</attribute> <attribute name="token">by</attribute> <attribute name="full_morpho">by+prep by+adv by+adj+zero</attribute> <attribute name="sentence">-</attribute> <attribute name="pos">PRP</attribute> </target> </operation> ... <operation type="substitution" scheme="equal" cost="0.0"> <source xsi:type="Word" id="2"> <attribute name="lemma">chabrol</attribute> <attribute name="wnpos">n</attribute> <attribute name="token">Chabrol</attribute> <attribute name="sentence">-</attribute> <attribute name="pos">NN1&lt;/attribute> </source> <target xsi:type="Word" id="50"> <attribute name="lemma">chabrol</attribute> <attribute name="wnpos">n</attribute> <attribute name="token">Chabrol</attribute> <attribute name="sentence">-</attribute> <attribute name="pos">NN1</attribute> </target> </operation> </log> </pair> Figure 11: Full output EDITS annotated files can be visualised with the EDITS graphical interface. For example snapshots check Section 8. 4.4 Combining Engines A relevant feature of EDITS is the possibility to combine multiple Entailment Engines into a combined entailment engine as shown in Figure 12. This can be done by grouping their definitions as sub-modules in the configuration file. EDITS allows users to define customized combination strategies, or to use two predefined combination modalities provided with the package, namely: i) Linear Combination, and ii) Classifier Combination. The two modalities combine in different ways the entailment scores produced by multiple independent engines, and return a final decision for each T-H pair.

13

Figure 12: Combined Entailment Engine

Linear Combination returns an overall entailment score as the weighted sum of the entailment scores returned by each engine:

In this formula, weighti is an ad-hoc weight parameter for each entailment engine. Optimal weight parameters can be determined using the same optimization strategy used to optimize the cost schemes, as described in Sections 3.3 4.5. Classifier Combination is based on using the entailment scores returned by each engine as features to train a classifier (see Figure 10). To this aim, EDITS provides a plug-in that uses the Weka machine learning workbench as a core. By default the plug-in uses an SVM classifier, but other Weka algorithms can be specified as options in the configuration file. The configuration file in Figure 13 describes configuration file the describes a combination of two engines (i.e. one based on Tree Edit Distance, the other based on Cosine Similarity), used to train a classifier with Weka. <module alias="weka"> <module alias="distance" id="1"> <module alias="tree"/> <module alias="xml"> <option name="scheme-file" value=scheme1.xml"/> </module> </module> <module alias="distance" id="2"> <module alias="cosine"/> <module alias="xml"/>

14

<option name="scheme-file" value="scheme2.xml"/> <module alias="xml"/> </module> </module> Figure 13: Configuration file of Combined Entailment Engine A linear combination can be easily obtained by changing the alias of the highest-level module (weka) into linear. Entailment engines can be combined by merging their configurations using the EDTS graphical interface. For example snapshots check Section 8. 4.5 Optimization The goal of the optimization process is to change the values of certain parameters (optimizable parameters) of an entailment engine in order to make it perform better on the training set. The procedure is performed by modules called engine-optimizers. EDITS provides two such module "genetic" and "pso" as plugins. For both modules the optimization procedure uses some form of genetic search to find the optimal values. The optimize-able parameters are specific to the optimized engine. The optimizeable parameters of the simple entailment engine are the OP constants of the cost scheme (more information in Section 5.2.1). The optimizeable parameters of the linear combination engine are the weights of each sub-engine. The Weka based entailment engine can not be optimized. In Figure 14 is shown the configuration file confoptimize.xml that can be fond in share/configurations folder) which represents a simple entailment engine using an optimizable cost scheme (scheme-optimize.xml) that will be optimized when the training process is activated. <conf> <module alias="distance"> <module alias="token"/> <module alias="xml"> <option name="scheme-file" value="${EDITS_PATH}/share/costs-schemes/ scheme-optimze.xml"/> </module> <module alias="genetic"/> <option name="optimize-parameters" value="true"/> </module> </conf> Figure 14: Configuration file that represents an entailment engine that will be optimized during training The option "optimize-parameters" indicates to EDITS that it should use the engineoptimizer the module "genetic" to tune the performance of the entailment engine. Figure 15 describes the result of the optimization process showing the new values of the optimizable parameters. bin/edits -r -c share/configurations/conf-optimize.xml rte/etaf/morpho-syntax/ RTE2_dev.xml Parameter: OPinsertion Value: 0.7482828808106775 Parameter: OPdeletion Value: 0.08147454355821715 Parameter: OPsubstitution Value: 0.7209179287932125 Calculated Threshold: 0.5636193641242588

15

******************************* Accuracy: 0.61625 ############################################# # Examples # Precision # Recall # FMeasure # Confidence # Class # # 400 # 0.5947 # 0.73 # 0.6554 # 0.2658 # YES # # 400 # 0.6505 # 0.5025 # 0.567 # 0.2097 # NO # ########################################### ################### # # YES # NO # # YES # 292 # 108 # # NO # 199 # 201 # ################### Figure 15: Result of Optimization Process 4.6 Automatic Generation of Cost Schemes EDITS provides a mechanism for a quick experimentation with different distance algorithms by allowing the user to avoid the specification of a cost scheme and still be able to have a fully functional entailment engine. In this case EDITS automatically generates a cost scheme that is adapt to the algorithm and the resources of entailment rules defined in the configuration file. For example the cost scheme in Figure 16 is automatically generated for the simple entailment engine that uses token edit distance algorithm with configuration file shown in Figure 17. <conf> <module alias="distance"> <module alias="token"/> </module> </conf> Figure 16: Simple Entailment Engine that uses Token Edit Distance Algorithm

<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <scheme> <constant value="1" type="number" name="OPinsertion"/> <constant value="1" type="number" name="OPdeletion"/> <constant value="0" type="number" name="OPsubstitution1"/> <constant value="1" type="number" name="OPsubstitution2"/> <insertion> <cost>(* OPinsertion (size (words T)))</cost> </insertion> <deletion> <cost>(* OPdeletion (size (words H)))</cost> </deletion> <substitution> <condition>(equals (a.token A) (a.token B))</condition> <cost>OPsubstitution1</cost> </substitution> <substitution> <cost>(* OPsubstitution2 (+ (size (words T)) (size (words H))))</cost> </substitution>

16

</scheme> Figure 17: Automatically Generated Cost Scheme The generated cost scheme is also optimizable. Users are encouraged to start from automatically generated cost schemes as template for future cost schemes development. 4.7 Automatic Generation of Configuration file EDITS allows the user to make quick experiments without providing a configuration file. For example the following command: bin/edits -r rte/etaf/morpho-syntax/RTE2_dev.xml will automatically generate a configuration file, like the one shown in Figure 16, that will describe a simple entailment engine using token edit distance as distance algorithm and an automatically generated cost scheme. The following command will create a simple entailment engine with tree edit distance as distance algorithm: bin/edits -r -tree rte/etaf/syntax/RTE2_dev.xml The following command will generate a configuration file that will describe a simple entailment engine with tree edit distance as distance algorithm and will optimize the automatically created cost scheme with the genetic entailment engine optimizer: bin/edits -r -op -genetic -tree rte/etaf/syntax/RTE2_dev.xml 4.8 Experiment <experiment use-memory="true" overwrite="false" output-type="NO" configuration="/ hardmnt/queneau0/tcc/kouylekov/edits/share/configurations/conf1.xml" adddate="false"> <training>/hardmnt/queneau0/tcc/kouylekov/edits/rte/etaf/morpho-syntax/ RTE2_dev.xml</training> <test>/hardmnt/queneau0/tcc/kouylekov/edits/rte/etaf/morpho-syntax/ RTE2_test.xml</test> </experiment> Figure 18: Example of Experiment File The experiment file allows the user to reduce the length of command lines and to repeat quickly the same experiment. An example of an experiment file is shown in Figure 18. This file defines an experiment in which EDITS will use the configuration defined in the file "share/configurations/conf1.xml" to train a model on the RTE2 development set and will test this model on RTE2 test set. The user can specify in the experiment file all the options that will handle the models created during training and the output of a test. The DTD of an experiment file is shown in Figure 19. <!ELEMENT experiment (training+, test*)> <!ELEMENT training (#PCDATA)> <!ELEMENT test (#PCDATA)> <!ATTLIST experiment configuration CDATA #INPLIED

17

>

model CDATA #INPLIED configuration CDATA #INPLIED output CDATA #INPLIED output-type CDATA #INPLIED use-memory CDATA #INPLIED overwrite CDATA #INPLIED add-data CDATA #INPLIED Figure 19: DTD of Experiment File

New experiments can be easily created using the EDITS graphical interface. For example snapshots check Section 8.

18

5 Defining Cost Schemes for Edit Operations


According to the distance-based approach, T entails H if there exists a sequence of transformations applied to T such that we can obtain H with an overall cost below a certain threshold. The underlying assumption is that pairs between which an entailment relation holds have a low cost of transformation. EDITS allows for the definition of the cost for each edit operation carried out by the distance algorithm in order to find the best (i.e. less costly) sequence of edit operations that transforms T into H. The basic data structure in EDITS for the definition of costs is the cost scheme. One or more cost schemes can be associated to each edit operation, and they are collected in a cost scheme file that can be created by the user. A cost scheme is invoked by the edit distance algorithm with three parameters: (i) an edit operation, (ii) an element of T, called the source and referred through the variable A, and (iii) an element of H, called the target and referred through the variable B. Each cost scheme for a certain edit operation consists of three parts: 1. Name - Every cost scheme must have a user defined unique name 2. Condition - A set (possibly empty) of constraints over the source and the target elements, which need to be satisfied in order to activate the cost scheme. Each constraint is expressed in a lisp-like syntax, and all constraints must be satisfied (i.e. they have to return true) in order the cost scheme to be applied. 3. Cost - A fixed value, or a function that returns a numerical value, expressing the cost of the edit operation applied to the source and to the target. <!ELEMENT schemes (scheme+)> <!ELEMENT scheme (constant*,insertion*,deletion*,substitution*)> <!ELEMENT constant EMPTY> <!ELEMENT insertion (condition*,cost)> <!ELEMENT deletion (condition*,cost)> <!ELEMENT substitution (condition*,cost)> <!ELEMENT condition (#PCDATA)> <!ELEMENT cost (#PCDATA)> <!ATTLIST scheme name CDATA #REQUIRED> <!ATTLIST insertion name CDATA #REQUIRED> <!ATTLIST deletion name CDATA #REQUIRED> <!ATTLIST substitution name CDATA #REQUIRED> <!ATTLIST constant name CDATA #REQUIRED type (string|number|boolean) #REQUIRED value CDATA #REQUIRED > Figure 20: XML Cost Scheme DTD. A cost function can consider as parameters the source element, the target element, the text T, and the hypothesis H. EDITS adopts a combination of XML annotations and functional expressions to define the cost schemes. The XML Document Type Definition (DTD) of the cost scheme file is reported in Figure 20. A simple example of a predefined cost scheme file (simple-scheme.xml, introduced in Section 3.2) is shown in Figure 21. <scheme> <insertion name="insertion"> <cost>10</cost> </insertion>

19

<deletion name="deletion"> <cost>10</cost> </deletion> <substitution name="equal"> <condition>(equals (attribute "token" A) (attribute "token" B))</condition> <cost>0</cost> </substitution> <substitution name="not-equal"> <condition>(not (equals (attribute "token" A) (attribute "token" B)))</condition> <cost>20</cost> </substitution> </scheme> Figure 21: Simple Cost Scheme This cost scheme applies to elements of T and H, referred respectively as A and B, which are annotated as words (see ETAF in Section 3). The function (attribute "token" A) returns the token of the source element. Within the example, there are four edit operations (1 for insertion, 1 for deletion, and 2 for substitution), assigned to different costs: insertion(B)= 10 - inserting an element B, no matter what B is, always costs 10. deletion(A)= 10 - deleting an element A, no matter what A is, always costs 10. substitution(A,B)= 0 if A=B - substituting A with B costs 0 if the token of A and the token of B are equal (i.e. they are the same string). substitution(A,B)= 20 if A != B - substituting A with B costs 20 if the tokens of A and B are not equal. <substitution name="equal"> <condition>(and (equals (attribute "lemma" A) (attribute "lemma" B)) (equals (attribute "pos" A) (attribute "pos" B)))</condition> <cost>0</cost> </substitution> Figure 22 shows a more complex example of a cost scheme for the substitution operation. In the example, a token from T is substituted by a token from H with a cost equal to 0 if their lemmas and part of speech are equal. As shown in the previous examples, EDITS allows to define the cost of the edit operations by means of user-defined attributes. In the example in Figure 23 the cost scheme exploits the pre-computed frequency of a token to calculate the cost of insertion, according to the intuition that the most frequent words should have a lower cost of insertion. <insertion name="insertion_frequency"> <condition>(not (null (attribute "freq" B)))</condition> <cost>(* (/ 1 (number (attribute "freq" B))) 20)</cost> </insertion> Figure 23: Insertion Based on Frequency Cost schemes can be easily created, modified and viewed using the EDITS graphical interface. For example snapshots check Section 8. 5.1 Data Types and Functions

20

Conditions and costs are defined using a set of functions, expressed in a lisp-like syntax. Such functions can consider as parameters the source A, the target B, the text T, and the hypothesis H. This means that all the information about T and H derived from their linguistic processing (e.g. part of speech, syntactic structure, etc.) is available for defining conditions and cost functions. As an example, typical constraints involve checking the token and the part of speech of A and B, while typical cost functions are computed considering the lexical similarity between A and B, possibly normalizing such value over the length of T and H. 5.1.1 Data Types Basic elements for defining constraints and costs in a cost scheme are derived from the three representation levels defined in ETAF. EDITS provides functions for the most relevant elements defined in the ETAF linguistic representation. The arguments of such functions are the variables (i.e. A, B, T, H) which are instantiated within a specific cost scheme. Functions use the following primitive objects data types: 1. Word - represents a token in T and H and it is instantiated by the variables A and B in a cost scheme. 2. Node - represents a tree element in T and H and it is instantiated by the variables A and B in a cost scheme. 3. Tree - represents a syntactic tree; it is obtained using the function tree. 4. Number - a real number, for example: 0 , 3:14 etc. 5. Boolean - True or False. 6. String - a sequence of characters, for example: "Dolomiti" or "Milen". 7. List - a sequence of elements (words, nodes, numbers, booleans, etc.). 8. Set - a group of elements of a given type (strings, numbers, or booleans) loaded from a file. 9. Hash - an object that contains maps from keys to values, loaded from a le. Keys are strings, while values are either strings, numbers, or booleans. 10. Edges - represents an edge in the dependency tree of T and H and it is instantiated by the variables A and B in a cost scheme. Sets and Hashs are objects that have to be read from an external le (i.e. they can not be created inside a cost scheme or read from the entailment corpus). The format of a file containing a Hash is shown in Figure 24. The type is the data-type of the values in the file, and the key is separated from a value with a tab. A fragment from the le "share/ cost-scheme/idf.txt" containing the IDF of words is shown in Figure 25. type key-1 value-1 ... key-n value-n Figure 24: Hash file format number speak 12.23 ride 3.23 ... read &nbsp; 10.32 Figure 25: Hash file example

21

The format of a file containing a Set is shown in Figure 26. The type is the data-type of the elements in the Set. A fragment from the le "share/cost-scheme/stopwords.txt" stop words is shown in Figure 27. type element-1 ... element-n Figure 26: Set file format string speak ride ... read Figure 27: Set file example The Hashs and Sets are defined as options in the configuration of the CostScheme module. They are accessed using the functions set-contains and hash-value described in the following section by refering to their ID. The fragment in Figure 28 represents a simple definition of a hash and a set inside a configuration file. <module alias="xml"> <option name="scheme-file" value="${EDITS}/share/cost-schemes/idfscheme.xml"/> <option name="hash-file" id="idf" value="${EDITS}/share/cost-schemes/idf.txt"/> <option name="hash-file" id="stopwords" value="${EDITS}/share/cost-schemes/ stopwords.txt"/> </module> Figure 28: Definition of a Hash in configuration file 5.1.2 Functions Functions over AnnotatedText (string AnnotatedText ) - returns the text of the AnnotatedText (i.e. the text of T or H). (tree AnnotatedText ) - returns the syntactic tree of the AnnotatedText. (words AnnotatedText ) - returns the list of words in the AnnotatedText. Functions for accessing Entailment Rules (entail SimpleRulesObject1 SimpleRulesObject2 ) - checks for the existence of an entailment rule (see Section 6) where the left hand side of the rule matches SimpleRulesObject1 and the right hand side of the rule matches SimpleRulesObject2. The two arguments must be of the same data type. The allowed types are: String, Word and Node. If the rule exists, then the probability associated to the rule is returned, otherwise the output of the function is null. (contradict SimpleRulesObject1 SimpleRulesObject2 ) - checks for the existence of a contradiction rule (see Section 6) where the left hand side of the rule matches SimpleRulesObject1 and the right hand side of the rule matches SimpleRulesObject2. The two arguments must be of the same data type. The

22

allowed types are: String, Word and Node. If the rule exists, then the probability associated to the rule is returned, otherwise the output of the function is null. Functions over Trees (nodes Tree ) - returns the list of nodes of a tree. (parent Node Tree ) - returns the parent of a node in the syntactic tree of T or H. (children Node Tree ) - returns the children of a node in the syntactic tree of T or H. (from Edge) - returns the from node of an edge in the syntactic tree of T or H. (to Edge) - returns the to node of an edge in the syntactic tree of T or H. Functions over Nodes (word Node) - returns the word of the node. (label Node) - returns the label (i.e. the syntactic category) of the node. (edge Node) - returns the edge (i.e. the syntactic relation) entering in the node. (is-label-node Node) - returns true if the node contains a label. (is-word-node Node) - returns true if the node contains a word.

Functions over Words (attribute String Word ) - returns the value of the attribute String of the word. If the attribute is missing the function returns null. Functions with string arguments (equals String1 String2) - Returns True if String1 is equal to String2. (equals-ignore-case String1 String2) - compares two strings ignoring their case. (capitalized String ) - returns true if the string is capitalized. (starts-with String1 String2 ) - returns True if String1 starts with String2 . For instance: (starts-with reading read ) is True. (ends-with String1 String2 ) - returns True if String1 ends with String2 . (contains String1 String2 ) - returns True if String1 contains String2 . For instance, (contains reenacting act ) returns True. (number String ) - reads a number from String . For instance, (number 3.14 ) returns 3.14. (boolean String ) - reads a boolean from String . The possible arguments are true and false. (to-lower-case String ) - converts String to lower case. (length String ) - returns the number of characters in String . (char String Number ) - returns the character in String at position corresponding to Number. (substring String Number1 Number2 ) - returns the sub-string of String from the position corresponding to Number1 till the position Number2 . (distance String1 String2 :normalize ) returns the Levenshtein distance between String1 and String2 . If the :normalize parameter is present the function returns a normalized distance (with respect to the length of the two arguments) between 0 and 1.

Functions with numeric arguments (= Number1 (< Number1 (> Number1 (+ Number1 to 3). Number2 ) - returns True if Number1 is equal to Number2 . Number2 ) - returns True if Number1 is less than Number2 . Number2 ) - returns True if Number1 is more than Number2 . ... Numbern ) - makes a sum of numbers (example (+ 1 2) is equal

23

( Number1 ... Numbern ) - subtracts numbers from N umber1 (example (- 5 2 1) is equal to 2). (* Number1 ... Numbern ) - multiplies numbers (example (* 2 2) is equal to 4). (/ Number1 ... Numbern ) - divides Number1 by the rest (example (/ 24 3 2) is equal to 4). Functions with boolean arguments (and Boolean1 ... Booleann ) - returns True if all the arguments are True. (or Boolean1 ... Booleann ) - returns True if at least one of the arguments is True. (not Boolean ) - returns True if the argument is False, and False if it is True. Conditional Functions (if Boolean Object1 Object2 ) - if the Boolean is equal to True then the function returns Object1 otherwise Object2 . If Object2 is not defined the function returns null. Functions with list arguments (member List Object ) - returns T rue if List contains Object. For example: the list (1 2 3) contains the number 1; the list (sum plus minus) contains the string plus. (size List ) - returns the number of elements in List. (nth Number List ) - returns the n-th (Number) element of List. The rst element of a list is returned by (nth 1 list), the last with (nth (- (size list) 1) list). (subseq List Number1 Number2 )- returns the sub-list of the list from the position corresponding to Number1 till the position Number2 . Functions handling Hash and Set (hash-value String1 String1) - returns the value of the hash with id equals to String1 for the key String2 . (set-contains String Object ) - returns True if the set with id equals to String contains Object. Null handling functions (null Object ) - returns True if the argument is null. For example, to express that a word A from T has not an attribute freq, the expression (null (attribute freq A)) can be used. 5.2 Using constants <constant name="COST" type="number" value="10"/> <insertion name="insertion"> <cost>COST</cost> </insertion> <deletion name="deletion"> <cost>COST</cost> </deletion> Figure 29: Example of Cost Scheme constants. The constants are used in a cost scheme to externalize certain values that can be used by more than one of the operations of the cost scheme. In the XML scheme in Figure 29

24

the constant "COST" is used as cost of both the insertion and substitution operations. Each constant must have a type that define what type of data the cost scheme will interpret it's value as. The possible types are "string", "number" and "boolean". EDITS provides two constants T and H that the user can use inside cost schemes. The contents of these constants is the annotation correspondingly of of T and H. For example if the user wants to set as the cost of deletion the number of words in Tm he/she must use the following xml fragment: <deletion name="deletion"> <cost>(size (words T))</cost> </deletion> 5.2.1 Optimizing Cost Schemes The constants also play an important role in optimizing the cost scheme. The constants that have a name that start with the capital letters OP are considered by the system as parameters of the cost scheme that can be optimized. A cost scheme is optimizable if it contains at least one such parameter. An example of a cost scheme with optimizable constants is the fragment in Figure 30 (which can be found in share/cost-schemes/ optimize-scheme.xml). In this scheme the cost of insertion, deletion and substitution in case A and B are not equal is optimized. <scheme> <constant name="OPinsertion" value="10" type="number"/> <constant name="OPdeletion" value="10" type="number"/> &nbsp; <constant name="OPsubstitution" value="20" type="number"/> &nbsp; &nbsp;<insertion name="insertion"> <cost>OPinsertion</cost> &nbsp;&nbsp; </insertion> &nbsp; <deletion name="deletion"> <cost>OPdeletion</cost> &lt;/deletion> <substitution name="equal"> <condition>(equals (attribute "token" A) (attribute "token" B))</condition> <cost>0&lt;/cost> </substitution> <substitution name="not-equal"> <condition>(not (equals (attribute "token" A) (attribute "token" B)))</condition> <cost>OPsubstitution</cost> &nbsp; &nbsp; </substitution> </scheme> Figure 30: Optimizable Cost Scheme 5.3 Matching Edges All algorithm that process tokens (the only one that works on trees is tree edit distance) from version 2.1 are adapted to work with the edges of the syntactic tree. To do this the user must use the match-edges option either in the configuration file or in the command line. For certain algorithms like token edit distance or longest common subsequence this is not semantically motivated as the edges does not have a predefined order. bin/edits -train -overlap -match-edges <module alias="overlap">

25

<option name="match-edges" value="true"> </module>

26

6 Dening rules in EDITS


EDITS allows the use of sets of rules, both entailment rules and contradiction rules, in order to provide specific knowledge (e.g. lexical, syntactic, semantic) about transformations between T and H. Rules can be manually created, or they can be extracted from any available resource (e.g. WordNet, Wikipedia, DIRT) and stored in XML les which are called Rule Repositories. Each rule in EDITS consists of four parts: 1. Name - a unique identier of the rule within a certain rule repository. This is used for logging purposes only, in order to help the user to understand which rules have been applied by the system for a certain pair. If not provided by the user, the rule name is automatically generated by the system. 2. Type (entailment) - specifies the type of rule: entailment or contradiction. 3. t - a text T, i.e. the left hand side of the rule. 4. h - a hypothesis H, i.e. the right hand side of the rule. 5. Probability - a probability that the rule maintains either the entailment or the contradiction between T and H. Both in entailment and contradiction rules, a probability equal to 0 means that the relation between T and H is unknown, while a probability equal to 1 means that the entailment/contradiction between T and H is fully preserved. 6.1 Rule format Both T and H can be defined using the Edits Text Annotation Format (ETAF) described in Section 4. ETAF allows text portions to be represented at three different levels of annotation: just as strings (i.e. the STRING object), as sequences of tokens with their morpho-syntactic features (i.e. the WORD object), and as syntactic trees (e.g. the NODE object). Rules in EDITS can be defined using the three datatypes, provided that they are used consistently in T and H, i.e. either STRING, or WORD or NODE. The XML Document Type Denition (DTD) of the rules le is reported in Figure 30. <!ELEMENT rules (rule+)> <!ELEMENT rule (t, h,probablity)> <!ELEMENT t (string?,word*,tree?,semantics?)> <!ELEMENT h (string?,word*,tree?,semantics?)> <!ELEMENT string (#PCDATA)> <!ELEMENT word (attribute+)> <!ELEMENT attribute (#PCDATA)> <!ELEMENT tree (node+,edge*)> <!ELEMENT node (word|label)> <!ELEMENT label (#PCDATA)> <!ELEMENT edge EMPTY> <!ELEMENT semantics (entity*,relation*)> <!ELEMENT entity EMPTY> <!ELEMENT relation EMPTY> <!ATTLIST rule name CDATA #REQUIRED> <!ATTLIST t id CDATA #IMPLIED> <!ATTLIST h id CDATA #IMPLIED> <!ATTLIST word id CDATA #IMPLIED> <!ATTLIST attribute name CDATA #REQUIRED> <!ATTLIST tree id CDATA #IMPLIED> <!ATTLIST node id CDATA #IMPLIED> <!ATTLIST edge name CDATA #IMPLIED from CDATA #REQUIRED to CDATA #REQUIRED >

27

<!ATTLIST relation name CDATA #REQUIRED entailment CDATA #IMPLIED source CDATA #IMPLIED > <!ATTLIST entity name CDATA #REQUIRED start CDATA #IMPLIED end CDATA #IMPLIED source CDATA #IMPLIED > Figure 30: DTD of the rule le In the current release of EDITS only rules that contain just one element both in t and h (i.e. lexical rules) are allowed. 6.1.1 Entailment Rules Entailment rules preserve, with some degree of condence, the entailment relation between T and H. The following are examples of entailment rules at the dierent levels allowed by ETAF. <rule name="1" entailment="Entailment"> <t><string>invented</string></t> <h><string>pioneered</string></h> <probability>1.0</probability> </rule> Figure 31: String Entailment Rule The string entailment rule in Figure 31 states that the word "invented" entails the word "pioneered" with a probability equal to 1.0. <rule name="2" entailment="Entailment"> <t> <word> <attribute name="lemma">invent<attribute> <attribute name="pos">v</attribute> </word> </t> <h> <word> <attribute name="lemma">pioneer<attribute> <attribute name="pos">v</attribute> </word> </h> <probability>1.0</probability> </rule> Figure 32: Morpho-Syntactic Entailment Rule The entailment rule in Figure states that the lemma invent entails the lemma pioneer with a probability equal to 1.0. <rule name="3" entailment="Entailment"> <t>

28

<node> <attribute name="edge-to-parent">dobj</attribute> <word> <attribute name="token">home<attribute> </word> </node> </t> <h> <node> <attribute name="edge-to-parent">dobj</attribute> <word> <attribute name="token">habitation<attribute> </word> </node> </h> <probability>1.0</probability> </rule> Figure 33: Syntactic Entailment Rule The entailment rule in Figure 33 states that the node home in the dependency relation of direct object with its syntactic head (e.g. John bought a house, where "house" is the direct object of the verb "buy" ) entails the node "habitation" in the dependency relation of direct object with its syntactic head (e.g. John bought an habitation, where "habitation" is the direct object of the verb to "buy") with a probability equal to 1.0. 6.1.2 Contradiction rules Contradiction rules represent, with some degree of condence, the semantic incompatibility between T and H. The following are examples of contradiction rules at the different levels allowed by ETAF. <rule name="1" entailment="Contradiction"> <t><string>beautiful</string></t> <h><string>ugly</string></h> <probability>1.0</probability> </rule> Figure 34: String Contradiction Rule The contradiction rule in Figure 34 states that the string "beautiful" contradicts the string "ugly" with a probability equal to 1.0. <rule name="2" entailment="Contradiction"> <t> <word> <attribute name="lemma">extend<attribute> <attribute name="pos">v</attribute> </word> </t> <h> <word> <attribute name="lemma">shorten<attribute> <attribute name="pos">v</attribute> </word> </h>

29

<probability>1.0</probability> </rule> Figure 35: Morpho-Syntactic Contradiction Rule The contradiction rule in Figure 35 states that the lemma "extend" contradicts the lemma "shorten" with a probability equal to 1.0. <rule name="3" entailment="Contradiction"> <t> <node> <attribute name="edge-to-parent">amod</attribute> <word> <attribute name="token">white<attribute> </word> </node> </t> <h> <node> <attribute name="edge-to-parent">amod</attribute> <word> <attribute name="token">black<attribute> </word> </node> </h> <probability>1.0</probability> </rule> Figure 36: Syntactic Contradiction Rule The contradiction rule Figure 36 states that the node white in the dependency relation of adjectival modier with its syntactic head (e.g. Mary wears a white T-shirt, where white is the adjective modifying the noun "T-shirt" ) contradicts the node "black" in the dependency relation of adjectival modier with its syntactic head (e.g. Mary wears a black T-shirt, where "black" is the adjective modifying the noun "T-shirt") with a probability equal to 1.0. 6.2 Rules repository In order to be used by EDITS, both entailment and contradiction rules have to be stored in a rule repository. EDITS allows to declare and use multiple XML rule les as sets of entailment or contradiction rules that can be referred to using user-dened identiers. As an example, the declaration below denes two rule repositories, which contain a rule le each, called, respectively, entailment-rules-wordnet and contradiction-rules-wordnet. An example of entailment rules repository configuration can be found in Figure 37. <module type="entailment-engine"> <module type="rule-repositoty" alias="memory" id="wordnet-entailment"> <option name="rules-file" value="${EDITS}/share/rules/entailment-ruleswordnet.xml"/> </module> <module type="rule-repositoty" alias="memory" id="wordnet-contradiction"> <option name="rules-file" value="${EDITS}/share/rules/contradicion-ruleswordnet.xml"/> </module> </module>

30

Figure 37: Entailment Rules Repository Configuration 6.3 Rule activation The basic way for activating a rule is through one of the two functions, entail and contradict, which can be used within a cost scheme (see Section 5). The two functions check for the existence of an entailment or a contradiction rule between the values assumed by A and B in a cost schema. If a rule exists in the specied repository which matches with both A and B, then the probability associated to the rule is returned, otherwise null. The two functions accept four parameters: 1. The rst two parameters X and Y are portions of T and H managed by the distance algorithm and the cost scheme. 2. The name of a set of rules in the rules repository where the search has to be carried. This parameter is optional. 3. The search modality. Two search modalities are allowed: First, which selects the rst rule that matches the X and Y parameters; Max, which selects the rule that matches X and Y and with thehighest probability. As an example, the following call at the entail function: (entail X Y "wordnet-entailment" :max) searches for the rule with the highest probability among those that are activated by the A and B parameters and that are contained in a rules repository called entailmentrules-wordnet. A rule is activated when the X parameter of the entail/contradiction function matches with the T part of the rule, i.e. the left hand side, and the Y parameter of the function matches with the H part of the rule, i.e. the right hand side. All the elements of the X /Y argument have to match against all elements of the rule. In case the rule contains variables, their assignments to corresponding elements of the X /Y argument need to be satised. The entail and contradict functions are called in cost schemes, typically the cost scheme dened for the substitution edit operation. Figure 38 shows a substitution that calculates the cost of substituting A with B based on the inverse of the probability of the entailment rule between A and B in the repository called entailment-rules-wordnet. <substitution name="entail"> <cost>(- 1 (entail A B "wordnet-entailment" :max))</cost> </substitution> Figure 38: Substitution using Entailment Function

31

7 EDITS Conguration File


The purpose of a conguration le is to dene the three basic modules (i.e. distance algorithm, cost scheme and rule repositories), and their corresponding parameters, that will be actually used while running EDITS on a certain dataset. Only modules dened in a EDITS Conguration File (ECF) can be used for training and testing with the command bin/edits (see Sections 4.2 and 4.3). A module may require that another module is dened in order to work. Such dependencies are expressed in a conguration le through nested modules. The whole EDITS conguration is considered itself a module, the most global one, called the entailment engine, which requires three nested modules, respectively for the distancealgorithm, the cost-scheme and the rules repository. The XML Document Type Denition (DTD) of the conguration le is reported in Figure 39. <!ELEMENT conf (constant* module*)> <!ELEMENT module (module*,option*, mlink*)> <!ELEMENT option EMPTY> <!ELEMENT mlink EMPTY> <!ATTLIST module type CDATA #IMPLIED id CDATA #IMPLIED alias CDATA #IMPLIED className CDATA #IMPLIED > <!ATTLIST option name CDATA #REQUIRED id CDATA #IMPLIED value CDATA #IMPLIED > <!ATTLIST mlink idref CDATA #REQUIRED > Figure 39: DTD of the conguration le 7.1 Module Conguration Modules are dened by the following pieces of information: alias - an internal identifier known to the system. All module with their internal identifiers are listed in the HTML reference; className - a path to the Java class of the module, referring to the code that will be executed when the module is activated; id: a unique identier for the module, assigned by the user; type: indicates the category of the module being dened. Accepted values for the type attribute are entailment-engine, distance-algorithm, cost-scheme, rules-repository etc (See the HTML reference). option - set the options of the module; Examples of configuration file can be found in the share/configuration folder. 7.2 Usage of Constants

32

EDITS allows that the values of the options (which are frequently used in the conguration le) can be referred through the use of constants, declared at the beginning of the conguration le. For example, the configuration in Figure 40 is using the ${DATA_PATH} variable to indicate the access path to cost scheme resources. <conf> <constant name="DATA_PATH">/home/epack/edits</constant>&nbsp; <module alias="distance"> &nbsp; <module alias="tree"/> <module alias="xml"> <option name="scheme-file" value="${DATA_PATH}/share/cost-schemes/idfscheme.xml"/> &lt;option name="hash-file" id="idf" value="${DATA_PATH}/share/costschemes/idf.txt"/> &nbsp;&lt;option name="hash-file" id="stopwords" value="${DATA_PATH}/ share/cost-schemes/stopwords.txt"/> &nbsp; </module> <module alias="pso"/> &nbsp; </module> </conf> Figure 40: Example of variable in the configuration file. All the configuration can access the path of the edits installation using the constant "EDITS_PATH".

33

8 EDITS Graphical Interface


This section contains several snapshot of the EDITS graphical interface. The graphical interface is started with the command: edits -g The graphical interface represents a simple editor for confgurations, cost-scemes and experiments. It can also view entailment corpora, EDITS output and EDITS models. The graphical interface represents a desktop in which different views are open as windows. The user can copy and cut objects from one window and paste them in another one of the same type.

Figure 41: New Configuration Simple Engine The interface in Figure 41 represents a dialog for creating a new entailment engine. The user must select one algorithm to create a configuration for a simple entailment engine or more than one for combination entailment engine. If he/she choses to do the latter a combination strategy must also be selected as demonstrated in Figure 42.

Figure 42: New Configuration Combined Engine

34

Figure 43: Configuration Editor - Conf1.xml EDITS provides an interface for editing a configuration file as presented in Figure 43. The configuration file is represented as a tree. The user can interact with it using the context menu available while right-clicking a node of the three.

35

Figure 44: Cost Scheme Editor - Simple Cost Scheme EDITS provides an interface for creating and editing a cost scheme file as presented in Figure 44. The configuration file is represented as a tree. The user can interact with it using the context menu available while right-clicking a node of the three.

36

Figure 45: ETAF Corpus Representation EDITS provides an interface for browsing entailment corpora as presented in Figure 45 . The files are represented as a table containing the entailment pairs in the rows and their contents in the columns. The user can view that annotation of each pair by clicking the view button. The morpho-syntactic annotation of a pair is presented as a table in Figure 46. The syntactic annotation of a pair is presented as a tree in Figure 47.

37

Figure 46: ETAF Morpho-Syntactic Annotation of a Entailment Pair

38

Figure 47: Etaf Syntactic Annotation of a Pair EDITS provides an interface for browsing an EDITS output as presented in Figure 48. The table is similar to the interface for browsing entailment corpus. New columns are added to represent the additional information (score confidence etc.) that comes in the EDITS output. The view button provides a simple viewer, represented in figure 49, for the edit operations log attached to each pair.

39

Figure 48: EDITS output

40

Figure 49: EDIT Operations EDITS provides an interface for creating and editing an experiment file as presented in Figure 50. The interface allows the user to change the basic elements of an experiment and execute the experiment to obtain a result as represented in Figure 51.

41

Figure 50: Experiment interface

42

Figure 51: Experiment Result


EDITS provides an interface for viewing the contents of an EDITS model as presented in Figure 52.

43

Figure 52: EDITS Model View

References
Ido Dagan, Oren Glickman (2004), Probabilistic Textual Entailment: Generic Applied Modeling of Language Variability, in Proceedings of the PASCAL Workshop of Learning Methods for Text Understanding and Mining, Grenoble, France. Ido Dagan, Oren Glickman, Bernardo Magnini (2005), The PASCAL Recognizing Textual Entailment Challenge, in Proceedings of the First PASCAL Challenges Workshop on Recognising Textual Entailment, Southampton, U.K., 11-13 April. Dan Klein and Christopher D. Manning (2003), Fast Exact Inference with a Factored Model for Natural Language Parsing, in Advances in Neural Information Processing Systems 15 (NIPS 2002), Cambridge, MA:MIT Press, pp. 3-10. Milen Kouylekov, Bernardo Magnini (2005), Tree Edit Distance for Recognizing Textual Entailment, in Proceedings of the International Conference Recent Advances in Natural Language Processing (RANLP), Borovets, Bulgaria, 21-23 September. Vladimir I. Levenshtein (1965), Binary codes capable of correcting deletions, insertions, and reversals, in Doklady Akademii Nauk SSSR, 163(4), pages 845848.

44

Emanuele Pianta, Christian Girardi, Roberto Zanoli (2008), The TextPro tool suite, in Proceedings of LREC, 6th edition of the Language Resources and Evaluation Conference, Marrakech, Morocco, 28-30 May. Kaizhong Zhang, and Dennis Shasha (1990). Fast Algorithm for the Unit Cost Editing Distance Between Trees. In Journal of Algorithms. vol.11, December 1990. Yashar Mehdad 2009. Automatic Cost Estimation for Tree Edit Distance Using Particle Swarm Optimization. Proc. of ACL-IJCNLP 2009. Matteo Negri and Milen Kouylekov 2009. Question Answering over Structured Data: an Entailment-Based Approach to Question Analysis. Proc. of RANLP-2009. Yashar Mehdad, Matteo Negri, Elena Cabrio, Milen Kouylekov, and Bernardo Magnini 2009. Recognizing Textual Entailment for English EDITS @ TAC 2009 To appear in Proceedings of TAC 2009. Chin-Yew Lin. ROUGE: A Package For Automatic Evaluation Of Summaries Workshop On Text Summarization Branches Out , ACL 2004

45

Você também pode gostar