Escolar Documentos
Profissional Documentos
Cultura Documentos
By
Master in Systems Engineering and Computer Science Systems and Industrial Engineering Department Engineering School of the National University of Colombia (Bogota D.C. - Colombia) 2008
of a thesis submitted by
This thesis has been read by each member of the following graduate committee and has been found to be satisfactory. ___________________ Date ________________________ Fabio A. Gonzalez O. (Chair)
___________________ Date
___________________ Date
___________________ Date
___________________ Date
A Knowledge-Based information Extraction Prototype for Data-Rich Documents in the Information Technology Domain
Sergio Gonzalo Jimenez Vargas
Abstract
The Internet is a great source of information. Semi-structured text documents represent great part of that information; commercial data-sheets of the Information Technology domain are among them (e.g. laptop computer data-
sheets). However, the capability to automatically gather and manipulate such information is still limited because those documents are designed to be read by people. Documents in domains such as that of Information Technology describe commercial products in data sheets with technical specications. People use those data sheets mainly to make buying decisions. Commercial data sheets are a kind of
data-rich
terized by heterogeneous format, specialized terminology, names, abbreviations, acronyms, quantities, magnitudes, units and limited utilization of complex natural language structures. This thesis presents an information extractor for data-rich documents based on a lexicalized domain ontology built mainly with meronymy and attribute relationships. Ontology concepts were manually lexicalized with words and
n-
string matcher module and a term disambiguator based on ideas from the word sense disambiguation problem in the natural language processing (NLP) eld. The former is used to nd references to ontology concepts allowing an error margin and the later is used to choose the better concept and use (or sense) associated with each referenced concept in a data-rich document according to their respective context. A domain ontology, a lexicon and a labeled corpus were manually constructed for the laptop computers domain using data sheets downloaded from the web sites of three of the most popular computer makers. The ontology had more
than 350 concepts and the lexicon had more than 380 entries with at least 1500 dierent related terms. The validation document corpus was ve selected data-sheet documents with more than 5000 tokens in total of which 2300 were extraction targets. The ontology and lexicon were built based on a set 30 laptop data sheet. A subset among them was manually annotated with unambiguous semantic labels in order to be used for validation. The approximate text string-matching problem using static measures was reviewed. Static string measures are those that compare two strings in an algorithmic way using only the information contained in both strings. Additionally, a comparative study of dierent string matching techniques at character and
at token level was conducted using well known data sets in literature. Particularly, a new general method for combining any string measure at character level with resemblance coecients (e.g. Jaccard, Dice, and cosine) was proposed.
Encouraging results were obtained in the proposed experiments. On the other hand, to develop the term disambiguator, the concept
semantic
path
was proposed.
ontology root concept with in a terminal concept (or sink node) in an ontology directed acyclic graph. Semantic paths have three uses: (i) to dene a label type for unambiguous semantic document annotation, (ii) to determine the use/sense inventory for each ontology concept and (iii) to serve as comparison unit for semantic relatedness measures. The proposed disambiguation technique was inspired by WSD methods based on general lexicalized ontologies such as WordNet. Additionally, a new global
context-graph optimization criterion for disambiguation was proposed (i.e. shortest path). That criterion seemed to be suitable for the specic task reaching Finally, experiments showed that the information
Acknowledgments
I am indebted to many people for their assistance and support to my thesis. This work would not have been possible without the support of many people in and outside of the National University of Colombia. I would like to express particular appreciation to the following people: To my wife, colleague and partner, Claudia Becerra, who has provided the main inspiration, encouragement and everyday moral support in completing this thesis. I also appreciate her love, help and tolerance of all the late nights and weekends, which we have put in studying, discussing and working. To my advisor, Dr. Fabio Gonzalez, whose enthusiasm, intelligence and I would like to thank him for I
generously nding time in his very busy schedule to discuss ideas with me. also would like to thank Dr.
comments that reoriented our approach to the right way in the right time. To Dr. Jonatan Gmez, who advised me during my eorts to nd bounds for this work and provided invaluable suggestions. To Dr. Yoan Pinzon, for his time and support for polishing my technical writing. To Dr. Luis F. Nio and Dr.
Elizabeth Len, their skillful teaching and patient tutoring were helpful to me at the early stages of this pro ject. To Zaza Aziz, for patiently help me to improve my English skills. Special thanks to my colleagues at the Laboratory of Intelligent Systems (LISI) and to the fellow students of the three research seminars who have provided their valuable comments and feedback during innumerable presentations and reviews. To my family and to my familiy in law. I would also like to thank Emilse Villamil, who provided invaluable assistance taking kind care of my daughters. Thanks for her help and patience with my
irregular schedule and undoubtedly for her healthy and delicious food. Finally, I would like to thank my daughters Sara and Luisa for their bright smiles, company, happiness and unconditional love that gave me the necessary motivation and support all the time in my life.
Contents
2 4 5 13
17 18 19
. . . . . . . . . . . . . . . . . . . . . . . . . . .
21
21 22 23 23 28 30 32 33 33
Laptop Ontology
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
35
36 37 37 37 39 39 40 40 41 41 42 43
q -gram
Distances
. . . . . . . . . . . . . . . . .
3.2.2 3.3
Combining Character-Level Measures at Token Level . . . . . . . 3.3.1 3.3.2 Monge-Elkan Method Token Order Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CONTENTS
3.3.3 3.3.4
Token Edit-Distance
. . . . . . . . . . . . . . . . . . . . .
43
3.3.5 3.3.6
Proposed Extension to Monge-Elkan Combination Method Proposed Cardinality-Based Token-/Character-level Combination Method . . . . . . . . . . . . . . . . . . . . . . .
46 49 49 50 52 54 57 58 58 60 63 65 66 66
3.4
Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Experimental Setup 3.4.1.1 3.4.1.2 3.4.1.3 3.4.1.4 3.4.2 Results 3.4.2.1 3.4.2.2 3.4.2.3 3.4.2.4 3.4.2.5 . . . . . . . . . . . . . . . . . . . . .
Experiments Performance Measure . . . . . . . . Statistical Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . General Measures Ranking Results versus Baselines . . . . . . . . . . . .
. . . . . . . . . . . . .
Results for Extended Monge-Elkan Method . . . Results for the Resemblance Coecients Results by Data Set . . . .
. . . . . . . . . . . . . . . .
3.5
Summary
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
69 70 70 73 73 74 74 75
Knowledge-Based Word Sense Disambiguation 4.1.2.1 4.1.2.2 4.1.2.3 4.1.2.4 Gloss Overlap WSD Algorithms
. . . . . .
. . . . . . . . . .
. . . . . . . . . .
4.2
Proposed Information Extraction System for Data-Rich Documents 79 4.2.1 4.2.2 4.2.3 4.2.4 Fuzzy String Matcher . . . . . . . . . . . . . . . . . . . . 80 81 82 83 84 85 88 93 95 96
4.3
Results for the Fuzzy String Matcher . . . . . . . . . . . . Results for Semantic Relatedness Measures Results for Disambiguation Strategies . . . . . . . .
. . . . . . . . . . .
4.4
Summary
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CONTENTS
98
99 100 100 103 104
Proposed Fuzzy String Matcher Sensitive to Acronyms and Abbreviations 5.2.1 5.2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 105 105 106
5.3
. . . . . . . . . . . . . . . .
Conclusions
6.1 6.2 6.3 6.4 Summary of Contributions Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
109
109 110 110 111
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CommercialOer . . . . . . . . . . . . . . . A.2 ProductID . . . . . . . . . . . . . . . . . . . A.3 Processor . . . . . . . . . . . . . . . . . . . A.4 Chipset & Memory . . . . . . . . . . . . . . A.5 HardDisk . . . . . . . . . . . . . . . . . . . A.6 Display . . . . . . . . . . . . . . . . . . . . A.7 VideoAdapter . . . . . . . . . . . . . . . . . A.8 OpticalDrive . . . . . . . . . . . . . . . . . A.9 NetworkAdapter (s) & Modem . . . . . . . . A.10 InputDevice (s) . . . . . . . . . . . . . . . . A.11 Battery & ACAdapter . . . . . . . . . . . . . A.12 PhysicalDescription . . . . . . . . . . . . . . A.13 AudioAdapter & Speaker . . . . . . . . . . . A.14 MediaAdapter & Interfaces & ExpansionSlots A.15 WarrantyServices . . . . . . . . . . . . . . . A.16 Software . . . . . . . . . . . . . . . . . . . . A.17 Security . . . . . . . . . . . . . . . . . . . . A.18 Measurements . . . . . . . . . . . . . . . . .
112
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 112 113 113 114 114 114 115 115 116 116 117 118 118 119 120 120 121
123 141
List of Figures
1.1 1.2
. . . .
14
2.1 2.2
Laptop domain ontology DAG sample. . . . . . . . . . . . . . . . Node-frequency chart for each denition (i.e.
25
out degree )
level in 26
in degree
Semantic-path-frequency chart for dierent semantic path lengths in the ontology graph. . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5
Terminal nodes counting chart for dierent number of their semantic paths. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 28
2.6 2.7
Term-frequency chart grouped by the number of times that terms are referred in the laptop lexicon. . . . . . . . . . . . . . . . . . . 30
2.8
Concept-frequency chart grouped by the number of terms in each concept in the laptop lexicon. . . . . . . . . . . . . . . . . . . . . 31
2.9
3.1 3.2
Edit distance
38
Business }.
55
3.3
Busi56
ness }.
3.4
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5
m m
in extended Monge-Elkan 64
method for each internal character-level string similarity measure. 3.6 behavior of the exponent in extended Monge-Elkan
LIST OF FIGURES
10
4.1
4.2 4.3
Sample graph built over the set of possible labels for a sequence of four words by Sinha and Mihalcea. . . . . . . . . . . . . . . . . 76 77 78
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
An example of the graph aligned to the token sequence for term disambiguation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 80 81 83 84 87
. . . .
An example of the matching schema. . . . . . . . . . . . . . . . . Token sequence labeled with possible senses (i.e. semantic paths)
4.10 Graph of semantic paths aligned to the document token sequence 4.11 Characterization of the labeling problem as a two class problem. 4.12 Graph with the frequency of true positives, false positives and false negatives ranging the string matching threshold in (0,1]. 4.13 Precision/recall/F-measure vs. periment of Figure 4.12 . .
88
. . . . . . . . . . . . . . . . . . . . . . . . . . .
4.15 Number of valid matches for dierent string comparison thresholds using EditDistance-MongeElkan as matching method. . . . . 4.16 F1 score for dierent matching methods using ve levels of characterediting noise in the lexicon. . . . . . . . . . . . . . . . . . . . . . 92 90
4.17 F1 score for dierent matching methods using ve levels of characterediting noise combined with token shue noise in the lexicon. . . 4.18 Precision/recall/F-measure vs. threshold plots . . . . . . . . . . 92 94
4.19 F-measure using shortest-path and other dierent semantic relatedness measures. . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.20 Number of nodes in the disambiguation graph for dierent string comparison thresholds. . . . . . . . . . . . . . . . . . . . . . . . . 96
4.21 Extraction performance (F-measure) comparison between the Adapted Lesk with window size of 3 vs. shortest path and baselines. . . . 97
5.1 5.2
Pattern/document example. . . . . . . . . . . . . . . . . . . . . . Example of the token conguration combinations enumerated by the acronym-sensitive token matcher. . . . . . . . . . . . . . . . .
106
107
5.3
Performance comparison of the information-extraction system with and without the acronym-sensitive matcher. . . . . . . . . . . . . 108
List of Tables
1.1
Specic objectives
. . . . . . . . . . . . . . . . . . . . . . . . . .
20
Concepts lexicalized with regular expressions in the laptop lexicon. 29 Laptop data sheets in the evaluation corpus. . . . . . . . . . . . . Characterization of the documents in the evaluation document corpus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 30
42
Meaning of the Minkowski distance and generalized mean exponent. 44 Pair wise token similarities with edit-distance-based similarity measure for the string Jimenez Gimenez Jimenez . . . . . . . . 47
3.4
Pair wise token similarities with edit-distance-based similarity measure for the string giveaway girlfriend gingerbread . . . . . 48
3.5
Pair wise token similarities with edit-distance-based similarity measure for the string Bush Clinton Kerry . . . . . . . . . . . . 48 48 50 51
cardA (.).
Data sets used for experiments. . . . . . . . . . . . . . . . . . . . Some examples of valid matches in the 12 data sets. . . . . . . .
3.10 Critical values for the Wilcoxon test statistic for one-tailed test. . 3.11 IAP Wilcoxon sign test for experiments (method1) and
2gram-MongeElkan2
. . . . . . . . . 59
ed.Dist-MongeElkan0 2gram-compress(A)
(method2).
2gram-MongeElkan2
. . . . . . . . . . 59 61 62
(method2)
3.13 General ranking for combined string matching methods. Part 1/2 3.14 General ranking for combined string matching methods. Part 2/2 3.15 Average IAP for baseline measures improved with MongeElkan method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
3.16 Average IAP for the Monge-Elkan combination method compared with the best combination method. . . . . . . . . . . . . . . . . . 3.17 Values of the Wilcoxon's test statistic 63
comparing extended 65
11
LIST OF TABLES
12
cardA ().
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65 66
4.1 4.2
. . . . . . . . . . . . . . . . . . . . .
82
Description of the noisy generated lexicons with character edition operations for dierent
noise levels.
. . . . . . . . . . . . . . .
85
4.3
Description of the noisy generated lexicons with term shuing at token level. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.4
F1 score for dierent string matching techniques using the noisefree laptop domain lexicon. . . . . . . . . . . . . . . . . . . . . . 91
4.5
Threshold where the F1 score was reached for dierent string matching techniques using the noise-free-laptop-domain lexicon. . 91
4.6
F1 score results for dierent string matching techniques at dierent noise levels in the lexicon. . . . . . . . . . . . . . . . . . . . . 93
4.7
String matching thresholds for the F1 score results reported in Table 4.6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.1
103
Chapter 1
Introduction
The amount of documents available electronically has increased dramatically with the vertiginous growth of the Web. However, the capacity to automati-
cally understand those documents is still considered an open problem [34]. On the other hand,
information extraction
structured information from semi-structured or free text. Although, IE is generally considered as a shallow reading process, it is a step towards solving the challenge of machine reading. For instance, extracting a list of authors, book
titles and publishers from a set of texts with literary critics is an IE task. Particularly, IE aims to nd some selected entities and relationships in texts. Returning to proposed book publishing example, three target elds can be extracted (i.e. tionships: author, book title and publisher) as well as two possible relaUsually, IE only focuses on
extracting a small portion of the full amount of the information present in a document. Many other entities and complex relations expressed in the docFor instance, criticisms, opinions, feelings, sarcastic
remarks, recommendations are out of the reach of the IE processes. However, other document types can be considered in which the amount of information that can be extracted is comparable to the total document information. For instance, a laptop data sheet describes the product in detail, including features, composing parts and performance capabilities. Some examples of datasheets are shown in Figure 1.1. Data-sheets are plenty of entities to be extracted and most of the relationship types are clearly dened. The most common re-
lationships between entities in data-sheets are is-a, is-part-of , is-attributeof , is-capability-of or their inverses such as is-a-generalization, has-part, has-attribute and so on. Documents with those characteristics are known as
data-rich
documents.
In a data-rich document, many of the entities and relationships may be extracted. Due to this fact, IE processes in data-rich documents might have Consequently, it is possible
to think that IE from data-rich documents is not a machine-reading process as shallow as IE from natural language documents.
13
CHAPTER 1.
INTRODUCTION
14
a)
b)
CHAPTER 1.
INTRODUCTION
15
Information Technology
quently used to describe products for technical and commercial purposes. Commercial data-sheets are relevant in e-commerce environments because they are and important support source for buying decision-making. For instance, a buyer looking for a laptop in Internet, using a price engine such as Pricegrabber
1,
may nd hundreds of options with tens of features for each product data sheet. To read data sheets in an automatic way (i.e. is the rst step to assist decision-making processes. extract their information) However, aided decision-
making is not the only use scenario. Additional appplications are: technological monitoring, information retrieval renement, automatic document translation, question answering, information integration, etc. The problem that is addressed in this thesis is to extract all the product characteristics mentioned in a set of commercial data-sheets circumscribed to a sub-domain of the IT domain. The selected sub-domain was the laptop com-
Laptops are among the most popular products in the IT domain. Therefore, a wide range of products is available and the extracted information has special interest. Data sheets that describe laptops have many composing parts, physical characteristics and technical features. Thus, the methods, prototypes and conclusions developed in this thesis could be applied to products of similar or smaller complexity.
The data-sheets documents that are being considered in this thesis have a particular set of characteristics and challenges:
The formatting homogeneity is not guaranteed. Even the documents generated by the same computers brand for similar products have several dierences in structure, style and appearance. The number of extraction targets is high. A typical laptop data sheet can have more than a hundred extraction targets. The density of the extraction targets is high. The documents are plenty Many of
the IE approaches build separate extraction models for each individual target [37, 36]. This approach does not seem to be appropriate for datasheets because the target density makes some targets become context of
the others and it assumes independence between targets. The target ambiguity is high. The ambiguity in targets happens when Consider the examfrom
targets of the same type have dierent meanings. ple of extracting the targets a news corpus. ambiguated.
date-of-the-news
and
date-of-the-event
1 http://computers.pricegrabber.com
CHAPTER 1.
INTRODUCTION
16
ambiguity.
installed memory, cache memory, size of the installed memory modules, memory requirements of a particular included software, video memory, etc. The same type of ambiguity happens in other measures such as distances,
weights, voltages and more. Many targets, plus high density, plus ambiguity means high labeling cost. To provide a labeled corpus for model construction or evaluation is a particularly expensive task. The use of acronyms, abbreviations and morphological variations are, in general, extensive in data-sheets. For instance, MS Win XP Pro Ed.
stands for Microsoft Windows XP Professional Edition. The availability of large quantities of data-sheets for dierent sub-domains is not guaranteed. Whereas, it is possible to gather some hundreds of
laptop
for
data sheets in English, only a few tens can be found for instance
blade servers.
In order to face those challenges, a new approach that poses the IE task as a
WSD
[61] is a natural language processing (NLP) task that aims to select the correct sense combination for all polysemous words in a text. We noticed the following similarities between WSD and our particular IE problem:
Most of the verbs, nouns and adjectives in a text are polysemous. polysemous word is a disambiguation target that has to be labeled. The density of polysemous words in a text is high.
Each
The number of senses for each polysemous word can be very high if the word senses are dened in detail. The selected sense combination reects the semantic coherence of the text sequence. Similarly, the product features in a data sheet have a coherence sequence clearly related with their meaning. Verbs, names and adjectives are subject to morphological variations such as plurals, conjugations and gender.
A particular set of approaches that address WSD use as disambiguation resource a general lexicalized ontology called WordNet [63]. WordNet is a semantic
network whose nodes are concepts and edges are semantic relationships such as hypernymy (i.e. is-a-generalization-of ), hyponymy (i.e. is-a), meronymy (i.e. is-part-of ), holonymy (i.e. the opposite of meronymy), synonymy and
antonymy. Besides, each concept has a set of lexicalizations called particular language such as English.
synset s
in a
As part of the proposed solution, a laptop domain ontology was built from scratch by the authors, thoroughly describing all the attributes and composing
CHAPTER 1.
INTRODUCTION
17
Is-part-of Memory
Processor
Is-attrib-of Is-attrib-of Memory Size Is-part-of Is-part-of Memory Magnitude Is-a Memor y Units Is-a Memory Technology Is-a Hard Disk Size Is-part-of has Hard Disk Hard Disk Size Size Units Magnitude Is-a Is-a
Is-attrib-of Hard Disk Speed Is-part-of Is-part-of Hard Disk Hard Disk Speed Speed Units Magnitude Is-a Is-a Is-attrib-of
Is-attrib-of
Is-attrib-of
Is-a
Intel
Core Duo
Processor
T2400
1024 MB
DDR2
120
GB
5400 rpm
(S-ATA)
parts of a laptop computer. Next, it was lexicalized with terms in English according to a set of data-sheets gathered from Internet. This lexicalized ontology was used as input to an IE system that labeled the extraction targets using a WSD-inspired algorithm. On the other hand, the problem of the morphological variations associated with terms, acronyms and abbreviations was addressed using a strategy based in approximate string matching. The nal output of the proposed system was an unambiguous labeling over the text of the data-sheets. A sample of an unambiguous labeling example is shown in Figure 1.2.
1.1
The IE from text documents problem has been addressed in the past using different techniques. Some approaches use machine-learning (ML) models trained with a labeled corpus (see [17] for a survey). Those ML models can be applied in several domains and languages only with small changes. However, the limited availability of labeled corpus is generally a drawback and the process of building it for specic applications is slow and expensive. Another approach is to
perform IE based on a set of xed extraction rules combined with dictionaries or gazetteer lists [27]. Rule-based approaches are relatively easy to setup but
experts with linguistics and specic domain skills are required for writing the rules, which is in fact, an expensive resource. Bootsrapping approaches also
have been used seeding a small-labeled corpus [24] or set of rules to a system that expand them based on a large corpus or the web [35, 23, 13]. This is an effective way to address the drawbacks of previous approaches. Nevertheless, the size of the corpus has to be considerable in order to achieve enough statistical evidence to infer generalizations. Nevertheless, bootstrapping approaches based on the web require a huge quantity of queries to search engines that commonly restrict the number of queries.
CHAPTER 1.
INTRODUCTION
18
In general, the previously mentioned IE approaches exploit context, order in text sequences and dictionaries to extract the target information in poorly structured text or free text. There is another IE approach based on the structural homogeneity of the documents. known as The IE systems based in that idea are
wrappers
of the web pages in order to identify extraction targets. Ontologies can also be used to model concepts in specic domains and their uses have become popular with the arrival of the
Semantic Web
[11]. Ontologies
as repository of the
extraction model [33], as semantic dictionary of instances [75, 31, 58], as schema to be populated with instances [42, 89], as source of inferences in extracted information [16]. Only a few approaches have used ontologies for guiding the
1.2
There are two main issues related to IE that have been addressed in this thesis: term identication and term disambiguation. The main research questions studied in this thesis related to those issues are:
Which of the static approximate string matching techniques suitable for the term-identication task in the IE context? Are the methods for word sense disambiguation suitable for the termdisambiguation task in the IE context? It is possible to construct an information extraction system based in term identication and disambiguation in conjuntion with a lexicalized domain ontology?
Related to the term identication problem, this pro ject focuses in the following question: Is it more convenient to treat the text strings to be compared
as sequences of characters or as sequences of tokens (i.e. words) for the namematching task? In order to answer this question dierent string comparison Particularly, a new gen-
eral combination method for character and token level measures based in fuzzy cardinality was proposed in this thesis. Experiments carried out with 12 namematching data sets well studied in the related literature showed that the proposed method consistently outperformed other methods or reached similar results. The approximate string matching techniques bring us the ability to detect terms that have misspellings, OCR errors, typos and morphological variations. However, it is unavoidable that similar strings could have not semantic relation at all. Besides, given our specic information extraction task, approximate
string matching provides exibility for term detection, but it also provides noise. Some of the experiments carried out in this thesis were designed to research this trade-o.
CHAPTER 1.
INTRODUCTION
19
Secondly, motivated by the specic problem that is addressed in this thesis, we explored the applicability of the WSD techniques to the term disambiguation problem in the context of an IE task. The main question is: Is the proposed
specic information-extraction problem analogous to the NLP's Word Sense Disambiguation problem? Summarizing, the two main contributions of this thesis are:
A new method for text string matching that allows the combination of existing character-based measures with cardinality-based resemblance coecients. This contribution has been applied in our specic term matching problem and has promising applications in elds such as natural language processing, bioinformatics and signal processing. The contributions to the IE eld were: i) to explore the application of
the WSD method in IE and ii) a new graph-based term-disambiguation method. The results of this thesis open a new way to address the extraction problem when the number and type of targets is considerable high.
Finally, other new ideas were proposed in this thesis but they were not exhaustively compared with other similar approaches due the scope of this work.
1.3
semantic
paths.
independence of the design of the ontology or the document format. The use of or terms. The
shortest path
tegrate acronym-matching heuristics to any approximate string matching technique based on tokens.
Thesis Structure
The Chapter 2 presents the developed laptop ontology, lexicon and evaluation document corpus. Those resources are the input data for the proposed IE prototype. That chapter describes and quanties the ontology, lexicon and the labeled document included in appendices A, B, and C. The Chapter 3 studies the name-matching problem using static approximate string-matching methods. This chapter can be read as a separated paper with only a few references to the rest of the document. The experimental evaluation of the proposed methods was made with data sets used in others previous studies by researchers in the elds of record linkage and name matching.
CHAPTER 1.
INTRODUCTION
20
Table 1.1: Specic objectives Ob jective Develop an ontology Develop a lexicon Develop a string matching algorithm Develop a labeling algorithm Place in the document Section 2.2 (Appendix A) Section 2.3 (Appendix B) Chapter 3 Chapter 4 (IE as WSD)
The proposed IE system is presented in Chapter 4 the background and relevant previous works related to knowledge-based WSD are presented as well as a new graph-based disambiguation method. Additionally, this
chapter integrates the results from the Chapter 3 in an IE system. Finally, an experimental evaluation was made to assess the proposed methods for approximate string matching, semantic relatedness measurement and term disambiguation strategy. The results obtained using these methods were
compared against baselines. In Chapter 5 the methods for acronym matching were reviewed and some of the heuristics presented for those approaches were used to propose a method that integrates the acronym matching to the name matching task. The proposed method was integrated to the system presented in Chapter 3 and its performance where assessed comparing its results against the
best results obtained in that chapter. Chapter 5 presents the conclusions of this thesis and discusses briey the future work.
Finally, in Table 1.1 the main objectives of this work are mapped to particular places in this thesis document.
Chapter 2
Domain Ontology, Lexicon and Corpus
The construction and preparation of the input data for the proposed IE system in this thesis is presented in this chapter. Firstly, some basic concepts related to ontologies (in the sense of formalism for knowledge representation in computer science) are presented. In order to make understandable the next sections, some general issues about ontology constructors and formal denition were introduced in Section 2.1. The system presented in this thesis was built from scratch. Thus, ontology, lexicon and evaluation corpus were manually prepared by us. This process was briey described in this chapter. In Section 2.2 the
laptop ontology
is presented
as well as the used formalism to represent it. Specically, the laptop ontology describes in detail how a laptop computer is composed of, its attributes and those of its composing parts. In Section 2.3 the
laptop lexicon
is presented.
This lexicon is a dictionary with the representation in English terms of the concepts modeled in the laptop ontology. Finally, in Section 2.4 the data-sheet corpus is presented with the selected annotation schema.
2.1
mains.
Ontologies
In computer science, ontologies are graphs in which their nodes (i.e.
Ontologies are artifacts that are useful to describe real word entities and do-
vertices) are entities or concepts, and edges correspond to semantic relationships between them [41]. Ontology denition languages oer a wide set of constructors allowing complex ontology modeling. The most common relationship types used in ontologies are hypernymy (i.e. is-a) and meronymy (i.e. is-part-of or its inverse has-part). The former allows dening an entity class hierarchy and the latter describes the composing hierarchy for each concept. Other commonly used ontology constructors are cardinality restrictions on meronymy relations (e.g. a laptop has to have only one display panel but can have from zero to
21
CHAPTER 2.
22
denition languages include the use of logical axioms such as the universal and existential quantiers
i.e.
The knowledge included in domain ontologies describes objects and domains with concepts and relations. The names of those concepts and relations are only reference names assigned by the person who builds the ontology. For instance, the concept related to a hard disk can be named as HardDisk or concept_235. Clearly, good practices in ontology construction recommend that the names of the concepts being represented are related to their names in the real world. Although, ontologies model real-word entities, there is not compromise related to how those entities are referred in a specic language such as English.
Popular
Class Denition
Class Hierarchy
graph (DAG) ranging from general classes to specic ones. class hierarchies are built with is-a
Object Properties
It denes and names the relationships between objects. It denes the allowed classes in the left or right
Domain/Range Restrictions
Cardinality Restrictions
Universal Restrictions
) restricting a relation )
that restrict a re-
Existential Restrictions
lationship with a condition that has to be satised by at least by one of the instances of a class.
Sets Operations
1
e.g.
, .
http:/www.w3.org/2004/OWL/
CHAPTER 2.
23
O = {S, C, h, g},
S C C h g
is the root concept (i.e. Laptop). is a set of internal or high-level concepts (e.g. Memory , HardDisk ,
Display , MegaByte ). is a set of terminals or low-level concepts (e.g. MegaByte ) that are not
divided by hypernymy, holonymy or attribute relationships. is the hypernymy (i.e. the inverse of is-a) relationship
h : (Cp
Com-
of concepts (e.g.
The lexicalization for an ontology is the addition of a dictionary of terms mapping the concepts modeled in the ontology in a language such as English. extend the previous ontology denition We
to
OL = {S, C, h, g, L, f }
in order to
L f
g
f : (c1 , c2 , . . . , cn ) L
with
where
ci C
and
c1 , c2 , . . . , cn
n 1 h or
relations and
L is a set of terms. This relationship allows simple lexicalf {Hewlett Packard , HP } or more informative such as (Software h PhotoSoftware g Family )f { Photo
O S
C,
and
i.e.
2.2
Laptop Ontology
The laptop ontology is a ne-grained description of all the parts and attributes of a laptop computer. The model goes beyond the physical description including additional attributes such as prices, commercial issues, warranty terms, etc. The
CHAPTER 2.
24
aim of that ontology is to model all the concepts that are commonly included in the laptop data sheets used with commercial purposes. The rst step in the construction methodology of the ontology was to collect 30 laptop data-sheets during the month of May 2007 from the web sites of three popular laptop builders (i.e. Hewlett Packard , Toshiba
and Sony ).
Those
documents were used to build from scratch the ontology trying to identify all the ma jor concepts (e.g. processor, hard disk memory, etc.). Further, each major
concept was decomposed reviewing all their descriptions in the documents. The ontology construction was not a linear process, since it was changed several times based on feedback. Additionally, the concepts included in the ontology
were not limited to the concepts mentioned in the documents. For example, a complete system of units of memory, distance, time, weight was dened even though some of those units were not used in the documents. The built ontology can be represented as a directed acyclic graph. The
Figure 2.1 shows a small sample of that graph. The graph contains three types of nodes and two types of edges. The node types are: a root node associated to the concept Laptop, which has not incoming edges. The other two types are the internal nodes that have both incoming and outgoing edges, and terminal nodes that have only incoming edges. The edges in the ontology graph are mainly has-part relationships shown with gray arrows in Figure 2.1. Actually, the is-a relationships (represented
with black arrows) were not used to model a class hierarchy but to model some relationships close to the terminal nodes. The reason to use the is-a relationships in that way is the fact that the candidate is-a hierarchies were too narrow and did not oer additional information. On the other hand, the composition
hierarchy made with the has-part and has-attribute relationships was depth and carried important information. Additionally, a semantic network with only one type of relationships is more convenient for designing semantic relatedness metrics (see Section 4.1.1). The full laptop ontology is provided as reference in Appendix A. The most common format to represent ontologies in text formats (i.e. RDF, OWL) is using a set of ordered triples {
entity1 ,
relationship,
entity2 }.
However signicantly
entitya , relationship,
entitya
down approach presenting rst the constructions related to concepts thar are more general. Next, some data related with the laptop ontology graph are presented in order to gure out the general shape of the graph.
Number of nodes:
CHAPTER 2.
25
Laptop
Processor
Memory
HardDisk ...
ProductID
Brand Family
Speed
FSB
Model
... FrequencyMeasurement
MHz
MemoryUnits
Magnitude
KB
MB
GB ...
DecimalReal
Integer
Number edges:
612 1342,
i.e.
The Figures 2.2 and 2.3 show charts of the ogy graph nodes. The outdegree and
outdegree
in degree
how the counting of nodes in each out-degree level is directly related with the specicity of the node. For instance, the more general node Laptop has the terminal nodes) are sinks or
highest outdegree and more specic nodes (i.e. nodes without outgoing edges. The
semantic paths
concept Laptop with a terminal node. The chart in Figure 2.4 gives an idea of the depth of the graph. Another important issue related to the semantic paths is the number of semantic paths for each terminal node. This issue is important because of the more semantic paths have a terminal node the higher its use ambiguity. The chart with the counting of terminal nodes grouped by their
number of semantic paths is shown in Figure 2.5. In that chart, it is possible to see that only 49 of the 118 terminal nodes are unambiguous. Thus, it is possible to say that the laptop ontology has an ambiguity of 58.47% at terminal nodes level, or ambiguity in terminal use.
CHAPTER 2.
26
(terminal nodes)
120 118 104
counting of nodes
80
(DistanceUnits)
1
(HardDisk)
56 40 26 14 14 0 0 1 2 3 4 5 6 7 8 5 7 6
(TimeUnits)
1
(Battery)
1 9
10 11 12 23
node out-degree
Figure 2.2:
out degree )
(Laptop)
1
level in
the ontology graph. (e.g. there are 56 nodes with out degree measure of 2)
counting of nodes
(TimeMeasurement)
(Model, Digit)
(Technology)
(Magnitude)
15
10
6 4
4 5
2 6
1 7
3 8
1 9
12 13 15 17 18 45
node in-degree
in degree
(e.g. there are 15 nodes in the ontology graph with in-degree measure of 2)
(Boolean)
1
(Version)
(Family)
(Integer)
(Brand)
CHAPTER 2.
27
118 86 27
18 4 5 6 7 8 9 10
Figure 2.4:
in the ontology graph. (e.g. there are 86 semantic paths with length of 4 nodes)
49
(SubVersion) (Technology)
(TwoPower)
(Day, Hour)
10 12 4 5 2 6 5
10
10
1 7 8
9 10 11 13 16 18 21 22 23 24 45 66 73 86 227 99
Figure 2.5: Terminal nodes counting chart for dierent number of their semantic paths. (e.g. there are 10 terminal nodes in the ontology with three semantic paths)
(Decimal) (Integer)
1 1
(Version)
(Family)
(Inches) (Not)
(Brand)
(Slash) (Digit)
(Bit)
CHAPTER 2.
28
CONCEPT
REF_HardDisk
TERMS
TOKENS
Figure 2.6: Laptop domain lexicon entry example.
2.3
Laptop Lexicon
The laptop domain lexicon was built simultaneously with the same methodology used in the laptop ontology construction. Similarly, to the laptop ontology, the reach of the terms included in the lexicon goes beyond the selected data-sheets, particularly with the concepts related to measures and units. The aim of the lexicon is to provide terms in English linked to concepts in the ontology. For instance the ontology concept HardDisk is related (in English) with the terms Hard Disk, Hard Drive or with the acronyms HD, HDD. Clearly, the terms can be composed of one or more words. We used the term
token
for
representing words because in this thesis some special characters are considered words. For instance, the term Built-in can be tokenized as {Built, -, in }. Each lexicon entry is headed with a concept name and it contains an unordered list of terms. The Figure 2.6 depicts a lexicon entry with the roles of
the concept, terms and tokens. The laptop lexicon includes two types of lexicalizations one for the other for
schema
and
data.
(node) in the ontology even to the root concept Laptop and the terminals. They were called
schema
the targets that we are interested in extract. The schema lexicalizations can be identied by the prex REF_ added to the concept name in the head of a lexicon entry. The lexicon entry shown in Figure 2.6 is an example of a schema lexicalization. The
data
tudes, units, brand names, technology names etc. The data lexicalizations are only linked to terminal nodes in the ontology. However, a terminal node can
be referred by both schema and data lexicalizations. For instance, the terminal Brand has the schema lexicalization REF_Brand :[Brand, Made by, Manufactured by] and the data lexicalization Brand :[HP, Sony, Toshiba, Dell].
CHAPTER 2.
29
Table 2.1: Concepts lexicalized with regular expressions in the laptop lexicon. Concept Integer Decimal UPCCode Version Regular Expression ^[0-9]+$ ^[0-9]+\.[0-9]+$ ^[0-9][0-9]{10}[0-9]$ ^([0-9]+\.)+([0-9]+|x|a|b|c|d)$
As it was mentioned in Section 2.1.2, more informative lexicon entries were allowed including in the head a list of concepts. For instance, the lexicaliza-
tion ProcessorID_Brand :[Intel, AMD] refers only to processor brands. Also, schema lexicalizations are allowed of having head entries with multiple concept names like the following example: Compliant]. For some particular concepts, it is not possible to enumerate all their possible related terms. For instance, the concept Integer clearly can not be enumerated. In order to address this problem, regular expressions were allowed as terms in lexicon entries. The concepts with terms expressed as regular expressions are REF_Modem_Version: [Version, Speed,
listed in Table 2.1. The complete version of the laptop lexicon is presented in Appendix B. In the same way that the ontology was outlined with some data, the following counts over the complete lexicon can help to gure out the structure of the lexicon.
It is worth to note, that there was a considerable dierence between the total term counting and the dierent terms counting. This mean, that many terms
were included in more than one lexicon entry. The Figure 2.7 depicts the number of terms according to the number of times that they are included in the lexicon entries. Considering the total set of terms, only 1410 of them are unambiguous. Thus, the lexicon has an ambiguity of 17.06% at term level. Another important issue to consider is the number of terms by lexicon entry illustrated in Figure 2.8. which, have one term. terms. That chart shows that there are only 49 concepts The remaining concepts are lexicalized with several
CHAPTER 2.
30
1410 150
counting of terms
97
11
9 4
3 5
2 6
Figure 2.7: Term-frequency chart grouped by the number of times that terms are referred in the laptop lexicon. (e.g. there are 97 terms that have been referred in two lexicon entries)
Table 2.2: Laptop data sheets in the evaluation corpus. Doc ID. 1 2 3 4 5 Brand Hewlett Packard Toshiba Sony Toshiba Hewlett Packard Family Pavilion Satellite VAIO Satellite Pavilion Model dv9500t A130-ST1311 VGN-TXN15P/B A135-S4498 dv9330us Date (mdy) 06/05/2007 11/05/2007 06/05/2007 11/05/2007 06/05/2007
term. The most common of these variations is the basic term transformed in its acronymic form. The number of token per term is depicted in Figure 2.9. Almost half of the terms have more than one token. Due to his fact, the Chapter 3 discussed in
2.4
In order to evaluate the proposed IE system in this thesis, it is necessary to provide a labeled set as gold standard to compare with the output of the system. Five data sheets were chosen among the set of documents used to build the laptop ontology and lexicon to constitute the evaluation corpus. The laptop
computers that those documents describe and the date of download are shown in Table 2.2. The documents were labeled so that all the relevant tokens had a label.
CHAPTER 2.
31
100
91 80
counting of concepts
49 50
49 22 24 13 12 8 6 5 11 >11
19
0 1 2 3 4 5 6 7 8 9 10
Figure 2.8: Concept-frequency chart grouped by the number of terms in each concept in the laptop lexicon. (e.g. there are 91 concepts with 2 terms)
1000
992
counting of terms
500
0 1 2 3 4
Figure 2.9:
CHAPTER 2.
32
Doc ID.
1 2 3 4 5 TOTAL
#tokens
1925 744 921 801 740 5131
#tokens target
744 363 462 416 327 2312
#targets
536 246 313 267 234 1596
schema targets
31.90% 43.09% 47.28% 47.19% 38.89% 40.23%
data targets
68.10% 56.91% 52.72% 52.81% 61.11% 59.77%
Relevant tokens are those that belong to some extraction target. way that lexicon entries were divided in
In the same
schema
and
data
entries, extraction
targets are also separated by the same criteria. In Table 2.3 the total number of tokens, targets and the percentage of schema and data targets are reported for each document. The manually labeled document #1 is included as reference in Appendix C. As it can be seen in the Appendix C, the text was extracted from the original documents, so the only text structure considered was punctuation marks and
newline
2.4.1 Tokenizer
To tokenize a text means to divide the original character sequence into a sequence of sub-sequences called tokens. The text tokenizer was designed to t the special needing of the IT data-sheets and it is based in the following static rules:
One or more consecutive space characters are token separators. Any of the following characters are token separators and constitute themselves a one-character individual token: {`(', `)', `[', `]', `$', `%', `&', `+', `-', `/', `;', `:', `*', `?', `!', To consider period (`.')
both sides blank characters (i.e. empty spaces). To divide into two tokens a digit consecutive to an alphabetic character,
e.g.
2GB and 5400rpm are tokenized as 2, GB and 5400, rpm
respectively. To divide into three tokens an `x' character surrounded by digits, 1200x768 separates as 1200, x, 768.
e.g.
The data-sheet documents in the corpus were tokenized using those rules. Similarly, when the terms of the lexicon were treated as sequences of tokens, their tokenizing process were made with the same rule set.
CHAPTER 2.
33
semantic path
is a sequence of nodes
starting with the root concept (i.e. Laptop) and nishing in any terminal node. This sequence has to be a valid path in the ontology graph. Let consider the
semantic paths in the sample graph in Figure 2.1 associated to the terminal concept MB (i.e. megabytes):
Processor Cache Size MemorySize MemoryUnits MB Memory Installed MemorySize MemoryUnits MB Memory MaxInstallable MemorySize MemoryUnits MB
All those semantic paths are unambiguous possible uses for the concept MB . Therefore, if an acronymic term such as M.b. is linked in the lexicon to the
concept MB , then those semantic paths are possible uses (or senses) for the term M.b.. Let consider another example with the semantic paths for the terminal Integer in the same sample graph:
1. Laptop
FrequencyMagnitude 2. Laptop
Processor Speed FrequencyMeasurement Magnitude Integer Processor FSB FrequencyMeasurement Magnitude Integer Processor Cache Size MemorySize Magnitude Integer Memory Installed MemorySize Magnitude Integer Memory MaxInstallable MemorySize Magnitude Integer
FrequencyMagnitude 3. Laptop
MemoryMagnitude 4. Laptop
MemoryMagnitude 5. Laptop
MemoryMagnitude
As it was shown in Figure 2.5, the terminal concept with the higher number of semantic paths in the laptop ontology is Integer .
[Installed, Memory, :, 512, MB, /, 2, GB, (, max, )]
Memory REF_Installed
CHAPTER 2.
34
[:]: [512]:
no-label Laptop
no-label Laptop
no-label Laptop
Memory REF_MaxInstallable
no-label
Chapter 3
Static Fuzzy String Searching in Text
The term
compare strings allowing errors. FSS is based on similarity metrics that compare sequences at character and token level (i.e. word level). The static string
metrics are those that compare two strings with some algorithm that uses only the information contained in both strings. On the other hand, adaptive methods are those that use additional information from the corpus in supervised [81, 12] or unsupervised ways such as the
Soft-TD-IDF
et al.
some static measures in particular matching problems, but they have drawbacks such as the cost to train the model, the cost of gathering statistics from large corpus, the lack of labeled corpora and their scalability. This chapter is focused in some widely known and new ones static string measures. This chapter also presents dierent methods for combining static fuzzy string metrics at character- and token-level. Many popular fuzzy string metrics (e.g. edit distance) use characters as comparison disregarding the natural division in words (tokens) of text sequences considering the separator character (i.e. space) as an ordinary character. On the other hand, token level metrics consider words as comparison unit but most of them compare tokens in a crisp way. In many NLP tasks, qualities of both character and token-level measures are desirable. Although, the combination of those types of measures seems to be
an interesting topic, only a few combination methods have been proposed in the related literature. We tested and compared them against a new general Jaccard, edit
method to combine cardinality based resemblance coecients (e.g. Dice, Sorensen, cosine, etc.) distance, Jaro distance,
n -grams,
35
CHAPTER 3.
36
3.1
Introduction
String similarity metrics have been used to compare sequences in domains such as the biomedical, signal processing and the natural language processing (NLP) among others. Probably, the most well known character-level metric is
edit
distance,
proach is very aordable when the strings to be compared are single words having misspellings, typographical errors, OCR errors and even some morphological variations. However, due to the inherent nature of the human language, text sequences are separated into words (i.e. tokens). This property can also be exploited to compare text strings as sequences of tokens instead of characters. That approach deals successfully with problems such as tokens out of order and missing tokens when two multitoken strings are being compared. Besides, it has been used in elds such as information retrieval [6], record linkage [64], acronym matching [87] among others. Depending on the specic matching problem, some measures perform better than others do. Nevertheless, there is not a technique that consistently out-
performs the others [12, 21]. For instance, edit distance and Jaro-distance [93] are suitable for name matching tasks but perform poorly for record linkage or acronym matching tasks. Approaches for combining character- and token-level measures [19, 65] have been proposed to preserve the properties of character-level measures properties but considering the token information. Nevertheless, the problem has not been studied in detail and there is not a general method to combine the known metrics. In this chapter a new general method for using cardinality-based metrics such as
Jaccard, Dice
and
cosine
Monge-Elkan
[65]
combination method extended with the generalized mean instead of the The proposed approaches open a wide set of options
with the possible combinations of known techniques that can be combined to deal with dierent matching problems in NLP and in other elds. This chapter is organized as follows. Section 3.2 reviews some of the most In Section 3.3, the
relevant static character- and token-level string measures. known static combination methods are reviewed.
the Mongue-Elkan combination method is presented in Section 3.3.5. In Section 3.3.6, the new proposed cardinality-based combination method is presented. Results from some experiments carried out using 12 data sets are presented in Section 3.4. Finally, some concluding remarks are given in Section 3.5.
CHAPTER 3.
37
Algorithm 1
Edit distance
EditDistance(string1, string2) //declare matrix of integers d[0..len(string1), 0..len(string2)] for i from 0 to len(string1) d[i, 0] = i for j from 1 to n d[0, j] = j for i from 1 to len(string1) for j from 1 to len(string2) if string1[i] = string2[j] then cost = 0 cost = 1 d[i, j] = min(d[i-1, j] + 1, // deletion d[i, j-1] + 1, // insertion d[i-1, j-1] + cost) // substitution return (d[m, n])
3.2 Static Fuzzy String Measures
else
n -gram
deal with typos and misspellings, our special interest is their ability to deal with inconsistent abbreviations, variable acronym generation rules and typographical variations.
3.2.1.1
The
Edit-Distance Family
was originally proposed by Levenshtein [54], it consists in
edit distance
counting the minimum number of edition operations needed to transform one sequence in other. The three basic edit operations are insertion, deletion and
substitution. Several modications to the original edit distance have been proposed varying cost schemes and adding more edit operations such as transpositions, opening and extending gaps (see [32] for a recent survey). The algo-
rithm to compute the edit distance [91] is a dynamic programming solution that stores in a matrix the counting of edit operations for all possible prexes of
sm m and n in a time complexity of O(mn) and in a space complexity of O(min(m, n)). In Algorithm 1 an implementation of the edit distance
both strings. This algorithm computes the edit distance between two strings and
sn
of length
is shown and in Figure 3.1 an example of the solutions matrix is shown. The edit distance has been extended in the operations and in the cost asso-
CHAPTER 3.
38
S A T U R D A Y
Figure 3.1:
0 1 2 3 4 5 6 7 8
S 1 0 1 2 3 4 5 6 7
U 2 1 1 2 2 3 4 5 6
N 3 2 2 2 3 3 4 5 6
D 4 3 3 3 3 4 3 4 5
A 5 4 3 4 4 4 4 3 4
Y 6 5 4 4 5 5 5 4 3
Edit distance
[28] considers
viewed as an edit distance with a cost schema that allows only insertions and deletions at cost 1 (the number of impaired characters). [82]
Hamming distance
i.e.
counting the number of terms in the same positions (this approach is only suitable for strings with the same length). a dierentiate cost Needleman and Wunsch also proposed
et al.
d(c1 , c2 ),
l
depending of the characters to be compared. For instance, the substitution cost for the characters 1 and might be lower than the substitution cost between
e
and
h . et al.
considering
open-/extend-gap operations with variable costs (or Smith-Waterman distance). These new operations allow closer matches with truncated or abbreviated strings (e.g. comparing Michael S. Waterman with M.S. Waterman). In order to
have that property the cost of extending a gap has to be smaller than the gap opening cost. Ristad and Yianilos [81] were the rst in proposing an adaptive method to learn the optimal cost schema for an edit distance for a particular matching problem in a supervised way. basic operations (i.e. They proposed an edit distance with the three Let
and
be
of size
|| || ca,b
is
ca,
Similarly
,a
and
respectively.
The optimal
costs are estimated using a training set with an expectation maximization (EM) strategy. Bilenko and Mooney [12] proposed a similar approach to learn the
CHAPTER 3.
39
optimal cost for an edit distance including open and extend-gap operations.
3.2.1.2
and
takes only
O(m + n)
c
in space and time complexity. It considers the number of common characters and the number of transpositions
1 3
c m
c n
ct c
The common characters are considered only in a sliding window of size additionally the common characters can not be shared and are
c = 0, simJaro
returns a value
of two
lists, the common characters are ordered according with their occurrence in the strings being compared. The number of non-coincident characters in common
t.
The Jaro-Winkler distance [93] is an improvement of the Jaro distance considering the number of common characters the following expression.
(1 simJaro (sm , sn ))
q -gram
Distances
q -grams
q.
Depending on
the
q -grams
trigrams and so on. For instance, the trigrams in `laptop' are `lap', `apt', `pto', `top'. Commonly,
q 1
q
pad characters are added as prex and sux to consider and to dierentiate
q -grams
`##laptop##' using `#' as padding character. When two strings are being compared, a resemblance coecient such as Jaccard, dice or overlap coecient (see Section 3.2.2) is used to compute a nal resemblance metric between them. The time complexity of computing and comparing the
q -grams
and
is
O(m + n).
Skip -
Additionally to normal
q -grams,
grams [46] are bigrams with one skip character in the center. For instance the 1-skip -grams of `laptop' are `l*p', `a*p', `p*o', `t*p'. the With the exception of
q -grams
q -grams
q -gram
in the string.
posi-
tional -q -grams
common value.
q-gram
q -grams
q-grams
follows: positional-unigrams (better), bigrams, positional-bigrams, skip-grams, unigrams, trigrams and positional trigrams (worse).
CHAPTER 3.
40
3.2.1.4
Bartlolini
Bag Distance
et al.
[9] proposed an approximation to the edit distance between
two strings
sm
and
sn of
length
and
a, a, a, b and a, a, b, c, c is a.
common characters and then takes the maximum of the number of elements in the asymmetrical dierences. The bag distance works as an approximation to the edit distance but in a more ecient way. All edit distance family measures are computed in but the bag distance is computed in
O(mn),
O(m + n).
lower performance than edit distance, but gives an ecient strategy to lter candidates to the same edit distance and to other approximate string measures.
3.2.1.5
Compression Distance
Cilibrasi and Vitnyi [22] proposed the idea of using data compression to obtain a distance measure between two ob jects
x, y
using the
normalized compression
distance
N CD(x, y) = C(x)
x and y .
The compressor
normal
compressor:
1. 2. 3. 4.
and
C() = 0,
where
Distributivity.
The data-compressor
BZ2, ZLib
or
This method was originally tested by Christen [21] tested the method in
a comparative study of approximate string metrics in name matching tasks obtaining comparable results against the best results obtained with measures of the Jaro and edit distance families.
CHAPTER 3.
41
complement
are
used. Additionally to the set operations, the common are used too. Let
A and B
1. 2. 3. 4. 5. 6. 7.
A B
|A B| =number of dierent tokens in common between A and B |A B| =number of dierent tokens in common between A and B |A \ B| =number |A
of dierent tokens present in
but not in
B. A
and
B| =number
B.
|A|c =
string.
Never-
theless, this operation is not practical in the task of comparing relatively short strings because the complement operation is made against a nite universe with cardinality
n.
set vocabulary that can be considerably greater than the number of tokens in a specic set of tokens. 8.
n =number
Using the elements previously enumerated it is possible to build a wide range of resemblance coecients. In Table 3.1, some of the most known coecients are listed. The
name
scientic community or the reference name given by De Baets and De Meyer [5].
3.3
Unlike character-based similarity measures, token-level measures use tokens (i.e. words) as comparison unit instead of characters. Some approaches at token-level
CHAPTER 3.
42
Table 3.1: Some resemblance coecients expressions. # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Name Expression # 15 16 17 18 19 20 21 22 23 24 25 26 27 Name Expression
max(|A\B|c ,|B\A|c ) n min(|A\B,|B\A|) max(|A\B|,|B\A|) |AB| max(|A\B|c ,|B\A|c ) min(|A\B,|c B\A|c ) max(|A\B|c ,|B\A|c ) min(|A\B,|B\A|) |A B| min(|A|,|B|) |AB| min(|A\B|c ,|B\A|c ) n |AB|c min(|A|c ,|B|c ) |AB|c |AB|c max(|A|c ,|B|c ) |AB|c min(|A|c ,|B|c ) |AB|c |AB|c max(|A\B|c ,|B\A|c ) |AB|c n
|A
|A
B|
R1 R2 R4 R7 R8
B|C n |AB| n |AB| max(|A|,|B|) |A B|c max(|A\B|c ,|B\A|c ) |A B|c min(|A\B|c ,|B\A|c ) max(|A\B|,|B\A|) |A B| max(|A|,|B|) |AB|
combine a character-level measure used to compare tokens to return a overall measure between two multi-token strings. In addition, the structure of many
token level algorithms is analogous to many character-based algorithms [19]. Moreover, these approaches can also be applied in a recursive way. For instance, edit distance can be used to compare tokens; these obtained measures can be used to compare sentences. Once again, these measures can be used to compare paragraphs and so on.
a, b
sim,
|a|
|b|
(3.1)
optimal assignment
problem
in time complexity of
O(|a| |b|)
This approximation is reasonable due to the fact that the best solutions to the assignment problem would have a time complexity of
O(min(|a|, |b|)3 ).
CHAPTER 3.
43
of those measures, the two strings to be compared are tokenized (separated in words) then ordered and re-joined again (sorted
tokens ). tokens ).
consist of to list all token permutations for one of the string and compare the resulting strings with the second one (permuted with the maximum value is chosen. Two comparative studies [74, 21] showed that the permuting heuristic outperforms sorting heuristic. Unfortunately, the complexity of exploring all order permutations is factorial, clearly unpractical when the number of tokens in the strings is considerable. Finally, the measure
et al.
[19] proposed the idea of using the edit distance algorithm us-
ing tokens as comparison unit instead of characters in a database-cleaning task. Analogously to the edit distance, the three basic operations are
token replace-
and
token deletion.
functions of the tokens in question. For instance, the replacement cost can be a character-based distance measure between the token to be replaced and the replacement token. Chaudhuri
et al.
tion weighted with the IDF (inverse document frequency) [45] factor, motivated by the idea that more informative tokens are more expensive to be inserted or deleted. However, an static approach can consider insertion and deletion costs as constants.
CHAPTER 3.
44
Table 3.2: Meaning of the Minkowski distance and generalized mean exponent.
p
2 1
meaning maximum quadratic mean arithmetic mean geometric mean harmonic mean minimum
0
-1
this combination method cannot be applied in the context of this chapter, it is worth to review it because of the information extraction system proposed in this thesis has a dictionary (the laptop lexicon) to be matched with a set of documents. Let two records with the same elds
can be any character-level similarity measure of those presented in this chapter. The question is how to obtain an overall similarity measure
between both records. The most common approach is to choose a metric from the Minkowski family of distance metrics, dened by:
1 p
sim(R1 , R2 ) =
i=1
(simi (ai , bi ))
(3.2)
The Euclidean and City-block distances are special cases of the Equation 3.2 when the exponent
the simplest measure from the Minkowski family with elds with the following expression:
n i=1
sim(R1 , R2 ) =
n i=1
wi = 1.
Weights
wi
mean.
(Equation 3.3) and that of the Minkowski distance (Equation 3.2) is the factor
1/n.
were normalized.
1 p
x(p) =
The term
1 n
xp i
i=1
(3.3)
n means the number of values to be averaged and p is a parameter. p = 2 the distances are combined with the quadratic mean; complete description of p see Table 3.2.
CHAPTER 3.
45
proach assumes that higher inter-token similarities are as informative as the lower ones. measure Consider the following example using as internal character-based a normalized edit distance converted to a similarity measure (sub-
sim
a =Lenovo b =Lenovo
inc. corp.
0 6 5 6 5 6 4 4
=1 = 0.1666 = 0.1666 =0
1 2
(max(sim (a1 , b1 ), sim (a1 , b2 )) + max(sim (a2 , b1 ), sim (a2 , b2 ))) (1 + 0.1666) = 0.5833
1 2
Whereas, the nal measure of 0.5833 seems to be very low taking into account that
sim(a1 , b1 ) is equal to 1,
sim (Lenovo
)=1
4 12
= 0.6666
An approach that tries to promote higher similarities in the average calculated in the equation 3.1 can be the use of the
generalized mean
mean. When the generalized mean exponent to the arithmetic mean, when
m = 2, it corresponds to the quadratic mean. In m, the stronger the promotion of higher values
1 m
in the mean. The generalized mean is computed with the following equation.
x(m) =
Let consider
1 n
xm i
m=2
1 2
12 + 0.16662
= 0.7168
CHAPTER 3.
46
This quadratic mean gives more importance to the number one than to 0.1666. We propose the idea that values of
|a|
1 m
|b|
(3.4)
Jaccard(A, B) =
|A B| |A B|
|.|
and only one basic set operation such as union or intersection due
Using this
Jaccard(A, B) =
(3.5)
As it was explained in Section 3.2.2, a resemblance coecient can be used to compare two multi-token strings if they are considered as unordered sets of tokens. For instance,
Jaccard (Sergio
2 3
= 0.6666,
Jaccard (Cergio
Gimenez, Sergio
0 5
= 0.
A and B with m and n tokens A is a1 , a2 , ..., an and b1 , b2 , ..., bm for B . Let assume that there is a function card(.) that takes an unordered list of tokens and
Now, let consider two multi-token strings The token set of respectively. returns some kind of text cardinality. The Jaccard coecient can be rewritten as follows:
Jaccard(A, B) =
The function
(3.6)
card(.)
can have a crisp behavior counting the number of difThis function has to satisfy two basic
(3.7)
CHAPTER 3.
47
Table 3.3: Pair wise token similarities with edit-distance-based similarity measure for the string Jimenez Gimenez Jimenez Jimenez Jimenez Gimenez Jimenes 1 0.8571 0.8571 Gimenez 0.8571 1 0.7143 Jimenes 0.8571 0.7143 1
(3.8)
(3.9)
card(.)
could consider the similarity between tokens to estimate their cardinality. For instance, the cardinality of a set of tokens with high similarity among them (e.g. {Jimenez, Gimenez, Jimenes }) should be closer to 1 than to 3. On
the other hand, if the tokens have a relatively low similarity among them (e.g. {giveaway, girlfriend, gingerbread }) should be closer to 3. In fact, tokens that do not share characters among them (e.g. the cardinality should be exactly 3. {Bush, Clinton, Kerry })
card(.)
Let consider the following cardinality function where larity measure between
sim(ai , aj )
is a simi-
ai
and
aj ,
n j=1
1 sim(ai , aj )
(3.10)
A and B with m and n tokens cardA (A), cardA (B) and cardA (A B). 2 The total time complexity of computing these three cardinalities is O((m + n) ) multiplied by the time complexity of the function sim(a, b). Note that the cardA (.) function reproduces the behavior of the classic cardinality in the set theory when the internal function sim(ai , aj ) is a crisp string comparator i.e.
returns an 1 if the strings are identical and 0 otherwise. Now, let consider some examples. In Tables 3.3, 3.4 and 3.5 the token pair wise character-based similarities measured using the edit distance
the following sample strings: Jimenez Gimenez Jimenes, giveaway girlfriend gingerbread and Bush Clinton Kerry. Next, the Table 3.6 shows the calculations of the proposed cardinality function
cardA (.)
in Equation 3.10.
2 normalized
CHAPTER 3.
48
Table 3.4: Pair wise token similarities with edit-distance-based similarity measure for the string giveaway girlfriend gingerbread giveaway giveaway girlfriend gingerbread 1 0.1999 0.3636 girlfriend 0.1999 1 0.4545 gingerbread 0.3636 0.4545 1
Table 3.5: Pair wise token similarities with edit-distance-based similarity measure for the string Bush Clinton Kerry Bush Bush Clinton Kerry 1 0 0 Clinton 0 1 0 Kerry 0 0 1
The proposed concept of cardinality for set of tokens is similar to the size of the compressed object information used in the Compression Distance (see Section 3.2.1.5). In fact, the function
size of the compressed information within the string. Now, let consider the following example comparing two multi-token strings with the Jaccard coecient, the cardinality function similarity measure:
cardA (.)
Jaccard(A, B) =
Jaccard(A, B) = 0.70336169
Table 3.6: Examples of cardinality estimation with the functions Multi-token string example Sergio Sergio Sergio Jimenez Gimenez Jimenes giveaway girlfriend gingerbread Bush Clinton Kerry
cardA (.).
cardA
1 1.1462 1.7939 3
CHAPTER 3.
49
Monge-Elkan
and
Token-
Edit Distance )
strings
and
longing to
B , use the internal sim(a, b) function only to compare tokens beA or B . Tokens belonging to the same string never are compared. In
the same way, the character-level measures only make comparisons between the characters of both strings. The method proposed in this section rstly compares the tokens of each string between them when
cardA (A)
is computed. Further,
AB
is used to
card(A B).
isting approaches proposed in the past. The question to be answered is: can this information increase the performance of a string measure in a string-matching problem? The most similar approach to ours has been proposed recently by Michelson and Knoblock [59]. They combined the Dice coecient
with the Jaro-Winkler measure (see Section 3.2.1.2) to compare two multi-token strings
and
B.
the Jaro-Winkler measure. Next, two tokens are put into the intersection those two tokens have a Jaro-Winkler similarity above
internal .
The Michelson
and Knoblock approach has the same drawback of the soft TD-IDF [25], which is the necessity of a second threshold for the internal measure additional to the threshold that is required for the combined method in order to decide the matching.
3.4
Experimental Evaluation
The aim of this section is to compare the performance of a set of characterlevel string similarity measures compared with the measures obtained from the combination of a character-level measure with some token-level measure. This comparison aims to answer the following questions:
Is it better to tokenize (i.e. separate into words) the text strings and treat them as sequences of tokens compared with to treat the text strings as simple sequences of characters? Which of the combination of measures at character- and at token-level has the best performance?
Additionally, the new combination methods presented in Sections 3.3.5 and 3.3.6 will be compared with two existing combination methods (i.e. Monge-
Elkan and the edit distance at token level) to assess the contribution of those new approaches.
CHAPTER 3.
50
Name Birds-Scott11 Birds-Scott21 Birds-Kunkel1 Birds-Nybird1 Business1 Game-Demos1 Parks1 Restaurants1 UCD-people1 Animals1 Hotels2 Census3
#records 38 719 336 985 2,139 911 654 863 90 1,378 1,257 841
|rel1|
15 155 19 67 976 113 258 331 45 689 1,125 449
|rel2|
23 564 315 918 1,163 768 396 532 45 689 132 392
|rel1| |rel2|
345 87,420 5,985 61,506 1,135,088 86,784 102,168 176,062 2,025 474,721 148,500 176,008
#tokens 122 3,215 1,302 1,957 6,275 3,263 2,119 9,125 275 3,436 9,094 4,083
1 2 3
[25, 21, 74, 12]. The name-matching task consists of comparing two multi-token strings that contains names, addresses, telephones and or information, and to decide if the pair of strings refers to the same entity or not.
3.4.1.1
Data Sets
The data sets used for string matching are usually divided in two relations. A specic subset of the Cartesian product of those two relations is the set of valid matches. The Table 3.7 describes the 12 used data sets with the total number of strings, the separation into two relations (
that are valid matches (the set of valid matches), the size of the Cartesian product of the relations and the total number of tokens. The
Animal
data set
was not originally separated in two relations, so the relations were obtained using the 327 valid matches and the 689 strings involved in those matches. The tokenizing process was made using as token (word) separator the following characters:
and `: '.
Additionally,
all consecutive double token separators were withdrawn, and all heading and trailing blank spaces were removed. The number of tokens obtained with this
tokenizing policy in each data set is reported in the last column in Table 3.7. Finally, all characters were converted to uppercase characters and some characters with French accents were replaced with their equivalent characters without accent. In Table 3.8 for the sake of illustration, a sample strings (in some cases hard-to-match strings) is shown for each data set.
CHAPTER 3.
51
Data-set(rel#) Birds-Scott1(1) Birds-Scott1(2) Birds-Scott2(1) Birds-Scott2(2) Birds-Kunkel(1) Birds-Kunkel(2) Birds-Nybird(1) Birds-Nybird(2) Business(1) Business(2) Game-Demos(1) Game-Demos(2) Parks(1) Parks(2) Restaurants(1) Restaurants(2) UCD-people(1) UCD-people(2) Animals(2) Animals(2) Hotels(1) Hotels(2) Census(1) Census(2)
Sample String Great Blue Heron Heron: Great blue (Grand Hron) Ardea herodias King Rail Rallus elegans Rail: King (Rale lgant) Rallus elegans Eastern Towhee Rufous-sided towhee (Pipilo erythrophthalmus) Yellow-rumped Warbler Yellow-rumped (Myrtle) Warbler BBN Corporation BOLT BERANEK & NEWMAN INC Learning in Toyland Fisher-Price Learning in Toyland Timucuan Ecol. & Hist. Preserve Timucuan Ecological and Historic Preserve Shun Lee West 43 W. 65th St. New York 212/371-8844 Asian Shun Lee Palace 155 E. 55th St. New York City 212-371-8844 Chinese Barragry, Thomas B. Dr. Thomas B. Barragry Amargosa pupsh Pupsh, Ash Meadows Amargosa 3* Four Points Sheraton Airport 3/13 $30 Four Points Barcelo by Sheraton Pittsburgh 3* Airport PIT GASBARRO PABLOS U 401 VILLAGRANDE GASBARRD PABLO V 401 VILLAGRANDE
CHAPTER 3.
52
3.4.1.2
Experimental Setup
Experiments were carried out comparing the full Cartesian product between the relations
rel1 and rel2 of each data set without using any blocking strategy.
This is a time-consuming approach but it gives us bet-
Blocking methods are used to reduce the number of candidate comparison pairs to a feasible number.
ter elements to fairly compare the performance of dierent matching techniques. The baseline experiments consist of using a character-based string measure (e.g. edit distance, 2-grams, Jaro distance) and to compare the full-length strings with that measure. In order to perform a fairer comparison, the baseline ex-
periments were not carried out with the original strings but with the tokenized string re-joined with the
space
measures used at the baseline experiments are described and named as follows:
1.
exactMatch :
0 otherwise.
2.
ed.Dist : It returns a normalized similarity using the expression: 1 editDistance(A,B) . The used edit distance had three basic edit operations max(|A|,|B|) (i.e. insertion, deletion and replacement) at cost 1 each one. Jaro :
It returns the Jaro distance measure explained in Section 3.2.1.2.
3.
2grams :
3.2.1.3.
#commonBigrams(A,B) . max(|A|,|B|)
The previously listed measures are also going be used as internal similarity measure
sim(a, b)
1.
SimpleStr.:
sim(a, b)
measure.
MongeElkan :
MongeElkan0 :
ponent
The extension to the basic Monge-Elkan method with exin Equation 3.4.
4.
MongeElkan0.5 :
exponent
5.
MongeElkan1.5 :
exponent
6.
MongeElkan2 :
ponent
The extension to the basic Monge-Elkan method with exin Equation 3.4.
m=2
CHAPTER 3.
53
Table 3.9: The second set of character-/token-level combination methods based on cardinality. # 11 12 13 14 15 16 17 18 19 Name Description
card(A)card(B) 2card(AB) card(AB)+card(AB) card(AB)min(card(A),card(B)) max(card(A),card(B)) card(AB) min(card(A),card(B)) card(AB) max(card(A),card(B)) max(card(A),card(B)) card(AB) min(card(A),card(B)) card(AB)
7.
MongeElkan5 :
ponent
The extension to the basic Monge-Elkan method with exin Equation 3.4. The extension to the basic Monge-Elkan method with in Equation 3.4.
m=5
8.
MongeElkan10 :
exponent
m = 10
9.
max_max :
,
10.
i.e.
|a|
|b|
TokenEd.Dist :
placement cost as
sim(a, b)
Additionally to the rst set of combination methods, a second set based on resemblance coecients using the soft text cardinality 3.10) are listed and explained in Table 3.9. The total number of combination strategies at token level was the addition of the 10 methods previously listed plus the 9 cardinality-based methods in Table 3.9. The 9 cardinality-based methods were used with the proposed cardinality function
cardA (.)
(see Equation
total combined with the four character-based measures previously listed gives 76 multi-token comparison measures. Each combination is referred using the
names heading their denitions in the column format for each combination of methods is (hyphen)
Name
in Table 3.9.
The name
name_of_character_level_metric
name_of_token_level_metric edit-distance
overlap
The total number of experiments carried out in the tests combinined the 76
CHAPTER 3.
54
string matching methods with the 12 data sets; that is 912 experiments.
For
each experiment, the string similarity measure is computed and stored for each pair of strings in the Cartesian product of the two relations in the data set. All computed similarities were in the range [0, 1], having 1 as the maximum level of similarity and 0 as the maximum dis-similarity.
3.4.1.3
The matching problem between two relations (or simply sets) can be viewed, as a classication problem over the Cartesian product of the sets. A subset of this Cartesian product is the set of valid matches (i.e. are the non-valid matches (i.e.
positive s)
negatives ).
measure value in the range [0, 1] is computed for each possible string pair in a data-set. However, in order to decide if a pair is a valid match it is necessary to establish a threshold as
are labeled
positive
and as
negative
it is possible to
dene:
positive.
False Negative:
negative.
The performance in the context of classication problems tasks is usually measured using
precision, recall
and
F-measure
(i.e.
P recision =
Recall =
#true positives #true positives + #f alse negatives 2 precision recall (precision + recall)
data set })
the
F measure =
counting of true positives, false positives and false negatives were made ranging
gives as a result 101 values of precision, recall and F-measure. The graph shown in Figure 3.2 plots the data obtained for the experiment {ed.Dist.-Cosine(A),
Business }.
The graph also shows the trade-o between precision and recall
CHAPTER 3.
55
1 0,8 0,6 0,4 0,2 0 0 0,2 0,4 0,6 0,8 1 similarity threshold recall precision F-measure
Figure 3.2: Precision, recall and F-measure vs. threshold curves for the experiment {ed.Dist.-Cosine(A),
Business }.
known as
F1 score,
tween precision and recall. F1 score can also be seen as a general performance measure of the overall experiment. Moreover, F1 score is closer to the inter-
section point of the precision and recall curves that is in fact another general performance measure for the experiment. more commonly used. The same data can be used to plot a precision vs. recall curve. The Figure 3.3 shows a precision vs. recall curve obtained from the experiment {ed.Dist.However, the F1 score measure is
Cosine(A), Business }.
monly used in information retrieval problems [6]. A method for computing IAP is to interpolate the precision vs. recall curve at 11 evenly separated recall
points (i.e. 0.0, 0.1, 0.2, ...,1.0}. Interpolated precision at recall point
is the
maximum precision obtained at any point with recall greater than or equal to
r.
Figure 3.4 shows an interpolated version of the curve in Figure 3.3 using the
described procedure. The area under the interpolated precision vs. recall curve is a general performance measure for the experiment. Consider a classier that obtains precision and recall values of 1 at some a step (i.e.
11
that curve is 1, that result is the maximum possible performance for a perfect classier. In order to compute the area under that curve, the precision values
at the 11 evenly separated recall points are averaged. The measure obtained is known as
eral performance of the string matching technique considering all the possible range values for
CHAPTER 3.
56
1 0,8
precision
Busi-
ness }.
1 0,8
precision
CHAPTER 3.
57
specic threshold
which is un-
known. Nevertheless, we are interested in measuring the general performance of the string matching method. Clearly, the better the performance of the matching method, the better the performance of the matching task at reasonable values. The IAP was selected as the primary performance measure since its graphical interpretation (area under the curve) makes more convenient the operations of aggregation such as the average. For instance, in order to report in only one
number the performance of one of the 76 matching methods, theirs IAP values obtained for the 12 data sets are averaged and reported with the name
avg.
IAP .
3.4.1.4
Statistical Test
The average IAP for the 12 data sets, for each one of the 76 tested matching methods is the selected performance measure. However, for two given methods,
m1
and
m2 ,
it is possible that
m1
m2
So, the question is how to decide whether In order to answer this question [92] was used to assess the 12
results obtained for each data set. The Wilcoxon's test is a nonparametric alternative to the two-sample
t-
test.
The term non-parametric means that there is no assumption that the This test evaluates whether
H0
null hypothesis). We
are interested in a one-tailed test, where the alternative hypothesis is that the median of the results of
m1
m2 .
If the pairs are
The assumptions required by the test are that the results obtained with the methods
m1
and
m2
assumed to have identical distributions, their dierences should always have a symmetrical distribution that is in fact the more important assumption of the test. In addition, it is required that the sample has been randomly chosen from the population (the sample in our case is the 12 used data sets). Besides, this test has not requirements about the size of the sample; in fact, there are tables of critical values with sample sizes starting at 5. The procedure to perform the Wilcoxon's test is as follows: 1. To calculate the performance dierence between for each pair of results on the 12 data sets. 2. To ignore the zero dierences (we ignored dierences lower than 0.0001). Clearly, the ignored dierences reduce the size of the sample a lower number. 3. To rank the absolute values of the remaining dierences. identical dierences, then average their ranks. 4. Sum the ranks of the negative dierences in If there are
from 12 to
W .
CHAPTER 3.
58
Table 3.10: Critical values for the Wilcoxon test statistic for one-tailed test. Signicance
n
5 6 7 8 9 10 11 12 13 14 15
10.0% 2 3 5 8 11 13 17 21 26 31 36
5.0% 1 3 4 6 9 11 14 18 22 26 31
2.5% 1 3 4 6 9 11 14 17 18 22
W and compare it against the critical value in W is lower than or equal to the critical value for W
n n
The critical values used for the Wilcoxon test statistic sizes are show in Table 3.10.
Now, let consider the results of the experiments for the methods
2gram-
MongeElkan2
and
ed.Dist-MongeElkan0
Table 3.10 with a signicance of 2.5% is 14. Thus, the null hypothesis is rejected and the evidence shows, at the 2.5% signicance level, that the matching method
2gram-MongeElkan2
outperforms
ed.Dist-MongeElkan0.
and
2gram-MongeElkan2
2gram-compress(A)
value is 14 at the signicance level of 2.5%. Thus, the null hypothesis is accepted and although the value of the average IAP is greater for
2gram-MongeElkan2
there is not enough statistical evidence to arm that this method outperforms
2gram-compress(A).
The Wilcoxon's test was performed to compare each one of the 76 matching methods against the other 75 methods. For each method, the quotient of the
Wilcoxon's Rate.
3.4.2 Results
3.4.2.1 General Measures Ranking
average. IAP
(avg. IAP) The Tables 3.13 and 3.14 show an ordered list by
with the results obtained for each one of the 76 combination matching methods tested. The avg. IAP values were obtained averaging the IAP results obtained in
CHAPTER 3.
59
Table 3.11: IAP Wilcoxon sign test for experiments (method1) and
2gram-MongeElkan2
rank 1 2 3 4 5 6 7 8 9 10 11 12
ed.Dist-MongeElkan0
method1 0.694809 0.541242 0.128630 0.652203 0.869519 0.891184 0.849196 0.839082 0.877792 0.551949 0.892562 0.717120 0.708774
(method2). di. 0.020283 0.050139 0.053643 -0.071818 0.087907 0.109325 0.151343 0.184927 0.186820 0.188778 0.195547 0.218186
Data-set Birds-Nybirds Game-Demos Animals Census Birds-Scott1 Restaurants UCD-people Parks Birds-Scott2 Hotels Birds-Kunkel Business AVERAGE
method2 0.674526 0.491103 0.074987 0.724021 0.781612 0.781859 0.697854 0.654155 0.690973 0.363171 0.697015 0.498934 0.594184
W+
1 2 3 5 6 7 8 9 10 11 12 74
W
4 4
Table 3.12: IAP Wilcoxon sign test for experiments (method1) and Data-set Restaurants Birds-Scott1 Parks Birds-Nybird Hotels Animals Business Birds-Scott2 UCD-people Census Game-Demos Birds-Kunkel AVERAGE
2gram-MongeElkan2
rank 1 2 3 4 5 6 7 8 9 10 11 12
2gram-compress(A)
metthod1 0.891184 0.869519 0.839082 0.694809 0.551949 0.128630 0.717120 0.877792 0.849196 0.652203 0.541242 0.892562 0.708774
(method2) di. 0.000953 0.011048 -0.012353 -0.034639 0.043172 0.045280 0.046064 0.098145 0.110241 -0.154344 -0.167709 0.432303
method2 0.890231 0.858471 0.851435 0.729448 0.508777 0.083351 0.671056 0.779647 0.738955 0.806547 0.708952 0.460259 0.673927
W+
1 2 5 6 7 8 9 50
W
3 4 10 11 12 28
CHAPTER 3.
60
test
of times that a method statistically outperformed the other 75 methods using the Wilcoxon's test. That is the number of accepted alternative hypothesis that the median of the results obtained with the method is greater than the median of the method with which it is being compared with signicance of 5%. Finally, the last column shows the rate of passed statistical test divided by the total number of methods minus one. Namely:
W ilcoxon sRate =
test 75
and
exactMatch
exactMatch
and IAP of 0.1835057 surpassing other 15 methods. However, as it was expected, this method did not outperform statistically other methods. The complete rank list is reported for the sake of completeness and to give a general guide of the good and the not as good method combinations.
3.4.2.2
The baselines of our experiments are four character-level measures used for comparing pairs of multi-token strings as simple sequences of characters. These four measures are
and
Jaro distance.
The
Table 3.15 compares the results obtained with the baseline methods versus the Monge-Elkan method (see Equation 3.1) using as internal measure the same baseline measure.
sim(a, b)
2gram-simpleStr.
2gram-MongeElkan.
The
results shown in Table 3.15 are in agreement with the results obtained by Cohen
et al.[25].
measures.
They used the same IAP performance metric obtaining the best
The four baseline measures combined with the Monge-Elkan method improved in all cases the IAP when it is compared against the baselines as it is reported in Table 3.15; the relative improvement is reported in the last column. The column
shows the test statistic for the one-tailed Wilcoxon's test. The
critical values of
spectively. Thus, the only improvement with enough statistical evidence is the obtained with
exactMatch-MongeElkan
exactMatch-simpleStr
(starred
with * in the Table 3.15). This result also reveals the fact that in the 12 used data sets the main matching challenge is the token disorder. The Monge-Elkan combination method is considered by the string processing and record linkage community as one of the best static combination methods. However, the obtained improvements had not enough statistical evidence of consistency in the sample of 12 data-sets. This result also shows that the Wilcoxon's
CHAPTER 3.
61
Table 3.13: General ranking for combined string matching methods. Part 1/2 rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19* 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 method avg. IAP 0.739927 0.729502 0.728885 0.728885 0.714650 0.710334 0.708774 0.707249 0.707147 0.707147 0.704047 0.703136 0.701698 0.696779 0.694706 0.689713 0.681110 0.678082 0.677859 0.677835 0.674830 0.674830 0.673927 0.673825 0.672695 0.672688 0.671519 0.662875 0.658941 0.658511 0.658385 0.657882 0.657856 0.657585 0.656911 0.656911 0.656879 0.656635 0.656062 0.656062 tests 67 59 59 59 44 37 47 39 39 39 34 31 35 33 38 33 22 30 24 26 23 23 28 28 21 28 28 20 19 18 18 18 18 18 18 25 26 26 25 25 Wilcoxon's Rate 89.33% 78.67% 78.67% 78.67% 58.67% 49.33% 62.67% 52.00% 52.00% 52.00% 45.33% 41.33% 46.67% 44.00% 50.67% 44.00% 29.33% 40.00% 32.00% 34.67% 30.67% 30.67% 37.33% 37.33% 28.00% 37.33% 37.33% 26.67% 25.33% 24.00% 24.00% 24.00% 24.00% 24.00% 24.00% 33.33% 34.67% 34.67% 33.33% 33.33%
2gram-MongeElkan1.5 2gram-MongeElkan2
Jaccard Sorensen Dice
Jaro-MongeElkan5 ed.Dist.-MongeElkan2
2gram-MongeElkan
2gram-MongeElkan5
compress R1
2gram-compress(A) 2gram-R1(A)
overlap
exactMatch-MongeElkan2 exactMatch-MongeElkan0.5 exactMatch-MongeElkan1.5 exactMatch-MongeElkan5 exactMatch-MongeElkan10 ed.Dist.-Jaccard(A) 2gram-MongeElkan0.5 Jaro-MongeElkan1.5 ed.Dist.-Sorensen(A) ed.Dist.-Dice(A)
CHAPTER 3.
62
Table 3.14: General ranking for combined string matching methods. Part 2/2 rank 41 42 43 44 45 46 47 48 49 50 51* 52 53 54 55 56* 57 58 59 60 61 62 63 64 65 66* 67 68 69 70 71 72 73 74 75 76 method Jaro-MongeElkan avg. IAP 0.636822 0.635641 0.629443 0.628162 0.616575 0.610215 0.594873 0.594511 0.591896 0.585316 0.577744 0.573967 0.573967 0.554194 0.553339 0.550062 0.523569 0.516899 0.466801 0.450146 0.427008 0.334419 0.255566 0.255534 0.252449 0.183505 0.130602 0.129052 0.129052 0.109623 0.101732 0.101724 0.101721 0.101709 0.028704 0.028212 tests 22 22 18 17 20 18 16 16 17 16 16 17 17 15 17 16 15 15 15 14 14 11 10 10 10 0 5 3 3 2 2 2 2 2 1 0 Wilcoxon's Rate 29.33% 29.33% 24.00% 22.67% 26.67% 24.00% 21.33% 21.33% 22.67% 21.33% 21.33% 22.67% 22.67% 20.00% 22.67% 21.33% 20.00% 20.00% 20.00% 18.67% 18.67% 14.67% 13.33% 13.33% 13.33% 0.00% 6.67% 4.00% 4.00% 2.67% 2.67% 2.67% 2.67% 2.67% 1.33% 0.00%
ed.Dist.-MongeElkan0.5
2gram-tokenEd.Dist.
ed.Dist.-MongeElkan0
R8 exactMatch-tokenEd.Dist. Jaro
Jaro-Jaccard(A) Jaro-Sorensen(A) Jaro-Dice(A) Jaro-cosine(A) Jaro-max_max 2gram-max_max ed.Dist.-max_max max_max Jaro-overlap(A) Jaro-R8(A)
CHAPTER 3.
63
sim(a, b)
2grams exactMatch ed.Dist. Jaro
MongeElkan
0.7016980.227473 0.6585110.249272 0.6780820.241435 0.6368220.275805
W
33 1* 26 32
Table 3.16: Average IAP for the Monge-Elkan combination method compared with the best combination method.
sim(a, b)
2grams exactMatch ed.Dist. Jaro
MongeElkan
0.7016980.227473 0.6585110.249272 0.6780820.241435 0.6368220.275805)
W
17* 18* 19** 18*
test is not as easy to overcome even with a relative high signicance such as 0.05. The results in Table 3.16 have the same format as in Table 3.15 but additionally report in the column Best Method the name of the combination method with the highest avg. IAP reached for each baseline measure. The
starred with one * and two ** mean that the Wilcoxon's test was passed with signicance of 0.05 and 0.1 respectively. Clearly, the the Monge-Elkan method. viations obtained by the
best methods
outperformed
It is also important to note that the standard deare lower than the standard deviations
best methods
3.4.2.3
The extension proposed for the Monge-Elkan method in the Equation 3.4 was tested for values of the exponent
standard Mon-
geElkan), 1.5, 2.0, 5.0 and 10.0. The averaged IAP considering the 12 data sets for each internal character-level similarity measure is plotted in Figure 3.5. The average of the four curves is shown in Figure 3.6. Additionally, the Wilcoxon's test was used to compare the standard MongeElkan method ( extended method with exponents
m = 1) versus the m with the values of 1.5, 2.0 and 5.0. The aim m
in the Equation
of this test is to establish if there is enough statistical evidence that if the standard MongeElkan method was improved when the exponent
3.4 was above 1. The Table 3.17 shows the results of the Wilcoxon's test statistic for
of
the Wilcoxon's text were starred with one or two * given its signicance of 5% or 10% respectively.
CHAPTER 3.
64
0.8 0.7
Average IAP
0.6 0.5 0.4 0.3 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 exponent m in Extended MongeElkan 2gram ed.Dist. exactMatch Jaro 1
Figure 3.5:
Avg. IAP
in extended Monge-Elkan
0.7
Average IAP
0.6
Figure 3.6:
Avg. IAP
in extended Monge-Elkan
CHAPTER 3.
65
Table 3.17:
comparing extended
sim(a, b)
2grams exactMatch ed.Dist. Jaro
MongeElkan1.5
15* 18* 5*
MongeElkan2
19** 20** 12*
MongeElkan5
42 27 12*
Table 3.18: Ranks of resemblance coecients performance using the cardinality function
cardA ().
rank 1 2 3 4 5 6 7 8 9 resemblance coef. cosine Jaccard Sorensen Dice compress R1 overlap R14 R8 avg.IAP
3.4.2.4
The presented results of this sub-section have the aim to assess the performance of the dierent tested resemblance coecients. Once again, the average of the IAP results over the 12 data sets and the four internal character-level string similarity measures (48 experiments in total) is computed for each resemblance coecient. The Table 3.18 reports the avg. IAP obtained using
cardA (.).
Al-
though, the most used resemblance coecients used in the name-matching task are
Jaccard, Dice
and
overlap
cosine
and it is computed between two sets with the quotient of the cardinality of the intersection and the geometric mean of the individual cardinalities. The geometric mean between two numbers always returns values lower or equal to the arithmetic mean. The greater the dierence between those numbers the greater the dierence between the arithmetic and geometric means. Consequently, the cosine coecient considers as more similar two sets with higher dierence in their cardinalities. The interpretation of that eect in the name-
matching task is that the cosine coecient tries to compensate the resemblance between two names when one of them has several additional tokens.
CHAPTER 3.
66
Table 3.19: The best matching method for each data set.
Data-set Animals Birds-Kunkel Birds-Nybirds Birds-Scott1 Birds-Scott2 Business Census Demos Hotels Parks Restaurants Ucd-people
1st
best method
IAP
1st
b.m.
IAP1 0.103992 0.722540 0.733630 0.876623 0.894928 0.684721 0.818457 0.775548 0.612641 0.867983 0.899097 0.888961
best baseline
IAP b.b. 0.066294 0.773828 0.714876 0.848485 0.869423 0.678614 0.859146 0.725928 0.610279 0.888874 0.890970 0.796911
Jaro-MongeElkan1.5 2gram-MongeElkan2 ed.Dist.-cosine(A) exactMatch-cosine(A) exactMatch-R8(A) 2gram-tokenEd.Dist. ed.Dist. 2gram-cosine(A) exactMatch-cosine(A) Jaro exactMatch-Sorensen(A) exactMatch-cosine(A)
0.129343 0.892562 0.734960 0.889610 0.909091 0.731611 0.859146 0.775548 0.722460 0.888874 0.905571 0.909091
2gram Jaro 2gram 2gram ed.Dist. Jaro ed.Dist. Jaro 2gram Jaro Jaro 2gram
IAP for
2grams-cosine(A),
3.4.2.5
In order to allow comparison with past and future studies, the method with best results obtained for each data set is reported in Table 3.19 in the second and third columns. Most of the methods with the best results involves new methods or extended methods proposed in this thesis or combinations not previously tested in the past. Those best results were compared with the baseline method that obtained the best IAP. Additionally, the IAP obtained by the method with the best results in average (i.e.
2grams-cosine(A) )
3.5
Summary
Lets make an analogy between
The nal rank obtained from all the experiments (shown in Tables 3.13 and 3.14) shows and simultaneously hides information.
this ranking and the results of an athletic race. The four character-based measures (2gram,
ed.Dist., Jaro
and
exactMatch )
competitors and the 28 combination methods could be performance-enhancing drugs (simpleStr. would correspond to fair competitors). The four fair comThis
bare result would show that doping methods clearly enhanced the performance of the athletes. However, the order in the rank list not necessarily reected the performance of the athletes or the drugs. Consider the ranks 6 and 7, than
2gram-
MongeElkan1.5
winner,
reached a higher
avg. IAP
2gram-MongeElkan2
but had
Nevertheless, the
2gram-cosine(A),
The rank analysis is based in an all-against-all comparison and it does not reect a good comparison against baselines. The Tables 3.15 and 3.16 showed
CHAPTER 3.
67
MongeElkan
method.
baselines (at character-level and at token-level) were clearly outperformed with the proposed combination methods. with the methods Even better, the improvements obtained and
cosine(A), MongeElkan2
MongeElkan5
MongeElkan
results showed clearly that to treat text strings as sequences of tokens instead of character sequences has an important gain. The conclusions that can be extracted from the empirical results are summarized as follows:
cardA (.)
soft-cardinality or the size of the compressed information of a set of words. Thus, this function in combination with resemblance coecients
provides a convenient general method for text string comparisons. The resemblance coecients that have in their numerator the cardinality of the intersection of the tokens such as
and
Dice
cosine
coecient reached the best results and it had a good and consistent To the best of our knowledge,
behavior.
cosine
string-matching task. The use of the generalized mean instead of the arithmetic mean in the MongeElkan measure seems to improve consistently the performance of the basic measure when the exponent The best results were obtained with
m is m=2.
Chapter 4
Knowledge-based Information Extraction for Data-Rich Documents
In Section 4.1 the problem of term disambiguation is addressed considering it as a generalization of the word sense disambiguation (WSD) problem in the NLP eld. Subsection 4.1.1 presents dierent approaches to the
problem of providing comparison measure between concepts in ontologies. The relevant WSD approaches for the term disambiguation problem are reviewed in subsection 4.1.2. The proposed disambiguation method for
our specic information extraction problem is presented in subsection 4.2. In Section 3 presents the nal information extractor integrated with the approximate string comparison techniques studied in Chapter 3 and the term disambiguation proposed in Section 4.1. The subsections 4.2.1, 4.2.2, 4.2.3 and 4.2.4 give additional details about the implementation of the dierent modules of the system. The experimental validation of the extraction system is provided in Section 4.3. Finally, some conclusion remarks are made in Section 4.4. The experimental evaluation of the proposed system using the laptop domain ontology, lexicon and corpus is presented in Section 4.3. In addition to the experiments carried out with the laptop lexicon, new synthetical lexicons were generated adding noise to the original lexicon for evaluating the robustness of the system. Finally, in Section 4.4 some concluding remarks are presented.
68
CHAPTER 4.
69
4.1
Term Disambiguation
In Chapter 2 the laptop ontology, lexicon and corpus were presented. Existing and proposed techniques for matching the lexicon entries against the document corpus were compared in Chapter 3. As a result, lexicon terms were identied in the token sequence of the documents in the corpus. However, a new problem rises: ambiguity. The rst source of ambiguity is
homoymy
instance, the acronym term MB can be linked to the terms MegaByte and MotherBoard. In Section 2.3 we concluded that the 17.06% of the terms in
the lexicon have this kind of ambiguity. The second ambiguity source is the
morphological similarity.
This ambiguity
is originated by the fact that if an approximate string comparison technique is used to compare two similar strings, then they could be considered as equals. For instance, if the terms that are being compared with the edit distance using a threshold
valid match even though the former is an optical disk technology and the former is a laptop exterior color. However, it is expected that the use of approximate string comparison techniques provide resilience against noise (misspellings, typos, OCR errors and morphological variations) to the nal extraction process. The trade-o between the morphological ambiguity and the resilience originated by the approximate string comparison was studied in this chapter. Two elements were used for dealing with the two mentioned sources of ambiguity:
i)
ii)
the information
contained in the laptop ontology. However, unfortunately the very laptop ontology adds a third source of ambiguity that is the
use ambiguity.
Use ambiguity
is originated by the fact that the ontology is a directed acyclic graph (DAG) instead of being a tree. Nevertheless, this property in ontology design allows
that a concept can be reused. For instance, if the term MB (linked to the concept MegaByte ) is found in a document, it is not possible (without considering additional information) to determine if it is referring to the memory units of the laptop main memory or the cache memory. This ambiguity had briey outlined in Section 2.4 when the possible semantic paths were considered for labeling a sample of a document sequence. The Figure 2.5 showed that the 58.47% of the terminal nodes in the laptop ontology have this kind of ambiguity. Because of the previously discussed sources of ambiguity, the token sequences in the document corpus can have many semantic interpretations. that this problem is similar to the in the We noticed
(WSD) problem
proaches to address WSD in natural language text uses a general lexicalized ontology called WordNet [63]. WordNet is structurally similar to our laptop ontology and lexicon. Thus, we proposed a solution to our information extraction problem similar to the WSD solutions that uses WordNet as disambiguation resource.
CHAPTER 4.
70
algorithms because they provide information to evaluate the coherence of a discourse. Semantic similarity is a less general concept compared with SR, because it is dened as a measure for semantic networks connected only with a single type of relationships (usually is-a hierarchies). Due to the fact that the laptop ontology was built mainly with one type of relationships, we do not make distinctions between the SR and semantic similarity measures. Resnik [79] introduced the concept of
information content
IC(c)
the prob-
corpus. Some SR measures have been proposed in the literature using this concept [79, 44, 55, 15]. However, we are only interested in the SR measures that can be computed from the ontology graph.
4.1.1.1
length
[78],
i.e.
(see Figure 4.1). However, the path length metric can generate misleading results since in practice; pairs of concepts closer to the ontology root tend to be less related than entities closer to leaves. In order to lessen this situation, Leacock and
pathLength(a, b) 2D
(LCS),
(4.1)
Another approach to the same problem has been proposed by Wu and Palmer [94] using the concept of
i.e.
concept from the root, common to two concepts. LCS is also known as
lowest
For
instance, in Figure 4.1 Processor is the LCS of Brand and GHz . Palmer measure is given by Equation 4.2, where the function the number of edges from the concept
The Wu and
depth(x) represents
(4.2)
The path length measure can be written using the LCS and depth(x) functions with the following expression:
CHAPTER 4.
71
Laptop
Processor
Memory
HardDisk ...
ProductID
Speed
Cache
FSB
BusWidth ...
Brand Family Model Frequency ... Frequency Magnitude Frequency Units ...
DecimalReal
MHz
GHz
...
Figure 4.1:
attribute" hierarchy.
et al.
ness of the LCS concept and concreteness of the concepts being compared. Altintas
et al.
tween precision and recall) compared with previously presented metrics in the noun word sense disambiguation task of Senseval-2 . Altinas
et al.
LenF actor(a, b) =
pathLength(a, b) 2D
Spec(x) =
depth(x) clusterDepth(x)
SIMAltinas (a, b) =
The function
clusterDepth(x)
et al.
1 http://www.senseval.org/
CHAPTER 4.
72
sd(a, b) =
1 2
Further, the semantic distance is converted in a semantic similarity using the Equation 4.3. The similarity is squared in the following formula in order to report higher values to semantically closer terms.
(4.3)
Sussna [86] proposed a semantic relatedness measure using all types of relationships in the WordNet [63] semantic network (i.e. synonymy, hypernymy, hyponymy, holonymy, meronymy and antonymy). The Sussna's measure be-
tween two concepts consist in the shortest path using weighed edges according to a
type-specic fanout
in the hierarchy.
two adjacent nodes in the semantic network related with the number of like relations that the source node has. For instance, the weight of the edge in the meronymy relation between Processor and Laptop is inverse to the number of meronymy relations between Laptop and other concepts. Similarly, the weight for inverse relation (i.e. holonymy between Laptop and Processor ) is obtained
in the same way. Both values are averaged and scaled obtaining the undirected edge weight as is shown in the following expression.
weight(a, b) =
w(x r y) = maxr
Where
r , r maxr and minr are the range of possible weights for a relation of type r , and nr (X) is the number of relations of type r having x as source node. Sussna assigned maxr = 2 and minr = 1 for all relation types except antonymy whose range was (2.5, 2.5), i.e. a
and
is its inverse,
no range. Agirre and Rigau [2] proposed a conceptual distance using not only the path length between concepts and their depth in the hierarchy, but adding the density of terms in the hierarchy as evidence. They argue that concepts in a dense part of the hierarchy are relatively closer than those in a sparser region. Maynard and Ananiadou [57] proposed another similarity measure and applied it in the biomedical UMLS
ontology.
CHAPTER 4.
73
Laptop
Processor
Memory
HardDisk ...
w=4
ProductID
w=4
Cache FSB BusWidth ...
Speed
w=3
w=3 w=2
w=1
DecimalReal MHz GHz
...
WeightedPathLength(Brand,GHz)=3+4+4+3+2+1=17
Figure 4.2: Weighted path-length semantic relatedness measure example.
4.1.1.2
Leacock
length metric (see Figure 4.1) following the idea that two concepts closer to the ontology hierarchy root are semantically more distant than those that are farther. Using that idea, we propose a weighted path length measure assigning weights to edges according to their proximity to the root of the ontology tree. In Figure 4.2 an example of that metric is shown.
machine-readable dictionary [53] or a semantic network such as WordNet. The following example has each open class word sub scripted with the number of senses found in the on-line version of The American Heritage Dictionary of the English Language .
Jackson found(4) good(42) medicine(9) in his summer(7) eld(37) trips(23) with their unfolding(6) of new(15) horizons(8)
The number of possible sense combinations in that sentence is 6,485'028,480. SR metrics are used to assess the semantic coherence of a specic sense combination. The WSD goal is to nd a sense combination that maintains logical
CHAPTER 4.
74
discourse coherence.
problem, techniques such as simulated annealing [26] and genetic algorithms [39] have been used to nd near-optimal combinations.
4.1.2.1
tween the dictionary denitions) of a targeted word and its surrounding words (i.e. context) are some kind of SR measure. Lesk [53] rst proposed this ap(MRD). The sense
machine-readable dictionary
combination whose glosses reach the highest overlap (i.e. words in common) is assumed the correct one. Another approach known as
simplied Lesk
proposes
to disambiguate a targeted word searching the highest gloss overlap between the glosses of the targeted word with the context words. The simplied Lesk
algorithm does not use the previously disambiguated word to disambiguate the next. This simplication reduces considerably the search space and surprisedly gets better accuracy than the original approach [47]. Extended Gloss Overlaps [8] uses not only the overlaps with neighboring words glosses, but extend the comparisons to glosses of related words. this additional information the disambiguation process is improved. Cowie Using
et al.
The method can be seen as an application of the Lesk's algorithm simultaneously to all ambiguous words in a text. The coherence evaluation of a candidate sense combination is made using the set of sense glosses. Redundancy factor
is
computed in the set of glosses by giving a stemmed word from which appears times a score of with the
n n 1 and adding up the scores. Finally, the sense combination highest R factor in its glosses is selected using simulated annealing as
et al.
[39] proposed a global coherence method at document level The method aimed to
nd an optimal sense combination at document level that minimizes the average distance between senses. Due to the great search space, a genetic algorithm was used to nd a near optimal solution. The tness function is similar to the Experiments showed that
global coherence hypothesis gave better results than methods that uses only local context such as Lesk's algorithm.
4.1.2.2
Banerjee and Pedersen [7, 8] noticed that the Lesk's algorithm could be expressed as an algorithm based in a semantic relatedness measure and name it the
2n + 1.
Let
si,j be
the
i -th
position relative to
CHAPTER 4.
75
j = 0.
s0,k is
computed as follows:
|wi |
SenseScorek =
i=n j=1
Where
relatedness(s0,k , si,j ), i = 0 i.
|wi |
The algorithm chooses the sense with the highest word. Sussna [86] proposed a strategy called
SenseScorek
mutual constraint,
which consists of
the addition of all pair wise semantic relatedness measures in a sense combination of a text. The candidate sense combination that maximizes the mutual constraint gives the most coherent sense combination. This approach is only
practical at sentence level or within a context window due to the great number of possible sense combinations. This approach also agrees with the one-sense-
per-discourse hypothesis because dierent senses in two instances of a single word in one text convey a sub-optimal mutual constraint. Mavroeidis
et al.
tion
nations are treated as a set and the most coherent combination is considered. This combination is obtained computing the Steiner tree [43] in the semantic network formed with all the possible word senses in the discourse. The Steiner tree is restricted to include paths through all least common subsumers (LCS) of the set of senses.
4.1.2.3
Mihalcea
with nodes representing word senses in a text and edges are semantic relations extracted from WordNet. of each node (i.e. Then PageRank is used to assess the popularity
and edges as hyperlinks. The idea of PageRank can be extended to undirected graphs, in which the outdegree of a node is the same indegree. Let be a undirected graph with the set of nodes PageRank value for a node
G = (V, E) E . The
Vi
P ageRank(Vi ) = (1 d) + d
jneighbors(Vi )
Where the function node
P ageRank(Vj ) |neighbors(Vj )|
neighbors(Vi )
Vi ,
and
is a damping factor set in 0.85 (see [14] for details). Scores for
each node are initialized in 1 and PageRank is calculated iteratively until convergence. Finally, for each word, the sense with the highest PageRank is selected. This method is in agreement with the one-sense-per-discourse hypothesis [38] because of nodes representing the same sense obtains the same PageRank score. The authors reported better results for their method (47.27% average accuracy) compared with the Lesk's algorithm (43.19% average accuracy).
CHAPTER 4.
76
Figure 4.3:
nodes) for a sequence of four words (unshaded nodes). Label dependencies are indicated as edge weights. Scores computed by the graph-based algorithm are shown in brackets, next to each label (taken from Sinha and Mihalcea [60, 84]).
Sinha and Mihalcea [60, 84] proposed a method for WSD constructing a graph aligned to the document whose nodes are possible senses for each word and edges are semantic similarity measures. Edges are only allowed within a
sliding window over word sequence (size 6 in the experiments) and above a similarity threshold. Further, a
closeness, betweenness and PageRank) is computed for each node to determine a weight. Nodes (i.e. senses) with higher weights are selected for disambiguation. Figure 4.3 shows a sample graph.
4.1.2.4
The WSD algorithms reviewed in the previous section considered the disambiguation procedure at word level. This means, that each word in the text has a set of possible senses aligned in the position of the word in the document. The sense inventory is provided by a machine readable dictionary or a lexicalized ontology such as WordNet. The Figure 4.4 depicts this scenario. This picture is clearly similar to the graph in Figure 4.3. As it was mentioned in the introduction of this chapter, our intention is to propose a term disambiguation strategy inspired in the WSD algorithms. Nevertheless, terms unlike words are compound of one or more tokens (words). The case of terms with more than one token is very common in the laptop domain lexicon as it was shown in Figure 2.9. Additionally, the sense inventory
CHAPTER 4.
77
senses
words
w1
w2
w3
w4
w5
is generated by the three sources of ambiguity also mentioned in the introduction of this chapter.
Polysemy:
Terms associated to many concepts. Terms whose character strings representations are
Morphological similarity:
similar.
The possibility of having terms with many tokens and the sources of ambiguity generate a disambiguation scenario with a shape dierent to the traditional WSD scenario showed in Figure 4.4. The candidate senses may include many
tokens and a token can have senses with dierent number of tokens. Figure 4.5 shows a graphical representation of the term disambiguation scenario. that some terms include mode than one token. The sense combination in the term disambiguation scenario has to guarantee that any token is covered by only one sense. For instance, in Figure 4.5 the sense that ranges from Notice
t3
to
t5
t5 .
and term disambiguation, most of the WSD algorithms can be adapted to term disambiguation. Another problem that arises when the WSD algorithms are being adapted to term disambiguation is that the one-sense-per-discourse hypothesis that some approaches agree cannot be accepted. The reason is because of depending on the design of the ontology, dierent occurrences of the same term in a document may have dierent meanings. For instance, it is not acceptable that all occurrences of the term `MB in a document be related only to one meaning such as the units of the cache memory. Consequently, the WSD algorithms that are in
agreement with that hypothesis do not seem to be suitable to the proposed term disambiguation task.
CHAPTER 4.
78
Term senses
tokens
t1
t2
t3
t4
t5
For our specic disambiguation task, a new graph-based criterion similar to those presented in Section 4.1.2.3 was proposed. The main idea is to build a
graph aligned to the document token sequence where the nodes are the term senses, the undirected edges are added between adjacent senses and the edges are weighted with a semantic distance (i.e. the inverse of the semantic relatedness) between the senses. An outline of this graph is shown in Figure 4.6. We argue that the shortest path in the built graph is a disambiguation strategy. This idea is motivated by the fact that consecutive terms in a text
document tends to be semantically close. Therefore, in a coherent discourse the chain of terms are connected with small semantic distances. The combination
that minimizes the addition of those semantic distances is the shortest path in the proposed graph. The proposed shortest path strategy has the following advantages versus other previously presented disambiguation strategies:
There is not assumption of context windows with specic size such as the approaches similar to the Lesk's algorithm. The proposed approach aims to optimize a criterion for the complete document. The senses are not treated as bags of senses as many global coherence criterion. There are no zones in the document semantically independent of the others. Consider the fact that if the weight of only one edge would change
anywhere in the document, the eect would be a complete new path giving a dierent interpretation to the document.
The shortest path in the proposed graph can be obtained using the Dijkstra's algorithm [30] or with the Viterbi's algorithm [90] due to the Markovian property of the graph. For our implementation, we chose the Dijkstra's algorithm.
CHAPTER 4.
79
Figure 4.6:
disambiguation.
t1
4.2
t2
t3
t4
t5
This section assembles the building blocks developed in the previous chapters and sections in an information extraction system. A block diagram with the
overall system architecture is shown in Figure 4.7. The architecture is going to be outlined describing each one of the numbered arrows in that gure. 1. The input to the system is a sequence of tokens obtained from the text of a laptop data sheet (see Section 2.4). The character sequences were
tokenized with a rule-based procedure described in Section 5.2.1. 2. The list of terms of the laptop lexicon (see Section 2.3) and the document sequences are provided as input to the Fuzzy String Matcher. This module uses the techniques studied in Chapter 3 for searching occurrences of the lexicon terms in the documents in a resilient way. 3. The set of occurrences found by the Fuzzy String Matcher is the main input for the term disambiguator. Each match stores the starting and
ending positions in the document and its associated concept in the ontology (usually a terminal concept). It is important to note that the tokens in the sequence can have several valid matches. 4. The ontology is a directed acyclic graph (see Section 2.2), that means that concepts can be referenced several times. The possible uses of a specic
concept can be obtained enumerating the possible paths from the root concept to that concept. This enumeration constitutes the ontology sense inventory of semantic paths. 5. The implemented semantic relatedness measures were computed using only the ontology graph.
CHAPTER 4.
80
Ontology
4
Sense Inventory
5
Semantic Relatedness Measure
Document
2 1
Fuzzy String Matcher
6 3
7 8
Labeled Document
Term Disambiguation
6. For each matched term in the document a set of senses or uses are provided by the sense inventory. The term disambiguator replaces each matched a set of
term label with a set of possible semantic interpretations (i.e. semantic paths). 7. A mechanism to compare two semantic interpretations (i.e.
semantic
paths) is provided for the term disambiguation module to assess the coherence of a specic sense combination over the document. 8. Finally, the term disambiguation module combines the inputs 3, 6 and 7 to decide an unique semantic interpretation to each token in the original document token sequence.
laptop lexicon into the laptop data-sheets documents. The terms in the laptop lexicon can have one or more tokens, so each match label over the document stores three elds: the lexicon entry name and the position of the starting and ending token. The diagram in Figure 4.8 outlines the matching labeling schema. There are some additional issues related to the matching that cannot be seen in that gure:
CHAPTER 4.
81
Match#3
Lex.entry#23
Match#5
Lex.entry#185
Match#2
Lex.entry#85
Match#4
Lex.entry#67
Match#6
Lex.entry#249
If two terms in the same lexicon entry match the same tokens, only one match is reported. If two consecutive matches refers to the same lexicon entry both matches are not replaced by a single match.
The nave approach to match the entire lexicon against the document is to check all token positions with all lexicon terms. However, this approach is prohibitive since the number of terms in the lexicon is greater than 1500 and the evaluation corpus has more than 5000 tokens. The approaches to solve eciently this
problem such as sux trees, nite automatas are out of the scope of this thesis. The FSM may use any of the approximate string comparison techniques presented in Chapter 3. Nevertheless, we implemented a simple blocking strategy with a term index by the rst bigram of characters. This blocking strategy
avoids to compare the complete lexicon against all the document terms, but it reasonably maintains the matching performance. Clearly, all the tokens (not
only the rst token) of the terms are included in the bigram index built to implement the blocking. Additionally, all terms that seem to be acronyms (i.e.
words only with capital letters and with less than 7 characters) are indexed only by their rst letter. For instance, the USB term was included in all the bigram index entries whose rst character is U. This blocking approach is very conservative, but in preliminary experiments showed to be very eective. All the lexicon entries are the patterns to be compared with the document. Those patterns have considerably less tokens than the number of tokens in the document. Consequently, it is assumed that in most of the cases (except at the end of the document) the document provides enough tokens to be compared with the patterns. Those boundary separators are listed in Table 4.1.
4.2.2
Use/Sense Inventory
The use/sense inventory is a simple repository that stores all the possible semantic paths for each terminal node in the ontology (see the semantic path denition in Section 2.2). This inventory is implemented with a dictionary whose key is a
CHAPTER 4.
82
Table 4.1: List of boundary separators Character . , ; : ( ) [ ] ? ! <CR>,<LF> Name period coma semicolon colon open parentheses close parentheses open brackets close brackets question mark open question mark exclamation mark open exclamation mark carriage return, line feed
terminal concept name and the data is a set of semantic paths. This inventory is preprocessed to improve the eciency of the term disambiguation process.
4.2.3
Five semantic relatedness measures among the measures reviewed in Section 4.1.1 were implemented: 1. Path Length [78] (baseline). 2. Weighted Path Length (see Section 4.1.1.2) 3. Wu & Palmer measure [94]. 4. Maynard & Ananiadou measure [57]. 5. Stetina
et al.
measure [85]
The selected measures were standardized as semantic distances instead of semantic similarities. The measures 1 and 2 are already distances. Because the values of measures 3 and 4 were normalized in the range [0, 1], they were converted to distances with the simple expression The Stetina
distance = 1 normalizedSimilarity .
et al.
Equation 4.3 proposed by the author. The semantic relatedness measures that uses large corpus were not used because in the laptop domain it is dicult to gather a really large document corpus of data sheets in order to reach enough statistical evidence for each target. Additionally, the string similarity measures studied in Chapter 3 were those that do not use additional information from
the corpus). Therefore, this selection was consistent with the approach of that chapter.
CHAPTER 4.
83
Laptop
Laptop
Processor
Laptop
Laptop
Laptop
Processor
Laptop
Laptop
Laptop
Laptop
Cache
Memory
Cache
Laptop
Laptop
Laptop
Laptop
Laptop
Memory
Laptop
Memory Memory
VideoAdapter
VideoAdapter
VideoAdapter
VideoAdapter
HardDisk
HardDisk
HardDisk
Installed
Laptop
MaxInstallable MaxInstallable
HardDisk
HardDisk
MaxInstallable MaxInstallable
MaxInstallable MaxInstallable
MemorySize
MemoryMagnitude
MemoryMagnitude
MemorySize MemorySize
Processor
MemoryUnits
MemoryUnits Megabyte
Integer
MemoryMagnitude
MemoryMagnitude
Integer
Integer
MemoryUnits
Integer
MemoryMagnitude
Integer
Integer
MemoryUnits
MemoryUnits
Megabyte
Megabyte
Cache
Integer
Megabyte
Integer
Megabyte
REF_Cache
...
512
MB
Cache
...
Figure 4.9: Token sequence labeled with possible senses (i.e. semantic paths)
1. Shortest path, see Section 4.1.2.4. 2. Adapted Lesk Algorithm [72] with a window of 3 tokens. 3. Shortest path but evaluated only in a sliding window o 3 tokens. 4. First-sense selection (baseline). 5. Random-sense selection (baseline).
The input from the fuzzy string matcher rstly is combined with the use/sense inventory. As a result, the document token sequence is labeled with the possible semantic paths for each identied term. An example of a token sequence sample at this point is shown in Figure 4.9. Further, each semantic path is considered as a node in a kind of view from above of Figure 4.9 in Figure 4.10. It is important to note that this sample (512MB Cache) does not include multi-token terms. Thus, in a commoner scenario the shape of the graph is closer to that of the Figure 4.5. The edges added to the graph in Figure 4.10 are weighted with semantic distances obtained from one of the implemented semantic relatedness measures. Finally, one of the implemented disambiguation strategies selects a unique semantic path for each token, which had candidates. The nal output is the same input sequence of tokens with selected tokens labeled with a semantic path.
CHAPTER 4.
84
Semantic Path #n
...
Semantic Path #5
Semantic Path #4
w
Semantic Path #1
...
...
...
512
MB
Cache
...
Figure 4.10: Graph of semantic paths aligned to the document token sequence
4.3
Experimental Validation
Which are the static fuzzy string matching techniques more suitable to the proposed information extraction task? Do fuzzy string matching techniques provide robustness or noise to the proposed information extraction task? Which of the selected and proposed semantic relatedness measures makes better the term disambiguation process? Which of the selected and proposed disambiguation strategies are more convenient for the specic information extraction task?
The experiments were carried out using the manually labeled corpus described in Section 2.4 combining each one of the following variables:
1. The approximate string matching technique. 2. The semantic relatedness measure. 3. The disambiguation algorithm. 4. The lexicon. The options were the original corpus and 8 synthetically
The synthetically generated noisy lexicons were obtained making random edit operations in the original laptop lexicon. The level of noise
was established in
order to control the amount of noise inserted in the original lexicon. Four levels
CHAPTER 4.
85
Algorithm 2
function for all
LEXICON, P) TERM in LEXICON do for all TOKEN in TERM do if length (TOKEN)>2 and
noisyCharacterEdition ( randomIntegerInRange (0,100)<P
then
position=randomIntegerInRange (2,length (TOKEN)) operation=randomIntegerInRange (1,3) if operation=1 then TOKEN=deleteCharacterAtPosition (position,TOKEN) if operation=2 then TOKEN=insertCharacterAtPosition (position,TOKEN) if operation=3 then TOKEN=replaceCharacterAtPosition (position,TOKEN) return(LEXICON)
Table 4.2: Description of the noisy generated lexicons with character edition
noise levels. # edited tokens 510 1034 1553 2109 edition rate 18.40% 37.31% 56.04% 76.10%
P
25 50 75 100
P = 0
means The
the original noise-free lexicon and 100 or more the maximum noise level. procedure to generate the noisy lexicons is presented in Algorithm 3.
The four generated lexicons with random edition operations at character level are listed in Table 4.2 with the number of edited tokens and the percentage of edited tokens of the 2771 tokens in the lexicon. The other set of noisy lexicons were generated with the same procedure described in Algorithm 3, but additionally shuing the tokens in the lexicon terms. This procedure is coarsely described in Algorithm 3. In the same way, the four additional generated lexicons are described in Table 4.3 (the total number of terms is 1700).
make an analysis of those errors it is necessary to have a very large corpus if the number of possible labels is large too. That is our case.
CHAPTER 4.
86
Algorithm 3
function
LEXICON, P) LEXICON=noisyCharacterEdition (LEXICON, P) for all TERM in LEXICON do if numberOfTokensIn (TERM)>1 and
shuedTokenEdition ( randomIntegerInRange (0,100)<P
while TERM1=TERM do TERM1=shueTokensIn (TERM) TERM=TERM1 return(LEXICON) then TERM1=shueTokensIn (TERM)
P
25 50 75 100
Another option is to consider the labeling as a two-class problem. That is, a particular target token has been labeled or not and the selection was correct or not. The picture in Figure 4.11 shows that approach. The
target
labeling
is the manually labeled laptop evaluation corpus (Section 2.4) and the
selected
labeling is the output of the information extraction system. Each token in the selected labeling sequence falls in one the next four categories:
When the token has been labeled and the selected label
is the same target label. This is the result of an accurate string matching and/or an accurate term disambiguation.
When the token has been labeled and the selected label (A) when the target
node had no label and (B) when the target node had a dierent label. The case (A) is the result of a poor string matching and the case (B) is the result also of a poor string matching and/or an in-accurate term disambiguation.
the target token had a label. This category is the result of a poor string
When the token has not been labeled and the target
CHAPTER 4.
87
(A)
target
(B)
selected
true positives
false positives
E
true negatives
false negatives
The graphic in Figure 4.12 shows the tendency of TP, FP and FN when the string matching threshold is varied in the range
[0, 1].
means that all string comparisons among the strings compared in the set dened by the blocking strategy (see Section 4.2.1) are considered as valid matches. The threshold value of 1 means that only identical strings are considered as valid matches. It is important to note that the counting of FN is low because of the used lexicon is noise free. However, if a noise lexicon was used, the FN plot
would have a clearly increasing tendency. The counting of TP, FP and FN is used to dene the
precision, recall
and
F-measure
precision =
TP TP + FP
recall =
TP TP + FN
F measure =
The precision measure can be interpreted as the rate of correct labeled tokens in the set of selected labeled tokens. Similarly, the recall measure is the rate of correct labeled tokens in the set of target labeled tokens. The F-measure is the harmonic mean between precision and recall and reects the balance of the two measures when both have high values. The graphic in Figure 4.13 shows a plot of precision, recall and F-measure for the same experiment of Figure 4.12. The performance metric used for the string-matching problem in Chapter 3 was the interpolated average precision (IAP). This metric was obtained interpolating the recall-precision curve for dierent thresholds and the area under that
CHAPTER 4.
88
3500 3000 2500 2000 1500 1000 500 0 0 0.2 0.4 0.6 0.8 1 threshold true positives false positives false negatives true positives upper bound
Figure 4.12:
false negatives ranging the string matching threshold in (0, 1]. This result was obtained using edit distance at character level, Minkowski distance at token level (
p = 10),
curve is the IAP metric. The graphic in Figure 4.14 shows the recall-precision plot for the same experiment of Figure 4.12. The interpolated curve would look like a step with its elbow in the point where precision and recall have their combined higher value. and is called as This point is also the maximum value of F-measure The F1 score is a measure of general performance of
F1 score.
the labeling process but focused in the point where the performance reaches its maximum. The string-matching threshold was varied in the range [0, 1] using
steps of 0.05. The IAP measure reects the performance of the labeler for the complete range of values of the threshold. However, there is not too much interest in the performance of the labeler for small values of the threshold. Thus, the F1 score seems to be a more convenient quantitative measure to our particular extraction system. Additionally to the F1 score, the plot of precision/recall/F-measure vs. threshold is reported in some experiments as a qualitative measure. Those plots also gave important information about the string matching thresholds where the F1 score was reached.
CHAPTER 4.
89
0.8
Recall Precision
threshold
1 0.8
Precision
Recall
CHAPTER 4.
90
30000 25000
# of matches
Figure 4.15: Number of valid matches for dierent string comparison thresholds using EditDistance-MongeElkan as matching method.
proposed in the previous subsection. However, an evaluation data set only for the fuzzy string matcher is not available. Thus, in order to evaluate only the
fuzzy string matcher, the semantic relatedness and the disambiguation strategy were pre-established for all the experiments. Further, experiments varying the string matching technique were carried out. It is expected that the changes
in the overall performance were only originated by the accuracy of the string matching technique. The used semantic relatedness measure was path length
(see subsection 4.1.1) and the disambiguation strategy was shortest path (see subsection 4.1.2.4). It is important to note that the lower the string matching threshold, the higher the number of valid matches and consequently, the higher the ambiguity that the term disambiguation module has to deal with. The graphic in Figure 4.15 shows the number o valid matches vs. the string matching threshold using as string-matching technique
EditDistance-MongeElkan
5.
The experiments were carried out varying ve string matching techniques at character level (i.e.
and
Jaro-Winkler )
SimpleStr.,
cosine(A) ).
noise-free laptop lexicon for the F1 score are listed in Table 4.4 and the string matching thresholds where that F1 score were reached are listed in Table 4.5. The best result and the best average result are shown in bold characters. The F1 score of 0.816071 obtained with the matching technique
ExactMatch-
5 The
naming convention for the string matching techniques is the same used in Chapter 3
CHAPTER 4.
91
Table 4.4: F1 score for dierent string matching techniques using the noise-free laptop domain lexicon.
2gram(Dice)
0.831804 0.829362 0.830689 0.831666 0.831816 0.831067
Jaro
0.836105
ExactMatch
(0.816071) 0.812196 0.813544 0.813544 0.815807 0.814233
JaroWinkler
0.816479 0.831784 0.829053 0.829053 0.741935 0.809661
Table 4.5: Threshold where the F1 score was reached for dierent string matching techniques using the noise-free-laptop-domain lexicon.
2gram(Dice)
0.85 0.75 0.85 0.85 0.90
Jaro
0.95 0.95 0.95 0.95 1.00
ExactMatch
0.85 0.95
JaroWinkler
1.00 0.95 0.95 0.95 1.00
SimpleStr
cause as it was mentioned in Section 2.3, the laptop lexicon was made based partially in the documents of the laptop evaluation corpus. However, several
string-matching techniques slightly outperformed this upper bound revealing that even in the ve documents of the evaluation corpus there were morphological variations that have not been considered in the lexicon. The fact that the approximate string matching techniques have outperformed the upper bound is a weak probe of the convenience of using those techniques instead of the simple exact matching. However, in order to provide a stronger probe experiments with the noisy synthetical lexicon obtained with the Algorithm 3 and four string matching techniques were carried out. The results of
the F1 score for dierent levels of noise in the lexicon are shown in Figure 4.16. As it was expected, the results show that the overall system performance is better maintained when an approximate string comparison technique is used at character level. The lexicons generated with additional noise at token level (see Algorithm 3) were used in a similar way for evaluating the resilience of the same matching methods. The results are shown in Figure 4.17. This graphic clearly shows that a matching technique with a fuzzy strategy at character and at token level such as
EditDistance-MongeElkan
The Table 4.6 shows the tabulated results of the curves of Figure 4.17 with two additional columns with the results of the matching methods and
2grams(Dice)
EditDistance-cosine(A).
CHAPTER 4.
92
0.8 0.6
F1-score
0.4 0.2
Figure
4.16:
F1
score
for
dierent
matching
methods
using
ve
levels
of
0.8
F1-score
0.6
0.4
0.2
Figure
4.17:
F1
score
for
dierent
matching
methods
using
ve
levels
of
CHAPTER 4.
93
Table 4.6: F1 score results for dierent string matching techniques at dierent noise levels in the lexicon.
ExactMatch
ExactMatch MongeElkan
0.813544 0.705415 0.537177 0.418154 0.207174
Jaro SimpleStr
0.836105
EditDist. SimpleStr
0.833443 0.741921 0.662466 0.587655 0.529873
EditDist. MongeElkan
0.835548
0.784871
2grams(Dice) MongeElkan
0.831666 0.784615
0.754949
EditDist. cosine(A)
0.832407 0.778088 0.746999 0.704862 0.683386
P
0 25 50 75 100
SimpleStr
0.816071 0.636644 0.440036 0.255391 0.081679
0.753542
0.716136 0.699073
0.710129 0.682116
Table 4.7: String matching thresholds for the F1 score results reported in Table 4.6.
ExactMatch
EditDist. SimpleStr
0.90 0.85 0.80 0.75 0.75
EditDist. MongeElkan
0.90 0.85 0.80 0.80 0.80
2grams(Dice) MongeElkan
0.85 0.65 0.65 0.60 0.50
EditDist. cosine(A)
P
0 25 50 75 100
MongeElkan
0.85 0.65 0.65 0.50 0.50
There is another important issue related to the use of approximate string matching techniques in the extraction process, which is to determine the right value for the string-matching threshold. The Figure 4.18 shows a matrix of
precision/recall/F-measure vs. threshold plots for 5 levels of noise in the lexicon (generated with the Algorithm 3). It can be noticed that the threshold where
the F1 score is reached tended to move to the left in the plots when the level of noise in the lexicon increases. The threshold values where the F1 scores were reached are showed in Table 4.7. Surprisedly, we notice that the combination method at token level with the better xation of the threshold was this thesis in Section 3.3.6.
cosine(A).
EditDistance-MongeElkan
Noise-free laptop lexicon.
as string-matching technique.
The F-measure for values of the string similarity threshold in [0, 1] was plotted in Figure 4.19. The proposed baseline metric is the simple path length measure,
CHAPTER 4.
94
EditDistanceMongeElkan
2grams(Dice)MongeElkan0
0.9
0.9
EditDistancecosine(A)
0.9
0.6
0.6
0.6
0.3
0.3
0.3
noisy lexicon 25
0.9
0.9
0.9
0.6
0.6
0.6
0.3
0.3
0.3
0.9
0.9
0.9
noisy lexicon 50
0.6
0.6
0.6
0.3
0.3
0.3
0.9
0.9
0.9
noisy lexicon 75
0.6
0.6
0.6
0.3
0.3
0.3
0.9
0.9
0.9
0.6
0.6
0.6
0.3
0.3
0.3
F-measure
Figure 4.18: Precision/recall/F-measure vs. threshold plots
CHAPTER 4.
95
0.8
0.7
F-measure
0.6
0.5
0.4
Figure 4.19: F-measure using shortest-path and other dierent semantic relatedness measures.
but no measure manages to consistently outperform this baseline. In fact, in the zone where the F-measure reached the F1 score (the maximum), path length surpasses all the other measures. On the other hand, the Stetina
et al.
measure
EditDistance-MongeElkan
member that the total number of targets in the laptop evaluation corpus is 1596. Thus, with a typical threshold xed in 0.8 the term disambiguator has to choose 1596 senses among more than 90,000 candidates. The experimental parameters used to evaluate the implemented disambiguation strategies (Section 4.2.4) are:
EditDistance-MongeElkan
Noise-free laptop lexicon.
as string-matching technique.
CHAPTER 4.
96
250,000
# of graph nodes
Figure 4.20: Number of nodes in the disambiguation graph for dierent string comparison thresholds.
4.4
Summary
The results of the experiments related with the string matching methods let us to conclude that the use of the evaluated matching measures surpassed the overall performance of the system in comparison with the use of simplistic exact match. Additionally, it is possible to conclude that when the string matching threshold is greater than 0.8 the additional noise that is added to the disambiguation process does not reduce signicantly the overall performance. Particularly, the methods that combine a character level measure with a token level metric showed the better resilience against noise. Comparable results were obtained using:
EditDistance-MongeElkan, 2grams(Dice)-MongeElkan
and
EditDistance-cosine(A).
old where it reaches the best performance (0.9) in the information extraction task being exposed to dierent noise levels. Related to the semantic relatedness experiments, clearly we can conclude that the simple
path length
sures with more complex approaches reached the same performance that
path
length
reached.
Among the disambiguation strategies evaluated, the proposed strategy clearly out performed other techniques and baselines.
shortest path
CHAPTER 4.
97
F-Measure
0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.4 0.6 0.8 threshold Shortest-Path (window=3) Shortest-Path First Sense Adapted Lesk (window=3) Random Sense 0.2 1
Figure
4.21:
Extraction
performance
(F-measure)
comparison
between
the
Adapted Lesk with window size of 3 vs. shortest path and baselines.
Chapter 5
Matching with Acronyms and Abbreviations
The Chapter 3 studied the problem of compare text strings in a fuzzy way, dealing mainly with problem such as misspellings, typos, token order, and morphological variations. The Chapter 4.2 showed that it is possible to provide
robustness to an information extraction system using fuzzy string searching. However, there is another important source of variations in strings, which is the fact that in written and spoken language it is very common to compress long names in shorter ones. This practice in the human language has not more than 200 years but nowadays is omnipresent in the modern language. Particularly,
the use of acronyms and abbreviations in the data-sheets in the IT domain is very frequent. However, the reviewed techniques for approximate string matching are not specially designed to deal with the problem of match short forms and long forms with the exception of the edit distance proposed by Gotoh [40]. Gotoh distance proposes the The
open
and
extend gap
costs. Consequently, the editions to transform USB into Universal Serial Bus are three
open-gap
extend-gap
e.g.
consider and
acronym made with the last letters with the same previous example, that is lls. same. On the other hand, the results of the previous chapter showed that is more convenient to treat text strings as sequences of tokens instead as sequences of characters, but the token oriented approaches do not seem to be suitable to acronym matching. For instance, matching USB with Universal Serial Bus at token level only can be possible if the token USB is divided in three tokens. The heuristic of separate into tokens consecutive capital letters is a common heuristic used in the The Gotoh distance from USB or lls to Universal Serial Bus is the
acronym-matching
eld.
98
CHAPTER 5.
acronym and abbreviation matching. Additionally, a general integrated method for applying some of acronym heuristics with the matching methods presented in Chapter 3. The aim of this integration method is to obtain a string matcher that conserves the properties of the string matching measures and to support acronym matching. The proposed method is not going to be validated experimentally in a direct way but like part of complete extraction system proposed in this thesis. Thus, the overall extraction performance will be compared with and without the proposed integration method. This chapter is organized as follows. The Section 5.1 presents a small survey of the known approaches to address the acronym-matching task. A new method to integrate acronym-matching heuristics to the token-based string comparison methods is proposed in Section 5.2. Finally, the eect of the proposed method in the extraction system presented in Chapter 4.2 is discussed in Section 5.3.
5.1
The most common form of abbreviation are the acronyms, which that are shortforms made with the initial letters or syllables of the words in the long-form (e.g. USB stand for Universal Serial Bus). Acronyms are single-word contracted versions of multi-word terms that are used in place of their long versions. Nevertheless, new acronym construction styles are being appearing and the approaches are so creative that the task can be compared with that of assigning names to things. There are notices of the earlier use of acronym since the 19th century. Acronyms are being used commonly in modern (post 1940s) English in technical and legal documents and their use have been spread to many other domains and languages nowadays. Commonly, acronyms are not included in general domain dictionaries and everybody creates new acronyms every day.
Acronym matching
(a.k.a.
short-
and
long-form
pairs (i.e.
Acronym detection
acronym identication
or
acronym recognition )
acronym matching, that consist of nding pairs of {acronym, denition} in corpus for building an acronym-denition list. into two sub tasks: the This task is usually divided
the acronym-denition candidate pairs identication and matching evaluation. We are especially interested in the
acronym-denition
acronym-denition denition.
acronym
Some applications of the acronym-matching task are: OCR renement, information retrieval (retrieve documents that uses the denition of the acronym when the query only has the acronym), knowledge extraction, thesaurus construction, ontology construction and enrichment, etc. The metrics used for evaluating acronym matching are
precision, recall
and
F-measure
Unfortunately,
the results reported by authors show the overall performance of their methods
CHAPTER 5.
combining candidate detection and matching evaluation. Some recall and precision results are reported for the sake of comparison, but they do not necessarily reect performance in the acronym-matching task.
recall =
#correct acronym def initions f ound total # of acronym def initions in document #correct acronym def initions f ound total # of acronym def initions f ound 2 precision recall precision + recall
precision =
F measure =
The most common approaches to the acronym/abbreviation-matching problem have addressed the task using static pattern matching rules and specic heuristics. We consider as unsupervised those methods because they do not
use any kind of labeled data or predened lists of build their models. Previous work related to the
acronym-denition
pairs to
acronym-denition
matching
evaluation sub-task (omitting the candidate identication task) is presented in the following subsections.
LCS
They moved a sliding window over the document looking for oc-
currences of acronym-denition pairs obtaining in their experiments precision of 98% at recall level of 86% using governmental documents. considered only acronyms from three to ten characters. Their approach
such as US (United States) or AI (Articial Intelligence) had too high error rates due to the high relative error of the
LCS
The model was rened using heuristics such as to optionally ignore stop words (e.g. and, of ), to separate hyphenated words (e.g.
gas-cooled, non-high-
level ),
e.g.
Univ.
abbreviates
2. The abbreviation combines a prex and a sux of its expansion, Dept. matches Department, or
e.g.
CHAPTER 5.
3. The abbreviation combines a prex and a sux of its expansion, UCSD abbreviates University of California San Diego, or 4. The abbreviation is a concatenation of prexes from its expansion, Caltech matches California Institute of Technology
e.g.
e.g.
The authors do not report results for the matching task because it was embedded in a record linkage general task. Yeates [95]proposed the Three Letter Acronyms (TLA) algorithm, a greedy sequential matching algorithm between the
acronym
de-
nition,
The heuristics
proposed by Yeates are the following: 1. Acronyms are shorter than their denitions. 2. Acronyms contain initials of most of the words in their denitions. 3. Acronyms are given in upper case. 4. Shorter acronyms tend to have longer words in their denition. 5. Longer acronyms tend to have more stop words. The matching algorithm tokenizes the
denition
def-
inition
acronym
using none or the initial letters of each token. Yeates claimed to obtain a more general method than Taghva and Gilbreth's method in spite of the lower performance of his method. Additionally, Yeates proposed as improvement the use of machine learning techniques to weight the heuristics in order to adapt the method to dierent data sets. Another rule-based acronym extractor was named Acrophile by Larkey
et al.
[51]. The Acrophile method searches rst for acronym candidates in the document using regular expressions. Next, a greedy acronym expansion algorithm
is used in the neighboring of the acronym candidates to nd possible expansions and weighting each candidate. Larkey
et al.
et al.[77]
forms constructed with initial letters of denition words) is poorly used in domains such as biomedical literature. For instance,
carboxiuorescein diac-
etate
tions and ts better with abbreviation denition. The proposed method named AcroMed, tries to match all acronym characters as a prex of inx of the words of the candidate denition. Considering only
denitions
Score =
If
Score
CHAPTER 5.
combined with a simple regular expression strategy to nd candidate pairs outperforms the Acrophile algorithm using the same evaluation corpus. Acromed
reached recall of 63% at precision of 90% and the best result obtained by Acrophile was recall of 26% at the same precision. Schwartz and Hearts [83] proposed a simple but fast and eective method. The matching algorithm process each part of the candidate
acronym acronym
and
and
de-
nition
and some
characters from
denition.
acronym
denition
are restricted to match to reach a valid pair. The algorithm does not use complex heuristics that usually are developed for specic training sets and consequently dicult generalization to other domains and documents. This method reached a recall of 82% at precision of 96% using abstracts from MEDLINE , comparable to the results obtained by Chang
et al.
using a supervised approach and those of Pustejovsky precision) in the same biomedical texts.
et al.
The idea of using compression for acronym detection was proposed by Yeates
et al.
the
[96]. The method relies in the idea that in valid can be coded as the
denition-acronym
pairs,
denition
acronym
fewer bits than with a general text compression model such as PPM [10]. The decision rule was given establishing a threshold
the number of bits required for coding the acronym model and the compression model.
denition
and stores the
acronym model
words that have been used to build the acronym and the number of initial characters that were used. 90% precision with Yeates
et al.
= 0.2
millions of words including 1080 manually labeled acronym denitions. Adar [1] proposed an acronym matching method mixing approximate string matching with a set of rules for scoring the matches. The method rstly found all possible common sub-sequences between the candidate abbreviation-denition pair. Further, the following score rules (related to acronym matching) are applied for each candidate pair:
For every abbreviation character that is at the start of a denition word, add 1 to the score. The number of denition words should be less than or equal to the number abbreviation characters. For every extra word, subtract 1. The
1 MEDLINE.
http://www.ncbi.nlm.nih.gov/pubmed
CHAPTER 5.
Table 5.1: Rule examples of the Park and Byrd's method [71].
Denition Two-Micron All Sky Survey chlorouorocarbons Hexadecimal to Binary Near-End CrossTalk Supernove 1987A
Formation Rule (1,R) (2,F) (3,F) (4,F) (5,F) (1,F) (1,I) (1,I) (1,I) (1,I) (1,R) (3,R) (4,F) (1,F) (2,F) (3,R) (3,I) (1,F) (2,F) (3,R) (4,F)
1 2
texts with 11,253,125 documents. The number of unique abbreviation, denition pairs was 136,082. Of these, a random sample of 644 were manually tested
abbreviation-denition
formation rules are extracted and ordered by occurrence. extracted rules taken from [71] are shown in Table 5.1.
Abbreviation patterns are formed from abbreviations replacing each alphabetic character by a c and each sequence of numeric character (including . and ,) by a n. For instance, patterns for 2MASS and V.3.5 abbrevia-
tions are respectively ncccc and cn. Denition patterns are formed with the following code: w (normal word), s (stop word form a pre-dened list), p (prex from a pre-dened list), h (headword and n (number). is phnw. Formation rules are lists of pairs (word mation method is coded as follows:
i.e.
For-
F I L E R
rst character of a word interior character of a word last character of a word exact match (only for matching numbers) special replacement match
e.g
Once the patterns and formation rules are extracted from training set, frequent rules are selected. Further, for validation of an unseen
abbreviation-denition denition
pair rules are selected matching abbreviation and denition patterns. Next, if the formation rule can generate the
abbreviation
then the candidate pair is labeled as valid. Park and Byrd tested their method
CHAPTER 5.
book from a pharmaceutical company, and NASA press releases for 1999.
obtained results for recall and precision measures were in the range over 93.8%100%. Nevertheless, authors do not give details about the nature and size of
et al.[18]
abbreviation-denition abbreviation-denition
pair using a character alignment between them obtained with the longest common sub-sequence (LCS) algorithm. The features are: 1. Percent of letters in abbreviation in lower case 2. Percent of letters aligned at the beginning of a word 3. Percent of letters aligned at the end of a word 4. Percent of letters aligned on a syllable boundary 5. Percent of letters aligned immediately after another letter 6. Percent of letters in the abbreviation that are aligned 7. Number of words in the
denition
8. Average number of aligned letters per word Positive examples for the training set were obtained from a list of valid abbreviations and their denitions using the LCS alignments. Negative examples were generated with incorrect alignments of correct abbreviations and correct alignments of incorrect abbreviations. The machine-learning model used was
Logistic Regression and the best results reached recall of 83% at 80% precision against the Medstract Gold Standard Evaluation Corpus [76] of labeled biomedical abstracts. Nadeau and Turney [67] proposed a supervised method similar to the one of Chang
et al.
but with more features [18]. They considered all possible acronym-
denition pairs at sentence level and used a set of heuristics to reduce the search space. The training and evaluation sets were built with the Medstract Gold Results from experiments using a support vector machine reached close to the performance obtained by Schwartz and Hearst
Standard.
F -measure =88.3%,
[83] of
F -measure =88.4%
5.2
The strategy that is proposed in this section aims to integrate some of the heuristics reviewed in this chapter into a token-based approximate string matcher.
CHAPTER 5.
An idea could be to activate the heuristics only when one of the tokens being compared seems to be an acronym. However, it is also desirable that when two acronyms are being compared those heuristics should be deactivated and the acronyms should be compared as regular tokens. Additionally, there are some
token separators such as the period that have to be ignored in some cases (e.g. when U.S. is compared with US). The proposed strategy consists of enumerating all possible token congurations obtained when the heuristics are applied or ignored. Further, each pair of token congurations is compared with a token-based string similarity method and the highest obtained value is returned as similarity value. tokenizing method is presented in Section 5.2.1. The acronym
and the token combination enumeration policy are explained in Section 5.2.2.
Divide as separate tokens two consecutive capital letter characters, US is tokenized as U, S.
e.g.
e.g.
Divide as separate token capital letters separated by a period (`.'), U.S.A. is tokenized as U, ., S, ., A, ..
e.g.
Clearly, if none of those rules can be applied to a specic token, it is considered that this token does not contain an acronym.
If the actual token can be divided with the acronym tokenizer, then: option 1) consider the tokens obtained with the acronym tokenizer, option 2) consider the original token. If the actual token is a period, a hyphen or a slash, then: option 1) ignore it, option 2) consider it.
CHAPTER 5.
...
Hard
Disk
120
GB
5400
rpm
...
The enumeration of the combinations is going to be illustrated with an example. Let consider the pattern and document being compared at the shaded position in Figure 5.1. The possible token congurations using the proposed acronym
heuristics are enumerated in Figure 5.2. Each token conguration is compared with a string matching method to obtain a similarity measure. The conguration with the highest similarity is chosen. The example in Figure 5.2 has the better conguration in the combination marked with j)*.
5.3
The method proposed in this chapter was integrated to the information extraction system proposed in Chapter 4.2. The best results obtained in that chapter are compared using the same parameter conguration with the acronym sensitive token matcher integrated. extraction system was: The conguration used in the information
Shortest path
The proposed base line is the F-measure obtained using the same conguration but with exact string comparison. The results in Figure 5.3 show that the use of the acronym sensitive matcher clearly improves the performance of the information-extraction system. Addi-
tionally, the range of the string-matching threshold that is above the baseline was considerably extended. The approach proposed in this chapter was particularly designed to t the styles of acronym construction in the data-sheets considered in this thesis. Besides, the acronym sensitive matcher could be a candidate to be a general matching method for acronyms but a comparison with other methods and experiments with other data sets would be mandatory. Although, the obtained results with our particular application are encouraging, the proposed method should be considered as a baseline for a future work.
CHAPTER 5.
Serial
ATA
Serial
a)
S
j)*
.
A
.
-
5400
.
-
.
T
.
A
5400
Serial
ATA
Serial
b)
S
k)
.
A
.
-
5400
.
-
.
T
.
A
5400
Serial
ATA
Serial
c)
S
l)
.
A
.
-
5400
.
-
.
T
.
A
5400
Serial
ATA
Serial
d)
S
m)
.
A
.
-
5400
.
-
.
T
.
A
5400
Serial
ATA
Serial
e)
S
n)
.
A
.
-
.
T
.
A
5400
.
-
.
T
.
A
5400
Serial
Serial
f)
S
o)
.
A
.
-
.
T
.
A
5400
.
-
.
T
.
A
5400
Serial
Serial
g)
S
p)
.
A
.
-
.
T
.
A
5400
.
-
.
T
.
A
5400
Serial
Serial
h)
S
q)
.
A
.
-
.
T
.
A
5400
.
-
.
T
.
A
5400
Serial
Serial
i)
S
r)
.
A
5400
5400
Figure 5.2:
CHAPTER 5.
0.8
F-Measure
0.6
0.4
0.2
0 0 0.2 0.4 0.6 0.8 threshold Baseline (ExactMatch-SimpleStr.) Best Configuration Best Conf. + Acronym Matcher 1
Figure 5.3: Performance comparison of the information-extraction system with and without the acronym-sensitive matcher.
Chapter 6
Conclusions
6.1
Summary of Contributions
The contributions of this thesis can be summarized as follows: 1. The main product of this thesis is a system capable to extract a huge set of characteristics from data sheets based in a lexicalized ontology that describes the domain. 2. A new method for string matching was proposed, which is based in a new expression for estimating the cardinality of a set of words. The proposed method reached better performance in a name-matching task previously studied in the literature with similar techniques. In addition, the proposed method had the better robustness against noise with the laptop domain data proposed in this thesis. 3. It was shown that, a IE process can be addressed as a WSD problem when the density of extraction targets and the number of target types are high. This approach has not been considered in the past, thus our work opens the way to a new set of IE methods based on the WSD ideas. Part of the ideas discussed in this thesis were presented to the scientic community in the following academic papers:
temas e Informtica Vol. 5 Number 2 (June 2008). Jimenez, S. G., Becerra, C. J., Gonzalez, F. A., Knowledge-based
Infor-
mation Extraction Using Fuzzy String Searching and Word Sense Disambiguation for IT Product Data-sheets ,
Proceedings of the 2008 International Conference on Information and Knowledge Engineering (IKE'08).
109
CHAPTER 6.
CONCLUSIONS
110
6.2
Conclusions
The following enumeration summarizes the strongest and most important statements that can be made from our observations:
1. The results obtained in Chapter 3 let us to conclude that exploiting the fact that text strings are naturally divided into sub-sequences of words or tokens, the quality of the string comparison processes can be increased. 2. The use of approximate string matching techniques that combine strategies at character and at token level, embedded in our specic IE task, bring resilience against noise in spite of the additional ambiguity introduced by their use. 3. Given the results obtained in Chapter 4.2, the proposed Knowledge-Based word sense disambiguation (WSD) inspired algorithm was capable of handling the IE task in data-rich documents. 4. The proposed acronym sensitive string matcher was capable of increasing the performance of our IE system given the high number of acronyms in the laptop lexicon.
6.3
The information extraction system presented in this thesis included new methods for term matching and disambiguation that can be applied in other domains. However, there are some important issues that the extraction problem has to full t in order to take advantage of the proposed method:
1. To provide an ontology with a hierarchy depth enough for allowing the computation of the semantic relatedness measure. The term disambiguator depends on that, so if the ontology is a simple narrow template the disambiguation process can no be achieved. 2. To have a high target density in the documents. That means that is necessary that the targets are near between them. The proposed method uses the neigboring targets as disambiguation context for an specic target. 3. The proposed method is more convenient to extract numerical measures than nominal targets. The measures that are composed of magnitudes and units are conveniently handled by the extractor. However, for the nominal targets such as names, the extraction process is dictionary-based.
We expect that the proposed extraction method have application in other sub domains of the Information Technology domain (i.e. computing, printing, networking, imaging, etc.). Besides, other domains such as customer electronics,
mechanics, bio-chemical and other describe products with data sheets similar to the IT domain.
CHAPTER 6.
CONCLUSIONS
111
An important limitation of the proposed IE system is that it was designed to work with documents that describe only one ob ject. Therfore, documents
with comparison charts of products, price lists and others are out of the reach of our extractor.
6.4
Future Work
sisted semantic document annotation system. The data obtained with this tool brings the possibility of building an IE model with predicting abilities for ontology population. and the lexicon. That is, a model that can enrich the ontology
1. The developed prototype can be the basis for the construction of an as-
2. The proposed methods for combining character- and token-level measures can be extended to integrate other measures in replacement of those of character-level. For instance, if there is evidence that the words con-
troller and adapter are synonyms in the IT domain, a comparison between video controller and video adapter can be easily identied as a valid match. Similarly, using general bilingual dictionaries the information extraction process can be also extended to dierent languages.
Appendix A
Laptop Domain Ontology
Laptop
Disk, Display, VideoAdapter, OpticalDrive, NetworkAdapter, Modem, InputDevice, Battery, ACAdapter, AudioAdapter, Interfaces, ExpansionSlots, WarrantyServices, PhysicalDescription, Software, Webcam, MediaAdapter, SecuritySpecs]
A.1
CommercialOer
haspart [ListPrice, MonthlyPrice, NetPrice, Rebate, Shipping, Availability, NoPaymentsTime] ListPrice \rightarrow_{has-part} [Money] MonthlyPrice \rightarrow_{has-part} [Money] haspart [Money] Rebate \rightarrow_{has-part} [Money] haspart [Money, TimeMeasurement] haspart [TimeMeasurement] haspart [Currency, MoneyAmount]
NoPaymentsTime
A.2
ProductID
Certication]
haspart [UPCCode, Brand, Family, SubFamily, Model, PartNumber, haspart [Brand, Family, Model] haspart [Brand, Family, Model] haspart [Brand, Model]
LaptopID
ProcessorID
VideoAdapterID
1 All
WirelessAdapterID
112
APPENDIX A.
113
AudioAdapterID
Processor
Technology]
haspart [ProcessorID, BusWidth, FrontSideBus, Cache, Speed, BusWidth, haspart [FrequencyMagnitude, MegaHertz]
Processor
haspart [CacheSize, CacheLevel] haspart [MemorySize] haspart [TwoPower, Bit] isa [FrequencyMeasurement]
Chipset Memory
haspart [MemorySize] haspart [TotalSlots, OccupiedSlots, AvailableSlots] haspart [Digit] haspart [Digit]
haspart [Digit]
haspart [Boolean]
APPENDIX A.
114
A.5
HardDisk
haspart [Capacity, MaxStorageCapacity, Controller, HDRotationSpeed, MotionSensor, MinSeekTime, AverageSeekTime, MaxSeekTime, PhysicalDescription] haspart [MemorySize] haspart [MemorySize] haspart [HDRotationSpeedMagnitude, RevolutionsPerMinute] haspart [TimeMeasurement]
HardDisk Capacity
AverageSeekTime
Display
haspart [PixelResolution, NamedResolution, Technology, Diagonal, PixelPitch, ContrastRatio, hasAmbientLightSensor, isWidescreen] haspart [PixelResolutionHorizontal, X, PixelResolutionVerti-
Display
PixelResolution
cal]
Diagonal
PixelPitch
ContrastRatio isWidescreen
A.7
[Boolean]
haspart [Boolean]
VideoAdapter
terfaces]
haspart [VideoAdapterID, VideoMemory, Slot, TVTuner, VideoInhaspart [HDMI]
VideoAdapter
crete, Technology]
hasTVTunner
APPENDIX A.
115
A.8
OpticalDrive
haspart [OpticalFormat, DVDLayer, Labeling, supportsBluRay, supportsDoubleLayer] haspart [CDFormat, DVDFormat, BluRayFormat, HDDVDFormat, numberOfFormats]
OpticalDrive
OpticalFormat
numberOfFormats haspart [Integer] CDFormat haspart [Media, ReadSpeed, WriteSpeed] DVDFormat haspart [Media, ReadSpeed, WriteSpeed] BluRayFormat haspart [Media, ReadSpeed, WriteSpeed] HDDVDFormat haspart [Media] ReadSpeed isa [OpticalDriveSpeedMagnitude] WriteSpeed isa [OpticalDriveSpeedMagnitude] supportsBluRay haspart [Boolean] supportsDoubleLayer \rightarrow_{has-part}
[Boolean]
A.9
NetworkAdapter
LANAdapter haspart [DataRate, DataRateUnits, Jack, hasLANAdapter] hasLANAdapter haspart [Boolean] WirelessAdapter haspart [WirelessAdapterID, Autentication, Version, hasAntenna, hasWirelessAdapter, hasOnOSwitch]
hasWirelessAdapter haspart [Boolean] hasOnOSwitch haspart [Boolean] BlueToothAdapter haspart [Technology, Version, hasBlueToothAdapter, hasBlueToothAntenna]
hasBlueToothAdapter haspart [Boolean] hasBlueToothAntenna haspart [Boolean] BroadBandWirelessAdapter haspart [BroadBandWirelessAdapterID, hasAntenna]
Modem haspart [ModemID, Version, Jack, Certication, isSoftwareModem] isSoftwareModem haspart [Boolean]
APPENDIX A.
116
A.10
InputDevice (s)
haspart [Keyboard, PointingDevice, HotKeys, hasRemoteControl] haspart [Boolean]
InputDevice
haspart [Technology, hasLeftButton, hasRightButton, hasCenterButton, hasScrollZone, isElectroStaticTouchPad] haspart [Boolean] haspart [Boolean] haspart [Boolean] haspart [Boolean]
hasCenterButton
haspart [Boolean]
isElectroStaticTouchPad
haspart [Country, Layout, KeyPitch, KeyStroke, NumberOfKeys, hasNumericKeyPad, hasScrollBar, isSpillResistant] haspart [Boolean] haspart [Boolean] haspart [Boolean] haspart [Integer]
KeyStroke
haspart [Cells, Technology, Life, Energy, PhysicalDescription, RechargeTime, BatteryWarranty, ElectricalCurrentCapacity, isRechargeable, isRemovable] haspart [Digit]
RechargeTime
haspart [EnergyMeasurement]
APPENDIX A.
117
BatteryWarranty haspart [TimeMeasurement] ElectricalCurrentCapacity haspart [CurrentMeasurement] isRechargeable haspart [Boolean] isRemovable haspart [Boolean] ACAdapter haspart [ACInput, DCOutput, PhysicalDescription] ACInput haspart [MinFrequency, MaxFrequency, MinVoltage, MaxVoltage, PowerRequirements, CurrentRequirements, hasVoltageAutoSensing]
MinFrequency haspart [FrequencyMeasurement] MaxFrequency haspart [FrequencyMeasurement] MinVoltage haspart [PotentialMeasurement] MaxVoltage haspart [PotentialMeasurement] PowerRequirements haspart [PowerMeasurement, Tolerance] CurrentRequirements haspart [CurrentMeasurement, Tolerance] Tolerance haspart [Percentage] hasVoltageAutoSensing haspart [Boolean] DCOutput haspart [Voltage, Current, Power] Voltage haspart [PotentialMeasurement] Current haspart [CurrentMeasurement] Power haspart [PowerMeasurement]
A.12
PhysicalDescription
PhysicalDescription haspart [Weight, Color, Material, Dimensions] Weight haspart [WeightMeasurement] Dimensions haspart [Height, Width, Depth] Height haspart [DistanceMeasurement, MinHeight, MaxHeight] Width haspart [DistanceMeasurement] Depth haspart [DistanceMeasurement] MinHeight haspart [DistanceMeasurement] MaxHeight haspart [DistanceMeasurement] Webcam haspart [Resolution] Resolution haspart [Magnitude, PixelUnits]
APPENDIX A.
118
A.13
AudioAdapter
haspart [AudioAdapterID, Speaker, Sampling, Technology, hasMicrophone, MicrophoneDirection, AudioInterfaces] haspart [Integer, Bit]
LineOutJack]
haspart [Boolean]
haspart [Boolean]
haspart [SpeakerID, Type, SoundPower, Impedance, hasSpeaker] haspart [PowerMeasurement] haspart [Boolean] haspart [ImpedanceMeasurement]
MediaAdapter
NumberOfMediaFormats
haspart [USB, IEEE1394, hasSerial, hasParallel, hasInfraRed, HDMI, hasPS/2, hasPortReplicatorConnector] haspart [NumberOfPorts, Version] haspart [NumberOfPorts, PinOut] haspart [Integer, Pin]
IEEE1394 PinOut
APPENDIX A.
119
HDMI VGA
haspart [NumberOfPorts, Version] haspart [Boolean] haspart [Boolean] haspart [Boolean] haspart [Boolean]
hasSerial
hasParallel hasPS/2
hasInfraRed
haspart [Boolean] haspart [PCCardSlots, PCExpressSlots, PCISlots] haspart [SlotsNumber, Type, hasPCExpressSlots] haspart [Boolean] haspart [Boolean]
hasPCExpressSlots
WarrantyServices
haspart [Warranty, Support, hasDamageCoverage] haspart [Boolean] haspart [OnSiteWarranty, CarryInWarranty, PartsWarranty, Laborhaspart [TimeMeasurement, Region] haspart [TimeMeasurement, Region]
WarrantyServices Warranty
hasDamageCoverage
Warranty]
haspart [EmailSupport, TelephoneSupport, OnlineSupport] haspart [EmailAddress, TimeMeasurement] haspart [ToolFreeNumber, Region, WeeklySchedule, TimeMea-
TelephoneSupport
APPENDIX A.
120
OnlineSupport
A.16
Software
haspart [ProcessorArchitecture, Brand, Family, Softwarehaspart [ProcessorType, BusWidth]
Software
OperatingSystem
Version, SubVersion]
ProcessorArchitecture ProductivitySoftware
Licence]
PhotoSoftware VideoSoftware
haspart [Brand, Family, SoftwareVersion, SubVersion, Licence] haspart [Brand, Family, SoftwareVersion, SubVersion, Licence] haspart [Brand, Family, SoftwareVersion, SubVersion, Licence] haspart [Brand, Family, SoftwareVersion, SubVersion, Licence]
haspart [Brand, Family, SoftwareVersion, SubVersion] haspart [Brand, Family, SoftwareVersion, SubVersion] haspart [Decimal, Integer, Version]
Security
haspart [PhysicalSecurity, SoftwareSecurity] haspart [hasCableLockSlot, hasFingerprintSensor, TPM] haspart [Boolean]
SecuritySpecs TPM
PhysicalSecurity
hasFingerprintSensor hasCableLockSlot
APPENDIX A.
121
SoftwareSecurity
visorPassword]
hasPasswordSecurity hasUserPassword
haspart [Boolean]
hasSupervisorPassword
A.18
Measurements
haspart [Magnitude, FrequencyUnits] haspart [MemoryMagnitude, MemoryUnits] haspart [Magnitude, TimeUnits] haspart [Magnitude, DistanceUnits] haspart [Magnitude, EnergyUnits] haspart [Magnitude, WeightUnits] haspart [Magnitude, PotentialUnits]
PotentialMeasurement CurrentMeasurement CurrentMeasurement MemoryMagnitude Magnitude Percentage MoneyAmount FrequencyUnits MemoryUnits TimeUnits
haspart [Magnitude, PowerUnits] haspart [Magnitude, CurrentUnits] haspart [Magnitude, ImpedanceUnits] haspart [Magnitude, CurrentUnits] haspart [Magnitude, DataRateUnits]
ImpedanceMeasurement DataRateMeasurement
isa [Integer, Decimal, Fraction, Digit] isa [Integer, Decimal] isa [Hertz, KiloHertz, MegaHertz, GigaHertz] haspart [Magnitude, PercentageCharacter] isa [Bit, Byte, KiloByte, MegaByte, GigaByte, TeraByte]
isa [PicoSecond, NanoSecond, MicroSecond, MiliSecond, Second, Minute, Hour, Day, BusinessDay, Week, Month, Year] isa [PicoMeter, NanoMeter, MicroMeter, MiliMeter, CentiMeter, Inch, Feet, Yard, Meter, KiloMeter, Mile]
DistanceUnits
APPENDIX A.
122
isa [WattHour, KiloWattHour, MiliWattHour] isa [Gram, MiliGram, KiloGram, Pound, Ounce] isa [Volt, MiliVolt, KiloVolt]
PotentialUnits CurrentUnits
isa [Watt, MiliWatt, KiloWatt, HorsePower] isa [Ampere, MiliAmpere] isa [Ohm, MiliOhm, KiloOhm, MegaOhm]
isa [KilobitPerSecond, MegabitPerSecond, GigabitPerSecond, TerabitPerSecond, KilobytePerSecond, MegabytePerSecond, GigabytePerSecond, TerabytePerSecond] haspart [Integer, Slash, Integer] isa [Yes, Not] haspart [Integer, Colon, Integer]
Appendix B
Laptop Lexicon
Ampere:
[A, Ah, Amperes"] [Creative"] [Sound Blaster"] [SRS Labs audio enhancements", SRS WOW
AudioAdapterID_Brand: AudioAdapterID_Family:
AudioAdapter_Technology:
HD stereo"]
AudioJackType: Availability:
[In Stock"] [Available", Free", Disponible"] [Ion", Li-ion", Lithium Ion", NiCad", Nickel Cad-
AvailableSlots:
Battery_Technology:
Bit:
BluRayFormat_Media:
ROM"]
BlueToothAdapter_Technology: BlueToothAdapter_Version:
[Nero", Sonic", Roxio", Sony", Toshiba"] [Burning ROM", Burn", Creator", CD Burn-
123
APPENDIX B.
LAPTOP LEXICON
124
BusinessDay:
Byte:
CDFormat_Media:
Bridge", CD-MIDI", CD-R", CD-ROM", CD-ROM XA", CD-RW", CD-RW", CD-TEXT", Multisession CD", Photo-CD", Portfolio", Video CD"]
CacheLevel: CentiMeter:
[Level2", L2"] [centimeter", cm"] [ATI", Intel"] [RADEON"] [XPress 200M", 200M", 945GMS"]
ver", Cloud", Red", Titanium Silver", Magnesium alloy", Charcoal Black", Sienna", Slate Blue", Onyx Blue Metallic", Mist Gray", Smart Indigo"]
Currency:
DVDFormat_Media:
DVD+-RW", DVD-10", DVD-18", DVD-5", DVD-9", DVD-R", DVDR DL", DVD-RAM", DVD-RAM", DVD-ROM", DVD-RW", HD DVDROM", DVD-Video", DVD-Audio"]
",
"]
DVDLayer: Day:
Diagonal_DistanceMagnitude:
9.5", 17.0"]
Digit:
[0", 1", 2", 3", 4", 5", 6", 7", 8", 9", one", two", three",
Display_Technology:
tive matrix"]
Feet:
[feet", ft"] 5 [. . . ", . . . ", . . . ", . . . ", / ", . . . ", ", ", "] 8
Fraction:
APPENDIX B.
LAPTOP LEXICON
125
[GB", GByte", GigaByte", Giga Byte", G.B."] [GHz", Gigahertz", Giga Hertz"] [Gigabit per second", Gbit/s", Gbps"] [Gigabyte per second", GB/s", GBps"]
GigabitPerSecond:
GigabytePerSecond: Gram:
HDDVDFormat_Media: HDMI_Version:
[1.0", 1.1", 1.2", 1.2a", 1.3", 1.3a", 1.3b"] [4200", 5400", 7200"]
HDRotationSpeed_HDRotationSpeedMagnitude: HardDisk_Controller:
UATA", U-ATA", Ultra-ATA", Ultra ATA", IDE", EIDE", E-IDE", Enhanced IDE", Enhanced IDE"]
Hertz:
[Hz", Hertz"] [hp", horsepower", horse power"] [Media", Application", Windows", Internet", CD/DVD",
HorsePower:
HotKeys_Type:
DVD/CD", DVD", Play Menu", Sound", Volume", volume up", volume down", volume mute", mute", Screen Blank", Security", Pause", Play", Skip", Skip to Next Track", Next Track", Skip to Previous Track", Previous Track", Stop", Instant On"]
Hour: Inch:
[hour", hours", hrs", hr"] [inch", inches", in", ""] [ES", LA", FR", US"]
[KB", KByte", KiloByte", Kilo Byte", K.B.", K"] [Kg", kg", kilograms"] [KHz", Kilohertz", Kilo Hertz"] [kilometer", km"]
"]
APPENDIX B.
LAPTOP LEXICON
126
KiloWattHour:
[KWh", KWHr"] [Kilobit per second", Kbit/s", Kbps"] [Kilobyte per second", KB/s", KBps"] [10", 100", 1000", 10/100", 100/1000", 10/100/1000",
KilobitPerSecond:
KilobytePerSecond:
LANAdapter_DataRate:
LANAdapter_Jack: LaptopID_Brand:
[Toshiba",
Compaq", Dell", Sony", Lenovo", NEC", Packard Bell", Samsung", LG", ASUS", Gateway"]
LaptopID_Certication:
BELUS", BSMI", CCC", CE Marking", CI", CSA", Energy Star", FCC", GOST", GS", ICES", Japan VCCI", MIC", SABS", Saudi Arabian (ICCP)", UKRSERTCOMPUTER", UL "]
LaptopID_Family: LaptopID_Model:
[VAIO", Pavilion", Presario", Tecra", Satellite", Portege"] [A130-ST1311", dv9500t", VGN-UX390N", VGN-TXN15P/B",
[UX"]
Material:
MediaAdapter_Technology:
ory Stick
", miniSD
", Mem-
MediaSoftware_Brand:
HP"]
MediaSoftware_Family:
Media Player",
Media Player",
MegaByte:
M"]
MegaHertz: MegaOhm:
"]
APPENDIX B.
LAPTOP LEXICON
127
MegaPixel:
[M", MP", megapixel"] [Megabit per second", Mbit/s", Mbps"] [Megabyte per second", MB/s", MBps"] [DDR", DDR2", DDR3", EDO", FPM", GDDR2",
MegabitPerSecond:
MegabytePerSecond:
Memory_Technology:
GDDR3", GDDR4", GDDR5", Rambus", SDR", SDRAM", SGRAM", XDR", DDR2-400", DDR2-533", DDR2-667", DDR2-800", DDR21066"]
Meter:
[meter", m"]
MicroMeter: MicroSecond:
MicrophoneDirection: Mile:
[mile", mi"]
"]
MiliWattHour: Minute:
[minutes", min", '", m"] [Conexant", Toshiba"] [Australian ACA", C.I.S.P.R.22", Canadian ICES-
ModemID_Brand:
Modem_Certication:
003 Class B", CCIB", CE Mark", CSA", CTR21", FCC Part 15 Class B", FCC Part 68", Industry Canada", NEMKO", UL"]
Modem_Jack:
[RJ-11", Standard Telephone Jack", Phone Jack"] [56K", v.90", v.92"] [PC66", PC100", PC133", PC166", PC200",
Modem_Version:
ModuleTransferRate_Name:
PC1600", PC2100", PC2700", PC3200", PC4200", PC5300", PC23200", PC2-4200", PC2-5300", PC-66", PC-100", PC-133", PC-166", PC-200", PC-1600", PC-2100", PC-2700", PC-3200", PC-4200", PC5300"]
APPENDIX B.
LAPTOP LEXICON
128
[Dimm", Simm"]
[Free", No Charge"] [month", mth"] [3D Accelerometer", 3D DriveGuard", G-Sensor"] [SXGA+", Wide SVGA", Widescreen XGA", Widescreen
MotionSensor:
NamedResolution:
[No", Not", Non", Optional", Select models", Selected models", some models", in selected models", in selected congurations"]
NumberOfMediaFormats: NumberOfPorts:
x5", x6"]
4", 5", 6", 1x", 2x", 3x", 4x", 5x", 6x", x1", x2", x3", x4",
OccupiedSlots: Ohm:
[Used", Occupied"]
[Ohms", Ohm",
"]
OperatingSystem:
[Windows XP", Win XP", Windows Vista", Linux"] [Microsoft", MS", SUN", Apple", Caldera",
OperatingSystem_Brand:
SCO", IBM", Novell", Caldera", Debian", Mandriva", Red Hat", Fedora", SuSE", Ubuntu"]
OperatingSystem_Family:
DOS"]
OperatingSystem_SoftwareVersion: OpticalDriveSpeedMagnitude:
52"]
[XP", Vista"]
OpticalDrive_Labeling: Ounce:
[oz", ounces"]
[DiscT@2", Labelash
", LightScribe"]
PCCardSlots_Type:
PCExpressSlots_Type: PercentageCharacter:
[%"]
APPENDIX B.
LAPTOP LEXICON
129
[pin", Pin"] [pixel", pixels"] [1024", 1280", 1366", 1400", 1440", 1680",
Pixel:
PixelResolutionHorizontal:
1680", 1920", 2048"]
PixelResolutionVertical:
1536"]
PointingDevice_Technology:
Pad
Pound:
ProcessorArchitecture_ProcessorType:
MIPS", RISC"]
ProcessorID_Brand: ProcessorID_Family:
trino Duo", Centrino Pro", Core(TM) 2 Duo", Core Duo", Core Solo", Core 2 Duo", Pentium dual-core", Semprom", Turion", Turion 64", Turion 64 Mobile", Turion 64 X2"]
ProcessorID_Model:
3800+", T2060", T2080", T2250", T2350", T2400", T2450", T2500", T2600", T5200", T5300", T5500", T5600", T5600", T7100", T7200", T7300", T7400", T7500", T7600", T7700", U1400", U1500"]
Processor_Technology:
[ULV", Ultra Low Voltage", Mobile"] [Microsoft", MS", Corel", SUN", Lotus",
ProductivitySoftware_Brand:
IBM", Adobe"]
ProductivitySoftware_Family:
Works", Oce X3", Acrobat Reader", PowerPoint", Accounting Express", Outlook", Publisher", Access", Money"]
REF_ACAdapter:
[AC Adapter",
AC-Adapter",
Power",
external AC
REF_ACInput:
[Input"]
APPENDIX B.
LAPTOP LEXICON
130
REF_AudioAdapter:
tem"]
REF_AudioJackType:
Audio Plug"]
REF_Availability:
REF_AverageSeekTime: REF_Battery:
[Warranty"]
[cell", cells"]
REF_Battery_Technology: REF_BluRayFormat:
REF_BlueToothAdapter: REF_Brand:
"]
REF_BroadBandWirelessAdapter:
REF_CacheLevel:
[Color", Case Color", Chasis Color", Cover Color"] [Computer", PC", Personal Computer"] [Contrast Ratio"]
REF_Computer:
REF_ContrastRatio: REF_DCOutput:
REF_DVDFormat:
APPENDIX B.
LAPTOP LEXICON
131
REF_DVDLayer: REF_Depth:
[Layer"]
REF_Dimensions: REF_Display:
REF_Display_Diagonal:
REF_ExpansionSlots: REF_Family:
REF_FrontSideBus:
running", up to"]
REF_HDDVDFormat: REF_HDMI:
REF_HDRotationSpeed: REF_HardDisk:
Hard Drive"]
REF_HardDisk_Capacity:
REF_HardDisk_Controller: REF_Height:
[H", Height", thick", thin"] [Hot Keys", Hot Key Functions", One-Touch Button",
REF_HotKeys:
One-Touch Productivity Button", Launch Button", Action Button", Key Function", Programmable Key", Dial Control", Buttons", Control Button"]
REF_IEEE1394: REF_ITProduct:
[Firewire", i.Link", IEEE 1394a", iLink", IEEE 1394"] [Product", Product Description", Product Specs", Prod-
uct Specications"]
REF_InputDevice: REF_Interfaces:
[Input Device"]
APPENDIX B.
LAPTOP LEXICON
132
REF_KeyStroke: REF_Keyboard:
[Stroke", Key Stroke"] [Keyboard"] [Country", Layout"] [Pitch", Key Pitch", Key Size", center-to-
REF_Keyboard_Country:
REF_Keyboard_KeyPitch:
center spacing"]
REF_LANAdapter:
ernet"]
REF_LANAdapter_DataRate: REF_LANAdapter_Jack:
REF_Labeling:
[Laptop", Notebook", Portable Computer"] [Licence"] [List Price", From", Original Price", Price", Regular
REF_ListPrice:
Price"]
REF_Material:
[Material", Case Material", Chasis Material"] [Max", Max Height", Rear", R"] [Max Seek Time", Maximum Seek Time"] [Max Capacity", Max Storage Capacity", Up
REF_MaxHeight:
REF_MaxSeekTime:
REF_MaxStorageCapacity:
to", Max"]
REF_Media:
REF_MediaAdapter:
dia Adapter", Media Slot", Media Slot", Media Reader", Memory Media", Media Card Reader", Multimedia Card Reader", Media", Slot", Media Port"]
REF_MediaSoftware: REF_Memory:
[Media", Player"]
[Memory", Main Memory", System Memory"] [Slots", Memory Slots", Main memory slots", Acce-
REF_MemorySlots:
REF_Memory_Installed:
APPENDIX B.
LAPTOP LEXICON
133
REF_Memory_MaxInstallable:
able to", Maximum", Max",
mum Supported", Max Memory", Maximum Memory", Max Capacity", Maximum Capacity"]
REF_Memory_Technology: REF_MinHeight:
thin as"]
REF_MinSeekTime: REF_Model:
[Model", Model Number"] [Modem", Fax/Modem", Fax Modem"] [Modem Certication", Telephone Certi-
REF_Modem:
REF_Modem_Certication:
cation"]
REF_Modem_Jack:
[Modem Port", Modem Jack", Telephone Plug"] [Version", Speed", Compliant"] [Bandwidth", Transfer Rate", Module Name",
REF_Modem_Version:
REF_ModuleTransferRate:
REF_MonthlyPrice:
REF_MotionSensor:
tion", Protection"]
REF_NamedResolution: REF_NetPrice:
[Resolution"]
[Net Price", Your Price", After Rebate", From"] [Network Adapter", Communications", Network
REF_NetworkAdapter:
Connection"]
[no payments"]
[key", keys"] [port", jack", ports", jacks"] [OnSite", On Site", On-Site"]
[OnLine", Online", On line", On-line", Chat"] [Operating System", Operative System", SO",
REF_OperatingSystem:
S.O."]
REF_OpticalDrive:
APPENDIX B.
LAPTOP LEXICON
134
REF_OpticalFormat: REF_PCCardSlots:
[Format", Formats"]
REF_PCCardSlots_Type: REF_PCExpressSlots:
pressCard"]
[PCExpress",
REF_PCExpressSlots_Type: REF_PCISlots:
pansion"]
[Type", FormFactor"]
REF_PartNumber:
[Part No.", Part", Part Number", Product Number"] [Parts", Hardware", Hardware Parts"] [Photo", Photograph", Image", Picture"] [Physical Description", Physical Specications"]
REF_PartsWarranty: REF_PhotoSoftware:
[Physical Security"]
[Pixel Pitch", Pitch"] [Resolution", Native resolution"] [Pointing Device", Mouse", Mouse Device"] [Power Requirements", Power Consumption"]
REF_PixelResolution: REF_PointingDevice:
REF_PowerRequirements: REF_Processor:
[Rebate",
REF_RechargeTime:
REF_Resolution: REF_Sampling:
APPENDIX B.
LAPTOP LEXICON
135
REF_SecuritySoftware:
REF_SecuritySpecs: REF_Shipping:
[Security"]
REF_Shipping_Money:
REF_Shipping_TimeMeasurement:
ally ships in"]
REF_SlotsNumber: REF_Software:
party software"]
[Speakers"]
[Speed", Processor Speed", Microprocessor Speed", Clock"] [Support", Customer Care", Assistance", available"]
REF_Support: REF_TPM:
REF_TVTuner:
[TV Tuner", Television", TV", TV & Entretainment"] [Technology", Type"] [Telephone Support", Telephone Assistance", toll-
REF_Technology:
REF_TelephoneSupport:
free", Call"]
[+", +-", +/-"] [total", total memory slots"] [UPC", UPC Code", Universal Product Code"]
[USB", Universal Serial Bus", USB Ports", USB Slots"] [compliant", version", ver."] [Utility", Utilities", Conguration", Cong", Setup",
REF_USB_Version:
REF_UtilitySoftware:
REF_Version:
APPENDIX B.
LAPTOP LEXICON
136
REF_VideoAdapter:
ics Engine", Graphics Subsystem", Graphics Adapter", Video Graphics", Graphics Processing Unit", Video"]
REF_VideoInterfaces:
REF_VideoMemory:
REF_VideoMemory_Installed:
REF_VideoMemory_MaxInstallable:
Maximum", Max", Supported",
Max Supported",
REF_VideoSoftware:
REF_WarrantyServices:
REF_Webcam:
REF_WeeklySchedule: REF_Weight:
REF_Width:
REF_WirelessAdapter:
Adapter", Wireless LAN", Wireless LAN Adapter", Wireless Network Connection", Wireless Option", Wireless Support", Wi-Fi munications", cable free"] ", Com-
REF_WriteSpeed:
REF_hasAmbientLightSensor: REF_hasAntenna:
[Antenna"]
REF_hasBlueToothAntenna: REF_hasCableLockSlot:
[Lock Slot",
REF_hasCenterButton:
[mouse center", left center", center"] [Damage Coverage", Accident Coverage", Ac-
REF_hasDamageCoverage:
APPENDIX B.
LAPTOP LEXICON
137
REF_hasECC:
REF_hasFingerprintSensor:
Print", Reader"]
REF_hasHeadphoneJack:
REF_hasInfraRed:
Remote Receiver"]
REF_hasLeftButton:
[mouse left", left button", left"] [line out", line-out", line out jack", line out port"] [microphone", microphone jack", microphone
REF_hasLineOutJack:
REF_hasMicrophoneJack:
REF_hasNumericKeyPad: REF_hasPS/2:
[PS/2", mouse port", keyboard port"] [DB-25", Parallel", Parallel Port", IEEE1284", IEEE-
REF_hasParallel:
REF_hasPasswordSecurity:
REF_hasPortReplicatorConnector:
cator", Replicator", Port Replicator Connector", Port Replicator Slot", Docking Station Port", Docking Station Connector", Docking Station Slot"]
REF_hasRemoteControl:
trol", IR Control"]
REF_hasRightButton: REF_hasS/PDIF:
REF_hasScrollBar:
REF_hasScrollZone:
Up/Down pad"]
REF_hasSerial:
REF_hasSpeaker:
REF_hasSupervisorPassword:
APPENDIX B.
LAPTOP LEXICON
138
REF_hasUserPassword:
REF_hasVoltageAutoSensing:
[Rechargeable"]
REF_isSoftwareModem:
modem", soft"]
REF_isSpillResistant: REF_isWidescreen:
REF_numberOfFormats: REF_supportsBluRay:
REF_supportsDoubleLayer: Region:
[US", USA", Canada", WorldWide", Word Wide", Word-Wide"] [rpm", RPM", r/min"]
RevolutionsPerMinute: Second:
SecuritySoftware_Brand:
Micro"]
SecuritySoftware_Family: Slash:
[/"]
SoftwareVersion:
2001", 2002", 2003", 2004", 2005", 2006", 2007", 2008", 2009", 2010", I", II", III", IV", V", VI", VII", VIII", IX", X", XI", XII", XIII", XIV", XV"]
Essentials", full", Home", Home", Student", Teacher", Home Basic", Premier", Premium", Pro", Professional", Runtime Enviroment", Server", Service Pack 1", Service Pack 2", SP1", SP2", Standard", Std",
TPM_Brand:
bond"]
APPENDIX B.
LAPTOP LEXICON
139
TVFormat: TeraByte:
[TB", TByte", TeraByte", Tera Byte", T.B."] [Terabit per second", Tbit/s", Tbps"] [Terabyte per second", TB/s", TBps"]
TerabitPerSecond:
TerabytePerSecond: TwoPower:
[1", 2", 4", 8", 16", 32", 64", 128", 256", 512", 1024",
USB_Version:
[2.0", 1.1", v2.0", v1.1"] [Roxio", Toshiba", HP", Softthinks"] [Backup MyPC", CongFree", Game Console",
UtilitySoftware_Brand: UtilitySoftware_Family:
PC Recovery"]
VideoAdapterID_Brand: VideoAdapterID_Family:
Fire GL", Fire MV", GeForce", GeForce M", Quadro", Quadro NVS", Quadro FX", GMA", Graphics Media Accelerator"]
VideoAdapterID_Model:
Go 7200", Go 7300", Go 7400", Go 7600", Go 7600", 950"]
VideoAdapter_Slot: VideoInterfaces:
[TurboCache [dedicated"]
"]
[discrete"]
[Ulead", Adobe", Microsoft", Muvee"] [DVD MovieFactory", DVD Movie Factory", Pre-
Volt: Watt:
[V", volt", volts"] [W", watt", watts"] [watthour", watts hour", Wh", WHr"] [VGA", 640x480", 800x600"]
WattHour:
Webcam_Resolution: Week:
[week", wk"]
APPENDIX B.
LAPTOP LEXICON
140
WeeklySchedule:
[7x24", 24x7"] [Atheros", Broadcom", Intel"] [PRO Wireless", PRO/Wireless LAN", 3945BG",
WirelessAdapterID_Brand: WirelessAdapterID_Model:
WirelessAdapter_Autentication:
Access", WPA", WPA2"]
WirelessAdapter_Version:
X:
[Yes", Included", Standard", Capable", Ready", Installed", PreInstalled", Compatible", Bundled", BuiltIn", Embedded", Integrated", Supported", Support"]
daysWeek:
[workdays", working days", weekends", business days"] [mic", microphone", built-in microphone"] [on/o"]
hasMicrophone: hasOnOSwitch:
Appendix C
A Laptop Domain Labeled Data Dheet Example
This manually labeled sequence example correspond to the rst reported datasheet in table 2.2.
['HP']:
Laptop
LaptopID
['Pavilion']:
Laptop
Laptop
Laptop
Laptop
Brand
LaptopID
LaptopID
LaptopID
LaptopID
Family Model
Family
REF_Family
Brand
Laptop
Software
OperatingSystem
Software
OperatingSystem
SubVersion
Laptop
Software
OperatingSystem Model
SubVersion
Laptop
LaptopID
Laptop
LaptopID
Family
REF_Family
Laptop
CommercialOer NetPrice
NetPrice
REF_NetPrice
Laptop
CommercialOer Laptop
Money
CommercialOer
NetPrice
Laptop
CommercialOer
MonthlyPrice Money
Laptop
CommercialOer
MonthlyPrice
Currency
141
['35']:
Laptop
Integer
CommercialOer
MonthlyPrice
Money
MoneyAmount
[]:
Laptop
CommercialOer
['90']:
Laptop
Magnitude
['days']:
Laptop
TimeUnits
MonthlyPrice
REF_MonthlyPrice
CommercialOer Integer
NoPaymentsTime
TimeMeasurement
CommercialOer Day
NoPaymentsTime
TimeMeasurement
['no','payments']: ['Free']:
Laptop
Laptop
CommercialOer
CommercialOer
Laptop
teger
Laptop
Shipping
CommercialOer
CommercialOer
CommercialOer
REF_NoPaymentsTime Money
Shipping
Shipping
Shipping
Money Money
In-
['$',]:
Laptop
['150']:
Laptop
teger
CommercialOer
CommercialOer
Rebate
Rebate
Money
Currency MoneyAmount
Money
In-
Laptop
Laptop
Laptop
teger
CommercialOer
CommercialOer
CommercialOer
Rebate
Rebate
Rebate
REF_Rebate
Money
Money
Currency MoneyAmount
In-
['o ']:
Laptop
CommercialOer Laptop
['upgrade','to']: ['2']:
Laptop Integer
Memory
Memory
Rebate
REF_Rebate
MaxInstallable
MaxInstallable
MemorySize
REF_MaxInstallable MemoryMagnitude
['GB']:
Laptop
GigaByte
Memory
MaxInstallable
MemorySize
MemoryUnits
['memory']: ['on']:
Laptop
Laptop
Memory
Memory
['models']: ['Free']:
Laptop
Laptop
REF_Memory
MemorySlots Model
LaptopID
Memory
['upgrade','to']:
Laptop
MemorySlots
Memory
OccupiedSlots
Digit
REF_Model AvailableSlots
MaxInstallable
REF_MaxInstallable
['LightScribe']: ['HP']:
Laptop
Laptop
OpticalDrive
LaptopID
Laptop
Laptop
Laptop ber
Integer
Labeling
Brand
WarrantyServices
WarrantyServices
WarrantyServices
Support
Support
REF_Support
TelephoneSupport
Support
TelephoneSupport
REF_TelephoneSupport
ToolFreeNum-
['888']:
Laptop
ber
Integer
WarrantyServices
Support
TelephoneSupport
ToolFreeNum-
['999']:
Laptop
WarrantyServices
Support
TelephoneSupport
ToolFreeNum-
ber
Integer
['4747']:
Laptop
FreeNumber
['Chat']:
Laptop
Support
TelephoneSupport
Tool-
['Warranty']: ['Support']:
Laptop
Laptop
Support
WarrantyServices
WarrantyServices Laptop
Software
Software
Laptop
Software
['Windows','Vista']: ['Home']:
Laptop
Laptop
Software
['Premium']: ['32']:
Laptop
TwoPower
Laptop
OnlineSupport
REF_OnlineSupport
REF_WarrantyServices
REF_WarrantyServices OperatingSystem
ProductivitySoftware
REF_OperatingSystem
REF_ProductivitySoftware
REF_Software OperatingSystem
Software
OperatingSystem
Software
Software
SubVersion
OperatingSystem
OperatingSystem
SubVersion
ProcessorArchitecture
BusWidth
['bit']:
Laptop
Bit
Software
OperatingSystem
ProcessorArchitecture
BusWidth
TwoPower
Laptop
Laptop
Software
Software
Software
OperatingSystem
OperatingSystem
OperatingSystem
SubVersion
ProcessorArchitecture
BusWidth
['bit']:
Laptop
Bit
Software
OperatingSystem
ProcessorArchitecture
BusWidth
Laptop
TwoPower
Laptop
Laptop
Software
Software
Software
OperatingSystem
OperatingSystem
OperatingSystem
SubVersion
ProcessorArchitecture
BusWidth
['bit']:
Laptop
Bit
Software
OperatingSystem
ProcessorArchitecture
BusWidth
Laptop Laptop
Laptop
Processor Processor
Processor
REF_Processor REF_Processor
ProcessorID Laptop
Laptop
Decimal
Processor
Processor
Brand
Processor
ProcessorID
Speed
ProcessorID
Family
Model
FrequencyMeasurement
Magnitude
['GHz']:
Laptop
cyUnits
['2']:
Laptop nitude
GigaHertz Processor
Processor
Speed
FrequencyMeasurement
Frequen-
TwoPower
Cache
CacheSize
MemorySize
MemoryMag-
['MB']:
Laptop
nits
MegaByte
Processor
Cache
CacheSize
MemorySize
MemoryU-
['L2',]:
Laptop
['Cache']: ['Intel']:
Laptop
Laptop
Processor
Processor
Processor
Cache
CacheLevel
Cache
REF_Cache
ProcessorID Laptop
Laptop
Decimal
Processor
Processor
Brand
Processor
ProcessorID
Speed
ProcessorID
Family
Model
FrequencyMeasurement
Magnitude
['GHz']:
Laptop
cyUnits
['4']:
Laptop nitude
GigaHertz Processor
Processor
Speed
FrequencyMeasurement
Frequen-
TwoPower
Cache
CacheSize
MemorySize
MemoryMag-
['MB']:
Laptop
nits
MegaByte
Processor
Cache
CacheSize
MemorySize
MemoryU-
['L2',]:
Laptop
Laptop
Laptop
Laptop
REF_MaxHeight
Processor
Processor
Processor
Cache
CacheLevel
Cache
REF_Cache
ProcessorID
PhysicalDescription
Brand
Dimensions
Height
MaxHeight
Laptop
Decimal
Processor
Processor
Laptop
Processor
ProcessorID
Speed
ProcessorID
Family
Model
FrequencyMeasurement
Magnitude
['GHz']:
Laptop
cyUnits
['MB']:
Laptop
nits
MegaByte
GigaHertz
Processor
Speed
FrequencyMeasurement
Frequen-
Processor
Cache
CacheSize
MemorySize
MemoryU-
['Display']: ['17.0']: ['"']: ['WXGA','+',]: ['BrightView']: ['Widescreen']: ['1440',]: ['x',]: ['900']: ['17.0']: ['"']: ['WSXGA','+',]: ['BrightView']: ['Widescreen']: ['1680',]:
['Cache']:
Laptop Processor Display Laptop Laptop Display Laptop Display Laptop Laptop Laptop Laptop Display Laptop Display Laptop Display Laptop Display Laptop Display Laptop Laptop Laptop Laptop Display
['L2',]:
Laptop
Processor
Cache
CacheLevel
Cache
REF_Cache
REF_Display
Diagonal
Diagonal
Display
Display
Display
DistanceMagnitude
isWidescreen
PixelResolution
PixelResolution
PixelResolution Diagonal
Diagonal
Display
Display
Display
REF_isWidescreen
PixelResolutionHorizontal
X PixelResolutionVertical
Magnitude
Decimal
Inch NamedResolution
Technology
isWidescreen
PixelResolution
REF_isWidescreen
PixelResolutionHorizontal
['x',]:
Laptop
Display
['1050']:
Laptop
PixelResolution
Display
Laptop Laptop
PixelResolution
Memory Memory
Memory
PixelResolutionVertical
REF_Memory REF_Memory
Installed
MemorySize
MemoryMagnitude
In-
['GB']:
Laptop
Byte
Memory
Installed
MemorySize
MemoryUnits
Giga-
['DDR2']:
Laptop
Memory
['System','Memory']: ['2']:
Laptop
Memory
Laptop
Laptop
Laptop
teger
Laptop
Technology Memory
MemorySlots
Memory
REF_Memory
OccupiedSlots
Digit
ModuleType
CommercialOer
CommercialOer
Rebate
Rebate
Money
Money
Currency MoneyAmount
In-
['o ']:
Laptop
['from']: ['1']:
Laptop
Laptop teger
CommercialOer Memory
Memory
Rebate
Installed
Installed
REF_Rebate
REF_Installed
MemorySize
MemoryMagnitude
In-
['GB']:
Laptop
Byte
Memory
Installed
MemorySize
MemoryUnits
Giga-
['2']:
Laptop
Memory
Laptop
Laptop Laptop
Integer
MemorySlots
Memory
Memory
Memory
OccupiedSlots
Digit
ModuleType
MaxInstallable
MaxInstallable
REF_MaxInstallable
MemorySize
MemoryMagnitude
['GB']:
Laptop
GigaByte
Memory
MaxInstallable
MemorySize
MemoryUnits
['2']:
Laptop
Memory
['Dimm']:
Laptop
MemorySlots
Memory Laptop
['Graphics','Card']:
OccupiedSlots
Digit
ModuleType VideoAdapter
REF_VideoAdapter
['Intel']:
Laptop
VideoAdapter
VideoAdapterID Laptop
['Graphics','Media','Accelerator']:
Family
Brand
VideoAdapter
VideoAdapterID
['X3100']:
Laptop
VideoAdapter
['notebook']: ['HP']:
Laptop
Laptop
VideoAdapterID
Model
REF_Laptop
LaptopID Laptop
Laptop
Brand
AudioAdapter Brand
hasMicrophone
LaptopID
AudioAdapter
['Fingerprint','Reader']:
FingerprintSensor
['HP']:
Laptop
Laptop
LaptopID Laptop
Laptop
Laptop
REF_hasFingerprintSensor Brand
hasMicrophone
SecuritySpecs
PhysicalSecurity
has-
AudioAdapter
Webcam
hasMicrophone
REF_Webcam
LaptopID
Brand
['Fingerprint','Reader',]:
FingerprintSensor
['Webcam']:
Laptop
['Networking']:
Laptop
Laptop
REF_hasFingerprintSensor
SecuritySpecs
PhysicalSecurity
has-
Webcam
REF_Webcam
NetworkAdapter
['network','port']: ['Intel']:
Laptop
Laptop
LANAdapter
NetworkAdapter
WirelessAdapter
['PRO','/','Wireless']:
lessAdapterID
Laptop
Model
REF_LANAdapter
LANAdapter
WirelessAdapterID
NetworkAdapter
Jack
REF_Jack
Brand
WirelessAdapter
Wire-
['4965','AGN']:
Model
Laptop
NetworkAdapter
WirelessAdapter
WirelessAdapterID
['Network','Connection']: ['Intel']:
Laptop
Brand
Laptop
NetworkAdapter
NetworkAdapter WirelessAdapter
REF_NetworkAdapter WirelessAdapterID
['PRO','/','Wireless']:
lessAdapterID
Laptop
Model
NetworkAdapter
WirelessAdapter
Wire-
['4965','AGN']:
Model
Laptop
NetworkAdapter
WirelessAdapter
WirelessAdapterID
['Network']:
Laptop
['Bluetooth']:
Laptop
NetworkAdapter
NetworkAdapter Laptop
['Broadband','Wireless']:
REF_BroadBandWirelessAdapter
LANAdapter
REF_LANAdapter REF_BlueToothAdapter
BlueToothAdapter
NetworkAdapter
BroadBandWirelessAdapter
['Verizon']:
Laptop
BandWirelessAdapterID
NetworkAdapter Brand
['Wireless']:
Laptop
BroadBandWirelessAdapter
Broad-
Laptop
Laptop Laptop
BandWirelessAdapterID
NetworkAdapter
NetworkAdapter
['Wireless']: ['V','740']:
Laptop
Laptop
BandWirelessAdapterID
WirelessAdapter
REF_WirelessAdapter
BroadBandWirelessAdapter
LANAdapter
REF_BroadBandWirelessAdapte
REF_LANAdapter
BroadBandWirelessAdapter
Broad-
NetworkAdapter
NetworkAdapter
['ExpressCard']: ['Hard','Drive']:
Laptop Laptop
Model
WirelessAdapter
REF_WirelessAdapter
BroadBandWirelessAdapter
Broad-
ExpansionSlots HardDisk
['hard','disk','drive']: ['160']:
Laptop
Integer
Laptop
HardDisk
PCExpressSlots
REF_PCExpressSlots
REF_HardDisk
HardDisk
Capacity
REF_HardDisk
MemorySize
MemoryMagnitude
['GB']:
Laptop
gaByte
HardDisk
Capacity
MemorySize
MemoryUnits
Gi-
['5400',]:
Laptop
nitude
HardDisk
HDRotationSpeed
HDRotationSpeedMag-
['RPM']: ['SATA']:
Laptop
HardDisk
Laptop
HardDisk
['Hard','Drive']: ['240']:
Laptop
Integer
Laptop
HardDisk
HDRotationSpeed Controller
RevolutionsPerMinute
HardDisk
Capacity
REF_HardDisk MemorySize
MemoryMagnitude
['GB']:
Laptop
gaByte
HardDisk
Capacity
MemorySize
MemoryUnits
Gi-
['5400',]:
Laptop
nitude
HardDisk
HDRotationSpeed
HDRotationSpeedMag-
['RPM']: ['SATA']:
Laptop
HardDisk
Laptop
HardDisk
['Hard','Drive']: ['120']:
Laptop
Integer
Laptop
HardDisk
HDRotationSpeed Controller
RevolutionsPerMinute
HardDisk
Capacity
REF_HardDisk MemorySize
MemoryMagnitude
['GB']:
Laptop
gaByte
HardDisk
Capacity
MemorySize
MemoryUnits
Gi-
['320']:
Laptop
HardDisk
Capacity
MemorySize
Integer
MemoryMagnitude
['GB']:
Laptop
HardDisk
Capacity
MemorySize
gaByte
MemoryUnits
Gi-
['5400',]:
Laptop
nitude
HardDisk
HDRotationSpeed
HDRotationSpeedMag-
['RPM']: ['SATA']:
Laptop
HardDisk
Laptop
HardDisk
['Hard','Drive']: ['160']:
Laptop
Integer
Laptop
HardDisk
HDRotationSpeed Controller
RevolutionsPerMinute
HardDisk
Capacity
REF_HardDisk MemorySize
MemoryMagnitude
['GB']:
Laptop
gaByte
HardDisk
Capacity
MemorySize
MemoryUnits
Gi-
['400']:
Laptop
HardDisk
Capacity
MemorySize
Integer
MemoryMagnitude
['GB']:
Laptop
HardDisk
Capacity
MemorySize
gaByte
MemoryUnits
Gi-
['4200',]:
Laptop
nitude
HardDisk
HDRotationSpeed
HDRotationSpeedMag-
['RPM']: ['SATA']:
Laptop
HardDisk
Laptop
HardDisk
['Hard','Drive']: ['200']:
Laptop
Integer
Laptop
HardDisk
HDRotationSpeed Controller
RevolutionsPerMinute
HardDisk
Capacity
REF_HardDisk MemorySize
MemoryMagnitude
['GB']:
Laptop
gaByte
HardDisk
Capacity
MemorySize
MemoryUnits
Gi-
['CD']:
Laptop
['DVD']: ['Drive']:
Laptop
Laptop
OpticalDrive
OpticalDrive
OpticalDrive Laptop
['Optical','drives']: ['CDs']:
Laptop
['DVDs']: ['8',]:
Laptop
Laptop
OpticalDriveSpeedMagnitude
OpticalDrive
OpticalDrive
OpticalDrive
OpticalFormat
OpticalFormat
CDFormat
REF_CDFormat
DVDFormat
REF_DVDFormat
REF_OpticalDrive
OpticalDrive
REF_OpticalDrive
OpticalFormat
OpticalFormat
OpticalFormat
CDFormat
REF_CDFormat
DVDFormat
DVDFormat
REF_DVDFormat
ReadSpeed
['X']:
Laptop
REF_ReadSpeed
OpticalDrive
OpticalFormat
DVDFormat
ReadSpeed
['DVD','+','/','-','R']:
mat
Laptop
OpticalDrive
Media
OpticalFormat
DVDFor-
['Double','Layer']: ['LightScribe']:
Laptop
Laptop
OpticalDrive
OpticalDrive Laptop
['DVD','+','/','-','RW']:
Format
Media
DVDLayer
Labeling
OpticalDrive
OpticalFormat
DVD-
Laptop
Laptop
VideoAdapter Laptop
['remote','control']: ['HP']:
Laptop
LaptopID Laptop
['Expresscard']: ['TV','Tuner']:
Laptop
OpticalDrive
TVTuner
InputDevice
hasTVTunner
Boolean
Not
REF_TVTuner hasRemoteControl
REF_hasRemoteControl
Brand
ExpansionSlots
VideoAdapter
['Windows','Vista']: ['Notebook']:
Laptop
Laptop
Software
PCExpressSlots
TVTuner
REF_PCExpressSlots
REF_TVTuner
OperatingSystem
REF_Laptop
['Primary','Battery']:
Laptop
Battery
REF_Battery
['battery']:
Laptop
['Notebook']: ['8']:
Laptop
Laptop
Battery
['Cell']:
Laptop
Battery
REF_Battery
Battery Laptop
Laptop
Cells
Battery
Battery
Battery
['Cell']:
Laptop
Cells
Battery Laptop
Laptop
REF_Cells Technology
REF_Battery Digit
Cells
Battery
Battery
Battery
['Cell']:
Laptop
Cells
Battery Laptop
Laptop
Laptop
REF_Cells Technology
REF_Battery Digit
Cells
Battery
Battery
Battery
['Capacity']: ['8']:
Laptop
Laptop
Battery
['Cell']:
Laptop
REF_Cells Technology
REF_Battery
Cells
Battery Cells
Battery Laptop
['Lithium','Ion']: ['Batteries']:
Digit
ElectricalCurrentCapacity Digit
REF_ElectricalCurrentCapacity
Cells
Battery
Laptop
Battery
['Security','Software']: ['Norton']:
Laptop
Laptop
Software
Laptop
surement
Laptop
Software
Software
Magnitude
REF_Cells Technology
REF_Battery Software
REF_Software
SecuritySoftware Software
Brand
SecuritySoftware
SecuritySoftware
['Months']:
Laptop
TimeMeasurement
SecuritySoftware Integer
Family
SoftwareVersion
Licence
TrialPeriod
TimeMea-
Software
['Norton']:
Laptop
SecuritySoftware Month
TimeUnits
Software
Licence
TrialPeriod
SecuritySoftware
Brand
Laptop
surement
Laptop
Software
Software
Magnitude
Software
SecuritySoftware
SecuritySoftware
['Months']:
Laptop
TimeMeasurement
SecuritySoftware Integer
Family
SoftwareVersion
Licence
TrialPeriod
TimeMea-
Software
['Norton']:
Laptop
TimeUnits
SecuritySoftware Month
Software
Laptop
surement
Laptop
Software
Software
Magnitude
Licence
TrialPeriod
SecuritySoftware Software
Brand
SecuritySoftware
SecuritySoftware
['Months']:
Laptop
TimeMeasurement
SecuritySoftware Integer
Family
SoftwareVersion
Licence
TrialPeriod
TimeMea-
Software
['HP']:
Laptop
SecuritySoftware Month
TimeUnits
LaptopID
['Featured']: ['Software']:
Laptop Laptop
Licence
TrialPeriod
Brand
Software
Software
['Computrace']: ['LoJack']:
Laptop
Laptop
['Software']:
Laptop
SecuritySoftware REF_Software
REF_SecuritySoftware
Software
Software
Software
Laptop
Laptop
Laptop
Laptop
surement
['Year']:
Laptop
surement
SecuritySoftware
SecuritySoftware
Family
Brand
REF_Software
Software
Software
SecuritySoftware
SecuritySoftware
Family
Brand
REF_Laptop
Software
Magnitude
Software
TimeUnits Laptop
Laptop
Laptop
Laptop
TimeMeasurement
SecuritySoftware Digit
Licence
TrialPeriod
TimeMea-
SecuritySoftware Year
Licence
TrialPeriod
TimeMea-
Software
Software
SecuritySoftware
SecuritySoftware
Family
Brand
REF_Laptop
Software
Magnitude
SecuritySoftware Digit
Licence
TrialPeriod
['Years']:
Laptop
surement
TimeUnits Laptop
Software
Laptop
Laptop
Laptop
surement
SecuritySoftware
Year
Licence
TrialPeriod
TimeMea-
Software
Software
SecuritySoftware
SecuritySoftware
Family
Brand
REF_Laptop
Software
Magnitude
['Years']:
Laptop
surement
TimeUnits Laptop
Software
Laptop
Laptop
Laptop
SecuritySoftware Digit
Licence
TrialPeriod
TimeMea-
SecuritySoftware
Year
Licence
TrialPeriod
TimeMea-
Software
Software
Software
Software
Laptop
Laptop
UtilitySoftware
UtilitySoftware
REF_UtilitySoftware
REF_UtilitySoftware
REF_Software
UtilitySoftware Software
Software
Software
Brand
UtilitySoftware
UtilitySoftware
['Recovery']: ['DVD']:
Laptop
Laptop
Software
OpticalDrive Laptop
['Windows','Vista']: ['Home']:
Laptop
Software
Laptop Laptop
Laptop
MediaSoftware
Family
REF_UtilitySoftware
UtilitySoftware
OpticalFormat
Software
REF_MediaSoftware REF_UtilitySoftware
DVDFormat
REF_DVDFormat
OperatingSystem
OperatingSystem
Software Software
OpticalDrive Laptop
Laptop
Laptop
OperatingSystem UtilitySoftware
OpticalFormat
Software
Software
OpticalDrive Laptop
['Windows','Vista']: ['Business']:
Laptop
Software
SubVersion SubVersion
REF_UtilitySoftware
DVDFormat
REF_DVDFormat
OperatingSystem
OperatingSystem UtilitySoftware
OpticalFormat
Software
Software
SubVersion
REF_UtilitySoftware
DVDFormat
REF_DVDFormat
OperatingSystem SubVersion
OperatingSystem
Laptop
Laptop
Laptop
Laptop
Laptop
mal
Software
Software
Software
Software
Software
ProductivitySoftware
REF_ProductivitySoftware
REF_Software
ProductivitySoftware
ProductivitySoftware
ProductivitySoftware
Brand
Family
SoftwareVersion
Deci-
['Corel']:
Laptop
Software
Laptop
Laptop Laptop
Laptop
Laptop Laptop
Laptop
ProductivitySoftware
Software
Software
Software
Software
Software
Software
Software
Laptop
Laptop Laptop
Laptop
Brand
ProductivitySoftware
ProductivitySoftware
ProductivitySoftware
ProductivitySoftware
ProductivitySoftware
ProductivitySoftware
ProductivitySoftware
Software
Software Software
Software
['Microsoft']:
Laptop
Family
Family
Brand
Family
SubVersion
SoftwareVersion Family
ProductivitySoftware
ProductivitySoftware ProductivitySoftware
ProductivitySoftware
Software
Brand
Family SubVersion
SoftwareVersion
ProductivitySoftware
ProductivitySoftware
ProductivitySoftware
ProductivitySoftware
ProductivitySoftware
ProductivitySoftware
ProductivitySoftware
Software
Brand
SoftwareVersion Brand
Family
ProductivitySoftware
ProductivitySoftware
SubVersion
SoftwareVersion Brand
ProductivitySoftware
['Oce']:
Laptop
Software
['Professional']: ['2007']:
Laptop
Laptop
ProductivitySoftware
Software
Software
['Premium']:
Laptop
Family
ProductivitySoftware
ProductivitySoftware
Software
Laptop
Laptop Laptop
Laptop
SubVersion
SoftwareVersion
ProductivitySoftware
Software
Software Software
Software Laptop
['Photo','Album']: ['6']:
Laptop
Software
['Corel']:
Laptop
PhotoSoftware
SubVersion
REF_PhotoSoftware
REF_Software REF_Software
PhotoSoftware
Software
PhotoSoftware
Brand
PhotoSoftware
Family
SoftwareVersion
Software
['Paint','Shop']: ['Pro']:
Laptop
Laptop
['Photo']: ['Corel']:
Laptop Laptop
Software
ProductivitySoftware
Software
PhotoSoftware
['Photo','Album']: ['6']:
Laptop
Software
Laptop
Software
Software
['Burner']:
Laptop
['Software']: ['Roxio']:
Laptop
Laptop
Laptop Laptop
Laptop
PhotoSoftware
Brand Family
PhotoSoftware PhotoSoftware
Software
PhotoSoftware Software
PhotoSoftware
Family
SoftwareVersion
PhotoSoftware
PhotoSoftware
PhotoSoftware
Software
Software
Software
Software
Software Software
['Roxio']:
Laptop
['Creator']:
Laptop
Family
SubVersion
SoftwareVersion
BurningSoftware
Family
REF_Software
BurningSoftware
BurningSoftware
BurningSoftware BurningSoftware
Software
Software
Brand Family
SoftwareVersion SoftwareVersion
BurningSoftware
BurningSoftware
Integer
Brand Family
Laptop Laptop
Laptop
Software
Software Software
BurningSoftware
Laptop
Display
Display
Laptop Laptop
Laptop
Laptop
Diagonal
Diagonal
Display Display
Display
Display
['WSXGA','+',]: ['BrightView']:
Laptop
Laptop
SubVersion
SoftwareVersion SoftwareVersion
Integer
Display
REF_Display
DistanceMagnitude
isWidescreen
REF_isWidescreen
REF_Display
Display
Display
NamedResolution
Technology Laptop
['Dimensions','<TAB><TAB>']:
sions
REF_Dimensions
PhysicalDescription
Dimen-
['15.16']:
Laptop
surement
['"']:
Laptop ment
Magnitude
PhysicalDescription Decimal
Dimensions
Depth
DistanceMea-
PhysicalDescription Inch
DistanceUnits
Dimensions
Depth
DistanceMeasure-
['L']: ['x']:
Laptop Laptop
PhysicalDescription PhysicalDescription
['11.65']:
Laptop
Measurement
['"']:
Laptop ment
Dimensions Dimensions
PhysicalDescription Magnitude
PhysicalDescription Inch
['W']: ['x']:
Laptop
Laptop
['1.57']:
Laptop
surement
['"']:
Laptop ment
DistanceUnits
Decimal
Depth X
REF_Depth
Dimensions
Width
Distance-
Dimensions
Width
DistanceMeasure-
PhysicalDescription
PhysicalDescription
PhysicalDescription
Magnitude
Decimal
Dimensions
Dimensions
Dimensions
X
Width
REF_Width
Height
DistanceMea-
PhysicalDescription Inch
DistanceUnits
Dimensions
Height
DistanceMeasure-
['Weight','<TAB><TAB>']:
Laptop
PhysicalDescription
Weight
REF_Weight
['7.7']:
Laptop
nitude
Decimal
PhysicalDescription
Weight
WeightMeasurement
Mag-
['lbs']:
Laptop
tUnits
Pound
PhysicalDescription
Weight
WeightMeasurement
Weigh-
['Weight']:
Laptop
PhysicalDescription
['conguration']:
Laptop
Software
Weight
REF_Weight
UtilitySoftware Laptop
REF_UtilitySoftware
['Communications','<TAB><TAB>']: ['Intel']:
Laptop
Brand
NetworkAdapter
WirelessAdapter
NetworkAdapter
REF_NetworkAdapter
WirelessAdapterID
['PRO','/','Wireless']:
lessAdapterID
Laptop
Model
NetworkAdapter
WirelessAdapter
Wire-
['4965','AGN']:
Model
Laptop
NetworkAdapter
WirelessAdapter
WirelessAdapterID
['Network']: ['optional']:
Laptop
Boolean
Laptop Not
NetworkAdapter
NetworkAdapter
LANAdapter
REF_LANAdapter
BlueToothAdapter
hasBlueToothAdapter
['Bluetooth']:
Laptop
NetworkAdapter Laptop
['Broadband','Wireless']:
REF_BroadBandWirelessAdapter
BlueToothAdapter
NetworkAdapter
REF_BlueToothAdapter
BroadBandWirelessAdapter
['Verizon']:
Laptop
BandWirelessAdapterID
NetworkAdapter Brand
['Wireless']: ['V','740']:
Laptop
Laptop
BandWirelessAdapterID
BroadBandWirelessAdapter
Broad-
NetworkAdapter
NetworkAdapter
['ExpressCard']:
Laptop
Model
WirelessAdapter
REF_WirelessAdapter
BroadBandWirelessAdapter
Broad-
ExpansionSlots Laptop
['Graphics','<TAB><TAB>']: ['Intel']:
Laptop
VideoAdapter
PCExpressSlots VideoAdapter
VideoAdapterID Laptop
['Graphics','Media','Accelerator']:
Family
REF_PCExpressSlots
REF_VideoAdapter
Brand
VideoAdapter
VideoAdapterID
['X3100']:
Laptop
VideoAdapter Laptop
['PCI','expansion']: ['Expansion','port']:
Laptop
VideoAdapterID
ExpansionSlots
ExpansionSlots
Model
PCISlots
REF_PCISlots
REF_ExpansionSlots
['3']:
Laptop
ExpansionSlots Laptop
['connector']:
PCISlots
ExpansionSlots
SlotsNumber
PCISlots
Digit
Laptop
Decimal
Memory
Memory
Laptop
Installed
Installed
Memory
SlotsNumber
REF_SlotsNumber
REF_Memory
REF_Installed
MemorySize
MemoryMagnitude
['GB']:
Laptop
Byte
Memory
Installed
MemorySize
MemoryUnits
Giga-
['to']: ['2.0']:
Laptop
Laptop
tude
Decimal
Memory
Memory
MaxInstallable
MaxInstallable
REF_MaxInstallable MemorySize
MemoryMagni-
['GB']:
Laptop
GigaByte
Memory
MaxInstallable
MemorySize
MemoryUnits
['DDR2']:
Laptop
['SDRAM']:
Laptop
Memory
Memory
['Total','memory','slots','<TAB><TAB>']:
orySlots
['2']:
Laptop
TotalSlots Memory
['DIMM']:
Laptop
REF_TotalSlots
Memory
Mem-
MemorySlots
Memory
TotalSlots
Digit
ModuleType
['Maximum','memory']: ['2']:
Laptop Integer
Memory
Laptop
Memory
MaxInstallable
MaxInstallable
MemorySize
REF_MaxInstallable
MemoryMagnitude
['GB']:
Laptop
GigaByte
Memory
MaxInstallable
MemorySize
MemoryUnits
['Hard','disk','drive']: ['s']:
Laptop Second
HardDisk
Laptop
HardDisk
MinSeekTime
REF_HardDisk
TimeMeasurement
TimeUnits
['Up','to']: ['400']:
Laptop
Laptop
Magnitude
['GB']:
Laptop
ryUnits
HardDisk
HardDisk Integer
MaxStorageCapacity
MaxStorageCapacity
REF_MaxStorageCapacity
MemorySize
Memory-
HardDisk
GigaByte
MaxStorageCapacity
MemorySize
Memo-
['4200']:
Laptop
tude
HardDisk
HDRotationSpeed
HDRotationSpeedMagni-
['rpm']:
Laptop
HardDisk
['Serial','ATA']: ['hard','drive']:
Laptop Laptop
HDRotationSpeed
HardDisk
HardDisk
RevolutionsPerMinute
Controller
REF_HardDisk Laptop
['Primary','battery','<TAB><TAB>']: ['8']:
Laptop
['cell']:
Laptop
Battery
Battery
Laptop
Laptop
Laptop
Cells
Battery
REF_Battery
Cells
Digit REF_Cells
Battery
Battery
Battery
Technology
Laptop
Laptop
['MHz']:
Laptop
Laptop
Processor
Processor
Processor
Processor
FrontSideBus
REF_FrontSideBus
REF_Processor
FrontSideBus
FrontSideBus
FrontSideBus
['AC','adapter','<TAB><TAB>']: ['65']:
Laptop
surement
['W']:
Laptop
surement
ACAdapter Magnitude
ACInput
Integer
Laptop
REF_FrontSideBus
PowerRequirements
REF_ACAdapter PowerMea-
ACAdapter PowerUnits
ACInput Watt
PowerRequirements
PowerMea-
['Expansion','slots','<TAB><TAB>']: ['1']:
Laptop
ExpansionSlots
Laptop
PCExpressSlots
['ExpressCard','/','54']: ['Slot']:
Laptop
Laptop
ExpansionSlots
['ExpressCard','/','34']:
Laptop
ExpansionSlots
ExpansionSlots
SlotsNumber
REF_ExpansionSlots
Digit Type
PCExpressSlots
REF_ExpansionSlots ExpansionSlots
['ports','<TAB><TAB>']: ['3']:
Laptop
Interfaces
Laptop
USB
Interfaces
PCExpressSlots
Type
REF_Interfaces
NumberOfPorts
['Universal','Serial','Bus']:
Laptop
Interfaces
USB
REF_USB
['2.0']: ['IEEE','1394']: ['Firewire']: ['expansion','port']: ['3']: ['TV','out']: ['S','-','video']: ['Integrated']: ['IR']: ['remote','control','receiver']: ['5']: ['in','-','1']: ['media','card','reader']: ['microphone']: ['RJ','-','11']: ['modem']: ['RJ','-','45']: ['LAN']: ['VGA']: ['Speakers','<TAB><TAB>']: ['Integrated']:
['USB']:
Laptop Interfaces USB REF_USB Laptop Interfaces Laptop USB Version Interfaces IEEE1394 REF_IEEE1394 Laptop Interfaces IEEE1394 REF_IEEE1394 Laptop ExpansionSlots PCISlots REF_ExpansionSlots Digit Laptop ExpansionSlots SlotsNumber Laptop VideoAdapter VideoInterfaces Laptop VideoAdapter VideoInterfaces Laptop Interfaces hasInfraRed Boolean Yes Laptop Interfaces hasInfraRed Laptop REF_hasInfraRed Interfaces hasInfraRed Integer Laptop MediaAdapter NumberOfMediaFormats Laptop MediaAdapter Laptop NumberOfMediaFormats MediaAdapter REF_MediaAdapter Laptop AudioAdapter Jack hasMicrophone Laptop Modem Laptop Modem REF_Modem Laptop NetworkAdapter LANAdapter Jack Laptop NetworkAdapter VideoAdapter LANAdapter REF_LANAdapter Laptop VideoInterfaces Laptop AudioAdapter Speaker Laptop AudioAdapter Speaker hasSpeaker Boolean Yes
REF_hasInfraRed
REF_Speaker
['Altec','Lansing']: ['stereo']:
Laptop
Laptop
AudioAdapter
Laptop
Laptop
Laptop
AudioAdapter
AudioAdapter Software
Software
Speaker
Speaker Type
SpeakerID
Brand
Speaker
REF_Speaker
REF_Software
PhotoSoftware
Brand
['PhotoSmart']: ['Essentials']:
Laptop
Laptop
Software
Software
['RealRhapsody']: ['Muvee']:
Laptop
Laptop
PhotoSoftware
PhotoSoftware
Software
['AutoProducer']: ['Basic']:
Laptop
Laptop
['Edition']: ['5']:
Laptop
Laptop
['20']:
Laptop
surement
['day']:
Laptop
surement
['trial']: ['full']:
Laptop Laptop
Software
Software
Software
Software
Magnitude Software
Software
Family
SubVersion
MediaSoftware
MediaSoftware
Software
Family
Brand
MediaSoftware
MediaSoftware
MediaSoftware
MediaSoftware
TimeUnits Software
Software
['version']: ['Adobe']:
Laptop Laptop
MediaSoftware Integer
Family
SubVersion
SoftwareVersion
SoftwareVersion Licence
REF_SoftwareVersion
Integer
TrialPeriod
TimeMea-
MediaSoftware Day
Licence
TrialPeriod
TimeMea-
MediaSoftware
MediaSoftware
Software Software
Laptop
Laptop
mal
Laptop
Licence
Licence
MediaSoftware
LicenceType
LicenceType
SoftwareVersion
ProductivitySoftware Software
Software
Software
Software
REF_SoftwareVersion
Brand
ProductivitySoftware
ProductivitySoftware
ProductivitySoftware
ProductivitySoftware
Family
Brand
Family
SoftwareVersion
Deci-
['Microsoft']:
Laptop
Software
MediaSoftware
['Windows','Media','Player']:
ily
Laptop
Software
Brand MediaSoftware
Fam-
['11']:
Laptop
['HP']:
Laptop
Software
['Roxio']:
Laptop
['Creator']: ['9']:
Laptop
Laptop
Software
MediaSoftware
Software
Software
Software
['Basic']:
Laptop
PhotoSoftware
SoftwareVersion Brand
Integer
BurningSoftware
BurningSoftware
Brand Family
BurningSoftware
Software
SoftwareVersion
BurningSoftware
Integer
SubVersion
['HP']:
Laptop
Software
['QuickPlay']: ['Software']:
Laptop
Laptop
MediaSoftware
Software
Software
Brand
MediaSoftware
Family
REF_Software Laptop
['introductory','versions','<TAB><TAB>']:
ritySoftware
SoftwareVersion Software
Software
Secu-
['Symantec']: ['Norton']:
Laptop
Laptop
Software
Laptop
surement
['days']:
Laptop
surement
['30']:
Laptop
tude
Integer
Laptop
Software
Software
Magnitude Software
SecuritySoftware
SecuritySoftware Software
Brand
Brand
SecuritySoftware
SecuritySoftware
TimeUnits
SecuritySoftware Integer
Family
SoftwareVersion
Licence
TrialPeriod
TimeMea-
SecuritySoftware Day
Licence
TrialPeriod
TimeMea-
CommercialOer
Shipping
TimeMeasurement
Magni-
['day']:
Laptop
Units
CommercialOer
Day
Shipping
TimeMeasurement
Time-
['free']: ['trial']:
Laptop
CommercialOer Software
Laptop
Laptop
Laptop
Laptop
['Student']: ['2007']:
Laptop
Laptop
['Edition']: ['60']:
Laptop
Laptop
TimeMeasurement
Shipping
Money
ProductivitySoftware
Software
Software
Software
Software
Software
Software
['day']:
Laptop
TimeMeasurement
Software
['trial']:
Laptop
Software
['Basic','warranty']:
Laptop
Magnitude TimeUnits
Software
Licence
ProductivitySoftware
ProductivitySoftware
ProductivitySoftware
ProductivitySoftware
ProductivitySoftware
ProductivitySoftware
ProductivitySoftware Integer
Licence
LicenceType
Brand
SoftwareVersion
SoftwareVersion
TrialPeriod
REF_SoftwareVersion
ProductivitySoftware Day
Licence
TrialPeriod
ProductivitySoftware WarrantyServices
Licence
LicenceType
REF_WarrantyServices
['One']:
Laptop
surement
['year']:
Laptop
surement
WarrantyServices Digit
Magnitude
Warranty
PartsWarranty
TimeMea-
WarrantyServices Year
TimeUnits
['hardware','parts']:
Laptop
REF_PartsWarranty
Warranty
PartsWarranty
TimeMea-
WarrantyServices
Warranty
PartsWarranty
['labor','coverage']:
Laptop
WarrantyServices
REF_LaborWarranty
Warranty
LaborWarranty
['One']:
Laptop
surement
['year']:
Laptop
surement
WarrantyServices Digit
Magnitude
Warranty
LaborWarranty
TimeMea-
WarrantyServices
TimeUnits Laptop
['toll','-','free']:
REF_TelephoneSupport
Warranty
LaborWarranty
TimeMea-
Year
WarrantyServices
Support
TelephoneSupport
['24','x','7']:
ule
Laptop
WarrantyServices
Support
TelephoneSupport
WeeklySched-
['support']:
Laptop
WarrantyServices
Support
REF_Support
Bibliography
[1] Eythan Adar. S-rad a simple and robust abbreviation dictionary. Technical report, HP Laboratories Technical Report, September 2002. [2] Eneko Agirre and German Rigau. A proposal for word sense disambiguation using conceptual distance. In
1996.
[3] Ergin Altintas, Elif Karsligil, and Vedat Coskun. A new semantic similarity measure evaluated in word sense disambiguation. In
Algorithmica,
2(1):315336, 1987.
[5] B. De Baets and H. De Meyer. Transitivity-preserving fuzzication schemes for cardinality-based similarity measures.
160:726740, 2005.
trieval.
[7] Satanjeev Banerjee and Ted Pedersen. An adapted lesk algorithm for word sense disambiguation using wordnet. In
2002.
[8] Satanjeev Banerjee and Ted Pedersen. Extended gloss overlaps as a measure of semantic relatedness. In
2003.
[9] Ilaria Bartolini, Paolo Ciaccia, and Marco Patella. metric trees using an approximate distance. In
Proceedings of SPIRE,
2002.
[10] T.C. Bell, J.G. Cleary, and I.H. Witten. Englewood Clis, 1990.
Text Compression.
Prentice Hall,
Scientic
American,
2001.
164
BIBLIOGRAPHY
165
[12] Mikhail Bilenko and Raymond J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In
[14] Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual web search engine. 117, 1998. [15] Alexander Budanitsky and Graeme Hirst. Semantic distance in wordnet:
30(1-7):107
Work2001.
shop on WordNet and Other Lexical Resources, Second meeting of the North American Chapter of the Association for Computational Linguistics,
[16] Paul Buitelaar, Philipp Cimiano, Stefania Racioppa, and Melanie Siegel. Ontology-based information extraction with soba. In
Proceedings of the
IEEE Transac-
18:14111428, 2006.
[18] Jerey T. Chang, Hinrich Schtze, and Russ B. Altman. Creating an online dictionary of abbreviations from medline.
Informatics Association,
9(6):612620, 2002.
[19] Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, and Rajeev Motwani. Robust and ecient fuzzy match for online data cleaning. In
Proceedings of
Machine Learning,
[21] Peter Christen.
2000. Techniques
and practical issues. Technical report, The Australian National University, Department of Computer Science, Faculty of Engineering and Information Technology, 2006. [22] Rudi Cilibrasi and Paul M. B. Vitnyi. Clustering by compression.
IEEE
[23] Philipp Cimiano, Siegfried Handschuh, and Steen Staab. self-annotating web. In
Web, 2004,
2004.
BIBLIOGRAPHY
166
[24] Fabio Ciravegna, Sam Chapman, Alexiei Dingli, and Yorick Wilks. Learning to harvest information for the semantic web. In
[25] William W. Cohen, Pradeep Ravikumar, and Stephen E. Fienberg. A comparison of string distance metrics for name-matching tasks. In
Proceedings
2003.
[26] Jim Cowie, Joe Guthrie, and Louise Guthrie. Lexical disambiguation using simulated annealing. In
Proceedings of COLING-92,
1992.
Encyclopedia of
2005.
[29] Debabrata Dey and Sumit Sarkar. A distance-bases approach to entity reconciliation in heterogeneous databases.
14(3):567582, 2002.
Numerische Mathematik,
[31] S. Dill, N. Eiron, D. Gibson, D. Gruhl, R. Guha, A. Jhingran, T. Kanungo, K.S. McCurley, S. Rajagopalan, A. Tomkins, J.A. Tomlin, and J.Y. Zienberer. A case for automated large scale semantic annotation. 1(1), 2003.
Web
Semantics,
[32] Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. Duplicate record detection: A survey.
Machine read-
Intelligence,
[35] Oren Etzioni, Michael Cafarella, Doug Downey, Stanley Kok, Ana Maria Popescu, Tal Shaked, Stephen Soderland, Daniel S. Weld, and Alexander Yates. Web-scale information extraction in knowitall. In
Proceedings of the
2004.
[36] Dayne Freitag and Andrew McCallum. Information extraction with hmm structures learned by stochastic optimization. In
AAAI,
2000.
BIBLIOGRAPHY
167
[37] Dayne Freitag and Andrew Kachites McCallum. with hmms and shrinkage. In
Information extraction
[38] William A. Gale, Kenneth W. Church, and David Yarowsky. One sense per discourse. In
Language,
1992.
[39] Alexander Gelbukh, Grigori Sidorov, and San-Yong Han. Evolutionary approach to natural language word sense disambiguation through global coherence optimization.
2(1):11
19, January 2003 2003. [40] Osamu Gotoh. An improved algorithm for matching biological sequences.
162:705708, 1982.
[41] Thomas R. Gruber. Toward principles for the design of ontologies used for knowledge sharing.
Representation, Kluwer,
[42] Siegfried Handschuh, Steen Staab, and Fabio Ciravegna. S-cream: Semiautomatic creation of metadata.
Annals
of Discrete Mathematics,
53, 1992.
[44] Jay J. Jiang and David W. Conrath. Semantic similarity based on corpus statistics and lexical taxonomy. In
[45] Karen Sparck Jones. A statistical interpretation of term specicity and its application in retrieval.
Journal of Documentation,
Proceedings
2003.
[47] A. Kilgarri and J. Rosenzweig. Framework and results for english senseval.
34:1548, 2000.
Naval
ACM
Computing Surveys,
PhD
BIBLIOGRAPHY
168
Proceedings
2000.
Proceedings
1986.
[54] Vladimir Levenshtein. Bynary codes capable of correcting deletions, insertions and reversals.
163(4):845848, 1965.
Proceed-
1998.
[56] Dimitrios Mavroeidis, George Tsatsaronis, Michalis Vazirgiannis, Martin Theobald, and Gerhard Weikum. Word sense disambiguation for exploit-
Knowledge Discovery in
3721/2005:181192, 2005.
Terminology,
pages 271278. John Benjamins, 2001. [58] Luke K. McDowell and Michael Cafarella. extraction with ontosyphon. In Ontology-driven information
extraction from unstructured, ungrammatical data sources on the world wide web.
10(3):211226, 2007. [60] Rada Mihalcea. Unsupervised large-vocabulary word sense disambiguation with graph-based algorithms for sequence data labeling. In
Proceedings of
the HLT/EMNLP,
2005.
[61] Rada Mihalcea and Dan Moldovan. A method for word sense disambiguation of unrestricted text. In
Proceedings
2004. Derek
Wordnet:
3(4):235244, 1990.
BIBLIOGRAPHY
169
[64] Steven N. Minton, Claude Nanjo, Craig A. Knoblock, Martin Michalowski, and Matthew Michelson. A heterogeneous eld matching method for record linkage. In
Mining,
2005.
[65] Alvaro Monge and Charles Elkan. The eld matching problem: Algorithms and applications. In
Advances
in Math,
[68] Roberto Navigli and Mirella Lapata. Graph connectivity measures for unsupervised word sense disambiguation. In
Proceedings of ICJAI,
2007.
[69] Saul Needleman and Christian Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins.
Journal
of Molecular Biology,
Journal
of Molecular Biology,
48:444453, 1970.
[71] Youngja Park and Roy J. Byrd. Hybrid text mining for nding abbreviations and their denitions. In
[72] Siddharth Patwardhan, Satanjeev Banerjee, and Ted Pedersen. mantic relatedness for word sense disambiguation. In
Proceedings of the
Fourth International Conference on Intelligent Text Processing and Computational Linguistics, Mexico City.,
[73] Ted Pedersen, Serguei V.S. 2002. Siddharth Patwardhan, and
Pakhomov,
Christopher G. Chute. Measures of semantic similarity and relatedness in the biomedical domain. 2007. [74] Jakub Piskorski and Marcin Sydow. for name matching tasks in polish. In Usability of string distance metrics
40(3):288299,
2007.
[75] B. Popov, A. Kiryakov, D. Ognyano, D. Manov, A. Kirilov, and M. Goranov. Towards semantic web information extraction. In
Proceedings ofthe
BIBLIOGRAPHY
170
[76] James Pustejovsky, Jos Castao, Brent Cochran, Maciej Kotecki, Michael Morrell, and Anna Rumshisky. Extraction and disambiguation of acronymmeaning pairs in medline. 2001. [77] James Pustejovsky, Jos Castao, Brent Cochran, Maciej Koteckib, and Michael Morrella. medline databases. Automatic extraction of acronym-meaning pairs from
Medinfo,
[78] Roy Rada, Hafedh Mili, Ellen Bicknell, and Maria Blettner. Development and application of a metric on semantic nets.
19(1):1730, 1989.
in a taxonomy. In
articial intelligence.,
[80] Uwe Reyle and Jasmin Saric. Ontology driven information extraction. In
2001.
Proceedings of
Proceeedings
1998.
1(4):191198, 1999.
BIBLIOGRAPHY
171
[88] J. R. Ullmann.
The Computer
Journal,
20(2):141147, 1977.
[89] Alexandros G. Valarakos, Georgios Sigletos, Vangelis Karkaletsis, Georgios Paliouras, and George A. Vouros. A methodology for enriching a In
A. Valarakos,
G. Sigletos, V. Karkaletsis, G. Paliouras, G.A. Vouros, "A Methodology for Enriching a Multi-Lingual Domain Ontology using Machine Learning", Proceedings of the Workshop "Text processing for Modern Greek: from symbolic to statistical approaches", 6th International Conference of Greek Linguistics,
2003. Error bounds for convolutional codes and an asymp-
Theory,
13(2):260269, 1967.
[91] Robert A. Wagner and Michael J. Fischer. The string-to-string correction problem.
21(1):168
Biometrics
Bulletin,
1:8083, 1945.
[93] William Winkler and Y. Thibaudeau. An application fo the fellegi-sunter model of record linkage to the 1990 us decenial census. Bureau of the Census, Washington, D.C., 1991. [94] Zhibiao Wu and Martha Palmer. Verb semantics and lexical selection. In Technical report,
Proceedings of the 32nd annual meeting of the Association for Computational Linguistics. Las Cruces,,
[95] Stuart Yeates. 1994. In
Proceed-
ings of the New Zealand Computer Science Research Students Conference University of Waikato,
1999. Using compression to 2000.
[96] Stuart Yeates, David Bainbridge, and Ian Witten. identify acronyms in text. In