Escolar Documentos
Profissional Documentos
Cultura Documentos
INGEGNERIA DELL’INFORMAZIONE
XVIII CICLO
Sede Amministrativa
Università degli Studi di MODENA e REGGIO EMILIA
Relatore:
Chiar.mo Prof. Paolo Tiberio
Acknowledgments 1
Introduction 3
6 Multi-version management
and personalized access to XML documents 195
6.1 Temporal versioning and slicing support . . . . . . . . . . . . 197
6.1.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . 197
6.1.2 Providing a native support for temporal slicing . . . . 200
6.2 Semantic versioning and personalization support . . . . . . . . 208
6.2.1 The complete infrastructure . . . . . . . . . . . . . . . 208
6.2.2 Personalized access to versions . . . . . . . . . . . . . . 210
6.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
6.3.1 Temporal XML representation and querying . . . . . . 214
6.3.2 Personalized access to XML documents . . . . . . . . . 216
6.4 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . 217
iv CONTENTS
C Proofs 273
C.1 Proofs of Chapter 1 . . . . . . . . . . . . . . . . . . . . . . . . 273
C.2 Proofs of Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . 274
C.3 Proofs of Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . 275
C.4 Proofs of Appendix B . . . . . . . . . . . . . . . . . . . . . . . 279
List of Figures
4.1 The XML data collections used for experimental evaluation . . 145
4.2 DBLP Test-Collection Statistics . . . . . . . . . . . . . . . . . 146
4.3 Pattern matching results . . . . . . . . . . . . . . . . . . . . . 148
4.4 Summary of the discussed cases . . . . . . . . . . . . . . . . . 150
4.5 Behavior of the isCleanable() and isNeeded() functions . . . . 151
4.6 Performance comparison for unordered tree inclusion . . . . . 154
I would like to begin this thesis with some words of gratitude to all people
who helped me and were close to me during my Ph.D. experience.
First of all, the people that made all this possible: My supervisor Paolo
Tiberio, an always stimulating and incredibly kind person who gave me the
possibility of working on the topics that interested me in the Information Sys-
tems research group, and Federica Mandreoli. Indeed, to thank you properly,
Federica, I would have to write several pages... what could I say in few words?
Ever since my master thesis, you have been (and are) an amazing supervisor,
an ideal co-author, a colleague working with whom is more pleasure than just
“work”, and a good friend. Thank you for showing me the beauty of research
and for all the interesting discussions we had; it is always incredible for me
to see your “multi-tasking” brain getting all those brilliant ideas, even when
you are away from university doing something else, such as driving your car
back to Bologna or taking care of your sons...
I would like to thank all my colleagues and coauthors (and friends) at
ISGroup: Thank you Enrico and Mattia, for all the enjoyable moments we
have spent and the remarkable results we achieved together, and thank you
Simona, the cooperation with you has just started but you instantly left your
mark in our group (and in me) with the exceptional enthusiasm, wonderful
pleasantness and high-quality work that distinguish you. Many thanks also to
Pavel Zezula and Fabio Grandi, working with you is always a truly inspiring
experience. Thank you all, this thesis would not have been as it is without
your precious collaboration!
Special thanks to Sonia Bergamaschi and Domenico Beneventano, which
have always been friendly and helpful to me in many occasions, to all my
friends at LabInfo and “lunch mates”, Francesco, Luca, Maurizio, Robby...
Last but not least, to my parents: A huge “Thank You” for the constant
support you provided in all these years!
January 25, 2006
Riccardo Martoglia
Introduction
the user queries with respect to the different documents available in a col-
lection. Such techniques first exploit a reworking of the documents’ schemas
(schema matching), then, with the extracted information, the structural com-
ponents of the query (query rewriting) are interpreted and adapted. The key
to a good effectiveness is to exploit the right meaning of the terminology
employed in the schemas. To this end, we propose a further service for auto-
matic structural disambiguation which can prove valuable in enhancing the
effectiveness of the matching (and rewriting) techniques. Indeed, the pre-
sented approach is completely generic and versatile and can be used to make
explicit the meaning of a wide range of structure based information, like
XML schemas, the structures of XML documents, web directories but also
ontologies. All the discussed services have been implemented in our XML
S3 MART system, which includes the STRIDER disambiguation component.
Finally, in Chapter 6, we also consider the versioning aspect of XML
management and deal with the problem of managing and querying time-
varying multi-version XML documents. Indeed, as data changes over time,
the possibility to deal with historical information is essential to many com-
puter applications, such as accounting, banking, law, medical records and
customer relationship management. The central issue of supporting tempo-
ral versioning, i.e. most temporal queries in any language, is time-slicing
the input data while retaining period timestamping. A time-varying XML
document records a version history and temporal slicing makes the different
states of the document available to the application needs. Standard XML
query engines are not aware of the temporal semantics and thus it makes
more difficult to map temporal XML queries into efficient “vanilla” queries
and to apply query optimization and indexing techniques particularly suited
for temporal XML documents. In the light of these facts, in a completely
general setting, we firstly propose a native solution to the temporal slicing
problem, addressing the question of how to construct a complete XML query
processor supporting temporal querying. Then, we focus on one of the most
interesting scenarios in which such techniques can be successfully exploited,
the eGovernment one. In this context, we present how the slicing technology
can be adapted and exploited in a complete normative system in order to
provide efficient access to temporal XML norm texts repositories. Further,
we propose additional techniques in order to support a personalized access
to them.
Part I
Pattern Matching
for Plain Text
Chapter 1
Approximate (sub)sequence
matching
chapter. Chapters 2 and 3 will then provide a much more detailed analysis
of how such techniques can be exploited in real applications. The first sce-
nario is the one of EBMT (Example-Based Machine Translation), one of the
most promising paradigms for multilingual document translation. An EBMT
system translates by analogy: it is given a set of sentences in the source lan-
guage (from which one is translating) and their corresponding translations
in the target language, and uses those examples to translate other, similar
source-language sentences into the target language. Large corpora of bilin-
gual text are maintained in a database, known as translation memory. Due to
the high complexity and extent of languages, in most cases it is rather difficult
for a translation memory to store the exact translation of a given sentence.
Thus, an EBMT system is proved to be a useful translator assistant only
if most of the suggestions provided are based on similarity searching rather
than exact matching. Besides EBMT, there are many other motivating ap-
plications, such as syntactical document similarity search and independent
sentence repositories correlation. Syntactical document similarity search in-
volves the comparison of a query document against the data documents so
that some relevance between the document and the query is obtained. For
this purpose, documents are usually broken up into more primitive units
such as sentences and inserted into a database. When a document is to be
compared against the stored documents, only the documents that overlap at
the unit level will be considered. This type of document similarity can be
exploited both for copy detection [122] and for similar document retrieval
services, as the one offered by the digital library CiteSeer [80]. The corre-
lation of independent sentence repositories is a prerequisite of applications
such as warehousing and mining which analyse textual data. In this context,
correlation between the data should be based on approximate joins which
also takes flexibility in specifying sentence attributes into account.
We argue that the kind of similarity matching useful for most applications
should go beyond the search for whole sequences. The similarity matching
we refer to attempts to match any parts of data sequences against any query
parts. Although complex, this kind of search enables the detection of similar-
ities that could otherwise be unidentified. Even if some works in literature
address the problem of similarity search in the context of information ex-
traction, we are not aware of works related to finding syntactic similarities
between sequences. In particular, we are not aware of solutions fitting into
a DBMS context, which represents the most common choice adopted by the
above cited applications for managing their large amount of textual data.
In this chapter, we propose this kind of solution, based on a purely syntac-
tic approach for searching similarities within sequences [90]. The underlying
similarity measure is exploitable for any language since it is based on the
1.1 Foundation of approximate matching for (sub)sequences 11
similarity between sequences of terms such that the parts most close to a
given one are those which maintain most of the original form and contents.
Applying an approximate sub2 sequence matching algorithm to a given query
sequence and a collection of data sequences is extremely time consuming.
Efficiency in retrieving the most similar parts available in the sequence repos-
itory is ensured by exploiting filtering techniques. Filtering is based on the
fact that it may be much easier to state that two sequences do not match
than to tell that they match. We introduce two new filters for the approx-
imate sub2 sequence matching which quickly discard sequences that cannot
match, efficiently ensuring no false dismissals and few false positives. Section
1.1 introduces the foundation, i.e. the similarity measure and filters, of our
approximate sub2 sequence matching.
As far as the matching processing is concerned, we chose a solution that
would require minimal changes to existing databases. In Section 1.2 we show
how sequence similarity search can be mapped into SQL expressions and
optimized by conventional optimizers. The immediate practical benefit of
our techniques is twofold: approximate sub2 sequence matching can be widely
deployed without changes to the underlying database and existing facilities,
like the query optimizer, can be reused, thus ensuring efficient processing.
Finally, in Section 1.3 we assess and evaluate the results of the conducted
experiments and, in Section 1.4, we discuss related works on approximate
matching.
1.1.1 Background
The problem of sequence matching has been extensively studied in the lit-
erature as sequences constitute a large portion of data stored in computers.
12 Approximate (sub)sequence matching
q-grams at the beginning and the end of the sequence can have fewer than
q terms from σ, new terms “#” and “$” not in the term grammar are in-
troduced, and the sequence σ is conceptually extended by prefixing it with
q − 1 occurrences of “#” and suffixing it with q − 1 occurrences of “$”.
The filtering techniques basically take the total number of q-gram matches
and the position of individual q-gram match into account:
Proposition 1.1 (Count Filtering) Consider sequences σ1 and σ2 . If σ1
and σ2 are within an edit distance of d, then the cardinality of Gσ1 ∩ Gσ2 ,
ignoring positional information, must be at least max(|σ1 |, |σ2 |)−1−(d−1)∗q.
Proof and explanations of the above filters can be found in [126, 127].
i n t [ ] S1 c ; // c o u n t e r s
i n t p1 , p2 ; // p o s i t i o n s i n s e n t e n c e s
boolean S1 l i m [ ] ; // i n c r e m e n t / decrement l i m i t a t i o n c h e c k
f o r ( p2 = 1 . . . | S2 | )
// o u t e r s e n t e n c e c y c l e
{
i f ( p2−w > 0 )
{
S1 lim , ← f a l s e ;
f o r ( p1 = 1 . . | S1 | )
// i n n e r s e n t e n c e c y c l e
{
i f ( S1 [ p1 ] = S2 [ p2−w ] )
{
f o r ( i = 0 . . w−1) // decrement c y c l e
{
i f ( ! S1 l i m [ p1+i ] ) {S1 c [ p1+i ]−−; S1 l i m [ p1+i ] ← true ; }
}
}
}
}
S1 l i m ← f a l s e ;
f o r ( p1 = 1 . . . | S1 | )
// i n n e r s e n t e n c e c y c l e
{
i f ( S1 [ p1 ] = S2 [ p2 ] )
{
f o r ( i = 0 . . w−1) // i n c r e m e n t c y c l e
{
i f ( ! S1 l i m [ p1+i ] ) {S1 c [ p1+i ]++; S1 l i m [ p1+i ] ← true ; }
}
i f ( S1 c [ p1 ] >=c ) return true ;
}
}
}
return f a l s e ;
}
Consider the following sequence (here, sentence) pair, where equal words are
emphasized: √ √ √
Example 1.2 Let minL = 4, d = 1. Then, the threshold c for sub2 Position
filtering is 3 and the window size w is 4. Consider the sentence pair of
Example 1.1 where the first four terms of σ2 are not found in σ1 . From
p2 = 5 to p2 = 7, the algorithm works as follows (the term marked √ with the
symbol 4 is the current term in the outer cycle; the number of over the
term in the position p1 in the inner cycle correspond to the value of σ1 c[p1 ];
equal terms are in italics):
p2 = 5 : √ √
Example 1.3 Let minL = 3, d = 1. Then, the threshold c for sub2 Position
filtering is 2 and the window size w is 3. Consider the following sequence
(sentence) pair which represents a wrong candidate answer for the standard
position filter since it contains two close and equal terms:
σ1 : XPaint is very easy to use.
σ2 : Is XPaint a bitmap processing software?
The filter counters
√
are updated for the two first (common) terms:
√ √
σ1 : XPaint is very easy to use.
σ2 : Is XPaint a bitmap processing software?
4
√ √√ √√ √
σ1 : XP aint is very easy to use.
σ2 : Is XP aint a bitmap processing software?
4
18 Approximate (sub)sequence matching
As you can see, some counters reach the threshold (i.e. 2), but not σ1 c[1]
which is the counter of the only term equal to the current term in the outer
cycle. So, sub2 PosFilter(σ1 ,σ2 ) returns FALSE, correctly pruning out even
this pair of sentences. ¤
As far as the correctness of the filter is concerned, we provide the following
theorem.
analysed for approximate sub2 sequence matches. The sub2 Position filtering
algorithm shown in Figure 1.1 is implemented by means of an UDF function
sub2Position(S1,S2,minL,d). The sub2 Count filtering is implemented by
comparing the number of q-gram matches with the length of the involved
sequences.
Notice that the structure of the query of Figure 1.2 requires that at least
one q-gram is shared by the two sequences. The fact that the parts of the
sequences analyzed for filtering purposes cannot benefit from extended q-
grams (as already outlined in subsection 1.1.2) also influences the size of
q-grams stored in tables Dq and Qq. Indeed, choosing a q-gram size too big
with respect to the minimum length minL and/or too small with respect to
the number of allowed errors d could imply that a sequence pair share no
q-gram even if they have some approximate matching parts.
Proposition 1.5 The propositional formula “If σ1 and σ2 have some ap-
proximate matching parts then there is at least one q-gram shared by σ1 and
σ2 ” is true if and only if q ∈ [1, b minL
d+1
c].
1.3.2 Implementation
The similarity search techniques described in the previous sections have been
implemented using Java2 and JDBC code; the underlying DBMS is Oracle
9i Enterprise Edition running on a Pentium IV 1.8Ghz Microsoft Windows
XP Pro workstation.
As for query efficiency, by issuing conventional SQL expressions we have
been able to rely on the DBMS standard optimization algorithms; we just
added some appropriate indexes on the q-gram tables to further speed up the
filtering and search processes.
As to the computation of sub2 sequence matching, we tested the three
alternatives described in Section 1.2. Since our experiments do not fo-
cus on such a computation performance, the ad-hoc algorithm we imple-
mented is a naive algorithm which performs two nested cycles for each pos-
sible starting term in the two sequences to compute the matrixes of the
edit distance dynamic programming algorithm. In particular, given two se-
quences σ1 and σ2 , a minimum length minL, and a a distance threshold
d, multi-edit-distance(σ1 , σ1 ,minL,d) locates and presents all possible
matching parts along with their distance. Such function will be presented
more in depth in Chapter 2, in the context where we first tested it, i.e. our
complete EBMT system EXTRA. In order to implement the two queries
for whole and subsequence matching we further extended the database by
introducing a new table storing all possible starting and ending positions
for each part of each sentence and by extending the q-gram tables with all
possible extended q-grams. Despite the high computational complexity of
the implemented algorithm, we noticed that its performance was better than
that of the two full-database solutions. Suffice it to say that the benefits of
using q-gram filtering were completely nullified by the enormous overhead in
generating, storing and exploiting the new q-grams: not only for their bigger
quantity (approximately 2(|S| − minL)(q − 1) more for each sentence) but
also for the complications of considering each extended q-gram in the right
place. We measured the execution time of these solutions, but already from
the first tests they turned out to be more than ten times slower than the
naive algorithm solution and so in the following discussions they will not be
22 Approximate (sub)sequence matching
Real Sub2Pos Sub2Count Cross Product Real Sub2Pos Sub2Count Cross Product
1,0E+06 1,0E+08
1,0E+07
1,0E+05
1,0E+06
Candidate Set Size
1,0E+04
1,0E+03
1,0E+03
1,0E+02
1,0E+02
1,0E+01 1,0E+01
MinL=3, K=1, Q=1 MinL=6, K=1, Q=3 MinL=4, K=2, Q=1 MinL=4, K=1, Q=2 MinL=3, K=1, Q=1 MinL=6, K=1, Q=3 MinL=4, K=2, Q=1 MinL=4, K=1, Q=2
further considered.
1.3.3 Performance
In this chapter, we are mainly interested in presenting the new filters perfor-
mance. The main objective of filtering techniques is to reduce the number
of candidate answer pairs. Obviously, the more filters are effective the more
the size of candidate answers gets near to the size of the answer set. In
order to examine how effective each filter and each combination of filters is
we ran different queries, enabling different filters each time, and measured
the size of the candidate set with respect to the cross product of the query
sentences and data sentences. Another key aspect of filtering techniques is
their efficiency. Indeed, for a filter to be useful its response time should not
be greater than the processing time of just the match algorithm on the whole
cross product. In order to examine how efficient each combination of filters
and matching algorithm is, we ran different queries, enabling different filters
each time, and measured the response time also considering the scalability
for the two most meaningful cases.
Effectiveness of filters
We conducted experiments on both the data sets introduced in Section 1.3.1.
Performance trends were observed under the parameters that are associated
with our problem, that is the minimum length minL, the number of allowed
errors d, and the values of q-gram size allowed by Prop. 1.5. We started by
considering the most meaningful minimum length minL that, in most cases,
Performance 23
MED Sub2Count > MED Sub2Pos > MED Sub2Count + Sub2Pos > MED MED Sub2Count > MED Sub2Pos > MED Sub2Count + Sub2Pos > MED
4 22
26 540
6 68
139 1427
19 131
36 640
31 380
138 1369
3 17
25 456
5 19
103 1052
21 144
31 625
35 358
155 1504
0 20 40 60 80 100 120 140 160 180 0 200 400 600 800 1000 1200 1400 1600
Seconds Seconds
Efficiency of filters
Figure 1.4 presents the response time of the experiments detailed in the ef-
fectiveness of filters paragraph. In particular, it shows the times required
to get the answer sentence pairs for each possible combination of filters and
24 Approximate (sub)sequence matching
MED Sub2Count > MED Sub2Pos > MED Sub2Count + Sub2Pos > MED MED Sub2Count > MED Sub2Pos > MED Sub2Count + Sub2Pos > MED
10000 10000
1000 1000
Seconds
Seconds
100 100
10 10
1 1
25% 50% 100% 25% 50% 100%
this strong performance is not evident from the graphs is that, in the total
times shown, the linear-growing MED time prevails.
Approximate matching
for EBMT
Translation
Suggestion search suggestions
(Target language)
Text File
(Source language)
Pairs of
sentences
TM update
Section 2.2 we present our approach and discuss the reasons behind it. In
particular, we present an overview of the processes involved, i.e. document
analysis (analysed in detail in Section 2.3) and suggestion search, to which
Sections 2.4 is dedicated. Finally, Section 2.5 presents a detailed discussion
on the results of the experiments performed.
employed structure(s), such as sequences or trees, and the amount and the
kind of linguistic processing performed [41]. For instance the approach in [24]
stores examples as strings of words together with some alignment and infor-
mation on equivalence classes (numbers, weekdays, etc.). Other approaches
[42, 41] are more language- and knowledge-dependent and perform some text
operations (e.g. POS tagging and stemming) on the sentences in order to
store more structural information about them. Finally, approaches such as
[118] perform advanced parsing on the text and choose to represent the full
complexity of its syntax by means of complex syntactic tree structures. It
should be noticed that more an approach and the consequent logical repre-
sentation are complex more information can be exploited in the search of
useful material in the translation memory; however, such complexity is paid
in terms of huge computational costs for the creation, storage and matching /
retrieval algorithms (see next paragraph). Furthermore, a strong advantage
of EBMT should be the ability to develop systems automatically despite lack
of external knowledge resources [124], instead some of the cited approaches
assume the availability of a particular knowledge strictly related to the lan-
guage involved, such as the equivalence classes.
However, such “reference” collections have never been defined in the Ma-
chine Translation field and, more generally, there are no universally accepted
ways to automatically and objectively evaluate MT systems. In [97], the
authors propose a way to measure the closeness between the output of an
MT system and a “reference” translation by measuring it in proportion to
the number of matching words. The “BLEU” measure [108] enhances this
technique by partially considering the order between the words: They count
the number of equal q-grams. The above techniques require the existence of
a complete set of reference (hand made) and automatic (machine generated)
translations to be applicable. Therefore, they are clearly not completely ap-
plicable to EBMT systems, which, by definition, do not generally provide a
complete translation as their output.
Document Analysis
Sequence Translations
of tokens (sentS,σ,sentT)
Queries
(sentSq,σq)
Translation
Memory
Suggestion Search
Translation
suggestions
much as an example maintains the same words in the same positions of the
unit of text to be translated as much its translation is a good suggestion.
For these reasons, we argue that the classical IR models based on bag of
words approaches are not suitable for our purposes, as they do not take into
account any notion of word order and position (e.g. both the sentences “The
cat eats the dog” and “The dog eats the cat” would be represented by the
set {dog, eat, cat}). Instead, the examples can be logically represented as
sequences of representative items named tokens 1 . The knowledge concern-
ing the position of the tokens in the example can thus be exploited through
the edit distance [103], the metric we presented in Chapter 1 for general
sequence matching and which constitutes the foundation of the suggestion
search process in EXTRA.
Such options are incremental, i.e. stemming also includes punctuation re-
moval and WSD also includes stemming, and produce different logical rep-
resentations with an increasing level of resilience. By switching between the
first two options, the logical representation of each sentence ranges from the
sentence itself with the exception of the punctuation to the sequence of its
most meaningful terms with the exception of their inflections. In a sug-
gestion search perspective, notice that in the former representation all the
words in the sentence are equally important while the latter disregards com-
mon words such as articles and so on while focusing on the most meaningful
terms. On the other hand, translators usually find it easier to translate com-
mon words rather than far-fetched terms, thus the latter representation helps
to find more useful suggestions than the former representation as only the
differences among the fundamental terms are important. Both the above
mentioned representations disregard semantic aspects, as they are the prod-
uct of a syntactic analysis where the meanings of terms are not taken into
account. The syntactic representation of a sentence is thus a sequence of
tokens, i.e. the stemmed version of its worth surviving terms (see the upper
block of Figure 2.3).
Finally, with the Word Sense Disambiguation option, we perform a se-
mantic analysis to disambiguate the main terms in the sequence, thus en-
abling the comparison between meanings instead of terms. For instance,
consider the technical context of a computer graphic software manual: In
such a context, the author could refer to the artistic creation of the user
as both “image” or “picture”, which should be considered as two equivalent
38 Approximate matching for EBMT
Semantic analysis
N (nouns)
3 cat ?
Nouns and verbs 7 mouse ?
lists extraction V (verbs)
5 hunt ?
N (nouns)
Sequence of Tokens
3 cat *N-1788952*
WSD: nouns and 7 mouse *N-1993014*
verbs disambiguation white *N-1788952* *V-903354* *N-1993014*
V (verbs)
5 hunt *V-903354*
Figure 2.3: The syntactic and semantic analysis and the actual WSD phase
words. On the other hand, he could describe the picture of a “mouse” from
Cinderella, obviously meant as an animal, which should not be mistaken with
the “mouse”, the electronic device, used to digitally paint it. By employing
WSD, different terms which, in the context of the sequences they belong
to, have the same sense, can be judged by the comparison scheme to be
the same. On the other hand, terms which, used in different contexts, have
different meanings can be considered as distinct tokens. By reconsidering
the previous example, “image” and “picture” would be considered to be the
same, while the two instances of “mouse” would be considered different. The
WSD techniques we designed can be categorized as relational information,
knowledge-driven ones, exploiting one of the most known lexical resources for
the English language: WordNet [100, 101]. Specifically, we devised two com-
pletely automatic techniques for the disambiguation of nouns and verbs that
ensure good effectiveness while not requiring any additional external data
or corpora besides WordNet and an English language probabilistic Parts-Of-
Speech (POS) tagger. Note that we disregarded such categories, as adjectives
and adverbs, which are usually less ambiguous and assume a marginal role
in the meaning of the sentence.
The syntactic analysis precedes the actual WSD phase which receives
a stemmed sentence as input and produces a disambiguated version of it,
2.4 The suggestion search process 39
translated. For example, setting d = 0.3 would allow 3 errors w.r.t. a query
of length 10, 6 errors w.r.t. a query of length 20, and so on.
Efficiency in retrieving the most similar sequences available in the trans-
lation memory is once again ensured by exploiting filtering techniques (see
Section 1.1).
language sentence, i.e. identify the right part sentT [iσ . . . j σ ], a special align-
ment technique is needed where the sentence in the source language acts
as “bridge” between the sequence and the sentence in the target language.
Such problem, which we call word alignment, has been addressed in [92].
Furthermore, in order to keep a good efficiency, we exploit the new filtering
techniques described in Section 1.1. In EXTRA, such filters are adapted
to the context of a relative distance threshold but are based on the same
properties which have been previously described.
actual translation. This is certainly true for the results of whole matching
as they are suggestions for the whole sentence to be translated. Different is
the case of sub2 sequence matching that suggests parts of the TM sentences
that match parts of the query sentence. The suggestions concern parts of
the query sentence starting in different positions and have variable lengths.
Thus, one possible way to prepare the suggestions for presentation is to group
them by starting point, i.e. considering the position of the first word of the
query segment for which a translation is proposed. Moreover, together with
the edit distance value, the length is also a factor that can affect the time
required to complete the actual translation. Indeed, in most cases, it takes
longer to use a large number of short suggestions than using a smaller number
of longer suggestions, even when the longer are less similar to the involved
parts than the short ones. Thus, for each starting point the suggestions can
be ordered by length and then by edit distance values. As for each starting
point the longest suggestions contain other suggestions, another possibility
is to output only such suggestions and to sort them on the basis of the edit
distance values. Contained matches are in fact usually not useful, since they
would typically only slow the work of the translator down while giving no
additional hint. In this case, we avoid computing the unnecessary suggestions
by exploiting an ad-hoc algorithm. An algorithm that implements this idea
is shown in Appendix A.2. The impact of document analysis, of the type of
suggestions, and of the ranking on the translation process has been subject
of several experiments. A detailed account of the results that we obtained is
presented in Section 2.5.
Doc analysis
Translation
Memory (Permanent
DB table)
New document
(to be translated) On-the-fly DB table
Parts of sentences Sub2 matching
Query sentence: Position the 4 clips (D) as shown and at the specified dimen-
sions.
Similar sentence in the source language: Position the 4 clips (A) as shown
and at the specified distance.
Corresponding sentence in the target language: Posizionare le 4 mollette
(A) come indicato e alla distanza prevista.
Query sentence: On completion of electrical connections, fit the cooktop in
place from the top and secure it by means of the clips as shown.
Sentence containing a similar part: After the electrical connection, fit the
hob from the top and hook it to the support springs, according to the illustration.
Corresponding sentence in the target language: Dopo aver eseguito il col-
legamento elettrico, montare il piano cottura dall’ alto e agganciarlo alle molle di
supporto come da figura.
Suggestion in the target language: collegamento elettrico, montare il piano
cottura dall’alto
Sentence containing a similar part: Secure it by means of the clips.
Suggestion in the target language: Fissare definitivamente per mezzo dei
ganci.
also to test the efficiency of the system in a larger and, thus, more challeng-
ing scenario. Furthermore, Collection2 contains both English sentences and
their Italian translations; being the sentences available in two languages, the
system could also be tested in aligning them and giving suggestions in the
target language.
100% 18
90% 16
80% 14
70%
12
60%
10
50%
8
40%
6
30%
4
20%
10% 2
0% 0
d, dSub d, dSub d, dSub d, dSub d, dSub d, dSub Whole Sub Whole Sub
0.1 0.2 0.3 0.1 0.2 0.3 d, dSub 0.1 1.1 2.0 2.6 1.5
Sub 25.5% 62.3% 58.0% 2.1% 2.6% 1.9% d, dSub 0.2 1.1 11.4 4.5 12.5
Whole 8.0% 12.0% 17.8% 95.7% 96.7% 97.6% d, dSub 0.3 1.1 15.2 10.1 15.875
Collection1 Collection2 Collection1 Collection2
Coverage
The content of the translation memory represents the history of past transla-
tions. When a translator is going to translate texts concerning subjects that
have already been dealt with, the EBMT system should help him/her save
time by exploiting the potentialities of the translation memory contents.
In order to quantify the ability of EXTRA in retrieving suggestions for
the submitted text, we propose a new measure, named coverage, which corre-
sponds to the percentage of query sentences for which at least one suggestion,
obtained both from a whole or a sub match, has been found in the trans-
lation memory. Such measure is a good indicator of the effectiveness of a
suggestion search process only if there is a good correlation between the
text to be translated and the translation memory as it is for our collections.
Moreover it proves to be useful in the comparisons of different systems, as
shown in Section 2.5.5. Figure 2.6-a shows that our search techniques en-
sure a good coverage for the considered collections, while Figure 2.6-b shows
Effectiveness of the system 47
the high number of retrieved suggestions per sentence and demonstrates the
wide range of proposed translations. The good size and consolidation of Col-
lection2 implies a very high level of coverage (over 97%, even with a very
restrictive setting of d and dSub = 0.1), where most of the suggestions con-
cern whole matches. As to Collection1, which is relatively small and not so
well established, notice that, as we expected, sub2 matching covers a remark-
able percentage of query sentences and becomes essential to further exploit
the translation memory potentialities. In particular, by setting the distance
threshold d and dSub at at least 0.2, the obtained coverage is more than 70%
of the available sentences. Furthermore, notice that, while the mean number
of whole suggestions per sentence is generally particularly sensitive to the
TM size (e.g. the possibilities of finding similar whole sentences in small
TMs are quite low), the mean number of partial suggestions is sufficiently
high for both scenarios (see Figure 2.6-b), thus proving the good flexibility
of our matching techniques.
100%
Collection1 Collection2
S .
.
.
Sentences Exit
Translators
deviation (σ); for the other parameters we simply specify the corresponding
value (val ). All the time values are expressed in seconds. The number of
translators (Ntrans ) is 1 by default, while the document to be translated con-
sists of the (Nsent ) query sentences from Collection1. The amount of time
needed to perform a translation is proportional to the length (in words) of
the sentences to be translated. For this reason we added a parameter, the
base word translation time tword base , representing the time needed to trans-
late one word whether or not the translator is confident in their translation.
Depending on the translator’s experience and on the difficulty of the text
(factors described by the probability of word look-up parameter Plook ), cer-
tain words may require an additional amount of time, which we call word
look-up time tword look : It is the time needed to look-up the word in a dictio-
nary and to decide the right meaning and translation. The default setting is
that, on average, 1 out of 20 words (5%) requires such additional time. On
the other hand, there are a number of very common and/or frequent words
the translation time of which can be less than base translation time since
the translator has already translated them in the preceding sentences and is
very confident on their meaning. Therefore, the word recall saved time para-
meter tword rec models such a time save for each of the recalled words, while
the probabilities of recalling a word translation Prec range linearly from a
minimum (beginning of the translation, less confidence) to a maximum (end
of the translation) value.
Let us now specifically analyze the manual translation model: One pe-
culiar parameter is the time required to make the “hand-made” translation
coherent in its terminology, avoiding inconsistencies with the previous sen-
tences and with the other translators, if any, working on the same document.
For these reasons, such sentence coherence-check time tsent coher is added for
each of the translated sentences and ranges from a minimum to a maximum
value in proportion to the number of sentences already translated. The max-
imum value is also proportional to the number of translators working on the
same task, since higher the number of translators working on the translation,
52 Approximate matching for EBMT
greater is the coordination time needed to obtain a final coherent work. The
following formula can be used to summarize all the contributions to the total
document translation time for one non-assisted translator, while setting aside
all probability considerations:
à !
X X
tmanual = tsent coher + (tword base + tword look − tword rec ) (2.1)
sentences words
7800
10000 25000
7750
8000 20000
7700
6000 7650 15000
7600
4000 10000
7550
2000 5000
7500
0 7450 0
1 2 3 4 5 6 7 8 9
0
50
0
0
0
0
0
0
0
01
02
05
5
10
15
20
25
30
35
40
0.
0.
0.
0.
0.
0.
0.
0.
(a) Nsent (b) Nread (c) Plook
mean and maximum translation time per sentence is higher in this scenario.
Now, notice that the assisted translation time is slightly more than 2 hours
(2 hours and 6 minutes) and is significantly closer to the performance of
the team of two translators than to the one of the single translator. This is
good proof of the significant improvement that can be obtained by employing
assisted translation software but, overall, it quantifies the real effectiveness
of the EXTRA translation suggestions. In particular, the time required to
adapt the suggestions retrieved from past translations is not only significantly
lower than the one required to produce the same translations from scratch,
but the speed gain is also particularly consistent: The mean sentence trans-
lation time is by far the lowest of the three scenarios considered, and this
corresponds to the highest per-translator productivity. Furthermore, notice
that the maximum time required to translate a sentence in the given query
document is also particularly close to the manual single translator scenario,
this is because even for the query sentences for which many suggestions are
available the time spent in reading them is limited by the maximum number
of suggestions parameter.
Figure 2.9 shows the trends of the total translation time for automatic
and/or manual translation obtained by our simulation models by varying
the number of query sentences (Nsent , Figure 2.9-a), the maximum number
of suggestions to be read (Nread , Figure 2.9-b), and the probability of look-up
(Plook , Figure 2.9-c) parameters. Notice that the variation of the number of
sentences produces a sort of scalability graph, with linear trends for the three
models: The automatic translation trend stands, as expected, between the
two manual ones. Figure 2.9-b demonstrates the trade-off between the time
saved by exploiting the translation suggestions and the time spent in reading
Efficiency of the system 55
and selecting them: The best trade-off is given by reading (and presenting)
4 or 5 suggestions at the most, therefore such values are the optimal ones
that have been used in the other assisted translation simulations and that
can be used in EXTRA itself in order to deliver a balanced suggestion range
to the user. Finally, Figure 2.9-c shows the trends of total translation time
by varying the Plook parameter, which may represent the ability and expe-
rience of the translator: In this case, the graph shows that the translation
assistance is equally useful both for experienced translators (Plook = 0.01)
and inexperienced ones (Plook > 0.3).
10,000
10,000
9,000
9,000
8,000
8,000
7,000
Time (mSec)
7,000
6,000
Time (mSec)
5,000 6,000
4,000 5,000
3,000 4,000
2,000 3,000
1,000 2,000
0
Whole Sub^2 Whole Sub^2 1,000
matching matching matching matching
0
d, dSub 0.1 390 532 4,719 2,605 25% 50% 75% 100%
d, dSub 0.2 422 1,312 6,766 2,957 Collection1 211 554 1,027 1,734
d, dSub 0.3 437 1,950 9,047 2,709 Collection2 3,928 5,743 7,541 9,723
Collection1 Collection2
(a) Whole and sub2 matching time (b) Total matching time scalability
experiment, where all the filters were on and stemming was performed on all
the query and the TM sentences. Notice that the graph employs a logarithmic
scale. Disabling stemming, but keeping the filters enabled, produces a to-
tal running time of nearly 22 seconds for Collection1 and approximately 185
seconds for Collection2, more than 20 times higher than in standard con-
figuration; conversely, disabling filters, but keeping the stemming enabled,
produces a huge performance loss, with total searching time of more than
2 minutes and 20 minutes, respectively, for the first and second collection.
Thus, filters have great impact on the final execution time of the search algo-
rithms. By enabling all the available whole and sub2 filters allows the system
to reduce the overall response times by a factor of at least 1:70.
As to document analysis, it can be extremely useful not only for enhanc-
ing the effectiveness of the retrieved suggestions (see Section 2.5.3) but also
for a clear increase in performance. Notice that the document analysis time
is not included in the total response time, since this is not strictly related to
the search algorithms; such time is shown for the query sentences of the two
collections in subfigure 2.11-b, both for stemming and WSD. In particular,
the “no analysis” time corresponds to the time required to read the docu-
ments, extract their sentences and store them in the DB, along with their
z-grams. For instance this time is 6 seconds for the 400 query sentences in
Collection1. By enabling stemming, such time increases by only 2 seconds,
thus showing the performance of over 200 sentences per second offered by
our stemming algorithms. The same graph also reports on the WSD analysis
time; such analysis is much more complex and, therefore, the time required
for it is higher than for stemming. However, 10 sentences per second are
Comparison with commercial systems 57
10,000,000
60
1,000,000
50
100,000
Time (mSec)
40
Time (Sec)
10,000
30
1,000
20
100
10 10
1 0
Total Time Total Time Collection1 Collection2
Filters + Stem 1,734 9,723 No analysis 6 7
Stem Off 21,812 185,320 Stemming 8 9
Filters Off 128,453 1,250,438 Stemming + WSD 51 54
Collection1 Collection2
(a) Total time with / without filters and (b) Query document analysis time
stemming
Figure 2.11: Further efficiency tests: impact of the filtering and document
analysis techniques
still elaborated (approx. 50 seconds in total) and such time is still quite low,
especially if we consider that WSD can prove valuable to achieve an optimal
effectiveness in the search techniques (see Section 2.5.3).
The first series of tests concerns the way past translations are logically
represented and the penalty scheme adopted by the two systems to judge the
differences between the submitted text and the TM content. Starting from
a sentence in the TM, we initially modified (deleted) a word from it to find
out if the systems were still able to identify the match and, if so, how much
the matching score was penalized. Then we modified (deleted) other words
and analyzed the penalty trend. After several tests we came to the following
conclusions:
the programs identify the different parts up to a certain level of modi-
fications. The penalties given seem to depend not only on the number
of modified (deleted) words but also on the length of the original sen-
tences;
unlike the EXTRA approach, the systems seem to perform no stemming
to the TM and query sentences’ words. For example, the systems give
an equal penalty to the modification of “graphics” in “graphic” and of
“graphics” in “raphics”, while they should penalize much more heavily
the second variation, where the new term is completely unconnected to
the original one from a linguistic point of view.
In the second series of experiments, we inverted some of the positions of
the words in a given sentence in order to verify if the comparison mechanism is
sensitive to the order of words. In particular, we inverted the terms “graphics
tool” in “tool graphics” and “ease and precision” in “precision and ease”. In
this case, the search algorithms appeared to be order sensitive, similarly to
EXTRA edit distance approach, and not simply based on a “bag of words”
approach. Further, Trados identified and displayed the moved segments with
a particular specific color.
Finally, we tested the ability of such systems to identify interesting parts
in the stored text. In these experiments, we joined two whole sentences (s1
and s2 in the following) contained in the TM and tried to pre-translate the
newly created sentence in each of the systems. The text was created in three
different ways:
separation with a conjunction (“s1 and s2 ”);
algorithms. As to commercial systems, in all three cases Trados was not able
to automatically retrieve any suggestions. In particular, we found out that
the system was not able to dynamically segment the query and/or the exam-
ples in order to find sub-matches. To retrieve the suggestions to our queries,
for example, the user would have two possibilities, both of them being quite
impractical: To translate in Concordance mode, manually selecting the par-
tial segments to look for in the TM, or, in batch mode, (s)he would have to
change the segmentation rules, adding the comma and the conjunction as per-
manent separators. The only way to automatically retrieve matches between
segments is indeed to insert the interesting parts, already segmented, in the
TM and to split the query sentences in multiple segments. The problem is
that the user does not know which segments could match and adding static
segmentation rules would not be a general solution. Translation Manager
was able to partially solve the three queries we submitted. The Transla-
tion Manager similarity model seems to be more complex than the Trados
one: While Trados does not analyze further unknown sentences, Translation
Manager tries to find more suggestions for the unknown sequences of words
by re-analyzing the sequences of which a segment consists. For example, in
the sentence “s11 , s, s21 ” the system first identified the sentence s1 , then it
specifically restricted the search to the contaminating segment (s). Some
limitations still remain: The system is not able to suggest the interesting
part of the TM sentence that matches a part of the query sentence, but it
just presents the whole sentence to the user. Furthermore, for the segments
to be identified, they must match the majority of a TM sentence, otherwise
they cannot be retrieved by the engine.
act match (with a similarity greater than 95%) and fuzzy match (from 50%
to 95% of similarity). Parts with a smaller similarity were categorized as
“not found”. The results are shown in Table 2.4, where we also report, for
ease of comparison, the results obtained by EXTRA on the same collection.
For EXTRA we considered the default parameters’ values and, for the time
comparison, we considered the total time for whole and sub2 matching; fur-
ther, the distinction between exact and fuzzy match does not apply to our
system. The quantity of exact matches favors Trados, but Translation Man-
ager is able to find almost twice the number of fuzzy matches and ultimately
proves to have the most effective similarity search engine between the two
commercial systems. On the other hand, the level of coverage provided in
EXTRA by our whole and sub2 matching algorithms is nearly three times
higher and guarantees a much more accurate suggestion retrieval. Finally,
notice that the time required for the pre-translation operation is quite differ-
ent for the three systems. In particular, the EXTRA search algorithms are
more complex but do not require more time; the EXTRA search techniques
do indeed actually reveal to be the most efficient ones.
Chapter 3
these reasons, copy prevention is not properly the best choice for the digital
library context. Digital libraries are open and shared systems, supposed to
be designed to advance and spread knowledge. Nonetheless, as it has been
recently highlighted in [83], the application of too many and too strict limita-
tions in such a domain often results in the paradoxical and intolerable result
of severely limiting the usefulness, stability, accessibility and flexibility of
digital libraries. Further, it has been noted that, since copies are typically
what we preserve, works that are copy protected are less likely to survive into
the future [21], thus also limiting the preservation of our cultural heritage.
In our opinion, a good trade-off between protecting the information and
ensuring its availability can be reached by using duplicate detection ap-
proaches that allow the free access to the documents while identifying the
works and the users that violate the rules. One of the techniques following
this approach relies on watermark schemes [78], where a publisher adds a
unique signature in a document at publishing time so that when an unau-
thorized copy is found, the source will be known. On the other hand,
watermarks-based protection systems can be easily broken by attackers who
remove embedded watermarks or cause them to be undetectable. Another
approach, which is the one we advocate, is that of proper duplicate detection
techniques [22, 23, 33, 66, 121]. These techniques are able to identify such
violations that occur when a document infringes upon another document in
some way (e.g. by rearranging portions of text). For a duplicate detection
technique, the notion of security represents how hard it is for a malicious
user to break the system [22]. Obviously, not all the duplicate detection
approaches present the same level of security. The level of security deliv-
ered by duplicate detection techniques is variable and is strictly correlated
to the scheme adopted for the comparison of documents. For instance, any
approach relying on an exact comparison of documents can not be particu-
larly secure, since a few insignificant changes to a document may prevent it
from identifying it as a duplicate.
A duplicate detection technique ensuring a good level of security can thus
be employed as service of an infrastructure that gives users access to a wide
variety of digital libraries and information sources [22] and that, at the same
time, protects the owners of intellectual properties by detecting different
levels of violation such as plagiarisms, subsets, overlaps and so on. In con-
texts requiring secrecy, a duplicate detection service could be supported by
a number of other important services, such as encryption and authorization
mechanisms, which would help too in the protection of intellectual property.
Moreover, in a digital library context, the use of duplicate detection tech-
niques is not limited to the safeguard of the intellectual property. Indeed,
duplicates are widely diffused for a number of reasons other than the il-
63
legal ones: the same document may be stored in almost identical form in
multiple places (i.e. mirror sites, partially overlapping data sources) or dif-
ferent versions of a document may be available (i.e. multiple revisions and
updates over time, various summarization levels, different formats). As a
consequence, being the document collection of a digital library the union of
data obtained from multiple sources, it usually presents an high level of du-
plication. “Legal” duplicates do not give any extra information to the user
and therefore lower the accuracy of the results of searches in a digital li-
brary. The availability of duplicate detection techniques represents an added
value for a digital library search engine as they improve the quality and the
correctness of search results.
For a duplicate detection technique to be secure, it is evident the need of a
solid pair-wise document comparison scheme which is able to detect different
levels of duplication, ranging from (near) duplicates up to (partial) overlaps.
By exploiting the approximate matching techniques described in Chapter 1,
in this chapter we specialize them to the document comparison scenario, de-
vising effective similarity measures that allow us to accurately determine how
much a document is similar to (or is contained into) another. Conceptually,
our pair-wise document comparison scheme tries to detect the resemblance
between the content of documents. In order to do it, we do not extract
stand-alone keywords but we consider the document chunks representing the
contexts of words selected from the text. The comparison of the informa-
tion conveyed by the chunks allows us to accurately quantify the similarity
between the involved documents. The security delivered by such approach
is particularly improved w.r.t. other schemes relying on crisp similarities,
since it is able to identify with much more precision and reliability the actual
level of overlap and similarity between two documents, thus detecting differ-
ent levels of duplication and violations. Further, we address efficiency and
scalability by introducing a number of data reduction techniques that are
able to reduce both time and space requirements without affecting the good
quality of the results. This is achieved by reducing the number of useless
comparisons (filtration), as well as the amount of space required to store the
logical representation of documents (intra-document reduction) and the doc-
ument search space (inter-document reduction). Such techniques have been
implemented in a system prototype named DANCER (Document ANalysis
and Comparison ExpeRt) [93].
The chapter is organized as follows: In Section 3.1 we define the new doc-
ument similarity measures for duplicate detection. Section 3.2 presents the
data reduction techniques. In Section 3.3 we discuss related works. Finally,
in Section 3.4 we show the DANCER architecture and the results of some
experiments.
64 Approximate matching for duplicate document detection
Di Dj Di Dj
sure between documents satisfying the requirements stated above is the one
that looks for the mapping between chunks maximizing the overall document
similarity (see Figure 3.1.a). The statement can also be formulated in the
following way. Let us suppose that the longest document in terms of number
of chunks is Dj , i.e. m > n, and let us consider the index set Im of Dj ,
then a permutation of Im is a function pm allowing the rearrangement of the
positions of the Dj chunks. We look for the permutations pm maximizing the
overall document similarity obtained by combining the similarities between
the chunks in the same position (see Figure 3.1.b). Its formalization is given
in the following definition.
similarity with D0 and the resulting score are summarized in the following
table:
Doc Permutation Similarity with D0
D1 Identity 1
D2 {1 7→ 2; 2 7→ 1; 3 7→ 3} 1
(10+10)∗1+(8+8)∗1
D3 Identity (10+8+2)+(10+8)
= 0.947
(10+10)∗1+(8+8)∗1+(2+2)∗1
D4 Identity (10+8+2)+(10+8+2+7)
= 0.851
(10+10)∗1
D5 Identity (10+8+2)+n∗10
≤ 0.667
(8+10)∗sim(A,A0 )+(8+8)∗1+(2+2)∗1
D6 Identity (10+8+2)+(8+8+2)
≥ 0.526
edit distance has proved to be a good metric for detecting syntactic similari-
ties between sentences. In this context, we extend the concept of sentence to
that of chunk having a stand alone meaning and defining a context. Given
two chunks chi ∈ Di and ckj ∈ Dj and an edit distance threshold t, we define
the similarity between chi and ckj in the following way:
1 − ed(chi ,ckj ) if
ed(ch k
i ,cj )
≤t
h k h k
max(|ci |,|cj |)) max(|ci |,|ckj |))
h
sim(ci , cj ) = (3.2)
0 otherwise
By computing Eq. 3.1 of the resemblance measure with the above similarity
measure between chunks, we are able to obtain a good level of effectiveness
since we perform a term comparison without giving up the context-awareness
which is guaranteed by the use of chunks. The security delivered by such
approach is particularly improved w.r.t. other schemes relying on crisp simi-
larities, since it is much more independent from the chosen size of the chunks
and it is able to identify with much more precision the actual level of over-
lap and similarity between two documents, thus detecting different levels
of duplication and violations (see Section 3.4 for an extensive experimental
evaluation).
p (k)
where [s, e] is a sequence of indexes in In such that sim(cki , cj m )>0
for each k ∈ [s, e].
Other possible indicators 69
X p (k)
maxs,e∈In :s≤e∧sim(ck ,cpm (k) )>0∀k∈[s,e] |cj m | (3.4)
i j
k∈[s,e]
The role of the above measure is twofold. When it is computed w.r.t. one of
the permutations pm associated to the maximum value Simpm (Di , Dj ) of Def.
3.1, it is an additional indicator that helps to understand the meaning of the
score of the resemblance measure. Otherwise, it can provide a different score
quantifying the maximum contiguous part overlapping another document.
For instance, such score can help to easily identify partial plagiarisms, such
as an entire paragraph copied from one document to another, even if the
relative size of the copied part is very small w.r.t. the final document. The
formula to be computed is:
M axOverlap(Di )Dj = maxpm M axOverlap(Di )D
pm
j
(3.5)
D
Notice that, in general, the two scores M axOverlap(Di )pmj and M axOverlap
(Di )Dj are different.
Example 3.2 Let us reconsider the document set of example 3.1. As we
have already shown, the measure sim(D0 , D3 ) = 0.947 is indicative of a
high similarity but gives no idea about the maximum overlap. For a deeper
analysis we compute M axOverlap(D0 )D2 = M axOverlap(D2 )D0 = (10 +
8) = 18. Such scores state that the content of D2 is approximately a copy of
a great part of D0 . ¤
Another useful indicator is the one telling how much one of the two doc-
uments is contained in the other. It can be measured by introducing an
asymmetric measure as variant of the resemblance measure: The contain-
ment measure.
Definition 3.3 (Containment measure) Given two documents Di = c1i
. . . cni , Dj = c1j . . . cm k h
j , such that m > n, and a similarity measure sim(ci , cj )
between chunks, the containment aSim(Di , Dj ) of Di in Dj is estimated by
the maximum of the following value set {aSimpm (Di , Dj ) | pm is a permuta-
tion of Im } where
n
X ¡ ¢ p (k)
|cki | · sim(cki , cj m )
k=1
aSimpm (Di , Dj ) = n (3.6)
X
|cki |
k=1
70 Approximate matching for duplicate document detection
Example 3.3 Let us consider again the document set of example 3.1 and,
in particular, the containment of D3 in D0 and of D0 in D4 and viceversa.
The scores are aSim(D3 , D0 ) = 10∗1+8∗1
10+8
= 1, aSim(D0 , D3 ) = 10∗1+8∗1
10+8+2
= 0.9,
10∗1+8∗1+2∗1 10∗1+8∗1+2∗1
aSim(D0 , D4 ) = 10+8+2
= 1, and aSim(D4 , D0 ) = 10+8+2+7 = 0.74.
In particular, aSim(D3 , D0 ) and aSim(D0 , D4 ) state that document D3 is
fully contained in document D0 which, in turn, is fully contained in document
D4 . ¤
They are orthogonal and thus fully combinable. The order of presentation
corresponds to the increasing impact on the document search space. Filtering
Filtering 71
leaves it unchanged since, although filters reduce the document search space,
they ensure no false dismissals; intra-document reduction introduces an ap-
proximation in the logical representation of documents by storing a selected
number of chunk samples; inter-document reduction approximates the docu-
ment search space representation by pruning out less significant documents
while maintaining significant ones.
The tests we performed and which will be presented in Section 3.4 show
that the similarity measures are robust w.r.t. such three different data re-
duction techniques and that a good trade-off between the effectiveness of
the measures and the efficiency of the duplicate detection technique can be
reached as data reduction allows us to decrease requirements for both time
and space while keeping a very high effectiveness in the results.
3.2.1 Filtering
Filters allow the reduction of the number of useless comparisons while always
ensuring the correctness of the results. Indeed, filtering is based on the fact
that it may be much easier to state that two data items do not match than
to state that they match. In this section, we consider the application of
filtering techniques based on (dis)similarity thresholds for the comparison of
both documents and their chunks.
Since the edit distance threshold t in Eq. 3.2 allows the identification of
pairs of chunks similar enough, we can ensure efficiency in the chunk similar-
ity computation by applying the filtering techniques described in Chapter 1
before the edit distance computation: count filtering, position filtering, and
length filtering. In the following, we will refer to such filters as chunk filters.
Exploiting chunk filters allows us to greatly reduce the costs of approximate
chunk match, i.e. the first phase of the document similarity computation.
Because of the way such filters work it is not possible to give a theoretical
approximation of the computational benefits offered by such techniques: In
fact all such filters’ behavior is dependent on the data on which they are
applied. For instance, the results offered by the length filter are strictly
influenced by the distribution of the sentences’ lengths. For instance, in a
typical case, the number of required comparisons between chunks (sentences)
can be reduced from hundreds of thousands to some dozens of comparisons,
thus exponentially reducing the required time, without missing any of the
final results.
The pair-wise document comparison scheme introduced in Def. 3.1 and
its variants essentially try to find the best mapping between the document
chunks by considering the similarities between all possible pairs of chunks
themselves. The mapping computation phase is the second and last phase
72 Approximate matching for duplicate document detection
Sim(Di , Dj ) ≥ s,
aSim(Di , Dj ) ≥ s,
aSim(Dj , Di ) ≥ s
^ i , Dj ) ≥ s,
aSim(D
^ j , Di ) ≥ s
aSim(D
where Pn Pm k k h
^ i , Dj ) = k=1 h=1 |ci | · sim(ci , cj )
aSim(D Pn k
(3.7)
k=1 |ci |
The above theorem shows the correctness of the filter we devised for
data reduction. In other words, we ensure no false dismissals, i.e. that
none of the pairs of documents whose similarity actually exceeds a given
threshold is left out by the filter. Consequently, whenever we solve range
queries where a query document Q and a duplication threshold s are specified,
we can first apply the filter shown in the theorem (i.e. for each document
^
D in the collection check whether aSim(Q, ^
D) > s or aSim(D, Q) > s)
which quickly discards documents that cannot match. Then, we compute
the resemblance level between the query document and each document in
the resulting candidate set. As for chunk filters, it is not possible to define
an a priori formula quantifying in general terms the reduction offered by the
document filter to the computational complexity of the mapping computation
Intra-document reduction 73
between the chunks, since the improvement is dependent on the data which
it is applied to. In Section 3.4 we describe the effect of such filter by means
of experimental evaluation on different document collections.
Being the chunk filters and the document filter correct, they reduce but
do not approximate the document search space. In particular, the adoption
of the chunk filtering techniques requires a space overhead to store the q-
gram repository. By following the arguments given in [61], we state that the
required space is bounded by some linear function of q times the size of the
corresponding chunks.
Finally, in Section 3.4 we quantify the effectiveness of the chunk filters
and the document filter in the reduction of the number of comparisons. From
our tests, we infer that filters ensure a small number of false positives as in
most cases the level of duplication between documents is heavily dependent
on the number of similar chunks.
Chunk clustering
The chunk clustering is the process of cluster analysis in the chunk search
space representing one document. The intuition is that, if a document con-
tains two or more very similar chunks, chunk clustering stores just one of
them. Its effectiveness is mainly due to the availability of the proximity
measure defined in Eq. 3.2 which is appropriate to the chunk domain and on
which relies the document resemblance measure. By featuring the similarities
among chunks, chunk clustering is able to choose the “right” representatives,
giving particularly good results for documents with a remarkable inner re-
peatedness. More precisely, given a document D containing n chunks, the
clustering algorithm produces chunkRatio ∗ n clusters in the chunk space.
As to the clustering algorithm, among those proposed in the literature, we
experimentally found out that the most suitable for our needs is a hierarchi-
cal complete link agglomerative clustering [70]. For each cluster γ, we keep
some features which will be exploited in the document similarity computa-
tion. To this end, the cluster centroid R will be used in place of the chunks
it represents. The centroid corresponds to the chunk minimizing the max-
imum of the distances between itself and the other elements of the cluster.
Moreover, in order to weight the γ contribution to the document similarity,
also the total length |γ| in words and the number N of the chunks in γ are
considered.
Therefore, in the chunk clustering setting, documents are no longer repre-
sented as sequences of chunks but as sequences of chunk clusters and an ad-
justment of the resemblance measure of Def. 3.1 is required. More precisely,
we start from two documents Di and Dj of n and m chunks, respectively,
which have been clustered in n0 = chunkRatio ∗ n and m0 = chunkRatio ∗ m
0
clusters, and thus are represented by the sequences Di = γi1 . . . γin , Dj =
0
γj1 . . . γjm where the k-th cluster γi(j)
k k
in Di (Dj ) is a tuple (Ri(j) k
, Ni(j) k
, |γi(j) |)
k k k
with centroid Ri(j) , number of chunks Ni(j) , and length |γi(j) |. Being the
reduction ratio chunkRatio common to the two documents, it follows that
if n ≤ m then n0 ≤ m0 . In the revised form of the resemblance measure of
Def. 3.1, the similarity Sim(Di , Dj ) between Di and Dj is the maximum of
the similarity scores between the two documents, each one computed on a
permutation pm0 on Im0 . To this end, we devised two variants of Eq. 3.1.
Cluster-Based Function The Cluster-Based Function is a straightforward
adaptation of Eq. 3.1:
Pn0 ³ k pm0 (k)
´
p (k)
k=1 |γ i | + |γ j | · sim(Rik , Rj m0 )
CSimpm0 (Di , Dj ) = P n0 k
Pm0 k
(3.8)
k=1 |γi | + k=1 |γj |
Inter-document reduction 75
items randomly chosen from an arbitrary data set, assuming only that a dis-
tance measure is defined for the original objects [20]. We introduce the notion
of document bubble which is a specialization of the data bubble notion in
the context of duplicate detection where the distance measure between doc-
uments d(Di , Dj ) is the complement to 1 of the resemblance measure shown
in Def. 3.1, i.e. 1 − sim(Di , Dj ).
Definition 3.4 (Document bubble) Let D = {D1 . . . Dn } be a set of n
documents. The document bubble B w.r.t. D is a tuple (R, N, ext, inn),
where:
R is the representative document for D corresponding to the document
in D minimizing the maximum of the distances between itself and the
other documents of the cluster,
N is the number of documents in D,
ext is a real number such that the documents of D are located within a
radius ext around R,
inn is the average of the nearest neighbor distances within the set of
documents D.
We build document bubbles in the following way: Given a set of documents
DS and a reduction ratio bubRatio, by applying the Vitter’s algorithm [135]
we perform a random sampling in order to select the initial bubRatio ∗ |DS|
document originators. The remaining documents are assigned to the “clos-
est” originator by applying the standard document similarity algorithm. The
outcome is a collection of bubRatio∗|DS| sets of documents. Finally, for each
document set, the features of the resulting cluster are computed and stored
as a document bubble.
Therefore, the inter-document reduction summarizes the starting set of
DS documents to be clustered in bubRatio ∗ |DS| document bubbles which
approximate the original document search space. In comparison with tra-
ditional hierarchical clustering, inter-document reduction allows us to cut
down the number of pair-wise document comparisons from |DS| × |DS| to
bubRatio2 × |DS| × |DS|. In such context, the distance measure suitable for
hierarchical clustering is no longer the distance d(Di , Dj ) between documents
but the distance measure between two bubbles B and C given in [20]:
d(RB , RC ) − (extB + extC ) + (innB + innC )
(if d(RB , RC ) − (extB + extC ) ≥ 0)
d(B, C) = (3.10)
max(inn B , innC )
(otherwise)
3.3 Related work 77
dist(repB,repC)
dist(repB,repC)
B B
C C
and in the DSC [23] techniques which address the general problem of textual
matching for document search and clustering and plagiarism/copyright ap-
plications. In such approaches, chunks or shingles are obtained by “sliding” a
window of fixed length over the tokens of the document. The DSC algorithm
has a more efficient alternative, DSC-SS, which uses super shingles but does
not work well with short documents and thus is not suitable for web clus-
tering. Such approaches, as outlined by the authors, are heavily dependent
on the type and the dimension of chunks which affect both the accuracy and
the efficiency of the systems. As to accuracy, the bigger the chunking unit
the lower the probability of matching unrelated documents but the higher
the probability of missing actual overlaps. As to search cost, the larger the
chunk, the lower is the running time but the higher the potential number of
distinct chunks that will be stored. Moreover, such approaches are usually
not resilient to small changes occurring within chunks as only equal chunks
are considered. Our approach differs from the above cited ones since it is less
dependent from the size of the chunks to be compared as our comparison
scheme enters into chunks. In this way, we are able to accurately detect dif-
ferent levels of duplication. Indeed, notice that if we adopt a crisp similarity,
the similarity measure of Eq. 3.1 computes the number of common chunks
weighted on their lengths. For the above approaches, finding the chunk size
ensuring a “good” level of efficiency is almost impossible. The problem is
also outlined in [121] where Garcia-Molina et. al. address again the problem
of copy detection by proposing a different approach named SCAM and com-
pare SCAM with COPS by showing that the adoption of sentence unit as
chunk implies a percentage of false negatives of approximatively 50% (more
than 1000 netnews articles). SCAM investigates the use of words as the unit
of chunking. It essentially measures the similarity between pairs of docu-
ments by considering the corresponding vectors of words together with their
occurrences and by adopting a variant of the cosine similarity measure which
uses the relative frequencies of words as indicators of similar word usage. In
our opinion, it does not fully meet the characteristics of a good measure of
resemblance or, at least, of those depicted in Section 3.1. For instance, the
authors show that, setting the measure parameters to the values adopted in
the implementation, the similarity between ha, b, ci and ha, bi is 1, the same
holds for ha, b, ci against ha, b, c, di and hak i with k = 1, whereas no similarity
is detected with k > 0. Moreover, as documents are broken up into stand-
alone words with no notion of context and the adopted measure relies on
the cosine similarity measure which is insensitive to word order, documents
differing in contents but sharing more or less the same words are judged very
similar.
Recently, an approach based on collection statistics has been proposed in
3.3 Related work 79
[33] for clustering purposes. The approach extracts from each document a
set of significant words selected on the basis of their inverse document fre-
quency and then computes a single hash value for the document by using the
SHA1 algorithm [1] which is order sensitive. All documents resulting in the
same hash value are duplicates. The approach supports an high similarity
threshold justified by the aim of clustering large amounts of documents. It
thus turns out to be not much suitable for plagiarism or similar problems
where the detection of different degrees of duplication within a collection is
required. As to the efficacy of the measure, a comparison with our approach
is also possible. Indeed, they conducted some experiments on synthetic doc-
uments generated in a similar way as ours, i.e. 10 seed documents and 10
variants for each seed document. Although they introduce a mechanism
which lowers the probability of modification, they show that the average of
document clusters formed for each seed document, which should ideally be
1, is 3.3 with a maximum of 9.
As far as the techniques for data reduction are concerned, the main ob-
jective of our work has been to devise techniques able to reduce the number
of document comparisons relying on our scheme and to test the robustness
of the resemblance measures w.r.t. such techniques. Filters have been suc-
cessfully employed in the context of approximate string matching (e.g. [61])
where the properties of strings whose edit distance is within a given thresh-
old are exploited in order to quickly discard strings which cannot match.
In our context, we devised an effective filter suitable for the resemblance
measure of the comparison scheme we propose. Data bubbles capitalize on
sampling as means of improving the efficiency of the clustering task. In our
work, we have revised and adapted to our context the general data bubble
framework proposed in [20]. The construction of the data bubble could be
sped up by adopting the very recent data summarization method proposed
in [142] which is very quick and is based only on distance information. Fi-
nally, several indexing techniques have been proposed to efficiently support
similarity queries. Among those, the Metric Access Methods (MAMs) only
require that the distance function used to measure the (dis)similarity of the
objects is a metric. They are orthogonal to the data reduction techniques
we proposed and thus they could be combined in order to deal with dupli-
cate detection problems involving large amount of documents. Indeed, the
fact that our resemblance measure is not a metric (it does not satisfy the
triangle inequality property) does not prevent us from using a MAM such
as the M-Tree [36] for indexing the document search space. The paper [35]
shows that the M-Tree and any other MAM can be extended so as to process
queries using distance functions based on user preferences. In particular any
distance function that is lower-bounded by the distance used to build the
80 Approximate matching for duplicate document detection
Pre-processing
Conversion to ASCII
Chunk Identification
Documents
Chunks
Syntactic Filtration
Length Filtration
Document
Statistic and Repository Inter / Intra document
Graphic Analysis Data reduction
Similarity Computation
Chunk Mapping
Chunk Similarity Search
Optimization
tree can be used to correctly query the index without any false dismissal.
In this enlarged scenario, several distances can be dealt with at a time, the
query (user-defined) distance, the index distance (used to build the tree)
and the comparison distance (used to quickly discard uninteresting index
paths), where the query and/or the comparison distance functions can also
be nonmetric. As a matter of fact, a metric which is a lower-bound of our
resemblance measure can be easily defined and our filters can be used as
comparison distances.
Eq. 3.1. Instead of generating all possible permutations and testing them, we
express the problem in the following form: We find the maximum of function
Pm Pn ¡ h k
¢ h k
k=1 h=1 |ci | + |cj | · sim(ci , cj ) · xh,k
X n Xm
k
|ci | + |ckj |
k=1 k=1
where xh,k is a boolean variable stating if the pair of chunks (chi , ckj ) actually
participates to the similarity computation and thus assuming the values
(
1 if chunk chi is associated to chunk ckj
xh,k =
0 otherwise
1000 (short), 500 (short), 100 (long) and 100 (short) documents, con-
taining 10 variations (including the seed documents themselves) for
each of the 100, 50, 10 and 10 seed documents, respectively, extracted
from articles of the Times newspaper digital library;
CiteR25 CiteR8
# docs 25 8
# sents 8223 4086
# words 164935 81339
Effectiveness tests
10
20
30
40
50
60
70
80
90
100
10 20 30 40 50 60 70 80 90 100
10 10 10
20 20 20
30 30 30
40 40 40
50 50 50
60 60 60
70 70 70
80 80 80
90 90 90
10 10 10
20 20 20
30 30 30
40 40 40
50 50 50
60 60 60
70 70 70
80 80 80
90 90 90
(d) length-rank sel. 0.5 (e) length-rank sel. 0.3 (f) length-rank sel. 0.1
we did not perform tests with filtering techniques which have been proved
to be correct (see Subsection 3.2.1) and thus we do not affect the accuracy
of the system. As to the intra-document reduction, we first compare the
results of the length-rank chunk selection with the sampling approach used
in [22]. Sampling was implemented both through a fixed-step sampling al-
gorithm keeping every i-th chunk and with the sequential random sampling
algorithm by Vitter [135]. Since the two implementations produced approx-
imately the same quality level, we only present random sampling results.
Figure 3.5 compares the effect of the different ratios of sampling (first line)
and length-rank chunk selection (second line). For instance, the third column
represents the image matrixes obtained by applying a reduction ratio of 0.1,
i.e. in both cases we kept one every 10 chunks per document. Here, the ref-
erence collection is again Times100L and thus the decay of the effectiveness
can be evaluated by comparing the image matrixes of Figure 3.5 with the
image matrix of Figure 3.4. With sampling 0.5 and 0.3 the 10 groups are still
recognizable but the intra-group duplication scores are much reduced (pixels
are darker than the corresponding ones in Figure 3.4). With sampling 0.1
the quality of the results is quite poor. The quality of the length-rank chunk
88 Approximate matching for duplicate document detection
10 10 10
20 20 20
30 30 30
40 40 40
50 50 50
60 60 60
70 70 70
80 80 80
90 90 90
10 10 10
20 20 20
30 30 30
40 40 40
50 50 50
60 60 60
70 70 70
80 80 80
90 90 90
(d) A-L C-B 0.5 (e) A-L C-B 0.3 (f) A-L C-B 0.1
selection is, instead, very good and even at the highest selection level (0.1)
the similarity scores are much better than any sampling 0.5.
Figure 3.6 shows the results of the experiments we conducted with the
chunk clustering technique at three different levels of reduction ratio, where
duplication scores were computed using the clustered-based (C-B) and the
average-length clustered-based (A-L C-B) functions (see Section 3.2.2). Since
the inner repeatedness of the collection is quite low, the results are not as
good as for length-rank chunk selection. Notice that the A-L C-B resem-
blance measure (see Eq. 3.9) produces “smooth” results with respect to the
different reduction settings: Results have an acceptable quality up to 0.2-
0.1 reduction ratio. On the other hand, the C-B resemblance measure (see
Eq. 3.8) behaves almost in a crisp way: the resemblance results near to the
extremes, 0 and 1, degrade very slowly at the different reduction ratios and
several document pairs are recognized with a full score even at 0.1.
The results of the effectiveness experiments we conducted for inter doc-
ument reduction are depicted in Figure 3.7. The collections tested are the
Times100S and Times500S. The reduction ratio was set to 0.2, that is 20 and
100 document bubbles were generated for Times100S and Times500S, respec-
Test results 89
2
10
4
20
6
30
8
40
10
50
12
60
14
70
16
80
18
90
20
100
2 4 6 8 10 12 14 16 18 20 10 20 30 40 50 60 70 80 90 100
10 10
20 20
30 30
40 40
50 50
60 60
70 70
80 80
90 90
100 100
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100
We also performed tests on the impact of the chunking unit on the quality
of the obtained results (Figure 3.8). The results concern Cite100M, other
syntethic collections behaved similarly. The tests shown so far have been
performed with the chunk size of one sentence (Figure 3.8a). By choosing a
bigger chunk size (i.e. a paragraph, Figure 3.8b) the similarity scores were
quite lowered but the similar document groups were still perfectly identifiable.
Such results are due to the fact that our comparison scheme, through the
chunk similarity measure, succeeds in correlating the information conveyed
90 Approximate matching for duplicate document detection
Sim Asim
Group Affinity Noise Affinity Noise
1 0.36496 ± 0.39617 0.00012 ± 0.00038 0.37176 ± 0.40008 0.00011 ± 0.00037
2 0.28984 ± 0.40268 0.00013 ± 0.00053 0.29572 ± 0.40308 0.00009 ± 0.00040
3 0.31576 ± 0.39189 0.00030 ± 0.00066 0.32352 ± 0.39356 0.00045 ± 0.00074
4 0.32384 ± 0.40992 0.00024 ± 0.00067 0.32580 ± 0.41103 0.00035 ± 0.00109
5 0.50104 ± 0.32719 0.00015 ± 0.00041 0.50332 ± 0.32709 0.00011 ± 0.00035
Total Avg. 0.35909 ± 0.38557 0.00019 ± 0.00053 0.36402 ± 0.38697 0.00022 ± 0.00059
Table 3.3: Affinity and noise values for the CiteR25 collection
5 5 5
10 10 10
15 15 15
20 20 20
25 25 25
5 10 15 20 25 5 10 15 20 25 5 10 15 20 25
(a) chunk=sent., no red. (b) paragraphs as chunks (c) length-rank sel. 0.3
Figure 3.9: Chunk size and length rank tests results in CiteR25
by the chunks.
In order to test the effectiveness of the resemblance measure in deter-
mining duplicate documents also in real heterogeneous document sets, we
performed “ad-hoc” tests on the CiteR25 and CiteR8 collections. First, we
inserted into the system the 5 groups of 5 documents of CiteR25 and then
computed the affinity and noise values for the symmetric and asymmetric
similarity measures. Affinity represents how close documents in the same
group are in terms of duplication. The affinity value a ± d reported in each
line of Table 3.3 is the average a and the standard deviation d over the intra-
group document comparisons. Noise represents undesired matches between
documents belonging to different groups and the value reported in each line
of Table 3.3 is the average and the standard deviation over the comparisons
between the documents of the group w.r.t. the other groups. Notice the
extremely low noise and the high affinity values, which are three orders of
magnitude higher than the noise ones. This once again confirms the good-
ness of our resemblance measures, even for real and therefore not extremely
correlated document sets. The image matrix of CiteR25 is depicted in Figure
3.9a, whereas Figure 3.9b shows the effect of the variation of the chunking
Test results 91
Citeseer
Doc. 1 Doc. 2 Doc. 3 Doc. 4 Doc. 5 Doc. 6 Doc. 7 Doc. 8
Doc. 1 - 31.5% 18.0% - - - - -
Doc. 2 41.1% - 47.6% - - - - -
Doc. 3 - - - 5.2% 5.2% - - -
Doc. 4 - - 41.6% - - 9.2% - -
Doc. 5 - - 19.7% - - - - -
Doc. 6 - - 20.7% 8.8% - - 20.5% 13.8%
Doc. 7 - - 14.3% - - 35.4% - 68.1%
Doc. 8 - - 12.0% - - 27.4% 59.4% -
DANCER
Doc. 1 Doc. 2 Doc. 3 Doc. 4 Doc. 5 Doc. 6 Doc. 7 Doc. 8
Doc. 1 100.0% 44.3% 25.3% 3.4% 3.7% 3.4% 2.6% 1.6%
Doc. 2 57.3% 100.0% 46.2% 3.2% 3.7% 3.1% 2.6% 1.9%
Doc. 3 2.0% 2.9% 100.0% 8.3% 7.1% 5.5%(A) 3.0% 2.3%
Doc. 4 1.8% 1.3% 54.6% 100.0% 5.9%(B) 18.2% 8.8%(C) 7.2%(D)
Doc. 5 0.9% 0.7% 22.7% 2.8% 100.0% 2.2% 4.1% 0.9%
Doc. 6 1.8% 1.2% 36.2% 18.5% 4.5% 100.0% 31.0% 24.1%
Doc. 7 2.3% 1.8% 32.0% 14.0%(E) 4.1% 48.6% 100.0% 66.7%
Doc. 8 1.8% 1.7% 30.6% 14.5%(F ) 3.9% 47.8% 83.8% 100.0%
unit and Figure 3.9c that of the length-rank chunk selection. As to the
first aspect (Figure 3.9b), the obtained results are comparable to the ones
obtained in the synthetic tests. As to the second aspect, notice that the qual-
ity of the computation on the reduced data is even better than the original
one, as noise decreases and affinity increases. This is due to the fact that
length-rank chunk selection stresses the similarities between large portions
of the involved documents while small chunks are excluded from the com-
putation. In the Citeseer real data sets, small chunks often correspond to
the few small sentences, such as “The figure shows the results ”, “Technical
report”, “Experimental evaluation”, or even the authors’ affiliations, that are
quite common in scientific articles, so that even unrelated documents might
both contain them. They are thus source of noise that are left out by the
length-rank chunk selection.
The second test on real collections involved CiteR8 and was devised to
stress the DANCER capabilities in discovering inter-correlations between sets
of scientific articles coming from a digital library. Such correlations are quite
important in order to classify the documents and to show to the user a web
of links involving articles similar to the one being accessed. In fact, the 8
documents were selected by starting from the article [9] and navigating the
92 Approximate matching for duplicate document detection
5,9% (B)
Doc. Doc. Doc.
3 4 5
“similar on the sentence level” links proposed by the Citeseer digital library
search engine. They correspond to variations of the original article and to
articles on the same subject from the same authors, and include extended
versions, extended abstracts and surveys. In order to provide the links, the
Citeseer engine adopts a sentence based approach to detect the similarities
between the documents: It maintains the collection of all sentences of all the
stored documents and then computes the percentage of identical sentences
between all documents. The percentages computed by Citeseer between the
CiteR8 documents are shown in the upper part of Table 3.4 (Doc. 1 is the
starting document). Notice that the matrix is asymmetric as each value rep-
resents the percentage of the sentences that the document in the row shares
with the document in the column. For instance, Doc. 1 is linked to Doc.2
(with which it shares 31.5% of its sentences) and Doc.3 (18.0%). Further,
notice that the resulting correlation matrix is quite sparse and unpopulated.
Now, consider the corresponding matrix produced by DANCER (lower part
of Table 3.4). To make the resulting values directly comparable with the
ones from Citeseer, we used the containment measure (see Subsection 3.1.3).
Notice that the output given by our approach is much more detailed, provid-
ing more precise similarities between all the available document pairs. For
instance, Citeseer does not provide a link from Doc.1 to Doc.4, whereas our
approach reveals that the first shares 3.4% of its contents with the latter.
Even by considering a 5.0% of minimum similarity threshold, that could be
employed by Citeseer not to flood the user with low similarity links, there
still remains a good number of links that have to be discovered in order to
consider the similarity computation a complete and correct one. Such signif-
icant links, which are not provided by Citeseer, are marked in Table 3.4 with
capital letters A to F and are also graphically represented in Figure 3.10.
Their significance can not be ignored, since they involve similarities even up
Test results 93
to 14.5%.
Finally, in order to specifically test the level of security provided by our
approach, we performed an in-depth violation detection analysis, whose re-
sults are shown in Table 3.5. The tests consisted in verifying the output (i.e.
the similarity scores) of our approach obtained by simulating well-defined
expected-case violation scenarios: Exact copy, several levels of partial copy,
several levels and types of plagiarism. Each row of data in the table sum-
marizes the results of 20 similarity tests performed on a specific scenario; for
this reason each cell of the results is represented by a similarity range, show-
ing the lowest and the highest computed similarity score. The table shows
all the three DANCER similarity measures (resemblance, containment and
maximum contiguous overlap) defined in Section 3.1. For the two asym-
metric measures, the presented values refer to the violating document w.r.t.
the original one. The overlap measure is expressed in number of (stemmed
and stopword-filtered) words. Further, we computed and compared the re-
semblance similarities that would be delivered by other approaches based on
crisp chunk similarities, such as [22, 33], with and without syntactic filtra-
tions (such as stemming): Such values are shown on the two last columns of
the table. For each test a document of the Times collection was chosen as
the “original” document and its similarity was measured w.r.t. a “violating”
document, derived from the original by applying various modifications. By
choosing specific similarity thresholds, such as 15%-20% for resemblance and
containment, and 100 words for overlap, such values can be used to help the
human judgement in the various violations, even well substituting and ap-
94 Approximate matching for duplicate document detection
Ci
teR
8
Ci
teR
25
Tim
es
50
0S
Tim
es
10
0L
Tim
es
10
00
S
acceptable security level, showing similarities well below threshold and thus
producing false negatives.
The last row of the table shows the resulting similarities in a “genuine”
document scenario: Here the analyzed document pairs are extracted from
related documents, i.e. documents involving the same topics but with no
common or plagiarized parts. As expected, being the DANCER approach
much more sensitive, the obtained similarities are higher than the crisp ap-
proach, however they are still low and below typical violation thresholds.
Then, the percentage of false positives delivered by our approach is still kept
very low, similarly to crisp techniques and differently from other approaches
which are able to identify subtle violations but at the cost of very high per-
centages of false positives, such as [121].
Efficiency tests
As far as the efficiency of the proposed techniques is concerned, we measured
the system runtime performance in performing the resemblance computations
by analyzing the impact of the data reduction techniques.
The computing time required to query each collection against itself when
only chunk filters are applied is shown in Figure 3.11. The graph makes
a distinction between the time required for each of the two phases, that is
chunk matching and similarity scores’ computation. The similarity score
computation time represents the time required to compute the similarity
scores for each pair of documents. The overall time can be significantly
decreased by applying the data reduction techniques we devised.
As to the document filter, setting the similarity threshold s to a value
96 Approximate matching for duplicate document detection
140
50
120
40
100
Seconds
Seconds
80 30
60
20
40
10
20
0 0
1,0 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 1,0 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1
Reduction Ratio Reduction Ratio
greater than 0, even at a minimal setting as 0.1, halves the total required
time. Such an improvement is due to the ability of the filter to quickly
discard pairs of documents which cannot match and to ensure few false pos-
itives thus limiting the number of useless comparisons. For instance, for
Times100L the cross product computation requires 100*100/2 comparisons
(resemblance measure is symmetric). By setting the similarity threshold at
0.1, the document filter leaves out the 91% of the document pairs while the
worth surviving document pairs on which we compute the similarity scores
are 450. From such a computation, we found out that all the pairs contained
in the candidate set are actually similar enough, that is the similarity score
is at least 0.1. In this case we have no false positives and, thus, the best
filter performance. The same applies to threshold values up to 0.6. As the
threshold grows over 0.6, the number of positive document pairs obviously
decrease (up to 0 when the threshold is 0.9) and the filter leaves out more
document pairs. The worst case occurs when the threshold is 0.7 where we
have a candidate set of 378 document pairs containing 5 false positives. Such
a behavior is mainly due to the fact that documents having a certain num-
ber of similar chunks are often similar documents. In most of these cases,
the mapping between chunks can be straightforwardly computed without
requiring the intervention of the ILP package (see Subsection 3.4.1).
Moreover, the intra-document reduction techniques provide us with fur-
ther ways to enhance the system performance, as shown in Figure 3.12, where
we compare the time required to compute the resemblance matrix employing
the three intra-document reduction techniques on Times100L and Cite100M
at different values of reduction ratios, starting from 1.0 which means no re-
duction. Using length-rank chunk selection at 0.1, for instance, allows us to
Test results 97
50 50
40 40
Seconds
Seconds
30 30
20 20
10 10
0 0
Sentences as chunks Paragraphs as chunks Sentences as chunks Paragraphs as chunks
Pattern Matching
for XML Documents
Chapter 4
With the rapidly increasing popularity of XML for data representation, there
is a lot of interest in query processing over data that conform to the labelled-
tree data model. The idea behind evaluating tree pattern queries, sometimes
called the twig queries, is to find all existing ways of embedding the pattern in
the data. From the formal point of view, three main types of pattern match-
ing exist: One involving paths and two involving trees. XML data objects
can be seen as ordered labelled trees, so the problem can be characterized
as the ordered tree pattern matching, of which the path pattern matching can
be seen as a particular case. Though there are certainly situations where
the ordered tree pattern matching perfectly reflects the information needs
of users, there are many other that would prefer to consider query trees as
unordered. For example, when searching for a twig of the element person
with the subelements first name and last name (possibly with specific val-
ues), ordered matching would not consider the case where the order of the
first name and the last name is reversed. However, this could exactly be
the person we are searching for. The way to solve this problem is to con-
sider the query twig as an unordered tree in which each node has a label
and where only the ancestor-descendant relationships are important – the
preceding-following relationships are unimportant. This is called unordered
tree pattern matching.
In general, since XML data collections can be very large, efficient evalu-
ation techniques for all types of tree pattern matching are needed. A naive
approach to solve the problem is to first decompose complex relationships into
binary structural relationships (parent-child or ancestor-descendant) between
pairs of nodes, then match each of the binary relationships against the XML
database, and finally complete together results from those basic matches.
102 Query processing for XML databases
how to take advantage from the indexes built on the content of the
document nodes (Section 4.3);
how the discovered conditions and properties can be used to write pat-
tern matching algorithms that are correct and which, from a numbering
scheme point of view, cannot be further improved (Section 4.4 presents
a summary of their main features).
All the twig matching algorithms have been implemented in the XSiter (XML
SIgnaTure Enhanced Retrieval) system, a native and extensible XML query
processor providing very high querying performance in general XML query-
ing settings (see Section 4.6 for an overview of the system architecture and
features). A detailed description of the complete pattern matching algo-
rithms is available in Appendix B. Further, we also consider an alternative
approach [138] specific for unordered tree matching, which in certain query-
ing scenarios can provide an equivalently high (or even better) efficiency than
that of the “standard” algorithms (Section 4.5). Such approach is based on
decomposition and structurally consistent joins. Finally, we provide exten-
sive experimental evaluation performed on real and synthetic data of all the
proposed algorithms (Section 4.7).
A (1,10)
B (2,5) F (7,9)
pre: A B C D E G F H O P
post: D E C G B O P H F A
rank: 1 2 3 4 5 6 7 8 9 10
The pre-order and post-order sequences are ordered lists of all nodes of
a given tree T . In a pre-order sequence, a tree node v is traversed and
assigned its (increasing) pre-order rank, pre(v), before its children are recur-
sively traversed from left to right. In the post-order sequence, a tree node v
is traversed and assigned its (increasing) post-order rank, post(v), after its
children are recursively traversed from left to right. For illustration, see the
pre-order and post-order sequences of our sample tree in Figure 4.1 – the
node’s position in the sequence is its pre-order/post-order rank, respectively.
Pre- and post-order ranks are also indicated in brackets near the nodes.
Given a node v ∈ T with pre(v) and post(v) ranks, the following proper-
ties are important towards our objectives:
post
n
A F
v
P D
n pre
Observe that the index in the signature sequence is the node’s pre-order, so
the value serves actually two purposes. In the following, we use the term
pre-order if we mean the rank of the node, when we consider the position of
the node’s entry in the signature sequence, we use the term index. For ex-
ample, ha, 10; b, 5; c, 3; d, 1; e, 2; g, 4; f, 9; h, 8; o, 6; p, 7i is the signature of the
tree from Figure 4.1. The first signature element a is the tree root. Leaf
nodes in signatures are all nodes with post-order smaller than the post-order
of the following node in the signature sequence, that is nodes d, e, g, o – the
last node, in our example it is node p, is always a leaf. We can also deter-
106 Query processing for XML databases
mine the level of leaf nodes, because the size(i) = 0 for all leaves i, thus
level(i) = i − post(i).
Extended Signatures
Sub-Signatures
Lemma 4.1 The query tree Q is included in the data tree D in an or-
dered fashion if the following two conditions are satisfied: (1) on the level of
node names, sig(Q) is sequence-included in sig(D) determining sub sigS (D)
through the ordered set of indexes S = (s1 , s2 , . . . sn ) where s1 < . . . <
sn , (2) for all pairs of entries i and j in sig(Q) and sub sigS (D), i, j =
1, 2, . . . |Q| − 1 and i + j ≤ |Q|, whenever post(qi+j ) > post(qi ) it is also true
that post(dsi+j ) > post(dsi ) and whenever post(qi+j ) < post(qi ) it is also true
that post(dsi+j ) < post(dsi ).
Observe that Lemma 4.1 defines a weak inclusion of the query tree in the
data tree, in the sense that the parent-child relationships of the query are im-
plicitly reflected in the data tree as only the ancestor-descendant. However,
due to the properties of pre-order and post-order ranks, such constraints can
easily be strengthened, if required.
Example 4.1 For example, consider the data tree D in Figure 4.1 and the
query tree Q in Figure 4.3. Such query qualifies in D, because sig(Q) =
hh, 3; o, 1; p, 2i determines sub sigS (T ) = hh, 8; o, 6; p, 7i through the ordered
set S = (8, 9, 10), because (1) q1 = d8 , q2 = d9 , and q3 = d10 , (2) the
post-order of node h is higher than the post-order of nodes o and p, and
the post-order of node o is smaller than the post-order of node p (both
in sig(Q) and sub sigS (T )). If we change in our query tree Q the label
h for f , we get sig(Q) = hf, 3; o, 1; p, 2i. Such a modified query tree is
108 Query processing for XML databases
H (1,3)
O (2,1) P (3,2)
also included in D, because Lemma 4.1 does not insist on the strict parent-
child relationships, and implicitly considers all such relationships as ancestor-
descendant. However, the query tree with the root g, resulting in sig(Q) =
hg, 3; o, 1; p, 2i, does not qualify, even though it is also sequence-included (on
the level of names) as the sub-signature sub sigS (D) = hg, 4; o, 6; p, 7i|S =
(6, 9, 10). The reason is that the query requires the post-order to go down
from g to o (from 3 to 1) , while in the sub-signature it actually goes up
(from 4 to 6). That means that o is not a descendant node of g, as required
by the query, which can be verified in Figure 4.1. ¤
Multiple nodes with common names may result in multiple tree inclu-
sions. As demonstrated in [137], the tree signatures can easily deal with
such situations just by simply distinguishing between node names and their
unique occurrences.
Lemma 4.2 A path P is included in the data tree D if the following two
conditions are satisfied: (1) on the level of node names, sub sigP (Q) is
sequence-included in sig(D) determining sub sigS (D) through the ordered set
of indexes S = (s1 , . . . , sn ) where s1 < . . . < sn , (2) for each i ∈ [1, |P | − 1]:
post(dsi ) < post(dsi+1 ).
4.2 A formal account of twig pattern matching 109
Lemma 4.3 The query tree Q is included in the data tree D in an unordered
fashion if the following two conditions are satisfied: (1) on the level of node
names, an ordered set of indexes S = (s1 , s2 , . . . sn ) exists, 1 ≤ si ≤ m for
i = 1, . . . , n, such that dsi = qi , for i = 1, . . . , n, (2) for all pairs of entries
i and j, i, j = 1, 2, . . . |Q| − 1 and i + j ≤ |Q|, if post(qi+j ) < post(qi ) then
post(dsi+j ) < post(dsi ) ∧ si+j > si .
Notice that the index set S is ordered but, unlike the ordered inclusion of
Lemma 4.1, indexes are not necessarily in an increasing order. In other
words, an unordered tree inclusion does not necessarily imply the node-name
inclusion of the query signature in the data signature. Should the signature
sig(Q) of the query not be included on the level of node names in the signa-
ture sig(D) of the data, S would not determine the qualifying sub-signature
sub sigS (D). Anyway, as shown in [138], any S satisfying the properties
specified in Lemma 4.3 can always undergo a sorting process in order to de-
termine the corresponding sub-signature of sig(D) qualifying the unordered
tree inclusion of Q in D.
Figure 4.4 shows further examples of the three types of pattern matching.
In the shown trees, the nodes’ names are appended by their pre-order and
post-order ranks in brackets.
and
sig(D) = hd1 , post(d1 ); d2 , post(d2 ); . . . ; dm , post(dm )i
110 Query processing for XML databases
author (2,2)
lastName (3,1) title (2,1) lastName (3,2) firstName (2,1) lastName (3,2)
Matches: < 1,2,4 > Matches: < 1,5,7 > Matches: < 2,3,4 >
<1,6,7> < 6,8,7 >
we denote with ansQ (D) the set of answers to the ordered inclusion of Q
in D and with U ansQ (D) the set of answers to the unordered inclusion of
Q in D. For the sake of brevity we will use the notation (U )ansQ (D) to
designate situations which apply to both the cases. Obviously, if Q is a path
P then ansP (D) = U ansP (D). In the previous section, we have shown the
properties an index set must satisfy so that it is an answer to the inclusion
of Q in D. In all the three cases, the matching on the level of node names
is required. Let Σi designate the set (domain) of all positions j in the data
signature where the query node name qi occurs (i.e. dj = qi ). The set of
answers (U )ansQ (D) is a subset of the Cartesian product of the domains
Σ1 × Σ2 × . . . × Σn determined by Lemma 4.1 or Lemma 4.3, respectively.
Obviously, if one of the Σi sets is empty, Q is not included in D, because the
Cartesian product is empty.
A naı̈ve strategy to compute the desired Cartesian product subset is to
first compute the sets Σi , for i ∈ [1, n] and then discard from the Cartesian
product Σ1 × Σ2 × . . . × Σn the tuples the corresponding sub-signatures of
which do not satisfy the properties required by specific pattern matching con-
straints. The intrinsic limitation of this approach is twofold: it can produce
very large intermediate results and, even more important, such evaluation
4.2 A formal account of twig pattern matching 111
Σh
k
Q
1
...
h
...
n
D 1 2 ... k ... m SEQUENTIAL SCAN
In other words, at a given step j, not all data nodes represented in the
domains Σj1 , . . . , Σjn are necessary for the generation of the delta answers
∆(U )ansjQ (D), . . . , ∆(U )ansm j
Q (D). Thus, we denote with ∆Σi the “reduced”
versions of the original domains Σji which are needed to decide the delta
answers from the j-th step. In the following, we show the pre- and post-
order conditions ensuring that a data node already accessed (or accessed
at a given step) is not necessary for the generation of the solutions from
that step up to the end of the scanning. Necessary conditions are founded
on the relative positions between such data node and the other data nodes
accessed so far, because at a given step of the sequential scanning we have no
information about the properties of the data nodes that follow. Moreover, we
characterize the delta answers that can be decided at each step. To this end,
we will consider the snapshot of the sequential scanning process occurring at
the k-th step (see Figure 4.5). We assume that lk (D) matches lh (Q) and thus
k should be added to ∆Σkh . Our main aim is to determine the conditions
under which any of the already accessed data nodes, i.e. those with pre-
orders 1, . . . , k, will always violate one of the pre- or post-order relationships
required by Lemmas 4.1-4.3 with the data nodes already accessed or those
following the k-th. In this case, either the data node has already been used
in the generation of the previous delta answers or it will never be used and
thus is unnecessary for the generation of ∆(U )ansjQ (D), for each j ≥ k.
Conditions on pre-orders 113
Q Q
1 1
∩
i
...
=
Ø
h’-1
...
h’
...
h=n n
1 ... k k+1 ... m D 1 ... k’ ... k ... m D
j j
(a) Condition PRO1 (b) Condition PRO2
The following Theorem shows that the three previous conditions together
constitute the sufficient conditions such that a data node, due to its pre-order
value, is no longer necessary.
Theorem 4.1 (Completeness) For the ordered case, beyond the condi-
tions expressed in Lemmas 4.4, 4.5, and 4.6, there is no other condition
ensuring that at each step k, any data node due to its pre-order value does
not belong to the solutions which will be generated in the following steps.
Example 4.2 With respect to the example of Figure 4.4, the Table above
shows the impact of Conditions PRO1 and PRO2 on the composition of the
delta domains in the domain space during the sequential scan for ordered
twig matching. Notice that at the 4-th step, Condition PRO2 avoids the
insertion of the data node 4 in the pertinence domain ∆Σ43 , as ∆Σ42 is empty.
Moreover at the 8-th step, thanks to Condition PRO1 ∆Σ83 is empty. ¤
As to the unordered case, notice that the pre-order values of any qual-
ifying set of indexes are not required to be completely ordered as it is for
the ordered evaluation. For this reason, the Lemmas above are no longer
sound. However, the unordered evaluation requires a partial order among
the pre-order values of a qualifying set of indexes (s1 , . . . , sn ). In particular,
whenever post(qi+j ) < post(qi ) it is required that post(dsi+j ) < post(dsi ) and
that si+j > si . Thus Lemmas 4.5 and 4.6 rewritten in the following way are
still sound.
Theorem 4.2 (Completeness) For the unordered case, beyond the condi-
tions expressed in Lemmas 4.7 and 4.8, there is no other condition ensuring
that at each step k, any data node due to its pre-order value does not belong
to the solutions which will be generated in the following steps.
In this way, we have shown the sufficient and necessary pre-order conditions
for the exclusion of a data node in the generation of the delta solutions of
the ordered and unordered inclusion of a query tree in a data tree.
Lemma 4.9 (Condition POT1) If i ∈ [1, h − 1] exists such that for each
si ∈ ∆Σk−1 i , post(dsi ) < post(dk ) is required but post(dsi ) > post(dk ) or
post(dsi ) > post(dh ) is required but post(dsi ) < post(dk ) then k 6∈ S, for each
S ∈ ∆(U )ansjQ (D) for each j ∈ [k, m]. Thus k can be deleted from ∆Σjh .
116 Query processing for XML databases
Different is the case of the deletion of a data node preceding the k-th in the
sequential scanning and thus already belonging to a delta domain. Notice
that, as in the pre-order case, at the k-th step of the sequential scanning, a
node belonging to a delta domain will no longer be necessary for the genera-
tion of the delta answer sets due to its post-order value if one of the required
post-order relationships w.r.t. the other nodes will always be violate by the
data nodes following the k-th in the sequential scanning. It means that either
dj with j < k has already been used in the generation of the previous delta
answer set or it will never be used. In the pattern matching definition, two
kinds of relationships are taken into account between the post-order values
of two data nodes di and dj : either it is required that post(di ) < post(dj )
or post(di ) > post(dj ). Given the relationship between the post-order value
of the k-th node, post(dk ), and that of a preceding node post(dj ) (j < k),
we want to predict what kind of inequality relationship will hold between
the post-order value of dj and those of the nodes following dk in the se-
quential scanning. Only if we are able to do it, we can state that dj is no
longer necessary due to its post-order value. At first glance, by considering
the properties of the pre-order and post-order ranks given in Figure 4.2, it
seems that the post-order relationships between post(dj ) and post(dk ) and
post(dj ) and post(dj0 ) with j0 > k are completely independent. It is true
when post(dj ) > post(dk ). Different is the case of the other post-order re-
latioship post(dj ) < post(dk ) which is taken into account in the following
Lemma.
Lemma 4.10 Let j < k and post(dj ) < post(dk ). It follows that post(dj ) <
post(dj0 ), for each j0 ∈ [k, m].
Thus, both in the cases of path and (un)ordered twig matching, a data node
dj can be deleted from the pertinence delta domain if and only if it is re-
quired that its post-order value is greater than that of another data node but
that condition shall never be verified from a particular step of the scanning
process. We deeply analyze such situations in the following.
Let us first consider the case of path matching. Lemma 4.10 allows the
introduction of the following Lemma showing a sufficient condition such that
a data node, due to its post-order value, is no longer necessary to generate
the answers to the path inclusion evaluation. For illustration see Figure 4.7-a
depicting Condition POP on the domain space.
P Q
1 1
i si
si i+1
...
...
s Ø
n n
1 ... k-1 k ... m D 1 ... k-1 k ... m D
si j si < j
< POST
POST
s
Example 4.3 Consider the example of Figure 4.4 at step 6, when the com-
position of the delta domains is the following: ∆Σ61 = {1} (book), ∆Σ62 = {2}
(author), and ∆Σ63 = {4} (lastName). In this case, the post-order of the
current data node is post6 (D) = 7 which is greater than both post2 (D) = 3
and post4 (D) = 2. Thus nodes 2 and 4 can be deleted from their pertinence
domains and the composition of the delta domains becomes ∆Σ61 = {1},
∆Σ62 = {}, and ∆Σ63 = {}. Intuitively, they have already been used in the
generation of the delta answer ans4P = {h1, 2, 4i} at the 4-th step and they
will never be used again because node 6 belongs to another path. ¤
Given the situation depicted in Lemma 4.9, the previous Lemma produces the
same effect on the current data node dk as Lemma 4.9, i.e. its deletion from
∆Σji . On the other hand, Lemma 4.11 also acts on the nodes preceding dk in
the sequential scanning. For this reason, Lemma 4.11 can be used in place
of Lemma 4.9. The proof is given in the following proposition. Notice that,
being the query Q a path P , we only consider the case post(qi ) < post(qh ).
Lemma 4.12 If for each i ∈ [1, h − 1], for each si ∈ ∆Σk−1 i , post(dsi ) <
post(dk ) then, due to Lemmas 4.5 and 4.11, k 6∈ S, for each S ∈ ∆ansjP (D)
for each j ∈ [k, m]. Thus k can be deleted from ∆Σjh .
It should be emphasized that Condition POP is a necessary and sufficient
condition, i.e. it states the only possible condition such that a data node due
to its post-order value can be deleted. The proof is Theorem 4.4 where we
show that in the computation of the delta answers there is no condition on
post-order values to be checked.
The problem of the generic twig inclusion evaluation is more involved
than the path matching problem. In these cases, both the two kind of post-
order relationship are in principle allowed between any two data nodes in
118 Query processing for XML databases
a set qualifying the pattern matching and only a partial order is required.
The following Lemma is the counterpart of Lemma 4.11 as it shows the
post-order conditions under which the nodes preceding dk in the sequential
scanning can be deleted. It only considers the inequality condition on post-
order values where it is required that dk is greater than any other data node.
For illustration see Figure 4.7-b.
Lemma 4.13 (Condition POT2) Let si ∈ ∆Σki and post(dsi ) < post(dk ).
If ı̄ ∈ [1, n] exists such that post(qı̄ ) < post(qi ) and ∆Σkı̄ = ∅ or for each
s ∈ ∆Σkī such that s > si , post(ds ) > post(dsi ), then si ∈ / S, for each
j
S ∈ ∆(U )ansQ (D) for each j ∈ [k, m].
Finally, whenever Condition POT2 involves the root domain, i.e. ∆Σk1 , no
check on the other delta domains is required and all the nodes s1 ∈ ∆Σk1
such that post(ds1 ) < post(dk ) can be deleted from ∆Σk1 .
Lemma 4.14 (Condition POT3) Let s1 ∈ ∆Σ1k and post(ds1 ) < post(dk )
/ S, for each S ∈ ∆(U )ansjQ (D) for each j ∈ [k, m].
then s1 ∈
Example 4.4 Consider the unordered twig matching of the example of Fig-
ure 4.4 at step 6 when the delta domains are as follows: ∆Σ61 = {2} (author),
∆Σ62 = {3} (firstName), and ∆Σ63 = {4} (lastName). At this step, a new
root arrives, i.e. the node 6, as post6 (D) > post2 (D). Thus Condition POT3
allows us to delete node 2 from ∆Σ61 , which become empty. Consequently, the
delta domains ∆Σ62 and ∆Σ63 can also be emptied thanks to Condition PRO2.
Note that such situation is very frequent in data-centric XML scenarios. ¤
Theorem 4.3 (Completeness) For the twig case, beyond the conditions
expressed in Lemmas 4.13 and 4.14, there is no other condition ensuring
that at each step k, any data node due to its post-order value does not belong
to the solutions which will be generated in the following steps.
For the unordered case, where only a partial order is required, new matches
can only be decided when the data node k matches with a query node which
is a leaf.
Lemma 4.16 If i > h exists such that post(qh ) > post(qi ) then ∆anskQ = ∅.
post(du ) < post(t0 (ql )) for each t0 (ql ) ∈ T (ql ) : pre(t0 (ql )) > pre(t(ql )). Under
these conditions, if Tl is scanned in sorted order by pre-order values then
we can safely discard the node du since it will never belong to any answer.
Moreover since post(du0 ) < post(du ) for each du0 descendant of du it follows
that conditions above hold also for these nodes so we can safely discard them.
From the above considerations we can conclude that if during the sequential
scan we access a node du such that post(du ) < post(t(ql )), we can directly
continue the scan from the first following of du . If the signature does not
contain the first following values f fi we can still safely skip a part of the
document due to the following observation. We have f f du = u + size(du ) + 1
and size(du ) = post(du ) − u + level(du ) since 0 ≤ level(du ) ≤ h where h is
the height of the data tree we can safely continue the scan from the node
having pre-oreder euqals to post(du ) + 1.
A (1,3) A (1,6)
B (2,1) C (3,2)
C (2,1) C (3,2) B (4,3) C (5,4) C (6,5)
Example 4.5 Consider Figure 4.8, where the nodes are represented as cir-
cles filled with different shades:
The query in the example has two value constrained leaves; suppose that each
document leaf satisfies the correspondent condition, then the lists T (B2 ) and
T (C3 ) obtained through the index are {B4 } and {C2 , C3 , C5 , C6 } respectively.
An answer to the query requires that any node matching “C” follows (i.e.
has a greater pre-order value) any node matching “B” thus we know that
element {C2 ,C3 } in TC3 will never be candidate for domain insertion because
no element TB2 has a smaller pre-order value. ¤
Definition 4.2 During the sequential scan, let k be the current document
pre-order, we say that two list T (qli ) and T (qli+1 ) are aligned iff:
Ordered twig matching 123
the pre-order of the current target for the element qli (t(qli ));
With these three values we can define the minimum pre-order value that the
target t(qli+1 ) must assume. More precisely given
minP re = max{k, min{pre(t(qli )), minP re(∆Σkli )}}
t(qli+1 ) is then the first element in T (qli+1 ) that has a pre-order value greater
or equal to minP re. Then, the updating of the targets should be performed
during the sequential scan and whenever a deletion is performed on a ∆Σli
delta domain. In order to take advantage from the transitive property and in
order to minimize the number of operations we perform the update starting
from t(qli ) and then propagate that to the targets on its right (with increasing
value of i). If the target lists are aligned we also say that current targets are
aligned.
Insertion Avoidance and Skipping Policy For the path case we have
shown that during a sequential scan, depending on the current target and on
post-order value of the current document’s element, we can safely skip some
part of the input document because we are sure that no useful elements for the
current query will be found in the skipped document parts. For the ordered
twig case we need to introduce more constraints that limit the applicability
of the skipping strategy. As for the path case the skipping policy can only
be based on conditions on the post-order values between document nodes
124 Query processing for XML databases
B C B B C
(n+4,n+2) (n+5,n+3) (3,1) (n+4,n+2) (n+5,n+3)
and current targets, in particular Lemma 4.10 ensures that if j < k and
post(dj ) < post(dk ) then post(dj ) < post(dj0 ), for each j0 ∈ [k, m]. For the
skipping policy point of view this means that while we are looking for answers
that include current targets, if we access a document node that should have
a post-order value greater than the targets’s one (i.e. we are looking for
an ancestor of the targets) but it is actually smaller then we can skip all
descendants of the current document element because none of them could be
useful.
Example 4.6 Consider Figure 4.9. The sample query specifies a value con-
dition over its leaf “C”, suppose that for both input documents “C” elements
satisfy this condition. Let us first analyze the first data tree. During the se-
quential scan we first access “D” element that does not match with any query
element, then we access its first child and independently from its label “X”
we can entirely skip its subtree. At this step, the delta domain associated
to “A” is empty and the current document post-order is smaller than the
current target one (i.e. the current target is outside the subtree rooted by
“X”). Instead, the second data tree case is different. In this case the first
element matches with the root query element and being an ancestor of the
current target it will be inserted into the correspondent delta domains. Again
we access element “X” but this time even if the current post-order value is
smaller than the current target one we cannot skip the subtree of the current
element, in fact the delta domain associated to “A” is not empty and possible
matches for “B” (whose post-order is not required to be greater than the one
of “C”) could be lost if we skip the current element subtree (as the example
shows). ¤
This simple example shows that, differently from the path case, we cannot
establish if a skip is safe or not by taking into consideration only post-order
Ordered twig matching 125
A (1,3) D (1,12)
E (8,5) C (9,6)
Example 4.7 Consider Figure 4.10. As before, suppose that each document
leaf satisfies the correspondent value condition, initial target lists are T (B2 ) =
{B3 , B6 , B11 } and T (C3 ) = {C9 , C12 } and they are aligned; then, current
targets are t(B2 ) = B3 and t(C3 ) = C9 . The first accessed node is D1 that
does not match with any query node. Then we access A2 that matches with
the root of the query, we have L1 = {B2 , C3 }, now since post(t(C3 ))=6 and
126 Query processing for XML databases
no reference target if Lx = ∅;
one reference target qlux , i.e. the one with the highest pre-order in Lx .
The observation above enables us to avoid the insertion of useless nodes but,
in order to perform a skip, we must first define under which conditions a skip
is safe. Basically a skip is safe if there is no useful element is in the document
skipped part. The rough condition above could be refined as follows:
none of the current partial solutions can be extended by nodes that are
descendants of the current document node;
Ordered twig matching 127
C B C C B
(n+4,n+2) (n+5,n+3) (3,1) (n+4,n+2) (n+5,n+3)
update of the related current targets. Assume that the current reference tar-
get for qx is qlix and that qx matches with the current document node dk , if
post(dk ) < post(t(qlix )) then the node dk is useless and we can avoid to insert
it into the delta domain associated to qx . Now we can discuss the skipping
policy used for the unordered matching algorithm. Like the previous cases
the skipping policy can only be based on conditions over the post-order val-
ues between nodes and current targets and, like the ordered case, in order
to establish if a skip is safe or not we need to analyze the status of the delta
domains. For the unordered case a matching node could be inserted in the
correspondent domain if there is at least a supporting occurrence (a node
with a greater post-order value) in the domain of its parent/ancestor.
Example 4.8 Consider Figure 4.11. The shown scenario is very similar to
the one for the ordered case; the query specifies a value condition on the node
“B”, suppose again that for both input documents all “B” nodes satisfy this
condition. Let us analyze the first data tree, the sequential scan first accesses
node “D” that does not match with any query node, then it accesses its first
child, independently of its label we can avoid to scan its subtree; at this
step we have all the delta domains empty and the current document post-
order value is smaller than the post-order value of the current target for “B”.
Even if we can build a partial solution with nodes found in the subtree of
the current element we will never be able to complete this partial solution
with the required “B” node (next match for “B” is outside the subtree and
since a match for “A” is still missing it is not possible to complete partial
solutions with this node). Now let us analyze the second data tree case.
The root of the document matches with “A” and, since it is ancestor of the
current target for “B”, we can insert it into the delta domain associated to
“A”. Now we access the first child of the root, again the current document
post-order value is smaller than the one of the current target for “B” but this
130 Query processing for XML databases
time we cannot skip the subtree. The delta domain associated to “A” is not
empty so possible matches for “C” (like the one shown in the example) will
be useful. It has to be noted that, if we introduce the order constraint, the
same subtree could be safely skipped (since “C” matches are useless unless
we found a preceding match for “B”). ¤
Starting from these examples we can derive the conditions under which a
skip is safe for the unordered matching algorithm. From a qi point of view a
skip is considered safe iff:
Li 6= ∅ ∧ post(dk ) < post(t(qlui )) ∧ (∆Σki = ∅∨ skip is safe ∀qj ∈
children(qi ))
It is obvious that if qi is related to targets only by following-preceding rela-
tionships (i.e. Li = ∅), or its current reference target is descendant of the
current document node, the skip is unsafe because, in the former case, we
have no information about next matches for these kinds of node and, in the
latter case, we know that at least one target will be lost if we perform a skip.
The second part of the condition is less intuitive. First if the delta domain
associated to qi is empty the skip is considered safe because we are guaran-
teed that matching elements with qi or with any descendant of qi will never
be part of a solution since there is not a valid match for qlui in the subtree
of the current document node. If the delta domain associated to qi is not
empty we need to verify if the skip is safe for all the children of qi ; if for at
least one child the skip is considered unsafe then the skip is unsafe also for
qi . In order to establish if a skip is safe or not it is sufficient to check the
condition above for the query root q1 .
In this section we show how the conditions presented so far can be used
in pattern matching algorithms to manage the domains associated with the
query nodes. In particular, for each query node identified by its pre-order
value i we assume that the post(i) operator accesses its post-order value,
the l(i) operator accesses its label, and we associate a domain Di together
with the maximum and the minimum post-order values of the data nodes
stored in Di (accessible by means of the minPost and maxPost operators,
respectively). We will in this context only show a sketch of the ideas and the
basic skeletons of the algorithms. Further, we will not analyze the content
index optimized versions exploiting the properties described in Section 4.3.
For a full discussion of the complete algorithms and all their different versions
see Appendix B.
Nodes are scanned in sorted order of their pre-order values and insertions
in the domains are always performed on the top by means of the push opera-
tor. Thus the data nodes from the bottom up to the top of each domain are
in pre-order increasing order. Moreover, each data node k in the pertinence
domain Dh consists of a pair: (post(k), pointer to a node in Dprev(h) )
where prev(h) is h − 1 in the ordered case whereas it is the parent of h in Q
in the unordered case. When the data node is inserted into Dh , each pointer
indicates the pair which is at the top of Dprev(h) . In this way, the set of
data nodes in Dprev(h) from the bottom up to the data node pointed by k
implements ∆Σkprev(h) . By recursively following such links from Dprev(h) to
D1 , we can derive ∆Σkprev(prev(h)) , . . . , ∆Σk1 . Figure 4.12 shows the algorithm
skeletons with place-holders implementing the complete set of deletion con-
ditions, as suggested by Theorems 4.4, 4.5 and 4.6 (e.g. PRO1 for Condition
PRO1). Even if the implementation of the complete set of the deletion condi-
tions ensures the compactness of domains, to put the theoretical framework
in practice also means to select the most effective reduction conditions with
respect to the pattern and the data tree. Indeed, in some cases the CPU
time spent to apply a condition is not repaid by the advantages gained on
the domain dimensions and/or the execution time. A deep analysis of this
aspect is provided in Section 4.7.
The three kinds of considered pattern matching share a common skeleton
shown on top of Figure 4.12. Line 0 determines the sequence of data nodes for
the sequential scanning. In the worst case it is the whole data tree. Otherwise
it can be reduced by filters or auxiliary structures as in [25, 32, 72], to which
case our theoretical framework can easily be adopted. Indeed, whenever the
pre-order value of the first occurrence first(l) of each label l in D and
of the last one last(l) are available, then, thanks to Condition PRO2, the
sequential scanning can start from first(l(1)). In the ordered case, it
132 Query processing for XML databases
(3) POP (3) POT2 & POT3 (3) POT2 & POT3
(4) PRO2 (4) PRO2 (4) PRU
(5) Lemma 4.15 (5) POT1 (5) POT1
(6) PRO1 (6) Lemma 4.15 (6) Lemma 4.16
(7) PRO1
(a) PMatch(P) (b) OTMatch(Q) (c) UTMatch(Q)
can end at last(l(n)) due to Lemma 4.15 whereas, in the unordered case,
Lemma 4.16 suggests to set the end as the maximum value among last(l(l))
for each leaf l in the query. Note that, if the algorithms are performed on
compressed structural surrogates of the original data tree, then the first and
last values can be computed in the surrogate construction phase. Therefore,
assuming that the current data node k matches with the h-th query node, and
thus it should be added to Dh (lines 1-2), the three algorithms implement the
required conditions in the most effective order. First, they try to delete nodes
by means of the conditions on post-orders (line 3), then they check whether
the intersection between domains at different steps is empty (line 4). In this
case, they delete all data nodes in the domains specified in the corresponding
Conditions, finally they work on the current node. In particular the twig
algorithms implement Condition POT1 to check whether k can be added
to Dh and all the algorithms verify if solutions can be generated. Finally,
through the PRO1 code fragment, the ordered algorithms delete k if it is the
last node.
As to the PMatch(P) algorithm, domains can be treated as stacks, that
is deletions are implemented following a LIFO policy by means of the pop
operator. Indeed, node deletions are only performed in the POP fragment
which corresponds to the following code:
post(si )>post(k) for each si ∈ Di and thus Condition POP can no longer
be applied. Moreover, the fact that domains are stacks allows us to imple-
ment Condition PRO2 (isEmpty(Di ) checks whether Di is empty whereas
empty(Di0 ) empties Di0 ):
(1) if(h = n)
(2) showSolutions(h,1);
(1) if(h = n)
(2) pop(Dh );
(1) if(¬isEmpty(Dprev(h) )
(2) push(Dh ,(post(k),pointerToTop(Dprev(h) )));
(1) if (isNeeded(h,k))
(2) push(Dh ,(post(k),pointerToTop(Dprev(h) )));
where the boolean function isNeeded() checks the condition shown in Con-
dition POT1 by using the minPost(D) and maxPost(D) values for each
domain D instead of comparing k with each data node in D. Obviously,
whenever both POT1 and PRO2 or PRU are applied, the two conditions of
Lines 1 are put together: ¬isEmpty(Dprev(h) ) ∧ isNeeded(h,k). By anol-
ogy to the path matching algorithm, Lemma 4.15 and Lemma 4.16 check if
new solutions can be generated and, in this case, call recursive functions
implementing Theorem 4.5 and 4.6, respectively.
4.5 Unordered decomposition approach 135
A (1,4) A (1,5)
A (1,3)
B (3,1) C (4,2)
Example 4.9 Consider Figure 4.13, in particular the first unordered query
shown, which we will call Q, and the data tree, D. We have sig(Q) =
ha, 3, 4, 0; f, 1, 3, 1; b, 2, 4, 1i and sig(D) = ha, 5; a, 3; b, 1; c, 2; f, 4i The only
sub-signature qualifying the unordered tree inclusion of Q in D is defined by
the index set {1, 5, 3} and the corresponding sub-signature is sub sig{1,3,5} (D)
= ha, 5; b, 1; f, 4i. Notice that the index set {1, 5, 3} satisfies both conditions
of Lemma 4.3 whereas the index set {2, 5, 3} only matches at the level of
node names but it is not a qualifying one. The rewriting of Q gives rise
to the following paths rew(Q) = {P2 , P3 }, where P2 = {1, 2} and P3 =
{1, 3}, and the outcome of their evaluation is ansP2 = {{1, 5}} and ansP3 =
{{1, 3}, {2, 3}}. The common sub-path between P2 and P3 is P2 ∩ P3 =
{1}. The index 1 occurs in the first position both in P2 and P3 . From the
cartesian product of ansP2 (D) and ansP3 (D) it follows that the index sets
{1, 5} ∈ ansP2 (D) and {1, 3} ∈ ansP3 (D) are structurally consistent as they
share the same value in the first position and have different values in the
second position, whereas {1, 5} ∈ ansP2 (D) and {2, 3} ∈ ansP3 (D) are not
structurally compatible and thus not joinable. ¤
Identification of the answer set 137
The following definition states the meaning of structural consistency for two
generic subtrees Ti and Tj of Q – paths Pi and Pj are particular instances of
Ti and Tj .
Definition 4.3 (Structural consistency) Let Q be a query twig, D a data
tree, Ti = {t1i , . . . , tni } and Tj = {t1j , . . . , tm
j } two ordered sets of indexes deter-
mining sub sigTi (Q) and sub sigTj (Q), respectively, ansTi (D) and ansTj (D)
the answers of the unordered inclusion of Ti and Tj in D, respectively.
Si = {s1i , . . . , sni } ∈ ansTi (D) and Sj = {s1j , . . . , sm j } ∈ ansTj (D) are
structurally consistent if:
for each pair of common indexes thi = tkj , shi = skj ;
P3 1 3
P2 1 2
ansP2 (D): ansP3 (D): 1 3
1 5
2 3
P2 ∪ P3 1 2 3
sj(ansP2 , ansP3 ):
1 5 3
Figure 4.14: Structural join of Example 4.9
The structural join sj(ansTi (D), ansTj (D)) thus returns an answer set de-
fined on the union of two sub-queries Ti and Tj as the join of the structurally
consistent answers of ansTi (D) and ansTj (D). Starting from the set of an-
swers {ansPx1 (D), . . . , ansPxk (D)} for paths in rew(Q), we get the answer
set ansQ (D) identifying the unordered inclusion of Q in D by incrementally
merging the answer sets by means of the structural join. Since the structural
join operator is associative and symmetric, we can compute ansQ (D) as:
Example 4.11 Consider Figure 4.13 again. In this example we show the
evaluation of the unordered tree inclusion of the second twig query depicted,
which we will call Q, in the data tree D. It can be easily verified that there
is no qualifying sub-signature since at most two of the three paths find a
correspondence in the data tree.
The rewriting phase produces the set rew(Q) = {P2 , P3 , P4 } where P2 =
{1, 2}, P3 = {1, 3}, and P4 = {1, 4}. The final result ansQ (D) is the outcome
of the structural join:
sj(ansP2 (D), ansP3 (D), ansP4 (D)) =
= sj(sj(ansP2 (D), ansP3 (D)), ansP4 (D)) = ∅
The answer sets of the separate paths and of sj(ansP2 (D), ansP3 (D)) are
shown in Figure 4.15. The final result is empty since the only pair of joinable
answers {1, 5, 3} ∈ sj(ansP2 (D), ansP3 (D)) and {1, 5} ∈ ansP4 (D) is not
structurally consistent: the two different query nodes 2 ∈ P2 ∪ P3 and 4 ∈ P4
Efficient computation of the answer set 139
P3 1 3
P2 1 2
ansP2 (D): ansP3 (D): 1 3 ansP4 (D):
1 5
2 3
P4 1 4
1 5
P2 ∪ P3 1 2 3
sj(ansP2 (D), ansP3 (D)):
1 5 3
Figure 4.15: Structural join of Example 4.11
correspond to the same data node 5. It means that there are not as many
data tree paths as query tree paths. ¤
Theorem 4.7 Given a query twig Q and a data tree D, the answer set
ansQ (D) as defined by Eq. 4.1 contains all and only the index sets S quali-
fying the unordered inclusion of Q in D according to Lemma 4.3.
(1) P = pop(rew(Q));
(2) pQ = P ;
(3) evaluate anspQ (D);
(4) while((rew(Q) not empty) AND (anspQ (D) not empty))
(5) P = pop(rew(Q));
(6) pP = P \ (P ∩ pQ);
(7) tk is the parent of pP , k is the position in pQ;
(8) P Ans = ∅;
(9) for each answer S in anspQ (D)
(10) evaluate anspP (sub sig{sk +1,...,f fsk −1} (D));
(11) if(anspP (sub sig{sk +1,...,f fsk −1} (D)) not empty)
(12) add sj({S}, anspP (sub sig{sk +1,...,f fsk −1} (D))) to P Ans;
(13) pQ = pQ ∪ P ;
(14) anspQ (D)=P Ans;}
the structural join sj(anspQ (D), ansP (D)) as shown in Eq. 4.1, but it rather
applies a sort of nested loop algorithm in order to perform the two phases
in one shot. As each pair of index sets must be structurally consistent in
order to be joinable, we compute only such answers in ansP (Q), which are
structurally consistent with some of the answers in anspQ (D). As a matter
of fact, only such answers may be part of the answers to Q. In order to do
it, the algorithm tries to extend each answer in anspQ (D) to the answers to
pQ ∪ P by only evaluating such sub-path of P which has not been evaluated
in pQ. In particular, step 6 stores in the sub-path pP such part of the path
P to be evaluated which is not in common with the query pQ evaluated up
to that moment: P \ (P ∩ pQ). Step 7 identifies tk as the parent of the
partial path pP where k is its position in pQ. For instance, by considering
Example 4.9, the two paths P2 and P3 of the query Q are depicted in Figure
4.17-a. If rew(Q) = {P2 , P3 }, then at step 5 pQ = P2 and, as the part of the
path P3 corresponding to the query node a has already been evaluated while
evaluating P2 , the partial path pP to be evaluated and the parent tk of pP
are depicted in Figure 4.17-b.
For each index set S ∈ anspQ (D), each index set in ansP (Q), which is
structurally consistent with S, must share the same values in the positions
corresponding to the common sub-path P ∩ pQ. In other words, we assume
Efficient computation of the answer set 141
A tk A
F B F B
P2 P3 pQ pP
(a) (b)
that the part of the path P which is common to pQ has already been evalu-
ated and that the indexes of the data nodes matching P ∩ pQ are contained
in S. In particular, the index sk in S actually represents the entry of the
data node matching the query node corresponding to tk . Thus, in order to
compute the answers in ansP (D) that are structurally consistent with S and,
then, join with S, the algorithm extends S to the answers to P ∪ pQ by only
evaluating in the “right” sub-tree of the data tree the inclusion of the part
pP of the path P which has not been evaluated yet (step 10). As the path P
has been split into two branches P ∩ pQ and pP , where tk is the parent of pP
and S contains a set of indexes matching P ∩ pQ, the evaluation of pP must
be limited to the descendants of the data node dsk which in the tree signature
corresponds to the sequence of nodes having pre-order values from sk + 1 up
to f fsk − 1. Then it joins S with such answer set by only checking that
different query entries correspond to different data entries (step 12). Notice
that, in step 10, by shrinking the index interval to a limited portion of the
data signature, we are able to reduce the computing time for the sequence
inclusion evaluation.
The algorithm ends when we have evaluated all the paths in rew(Q) or
when the partial answer set collected up to that moment anspQ (D) becomes
empty. The latter case occurs when we evaluate a path P having no answer
which is structurally consistent with those in anspQ (D): sj(anspQ (D), ansP
(D)) = ∅. In this case, for each answer S in anspQ (D) two alternatives exists.
Either the evaluation of the partial path pP fails (line 11), which means that
none of the answers in ansP (D) share the same values of S in the positions
corresponding to the common sub-path P ∩pQ, or the structural join between
S and the answers to pP fails (line 12), which means that some of the answers
in ansP (D) share the same values of S in positions corresponding to different
indexes in P and pQ.
Example 4.12 Let us apply the algorithm described in Figure 4.16 to Ex-
142 Query processing for XML databases
Query
Language
GUI
Internal Query
Representation
Query Importer
Query Engine
Doc.xml Doc Importer Internal Doc
Representation
Core System
Datastore Datastore
</>
Global TagIndex Signature
Stored Documents
(Internal Representation)
Datastore
Table 4.1: The XML data collections used for experimental evaluation
In our tests, we used both real and synthetic data sets. We present the
results we obtained on one real and two synthetic collections (see Table 4.1).
In order to show the performance of the matching algorithms on real-world
“data-centric” XML scenarios, we chose the complete DBLP Computer Sci-
ence Bibliography archive. The file consists of over 3.8 Millions of elements.
Table 4.2 shows more details about this XML archive. It is a very “flat”
(3 levels) and wide (very high root fan out) data tree, as in typical “data-
centric” XML documents. As in most real data sets, the distribution of the
node labels is non equiprobable. In fact, the whole set presents repetitions
of typical patterns (for instance, “article-author”). Since typical real data
sets are very flat, this would not allow us to test some of the most complex
conditions, such as POT2. For this reason, we also generated two synthetic
data sets, Gen1 and Gen2, as random trees, using the following parameters:
depth (5 and 8, respectively), fan-out (3) and root fan-out (50000 and 30,
respectively). Both synthetic collections differ from the DBLP set in their
labels distribution as they are uniformly distributed. However, while Gen1
146 Query processing for XML databases
Middle-level Leaf-level
Element name Occs Element name Occs
inproceedings 241244 author 823369
article 129468 title 376737
proceedings 3820 year 376709
incollection 1079 url 375192
book 1010 pages 361989
phdthesis 72 booktitle 245795
mastersthesis 5 ee 143068
crossref 141681
editor 8032
publisher 5093
isbn 4450
school 77
Summary
Total number of elements 3814975
Total number of values 3438237
Maximum tree height 3
proposes a similarly wide and slightly deeper tree, Gen2 is very deep and has
smaller root fan-out, thus simulating more “text centric” trees. Note that
the size of collections (in particular the root fan-out) is not very important.
Our aim is mainly to analyze the behavior of the algorithms and the trends
of the sizes of domains. This is typically clear after a significant portion of
the data is scanned.
Figure 4.20 shows the testing queries we used to perform the main tests on
the algorithms discussed in Section 4.4. The upper row shows the queries on
the DBLP collection (denoted with Dn), while the lower one defines queries
for the synthetically generated collections (denoted with Gn). In both cases,
we provide a path and two twigs. For DBLP, the query depth is limited to 2,
therefore we tried to differentiate the queries by means of an increasing fan-
out. The labels are specifically chosen amongst the less selective, in order
to test our algorithms in the most demanding settings. As to Gen1 and
Gen2, we created queries G2 and G3 deeper than the DBLP ones. They are
specifically conceived to test all the conditions which would not be activated
in the shallow DBLP setting.
Experimental setting 147
author
title pages ee url
author author title booktitle
year crossref
B
B D B C D
C
E G E F G
author
title pages ee url
author title author title year
year crossref booktitle
phdthesis book
author title author title year
Figure 4.21: The query templates used in the decomposition approach tests
Table 4.3: Pattern matching results for the different queries and collections
ducted experiments by using not only queries defined by the plain templates
(designated as “NSb-h”) which only contain tree structural relationships, but
also queries (designated as “VSb-h”), where the templates are extended by
predicates on the author name. Value accesses are supported by a content
index. We have chosen the highly content-selective predicates, because we
believe that this kind of queries is especially significant for highly selective
fields, such as the author name. On the other hand, the performance of
queries with low selectivity fields should be very close to the corresponding
templates. In this way, we measure the response time of twelve queries, half
of which contain predicates.
All experiments are executed on a Pentium 4 2.5Ghz Windows XP Pro-
fessional workstation, equipped with 512MB RAM and a RAID0 cluster of
2 80GB EIDE disks with NT file system (NTFS).
Queries are denoted with a prefix signifying the applied matching algorithm
(P- for path matching, O- for ordered and U- for unordered twig matching).
For each query setting, we present the fundamental details of the algorithms
execution, i.e. the total execution time (in milliseconds), the total number
of solutions retrieved, the mean number of solutions constructed each time a
solution construction is started, the mean domain size (denoted with MDS),
the total number of node insertions, and the percentage of avoided insertions
with respect to their total number. Observe that in all the cases the time
is in the order of a few seconds (7 seconds at most for query U-D3), even
though each of the settings presents non-trivial query execution challenges: a
very wide data tree for both DBLP and Gen1, a considerable repetitiveness
for DBLP labels and patterns (notice the very high number of solutions,
over half a million for queries D2) and a very deep and involved tree for
Gen2 (notice the high number of solutions for each solution construction,
nearly one or even two orders of magnitude larger than for the other two
collections). Also observe the large number of node insertions, and especially
the high percentage of avoided insertions, which is very significant for all
collections (e.g. nearly a million avoided insertions for DBLP O-D3 query,
while in Gen1 and Gen2 twigs the number of non-inserted nodes is much
higher than the ones inserted). Finally, the MDS parameter is particularly
significant for all the queries. It represents the mean size of the domains
measured each time a solution construction is called for the whole size of
the collection. Its low values in each of the settings (less than 1.8 nodes
for DBLP and Gen1, reaching 7.92 for the most complex query in the deep
Gen2) testimony the good efficacy of our reduction conditions. In particular,
since the mean domains size is low, this means that the number of deletions
is very near to the number of insertions. Keeping the MDS low is essential
for efficiency reasons, since the time spent in constructing the solutions is
roughly proportional to the Cartesian product of the domains size, but, in
many cases, it is also essential for the good outcome of the matching, since
an overflow of the domains would mean a total failure.
Table 4.4: Summary of the discussed cases (disabled conditions are marked
with a ‘x’)
some of the most complex and interesting cases, shortly discussing the others
in words. Table 4.4 presents a summary of the cases we will discuss, together
with the associated disabled conditions.
Path matching
Twig matching
As to twig matching, the number of available conditions requires a deeper
analysis. As for paths, we will first inspect the pre-order conditions (i.e.
POT), which are the main source of avoided insertions and deletions. As
shown in the algorithms (Section 4.4), the following are key functions which
activate POT conditions: isCleanable() for deletions (exploiting POT2 and
POT3) and isNeeded() (which will be isNeededOrd() for the ordered and
isNeededUnord() for the unordered case) for insertions (exploiting POT1).
We started by analyzing the “percentages of success” of such functions for
each call in each of the queries – Table 4.5 provides such statistics. For
isCleanable() “success” means allowing a node deletion (returning true,
both for POT3 or POT2), while for isNeeded() means preventing an use-
less insertion (returning false for POT1). Notice that POT1 can be satisfied
by examining nodes in the parent domain (denoted in table with POT1p) or,
152 Query processing for XML databases
Case std. Case T-A Case std. Case T-B Case T-D
600 14.00
12.00
500
Mean domains size
300 6.00
4.00
200
2.00
100 0.00
O-G2 O-G3 U-G2 U-G3 O-G2 O-G3 U-G2 U-G3
(Gen1) (Gen1) (Gen1) (Gen1) (Gen2) (Gen2) (Gen2) (Gen2)
0
Case std. 1.32 1.59 1.47 1.75 2.93 5.99 5.40 7.92
0
0
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
60
Case T-B 1.65 1.96 1.88 2.29 4.40 7.94 6.42 8.74
12
18
24
30
36
42
48
54
60
66
72
78
84
90
96
Document pre-order Case T-D 1.44 1.83 1.85 2.16 3.70 7.81 8.50 12.11
(a) MDS, std. vs. T-A (O-G3, Gen1) (b) MDS, std. vs. T-B and T-D
only for unordered, in the siblings ones (POT1s in table). The percentage
of success for both functions is considerable in all cases. In DBLP, the per-
centage of deletion success is lower than in the other collections. This is due
both to the more repetitive and simple structure of its data tree and to the
inapplicability of condition POT2 (DBLP queries have only two levels). Such
condition proves instead quite useful in the other collections, particularly for
unordered matching, where its application comes often near the one of POT3.
As to POT1, the main contribution is given in situation POT1p, while also
POT1s can give a good contribution in the ordered matching. About the
two functions we are discussing, we also performed some CPU utilization
tests and found out that their contribution is generally significant also from
an execution time point of view, since their percentage of CPU utilization is
typically less than 4% of the total CPU times.
To quantify the specific effects of the conditions on the domain size and
time, we distinguish cases T-A to T-D (see Table 4.4). The first three cases
will clearly produce less node deletions, while case T-D will allow useless
insertions. Case T-A is conceptually equivalent to the case P-A, i.e. deletions
are almost totally prevented. Like in P-A, as expected, we found out that the
domain sizes grow uncontrollably, preventing the termination of the matching
in acceptable time. As an example, the graph in Figure 4.22-a shows a plot
comparing, after each data node, the mean stack size of case T-A to the
standard one for O-G3 query, collection Gen1. As to cases T-B and T-
D, the algorithms generally produced domain sizes which were larger than
the standard case (see Figure 4.22-b). Even if the difference in size may
not seem particularly significant, we have to consider that the time spent in
constructing the solutions is roughly proportional to the cartesian product of
the size of each domains, thus the differences in execution time may become
Evaluating the impact of each condition 153
2500
2000
Time (msec)
1500
1000
500
0
O-G2 O-G2b O-G3 U-G2 U-G2b U-G3
(Gen2) (Gen2) (Gen2) (Gen2) (Gen2) (Gen2)
Case std. 113 188 190 180 592 961
Case T-D 152 227 266 319 2516 2552
more evident. For instance, if the domains are on mean one and a half larger,
as for query UG-3 (collection Gen2, case T-D), each solution construction run
becomes nearly 20 times longer. While for the most simple queries we found
out that execution time is still not much affected, since the time required to
check the conditions compensates the shorter solution generation time, for the
most complex settings the difference in execution time can be remarkable.
As an example, Figure 4.23 shows, for the most complex queries and for
collection Gen2, the comparison between the standard case time and the one
of the T-D case. Notice that, in order to verify execution time savings in more
complex situations, we also employed new queries specifically for these tests,
e.g. a modified version of query U-G2, named U-G2b, where second level
nodes have two children instead of one. As seen in the graph, the difference
in execution time can reach a proportion of 1:5 (query U-G2b). The results
obtained for case T-C are very different between the ordered and unordered
settings. Disabling condition POT3 produces almost no variations in the
ordered matching (condition POT2 produces the same deletions at the cost of
a little more time spent in checking the hypotheses), while it proves essential
for unordered matching (time and domain size grow uncontrollably). Note
that if we disabled PRO1 together with POT3, case T-C would degenerate
for the ordered setting too, since the last domain would not be empty and
POT2 could no longer be always applied.
Finally, as for the paths, we can also briefly analyze the cases involving
the activation of the pre-order conditions (PRO, PRU), denoted as T-E and
T-F in Table 4.4. Simulating case T-E (which is conceptually similar to the
P-B one for paths) means to disable the deletions produced by the pointers
update, i.e. there will be dangling pointers. Differently from P-B, such
case produces uncontrolled growth in domains size and in time, proving that
such conditions are essential for more complex queries. As to case T-F, this is
154 Query processing for XML databases
Query Evaluation
Twig elem. pred. solutions Decomp. Permutation
# # # (sec) N mean (sec) total (sec)
NH2-2 3 0 1343 0.016 2 0.014 0.028
NH3-2 4 0 1343 0.016 6 0.015 0.105
NH7-3 10 0 90720 1.1 288 0.9 259.2
NL2-2 3 0 559209 2.2 2 2.28 4.56
NL3-2 4 0 559209 4.2 6 2.49 14.94
NL8-2 9 0 149700 7.7 40320 4.8 193536
VH2-2 3 1 1 0.015 2 0.014 0.028
VH3-2 4 1 1 0.016 6 0.016 0.096
VH7-3 10 2 1 0.031 288 0.03 8.64
VL2-2 3 1 39 0.65 2 0.832 1.664
VL3-2 4 1 36 0.69 6 1.1 6.6
VL8-2 9 1 29 0.718 40320 2.3 92736
equivalent to case P-C and the results obtained for ordered matching confirm
the one discussed for such case.
in Section 5.1 we propose XML S3 MART services [94], which are able
5.1 Matching and rewriting services 159
Interface logic
XQuery GUI
client
Business logic
XML S3MART
Data logic
Data Manager &
Search Engine
XML Multimedia
Repository Repository
Figure 5.1: The role of schema matching and query rewriting in an open-
architecture web repository offering advanced XML search functions
tural parts of the documents, described by XML schemas, are used to search
the documents as they are involved in the query formulation. Due to the
intrinsic nature of the semi-structured data, all such documents can be use-
ful to answer a query only if, though being different, the target schemas
share meaningful similarities with the source one, both structural (similar
structure of the underlying XML tree) and semantical (employed terms have
similar meanings) ones. Consider for instance the query shown in the up-
per left part of Figure 5.2, asking for the names of the music stores selling
a particular album. The document shown in the right part of the figure
would clearly be useful to answer such need, however, since its structure and
element names are different, it would not be returned by a standard XML
search engine. In order to retrieve all such useful documents available in the
document base, thus fully exploiting the potentialities of the data, the query
needs to be rewritten (lower left part of Figure 5.2). Being such similari-
ties independent from the queries which could be issued, they are identified
by a schema matching operation which is preliminary to the proper query
processing phase. Then, using all the information extracted by such analy-
sis, the approximate query answering process is performed by first applying
a query rewriting operation in a completely automatic, effective and efficient
way. See also Figure 5.3, which depicts the role of the two operations and
the interaction between them.
Schema matching
S Structural expansion
M Matching
information
Q Query Rewriting Q
it. As a matter of fact, searches are performed on the XML documents stored
in the repository and a query in XQuery usually contains paths expressing
structural relationships between elements and attributes. For instance the
path /musicStore/storage/stock selects all the stock elements that have
a storage parent and a musicStore grandparent which is the root element.
The resulting expanded schema file abstracts from several complexities of
the XML schema syntax, such as complex type definitions, element refer-
ences, global definitions, and so on, and ultimately better captures the tree
structure underlying the concepts expressed in the schema.
Consider, for instance, Figure 5.4 showing a fragment of an XML Schema
describing the structural part of documents about music stores and their
merchandiser, along with a fragment of the corresponding expanded schema
file and a representation of the underlying tree structure expressing the struc-
tural relationship between the elements which can appear in the XML doc-
uments complying with the schema. As can be seen from the figure, the
original XML Schema contains, along with the element definitions whose
importance is definitely central (i.e. elements musicStore, location), also
type definitions (i.e. complex types musicStoreType, locationType) and
regular expression keywords (i.e. all), which may interfere or even distort
the discovery of the real underlying tree structure, which is essential for an
effective schema matching. In general, XML Schema constructions need to
be resolved and rewritten in a more explicit way in order for the structure
of the schema to be the most possibly similar to its underlying conceptual
tree structure involving elements and attributes. Going back to the example
of Figure 5.4, the element location, for instance, is conceptually a child of
musicStore: This relation is made explicit only in the expanded version of
the schema, while in the original XML Schema location was the child of a
all node, which was child of a complex type. Further, every complex type
Approximate query answering
164 in heterogeneous XML collections
Terminology disambiguation
As discussed in the introduction, after having made explicit the structural
relationships of a schema with the expansion process, a further step is re-
quired in order to refine and complete even more the information delivered
by each schema, thus maximizing the successive matching computation’s ef-
fectiveness. This time the focus is on the semantics of the terms used in the
element and attribute definitions. In this step, each term is disambiguated,
that is its meaning is made explicit as it will be used for the identification
of the semantical similarities between the elements and attributes of the
schemas, which actually rely on the distance between meanings. To this end,
we exploit one of the most known lexical resources for the English language:
WordNet [100]. The WordNet (WN) lexical database is conceptually orga-
nized in synonym sets or synsets, representing different meanings or senses.
Each term in WN is usually associated to more than one synset, signifying
it is polysemic, i.e. it has more than one meaning (some preliminary Word-
Net concepts are also available in Appendix A.1). In this section the focus
is on the rewriting and matching computation techniques, therefore we will
consider term disambiguation as a semi-automatic operation where the op-
erator, by using an ad-hoc GUI, is required to “annotate” each term used in
each XML schema with the best candidate among the WN terms and, then,
to select one of its synsets. Our new structural disambiguation technique
greatly enhances this approach, providing completely automatic terminology
disambiguation, and will be specifically discussed later in this chapter, in
Section 5.2.
Matching Computation
The matching computation phase performs the actual matching operation
between the expanded annotated schemas made available by the previous
steps. For each pair of schemas, we identify the “best” matchings between
the attributes and the elements of the two schemas by considering both the
structure of the corresponding trees and the semantics of the involved terms.
Indeed, in our opinion the meanings of the terms used in the XML schemas
Schema matching 165
musicStore A cdStore
A
B
location signboard storage name
E
address
B
cd
F
C D E C D J K G
town country colorsign namesign stock city street state vocalist cdTitle trackList
F H
compactDisk passage
G K I
songList albumTitle title
H
track
I J
songTitle singer
Figure 5.5: Example of two related schemas and of the expected matches
cannot be ignored as they represent the semantics of the actual content of the
XML documents. On the other hand, the structural part of XML documents
cannot be considered as a plain set of terms as the position of each node
in the tree provides the context of the corresponding term. For instance,
let us consider the two expanded schemas represented by the trees shown
in Figure 5.5. Though being different in the structure and in the adopted
terminology, they both describe the contents of the albums sold by music
stores, for which the information about their location is also represented.
Thus, among the results of a query expressed by using Schema A we would
also expect documents consistent with Schema B. In particular, by looking
at the two schemas of Figure 5.5, a careful reader would probably identify
the matches which are represented by the same letter. At this step, the
terms used in the two schemas have already been disambiguated by choosing
the best WN synset. As WordNet is a general purpose lexical ontology, it
does not provide meanings for specific context. Thus, the best choice is to
associate terms as albumTitle and songTitle for Schema A and cdTitle
and title for Schema B with the same WN term, i.e. title, for which the
best synset can thus be chosen. In these cases, which are quite common, it
is only the position of the corresponding nodes which can help us to better
contextualize the selected meaning. For instance, it should be clear that the
node albumTitle matches with the node cdTitle, as both refer to album
title, and that songTitle matches with the node title, as both refer to
song title.
The steps we devised for the matching computation are partially derived
from the ones proposed in [98] and are the following:
1. the involved schemas are first converted into directed labelled graphs
following the RDF specifications [79], where each entity represents an
Approximate query answering
166 in heterogeneous XML collections
RDF Model for Schema A (portion) RDF Model for Schema B (portion) Pairwise connectivity graph (portion)
/musicStore
, name
/musicStore name /cdStore name /cdStore
“musicStore”
,
“musicStore” “cdStore” “cdStore”
child child child
Figure 5.6: RDF and corresponding PCG for portions of Schemas A and B
2. Then, an initial similarity score is computed for each node pair con-
tained in the PCG. This is one of the most important steps in the
matching process. In [98] the scores are obtained using a simple string
matcher that compares common prefixes and suffixes of literals. In-
stead, in order to maximize the matching effectiveness, we chose to
adopt an in-depth semantic approach. Exploiting the semantics of the
terms in the XML schemas provided in the disambiguation phase, we
follow a linguistic approach in the computation of the similarities be-
tween pairs of literals (names), which quantifies the distance between
the involved meanings by comparing the WN hypernyms hierarchies of
the involved synsets. We recall that hypernym relations are also known
as IS-A relations (for instance “feline” is a hypernym for “cat”, since
you can say “a cat is a feline”). In this case, the scores for each pair
of synsets (s1 , s2 ) are obtained by computing the depths of the synset
in the WN hypernyms hierarchy and the length of the path connecting
Automatic query rewriting 167
them as follows:
2 ∗ depth of the least common ancestor
.
depth of s1 + depth of s2
3. The initial similarities, reflecting the semantics of the single node pairs,
are refined by an iterative fixpoint calculation as in the similarity
flooding algorithm [98], which brings the structural information of the
schemas in the computation. In fact, this method is one of the most
versatile and also provides realistic metrics for match accuracy [48].
The intuition behind this computation is that two nodes belonging to
two distinct schemes are the more similar the more their adjacent nodes
are similar. In other words, the similarity of two elements propagates
to their respective adjacent nodes. The fixpoint computation is iter-
ated until the similarities converge or a maximum number of iterations
is reached.
1. all the full paths in the query are rewritten by using the best matches
between the nodes in the given source schema and target schema (e.g.
Approximate query answering
168 in heterogeneous XML collections
Query 1
FOR $x IN /musicStore FOR $x IN /cdStore
WHERE $x/storage/*/compactDisk//singer = "Elisa" WHERE $x/cd/vocalist = "Elisa"
AND $x//track/songTitle = "Gift" AND $x/cd/trackList/passage/title = "Gift"
RETURN $x/signboard/namesign RETURN $x/name
Query 2
FOR $x IN /musicStore/storage/stock FOR $x IN /cdStore/cd
/compactDisk/songlist/track WHERE $x/vocalist = "Elisa"
WHERE $x/singer = "Elisa" AND $x/trackList/passage/title = "Gift"
AND $x/songtitle = "Gift" RETURN $x/trackList/passage
RETURN $x
Query 3
FOR $x IN /musicStore FOR $x IN /cdStore
WHERE $x/storage/stock/compactDisk = "Gift" WHERE ( $x/cd/vocalist = "Gift"
AND $x/location = "Modena" OR $x/cd/cdTitle = "Gift"
RETURN $x OR $x/cd/trackList/passage/title = "Gift" )
AND ( $x/address/city = "Modena"
OR $x/address/street = "Modena"
OR $x/address/state = "Modena" )
RETURN $x
directly translating the variable value in the rewritten query would lead to a
wrong rewrite: While the elements singer and songTitle referenced in the
query are descendants of track, the corresponding best matches in Schema
B, that is songTitle and vocalist, are not both descendants of passage.
Notice that, in these cases, the query is correctly rewritten as we first substi-
tute each path with the corresponding full path and then we reconstruct the
variable, which in this case holds the value of path cdStore/cd. In example
3 an additional rewrite feature is highlighted concerning the management of
predicates involving values and, in particular, textual values. At present,
whenever the best match for an element containing a value is a middle ele-
ment, the predicates expressed on such element are rewritten as OR clauses
on the elements which are descendants of the matching target element and
which contain a compatible value. For instance, the element compactDisk
and its match cd on Schema B are not leaf elements, therefore the condition
is rewritten on the descendant leaves vocalist, cdTitle and title.
buy 1
Terms / senses
External
feedback
knowledge sources
Section 5.2.2.
1
Notice that the same term could be included more than once and that the disam-
biguation is strictly dependent on the node each instance belongs to.
Overview of the approach 173
sponding to each pattern specified in the crossing setting (i.e. arc label and
arc crossing direction) and we define the distance d between N and Nc as the
sum of the product of the weights associated to each pattern and the corre-
sponding number of instances. Then, weight(Nc ) is computed by applying a
gaussian distance decay function defined on d:
d2
e− 8 2
weight(Nc ) = 2 · √ + 1 − √
2π 2π
Example 5.1 Assume that in the eBay tree (Figure 5.8) the context is made
up of the siblings and ancestors, that the weight of the parent/child arcs is 1 in
the direct direction and 0.5 in the opposite one, and that the maximum num-
ber of crossings is 2. The graph context of the term (mouse, 7) is made up of
the terms (computers, 2), (desktop, 3), (PC, 3), (components, 3), (memory, 4),
(speaker, 5), and (fan, 6). The distance between node 7 and nodes 2, 3, 4
(5 and 6) are 1 (i.e. 2 arcs crossed in the opposite direction with weight 0.5),
0.5 (i.e. 1 arc crossed in the opposite direction with weight 0.5), and 1.5 (i.e.
1 arc crossed in the opposite direction with weight 0.5 and 1 arc crossed in
the direct direction with weight 1), respectively. Then, weight(2) = 0.91,
weight(3) = 0.95, and weight(4) = weight(5) = weight(6) = 0.8. ¤
The context of each term (t, N ) can be expanded by the contexts Scontext
(s) of each sense s in Senses(t, N ). It is particularly useful when the graph
context provides too little information. In particular for each sense we con-
sider the definitions, the examples and any other explanation of the sense
provided by the thesaurus. As most of the semantics is carried by noun words
[68], the “context expansion” module in Figure 5.9 defines Scontext(s) as the
set of nouns contained in the sense explanation.
Finally, each term (t, N ) with its senses Senses(t, N ) is disambiguated by
using the previously extracted context. The proper disambiguation process
is the subject of the following section. The result is a ranked version of
Senses(t, N ) where each sense s ∈ Senses(t, N ) is associated with a confi-
dence φ(s) in choosing s as a sense of (t, N ).
The overall approach is quite versatile. It supports several disambiguation
needs by means of parameters which can be freely combined, from the weights
to the graph context. Moreover, the ranking approach has been conceived in
order to support two types of graph disambiguation services: The assisted
and the completely automatic one. In the former case, the disambiguation
Approximate query answering
174 in heterogeneous XML collections
algorithm Disambiguate(t, N )
that the more similar two terms are, the more informative will be the most
specific concept that subsumes them both. However, our approach differs in
the semantic similarity measure sim(t, tc ) as it does not rely on a training
phase on large pre-classified corpora but exploits the hypernymy hierarchy
of the thesaurus. In this context, one of the most promising measures is the
Leacock-Chodorow [82], which has been reviewed in the following way:
(
−ln len(t,t
2·H
c)
if ∃ a common hypernymy
sim(t, tc ) = (5.1)
0 otherwise
where len(t, tc ) is the minimum among the number of links connecting each
sense in Senses(t, N ) and each sense in Senses(tc , Nc ) and H is the height
of the hypernymy hierarchy (in WordNet it is 16). Moreover, we define the
minimum common hypernym c(t, tc ) of t and tc as the sense which is the
most specific (lowest in the hierarchy) of the hypernyms common to the two
senses (i.e. that crossed in the computation of len(t, tc )). For instance, in
WordNet the minimum path length between the terms “cat” and “mouse” is
5, since the senses of such nouns that join most rapidly are “cat (animal)”
and “mouse (animal)” and the minimum common hypernym is “placental
mammal”. Obviously these two values are not computed within the function
but once for each pair of the involved terms. Eq. 5.1 is decreasing as one
moves higher in the taxonomy thus guaranteeing that “more abstract” is
synonymous of “less informative”. Therefore, function T ermCorr() increases
the confidence of such senses in Senses(t, N ) which are descendants of the
minimum common hypernym (lines 3-4) and the increment is proportional
to how informative the minimum common hypernym is (line 5). At the end
of the process (Figure 5.10, line 6), the value assigned in φG to each sense is
then the proportion of support it receives, out of the support possible which
is kept updated by function T ermCorr() (line 6) and in the main algorithm
(Figure 5.10, line 5).
Beside the contribution of the graph context, also the expanded context
can be exploited in the disambiguation process (Figure 5.10, lines 7-9). In this
case, the main objective is to quantify the semantic correlation between the
context Gcontext(t, N ) of the polysemous term (t, N ) and the explanation
of each sense s in Senses(t, N ) represented by Scontext(s). In particular,
the confidence in choosing s is proportional to the computed similarity value
(Figure 5.10, line 9). The pseudocode of function ContextCorr() is shown
in Figure 5.12. It essentially computes the semantic similarity between each
term ti in the graph context and the terms in the sense context Scontext(s)
(lines 3-7) by calling the T ermCorr() function for each term tsj in Scontext(s)
(line 6) and then by computing the maximum of the obtained confidence
5.3 Related work 177
vector φT . The returned value (line 8) is the mean of the similarity values
computed for the terms in Gcontext(t, N ).
The last contribution is that of function decay(), exploiting the frequency
of use of the senses in English language (Figure 5.10, line 9). In particular,
WordNet orders its list of senses W N Senses(t) of each term t on the basis
of the frequency of use (i.e. the first is the most common sense, etc.). We
increment the confidence in choosing each sense s in Senses(t, N ) in a way
which is inversely proportional to its position, pos(s), in such ordered list:
pos(si ) − 1
decay(si ) = 1 − ρ
|W N Senses(t)|
done on the instance level, trying to reduce the approximate structural query
evaluation problem to well-known unordered tree inclusion [119] or tree edit
distance [65] problems directly on the data trees. However, the process of un-
ordered tree matching is difficult and extremely time consuming; for instance,
the edit distance on unordered trees was found in [141] N P hard. Ad-hoc
approaches based on explicit navigation of the nodes’ instances, such as [37],
are equally very expensive and generally deliver inadequate performance due
to the very large size of most of the available XML data trees.
On the other hand, a large number of approaches prefer to address the
problem of structural heterogeneity by first trying to solve the differences
between the schemas on which data are based. Schema matching is a prob-
lem which has been the focus of work since the 1970s in the AI, DB and
knowledge representation communities [15, 48]. Many systems have been de-
veloped: The most interesting ones working on XML data are COMA [49],
which supports the combination of different schema matching techniques,
CUPID [85], combining a name and a structural matching algorithm, and
Similarity Flooding (SF) [98], providing a particularly versatile graph match-
ing algorithm. Many approaches also combine schema level with instance
level analysis, such as LSD and GLUE [50], which are based on machine
learning approaches needing a preliminary training phase. However, most of
the work on XML schema matching has been motivated by the problem of
schema integration: A global view of the schemas is constructed and from
this point the fundamental aspect of query rewriting remains a particularly
problematic and difficult to be solved aspect [115]. As to rewriting, most of
the works present interesting and complex theoretical studies [107]. In [27]
the theoretical foundations for query rewriting techniques based on views of
semi-structured data, in particular for regular path queries. Some rewriting
methods have also been studied in the context of mediator systems, in or-
der to rewrite the submitted queries on the involved sources: For instance,
[111] presents an approach based on the exploitation of a description logic,
while [106] deals with the problem of the informative capability of each of
the sources. However, in general, the proposed rewriting approaches rarely
actually benefit from the great promises of the schema matching methods.
bag of words approach, where the context is merely a set of words next to
the term to disambiguate, and the relational information approach, which
extends the former with other information such as their distance or rela-
tion with the involved word. The disambiguation algorithm developed by
us adopts the relational information approach which is more complex but
generally performs much better. In the literature, a further distinction is
based on the kind of information source used to assign a sense to each word
occurrence [68]. Our disambiguation method is a knowledge-driven method
as it combines the context of the word to be disambiguated with additional
information extracted from an external knowledge source, such as electron-
ically oriented dictionaries and thesauri. Such approach often benefits from
a general applicability and is able to achieve good effectiveness even when it
is not restricted to specific domains. WordNet is, with no doubt, the most
used external knowledge source [26] and its hypernym hierarchies constitute
a very solid foundation on which to build effective relatedness and similarity
measures, the most common of which are the path based ones [82]. Further,
the descriptions and glosses provided for each term can deliver additional
ways to perform or refine the disambiguation: The gloss overlap approach
[11] is one of them. Among the alternative approaches, the most common
one is the corpus-based or statistic approach where the context of a word is
combined with previously disambiguated instances of such word, extracted
from annotated corpora [4, 5]. Recently, new methods relying on the entire
web textual data, and in particular on the page count statistics gathered by
search engines like Google [38, 39] have also been proposed. However, gen-
erally speaking, the problem of such approaches is that they are extremely
data hungry and require extensive training, huge textual corpora, which are
not always available, and/or a very large quantity of manual work to produce
the sense-annotated corpora they rely on. This problem prevents their use
in the application contexts we refer to, as even “raw” data are not always
available (e.g. in a PDMS, peers not necessarily store actual data).
Experimental setting
We evaluated the effectiveness of our techniques in a wide range of contexts
by performing tests on a large number of different XML schemas. Such
schemas include the ones we devised ad hoc in order to precisely evaluate
the behavior of the different features of our approach, an example of which
is the music store example introduced in Figure 5.5, and schemas officially
adopted in worldwide DLs in order to describe bibliography metadata or
audio-visual content.
In particular, we further tested XML S3 MART on carefully selected pairs
of similar “official” schemas derived from:
In this section we will discuss the results obtained for the music store
example and for a real case concerning the schemas employed for storing
the two most popular digital libraries of scientific references in XML format:
Approximate query answering
182 in heterogeneous XML collections
0.98
musicStore cdStore
0.26 0.27
0.59
compactDisk cd
0.31 0.18
0.08
stock street
Figure 5.13: A small selection of the matching results between the nodes of
Schema A (on the left) and B (on the right) before filtering; similarity scores
are shown on the edges.
The DBLP Computer Science Bibliography archive and the ACM SIGMOD
Record.
Effectiveness of matching
For the music store schema matching, we devised the two schemas so to have
both different terms describing the same concept (such as musicStore and
cdStore, location and address) and also different conceptual organizations
(notably singer, associated to each of the tracks of a cd in Schema A, vs.
vocalist, pertaining to a whole cd in Schema B). Firstly, we performed a
careful annotation, in which we associated each of the different terms to the
most similar term and sense available in WordNet. The annotation phase for
the ad-hoc schemas was quite straightforward, since we basically only had to
associate each term with the corresponding WN term; the only peculiarities
are the annotations of composite terms (e.g. cdTitle, annotated as title) and
of the very few terms not present in WN (e.g. compactDisk annotated as cd ).
After annotation, XML S3 MART iterative matching algorithm automatically
identified the best matches among the node pairs which coincides with the
ones shown in Figure 5.5. For instance, matches A, E, F and G between
nodes with identical annotation and a similar surrounding context are clearly
identified. A very similar context of surrounding nodes, together with similar
but not identical annotations, are also the key to identify matches B, C,
D, H and J. The matches I and K require particular attention: Schema A
songTitle and albumTitle are correctly matched respectively with Schema
B title and cdTitle. In these cases, all four annotations are the same
(title) but the different contexts of surrounding nodes allow XML S3 MART
to identify the right correspondences. Notice that before applying the stable
marriage filtering each node in Schema A is matched to more than one node
in Schema B; simply choosing the best matching node in Schema B for each
of the nodes in Schema A would not represent a good choice. Consider for
instance the small excerpt of the results before filtration shown in Figure
Matching and rewriting 183
5.13: The best match for stock (Schema A) is cdStore but such node has a
better match with musicStore. The same applies between stock - cd, and
cd - compactDisk. Therefore, the matches for musicStore and compactDisk
are correctly selected (similarities in bold), while stock, a node which has no
correspondent in Schema B, is ultimately matched with street. However,
the score for such match is very low (< 0.1) and will be finally filtered out
by a threshold filter. We also tested our system against specific changes in
the ad-hoc schemas. For instance, we modified Schema B by making the
vocalist node child of the node passage, making the whole schema more
similar to Schema A structure, and we tested if the similarities between the
involved nodes increased as expected. In particular, we noticed that the
similarities of the matches between cd, passage and vocalist and their
Schema B correspondents were 10% to 30% higher.
As to the tests on a real case, Figure 5.14 shows the two involved schemas,
describing the proceedings of conferences along with the articles belonging
to conference proceedings. Along with the complexities already discussed
in the ad-hoc test, such as different terms describing the same concept
(proceedings and issue, inproceedings and article), the proposed pair
of schemas presents additional challenges, such as an higher number of nodes,
structures describing the same reality with different levels of detail (as for
author) and different distribution of the nodes (more linear for DBLP, with
a higher depth for SIGMOD), making the evaluation of the matching phase
particularly critical and interesting. In such real cases the annotation process
is no longer trivial and many terms could not have a WN correspondence:
For instance DBLP’s ee, the link to the electronic edition of an article, is
annotated as link, inprocedings, a term not available in WN, as article. In
general, we tried to be as objective and as faithful to the schemas’ terms
as possible, avoiding, for instance, to artificially hint at the right matches
by selecting identical annotations for different corresponding terms: For in-
stance, terms like procedings (DBLP) and issue (SIGMOD) were anno-
tated with the respective WN terms, dblp was annotated as bibligraphy while
sigmodRecord as record.
After annotation, XML S3 MART matcher automatically produced the
matches shown in Figure 5.14. Each match is identified by the same letter
inside the nodes and is associated with a similarity score (on the right). The
effectiveness is very high: Practically all the matches, from the fundamental
ones like B and G, involving articles and proceedings, to the most subtle,
such as L involving the link to electronic editions, are correctly identified
without any manual intervention. Notice that the nodes having no match
(weak matches were pruned out by filtering them with a similarity threshold
of 0.1) actually represent concepts not covered in the other schema, such as
Approximate query answering
184 in heterogeneous XML collections
Schema DBLP - Tree structure Schema SIGMOD - Tree structure Sim Scores
sigmod
Record A A 0.48
B 0.64
issues C 0.21
dblp A
issue B
D 0.29
E 0.29
proceedings B F 0.29
D E F C G 0.98
volume number year month conference location H 0.22
C D E F I 0.28
articles
key title volume number year J 0.22
G
K 0.31
article
inproceedings G
L 0.29
H J K
H I J K L
key author title pages ee crossref article title init end toFull authors
Code Page Page Text
href author
author I
author
Position Name
Figure 5.14: Results of schema matching between Schema DBLP and Schema
SIGMOD. Each match, represented by a letter, is associated to a similarity
score (shown on the right).
tween nodes: For instance match K is not completely correct since the node
pages would have to be matched with two nodes of Schema B, initPage
and endPage, and not just with initPage.
We performed many other tests on XML S3 MART effectiveness, generally
confirming the correct identification of at least 90% of the available matches.
Among them, we conducted “mixed” tests between little correlated schemas,
for instance between Schema B and DBLP. In this case, the matches’ scores
were very low as we expected. For instance, the nodes labelled title in
Schema B (title of a song) and DBLP (title of articles and proceedings)
were matched with a very low score, more than three times smaller than the
corresponding DBLP-SIGMOD match. This is because, though having the
same annotation, the nodes had a completely different surrounding context.
Finally notice that, in order to obtain such matching results, it has been
necessary to find a good trade-off between the influence of the similarities
between given pairs of nodes and that of the surrounding nodes, i.e. between
the annotations and context of nodes; in particular, we tested several graph
coefficient propagation formulas [98] and we found that the one delivering
the most effective results is the inverse total average.
Experimental setting
Tests were conceived in order to show the behavior of our disambiguation
approach in different scenarios. We tested 3 groups of trees characterized by
2 dimensions of interest. The first dimension, specificity, indicates how much
a tree is contextualized in a particular scope; trees with low specificity can
be used to describe heterogeneous concepts, such as a web directory, whereas
trees with high specificity are used to represent specialized fields such as data
about movies and their features and staff. The second dimension, polysemy,
indicates how much the terms are ambiguous. Trees with high polysemy
contain terms with very different meanings: For instance, rock and track
whose meanings radically change in different contexts. On the other hand,
trees with low polysemy contain mostly terms whose senses are characterized
by subtle shades of meaning, such as title. For each feasible combination
of these properties we formed a group by selecting the three most represen-
tative trees. Group1 is characterized by a low specificity and a polysemy
which increases along with the level of the tree; it is the case of web directo-
Approximate query answering
186 in heterogeneous XML collections
ries in which we usually find very different categories under the same root, a
low polysemy at low levels and high polysemy at the leaf level. The trees we
selected for Group1 are a small portion of Google’s and Yahoor ’s web direc-
tories and of eBayr ’s catalog. Group2 is characterized by a high specificity
and a high polysemy; we chose structures extracted from XML documents of
Shakespeare’s plays, Internet Movie Database (IMDbr , www.imdb.org) and
a possible On Line Music Shop (OLMS). Finally, Group3 is characterized
by a high specificity and a low polysemy and contains representative XML
schemas from the DBLP and SIGMOD Record scientific digital libraries and
the Dublin Core Metadata Initiative (DCMIr , dublincore.org) specifica-
tions. Low specificity and high polysemy are hardly compatible, therefore
we will not consider this one as a feasible combination.
Table 5.1 shows the features of each tree involved in our experimental eval-
uation. From left to right: The number of terms, the mean and maximum
number of terms’ senses, the percentage of correct senses between all the pos-
sible senses and the average similarities among the senses of each given term
in the tree (computed by using a variant of Eg. 5.1). Notice that our trees
are composed by 15-40 terms. Even though not big, their composition allows
us to generate a significant variety of graph contexts. The other features are
instead important in order to understand the difficulty of the disambiguation
task: For instance, higher is the number of senses of the involved terms more
difficult will be their disambiguation. The mean number of senses of Group2
and Group3 is almost double than that of Group1, thus we expect their dis-
ambiguation to be harder. This is confirmed by the percentage of correct
Structural disambiguation 187
1.000
0.900
senses between all the possible senses, which can be considered an even more
significant “ease factor” and is higher in Group1. The last feature partially
expresses how the trees are positioned w.r.t. the polysemy dimension: the
higher is the average of the sense similarity the lower is the polysemy and the
different senses have a closer meaning. This is true in particular for Group1
and Group3 trees, confirming the initial hypothesis.
Effectiveness evaluation
In our experiments we evaluated the performances of our disambiguation al-
gorithm mainly in terms of effectiveness. Efficiency evaluation is not crucial
for a disambiguation approach so it will not be deepened (in any case, the
disambiguation process for the analysed trees required at most few seconds).
Traditionally, wsd algorithms are evaluated in terms of precision and recall
figures [68]. In order to produce a deeper analysis not only of the quality
of the results but also of its possible motivations w.r.t. the different tree
scenarios, we considered the precision figure along with a number of newly
introduced indicators. Recall parameter is not considered because its compu-
tation is usually based on frequent repetitions of the same terms in different
documents, and we are not interested in evaluating the wsd quality from a
single term perspective.
The disambiguation algorithm has first been tested on the entire collection
of trees using the default graph context: all the terms in the tree. Figure 5.15
shows the precision results for the disambiguation of the three groups. Three
contributions are presented: The graph context one (Graph), the expanded
context one (Exp) and the combined one (Comb). In general, precision P
is the mean of the number of terms correctly disambiguated divided by the
number of terms in the trees of each group. Since we have at our disposal
Approximate query answering
188 in heterogeneous XML collections
1.000
0.900
Precision level
0.800
0.700
0.600
0.500
0.400
0.300
Graph Exp Comb Graph Exp Comb
P(1) 0.867 0.533 0.867 1.000 0.533 1.000
P(2) 1.000 0.867 1.000 1.000 0.867 1.000
P(3) 1.000 1.000 1.000 1.000 1.000 1.000
Complete context Selected context
Figure 5.16: Typical context selection behavior for Group1 (Yahoo tree)
1.000
0.900
Precision level
0.800
0.700
0.600
0.500
0.400
0.300
Graph Exp Comb Graph Exp Comb
P(1) 0.878 0.659 0.878 0.756 0.659 0.805
P(2) 0.951 0.951 0.951 0.902 0.951 0.927
P(3) 0.951 0.976 0.951 0.951 0.976 0.951
Complete context Selected context
Figure 5.17: Typical context selection behavior for Group2 (IMDb tree)
How much the right senses’ confidences are far from the incorrect ones, i.e.
how much the algorithm is confident in its choices (delta confidence to the
followings, first column of the right part of the table), and, when the choice
is not right, how much the correct sense confidence is far from the chosen
one (delta confidence from the top). We see that the “to the followings”
values are sufficiently high (from 14% of Group3 to over 24% of Group1),
while the “from the top” ones are nearly null, meaning very few and small
mistakes. Notice that the wsd choices performed on Group1, which gave the
best results in terms of precision, are also the most “reliable” ones, as we
expected.
In Table 5.2 we showed aggregate delta values for each group, however we
also found interesting to investigate the visual trend of the delta confidences
of the terms of a tree. Figure 5.18 shows the double histogram composed by
the delta to the followings (top part) and the delta from the top (bottom
part) values, where the horizontal axis represents the 21 terms of the On Line
Music Shop tree. Notice that for two terms no contributions are present: This
is due to the fact that these terms have only one available sense and, thus,
their disambiguation is not relevant. Further, the graph shows that, when the
upper bars are not particularly high (low confidence), the bottom bars are
not null (wrong disambiguation choices), but only in a very limited number
of cases. In most cases, the upper bars are evidence of good disambiguation
confidence and reliability, with peaks of over 40%.
Up to now we have not considered the contribution of the terms/senses’
feedback to the overall effectiveness of the results, in particular in the disam-
5.5 Future extensions towards Peer-to-Peer scenarios 191
0.5
0.4
0.3
Delta confidence
0.2
0.1
-0.1
-0.2
Terms
biguation of the most ambiguous terms in the tree. For illustration, suggest-
ing the correct meaning of the term volume in the DBLP tree as a book helps
the algorithm in choosing the right meaning for number as a periodic publi-
cation. Moreover, suggesting the correct meaning of the term line (part of
character’s speech) in the Shakespeare tree produces better disambiguation
results, for instance for the speaker term, where the position of the right
sense passes from second to first in the ranking. Notice that, in this case, and
in many others, the feedback on the term merely confirms the top sense in the
ranking (i.e. our algorithm is able to correctly disambiguate it); nonetheless,
this has a positive effect on the disambiguation of the near terms since the
“noise” produced by the wrong senses is eliminated. The flexibility of our
approach allows also to benefit from a completely automatic feedback, where
the results of a given run are refined by automatically disabling the contri-
butions of all but the top X senses in the following runs. We can generally
choose a very low X value, such as 2 or 3, since the right sense is typically
occupying the very top positions in the ranking. For instance, by choosing
X = 2 in the SIGMOD tree, the results of the second run show a precision
increment of almost 17%, and similar results are generally obtainable on all
the considered trees.
Multi-version management
and personalized access
to XML documents
particularly suited for temporal XML documents. Thus, the need for a native
solution to the temporal slicing problem becomes apparent to be able to
effectively and efficiently manage temporal versioning.
One of the most interesting scenarios in which this is particularly essen-
tial is the eGovernment one. Indeed, we are witnessing a strong institutional
push towards the implementation of eGovernment support services, aimed
at a higher level of integration and involvement of the citizens in the Public
Administration (PA) activities that concern them. In this context, collec-
tions of norm texts and legal information presented to citizens are becoming
popular on the internet and one of the main objectives of many reasearch
activities and projects is the development of techniques supporting tempo-
ral querying but also, as in [52], another versioning dimension, the semantic
one, thus allowing personalization facilities. Here, personalization plays an
important role, because some norms or some of their parts have or acquire a
limited applicability. For example, a given norm may contain some articles
which are only applicable to particular classes of citizens (e.g. public employ-
ees). Hence, a citizen accessing the repository may be interested in finding
a personalized version of the norm, that is a version only containing articles
which are applicable to his/her personal case. In existing works, personal-
ization is either absent (e.g. www.normeinrete.it) or predefined by human
experts and hardwired in the repository structure (e.g. www.italia.gov.it),
whereas flexible and on-demand personalization services are lacking.
In this chapter we propose new techniques for the effective and efficient
management and querying of time varying XML documents and for their
personalized access. In particular:
in the first part (Section 6.1) we deal with the problem of managing and
querying time-varying multi-version XML documents in a completely
general scenario. In particular, we propose a native solution to the
temporal slicing problem, addressing the question of how to construct
an XML query processor supporting time-slicing [88];
6.1.1 Preliminaries
A time-varying XML document records a version history, which consists of
the information in each version, along with timestamps indicating the lifetime
of that version [44]. The left part of Figure 6.1 shows the tree representa-
tion of our reference time-varying XML document taken from a legislative
repository of norms. Data nodes are identified by capital letters. For sim-
plicity’s sake, timestamps are defined on a single time dimension and the
granularity is the year. Temporal slicing is essentially the snapshot of the
time-varying XML document(s) at a given time point but, in its broader
meaning, it consists in computing simultaneously the portion of each state
Multi-version management
198 and personalized access to XML documents
Document representation
A temporal XML model is required when there is the need of managing
temporal information in XML documents and the adopted solution usually
depends on the peculiarities of the application one wants to support. For
the sake of generality, our proposal is not bound to a specific temporal XML
model. On the contrary, it is able to deal with time-varying XML docu-
ments containing timestamps defined on an arbitrary number of temporal
dimensions and represented as temporal elements [54], i.e. disjoint union of
periods, as well as single periods.
In the following, we will refer to time-varying XML documents by adopt-
ing part of the notation introduced in [44]. A time-varying XML database
is a collection of XML documents, also containing time-varying documents.
We denote with DT a time-varying XML document represented as an ordered
labelled tree containing timestamped elements and attributes (in the follow-
ing denoted as nodes) related by some structural relationships (ancestor-
descendant, parent-child, preceding-following). The timestamp is a temporal
element chosen from one or more temporal dimensions and records the life-
Preliminaries 199
time of a node. Not all nodes are necessarily timestamped. We will use the
notation nT to signify that node n has been timestamped and lif etime(nT )
to denote its lifetime. Sometimes it can be necessary to extend the lifetime
of a node n[T ] , which can be either temporal or snapshot, to a temporal di-
mension not specified in its timestamp. In this case, we follow the semantics
given in [45]: If no temporal semantics is provided, for each newly added tem-
poral dimension we set the value on this dimension to the whole time-line,
i.e. [t0 , t∞ ).
The snapshot operator is an auxiliary operation which extracts a complete
snapshot or state of a time-varying document at a given instant and which
is particularly useful in our context. Timestamps are not represented in the
snapshot. A snapshot at time t replaces each timestamped node nT with its
non-timestamped copy x if t is in lif etime(nT ) or with the empty string,
otherwise. The snapshot operator is defined as snp(t, DT ) = D where D is
the snapshot at time t of DT .
[T ] [T ]
empty and it is contained in the temporal window, lif etime(n1 , . . . , nk ) ⊆
t-window. For instance, in the reference example, the tuple (B,D) is struc-
turally but not temporally consistent as lif etime(B) ∩ lif etime(D) = ∅. In
this chapter, we consider the temporal slicing problem:
Given a twig pattern twig, a temporal window t-window and
a time-varying XML database TXMLdb, for each distinct slice
[T ] [T ]
(n1 , . . . , nk ), time-slice(twig,t-window) computes the snap-
[T ] [T ] [T ] [T ]
shot snp(t, (n1 , . . . , nk )), where t ∈ lif espan(n1 , . . . , nk ).
Obviously, it is possible to provide a period-timestamped representation of
[T ] [T ]
the results by associating each distinct state snp(t, (n1 , . . . , nk )) with its
[T ] [T ]
pertinence lif etime(n1 , . . . , nk ) in t-window.
Figure 6.2: The temporal inverted indices for the reference example
Level ...
SOL q1 q2 qn
Level
...
L2
Sq 1 Sq2 Sq n
Level ...
L1 n q1 n q2 n qn
Bq 1 Bq2 Bq n
Iq1 I q2 Iqn
Figure 6.3: The basic holistic twig join four level architecture
The basic four level architecture of the holistic twig join approach is de-
picted in Figure 6.3. Similarly to the tree signature twig matching algorithms
we described in Chapter 4, the approach maintains in main-memory a chain
of linked stacks to compactly represent partial results to root-to-leaf query
paths, which are then composed to obtain matches for the twig pattern (level
SOL in Figure). In particular, given a path involving the nodes q1 , . . . , qn , the
two stack-based algorithms presented in [25], one for path matching and the
other for twig matching, work on the inverted indices Iq1 , . . . , Iqn (level L0 in
Figure) and build solutions from the stacks Sq1 , . . . , Sqn (level L2 in Figure).
During the computation, thanks to a deletion policy the set of stacks contains
data nodes which are guaranteed to lie on a root-to-leaf path in the XML
database and thus represents in linear space a compact encoding of partial
and total answers to the query twig pattern. The skeleton of the two holistic
twig join algorithms (HTJ algorithms in the following) is presented in Figure
6.4. At each iteration the algorithms identify the next node to be processed.
To this end, for each query node q, at level L1 is the node in the inverted in-
dex Iq with the smaller LeftPos value and not yet processed. Among those,
the algorithms choose the node with the smaller value, let it be nq̄ . Then,
given knowledge of such node, they remove partial answers form the stacks
that cannot be extended to total answers and push the node nq̄ into the stack
Sq̄ . Whenever a node associated with a leaf node of the query path is pushed
on a stack, the set of stacks contains an encoding of total answers and the
algorithms output these answers. The algorithms presented in [25] have been
Providing a native support for temporal slicing 203
Figure 6.4: Skeleton of the holistic twig join algorithms (HTJ algorithms)
further improved in [30, 72]. As our solutions do not modify the core of such
algorithms, we refer interested readers to the above cited papers.
The time-slice operator can be implemented by applying minimal changes
to the holistic twig join architecture. The time-varying XML database is
recorded in the temporal inverted indices which substitute the “conventional”
inverted index at the lower level of the architecture and thus the nodes in the
stacks will be represented both by the position and the temporal attributes.
Given a twig pattern twig, a temporal window t-window, a slice is the
snapshot of any answer to twig which is temporally consistent. Thus the
holistic twig join algorithms continue to work as they are responsible for
the structural consistency of the slices and provide the best management of
the stacks from this point of view. Temporal consistency, instead, must be
checked on each answer output of the overall process. In particular, for each
potential slice ((D, L1 : R1 , N1 |T1 ), . . . , (D, Lk : Rk , Nk |Tk )) it is necessary
to intersect the periods represented by the values T1 , . . . , Tk and then check
both that such intersection is not empty and that it is contained in the
temporal window. Finally, the snapshot operation is simply a projection
of the temporally consistent answers on the non-temporal attributes. In
this way, we have described the “first step” towards the realization of a
temporal XML query processor. On the other hand, the performances of
this first solution are strictly related to the peculiarities of the underlying
database. Indeed, XML documents usually contain millions of nodes and
this is absolutely true in the temporal context where documents record the
history of the applied changes. Thus, the holistic twig join algorithms can
produce a lot of answers which are structurally consistent but which are
eventually discarded as they are not temporally consistent. This situation
implies useless computations due to an uncontrolled growth of the the number
of tuples put on the stacks.
Temporal consistency considers two aspects: The intersection of the in-
volved lifetimes must be non-empty (non-empty intersection constraint in the
following) and it must be contained in the temporal window (containment
constraint in the following). We devised alternative solutions which rely on
the two different aspects of temporal consistency and act at the different
Multi-version management
204 and personalized access to XML documents
levels of the architecture with the aim of limiting the number of temporally
useless nodes the algorithms put in the stacks. The reference architecture is
slightly different from the one presented in Figure 6.3. Indeed, in our con-
text, any timestamped node whose lifetime is a temporal element is encoded
into more tuples (e.g. see the encoding of the timestamped node E in the
reference example). Thus, at level L1, each node nq must be interpreted as
the set of tuples encoding nq . They are stored in buffer Bq and step 3 of the
HTJ algorithms empties Bq and pushes the tuples in the stack Sq .
Notice that, at each step of the process, the tuples having LeftPos smaller
than L can be in the stacks, in the buffers or still have to be read from
the inverted indices. However, looking for such tuples in the three levels
of the architecture would be quite computationally expensive. Thus, in the
following we introduce a new approach for buffer loading which allows us
to look only at the stack level. Moreover, we avoid accessing the temporal
pertinence of the tuples contained in the stacks by associating a temporal
pertinence to each stack (temporal stack ). Such a temporal pertinence must
therefore be updated at each push and pop operation. At each step of the
process, for efficiency purposes both in the update and in the intersection
phase, such a temporal pertinence is the smaller multidimensional period Pq
containing the union of the temporal pertinence of the tuples in the stack Sq .
The aim of our buffer loading approach is to avoid loading the temporal
tuples encoding a node n[T ] in the pertinence buffer Bq if the inverted indices
associated with the parents of q contain tuples with LeftPos smaller than
that of nq and not yet processed. Such an approach is consistent with step
1 of the HTJ algorithms as it chooses the node at level L1 with the smaller
LeftPos value and ensures that when n[T ] enters Bq all the tuples involved
Providing a native support for temporal slicing 205
in Prop. 6.1 are in the stacks. The algorithm implementing step 1 of the
HTJ algorithms is shown in Figure 6.5. We associate each buffer Bq with
the minimum minq among the LeftPos values of the tuples contained in the
buffer itself and those of its ancestors. Assuming that all buffers are empty,
the algorithm starts from the root of the twig (step 2) and, for each node q
up to the leaf, it updates the minimum minq and inserts nq , the node in Iq
with the smaller LeftPos value and not yet processed, if it is smaller than
minq . The same applies when some buffers are not empty. In this case, it
starts from the query node matching with the previously processed data node
and it can be easily shown that the buffers of the ancestors of such node are
not empty whereas the buffers of the subpath rooted by such node are all
empty.
Lemma 6.1 Assume that step 1 of the HTJ algorithms depicted in Figure
6.4 is implemented by the algorithm Load. The tuple (D, L : R, N |T ) in Bq
will belong to no slice if the intersection of its temporal pertinence T with the
multidimensional period Pq1 →qk = Pq1 ∩ . . . ∩ Pqk intersecting the periods of
the stacks of the ancestors q1 , . . . , qk of q is empty.
For instance, at the first iteration of the HTJ algorithms applied to the
reference example, step 1 and step 3 produce the situation depicted in Figure
6.6. Notice that when the tuple (1, 4 : 5, 4|1970 : 1990) encoding node D
(label article) enters level L1 all the tuples with LeftPos smaller than 4
are already at level L2 and due to the above Lemma we can state that it will
belong to no slice.
Multi-version management
206 and personalized access to XML documents
[1991, now] [ ]
contents article
Containment constraint
The following proposition is the equivalent of Prop. 6.1 when the containment
constraint is considered.
It allows us to act at level L1 and L2, but also between level L0 and level L1.
At level L1 and L2 the approach is the same as the non-empty intersection
constraint; it is sufficient to use the temporal window t-window, and thus
Prop. 6.2, instead of Lemma 6.1. Moreover, it is also possible to add an
intermediate level between level L0 and level L1 of the architecture, which
we call “under L1” (UL1), where the only tuples satisfying Prop. 6.2 are
selected from each temporal inverted index, are ordered on the basis of their
(DocId,LeftPos) values and then pushed into the buffers. Similarly to the
approach explained in the previous section, to speed up the selection, we
exploit B+ -tree indices built on one temporal dimension. Notice that this
solution deals with buffers as streams of tuples and thus it provides interesting
efficiency improvements only when the temporal window is quite selective.
Combining solutions
The non-empty intersection constraint and the containment constraint are
orthogonal thus, in principle, the solutions presented in the above subsections
can be freely combined in order to decrease the number of useless tuples we
put in the stacks. Each combination gives rise to a different scenario denoted
as “X/Y”, where “X” and “Y” are the employed solutions for the non-empty
intersection constraint and for the containment constraint, respectively (e.g.
scenario L1/L2 employs solution L1 for the non-empty intersection constraint
and solution L2 for the containment constraint). Some of these scenarios will
be discussed in the following. First, scenario L1/UL1 is not applicable since
in solution UL1 selected data is kept and read directly from buffers, with no
chance of additional indexing. Instead, in scenario L1/L1 the management of
the two constraints can be easily combined by querying the indices with the
Multi-version management
208 and personalized access to XML documents
WEB SERVICES
OF PUBLIC PUBLIC
ADMINISTRATION ADMINISTRATION DB
1 IDENTIFICATION
XML REPOSITORY
WEB SERVICES OF ANNOTATED
SIMPLE WITH
ELABORATION 2 CLASSIFICATION CREATION NORMS
ONTOLOGY /UPDATE
UNIT
CLASS CX OC
3 QUERYING RESULTS
Semantic versioning also plays an important role, due to the limited applica-
bility that norms or some of their parts have or acquire. Hence, it is crucial
to maintain the mapping between each portion of a norm and the maximal
class of citizens it applies to in order to support an effective personaliza-
tion service. Finally, notice that temporal and limited applicability aspects
though orthogonal may also interplay in the production and management of
versions. For instance, a new norm might state a modification to a preexist-
ing norm, where the modified norm becomes applicable to a limited category
of citizens only (e.g. retired persons), whereas the rest of the citizens remain
subject to the unmodified norm.
Temporal versioning
We first focused on the temporal aspects and on the effective and efficient
management of time-varying norm texts. Our work on these aspects is based
on our previous research experiences [58, 59, 60] and on the work discussed
in Section 6.1. To this purpose, we developed a temporal XML data model
which uses four time dimensions to correctly represent the evolution of norms
in time and their resulting versioning. The considered dimensions are:
Validity time. It is the time the norm is in force. It has the same semantics
of valid time as in temporal databases [71], since it represents the time
the norm actually belongs to the regulations in the real world.
Efficacy time. It is the time the norm can be applied to concrete cases.
While such cases do exist, the norm continues its efficacy even if no
longer in force. It also has a semantics of valid time although it is
independent from validity time.
The data model was defined via an XML Schema, where the structure of
norms is defined by means of a contents-section-article-paragraph hierarchy
and multiple content versions can be defined at each level of the hierarchy.
Each version is characterized by timestamp attributes defining its temporal
Multi-version management
212 and personalized access to XML documents
Figure 6.8: An example of civic ontology, where each class has a name and
is associated to a (pre,post) pair, and a fragment of a XML norm containing
applicability annotations.
pertinence with respect to each of the validity, efficacy and transaction time
dimensions.
Legal text repositories are usually managed by traditional information
retrieval systems where users are allowed to access their contents by means
of keyword-based queries expressing the subjects they are interested in. We
extended such a framework by offering users the possibility of expressing
temporal specifications for the reconstruction of a consistent version of the
retrieved normative acts (consolidated act).
Semantic versioning
The temporal multi-version model described above has then been enhanced
to include a semantic versioning mechanism to provide personalized access,
that is retrieval of all and only norm provisions that are applicable to a
given citizen according to his/her digital identity. Hence, the semantic ver-
sioning dimension encodes information about the applicability of different
parts of a norm text to the relevant classes of the civic ontology defined
in the infrastructure (OC in Figure 6.7). Semantic information is mapped
onto a tree-like civic ontology, that is based on a taxonomy induced by IS-
A relationships. The tree-like civic ontology is sufficient to satisfy basic
application requirements as to applicability constraints and personalization
services, though more advanced application requirements may need a more
sophisticated ontology definition.
For instance, the left part of Figure 6.8 depicts a simple civic ontology
built from a small corpus of norms ruling the status of citizens with respect
to their work position. The right part shows a fragment of a multi-version
XML norm text supporting personalized access with respect to this ontology.
Personalized access to versions 213
which were valid and in effect between 2002 and 2004 (temporal con-
straint), ...
Such a query can be issued to our system using the standard XQuery FLWR
syntax as follows:
FOR $a IN norm
WHERE textConstr ($a//paragraph//text(), ’health AND care’)
AND tempConstr (’vTime OVERLAPS
PERIOD(’2002-01-01’,’2004-12-31’)’)
AND tempConstr (’eTime OVERLAPS
PERIOD(’2002-01-01’,’2004-12-31’)’)
AND applConstr (’class 7’)
RETURN $a
where textConstr, tempConstr, and applConstr are suitable functions al-
lowing the specification of the textual, temporal and applicability constraints,
respectively (the structural constraint is implicit in the XPath expressions
used in the XQuery statement). Notice that the temporal constraints can
involve all the four available time dimensions (publication, validity, efficacy
and transaction), allowing high flexibility in satisfying the information needs
of users in the eGovernment scenario. In particular, by means of validity
and efficacy time constraints, a user is able to extract consolidated current
versions from the multi-version repository, or to access past versions of par-
ticular norm texts, all consistently reconstructed by the system on the basis
of the user’s requirements and personalized on the basis of his/her identity.
binary structural joins can get very large, even when the input and the final
result sizes obtained by stitching together the basic matches are much more
manageable.
Our native approach extends one of the most efficient approaches for XML
query processing and the underlying indexing scheme in order to support tem-
poral slicing and overcome most of the previously discussed problems. Start-
ing from the holistic twig join approach [25], which directly avoids the prob-
lem of very large intermediate results size by using a chain of linked stacks
to compactly represent partial results, we proposed new flexible technolo-
gies consisting in alternative solutions and extensively experimented them in
different settings.
vention of human experts to build the ontology, but avoids privacy issues.
Moreover, we access more complex documents, such as semi-structured ones,
rather than unstructured documents such as web pages; indeed, the high flex-
ibility we provide makes it possible to access fragments of the documents,
returning all and only the ones that fit to the user needs and avoiding the
retrieval of useless information.
The document collections follow the structure of the documents used in [58],
where three temporal dimensions are involved, and have been generated by
a configurable XML generator. On average, each document contains 30-40
nodes, a depth level of 10, 10-15 of these nodes are timestamped nodes nT ,
each one in 2-3 versions composed by the union of 1-2 distinct periods. We
are also able to change the length of the periods and the probability that the
temporal pertinence of the document nodes overlap. Finally, we investigate
different kinds of probability density functions generating collections with
different distributions, thus directly affecting the containment constraint.
Experiments were conducted on a reference collection (C-R), consisting
of 5000 documents (120 MB) generated following a uniform distribution and
characterized by not much scattered nodes, and on several variations of it.
We tested the performance of the time-slice operator with different twig
and t-window parameters. In this context we will deepen the performance
analysis by considering the same path, involving three nodes, and different
temporal windows as our focus is not on the structural aspects.
Multi-version management
218 and personalized access to XML documents
100 14000
90
12000
80
% of Tuples in the Buffers
70 10000
solution. The scenarios within each group show similar execution time and
percentages of tuples. In group */L1 the low percentage of tuples in buffers
(10%) means low I/O costs and this has a good influence on the execution
time. In group */L2 the percentages of tuples in buffers are more than double
of those of group */L1, while the execution time is about 1.5 times higher.
Finally, group */SOL is characterized by percentages of tuples in buffers and
execution time approximately ten and six time higher than those in */T1,
respectively. Moreover, within each group it should be noticed that rising
the non-empty intersection constraint solution from level L1 to level SOL
produces more and more deterioration in the overall performances.
14000 20000
8000
10000
6000
4000
5000
2000
0 0
L1/L1 L2/UL1 SOL/UL1 SOL/SOL L1/L1 L1/L1
SOL/SOL STRUCT
TS1 1290 3078 3081 12688 B+TREE MVBT
TS2 2812 3938 3953 12691 TS1 1290 2655 12688 17750
TS3 1031 797 813 9891 TS2 2812 5709 12691 17859
In Figure 6.10-b we compare the execution time for scenario L1/L1 when the
access method is the B+-tree w.r.t. the MVBT. Notice that when MVBT
indices are used to access data the execution time is generally higher than the
B+-tree solution. This might be due to the implementation we used which
is a beta-version included in the XXL package [130]. The last comparison
involves the holistic twig join algorithms applied on the original indexing
scheme proposed in [139] where temporal attributes are added to the index
structure but are considered as common attributes. Notice that in this in-
dexing scheme tuples must have different LeftPos and RightPos values and
thus each temporal XML document must be converted into an XML docu-
ment where each timestamped node gives rise to a number of distinct nodes
equal to the number of distinct periods. The results are shown on the right of
Figure 6.10-b where it is clear that the execution time of the purely structural
Temporal slicing 221
14000 100
90
% Non-Consistent Solutions
12000
80
10000 70
Execution Time (ms)
8000 60
50
6000 40
4000 30
20
2000
10
0 0
L1/L1 SOL/L1 SOL/SOL L1/L1 SOL/L1 SOL/SOL L1/L1 SOL/L1 SOL/SOL L1/L1 SOL/L1 SOL/SOL
C-R 1890 2000 12688 2812 2859 12691 C-R 23.10 39.13 96.51 29.96 43.23 91.95
C-S 906 1383 9766 1250 1797 9875 C-S 32.5 95.01 99.98 63.17 98.22 99.88
TS1 TS2 TS1 TS2
Figure 6.11: Comparison between the two collections C-R and C-S.
approach (STRUCT) is generally higher than our baseline scenario and thus
also than the other scenarios (13 times slower than the best scenario). This
demonstrates that the introduction of our temporal indexing scheme alone
brings significant benefits on temporal slicing performance. We refer the in-
terested reader also to Section 6.3 where we provide additional discussion of
state of the art techniques w.r.t. ours.
100000
1000
5000 Docs 10000 Docs 20000 Docs
L1/L1 1890 3531 5654
L2/L2 2797 5329 9844
SOL/SOL 12688 22893 45750
Scalability
Figure 6.12 (notice the logarithmic scales) reports the performance of our
XML query processor in executing TS1 for the reference collection C-R and
for two collections having the same characteristics but different sizes: 10000
and 20000 documents. The execution time grew linearly in every scenario,
with a proportion of approximately 0.75 w.r.t. the number of documents for
our best scenario L1/L1. Such tests have also been performed on the other
temporal slicing settings where we measured a similar trend, thus showing
the good scalability of the processor in every type of query context.
Table 6.2: Features of the test queries and query execution time (time in
msecs, collection C1)
(St constraint); types Q1 and Q2 also involve textual search by keywords (Tx
constraint), with different selectivities; type Q3 contains temporal conditions
(Tm constraint) on three time dimensions: transaction, valid and publica-
tion time; types Q4 and Q5 mix the previous ones since they involve both
keywords and temporal conditions. For each query type, we also present a
personalized access variant involving an additional applicability constraint,
denoted as Qx-A in Table 6.2 (performance figures in parentheses).
Let us first focus on the “standard” queries. Our approach shows a good
efficiency in every context, providing a short response time (including query
analysis, retrieval of the qualifying norm parts and reconstruction of the
result) of approximately one or two seconds for most of the queries. Notice
that the selectivity of the query predicates does not impair performances,
even when large amounts of documents containing some (typically small)
relevant portions have to be retrieved, as it happens for queries Q2 and Q3.
Our system is able to deliver a fast and reliable performance in all cases, since
it practically avoids the retrieval of useless document parts. Furthermore,
consider that, for the same reasons, the main memory requirements of the
Multi-version XML Query Processor are very small, less than 5% with respect
to “DOM-based” approaches such as the one adopted in [59, 58]. Notice that
this property is also very promising towards future extensions to cope with
concurrent multi-user query processing.
The time needed to answer the personalized access versions of the Q1–Q5
queries is approximately 0.5-1% more than for the original versions. More-
over, since the applicability annotations of each part of an XML document are
stored as simple integers, the size of the tuples with applicability annotations
is practically unchanged (only a 3-4% storage space overhead is required with
respect to documents without semantic versioning), even with quite complex
annotations involving several applicability extensions and restrictions.
Scalability
Finally, let us comment the performance in querying the other two collections
C2 and C3 and, therefore, concerning the scalability of the system. We ran
the same queries of the previous tests on the larger collections and saw that
the computing time always grew sub-linearly with the number of documents.
For instance, query Q1 executed on the 10,000 documents of collection C2
(which is as double as C1) took 1,366 msec (i.e. the system was only 30%
slower); similarly, on the 20,000 documents of collection C3, the average
response time was 1,741 msec (i.e. the system was less than 30% slower than
with C2). Also with the other queries the measured trend was the same, thus
showing the good scalability of the system in every type of query context.
Conclusions
and Future Directions
Starting from the ideas and work presented in this thesis, several interesting
issues for future research can be considered. These include:
c s 1, s2 Placental
mammal
Carnivore Rodent
able (e.g. note that the noun hierarchy has nine root senses). Furthermore,
whenever a minimum common hypernym between the two senses si and sj
is available, we call path length denoted as len(si , sj ) the number of links
connecting s1 with s2 and passing from the minimum common hypernym
node. For instance, by starting from senses s1 “cat (animal)” and s2 “mouse
(animal)” and moving up in the hierarchy , we see that they first intersect on
the common sense of “placental mammal”, which therefore is the minimum
common hypernym cs1 ,s2 and the path length is len(s1 , s2 ) = 3 + 2 = 5 (3
links from s1 to cs1 ,s2 plus 2 from cs1 ,s2 to s2 ). As to nouns, we define the
path length between two nouns ni , nj as:
n
len(ni , nj ) = minsni ∈Sn nj len(snk i , sk0j ) (A.1)
k i ,sk0 ∈Snj
In the same way, we can introduce the minimum common hypernym cni ,nj
between two nouns ni , nj as the minimum common hypernym sense corre-
sponding to the minimum path length between the two nouns. For instance,
the minimum path length between nouns “cat” and “mouse” is 5, since the
senses of such nouns that join most rapidly are the ones depicted in Figure
A.1. Their minimum common hypernym is the sense “placental mammal”.
In the literature, the hypernym hierarchies and the notion of path length
between pairs of synsets has been extensively studied as a backbone for the
definition of similarities between two senses [26]. Among the proposed mea-
sures, one of the most promising one that does not require a training phase
on large pre-classified corpora is the Leacock-Chodorow measure [81]. In the
following definition, we introduce a more general variant of such measure
for quantifying the similarity between two nouns as follows, where we also
consider the case where the minimum common hypernym is not available.
where Hmax has value 16 has it is the maximum height of WordNet IS-A
hierarchy.
the sentence in Figure 2.3 again: The nouns to disambiguate are “cat” and
“mouse”. If no context is available, a mouse could be an animal, but also
an electronic device, while a cat could also be a vehicle. If we consider their
minimum common hypernym, i.e. “placental mammal”, the two senses that
join through it, “cat (animal)” and “mouse (animal)”, will have the highest
confidence and will be the ones chosen.
The basis of the technique we propose derives from the one in [113]. As
in [113], the confidence in choosing one of the senses associated with each
term is directly proportional to the semantic similarities between that term
and the other ones. On the other hand, we deal with sentences of variable
lengths and whose meaning is also affected by the relative positions of their
terms. Thus, we refined the confidence computation by introducing two
enhancements which make our approach more effective in such context. In
particular, in the computation of the confidence in choosing one of the senses
associated with each noun, we weigh the contribution of the surrounding
nouns w.r.t. their positions and we consider the frequency of use of such
sense in English language.
As far as the first aspect is considered, note that the contributes of the
surrounding nouns to the confidence of each of the senses of a given noun
are not equally important. In particular, we assume that closer the positions
of the two nouns are in the sentence, more correlated the two nouns are
and therefore more important will be the information extracted from the
computation of their semantic similarity. Thus, given a noun nh ∈ N to
be disambiguated, we weigh the similarity between nh ∈ N and each of the
surrounding nouns nj ∈ N on their relative positions by adopting a gaussian
distance decay function D centered on nh and defined on their distance dh,j :
d2
h,j
e− 8 2
D(dh,j ) = 2 · √ +1− √ (A.3)
2π 2π
For two close nouns, the decay is almost absent (values near 1), while for
distant words the decay asymptotically tends towards values around 1/5 of
the full value. For example, consider the following technical sentence: “As
soon as the curve is the right shape, click the mouse button to freeze it in that
position” (see Figure A.2). The decay function is centered on “mouse”, the
noun to be disambiguated, while the surrounding nouns forming its context
are underlined. In this example, the “mouse” is clearly an electronic device
and not an animal; the best hint of this fact is given by the presence of the
noun “button”, located close to the term “mouse” (point B in the figure,
low decay), clearly providing a good way to understand the meaning. More
distant nouns such as “curve” or “position” are less correlated to “mouse”
Noun disambiguation 231
1
B
0,8
0,6
0,4
A
0,2
0
-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6
soon curve right shape click mouse button freeze that position di,j
and have a lesser influence on its disambiguation (point A in the figure, high
decay).
As to the second aspect, we exploit the rank of the senses, which is based
on the frequency of use of the senses in English language, by incrementing
the confidence of one sense of a given noun n in a way which is inversely
proportional to its position in the list Sn . In this case, we defined a linear
sense-decay function R on the sense number k (ranging from 1 to m) of a
given noun n:
k−1
R(snk ) = 1 − γ (A.4)
m−1
where 0 < γ < 1 is a parameter we usually set at 0.8. In this way, we
quantify the frequency of the senses where the first sense has no decay and
the last sense has a decay of 1:5. Such an adjustment attempts to emulate
the common sense of a human in choosing the right meaning of a noun when
the context gives little help. In our experiments, it has been proved to be
particularly useful for short sentences where the choice of a sense can be led
astray by the small number of surrounding nouns.
All the concepts discussed are involved in the notion of confidence in
choosing a sense for a noun nh ∈ N . It is defined as a sum of two components,
the first one involving the semantic similarity and the distance decay between
nh and the surrounding nouns, while the second represents the contribution
of the sense decay. The confidence in choosing the meaning of a noun in a
sentence is formally defined as follows.
232 More on EXTRA techniques
The meaning chosen for nh ∈ N in the sentence is the sense snh ∈ Sni such
that ϕ(snh ) = max{ϕ(snk h ) | snk h ∈ Snh }.
By varying the values of the α and β parameters we can change the relative
weight of each of the components; the default values, 0.7 and 0.3, respectively,
make the contribution of the first component predominant. Confidences are
values between 0 and 1; an high value of ϕ(snk h ) indicates a high probability
that the correct sense of nh is snk h . Therefore, we choose the sense with highest
confidence as the most probable meaning for each noun in the sentence. For
instance, in the sentence “You can even pick up an animation as a brush and
paint a picture with it”, the noun “brush” has several senses in WordNet. Our
disambiguation algorithm is able to correctly disambiguate it by measuring
the semantic similarities between it and the other nouns, i.e. “animation”
and “picture”. At the end of the algorithm the sense having the highest
confidence (0.79) is “an implement that has hairs or bristles firmly set into
a handle”, which is the correct one and will be the one chosen, while other
off-topic senses, such as “a dense growth of bushes” or “the act of brushing
your teeth”, all have a much lower confidence (less than 0.4).
As previously described, the nouns N contained in a sentence provide
a good informational context for the correct disambiguation of each noun
nh ∈ N . However, there may be some situations in which such context is not
sufficient, for example when the nouns in N are not strictly inter-correlated
or when the length of the sentence is too short. Consider, for instance, the
sentence “The cat hunts the beetle”: The two nouns “cat” and “beetle”
alone could be correlated both as vehicles and as animals. If we consider the
following sentence, however, “It just ran away from the dog”, the name “dog”
would clearly be very useful in order to disambiguate the meaning of “cat”
and “beetle” as animals. Therefore, in order to provide good effectiveness
Verb disambiguation 233
verbs of a given sentence, the confidence ϕ(svkh ) in choosing sense svkh ∈ Svh
of a verb vh ∈ V is computed as:
P
ni ∈N maxnj ∈N (svh ) sim(ni , nj ) · D(di,h )
ϕ(svkh ) = α P k
+ βR(svkh ) (A.7)
ni ∈N maxnj 0 ∈N (Svh ) sim(ni , nj 0 ) · D(di,h )
where α > 0 and β > 0 (α + β = 1). The meaning chosen for vh ∈ V in the
sentence is the sense svh ∈ Svh such that ϕ(svh ) = max{ϕ(svkh ) | svkh ∈ Svh }.
Again, notice that the confidence is a value between 0 and 1; an high value
of ϕ(svkh ) indicates a high probability for svkh to be the correct sense of vh .
Therefore, we choose the sense with highest confidence as the most probable
meaning for each verb in the sentence. As for each noun n ∈ N in the sen-
tence, we identify the most semantically close noun from the usage examples
(maxnj ∈N (svh ) sim(ni , nj )), the most probable sense is the one whose nouns
k
in its usage examples best match the nouns in the sentence. In this way, we
choose the sense whose usage examples best reflect the use of the verb in the
sentence.
For example, consider the disambiguation of the verb “click” in the frag-
ment “click the mouse button”. In WordNet such verb has 7 different senses,
each with different usage examples. The correct sense, that is “to move or
strike with a click”, contains the following usage fragment: “he clicked on
the light”. Since the semantic similarities between “mouse”, “button” and
“light” as electronic devices are very high, the algorithm chooses “light” as
the best match both for “mouse” and “button” and, consequently, it is able
to choose the correct sense for the verb, since this match greatly increments
its confidence.
Good hints for the disambiguation of a verb can also be extracted from
the definitions of its different senses. In particular, our approach is also able
to extract, along with the nouns of the usage examples, or instead of them,
the nouns present in such definitions, computing the similarities between
them and the nouns of the original sentence. Consider the sentence “the
white cat is hunting the mouse” again. A large contribution to the correct
disambiguation of the verb “hunt” comes from the nouns of the definition
“pursue for food or sport (as of wild animals)”: The semantic similarity
between “cat”, “mouse” and “animal” is obviously very high and decisive in
correctly choosing such a sense as the preferred one.
Properties of the confidence functions and optimization 235
se
se
se
se
do e
nt
nt
nt
ou
ou
ou
ou
it
g
g
wh
hu
do
hu
hu
m
m
0 1 2 3 4 0 1 2 3 0 1 2 0 1
se
se
se
do e
ou
ou
ou
ou
nt
nt
nt
it
g
g
wh
hu
do
hu
hu
m
m
0 1 2 3 4 0 1 2 3 0 1 2 0 1
se
se
se
do e
ou
ou
ou
ou
nt
nt
nt
it
g
g
wh
hu
do
hu
hu
m
m
0 1 2 3 4 0 1 2 3 0 1 2 0 1
Increasing σ
]
[1
[2
[3
[4
q
q
starting
A : ed (σq[2...4], σ[1...3]) σq : white dog hunt mouse
token
]
]
ed (σq[3..4], σ[2..3])
[1
[2
[3
B : σ : cat hunt mouse
is possible to perform all the steps required for sub2 matching and to identify
all the valid sub2 matches. However, as discussed in Chapter 2, the transla-
tor could be interested just in suggestions that are not contained in other
suggestions, i.e. the longest ones. To satisfy this particular need, a filtration
process, which we call inclusion filtering, is also required in order to prune
out the shorter matches contained in the longer ones.
In order to better understand the set of requirements, consider the follow-
ing example of approximate sub2 matching with inclusion filtering. Suppose
that minL = 3 and d = 0.3 and that we search for the sub2 matches for
the query sentence “So, welcome to the world of computer generated art”
where “welcome world compute generate art” is the sequence outcome of the
syntactic analysis. Let us suppose that the translation memory contains the
five sentences shown in the following table (number of sentence in the first
column, sentence in the source language in the second one):
2518.
3945.
5673.
10271.
13456.
All the above sequences present some tokens in common with the query se-
quence (see Figure A.4). However, the subsequence of 2518 is not a valid
sub2 match since it is too short, the part of 3945 is not considered since
ed(σ 3945 [welcome . . . art], σ q [welcome . . . art]) > round(0.3 ∗ 5). The remain-
ing sequences all satisfy the minL and d requirements. However, note that
the inclusion requirement excludes σ 5673 [welcome . . . compute] since it is con-
tained in σ 10271 [welcome . . . generate].
In order to implement a sub2 matching satisfying all the discussed require-
ments, the multiEditDistance algorithm could still be used as a first step to
produce all candidate sub2 matches, then a subsequent phase could check for
the d, minL, and inclusion constraints and prune out all the invalid matches.
However, a more efficient way to proceed is to avoid the computation of the
undesired matches, at least partially. In particular, it could be possible to
exclude the computation of the shorter matches that would not satisfy the
minimum length requirement and of those matches that would be included
in already identified ones. To do so, we developed a modified version of the
multiEditDistance function, named multiEditDistanceOpt, which is able
to efficiently solve the sub2 matching problem by considering all the above
constraints.
The pseudo-code is shown in Figure A.5. It is applied to the collection
of the pairs of sequences coming out from the filters. Such collection is
ordered on the basis of the query sequences σ q ; in this way all the pairs
of sequences sharing the same query sequence will be one after the other.
238 More on EXTRA techniques
Figure A.5: Algorithms for approximate sub2 matching with inclusion filtering
Each match is a tuple containing the start and end positions in σ q , the start
and end positions in σ and the computed edit distance. Besides matches,
a few auxiliary data structures are exploited in the multiEditDistanceOpt
function and initialized by the sub2 M atchingOpt function (lines 8, 11, 12,
14) for each new query sequence σ q . The goal of the integers L and lastP os is
to optimize the computation w.r.t. the minimum length constraint, whereas
that of the arrays maxEnd and pM atches is to dynamically apply the in-
clusion filtering. In particular, L is a constant representing the number of
acceptable starting positions for a match on the query sequence; it depends
on the length of the query sequence and the minimum length of any sugges-
tion: |σ q | − minL + 1 (line 8). lastP os represents the current last acceptable
starting position for the query sequence; its initial value is L − 1 (line 11).
As to the inclusion filter structures, maxEnd is an array of L integers, each
one L[i] representing the maximum end position of all the already computed
matches starting at the i-th position in the query. Finally, pM atches is
an array of L sets, each one pM atches[i] containing the computed matches
starting at the i-th position in the query.
If, for a given σ q , L is less than 1 the query sequence is discarded for
insufficient length and is not analyzed any further (lines 9-10); otherwise,
the current pair of sequences (σ q , σ) is passed to the multiEditDistanceOpt.
Such function considers each starting position in each of the two sequences
(lines 23-24), computes the corresponding distance matrix (line 25-27), iden-
tifies new matches (lines 28-29), checks if a match is included or includes
other matches (lines 31-36) and, eventually, if the match is not included in
others, inserts it in the relevant set (lines 38-39) and updates the maxEnd
and lastP os values accordingly (lines 40-42). More precisely, if the new
match is longer than the already computed matches having the same start-
ing position, thanks to the inclusion filter, the function empties the relevant
match set (line 36). For the same reason, if the match is included in others,
the insertion is not performed by setting insertF lag to false (line 34). Notice
that, with the help of the auxiliary structures, the inclusion filtering process
is performed quite efficiently, without ever re-accessing the already computed
matches. Furthermore, note that lastP os needs to be updated only when the
ending position of the query sequence in the new match coincides with the
last token of such sequence (line 41): In this case, for the properties of the
inclusion requirements, lastP os is set at the starting position of the found
match and, for such query sequence, the following starting positions will not
be analyzed any further since they would only produce shorter contained
matches.
Appendix B
In this appendix we present and discuss the complete versions of the sequen-
tial twig pattern matching algorithms whose basic properties and ideas have
been shown as an overview in Section 4.4. We have three classes of algo-
rithms which, respectively, perform path matching (Section B.2), ordered
(Section B.3) and unordered (Section B.4) twig matching. For each class
of algorithms we present both the “standard” and the content-based index
optimized versions. All these algorithms perform a sequential scan of the
tree signature; in Section B.5 we give more detail over the current solutions
used to delimit the scan range. The algorithms associate one domain to each
query node and rely on two principles: generate the qualifying index set as
soon as possible and delete from the domains those data nodes which are no
longer needed for the generation of the subsequent answers. With reference
to the previous section, during the scanning process the algorithms generate
the “delta” answer sets ∆(U )anskQ , that is the set of answers which can be
computed at step k.
∆Σprev(h’)
k
∆Σprev(h’)
k'
dk'
Dprev(h’) Dh’
0
Figure B.1: How ∆Σkprev(h0 ) (and ∆Σkprev(h0 ) ) are implemented
algorithm PathMatch(P)
(0) getRange(start, end);
(1) for each dk where k ranges from start to end
(2) for each h such that qh = dk in descending order
(3) for each Di where i ranges from 1 to n
(4) while(¬isEmpty(Di ) ∧ post(top(Di ))<post(dk ))
(5) pop(Di );
(6) if(isEmpty(Di ))
(7) for each Di0 where i0 ranges from i + 1 to n
(8) empty(Di0 );
(9) if(¬isEmpty(Dh−1 ))
(10) push(Dh ,(post(dk ),pointerToTop(Dh−1 )));
(11) if(h = n)
(12) showSolutions(h,1);
(13) pop(Dh );
(14) for each Di where i ranges from 1 to n
(15) if(isEmpty(Di ) ∧ last(qi )<k)
(16) exit;
procedure showSolutions(h,p)
(1) index[h] ← p;
(2) if(h = 1) output(D1 .index[1],...,Dn .index[n])
(3) else
(4) for i = 1 to pointer(Dh .index[h])
(5) showSolutions(h − 1,i);
Lemma B.2 For each i ∈ [1, n] ∆Σji ∩∆Σki = ∅ iff the condition isEmpty(Di )
is true at any step j 0 with k ≤ j 0 ≤ j.
Lemma B.3 At each step j and for each query index i, the stack Di is a
subset of Σji containing only the data entries that cannot be deleted from Σji ,
i.e. it has the same content as ∆Σji when Lemmas 4.4, 4.5, 4.6, and 4.11
have been applied.
Starting from the data node in the leaf stack Dn (function call at Line 12
of the main algorithm), function showSolutions() uses the pointer associ-
ated with each data node dk to “jump” from one stack Di to the previous one
Di−1 (up to D1 ) and recursively combines dk with each node from the bot-
tom of Di−1 up to the node pointed by dk . The correctness of the algorithm
follows from the properties shown so far.
Theorem B.1 For each data node dj , S = (s1 , . . . , sn ) ∈ ∆ansjP (D) iff the
algorithm, by calling the function showSolutions(), generates the solution
S.
Differently from the previous algorithm we first retrieve the list of doc-
ument’s elements of type ql (last element of the path, i.e. the only with a
specified value condition) that satisfy the specified value condition by calling
the function getMatchList() (Line 2). If the list is empty no answer can be
found (i.e. there is not any element ql that satisfies the specified value con-
dition) so we can terminate the algorithm (Lines 3 and 4) otherwise we can
start the search. Since the list is ordered by the pre-order value and answers
are sought sequentially from start to end we will first find (if they exist)
answers that end with the first element of the list, then answers ended by
the second element, and so on. Each element of the list is sequentially used
as a target element (curLeaf ) for the search, following observations made in
Section 4.3.1 we can skip document subtrees rooted from nodes dk such that
post(dk ) < post(curLeaf ). Such skip is made by the loop at Lines 13 and
14; the proposed algorithm supposes to use a signature that does not contain
the first following values, if the signature contains those values we can simply
replace Line 14 with k ← f f (dk ). The loop from Lines 8 to 12 updates the
target of the search, if needed. If we have gone past the curLeaf (Line 8) we
need to change the target: if a next target exists (Line 9) we simply update
the curLeaf variable, otherwise no other answers can be found and we can
terminate the algorithm (Line 12). In order to keep curLeaf and k coher-
ently updated we need to repeat the two loops described before (Lines 8 to
14) until we have curLeaf and k such that k ≤ pre(curLeaf ) and post(dk )
Content-based index optimized version 247
algorithm PathMatchCont(P)
(0) getRange(start, end);
(1) ql ← getLastElement(P );
(2) docLeaves ← getMatchList(ql ,getValue(ql ),getCondition(ql ));
(3) if (¬hasNext(docLeaves))
(4) exit;
(5) curLeaf ← getNext(docLeaves);
(6) for each dk where k ranges from start to end
(7) while((k > pre(curLeaf )) ∨ ((post(dk ) < post(curLeaf ))))
(8) while(k > pre(curLeaf ))
(9) if(hasNext(docLeaves))
(10) curLeaf ← getNext(docLeaves);
(11) else
(12) exit;
(13) while(post(dk ) < post(curLeaf ))
(14) k ← post(dk ) + 1;
(15) for each h such that qh = dk in descending order
(16) for each Di where i ranges from 1 to n
(17) while(¬isEmpty(Di ) ∧ post(top(Di ))<post(dk ))
(18) pop(Di );
(19) if(isEmpty(Di ))
(20) for each Di0 where i0 ranges from i + 1 to n
(21) empty(Di0 );
(22) if(¬isEmpty(Dh−1 ))
(23) push(Dh ,(post(dk ),pointerToTop(Dh−1 )));
(24) if(h = n ∧ dk = curLeaf )
(25) showSolutions(h,1);
(26) pop(Dh );
(27) for each Di where i ranges from 1 to n
(28) if(isEmpty(Di ) ∧ last(qi )<k)
(29) exit;
≥ post(curLeaf ).
From Lines 15 to 29 the algorithm is substantially the same of the previous
section, on Line 24 before generating answers we have also to check that dk
is the same node as curLeaf because we need to check that the value of dk
satisfies the specified condition (and that is possible only if dk = curLeaf ).
248 The complete XML matching algorithms
twig root, it simply returns true (Condition POT3), otherwise it checks the
conditions expressed in Condition POT2. In this case, links connecting each
domain Di with the domains of the descendants ı̄ of the i-th twig node and the
minPost operator are exploited to speed up the process. In particular, instead
of checking the post-order value of each data node in the domains, we check
if minPost(Dı̄ )>post(di ) for each domain Dı̄ (Line4 of the isCleanable()
code). Whenever a node di is deleted, the updateRightLists() function,
shown in detail in Figure B.5, updates the pointer of all the nodes pointing
to di in order to make it point to the node below di (Line 12 of the function).
Such an update is performed in a descending order and stops when a node
pointing to a node below di is accessed (Line 3).
As to the current node insertion (Lines 10-11 of the main algorithm),
250 The complete XML matching algorithms
algorithm OrderedTwigMatch(Q)
(0) getRange(start, end);
(1) for each dk where k ranges from start to end
(2) for each h such that qh = dk in descending order
(3) for each Di where i ranges from 1 to n
(4) for each di in Di in ascending order
(5) if(post(di )<post(dk ) ∧ isCleanable(i,di ))
(6) pos ← index(Di ,di );
(7) delete(Di ,di );
(8) if(i 6= n)
(9) updateRightLists(i,pos);
(10) if(¬isEmpty(Dh−1 ) ∧ isNeededOrd(h,dk ))
(11) push(Dh ,(post(dk ),pointerToLast(Dh−1 )));
(12) if(h = n)
(13) findSolsOrd(h,1);
(14) deleteLast(Dh );
(15) for each Di where i ranges from 1 to n
(16) if(isEmpty(Di ) ∧ last(qi )<k)
(17) exit;
function isCleanable(i,di )
(1) if(i = 1)
(2) return true;
(3) for each ı̄ in descendants(i)
(4) if(isEmpty(Dı̄ ) ∨ minPost(Dı̄ )>post(di ))
(5) return true;
(6) return false;
function isNeededOrd(i,di )
(1) if(i = 1)
(2) return true;
(3) if(isEmpty(Dparent(i) ) ∨ maxPost(Dparent(i) )<post(di ))
(4) return false;
(5) if(isEmpty(Di−1 ) ∨ minPost(Di−1 )>post(di ))
(6) return false;
(7) return true;
procedure updateRightLists(i,pos)
(1) for each di+1 in Di+1 in descending order
(2) if(pointer(di+1 )<pos)
(3) return;
(4) if(pointer(di+1 )=1)
(5) for each di+1 from di+1 in descending order
(6) rP os ← index(Di+1 ,di+1 );
(7) delete(Di+1 ,di+1 );
(8) if(i + 1 6= n)
(9) updateRightLists(i + 1,rP os);
(10) return;
(11) else
(12) decreasePointer(di+1 );
procedure findSolsOrd(h,p)
(1) index[h] ← p;
(2) if(h = 1) output(D1 .index[1],...,Dn .index[n])
(3) else
(4) if(twigBlock(h)<0)
(5) for i = 0 to twigBlock(h)
(6) push(postStack,post(Dh .index[h]));
(7) else if (twigBlock(h)=0)
(8) updateStackMax(postStack,post(Dh .index[h]));
(9) if(twigBlock(h − 1)>0)
(10) curP ost ← pop(postStack);
(11) for i = 1 to pointer(Dh .index[h])
(12) okT oContinue ← false;
(13) if(checkPostDir(Dh .index[h],Dh−1 .i))
(14) if(twigBlock(h − 1)<=0)
(15) okT oContinue ← true;
(16) else
(17) if(post(Dh−1 .i)>curP ost)
(18) okT oContinue ← true;
(19) updateStackMax(postStack,post(Dh−1 .i));
(20) if(okT oContinue)
(21) findSolsOrd(h − 1,i);
has the same content as ∆Σji when Lemmas 4.4, 4.5, 4.6, 4.9, 4.13, and 4.14
have been applied.
structure named postStack, in which the post orders of the children nodes
are kept, and a function named twigBlock(), which helps in identifying the
structure of the query, i.e. its “blocks” of children. In particular, for a given
node, the twigBlock() function returns an integer: 1 for block opening (i.e.
the given node is the father of other nodes), -n for n blocks closing (i.e. the
given node is a leaf and is the last children for n parents), 0 otherwise (no
blocks opening or closing). If a block is closing (Line 4), and this will be
the first case since we are constructing the solutions from the last domain,
the current post order is saved in the stack (Line 6); such post order is then
updated in case of other siblings (Lines 7-8) in order to keep the maximum
one of the current block. Then, in case of a parent node (block opening,
Line 9) such value is retrieved and, if the post order direction check succeeds
(Line 13), Line 17 checks if such value, representing the maximum post order
of the children, is less than the post of the current node (Line 17). Finally,
if all the check succeed, the algorithm continues by recursively calling itself
(Line 21).
Theorem B.2 For each data node dj , S = (s1 , . . . , sn ) ∈ ∆ansjQ (D) iff the
algorithm, by calling the function findSolsOrd(), generates the solution S.
QUERY NAVIGATION
isConstrainedLeaf(element) Returns true if the passed element
is a value constrained leaf
getValueConstrainedLeaves Returns the list of leaf that specifies
(signature) a value condition
getTarget(pre) Returns the current target for twig
pre-th element, by convention
if the element is a target itself
it returns the relative target
MATCHING
isNeededOrdCont(pre,elem) Checks if insertion of given elem is
needed in the pre-th list
considering the current targets
updateRightListsCont(pre,pos, Updates the pointers in the (pre + 1)-th
curDocP re) list, possibly propagating the update,
following the deletion of the pos-th
element from the pre-th list. It
also perform target alignment if it
is needed
canSkipOrd(post) Returns true if evaluating the
current document post-order value,
current targets and current domains
is possible to perform a Skip
algorithm OrderedTwigMatchCont(Q)
(0) getRange(start, end);
(1) targetLeaves ← getValueConstrainedLeaves(Q);
(2) for each li in targetLeaves
(3) targetListli ← getMatchList(li ,getValue(li ),getCondition(li ));
(4) if(¬hasNext(targetListli ))
(5) exit;
(6) if(¬firstAlignment(start))
(7) exit;
(8) for each dk where k ranges from start to end
(9) if(¬adjustTargetsOrd(k))
(10) exit;
(11) while(canSkipOrd(post(dk )))
(12) k ← post(dk ) + 1;
(13) if(¬adjustTargetsOrd(k))
(14) exit;
(15) for each h such that qh = dk in descending order
(16) for each Di where i ranges from 1 to n
(17) for each di in Di in ascending order
(18) if(post(di )<post(dk ) ∧ isCleanable(i,di ))
(19) pos ← index(Di ,di );
(20) delete(Di ,di );
(21) if(isConstrainedLeaf(qi ) ∧ pos = 1)
(22) alignment(k,qi )
(23) if(i 6= n)
(24) updateRightListsCont(i,pos,k);
(25) if(¬isEmpty(Dh−1 ) ∧ isNeededOrdCont(h,dk ))
(26) push(Dh ,(post(dk ),pointerToLast(Dh−1 )));
(27) if(h = n)
(28) findSolsOrd(h,1);
(29) deleteLast(Dh );
(30) for each Di where i ranges from 1 to n
(31) if(isEmpty(Di ) ∧ last(qi )<k)
(32) exit;
function isNeededOrdCont(i,di )
(0) if(hasAReferenceTarget(i) ∧ post(di ) < post(getTarget(i)))
(1) return false;
(2) if(i = 1)
(3) return true;
(4) if(isEmpty(Dparent(i) ) ∨ maxPost(Dparent(i) )<post(di ))
(5) return false;
(6) if(isEmpty(Di−1 ) ∨ minPost(Di−1 )>post(di ))
(7) return false;
(8) return true;
procedure updateRightListsCont(i,pos,pre)
(1) for each di+1 in Di+1 in descending order
(2) if(pointer(di+1 )<pos)
(3) return;
(4) if(pointer(di+1 )=1)
(5) for each di+1 from di+1 in descending order
(6) rP os ← index(Di+1 ,di+1 );
(7) delete(Di+1 ,di+1 );
(8) if(isConstrainedLeaf(qi+1 ) ∧ rP os = 1)
(9) alignment(pre,qi+1 )
(10) if(i + 1 6= n)
(11) updateRightListsCont(i + 1,rP os,pre);
(12) return;
(13) else
(14) decreasePointer(di+1 );
function canSkipOrd(post)
(0) for i from 1 to n
(1) if( ¬(hasAReferenceTarget(i) ∨ isContrainedLeaf(qi )) ∨
post ≥post(getTarget(i)) )
(2) return false;
(3) else
(4) if( isEmpty(Di ) ∨
(hasAReferenceTarget(i) ∧ post >maxPost(Di )) )
(5) return true;
Figure B.8: Content index optimized ordered twig matching algorithm aux-
iliary functions
by the status of the domains associated to target leaves. After the deletion
of an occurrence (Line 7), we check if we delete the first occurrence of a
258 The complete XML matching algorithms
function adjustTargetsOrd(pre)
(0) if((curT argetl1 is null) ∧ isEmpty(Dl1 ))
(1) return false;
(2) while((curT argetl1 is not null) ∧ pre(curT argetl1 )<pre)
(3) if(hasNext(targetListl1 ))
(4) curT argetl1 ← getNext(targetListl1 );
(5) else
(6) curT argetl1 ← null;
(7) if(isEmpty(Dl1 ))
(8) return false;
(9) return alignment(pre,l1 );
function firstAlignment(pre)
(0) foreach targetListli in targetLists
(1) if(i = 1)
(2) minP re ← pre;
(3) else
(4) minP re ← pre(curT argetli−1 );
(5) do
(6) if(¬hasNext(targetListli ))
(7) return false;
(8) curT argetli ← getNext(targetListli );
(9) while(pre(curT argetli )<minP re)
(10) return true;
lists are not aligned, in order to perform a first alignment we call the function
firstAlignment(), it has to be noted that instead of introducing a new
function we could perform the same task by initializing all the curT argets
and then call the function adjustTargetsOrd() but since the first alignment
is simpler than a normal adjustment (we have not to check domains because
at the beginning they are all empty) we have chosen to develop an ad-hoc
function. The function firstAlignment() moves the iterator of each list
until it founds an element that have a pre-order value greater than minP re
(loop from Lines 5 to 9). The minP re value is the first pre-order value
scanned by the main algorithm (the value start returned by the getRange()
function) for the first list or the pre-order value of the target of the preceding
list for the other lists. As we reach the end of a list, we return false (Line
7); if we reach the end of the main loop (i.e. we have a valid target for each
element) we can return true (Line 10).
At each step of the main algorithm we may need to update the targets,
260 The complete XML matching algorithms
function alignment(pre,li )
(0) if(@li+1 )
(1) return true;
(2) if(isEmpty(Dli ))
(3) minP re ← pre(curT argetli );
(4) else
(5) minP re ← pre(Dli [0]);
(6) if(minP re < pre)
(7) minP re ← pre;
(8) while((curT argetli+1 is not null) ∧ (pre(curT argetli+1 )<minP re))
(9) if(hasNext(targetListli+1 ))
(10) curT argetli+1 ← getNext(targetListli+1 );
(11) else
(12) curT argetli+1 ← null;
(13) if(isEmpty(Dli+1 ))
(14) return false;
(15) return alignment(pre,targetLists,li+1 ;
MATCHING
isNeededUnord(pre,elem) Checks if insertion of given elem is needed
in the pre-th list (unordered version)
updateDescLists(pre,pos) Updates the pointers in the descendants of the
pre-th list, possibly propagating the update,
following the deletion of the pos-th element
from the pre-th list
TWIG QUERY NAVIGATION
firstChild(pre) Returns the pre-order of the first child of the
given twig node, -1 if the node is a leaf
firstLeaf() Returns the pre-order of the first leaf in twig
isLeaf(pre) Returns true if given twig node is a leaf, false
otherwise
parent(pre) Returns the pre-order of the parent of the given
twig node
siblings(pre) Returns the pre-orders of the siblings of the
given twig node
SOLUTION CONSTRUCTION
findSolsUnord(...) Recursively builds the unordered twig solutions
preVisit(...) Used by findSolsUnord to navigate domains
extendSols(...) Used by findSolsUnord to build the solutions
algorithm UnorderedTwigMatch(Q)
(0) getRange(start, end);
(1) for each dk where k ranges from start to end
(2) for each h such that qh = dk in descending order
(3) for each Di where i ranges from 1 to n
(4) for each di in Di in ascending order
(5) if(post(di )<post(dk ) ∧ isCleanable(i,di ))
(6) pos ← index(Di ,di );
(7) delete(Di ,di );
(8) if(¬isLeaf(i))
(9) updateDescLists(i,pos);
(10) if(¬isEmpty(Dparent(h) ) ∧ isNeededUnord(h,dk ))
(11) push(Dh ,(post(dk ),pointerToLast(Dparent(h) ))));
(12) if(isLeaf(h) ∧ noEmptyLists())
(13) findSolsUnord(firstLeaf(),-1, h, indexesList);
(14) for each Di where i ranges from 1 to n
(15) if(isEmpty(Di ) ∧ last(qi )<k)
(16) exit;
dered case. Further, additional twig query navigation functions are needed,
which are quite self-explanatory, and, since the solution construction is dif-
ferent from the ordered case, new functions are also needed in this respect
(findSolsUnord(), preVisit() and extendSols()). These functions will
be discussed later while explaining the solution construction algorithm.
The unordered twig matching algorithm is shown in Figure B.11. As in
the other two algorithms we analysed, we first try to delete nodes by means
of the post-order conditions, in particular POT2 and POT3, (Lines 3-9) and,
if a deletion is performed, we update the pointers in the subsequent lists, in
this case in the descendant ones (Line 9). Then, we work on the current node
(Lines 10-13), checking if an insertion is needed (condition PRU and POT1,
Line 10) and verifying if new solutions can be generated (Lines 12-13). This
time, condition PRO1 is not available and, thus, we do not delete a node
from the last stack after solution construction as in the other algorithms.
Finally, the algorithm exits whenever any stack Di is empty and no data
node matching with qi will be accessed (Lines 14-16).
The boolean function isCleanable() is the same of the ordered case and
will not be further discussed. In this case, whenever a node di is deleted, the
Standard version 263
function isNeededUnord(i,di )
(1) if(i = 1)
(2) return true;
(3) if(isEmpty(Dparent(i) ) ∨ maxPost(Dparent(i) )<post(di ))
(4) return false;
(5) return true;
procedure updateDescLists(i,pos)
(1) i ← firstChild(i);
(2) if(i 6= -1)
(3) children ← i ∪ siblings(i);
(4) for each h in children
(5) for each dh in Dh in descending order
(6) if(pointer(dh )<pos)
(7) break;
(8) if(pointer(dh )=1)
(9) for each dh from dh in descending order
(10) dP os ← index(Dh ,dh );
(11) delete(Dh ,dh );
(12) if(¬isLeaf(h))
(13) updateDescLists(h,dP os);
(14) break;
(15) else
(16) decreasePointer(dh );
Lemma B.5 At each step j and for each query index i, the list Di is a subset
of Σji containing only the data entries that cannot be deleted from Σji , i.e. it
has the same content as ∆Σji when Lemmas 4.7, 4.8, 4.9, 4.13, and 4.14
have been applied.
The unordered twig matching solution construction, and all its required
functions, are shown in detail in Figure B.13 and B.14. In this case the
step-by-step backward construction behavior of the other cases would not
be the best choice, since the pointers are now from children to parents and
not from right to left domains. In this case the solution construction starts
from the first leaf (see the initial call at Line 13 of the main algorithm), then
navigates all the query nodes one by one, gradually checking and producing
all the answers by extending them with the nodes contained in the associ-
ated domains with the extendSols() function. The solutions are kept in
indexesList, which contains, for each of them, an index array pointing to
the domain nodes, as in the path and ordered matching case.
Standard version 265
Let us examine the way in which the query domains are navigated in
order to build the solutions: Starting from the first leaf, findSolsUnord()
goes up the query twig by recursively calling itself up to the query root node
(Lines 4, 10 of its code), thus covering the left most path. For each of the
navigated nodes having right sibilings, before navigating up it calls on each
of the siblings the preVisit() function (Line 8), which recursively explores
in pre-visit, from parent to child, the subtrees of the given nodes (Lines 4,6 of
its code). In this way, all the query nodes (domains) are covered and we move
from one domain to the other in the most suitable way w.r.t. the pointers
in the domain nodes and the solution construction. Indeed, first we go up
from a leaf to its parent, thus exploiting the available node pointers going in
the same direction; in this way we extend the current solution only with the
pointed node, which is a sort of upper bound, and the ones underlying it (the
same as in path and ordered construction) (Lines 10-13 of extendSols()).
Then, when we have reached a parent node, we go downward from it to its
other children, doing the opposite: Starting from the last node in the child
domain to be included in the solution, we extend the solutions with all the
nodes which point to the parent node, or to any node above it (Lines 15-20 of
266 The complete XML matching algorithms
Theorem B.3 For each data node dj , S = (s1 , . . . , sn ) ∈ ∆U ansjQ (D) iff the
algorithm, by calling the function findSolsUnord(), generates the solution
S.
MATCHING
isNeededUnordCont(pre,elem) Checks if insertion of given elem is
needed in the pre-th list
considering the current targets
canSkipUnord(post) Returns true if evaluating the
current document post-order value,
current targets and current domains
is possible to perform a Skip
TARGET LIST MANAGEMENT
adjustTargetsUnord(pre) Adjusts the current targets depending on
the current document pre.
Returns true if the operation was completed
successfully
algorithm UnorderedTwigMatchCont(Q)
(0) getRange(start, end);
(1) targetLeaves ← getValueConstrainedLeaves(Q);
(2) for each li in targetLeaves
(3) targetListli ← getMatchList(li ,getValue(li ),getCondition(li ));
(4) if(¬hasNext(targetListli ))
(5) exit;
(6) for each dk where k ranges from start to end
(7) if(¬adjustTargetsUnord(k))
(8) exit;
(9) while(canSkipUnord(post(dk )))
(10) k ← post(dk ) + 1;
(11) if(¬adjustTargetsUnord(k))
(12) exit;
(13) for each h such that qh = dk in descending order
(14) for each Di where i ranges from 1 to n
(15) for each di in Di in ascending order
(16) if(post(di )<post(dk ) ∧ isCleanable(i,di ))
(17) pos ← index(Di ,di );
(18) delete(Di ,di );
(19) if(¬isLeaf(i))
(20) updateDescLists(i,pos);
(21) if(¬isEmpty(Dparent(h) ) ∧ isNeededUnordCont(h,dk ))
(22) push(Dh ,(post(dk ),pointerToLast(Dparent(h) ))));
(23) if(isLeaf(h) ∧ noEmptyLists())
(24) findSolsUnord(firstLeaf(),-1, h, indexesList);
(25) for each Di where i ranges from 1 to n
(27) if(isEmpty(Di ) ∧ last(qi )<k)
(28) exit;
post-order value is greater then the maximum post-order value in the pre-
domain (recall that since domains are cleaned only when we found a match
a domain could be not physically empty but substantially empty and see
Section 4.3.3 for the Skipping Policy) we can directly return true (Line 2).
If the domain is not empty we need to verify what kind of children has the
current element and possibly call the function again on them. If the current
element has at least one child that is related to targets only by following-
Content-based index optimized version 269
function isNeededUnordCont(i,di )
(0) if(hasAReferenceTarget(i) ∧ post(di ) < post(getTarget(i)))
(1) return false;
(2) if(i = 1)
(3) return true;
(4) if(isEmpty(Dparent(i) ) ∨ maxPost(Dparent(i) )<post(di ))
(5) return false;
(6) return true;
function canSkipUnord(post)
(0) return checkSkipUnord(1,post);
function checkSkipUnord(pre,post)
(0) if(post < post(getTarget(pre)))
(1) if( isEmpty(Dpre ) ∨
(hasAReferenceTarget(pre) ∧ post >maxPost(Dpre )) )
(2) return true;
(3) else
(4) j ← firstChild(pre);
(5) if(j = -1)
(6) return true;
(7) children ← j ∪ siblings(j);
(8) for each qi in children
(9) if(¬(hasAReferenceTarget(i) ∨ isContrainedLeaf(qi )))
(10) return false;
(11) if(¬checkSkipUnord(i,post))
(12) return false;
(13) return true;
(14) else
(15) return false;
function adjustTargetsUnord(pre)
(0) for each li in targetLeaves
(1) if((curT argetli is null) ∧ isEmpty(Dli ))
(2) return false;
(3) while((curT argetli is not null) ∧ pre(curT argetli )<pre)
(4) if(hasNext(targetListli ))
(5) curT argetli ← getNext(targetListli );
(6) else
(7) curT argetli ← null;
(8) if(isEmpty(Dli ))
(9) return false;
(10)return true;
scan from the first occurrence of q1 (first(q1 )), while to establish the right
limit of the range we need to separate the ordered (and path) case form the
unordered one. In the former case, due to Lemma 4.15, we can stop the scan
with the last occurrence of qn (last(qn )) whereas, in the latter case, Lemma
4.16 suggests to stop with the maximum value among last(ql ) for each leaf
l in the query.
During the computation of the range we can also make some checks that
could identify an empty answer space; for each element of the query we can
compute a specific range that represents the part of the document where
we can find occurrences of that element that is [first(qi ),last(qi )]. If
some twig element specifies a value condition and we have a content in-
dex for those elements we can limit the specific range for an element qi to
[firstV(qi ),lastV(qi )], where firstV(qi ) and lastV(qi ) return the first
and the last pre-order value for element qi in the document satisfying the
specified value condition, respectively.
By analyzing specific ranges we can simply conclude if a document can-
not contain a solution. Again we need to analyze separately ordered and
unordered case. Ordered match requires by definition (see Lemma 4.1) that
answer elements are totaly ordered by their pre-order values so subsequent
elements in the query must be subsequent in the answer. If a specific range
of an element qj ends before the beginning of the specific range of qi with
i < j, the document cannot contain any answer. For each query node qi with
i ∈ [1, n − 1] we need to perform the described check with any node qj with
j ∈ [i + 1, n]. It is obvious that these checks represent a necessary condition
but not a sufficient one.
Unordered match requires only a partial order between answer node pre-
ordes so checks related to a node qi are performed only with node qj with
qj ∈ descendants(qi ).
and considering all these ranges we can identify a single range for the query
node qj that is [minStart(qj ),maxEnd(qj )], where minStart(qj ) represents
the minimum start (start with the smallest pre-order) for the previous ranges
while maxEnd(qj ) represents the maximum end (end with the greatest pre-
order) for the previous ranges. The second kind of filter computes such range
for all nodes qj that satisfy the conditions explained before and then returns
a range obtained from the intersections of these ranges with the whole data
tree range.
Appendix C
Proofs
Proof of Theorem 1.1. First, notice that showing Theorem 1.1 is equiv-
alent to prove the following statement: if extP osF ilter(σ1 , σ2 , minL, d) re-
turns FALSE, then does not exist a pair (σ1 [i1 . . . j1 ] ∈ σ1 , σ2 [i2 . . . j2 ] ∈ σ2 )
such that (j1 − i1 + 1) ≥ minL, (j2 − i2 + 1) ≥ minL and ed(σ1 [i1 . . . j1 ], σ2
[i2 . . . j2 ]) ≤ d. Let extP osF ilter(σ1 , σ2 , minL, d) return FALSE then there
are two alternatives:
2. For each common term σ1 [p1 ] = σ2 [p2 ] the corresponding counter has
the following value: σ1 c[p1 ] < c.
In this second case, if we show that for each common term σ1 [p1 ] =
σ2 [p2 ] if σ1 c[p1 ] < c then ed(σ1 [p1 −w +1 . . . p1 ], σ2 [p2 −w +1 . . . p2 ]) > d
then it is obvious that a pair (σ1 [i1 . . . j1 ] ∈ σ1 , σ2 [i2 . . . j2 ] ∈ σ2 ) does
274 Proofs
From the edit distance definition and from the fact that P P3 contains the
ordered and non-repeated equal terms, it directly follows that the P P3 car-
dinality is as follows:
P P3 ⊆P P2 P P2 ⊆P P1
w − ed(σ1 (w1 ), σ2 (w2 )) = card(P P3 ) ≤ card(P P2 ) ≤ card(P P1 )
= σ1 c[p1 ] < c
where card(P P1 ) = σ1 c[p1 ] < c from the filter definition and by hypothe-
sis. Then, w − ed(σ1 (w1 ), σ2 (w2 )) < c. But, since c = w − d then w −
ed(σ1 (w1 ), σ2 (w2 )) < w − d. It follows that ed(σ1 (w1 ), σ2 (w2 )) > d. ¤
α β
z }| { z }| {
X n X n
p (k) p (k) p (k)
|cki | · sim(cki , cj m )+ |cj m | · sim(cki , cj m )
k=1 k=1
m
X X n
|cki | + |chj |
|k=1{z } |h=1{z }
γ δ
where α, β, γ, and δ are positive values. Moreover from Eq. 3.6, we have
aSim(Di , Dj ) = αγ and aSim(Dj , Di ) = βδ . Obviously, either aSim(Di , Dj ) ≤
aSim(Dj , Di ) or aSim(Di , Dj ) ≥ aSim(Dj , Di ). Let us suppose that aSim
(Di , Dj ) ≤ aSim(Dj , Di ) then αγ ≤ βδ ⇒ αγ ≤ α+β
γ+δ
≤ βδ that is aSim(Di , Dj )
≤ Sim(Di , Dj ) ≤ aSim(Dj , Di ). As to the second inequality notice that
? ? ? ?
α+β
γ+δ
≤ βδ ⇔ (α + β)δ ≤ (γ + δ)β ⇔ (αδ + βδ) ≤ (βγ + βδ) ⇔ αδ ≤ βγ which
is true since αγ ≤ βδ . In the same way, if aSim(Di , Dj ) ≥ aSim(Dj , Di ) then
aSim(Dj , Di ) ≤ Sim(Di , Dj ) ≤ aSim(Di , Dj ).
It follows that if aSim(Di , Dj ) ≤ aSim(Dj , Di ) then aSim(Di , Dj ) ≤
^ j , Di ), whereas if aSim(Di , Dj ) ≤
Sim(Di , Dj ) ≤ aSim(Dj , Di ) ≤ aSim(D
aSim(Dj , Di ) then aSim(Dj , Di ) ≤ Sim(Di , Dj ) ≤ aSim (Di , Dj ) ≤ aSim^
(Di , Dj ) from which follow the statements of the theorem. ¤
Proof of Lemma 4.1. Because the index i increases according to the pre-
order sequence, node i + j must be either the descendant or the following
node of i. If post(qi+j ) < post(qi ), the node i + j in the query is a descendant
of the node i, thus also post(dsi+j ) < post(dsi ) is required. By analogy, if
post(qi+j ) > post(qi ), the node i + j in the query is a following node of i,
thus also post(dsi+j ) > post(dsi ) must hold. ¤
Proof of Lemma 4.5. Let us suppose that j ∈ [k, m] exists such that
S = (s1 , . . . , sn ) ∈ ∆ansjQ (D) and sh = k. Notice that it should be si < k
276 Proofs
but ∆Σki = ∅ and thus no index si exists such that dsi = qi and si < k. ¤
Proof of Lemma 4.10. Obviously, due to the premise, post(dj ) < post(dj0 )
when j0 = k. As to j0 ∈ [k + 1, m], let us consider the two possible alter-
natives: either post(dk ) < post(dj0 ) or post(dk ) > post(dj0 ). In the former
case, it easily follows that post(dj ) < post(dj0 ). The latter case means that
dk is an ancestor of dj0 since k < j0. Moreover dj is a preceding of dk since
post(dj ) < post(dk ) and j < k. It follows that dj is a preceding of dj0 and
thus post(dj ) < post(dj0 ). ¤
Proof of Lemma 4.11. Let us suppose that j ∈ [k, m] exists such that
S = (s1 , . . . , sn ) ∈ ∆ansjQ (D), si ∈ S. Notice that (s1 , . . . , sn ) ∈ ∆ansjQ (D)
iff j exist such that sj 0 > k for each j 0 ≥ j. Let us consider sj . Being
si < sj then S is a solution iff post(dsi ) > post(dsj ). But it is impossible
since post(dsi ) < post(dk ) and, from Lemma 4.10, post(dsi ) < post(dsj ). ¤
C.3 Proofs of Chapter 4 277
Proof of Lemma 4.12. Notice that due to Lemma 4.11 each data node
in ∆Σk−1
i , for i ∈ [1, h − 1], can be deleted from its domain. In this way,
each ∆Σki is empty and thus due to Lemma 4.5, k can be deleted from ∆Σjh . ¤
Proof of Lemma 4.13. Let us suppose that j ∈ [k, m] exists such that
S = (s1 , . . . , sn ) ∈ ∆(U )ansjQ (D) and si ∈ S, and that ı̄ ∈ [1, n] exists such
that post(qı̄ ) < post(qi ). Notice that, being post(qı̄ ) < post(qi ), the node
matching with qı̄ , i.e. that with pre-order value equal to sı̄ , both in the
case of ordered and unordered matching must satisfy the following property:
post(dsi ) > post(dsı̄ ). On the other hand, wherever ∆Σkı̄ = ∅ or for each
s ∈ ∆Σkī , post(ds ) > post(dsi ), the condition post(dsi ) > post(dsı̄ ) can be
satisfied iff sı̄ > k. But being post(dsi ) < post(dk ), due to Lemma 4.10
post(dsi ) < post(ds ) for each s > k and thus also for s = sı̄ . Thus the above
condition can not be satisfied. ¤
Proof of Lemma 4.14. First notice that for each qi for i ∈ [2, n], post(q1 ) >
post(qi ) since q1 is the root of the twig pattern. thus for each j ∈ [k, m] for
each solution S = (s1 , s2 , . . . , sn ) ∈ ∆(U )ansjQ (D) involving s1 it should
be post(ds1 ) > post(dsi ). On the other hand being post(ds1 ) < post(dk ),
from Lemma 4.10 it follows that post(dsi ) < post(ds ) for each s > k and
S ∈ ∆(U )ansjQ (D) iff at least one si is greater than k. ¤
Proof of Theorem 4.3. For a data node si ∈ ∆Σki whenever post(dsi ) >
post(dk ) there is no way to predict the relationship between the post-order
of the data node si and that of the nodes after k. On the other hand,
whenever post(dsi ) < post(dk ) we will show that the following two facts
together constitute an absurd:
1. si does not belong to ∆ansjQ (D), j ∈ [k, m], because for each (si+1 , . . . ,
sn ) ∈ ∆Σji+1 ×. . .×∆Σjn , i0 exists such that it is required that post(dsi0 ) <
post(dsi ) but post(dsi0 ) > post(dsi ),
2. i 6= 1 and for each ı̄ ∈ [1, n] such that post(qı̄ ) < post(qi ) and ∆Σkı̄ 6= ∅
and exists s ∈ ∆Σkī such that s > k and post(ds ) > post(dsi ) (condition
of Lemma 4.13)
Proof of Theorem 4.4. The set of answers ∆anskP (D) is a subset of the
cartesian product ∆Σk1 × . . . × ∆Σkn as, by applying Lemmas 4.4, 4.5, 4.6,
and 4.11, we never delete useful data nodes.
Given that premise, we have to show that if (s1 , . . . , sn ) ∈ ∆anskP (D) then
s
si ∈ ∆Σi i+1 for each i ∈ [1, n − 1] and vice versa. If (s1 , . . . , sn ) ∈ ∆anskP (D)
then s1 < . . . < sn and thus si must be processed before si+1 in the sequen-
s
tial scanning that is si ∈ ∆Σi i+1 . The other way around, we have to show
s
that if si ∈ ∆Σi i+1 for each i ∈ [1, n − 1] then (s1 , . . . , sn ) ∈ ∆anskP (D).
We first show that (s1 , . . . , sn ) is a solution. Notice that (s1 , . . . , sn ) can
s
is a solution as s1 < s2 < . . . < sn because si ∈ ∆Σi i+1 and, for each
i ∈ [1, n − 1], post(dsi ) > post(dsi+1 ) since, due to Lemma 4.11, we have
that post(dsi ) > post(dj ), for each j such that si < j ≤ k being si ∈ ∆Σki .
Finally, we show that (s1 , . . . , sn ) 6∈ ansk−1 k
P (D). It is true as sn ∈ ∆Σn if, due
to Lemma 4.4, k = sn and thus, it straightforwardly follow that sn 6∈ Σk−1 n
which is one of the domains of ansk−1 P (D). ¤
Proof of Theorem 4.5. The set of index tuples specified in the theorem is
a subset of the cartesian product ∆Σk1 × . . . × ∆Σkn as, by applying Lemmas
4.4, 4.5, 4.6, 4.9, 4.13, and 4.14, we never delete useful data nodes. Moreover
any index tuple (s1 , . . . , sn ) which satisfies conditions 1 and 2 is a solution
because, as in the path case, condition 1 implies that s1 < s2 < . . . < sn
whereas condition 2 explicitly requires that the relationships between post-
orders are satisfied. Finally the fact that (s1 , . . . , sn ) 6∈ ansk−1
Q (D) is the
same as in the path case. ¤
Proof of Theorem 4.6. The set of index tuples specified in the theorem is
a subset of the cartesian product ∆Σk1 × . . . × ∆Σkn as, by applying Lemmas
4.7, 4.8, 4.9, 4.13, and 4.14, we never delete useful data nodes. Moreover any
index tuple (s1 , . . . , sn ) which satisfies conditions 1 and 2 is a solution and
C.4 Proofs of Appendix B 279
Proof of Lemma B.1. Let k 0 ∈ Di and let top(Di )=k 00 then k 0 <
k as k 00 is at the top of Di and the data signature is scanned in a se-
00
quential way. Then there are two alternatives: either post(dk0 )>post(dk00 )
or post(dk0 )<post(dk00 ). In the first case, it straightforward follows that
post(dk0 )>post(dk ) as due to the premise post(dk00 )>post(dk ) whereas the
second case is impossible as when k 00 was added to Di , the algorithm should
have deleted k 0 from Di (see Line 5). ¤
0
Proof of Lemma B.2. Let us suppose that ∆Σji = ∅ with k ≤ j 0 ≤ j,
then being k ≤ j 0 and j 0 ≤ j it easily follows that ∆Σji ∩ ∆Σki = ∅.
As far as the opposite direction is0 involved, we show that if for each step
j 0 with k ≤ j 0 ≤ j we have that ∆Σji 6= ∅ then ∆Σji ∩ ∆Σki 6= ∅. The proof
is by induction. When j = k and ∆Σki 6= ∅, then ∆Σki ∩ ∆Σki = ∆Σki 6= ∅.
Let us suppose that the statement is true for0 j = r, we show it for j = r + 1.
As if for each step j 0 with k ≤ j 0 ≤ r ∆Σji 6= ∅ then ∆Σri ∩ ∆Σki 6= ∅, let
us suppose that ∆Σri ∩ ∆Σki = (i1 , . . . , in ). We have to show that if for each
0
step j 0 with k ≤ j 0 ≤ (r + 1) ∆Σji 6= ∅ then ∆Σr+1 i ∩ ∆Σki 6= ∅. Notice that
r+1 k
∆Σi ∩ ∆Σi = ∅ iff at step r + 1 we delete the index set (i1 , . . . , in ) from
∆Σr+1i . But, as such domains are stacks and k ≤ (r + 1) thus (i1 , . . . , in ) are
at the bottom of the stack and the deletion of (i1 , . . . , in ) implies the deletion
of all the data node in ∆Σr+1
i , but it is impossible because the ∆Σr+1 i 6= ∅. ¤
Proof of Lemma B.3. First, notice that Di ⊆ Σji as for each data node dj ,
at Line 10 the algorithm adds dj to the right stack. Moreover the algorithm
deletes some indexes from Di by means of the pop and empty operators. If k
can not belong to ∆Σji due to its pre-order value, it can be due to one of three
possible alternatives shown in Theorem 4.1. The algorithm detects all these
conditions and delete k from Di . At Lines 6-8, assuming that Di = ∆Σji , due
to Lemma 4.6 it deletes all the nodes in Di+1 ∪. . .∪Dn . Indeed, in Lemma B.2
j k j0
it has been shown that ∆Σi must be empty in order that ∆Σi ∩∆Σi = ∅, for
j 0 ≥ j. Thus, at the j-th step we delete all the “unnecessary” data nodes. At
Lines 9-10, it applies Lemma 4.5. In particular, as whenever a stack is empty
we empty all the stacks at “its right”, then it is sufficient to check stack Dh−1 .
280 Proofs
Finally, at Line 13, it applies Lemma 4.4. Moreover, the algorithm deletes k
from Di due to its post-order value at Lines 4-5 where it applies Lemma 4.11.
Notice that, due to Lemma B.1, the algorithm correctly avoid to go on with
Di whenever post(top(Di ))>post(dj ) as, in this case, no other data node
can be deleted from that stack due to its post-order value. It follows that
the algorithm never deletes from the stacks data nodes which could belong
to a delta answer set ∆ansj0 P (D) for j0 ∈ [j, m]. Thus, k ∈ Di at step j iff
k ∈ ∆Σji . ¤
[1] Secure Hash Standard. Technical Report FIPS PUB 180-1, U.S. Department
of Commerce/National Institute of Standards and Technology, 1995.
[3] R. Agrawal, K.-I. Lin, H. S. Sawhney, and K. Shim. Fast Similarity Search
in the Presence of Noise, Scaling, and Translation in Time-Series Databases.
In Proc. of 21th International Conference on Very Large Data Bases (VLDB
1995), 1995.
[6] F. Baader, I. Horrocks, and U. Sattler. Description logics for the semantic
web. Künstliche Intelligenz, 16(4), 2002.
[10] T. Baldwin and H. Tanaka. The Effects of Word Order and Segmentation
on Translation Retrieval Performance. In Proc. of the 18th International
Conference on Computational Linguistics (COLING 2000), 2000.
282 BIBLIOGRAPHY
[19] D. Braga and A. Campi. A Graphical Environment to Query XML Data with
XQuery. In Proc. of the 4th Intl. Conference on Web Information Systems
Engineering, 2003.
[22] Sergev Brin, James Davis, and Hector Garcia-Molina. Copy Detection Mech-
anisms for Digital Documents. In Proc. of the 1995 ACM SIGMOD Inter-
national Conference on Management of Data (SIGMOD 1995), 1995.
[37] P. Ciaccia and W. Penzo. Relevance ranking tuning for similarity queries on
xml data. In Proc. of the VLDB EEXTT Workshop, 2002.
[38] R. Cilibrasi and P.M.B. Vitanyi. Automatic meaning discovery using Google.
Technical report, University of Amsterdam, 2004.
[40] A. Cobbs. Fast Approximate Matching Using Suffix Trees. In Proc. of the 6th
International Symposium on Combinatorial Pattern Matching (CPM 1995),
1995.
[46] Atril Deja Vu - Translation Memory and Productivity System. Home page
http://www.atril.com.
[47] P.F. Dietz. Maintaining Order in a Linked List. In Proceedings of 14th Anual
ACM Symposium on Theory of Computing (STOC 1982), 1982.
[53] Marc Ehrig and Alexander Maedche. Ontology-focused crawling of web doc-
uments. In Proc. of the ACM SAC, 2003.
[63] T. Grust, M. Van Keulen, and J. Teubner. Staircase Join: Teach a Rela-
tional DBMS to Watch its (Axis) Steps. In Proceedings of 29th International
Conference on Very Large Data Bases (VLDB 2003), 2003.
286 BIBLIOGRAPHY
[68] N. Ide and J. Veronis. Introduction to the Special Issue on Word Sense
Disambiguation: The State of the Art. Computational Linguistics, 24(1),
1998.
[69] A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall
Inc., 1988.
[72] H. Jiang, W. Wang, H. Lu, and J. Xu Yu. Holistic Twig Joins on Indexed
XML Documents. In Proceedings of 29th International Conference on Very
Large Data Bases (VLDB 2003), 2003.
[73] T. Kahveci and A. K. Singh. Variable Length Queries for Time Series Data.
In Proc. of the 17th International Conference on Data Engineering (ICDE
2001), 2001.
[97] I. Dan Melamed. Automatic Evaluation and Uniform Filter Cascades for
Inducing N-Best Translation Lexicons. In Proc. of the Third Workshop on
Very Large Corpora (WVLC3), 1995.
[100] G.A. Miller. WordNet: A Lexical Database for English. In CACM 38, 1995.
[101] G.A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. Miller. Five Papers
on WordNet. Technical report, Princeton University’s Cognitive Science
Laboratory, 1993.
[105] G. Navarro and R. A. Baeza-Yatesa. New and faster filters for multiple
approximate string matching. Random Structures and Algorithms, 20(1),
2002.
[108] K. Papineni, S. Roukos, T. Ward, and W. Zhu. Bleu: a Method for Au-
tomatic Evaluation of Machine Translation. In Proc. of the 40th Annual
Meeting of the Association for Computational Linguistics (ACL 2002), 2002.
[114] C. Rick. A New Flexible Algorithm for the Longest Common Subsequence
Problem. Technical report, University of Bonn, Computer Science Depart-
ment IV, 1994.
[116] S. Rodotà. Introduction to the “one world, one privacy” session. In Proc. of
23rd Data Protection Commissioners Conference, Paris, France.
290 BIBLIOGRAPHY
[119] T. Schlieder and Felix Naumann. Approximate tree embedding for querying
XML data. In Proc. of ACM SIGIR Workshop On XML and Information
Retrieval, 2000.
[120] X. Shen, B. Tan, and C. X. Zhai. Implicit user modeling for personalized
search. In Proc. of CIKM 2005, Bremen, Germany, 2005.
[130] A. Theobald and Gerhard Weikum. The index-based XXL search engine
for querying XML data with relevance ranking. Lecture Notes in Computer
Science, 2287, 2002.
[137] P. Zezula, G. Amato, F. Debole, and F. Rabitti. Tree Signatures for XML
Querying and Navigation. In Proceedings of the XML Database Symposium
(XSym 2003), 2003.
[142] J. Zhou and J. Sander. Data bubbles for non-vector data: Speeding-up hier-
archical clustering in arbitrary metric spaces. In Proc. of 29th International
Conference on Very Large Data Bases (VLDB), 2003.