Você está na página 1de 14

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2656906, IEEE
Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 1

An Efficient Indexing Method for Skyline


Computations with Partially Ordered Domains
Yu-Ling Hsueh, Member, IEEE, Chia-Chun Lin, and Chia-Che Chang

Abstract—

Efficient processing of skyline queries with partially ordered domains has been intensively addressed in recent years. To further
reduce the query processing time to support high-responsive applications, the skyline queries that were previously processed with user
preferences similar to those of the new query contribute useful candidate result points. Hence, the answered queries can be cached with
both their results and the user preferences such that the query processor can rapidly retrieve the result for a new query only from the
result sets of cached queries with compatible user preferences. When caching a significant number of queries accumulated over time,
it is essential to adopt effective access methods to index the cached queries to retrieve a set of relevant cached queries for facilitating
the cache-based skyline query computations. In this paper, we propose an extended depth-first search indexing method (e-DFS for
short) for accessing user preference profiles represented by directed acyclic graphs (DAGs), and emphasize the design of the e-DFS
encoding that effectively encodes a user preference profile into a low-dimensional feature point which is eventually indexed by an R-tree.
We obtain one or more traversal orders for each node in a DAG by traversing it through a modified version of the depth-first search
which is utilized to examine the topology structure and dominance relations to measure closeness or similarity. As a result, e-DFS which
combines the criteria of similarity evaluation is able to greatly reduce the search space by filtering out most of the irrelevant cached
queries such that the query processor can avoid accessing the entire data set to compute the query results. Extensive experiments are
presented to demonstrate the performance and utility of our indexing method, which outperforms the baseline planning techniques by
reducing 37% of the computational time on average.

Index Terms—Indexing methods, query processing, multi-dimensional databases, data management.

1 I NTRODUCTION TABLE 1: An example of a data set with TO and PO domains.

In recent years, skyline query computation has drawn extensive TO Domain: d1 TO Domain: d2 PO Domain: d3
id
Price Rank Type
attention because significant high-dimensional data have become p1 120 3 Black coffee
widely available. There are many multiple criteria decision- p2 120 2 White coffee
making applications which utilize skyline query computation, p3 120 3 Latte
e.g., mobile information systems, geographic information systems p4 120 1 Macchiato
p5 125 1 Americano
and traffic monitoring systems. The formal definition of skyline
p6 120 2 Irish coffe
points in d-dimensional space is a distinct object set S , where p7 135 5 Viennese
any two objects in S do not dominate each other. We say p1 p8 150 4 Frappe coffee
dominates p2 , if and only if p1 is better than or equal to p2 p9 130 4 Cappuccino
in all dimensions, and p1 is strictly better than p2 in at least
one dimension. Furthermore, skyline queries may involve both
A A
totally ordered (TO) and partially ordered (PO) domains. There
have been a number of studies on skyline query computation
with partially ordered domains. For example, in Table 1, the price B C B C L V
of coffee d1 and the popularity d2 are totally ordered domains,
and d3 , which represents a coffee type, is a partially ordered W L M W M F
domain. The coffee type d3 includes the following attributes:
{Black coffee, White coffee, Latte, Macchiato, Americano, Irish F I V I
coffee, Viennese, Frappe coffee, Cappuccino}. We assume that
a smaller value for Price and Rank is preferential in the totally (a) User profile g1 (b) User profile g 2 and the query
ordered domains, and each user provides a user preference profile and the query result result {p1 , p2 , p3 , p4 , p5 }
to describe his/her order of interest among the attributes in a {p1 , p2 , p4 , p5 }
partially ordered domain.
A directed acyclic graph (DAG) is used to represent a user Fig. 1: Two examples of skyline queries with user preference
preference profile, where each node represents an attribute in profiles g 1 and g 2 , respectively.

a partially ordered domain. Two examples of user preference


Yu-Ling Hsueh, Chia-Chun Lin, and Chia-Che Chang are with Department
of Computer Science and Information Engineering, National Chung Cheng profiles g 1 and g 2 are shown in Figs. 1(a) and (b). We assume
University, Taiwan. {Black coffee = B , White coffee = W , Latte = L, Macchiato

1041-4347 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2656906, IEEE
Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 2

A New Query Cache-based Query Processor

Similarity Measure Query Evaluation


Access Methods
Similarity function that Query answer retrieval
Source-clustered indexing validates mappings and from the similar cached
Attribute relational graph indexing violations queries or from the data Dataset
Extended depth-first search indexing Top-k most similar set when necessary
cached queries

Cache queries
with the results
Cached Query Set
Query Results

Fig. 2: The system framework of a cache-based query processor using access methods.

= M , Americano = A, Irish coffee = I , Viennese = V , Frappe topology structure and dominance relations to measure closeness
coffee = F , Cappuccino = C}. g 1 represents that the user prefers or similarity. Fig. 2 shows the system framework using indexing
an Americano (A) to a black coffee (B ) and a Cappuccino (C ), methods. When a new query q (see Step (1)) is requested, one
and prefers a Black coffee (B ) to a White coffee (W ) and a Latte access method (i.e., an indexing method) (see Step (2)) searches
(L), and so on. We first evaluate the query answer based on the for the similar user preference profiles from the cached query set
user preference profile g 1 in Fig. 1(a). A data point p1 dominates (see Step (3)). Next, the system performs a similarity evaluation to
p3 , because the price and the rank domains are equal to those compute the similarity scores only on the cached queries selected
of p3 and the user has specified in g 1 that Black coffee (B ) is by the access method to measure the level of similarity with
preferred over Latte (L). A data point p1 does not dominate p2 , respect to the new query q , and then evaluates the new query
although the user prefers Black coffee (B ) to White coffee (W ) q based on the top-k similar cached queries (see Steps (4)-(5)).
and the price of p1 is equal to that of p2 . However, the ranking If q cannot be answered given the cached queries, the system
of p1 is worse than that of p2 . As a result, the final skyline query directly accesses the data set (see Step (6)) to compute the query
result set {p1 , p2 , p4 , p5 } is returned. The result set of the second result for q . Finally, the system outputs skyline query results to
query with the user preference profile g 2 shown in Fig. 1(b) is the user and caches the queries with the user preference profiles
{p1 , p2 , p3 , p4 , p5 }. We can observe that g 2 contains all of the and the results (see Step (7)). In this paper, we mainly focus on
compatible preference relations in g 1 , although the structures of the improvement of the access methods (i.e., Step (2)) that are
both DAGs are not identical. For example, A is preferred over described in Section 4. In summary, the main contributions of this
V in g 2 is a less strict relation than the preference relation with work are listed as follows:
respect to A and V in g 1 . As a result, the query result set of g 2
1) We propose a new access method called e-DFS that
must be a super set of that of g 1 . If g 2 is previously answered by
combines the criteria of similarity evaluation to index
the system before the new query with the user preference profile
user preference profiles represented by DAGs to facilitate
g 1 is requested, we can efficiently perform the new query answer
the cache-based skyline query computation with partially
by utilizing the result of the previously answered query with g 2 .
ordered domains. e-DFS is able to greatly reduce the
The skyline query computation suffers a high cost in high search space by filtering out most of the irrelevant cached
dimensions with partially ordered domains. In our previous work, queries such that the query processor can avoid accessing
we proposed a cache-based framework called Caching Support the entire data set to compute the query results.
for Skyline computations (CSS) [1], which uses a cache to store 2) We design an e-DFS encoding method to convert a DAG
user preference profiles and skyline results such that CSS does into a low-dimension point that preserves most of the
not have to access the entire data set for calculating the skyline preference orders (i.e., dominance relations). As a result,
results for a new query. We have concluded that such a cache- the access time to the cached queries can be greatly
based approach improves upon existing methods and is especially reduced so as to reduce the overall query processing time.
well-suited for interactive applications that require a fast response 3) We conduct a series of experimental evaluations to verify
time. When caching a significant number of queries accumulated our indexing methods. In addition to the e-DFS indexing
over time, it is essential to adopt effective access methods to method, we compare our work with the source-clustered
index the cached queries for efficiently retrieving a set of relevant and the ARG indexing methods. Our experimental results
cached queries for the skyline query computations. Therefore, show that the e-DFS outperforms the other two methods
we have introduced three indexing methods for the cache-based by reducing 37% of the computational time on average, in
framework in this paper. In addition to the source-clustered and the particular when the number of partially ordered domains
attribute relational graph (ARG) indexing methods, we propose an and the number of dimensions are high.
extended depth-first search indexing method (e-DFS) for accessing
user preference profiles of the cached queries. We first perform
the e-DFS encoding that effectively encodes a user preference 2 BACKGROUND AND R ELATED W ORK
profile into a low-dimensional feature point which is eventually Early work related to skyline query computation with only totally
indexed by an R-tree. We then obtain one or more traversal orders ordered domains has been proposed in the past decades. Borzsonyi
for each node in a DAG by traversing it through a modified et al. [2] introduced the non-progressive block-nestedloop (BNL)
version of the depth-first search which is utilized to examine the and divide-and-conquer (D&C) algorithms. The BNL approach

1041-4347 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2656906, IEEE
Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 3

recursively compares a data point p with the skyline candidates IPO-tree search, which stores only partial useful results. With
which might be dominated in a later step. However, the perfor- these partial results, the result of each possible preference can
mance of BNL is restricted by the main memory size. The D&C be efficiently returned. The adaptive SFS algorithm is proposed
technique divides the data set into several partitions and computes to transform the value in a partially ordered domain to the
the partial skyline points for each one. By merging the partial numerical values by a ranking function. When a new query with
skyline points, the final skyline can be obtained. Both of these its user preference profile is requested, the adaptive SFS algorithm
algorithms can incur many iterations and are inadequate for on- computes the skyline results using an IPO-tree. However, IPO-
line processing. Tan et al. [3] presented a progressive skyline trees only consider very simple totally-order-like user preferences.
processing algorithm named Bitmap, which converts multidimen- Sacharidis et al. designed a topological sort-based mechanism
sional data points into bit strings to reduce the time of dominance called topologically-sorted skylines (TSS) [24], which converts
checking. The nearest neighbor (NN) method [4] indexes the data the values of a partially ordered domain into integer intervals that
set with an R-tree and utilizes a k -nearest neighbors algorithm enable the traditional methods to process a skyline query with
to find the skyline results. This approach repeats the query-and- partially ordered domains. In addition, TSS can handle dynamic
divide procedure and inserts new partitions that are not dominated skyline queries. Zhang et al. [26] extended the lattice theorem and
by any skyline point into a to-do list. The algorithm terminates an off-the-shelf skyline algorithm and then designed a mechanism
when the to-do list is empty. A special method is applied to remove that employs an appropriate mapping of a partial order to a total
duplicates retrieved from overlapping partitions. The branch-and- order. Jung et al. [22] presented two indexing methods named SM-
bound skyline (BBS) algorithm [5], [6] traverses an R-tree to treaps and SR-trees for the attributes of totally ordered domains to
find a set of skyline points. BBS recursively performs a k -nearest support progressive query evaluation. These methods use the zero
neighbors algorithm to compute leaf nodes that are not dominated in-degree sorting algorithm (e.g. topological sorting) to assign a
by the currently discovered skyline points. Because BBS traverses scheduling number called tso to each attribute, and build a priority
R-tree nodes based on their mindist from the origin, each retrieved search tree according to the tso value. The common disadvantage
point is guaranteed to be a final skyline point and can be imme- of SM-treaps and SR-trees is the greater requirement for memory
diately returned to the user. Subsequently, the techniques aim to space, because the system needs to build an SM-treap or SR-tree
support continuous skyline computations for moving objects and for each attribute. Liu et al. [23] proposed a method called ZINC
data streams. Lin et al. [7] utilized n-of-N skyline queries with that uses a partial order reduction algorithm to simplify a user
the most recent n of N elements to support on-line computation preference profile, and converts it using a nested encoding scheme
against sliding windows over a rapid data stream. Morse et al. [8] to map partial orders into total orders. ZINC then uses a ZB-tree
illustrated a scalable LookOut algorithm for efficiently updating to index the encoding values. A ZB-tree maps multi-dimensional
a continuous time-interval skyline. Sharifzadeh and Shahabi [9] data to a one-dimensional Z-address value in a totally ordered
introduced the concept of spatial skyline queries (SSQ). Given a domain. However, the processing of the partial order reduction
set of data points P and a set of query points Q, SSQ retrieves algorithm suffers a high converting cost when a user preference
from P a set of points that are not dominated by any other point in profile is complex. Zhang et al. [26] presented two methods called
P , while considering their derived spatial attributes with respect CPS and the strata cyclic linked approach. CPS uses the attributes
to the query points in Q. For dynamic query points, a strategy of which cannot be compared in a directed acyclic graph (DAG) to
processing continuous skyline queries has been presented with convert a user preference profile into a CPS encoding, and the
a kinetic-based data structure [10]. A suite of novel skyline strata cyclic linked approach stores a CPS encoding for a user
algorithms based on a Z -order curve [11] has also been proposed preference profile. However, all of the aforementioned studies
in [12]. A new indexing structure called ZBtree is used to index differ from the main goal of this research.
and store data points based on Z -order curves. Among the solu- In this paper, we focus on indexing user preference profiles
tions, ZUpdate facilitates incremental skyline result maintenance represented by DAGs to facilitate the cache-based skyline compu-
by utilizing the properties of a Z -order curve. Other related studies tations with partially ordered domains. The related work regarding
can be found in the literature [13], [14], [15], [16], [17], [18], [19], graph indexing proposed in [27], [28] is suitable for indexing large
[20]. However, all of the existing studies differ from the main graphs. SPath [28] uses the decomposed shortest paths around the
goal of this research which is to support the efficient indexing for vertex neighborhood as basic indexing units. SPath is efficient
skyline queries with partially ordered domains. for addressing the graph query problem on large networks. SAP-
The techniques proposed in [21], [22], [23], [24], [25], [26] PER [27] uses the hybrid neighborhood unit structures to index
are related to skyline query computation with partially ordered large graphs. The techniques proposed in [29], [30], [31], [32],
domains. Chan et al. [21] presented three algorithms for eval- [33] are also related to graph indexing. Wang et al. [30] proposed
uating skyline queries with partially ordered attributes. Their a framework called SEGOS for graph similarity searches. SEGOS
solution transforms each partially ordered attribute into a two- uses a two-level indexing method to index the sub-units of graphs.
integer domain value, which allows the users to utilize index-based The lower-level index is used to efficiently find the top-k similar
algorithms to compute skyline queries in the transformed space. sub-units, and the upper-level index is used to find a similar graph
However, all of the techniques proposed in [21] have limited pro- from a list of graphs that are sorted based on the similarity score.
gressiveness and pruning abilities. In real applications, dynamic Subsequently, they use two algorithms adapted from the TA and
preferences for categorical attributes are more common than a CA methods to improve the performance of the graph similarity
fixed ordering for skyline query evaluation. One straightforward searches. Petrakis et al. [29] proposed a method named ImageMap,
solution is to enumerate all of the possible preferences and to which maps images into low-dimensionality points according to
materialize all of the results of the preferences; however, the their ARG editing distance for image searches. Zhang et al. [32]
costs of a full materialization are usually prohibitive. Therefore, proposed a method named TreePi, which uses frequent subtrees
Wong et al. [25] proposed a semi-materialization method called as indexing structures. Yan et al. [31], [33] proposed a graph

1041-4347 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2656906, IEEE
Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 4

indexing model named gIndex, which uses frequent substructures TABLE 2: Symbols and functions.
of a graph as the basic indexing feature. Frequent substructures Symbol Description
are ideal candidates because the intrinsic characteristics of the P The entire data set in the system.
frequent substructures are relatively stable. However, none of the TO and m TO represents a totally ordered domain and the size is
existing graph indexing methods are primarily designed for the m.
PO and n PO represents a partially ordered domain and the size is
skyline query computation with partially ordered domains. In [34], n.
Hsueh et al. proposed a framework that adopts a cache to store d Number of total diminutions and d = m + n.
previously answered queries to speed up the query processing for a qi A query of a user i.
new query with partially ordered domains. Two baseline indexing Gi and gji Gi is a user-specified preference set Gi =
{g1i , g2i , . . . , gn
i } for each PO domain, where each
methods were proposed to access the cached queries; however, gji ∈ Gi , and 1 ≤ j ≤ n is a preference profile of
these two indexing methods fail to filter out irrelevant cached a user i in a PO domain j .
queries. Therefore, unlike the baseline indexing methods, we have G A cached preference set consists of G =
designed a new indexing method named e-DFS that converts a {G1 , G2 , . . . , GM AX }, where each Gi ∈ G is
the user-specified preference set of a user i, and M AX
DAG into a low-dimensional feature point with the consideration is the maximum number of preference sets which can
of dominance relations by examining the topology structure so as be cached in the system.
to improve the query evaluation performance. g = (V , E ) and A user preference profile in a preference set, comprising
h a set of the attributes V with a size of h = |V |, and a set
of edges E . For simplicity, the superscript and subscript
3 P ROBLEM D EFINTIONS AND S YSTEM OVERVIEW may be neglected. g is represented as a user preference
profile for any PO domain of one unspecified user; g i
We formally define skyline queries with partially ordered domains, is for one unspecified domain of user i; gj is for a PO
user preference profiles, and a similarity function as our objective domain j of one unspecified user.
function. The problem statement is addressed subsequently in vi −→ vi A primitive relation exists in a user preference profile,
this section. We first introduce the notations of the symbols used where node vi is adjacent to vj .
vi 99K vj A transitive relation exists in a user preference profile,
throughout the paper as shown in Table 2. Note that some symbols where node vi is (indirectly) connected to vj .
may be used as prefixes, and their meanings are explained when vi ←→ vj An equivalent relation exists in a user profile, when vi
used. and vj are sibling nodes.
Sg (g 0 ) A similarity function returns a score to measure the level
Definition 1. (A skyline query with partially ordered domains): of similarity between g and g 0 of a new query and a
Given a data set P in a d-dimensional space and a user- cached query in G, respectively.
specified preference set Gi = {g1 , g2 , . . . , gn } of a user i, kCQ A top-k query set sorted by their similarity scores is
a skyline query with partially ordered domains retrieves a selected from G
R The candidate result set R = {q1 .R ∩ q2 .R · · · ∩ qk .R}
distinct object set S , where any two objects in S do not is a union set of the result R of each cached query qi in
dominate each other. We say p1 dominates p2 (p1 ` p2 for kCQ, 1 ≤ i ≤ k selected for computing the final query
short), if and only if p1 is better than or equal to p2 for ALL result.
TO and PO dimensions, and p1 is strictly better than p2 in at Ag and Akg An adjacency matrix represents all of the relations
between vertices in a DAG. The non-diagonal entry aij
least one dimension. of Ag is 1 if there is an edge from vertex vi to vertex
Definition 2. (A user preference profile g ): A user preference set vj (i 6= j ); otherwise aij is 0. Akg is a transitive closure
with a path length r = k. When r = 1, Ag is identical
for a PO dimension can be represented by a directed acyclic to A1g .
graph (DAG) g = (V, E), comprising a set V of the attributes r The length of a path from vi to vj is the number of a
with a set E of edges. sequence of directed edges (e1 , e3 , . . . , er ) on the path.
DF Sstart and DF Sstart is the e-DFS traversal order for a node when
The topological structure of a DAG represents the preferences DF Send visiting the node on a forward path and DF Send is
for each option in V . In each DAG, there exist three possible the traversal order when visiting the same node on a
mapping relations: a primitive relation, a transitive relation, and an backward path.
equivalent relation. We adopt an adjacency matrix Ag to represent
a DAG g and then compute its transitive closure g + = (V ,E + ),
which is composed of all of the primitive and transitive relations in node vw ∈ V between vi and vj ; The edge < vi , vj >
g . Let gji represent a preference profile of a user i in PO domain must exist in E + .
j . Note that superscript and/or subscript of g may be omitted 3) An equivalent relation (denoted by vi ←→ vj ) between
for simplicity. In that case, g is represented as a user preference node vi to node vj holds, if vi and vj are sibling nodes.
profile for any PO domain of one unspecified user; g i is for one
As shown by the examples in Fig. 3(a), the relation of A −→
unspecified PO domain of user i; gj is for PO domain j of one
B or A −→ C is a primitive relation, because there is an edge
unspecified user. The definition of the three mapping relations are
in E of g connecting A and B or A and C . However, a transitive
given in Definition 3:
relation of A 99K E has one transitive edge (dotted line) in E +
Definition 3. (Mapping relations): There are three types of of g + between A and E . For an equivalent relation, the relation
mapping relations: a primitive relation, a transitive relation, of B ←→ C indicates that the user does not prefer one over the
and an equivalent relation. other, where B and C are sibling nodes.
1) A primitive relation (denoted by vi −→ vj ) has one edge Definition 4. (Similarity Sg (g 0 )): The level of similarity between
< vi , vj > in E , connecting from node vi to node vj and g of a new query and g 0 of a cached query is measured
vi and vj are adjacent; by computing the scores of mappings in relations, where a
2) A transitive relation (denoted by vi 99K vj ) from node primitive relation contributes a higher score than a transitive
vi to node vj holds, if there exists at least one additional relation and an equivalent relation contributes the lowest score.

1041-4347 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2656906, IEEE
Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 5

Lemma 1. Given g of a new query q and g 0 of a cached query q 0 ,


when all relations of g can be found in g 0 and there exist only A A B
mapping relations between g and g 0 , the skyline results of q
and q 0 must completely overlap. B C C D
Proof: (By definition) For simplicity, the size of the PO domain
of the data set is one, and g and g 0 are the user preference profiles D EE E
of q and q 0 , respectively. Let T = q.R, where q.R is the skyline
result of q ; T 0 = q 0 .R, where q 0 .R is the skyline result of q 0 .
Because all the relations of g can be found in g 0 and there are only
mapping relations between g and g 0 , T must be a subset of T 0 .
(a) g (b) g 1 (5 mappings,
As the example is shown in Fig. 1, we can see that since g 2 1 violation)
contains all of the mapping relations in g 1 , and there exist only
mapping relations, the skyline results of g 1 are a subset of the B A
results of g 2 . C
The existing graph indexing approaches [30] measure the level
of similarity between two graphs in terms of their morphisms. D B EE B D
However, the concept of similarity cannot be applied to our work.
We consider the mappings, which consist of the three above-
mentioned relations, for similarity measures to facilitate the cache- C D E
based skyline query search over partially ordered domains. The
similarity cannot be measured solely based on the shapes of the A
DAGs. For example, in Figs. 3(a)−(d), the similarity between
g and g 1 is higher than that between g 2 and g 3 , despite the (c) g 2 (5 mappings, 4 (d) g 3 (2 mappings,
fact that the shape of g 1 is not similar to g . In addition, we violations) 5 violations)
also define violations, when the relation does not belong to any
cases in the mapping relations. A violation causes no common
Fig. 3: An example of the DAGs of cached queries with the
results between two queries. Therefore, the similarity scores are
number of mappings and violations with respect to g of the new
deducted to penalize the contribution of similarity. For example, in
query.
Fig. 3(a), g.A −→ g.C violates g 3 .C 99K g 3 .A. In the following
definitions, we define mappings and violations for a new query
with respect to a cached query. Note that the definitions cannot be X
kCQ = argmax SGq (Gi )
applied to the other direction. Gi ∈G
∀Gi ∈kCQ
Definition 5. (Mapping): Given g of a new query and g 0 of a
n
cached query, if g.vi −→ g.vj or g.vi 99K g.vj holds, a X
i
SGq (Gi ) = Sgwq (gw )
mapping occurs with either g 0 .vi −→ g 0 .vj or g 0 .vi 99K g 0 .vj .
w=1
Additionally, if g.vi −→ g.vj , g.vi 99K g.vj , or g.vi ←→
g.vj holds, a mapping occurs with g 0 .vi ←→ g 0 .vj where kCQ consists of the top-k cached queries selected from G,
and the sum of the similarity scores for all queries in kCQ with re-
Definition 6. (Violation): Given g of a new query and g 0 of a spect to Gq achieves the highest scores. After we obtain the kCQ
cached query, a pair of preference relations are in violation set (a filtering step), the system is able to compute the query results
of each other, if either g.vi −→ g.vj or g.vi 99K g.vj holds (a refinement step) only from R = {q1 .R ∪ q2 .R · · · ∪ qk .R},
and at the same time either g 0 .vj −→ g 0 .vi or g 0 .vj 99K g 0 .vi which is a union set of the result of each cached query qi in
exists. Additionally, if g.vi ←→ g.vj holds, a violation occurs kCQ, 1 ≤ i ≤ k , without going through the entire data set P ,
with g 0 .vi = g 0 .vj . if at least one qi .R ∈ R has no violation relations. Otherwise,
Based on the definitions of a mapping and a violation, the additional steps are adopted to obtain the missing data tuples
number of mappings and violations are computed with respect due to the violation relations. We adopt the CSS [1], [34] system
to g as shown in Figs. 3(b)−(d). Subsequently, we compute the framework illustrated in Fig. 2 and its similarity function. In this
similarity score given by the following equation for n partially paper, we mainly focus on the improvement of the query access
ordered domains. The similarity score can possibly become a methods (indexing methods) that are described in Section 4.
negative value due to violations.
4 Q UERY I NDEXING M ETHODS
n
X
Sg (g 0 ) = [mappinggi (gi0 ) − violationgi (gi0 )] (1) In this section, we discuss the design of the access method for
i=1 DAGs, which is one of the most challenging tasks for a cache-
based system to perform skyline query computations. When a new
Problem Statement query is requested, the system searches for a set of queries with
The problem statement for our work is described as follows. Given similar user preference profiles from the cache. An effective index
a new query Gq and a cached query set G, a top-k cached query set structure for indexing the user preference profiles for a partially
kCQ sorted in descending order based on the similarity function ordered domain is needed because it has a great impact on the
defined in Equation 1 is returned. performance as the number of cached queries increases over time.

1041-4347 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2656906, IEEE
Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 6

However, the existing graph indexing methods only focus on how similar preference profile for a query, which is then converted to
to store the graph data for similarity searches, regardless of the a corresponding feature point by the same principles, a nearest
characteristics of the dominance relations for skyline queries with feature point in the R-tree is considered as the most similar DAG
partially ordered domains. The similarity search for graphs is to to the query.
measure the similarity between two graphs, and finds the analo- For example, the user preference profile g and its correspond-
gous graphs. A well-known technique to measure two graphs is ing ARG are shown in Figs. 4(a) and (b). In Fig. 4(b), for example,
GED (graph edit distance), which measures the minimum number the relation of “B , A, -1” indicates that B is dominated by A; the
of graph edit operations to transform one graph into another. relation of “A, C , 1” indicates that A dominates C ; the relation
However, we use a DAG to represent a user preference profile, of “C , E , 0” indicates that C and E have an equivalent relation.
and the previous graph indexing methods cannot be applied to our Therefore, the corresponding adjacency matrix Ag is shown in
work, because the existing methods measure the level of similarity Fig. 4(c), which shows all the relations between two vertices.
in terms of morphisms. Therefore, we have designed an efficient Because the adjacency matrix Ag is a symmetric matrix, a set
indexing method, an extended depth-first search indexing algo- of values in row-by-row order is obtained only from the top-right
rithm (e-DFS for short) to search for similar DAGs for partially area (shaded area) of the adjacency matrix Ag as a feature point so
ordered domains to facilitate the skyline queries in this section. as to reduce the dimensionality of the feature point. As a result, the
We describe two baseline algorithms [34], the source-clustered feature point of g is converted to (1, 1, 1, 1, 0, 1, 1, 0, 0, 0), which
and the attribute relational graph (ARG) indexing methods, for is a 10-dimensional point for one PO domain (n = 1). When the
comparison, and subsequently we detail the e-DFS algorithm. number of PO domains n > 1, the feature point is obtained by
The source-clustered indexing method uses vertices in the first merging all the feature points of each PO domain into a single
level of DAGs as the keys to search for similar DAGs, while the one.
ARG indexing method identifies the relations between vertices as
features. Our e-DFS utilizes the depth-first search algorithm to
A A
preserve the characteristics of the dominance relations in DAGs,
B, A, -1 A, C, 1
on which the similarity search is based. Note that a DAG and a
user preference profile are used interchangeably, since they are A
B C C B A, D, 1
C B
equivalent. A, E, 1
C, E, 0 C
B, D, 1
D E E D
4.1 Source-clustered Indexing E
D D, E, 0 E
We introduce the first indexing method named source-clustered
indexing. This method uses the node(s) in the first level of a
(a) g (b) The ARG of g
DAG as the key(s) to search for similar user preference profiles.
We use these nodes as the keys because the nodes in the first
level dominate those in the second level, those in the second level
dominate those in the third level, and so on. Therefore, the nodes ABCDE
in the first level are more important than those in the other levels. A 0 1 1 1 1
For example in Fig. 3(a), the key of g and g 2 is A, the keys of C B -1 0 0 1 1
g 1 are A and B , and the key of g 3 is C . When the new query C -1 0 0 0 0
C, E, 0
g is requested, the system retrieves the cached user preference D -1 -1 0 0 0
profiles g 1 and g 2 for g , because these two profiles have the same
E -1 -1 0 0 0
source key (i.e., A). As a result, g 3 is ignored because its key E
does not include A. The indexing structure is easy to implement;
however, it is efficient only when the data cardinality is low and (c) The adjacency matrix of
the number of attributes in a DAG is small, because the indexing the ARG
method incurs a significant load imbalance. For example, in Fig. 3,
if A is a popular option, the number of cached queries under key
Fig. 4: An example of DAG transformation into an ARG and its
A is large, incurring a sequential search while the data access is
corresponding adjacency matrix.
performed.

4.2 Attribute Relational Graph Indexing 4.3 Extended Depth-First Search Indexing
The source-clustered indexing method cannot efficiently handle The ARG indexing method is designed to represent user prefer-
the cached queries with complex user preference profiles, because ence profiles more effectively than the source-clustered indexing
the indexing structure simply uses the source nodes as the keys and method, particularly when the number of vertices in a DAG is
the rest of the relations are not considered. On the other hand, the large. However, the encoding method of the ARG indexing may
attribute relational graph (ARG) indexing structure maps a DAG return high-dimensional feature points eventually indexed by an R-
to a corresponding ARG. The ARG indexing method uses the tree that suffers the curse of dimensionality. Therefore, the ARG
relations between vertices in a DAG as features, and converts a indexing method incurs high computational costs when searching
DAG to a multi-dimensional feature point. Subsequently, an R- for similar user preference profiles for new queries. In this section,
tree is built to store the converted feature points. As a result, we introduce a method, the extended depth-first search indexing
an R-tree groups DAGs with similar user preference relations algorithm (e-DFS for short), which utilizes a modified depth-first
into the same minimum bounding rectangles. When searching a search algorithm to preserve the characteristics of the dominance

1041-4347 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2656906, IEEE
Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 7

relations in a DAG, while reducing the number of dimensions of described as follows.


the converted feature points. Similar to the ARG indexing method,
the e-DFS indexing method adopts R-trees to store these feature Property 1. (Redundant edge detection): The redundant edges of
points. g can be detected by checking each value of aij in A1g + Akg ,
In this paper, we mainly focus on the design of the indexing where k = 2 to h, and h is the total number of nodes in g . If
algorithm that utilizes the e-DFS encoding method for DAGs. aij > 1, there is at least one redundant edge in g .
The details of the e-DFS method which involves consecutive
procedures are described in the following sections. Section 4.3.1 Figs. 6(a)−(c) shows the transitive closures of g in Fig. 5 with
describes the redundant DAG edge removal that simplifies a a length of r = 1 and 2. The a14 of transitive closure A1g is 1 (i.e.,
DAG before performing the e-DFS encoding method for indexing one direct edge from A to D), because there is a path with r = 1,
user preference profiles of the cached queries. The procedure of and the a14 of transitive closure A2g is 1 (see Fig. 6(b)), because
the redundant DAG edge removal is mainly to reduce the time there is a path from A to D through exactly two edges. We can
complexity of e-DFS by simplifying a DAG. Subsequently, in detect a redundant edge by computing the sum of A1g and A2g as
Section 4.3.2, we propose an e-DFS encoding method which traces shown in Fig. 6(c). As a result, because the a14 is 2, there is a path
the vertices in a DAG and obtains a pair of depth-first search with r = 1 and a path with r = 2, respectively. Therefore, we can
traversal orders for each node. Section 4.3.3 merges the traversal find a redundant edge from A to D and it should be removed from
orders by converting the intermediary results obtained from the the DAG before the e-DFS encoding method is performed so as to
e-DFS traversal orders into the final encodings. reduce the computational cost.

AABABCBCD EEE
CDD A
AAB
BBC DDE
CCD EE AA
ABB CDD
BCC DE EE
4.3.1 Redundant DAG Edge Removal 000000000111111
A AA 0 01010101010100 AAA 01021
AAA 0 01 10 02 21 1
The procedure of redundant DAG edge removal is mainly used 1 B BB 0 00000001010100 BBB BBB 00 000000000000111
00 0
00 001 111 11
to simplify a DAG before performing the encoding algorithm for AgAAg gCCC 0 00000000000000 AAgAg  AAA1gg1gAAA
222  C
22 2 1
g 
11 CCC gg C C 0 00 000000000000000
00 000 000 00
g
DDD 0 00000000001011 DDD DD 0 000000000
D 000 00 00 01 11
feature point conversion. We delete the redundant edges of a DAG E EE 0 00000000000000 EEE EEE 000 00 00 00 00
0 000000000
by scanning its transitive closures because the e-DFS encoding
method is likely to trace a node more than once. Subsequently, (a) A1g (b) A2g (c) A1g + A2g
a simplified DAG results in less time complexity when perform-
ing the e-DFS encoding method, yet still preserves the original
dominance relations in the user preference profile. For example, Fig. 6: The transitive closures of g in Fig. 5. A1g + A2g is computed
in Figs. 5(a) and (b), we can delete the edge from A to D in g , for redundant edge detection.
because there is one edge from A to B and another from B to
D. The original relations of the user preference profile g do not 4.3.2 Extended Depth-First Search Encoding
change after deleting the edge from A to D. However, there are no The objective of the encoding algorithm is to convert a DAG into a
redundant edges in g 1 . As a result, none of the edges are removed low-dimensional feature point which is then indexed by an R-tree.
from g 1 . For this purpose, we obtain an interval label for each node in the
DAG using the e-DFS encoding method to capture the DAG struc-
ture (i.e., the dominance relations among vertices) by performing a
A C depth-first search traversal. Unlike the traditional depth-first search
remove
algorithm (DFS), a modified version of the DFS algorithm used by
D B B B D the indexing algorithm may traverse a node more than once. For
each node, the algorithm computes an interval, which is a pair of
DF Sstart and DF Send as the traversal orders, where DF Sstart
E A represents an order when visiting the node on a forward path and
DF Send represents an order when the traversal moves backward
(a) g (b) g 1 to visit the same node on a backward path. Figs. 7(a)−(c) show
the intervals for g , g 1 , and g 2 , where the dotted arch is a forward
path and the dashed arch is a backward path.
Fig. 5: An example of a redundant DAG edge removal and a non-
To facilitate the e-DFS traversal, we maintain the number of
redundant DAG.
incoming edges for each node as an indicator, which is denoted
First, we utilize the Warshall’s algorithm to obtain the path by edgeN um[vi ] for a node vi in g . The number of incoming
between two vertices so that redundant edges in the graph can be edges is one of the characteristics that helps to preserve the
inferred. The transitive closures are established to represent the dominance relations. As a result, we propagate edgeN um[vi ]
reachability of the graph. Given a user preference profile DAG g of node vi to all its offspring nodes. If edgeN um[vi ] = x,
= (V , E ) and a length r = 1, the transitive closure of g is A1g the node vi must be traversed exactly x times during the e-
(i.e., the adjacency matrix Ag ). The non-diagonal entry aij of DFS traversal. For example, Table 3 (first line marked in red)
A1g is 1 if there is one edge from node vi to node vj (i 6= j ); shows the initial edgeN um[A · · · F ] for each node in g 1 in
otherwise aij is 0. Likewise, when the length r = k , aij in Akg Fig. 7(b). In this example, F is chosen as an entry node of
is 1, if there is one path from node vi to node vj exact k edges; the DAG, so the incoming edges edgeN um[F ] are initialized
otherwise aij is 0. Therefore, the number of paths from node vi to 1 to enable the traversal. Since node B has two incoming
to node vj in g with length r can be inferred by checking aij edges, edgeN um[B] = 2 is propagated to all its offspring nodes.
in Arg . The property used for redundant DAG edge removal is Therefore, edgeN um[E] is also set to 2.

1041-4347 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2656906, IEEE
Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 8
1
TABLE 3: Traces of the e-DFS traversal for g in Fig. 7(b).
Node Node Node Node Node Node
Call of DFS Order A B C D E F
Initial 0 1 – 2 – 1 – 1 – 2 – 1 –
DF S(g 1 , F ) 1 1 – 2 – 1 – 1 – 2 – 0 (1,null)

DF S(g 1 , C) 2 1 – 2 – 0 (2,null) 1 – 2 – 0 (1,null)

DF S(g 1 , A) 3 0 (3,null) 2 – 0 (2,null) 1 – 2 – 0 (1,null)

DF S(g 1 , B) 4 0 (3,null) 1 (4,null) 0 (2,null) 1 – 2 – 0 (1,null)

DF S(g 1 , E) 5 0 (3,null) 1 (4,null) 0 (2,null) 1 – 1 (5,6) 0 (1,null)

DF S(g 1 , B) 7 0 (3,null) 1 (4,7) 0 (2,null) 1 – 1 (5,6) 0 (1,null)

DF S(g 1 , A) 8 0 (3,8) 1 (4,7) 0 (2,null) 1 – 1 (5,6) 0 (1,null)

DF S(g 1 , D) 9 0 (3,8) 1 (4,7) 0 (2,null) 0 (9,null) 1 (5,6) 0 (1,null)


(4,7)
DF S(g 1 , B) 10 0 (3,8) 0 0 (2,null) 0 (9,null) 1 (5,6) 0 (1,null)
(10,null)
(4,7) (5,6)
DF S(g 1 , E) 11 0 (3,8) 0 0 (2,null) 0 (9,null) 0 0 (1,null)
(10,null) (11,12)
(4,7) (5,6)
DF S(g 1 , B) 13 0 (3,8) 0 0 (2,null) 0 (9,null) 0 0 (1,null)
(10,13) (11,12)
(4,7) (5,6)
DF S(g 1 , D) 14 0 (3,8) 0 0 (2,null) 0 (9,14) 0 0 (1,null)
(10,13) (11,12)
(4,7) (5,6)
DF S(g 1 , C) 15 0 (3,8) 0 0 (2,15) 0 (9,14) 0 0 (1,null)
(10,13) (11,12)
(4,7) (5,6)
DF S(g 1 , F ) 16 0 (3,8) 0 0 (2,15) 0 (9,14) 0 0 (1,16)
(10,13) (11,12)

path. Lines 3−6 recursively perform the main depth-first search


A A(1, 12)
(1, 12) F F(1, 16)
F F(1, 14)
(1, 16)
(1, 14)
procedure, only when edgeN um[vi ] > 0. For each time vi is
(DFS(DFS
start, DFS end) end)
start, DFS visited, edgeN um[vi ] is reduced by one (Line 5). Lines 7−8
B B(2, 11)
(2, 11) (2, 15)
(2, 15)C C (2, 11)
update(2, the
11)C
DFCSend
E ofE(12, 13) 13) visited node vi on the backward
the(12,
current
path. As shown in Fig. 7(a), the DF Sstart of B is 2 and the
DF Send is 11, when the traversal procedure moves backward
(3, 8)(3, 8)C C D D(9, 10) (9, 10) (3, 8)(3, 8)A A D D(9, 14) (3, 6)
(9, 14) (3, 6)B B D D(7, 10) (7, 10)
to B . In Fig. 7(b), B has two intervals, because the algorithm
traverses it twice.
(4, 5)(4, 5)E E F F(6, 7)(6, 7) B B (4, 7)
(4, 7) A A(4, 5)
(4, 5)
(10, (10,
13) 13) Algorithm 1:(8,e-9)(8, 9)S(g, vi , edgeN um)
DF
E E (5, 6)
(5, 6) Data: g is a DAG that represents a user preference profile,
(11, (11,
12) 12) and vi is an input node in g . edgeN um is a
(a) g (b) g 1 reference to an array edgeN um[v1 · · · vh ], where
each element edgeN um[vi ], vi ∈ g stores the
F (1, 16) F (1, 14) propagated number of incoming edges to vi . A
global variable order is initialized to 0 and
(2, 15) C (2, 11) C E (12, 13) incremented by one each time.
1 order ← order + 1
2 vi .DF Sstart ← order
(3, 8) A D (9, 14) (3, 6) B D (7, 10)
3 for (all adjacent edges from vi to vw ) do
4 if (edgeN um[vi ] > 0) then
B (4, 7) A (4, 5) 5 edgeN um[vi ] ← edgeN um[vi ] − 1
(10, 13) (8, 9) 6 recursively call e-DF S(g, w, edgeN um)
(5, 6) (c) g 2
E 7 order ← order + 1
(11, 12)
8 vi .DF Send ← order
Fig. 7: Examples of (DF Sstart , DF Send ) pairs.

Table 3 demonstrates the traces of the e-DFS traversal for g 1


The e-DFS traversal method is outlined in Algorithm 1, which in Fig. 7(b) for each recursion call, for which we present the
takes three parameters: a DAG g , an input node vi in g , and traversal order, edgeN um value and the interval of each node.
the edgeN um set, which has been initialized based on the At Order = 1, F is firstly visited, the edgeN um[F ] decreases
propagated number of incoming edges in g . Lines 1−2 update by one and the interval is set to (1, null). At Order = 5, E is
the DF Sstart of the current visited node vi on the forward reached, and since there exists a backward path at the node, the

1041-4347 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2656906, IEEE
Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 9
1 2
TABLE 4: Examples of DFS intervals and DF Sdif f for g , g , and g in Figs. 7(a)−(c).
A B C D E F DF Sdif f
g (1, 12) (2, 11) (3, 8) (9, 10) (4, 5) (6,7) (11, 9, 5, 1, 1,1)
(4, 7) (5, 6)
g1 (3, 8) (2, 15) (9, 14) (1,16) (5, 3, 13, 5, 1, 15)
(10, 13) (11, 12)
(4, 5)
g2 (3, 6) (2, 11) (7, 10) (12, 13) (1, 14) (1, 3, 9, 3, 1, 13)
(8, 9)

TABLE 5: Number of visited descendant nodes for g 1 in Fig. 7(b).


Node Initial Descendant Intervals Final Counts
A (3,8) (4,7) (5,6) (10,13) (11,12) (4,7) (5,6) 2
B (4,7) (10,13) (5,6) (11,12) (5,6) (11,12) 1
(3,8) (4,7) (5,6) (9,14) (3,8) (4,7) (5,6) (9,14)
C (2,15) 6
(10,13) (11,12) (10,1) (11,12)
D (9,14) (4,7) (5,6) (10,13) (11,12) (4,7) (5,6) (10,13) (11,12) 4
E (5,6) (11,12) 0
(2,15) (3,8) (4,7) (5,6) (2,15) (3,8) (4,7) (5,6)
F (1,16) 7
(9,14) (10,13) (11,12) (9,14) (10,13) (11,12)

interval is set to (5, 6). Node that Order is updated to 6 in this vi ∈ g value may not faithfully preserve the dominance relations
step. Subsequently, B is visited at Order = 7, so the interval when the e-DFS encoding algorithm traces the descendant nodes
becomes (4, 7). At Order = 8, A is visited due to the backward of vi more than once. As shown in Figs. 7(a) and (c), the interval
path; however, the recursion ends because edgeN um[A]=0. As of both B in g and C in g 2 is (2, 11). However, the number of
a result, the next recursion call DF S(g 1 , D) that reaches D the dominated vertices of B in g (i.e., {C , D, E , F }) is larger
is performed. The procedure proceeds until each value in the than that of C in g 2 ({A, B , D}). From the observation, we can
edgeN um set becomes 0. Table 4 summarizes the intervals in the see that in g 2 , since node A has been visited twice, the traversal
form of (DF Sstart , DF Send ) for g , g 1 , and g 2 in Figs. 7(a)−(c). order on the backward path (i.e., DF Send ) of node C turns out
Subsequently, the algorithm obtains the dominance relations to be identical to that of B in g . Therefore, we need to generalize
denoted by DF Sdif f for each node in a DAG by subtracting its DF Sdif f for node C in g 2 to distinguish itself from B in g . This
corresponding DF Sstart from DF Send . From our observation, process is termed encoding generalization to properly measure
for each node vi , we can obtain the number of visited descendant the similarity between two DAGs, because the complexity (e.g.,
vertices before its interval is completed, which holds when the density or shape) for different DAGs varies. Based on Property 1,
following condition is satisfied: vi .DF Sstart < vj .DF Sstart we provide Property 2 as follows:
and vi .DF Send > vj .DF Send , where vj is one descendant node Property 2. (Encoding generalization): The value of DF Sdif f
of vi . Therefore, the values of DF Sdif f can also be computed by of node vi in g needs to be generalized, if there exists aij > 1
Equation 2 as follows. in Arg , where 2 ≤ r ≤ h.
M ultiDescendant(vi ) represents the total number of its
vi .DF Sdif f = V isitedDescendant(vi ) × 2 + 1 (2) descendant nodes that are visited more than once based on
V isitedDescendant(vi ) can be obtained by propagating the Property 2. The final encoding for vi is given by Equation 3.
intervals of all its descendant nodes and counting only the number
of completed intervals. Table 5 illustrates the final counts for vi .DF Sencoding = DF Sdif f − M ultiDescendant(vi ) × 2
V isitedDescendant(vi ) of g 1 in Fig. 7(b). Notably, node B has (3)
been visited twice in total. However, we only count the completed For example, Figs. 8(a)−(c) show the transitive closures of g 1
intervals for one visit to B . The corresponding completed descen- in Fig. 7(b) to the power of 2, 3, and 4, respectively. a32 in A2g1 is
dant interval for (4,7) or (10,13) is (5,6) or (11,12), respectively. 2, because there are two paths with a length r = 2 from C to B .
Hence, the V isitedDescendant(B) is one. Let us continue with Furthermore, a35 in A3g1 is 2, because there are two paths with a
the example in Table 4. The DF Sdif f of g 1 is (5, 3, 13, 5, 1, length r = 3 from C to E . Both B and E are descendant nodes of
15) which represents (A, B , C , D, E , F ) attributes. The visited C , so DF Sencoding of node C in g 1 is set to 9 (i.e., 13 − 2 × 2).
descendant nodes of E are an empty set (a terminal node) as On the other hand, there are two paths with a length r = 3 from
shown in Table 5, so the DF Sdif f value of E is 0 × 2 + 1 = 1; F to B and two paths with a length r = 4 from F to E . Both B
the visited descendant node of B is {E}, so the DF Sdif f value and E are descendant nodes of F , so DF Sencoding of node F in
of B is 1 × 2 + 1 = 3, and the visited descendant nodes of A are g 1 is set to 11 (i.e., 15 − 2 × 2). After the generalization process,
{B , E}, so the DF Sdif f value of A is 2 × 2 + 1 = 5. Note that the final e-DFS encodings are shown in Table 6.
the number of visited descendant nodes includes those vertices Finally, a full example is given to describe our extended depth-
which are traced more than once. first search indexing algorithm. Figs. 9(a)−(d) show the DAGs and
their corresponding intervals of DF Sstart and DF Send pairs.
4.3.3 Encoding generalization The DF Sencoding values of these DAGs are shown in Table 7,
Finally, in this section, we describe the final procedure to convert where each set of the encoding values is considered as a six-
a DF Sdif f set into the final encoding for a DAG through a gener- dimensional data point, which is then indexed by an R-tree. The
ation process, which is conducted because the DF Sdif f of node distance is used to measure the closeness to the DAG of the new

1041-4347 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2656906, IEEE
Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 10

AA BB C
CDD EE FF EF
ABCDE F ABCDEF
A 00 00 00 00 11 00 A 000000 A 000000 F (1, 12) F (1, 14)
B 00 00 00 00 00 00 B 000000 B 000000
C 00 22 00 00 00 00 C 000020 C 000000
Ag21 D
AAg31 D
Ag41
A D A (2, 11) (2, 9) A B (10, 13)
00 00 00 00 11 00 000000 000000
E 00 00 00 00 00 00 E 000000 E 000000
F F F 000020
11 00 00 11 00 00 020000 (3, 4) B C (5, 10) (3, 6) C D (7, 8)
(a) (b) (c) (11, 12)

(6, 7) D E (8,(8,
9) 9) (4, 5) E
Fig. 8: An example of encoding normalization for g 1 in Fig. 7(b).
(a) g (a new query) (b) g 1 (9 mappings, 1 violation)
TABLE 6: The final DFS encodings of g , g 1 , and g 2 in
Figs. 7(a)−(c).
F (1,
(1, 12)
12) F (1, 12)
DF Sdif f DF Sencoding
(A,B,C,D,E,F) (A,B,C,D,E,F)
g (11,9,5,1,1,1) (11,9,5,1,1,1) (2, 9) A D (10,
(10, 11)
11) C (2, 11)
g1 (5,3,13, 5, 1, 15) (5,3,9, 5, 1, 11)
g2 (1, 3, 9, 3, 1, 13) (1, 3, 7, 3, 1, 11)
(4, 7) B E (3,
(3, 8)
8) (3, 4) B D (5, 10)

query. Hence, a pair of DAGs is similar if the distance is shorter.


For example, the distance between g and g 1 is (9 − 7)2 + (1 − (5, 6) C E (6, 9)
3)2 +(5−3)2 +(1−1)2 +(1−1)2 +(12−11)2 = 13; the distance
between g and g 2 is (9 − 7)2 + (1 − 3)2 + (5 − 1)2 + (1 − 1)2 + A (7, 8)
(1 − 5)2 + (11 − 11)2 = 40, and the distance between g and g 3 is
(9−1)2 +(1−1)2 +(5−9)2 +(1−5)2 +(1−3)2 +(11−11)2 = (c) g 2 (8 mappings, 3 violations) (d) g 3 (7 mappings, 5 violations)
88. Since the closeness (i.e., similarity) can be measured by the
Euclidean distance, we apply a top-k nearest neighbor query (a Fig. 9: An example of the e-DFS indexing algorithm.
k -NN query) to access the top-k most similar DAGs to be the
candidates, which are subsequently used by the similarity measure
to compute the similarity scores. The result of the k -NN query of
g (a 3-NN query in this example) is {g 1 , g 2 , g 3 } as shown in g 3(1, 1, 9, 5, 3, 11)
Fig. 10. The order of the top k nearest neighbors is identical to the
similarity order (g 1  g 2  g 3 ) with respect to g based on the 100
similarity measure (see the number of mappings and violations in
g 1 (7, 3, 3, 1, 1, 12)
Figs. 9(b)−(d)). Therefore, the e-DFS indexing method is able to
access similar DAGs to the new query efficiently. 12 g (9, 1, 5, 1, 1, 11)

TABLE 7: A full example of e-DFS encodings of g , g 1 , g 2 , and 40


g 3 in Figs. 9(a)−(d).
g 2(7, 3, 1, 1, 5, 11)
DF Sencoding
{A, B,C,D,E,F}
g {9,1,5,1,1,11}
g1 {7,3,3,1,1,12} Fig. 10: An illustration of the distance as similarity measure
g2 {7,3,1,1,5,11} between DAGs in an R-tree.
g3 {1,1,9,5,3,11}

4.4 Space and Time Complexity Analysis index for data access is O(h2 × n × log |P |), because the
In this section, we compare the time and space complexities of the feature points for the PO domains are indexed by an R-tree. The
three indexing methods. Firstly, the time complexity of the source- space complexity of the ARG indexing is O(h2 × n × |P |),
clustered index for data access is O(h2 × n × |P |), where h is the because a feature point has (h2 − h)/2 dimensions, and there
total number of attributes, h2 is the total number of dimensions to are n × (h2 − h)/2 for all dimensions of the PO domains.
represent a DAG, n is the number of PO domains, and |P | is the Lastly, the time complexity of the e-DFS index for data access
total number of user preference profiles. The worst-case running is O(h × n × log |P |), because each feature point which has only
time is incurred when considering a set of highly skewed user h dimensions for the PO domains is indexed by one R-tree, and
preference profiles (e.g., most of the DAGs use the same node as the space complexity of the e-DFS indexing is O(h × n × |P |),
the indexing key), because a sequential scan has to be preformed because a feature point only has h dimensions, and there are n×h
on the entire data set with size of |P |. However, the best-case for all dimensions of the PO domains. Therefore, e-DFS achieves
running time is Ω(h2 × n × |P |/h) for a uniformly distributed set better performance than the source-clustering and ARG indexing
of user preference profiles, where |P |/h is the average number of methods in terms of the time and space complexity. Additionally,
user preference profiles per indexing key. The space complexity is Table 8 shows the summary of time complexity for data access,
O(h2 × n × |P |). Secondly, the time complexity of the ARG insertion and deletion operations.

1041-4347 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2656906, IEEE
Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 11

TABLE 8: Summary of time complexity. 5.1 Effect of Cache Threshold and Cache Size
Operation \ We first evaluate the average query processing time by varying
Data Access Insertion/Deletion
Indexing Methods the cache threshold size for the cache-based query processor in
Source-clustered O(h2 × n × |P |) O(1) our system. The cache threshold size is used to avoid seeking
ARG O(h2 × n × log |P |) O(h2 × n × log |P |)
e-DFS O(h × n × log |P |) O(h × n × log |P |)
a large result set of a cached query to retrieve the result for a
new query. Therefore, an answered query is cached only if the
size of its result set is less than a specified threshold. The choice
5 E XPERIMENTS of the threshold size is important for the performance of our
system. If the threshold is too small, more queries with small
We use both synthetic and real-world data sets for evaluating the
result sets are cached. The queries with small result sets often
performance of the access methods: the source-clustered indexing
have complex user preference profiles. Therefore, the performance
(SC for short), the ARG indexing (ARG for short), and the
is degraded. If the threshold is too high, the queries with large
extended depth-first search indexing (e-DFS for short) methods
result sets are cached. Therefore, more cache space is occupied,
under the framework using the CSS algorithm [34], which is
and the cache is more likely to be full. Fig. 11(a) shows the
a cache-based approach to the skyline query computations with
average query processing time for threshold sizes ranging from
partially ordered domains. In addition, TSS [24], which is a non-
0.005% to 0.8% of the entire data set given 10k queries. When
cache-based approach, is compared and this approach may access
the threshold size is set to 0.005% or lower, the performance
the entire data set when a new skyline query is performed. We use
of the system using the SC, ARG and e-DFS indexing methods
a spatial index library [35] (Libspatialindex) for the R-tree index
is degraded in terms of the average query processing time. A
and all the methods use a disk-based R-tree to index the data set.
threshold size greater than 0.01% provides stable performance.
Our synthetic data set for a totally ordered domain is in the
The performance of e-DFS is the best among those methods, and
range of [0, 1000), and we generate up to 1,024,000 normally
the performance of ARG is better than that of SC. Next, the size of
distributed data points. For partially ordered domains, we generate
the cache is important in terms of its overall impact on improving
a PO value of 6 to 18 for each PO dimension, where 18 is the
the performance of the cache-based query processor. If the cache
maximum number of attributes (i.e., number of DAG nodes) for
size is too small, the system accesses the disk frequently because
a user preference profile in the system. Our query set contains
fewer useful queries are cached. However, if the cache size is too
100 and 10,000 queries, and 100 queries are used for the TSS
large, the indexing methods have to process a large set of cached
approach, which can only handle a small query set, because the
queries. Specifically, many similarity measures must be executed
performance degrades significantly as the query size increases.
to retrieve the candidate data set. Fig. 11(b) shows the result of
For the real-world data set, we use a data set called Household
different indexing methods by varying the cache size from 2k to
downloaded from http://www.ipums.org in our evaluation. House-
10k. The performance of e-DFS is stable and better than that of
hold has 127,931 tuples, and each tuple contains six household
SC and ARG over all different cache sizes.
costs including gas, electricity, water, heating, insurance, and
property tax as the TO domains. However, we simulate the PO
Ave. Query Processing Time (ms)

Ave. Query Processing Time (ms)


180 250
SC ARG e-DFS SC ARG e-DFS
domains which are not provided in the Household data set by
150
considering the actual criteria collected from a housing website. 200

Furthermore, we follow the definition in Wikipedia’s List of 120


150
house types page [36] to set the number of DAG nodes as four 90

to represent the four main house types: detached single-unit 60


100

housing, semi-detached dwellings, attached single-unit housing, 50


30
and attached multi-unit housing. The second real-world data set
downloaded from https://github.com/sean-chester/SkyBench con- 0
0.005 0.01 0.05 0.1 0.2 0.4 0.8
0
2k 4k 6k 8k 10k

tains 17,264 NBA player records, each of which consists of eight Threshold (%) Cache Size (k)

dimensions (i.e., eight TO domains). Similarly, we generate a PO (a) Threshold (b) Cache size
value of 5 for each of five PO domains.
In summary, we first show the experimental results in Sec- Fig. 11: Effect of cache threshold and cache size on the average
tions 5.1− 5.6 using the synthetic data sets. Section 5.7 presents query processing for the cache-based query processor in our
the performance evaluation using the real-world data set. Table 9 system.
shows a brief summary of the default parameter settings used in
the following simulations. 5.2 Effect of Data Cardinality
From this section, we start to compare our system performance
TABLE 9: Simulation parameters.
using the SC, ARG, and e-DFS indexing methods with the TSS
Parameter Default Range algorithm. Figs. 12(a) and (b) show the average query processing
Data cardinality 128k 64k, 128k, 256k, 512k, time as a function of the number of data points over a small (100
1024k queries) and a large query set (10k queries), respectively. Overall,
Query cardinality 100 and 1k, 5k, 10k, 15k, 20k
10k the query processing time increases when the number of data
Number of TO domains 2 2, 3, 4 points increases. In Fig. 12(a), the average query processing time
Number of PO domains 2 2, 3, 4, 5, 6 of our system using the SC, ARG and e-DFS indexing methods
0.005%, 0.01%, 0.05%, 0.1%, are competitive and reach a significant reduction in terms of the
Cache threshold 0.8%
0.2%, 0.4%, 0.8%
Cache size 4k 2k, 4k, 6k, 8k, 10k average query processing time compared with TSS. Because the
Number of DAG nodes 12 6, 9, 12, 15, 18 TSS approach may access the entire data set when a new query

1041-4347 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2656906, IEEE
Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 12

Ave. Query Processing Time (ms)


is performed, the average query processing time is significantly

Ave. Query Processing Time (ms)


106 180
SC ARG e-DFS
SC ARG e-DFS TSS

higher. However, in Fig. 12(b), the performance of e-DFS is more 10 5 150

efficient than that of SC and ARG in handling a large query


104 120
set. From the results, we can see that the e-DFS is a scalable
3 90
indexing method which achieves 41% and 33% reduction on the 10

query processing time of the SC and ARG methods on average, 10 2 60

respectively. 10
1 30

0 0
10
Ave. Query Processing Time (ms)

Ave. Query Processing Time (ms)


107 300 1k 5k 10k 15k 20k
SC ARG e-DFS TSS SC ARG e-DFS
Query Cardinality (100 queries) Query Cardinality
6
10 250
(a) Small query set (b) Large query set
105
200

104
150
Fig. 13: Effect of query cardinality on the average query process-
103
ing time for SC, ARG, e-DFS, and TSS. When the size of query
100
102 cardinality is small (100 queries), the three indexing methods
101
50
result in similar performance, yet still greatly outperform TSS.
10 0
64k 128k 256k 512k 1024k
0
64k 128k 256k 512k 1024k
When the size of query cardinality is large, the performance gap
Data Cardinality Data Cardinality is significant as the number of queries increases.
(a) Small query set (b) Large query set

Fig. 12: Effect of the size of data cardinality on the average query and outperforms SC and ARG, because the e-DFS only has n × h
processing time for SC, ARG, e-DFS, and TSS. dimensions of all PO domains. In addition, e-DFS preserves most
of the dominance relations for the similarity measure. Therefore,
5.3 Effect of Query Cardinality e-DFS is able to select better candidates for the query processor
so as to reduce the average query processing time.
We report the impact of the query cardinality on the performance
of the system using the three indexing methods. Fig. 13(a) shows

Ave. Query Processing Time (ms)


Ave. Query Processing Time (ms)

106 200
the average query processing time over a small query set (100 SC ARG e-DFS TSS SC ARG e-DFS

queries) and Fig. 13(b) shows the average query processing time 10
5

150
for query cardinality values ranging from 1k to 20k. In Fig. 13(a), 104

SC, ARG and e-DFS outperform TSS dramatically, while the


103 100
performances of SC, ARG and e-DFS are very competitive when
2
10
using a small query set. However, in Fig. 13(b), the performances 50
of the system using the three indexing methods are close to 101

each other when the number of query sets is 1k. As the number 100 0
6 9 12 15 18 6 9 12 15 18
of queries increases, e-DFS achieves significant improvement in Number of DAG Nodes Number of DAG Nodes
the average query processing time compared with SC and ARG, (a) Small query set (b) Large query set
because e-DFS reduces the complexity of each DAG and converts
the DAG to a low-dimensional feature point. SC has the worst per- Fig. 14: Effect of the complexity of user preference profiles. As
formance because its simple design of the indexing structure may the number of DAG nodes increases, the complexity of a DAG is
cause load imbalance when most of the DAG keys are identical. very likely to be high. Consequently, the efficiency and accuracy
Furthermore, we can see that the average query processing time of the closeness measures for accessing the similar user preference
for the large query set is much lower than the performance on the profiles are affected.
small query set for all dimension settings, because more cached
queries can be utilized so as to reduce the query processing time.
5.5 Effect of Dimensionality
5.4 Effect of the Complexity of User Preference Profiles We report the impact of dimensionality on the performance in this
In this experiment, we investigate the effect of the number of section. Figs. 15(a) and (b) show the average query processing
DAG nodes associated with PO domains. In Figs. 14(a) and (b), time by varying the PO and TO dimensions in pairs of (size of
we vary the number of DAG nodes from 6 to 18, because a user PO, size of TO), ranging from 2 to 4 for the TO domains and
preference is likely more complex if the number of DAG nodes 2 to 3 for the PO domains over a small and a large query set,
increases. The SC and ARG indexing methods suffer performance respectively. Fig. 15(a) shows the performance for the small query
degradation as the number of DAG nodes increases. In Fig. 14(a), set, where we can see that TSS has the worst performance, because
the TSS approach has the worst performance among the other TSS may access the entire data set to retrieve the query answer for
three methods. The performances of SC, ARG, and e-DFS are very each query, while our system using the SC, ARG, and e-DFS
competitive when the query set is small. However, in Fig. 14(b), indexing methods achieves better results. Fig. 15(b) shows the
the average query processing time of SC and ARG increases average query processing time of our system using the SC, ARG,
greatly because the SC indexing method fails to filter out irrelevant and e-DFS for all dimensions over a large query set. When the
cached queries when the user preference profiles are complex and dimensionality increases, the performance of all indexing methods
the ARG indexing method has n×(h2 −h)/2 dimensions of all PO is degraded because the R-trees fail to filter out irrelevant cached
domains. On the other hand, e-DFS maintains stable performance queries in the higher dimension.

1041-4347 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2656906, IEEE
Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 13

Ave. Query Processing Time (ms)

Ave. Query Processing Time (ms)


6 10
10 400 100
SC ARG e-DFS TSS SC ARG e-DFS SC ARG e-DFS SC ARG e-DFS

5
350 0
10
90
300 -10

Similarity Score
104
250 -20

Hit Ratio (%)


80

103 200 -30

150 -40 70
102
100 -50
60
101
50 -60

100 0 -70 50
2, 2 3, 2 4, 2 2, 3 3, 3 4, 3 2, 2 3, 2 2, 3 3, 3 4, 2 4, 3 2, 2 2, 3 2, 4 2, 5 2, 6 1k 5k 10k 15k 20k
Dimensionality (|TO|, |PO|) Dimensionality (|TO|,|PO|) Dimensionality (|TO|, |PO|) Query Cardinality

(a) Small query set (b) Large query set (a) Similarity scores (b) Hit ratios

Fig. 15: Effect of TO and PO dimensionality. We vary the Fig. 16: The indexing quality evaluated based on the similarity
dimensionality settings for the TO and PO domains. For the large scores and hit ratios.
query set, the performance degrades significantly as the number
of dimensions increases.
tuples) and NBA players (17,264 tuples). In Fig. 17(a), SC,
5.6 Indexing Degradation ARG and e-DFS have very close performance and all these three
In this section, we analyze the indexing quality of these three algorithms outperform TSS, when the size of the query cardinality
indexing methods. Our system uses a similarity function to score is small (100 queries) for both data sets. However, in Fig. 17(b),
the cached queries depending on their similarity with respect to when the size of query cardinality is 10k , e-DFS still outperforms
a new query. The similarity function returns a negative similarity SC and ARG. The performance improvement is not significant
score if there are violation relations in the user preference profiles because e-DFS is efficient when the number of DAG nodes is
between the new query and the cached query; otherwise, the large. Nevertheless, the experimental results using the real-world
similarity score is positive. A higher score means the level of data sets are consistent with those using the synthetic data sets.
similarity is higher. The cached query with the highest similarity Therefore, we can conclude that e-DFS is efficient in practice.
score is used as the representative in our comparison regarding the
Ave. Query Processing Time (ms)

Ave. Query Processing Time (ms)


106 104
indexing degradation, and we then compare the similarity scores to SC ARG e-DFS TSS SC ARG e-DFS

105
measure the quality of the selected queries accessed by these three 4
10 3

10
indexing methods. The similarity scores of the selected queries
103 102
from the cache using the source-clustered indexing method can be
102
treated as the optimal scores because a significant number of the 101
101
relevant as well as the irrelevant cached queries are selected due to 0 0
10 10
its simple design. The design of the ARG and e-DFS methods may NBA Household
Query Cardinality (100 queries)
NBA Household
Query Cardinality (10k queries)
not preserve all the dominance relations, depending on the number (a) Small query set (b) Large query set
of dimensions to represent a feature point; as a result, the most
relevant cached query might be filtered out. Figs. 16(a) and (b) Fig. 17: Effect of query cardinality using the real-world data set
show the actual similarity scores of the selected queries by varying Household for SC, ARG, e-DFS, and TSS.
the number of PO and TO domains and the hit ratios by varying the
query cardinality, respectively. In Fig. 16(a), the similarity scores 6 C ONCLUSION
of the three methods are degraded, because it is difficult to find
We propose a new indexing method called extended depth-first
similar preference profiles when the dimensionality increases. The
search indexing method (e-DFS) for user preference profiles
similarity scores of the ARG method is closer to the SC method
represented by DAGs to facilitate the access to the cached queries
than our e-DFS method, because e-DFS reduces the number of
for efficient skyline query computation with partially ordered
dimensions of a DAG to represent a low-dimensional feature point
domains. The computation time of processing a new query is
for efficient access. Nevertheless, the difference in the similarity
significantly reduced, because the query results are retrieved from
scores is minor, but e-DFS achieves significant reductions in the
the results of cached queries with compatible user preferences,
query processing time due to its efficient encodings. In Fig. 16(b),
which must be accessed through an efficient access method to
we obtain the hit ratios of the total number of mapped edges to the
select a set of relevant cached queries for query processing. In this
total number of edges of the selected cached queries for the three
work, we emphasize the design of the e-DFS indexing method that
methods, because the mapped edges (or relations) are useful for
effectively encodes a user preference profile to a low-dimensional
the query processor for retrieving the query answers. As expected,
feature point. We first conduct a redundant DAG edge removal
we can see that SC achieves the highest hit ratio. In particular,
to reduce the complexity of the DAG itself, and subsequently,
although the hit ratio of the e-DFS is lower than that of SC, the
the e-DFS encoding is performed to obtain one or more intervals
cost of accessing the candidate queries from the cache is much
for each node in the DAG by traversing it through a modified
lower than that of the SC approach. As a result, the overall query
version of a depth-first search, which is utilized to examine the
processing time is efficient as well.
topology structure and dominance relations. Finally, the encoding
generation step enhances the e-DFS encoding method for the
5.7 Real-world Data Set Evaluation nodes whose intervals do not properly preserve the dominance
In this section, we evaluate the SC, ARG, e-DFS, and TSS algo- relations. The converted, low-dimensional feature points are stored
rithms with two real-world data sets of US households (127,931 and indexed by an R-tree. When a new query is given, the system

1041-4347 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2656906, IEEE
Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 14

performs a top-k NN query to select a set of similar cached queries [21] C.-Y. Chan, P.-K. Eng, and K.-L. Tan, “Stratified computation of skylines
for the new queries during the query processing. Therefore, the with partially-ordered domains,” in Proceedings of the International
Conference on Management of Data. ACM, 2005, pp. 203–214.
closeness or similarity can be measured with the consideration [22] H. Jung, H. Han, H. Y. Yeom, and S. Kang, “A fast and progressive
of the dominance relations in our access method. Our experi- algorithm for skyline queries with totally-and partially-ordered domains,”
mental evaluation demonstrates that the e-DFS indexing method Journal of Systems and Software, vol. 83, no. 3, pp. 429–445, 2010.
outperforms the source-clustered and ARG indexing methods and [23] B. Liu and C.-Y. Chan, “Zinc: Efficient indexing for skyline computa-
tion,” Proceedings of the VLDB Endowment, vol. 4, no. 3, pp. 197–207,
concludes the utility of our novel approach. 2010.
[24] D. Sacharidis, S. Papadopoulos, and D. Papadias, “Topologically sorted
R EFERENCES skylines for partially ordered domains,” in Proceedings of the 25th
International Conference on Data Engineering. IEEE, 2009, pp. 1072–
[1] Y.-L. Hsueh, R. Zimmermann, and W.-S. Ku, “Caching support for 1083.
skyline query processing with partially-ordered domains,” in Proceed- [25] R. C.-W. Wong, A. W.-C. Fu, J. Pei, Y. S. Ho, T. Wong, and Y. Liu,
ings of the 20th International Conference on Advances in Geographic “Efficient skyline querying with variable user preferences on nominal
Information Systems. ACM, 2012, pp. 386–389. attributes,” Proceedings of the VLDB Endowment, vol. 1, no. 1, pp. 1032–
[2] S. Borzsony, D. Kossmann, and K. Stocker, “The skyline operator,” in 1043, 2008.
Proceedings of the 17th International Conference on Data Engineering. [26] S. Zhang, N. Mamoulis, D. W. Cheung, and B. Kao, “Efficient skyline
IEEE, 2001, pp. 421–430. evaluation over partially ordered domains,” Proceedings of the VLDB
[3] K.-L. Tan, P.-K. Eng, B. C. Ooi et al., “Efficient progressive skyline Endowment, vol. 3, no. 1-2, pp. 1255–1266, 2010.
computation,” in Proceedings of the 27th International Conference on [27] S. Zhang, J. Yang, and W. Jin, “Sapper: Subgraph indexing and approx-
Very Large Data Bases, vol. 1. Morgan Kaufmann Publishers Inc., imate matching in large graphs,” Proceedings of the VLDB Endowment,
2001, pp. 301–310. vol. 3, no. 1-2, pp. 1185–1194, 2010.
[4] D. Kossmann, F. Ramsak, and S. Rost, “Shooting stars in the sky: An [28] P. Zhao and J. Han, “On graph query optimization in large networks,”
online algorithm for skyline queries,” in Proceedings of the 28th Inter- Proceedings of the VLDB Endowment, vol. 3, no. 1-2, pp. 340–351, 2010.
national Conference on Very Large Data Bases. VLDB Endowment, [29] E. G. Petrakis, C. Faloutsos, and K.-I. D. Lin, “Imagemap: An image
2002, pp. 275–286. indexing method based on spatial similarity,” IEEE Transactions on
[5] D. Papadias, Y. Tao, G. Fu, and B. Seeger, “An optimal and progressive Knowledge and Data Engineering, vol. 14, no. 5, pp. 979–987, 2002.
algorithm for skyline queries,” in Proceedings of the International [30] X. Wang, X. Ding, A. K. Tung, S. Ying, and H. Jin, “An efficient graph
Conference on Management of Data. ACM, 2003, pp. 467–478. indexing method,” in Proceedings of the 28th International Conference
[6] D. Papadias, Y. Tao, G. Fu, and B. Seeger, “Progressive skyline com- on Data Engineering. IEEE, 2012, pp. 210–221.
putation in database systems,” ACM Transactions on Database Systems, [31] X. Yan, P. S. Yu, and J. Han, “Graph indexing: a frequent structure-
vol. 30, no. 1, pp. 41–82, 2005. based approach,” in Proceedings of the 2004 International Conference
[7] X. Lin, Y. Yuan, W. Wang, and H. Lu, “Stabbing the sky: efficient on Management of Data. ACM, 2004, pp. 335–346.
skyline computation over sliding windows,” in Proceedings of the 21st [32] S. Zhang, M. Hu, and J. Yang, “Treepi: A novel graph indexing method,”
International Conference on Data Engineering. IEEE, 2005, pp. 502– in Proceedings of the IEEE 23rd International Conference on Data
513. Engineering. IEEE, 2007, pp. 966–975.
[8] M. D. Morse, J. M. Patel, and W. I. Grosky, “Efficient Continuous Skyline [33] X. Yan, P. S. Yu, and J. Han, “Graph indexing based on discriminative
Computation,” in Proceedings of the 22nd International Conference on frequent structure analysis,” ACM Transactions on Database Systems,
Data Engineering. IEEE, 2006, p. 108. vol. 30, no. 4, pp. 960–993, 2005.
[9] M. Sharifzadeh and C. Shahabi, “The spatial skyline queries,” in Pro- [34] Y.-L. Hsueh and T. Hascoet, “Caching support for skyline query process-
ceedings of the 32nd international conference on Very large data bases. ing with partially ordered domains,” IEEE Transactions on Knowledge
VLDB Endowment, 2006, pp. 751–762. and Data Engineering, vol. 26, no. 11, pp. 2649–2661, 2014.
[10] Z. Huang, H. Lu, B. C. Ooi, and A. K. H. Tung, “Continuous Skyline [35] “Libspatialindex,” https://libspatialindex.github.io/overview.html.
Queries for Moving Objects,” IEEE Transactions on Knowledge and [36] “List of house types,” https://en.wikipedia.org/wiki/List of house types.
Data Engineering, vol. 18, no. 12, pp. 1645–1658, 2006.
[11] V. Gaede and O. Günther, “Multidimensional access methods,” ACM
Computing Surveys, vol. 30, no. 2, pp. 170–231, 1998. Yu-Ling Hsueh is an Associate Professor with
[12] K. C. Lee, B. Zheng, H. Li, and W.-C. Lee, “Approaching the skyline the Department of Computer Science and In-
in z order,” in Proceedings of the 33rd International Conference on Very formation Engineering at the National Chung
Large Data Bases. VLDB Endowment, 2007, pp. 279–290. Cheng University, Taiwan since 2011. Hsueh re-
[13] Z. Huang, H. Lu, B. C. Ooi, and A. Tung, “Continuous skyline queries ceived her M.S. and Ph.D. degrees in computer
for moving objects,” IEEE Transactions on Knowledge and Data Engi- science from the University of Southern Califor-
neering, vol. 18, no. 12, pp. 1645–1658, 2006. nia in 2003 and 2009, respectively. Her research
[14] M. S. Islam, R. Zhou, and C. Liu, “On answering why-not questions interests are spatio-temporal databases, mobile
in reverse skyline queries,” in Proceedings of the 29th International data management, scalable continuous query
Conference on Data Engineering. IEEE, 2013, pp. 973–984. processing, and spatial data indexing.
[15] X. Lin, Y. Yuan, Q. Zhang, and Y. Zhang, “Selecting stars: The k most Chia-Chun Lin received his BS degree in Elec-
representative skyline operator,” in Proceedings of the 23rd International tronic Engineering from the Chung Yuan Chris-
Conference on Data Engineering. IEEE, 2007, pp. 86–95. tian University in 2013, and is currently working
[16] M. Nagendra and K. S. Candan, “Layered processing of skyline-window- toward the Ph.D degree in Computer Science
join (swj) queries using iteration-fabric,” in Proceedings of the 29th and Information Engineering from the National
International Conference on Data Engineering. IEEE, 2013, pp. 985– Chung Cheng University, Taiwan. His research
996. interests include spatio-temporal databases, and
[17] M. Sharifzadeh and C. Shahabi, “The spatial skyline queries,” in Pro- multimedia data mining and retrieval.
ceedings of the 32nd International Conference on Very Large Data Bases.
VLDB Endowment, 2006, pp. 751–762.
[18] L. Tian, L. Wang, P. Zou, Y. Jia, and A. Li, “Continuous monitoring of Chia-Che Chang received his BS and MS de-
skyline query over highly dynamic moving objects,” in Proceedings of grees in Computer Science and Information En-
the 6th International Workshop on Data Engineering for Wireless and gineering from the Chung Shan Medical Univer-
Mobile Access. ACM, 2007, pp. 59–66. sity in 2011, and from the National Chung Cheng
[19] A. Vlachou, C. Doulkeridis, and N. Polyzotis, “Skyline query processing University in 2015, respectively. His research in-
over joins,” in Proceedings of the International Conference on Manage- terests include mobile data management, spatio-
ment of Data. ACM, 2011, pp. 73–84. temporal databases, and spatial data indexing.
[20] P. Wu, D. Agrawal, O. Egecioglu, and A. El Abbadi, “Deltasky: Optimal
maintenance of skyline deletions without exclusive dominance region
generation,” in Proceedings of the 23rd International Conference on Data
Engineering. IEEE, 2007, pp. 486–495.

1041-4347 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Você também pode gostar