Você está na página 1de 11

Research Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA

Effective and Efficient Reuse of Past Travel Behavior for Route


Recommendation
Lisi Chen1 , Shuo Shang2,∗ , Christian S. Jensen3 , Bin Yao4 , Zhiwei Zhang5 , Ling Shao6
2 UESTC, China, 1,2,6 Inception Institute of Artificial Intelligence, UAE, 3 Aalborg University, Denmark
4 Shanghai Jiao Tong University, China, 5 Hong Kong Baptist University
1 lisi.chen@inceptioniai.org, 2 jedi.shang@gmail.com, 3 csj@cs.aau.dk
4 yaobin@cs.sjtu.edu.cn, 5 cszwzhang@comp.hkbu.edu.hk, 6 ling.shao@inceptioniai.org

ABSTRACT
τ2 τ3-2
With the increasing availability of moving-object tracking data, use end
of this data for route search and recommendation is increasingly im-
portant. To this end, we propose a novel parallel split-and-combine
start τ3-1
approach to enable route search by locations (RSL-Psc). Given a set τ2 o4
of routes, a set of places to visit O, and a threshold θ , we retrieve the o2 τ1-2
τ3 τ1-1 o3
route composed of sub-routes that (i) has similarity to O no less than o1 τ3
start end
θ and (ii) contains the minimum number of sub-route combinations. end
τ1
The resulting functionality targets a broad range of applications, start
including route planning and recommendation, ridesharing, and
location-based services in general. Figure 1: An example of the RSL-Psc problem
To enable efficient and effective RSL-Psc computation on mas-
sive route data, we develop novel search space pruning techniques
and enable use of the parallel processing capabilities of modern
processors. Specifically, we develop two parallel algorithms, Fully- 1 INTRODUCTION
Split Parallel Search (FSPS) and Group-Split Parallel Search (GSPS).
M The continued proliferation of GPS-equipped mobile devices (e.g.,
We divide the route split-and-combine task into k=0 S(|O |, k + 1)
vehicle navigation systems and smart phones) and the proliferation
sub-tasks, where M is the maximum number of combinations and
of online map-based services (e.g., Bing Maps, Google Maps, and
S(·) is the Stirling number of the second kind. In each sub-task, we
MapQuest) enable the collection and sharing of travel routes. Spe-
use network expansion and exploit spatial similarity bounds for
cialized sites, including Bikely, GPS-Way-Points, Share-My-Routes,
pruning. The algorithms split candidate routes into sub-routes and
and Microsoft Geolife [20], as well as general social network sites,
combine them to construct new routes. The sub-tasks are indepen-
including Twitter, Facebook, and Foursquare, are starting to sup-
dent and are performed in parallel. Extensive experiments with real
port route sharing and search. The availability of massive route
data offer insight into the performance of the algorithms, indicating
data enables novel mobile functionality, including route search by
that our RSL-Psc problem can generate high-quality results and
locations (RSL query [1, 12, 13]), which retrieves routes that are
that the two algorithms are capable of achieving high efficiency
similar in some specific sense to a set of user-specified places (e.g.,
and scalability.
sightseeing places).
The RSL query is useful in a broad range of applications, includ-
KEYWORDS
ing route planning and recommendation, ridesharing, and location
Route recommendation; Trajectory search based services in general [1, 12, 13]. For example, tourists can ex-
ACM Reference Format: ploit the travel histories of other tourists to improve their own
Lisi Chen, Shuo Shang, Christian S. Jensen, Bin Yao, Zhiwei Zhang, Ling travel. Others with similar interests may have visited nearby land-
Shao. 2019. Effective and Efficient Reuse of Past Travel Behavior for Route marks that the tourist may not know, but may be interested in;
Recommendation. In The 25th ACM SIGKDD Conference on Knowledge Dis- or others may have avoided a specific road because it is unpleas-
covery and Data Mining (KDD’19), August 4–8, 2019, Anchorage, AK, USA. ant, although it may seem like a good choice in term of distance.
ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3292500.3330835 Such experiences are captured in the routes shared by previous
∗ Corresponding author. tourists. In addition, tourists may post their routes to attract poten-
tial ridesharing partners. The RSL query can identify such tourists
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed with similar interests (i.e., the user-specified places are similar to
for profit or commercial advantage and that copies bear this notice and the full citation the posted route) and can recommend them as ridesharing partners.
on the first page. Copyrights for components of this work owned by others than ACM In most existing studies (e.g., [1, 12, 13]), the RSL query is defined
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a as a top-k query. However, sometimes the quality of query results
fee. Request permissions from permissions@acm.org. cannot be guaranteed due to insufficient data (e.g., the top-1 route
KDD ’19, August 4–8, 2019, Anchorage, AK, USA is relatively far away from the user-specified places). Consider the
© 2019 Association for Computing Machinery.
ACM ISBN 978-1-4503-6201-6/19/08. . . $15.00 example in Figure 1, where o 1 , o 2 , o 3 , and o 4 are query locations
https://doi.org/10.1145/3292500.3330835 (user-specified places) and τ1 , τ2 , and τ3 are routes. Compared to τ1

488
Research Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA

and τ3 , route τ2 is spatially close to the query locations so it is re- locations to subsets, and subsets to partitionings), a larger |C | causes
turned as the top-1 result. However, this result is of low quality (i.e., a very high (exponential) number of combination possibilities.
relatively far away from the query locations), and it is not a useful
result in real applications like travel planning and recommendation
and ridesharing. Group-Split Parallel Search: To further improve the efficiency
This motivates us to study a novel split-and-combine approach of RSL-Psc processing, we propose the Group-Split Parallel Search
to solving the RSL problem. Given a set of routes, a set of user- (GSPS) algorithm that adopts a divide-and-conquer strategy. In the
specified places O, and a threshold θ , we retrieve the route τ that case of GSPS, we partition the set of query locations into (k + 1)
consists of several sub-routes that satisfy two conditions: (1) τ subsets (O = O 1 ∪ ... ∪ O k +1 ), where k ∈ [0, M] and we refine the
has the similarity to O no less than θ , and (2) τ contains the min- value of k from the minimum to the maximum. For each value of
imum number of sub-routes (the minimum number of transfers k, we have S(|O |, k + 1) possible partitionings. For each subset O i
in ridesharing). Consider the example in Figure 1, where routes τ1 (i ∈ [1, k + 1]), we search the route candidates that are spatially
and τ3 split from o 2 into sub-routes τ1−1 (from τ1 .start to o 2 ), τ1−2 close to O i . Upper and lower bounds on the aggregate distance are
(from o 2 to τ1 .end), τ3−1 (from τ3 .start to o 2 ) , and τ3−2 (from o 2 defined in order to prune the search space. The route search in each
to τ3 .end). Here, we combine sub-routes τ3−1 and τ1−2 to make up subset is performed in parallel. Then we combine and evaluate the
a new route τ = τ3−1, τ1−2 . Compared to the original routes τ1 , route candidates of the location subsets of the same partitioning,
τ2 , and τ3 , τ matches query locations o 1 , o 2 , o 3 , and o 4 well while again performing the computation for each partitioning in parallel.
combining only two sub-routes. Compared to FSPS, GSPS achieves tighter candidate sets and avoids
The RSL-Psc problem is applied in spatial networks because in the combination from locations to subsets.
many practical scenarios, objects (e.g., commuters and vehicles) Our contributions can be summarized as follows. First, we pro-
move in spatial networks [9, 11, 12] rather than in Euclidean space. pose a novel parallel split-and-combine approach to tackling the
In spatial networks, the most relevant distance notion when quan- problem of route search by locations (RSL-Psc) efficiently and effec-
tifying the distance between two objects is network distance; Eu- tively, thus targeting applications such as route planning and rec-
clidean distance may lead to errors. We adopt aggregate-distance ommendation, ridesharing, and location-based services in general.
matching (i.e., the sum of distances between query locations and Second, we develop two efficient algorithms, Fully-Split Parallel
routes) [1, 9, 10, 12] to match routes and query locations. Search (FSPS) (Section 3) and Group-Split Parallel Search (GSPS)
The RSL-Psc problem is challenging due to its high computation (Section 4), to process the RSL-Psc query efficiently. Third, we con-
M duct extensive experiments on large real route data sets to study the
complexity. There exist k=0 S(|O |, k + 1) possibilities when par-
titioning the set O of query locations, where M is the maximum performance of the algorithms (Section 5). Our experiment results
number of combinations (e.g., the tolerance of transfer times for show that the PSL-Psc query is much more likely to return a valid
a tourist) and S(·) is the Stirling number of the second kind. The result compared with the PSL query without route combination.
computations in different partitionings are independent of each
other so can occur in parallel. We propose two parallel solutions to
the RSL-Psc problem. 2 PRELIMINARIES
Fully-Split Parallel Search: In Fully-Split Parallel Search (FSPS),
2.1 Spatial Networks and Routes
we first use network expansion [3] to explore the spatial network A spatial network is modeled as a connected, undirected graph
from each query location o ∈ O and retrieve the route candidates G = (V , E, F,W ), where V is a vertex set and E ⊆ {{vi , v j }|vi , v j
that are spatially close to o. We define a distance lower bound and a ∈ V ∧ vi  v j } is an edge set. A vertex vi ∈ V represents a road
similarity upper bound to prune the search space. Then we partition intersection or an end of a road, and an edge ek = {vi , v j } ∈ E
set O into (k + 1) subsets, where k ∈ [0, M], and we refine the value represents a road segment that enables travel between vertices vi
of k from the minimum to the maximum (since we retrieve the and v j . Function F : V ∪ E → Geometries maps a vertex to the
routes with the minimum number of combinations, once we find point location of the corresponding road intersection and maps an
a qualified route, it is unnecessary to consider larger values of edge to a polyline representing the corresponding road segment.
k). For each possible subset O i ⊆ O, we select the intersection of Function W : E → R assigns a real-valued weight W (e) to an edge
the route candidate sets of the corresponding query locations and e that represents the corresponding road segment’s length.
generate the route candidate set at the subset level (i.e., the route The shortest path between two vertices vi and v j is a sequence
candidate set of O i ). Next, we combine and evaluate candidate sets of edges linking vi and v j such that the sum of their edge weights
associated with every query location subset in each partitioning to is minimal. Such a path is denoted by SP(vi , v j ), and its length is
obtain the query result. The computations in each subset and in denoted by sd(vi , v j ). Euclidean-space based spatial indices (e.g., the
each partitioning occur in parallel. R-tree [6]) and accompanying techniques are relatively ineffective
The advantage of the FSPS algorithm is that it only needs to in network environments due to loose bounds. For simplicity, we
conduct the route search once, after which it can reuse the search assume that the data points considered (e.g., route sample points
results for route combination. Its limitation lies in the tightness of and query locations) are located on vertices.
its upper and lower bounds (it uses a single distance to prune an Definition 1: (Route) A route τ of is a finite sequence p1, p2, ..., pn 
aggregate distance). As a result, each query location must maintain a that consists of at least 2 vertices, where pi and pi+1 (i ∈ [1, n − 1])
large candidate set C. When combining route candidates (two steps: are adjacent vertices in V . 

489
Research Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA

τ1.sub(p1,p3) τ2.sub(p3,p5) Sim(O, τu ) ≥ θ and


τ1 τu = τ1 .sub(p1, p2 ) + τ2 .sub(p2, p3 ) + ...
p1 p2 p4 p5 +τk −1 .sub(pk −1, pk ) + τk .sub(pk , pk +1 ), where
τi ∈ T (1 ≤ i ≤ k). 
p3
τ2 3 FULLY-SPLIT PARALLEL SEARCH
p8 p9
p6 p7 3.1 Basic Idea
τ2.sub(p6,p3) τ1.sub(p3,p9) We propose the first solution to the RSL-Psc problem, the Fully-Split
Figure 2: An example of route split-and-combine Parallel Search (FSPS) algorithm. Initially, we fully split the set O
of query locations into |O | individual expansion centers. Then we
use network expansion [3] to explore the spatial network and to
Definition 2: (Sub-route) A sub-route of τ , denoted by τ .sub(pi , p j ),
retrieve route candidates that are spatially close to each expansion
is a segment of τ , where pi , p j ∈ τ and pi and p j are the start point
center. We define an upper bound on network distance to prune
and end point of the segment, respectively. 
the search space (Section 3.2). Next, we generate partitionings of
Definition 3: (Conflicting directions) Assume that τa .sub(pi , p j ) the set O. Each partitioning consists of (k + 1) disjoint subsets
and τb .sub(pk , pl ) are two sub-routes of τa and τb , respectively. If (k ∈ [0, M]), which are called groups. We refine the value of k from
pi and pl are the same vertex or p j and pl are the same vertex, we the minimum to the maximum. Since we retrieve the routes with the
say that they do not have conflicting directions; Otherwise, they minimum number of combinations, once we find a qualified route, it
have conflicting directions.  is unnecessary to consider routes with a larger value of k. For each
If two routes τ1 and τ2 have an intersection point (i.e., τ1 ∩τ2  ∅) group in each partitioning, we combine and evaluate the candidates
that is neither the common start point of τ1 and τ2 nor the common of the corresponding expansion centers; For each partitioning, we
end point of τ1 and τ2 , they can be split and combined from that combine and evaluate the candidates of the corresponding subsets.
intersection. An example of route split-and-combine is shown in The computations in each group and in each partitioning occur in
Figure 2, where τ1 = p1, p2, p3, p8, p9  and τ2 = p6, p7, p3, p4, p5 . parallel (Section 3.3). The FSPS algorithm is detailed in Section A.
Specifically, p3 is their intersection. We split τ1 and τ2 at p3 into sub-
routes τ1 .sub(p1, p3 ) = p1, p2, p3  and τ1 .sub(p3, p9 ) = p3, p8, p9 , 3.2 Parallel Expansion Search
and τ2 .sub(p6, p3 ) = p6, p7, p3  and τ2 .sub(p3, p5 ) = p3, p4, p5 , re- Consider the example in Figure 3, where O = {o 1, o 2, o 3 } is a set
spectively. Then we combine τ2 .sub(p6, p3 ) and τ1 .sub(p3, p9 ) into of query locations, and τ1 , τ2 , and τ3 are routes. Set O is fully split
a new route τ = τ2 .sub(p6, p3 ) + τ1 .sub(p3, p9 ) = p6, p7, p3, p8, p9 , and each query location is used as an expansion center. Network
which consists of one combination. Notice that sub-routes τ1 .sub(p1, p3 ) expansion is performed from each query location oi (i ∈ [1, 3]) using
and τ2 .sub(p6, p3 ), and τ1 .sub(p3, p9 ) and τ2 .sub(p3, p5 ) cannot be Dijkstra’s algorithm [3]. The exploration space of query location oi
combined because their directions conflict. is a region (oi , rsi ), where radius rsi is the network distance from
the center oi to the expansion boundary (i.e., the explore regions
2.2 Distance Measures of o 1 , o 2 , and o 3 are depicted by the blue, red, and green regions,
Given a query location o and a route τ , the spatial network distance respectively). As Dijkstra’s algorithm always selects the vertex with
d(o, τ ) between them is defined as follows. the minimum distance label for expansion, if p ∈ τ is the first vertex
scanned by the expansion from o, p is the vertex closest to o, i.e.,
d(o, τ ) = min {sd(o, pi )}, (1)
pi ∈τ d(o, τ ) = sd(o, p). (3)
where sd(o, pi ) denotes the shortest network distance between o For example, in Figure 3, d(o 1, τ2 ) = sd(o 1, p1 ), the shortest path
and pi . Given a set O of query locations and a route τ , the simi- between o 1 and p1 is illustrated by the dashed line in the blue region.
larity Sim(O, τ ) between them is defined according to aggregate Notice that the expansions from query locations are independent
distance [1, 9]:  of each other; thus, they occur in parallel.
Sim(O, τ ) = e −d (o,τ ) (2)
Upper bound: For each expansion center oi , if a vertex p is scanned
o ∈O
by the expansion from oi and route τ passes p, we can derive the
2.3 Problem Definition similarity upper bound of τ as follow:

We formally define the RSL-Psc problem in Definition 4. Sim(O, τ ) = e −d(o j ,τ )
Definition 4: (RSL-Psc Problem) Given a set O of query locations, a o j ∈O
set T of routes, a threshold θ , and the maximum number of combina-
tions M, RSL-Psc finds a route τr satisfying the following conditions: < |O | − 1 + e −sd (oi ,p) = Sim(O, τ ).ub (4)
(1) Sim(O, τr ) ≥ θ ; Here, e −d(o j ,τ ) < 1, so we use 1 as the upper bound for all o j ∈
(2) τr = τ1 .sub(p1, p2 ) + τ2 .sub(p2, p3 ) + ... O \ {oi }. For each expansion center oi , if its similarity upper bound
+τm−1 .sub(pm−1, pm ) + τm .sub(pm , pm+1 ), where is smaller than the threshold θ , the expansion from oi terminates,
m − 1 ≤ M and τi ∈ T (1 ≤ i ≤ m); and all unscanned routes are pruned safely. We place all scanned
(3) for any k that k < m, we cannot find a route τu such that routes (e.g., in Figure 3, τ2 and τ3 , passing through the red region,

490
Research Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA

1 o3
Algorithm 1: FSPSCandidates
2 Data: Query location set O , trajectory set T , similarity threshold θ
Result: C = {C 1 , C 2 , ..., C |O | }
p4 1 for each C i in C do
o2 p2 2 C i ←∅;
p1 for each query location o i in O do
3
3
4 while DijkstraExpansion(o i ).hasNext() do
p3 o1 5 p ←DijkstraExpansion(o i ).next();
6 ubi ← |O | − 1 + e −sd (oi ,p) ;
7 if ubi ≥ θ then
8 for each τ s.t. p ∈ τ do
Figure 3: An example of network expansion 9 C i .add(o i (τ , p));

10 else
11 break;

are scanned by the expansion from o 2 ) in a candidate set C(oi ) for 12 return C;
expansion center oi , and we also maintain the network distance
d(oi , τ ) for each candidate (it is derived during network expansion,
see Equation 3). 3.4 Combining Route Candidates
Now we are ready to present the FSPS algorithm. Specifically, Now we have a route candidate tuple set of each query location.
FSPS consists of two steps: (1) Generating route candidate sets The next step is to combine route candidates and acquire the final
(Section 3.3); (2) Combining route candidates from different query route. In particular, we need to: (1) Generate partitionings for query
location subsets (Section 3.4). locations; (2) For each partitioning, retrieve route candidates associ-
ated with each query location group; (3) Combine route candidates
3.3 Generating Route Candidate Sets in each group and generate the final route. Before presenting the
Before presenting the algorithm, we first define route tuple, which algorithm, we introduce the concepts of partitioning, query location
will be used to record the route candidates and corresponding labels group, and relevant data structures for maintaining partitionings
for each query location. Route tuples will be retrieved and updated and route candidates associated with each query location group.
when combining route candidates from different query location 3.4.1 Partitioning and query location group. We present the defini-
subsets. tion of query locations partitioning in Definition 6.
Definition 5: (Route Tuple) A route tuple of route τ associated Definition 6: (Partitioning of Query Locations) A partitioning of
with query location oi and point p is denoted by oi (τ , p) = e, p, d, query locations is denoted by P. It consists of a set of disjoint query
which consists of three elements: an entry e (identifier) of route τ , location groups {G 1, ..., G n }. Each location group contains a subset
an expansion point p in τ scanned by network expansion, and the of the query locations in O. We use P(k ) to denote k-set of query
shortest network distance d between p and oi .  location partitionings. In particular, we have:
Algorithm 1 presents the pseudo code for generating a route can- P(k ) = {Pi | |Pi | = k} (5)
didate set. The inputs are a set of query location O, the route dataset
T , and the similarity threshold θ . The output is a route candidate 
tuple set of each query location (i.e., C = {C 1, C 2, ..., C |O | }). Note Recall that once we find a qualified final route, which is generated
that each Ci is maintained as a priority queue and contains route tu- from a set of qualified route candidates from each group (i.e., group-
ples associated with query location oi . Specifically, the route tuples wise route candidates) in a particular partitioning, the algorithm
in Ci is sorted in ascending order of oi (τ ).d. terminates immediately. To enhance the search efficiency, we need
We first initialize the route (candidate) tuple set associated with to find a qualified final route as early as possible. To achieve this,
each query location (lines 1–2). Next, we find the route candidate we use a priority queue to maintain group-wise route candidates in
tuple set of each query location oi ∈ O. Specifically, we perform a each query location group G i ∈ P. In particular, it stores group-wise
network expansion from query location oi . If an unvisited vertex route tuples that are sorted in descending order of the similarity
exists (line 4), we retrieve the next unvisited vertex p (line 5). Then upper bound. Using this data structure, route candidates with high
we update the upper bound of route candidate tuple set associated similarity scores, which are more likely to produce a qualified final
with oi (i.e., ubi ) to be |O | − 1 +e −sd(oi ,p) (Equation 4) (line 6). If ubi route, are evaluated at first. Group-wise route candidates are stored
is no less than the similarity threshold θ , we regard all routes whose as group-wise route tuples (Definition 7).
vertices contain p as candidates, so we add their route candidate Definition 7: (Group-wise route tuple) A group-wise route tuple
tuples to Ci (lines 8–9). If the value of ubi is lower than θ , the of route τ associated with query location group G is denoted by
expansion from oi terminates (lines 10–11). Having searched all G(τ , P) = e, P, ub, which consists of three elements: an entry
query locations, we combine their results and get the result C of (identifier) e of route τ , a set of key-value pairs where the key is a
route candidates. query location (expansion center) in G and the value is an expansion

491
Research Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA

point of τ , and the similarity upper bound ub of τ . Specifically, each To improve the efficiency of RSL-Psc search, we need to decrease
pair oi , p j  ∈ P satisfies the following conditions: (1) oi ∈ G; (2) the number of route candidates and route candidate tuples. Hence,
oi (τ , p j ) ∈ Ci ; (3) ∀(o, p, o , p   ∈ P) (o  o  ). The value of ub is a more effective pruning strategy is required for the filtering of
computed as follows: unqualified candidates. To achieve this, we propose the Group-
 Split Parallel Search (GSPS) algorithm. Unlike the FSPS algorithm,
G(τ , P).ub = |O | − |G | + e −d (oi ,p j )
the GSPS algorithm does not need to maintain route candidates
o i ,p j  ∈P
and tuples for each query location. Instead, we generate route
 candidates associated with each group directly. Thus, a similarity
We say that G(τ , P) is a valid group-wise route tuple if G(τ , P) upper bound between a route and a query location set can be derived
satisfies Definition 7 and G(τ , P).ub ≥ θ , where θ is the similarity by computing the aggregate distances to query locations in a group
threshold. rather than to a single query location. Consequently, the pruning
3.4.2 Route combination. We proceed to present how to combine power provided by GSPS is much larger than that provided by FSPS.
two route candidates from two different query location groups as a The high-level idea of the GSPS algorithm is as follows. First,
route candidate in the next level (i.e., a route candidate associated we partition the set of query locations into k + 1 groups, where
with the union of the two query location groups). Specifically, given k ∈ [0, M], and we refine the value of k from the minimum to the
group-wise route tuples G(τ , P) and G  (τ , P  ), we need to be able maximum. For each group G i (i ∈ [1, k + 1]), we directly generate
to determine whether we can generate a qualified group-wise route the route candidates and route tuples that are spatially close to G i .
tuple of group G ∪G  based on G(τ , P) and G  (τ , P  ). We use τ − (p) This involves two steps: (1) Group-based network expansion (cf.
to denote the sub-route of τ starting from the beginning of τ and Section 4.2) and (2) route candidate filtering (cf. Section B.2). The
terminating at point p (p is a point in τ ). Likewise τ + (p) denotes route candidate search in each group are performed in parallel. After
the sub-route of τ starting from point p and terminating at the end that, we combine and evaluate the route candidates associated with
of τ . location groups of the same partitioning. Note that the computation
Intuitively, two route candidates τ and τ  can be combined into for each partitioning is also performed in parallel. Compared to
a route candidate in the next level if they have an intersection point FSPS, GSPS produces much fewer candidate sets, and it avoids the
(i.e., τ ∩ τ   ∅) and the intersection is a “transfer point”, which lies route candidate combination from query-location level to group
between the expansion points of query locations in the one group level.
and the expansion points of query locations in the other group. A
next-level route candidate tuple is defined as follows.
4.2 Group-based Network Expansion
Definition 8: (Next-level route tuple) Let G(τ , P) and G  (τ , P  ) be
Recall that in FSPS, network expansion is performed individually for
two group-wise route tuples of G and G  , respectively. Routes τ
each query location. When we parallelize the network expansion,
and τ  intersect at point pin and O denotes a set of query locations
we may only consider the minimum distance between a query
(G ⊆ O, G  ⊆ O). Route τs = τ − (pin )+τ + (pin ) is a next-level route
location o and its nearest vertex p (i.e., sd(o, p) in Equation 3) when
candidate for S = G ∪G  and S(τs , P ∪ P  ) is a next-level route tuple
calculating the similarity upper bound, which is a static value and
for G and G  if:
has limited pruning effect. In GSPS, we introduce group-based
∀(oi , p j  ∈ P, oi, p j  ∈ P  ) (p j ∈ τ − (pin ) ∧ p j ∈ τ + (pin )) network expansion that performs expansion for all query locations
 in a group simultaneously. In addition, given a group G, instead of
Having derived a next-level route tuple S(τs , P ∪ P  ), we regard storing a comparably loose and static similarity upper bound for
its corresponding route candidate τs as a new group-wise route each query location, we maintain a dynamic upper bound for each
candidate associated with G ∪G  . The route combination processing query location o ∈ G that takes an aggregated distance between
is performed iteratively in a bottom-up fashion until we find a all query locations in G and their corresponding nearest vertices
qualified final route associated with group O. into consideration. With the group-based dynamic similarity upper
Detailed algorithms and complexity analyses for generating bounds, we generate route candidates for G directly.
group-wise route tuples, deriving a qualified final route by combin- Next, we explain how to compute the group-based similarity
ing route candidates, are presented in Appendix, Section A. upper bounds and how to perform route candidate pruning based
on the upper bounds. Consider the example in Figure 4, where
4 GROUP-SPLIT PARALLEL SEARCH G = {o 1, o 2, o 3 } is a query location group in a partitioning P, and
τ1 , τ2 , τ3 , and τ4 are routes. In each network expansion iteration,
4.1 Basic Idea we select one of the query locations in G as an expansion center.
The FSPS algorithm maintains a set of route candidates for each For each query location o, we maintain its network distance to its
query location. Because the similarity upper bound of each route current expansion point, which is denoted by o.sd. The location
candidate in FSPS only takes one query location into consideration which has the minimum o.sd value will be selected as the expansion
(cf. Equation 4), which has low pruning power, the number of route center in the current iteration.
candidates associated with each query location can be large. When Table 1 presents the values of o.sd in each iteration. At the be-
combining route candidates in Algorithm 2, a large |Ci | results in a ginning (Iter 0), the values of o.sd for all o ∈ G are 0. Assume that
very large (exponential) number of combination possibilities, which we select o 1 as the expansion center in the first iteration and p1 is
makes the combination process computationally expensive. the first vertex scanned by the expansion from o 1 . We update o 1 .sd

492
Research Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA

τ4 5 EXPERIMENTAL STUDY
p7 We report on experiments with real road networks routes and
Points-of-Interest (POI) data sets that offer insight into the efficiency
τ3 p5 and scalability of the proposed algorithms.
d (o1,τ3) = sd (o1,p5) d (o2,τ4) = sd (o2,p7)
p1
p2
τ2 p6 5.1 Experiment Settings
d (o1,τ2) = sd (o1,p1) sd (o2,p2) 5.1.1 Data sets. We use two road networks, namely the Beijing
o1 sd (o3,p6)
o2 Road Network (BRN) and the New York Road Network (NRN)1 ,
sd (o1,p1) = 1.0 d (o2,τ1) = sd (o2,p3)
sd (o2,p3) = 1.2 τ1
which contain 28,342 vertices and 27,690 edges, and 95,581 vertices
sd (o3,p4) = 1.3 p3 o3 d (o3,τ1) = sd (o3,p4) and 260,855 edges, respectively. The corresponding network graphs
sd (o2,p2) = 1.8
sd (o1,p5) = 2.2 p4 are stored and indexed by adjacency lists. In BRN, we use a real
sd (o3,p6) = 2.7 taxi trajectory data set collected by the T-drive project [19], while
sd (o2,p7) = 3.5
in NRN, we use a real taxi data set from New York 10 . Each item in
Figure 4: An example of group-based network expansion the data set contains pick-up and drop-off locations of a taxi. We
derive the shortest path from the pick-up location to the drop-off
Table 1: Update of o.sd location and regard it as a route. The T-drive taxi trajectory data set
contains 800K trajectories and 300K POIs (each POI has a spatial
Iter 1 Iter 2 Iter 3 Iter 4 Iter 5 Iter 6 Iter 7
o 1 .sd 1.0 1.0 1.0 2.2 2.2 2.2 2.2
coordinate with latitude and longitude), while the New York taxi
o 2 .sd 0 1.2 1.2 1.2 1.8 1.8 3.5 data set contains 700M routes. In NRN, we use a real POI data set
o 3 .sd 0 0 1.3 1.3 1.3 2.7 2.7 that contains 19,918 POIs in New York City 2 . For NRN, the POIs
may not match the trajectory points. So we map each POI in NRN
Table 2: Route Label Hash Map to its nearest road network vertex.
Key Value (route label set - o, p, o .sd  )
τ1 o 2 , p3 , 1.2 , o 3 , p4 , 1.3
5.1.2 Query location sets. A query location set O is generated
τ2 o 1 , p1 , 1.0 , o 2 , p2 , 1.8 , o 3 , p6 , 2.7 as follows: First, we plot n circular query selector regions with
τ3 o 1 , p5 , 2.2 radius r and place each selector region at a random position in the
underlying space. Next, we randomly select |O |/n POIs from each
selector region. The selected POIs constitute the query location set.
to be 1.0. Because p1 ∈ τ2 , we generate the label o 1, p1, 1.0 for τ2
In the experiments, we evaluate the parameters n and r .
and add it to the route label hash map maintained by group G (cf.
Table 2). The route label hash map is used during route candidate 5.1.3 Implementations. In the experiments, the road network graphs,
filtering (cf. Section B.2). In the second iteration, we select o 2 as the routes, and POIs are memory resident. All algorithms are imple-
expansion center (after the 1st iteration, o 2 .sd = o 3 .sd, so either o 2 mented in Java and run on a cluster with 10 data nodes. Each node
or o 3 can be selected as the expansion center), and p3 is the first is equipped with two Intel Xeon Processors E5-2620 v3 (2.4GHz)
vertex scanned by the expansion from o 2 . Thus, we set o 2 .sd = 1.2. and 128GB RAM. Unless stated otherwise, experimental results are
Likewise, since p3 ∈ τ1 , we generate a label o 2, p3, 1.2 for τ1 and averaged over 200 and 50 independent trials using different query
add it to the route label hash map. We continue the iterative process location sets for effectiveness (Section 5.2) and efficiency evalua-
until we reach the similarity upper bound of G. Theorem 1 explains tions, respectively. The performance metrics are runtime and the
that the resulting pruning is safe. number of route visits. The number of route visits is used as a metric
Theorem 1: Given a query location set O, a similarity threshold because it reflects the number of data accesses. In multi-threaded
θ , and a group of query locations G where G ⊂ O, group-based executions, the total runtime is the maximum runtime among all
network expansion can be stopped and all unexplored routes can individual threads.
be safely pruned when: Trajectories in T are selected randomly from the real data sets.

|O | − |G | + e −o .sd < θ (6) We evaluate the following three methods:
o ∈G • FSPS: Fully-Split Parallel Search (Section 3);
Proof. Let τu be a route that is unexplored during group-based • GSE+CTF: Group-Split Parallel Search (GSPSExpansion +
network expansion. The Dijkstra expansion has the property that, CTFilter) (Section 4);
∀(oi ∈ G) (d(oi , τu ) ≥ oi .sd). As a result, ∀(oi ∈ G) (e −d(oi ,τu ) ≤ • GSE: Group-Split Parallel Search without CTFilter (GSPSEx-
e −oi .sd ). Because d(o, τu ) is non-negative, e −d (o,τu ) cannot exceed pansion only) (Section 4.2).
1. Then we have: ∀(o j ∈ O \ G) (e d (o j ,τu ) ≤ 1). Consequently, When evaluating the number of route visits, we do not report
 −o .sd  −d (o,τu )
if |O | − |G | + e is smaller than θ , e must be the performance of GSE+CTF because GSE and GSE+CTF incur the
o ∈G o ∈O same numbers of route visits. The parameter settings are listed in
smaller than θ . This completes the proof. 
Table 3.
Detailed algorithms for group-based network expansion and
route candidate filtering are presented in Sections B.1 and B.2, 1 https://publish.illinois.edu/dbwork/open-data/

respectively. 2 https://data.cityofnewyork.us/City-Government/Points-Of-Interest/rxuy-2muj

493
Research Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA

Table 3: Parameter Settings Table 4: Pruning Effect


BRN NRN FSPS GSE GSE+CTF
Cardinality of routes 100,000–500,000 / de- 2,000,000–10,000,000 / PR in BRN 0.29 0.81 0.91
|T | fault 300,000 default 2,000,000 PC in BRN 0.71 0.19 0.09
Cardinality of query 8–12 / default 10 8–12 / default 10 PR in NRN 0.34 0.84 0.93
location set |O | PC in NRN 0.66 0.26 0.07
Similarity threshold θ 0.4–0.8 (× |O |) / default 0.3–0.7 (× |O |) / de-
0.6 (× |O |) fault 0.5 (× |O |)
Maximum number of 1–5 / default 3 1–5 / default 3

Pi ∈P #DistinctR(Pi )
transfers M #DistinctR(C)
Radius of query loca- 1–9 km / default: de- 1–9 km / default: de- PCFSPS = PCGSPS =
|T | |P| × |T |
tion selector region r tailed in experiments tailed in experiments
Number of query loca- 1–4 / default 2 1–4 / default 3 PRFSPS = 1 − PCFSPS PRGSPS = 1 − PCFSPS
tion selector regions Here, the pruning effectiveness of FSPS is computed by PCFSPS
n and PRFSPS , while the pruning effectivenesses of GSE and GSE+CTF
Thread count m 24–120 / default 24 24–120 / default 24
are computed by PCGSPS and PRGSPS . The numerator of PCFSPS (i.e.,
#DisdinctR(C)) denotes the number of distinct routes in C, and
5.2 Result Quality Analysis |T | is the cardinality of the route set. The value of #DistinctR(Pi ))
in PCGSPS is the number of distinct route candidates in all groups
of partition Pi . By comparing the pruning rate of GSE+CTF to
1.4 RSL(M=0) RSL-Psc(M=3) 1.4 RSL(M=0) RSL-Psc(M=3)
that of FSPS, we see that the pruning effectiveness is improved by
RSL-Psc(M=1) RSL-Psc(M=5) RSL-Psc(M=1) RSL-Psc(M=5)
1.2
RSL-Psc(M=2)
1.2
RSL-Psc(M=2) approximately a factor of 3 when using the group-based dynamic
Ratio of Valid Results

Ratio of Valid Results

1 1 similarity upper bounds based on the aggregated distance between


0.8 0.8 all query locations in each group (cf. Section 4.2). In addition, the
0.6 0.6
comparison between GSE+CTF and GSE suggests that the route
0.4 0.4

0.2 0.2
candidate filtering algorithm (CTFilter) is capable of improving the
0 0
pruning effectiveness by a factor of 1.1.
0.4 0.5 0.6 0.7 0.8 0.5 0.3 0.4 0.5 0.6 0.7 0.5
Similarity Threshold, θ ( × |O| ) Similarity Threshold, θ ( × |O| )

(a) BRN (b) NRN 5.4 Evaluation of Query Performance


Figure 5: Ratio of valid results Effect of the number of routes: Figure 6 presents the perfor-
mance of the algorithms when varying the number of routes |T |.
First, we evaluate the ratio of queries that return valid results (i.e., Intuitively, a larger |T | causes more routes to be processed and
qualified final routes) as we vary the similarity threshold θ . Here, yields a larger search space. As a result, both the CPU time and
RSL (M=0) in Figure 5 denotes an RSL query without a route com- the count of visited routes are expected to increase for all three
bination. RSL-Psc (M=x) denotes an RSL-Psc query with x route algorithms. Figures 6(a) and 6(b) shows that GSE+CTF outperforms
combinations. We can see that all methods exhibit a decreasing FSPS by approximately a factor of 6 in both BRN and NRN regarding
trend regarding the ration of valid results as we increase the sim- the CPU time. In particular, the group-based network expansion
ilarity threshold. In particular, the performance of RSL decreases algorithm (cf. Section 4.2) in GSE is able to improve the CPU-time
significantly as we increase the similarity threshold. We also find performance by a factor of 3–4 compared with FSPS, and the route
that the decreasing trend becomes less significant when the max- candidate filtering algorithm (CTFilter) can further improve the
imum number of combinations (M) becomes larger. Specifically, efficiency of GSE by a factor of 1.5. It is worth noting that the CPU
when we set θ to 0.8 in BRN, RSL only has 4 out of 200 queries time is not fully aligned with the count of visited routes. Specifically,
that return valid results. In contrast, RSL-Psc (M=3) and RSL-Psc the performance discrepancy between FSPS and GSE in terms of
(M=5) have 78 and 135 out of 200 queries that return valid results, route counts is less significant than that in terms of CPU time. The
respectively. Similarly, when we set θ to 0.7 in NRN, RSL has 31 out reason can be explained as follows. In FSPS, each visited route is
of 200 queries that return valid results. In contrast, RSL-Psc (M=3) regarded as a route candidate of a query location, and it is evaluated
and RSL-Psc (M=5) have 149 and 186 out of 200 queries that return in the combination process. In GSE, a large number of visited routes
valid results, respectively. As a result, the RSL-Psc query, even with are pruned by the group-based dynamic similarity upper bounds,
a small number of route combinations, demonstrates superiority so they are not evaluated in the combination process.
over the RSL query (without route combination) with regards to Effect of query location set cardinality: Figure 7 shows the
the probability of returning a valid result route. effect of varying the size of query location set |O | on the efficiency
of the algorithms. A larger |O | implies: (1) A larger search space with
5.3 Pruning Effectiveness more routes to be accessed and evaluated; (2) A larger number of
First, we investigate the pruning achieved by the three methods possible partitions and split-and-combine sub-tasks (i.e., the number

with default parameter settings. The experimental results are shown of split-and-combine sub-tasks is kM=0 S(|O |, k + 1)). Thus, when
in Tables 4. Specifically, the pruning rate (PR) and proportion of we increase the number of query locations in O, more CPU time and
candidates (PC) are defined as follows. route visits occur. It is worth mentioning that when |O | is increased

494
Research Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA

8000K

Count of Visited Routes

Count of Visited Routes


FSPS FSPS FSPS FSPS FSPS 1800K FSPS
400 GSE GSE GSE GSE 10000 GSE 1500K GSE
Runtime (ms)

Runtime (ms)

Runtime (ms)

Runtime (ms)
GSE+CTF 1600 GSE+CTF 6000K GSE+CTF GSE+CTF
300 1000 1000 1200K
1200 4000K 900K
200 800 100 100
600K
100 2000K 10 10
400 300K

100K 200K 300K 400K 500K 2M 4M 6M 8M 10M 2M 4M 6M 8M 10M 8 9 10 11 12 8 9 10 11 12 8 9 10 11 12


Number of Routes Number of Routes Number of Routes Cardinality of Query Location Set, |O| Cardinality of Query Location Set, |O| Cardinality of Query Location Set, |O|

(a) BRN-time (b) NRN-time (c) NRN-count (a) BRN-time (b) NRN-time (c) NRN-count
Figure 6: Effect of the number of routes Figure 7: Effect of |O |

M
from 8 to 12, the value of k=0 S(|O |, k + 1) (M = 3) is increased by 500K for BRN and |T | = 10M for NRN). The results in Figure 12
more than 300 times. However, according to Figures 7(a) and 7(b), show that GSE+CTF outperforms FSPS by a factor of 6–8 in term
the CPU time only increases by 30–100 times for all algorithms. This of runtime and outperforms GSE by 30%–60% in term of runtime.
is because some qualified routes require only 0, 1, or 2 transfers, and In BRN, when we set the thread counts to 120, GSE+CTF is able to
that once we find a qualified route, the search process terminates. solve the RSL-Psc problem over a collection of 500K routes in 15
Effect of similarity threshold θ : This set of experiments investi- milliseconds, while in NRN, GSE+CTF is able to solve the RSL-Psc
gates the effect of similarity threshold θ . Figure 8 shows the results problem with 10M routes in 100 milliseconds. When we increase
when we vary the similarity threshold θ . Increasing the value of the thread count from 24 to 120 (5 times), the runtime of GSE+CTF
θ has the following two effects on the performance: (1) Based on and GSE are improved by a factor of around 3, while the runtime
Equation 4 and Theorem 1, a larger value of θ leads to higher prun- performance of FSPS is improved by 2.8 and 2.1 in BRN and NRN,
ing effectiveness, which may improve the efficiency. (2) A larger respectively.
value of θ may postpone the termination of the algorithms. Recall
that while combining routes, we first generate the routes with the 6 RELATED WORK
minimum number of combinations (transfers), and that once we Existing studies related to the RSL-Psc problem can be classified into
find a qualified route, it is unnecessary to consider larger numbers two categories: Location-to-trajectory search and location-based
of combinations. Hence, when we increase θ , it is less likely that route recommendations.
we will be able to generate a qualified route with few combinations.
Such effect may deteriorate the efficiency. Compared to Effect (2), Location-to-route search: Location-to-route search aims at re-
Effect (1) is negligible. As a result, all algorithms exhibit increasing trieving trajectories who have the highest relevances to query ar-
CPU time and route visits as we increase the value of θ . guments [4, 12, 15, 21]. In particular, the relevancy functions may
contain spatial [1], temporal [8], textual [12][21], and density ele-
Effect of maximum number of transfers M: We proceed to eval- ments. The resulting queries are useful in many popular applica-
uate the effect of varying the maximum number of route transfers tions including travel planning, carpooling, friend recommendation
(M). From Figures 9(a) and 9(b), we find that when we increase M, in social networks, and location-based services in general.
the CPU time increases for all algorithms. Specifically, when the According to the types of trajectory query arguments, we further
value of M reaches 4 in NRN, the subsequent increase in CPU time classify existing studies regarding location-to-trajectory search
is modest. The reason is that when M is set to be 5 in NRN, most of into two sub-categories: (1) Trajectory search based on a single
the trials are returned with qualified final routes with no more than location; (2) trajectory search based on multiple locations. Zheng
4 transfers. Additionally, Figure 9(c) suggests that the performance et al. [22] extend the single-point trajectory query to cover spatial
of route counts is relatively consistent as we increase M. and textual domains and propose the TkSK query, which retrieves
Effect of the radius of query location selector region: Fig- the trajectories that are spatially close to the query point and also
ure 10 shows the effect of varying the radius of the query location meet semantic requirements defined by the query. For trajectory
selector region. We find that both CPU time and route visits exhibit search based on multiple locations, the query takes a set of locations
slight or moderate increasing trends for all algorithms when we as argument and returns a trajectory that connects or is close to
increase the radius of the selector region from 1 km to 9 km. The the query locations according to specific metrics. The concept of
reason is that when we apply a large query location selector region, trajectory search by locations (TSL) was first proposed by Chen et
the query locations in O are distributed increasingly widely in the al. [1]. The main difference between RSL-Psc and the problem of
underlying space, which increases the number of route visits during location-based trajectory search studied by existing work is that
network expansion. RSL-Psc returns a route by combining a set of connected trajectories.
Effect of the number of query location selector regions: Fig- In contrast, existing location-based trajectory queries return a single
ure 11 covers the effect of varying the number of query location pre-existing trajectory or a list of pre-existing trajectories.
selector regions. More query regions implies that more expansion Location-based route recommendations: Given a set of loca-
centers must be processed, which increases the search space and tions (e.g., POIs, taxi locations), the location-based route recom-
the number of route visits. Thus, the CPU time and the count of mendation problem aims to derive a new route based on the loca-
visited routes for all three algorithms increase with the number of tions and user preferences. Ge et al. [5] and Ye et al. [16, 17] study
query location selector regions. the mobile sequential recommendation problem that outputs an
Effect of thread counts: We study the effect of thread count on optimal routes with minimum potential travel distance to a taxi
the efficiency of the algorithms using large route data sets (|T | = driver’s next potential passenger. In particular, Ye et al. [18] first

495
Research Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA

Count of Visited Routes

Count of Visited Routes


FSPS 1000 FSPS FSPS 500 FSPS FSPS FSPS
GSE GSE 2000K GSE GSE 800 GSE 1400K GSE
Runtime (ms)

Runtime (ms)

Runtime (ms)

Runtime (ms)
500 GSE+CTF 800 GSE+CTF 400 GSE+CTF GSE+CTF 1200K
400 1500K 600 1000K
600 300
300 1000K 400 800K
400 200 600K
200
200 500K 100 200 400K
100 200K

0.4 0.5 0.6 0.7 0.8 0.3 0.4 0.5 0.6 0.7 0.3 0.4 0.5 0.6 0.7 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Similarity Threshold, θ ( × |O| ) Similarity Threshold, θ ( × |O| ) Similarity Threshold, θ ( × |O| ) Maximum Number of Transfers, M Maximum Number of Transfers, M Maximum Number of Transfers, M

(a) BRN-time (b) NRN-time (c) NRN-count (a) BRN-time (b) NRN-time (c) NRN-count
Figure 8: Effect of θ Figure 9: Effect of M

FSPS FSPS FSPS FSPS FSPS FSPS


200 GSE GSE 200 GSE GSE 400 GSE GSE

Runtime (ms)

Runtime (ms)
Runtime (ms)

Runtime (ms)

Runtime (ms)

Runtime (ms)
GSE+CTF GSE+CTF GSE+CTF GSE+CTF GSE+CTF 1600 GSE+CTF
400 400
150 150 300
1200
300 300
100 100 200 800
200 200
50 100 50 100 100 400

1 3 5 7 9 1 3 5 7 9 1 2 3 4 1 2 3 4 24 48 72 96 120 24 48 72 96 120
r (km) r (km) Number of selector regions, n Number of selector regions, n Thread Counts Thread Counts

(a) BRN-time (b) NRN-time (a) BRN-time (b) NRN-time (a) BRN-time (b) NRN-time
Figure 10: Effect of r Figure 11: Effect of n Figure 12: Effect of thread counts

study the multiple mobile sequential recommendation problem [3] E. W. Dijkstra. 1959. A note on two problems in connection with graphs. Nu-
that generates optimal routes for a group of taxis with different merische Math 1 (1959), 269–271.
[4] E. Frentzos, K. Gratsias, and Y. Theodoridis. 2007. Index-based most similar
locations. Another area of related studies is travel itinerary recom- trajectory search. In ICDE. 816–825.
mendation (e.g., [7, 14]). Specifically, it takes a set of user-specified [5] Yong Ge, Hui Xiong, Alexander Tuzhilin, Keli Xiao, Marco Gruteser, and Michael J.
Pazzani. 2010. An energy-efficient mobile recommender system. In KDD. 899–
POIs and constraints as input to generate an itinerary through a 908.
subset of POIs with a specific starting and ending POI that can be [6] Antonin Guttman. 1984. R-Trees: A Dynamic Index Structure for Spatial Search-
completed within a certain time. Additionally, Yang et al. [2] recom- ing. In SIGMOD. 47–57.
[7] Kwan Hui Lim, Jeffrey Chan, Shanika Karunasekera, and Christopher Leckie.
mend the shortest route to users based on existing trajectories by 2017. Personalized Itinerary Recommendation with Queuing Time Awareness.
considering multiple costs. However, the results generated by these In SIGIR. 325–334.
proposals are new individual routes, while the results of RSL-Psc [8] Shuo Shang, Lisi Chen, Zhewei Wei, Christian S. Jensen, Ji-Rong Wen, and Panos
Kalnis. 2016. Collective Travel Planning in Spatial Networks. IEEE Trans. Knowl.
are combinations of existing trajectories. Data Eng. 28, 5 (2016), 1132–1146.
[9] Shuo Shang, Lisi Chen, Zhewei Wei, Christian S. Jensen, Kai Zheng, and Panos
Kalnis. 2017. Trajectory Similarity Join in Spatial Networks. PVLDB 10, 11 (2017),
7 CONCLUSIONS 1178–1189.
We propose and study RSL-Psc problem, namely parallel split-and- [10] Shuo Shang, Lisi Chen, Zhewei Wei, Christian S. Jensen, Kai Zheng, and Panos
Kalnis. 2018. Parallel trajectory similarity joins in spatial networks. VLDB J. 27,
combine approach to enable route search by locations. To answer 3 (2018), 395–420.
the RSL-Psc query, we develop two parallel search algorithms: [11] Shuo Shang, Lisi Chen, Kai Zheng, Christian S. Jensen, Zhewei Wei, and Panos
Kalnis. 2019. Parallel Trajectory-to-Location Join. IEEE Trans. Knowl. Data Eng.
Fully-Split Parallel Search (FSPS) and Group-Split Parallel Search 31, 6 (2019), 1194–1207.
(GSPS). Specifically, we divide the route split-and-combine task into [12] Shuo Shang, Ruogu Ding, Bo Yuan, Kexin Xie, Kai Zheng, and Panos Kalnis. 2012.
M
k=0
S(|O |, k + 1) sub-tasks, where M is the maximum number of User oriented trajectory search for trip recommendation. In EDBT. 156–167.
[13] Shuo Shang, Ruogu Ding, Kai Zheng, Christian S. Jensen, Panos Kalnis, and
combinations and S(·) is the Stirling number of the second kind. Xiaofang Zhou. 2014. Personalized trajectory matching in spatial networks.
In each sub-task, we use network expansion to explore the spatial VLDB J. 23, 3 (2014), 449–468.
network and exploit spatial similarity bounds for pruning. The algo- [14] Kendall Taylor, Kwan Hui Lim, and Jeffrey Chan. 2018. Travel Itinerary Recom-
mendations with Must-see Points-of-Interest. In WWW. 1198–1205.
rithms split candidate routes into sub-routes and combine them to [15] Kexin Xie, Ke Deng, and Xiaofang Zhou. 2009. From trajectories to activities: a
construct new routes. The sub-tasks are independent of each other spatio-temporal join approach. In LBSN. 25–32.
[16] Zeyang Ye, Keli Xiao, and Yuefan Deng. 2018. A Unified Theory of the Mobile
and are performed in parallel. Extensive experiment with real data Sequential Recommendation Problem. In ICDM. 1380–1385.
demonstrates that our proposed RSL-Psc query is much more likely [17] Zeyang Ye, Keli Xiao, Yong Ge, and Yuefan Deng. 2019. Applying Simulated
to return a valid result compared with the PSL query without route Annealing and Parallel Computing to the Mobile Sequential Recommendation.
IEEE Trans. Knowl. Data Eng. 31, 2 (2019), 243–256.
combination. In addition, FSPS and GSPS algorithms are capable of [18] Zeyang Ye, Lihao Zhang, Keli Xiao, Wenjun Zhou, Yong Ge, and Yuefan Deng.
achieving high efficiency and scalability on massive route data. 2018. Multi-User Mobile Sequential Recommendation: An Efficient Parallel
Computing Paradigm. In KDD. 2624–2633.
Acknowledgements: This work (Bin Yao) was supported by the [19] Jing Yuan, Yu Zheng, Xing Xie, and Guangzhong Sun. 2013. T-Drive: Enhancing
NSFC (61872235, 61729202, 61832017, U1636210) and the National Driving Directions with Taxi Drivers’ Intelligence. IEEE Trans. Knowl. Data Eng.
25, 1 (2013), 220–232.
Key Research and Development Program of China (2018YFC1504504). [20] Jing Yuan, Yu Zheng, Chengyang Zhang, Wenlei Xie, Xing Xie, Guangzhong Sun,
Additionally, Zhiwei Zhang is supported by GRF (12201518, 12232716, and Yan Huang. 2010. T-drive: driving directions based on taxi trajectories. In
12258116) and NSFC (61602395). ACM SIGSPATIAL. 99–108.
[21] Kai Zheng, Shuo Shang, Nicholas Jing Yuan, and Yi Yang. 2013. Towards efficient
search for activity trajectories. In ICDE. 230–241.
REFERENCES [22] Kai Zheng, Bolong Zheng, Jiajie Xu, Guanfeng Liu, An Liu, and Zhixu Li. 2016.
Popularity-aware spatial keyword search on activity trajectories. World Wide
[1] Zaiben Chen, Heng Tao Shen, Xiaofang Zhou, Yu Zheng, and Xing Xie. 2010.
Web 19, 6 (2016), 1–25, online first.
Searching trajectories by locations: an efficiency study. In SIGMOD. 255–266.
[2] Jian Dai, Bin Yang, Chenjuan Guo, and Zhiming Ding. 2015. Personalized route
recommendation using big trajectory data. In ICDE. 543–554.

496
Research Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA

A ALGORITHM FOR COMBINING ROUTE Algorithm 2 presents the pseudo code of generating group-wise
CANDIDATES IN FSPS route tuples and deriving a qualified final route by combining route
candidates. We partition the query location set O into k + 1 subsets,
where k ∈ [0, M], and we evaluate each possible partitioning start-
Algorithm 2: FSPSCombination ing from partitioning with one group to partitionings with k + 1
Data: Route candidate tuple set C , similarity threshold θ , combination groups. For each possible partitioning P, we generate the group-
threshold M wise route tuples associated with each group G (lines 4–8). First,
Result: Final result route
1 k ← 0; we initialize G.candidates, which stores all group-wise route tuples
2 while k ≤ M do of G (line 5). Next, for each τ that were scanned by every query
3 for each P in P(k +1) do locations in G during Dijkstra expansion, we generate its group-
4 for each G in P do
5 G.candidates ← ∅; wise route tuples based on Definition 7 (lines 6–8). After having
6 for each τ s.t. ∀(o i ∈ G) (o i (τ , ·) ∈ C i ) do group-wise route tuples of all groups in partitioning P, we evaluate
7 for each valid G(τ , P ) do each possible sequence of groups X and for each sequence we com-
8 G.candidates.add(G(τ , P )); // Definition 7
bine route candidates in each group in a bottom-up fashion until all
(j)
9 for each group sequence X of P do groups are combined (lines 9–26). Here, i in G i is the group index
(0)
10 for each G i in X do (j)
(0) and j in G i denotes the level of the group. In particular, for each
11 Initialize G i ;
sequence of group we initialize the group-wise route tuples of each
12 j ← 0; (0)
13 while | X | > 1 do group G i from level 0 (lines 10–11). If |X| > 1, which means that
14 l , s ← 1; the groups in sequence X can be combined, we proceed to combine
15 while l ≤ k do route candidates in groups from the lowest level. Starting from level
16 if l = k then
Gs
(j +1) (j )
← Gl ;
j = 0, we combine route candidates associated with two adjacent
17
(j) (j)
18 else
groups (i.e., Gl and Gl +1 ) and generate corresponding next-level
(j +1) (j ) (j ) route candidate tuples (lines 19–20). Specifically, we derive the
19 Gs ← G l ∪ G l +1 ;
(j+1) (j ) (j ) (j) (j) (j+1)
20 Gs .candidates ← NextLevelRoute(G l ,G l +1 ); union of groups Gl and Gl +1 (i.e., G s ) (line 19) and generate
21 if |G s
(j +1) (j+1)
| = |O | and Gs .candidates  ∅ (j+1)
route candidates associated with G s by calling NextLevelRoute
then (j+1)
22
(j+1)
return Gs .candidates.top(); function (Algorithm 3) (line 20). If the cardinality of G s equals
(j ) (j ) (j +1) the cardinality of the query location set O, this means that we have
23 Replace G l and G l +1 by G s in X ;
combined all groups and the route candidates are associated with
24 l ← l + 2;
25 s ← s + 1; all locations in the query location set. Thus, we can return any
route candidate as a result (lines 21–22). Otherwise, we need to
26 j ← j + 1; (j) (j)
update X by merging Gl and Gl +1 (line 23).
27 k ← k + 1; Algorithm 3 presents the pseudo code for combining the route
candidate tuples associated with two adjacent groups and gener-
ating corresponding next-level route candidate tuples. After ini-
tialization (lines 1–2), we evaluate each pair of routes from G and
G  , respectively, and check if we can generate a next-level route
Algorithm 3: NextLevelRoute(G, G  ) candidate from the route pair τ and τ  . Based on Definition 8, if τ
Data: Groups G and G  , similarity threshold θ and τ  can generate a next-level route candidate, they must have at
Result: Next-level route tuples of G ∪ G  least one intersection point (line 4). For each intersection point pi ,
S ← G ∪ G ;
we consider two potential combinations of sub-routes, τp−i + τ + (pi )
1
2 S.candidates ← ∅;
3 for each G(τ , P ) ∈ G.candidates, G(τ  , P  ) ∈ G .candidates do and τ − (pi ) + τ + (pi ), and we check whether they are qualified for
4 if τ ∩ τ   ∅ then combination based on Definition 8 (lines 8–17). Qualified next-level
5 for each pi ∈ τ ∩ τ  do
6 τl ← τ − (pi ) + τ + (pi ); route candidate tuples are added to the route candidate tuple set
7 τr ← τ − (pi ) + τ + (pi ); associated with the union of G and G  . In particular, if the union of
8 if τl is a next-level route and |G ∪ G  | < |O | then G and G  equals to the query location set O, each route candidate
S.candidates.add(S (τl , P ∪ P  ));
9
associated with S = G ∪ G  is considered to be a final route can-
10 else if τl is a next-level route and Sim(O , τl ) ≥ θ then didate. Here, we check whether the similarity between the route
11 S.candidates.add(S (τl , P ∪ P  ));
12 return S.candidates; and query location set satisfies the similarity threshold θ . If so, we
13 if τr is a next-level route and |G ∪ G  | < |O | then regard the route as a qualified final route and return S.candidates.
14 S.candidates.add(S (τr , P ∪ P  )); Time Complexity: The time complexity of generating route tuples
15 else if τr is a next-level route and Sim(O , τr ) ≥ θ then of each query location in O (i.e., FSPSCandidates) is O((|V |log|V | +
16 S.candidates.add(S (τr , P ∪ P  ));
17 return S.candidates;
|E|) · |O | · |Tv |). Specifically, (|V |log|V | + |E|) is the complexity of
the Dijkstra expansion for each query location, |V | is the number
18 return S.candidates ;
of vertices, and |E| is the number of edges. Further, |Tv | denotes the
average number of routes passing each vertex. The time complexity

497
Research Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA

of running FSPSCombination on each partitioning can be approx- on the value of oi .sd in iteration i. As we continue the iterative
|O |
imated as O(k · |τo | k + |TG | 2 · k!), where k denotes the number process, the value of oi .sd will increase, which makes the value of

of groups in the partitioning, |τo | is the average number of route |O | − |G | + oi ∈G e −oi .sd decrease. Therefore, it is possible that
tuples of each route associated with each query location, and |TG | a route τ added to H during iteration i can be pruned based on
is the average number of group-wise route tuples of each group.3 Theorem 1 during iteration i+n.
Here, k and |O | are much smaller than |Tv | and |TG |. To address the problem, we develop a route candidate filtering
algorithm, CTFilter, that takes H as input and evaluates each τ ∈ H
B ALGORITHM OF GSPS based on the final value of oi .sd produced by the GSPSExpansion
algorithm.
B.1 Algorithm of Network Expansion
Algorithm 4 presents the pseudo code of group-based network ex-
Algorithm 5: CTFilter
pansion. First, we initialize route label hash map H and the value of
Data: Route label hash map H , the final values of o i .sd for each o i ∈ G ,
oi .sd for each oi ∈ G (lines 1–3). Next, we check whether the expan- cardinality of query location set |O | , route set T , similarity threshold θ
sion can be terminated based on Theorem 1 (line 4). If not, we start Result: G.candidates
the current iteration of network expansion (lines 5–13). Specifically, 1 G.candidates ← ∅;
2 for each τ in H do
we select the query location that has the minimum value of o.sd 3 L ← H .get(τ );
in G for performing expansion (i.e., omin ) (line 5). Here, p is the 4 S ← ∅; P ← ∅;
next vertex scanned by the expansion from omin (line 6). Then we 5 a ← 0;
update omin .sd to be sd(omin , p), which is the shortest network dis- 6 for each label l ∈ L do
7 o ← query location in l ;
tance between omin and p (line 7). After that, we scan and evaluate 8 sd ← o .sd in l ;
the routes that pass through p (lines 8–13). In particular, for each 9 p ← p in l ;
route τ passing through p, we retrieve its route label set (i.e., L) in 10 S .add(o );
H (line 9). If L is null, τ has never been scanned. So we need to add 11 P .add( o, p  );
12 a ← a + e −sd ;
a new key-value pair of τ into H . The key is the entry of τ and the 
value is a new route label set containing label {omin , p, omin .sd} 13 if |O | − |G | + a + oi ∈G\S e −oi .sd ≥ θ then
14 G.candidates.add(G(τ , P ));
(lines 10–11). If L is not null, it denotes that τ was scanned before.
Here, we just insert a new label ({omin , p, omin .sd}) into L (lines 15 return G.candidates ;
12–13). When the expansion terminates, we return the hash map
H as the result (line 14). Elements in H are considered to be route
After initializing G.candidates, we evaluate each τ in H . First,
candidates associated with G.
we retrieve the route label set L associated with τ (line 3). Next,
sets S, P, and variable a are initialized (lines 4–5). Specifically, S
Algorithm 4: GSPSExpansion
stores the query locations in L, P stores the o, p pairs (cf. P in
Data: Query location group G , cardinality of query location set |O | , route set T ,
similarity threshold θ
Definition 7), and a records the aggregated value of oi .sd (oi ∈ S).
Result: Route label hash map H Then we visit each label l in L and acquire the query location o and
1 H ← ∅; its corresponding p and o.sd in l (lines 7–9). We update S and P by
for each o i in G do
2
3 o i .sd ← 0;
inserting o and o, p, respectively, and update a by adding the value
 of o.sd (lines 10–12). Next, we calculate the up-to-date similarity
while |O | − |G | + oi ∈G e −oi .sd ≥ θ do 
upper bound of τ . In expression |O | − |G | + a + oi ∈G\S e −oi .sd , a
4
5 omin ← the o with the minimum o .sd in G ;
6 p ←DijkstraExpansion(omin ).next(); denotes the aggregated similarity score contributed by route labels

7 omin .sd ← sd(omin , p); of τ (i.e., L), and oi ∈G\S e −oi .sd computes the similarity score
8 for each τ s.t. p ∈ τ do
9 L ← H .get(τ );
contributed by the query locations in G that are not stored in L. If
10 if L is null then the upper bound is no less than θ , we add the group-wise route
11 H .put(τ , { omin , p, omin .sd  }); tuples of τ into G.candidates. When having completed the scan of
12 else H , we return G.candidates as the result.
13 L .add( { omin , p, omin .sd  }); The algorithm for combining route candidates in GSPS is simi-
14 return H ;
lar to Algorithm 2. The only difference is that we do not need to
generate the group-wise route tuples associated with each group G
B.2 Route Candidate Filtering (cf. Algorithm 2 lines 4–8) because we have done it in Algorithm 4.
Recall that the GSPSExpansion algorithm generates route candi- Time Complexity: The time complexity of generating route tu-
dates (H ) for group G. However, some routes in H can be eliminated ples of each query location in O (i.e., GSPSExpansion) is O(2 |O | ·
based on the route label set associated with each route in H . In (|V | · log|V | + |E|) · |Tv |), where 2 |O | denotes the number of unique
particular, while evaluating whether a new route τ is a qualified groups that can be generated from query location set O. The time
route candidate for G during iteration i, the corresponding similar- complexity of route combination on each partitioning can be ap-

ity upper bound (i.e., |O | − |G | + oi ∈G e −oi .sd ) is computed based proximated as O(|TG | 2 · k!). The notation was explained at the end
3 We
of Section A.
omit the detailed reduction and approximation of the time complexity due to the
space limitation.

498

Você também pode gostar