Arumugam S

EFFICIENT ALGORITHMS FOR SPATIOTEMPORAL DATA MANAGEMENT
By
SUBRAMANIAN ARUMUGAM
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
2008
1
c _ 2008 Subramanian Arumugam
2
To my parents.
3
ACKNOWLEDGMENTS
First of all, I would like to thank my advisor Chris Jermaine. This dissertation would
not have been made possible had it not been for his excellent mentoring and guidance
through the years. Chris is a terric teacher, a critical thinker and a passionate researcher.
He has served as a great role model and has helped me mature as a researcher. I cannot
thank him more for that.
My thanks also goes to Prof. Alin Dobra. Through the years, Alin has been a patient
listener and has helped me structure and rene my ideas countless times. His excitement
for research is contagious!
I would like to take this opportunity to mention my colleagues at the database
center: Amit, Florin, Fei, Luis, Mingxi and Ravi. I have had many hours of fun discussing
interesting problems with them. Special thanks goes to my friends Manas, Srijit, Arun,
Shantanu, and Seema, for making my stay in Gainesville all the more enjoyable.
Finally, I would like thank my parents for being a source of constant support and
encouragment throughout my studies.
4
TABLE OF CONTENTS
page
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
CHAPTER
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2 Research Landscape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.1 Data Modeling and Database Design . . . . . . . . . . . . . . . . . 15
1.2.2 Access Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.2.3 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.2.4 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.2.5 Data Warehousing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.3.1 Scalable Join Processing over Massive Spatiotemporal Histories . . . 18
1.3.2 Entity Resolution in Spatiotemporal Databases . . . . . . . . . . . . 19
1.3.3 Selection Queries over Probabilistic Spatiotemporal Databases . . . 19
2 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1 Spatiotemporal Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Entity Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Probabilistic Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3 SCALABLE JOIN PROCESSING OVER SPATIOTEMPORAL HISTORIES . 25
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.1 Moving Object Trajectories . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.2 Closest Point of Approach (CPA) Problem . . . . . . . . . . . . . . 28
3.3 Join Using Indexing Structures . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3.1 Trajectory Index Structures . . . . . . . . . . . . . . . . . . . . . . 31
3.3.2 R-tree Based CPA Join . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 Join Using Plane-Sweeping . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.1 Basic CPA Join using Plane-Sweeping . . . . . . . . . . . . . . . . . 36
3.4.2 Problem With The Basic Approach . . . . . . . . . . . . . . . . . . 37
3.4.3 Layered Plane-Sweep . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5 Adaptive Plane-Sweeping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5
3.5.2 Cost Associated With a Given Granularity . . . . . . . . . . . . . . 41
3.5.3 The Basic Adaptive Plane-Sweep . . . . . . . . . . . . . . . . . . . 41
3.5.4 Estimating Cost
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.5.5 Determining The Best Cost . . . . . . . . . . . . . . . . . . . . . . . 44
3.5.6 Speeding Up the Estimation . . . . . . . . . . . . . . . . . . . . . . 46
3.5.7 Putting It All Together . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.6 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.6.1 Test Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.6.2 Methodology and Results . . . . . . . . . . . . . . . . . . . . . . . . 50
3.6.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4 ENTITY RESOLUTION IN SPATIOTEMPORAL DATABASES . . . . . . . . 58
4.1 Problem Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2 Outline of Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3 Generative Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.3.1 PDF for Restricted Motion . . . . . . . . . . . . . . . . . . . . . . . 64
4.3.2 PDF for Unrestricted Motion . . . . . . . . . . . . . . . . . . . . . . 65
4.4 Learning the Restricted Model . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.4.1 Expectation Maximization . . . . . . . . . . . . . . . . . . . . . . . 67
4.4.2 Learning K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.5 Learning Unrestricted Motion . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.5.1 Applying a Particle Filter . . . . . . . . . . . . . . . . . . . . . . . . 72
4.5.2 Handling Multiple Objects . . . . . . . . . . . . . . . . . . . . . . . 73
4.5.3 Update Strategy for a Sample given Multiple Objects . . . . . . . . 75
4.5.4 Speeding Things Up . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.6 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5 SELECTION OVER PROBABILISTIC SPATIOTEMPORAL RELATIONS . . 84
5.1 Problem and Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.1.1 Problem Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.1.2 The False Positive Problem . . . . . . . . . . . . . . . . . . . . . . . 87
5.1.3 The False Negative Problem . . . . . . . . . . . . . . . . . . . . . . 90
5.2 The Sequential Probability Ratio Test (SPRT) . . . . . . . . . . . . . . . . 91
5.3 The End-Biased Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.3.1 Whats Wrong With the SPRT? . . . . . . . . . . . . . . . . . . . . 95
5.3.2 Removing the Magic Epsilon . . . . . . . . . . . . . . . . . . . . . . 96
5.3.3 The End-Biased Algorithm . . . . . . . . . . . . . . . . . . . . . . . 97
5.4 Indexing the End-Biased Test . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.4.2 Building the Index . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6
5.4.3 Processing Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6 CONCLUDING REMARKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7
LIST OF TABLES
Table page
4-1 Varying the number of objects and its eect on recall, precision and runtime. . . 80
4-2 Varying the number of time ticks. . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4-3 Varying the number of sensors red. . . . . . . . . . . . . . . . . . . . . . . . . 80
4-4 Varying the standard deviation of the Gaussian cloud. . . . . . . . . . . . . . . 80
4-5 Varying the number of time ticks where EM is applied. . . . . . . . . . . . . . . 81
5-1 Running times over varying database sizes. . . . . . . . . . . . . . . . . . . . . . 109
5-2 Running times over varying query sizes. . . . . . . . . . . . . . . . . . . . . . . 109
5-3 Running times over varying object standard deviations. . . . . . . . . . . . . . . 109
5-4 Running times over varying condence levels. . . . . . . . . . . . . . . . . . . . 109
8
LIST OF FIGURES
Figure page
3-1 Trajectory of an object (a) and its polyline approximation (b) . . . . . . . . . . 28
3-2 Closest Point of Approach Illustration . . . . . . . . . . . . . . . . . . . . . . . 29
3-3 CPA Illustration with trajectories . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3-4 Example of an R-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3-5 Heuristic to speed up distance computation . . . . . . . . . . . . . . . . . . . . 34
3-6 Issues with R-trees- Fast moving object p joins with everyone . . . . . . . . . . 35
3-7 Progression of plane-sweep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3-8 Layered Plane-Sweep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3-9 Problem with using large granularities for bounding box approximation . . . . . 40
3-10 Adaptively varying the granularity . . . . . . . . . . . . . . . . . . . . . . . . . 42
3-11 Convexity of cost function illustration. . . . . . . . . . . . . . . . . . . . . . . . 45
3-12 Iteratively evaluating k cut points . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3-13 Speeding up the Optimizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3-14 Injection data set at time tick 2,650 . . . . . . . . . . . . . . . . . . . . . . . . . 49
3-15 Collision data set at time tick 1,500 . . . . . . . . . . . . . . . . . . . . . . . . . 50
3-16 Injection data set experimental results . . . . . . . . . . . . . . . . . . . . 51
3-17 Collision data set experimental results . . . . . . . . . . . . . . . . . . . . . . . 52
3-18 Buer size choices for Injection data set . . . . . . . . . . . . . . . . . . . . . . 53
3-19 Buer size choices for Collision data set . . . . . . . . . . . . . . . . . . . . . . 53
3-20 Synthetic data set experimental results . . . . . . . . . . . . . . . . . . . . . . . 54
3-21 Buer size choices for Synthetic data set . . . . . . . . . . . . . . . . . . . . . . 56
4-1 Mapping of a set of observations for linear motion . . . . . . . . . . . . . . . . . 60
4-2 Object path (a) and quadratic t for varying time ticks (b-d) . . . . . . . . . . . 62
4-3 Object path in a sensor eld (a) and sensor rings triggered by object motion (b) 64
4-4 The baseline input set (10,000 observations) . . . . . . . . . . . . . . . . . . . . 79
9
4-5 The learned trajectories for the data of Figure 4-4 . . . . . . . . . . . . . . . . . 79
5-1 The SPRT in action. The middle line is the LRT statistic . . . . . . . . . . . . . 92
5-2 Two spatial queries over a database of objects with gaussian uncertainty . . . . 97
5-3 The sequence of SPRTs run by the end-biased test . . . . . . . . . . . . . . . . 98
5-4 Building the MBRs used to index the samples from the end-biased test. . . . . . 104
5-5 Using the index to speed the end-biased test . . . . . . . . . . . . . . . . . . . . 106
10
Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulllment of the
Requirements for the Degree of Doctor of Philosophy
EFFICIENT ALGORITHMS FOR SPATIOTEMPORAL DATA MANAGEMENT
By
Subramanian Arumugam
August 2008
Chair: Christopher Jermaine
Major: Computer Engineering
This work focuses on interesting data management problems that arise in the analysis,
modeling and querying of largescale spatiotemporal data. Such data naturally arise in the
context of many scientic and engineering applications that deal with physical processes
that evolve over time.
We rst focus on the issue of scalable query processing in spatiotemporal databases.
In many applications that produce a large amount of data describing the paths of moving
objects, there is a need to ask questions about the interaction of objects over a long
recorded history. To aid such analysis, we consider the problem of computing joins over
moving object histories. The particular join studied is the Closest-Point-Of-Approach
join, which asks: Given a massive moving object history, which objects approached within
a distance d of one another?
Next, we study a novel variation of the classic entity resolution problem that
appears in sensor network applications. In entity resolution, the goal is to determine
whether or not various bits of data pertain to the same object. Given a large database of
spatiotemporal sensor observations that consist of (location, timestamp) pairs, our goal is
to perform an accurate segmentation of all of the observations into sets, where each set is
associated with one object. Each set should also be annotated with the path of the object
through the area.
11
Finally, we consider the problem of answering selection queries in a spatiotemporal
database, in the presence of uncertainty incorporated through a probabilistic model.
We propose very general algorithms that can be used to estimate the probability that a
selection predicate evaluates to true over a probabilistic attribute or attributes, where
the attributes are supplied only in the form of a pseudo-random attribute value generator.
This enables the ecient evaluation of queries such as Find all vehicles that are in close
proximity to one another with probability p at time t using Monte Carlo statistical
methods.
12
CHAPTER 1
INTRODUCTION
This study is a step towards addressing some of the many issues faced in extending
current database technology to handle spatiotemporal data in a seamless and ecient
manner. This chapter motivates spatiotemporal data management and introduces
the reader to the main research issues. This is followed by a summary of the key
contributions.
1.1 Motivation
The last few years have seen a siginicant interest in extending databases to support
spatiotemporal data as evidenced by the growing number of books, workshops, and
conferences devoted to this topic [14]. The advent of computational science and the
increasing use of wireless technology, sensors, and devices such as GPS has resulted in
numerous potential sources of spatio-temporal data. Large volumes of spatiotemporal data
are produced by many scientic, engineering and business applications that track and
monitor moving objects. Moving objects may be people, vehicles, wildlife, products
in transit, weather systems. Such applications often arise in the context of trac
surveillance and monitoring, land use management in GIS, simulation in astrophysics,
climate monitoring in earth sciences, eet management, mulitmedia animation, etc. The
increasing importance of spatiotemporal data can be attributed to the improved reliability
of tracking devices and their low cost, which has reduced the acquisition barrier for such
data. Tracking devices have been adopted in varying degrees in a number of scientic
and enterprise application domains. For instance, vehicles increasingly come equipped
with GPS devices which enable location-based services [3]. Sensors play an increasingly
important role in surveillance and monitoring of physical spaces [5]. Enterprises such
as Walmart, Target and organizations like the Department of Defense (DoD) plan to
track products in their supply chain through use of smart Radio Frequency Identication
(RFID) labels [6].
13
Extending modern database systems to support spatiotemporal data is challenging for
several reasons:
Conventional databases are designed to manage static data, whereas spatiotemporal
data describe spatial geometries that change continuously with time. This requires a
unied approach to deal with aspects of spatiality and temporality.
Current databases are designed to manage data that is precise. However, uncertainty
is often an inherent property in spatiotemporal data due to discretization of
continuous movement and measurement errors. The fact that most spatiotemporal
data sources (particularly polling and sampling-based schemes) provide only a
discrete snapshot of continuous movement poses new problems to query processing.
For example, consider a conventional database record that stores the fact John
Smith earns $200,000 and a spatiotemporal record which stores the fact John
Smith walks from point A to point B in the form of an discretized ordered pair
(A, B). In the former case, a query such as What is the salary of John Smith?
involves dealing with precise data. On the other hand, a spatiotemporal query
such as Did John Smith walk through point C between A and B? requires dealing
with information that is often not known with certainty. Further compounding the
problem is that even the recorded observations are only accurate to within a few
decimal places. Thus, even queries queries such Identify all objects located at point
A may not return meaningful results unless allowed a certain margin for error.
Due to the presence of the time dimension, spatiotemporal applications have the
potential to produce a large amount of data. The sheer volume of data generated
by spatiotemporal applications presents a computational and data management
challenge. For instance, it is not uncommon for many scientic processes to produce
spatiotemporal data in the order of terabytes or even petabytes [7]. Developing
scalable algorithms to support query processing over tera- and peta-byte-sized
spatiotemporal data sets is a signicant challenge.
The semantics of many basic operations in a database changes in the presence of
space and time. For instance, basic operations like joins typically employ equality
predicates in a classic relational database, whereas equality is rare between two
arbitrary spatiotemporal objects.
1.2 Research Landscape
Over the last decade, database researchers have begun to respond to the challenges
posed by spatiotemporal data. Most of the research eorts is concentrated on supporting
either predictive or historical queries. Within this taxonomy, we can further distinguish
work based on whether they support time-instance or time-interval queries.
14
In predictive queries, the focus is on the future position of the objects and only a
limited time window of the object positions needs to be maintained. On the other hand,
for historical queries, the interest is on ecient retrieval of past history and thus the
database needs to maintain the complete timeline of an objects past locations. Due to
these divergent requirements, techniques developed for predictive queries are often not
suitable for historical queries.
What follows is a brief tour of the major research areas in spatiotemporal data
management. For a more complete treatment of this topic, the interested reader is
referrred to [1, 3].
1.2.1 Data Modeling and Database Design
Early research focused on aspects of data modeling and database design for
spatiotemporal data [8]. Conventional data types employed in existing databases are
often not suitable to represent spatiotemporal data which describe continuous time-varying
spatial geometries. Thus, there is a need for a spatiotemporal type system that can model
continuously moving data. Depending on whether the underlying spatial object has an
extent or not, abstractions have been developed to model a moving point, line, and region
in two- and three-dimensional space with time considered as the additional dimension
[811]. Similarly, early work has also focused on rening existing CASE tools to aid in the
design of spatiotemporal databases. Existing conceptual tools such as ER diagrams and
UML present a non-temporal view of the world and extensions to incorporate temporal
and spatial awareness has been investigated [12, 13].
Recently there has been interest in designing exible type systems that can model
aspects of uncertainty associated with an objects spatial location [14]. There has also
been active eort towards designing SQL language extensions for spatiotemporal data
types and operations [15].
15
1.2.2 Access Methods
Ecient processing of spatiotemporal queries requires developing new techniques
for query evaluation, providing suitable access structures and storage mechanisms, and
designing ecient algorithms for the implementation of spatiotemporal operators.
Developing ecient access structures for spatiotemporal databases is an important
area of research. A variety of spatiotemporal index structures have been developed to
support selection queries over both predictive and historical queries, most based on
generalization of the R-tree [16] to incorporate the time dimension. Indexing structures
designed to support predictive queries typically manage object movement within a small
time window and need to handle frequent updates to object locations. A popular choice
for such applications is the TPR-tree [17] and its many variants.
On the other hand, index structures designed to support historical queries need to
manage an objects entire past movement trajectory (for this reason they can be viewed as
trajectory indexes). Depending on the time interval indexed, the sheer volume of data that
needs to be managed present signicant technical challenges for overlap-allowing indexing
schemes such as R-trees [16]. Thus, there has been interest in revisiting grid/cell-based
solutions that do not allow overlap, such as SETI [18]. Several tree-based indexing
structures have been developed such as STRIPES [19], 3D R-trees [20], TB trees [21] and
linear quad trees [22]. Further, spatiotemporal extensions of several popular queries such
as nearest-neighbor [23], top-k [24], and skyline [25] have been developed.
1.2.3 Query Processing
The development of ecient index structures has also led to a growing body of
research on dierent types of queries on spatiotemporal data, such as time-instant and
range queries [2628], continuous queries, joins [29, 30], and their ecient evaluation
[31, 32]. In the same vein, there has also been seem preliminary work on optimizing
spatio-temporal selection queries [33, 34].
16
Much of the work focuses specically on indexing two-dimensional space and/or
supporting time-instance or short time-interval selection queries. Thus many indexing
structures often do not scale well for higher-dimensional spaces and have diculty with
queries over long time intervals. Finally, historical data collections may be huge and joins
over such data require new solutions, since predicates involved are non-traditional (such as
closest point of approach, within, sometimes-possibly-inside, etc.)
1.2.4 Data Analysis
Spatiotemporal data analysis allows us to obtain interesting insights from the stored
data collection. For instance:
In a road network database, the history of movement of various objects can be used
to understand trac patterns.
In aviation, the ight path of various planes can be used in future path planning and
computing minimum separation constraints to avoid collision.
In wildlife management, one can understand animal migration patterns from the
trajectories traced by them.
Pollutants can be traced to their source by studing air ow patterns of aerosols
stored as trajectories.
Research in this area focuses on extending traditional data mining techniques to the
analysis of large spatiotemporal data sets. Of interest includes discovering similiarities
among object trajectories [35], data classication and generalization [36], trajectory
clustering and rule mining [3739], and supporting interactive visualization for browsing
large spatiotemporal collections [40].
1.2.5 Data Warehousing
Supporting data analysis also requires designing and maintaining large collections of
historical spatiotemporal data, which falls under the domain of data warehousing.
Conventional data warehouses are often designed around the goal of supporting
aggregate queries eciently. However, the interesting queries in a spatiotemporal data
warehouse seek to discover the interaction patterns of moving objects and understand the
17
spatial and/or temporal relationships that exist between them. Facilitating such queries
in a scalable fashion over terabyte-sized spatiotemporal data warehouses is a signicant
challenge. This requires extending traditional data mining techniques to the analysis
of large spatiotemporal data sets to discover spatial and temporal relationships, which
might exist at various levels of granularity involving complex data types. Research in
spatiotemporal data warehousing [41, 42] is relatively new and is focused on rening
existing multidimensional models to support continuous data and dening semantics for
spatiotemporal aggregation [43, 44].
1.3 Main Contributions
It is clear that extending modern database systems to support data management
and analysis of spatiotemporal data require addressing issues that span almost the entire
breadth of database research. A full treatment of the various issues can be the subject of
numerous dissertations! To keep the scope of this dissertation managable, I tackle three
important problems in spatiotemporal data management. The dissertation focuses on
data produced by moving objects, since moving object databases represent the most
common application domain for spatiotemporal databases [1]. The three specic problems
considered are described briey in the following subsections.
1.3.1 Scalable Join Processing over Massive Spatiotemporal Histories
I rst consider the scalability problem in computing joins over massive moving
object histories. In applications that produce a large amount of data describing the
paths of moving objects, there is a need to ask questions about the interaction of
objects over a long recorded history. This problem is becoming especially important
given the emergence of computational, simulation-based science (where simulations
of natural phenomenon naturally produce massive databases containing data with
spatial and temporal characteristics), and the increased prevalence of tracking and
positioning devices such as RFID and GPS. The particular join that I study is the CPA
(Closest-Point-Of-Approach) join, which asks: Given a massive moving object history,
18
which objects approached within a distance d of one another? I carefully consider several
obvious strategies for computing the answer to such a join, and then propose a novel,
adaptive join algorithm which naturally alters the way in which it computes the join in
response to the characteristics of the underlying data. A performance study over two
physics-based simulation data sets and a third, synthetic data set validates the utility of
my approach.
1.3.2 Entity Resolution in Spatiotemporal Databases
Next, I consider the problem of entity resolution for a large database of spatio-temporal
sensor observations. The following scenario is assumed. At each time-tick, one or more of
a large number of sensors report back that they have sensed activity at or near a specic
spatial location. For example, a magnetic sensor may report that a large metal object has
passed by. The goal is to partition the sensor observations into a number of subsets so
that it is likely that all of the observations in a single subset are associated with the same
entity, or physical object. For example, all of the sensor observations in one partition may
correspond to a single vehicle driving accross the area that is monitored. The dissertation
describes a two-phase, learning-based approach to solving this problem. In the rst phase,
a quadratic motion model is used to produce an initial classication that is valid for a
short portion of the timeline. In the second phase, Bayesian methods are used to learn the
long-term, unrestricted motion of the underlying objects.
1.3.3 Selection Queries over Probabilistic Spatiotemporal Databases
Finally, I consider the problem of answering selection queries in the presence of
uncertainty incorporated through a probabilistic model. One way to facilitate the
representation of uncertainty in a spatiotemporal database is by allowing tuples to
have probabilistic attributes whose actual values are unknown, but are assumed
to be selected by sampling from a specied distribution. This can be supported by
including a few, pre-specied, common distributions in the database system when it is
shipped. However, to be truly general and extensible and support distributions that
19
cannot be represented explicitly or even integrated, it is necessary to provide an interface
that allows the user to specify arbitrary distributions by implementing a function that
produces pseudo-random samples from the desired distribution. Allowing a user to specify
uncertainty via arbitrary sampling functions creates several interesting technical challenges
during query evaluation. Specically, evaluating time-instance selection queries such as
Find all vehicles that are in close proximity to one another with probability p at time
t requires the principled use of Monte Carlo statistical methods to determine whether
the query predicate holds. To support such queries, the thesis describes new methods
that draw heavily for the relevant statistical theory on sequential estimation. I also
consider the problem of indexing for the Monte Carlo algorithms, so that samples from the
pseudo-random attribute value generator can be pre-computed and stored in a structure in
order to answer subsequent queries quickly.
Organization. The rest of this study is organized as follows. Chapter 2 provides
a survey of work related to the problems addressed in this thesis. Chapter 3 tackles the
scalability issue when processing join queries over massive spatiotemporal databases.
Chapter 4 describes an approach to handling the entity-resolution problem in cleaning
spatiotemporal data sources. Chapter 5 describes a simple and general approach to
answering selection queries over spatiotemporal databases that incorporate uncertainty
within a probabilistic model framework (selection queries over probabilistic spatiotemporal
databases). Chapter 6 concludes the dissertation by summarizing the contributions and
identifying potential directions for future work.
20
CHAPTER 2
BACKGROUND
This section provides a survey of literature related to the three problems addressed in
this dissertation.
2.1 Spatiotemporal Join
Though research in spatiotemporal joins is relatively new, the closely related problem
of processing joins over spatial objects has been extensively studied. The classical paper
in spatial joins is due to Brinkho et al. [45]. Their approach assumes the existence
of a hierarchical spatial index, such as an R-tree [16], on the underlying relations. The
join Brinkho proposes makes use of a carefully synchronized depth-rst traversal of the
underlying indices to narrow down the candidate pairs. A breadth-rst strategy with
several additional optimizations is considered by Huang et al. [46]. Lo and Ravishankar
[47] explore a non-index based approach to processing a spatial join. They consider how
to extend the traditional hash join algorithm to the spatial join problem and propose a
strategy based on a partitioning of the database objects into extent mapping hash buckets.
A similar idea, referred to as the partition-based spatial merge (PBSM), is considered
by Patel et al. [48]. Instead of partitioning the input data objects, they consider a grid
partitioning of the data space on to which objects are mapped. This idea is further
extended by Lars et al. [49], where they propose a dynamic partitioning of the input space
into vertical strips. Their strategy avoids the data spill problem encountered by previous
approaches since the strips can be constructed such that they t within the available main
memory.
A common theme among existing approaches is their use of the plane-sweep [50] as a
fast pruning technique. In the case of index-based algorithms, plane-sweep is used to lter
candidate node pairs enumerated from the traversal. Non-indexed based algorithms make
use of the plane-sweep to construct candidate sets over partitions.
21
To our knowledge, the only prior work on spatiotemporal joins is due to Jeong et al.
[51]. However, they only consider spatiotemporal join techniques that are straightforward
extensions to traditional spatial join algorithms. Further, they limit their scope to
index-based algorithms for objects over limited time windows.
2.2 Entity Resolution
Research in entity resolution has a long history in databases [5255] and has focused
mainly on integrating non-geometric string based data from noisy external sources. Closely
related to the work in this thesis is the large body of work on target tracking that exists
in elds as diverse as signal processing, robotics, and computer vision. The goal in target
tracking [56, 57] is to support the real-time monitoring and tracking of a set of moving
objects from noisy observations.
Various algorithms to classify observations among objects can be found in the
target tracking literature. They characterize the problem as one of data association (i.e.
associating observations with corresponding targets). A brief summary of the main ideas is
given below.
The seminal work is due to Reid [58] who propose a multiple hypothesis technique
(MHT) to solve the tracking problem. In the MHT approach, a set of hypotheses is
maintained with each hypothesis reecting the belief on the location of an individual
target. When a new set of observations arrive, the hypotheses are updated. Hypotheses
with minimal support are deleted and additional hypotheses are created to reect new
evidence. The main drawback of the approach is that the number of hypotheses can grow
exponentially over time. Though heuristic lters [5961] can be used to bound the search
space, it limits the scalability of the algorithm.
Target tracking also has been studied using Bayesian approaches [62]. The Bayesian
approach views tracking as a state estimation problem. Given some initial state and a
set of observations, the goal is to predict the objects next state. An optimal solution to
the problem is given by Bayes Filter [63, 64]. Bayes lters produces optimal estimates by
22
integrating over the complete set of observations. The formulation is often recursive and
involves complex integrals that are dicult to solve analytically. Hence, approximation
schemes such as particle lters [57] and sequential Monte Carlo techniques [63] are often
used in practice.
Recently, Markov Chain Monte Carlo (MCMC) [65, 66] techniques have been
proposed. MCMC techniques attempt to approximate the optimal Bayes lter for multiple
target tracking. MCMC based methods employ sequential MC sampling and are shown to
perform better than existing sub-optimal approaches such as MHT for tracking objects in
highly cluttered environments.
A common theme among most of the research in target tracking is its focus on
accurate tracking and detection of objects in real time in highly cluttered environments
over relatively short time periods. In a data warehouse context, the ability of techniques
such as MCMC to make ne-grained distinctions make them ideal candidates when
performing operations such as drilldown that involve analytics over small time windows.
Their applicability is limited, however, to entity resolution in a data warehouse. In such a
context, summarization and visualization of historical trajectories smoothed over long time
intervals is often more useful. The model-based approach considered in this work seems a
more suitable candidate for such tasks.
2.3 Probabilistic Databases
Uncertainty management in spatiotemporal databases is a relatively new area of
research. Earlier work has focused on aspects of modeling uncertainty and query language
support [9, 67].
In the context of query processing, one of the earliest papers in this area is the
paper by Pfoser et al. [68] where dierent sources of uncertainty are characterized and
a probability density function is used to model errors. Hosbond et al. [69] extended this
work by employing a hyper square uncertainty region, which expands over time to answer
queries using a TPR-tree.
23
Trajcevksi et al. [70] study the problem from a modeling perspective. They model
trajectories by a cylindrical volume in 3D and outline semantics of fuzzy selection queries
over trajectories in both space and time. However, the approach does not specify how to
choose the dimensions of the cylindrical region which may have to change over time to
account for shrinking or expanding of the underlying uncertainty region.
Cheng et al. [71] describe algorithms for time instant queries (probabilistic range
and nearest neighbor) using an uncertainty model where a probabilty density function
(PDF) and an uncertain region is associated with each point object. Given a location in
the uncertain region, the PDF returns the probablity of nding the object at that location.
A similar idea is used by Tao et al. [72] to answer queries in spatial databases. To handle
time interval queries, Mokhtar et al. [73] represent uncertain trajectories as a stochastic
process with a time-parametric uniform distribution.
24
CHAPTER 3
SCALABLE JOIN PROCESSING OVER SPATIOTEMPORAL HISTORIES
In applications that produce a large amount of data describing the paths of
moving objects, there is a need to ask questions about the interaction of objects
over a long recorded history. In this chapter, the problem of computing joins over
massive moving object histories is considered. The particular join studied is the
Closest-Point-Of-Approach join, which asks: Given a massive moving object history,
which objects approached within a distance d of one another?
3.1 Motivation
Frequently, it is of interest in applications which make use of spatial data to ask
questions about the interaction between spatial objects. A useful operation that enables
one to answer such questions is the spatial join operation. Spatial join is similar to the
classical relational join except that it is dened over two spatial relations based on a
spatial predicate. The objective of the join operation is to retrieve all object pairs that
satisfy a spatial relationship. One common predicate involves distance measures, where
we are interested in objects that were within a certain distance of each other. The query
Find all restaurants within distance 10 miles from a hotel is an example of a spatial
join.
For moving objects, the spatial join operation involves the evaluation of both a spatial
and a temporal predicate and for this reason the join is referred to as a spatiotemporal
join. For example, consider the spatial relations PLANES and TANKS, where each relation
represents accumulated trajectory data of planes and tanks from a battleeld simulation.
The query Find all planes that are within distance 10 miles of a tank is an example of a
spatiotemporal join. The spatial predicate in this case restricts the distance (10 miles) and
the temporal predicate restricts the time period to the current time instance.
In the more general case, the spatiotemporal join is issued over a moving object
history, which contains all of the past locations of the objects stored in a database. For
25
example, consider the query Find all pairs of planes that came within distance of 1000
feet during their ight path. Since there is no restriction of the temporal predicate,
answering this query involves an evaluation of the spatial predicate at every time instance.
The amount of data to be processed can be overwhelming. For example, in a typical
ight, the ight data recorder stores about 7 MB of data which records among other
things, the position and time of the ight for every second during its operation. Given
that on average the US Air Trac Control handles around 30000 ights in a single day,
if all of this data were archived, it would result in 200 GB of data accumulation just
for a single day. For another example, it is not uncommon for scientic simulations to
output terabytes or even petabytes of spatiotemporal data (see Abdulla et al.[7] and the
references contained therein).
In this chapter, the spatial-temporal join problem for moving object histories in
three-dimensional space, with time considered as the fourth dimension is investigated. The
spatiotemporal join operation considered is the CPA Join (Closest-Point-Of-Approach
Join). By Closest Point of Approach, we refer to a position at which two moving objects
attain their closest possible distance[74]. Formally, in a CPA Join, we answer queries of
the following type: Find all object pairs (p P, q Q) from relations P and Q such that
CPA-distance (p, q) d. The goal is to retrieve all object pairs that are within a distance
d at their closest-point-of-approach.
Surprisingly, this problem has not been studied previously. The spatial join problem
has been well-studied for stationary objects in two- and three-dimensionsal space [45, 47
49], however very little work related to spatiotemporal joins can be found in literature.
There has been some work related to joins involving moving objects [75, 76] but the work
has been restricted to objects in a limited time window and does not consider the problem
of joining object histories that may be gigabytes or terabytes in size.
26
The contributions can be summarized as follows:
Three spatiotemporal join strategies for data involving moving object histories is
presented.
Simple adaptations of existing spatial join processing algorithms, based on the R-tree
structure and on a plane-sweeping algorithm, for spatiotemporal histories is explored.
To address the problems associated with straightforward extensions to these
techniques, a novel join strategy for moving objects based on an extension of the
basic plane-sweeping algorithm is described.
A rigorous evaluation and benchmarking of the alternatives is provided. The
performance results suggest that we can obtain signicant speedup in execution time
with the adaptive plane-sweeping technique.
The rest of this chapter is organized as follows: In Section 3.2, the closest point
of approach problem is reviewed. In Sections 3.3 and 3.4, two obvious alternatives to
implementing the CPA join using R-trees and plane-sweeping is described. In Section
3.5, a novel adaptive plane-sweeping technique that outperforms competing techniques
considerably is presented. Results from our benchmarking experiments are given in Section
3.6. Section 3.7 outlines related work.
3.2 Background
In this Section, we discuss the motion of moving objects, and give an intutive
description of the CPA problem. This is followed by an analytic solution to the CPA
problem over a pair of points moving in a straight line.
3.2.1 Moving Object Trajectories
Trajectories describe the motion of objects in a 2 or 3-dimensional space. Real-world
objects tend to have smooth trajectories and storing them for analysis often involves
approximation to a polyline. A polyline approximation of a trajectory connects object
positions, sampled at discrete time instances, by line segments (Figure 3-1).
In a database the trajectory of an object can be represented as a sequence of the form
(t
1
, v
1
), (t
2
, v
2
), . . . , (t
n
, v
n
)) where each v
i
represents the position vector of the object at
27
t
6
t
7
t
8
(B) (A)
t3
y
x
t0
t
1
t2
t
4
t5
t10
t
9
Figure 3-1. Trajectory of an object (a) and its polyline approximation (b)
time instance t
i
. The arity of the vector describes the dimensions of the space. For ight
simulation data, the arity would be 3, whereas for a moving car, the arity would be 2. The
position of the moving objects is normally obtained in one of several ways: by sampling or
polling the object at discrete time instances, through use of devices like GPS, etc.
3.2.2 Closest Point of Approach (CPA) Problem
We are now ready to describe the CPA problem. Let CPA(p,q,d) over two straight
line trajectories be evaluated as follows. Assuming the distance between the two objects is
given by mindist, then we output true if mindist < d (the objects were within distance d
during their motion in space), otherwise false. We refer to the calculation of CPA(p,q,d)
as the CPA problem.
The minimum distance mindist between two objects is the distance between the
object positions at their closest point of approach. It is straightforward to calculate
mindist once the CPA time t
cpa
, time instance at which the objects reached their closest
distance, is known.
We now give an analytic solution to the CPA problem for a pair of objects on a
simple straight-line trajectory.
Calculating the CPA time t
cpa
. Figure 3-2 shows the trajectory of two objects p
and q in 2-dimensional space for the time period [t
start
, t
end
]. The position of these objects
at any time instance t is given by p(t) and q(t). Let their positions at time t = 0 be p
0
and q
0
and let their velocity vectors per unit of time be u and v. The motion equations for
28
t
0
t
3
t
4
t
1
t
2
t
cpa
t
4
t
3
t
2
t
0
q
distcpa
t
start
t
end
t
1
p
Figure 3-2. Closest Point of Approach Illustration
q[3] p[3]
q[1]
q p
p[2]
p[1]
q[2]
y
x
t
Figure 3-3. CPA Illustration with trajectories
these two objects are p(t) = p
0
+ tu; q(t) = q
0
+ tv. At any time instance t, the distance
between the two objects is given by d(t) = [p(t) q(t)[.
Using basic calculus, one can nd the time instance at which the distance d(t) is
minimum (when D(t) = d(t)
2
is a minimum). Solving for this time we obtain:
t
cpa
=
(p
o
q
o
).(u v)
[u v[
2
Given this, mindist is given by [p(t
cpa
) q(t
cpa
)[.
The distance calculation that we described above is applicable only between two
objects on a straight line trajectory. To calculate the distance between two objects on a
polyline trajectory, we apply the same basic technique. For trajectories consisting of a
chain of line-segments, we nd the minimum distance by rst determining the distance
between each pair of line-segments and then choosing the minimum distance.
As an example, consider Figure 3-3 which shows the trajectory of two objects in
2-dimensional space with time as the third dimension. Each object is represented by
29
an array that stores the chain of segments comprising the trajectory. The line-segments
are labeled by the array indices. To determine the qualifying pairs, we nd the CPA
distance between the line segment pairs (p[1],q[1]) (p[1],q[2]) (p[1],q[3]) (p[2],q[1])
(p[2],q[2]) (p[2],q[3]) (p[3],q[1]) (p[3],q[2]) (p[3],q[3]) and return the pair with the minimum
distance among all evaluated pairs. The complete code for computing CPA(p,q,d) over
multi-segment trajectories is given as Algorithm 3-1.
Algorithm 1 CPA (Object p, Object q, distance d)
1: mindist =
2: for (i = 1 to p.size) do
3: for (j = 1 to q.size) do
4: tmp = CPA Distance (p[i], q[j])
5: if (tmp mindist) then
6: mindist = tmp
7: end if
8: end for
9: end for
10: if (mindist d) then
11: return true
12: end if
13: return false
In the next two Sections, we consider two obvious alternatives for computing the
CPA Join, where we wish to discover all pairs of objects (p, q) from two relations P and
Q, where CPA (p, q,d) evaluates to true. The rst technique we describe makes use of an
underlying R-tree index structure to speed up join processing. The second methodology is
based on a simple plane-sweep.
3.3 Join Using Indexing Structures
Given numerous existing spatiotemporal indexing structures, it is natural to rst
consider employing a suitable index to perform the join.
Though many indexing structures exist, unfortunately most are not suitable for the
CPA Join. For example, a large number of indexing structures like the TPR-tree [17],
R
EXP
tree [77], TPR*-tree [78] have been developed to support predictive queries, where
30
the focus is on indexing the future position of an object. However, these index structures
are generally not suitable for CPA Join, where access to the entire history is needed.
Indexing structures like MR-tree [26], MV3R-tree [27], HR-tree [28], HR+-tree [27]
are more relevant since they are geared towards answering time instance queries (in case
of MV3R-tree also short time-interval queries), where all objects alive at a certain time
instance are retrieved. The general idea behind these index structures is to maintain a
separate spatial index for each time instance. However, such indices are meant to store
discrete snapshots of a evolving spatial database, and are not ideal for use with CPA Join
over continuous trajectories.
3.3.1 Trajectory Index Structures
More relevant are indexing structures specic to moving object trajectory histories
like the TB-tree, STR-tree [21] and SETI [18]. TB-trees emphasize trajectory preservation
since they are primarily designed to handle topological queries where access to entire
trajectory is desired (segments belonging to the same trajectory are stored together). The
problem with TB-trees in the context of the CPA Join is that segments from dierent
trajectories that are close in space or time will be scattered across nodes. Thus, retrieving
segments in a given time window will require several random I/Os. In the same paper
[21], a STR tree is introduced that attempts to somewhat balance spatial locality with
trajectory preservation. However, as the authors point out STR-trees turn out to be a
weak compromise that do not perform better than traditional 3D R-trees [20] or TB-trees.
More appropriate to the CPA Join is SETI [18]. SETI partitions two-dimensional space
statically into non-overlapping cells and uses a separate spatial index for each cell. SETI
might be a good candidate for CPA Join since it preserves spatial and temporal locality.
However, there are several reasons why SETI is not the most natural choice for a CPA
Join:
It is not clear that SETIs forest scales to a three-dimensional space. A 25 25 SETI
grid in two-dimension becomes a sparse 25 25 25 grid with almost 20, 000 cells in
three-dimension.
31
SETIs grid structure is an interesting idea for addressing problems with high
variance in object speeds (we will use a related idea for the adaptive plane-sweep
algorithm described later). However, it is not clear how to size the grid for a given
data set, and sizing it for a join seems even harder. It might very well be that
relation R should have a dierent grid for R S compared to R T.
For a CPA Join over a limited history, SETI has no way of pruning the search space,
since every cell will have to be searched.
3.3.2 R-tree Based CPA Join
Given these caveats, perhaps the most natural choice for the CPA Join is the R-tree
[16]. The R-tree [16] is a hierarchical, multi-dimensional index structure that is commonly
used to index spatial objects. The join problem has been studied extensively for R-trees
and several spatial join techniques exist [45, 46, 79] that leverage underlying R-tree
index structures to speed-up join processing. Hence, our rst inclination is to consider a
spatiotemporal join strategy that is based on R-trees. The basic idea is to index object
histories using R-trees and then perform a join over these indices.
The R-Tree Index
It is a very straightforward task to adapt the R-tree to index a history of moving
object trajectories. Assuming three spatial dimensions and a fourth temporal dimension,
the four-dimensional line segments making up each individual object trajectory are simply
treated as individual spatial objects and indexed directly by the R-tree. The R-tree and
its associated insertion or packing algorithms are used to group those line segments into
disk-page sized groups, based on proximity in their four-dimensional space. These pages
make up the leaf level of the tree. As in a standard R-tree, these leaf pages are indexed
by computing the minimum bounding rectangle that encloses the set of objects stored in
each leaf page. Those rectangles are in turn grouped into disk-page sized groups which are
themselves indexed. An R-tree index for 3 line segments moving through 2-dimensional
space is depicted in Figure 3-4.
32
p
1
[3] p
3
[2] p
1
[2] p
1
[4] p
3
[3] p
2
[4]
p
3
[1]
p
2
[3]
p3[2]
p
3
[3]
p
1
[4]
p2[1]
p2[2]
p
1
[3]
I
1
I
2
I
3
t
y
x
p
2
[4]
I
1
I
2
I
3
p
1
[2]
p
1
[1]
p
2
[3]
p
1
[2]
p1[3]
p
3
[2]
p2[3]
p
1
[4]
p
2
[4]
p
3
[3]
p
2
[2] p
3
[1] p
2
[1] p
1
[1]
p
1
[1]
p
2
[1]
p
2
[2]
p
3
[1]
Figure 3-4. Example of an R-tree
Basic CPA Join Algorithm Using R-Trees
Assuming that the two spatiotemporal relations to be joined are organized using
R-trees, we can use one of the standard R-tree distance joins as a basis for the CPA Join.
The common approach to joins using R-trees employ carefully controlled synchronized
traversal of the two R-trees to be joined. The pruning power of the R-tree index arises
from the fact that if two bounding rectangles R
1
and R
2
do not satisfy the join predicate
then the join predicate is not satised between any two bounding rectangles that can be
enclosed within R
1
or R
2
.
In a synchronized technique, both the R-trees are simultaneously traversed retrieving
object-pairs that satisfy the join predicate. To begin with, the root nodes of both the
R-trees are pushed into a queue. A pair of nodes from the queue is processed by pairing
up every entry of the rst node with every entry in the second node to form the candidate
set for further expansion. Each pair in the candidate set that qualies the join predicate is
pushed into the queue for subsequent processing. The strategy described leads to a BFS
(Breadth-First-Search) expansion of the trees. BFS-style traversal lends itself to global
optimization of the join processing steps [46] and works well in practice.
33
d
2
l
1
P
2
l
2
P
1
d
arbit
d
1
x
y
z
d
real
Figure 3-5. Heuristic to speed up distance computation
The distance routine is used in evaluating the join predicate to determine the distance
between two bounding rectangles associated with a pair of nodes. A node-pair qualies
for further expansion if the distance between the pair is less than the limiting distance d
supplied by the query.
Heuristics to Improve the Basic Algorithm
The basic join algorithm can be improved in several ways by using several standard
and non-standard techniques for reducing I/O and CPU costs over spatial joins. These
include:
Using a plane-sweeping algorithm [45] to speed up the all-pairs distance computation
when pairs of nodes are expanded and their children are checked for possible
matches.
Carefully considering the processing of node pairs so that when each pair is
considered, one or both of the nodes are in the buer [46].
Avoiding expensive distance computations by applying heuristic lters. Computing
the distance between two 3-dimensional rectangles can be a very costly operation,
since the closest points may be on arbitrary positions on the faces of the rectangles.
To speed this computation, the magnitudes of the diagonals of the two rectangles
(d
1
and d
2
) can be computed rst. Next, we pick an arbitrary point from both of
the rectangles (points P
1
and P
2
), and compute the distance between them, called
d
arbit
. If d
arbit
d
1
d
2
> d
join
, then the two rectangles cannot contain any points
as close as d
join
from one another and the pair can be discarded, as shown in Figure
3-5. This provides for immediate dismissals with only three distance computations
(or one if the diagonal distances are precomputed and stored with each rectangle).
34
object p
object q
t
y
x
Figure 3-6. Issues with R-trees- Fast moving object p joins with everyone
In addition, there are some obvious improvements to the algorithm that can be made
which are specic to the 4-dimensional CPA Join:
The fourth dimension, time, can be used as an initial lter. If two MBRs or line
segments do not overlap in time, then the pair cannot possibly be a candidate for a
CPA match.
Since time can be used to provide for immediate dismissals without Euclidean
distance calculations it is given priority over the attributes. For example, when a
plane-sweep is performed to prune an all-pairs CPA distance computation, time is
always chosen as the sweeping axis. The reason is that time will usually have the
greatest pruning power of any attributes since time-based matches must always be
exact, regardless of the join distance.
In our implementaion of the CPA Join for R-trees, we make use of the STR packing
algorithm [80] to build the trees. Because the potential pruning power of the time
dimension is greatest, we ensure that the trees are well-organized with respect to
time by choosing time as the rst packing dimension.
Problem With R-tree CPA Join
Unfortunately, it turns out that in practice the R-tree can be ill-suited to the problem
of computing spatiotemporal joins over moving object histories. R-trees have a problem
handling databases with a high variance in object velocities. The reason for this is that
join algorithms which make use of R-trees rely on tight and well-behaved minimum
bounding rectangles to speed the processing of the join. When the positions of a set of
moving objects are sampled at periodic intervals, fast moving objects tend to produce
larger bounding rectangles than slow moving objects.
35
p
1
q
2
q
1
p
2
y
time
t
end
tstart
Figure 3-7. Progression of plane-sweep
One such scenario is depicted in Figure 3-6, which shows the paths of a set of objects
on a 2-D plane for a given time period. A fast moving object such as p will be contained
in a very large MBR, while slower objects such as q will be contained in much smaller
MBRs. When a spatial join is computed over R-trees storing these MBRs, the MBR
associated with p can overlap many smaller MBRs, and each overlap will result in an
expensive distance computation (even if the objects do not travel close to one another).
Thus, any sort of variance in object velocities can adversely aect the performance of the
join.
3.4 Join Using Plane-Sweeping
The second technique that is considered is a join strategy based on a simple plane-
sweep. Plane-sweep is a powerful technique for solving proximity problems involving
geometric objects in a plane and has previously been proposed [49] as a way to eciently
compute the spatial join operation.
3.4.1 Basic CPA Join using Plane-Sweeping
Plane-sweep is an excellent candidate for use with the CPA join because no matter
what distance threshold is given as input into the join, two objects must overlap in the
time dimension for there to be a potential CPA match. Thus, given two spatiotemporal
relations P and Q, we could easily base our implementation of the CPA join on a
plane-sweep along the time dimension.
36
We would begin a plane-sweep evaluation of the CPA join by rst sorting the intervals
making up P and Q along the time dimension, as depicted in Figure 3-7. We then sweep
a vertical line along the time dimension. A sweepline data structure D is maintained
which keeps track of all line segments which are valid given the current position of the line
along the time dimension. As the sweepline progresses, D is updated with insertions (new
segments that became active) and deletions (segments whose time period has expired).
Segment pairs from both input relations that satisfy the join predicate are always present
in D, and they can be checked and reported during updates to D. Pseudo-code for the
algorithm is given below:
Algorithm 2 PlaneSweep (Relation P, Relation Q, distance d)
1: Form a single list L containing segments from P and Q sorted by t
start
2: Initialize sweepline data structure D
3: while not IsEmpty (L) do
4: Segment top = popFront (L)
5: Insert (D, top)
6: Delete from D all segments s s.t. (s.t
end
< top.t
start
) remove segments that donot
intersect sweepline
7: Query (D, top, d) report segments in D that are within distance dist
8: end while
In the case of the CPA join, assuming that all moving objects at any given moment
can be stored in main memory, any of a number of data structures can be used to
implement D, such as a quad- or oct-tree, or an interval skip-list [81]. The main
requirement is that the data structure selected should easily be possible to check proximity
of objects in space.
3.4.2 Problem With The Basic Approach
Although the plane-sweep approach is simple, in practice it is usually too slow to
be useful for processing moving object queries. The problem has to do with how the
sweepline progression takes place. As the sweepline moves through the data space, it has
to stop momentarily at sample points (time instances at which object positions where
recorded) to process newly encountered segments into the data structure D. New segments
37
y
time
t
end
q
2
q
1
p
2
p
1
tstart
Figure 3-8. Layered Plane-Sweep
that are encountered at the sample point are added into the data structure and segments
in D that are no longer active are deleted from it.
Consequently, the sweepline pauses more often when objects with high sampling rates
are present, and the progress of the sweepline is heavily inuenced by the sampling rates
of the underlying objects. For example, consider Figure 3-7 which shows the trajectory
of four objects in a given time period. In the case illustrated, object p
2
controls the
progression of the sweepline. Observe that in the time-interval [t
start
, t
end
], only new
segments from object p
2
get added to D but expensive join computations are performed
each time with same set of line segments.
The net result is that if the sampling rate of a data set is very high relative to the
amount of object movement in the data set, then processing a multi-gigabyte object
history using a simple plane-sweeping algorithm may take a prohibitively long time.
3.4.3 Layered Plane-Sweep
One way to address this problem is to reduce the number of segment level comparisons
by comparing the regions of movement of various objects at a coarser level. For example,
reconsider the CPA join depicted in Figure 7. If we were to replace the many oscillations
of object p
2
with a single minimum bounding rectangle which enclosed all of those
oscillations from t
start
to t
end
, we could then use that rectangle during the plane-sweep
38
as an intial approximation to the path of object p
2
. This would potentially save many
distance computations.
This idea can be taken to its natural conclusion by constructing a minimum bounding
box that encompasses the line-segments of each object. A plane-sweep is then performed
over the bounding boxes, and only qualifying boxes are expanded further. We refer to this
technique as the Layered Plane-Sweep approach since plane-sweep is performed at two
layers one at a coarser level of bounding boxes and then at the ner level of individual
line segments.
One issue that must be considered is how much movement is to be summarized
within the bounding rectangle for each object. Since we would like to eliminate as many
comparisons as possible, one natural choice would be to let the available system memory
dictate how much movement is covered for each object. Given a xed buer size, the
algorithm will proceed as follows.
Algorithm 3 LayeredPlaneSweep(Relation P, Relation Q, distance d)
1: Segment s dened by [(x
sart
, x
end
), (y
start
, y
end
), (z
start
,z
end
), (t
start
, t
end
)]
2: Assume a sorted list of object segments (by t
start
) in disk
3: while there is still some unprocessed data do
4: Read in enough data from P and Q to ll the buer
5: Let t
start
be the rst time tick which has not yet been processed by the plane-sweep
6: Let t
end
be the last time tick for which no data is still on disk
7: Next, bound the trajectory of every object present in the buer by a MBR
8: Sort the MBRs along one of the spatial dimension and then perform a plane-sweep
along that dimension
9: Expand the qualifying MBR pairs to get the actual trajectory data (line segments)
10: Sort the line segments by t
start
11: Perform a nal sweep along the time dimension to get the nal result set
12: end while
Figure 3-8 illustrates the idea. It depicts the snapshot of object trajectories starting
at some time instance t
start
. Segments in the interval [t
start
, t
end
] represent the maximum
that can be buered in the available memory. A rst level plane-sweep is carried out over
the bounding boxes to eliminate false positives. Qualifying objects are expanded and a
second-level plane-sweep is carried out over individual line-segments. In the best case,
39
tstart
q2 q2
t
end
time
q1
p
2
p
1
y
Figure 3-9. Problem with using large granularities for bounding box approximation
there is an opportunity to process the entire data set through just three comparisons at
the MBR level.
3.5 Adaptive Plane-Sweeping
While the layered plane-sweep typically performs far better than the basic plane-sweeping
algorithm, it may not always choose the proper level of granularity for the bounding box
approximations. This Section describes an adaptive strategy that takes into careful
consideration the underlying object interaction dynamics and adjusts this granularity
dynamically in response to the underlying data characteristics.
3.5.1 Motivation
In the simple layered plane-sweep, the granularity for the bounding box approximation
is always dictated by the available system memory. The assumption is that pruning power
increases monotonically with increasing granularity. Unfortunately, this is not always the
case. As a motivating example, consider Figure 3-9. Assume available system memory
allows us to buer all the line segments. In this case, the layered plane-sweep performs
no better than the basic plane-sweep, due to the fact that all the object bounding
boxes overlap with each other and as a result no pruning is achieved at the rst-level
plane-sweep.
However, assume we had instead xed the granulatiry to correspond to the time
period [t
start
, t
i
], as depicted in Figure 3-10. In this case, none of the bounding boxes
40
overlap, and there are possibly many dismissals at the rst level. Though less of the
buer is processed intitially, we are able to eliminate many of the segment-level distance
comparisons compared to a technique that bounds the entire time-period, thereby
potentially increasing the eciency of the algorithm. The entire buer can then be
processed in a piece-by-piece fashion, as depicted in Figure 3-10. In general, the eciency
of the layered plane-sweep is tied not to the granularity of the time interval that is
processed, but the granularity that minimizes the number of distance comparisons.
3.5.2 Cost Associated With a Given Granularity
Since distance computations dominate the time required to compute a typical CPA
Join, the cost associated with a given granularity can be approximated as a function of the
number of distance comparisons that are needed to process the segments encompassed in
that granularity. Let n
MBR
be the number of distance computations at the box-level, let
n
seg
be the number of distance calculations at the segment-level, and let be the fraction
of the time range in the buer which is processed at once. Then the cost associated with
processing that fraction of the buer can be estimated as:
cost
= (n
seg
+ n
MBR
)(1 )
This function reects the fact that if we choose a very small value for , we will have to
process many cut-points in order to process the entire buer, which can increase the cost
of the join. As shrinks, the algorithm becomes equivalent to the traditional plane-sweep.
On the other hand, choosing a very large value for tends to increase (n
seg
+ n
MBR
),
eventually yielding an algorithm which is equivalent to the simple, layered plane-sweep. In
practice, the optimal value for lies somewhere in between the two extremes, and varies
from data set to data set.
3.5.3 The Basic Adaptive Plane-Sweep
Given this cost function, it is easy to design a greedy plane-sweep algorithm that
attempts to repeatedly minimize cost
in order to adapt the underlying (and potentially

41
t
end
tj
p
1
p
2
q
1
q
2
time
y
ti tstart
Figure 3-10. Adaptively varying the granularity
time-varying) characteristics of the data. At every iteration, the algorithm simply chooses
to process the fraction of the buer that appears to minimize the overall cost of the
plane-sweep in terms of the expected number of distance computations. The algorithm is
given below:
Algorithm 4 AdaptivePlaneSweep(Relation P, Relation Q, distance d)
1: while there is still some unprocessed data do
2: Read in enough data from P and Q to ll the buer
3: Let t
start
be the rst time tick which has not yet been processed by the plane-sweep
4: Let t
end
be the last time tick for which no data is still on disk
5: Choose so as to minimize cost
6: Perform a layered plane-sweep from time t

start
to t
start
+ (t
end
t
start
) steps 5-9
of procedure LayeredPlaneSweep
7: end while
Unfortunately, there are two obvious diculties involved with actually implementing
the above algorithm:
First, the cost cost
associated with a given granularity is known only after the

layered plane has been executed at that granularity.
Second, even if we can compute cost
easily, it is not obvious how we can compute

cost
for all values of from 0 to 1 so as to minimize cost
over all .
These two issues are discussed in detail in the next two Sections.
42
3.5.4 Estimating Cost
This Section describes how to eciently estimate cost
for a given using a simple,

online sampling algorithm reminiscent of the algorithm of Hellerstein and Haas [82].
At a high level, the idea is as follows. To estimate cost
, we begin by constructing
bounding rectangles for all of the objects in P considering their trajectories from time
t
start
to (t
end
t
start
). These rectangles are then inserted into an in-memory index, just as
if we were going to perform a layered plane-sweep. Next, we randomly choose an object q
1
from Q, and construct a bounding box for its trajectory as well. This object is joined with
all of the objects in P by using the in-memory index to nd all bounding boxes within
distance d of q
1
. Then:
Let n
MBR,q
1
be the number of distance computations needed by the index to
compute which objects from P have bounding rectangles within distance d of the
bounding rectangle for q
1
, and
Let n
seg,q
1
be the total number of distance computations that would have been
needed to compute the CPA distance between q
1
and every object p P whose
bounding rectangle is within distance d of the bounding rectangle for q
1
(this can be
computed eciently by performing a plane-sweep without actually performing the
required distance computations).
Once n
MBR,q
1
and n
seg,q
1
have been computed for q
1
, the process can be repeated for a
second randomly selected object q
2
Q, for a third object q
3
and so on. A key observation
is that after m objects from Q have been processed, the value

m
=
1
m
m
i=1
(n
MBR,q
1
+ n
seg,q
1
)[Q[
represents an unbiased estimator for (n
MBR
+ n
seg
) at , where [Q[ denotes the number of
data objects in Q.
In practice, however, we are not only interested in
m
. We would also like to know at
all times just how accurate our estimate
m
is, since at the point where we are satised
with our guess as to the real value of cost
, we want to stop the process of estimating

cost
and continue with the join.

43
Fortunately, the central limit theorem can easily be used to estimate the accuracy
of
m
. Assuming sampling with replacement from Q, for large enough m the error of our
estimate will be normally distributed around (n
MBR
+ n
seg
) with variance
2
m
=
1
m
2
(Q),
where
2
(Q) is dened as
1
[Q[
|Q|
i=1
(n
MBR,q
1
+ n
seg,q
1
)[Q[ (n
MBR
+ n
seg
)
2
Since in practice we cannot know
2
(Q), it must be estimated via the expression

2
(Q
m
) =
1
m1
m
i=1
(n
MBR,q
1
+ n
seg,q
1
)[Q[
m
(Q
m
denotes the sample of Q that is obtained after m objects have been randomly
retrieved from Q). Substituting into the expression for
2
m
, we can treat
m
as a normally
distributed random variable with variance
b
2
(Qm)
m
.
In our implementation of the adaptive plane-sweep, we can continue the sampling
process until our estimate for cost
is accurate to within 10% at 95% condence. Since

95% of the standard normal distribution falls within two standard deviations of the mean,
this is equivalent to sampling until
_
b
2
(Q
m
)
m
is less than
m
0.1.
3.5.5 Determining The Best Cost
We now address the second issue: how to compute cost
for all values of from 0 to

1 so as to minimize cost
over all .
Calculating cost
for all possible values of is prohibitively expensive and hence not

feasible in practice. Fortunately, in practice we do not have to evaluate all values of to
determine the best . This is due to the following interesting fact: If we plot all possible
values of and their respective associated cost, we would observe that the graph is not
linear, but exhibits a certain concavity. The concave region of the graph represents a sweet
spot and represents the feasible region for the best cost.
As an example consider Figure 3-11, which shows the plot of the cost function for
various fractions of for one of the experimental data sets from Section 3.6. Given this
44
2e+07
3e+07
4e+07
5e+07
6e+07
7e+07
8e+07
9e+07
0 10 20 30 40 50 60 70 80 90 100
#

o
f

D
i
s
t
a
n
c
e

C
o
m
p
u
t
a
t
i
o
n
s

(
e
s
t
i
m
a
t
e
d
)
% of buffer
Convexity of Cost Function for k = 20
Figure 3-11. Convexity of cost function illustration.
fact, we identify the feasible region by evaluating cost
i
for a small number, k, of
i
values. Given k the number of allowed cutpoints, the fraction
1
can be determined as
follows:
1
=
r
(
1
k
)
r
where r = (t
end
t
start
) is the time range described by the buer (the above formula
assumes that r is greater than one; if not, then the time range should be scaled accordingly).
In the general case, the fraction of the buer considered by any
i
(1 i k) is given by:
i
=
(r
1
)
i
r
Note that since the growth rate of each subsequent
i
is exponential, we can cover
the entire buer with just a small k and still guarantee that we will consider some value
of
i
that is within a factor of
1
from the optimal. After computing
1
,
2
,. . . ,
k
, we
successively evaluate increasing buer fractions,
1
,
2
,
3
, and so on and determine their
associated costs. From these k costs we determine the
i
with the minimum cost.
Note that if we choose based on the evaluation of a small k, then it is possible that
the optimal choice of may lie outside the feasible region. However, there is a simple
approach to solving this issue. After an initial evaluation of k granularities, consider
just the region starting before and ending after the best k and recursively reapply the
evaluation described above just in this region.
45
4

5
3
t
j
t
i
t
end
tstart
5
mincost
mincost
mincost
2
Figure 3-12. Iteratively evaluating k cut points
For instance, assume we chose
i
after evaluation of k cutpoints in the time range r.
To further tune this
i
, we consider the time range dened between the adjacent cutpoints
i1
and
i+1
and recursively apply cost estimation in this interval. (i.e., evaluate k points
in the time range (t
start
+
i1
r, t
start
+
i+1
r)). Figure 3-12 illustrates the idea. This
approach is simple and very eective in considering a large number of choices of .
3.5.6 Speeding Up the Estimation
Restricting the number of candidate cut points can help keep the time required to
nd a suitable value for manageable. However, if the estimation is not implemented
carefully, the time required to consider the cost at each of the k possible time periods can
still be signicant.
The most obvious method for estimating cost
for each of the k granularities would

be to simply loop through each of the associated time periods. For each time period,
we would build bounding boxes around each of the trajectories of the objects in P, and
then sample objects from Q as described in Section 3.5 until the cost was estimated with
sucient accuracy.
However, this simple algorithm results in a good deal of repeated work for each time
period, and can actually decrease the overall speed of the adaptive plane-sweep compared
to the layered plane-sweep. A more intelligent implementation can speed the optimization
process considerably.
In our implementation, we maintain a table of all the objects in P and Q, organized
on the ID of each object. Each entry in the table points to a linked list that contains a
46
chain of MBRs for the associated object. Each MBR in the list bounds the trajectory
of the object for one of the k time-periods considered during the optimization, and the
MBRs in each list are sorted from the coarsest of the k granularities to the nest. The
data structure is depicted in Figure 3-12.
Given this structure, we can estimate cost
for each of the k values of alpha in

parallel, with only a small hit in performance associated with an increased value for k.
Any object pair (p P, q Q) that needs to be evaluated during the sampling process
described in Section 3.5 is rst evaluated at the coarsest granularity corresponding to
k
.
If the two MBRs are within distance d of one another, then the cost estimate for
k
is
updated, and the evaluation is then repeated at the second coarsest granularity
k1
. If
there is again a match, then the cost estimate for
k1
is updated as well. The process is
repeated until there is not a match. As soon as we nd a granularity at which the MBRs
for p and q are not within a distance d of one another, then we can stop the process,
because if the MBRs for P and Q are not within distance d for the time period associated
with
i
, then they cannot be within this distance for any time period
j
where j < i.
The benet of this approach is that in cases where the data are well-behaved and
the optimization process tends to choose a value for that causes the entire buer to be
processed at once, a quick check of the distance between the outer-most MBRs of p and q
is the only geometric computation needed to process p and q, no matter what value of k is
chosen.
The bounding box approximations themselves can be formed while the system buer
is being lled with data from disk. As trajectory data are being read from disk, we grow
the MBRs for each
i
progressively. Since each
i
represents a fraction of the buer, the
updates to its MBR can be stopped as soon as that much fraction of the buer has been
lled. Similar logic can be used to shrink the MBRs when some fraction of the buer is
consumed and expand it when the buer is relled.
47
.
.
.
3 4 5
p
2
p
1
2
time
y
p
n
p
2
(
5
) (
4
) (
3
) (
2
) (
1
)
MBR MBR MBR MBR MBR
1
Figure 3-13. Speeding up the Optimizer
3.5.7 Putting It All Together
In our implementation of the adaptive plane-sweep, data are fetched in blocks
and stored in the system buer. Then an optimizer routine is called which evaluates
k granularities and returns the granularity with the minimum cost. Data in the
granularity chosen by the optimizer is then evaluated using the LayeredPlaneSweep
routine (procedure described in Section 3.4). When the LayeredPlaneSweep routine returns,
the buer is relled and the process is repeated. The techniques described in the previous
Section are utilized to make the optimizer implementation fast and ecient.
3.6 Benchmarking
This section presents experimental results comparing the various methods discussed so
far for computing a spatiotemporal CPA Join: an R-tree, a simple plane-sweep, a layered
plane-sweep, and an adaptive plane-sweep with several parameter settings. The Section is
organized as follows. First, a description of the three, three-dimensional temporal data sets
used to test the algorithms is given. This is followed by the actual experimental results
and a detailed discussion analyzing the experiemental data.
3.6.1 Test Data Sets
The rst two data sets that we use to test the various algorithms result from two
physics-based, N-body simulations. In both data sets, constituent records occupy 80B on
disk (80B is the storage required to record the object ID, time information, as well as the
48
0
500
1000
1500
2000
2500
3000
3500
-1500
-1000
-500
0
500
1000
1500
-1500
-1000
-500
0
500
1000
1500
Figure 3-14. Injection data set at time tick 2,650
position and motion of the object). The size of each data set is around 50 gigabytes each.
The two data sets are as follows:
1. The Injection data set. This data set is the result of a simulation of the injection of
two gasses into a chamber through two nozzles on the opposite sides of the chamber
via the depression of pistons behind each of the nozzles. Each gas cloud is treated as
one of the input relations to the CPA-join. In addition to heat energy transmitted
to the gas particles via the depression of the pistons, the gas particles also have an
attractive charge. The purpose of the join is to determine the speed of the reaction
resulting from the injection of the gasses into the chamber, by determining the
number of (near) collisions of the gas particles moving through the chamber. Both
data sets consist of 100,000 particles, and the positions of the particles are sampled
at 3,500 individual time ticks, resulting in two relations that are around 28 gigabytes
in size each. During the rst 2,500 time ticks, for the most part both gasses are
simply compressed in their respective cylinders. After tick 2,500, the majority of the
particles begin to be ejected from the two nozzles. A small sample of the particles in
the data set is depicted above in Figure 3-13, at time tick 2,650.
2. The Collision data set. This data set is the result of an N-body gravitational
simulation of the collision of two small galaxies. Again, both galaxies contain around
100,000 star systems, and the positions of the systems in each galaxy are polled at
3,000 dierent time ticks. The size of the relations tracking each galaxy is around 24
gigabytes each. For the rst 1,500 or so time ticks, the two galaxies merely approach
one another. For the next thousand time ticks, there is an intense interaction as
they pass through one another. During the last few hundred time ticks, there is less
interaction as the two galaxies have larglely gone through one another. The purpose
of the CPA Join is to nd paris of galaxies that apprached closely enough to have a
49
-10000
-5000
0
5000
10000
-12000
-10000
-8000
-6000
-4000
-2000
0
2000
-6000
-4000
-2000
0
2000
4000
6000
8000
Figure 3-15. Collision data set at time tick 1,500
strong graviational interaction. A small sample of the galaxies in the simulation is
depicted above in Figure 3-14, at time tick 1,500.
In addition, we test a third data set created using a simple, 3-dimensional random walk.
We call this the Synthetic data set (this data set was again about 50GB in size). The
speed of the various objects varies considerably during the walk. The purpose of including
this data is to rigorously test the adpatability of the adaptive plane-sweep, by creating
a synthetic data set where there are signicant uctuations in the amount of interaction
among objects as a function of time.
3.6.2 Methodology and Results
All experiments were conducted on a 2.4GHz Intel Xeon PC with 1GB of RAM. The
experimental data sets were each stored on an 80GB, 15,000 RPM Seagate SCSI disk.
For all three of the data sets, we tested an R-tree-based CPA Join (implemented as
described in Section 3.3; we used the STR R-tree packing algorithm [80] to construct an
R-tree for each input relation), a simple plane-sweep (implemented as described in Section
3.4), a layered plane-sweep (implemented as described in Section 3.5).
We also tested the adaptive plane-sweep algorithm, implemented as described
in Section 6. For the adaptive plane-sweep, we also wanted to test the eect of the
50
0
2000
4000
6000
8000
10000
12000
0 20 40 60 80 100
T
i
m
e

T
a
k
e
n
% of Join Completed
CPA-Join over Injection Dataset
R-Tree
Simple Sweep
Layered Sweep
Adaptive Sweep(k=20)
w/ addl. recursive call
Adaptive Sweep(k=5)
Figure 3-16. Injection data set experimental results
two relevant parameter settings on the eciency of the algorithm. These settings are
the number of cut-points k considered at each level of the optimization performed by
the algorithm, as well as the number of recursive calls made to the optimizer. In our
experiments, we used k values of 5, 10, and 20, and we tested using either a single or no
recursive calls to the optimizer.
The results of our experiments are plotted above in Figures 3-15 through 3-20.
Figures 3-15, 3-16, and 3-19 show the progress of the various algorithms as a function
of time, for each of the three data sets (only Figure 3-15 depicts the running time of the
adaptive plane-sweep making use of a recursive call to the optimizer). For the various
plane-sweep-based joins, the x-axis of the two plots shows the percentage of the join that
has been completed, while the y-axis shows the wall-clock time required to reach that
point in the completion of the join. For the R-tree-based join (which does not progress
through virtual time in a linear fashion) the x-axis shows the fraction of the MBR-MBR
pairs that have been evaluated at each particular wall-clock time instant. These values are
normalized so that they are comparable with the progress of the plane-sweep-based joins.
51
0
2000
4000
6000
8000
10000
12000
0 20 40 60 80 100
T
i
m
e

T
a
k
e
n
% of Join Completed
CPA-Join over Collision Dataset
R-Tree
Simple Sweep Layered Sweep
Adaptive Sweep(k = 20)
Figure 3-17. Collision data set experimental results
Figures 3-17, 3-18 and 3-20 show the buer-size choices mades by the adaptive
plane-sweeping algorithm using k = 20 and no recursive calls to the optimizer , as a
function of time for all the three test data sets.
3.6.3 Discussion
On all three data sets, the R-tree was clearly the worst option. The R-tree indices
were not able to meaningfully restrict the number of leaf-level pairs that needed to be
expanded during join processing. This results in a huge number of segment pairs that
must be evaluated. It may have been possible to reduce this cost by using a smaller page
size (we used 64KB pages, a reasonable choice for a modern 15,000 RPM hard disk with a
fast sequential transfer rate), but reducing the page size is a double-edged sword. While it
may increase the pruning power in the index and thus reduce the number of comparisons,
it may also increase the number of random I/Os required to process the join, since there
will be more pages in the structure. Unfortunately, however, it is not possible to know
the optimal page size until after the index is created and the join has been run, a clear
weakness of the R-tree.
The standard plane-sweep and the layered plane-sweep performed comparably on the
three data sets we tested, and both far outperformed the R-tree. It is interesting to note
that the standard plane-sweep performed well when there was much interaction among
52
10
20
30
40
50
60
70
80
90
100
0 500 1000 1500 2000 2500 3000 3500
%

o
f

b
u
f
f
e
r

c
o
n
s
u
m
e
d
Virtual timeline in the dataset
Injection Dataset - Buffer Choices by Optimizer (k = 20)
Figure 3-18. Buer size choices for Injection data set
30
40
50
60
70
80
90
100
0 500 1000 1500 2000 2500 3000
%

o
f

b
u
f
f
e
r

c
o
n
s
u
m
e
d
Collision Dataset - Buffer Choices by Optimizer (k = 20)
Figure 3-19. Buer size choices for Collision data set
the input relations (when the gasses are expelled from the nozzles in the Injection data
set and when the two galaxies overlap in the Collision data set). During such periods
it makes sense to consider only very small time periods in order to reduce the number
of comparisons, leading to good eciency for the standard plane-sweep. On the other
hand, during time periods when there was relatively little interaction between the input
relations, the layered plane-sweep performed far better because it was able to process large
time-periods at once. Even when the objects in the input relations have very long paths
during such periods, the input data were isolated enough that there tends to be little cost
associated with checking these paths for proximity during the rst level of the layered
plane-sweep. The adaptive plane-sweep was the best option by far for all the three data
53
0
500
1000
1500
2000
2500
3000
3500
4000
4500
0 20 40 60 80 100
T
i
m
e

T
a
k
e
n
% of Join Completed
CPA-Join over Synthetic Dataset
R-Tree
Layered Sweep
Simple Sweep
Adaptive Sweep(k=5)
Figure 3-20. Synthetic data set experimental results
sets, and was able to smoothly transition from periods of low to high activity in the data
and back again, eectively picking the best granularity at which to compare the paths of
the objects in the two input relations.
From the graphs, we can see that the cost of performing the optimization causes the
adaptive approach to be slightly slower than the non-adaptive approach when optimization
is ineective. In both the data sets, this happens in the beginning when the objects
are moving towards each other but still far enough that no interaction takes place. As
expected, in both the experiments, adaptivity begins to take eect when the objects in
the underlying data set start interacting. From Figures 3-17, 3-18, and 3-20 it can be
seen that the buer size choices made by the adaptive plane-sweep is very nely tuned to
the underlying object interaction dynamics (decreasing with increasing interaction and
vice versa). In both the Injection and Collision data sets, the size of the buer falls
dramatically just as the amount of interaction between the input relations increases. In
the Synthetic data set, the oscillations in buer usage depicted in Figure 20 mimic almost
exactly the energy of the data as they perform their random walk.
The graphs also show the impact of varying the parameters to the adaptive
plane-sweep routine, namely, the number of cut points k, considered at each level of
the optimization, and whether or not a chosen granularity is rened through recursive
54
calls to the optimizer. It is surprising to note that varying these parameters causes no
signicant changes in the granularity choices made by the optimizer. The reason is that
with increasing interaction in the underlying data set, the optimizer has a preference
towards smaller granularites and these granularites are naturally considered in more detail
due to the logarithmic way in which the search space is partitioned.
Another interesting observation is that the recursive call does not improve the
performance of the algorithm, for two reasons. First, since each invokation of the
optimization is a separate attempt to nd the best cut point in a dierent time range,
it is not possible to share work among the recursive calls. Second, it is likely that just
being in the feasible region, or at least a region close to it is enough to enjoy signicant
performance gains. Since the coarse rst-level optimization already does that, further
optimizations in terms of recursive calls to tune the chosen granularity do not seem to be
necessary.
In all of our experiments, k = 5 with no recursive call to the optimizer uniormly
gave the best performance. However, if the nature of the input data set is unknown
and the data may be extremely poorly behaved, then we believe a choice of k = 10
with one recursive call may be a safer, all-purpose choice. On one hand, the cost of
optimization will be increased, which may lead to a greater execution time in most cases
(our experiments showed about a 30% performance hit associated with using k = 10 and
one recusrive call compared to k = 5). However, the benet of this choice is that it is
highly unlikely that such a combination would miss the optimal buer choice in a very
dicult scenario with a highly skewed data set.
3.7 Related Work
To our knowledge, the only prior work which has considered the problem of
computing joins over moving object histories is the work of Jeong et al. [51]. However,
their paper considers the basic problem at a high level. The algorithmic and implementation
issues addressed by our own work were not considered.
55
10
20
30
40
50
60
70
80
90
100
0 2000 4000 6000 8000 10000 12000 14000 16000 18000
%

o
f

b
u
f
f
e
r

c
o
n
s
u
m
e
d
Synthetic Dataset - Buffer Choices by Optimizer (k = 20)
Figure 3-21. Buer size choices for Synthetic data set
Though little work has been reported on spatiotemporal joins, there has been a
wealth of research in the area of spatial joins. The classical paper in spatial-joins is due
to Brinkho, Kreigel and Seeger[45] and is based on the R-tree index structure. An
improvement of this work was given by Huang et al.[46]. Hash-based spatial join strategies
have been suggested by Lo and Ravishankar [47], and Patel and Dewitt [48]. Lars et al.
[49] proposed a plane-sweep approach to address the spatial-join problem in the absence of
underlying indexes.
Within the context of moving objects, research has been focused on two main areas:
predictive queries, and historical queries. Within this taxonomy, our work falls in the
latter category. In predictive queries, the focus is on the future position of the objects
and only a limited time window of the object positions need to be maintained. On the
other hand, for historical queries, the interest is on ecient retrieval of past history and
usually the index structure maintains the entire timeline of an objects history. Due to
these divergent requirements, index structures designed for predictive queries are usually
not suitable for historical queries.
A number of index structures have been proposed to support predictive and historical
queries eciently. These structures are generally geared towards eciently answering
56
selection or window queries and do not study the problem of joins involving multi-gigabyte
object histories addressed by our work.
Index structures for historical queries include the 3D-R-trees[20], spatiotemporal
R-Trees and TB (Trajectory Bounding)-trees [21], and linear quad-trees [22] . A technique
based on space partitioning is reported in [18]. For predictive queries, Saltenis et al.[17]
proposed the TPR-tree (time-parametrized R-tree) which indexes the current and
predicted future positions of moving point objects. They mention the sensitivity of
the bounding boxes in the R-tree to object velocities. An improvement of the TPR-tree
can be found in [78]. In [76], a framework to cover time-parametrized versions of spatial
queries by reducing them to nearest-neighbor search problem has been suggested. In [23],
an indexing technique is proposed where trajectories in a d-dimensional space is mapped
to points in higher-dimensional space and then indexed. In [75], the authors propose a
framework called SINA in which continuous spatiotemporal queries are abstracted as a
spatial join involving moving objects and moving queries. An overview of dierent access
structures can be found in [83].
3.8 Summary
This chapter explored the challenging problem of computing joins over massive
moving object histories. We compared and evaluated obvious join strategies and
then described a novel join technique based on an extension to the plane-sweep. The
benchmarking results suggest that the proposed adaptive technique oers signicant
benets over the competing techniques.
57
CHAPTER 4
ENTITY RESOLUTION IN SPATIOTEMPORAL DATABASES
Sensor networks are steadily growing larger and more complex, as well as more
practical for real-world applications. It is not hard to imagine a time in the near future
where huge networks will be widely deployed. A key data mining challenge will be making
sense of all of the data that those sensors produce.
In this chapter, we study a novel variation of the classic entity resolution problem
that appears in sensor network applications. In entity resoluition, the goal is to determine
whether or not various bits of data pertain to the same object. For example, two records
from dierent data sources that have been loaded into a database may reference the names
Joe Johnson and Joseph Johnson. Entity resolution methodologies may be useful in
determining if these records refer to the same person.
This problem also appears in sensor networks. A large number of sensors may all
produce data relating to the same object or event, and it becomes necessary to be able
to determine when this is the case. Unfortunately, the problem is exceptionally dicult
in a sensor application, for two primary reasons. First, sensor data is often not as rich
as classical, relational data, and often gives far fewer cluues as to when two observations
correspond to the same oject. The most extreme case is a simple motion sensor, which
will simply report a yes indicating that motion was sensed, along with a timestamp and
an approximate location. This provides very little information to make use of during the
resolution process. Second, there is the large number of data sources. The largest sensor
networks in existence today already contain on the order of one thousand sensors. This
goes far beyond what one might expect in a classical entity resolution application.
In this chapter, we consider a specic version of the entity resolution problem
that appears in sensor networks, where the goal is monitoring a spatial area in order
to understand the motion of objects through the area. Given a large database of
spatiotemporal sensor observations that consist of (location, timestamp) pairs, our
58
goal is to perform an accurate segmentation of all of the observations into sets, where each
set is associated with one object. Each set should also be annotated with the path of the
object through the area.
We develop a statistical, learning-based approach to solving the problem. We rst
learn a model for the motion of the various objects through the spatial area, and then
associate sensor observations with the objects via a maximum likelihood procedure. The
major technical diculty lies in using the sensor observations to learn the spatial motion
of the objects. A key aspect of our approach is that we make use of two dierent motion
models to develop a reasonable solution to this problem: a restricted motion model that is
easy to learn and yet is only applicable to smooth motion over small time periods, and a
more general model that takes as input the initial motion learned via the restricted model.
Some specic contributions of this work are as follows:
1. A unique expectation-maximization (EM) algorithm that is suitable for learning
associations of spatiotemporal, moving object data is described. This algorithm
allows us to recognize quadratic (xed acceleration) motion in a large set of sensor
observations.
2. We apply and extend the method of Bayesian lters for recognizing unrestricted
motion to the case when a large number of interacting objects produce data, and it
is not clear which observation corresponds to which object.
3. Experimental results show that the proposed method can accurately perform
resolution over more than one hundred simultaneuously moving objects, even when
the number of moving objects is not known beforehand.
The remainder of this chapter is organized as follows: In the next section, we
state the problem formally and give an overview of our approach. We then describe
the generative model and dene the PDFs for the restricted and unrestricted motion.
This is followed by a detailed description of the learning algorithms in section 4.4.
An experimental evaluation of the algorithms is given in section 4.5 followed by the
conclusion.
59
4.1 Problem Denition
Consider a large database of sensor observations associated with K objects moving
through some monitored area. Each observation is a potentially inaccurate ordered pair of
the form (x, t) where x is the position vector and t is the time instance of the observation.
Given a set of object IDs O = o
1
, o
2
, ..o
K
, the entity resolution problem that we
consider consists of associating with each observation a label o
i
O such that all sensor
observations labeled o
i
were produced by the same object. That is, we are partitioning the
observations into sets of K classes where each class of observations represents the path of
a moving object on the eld.
(4,11,1)
(6,14,2)
(8,17,3)
(9,9,0)
(11,7,1)
(13,5,2)
(15,3,3)
(2,8,0)
(4,11,1)
(6,14,2)
(8,17,3)
(9,9,0)
(11,7,1)
(13,5,2)
(15,3,3)
(2,8,0)
(b) (a)
target 2
target 1
Figure 4-1. Mapping of a set of observations for linear motion
As an example, consider the set of observations: (2,8,0) (9,9,0) (4,11,1) (11,7,1)
(6,14,2) (13,5,2) (8,17,3) (15,3,3) as shown in Figure 4-1(a). Given the underlying
motion is linear and K = 2, Figure 4-1(b) shows a mapping of the observations with
objects. Observations (2,8,0) (4,11,1) (6,14,2) (8,17,3) are associated with object 1, and
observations (9,9,0) (11,7,1) (13,5,2) (15,3,3) are associated with object 2. Though in
this case the classication was easy, the problem in general is hard due to a number of
factors, including:
Paths traced by real-life objects tend to be irregular and often cannot be approximated
with simple curves.
The measurement process is not accurate and often introduces error in the
observations, which needs to be taken into account in classication.
60
Objects can cross paths, or track one anothers paths for realatively long time
periods, complicating both the segmentation and the problem of guring out how
many ojects are under observation.
4.2 Outline of Our Approach
In order to solve this problem in the general case, we make use of a model-oriented
approach. We model the uncertainty inherent in the production of sensor observations by
assuming that the observations are produced by a generative, random process.
The location of an object moving through space is expressed as a function of time. As
an object wanders through the data space, it triggers sensors that generate observations
in a cloud around it. Our model assumes that an object generates sensor observations
in a cloud around it by sampling repeatedly from a Gaussian (multidimensional normal)
probability density function (PDF) centered at the current location of the object, which
we denote by f
obs
. This randomized generative process nicely models the uncertainty
inherent in any sort of sensor-based system. As the object moves through the eld, it
tends to trigger sensors that are located close to it, but the probability that sensor is
triggered falls o the further away from the object that the sensor is located.
If we know the exact path of each object through the eld, given such a model it
is then a simple matter to perform the required partitioning of sensor observations by
making use of the principle of maximum likelihood estimation (MLE). Using MLE, each
observation is simply associated with the object that was most likely to have produced
it that is, it is assigned to the object whose f
obs
function has the greatest value at the
location returned by the sensor.
Of course, the diculty is that in order to make use of such a simple MLE, we must
rst have access to the parameters dening f
obs
and the motion of the various objects
through the data space. This requires learning the parameters and the underlying motion
which can be very dicult particularly so in our case, since we are unsure of the number
of objects K that are producing the sensor observations.
61
0
0
1
1
2
2 3
9
9
9
9
8
8
8 7
7
7
6 6
6
7
6
5 4 4
4
4
5
5
5
3
3
2
2
8
0
1
0
0
0
1
1
2
2 3
3
3
2
2
0
1
0
0
1
1
2
2 3
6 6
6
6
5 4 4
4
4
5
5
5
3
3
2
2
0
1
0
0
1
1
2
2 3
9
9
9
9
8
8
8 7
7
7
6 6
6
7
6
5 4 4
4
4
5
5
5
3
3
2
2
8
0
1
0
0
0
(a) (b)
(c) (d)
Figure 4-2. Object path (a) and quadratic t for varying time ticks (b-d)
One of the key aspects of our approach is that to make the learning process feasible,
we rely on two separate motion models: a restricted motion model that is used for only
the rst few time ticks in order to recognize the number and initial motion of the various
objects, and an unrestricted motion model that takes this initial motion as input and
allows for arbitrary object motion. Given this, the following describes our overall process
for grouping sensor observations into objects:
1. First, we determine K and learn the set of model parameters governing object
motion under the restricted model for the rst few time ticks.
2. Next, we use K as well as the object positions at the end of the rst few time ticks
as input into a learning algorithm for the remainder of the timeline. The goal here is
to learn how objects move under the more general, unrestricted motion model.
3. Finally, once all object trajectories have been learned for the complete timeline, each
sensor observation is assigned to the object that was most likely to have produced it.
Of course, this basic approach requires that we address two very important technical
questions:
1. What exactly is our model for how objects move through the data space, and how do
objects probabilistically generate observations around them?
2. How do we learn the parameters associated with the model, in order to tailor the
general motion model to the specic objects that we are trying to model?
62
In the next section, we consider the answer to the rst question. In section 4.4, we
consider the answer to the second question.
4.3 Generative Model
In this section, we dene the generative model that we make use of in order to
perform the classication.
The high-level goal is to dene the PDF f
obs
(x, t[). For a particular object, f
obs
takes
as input a sensor observation x and a time t and gives the probability that particular
object in question would have triggered x at time t. f
obs
is parametrized on the parameter
set , which describes how the object in question moves through the data space, and how
it tends to scatter sensor observations around it.
Before we describe the restricted and general motion models formally, it is worthwhile
to explain the need for two separate models. The parameter set governing the motion
of the object through the data space is unknown, and must be inferred from the observed
data. This is a dicult task. The problem is compounded when the observed data are
a set of mixed sensor observations generated by an unknown number of objects. Given
the diculty inherent in learning , we choose to make the initial classication problem
simpler by allowing only a very restricted type of quadratic motion characterized by
constant acceleration.
Furthermore, object paths tend to be complex only over a relatively long time period.
That is, motion seems unrestricted only when we take a long-term view. This is illustrated
in Figure 4-2 where the initial quadratic approximations (for time periods [0 3] and
[0 6]) are faithful to the objects actual path. As the time period extends and the object
has a chance to change its acceleration, a simple quadratic t is no longer appropriate
(Figure 4-2(d)).
We take advantage of the fact that a simple motion model may be reasonable for
a short time period, and learn the initial parameters of the generative process by using
a restricted motion model over a small portion t of the time line. Once the initial
63
(b) (a)
2SD
Figure 4-3. Object path in a sensor eld (a) and sensor rings triggered by object motion
(b)
paramters are learned, we can make use of the unrestricted model for the remainder of the
timeline since there will be fewer unknowns and the computational complexity is greatly
reduced.
4.3.1 PDF for Restricted Motion
We will now describe the PDF for observations assuming a restricted motion model
that is valid for short time periods. In this model, the location of an object is expressed as
a function of time in the form of a quadratic equation. The restricted model assumes that
acceleration is constant. The position of an object at some time instance t is specied by
the parametric equation:
p
t
= a t
2
+v t +p
0
where p
0
represents the initial position of the object, v the initial velocity, and a the
constant acceleration.
We dene the probability of an observation x at time t by the PDF:
f
obs
(x, t [ ) = f
N
(x[
obs
, p
t
)
where
f
N
(x [ , p) =
1
2[[
1/2
e
1
2
(xp)
T
1
(xp)
is a Gaussian PDF that models the cloud of sensors triggered by the object at time t.
Figure 4-3 shows a typical scenario of how observations are generated. The parameter set
contains:
64
The objects initial position p
0
, initial velocity v, and acceleration a
The covariance matrix
obs
specifying how the object produces sensor readings
around itself.
4.3.2 PDF for Unrestricted Motion
While the restricted motion model may be applicable for a resonably small time
period, the fact is that accelerations do not remain constant for long. Thus, we make use
of a second PDF providing for more irregular motion, that can be used over longer time
periods. In the more general PDF, at each time tick an object moves not in a nice, smooth
trajectory, but instead moves through the data space in a completely random fashion.
Given that the objects position in space at time t 1 is p
t1
, the objects position at time
t is simply p
t1
+ N, where N is a multidimensional, normally distributed random variable
parameterized on the covariance matrix
mot
. This random motion provides for a much
more general model.
One result of using such a very general model is that there is no longer a simple
equation for p
t
. Instead, p
t
has to be modeled by a random variable P
t
which depends on
random variables for the objects position at P
t1
which by itself depends on the random
variable for objects positon at P
t2
. Thus, the likelihood of observing p
t
must be specied
via a recursive PDF, where f
mot
(p
t
[) depends on f
mot
(p
t1
[):
f
mot
(p
t
[) =
_
p
t1
f
N
(p
t
[
mot
, p
t1
)f
mot
(p
t1
[)dp
t1
As we will discuss in Section 4.4, the fact that an objects position is not specied
precisely by the parameter set and the recursive nature of the PDF make this motion
model much more dicult to deal with, which is why this more general motion model is
not used throughout the timeline.
Given the PDF describing the distribution of an objects location at time t, it is
then possible to give a PDF specifying the probability that we observe a sensor reading
corresponding to the object at time t:
65
f
obs
(x, t[) =
_
p
t
f
N
(x[
obs
, p
t
)f
mot
(p
t
[)dp
t
Thus, for the more general PDF, the parameter set contains:
mot
specifying the objects random motion.
The objects initial position p
0
.
obs
specifying how the object produces sensor readings
around itself.
Before we can actually map observations to individual objects, we must be able
to learn the two underlying models. As we will discuss in detail subsequently, the term
learn has a dierent meaning for each of the two models.
In the case of the restricted model, learning consists of computing the parameter
set for each object, as well as determining the number of objects. Since this is a classical
parameter estimation problem, we will make use of an MLE framework that will be solved
by an EM algortihm.
In the case of the unrestricted model, determining the parameter set is easy, in the
sense that once the parameter set of the restricted model has been learned, for the
unrestricted model is already fully determined (see Section 4.3.2). However, this does not
mean that use of the unrestricted model is easy. Because the model allows for arbitrary
motion, f
mot
as described before is not very useful by itself. Thus, our learning of the
unrestricted model makes use of Bayesian methods to update and restrict f
mot
. As we
process the data, sensor observations from the database are used in Bayesian fashion to
update f
mot
in order to refect a more rened belief in the position of the object. This
updated f
mot
places less weight in portions of the data space that do not contain sensor
observations relevant to an object in question, and more weight in portions that do.
4.4 Learning the Restricted Model
We begin our discussion of how to learn the restricted model by assuming that
number of objects K is known. We will address the extension to unknown K subsequently.
66
4.4.1 Expectation Maximization
Given a set of observations produced by a single object and the associated PDF
f
obs
(x, t[), the parameter can be learned by applying a relatively simple MLE. However,
in our case, the observations come from a set of K unknown objects where each object
potentially contributes to some fraction of the sample. Note that the individual s need
not be uniform since an object moving in a dense eld of sensors or a very large object
might produce more observations than an object moving in a sparser region. Given K
objects, the probability of an arbitrary observation x at time t is then given by:
p(x, t [ ) =
K
j=1
j
.f
obs
(x, t[
j
)
where =
j
,
j
[ 1 j K denotes the complete parameter set and
j
represents the
fraction of data generated by the j
th
object with the constraint that
K
j=1
j
= 1.
Our goal is to learn the complete parameter set . Applying MLE, we want to nd a
that maximizes the following likelihood function:

L( [ A) =
N
i=1
p(x
i
, t
i
)
where A = (x
1
, t
1
), (x
2
, t
2
), , (x
N
, t
N
) is the set of observations from some initial
time period. As is standard practice, we instead try to maximize the log of the likelihood
since it makes the computations easier to handle:
log(L( [ A)) = log
N
i=1
p(x
i
, t
i
) =
N
i=1
log
_
K
j=1
j
.f
obs
(x
i
, t
i
[
j
)
_
Unfortunately, this problem is in general dicult because of the fact that we do not
know which observation was produced by which object. That is, if we had access to a
vector = y
1
, y
2
, , y
N
where y
i
= j if the i
th
observation is generated by the j
th
object, the maximization would be a relatively straightforward problem.
67
The fact that we lack access to can be addressed by making use of the EM
algorithm [84]. EM is an iterative algorithm that works by repeating the E-Step
and the M-Step. At all times, EM maintains a current guess as to the parameter set
. In the E-Step, we compute the so-called Q-function, which is nothing more than the
expected value of the log-likelihood, taken with respect to all possible values of . The
probability of generating any given is computed using the current guess for . This
removes the dependency on . The M-Step then updates so as to maximize the value of
the resulting Q function. The process is repeated until there is little step-to-step change in
.
In order to derive an EM algorithm for learning the restricted motion model, we must
rst derive the Q function. In general, the Q function takes the form:
Q(,
i
) = E[ log L(A, [ ) [ A,
i
]
In our particular case, this can be expanded to:
Q(,
g
) =
N
i=1
K
j=1
log(
j
p(x
i
, t
i
[
g
j
))P
j,i
where
g
=
g
j
,
g
j
[ (1 j K) represents our guess for the various parameters of the
K objects and P
j,i
is the posterior probabilty that the i
th
observation came from the j
th
object given by the formula:
P
j,i
= P(j [ x
i
, t
i
) =
g
j
p(x
i
, t
i
[
g
j
)
K
j=1
g
j
p(x
i
, t
i
[
g
j
)
(by Bayes Rule)
Once we have derived Q, we need to maximze Q with respect to . Notice that we can
isolate the term containing
j
and term containing
j
by rewriting the Q function as
follows:
Q(,
g
) =
N
i=1
K
j=1
log (
j
)P
j,i
+
N
i=1
K
j=1
log (p(x
i
, t
i
[
g
j
))P
j,i
68
We can now maximize the above equation with respect to various parameters of interest.
This can be done using standard optimization methods [85]. Doing so results in the
following update rules for the parameter set
j
for the j
th
object:
_
_
_
_
_
_
_
_
_
_
_
N
i=1
P
j,i
N
i=1
t
i
P
j,i
N
i=1
t
2
i
P
j,i
N
i=1
t
i
P
j,i
N
i=1
t
2
i
P
j,i
N
i=1
t
3
i
P
j,i
N
i=1
t
2
i
P
j,i
N
i=1
t
3
i
P
j,i
N
i=1
t
4
i
P
j,i
_
_
_
_
_
_
_
_
_
_
_
_

j
v
j
a
j
_
=
_
_
_
_
_
_
_
_
_
_
_
N
i=1
x
i
P
j,i
N
i=1
x
i
t
i
P
j,i
N
i=1
x
i
t
2
i
P
j,i
_
_
_
_
_
_
_
_
_
_
_
,
j
=
N
i=1
(x
i
j
)(x
i
j
)
T
p
j,i
N
i=1
p
j,i
and
j
=
1
N
N
i=1
P
j,i
.
Given these equations, our nal EM algorithm is given as Algorithm 6.
Algorithm 6 EM Algorithm
1: while continues to improve do
2: for each object j do
3: for each observation i do
4: Compute P
j,i
5: end for
6: Compute
j
= (
j
, v
j
, a
j
,
j
,
j
) using P
j,i
and update rules given above
7: end for
8: end while
4.4.2 Learning K
So far we have assumed that the number of objects K is known. However, in practice,
we often have very little knowledge about K, thus requiring us to estimate it from the
observed data. The problem of choosing K can be viewed as the problem of selecting
the number of components of a mixture model that describes some observed data. The
69
problem has been well-studied as it arises in many dierent areas of research, and a variety
of criterion has been proposed to solve the problem [86][87][88][89].
The basic idea behind the various techniques is as follows: Assume we have a model
for some observed data in the form of a parameter set
K
=
1
, ...,
K
. Further, assume
we have a cost function ((
k
) to evaluate the cost of the model. In order to select the
model with the optimal number of components, we simply compute for a range of K
values and choose the one with the minimum cost:
K =
argmin
K
((
K
) [ K
low
K K
high
The various techniques proposed in the literature can be distinguished by the cost
criterion they use to evaluate a model: AIC (Akaikes Information Criterion), MDL
(Minimum Description Length) [88], MML (Minimum Message Length) [90], etc. For the
cost function, we make use of the Minimum Message Length (MML) criterion as it has
been shown to be competitive and even superior to other techniques [89]. MML is an
information theoretic criterion where dierent models are compared on the basis of how
well they can encode the observed data. The MML criterion nicely captures the tradeo
between the number of components and model simplicity. The general formula [89] for the
MML criterion is given by:
((
k
) = logh(
k
) logL(A [
k
) +
1
2
log[I(
k
)[ +
c
2
_
1 +
1
12
_
where h() describes the prior probabilities of the various parameters, L() the likelihood
of observing the data, [I[ is the determinant of the sher information matrix of the
observed data. For our specic case, we need a formulation that is applicable to Gaussian
distributions [87]:
((
k
) =
P
2
K
high
j=K
low
log
_
N
j
12
_
+
K
high
2
log
N
12
+
K
high
(N+1)
2
logL( [
k
)
70
One nal issue that should be mentioned with respect to choosing K is computational
eciency. It is clearly unacceptable to re-run the entire EM algorithm for every possible
K in order to minimize the MML criteria. A solution to this is the component EM
framework [91]. In this variation of EM, a model is rst learned with a very large K value.
Then, in an iterative fashion, poor components are pruned o and the model is re-adjusted
to incorporate any data that is no longer well-t. For each resulting value of K, the MML
criteria is checked and the best model is chosen.
4.5 Learning Unrestricted Motion
Once the restricted model has been learned over a short portion of the timeline,
the next step is to use the learned parameters as a starting point in order to extend our
estimate of each objects postition to the remainder of the timeline.
The learning process for the unrestricted model is quite dierent than for the
restricted model. The restricted model makes use of a classical parameter estimation
framework, where the goal is to learn the parameter set govening the motion of the object.
In the unrestricted case, the values for the parameter set for f
mot
are fully dened before
the learning ever begins:
An objects initial postiion p
0
at the time the unrestricted model becomes applicable
can be computed directly from the parameters learned over the restricted model.

obs
can be taken directly from the unrestricted model, since it is one of the learned
parameters.

mot
can be determined in a number of ways from the restricted model, such as
by using an MLE over the objects time-tick to time-tick motion for the travectory
learned under the restricted model.
Thus, rather than relying on classical parameter estimation, we instead make use
of Bayesian techniques [62] to update f
mot
to take into account the various sensor
observations. f
mot
denes a distribution over p
t
for every time-tick t, which can be
viewed as describing a belief in the objects position at time-tick t. In a Bayesian fashion,
this belief (e.g., distribution) can be updated and made more accurate by making use of
71
the observed data. We will use the notation f
t
mot
to denote f
mot
updated with all of the
information up until time-tick t. Such Bayesian techniques for modeling motion are often
referred to as Bayesian lters [64].
4.5.1 Applying a Particle Filter
The mathematics associated with using a large number of discrete sensor observations
to update f
mot
quickly become very dicult particularly in the case of Gaussian motion
resulting in an unwieldy multi-modal posterior distribution. To address this, we make
use of a common method for rendering Bayesian lters practical, called a particle lter
[57]. A particle lter simplies the problem by representing f
t
mot
by a set of discrete
particles, where each particle is a possible current position for the object. We denote
the set of particles associated with time-tick t as S
t
, and the i
th
particle in this set is
S
t
[i]. The i
th
particle has a non-negative weight w
i
attached to it with the constraint
i
w
i
= 1. Highly-weighted particles indicate that the object is more likely to be located
at or close to the particles position. Given S
t
, f
t
mot
(p
t
) simply returns w
i
if p
t
= S
t
[i], and
0 otherwise.
The basic application of a particle lter to our problem is quite simple (though there
is a major complication that we will consider in the next subsection). To compute f
t
mot
for
any time tick t, we use a recursive algorithm. For the base case t = 0, we have a single
particle located at p
0
, having weight one. Then, given a set of particles S
t1
for time-tick
t 1, the set S
t
for time tick t is computed as given in Algorithm 2.
Algorithm 8 Sampling a particle cloud
1: for i = 1 to [S
t
[ do
2: Roll a biased die so the chance of rolling i is w
i
3: Sample from f
N
(
mot
, S
t1
[i])
4: Add the result as the i
th
particle for [S
t
[
5: end for
Essentially, this is nothing more than sampling from the distribution representing
the objects position at time t 1, and then adding the appropriate possible random
motion to each sample. At this point, all weights are uniform. This gives us a discrete
72
respresentation for the prior distribution for the objects position at time-tick t. We use
f
t
mot
to denote this prior.
The next step is to use the various sensor observations to update the prior weights in
order to obtain f
t
mot
. For each particle:
w
i
= Pr[p
t
= S
t
[i]] =
j=1
f
N
(x
j
[
obs
, S
t
[i])
|S
t
|
k=1
N
j=1
f
N
(x
j
[
obs
, S
t
[i])
Given S
t
, it is then an easy matter to dene an updated version of f
obs
that
corresponds to f
t
mot
:
f
t
obs
(x) =
|S
t
|
i=1
w
i
f
N
(x[
obs
, S
t
[i])
4.5.2 Handling Multiple Objects
Unfortunately, the simple lter described in the previous subsection cannot be
applied directly to our problem. Unlike in most applications of particle lters, we actually
have many objects producing sensor observations. As a result, it may be that a given
observation x
j
has nothing to do with the current object, which we denote by . As such,
this observation should not be used to update our beleif in the position of .
To handle this, we need to modify the basic framework. Rather than handling each
and every x
j
in a uniform fashion when dealing with , we instead associate a belief
(represented as a probability) with every x
j
. This belief tells us the extent to which we
think that x
j
was in fact produced by , and is computed as follows:
,j
=
f
t
obs
(x
j
[)
K
k=1
f
t
obs
(x
j
[k)
Note that f
t
obs
is the uniform-weighted version of f
t
obs
computed using the particles
associated with time-tick t, before the particle weights have been updated. f
t
obs
(x[)
73
denotes evaluation of the f
t
obs
function that is specically associated with object at point
x.
Given this, we now need to produce an alternative formula for w
i
that takes into
account the possibility that x
j
was not produced by . In the case of a single object, w
i
is simply the probability that would produce x
j
given that the actual object location
is S
t
[i]. In the case of multiple objects, w
i
is the probability that the entire collection of
objects would produce x
j
given that the actual location of is S
t
[i].
To derive an expression for this probabilty, we rst compute the likelihood that
another object (other than ) would produce x, given all of our beliefs regarding which
object produced x:
f
t
obs
(x[) =
k=
k,j
f
t
obs
(x[k)
1
,j
Then, we can apply Bayes rule to compute w
i
:
w
i
=
j=1
(1
,j
)f
t
obs
(x
j
[) +
,j
f
N
(x
j
[
obs
, S
t
[i])
|S
t
|
k=1
j=1
(1
k,j
)f
t
obs
(x
j
[k) +
k,j
f
N
(x
j
[
obs
, S
t
[i])
This formula bears a bit of additional discussion. In the numerator, we simply take
the product of all of the likelihoods that are associated with each sensor observation, given
that object is actually present at location S
t
[i]. The likelihood of observing x
j
under this
constraint is (1
,j
) f
t
obs
(x
j
[) +
,j
f
N
(x
j
[
obs
, S
t
[i]). The rst of the two terms
that are summed in this expression is the likelihood that an object other than would
produce x
j
, multiplied by our belief that this is the case. The second of the terms is the
likelihood that (located at position S
t
[i]) would produce x
j
multiplied by our belief that
this is the case. The deonominator in the expression is simply the normalization factor,
which is computed by summing the likelihood over all possible positions of object .
74
w
i
= Pr[p
t
= S
t
[i]] =
j=1
f
N
(x
j
[
obs
, S
t
[i])
|S
t
|
k=1
N
j=1
f
N
(x
j
[
obs
, S
t
[i])
However, the method has a problem. The update formula in step (2) is valid only
when the observations X
t
are for a single object i.e. (K = 1). For multiple objects (i.e
K > 1) we cannot use this formulation since we are allowing potential observations from
other objects to inuence the weight update of samples for any given object. To update
the belief of the j
th
object, we should ideally make use of only some set of observations
X
j
t
X
t
attributed to it. Thus, we need a new strategy that updates the weights of
samples for a given object that takes in to account the contribution of other objects
present in the eld to the observation set. Our modied update strategy is explained
below.
4.5.3 Update Strategy for a Sample given Multiple Objects
For the purpose of this discussion, we consider K objects where each object is
represented by [S
t
[ samples, and N observations produced at time t represented by X
t
.
First, we need some denitions.
Prior Probability of an Object: We denote the prior probabilty of some
observation x
i
given some object j by src
j
. This is obtained via Bayes Rule as follows:
src
j
= p(x
i
[j) =
p(x
i
[j)
K
j=1
p(x
i
[i)
Probabilty of an Observation: We dene the probability x
i
of an observation in
reference to some object j by the function f
obs
. There are two variations of this function.
For a given object position p
t
of object j, the function f
obs
(x
i
[j) is described by a
Gaussian PDF f
N
(.) parametrized on
j
= (p
t
,
obs
).
f
obs
(x
i
[ j) =
1
2[
obs
[
1/2
e
1
2
(x
i
p
t
)
T
1
obs
(x
i
p
t
)
75
where
obs
describes how observations are scattered around the path of object j.
For a given sample position s
k
m
of object k, the function f
obs
(x
i
[s
k
m
) describes the
probability that sensor m of object k can trigger observation x
i
. In this case, f
obs
is
parametrized on = (s
k
m
,
sensor
) where
sensor
describes the width of the region around
which a sensor is able to record observations.
Likelihood of an Observation: The likelihood of an observation x
i
with respect to
some object j is simply:
L(j[x
i
) = c
K
j=1
src
j
f
obs
(x
i
[j)
where c is a marginalizing constant. The likelihood can be viewed as the possiblity of the
j
th
object to have triggered the observation.
Weight of a sample Given the prior src
j
, the PDF f
obs
and the likelihood L(j[x
i
)
we can update the weight associated with some sample m of object k as follows:
w
k
m
= p(s
k
m
[X
t
) =
j=k
(src
j
f
obs
(x
i
[j)) +src
k
f
obs
(x
i
[s
k
m
)
|S
t
|
l=1
x
i
f
obs
(x
i
[s
k
l
)
The update equation for a set of observations X = (x
1
, x
N
) is simply:
w
k
m
= p(s
k
m
[X
t
) =
N
i=1
_
j=k
(src
j
f
obs
(x
i
[j)) +src
k
f
obs
(x
i
[s
k
m
)
_
|S
t
|
l=1
N
i=1
x
i
f
obs
(x
i
[s
k
l
)
4.5.4 Speeding Things Up
A close examination of the formulas from the previous subsection shows that their
evaluation requires considerable computation, especially if the number of particles per
object, the number of objects, and/or the number of sensor observations per time tick
is very large. However, a couple of simple tricks can alleviate a substantial portion of
the complexity. First, when computing each
phi,j
value, we can make use of the average
76
or centroid of object in order to compute f
t
obs
, rather than considering each particle
separately. Thus, we approximate f
t
obs
with f
t
obs
(x) f
N
(x[
obs
+
part
, ) where
=
|S
t
|
i=1
w
i
S
t
[i] and
part
is the covariance matrix computed over the positions of each
particle in S
t
.
Second, for any given object, on average only slightly more than 1/K of the
observations at a given time tick actually apply to it. This is because we are usually
quite sure which observation applies to which object, and for only a few observations is
this in doubt. For a given object, those observations that do not apply to it will have very
low correponding values and will not aect w
i
. Thus, when computing w
i
we rst drop
any observation j for which
,j
does not exceed . This can be expected to achieve a
speedup factor of close to K.
4.6 Benchmarking
This section presents an experimental evaluation of the proposed algorithm. The goal
of the evaluation is to answer the following questions:
How many objects is the algorithm able to classify eectively?
Is the algorithm able to eectively classify object observations that can span a long
period of time?
How accurate is the proposed algorithm in classifying observations into objects?
Is there an advantage to using the particle lter step?
Methodology. The experiments were conducted over a simple, synthetic database.
This allows us to easily measure the sensitivity of the the algorithm to varying data
characteristics and parameter settings. The database stores sensor observations from a set
of moving objects spanning some time interval.
The database is generated as follows: The various objects are initialized randomly
throughout a 2D eld and allowed to perform a random walk through the eld. As the
objects move through the eld, their position at various time ticks are recorded. At each
time tick, sensor observations were generated as a Gaussian cloud around the various
77
object locations. A snapshot of recorded observations from 40 objects is shown in Figure
4-4.
For various parameter settings, we measure the wall-clock exection time required
to classify all the observations and the classication accuracy of the algorithm. As is
standard practice in machine learning, classication accuracy is measured through the
recall and precision parameters. For a given object, recall denotes the total number of
observations that the classier assigns to the object and precision denotes the actual
number of observations that are actually produced by the object. Ideally, recall and
precision should be close to 1.
Given this setup, we vary ve parameters and measure their eect on execution time and
classication accuracy:
1. numObj: the number of unique objects that produced the observations
2. numTicks: the number of time ticks over which observations were collected
3. stdDev: the standard deviation of the average Gaussian sensor cloud
4. numObs: the average number of sensor rings for a given Gaussian sensor cloud
5. emTime: the portion of the intial time line over which EM was used
Five separate tests were conducted:
1. In the rst test, numTicks is xed at 50, stdDev is xed at 2% of the width of the
eld, and numObs is set at 5. emTime is xed at 5, numObj is varied from 10 to 110
objects in increments of 30.
2. In the second test, numObj is xed at 40, stdDev is xed at 2% of the width of the
eld, emTime is xed at 5, and numObs is set at 5. The time interval over which
observations were recorded numTicks is varied in increments of 25 upto 100 time
ticks.
3. In the third test, numObj is xed at 40, numTicks is xed at 50, stdDev is xed
at 2% of the width of the eld, emTime is xed at 5, and the average number of
sensor rings generated per object at each time tick numObs is varied from 5 to 25 in
increments of 5.
78
-200
0
200
400
600
800
1000
1200
0 200 400 600 800 1000 1200
Figure 4-4. The baseline input set (10,000 observations)
-200
0
200
400
600
800
1000
1200
0 200 400 600 800 1000 1200
Figure 4-5. The learned trajectories for the data of Figure 4-4
4. In the fourth test, numObj is xed at 40, numTicks is xed at 50, numObs is set at 5.
We then vary the spread of the Gaussian cloud stdDev from 2% to 10% of the width
of the eld.
79
numObj 10 40 70 110
Recall 1.0 0.91 0.76 0.69
Precision 1.0 0.92 0.92 0.93
Runtime 9 sec 38 sec 131 sec 378 sec
Table 4-1. Varying the number of objects and its eect on recall, precision and runtime.
numTicks 25 50 75 100
Recall 0.93 0.91 0.75 0.64
Precision 0.96 0.93 0.92 0.92
Table 4-2. Varying the number of time ticks.
numObs 5 10 15 20
Recall 0.91 0.91 0.91 0.92
Precision 0.93 0.92 0.92 0.91
Table 4-3. Varying the number of sensors red.
stdDev 2% 5% 7% 10%
Recall 0.91 0.90 0.88 0.80
Precision 0.93 0.94 0.91 0.83
Table 4-4. Varying the standard deviation of the Gaussian cloud.
5. In the nal test, numObj is xed at 40, numTicks is xed at 50, emTime is xed at
5, stdDev is xed at 2% of the width of the eld, and numObs is set at 5. emTime is
varied from 5 to 20 time ticks in increments of 5.
All tests were carried out in a dual-core Pentium PC with 2 GB RAM. The tests were
run in two stages. First, the EM algorithm is run to get an initial estimate of the number
of objects and their starting location. The number of time ticks over which EM is applied
is controlled by the emTime parameter. Next, the estimates produced by EM are used to
bootstrap the particle lter phase of the algorithm, which tracks the individual objects
for the rest of the timeline. In a post-processing step, the recall and precision values are
computed. Each test was repeated ve times and the results were averaged across the
runs.
80
emTime 5 10 15 20
Recall 0.91 0.88 0.87 0.83
Precision 0.92 0.95 0.94 0.96
Table 4-5. Varying the number of time ticks where EM is applied.
Results. The results from the experiments are given in Tables 4-1 through 4-5. All
times are wall-clock running times in seconds. The smallest database processed consisted
of around 5000 observations from 40 objects over 25 time ticks. The largest data set
processed consisted of around 40,000 observations from 40 objects over 50 time ticks. Disk
I/O cost is limited to the initial loading of the data set into the main memory. Just to
give an visual illustration, the actual plot of the sensor rings and the learned trajectories
is given in Figures 4-5 and 4-6 for the baseline conguration (numObj 40, numTicks 50,
numObs 5, stdDev 2%, emTime 5).
Discussion. There are several interesting ndings. In general, the accuracy of the
algorithm suers as we vary the parameter of interest from low to high. The algorithm
seems to be particularly sensitive to both the number of objects and the length of the time
interval over which observations are obtained.
As Table 4-1 demonstrates, the classication accuracy suers as we increase the
number of objects considered by the algorithm. This is because with increasing number
of objects, spatial discrimination is greatly reduced. Objects with observation clouds that
are not well separated are grouped together as a single component by the EM stage of the
algorithm. This has the eect of reducing the total number of objects that is tracked by
the particle lter stage of the algorithm, and the observations from the untracked objects
contribute to a reduced recall.
A somewhat dierent issue arises when the length of the time interval is increased
(Table 4-2). When the time interval is increased, we increase the chance that the paths
traced by two arbitrary objects will collide. Whenever object paths overlap or intersect,
the individual particle lters tracking the objects can no longer perform any meaningful
81
discrimination between the objects. When this happens, the lters end up dividing the
observations among themselves in an arbitrary manner. A somewhat subtle issue arises
when two objects intersect briey and then diverge. In this case, the individual particle
lters may end up swapping the objects. Similar factors are in play when we increase the
spread of the sensor rings around object paths (Table 4-3).
A somewhat surprising nding (Table 4-4) is that increasing the density of observations
does not seem to cause any noticeable improvement in classication accuracy other than
increasing the run times. Finally, our use of the EM stage only to bootstrap the particle
lter phase is validated by the results shown in Table 4-5. If EM is used for more than a
few initial time ticks, the limitations of the restricted model employed by EM come in to
play, and result in poor estimates being fed to the lter stage.
4.7 Related Work
There is a wealth of database research on supporting analytics over object paths. This
includes trajectory indexing [18, 21, 9294], queries over tracks [75, 76, 83] and clustering
paths [35, 37, 38, 95, 96]. However, little work exists in databases that worry about how to
actually obtain the object path.
The only prior work in database literature closely related to the problem we address is
the work of Kubi et al [36]. Given a set of asteroid observations, they consider the problem
of linking observations that correspond to the same underlying asteroid. Their approach
consists of building a forest of k-d trees[97], one for each time tick, and performing a
synchronized search of all the trees with exchange of information among tree nodes to
guide the search towards feasible associations. They assume that each asteroid has atmost
one observation at every time tick and consider only simple linear or quadratic motion
models.
Modeling based approaches [85, 98, 99] have been previously employed in target
tracking to map observations in to targets. The focus is primarily on supporting real-time
tracking using simple motion models. In contrast to existing research, we focus on aiding
82
the ETL process in a warehouse context to support entity resolution and provide historical
aggregation of object movements.
4.8 Summary
This chapter described a novel entity resolution problem that arises in sensor
databases and then proposed a statistical learning based approach to solving the problem.
The learning was carried out in two stages: an EM algorithm applied over a small portion
of the data in the rst stage to learn initial object patterns, followed by a particle-lter
based algorithm to track the individual objects. Experimental results conrm the utility of
the proposed approach.
83
CHAPTER 5
SELECTION OVER PROBABILISTIC SPATIOTEMPORAL RELATIONS
For nearly 20 years, database researchers have produced various incarnations of
probabilistic data models [100106]. In these models, the relational model is extended so
that a single stored database actually species a distribution of possible database states,
where these possible states are also called possible worlds. In this sort of model, answering
a query is closely related to the problem of statistical inference. Given a query over the
database, the task is to infer some characteristic of the underlying distribution of possible
worlds. For example, the goal may be to infer the probability that a specic tuple appears
in the answer set of a query exceeds some user-specied p.
Along these lines, most of the existing work on probabilistic databases has focused
on providing exact solutions to various inference problems. For example, imagine that
one relation R
1
has an attribute lname, where exactly one tuple t from R
1
has the value
t.lname = Smith. The probability of t appearing in a given world is 0.2. t also has
another attribute t.SSN = 123456789, which is a foreign key into a second database table
R
2
. The probability of 123456789 appearing in R
2
is 0.6. Then (assuming that there are
no other Smiths in the database) the probability that Smith will appear in the output
of R
1
R
2
can be computed exactly as 0.2 0.6 = 0.12.
Unfortunately, probabilistic data models where tuples or attribute values can be
described using simple, discrete probability distributions may be of only limited utility in
the real world. If the goal is to build databases that can represent the sort of uncertainty
present in modern data management applications, it is very useful to handle complex,
continuous, multi-attribute distributions. For example, consider an application where
moving objects are automatically trackedperhaps by video, magnetic, or seismic
sensorsand the observed tracks are stored in a database. The standard, modern method
for automatic tracking via electronic sensory input is the so-called particle lter [63],
which generates a complex, time-parameterized probabilistic mixture model for each
84
object that is tracked. If this mixture model is stored in a database, then it becomes
natural to ask questions such as Find all of the tracks that entered area A from time t
start
to time t
end
with probability of greater than (p 100)%. Answering this question involves
somehow computing the mass of each objects time-varying positional distribution that
intersects A during the given time range, and checking if it exceeds p.
For many applications, such a problem can be quite dicultit may be that no
closed-form (and integratable) probability density function (PDF) is even available. For
example, Bayesian inference [62] is a popular method that is commonly proposed as a
way to infer unknown or uncertain characteristics of dataone standard application of
Bayesian inference is automatically guessing the topic of a document such as an email.
The so-called posterior distribution resulting from Bayesian inference often has no
closed form, cannot be integrated, and can only be sampled from, using tools such as a
Markov Chain Monte Carlo (MCMC) methods [107].
Thus, in the most general case, an integratable PDF is unavailable, and the user
can only provide an implementation of a pseudo-random variable that can be used to
provide samples from the probability distribution that he or she wishes to attach to an
attribute or set of correlated attributes. By asking only for a pseudo-random generator,
we can handle both dicult cases (such as the Bayesian case) and simpler cases where
the underlying distribution is well-known and widely used (such as Gaussian, Poisson,
Gamma, Dirichlet, etc.) in a unied fashion. Myriad algorithms exist for generating
Monte Carlo samples in a computationally ecient manner [108]. For more details on
how a database system might support user-dened functions for generating the required
pseudo-random samples, we point to our earlier paper on the subject [109].
Our Contributions. If the user is asked only to supply pseudo-random attribute value
generators, it becomes necessary to develop new technologies that allow the database
system to integrate the unknown density function underlying a pseudo-random generator
over the space of database tuples accepted by a user-supplied query predicate. In this
85
chapter, I consider the problem of how the required computations can and should be
performed using Monte Carlo in a principled fashion. I propose very general algorithms
that can be used to estimate the probability that a selection predicate evaluates to true
over a probabilistic attribute or attributes, where the attributes are supplied only in the
form of a pseudo-random attribute value generator.
The specic technical contributions are as follows:
I carefully consider the statistical theory relevant to applying Monte Carlo methods
to decide whether a database object is suciently accepted by the query predicate.
Unfortunately, it turns out that due to peculiarities specic to the application of
Monte Carlo methods to probabilistic databases, even so-called optimal methods
can perform quite poorly in practice.
I devise a new statistical test for deciding whether a database object should be
included in the result set when Monte Carlo is used. In practice, the test can be used
to scan a database and determine which objects are accepted by the query using far
fewer samples even than existing, optimal tests.
I also consider the problem of indexing for relational selection predicates over a
probabilistic database to facilitate fast evaluation using Monte Carlo techniques.
Chapter Organization. In Section 2, we dene the basic problem of evaluating selection
predicates when probabilistic attribute values can only be obtained via Monte Carlo
methods, and consider the false positive and false negative problems associated with
testing whether a database object should be accepted. Section 3 describes a classical test
from statistics that is very relevant, called the sequential probability ratio test (SPRT).
Section 4 describes our own proposed test which makes use of the SPRT. Section 5
considers the problem of indexing for our test. Experiments are described in Section 6,
Section 7 covers related work, and Section 8 concludes the chapter.
5.1 Problem and Background
In this section, we rst dene the basic problem: relational selection in a probabilistic
database where uncertainty is represented via a black-box, possibly multi-dimensional
sampling function. While we limit the scope by considering only relational selection, we
note that join predicates are really nothing more than simple relational selection over a
86
cross product, and so joins can be handled in a similar fashion. We begin by discussing
the basic problem denition, and then give the reader a step-by-step tour through the
relevant statistical theory, which will provide the necessary background to discuss our own
technical contributions in the following sections.
5.1.1 Problem Denition
We consider generic selection queries of the form: SELECT obj
FROM MYTABLE AS obj
WHERE pred(obj)
USING CONFIDENCE p
pred() is some (any) relational selection predicate over database object obj. pred() may
include references to probabilistic and non-probabilistic attributes, that may or may
not be correlated with one another. For an example of this type of query, consider the
following: SELECT v.ID
FROM VEHICLES v
WHERE v.latitude BETWEEN 29.69.32 AND 29.69.38 AND
v.longitude BETWEEN -82.35.12 AND -82.35.19
USING CONFIDENCE 0.98
This query will accept all vehicles falling in the specied rectangular region with
probability higher than 0.98.
In general, our assumption is that there is a function obj.GetInstance() which
supplies one random instance of the object obj (note that a random instance could
contain a mix of deterministic and non-deterministic attributes, where deterministic
attributes have the same value in every sample). In our example, latitude and longitude
could be supplied by sampling from a two-dimensional Gaussian distribution.
5.1.2 The False Positive Problem
87
Algorithm 9 MonteCarlo (MYTABLE, p, n)
1: result =
2: for each object obj in MYTABLE do
3: k = 0
4: for i = 1 to n do
5: if pred(obj.GetInstance() = true) then
6: k = k + 1
7: end if
8: end for
9: if (k/n p) then
10: result = result obj
11: end if
12: end for
13: return result
Given this interface, one way to answer our basic selection query is to use Monte
Carlo methods to guess whether or not each object should be contained in the answer set,
as described in Algorithm 5-1.
For every database object, a number of random instances of the object are generated
by the Algorithm; the selection predicate pred() is applied to each of them, and the object
is accepted or rejected depending upon how many times pred() evaluates to true.
While this algorithm is quite simple, the obvious problem is that there may be some
inaccuracy in the result. For example, imagine that p is .95, n is 1000, and for a given
object, the counter k ends up as 955. While it may be likely that the probability that
the object is accepted by pred() exceeds .95, this does not mean that the object should
necessarily be accepted; there is a possibility that the real chance that a object is
accepted by pred() is 94%, but we just got lucky, and 95.5% of the samples were accepted.
The chance of making such an error is intimately connected with the value n; the larger
the n, the lower the chance of making an error.
For this reason, we might modify our query slightly so that the USING clause also
includes a user-specied false positive error rate:
USING CONFIDENCE p
88
FALSE POSITIVE RATE
To actually incorporate this error rate into our computation, it becomes necessary
to modify Algorithm 5-1 so that it implements a statistical hypothesis test to guard
against the error. For a given object obj, the inner loop of Algorithm 5-1 runs n
Bernoulli (true/false) random trials, and counts how many true results are observed.
There are many ways to accomplish this. For example, if the real probability that
pred(obj.GetInstance()) is true is p, then the number of observed true results
will follow a binomial distribution. For a given database object obj, we will use as
shorthand for this probability; that is, = Pr[pred(obj.GetInstance()) = true]. Using
the binomial distribution, we can set up a proper, statistical hypothesis test with two
competing hypotheses:
H
0
: < p versus H
1
: p
To do this, we use the fact that:
Pr[k c [ = p]
n
k=c
binomial(p, n, k)
Thus, if we want the chance of erroneously including obj in the result set when it should
not be included (that is, when H
0
is true) to be less then a user-supplied , we should
accept obj only if the number of observed true results, k, meets or exceeds the value c,
where
n
k=c
binomial(p, n, k) does not exceed the user-supplied . Thus, we can rst
compute the largest such c, and replace the last if statement (lines (9)-(11)) in the
pseudo-code for Algorithm 5-1 with:
if k c then
result = result obj
end if
89
Then, we will be sure that we will be unlikely to erroneously include incorrect
results in the answer set.
5.1.3 The False Negative Problem
There is a key problem with this approach: it only guards against false positives,
and provides no protection against false negatives; using the terminology common in the
statistical literature, this approach provides no guarantees as to the power of the test.
1
Fortunately, standard statistical methods make it easy to handle a lower bound of
the power of the test. Assume that we alter the query syntax so that the desired power is
specied directly in the query:
USING CONFIDENCE p
FALSE POSITIVE RATE
FALSE NEGATIVE RATE
Then, given some small value , we wish to choose from one of the two alternatives:
2
H
0
: = p + versus H
1
: = p
When evaluating a query, either H
0
or H
1
should be chosen for each database object obj,
subject to the constraint that:
Pr[Accept(H
0
) [ p ] is less than
Pr[Accept(H
1
) [ p + ] is less than
1
In fact, this particular binomial test is quite weak compared to other possible tests.
2
We assume that is an internal system parameter that is not chosen directly by the
userthis is an important point discussed in depth in Section 4. is set to be small
enough that no reasonable user would care about the dierence between p + and
p . Since most PDFs stored in a so-called probabilistic database are the result of
an inexact modeling and inference process that introduces its own error and renders very
high-precision query evaluation of somewhat dubious utility, should not be too small in
practice. We expect that on the order of 10
4
might be reasonable.
90
If these two constraints are met, then when H
0
is accepted we can put obj into the answer
set and be sure that the probability of incorrectly including obj is at most , and when H
1
is accepted we can safely leave obj out of the answer set and be sure that the probability
of incorrectly dismissing obj is at most .
Fortunately, it is quite easy to do this using standard statistical machinery known as
the Neyman-Pearson test [110], or Neyman test for short. For a given database object obj,
the Neyman test chooses between H
0
or H
1
by analyzing a xed sample of size n drawn
using GetInstance(). The test relies on a likelihood ratio test (LRT) that compares
the probabilities of observing the sample sequence under H
0
and H
1
. It is named after
a theoretical result (the Neyman-Pearson lemma) that states that a test based on LRT
is the most powerful test of all possible tests for a xed sample size n comparing the
two simple hypotheses (i.e. it is a uniformly-most-powerful test). Since the Neyman test
for the Bernoulli (yes/no) probability case is given in many textbooks on hypothesis
testing, we omit its exact denition here. Given an implementation of a Neyman test that
returns ACCEPT if H
1
is selected, it is possible to replace lines (9) to (11) of Algorithm 5-1
with:
if (Neyman (obj, pred, p, , , ) = ACCEPT) then
result = result obj
end if
The resulting framework will then correctly control the false positive and false
negative rates associated with the underlying query evaluation.
5.2 The Sequential Probability Ratio Test (SPRT)
While the Neyman test is theoretically optimal, it is important to carefully consider
what the word optimal means in this context: it means that no other test can choose
between H
0
and H
1
for a given and pair in fewer samplesspecically, no other
test can do a better job when either H
0
or H
1
is true. The problem is that in a real,
probabilistic database there is little chance that either H
0
or H
1
is true: these two
91
Figure 5-1. The SPRT in action. The middle line is the LRT statistic
hypothesis relate to specic, innitely-precise probability values p + and p , when in
reality the true probability is likely to be either greater than p + or less than p ,
but not exactly equal to either of them. In this case, the Neyman test will still be correct
in the sense that while still respecting and , it will choose H
0
if < p and H
1
if
> p . However, the test is somewhat silly in this case, because it still requires just
as many samples as it would in the hard case where is precisely equal to one of these
values.
To make this concrete, imagine that p = .95, and after 100 samples have been taken
from GetInstance(), absolutely none of them have been accepted by pred(), but the
Neyman algorithm has determined that in the worst case, we need 10
5
to choose between
H
0
and H
1
. Even though there is a probability of at most (1.95)
100
< 10
130
of observing
100 consecutive false values if was at least 0.95, the test cannot terminatemeaning
that we must still take 99,900 more samples. In this extreme case we would like to be able
to realize that there is no chance that we will accept H
1
and terminate early with a result
of H
0
. In fact, this extreme case may be quite common in a probabilistic database where p
will often be quite large and pred() highly selective.
Not surprisingly, this issue has been considered in detail by the statistics community,
and there is an entire subeld of work devoted to so-called sequential tests. The basis
92
for much of this work is Walds famous sequential probability ratio test [111], or SPRT
for short. The SPRT can be seen as a sequential version of the Neyman test. At each
iteration, the SPRT draws another sample from the underlying data distribution, and uses
it to update the value of a likelihood ratio statistic. If the statistic exceeds a certain upper
threshold, then H
1
is accepted. If it ever fails to exceed a certain lower threshold, then H
0
is accepted. If neither of these things happen, then at least one more iteration is required;
however, the SPRT is guaranteed to end (eventually).
Thus, over time, the likelihood ratio statistic can be seen as performing a random
walk between two moving goalposts. As soon as the value of the statistic falls outside
of the goalposts, a decision is reached and the test is ended. The process is illustrated in
Figure 5-1. This plot shows the SPRT for a specic case where = .5, = .05, p = .3, and
= = 0.05. The x-axis of this plot shows the number of samples that have been taken,
while the wavy line in the middle is the current value of the LRT statistic. As soon as the
statistic exits either boundary, the test is ended.
The key benet of this approach is that for very low values of that are very far from
p, H
0
is accepted quickly (H
1
is accepted with a similar speed when greatly exceeds
p). All of this is done while fully controlling for the multiple-hypothesis-testing (MHT)
problem: when the test statistic is checked repeatedly, then extreme care must be taken
with respect to and because there are many chances to erroneously accept H
0
(or
H
1
), and so the eective or real (or ) can be much higher than what would naively
be expected. Furthermore, like the Neyman test, the SPRT is also optimal in the sense
that on expectation, it requires no more samples than any other sequential test to choose
between H
0
and H
1
, assuming that one of the two hypotheses are true.
Just like the neyman test, the SPRT makes use of a likelihood ratio statistic. in the
bernoulli case we study here, after numacc samples that are accepted by pred() out of num
93
Algorithm 10 SPRT (obj, pred, p, , , )
1: mult=
a
b
2: tot = 0
3: numAcc = 0
4: constUp =
log
1
b
5: constDown =
log

1
b
6: while (constDown + tot < numAcc < constUp + tot ) do
7: sample = obj.GetInstance()
8: if (pred(obj.GetInstance () = true) then
9: tot = tot + mult
10: end if
11: end while
12: if (numacc >= constup + tot) then
13: decision = accept
14: else
15: decision = reject
16: end if
17: return decision
total samples, this statistic is:
= numacc log
p +
p
+ (num numacc) log
1 p
1 p +
given , the test continues as long as:
log

1
< < log
1
for simplicity, this can be re-worked a bit. let:

a = log
1 p +
1 p
, b = log
p +
p
log
1 p
1 p +
then the test continues as long as:
numacc
log
1
b
+ num
a
b
and:
numacc
log

1
b
+ num
a
b
94
this leads directly to the pseudo-code for the basic sprt algorithm, which can be inserted
into algorithm 1 to produce a test which uses an adaptive sample size to choose between
h
0
and h
1
. The pseudo-code is given as Algorithm 2.
5.3 The End-Biased Test
In this section, we devise a new, simple sequential test called the end-biased test that
is specically designed to work well for queries over a probabilistic database.
5.3.1 Whats Wrong With the SPRT?
To motivate our test, it is rst necessary to consider why the SPRT and its existing,
close cousins may not be the best choice for use in a probabilistic database.
The SPRT and its variants (of which there are manysee the related work section)
are widely-used in practice. Unfortunately, there is a key reason why the SPRT as it
was originally dened is not a good choice for use in the inner-more loop of a selection
algorithm over a probabilistic database: the existence of the magic constant .
In classical applications of the SPRT, (that is, the distance between H
0
and H
1
)
is carefully chosen in an application-specic fashion by an expert who understands the
signicance of the parameter and its eect on the cost of running the SPRT. For example,
a widget producer may wish to sample the widgets that his/her assembly line produces to
see if the unknown rate of defective widgets is acceptable by sampling from the widgets
that are produced by the line. In this setting, would be chosen so that there is an
economically signicant dierence between H
0
and H
1
, while at the same time taking
into account the adverse eect of a small ; a small is associated with a large number
of (expensive) samples. That is, p is likely chosen so that the defect rate is so low
that it would be a waste of money and time to stop the production line and determine
the problem. On the other hand, p + is chosen at the point where production must
be stopped, because so many defective widgets are produced that the associated cost is
unacceptable. The widget producer understands that if the true rate of defective widgets
is between p and p + , the SPRT may return either result, so there is a natural
95
inclination to shrink its value; however, he/she is also strongly motivated to make as
large as possible because she/he also understands that a small will require that more
widgets be sampled, which increases the cost of the quality control program.
Unfortunately, in the context of a probabilistic database, the existence of a user-dened
parameter with such a profound eect on the cost of the test is highly problematic. We
contrast this with the fairly intuitive nature of the parameter p. user might choose
p = 0.95 if she/he wants only those objects that are probably accepted by pred().
She/he might choose p = 0.05 if she/he wants any object with even a slight chance of
being accepted by pred(). p may even be computed automatically in the case of a top-k
query. But what about ? Without an in-depth understanding of the eect of on the
SPRT, the choice of is necessarily arbitrary. A user may wonder, why not simply choose
= 10
5
to ensure that all results are highly accurate? The reason is that this may
(or may not) have a very signicant eect on the speed of query evaluation, depending
upon many factors that include the particular predicate chosen as well as the underlying
probabilistic modelbut it is not acceptable to ask an end-user to understand and
account for this!
5.3.2 Removing the Magic Epsilon
According to basic statistical theory, it is impossible to do away with altogether.
Intuitively, no statistical hypothesis test can decide between two options that are almost
identical. Thus, our goal is to take the choice of away from the user, and simply ship
the database software with an that is small enough that no reasonable user would care
about the error induced (see the relevant discussion in Footnote 1). The problem with this
plan is that the classical SPRT may require a very large number of samples to terminate
with a small . For example, consider the following, simple test. We choose p = 0.5,
= 10
5
, and = .2, and run Algorithm 5-3: in this case, it turns out that more than ten
thousand samples are required for the test to terminate. For a one-million object database,
generating this many samples per object is probably unacceptable.
96
Figure 5-2. Two spatial queries over a database of objects with gaussian uncertainty
The unique problem in the database context is that while H
0
and H
1
are very close
to one another (due to a tiny ), in reality, is typically very far from both p and
p + ; usually, it will be close to zero or one. For example, consider Figure 5-2, which
shows a simple spatial query over a database of objects whose positions are represented
as two-dimensional Gaussian density functions (depicted as ovals in the gure). For both
the more selective query at the left and the and less selective query at the right, only the
few objects falling on the boundary of the query region would have p for any
user-specied p ,= 0, 1.
This creates a unique setup that is quite dierent from classic applications of the
SPRT and its variants. In fact, the SPRT itself is provably optimal for only values lying
at p and p + ; but for those far from these boundaries (such as at zero and one), it
may do quite poorly. Many other optimal tests have been proposed in the statistical
literature, but few seem to be applicable to this rather unique application domainsee the
experimental section as well as the related work section for more details.
5.3.3 The End-Biased Algorithm
As a result, we propose our own sequential test that is specically geared towards
operating in a database environment, where (a) is vanishingly small, (b) for the typical
object is close to 0 or 1, and (c) only for a few objects is p.
The algorithm that we develop is called the end-biased test. Unlike many of the tests
from statistics, it has no optimality properties, but by design it functions very well in
practicean issue we consider experimentally in Section 6.
97
Figure 5-3. The sequence of SPRTs run by the end-biased test
To perform the end-biased test, we run a series of pairs of individual hypothesis tests.
In the rst pair of tests, one SPRT is run right after another:
The rst SPRT tries to reject the object quickly, in just a few samples, if this is
possible. To do this, a standard SPRT is run to decide between H
0
: = p/2, and
H
1
: = p + . If the SPRT accepts H
0
, then obj is immediately rejected. However,
if the SPRT accepts H
1
, then a second test is run.
The second test tries to accept the object quickly, again in just a few samples, if this
is possible. To do this, a standard SPRT is run to decide between H
0
: = p ,
and H
1
: = p +
1p
2
. If the SPRT accepts H
1
, then obj is immediately accepted.
However, if the SPRT accepts H
0
, then the object survives for another round of
testing.
98
The rst pair of tests is set up so that the region of indierence (that is, the region
between H
0
and H
1
in each test) is very large. A large region of indierence tends to
speed the execution of the test. Intuitively, the reason for this is that it is much easier to
decide between two disparate values for p such as p = .1 and p = .9 than it is to decide
between close values such as p = .1 and p = 0.100001, because the latter two values
for p can explain any given observation almost equally well. Thus, the relatively large
indierence ranges used by the rst pair of SPRT sub-tests in the end-biased test tends to
allow values below p/2 or above p +
1p
2
to accepted or rejected very quickly.
The drawback of using a large region of indierence is that if falls within either
tests region of indierence, then the test can produce an arbitrary result that is not
governed by the tests false positive and false negative parameters. Fortunately, since we
choose the region of indierence so that it always falls entirely below p + in the rejection
case (or above p in the accept case), this will not cause problems in terms of the
correctness of the test. For example, in the rejection case, if H
1
is accepted for an object
whose value happens to fall in the region of indierence, then we do not immediately
(incorrectly) accept the object as an actual query resultrather, we will then run the
second SPRT to determine if the object should actually be accepted. The real problem
with an erroneous H
1
for an object in the region of indierence means that the object is
not immediately pruned and we will need to do more work.
If an object is neither accepted or rejected by the rst pair of tests, then a second pair
of tests must be run. This time, however, the size of the region of indierence is shrunk by
a fraction of
1
2
for both the rejection test and the acceptance test. This means that more
samples will probably be required to arrive at a result in either testdue to the fact that
H
0
and H
1
will be closer to one anotherbut it also means that fewer objects will have
values that fall in either tests region of indierence. Specically, the third SPRT that is
run is used to determine possible rejection using H
0
: = 3p/4 versus H
1
: = p +. If the
SPRT accepts H
0
, then obj is immediately rejected. However, if the third SPRT accepts
99
H
1
, then a second test for acceptance (the fourth test overall) is run. This test checks
H
0
: = p against H
1
: = p +
1p
4
. If the SPRT accepts H
1
, then obj is accepted,
otherwise, a third pair of tests are run, and so on.
This process is repeated, shrinking the region of indierence each time, until one of
two things happens:
1. The process terminates with either an accept or a reject in some test, or;
2. The space of possible values for which the process would not have terminated falls
strictly in the range from p to p + . In this case, an arbitrary result can be
chosen.
The sequence of SPRT tests that are run is illustrated above in Figure 5-3. At each
iteration, the region of indierence shrinks, until it becomes vanishingly small and the
test terminates. Since a large initial region of indierence means that the rst few tests
terminate quickly (but will only accept or reject large or small values of ), the test
is end-biased; that is, it is biased towards terminating early in those cases where
is either small or large. For those values that are closer to p, more sub-tests and more
samples will be requiredwhich is very dierent from classical tests such as the SPRT or
Neyman test, which try to optimize for the case when is close to p .
, , and the MHT problem. One thing that we have ignored thus far is how to
choose
and
(that is, the false negative and false positive rate of each individual SPRT
subtest) so that the overall, user-supplied and values are respected by the end-biased
test. This is a bit more dicult than it may seem to be at rst glance: one signicant
result of running a series of SPRTs is that it becomes imperative that we be very careful
not to accidentally accept or reject an object due to the fact that we are running multiple
hypothesis tests.
We begin our discussion by assuming that the limit on the number of pairs of tests
run is n; that is, there are n tests that can accept obj, and there are n tests that can reject
obj. We also note that in practice, the 2n tests are not run in sequence, but they are run
100
in parallel; this is done so that all of the tests can make use of the same samples, and thus
samples can be re-used and the total number of samples is minimized (see Algorithm 5-3
below and the accompanying discussion). Specically, rst we use obj.GetInstance () to
generate one sample from the underlying distribution, then we feed this sample to each
of the 2n tests. If any one of the n acceptance tests accepts the object, then the overall
end-biased test accepts the object; if any one of the n rejection tests rejects the object,
then the overall end-biased test rejects the object.
Given this setup, imagine that there is an object obj that should be accepted by the
end-biased test. We ask the question, what is the probability that we will falsely reject
obj? This can be computed as:
= Pr[
n
i=1
(reject in rejection test i [ no prior accept)]
In this expression, no prior accept means that no test for acceptance of obj terminated
with an accept before test i incorrectly rejected. We can then upper-bound by simply
removing this clause:
Pr[
n
i=1
(reject in test i]
The reason for this inequality is that by removing any restriction on the set of outcomes
accepted by the inner boolean expression, the probability that any event is accepted by
the expression can only increase. Furthermore, by Bonferronis inequality [112], we have:
Pr[
n
i=1
reject in test i]
n
i=1
Pr[reject in test i]
As a result, if we run each individual rejection test using a false reject rate of
, we know
that:
Pr[
n
i=1
reject in test i] n
Thus, by choosing
= /n, we correctly bound the false negative rate of the overall

end-biased test. A similar argument holds for the false positive rate: by choosing a rate of
101
= /n for each individual test, we will correctly bound the overall false positive rate of
the test.
The Final Algorithm. Given all of these considerations, pseudo-code for the end-biased
test is given in Algorithm 5-3.
Algorithm 11 EndBiased (obj, pred, p, , , )
1: rejIndLo =
p
2
; accIndHi =
1p
2
2: numTests = 0
3: while (p - rejIndLo < p) or (p + accIndHi > p+) /* rst, count the number of
tests */ do
4: numTests++
5: rejIndLo /= 2; accIndHi /= 2
6: end while
7: for i = 1 to numTests /* now, set up the tests */ do
8: rejSPRTs[i].Init (p p
1
2
i
, p + , /numTests, /numTests)
9: accSPRTs[i].Init (p , p + (1 p)
1
2
i
, /numTests, /numTests)
10: end for
11: while any test is still going /* run them all */ do
12: sam = pred(obj.GetInstance())
13: for i = 1 to numTests do
14: if rejSPRTs[i].AddSam (sam) == REJECT then
15: return REJECT
16: end if
17: if accSPRTs[i].AddSam (sam) == ACCEPT then
18: return ACCEPT
19: end if
20: end for
21: end while
22: return ACCEPT
This algorithm assumes two arrays of SPRTs, where the elements of each array
function just like the classic SPRT from Algorithm 5-3. The only dierence is that the
various SPRTs are rst initialized (via a call to Init) and then fed true/false results
one-at-a-time, via calls AddSam()that is, they do not operate independently. The array
rejSPRTs[] attempts to reject obj; the array accSPRTs[] attempts to accept obj.
For simplicity, in Algorithm 5-3, each sample is added to each and every SPRT in
turn. In practice, this can be implemented more eciently in a way that produces a
102
statistically equivalent outcome. First, we run rejSPRTs[0] to completion; if this SPRT
does not reject, then accSPRTs[0] picks up where the rst SPRT left o (using its nal
count of accepted samples) and runs to completion. If this SPRT does not accept, then
rejSPRTs[1] picks up where the second one left o and also runs to completion. This is
repeated until any member of rejSPRTs[] rejects, any member of accSPRTs[] accepts, or
all SPRTs complete.
5.4 Indexing the End-Biased Test
The end-biased test can easily be used to answer a selection query over a database
table: apply the test to each database object, and add the object to the output set
if it is accepted. However, this can be costly if the underlying database is large. One
of the longest-studied problems in database systems is how to speed such selection
operationsparticularly in the case of very selective queriesvia indexing. Fortunately,
the end-biased test is amenable to indexing, which is the issue we consider in this section.
Specically, we consider the problem of indexing for queries where the spatial location of
an object is represented by the user-dened sampling function GetInstance(), because
spatial and temporal data is one of the most obvious application areas for probabilistic
selection.
5.4.1 Overview
The basic idea behind our proposed indexing strategy is as follows:
1. First, during an o-line pre-computation phase, we obtain, from each database
object, a sequence of samples. Those samples (or at least a summary of the samples)
are stored within in an index to facilitate fast evaluation of queries at a later time.
2. Then, when a user asks a query with a specic , , p, and a range predicate pred(),
the rst step is to determine how many samples would need to be taken in order to
reject any given object by the rst rejection SPRT in the end-biased test, if pred()
evaluated to false for each and every one of those samples. This quantity can be
computed as:
minSam =
log (

)
log (
1p
1p+
)
|
103
Figure 5-4. Building the MBRs used to index the samples from the end-biased test.
3. Once numSam is obtained, the index is used to answer the question: Which
database objects could possibly have one of the rst numSam samples in its
pre-computed sequence accepted by pred()? All such database objects are placed
within a candidate set C. All those object not in C are implicitly rejected.
4. Finally, for each object within C, an end-biased test is run to see whether the object
is actually accepted by the query.
In the following few subsections, as we discuss some of the details associated with each of
these steps.
5.4.2 Building the Index
The rst issue we consider is how to construct the index for the pre-computed
samples. For each database object having d attributes that may be accessed by a spatial
range query, index construction results in a series of (d + 1)-dimensional MBRs (minimum
bounding rectangles) being inserted into a spatial index such as an R-Tree. Each MBR has
a lower and upper bound for each of the d probabilistic attributes to be indexed, as well as
a lower bound b
and an upper bound b on a sample sequence number. Specically, if an

MBR associated with object obj has a bound (b
, b) and rectangle R, this means the rst b

pre-computed samples produced via calls to obj.GetInstance() all fell within R.
104
In addition, a pseudo-random seed value S is also stored along with the MBR
3
. S is
the seed used used to produce all obj.GetInstance() values starting at sample number
b
. Storing this pair is of key importance. As we describe subsequently, during query

evaluation S can be used to re-create some of the samples that were bounded within R.
Given this setup, to construct a series of MBRs for a given object, the following
method is used. For a given number of pre-computed samples m we rst store the pair
(S, b
) where S is the initial pseudo-random number seed, and b = 1

4
. We then use S to
obtain two initial samples and bound them using the rectangle R. After this initialization,
the following sequence of operations is repeated until m samples have been taken:
1. First, we obtain as many samples as are needed until a new sample is obtained that
cannot t into R.
2. Let b be the current number of samples that have been obtained. Create a (d +
1)-dimensional MBR using R along with the sequence number pair (b
, b 1), and
insert this MBR along with the current S and the object identier into the spatial
index.
3. Next, update R by expanding it so that it contains the new sample. Update S to be
the current random number seed, and set b
= b.
4. Repeat from step (1).
This process is illustrated pictorially above in Figure 5-4, for a series of one-dimensional
random values, up to b = 16. In this example, we begin by taking two samples during
initialization. We then keep sampling until the fth sample, which is the rst one
3
Since true randomness is dicult and expensive to create on a computer, virtually all
applications using Monte Carlo methods make use of pseudo-random number generation
[108]. To generate a pseudo-random number, a string of bits (called a seed) is rst sent
as an argument to a function that uses the bits to produce the next random value. As a
side-eect of producing the new random value, the seed itself is updated. This updated
seed value is then saved and used to produce the next random value at a later time.
4
m would typically be chosen to be just large enough so that with any reasonable,
user-supplied query parameters, it would always be possible to reject a database object
where pred(obj.GetInstance) evaluated to false m times in a row.
105
Figure 5-5. Using the index to speed the end-biased test
that does not t into the initial MBR. This completes to step (1) above. Then, a
two-dimensional MBR is created to bound the sample sequence range from 1 to 4, as
well as the set of pseudo-random values that have been observed. This MBR is inserted
(along with S) into the spatial index as MBR 1 (step (2)). Next, the fth sample is used
to enlarge the MBR (step (3)) More samples are taken until it is found that the eighth
sample does not t into the new MBR (back to step (1)). Then, MBR 2 is created to
cover the rst seven samples as well as the sequence range from 5 to 7, and inserted into
the spatial index. The process is repeated until all m samples have been obtained. The
process can be summed up as follows: every time that a new sample forces the current
MBR to grow, a copy of the MBR is rst inserted into the index, and then the MBR is
expanded to accommodate the sample.
5.4.3 Processing Queries
To use the resulting index to process a range query R encoded by the predicate
pred(), the minimum number of samples required to reject minSam is rst computed as
described in Section 4.1. Then, a query Q is issued to the index searching for all MBRs
intersecting R as well as the sample sequence range from 1 to minSam. Due to the way
that the MBRs inserted into the index were constructed, we know that any database
106
object obj that does not intersect Q can immediately (and implicitly) be rejected, because
the MBR covering the rst minSam samples from obj.GetInstance() did not intersect R.
However, we must still deal with those objects that did have an MBR intersecting
Q. For those objects, we run a modied version of the end-biased test that skips ahead
as far as possible in the sample sequencethe details of this modied test are not too
hard to imagine, and are left out for brevity. For a given intersecting object, we nd
the MBR with the lowest sample sequence range that intersected Q. For example,
consider Figure 5-5; in this example, the MBRs from Object 1 intersect Q. We choose the
earlier of the two MBRs, which is the MBR covering the sample sequence range from
6 through 9. Let b
be the low sample sequence value associated with this MBR, and let
S be the pseudo-random seed value associated with it. To run the modied end-biased
test, we use Algorithm 5-3, as well as the fact that none of the rst b
1 samples from
obj.GetInstance() could have been accepted by pred(). Thus, we initialize each of the 2n
SPRTs with b
1 false samples, and start execution at the bth sample. In this way, we
skip immediately to the rst sample sequence number that was likely accepted by pred().
5.5 Experiments
In this section, we experimentally evaluate our proposed algorithms. Specic
questions we wish to answer are:
1. In a realistic environment where a selection predicate must be run over millions of
database objects, how well do standard methods from statistics perform?
2. Can our end-biased test improve upon methods from the statistical literature?
3. Does the proposed indexing framework eectively speed application of the end-biased
test?
Experimental Setup. In each of our experiments, we consider a simple, synthetic
database, which allows us to easily test the sensitivity of the various algorithms to
dierent data and query characteristics. This database consists of two-dimensional
Gaussians spread randomly throughout a eld. For a number of dierent (query, database)
107
combinations, we measure the wall-clock execution time required to discover all of the
database objects that fall in a given, rectangular box with probability p. Since in all cases
the database size is large relative to the query result size, the false positive rate we allow is
generally much lower than the false negative rate. The reason is that a false positive rate
of 1% over a 10-million object database could theoretically result in 1 10
5
false positives.
Thus, in all of our queries, we use a false positive rate of (number of database object)
1
an
a false negative rate of 10
2
since the average result size is quite small, so a relatively high
false drop rate is acceptable. The value we use in our experiments is 10
5
.
Given this setup, there are four dierent parameters that we vary to measure the
eect on query execution time:
1. dbSize: the size of the database, in terms of the number of objects.
2. stdDev: the standard deviation of an average Gaussian stored in the database,
along each dimension. Since this controls the spread of each objects PDF, if the
value is small, then eectively the database objects are all very far apart from each
other. As the value grows, the objects eectively get closer to one another, until they
eventually overlap.
3. qSize: this is the size of the query box.
4. p: this is the user-supplied p value that is used to accept or reject objects.
We run four separate tests:
1. In the rst test, stdDev is xed at 10% of the width of the eld, qSize along each
axis is xed at 3% of the width of the eld, and p is set at 0.8. Thus, many database
objects intersect each query, but likely none are accepted. dbSize is varied from 10
6
to 3 10
6
to 5 10
6
to 10
7
.
2. In the second test, dbSize is xed at 10
7
. stdDev is again xed at 1%, and p is 0.95.
qSize is varied from 0.3% to 1% to 3% to 10% along each axis. In the rst case,
most database object intersecting the query region are accepted; in the latter, none
are since the objects spread is much greater than the query region.
3. In the third test, dbSize is xed at 10
7
, qSize is 3%, p = 0.8, and stdDev is varied
from 1% to 3% to 10%.
4. In the nal test, dbSize is 10
7
, qSize is 3%, stdDev is 10%, and p is varied from 0.8
to 0.9 to 0.95. The rst case is particularly dicult because while very few objects
108
Method 10
6
3 10
6
5 10
6
10
7
SPRT 568 sec 1700 sec 2824 sec 5653 sec
Opt 2656 sec 8517 sec 14091 sec 26544 sec
End-biased 9 sec 24 sec 38 sec 76 sec
Indexed 1 sec 3 sec 7 sec 15 sec
Table 5-1. Running times over varying database sizes.
Method 0.3% 1% 3% 10%
SPRT 1423 sec 1420 sec 1427 sec 3265 sec
End-biased 76 sec 75 sec 75 sec 430 sec
Indexed 11 sec 4 sec 4 sec 962 sec
Table 5-2. Running times over varying query sizes.
Method 1% 3% 10%
SPRT 5734 sec 5608 sec 5690 sec
End-biased 116 sec 75 sec 75 sec
Indexed 107 sec 12 sec 15 sec
Table 5-3. Running times over varying object standard deviations.
Method 0.8 0.9 0.95
SPRT 5672 sec 2869 sec 1436 sec
End-biased 75 sec 75 sec 75 sec
Indexed 14 sec 12 sec 13 sec
Table 5-4. Running times over varying condence levels.
are accepted, the spread of each object is so great that most are candidates for
acceptance.
Each test is run several times, and results are averaged across all runs.
Methods Tested. For each of the above tests, we test four methods: the SPRT, an
alternative sequential test that is approximately, asymptotically optimal [113], the
end-biased test via sequential scan, and the end-biased test via indexing. In practice, we
found the optimal test to be so slow that it was only used for the rst set of tests.
Results. The results are given in Tables 5-1 through 5-4. All times are wall-clock running
times in seconds. The raw data les for a database of size 10
7
required about 500MB of
storage. The indexed, pre-sampled version of this data le requires around 7GB to store in
its entirety if 500 samples are used.
Discussion.
There are several interesting ndings. First and foremost is the terrible relative
performance of the optimal sequential test, which was generally about ve times slower
109
than Walds classic SPRT. The results are so bad that we removed this option after
the rst set of experiments. Since we were quite curious about the poor performance,
we ran some additional, exploratory experiments using the optimal test and found
that it can be better than the SPRT, particularly in cases where H
0
and H
1
were far
apart. Unfortunately, in our application is chosen to be quite small and under such
circumstances the optimal test is quite useless. The poor results associated with this test
illustrate quite strongly how asymptotic statistical theory is often quite far away from
practice.
On the other hand, the end-biased test always far outperformed the SPRT, sometimes
by almost two orders of magnitude. This is perhaps not surprising given the fact that,
by design, the end-biased test can quickly reject those multitude of objects where
0. The spread between the two tests was particularly signicant for the third set of
experiments, which tests the eect of the object standard deviation (or size) on the
running time. It was interesting that the end-biased test performed better with higher
standard deviationas the object size increases, fewer objects are accepted by the query
box, which cannot encompass enough of the PDF. The end-biased test appears to be
particularly adept at rejecting such objects quickly. However, the SPRT performance
seemed to be invariant to the size of the object.
Another interesting nding was the sensitivity of the SPRT to the p parameter.
For lower condence, the test was far more expensive. This is because as the condence
is lowered, the actual probability that an object is in the query box gets closer to the
user-dened input parameter. As this happens, the SPRT has a harder time rejecting the
object.
The results regarding the index were informative as well. It is not surprising that
the indexed method was almost always the fastest for the four. For the 10 million record
database, it seems that the standard end-biased test bottoms with at a sequential scan
plus processing time of about 70 seconds. However, the indexed, end-biased method is able
110
to cut this baseline cost down to under ten seconds for the same database sizethough
the time taken from query to query tended to vary a lot more for the index than the other
methods.
It is interesting that in the cases where the regular end-biased test becomes more
expensive than its 70 second baseline, (for example, consider the rst column in Table 5-3)
the indexed version also suers to almost the same extent. This is not surprising. The
reason for an increased cost for the un-indexed version is that a large number of objects
were encountered that required many samples. Perhaps there were even a few objects
that required an extreme number of samplesnumbers in the billions happen occasionally
when is very close to p. The indexed version is no better than the un-indexed version
in this case; it cannot dismiss such an expensive object outright using the index, and the
few, pre-computed, indexed samples it has access to are useless if millions of samples are
eventually required to accept or reject the object.
Perhaps the most interesting case with respect to the index is the fourth column of
Table 5-2, where the indexed end-biased test actually doubles the running time of the
un-indexed version. The explanation here seems to be that this particular query has
the largest result set size. The size is so large, in fact, that use of the index induces an
overhead and actually slows query evaluationa phenomenon that is always possible in
database indexing.
5.6 Related Work
Since the SPRT was rst proposed by Wald in 1947 [111], sequential statistics
have been widely studied. Walds original SPRT is proven to be optimal for values
lying exactly at H
0
or H
1
; in other cases, it may perform poorly. Kieer and Weiss
rst raised the question of developing tests that are provably optimal at points other
than those covered by H
0
and H
1
[114]. However, in the general case, this problem has
not been solved, though there has been some success in solving the asymptotic case
where ( = ) 0. Such a solution was rst proposed by Lorden in 1976 [115] where
111
he showed that Keifer-Weiss optimality can be achevied asymptotically under certain
conditions. Well-known follow-up work is due to Eisenberg [116], Human [113], and
Pavlov [117]. Work in this area continues to this day. However, a reasonable criticism
of much of this work is its focus on asymptotic optimalityparticularly its focus on
applications having vanishingly small (and equal) values of and . It is unclear how
well such asymptotically optimal tests perform in practice, and the statistical research
literature provides surprisingly little in the way of guidance, which was our motivation
for undertaking this line of work. In our particular application, and are not equal (in
fact, they will most often dier by orders of magnitude), and it is unclear to us whether
practical and clearly superior alternatives to Walds original proposal exist in such a case.
In contrast to related work pure and applied statistics, we seek no notion of optimality;
our goal is to design a test that is (a) correct, and (b) experimentally proven to work
well in the case where is vanishingly small, and yet more often than not, the true
probability of object inclusion is either zero or one.
In the database literature, the paper that is closest to our indexing proposal is due to
Tao et al. [118]. They consider the problem of indexing spatial data where the position is
dened by a PDF. However, they assume that the PDF is non-innite, and integratable.
The assumption of niteness may be reasonable for many applications (since many
distributions, such as the Gaussian, fall o exponentially in density as distance from the
mean increases). However, integratability is a strong assumption, precluding, for example,
many distributions resulting from Bayesian inference [62] that can only be sampled from
using MCMC methods [107].
Most of the work in probabilistic databases is at least tangentially related to our own,
in that our goal is to represent uncertainty. We point the reader to Dalvi and Suciu for a
general treatment of the topic [119]. The paper most closely related to this work is due to
Jampani et al. [109] who propose a data model for uncertain data that is fundamentally
based upon Monte Carlo.
112
5.7 Summary
In this chapter, we have considered the problem of using Monte Carlo methods to
test which object in a probabilistic database are accepted by a query predicate. Our two
primary contributions are (1) the denition of a new sequential test of whether or not the
probability that an object is accepted by the query predicate which strictly controls both
the false positive and false negative rates, and (2) an indexing methodology for the test.
The test was found to work quite well in practice, and the indexing is very successful in
speeding the tests application in practice.
We close the chapter by pointing out that our goal was not to make a contribution to
statistical theory, and arguably, we have not! Most of the relevant statistical literature is
concerned with various denitions of optimality, and while our new test is correct, there
is no sense in which it is optimal. However, we believe that our new test is of practical
signicance to the implementation of probabilistic databases. The experimental evidence
that it works well is strong, and there is also strong intuition behind the design of the
test. In practice, the new test outperforms an oft-cited optimal test from the relevant
statistical literature for database selection problems, and while a more appropriate test
may exist, we are unaware of a more suitable candidate for solving the problem at hand.
113
CHAPTER 6
CONCLUDING REMARKS
With the increasing popularity of tracking devices, and the decreasing cost of storage,
large spatiotemporal data collections are becoming increasingly more commonplace.
Extending current database systems to support such collections require the development of
new solutions. The work presented in this study represents a small step in that direction.
The main theme of this work was on developing scalable and ecient algorithms for
processing historical spatiotemporal data, particularly in a warehouse context. As much
as this work solves some important research issues, it also opens new avenues for future
research. Some potential directions for further development include:
The CPA-join focused on historical queries over two spatiotemporal relations.
Extending the work to support predictive queries would be an interesting exercise.
Unlike historical queries that span long time intervals, predictive queries are often
interested in short time intervals. This could make the use of indexes potentially
more attractive.
The version of entity resolution considered in this work assumed simple binary
sensors that provide limited information about the tracked objects. This however
limits the ability of the algorithms to discriminate between closely moving objects.
The accuracy however could be improved if one considers sensors that provide a
richer feature set (such as sensors that provide additional color information). This
would provide the algorithms an additional dimension to dierentiate the observation
clouds.
Finally, we focused only on answering probabilistic spatiotemporal selection queries
using the end-biased test. However, statistical hypothesis testing, on which the
end-biased test is built upon, is a basic technique used in many elds of science and
engineering. Hence, the end-biased algorithm proposed in this work has potentially
broad applicability besides probabilistic databases.
114
REFERENCES
[1] R. Guting and M. Schneider, Moving Object Databases, Morgan Kaufmann, 2005.
[2] D. Papadias, D. Zhang, and G. Kollios, Advances in Spatial and Spatio Temporal
Data Management, Springer-Verlag, 2007.
[3] J.Schillier and A.Voisard, Location-Based Services, Morgan Kaufmann, 2004.
[4] Y.Zhang and O.Wolfson, Satellite-based information services, Kluwer Academic
Publishers, 2002.
[5] W.I.Grosky, A. Kansal, S. Nath, J. Liu, and F.Zhao, Senseweb: An infrastructure
for shared sensing, in IEEE Multimedia, 2007.
[6] Cover, Mandate for change, RFID Journal, 2004.
[7] G.Abdulla, T.Critchlow, and W.Arrighi, Simulation data as data streams,
SIGMOD Record 33(1):89-94, 2004.
[8] N. Pelekis, B. Theodoulidis, I. Kopanakis, and Y. Theodoridis, Literature review of
spatio-temporal database models, in The Knowledge Engineering Review, 2004.
[9] A.P.Sistla, O.Wolfson, S.Chamberlain, and S.Dao, Modeling and querying moving
objects, in ICDE, 1997.
[10] M. Erwig, R. Guting, M. Schneider, and M. Vazirgianni, A foundation for
representing and querying moving objects, in TODS, 2000.
[11] L.Forlizzi, R.H.Guting, E.Nardelli, and M.Schneider, A data model and data
structures for moving objects databases, in SIGMOD, 2000.
[12] C.Parent, S.Spaccapietra, and E.Zimanyl, Spatiotemporal conceptual models: Data
structures + space + time, in GIS, 1999.
[13] N.Tryfona, R.Price, and C.S.Jensen, Conceptual models for spatiotemporal
applications, in The CHOROCHRONOS Approach, 2002.
[14] E. Tossebro, Representing uncertainty in spatial and spatiotemporal databases, in
Phd Thesis, 2002.
[15] M. Erwig and S.Schneider, Stql: A spatiotemporal query language, in Mining
spatio-temporal information systems, 2002.
[16] R.Guttman, R-trees: a dynamic index structure for spatial searching, in SIGMOD,
1984.
[17] S. Saltenis, C. Jensen, S. Leutengger, and M. Lopez, Indexing the positions of
continuously moving objects, in SIGMOD, 2000.
115
[18] P. Chakka, A. Everspaugh, and J. Patel, Indexing large trajectory data sets with
SETI, in CIDR, 2003.
[19] J.Patel, Y.Chen, and P.Chakka, Stripes: An ecient index for predicted
trajectories, in SIGMOD, 2004.
[20] S. Theodoridis, Spatio-temporal Indexing for Large Multimedia Applications, in
IEEE Intl Conference on Multimedia Computing and Systems, 1996.
[21] D. Pfoser, C. S. Jensen, and Y. Theodoridis, Novel approaches to the indexing of
moving object trajectories, in VLDB, 2000.
[22] T. Tzouramanis, M. Vassilakopoulos, and Y. Manolopoulos, Overlapping linear
quadtrees: A spatio-temporal access method, in Advances in GIS, 1998.
[23] G.Kollios, D.Gunopulos, and V.J.Tsotras, Nearest neighbor queries in a mobile
environment, in Spatiotemporal database management, 1999.
[24] Z.Song and N.Roussopoulos, K-nearest neighbor search for moving query point, in
Symp. on Spatial and Temporal Databases, 2001.
[25] Z.Huang, H.Lu, B. Ooi, and A. Tung, Continuous skyline queries for moving
objects, in TKDE, 2006.
[26] G.Kollios, D.Gunopulos, and V.J.Tsotras, An improved R-tree indexing for
temporal spatial databases, in SDH, 1990.
[27] Y.Tao and D.Papadias, Mv3r-tree: A spatiotemporal access method for timestamp
and interval queries, in VLDB, 2001.
[28] M.A.Nascimento and J.R.O.Silva, Towards historical R-trees, in ACM SAC, 1998.
[29] G.Iwerks, H.Samet, and K.P.Smith, Maintenance of spatial semijoin queries on
moving points, in VLDB, 2004.
[30] S.Arumugam and C.Jermaine, Closest-point-of-approach join over moving object
histories, in ICDE, 2006.
[31] Y.Choi and C.Chung, Selectivity estimation for spatio-temporal queries to moving
objects, in SIGMOD, 2002.
[32] M.Schneider, Evaluation of spatio-temporal predicates on moving objects, in
ICDE, 2005.
[33] Y.Tao, J.Sun, D.Papadias, and G.Kollios, Analysis of predictive spatio-temporal
queries, in TODS, 2003.
[34] J.Sun, Y.Tao, D.Papadias, and G.Kollios, Spatiotemporal join selectivity, in
Information Systems, 2006.
116
[35] M. Vlachos, G. Kollios, and D. Gunopulos, Discovering similar multidimensional
trajectories, in ICDE, 2002.
[36] J.Kubica, A.Moore, A.Connolly, and R.Jedicke, A multiple tree algorithm for the
ecient association of asteroid observations, in KDD, 2005.
[37] S. Ganey and P. Smyth, Trajectory Clustering with Mixtures of Regression
Models, in KDD, 1999.
[38] Y.Li, J.Han, and J.Yang, Clustering Moving Objects, in KDD, 2004.
[39] J.Lee, J.Han, and K.Whang, Trajectory clustering: A partition-and-group
framework, in SIGMOD, 2007.
[40] D.Guo, J.Chen, A. MacEachren, and K.Liao, A visualization system for space-time
and multivariate patterns, in IEEE Transactions on Visualization and Computer
Graphcis, 2006.
[41] D. Papadias, Y. Tao, P. Kalnis, and J. Zhang, Indexing spatio-temporal data
warehouses, in ICDE, 2002.
[42] N.Mamoulis, H.Cao, G.Kollios, M.Hadjieleftheirou, Y.Tao, and D.Cheung, Mining,
indexing, and querying historical spatiotemporal data, in KDD, 2004.
[43] Y.Tao, G.Kollios, J.Considine, F.Li, and D.Papadias, Spatio-temporal aggregation
using sketches, in ICDE, 2004.
[44] D. Papadias, Y.Tao, P.Kalnis, and J.Zhang, Historical spatio-temporal
aggregation, in Trans. of Information Systems, 2005.
[45] T.Brinkho, H.P.Kriegel, and B.Seeger, Ecient processing of spatial-joins using
R-trees, in SIGMOD, 1993.
[46] Y.W.Huang, N.Jing, and E.A.Rundensteiner, Spatial joins using R-trees:
Breadth-rst traversal with global optimizations, in VLDB, 1997.
[47] M. Lo and C.V.Ravishankar, Spatial hash joins, in SIGMOD, 1996.
[48] J. Patel and D. DeWitt, Partition based spatial-merge join, in SIGMOD, 1996.
[49] L. Arge, O.Procopiu, and S. T. J.S.Vitter, Scalable sweeping-based spatial join, in
VLDB, 1998.
[50] M. Berg, M. Kreveld, M.Overmars, and O.Schwarzkopf, Computational Geomtery:
Algorithms and Applictions, Springer-Verlag, 2000.
[51] S.H.Jeong, N.W.Paton, A. Fernandes, and T. Griths, An experimental
performance evaluation of spatio-temporal join strategies, in Transactions in
GIS, 2004.
117
[52] W. Winkler, Matching and record linkage, in Business Survey Methods, 1995.
[53] M.Hernandez and S.Stolfo, The merge/purge problem for large databases, in
SIGMOD, 1995.
[54] C. E. A. Monge, The eld matching problem: Algorithms and applications, in
KDD, 1996.
[55] W. Cohen and J. Richman, Learning to match and cluster large high-dimensional
data sets for data integration, in KDD, 2002.
[56] Y.Bar-Shalom and T.Fortmann, Tracking and data association, in Academic
Press, 1988.
[57] B.Ristic, S.Arulampalam, and N.Gordon, Beyond the kalman lter: Particle lters
for tracking applications, in Artech House Publishers, 2004.
[58] D.B.Reid, An algorithm for tracking multiple targets, in IEEE Trans. Automat.
Control, 1979.
[59] X.Li, The pdf of nearest neighbor measurement and a probabilistic nearest
neighbor lter for tracking in clutter, in IEEE Control and Decision Conference,
1993.
[60] I. Cox and S.L.Hingorani, An ecient implentation of reids multiple hypothesis
tracking alogrithm and its evaluation for the purpose of Visual Tracking, in Intl.
Conf. on Pattern Recognition, 1994.
[61] T.Song, D.Lee, and J.Ryu, A probabilistic nearest neighbor lter algorithm for
tracking in a clutter environment, in Signal Processing, Elsevier Science, 2005.
[62] A. OHagan and J. J. Forster, Bayesian Inference, Volume 2B of Kendalls Advanced
Theory of Statistics. Arnold, second edition, 2004.
[63] A. Doucet, C. Andrieu, and S. Godsill, On sequential monte carlo sampling
methods for bayesian ltering, Statistics and Computing, vol. 10, pp. 197208, 2000.
[64] D.Fox, J.Hightower, L.Liao, D.Schulz, and G.Borriello, Bayesian ltering for
location estimation, in IEEE Pervasive Computing, 2003.
[65] Z.Khan, T.Balch, and F.Dellaert, An mcmc-based particle lter for mulitiple
interacting targets, in ECCV, 2004.
[66] S. Oh, S. Russell, and S. Sastry, Markov Chain Monte Carlo data association for
general multiple-target tracking problems, in IEEE Conf. on Decision and Control,
2004.
[67] O.Wolfson, S.Chamberlain, S.Dao, L.Jiang, and G.Mendez, Cost and imprecision in
modeling the precision of moving objects, in ICDE, 1998.
118
[68] D.Pfoser, Capturing the uncertainty of moving objects, in LNCS, 1999.
[69] J.H.Hosbond, S.Saltenis, and R.Ortfort, Indexing uncertainty of continuously
moving objects, in IDEAS, 2003.
[70] C.Trajcevski, O.Wolfson, K.Hinrichs, and S.Chamberlain, Managing uncertainty in
moving object databases, in TODS, 2004.
[71] R.Cheng, D.Kalashikov, and S.Prabhakar, Querying imprecise data in moving
object environments, in TKDE, 2004.
[72] Y.Tao, R.Cheng, and X.Xiao, Indexing multidimensional uncertain data with
arbitrary probability density functions, in VLDB, 2005.
[73] H.Mokhtar and J.Su, Universal trajectory queries on moving object databases, in
Mobile Data Management, 2004.
[74] D. Eberly, 3D Game Engine Design: A Practical Approach to Real-time Computer
Graphics, Morgan Kaufmann, 2001.
[75] M. Mokbel, X. Xiong, and W. Aref, SINA: Scalable incremental processing of
continuous queries in spatio-temporal databases, in SIGMOD, 2004.
[76] Y. Tao, Time-parametrized queries in spatio-temporal databases, in SIGMOD,
2004.
[77] S. Saltenis and C. Jensen, Indexing of moving objects for location-based services,
in ICDE, 2002.
[78] Y. Tao, D. Papadias, and J. Sun, The TPR*-tree: An optimized spatio-temporal
access method for predictive queries, in VLDB, 2003.
[79] O. Gunther, Ecient computation of spatial joins, in ICDE, 1993.
[80] S. Leutenegger and J.Edgington, STR: A simple and ecient algorithm for R-tree
packing, in 13th Intl. Conf. on Data Engineering (ICDE), 1997.
[81] D.Mehta and S.Sahni, Handbook of Data Strutures and Its Applications, Chapman
and Hall, 2004.
[82] P.J.Haas and J.M.Hellerstein, Ripple joins for online aggregation, in SIGMOD,
1999.
[83] M. Nascimento and J. Silva, Evaluation of access structures for discretely moving
points, in Intl Workshop on Spatio-Temporal Database Management, 1999.
[84] A.Dempster, N.Laird, and D.Rubin, Maximum likelihood estimation from
incomplete data via the em, in Journ. Royal Statistical Society, 1977.
119
[85] J.Bilmes, A gentle tutorial of the em algorithm and its application to parameter
estimation for gaussian mixture and hidden markov models, in Technical Report,
Univ. of Berkeley, 1997.
[86] J.Baneld and A.Raftery, Model-based gaussian and non-gaussian clustering, in
Biometrics, 1993.
[87] J.Oliver, R.Baxter, and C.Wallace, Unsupervised learning using mml, in ICML,
1996.
[88] M.Hansen and B. Yu, Model selection and the principle of minimum description
length, in Journal of the American Statistical Association, 1998.
[89] M. Figueiredo and A. Jain, Unsupervised learning of nite mixture models, in
IEEE Trans. on Pattern Analysis and Machine Intelligence, 2002.
[90] R. Baxter, Minimum message length inference: Theory and applications, in PhD
Thesis, 1996.
[91] G.Celeux, S.Chretien, F.Forbes, and A.Mikhadri, A component-wise em algorithm
for mixtures, in Journ. of Computational and Graphical Statistics, 1999.
[92] D.Pfoser and C.Jensen, Trajectory indexing using movement constraints, in
GeoInformatica, 2005.
[93] Y.Cai and R.Ng, Indexing spatio-temporal trajectories with chebyshev
polynomials, in SIGMOD, 2004.
[94] S.Rasetic and J.Sander, A trajectory splitting model for ecient spatio-temporal
indexing, in VLDB, 2005.
[95] D.Chudova, S.Ganey, E.Mjolsness, and P.Smyth, Translation-invariant Mixture
Models for Curve Clustering, in KDD, 2003.
[96] H.Kriegel and M.Pfeie, Density-based Clustering of Uncertain Data, in KDD,
2005.
[97] J.L.Bentley, K-d trees for semidynamic point sets, in Annual Symposium on
Computational Geometry, 1990.
[98] L.Frenkel and M.Feder, Recursive Expectation Maximization algorithms for
time-varying parameters with applications to multiple target tracking, in IEEE
Trans. Signal Processing, 1999.
[99] P. Chung, J. Bohme, and A. Hero, Tracking of multiple moving sources using
recursive em algorithm, in EURASIP Journal on Applied Signal Processing, 2005.
[100] P. Agrawal, O. Benjelloun, A. D. Sarma, C. Hayworth, S. U. Nabar, T. Sugihara,
and J. Widom, Trio: A system for data, uncertainty, and lineage, in VLDB, 2006.
120
[101] P. Andritsos, A. Fuxman, and R. J. Miller, Clean answers over dirty databases: A
probabilistic approach, in ICDE, 2006, p. 30.
[102] L. Antova, C. Koch, and D. Olteanu, MayBMS: Managing incomplete information
with probabilistic world-set decompositions, in ICDE, 2007, pp. 14791480.
[103] R. Cheng, S. Singh, and S. Prabhakar, U-DBMS: A database system for managing
constantly-evolving data, in VLDB, 2005, pp. 12711274.
[104] N. N. Dalvi and D. Suciu, Ecient query evaluation on probabilistic databases,
VLDB J., vol. 16, no. 4, pp. 523544, 2007.
[105] N. Fuhr and T. Rolleke, A probabilistic relational algebra for the integration of
information retrieval and database systems, ACM Trans. Inf. Syst., vol. 15, no. 1,
pp. 3266, 1997.
[106] R. Gupta and S. Sarawagi, Creating probabilistic databases from information
extraction models, in VLDB, 2006, pp. 965976.
[107] C. Robert and G. Casella, Monte Carlos Statistical Methods, Springer, second
edition, 2004.
[108] J. E. Gentle, Random Number Generation and Monte Carlo Methods, Springer,
second edition, 2003.
[109] R. Jampani, F. Xu, M. Wu, L. P. Ngai, C. Jermaine, and P. Hass, Mcdb: A monte
carlo approach to handling uncertianty, in SIGMOD, 2008.
[110] J. Neyman and E. Pearson, On the problem of the most ecient tests of statistical
hypotheses, Phil. Tran. of the Royal Soc. of London, Series A, vol. 231, pp.
289337, 1933.
[111] A. Wald, Sequential Analysis, Wiley, 1947.
[112] J. Galambos and I. Simonelli, Bonferroni-Type Inequalities with Applications,
Springer-Verlag, 1996.
[113] M. Human, An ecient approximate solution to the kiefer-weiss problem, in The
Annals of Statistics, 1983, vol. 11, pp. 306316.
[114] J. Kiefer and L. Weiss, Some properties of generalized sequential probability ratio
tests, in The Annals of Mathematical Statistics, 1957, vol. 28, pp. 5774.
[115] G. Lorden, 2-sprts and the modied keifer-weiss problem of minimizing an expected
sample size, in The Annals of Statistics, 1976, vol. 4, pp. 281291.
[116] B. Eisenberg, The asymptotic solution to the keifer-weiss problem, in Comm.
Statistics C-Sequential Analysis, 1982, vol. 1, pp. 8188.
121
[117] I. Pavlov, Sequential procedure of testing compositie hypotheses with application
to the keifer-weiss problem, in Theory of Probability and Its Applications, 1991,
vol. 35, pp. 280292.
[118] Y. Tao, R. Cheng, X. Xiao, W. K. Ngai, B. Kao, and S. Prabhakar, Indexing
multi-dimensional uncertain data with arbitrary probability density functions, in
VLDB, 2005, pp. 922933.
[119] N. Dalvi and D. Suciu, Management of probabilistic data: foundations and
challenges, in PODS, 2007, pp. 112.
122
BIOGRAPHICAL SKETCH
Subramanian Arumugam is a member of the query processing team at the database
startup, Greenplum. He is a recipient of the 2007 ACM SIGMOD Best Paper Award.
He received his bachelors degree from the University of Madras in 2000. He obtained his
masters in computer engineering in 2003, and his PhD in computer engineering in 2008
both from the University of Florida.
123

Arumugam S

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Arumugam S

Enviado por

Direitos autorais:

Formatos disponíveis

EFFICIENT ALGORITHMS FOR SPATIOTEMPORAL DATA MANAGEMENT

in order to adapt the underlying (and potentially

6: Perform a layered plane-sweep from time t

associated with a given granularity is known only after the

easily, it is not obvious how we can compute

for all values of from 0 to 1 so as to minimize cost

This Section describes how to eciently estimate cost

for a given using a simple,

, we want to stop the process of estimating

and continue with the join.

is accurate to within 10% at 95% condence. Since

for all values of from 0 to

for all possible values of is prohibitively expensive and hence not

for each of the k granularities would

for each of the k values of alpha in

that maximizes the following likelihood function:

for simplicity, this can be re-worked a bit. let:

= /n, we correctly bound the false negative rate of the overall

and an upper bound b on a sample sequence number. Specically, if an

, b) and rectangle R, this means the rst b

. Storing this pair is of key importance. As we describe subsequently, during query

) where S is the initial pseudo-random number seed, and b = 1

Você também pode gostar