Escolar Documentos
Profissional Documentos
Cultura Documentos
Metrics
Piotr Indyk
Abstract
We present a data structure for the approximate nearest
neighbor problem under edit metric (which is defined as
the minimum number of insertions, deletions and character substitutionsneeded
to transform one string into an
other). For any
and a set of strings of length , the
data structure reports a
-approximate Nearest Neighbor
for any given query string in
time. The space requirement of this data structure is roughly
,
i.e., strongly subexponential. To our knowledge, this is
the first data structure for this problem with both !"
#
query time and storage subexponential in .
1 Introduction
The Nearest Neighbor Search (NN) problem is: Given a
set $ of points in a metric space % , preprocess $ so
as to efficiently answer queries for finding the point in
$ closest to a query point . NN and its approximate
versions are among the most extensively studied problems
in the fields of Computational Geometry and Algorithms,
resulting in discovery of many efficient algorithms. In
particular, for the case when the metric space % is a
low-dimensional Euclidean space & , it is known how
to construct data structure for exact [Cla88, Mei93] or
approximate [AMN ' 94, Kle97, IM98, HP01] NN with
query time
)(+*-,/.0#21436527 . Unfortunately, those data
structures require space exponential in . More recently,
several data structures using space (quasi)-polynomial
in and , and query time sublinear in , have been
discovered for approximate
NN under and & [KOR98,
8
5
IM98, Ind00] and
[Ind98] norms.
While many metrics of interest for nearest neighbor
search are norms, quite a few of them are not. A prominent example of the latter is the Levenshtein (or edit) metric, which, for two strings 9 and 9;: , is defined as the minimum number of insertions, deletions and substitutions
that transform 9 into 9;: . Edit distance is a natural measure of similarity between sequences of characters, and
is widely used e.g., in computational biology and text
processing. Unfortunately, the only nearest neighbor al-
646
2 Preliminaries
K
Z
% O U be metrics,K
QP PRP , and let
N
%
U
N 5 PRPRP N S
. The W -product
of
K
O
% 5 ]JPRPQP_] % S , and, for any
is defined by setting %
a O
\% , setting
Let Z N
W X S [
U a O B
Z S
!
"$#&%
+*
,
.-/10 3254
9 7
647
O ;
'#)(
6*
7
8
:
K
-
Therefore, if we set
( A Q U
O a :6 .
f a 5;O U 5
;5 O a 5 ORPRPQPRO a S O U 5
hS O adScji to a point a :
which minimizes
U
O a over a $ , the algorithm will
report a kc
A ( -approximate nearest neighbor.
kB
4 Edit distance
T HEOREM 4.1. For any integer
, there is a data
T HEOREM 3.1. Assume that each metric N
is equipped structure solving / -NN under the edit metric N , using
with a -NN data structure with query time
# and
#1436 7 space and
query time.
I
using space
# . Then, for any
, there is a data
structure for kc
(+A Q
(+
e -NN, with query time
In the following we will show a data structure for space
# using
#0(C
*-,/. 52'
S (we denote NN under the edit distance, which uses
K
).
*-,/. 5
M L ! #"$ % space and has
query time. The gen
K I eralization to the bounds stated are obtained by recursive
.
Proof: For simplicity, we focus on the case
is straightforward.
Consider two strings a
. Partition into
The data structure is constructed as follows. First, we
O
"
@
substrings
of
length
each (assuming
S
build a -NN data structure for each N . Then, we prepare
5 PRPRP
[
divides
).
a lookup table X/$
$gS
5 ]
#
PQPRP_] PQPRP ]
PQPRP
U:e
O a
and U
that U
U O a :
U2
a
f Ue
ba
O a : #(HUe
O a j
a : U
a i"( ABU2
U
O Oa : D(HA/U
O O a
( A Q U
O a :
Thus, U:
a :
U 2
O a :
( A R U:2
O
which implies U`
a :
U
O a :6
O
K
#
#
O a
O a : ,
U! O a
- 1/ 0 .
K
)(+*
S
,
&'&'&
K.-
U! O a j
:!
!
5
#
!
5 Sum-product
In this section we show how to perform an embedding
of a sum-product N of metrics N ORPQPRPRO N S into a max5
product of those metrics; the max-product includes several (scaled) copies of each N . The embedding is somewhat weak, but is nevertheless sufficient to reduce the
648
"
decision version of the problem of computing approxi- that #" contains @BA
@BA elements. The probability
"
mate nearest neighbor under sum-product of metrics to the that all of them are 2A is at most
same problem over max-product of those metrics. The lat& K
&
J
@/A @
Q@BA
ter problem can be solved using the algorithm of [Ind02].
The embedding is based on the following approach.
Thus : satisfies the conditions of the lemma.
isKY
a power of A . We start by choosing
Assume
PRPRP ,
PQPRP *-,/. . Each is chosen by picking
@/A elements
independently and uniformly at random T HEOREM 5.1. The embedding X4N [ N+: satisfies
from
PRK PRP , without replacement. In addition, we the following conditions: for any a O )N we have:
define
PQPRP .
Our mapping will map a point a
N into 1. If U
ba O B , then
K
N :
%`: O U : , which is a max-product of metrics
&
f U :
$V
a O V
Be
i
@
N PQPRP N S . Each N itself is a max-product of
metrics N
, multiplied by A . The coordinate
O
G
ba of V
bad corresponding to
O as above is defined 2. If U
ba O B Q* ,
. , then
to be a . The mapping V
bad is obtained by concatenating
f U :
$V
a O V
Be G ji G %
ba over all
.
O
The key properties of the mapping follow from the
C OROLLARY 5.1. Assume that each metric N
is
following lemma.
k
equipped
with
a
-NN
data
structure
with
query
time
N , O G we have:
L EMMA 5.1. For any a O
# and using space
# . Then there is a
k *-,/.0
*-,/. *-,/. *-,/.0# -NN data structure for N , with query time
1. If U`
ba B
, then
O
#
( *-,/. #21436527 and space M1 36527d(
#2 .
f . - 2
U
ba O h
G
O B *-,/. , then
2
U
a O h G 2 A
f -
2. If U` ba
;
6
A i
i G
&
5
5
5
649
References
650