Você está na página 1de 5

Approximate Nearest Neighbor under Edit Distance via Product

Metrics
Piotr Indyk
Abstract
We present a data structure for the approximate nearest
neighbor problem under edit metric (which is defined as
the minimum number of insertions, deletions and character substitutionsneeded
to transform one string into an
other). For any
and a set of  strings of length  , the
data structure reports a 
-approximate Nearest Neighbor
for any given query string in   time. The space requirement of this data structure is roughly    ,
i.e., strongly subexponential. To our knowledge, this is
the first data structure for this problem with both !" #
query time and storage subexponential in  .
1 Introduction
The Nearest Neighbor Search (NN) problem is: Given a
set $ of  points in a metric space % , preprocess $ so
as to efficiently answer queries for finding the point in
$ closest to a query point . NN and its approximate
versions are among the most extensively studied problems
in the fields of Computational Geometry and Algorithms,
resulting in discovery of many efficient algorithms. In
particular, for the case when the metric space % is a

low-dimensional Euclidean space &  , it is known how
to construct data structure for exact [Cla88, Mei93] or
approximate [AMN ' 94, Kle97, IM98, HP01] NN with
query time )(+*-,/.0#21436527 . Unfortunately, those data
structures require space exponential in  . More recently,
several data structures using space (quasi)-polynomial
in  and  , and query time sublinear in  , have been
discovered for approximate
NN under and & [KOR98,
8
5
IM98, Ind00] and
[Ind98] norms.
While many metrics of interest for nearest neighbor
search are norms, quite a few of them are not. A prominent example of the latter is the Levenshtein (or edit) metric, which, for two strings 9 and 9;: , is defined as the minimum number of insertions, deletions and substitutions
that transform 9 into 9;: . Edit distance is a natural measure of similarity between sequences of characters, and
is widely used e.g., in computational biology and text
processing. Unfortunately, the only nearest neighbor al-

< Laboratory for Computer Science,

MIT, Cambridge, MA 01239.


This work was supported in part by NSF CAREER grant CCR-0133849.

gorithms for edit metric known so far were the trivial


ones, i.e., using linear scan (which suffers from => #
query time) or using exhaustive storage (which suffers
from memory requirements exponential in the length of
strings).
A potential approach for devising an approximate
NN data structure for edit metric is to embed it into a
normed space, and use NN algorithms for normed spaces.
In other words, we would like to map every sequence
into a vector such that the distance between the vectors
approximates the distance between the sequences. This
approach has been successfully applied for some notnormed metric spaces. In particular, it was shown that
a variant of the edit distance (called block edit
 distance)
can be embedded into Hamming space or norm with
5
distortion roughly  *-,/.? [CPSV00, MS00, CM02],
where  is the maximum string length. The block edit
distance measures the minimum number of blocks (not
just character) operations needed to transform one string
into another. Although it is plausible that the standard

edit metric can be embedded into, say, norm with low
5
distortion, so far researchers were unable to provide such
an embedding. A (somewhat weak) lower bound of @BA
for distortion of any such embedding has been recently
given in [ADG ' 03].
In this paper we present the first non-trivial approxi?C
mate algorithm for NN under edit metric. For any
,
our algorithm solves the 
-approximate NN using  
query time and space roughly D  space. This, e.g.,
allows us to obtain a constant factor approximation using
"E , for arbitrarily small
space that is exponential only in #
FHGJI . Alternatively, we can use K F *-,/.LM , and obtain
a E -approximate algorithm using polynomial space.
The result is obtained by representing edit metric as a
product of several smaller metrics, using a method first
proposed in [Ind02]. A product of metrics N
5;OQPRPRP NTS
with
distance
functions
U
V
U
S
and
a
function
WYX
Z S\[ Z
5;OQPRPQP
is a metric over N
with distance
T
N
S
5^]TPQPRP_]
K
function U such that U` 2 ba
5cOQPRPQPO adSB O  5BORPQPRPO QSBe
Wgf U 5 ba 5;O 5  PRPQP UVS" badS O hSBji . The most typical functions
W are max and sum; in general, our result holds for
any k -monotone function (see Preliminaries).
We prove the following general result about NN

Copyright 2004 by the Association for Computing Machinery, Inc.


and the Society for industrial and Applied Mathematics. All Rights
reserved. Printed in The United States of America. No part of this book
may be reproduced, stored, or transmitted in any manner without the
written permission of the publisher. For information, write to the
Association for Computing Machinery, 1515 Broadway, New York, NY
10036 and the Society for Industrial and Applied Mathematics, 3600
University City Science Center, Philadelphia, PA 19104-2688

646

for product metrics. Assume that each N


is equipped
with a -approximate NN data structure with query time
# and using space # . Then we can construct
a data structure for  R -approximate NN, with query
# and space roughly #?(  S . Given the
time
above theorem, we show that NN for edit metric can be
reduced to NN over a product of several edit metrics over
shorter sequences. Then, the NN data structure for each
individual metric can be constructed either recursively, or
using the exhaustive storage approach.
Although the main focus of this paper is on algorithms for edit metric, we mention that NN data structures
for product metrics are also of significant interest by itself. Product metrics are of interest in scenarios where
one searches for an object satisfying a set of metric constraints, e.g., Find a person with similar signature and
similar face photo. We can answer such a query by using
NN data structure for a product of individual metrics.
Because of this reason, we further study approximate
NN algorithms for product metrics. In particular, we
obtain the following result for the case when W is a sum.
As before, assume that each metric N
is equipped with
an approximate NN data structure with query time #
and space roughly # . Then we construct an  * ,
.0
*-,/. b*-,/. * ,
. # -approximate data structure that solves the
decision version (see Preliminaries) of NN for the sumproduct of N s, with query time # M( * ,
. #2143 57 and
space M1 36527R( 
#2 . Note that, unlike in the previous
case, the space bound does not depend exponentially on
. However, the => *-,/. # approximation factor makes it
impossible to use this data structure in the context of edit
metric.
Related results. The approach of solving NN in
not-normed metrics via product metrics was introduced
in [Ind02]. In that paper, the author gave a  d*-,/.0*-,/.0# approximate algorithm for the case when the function
W computes the maximum of its arguments. The space
bound of that data structure lacked the  S term present
here; other parameters were similar. The data structure
was then used to provide a subexponential space approximate NN data structure for the Frechet metric.
Unfortunately, the approach of [Ind02] relied significantly on properties of the maximum function, since it essentially
generalized the earlier algorithm of [Ind98] for
 8S
norms. On the other hand, in order to obtain an algorithm for the edit metric, one needs to design an algorithm
for the case when W computes the sum of its arguments.
Thus, one cannot use the data structure of [Ind02] to obtain the result presented in this paper.
We mention that by using our product metric data
structure, we can construct an approximation algorithm
for NN under Frechet metric, with space and query time
as in [Ind02], that has only constant approximation factor.





 







Our techniques. Our first data structure (for any


monotone function W ) is quite simple. Its basic idea
S
is to partition the product metric N
into roughly 
regions, within which an approximate nearest neighbor
can be fixed and pre-computed during the preprocessing.
In addition, we can determine the region that a given
query belongs to, by only invoking the nearest neighbor
oracles for the individual metrics N .
Our second algorithm can be viewed as an abstract version of the Locality-Sensitive Hashing algorithm
of [IM98] . It allows us to reduce the problem of approximate NN in sum-product metrics to approximate NN in
max-product metrics, by embedding the former into the
latter. Then we can use the algorithm of [Ind02] to solve
the problem in the latter space.

2 Preliminaries

K 
Z % O U  be metrics,K 
QP PRP  , and let
N


%
U

N 5 PRPRP N S
. The W -product
of
K
O
% 5 ]JPRPQP_] % S , and, for any
is defined by setting %
a O
\% , setting
Let Z N

W X S [

U  a O B

Wg U 5 b a ;5 O 5  ORPQPRPO US ba S O QS/2 P

 

Z S

Note that if for any k


, ^( k
implies

O O
Wg "M(HWg k Wg R , then N is a metric.
A function
W is called
k -monotone, if for any I
Z S
Z
and
, if
then Wg 
:
O :
K
functions
Wg :  k RWg  . Note that the
g
W

SB

g
5
R
P
R
P
P
, for any I
, are -monotone.
a
We define U
9 O 9h:6 to be the edit distance between
strings 9 O 9h:
, i.e., the minimum number of insertions
and deletions and substitutions of symbols
K needed to
d O U  .
transform 9 into 9h: . In addition, we set N 
The -approximate K Nearest Neighbor problem ( % O U  , is defined as follows:
NN), under a metric N K
% , $
 , build a data structure which
given $
% finds a :
$ such that U  O a :-
given any
U  O a  . The decision version of Kthe above
problem is defined as: given $
% , $
 , and
JG I , build a data structure which, given any
% ,
if there is any a C$ that U ad
, finds a : C$
O
such that U a :6
. Note that is fixed before the
O
data structure is constructed. If the data structures are
are randomized, we require that the above specifications
holds for a fixed with constant probability. As shown
in [IM98, HP01], the decision and optimization versions
of approximate NN are equivalent, up to polylogarithmic
factors in space and query
K time bounds. G I
N
% O U  and
For
any
metric
, we define
K
K
"N
% O U  to be a metric such that U  a O B
U a B for all a % .




  
       

  
!
"$#&%


+*  
,
.-/10 3254

9  7





647

O ;

   


'#)(

 

6*

7


 



8
:

3 Nearest neighbor under W -product metrics

K 

Consider metric spaces N ,


that all

PRPQP . Assume
distances in metrics N
are integers from
PRZ PQPS [  . Z In
addition,K let W be a k -monotone function W X
.
% K O U  be the W -product
N
N
of
.
Let
Let N
S
K
5 PRPQP
$ K % , $
 . Define $
a X a
$  , and let

 
$ .

 
-  

Therefore, if we set
( A Q U O a :6 .
f a 5;O U 5  ;5 O a 5  ORPRPQPRO a S O U 5  hS O adScji to a point a :
which minimizes
 U  O a  over a  $ , the algorithm will

report a kc A  (  -approximate nearest neighbor.

kB

4 Edit distance
  
T HEOREM 4.1. For any integer
, there is a data
T HEOREM 3.1. Assume that each metric N
is equipped structure solving / -NN under the edit metric N  , using
with a -NN data structure with query time # and 
#1436  7 space and   query time.
 I
using space #  . Then, for any

, there is a data
structure for kc (+A Q (+ 
e -NN, with query time
In the following we will show a data structure for  space
# using
#0(C  *-,/. 52'   S (we denote NN under the edit distance, which uses     
K
).
*-,/. 5 
M L ! #"$ % space and has   query time. The gen
K I eralization to the bounds stated are obtained by recursive
.
Proof: For simplicity, we focus on the case

application of the same idea.


G
I
Generalization to

is straightforward.
Consider two strings a
 . Partition into
The data structure is constructed as follows. First, we
O


"

@
substrings
of
length
each (assuming
S
build a -NN data structure for each N . Then, we prepare
5 PRPRP


[

divides
).
a lookup table  X/$
$gS

 






5 ]

# 

PQPRP_] PQPRP ]

PQPRP

U:e O a
and U 
that U 

Mf U2 O a e @  O Ue ba O a :  Ue O a ji


K
O a 6:  Wg U 5  5 O a :5  QO PRPQPRO U S  S O a :S 2 . Observe
O a :  can be computed using exclusively the information presented to the lookup table, without the full
knowledge of .

Observe that 
U e  O a : 
U 2 O a je@  and

U 2 O a :   U ba O a :   U 2 O a  ; this
implies
:U  O a :  U 2 O a :  . In addition, we have
:

U  O a : 



 U2 a
f Ue ba

O a : #(HUe  O a j
a :  U  a  i"( ABU2
 U  O Oa : D(HA/U  O O a 
  ( A Q U  O a : 

Thus, U:  a :   
U 2  O a :   ( A R U:2
O

which implies U`  a : 
U O a :6
O
K

 #

. We describe how to construct


K  later.
Given a query point
 5cOQPRPRPRO QS/ , the data
structure proceeds as follows. First, for each , it queries
with an argument .
the -NN data structure for N
Let a be the point returned by the -th data structure.
Then the data structure returns a point in  indexed by
ba 5 O U 5  5 O a 5  ORPQPRPRO a S O U S  S O a S e .
The query time and space bound follows directly
from the above description. It remains how to construct
 so that the algorithm returns an approximate nearest
neighbor.
K Let and a 5PQPRP adS be as above. Consider any point
a :
a :5 PQPRP a :S  Y$ . We define U O a :6 to be an
approximation of the distance U`  ad:6 . Specifically, we
O
set

# 

O a 
O a :  ,

C LAIM 1. The edit distance


equality

U! O a 

 - 1/ 0 .
K

)(+*

satisfies the following

S
,

&'&'&

K.-

U!  O a j

"@*-,/.  . We create an auxiliary set of


containing all -tuples ba
5BORPQPRPO a SB such that
a 5DPRPQP a S
$ . By Claim 1, the (approximate) nearest
neighbor of point in the set $ with respect to U
can be
computed by finding the (approximate) nearest neighbor
K
a query point B:
5BORPQPRPO QSB in $ : with respect to
the distance /
U O  (informally, we can combine the
min from the nearest neighbor definition with the min
of Claim 1 into one min). The latter distance function
is a sum-product of copies of N  . The key property
of this transformation, however, is that each
 of s in
5 ORPRPQPRO S  has length "@ . Therefore, the -NN problem
under the component metrics can be trivially solved with
 @  query time and using #" S space, via exhaustive
storage of answers
 to all queries. At the same time, the
sum function is -monotone. Thus, using the results from
the previous section, we can build a data structure solving
the  -NN problem under
& the / U K O  metric. Since
K 

$ :
  and 
 , the space bound
follows.
Set
points $

:!

!  


5

 

# 

!  

5 Sum-product
In this section we show how to perform an embedding
of a sum-product N of metrics N ORPQPRPRO N S into a max5
product of those metrics; the max-product includes several (scaled) copies of each N . The embedding is somewhat weak, but is nevertheless sufficient to reduce the

648

 

"



decision version of the problem of computing approxi- that #" contains @BA
@BA elements. The probability
"
mate nearest neighbor under sum-product of metrics to the that all of them are 2A   is at most

same problem over max-product of those metrics. The lat& K 
&

J @/A  @ 
 Q@BA 
 
ter problem can be solved using the algorithm of [Ind02].
The embedding is based on the following approach.

Thus : satisfies the conditions of the lemma.
isKY
a  power of A . We start by choosing
Assume

PRPRP  ,
PQPRP *-,/. . Each is chosen by picking
@/A elements
independently and uniformly at random T HEOREM 5.1. The embedding  X4N [ N+: satisfies

from
PRK PRP   , without replacement. In addition, we the following conditions: for any a O )N we have:
define 
PQPRP  .
Our mapping  will map a point a
N into 1. If U ba O B  , then
K
N :
%`: O U :  , which is a max-product of metrics
   &

f U : $V a  O V  Be
i
@
N  PQPRP N    S . Each N itself is a max-product of
metrics N
 , multiplied by A . The coordinate
O

G
  ba  of V bad corresponding to
O as above is defined 2. If U ba O B  Q* ,
. , then
to be a  . The mapping V bad is obtained by concatenating



f U : $V a  O V  Be G ji G % 
  ba  over all
.
O
The key properties of the mapping  follow from the
C OROLLARY 5.1. Assume that each metric N
is
following lemma.
k
equipped
with
a
-NN
data
structure
with
query
time

N ,  O G we have:
L EMMA 5.1. For any a O
# and using space # . Then there is a  k *-,/.0

*-,/. *-,/. *-,/.0# -NN data structure for N , with query time
1. If U` ba B
, then
O
# ( *-,/. #21436527 and space M1 36527d(  #2 .

 

f . -  2 

U
ba O  h

G
O B    *-,/.  , then

2   U
a O h G 2 A
f   -


2. If U` ba

;

 

 



6

 



A i


i G


&

   



Proof: Follows by combining Theorem K


5.1& and the
*-,/.0#
algorithm
of
[Ind02].
Specifically,
we
set
K'&
k *-,/. *-,/. # . By scaling, we can assume that
and 
the threshold
 in the decision version of approximate NN
is equal to . Assume
that there is a data point a which is

within distance from . By performing the embedding
as in Theorem 5.1 we are guaranteed that:

 

Proof: We start from part (1). Consider first



. By Markov inequality,
the
(
K
 fraction of s such that
and a does not increase with
The distance between
G
&
U  ba  O  
A 
@/A
@ is at most A @ . Thus,

@
probability
the probability that &  does not contain any of such s is

 
K I
S
(
@ . The case of
at least  A @  "
holds
The distance between and any ad: such that
trivially. Therefore, the probability that& this holds for all
U  O a :  G 
/*-,/. remains greater than  with
K I

K+
S
*
/
,
.
is at least @$ 
@
.
overwhelming
probability.
PRPQP
Now we consider part (2). Let 9 denote the number
 &

Thus, with probability slightly smaller than @ , a  of s such that U
a h G  RA  . We will show first
O

that there exists such that 9
A . Indeed, assume approximate nearest neighbor of V  B is a correct solution
otherwise,
i.e., 9
A for all . This would imply that to the decision version of a  $* ,
.0 *-,/.  -approximate
K I
can be amplified to a constant
. In addition, we could upper bound U
ba h by NN for . The probability
9
&
O
by building   data structures and using all of them.
S
S
It suffices to solve a decision version of an  -approximate
 , 
 , 
D
'
5
9  9  5  2A 
 *
A  9
* ,
.
NN under N : . Since N : is a max-product metric, where
*
each of components is equipped with a k -approximate NN
5
5

data
structure, we can solve the latter problem using the
which yields a contradiction.
 result of [Ind02] (as mentioned in the Introduction).
By the above argument, there exists such that 9
A . Consider the following two
 cases:

R
A
Case
1:
.
InK this case, 6 Conclusions

 !
G
I satisfies In this paper we showed how to solve NN in edit metrics
  
U  ba  O  
 , and thus
the conditions of the lemma.
by designing a new data structure for product metrics. The

K
Case 2: QA 
. In this case, set :
 *-,/. . approach undertaken in this paper is fairly general, and
 
Note that :
. From the above description we know can be probably used for other metrics which do not seem

5

 



5

  




 


 

5

649

 

amenable to constant distortion embeddings into normed


spaces. However, the latter option would still be more
attractive, due to much lower space requirements.
This work raises several interesting open problems
on the complexity of the nearest neighbor problem under
product metrics. One natural question is if it is possible
to improve the approximation factor of  , and maybe
even solve the exact NN problem. However, there is
evidence that the factor of  could be tight, even for the
K
case when W
  . Indeed, assume we could get a
-NN with space simply exponential in , for
.
Then we could obtain a -approximation algorithm for
8
NN under  norm that
K uses space only exponential in
 !  , by setting
!  . By the result of [Ind98], this
would imply
a
data
structure
for the partial match query

over I O   , with space only exponential in !  . Such a
result, although possible, appears to be somewhat unlikely
(see [CIP02] for more background on the partial match
problem).

 

[Ind00] P. Indyk. Dimensionality reduction techniques for


proximity problems. Proceedings of the Ninth ACM-SIAM
Symposium on Discrete Algorithms, 2000.
[Ind02] P. Indyk. Approximate nearest neighbor algorithms
for frechet metric via product metrics. Symposium on
Computational Geometry, 2002.
[Kle97] J. Kleinberg. Two algorithms for nearest-neighbor
search in high dimensions. Proceedings of the TwentyNinth Annual ACM Symposium on Theory of Computing,
1997.
[KOR98] E. Kushilevitz, R. Ostrovsky, and Y. Rabani. Efficient search for approximate nearest neighbor in high dimensional spaces. Proceedings of the Thirtieth ACM Symposium on Theory of Computing, pages 614623, 1998.
[Mei93] S. Meiser. Point location in arrangements of hyperplanes. Information and Computation, 106:286303,
1993.
[MS00] S. Muthukrishnan and C. Sahinalp. Approximate sequence nearest neighbors. Proceedings of the Symposium
on Theory of Computing, 2000.

References

[ADG 03] A. Andoni, M. Deza, A. Gupta, P. Indyk, and


S. Raskhodnikova. Lower bounds for embedding of edit
distance into normed spaces. Proceedings of the ACMSIAM Symposium on Discrete Algorithms, 2003.
[AMN 94] S. Arya, D.M. Mount, N.S. Netanyahu, R. Silverman, and A. Wu. An optimal algorithm for approximate nearest neighbor searching. Proceedings of the Fifth
Annual ACM-SIAM Symposium on Discrete Algorithms,
pages 573582, 1994.
[CIP02] Moses Charikar, Piotr Indyk, and Rina Panigrahy.
New algorithms for subset query, partial match, orthogonal range searching and related problems. International
Colloquium on Automata,Languages, and Programming,
2002.
[Cla88] K. Clarkson. A randomized algorithm for closest-point
queries. SIAM Journal on Computing, 17:830847, 1988.
[CM02] G. Cormode and S. Muthukrishnan. The string edit
distance matching problem with moves. Proceedings of
the ACM-SIAM Symposium on Discrete Algorithms, 2002.
[CPSV00] G. Cormode, M. Paterson, C. Sahinalp, and
U. Vishkin. Communication complexity of document exchange. Proceedings of the ACM-SIAM Symposium on
Discrete Algorithms, 2000.
[HP01] S. Har-Peled. A replacement for voronoi diagrams
of near linear size. Proceedings of the Symposium on
Foundations of Computer Science, 2001.
[IM98] P. Indyk and R. Motwani. Approximate nearest neighbor: towards removing the curse of dimensionality. Proceedings of the Symposium on Theory of Computing, 1998.

[Ind98] P. Indyk. On approximate nearest neighbors in
norm. Journal of Computer and System Sciences, to appear. Preliminary version appeared in Proceedings of the
Symposium on Foundations of Computer Science, 1998.

650

Você também pode gostar