Você está na página 1de 4

gOVIET

P H Y S I C S D O K L A D Y

VOL.

10,

NO.

FEBRUARY,

1966

CYBERNETICS AND CONTROL THEORY

b in a r y c o d e s c a pa b l e o f c o r r e c t in g d e l e t io n s , i n s e r t i o n s , a n d reversals

V. I. Levenshtein
(P resen ted by Academ ician P. S. Novikov, January 4, 1965) T ran slated from Doklady Akadem ii Nauk SSSR, Vol. 163, No. 4, pp. 845-848, August, 1965 O riginal a rtic le subm itted January 2, 1965

Investigations of tra n sm issio n of binary infor mation usually consider a channel m odel in which failures of the type 0 -* 1 and 1 0 (which we w ill call re v e rsa ls) a re adm itted. In the p re s e n t paper (as in [1]) we investigate a channel m odel in which it is also po ssib le to have fa ilu re s of the fo rm 0 A, 1-*- A, which a re called deletions, and fa ilu re s of the form A 0, A - 1, which are called in sertio n s (here A is the empty word). F o r such channels, by analogy to the com binatorial problem of co n stru c t ing optimal codes capable of c o rrec tin g s re v e rs a ls , we will consider the problem of constructing opti mal codes capable of co rrec tin g deletions, in s e r tions, and re v e rs a ls . 1. Codes Capable of C o r r e c t i n g

Deletions

and I n s e r t i o n s

By a binary w ord we w ill m ean a w ord in the alphabet {0, l}. By a code we w ill m ean an a rb i trary set of binary w ords that has fixed length.1 We will say th at a code K can c o r r e c t s d e l e t i o n s ( s i n s e r t i o n s ) if any binary w ord can be obtained from no m o re than one w ord in K by s or fewer deletions (insertions)* This la s t p ro p e rty guarantees the possib ility of unique determ ination of the initial code w ord from a w ord obtained as the result of som e num ber i (i > 0) of deletions and some num ber j (j > 0) of in sertio n s if i + j ^ s. The following a sse rtio n shows that all of the codes de fined above are equivalent. L e m m a 1. Any code that can c o rre c t s de letions (like any code th at can c o rre c t s in sertio n s) can c o rre c t s deletions and in sertio n s. P r o o f (by contradiction). A ssum e th at the same word z is obtained from a w ord x of length n by ij deletions and in sertio n s, w here ii + jj ^ s, and from a w ord y of length n by i2 deletions and j 2 insertions, w here i2+ j 2 ^ s . If the sym bols th at

w ere in se rte d (deleted) from at le a s t one of.the w ords x or. y to obtain z are deleted from (in serted into) the w ord z, then, as we can easily see, we ob tain a w ord that can be obtained from both x and y by no m ore than max (i2+ j l9 j 2 + ii) deletions (in se r tions). B ecause x and y have the sam e length, = j 2i2 and, consequently, i 2+ j 1= j2+ i i = 1 (ii + i2+ ji /2 + j 2) ^ s, which proves Lem m a 1. Codes that can c o rre c t s deletions and in s e r tions adm it another, m e tric , description. C onsider a function p (x, y) defined on p a irs of binary w ords and equal to the sm allest num ber of deletions and in sertio n s that tran sfo rm the word x into y. It is not difficult to show that the function p (x, y) is a m etric, and that a code K can c o rre c t s deletions and in sertio n s if and only if p (x, y) > 2s fo r any two different w ords x and y in K. Let Bn be the se t of all binary w ords of length n. F o r an a rb itra ry w ord x in Bn> let |x | denote the num ber of ones in x, and let ||x|| be the num b e r of ru n s2 in the w ord x. We w ill now estim ate the num ber P s (x) [Qs (x)] of different w ords th at can be obtained from x by s deletions (s in sertio n s). We have the bounds
C' m - s + i ^ P s (# ) ^

(i)

i=0

O * -* < Qs (*)< S c l c f r 4.
i

=0

(2)

In o rd e r to prove the upper bound in (1), note that each w ord obtained by deletion from x'1 b unique ly determ ined by the num ber of sym bols deleted
2The definitions given below are also meaningful if the code is taken to mean an arbitrary set of words (possibly of different lengths) in some alphabet containing r letters (r^ 2). We should note, however, that in the case of words of different length Lem ma 1 is generally not true. 2By a run in a word x we mean a maximal subword consisting of identical symbols. For example, the word x = 01101 has 4 runs.

708

V. I.

LEVENSHTEIN

from each run, so P s (x) is no g re a te r than the num b e r of com binations of s item s taken ||x|| at a tim e. On the other hand, it is easy to see that if one sym bol is elim inated from any s p airw ise nonadjacent ru n s in x, all of the w ords thus obtained w ill be dif fe ren t. This leads to the low er bound in (1), if we note th at the num ber of such w ords is equal to the num ber of o rd ered p artitio n s of the num ber ||x ||s into s + 1 non-negative te rm s , w here only two may, perh ap s, be equal to zero. The upper bound in (2) follows from the fact th at each w ord obtained from x= . . . an by s in sertio n s can be obtained in the following m anner. F o r som e i (i = 0, 1 , . . . , s), choose i indices nl 9. . . , ni (1 < < . . . < n* < n) and i + 1 w ords f t , . . . , Pi Pi+i such that the sum of th e ir lengths is s and such that each of the f ir s t i w ords pj is nonempty and does not end in the sym bol onj; then, in s e rt each word /3j (j= 1 , . . . , i) into the w ord x before the symbol < rnj, and in s e rt Pi + x before the sym bol crn The lower bound in (2) fol lows from the fact that if each of the w ords pi9 . . . , Pi has length 1, all of the w ords obtained from x in this way are different. We should note that (1) and (2) imply that Pj(x) = ||xj| and Q1(x)= n+2. Let L s (n) denote the power (num ber of words) of a m axim al code in Bn that can c o rre c t s dele tions and in sertio n s. L e m m a 2 . 3 F o r fixed s and n 2*(s!)2 ln 2s^ L s(n) ^ s\ 2n I ns. 2* (3)

T he o r em

1
Li(n) ~ 2n I n.

(4)

P r o o f . In virtue of Lem m a 2, it is sufficient to prove that


L i ( n ) ^ 2 / ( * + ! ) .

(5)

In o rd e r to prove this, we will use one of the V arsham ov-T enengolfts [3] constructions. C onsider | the c la ss of codes Kg m , w here each Kg m (a = 0, , m 1) is defined as the set w ords 0* . . . cr, in 1, . .......................
n

Bn such that 2 aii = a (mod m). We w ill show that


i=1

for m > n+ 1, each code Kn, m can c o rre c t one deletion. As the re su lt of one deletion, assum e that a w ord x = crj. . . o in Kg}m has been tran sfo rm ed rn into the w ord x ?= cr^. . . We can then assum e that we know |x !| and the sm allest non-negative resid u e of a 2 o'4 mod m, which we w ill denote by a f. In o rd e r to re s to re the word x from x T it is fi , clearly sufficient to know: 1) which of the binary |l sym bols 0 or 1 has been elim inated and 2) eith er l| the num ber (which we denote by n0) of zero s to the J left of the deleted symbol if this sym bols is 1, or |/j the num ber (which we denote by nt) of ones to the rig h t of the deleted symbol, if this symbol is 0. B ut$ it follows from the definition of Kn}m anc^ num~ ^ b e rs yiq and that when m > n+ 1 we have eith er |jj a'= | x r|+ 1 + n^ (if the symbol 1 has been deleted) or Ifl a 1= n* (if the symbol 0 has been deleted), andnj ^ $$ | x r|0 As a re s u lt, depending on w hether a ! is la rg e r than |x 11 or not, we can determ ine which of the binary sym bols has been deleted, and then find n^ or j nlB As a re su lt, by Lem m a 1, each code Kg?m can,' j for m > n+ 1, c o rre c t one deletion or insertion. jj Since each of the w ords in Bn belongs to the sam e | one of the m codes Kg}m (a= 0, 1 , . , . , m 1), at le a s t one of these codes contains no few er than 2n/m w ords, which, for m= n+ 1, yields estim ate (5). ! 2. Codes that Can C o r r e c t D e le ti o n s , and R e v e r s a l s

P r o o f . L et K be a m axim al code in Bn that can c o rre c t s deletions and in sertio n s, and for a r b itra ry k (1 ^ k < n / 2 ) , let Ls (n) = L ^ ?+ L ^ n, w here L k f is the num ber of w ords x s K such that k < \\x || < n By the definition of K, k. P s (x) < 2n_s,
and b e c a u s e of m a x im a lity , 2k R 2S(x ) > 2 n , w h e r e

R 2S(x ) is the num ber of w ords at a distance of 2s

or le s s [in the m etric p(x, y)] from x. It follows from (1) and (2) th at 2n_s > L k T Ck_s s ancl

Insertions,
i 0

E stim ate (3) follows from these la s t inequalities when we note th at Lk" < 2 ( 2
i 1

+
k i- 0

2 ^n~i) = 2 ^ 1
i n k i- 0 o n 'n '

We w ill say that a code K can c o r r e c t s de l e t i o n s , i n s e r t i o n s , a n d r e v e r s a l s if any binary w ord can be obtained from no m ore than one word in K by s or few er deletions, in sertio n s,
3In what follows the notation f (n) Z g(n) will mean that l i m f ( n ) I g ( n ) ^ 1,, -while the notation f ( n ) ~ g ( n ) w ill mean
n~>co

(since the num ber of w ords in Bn with i ru n s is 2Cn-1 an^w e use the fact that 2 Cn1 = o (-^7 ) when i),
k = [ n / 2 - (sn ln n )1 ^]

and

(see, for exam ple,

[2 ]).

that lim f ( n )

7H O 1 -C

/ g(n) =

1.

BINARY CODES C A P A B L E

OF C O R R E C T I N G D E L E T I O N S

709

or re v e rs a ls . It can be shown that the function r (x, y) defined on p a irs of binary w ords as equal to the sm allest num ber of deletions, in sertio n s, and re v e rs a ls that w ill tra n sfo rm x into y is a m e tric , and that a code K can c o rre c t s deletions, in s e r tions, and re v e rs a ls if and only if r (x, y)> 2s fo r any two different w ords x and y in K. Let Ms (n) denote the power of the m axim al code in Bn that can c o rre c t s deletions, in sertio n s, and re v e rs a ls . T h e o r e m 2.
2n~i / n < Ah (n) < 2 / (n + 1).

(6)

P r o o f . The upper bound is Hamming ?s e s ti mate [4] fo r codes th at can c o rre c t one re v e rs a l. In order to prove the lower bound, it is sufficient to show th at all of the codes K^, m defined in the proof of Theorem 1 a re , when m > 2n, capable of correctin g one deletion, insertion, or re v e rs a l. The fact that th ese codes can c o rre c t deletions or in s e r tions has already been proved. We should note, furth erm o re, th at if no m ore than one re v e rs a l is required to change a w ord a*. . . crn in m into a word O* *. crnF the sm a lle st of the non-negative } ,
n
i l

n
i= 1

residues of a 2 Gj'i and 2 Si'z a mod 2n is larger than or equal to j, w here j is the index of the re v e rs e d sym bol (or j = 0 if th ere is no re v e rs a l). By using the sam e m ethod as we used to prove Lemma 2, we can show that for fixed s and

S
(2 )! I

V
^C -C J) ^ M . (n) ^ si (7)

3.

U se of C o d e s f o r

Transmission Over

(Without S y n c h r o n i z i n g S y m b o l s ) Channels that D e le te , Inse rt, Reverse and

L et Z! n (Z g n ; Z , n; m Sj n) denote a channel s s in which no m ore than s deletions (insertions; de letions and in sertio n s; deletions, in sertio n s, and re v ersals) occur in each segm ent of length n. We agree to w rite the sequence obtained at a channel output from an a rb itra ry infinite sequence ZjZ2 . . of words in a code J in the form z1fz2 . . . , w here f z{ denotes the w ord obtained from the code w ord zi as the re s u lt of fa ilu re s in the channel. We w ill call a code J adm issible fo r a given channel if th ere exists a finite autom aton4 that m aps any sequence 2irz2 . . into the sequence ZjZ2. . . . In o rd er fo r a T. code J to be adm issible for the channels defined above, it is n ec essary (but generally not sufficient) that it be a code capable of co rrec tin g s fa ilu re s of

the appropriate types. The following a ss e rtio n is useful for construction of adm issible codes: for any binary w ords a and p, the codes K and ^= {a x p , x sK } can c o rre c t the sam e num ber of fa il u re s of the types under discussion. This statem ent follows from the obvious equations p (axp, ayp) = p (x, y), r (ax/3, ay/3) = r (x, y ). In what follows, the w ord pa will play the ro le of a se p a ra to r between code w ords, although it is generally d isto rted by the channel. We should also note the im portant fact th at in co n tra st to the case of the channel l s n , i n t h e c a s e of t h e c h a n n e l s ZSj n, m S j n no code J p e r m i t s , whe n s > 2 ( i . e . , in c h a n n e l s w i t h t wo or m o r e i n s e r t i o n s ) d e t e r m i n a t i o n of t h e e n d of t h e w o r d z / f r o m a n y s e q u e n c e zjfz2f . . . . This m eans th at,in the c a ses indicated,decoding m ust s ta r t with the assum ption th at not only can th e re be fa ilu re s in the channel, but th ere can be fa ilu re s due to im p ro p e r location of the beginning of a w ord z^ (de coding fa ilu re s). The idea at the b a sis of the con stru ctio n s proposed below fo r the indicated chan nels is that a s a r e s u l t of t r e a t i n g d e c o d i n g f a i l u r e s a s c h a n n e l f a i l u r e s , no m o r e t h a n s f a i l u r e s o c c u r in e a c h c o d e w o r d . This is achieved by decreasin g the length of the code and appropriately selecting a se p a ra to r pa. The following statem en ts hold: 1) i f a c o d e K i n Bn _2s - i c a n c o r r e c t s d e _ l e t i o n s , t h e n t h e c o d e J = K 1s >0s i s a d m i s s i b l e f o r t h e c h a n n e l Zf >n; 2) i f a s c o d e K i n Bn_4s c a n c o r r e c t s i n s e r t i o n s , t h e n J=K^j SqS i s a d m i s s i b l e f o r t h e c h a n n e l Z^j n ; 3) i f K B n- 4(s + 1 )2 .2s c a n c o r r e c t s d e l e t i o n s , i n s e r t i o n s , and r e v e r s a l s ( i n s e r t i o n s and d e l e t i o n s ) , J = K ^ (ls+1 0s + 1)s + 11s i s a d m i s s i b l e 5 f o r t h e c h a n n e l m s , n (^s, n)*

LITERATURE 1. 2.

CITED

F . F . S ellers J r ., IRE T ra n s., IT -8, No. 1 (1962). W. F e lle r, An Introduction to P robability Theo ry and Its Applications [R ussian tran slatio n ], 1964.

4ln some generalized sense (see, for example [5]). 5It can be shown that if a code K in Bn_7 can correct one deletion, insertion, or reversal (e.g., K = Kn. 7, 2(n_7 )), the code J = K11>01 is admissible.

710

V. I. L E V E N S H T E I N

3. 4. 5.

R. R . V arsham ov and G. M. TenengoPts, Avtom atika i telem ekhanika, _26, No. 2 (1965). R. W. Hamming, B ell Syst, Techn. J ., 29, No. 2 (1950). V. I. L evenshtein, Problem y kibernetiki, No. 11, 1964.

All abbreviations of periodicals in the above bibliography are letter-by-letter transliterations of the abbreviations as given in the original Russian journal. Som e or all of this periodical literature m ay w ell be available in English transla tion. A complete list of the cover-to-cover English translations appears at the back of this issue.