Você está na página 1de 9

136 Bioinformatics I, WS09-10, J. Fischer (script by J. Fischer and D.

Huson) February 3, 2010


9 Sux Trees
This lecture is based on the following sources, which are all recommended reading:
D. Guseld, Algorithms on Strings, Trees and Sequences, Cambridge UP, 1997.
M. Crochemore, W. Rytter, Jewels of Stringology, World Scientic, 2002.
M. Crochemore, C. Hancart, T. Lecroq, Algorithms on Strings, Cambridge UP, 2007.
R. Giegerich, S. Kurtz, J. Stoye, Ecient implementation of lazy sux trees, Software Practice
and Experiments 33(11): 10351049, 2003
S. Kurtz, Foundations of Sequence Analysis, Universitat Bielefeld, 2001.
9.1 Introduction
The rst fact of biological sequence analysis: In biomolecular sequences, high sequence similarity
usually implies signicant functional or structural similarity.
We didnt know it at the time, but we found everything in life is so similar, that the same
genes that work in ies are the ones that work in humans.
(Eric Wieschaus, co-winner of the 1995 Nobel prize in medicine for work on the genetics of Drosophia
development.)
1
Problem Assume we are given a long text T and many short queries q
1
, . . . , q
k
. For each
query sequence q
i
, nd all its occurrences in T.
We would like to have a data-structure that allows us to solve this problem eciently.
Important applications are in the comparison of genomes (in programs such as MUMmer that computes
alignments between genomic sequences, based on maximal unique matches) and in the analysis of
repeats (in programs such as REPuter).
For example, the text T might be a genomic sequences and the queries might be short words such as
transcription factor binding sites.
Text T =
tttttttgagacggagtctcgctctgtcgcccaggctggagtgcagt
ggcgggatctcggctcactgcaagctccgcctcccgggttcacgcca
tctcctgcctcagcctcccaagtagctgggactacaggcgcccgcca
cggctaattttttgtatttttagtagagacggggtttcaccgtttta
cgggatggtctcgatctcctgacctcgtgatccgcccgcctcggcct
cccaaagtgctgggattacaggcgt
Query q = tttta
Find (all) occurrences of the query q in the text T
1
From Dan Guseld, 1997, 212
Bioinformatics I, WS09-10, J. Fischer (script by J. Fischer and D. Huson) February 3, 2010 137
9.2 Basic denitions
Denition 9.2.1 (Prex, sux, i-th sux) We use to denote an alphabet and

for the set


of all strings over . We use to denote the empty string and
+
=

\ {}.
Let T = t
1
t
2
. . . t
n
be a text. A prex of T is a substring of T beginning at the rst position in T. A
sux of T is a substring of T ending at the last position in T.
In bioinformatics, the alphabet is usually of size 4 or 20. In other applications, the alphabet can
be much larger, for example, in the analysis of web-surng patterns, consists of the set of all links
contained in a collection of web sites.
9.3 The role of suxes and the sentinel $
Consider the text abab
It has the following suxes: abab, bab, ab and b.
The sentinel $: In the following, we want to ensure that no sux is a prex of any other. To do so,
we append a special character $ / to the end of the text.
Now, consider the text abab$
It has the following suxes: abab$, bab$, ab$, b$, and $.
Queries are prexes of suxes: To determine whether a given query q is contained in the text,
we could simply check whether q is the prex of one of the suxes.
9.4 Sharing prexes
E.g., the query ab is the prex of both abab$ and ab$.
To speed up the search for all suxes that have the query as a prex, we use a tree structure to share
common prexes between the suxes.
(a) The suxes abab$ and ab$ both share the prex ab.
(b) The suxes bab$ and b$ both share the prex b.
(c) The sux $ doesnt share a prex.
b
a
$
a
b
$
(a)
b
$
$
b
a
(b)
$
(c)
Here is an example of how the idea of searching the prex of every sux can be optimized by sharing
prexes whenever possible:
138 Bioinformatics I, WS09-10, J. Fischer (script by J. Fischer and D. Huson) February 3, 2010
1
3 2
4
5
a
b
a
b
$
a
b
$
b
a
b
$
$
b
$

b
a
b
$
$
$
a
b
$
$
b
a
1
3
2
4
5
A sux tree for abab$ is obtained by sharing prexes whenever possible. The leaves are annotated
by the positions of the corresponding suxes in the text.
9.5
+
-tree
Denition 9.5.1 (
+
-tree) A compact
+
-tree is a rooted tree S = (V, E) with edge labels from
+
that fullls the following two constraints:
For all v V , all outgoing edges from v start with a dierent letter a .
Apart from the root, all nodes have out-degree = 1.
You can think of a compact
+
-tree as a tree that looks like the trie data structure of section 2.7, but
with degree-one-paths contracted into a single edge.
As usual, a leaf is a node with no children and an edge leading to a leaf is called a leaf edge. A node
with at least two children is called a branching node.
Denition 9.5.2 (Notations in
+
-trees) Let S = (V, E) be a compact
+
-tree.
For v V , v denotes the concatenation of all path labels from the root of S to v.
|v| is called the string-depth of v and is denoted by d(v).
S is said to display

i v V,

: v = . Another way to say this is that string


occurs in S.
If v = for v V,

, we also write to denote v.


words(S) denotes all strings in

that are displayed by S: words(S) = {

: S displays }
For i {1, 2, . . . , n}, t
i
t
i+1
. . . t
n
is called the i-th sux of T and is denoted by T
i...n
. In general,
we use the notation T
i...j
as an abbreviation of t
i
t
i+1
. . . t
j
.
For example u = pqr:
u
p
q
r
root
The root is called .
Bioinformatics I, WS09-10, J. Fischer (script by J. Fischer and D. Huson) February 3, 2010 139
9.6 Sux tree
Denition 9.6.1 Let factor(T) denote the set of all factors (=subwords) of T, factor(T) = {T
i...j
:
1 i j n}. The sux tree of T is a compact
+
-tree S with words(S) = factor(T).
Here is an example.
For several reasons, we shall nd it useful that each sux ends in a leaf of S. This can be accomplished
by adding a new character $ to the end of T, and build the sux tree over T$.
From now on, we assume that T terminates with a $, and we dene $ to be lexicographically smaller
than all other characters in : $ < a for all a . This gives the desired one-to-one correspondence
between Ts suxes and the leaves of S, which implies that we can label the leaves with a function l
by the start index of the sux they represent: l(v) = i v = T
i...n
. This also explains the name
sux tree. (Observe that we did not dene sux trees in terms of suxes, but in terms of factors!)
For every leaf T
j...n
we dene L(T
j...n
) = l(v). Recursively, for every branching node u we dene
L(u) =

v is child of u
L(v).
We call L(u) the leaf set of u.
We can use S to nd occurrences of a pattern P

as follows. Start at the root of S and try


reading Ps characters from the edges of S. If at some point this is impossible, P does not occur in T.
Otherwise, P occurs in T. Assume that this matching procedure has brought us to some point
p in S (where with point we mean either an edge or a node). Then traversing the subtree below p
and reporting all leaf-labels in that subtree gives the set of all occurrences of P in T.
The decision query (does P occur in T?) takes O(m) time.
140 Bioinformatics I, WS09-10, J. Fischer (script by J. Fischer and D. Huson) February 3, 2010
The reporting query (report all occurrences of P in T) takes O(m+o) time, where o denotes
the number of occurrences of P in T.
The correctness of these query-algorithms follows from the fact that each factor of T is a prex of
some sux of T.
We have assumed a constant alphabet size , such that nding the correct outgoing edge takes O(1)
time. As seen in the exercises, there are several ways to cope with non-constant alphabet sizes.
An important implementation detail is that the edge labels in a sux tree are represented by a pair
(i, j), 1 i j n, such that T
i...j
is equal to the corresponding edge label. This ensures that an
edge label uses only a constant amount of memory.
From this implementation detail and the fact that S contains exactly n leaves and hence less than n
branching nodes, we can formulate the following theorem:
Theorem 9.6.2 A sux tree occupies O(n) space in memory.
9.7 Sux- and LCP-Arrays
We will now introduce two arrays that are closely related to the sux tree, the sux array A and the
LCP-array H.
Denition 9.7.1 The sux array A of T is a permutation of {1, 2, . . . , n} such that A[i] is the i-th
smallest sux in lexicographic order: T
A[i1]...n
< T
A[i]...n
for all 1 < i n.
If we do a lexicographically-driven depth-rst search through S (visit the children in lexicographic
order of the rst character of their corresponding edge-label), then the leaf-labels seen in this order
give the sux-array A. This observation relates the sux array A with the sux tree S.
The second array H builds on the sux array:
Denition 9.7.2 The LCP-array H of T is dened such that H[1] = 0, and for all i > 1, H[i] holds
the length of the longest common prex of T
A[i]...n
and T
A[i1]...n
.
To relate the LCP-array H with the sux tree S, we need to dene the concept of lowest common
ancestors:
Bioinformatics I, WS09-10, J. Fischer (script by J. Fischer and D. Huson) February 3, 2010 141
Denition 9.7.3 Given a tree S = (V, E) and two nodes v, w V , the lowest common ancestor of v
and w is the deepest node in S that is an ancestor of both v and w. This node is denoted by lca(v, w).
Lemma 9.7.4 The string-depth of the lowest common ancestor of the leaves labeled A[i] and A[i
1] is given by the corresponding entry H[i] of the LCP-array, in symbols: for all i > 1 : H[i] =
d(lca(T
A[i]...n
, T
A[i1]...n
)).
9.8 Construction of Sux Trees from Sux- and LCP-Arrays
Assume for now that we are given T, A, and H, and we wish to construct S, the sux tree of T. We
will show in this section how to do this in O(n) time. Later, we will also see how to construct A and
H only from T in linear time. In total, this will give us an O(n)-time construction algorithm for sux
trees.
The idea of the algorithm is to insert the suxes into S in the order of the sux array:
T
A[1]...n
, T
A[2]...n
, . . . , T
A[n]...n
. To this end, let S
i
denote the partial sux tree for 0 i n (S
i
is the compact
+
-tree with words(S
i
) = {T
A[k]...j
: 1 k i, A[k] j n}). In the end, we will
have S = S
n
.
We start with S
0
, the tree consisting only of the root (and thus displaying only ). In step i + 1, we
climb up the rightmost path of S
i
(i.e., the path from the leaf labeled A[i] to the root) until we meet
the deepest node v with d(v) H[i + 1]. If d(v) = H[i + 1], we simply insert a new leaf x to S
i
as a
child of v, and label (v, x) by T
A[i+1]+H[i+1]...n
. Leaf x is labeled by A[i + 1]. This gives us S
i+1
.
Otherwise (i.e., d(v) < H[i + 1]), let w be the child of v on S
i
s rightmost path. In order to obtain
S
i+1
, we split up the edge (v, w) as follows.
142 Bioinformatics I, WS09-10, J. Fischer (script by J. Fischer and D. Huson) February 3, 2010
1. Delete (v, w).
2. Add a new node y and a new edge (v, y). (v, y) gets labeled by T
A[i]+d(v)...A[i]+H[i+1]1
.
3. Add (y, w) and label it by T
A[i]+H[i+1]...A[i]+d(w)1
.
4. Add a new leaf x (labeled A[i + 1]) and an edge (y, x). Label (y, x) by T
A[i+1]+H[i+1]...n
.
The correctness of this algorithm follows from observations 1 and 2 above. Let us now consider the
execution time of this algorithm. Although climbing up the rightmost path could take O(n) time
in a single step, a simple amortized argument shows that the running time of this algorithm can be
bounded by O(n) in total: each node traversed in step i (apart from the last) is removed from the
rightmost path and will not be traversed again for all subsequent steps j > i. Hence, at most 2n nodes
are traversed in total.
Theorem 9.8.1 We can construct Ts sux tree in linear time from Ts sux- and LCP-array.
This sux tree construction algorithm is good in terms of memory locality, which means that there
is a high correlation between how close two pieces of data are in the tree, and how close they are in
physical memory. Hence, an average query will produce only few cache misses and leads to a good
performance in practice.
9.9 Linear-Time Construction of Sux Arrays
For an easier presentation, we assume in this section that T is indexed from 0 to n1, T = t
0
t
1
. . . t
n1
.
Denition 9.9.1 If x mod y = z, we write x z(mod y) or simply x z(y).
Our general approach will be recursive: rst construct the sux array A
12
for the suxes T
i...n
with
i 0(3) by a recursive call to the sux sorting routine. From this, derive the sux array A
0
for the
suxes T
i...n
with i 0(3). Then merge A
12
and A
0
to obtain the nal sux array.
9.9.1 Creation of A
12
by Recursion
In order to call the sux sorting routine recursively, we need to construct a new text T

from whose
sux array A

we can derive A
12
. To this end, we look at all character triplets t
i
t
i+1
t
i+2
with i 0(3),
and sort the resulting set of triplets S = {t
i
t
i+1
t
i+2
: i 0(3)} with a bucket-sort in O(n) time. (To
have all triplets well-dened, we pad T with suciently many $s at the end.)
We dene T

as follows: its rst half consists of the bucket-numbers of t


i
t
i+1
t
i+2
for increasing i 1(3),
and its second half consists of the same for the triplets with i 2(3). Note that we interpret the
bucket numbers as characters in T

: our new alphabet is {1, 2, . . . , 2n/3}.


We now build the sux array A

for T

by a recursive call (stop the recursion if |T

| = O(1)). This
already gives us sorting of the suxes starting at positions i 0(3), because of the following:
The suxes starting at i 2(3) in T have a one-to-one correspondence to the suxes in T

and
are hence in correct lexicographic order.
The suxes starting at i 1(3) in T are longer than they should be (because of the bucket
numbers of the triplets starting at 2(3)), but due to the $s in the middle of T

this tail does


not inuence the result.
Bioinformatics I, WS09-10, J. Fischer (script by J. Fischer and D. Huson) February 3, 2010 143
Thus, all that remains to be done is to convert the indices in T

to indices in T, which is done as


follows:
A
12
[i] =

1 + 3A

[i] if A

[i] <
|T

|
2

2 + 3(A

[i]
|T

|
2
) otherwise
9.9.2 Creation of A
0
The following lemma follows immediately from the denition of A
12
.
Lemma 9.9.2 Let i, j 0(3). Then T
i...n
< T
j...n
i t
i
< t
j
, or t
i
= t
j
and i +1 appears before j +1
in A
12
.
This suggests the following strategy to construct A
0
:
1. Initialize A
0
with all numbers 0 i < n for which i 0(3).
2. Bucket-sort A
0
, where A
0
[i] has sort-key t
A
0
[i]
.
3. Bucket-sort each bucket obtained from step 2 again, using the position of A
0
[i] + 1 in A
12
as a
sort-key for A
0
[i]. This step can be realized by a left-to-right scan of A
12
: if A
12
[i] 1(3), swap
A
12
[i] 1 with the entry at the current beginning of its bucket, and move the beginning of that
bucket one to the right (such that all processed entries remain where they have just been moved
to).
9.9.3 Merging A
12
with A
0
We scan A
0
and A
12
simultaneously. The suxes from A
0
and A
12
can be compared among each
other in O(1) time by the following lemma, which follows again directly from the denition of A
12
.
Lemma 9.9.3 Let i and j be two indices in T with i 0(3).
1. If j 1(3), then T
i...n
< T
j...n
i t
i
< t
j
, or t
i
= t
j
and i + 1 appears before j + 1 in A
12
.
2. If j 2(3), then T
i...n
< T
j...n
i t
i
< t
j
, or t
i
= t
j
and t
i+1
< t
j+1
, or t
i
t
i+1
= t
j
t
j+1
and i + 2
appears before j + 2 in A
12
.
One should note that in both of the above cases the values i + 1 and j + 1 (or i + 2 and j + 2,
respectively) appear in A
12
this is why it is enough to compare at most 2 characters before one can
derive the lexicographic order of T
i...n
and T
j...n
from A
12
.
In order to check eciently whether i appears before j in A
12
, we need the inverse sux array A
1
12
of A
12
, dened by A
1
12
[A
12
[i]] = i for all i. With this, it is easy to see that i appears before j in A
12
i A
1
12
[i] < A
1
12
[j].
The running time T (n) of the whole sux sorting algorithm presented in this section is given by the
recursion T (n) = T (2n/3) +O(n), which solves to T (n) = O(n).
Theorem 9.9.4 We can construct the sux array for a text of length n in O(n) time.
144 Bioinformatics I, WS09-10, J. Fischer (script by J. Fischer and D. Huson) February 3, 2010
9.10 Linear-Time Construction of LCP-Arrays
It remains to be shown how the LCP-array H can be constructed in O(n) time. Here, we assume that
we are given T, A, and A
1
, the latter being the inverse sux array.
We will construct H in the order of the inverse sux array (i.e., lling H[A
1
[i]] before H[A
1
[i+1]]),
because in this case we know that H cannot decrease too much, as shown next.
Going from sux T
i...n
to T
i+1...n
, we see that the latter equals the former, but with the rst character
t
i
truncated. Let h = H[i]. Then the sux T
j...n
, j = A[A
1
[i] 1], has a longest common prex
with T
i...n
of length h. So T
i+1...n
has a longest common prex with T
j+1...n
of length h1. But every
sux T
k...n
that is lexicographically between T
j+1...n
and T
i+1...n
must have a longest common prex
with T
j+1...n
that is at least h 1 characters long (for otherwise T
k...n
would not be in lexicographic
order). We have thus proved the following:
Lemma 9.10.1 For all 1 i < n: H[A
1
[i + 1]] H[A
1
[i]] 1.
This gives rise to the following elegant algorithm to construct H:
h 0, H[1] 0 1
for i = 1, . . . , n do 2
if A
1
[i] = 1 then 3
while t
i+h
= t
A[A
1
[i]1]+h
do h h + 1 4
H[A
1
[i]] h 5
h max{0, h 1} 6
end 7
end 8
The linear running time follows because h is always less than n and decreased at most n times in line
6. Hence, the number of times where k is increased in line 4 is bounded by n, so there are at most 2n
character comparisons in the whole algorithm. We have proved:
Theorem 9.10.2 The LCP-array for a text of length n can be constructed in O(n) time.

Você também pode gostar