Escolar Documentos
Profissional Documentos
Cultura Documentos
Lothar May
Master Thesis
Information Technology
Supervisors:
Prof. Dr. André Neubauer
Prof. Dr. Michael Tüxen
1 Introduction 1
1.1 Security Patches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Matching With Mismatches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Analysis 6
2.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Intuitive Substring Matching . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.2 Run Time Considerations . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Derivation of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.1 Formal Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.2 Restrictions of the Model . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.3 Optimisation using the FFT . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.4 Projecting onto Subspaces . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.5 Optimising Random Behaviour . . . . . . . . . . . . . . . . . . . . . 28
2.4 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4.1 Reference Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4.2 The First Variant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4.3 The Second Variant . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4.4 The Third Variant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.4.5 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3 Implementation 51
3.1 Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.1.1 Portability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.1.2 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.1.3 Library Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.2.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
i
CONTENTS CONTENTS
4 Improvements 57
4.1 Theorem 1.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.1.1 Selecting Primes Randomly . . . . . . . . . . . . . . . . . . . . . . . 57
4.1.2 Numerical Probability . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 Derandomisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2.1 Selecting Primes Non-Randomly . . . . . . . . . . . . . . . . . . . . . 60
4.2.2 Creating φ Non-Randomly . . . . . . . . . . . . . . . . . . . . . . . . 60
5 Conclusion 62
Appendix 63
Bibliography 67
ii
List of Figures
iii
LIST OF FIGURES LIST OF FIGURES
n
33 Matlab calls to construct data using primes around 10 . . . . . . . . . . . . . . 37
(1) n
34 Plot of the cyclic correlation C of data using primes around 10 . . . . . . . . 38
n
35 Plot of the cyclic correlation C(2) of data using primes around 10 . . . . . . . . 39
36 Matlab function implementing “Algorithm 1.2” of [1] . . . . . . . . . . . . . . 41
37 Matlab code which tries to use algorithm 1.2 but fails (pedantic=0) . . . . . . . 42
38 Matlab code which demonstrates the usage of algorithm 1.2 (pedantic=1) . . . 42
39 Limit of m in algorithm 1.2 (ε = 0.1, p = 0.9, t = 2) . . . . . . . . . . . . . . 43
√
40 Plot of y = − log( δ + e−x ) with δ = 2 . . . . . . . . . . . . . . . . . . . . . 45
41 Matlab function to calculate vector D . . . . . . . . . . . . . . . . . . . . . . 45
42 Plot of the raw cyclic correlation C . . . . . . . . . . . . . . . . . . . . . . . . 46
43 Plot of the filtered cyclic correlation D . . . . . . . . . . . . . . . . . . . . . . 46
44 Matlab function implementing “Algorithm 1.3” of [1] . . . . . . . . . . . . . . 48
45 Matlab code which demonstrates the usage of algorithm 1.3 (pedantic=1) . . . 49
46 Limit on m in algorithm 1.3 (ε = 0.1, p = 0.9, t = 2) . . . . . . . . . . . . . . 49
47 Matlab function to reconstruct X as in theorem 1.1 of [1] . . . . . . . . . . . . 58
48 Matlab function to provide numeric probability on a good match not in X . . . 63
49 Helper function to add up good matches not in X . . . . . . . . . . . . . . . . 63
50 Helper function to count good matches not in X . . . . . . . . . . . . . . . . . 64
51 Matlab function to compute C as in figure 8 by using the FFT as in figure 9 . . 64
52 Matlab function to simply calculate the positions of the t largest values . . . . . 65
53 Matlab function to iteratively calculate the Cartesian product (2 dimensions) . . 65
54 Matlab function to recursively calculate the Cartesian product . . . . . . . . . 65
iv
List of Tables
v
Chapter 1
Introduction
1 Even though not entirely correct, “bandwidth” is used as a synonym for data rate, like often done in literature
on computer science.
2 and also many embedded systems
1
1.2. MATCHING WITH MISMATCHES CHAPTER 1. INTRODUCTION
to mention that some security fixes cause changes in multiple files, and this will again result in
large patches if the new files are copied.
A totally different approach is to consider the existing files on the system, and compare
them to the corresponding files containing security fixes, in order to avoid replacing entire files
by new ones. Since these fixes usually implement only few changes, most of the data is already
present on the system — within the files which need patching. If this method is applied, and only
the differences of the files are stored in security patches, they could be tremendously smaller.
Actually, some OS vendors recently have started using this technique to deploy at least some of
their updates:
• Microsoft Windows Update on Windows XP and above supports “Binary Delta Compres-
sion (BDC)” (see [15]), but this is a proprietary system and little information is available
on its inner workings.
• FreeBSD and Mac OS X both come with the open source tool “bsdiff 4” (see [2]) and
provide update tools which (can) make use of it.
Colin Percival, the author of bsdiff, has written a doctoral thesis [1] on this subject, in which he
presents an algorithm to further improve bsdiff 4. Unfortunately, he did not publish the source
code for his new algorithm, and there is only few third party material available on it. A reference
to bsdiff 6 can be found in [5] (on page 939) but it does not provide a description. The authors
of [4] use bsdiff 6 for comparison with their patch tool, but again no details are mentioned. If
we are willing to make use of the new algorithm, we need to work through the doctoral thesis.
2
CHAPTER 1. INTRODUCTION 1.3. STRUCTURE
very few code is actually added or removed. However, due to these modifications, memory
addresses throughout the executable file change.5 In this context, an algorithm for matching
with mismatches helps to find similar “blocks” of the original file in the new file, such that the
differences can be identified and encoded. Blocks or parts of blocks which are identical in both
files can be referenced and do not need to be copied.
1.3 Structure
This thesis is structured as follows:
1. Introduction
In the course of this chapter, conventions and definitions are provided which are used
throughout the thesis.
2. Analysis
The analysis contains a step-by-step derivation of the algorithm for matching with mis-
matches. This includes the motivation for each step as well as numerical examples. Addi-
tionally, example code for the different iterations of the algorithms is provided, and basic
tests are performed.
3. Implementation
Based on the prior analysis, an implementation of the algorithm using C++ is presented,
and possible use in a tool for delta compression is prepared. This chapter describes the
basic structure of the implementation and additional considerations like portability and
third party libraries.
4. Improvements
With regard to the analysis and the C++ implementation, specific improvements of the
algorithm are proposed and illustrated.
5. Conclusion
In this chapter, we provide an overview of what we have achieved and give an outlook
identifying possible further projects.
1.4 Conventions
All examples are written in Matlab [19] code. However, neither knowledge of Matlab nor a
license of Matlab is required to understand this thesis and test the examples. Basic knowledge
of C should be sufficient to be able to read the code, and GNU Octave [20] (with the additional
packages from Octave-Forge [21]) can be used to run the examples.
3
1.5. DEFINITIONS CHAPTER 1. INTRODUCTION
Nevertheless, it should be noted that array and vector indexing in Matlab code starts at 1 (i.e.
array{1} is the first element). This is opposed to C, where indexing starts at 0 (i.e. array[0]). In
spite of this, we still stick to the constraints of all values as described in [1], which are mainly
zero-based. So whenever there is an unexplained increment or decrement in the code, the reason
is almost certainly the difference in indexing.
Whenever possible, we use the symbols of [1] (on pages iii-iv). This means that the reader
may generally switch to and from the doctoral thesis without problems. One notable difference
though is the use of the vector indices i and j: In [1], the index i is first used as index for the
match count vector (Vi ), but later i represents the “vector number” and j becomes the index (e.g.
(i)
A j ). For the sake of clarity, we use j as vector index from the start. Additionally, when we
are talking about “the vectors” A(i) , B(i) , or C(i) , we actually mean A(i) , B(i) , respectively C(i)
∀i ∈ {1, . . . , k}, with k being described in the context.
Some concepts of probability theory are applied in this thesis without further explanation,
for more information on this subject we refer to [8], especially chapter 3.
1.5 Definitions
A programmer usually regards executable code as binary data. A string, in contrast to that, is
seen as human-readable data, which is binary data with special semantics. This distinction is
generally applied because there are certain functions which do not work with all forms of binary
data. Yet in the context of this thesis it is not relevant. We therefore take the mathematical point
of view (see also [9] on pages 28-29):
Definition (Alphabet)
Definition (String)
These definitions will be used throughout the thesis, so whenever we are talking about a
string, we do not impose any semantics on its elements. To give an example, an executable file
(analysed at byte level) is a string over the alphabet Σexe = {0, 1, . . . , 255}.
4
CHAPTER 1. INTRODUCTION 1.5. DEFINITIONS
Definition (Ceiling)
d e : R → Z is the ceiling function which “rounds towards +∞”, i.e. dxe returns the smallest
n ∈ Z which is not less than x. The corresponding Matlab function is “ceil”.
Definition (Interval)
[a, b) = {n ∈ N | a ≤ n < b} for a, b ∈ R. Note that this is not a common definition, because
we only allow positive integers as elements of the interval.
5
Chapter 2
Analysis
2.1 Approach
In his doctoral thesis [1], Colin Percival writes that the first chapter, in which he introduces the
new algorithm for matching with mismatches, “is not for the faint of heart”6 . This is quite true,
the mathematics for the algorithm might seem daunting. We try to ease the pain a bit. However,
even if we have applications in mind, a good understanding of the underlying algorithm is
essential.
In this chapter we provide a description of the algorithm for matching with mismatches with
Our approach is to numerically show how and why the algorithm works. We regard this as a
reasonable addition to chapter 1 of [1], which mainly consists of lemmas/theorems and proofs.
2.2 Motivation
6
CHAPTER 2. ANALYSIS 2.2. MOTIVATION
Intuitively, this can be done as follows: We iterate through the large string and compare the
small string with the substring at the current position. For each substring comparison, we count
the number of elements which match, and then process these match counts to find the “good”
matches.8 As an example, consider the strings9 S, T and their lengths n, m in figure 1.
In the language of mathematics, the match counts can be seen as a vector, and the calculation
is formally done as follows (see [1] on page 6):
m−1
Vj = ∑ δ (Si+ j , Ti) ∀ j ∈ {0 . . . n − m} (2.1)
i=0
8 “Good” matches are matches with a high number of matching characters, compared to the maximum possible
match count.
9 The longer string was taken from J. W. Goethe’s ”The Sorcerer’s Apprentice”.
7
2.2. MOTIVATION CHAPTER 2. ANALYSIS
Now that we are able to calculate the match counts, we need to process them to find good
matches. In our example (table 1), the maximum possible match count is m = 5, which is the
length of T , the smaller string. Assuming that we wish to find the positions with at least 50% of
the maximum match count, we require no less than m2 = 3 matches for one substring. Thus,
we extract only those positions j which satisfy the predicate V j ≥ m2 . This way, we get all the
spikes in figure 2, while ignoring small match counts which are “kind of random”. This filtering
is easily done in Matlab code (see figure 4).
8
CHAPTER 2. ANALYSIS 2.2. MOTIVATION
We have presented an intuitive algorithm for matching with mismatches, implemented it,
and it works fine. However, is this algorithm also applicable for our problem of finding similar
sections of two different files, given that these files contain much more data than our test input
strings?
In the context of comparing two files, n could be the size of one file, and m could be the
size of a “block” of the second file which we are trying to match. The file size tends to be
quite large nowadays, for example nexe = 2, 097, 152 , 2 MB, with a sample block size of
mexe = 2, 048 , 2 KB. Run time is O(nexe mexe ), which results in approximately 4.2950 · 109
steps, and that would be only for matching a single block. Assuming the second file also has
a size of 2 MB, and we simply want to match all non-overlapping blocks, that would mean
matching mnexe
exe
= 1024 blocks, i.e. some O(nexe mexe mnexe
exe
) = O(n2exe ), which are approximately
4.3980 · 1012 steps. This is a quadratic time algorithm, (mostly) independent of the block size
m. Even modern processors cannot compensate for this.
Based on this result, we can be sure that our intuitive algorithm is not quite fast enough (in
other words: it is too expensive), and that optimisation is a necessity. There is one thing in our
favour, though: We do not have the absolute requirement to always calculate the exact match
counts and find only the best matches. If we do not find them, the calculated difference between
the two files will be larger, which will result in larger patches, but we will still succeed. With
that in mind, one option to speed up this calculation is to estimate the match count vector V
using a randomized algorithm with a sufficiently high chance of success. This leads us to the
new algorithm of matching with mismatches as described in [1].
10 For a description of the O-Notation as used in this thesis, see [6] on pages 44-45.
9
2.3. DERIVATION OF THE ALGORITHM CHAPTER 2. ANALYSIS
In other words (with regard to finding differences of files), we create T from randomly chosen
characters to be a “block” of one file. We then create another file S from randomly chosen
characters, but at certain positions in S we copy the block T into the file S. This copy of T is not
an exact copy, but an “approximate” copy, and the probability p states how accurate the copy is.
For example if p = 0.9, 9 of 10 characters will be correctly copied on average. The construction
can be performed using a Matlab function as shown in figure 5.
Based on this model, which generates S and T using given positions of good matches, we
wish to invert the random construction and find X with a probability of at least 1 − ε. In this
context, ε is a non-zero parameter which can be set to achieve the desired “accuracy”, but
choosing ε will impose certain restrictions on other input values, as we will see later.
In addition to the problem of reconstructing X, we wish to identify the parts of the algorithm
which are independent of T . Following the nomenclature of [1], we call a pre-calculation of
these parts an “index” of the algorithm. As we need to match several blocks (different strings
T ) with the same target file (constant string S), such an index can speed up the processing.
10
CHAPTER 2. ANALYSIS 2.3. DERIVATION OF THE ALGORITHM
Figure 5: Matlab function to construct strings S and T according to the formal model
However, this chapter mainly deals with the reconstruction of X; an index is generated in the
C++ implementation in section 3.2.4.
These kind of matches are distracting in our model, because our aim is to compute X and
only X. Let us assume we have at least one such good match, specifically at position i with
i∈
/ X ⇒ i ∈ X = {0, . . . , n − m} − X. Hence, when estimating X, we cannot be sure that X
11
2.3. DERIVATION OF THE ALGORITHM CHAPTER 2. ANALYSIS
Σ = {a, b, c, . . . , z}
X = {1, 12}
T = ”a b c”
S = ”f |{z}
a b c k l a b c h x v |{z}
a b c s”
1∈X 6∈X
/ 12∈X
contains the positions of maximum match counts, because Vi (our lucky match count) might
actually be larger. Thus, we will have to guess the elements of X, which reduces the probability
to find the correct X to 0.5 or less.
We therefore conclude: To effectively estimate X without “fifty-fifty-guessing”, all match
counts at positions not in X must be smaller than those of positions in X, i.e.:
The probability q that a randomly good match happens, i.e. that the predicate (2.3) is not
true for a specific pair (i, j), depends on several parameters (see also [1] on page 8): The first
parameter is m, the size of string T . The smaller m is, the more likely we are to encounter
a randomly good match. The second parameter is p, because if p is near zero, the supposed
“good” matches in X will not be very good, and it will be more likely that a random match
occurs which is “better”. The third parameter is the size of the alphabet, namely |Σ|. The larger
the alphabet, the smaller is the probability of a randomly good match (if the other factors are
constant). Implicitly, and as the last parameter, the probability also depends on |X|. For example
if X = {0, m, 2m, ...hm} with h ∈ N, hm ≤ n − m it is not possible to locate any randomly good
matches which are not in X.11
We assume m to be the most important of these factors, because it is variable and depends
on the string T , while p and |Σ| will be constant in most applications of the model, and |X| is
considered to be very small, i.e. |X| mn . Hence, as an example, we numerically calculate q
for m = [1, 15] with constant p = 0.7, |Σ| = 2 and |X| = 1. Figure 7 shows the result.12
Our numerical example suggests that q(m) with the assumptions above is an exponential
function, which is given in [1] on page 8 as
As apparent from figure 7, q becomes negligible for large values of m. So, basically, this
limitation of the model imposes some lower bound on m when reconstructing X.
However, if we leave the model for a moment and switch back to what we intend to do, that
is matching two files, this limitation turns out to be insignificant for the following reasons:
11 The factor |X| is not mentioned in [1], but it is included here to show that the calculation of q is quite complex.
Equation (2.4), also given in [1] on page 8, is based on certain assumptions and cannot be applied in general.
12 The script used to create this plot can be found in the appendix.
12
CHAPTER 2. ANALYSIS 2.3. DERIVATION OF THE ALGORITHM
1. If we found random matches between two files that we did not “expect”, we would be
happy and would gladly accept them.
2. The block size m when matching two files should not be too small anyway, so that proper
compression can be achieved.
3. We will set p to be near 1 anyway, otherwise we would not be able to achieve proper
results.
Even though we have identified certain restrictions of the model, we can continue with our work
with the knowledge that these limitations will not block the progress of solving our problem.
13 Unfortunately, the references mentioned in [1] on using the FFT to calculate V do not cover the FFT at all,
and are therefore not very useful in this context.
13
2.3. DERIVATION OF THE ALGORITHM CHAPTER 2. ANALYSIS
the cyclic correlation14 of the two strings S and T , when treating them as discrete signals with
certain properties.
Assuming φ : Σ → R is a function which converts a character to a signal value,15 the cyclic
correlation C is calculated as follows:16
∀ j ∈ {0, . . . , n − 1} :
A j = φ (S j ) (2.5)
(
φ (T j ) if j < m
Bj = (2.6)
0 otherwise
n−1
Cj = ∑ A(r+ j) mod n · Br (2.7)
r=0
In this calculation, the string S is converted to the signal vector A, and T is converted to B
with zero padding, such that A and B have the same size. In order to retrieve proper results, we
have to define φ in a way that it does not “weight” characters differently. As counter-example,
defining φ to map each element of Σ to a unique numerical representation
|Σ|
[
given Σ = {x1 , x2 , . . . , x|Σ| } = {x j } :
j=1
φ (x j ) = j (2.8)
will not produce proper results, because matching x1 and x1 (which are equal) when calculating
the cyclic correlation will yield 1 · 1 = 1,17 while matching x1 and x2 (which are non-equal)
will produce 1 · 2 = 2. In other words, certain mismatches will count much more than certain
matches, and this is not the result we wish to have.
Instead, we define φ to randomly map half of Σ to 1 and the other half to −1 (similar to [1]
on page 12):
1
Choose Σ0 ⊂ Σ with Σ0 = |Σ| uniformly at random.
2
0
φ (x) = (−1)|Σ ∩{x}| (2.9)
Note that in case |Σ| > 2 this is a lossy conversion of the characters to signal values, since
it maps Σ to {−1, 1}. However, in contrast to equation (2.8), mismatches will never produce
14
CHAPTER 2. ANALYSIS 2.3. DERIVATION OF THE ALGORITHM
larger values than matches: Matching x1 and x1 will produce 1, matching x1 and x2 will produce
either 1 or −1, depending on how Σ0 was chosen.
Figure 8 shows the calculation of the cyclic correlation in Matlab code using φ as defined
in equation (2.9).
Due to the nested loop in equation (2.7),18 the calculation of C requires O(n2 ) time. The
actual improvement comes from the fact that the cyclic correlation can be computed using the
FFT (see [12] on pages 545-546) in O(n log2 (n)) time (according to [10]). The corresponding
Matlab code is shown in figure 9. In this context, fft and ifft are the Fast Fourier Transform
and the inverse FFT, respectively, and conj is the complex conjugate.
15
2.3. DERIVATION OF THE ALGORITHM CHAPTER 2. ANALYSIS
Figure 9: Matlab code to calculate the cyclic correlation using the FFT
The resulting vector C does not necessarily contain match counts. Mismatches decrease
values in C, e.g., for m = 15 a match count of 10 at position j results in C j = 10 − 5 = 5. This
means that processing the C as we processed V in figure 4 by filtering values ≥ m2 might lead
to different results. This is specifically true for |Σ| > 2, because in that case the string-to-signal
conversion is lossy. Thus, there is more “background noise” than in the match count vector V .
Performing the matching of the example strings (see figure 1) using the cyclic correlation leads
to a result19 as plotted in figure 10. Negative values in this plot are clear mismatches (because
they are the result of accumulated −1 · 1 multiplications, see equation (2.7)). Positive values are
likely but not guaranteed to be matches, depending on how suitable the randomly generated φ
is for our example strings.
Figure 10: Plot of the cyclic correlation C calculated from example strings S and T
To find the t best matches using the result of the cyclic correlation, we identify the t positions
for which C takes the largest values. In our example, the positions 7 and 19 both indicate
full matches, although in fact position 7 only matches 4 of 5 characters (see table 1). This
16
CHAPTER 2. ANALYSIS 2.3. DERIVATION OF THE ALGORITHM
emphasizes the fact that our new method does not always lead to correct results. When choosing
φ unluckily, we might even find full matches at positions where none are present.
However, the method is still very useful, if certain conditions are met. Let us now apply
our model and create S and T according to it, which means that their characters are uniformly
distributed20 . If m is large enough to make up for the (possibly) lossy conversion done by φ ,21
the spikes within C will (with high probability) be the good matches we wish to find, since T
matches within S “well or not at all” ([1] page 6).
Our initial problem remains, though: The run time is dominated by the O(n log2 (n)) time
required for the FFT, which does not scale well for our purpose.22 In addition to that, memory
usage is O(3n),23 which can be too much for use with large files.24 The next step is therefore to
“shorten” the lengths of the vectors A and B before calculating the cyclic correlation, while still
retaining the necessary information.
A Simplified Approach
We now introduce such a projection (based on [1], page 9), starting with the special case |X| = 1
and |Σ| = 2, which basically means that we are only looking for the best match of T in S,
whereby the conversions of strings to signals are lossless. For this case, we construct an example
to show how the projection is performed and to explain the mathematical background.
n = 10000;
m = 256;
p = 0.95;
Sigma = uint8([0:1]);
X = [9000];
[S, T] = construct(n, m, p, Sigma, X);
C = match_cyclic_correl(S, T, Sigma);
Figure 11: Matlab calls to construct data according to the model (|X| = 1)
parallel.
17
2.3. DERIVATION OF THE ALGORITHM CHAPTER 2. ANALYSIS
Figure 11 lists the Matlab calls used to create example data according to the model. Note
that we choose mn |X| and select m not to be too small, to make sure that the restrictions of the
model (section 2.3.2) do not apply. Figure 12 shows the resulting vector C (following equation
(2.7)). There is significant noise and a spike at position j = 9000, which we expected because
X = {9000}.
Now we need to reduce the data size and still get the position of the maximum j = 9000
as result. Assuming we can somehow reduce the data size modulo a prime number, we could
extract the position modulo this prime. Performing this several times with different primes will
give us the position modulo multiple primes. To calculate the actual result we can make use
of the Chinese Remainder Theorem, which states “that it is possible to reconstruct integers in
a certain range from their residues modulo a set of coprime moduli” ([7], page 194). This is
possible if the integer x we wish to reconstruct follows the predicate 0 ≤ x < M where M =
p1 p2 . . . pk is the product of the coprime integers (with k being the number of primes).
For example, we can choose the primes p1 = 5003, p2 = 6007 (being about n2 with enough
difference, this choice is for simplicity)25 , and perform the character-to-signal conversions ac-
cumulated modulo each of the primes (based on algorithm 1.1 in [1] on page 12):
∀i ∈ {1, . . . , k}; ∀ j ∈ {0, . . . , pi − 1}:
25 Thereconstruction is clearly possible because 0 ≤ n = 10000 < p1 p2 = 30053021. We are not required to
choose primes, coprime values are sufficient, but we prepare for later changes to the algorithm.
18
CHAPTER 2. ANALYSIS 2.3. DERIVATION OF THE ALGORITHM
l m
n− j
pi −1
(i)
Aj = ∑ φ (S j+λ pi ) (2.10)
λ =0
l m
m− j
pi −1
(i)
Bj = ∑ φ (T j+λ pi ) (2.11)
λ =0
pi −1
(i) (i) (i)
Cj = ∑ A(r+ j) mod n · Br (2.12)
r=0
This means that we shorten the original vector A j = φ (S j ) by adding up “roughly the second
half of the vector to the first half” (with a different boundary for each prime). Equation (2.10)
specifies a projection from Σn → R pi , ∀i ∈ {1, . . . , k}. Given pi < n, this is a lossy (irreversible)
projection, but it maintains certain characteristics. In our example, instead of one large vector
A, we now have two vectors: A(1) with size 5003 and A(2) with size 6007. Vector B is less
concerned – only some zeros are cut from the end, since in our case m < p1 < p2 .26
26 Basically,we could use the same definition for B(i) as in equation (2.6), but the new definition covers the
general case and is therefore preferable.
19
2.3. DERIVATION OF THE ALGORITHM CHAPTER 2. ANALYSIS
Figure 13: Matlab function to project data before calculating the cyclic correlation
The cyclic correlation is calculated for each of these smaller vectors (see equation (2.12)).
Figure 13 shows a Matlab function implementing this “projection onto subspaces”; figures 14
and 15 show the resulting vectors C(1) and C(2) for our example. Due to adding up the values
before calculating the correlation, the level of noise has increased compared to figure 12, but
the maximum value still raises clearly above the noise.
20
CHAPTER 2. ANALYSIS 2.3. DERIVATION OF THE ALGORITHM
Figure 14: Plot of the cyclic correlation C(1) of projected model data (|X| = 1)
Figure 15: Plot of the cyclic correlation C(2) of projected model data (|X| = 1)
The positions of maximum values in these vectors C(i) are (with high probability) the maxi-
mum position of the original vector C modulo each of the primes pi . Using the residues modulo
these primes, we can reconstruct the position of the maximum correlation. In our example the
21
2.3. DERIVATION OF THE ALGORITHM CHAPTER 2. ANALYSIS
residues are 9000 mod 5003 = 3997 (see figure 14) and 9000 mod 6007 = 2993 (see figure 15).
The reconstruction following the Chinese Remainder Theorem is done as follows (see also [7]
page 194f):
∀i ∈ {1, . . . , k} :
k
M = ∏ pi : M = 5003 · 6007 = 30053021 (2.14)
i=1
M 30053021
Mi = : M1 = = 6007 (2.15)
pi 5003
30053021
M2 = = 5003
6007
x ≡ a1 N1 M1 + · · · + ak Nk Mk (mod M) : (2.17)
x ≡ −3997 · 294 · 6007 + 2993 · 353 · 5003 (mod 30053021)
≡ −1773119239 (mod 30053021)
≡ 9000 (mod 30053021)
While most of the steps are fairly straightforward, the solution of equation (2.16) requires
some work. Rearranging it leads to a form which can be solved more easily:
Ni Mi ≡1 (mod pi )
⇔ Ni Mi = 1 − r · pi , r∈Z
⇔ Ni Mi + r · pi = 1 (2.18)
Since gcd(Mi , pi ) = 1 (by definition of Mi and pi ),27 we can apply the extended Euclidean
algorithm (see [6] on pages 859-860) to calculate Ni .
27 gcd: greatest common divisor
22
CHAPTER 2. ANALYSIS 2.3. DERIVATION OF THE ALGORITHM
Figure 16 shows a Matlab function which performs the reconstruction of an integer accord-
ing to the Chinese Remainder Theorem.
Figure 16: Chinese Remainder Theorem: Matlab function which calculates the solution
Using this algorithm, the position x of the maximum correlation can be uniquely calculated
as long as n < M. Instead of one large cyclic correlation, we now have multiple smaller cyclic
correlations to calculate, which is an improvement in terms of speed, because the time spent
calculating the FFT grows ”faster than linear” with n. There are drawbacks which need to be
kept in mind, however:
1. The level of noise increases, because we add up values in vectors A(i) and (possibly) B(i) .
We need to make sure that, at least with high probability, the noise does not increase too
much.
2. Finding the position of the maximum correlation is more complex now, but if the time
gain is sufficiently large, it is worth the effort.
A Generic Approach
In our introduction of the projection, we only handled the special case that |X| = 1 and |Σ| = 2.
Now we extend the previous approach to present a generic solution. For a start, we stick with
|Σ| = 2, but consider varying |X|.
The case |X| = 0 does not need to be handled, because it means zero solutions need to be
found, and we have instantly finished. What needs to be covered is the general case |X| ≥ 1.
If we apply the same procedure as above, we stumble upon the fact that we cannot simply
use the Chinese Remainder Theorem for the general case. Actually, we are able to reconstruct
integers from their residues modulo some primes, but if we consider multiple solutions, we
have multiple residues for each prime and do not know which integer they belong to.
23
2.3. DERIVATION OF THE ALGORITHM CHAPTER 2. ANALYSIS
n = 10000;
m = 256;
p = 0.95;
Sigma = uint8([0:1]);
X = [7700, 8050, 9000];
[S, T] = construct(n, m, p, Sigma, X);
C = match_cyclic_correl(S, T, Sigma);
Figure 17: Matlab calls to construct data according to the model (|X| = 3)
As an example, consider model data with |X| = 3, generated using the Matlab calls in figure
17. The cyclic correlation C (see equation (2.7)) of this data has three spikes at positions which
are elements of X (see figure 18). We need to reconstruct these three positions from their
residues modulo the primes p1 and p2 as above.
Each of the cyclic correlations C(i) of projected data (calculated as in equation (2.12)) also
has three spikes (see figures 19 and 20). Thus, when extracting the positions of the three largest
values from each of the vectors C(i) , we have three residues modulo each prime. However,
unfortunately it is not apparent which of the residues modulo one prime ”belongs to” a specific
residue modulo another prime to reconstruct one of the results.
Intuitively, one might consider to simply calculate the result x using the Chinese Remainder
Theorem for all combinations of residues and check each time whether it is a valid value (i.e.
whether x ≤ n − m according to the model). This actually works fairly well, given M n − m,
such that it is possible to drop invalid combinations.
24
CHAPTER 2. ANALYSIS 2.3. DERIVATION OF THE ALGORITHM
Figure 19: Plot of the cyclic correlation C(1) of projected model data (|X| = 3)
Figure 20: Plot of the cyclic correlation C(2) of projected model data (|X| = 3)
25
2.3. DERIVATION OF THE ALGORITHM CHAPTER 2. ANALYSIS
The residues and corresponding solutions for our example are shown in table 2 (with a1
being position in C(1) , and a2 being position in C(2) ). The valid results, i.e. those which are
≤ 9744, are marked in bold. As shown in the table, we have successfully reconstructed the
elements of X – all other combinations of residues lead to values well out of range.
Table 2: Residues and solutions for p1 = 5003, p2 = 6007, X = {7700, 8050, 9000}
a1 a2 x (mod 30053021)
2697 1693 7700
2697 2043 17067930
2697 2993 11854804
3047 1693 13000841
3047 2043 8050
3047 2993 24847945
3997 1693 18214917
3997 2043 5222126
3997 2993 9000
Theorem 1.1 in [1] on page 10 establishes a lower bound on the probability of whether
this reconstruction will lead only to the correct results.28 The problem is formulated slightly
different in this theorem. It starts with a set of candidates for the solution, namely {0, . . . , n−1}
(assuming m = 1, the minimum reasonable value). For each prime, only those values of this set
which are elements of one of the residue classes (of the actual results modulo the prime) are
accepted (see [1] on page 10):
X̄ = {0, . . . , n − 1} ∩ (X + p1 Z) ∩ · · · ∩ (X + pk Z) (2.19)
The ”filtering” by intersecting for each prime is basically the same as trying all combinations
and removing invalid results. The set X̄ will always contain the correct results (because all
combinations of residues are considered), but it might also contain additional values. A lower
probability bound on the condition X = X̄ according to [1] is
k
t log(n) log(L))
1−n (2.20)
L
with L ∈ R, L ≥ 5 specifying the interval [L, L(1+2/ log(L))) from which the primes p1 , . . . , pk
are randomly selected; and t = |X| according to our model definition. This probability bound is
used by Colin Percival for further proofs, which is why theorem 1.1 is the very foundation of
the algorithm proposed in [1]. Still, what is missing in [1] is a critical analysis of this theorem.
We provide this, together with suggestions for improvement, in section 4.1.
For now we choose input values such that the lower probability bound in equation (2.20) is
near 1. Hence, we can expect that the set X is properly reconstructed most of the time. To be able
to choose input values accordingly, we need to select primes from the specified interval. Figure
28 This
theorem depends on pi being prime and not just coprime ∀i ∈ {1, . . . , k}, which is why we used prime
numbers from the start.
26
CHAPTER 2. ANALYSIS 2.3. DERIVATION OF THE ALGORITHM
21 shows a Matlab function which performs this task. Please note that this function actually
creates a random permutation of primes instead of selecting them ”uniformly at random” (as
described in [1]), which makes it behave slightly different. This is explained in section 4.1.1.
Figure 21: Matlab function to randomly select primes for the reconstruction of X
Now we have provided a solution which covers the general case |X| ≥ 1. The solution has
the drawback of restricting some of the input values, an issue which we will revisit in section
2.4.2.
To provide a fully generic approach, we still need to cover the case |Σ| ≥ 2. However, this
is only a small problem: We recall the fact that |Σ| > 2 will cause φ (x) to be a lossy conversion
from characters to signal values. This means that when comparing the signal values of differ-
ent characters, they might turn out to be equal. However, as we discussed in section 2.3.2, the
larger |Σ|, the less likely are randomly good matches of T in S (if all other factors are constant).
These two effects compensate each other. While the lossy character-to-signal conversion in-
creases the level of the noise in which to find good matches, the reduced probability of random
matches decreases the noise. Figure 22 shows vector C constructed as in figure 17 but with
Σ = {0, . . . , 65535}. Compared to figure 18 we do not observe any difference except for the in-
fluence of the randomly constructed input values. Actually, there is a slight difference in detail:
Since φ is a “randomly created” function, there is a certain chance that we create an “unlucky”
φ for our specific input values. This issue will be handled in the next section.
Comparing C to the match count vector V (see equation (2.1) on page 7) with varying |Σ|
reveals another issue: For larger |Σ|, the level of noise in V is reduced, while the level of noise
in C remains constant. This should be kept in mind for applications where C is meant to be used
as direct replacement for V .
27
2.3. DERIVATION OF THE ALGORITHM CHAPTER 2. ANALYSIS
Figure 22: Plot of the cyclic correlation C of model data (|X| = 3, |Σ| = 65536)
Σ = {a, b, c, d}
X = {1}
T = ”a b b d”
S = ”d a
| b{zb d} c b
| a{zb c} c a d d c”
1∈X 6=T
1 if x = a or x = b
φ (x) =
−1 otherwise
⇒ B = (1, 1, 1, −1, 0, 0, . . . )
28
CHAPTER 2. ANALYSIS 2.3. DERIVATION OF THE ALGORITHM
For specific input strings this problem can often be solved by choosing a different (“lucky”)
φ . In figure 24 φ is redefined, which leads to the correct result.
1 if x = a or x = c
φ (x) =
−1 otherwise
However, there is no general solution which works for all input strings as long as φ is a
lossy conversion. Given φ , it is always possible to maliciously construct input strings such that
a good match of T in S is found where there is none. One tempting way to reduce the probability
of finding “false matches” is to analyse the input strings and define φ purposeful instead of at
random. This possibility is discussed later in section 4.2.2.
Nevertheless, there is something else we can do: In the equations (2.10) and (2.11), the
same φ is used to calculate the vectors A(i) and B(i) for all i ∈ {1, . . . , k}. This means that if a
false match occurs, it will occur in all vectors C(i) , at positions modulo the respective primes.
If we use a different φi for each i to calculate the vectors A(i) and B(i) , false matches are still
possible, but most likely on different positions for each prime, and therefore likely to be filtered
out as invalid results (as in table 2). Extending our previous equations (2.9), (2.10) and (2.11)
we now have (see also [1] on page 12):
∀i ∈ {1, . . . , k}:
1
Choose Σi ⊂ Σ with |Σi | = |Σ| uniformly at random.
2
l m
m− j
pi −1
(i)
Bj = ∑ φi (T j+λ pi ) (2.23)
λ =0
For further usage, figure 25 shows a Matlab function performing the creation of φi and the
projection accordingly.
29
2.4. THE ALGORITHM CHAPTER 2. ANALYSIS
In the following sections, we present different variants of the algorithm to estimate X according
to the model. Figure shows 26 an algorithm to be used as base for later comparisons and
tests. This algorithm is not specified in [1], we have created it by strongly simplifying the first
algorithm presented there. It implements only the optimisation using the FFT (see section 2.3.3)
and does not reduce the data size. To be able to compare the run time of the different algorithms,
calls to the Matlab functions tic and toc are added at the beginning respective the end of the
function. tic starts the timer, and when calling toc the elapsed time is displayed.
The Matlab implementation of this algorithm uses a few functions which have not yet been
introduced:
30
CHAPTER 2. ANALYSIS 2.4. THE ALGORITHM
Figure 27: Matlab function to check input values according to the model
29 The returned positions are Matlab indices, which is why an additional decrement is performed to get zero-
based positions.
30 Actually, our implementation of the function has a run time of O(tL), assuming that the length of C is L.
Using an algorithm based on a priority queue (see e.g. [6] on page 194), a run time of O(L) can be achieved, but
this would add a lot of complexity.
31
2.4. THE ALGORITHM CHAPTER 2. ANALYSIS
Description
After checking its input values according to the model restrictions, this algorithm basically
performs all the steps described in section 2.3.3 on page 13. It converts the strings S and T to
discrete signals A and B through use of the φ function and computes the cyclic correlation C of
these signals. The set X is then estimated by extracting the positions of the t largest values of
the cyclic correlation.
Example
Figure 28 shows an example application of the algorithm. Note that the resulting elements in
Xest might be in a different order than in the supplied X. This, however, is assumed not to be
a problem.31
n = 102400;
m = 512;
p = 0.9;
Sigma = uint8([0:255]);
X = [34, 39411, 101410]
[S, T] = construct(n, m, p, Sigma, X);
Xest = algorithm_simple(S, T, Sigma, length(X))
Figure 28: Matlab code which demonstrates the usage of our reference algorithm
Review
This algorithm has one great benefit: Its run time relies mostly on the FFT; other than that only
simple processing is done. Matlab uses the FFTW library [22] for fft/ifft-calls, which has
been heavily optimised and is very fast even for large input sizes. There are several drawbacks,
however: Even FFTW does not calculate the FFT faster than O(n log2 (n)), and memory usage
is very high. Also, false matches according to section 2.3.5 on page 28 cannot be filtered.
1. We need to deduce restrictions of input values from the probability bound in equation
(2.20). These restrictions should be based on ε, where 1 − ε is the probability of correctly
reconstructing X (according to our model). If ε is near 0, a high chance to succeed will
be guaranteed, and thus the limits on input values are more strict. If ε is near 1, the
restrictions on input values will be more graceful.
32
CHAPTER 2. ANALYSIS 2.4. THE ALGORITHM
2. In section 2.3.3 we have emphasized the fact that vector C does not generally contain
match counts, and therefore we cannot easily process C to extract all positions with at
least 50% matches. This is even more true of the vectors C(i) , because some values
have been added up. Extracting only the t largest values, as we proposed for C, has the
drawback that one of these largest values might be a “false match” according to section
2.3.5, with the effect that we possibly miss one of the t results. Consequently, we still
need to specify how to extract spikes from vectors C(i) .
1. The input value constraints are specified as follows (see [1] on page 11):
√ √ !
16 log(4n/ε) 32nε 8( n + 1) log(4n/ε)
< m < min , (2.24)
p2 t log n p2
The number of primes k and the minimum size of the primes L are calculated from input
values and therefore indirectly restricted (see [1] on page 12):
log(2n/ε)
k= (2.25)
log(8n) − log(mt log(n))
8n log(2kn/ε)
L= (2.26)
mp2 − 8 log(2kn/ε)
According to [1] on page 28 the “restrictions placed upon the input parameters, and the
values assigned to L, have naturally erred on the side of caution”. This means that they
have been chosen such that the proofs concerning the algorithm are “successful”, but
apparently some of them are also the results of trial and error to prevent certain border
cases. For now we accept these constraints, but later we will analyse how restrictive they
are in applications.
2. For each i ∈ {1, . . . , k}, we will process the vector C(i) by finding all positions j with
(i)
C j > mp2 (according to [1] on page 12). Please note, however, this does not clearly
define the number of character matches we require, it is simply a bound that tends to
work reasonably well with our model.
With these supplements, we finally present a Matlab function implementing “Algorithm 1.1” of
[1] (on pages 11-12). It is shown in figure 29. All previous results are used in this implementa-
tion, and additionally one new helper function is introduced:
cartesian_prod Given a cell array32 and the number of dimensions, this function calculates
the Cartesian product and returns it as cell array of vectors. The implemen-
tation is beyond the scope of this thesis and is provided in the appendix. On
a side note, the function has been optimised for two dimensions because
this is the most frequent case.
32 A cell array is a Matlab array with dynamic size and content.
33
2.4. THE ALGORITHM CHAPTER 2. ANALYSIS
Description
The algorithm starts by checking its input values, first against the limitations of the model and
second against the specific constraints in equation (2.24). It then initialises k and L according
to equations (2.25) and (2.26), and also sets a helper variable k̂ (khat) whose sole use is speed
optimisation. It selects k primes as shown in figure 21 on page 27 and then projects the input
34
CHAPTER 2. ANALYSIS 2.4. THE ALGORITHM
strings onto subspaces according to section 2.3.5, figure 25 on page 30. Afterwards, the cyclic
correlations C(i) of vectors A(i) and B(i) are calculated using the FFT as in section 2.3.3, figure 9
on page 16. Positions in C(i) which are larger than mp 2 are considered good matches and thus
extracted, and the Chinese Remainder Theorem is applied combining the extracted positions
modulo one prime with the positions modulo each other prime (similar to table 2 on page 26).33
Results which are within the valid range are accepted, building the estimated set X.34
Example
Figure 30 shows an example application of this algorithm similar to the “reference example” in
figure 28. It turns out that the input restrictions in equation (2.24) are not met by these input
values, although algorithm 1.1 does produce the correct solution with high probability (as can
be verified numerically). Thus the last parameter “pedantic” is set to 0 for this example in
order to disable the checking of the input value constraints on m.35 Figure 31 shows a different
example where the conditions on m are met, and pedantic is set to 1.
n = 102400;
m = 512;
p = 0.9;
epsilon = 0.1;
Sigma = uint8([0:255]);
X = [34, 39411, 101410]
[S, T] = construct(n, m, p, Sigma, X);
Xest = algorithm_11(S, T, p, Sigma, length(X), epsilon, 0)
Figure 30: Matlab code which demonstrates the usage of algorithm 1.1 (pedantic=0)
n = 1000000;
m = 325;
p = 0.9;
epsilon = 0.7;
Sigma = uint8([0:255]);
X = [999601]
[S, T] = construct(n, m, p, Sigma, X);
Xest = algorithm_11(S, T, p, Sigma, length(X), epsilon, 1)
Figure 31: Matlab code which demonstrates the usage of algorithm 1.1 (pedantic=1)
Review
The first thing we note is that ε is just a probability bound, and one can achieve correct results
with high numerical probability even if ε is near one. Similarly, the input constraints (equation
33 The exact implementation is slightly different, because only k̂ primes are used for the reconstruction, and the
result is checked against the remaining primes. However, this is only an optimisation; the effect is the same.
34 Following the model, we only accept values x ≤ n − m (instead of x < n as in [1]).
35 This should be done with caution because L might turn out to be negative, especially for small m.
35
2.4. THE ALGORITHM CHAPTER 2. ANALYSIS
(2.24)) are not irrevocable — we have to comply with them if we wish to use ε as proven
probability bound,36 but often we also achieve good results when ignoring them.
If we provide input values in the context of executable file comparisons, it will be quite hard
to stick to the input constraints. To achieve a good encoding of the file differences, we need
the correct result of the matching with high probability, so we choose ε = 0.1. We are mainly
interested in matches with only a few mismatches, and for that reason set p = 0.9. Further, we
assume that we wish to find the two best matches (t = 2), to allow for some flexibility. Given
these basic input parameters, n needs to be very large for us to be possible to meet the input
conditions. Figure 32 shows the upper and lower limits of m with these input values for varying
n. We observe that n needs to be roughly 8 · 107 simply to be able to select a valid m, and even
then we are restricted to m ≈ 430. Using a block size m of several kilobytes is only possible for
huge values of n, way beyond the usual size of executable files. This also means that the run
time of the algorithm given in [1] (on page 13) is not valid for file comparisons,37 because it
relies (by definition) on
16 log(4n/ε)
m (2.27)
p2
i.e. m is required to be a lot larger than its lower limit.
There is even more to say about the input constraints: For our second Matlab example
(see figure 31) we selected the relatively small (given the restrictions) n = 106 and chose m =
325 to be near its lower limit. In this example the input leads to L ≈ 8.9686 · 105 , and while
the condition L < n is true as observed in [1] (page 13), the interval [L, L(1 + 2/ log(L))) ≈
36 Note that ε only guarantees a certain probability of success given random input according to our model.
37 at least not for today’s usual file sizes
36
CHAPTER 2. ANALYSIS 2.4. THE ALGORITHM
[8.9686 · 105 , 1.0277 · 106 ) from which the primes are randomly chosen actually exceeds n.
Therefore, it can happen that primes larger than n are selected, which is inefficient in terms of
time and memory, and can even lead to incorrect results, because A(i) is zero-padded for the
corresponding i. Even if false results are eventually caught, this is still a situation which is
undesirable. Given the calculation of L as it is, the limits on m are probably not strict enough.
Thinking in terms of L instead of considering the interval [L, L(1 + 2/ log(L))) also seems
to be a problem in the proofs of [1]. It specifically makes the proof of the algorithm’s time
bound appear questionable: The size of the cyclic correlation is assumed to be L (see [1] on
page 13), but in fact it can be larger, and therefore the size is not L but worst case L(1 +
2/ log(L)). Possibly this difference can be ignored for “huge” values of n, but it is nevertheless
an assumption which should at least have been justified if used in a proof. Actually, the run
time equation has the precondition n 1 (see [1] on page 13), implying asymptotic behaviour,
but this underlines the fact that the time bound cannot be applied for our application. We have
to expect slower run time, so further improvement of the algorithm is necessary.
Last but not least, given the method how the set X is estimated in this algorithm (by extract-
ing many results and filtering those out of range), we have no guarantee to retrieve t results. We
can get any number of results. For some applications this might be a desired effect, but when
matching files we usually wish to find the t best matches.
n = 50000;
m = 256;
p = 0.95;
Sigma = uint8([0:1]);
X = [7700, 8050, 9000]
Primes = [5003, 6007];
[S, T] = construct(n, m, p, Sigma, X);
C = match_cyclic_correl_project(S, T, Sigma, Primes);
n
Figure 33: Matlab calls to construct data using primes around 10
Figures 34 and 35 show the resulting cyclic correlations. As expected, there is a consid-
erably higher level of random noise compared to the previous example (figures 19 and 20).
37
2.4. THE ALGORITHM CHAPTER 2. ANALYSIS
For C(1) , we expect spikes at positions 2697, 3047, 3997. Those actually are present, but we
also observe random spikes, e.g. roughly at positions 1500, 3100. The expected spikes of C(2)
at positions 1693, 2043, 2993 are also present, but there are additional spikes, e.g. roughly at
positions 100, 2800. However, it is very unlikely that these spikes occur in all projections at
positions which reconstruct a valid result. Therefore, these spikes are “filtered” during recon-
struction, because they lead to results > n − m (with high probability). This is basically the
same as in section 2.3.5, where “false matches” are removed.
n
Figure 34: Plot of the cyclic correlation C(1) of data using primes around 10
Theorem 1.3 in [1] on page 15 addresses this issue from a mathematical point of view. It
extends theorem 1.1 of [1] by considering additional random elements for each of the intersec-
tions. Assuming that these elements are selected randomly in sets Y (i) ⊂ {0, . . . , n − 1} − X for
all i ∈ {1, . . . , k},38 this looks as follows:39
(1) (k)
X̄ = {0, . . . , n − 1} ∩ (X ∪Y ) + p1 Z ∩ · · · ∩ (X ∪Y ) + pk Z (2.28)
38 This definition has been simplified, for more information see [1].
39 This has been slightly corrected from [1].
38
CHAPTER 2. ANALYSIS 2.4. THE ALGORITHM
n
Figure 35: Plot of the cyclic correlation C(2) of data using primes around 10
t log(n) log(L)) k
1−n β + (2.29)
L
In principle, we can use the same procedure as in the first variant of the algorithm but with
shorter primes, and have a slightly smaller probability of succeeding. To stay within proven
bounds, however, we revisit the input value restrictions and the processing of C(i) from the
previous section:
1. The input value constraints should now be deduced from the new probability bound in
equation (2.29). Unfortunately, this is not trivial, because it involves establishing propo-
sitions about β . Theorem 1.4 in [1] on page 17 performs this task, basically giving guid-
ance on how “high” the spikes of the valid results still need to raise above the noise. The
resulting restriction according to [1] on page 19 is (again chosen such that the proof is
successful): p p !
3
n2 ε/2 nε/2
m < min , (2.30)
t(log(n))2 8p2
40 Cited from [1] on page 15. The definition of β given there as “the probability [...] of y falling into each of the
k sets” is very fuzzy, since y is undefined.
39
2.4. THE ALGORITHM CHAPTER 2. ANALYSIS
The number of primes k, the probability β and the minimum size of the primes L are
derived values and thus indirectly restricted (see [1] on page 20):41
log(2n/ε)
k= (2.31)
log(n) − log(mt log(n)2 )
1p k
β= ε/(2n) (2.32)
2
p p 2
y= − log(β ) + log(4kt/ε) (2.33)
2ny
L= (2.34)
mp2 − 2y
We will analyse later how restrictive the limit on m actually is for applications.
2. To solve the problem of algorithm 1.1 that we are not guaranteed to get t results, we
can start by extracting the t largest values in vectors C(i) for each i ∈ {1, . . . , k} and
perform the reconstruction. However, as mentioned before, if for any reason we have a
random spike raising above one of the results, we will miss some results and might end
up with less than t values (because invalid results are removed). Given that the number
of “additional spikes” in C(i) according to our new approach is expected to be β pi for
each i ∈ {1, . . . , k} (see [1] on page 15), we can assume that, in the worst case, all of the
additional spikes raise above our actual results. Therefore, we are on the safe side if we
extract the β pi + t largest values.
Now that these issues are solved, we present a Matlab function implementing “Algorithm 1.2”
of [1] (on pages 19-20). It is shown in figure 36.
Description
At first, this algorithm checks its input values against the limitations of the model as well as
against the constraints given in equation (2.30). Next, it initialises k, β and L according to
equations (2.31), (2.32) and (2.34). It selects k primes and then projects the input strings onto
subspaces. In the next step, the cyclic correlations C(i) of vectors A(i) and B(i) are calculated
using the FFT. The positions of the β pi +t largest values in C(i) are considered as candidates for
good matches and thus extracted, and the Chinese Remainder Theorem is applied combining
the extracted positions modulo one prime with the positions modulo each other prime (similar
to table 2 on page 26). Those results that are within the valid range are accepted, and form the
estimated set X.
Example
Unfortunately, the Matlab application of Algorithm 1.2 according to the reference example does
not run. Figure 37 shows the corresponding calls, ignoring the limit on m in equation (2.30),
41 In equation (2.33) we choose the name y instead of x to avoid a naming conflict.
40
CHAPTER 2. ANALYSIS 2.4. THE ALGORITHM
i.e. using 0 for the parameter “pedantic”. However, the algorithm fails, because k turns out
to be negative. This points to one major weakness of the algorithm: For our use either n is
required to be huge, or m needs to be very, very small. In contrast to Algorithm 1.1, which
tends to work well outside the specified limits of m, Algorithm 1.2 seems to strongly depend on
41
2.4. THE ALGORITHM CHAPTER 2. ANALYSIS
the limitation. Even if we choose m such that k is barely not negative (still ignoring the limit),
k is unnecessarily large (e.g. around 20), which in turn has a very negative impact on the run
time of the algorithm.
If the limit is regarded, however, the algorithm can be run.42 Figure 38 shows an example
with pedantic set to 1.
n = 102400;
m = 512;
p = 0.9;
epsilon = 0.1;
Sigma = uint8([0:255]);
X = [34, 39411, 101410]
[S, T] = construct(n, m, p, Sigma, X);
Xest = algorithm_12(S, T, p, Sigma, length(X), epsilon, 0)
Figure 37: Matlab code which tries to use algorithm 1.2 but fails (pedantic=0)
n = 1000000;
m = 90;
p = 0.9;
epsilon = 0.7;
Sigma = uint8([0:255]);
X = [999601]
[S, T] = construct(n, m, p, Sigma, X);
Xest = algorithm_12(S, T, p, Sigma, length(X), epsilon, 1)
Figure 38: Matlab code which demonstrates the usage of algorithm 1.2 (pedantic=1)
Review
As mentioned above, the upper limit on m is very strict and cannot be set aside. When analysing
it for varying n in the same context as before (i.e. with ε = 0.1, p = 0.9, t = 2), we observe that
only small block sizes are allowed with this algorithm. Figure 39 shows the corresponding plot.
However, even if the limit on m is met, the run time and memory usage of the Matlab
implementation of algorithm 1.2 is unfortunately rather inappropriate.43 The main reason for
this is that β pi + t easily turns out to be several thousand. Thus, thousands of largest values
are extracted from vectors C(i) . Since solving according to the Chinese Remainder Theorem is
not optimised in Matlab, calculating the solutions for all combinations is a very slow process
and clearly the bottleneck of our implementation. Colin Percival shows in [1] on page 22 that
the seemingly quadratic run time O((β L + t)2 ) of this part of the algorithm is actually not
quadratic, by definition of β and L. However, even in theory this is questionable: It again relies
42 Please note that this example might take several hours to run, and Matlab might run out of memory due to the
large Cartesian product.
43 Specifically, either Matlab runs out of memory or we have given up waiting for results after 8 hours.
42
CHAPTER 2. ANALYSIS 2.4. THE ALGORITHM
on the assumption that primes of size L are used, while their worst case size is roughly L(1 +
2/ log(L)). The corresponding difference should not be set aside without further comment,
especially because it is in context of a square. If we use a completely different way to reconstruct
X, which does not tend to have a quadratic run time, this problem will be solved.
Spikes which existed e.g. in C(1) at positions modulo p1 and in C(2) at corresponding posi-
tions modulo p2 are added up and will lead to spikes in F. Thus, F is actually an approximation
of C, and equation (2.35) can be seen as “inverse projection”, because it restores the spikes at
positions where they would have been without the projection. Further processing of F can then
be done like the processing of C in our reference algorithm.
While this is intuitively a reasonable result, the approximation according to the Bayesian
analysis is a bit different (see [1] on page 24):
43
2.4. THE ALGORITHM CHAPTER 2. ANALYSIS
∀ j ∈ {0, . . . , n − 1}:
(i)
k C j mod pi − mp
2
Fj0 =∑ (2.36)
i=1 σ pi (n, m, j)
with σ being the standard deviation defined in [1] on page 13 as
2. Using equation (2.36) to calculate F 0 will not lead to correct results for maliciously
formed X (see [1] on page 24). This is more a theoretical problem, because these X
are very unlikely to occur in real applications, but it is still a drawback.
While the best way to solve the first problem is to use F instead of F 0 , the second problem can
be dealt with by performing further processing of C(i) ”for some appropriate δ ”44 (see [1] on
pages 25):
∀i ∈ {1, . . . , k}; ∀ j ∈ {0, . . . , pi − 1}:
(i)
mp
√ −mp C j − 2
(i)
exp −D j = δ + exp
2σ pi (n, m, j)
(i)
mp
√ mp C j − 2
(i)
⇔ Dj = − log δ + exp − (2.38)
2σ pi (n, m, j)
In order to numerically show “what happens” during the calculation of vectors D(i) , we define
(i)
mp C j − mp
2
x=
2σ pi (n, m, j)
√
to be used as variable, set δ = 2 and plot y = − log( δ + e−x ). The result is shown in figure
40.
(i)
Interpreting this plot, we observe that x > 0 if and only if C j > mp
2 , due to the restrictions
of the model. This means that, according to the plot, spikes in vectors C(i) larger than mp 2 are
44 This is cited from [1] on page 24 to make it clear that δ is a very abstract value.
44
CHAPTER 2. ANALYSIS 2.4. THE ALGORITHM
√
Figure 40: Plot of y = − log( δ + e−x ) with δ = 2
truncated. Since y ≈ x for all x ≤ 0, all other values are relatively maintained by this function.
Figure 41 shows a Matlab function which calculates a vector D from C, using the definition of
δ as specified later in equation (2.41) and an approximation from [1] on page 28:
nm
σ pi (n, m, j) ≈ (2.40)
pi
Applying this function to a vector C calculated according to figure 17 on page 24, we clearly see
that all values are relatively maintained, but the spikes are truncated. Figures 42 and 43 show
vectors C and D, respectively.45
Based on these numerical observations, it is questionable why Colin Percival states in [1]
(i) (i)
on page 24, while deriving the algorithm, that “D j = max(C j , δ )”. This is in contrast to the
definition of D(i) in equation (2.38)46 , which, to the best of our knowledge, leads to a truncation
45 Note that the second figure has been scaled, but the ratio was maintained.
46 which was cited from [1]
45
2.4. THE ALGORITHM CHAPTER 2. ANALYSIS
of spikes in C(i) , thereby showing the behaviour of a “min”-function.47 Due to this discrepancy,
47 One might argue that truncating large values respectively increasing small values is basically the same, but
this is questionable. Increasing all values to a certain minimum level would remove some of the random noise.
46
CHAPTER 2. ANALYSIS 2.4. THE ALGORITHM
and to avoid too much dependency on p, we decide to stick to our initial approach and calculate
F in our implementation as specified in equation (2.35).
What remains to be discussed is, as in the previous variants of the algorithm, the restrictions
on input values and the further processing to estimate X:
1. No input restrictions are specified in [1] for this algorithm, so only the restrictions of
the model apply. The number of primes k and the minimum size of the primes L are
calculated as follows (see [1] on page 24):
tmp2 log(n)
δ= (2.41)
n
log(nt/ε)
k= (2.42)
log(1/(4δ ))
p √
−8n log 2k ε/nt − δ
L= p √ (2.43)
2
mp + 8 log 2k
ε/nt − δ
However, these definitions lead to implicit input restrictions, because negative values of
k and L are not applicable. For example, n = 10000, m = 300, t = 3, ε = 0.1 leads to
k = −12. We observe that this happens if 4δ > 1, since in that case log(1/(4δ )) < 0.
The case 4δ = 1 should also be prevented, as it leads to a division by zero. Therefore, we
require 4δ < 1 and derive an input restriction by ourselves:48
4δ <1
4tmp2 log(n)
⇔ n <1
⇔ 2
4tmp log(n) <n
n
⇔ m < (2.44)
4t p2 log(n)
This input restriction helps to avoid invalid input values, though we do not claim that it
covers all cases.
We now present a Matlab function implementing “Algorithm 1.3” of [1] (on pages 24-25) with
the changes as described in this section. It is shown in figure 44 on the following page.
Description
The first step of this algorithm is, as in the other variants, to check its input values against
the limitations of the model and against the specific constraints we defined in equation (2.44).
48 In addition to the model definitions, we assume n > 1 and t > 0 to prevent division by zero.
47
2.4. THE ALGORITHM CHAPTER 2. ANALYSIS
Afterwards, it initialises δ , k and L according to equations (2.41), (2.42) and (2.43). It selects
k primes and then projects the input strings onto subspaces. Then the cyclic correlations C(i) of
vectors A(i) and B(i) are calculated using the FFT. In the next step, vector F is calculated as in
equation (2.35) of this section. The set X is estimated by extracting the positions of the t largest
values of F.
Example
Using algorithm 1.3 is fairly straightforward, since the limit on m is not too restrictive. Figure
30 shows an example application of this algorithm similar to the “reference example” in figure
28, with pedantic set to 1.
48
CHAPTER 2. ANALYSIS 2.4. THE ALGORITHM
n = 102400;
m = 512;
p = 0.9;
epsilon = 0.1;
Sigma = uint8([0:255]);
X = [34, 39411, 101410]
[S, T] = construct(n, m, p, Sigma, X);
Xest = algorithm_13(S, T, p, Sigma, length(X), epsilon, 1)
Figure 45: Matlab code which demonstrates the usage of algorithm 1.3 (pedantic=1)
Review
The plot of the upper limit of m given ε = 0.1, p = 0.9, t = 2 is shown in figure 46. The limit
is very generous, and we can easily select small as well as large block sizes. Consequently, this
variant of the algorithm is the best candidate to be used for delta compression of executable
code.
We also note that calculating F does not tend to have a quadratic run time, and is therefore
preferable compared to calculating all combinations of solutions according to the Chinese Re-
mainder Theorem. In theory, it is not even necessary to calculate the whole vector F, as only
the largest values of the sum need to be considered (see [1] on page 24).
49
2.4. THE ALGORITHM CHAPTER 2. ANALYSIS
2.4.5 Comparison
In order to compare the different variants of the algorithm with regard to speed and probability
of success, we run them several times with the same input data and check whether they are
successful, i.e. whether X is estimated correctly. All tests are performed on a Windows XP
system with Intel Core2 CPU 2.1 GHz and 1 GB RAM, using Matlab version 7.1.49
Table 3 shows the input values and the corresponding test results of 100 runs, given Σ =
{0, . . . , 255}. We skip algorithm 1.2 entirely, because it fails due to input restrictions.50
Table 3: Test results using n = 102400, m = 512, p = 0.9, X = {34, 39411, 101410}
Algorithm Average Run Time Probability of Success
Reference 0.03s 1.0
1.1 2.32s 1.0
1.2 (not applicable) (not applicable)
1.3 16.88s 1.0
These results are characteristic for large input values with the precondition n m and m
not being too small in the context of matching blocks of executable code. Given data according
to the model, the numeric probability to succeed is usually about 1. Small input values lead to
problems with regard to input restrictions.
Comparing the run time clearly shows that the calculation of the FFT is heavily optimised
in Matlab. The more complexity is added outside of the FFT, the slower is the run time. This
is especially bad for algorithm 1.3, since we chose it for use in delta compression of executable
code. Therefore, we intend to write a Matlab-independent implementation of this algorithm.
50
Chapter 3
Implementation
In this chapter, we present a C++ implementation of algorithm 1.3 as described in section 2.4.4
on page 43. The implementation aims at overcoming the drawbacks of the Matlab functions
concerning memory usage, speed and reusability to allow further use of this algorithm, for
example in a patch tool.
3.1 Considerations
3.1.1 Portability
When planning to implement the algorithm, we need to consider portability from the start, as it
severely affects the choice of libraries and the basic technical design of the implementation. In
order to achieve portability across a wide range of operating systems, we decide to use CMake
[24] as “build environment”. Thus, we are able to support compiling the program on Windows
and Linux systems as well as BSD-based systems like Mac OS X. This also enables the use of
different compilers on the respective systems.
In order to accomplish portability of the implementation itself, we avoid using system spe-
cific calls as much as possible and instead use standard libraries provided by the compiler, or
portable third party libraries.
3.1.2 Environment
Although we wish to optimise for speed, we do not have the claim to write optimal code for
all subproblems. We do not wish to reinvent the wheel, instead we intend to reuse widespread
and optimised algorithms. While interpreted environments like Microsoft .NET offer various
algorithms, the programs running in this context usually are considerably slower than corre-
sponding native applications. This is why we choose C++ as programming language. C++ is
based on C and offers low level functions, but additionally it provides the “Standard Template
Library” (STL)51 . Its “<algorithm>” header supplies various stable and optimised algorithms
51
3.1. CONSIDERATIONS CHAPTER 3. IMPLEMENTATION
like priority queues, which can easily be used. Table 4 shows a list of libraries which we utilise
to implement the algorithm for matching with mismatches.
We have to be aware of the fact that this adds dependencies to our implementation of the
algorithm. Possible tools based on it will also have these dependencies. However, given the
complexity of the algorithm, we think that these dependencies are justified. They will only
be present on the side of the patch generation, a tool to apply the patch can perform this task
without requiring Fast Fourier Transforms and priority queues.52
Finally, we need to make sure that the third-party libraries we are using integrate well
enough with our implementation, both in terms of license and technology.
52 The tool applying the patch might still depend on libraries to uncompress the data, but this is beyond the
scope of this thesis.
53 Single Instruction Multiple Data
54 A C++ template class is a class which can be used for different base types, e.g. a generic vector for real and
52
CHAPTER 3. IMPLEMENTATION 3.2. IMPLEMENTATION
3.2 Implementation
3.2.1 Structure
We start the implementation with the Matlab code of algorithm 1.3 in mind (see figure 44 on
page 48), but also with regard to delta compression of executable code. Some functions used in
Matlab are not available in C/C++, so we have to implement them on our own. Table 5 shows
the core modules55 and the corresponding source files of our implementation.
In order to implement a patch tool for delta compression of executable files based on these
core modules, further functions are required. These are listed in table 6. Furthermore, a file
format needs to be defined. The implementation of these additional modules is beyond the
scope of this thesis.
53
3.2. IMPLEMENTATION CHAPTER 3. IMPLEMENTATION
and pre-calculate all primes which we possibly need for the valid input range. These primes
are stored in a static array directly in the source code. Thus, choosing primes is very efficient
regardless of the interval.
In order to select primes from a given interval, we first find the boundaries of that interval
within our array of pre-calculated primes using the STL algorithm lower_bound. Next we
create a random permutation of all primes within the boundaries by utilising the STL algorithm
random_shuffle, and then return the first k of these primes as vector of integers.
After computing the index, the actual matching of a specific block T with S can be per-
formed. It consists of the following steps:
1. The vectors B(i) are calculated from T according to equation (2.23) on page 29.
2. The cyclic correlations C(i) are computed by use of FFTW functions56 with the plans
provided from the index.
54
CHAPTER 3. IMPLEMENTATION 3.2. IMPLEMENTATION
k
(i)
∑ C j mod pi
i=1
for all j ∈ {0, . . . , n − 1} are calculated (see also equation (2.35) on page 43). In order to
save memory, we do not compute the vector F. Instead, we maintain a list of positions of
the t largest values while calculating the upper sum.
These steps can be repeated for different blocks T without recreating the index. However, the
block size m needs to be kept constant.
2. Next, the original file, which is assumed to be present on the target system, is read as
string S, with n being the size of that file.
p
3. Given S, n and the block size m = n log(n) (according to [1] on page 34), the index is
created as described in the previous section.
4. The new file including e.g. a security fix is read in blocks of length m, and each block is
matched with the original file. For each block only the best match is considered (t = 1).
The result is a mapping of positions in the new file to corresponding positions of similar blocks
in the original file.
55
3.3. FURTHER STEPS CHAPTER 3. IMPLEMENTATION
http://sourceforge.net/projects/rndiff
First tests of the implementation show that the run time is slower than that of the Matlab ref-
erence algorithm presented in section 2.4.1 on page 30 for similar input values. This is due to
the fact that neither the projection (specifically the calculation of vectors B(i) ) nor the compu-
tation of the sum to reconstruct the results are optimised in the C++ implementation. If the run
time turns out to be too slow for specific applications, further efforts will have to be put into
optimising these parts.
We also note that when applying “rndiff” on two files which are equal, the resulting mapping
usually does not consist of one block, as expected. This is due to the fact that the bytes of
executable files are in general not uniformly distributed, an issue which will be addressed in the
next chapter.
56
Chapter 4
Improvements
Small improvements and extensions concerning the algorithm were already integrated directly
in the corresponding sections, if appropriate. In this chapter additional improvements, which are
beyond the scope of other parts of this thesis, are presented. Theoretical changes are considered
as well as improvements concerning the practical usage of the algorithm, with special regard to
reproducibility of results.
57
4.1. THEOREM 1.1 CHAPTER 4. IMPROVEMENTS
function implementing the “filtering by intersecting” as in theorem 1.1. The function is shown
in figure 47.
We run several numeric tests, repeating each test 1000 times. Only exact reconstructions
count as successes. Table 8 shows the input values, the numerical probability of correct recon-
struction, and the corresponding lower bound according to theorem 1.1 (see equation (2.20) on
page 26).
58
CHAPTER 4. IMPROVEMENTS 4.1. THEOREM 1.1
The results suggest that the lower probability bound is very pessimistic. This has several
reasons:
• Colin Percival states in the proof of the theorem 1.1 (see [1] on pages 10-11) that the
product
t
∏ (y − x j ) (4.1)
j=1
Thus, a tighter (but less trivial) upper bound of the product in equation (4.1) is:
t
(n − 1)!
∏ (n − j) = (n − t − 1)! (4.2)
j=1
By definition of X in the model (see 2.3.1 on page 10), “X = {x1 , . . . , xt } ⊂ {0, . . . , n−m}
with xi ≤ xi+1 − m for 1 ≤ i < t”, meaning that the differences between two elements of
X is at least m. This leads to an even tighter upper bound of the product:
t
∏ (n − jm) (4.3)
j=1
• The proof of the probability bound of theorem 1.1 is based on other worst case assump-
tions, and the probability that worst case values are chosen is very small. However, the
probability bound is meant to be true for all inputs, i.e. it has no further assumptions
about the context.
To conclude, we can achieve a direct improvement of theorem 1.1 if we use a random per-
mutation of primes instead of selecting them uniformly at random. This change is already
implemented in the Matlab function in figure 21 on page 27. There is further potential for
improvement by using a different approach in the proof of theorem 1.1.
These results show why we observed that algorithm 1.1 works well outside its proven limits.
59
4.2. DERANDOMISATION CHAPTER 4. IMPROVEMENTS
4.2 Derandomisation
One problem with the usage of the algorithm presented in chapter 2 is that its result can be
different each time we apply the algorithm. Especially when the algorithm is used in a tool to
create patches, a certain reproducibility is expected by the users so that they trust the program.
Generally, there are two possibilities to achieve this:
1. The result is “guaranteed” following the law of large numbers, regardless of the random
selection.
2. Instead of randomly choosing values, these values are calculated according to rules and/or
additional input values.
The first of these options is quite impossible to achieve, given the nature of the algorithm. We
would need to repeatedly run the algorithm, and even then we can only state that the same
result will be given most of the time. Unfortunately, if a different result is returned only once, a
user might lose trust in the product. Any application concerning security patches should output
reproducible results.
This leaves the second option as only alternative. We have to think of rules for choosing the
values as well as possible input parameters which can be used to “tweak” these rules.
60
CHAPTER 4. IMPROVEMENTS 4.2. DERANDOMISATION
parts of the file are filled with zeros. Additionally, some executable files are statically linked to
resources, e.g. bitmaps. Given that this depends on the type of executable file and the underlying
operating system, we cannot assume a specific probability distribution of the alphabet for all
executable files. But we can calculate the numeric probability distribution of a specific file in
a single pass by counting the characters. The result can be used to calculate φ , such that φ
actually maps half of the input values to 1 and the other half to −1 for non-random input values,
as was before only the case when using uniformly distributed input.
Specifically, we can create the function φ as follows: Given the probability distribution, we
calculate the first bit of the Huffman code (see e.g. [13] on pages 99-113) for each character of
Σ. Afterwards, we map φ to −1 for all characters with 0 as first bit of the Huffman code, and
to 1 for all characters with 1 as first bit. Due to the basic properties of the Huffman code, this
usually59 leads to a fair distribution of the mapping between −1 and 1.
If we use the original file without the security fix for the calculation, we do not even need
to store the probability distribution in the patch file, because this file is present on the target
system.60
59 If,for example, a certain character has a probability of occurence larger than 0.5, a fair distribution cannot be
achieved by φ . In that case, further pre-processing of the input string could be done, but this is beyond the scope
of this thesis.
60 However, we should make sure that the correct file is used, for example by checking a cryptographic hash.
61
Chapter 5
Conclusion
In the present thesis, we have explained the background behind the algorithm for matching
with mismatches from [1], with regard to delta compression of executable code. Supported
by numerical examples, we have critically analysed the different variants of the algorithm and
implemented them in Matlab.
Based on the Matlab code, we have created a C++ implementation of the last variant of
the algorithm, specifically optimising for speed and memory usage. This implementation can
be used as base for a tool that creates compact security patches. Further improvements with
regard to reproducibility have been proposed, such that the algorithm is able to produce constant
output, given constant input.
First tests show that the run time of all variants of the algorithm in practice severely depends
on an efficient implementation of its inner loops and the sub-algorithms it uses. Since the FFTW
library is heavily optimised, we have yet to find a numerical example where one of the variants
of the algorithm is faster than our reference algorithm, which mostly relies on the FFT.
It seems hard to beat FFTW in terms of speed, even though the last variant of the algorithm
has a run time “sublinear in n” (see [1] on page 9). The O(n log2 (n)) run time of the FFT
is only significant for very large values of n, because FFTW keeps the constant parts of the
run time very low. In contrast to that, the non-FFT elements of the algorithm are not easy to
optimise. Therefore, the practical run time of these parts outweighs the theoretical advantage.
Additionally, the theoretical run time of the algorithm is only asymptotic for growing values of
n.
That being said, the main advantage of the algorithm, in particular of the C++ implementa-
tion, is its low memory usage. In order to achieve a speed near that of the reference algorithm
for usual file sizes of a few megabytes, more efforts need to be devoted to optimisation.
62
Appendix
Figure 48: Matlab function to provide numeric probability on a good match not in X
63
Appendix
Figure 51: Matlab function to compute C as in figure 8 by using the FFT as in figure 9
64
Appendix
Figure 52: Matlab function to simply calculate the positions of the t largest values
Figure 53: Matlab function to iteratively calculate the Cartesian product (2 dimensions)
65
66
Bibliography
[1] Colin Percival: Matching with Mismatches and Assorted Applications, D.Phil. thesis, Uni-
versity of Oxford 2006.
[3] Gonzalo Navarro: A Guided Tour to Approximate String Matching, ACM Computing
Surveys, 33 (1): 31-88, 2001.
[4] Giovanni Motta, James Gustafson, Samson Chen: Differential Compression of Executable
Code, In Proceedings of the 2007 Data Compression Conference, Pages 103-112, 2007.
[5] David Salomon: Data Compression, The Complete Reference, 4th Edition, Springer 2007.
[6] Thomas H. Cormen, Charles E. Leierson, Ronald L. Rivest, Clifford Stein: Introduction
To Algorithms, 2nd Edition, The MIT Press 2001
[7] Manfred R. Schroeder: Number Theory in Science and Communication, 4th Edition,
Springer 2006.
[9] John E. Hopcroft, Rajeev Motwani, Jeffrey D. Ullman: Automata Theory, Languages, and
Computation, 3rd Edition, Addison-Wesley 2007.
[10] Leo I. Bluestein: A Linear Filtering Approach to the Computation of Discrete Fourier
Transform, IEEE Transactions on Audio and Electroacoustics, AU-18 (4): 451-455, 1970.
[11] Hari Krishna: Digital Signal Processing Algorithms: Number Theory, Convolution, Fast
Fourier Transforms, and Applications, CRC Press 1998.
[12] William H. Press, Saul A. Teukolsky, William T. Vetterling: Numerical Recipes in C: The
Art of Scientific Computing, 2nd Edition, Cambridge University Press 1993.
[13] André Neubauer: Informationstheorie und Quellencodierung: Eine Einführung für Inge-
nieure, Informatiker und Naturwissenschaftler, J. Schlembach Fachverlag 2006.
[14] André Neubauer: Kanalcodierung: Eine Einführung für Ingenieure, Informatiker und
Naturwissenschaftler, J. Schlembach Fachverlag 2006.
67
[15] Microsoft Corporation: Using Binary Delta Compression (BDC)
Technology to Update Windows Operating Systems, BDC_v2.doc,
http://www.microsoft.com/downloads/details.aspx?FamilyID=
4789196c-d60a-497c-ae89-101a3754bad6&displaylang=en, 2005.
[16] ISO/IEC 9899:1990(E), Programming Languages – C (ISO C90 and ANSI C89 standard),
1990.
[17] ISO/IEC 14882:1998(E), Programming Languages – C++ (ISO and ANSI C++ standard),
1998.
68