Delta Compression of Executable Code

Delta Compression of Executable Code
Analysis, Implementation and Application-Specific

Improvements
Lothar May
Master Thesis
Information Technology
Supervisors:
Prof. Dr. André Neubauer
Prof. Dr. Michael Tüxen
19 November 2008 (rev2)

Contents
1 Introduction 1
1.1 Security Patches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Matching With Mismatches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Analysis 6
2.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Intuitive Substring Matching . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.2 Run Time Considerations . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Derivation of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.1 Formal Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.2 Restrictions of the Model . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.3 Optimisation using the FFT . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.4 Projecting onto Subspaces . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.5 Optimising Random Behaviour . . . . . . . . . . . . . . . . . . . . . 28
2.4 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4.1 Reference Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4.2 The First Variant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4.3 The Second Variant . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4.4 The Third Variant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.4.5 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3 Implementation 51
3.1 Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.1.1 Portability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.1.2 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.1.3 Library Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.2.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
i
CONTENTS CONTENTS
3.2.2 Vector Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.2.3 Selection of Primes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.2.4 Core Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.2.5 File Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.2.6 Post Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3 Further Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4 Improvements 57
4.1 Theorem 1.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.1.1 Selecting Primes Randomly . . . . . . . . . . . . . . . . . . . . . . . 57
4.1.2 Numerical Probability . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 Derandomisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2.1 Selecting Primes Non-Randomly . . . . . . . . . . . . . . . . . . . . . 60
4.2.2 Creating φ Non-Randomly . . . . . . . . . . . . . . . . . . . . . . . . 60
5 Conclusion 62
Appendix 63
Bibliography 67
ii
List of Figures
1 Definition of example strings S and T . . . . . . . . . . . . . . . . . . . . . . 7

2 Plot of match count calculated from example strings S and T . . . . . . . . . . 8
3 Matlab function to calculate the match count vector V . . . . . . . . . . . . . . 8
4 Matlab function to find positions with at least 50% matches . . . . . . . . . . . 9
5 Matlab function to construct strings S and T according to the formal model . . 11
6 Construction example: At position 6 in S the string T randomly matches. . . . . 12
7 Plot of the probability of a randomly good match not in X . . . . . . . . . . . . 13
8 Matlab function to calculate matches using the cyclic correlation . . . . . . . . 15
9 Matlab code to calculate the cyclic correlation using the FFT . . . . . . . . . . 16
10 Plot of the cyclic correlation C calculated from example strings S and T . . . . 16
11 Matlab calls to construct data according to the model (|X| = 1) . . . . . . . . . 17
12 Plot of the cyclic correlation C of model data (|X| = 1) . . . . . . . . . . . . . 18
13 Matlab function to project data before calculating the cyclic correlation . . . . 20
14 Plot of the cyclic correlation C(1) of projected model data (|X| = 1) . . . . . . . 21
16 Chinese Remainder Theorem: Matlab function which calculates the solution . 23
17 Matlab calls to construct data according to the model (|X| = 3) . . . . . . . . . 24
18 Plot of the cyclic correlation C of model data (|X| = 3) . . . . . . . . . . . . . 24
21 Matlab function to randomly select primes for the reconstruction of X . . . . . 27
22 Plot of the cyclic correlation C of model data (|X| = 3, |Σ| = 65536) . . . . . . 28
23 Construction example: At position 6 in S a non-existing match of T is found . . 28
24 Construction according to figure 23 with a different φ . . . . . . . . . . . . . . 29
25 Matlab function to project data with varying φi (x) . . . . . . . . . . . . . . . . 30
26 Simple Matlab function to perform “matching with mismatches” . . . . . . . . 31
27 Matlab function to check input values according to the model . . . . . . . . . . 31
28 Matlab code which demonstrates the usage of our reference algorithm . . . . . 32
29 Matlab function implementing “Algorithm 1.1” of [1] . . . . . . . . . . . . . . 34
30 Matlab code which demonstrates the usage of algorithm 1.1 (pedantic=0) . . . 35
32 Limits of m in algorithm 1.1 (ε = 0.1, p = 0.9, t = 2) . . . . . . . . . . . . . . 36
iii
LIST OF FIGURES LIST OF FIGURES
n
33 Matlab calls to construct data using primes around 10 . . . . . . . . . . . . . . 37
(1) n
34 Plot of the cyclic correlation C of data using primes around 10 . . . . . . . . 38
n
35 Plot of the cyclic correlation C(2) of data using primes around 10 . . . . . . . . 39
37 Matlab code which tries to use algorithm 1.2 but fails (pedantic=0) . . . . . . . 42
39 Limit of m in algorithm 1.2 (ε = 0.1, p = 0.9, t = 2) . . . . . . . . . . . . . . 43
√
40 Plot of y = − log( δ + e−x ) with δ = 2 . . . . . . . . . . . . . . . . . . . . . 45
41 Matlab function to calculate vector D . . . . . . . . . . . . . . . . . . . . . . 45
42 Plot of the raw cyclic correlation C . . . . . . . . . . . . . . . . . . . . . . . . 46
43 Plot of the filtered cyclic correlation D . . . . . . . . . . . . . . . . . . . . . . 46
46 Limit on m in algorithm 1.3 (ε = 0.1, p = 0.9, t = 2) . . . . . . . . . . . . . . 49
47 Matlab function to reconstruct X as in theorem 1.1 of [1] . . . . . . . . . . . . 58
48 Matlab function to provide numeric probability on a good match not in X . . . 63
49 Helper function to add up good matches not in X . . . . . . . . . . . . . . . . 63
50 Helper function to count good matches not in X . . . . . . . . . . . . . . . . . 64
51 Matlab function to compute C as in figure 8 by using the FFT as in figure 9 . . 64
52 Matlab function to simply calculate the positions of the t largest values . . . . . 65
53 Matlab function to iteratively calculate the Cartesian product (2 dimensions) . . 65
54 Matlab function to recursively calculate the Cartesian product . . . . . . . . . 65
iv
List of Tables
1 Simple and slow substring matching . . . . . . . . . . . . . . . . . . . . . . . 7

2 Residues and solutions for p1 = 5003, p2 = 6007, X = {7700, 8050, 9000} . . 26
3 Test results using n = 102400, m = 512, p = 0.9, X = {34, 39411, 101410} . . 50
4 Base libraries for the C++ implementation . . . . . . . . . . . . . . . . . . . . 52
5 Core modules of the C++ implementation of the algorithm . . . . . . . . . . . 53
6 Additional modules needed for delta compression . . . . . . . . . . . . . . . . 53
7 Elements of the pre-calculated index of the implementation . . . . . . . . . . . 54
8 Numeric tests on probability of reconstructing X . . . . . . . . . . . . . . . . . 58
v
Chapter 1
Introduction
1.1 Security Patches

Regardless of which operating system (OS) is used, security patches need to be applied fre-
quently. This lays a heavy burden on OS vendors: They need to provide the patches quickly
on servers with sufficient bandwidth1 . The users, on the other side, often need a lot of patience
when downloading the patches. If they do not have this patience, their systems might end up
being vulnerable to known exploits.
The trigger for a security update is often as simple as changing a few lines of source code,
for example to prevent a buffer overflow. Changing these few lines can have an enormous
effect, however, if a large executable file needs to be replaced by a patched version. This can
lead to a security patch of several megabytes, which is then downloaded by thousands or even
millions of people. Not only does this cause huge bandwidth costs — the OS vendor also
needs to provide update servers, network devices and manpower to administrate them. Even
worse are accumulated patches, for example “Service Packs”, which easily surpass a size of
100 megabytes. Deploying these patches is costly and quite a challenge.
The solution not to release patches at all to save costs is not an option, because nowa-
days most computers2 are connected to the Internet. Known vulnerabilities need to be fixed as
quickly as possible to provide basic security. In larger companies there are usually dedicated
servers, which maintain a cache of updates for employees. This speeds up the process and saves
bandwidth costs, but it is not a general solution. Getting back to the root of the problem, the
size of the patches needs to be reduced.
One way to achieve this is to use many small shared libraries instead of big executable files.
If the security fix is local, only a small library needs to be replaced. However, it is not always
feasible to instruct the programmers how to write their programs, especially if development is
not managed centrally. Also, sometimes there are good reasons not to use shared libraries. Not
1 Even though not entirely correct, “bandwidth” is used as a synonym for data rate, like often done in literature
on computer science.
2 and also many embedded systems
1
1.2. MATCHING WITH MISMATCHES CHAPTER 1. INTRODUCTION
to mention that some security fixes cause changes in multiple files, and this will again result in
large patches if the new files are copied.
A totally different approach is to consider the existing files on the system, and compare
them to the corresponding files containing security fixes, in order to avoid replacing entire files
by new ones. Since these fixes usually implement only few changes, most of the data is already
present on the system — within the files which need patching. If this method is applied, and only
the differences of the files are stored in security patches, they could be tremendously smaller.
Actually, some OS vendors recently have started using this technique to deploy at least some of
their updates:
• Microsoft Windows Update on Windows XP and above supports “Binary Delta Compres-
sion (BDC)” (see [15]), but this is a proprietary system and little information is available
on its inner workings.
• FreeBSD and Mac OS X both come with the open source tool “bsdiff 4” (see [2]) and
provide update tools which (can) make use of it.
Colin Percival, the author of bsdiff, has written a doctoral thesis [1] on this subject, in which he
presents an algorithm to further improve bsdiff 4. Unfortunately, he did not publish the source
code for his new algorithm, and there is only few third party material available on it. A reference
to bsdiff 6 can be found in [5] (on page 939) but it does not provide a description. The authors
of [4] use bsdiff 6 for comparison with their patch tool, but again no details are mentioned. If
we are willing to make use of the new algorithm, we need to work through the doctoral thesis.
1.2 Matching With Mismatches

The present master thesis is mainly about the aforementioned doctoral thesis “Matching with
Mismatches and Assorted Applications” [1]. Of that thesis, the focus is on the first chapter,
which specifies an algorithm for matching with mismatches in three iterations.
What is “matching with mismatches”? It basically means finding something “similar” to
what is present, where “similar” implies that it can be exactly the same or might be modified
at some places. These possible modifications are the mismatches. Insertions (additional data
in between) or deletions (removed data) are not considered for the measurement of the “simi-
larity”, only in-place changes.3 This is related to the Hamming distance (see e.g. [14] on page
19): Given a large string S and a small string T , we look for substrings4 in S with low Hamming
distance to T .
Why is this useful for delta compression of executable code? Our aim is to encode only the
differences of two executable files, specifically the original file and a different version including
a security fix. Security fixes usually implement only few source code modifications, and thus,
3 Other algorithms, which consider insertions and deletions, are described in [3].
4 meaning “continuous parts” of the string S with the same length as T
2
CHAPTER 1. INTRODUCTION 1.3. STRUCTURE
very few code is actually added or removed. However, due to these modifications, memory
addresses throughout the executable file change.5 In this context, an algorithm for matching
with mismatches helps to find similar “blocks” of the original file in the new file, such that the
differences can be identified and encoded. Blocks or parts of blocks which are identical in both
files can be referenced and do not need to be copied.
1.3 Structure
This thesis is structured as follows:
1. Introduction
In the course of this chapter, conventions and definitions are provided which are used
throughout the thesis.
2. Analysis
The analysis contains a step-by-step derivation of the algorithm for matching with mis-
matches. This includes the motivation for each step as well as numerical examples. Addi-
tionally, example code for the different iterations of the algorithms is provided, and basic
tests are performed.
3. Implementation
Based on the prior analysis, an implementation of the algorithm using C++ is presented,
and possible use in a tool for delta compression is prepared. This chapter describes the
basic structure of the implementation and additional considerations like portability and
third party libraries.
4. Improvements
With regard to the analysis and the C++ implementation, specific improvements of the
algorithm are proposed and illustrated.
5. Conclusion
In this chapter, we provide an overview of what we have achieved and give an outlook
identifying possible further projects.
1.4 Conventions
All examples are written in Matlab [19] code. However, neither knowledge of Matlab nor a
license of Matlab is required to understand this thesis and test the examples. Basic knowledge
of C should be sufficient to be able to read the code, and GNU Octave [20] (with the additional
packages from Octave-Forge [21]) can be used to run the examples.
5 see also [1] on page 32
3
1.5. DEFINITIONS CHAPTER 1. INTRODUCTION
Nevertheless, it should be noted that array and vector indexing in Matlab code starts at 1 (i.e.
array{1} is the first element). This is opposed to C, where indexing starts at 0 (i.e. array[0]). In
spite of this, we still stick to the constraints of all values as described in [1], which are mainly
zero-based. So whenever there is an unexplained increment or decrement in the code, the reason
is almost certainly the difference in indexing.
Whenever possible, we use the symbols of [1] (on pages iii-iv). This means that the reader
may generally switch to and from the doctoral thesis without problems. One notable difference
though is the use of the vector indices i and j: In [1], the index i is first used as index for the
match count vector (Vi ), but later i represents the “vector number” and j becomes the index (e.g.
(i)
A j ). For the sake of clarity, we use j as vector index from the start. Additionally, when we
are talking about “the vectors” A(i) , B(i) , or C(i) , we actually mean A(i) , B(i) , respectively C(i)
∀i ∈ {1, . . . , k}, with k being described in the context.
Some concepts of probability theory are applied in this thesis without further explanation,
for more information on this subject we refer to [8], especially chapter 3.
1.5 Definitions
A programmer usually regards executable code as binary data. A string, in contrast to that, is
seen as human-readable data, which is binary data with special semantics. This distinction is
generally applied because there are certain functions which do not work with all forms of binary
data. Yet in the context of this thesis it is not relevant. We therefore take the mathematical point
of view (see also [9] on pages 28-29):
Definition (Alphabet)
An alphabet Σ is any finite non-empty set.
Definition (String)
A string over an alphabet Σ is a finite sequence of elements from the alphabet.
These definitions will be used throughout the thesis, so whenever we are talking about a
string, we do not impose any semantics on its elements. To give an example, an executable file
(analysed at byte level) is a string over the alphabet Σexe = {0, 1, . . . , 255}.
4
CHAPTER 1. INTRODUCTION 1.5. DEFINITIONS
Additionally, we provide some definitions of mathematical terms used in this thesis:
Definition (Ceiling)
d e : R → Z is the ceiling function which “rounds towards +∞”, i.e. dxe returns the smallest
n ∈ Z which is not less than x. The corresponding Matlab function is “ceil”.
Definition (Addition of Sets)
Given A = {a1 , a2 , . . . , an } ⊆ Z, B = {b1 , b2 , . . . , bm } ⊆ Z, we define

A + B = ai + b j | i ∈ {1, 2, . . . , |A|} , j ∈ {1, 2, . . . , |B|}
Definition (Interval)
[a, b) = {n ∈ N | a ≤ n < b} for a, b ∈ R. Note that this is not a common definition, because
we only allow positive integers as elements of the interval.
5
Chapter 2
Analysis
2.1 Approach
In his doctoral thesis [1], Colin Percival writes that the first chapter, in which he introduces the
new algorithm for matching with mismatches, “is not for the faint of heart”6 . This is quite true,
the mathematics for the algorithm might seem daunting. We try to ease the pain a bit. However,
even if we have applications in mind, a good understanding of the underlying algorithm is
essential.
In this chapter we provide a description of the algorithm for matching with mismatches with
• detailed “common sense” motivation and reasoning,
• various numerical examples, including numerical evidence and restrictions.
Our approach is to numerically show how and why the algorithm works. We regard this as a
reasonable addition to chapter 1 of [1], which mainly consists of lemmas/theorems and proofs.
2.2 Motivation
2.2.1 Intuitive Substring Matching

Substring matching is used very frequently in everyday life. For example the “Find...” function
of any word processor performs a substring search in a larger string. This search is usually
designed to find only exact matches7 . Whenever a single element does not match, it means that
the whole substring does not match.
Reflecting that we are willing to encode the differences of two files, exact matches would
help but are too strict to be used in general (see also [1] on page 33). It is more useful to also find
sections which mostly match, with some mismatches, where the mismatches could for example
be modified memory addresses.
6 See [1] on page 3.
7 maybe being case insensitive
6
CHAPTER 2. ANALYSIS 2.2. MOTIVATION
Intuitively, this can be done as follows: We iterate through the large string and compare the
small string with the substring at the current position. For each substring comparison, we count
the number of elements which match, and then process these match counts to find the “good”
matches.8 As an example, consider the strings9 S, T and their lengths n, m in figure 1.
S = 'do not tarry water carry';

T = 'carry';
n = length(S); % = 24
m = length(T); % = 5
Figure 1: Definition of example strings S and T
Performing a substring match of T in S using the intuitive algorithm we described above

leads to the result shown in table 1 with a plot as in figure 2.
Table 1: Simple and slow substring matching

Position ( j) View Match Count (V j )
do not tarry water carry
0 carry 0
1 carry 0
2 carry 0
3 carry 0
4 carry 0
5 carry 0
6 carry 1
7 carry 4
8 carry 1
9 carry 0
10 carry 0
11 carry 0
12 carry 0
13 carry 1
14 carry 1
15 carry 1
16 carry 0
17 carry 0
18 carry 1
19 carry 5
In the language of mathematics, the match counts can be seen as a vector, and the calculation
is formally done as follows (see [1] on page 6):
m−1
Vj = ∑ δ (Si+ j , Ti) ∀ j ∈ {0 . . . n − m} (2.1)
i=0
8 “Good” matches are matches with a high number of matching characters, compared to the maximum possible
match count.
9 The longer string was taken from J. W. Goethe’s ”The Sorcerer’s Apprentice”.
7
2.2. MOTIVATION CHAPTER 2. ANALYSIS
Figure 2: Plot of match count calculated from example strings S and T
The function δ : Σ × Σ→ R is in our case the Kronecker delta, i.e.:

(
1 if a = b
δ (a, b) = ∀a, b ∈ Σ (2.2)
0 otherwise
Translating equation (2.1) to Matlab code is fairly straightforward (see figure 3), except that
we should not forget to check the input constraint.
function [V] = match(S, T)

n = length(S);
m = length(T);
% Check matching predicate.
if (not (m < n))
error('Invalid vector lengths.');
end
% Calculate match count vector.
V = zeros(1, n−m+1);
for j = 0 : (n−m)
V(j+1) = sum(S(j+1:j+m) == T);
end
Figure 3: Matlab function to calculate the match count vector V
Now that we are able to calculate the match counts, we need to process them to find good
matches. In our example (table 1), the maximum possible match count is m = 5, which is the
length of T , the smaller string. Assuming that we wish to find the positions with at least 50% of
the maximum match count, we require no less than m2 = 3 matches for one substring. Thus,

we extract only those positions j which satisfy the predicate V j ≥ m2 . This way, we get all the

spikes in figure 2, while ignoring small match counts which are “kind of random”. This filtering
is easily done in Matlab code (see figure 4).
8
CHAPTER 2. ANALYSIS 2.2. MOTIVATION
function [J] = find_good_matches(V, m)

J = find(V >= ceil(m/2)) − 1;
Figure 4: Matlab function to find positions with at least 50% matches
We have presented an intuitive algorithm for matching with mismatches, implemented it,
and it works fine. However, is this algorithm also applicable for our problem of finding similar
sections of two different files, given that these files contain much more data than our test input
strings?
2.2.2 Run Time Considerations

We observe that our algorithm requires n − m + 1 steps when iterating through S, and each step
consists of m element comparisons. Additionally, it requires n − m + 1 steps to extract the good
matches, even if this is done “on the fly”. All in all it will complete in O((n − m + 1)(m + 1)) =
O(nm + n − m2 + 1) time.10
For large n with n m, the run time can be approximated as O(nm). The factor n seems
quite natural, as we need to iterate through (most of) S. However, the factor m is specific to the
method of comparison we are using, so there might be room for optimisation.
In the context of comparing two files, n could be the size of one file, and m could be the
size of a “block” of the second file which we are trying to match. The file size tends to be
quite large nowadays, for example nexe = 2, 097, 152 , 2 MB, with a sample block size of
mexe = 2, 048 , 2 KB. Run time is O(nexe mexe ), which results in approximately 4.2950 · 109
steps, and that would be only for matching a single block. Assuming the second file also has
a size of 2 MB, and we simply want to match all non-overlapping blocks, that would mean
matching mnexe
exe
= 1024 blocks, i.e. some O(nexe mexe mnexe
exe
) = O(n2exe ), which are approximately
4.3980 · 1012 steps. This is a quadratic time algorithm, (mostly) independent of the block size
m. Even modern processors cannot compensate for this.
Based on this result, we can be sure that our intuitive algorithm is not quite fast enough (in
other words: it is too expensive), and that optimisation is a necessity. There is one thing in our
favour, though: We do not have the absolute requirement to always calculate the exact match
counts and find only the best matches. If we do not find them, the calculated difference between
the two files will be larger, which will result in larger patches, but we will still succeed. With
that in mind, one option to speed up this calculation is to estimate the match count vector V
using a randomized algorithm with a sufficiently high chance of success. This leads us to the
new algorithm of matching with mismatches as described in [1].
10 For a description of the O-Notation as used in this thesis, see [6] on pages 44-45.
9
2.3. DERIVATION OF THE ALGORITHM CHAPTER 2. ANALYSIS
2.3 Derivation of the Algorithm
2.3.1 Formal Model

Since we could not find a suitable algorithm quickly and intuitively, we now use a formal
way to deal with the problem. Thus, we need to formalize the problem when comparing two
similar versions of one executable program. If we choose a block of one file and try to locate a
similar block in the other file, we expect to find exactly one good match. This is at the position
where the corresponding code before applying the security fix is present. For the sake of a
more general approach, instead of simply considering the best match, we plan to find the t best
matches. Additionally, there might be not-so-good matches which occur by chance and are
random. These need to be considered, because we prefer the good matches over them.
Instead of choosing specific example strings S and T , we generate them randomly from an
alphabet Σ subject to a certain condition: There are some positions within S where T matches
well, in the sense that each character matches with probability p. The indices of these good
matches in S are assumed to be elements of the set X. Even more formally, the model we are
using is specified as follows (citing from [1] on pages 7-8):
“Problem space: A problem is determined by a tuple (n, m, t, p, ε, Σ, X) where

{n, m, t} ⊂ N, {p, ε} ⊂ R, m < n, 0 < ε, 0 < p, |Σ| is even, and X = {x1 , . . . , xt } ⊂
{0, . . . , n − m} with xi ≤ xi+1 − m for 1 ≤ i < t.
Construction: Let a string T of length m be constructed by selecting m characters

independently and uniformly from the alphabet Σ. Let a string Ŝ be constructed by
randomly selecting n characters independently and uniformly from the alphabet Σ.
Let a string S of length n be constructed by independently taking Si = Ti−xk with
probability p if ∃xk ∈ {i − m, . . . , i − 1} and Si = Ŝi otherwise [...].”
In other words (with regard to finding differences of files), we create T from randomly chosen
characters to be a “block” of one file. We then create another file S from randomly chosen
characters, but at certain positions in S we copy the block T into the file S. This copy of T is not
an exact copy, but an “approximate” copy, and the probability p states how accurate the copy is.
For example if p = 0.9, 9 of 10 characters will be correctly copied on average. The construction
can be performed using a Matlab function as shown in figure 5.
Based on this model, which generates S and T using given positions of good matches, we
wish to invert the random construction and find X with a probability of at least 1 − ε. In this
context, ε is a non-zero parameter which can be set to achieve the desired “accuracy”, but
choosing ε will impose certain restrictions on other input values, as we will see later.
In addition to the problem of reconstructing X, we wish to identify the parts of the algorithm
which are independent of T . Following the nomenclature of [1], we call a pre-calculation of
these parts an “index” of the algorithm. As we need to match several blocks (different strings
T ) with the same target file (constant string S), such an index can speed up the processing.
10
CHAPTER 2. ANALYSIS 2.3. DERIVATION OF THE ALGORITHM
function [S, T] = construct(n, m, p, Sigma, X)

% X and Sigma should be sets.
if (not (length(X) == length(unique(X))))
error('X is not a set');
end
if (not (length(Sigma) == length(unique(Sigma))))
error('Sigma is not a set');
end
% Check construction predicates.
if (not (m < n && 0 < p && mod(length(Sigma), 2) == 0 ...
&& sum(ismember(X, [0:n−m])) == length(X)))
error('Invalid construction value.');
end
for i = 1 : length(X) − 1
if (not (X(i) <= X(i+1) − m))
error('Invalid set X.');
end
end
% Choose T independently and uniformly distributed from the alphabet.
T = randsrc(1, m, Sigma);
% Default: Choose S independently and uniformly distributed.
S = randsrc(1, n, Sigma);
% At a set of offsets, characters of S and T are more likely to match.
for i = 1 : n
x_k = intersect(X, [i−m:i−1]);
if (length(x_k) == 1 && rand() <= p)
S(i) = T(i−x_k);
end
end
Figure 5: Matlab function to construct strings S and T according to the formal model
However, this chapter mainly deals with the reconstruction of X; an index is generated in the
C++ implementation in section 3.2.4.
2.3.2 Restrictions of the Model

As Colin Percival points out, “the model given above is quite restrictive” ([1] on page 8) and
does not completely fit the situation in practice. One thing to note is that bytes of executable files
usually are neither uniformly distributed nor independent. This will be considered in section
4.2.2.
Whenever dealing with random string constructions, however, we have to consider that all
kinds of strings could be constructed, even strings we did not specifically have in mind when
defining the model. Namely, there might be randomly good matches of T in S at positions which
are not in X. If this happens, the random symbols have formed roughly the substring T within
S just out of pure luck, for example as shown in figure 6.
These kind of matches are distracting in our model, because our aim is to compute X and
only X. Let us assume we have at least one such good match, specifically at position i with
i∈
/ X ⇒ i ∈ X = {0, . . . , n − m} − X. Hence, when estimating X, we cannot be sure that X
11
Σ = {a, b, c, . . . , z}
X = {1, 12}
T = ”a b c”
S = ”f |{z}
a b c k l a b c h x v |{z}
a b c s”
1∈X 6∈X
/ 12∈X
Figure 6: Construction example: At position 6 in S the string T randomly matches.
contains the positions of maximum match counts, because Vi (our lucky match count) might
actually be larger. Thus, we will have to guess the elements of X, which reduces the probability
to find the correct X to 0.5 or less.
We therefore conclude: To effectively estimate X without “fifty-fifty-guessing”, all match
counts at positions not in X must be smaller than those of positions in X, i.e.:
∀(i, j) ∈ X × X : Vi < V j (2.3)
The probability q that a randomly good match happens, i.e. that the predicate (2.3) is not
true for a specific pair (i, j), depends on several parameters (see also [1] on page 8): The first
parameter is m, the size of string T . The smaller m is, the more likely we are to encounter
a randomly good match. The second parameter is p, because if p is near zero, the supposed
“good” matches in X will not be very good, and it will be more likely that a random match
occurs which is “better”. The third parameter is the size of the alphabet, namely |Σ|. The larger
the alphabet, the smaller is the probability of a randomly good match (if the other factors are
constant). Implicitly, and as the last parameter, the probability also depends on |X|. For example
if X = {0, m, 2m, ...hm} with h ∈ N, hm ≤ n − m it is not possible to locate any randomly good
matches which are not in X.11
We assume m to be the most important of these factors, because it is variable and depends
on the string T , while p and |Σ| will be constant in most applications of the model, and |X| is
considered to be very small, i.e. |X| mn . Hence, as an example, we numerically calculate q
for m = [1, 15] with constant p = 0.7, |Σ| = 2 and |X| = 1. Figure 7 shows the result.12
Our numerical example suggests that q(m) with the assumptions above is an exponential
function, which is given in [1] on page 8 as
q(m) = exp(−O(m)) (2.4)
As apparent from figure 7, q becomes negligible for large values of m. So, basically, this
limitation of the model imposes some lower bound on m when reconstructing X.
However, if we leave the model for a moment and switch back to what we intend to do, that
is matching two files, this limitation turns out to be insignificant for the following reasons:
11 The factor |X| is not mentioned in [1], but it is included here to show that the calculation of q is quite complex.
Equation (2.4), also given in [1] on page 8, is based on certain assumptions and cannot be applied in general.
12 The script used to create this plot can be found in the appendix.
12
Figure 7: Plot of the probability of a randomly good match not in X
1. If we found random matches between two files that we did not “expect”, we would be
happy and would gladly accept them.
2. The block size m when matching two files should not be too small anyway, so that proper
compression can be achieved.
3. We will set p to be near 1 anyway, otherwise we would not be able to achieve proper
results.
Even though we have identified certain restrictions of the model, we can continue with our work
with the knowledge that these limitations will not block the progress of solving our problem.
2.3.3 Optimisation using the FFT

Based on the model, we intend to derive an improved algorithm for matching with mismatches.
As a first optimisation, the match count vector V in equation (2.1) can be calculated with the
p
help of the Fast Fourier Transform. This requires O(n m log(m) time for matching one block
(see [1] on page 7)13 , which is less than the O(nm) time of our intuitive algorithm, but there is
still potential for improvement.
To improve this, we note that we do not specifically need the vector V as described in
equation (2.1). When processing V , we extract ”large” values, thus a vector containing spikes
at the same positions as V would also be perfectly fine. A vector with this characteristic is
13 Unfortunately, the references mentioned in [1] on using the FFT to calculate V do not cover the FFT at all,
and are therefore not very useful in this context.
13
the cyclic correlation14 of the two strings S and T , when treating them as discrete signals with
certain properties.
Assuming φ : Σ → R is a function which converts a character to a signal value,15 the cyclic
correlation C is calculated as follows:16
∀ j ∈ {0, . . . , n − 1} :
A j = φ (S j ) (2.5)
(
φ (T j ) if j < m
Bj = (2.6)
0 otherwise
n−1
Cj = ∑ A(r+ j) mod n · Br (2.7)
r=0
In this calculation, the string S is converted to the signal vector A, and T is converted to B
with zero padding, such that A and B have the same size. In order to retrieve proper results, we
have to define φ in a way that it does not “weight” characters differently. As counter-example,
defining φ to map each element of Σ to a unique numerical representation
|Σ|
[
given Σ = {x1 , x2 , . . . , x|Σ| } = {x j } :
j=1
φ (x j ) = j (2.8)
will not produce proper results, because matching x1 and x1 (which are equal) when calculating
the cyclic correlation will yield 1 · 1 = 1,17 while matching x1 and x2 (which are non-equal)
will produce 1 · 2 = 2. In other words, certain mismatches will count much more than certain
matches, and this is not the result we wish to have.
Instead, we define φ to randomly map half of Σ to 1 and the other half to −1 (similar to [1]
on page 12):
1
Choose Σ0 ⊂ Σ with Σ0 = |Σ| uniformly at random.
2
0
φ (x) = (−1)|Σ ∩{x}| (2.9)
Note that in case |Σ| > 2 this is a lossy conversion of the characters to signal values, since
it maps Σ to {−1, 1}. However, in contrast to equation (2.8), mismatches will never produce
14 seee.g. [11] on page 72

15 We do not consider the case that the function φ maps to C (as mentioned in [1] on page 27), because for all
our requirements mapping to R is sufficient.
16 This is a simplification based on parts of algorithm 1.1 in [1] on page 12.
17 The underlying operation in equation (2.7) is a multiplication.
14
larger values than matches: Matching x1 and x1 will produce 1, matching x1 and x2 will produce
either 1 or −1, depending on how Σ0 was chosen.
Figure 8 shows the calculation of the cyclic correlation in Matlab code using φ as defined
in equation (2.9).
function [C] = match_cyclic_correl(S, T, Sigma)

% Retrieve lengths.
n = length(S);
m = length(T);
% Sigma should be a set and sorted. We assume it to be continuous.
alphabetSize = length(Sigma);
if (not (alphabetSize == length(unique(Sigma)) && issorted(Sigma)))
error('Sigma is not a sorted set');
end
% Check input predicates.
if (not (m < n && mod(alphabetSize, 2) == 0))
error('Invalid input value.');
end
Sigma_base = double(Sigma(1));
% Calculate phi.
tmp_phi = ones(1, alphabetSize, 'single');
for j = 1 : alphabetSize/2
tmp_phi(j) = single(−1);
end
% Use a random mapping.
phi = intrlv(tmp_phi, randperm(alphabetSize));
% Convert S and T to A and B.
A = zeros(1, n, 'single');
B = zeros(1, n, 'single');
for j = 0 : n − 1
A(j + 1) = phi(double(S(j + 1)) − Sigma_base + 1);
if (j < m)
B(j + 1) = phi(double(T(j + 1)) − Sigma_base + 1);
end
end
% Calculate the cyclic correlation.
C = zeros(1, n, 'single');
for j = 0 : n − 1
tmpC = single(0);
for r = 0 : n − 1
tmpC = tmpC + A(mod(r+j, n) + 1)*B(r+1);
end
C(j+1) = tmpC;
end
Figure 8: Matlab function to calculate matches using the cyclic correlation
Due to the nested loop in equation (2.7),18 the calculation of C requires O(n2 ) time. The
actual improvement comes from the fact that the cyclic correlation can be computed using the
FFT (see [12] on pages 545-546) in O(n log2 (n)) time (according to [10]). The corresponding
Matlab code is shown in figure 9. In this context, fft and ifft are the Fast Fourier Transform
and the inverse FFT, respectively, and conj is the complex conjugate.
18 because this equation is applied ∀ j ∈ {0, . . . , n − 1}
15
% Calculate the cyclic correlation using the fft.

C = ifft(fft(A) .* conj(fft(B)));
Figure 9: Matlab code to calculate the cyclic correlation using the FFT
The resulting vector C does not necessarily contain match counts. Mismatches decrease
values in C, e.g., for m = 15 a match count of 10 at position j results in C j = 10 − 5 = 5. This
means that processing the C as we processed V in figure 4 by filtering values ≥ m2 might lead

to different results. This is specifically true for |Σ| > 2, because in that case the string-to-signal
conversion is lossy. Thus, there is more “background noise” than in the match count vector V .
Performing the matching of the example strings (see figure 1) using the cyclic correlation leads
to a result19 as plotted in figure 10. Negative values in this plot are clear mismatches (because
they are the result of accumulated −1 · 1 multiplications, see equation (2.7)). Positive values are
likely but not guaranteed to be matches, depending on how suitable the randomly generated φ
is for our example strings.
Figure 10: Plot of the cyclic correlation C calculated from example strings S and T
To find the t best matches using the result of the cyclic correlation, we identify the t positions
for which C takes the largest values. In our example, the positions 7 and 19 both indicate
full matches, although in fact position 7 only matches 4 of 5 characters (see table 1). This
19 The result may vary because the algorithm is randomized.
16
emphasizes the fact that our new method does not always lead to correct results. When choosing
φ unluckily, we might even find full matches at positions where none are present.
However, the method is still very useful, if certain conditions are met. Let us now apply
our model and create S and T according to it, which means that their characters are uniformly
distributed20 . If m is large enough to make up for the (possibly) lossy conversion done by φ ,21
the spikes within C will (with high probability) be the good matches we wish to find, since T
matches within S “well or not at all” ([1] page 6).
Our initial problem remains, though: The run time is dominated by the O(n log2 (n)) time
required for the FFT, which does not scale well for our purpose.22 In addition to that, memory
usage is O(3n),23 which can be too much for use with large files.24 The next step is therefore to
“shorten” the lengths of the vectors A and B before calculating the cyclic correlation, while still
retaining the necessary information.
2.3.4 Projecting onto Subspaces

In order to be able to reduce the size of the data before calculating the FFT, we need to find
a projection (preferably lossless or with only small loss) which reduces the vector sizes in
equations (2.5) and (2.6) but maintains the basic properties such that we can still perform the
cyclic correlation and extract proper results.
A Simplified Approach
We now introduce such a projection (based on [1], page 9), starting with the special case |X| = 1
and |Σ| = 2, which basically means that we are only looking for the best match of T in S,
whereby the conversions of strings to signals are lossless. For this case, we construct an example
to show how the projection is performed and to explain the mathematical background.
n = 10000;
m = 256;
p = 0.95;
Sigma = uint8([0:1]);
X = [9000];
[S, T] = construct(n, m, p, Sigma, X);
C = match_cyclic_correl(S, T, Sigma);
Figure 11: Matlab calls to construct data according to the model (|X| = 1)
20 except for the substrings in S which match well with T

21 To be more exact, this does not solely depend on m, but choosing a large m is one way to deal with this
problem.
22 In fact, this method requires even more time than the FFT-based calculation mentioned at the beginning of
this section, but it leaves room for optimisation.

23 implicitly depending on the system-specific size of floating point values
24 At least one would prefer low memory usage, especially when multiple file patches need to be generated in
parallel.
17
Figure 11 lists the Matlab calls used to create example data according to the model. Note
that we choose mn |X| and select m not to be too small, to make sure that the restrictions of the
model (section 2.3.2) do not apply. Figure 12 shows the resulting vector C (following equation
(2.7)). There is significant noise and a spike at position j = 9000, which we expected because
X = {9000}.
Figure 12: Plot of the cyclic correlation C of model data (|X| = 1)
Now we need to reduce the data size and still get the position of the maximum j = 9000
as result. Assuming we can somehow reduce the data size modulo a prime number, we could
extract the position modulo this prime. Performing this several times with different primes will
give us the position modulo multiple primes. To calculate the actual result we can make use
of the Chinese Remainder Theorem, which states “that it is possible to reconstruct integers in
a certain range from their residues modulo a set of coprime moduli” ([7], page 194). This is
possible if the integer x we wish to reconstruct follows the predicate 0 ≤ x < M where M =
p1 p2 . . . pk is the product of the coprime integers (with k being the number of primes).
For example, we can choose the primes p1 = 5003, p2 = 6007 (being about n2 with enough
difference, this choice is for simplicity)25 , and perform the character-to-signal conversions ac-
cumulated modulo each of the primes (based on algorithm 1.1 in [1] on page 12):
∀i ∈ {1, . . . , k}; ∀ j ∈ {0, . . . , pi − 1}:
25 Thereconstruction is clearly possible because 0 ≤ n = 10000 < p1 p2 = 30053021. We are not required to
choose primes, coprime values are sufficient, but we prepare for later changes to the algorithm.
18
l m
n− j
pi −1
(i)
Aj = ∑ φ (S j+λ pi ) (2.10)
λ =0
l m
m− j
pi −1
(i)
Bj = ∑ φ (T j+λ pi ) (2.11)
λ =0
pi −1
(i) (i) (i)
Cj = ∑ A(r+ j) mod n · Br (2.12)
r=0
This means that we shorten the original vector A j = φ (S j ) by adding up “roughly the second
half of the vector to the first half” (with a different boundary for each prime). Equation (2.10)
specifies a projection from Σn → R pi , ∀i ∈ {1, . . . , k}. Given pi < n, this is a lossy (irreversible)
projection, but it maintains certain characteristics. In our example, instead of one large vector
A, we now have two vectors: A(1) with size 5003 and A(2) with size 6007. Vector B is less
concerned – only some zeros are cut from the end, since in our case m < p1 < p2 .26
26 Basically,we could use the same definition for B(i) as in equation (2.6), but the new definition covers the
general case and is therefore preferable.
19
function [C] = match_cyclic_correl_project(S, T, Sigma, Primes)

% Retrieve lengths.
n = length(S);
m = length(T);
k = length(Primes);
% Sigma should be a set and sorted.
% We assume it to be numeric, with values >= 0 and continuous.
alphabet_size = length(Sigma);
if (not (alphabet_size == length(unique(Sigma)) && issorted(Sigma)))
end
if (not (m < n && mod(alphabet_size, 2) == 0 && Sigma(1) >= 0))
end
sigma_base = double(Sigma(1));
% Calculate phi (character to signal conversion).
tmp_phi = ones(1, alphabet_size, 'single');
for j = 1 : alphabet_size/2
end
phi = intrlv(tmp_phi, randperm(alphabet_size));
% Convert S and T to A and B, project to subspaces.
for i = 1 : k
A{i} = zeros(1, Primes(i), 'single');
B{i} = zeros(1, Primes(i), 'single');
for j = 0 : Primes(i) − 1
tmpA = single(0);
tmpB = single(0);
for lambda = 0 : ceil((n−j)/Primes(i))−1
tmpA = tmpA + phi(double(S(j+lambda*Primes(i)+1)) ...
− sigma_base + 1);
end
for lambda = 0 : ceil((m−j)/Primes(i))−1
tmpB = tmpB + phi(double(T(j+lambda*Primes(i)+1)) ...
end
A{i}(j+1) = tmpA;
B{i}(j+1) = tmpB;
end
end
for i = 1 : k
C{i} = ifft(fft(A{i}) .* conj(fft(B{i})));
end
Figure 13: Matlab function to project data before calculating the cyclic correlation
The cyclic correlation is calculated for each of these smaller vectors (see equation (2.12)).
Figure 13 shows a Matlab function implementing this “projection onto subspaces”; figures 14
and 15 show the resulting vectors C(1) and C(2) for our example. Due to adding up the values
before calculating the correlation, the level of noise has increased compared to figure 12, but
the maximum value still raises clearly above the noise.
20
Figure 14: Plot of the cyclic correlation C(1) of projected model data (|X| = 1)
The positions of maximum values in these vectors C(i) are (with high probability) the maxi-
mum position of the original vector C modulo each of the primes pi . Using the residues modulo
these primes, we can reconstruct the position of the maximum correlation. In our example the
21
residues are 9000 mod 5003 = 3997 (see figure 14) and 9000 mod 6007 = 2993 (see figure 15).
The reconstruction following the Chinese Remainder Theorem is done as follows (see also [7]
page 194f):
∀i ∈ {1, . . . , k} :
x ≡ ai (mod pi ) : x ≡ 3997 (mod 5003) (2.13)

x ≡ 2993 (mod 6007)
k
M = ∏ pi : M = 5003 · 6007 = 30053021 (2.14)
i=1
M 30053021
Mi = : M1 = = 6007 (2.15)
pi 5003
30053021
M2 = = 5003
6007
Ni Mi ≡ 1 (mod pi ) : N1 · 6007 ≡ 1 (mod 5003) (2.16)

has solution N1 = −294
N2 · 5003 ≡ 1 (mod 6007)
has solution N2 = 353
Finally, the underlying value x can be calculated:
x ≡ a1 N1 M1 + · · · + ak Nk Mk (mod M) : (2.17)
x ≡ −3997 · 294 · 6007 + 2993 · 353 · 5003 (mod 30053021)
≡ −1773119239 (mod 30053021)
≡ 9000 (mod 30053021)
While most of the steps are fairly straightforward, the solution of equation (2.16) requires
some work. Rearranging it leads to a form which can be solved more easily:
Ni Mi ≡1 (mod pi )
⇔ Ni Mi = 1 − r · pi , r∈Z
⇔ Ni Mi + r · pi = 1 (2.18)
Since gcd(Mi , pi ) = 1 (by definition of Mi and pi ),27 we can apply the extended Euclidean
algorithm (see [6] on pages 859-860) to calculate Ni .
27 gcd: greatest common divisor
22
Figure 16 shows a Matlab function which performs the reconstruction of an integer accord-
ing to the Chinese Remainder Theorem.
function [x] = solve_crt(Primes, Residues)

num_primes = length(Primes);
prime_prod = prod(Primes(1 : num_primes));
for i = 1 : num_primes
M_i = prime_prod/Primes(i);
% Use extended euclidian algorithm
[g, r, N_i] = gcd(Primes(i), M_i);
% g = r * Primes(i) + N_i * M_i
% with g = 1 because Primes(i) and M_i are coprime.
NM{i} = N_i * M_i;
end
x = 0;
for i = 1 : num_primes
x = x + Residues(i) * NM{i};
end
x = mod(x, prime_prod);
Figure 16: Chinese Remainder Theorem: Matlab function which calculates the solution
Using this algorithm, the position x of the maximum correlation can be uniquely calculated
as long as n < M. Instead of one large cyclic correlation, we now have multiple smaller cyclic
correlations to calculate, which is an improvement in terms of speed, because the time spent
calculating the FFT grows ”faster than linear” with n. There are drawbacks which need to be
kept in mind, however:
1. The level of noise increases, because we add up values in vectors A(i) and (possibly) B(i) .
We need to make sure that, at least with high probability, the noise does not increase too
much.
2. Finding the position of the maximum correlation is more complex now, but if the time
gain is sufficiently large, it is worth the effort.
A Generic Approach
In our introduction of the projection, we only handled the special case that |X| = 1 and |Σ| = 2.
Now we extend the previous approach to present a generic solution. For a start, we stick with
|Σ| = 2, but consider varying |X|.
The case |X| = 0 does not need to be handled, because it means zero solutions need to be
found, and we have instantly finished. What needs to be covered is the general case |X| ≥ 1.
If we apply the same procedure as above, we stumble upon the fact that we cannot simply
use the Chinese Remainder Theorem for the general case. Actually, we are able to reconstruct
integers from their residues modulo some primes, but if we consider multiple solutions, we
have multiple residues for each prime and do not know which integer they belong to.
23
n = 10000;
m = 256;
p = 0.95;
X = [7700, 8050, 9000];
C = match_cyclic_correl(S, T, Sigma);
Figure 17: Matlab calls to construct data according to the model (|X| = 3)
As an example, consider model data with |X| = 3, generated using the Matlab calls in figure
17. The cyclic correlation C (see equation (2.7)) of this data has three spikes at positions which
are elements of X (see figure 18). We need to reconstruct these three positions from their
residues modulo the primes p1 and p2 as above.
Figure 18: Plot of the cyclic correlation C of model data (|X| = 3)
Each of the cyclic correlations C(i) of projected data (calculated as in equation (2.12)) also
has three spikes (see figures 19 and 20). Thus, when extracting the positions of the three largest
values from each of the vectors C(i) , we have three residues modulo each prime. However,
unfortunately it is not apparent which of the residues modulo one prime ”belongs to” a specific
residue modulo another prime to reconstruct one of the results.
Intuitively, one might consider to simply calculate the result x using the Chinese Remainder
Theorem for all combinations of residues and check each time whether it is a valid value (i.e.
whether x ≤ n − m according to the model). This actually works fairly well, given M n − m,
such that it is possible to drop invalid combinations.
24
25
The residues and corresponding solutions for our example are shown in table 2 (with a1
being position in C(1) , and a2 being position in C(2) ). The valid results, i.e. those which are
≤ 9744, are marked in bold. As shown in the table, we have successfully reconstructed the
elements of X – all other combinations of residues lead to values well out of range.
Table 2: Residues and solutions for p1 = 5003, p2 = 6007, X = {7700, 8050, 9000}
a1 a2 x (mod 30053021)
2697 1693 7700
2697 2043 17067930
2697 2993 11854804
3047 1693 13000841
3047 2043 8050
3047 2993 24847945
3997 1693 18214917
3997 2043 5222126
3997 2993 9000
Theorem 1.1 in [1] on page 10 establishes a lower bound on the probability of whether
this reconstruction will lead only to the correct results.28 The problem is formulated slightly
different in this theorem. It starts with a set of candidates for the solution, namely {0, . . . , n−1}
(assuming m = 1, the minimum reasonable value). For each prime, only those values of this set
which are elements of one of the residue classes (of the actual results modulo the prime) are
accepted (see [1] on page 10):
X̄ = {0, . . . , n − 1} ∩ (X + p1 Z) ∩ · · · ∩ (X + pk Z) (2.19)
The ”filtering” by intersecting for each prime is basically the same as trying all combinations
and removing invalid results. The set X̄ will always contain the correct results (because all
combinations of residues are considered), but it might also contain additional values. A lower
probability bound on the condition X = X̄ according to [1] is
k
t log(n) log(L))
1−n (2.20)
L
with L ∈ R, L ≥ 5 specifying the interval [L, L(1+2/ log(L))) from which the primes p1 , . . . , pk
are randomly selected; and t = |X| according to our model definition. This probability bound is
used by Colin Percival for further proofs, which is why theorem 1.1 is the very foundation of
the algorithm proposed in [1]. Still, what is missing in [1] is a critical analysis of this theorem.
We provide this, together with suggestions for improvement, in section 4.1.
For now we choose input values such that the lower probability bound in equation (2.20) is
near 1. Hence, we can expect that the set X is properly reconstructed most of the time. To be able
to choose input values accordingly, we need to select primes from the specified interval. Figure
28 This
theorem depends on pi being prime and not just coprime ∀i ∈ {1, . . . , k}, which is why we used prime
numbers from the start.
26
21 shows a Matlab function which performs this task. Please note that this function actually
creates a random permutation of primes instead of selecting them ”uniformly at random” (as
described in [1]), which makes it behave slightly different. This is explained in section 4.1.1.
function [P] = select_primes(L, k)

% Check selection predicate.
if (not (L >= 5 && k >= 1))
error('Invalid input for prime selection.');
end
% Round upper bound.
primes_upper_bound = L * (1 + 2 / log(L));
max_prime = fix(primes_upper_bound);
% Make sure that values are < upper bound.
if (max_prime == primes_upper_bound)
max_prime = max_prime − 1;
end
% Retrieve subset of the set of primes.
% Use a permutation to prevent double primes.
primes_set = setdiff(primes(max_prime), primes(L − 1));
primes_set = intrlv(primes_set, randperm(length(primes_set)));
P = primes_set(1:k);
Figure 21: Matlab function to randomly select primes for the reconstruction of X
Now we have provided a solution which covers the general case |X| ≥ 1. The solution has
the drawback of restricting some of the input values, an issue which we will revisit in section
2.4.2.
To provide a fully generic approach, we still need to cover the case |Σ| ≥ 2. However, this
is only a small problem: We recall the fact that |Σ| > 2 will cause φ (x) to be a lossy conversion
from characters to signal values. This means that when comparing the signal values of differ-
ent characters, they might turn out to be equal. However, as we discussed in section 2.3.2, the
larger |Σ|, the less likely are randomly good matches of T in S (if all other factors are constant).
These two effects compensate each other. While the lossy character-to-signal conversion in-
creases the level of the noise in which to find good matches, the reduced probability of random
matches decreases the noise. Figure 22 shows vector C constructed as in figure 17 but with
Σ = {0, . . . , 65535}. Compared to figure 18 we do not observe any difference except for the in-
fluence of the randomly constructed input values. Actually, there is a slight difference in detail:
Since φ is a “randomly created” function, there is a certain chance that we create an “unlucky”
φ for our specific input values. This issue will be handled in the next section.
Comparing C to the match count vector V (see equation (2.1) on page 7) with varying |Σ|
reveals another issue: For larger |Σ|, the level of noise in V is reduced, while the level of noise
in C remains constant. This should be kept in mind for applications where C is meant to be used
as direct replacement for V .
27
Figure 22: Plot of the cyclic correlation C of model data (|X| = 3, |Σ| = 65536)
2.3.5 Optimising Random Behaviour

As we mentioned above, for |Σ| > 2 the conversion performed by φ is lossy. It can randomly
happen, though, that φ is defined (see equation (2.9)) such that it performs an “unlucky” con-
version of specific input strings. By “unlucky” we mean that the resulting C has a spike at
a position where there is no good match, because φ converts different characters to the same
signal values. Figure 23 shows an example for an unluckily defined φ in the context of certain
input strings.
Σ = {a, b, c, d}
X = {1}
T = ”a b b d”
S = ”d a
| b{zb d} c b
| a{zb c} c a d d c”
1∈X 6=T

1 if x = a or x = b
φ (x) =
−1 otherwise
⇒ B = (1, 1, 1, −1, 0, 0, . . . )
⇒ A = (−1, 1, 1, 1, −1, − 1, 1, 1, 1, −1, − 1, 1, −1, −1, −1)

| {z } | {z }
1∈X 6∈X
/
Figure 23: Construction example: At position 6 in S a non-existing match of T is found
28
For specific input strings this problem can often be solved by choosing a different (“lucky”)
φ . In figure 24 φ is redefined, which leads to the correct result.

1 if x = a or x = c
φ (x) =
−1 otherwise
⇒ B = (1, −1, −1, −1, 0, 0, . . . )
⇒ A = (−1, 1, −1, −1, −1, 1, −1, 1, −1, 1, 1, 1, −1, −1, 1)

| {z }
1∈X
Figure 24: Construction according to figure 23 with a different φ
However, there is no general solution which works for all input strings as long as φ is a
lossy conversion. Given φ , it is always possible to maliciously construct input strings such that
a good match of T in S is found where there is none. One tempting way to reduce the probability
of finding “false matches” is to analyse the input strings and define φ purposeful instead of at
random. This possibility is discussed later in section 4.2.2.
Nevertheless, there is something else we can do: In the equations (2.10) and (2.11), the
same φ is used to calculate the vectors A(i) and B(i) for all i ∈ {1, . . . , k}. This means that if a
false match occurs, it will occur in all vectors C(i) , at positions modulo the respective primes.
If we use a different φi for each i to calculate the vectors A(i) and B(i) , false matches are still
possible, but most likely on different positions for each prime, and therefore likely to be filtered
out as invalid results (as in table 2). Extending our previous equations (2.9), (2.10) and (2.11)
we now have (see also [1] on page 12):
∀i ∈ {1, . . . , k}:
1
Choose Σi ⊂ Σ with |Σi | = |Σ| uniformly at random.
2
φi (x) = (−1)|Σi ∩{x}| (2.21)
∀i ∈ {1, . . . , k}; ∀ j ∈ {0, . . . , pi − 1}:

l m
n− j
pi −1
(i)
Aj = ∑ φi (S j+λ pi ) (2.22)
λ =0
l m
m− j
pi −1
(i)
Bj = ∑ φi (T j+λ pi ) (2.23)
λ =0
For further usage, figure 25 shows a Matlab function performing the creation of φi and the
projection accordingly.
29
2.4. THE ALGORITHM CHAPTER 2. ANALYSIS
function [A, B] = project_onto_subspaces(S, T, Primes, Sigma)

n = length(S);
m = length(T);
alphabet_size = length(Sigma);
sigma_base = double(Sigma(1));
k = length(Primes);
% phi maps half of Sigma to −1, the other half to 1.
tmp_phi = ones(1, alphabet_size, 'single');
for j = 1 : alphabet_size/2
end
phi = zeros(k, alphabet_size, 'single');
for i = 1 : k
phi(i, :) = intrlv(tmp_phi, randperm(alphabet_size));
end
% Perform projection onto subspaces of prime dimensions.
for i = 1 : k
A{i} = zeros(1, Primes(i), 'single');
B{i} = zeros(1, Primes(i), 'single');
for j = 0 : Primes(i)−1
tmpA = single(0);
tmpB = single(0);
for lambda = 0 : ceil((n−j)/Primes(i))−1
tmpA = tmpA + phi(i, double(S(j+lambda*Primes(i)+1)) ...
end
for lambda = 0 : ceil((m−j)/Primes(i))−1
tmpB = tmpB + phi(i, double(T(j+lambda*Primes(i)+1)) ...
end
A{i}(j+1) = tmpA;
B{i}(j+1) = tmpB;
end
end
Figure 25: Matlab function to project data with varying φi (x)
2.4 The Algorithm
2.4.1 Reference Version
In the following sections, we present different variants of the algorithm to estimate X according
to the model. Figure shows 26 an algorithm to be used as base for later comparisons and
tests. This algorithm is not specified in [1], we have created it by strongly simplifying the first
algorithm presented there. It implements only the optimisation using the FFT (see section 2.3.3)
and does not reduce the data size. To be able to compare the run time of the different algorithms,
calls to the Matlab functions tic and toc are added at the beginning respective the end of the
function. tic starts the timer, and when calling toc the elapsed time is displayed.
The Matlab implementation of this algorithm uses a few functions which have not yet been
introduced:
30
CHAPTER 2. ANALYSIS 2.4. THE ALGORITHM
check_model_predicates This function checks input predicates according to our formal

model (see section 2.3.1), and aborts if they are not met. It is
shown in figure 27.
match_cyclic_correl_fft This function computes the cyclic correlation C as in figure 8

by using the FFT as in figure 9 (see page 16). For the sake of
clarity, this function is provided in the appendix.
pos_of_largest_val Given two parameters C and t, this function retrieves the

positions of the t largest values in C.29 This is beyond the
scope of this thesis and is therefore only presented in the ap-
pendix.30
function [Xest] = algorithm_simple(S, T, Sigma, t)

tic
n = length(S);
m = length(T);
% Check input predicates. Use placeholder if variable not needed.
check_model_predicates(n, m, 0.9, Sigma, t, 0.1);
% Calculate the cyclic correlation using the FFT.
C = match_cyclic_correl_fft(S, T, Sigma);
Xest = pos_of_largest_val(C, t)−1;
toc
Figure 26: Simple Matlab function to perform “matching with mismatches”
function check_model_predicates(n, m, p, Sigma, t, epsilon)

if (not (length(Sigma) == length(unique(Sigma)) && issorted(Sigma)))
end
if (not (m < n && 0 < epsilon && 0 < p && mod(length(Sigma), 2) == 0 ...
&& Sigma(1) >= 0))
end
if (not (t > 0))
error('Nothing to do.');
end
Figure 27: Matlab function to check input values according to the model
29 The returned positions are Matlab indices, which is why an additional decrement is performed to get zero-
based positions.
30 Actually, our implementation of the function has a run time of O(tL), assuming that the length of C is L.
Using an algorithm based on a priority queue (see e.g. [6] on page 194), a run time of O(L) can be achieved, but
this would add a lot of complexity.
31
Description
After checking its input values according to the model restrictions, this algorithm basically
performs all the steps described in section 2.3.3 on page 13. It converts the strings S and T to
discrete signals A and B through use of the φ function and computes the cyclic correlation C of
these signals. The set X is then estimated by extracting the positions of the t largest values of
the cyclic correlation.
Example
Figure 28 shows an example application of the algorithm. Note that the resulting elements in
Xest might be in a different order than in the supplied X. This, however, is assumed not to be
a problem.31
n = 102400;
m = 512;
p = 0.9;
Sigma = uint8([0:255]);
X = [34, 39411, 101410]
Xest = algorithm_simple(S, T, Sigma, length(X))
Figure 28: Matlab code which demonstrates the usage of our reference algorithm
Review
This algorithm has one great benefit: Its run time relies mostly on the FFT; other than that only
simple processing is done. Matlab uses the FFTW library [22] for fft/ifft-calls, which has
been heavily optimised and is very fast even for large input sizes. There are several drawbacks,
however: Even FFTW does not calculate the FFT faster than O(n log2 (n)), and memory usage
is very high. Also, false matches according to section 2.3.5 on page 28 cannot be filtered.
2.4.2 The First Variant

Based on our previous analysis, we are about to present the first version of the algorithm for
matching with mismatches according to [1]. Only two minor issues remain to be solved:
1. We need to deduce restrictions of input values from the probability bound in equation
(2.20). These restrictions should be based on ε, where 1 − ε is the probability of correctly
reconstructing X (according to our model). If ε is near 0, a high chance to succeed will
be guaranteed, and thus the limits on input values are more strict. If ε is near 1, the
restrictions on input values will be more graceful.
31 If it is a problem, the resulting set can be sorted.
32
2. In section 2.3.3 we have emphasized the fact that vector C does not generally contain
match counts, and therefore we cannot easily process C to extract all positions with at
least 50% matches. This is even more true of the vectors C(i) , because some values
have been added up. Extracting only the t largest values, as we proposed for C, has the
drawback that one of these largest values might be a “false match” according to section
2.3.5, with the effect that we possibly miss one of the t results. Consequently, we still
need to specify how to extract spikes from vectors C(i) .
For both issues we accept the solutions given in [1].
1. The input value constraints are specified as follows (see [1] on page 11):
√ √ !
16 log(4n/ε) 32nε 8( n + 1) log(4n/ε)
< m < min , (2.24)
p2 t log n p2
The number of primes k and the minimum size of the primes L are calculated from input
values and therefore indirectly restricted (see [1] on page 12):

log(2n/ε)
k= (2.25)
log(8n) − log(mt log(n))
8n log(2kn/ε)
L= (2.26)
mp2 − 8 log(2kn/ε)
According to [1] on page 28 the “restrictions placed upon the input parameters, and the
values assigned to L, have naturally erred on the side of caution”. This means that they
have been chosen such that the proofs concerning the algorithm are “successful”, but
apparently some of them are also the results of trial and error to prevent certain border
cases. For now we accept these constraints, but later we will analyse how restrictive they
are in applications.
2. For each i ∈ {1, . . . , k}, we will process the vector C(i) by finding all positions j with
(i)
C j > mp2 (according to [1] on page 12). Please note, however, this does not clearly
define the number of character matches we require, it is simply a bound that tends to
work reasonably well with our model.
With these supplements, we finally present a Matlab function implementing “Algorithm 1.1” of
[1] (on pages 11-12). It is shown in figure 29. All previous results are used in this implementa-
tion, and additionally one new helper function is introduced:
cartesian_prod Given a cell array32 and the number of dimensions, this function calculates
the Cartesian product and returns it as cell array of vectors. The implemen-
tation is beyond the scope of this thesis and is provided in the appendix. On
a side note, the function has been optimised for two dimensions because
this is the most frequent case.
32 A cell array is a Matlab array with dynamic size and content.
33
function [Xest] = algorithm_11(S, T, p, Sigma, t, epsilon, pedantic)

tic
n = length(S);
m = length(T);
check_model_predicates(n, m, p, Sigma, t, epsilon);
% Check additional predicates for the algorithm (if pedantic is nonzero).
if (pedantic) && (not ((16*log(4*n/epsilon))/p^2 < m && ...
min(sqrt(32*n*epsilon)/(t*log(n)), ...
(8*(sqrt(n)+1)*log(4*n/epsilon))/p^2) > m))
error('Invalid m for this algorithm.');
end
% Initialization.
k = ceil(log(2*n/epsilon)/(log(8*n)−log(m*t*log(n))))
L = (8*n*log(2*k*n/epsilon))/(m*p^2−8*log(2*k*n/epsilon))
khat = ceil(log(n)/log(L))
Xest = [];
% Randomly select primes.
Primes = select_primes(L, k)
% Reduce data size by projecting onto subspaces.
[A, B] = project_onto_subspaces(S, T, Primes, Sigma);
for i = 1 : k
C{i} = ifft(fft(A{i}) .* conj(fft(B{i})));
end
% Extract positions of spikes.
for i = 1 : k
X_residue{i} = find(C{i} > (m*p)/2) − 1;
end
% Calculate all khat tuples using a helper function.
x_tupel = cartesian_prod(X_residue, khat);
% Estimate X by applying the Chinese Remainder Theorem to all tuples.
for t = 1 : size(x_tupel, 2)
x = solve_crt(Primes(1:khat), x_tupel{t});
% Additional filtering of invalid values
xvalid = logical(1);
for i = khat+1 : k
if (not (ismember(mod(x, Primes(i)), X_residue{i})))
break;
end
end
if (xvalid && x <= n − m)
Xest = union(Xest, x);
end
end
toc
Figure 29: Matlab function implementing “Algorithm 1.1” of [1]
Description
The algorithm starts by checking its input values, first against the limitations of the model and
second against the specific constraints in equation (2.24). It then initialises k and L according
to equations (2.25) and (2.26), and also sets a helper variable k̂ (khat) whose sole use is speed
optimisation. It selects k primes as shown in figure 21 on page 27 and then projects the input
34
strings onto subspaces according to section 2.3.5, figure 25 on page 30. Afterwards, the cyclic
correlations C(i) of vectors A(i) and B(i) are calculated using the FFT as in section 2.3.3, figure 9
on page 16. Positions in C(i) which are larger than mp 2 are considered good matches and thus
extracted, and the Chinese Remainder Theorem is applied combining the extracted positions
modulo one prime with the positions modulo each other prime (similar to table 2 on page 26).33
Results which are within the valid range are accepted, building the estimated set X.34
Example
Figure 30 shows an example application of this algorithm similar to the “reference example” in
figure 28. It turns out that the input restrictions in equation (2.24) are not met by these input
values, although algorithm 1.1 does produce the correct solution with high probability (as can
be verified numerically). Thus the last parameter “pedantic” is set to 0 for this example in
order to disable the checking of the input value constraints on m.35 Figure 31 shows a different
example where the conditions on m are met, and pedantic is set to 1.
n = 102400;
m = 512;
p = 0.9;
epsilon = 0.1;
Sigma = uint8([0:255]);
X = [34, 39411, 101410]
Xest = algorithm_11(S, T, p, Sigma, length(X), epsilon, 0)
Figure 30: Matlab code which demonstrates the usage of algorithm 1.1 (pedantic=0)
n = 1000000;
m = 325;
p = 0.9;
epsilon = 0.7;
Sigma = uint8([0:255]);
X = [999601]
Review
The first thing we note is that ε is just a probability bound, and one can achieve correct results
with high numerical probability even if ε is near one. Similarly, the input constraints (equation
33 The exact implementation is slightly different, because only k̂ primes are used for the reconstruction, and the
result is checked against the remaining primes. However, this is only an optimisation; the effect is the same.
34 Following the model, we only accept values x ≤ n − m (instead of x < n as in [1]).
35 This should be done with caution because L might turn out to be negative, especially for small m.
35
(2.24)) are not irrevocable — we have to comply with them if we wish to use ε as proven
probability bound,36 but often we also achieve good results when ignoring them.
If we provide input values in the context of executable file comparisons, it will be quite hard
to stick to the input constraints. To achieve a good encoding of the file differences, we need
the correct result of the matching with high probability, so we choose ε = 0.1. We are mainly
interested in matches with only a few mismatches, and for that reason set p = 0.9. Further, we
assume that we wish to find the two best matches (t = 2), to allow for some flexibility. Given
these basic input parameters, n needs to be very large for us to be possible to meet the input
conditions. Figure 32 shows the upper and lower limits of m with these input values for varying
n. We observe that n needs to be roughly 8 · 107 simply to be able to select a valid m, and even
then we are restricted to m ≈ 430. Using a block size m of several kilobytes is only possible for
huge values of n, way beyond the usual size of executable files. This also means that the run
time of the algorithm given in [1] (on page 13) is not valid for file comparisons,37 because it
relies (by definition) on
16 log(4n/ε)
m (2.27)
p2
i.e. m is required to be a lot larger than its lower limit.
Figure 32: Limits of m in algorithm 1.1 (ε = 0.1, p = 0.9, t = 2)
There is even more to say about the input constraints: For our second Matlab example
(see figure 31) we selected the relatively small (given the restrictions) n = 106 and chose m =
325 to be near its lower limit. In this example the input leads to L ≈ 8.9686 · 105 , and while
the condition L < n is true as observed in [1] (page 13), the interval [L, L(1 + 2/ log(L))) ≈
36 Note that ε only guarantees a certain probability of success given random input according to our model.
37 at least not for today’s usual file sizes
36
[8.9686 · 105 , 1.0277 · 106 ) from which the primes are randomly chosen actually exceeds n.
Therefore, it can happen that primes larger than n are selected, which is inefficient in terms of
time and memory, and can even lead to incorrect results, because A(i) is zero-padded for the
corresponding i. Even if false results are eventually caught, this is still a situation which is
undesirable. Given the calculation of L as it is, the limits on m are probably not strict enough.
Thinking in terms of L instead of considering the interval [L, L(1 + 2/ log(L))) also seems
to be a problem in the proofs of [1]. It specifically makes the proof of the algorithm’s time
bound appear questionable: The size of the cyclic correlation is assumed to be L (see [1] on
page 13), but in fact it can be larger, and therefore the size is not L but worst case L(1 +
2/ log(L)). Possibly this difference can be ignored for “huge” values of n, but it is nevertheless
an assumption which should at least have been justified if used in a proof. Actually, the run
time equation has the precondition n 1 (see [1] on page 13), implying asymptotic behaviour,
but this underlines the fact that the time bound cannot be applied for our application. We have
to expect slower run time, so further improvement of the algorithm is necessary.
Last but not least, given the method how the set X is estimated in this algorithm (by extract-
ing many results and filtering those out of range), we have no guarantee to retrieve t results. We
can get any number of results. For some applications this might be a desired effect, but when
matching files we usually wish to find the t best matches.
2.4.3 The Second Variant

In the first version of the algorithm, we chose primes for the projection according to theorem 1.1
of [1]. If we use smaller primes and still achieve the desired result, we will reduce the size of the
corresponding vectors as well as the processing time. However, using smaller primes will result
in a higher level of background noise, and random spikes can occur more frequently. They can
even raise as high as the matches we wish to find.
In section 2.3.4 we used primes around n2 . For this version, we construct model data using
n
primes around 10 in order to further reduce the size of the data. The corresponding Matlab calls
are shown in figure 33. This example is similar to figure 17 on page 24, it simply specifies a
larger n.
n = 50000;
m = 256;
p = 0.95;
X = [7700, 8050, 9000]
Primes = [5003, 6007];
C = match_cyclic_correl_project(S, T, Sigma, Primes);
n
Figure 33: Matlab calls to construct data using primes around 10
Figures 34 and 35 show the resulting cyclic correlations. As expected, there is a consid-
erably higher level of random noise compared to the previous example (figures 19 and 20).
37
For C(1) , we expect spikes at positions 2697, 3047, 3997. Those actually are present, but we
also observe random spikes, e.g. roughly at positions 1500, 3100. The expected spikes of C(2)
at positions 1693, 2043, 2993 are also present, but there are additional spikes, e.g. roughly at
positions 100, 2800. However, it is very unlikely that these spikes occur in all projections at
positions which reconstruct a valid result. Therefore, these spikes are “filtered” during recon-
struction, because they lead to results > n − m (with high probability). This is basically the
same as in section 2.3.5, where “false matches” are removed.
n
Figure 34: Plot of the cyclic correlation C(1) of data using primes around 10
Theorem 1.3 in [1] on page 15 addresses this issue from a mathematical point of view. It
extends theorem 1.1 of [1] by considering additional random elements for each of the intersec-
tions. Assuming that these elements are selected randomly in sets Y (i) ⊂ {0, . . . , n − 1} − X for
all i ∈ {1, . . . , k},38 this looks as follows:39

(1) (k)
X̄ = {0, . . . , n − 1} ∩ (X ∪Y ) + p1 Z ∩ · · · ∩ (X ∪Y ) + pk Z (2.28)
38 This definition has been simplified, for more information see [1].
39 This has been slightly corrected from [1].
38
n
Figure 35: Plot of the cyclic correlation C(2) of data using primes around 10
The probability bound in equation (2.20) is correspondingly extended by the probability

β ∈ [0, 1) of one random element “falling into each of the k sets”40 . The new bound is specified
as follows (see [1] on page 15):
t log(n) log(L)) k

1−n β + (2.29)
L
In principle, we can use the same procedure as in the first variant of the algorithm but with
shorter primes, and have a slightly smaller probability of succeeding. To stay within proven
bounds, however, we revisit the input value restrictions and the processing of C(i) from the
previous section:
1. The input value constraints should now be deduced from the new probability bound in
equation (2.29). Unfortunately, this is not trivial, because it involves establishing propo-
sitions about β . Theorem 1.4 in [1] on page 17 performs this task, basically giving guid-
ance on how “high” the spikes of the valid results still need to raise above the noise. The
resulting restriction according to [1] on page 19 is (again chosen such that the proof is
successful): p p !
3
n2 ε/2 nε/2
m < min , (2.30)
t(log(n))2 8p2
40 Cited from [1] on page 15. The definition of β given there as “the probability [...] of y falling into each of the
k sets” is very fuzzy, since y is undefined.
39
The number of primes k, the probability β and the minimum size of the primes L are
derived values and thus indirectly restricted (see [1] on page 20):41

log(2n/ε)
k= (2.31)
log(n) − log(mt log(n)2 )
1p k
β= ε/(2n) (2.32)
2
p p 2
y= − log(β ) + log(4kt/ε) (2.33)
2ny
L= (2.34)
mp2 − 2y
We will analyse later how restrictive the limit on m actually is for applications.
2. To solve the problem of algorithm 1.1 that we are not guaranteed to get t results, we
can start by extracting the t largest values in vectors C(i) for each i ∈ {1, . . . , k} and
perform the reconstruction. However, as mentioned before, if for any reason we have a
random spike raising above one of the results, we will miss some results and might end
up with less than t values (because invalid results are removed). Given that the number
of “additional spikes” in C(i) according to our new approach is expected to be β pi for
each i ∈ {1, . . . , k} (see [1] on page 15), we can assume that, in the worst case, all of the
additional spikes raise above our actual results. Therefore, we are on the safe side if we
extract the β pi + t largest values.
Now that these issues are solved, we present a Matlab function implementing “Algorithm 1.2”
of [1] (on pages 19-20). It is shown in figure 36.
Description
At first, this algorithm checks its input values against the limitations of the model as well as
against the constraints given in equation (2.30). Next, it initialises k, β and L according to
equations (2.31), (2.32) and (2.34). It selects k primes and then projects the input strings onto
subspaces. In the next step, the cyclic correlations C(i) of vectors A(i) and B(i) are calculated
using the FFT. The positions of the β pi +t largest values in C(i) are considered as candidates for
good matches and thus extracted, and the Chinese Remainder Theorem is applied combining
the extracted positions modulo one prime with the positions modulo each other prime (similar
to table 2 on page 26). Those results that are within the valid range are accepted, and form the
estimated set X.
Example
Unfortunately, the Matlab application of Algorithm 1.2 according to the reference example does
not run. Figure 37 shows the corresponding calls, ignoring the limit on m in equation (2.30),
41 In equation (2.33) we choose the name y instead of x to avoid a naming conflict.
40

tic
n = length(S);
m = length(T);
if (pedantic) && (not (m < min((((n^2*epsilon)/2)^1/3)/(t*(log(n)^2)), ...
sqrt((n*epsilon)/2)/(8*p^2))))
end
% Initialization.
k = ceil(log(2*n/epsilon)/(log(n)−log(m*t*log(n)^2)))
beta = (1/2)*((epsilon/(2*n))^(1/k))
y = (sqrt(−log(beta))+sqrt(log((4*k*t)/epsilon)))^2
L = (2*n*y)/(m*p^2−2*y)
khat = ceil(log(n)/log(L))
Xest = [];
[A, B] = project_onto_subspaces(S, T, Primes, Sigma); pack;
for i = 1 : k
C{i} = ifft(fft(A{i}) .* conj(fft(B{i}))); pack;
end
% Extract positions of spikes.
for i = 1 : k
X_residue{i} = pos_of_largest_val(C{i}, ceil(beta * Primes(i)) + t)−1;
%X_residue{i} = find(C{i} > (m*p)/2) − 1;
end
% Calculate all khat tuples using a helper function.
x_tupel = cartesian_prod(X_residue, khat);
% Estimate X by applying the Chinese Remainder Theorem to all tuples.
for t = 1 : size(x_tupel, 2)
x = solve_crt(Primes(1:khat), x_tupel{t});
% Additional filtering of invalid values
for i = khat+1 : k
if (not (ismember(mod(x, Primes(i)), X_residue{i})))
break;
end
end
if (xvalid && x <= n − m)
Xest = union(Xest, x);
end
end
toc
i.e. using 0 for the parameter “pedantic”. However, the algorithm fails, because k turns out
to be negative. This points to one major weakness of the algorithm: For our use either n is
required to be huge, or m needs to be very, very small. In contrast to Algorithm 1.1, which
tends to work well outside the specified limits of m, Algorithm 1.2 seems to strongly depend on
41
the limitation. Even if we choose m such that k is barely not negative (still ignoring the limit),
k is unnecessarily large (e.g. around 20), which in turn has a very negative impact on the run
time of the algorithm.
If the limit is regarded, however, the algorithm can be run.42 Figure 38 shows an example
with pedantic set to 1.
n = 102400;
m = 512;
p = 0.9;
epsilon = 0.1;
Sigma = uint8([0:255]);
X = [34, 39411, 101410]
Figure 37: Matlab code which tries to use algorithm 1.2 but fails (pedantic=0)
n = 1000000;
m = 90;
p = 0.9;
epsilon = 0.7;
Sigma = uint8([0:255]);
X = [999601]
Review
As mentioned above, the upper limit on m is very strict and cannot be set aside. When analysing
it for varying n in the same context as before (i.e. with ε = 0.1, p = 0.9, t = 2), we observe that
only small block sizes are allowed with this algorithm. Figure 39 shows the corresponding plot.
However, even if the limit on m is met, the run time and memory usage of the Matlab
implementation of algorithm 1.2 is unfortunately rather inappropriate.43 The main reason for
this is that β pi + t easily turns out to be several thousand. Thus, thousands of largest values
are extracted from vectors C(i) . Since solving according to the Chinese Remainder Theorem is
not optimised in Matlab, calculating the solutions for all combinations is a very slow process
and clearly the bottleneck of our implementation. Colin Percival shows in [1] on page 22 that
the seemingly quadratic run time O((β L + t)2 ) of this part of the algorithm is actually not
quadratic, by definition of β and L. However, even in theory this is questionable: It again relies
42 Please note that this example might take several hours to run, and Matlab might run out of memory due to the
large Cartesian product.
43 Specifically, either Matlab runs out of memory or we have given up waiting for results after 8 hours.
42
Figure 39: Limit of m in algorithm 1.2 (ε = 0.1, p = 0.9, t = 2)
on the assumption that primes of size L are used, while their worst case size is roughly L(1 +
2/ log(L)). The corresponding difference should not be set aside without further comment,
especially because it is in context of a square. If we use a completely different way to reconstruct
X, which does not tend to have a quadratic run time, this problem will be solved.
2.4.4 The Third Variant

In the third version of the algorithm, the strict limits on m are relaxed, and the size of the
primes is reduced even further. This is done by applying a different theory, namely that of
Bayesian analysis. As a result, the method to reconstruct X is also changed, with the following
background: Instead of calculating all possible solutions according to the Chinese Remainder
Theorem, the elements of the vectors C(i) can be added up for all positions up to n modulo the
corresponding prime. The result is a single vector F with a length of n:
∀ j ∈ {0, . . . , n − 1}:
k
(i)
Fj = ∑ C j mod pi (2.35)
i=1
Spikes which existed e.g. in C(1) at positions modulo p1 and in C(2) at corresponding posi-
tions modulo p2 are added up and will lead to spikes in F. Thus, F is actually an approximation
of C, and equation (2.35) can be seen as “inverse projection”, because it restores the spikes at
positions where they would have been without the projection. Further processing of F can then
be done like the processing of C in our reference algorithm.
While this is intuitively a reasonable result, the approximation according to the Bayesian
analysis is a bit different (see [1] on page 24):
43
∀ j ∈ {0, . . . , n − 1}:
(i)
k C j mod pi − mp
2
Fj0 =∑ (2.36)
i=1 σ pi (n, m, j)
with σ being the standard deviation defined in [1] on page 13 as
σ pi (n, m, j) = |{(x, y) ∈ Z × Z : 0 ≤ x < n, 0 ≤ y < m, x ≡ y + j (mod pi )}| (2.37)
Using vector F 0 instead of F for further processing leads to problems, however:
1. F 0 depends on p, and if p cannot be predicted or at least estimated, the results will be

falsified (see also [1] on page 26). This is a serious problem, because in an application
which is comparing two files, one cannot tell in the first place “how well” these files will
match.
2. Using equation (2.36) to calculate F 0 will not lead to correct results for maliciously
formed X (see [1] on page 24). This is more a theoretical problem, because these X
are very unlikely to occur in real applications, but it is still a drawback.
While the best way to solve the first problem is to use F instead of F 0 , the second problem can
be dealt with by performing further processing of C(i) ”for some appropriate δ ”44 (see [1] on
pages 25):
∀i ∈ {1, . . . , k}; ∀ j ∈ {0, . . . , pi − 1}:

(i)

mp
√ −mp C j − 2
(i)
exp −D j = δ + exp  
2σ pi (n, m, j)

(i)
 
mp
√ mp C j − 2
(i)
⇔ Dj = − log  δ + exp −  (2.38)
2σ pi (n, m, j)
These vectors D(i) are then added up:

∀ j ∈ {0, . . . , n − 1}:
k
(i)
Fj00 = ∑ Dj (2.39)
i=1
In order to numerically show “what happens” during the calculation of vectors D(i) , we define

(i)
mp C j − mp
2
x=
2σ pi (n, m, j)
√
to be used as variable, set δ = 2 and plot y = − log( δ + e−x ). The result is shown in figure
40.
(i)
Interpreting this plot, we observe that x > 0 if and only if C j > mp
2 , due to the restrictions
of the model. This means that, according to the plot, spikes in vectors C(i) larger than mp 2 are
44 This is cited from [1] on page 24 to make it clear that δ is a very abstract value.
44
√
Figure 40: Plot of y = − log( δ + e−x ) with δ = 2
function [D] = filter_C(n, m, p, t, C)

∆ = (t*m*p^2*log(n))/n
size = length(C);
D = zeros(1, size, 'single');
for j = 1 : size
D(j) = −log(sqrt( ∆ )+exp(−(m*p*(C(j)−m*p/2))/(2*2*n*m/size)));
end
Figure 41: Matlab function to calculate vector D
truncated. Since y ≈ x for all x ≤ 0, all other values are relatively maintained by this function.
Figure 41 shows a Matlab function which calculates a vector D from C, using the definition of
δ as specified later in equation (2.41) and an approximation from [1] on page 28:
nm
σ pi (n, m, j) ≈ (2.40)
pi
Applying this function to a vector C calculated according to figure 17 on page 24, we clearly see
that all values are relatively maintained, but the spikes are truncated. Figures 42 and 43 show
vectors C and D, respectively.45
Based on these numerical observations, it is questionable why Colin Percival states in [1]
(i) (i)
on page 24, while deriving the algorithm, that “D j = max(C j , δ )”. This is in contrast to the
definition of D(i) in equation (2.38)46 , which, to the best of our knowledge, leads to a truncation
45 Note that the second figure has been scaled, but the ratio was maintained.
46 which was cited from [1]
45
Figure 42: Plot of the raw cyclic correlation C
Figure 43: Plot of the filtered cyclic correlation D
of spikes in C(i) , thereby showing the behaviour of a “min”-function.47 Due to this discrepancy,
47 One might argue that truncating large values respectively increasing small values is basically the same, but
this is questionable. Increasing all values to a certain minimum level would remove some of the random noise.
46
and to avoid too much dependency on p, we decide to stick to our initial approach and calculate
F in our implementation as specified in equation (2.35).
What remains to be discussed is, as in the previous variants of the algorithm, the restrictions
on input values and the further processing to estimate X:
1. No input restrictions are specified in [1] for this algorithm, so only the restrictions of
the model apply. The number of primes k and the minimum size of the primes L are
calculated as follows (see [1] on page 24):
tmp2 log(n)
δ= (2.41)
n

log(nt/ε)
k= (2.42)
log(1/(4δ ))
p √
−8n log 2k ε/nt − δ
L= p √ (2.43)
2
mp + 8 log 2k
ε/nt − δ
However, these definitions lead to implicit input restrictions, because negative values of
k and L are not applicable. For example, n = 10000, m = 300, t = 3, ε = 0.1 leads to
k = −12. We observe that this happens if 4δ > 1, since in that case log(1/(4δ )) < 0.
The case 4δ = 1 should also be prevented, as it leads to a division by zero. Therefore, we
require 4δ < 1 and derive an input restriction by ourselves:48
4δ <1
4tmp2 log(n)
⇔ n <1
⇔ 2
4tmp log(n) <n
n
⇔ m < (2.44)
4t p2 log(n)
This input restriction helps to avoid invalid input values, though we do not claim that it
covers all cases.
2. As already mentioned, the processing of F can be done as the processing of C in sec-

tion 2.4.1 on page 30: We simply find the positions of the t largest values of F, which
then form the estimated X.
We now present a Matlab function implementing “Algorithm 1.3” of [1] (on pages 24-25) with
the changes as described in this section. It is shown in figure 44 on the following page.
Description
The first step of this algorithm is, as in the other variants, to check its input values against
the limitations of the model and against the specific constraints we defined in equation (2.44).
48 In addition to the model definitions, we assume n > 1 and t > 0 to prevent division by zero.
47

tic
% Retrieve lengths.
n = length(S);
m = length(T);
if (pedantic) && (not (n > 1 && n/(4*t*p^2*log(n)) > m))
end
% Initialization.
∆ = (t*m*p^2*log(n))/n
k = ceil(log((n*t)/epsilon)/(log(1/(4* ∆ ))))
L = (−8*n*log((epsilon/(n*t))^(1/(2*k))−sqrt( ∆ ))) ...
/(m*p^2+8*log((epsilon/(n*t))^(1/(2*k))−sqrt( ∆ )))
Xest = [];
[A, B] = project_onto_subspaces(S, T, Primes, Sigma); pack;
for i = 1 : k
C{i} = ifft(fft(A{i}) .* conj(fft(B{i}))); pack;
end
% Estimate X.
F = zeros(1, n, 'single');
for j = 0 : n − 1
tmpSum = single(0);
for i = 1 : k
tmpSum = tmpSum + C{i}(mod(j, Primes(i)) + 1);
end
F(j + 1) = tmpSum;
end
Xest = pos_of_largest_val(F, t)−1;
toc
Afterwards, it initialises δ , k and L according to equations (2.41), (2.42) and (2.43). It selects
k primes and then projects the input strings onto subspaces. Then the cyclic correlations C(i) of
vectors A(i) and B(i) are calculated using the FFT. In the next step, vector F is calculated as in
equation (2.35) of this section. The set X is estimated by extracting the positions of the t largest
values of F.
Example
Using algorithm 1.3 is fairly straightforward, since the limit on m is not too restrictive. Figure
30 shows an example application of this algorithm similar to the “reference example” in figure
28, with pedantic set to 1.
48
n = 102400;
m = 512;
p = 0.9;
epsilon = 0.1;
Sigma = uint8([0:255]);
X = [34, 39411, 101410]
Review
The plot of the upper limit of m given ε = 0.1, p = 0.9, t = 2 is shown in figure 46. The limit
is very generous, and we can easily select small as well as large block sizes. Consequently, this
variant of the algorithm is the best candidate to be used for delta compression of executable
code.
Figure 46: Limit on m in algorithm 1.3 (ε = 0.1, p = 0.9, t = 2)
We also note that calculating F does not tend to have a quadratic run time, and is therefore
preferable compared to calculating all combinations of solutions according to the Chinese Re-
mainder Theorem. In theory, it is not even necessary to calculate the whole vector F, as only
the largest values of the sum need to be considered (see [1] on page 24).
49
2.4.5 Comparison
In order to compare the different variants of the algorithm with regard to speed and probability
of success, we run them several times with the same input data and check whether they are
successful, i.e. whether X is estimated correctly. All tests are performed on a Windows XP
system with Intel Core2 CPU 2.1 GHz and 1 GB RAM, using Matlab version 7.1.49
Table 3 shows the input values and the corresponding test results of 100 runs, given Σ =
{0, . . . , 255}. We skip algorithm 1.2 entirely, because it fails due to input restrictions.50
Table 3: Test results using n = 102400, m = 512, p = 0.9, X = {34, 39411, 101410}
Algorithm Average Run Time Probability of Success
Reference 0.03s 1.0
1.1 2.32s 1.0
1.2 (not applicable) (not applicable)
1.3 16.88s 1.0
These results are characteristic for large input values with the precondition n m and m
not being too small in the context of matching blocks of executable code. Given data according
to the model, the numeric probability to succeed is usually about 1. Small input values lead to
problems with regard to input restrictions.
Comparing the run time clearly shows that the calculation of the FFT is heavily optimised
in Matlab. The more complexity is added outside of the FFT, the slower is the run time. This
is especially bad for algorithm 1.3, since we chose it for use in delta compression of executable
code. Therefore, we intend to write a Matlab-independent implementation of this algorithm.
49 The tests run considerably slower if GNU Octave is used instead.

50 Including a test according to the restrictions of algorithm 1.2 does not help, either, because a single run time
is several hours then.
50
Chapter 3
Implementation
In this chapter, we present a C++ implementation of algorithm 1.3 as described in section 2.4.4
on page 43. The implementation aims at overcoming the drawbacks of the Matlab functions
concerning memory usage, speed and reusability to allow further use of this algorithm, for
example in a patch tool.
3.1 Considerations
3.1.1 Portability
When planning to implement the algorithm, we need to consider portability from the start, as it
severely affects the choice of libraries and the basic technical design of the implementation. In
order to achieve portability across a wide range of operating systems, we decide to use CMake
[24] as “build environment”. Thus, we are able to support compiling the program on Windows
and Linux systems as well as BSD-based systems like Mac OS X. This also enables the use of
different compilers on the respective systems.
In order to accomplish portability of the implementation itself, we avoid using system spe-
cific calls as much as possible and instead use standard libraries provided by the compiler, or
portable third party libraries.
3.1.2 Environment
Although we wish to optimise for speed, we do not have the claim to write optimal code for
all subproblems. We do not wish to reinvent the wheel, instead we intend to reuse widespread
and optimised algorithms. While interpreted environments like Microsoft .NET offer various
algorithms, the programs running in this context usually are considerably slower than corre-
sponding native applications. This is why we choose C++ as programming language. C++ is
based on C and offers low level functions, but additionally it provides the “Standard Template
Library” (STL)51 . Its “<algorithm>” header supplies various stable and optimised algorithms
51 The STL is specified in [17]; for documentation see e.g. [18].
51
3.1. CONSIDERATIONS CHAPTER 3. IMPLEMENTATION
like priority queues, which can easily be used. Table 4 shows a list of libraries which we utilise
to implement the algorithm for matching with mismatches.
Table 4: Base libraries for the C++ implementation

Library Functions/modules to use License
FFTW [22] Fast Fourier Transform calculation GPL 2 or higher
STL [17] Vectors, iterators and algorithms (redistributable in binary form)
Standard C Lib [16] File operations, math functions (redistributable in binary form)
We have to be aware of the fact that this adds dependencies to our implementation of the
algorithm. Possible tools based on it will also have these dependencies. However, given the
complexity of the algorithm, we think that these dependencies are justified. They will only
be present on the side of the patch generation, a tool to apply the patch can perform this task
without requiring Fast Fourier Transforms and priority queues.52
Finally, we need to make sure that the third-party libraries we are using integrate well
enough with our implementation, both in terms of license and technology.
3.1.3 Library Integration

In order to efficiently use the FFTW library, the vectors supplied for the calculation of the Fast
Fourier Transform need to be aligned in memory on a 16 byte boundary (see [23] on page 15).
With this being the case, FFTW makes use of the SIMD53 instructions, if they are available on
the target system. Using these instructions tremendously improves the actual run time of the
FFT, and therefore we should regard this boundary when allocating memory for our vectors.
FFTW provides the function fftw_malloc to allocate memory on a 16 byte boundary.
However, this is in conflict to the valarray template class54 of the C++ STL, which is the usual
vector type for efficient mathematical calculations. This type automatically allocates memory
and does not guarantee that its data is aligned on a 16 byte boundary. Unfortunately, valarray
does not support redefining the memory allocation. This leaves us two options: Either we use
the vector template class of the STL and define a so-called “custom allocator”, or we have to
implement our own vector type.
Since it is hard to control when exactly memory reallocations and temporary copies occur
with the vector type, we decide to create our own vector template for use with this algorithm.
Apart from that, the libraries integrate smoothly, for example we are able to use the C++
type complex with FFTW without problems.
52 The tool applying the patch might still depend on libraries to uncompress the data, but this is beyond the
scope of this thesis.
53 Single Instruction Multiple Data
54 A C++ template class is a class which can be used for different base types, e.g. a generic vector for real and
complex value types.
52
CHAPTER 3. IMPLEMENTATION 3.2. IMPLEMENTATION
3.2 Implementation
3.2.1 Structure
We start the implementation with the Matlab code of algorithm 1.3 in mind (see figure 44 on
page 48), but also with regard to delta compression of executable code. Some functions used in
Matlab are not available in C/C++, so we have to implement them on our own. Table 5 shows
the core modules55 and the corresponding source files of our implementation.
Table 5: Core modules of the C++ implementation of the algorithm

Module Functions Source File
Vector type Math functions, resizing, copying rntypes.h
Selection of primes Select and return a list of primes prime_numbers.cpp
Core algorithm Calculate index, perform matching matching_with_mismatches.cpp
File input Read input data from files rndiff.cpp
Post processing Merge adjacent match results rndiff.cpp
In order to implement a patch tool for delta compression of executable files based on these
core modules, further functions are required. These are listed in table 6. Furthermore, a file
format needs to be defined. The implementation of these additional modules is beyond the
scope of this thesis.
Table 6: Additional modules needed for delta compression

Module Functions
Boundary processing Check and adjust boundaries of the results (see [1] on page 35)
Encoding of differences Encode the actual differences (see [1] on pages 38-41)
3.2.2 Vector Type

The main reason for implementing a custom vector type is that we need to use the memory
allocation and deallocation functions of FFTW. In addition to that, the vector type offers func-
tions for copying, resizing, applying mathematical functions (e.g. finding the maximum value)
and allows direct access to its elements. The vector is implemented as template class, thereby
allowing use with integers as well as real and complex types.
3.2.3 Selection of Primes

We need to select primes as in the Matlab function in figure 21 on page 27. Generally, this
can be done by implementing a function to calculate the primes “on the fly” using the Sieve of
Eratosthenes (see e.g. [7] on pages 29-30). However, that would unnecessarily cost execution
time. Instead, we set a limit on the size of input files, i.e. we disallow files larger than 2 GB,
55 A module is a pool of functions which build a logical unit.
53
3.2. IMPLEMENTATION CHAPTER 3. IMPLEMENTATION
and pre-calculate all primes which we possibly need for the valid input range. These primes
are stored in a static array directly in the source code. Thus, choosing primes is very efficient
regardless of the interval.
In order to select primes from a given interval, we first find the boundaries of that interval
within our array of pre-calculated primes using the STL algorithm lower_bound. Next we
create a random permutation of all primes within the boundaries by utilising the STL algorithm
random_shuffle, and then return the first k of these primes as vector of integers.
3.2.4 Core Algorithm

In the implementation of the core algorithm, all values which are independent of the specific
characters of T are pre-calculated. These values form the index of the algorithm, as described
in section 2.3.1 on page 10. The index can be used to efficiently match all blocks of one file
with another file, since the block T will change, while the file S remains constant.
The elements of the index are shown in table 7.
Table 7: Elements of the pre-calculated index of the implementation

Element Description
k The number of primes k = 2 is constant (see [1] on page p34).
L The minimum size of the primes is calculated as L = 4 n log(n)
(see [1] on page 34).
pi ∀i ∈ {1, . . . , k} The primes are selected using the module described in the
previous section.
φi ∀i ∈ {1, . . . , k} The functions φi are randomly created for the alphabet
Σ = {0, 1, . . . , 255} with the help of the STL function
random_shuffle.
A(i) ∀i ∈ {1, . . . , k} The vectors A(i) are calculated from S according to equation
(2.22) on page 29.
FFTW “plans” FFTW is initialised to perform measurements and set up the
most efficient way to calculate the FFT on the target system. The
results of this step are so called “plans” which tell FFTW how to
execute the FFT for vectors of specific lengths (see also [23] on
pages 22-23).
After computing the index, the actual matching of a specific block T with S can be per-
formed. It consists of the following steps:
1. The vectors B(i) are calculated from T according to equation (2.23) on page 29.
2. The cyclic correlations C(i) are computed by use of FFTW functions56 with the plans
provided from the index.
56 specifically, using the single precision real-to-complex fft
54
CHAPTER 3. IMPLEMENTATION 3.2. IMPLEMENTATION
3. Using the STL priority_queue, the t largest elements of the sum
k
(i)
∑ C j mod pi
i=1
for all j ∈ {0, . . . , n − 1} are calculated (see also equation (2.35) on page 43). In order to
save memory, we do not compute the vector F. Instead, we maintain a list of positions of
the t largest values while calculating the upper sum.
4. The resulting positions are returned as vector of integers.
These steps can be repeated for different blocks T without recreating the index. However, the
block size m needs to be kept constant.
3.2.5 File Input

The Matlab applications of the algorithm are all based on randomly generated input according
to the model. In contrast to that, the C++ implementation provides core modules specifically
for delta compression of executable code. Hence, we need to be able to read input data from
executable files. This module implements the corresponding functions:
1. First, the size of the two input files is determined.
2. Next, the original file, which is assumed to be present on the target system, is read as
string S, with n being the size of that file.
p
3. Given S, n and the block size m = n log(n) (according to [1] on page 34), the index is
created as described in the previous section.
4. The new file including e.g. a security fix is read in blocks of length m, and each block is
matched with the original file. For each block only the best match is considered (t = 1).
The result is a mapping of positions in the new file to corresponding positions of similar blocks
in the original file.
3.2.6 Post Processing

Based on the resulting mapping from the previous step, we need to perform processing such
that this mapping can be efficiently used by a tool for delta compression of executable code.
Since the two executable files which are compared are assumed to be similar, we are very likely
to encounter regions larger than the block size m which match (with mismatches). Therefore,
we merge adjacent blocks in the new file which are mapped to corresponding adjacent blocks in
the original file to larger blocks. In case the two files were equal, this would result in the whole
new file being mapped as one large block to the whole original file.57
57 given that our algorithm correctly identifies all matches
55
3.3. FURTHER STEPS CHAPTER 3. IMPLEMENTATION
3.3 Further Steps

We name the C++ implementation of the algorithm “rndiff”. It is available at
http://sourceforge.net/projects/rndiff
First tests of the implementation show that the run time is slower than that of the Matlab ref-
erence algorithm presented in section 2.4.1 on page 30 for similar input values. This is due to
the fact that neither the projection (specifically the calculation of vectors B(i) ) nor the compu-
tation of the sum to reconstruct the results are optimised in the C++ implementation. If the run
time turns out to be too slow for specific applications, further efforts will have to be put into
optimising these parts.
We also note that when applying “rndiff” on two files which are equal, the resulting mapping
usually does not consist of one block, as expected. This is due to the fact that the bytes of
executable files are in general not uniformly distributed, an issue which will be addressed in the
next chapter.
56
Chapter 4
Improvements
Small improvements and extensions concerning the algorithm were already integrated directly
in the corresponding sections, if appropriate. In this chapter additional improvements, which are
beyond the scope of other parts of this thesis, are presented. Theoretical changes are considered
as well as improvements concerning the practical usage of the algorithm, with special regard to
reproducibility of results.
4.1 Theorem 1.1

Although the third variant of the algorithm (see section 2.4.4 on page 43) has its general idea
based on theorem 1.1 of [1], it no longer uses the corresponding probability bound. This sug-
gests that the probability bound is not “well enough” to be used for a projection with small
primes. This sections deals with this problem.
4.1.1 Selecting Primes Randomly

The first thing we note in theorem 1.1 is that the primes are selected “uniformly at random”
from the available set of primes (see [1] on page 10). Thus, the same prime can be selected
twice or more. Since for each prime invalid results are “filtered”, this leads to “less filtering”.
In other words, once we have the knowledge that two primes are equal, this is actually the same
situation as if we had one less prime, and the probability bound decreases accordingly.
However, if we create a random permutation of the set of available primes and select the
first k primes of this permutation, we will prevent selecting the same prime more than once.
Especially for small values of L this is a large improvement.
4.1.2 Numerical Probability

Next, we take a closer look at the lower probability bound of theorem 1.1. We numerically show
that there are cases where the reconstruction works with high numerical probability, while the
corresponding probability bound is actually negative58 . In order to do this, we provide a Matlab
58 i.e. worse than the possible range of probabilities
57
4.1. THEOREM 1.1 CHAPTER 4. IMPROVEMENTS
function implementing the “filtering by intersecting” as in theorem 1.1. The function is shown
in figure 47.
function [Xest] = reconstruct_X(n, L, k, X)

if (not (sum(ismember(X, [0:n−1])) == length(X)))
error('Invalid estimation value.');
end
% We cannot use Z as whole set numerically.
% Instead, we use simple bounds.
Xest = [0:n−1];
Z = [−(n−1):n−1];
for i = 1 : k
pZ = Primes(i) * Z;
XpZ = [];
for x = X
XpZ = union(XpZ, x + pZ);
end
Xest = intersect(Xest, XpZ);
end
Figure 47: Matlab function to reconstruct X as in theorem 1.1 of [1]
We run several numeric tests, repeating each test 1000 times. Only exact reconstructions
count as successes. Table 8 shows the input values, the numerical probability of correct recon-
struction, and the corresponding lower bound according to theorem 1.1 (see equation (2.20) on
page 26).
Table 8: Numeric tests on probability of reconstructing X

Input Values Numerical p Lower Bound
n = 100000, L = 1000, k = 2, X = {99578, 99913, 99967} 0.57 −5691.3
n = 100000, L = 1000, k = 2, X = {10, 1000, 80000} 0.69 −5691.3
n = 10000, L = 1000, k = 2, X = {9578, 9913, 9967} 0.94 −363.31
n = 10000, L = 1000, k = 2, X = {10, 1000, 8000} 0.98 −363.31
n = 5000, L = 1000, k = 2, X = {4578, 4913, 4967} 0.97 −154.77
n = 5000, L = 1000, k = 2, X = {10, 1000, 4000} 1 −154.77
n = 5000, L = 1000, k = 6, X = {4578, 4913, 4967} 1 0.85
n = 5000, L = 1000, k = 6, X = {10, 1000, 4000} 1 0.85
58
CHAPTER 4. IMPROVEMENTS 4.1. THEOREM 1.1
The results suggest that the lower probability bound is very pessimistic. This has several
reasons:
• Colin Percival states in the proof of the theorem 1.1 (see [1] on pages 10-11) that the
product
t
∏ (y − x j ) (4.1)
j=1
with 0 ≤ y ≤ n − 1 and x j ∈ X ∀ j ∈ {1, . . . , t} is bounded from above by nt – this is true,

but it is an overly pessimistic bound. Actually it cannot be reached even for t = 1 in
combination with worst case values:
t
x1 = 0, y = n − 1 ⇒ ∏ (y − x j ) = n − 1 < n1
j=1
Thus, a tighter (but less trivial) upper bound of the product in equation (4.1) is:
t
(n − 1)!
∏ (n − j) = (n − t − 1)! (4.2)
j=1
By definition of X in the model (see 2.3.1 on page 10), “X = {x1 , . . . , xt } ⊂ {0, . . . , n−m}
with xi ≤ xi+1 − m for 1 ≤ i < t”, meaning that the differences between two elements of
X is at least m. This leads to an even tighter upper bound of the product:
t
∏ (n − jm) (4.3)
j=1
• The proof of the probability bound of theorem 1.1 is based on other worst case assump-
tions, and the probability that worst case values are chosen is very small. However, the
probability bound is meant to be true for all inputs, i.e. it has no further assumptions
about the context.
To conclude, we can achieve a direct improvement of theorem 1.1 if we use a random per-
mutation of primes instead of selecting them uniformly at random. This change is already
implemented in the Matlab function in figure 21 on page 27. There is further potential for
improvement by using a different approach in the proof of theorem 1.1.
These results show why we observed that algorithm 1.1 works well outside its proven limits.
59
4.2. DERANDOMISATION CHAPTER 4. IMPROVEMENTS
4.2 Derandomisation
One problem with the usage of the algorithm presented in chapter 2 is that its result can be
different each time we apply the algorithm. Especially when the algorithm is used in a tool to
create patches, a certain reproducibility is expected by the users so that they trust the program.
Generally, there are two possibilities to achieve this:
1. The result is “guaranteed” following the law of large numbers, regardless of the random
selection.
2. Instead of randomly choosing values, these values are calculated according to rules and/or
additional input values.
The first of these options is quite impossible to achieve, given the nature of the algorithm. We
would need to repeatedly run the algorithm, and even then we can only state that the same
result will be given most of the time. Unfortunately, if a different result is returned only once, a
user might lose trust in the product. Any application concerning security patches should output
reproducible results.
This leaves the second option as only alternative. We have to think of rules for choosing the
values as well as possible input parameters which can be used to “tweak” these rules.
4.2.1 Selecting Primes Non-Randomly

When selecting the primes from the interval [L, L(1 + 2/ log(L))), it cannot be decided in gen-
eral whether certain primes are good for use with the algorithm or not. Even for specific input,
we cannot easily estimate the quality of the primes. We could, for example, run the algorithm
several times with different primes and compare the results, but that would severely increase
the run time.
Given that executable files often contain local repetition of characters (e.g. due to zero
padding), one option is to select primes with maximum difference to avoid projections which
are partly similar. A different option is to purposely select large or small primes in order to
influence speed or quality of the algorithm. Small primes will basically lead to faster execution,
and large primes can lead to better results because less values are added up during the projection.
However, as far as the filtering according to equation (2.19) on page 26 is concerned, no basis
for a quick evaluation of the primes is available.
4.2.2 Creating φ Non-Randomly

Instead of creating the function φ from random sets as in equation (2.21) on page 29, we can
purposeful create φ such that it is convenient for specific input values which are present.
The random creation of φ is based on the assumption that the input characters are selected
independently and uniformly from the alphabet. This is usually not the case in executable
files: Certain commands occur more frequently, and due to alignments and initialisation data,
60
CHAPTER 4. IMPROVEMENTS 4.2. DERANDOMISATION
parts of the file are filled with zeros. Additionally, some executable files are statically linked to
resources, e.g. bitmaps. Given that this depends on the type of executable file and the underlying
operating system, we cannot assume a specific probability distribution of the alphabet for all
executable files. But we can calculate the numeric probability distribution of a specific file in
a single pass by counting the characters. The result can be used to calculate φ , such that φ
actually maps half of the input values to 1 and the other half to −1 for non-random input values,
as was before only the case when using uniformly distributed input.
Specifically, we can create the function φ as follows: Given the probability distribution, we
calculate the first bit of the Huffman code (see e.g. [13] on pages 99-113) for each character of
Σ. Afterwards, we map φ to −1 for all characters with 0 as first bit of the Huffman code, and
to 1 for all characters with 1 as first bit. Due to the basic properties of the Huffman code, this
usually59 leads to a fair distribution of the mapping between −1 and 1.
If we use the original file without the security fix for the calculation, we do not even need
to store the probability distribution in the patch file, because this file is present on the target
system.60
59 If,for example, a certain character has a probability of occurence larger than 0.5, a fair distribution cannot be
achieved by φ . In that case, further pre-processing of the input string could be done, but this is beyond the scope
of this thesis.
60 However, we should make sure that the correct file is used, for example by checking a cryptographic hash.
61
Chapter 5
Conclusion
In the present thesis, we have explained the background behind the algorithm for matching
with mismatches from [1], with regard to delta compression of executable code. Supported
by numerical examples, we have critically analysed the different variants of the algorithm and
implemented them in Matlab.
Based on the Matlab code, we have created a C++ implementation of the last variant of
the algorithm, specifically optimising for speed and memory usage. This implementation can
be used as base for a tool that creates compact security patches. Further improvements with
regard to reproducibility have been proposed, such that the algorithm is able to produce constant
output, given constant input.
First tests show that the run time of all variants of the algorithm in practice severely depends
on an efficient implementation of its inner loops and the sub-algorithms it uses. Since the FFTW
library is heavily optimised, we have yet to find a numerical example where one of the variants
of the algorithm is faster than our reference algorithm, which mostly relies on the FFT.
It seems hard to beat FFTW in terms of speed, even though the last variant of the algorithm
has a run time “sublinear in n” (see [1] on page 9). The O(n log2 (n)) run time of the FFT
is only significant for very large values of n, because FFTW keeps the constant parts of the
run time very low. In contrast to that, the non-FFT elements of the algorithm are not easy to
optimise. Therefore, the practical run time of these parts outweighs the theoretical advantage.
Additionally, the theoretical run time of the algorithm is only asymptotic for growing values of
n.
That being said, the main advantage of the algorithm, in particular of the C++ implementa-
tion, is its low memory usage. In order to achieve a speed near that of the reference algorithm
for usual file sizes of a few megabytes, more efforts need to be devoted to optimisation.
62
Appendix
Additional Matlab Functions
Functions used to create the plot in figure 7 on page 13
function [stat] = stat_good_match_not_in_X(n, max_m, p, Sigma, X, runs)

count = [0, 0];
stat = zeros(1, max_m);
% count matches for each m up to max_m
for m = 1 : max_m
count = sum_good_matches_not_in_X(n, m, p, Sigma, X, runs);
stat(m) = count(1) / count(2);
end
Figure 48: Matlab function to provide numeric probability on a good match not in X
function [counter] = sum_good_matches_not_in_X(n, m, p, Sigma, X, runs)

counter = [0, 0];
for i = 1 : runs
V = match(S, T);
counter = counter + count_good_matches_not_in_X(n, m, X, V);
end
Figure 49: Helper function to add up good matches not in X
63
Appendix
function [counter] = count_good_matches_not_in_X(n, m, X, V)

% Check V for "good" matches not in X.
notX = setdiff([0:n−m], X);
% Format of counter: [good_matches_not_in_X, total_count]
counter = [0, length(notX) * length(X)];
for i = notX
for j = X
if (not (V(i+1) < V(j+1)))
counter(1) = counter(1) + 1;
end
end
end
Figure 50: Helper function to count good matches not in X
Function to calculate the cyclic correlation C using the FFT
function [C] = match_cyclic_correl_fft(S, T, Sigma)

% Retrieve lengths.
n = length(S);
m = length(T);
% Sigma should be a set and sorted. We assume it to be continuous.
alphabetSize = length(Sigma);
if (not (alphabetSize == length(unique(Sigma)) && issorted(Sigma)))
end
if (not (m < n && mod(alphabetSize, 2) == 0))
end
Sigma_base = double(Sigma(1));
% Calculate phi.
tmp_phi = ones(1, alphabetSize, 'single');
for j = 1 : alphabetSize/2
end
phi = intrlv(tmp_phi, randperm(alphabetSize));
% Convert S and T to A and B.
A = zeros(1, n, 'single');
B = zeros(1, n, 'single');
for j = 0 : n − 1
A(j + 1) = phi(double(S(j + 1)) − Sigma_base + 1);
if (j < m)
B(j + 1) = phi(double(T(j + 1)) − Sigma_base + 1);
end
end
C = ifft(fft(A) .* conj(fft(B)));
Figure 51: Matlab function to compute C as in figure 8 by using the FFT as in figure 9
64
Appendix
Function to calculate the positions of the t largest values
function [P] = pos_of_largest_val(X, n)

% The convention is not call−by−reference, so we are not modifying X.
% This function might return the same position multiple
% times, given malicious input. But it is fine for our use.
min_value = min(X);
P = zeros(1, n);
for i = 1 : n
[a, b] = max(X);
P(i) = b;
X(b) = min_value;
end
Figure 52: Matlab function to simply calculate the positions of the t largest values
Functions to calculate the Cartesian product
function [out] = cartesian_prod(cellin, n)

if (not (n == 2)) % use recursive function if n != 2
out = cartesian_prod_recursive(cellin, n);
else % use optimized function if n == 2
len1 = length(cellin{1});
len2 = length(cellin{2});
out = cell(1, len1 * len2);
for i = 1 : len1
for j = 1 : len2
out{j + (i − 1) * len2} = [cellin{1}(i), cellin{2}(j)];
end
end
end
Figure 53: Matlab function to iteratively calculate the Cartesian product (2 dimensions)
function [out] = cartesian_prod_recursive(cellin, n)

if (n == 1)
for i = 1 : length(cellin{1})
out{i} = cellin{1}(i);
end
else
tmpOut = cartesian_prod_recursive(cellin, n−1);
k = 1;
for i = 1 : length(cellin{n})
for j = 1 : length(tmpOut)
out{k} = [tmpOut{j}, cellin{n}(i)];
k = k + 1;
end
end
end
Figure 54: Matlab function to recursively calculate the Cartesian product
65
66
Bibliography
[1] Colin Percival: Matching with Mismatches and Assorted Applications, D.Phil. thesis, Uni-
versity of Oxford 2006.
[2] Colin Percival: Naive differences of executable code, http://www.daemonology.net/

bsdiff/, 2003.
[3] Gonzalo Navarro: A Guided Tour to Approximate String Matching, ACM Computing
Surveys, 33 (1): 31-88, 2001.
[4] Giovanni Motta, James Gustafson, Samson Chen: Differential Compression of Executable
Code, In Proceedings of the 2007 Data Compression Conference, Pages 103-112, 2007.
[5] David Salomon: Data Compression, The Complete Reference, 4th Edition, Springer 2007.
[6] Thomas H. Cormen, Charles E. Leierson, Ronald L. Rivest, Clifford Stein: Introduction
To Algorithms, 2nd Edition, The MIT Press 2001
[7] Manfred R. Schroeder: Number Theory in Science and Communication, 4th Edition,
Springer 2006.
[8] Athanasios Papoulis: Probability and Statistics, Prentice-Hall 1990.
[9] John E. Hopcroft, Rajeev Motwani, Jeffrey D. Ullman: Automata Theory, Languages, and
Computation, 3rd Edition, Addison-Wesley 2007.
[10] Leo I. Bluestein: A Linear Filtering Approach to the Computation of Discrete Fourier
Transform, IEEE Transactions on Audio and Electroacoustics, AU-18 (4): 451-455, 1970.
[11] Hari Krishna: Digital Signal Processing Algorithms: Number Theory, Convolution, Fast
Fourier Transforms, and Applications, CRC Press 1998.
[12] William H. Press, Saul A. Teukolsky, William T. Vetterling: Numerical Recipes in C: The
Art of Scientific Computing, 2nd Edition, Cambridge University Press 1993.
[13] André Neubauer: Informationstheorie und Quellencodierung: Eine Einführung für Inge-
nieure, Informatiker und Naturwissenschaftler, J. Schlembach Fachverlag 2006.
[14] André Neubauer: Kanalcodierung: Eine Einführung für Ingenieure, Informatiker und
Naturwissenschaftler, J. Schlembach Fachverlag 2006.
67
[15] Microsoft Corporation: Using Binary Delta Compression (BDC)
Technology to Update Windows Operating Systems, BDC_v2.doc,
http://www.microsoft.com/downloads/details.aspx?FamilyID=
4789196c-d60a-497c-ae89-101a3754bad6&displaylang=en, 2005.
[16] ISO/IEC 9899:1990(E), Programming Languages – C (ISO C90 and ANSI C89 standard),
1990.
[17] ISO/IEC 14882:1998(E), Programming Languages – C++ (ISO and ANSI C++ standard),
1998.
[18] Hewlett-Packard Company: Standard Template Library Programmer’s Guide, http://

www.sgi.com/tech/stl/, 1994.
[19] The Mathworks, Inc.: MATLAB, http://www.mathworks.com/, Version 7.1.
[20] John W. Eaton and others: GNU Octave, http://www.gnu.org/software/octave/,

Version 3.
[21] Octave-Forge: Extra Packages for GNU Octave, http://octave.sourceforge.net/,

Version 20080216.
[22] Matteo Frigo, Steven G. Johnson: FFTW, http://www.fftw.org/, Version 3.1.2.
[23] Matteo Frigo, Steven G. Johnson: FFTW Documentation, http://www.fftw.org/

fftw3.pdf, for Version 3.1.2, 2006.
[24] Kitware, Inc.: CMake, http://www.cmake.org/, Version 2.6.1.
68

Delta Compression of Executable Code - Analysis, Implementation and Application-Specific Improvements

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Delta Compression of Executable Code - Analysis, Implementation and Application-Specific Improvements

Enviado por

Direitos autorais:

Formatos disponíveis

Analysis, Implementation and Application-Specific

19 November 2008 (rev2)

3.2.2 Vector Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

1 Definition of example strings S and T . . . . . . . . . . . . . . . . . . . . . . 7

1 Simple and slow substring matching . . . . . . . . . . . . . . . . . . . . . . . 7

1.1 Security Patches

1.2 Matching With Mismatches

5 see also [1] on page 32

An alphabet Σ is any finite non-empty set.

A string over an alphabet Σ is a finite sequence of elements from the alphabet.

Additionally, we provide some definitions of mathematical terms used in this thesis:

Definition (Addition of Sets)

Given A = {a1 , a2 , . . . , an } ⊆ Z, B = {b1 , b2 , . . . , bm } ⊆ Z, we define

• detailed “common sense” motivation and reasoning,

• various numerical examples, including numerical evidence and restrictions.

2.2.1 Intuitive Substring Matching

S = 'do not tarry water carry';

Figure 1: Definition of example strings S and T

Performing a substring match of T in S using the intuitive algorithm we described above

Table 1: Simple and slow substring matching

Figure 2: Plot of match count calculated from example strings S and T

The function δ : Σ × Σ→ R is in our case the Kronecker delta, i.e.:

function [V] = match(S, T)

Figure 3: Matlab function to calculate the match count vector V

function [J] = find_good_matches(V, m)

Figure 4: Matlab function to find positions with at least 50% matches

2.2.2 Run Time Considerations

2.3 Derivation of the Algorithm

2.3.1 Formal Model

“Problem space: A problem is determined by a tuple (n, m, t, p, ε, Σ, X) where

Construction: Let a string T of length m be constructed by selecting m characters

function [S, T] = construct(n, m, p, Sigma, X)

2.3.2 Restrictions of the Model

Figure 6: Construction example: At position 6 in S the string T randomly matches.

∀(i, j) ∈ X × X : Vi < V j (2.3)

q(m) = exp(−O(m)) (2.4)

Figure 7: Plot of the probability of a randomly good match not in X

2.3.3 Optimisation using the FFT

14 seee.g. [11] on page 72

function [C] = match_cyclic_correl(S, T, Sigma)

Figure 8: Matlab function to calculate matches using the cyclic correlation

18 because this equation is applied ∀ j ∈ {0, . . . , n − 1}

% Calculate the cyclic correlation using the fft.

19 The result may vary because the algorithm is randomized.

2.3.4 Projecting onto Subspaces

20 except for the substrings in S which match well with T

this section, but it leaves room for optimisation.

Figure 12: Plot of the cyclic correlation C of model data (|X| = 1)

function [C] = match_cyclic_correl_project(S, T, Sigma, Primes)

x ≡ ai (mod pi ) : x ≡ 3997 (mod 5003) (2.13)

Ni Mi ≡ 1 (mod pi ) : N1 · 6007 ≡ 1 (mod 5003) (2.16)

Finally, the underlying value x can be calculated:

function [x] = solve_crt(Primes, Residues)

Figure 18: Plot of the cyclic correlation C of model data (|X| = 3)

function [P] = select_primes(L, k)

2.3.5 Optimising Random Behaviour

⇒ A = (−1, 1, 1, 1, −1, − 1, 1, 1, 1, −1, − 1, 1, −1, −1, −1)

Figure 23: Construction example: At position 6 in S a non-existing match of T is found

⇒ B = (1, −1, −1, −1, 0, 0, . . . )

⇒ A = (−1, 1, −1, −1, −1, 1, −1, 1, −1, 1, 1, 1, −1, −1, 1)

Figure 24: Construction according to figure 23 with a different φ

φi (x) = (−1)|Σi ∩{x}| (2.21)

∀i ∈ {1, . . . , k}; ∀ j ∈ {0, . . . , pi − 1}: