Ivancik Thesis 2012 Online

i
iii
v
Czech Technical University in Prague
Faculty of Electrical Engineering
Department of Cybernetics
Masters Thesis
The Linear Direct Sparse Solver on GPU for
Bundle Adjustment Method
Bc. Ondrej Ivank
Supervisor: Ing. Ivan imeek, Ph.D.
Study Programme: Open Informatics
Field of Study: Computer Vision and Image Processing
May 11, 2012
v
vi
Aknowledgements
I would like to thank to my supervisor Ivan imeek who enabled me to
deal with a very interesting topic and to prof. Olaf Hellwich and Cornelius
Wefelscheid who allow me to work on my thesis within an individual project
at TU Berlin.
vii
viii
Declaration
I hereby declare that I have completed this thesis independently and that I
have listed all the literature and publications used.
I have no objection to usage of this work in compliance with the act 60
Zkon . 121/2000Sb. (copyright law), and with the rights connected with
the copyright act including the changes in the act.
Prague, May 11, 2012
ix
Abstract
The thesis deals with solving of sparse linear positive denite systems. It
implements Cholesky decomposition on CPU utilizing a CRS format for
sparse matrices, a fast AMD ordering, and a symbolic factorization.
Analysed are possibilities of a parallelization of Cholesky decomposition for
sparse diagonal-based linear systems and for Bundle Adjustment problem
where matrices of specic structure arise. Cholesky decomposition exploiting
a Schur complement is implemented on both CPU and GPU side.
Abstrakt
Prce se zabv eenm dkch linernch pozitivn denitnch soustav.
Implementuje Choleskho dekompozici na CPU s vyuitm CRS formtu
dkch matic, rychl AMD permutace a symbolick faktorizace.
Analyzuje monosti paralelizace Choleskho dekompozice pro dk linern
systmy diagonlnho tvaru a pro problm vyrovnn svazku, kde vznikaj
dk matice specick struktury. Navrhuje a implementuje vpoet Choles-
kho dekompozice na GPU a CPU pomoci Schrova komplementu.
x
xi
Contents
1 Introduction 2
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Solving Linear Systems 4
2.1 System of Linear Equations . . . . . . . . . . . . . . . . . . . 4
2.2 Direct Methods for Solving Linear Systems . . . . . . . . . . 5
2.2.1 Cramers Rule . . . . . . . . . . . . . . . . . . . . . . 5
2.2.2 Forward and Backward Substitution . . . . . . . . . . 5
2.2.3 Gaussian Elimination . . . . . . . . . . . . . . . . . . 6
2.2.4 Gauss-Jordan Elimination . . . . . . . . . . . . . . . . 7
2.2.5 LU Decomposition . . . . . . . . . . . . . . . . . . . . 7
2.2.6 Cholesky Decomposition . . . . . . . . . . . . . . . . . 7
2.3 Iterative Methods for Solving Linear Systems . . . . . . . . . 8
3 Sparse Matrices 10
3.1 Ordering Methods . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1.1 Arrowhead Matrix Example . . . . . . . . . . . . . . . 11
3.1.2 Graph Representation . . . . . . . . . . . . . . . . . . 11
3.1.3 Bottom-up Ordering Methods . . . . . . . . . . . . . . 12
3.1.4 Top-down Ordering Methods . . . . . . . . . . . . . . 12
3.2 Symbolical Factorization . . . . . . . . . . . . . . . . . . . . . 13
4 Bundle Adjustment 16
4.1 Unconstrained Optimization . . . . . . . . . . . . . . . . . . . 17
4.1.1 Search Methods . . . . . . . . . . . . . . . . . . . . . . 18
4.1.2 LevenbergMarquardt Algorithm . . . . . . . . . . . . 19
5 Overview of NVIDIA CUDA 22
5.1 The CUDA Execution Model . . . . . . . . . . . . . . . . . . 23
5.2 GPU Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6 Analysis of the Problem 28
6.1 Structure of Linear Systems in BA . . . . . . . . . . . . . . . 28
xii
xiii CONTENTS
6.2 Block Cholesky Decomposition for BA . . . . . . . . . . . . . 29
7 Implementation 34
7.1 Used Framework . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.2 Compressed Row Storage Format . . . . . . . . . . . . . . . . 34
7.3 Cholesky decomposition on GPU . . . . . . . . . . . . . . . . 35
7.4 Ordering for CPU solver . . . . . . . . . . . . . . . . . . . . . 36
7.5 Block Matrix Format for GPU . . . . . . . . . . . . . . . . . 36
7.6 Block Cholesky decomposition on GPU . . . . . . . . . . . . . 37
7.7 Ordering for GPU solver . . . . . . . . . . . . . . . . . . . . . 38
8 Testing 40
8.1 Octave solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
8.2 CPU solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
8.3 GPU solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
8.4 CUSP solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
9 Conclusion 44
A List of Abbrevations 50
B User Manual 52
B.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
B.2 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
C Contetns of the Attached CD 54
List of Figures
3.1 The dependence of the reordering of a sparse matrix on the
ll-in count . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Ordering example . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.1 Reprojection error . . . . . . . . . . . . . . . . . . . . . . . . 17
5.1 Block diagram of a GF100 GPU . . . . . . . . . . . . . . . . . 24
5.2 Streaming multiprocessor of a GF100 (Fermi) GPU . . . . . . 25
5.3 Bandwidth of various GPU memory . . . . . . . . . . . . . . 25
6.1 An example of a modestly sized Hessian in BA . . . . . . . . 30
7.1 Sample of a symmetric positive denite sparse matrix 6 6
with 22 nonzero elements . . . . . . . . . . . . . . . . . . . . 35
7.2 Performing k-way ordering on diagonal-based matrix Wathen
10 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
7.3 Performing k-way ordering on diagonal-based matrix Poisson
30 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
8.1 Test of Octave solvers . . . . . . . . . . . . . . . . . . . . . . 41
8.2 Test of iterative CUSP solvers. Max. error is the maximal
dierence with Octaves reference solution . . . . . . . . . . . 43
xiv
xv LIST OF FIGURES
Chapter 1
Introduction
Finding a solution of a system of linear algebraic equations (2.1) is the most
basic task in linear algebra and the heart of many engineering problems. It
is the topic of studies for many years not only for its application in many
branches of scientic computing, but also for its high computational com-
plexity and a wide variety of methods and approaches that help to solve
linear systems of dierent types faster and more accurately.
Finding a solution for a system of nonlinear algebraic equations can be
achieved using iterative solvers which keystone is solving a linear system
in each iteration step to bring near the suciently accurate solution. There-
fore, a linear solver forms a crucial part and a bottleneck of a nonlinear solver
at the same time.
A widely used optimization method in 3D reconstruction algorithms is bundle
adjustment. As a nonlinear iterative optimization method, it needs to solve
a sparse, often very large linear system of a specic structure many times.
Studying of the suitable linear solver for bundle adjustment is the main part
of my thesis.
1.1 Motivation
One particular and promising approach for speeding-up the process of solving
systems of linear equations consists in parallel computation. In case of dense
direct solvers, the parallelization is more straightforward and has better per-
formance results than those for sparse direct solvers. Iterative methods,
almost used for solving large sparse linear systems, are eciently paralleliz-
able thanks to the character of iterative solvers that used only sparse matrix
and vector multiplications and additions.
2
3 1.1. MOTIVATION
In the last decade, there has been growing interest in general-purpose compu-
tation on graphics processing units (GPGPU). Several libraries were devel-
oped which implement basic linear algebra subroutines or even linear solvers
for dense matrices (NVIDIA cuBLAS, MAGMA, CULA Dense) and sparse
matrices (NVIDIA cuSparse, NVIDIA CUSP, CULA Sparse). At the present
time, no implementation of a linear direct solver for general sparse matrices
on GPU exists. The main cause is the problematic ne-grain parallelization
and the thread divergence on a GPU.
Sparse matrices consisting of many small independent full blocks on diagonal
with some dependent parts on borders are formed during computation of
bundle adjustment. It seems that there is possibility to eliminate these
blocks in parallel manner eectively even on GPU. The question is which
type of solver is more suitable direct or iterative? My thesis aims to give
the answer for it.
Chapter 2
Solving Linear Systems
1
2.1 System of Linear Equations
Denition 1. A system of m linear equations in n unknowns consists of
a set of algebraic relations of the form
n
j=1
a
ij
x
j
= b
i
, i = 1, . . . , m (2.1)
where x
j
are unknowns, a
ij
are the coecients of the system and b
i
are the
components of the right-hand side. System (2.1) can be more conveniently
written in matrix form as
Ax = b, (2.2)
where A = (a
ij
) C
mn
denotes the coecient matrix, b = (b
i
) C
m
the right side vector and x = (x
i
) C
n
the unknown vector, respectively.
A solution of (2.2) is any n-tuple of values x
i
which satises (2.1).
Remark 1. The existence and uniqueness of the solution of are ensured if
one of the following (equivalent) hypotheses holds:
1. A is invertible,
2. rank(A) = n,
3. the homogeneous system Ax = 0 admits only the null solution.
In next chapters I will be dealing with numerical methods nding the solution
of real-valued square systems of order n, that is, systems of the form (2.2)
with A R
nn
and x, b R
n
. Such linear systems arise frequently in any
1
For this chapter was cited from [20] and [21]
4
5 2.2. DIRECT METHODS FOR SOLVING LINEAR SYSTEMS
branch of science, also in bundle adjustment. These numerical methods can
generally be divided into two classes. In absence of roundo errors, direct
methods yield the exact solution in a nite number of steps. Iterative methods
require (theoretically) an innite number of steps to nd the exact solution.
2.2 Direct Methods for Solving Systems of Linear
Equations
2.2.1 Cramers Rule
The solution of system (2.2) is formally provided by Cramers rule
x
j
=
det(A
j
)
det(A)
, j = 1, . . . , n, (2.3)
where A
j
is the matrix obtained by substituting the j-th column of A with
the right-hand side b. If the determinants are evaluated by the recursive
Laplace rule, the method based on Cramers rule turns out to be unac-
ceptable even for small dimensions of A because of its computational costs
(n + 1)! ops. However, Habgood and Arel [11] have recently shown that
Cramers rule can be implemented in O(n
3
) time, which is comparable to
more common methods of solving systems of linear equations.
2.2.2 Forward and Backward Substitution
Denition 2. A square matrix with zero entries above the main diagonal
(a
ij
= 0 for i < j) is called lower triangular. A square matrix with zero
entries below the main diagonal (a
ij
= 0 for i > j) is called upper triangular.
A lower (upper) triangular matrix is strictly lower (upper) triangular when
its entries on the main diagonal are zeros, too.
Example 1. Lower (upper) triangular systems can be easily solved using
forward (backward) substitution. For example, the nonsingular 3 3 upper
triangular system
_
_
u
11
u
12
u
13
0 u
22
u
23
0 0 u
33
_
_
_
_
x
1
x
2
x
3
_
_
=
_
_
b
1
b
2
b
3
_
_
can be solved in sequence as follows
x
3
= b
3
/u
33
,
x
2
= (b
2
u
23
x
3
)/u
22
,
x
1
= (b
1
u
12
x
2
u
13
x
3
)/u
33
.
CHAPTER 2. SOLVING LINEAR SYSTEMS 6

For a nonsingular upper triangular system of order n (n 2), the solution
can be expressed generally in the form
x
n
=
b
n
u
nn
x
i
=
1
u
ii
_
_
b
i
j=i+1
u
ij
x
j
_
_
, i = n 1, . . . , 1.
(2.4)
Analogically, the solution for a nonsingular lower triangular system of order
n (n 2) in the form
x
i
=
b
1
l
11
x
i
=
1
l
ii
_
_
b
i
i1
j=1
l
ij
x
j
_
_
, i = 2, . . . , n.
(2.5)
The number of multiplication and divisions for forward/backward substi-
tution is equal to
n
2
(n + 1), while the number of sums and subtractions is
n
2
(n 1). The total operation count for (2.4) and (2.5) is thus n
2
.
2.2.3 Gaussian Elimination
Let A be a square nonsingular matrix. A linear system Ax = b can be
transformed into equivalent (lower or upper) triangular system Tx =

b that
has the same solution using three elementary row operations. The solution
of the system is invariant to
1. the multiplication of a row by a nonzero scalar,
2. the addition of one row to another,
3. the swapping of two rows.
The basic idea is to multiply the i-th equation by a nonzero constant and
subtract with the rst equation to zeroize rst unknown in the i-th equation.
This is done with all equations from 2 to n. Then, the second equation is
considered as reference and all unknowns in equations form 3 to n are zeroed.
The procedure ends, when the system has form Tx =

b. Right-hand side
b equals to T
b. Finally, the solution is obtained by forward substitution

(if T is lower triangular matrix) or backward substitution (if T is upper
triangular matrix).
To complete Gaussian elimination
2
3
(n 1)n(n + 1) + n(n 1) ops are
required. To solve the linear system, about
2
3
n
3
+2n
2
ops are needed (with
n
2
ops to backsolve the triangular system). Neglecting the lower order of
terms, the Gaussian elimination process has a cost of
2
3
n
3
ops.
7 2.2. DIRECT METHODS FOR SOLVING LINEAR SYSTEMS
2.2.4 Gauss-Jordan Elimination
Gauss-Jordan elimination is slightly dierent as Gaussian elimination. The
transformation of the system using three elementary row operations repeats
until each equation contains only one of the unknowns, thus giving an im-
mediate solution. Principal deciencies of this method are that
1. it requires all the right-hand sides to be stored and manipulated at the
same time and
2. it is three times slower than the alternative solvers, when the inverse
of A is not desired.
2.2.5 LU Decomposition
Suppose that it is able to write the matrix A as a product of two matrices,
A = LU where L is lower triangular and U is upper triangular. This
decomposition can be used to solve the linear system
Ax = (LU)x = L(Ux) = b (2.6)
by rst solving (by forward substitution) for the vector y such that
Ly = b (2.7)
and then solving (by backward substitution) for the vector x such that
Ux = y. (2.8)
Theorem 1. Let A R
nn
. The LU decomposition of A with l
ii
= 1 for
i = 1, . . . , n exists and is unique i the principal submatrices A
i
of A of
order i = 1, . . . , n 1 are nonsingular.
The LU decomposition is usually performed in place to avoid copying and
wasting the memory when storing triangular matrices L and U separately
as it is shown in Algorithm 1. At the end (here only for presentational
purposes) is the result stored in L and U matrix.
2.2.6 Cholesky Decomposition
Theorem 2. Let A R
nn
be a symmetric and positive denite matrix.
Then, there exists a unique lower triangular matrix L with positive diagonal
entries such that
A = LL
. (2.9)
CHAPTER 2. SOLVING LINEAR SYSTEMS 8
Algorithm 1 LU Decomposition
Require: A square matrix A.
Ensure: A lower triangular matrix L with ones on the main diagonal and
an upper triangular matrix U such that LU = A.
function [L, U] = lu2(A)
[n,n] = size(A);
for k = 1:n
A(k+1:n,k) = A(k+1:n,k) / A(k,k);
A(k+1:n,k+1:n) = A(k+1:n,k+1:n) - A(k+1:n,k) * A(k,k+1:n);
end
L = tril(A,-1) + eye(n); % ones on the diagonal
U = triu(A);
end
The computational costs for Cholesky halves, with respect to the LU decom-
position, to about
n
3
3
ops because the input matrix A is symmetric. An
implementation example of Cholesky decomposition is coded in Algorithm 2.
Algorithm 2 Cholesky Decomposition
Require: A square positive denite matrix A.
Ensure: A lower triangular matrix L such that LL
= A.
function [L] = chol2(A)
[n,n] = size(A);
for k = 1:n
A(k,k) = sqrt(A(k,k));
A(k,k+1:n) = A(k,k+1:n) / A(k,k);
for i = k+1:n
A(i,i:n) = A(i,i:n) - A(k,i:n) * A(k,i);
end
end
L = triu(A);
end
2.3 Iterative Methods for Solving Systems of Linear
Equations
Iterative methods formally yield the solution x of a linear system after an
innitive number of steps. At each step they require the computation of
the residual of the system. For full matrices, their computational cost is of
the order n
2
operations for each iteration to be compared with an overall
cost of the order of
2
3
n
3
operations needed by direct methods. Iterative
methods can therefore become competitive with direct methods because the
9 2.3. ITERATIVE METHODS FOR SOLVING LINEAR SYSTEMS
required number of iterations to converge is either independent of n or scales
sublinearly with respect to n.
The basic idea of iterative methods is to construct a sequence of vectors x
(k)
that enjoy the property of convergence
x = lim
k
x
(k)
,
where x is the solution to (2.2). In practice, the iterative process is stopped
at the minimum value of n such that

x
(n)
x
< .
Chapter 3
Sparse Matrices
Many engineering problems have to confront with large and sparse matrices.
A sparse matrix is a matrix that allows special techniques to take advantage
of the large number of zero elements. This denition helps to dene how
many zeros a matrix needs in order to be sparse. The answer is that it de-
pends on what the structure of the matrix is, and what is being used for. For
example, a randomly generated sparse nn matrix with cn entries scattered
randomly throughout the matrix is not sparse in the sense of Wilkinson (for
direct methods) since it takes O(n
3
) time to factorize (with high probability
and for large enough c [9]). [3]
Example 2. Using some of the sparse formats to store real sparse matrices
can result in signicant computational and storage savings. For instance a
tridiagonal square matrix with 1, 000, 000 rows. Storing 3 million nonzero
elements in double precision, and other data as row and column indices,
consumes approx. 40 MB. But storing the same matrix as full matrix would
consume more than 7 TB. Such big dierences can be expected also in exe-
cution times.
3.1 Ordering Methods
An unfavourable fact lies in the process of elimination with sparse matrices.
Some zero values of input matrix become non-zero during the elimination
(ll-ins) and their positions must be precomputed in advance. Reordering
techniques try to minimize the amount of ll-ins by nding a permutation of
rows and columns of the input matrix. But nding such optimal permutation
is a NP-complete problem [26] and could be more time consuming than
solving original linear system; therefore, heuristic approach that gives often
near optimal results is applied.
10
11 3.1. ORDERING METHODS
3.1.1 Arrowhead Matrix Example
Example 3. The operations counts required for the solution of two linear
system Ax = b will be examined. The input matrices are on the gure 3.1.
Even though both matrices have the same number of non-zero elements,
there is a signicant computation reduction by simply permutation of rows
and columns.

(a) Left-up arrow-
head matrix

(b) Left-up arrow-
head matrix after
LU

(c) Right-down ar-
rowhead matrix

(d) Right-down ar-
rowhead matrix af-
ter LU
Figure 3.1: The dependence of the reordering of a sparse matrix on the ll-
in count. represents nonzero elements of the input matrix, ll-ins and
empty space zero elements
For the left-up arrowhead matrix 3.1a, the number of multiplications and
divisions required by the forward elimination is = 40, for the back substi-
tution = 25. The total number of operations is + = 65 and the input
sparse matrix becomes full. For the right-down matrix 3.1c, the number of
mul. and div. required for the forward elimination is = 8, for back sub-
stitution is = 13. The total number of operations is + = 21 and the
input sparse matrix remains sparse.
There are many recent works about ordering schemas. This is because the
specic problems construct specic types of sparse matrices (band-diagonal,
block triangular, block tridiagonal, . . . ) [20, p. 77]. Below, the most used
methods are described. They can be divided in two categories, according
how the elimination tree is build. Most state-of-the-art ordering schemes
for sparse matrices are a hybrid of a bottom-up method such as minimum
degree and a top-down scheme such as Georges nested dissection.
3.1.2 Graph Representation of Sparse Matrices
To explain ordering methods, it is convenient to introduce a graph represen-
tation of sparse matrices. They are then represented as undirected graphs
(sparse matrix have the structure of an adjacency matrix for this graph).
All schemes are described for the undirected graph G = (V, E), E V V ,
CHAPTER 3. SPARSE MATRICES 12
associated with the symmetric matrix S. Let v be a vertex of G. The set of
vertices that are adjacent to v is denoted by adjG(v).
3.1.3 Bottom-up Ordering Methods
Bottom-up methods build the elimination tree from the leaves up to the root.
In each iteration k a greedy heuristic is applied to G
k1
to select a vertex for
elimination. This section briey describes two of the most popular bottom-
up algorithms, the minimum degree and the minimum deciency ordering
heuristics.
Minimum Degree Ordering As mentioned above, at each iteration k the
minimum degree algorithm eliminates a vertex v that minimizes the
number of adjacent vertices degG
k1
(v) = |adjG
k1
(v)|. The algo-
rithm is a symmetric variant of the Markowitz scheme [15] and was
rst applied to sparse symmetric factorization by Tinney and Walker
[22]. Over the years many enhancements have been proposed to the
basic algorithm that have greatly improved its eciency.
Minimum Deciency Fill A less popular bottom-up scheme is the mini-
mum deciency or minimum local ll heuristic. The exact amount of
ll is used to select a vertex for elimination. The minimum deciency
algorithm has received much less attention because of its prohibitive
runtime.
3.1.4 Top-down Ordering Methods
The most popular top-down scheme is Georges nested dissection algorithm
[7, 8]. The basic idea of this approach is to nd a subset of vertices S
in G, whose removal partitions G in two subgraphs G(B) and G(W) with
V = S B W and |B|, |W| |V | for some 0 < < 1. Such a par-
tition of G is denoted by (S, B, W). The set S is called vertex separator
of G. If we order the vertices in S after the (black) vertices in B and the
(white) vertices in W, no ll-edge can occur between B and W. Typically,
the columns corresponding to S constitute a full o-diagonal block in the
Cholesky factor. Therefore, S is supposed to be small. Once S has been
found, the algorithm is recursively applied to each connected component of
G(B) and G(W) until a component consists of a single vertex or a clique.
In this way the elimination tree is built from the root down to the leaves.
Graph partitioning heuristics are usually divided into construction and im-
provement heuristics. A construction heuristic takes the graph as input and
computes an initial separator from scratch. An improvement heuristic tries
to minimize the size of a separator through a sequence of elementary steps.
13 3.2. SYMBOLICAL FACTORIZATION
As some ordering methods are implemented in MATLAB as standard func-
tions (colperm, symrcm, colamd, symamd, amd, dmperm), I have tested some
of them (see gure 3.2).
3.2 Symbolical Factorization
Symbolical factorization is a step executed before the numerical factoriza-
tion. It precomputes the positions of ll-ins (see also 3.1) that appears during
factorization process when one row is added to another. It can be seen on
the Cholesky or LU factors that they are often more denser than original
matrices (see gure 3.2). The CRS format stores only nonzero elements and
therefore needed space for ll-ins must be allocated before the numerical
factorization. The nave solution is to run slightly changed numerical factor-
ization and store new nonzero entries. In fact that symbolical factorization
works only with indices to determine the Cholesky or LU factors, it can be
computed much faster than full numerical factorization. When implementing
my symbolical factorization I have used a great information source [13].
CHAPTER 3. SPARSE MATRICES 14
0 50 100 150 200 250 300
0
50
100
150
200
250
300
no ordering: fillins=13309
0 50 100 150 200 250 300
0
50
100
150
200
250
300
colperm: fillins=30627
0 50 100 150 200 250 300
0
50
100
150
200
250
300
symrcm: fillins=13040
0 50 100 150 200 250 300
0
50
100
150
200
250
300
colamd: fillins=9569
0 50 100 150 200 250 300
0
50
100
150
200
250
300
symamd: fillins=6681
0 50 100 150 200 250 300
0
50
100
150
200
250
300
amd: fillins=6583
Figure 3.2: Applying dierent ordering methods and displaying LU factors.
Nonzeros are in black, ll-ins in gray color
15 3.2. SYMBOLICAL FACTORIZATION
Chapter 4
Bundle Adjustment
Three-dimensional (3D) reconstruction is a problem that appears often in
many computer vision tasks. 3D reconstruction can be dened as the prob-
lem of using 2D measurements arising from a set of images depicting the same
scene from dierent viewpoints, aiming to derive information related to the
scene geometry as well as the relative motion and the optical characteris-
tics of the camera(s) employed to acquire these images. Bundle adjustment
(BA) is almost invariably used as the last step of every feature-based 3D
reconstruction algorithm [14, p. 12].
Bundle adjustment is the problem of rening a visual reconstruction to
produce jointly optimal 3D structure and viewing parameter (camera pose
and/or calibration) estimates. Optimal means that the parameter estimates
are found by minimizing some cost function that quanties the model tting
error, and jointly that the solution is simultaneously optimal with respect
to both structure and camera variations. The name refers to the bundles
of light rays leaving each 3D feature and converging on each camera cen-
tre, which are adjusted optimally with respect to both feature and camera
positions. Equivalently unlike independent model methods, which merge
partial reconstructions without updating their internal structure all of the
structure and camera parameters are adjusted together in one bundle [23].
BA boils down to minimizing the reprojection error (4.1) between the ob-
served and predicted image points, which is expressed as the sum of squares of
a large number of nonlinear, real-valued functions. Thus, the minimization is
achieved using nonlinear least-squares algorithms [4], from which Levenberg-
Marquardt has proven to be of the most successful due to its ease of imple-
mentation and its use of an eective damping strategy that lends it the
ability to converge quickly from a wide range of initial guesses [12].
16
17 4.1. UNCONSTRAINED OPTIMIZATION
Figure 4.1: Reprojection error [17]
4.1 Unconstrained Optimization
1
The aim of the unconstrained optimization is to nd x
such that
arg min
xR
n
f(x). (4.1)
The point x
is called a global minimizer of f if f(x
) f(x) x R
n
,
while x
is called a local minimizer of f if a neighborhood N of x
exists
such that f(x
) f(x) x N. Vector of rst partial derivations of the

function f (must be continuously dierentiable) by the vector x is denoted
by
f(x) =
_
f
x
1
(x), . . . ,
f
x
n
(x)
_
and called gradient of f at a point x. If d is is a non null vector in R

n
, then
the directional derivative of f with respect to d is
f
d
(x) = lim
0
f(x + d) f(x)
and satises f(x)/d = [fx]
d. Moreover, denoting by (x, x + d) the

segment in R
n
joining the points x and x + d, with R, Taylorss
expansion ensures that (x, x + d) such that
f(x + d) f(x) = f()
d.
1
This chapter was cited from [21].
CHAPTER 4. BUNDLE ADJUSTMENT 18
If f is twice-continuously dierentiable, it can by denoted by H(x) (or
2
f(x)) the Hessian matrix of f evaluated at a point x, whose entries are
h
ij
(x) =

2
f(x)
x
i
x
j
, i, j = 1, . . . , n.
In such case it can be shown that, if d = 0, the second-order directional
derivative exists
2
f
2
d
(x) = d
H(x)d.
For a suitable (x, x +d) also
f(x +d) f(x) = f(x)
d +
1
2
d
H()d.
Existence and uniqueness of solution for (4.1) is not guaranteed in R
n
. Nev-
ertheless, it can be proved that the gradient of a local minimizer x
equals
to a null vector. This condition is necessary for optimality to hold. However,
this condition also becomes sucient if f is a convex function on R, i.e., such
that x, y R
n
and for any [0, 1]
f[x + (1 )y] f(x) + (1 )f(y).
4.1.1 Search Methods
Analytical methods are possible to use only for simple problems (Bra-
chistochrone problem, univariate minimization).
Numerical methods must be used for most engineering optimization
problems (too large and complex to solve analytically). Numerical
methods can be divided into two classes
Gradient-based methods are ecient for many variables and for
smooth objective function. The drawback is the local conver-
gence.
Derivative-free methods are suitable for problems when gradi-
ents are not available, objective function is not dierentiable or
the global minimizer is desired to nd.
Gradient-based descent methods compute direction d
(k)
and positive pa-
rameter (step length)
(k)
at each iteration k with the help of gradient and
Hessian. Algorithm 3 shows the skeleton of this method.
The concept of computing the direction d
(k)
and the step length
(k)
denes
a specic direct method.
Algorithm 3 Descent method
Require: f(x), H(x) and a starting point x
0
.
Ensure: A local minimizer x
.
1: k 0
2: while (not converged) do
3: compute direction d
(k)
and step length
(k)
4: x
(k+1)
x
(k)
+
(k)
d
(k)
5: k k + 1
6: end while
7: return x
(k)
Newtons method computes
d
(k)
= H
1
(x
(k)
)f(x
(k)
),
where H is positive denite within a suciently large neighborhood of
point x
;
inexact Newtons method
d
(k)
= B
1
(x
(k)
)f(x
(k)
),
where B(x
(k)
) is a suitable approximation of H(x
(k)
);
gradient (steepest descent) method
d
(k)
= f(x
(k)
);
conjugate gradient method
d
(k)
= f(x
(k)
) +
(k)
d
(k1)
,
where
(k)
is a scalar to be suitably selected in such a way that the
directions
_
d
(k)
_
turn out to be mutually orthogonal with respect to
a suitable scalar product.
4.1.2 LevenbergMarquardt Algorithm
LevenbergMarquardt (LM) algorithm, also known as the damped least-
squares method, provides a numerical solution to the problem of minimizing
a function, generally nonlinear, over a space of parameters of the function.
It can be thought of as a combination of GaussNewton and the steepest
descent method. When the current solution is far from a local minimum,
the algorithm behaves like a steepest descent method: slow, but guaranteed
CHAPTER 4. BUNDLE ADJUSTMENT 20
to converge. When the current solution is close to a local minimum, it
becomes a Gauss-Newton method and exhibits fast convergence. For these
reasons, mostly LM algorithm is used in bundle adjustment.
Let f be an assumed functional relation which maps a parameter vector
p R
m
to an estimated measurement vector x = f(p), x R
n
. An initial
parameter estimate p
0
and a measured vector x are provided and it is desired
to nd the vector p
that best satises the functional relation f locally, that

is, minimizes the squared distance
with = x x for all p within a

sphere having a certain, small radius. The basis of LM algorithm is an
ane approximation to f in the neighborhood of p. For a small ||
p
||, f is
approximated by (see [5, p. 75])
f(p +
p
) f(p) +J
p
,
where J is the Jacobian of f.
The basis of LM algorithm is an ane approximation to f in the neighbor-
hood of p. At each iteration, it is required to nd the step
p
that minimizes
the quantity ||x f(p +
p
)|| ||x f(p) J
p
|| = || J
p
||. The mini-
mum is attained when J
p
is orthogonal to the column space of J. This
leads to J
(J
p
) = 0, which yields
p
as the solution of the so-called
normal equations [10]:
J
J
p
= J
(4.2)
Matrix J
J in the above equation is the rst order approximation to the

Hessian of
1
2
[16] and
p
is the Gauss-Newton step. J
corresponds
to the steepest descent direction since the gradient of
1
2
is J
. The
LM algorithm actually solves a slight variation of Equation (4.2), known as
augmented normal equations
N
p
= J
, with N J
J + I, > 0. (4.3)
The strategy of altering the diagonal elements of J
J is called damping and

is referred to as the damping term. It is decreased, when the updated
parameter vector p +
p
with
p
computed from Equation (4.3) leads to a
reduction in the error
; otherwise it is increased, the augmented normal

equations are solved again and this process iterates until a value of
p
that
decreases the error is found.
Chapter 5
Overview of NVIDIA CUDA
1
By introducing CUDA (Compute Unied Device Architecture) NVIDIA has
given programmers the initial opportunity to capitalize on inexpensive, gen-
erally available, massively parallel computing hardware. Teraop computing
is now within the economic reach of most people around the world. The
impact of GPGPU (General-Purpose Graphics Processing Units) technology
spans all aspect of computation, from cell phones to largest supercomputers.
Programmable GPUs are deployed in areas of scientic computing, cloud
computing, computer visualization, simulations, games, . . .
Programming for GPGPU requires a basic knowledge about the GPU archi-
tecture because only small changes in data structures or program can make
signicant dierences in the performance. Modern GPUs belong in principle
to the SIMD class of Flynns taxonomy. That means that GPUs are capable
to do the same operation on multiple data simultaneously. The restriction
is one operation at the time which reduces possible problems worth to par-
allelize on GPU. On the other hand, well-vectorized problems are able to
achieve an acceleration by two or more orders of magnitude over multi-core
processors
2
.
To ensure best performance of GPGPU, next tree rules should be met.
1. Get the data on the GPGPU and keep it there. GPGPU are separate
devices plugged into the PCI Express bus of the host computer which
is very slow compared to GPGPU memory system (20 to 28 times
slower).
2. Give the GPGPU enough work to do. CUDA-enabled GPUs deliver
teraop performance and they are fast enough to complete small prob-
1
For this chapter I have quote from [6].
2
Top 100 NVIDIA CUDA application showcase speedups (Min 100, Max 2600, Median
1350), published May 9., 2011.
22
23 5.1. THE CUDA EXECUTION MODEL
lems faster than the host processor can start kernels. Each thread
should perform as much instruction to hide this latency.
3. Focus on data reuse within the GPGPU to avoid memory bandwidth
limitation. All high-performance CUDA applications exploit internal
resources on the GPU (registers, shared memory) to bypass global
memory bottlenecks.
5.1 The CUDA Execution Model
The heart of CUDA performance lies in the execution model and the sim-
ple partitioning of a computation into xed-sized blocks of threads in the
execution conguration. CUDA maps naturally the parallelism within an
application to the massive parallelism of the GPGPU hardware. The result
is the compatibility within older and future generations of GPU.
GPU hardware parallelism is achieved through replication of a common ar-
chitectural building blocks called a streaming multiprocessor (SM). Figure 5.1
illustrates 16 SM on a GF100 (Fermi) series GPGPU. The software abstrac-
tion of a thread block translates into a natural mapping of the kernel onto
an arbitrary number of SM on a GPGPU. Each SM can be scheduled (by
GigaThread global scheduler) to run one or more thread blocks. Therefore,
they are independent and not synchronizable during the kernel execution
3
.
Thread blocks also acts as a container of thread cooperation, as only threads
in a thread block can share data. Thread in a thread block can utilize high-
speed memory inside the SM called shared memory for data sharing.
Figure 5.2b depicts the composition of one of 16 streaming multiprocessors in
GF100 GPU. SIMD cores require less power and space than non-SIMD cores.
As a result, GPGPU have a high op per watt ratio compared to conventional
CPUs [25]. The threads running on a multiprocessor are partitioned into
groups in which all threads execute the same instruction simultaneously. On
the CUDA architecture, these groups are called warps, each warp has 32
threads, and this execution model is referred to as SIMT (Single Instruction
Multiple Threads) [18].
GPGPUs are not true SIMD machines (but SIMT), since SIMD are only
streaming multiprocessors which may be running one or more dierent in-
structions. Conditionals (if statements) can decrease performance inside a
SM because each branch of each conditional must be evaluated. This can
cause slowdown of 2
n
for n nested loops.
3
Atomic operation make an exception, they allow threads of dierent blocks to commu-
nicate. This approach should be used in reasonably situations, as using atomic operations
may introduce scalability and performance issues.
CHAPTER 5. OVERVIEW OF NVIDIA CUDA 24
Figure 5.1: Block diagram of a GF100 (Fermi) GPU [2]
5.2 GPU Memory
For highest performance of applications developed for GPU, data inside the
SM must be reused. The reason is that on-board global memory (DRAM
in 5.2a) is not fast enough when all SM want to perform read/write operation.
CUDA provides congurable caches for each SM to give the opportunity for
data reuse. The awareness of dierence between on-board (GPU) and on-
chip (SM) memory is the key to achieving the highest performance that
GPGPU can provide.
The most fastest and most scalable is on-chip SM memory. However, it is
limited to a few KB. The on-board global memory is accessible by all the SM
across the GPU and is measured in GB. Signicant bandwidth gaps between
on-board and on-chip memories could be seen in table 5.3. Although the
bandwidth of shared memory can greatly accelerate applications, it is too
slow to achieve peak performance [24].
Example 4. Computing a simple dot-product
for( i = 0; i < N; i++ ) c[i] = a[i] * b[i];
25 5.2. GPU MEMORY
(a) Memory hierarchy [1] (b) Block diagram [1]
Figure 5.2: Streaming multiprocessor of a GF100 (Fermi) GPU
Register memory 8000 GB/s
Shared memory 1600 GB/s
Global memory 177 GB/s
Mapped memory 8 GB/s
Figure 5.3: Bandwidth of various GPU memory [6, p. 111]
on a GPU utilizing only global memory gives a limited performance. When
4-byte oating-point values are being used, a 1 Top GPU would require
12 TB/s of memory bandwidth. A GPU with 177 GB/s of memory band-
width could only deliver 14 Gop (1.4% of the potential 1 Top performance).
When programming for a GPU, it is necessary to reuse data within the SM

(to exploit data locality). GPGPUs support two types of data locality
CHAPTER 5. OVERVIEW OF NVIDIA CUDA 26
temporal locality (or LRU Last Recently Used) means that last recently
accessed data is likely to be used again in the future and spatial locality
means that neighbouring data is cached to be used in the future.
For compute capability 2.0 or higher, a constant or texture memory used for
eective data broadcasting to all threads are overcome by the global memory.
This is because compute 2.0 devices contains SM with L1 cache and a unied
L2 cache that speed-up accessing the global memory.
27 5.2. GPU MEMORY
Chapter 6
Analysis of the Problem
As I have mentioned in the Introduction, nding the solution of a linear
system is the most computedemanding part in the problem of solving a
nonlinear system. At each iteration, a linear system Ax = b must be solved.
Bundle adjustment (BA), as a least squares problem, works with sparse linear
systems of a special structure (doubly bordered block diagonal). A similar
structure can be obtained when applying nested dissection ordering on the
diagonal-based matrix A (band-diagonal, block tridiagonal, . . . ). The im-
plemented solver on GPU can be used for BA when the information about
the structure of A matrix is provided by BA or for the diagonal-based ma-
trix when the information about the structure is provided by the ordering
function.
6.1 Structure of Linear Systems in BA
A system of linear augmented normal equations 4.3 arises in BA and are
solved at each iteration of Levenberg-Marquardt algorithm. Matrix J is
Jacobian and N is the rst order approximation of Hessian. The structure
of N can be exactly determined according to the input parameters of BA
problem.
Example 5. [14, p. 9] Consider that we want to optimize parameters of
3 cameras and 4 3D points visible in all cameras. The measurement vec-
tor X = (x
11
, x
12
, x
13
, x
22
, x
21
, x
23
, x
31
, x
32
, x
33
, x
41
, x
42
, x
43
)
is made up
of the measured image point coordinates across all cameras. The parameter
vector P = (a
1
, a
2
, a
3
, b
1
, b
2
, b
3
, b
4
)
is dened by all parameters de-

scribing 3 projection matrices and 4 3D points. Let A
ij
and B
ij
denote
x
ij
a
j
and
x
ij
b
j
, respectively.
x
ij
a
k
= 0, j = k and
x
ij
b
k
= 0, i = k. Employing
28
29 6.2. BLOCK CHOLESKY DECOMPOSITION FOR BA
this notation, the Jacobian can be written as
J =
X
P
=
_
_
A
11
0 0 B
11
0 0 0
0 A
12
0 B
12
0 0 0
0 0 A
13
B
13
0 0 0
A
21
0 0 0 B
21
0 0
0 A
22
0 0 B
22
0 0
0 0 A
23
0 B
23
0 0
A
31
0 0 0 0 B
31
0
0 A
32
0 0 0 B
32
0
0 0 A
33
0 0 B
33
0
A
41
0 0 0 0 0 B
41
0 A
42
0 0 0 0 B
42
0 0 A
43
0 0 0 B
43
_
_
. (6.1)
Then, the approximation of Hessian (matrix N from Equation (4.3)) have a
form
_
_
U
1
0 0 W
11
W
21
W
31
W
41
0 U
2
0 W
12
W
22
W
31
W
41
0 0 U
3
W
13
W
23
W
33
W
43
W
11
W
12
W
13
V
1
0 0 0
W
21
W
22
W
23
0 V
2
0 0
W
31
W
32
W
33
0 0 V
3
0
W
41
W
42
W
43
0 0 0 V
4
_
_
_
a
1
a
2
a
3
b
1
b
2
b
3
b
4
_
_
=
_
a
1
a
2
a
3
b
1
b
2
b
3
b
4
_
_
. (6.2)
Denoting the upper left, lower right, and upper right parts of the matrix in
Equation (6.2), respectively, with U, V and W, allows to rewrite augmented
normal equations (4.3) compactly to
_
U
W
W
_ _
b
_
=
_
b
_
, (6.3)
where * designates the augmentation of the diagonal elements of U and V.
Now, lets compare the structure of Hessian in Equation (6.2) with a Hessian
of a bigger BA problem (gure 6.1). The upper left part (U) corresponds to
the approximation of second derivations of camera parameters, lower right
(V) to the approximation of second derivations of 3D points and upper right
part (W) to the derivation
6.2 Block Cholesky Decomposition for BA
Lourakis and Argyros [14] suggest to solve augmented normal equations (6.3)
arising in BA in two steps (rstly for
a
and then for
b
) as follows. Left
CHAPTER 6. ANALYSIS OF THE PROBLEM 30
(a) Original input matrix (b) Rotated of 180 degrees with marked
parts (see also gure 7.1 for comparison)
Figure 6.1: An example of a modestly sized Hessian in BA. This is the
sparsity pattern of a 992992 normal equations (i.e. approximate Hessian).
Black regions correspond to nonzero elements [14, p. 27]
multiplication of Equation (6.3) by the block matrix
_
I WV
1
0 I
_
(6.4)
results in
_
U
WV
1
W
0
W
_ _
b
_
=
_
a
WV
1
b
_
Since the top right block of the above left hand matrix is zero, therefore
a
can be determined from its top half, which is
(U
WV
1
W
)
a
=
a
WV
1
b
(6.5)
Matrix S U
WV
1
W
is the Schur complement of V
in the left-hand
side matrix of (6.3) and is also positive denite [19]. Linear system (6.5) is
solved for
a
using Cholesky decomposition of S.
b
is computed by solving
V
b
=
b
W
a
.
This approach has a big advantage an absence of ll-ins during the compu-
tation. The approach explained in the next Example is slightly dierent [21,
p. 102].
Example 6. Let A R
nn
be a symmetric positive denite matrix that
can be divided into 4 submatrices A
11
, A
12
, A
21
and A
22
. Then, according
the Theorem 2, Cholesky decomposition A = LL
exists where L is lower

triangular matrix with strictly positive diagonal entries. If matrix A consists
of 4 submatrices, the equation A = LL
can be rewritten to
A =
_
A
11
A
21
A
21
A
22
_
=
_
L
11
0
L
21
L
22
_ _
L
11
L
21
0 L
22
_
.
The aim of the block Cholesky decomposition is to compute values in L
11
,
L
21
, L
22
submatrices or L
11
, L
21
, L
22
respectively. The whole process can
be divided into 4 steps:
1. A
11
= L
11
L
11
(Cholesky decomposition)
2. L
21
= L
1
11
A
21
from A
21
= L
11
L
21
L
21
= A
21
L
11
from A
21
= L
21
L
11
3. A
22
L
21
L
21
= L
22
L
22
(Cholesky decomposition)
During the decomposition process, rst two steps can be done simultaneously.
The last step is updating the A
22
submatrix with matrix A
S
22
that is called
Schur complement of A
11
in matrix A and can be expressed as
A
S
22
= A
22
A
21
A
1
11
A
21
=
= A
22
L
21
L
11
(L
11
L
11
)
1
L
11
L
21
=
= A
22
L
21
(L
11
L
11
)(L
1
11
L
11
)L
21
=
= A
22
L
21
L
21
.
(6.6)
Example 7. This method allows parallel computation when diagonal blocks

are independent, for example linear system (6.7). Blocks A
11
and A
22
have
not any mutual dependent elements (A
12
and A
12
are zero matrices).
_
_
A
11
0 A
13
0 A
22
A
23
A
13
A
23
A
33
_
_
_
_
x
1
x
2
x
3
_
_
=
_
_
b
1
b
2
b
3
_
_
(6.7)
After the rst step, blocks A
11
, A
13
, A
13
, A
22
, A
23
, A
23
and parts of right-
hand side b
1
and b
2
are updated parallely and the system has the form as
follows:
_
_
L
11
0 L
1
11
A
13
0 L
22
L
1
22
A
23
0 0 A
33
_
_
_
_
x
1
x
2
x
3
_
_
=
_
_
L
11
0 L
13
0 L
22
L
23
0 0 A
33
_
_
_
_
x
1
x
2
x
3
_
_
=
_
_
L
1
11
b
1
L
1
22
b
2
b
3
_
_
CHAPTER 6. ANALYSIS OF THE PROBLEM 32
The next step is to update block A
33
with the Schur complement A
S
33
of
matrix
_
A
11
0
0 A
22
_
in matrix A, that is according to (6.6)
A
33
_
L
13
L
23
_
L
13
L
23
_
and to update vector b
3
with b
S
3
, that equals to
b
3
_
L
13
L
23
_
L
1
11
b
1
L
1
22
b
2
_
.
Next, the linear system
A
S
33
x
3
= b
S
3
is using the Gaussian elimination transformed to
L
S
33
x
3
= L
S
33
b
S
3
and solved for x
3
using back substitution. Finally, remaining parts of x
vector (x
1
and x
2
) in the transformed system
_
_
L
11
0 L
13
0 L
22
L
23
0 0 L
S
33
_
_
_
_
x
1
x
2
x
3
_
_
=
_
_
L
1
11
b
1
L
1
22
b
2
L
S
33
b
S
3
_
_
are computed now using only back substitution.
Chapter 7
Implementation
This chapter describes chosen framework and implementation details such as
used structures, functions and data types of a practical output of the thesis
linear direct solver (LDS).
7.1 Used Framework
The whole application was developed on a Linux environment (Xubuntu
12.04 for 64-bit PC and Debian 6.0 for 32-bit PC). The host code (for the
CPU side) was written in ANSI C, the device code (for the GPU side) in
CUDA (CUDA Driver 4.0). All object les was linked together into an exe-
cutable le (ldsexam) using NVCC compiler, no static or dynamic libraries
was created (see my makefile).
7.2 Compressed Row Storage Format
Many formats for sparse matrices exists. One of the most general is the
compressed row storage (CRS) format. It makes no assumptions about the
sparsity pattern and stores only indices and nonzero elements. On the other
hand, it is not very ecient because it needs an indirect addressing step for
every scalar operation in a matrix-vector product. I have decided on this
format in my CPU-side solver for its eective utilization in the Cholesky
decomposition.
A CRS format needs three vectors: nozval of oating-point numbers, rowbeg
and colind of integers. The nozval vector stores the values of the nonzero
elements of the matrix, as they are traversed in a row-wise fashion. The
colind vector stores the column indexes of the elements in the nozval vector.
34
35 7.3. CHOLESKY DECOMPOSITION ON GPU
That is, if nozval(k) = a
ij
then colind(k) = j. The rowptr vector stores
the locations in the nozval vector that start a row, that is, if nozval(k) = a
ij
then rowptr(i) k < rowptr(i+1). By convention, rowptr(n+1) = nnz+1,
where nnz is the number of all nonzeros.
Example 8. Consider a sparse symmetric matrix in the gure 7.1
0 1 2 3 4 5
0 7 1
1 8 1 2
2 1 8 3 2
3 9 3 2
4 2 3 3 9 3
5 1 2 2 3 9
Figure 7.1: Sample of a symmetric positive denite sparse matrix 6 6 with
22 nonzero elements
CRS has the following attributes: n = 6, nnz = 22,
rowptr
0 1 2 3 4 5 6
0 2 5 9 12 17 22
colind
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
0 5 1 2 4 1 2 4 5 3 4 5 1 2 3 4 5 0 2 3 4 5
nozval
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
7 7 8 1 2 1 9 3 2 9 3 2 2 3 3 9 3 1 2 2 3 9
7.3 Cholesky decomposition on GPU

The implementation of a sparse Cholesky decomposition (functions CRS_chol
and CRS_chol_subs) was quite straightforward. Before these functions are
called, a symbolical factorization must be performed which determines the
indices of ll-ins and allocate space for them. For purpose of Cholesky de-
composition, only lower or upper triangular matrix is sucient to have. This
fact was exploited by skipping all elements from the beginning of each row to
the main diagonal. This is done by CRS_shifted_rows. Another dierence
in decomposition of sparse matrices rests in the necessity of altering the be-
ginning of each row during the factorization. Regarding that I have worked
with temporary arrays rowbeg and rowend.
CHAPTER 7. IMPLEMENTATION 36
7.4 Ordering for CPU solver
In my solver, I have utilized approximate minimum degree (AMD) ordering
by Tim Davis that can be found also in MATLABs amd function. It minimizes
number of ll-ins very eectively and fast (see gure 3.2). For BA problem,
even faster ordering (but with more ll-ins) can be used and that is a simple
rotation of 180 degrees.
7.5 Block Matrix Format for GPU
There are 3 dierent parts in the matrix: full diagonal blocks, the sparse
border and the almost dense tail (light, middle and dark gray in the g-
ure 7.1). Analyzing the properties of this parts and CUDA architecture I
have suggested to use the following matrix data structure (MXBF).
Blocks: As there are many (from thousands to millions) full but small di-
agonal blocks (V
i
), they can be stored in one array (data) in row-wise
manner. In BA, the blocks have the same size, but when using METIS
k-way ordering, blocks have not the same size. Because of that, for
each block a information about its size must be stored (blksz). When
iterating over the blocks, it is ecient to have an index saying where
the data start for i-th block (blkp). Only upper part of the blocks is
stored but memory is allocated for the full block to get rid of strange
indexing.
Border: This part has the majority of nonzero elements. Therefore, is must
be stored as a sparse matrix. I have chosen the CRS format. In fact
that the input matrix is symmetric, only one border side is sucient
to have.
Tail: After computing the Schur complement, this part will be almost dense.
Consequently it is stored as a full matrix. Only upper triangle is stored,
but memory allocated for a full matrix, as in the case with blocks. The
data for this part are stored also in data array and tail points to
location where the data for tail start.
The MXBF structure of matrix from Example 7.1 have these attributes: n
= 6, tail = 5 (where data for tail in the array data starts), tailsz = 3,
ndata = 14 (number of elements in blocks and tail), brd_nnz = 13 (number
of nonzeros in the border),
blksz
0 1
1 2
blkp
0 1
0 1
data
0 1 2 3 4 5 6 7 8 9 10 11 12 13
7 8 1 0 8 9 3 2 0 9 3 0 0 9
37 7.6. BLOCK CHOLESKY DECOMPOSITION ON GPU
brd_rowptr
0 1 2 3
0 1 2 4
brd_colind
0 1 2 3
2 1 1 2
brd_nozval
0 1 2 3
1 2 3 2
7.6 Block Cholesky decomposition on GPU
Consider the block matrix
_
_
A
11
0 A
13
0 A
22
A
23
A
13
A
23
A
33
_
_
_
_
x
1
x
2
x
3
_
_
=
_
_
b
1
b
2
b
3
_
_
,
where A
11
and A
22
are called blocks, A
13
and A
23
borders and A
33
is
tail.
The block Cholesky decomposition consists of four main parts:
1. Eliminating blocks (A
11
L
11
and A
22
L
22
), updating corre-
sponding borders (A
13
L
1
11
A
13
and A
23
L
1
22
A
23
), and updating
corresponding parts of right-hand side of linear system (b
1
L
1
11
b
1
and b
2
L
1
22
b
2
). All previous steps are done simultaneously (within
elimination loops). Each thread eliminates one block (in the test ma-
trix it is the size of 3 3) and update its own part of border and b
vector. As the border part is sparse and can have arbitrary number of
nonzero elements, I store and access this data in a global memory.
2. Computing the Schur complement
A
33
A
S
33
= A
33
_
L
1
11
A
13
L
1
22
A
23
_
_
L
1
11
A
13
L
1
22
A
23
_
There was a problem that updated border part
_
L
1
11
A
13
L
1
22
A
23
_
was stored in
row-wise manner and the transposed matrix was not available. There-
fore, using dot-product when matrix-matrix multiplying was not pos-
sible. I had to loop through the rows of the matrix and update the
elements of A
33
matrix at every multiplication. But this could be
possible only when using atomic operations (atomicAdd). Even more,
this could be used only for single-precision oats in compute capability
> 2.0. I am aware of such restriction of proposed approach.
3. Eliminating of the tail (A
S
33
L
S
33
). This part has surely the biggest
potential to exploit the full potential of a GPU. Unfortunately, this
was postponed due to lack of time. I have planned to call a function
from MAGMA library that is able to solve dense linear system. In my
solver, this part is performed on CPU-side.
CHAPTER 7. IMPLEMENTATION 38
4. Back substitution. Performed on CPU-side, rstly for dense part L
S
33
and then for sparse borders and full blocks.
7.7 Ordering for GPU solver
A requirement of my GPU solver is that the input matrix can be partitioned
in a such structure that appears in a approximation of Hessians in BA prob-
lem (see the matrix in Equation 6.2). This can be achieved applying nested
dissection ordering recursively. METIS K-way ordering was used for parti-
tioning the input matrix into independent block structure for GPU solver.
Figures 7.2 and 7.3 illustrates structure of matrices from MATLAB gallery
reordered by k-way ordering. As BA has this structure implicitly and the
size and number on independent block are known from BA conguration, it
needs only rotation of 180 degrees to get structure like in gure 6.1b.
0 50 100 150 200 250 300
0
50
100
150
200
250
300
nz = 4861
(a) Original matrix
0 50 100 150 200 250 300
0
50
100
150
200
250
300
nz = 4861
(b) Reordered into 5 independent blocks
Figure 7.2: Performing k-way ordering on diagonal-based matrix Wathen
10 10
39 7.7. ORDERING FOR GPU SOLVER
0 200 400 600 800
0
100
200
300
400
500
600
700
800
900
nz = 4380
(a) Original matrix
0 200 400 600 800
0
100
200
300
400
500
600
700
800
900
nz = 4380
(b) Reordered using k-way ordering into
10 independent blocks
Figure 7.3: Performing k-way ordering on diagonal-based matrix Poisson
30
Chapter 8
Testing
Testing was performed on the followed conguration: Intel i7-2600 CPU
@ 3.40GHz, 4GB RAM, GeeForce GT570, Debian 6.0 for 32-bit PC, CUDA
Driver 4.0. Applications were compiled using GCC (version 4.3.5) and NVCC
with -use-fast-math and -O3 optimization mode.
To check the accuracy of my solvers I have used Octave to get the reference
x vector. The solution from Octave and my solver were printed into the le
(x_octave.vec and x_result.vec) and the dierences were compared with
another Octave function (vec_ck).
The main testing input matrix was the approximation of Hessian from BA
problem optimizing 3 parameters of 11049 3D points and 7 parameters of 22
cameras. The matrix is of size 33, 301 33, 301 and have 1, 817, 521 nonzero
elements saved in Matrix Market coordinate format (data/jTj_mue.mtx).
8.1 Octave solvers
In Octave, I have tested the direct solver (left division operator \), the Pre-
conditioned Conjugate Gradient solver (pcg) and Preconditioned Conjugate
Residuals (pcr). Iterative solvers was set to terminate after reaching 200
iterations or a residual norm less than 1
6
. Table 8.1 shows the results. Pre-
conditioned Conjugate Residuals solver have terminated after 45 iterations,
but the result was wrong.
40
41 8.2. CPU SOLVER
Method Time Res. norm Iterations
Left division operator 695 ms 1.283
13
Conjugate gradient 1440 ms 4.128

5
75
Conjugate residuals 1386 ms NaN 45
Figure 8.1: Test of Octave solvers
8.2 CPU solver
After execution the CPU solver from lds directory with the command
./bin/ldscpuexam data/jtj_mueI.mtx data/g.mtx
the following information are printed:
load matrix: 1070 ms
load vector: 10 ms
symamd ord.: 80 ms
mat. reorder: 390 ms
symbolic: 500 ms
CRS_symbolic: 1834461 nnz
CPU CRS chol: 50 ms
all: 2120 ms
The new number of nonzeros havent increased much (from 1, 817, 521 to
1, 8344, 61) which means that there are very few ll-ins (less then 1%). It
can be seen, that my implemented functions for reordering of the matrix
and for symbolic factorization are not very ecient. The reason can be that
reordering is performed by transforming CRS format into triplet (or COO)
format, reordered, sorted, and then transformed back which needs a lot of
data moves. Although nding the ordering takes more time then solving
the whole linear system, without it (try to comment it in ldscpuexam.c)
the computation will takes more than several minutes. Execution of all
functions required by nding the solution takes 1 second.
Command
octave -q --eval="vec_ck( x_octave.vec, x_result.vec );"
outputs the residual norm of the dierence with the reference octave solution
and nd where is the biggest dierence
max err: 0.0000000228 at 138th element
res nrm: 0.0000000000
CHAPTER 8. TESTING 42
8.3 GPU solver
For checking the corectness of GPU solver, I have implemented the GPU
solver on CPU side (to use this, a constant BLOCK_CHOLESKY_CPU
must be uncomented and to compile with make). Then, ldsgpuexam in
performed on CPU.
Calling
./bin/ldsgpuexam data/jtj_mueI.mtx data/g.mtx
gives this results:
load vector: 20 ms
kway ord.: 20 ms
symbolic: 500 ms
MXBF_from_crs: 11049 blocks
858522 border nnz
123157 block and tail data
block matrix: 10 ms
elim. blks: 10 ms
tail update: 30 ms
elim tail: 0 ms
back subs: 0 ms
CPU block chol: 40 ms
all: 2020 ms
This solver that exploits the special structure in BA runs faster than general
CPU solver (40 vs. 50 ms). When checking the residual norm:
max err: 0.0000221960 at 59th element.
res nrm: 0.0000000010
Output of the real GPU solver:
load vector: 10 ms
kway ord.: 20 ms
symbolic: 500 ms
MXBF_from_crs: 11049 blocks
858522 border nnz
123157 block and tail data
block matrix: 10 ms
43 8.4. CUSP SOLVERS
elim on GPU:
elim without copy: 15.1688 ms
elim with copy: 20.0004 ms
elim blocks + tail update: 420 ms
elim tail: 0 ms
back subs: 0 ms
GPU block chol: 430 ms
all: 2430 ms
with residual norm:
max err: 0.0000072417 at 103th element.
res nrm: 0.0000000003
The GPU solver must be run on single-precision oats because of atomicAdd
operations. elim without copy is the time that is needed for elimination
of blocks and tail update (computing the Schur complement).
8.4 CUSP solvers
CUSP is a C++ template library that implements parallel algorithms for
sparse matrix and graph computations. It provides a variety of iterative
solvers such as Conjugate-Gradient (CG), Biconjugate Gradient (BiCG), Bi-
conjugate Gradient Stabilized (BiCGstab), Generalized Minimum Residual
(GMRES), Multi-mass Conjugate-Gradient (CG-M) and Multi-mass Bicon-
jugate Gradient stabilized (BiCGstab-M). Two of them I have tested with
maximum number of iterations set to 200 and relative error 1
6
. Table 8.2
shows the results.
Method Time Max. error Iterations
CG 50 ms 3.8
8
77
BiCGstab 90 ms 2.3
8
76
Figure 8.2: Test of iterative CUSP solvers. Max. error is the maximal
dierence with Octaves reference solution
Chapter 9
Conclusion
The aim of this thesis was to deal with linear direct solvers and then imple-
ment a linear direct GPU solver for BA problem. Of course, the implementa-
tion of a GPU solver was preceded by studying the mathematical background
of linear direct solvers. Firstly, the CPU solver must be implemented. An-
other important concepts that concerns about direct sparse solvers must be
acquired like a symbolic factorization, working with CRS matrix format, and
applying ordering techniques. I can say that my implemented CPU solver
is fast and reliable when solving positive denite linear systems. This have
been done in the rst half of the academic year.
The next half year, I have started experimenting with the METIS k-way
ordering and how to utilize it for solving general sparse systems in parallel.
Although this approach is fully usable, it has drawbacks such as a slow
computation of the ordering, relatively big tail part, and independent blocks
of dierent sizes.
Simultaneously I was analyzing the BA problem and structure of its linear
sparse systems in Levenberg-Marquardt algorithm. As the structure of the
BA and returned k-way ordering was the same, I tried to write the solver
which could be general (the needed information about block matrix gives k-
way ordering) and specic at the same time (in this case information about
the block matrix provides BA conguration). The general solver on GPU is
not nished (special symbolic factorization is missing).
The GPU solver special for BA was implemented, but provides very small
speedups in comparison with CPU solver. The reason is that only global
memory on GPU was used for all computations.
In testing phase I have found out that iterative solver have a great potential
to solve linear systems very fast. The advantage of iterative solvers is the
congurable accuracy which can be sucient for iterative nonlinear solvers.
44
45
Even when using with a preconditioner, the solution should be found very
fast. On the other side, when using direct solvers, symbolical factorization
is solved only once in LM algorithm. Direct solvers give in general more
accurate results. From my experiments, I suggest to use direct solver on
CPU with a dense GPU solver for factorizing the Schur complement.
I realize that the precise study of SBA (Sparse Bundle Adjustment) package
is missing such as testing of a practical utilization of GPU solvers in this
package.
Bibliography
[1] Unknown author. NVIDIA GeForce GTX 680 s ipem GK104: Hern
Kepler detailn. CD-R server s.r.o, URL http://diit.cz/clanek/
unifikovane-jadro-a-rizeni-cipu, 2012. Cited in page 25.
[2] O. Coles. Nvidia GF100 GPU fermi graphics architecture. Benchmark
Reviews, URL http://benchmarkreviews.com, 2010. Cited in page 24.
[3] T. Davis. Sparse matrix. From MathWorld A Wolfram Web Re-
source URL http://mathworld.wolfram.com/SparseMatrix.html, 2012.
Retrieved April 2012. Cited in page 10.
[4] J. Dennis. Nonlinear last squares. State of the Art in Numerical Anal-
ysis, pages 269312, 1977. Cited in page 16.
[5] J. Dennis and R. Schnabel. Numerical Methods for Unconstrained Op-
timization and Nonlinear Equations. SIAM Publications, 1996. Cited
in page 20.
[6] R. Farber. CUDA Application Design and Development. Morgan Kauf-
mann, Waltham, MA, 2011. 2 citations in pages 22 and 25.
[7] A. George and J. W. H. Liu. An automatic nested dissection algorithm
for irregular nite element problems. SIAM Journal on Numerical Anal-
ysis, 15(5):1051069, 1978. Cited in page 12.
[8] A. George and J. W. H. Liu. A fast implementation of the minimum
degree algorithm using quotient graphs. ACM Transactions on Mathe-
matical Software, 6:33358, 1980. Cited in page 12.
[9] J. R. Gilbert, C. Moler, and R. Schreiber. Sparse matrices in matlab:
Design and implementation. SIAM Journal on Matrix Analysis and
Application, pages 333356, 1992. Cited in page 10.
[10] G. Golub and C. van Loan. Matrix Computations. Jonhn Hopkins
University Press, Baltimore, MD, 3rd edition, 1996. Cited in page 20.
[11] K. Habgood and I. Arel. A condensation-based application of cramers
46
47 BIBLIOGRAPHY
rule for solving large-scale linear systems. Journal of Discrete Algo-
rithms, 10:98109, 2012. Cited in page 5.
[12] K. Hiebert. An eveluation of mathematical software that solves nonlin-
ear least squares problems. ACM Transactions on Mathematical Soft-
ware, 7(1):116, 1981. Cited in page 16.
[13] J.-Y. LExcellent and B. Ucar. Elimination tree. URL http://graal.
ens-lyon.fr/~bucar/CR07, 2010. Cited in page 13.
[14] M. I. A. Lourakis and A. A. Argyros. Sba: A software package for
generic sparse bundle adjustment. ACM Transactions on Mathematical
Software, 36(1), 2007. 4 citations in pages 16, 28, 29, and 30.
[15] H. M. Markowitz. The elimination form of the inverse and its application
to linear programming. Management Science, 3:255269, 1957. Cited
in page 12.
[16] J. Nocedal and S. Wright. Numerical Optimization. Springer, New York,
NY, 1999. Cited in page 20.
[17] F. Ntawiniga. Bundle adjustment technique. URL http://archimede.
bibl.ulaval.ca/archimede/fichiers/25229/ch06.html, 2008. Retrieved
April 2012. Cited in page 17.
[18] NVIDIA. OpenCL Programming for the CUDA Architecture, 2009.
Cited in page 23.
[19] V. Prasolov. Problems and Theorems in Linear Algebra. American
Mathematical Society, Providence, RI, 1994. Cited in page 30.
[20] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery.
Numerical Recipes: The Art of Scientic Computing. Cambridge Uni-
versity Press, 3rd edition, 2007. 2 citations in pages 4 and 11.
[21] A. Quarteroni, R. Sacco, and F. Saleri. Numerical Mathematics.
Springer, 2000. 3 citations in pages 4, 17, and 30.
[22] W. F. Tinney and J. W. Walker. Direct solutions of sparse network
equations by optimally ordered triangular factorization. In Proceedings
of the IEEE, volume 55, pages 18011809, 1967. Cited in page 12.
[23] B. Triggs, P. McLauchlan, R. Hartley, and A. Fitzgibbon. Bundle ad-
justment a modern synthesis. Proceedings of the International Work-
shop on Vision Algorithms: Theory and Practice, pages 298372, 1999.
Cited in page 16.
[24] V. Volkov. Better performance at lower occupancy. GPU Technology
Conference 2010 (GTC 2010), 2010. URL http://www.cs.berkeley.edu/
~volkov. Cited in page 24.
BIBLIOGRAPHY 48
[25] R. Vuduc. Analysis and tuning case study. Teragrid Conference, URL
http://hpcgarage.org/tg10--gpu-tutorial, 2010. Retrieved May 2012.
Cited in page 23.
[26] M. Yannakakis. Computing the minimum ll-in is NP-complete. SIAM
Journal of Algebraic Discrete Methods, 2:7779, 1981. Cited in page 10.
49 BIBLIOGRAPHY
Appendix A
List of Abbrevations
3D Three-Dimensional
BA Bundle Adjustment
CPU Central Processing Unit
CUDA Compute Unied Device Architecture
CRS Compressed Row Storage
GPGPU General-Purpose Computing on Graphics Processing Unit
GPU Graphics Processing Unit
LDS Linear Direct Solver (output of this thesis)
LM Levenberg-Marquardt (algorithm)
LRU Last Recently Used
SBA Sparse Bundle Adjustment
SIMD Single Instruction Multiple Data
SIMT Single Instruction Multiple Threads
SM Streaming Multiprocessor
50
51
Appendix B
User Manual
B.1 Requirements
All code was written in ANSI C and CUDA and tested on 64-bit Linux
(Xubuntu distribution), GCC 4.4.6. For successful compilation, the package
libscotchmetis-dev is required. Some install paths in makefile for CUDA
and METIS include le must be set properly. After compilation, execution
le ldscpuexam and ldsgpuexam will be created into bin directory.
B.2 Usage
ldscpuexam <A.mtx> <b.vec>
ldsgpuexam <A.mtx> <b.vec>
bin/ldscpuexam data/jtj_mueI.mtx data/g.mtx or
bin/ldsgpuexam data/jtj_mueI.mtx data/g.mtx for tested matrix.
A.mtx is symmetric positive denite matrix of size n n stored in matrix
market format, b.vec is the right side n 1 vector of the equation system
stored in matrix market format. Some time information will be printed on
the stdout and the solution will be stored in le named x_result.vec.
For testing the correctness of the solution, octave function vec_ck can be
used callable from the command line:
octave -q eval vec_ck( "x_result.vec", "x_octave.vec" );
52
53 B.2. USAGE
Appendix C
Contetns of the Attached CD
.
+-- lds
| +-- bin
| +-- data
| | +-- g.mtx
| | +-- jtj_mueI.mtx
| | +-- test_thesis.mtx
| +-- makefile
| +-- obj
| +-- octave
| | +-- matrix_load.m
| | +-- octave_solver.m
| | +-- spy_print.m
| +-- README.txt
| +-- src
| | +-- colamd.c
| | +-- colamd_global.c
| | +-- colamd.h
| | +-- crs.c
| | +-- crs.h
| | +-- etree.c
| | +-- etree.h
| | +-- ldscpuexam.c
| | +-- ldsgpuexam.c
| | +-- mxbf.cu
| | +-- mxbf.h
| | +-- mxbf_chol.cu
| | +-- ord.c
| | +-- ord.h
| | +-- UFconfig.c
| | +-- UFconfig.h
| | +-- uni.c
| | +-- uni.h
| | +-- vec.c
| | +-- vec.h
| +-- vec_ck.m
| +-- x_octave.vec
+-- text
+-- Ivancik_thesis_2012.pdf
7 directories, 31 files
54

Ivancik Thesis 2012 Online

Enviado por

Dados do documento

Descrição original:

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Ivancik Thesis 2012 Online

Enviado por

Direitos autorais:

Formatos disponíveis

i

CHAPTER 2. SOLVING LINEAR SYSTEMS 6

b. Finally, the solution is obtained by forward substitution

is called a global minimizer of f if f(x

is called a local minimizer of f if a neighborhood N of x

) f(x) x N. Vector of rst partial derivations of the

and called gradient of f at a point x. If d is is a non null vector in R

and satises f(x)/d = [fx]

d. Moreover, denoting by (x, x + d) the

that best satises the functional relation f locally, that

with = x x for all p within a

J in the above equation is the rst order approximation to the

J is called damping and

; otherwise it is increased, the augmented normal

When programming for a GPU, it is necessary to reuse data within the SM

is dened by all parameters de-

is the Schur complement of V

exists where L is lower

Example 7. This method allows parallel computation when diagonal blocks

7.3 Cholesky decomposition on GPU

Conjugate gradient 1440 ms 4.128

Você também pode gostar