Você está na página 1de 12

Parallel Sparse Matrix-Vector and Matrix-Transpose-Vector

Multiplication Using Compressed Sparse Blocks


Aydn Buluc Jeremy T. Fineman Matteo Frigo
aydin@cs.ucsb.edu jfineman@csail.mit.edu matteo@cilk.com
John R. Gilbert Charles E. Leiserson
gilbert@cs.ucsb.edu cel@mit.edu
Dept. of
Computer Science MIT
CSAIL Cilk
Arts, Inc.
University of California 32 Vassar Street 55 Cambridge Street, Suite 200
Santa Barbara, CA 93106 Cambridge, MA 02139 Burlington, MA 01803

Abstract 1 Introduction
This paper introduces a storage format for sparse matrices, called When multiplying a large n n sparse matrix A having nnz nonze-
compressed sparse blocks (CSB), which allows both Ax and AT x ros by a dense n-vector x, the memory bandwidth for reading A
to be computed efficiently in parallel, where A is an n n sparse can limit overall performance. Consequently, most algorithms to
matrix with nnz n nonzeros and x is a dense n-vector. Our algo- compute Ax store A in a compressed format. One simple tuple
rithms use (nnz) work (serial running time) and ( n lgn) span representation involves storing each nonzero of A as a triple con-
(critical-path length), yielding a parallelism of (nnz / n lg n), sisting of its row index, its column index, and the nonzero value
which is amply high for virtually any large matrix. The storage itself. This representation, however, requires storing 2 nnz row and
requirement for CSB is esssentially the same as that for the more- column indices, in addition to the nonzeros. The current standard
standard compressed-sparse-rows (CSR) format, for which com- storage format for sparse matrices in scientific computing, com-
puting Ax in parallel is easy but AT x is difficult. Benchmark results pressed sparse rows (CSR) [32], is more efficient, because it stores
indicate that on one processor, the CSB algorithms for Ax and AT x only n + nnz indices or pointers. This reduction in storage of CSR
run just as fast as the CSR algorithm for Ax, but the CSB algo- compared with the tuple representation tends to result in faster se-
rithms also scale up linearly with processors until limited by off- rial algorithms.
chip memory bandwidth. In the domain of parallel algorithms, however, CSR has its lim-
itations. Although CSR lends itself to a simple parallel algorithm
for computing the matrix-vector product Ax, this storage format
Categories and Subject Descriptors does not admit an efficient parallel algorithm for computing the
product AT x, where AT denotes the transpose of the matrix A
F.2.1 [Analysis of Algorithms and Problem Complexity]: Nu-
or equivalently, for computing the product xT A of a row vector xT
merical Algorithms and Problemscomputations on matrices; G.4
by A. Although one could use compressed sparse columns (CSC)
[Mathematics of Computing]: Mathematical Softwareparallel
to compute AT x, many applications, including iterative linear sys-
and vector implementations
tem solvers such as biconjugate gradients and quasi-minimal resid-
ual [32], require both Ax and AT x. One could transpose A explicitly,
General Terms but computing the transpose for either CSR or CSC formats is ex-
pensive. Moreover, since matrix-vector multiplication for sparse
Algorithms, Design, Experimentation, Performance, Theory matrices is generally limited by memory bandwidth, it is desirable
to find a storage format for which both Ax and AT x can be computed
in parallel without performing more than nnz fetches of nonzeros
Keywords from the memory to compute either product.
This paper presents a new storage format called compressed
Compressed sparse blocks, compressed sparse columns, com- sparse blocks (CSB) for representing sparse matrices. Like CSR
pressed sparse rows, matrix transpose, matrix-vector multiplica- and CSC, the CSB format requires only n + nnz words of storage
tion, multithreaded algorithm, parallelism, span, sparse matrix, for indices. Because CSB does not favor rows over columns or vice
storage format, work. versa, it admits efficient parallel algorithms for computing either Ax
This work was supported in part by the National Science Foundation or AT x, as well as for computing Ax when A is symmetric and only
under Grants 0540248, 0615215, 0712243, 0822896, and 0709385, and by half the matrix is actually stored.
MIT Lincoln Laboratory under contract 7000012980. Previous work on parallel sparse matrix-vector multiplication
has focused on reducing communication volume in a distributed-
memory setting, often by using graph or hypergraph partitioning
techniques to find good data distributions for particular matrices
( [7, 38], for example). Good partitions generally exist for matrices
whose structures arise from numerical discretizations of partial dif-
ferential equations in two or three spatial dimensions. Our work, by
contrast, is motivated by multicore and manycore architectures, in
CSB_SpMV
CSB_SpMV_T
CSR_ S P MV(A, x, y)
CSR_SpMV(Serial)
500 CSR_SpMV_T(Serial)
Star-P(y=Ax)
1 n A. rows
Star-P(y=xA) 2 for i 0 to n 1 in parallel
400 3 do y[i] 0
4 for k A. row_ ptr[i] to A. row_ ptr[i + 1] 1
MFlops/sec

5 do y[i] y[i] + A. val[k] x[A. col_ ind[k]]


300

Figure 2: Parallel procedure for computing y Ax, where the n n matrix


200 A is stored in CSR format.

100
is, if val[k] stores matrix element ai j , then col_ ind[k] = j.
The compressed sparse column (CSC) format is analogous to
0 CSR, except that the nonzeros of each column, instead of row, are
1 2 3 4 5 6 7 8
Processors
stored in contiguous memory locations. In other words, the CSC
Figure 1: Average performance of Ax and AT x operations on 13 different
format for A is obtained by storing AT in CSR format.
matrices from our benchmark test suite. CSB_ SpMV and CSB_ SpMV_ T The earliest written description of CSR that we have been able
use compressed sparse blocks to perform Ax and AT x, respectively. CSR_ to divine from the literature is an unnamed scheme presented in
SpMV (Serial) and CSR_ SpMV_ T (Serial) use OSKI [39] and compressed Table 1 of the 1967 article [36] by Tinney and Walker, although
sparse rows without any matrix-specific optimizations. Star-P (y=Ax) and in 1963 Sato and Tinney [33] allude to what is probably CSR.
Star-P (y=xA) use Star-P [34], a parallel code based on CSR. The ex- Markowitzs seminal paper [28] on sparse Gaussian elimination
periments were run on a ccNUMA architecture powered by AMD Opteron does not discuss data structures, but it is likely that Markowitz used
8214 (Santa Rosa) processors. such a format as well. CSR and CSC have since become ubiquitous
in sparse matrix computation [13, 16, 17, 21, 23, 32].
The following lemma states the well-known bound on space used
by the index data in the CSR format (and hence the CSC format as
which parallelism and memory bandwidth are key resources. Our
well). By index data, we mean all data other than the nonzeros
algorithms are efficient in these measures for matrices with arbi-
that is, the row_ ptr and col_ ind arrays.
trary nonzero structure.
Figure 1 presents an overall summary of achieved performance. Lemma 1. The CSR format uses n lg nnz + nnz lg n bits of index
The serial CSR implementation uses plain OSKI [39] without any data for an n n matrix.
matrix-specific optimizations. The graph shows the average perfor-
mance over all our test matrices except for the largest, which failed For a CSR matrix A, computing y Ax in parallel is straightfor-
to run on Star-P [34] due to memory constraints. The performance ward, as shown in Figure 2. Procedure CSR_ S P MV in the figure
is measured in Mflops (Millions of FLoating-point OPerationS) per computes each element of the output array in parallel, and it does
second. Both Ax and AT x take 2 nnz flops. To measure perfor- not suffer from race conditions, because each parallel iteration i
mance, we divide this value by the time it takes for the computation writes to a single location y[i] which is not updated by any other
to complete. Section 7 provides detailed performance results. iteration.
The remainder of this paper is organized as follows. Section 2 We shall measure the complexity of this code, and other codes in
discusses the limitations of the CSR/CSC formats for parallelizing this paper, in terms of work and span [10, Ch. 27]:
Ax and AT x calculations. Section 3 describes the CSB format for The work, denoted by T1 , is the running time on 1 processor.
sparse matrices. Section 4 presents the algorithms for computing The span,1 denoted by T , is running time on an infinite num-
Ax and AT x using the CSB format, and Section 5 provides a theo- ber of processors.
retical analysis of their parallel performance. Section 6 describes The parallelism of the algorithm is T1 /T , which corresponds to
the experimental setup we used, and Section 7 presents the results. the maximum possible speedup on any number of processors. Gen-
Section 8 offers some concluding remarks. erally, if a machine has somewhat fewer processors than the paral-
lelism of an application, a good scheduler should be able to achieve
linear speedup. Thus, for a fixed amount of work, our goal is to
2 Conventional storage formats achieve a sufficiently small span so that the parallelism exceeds the
This section describes the CSR and CSC sparse-matrix storage for- number of processors by a reasonable margin.
mats and explores their limitations when it comes to computing The work of CSR_ S P MV is (nnz), assuming, as we shall, that
both Ax and AT x in parallel. We review the work/span formulation nnz n, because the body of the outer loop starting in line 2 ex-
of parallelism and show that performing Ax with CSR (or equiva- ecutes for n iterations, and the body of the inner loop starting in
lently AT x with CSC) yields ample parallelism. We consider vari- line 4 executes for the number of nonzeros in the ith row, for a total
ous strategies for performing AT x in parallel with CSR (or equiva- of nnz times.
lently Ax with CSC) and why they are problematic. The span of CSR_ S P MV depends on the maximum number nr
The compressed sparse row (CSR) format stores the nonzeros of nonzeros in any row of the matrix A, since that number deter-
(and ideally only the nonzeros) of each matrix row in consecutive mines the worst-case time of any iteration of the loop in line 4.
memory locations, and it stores an index to the first stored ele- The n iterations of the parallel loop in line 2 contribute (lg n) to
ment of each row. In one popular variant [14], CSR maintains one the span, assuming that loops are implemented as binary recursion.
floating-point array val[nnz] and two integer arrays, col_ ind[nnz] Thus, the total span is (nr + lg n).
and row_ ptr[n] to store the matrix A = (ai j ). The row_ ptr array The parallelism is therefore (nnz /(nr + lg n)). In many com-
stores the index of each row in val. That is, if val[k] stores matrix mon situations, we have nnz = (n), which we will assume for
element ai j , then row_ ptr[i] k < row_ ptr[i + 1]. The col_ ind ar- estimation purposes. The maximum number nr of nonzeros in any
ray stores the column indices of the elements in the val array. That 1 The literature also uses the terms depth [3] and critical-path length [4].
explicitly and then use CSR_ S P MV. Unfortunately, parallel trans-
CSR_ S P MV_ T(A, x, y) position of a sparse matrix in CSR format is costly and encounters
exactly the same problems we are trying to avoid. Moreover, ev-
1 n A. cols
ery element is accessed at least twice: once for the transpose, and
2 for i 0 to n 1
once for the multiplication. Since the calculation of a matrix-vector
3 do y[i] 0
product tends to be memory-bandwidth limited, this strategy is gen-
4 for i 0 to n 1
erally inferior to any strategy that accesses each element only once.
5 do for k A. row_ ptr[i] to A. row_ ptr[i + 1] 1
Finally, of course, we could store the matrix AT in CSR format,
6 do y[A. col_ ind[k]] y[A. col_ ind[k]] + A. val[k] x[i]
that is, storing A in CSC format, but then computing Ax becomes
difficult.
Figure 3: Serial procedure for computing y AT x, where the n n matrix To close this section, we should mention that if the matrix A is
A is stored in CSR format. symmetric, so that only about half the nonzeros need be stored
for example, those on or above the diagonal then computing
row can vary considerably, however, from a constant, if all rows Ax in parallel for CSR is also problematic. For this example, the
have an average number of nonzeros, to n, if the matrix has a dense elements below the diagonal are visited in an inconvenient order,
row. If nr = O(1), then the parallelism is (nnz / lg n), which is as if they were stored in CSC format.
quite high for a matrix with a billion nonzeros. In particular, if we
ignore constants for the purpose of making a ballpark estimate, we 3 The CSB storage format
have nnz / lg n 109 /(lg 109 ) > 3 107 , which is much larger than
any number of processors one is likely to encounter in the near fu- This section describes the CSB storage format for sparse matrices
ture. If nr = (n), however, as is the case when there is even a and shows that it uses the same amount of storage space as the
single dense row, we have parallelism (nnz /n) = (1), which CSR and CSC formats. We also compare CSB to other blocking
limits scalability dramatically. Fortunately, we can parallelize the schemes.
inner loop (line 4) using divide-and-conquer recursion to compute For a given block-size parameter , CSB partitions the n n
the sparse inner product in lg(nr) span without affecting the asymp- matrix A into n2 /2 equal-sized square blocks4
totic work, thereby achieving parallelism (nnz / lg n) in all cases.
Computing AT x serially can be accomplished by simply inter-
A00 A01 A0,n/1

changing the row and column indices [15], yielding the pseudocode A10 A11 A1,n/1
shown in Figure 3. The work of procedure CSR_ S P MV_ T is A= .. .. .. .. ,
(nnz), the same as CSR_ S P MV. . . . .

Parallelizing CSR_ S P MV_ T is not straightforward, however. An/1,0 An/1,1 An/1,n/1
We shall review several strategies to see why it is problematic.
One idea is to parallelize the loops in lines 2 and 5, but this strat- where the block Ai j is the submatrix of A containing el-
egy yields minimal scalability. First, the span of the procedure is ements falling in rows i, i + 1, . . . , (i + 1) 1 and columns
(n), due to the loop in line 4. Thus, the parallelism can be at j, j + 1, . . . , ( j + 1) 1 of A. For simplicity of presentation,
most O(nnz /n), which is a small constant in most common situ- we shall assume that is an exact power of 2 and that it divides n;
ations. Second, in any practical system, the communication and relaxing these assumptions is straightforward.
synchronization overhead for executing a small loop in parallel is Many or most of the individual blocks Ai j are hypersparse [6],
much larger than the execution time of the few operations executed meaning that the ratio of nonzeros to matrix dimension is asymp-
in line 6. totically 0. For example, if = n and nnz = cn, the average block
Another idea is to execute the loop in line 4 in parallel. has dimension n and only c nonzeros. The space to store a block
Unfortunately, this strategy introduces race conditions in the should therefore depend only on its nonzero count, not on its di-
read/modify/write to y[A. col_ ind[k]] in line 6.2 These races can mension.
be addressed in two ways, neither of which is satisfactory. CSB represents a block Ai j by compactly storing a triple for each
The first solution involves locking column col_ ind[k] or using nonzero, associating with the nonzero data element a row and col-
some other form of atomic update.3 This solution is unsatisfac- umn index. In contrast to the column index stored for each nonzero
tory because of the high overhead of the lock compared to the cost in CSR, the row and column indices lie within the submatrix Ai j ,

of the update. Moreover, if A contains a dense column, then the and hence require fewer bits. In particular, if = n, then each
contention on the lock is (n), which completely destroys any par- index into Ai j requires only half the bits of an index into A. Since
allelism in the common case where nnz = (n). these blocks are stored contiguously in memory, CSB uses an aux-
The second solution involves splitting the output array y into iliary array of pointers to locate the beginning of each block.
multiple arrays y p in a way that avoids races, and then accumu- More specifically, CSB maintains a floating-point array
lating y p y p at the end of the computation. For example, in a val[nnz], and three integer arrays row_ ind[nnz], col_ ind[nnz], and
system with P processors (or threads), one could postulate that pro- blk_ ptr[n2 /2 ]. We describe each of these arrays in turn.
cessor p only operates on array y p , thereby avoiding any races. This The val array stores all the nonzeros of the matrix and is anal-
solution is unsatisfactory because the work becomes (nnz +Pn), ogous to CSRs array of the same name. The difference is that
where the last term comes from the need to initialize and accumu- CSR stores rows contiguously, whereas CSB stores blocks con-
late P (dense) length-n arrays. Thus, the parallel execution time is tiguously. Although each block must be contiguous, the ordering
((nnz +Pn)/P) = (n) no matter how many processors are avail- among blocks is flexible. Let f (i, j) be the bijection from pairs of
able.
A third idea for parallelizing AT x is to compute the transpose block indices to integers in the range 0, 1, . . . , n2 /2 1 that de-
scribes the ordering among blocks. That is, f (i, j) < f (i , j ) if and
2 In fact, if nnz > n, then the pigeonhole principle guarantees that the
4 The CSB format may be easily extended to nonsquare n m matrices.
program has at least one race condition.
3 No mainstream hardware supports atomic update of floating-point In this case, the blocks remain as square matrices, and there are nm/2
quantities, however. blocks.
only if Ai j appears before Ai j in val. We discuss choices of order- stores a sparse collection of dense blocks, whereas CSB stores a
ing later in this section. dense collection of sparse blocks. We conjecture that it would be
The row_ ind and col_ ind arrays store the row and column in- advantageous to apply BCSR-style register blocking to each indi-
dices, respectively, of the elements in the val array. These indices vidual sparse block of CSB.
are relative to the block containing the particular element, not the Nishtala et al. [30] have proposed a data structure similar to CSB
entire matrix, and hence they range from 0 to 1. That is, if val[k] in the context of cache blocking. Our work differs from theirs in
stores the matrix element ai+r, j+c , which is located in the rth row two ways. First, CSB is symmetric without favoring rows over
and cth column of the block Ai j , then row_ ind = r and col_ ind = c. columns. Second, our algorithms and analysis for CSB are de-
As a practical matter, we can pack a corresponding pair of elements signed for parallelism instead of cache performance. As shown
of row_ ind and col_ ind into a single integer word of 2 lg bits so in Section 5, CSB supports ample parallelism for algorithms com-
that they make a single array of length nnz, which is comparable to puting Ax and AT x, even on sparse and irregular matrices.
the storage needed by CSR for the col_ ind array. Blocking is also used in dense matrices. The Morton-hybrid lay-
The blk_ ptr array stores the index of each block in the val array, out [1, 27], for example, uses a parameter equivalent to our param-
which is analogous to the row_ ptr array for CSR. If val[k] stores eter for selecting the block size. Whereas in CSB we store ele-
a matrix element falling in block Ai j , then blk_ ptr[ f (i, j)] k < ments in a Morton ordering within blocks and an arbitrary ordering
blk_ ptr[ f (i, j) + 1]. among blocks, the Morton-hybrid layout stores elements in row-
The following lemma states the storage used for indices in the major order within blocks and a Morton ordering among blocks.
CSB format. The Morton-hybrid layout is designed to take advantage of hard-
ware and compiler optimizations (within a block) while still ex-
Lemma 2. The CSB format uses (n2 /2 ) lg nnz +2 nnz lg bits of ploiting the cache benefits of a recursive layout. Typically the block
index data. sizeis chosen to be 32 32, which is significantly smaller than the
Proof. Since the val array contains nnz elements, referencing ( n) block size we propose for CSB. The Morton-hybrid lay-
an element requires lg nnz bits, and hence the blk_ ptr array uses out, however, considers only dense matrices, for which designing
(n2 /2 ) lg nnz bits of storage. a matrix-vector multiplication algorithm with good parallelism is
For each element in val, we use lg bits to represent the row significantly easier.
index and lg bits to represent the column index, requiring a total
of nnz lg bits for each of row_ ind and col_ ind. Adding the space 4 Matrix-vector multiplication using CSB
used by all three indexing arrays completes the proof.
To better understand the storage requirements of CSB, we This section describes a parallel algorithm for computing the

present the following corollary for = n. In this case, both CSR sparse-matrix dense-vector product y Ax, where A is stored in
(Lemma 1) and CSB use the same storage. CSB format. This algorithm can be used equally well for comput-
ing y AT x by switching the roles of row and column. We first
CSB format uses n lg nnz + nnz lg n bits of index
Corollary 3. The give an overview of the algorithm and then describe it in detail.
data when = n. At a high level, the CSB_ S P MV multiplication algorithm sim-
ply multiplies each blockrow by the vector x in parallel, where
Thus far, we have not addressed the ordering of elements within the ith blockrow is the row of blocks (Ai0 Ai1 Ai,n/1 ). Since
each block or the ordering of blocks. Within a block, we use a Z- each blockrow multiplication writes to a different portion of the
Morton ordering [29], storing first all those elements in the top-left output vector, this part of the algorithm contains no races due to
quadrant, then the top-right, bottom-left, and finally bottom-right write conflicts.
quadrants, using the same layout recursively within each quadrant. If the nonzeros were guaranteed to be distributed evenly among
In fact, these quadrants may be stored in any order, but the recursive block rows, then the simple blockrow parallelism would yield an
ordering is necessary for our algorithm to achieve good parallelism efficient algorithm with n/-way parallelism by simply performing
within a block. a serial multiplication for each blockrow. One cannot, in general,
The choice of storing the nonzeros within blocks in a recursive guarantee that distribution of nonzeros will be so nice, however. In
layout is opposite to the common wisdom for storing dense matri- fact, sparse matrices in practice often include at least one dense row
ces [18]. Although most compilers and architectures favor conven- containing roughly n nonzeros, whereas the number of nonzeros
tional row/column ordering for optimal prefetching, the choice of is only nnz cn for some small constant c. Thus, performing a
layout within the block becomes less significant for sparse blocks serial multiplication for each blockrow yields no better than c-way
as they already do not take full advantage of such features. More parallelism.
importantly, a recursive ordering allows us to efficiently determine To make the algorithm robust to matrices of arbitrary nonzero
the four quadrants of a block using binary search, which is crucial structure, we must parallelize the blockrow multiplication when a
for parallelizing individual blocks. blockrow contains too many nonzeros. This level of paralleliza-
Our algorithm and analysis do not, however, require any particu- tion requires care to avoid races, however, because two blocks in
lar ordering among blocks. A Z-Morton ordering (or any recursive the same blockrow write to the same region within the output vec-
ordering) seems desirable as it should get better performance in tor. Specifically, when a blockrow contains () nonzeros, we re-
practice by providing spatial locality, and it matches the ordering cursively divide it in half, yielding two subblockrows, each con-
within a block. Computing the function f (i, j), however, is simpler taining roughly half the nonzeros. Although each of these sub-
for a row-major or column-major ordering among blocks. blockrows can be multiplied in parallel, they may need to write to
the same region of the output vector. To avoid the races that might
Comparison with other blocking methods arise due to write conflicts between the subblockrows, we allocate a
temporary vector to store the result of one of the subblockrows and
A blocked variant of CSR, called BCSR, has been used for im- allow the other subblockrow to use the output vector. After both
proving register reuse [24]. In BCSR, the sparse matrix is divided subblockrow multiplications complete, we serially add the tempo-
into small dense blocks that are stored in consecutive memory loca- rary vector into the output vector.
tions. The pointers are maintained to the first block on each row of To facilitate fast subblockrow divisions, we first partition the
blocks. BCSR storage is converse to CSB storage, because BCSR blockrow into chunks of consecutive blocks, each containing at
CSB_ S P MV(A, x, y) CSB_ B LOCKROW V(A, i, R, x, y)
1 for i 0 to n/ 1 in parallel // For each blockrow. 11 if R. length = 2 // The subblockrow is a single chunk.
2 do Initialize a dynamic array Ri 12 then R[0] + 1 // Leftmost block in chunk.
3 Ri [0] 0 13 r R[1] // Rightmost block in chunk.
4 count 0 // Count nonzeroes in chunk. 14 if = r
5 for j 0 to n/ 2 15 then // The chunk is a single (dense) block.
6 do count count + nnz(Ai j ) 16 start A. blk_ ptr[ f (i, )]
7 if count + nnz(Ai, j+1 ) > () 17 end A. blk_ ptr[ f (i, ) + 1] 1
8 then // End the chunk, since the next block 18 CSB_ B LOCK V(A, start, end, , x, y)
// makes it too large. 19 else // The chunk is sparse.
9 append j to Ri // Last block in chunk. 20 multiply y (Ai Ai,+1 Air )x serially
10 count 0 21 return
11 append n/ 1 to Ri // Since the block row is dense, split it in half.
12 CSB_ B LOCKROW V(A, i, Ri , x, y[i . . (i + 1) 1]) 22 mid R. length /2 1 // Divide chunks in half.
// Calculate the dividing point in the input vector x.
Figure 4: Pseudocode for the matrix-vector multiplication y Ax. The 23 xmid (R[mid] R[0])
procedure CSB_ B LOCKROW V (pseudocode for which can be found in 24 allocate a length- temporary vector z, initialized to 0
Figure 5) as called here multiplies the blockrow by the vector x and writes 25 in parallel
the output into the appropriate region of the output vector y. The nota- 26 do CSB_ B LOCKROW V(A, i, R[0 . . mid], x[0 . . xmid 1], y)
tion x[a . . b] means the subarray of x starting at index a and ending at in- 27 do CSB_ B LOCKROW V(A, i, R[mid . . R. length 1],
dex b. The function nnz(Ai j ) is a shorthand for A. blk_ ptr[ f (i, j) + 1] x[xmid . . x. length 1], z)
A. blk_ ptr[ f (i, j)], which calculates the number of nonzeros in the block Ai j . 28 for k 0 to 1
For conciseness, we have overloaded the () notation (in line 7) to mean
29 do y[k] y[k] + z[k]
a constant times ; any constant suffices for the analysis, and we use the
constant 3 in our implementation.
Figure 5: Pseudocode for the subblockrow vector product y (Ai Ai,+1
Air )x. The in parallel do . . . do . . . construct indicates that all of the do
most O() nonzeros (when possible) and () nonzeros on aver- code blocks may execute in parallel. The procedure CSB_ B LOCK V (pseu-
age. The lower bound of () will allow us to amortize the cost docode for which can be found in Figure 6) calculates the product of the
of writing to the length- temporary vector against the nonzeros in block and the vector in parallel.
the chunk. By dividing a blockrow in half, we mean assigning to
each subblockrow roughly half the chunks.
The block-vector multiplication proceeds by recursively dividing
Figure 4 gives the top-level algorithm, performing each block-
the (sub)block M into quadrants M00 , M01 , M10 , and M11 , each of
row vector multiplication in parallel. The for . . . in parallel do
which is conveniently stored contiguously in the Z-Morton-ordered
construct means that each iteration of the for loop may be executed
val, row_ ind, and col_ ind arrays between indices start and end. We
in parallel with the others. For each loop iteration, we partition
perform binary searches to find the appropriate dividing points in
the blockrow into chunks in lines 211 and then call the blockrow
the array in lines 3436.
multiplication in line 12. The array Ri stores the indices of the
To understand the pseudocode, consider the search for the divid-
last block in each chunk; specifically, the kth chunk, for k > 0, in-
ing point s2 between M00 M01 and M10 M11 . For any recursively
cludes blocks (Ai,Ri [k1]+1 Ai,Ri [k1]+2 Ai,Ri [k] ). A chunk consists
chosen dim dim matrix M, the column indices and row indices
of either a single block containing () nonzeros, or it consists of of all elements have the same leading lg lg dim bits. Moreover,
many blocks containing O() nonzeros in total. To compute chunk for those elements in M00 M01 , the next bit in the row index is a
boundaries, just iterate over blocks (in lines 510) until enough 0, whereas for those in elements in M10 M11 , the next bit in the
nonzeros are accrued. row index is 1. The algorithm does a binary search for the point at
Figure 5 gives the parallel algorithm CSB_ B LOCKROW V for which this bit flips. The cases for the dividing point between M00
multiplying a blockrow by a vector, writing the result into the and M01 or M10 and M11 are similar, except that we focus on the
length- vector y. In lines 2229, the algorithm recursively di- column index instead of the row index.
vides the blockrow such that each half receives roughly the same After dividing the matrix into quadrants, we execute the matrix
number of chunks. We find the appropriate middles of the chunk products involving matrices M00 and M11 in parallel (lines 3739),
array R and the input vector x in lines 22 and 23, respectively. We as they do not conflict on any outputs. After completing these prod-
then allocate a length- temporary vector z (line 24) and perform ucts, we execute the other two matrix products in parallel (lines
the recursive multiplications on each subblockrow in parallel (lines 4042).5 This procedure resembles a standard parallel divide-and-
2527), having one of the recursive multiplications write its output conquer matrix multiplication, except that our base case of serial
to z. When these recursive multiplications complete, we merge the multiplication starts at a matrix containing (dim) nonzeros (lines
outputs into the vector y (lines 2829). 2932). Note that although we pass the full length- arrays x and
The recursion bottoms out when the blockrow consists of a sin- y to each recursive call, the effective length of each array is halved
gle chunk (lines 1221). If this chunk contains many blocks, it is implicitly by partitioning M into quadrants. Passing the full arrays
guaranteed to contain at most () nonzeros, which is sufficiently is a technical detail required to properly compute array indices, as
sparse to perform the serial multiplication in line 20. If, on the the indices A. row_ ind and A. col_ ind store offsets within the block.
other hand, the chunk is a single block, it may contain as many as The CSB_ S P MV_ T algorithm is identical to CSB_ S P MV, ex-
2 n nonzeros. A serial multiplication here, therefore, would be cept that we operate over blockcolumns rather than blockrows.
the bottleneck in the algorithm. Instead, we perform the parallel
5 The algorithm may instead do M and M in parallel followed by M
block-vector multiplication CSB_ B LOCK V in line 18. 00 10 01
If the blockrow recursion reaches a single block, we perform a and M11 in parallel without affecting the performance analysis. Presenting
the algorithm with two choices may yield better load balance.
parallel multiplication of the block by the vector, given in Figure 6.
nodes gives total work (r).
CSB_ B LOCK V(A, start, end, dim, x, y) The next lemma analyzes blockrow-vector multiplication.
// A. val[start . . end] is a dim dim matrix M. Lemma 5. On a blockrow containing n/ blocks and r nonzeros,
28 if end start (dim) CSB_ B LOCKROW V runs with work (r) and span O( lg(n/)).
29 then // Perform the serial computation y y + Mx.
30 for k start to end Proof. Consider a call to CSB_ B LOCKROW V on a row that is
31 do y[A. row_ ind[k]] y[A. row_ ind[k]] partitioned into C chunks, and let W (C) denote the work. The work
+A. val[k] x[A. col_ ind[k]] per recursive call on a multichunk subblockrow is dominated by
32 return the () work of initializing a temporary vector z and adding the
33 // Recurse. Find the indices of the quadrants. vector z into the output vector y. The work for a CSB_ B LOCK -
34 binary search start, start +1, . . . , end for the smallest s2 ROW V on a single-chunk subblockrow is linear in the number of
such that (A. row_ ind[s2 ] & dim /2) 6= 0 nonzeros in the chunk. (We perform linear work either in line 20
35 binary search start, start +1, . . . , s2 1 for the smallest s1 or in line 18 see Lemma 4 for the work of line 18.) We can thus
such that (A. col_ ind[s1 ] & dim /2) 6= 0 describe the work by the recurrence W (C) 2W (C/2) + ()
36 binary search s2 , s2 + 1, . . . , end for the smallest s3 with a base case of work linear in the nonzeros, which solves to
such that (A. col_ ind[s3 ] & dim /2) 6= 0 W (C) = (C + r) for C > 1. When C = 1, we have W (C) = (r),
37 in parallel as we do not operate on the temporary vector z.
38 do CSB_ B LOCK V(A, start, s1 1, dim /2, x, y) // M00 . To bound work, it remains to bound the maximum number of
39 do CSB_ B LOCK V(A, s3 , end, dim /2, x, y) // M11 . chunks in a row. Notice that any two consecutive chunks contain
40 in parallel at least () nonzeros. This fact follows from the way chunks are
41 do CSB_ B LOCK V(A, s1 , s2 1, dim /2, x, y) // M01 . chosen in lines 211: a chunk is terminated only if adding the next
42 do CSB_ B LOCK V(A, s2 , s3 1, dim /2, x, y) // M10 . block to the chunk would increase the number of nonzeros to more
than (). Thus, a blockrow consists of a single chunk whenever
Figure 6: Pseudocode for the subblock-vector product y Mx, where M r = O() and at most O(r/) chunks whenever r = (). Hence,
is the list of tuples stored in A. val[start . . end], A. row_ ind[start . . end], and the total work is (r).
A. col_ ind[start . . end], in recursive Z-Morton order. The & operator is a We can describe the span of CSB_ B LOCKROW V by the recur-
bitwise AND of the two operands. rence S(C) = S(C/2) + O() = O( lgC) + S(1). The base case
involves either serially multiplying a single chunk containing at
most O() nonzeros in line 20, which has span O(), or multiplying
5 Analysis a single block in parallel in line 18, which also has span O() from
Lemma 4. We have, therefore, a span of O( lgC) = O( lg(n/)),
In this section, we prove that for an n n matrix with nnznonze-
since C n/.
ros, CSB_ S P MV operates with work (nnz) andspan O( n lg n)
when = n, yielding a parallelism of (nnz / n lg n). We also We are now ready to analyze matrix-vector multiplication itself.
provide bounds in terms of and analyze the space usage. Theorem 6. On an n n matrix containing nnz nonzeros, CSB_
We begin by analyzing block-vector multiplication. S P MV runs with work (n2 /2 + nnz) and span O( lg(n/) +
Lemma 4. On a block containing r nonzeros, CSB_ n/).
B LOCK V runs with work (r) and span O(). Proof. For each blockrow, we add (n/) work and span for
Proof. The span for multiplying a dim dim matrix can be computing the chunks, which arise from a serial scan of the n/
described by the recurrence S(dim) = 2S(dim /2) + O(lg dim) = blocks in the blockrow. Thus, the total work is O(n2 /2 ) in addi-
O(dim). The lg dim term represents a loose upper bound on the tion to the work for multiplying the blockrows, which is linear in
cost of the binary searches. In particular, the binary-search cost the number of nonzeros from Lemma 5.
is O(lg z) for a submatrix containing z nonzeros, and we have The total span is O(lg(n/)) to parallelize all the rows, plus
z dim2 , and hence O(lg z) = O(lg dim), for a dim dim matrix. O(n/) per row to partition the row into chunks, plus the
To calculate the work, consider the degree-4 tree of recursive O( lg(n/)) span per blockrow from Lemma 5.
procedure calls, and associate with each node the work done by The following corollary gives the work and span bounds when
that procedure call. We say that a node in the tree has height h if it we choose to yield the same space for the CSB storage format as
corresponds to a 2h 2h subblock, i.e., if dim = 2h is the parameter for the CSR or CSC formats.
passed into the corresponding CSB_ B LOCK V call. Node heights Corollary 7. On an n n matrix containing nnz n nonzeros, by
are integers ranging from 0 to lg . Observe that each height-h node choosing = ( n), CSB_ S P MV runs with work (nnz) and
corresponds to a distinct 2h 2h subblock (although subblocks may span O( n lg n), achieving a parallelism of (nnz / n lg n).
overlap for nodes having different heights). A height-h leaf node
(serial base case) corresponds to a subblock containing at most z = Since CSB_ S P MV_ T is isomorphic to CSB_ S P MV, we obtain
O(2h ) nonzeros and has work linear in this number z of nonzeros. the following corollary.
Summing across all leaves, therefore, gives (r) work. A height-h Corollary 8. Onan n n matrix containing nnz n nonzeros, by
internal node, on the other hand, corresponds to a subblock contain- choosing = ( n), CSB_ S P MV_ T runs with work
ing at least z = (2h ) nonzeros (or else it would not recurse further (nnz) and
span O( n lg n), achieving a parallelism of (nnz / n lg n).
and be a leaf) and has work O(lg 2h ) = O(h) arising from the bi-
nary searches. There can thus be at most O(r/2h ) height-h internal The work of our algorithm is dominated by the space of the tem-
porary vectors z, and thus the space usage on an infinite number of
nodes having total work O((r/2h )h). Summing across all heights processors matches the work bound. When run on fewer proces-
Plg h
Plg
h
gives total work of h=0 O((r/2 )h) r h=0 O(h/2 ) = O(r) sors however, the space usage reduces drastically. We can analyze
for internal nodes. Combining the work at internal nodes and leaf the space in terms of the serialization of the program, which corre-
sponds to the program obtained by removing all parallel keywords.

Lemma 9. On an n n matrix, by choosing
= ( n), the seri- count the nonzeros in a subblockrow more easily when comput-
alization of CSB S P MV requires O( n lg n) space (not counting
_ ing y Ax. This optimization is not symmetric, but interestingly,
the storage for the matrix itself). we achieved similar performance when computing y AT x, where
we must still aggregate the nonzeros in each block. In fact, in al-
Proof. The serialization executes one blockrow multiplication at most half the cases, computing AT x was faster than Ax, depending
There are two space overheads. First, we use O(n/) =
a time. on the matrix structure.
O( n) space for the chunk array. Second, we use space to The Z-Morton ordering on nonzeros in each block is equiva-
store the temporary vector z for each outstanding recursive call to lent to first interleaving the bits of row_ ind and col_ ind, and then
CSB_ B LOCKROW V. Since the recursion depth is O(lg n), the total sorting the nonzeros using these bit-interleaved values as the keys.
space becomes O( lg n) = O( n lg n). Thus, it is tempting to store the index array in a bit-interleaved fash-
A typical work-stealing scheduler executes the program in a ion, thereby simplifying the binary searches in lines 3436. Con-
depth-first (serial) manner on each processor. When a processor verting to and from bit-interleaved integers, however, is expensive
completes all its work, it steals work from a different processor, with current hardware support,6 which would be necessary for the
beginning a depth-first execution from some unexecuted parallel serial base case in lines 2932. Instead, the kth element of the in-
branch. Although not all work-stealing schedulers are space effi- dex array is the concatenation of row_ ind[k] and col_ ind[k], as indi-
cient, those maintaining the busy-leaves property [5] (e.g., as used cated earlier. This design choice of storing concatenated, instead of
in the Cilk work-stealing scheduler [4]) are space efficient. The bit-interleaved, indices requires either some care when performing
busy-leaves property roughly says that if a procedure has begun the binary search (as presented in Figure 6) or implicitly converting
(but not completed) executing, then there exists a processor cur- from the concatenated to interleaved format when making a binary-
rently working on that procedure or one of its descendants proce- search comparison. Our preliminary implementation does the lat-
dures. ter, using a C++ function object for comparisons [35]. In practice,
the overhead of performing these conversions is small, since the
Corollary 10. Suppose that a work-stealing scheduler with the number of binary-search steps is small.
busy-leaves property schedules anexecution of CSB_ S P MV on an Performing the actual address calculation and determining the
n nmatrix with the choice = n. Then, the execution requires pointers to x and y vectors are done by masking and bit-shifting.
O(P n lg n) space. The bitmasks are determined dynamically by the CSB constructor
depending on the input matrix and the data type used for storing
Proof. Combine Lemma 9 and Theorem 1 from [4]. matrix indices. Our library allows any data type to be used for
The work overhead of our algorithm may be reduced by increas- matrix indices and handles any type of matrix dynamically. For
ing the constants in the () threshold in line 7. Specifically, in- the results presented in Section 7, nonzero values are represented
creasing this threshold by a constant factor reduces the number of as double-precision floating-point numbers, and indices are repre-
reads and writes to temporaries by the same constant factor. As sented as 32-bit unsigned integers. Finally, as our library aims to
these temporaries constitute the majority of the work overhead of be general instead of matrix specific, we did not employ speculative
the algorithm, doubling the threshold nearly halves the overhead. low-level optimizations such as software prefetching, pipelining, or
Increasing the threshold, however, also increases the span by a con- matrix-specific optimizations such as index and/or value compres-
stant factor, and so there is a trade-off. sion [25, 40], but we believe that CSB and our algorithms should
not adversely affect incorporation of these approaches.

6 Experimental design Choosing the block size


This section describes our implementation of the CSB_ S P MV and We investigated different strategies to choose the block size that
CSB_ S P MV_ T algorithms, the benchmark matrices we used to achieves the best performance. For the types of loads we ran, we
test the algorithms, the machines on which we ran our tests, and found that a block size slightly larger than n delivers reasonable
the other codes with which we compared our algorithms. performance. Figure 7 shows the effect of different blocksizes on
the performance of the y Ax operation with the representative

Implementation matrix Kkt_ power. The closest exact power of 2 to n is 1024,
which turns out to be slightly suboptimal. In our experiments, the
We parallelized our code using Cilk++ [9], which is a faithful performance wasachieved when satisfies the equa-
overall best
extension of C++ for multicore and shared-memory parallel pro- tion lg n lg 3 + lg n.
gramming. Cilk++ is based on the earlier MIT Cilk system [20], Merely setting to a hard-coded value, however, is not robust
and it employs dynamic load balancing and provably optimal task for various reasons. First, the elements stored in the index ar-
scheduling. The CSB code used for the experiments is freely avail- ray should use the same data type as that used for matrix indices.
able for academic use at http://gauss.cs.ucsb.edu/aydin/ Specifically, the integer 1 should fit in 2 bytes so that a con-
software.html. catenated row_ ind and col_ ind fit into 4 bytes. Second, the length-
The row_ ind and col_ ind arrays of CSB, which store the row and regions of the input vector x and output vector y (which are ac-
column indices of each nonzero within a block (i.e., the lower-order cessed when multiplying a single block) should comfortably fit into
bits of the row and column indices within the matrix A), are imple- L2 cache. Finally, to ensure speedup on matrices with evenly dis-
mented as a single index array by concatenating the two values tributed nonzeros, there should be enough parallel slackness for
together. The higher-order bits of row_ ind and col_ ind are stored the parallelization across blockrows (i.e., the highest level paral-
only implicitly, and are retrieved by referencing the blk_ ptr array. lelism). Specifically, when grows large, the parallelism is roughly
The CSB blocks themselves are stored in row-major order, while bounded by O(nnz /( lg(n/))) (by dividing the work and span
the nonzeros within blocks are in Z-Morton order. The row-major from Theorem 6). Thus, we want nnz /( lg(n/)) to be large
ordering among blocks may seem to break the overall symmetry enough, which means limiting the maximum magnitude of .
of CSB, but in practice it yields efficient handling of block indices We adjusted our CSB constructor, therefore, to automatically
for look-up in A. blk_ ptr by permitting an easily computed look- select a reasonable block-size parameter . It starts with =
up function f (i, j). The row-major ordering also allowed us to
6 Recent research [31] addresses these conversions.
500 the output vector for both recursive calls. In other words, when
a blockrow multiplication is scheduled serially, the multiplication
p=8
400 p=4 procedure detects this fact and mimics a normal serial execution,
p=2 without the use of temporary vectors. Our implementation exploits
an undocumented feature of Cilk++ to test whether the first call has
MFlops/sec

300
completed before making the second recursive call, and we allocate
the temporary as appropriate. This test may also be implemented
200 using Cilk++ reducers [19].

100 Sparse-matrix test suite


We conducted experiments on a diverse set of sparse matrices
0
from real applications including circuit simulation, finite-element
32

64

12

25

51

10

20

40

81

16

32
computations, linear programming, and web-connectivity analysis.
8

24

48

96

92

38

76
4

8
Block size These matrices not only cover a wide range of applications, but
Figure 7: The effect of block size parameter on SpMV performance using they also greatly vary in size, density, and structure. The test suite
the Kkt_ power matrix. For values > 32768 and < 32, the experiment contains both rectangular and square matrices. Almost half of the
failed to finish due to memory limitations. The experiment was conducted square matrices are asymmetric. Figure 8 summarizes the 14 test
on the AMD Opteron. matrices.
Included in Figure 8 is the load imbalance that is likely to oc-
cur for an SpMV algorithm parallelized with respect to columns
(CSC) and blocks (CSB). In the last column, the average (mean)
3 + lg n and keeps decreasing it until the aforementioned con- and the maximum number of nonzeros among columns (first line)
straints are satisfied. Although a research opportunity may exist to and blocks (second line) are shown for each matrix. The sparsity
autotune the optimal block size with respect to a specific
matrix and of matrices can be quantified by the average number of nonzeros
architecture, in most test matrices, choosing = n degraded per- per column, which is equivalent to the mean of CSC. The sparsest
formance by at most 10%15%. The optimal value barely shifts matrix (Rajat31) has 4.3 nonzeros per column on the average while
along the x-axis when running on different numbers of processors the densest matrix has about 73 nonzeros per column (Sme3Dc and
and is quite stable overall. Torso). For CSB, the reported mean/max values are obtained by
setting the block dimension to be approximately n, so that they
An optimization heuristic for structured matrices are comparable with statistics from CSC.
Even though CSB_ S P MV and CSB_ S P MV_ T are robust and ex-
hibit plenty of parallelism on most matrices, their practical perfor- Architectures and comparisons
mance can be improved on some sparse matrices having regular We ran our experiments on three multicore superscalar architec-
structure. In particular, a block diagonal matrix with equally sized tures. Opteron is a ccNUMA architecture powered by AMD
blocks has nonzeros that are evenly distributed across blockrows. Opteron 8214 (Santa Rosa) processors clocked at 2.2 GHz. Each
In this case, a simple algorithm based on blockrow parallelism core of Opteron has a private 1 MB L2 cache, and each socket has
would suffice in place of the more complicated recursive method its own integrated memory controller. Although it is an 8-socket
from CSB_ B LOCK V. This divide-and-conquer within blockrows dual-core system, we only experimented with up to 8 processors.
incurs overhead that might unnecessarily degrade performance. Harpertown is a dual-socket quad-core system running two Intel
Thus, when the nonzeros are evenly distributed across the block- Xeon X5460s, each clocked at 3.16 GHz. Each socket has 12 MB
rows, our implementation of the top-level algorithm (given in Fig- of L2 cache, shared among four cores, and a front-side bus (FSB)
ure 4) calls the serial multiplication in line 12 instead of the CSB_ running at 1333 MHz. Nehalem is a single-socket quad-core Intel
B LOCKROW V procedure. Core i7 920 processor. Like Opteron, Nehalem has an integrated
To see whether a given matrix is amenable to the optimization, memory controller. Each core is clocked at 2.66 GHz and has a pri-
we apply the following balance heuristic. We calculate the imbal- vate 256 KB L2 cache. An 8 MB L3 cache is shared among four
ance among blockrows (or blockcolumns in the case of y AT x) cores.
and apply the optimization only when no blocks have more than While Opteron has 64 GB of RAM, Harpertown and Nehalem
twice the average number of nonzeros per blockrow. In other have only 8 GB and 6 GB, respectively, which forced us to exclude
words, if max(nnz(Ai )) < 2mean(nnz(Ai )), then the matrix is con- our biggest test matrix (Webbase2001) from our runs on Intel ar-
sidered to have balanced blockrows and the optimization is applied. chitectures. We compiled our code using gcc 4.1 on Opteron and
Of course, this optimization is not the only way to achieve a per- Harpertown and with gcc 4.3 on Nehalem, all with optimization
formance boost on structured matrices. flags -O2 -fno-rtti -fno-exceptions.
To evaluate our code on a single core, we compared its perfor-
Optimization of temporary vectors mance with pure OSKI matrix-vector multiplication [39] running
on one processor of Opteron. We did not enable OSKIs prepro-
One of the most significant overheads of our algorithm is the use cessing step, which chooses blockings for cache and register usage
of temporary vectors to store intermediate results when paralleliz- that are tuned to a specific matrix. We conjecture that such matrix-
ing a blockrow multiplication in CSB_ B LOCK ROW V. The bal- specific tuning techniques can be combined advantageously with
ance heuristic above is one way of reducing this overhead when our CSB data structure and parallel algorithms.
the nonzeros in the matrix are evenly distributed. For arbitrary To compare with a parallel code, we used the matrix-vector
matrices, however, we can still reduce the overhead in practice. multiplication of Star-P [34] running on Opteron. Star-P is a
In particular, we only need to allocate the temporary vector z (in distributed-memory code that uses CSR to represent sparse matri-
line 24) if both of the subsequent multiplications (lines 2527) are ces and distributes matrices to processor memories by equal-sized
scheduled in parallel. If the first recursive call completes before blocks of rows.
the second recursive call begins, then we instead write directly into
700
p=1
p=2
p=4
600 p=8
Name Dimensions CSC (mean/max)
Description Spy Plot Nonzeros CSB (mean/max) 500

MFlops/sec
400
Asic_ 320k 321K 321K 6.0 / 157K
circuit simulation 1, 931K 4.9 / 2.3K 300

200
Sme3Dc 42K 42K 73.3 / 405
3D structural 3, 148K 111.6 / 1368 100

mechanics
0

As

Sm

Pa

To

Kk
uc
itt
ic

rs

t_
Parabolic_ fem 525K 525K

ra
7.0 / 7

e3

el

ci
_

p
b

m
32

ow
D

ol

an
c

ic
0k

er
diff-convection 3, 674K 3.5 / 1, 534

n
fe
m
reaction
Figure 9: CSB_ S P MV performance on Opteron (smaller matrices).
Mittelmann 1, 468K 1, 961K 2.7 / 7
LP problem 5, 382K 2.0 / 3, 713
700
p=1
p=2
Rucci 1, 977K 109K 70.9 / 108 600
p=4
p=8
Ill-conditioned 7, 791K 9.4 / 36
least-squares 500

MFlops/sec
400
Torso 116K 116K 73.3 / 1.2K
Finite diff, 8, 516K 41.3 / 36.6K
300
2D model of torso
200
Kkt_ power 2.06M 2.06M 6.2 / 90
optimal power flow, 12.77M 3.1 / 1, 840 100
nonlinear opt.
0

As

Sm

Pa

To

kk
Rajat31 4.69M 4.69M 4.3 / 1.2K

uc
itt

t_
ic

rs
ra
e3

el

ci

po
_3

o
bo

m
D

w
20
circuit simulation 20.31M 3.9 / 8.7K

lic

an
c

er
k

_f

n
em
Figure 10: CSB_ S P MV_ T performance on Opteron (smaller matrices).
Ldoor 952K 952K 44.6 / 77
structural prob. 42.49M 49.1 / 43, 872

Bone010 986K 986K 48.5 / 63


7 Experimental results
3D trabecular bone 47.85M 51.5 / 18, 670
Figures 9 and 10 show how CSB_ S P MV and CSB_ S P MV_ T,
respectively, scale for the seven smaller matrices on Opteron, and
Grid3D200 8M 8M 6.97 / 7 Figures 11 and 12 show similar results for the seven larger matrices.
3D 7-point 55.7M 3.7 / 9, 818 In most cases, the two codes show virtually identical performance,
finite-diff mesh confirming that the CSB data structure and algorithms are equally
suitable for both operations. In all the parallel scaling graphs, only
RMat23 8.4M 8.4M 9.4 / 70.3K the values p = 1, 2, 4, 8 are reported. They should be interpreted as
Real-world 78.7M 4.7 / 222.1K performance achievable by doubling the number of cores instead
graph model

Cage15 5.15M 5.15M 19.2 / 47 700


DNA electrophoresis 99.2M 15.6 / 39, 712 p=1
p=2
p=4
600 p=8

Webbase2001 118M 118M 8.6 / 816K 500


Web connectivity 1, 019M 4.9 / 2, 375K
MFlops/sec

400

300
Figure 8: Structural information on the sparse matrices used in our exper-
iments, ordered by increasing number of nonzeros. The first ten matrices
200
and Cage15 are from the University of Florida sparse matrix collection [12].
Grid3D200 is a 7-point finite difference mesh generated using the Matlab 100
Mesh Partitioning and Graph Separator Toolbox [22]. The RMat23 ma-
trix [26], which models scale-free graphs, is generated by using repeated 0
Kronecker products [2]. We chose parameters A = 0.7, B = C = D = 0.1
R

Ld

Bo

W
aj

ag

M
rid

eb
oo

ne
at

at
e1
3D

ba
r
31

01

for RMat23 in order to generate skewed matrices. Webbase2001 is a crawl


23
5
20

se
0

20
0

of the World Wide Web from the year 2001 [8].


01

Figure 11: CSB_ S P MV performance on Opteron (larger matrices).


800 16
p=1
p=2
700 p=4 Asic_320k
p=8 14

600
12

500
10
MFlops/sec

Speedup
400
8
300
6
200
4
100
2
0
R

Ld

Bo

W
aj

ag

M
rid

eb
oo

ne
at

at
e1
3

ba
r
31

01
23
D

5 1 2 4 6 8 10 12 14 16
20

se
0

20
0

Processors

01
Figure 12: CSB_ S P MV_ T performance on Opteron (larger matrices). Figure 14: Parallelism test for CSB_ S P MV on Asic_ 320k obtained by
artificially increasing the flops per byte. The test shows that the algorithm
exhibits substantial parallelism and scales almost perfectly given sufficient
CSB_ S P MV CSB_ S P MV_ T memory bandwidth.
Processors
17 814 17 814
P=2 1.65 1.70 1.44 1.49
P=4 2.34 2.49 2.07 2.30
P=8 2.75 3.03 2.81 3.16 exhibit the least parallelism. Asic_ 320k is also irregular in struc-
ture, which means that our balance heuristic does not apply. Nev-
Figure 13: Average speedup results for relatively smaller (17) and larger ertheless, CSB_ S P MV scaled almost perfectly given enough flops
(814) matrices. These experiments were conducted on Opteron. per byte.
The parallel performance of CSB_ S P MV and CSB_ S P MV_ T
is generally not affected by highly uneven row and column nonzero
of as the exact performance on p threads (e.g. , p = 8 is the best counts. The highly skewed matrices RMat23 and Webbase2001
performance achieved for 5 p 8). achieved speedups as good as for matrices with flat row and column
In general, we observed better speedups for larger problems. For counts. An unusual case is the Torso matrix, where both CSB_
example, the average speedup of CSB_ S P MV for the first seven S P MV and CSB_ S P MV_ T were actually slower on 2 processors
matrices was 2.75 on 8 processors, whereas it was 3.03 for the sec- than serially. This slowdown does not, however, mark a plateau in
ond set of seven matrices with more nonzeros. Figure 13 sum- performance, since Torso speeds up as we add more than 2 pro-
marizes these results. The speedups are relative to the CSB code cessors. We believe this behavior occurs because the overhead of
running on a single processor, which Figure 1 shows is competitive intrablock parallelization is not amortized for 2 processors. Torso
with serial CSR codes. In another study [41] on the same Opteron requires a large number of intrablock parallelization calls, because
architecture, multicore-specific parallelization of the CSR code for it is unusually irregular and dense.
4 cores achieved comparable speedup to what we report here, al- Figure 15 shows the performance of CSB_ S P MV on Harper-
beit on a slightly different sparse-matrix test suite. That study does town for a subset of test matrices. We do not report performance
not consider the y AT x operation, however, which is difficult to for CSB_ S P MV_ T, as it was consistently close to that of CSB_
parallelize with CSR but which achieves the same performance as S P MV. The performance on this platform levels off beyond 4 pro-
y Ax when using CSB. cessors for most matrices. Indeed, the average Mflops/sec on 8
For CSB_ S P MV on 4 processors, CSB reached its highest processors is only 3.5% higher than on 4 processors. We believe
speedup of 2.80 on the RMat23 matrix, showing that this algorithm this plateau results from insufficient memory bandwidth. The con-
is robust even on a matrix with highly irregular nonzero structure. tinued speedup on Opteron is due to its higher ratio of memory
On 8 processors, CSB_ S P MV reached its maximum speedup of bandwidth (bytes) to peak performance (flops) per second.
3.93 on the Webbase2001 matrix, indicating the codes ability to Figure 16 summarizes the performance results of CSB_ S P MV
handle very large matrices without sacrificing parallel scalability. for the same subset of test matrices on Nehalem. Despite having
Sublinear speedup occurs only after the memory-system band- only 4 physical cores, for most matrices, Nehalem achieved scal-
width becomes the bottleneck. This bottleneck occurs at different ing up to 8 threads thanks to hyperthreading. Running 8 threads
numbers of cores for different matrices. In most cases, we observed was necessary to utilize the processor fully, because hyperthreading
nearly linear speedup up to 4 cores. Although the speedup is sub- fills the pipeline more effectively. We observed that the improve-
linear beyond 4 cores, in every case (except CSB_ S P MV on Mit- ment from oversubscribing is not monotonic, however, because
telmann), we see some performance improvement going from 4 to running more threads reduces the effective cache size available to
8 cores on Opteron. Sublinear speedup of SpMV on superscalar each thread. Nehalems point-to-point interconnect is faster than
multicore architectures has been noted by others as well [41]. Opterons (a generation old Hypertransport 1.0), which explains its
We conducted an additional experiment to verify that perfor- better speedup values when comparing the 4-core performance of
mance was limited by the memory-system bandwidth, not by lack both architectures. Its raw performance is also impressive, beating
of parallelism. We repeated each scalar multiply-add operation of both Opteron and Harpertown by large margins.
the form yi yi + Ai j x j a fixed number t of times. Although the To determine CSBs competitiveness with a conventional CSR
resulting code computes y tAx, we ensured that the compiler code, we compared the performance of the CSB serial code with
did not optimize away any multiply-add operations. Setting t = 10 plain OSKI using no matrix-specific optimizations such as register
did not affect the timings significantlyflops are indeed essentially or cache blocking. Figures 17 and 18 present the results of the
freebut, for t = 100, we saw almost perfect linear speedup up to comparison. As can be seen from the figures, CSB achieves similar
16 cores, as shown in Figure 14. We performed this experiment serial performance to CSR.
with Asic_ 320k, the smallest matrix in the test suite, which should In general, CSR seems to perform best on banded matrices, all
p=1
1000 p=2
p=4 300
p=8 CSR
CSB

800 250
MFlops/sec

200
600

MFlops/sec
150
400

100
200
50

0
As

Sm

Pa

To

Kk

Ld

R
0
uc

aj

ag

M
rid
itt

oo
ic

rs

t_
ra
e3

at

at
el

ci

As

Sm

Pa

To

Kk

Ld 1

Bo 3

W
e1
_3

3D
o

po
bo

r
31
m

23

uc

aj

ag 00

M
rid
itt
D

eb
oo
ic

rs

t_
5

ra

ne
20

20
lic

an

e3

at r

at
el
c

ci

e1
_

3D
o

po

ba
bo
er

r
3

01
m

2
32
k

_f

D
0
n

5
w

se
lic

an
c
em

0
0k

e
_f

20
n
em

01
Figure 15: CSB_ S P MV performance on Harpertown.
Figure 17: Serial performance comparison of SpMV for CSB and CSR.

p=1
p=2
2000 p=4
p=8

300
CSR
CSB
1500
MFlops/sec

250

1000 200

MFlops/sec 150
500

100

0
50
As

Sm

Pa

To

Kk

Ld

R
uc

aj

ag

M
rid
itt

oo
ic

rs

t_
ra
e3

at

at
el

ci

e1
_3

3D
o

po
bo

r
31
m

23
D

5
20

20
lic

an
c

er
k

_f

0
n

0
em

As

Sm

Pa

To

Kk

Ld

Bo

W
uc

aj

ag

M
rid
itt

eb
oo
ic

rs

t_
ra

ne
e3

at

at
el

ci

e1
_3

3D
o

po

ba
bo

r
31

01
m

23
Figure 16: CSB_ S P MV performance on Nehalem.
D

5
20

20

se
lic

an
c

0
er
k

_f

20
0
n
em

01
Figure 18: Serial performance comparison of SpMV_ T for CSB and CSR.

of whose nonzeros are located near the main diagonal. (The max-
imum distance of any nonzero from the diagonal is called the ma-
trixs bandwidth, not to be confused with memory bandwidth.) If
the matrix is banded, memory accesses to the input vector x tend
to be regular and thus favorable to cacheline reuse and automatic 8
p=1
prefetching. Strategies for reducing the bandwidth of a sparse ma- p=2
7 p=4
trix by permuting its rows and columns have been studied exten-
CSB_SpMV / Star-P (MFlops/sec ratio)

p=8
sively (see [11, 37], for example). Many matrices, however, cannot 6
be permuted to have low bandwidth. For matrices with scattered
5
nonzeros, CSB outperforms CSR, because CSR incurs many cache
misses when accessing the x vector. An example of this effect oc- 4
curs for the RMat23 matrix, where the CSB implementation is al-
3
most twice as fast as CSR.
Figure 19 compares the parallel performance of the CSB algo- 2
rithms with Star-P. Star-Ps blockrow data distribution does not
1
afford any flexibility for load-balancing across processors. Load
balance is not an issue for matrices with nearly flat row counts, 0
including finite-element and finite-difference matrices, such as
As

Sm 20k

Pa c

To

Kk

Ld

Bo
uc

aj

ag 00

M
rid
itt

oo
ic

rs

t_
ra

ne
e3

at

at
el

ci

e1
_3

3D
o

po
bo

r
31

01
m

23

Grid3D200. Load balance does become an issue for skewed ma-


D

5
w

2
lic

an

0
er
_f

n
em

trices such as RMat23, however. Our performance results confirm


this effect. CSB_ S P MV is about 500% faster than Star-Ps SpMV Figure 19: Performance comparison of parallel CSB_ S P MV with Star-
routine for RMat23 on 8 cores. Moreover, for any number of pro- P, which is a parallel-dialect of Matlab. The vertical axis shows the per-
cessors, CSB_ S P MV runs faster for all the matrices we tested, formance ratio of CSB_ S P MV to Star-P. A direct comparison of CSB_
including the structured ones. S P MV_ T with Star-P was not possible, because Star-P does not natively
support multiplying the transpose of a sparse matrix by a vector.
8 Conclusion [17] S. C. Eisenstat, M. C. Gursky, M. H. Schultz, and A. H. Sherman. Yale
sparse matrix package I: The symmetric codes. International Journal
Compressed sparse blocks allow parallel operations on sparse ma- for Numerical Methods in Engineering, 18:11451151, 1982.
trices to proceed either row-wise or column-wise with equal facil- [18] E. Elmroth, F. Gustavson, I. Jonsson, and B. Kagstrom. Recursive
ity. We have demonstrated the efficacy of the CSB storage format blocked algorithms and hybrid data structures for dense matrix library
for SpMV calculations on a sparse matrix or its transpose. It re- software. SIAM Review, 46(1):345, 2004.
mains to be seen, however, whether the CSB format is limited to [19] M. Frigo, P. Halpern, C. E. Leiserson, and S. Lewin-Berlin. Reducers
SpMV calculations or if it can also be effective in enabling par- and other Cilk++ hyperobjects. In SPAA, Calgary, Canada, 2009.
allel algorithms for multiplying two sparse matrices, performing [20] M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation
LU-, LUP-, and related decompositions, linear programming, and of the Cilk-5 multithreaded language. In SIGPLAN, pages 212223,
a host of other problems for which serial sparse-matrix algorithms Montreal, Quebec, Canada, June 1998.
currently use the CSC and CSR storage formats. [21] A. George and J. W. Liu. Computer Solution of Sparse Positive Defi-
The CSB format readily enables parallel SpMV calculations on nite Systems. Prentice-Hall, Englewood Cliffs, NJ, 1981.
a symmetric matrix where only half the matrix is stored, but we [22] J. R. Gilbert, G. L. Miller, and S.-H. Teng. Geometric mesh partition-
were unable to attain one optimization that serial codes exploit in ing: Implementation and experiments. SIAM Journal on Scientific
Computing, 19(6):20912110, 1998.
this situation. In a typical serial code that computes y Ax, where
A = (ai j ) is a symmetric matrix, when a processor fetches ai j = a ji [23] J. R. Gilbert, C. Moler, and R. Schreiber. Sparse matrices in MAT-
LAB: Design and implementation. SIAM J. Matrix Anal. Appl,
out of memory to perform the update yi yi + ai j x j , it can also 13:333356, 1991.
perform the update y j y j + ai j xi at the same time. This strategy
[24] E.-J. Im, K. Yelick, and R. Vuduc. Sparsity: Optimization framework
halves the memory bandwidth compared to executing CSB_ S P MV for sparse matrix kernels. International Journal of High Performance
on the matrix, where ai j = a ji is fetched twice. It remains an open Computing Applications, 18(1):135158, 2004.
problem whether the 50% savings in storage for sparse matrices can [25] K. Kourtis, G. Goumas, and N. Koziris. Optimizing sparse matrix-
be coupled with a 50% savings in memory bandwidth, which is an vector multiplication using index and value compression. In Comput-
important factor of 2, since it appears that the bandwidth between ing Frontiers (CF), pages 8796, New York, NY, USA, 2008. ACM.
multicore chips and DRAM will scale more slowly than core count. [26] J. Leskovec, D. Chakrabarti, J. M. Kleinberg, and C. Faloutsos. Real-
istic, mathematically tractable graph generation and evolution, using
Kronecker multiplication. In PKDD, pages 133145, 2005.
References [27] K. P. Lorton and D. S. Wise. Analyzing block locality in Morton-
[1] M. D. Adams and D. S. Wise. Seven at one stroke: results from a order and Morton-hybrid matrices. SIGARCH Computer Architecture
cache-oblivious paradigm for scalable matrix algorithms. In MSPC, News, 35(4):612, 2007.
pages 4150, New York, NY, USA, 2006. ACM. [28] H. M. Markowitz. The elimination form of the inverse and its appli-
[2] D. Bader, J. Feo, J. Gilbert, J. Kepner, D. Koester, E. Loh, K. Madduri, cation to linear programming. Management Science, 3(3):255269,
B. Mann, and T. Meuse. HPCS scalable synthetic compact applica- 1957.
tions #2. Version 1.1. [29] G. Morton. A computer oriented geodetic data base and a new tech-
[3] G. E. Blelloch. Programming parallel algorithms. CACM, 39(3), Mar. nique in file sequencing. Technical report, IBM Ltd., Ottawa, Canada,
1996. Mar. 1966.
[4] R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. [30] R. Nishtala, R. W. Vuduc, J. W. Demmel, and K. A. Yelick. When
Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime sys- cache blocking of sparse matrix vector multiply works and why. Ap-
tem. In PPoPP, pages 207216, Santa Barbara, California, July 1995. plicable Algebra in Engineering, Communication and Computing,
[5] R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded com- 18(3):297311, 2007.
putations by work stealing. JACM, 46(5):720748, Sept. 1999. [31] R. Raman and D. S. Wise. Converting to and from dilated integers.
[6] A. Buluc and J. R. Gilbert. On the representation and multiplication IEEE Trans. on Computers, 57(4):567573, 2008.
of hypersparse matrices. In IPDPS, pages 111, 2008. [32] Y. Saad. Iterative Methods for Sparse Linear Systems. SIAM,
[7] U. Catalyurek and C. Aykanat. A fine-grain hypergraph model for 2D Philadelpha, PA, second edition, 2003.
decomposition of sparse matrices. In IPDPS, page 118, Washington, [33] N. Sato and W. F. Tinney. Techniques for exploiting the sparsity of
DC, USA, 2001. IEEE Computer Society. the network admittance matrix. IEEE Trans. Power Apparatus and
[8] J. Cho, H. Garcia-Molina, T. Haveliwala, W. Lam, A. Paepcke, Systems, 82(69):944950, Dec. 1963.
S. Raghavan, and G. Wesley. Stanford webbase components and ap- [34] V. Shah and J. R. Gilbert. Sparse matrices in Matlab*P: Design and
plications. ACM Transactions on Internet Technology, 6(2):153186, implementation. In HiPC, pages 144155, 2004.
2006.
[35] B. Stroustrup. The C++ Programming Language. Addison-Wesley,
[9] Cilk Arts, Inc., Burlington, MA. Cilk++ Programmers Guide, 2009. third edition, 2000.
Available from http://www.cilk.com/.
[36] W. Tinney and J. Walker. Direct solutions of sparse network equa-
[10] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction
tions by optimally ordered triangular factorization. Proceedings of
to Algorithms. The MIT Press, third edition, 2009.
the IEEE, 55(11):18011809, Nov. 1967.
[11] E. Cuthill and J. McKee. Reducing the bandwidth of sparse symmetric
matrices. In Proceedings of the 24th National Conference, pages 157 [37] S. Toledo. Improving the memory-system performance of sparse-
172, New York, NY, USA, 1969. ACM. matrix vector multiplication. IBM J. Research and Development,
41(6):711726, 1997.
[12] T. A. Davis. University of Florida sparse matrix collection. NA Digest,
92, 1994. [38] B. Vastenhouw and R. H. Bisseling. A two-dimensional data distri-
bution method for parallel sparse matrix-vector multiplication. SIAM
[13] T. A. Davis. Direct Methods for Sparse Linear Systems. SIAM,
Rev., 47(1):6795, 2005.
Philadelpha, PA, 2006.
[14] J. Dongarra. Sparse matrix storage formats. In Z. Bai, J. Demmel, [39] R. Vuduc, J. W. Demmel, and K. A. Yelick. OSKI: A library of auto-
J. Dongarra, A. Ruhe, and H. van der Vorst, editors, Templates for matically tuned sparse matrix kernels. Journal of Physics: Conference
the Solution of Algebraic Eigenvalue Problems: a Practical Guide. Series, 16(1):521+, 2005.
SIAM, 2000. [40] J. Willcock and A. Lumsdaine. Accelerating sparse matrix computa-
[15] J. Dongarra, P. Koev, and X. Li. Matrix-vector and matrix-matrix tions via data compression. In ICS, pages 307316, New York, NY,
multiplication. In Z. Bai, J. Demmel, J. Dongarra, A. Ruhe, and USA, 2006. ACM.
H. van der Vorst, editors, Templates for the Solution of Algebraic [41] S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel.
Eigenvalue Problems: a Practical Guide. SIAM, 2000. Optimization of sparse matrix-vector multiplication on emerging mul-
[16] I. S. Duff, A. M. Erisman, and J. K. Reid. Direct Methods for Sparse ticore platforms. Parallel Computing, 35(3):178194, 2009.
Matrices. Oxford University Press, New York, 1986.

Você também pode gostar