Você está na página 1de 7

Parallel Formulations of Matrix-Vector Multiplication Matrices with Large Aspect Ratios

John Lloyd (Member IEEE) Department of Computation University of Manchester Institute of Science and Technology Manchester M60 1QD, United Kingdom jlloyd@ap.co.umist.ac.uk
Abstract
Matrix-vector multiplication is an important component of a lurge number of parallel algorithms. Over the years, many parallel formulations of matrix-vector multiplication have been proposed. However, they all tend to sufferfrom the same basic problem: while they may perform well for square matrices, or matrices with moderate aspect ratios, their efficiency deteriorates considerably for matrices with large aspect ratios. This paper proposes novel techniques for improving the efficiency of matrix-vector multiplication for matrices with large aspect ratios. The basic approach involves partitioning the matrix and vector over a logical array of processors, which is then embedded in the physical architecture. The dimensions of the logical array are chosen so as to minimise the communication overhead associated with the algorithm. Two popular families of parallel architectures are considered: square meshes with wraparound connections, and hypercubes. Theoretical results show that, for large numbers of processors, and for matrices with large aspect ratios, the new schemes perform significantly better than existing ones.

for

Introduction

and vector over a logical array of processors which is then embedded in the physical architecture. The dimensions of the logical array are chosen so as to minimise the communication overhead associated with the algorithm. Two popular families of parallel architectures are considered: square meshes with wraparound connections, and hypercubes. In particular we consider mesh-connected and hypercube-connected parallel computers that employ store-and-forward routing with single-port communication [2]. The time to deliver an In-word message between two processors that are 1 connections apart is t,V+ t,,ml, where t,v is the startup time and t,,, is the per-word transfer time (the inverse of the bandwidth). Theoretical results show that, for large numbers of processors, and for matrices with large aspect ratios, the new schemes perform significantly better than the existing ones. The remainder of this paper is organised as follows. Section 2 reviews some common parallel formulations of matrix-vector multiplication and obtains expressions for their parallel execution time on mesh and hypercube-connected architectures. Section 3 describes techniques for improving the efficiency of matrix-vector multiplication for matrices with large aspect ratios and compares their theoretical performance with that of the original algorithms. Section 4 contains some conclusions and suggestions for further work.

Matrix-vector multiplication is an important component of a large number of parallel algorithms. Over the years, many parallel formulations of matrix-vector multiplication have been proposed [ 1,2]. However, they all tend to suffer from one basic problem: while they may perform well for square matrices, or matrices with moderate aspect ratios (the ratio of the longest side of the matrix, to the shortest side), their efficiency deteriorates considerably for matrices with large aspect ratios. This paper proposes novel techniques for improving the efficiency of matrix-vector multiplication for matrices with large aspect ratios. We specifically consider the case where the matrix has few non-zero entries (in other words, the matrix is dense). Sparse matrix-vector multiplication is discussed by Cook and Sadecki [3], and Ferng and others [4]. The basic approach involves partitioning the matrix

2 Parallel Formulations Multiplication

of Matrix -Vector

In this section we review some common parallel algorithms for multiplying a dense m x n matrix with an n x 1 vector. At least four distinct parallel formulations of matrix-vector multiplication are possible, depending on the way in which the matrix and vector elements are partitioned among the processors.

2.1

Row Partitioning
formulations of matrixa rowwise partitioning of [2, 51. Each processor is rows of the matrix and

One of the simplest parallel vector multiplication is based on the matrix among p processors initially assigned m/p complete

1066-6192196 $5.00 0 1996 IEEE Proceedings of PDP 96

102

Proceedings of the 4th Euromicro Workshop on Parallel and Distributed Processing (PDP '96) 1066-6192/96 $10.00 1996 IEEE

I I -7f I -.l: -fI m-r---r-

I I

--l---: : I I I -J---f -y---I 1 I

ml

I I I

I I 1

I-I-I - ---; ---- ;--- :---I I I - ,,..,-:,:-,1-1-1 I_ I - ---,---p---:----I I I El f I I I

Figure

Block-partitioned

matrix-vector

multiplication

with non-uniform

distribution

of the vector.

n/p elements of the vector. Since the vector has to be multiplied with each row of the matrix, every processor needs the entire vector. To accomplish this, each processor broadcasts its portion of the vector to every other processor in an all-to-all broadcast operation. After this communication step, each processor has the vector available locally and m/p rows of the matrix. Using these, it computes m/p elements of the product vector. On a mesh, the all-to-all broadcast takes time 2t,V& +, t,n [2]. Assuming that a single multiplication and addition takes t, units of time to perform, each processor spends tc mn/p units of time in multiplying its m/p rows with the vector. Hence, the parallel execution time for this procedure is Tp =tc~+2t,&+t,n P

First, consider the case for the mesh. Assuming that a single multiplication-addition pair takes t, units of time, each processor spends t,. mn/p units of time in multiplying its m partial rows with n/p elements of the vector. Ignoring the time required to perform additions, a multinode accumulation of messages of size m/p among p processors takes time 2t,vfi+ twm [2]. Hence, the parallel execution time for the entire procedure is T, =t,~+2t,s~+twm P (3)

On a hypercube, the multinode accumulation can be performed in time t,ylogp+ t,m [2]. As the time spent performing computation is unchanged, the parallel execution time is T, =tc~+t,ylogp+t,m P (4)

(1)

On a hypercube, the all-to-all broadcast takes time t,vlogp+ t,n [2]. As the time spent performing computation is still t, mn/p, the parallel execution time is Tp = t, E A-t,vlog p -I- t,,n P

When m < n, this scheme has a lower communication overhead than the row-partitioned scheme.

2.3 Block Distribution

Partitioning of the Vector

with

Non-Uniform

2.2

Column Partitioning

While the previous scheme involved partitioning the rows of the matrix, an alternative approach involves partitioning the columns among the processors [2,5]. Each processor is initially assigned n/p complete columns of the matrix and the n/p elements of the vector corresponding to those columns. As the vector is already aligned along the rows of the matrix, no initial communication is required. Each processor multiplies its m partial rows of the matrix with its portion of the vector to obtain m partial product-sums. The partial product-sums are then accumulated using a multinode accumulation operation [2] so that each processor ends up with m/p elements of the product vector.

Another widely-used parallel formulation is based on a two-dimensional partitioning of the matrix [2]. Figure 1 illustrates the data partitioning and communication for this scheme. The mx II matrix is partitioned among p processors so that each processor receives a block of size 1 m/d? x n/fi (F ure l(a)). When the matrix is g partitioned in this way it is convenient to regard the processors as being configured as a fix fi array. The n x 1 vector is partitioned over the topmost row of the array so that each processor receives n/G elements. In the first step, the processors in the topmost row broadcast their portions of the vector to all the other processors in their respective columns (Figure l(b)). After this columnwise one-to-all broadcast, the vector is aligned along the rows of the matrix. Each processor then

103

Proceedings of the 4th Euromicro Workshop on Parallel and Distributed Processing (PDP '96) 1066-6192/96 $10.00 1996 IEEE

(a)
Figure 2 Block-partitioned matrix-vector

tb)
multiplication with uniform distribution

(c)
of the vector.

multiplies its m/G x n/& block of the matrix with the portion of the vector received during the first step to produce m/& partial product-sums. The partial productsums are then accumulated along each row to obtain the product vector. Hence, the last step of the algorithm is a single-node accumulation [2] of the m/G values in each row, with the leftmost processor of the row as the destination Figure l(c)). On a 7/1 x & mesh of processors, the columnwise p one-to -all broadcast can be performed in time t, & + t,n [2]. Assuming that a single multiplication-addition pair takes t,. units of time, each processor spends about t, mn/p time in multiplying its block of the matrix with the portion of the vector received during the first step. Ignoring the time required to perform additions, the final rowwise sin le-node accumulation can also be performed in time t,vP p + t,n [2]. Hence, the parallel execution time for the entire procedure is T, =tc~+2t,y&+tw(m+n) P This algorithm can also be mapped quite easily onto a hypercube using a well-known embedding of a fi x -& array in a p= 22k processor hypercube [6]. The processors in each row and column of the array form a 2k = -& processor hypercube. Also, the communication subnetworks corresponding to the different rows and columns are disjoint. So the columnwise one-to-all broadcasts and rowwise single-node accumulations can be implemented using hypercube-based algorithms. The columnwise one-to-all broadcast requires t,vlog & + t,n log &/& time to perform [2]. The time spent performing computation is the same as for the mesh. Ignoring the time needed to perform additions, the final rowwise single-node accumulation takes time t,Vlog & + t,m log &/fi [2]. Hence, the overall parallel execution time is T, =t,~+t,ylogp+t,(m+n)P 1% P 2fi

The main advantage of this scheme over the rowpartitioned and column-partitioned schemes is that it exploits a greater degree of concurrency and can therefore use more processors to run the same-sized problem faster. For hypercubes with large numbers of processors, the block-partitioned formulation is also faster than either of the other two schemes.

2.4 Block Partitioning of the Vector

with Uniform

Distribution

(6)

For large problem sizes, the simple one-to-all broadcast and single-node accumulation algorithms which are assumed to be used in the previous section become very inefficient. This is because they fail to balance the communication load evenly among the processors. One way of achieving a more balanced communication load is by using pipelined or hybrid communication algorithms [2,7,X]. Alternatively, we can distribute the vector among the processors in such a way that the communication load is intrinsically more balanced [ 1, 51. Figure 2 shows an example of this type of approach. The m x n matrix is block-partitioned among p processors such that each x n/G (Figure processor receives a block of size m/& 2(a)). The vector is uniformly partitioned among the processors so that each processor receives n/p vector elements. The portions of the vector are mapped onto the fi x & array in a column-major order (Figure 2(b)). Notice that each column of processors collectively contains the vector elements corresponding to the columns of the matrix that reside in it. In the first step, each processor broadcasts its portion of the vector to the other processors in the same column. This involves performing a columnwise all-to-all broadcast with messages of size n/p (Figure 2(b)). After this communication step, the vector is aligned along the rows of the matrix. Each processor then multiplies its m/&x n/& block of the matrix with the vector elements received during the previous step, to produce m/G partial product-sums. The partial sums are then be accumulated along each row to obtain the product vector. This is achieved using a rowwise multinode accumulation;

104

Proceedings of the 4th Euromicro Workshop on Parallel and Distributed Processing (PDP '96) 1066-6192/96 $10.00 1996 IEEE

-.

Figure 3

Block

partitioning

an m x n matrix

over a r x s array ( r = 2, s = 4). block-partitioned schemes tend to result in smaller communication overheads. However, as the aspect ratio is increased, they become increasingly more inefficient. The source of this inefficiency is the factor m + n that appears in coefficient of the per-word transfer time t, in Equations (5) to (8); this term increases as the matrix aspect ratio increases. To illustrate just how dramatic this effect can be, consider a matrix with an aspect ratio of 1000. Matrices with this sort of aspect ratio are not uncommon in some types of neural network model [9]. For such a matrix, the term m + n is over 30 times larger than that associated with a square matrix having the same number of elements. This section describes techniques for improving the efficiency of block-partitioned algorithms for matrices with large aspect ratios. These techniques basically involve block-partitioning the matrix over a rectangular array of processors, and then embedding the rectangular array in a hypercube or mesh.

each processor ends up with m/p elements of the product

Ignoring the time required to perform additions, the final rowwise multinode accumulation can be performed in ime [2]. Hence, the parallel execution 4 & + t, m/J); t time for the entire procedure is T,, =t,~+2t,s&+t,= P (7) fi

On a p-processor hypercube, the columnwise all-to-all broadcast and rowwise multinode accumulation can be implemented using hypercube-based communication procedures. The columnwise all-to-all broadcast requires t,ylog fi + t, n/1/5; time to perform [2]. As, in the case

3.1
step. Ignoring the time required to perform additions, the final rowwise multinode accumulation takes rs log&+ t, m/G time [2]. Hence, the parallel execution time for the entire procedure is T,=t,~+t,Jogp+t,P m+n fi

Improving Efficiency on a Hypercube

While this scheme does have a smaller communication overhead than the former block-partitioned scheme, it can only be used when the problem size is sufficiently large compared with the number of processors.

Improving Efficiency 3 Large Aspect Ratios

for Matrices with

In the previous section, four basic algorithms for performing matrix-vector multiplication were described. For matrices with moderate aspect ratios, the

Both of the block-partitioned approaches described in the Section 2 can be extended by partitionin the nzxn matrix over an r x s array (where t-s = p = 2 Zfk such that ) each processor receives a block of size m/r x nfs Figure 3 illustrates this for a 2 x 4 array. The r x s array is then embedded in a hypercube [6]. The processors in each column of the embedded array correspond to an r-processor hypercube and the processors in each row correspond to an s-processor hypercube. In the first of the two block-partitioned algorithms, the vector is uniformly distributed over the topmost row of the array so that each processor receives n/s elements. The initial columnwise one-to-all broadcasts take place among r processors with messages of size n/s. The time required for this operation is (ts + t, n/s) log r. [2] Assuming that a single multiplication-addition pair takes t, units of time, each processor spends about t, mnjp time in multiplying its m/r x nJs block of the matrix with the nJs vector elements received during the first step. The final rowwise single-node accumulations take place among s processors

105

Proceedings of the 4th Euromicro Workshop on Parallel and Distributed Processing (PDP '96) 1066-6192/96 $10.00 1996 IEEE

Figure 4

Embedding

an r x s array in a fix

mesh

(s > r).

with messages of size m/r. Ignoring the time needed to perform additions, the time required for this operation is (ts + t, m/r) log s [2] . After this communication step, each processor in the leftmost column contains m/r elements of the product vector. The parallel execution time for the overall procedure is T, =tc~+t,Jogp+t, P nlogr -+( s mlogs r (9)

vector. The parallel procedure is

execution

time

for the overall

T, =t,=+t,<logp+t, P

(11)

Again, applying some elementary calculus, we can show that the coefficient of t, is minimised when r -=-.- nz s n

By applying some elementary calculus we can show that the coefficient of t, is minimised when r(lnr+l) s(lns+l) m =n

(12)

(10)

For a matrix aspect ratio of m/n = l/1000 and p = 4096, the optimal configuration for the array is 4 x 1024. This configuration reduces the coefficient of t, by a factor of over 20 times that associated with a square array configuration. In the second block-partitioned algorithm, the vector is uniformly partitioned into blocks of size n/p. These blocks are then allocated to the rxs array in columnmajor order. The initial columnwise all-to-all broadcasts take place among r processors with messages of size n/p. The time required for this operation is ts log r + t, n/s [2]. Assuming that a single multiplication-addition pair takes t, units of time, each processor spends about tc mn/p time in multiplying its m/r x n/s block of the matrix with the n/s vector elements received during the first step. The final rowwise multinode accumulations take place among s processors with messages of size m/p. Ignoring the time needed to perform additions, the time required for this operation is t,ylogs + t, m/r [2]. After this communication step, each processor contains m/p elements of the product

In other words, the communication overhead is minimised when the aspect ratio of the array matches the aspect ratio of the matrix. For a matrix aspect ratio of m/n = l/1000 and p = 4096, the optimal configuration for the array is 2~ 2048. This configuration reduces the coefficient of t, by a factor of over 15 times that associated with a square array configuration.

3.2

Improving Efficiency on a Mesh

Whereas it is relatively easy to embed rectangular arrays in a hypercube, the same does not hold true for the mesh. A number of techniques have been developed for embedding rectangular arrays into square meshes [lo] but they are fairly complex and involve irregular patterns of communication. A sim ler technique for embedding an rxs arra? in ,a &?x fi mesh (where s > r and r-s= p = 2 is illustrated m Figure 4. In this scheme, the ) r x s array is sliced into blocks of size r x fi which are then stacked one on top of the other in the &x & mesh. The situation where r > s is dealt with similarly by slicing the r x s array into blocks of size are then stacked side by side in the & x For the first of the two block-partitioned algorithms, the vector is uniformly distributed over the topmost row of

106

Proceedings of the 4th Euromicro Workshop on Parallel and Distributed Processing (PDP '96) 1066-6192/96 $10.00 1996 IEEE

-fi+ (a)
Figure 5 Block-partitioned matrix-vector multiplication

(W
with non-uniform distribution

(cl
of the vector.

processors in the embedded rectangular array such that each processor receives n/s elements. The initial columnwise one-to-all broadcasts in the embedded array (Figure 5(a)) are non-interfering and can be performed in approximately r(t, + t, n/s) time [2]. Assuming that a single multiplication-addition pair takes t, units of time, each processor spends about t, mnfp in multiplying its m/r x n/s block of the matrix with the n/s vector elements received during the first step. The rowwise single-node accumulations in the embedded array are performed in two phases. In the first phase, rowwise single-node accumulations are performed in each row of the mesh with the processor in the leftmost column as the destination (Figure S(b)). These accumulations are non-interfering and can be performed in approximately &( t,, + t, m/r) time [2]. At the end of this phase, each processor in the leftmost column contains m/r partial product-sums . These partial product-sums are then accumulated over the corresponding processors in each of the stacked blocks, leaving the product vector distributed uniformly over the leftmost column of processors in the topmost block (Figure 5(c)). This communication is implemented by shifting the partial sums in the leftmost column, upwards, r steps at a time. Ignoring the time needed to perform additions, each r-ste shift takes r(t,y + t, m/r) time [2]. Since there are PIp r-l such of the communication takes time Th e parallel execution time for the

Tp =t,~+2t,yfi+t,, P

y+--m

2mfi r

(13)

By applying some elementary calculus we can show that the coefficient of tw is minimised when

r -= S
For a matrix aspect

213 -m n

(14)

p = 4096, the optimal

ratio of m/n = l/1000 and aspect ratio of the array is

r/s = l/100. Under the assumptions made above, we have two possible choices for the configuration: either 8 x 5 12 or 4x 1024. In fact, the former of these two configurations gives the lower communication overhead. By embedding an 8~ 512 array in the mesh using the technique described above, we can reduce the coefficient of t, by a factor of over 30 times that associated with a square array configuration. For the second block-partitioned algorithm, the vector is uniformly partitioned into blocks of size n/p. These blocks are then allocated to the embedded rectangular array in column-major order. The initial columnwise all-to-all broadcasts in the embedded array (Figure 6(a)) are non-interfering and can be performed in approximately 2r(t, + t,,, n/p) time using shift-based communication [2]. If a single multiplication-addition pair takes t, units of time, each processor spends about t, mn/p time in multiplying its m/r x n/s block of the matrix with the n/s vector elements received during the first step. Each processor now contains m/r partial sums which must be accumulated over the rows of the embedded array to obtain the product vector. The rowwise multinode accumulations in the embedded array are performed in two phases. In the first phase, rowwise multinode accumulations are performed in each row of the mesh so that each processor ends up with m/r& partial product-sums (Figure 6(b)). These accumulations are non-interfering and, ignoring the time required to perform additions, can be performed in t,q&+ t, m/r time [2]. The partial approximately accumulated over the product-sums are then corresponding processors in each of the stacked blocks, so that each processor ends up with m/p elements of the product (Figure 6(c)). This involves shifting the partial sums around the mesh in a vertical direction, r steps at a time. Ignoring the time required to perform additions, each r-ste shift takes r(t,v + t, m/p) time [2]. Since there are G/!-l such shift 77, , (is phase of the communication takes time p-r t,s+t,m/p). The parallel execution time for the entire procedure is

107

Proceedings of the 4th Euromicro Workshop on Parallel and Distributed Processing (PDP '96) 1066-6192/96 $10.00 1996 IEEE

-fi(a)
Figure 6 Block-partitioned matrix-vector multiplication

W
with uniform distribution

(4
of the vector. these

T, =t,=+1,(2&+r)+t, P By applying some elementary calculus we can show that the coefficient of t, is minimised when r m -=s 2n-m (16)

ones. We are in the process of implementing algorithms on a Parsys SN9500 multicomputer.

5
1. 2 3.

References
Fox, G.C., et al., Sohjing Problems on Concurrent Processors, Vol. 1, 1988, Englewood Cliffs, NJ, Prentice Hall. Kumar, V., et al., Introduction to Parallel Computing: Design and Analysis of Algorithms, 1994, Redwood City, CA, Benjamin/Cummings Publishing Company Inc. Cook, R. and J. Sadecki, Sparse Matrix-Vector Product on Distributed Memory MIMD Architectures, in Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific Computing, 1993, Norfolk, Virginia, SIAM. Ferng, W., et al., Basic Sparse Matrix Computations on Massively Parallel Computers, 1992, Army High Performance Computing Research Center, 92-084, July 1992. K.K. and S.L. Johnsson, All-to-All Mathur, Communication on the Connection Machine CM-200, 1993, Center for Research in Computing Technology, Harvard University, Technical Report TR-02-93, January 1993. Saad, Y. and M.H. Schultz, Topological Properties of Hypercubes, IEEE Transactions on Computers, 1988, 37(7), p. 867-872. \ Saad, Y. and M.H. Schultz, Data communication in parallel architectures, Parallel Computing, 1989, 11, p, 131-150. Johnsson, S.L. and C.-T. Ho, Optimum Broadcasting and Personalized Communication in Hypercubes, IEEE Transactions on Computers, 1989,38(9), p. 1249-1268. Ienne, P., Architectures for Neuro-Computers, 1993,
&ole Polytechnique F&d&ale de Lausanne, Technical

For a matrix aspect ratio of m/n = l/500 and p = 4096, the optimal configuration for the array is approximately 2 x 2048. By embedding the 2 x 2048 array in the square array using the technique described above, we can reduce the coefficient of t, by a factor of over 10 times that associated with a square array configuration.

4.

Conclusions
5.

In this paper we have seen that the efficiency of many existing parallel formulations of matrix-vector multiplication deteriorates for matrices with large aspect ratios. Novel techniques have been proposed for improving the efficiency in this situation. The basic approach has involved partitioning the matrix and vector over a logical array of processors which is then embedded in the physical architecture. The dimensions of the logical array are chosen so as to minimise the communication overhead associated with the algorithm. The process of selecting an optimal configuration for the array is nontrivial and depends not only on the aspect ratio of the matrix, but also on the type of communication used in the parallel algorithm. Theoretical results show that, for large numbers of processors, and for matrices with large aspect ratios, the new schemes perform significantly better than the existing

6. 7. 8. 9.

Report 93/21, January 1993. 10 Aleliunas, R. and A.L. Rosenberg, On Embedding Rectangular Grids in Square Grids, IEEE Transactions on Computers, 1982, 31(9), p. 907-913.

108

Proceedings of the 4th Euromicro Workshop on Parallel and Distributed Processing (PDP '96) 1066-6192/96 $10.00 1996 IEEE

Você também pode gostar