Você está na página 1de 39

Parallel Computing

Numerical linear algebra II


Thorsten Grahs, 29.06.2015

Overview

Sparse matrix Example PageRank


Parallel matrix-matrix-multiplication
Virtual topologies in MPI

29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 2

Matrices are extremely important in HPC


While its certainly not the case that high performance
computing is all about computing with matrices, matrix
operations are key to many important HPC applications.
Many important applications can be reduced to operations
on matrices, including (but not limited to)
1. searching and sorting
2. numerical simulation of physical processes
3. optimization

The list of the top 500 supercomputers in the world


(http://www.top500.org) is determined by a benchmark
program that performs matrix operations.

29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 3

Dense vs. sparse matrices


Matrix A Rmn is dense if all or most of its entries are
non-zero.
Storing a dense matrix (sometimes called a full matrix)
requires storing all m n elements of the matrix.
Usually an array data structure is used to store a dense
matrix.
A matrix is sparse if most of its entries are zero.
Here we expect the number of zeros to far exceed the
number of non zeros.
It is often most efficient to store only the non-zero entries
of a sparse matrix, but this requires that location
information also be stored.
Arrays and lists are most commonly used to store sparse
matrices.
29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 4

Dense matrix examples


Find a matrix to represent the
complete graph between nodes.
The ij entry describes the edge
connecting the node
corresponding to row i with the
node corresponding to column j .
Use 0 if a connection is missing.

29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 5

Sparse matrix examples


Here not every node is
connected to each other.
This results in an sparse matrix
representation.
One can think of the ij-entries as
weights (importance) of the
node connection.

29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 6

Example | PageRank
Googles search algorithm works with PageRank
For each web-page a weight (PageRank) is computed.
The weight is higher when more sides point/linked to this
page.
The weight is even higher, when the linked pages have
higher weights
Thus, the weight PRi of page i computes from the weight
PRj of the linked web-pages j.
Is page j linked to cj web-pages, the weight PRj is
distributed over these linked pages
Patent US6285999: Method for node ranking in a linked database
. Registered 10.01.1997 by Lawrence Page.
29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 7

Example | PageRank

from Wikipedia http://de.wikipedia.org/wiki/PageRank


29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 8

with d=0.85.

Example | PageRank
This gives rise to the following recursive formula
X PRj
1d
PRi =
+d
n
cj
j(j,i)

n total number of pages


d damping factor between 0 and 1.

This could be written as a linear system:


1d
MPR =
1
n
with Mij = ij dLij where L is a (sparse) matrix with

1/cij , if page i is linked with page j
Lij =
0,
else.
29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 9

Example | PageRank
This system leads to the Eigenvalue problem
PT (PR) = PR

with P so-called Google matrix.

For d < 1 the solution of the system is unique


(Theorem of Perron-Frobenius)
The system is solved with an iterative algorithm.

BiCGstab vs. Reverse GS for matrix n = 2Mill..


G.M. Del Corsa, A. Gullia, F. Romania, Comparison of Krylov subspace methods on the
PageRank problem, J. Comp.Appl.Math, Vol. 210,12, 31 2007
29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 10

Matrix-matrix multiplication
Series of inner product (dot product) operations

Iterative row-oriented algorithm

29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 11

Block-matrix computation

Replace scalar multiplication with matrix multiplication


Replace scalar addition with matrix addition

29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 12

Block-matrix computation

Recurs Until B Small Enough

29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 13

First parallel algorithm


Partitioning
Divide matrices into rows
Each primitive task has corresponding rows of three
matrices
Communication
Each task must eventually see every row of B
Organize tasks into a ring
Agglomeration and mapping
Fixed number of tasks, each requiring same amount of
computation
Regular communication among tasks
Strategy: Assign each process a contiguous group of rows
29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 14

Communication of B

29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 15

Communication of B (cont.)

29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 16

Communication of B (cont.)

29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 17

Communication of B (cont.)

29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 18

Complexity
Algorithm has p iterations
During each iteration a process multiplies
(n/p) (n/p) block of A by (n/p) n block of B,
i.e. multiplications per processor: O(n3 /p2 )
Total computation time O(n3 /p)
Each process ends up passing
(p 1)n2 /p = O(n2 ) elements of B
Weakness
Each process must access every element of matrix B
Ratio of computations per communication is poor:
only 2n/p
29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 19

Cannons algorithm
Compute C = A B.
A, B RNN

Distributed block-wise on a 2D-Field ( P P)


C should be stored in a similar manner.
Processor (i,j) computes
X
Ci,j =
Ai,k Bk,j
k

i.e. needs block-row i of A and block-column j of B


Prepare with alignment of blocks
Iterate over
rotation phase
computation phase
29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 20

Simple vs. Cannons Algorithm

29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 21

Alignment
Alignment
Matrix A
The block-rows of matrix A are shifted to the left
This has to be be done for each row, until the diagonal
blocks are placed in the first column.

Matrix B
The block-columns of matrix B are shifted upwards
This has to be be done for each columns, until the diagonal
blocks are placed in the first row.

After this step processor (i,j) holds the blocks


Ai,(j+i)%P
Row i is shifted i-times to the left

B(j+i)% P,j
Columns j is shifted j-times upwards
29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 22

Alignment phase
Before

After alignment

29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 23

Computation & Rotation


Phase I Computation
After the alignment, every processor hold two
corresponding blocks.
In the computation phase the matrix-matrix multiplication
is carried out block-wise
Phase II Rotation
Blocks are shifted further to the left resp. upwards.

29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 24

Example
Starting point

Block-are distributed to processors P = 9.


Processor (i,j) holds block (i,j).

29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 25

Example
Rotation

Alignment has taken place

29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 26

Example | Iteration

29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 27

Complexity analysis| Cannons

Algorithm has p iterations


During each iteration a process multiplies

(n/ p) (n/ p) blocks : O(n3 /p2/3 )


Total computation time O(n3 /p)
During each iteration process sends and receives two

blocks of size (n/ p) (n/ p).

Communication complexity: O(n2 / p)


Row-wise algorithm was:
p iterations
Communication complexity: O(n2 )
29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 28

MPI-Code | Cannons alg.


1
2
3
4
5
6
7
8
9
10
11

/* Create Grid and get coordinates */


int dims[2], periods[2] = {1, 1};
int mycoords[2];
dims[0] = sqrt(num_procs);
dims[1] = num_procs / dims[0];
MPI_Cart_create(MPI_COMM_WORLD,2,dims,periods,0, &comm_2d);
MPI_Comm_rank(comm_2d, &my2drank);
MPI_Cart_coords(comm_2d, my2drank, 2, mycoords);
/* Local blocks of the matricies */
double *a, *b, *c;
/* Load a, b and c corresponding to coordinates*/

12

29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 29

MPI-Code | Cannons alg.


1
2

3
4

5
6

7
8

/* Matrix rotation A */
MPI_Cart_shift(comm_2d, 0, -mycoords[0], &shiftsource, &
shiftdest);
MPI_Sendrecv_replace(a, nlocal*nlocal, MPI_DOUBLE,
shiftdest, 77, shiftsource, 77,comm_2d, &
status);
/* Matrix rotation B */
MPI_Cart_shift(comm_2d, 1, -mycoords[1], &shiftsource, &
shiftdest);
MPI_Sendrecv_replace(b, nlocal*nlocal, MPI_DOUBLE,
shiftdest, 77, shiftsource, 77, comm_2d, &
status);

29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 30

MPI-Code | Cannons alg.


1
2
3
4
5
6
7
8
9
10
11
12
13
14

/* Finde left and upper neighbour */


MPI_Cart_shift(comm_2d, 0, -1, &rightrank, &leftrank);
MPI_Cart_shift(comm_2d, 1, -1, &downrank, &uprank);
for (i=0; i<dims[0]; i++)
{
dgemm(nlocal, a, b, c); /* c= c + a * b */
/* Matrix A shifted t the left */
MPI_Sendrecv_replace(a, nlocal*nlocal, MPI_DOUBLE,
leftrank, 77, rightrank, 77, comm_2d, &status);
/* Matrix B shifted upwards*/
MPI_Sendrecv_replace(b, nlocal*nlocal, MPI_DOUBLE,
uprank, 77, downrank, 77, comm_2d, &status);
}
/* A und B back in original placement*/

29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 31

Virtual topologies in MPI


Associating processors with a topology
MPI allows to associate your processors field with an
underlying connection.
This connection is considered as virtual topology.
Usual topologies are:
Cartesian topology
Graph topology

Creating a topology produces a new communicator


MPI provides so-called mapping functions
Mapping functions are used to compute the according
processor ranks, based on the topology naming scheme
29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 32

Cartesian topology
Each process is connected to its neighbours in a virtual
grid
Boundaries can be cyclic
Processes can be identified by Cartesian coordinates

29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 33

Creating a Cartesian topology


int MPI_Cart_create (MPI_Comm comm_old, int ndims,
int *dims,int *periods,int reorder,MPI_Comm *comm_cart)

comm_old existing communicator


ndims number of dimensions
dims size of dimensions
periods logical array indicating whether a dimension is
cyclic (TRUE cyclic boundary conditions)
reorder logical
FALSE rank preserved
TRUE possible rank reordering

comm_cart new Cartesian communicator


29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 34

Cartesian example
1

MPI_Comm vu;

2
3
4

int dim[2], period[2], reorder;


dim[0]=4; dim[1]=3;

5
6
7
8

period[0]=TRUE;
period[1]=FALSE;
reorder=TRUE;

9
10
11

MPI_Cart_create(MPI_COMM_WORLD,
2,dim,period,reorder,&vu);

29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 35

Mapping coordinates to ranks


Mapping coordinates to ranks
int MPI_Cart_rank(MPI_Comm comm ,int *coords,
int *rank)
Mapping ranks to coordinates
int MPI_Cart_coords(MPI_Comm comm, int rank,
int maxdims, int *coords)

29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 36

Cartesian example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

#include<mpi.h> /* Run with 12 processes */


int main(int argc, char *argv[]) {
int rank;
MPI_Comm vu;
int dim[2],period[2],reorder, coord[2],id;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
dim[0]=4; dim[1]=3;
period[0]=TRUE; period[1]=FALSE;reorder=TRUE;
MPI_Cart_create(MPI_COMM_WORLD,2,dim,period,reorder,&vu);
if(rank==5){
MPI_Cart_coords(vu,rank,2,coord);
printf("P:%dMycoordinatesare%d%d\n",rank,coord[0],coord[1]);
}
if(rank==0) {
coord[0]=3; coord[1]=1;
MPI_Cart_rank(vu,coord,&id);
printf("Proc.atposition(%d,%d)hasrank%d\n",coord[0],coord[1],id);
}
MPI_Finalize();
}
29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 37

Neighbouring ranks
output
Proc. at position (3,1) has rank 10
P:5 My coordinates are 1 2
Computing ranks of neighbouring processors
int MPI_Cart_shift (MPI_Comm comm, int direction,
int disp, int *rank_source, int *rank_dest)
Does not actually shift data: returns the correct ranks for a
shift that can be used in subsequent communication calls

29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 38

Neighbouring ranks
Arguments
direction dimension in which the shift should be made)
disp (length of the shift in processor coordinates [+ or -])
rank_source (where calling process should receive a
message from during the shift)
rank_dest (where calling process should send a
message to during the shift)
Non-periodic case (period[i]=false)
If we shift off the topology, MPI_Cart_shift returns
MPI_Proc_null i.e. (1) for processors outside the
interval [0, n].
29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 39

Você também pode gostar