Pc11 NumLinAl II

Parallel Computing
Numerical linear algebra II

Thorsten Grahs, 29.06.2015
Overview
Sparse matrix Example PageRank

Parallel matrix-matrix-multiplication
Virtual topologies in MPI
29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 2
Matrices are extremely important in HPC

While its certainly not the case that high performance
computing is all about computing with matrices, matrix
operations are key to many important HPC applications.
Many important applications can be reduced to operations
on matrices, including (but not limited to)
1. searching and sorting
2. numerical simulation of physical processes
3. optimization
The list of the top 500 supercomputers in the world

(http://www.top500.org) is determined by a benchmark
program that performs matrix operations.
Dense vs. sparse matrices

Matrix A Rmn is dense if all or most of its entries are
non-zero.
Storing a dense matrix (sometimes called a full matrix)
requires storing all m n elements of the matrix.
Usually an array data structure is used to store a dense
matrix.
A matrix is sparse if most of its entries are zero.
Here we expect the number of zeros to far exceed the
number of non zeros.
It is often most efficient to store only the non-zero entries
of a sparse matrix, but this requires that location
information also be stored.
Arrays and lists are most commonly used to store sparse
matrices.
Dense matrix examples

Find a matrix to represent the
complete graph between nodes.
The ij entry describes the edge
connecting the node
corresponding to row i with the
node corresponding to column j .
Use 0 if a connection is missing.
Sparse matrix examples

Here not every node is
connected to each other.
This results in an sparse matrix
representation.
One can think of the ij-entries as
weights (importance) of the
node connection.
Example | PageRank
Googles search algorithm works with PageRank
For each web-page a weight (PageRank) is computed.
The weight is higher when more sides point/linked to this
page.
The weight is even higher, when the linked pages have
higher weights
Thus, the weight PRi of page i computes from the weight
PRj of the linked web-pages j.
Is page j linked to cj web-pages, the weight PRj is
distributed over these linked pages
Patent US6285999: Method for node ranking in a linked database
. Registered 10.01.1997 by Lawrence Page.
Example | PageRank
from Wikipedia http://de.wikipedia.org/wiki/PageRank

with d=0.85.
Example | PageRank
This gives rise to the following recursive formula
X PRj
1d
PRi =
+d
n
cj
j(j,i)
n total number of pages

d damping factor between 0 and 1.
This could be written as a linear system:

1d
MPR =
1
n
with Mij = ij dLij where L is a (sparse) matrix with

1/cij , if page i is linked with page j
Lij =
0,
else.
Example | PageRank
This system leads to the Eigenvalue problem
PT (PR) = PR
with P so-called Google matrix.
For d < 1 the solution of the system is unique

(Theorem of Perron-Frobenius)
The system is solved with an iterative algorithm.
BiCGstab vs. Reverse GS for matrix n = 2Mill..

G.M. Del Corsa, A. Gullia, F. Romania, Comparison of Krylov subspace methods on the
PageRank problem, J. Comp.Appl.Math, Vol. 210,12, 31 2007
Matrix-matrix multiplication
Series of inner product (dot product) operations
Iterative row-oriented algorithm
Block-matrix computation
Replace scalar multiplication with matrix multiplication

Replace scalar addition with matrix addition
Block-matrix computation
Recurs Until B Small Enough
First parallel algorithm

Partitioning
Divide matrices into rows
Each primitive task has corresponding rows of three
matrices
Communication
Each task must eventually see every row of B
Organize tasks into a ring
Agglomeration and mapping
Fixed number of tasks, each requiring same amount of
computation
Regular communication among tasks
Strategy: Assign each process a contiguous group of rows
Communication of B
Communication of B (cont.)
Complexity
Algorithm has p iterations
During each iteration a process multiplies
(n/p) (n/p) block of A by (n/p) n block of B,
i.e. multiplications per processor: O(n3 /p2 )
Total computation time O(n3 /p)
Each process ends up passing
(p 1)n2 /p = O(n2 ) elements of B
Weakness
Each process must access every element of matrix B
Ratio of computations per communication is poor:
only 2n/p
Cannons algorithm
Compute C = A B.
A, B RNN
Distributed block-wise on a 2D-Field ( P P)

C should be stored in a similar manner.
Processor (i,j) computes
X
Ci,j =
Ai,k Bk,j
k
i.e. needs block-row i of A and block-column j of B

Prepare with alignment of blocks
Iterate over
rotation phase
computation phase
Simple vs. Cannons Algorithm
Alignment
Alignment
Matrix A
The block-rows of matrix A are shifted to the left
This has to be be done for each row, until the diagonal
blocks are placed in the first column.
Matrix B
The block-columns of matrix B are shifted upwards
This has to be be done for each columns, until the diagonal
blocks are placed in the first row.
After this step processor (i,j) holds the blocks

Ai,(j+i)%P
Row i is shifted i-times to the left
B(j+i)% P,j
Columns j is shifted j-times upwards
Alignment phase
Before
After alignment
Computation & Rotation

Phase I Computation
After the alignment, every processor hold two
corresponding blocks.
In the computation phase the matrix-matrix multiplication
is carried out block-wise
Phase II Rotation
Blocks are shifted further to the left resp. upwards.
Example
Starting point
Block-are distributed to processors P = 9.

Processor (i,j) holds block (i,j).
Example
Rotation
Alignment has taken place
Example | Iteration
Complexity analysis| Cannons
Algorithm has p iterations

During each iteration a process multiplies
(n/ p) (n/ p) blocks : O(n3 /p2/3 )

Total computation time O(n3 /p)
During each iteration process sends and receives two
blocks of size (n/ p) (n/ p).
Communication complexity: O(n2 / p)

Row-wise algorithm was:
p iterations
Communication complexity: O(n2 )
MPI-Code | Cannons alg.

1
2
3
4
5
6
7
8
9
10
11
/* Create Grid and get coordinates */

int dims[2], periods[2] = {1, 1};
int mycoords[2];
dims[0] = sqrt(num_procs);
dims[1] = num_procs / dims[0];
MPI_Cart_create(MPI_COMM_WORLD,2,dims,periods,0, &comm_2d);
MPI_Comm_rank(comm_2d, &my2drank);
MPI_Cart_coords(comm_2d, my2drank, 2, mycoords);
/* Local blocks of the matricies */
double *a, *b, *c;
/* Load a, b and c corresponding to coordinates*/
12

1
2
3
4
5
6
7
8
/* Matrix rotation A */
MPI_Cart_shift(comm_2d, 0, -mycoords[0], &shiftsource, &
shiftdest);
MPI_Sendrecv_replace(a, nlocal*nlocal, MPI_DOUBLE,
shiftdest, 77, shiftsource, 77,comm_2d, &
status);
/* Matrix rotation B */
MPI_Cart_shift(comm_2d, 1, -mycoords[1], &shiftsource, &
shiftdest);
MPI_Sendrecv_replace(b, nlocal*nlocal, MPI_DOUBLE,
shiftdest, 77, shiftsource, 77, comm_2d, &
status);

1
2
3
4
5
6
7
8
9
10
11
12
13
14
/* Finde left and upper neighbour */

MPI_Cart_shift(comm_2d, 0, -1, &rightrank, &leftrank);
MPI_Cart_shift(comm_2d, 1, -1, &downrank, &uprank);
for (i=0; i<dims[0]; i++)
{
dgemm(nlocal, a, b, c); /* c= c + a * b */
/* Matrix A shifted t the left */
MPI_Sendrecv_replace(a, nlocal*nlocal, MPI_DOUBLE,
leftrank, 77, rightrank, 77, comm_2d, &status);
/* Matrix B shifted upwards*/
MPI_Sendrecv_replace(b, nlocal*nlocal, MPI_DOUBLE,
uprank, 77, downrank, 77, comm_2d, &status);
}
/* A und B back in original placement*/
Virtual topologies in MPI

Associating processors with a topology
MPI allows to associate your processors field with an
underlying connection.
This connection is considered as virtual topology.
Usual topologies are:
Cartesian topology
Graph topology
Creating a topology produces a new communicator

MPI provides so-called mapping functions
Mapping functions are used to compute the according
processor ranks, based on the topology naming scheme
Cartesian topology
Each process is connected to its neighbours in a virtual
grid
Boundaries can be cyclic
Processes can be identified by Cartesian coordinates
Creating a Cartesian topology

int MPI_Cart_create (MPI_Comm comm_old, int ndims,
int *dims,int *periods,int reorder,MPI_Comm *comm_cart)
comm_old existing communicator

ndims number of dimensions
dims size of dimensions
periods logical array indicating whether a dimension is
cyclic (TRUE cyclic boundary conditions)
reorder logical
FALSE rank preserved
TRUE possible rank reordering
comm_cart new Cartesian communicator

Cartesian example
1
MPI_Comm vu;
2
3
4
int dim[2], period[2], reorder;

dim[0]=4; dim[1]=3;
5
6
7
8
period[0]=TRUE;
period[1]=FALSE;
reorder=TRUE;
9
10
11
MPI_Cart_create(MPI_COMM_WORLD,
2,dim,period,reorder,&vu);
Mapping coordinates to ranks

Mapping coordinates to ranks
int MPI_Cart_rank(MPI_Comm comm ,int *coords,
int *rank)
Mapping ranks to coordinates
int MPI_Cart_coords(MPI_Comm comm, int rank,
int maxdims, int *coords)
Cartesian example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#include<mpi.h> /* Run with 12 processes */

int main(int argc, char *argv[]) {
int rank;
MPI_Comm vu;
int dim[2],period[2],reorder, coord[2],id;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
dim[0]=4; dim[1]=3;
period[0]=TRUE; period[1]=FALSE;reorder=TRUE;
MPI_Cart_create(MPI_COMM_WORLD,2,dim,period,reorder,&vu);
if(rank==5){
MPI_Cart_coords(vu,rank,2,coord);
printf("P:%dMycoordinatesare%d%d\n",rank,coord[0],coord[1]);
}
if(rank==0) {
coord[0]=3; coord[1]=1;
MPI_Cart_rank(vu,coord,&id);
printf("Proc.atposition(%d,%d)hasrank%d\n",coord[0],coord[1],id);
}
MPI_Finalize();
}
Neighbouring ranks
output
Proc. at position (3,1) has rank 10
P:5 My coordinates are 1 2
Computing ranks of neighbouring processors
int MPI_Cart_shift (MPI_Comm comm, int direction,
int disp, int *rank_source, int *rank_dest)
Does not actually shift data: returns the correct ranks for a
shift that can be used in subsequent communication calls
Neighbouring ranks
Arguments
direction dimension in which the shift should be made)
disp (length of the shift in processor coordinates [+ or -])
rank_source (where calling process should receive a
message from during the shift)
rank_dest (where calling process should send a
message to during the shift)
Non-periodic case (period[i]=false)
If we shift off the topology, MPI_Cart_shift returns
MPI_Proc_null i.e. (1) for processors outside the
interval [0, n].

Pc11 NumLinAl II

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Pc11 NumLinAl II

Enviado por

Direitos autorais:

Formatos disponíveis

Parallel Computing

Numerical linear algebra II

Sparse matrix Example PageRank

29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 2

Matrices are extremely important in HPC

The list of the top 500 supercomputers in the world

29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 3

Dense vs. sparse matrices

Dense matrix examples

29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 5

Sparse matrix examples

29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 6

from Wikipedia http://de.wikipedia.org/wiki/PageRank

n total number of pages

This could be written as a linear system:

with P so-called Google matrix.

For d < 1 the solution of the system is unique

BiCGstab vs. Reverse GS for matrix n = 2Mill..

Iterative row-oriented algorithm

29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 11

Replace scalar multiplication with matrix multiplication

29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 12

Recurs Until B Small Enough

29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 13

First parallel algorithm

29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 15

29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 16

29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 17

29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 18

Distributed block-wise on a 2D-Field ( P P)

i.e. needs block-row i of A and block-column j of B

Simple vs. Cannons Algorithm

29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 21

After this step processor (i,j) holds the blocks

29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 23

Computation & Rotation

29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 24

Block-are distributed to processors P = 9.

29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 25

Alignment has taken place

29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 26

29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 27

Complexity analysis| Cannons

Algorithm has p iterations

(n/ p) (n/ p) blocks : O(n3 /p2/3 )

blocks of size (n/ p) (n/ p).

Communication complexity: O(n2 / p)

MPI-Code | Cannons alg.

/* Create Grid and get coordinates */

29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 29

MPI-Code | Cannons alg.

29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 30

MPI-Code | Cannons alg.

/* Finde left and upper neighbour */

29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 31

Virtual topologies in MPI

Creating a topology produces a new communicator

29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 33

Creating a Cartesian topology

comm_old existing communicator

comm_cart new Cartesian communicator

int dim[2], period[2], reorder;

29.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 35

Mapping coordinates to ranks