Você está na página 1de 22

Block LU Factorization

Lecture 24
MA471 Fall 2003
Example Case
1) Suppose we are faced with the solution of a
linear system Ax=b

2) Further suppose:
1) A is large (dim(A)>10,000)
2) A is dense
3) A is full
4) We have a sequence of different b vectors.
Problems
Suppose we are able to compute the
matrix
It costs N
2
doubles to store the matrix
E.g. for N=100,000 we require 76.3 gigabytes
of storage for the matrix alone.
32 bit processors are limited to 4 gigabytes of
memory
Most desktops (even 64 bit) do not have 76.3
gigabytes

What to do?
Divide and Conquer
P0 P1 P2 P3
P4 P5 P6 P7
P8 P9 P10 P11
P12 P13 P14 P15
One approach is to assume we have a square number of processors.
We then divide the matrix into blocks storing one block per processor.
Back to the Linear System
We are now faced with LU factorization of
a distributed matrix.
This calls for a modified LU routine which
acts on blocks of the matrix.
We will demonstrate this algorithm for one
level.
i.e. we need to construct matrices L,U
such that A=LU and we only store single
blocks of A,L,U on any processor.
Constructing the
Block LU Factorization
A00 A01 A02
A10 A11 A12
A20 A21 A22
=
L00 0 0
L10 1 0
L20 0 1
*
U00 U01 U02
0 ?11 ?12
0 ?21 ?22
First we LU factorize A00 and look for the above block factorization.
However, we need to figure out what each of the entries are:
A00 = L00*U00 (compute by L00, U00 by LU factorization)

A01 = L00*U01 => U01 = L00\A01
A02 = L00*U02 => U02 = L00\A02

A10 = L10*U00 => L10 = A10/U00
A20 = L20*U00 => L20 = A20/U00

A11 = L10*U01 + ?11 => ?11 = A11 L10*U01..
cont
A00 = L00*U00 (compute by L00, U00 by LU factorization)

A01 = L00*U01 => U01 = L00\A01
A02 = L00*U02 => U02 = L00\A02

A10 = L10*U00 => L10 = A10/U00
A20 = L20*U00 => L20 = A20/U00

A11 = L10*U01 + ?11 => ?11 = A11 L10*U01
A12 = L10*U02 + ?12 => ?12 = A12 L10*U02
A21 = L20*U01 + ?21 => ?21 = A21 L20*U01
A22 = L20*U02 + ?22 => ?22 = A22 L20*U02

In the general case:
Anm = Ln0*U0m + ?nm => ?nm = Anm Ln0*U0m

Summary First Stage
A00 A01 A02
A10 A11 A12
A20 A21 A22
=
L00 0 0
L10 1 0
L20 0 1
*
U00 U01 U02
0 ?11 ?12
0 ?21 ?22
First step: LU factorize uppermost block diagonal

Second step: a) compute U0n = L00\A0n n>0
b) compute Ln0 = An0/U00 n>0

Third step: compute ?nm = Anm Ln0*U0m, (n,m>0)
Now Factorize Lower SE Block
?11 ?12
?21 ?22
=
L11 0
L21 1
*
U11 U12
0 ??22
We repeat the previous algorithm this time on the two by two SE block.
End Result
A00 A01 A02
A10 A11 A12
A20 A21 A22
=
L00 0 0
L10 L11 0
L20 L21 L22
*
U00 U01 U02
0 U11 U12
0 0 U22
Matlab
Version
Parallel Algorithm
P0 P1 P2
P3 P4 P5
P6 P7 P8
P0: A00 = L00*U00 (compute by L00, U00 by LU factorization)

P1: U01 = L00\A01
P2: U02 = L00\A02

P3: L10 = A10/U00
P6: L20 = A20/U00

P4: A11 <- A11 L10*U01
P5: A12 <- A12 L10*U02
P7: A21 <- A21 L20*U01
P8: A22 <- A22 L20*U02

In the general case:
Anm = Ln0*U0m + ?nm => ?nm = Anm Ln0*U0m

Parallel Communication
L00
U00
U01 U02
L10 A11 A12
L20 A21 A22
P0: L00,U00 =lu(A)

P1: U01 = L00\A01
P2: U02 = L00\A02

P3: L10 = A10/U00
P6: L20 = A20/U00

P4: A11 <- A11 L10*U01
P5: A12 <- A12 L10*U02
P7: A21 <- A21 L20*U01
P8: A22 <- A22 L20*U02

In the general case:
Anm = Ln0*U0m + ?nm => ?nm = Anm Ln0*U0m

Communication
Summary
P0: L00,U00 =lu(A)

P1: U01 = L00\A01
P2: U02 = L00\A02

P3: L10 = A10/U00
P6: L20 = A20/U00

P4: A11 <- A11 L10*U01
P5: A12 <- A12 L10*U02
P7: A21 <- A21 L20*U01
P8: A22 <- A22 L20*U02

P0: sends L00 to P1,P2
sends U00 to P3,P6

P1: sends U01 to P4,P7
P2: sends U02 to P5,P8

P3: sends L10 to P4,P5
P4: sends L20 to P7,P8
P0 P1 P2
P3 P4 P5
P6 P7 P8
L00
U00
U01 U02
L10 A11 A12
L20 A21 A22
Upshot
Notes:
1) I added an MPI_Barrier purely to separate the LU factorization and the backsolve.
2) In terms of efficiency we can see that quite a bit of time is spent in MPI_Wait
compared to compute time.
3) The compute part of this code can be optimized much more making the parallel
efficiency even worse.
a
b
(a) P0: sends L00 to P1,P2
sends U00 to P3,P6

(b) P1: sends U01 to P4,P7
(c) P2: sends U02 to P5,P8

(d) P3: sends L10 to P4,P5
(e) P4: sends L20 to P7,P8
c
d
e
(f) P4: sends L11 to P5
sends U11 to P7

(g) P1: sends U12 to P8

(h) P3: sends L21 to P8
f
1
st
stage:
1
st
stage:
g
h
Block Back Solve
After factorization we are left with the task
of using the distributed L and U to
compute the backsolve:
U00
L00
U01 U02
L10
U11
L11
U12
L20 L21
U22
L22
Block distribution of L and U
P0 P1 P2
P3 P4 P5
P6 P7 P8
Recall
Given an LU factorization of A namely,
L,U such that A=LU
Then we can solve Ax=b by
y=L\b
x=U\y
Distributed Back Solve
L00 0 0
L10 L11 0
L20 L21 L22
=
y0

y1

y2
b0

b1

b2
P0: solve L00*y0 = b0 send: y0 to P3,P6
P3: send: L10*y0 to P4
P4: solve L11*y1 = b1-L10*y0 send: y1 to P7
P6: send: L20*y0 to P8\
P7: send: L21*y1 to P8
P8: solve L22*y2 = b2-L20*y0-L21*y1
Results: y0 on P0, y1 on P4, y2 on P8
P0 P1 P2
P3 P4 P5
P6 P7 P8
Matlab
Code
Back Solve
After the factorization we computed a solution to Ax=b
This consists of two distributed block triangular systems to solve
Barrier Between Back Solves
This time I inserted an MPI_Barrier call between the backsolves.
This highlights the serial nature of the backsolves..
Example Code
http://www.math.unm.edu/~timwar/MA471F03/blocklu.m

http://www.math.unm.edu/~timwar/MA471F03/parlufact2.c

Você também pode gostar