Escolar Documentos
Profissional Documentos
Cultura Documentos
CPS343
Spring 2016
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 1 / 65
Outline
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 2 / 65
Acknowledgements
Material used in creating these slides comes from Designing and Building
Parallel Programs by Ian Foster, Addison-Wesley, 1995. Available on-line
at http://www.mcs.anl.gov/~itf/dbpp/
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 3 / 65
Outline
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 4 / 65
Fosters model
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 5 / 65
Four-phase design process: PCAM
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 6 / 65
Four-phase design process: PCAM
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 8 / 65
Partitioning
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 9 / 65
Domain decomposition
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 10 / 65
Domain decomposition example: 3-D cube of data
1-D decomposition: split cube into a 1-D array of slices (each slice is
2-D, coarse granularity)
2-D decomposition: split cube into a 2-D array of columns (each
column is 1-D)
3-D decomposition: split cube into a 3-D array of individual data
elements. (fine granularity)
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 11 / 65
Functional decomposition
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 12 / 65
Functional decomposition
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 13 / 65
Partitioning design checklist
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 14 / 65
Outline
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 15 / 65
Communication
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 16 / 65
Communication in domain and functional decomposition
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 17 / 65
Patterns of communication
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 18 / 65
Patterns of communication
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 18 / 65
Patterns of communication
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 18 / 65
Patterns of communication
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 18 / 65
Local communication
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 19 / 65
Local communication: Jacobi finite differences
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 20 / 65
Local communication: Jacobi finite differences
The communications channels for a
particular node are shown by the arrows
in the diagram on the right.
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 21 / 65
Local communication: Jacobi finite differences
Define channels linking each task that requires a value with the task
that generates that value.
Each task then executes the following logic:
for t = 0 to T 1
(t)
send Xi,j to each neighbor
(t) (t) (t) (t)
receive Xi1,j , Xi,j1 , Xi+1,j , and Xi,j+1 from neighbors
(t+1)
compute Xi,j
endfor
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 22 / 65
Global communication
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 23 / 65
Global communication
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 24 / 65
Global communication
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 25 / 65
Distributing communication and computation
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 26 / 65
Distributing communication and computation
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 27 / 65
Uncovering concurrency: Divide and conquer
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 28 / 65
Divide and conquer algorithm
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 29 / 65
Divide and conquer analysis
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 31 / 65
Asynchronous communication
In this case, tasks that possess data (producers) are not able to
determine when other tasks (consumers) may require data
Consumers must explicitly request data from producers
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 32 / 65
Communication checklist
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 33 / 65
Outline
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 34 / 65
Agglomeration
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 35 / 65
Agglomeration: Conflicting goals
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 36 / 65
Increasing granularity
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 37 / 65
Increasing granularity: Fine grained version
Fine-grained
partition of 8 8
grid.
Partitioned into 64
tasks.
Each task responsible
for a single point.
64 4 = 256
communications are
required, 4 per task.
Total of 256 data Outgoing messages are dark shaded and
values transferred. incoming messages are light shaded.
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 38 / 65
Increasing granularity: Coarse grained version
Coarse-grained
partition of 8 8
grid.
Partitioned into 4
tasks.
Each task responsible
for 16 points.
4 4 = 16
communications are
required.
total of 16 4 = 64 Outgoing messages are dark shaded and
data values incoming messages are light shaded.
transferred.
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 39 / 65
Surface-to-volume effects
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 40 / 65
Replicating computation
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 41 / 65
Replicating computation example
Sum followed by broadcast: N tasks each have a value that must be
combined into a sum and made available to all tasks.
1 Task receives a partial sum from neighbor, updates sum, and passes
on updated value. Task 0 completes the sum and sends it back. This
requires 2(N 1) communication steps.
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 42 / 65
Replicating computation example
These algorithms are optimal in the sense that they do not perform
any unnecessary computation or communication.
To improve the first summation, assume that tasks are connected in a
ring rather than an array, and all N tasks execute the same algorithm
so that N partial sums are in motion simultaneously. After N 1
steps, the complete sum is replicated in every task.
This strategy avoids the need for a subsequent broadcast operation,
but at the expense of (N 1)2 redundant additions and (N 1)2
unnecessary (but simultaneous) communications.
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 43 / 65
Replicating computation example
The tree summation algorithm can be modified so that after log N
steps each task has a copy of the sum. When the communication
structure is a butterfly structure there are only O(N log N)
operations. In the case that N = 8 this looks like:
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 44 / 65
Avoiding communication
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 45 / 65
Preserving flexibility
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 46 / 65
Reducing software engineering costs
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 47 / 65
Agglomeration design checklist
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 48 / 65
Agglomeration design checklist (continued)
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 49 / 65
Outline
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 50 / 65
Mapping
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 51 / 65
Mapping
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 52 / 65
Mapping
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 53 / 65
Load balancing
Recursive Bisection
Local Algorithms
Probabilistic Methods
Cyclic Methods
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 54 / 65
Recursive bisection
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 55 / 65
Recursive bisection
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 56 / 65
Local algorithms
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 57 / 65
Probabilistic methods
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 58 / 65
Cyclic mappings
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 59 / 65
Task-scheduling algorithms
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 60 / 65
Manager/Worker
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 61 / 65
Hierarchical Manager/Worker
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 62 / 65
Decentralized schemes
No central manager.
Separate task pool is maintained on each processor.
Idle workers request problems from other processors.
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 63 / 65
Termination detection
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 64 / 65
Mapping design checklist
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2016 65 / 65