Escolar Documentos
Profissional Documentos
Cultura Documentos
http://www.dcs.warwick.ac.uk/~hpsg/
email: {junwei, djke}@dcs.warwick.ac.uk
1
IEEE International Performance Computing and Communications Conference, IPCCC-2000, Phoenix, February 2000, pp. 485-492.
1
This is a modular approach that leads to readily reusable performance computing systems, which involve
models, which can be interchanged for experimental concurrency and parallelism, the model must be
analysis. enhanced. The layered framework is an extension of SPE
Each of the modules in PACE can be described at for the characterisation of parallel and distributed
multiple levels of detail in a similar way to POEMS, systems. It supports the development of three types of
thus providing a range of result accuracies but at varying models: software model, parallelisation model and
costs in terms of prediction evaluation time. PACE is system (hardware) model. It allows the separation of the
aimed to be used for pre-implementation analysis, such software and hardware model by the addition of the
as design or code porting activities as well as for on-the- intermediate parallelisation model.
fly use in scheduling systems in similar manner to that of The framework and layers can be used to represent
AppLeS. entire systems, including: the application, parallelisation
The core component of PACE is a performance and hardware aspects, as illustrated in Figure 1.
specification language, CHIP³S (Characterisation
Instrumentation for Performance Prediction of Parallel Application Domain
Systems) [9]. CHIP³S provides a syntax that allows the Application Layer
description of the performance aspects of an application,
and its parallelisation, to be expressed. This includes Subtask Layer
control flow information, resource usage information
(e.g. number of operations), communication structures
and mapping information for a parallel or distributed Parallel Template Layer
system.
In the work presented in this paper, the use of the Hardware Layer
PACE system is described through an example
application kernel – Sweep3D [10]. Sweep3D is a part of Figure 1. The Layered Framework
the ASCI application suite [11], which has been used to
evaluate advanced parallel architectures at Los Alamos The functions of the layers are:
National Laboratories. The capabilities for performance
• Application Layer – describes the application in
evaluation within PACE are illustrated through the
terms of a sequence of parallel kernels or subtasks.
cross-platform use of Sweep3D on both an SGI
It acts as the entry point to the performance study,
Origin2000 (a shared memory system), and a cluster of
and includes an interface that can be used to
SunUltra1 workstations.
modify parameters of a performance study.
The rest of the paper is organised as follows: Section
• Application Subtask Layer – describes the
2 describes the performance modelling approach based
sequential part of every subtask within an
on the PACE conceptual framework. Section 3 gives an
application that can be executed in parallel.
overview of the Sweep3D application and how it is
• Parallel Template Layer – describes the parallel
described within CHIP³S performance specification
characteristics of subtasks in terms of expected
language. Section 4 illustrates the performance
computation-communication interactions between
predictions that can be produced by PACE on the two
processors.
systems considered. Preliminary conclusions are
• Hardware Layer – collects system specification
discussed in Section 5.
parameters, micro-benchmark results, statistical
models, analytical models, and heuristics to
2 PACE Performance Modelling Approach characterise the communication and computation
The main concepts behind PACE include a layered abilities of a particular system.
framework, and the use of associative objects as a basis According to the layered framework, a performance
for representing system components. An initial model is built up from a number of separate objects.
implementation of PACE supports performance Each object is of one of the following types: application,
modelling of parallel and distributed applications from subtask, parallel template, and hardware. A key feature
object definition, through to model creation, and result of the object organization is the independent
generation. These factors are described further below. representation of computation, parallelisation, and
2.1 Layered Framework hardware. This is possible due to strict object interaction
rules.
Many existing techniques, particularly for the analysis All objects have a similar structure, and a hierarchical
of serial machines, use Software Performance set of objects, representing the layers of the framework,
Engineering (SPE) methodologies [12], to provide a is built up into the complete performance model. An
representation of the whole system in terms of two example of a complete performance model, represented
modular components, namely a software execution by a Hierarchical Layered Framework Diagram (HLFD),
model and a system model. However, for high is shown in Figure 7.
2
2.2 Object Definition
In PACE a Hardware Model Configuration Language
Each software object (application, subtask, or parallel
(HMCL) allows users to create new hardware objects by
template) is comprised of an internal structure, options,
specifying system-dependent parameters. On evaluation,
and an interface that can be used by other objects to
the relevant sets of parameters are used, and supplied to
modify its behaviour. A schematic representation of a
the evaluation methods for each of the component
software object is shown in Figure 2.
models. An example HMCL fragment is illustrated later
in Figure 9E.
Type Identifier Object 1 (lower)
Include Object 2 (lower) 2.4 Mapping Relations
External Var. Def. Object 3 (higher) There are strict mapping relations between source
Link Object 1 (lower) code of the application and its performance model.
Figure 5 illustrates the way in which independent objects
Options Object 2 (lower)
are abstracted directly from the source code and built up
Procedures into a complete performance model which can be used to
produce performance prediction results.
Figure 2. Software Object Structure
Application Model Scripts
Each hardware object is subdivided into many smaller Source Code
component hardware models, each describing the Parallel
behaviour of individual parts of the hardware system. An Abstracted Template
3
IJK. The angular dependence is handled by discrete rectangle and is labelled with the object name.
angles with a spherical harmonics treatment for the
scattering source. The solution involves two main steps: Application sweep3d
Object
• the streaming operator is solved by sweeps for each
angle, and Subtask
• the scattering operator is solved iteratively. source sweep fixed flux_err
Object
A sweep (Sn) proceeds as follows. For one of eight
given angles, each grid cell has 4 equations with 7 Parallel
unknowns (6 faces plus 1 central); boundary conditions Template async pipeline global global
complete the system of equations. The solution is by a Object sum max
direct ordered solve known as a sweep from one corner
of the data cube to the opposite corner. Three known
inflows allow the cell centre to be solved producing Hardware
SgiOrigin2000
three outflows. Each cell’s solution then provides inflows Object
to 3 adjoining cells (1 in each of the I, J, & K directions).
This represents a wavefront evaluation in all 3 grid Figure 7. Sweep3D Object Hierarchy
directions. For XYZ geometries, each octant of angles (HLFD Diagram)
has a different sweep direction through the mesh, but all
angles in a given octant sweep the same way. The functions of each object are:
Sweep3D exploits parallelism through the wavefront
• sweep3d – the entry of the whole performance
process. The data cube undergoes a decomposition so
model. It initialises all parameters used in the
that a set of processors, indexed in a 2D array, hold part
model and calls the subtasks iteratively according
of the data in the I and J dimensions, and all of the data
to the convergence control parameter (epsi) as
in the K dimension. The sweep processing consists of
input by the user. Figure 8 describes different parts
pipelining the data flow from each cube vertex in turn to
of the sweep3d object clearly in CHIP³S scripts.
its opposite vertex. It is possible for different sweeps to application sweep3d {
be in operation at the same time but on different include hardware;
processors. include source;
include sweep;
include fixed;
include flux_err;
var numeric:
C
npe_i = 2,
npe_j = 3,
mk = 10,
K mmi = 3,
it_g = 50,
A
jt_g = 50,
kt = 50,
epsi = -12,
B ······
link {
J hardware:
I Nproc = npe_i * npe_j;
source:
it = it,
Figure 6. Data Decomposition of the Sweep3D Cube ······
sweep:
For example, Figure 6 depicts a wavefront (shaded in it = it,
······
Grey) that originated from the unseen vertex in the cube, }
and is about to finish at vertex A. At the same time, a option {
further wavefront is starting at vertex B and will finish at hrduse = "SgiOrigin2000";
}
vertex C. Note that the example shows the use of a 5x5 proc exec init {
grid of processors, and in this case each processor holds ······
a total of 2x2x10 data elements (data set of 10x10x10). for(i = 1;i <= -epsi;i = i + 1) {
call source;
3.2 Model Description call sweep;
call fixed;
We define the application object of the performance call flux_err;
}
model as sweep3d, and divide each iteration of the }
application into four subtasks according to their different }
functions and different parallelisations. The object
Figure 8. Sweep3D Application Object
hierarchy is shown in Figure 7, each object is a separate
4
The sections correspond to those shown • globalsum – parallel template which represents the
schematically in Figure 2. parallel pattern for getting the sum value of a given
• source – subtask for getting the source moments, parameter from all the processors.
which is actually a sequential process. • globalmax – parallel template which represents the
• sweep – subtask for sweeper, which is the core parallel pattern for getting the maximum value of a
component of the application. given parameter from all the processors.
• fixed – subtask to compute the total flux fixup • SgiOrigin2000 – contains all the hardware
number during each iteration. configurations for SGI Origin2000, which is
• flux_err – subtask to compute the maximum of comprised of smaller component hardware models
relative flux error. already in existence within PACE. This can be
• async – a sequential “parallel” template. interchanged with a hardware model of a different
• pipeline – parallel template specially made for the system, e.g. a cluster of SUN workstations.
sweeper function.
5
The example model objects and their correspondence device mpirecv (MPI receive communication
with the C source code is shown in Figure 9. Figure 9A operation) accepts three arguments: source
is the C source code of showing part of the main processor ID, destination processor ID and
function sweep, whose serial parts have been abstracted message size.
into a number of sub-functions in bold font. Figure 9C
It can be seen from the part of the Sweep3D model
shows how the same source code structure is used to
shown in Figure 9 that there is a lot of information
provide the parallel template description. Figure 9B is an
extracted from the source code that is used for the
example sub-function source code which can be
performance prediction. The accuracy of the resulting
converted automatically to the control flow procedure in
model is of importance, and in Section 4 below, detailed
the subtask object as shown in Figure 9D.
results are shown to validate the model with
Some of the main statements used in the CHIP³S
measurements on the two systems considered.
language to represent the performance aspects of the
Figure 9 also shows the inner mapping between the
source code are as follows:
software objects and hardware object of the performance
• compute – a processing part of the application, its model. The abundant off-line configuration information
arguement is a resource usage vector. This vector is included by the hardware object is the basis to
evaluated through the hardware object. implement a rapid evaluation time to produce the
• loop – the body of which includes a list of the performance predictions.
control flow statements that will be repeated.
• call - used to execute another procedure. 4 Validation Results
• case – the body of which includes a list of
expressions and corresponding control flow In this section the preliminary validation results on
statements which might be evaluated. execution time for Sweep3D are given to illustrate the
• step – corresponds to the use of one of the accuracy of the PACE modelling capabilities for
hardware resources of the system. Its arguement is performance evaluation. The procedures in the PACE
used to configure the device specified in the evaluation engine to achieve these results is complex and
current step. This is used in parallel templates only. out of the scope of this paper. Further details can be
• confdev – configures a device. The meaning of its found in [8].
arguments depend on the device. For example, the
5 25
grid size: 15x15x15 grid size: 25x25x25
4 20
Run Run
time time
Model (sec) 15 Model
(sec) 3 Measured Measured
2 10
1 5
0 0
02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16
Processors Processors
80 250
grid size: 35x35x35 grid size: 50x50x50
70 Run
Run 60
200
time
time 50 (sec)
(sec) Model 150 Model
40 Measured Measured
30
100
20 50
10
0
0
02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16
Processors Processors
6
14 60
grid size: 15x15x15 grid size: 25x25x25
12 50
Run 10 Run 40
time Model time Model
8
(sec) Measured Measured
(sec) 30
6
4 20
2 10
0 0
02 03 04 05 06 07 08 09 02 03 04 05 06 07 08 09
Processors Processors
160 500
grid size: 35x35x35 grid size: 50x50x50
140 450
400
120 Run 350
Run
time 300
time 100 Model Model
Measured (sec) Measured
(sec) 80 250
60 200
150
40
100
20 50
0 0
02 03 04 05 06 07 08 09 02 03 04 05 06 07 08 09
Processors Processors
Figure 10 shows the validation of the PACE model average error is approx. 5%.
against the code running on an SGI Origin2000 shared
memory system. Note that the result for single processor Table 1. Prediction Error on SGI Origin2000
input is not included because there are many special Err.(%) 15X15X15 25X25X25 35X35X35 50X50X50
configurations, which are not included to current 1X2 6.53 10.44 7.02 -5.02
performance model for the sequential code. As shown in 2X2 0.45 4.60 9.37 9.80
the figure, run time decreases when the number of 2X3 1.38 -0.73 4.47 -2.46
processors increases. At the same time the parallel 2X4 -5.66 0.82 1.12 -5.60
efficiency decreases too. In fact when the number of 3X3 -0.29 -0.13 0.48 -4.55
processors is more than 16, the run time does not
3X4 -4.72 -4.92 -1.13 -7.62
improve any further.
4X4 -9.54 -4.90 -11.44 0.20
By only changing the hardware object to the
SunUltra1 predictions on this new system can be
Table 2. Prediction Error on SunUltra1
obtained as shown in Figure 11. A cluster of 9 SunUltra1
Err.(%) 15X15X15 25X25X25 35X35X35 50X50X50
workstations were used to obtain the measurements
assuming no background loading. The run time spent is 1X2 -6.79 0.15 3.24 -1.12
much more than that on SGI Origin2000 with the same 2X2 7.07 8.07 5.62 5.30
workload. But the trend of the curve is almost the same. 2X3 4.00 1.64 -0.20 0.32
The accuracy of the prediction results were evaluated 2X4 2.85 -1.49 -4.30 -10.06
as follows: 3X3 5.01 3.42 2.27 0.82
| Measurement − Prediction |
Error = × 100%. Besides the reasonable accuracy, the performance
Measurement model can be used to obtain the evaluation results in a
rapid time period, typically less than 2s. This is a key
The errors between measurements and predictions are feature of PACE that enables the performance models to
shown in Table 1, for the SGI Origin2000, and in Table be used to aid to steer the application execution onto an
2 for the SunUltra1 Workstation Cluster. It can be seen available system at run-time in an efficient manner [15].
that the maximum error is 10% in both cases, but the
7
5 Conclusions IEEE Computer, Vol. 31(10), pp. 77-85, October
1998.
This work has described a performance modelling [5] J.M. Schopf, “Structural Prediction Models for
approach for parallel and distributed computing using High-Performance Distributed Applications”,
the PACE toolset. A case study of the Sweep3D Proceedings of 1997 Cluster Computing
application has been given containing both model Conference, 1997.
descriptions and validation results. The main points
include: [6] M. Uysal, T.M. Kurc, A. Sussman, and J. Saltz, “A
Performance Prediction Framework for Data
• a parallel template layer in the PACE framework, Intensive Applications on Large Scale Parallel
• a performance modelling language, CHIP³S, Machines”, Proceedings of the 4th Workshop on
• a semi-automated Application Characterization Languages, Compilers and Run-time Systems for
Tool , ACT, Scalable Computers, 1998.
• a Hardware Model Configuration Language,
HMCL, and [7] V.S. Pai, P. Ranganathan, and S.V. Adve, “RSIM
• strict mapping relations to get a performance Reference Manual. Version 1.0”, Department of
model directly from the source code with Electrical and Computer Engineering, Rice
maximum application information. University, Technical Report 9705, July 1997.
These lead to the key features of PACE which include: [8] G.R. Nudd, D.J. Kerbyson, E. Papaefstathiou, S.C.
Perry, J.S. Harper, and D.V. Wilcox, “PACE – A
• a reasonable prediction accuracy, approximately Toolset for the Performance Prediction of Parallel
10% error at most, and Distributed Systems”, to appear in High
• a rapid evaluation time, typically less than 2s, Performance Systems, Sage Science Press, 1999.
• easy performance comparison across different
computational systems. [9] E. Papaefstathiou, D.J. Kerbyson, G.R. Nudd, and
T.J. Atherton, “An Overview of the CHIP³S
Ultimately, the value of a tool such as PACE is Performance Prediction Toolset for Parallel
measured by the studies it enables. The performance Systems”, in Proceedings 8th ISCA International
evaluation study on Sweep3D proves it to be successful. Conference on Parallel and Distributed Computing
Systems, pp. 527-533, 1995.
Acknowledgement
[10] K.R. Koch, R.S. Baker, and R.E. Alcouffe,
“Solution of the First-Order Form of the 3-D
This work is funded in part by DARPA contract
Discrete Ordinates Equation on a Massively Parallel
N66001-97-C-8530, awarded under the Performance
Processor”, Trans. of the Amer. Nuc. Soc., Vol.
Technology Initiative administered by NOSC.
65(108), 1992.
References [11] D.A. NowakR, C. Christensen, “ASCI
Applications”, Report 232247, Lawrence Livermore
[1] I. Foster, and C. Kesselman, “The Grid: Blueprint National Laboratory, American, November 1997.
for a New Computing Infrastructure”, Morgan- [12] C.U. Smith, “Performance Engineering of Software
Kaufmann, July 1998. Systems”, Addison Wesley, 1990.
[2] E. Deelman, A. Dube, A. Hoisie, Y. Luo, R.L. [13] M.W. Hall, J.M. Anderson, S.P. Amarasinghe, B.R.
Oliver, D. Sundaram-Stukel, H. Wasserman, V.S. Murphy, S. Liao, E. Bugnion, and M.S. Lam,
Adve, R. Bagrodia, J.C. Browne, E. Houstis, O. “Maximizing Multiprocessor Performance with the
Lubeck, J. Rice, P.J. Teller, and M.K. Vernon, SUIF Compiler”, IEEE Computer, Vol. 29(12), pp.
“POEMS: End-to-end Performance Design of Large 84-89, December 1996
Parallel Adaptive Computational Systems”,
Proceedings of the 1st International Workshop on [14] B. Qin, H.A. Sholl, and R.A. Ammar, “Micro Time
Software and Performance, pp. 18-30, 1998. Cost Analysis of Parallel Computations”, IEEE
Transactions on Computers, Vol. 40(5), pp613-628,
[3] M.I. Frank, A. Agarwal, and M.K. Vernon, “LoPC: 1991.
Modelling Contention in Parallel Algorithms”,
Proceedings of 6th ACM SIGPLAN Symposium on [15] D.J. Kerbyson, E. Papaefstathiou, and G.R. Nudd,
Principles and Practices of Parallel Programming “Application Execution Steering Using On-the-fly
(PpoPP ’97), Las Vegas, pp. 62-73, June 1997. Performance Prediction”, in: High Performance
Computing and Networking, Springer-Verlag, 1998.
[4] R. Bagrodia, R. Meyer, M. Takai, Y.A Chen, X.
Zeng, J. Martin, and H.Y. Song, “Parsec: A Parallel
Simulation Environment for Complex Systems”,
8