Escolar Documentos
Profissional Documentos
Cultura Documentos
Simulation
Imen Chakroun
ExaScience Life Lab, Belgium
IMEC, Leuven , Belgium
Tom Vander Aa
ExaScience Life Lab, Belgium
IMEC, Leuven , Belgium
Bruno De Fraine
Vrije Universiteit Brussels,
Brussels , Belgium
Tom Haber
ExaScience Life Lab, Belgium
UHasselt, Belgium
Roel Wuyts
ExaScience Life Lab, Belgium
IMEC, Leuven , Belgium
DistriNet, KU Leuven,
Belgium
Wolfgang Demeuter
Vrije Universiteit Brussels,
Brussels , Belgium
ABSTRACT
Author Keywords
This paper introduces ExaShark: a partitioned global address space (PGAS) library that provides a high-level, abstract interface for handling shared and distributed multidimensional arrays. It offers its users a global-arrays-like usability while its runtime can be configured to use any combination of shared memory threading techniques (Pthreads,
OpenMP, TBB) and inter-node distribution techniques (MPI,
GPI). Compared to similar general-purpose structured gridbased application programming interfaces, ExaShark is enhanced with advanced features such as expression templates
and operator overloading for global arrays, coupling of multiple grids and dynamic redistribution of the grid.
High Performance Computing (HPC) architectures are expected to change dramatically in the next decade with the
arrival of exascale computing, high performance computers
that offer 1 exaFlop (1015 ) of performance. Because of power
and cooling constraints, large increases in individual core performance are not possible and as a result on-chip parallelism
is increasing rapidly. The expected hardware for an exascale
machine node will therefore need to rely on massive parallelism both on-chip and off-chip, with a complex distributed
hierarchy of resources. Programming a machine of such scale
and complexity will be very hard unless some appropriate,
workable layers of abstraction are introduced to bridge the
To validate our claims of performance and usability, the performance of ExaShark is evaluated on advanced state-of-the-
The remainder of this paper is organized as follows. In Section 2, existing work dealing with multidimensional arrays is
presented. Section 3 describes the ExaShark design focusing
mainly on the communication layer, the data sharing patterns
and the advanced features such as ghost regions. In Section
4, the experimental results are presented on two benchmarks:
advanced numerical solvers and stencil codes. Some conclusions and perspectives of this work are drawn in Section 5.
FRAMEWORK DESIGN
ExaShark is an open source middleware [6], offered as a library, targeted at reducing the increasing programming burden on heterogeneous current and future exascale architectures. ExaShark handles matrices that are physically distributed blockwise, either regularly or as the Cartesian product of irregular distributions on each axis. The access to the
global array is performed through a logical indexing.
To define a global array, the programmer should define its coordinates which are the n dimensions of the structure and its
data. The data can also be redistributed at run-time. Besides,
the programmer can define halos/ghost regions in the global
array (see Figure 1). Indeed, many applications based on regular grids need support for ghost cells. These regions are
boundary data that are needed by one process to compute its
inner part of the global array but which are remotely-held by
other processes and need to be exchanged (updated) at each
iteration. Because they are error prone and because they require knowledge expertise, ExaShark transparently manages
these features for the programmer. Along with ghost cells,
periodic boundary conditions are also supported.
RELATED WORK
This section presents existing work dealing with multidimensional arrays. We particularly focus on general-purpose libraries, not on specific libraries such as [8] and [16] that are
specific to stencil computations which is only one optimization domain of ExaShark.
Despite the importance of multidimensional arrays to scientific applications, common programming languages such as
C++ provide only limited support for such arrays (e.g. boost
library).
However, a number of efforts aiming at creating reusable libraries to support scientists in implementing grids exist. The
most commonly used PGAS array library is probably Global
Arrays (GA) [4]. GA is implemented as a library with C
and Fortran bindings, and more recently added Python and
C++ interfaces (starting with the release 3.2). GA is built on
top of the aggregate remote memory copy interface (ARMCI)
[1], a low-level, one-sided communication runtime system.
ARMCI forms GAs portability layer, and when porting GA
to a new platform, ARMCI must be implemented using that
platforms native communication and data management primitives. GA works with either MPI or TCGMSG message passing libraries for communication but it does not seem to support multi-threading. The fact that variables are always local
to an MPI process and sharing them requires explicit communication between processes renders the pure MPIapproach,
without adding support for one-sided communication, unsustainable on future large-scale systems with growing numbers
of cores and decreasing amount of memory per core.
In [17], LibGeoDecomp (Library for Geometric Decomposition codes) is presented. It is a generic C++ library with a
high abstraction level that uses grids to express the computation in presence of high latency networks. It is limited to
2D and 3D simulations. LibGeoDecomp supports accelerators such as the Intel Xeon Phi coprocessor and GPU. Even
though LibGeoDecomp provides a high-level C++ interface,
it does not use specific mechanisms such as expression templates to speed up mathematical operations on the grids.
Our major architectural drivers for a scalable structured gridbased library are efficiency, portability and ease of coding.
ExaShark is portable since it is built upon widely used technologies such as MPI and C++ as a programming language.
It provides simple coding via a global-arrays-like interface
which offers template-based functions (dot products, matrix
multiplications, unary expressions). The functionalities offered by ExaShark are efficient since they use asynchrounous
and specific communication patterns. For example, let us
consider the update operation which fills in ghost cells with
Desing Flaws
ExaShark
Global Array
LibGeoDecomp
Programming language
C++
Pyhton/Fortran/C++
C++
Ghost regions
Yes
Yes
Yes
Communication layer
MPI/TCGMSG
MPI/OpenMP
Yes
Yes
No
XeonPhi
No
Yes
Domain redistribution
Yes
No
Yes
Stencil operations
Yes
No
Yes
Yes
Yes
No
Yes
No
No
the visible data residing on neighboring processors. This operation usually induces latency. In order to hide this latency,
ExaShark provides an asynchronous version where the data
exchanges are overlapped with the update of the inner region
of each process.
communications are used according to the computation performed. For instance, for the update of the ghost regions geometric communication is needed, while one sided communication is used for the get and put routines and collective
operations are performed for the reductions.
Expression templates over global arrays
Data distribution
ExaShark offers many high-level functions traditionally associated with arrays, eliminating the need for programmers to
write these functions themselves. Examples are basic mathematical operators, unary functions, standard global arrays
operations such as dot products and matrix vector multiplication,and so on. These functionalities are implemented using
expression templates [18].
ExaShark allows the programmer to control regular and irregular data distributions of the global array. Indeed, ExaShark
is based on the PGAS parallel programming model which is
convenient for expressing algorithms with large and random
data access. Each process is assumed to have fast access to
a portion of each distributed matrix, and slower access to the
remainder. These access time differences define the data as
being either local or remote, respectively. This locality information can be exploited at each iteration of the computation.
For example, the user can inquire about what data portion is
held by a given process or which process owns a particular
array element, identify inner and outer regions of the grid,
etc.
In Figure 2, we show how a basic code (left) of the conjugate gradient solver which uses mathematical operations over
vectors is rewritten using ExaShark (right). The goal of this
algorithm is to give an approximate solution of the equation
Ax = b where the A is the coefficient matrix, x is the solution vector that is returned and b is the input vector. The first
instruction of the Exashark code (left) defines the global domain size (in this case n*n). The second instruction defines
two global arrays which corresponds to the x and b vectors in
the pseudo code of CG (left). Here the matrix A is not allocated since it is implicitly assembled using stencils. Indeed,
the information about the underlying matrix is available only
through matrix vector computations which is probed approximately without forming and storing the elements and therefore do not have access to a fully assembled matrix [15].
By using expression templates, ExaShark incorporates delayed expressions to reduce temporary memory allocation.
For example, consider line 7 from the code sample in Figure 2. A naive implementation of the operation x = x + p
where alpha is a scalar and x and p are global arrays would
have operator+ and operator overloaded and return arrays. Then, the considered expression would mean creating
a temporary for p then another temporary for x+ that first
temporary, then assigning that to x. Even with the return
value optimization this will allocate memory at least twice.
Expression templates delay evaluation so the expression essentially generates at compile time a new array constructor. It
One of the ExaSharks efficiency keys is that it uses asynchronous communications and ensures overlapping communication and computation whenever possible. Several types of
3
is as if this constructor takes a scalar and two arrays by reference; it allocates the necessary memory and then performs the
computation. Thus only one memory allocation is performed.
carefully taken into account. Our plan is to use the stencilembedded definition provided by Patus which allows to exchange only pointers rather than performing data copies.
Generally speaking, it is not hard to integrate with external libraries because internally ExaShark stores its data in the common row-wise pattern. It is therefore sufficient in many cases
to exchange pointers. If other data formats are used then integration is still possible but may incur the penalty of copying
data from and/or to the external representation.
VALIDATION
Benchmarks
Two applications are considered as benchmarks for measuring the performance of ExaShark. The first one is the conjugate gradient method (CG) [12] which is an algorithm for
the numerical solution of particular systems of linear equations. The second one is a heat distribution simulation example which is a five point stencil application. The aim of
the first experiment is to evaluate the added value of asynchronous and hybrid communication techniques supported by
ExaShark. The second is to measure the performance of ExaShark in comparison to other similar libraries namely GA
and LibGeoDecomp.
The PETSc solvers, for example, can be called within an Exashark application to solve PDEs that require solving largescale, sparse nonlinear systems of equations. One important issue related to inter-operability is how to convert the
data structures before calling the PETSc solvers, and how
to convert the data structures of PETSc back after calling
the PETSc solvers. We believe that one of the most efficient ways to exchange data structures between Exashark and
PETSc without inducing data copying overheads, is to use
the V ecGetArray() and V ecRestoreArray() routines of
PETSc. V ecGetArray(), indeed, returns a pointer to a contiguous array that contains a processors portion of the vector
data. After calling the PETSc solvers, using V ecGetArray()
will zeros out the pointer unless vectors data are not stored
in a contiguous array. In this case, this routine will copy the
data back into the underlying vector data structure from the
array obtained with V ecGetArray().
Iterative methods are an efficient way to obtain a good numerical approximation to the solution of Ax = b when the
matrix A is large and sparse. The CG method is a widely
used iterative method for solving such systems when the
matrix A is symmetric and positive definite. Generalizations
of CG exist for non-symmetric problems. A pseudo code of
the conjugate gradient method is given in Figure 2.
We run three experiments for the CG solver using ExaShark
but with different communication protocols: the first scenario
is a CG implementation that relies on asynchronous communication and for which hybrid asynchronous MPI/OpenMP is
used. In the second scenario, the same implementation of
CG is run with only MPI. The last scenario corresponds to an
implementation of CG where no asynchronous communication is considered. For this later scenario hybrid synchronous
MPI/OpenMP is used.
For integration with Patus, the idea is to improve the performance of stencil operations at the node-level, while ExaShark orchestrates the distribution of data and computation
across different nodes and manages the communication between them. Here as well, the data conversion and memory
alignment for the structures used by both software should be
4
lem for exascale numerical simulations. Indeed, our simulations with LibGeoDecomp crashed for the 2D heat distribution simulation of a 655362 grid because it consumes more
than 64GB of data, which is the maximum allowed memory
size on our testing cluster.
The results, reported in Figure 3 show that using both intra and inner nodes asynchronous communication techniques
are on average 32.17% and 19.4% better than synchronous
communication and message passing interface mechanisms
respectively. For example, resolving the same problem instance using 1024 cores would take 576 seconds with the hybrid asynchronous version of CG while it would last 971 seconds using the hybrid synchronous communication and 835
seconds with only asynchronous MPI. Results show also that
the speedup with asynchronous hybrid communication protocols scales better than the other communication techniques.
The results are machine and preconditioner-dependent.
Heat distribution using the Jacobi iteration
Figure 4. Comparing the performance of ExaShark, GA and LibGeoDecomp on the 2D heat distribution problem
Jacobi [9] is a popular algorithm for solving Laplaces differential equation on a square domain, regularly discretized. The
idea is the following: let us consider a 2D array of particles,
each with an initial value of temperature. Each particle is in
contact with a fixed value of neighboring particles which impact its temperature. The Laplaces equation is solved for all
particles to determine their temperature as the average of the
four neighboring particles. To do so, a number of iterations
are performed over the data to recompute average temperatures repeatedly, and the values gradually converge to a finer
solution until the desired accuracy is reached.
Design Flaws
Internal Duplication
280
External Duplication
729
Feature Envy
122
Data Clumps
1153
67
Data Class
249
22
Tradition Breaker
Schizophrenic Class
From the design point of view, InCode detects design antipatterns that are most commonly encountered in software
projects. These anti-patterns cover all the essential design aspects: complexity, encapsulation, coupling, cohesion and inheritance. We ran InCode for GA, ExaShark and LibeGeoDecomp and report the corresponding design flaws in Table 3.
InCode shows that ExaShark has no design flaws whereas
both other libraries have. Some of these design flaws, such as
data clumpsand feature envy, affect the understanding of
the operations provided by a library and consequently its usability. An explanation of the different design flaws reported
in Table 3 is given below:
- Data Class: refers to a class with an interface that exposes data members, instead of providing any substantial
functionality. This means that related data and behavior
are not in the same scope, which indicates a poor datafunctionality proximity.
Figure 5. Analyzing weak scaling of ExaShark, GA and LibGeoDecomp
on the 2D heat distribution problem.
Usability
From the code point of view, InCode computes software metrics for a given program ranging from basic size and complexity metrics to the more advanced metrics. In Table 2, we
report two metrics: the number of classes and the number of
lines. We can see, according to these metrics, that ExaShark
has the least number of lines of codes and classes which is an
indication of an easier use of the library.
Metrics
8881
501553
65925
Number of classes
12
464
535
- Tradition Breaker: refers to a class that breaks the interface inherited from a base class or an interface. A class can
do this by reducing the visibility of the services published
by the base class by means of private/protected inheritance
(in C++).
http://hpc.pnl.gov/globalarrays.
5. Incode helium.
https://www.intooitus.com/products/incode.
As future work, we are working on a GASPI/GPI (Global Address Space Programming Interface) integration within ExaShark [11]. For optimizing stencil code implementations,
stencils compilers like PATUS [10] will be included into ExaShark.
14. Kamil, S., Chan, C., Oliker, L., Shalf, J., and Williams,
S. An auto-tuning framework for parallel multicore
stencil computations. In Parallel & Distributed
Processing (IPDPS), 2010 IEEE International
Symposium on, IEEE (2010), 112.
ACKNOWLEDGMENT
We thank Dr. Pascal Costanza from the ExaScience Lab Belgium and Intel Health & Life Sciences for his help.
REFERENCES
http://hpc.pnl.gov/armci/documentation.htm.