Escolar Documentos
Profissional Documentos
Cultura Documentos
Neighborhood Communications
Torsten Hoefler
Timo Schneider
1 Using the MPI-1.0 graph topology is possible but not recommended due
to its scalability issues.
Abstract Many scientific applications operate in a bulksynchronous mode of iterative communication and computation
steps. Even though the communication steps happen at the same
logical time, important patterns such as stencil computations
cannot be expressed as collective communications in MPI. We
demonstrate how neighborhood collective operations allow to
specify arbitrary collective communication relations during runtime and enable optimizations similar to traditional collective
calls. We show a number of optimization opportunities and algorithms for different communication scenarios. We also show how
users can assert constraints that provide additional optimization
opportunities in a portable way. We demonstrate the utility of
all described optimizations in a highly optimized implementation
of neighborhood collective operations. Our communication and
protocol optimizations result in a performance improvement of
up to a factor of two for small stencil communications. We found
that, for some patterns, our optimization heuristics automatically
generate communication schedules that are comparable to handtuned collectives. With those optimizations in place, we are able
to accelerate arbitrary collective communication patterns, such
as regular and irregular stencils with optimization methods for
collective communications. We expect that our methods will
influence the design of future MPI libraries and provide a
significant performance benefit on large-scale systems.
I. I NTRODUCTION
0
2
Process 0
Process 0
Sendbuffer
4
6
Sendbuffer
Fig. 1.
Recvbuffer
Recvbuffer
(a) 2D Cartesian
(b) Scatter/Bcast
(a) Allgather
(b) Alltoall
i n t dims [ 2 ] = { 2 , 3 } , p e r i o d s [ 2 ] = { 1 , 1 } ;
MPI Comm comm cart ;
MPI Cart create (MPI COMM WORLD, 2 , dims , p e r i o d s , 0 ,
&comm cart ) ;
int sbuf [1] , rbuf [ 4 ] ;
M P I N e i g h b o r a l l g a t h e r ( s b u f , 1 , MPI INT , r b u f , 1 ,
MPI INT , comm cart ) ;
1) We classify different levels of persistence (static properties) of collective communications in parallel codes
and propose a mechanism to specify this knowledge in
a portable way to the runtime library.
2) We develop an open-source execution and optimization
framework for arbitrary nonblocking neighborhood collectives.
3) We propose and evaluate different optimization techniques for arbitrary collective operations to improve
performance.
4) We demonstrate the applicability of our approach with
a wide set benchmarks and applications.
II. P ERSISTENCE L EVELS IN C OLLECTIVE
C OMMUNICATION
Numerous optimizations in different communication models
(cf. [9], [10], [11] among many others) have been proposed
to optimize predefined collective communications with known
static patterns. Common techniques are to schedule communications in order to avoid congestion and to use message
coalescing and forwarding through proxy processes to reduce
injection rate limitations.
We now describe how those techniques can be used dynamically, i.e., during runtime, to optimize arbitrary neighborhood
communications. As with most optimizations being able to
make additional assumptions about the problem that constrain
the optimization space leads to higher optimization potentials.
For neighborhood collectives, such assumptions can be made
about the communication topology, sizes, and buffer access.
We define three persistence levels that indicate how fixed
each of the parameters is across iterations.
The user can communicate persistence properties of his
application through passing special info keys that guide the
optimization of neighborhood collectives. This scheme is
completely transparent and portable within the MPI standard.
If an implementation does not understand an info key, then
the key is ignored and the code executes as if the key was
not specified. Thus, info keys can be used to specify hints
about the code but they cannot change the structure of the
communication. We point out that users are not allowed to
lie with specified info keys, i.e., if a user specifies a certain
code property, the code must act within the specified limits.
Also, info keys for MPI topologies are collective, i.e., all
processes have to specify the same set of keys for the topology
construction.
If an application has multiple repeating communication
steps with different sizes or collectives, the user can create
one topology communicator for each call-site and provide
the highest optimization options. Creating a communicator for
each set of arguments may seem expensive, however, we point
out that communicators that only differ in their info arguments
can be stored efficiently with simple lossless compression
techniques with a constant (O(1)) time and space overhead.
Our optimizations target networks that are able to perform remote direct memory accesses (RDMA). This type
of network recently gained such high popularity due to its
3
Step 2
Step 2
0
0
3
Step 3
Step 3
0
0
3
Step 4
Step 4
0
0
3
Step 4
L
The subtraction of o+(k1)G
in the second step exists
because process pm balances its excess load to ps which in
turn can only start sending after it received
the message
(L)
L
and process pm can continue sending o+(k1)G
messages
during that time. This algorithm can be used for arbitrarily
large k because it does not increase the total communication
volume if forwarding processes are neighbors of the source.
The example shown in Figure 4 is also illustrating a valid
allgather optimization (messages 1, 3, and 5 are identical in
this case).
The parameter t is generally architecture-dependent. We
chose t = 2 for all our experiments. Both heuristics converge rather quickly in practice, e.g., for the hotspot pattern
in O(log P ) steps and each step has a cost of O(log P )
(allreduce). Thus, our balancing algorithm is scalable to large
process counts. However, the resulting tree may not be optimal
in the LogGP model. Optimal tree configurations (cf. [14]) are
an interesting direction for future work.
1, 3, 5
0
1
Tree
3
2
transform
3
4
Fig. 4.
calls to the same neighborhood collective on this communicator are the same, then the user may set the info argument
cdag static buffers to true. Having static buffer addresses
allows similar optimizations as for persistent point-to-point
messages, however, those optimizations can now be applied in
a collective manner, taking advantage of the knowledge about
the full communication schedule.
1) Static RDMA Regions: Static buffers allow to register
the buffer memory to the network interface at the first call.
The keys to access (RDMA remote access handles) are then
exchanged along all communication edges and all the information is saved for future calls to the same function. The next
call will then simply fire off the communications which have
all RDMA descriptors already prepared. The direct memory
access to remote nodes enables also advanced synchronization
protocols.
Having static information about the communication buffers
enables us to apply static protocol optimizations. Every
RDMA communication requires two synchronizations: (1)
Ready to Receive (RTR), where the receiver needs to indicate
that the buffer is ready for remote writes (i.e., the receiver
entered the collective communication function) and (2) Ready
to Exit (RTE), where sender notifies the receiver that the
data communication is finished and the receive buffer is in a
consistent state. Figure 5(a) illustrates the state of the art pointto-point RDMA synchronization scheme [12]. A collective
communication function can return when a process received
RTEs from all neighbors and finished its own sends.
2) Collective RDMA RTR Protocol: The RTR and RTE
protocols can be optimized separately. The RTR protocol can
be performed collectively, e.g., a global barrier may be used at
the beginning of the collective call to communicate RTR. This
may be a good strategy if the machine offers fast (O(1) time)
global barrier support, like BG/L and BG/P [15]. However, if
the barrier is implemented with point-to-point messages, this
will add (log(P )) communication overhead.
Instead of a global barrier, we propose a collective RTR
protocol on each process neighborhood using neighborhood
RTR trees with each sender as root. This protocol combines
RTR messages in a tree shape and can reduce the RTR time
from ( ) to O(log ) for a process with neighbors. The
desired degree of the tree depends on the system parameters
(L/o as discussed before). Figure 5(b) shows a direct comparison between the traditional model (with four RTR messages
going to the root) and the optimized tree model (with two RTR
RTR 2
RTR
2
RTR
RTR
RTR 2,6
0
RTR
RTR 1,3
RTR 3
Fig. 5. Illustration of our collective tree RDMA protocol. RTE was omitted
from this figure for readability because it would simply invert the RTR arrows.
Sender
Sender
Receiver
MPI_Send
RTR
MPI_Recv
Receiver
MPI_Send
MPI_Recv
set canary
RTR
Flag
Exit
DATA
Flag
RTE
if(last_byte=
canary)
DATA
test
Flag
Exit
Fig. 6.
Exit
Flag
Exit
R
active queue
S
S
S
Fig. 7.
DAG Scheduling
DAG. The scheduler starts with the DAG and puts all roots
(vertices without dependencies) into the active queue (AQ).
Then the scheduler iterates over the queue and progresses each
operation (vertex) in the active queue and checks it for completion. Upon completion of a vertex, the scheduler removes
all dependencies to this vertex and starts all vertices with no
remaining dependencies. Using this scheme, the scheduler is
100
None of those sparse communication patterns can be represented by MPIs previous collective functions. Thus, we use
the most common way to implement such exchanges: post all
receives nonblocking (MPI Irecv), start all sends nonblocking
(MPI Isend) and wait for all sends and recvs to complete
(MPI Waitall). We confirmed that this is the fastest method
to implement such exchanges on our test system.
1) Sparse Alltoall Pattern: A sparse alltoall pattern A(s, p)
is specified by a data size s and a parameter p. The parameter
p (0 p 1) indicates the probability that process i sends
a message of size s to process j (i.e., there is an edge from
i to j in the neighborhood collective), independent from any
other process pair. The resulting graphs are essentially random
Erdos-Renyi graphs [22].
Figure 9 shows the alltoall pattern with varying density on
1024 processes and communicating 16 bytes along each edge.
Coloring was used to improve the ordering of messages. The
To evaluate each optimization technique, we analyzed microbenchmark performance with manually crafted communication patterns and various message sizes to show the
impact of each optimization in isolation. We then extracted
several communication patterns from real-world applications
and examined the resulting microbenchmarks. In addition,
we modified one application kernel and a full application to
support neighborhood collective communications.
Our implementation supports nonblocking and blocking
variants of all collectives. However, all experiments were done
with the blocking implementation of neighborhood collectives because transforming applications to use nonblocking
collectives is a more complex task than simply introducing
neighborhood collectives. Evaluating tradeoffs and benefits for
nonblocking neighborhood communication is an interesting
future work.
A. Experimental Environment
For our experiments, we use the Blue Waters Test System
(JYC), a single cabinet Cray XE6 (approx. 50 nodes with
1600 Interlagos 2.3-2.6 GHz cores). We use the GNU compiler
version 4.6.2 in the Cray compiler environment version 4.0.46.
B. Relevant Communication Patterns
In this section, we will demonstrate each key optimization
with representative communication patterns. We start with
a sparse alltoall pattern to demonstrate the benefit of message scheduling (coloring) and then show different real-world
Cartesian stencil examples.
60
40
100 200
20
50
20
10
Fig. 8.
Latency [us]
500
80
Neighbor colls
Cray MPI
Improvement [in %]
Improvement [Percent]
2000
20
40
60
80
100
Density [Percent]
Fig. 9.
50
20
Improvement [Percent]
30
40
Neighbor colls
Cray MPI
Improvement [in %]
10
100
60
40
500
20
Improvement [Percent]
80
Neighbor colls
Cray MPI
Improvement [in %]
50
Latency [us]
5000
2d Cartesian Performance
25
Fig. 10.
20
1M
15
100K
10
10K
Datasize [Bytes]
1K
20
0
10
100
100
60
40
500
Improvement [Percent]
80
Neighbor colls
Cray MPI
Improvement [in %]
50
Latency [us]
5000
128
100
1K
10K
100K
1M
256
512
# Cores
10
32
Fig. 12.
Datasize [Bytes]
Fig. 11.
4d Cartesian Performance
100
90
80
70
60
50
40
aug2dc
andrews
tube2
30
100
200
300
400
500
Number of Processes
Fig. 13.