Você está na página 1de 8

Novice Insights

SPMD Message Passing Broadcast on

Mircea-Valeriu ULINIC1, Omid SHAHMIRZADI2, Andr SCHIPER2
Technical University of Cluj-Napoca, Romania
cole Polytechnique Fdrale de Lausanne (EPFL), Switzerland

The advent of manycore architectures raises new scalability challenges for concurrent
applications. Implementing scalable data structures is one of them. Several manycore
architectures provide hardware message passing as a means to efficiently exchange data
between cores. In this paper we study the implementation of high-throughput, low latency
broadcast algorithms in message-passing manycores. The model is validated through
experiments on a 36-core TILE-Gx8036 processor. Evaluations show that an efficient
implementation of the algorithms can lead to maximize the number of messages exchanged
and reduction of the delay.


It is more and more emphatic that superior performances cannot be achieved only by
increasing the CPU frequency, since that involves also more efficient cooling systems which
leads to higher power consumption. In concordance with the green concepts, but still raising
even further the number of operations per second, caused processor manufacturers to move into
the direction of multi- and many-core architectures [1]. A many-core chip is built by
interconnecting a large number of cores through a powerful NoC (Network on Chip).
Nowadays, chips with hundreds cores are already available while chips with thousands cores
are still under deployment.
The main issues of the many-core systems rely on the overhead of the hardware cache
coherence [2] which can be avoided by implementing one of the following alternatives: (i)
sticking to the shared memory paradigm, but managing data coherence in software [3], or (ii)
adopting message passing as the new communication paradigm.
The message passing is anonymous one-sided communication wherein any core can write
communication wherein any core can write into the instruction or data memory of any other
core (including itself) [2].
The natural choice to program a high-performance message-passing system is to use Single
Program Multiple Data (SPMD) algorithms. This paper focuses on the broadcast primitive (oneto-all) deployed under the previous considerations, on a 36-cores chip from Tilera.


Novice Insights

The Tilera TILE-Gx36 Processor

TILE-GxTM is a multicore processor family of Tilera Corporation [5], produced by TSMC

[6] using 40nm technology. It is based on 64-bit cores offering high performance for both
control plane and data plane processing. In this section we briefly describe the TILE-Gx36
architecture and inter-core communications.
a) Architecture: As shown in Figure 1, this processor consists on a 2-dimensional array of
identical cores, called tiles delivering more than 100 billion 64-bit operations per
second, while consuming less than 500mW of power each. One tile consists of 3-wide
issue processors core with translation lookaside buffers (TLBs), L1 and L2 caches, and
interfaces to the iMesh networks. Each tile can independently run a full operating
system, or a group of tiles can run a multi-processing OS, like SMP Linux. The TILEGx instruction set architecture (ISA) uses a 64-bit instruction word to encode up to three
RISC operations per cycle. The ISA provides a rich set of single-instruction, multiple
date (SIMD) operations that support packed 8x8-bit, 4x16-bit, and 2x32-bit operations
such as multiply-and-add and sum-of-absolute-differences. At a nominal clock of
1.2GHz, the device delivers over 200 billion 16-bit multiply-accumulates (MACs) per
second and over 400 billion 8-bit MACs per second [7]. The tile contains a 2-way setassociative 32KB instruction cache and a 2-way set associative 32KB data cache. Both
caches are backed by a unified 8-way, 256KB second level cache with Single Error
Correction Double Error Detection (SECDED) ECC protection. The second-level cache
subsystem provides up to 8 outstanding cache lines misses from each tile.

Figure 1: TILE-Gx36 block diagram


Novice Insights
b) Inter-core communication: All on-chip communication occurs via multiple point-to-point
intelligent Mesh (iMeshTM) networks. The iMesh interconnect provides low latency,
high bandwidth communication to all on-chip components, including tiles and
accelerators. Each tile contains an identical iMesh switch block that connects the tile to
its immediate north, east, south and west neighbors. The tiles iMesh interface also
connects to the L2 cache pipeline at every tile for sinking and sourcing traffic. All
networks are identical at the flow control level. Messages on the network are represented
as packets and are divided into units the width of the network (called flits), plus a
header flit to specify the route for the packet. The hop latency is a single cycle: packets
are routed at the rate of one flit per cycle through the network. The route of a packet
takes is determined at the source and wormhole routing minimizes the link-level
buffering requirements. iMesh packets move from source tile to the destination tile while
traversing the minimum amount of wire. This leads directly to lower power and lower
latency compared to ring or bus implementations [7].

MPB Algorithms

When it comes to the broadcast algorithms, there are multiple choices such as
RCCE_common [8] as well as RCKMPI [3] or OC-Bcast [4], but none of them are suitable for a
grid processor architecture. Thus, we have considered Algorithm 1, based on five types of trees
(each node representing one CPU), as follows.
Algorithm 1. Generic Tree Based Algorithm (For one chunk)
if root then
if children_can_receive_chunk() then
end if
if chunk_available() then
if children_can_receive_chunk() then
send chunk()
end if
end if
end if

Flat Tree (FT)

This is the most intuitive algorithm, considering one core (called root) that has data to
be shared with all other ones, simply delivers the information straightly. Thereby, the
tree has only two levels: the first level is consisting of the root, which has N-1 leaves
(considering N as the maximum number of used cores).


Novice Insights

Balanced Binary Tree (BBT)

In this case we have considered a binary tree, trying to balance the two halves of the
tree as much as possible, depending only on the number N. Once the tree is built, the id
of the nodes from the 2nd level downwards is marked straightly from 0 to N-1 escaping
ROOTID which will be the root of the tree as well.


Clever Balanced Binary Tree (CBBT)

Likely to the previous case, the tree is built as balanced as possible, but completed
with the IDs of the children the most nearest to the parent.


Minimum Spanning Tree (MST)

This algorithm is based on the idea of delivering each chunk of information only to a
maximum number of direct neighbors of every CPU. If we suppose the root as being in
the middle of the grid (e.g. CPU#14), it has four children (direct neighbors from the
right, bottom, left and right, not necessary in this order). Its children will resend the
chunk to a number of three CPUs and so on. The order of the children is chosen
depending on the position of the root relative to the grid: in the very middle of the grid
we can consider the origin of a X-Y axis system; if the X-coordinate of the root is less
than 0, the horizontal broadcast direction should be primarily to the right, otherwise to
the left of the grid. Using the same reasons, if the Y-coordinate of the root is lower
than 0, the vertical priority should be oriented to the bottom.


Binomial Tree (BT)

The binomial broadcast algorithm is based on a recursive tree. The set of nodes is
divided into two subsets of
and . The root, belonging to one of the subsets, sends
the message to one node from the other subset. Then, broadcast is recursively called on
both subsets.
levels and in each of them the whole
Obviously, the formed tree has log
message is sent between the pairs of nodes.

Using the same generic algorithm, the performances actually reflects the resourcefulness of
the tree building algos.


We run experiments on a Tilera TILE-Gx9036 processor. We use it as a representative of

current message-passing manycore architectures. Experiments are run with version of Tileras custom Linux kernel. Applications are compiled using GCC
4.4.6 with O3 flag. To implement our algorithms, we use the User Dynamic Network (UDN). In
our experiments, we dedicate one queue to asynchronous messages: an interrupt is generated
each time a new message is available in this queue. Note that the TILEGx8036 processor does
not provide support for collective operations. Hence, we implement broadcast as a set of send
Figure 2 describes how we model a point-to-point communication on the TILE-Gx
processor. The figure illustrates the case of a 2-word message transmission using send and recv.
This model is solely based on our evaluations of the communication performance and is only

Novice Insights
valid for small-sized messages. We do not claim that Figure 2 describes the way
communication are actually implemented in the processor.

Figure 2: Point-to-point communication on the TILE-Gx for a 2-word message m

(NI: Network Interface)
The overhead osend of a message of n words includes a fix cost of 8 cycles associated with
issuing a header packet, plus a variable cost of 1 cycles per word. The overhead orecv of a
message is equal to 2 cycles per word. The header packet is not received at the application
level. The transmission delay L between the sender and the receiver includes some fix overhead
at the network engines on both the sender and the receiver, plus the latency l associated with
network traversal. The fix overhead is 10 cycles in total. The latency l depends on the number
of routers on the path from the source to the destination: 1 cycle per router. However, on a 36core mesh the distance between processes has little impact on the performance. Finally, note
that there is no gap between two consecutive messages sent by the same core.
Three methods are available to deal with data exchange: send, send_n_messages, receive
and receive_n_messages. Operation send(m, i, x) sends message m to thread ti using multiplexer
#x. Note that x can have one of these values: 0 (west), 1 (north), 2 (east) or 3 (south). Operation
send_n_messages(m1, . . . , mn, i, x) sends n messages on MUX#x to CPUi. Messages can be
received using a synchronous receive function. Operation receive(m) blocks until message m
can be received. Alternatively, threads can be interrupted when a new message is available. In a
similar way, method receive_n_messages(m1, . . . , mn) waits until receives all n messages.
In out implementation we use send and receive in both directions: from parent to children
and from children to parent, as follows. MUX0 is used to send genuine information, while
MUX2 is used for receiving acknowledgements. Each time a CPU (except the root) reads the
whole chunk, sends a notification message, upwards, to its parent CPU on MUX2. The parent is
able to send another chunk only by receiving the acknowledgement. Chunk refers to a group of
consecutive independent messages.
Algorithms are compared in terms of latency and throughput. Latency is defined as the time
elapsed between the moment when root CPU sends the information and the moment when the
last CPU receives it. On the other hand, throughput is a measure of total number of messages
that can be sent during one second. For our experiments, we have considered the latency for
only one chunk, while the throughput for a stream of chunks. Note that, in order to avoid other
processing influences, the latency has been considered as an average of the middle 80% values
from a set of 1000 experiments.
Due to limit space for this paper, we will present the detailed results only for the most
efficient and for the least efficient algorithm.

Novice Insights
The throughput and the latency for the Flat Tree algorithm are depicted in Figure 3, while
for the CBBT in Figure 4. Evaluations have been performed for a number of cores from 2 to 36
increased gradually with a granularity of 1, and a chunk size from 1 to 128 messages, using
powers of 2. The maximum limit of a chunk is dependent on the cache size which is 12Mbytes.
Transmitting data of 64Kbytes, results that we cannot send more than 200 messages one time,
which means that 128 is the highest power of 2 value lower that can be used. One can observe
in Figure 3 that the throughput decreases as the latency increases. Although there is no
mathematical relationship between them, it is obviously that the root is able to send another
chunk only after receiving acknowledgement from the last CPU. Thus, for Flat Tree we can
have the following equation, only when the chunk size is 1:
. In this case,

also we can compute the latency when the number of used cores is 36 as
where Ti is the cost from the root (here considered 0) to CPUi. The cost can be determined using
the model presented previously. Using 36 cores receiving a chunk of 1 message, the latency
obtained is 1024 cycles. Having this value, we can compute the theoretical value of the
throughput which indeed corresponds to our results, being around 0.6M messages/second.
When the chunk size is increased, the latency and the throughput cannot be determined
theoretically anymore. The reason belongs to the UDN type routing: when a new routing
command from CPUi to CPUj is issued, the router determines the shortest path using the X-Y
routing. This path is cahced for an undetermined amount of time. When another request arrives
in the controller, it is compared with a cached one, if any available. If it has a cached path from
CPUi to CPUj, there is no doubt that the process is much faster. Thus, the more consecutive
messages are sent using the same route, the quicker they will be routed. But there is also an
upperbound that cannot be surpassed. That effect can be observed in Figure 4: there is no
essential improvement from a chunk of 64 to 128 messages.

(a) Latency

(b) Throughput

Figure 3: Latency and Throughput for Flat Tree



(b) Throughput

Figure 4: Latency and Throughput for CBBT


Novice Insights
It is quite natural that the throughput decreases as the number of used core increases. Unlike
FT, the throughput when CBBT algorithm is used, decreases less smoothly. The first sudden
descent is the transition from 2 (when one CPU transmits and the other one receives) to 3 used
cores (one root with 2 children). Also observe that the difference between the throughput for
two consecutive values for the number of used cores is approximately the same since the
measurement are made from the point of view of the root which has the same number of
children: 2, but are not quite the same because each level of CPUs can receive another chunk
only after receiving the confirmation from underside.
Four out of five algorithms that we have considered have a static structure, without being
relative to the architecture of the grid. But, when comes to MST, it is important which is the
root. For example, if the root is 0 it has only two children, while a root placed somewhere in the
middle of the grid such as CPU#14, has four children, etc. This leads to a lower throughput as
one can see in Figure 5 which presents a comparison between all algorithms, using the
maximum number of cores with a chunk size of 128 messages, for both cases when the root is 0
and in the middle. One can observe that CBBT is the most efficient algorithm, being up to 15%
more powerful that BBT.



(b) Throughput

Figure 5: Latency and Throughput for all algorithms. Chunk size=128 messages, Number of
used cores=36. Root can be 0 or in the middle of the grid
There can be implemented more similar algorithms, keeping in mind to assure a minimum
number of children in order to enhance the throughput, and also a low number of levels, in
order to minimize the latency. This trade off can be avoided by designing special algorithms
either just for throughput, without taking into account the latency or reverse. There is also a
threshold of the minimum levels of the tree which must be also considered in order to avoid a
FT type situation.


The paper studies the implementation of strongly-consistent broadcast algorithms in

messagepassing mannycores. Using a communication model it compares the performances of
the tree building algorithms in order to facilitate the broadcast. A Tilera TILE-Gx8036
processor is used to validate the model and serves as a baseline for the evaluations. The results
show that a more mindfully, but more complicated manner to build the broadcast tree leads to
high efficiency, lowering also to delay. Based on our experiments, we have picked up the
Clever Balanced Binary Tree as being the most performant, from our set of algorithms.
However, since the computation is more sophisticated, we recommended it in applications with
a fixed set of CPUs, otherwise, Balanced Binary Tree is good enough.

Novice Insights

[1] Shekar Borkar, Thousands core chips: a technology perspective, Proceedings of the 44th annual
Design Automation Conference, DAC 07, New York, NY, USA, 2007, pp. 746-749.
[2] Timothy Mattson, Rob Van der Wijngaart, Michael Frumkin, Programming the Intel 80-core
network-on-a-chip terascale processor, Proceedings of the 2008 ACM/IEEE conference on
Supercomputing, SC 08, Art. no. 38, IEEE Press Piscataway, NJ, USA, 2008.
[3] Isaas Comprs Ureas, Michael Riepen, Michael Konow, RCKMPI - lightweight MPI
implementation for intels single-chip cloud computer (SCC), Proceedings of the 18th European MPI
Users Group conference on Recent advances in the message passing interface, EuroMPI 11,
Springer-Verlang Berlin, Heidelberg, 2011, pp. 208-217.
[4] Darko Petrovi, Omid Shahmirzadi, Thomas Ropars, Andr Schiper, High-performance RMAbased
broadcast on the Intel SCC, Proceedings of the 24th annual ACM symposium on Parallelism in
algorithms and architectures, SPAA 12, New York, NY, USA, 2012, pp. 121-130.
[5] Tilera Corporation, www.tilera.com.
[6] Taiwan Semiconductor Manufactoring Company, www.tsmc.com.
[7] Tilera Corporation, Architecture and Performance of the TILE-Gx Processor Family, available on
www.tilera.com, 2013.
[8] Ernie Chan, RCCE comm: a collective communication library for the Intel Single-chip Cloud
Computer, available on https://communities.intel.com/docs/DOC-5663, 2010.

Mircea-Valeriu Ulinic graduated the B.Sc. of Telecommunications Technologies and
Systems and is currently studying at the Technical University of Cluj-Napoca, expecting to get
the M.Sc. in Telecommunications in July 2015. He is 23 years old with a great interest in
programming software solutions for the world needs. During the summer of 2013 he has
performed a two months internship at the Distributed Systems Laboratory lead by Andr
Schiper, cole Polytechnique Fdrale de Lausanne.

Mircea-Valeriu ULINIC, student

Technical University of Cluj-Napoca
Faculty of Electronics, Telecommunications and Information Technology
Memorandumului 28, 40114, Cluj-Napoca, ROMANIA
E-mail: mircea.ulinic@epfl.ch / mirceaulinic@student.utcluj.ro
Manuscript received on July 8, revised on September 7, 2014