Documentos de Académico
Documentos de Profesional
Documentos de Cultura
Abstract
The advent of manycore architectures raises new scalability challenges for concurrent
applications. Implementing scalable data structures is one of them. Several manycore
architectures provide hardware message passing as a means to efficiently exchange data
between cores. In this paper we study the implementation of high-throughput, low latency
broadcast algorithms in message-passing manycores. The model is validated through
experiments on a 36-core TILE-Gx8036 processor. Evaluations show that an efficient
implementation of the algorithms can lead to maximize the number of messages exchanged
and reduction of the delay.
Introduction
It is more and more emphatic that superior performances cannot be achieved only by
increasing the CPU frequency, since that involves also more efficient cooling systems which
leads to higher power consumption. In concordance with the green concepts, but still raising
even further the number of operations per second, caused processor manufacturers to move into
the direction of multi- and many-core architectures [1]. A many-core chip is built by
interconnecting a large number of cores through a powerful NoC (Network on Chip).
Nowadays, chips with hundreds cores are already available while chips with thousands cores
are still under deployment.
The main issues of the many-core systems rely on the overhead of the hardware cache
coherence [2] which can be avoided by implementing one of the following alternatives: (i)
sticking to the shared memory paradigm, but managing data coherence in software [3], or (ii)
adopting message passing as the new communication paradigm.
The message passing is anonymous one-sided communication wherein any core can write
communication wherein any core can write into the instruction or data memory of any other
core (including itself) [2].
The natural choice to program a high-performance message-passing system is to use Single
Program Multiple Data (SPMD) algorithms. This paper focuses on the broadcast primitive (oneto-all) deployed under the previous considerations, on a 36-cores chip from Tilera.
16
Novice Insights
Novice Insights
b) Inter-core communication: All on-chip communication occurs via multiple point-to-point
intelligent Mesh (iMeshTM) networks. The iMesh interconnect provides low latency,
high bandwidth communication to all on-chip components, including tiles and
accelerators. Each tile contains an identical iMesh switch block that connects the tile to
its immediate north, east, south and west neighbors. The tiles iMesh interface also
connects to the L2 cache pipeline at every tile for sinking and sourcing traffic. All
networks are identical at the flow control level. Messages on the network are represented
as packets and are divided into units the width of the network (called flits), plus a
header flit to specify the route for the packet. The hop latency is a single cycle: packets
are routed at the rate of one flit per cycle through the network. The route of a packet
takes is determined at the source and wormhole routing minimizes the link-level
buffering requirements. iMesh packets move from source tile to the destination tile while
traversing the minimum amount of wire. This leads directly to lower power and lower
latency compared to ring or bus implementations [7].
MPB Algorithms
When it comes to the broadcast algorithms, there are multiple choices such as
RCCE_common [8] as well as RCKMPI [3] or OC-Bcast [4], but none of them are suitable for a
grid processor architecture. Thus, we have considered Algorithm 1, based on five types of trees
(each node representing one CPU), as follows.
Algorithm 1. Generic Tree Based Algorithm (For one chunk)
if root then
if children_can_receive_chunk() then
empty_buffers()
send_chunk()
end if
else
if chunk_available() then
read_chunk()
send_ack_to_parent()
if children_can_receive_chunk() then
empty_buffers()
send chunk()
end if
end if
end if
1.
18
Novice Insights
2.
3.
4.
5.
Using the same generic algorithm, the performances actually reflects the resourcefulness of
the tree building algos.
Experiments
Novice Insights
valid for small-sized messages. We do not claim that Figure 2 describes the way
communication are actually implemented in the processor.
Novice Insights
The throughput and the latency for the Flat Tree algorithm are depicted in Figure 3, while
for the CBBT in Figure 4. Evaluations have been performed for a number of cores from 2 to 36
increased gradually with a granularity of 1, and a chunk size from 1 to 128 messages, using
powers of 2. The maximum limit of a chunk is dependent on the cache size which is 12Mbytes.
Transmitting data of 64Kbytes, results that we cannot send more than 200 messages one time,
which means that 128 is the highest power of 2 value lower that can be used. One can observe
in Figure 3 that the throughput decreases as the latency increases. Although there is no
mathematical relationship between them, it is obviously that the root is able to send another
chunk only after receiving acknowledgement from the last CPU. Thus, for Flat Tree we can
have the following equation, only when the chunk size is 1:
. In this case,
also we can compute the latency when the number of used cores is 36 as
,
where Ti is the cost from the root (here considered 0) to CPUi. The cost can be determined using
the model presented previously. Using 36 cores receiving a chunk of 1 message, the latency
obtained is 1024 cycles. Having this value, we can compute the theoretical value of the
throughput which indeed corresponds to our results, being around 0.6M messages/second.
When the chunk size is increased, the latency and the throughput cannot be determined
theoretically anymore. The reason belongs to the UDN type routing: when a new routing
command from CPUi to CPUj is issued, the router determines the shortest path using the X-Y
routing. This path is cahced for an undetermined amount of time. When another request arrives
in the controller, it is compared with a cached one, if any available. If it has a cached path from
CPUi to CPUj, there is no doubt that the process is much faster. Thus, the more consecutive
messages are sent using the same route, the quicker they will be routed. But there is also an
upperbound that cannot be surpassed. That effect can be observed in Figure 4: there is no
essential improvement from a chunk of 64 to 128 messages.
(a) Latency
(b) Throughput
(a)
Latency
(b) Throughput
Novice Insights
It is quite natural that the throughput decreases as the number of used core increases. Unlike
FT, the throughput when CBBT algorithm is used, decreases less smoothly. The first sudden
descent is the transition from 2 (when one CPU transmits and the other one receives) to 3 used
cores (one root with 2 children). Also observe that the difference between the throughput for
two consecutive values for the number of used cores is approximately the same since the
measurement are made from the point of view of the root which has the same number of
children: 2, but are not quite the same because each level of CPUs can receive another chunk
only after receiving the confirmation from underside.
Four out of five algorithms that we have considered have a static structure, without being
relative to the architecture of the grid. But, when comes to MST, it is important which is the
root. For example, if the root is 0 it has only two children, while a root placed somewhere in the
middle of the grid such as CPU#14, has four children, etc. This leads to a lower throughput as
one can see in Figure 5 which presents a comparison between all algorithms, using the
maximum number of cores with a chunk size of 128 messages, for both cases when the root is 0
and in the middle. One can observe that CBBT is the most efficient algorithm, being up to 15%
more powerful that BBT.
(a)
Latency
(b) Throughput
Figure 5: Latency and Throughput for all algorithms. Chunk size=128 messages, Number of
used cores=36. Root can be 0 or in the middle of the grid
There can be implemented more similar algorithms, keeping in mind to assure a minimum
number of children in order to enhance the throughput, and also a low number of levels, in
order to minimize the latency. This trade off can be avoided by designing special algorithms
either just for throughput, without taking into account the latency or reverse. There is also a
threshold of the minimum levels of the tree which must be also considered in order to avoid a
FT type situation.
Conclusions
Novice Insights
References
[1] Shekar Borkar, Thousands core chips: a technology perspective, Proceedings of the 44th annual
Design Automation Conference, DAC 07, New York, NY, USA, 2007, pp. 746-749.
[2] Timothy Mattson, Rob Van der Wijngaart, Michael Frumkin, Programming the Intel 80-core
network-on-a-chip terascale processor, Proceedings of the 2008 ACM/IEEE conference on
Supercomputing, SC 08, Art. no. 38, IEEE Press Piscataway, NJ, USA, 2008.
[3] Isaas Comprs Ureas, Michael Riepen, Michael Konow, RCKMPI - lightweight MPI
implementation for intels single-chip cloud computer (SCC), Proceedings of the 18th European MPI
Users Group conference on Recent advances in the message passing interface, EuroMPI 11,
Springer-Verlang Berlin, Heidelberg, 2011, pp. 208-217.
[4] Darko Petrovi, Omid Shahmirzadi, Thomas Ropars, Andr Schiper, High-performance RMAbased
broadcast on the Intel SCC, Proceedings of the 24th annual ACM symposium on Parallelism in
algorithms and architectures, SPAA 12, New York, NY, USA, 2012, pp. 121-130.
[5] Tilera Corporation, www.tilera.com.
[6] Taiwan Semiconductor Manufactoring Company, www.tsmc.com.
[7] Tilera Corporation, Architecture and Performance of the TILE-Gx Processor Family, available on
www.tilera.com, 2013.
[8] Ernie Chan, RCCE comm: a collective communication library for the Intel Single-chip Cloud
Computer, available on https://communities.intel.com/docs/DOC-5663, 2010.
Biography
Mircea-Valeriu Ulinic graduated the B.Sc. of Telecommunications Technologies and
Systems and is currently studying at the Technical University of Cluj-Napoca, expecting to get
the M.Sc. in Telecommunications in July 2015. He is 23 years old with a great interest in
programming software solutions for the world needs. During the summer of 2013 he has
performed a two months internship at the Distributed Systems Laboratory lead by Andr
Schiper, cole Polytechnique Fdrale de Lausanne.
23