Está en la página 1de 2

CNNA 2016, August 23-25, 2016, Dresden, Germany

Cellular Neural Networks for FPGAs with OpenCL


Franz Richter-Gottfried and Dietmar Fey
Chair of Computer Science 3 (Computer Architecture)
Friedrich-Alexander-Universitt Erlangen-Nurnberg (FAU)
91058 Erlangen, Germany
Email: {franz.richter-gottfried, dietmar.fey}@fau.de

AbstractCellular Neural Networks (CNNs) are an inherently FPGA with a multi-core CPU. Finally, we conclude the paper
parallel computational model for multiple applications, and they with Section VII.
are especially appropriate for image processing tasks. Besides of
implementing them with analogue electronic circuits, they can II. C ELLULAR N EURAL N ETWORKS
be simulated on digital processor architectures like CPUs and CNNs are modeled by cells connected in a regular 2D grid.
GPUs, with the drawback of limited parallelism. FPGAs offer a
Only neighboring cells are linked and communicate. Each cell
ne-grained parallel execution with low power consumption and
are thus attractive for embedded systems like smart cameras, for maintains a state x which iteratively evolves over time based
which it is not possible to use a full-featured CPU or GPU with on its current state and the feedback from its neighbors y. The
tens or hundrets of watts. The drawback of implementing CNNs change of the cell state shows (1).
with FPGAs, to prot from the high performance-to-power ratio,
is the time-consuming design process with conventional hardware
 
x1 = x1 + a k yk + bk u k + z (1)
descriptions languages. High-level-synthesis, e.g., from OpenCL,
eases the process of generating CNNs in FPGAs. By using the kN kN
OpenCL programming model, the programmer can explicitly The output of a cell is a function of its current state (2).
express the parallel nature of CNNs in a platform-independent
way. To investigate its applicability to CNNs, we compare the yk (xk ) = 0.5(|xk + 1| |xk 1|) (2)
execution of an unmodied OpenCL kernel on a recent CPU with This basic concept is valid for all CNNs. The actual
an FPGA design generated with Alteras SDK for OpenCL. The
results show, that though the CPU is faster, the FPGA solution
functionality of a CNN depends on two ltering masks (of
performs better in terms of energy efciency and ts for smart size N ), a and b. Mask b is used to convolve the input signal
camera systems. u once, and a is applied to the neighbors output to produce
their feedback to compute the cells new state. This step is
I. I NTRODUCTION iteratively applied until a dened criteria for convergence is
met. The state is also inuenced by a constant bias z. The
Many applications based on CNNs have been presented, in
size of the masks may differ among applications, but it highly
particular image processing operations, since Chua and Yang
inuences the amount of calculations and memory transfers
published CNNs [1]. Originally, CNNs were implemented by
and thus execution time on most devices. In general, it is
analogue circuits taking advantage of the parallel processing
desirable to have smaller masks.
of multiple realized CNN cells. Using CPUs for simulation
has the problem that a limited number of arithmetic resources III. R ELATED W ORK
is available, and the cells computation has to be time- Several papers deal with software implementation of CNNs
multiplexed. FPGAs offer a higher degree of parallelism than on CPUs and GPUs, but also on FPGAs, to take advantage
CPUs or even GPUs, but for the cost of a complex and time of the exibility and applicability for embedded applications.
consuming design process. High-level-synthesis (HLS) may To the best of our knowledge, few publications investigate
shorten this drastically if the source language is capable of CNNs in OpenCL to compare CPUs and GPUs, but not
expressing the algorithms properties. OpenCL, supported by FPGAs. Potluri et al. [2] present an OpenCL implementation
HLS tools, allows and simplies the expression of an array of on CPU and GPU, but they miss an architectural descrip-
parallel operating CNN cells connected among each other. Our tion as well as a detailed performance analysis. Dolan and
experiments show that there is a loss of performance on the DeSouza [3] focus on the implementation but without any
FPGA compared to a realization of CNNs on a CPU. However, optimization, resulting in a very bad CPU performance. In [4],
when we also consider energy efciency, the OpenCL design optimized CPU and GPU implementations are compared, and
of CNNs offers a reduction of both design time and energy the GPU outperforms the CPU by a factor of ten. However, our
consumption, making it attractive for smart cameras. FPGA implementation easily excels their CPU performance,
The paper is structured as follows. After a short introduction of course also inuenced by technical progress of the devices.
of CNNs in Section II we outline related work in Section III. The authors of [5] and [6] present custom FPGA architectures,
In Section IV we briey introduce OpenCL followed by a but they have to trade exibility regarding image and lter size
description of an OpenCL kernel implementation of CNNs in for performance and design simplicity, which is the advantage
Section V. Section VI compares the results achieved on an of using OpenCL for FPGA design.

ISBN 978-3-8007-4252-3 79 VDE VERLAG GMBH, Berlin, Offenbach


CNNA 2016, August 23-25, 2016, Dresden, Germany

IV. O PEN CL TABLE I


P ERFORMANCE OF O NE I TERATION
OpenCL denes a library interface for the host, typically
a normal CPU, to control devices like GPUs or FPGAs. The Image size Time [ms] Bandwidth [GB/s]
algorithm itself, referred to as kernel, is written in the C-style 512x512 0.28 24.09
CPU
4096x4096 39.94 10.95
language OpenCL-C. Devices offer compute units (CU), and 512x512 3.21 2.13
FPGA
each of them consists of processing elements (PE) to execute 4096x4096 192.80 2.27
the kernel in parallel. Work is distributed among PEs according
to the OpenCL execution model. A work item, which is an TABLE II
entity in the problem space, represents a single instance of the FPGA R ESOURCES
kernel implementing the functionality of a single CNN cell. Component Total (Percent)
Multiple work items combined in work groups can be rapidly Logic Elements 71484 (21%)
dened in OpenCL to realize a whole CNN array exchanging FlipFlops 104582 (15%)
RAMs 531 (26%)
data by using fast local memory. DSPs 15 (1%)
V. I MPLEMENTATION
In the OpenCL implementation, each cell is represented 100 W. The FPGA board consumes at most 25 W and by using
by a single work item. Due to the regular memory access OpenCL pipes, the FPGA can directly read from the input,
pattern, we omitted buffering data in local memory but rely on so the host becomes redundant. Smaller FPGAs may be even
efcient streaming access to main memory and caching. The more efcient.
CNN simulation is split into three kernels: buz, update
and output, to express temporal dependencies between the VII. C ONCLUSION
neighboring cells current states and their output. We implemented a typical image processing application for
The host rst transfers the input image  to the OpenCL a CNN in OpenCL and generated an FPGA design using
device and calls the rst kernel to compute kN bk uk + z Alteras SDK for OpenCL to compare the performance to a
once, as it is constant during execution. For each iteration, the recent CPUs. Though the CPU outperforms the FPGA, the
other two kernels compute the output for each cell, based on energy consumption and the exibility of the FPGA solution
the current state, and the next state using the processed input, compensates this. We show that HLS from OpenCL is a rea-
the cells current state and the neighbors outputs. sonable tradeoff between performance and design complexity.
The performance of our implementation is measured using
R EFERENCES
a lter setup and the mask described as EGDEGRAY in [7],
which are shown in (3). [1] L. O. Chua and L. Yang, Cellular neural networks:
applications, IEEE Transactions on Circuits and Systems,
vol. 35, no. 10, pp. 12731290, Oct 1988.
0 0 0 -1 -1 -1
[2] S. Potluri, A. Fasih, L. K. Vutukuru, F. A. Machot, and
a= 0 2 0 b = -1 8 -1 z = 0.5 (3)
K. Kyamakya, CNN based high performance computing
0 0 0 -1 -1 -1
for real time image processing on GPU, in Proceedings
VI. R ESULTS of the Joint INDS11 ISTET11, July 2011, pp. 17.
We evaluated our implementation on a CPU (Intel Core i7- [3] R. Dolan and G. DeSouza, Gpu-based simulation of
4790 CPU, 3.60GHz) and an FPGA accelerator card (Bittware cellular neural networks for image processing, in 2009
S5-PCIe-HQ D5). For both platforms, automatic selection of International Joint Conference on Neural Networks, June
the work group size gave the best performance. 2009, pp. 730735.
Table I shows the execution times and effective memory [4] T.-Y. Ho, P.-M. Lam, and C.-S. Leung, Parallelization of
bandwidth for a single iteration. In total, seven memory cellular neural networks on GPU, Pattern Recognition,
transfers are needed for an update, but two are only needed vol. 41, no. 8, pp. 2684 2692, 2008.
once for initialization, resulting in additional ve transfers for [5] O. Y. H. Cheung, P. H. W. Leong, E. K. C. Tsang, and
each further iteration. As it can be seen, the CPU outperforms B. E. Shi, A scalable fpga implementation of cellular
the FPGA by a factor of 10 for smaller images, and 5 neural networks for gabor-type ltering, in The 2006
for larger ones. See Table II for the FPGA resources of the IEEE International Joint Conference on Neural Network
design. Even smaller FPGAs may be used as there are enough Proceedings, 2006, pp. 1520.
free resources, if the memory bandwidth stays the same. With [6] R. Grech, E. Gatt, I. Grech, and J. Micallef, Digital
a higher memory bandwidth, multiple kernel instances may implementation of cellular neural networks, in Electron-
increase performance. ics, Circuits and Systems, 2008. ICECS 2008. 15th IEEE
Besides of raw performance, power consumption is of in- International Conference on, Aug 2008, pp. 710713.
terest, especially for embedded applications like smart camera [7] L. O. Chua and T. Roska, Cellular Neural Networks and
systems. The Intel CPU used has a thermal design power Visual Computing: Foundations and Applications. New
(TDP) of 84 W, leading to a systems power of more than York, NY, USA: Cambridge University Press, 2002.

ISBN 978-3-8007-4252-3 80 VDE VERLAG GMBH, Berlin, Offenbach