Documentos de Académico
Documentos de Profesional
Documentos de Cultura
AbstractCellular Neural Networks (CNNs) are an inherently FPGA with a multi-core CPU. Finally, we conclude the paper
parallel computational model for multiple applications, and they with Section VII.
are especially appropriate for image processing tasks. Besides of
implementing them with analogue electronic circuits, they can II. C ELLULAR N EURAL N ETWORKS
be simulated on digital processor architectures like CPUs and CNNs are modeled by cells connected in a regular 2D grid.
GPUs, with the drawback of limited parallelism. FPGAs offer a
Only neighboring cells are linked and communicate. Each cell
ne-grained parallel execution with low power consumption and
are thus attractive for embedded systems like smart cameras, for maintains a state x which iteratively evolves over time based
which it is not possible to use a full-featured CPU or GPU with on its current state and the feedback from its neighbors y. The
tens or hundrets of watts. The drawback of implementing CNNs change of the cell state shows (1).
with FPGAs, to prot from the high performance-to-power ratio,
is the time-consuming design process with conventional hardware
x1 = x1 + a k yk + bk u k + z (1)
descriptions languages. High-level-synthesis, e.g., from OpenCL,
eases the process of generating CNNs in FPGAs. By using the kN kN
OpenCL programming model, the programmer can explicitly The output of a cell is a function of its current state (2).
express the parallel nature of CNNs in a platform-independent
way. To investigate its applicability to CNNs, we compare the yk (xk ) = 0.5(|xk + 1| |xk 1|) (2)
execution of an unmodied OpenCL kernel on a recent CPU with This basic concept is valid for all CNNs. The actual
an FPGA design generated with Alteras SDK for OpenCL. The
results show, that though the CPU is faster, the FPGA solution
functionality of a CNN depends on two ltering masks (of
performs better in terms of energy efciency and ts for smart size N ), a and b. Mask b is used to convolve the input signal
camera systems. u once, and a is applied to the neighbors output to produce
their feedback to compute the cells new state. This step is
I. I NTRODUCTION iteratively applied until a dened criteria for convergence is
met. The state is also inuenced by a constant bias z. The
Many applications based on CNNs have been presented, in
size of the masks may differ among applications, but it highly
particular image processing operations, since Chua and Yang
inuences the amount of calculations and memory transfers
published CNNs [1]. Originally, CNNs were implemented by
and thus execution time on most devices. In general, it is
analogue circuits taking advantage of the parallel processing
desirable to have smaller masks.
of multiple realized CNN cells. Using CPUs for simulation
has the problem that a limited number of arithmetic resources III. R ELATED W ORK
is available, and the cells computation has to be time- Several papers deal with software implementation of CNNs
multiplexed. FPGAs offer a higher degree of parallelism than on CPUs and GPUs, but also on FPGAs, to take advantage
CPUs or even GPUs, but for the cost of a complex and time of the exibility and applicability for embedded applications.
consuming design process. High-level-synthesis (HLS) may To the best of our knowledge, few publications investigate
shorten this drastically if the source language is capable of CNNs in OpenCL to compare CPUs and GPUs, but not
expressing the algorithms properties. OpenCL, supported by FPGAs. Potluri et al. [2] present an OpenCL implementation
HLS tools, allows and simplies the expression of an array of on CPU and GPU, but they miss an architectural descrip-
parallel operating CNN cells connected among each other. Our tion as well as a detailed performance analysis. Dolan and
experiments show that there is a loss of performance on the DeSouza [3] focus on the implementation but without any
FPGA compared to a realization of CNNs on a CPU. However, optimization, resulting in a very bad CPU performance. In [4],
when we also consider energy efciency, the OpenCL design optimized CPU and GPU implementations are compared, and
of CNNs offers a reduction of both design time and energy the GPU outperforms the CPU by a factor of ten. However, our
consumption, making it attractive for smart cameras. FPGA implementation easily excels their CPU performance,
The paper is structured as follows. After a short introduction of course also inuenced by technical progress of the devices.
of CNNs in Section II we outline related work in Section III. The authors of [5] and [6] present custom FPGA architectures,
In Section IV we briey introduce OpenCL followed by a but they have to trade exibility regarding image and lter size
description of an OpenCL kernel implementation of CNNs in for performance and design simplicity, which is the advantage
Section V. Section VI compares the results achieved on an of using OpenCL for FPGA design.