Documentos de Académico
Documentos de Profesional
Documentos de Cultura
AbstractThree-dimensional (3D) numerical simulation is an the 3D finite-difference time-domain (FDTD) method [5] as a
indispensable technique for various analyses of physical phenom- representative of 3D numerical simulations. This accelerator
ena, but it generally requires numerous computation. In this consists of a 2D single instruction multiple data (SIMD)
paper, we propose an FPGA-based accelerator for 3D numerical [6] array processor. In our proposal, the 2D SIMD array
simulations and focus on acceleration of the 3D finite-difference processor executes 3D parallel computing with little data
time-domain (FDTD) method. This accelerator consists of a 2D
transfer overhead, and the accelerator can be extended easily
single instruction multiple data (SIMD) array processor, and
it can execute 3D parallel computing with little data transfer to multi-FPGA implementation. We implement the accelerator
overhead by applying virtual processing-elements cuboid (VPEC) on a high-end FPGA and execute the 3D FDTD method
with synchronous shift data transfer. We demonstrate that the for electromagnetic simulations of waveguides. Moreover, we
experimental hardware implemented on an Altera Stratix V compare its performance with GPGPUs.
FPGA (5SGSMD5K2F40C2N) is 3.1 times faster than parallel
computing on the NVIDIA Tesla C2075, and it reaches a 94.57% II. PARALLEL 3D C OMPUTING ON A 2D SIMD A RRAY
operating rate of the calculation units for the computation of
P ROCESSOR
the 3D FDTD method. The proposed accelerator is suitable for
multi-chip composition. In this section, we describe several techniques for parallel
computing to solve 3D numerical simulations. We apply the
I. I NTRODUCTION SIMD architecture as our basic scheme for parallel computing.
The SIMD processor consists of multiple PEs that are spe-
Three-dimensional (3D) numerical simulation is an indis- cialized in calculation. All PEs execute the same instruction
pensable technique for various analyses of physical phenom- simultaneously, but each of the PEs processes different data.
ena. However, it generally requires numerous computation; The SIMD architecture works with the best performance when
acceleration of the 3D numerical simulation is important for it operates on independent data; thus, the SIMD is suitable
fields of science and engineering. for problems that have data-level parallelism [6]. The SIMD
As an approach to accelerate numerical simulations, paral- array processor has the advantage of simple control and high
lel processing based on general-purpose computing on graphics performance for numerical simulations.
processing units (GPGPUs) [1] have been proposed and widely Basically, spatial parallelism is realized by dividing an
used recently. The GPU has many processing cores, and when analysis domain into 3D computing grids and assigning each
all of them perform efficiently, the GPU operates with very of them to a PE, as shown in Fig. 1. The discretized element
high performance. However, simulations of physical phenom- of the 3D computing grid is called a node or cell. Each
ena generally require a considerable amount of memory access, PE stores the assigned nodes data in its local memory and
i.e., they require multiple data accesses per unit operation, processes the nodes one by one. Because the processing
and GPU computing suffers from a memory-access bottleneck of a node generally requires its neighboring nodes, PE-PE
caused by the memory structure of GPUs [2]. Moreover, tuning communication is indispensable. Composing a 3D array of PEs
to reduce the bottleneck and achieve the best performance is is a basic idea for 3D parallel computing. However, the 3D
difficult without deep knowledge of GPU computing. array is not suitable for FPGAs because the structure of the
Another approach for such acceleration is to implement FPGA is 2D, and it requires a great deal of wiring resources.
an accelerator dedicated to numerical simulations using field- It is likely to be infeasible especially in the case of multi-
programmable gate arrays (FPGAs) because they are rapidly FPGA implementation because it needs significant I/O band-
advancing with regard to high performance. Several FPGA- width between FPGAs. Therefore, we introduce the virtual
based accelerators for numerical simulations are proposed in processing-elements cuboid (VPEC) technique to execute 3D
[2][4]. However, both accelerators proposed in [2] and [3] highly parallel processing on the 2D SIMD array with little
cannot solve the 3D problem. A scalable array processor data transfer overhead.
proposed in [4] can execute 3D parallel computing, but the In VPEC, a 3D array is sliced in the z-direction and
performance is limited by the DRAMs memory bandwidth. It connected to each other in the x-direction to compose a 2D
is not easy for the FPGA-based accelerator to implement 3D SIMD array, as shown in Fig. 1. In order to realize VPEC,
parallel processing and achieve high performance. we employ Synchronous shift data transfer [2], [7], which is
In this paper, we propose an FPGA-based accelerator for a technique for communication between the PEs on the SIMD
3D numerical simulations, and we focus on acceleration of array processor. In this technique, all of the PEs transfer data
Y=1 PE PE PE PE PE PE PE PE PE
y h .....
Y=0 PE PE PE PE PE PE PE PE PE
y PE PE PE
x X=0 X=1 X=2 X=0 X=1 X=2 X=0 In order to update a node, adjacent node's data are necessary.
X=1 X=2
z x
x-directional data transfer (single transfer) Fig. 2. All of the PEs updates nodes in a single direction. When the PE
, : z-directional wires between PEs reads the edge nodes data for the calculation, the PE transfers them to the
(not necessary with synchronous shift data transfer) adjacent PE.
Trans Controller
Control
Processor
(CP) H
to adjacent PEs simultaneously in same direction. The PEs
are also able to transfer data to nonadjacent PEs by multi-
PE
ple synchronous shift data transfers. Therefore, z-directional Control
Memory
communication on the VPEC is realized by using multiple x-
Trans Module W
directional synchronous shift data transfers, as shown in Fig.
1. We can save routing resources with this technique.
Fig. 3. 2D SIMD array processor. This processor consists of the PE array,
As shown in Fig. 1, z-directional data transfer takes more CP, and several peripherals.
clocks than x- and y-directional transfers. However, normally
data transfer between PEs does not occur frequently while
computing all nodes assigned to each PE. Therefore, if the assigned to each PE, adjacent PEs data are required when the
computing time of each PE is longer than the transfer time, we PE calculates the nodes on the surface of the PE cuboid in Fig.
can reduce the time loss of transfer by controlling the order of 2. We call these nodes as edge nodes. In our control method,
calculation and transfer timing optimally. We show the optimal when the PE reads the edge nodes data for the calculation,
control of the 3D FDTD method in the next section. the PE transfers them to the adjacent PE, as shown Fig. 2.
In this paper, the PE architecture is optimized for com- Therefore, if the computing time to update each line is longer
putation of the 3D FDTD method. We can also apply the than each transfer time, there is no time loss. In this control
VPEC to other numerical simulations such as heat conduction method, the computing time depends on the number of nodes
and fluid dynamics by optimizing PEs composition for each in each direction, and the transfer time is determined by the
computation. hardware architecture. Note that the z-directional data transfer
time depends on the x-directional number of PEs in the VPEC
III. C ONTROL OF THE 3D FDTD M ETHOD because of multiple synchronous shift data transfers. Thus, we
optimize the number of nodes and PEs, and the composition
In this section, we describe the control method of the 3D
of the VPEC to achieve the best performance.
FDTD method on the VPEC. We decompose the electromag-
netic field-update equations ( , , ) shown in [5] to two
pseudo codes respectively in order to calculate them with same IV. H ARDWARE A RCHITECTURE
pipelined datapath, e.g., the codes of are as follows: We expand the 2D FDTD accelerator [2] to the 3D accel-
(, , ) (, , ) + 1 { (, , ) (, 1, )}, erator with little data transfer overhead. Here we explain the
(1) 2D SIMD array architecture and PE composition for the 3D
(, , ) (, , ) 2 { (, , ) (, , 1)}, FDTD method.
(2)
A. 2D SIMD Array Processor
1 = , 2 = , (3)
In order to implement the VPEC, we compose a 2D SIMD
where , , and denote the discrete position, and indicates array processor on an FPGA, as shown in Fig. 3. This SIMD
the permittivity. is the time interval, and are the processor consists of 2D arrayed PEs, a control processor
size of a node in each direction. In the 3D FDTD method, we (CP), and several peripherals. The PEs are specialized in the
must execute six field-update computations ( , , ), i.e., calculation of the 3D FDTD method and calculates either 32-
the proposed accelerator executes a dozen pseudo codes. As bit fixed-point or 32-bit floating-point number.
shown in these codes, the adjacent nodes data are necessary The CP controls the PEs operation and communication
for updating a node. By this decomposition, the PE updates between the PEs and the host PC or between the PEs and the
nodes in a single direction, as shown in Fig. 2. For example, SRAM blocks. The control instructions for the PEs are gener-
in the case of Code (1), the PE updates nodes in y-direction. ated on the host PC and stored in the PE control memory. The
As shown in Fig. 2, suppose that nodes are CP fetches the instruction from this memory in sequence. The
939
Pipeline stage:
address set memory read addition multiplication addition write back
/subtraction /subtraction Host input port
LRTB
adjacent PEs
MUXdir
regtrans y
OutData InData
MUXtrans1
MUXshift
EMem x
a z
add/ output port
MUX1
sub
addsub2
b addsub1
MUXport mult Fig. 5. Canonical problem of waveguide analysis given in [10]. This is a
MUX2 add/
ConstMem four-stage waveguide bandpass filter (BPF).
HMem mult1 sub
c
0
ut i yxni zpx
MUX3 MUX4
d
single-port memory register MUXtrans2
Index o
Mem dual-port memory
r
zyxi yxni zpx
Fig. 4. Pipelined datapath of a PE. This datapath is a six-stage pipeline
dedicated to the calculation as = ( ).
Fig. 6. An asymmetrical resonant iris given in [9]. This waveguide must be
solved by the 3D FDTD method.
instructions are very long instruction word (VLIW) [8] type
operation codes which consist of 80-bit control signals. VLIW in Table I. In Table I, w, h, and d are the x, y, and z directional
allows the PEs to execute multiple operations simultaneously, number of nodes assigned to each PE, respectively; X, Y,
such as memory read, memory write, calculation, and data and Z are the number of PEs for each direction; W and
transfer. H are the array sizes shown in Fig. 3. Table II shows the
maximum resource utilization for the accelerator compiled on
B. PE Architecture Altera Quartus II software; in this case, the accelerator is
Each PE has a pipelined datapath that consists of two implemented the floating point calculation unit, and the array
addersubtractors, one multiplier, two dual-port memories, size is optimized for the four-stage BPF. We set the clock of
and two single-port memories, as shown in Fig. 4. This the FPGA to 100 MHz.
datapath is a six-stage pipeline dedicated to the calculation
as = ( ). This calculation is suitable for the B. Electromagnetic Simulation Result
decomposed field-update computations such as Codes (1) and
We ran electromagnetic simulations of the waveguides on
(2). The PE can execute this calculation with only one clock
the accelerator. We computed 65,536 time steps with the nodes,
throughput.
as shown in Figs. 5 and 6. As part of simulation results, Fig.
In Fig. 4, EMem and HMem are data memories storing elec-
7 shows visualized absolute values of electric fields in the
tromagnetic field components. ConstMem stores some constant
asymmetrical resonant iris.
data, and IndexMem stores structure data of the computing
grid. The structure data control the range of computation and
conditional execution for the boundary of the analysis domain. C. Performance Comparison with GPGPU
Because the processor is based on the SIMD architecture, and In this section, we compare the performance of the FPGA
hence such data are essential to the conditional execution. accelerator with that of the GPGPU. We have implemented
By controlling OutData, InData, and MUXdir, communi- the 3D FDTD method on NVIDIA GPUs using CUDA. We
cation with four adjacent PEs can be implemented. In addition, set optimal thread and block sizes for CUDA, and one thread
multiple synchronous shift data transfers is performed by using updates one node. In addition, we employ a constant memory
MUXshift. of each GPU and the reduction technique [11]. The GPUs
compute with single-precision (32-bit) floating-point numbers.
V. I MPLEMENTATION AND P ERFORMANCE E VALUATION
In this section, we describe the implementation results and
the performance comparison when the proposed accelerator 1.2
executes electromagnetic simulations. We analyzed two waveg-
uides shown in Figs. 5 and 6 with the impedance boundary
condition [9] for each waveguides ports.
x
y
A. Synthesis Result
z 0
We have implemented the proposed accelerator de-
scribed in the preceding section with an Altera Stratix V
5SGSMD5K2F40C2N FPGA. The sizes of the VPECs imple- example of visualized absolute values of electric fields calculated
Fig. 7. An
by = 2 + 2 + 2 .
mented on the FPGA, optimized for each case, are summarized
940
TABLE I. T HE SIZES OF VPECs.
Four-stage BPF (Fig. 5) Asymmetrical resonant iris (Fig. 6)
Nodes / PE PEs / direction 2D array size Nodes / PE PEs / direction 2D array size
w h d X Y Z W H Total w h d X Y Z W H Total
Fixed point 5 3 15 4 13 9 36 13 468 4 3 13 4 11 10 40 11 440
Floating point 5 4 23 4 10 6 24 10 240 4 4 19 4 8 7 28 8 224
941