Computer Generation of High-Performance PHY Software

Computer Generation of Platform-Adapted Physical Layer Software
Yevgen Voronenko (SpiralGen, Pittsburgh, PA, USA; yevgen@spiralgen.com)

Volodymyr Arbatov (Carnegie Mellon University, Pittsburgh, PA, USA; arbatov@cmu.edu)
Christian R. Berger (Carnegie Mellon University, Pittsburgh, PA, USA; crberger@ece.cmu.edu)
Ronghui Peng (SpiralGen, Pittsburgh, PA, USA; jonathan.peng@spiralgen.com)
Markus Püschel (ETH Zurich, Switzerland; markus.pueschel@inf.ethz.ch)
Franz Franchetti (Carnegie Mellon University, Pittsburgh, PA, USA; franzf@ece.cmu.edu)∗
Abstract platform changes, the code still has to be reoptimized, since

performance usually does not port.
In this paper, we describe a program generator for physical Contribution of this paper. In this paper, we propose to
layer (PHY) baseband processing in a software-defined radio overcome these problems using a program generator for PHYs.
implementation. The input of the generator is a very high- The generator is based on Spiral [5–8] and automates the pro-
level platform-independent description of the transmitter and duction and optimization of PHY source code. The input to
receiver PHY functionality, represented in a domain-specific the generator is a high-level, platform-independent descrip-
declarative language called Operator Language (OL). The out- tion of the PHY receiver or transmitter functionality, described
put is performance-optimized and platform-tuned C code with in a domain-specific mathematical declarative language called
single-instruction multiple-data (SIMD) vector intrinsics and Operator Language (OL). The output is highly optimized C
threading directives. The generator performs these optimiza- code including SIMD vector intrinsics and threading direc-
tions by restructuring the algorithms for the individual com- tives. Difficult optimizations, such as vectorization, are per-
ponents at the OL level before mapping to code. This way formed at the OL level using a platform-cognizant rewriting
known compiler limitations are overcome. We demonstrate system that effectively manipulates the algorithm based on a
the approach and the excellent performance of the generated few hardware parameters to obtain a suitable structure. Sim-
code on on the IEEE 802.11a (WiFi) receiver and transmitter ilarly, our generator is able to perform cross-block optimiza-
PHY for all transmission modes. tions and to efficiently map different parts of the computation
to different data types (e.g., 16-way bytes for Viterbi decoding
and 4-way floating point for the fast Fourier transform).
1 Introduction The main contribution of the presented work is to show
that the PHY layer can be expressed in OL (which was proto-
A major challenge in implementing software-defined radios
typically introduced in [9]), to include the necessary support
(SDRs) is meeting the real-time demands of the signal process-
for multiple data types and cross block optimizations into the
ing in the physical layer (PHY). For this reason, most common
generator, and finally, the generated code, which has excellent
SDR platforms are not software-only, but use a hybrid archi-
performance.
tecture that, besides a digital signal processor (DSP) or a gen-
We demonstrate the latter for both the transmitter and re-
eral purpose processor (GPP), also includes one or more field-
ceiver of IEEE 802.11a (WiFi) on the Intel platforms Core
programmable gate arrays (FPGA). The PHY functionality is
and Atom (backends for other architectures are in develop-
then partitioned across these devices, making the implementa-
ment). For example, the generated code achieves real-time
tion, optimization, and maintenance difficult and costly.
WiFi transmission speeds for all data rates up to 54 Mbps on
Recently, true SDR has come into reach with modern multi-
an off-the-shelf Intel Core based system. Further, we show
core GPPs (e.g., Intel Core [1]), or multi-core DSPs (e.g., the
that the computer-generated code outperforms the best hand-
Sandblaster DSP [2, 3] or Tilera [4]). However, optimizing
optimized code.
the PHY layer for these processors is still very challenging
since it requires careful tuning to the memory hierarchy, the
explicit use of ν-way single-instruction multiple-data (SIMD) 2 Background and Prior Work
vector instruction, and efficient threading. Further, when the
∗ This work was supported by ONR through the STTR contract N00014- To date most SDR implementations run (fully or partially) on
09-M-0332, by NSF through awards 0325687, 0702386, and by DARPA an FPGA [10]. Although FPGAs are reconfigurable, the PHY
through the DOI grant NBCH1050009. implementation still has to be done in Verilog or VHDL just
1
as for application specific integrated circuits (ASIC). The dif- All SPL constructs have natural interpretations as code. For
ference compared to a true software PHY implementation is example, the formula M = A · B, implies the two-step com-
that in an FPGA or ASIC design, the hardware is designed to putation t = Bx; y = At.
match the algorithm, naturally using fine grained parallelism Using SPL, the well-known Cooley-Tukey fast Fourier trans-
and pipelining to achieve high performance. form (FFT) is expressed as
First true software PHY implementations used a single DSP km
that is programmed serially [11–13]. Although fully flexi- DFTkm = (DFTk ⊗ Im ) diag tkm m (Ik ⊗ DFTm ) Lk .
ble (and easy to program), these could not achieve real-time (1)
performance for computational intensive PHY standards like It shows that y = DFTkm x can be computed in four steps
WiFi. Recent processors possess multiple cores, each typically corresponding to the four factors in (1). Two of the steps in-
possessing further parallelism in the form of SIMD vector ex- volve the recursive computation of smaller DFTs.
tensions [1–4,14,15]. While these platforms come close to the Further important building blocks of SPL, and later OL,
computational power needed to run PHY standards like WiFi, are the following matrices parametrized by index mappings.
it also becomes increasingly challenging to implement PHY An index mapping is a function on integer intervals. De-
software that exploits the full computational potential. note eni the i-th column basis vector of size n, i.e., the column
Finally, a quite different kind of SDR platforms has sur- vector of n elements, with a 1 in i-th position and 0s elsewhere.
faced, using simple radio front-end boards attached to com- Given an index mapping function f , gather and scatter matri-
modity personal computers (PCs) [16–18]. These are mostly ces are defined as follows:
of interest for academic test-beds as in [18,19] and run the full
f : {0, . . . , n − 1} → {0, . . . m − 1},
functionality on a PC that commonly features an Intel or simi- h i
lar GPP. Hence the challenges to efficiently utilize the compu- G(f ) = em f (0) | . . . |em
f (n−1) ,
tational resources in terms of SIMD and multicore parallelism h iT
are quite similar to the SDR platforms described above. S(f ) = G(f )T = em f (0) | . . . |em
f (n−1) .
OL. OL is a superset of SPL. Where SPL can only describe

3 Operator Language and PHYs transforms, i.e., linear single-input and single-output opera-
tions, OL removes this restriction and considers more general
The operator Language (OL) [9] is a domain-specific declar- operators. An operator of arity (c, d) is a function that takes c
ative mathematical language used to represent certain classes vectors as input and produces d vectors as output. For exam-
of numerical algorithms. OL is an extension or superset of ple, a k × n matrix M , the simplest possible SPL formula, is
SPL [5,20,21] to cover non-linear multi-input and multi-output in OL viewed as the arity (1, 1) operator
operations. The idea behind SPL and OL is to formally repre-
sent algorithm knowledge in a platform-independent way. Ac- Mk×n : Cn → Ck .
tual code is generated from this representation using a number
of steps that depend on the target platform. We first introduce The matrix product Am×n · Bn×k becomes in OL the operator
SPL and then extend the discussion to OL. composition, e.g.,
SPL. SPL is a language to describe fast algorithms for lin-
Am×n ◦ Bn×k : Ck → Cm .
ear transforms, which are functions of the form x 7→ y = M x
with a fixed matrix M . By slight abuse of notation we will The tensor product of matrices generalizes to tensor product
simply refer to M as transform. An example is the discrete of operators, but in this paper we only need one special case.
Fourier transform (DFT) defined by Namely for any arity (1, 1) operator A : Cm → Cn ; x 7→
M = DFT = [ω ij ] , A(x), Ik ⊗A is defined as
n n 0≤i,j<n
√ km kn
where ωn = e−2π −1/n . An SPL program, or formula, is a Ik ⊗A : C → C ;
fast algorithm for a transform M represented as a factorization x 7→ (A(x0 , . . . , xm − 1), . . . , A(x(k−1)m , . . . xkm−1 )).
of M into a product of sparse matrices.
SPL contains basic matrices such as the identity matrix In , WiFi physical layer in OL. The 802.11a OFDM trans-
diagonal matrices diag0≤m<n (f (m)) with a scalar function mitter (TX) and receiver (RX) map data bits into a complex
n
f , or the stride permutation matrix Lk , which transposes an baseband signal and vice versa. Formally, and more precisely,
n/k × k matrix stored linearized in memory. More complex
WiFiTXk,m,r : Z48kmr−6
2 → C80k , (2)
SPL formulas are built from other SPL formulas using matrix
80k 48kmr−6
operators, such as the matrix product A · B or the Kronecker WiFiRXk,m,r : C → Z2 . (3)
product A ⊗ B defined as
Namely, if a number, say ℓ, bits are to be transmitted at a trans-
A ⊗ B = [aij B]i,j , for A = [ai,j ]i,j . mission mode characterized by a modulation scheme with m ∈
2
h ` í
WiFiTXk,m,r = Ik ⊗ CPIns64 ◦ IDFT64 ◦ PltIns64 ◦ Map48,m ◦ Int48m ◦ Puncr48m ◦ CvEnc48m,r ◦ Scrsℓ (5)
h » –
` G(h64,1 ) í
= Ik ⊗ ◦ IDFT64 ◦ S(plt48 ) ◦(I48 ⊗ M1,m ) ◦ G(int48m ) ◦ G(dr48m ) ◦ CvEnc48m,r ◦(+s). (6)
G(h0,1 ) | {z }
| {z }
h ` í
WiFiRXh̃k,m,r = Scrsℓ ◦ VitDecℓ ◦ DePuncr48km ◦ Ik ⊗ DeInt48m ◦ DeMap48,m ◦ PltRm64 ◦ Eqh̃ ◦ DFT64 ◦ CPRm64 (7)
h ` í
= (+s) ◦ VitDecℓ ◦ S(dr48km ) ◦ Ik ⊗ S(int48m ) ◦ (I48 ⊗ M−1 8,m ) ◦ G(plt48 ) ◦ diag(h̃) ◦ DFT64 ◦ G(h0,1 ) (8)
| {z } | {z } | {z }
Rate Modulation Bits/sc. Code Coded bits/ Data bits/ VitDec, and Scr.
Mbps m rate r symb. symb., NDBPS All of the blocks, except for the Viterbi decoder VitDec,
6 BPSK 1 1/2 48 24 can be defined in terms of primitive OL constructs and matri-
9 BPSK 1 3/4 48 36 ces, and most of the blocks are normally computed by defini-
12 QPSK 2 1/2 96 48 tion. The important exceptions are the DFT, and the Viterbi de-
18 QPSK 2 3/4 96 72 coder, for which several alternative fast algorithms exist (e.g.,
24 16-QAM 4 1/2 192 96
for the DFT the choice of k in (1)), which are again expressed
36 16-QAM 4 3/4 192 144
48 64-QAM 6 2/3 288 192
in OL.
54 64-QAM 6 3/4 288 216 PltIns, PltRm, Int, DeInt, Punc and DePunc are all ba-
sically data reorderings or padding operations and thus can
Table 1: Data rates in IEEE 802.11a with corresponding modulation be expressed as a gather or scatter with the corresponding in-
schemes and coding rates [22]. (sc. = subcarrier; symb. = symbol) dex mapping function plt, int, and d respectively for pilot re-
moval/insertion, (de)interleaver, and (de)puncturing. The pre-
cise form is not relevant here. We do show the matrix struc-
{1, 2, 4, 6} bits per subcarrier and coding rate r ∈ {1/2, 3/4, tures of (de)interleaver and (de)puncturer, but in translating
2/3}, they will take up these to code, these structures are not used.
The modulator and demodulator are defined by the scalar
k = ⌈ℓ/NDBPS + 6⌉ = ⌈ℓ/(48mr) + 6⌉ (4)
functions M and M−1 that map m hard bits to a complex num-
OFDM symbols, where NDBPS = 48mr is the number of data ber, and, vice-versa, a complex number to m soft bit estimates.
bits per OFDM symbol, see Table 1. The data bits are ap- Implementation degrees of freedom. Before the OL for-
pended with zero bits to fill exactly k ODFM symbols and we mulas (6) and (8) can be mapped to code, all remaining un-
will always assume the TX and RX operate on this extended expanded blocks (in bold) must be expressed in primitive OL
bit sequence. constructs. There are multiple ways of doing so that corre-
An important difference of physical layer computations spond to different computational algorithms and the internal
from other types of numerical codes is the diversity of used degrees of freedom within the algorithms. Spiral employs feed-
data types, which is also evident above. The transmitter maps back driven search to make the best, i.e., fastest choice on
bits (denoted with Z2 ) to complex (floating point) values (de- the given platform. This search effectively performs platform
noted with C), and the receiver vice versa. The actual imple- adaptation. In the interest of space, we only briefly discuss
mentation will use one or more additional data types during the some of these degrees of freedom next.
course of computation to optimally use the available vector in- The tensor product Ik ⊗ A itself is not a primitive con-
struction set. This makes the domain and range specifications struct. It has four alternative implementations: as a loop over
of blocks, as in (2)–(3) very important. A, as a parallel loop, as a vectorized loop, or via loop splitting
We now translate the entire computation data flow of the (discussed below). The parallelization and vectorization is im-
receiver and transmitter PHY as defined in [22] into OL. The plemented using special tags, explained in [6, 7]. If the tensor
result is (5) and (7), and the occurring blocks (marked bold) product is not vectorized, the vectorization is performed “in-
are defined in Table 2. We show a further decomposition in ternally,” i.e., in A, via rewrite rules (not shown here). An
(6) and (8), where some of the blocks are broken down them- interesting twist in the case of both internal and external vec-
selves. Braces show the grouping of operations in the later torization, is that the inner subformula of the tensor product
implementation. operates on different data types, each having different associ-
Table 2 defines all of the blocks in the receiver and trans- ated vector length, which makes the vectorization more com-
mitter. Most of the blocks are linear, and perform a matrix plex. The equivalent of loop splitting is the transformation
vector product; the OL definition can thus be interpreted as I ⊗AB → (I ⊗A)(I ⊗B). In our experiments, such “vertical”
a matrix. The non-linear blocks are Map, DeMap, PltIns, implementations always performed better.
3
Operator Notation Domain → Range Definition
" # " #
64 80 I16 G(h64,1 )
Cyclic prefix insertion CPIns64 C →C =
I G(h0,1 )
h 64 i
Cyclic prefix removal CPRm64 C80 → C64 064×16 I64 = G(h16,1 )
ij
Forward DFT DFT64 C64 → C64 [ω64 ]0≤i,j<64
64 64 −ij
Inverse DFT IDFT64 C →C [ω64 ]0≤i,j<64
Pilot tone insertion PltIns64 C48 → C64 (+P ) ◦ S(plt48 )
Pilot tone removal PltRm64 C64 → C48 G(plt48 )
Symbol mapping Map48,m Z48m
2 → C48 I48 ⊗ Mm,1
Symbol demapping DeMap48,m C48 → Z48m
28 I48 ⊗ M−1
m,8 (
L48m
3 , m ≤ 2,
Bit interleaving Int48m Z48m
2 → Z48m
2 G(int48m ) = m/2
(I16 ⊗i (I6 ⊗ Zi )) L48m
3 , m > 2.
Bit deinterleaving DeInt48m Z48m
28 → Z48m
28 S(int48m ) = (Int48m )T
Puncturing Puncr48m Z22·48mr
→ Z48m
2 G(dr48m ) = (I48m/6 ⊗ Sr )
Depuncturing DePuncr48km Z48km
28 → Z22·48kmr
8 S(dr48km ) = (I48km/6 ⊗ STr ) = (Puncr48km )T
T
Convolutional encoding CvEnc48m,r Z48mr
2 → Z22·48mr L2mr
mr [C0 C1 ]
Viterbi decoding VitDeck,m,r Z22·48kmr

8 → Z248kmr−6
64 64
Channel equalization Eqh̃ C →C diag h̃
(De)Scrambling Scrsℓ Zℓ2→ Zℓ2 Il ⊗i (+s)
h1 0 0 0i »1 0 0 0 0 0– h i
Ii
S1/2 = I12 , S2/3 = I3 ⊗ 0 1 0 0 , S3/4 = I2 ⊗ 00 10 01 00 00 00 , Zn
i = In−i .
0 0 1 0 0 0 0 0 0 1
hb,s : i 7→ b + is (stride s index mapping), (+a) : X 7→ X + a (“add constant” operator)
Mm,1 : Zm
2 → C is the m-bit modulation operator, M−1
m,8 : C → Zm
28 is the demodulation operator to m 8-bit estimates
h̃ is the inverted channel freq. response, 64-element vector, plt, int, d are index mapping functions, not shown here.
C0 , C1 are the Toeplitz matrices formed from the degree-7 bit polynomials
Table 2: Definition of the block operators used in (5)– (8).
Next, the DFT and IDFT blocks have different vectoriza- Operator grouping. The underbraces in equations (8)–
tion possibilities [6], different choices of radices in the Cooley- (8) show the operator grouping which was used by the code
Tukey algorithm, and alternative algorithms. generator. Currently, the grouping is guided by rewrite rules,
For the Viterbi decoder, [23] gives the OL description of which can be somewhat ad-hoc, or can be controlled manu-
the standard decoding algorithm. However, there exist other ally by explicitly forcing it. Grouping is very advantageous
algorithms, amenable to parallelization, e.g. [24], which could when the code can be unrolled and resulting temporary buffers
provide scaling beyond 2 threads enabled by pipelined paral- scalarized.
lelism. In our implementation of the Viterbi decoder we only
consider the degree of unrolling.
The convolutional encoder can be treated as an FIR fil- 4 Optimized Code Generation
ter on blocks, and most FIR filter breakdown rules in Spiral
apply. Most importantly, these include different vectorization Standard code generation process. P The standard code gen-
strategies; less important are FIR blocking rules that improve eration process in Spiral for SPLP / -SPL is demonstrated in
register reuse. Fig. 1. In this paper, we use OL / -OL, which are extensions,
Alternatively, the convolutional encoder can be grouped but the code generation process is identical.
with the adjacent matrices, resulting in a single matrix-vector In the “Algorithm generation” block, the required function-
product with a less structured matrix. This matrix-vector prod- ality (in this case the receiver or transmitter) is expanded us-
uct has several degrees of freedom including different ways ing the algorithmic breakdown rules as well as the so-called
of blocking for locality, and different vectorization methods paradigm breakdown rules, which will apply the high-level
(tiling into vector-sized diagonals, cyclic diagonals, or vertical parallelization and vectorization transformations, based on the
stripes). characterization of the target platform. Note that the platform
4
Transform (“DFT”)
input size (“128”) also the one that would scale to more than 2 threads) will be
obtained by directly parallelizing the Viterbi decoder and per-
Algorithm knowledge
(breakdown rules) Algorithm generaon forming further loop fusion with data processing that precedes
Pla#orm knowledge
(paradigms) algorithm (SPL)
the Viterbi block. This can be done by implementing algo-
parallelizaon
loop SPL→Σ-SPL rithms, such as [24].
opmizaons loop merging Target platform description. Paradigm breakdown rules
Search
algorithm (Σ-SPL)
are guided by the platform knowledge, encapsulated in the
Σ-SPL→C code
Code opmizaons platform backend module. Each backend contains both high-
source code (C) level and low-level information about the platform. The high-
Spiral Performance evaluaon level information is a set of basic parameters, such as the num-
(fixed input size)
ber of threads, threading model, vector length, cache line length,
C implementaon: and others. The low-level information supplies additional in-
DFT_128(*y, *x) { … }
formation such as the mapping of in-register vector permuta-
Figure 1: The code generation process in Spiral. The “Search” block closes tions to the machine instructions of the underlying vector in-
the performance-driven feedback back loop and enables exploring the space of struction set architecture, vector instruction mnemonics, and
implementation degrees of freedom
P and achieve automated
P platform tuning. the syntactic sugaring required for threading and thread syn-
In this paper, we use OL / -OL instead of SPL / -SPL, but the code
generation process is identical.
chronization. More details can be found in [6, 7]
Memory hierarchy information is not explicitly used in
the platform description, with the exception of the cache line
knowledge is used from the very beginning of the code gen- length, which is used by the parallelization to prevent false
eration process, and enables the generator to undertake major sharing. However, memory hierarchy adaptation is performed
restructuring, if needed. as part of the overall feedback-driven algorithm selection pro-
First, the system will apply (5)–(7), next expand them us- cess, i.e., by iterating over different degrees of freedom, which
ing the algorithm rules, such as (1), and identities in Table 2, leads to differently structured implementations.
and further restructure the formulas using the paradigm break- Code generation and optimization. A special Spiral OL
down rules. The process continues until all blocks in bold compiler generates code from OL formulas. OL compiler,
(called non-terminals) are expanded. This yields (6)–(8) ini- briefly explained in [9] is an extension of the SPL compiler,
tially, and then separately processes the non-terminals that still described in detail in [5, 20, 25].
remain in the formula. To generate
P optimized code, the OL compiler first converts
PAdditional loop level restructuring is performed
P in the “SPL OL into -OL, a lower level representation. In this stage con-
→ -SPL” block, as explained in [25]. “ -SPL → Code” structs like ⊗ are converted into iterative sums with gather and
block generates the final optimized code, as explained in de- scatter matrices. Next, initial code is created by using code
tail below. Finally, the “Performance Evaluation” measures generation rules, as shown in the table below. (x and y denote
the runtime of generated code, and closes the performance- the input and output vectors, t is a temporary vector.)
driven feedback loop that achieves automatic platform tuning.
Note, that here we show code generation time platform Parametrized matrices (assume domain(f ) = n)
tuning, while other more general tuning modes are possible, code(G(f ), y, x) for(j=0..n-1) y[j] = x[f(j)];
such as installation time and initialization time (or runtime)
code(S(f ), y, x) for(j=0..n-1) y[f(j)] = x[j];
tuning, see [26] for details.
Extensions. We had to extend Spiral to be able to generate code(diag(f ), y, x) for(j=0..n-1) y[j] = f(j)*x[j];
optimized code for the formulas (5)–(8). This also required Operators (assume A : Cn → Cm )
extensions for dealing with mixed data-type formulas, bit-level
code(A ◦ B, y, x) code(B, t, x); code(A, y, t);
and byte-level SIMD vectorization, and mixed vector length
vectorization. code(Ik ⊗A, y, x) for(j=0..k-1) code(A, y + mj, x + nj);
For full vectorization, additional rewriting rules for the
vectorization of the modulator, and general bit-matrices were Finally, the compiler applies a set of standard compiler op-
required. timizations, such as loop unrolling, copy propagation, constant
Spiral was able to parallelize the transmitter without major folding, and strength reduction. Many of the latter optimiza-
extensions. The parallelization of the receiver required a spe- tions are enabled by completely unrolling the inner loops with
cial breakdown rule that exposed the pipeline parallelism, to fixed bounds and small number of iterations. For example, the
mimic the implementation in [27]. This was needed to break fastest implementation of DFT64 , is a fully unrolled “flat” im-
the dependency caused by a sequential nature of a Viterbi de- plementation with no control flow at all. The same holds for
coder. This is not the best and most elegant way to deal with most other blocks in the PHYs. The degree of such unrolling
this problem. Most probably the fastest implementation (and can be controlled, if needed.
5
WiFi Receiver per Symbol Core i7 (3.3 GHz) WiFi Transmier Spiral Code Across Plaorms
Run me [micro sec] vs. Data rate [Mbit/s] Achievable data rate [Mbit/s] vs. Nominal data rate [Mbit/s]
3.00 1200
real-me = 4 μs
FFT/equ. Core i7, threaded
2.50 1000
de-mapp./
de-interl.
2.00 800
Core 2, threaded
1.50 600
Viterbi Core i7, vectorized
fwd. pass
Core 2, vectorized
1.00 400
0.50 200 Atom, threaded real-!me bound

Viterbi Atom, vectorized
traceback
0
0.00
6 9 12 18 24 36 48 54 6 12 18 24 30 36 42 48 54
WiFi Receiver per Symbol Core 2 (2.6 GHz) WiFi Receiver Spiral Code Across Plaorms
Run me [micro sec] vs. Data rate [Mbit/s] Achievable data rate [Mbit/s] vs. Nominal data rate [Mbit/s]
5.00 FFT /equ. 120
Core i7, threaded
de-mapp./
real-me = 4 μs 100
4.00 de-interl.
Core i7, vectorized
80
3.00
Viterbi Core 2, threaded
60
fwd. pass
2.00
40
Core 2, vectorized
real-!me bound
1.00 20 Atom, threaded

Viterbi
traceback
Atom, vectorized
0
0.00
6 12 18 24 30 36 42 48 54
6 9 12 18 24 36 48 54
WiFi Receiver Code Comparison on Core 2
WiFi Receiver per Symbol Atom (1.6 GHz) Achievable data rate [Mbit/s] vs. Nominal data rate [Mbit/s]
Run me [micro sec] vs. Data rate [Mbit/s]
16.00 60
Spiral, threaded
FFT/equ. Berger et al., threaded
14.00 Sora, threaded
50
de-mapp./
12.00 de-interl.
40
Spiral, vectorized
10.00
Berger et al.,
30
8.00 vectorized
Viterbi
fwd. pass
6.00 20
Sora, vectorized
real-me = 4 μs
4.00
10
Viterbi real-!me bound
2.00 traceback
0
0.00 6 12 18 24 30 36 42 48 54
6 9 12 18 24 36 48 54
Figure 3: Comparison of the achievable vs. nominal throughput in the
Figure 2: Composition of runtime per OFDM symbol across different ar- transmitter and receiver. The hand-written implementations are Berger [27]
chitectures in a single-threaded receiver. Viterbi decoder is the most compu- and Sora [18]. Implementation marked as “threaded” is both threaded and
tationally demanding block, and its runtime share increases with the nominal vectorized. Even though Atom only achieves real-time at 9 Mbps, with its
data rate. TDP of 2.5W, it provides the best performance per Watt.
6
In addition, complete loop unrolling eliminates temporary the processor can process the data fast enough to meet the
array storage thorough scalar replacement, which in some cases nominal (required) data rate (i.e., the real-time bound is met).
is necessary for optimal performance. Array scalarization also If this is the case, the ratio of nominal data rate over achievable
helps when multiple blocks are combined into a single piece will be approximately equal the CPU load. For example, on
of code through grouping. Core i7, the generated vectorized receiver achieves 84 Mbps at
the nominal rate of 54 Mbps, which means that the CPU load
is 54/84 × 100% = 64%. The threaded and vectorized im-
5 Performance Results plementation achieves 115 Mbps, or 54/115 × 100% = 46%
CPU load on each of 2 used cores.
Experimental Setup. In our experiments we performed re- Generally, the achievable data rate grows with the nominal,
ceiver and transmitter simulation on a synthetic channel data, because the number of data bits per OFDM symbol increases
stored on the hard drive. Thus, the external I/O time from with the increased nominal data rate, and hence there is less
main memory to the hypothetical radio card is not accounted computation per data bit. This increase is non-monotonic, be-
for. The real world measurements, however, should be quite cause the amount of computation per bit also depends on the
close, as long as external I/O is able to sustain the required code rate r (see Table 1). For example, there are drops in
data rate, because it can be efficiently done in the background, achievable data rates at nominal speeds of 24 Mbps and 48
with data coming to main memory via DMA. Mbps due to reduced r, which slightly bumps up the amount
The measured program is the Spiral-generated C99 code of computation per data bit.
with SIMD vector intrinsics. The code was compiled using Support for different instruction sets. The next genera-
the Intel Compiler icc 11.0. In all cases, the benchmarks were tion instruction set on x86 platforms is AVX, and we already
performed under 64-bit Linux with a 2.6.x series kernel. Hy- have a validated AVX backend in the generator. However, at
perthreading was not used, since it only degrades performance the time AVX hardware was not publicly available from either
for compute-bound workloads such as WiFi. AMD or Intel. Access to pre-release hardware was only pos-
We benchmarked the generated code on three Intel plat- sible under NDA to select Intel customers, and thus runtimes
forms listed below, by comparing the achievable rate based cannot be shown in this paper.
solely on the required baseband processing runtimes (TDP in- Notably, the AVX instruction set poses additional chal-
dicates the thermal design power of the processor): lenges. It doubles the available vector length, without provid-
• Intel Core i7-975, 3.33 Ghz, TDP 130W, 4 cores; ing any integer data instructions on the wider vectors. Thus,
• Intel Core 2 Quad Q6700, 2.66 Ghz, TDP 95W, 4 cores; the most important block of the WiFi receiver, the Viterbi de-
• Intel Atom N270, 1.6 Ghz, TDP 2.5W, 1 core. coder, can’t benefit, with the additional complications of using
Benchmarks. The first set of benchmarks analyzes the wider floating point instructions, for blocks such as the DFT.
runtime of the receiver and transmitter per symbol, across dif- The support of multiple instruction sets is even needed on a
ferent data rates. The recorded runtimes are given in Fig. 2. single platform due to the mixture of different data types. Ev-
The bar plots also show the breakdown of runtimes across the ery one of our generated PHY implementations uses at least 8-
computational blocks of the WiFi Receiver running using a bit integer, 32-bit integer, and 32-bit floating point data types.
single thread. The Viterbi decoder is the most time consuming Each data type requires a distinct set of vector instructions to
block, and requires a larger proportion of the runtime as the manipulate the data, and the vector length also differs, it is 16
data rate increases. At 54 Mbps it is 88% of the runtime on for 8-bit data, 8 for 16-bit data, and 4 for 32-bit data. Finally,
the Core platforms and 82% on the Atom; at 6 Mbps, it is 64% for bit-level operations, the vector length is 128.
on the Cores, and 54% on the Atom. The other blocks are still An extension to other shared-memory multi-core vector ar-
important and aggressive optimizations and block fusions are chitectures is straightforward, as explained in Sec. 4.
needed to reduce their relative runtime to the current level.
The second set of plots in Fig. 3 compares achievable vs.
nominal data rates. In these plots, besides varying the plat- 6 Conclusion
form, we also consider two versions of the generated code,
one that was used in the previous set of plots (denoted as As true SDR comes into reach, every available performance
“vectorized”), and one that is vectorized and in addition uses optimization technique has to be used to achieve maximal per-
two threads as described in Section 4 (denoted as “threaded”). formance and real-time. The programming burden hence be-
The generated code outperforms both of the hand-coded im- comes considerable since the software developer has to use
plementations [27] and [18] we compared against. The two different vector instruction sets, threading, consider available
biggest enabling factors are the ability to generate multiple al- choices, and use a variety of other techniques. A program
gorithmic code alternatives and search within the available de- generator like Spiral is an attractive solution to this problem.
grees of freedom, and the ability to combine multiple blocks. As we have demonstrated, the input description is platform-
In Fig. 3, achievable data rate above nominal means that independent and hence has to be created only once. Mapping
7
to different processors is done by simply changing platform radios,” in Proc. Canadian Conf. Electrical and Computer En-
parameters and the backend and regenerating the code. This gineering, May 2004.
way, migration can be accelerated, while maintaining excel- [14] Y. Lin, H. Lee, M. Woh, Y. Harel, S. Mahlke, T. Mudge,
lent performance. We believe that program generators are an C. Chakrabarti, and K. Flautner, “SODA: A high-performance
important part of a solution to the performance/productivity DSP architecture for software-defined radio,” IEEE Micro,
problem that plagues signal processing and communication vol. 27, no. 1, pp. 114–123, Jan. 2007.
applications. [15] A. T. Tran, D. N. Truong, and B. M. Baas, “A complete real-
time 802.11a baseband receiver implemented on an array of
programmable processors,” in Proc. of Asilomar Conf. on Sig-
nals, Systems, and Computers, Nov. 2008.
References
[16] GNU Radio. [Online]. Available: http://gnuradio.org/
[1] G. Blake, R. G. Dreslinski, and T. Mudge, “A survey of mul- [17] Wireless Open-Access Research Platform (WARP). [Online].
ticore processors,” IEEE Signal Processing Magazine, vol. 26, Available: http://warp.rice.edu/
no. 2, pp. 26–37, Dec. 2009. [18] K. Tan, J. Zhang, J. Fang, H. Liu, Y. Ye, S. Wang, Y. Zhang,
[2] V. Ramadurai, S. Jinturkar, S. Agarwal, M. Moudgill, and H. Wu, W. Wang, and G. Voelke, “Sora: High performance
J. Glossner, “Software implementation of 802.11a blocks on software radio using general purpose multi-core processors,” in
SandBlaster DSP,” in Proc. SDR’06 Technical Conference and Proc. 6th USENIX Symposium on Networked Systems Design
Product Exposition, 2006. and Implementation, Apr. 2009.
[3] D. Iancu, H. Ye, E. Surducan, M. Senthilvelan, J. Glossner, [19] M. L. Dickens, B. P. Dunn, and J. N. Laneman, “Design and
V. Surducan, V. Kotlyar, A. Iancu, G. Nacer, and J. Takala, implementation of a portable software radio,” IEEE Communi-
“Software implementation of WiMAX on the Sandbridge Sand- cations Magazine, vol. 46, no. 8, pp. 58–66, Aug. 2008.
Blaster platform,” Embedded Computer Systems: Architec- [20] J. Xiong, J. Johnson, R. Johnson, and D. Padua, “SPL: A lan-
tures, Modeling, and Simulation, vol. LNCS 4017, pp. 435– guage and compiler for DSP algorithms,” in Proc. PLDI, 2001,
446, 2006. pp. 298–308.
[4] L. J. Karam, I. AlKamal, A. Gatherer, G. A. Frantz, D. V. An- [21] J. Johnson, R. W. Johnson, D. Rodriguez, and R. Tolimieri,
derson, and B. L. Evans, “Trends in multicore DSP platforms,” “A methodology for designing, modifying, and implementing
IEEE Signal Processing Magazine, vol. 26, no. 6, pp. 38–49, Fourier transform algorithms on various architectures,” IEEE
Nov. 2009. Trans. Circuits and Systems, vol. 9, pp. 449–500, 1990.
[5] M. Püschel, J. M. F. Moura, J. Johnson, D. Padua, M. Veloso, [22] IEEE Computer Society, “IEEE Std 802.11-2007, Part 11:
B. W. Singer, J. Xiong, F. Franchetti, A. Gačić, Y. Voronenko, Wireless LAN medium access control (MAC) and physical
K. Chen, R. W. Johnson, and N. Rizzolo, “SPIRAL: Code gen- layer (PHY) specifications, Revision of IEEE Std 802.11-
eration for DSP transforms,” Proc. of the IEEE, vol. 93, no. 2, 1999,” June 2007.
pp. 232–275, 2005. [23] F. de Mesmay, S. Chellappa, F. Franchetti, and M. Püschel,
[6] F. Franchetti, Y. Voronenko, and M. Püschel, “A rewriting sys- “Computer generation of efficient software Viterbi decoders,”
tem for the vectorization of signal transforms,” in Proc. High in High Performance Embedded Architectures and Compilers
Perf. Computing for Computational Science (VECPAR), 2006. (HiPEAC), ser. Lecture Notes in Computer Science, vol. 5952.
[7] ——, “FFT program generation for shared memory: SMP and Springer, 2010, pp. 353–368.
multicore,” in Proc. Supercomputing, 2006. [24] G. Fettweis and H. Meyr, “High-speed parallel viterbi decod-
[8] Y. Voronenko, “Library generation for linear transforms,” Ph.D. ing: Algorithm and VLSI-architecture,” IEEE Communications
dissertation, Dept. of Electrical and Computer Eng., Carnegie Magazine, vol. 29, no. 5, pp. 46–55, May 1991.
Mellon University, 2008. [25] F. Franchetti, Y. Voronenko, and M. Püschel, “Loop merging
[9] F. Franchetti, F. de Mesmay, D. McFarlin, and M. Püschel, for signal transforms,” in Proc. PLDI, 2005, pp. 315–326.
“Operator language: A program generation framework for fast [26] F. de Mesmay, “On the computer generation of adaptive nu-
kernels,” in IFIP Working Conference on Domain Specific Lan- merical libraries,” Ph.D. dissertation, Dept. of Electrical and
guages, ser. LNCS, vol. 5658. Springer, 2009, pp. 385–410. Computer Eng., Carnegie Mellon University, 2010.
[10] P. Koch and R. Prasad, “The universal handset,” IEEE Spec- [27] C. R. Berger, V. Arbatov, Y. Voronenko, F. Franchetti, and
trum, vol. 46, no. 4, pp. 36–41, Apr. 2009. M. Püschel, “Real-time software implementation of an IEEE
[11] M. J. Meeuwsen, O. Sattari, and B. Baas, “A full-rate software 802.11a baseband receiver on Intel multicore,” submitted
implementation of an IEEE 802.11a compliant digital baseband for publication. [Online]. Available: http://www.ece.cmu.edu/
∼crberger/paper allerton.pdf
transmitter,” in Proc. IEEE Workshop Signal Processing Sys-
tems (SIPS), Oct. 2004.
[12] Y. Tang, L. Qian, and Y. Wang, “Optimized software im-
plementation of a full-rate IEEE 802.11a compliant digital
baseband transmitter on a digital signal processor,” in Proc.
GLOBECOM, Nov. 2005.
[13] A. L. Cinquino and Y. R. Shayan, “A real-time software imple-
mentation of an OFDM modem suitable for software defined

Computer Generation of High-Performance PHY Software

Cargado por

Información del documento

Descripción original:

Título original

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

Computer Generation of High-Performance PHY Software

Cargado por

Copyright:

Formatos disponibles

Computer Generation of Platform-Adapted Physical Layer Software

Yevgen Voronenko (SpiralGen, Pittsburgh, PA, USA; yevgen@spiralgen.com)

Abstract platform changes, the code still has to be reoptimized, since

OL. OL is a superset of SPL. Where SPL can only describe

Viterbi decoding VitDeck,m,r Z22·48kmr

Table 2: Definition of the block operators used in (5)– (8).

0.50 200 Atom, threaded real-!me bound

1.00 20 Atom, threaded

También podría gustarte