Documentos de Académico
Documentos de Profesional
Documentos de Cultura
Since the Radix-4 algorithm is based on the symbol level, Based on the justification in section 2.1 that both binary
we must define a symbol probability (or reliability): and duo-binary codes can be decoded in an unified way, we
∗ propose a configurable Radix-4 Log-MAP decoder archi-
L(ψij ) = max {αk−2 (sk−2 ) + γkij + βk (sk )} (6) tecture to perform both types of decoding operations. To
sk−2 ,sk
210
W 2W 3W 4W 5W 6W W 2W 3W 4W 5W 6W ak0 /β k0+1
+
γ k0
I3
I0
Trellis *
Max Routing ... Routing
W W a1k /β k1+1 BMs
+
* ak +1/β k
I1
I0
γ 1
k
2W 2W Max
ak2 /β k2+1 ACSA 0 ... ACSA 7
0
I1
I2
O
+
3W γ k2 *
3W Max
0
ak3 /β k3+1
O
I3
1
I2
ak +1/β k [s0 − s7]
O
a/β Unit
+
4W 4W γ k3 Radix-4 ACSA
I3
2
O
O
5W 5W
2
3
I0
O
6W 6W M
α -BMC α -Unit α -RAM
La (u) U
3
O
Lc (ys) X
7W 7W û
Lc (yp1,2) Scratch
Time Time Λ-Unit
RAM M
(a) Binary tailing (b) Duo-binary tail-biting U β1 -BMC β1-Unit
X Le (uˆ )
Data input β 0 dummy Λ & β1 calculation
α dummy α calculation Tail-biting only β 0 -BMC β 0-Unit
Top Level MAP Decoder
Figure 4. Sliding window tile chart
α k −1 *
βk L(ϕ00 ) Max
γk
*
Max Tree - Λ (uˆ2 k −1 )
α : Forward *
Max
β 0 : Dummy backward L(ϕ01 )
efficiently implement the Log-MAP algorithm, an overlap- β1 : Efective backward * Tree
Max
*
Max
- Λ (uˆ2k )
ping sliding window technique [9] is employed. Λ : LLR and extrinsics
*
Max Bit LLR
L(ϕ10 )
We generalize the two types of decoding operations into *
Max Tree - Λ (uˆk )
1
211
contains 410K gates of logic, our architecture achieves bet- Input Buffer Extrinsic Buffer
ter flexibility by supporting multiple code types at low hard- Lc Le
Bank 1 MAP 1 Bank 1
Crossbar Switch
Crossbar Switch
ware overhead. Input
LLRs Bank 2 MAP 2 Bank 2
Table 2. Area distribution of the MAP decoder ... ... ...
Blocks Gate count
α-processor (including α-BMU) 30.8K gates Bank Q MAP P Bank Q
β-processor x2 (including β-BMUs) 66.2K gates
Λ-processor 37.3K gates QPP/ARP Interleaver Address Generators П
α-RAM 2.560K bits (a) Multi-MAP turbo decoder architecture
Scratch RAMs x3 4.224K bits
Tail
Control logic 13.4K gates
biting SB 1 SB 2 ... SB P
..
. ..
. ... ..
.
4. Low complexity interleaver architecture
time
W W
MAP 1 One decoding block N MAP P
One of the main obstacles to parallel turbo decoding is
(b) Parallel MAP decoding
the interleaver parallelism due to the memory access col-
lision problems. For example, in [10] memory collisions
Figure 7. Parallel decoder architecture
are solved by having additional write buffers. However, re-
cently, new contention-free interleavers have been adopted
by 4G standards, such as QPP interleaver [11] in 3GPP LTE where Γ(x) = (2f2 x + f1 + f2 ) mod N , and it can also be
and ARP interleaver [12] in WiMax. The simplest approach computed recursively as: Γ(x + 1) = (Γ(x) + g) mod N ,
to implement an interleaver is to store all the interleaving where g = 2f2 . Since Π(x), Γ(x) and g are all smaller than
patterns in ROMs. However, this approach becomes im- N , the QPP interleaver can be efficiently implemented by
practical for a turbo decoder supporting multiple block sizes cascading two Add-Compare-Choose (ACC) units as shown
and multiple standards. In this section, we propose an uni- in Fig. 6(a), which avoids the expensive multipliers and di-
fied on-line address generator with two recursion units sup- viders.
porting both QPP and ARP interleavers. For the ARP interleaver, it employs a two-step interleav-
ing. In the first step it switches the alternate couples as
П(x+1)=(П(x)+Г(x)) % N
[Bx , Ax ]=[Ax , Bx ] if x mod 2 = 1. In the second step, it
Г(x+1)=(Г(x)+g) % N
П(x 0)
A M
D
Y П(x) computes Π(x)=(P0 · x + Qx ) mod N , where Qx is equal
U
-
+
A M Y Г(x)
X to 1, 1+N /2+P1 , 1+P2 or 1+N /2+P3 when x mod 4 = 0,
U D B
Г(x0) -
+
Sign bit
X
Г(x0) N ACC Unit
1, 2 or 3 respectively. P0 , P1 , P2 and P3 are constants de-
g Init B
Sign bit
Init pending on N . Denote λ(x) = P0 · x, ARP interleaver can
N ACC Unit
N be efficiently implemented in a similar manner by reusing
(a) QPP interleaver architecture the same two ACC units, as shown in Fig. 6(b).
λ(x+1)=(λ(x)+P0) % N П(x)=(λ(x)+Qx) % N
This architecture only requires a few adders and multi-
plexors which leads to very low complexity and can sup-
A M Y λ(x) A M Y П(x)
λ(x0 ) U D U D port all QPP/ARP turbo interleaving patterns. Compared
- -
+
X Q0 X
Init B Q1 B to the latest work on row-column based interleaver [13]
P0 Sign bit Q2 Sign bit
N ACC Unit Q3 N ACC Unit which needs complex state machines and RAMs/ROMs, our
x%4
N
(b) ARP interleaver architecture
QPP/ARP interleaver architecture has lower gate count and
more flexibility.
Figure 6. Low-complexity unified interleaver
5. Parallel turbo decoder architecture
We first look at the QPP interleaver. Given an informa-
tion block length N , the x-th interleaved output position is To increase the throughput, parallel MAP decoding can
given by Π(x) = (f2 x2 + f1 x) mod N, 0 ≤ x, f1 , f2 < N . be employed by dividing the whole information block into
This can be computed recursively as: several sub-blocks and then each sub-block is decoded sep-
arately by a dedicated MAP decoder [14][15][16][17]. For
Π(x + 1) = (f2 x2 + f1 x) + (2f2 x + f1 + f2 ) mod N
example, in [15] a 75.6 Mbps data rate is achieved using 7
= (Π(x) + Γ(x)) mod N (7) SISO decoders running at 160 MHz.
212
1 1
f = 200 MHz f = 200 MHz
P=1 P=1
0.9 P=2 0.9 P=2
P=4 f = 180 MHz P=4
P=8 [16] P=8
0.8 P=16 0.8
P=16 f = 180 MHz
P=32 Clock frequency range P=32
0.7 (80 :20: 200 MHz) 0.7
Normalized power
Normalized area
0.6 0.6
f = 100 MHz
f = 80 MHz
0.5 0.5
Clock frequency range
(80 :20: 200) MHz
0.4 0.4
0.3 0.3
f = 100 MHz
[4]
0.2 0.2
f = 80 MHz
[14]
0.1 0.1
[15]
0 0
0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750
Throughput (Mbps) Throughput (Mbps)
(a) Normalized area versus throughput (N=6144, I=6, W=32) (b) Normalized power versus throughput (N=6144, I=6, W=32)
In this section, we propose a scalable decoder architec- where Ñ =N/2 and W̃ =W/2 for Radix-4 decoding, I is the
ture, shown in Fig. 7(a), based on QPP/ARP contention-free number of iterations (contains two half iterations), 3W̃ is
4G interleavers. The parallelism is achieved by dividing the decoding latency for one MAP decoder and f is the
the whole block N into P sub-blocks (SBs) and assigning clock frequency. And the area is estimated as:
P MAP decoders working in parallel to reduce the latency
down to O(N/P ). The memory structure is designed to Area ≈ P · Amap (f ) + Amem (P, N ) + Aroute (P, f ) (9)
support concurrent access of LLRs by multiple (P ) MAP
decoders in both linear addressing and interleaved address- where Amap is one MAP decoder area which increases with
ing modes. This is achieved by partitioning the memory f , Amem is the total memory area which increases with N
into Q banks (Q ≥ P ), with each bank being independently and also P due to the sub-banking overhead, and Aroute
accessible. In this paper, we use Q=P . Take the QPP inter- is the routing cost (crossbars plus interleavers) which in-
leaver as an example, P MAP decoders always access data creases with both P and f . Note that the complexity of the
simultaneously at a particular offset x, which guarantees no full crossbar actually increases with P 2 . We describe the
memory access conflicts due to the contention-free property decoder in Verilog and synthesize it on a 65 nm technology
of bΠ(x + jM )/M c 6= bΠ(x + kM )/M c, where x is the using Synopsys Design Compiler. The area tradeoff analy-
offset in the sub-block j and k (0 ≤ j < k < P ), and sis is given in Fig. 8(a) which plots the normalized area ver-
M is the sub-block length N/P . Since any MAP decoder sus throughput for different parallelism levels and clock rate
will write to any memory during the decoder process, a full (80-200MHz, step of 20MHz). The power estimation is ob-
crossbar is designed for routing data among them. Fig. 7(b) tained from the product of area and frequency (actual power
depicts a parallel decoding example for duo-binary codes estimation is still in progress). This estimation is based on
(general concepts hold for binary codes). the dynamic power consumption P = Cf V 2 where V is
assumed to be the same for all the cases and C is assumed
to be proportional to the silicon area. Fig. 8(b) shows the
5.1. Architecture trade-off analysis power tradeoff analysis based on the area × f metric.
High throughput is achieved at a cost of multiple MAP 5.2. Comparison with related works
decoders and multiple memory banks. In this section, we
will analyze the impact of parallelism on throughput, area Table 3 compares the flexibility and performance with
and power consumption. The maximum throughput is esti- existing state-of-the-art turbo decoders. Fig. 8(a) compares
mated as: the silicon area with [4][14][15][16] where the area is scaled
and normalized to a 65nm technology, and the block size
N N ·f is scaled to 6144. In [2][4][18], programmable processors
Throughput = ≈ (8)
Decoding time 2 · I · ( Ñ are proposed to provide multi-code flexibilities. In [14][15]
P + 3W̃ )
213
Table 3. Comparison with existing turbo decoder architectures
Work Architecture Flexibility MAP algorithm Parallelism Iteration Frequency Throughput
[2] 32-wide SIMD Multi-code Max-LogMAP 4 window 5 400 MHz 2.08 Mbps
[18] Clustered VLIW Multi-code LogMAP/Max-LogMAP Dual cluster 5 80 MHz 10 Mbps
[4] ASIP-SIMD Multi-code Max-LogMAP 16 ASIP 6 335 MHz 100 Mbps
[14] ASIC Single-code LogMAP 6 SISO 6 166 MHz 59.6 Mbps
[15] ASIC Single-code LogMAP 7 SISO 6 160 MHz 75.6 Mbps
[16] ASIC Single-code Constant-LogMAP 64 SISO 6 256 MHz 758 Mbps
This work ASIC Multi-code LogMAP 32 SISO 6 200 MHz 711 Mbps
efficient ASIC architectures are presented for decoding of [6] M. Bickerstaff et al. A 24Mb/s radix-4 logMAP turbo de-
simple binary turbo codes using the LogMAP algorithm. coder for 3GPP-HSDPA mobile wireless. In IEEE Int. Solid-
In [16], a high speed Turbo decoder architecture based on State Circuit Conf. (ISSCC), Feb. 2003.
the sub-optimal Constant-LogMAP SISO decoders (Radix- [7] Y. Zhang and K.K. Parhi. High-throughput radix-4 logMAP
2) is discussed. Though the decoder in [16] achieves high turbo decoder architecture. In Asilomar conf. on Signals,
throughput at a low silicon area, it has some limitations, Syst. and Computers, pages 1711–1715, Oct. 2006.
e.g. the interleaver is non-standard compliant and it can not [8] C. Zhan et al. An efficient decoder scheme for double binary
support WiMax duo-binary turbo codes. As can be seen, circular turbo codes. In IEEE Int. Conf. Acoustics, Speech,
our architecture exhibits both flexibility and efficiency in and Signal Processing (ICASSP), pages 229–232, 2006.
supporting multiple turbo codes (simple binary + double bi- [9] G. Masera, G. Piccinini, M. Roch, and M. Zamboni. VLSI
nary) and achieving high throughput. architecture for turbo codes. In IEEE Trans. VLSI Syst., vol-
ume 7, pages 369–3797, 1999.
[10] P. Salmela, R. Gu, S.S. Bhattacharyya, and J. Takala. Effi-
6. Conclusion cient parallel memory organization for turbo decoders. In
Proc. European Signal Processing Conf., pages 831–835,
A flexible multi-code turbo decoder architecture is pre- Sep. 2007.
sented together with a low-complexity contention-free in- [11] J. Sun and O.Y. Takeshita. Interleavers for turbo codes us-
terleaver. The proposed architecture is capable of decoding ing permutation polynomials over integer rings. IEEE Trans.
simple and double binary turbo codes with a limited com- Inform. Theory, vol.51, Jan. 2005.
plexity overhead. A date rate of 10-711 Mbps is achiev- [12] C. Berrou et al. Designing good permutations for turbo
able with scalable parallelism. The architecture is capable codes: towards a single model. In IEEE Conf. Commun.,
of supporting multiple proposed 4G wireless standards. volume 1, pages 341–345, June 2004.
[13] Z. Wang and Q. Li. Very low-complexity hardware inter-
References leaver for turbo decoding. IEEE Trans. Circuit and Syst.,
vol.54:pp. 636–640, Jul. 2007.
[1] C. Berrou, A. Glavieux, and P. Thitimajshima. Near Shannon [14] M. J. Thul et al. A scalable system architecture for high-
limit error-correcting coding and decoding: Turbo-Codes. In throughput turbo-decoders. The Journal of VLSI Signal Pro-
IEEE Int. Conf. Commun., pages 1064–1070, May 1993. cessing, pages 63–77, 2005.
[2] Y. Lin et al. Design and implementation of turbo decoders [15] B. Bougard et al. A scalable 8.7-nJ/bit 75.6-Mb/s paral-
for software defined radio. In IEEE Workshop on Signal lel concatenated convolutional (turbo-) codec. In IEEE Int.
Processing Design and Implementation (SIPS), pages 22–27, Solid-State Circuit Conf. (ISSCC), Feb. 2003.
Oct. 2006. [16] G. Prescher, T. Gemmeke, and T.G. Noll. A parametriz-
[3] M.C. Shin and I.C. Park. SIMD processor-based turbo able low-power high-throughput turbo-decoder. In IEEE Int.
decoder supporting multiple third-generation wireless stan- Conf. Acoustics, Speech, and Signal Processing (ICASSP),
dards. IEEE Trans. on VLSI, vol.15:pp.801–810, Jun. 2007. volume 5, pages 25–28, Mar. 2005.
[17] S-J. Lee, N.R. Shanbhag, and A.C. Singer. Area-efficient
[4] O. Muller, A. Baghdadi, and M. Jezequel. ASIP-Based mul-
high-throughput MAP decoder architectures. IEEE Trans.
tiprocessor SoC design for simple and double binary turbo
VLSI Syst., vol.13:pp. 921–933, Aug. 2005.
decoding. In DATE’06, volume 1, pages 1–6, Mar. 2006.
[18] P. Ituero and M. Lopez-Vallejo. New schemes in clustered
[5] P. Robertson, E. Villebrun, and P. Hoeher. A comparison of
vliw processors applied to turbo decoding. In Proc. Int. Conf.
optimal and sub-optimal MAP decoding algorithm operating
on Application-specific Syst., Architectures and Processors
in the log domain. In IEEE Int. Conf. Commun., pages 1009–
(ASAP), pages 291–296, Sep. 2006.
1013, 1995.
214