Está en la página 1de 5

Hardware Implementation of 4x4 DCT/Quantization

Block Using Multiplication and Error-Free Algorithm

Suvam Nandi K. Rajan Prasenjit Biswas
Department of Instrumentation Department of Physics SuperComputer Education & Research Centre
Indian Institute of Science Indian Institute of Science Indian Institute of Science
Bangalore, India Bangalore, India Bangalore, India
Email: Email: Email:

Abstract—The 4x4 discrete cosine transform is one of the

√ √
most important building blocks for the emerging video coding 𝑎 = 21 , 𝑏 = 12 cos 𝜋8 , 𝑐 = 12 cos 3𝜋
8 .
standard, viz. H.264. The conventional implementation does some The matrix multiplication in equation(1) can be factorized
approximation to the transform matrix elements to facilitate into the following form
integer arithmetic, for which hardware is suitably prepared.
Though the transform coding does not involve any multiplications, 𝑌 = (𝐶𝑋𝐶 𝑇 ) ⊗ 𝐸𝑓 (4)
quantization process requires sixteen 16-bit multiplications. The
algorithm used here eliminates the process of approximation where
⎡ 2
𝑎𝑏 𝑎2
⎡ ⎤ ⎤
in transform coding and multiplication in the quantization 1 1 1 1 𝑎 𝑎𝑏
process, by usage of Algebraic Integer Coding. We propose an ⎢1 𝑑 −𝑑 −1⎥ ⎢𝑎𝑏 𝑏2 𝑎𝑏 𝑏2 ⎥
area-efficient implementation of the transform and quantization ⎣1 −1 −1 1 ⎦ 𝐸𝑓 = ⎣𝑎2
𝐶=⎢ ⎥ ⎢ ⎥ (5)
𝑎𝑏 𝑎2 𝑎𝑏⎦
blocks based on the algebraic integer coding. The designs were
synthesized with 90 nm TSMC CMOS technology and were also 𝑑 −1 1 −𝑑 𝑎𝑏 𝑏2 𝑎𝑏 𝑏2
implemented on a Xilinx FPGA. The gate counts and throughput where 𝑑 = 𝑐𝑏 .
achievable in this case are 7000 and 125 Msamples/sec.
In H.264, to simplify implementation, 𝑑 is approximated by
I. I NTRODUCTION 0.5. To ensure orthogonality, 𝑏 also needs to be modified[2].
The H.264/AVC provides an outstanding compression gain Due to these modifications, the results of H.264 transform will
compared to that of the previous video coding standards. not be identical to the 4x4 DCT.
The H.264 also brings out many innovations which are never B. New Algorithm
employed in the previous standards, such as hybrid predic- The proposed algorithm[1] takes help of algebraic integers.
tive/transform coding of intra frames and integer transforms. They are defined by real numbers that are roots of monic
The integer 4x4 transform proposed in H.264[3] is an ap- polynomials with integer coefficients[5].
proximation to 4x4 DCT, resulting in performance degradation. If we denote
The algorithm proposed in [1] is error-free because it doesn’t √

have these approximations. Also, it provides shift-and-add 𝑧 = 2 cos = 2 + 2 (6)
tables to do away with 16-bit multiplication in quantization 8
stage. Then 𝑧 is a root of the polynomial 𝐹 (𝑥) = 𝑥4 − 4𝑥2 + 2. So,
Therefore, the VLSI implementation contains only additions in the following polynomial expansion:
and shifts that hold a definite clue toward less-resource utiliza- ∑
𝑓 (𝑧) = 𝑎𝑖 𝑧 𝑖 (7)
tion and faster design.
𝑧 corresponds to the particular choice of 𝑎𝑖 = (0, 1, 0, 0).
A. 4x4 DCT: Conventional Implementation
Therefore, if we have
The forward 4x4 DCT of a sample block is given by, √
′ 1 ′ 𝜋 3𝜋
𝑌 = 𝐴𝑋𝐴𝑇 (1) 𝑎 = , 𝑏 = 2 cos , 𝑐′ = 2 cos (8)
2 8 8
and the IDCT given by, Then 𝑏 = 2−1 𝑎′ 𝑏′ , 𝑐 = 2−1 𝑎′ 𝑐′ . With these intermediate
definitions, we can represent all elements as combinations of
𝑋 = 𝐴𝑇 𝑌 𝐴 (2)
𝑎𝑖 [1].
where 𝑋 is the residual data block and 𝑌 is a matrix of Using these, the actual 4x4 DCT algorithm is modified. In
coefficients and 𝐴 is a 4x4 transform matrix and is given by, equation(4) if 𝐶 = 𝐶1 + 𝑑𝐶2 where
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
𝑎 𝑎 𝑎 𝑎 1 1 1 1 0 0 0 0
⎢ 𝑏 𝑐 −𝑐 −𝑏⎥ 0 −1⎥ ⎥ 𝐶2 = ⎢0 1 −1 0 ⎥ (9)
⎢1 0 ⎢ ⎥
𝐴=⎢ ⎣𝑎 −𝑎 −𝑎
⎥ (3) 𝐶1 = ⎢⎣1 −1 −1 1 ⎦
𝑎⎦ ⎣0 0 0 0⎦
𝑐 −𝑏 𝑏 −𝑐 0 −1 1 0 1 0 0 −1

978–1–4244–4547–9/09/$26.00 ⃝
c 2009 IEEE 1 TENCON 2009
Factorizing the original matrix multiplication using 𝐶 = 𝐶1 + A. Modifications
𝑑𝐶2 , in equation(4), we get[1] As per equation(10), 𝑑 is applied only after 𝐸 has been cal-
𝑊𝑎 = 𝐸 + 𝑑𝑘 + 𝑑 ℎ = 𝐸 + 𝐾 + 𝐻 (10) culated, to arrive at the final matrix of transform coefficients.
To calculate 𝐸, the butterfly structure looks like Figure 1.
𝑌 = 𝑊𝑎 ⊗ 𝐸𝑓 (11) The butterfly structure to get the 1-D transform in the
where 𝐸 = 𝐶1 𝑋𝐶1𝑇 and integer 4x4 DCT[3] can be modified slightly as in Figure
⎡ ⎤ 2, to include 𝑑 so that the transform coefficients matrix,
0 0 0 0 𝑊𝑎 , equation (10), is obtained at the end of 2-D transform
⎢0 𝐸44 0 −𝐸42 ⎥ butterfly computation, with the post-scaling and quantization
ℎ=⎢ ⎣0
⎥ (12)
0 0 0 ⎦ steps remaining.
0 −𝐸24 0 𝐸22
⎡ ⎤
0 −𝐸14 0 𝐸12
⎢−𝐸41 −𝐸24 − 𝐸42 −𝐸43 𝐸22 − 𝐸44 ⎥
𝑘=⎢ ⎥ (13)
⎣ 0 −𝐸34 0 𝐸32 ⎦
𝐸21 −𝐸44 + 𝐸22 𝐸23 𝐸42 + 𝐸24
So, 𝐸 can be computed with the help of a butterfly structure
and then, 𝑑 needs to be applied.
From 𝑑 = 𝑐𝑏 = 𝑧 2 − 3, (using intermediate definitions, refer
equations (6) and (8)) for good enough precision[1], 𝑑 can be
represented by 1 − 2−1 − 2−4 − 2−6 − 2−7 .
The post-scaling (⊗𝐸𝑓 ) and quantization operation is ab-
sorbed into the quantization process. Since operations are
scalar multiplication rather than matrix multiplication, tables
for multiplication and scaling factors can be replaced with Fig. 1. 𝐸 = 𝐶1 𝑋𝐶1𝑇
corresponding shift-and-add tables[1] with good precision.
The tables for inverse scaling and quantization is also
provided and with the help of reverse transform steps (refer
Table I), we are able to design for both forward and inverse
DCT/Quant process.

Step Number
Forward DCT/Quant
1 𝑊𝑎 = 𝐶𝑋𝐶 𝑇
2 𝑞𝑏𝑖𝑡𝑠 = 𝑓 𝑙𝑜𝑜𝑟(𝑄𝑃/6)
3 𝑎𝑧𝑎 = [𝑎𝑏𝑠(𝑊𝑎 ) ⊗ 𝑀 𝐹𝑎 ] >> 𝑞𝑏𝑖𝑡𝑠
4 𝑍 = 𝑟𝑜𝑢𝑛𝑑(𝑎𝑧𝑎 ) ⊗ 𝑠𝑖𝑔𝑛(𝑊𝑎 )
Inverse DCT/Quant Fig. 2.
1 𝑊𝑎′ = 𝑍 ⊗ 𝑉𝑎 x 2𝑞𝑏𝑖𝑡𝑠
2 𝑋𝑟 = 𝑟𝑜𝑢𝑛𝑑(𝐶 𝑇 𝑊𝑎′ 𝐶) To be noted here is that, with proper pipelining stages,
without much loss in speed, we can prepare the hardware with
QP is the quantization parameter that enables the encoder just one digital block to implement 𝑑 rather than two placed
to accurately and flexibly control the trade-off between bit rate consecutively to mimic 𝑑2 as required by equation(10).
and quality; 𝑀 𝐹𝑎 and 𝑉𝑎 are the table of multiplication factors
B. Salient features of the Architecture
and rescaling factors respectively, in algebraic integers for the
corresponding 16-bit multiplicand tables in the original H.264 The block diagram shown in Figure 3 gives the top-level
transform. structure of the architecture. For sequential logic, it is difficult
to implement equation(1). So the architecture designed is for
II. H ARDWARE I MPLEMENTATION 𝐶(𝐶𝑋 𝑇 )𝑇 , which of course is equivalent to 𝐶𝑋𝐶 𝑇 . There
The digital design for the algorithm proposed has been is a 16-register memory bank that takes the input, stores the
carried out and verified using VHDL in Xilinx. There are intermediate, and also the final, result.
some modifications made to the algorithm to get some speed Every register is accessed twice, once at the end of two
advantage, although the overall optimization target has been 1-D transforms, i.e. 𝑋 ′ = 𝐶𝑋 𝑇 and 𝑌 = 𝐶𝑋 ′𝑇 , during
area. the computation phase, apart from the load/store phase when

As per the time line of events (figure 4) in the computation
phase, firstly, appropriate selection of 4 from the total of 16(=
4x4) input data stored in the memory bank is presented to the
computation unit. This is done with the help of multiplexers.
Since the butterfly structure is designed for 4-input 4-output,
we have to stick with 4 data. This constitutes the first of 4
columns from 𝑋 𝑇 (𝑋 ′ = 𝐶𝑋 𝑇 ) in the first cycle.

Fig. 3. Top-Level Block Diagram

input is saved into these registers and output taken. During the
load/store phase, the 16-register bank is connected in a shift-
register formation, so that inputs are taken from one side and
results taken from the other. Of course, minor modifications in Fig. 4.
the design will allow load/store of all 16 registers happening
In the computation unit, these inputs or their complements
in a single clock cycle.
are provided to the first set of adders, refer Figure 5. The but-
From the computation unit which performs the 1-D trans- terfly computations can be represented in terms of the general
form with the help of the butterfly structure, because of equation 𝑋(𝑁 ) = 𝑓𝑁 (𝑥(0), 𝑥(1), 𝑥(2), 𝑥(3)), 𝑁 = [0, 3]. The
pipelining, the 4 results are obtained in individual cycles. combination of inputs to the adders in the computation unit is
Suitable multiplexing, along with supported demultiplexing on modified every cycle to realize 𝑓𝑁 , over every 4 cycles and
the register input side, helps us save precious resource and this gets repeated until the 2-D transform calculation is over.
inter-block wiring area at the cost of a few mux and demux, Thus, at the output of the last adder, the 4 results obtained
and of course, reduced speed. in consecutive cycles is the complete vector of 1-D transform
Therefore, we need 16(once for every register) X output data (X(0), X(1), X(2), X(3)). The multiplexing ensures
2(sets of 1-D transform) cycles to complete the task of that minimum number of adders are used. Also just one block
DCT/Quantization. At the end of both sets of 1-D transform, implementing 𝑑 = 1−2−1 −2−4 −2−6 −2−7 = 0.414 is used.
the steps to compute quantization is performed, irrespective The input to the computation unit is held constant during these
of the result required or rejected. The quantized results are 4 cycles.
meaningful only at the end of 2-D transform, which is then The other important block is the memory bank. Apart from
saved in the memory bank. the 16-register bank, it has some other digital blocks too, as
This stretches the total time, as clock periods take account shown in Figure 6.
of the worst case calculations involved, even though in this Inside this block, the quantizer steps (step 2, 3, 4 as per table
case, the quantization calculations are necessary for one half I) are also carried out. The quantizer block takes inputs 𝑞𝑝 =
of the 32 cycles, i.e., after the 2-D transform is over. Still, as 𝑓 𝑙𝑜𝑜𝑟(𝑄𝑃/6) and 𝑞𝑚 = 𝑄𝑃 𝑚𝑜𝑑6, the latter necessary to
we would later see, we have fairly decent speed report. So that reference 𝑀 𝐹𝑎 table. The 4 LSBs padded to the input residual
doesn’t pose a concern. samples to serve as the fractional part, are dropped and the
This kind of pipelining actually helps us optimize area by integer part is rounded-off based on the contents of these 4
a factor of 2. The registers are all 16-bit, so that the input can bits, in the final stage of this block.
be 12-bit max to prevent any overflow in the worst case. The quantizer block is present in both the passes of 1-D
1) Pipeline Stages: The pipeline stages are realized in the transform, though it is necessary only at the end of 2nd pass,
memory bank and the computation unit. when 2-D transform is complete. This stretches our timing, to

appropriate location.
The select lines to the demux comes from the mod32
The mux at the output of the memory bank helps in
selecting four input vector from the 16 input data. This
isn’t completely straightforward as the matrix multiplication
implemented is of the form 𝐶(𝐶𝑋 𝑇 )𝑇 . So, a proper selection
of inputs is necessary taking care of them being from the input
sample matrix 𝑋 direct, or its complement, 𝑋 𝑇 .
The area optimized DCT/QUANT blocks were first synthe-
sized with Xilinx Project Navigator 10.01 for Xilinx Virtex 5
(xc5vlx30). In addition to synthesis performed for the FPGA,
these blocks were also synthesized with Synopsys Design
Compiler in the 90 nm TSMC CMOS technology.
This gives reliable area and timing information for each
block. To get the data in comparable terms, area used is
Fig. 5. Block Diagram of the Computation Unit
expressed in terms of gate count of the design. This is obtained
by normalizing the total cell area by the area of a 2 input
NAND gate synthesized in the same technology.
The performance results are summarized in Table II. For
comparison, typical 4x4 implementations reported earlier in
[4], [6], [7] are tabulated.

Wung, et al Agostini, et al Shirani, et al Proposed

[4] [6] [7] Architecture
9 bits 8 bits 16 bits 16 bits
3737 gates 10,000 gates DCT block 7000 gates
requires 294
gates+65 FFs+
256 bits R/W
bits of ROM
500 Msamples/s 137 Msamples/s 10 Msamples/s 125 Msamples/s

Fig. 6. Inside the Storage Unit In [7], the design implemented is for the integer transform
algorithm as proposed in [3]. The authors have provided
area and speed optimized designs. The data used here for
an extent, unnecessarily, in the first pass. The results from the comparison is that of area-optimized design. Since, in [4],
quantizer, however, may be bypassed. authors haven’t included the design for quantization block, it is
The output from the computation unit is passed as it is, a bit difficult to compare and the speed data is of no practical
during the first pass of 1-D transform 𝑋 ′ = 𝐶𝑋 𝑇 , i.e., for importance without quantization included, as it is the most
half of the total 32 cycles necessary in the computation phase. time-consuming part of the DCT/Quant architecture.
This is ensured by the mux with bit 4 from mod32 counter as Without the pipelining in the computation unit, the vector
the select line. output from butterfly structure can be obtained in one cycle,
The demux at the input of the memory bank is necessary at the cost of more resources of course. This variant of the
to select one out of 16 registers to be used to write the current original design is faster, 500 Msamples/s but consumes around
result from the computation unit. Since the design is pipelined 14,500 gates.
and we get all the results on the same bus from the computation The IDCT steps as per the proposed algorithm [1] is imple-
unit, this ensures there is no over-writing on the memory and mented with the help of almost same top level structure. Of
also results reaching proper destination. This is crucial as in a course, the internal implementation is different as the butterfly
matrix, here 4x4, the importance of the data also relies on its structure needs to be modified, refer Inverse DCT/Quant step

2, Table I. This design has also been synthesized and this
consumes 7500 gates @ 250 Msamples/s.
We have presented the architecture for H.264/AVC based on
the proposed multiplication error-free algorithm[1]. The pro-
posed model uses no approximations, so the results obtained
are better.
Also the fact that there are no multiplications involved
makes the design less complex and definitely faster.
[1] Mohammad Norouzi, Karim Mohammadi, Mohammad Mahdy Azad-
far, Multiplication and Error Free Implementation of H.264 like 4x4
DCT/Quan IQuan/IDCT using Algebraic Integer Encoding, International
Journal of Computer Science and Network Security, VOL.6 No.9B,
September 2006.
[2] I.E.G. Richardson, H.264 and MPEG-4 Video Compression: Video Cod-
ing for Next-generation Multimedia, John Wiley & Sons Ltd., Sussex,
England, December 2003.
[3] Henrique S. Malvar, Antti Hallapuro, Marta Karczewicz, and Louis
Kerofsky, Low-Complexity transform and quantization in H.264/AVC,
IEEE transactions on circuits and systems for video technology, Vol.13,
No.7, July 2003.
[4] Tu-Chih Wung, Yu- Wen Huang. Hung-Chi Fang, and Liang-Gee Chen,
Parallel 4x4 2D transform and inverse transform architecture for MPEG-4
AVC/H.264, Proceedings of the 2003 International Symposium on Circuits
and Systems, vol. 2, May 2003, pp. II-800-II-803.
[5] K. A. Wahid, V. S. Dimitrov and G. A. Jullien, Error- Free Arithmetic
for Discrete Wavelet Transforms using Algebraic Integers, Proceedings of
the 16th IEEE Symposium on Computer Arithmetic (ARITH’03), 1063-
6889/03 (C) 2003.
[6] Luciano Agostini, Roger Porto, Sergio Bampi, Leandro Rosa, Jos Gntzel,
Ivan Saraiva Silva, High Throughput Architecture for H.264/AVC Forward
Transforms Block, Great Lakes Symposium on VLSI, Proceedings of the
16th ACM Great Lakes symposium on VLSI, Philadelphia, PA, USA,
pages: 320 - 323, 2006.
[7] Roman Kordasiewicz and Shahram Shirani, On Hardware Implementa-
tions Of DCT and Quantization Blocks for H.264/AVC, Journal of VLSI
Signal Processing 47, 93-102, 2007 Springer Science.