Attribution Non-Commercial (BY-NC)

104 vistas

Attribution Non-Commercial (BY-NC)

- Zeroth Review 1
- AHB Interview Questions
- DCT SVD Based Hybrid Transform Coding for Image Compression
- lab1_Verilog
- A ROBUST IMAGE AUTHENTICATION METHOD BASED ON WAVELET TRANSFORM AND TEAGER ENERGY OPERATOR
- Final Report
- E1201032834.pdf
- Digitization
- Weatermark 3D
- External Force for Deformable Models in Medical Image Segmentation: A Survey
- Multiclk Final 032807
- nevion dvb ts mux
- Lesson 4
- combinationalcircuit-140524112345-phpapp02
- Untitled 1
- Improved 8-Point Approximate Dct for Image and Video Compression Requiring Only 14 Additions
- COMPRESSED DOMAIN DATA HIDING APPROACH ON ENCRYPTED IMAGES USING AUXILIARY INFORMATION
- HEF4051B
- JQ3417501753.pdf
- vhdl_quickref.pdf

Está en la página 1de 5

Suvam Nandi K. Rajan Prasenjit Biswas

Department of Instrumentation Department of Physics SuperComputer Education & Research Centre

Indian Institute of Science Indian Institute of Science Indian Institute of Science

Bangalore, India Bangalore, India Bangalore, India

Email: suvam@isu.iisc.ernet.in Email: rajan@physics.iisc.ernet.in Email: prasenjit@cadl.iisc.ernet.in

√ √

most important building blocks for the emerging video coding 𝑎 = 21 , 𝑏 = 12 cos 𝜋8 , 𝑐 = 12 cos 3𝜋

8 .

standard, viz. H.264. The conventional implementation does some The matrix multiplication in equation(1) can be factorized

approximation to the transform matrix elements to facilitate into the following form

integer arithmetic, for which hardware is suitably prepared.

Though the transform coding does not involve any multiplications, 𝑌 = (𝐶𝑋𝐶 𝑇 ) ⊗ 𝐸𝑓 (4)

quantization process requires sixteen 16-bit multiplications. The

algorithm used here eliminates the process of approximation where

⎡ 2

𝑎𝑏 𝑎2

⎡ ⎤ ⎤

in transform coding and multiplication in the quantization 1 1 1 1 𝑎 𝑎𝑏

process, by usage of Algebraic Integer Coding. We propose an ⎢1 𝑑 −𝑑 −1⎥ ⎢𝑎𝑏 𝑏2 𝑎𝑏 𝑏2 ⎥

area-efficient implementation of the transform and quantization ⎣1 −1 −1 1 ⎦ 𝐸𝑓 = ⎣𝑎2

𝐶=⎢ ⎥ ⎢ ⎥ (5)

𝑎𝑏 𝑎2 𝑎𝑏⎦

blocks based on the algebraic integer coding. The designs were

synthesized with 90 nm TSMC CMOS technology and were also 𝑑 −1 1 −𝑑 𝑎𝑏 𝑏2 𝑎𝑏 𝑏2

implemented on a Xilinx FPGA. The gate counts and throughput where 𝑑 = 𝑐𝑏 .

achievable in this case are 7000 and 125 Msamples/sec.

In H.264, to simplify implementation, 𝑑 is approximated by

I. I NTRODUCTION 0.5. To ensure orthogonality, 𝑏 also needs to be modified[2].

The H.264/AVC provides an outstanding compression gain Due to these modifications, the results of H.264 transform will

compared to that of the previous video coding standards. not be identical to the 4x4 DCT.

The H.264 also brings out many innovations which are never B. New Algorithm

employed in the previous standards, such as hybrid predic- The proposed algorithm[1] takes help of algebraic integers.

tive/transform coding of intra frames and integer transforms. They are defined by real numbers that are roots of monic

The integer 4x4 transform proposed in H.264[3] is an ap- polynomials with integer coefficients[5].

proximation to 4x4 DCT, resulting in performance degradation. If we denote

The algorithm proposed in [1] is error-free because it doesn’t √

√

𝜋

have these approximations. Also, it provides shift-and-add 𝑧 = 2 cos = 2 + 2 (6)

tables to do away with 16-bit multiplication in quantization 8

stage. Then 𝑧 is a root of the polynomial 𝐹 (𝑥) = 𝑥4 − 4𝑥2 + 2. So,

Therefore, the VLSI implementation contains only additions in the following polynomial expansion:

and shifts that hold a definite clue toward less-resource utiliza- ∑

𝑓 (𝑧) = 𝑎𝑖 𝑧 𝑖 (7)

tion and faster design.

𝑧 corresponds to the particular choice of 𝑎𝑖 = (0, 1, 0, 0).

A. 4x4 DCT: Conventional Implementation

Therefore, if we have

The forward 4x4 DCT of a sample block is given by, √

′ 1 ′ 𝜋 3𝜋

𝑌 = 𝐴𝑋𝐴𝑇 (1) 𝑎 = , 𝑏 = 2 cos , 𝑐′ = 2 cos (8)

2 8 8

and the IDCT given by, Then 𝑏 = 2−1 𝑎′ 𝑏′ , 𝑐 = 2−1 𝑎′ 𝑐′ . With these intermediate

definitions, we can represent all elements as combinations of

𝑋 = 𝐴𝑇 𝑌 𝐴 (2)

𝑎𝑖 [1].

where 𝑋 is the residual data block and 𝑌 is a matrix of Using these, the actual 4x4 DCT algorithm is modified. In

coefficients and 𝐴 is a 4x4 transform matrix and is given by, equation(4) if 𝐶 = 𝐶1 + 𝑑𝐶2 where

⎡ ⎤ ⎡ ⎤ ⎡ ⎤

𝑎 𝑎 𝑎 𝑎 1 1 1 1 0 0 0 0

⎢ 𝑏 𝑐 −𝑐 −𝑏⎥ 0 −1⎥ ⎥ 𝐶2 = ⎢0 1 −1 0 ⎥ (9)

⎢1 0 ⎢ ⎥

𝐴=⎢ ⎣𝑎 −𝑎 −𝑎

⎥ (3) 𝐶1 = ⎢⎣1 −1 −1 1 ⎦

𝑎⎦ ⎣0 0 0 0⎦

𝑐 −𝑏 𝑏 −𝑐 0 −1 1 0 1 0 0 −1

978–1–4244–4547–9/09/$26.00 ⃝

c 2009 IEEE 1 TENCON 2009

Factorizing the original matrix multiplication using 𝐶 = 𝐶1 + A. Modifications

𝑑𝐶2 , in equation(4), we get[1] As per equation(10), 𝑑 is applied only after 𝐸 has been cal-

2

𝑊𝑎 = 𝐸 + 𝑑𝑘 + 𝑑 ℎ = 𝐸 + 𝐾 + 𝐻 (10) culated, to arrive at the final matrix of transform coefficients.

To calculate 𝐸, the butterfly structure looks like Figure 1.

𝑌 = 𝑊𝑎 ⊗ 𝐸𝑓 (11) The butterfly structure to get the 1-D transform in the

where 𝐸 = 𝐶1 𝑋𝐶1𝑇 and integer 4x4 DCT[3] can be modified slightly as in Figure

⎡ ⎤ 2, to include 𝑑 so that the transform coefficients matrix,

0 0 0 0 𝑊𝑎 , equation (10), is obtained at the end of 2-D transform

⎢0 𝐸44 0 −𝐸42 ⎥ butterfly computation, with the post-scaling and quantization

ℎ=⎢ ⎣0

⎥ (12)

0 0 0 ⎦ steps remaining.

0 −𝐸24 0 𝐸22

⎡ ⎤

0 −𝐸14 0 𝐸12

⎢−𝐸41 −𝐸24 − 𝐸42 −𝐸43 𝐸22 − 𝐸44 ⎥

𝑘=⎢ ⎥ (13)

⎣ 0 −𝐸34 0 𝐸32 ⎦

𝐸21 −𝐸44 + 𝐸22 𝐸23 𝐸42 + 𝐸24

So, 𝐸 can be computed with the help of a butterfly structure

and then, 𝑑 needs to be applied.

From 𝑑 = 𝑐𝑏 = 𝑧 2 − 3, (using intermediate definitions, refer

equations (6) and (8)) for good enough precision[1], 𝑑 can be

represented by 1 − 2−1 − 2−4 − 2−6 − 2−7 .

The post-scaling (⊗𝐸𝑓 ) and quantization operation is ab-

sorbed into the quantization process. Since operations are

scalar multiplication rather than matrix multiplication, tables

for multiplication and scaling factors can be replaced with Fig. 1. 𝐸 = 𝐶1 𝑋𝐶1𝑇

corresponding shift-and-add tables[1] with good precision.

The tables for inverse scaling and quantization is also

provided and with the help of reverse transform steps (refer

Table I), we are able to design for both forward and inverse

DCT/Quant process.

TABLE I

S TEPS AS PER NEW ALGORITHM

Step Number

Forward DCT/Quant

1 𝑊𝑎 = 𝐶𝑋𝐶 𝑇

2 𝑞𝑏𝑖𝑡𝑠 = 𝑓 𝑙𝑜𝑜𝑟(𝑄𝑃/6)

3 𝑎𝑧𝑎 = [𝑎𝑏𝑠(𝑊𝑎 ) ⊗ 𝑀 𝐹𝑎 ] >> 𝑞𝑏𝑖𝑡𝑠

4 𝑍 = 𝑟𝑜𝑢𝑛𝑑(𝑎𝑧𝑎 ) ⊗ 𝑠𝑖𝑔𝑛(𝑊𝑎 )

Inverse DCT/Quant Fig. 2.

1 𝑊𝑎′ = 𝑍 ⊗ 𝑉𝑎 x 2𝑞𝑏𝑖𝑡𝑠

2 𝑋𝑟 = 𝑟𝑜𝑢𝑛𝑑(𝐶 𝑇 𝑊𝑎′ 𝐶) To be noted here is that, with proper pipelining stages,

without much loss in speed, we can prepare the hardware with

QP is the quantization parameter that enables the encoder just one digital block to implement 𝑑 rather than two placed

to accurately and flexibly control the trade-off between bit rate consecutively to mimic 𝑑2 as required by equation(10).

and quality; 𝑀 𝐹𝑎 and 𝑉𝑎 are the table of multiplication factors

B. Salient features of the Architecture

and rescaling factors respectively, in algebraic integers for the

corresponding 16-bit multiplicand tables in the original H.264 The block diagram shown in Figure 3 gives the top-level

transform. structure of the architecture. For sequential logic, it is difficult

to implement equation(1). So the architecture designed is for

II. H ARDWARE I MPLEMENTATION 𝐶(𝐶𝑋 𝑇 )𝑇 , which of course is equivalent to 𝐶𝑋𝐶 𝑇 . There

The digital design for the algorithm proposed has been is a 16-register memory bank that takes the input, stores the

carried out and verified using VHDL in Xilinx. There are intermediate, and also the final, result.

some modifications made to the algorithm to get some speed Every register is accessed twice, once at the end of two

advantage, although the overall optimization target has been 1-D transforms, i.e. 𝑋 ′ = 𝐶𝑋 𝑇 and 𝑌 = 𝐶𝑋 ′𝑇 , during

area. the computation phase, apart from the load/store phase when

2

As per the time line of events (figure 4) in the computation

phase, firstly, appropriate selection of 4 from the total of 16(=

4x4) input data stored in the memory bank is presented to the

computation unit. This is done with the help of multiplexers.

Since the butterfly structure is designed for 4-input 4-output,

we have to stick with 4 data. This constitutes the first of 4

columns from 𝑋 𝑇 (𝑋 ′ = 𝐶𝑋 𝑇 ) in the first cycle.

input is saved into these registers and output taken. During the

load/store phase, the 16-register bank is connected in a shift-

register formation, so that inputs are taken from one side and

results taken from the other. Of course, minor modifications in Fig. 4.

the design will allow load/store of all 16 registers happening

In the computation unit, these inputs or their complements

in a single clock cycle.

are provided to the first set of adders, refer Figure 5. The but-

From the computation unit which performs the 1-D trans- terfly computations can be represented in terms of the general

form with the help of the butterfly structure, because of equation 𝑋(𝑁 ) = 𝑓𝑁 (𝑥(0), 𝑥(1), 𝑥(2), 𝑥(3)), 𝑁 = [0, 3]. The

pipelining, the 4 results are obtained in individual cycles. combination of inputs to the adders in the computation unit is

Suitable multiplexing, along with supported demultiplexing on modified every cycle to realize 𝑓𝑁 , over every 4 cycles and

the register input side, helps us save precious resource and this gets repeated until the 2-D transform calculation is over.

inter-block wiring area at the cost of a few mux and demux, Thus, at the output of the last adder, the 4 results obtained

and of course, reduced speed. in consecutive cycles is the complete vector of 1-D transform

Therefore, we need 16(once for every register) X output data (X(0), X(1), X(2), X(3)). The multiplexing ensures

2(sets of 1-D transform) cycles to complete the task of that minimum number of adders are used. Also just one block

DCT/Quantization. At the end of both sets of 1-D transform, implementing 𝑑 = 1−2−1 −2−4 −2−6 −2−7 = 0.414 is used.

the steps to compute quantization is performed, irrespective The input to the computation unit is held constant during these

of the result required or rejected. The quantized results are 4 cycles.

meaningful only at the end of 2-D transform, which is then The other important block is the memory bank. Apart from

saved in the memory bank. the 16-register bank, it has some other digital blocks too, as

This stretches the total time, as clock periods take account shown in Figure 6.

of the worst case calculations involved, even though in this Inside this block, the quantizer steps (step 2, 3, 4 as per table

case, the quantization calculations are necessary for one half I) are also carried out. The quantizer block takes inputs 𝑞𝑝 =

of the 32 cycles, i.e., after the 2-D transform is over. Still, as 𝑓 𝑙𝑜𝑜𝑟(𝑄𝑃/6) and 𝑞𝑚 = 𝑄𝑃 𝑚𝑜𝑑6, the latter necessary to

we would later see, we have fairly decent speed report. So that reference 𝑀 𝐹𝑎 table. The 4 LSBs padded to the input residual

doesn’t pose a concern. samples to serve as the fractional part, are dropped and the

This kind of pipelining actually helps us optimize area by integer part is rounded-off based on the contents of these 4

a factor of 2. The registers are all 16-bit, so that the input can bits, in the final stage of this block.

be 12-bit max to prevent any overflow in the worst case. The quantizer block is present in both the passes of 1-D

1) Pipeline Stages: The pipeline stages are realized in the transform, though it is necessary only at the end of 2nd pass,

memory bank and the computation unit. when 2-D transform is complete. This stretches our timing, to

3

appropriate location.

The select lines to the demux comes from the mod32

counter.

The mux at the output of the memory bank helps in

selecting four input vector from the 16 input data. This

isn’t completely straightforward as the matrix multiplication

implemented is of the form 𝐶(𝐶𝑋 𝑇 )𝑇 . So, a proper selection

of inputs is necessary taking care of them being from the input

sample matrix 𝑋 direct, or its complement, 𝑋 𝑇 .

III. R ESULTS

The area optimized DCT/QUANT blocks were first synthe-

sized with Xilinx Project Navigator 10.01 for Xilinx Virtex 5

(xc5vlx30). In addition to synthesis performed for the FPGA,

these blocks were also synthesized with Synopsys Design

Compiler in the 90 nm TSMC CMOS technology.

This gives reliable area and timing information for each

block. To get the data in comparable terms, area used is

Fig. 5. Block Diagram of the Computation Unit

expressed in terms of gate count of the design. This is obtained

by normalizing the total cell area by the area of a 2 input

NAND gate synthesized in the same technology.

The performance results are summarized in Table II. For

comparison, typical 4x4 implementations reported earlier in

[4], [6], [7] are tabulated.

TABLE II

R ESULT C OMPARISON

[4] [6] [7] Architecture

DCT DCT DCT/Quant DCT/Quant

9 bits 8 bits 16 bits 16 bits

3737 gates 10,000 gates DCT block 7000 gates

requires 294

gates+65 FFs+

256 bits R/W

memory+64

bits of ROM

500 Msamples/s 137 Msamples/s 10 Msamples/s 125 Msamples/s

Fig. 6. Inside the Storage Unit In [7], the design implemented is for the integer transform

algorithm as proposed in [3]. The authors have provided

area and speed optimized designs. The data used here for

an extent, unnecessarily, in the first pass. The results from the comparison is that of area-optimized design. Since, in [4],

quantizer, however, may be bypassed. authors haven’t included the design for quantization block, it is

The output from the computation unit is passed as it is, a bit difficult to compare and the speed data is of no practical

during the first pass of 1-D transform 𝑋 ′ = 𝐶𝑋 𝑇 , i.e., for importance without quantization included, as it is the most

half of the total 32 cycles necessary in the computation phase. time-consuming part of the DCT/Quant architecture.

This is ensured by the mux with bit 4 from mod32 counter as Without the pipelining in the computation unit, the vector

the select line. output from butterfly structure can be obtained in one cycle,

The demux at the input of the memory bank is necessary at the cost of more resources of course. This variant of the

to select one out of 16 registers to be used to write the current original design is faster, 500 Msamples/s but consumes around

result from the computation unit. Since the design is pipelined 14,500 gates.

and we get all the results on the same bus from the computation The IDCT steps as per the proposed algorithm [1] is imple-

unit, this ensures there is no over-writing on the memory and mented with the help of almost same top level structure. Of

also results reaching proper destination. This is crucial as in a course, the internal implementation is different as the butterfly

matrix, here 4x4, the importance of the data also relies on its structure needs to be modified, refer Inverse DCT/Quant step

4

2, Table I. This design has also been synthesized and this

consumes 7500 gates @ 250 Msamples/s.

IV. C ONCLUSIONS

We have presented the architecture for H.264/AVC based on

the proposed multiplication error-free algorithm[1]. The pro-

posed model uses no approximations, so the results obtained

are better.

Also the fact that there are no multiplications involved

makes the design less complex and definitely faster.

R EFERENCES

[1] Mohammad Norouzi, Karim Mohammadi, Mohammad Mahdy Azad-

far, Multiplication and Error Free Implementation of H.264 like 4x4

DCT/Quan IQuan/IDCT using Algebraic Integer Encoding, International

Journal of Computer Science and Network Security, VOL.6 No.9B,

September 2006.

[2] I.E.G. Richardson, H.264 and MPEG-4 Video Compression: Video Cod-

ing for Next-generation Multimedia, John Wiley & Sons Ltd., Sussex,

England, December 2003.

[3] Henrique S. Malvar, Antti Hallapuro, Marta Karczewicz, and Louis

Kerofsky, Low-Complexity transform and quantization in H.264/AVC,

IEEE transactions on circuits and systems for video technology, Vol.13,

No.7, July 2003.

[4] Tu-Chih Wung, Yu- Wen Huang. Hung-Chi Fang, and Liang-Gee Chen,

Parallel 4x4 2D transform and inverse transform architecture for MPEG-4

AVC/H.264, Proceedings of the 2003 International Symposium on Circuits

and Systems, vol. 2, May 2003, pp. II-800-II-803.

[5] K. A. Wahid, V. S. Dimitrov and G. A. Jullien, Error- Free Arithmetic

for Discrete Wavelet Transforms using Algebraic Integers, Proceedings of

the 16th IEEE Symposium on Computer Arithmetic (ARITH’03), 1063-

6889/03 (C) 2003.

[6] Luciano Agostini, Roger Porto, Sergio Bampi, Leandro Rosa, Jos Gntzel,

Ivan Saraiva Silva, High Throughput Architecture for H.264/AVC Forward

Transforms Block, Great Lakes Symposium on VLSI, Proceedings of the

16th ACM Great Lakes symposium on VLSI, Philadelphia, PA, USA,

pages: 320 - 323, 2006.

[7] Roman Kordasiewicz and Shahram Shirani, On Hardware Implementa-

tions Of DCT and Quantization Blocks for H.264/AVC, Journal of VLSI

Signal Processing 47, 93-102, 2007 Springer Science.

- Zeroth Review 1Cargado porviveknk
- AHB Interview QuestionsCargado porMohit Topiwala
- DCT SVD Based Hybrid Transform Coding for Image CompressionCargado porEditor IJRITCC
- lab1_VerilogCargado porTruong Quang Thong
- A ROBUST IMAGE AUTHENTICATION METHOD BASED ON WAVELET TRANSFORM AND TEAGER ENERGY OPERATORCargado porIJMAJournal
- Final ReportCargado porpr
- E1201032834.pdfCargado porprasanthi
- DigitizationCargado poramzeus
- Weatermark 3DCargado porKevin Rangel
- External Force for Deformable Models in Medical Image Segmentation: A SurveyCargado porsipij
- Multiclk Final 032807Cargado porReena Mathew
- nevion dvb ts muxCargado porDramane Bonkoungou
- Lesson 4Cargado porTrần Ngọc Lâm
- combinationalcircuit-140524112345-phpapp02Cargado porakshay
- Untitled 1Cargado porasutoshpat
- Improved 8-Point Approximate Dct for Image and Video Compression Requiring Only 14 AdditionsCargado porTechnosIndia
- COMPRESSED DOMAIN DATA HIDING APPROACH ON ENCRYPTED IMAGES USING AUXILIARY INFORMATIONCargado porIJIERT-International Journal of Innovations in Engineering Research and Technology
- HEF4051BCargado porinside121
- JQ3417501753.pdfCargado porminh duong
- vhdl_quickref.pdfCargado porPeter
- Design and Implementation of SDRAM controller based Digital Watermarking with combined DWT-DCT Technique on FPGACargado porIJSRP ORG
- Datasheet NyquistCargado porHarold Mamitag
- EC381_lecture21.pdfCargado porCorazon corazon
- 08Cargado pork
- Speech Recognisation Proposal.docxCargado porSachin
- cd74hct4053Cargado porRoozbeh Bahmanyar
- ALARMAVIVEROCargado porNicho Casados
- crash_5.11.8(51184525)_20190126_061618.txtCargado pornur halidah
- Maintain Training FacilitiesCargado porRaynold Dadero

- Ordinary Differential EquationsCargado porguruvashistha
- KORS 2015 MS Presentation 11-17-15 v15 FinalCargado porAla Baster
- bgp tipsCargado porQiyonk Zolp
- Intercultural Communication EDUC13Cargado porTerrencio Reodava
- weather monitoring roverCargado porabhinay
- 3_Relations And Functions.pdfCargado porthinkiit
- Ibm Rs g8124eCargado pordanibrb
- AshleyBriones Module3 HomeworkCargado pormilove4u
- 102046544 the Pop Up Book Jackson PaulCargado porbestgeo83
- National Pension Scheme & Aggregator Registration Under PFRDACargado porEquiCorp Associates, Advocates & Solicitors
- Direct DSOCargado porrk_sathish
- Project Cycle Management PCM-Project Identification & SelectionCargado porRoland Tiki
- Project on WTOCargado porYash Shah
- les10- COMPRESSORS.pdfCargado porsgh1355
- Master.plumber CpdproviderCargado porPRC Board
- 2015- CA as Second Messanger in Nitrate Signaling in Arabidopsis ThalaianaCargado porBrijesh Kumar
- lesson plan - scienceCargado porapi-266997689
- Engine Builder 2016-12Cargado porxeron7126
- Pratice EeCargado porasd
- Psychotropic DrugsCargado porHanya Bint Potawan
- ct-trg-otrs-en.pdfCargado porLuis F Jauregui
- L7 Relevant CostingCargado porDouglas Leong Jian-Hao
- CASE Analysis for L'Oreal.docxCargado porManish Rohera
- Pendhapa Residences a5 LandscapeCargado porWisnu Senjaya
- Application Suite v.s. Best of Breed SaaSCargado pornofluff
- Shear Bond Strength of Orthodontic BracketsCargado porArijAftab
- MB0049 Assignement AnswerCargado porHemant Kumar
- Unfuck Your Program 2Cargado pormark
- Pages from International Steam Tables Extract.pdfCargado porJoseph Silva
- Clinical Examination NOTESCargado porDanielDzinotyiweiD-cubed