An IC Design For Real-Time Motion Estimation For H.264 Digital Video

An IC Design for Real-Time Motion Estimation for H.
264 Digital Video

Kenneth W. Hsu kwheec@rit.edu Xiang Li Rahul Chopra shanelee_cn@yahoo.com rxc1173@osfmail.rit.edu Department of Computer Engineering Rochester Institute of Technology Rochester, New York 14623
portrayed in the previous frame. Predicting these changes is made eaiser with precise motion estimation thus enabling huge gains when trying to achieve Video Compression. Redundancies removed by techniques such as Block Based Motion Estimation and also Spatial IntraPrediction techniques ensure low bit rate for streaming applications. The H.264 standard, the latest effort by the ITU and JVT to improve on current standards and practices for Video Compression in a bid to lower bit rate, provides adaptive and powerful coding schemes to tackle memory and Bandwidth issues at the heart of video compression. In the field of motion estimation, H.264 brings with it highly efficient and precise motion vector calculations with motion vectors accurate to quarter pixel resolution. It also includes variable block sized Block Matching and an innovative de-blocking filter, ContextAdaptive Binary Arithmetic Coding, [CABAC], Context Adaptive Variable Length Coding, and a host of other techniques for compression and mitigation of errors and packet losses during transmission to achieve half the bit rate of its predecessors [2][3][4][5][6]. The computational complexity of tasks required by the standard, coupled with the memory interactions required, render most motion estimation algorithms not fast enough for real-time System-On-Chip design [3][14]. In this paper, in Section-II we shall first introduce the H.264 standard [6] along with its predominant features focusing on Motion Estimation, in Section-III we will discuss our algorithm and show implementation details and discuss our results ultimately in section-IV to follow. II. OVERVIEW OF H.264/AVC:
ABSTRACT An IC design to achieve real-time motion estimation compensation encoding for H.264 ITU Video compression standard is presented. A full-search block matching algorithm has been adapted to a pipelined data flow to enable parallel processing of variable block sized block matching and fractional pixel motion vector generation. High definition TV (HDTV) requires wide bandwidth and a large amount of memory for digital video processing. The SOC is designed with TSMC 0.18um technology using VHDL and optimized to achieve a 125 MHz clock speed to make real-time processing possible. I. INTRODUCTION:
In todays fast paced and constantly changing market for consumer electronics, the demand for High Quality Video is growing exponentially. High Definition Televisions [HDTVs] are seeing a boom in demand as is evident from the significant increase in programming and live sporting events available for them through all major networks. High Quality Video, be it for HDTVs, Internet Video Conferencing, or, live streaming and portable video on Cellular phones through CDMA based networks, requires excessive bandwidth resources. The bandwidth requirement is proportional to the Quality of Video, and often video quality has to be sacrificed for efficient streaming video. Another significant drawback is strains put on expensive resources such as memory and massive processing power, both expensive commodities in terms of silicon area. Present day DVD output relies on high quality audio and video by expressing both, each pixel in video, and each audio sample at 96 KHz with 24 bit precision. A single DVD frame consists of 720*576 pixels, and with the North American video standard 29.97 frames per second, that leads to about 250Mb/s of uncompressed data that needs to be transmitted [2]. As can be deduced, even by Internet-II standards, that is an enormous amount of data. In video, just like in any other signal, information is conveyed in the form of divergence from the carrier wave or default signal. This portrays itself in the form of relatively small temporal differences over multiple frames caused by objects and/or the camera changing position with respect to what was
The International Telecommunications Union [ITU] initiated the H.26L techniques in 1998 as a means to continue in the footsteps of the revolutionary results provided by the MPEG-2 standard, and more recently, the H.263 standard. The ITUs goal is to reduce the bit rate achieved by compression to with every new standard. The H.264 standard is largely inherited from a lineage of the past H.26L standards as is evident from its block-based encoding approach as established by the
0-7803-9197-7/05/$20.00 2005 IEEE.
1489
MPEG and ITU standards. It preserves the prominent adjectives of its predecessors such as; using Motion Estimation to support Inter-picture prediction to eliminate temporal redundancies; spatial correlation within each frame for Intra-Prediction; residuals as the difference between predicted and source images; Entropy encoding of the transformed residual coefficients and motion vectors, and, the use of a discrete spatial transform and filtering to eliminate spatial redundancies in residuals [1][2][3][4][5][6][7][11]. The biggest difference comes in maximizing the compression returns by optimizing these basic algorithms and extracting as much savings as possible. Fig. 1 illustrates the basic building blocks of an H.264 AVC. In this paper we are primarily concerned with the Motion Estimation subsystem that calculates motion vectors for motion compensation facilitating interprediction. Motion estimation is the most computationally intensive subsystem of video encoding. It involves large amounts of data transfer, data storage, and the variable block sizes of H.264 increase computation further [4][5][6][7][11][14]. The motion model for the H.264 standard is similar to the H.263 motion model. It uses a similar Block Matching approach by segmenting frames into Macroblocks, and then trying to minimize a Sum of Absolute Differences [SAD] or Mean Square Error [MSE] function with a block in the vicinity of the original block co-ordinates in the previous frame. The H.264 AVC also utilizes smaller Macroblock sizes to achieve greater precision and works with Tree Structured Motion Compensation, i.e. Motion compensation with Variable Macroblock Sizes. The tree structured motion compensation allows the use of 16x16, 16x8, 8x16, 8x8, 8x4, 4x8, and 4x4 sub-Macroblock sizes.
that is highly pipelined in nature to utilize processing power efficiently. A programmable VLIW DSP is not ideal because it uses Instruction Level Parallelism [FetchDecodeExecute] and allows for multiple algorithmic/logical tasks to be completed in the same cycle, which is not ideal for highly recursive data that is the case in a search based algorithm [4]. A common solution to this problem is to have a dedicated ASIC like our design or to use a highly programmable and pipelined DSP based on Very Long Instruction Word [VLIW] architecture in conjunction with FPGAs and large on chip memories in order to split the highly data recursive operations onto the FPGA so the DSP has extra idle cycles to perform its task [4]. III. ALGORITHM & IMPLEMENTATION: There are many different algorithms used to implement motion estimation for video compression. These include techniques such as the Diamond Search[13], the Three Step Search[12], and, Full Search Block Matching Motion Estimation Algorithms [7][8]. A block matching algorithm is chosen because it is adaptable to hardware design. A Block Matching algorithm, though excellent for motion estimation, is very computationally intensive, making implementation of a real time system difficult. For our hardware implementation, we chose a pipelined version of the Block Matching Motion Estimation algorithm [8] proposed by industry pioneers K.M. Yang, M.T. Sun, and L. Wu. It achieves parallel processing with 100% efficiency of the Process Engine [PE] block by managing the data flow such that the PEs are never idle, but for an initial delay. The recursive data required is managed through a smart interconnect logic unit and controller between an Address Generation Unit [AGU] and the Memory to achieve pipelining. In the Block Matching Motion Estimation process, the goal is to find the best fit matching position of the Macroblock from the current frame in the previous frame in a pre-determined or adaptive search area[7][8][10]. Fig. 2 illustrates the basic block matching algorithm.
Fig 1: H.264 AVC-Subsystem Level [1][3][4]
As compared with previous standards, variable block sizes along with Quarter Pixel Motion estimation increases the complexity and memory access bandwidth approximately four-fold ensuring the need for a design
1490
Search Area MV [m,n]
Possible matching block [x+m,y+n]
Macro Block [x,y]
equation 1 with the overlap and compare technique of block matching, a significant amount of data recursion is present; the same pixel co-ordinate is part of many different possible searching positions and thus different Macroblock searches. As an example, consider a[i,j] and b[k,l] in fig. 3 to be pixel values in the current Macroblock and previous frame respectively with i, j, k, l representing indices. Let c denote the current block data, and p and p be representative of previous frame pixel data in the corresponding Macroblock, and, the rest of the tracking area respectively. [Ia, Ja] and [Ib, Jb] denote the upper left corner addresses of a and b [8].
Tracking/Previous Frame
Current Frame Fig 2. Block Matching Illustration [1][10] Fig. 3 Block Matching-Pixel Level [8].
The best-fit Marcoblock in the tracking area is found by calculating the Sum of Absolute Differences [SAD] between the Macroblock in the current frame and all Macroblocks in the search area around in the previous frame, and choosing the Macroblock with the least SAD. The SAD is calculated by calculating individual pixel differences in an overlap and compare technique where the current Macroblock is overlapped and compared to all possible Macroblocks in pre-determined search area: SADmin= N N min[ |A[x+i, y+j] B[x+i+m, y+j+n]|]..1 [14] i=0 j=0 where, A[x+i, y+j] is a pixel of macro block from current frame, B[x+i+m, y+j+n] is a pixel of a candidate matching block from previous frame, with a candidate MV [m, n], while m and n are the searching range, and, N is the size of macro block [14]. The block based motion estimation uses Macroblocks of size 16x16, 8x8 or 4x4 including a novel hardware algorithm to achieve Quarter Pixel Motion Vector generation. For 16x16 Macroblock Motion Estimation, the design includes 16 Process Engines that calculate pixel difference between a row of the current Macroblock and a row in the Macroblock in the previous frame in the search area, and relay them to the Comparator to add the differences and compare them to find the minimum SAD. The key to high efficiency in the PEs is managing the data flow such that repeatedly used data at different searching positions is available at the correct clock cycle. For a 16x16 Macroblock, the search area extends -7 to +8 index values outside of the current Macroblocks starting index, or 256 possible starting co-ordinates and 256 possible best-fit Marcoblocks. As is evident from
For difference calculation using PEs, the same pixel b[Ib,Jb+15] is used in 16 different searching positions [8]. Taking advantage of this recursion, b[Ib,Jb+15] can be broadcast to all the PEs that require it to enable multiple searches for the same current Macroblock to occur simultaneously.
Table 1. Data flow for broadcasting data to PEs [Start point 0,0] [8]
Table 1. illustrates the data flow for a 16x16 Macroblock size. Data flows for 8x8 and 4x4 Macroblocks, and, quarter pixel motion vector calculation are similarly generated. Fig. 4 shows the pipelining procedure using Flip Flops and Multiplexers to provide recurring data to the PEs to compute difference as shown in Table 1. This forms the basis of the interconnect logic.
1491
Cur Frame Memory
Tracking Area Frame Memory
Address Generator Unit [AGU]
StartPoint Gen. Unit [SGU]
TopLevel Controller Fig. 4 Interconnect Logic for Pipelining between PEs and data [8].
I N T E R C O N N E C T L O G I C
PE
PE
PE
C O M P A R A T O R
Quarter pixel motion vectors were generated by first generating quarter pixel data information through interpolation. [8] The interpolation detail is shown in the following example: Neighborhood pixels A[0,0], A[0,1] are interpolated as: 0 A0 (0,0) = A [0,0]...2 1 A0 (0,0) = 0.75*A[0,0] + 0.25*A[0,1]3
A02 (0,0) = 0.5*A[0,0] + 0.5*A[0,1]4 3 A0 (0,0) =0.25*A[0,0] + 0.75*A[0,1].5
PE
Fig. 5. Core Motion Estimation Unit.[8][10]
Once interpolation results are available, motion estimation is similarly carried out for Quarter Pixel motion estimation by utilizing a pipelined data flow. [8] The design for 16x16, 8x8, and 4x4 Motion Vector Generation is based on the building block as depicted in Fig. 5. The Interconnect logic, Controller, and, the AGU are modified to support the pipelined data flows. The similar core architecture of 16x16, 8x8 and, 4x4 Macroblocks leads to similar processing times. Bridging and Transfer units are added in between these neighboring stages for pipelining. The hardware design was written in VHDL and simulated in a ModelSim environment. Design synthesis and optimization was carried out using Design Compiler from Synopsys. An ARM library is available for future low power implementations.
Though smaller block sizes usually lead to smaller SAD and thus more precise motion vectors, they may be too computationally intensive at times, and, smaller blocks also lead to more motion vectors that need to be transmitted, thus increasing the bit count to be transmitted [7]. New encoders use Rate Distortion [RD] optimization [9] using the LaGrangian multiplier technique and minimizing the function: C[x,y]= SAD[x,y] + B[x-x0, y-y0] ; 6 where [x,y] is the candidate MV, [x0,y0] is the predicted MV for the block, SAD[x,y] is image distortion, and, B[x-x0, y-y0] is proportional to the number of required for encoding the difference between the candidate MV and the predicted MV, and, is the LaGrangian multiplier. The depicted algorithm represents possible future additions to our research. IV. RESULTS & CONCLUSIONS: Utilizing pipelining of the data flow significantly lowers data transfers, and thus lowers transistor switching and power consumption of the design. The design was synthesized and optimized using Design Compiler and achieved a clock speed of 125 MHz, which is ideal for real time motion estimation for video compression in the H.264/AVC realm. The images shown in Fig. 6, extracted from high speed video frames at the Athens Olympics in 2004, show a previous and current frame
1492
and the significant reduction of residual information that needs to be transmitted after motion estimation. Also indicated are synthesis circuits for the controller, and, a top level Schematic in Fig. 7.
References: [1] Ian. E. G. Richardson, H.264 white papers www.vcodex.com 2002 [2] Andrew Gibson, The h.264 Video Compression Standard Masters Project, Queens University, Kingston, Ontario, Canada. 2002. [3]Schafer, Wiegand, Schwarz, The Emerging h.264/AVC Standard Heinrich Hertz Institute, Berlin, Germany. EBU Technical Review, January 2003 http://www.packetizer.com/codecs/h264/trev_293-schaefer.pdf [4] William C. Chung, Implementing the H.264/AVC Video Coding Standard on FPGAs Xtreme DSP http://www.xilinx.com/publications/xcellonline/xcell_51/xc_pdf/xc_ds p-avc51.pdf [5] A whirlwind tour- H.264 Advanced Video Coding: http://www.pixeltools.com/h264_paper.html [6] JVT Draft ITU-T Recommendation and Final Draft International Standard of Joint Video Specification [ITU-T Rec. H.264/ISO/IEC 14 496-10 AVC], in Joint Video Team [JVT] of ISO/IEC MPEG and ITU-T VCEG, JVT-G050, 2003. [7] Toivonen and Heikkila, Fast Full Search Block Motion Estimation For H.264/AVC With MultiLevel Successive Elimination Algorithm IEEE International Conference on Image Processing [ICIP], Singapore Oct. 2004. [8] K.M. Yang, M.T. Sun, and L. Wu, "A Family of VLSI Designs for the Motion Compensation Block-Matching Algorithm," IEEE Transactions on Circuits and Systems, pp. 1317-1325, Oct. 1989. [9] G.J. Sullivan and T. Wiegand. Rate-distortion Optimization for Video Compression, IEEE Signal Processing Magazine. Vol 15, no. 6, pp. 74-90, 1998 [10] Xiang Li, A Novel VLSI Architecture of Motion Estimation and Compensation for the H.264 Standard Masters Thesis, Rochester Institute of Technology, Rochester, NY, USA, 2004. [11] Ian. E. G. Richardson H.264 and MPEG-4 video compression, John wiley&Sons Ltd. West Sussex PO19 85Q. England, 2003 [12] H.M. Jong, L.G. Chen and T.D. Chiueh, Parallel architectures for three step hierarchical search block matching algorithm, IEEE Trans. on circuits and systems for video technology, Vol. 4, No. 4, 1994, pp. 407-417 [13] Shan Zhu and Kai-Kuang Ma, A new diamond search algorithm for fast block matching, IEEE Transactions on Circuits and Systems on Video Technology, Vol.9, No.2, Feb.2000, pp287-290 [14] Peter. Kuhn, Algorithm, Complexity, Analysis and VLSI Architecture for MPEG-4 Motion Estimation, Kluwer Academic Publishers, Boston, 1999.
a. Current Frame
and
b. Previous Frame
c. High Energy-without; d. Low Energy w/ 4*4 ME Fig. 6. Olympics 2004Video Grabs [10]
Fig. 7 Synthesis Result Snapshot of Main Controller.[10]
1493

An IC Design For Real-Time Motion Estimation For H.264 Digital Video

Cargado por

Información del documento

Descripción original:

Título original

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

An IC Design For Real-Time Motion Estimation For H.264 Digital Video

Cargado por

Copyright:

Formatos disponibles

An IC Design for Real-Time Motion Estimation for H.

264 Digital Video

0-7803-9197-7/05/$20.00 2005 IEEE.

Fig 1: H.264 AVC-Subsystem Level [1][3][4]

Search Area MV [m,n]

Possible matching block [x+m,y+n]

Macro Block [x,y]

Cur Frame Memory

Tracking Area Frame Memory

Address Generator Unit [AGU]

StartPoint Gen. Unit [SGU]

Fig. 7 Synthesis Result Snapshot of Main Controller.[10]

También podría gustarte