Está en la página 1de 53

School of Electronics and Computer Science

Faculty of Engineering, Sciences and Mathematics

University of Southampton

Author: Joseph Conway (jc12g08@ecs.soton.ac.uk)

Date: June 8, 2012

Utilizing the H.264/MPEG-4 AVC


compressed domain for computationally
cheap abnormal motion detection

Project Supervisor: Eric Cooke (ecc@ecs.soton.ac.uk)

Second Examiner: John Carter (jnc@ecs.soton.ac.uk)

A project report submitted for the award of


BSc Computer Science
Abstract
This paper discusses the implementation of an algorithm for extraction and filtration
of compressed data features for classifying abnormal events from an H.264/MPEG-4
AVC video stream. The compressed domain in H.264 contains textural and motion
information in the form of discrete cosine transform coefficients and motion vectors
which are consequently decoded to reconstitute the compressed video sequence. Pre-
vious research has shown the effectiveness of these two components being used for
analysing optical flow on prerecorded MPEG video sequences. This paper uses mo-
tion vectors filtered by analysis of both the DCT coefficients and traits of the motion
vectors for unsupervised classification of abnormal events occurring in a live video.
The use of the compressed domain has allowed for extremely computationally cheap
realtime video analysis which has a low error rate and the benefit of low power usage.

i
CONTENTS CONTENTS

Contents
Acknowledgments iv

1 Introduction 1
1.1 Video Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motion Analysis in the uncompressed domain . . . . . . . . . . . . . 1
1.3 MPEG Video Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.4 MPEG Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.5 Project goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background Research 4
2.1 Embedded Computer Vision Systems on Mobile Hardware . . . . . . 4
2.2 Efficacy of macroblock motion vectors for motion analysis . . . . . . . 4
2.3 Motion vector noise removal . . . . . . . . . . . . . . . . . . . . . . . 5
2.3.1 Low textured areas . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3.2 Edge Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.3 Global motion analysis . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Classification of results . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Design 8
3.1 Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Macroblocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Disrete Cosine Transform Coefficients . . . . . . . . . . . . . . . . . . 9
3.4 Motion Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.5 Global Motion Estimation . . . . . . . . . . . . . . . . . . . . . . . . 12
3.6 Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4 Implementation 14
4.1 Motion Vector Extraction . . . . . . . . . . . . . . . . . . . . . . . . 14
4.1.1 FFmpeg - Parsing a Frame . . . . . . . . . . . . . . . . . . . . 14
4.1.2 FFmpeg - Extracting the Motion Vectors . . . . . . . . . . . . 15
4.2 Filtering the Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2.1 Macroblock Skip Table . . . . . . . . . . . . . . . . . . . . . . 16
4.2.2 DCT Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2.3 Global Motion Compensation . . . . . . . . . . . . . . . . . . 17
4.2.4 Edge Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3 Operation on a smartphone . . . . . . . . . . . . . . . . . . . . . . . 18
4.4 Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.4.1 Sum of Absolute Motion . . . . . . . . . . . . . . . . . . . . . 20
4.4.2 Number of Majorly Different Macroblocks . . . . . . . . . . . 20
4.5 Submitted Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5 Experimentation 22
5.1 Live Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.2 Test Scenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.2.1 Road.mov . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.2.2 Room.mov . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.2.3 Motorway.mov . . . . . . . . . . . . . . . . . . . . . . . . . . 23

ii
CONTENTS CONTENTS

5.3 Extraction and filtration of the motion vectors . . . . . . . . . . . . . 24


5.3.1 Base noise level . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.3.2 Removal of textureless motion vectors . . . . . . . . . . . . . . 24
5.3.3 Removal of textureless motion vectors and global motion . . . 24
5.4 Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.4.1 Road.mov . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.5 Room.mov . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.6 motorway.mov . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.7 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

6 Critical Evaluation 36

7 Conclusion 37

References 38

Appendices 40

Appendix A Bibliography 40

Appendix B Project Brief 42

Appendix C Code Snippets 44

iii
ACKNOWLEDGMENTS

Acknowledgments
This research was conducted following an investigation into the feasibility of com-
pressed domain analysis in H.264 by myself under the employ of Roke Manor Re-
search, who have indicated an interest in the outcome of this study. It should be
noted that this project is largely dependant on the open source video codec FFmpeg.

iv
1 INTRODUCTION

1 Introduction
1.1 Video Compression
When a video is stored or transmitted it is critical that sensible measures are taken
to account for limitations of storage or network bandwidth. Uncompressed video
data is a bitmapped representation of a frame, where colour and intensity values are
stored for each pixel in each frame. It is common for each pixel to be represented by
24 bits, 8 each for the red, blue and green components. This frame representation
takes up a huge amount of space, E.G. a 1920x1080 video at 30 frames per second
would require a bandwidth of

((1920 1080) 24) 30 = 186M B/s (1)

This is obviously highly impractical, so a method of compressing this data is required.


There are many standards for video compression, the most common of which are
the MPEG family of standards. MPEG (Moving Pictures Expert Group) has gone
through many variations from MPEG-1 to the now commonplace MPEG-4, of which
a further extension H.264 has become ubiquitous in terms of video codecs. H.264 uses
multiple forms of compression and can thus achieve substantial bitrate reductions.
For example, a 1920x1080 video at 30 frames per second would require a maximum
bandwidth of 25MB/s.

1.2 Motion Analysis in the uncompressed domain


Motion analysis is a process which has been used and improved upon for decades.
The goal of motion analysis is to determine the optical flow of a video. The optical
flow is the direction and speed at which objects are moving in a frame. There are
many ways of performing motion analysis in the uncompressed domain but even the
simplest method requires performing operations and comparisons on every pixel in
an image. This requires either a large amount of parallel processing or having enough
processing power to quickly deal with arithmetic and differential analysis of millions
of pixels. A simple outline of a conventional motion analysis algorithm would be as
follows:

Take two frames, subtract the frames to get an error image

Perform Sobel edge detection to find edges and their directions

Repeat this for future frames and compare the positions of the edges in order
to calculate feature vectors for edge motion

This is a highly complex procedure and does not account for any form of segmentation
to differentiate the edge points from whole objects in the frame.

1.3 MPEG Video Analysis


Analysis of MPEG video can be separated into two separate fields of study, the com-
pressed and the uncompressed domain. Conventional computer vision algorithms

1
1.4 MPEG Compression 1 INTRODUCTION

operate exclusively in the uncompressed domain on a pixel level basis. This is ex-
tremely computationally expensive as it requires both fully decoding the frame and
analysing every pixel. The decoded frame is typically such a large dataset that
the image must be reduced to a much lower resolution, greyscale image in order
for computation to be performed in a reasonable amount of time unless sufficiently
powerful hardware is available. The large amount of raw image data, in addition
to the complexity of the analysis that needs to be performed, results in a system
which is severely computationally complex in both time and space; requiring exten-
sive processing power and memory. Conversely, analysis in the compressed domain
is a relatively new field of research which is significantly computationally cheaper.
This affords us the opportunity of performing motion and texture analysis on high
resolution video using cheap and lightweight hardware, with the potential to be used
in realtime.

One of the most prevalant video compression standards in use today is H.264/MPEG-
4 AVC (Advanced Video Coding), this standard has been widely adopted owing to
its bitrate being substantially lower than others. This has resulted in H.264 being
the de facto standard on mobile hardware such as smartphones and tablets for video
storage and transmission, where memory and bandwidth have critical limitations.
Consequentially, there are a wealth of cheap, low powered devices which have been
manufactured with a dedicated hardware image signal processing (ISP) chip. This
chip allows for extraordinarily fast H.264 video compression performed purely in
hardware. This project aims to take advantage of the presence of this chip and the
pre-calculated motion and textural dataset for the expedited discovery of segmented
moving video objects.

1.4 MPEG Compression


The MPEG video codec utilises many layered compression techniques, principally
the use of discreet cosine transforms and macroblock motion vectors to generate
predictive frames. These predictive frames consist of merely the difference between
frames. They are composed of a set of instructions on how to reconstitute the image
based on prior knowledge, rather than the raw image data. These frames are com-
pressed with purely a reduction of bitrate in mind, no consideration is made into the
preservation of optical flow. Any semantic meaning which can be inferred from this
data is purely fortuitous.

This project is an investigation into the analysis of the compressed domain of a video
stream. It will attempt to determine the effectiveness of using macroblock motion
vectors for low complexity and high speed semantic analysis, from intrinsically mean-
ingless and noisy data. If this is a possibility, then it allows us to take advantage
of the dramatically reduced bit rate and novel representation of video data in order
to determine the level of abnormality which the frame represents in contrast to the
learnt average scene whilst only performing a minimal amount of computation. The
result of which is a framework for abnormal motion detection which is then poten-
tially executable in realtime on a low powered mobile device.

2
1.5 Project goals 1 INTRODUCTION

1.5 Project goals


The goal of this project is to implement a combination of algorithms for the ex-
traction, filtration and analysis of the vector space and DCT coefficients. It will be
determined if this data is a viable means of determining the level of abnormality
in a scene whilst demonstrating the propensity to be executed in realtime on an
extremely low powered platform such as a smartphone. Such a platform is subject
to extreme restrictions in processing power, system resources & power usage limita-
tions. The implementation will be written in C using the FFmpeg and libx264 video
codecs and tested on a 2.5Ghz x86 64 Intel Core2Duo with 4Gb of RAM running
Max OS X. The project will be considered a success if the algorithm demonstrates
a level of efficiency and accuracy which indicates the possibility for translation onto
a 1Ghz ARM processor within reasonable limits of power usage.

3
2 BACKGROUND RESEARCH

2 Background Research
2.1 Embedded Computer Vision Systems on Mobile Hard-
ware
The need for a method of computationally cheap optical flow analysis for use on
cheap hardware was found after an investigation into the use of standard computer
vision libraries on ARM processors. However, the use case of performing video anal-
ysis on a low powered device comes with several considerations in terms of hardware
limitations. Firstly, almost all microcontroller boards or smartphone handsets run
on ARM CPUs or ARM System on Chips (SoC). These are RISC (Reduced Instruc-
tion Set Computing) based architectures which impairs the use of computer vision
libraries such as OpenCV, which were designed for CISC (Complex Instruction Set
Computing) x86 architectures. Moreover, because of the reduced processing power
of mobile CPUs, the SoC will offload a lot of work to other components such as
the NEON SIMD (Single instruction, multiple data) coprocessor, the VFP (Vector
Floating point) unit and the DSP (Digital signal processor). These are all compo-
nents which are optimised for certain kinds of calculations and embedded into one
component, such as the Apple A5 system on chip. However OpenCV was written
for use with just a CPU with occasional GPU support, and as such does not take
advantage of the available resources [10].

These advanced SoC resources are also only available on higher end systems and as
such, the possibility of performing motion analysis using purely a low powered CPU
with minimal vector or floating point operations would be hugely beneficial. This
paper suggests that this is a possibility with the utilisation of the compressed domain
in conjunction with the image signal processing chip.

2.2 Efficacy of macroblock motion vectors for motion anal-


ysis
The justification for the use of macro block motion vectors is effectively conveyed
in [2] and [12]. Takacs [12] makes a comparison between the accuracy of track-
ing movement with macro block motion vectors against the well established SURF
algorithm[7] which is used for feature detection. It was concluded by [12] that there
are inherent flaws with the use of macro block motion vectors. One of [12]s chief
concerns was that they do not truly represent motion of an object. They point to the
nearest visually similar macro block, they do not actually refer to the true motion of
a feature. They also point out that macro block motion vectors point both forwards
and backwards in time as h.264 frames are not always transmitted in a temporally
linear fashion. However they conclude that these issues are negligible as they quan-
tised the error rate and determined that it was more than accurate enough, with an
error rate of only 10 pixels when tracking various kinds of movement of an object.

The main basis for my work will be a continuation of the work started by Kiryati
[6] and Ozer [9]. Kiryati[6] investigates the use of probability densities being gen-

4
2.3 Motion vector noise removal 2 BACKGROUND RESEARCH

erated from several algorithms for analysing motion from the macro block motion
vectors. They successfully prove that they can create an abnormal event detector
which trains itself by generating probability densities for various sums of the set of
vectors for each frame. They quantify the total overall motion and prominent regions
and directions of motion. However one critical flaw of the paper is that it makes no
attempt to imply any kind of semantics to the motion it is analysing. Aside from
this, it succeeds in creating a very effective abnormal event detector; one of its most
notable achievements is that the algorithm runs at 3x faster than the video framerate
on a very slow processor. However, the system described does not analyse live video,
which is a crucial factor for my system.

Another critical paper for work to be based on is [9]. This paper describes develop-
ment of a 3-tier system into detecting humans in MPEG video by analysis of both
the compressed and uncompressed domain. Their system initially analyses frames
for areas of motion using motion vectors. The system then proceeds to analyse the
intra coded DCT coefficients for chrominance on these subsets of a frame to detect
skin colour. The last tier then performs segmentation to group blocks of motion,
searches for blocks of human proportions then performs graph matching algorithms
such as super-eliptical fitting and similarity matching.

There is a large difference in this system as compared to what I would like to achieve,
in that it is a system built around the need to detect humans. They are however
demonstrating the efficacy of macro block motion vectors. They are using the un-
compressed domain to dramatically reduce the area in a frame which is subject to
further analysis. Despite the disparity between [9] and my goals they have effectively
proven that macro block motion vectors can be used to detect motion despite the
level of noise[12] in the dataset and can drastically reduce the amount of computation
involved.

2.3 Motion vector noise removal


It has been identified by several researchers that there are several inherent flaws with
using macroblock motion vectors for motion analysis. Because their actual purpose
is to simply find visually similiar areas, there are several kinds of motion vector
produced [14] [13] [3] . It has been shown that in homogenous areas of low texture,
the motion vectors produced by an encoder are highly erratic and will typically result
in one macroblock being redrawn in dozens of places to reconstitute an area of flat
texture. There is also significant vector noise where the encoder detects edges.

2.3.1 Low textured areas

Figure 1 demonstrates the chaotic nature of vectors on an area with no texture. A


solution to this is presented by [8], [1], [5], [13] where it is suggested that these areas
of noise can be isolated by looking at the texture of the macroblock. The notion
being that an area of flat texture can be eliminated from the vector space. The
complexity of the texture of a macroblock can be indicated by a concentration of the
coefficients at either the lower or higher frequencies in the transform values for the
relevant I-frame[1], [5], [13]. It was found that by checking the sum of the absolute

5
2.4 Classification of results 2 BACKGROUND RESEARCH

Figure 1: Chaotic vectors in an area of low texture and no motion

values of the top 3x3 quadrant of the DCT coefficients (the low frequencies), that
there is a direct correlation between the result and the complexity of the texture of
the macroblock. Hesseler [5] found that a suitable method of filtration is by summing
the cumulative absolute values for the top left 3x3 matrix of the four luminance DCT
blocks per macroblock. If the values amounted to a value greater than 1, then he
deemed the macroblock to be contentless. Liu et al.[8] took a different approach
and chose to store the sum of the squared values of the DCT coefficients for each
macroblock, so as to check if the new MBs sum of squares is similiar. This works on
the basis that this value should significantly change if the content of the macroblocks
texture has changed.

2.3.2 Edge Vectors

Yokoyama [13] demonstrates the need to detect and eliminate edge vectors. He
indicates that by means of a zero comparison method, noisy vectors traversing along
an edge can be eliminated.

2.3.3 Global motion analysis

Another key contributor to noise in the vector space is when global motion is in-
troduced. Minute movement of the camera will instantly saturate the vector space
with every macroblock shifting to accommodate for the relative motion in the frame.
soandso[nokia] found that it this global motion can be classified into either pan,
zoom or rotate. There have been several methods of accounting for global motion
and compensating accordingly. someone[],[] & [] has proven that an effective method
is to perform an 8 parameter affine transform to compensated for the cameras motion
in 3 dimensions. The 8 parameters are established through regression testing and a
least squares fitting algorithm

2.4 Classification of results


A key paper in my research into classification of abnormal events in video data has
been Rao & Sastry [11]. They performed event classification based on dynamic prob-
ability densities calculated by quantised vectors using K-means clustering algorithm

6
2.4 Classification of results 2 BACKGROUND RESEARCH

to learn what is a normal set of observations. They determined their motion vec-
tors by means of segmentation and classical object tracking algorithms, however this
resulted in them having a similar dataset to what I will be working with. They
used their probability densities to generate prototypes of motion which are used as
observations in a first order Markov process.

The idea of general motion prototypes is also used in Zhong [15]. They generate a
binary moving object map, which is essentially the same dataset that I am using.
However their algorithm involves analysing sections of time as well and my algorithm
must be independent of time as it is using a live video source.

7
3 DESIGN

3 Design
3.1 Frames
H.264/AVC, also known as MPEG 4 is a video codec which uses high levels of com-
pression to achieve a drastic reduction in bitrate. A major factor in this compression
is the use of predictive frames. The encoder will produce two different types of frame
from a video source: key frames and predictive frames. Key frames, referred to as
I-Frames, contain all the information needed to recreate a still image. This data
consists of the luminance(brightness) and chrominance(colour) data for a frame.
These frames are transmitted at either a fixed interval set by the encoder, or after
a dramatic change in scene content. Predictive frames can both contain information
pertaining to both future or past frames, these are referred to as P and B frames.
P and B frames are both predictive frames which represent the change from the key
frame to the current frame. H.264 allows for both forward and backward prediction
with B-frames as frame transmission does not require temporal consistency. This
inconsistency is dealt with by the decoder where frames are reordered according to
their frame type and sequence number contained in the packet data.

Figure 2: An illustration demonstrating H.264 variable frame ordering. P frames


(4,7) reference image data from previous I frames (1). B frames(5) reference both
past(4) and future(6) B/P frames

3.2 Macroblocks
When compressing a frame, H.264 will break up the image into small NxN groups
of pixels, these are macroblocks. Macroblocks, when reconverted from the frequency

8
3.3 Disrete Cosine Transform Coefficients 3 DESIGN

domain to the pixel domain reconstitute the uncompressed image. The typical size
for a macroblock is 16x16 pixels, but H.264 allows for smaller macroblock size when
more granular image information is required. This would result in a 1920x1080 pixel
image having a macroblock resolution of 120x68.

In an I-frame, the macroblock will contain DCT coefficients for chrominance and
luminance which can be used to regenerate the texture for the 16x16 block through
an inverse DCT calculation. Predictive frames, however will contain a series of
vectors indicating only that the preexisting macroblocks have shifted and can be
replicated in other places of the frame. This results in the ability for the codec
to save only the fact that the macroblock at position (x,y) should be redrawn at
positions (x1 , y1 ), (x2 , y2 )...(xn , yn ) thus preventing the same high bitrate image data
from being retransmitted multiple times. This can in practice reduce a frame by a
factor of 40, from 8MB per frame to 200KB per frame for a P or B frame.

The Macroblocks in H.264 can also represent submacroblocks of 8x8, 8x16,16x8,4x8,8x4


& 4x4 block sizes but this is a choice left to the encoder. This means that analysis in
the compressed domain has a maximum resolution of an 4x4 block of pixels, however
this is inconsequential for the goals of this project.

1
Figure 3: A frame being broken up into 8X8 macroblocks

3.3 Disrete Cosine Transform Coefficients


An I-frame contains chrominance and luminance data for the entire frame which the
decoder can reconstitute into an error free uncompressed image. Each macroblock
is split 3 channels, one is for brightness (luminance) and two for colour informa-
tion(chrominance), as seen in Figure 4.

This information is not stored as intensity values for each pixel. Rather, the en-
coder has analysed the luminance and chrominance intensity values and converted
the blocks into the frequency domain with a discrete cosine transform. By this it is
meant that the image data is represented as the sum of numerous cosine waves of
differing frequencies. The matrix of frequencies that are represented by the DCT are
visualised in Figure 5. This results in an 8x8 matrix of DCT coefficients which can
be used to recreate the block as seen in Figure 8 .

1
Image source: https://www.projectrhea.org/rhea/index.php/Homework3ECE438JPEG

9
3.3 Disrete Cosine Transform Coefficients 3 DESIGN

U V
Figure 4: An image split into YUV Channels. Luminance (Y) and Chrominance (
U and V) are combined to form the original top left image

The goal of this is to reduce the size of the data as much as possible. An example of
this is shown in figures 6 - 8 This is desirable as despite already being a much lower
bitrate than the original intensity values, the coefficients can be further compressed
through quantization, zig-zag ordering, huffman encoding, then run length encod-
ing, for a considerable drop in bitrate whilst still containing enough information to
accurately reconstruct the image block.

This information is pertinant as the vectors we are utilising are subject to significant
levels of noise. We can analyse the complexity of the texture of the block by assessing
the coefficients for the various frequencies of the DCT and filtering accordingly. By
performing texture analysis using these values, they afford us the opportunity to
reduce the inherent vector noise introduced by homogenous areas of low texture.

Figure 5: A visual representation of the discrete cosine transform frequencies. A


block of an image will be reduced to an 8x8 matrix with the value in the array
corresponding to the weight given to that frequency in the corresponding index.
The sum of these 64 frequencies can rebuild the texture

10
3.3 Disrete Cosine Transform Coefficients 3 DESIGN

Figure 6: An 8x8 luminance blocks intensity values

Figure 7: The block after a 2D DCT

Figure 8: The transformed block after quantization

11
3.4 Motion Vectors 3 DESIGN

3.4 Motion Vectors


A predictive frame consists of a series of vectors, indicating that a macroblock from
a previous frame can be redrawn at the position indicated by the vector. This is a
viable means of frame representation as the vast majority of change between frames
basis is simply pre-existing macroblocks moving, rather than new macroblocks being
introduced. The reshuffling and replication of various macroblocks from the previous,
or future frames is enough to reconstitute a sufficiently viable image with negligible,
albeit noticable visual artifacts.

There is one significant drawback with these vectors in that they do not definitively
represent the actual movement of an object within a frame, their significance is much
more abstract. This is because the motion vectors are generated with the intention
of compression, rather than optical flow in mind. Although optical flow of a moving
object can be inferred from the vectors, there are severe inherent problems with using
these vectors as a data source.

This notion can be best seen in both homogenous areas of low texture such as sections
of sky, or in edge sections. Areas of low texture are subject to numerous highly erratic
vectors spanning the entire plane of texture because the encoder will want to draw
one macroblock then reproduce it for the entire surface, as seen in Figure 1. Edges
are also subject to large amounts of noise with vectors of high magnitude traversing
along the edge.

Figure 9: An example of motion vectors demonstrating movement of a ball rolling


right to left

3.5 Global Motion Estimation


One of the biggest causes of noise in the vector field is the introduction of global
motion from the camera panning, rotating, zooming etc. When dealing with a camera
in a fixed position, we are still subject to occasional jolting of the camera sensor from
vibrations and wind. An example of this can be seen in figure 3. It was found that
the software based image stabilisation in conjunction with the auto-focus feature of
a camera induced significant global motion, despite being fixed in a tripod.

12
3.6 Classifier 3 DESIGN

Figure 10: Screenshot demonstrating global motion. Blocks represent motion


vectors corresponding to that macroblock. The blocks size refers to the MB size.
The blocks colour refers to the direction of motion. The saturation of orange 16x16
vectors can be clearly seen here indicating that the camera has shifted south-east.

3.6 Classifier
Once the motion vectors have been extracted from the frame and filtered they are
primed for input into a classifier. It is important however that the vectors have been
sufficiently filtered to remove any outliers and only contain vectors representative of
the spacial flow of moving objects in the frame. This classifier will perform unsu-
pervised classifiction of the motion in a frame in the context of previously observed
motion magnitude and directions. The software will require a period to habituate
to a scene and classify what normal motion is; for example it must be able to learn
that a tree swaying at a certain point in the frame is normal motion and should be
considered likely behaviour. Likewise, it must be able to learn that directional traffic
down either side of a road or multidirectional lateral movement by pedestrians along
a pavement is normal. However, because it would rarely be exposed to abnormal
events such as a person crossing the road it should calculate a low probability for
such an event and classify the scene as abnormal accordingly. Such a system would
allow a camera to be pointed at any scene from a fixed position and quickly be used
to notify a user of anything unusual. An example of unusual activity is shown in
Figure 11 where normal motorway directional traffic is seen on the left but a person
crossing the road would be seen by the feature vector on the right, which would be
abnormal for this scene.

Figure 11: An example of what might be considered normal(Left) and


abnormal(right) motion.

13
4 IMPLEMENTATION

4 Implementation
This paper was primarily an investigation into the efficacy of extracting semantic
meaning from noisy macroblock motion vectors. The secondary objective of it was
to indicate that the efficiency of such an approach allows for semantic analysis to
be executed on mobile hardware. As such the implementation of this project can be
split into four seperate problems of similiar complexity, albeit theoretical or practical
difficulty. These are:

The extraction of the vectors from a video sequence and the vector space fil-
tration algorithms;

Gaining access to the compressed information from the cameras image signal
processing chip in order to implement these algorithms without needing to fully
encode and decode image frames in software;

Investigating the feasability of compressed domain analysis on an ARM pro-


cessor under reasonable power usage levels;

Testing the efficacy of optical flow analysis in the compressed domain by im-
plementation of an abnormal event detector.

4.1 Motion Vector Extraction


Owing to the intensely problematic nature of accessing the ISP on a mobile operating
system, it was deemed sensible to first outline and test the algorithms on a desktop
operating system using prerecorded video. For this the open source video codec
FFmpeg was used to open and decode a set of prerecorded test videos encoded in
various levels of H.264/MPEG-4 AVC.

These test videos comprised of a combination of outdoor and indoor scenes filmed
with the various sources of vector space noise in mind. The main test scene was
filmed without a tripod and incorporated areas of sky and telegraph poles to intro-
duce global motion, homogenous low textured areas and edge vectors respectively.
Screenshots of these scenes can be seen in Section 5.

Successful vector extraction was also possible with live video, but for the purposes
of consistent testing prerecorded video was used. For this, video was encoded on the
fly using libX264, an open source H.264 video encoding library.

4.1.1 FFmpeg - Parsing a Frame

FFmpeg was used for this project as it is widely considered to be the best and most
supported video codec library. The FFmpeg fork LibAV was considered, but was
found to not offer much benefit over FFmpeg. FFmpeg is a video codec library writ-
ten in C, as such it will be compilable for any architecture desired, which fits with
the scope of this project perfectly.

Video decoding in FFmpeg is made extremely simple for execution as a command


line application, however for use programmatically, it requires a much more compre-

14
4.1 Motion Vector Extraction 4 IMPLEMENTATION

hensive understanding of both FFmpegs structure and the H.264 specification. This
is because in order to open the file and decode it, one must first call the appropriate
functions and set up the decoding contexts in the correct order to ensure that all
the data structures are initiliased with the right parameters before parsing the file
data. The exact procedure used for initialising FFmpeg,loading it with a file and
iterating through the frames can be seen in the Appendix as the function open f ile
and next f rame.

Once the file has been loaded, a function was written to iteratively parse data from
the file until an entire frame had been decoded. At which point the data and the
frame parameters are stored inside an FFmpeg struct called an AVFrame.

The first thing to be considered is what kind of frame has been decoded. FFmpeg
stores the frame type in an AVFrame in the field pict type as either AV PICTURE TYPE P
, AV PICTURE TYPE B or AV PICTURE TYPE I for predictive frames and key frames
respectively.

Motion vectors are only present in predictive frames, and are much more coherent in
P frames, as opposed to the more convoluted B frames. Fortunately we can specify in
the video encoding step a GOP (Group of pictures) size and the maximum number of
B frames between P frames. GOP refers to the maximum number of frames between I
frames. For all the test video sequences and in the live video implementation, a GOP
size of 10 and a maximum B frame frequency of 1 was specified. This produced the
average pattern of frames as I B P B P B P B P P I which allowed for the most
consistent and coherent optical flow representation in the video sequence. We can
also elect to ignore any B frames which point backwards in time, as this would
indicate motion which is the opposite to the actual optical flow.

4.1.2 FFmpeg - Extracting the Motion Vectors

Inside an AVFrame there is a 3 dimensional array containing the macroblock motion


vectors, it is of the format AVFrame->motion val[xy][direction][component]
where xy is the serialised 2 dimensional cartesian coordinate of the relevant mac-
roblock in the frame. direction is the temporal reference point(0=backward,1=forward),
as H.264 allows both forward and backward predictive frames. lastly, component
refers to either a 0 or 1, returning the x or y component of the vector respectively.

These values are extracted by iterating over the width and height of the macroblock
resolution of the image. The values are then further manipulated according to their
corresponding macroblock size so as to get an accurate pixel level cartesian coordinate
displacement, accurate to a quarter of a pixel. Once a vector is extracted from the
frame, it is passed into the filtering function before being saved for futher reference.

In terms of the implementation phase of the project, this step was one of the most
significant. Despite the fact that FFmpeg provides all the information needed to
extract the vectors, the actual implementation of this algorithm took much longer
than anticipated. This is because FFmpeg is written with purely speed of processing
in mind, at the cost of ease of use. The obscure nature of the source code, the lack
of documentation and obtuse attitude of the developers towards providing any help
to the public contributed to this being an extremely complex process. The source

15
4.2 Filtering the Vectors 4 IMPLEMENTATION

code for extracting the vectors can be found in the Appendix under the function
extract vectors

4.2 Filtering the Vectors


Once the vectors have been extracted from all the vectors, they now needed to be
passed through a strict filter in order to remove as much noise as possible. The level
of noise is so severe that on the majority of frames it is competely saturating.

However, as humans it is possible to look at a visualisation of the vector space and


immediately see that although the noise is ubiquitous, there is a significant level of
coherence where there is a moving object.

4.2.1 Macroblock Skip Table

The simplest and most immediate method of noise reduction is to ensure that only
pertinant motion vectors are being included. The frame which we are provided with
also provides a macroblock skip table. This is a list of the macroblocks which have
not changed and as such can be ignored

4.2.2 DCT Coefficients

It is clear from figure 1 that the chaotic motion in homogenous areas of low texture are
significant and must be dealt with. The DCT coefficients can be used to accurately
determine that an area is of sufficiently low texture complexity so that we can ignore
it. This is achieved quickly without requiring any video decoding. We achieve this
by taking each vector and looking at the values for that macroblock in the I frame.

The macroblock in the keyframe contains the luminance DCT coefficients needed
to rebuild the block. If an area is of high textural complexity, then the highest
coefficients are localised in the lower right of the coefficients matrix. Conversely,
for areas of little or no texture, the majority of the high value DCT coefficients are
localised in the top left. In some instances the top left coefficient (known as the
DC coefficient) has the highest value indicating clearly that the macroblock has no
texture. An example of low texture DCT coefficients can be seen in the clear top
left (sid: low frequency) bias in figure 12

Figure 12: DCT coefficients of a quantized low textured macroblock

This fact affords us the opportunity to perform quick textural analysis by simply
summing the absolute values of the top left quadrant of the 8x8 matrix. By perform-
ing this high level analysis, we are able to effectively and quickly remove any vectors
which simply represent unmoving background information such as blank walls or sky.

16
4.2 Filtering the Vectors 4 IMPLEMENTATION

Occasionally it also effectively acts to remove the shadows of moving objects as the
complexity of the MB remains low, even though it does actually represent a moving
video object.

An important consideration to take into account is that DCT coefficients are typically
represented in an 8x8 matrix of values in the range of -1 to 1. However, in H.264
the coefficients are subject to multiple layers of futher compression. The values are
subject to quantization, zig-zag ordering, run length encoding and then Huffman
encoding.

FFmpeg decompresses these values for the user, however the coefficients we are
given have been quantized. The quantisation process aims to reduce as many of the
coefficients as possible to 0. It achieves this by having a lookup table in the encoder
which assigns a weighting to a quantization function. This will take the value of the
coefficient and its value in the quantization table given by its position and formulate
its new value. The quantisation table puts a different weighting all the different
frequences in the cosine transform. The resultant table consists of mostly 0s with
the non-zero values being values between -32565 to 32565.

One of the considerations made whilst tuning the filtering process was to ascertain
the best threshold value for the sum of the top left quadrant of the DCT coefficients.
Different values allow for different levels of texture complexity.

4.2.3 Global Motion Compensation

Lastly, we must account for the global motion of the camera. Many papers ignore
this factor as a camera on a tripod is the most likely use case and camera movement
could be argued to fall out of the problem domain. However it was found that a
huge amount of global motion was being introduced to all of the test videos despite
being recorded from a stationary viewpoint. Initially this was put down to ambient
movement such as footsteps. However it soon became apparent that the movement
was actually a result of the cameras built in image stabilisation. The image stabiliser
was shifting the image a couple of pixels on each frame. This situation is combined
with the fact that most cameras operate with an autofocus which varies the focal
length slightly, therefore introducing a huge amount of global motion vectors from a
zooming action when dealing with a high resolution video.

A quick and elegant solution was found by me to this problem. A small 10x10 matrix
is initialised with each frame, this acts as a two dimensional motion histogram. This
histogram stores the frequency of low magnitude relative motion values, with the
position 4,4 (the center of the matrix) acting as the null movement value. For
example, a macroblock which shifts 4 pixels to the right(4,0) will increment the bin
at position (9,0) etc.

At the end of each frame, this histogram is analysed to determine the bin with the
highest count. This vector is assumed to be the global motion vector. From this
we allow a 2 pixel leaway in each direction and compare the relative motion of each
vector in this frame. If the vector falls within the range of the global motion vectors
then it is discarded.

17
4.3 Operation on a smartphone 4 IMPLEMENTATION

If the magnitude and direction of a vectors falls beyond the range, then it is adjusted
accordingly. This reduces the amount of noise in a scene, as seen in section 5 where
the scene instantly has become almost entirely noise free.

4.2.4 Edge Vectors

A clear method of edge vector removal is explained in [13] where an edge is tracked
spatiotemporally. However, it was quickly found in the implementation phase that
any kind of edge vector removal was unnecesary and had no effect on the level of
noise. This was largely due to the fact that most edge vectors were negated by the
global motion compensation.

4.3 Operation on a smartphone


The ultimate goal is to have the software running at 30 frames per second on a live
video stream on a smartphone. By smartphone it is meant, a device with an ARM
central processing unit, which may or may not have access to the NEON SIMD. In an
effort to approach the problem in a sensible way it was decided to take incremental
steps towards the goal. This meant approaching the encoding/decoding process in
this order:

To achieve the optimum level of efficiency, it would be perfect to receive a pointer to


a block of memory from the hardware video encoder. From this block we would then
only partially decode the block to parse out the NAL units, reconstruct the frame
slices and then parse out the motion vectors from the macroblocks. In reality this is
a highly complex procedure[4] but it is the price paid for having a bitstream which
is so highly compressed. It is far beyond the scope of this project to implement our
own decoder. This results in a compromise needing to be taken in order to ensure
that the project is possible in the available time.

Stage Encoding Decoding Source Platform


ISP FFmpeg Partial Fully File Live x86 ARMv9 Implemented
1 X X X X Fully
2 X X X X Fully
3 X X X X Fully
4 X X X X Fully
5 X X X X
6 X X X X
7 X X X X

Initially it was attempted to implement this algorithm on a mobile operating system


as per the requirements. A comprehensive investigation into the possibility of using
both iOS and Android was carried out. It was preferable to implement this project
in iOS as Objective-C and the Cocoa frameworks are familiar allowing for rapid
prototyping, however Android may have been more suited to the task so both were
investigated.

It was determined that accessing the compressed data stream is not as straightfor-
ward as one would hope. Initially, an inordinate amount of time was spent attempting

18
4.4 Classifier 4 IMPLEMENTATION

to gain either programmatic access to the hardware video decoder or utilizing the
camera software for returning the compressed data packets. The first attempt was
to attempt to use Google Android as being an open source operating system; it was
assumed that access to the camera hardware would be much more accomodating.
It was found however that at the application layer the Android Camera application
only allowed for access to preview frames in JPEG form. The video encoding and
decoding is provided by the OpenCore frameworks. Of which, only the video decod-
ing API is accesible to the user. This resulted in the only possibility of accessing the
hardware video encoder being by altering the Android source code and rebuilding
the OS, this proved impractical on many levels.

It was found that on a desktop running OS X, one can initialise the iSight camera in
such a way that the Quicktime framework can provide frames in either compressed
or uncompressed formats.

Unfortunately, it was discovered that in iOS, despite the similiarities between the
Core Video and Quicktime frameworks and the fact that the same compressed data
pixel format was declared in the CoreVideo header file, when the camera was ini-
tialised with the request, the error Access to compressed frames on iOS is currently
not available. An investigation into the possibility of using iOS private frameworks
was begun but the undocumented nature of these frameworks resulted in this being
an impossibility. Furthermore it was found that gaining root access to an iOS device
did not override this limitation. Thus it was concluded that there is no realistic way
of accessing the ISP on either iOS or on Android.

4.4 Classifier
Once the motion vectors are thought to be sufficiently cleaned, this theory will be
tested by determining if this dataset can be used to classify normal and abnormal
activity in a scene.

In the spirit of the rest of the project, it was deemed necessary to implement as
lightweight as possible as classifier. It was decided that the most sensible approach
was to determine the both the sum of absolute motion and the number of significantly
changed macroblocks.

Previous research has shown that there are much more complex methods of analysing
the compressed data features, however they were not appropriate for this use case.
This is because previous research has exclusively looked at the analysis of prerecorded
video where time of execution is not a critical factor. This allowed for methods such
as backpropogation for hidden Markov model generation, Multi-layered perceptrons
and other machine learning techniques which require temporal as well as spatial
analysis to be performed to generate models.

When analysing live video there is a much stronger emphasis speed rather than
accuracy, with a much greater need for unsupervised learning. The conclusion was
drawn that if a simple classifier can be shown to be effective then this is much more
significant than the implementation of a needlessly complex classifier.

19
4.4 Classifier 4 IMPLEMENTATION

4.4.1 Sum of Absolute Motion

Once we have a clean dataset we can assume that every vector represents motion of
an object in the scene. Contrary to all the papers discovered in my research, it was
found that segmentation and temporal tracking was not required. The cumulative
amount of different motion in the frame could be summised by having a running
average of each macroblock indexs most common direction of motion. This angle
was calculated by having the average vector matrix store the x and y component for
its vector and also keeping track of the number of times this macroblock has had
its value updated. This way, when a new vector is presented in a frame, the new
iterative average can be calculated using the following algorithm:

vn avn1
avn = (2)
avn1 .occurance

This takes an average of all the vectors that have ever been assigned to the mac-
roblock, putting a relevant weighting on the new value by using an incremental mean.
Now when a new vector is presented, the angle can be calculated from its absolute
pixel vector components. This is calculated from the following formula

atan2(y(v), x(v)) 180 atan2(y(avn1 , x(vn ))) 180


(vn , avn1 ) = ( )( ) (3)

From this equation, a value arises for how different the angle between the new and
the average vector is. This value is simply added to the global total for the current
frame. It was found that this value was a clear indicator of how much different
motion had occured within the frame from the average.

4.4.2 Number of Majorly Different Macroblocks

When the absolute difference between the new and average vector is calculated, it
is checked against a pretedetermined threshold value. A threshold value of 45 was
decided upon. If the calculated value was above this value then the macroblock was
deemed to have majorly changed. Once every macroblock has been processed, the
number of majorly different macroblocks is then checked against the preset threshold
value for what is considered normal. In the test sequences used, this was calculated
as follows:

mb widthmb height
T hreshhold = 256

mb width, refers to the number of macroblocks across. mb height refers to the num-
ber of macroblocks vertically. This meant that for a 1080p input video, a threshold

20
4.5 Submitted Code 4 IMPLEMENTATION

value of 30 MB was chosen.

The success of this simple binary classifier is huggely dependant on the vector space
being almost entirely clean of noise. If presented with a clean vector space for
each frame then the classifier should initially find every frame abnormal, but slowly
habituate to a scene.

4.5 Submitted Code


Note that the code written is contained on the enclosed CD. On this disc there are
3 project folders.

Project Name Platform Stages Implemented


MV FromFileOSX x86 1
MVExtractor ARM 2,4
iSight x86 3

MV FromFileOSX is decoding a video file via FFmpeg and is the main project
with all the vector filtering and the abnormal motion classifier.

The following are projects which were under development but were not completed
and should largely be ignored.

MVExtractor is an iOS project which attempts to either load a file for decoding
and processing, or connect to a UDP stream for processing live video

iSight is an OS X project which successfully encodes and decodes live video


from a MacBook iSight camera.

All of the above projects share almost 90% of the same code. Almost all functionality
is in the Objective-C classes Decoder and Encoder

N.B. In all the above projects, the functions to extract the vectors called extract vectors
in Decoder.m are largely extracted from the FFmpeg source code from h264.c

21
5 EXPERIMENTATION

5 Experimentation
5.1 Live Video
A critical aim of this project was to implement the algorithms in real time on live
video. Vector extraction and filtration was found to be possible on live video, how-
ever without access to the ISP this video required both encoding into H.264 from
bitmapped images and then consequently immediately decoding.

This technique worked for real time vector extraction on a powerful desktop operating
system but was incredibly resource intensive, owing to the software based video
encoding. This rendered this technique unrealistic for use on the test ARM device.

The live video analysis can be seen in Figure 13 where a clear difference between
the actual motion, and the motion vectors can be seen. This is introduced from
the time delay in vector processing induced from the extra computational load of
software based video encoding. This pushed a 2.5Ghz dual core x86 64 processor
to 180% load and utilised 180MB of RAM. The test ARM device which contained
a top of the range Apple A5 SoC could only achieve a maximum framerate of 4fps
when encoding video to H.264, even whilst utilising the onboard NEON SIMD. The
device would promptly run out of available heap space as iOS only allows for 60MB
of RAM for an application. This was a hinderance to the project, however it further
emphasised the importance of being able to access the ISP on a mobile device.

Without access to the ISP, the use of live video sequences was not conducive to prov-
ing the efficacy of compressed domain analysis. Futhermore, it was decided that for
the purposes of calibrating the filtration algorithms, prerecorded test video sequences
would be adequate for testing the algorithms. If in the future, access to compressed
data from the ISP can be accessed and partially decoded, then the filtration algo-
rithms and semantic analysis could then be applied under the assumption that it
would only ever be faster than having to fully encode and/or fully decode the video
sequence.

Figure 13: Live video from a webcam with delayed motion vectors overlayed onto
the image

22
5.2 Test Scenes 5 EXPERIMENTATION

5.2 Test Scenes


In order to test the effectiveness of the extraction and filtration, various test scenes
were used. These were as follows:

5.2.1 Road.mov

This sequence was filmed out a window onto a public road with a reasonable amount
of traffic flowing in each direction. This scene can be seen in Figure 14.

Figure 14: A scene depicting a busy public highway

This scene was chosen as it had both traffic and pedestrians, flowing regularly in
both directions. It was also subject to all forms of vector noise. This scene was
encoded in H.264/MPEG-4 AVC Baseline at 1920x1080 30fps on an Apple iPhone
4S

To demonstrate the noise present in this scene, a heatmap of motion intensity on the
unfiltered vector space was generated, as seen in Figure 15.

5.2.2 Room.mov

This scene can be seen in Figure 16. The scene was chosen as a static indoor scene
would be a good indicator of how much noise was left, when there was nothing
moving in the scene. Frames were still subject to all forms of vector noise owing to
white walls, footsteps and a doorway inducing textureless vectors, global motion and
edge vectors respectively The scene was encoded in H.264/MPEG-4 AVC Baseline
at 1920x1080 30fps on an Apple iPhone 4s

5.2.3 Motorway.mov

This scene was used as it was generally free from global motion and showed consis-
tent moving objects in multiple directions under several different times of day. The

23
5.3 Extraction and filtration of the motion vectors 5 EXPERIMENTATION

Figure 15: A heatmap of intensity values based on the frequency and magnitude of
motion in Road.mov N.B. This was captured without global motion estimation so
the edge vectors are still clearly visible

different scenes in this sequence can be seen in figures 17 - 19.

5.3 Extraction and filtration of the motion vectors


The test video sequences were initially run with no form of filtering, so as to demon-
strate the extent of the noise that was being removed. The test sequences all showed
a huge amount of noise.

In all the following figures, the motion vectors have been visualised according to the
macroblock size and the colour indicates the direction of travel.

5.3.1 Base noise level

See figures 20 - 24

5.3.2 Removal of textureless motion vectors

These screenshots indicate vectors which have been filtered by their DCT coefficients
and also by ignoring the vectors which the decoder has flagged as not representing
any worthwhile information by listing them in the MB Skip table of the frame. See
Figures 25 - 29

5.3.3 Removal of textureless motion vectors and global motion

See Figures 30 - 32

From these figures you can clearly see the progress in the reduction of noise through
each step. Although the DCT coefficient analysis does reduce the noise somewhat,

24
5.3 Extraction and filtration of the motion vectors 5 EXPERIMENTATION

Figure 16: A scene depicting a mostly motionless room with occasional human
traversal across the scene

Figure 17: A video filmed from a stationary camera of motorway traffic

25
5.3 Extraction and filtration of the motion vectors 5 EXPERIMENTATION

Figure 18: A video filmed from a stationary camera of motorway traffic

Figure 19: A video filmed from a stationary camera of motorway traffic

26
5.3 Extraction and filtration of the motion vectors 5 EXPERIMENTATION

Figure 20: Visualisation of unmanipulated motion vectors

Figure 21: Visualisation of unmanipulated motion vectors

27
5.3 Extraction and filtration of the motion vectors 5 EXPERIMENTATION

Figure 22: Visualisation of unmanipulated motion vectors

Figure 23: Visualisation of unmanipulated motion vectors

28
5.3 Extraction and filtration of the motion vectors 5 EXPERIMENTATION

Figure 24: Visualisation of unmanipulated motion vectors

Figure 25: Visualisation of motion vectors filtered by texture level and MB Skip
table

29
5.3 Extraction and filtration of the motion vectors 5 EXPERIMENTATION

Figure 26: Visualisation of motion vectors filtered by texture level and MB Skip
table

Figure 27: Visualisation of motion vectors filtered by texture level and MB Skip
table

30
5.3 Extraction and filtration of the motion vectors 5 EXPERIMENTATION

Figure 28: Visualisation of motion vectors filtered by texture level and MB Skip
table

Figure 29: Visualisation of motion vectors filtered by texture level and MB Skip
table

31
5.4 Classifier 5 EXPERIMENTATION

it pales in comparison to the gains from implementing the global motion histogram.
The benefit of the global motion histogram has even gone so far as to eliminate
the edge vector noise. It was deemed unnecessary to implement any further edge
reduction algorithms for the sake of efficiency and lack of necessity.

After manipulating the calibration of the filters it was determined that the reason
the DCT coefficient filtering was not as effective as hoped is most likely owing to
the quantisation and quirks of FFmpeg. Often the DCT coefficients for significant
portions of the frames would be entirely zeros. Other times the coefficients do not
accurately represent the texture level as low textured areas have high coefficients
in the high frequency bands. This was unfortunate, however it was found that the
filtering system as a sum of all its constituent parts acted more than suitably.

5.4 Classifier
The classifier which was implemented, determined the abnormality of a frame based
on its sum of absolute differences from the average learnt scene and from the sum
of macroblocks where the angle presented is over 45 degrees different from what was
expected. The classifier was implemented so that it took a screenshot of any events
that were over the threshold level of normality.

This classifier was tested on all of the test sequences with the following results:

5.4.1 Road.mov

The classifier quickly habituated to the scene and the captured screenshots reduced
down from 5 per second to 5 per minute over a 5 minute unsupervised learning
period. Based on the screenshots taken, it became evident what kind of behaviour it
had learnt was not normal. There was a very clear emphasis on cars and pedestrians
travelling from right to left, This was both correct and expected as the video was
recorded on the main entrance to a university campus, and the recording was taken
ten minutes before the hour. This would indicate that the majority of motion is
people travelling left to right, on their way to the university.

5.5 Room.mov
This scene habituated extremely fast, with screenshots starting at around 5 per
second and reducing to 1 per minute, unless there was an individual travelling across
the screen.

5.6 motorway.mov
As there was not any abnormal motion in these scenes, which was unremarkable
given the nature of the footage. The classifier struggled to notice anything other
than the scene changes and abnormally large vehicles

5.7 Efficiency
Unfortunately, only stage 1 was able to be fully realised, with stages 2-4 being par-
tially implemented. This was owing to the extremely prolonged amount of time spent

32
5.7 Efficiency 5 EXPERIMENTATION

Figure 30: Visualisation of motion vectors filtered by texture level and global
motion

Figure 31: Visualisation of motion vectors filtered by texture level and global
motion

33
5.7 Efficiency 5 EXPERIMENTATION

attempting to compile FFmpeg with libx264 (the third party H.264 software encod-
ing library) for an ARM processor, and the crippling limitations on both Android
and iOS imposed by the manufacturers with regards to the onboard ISP.

However, whilst executing on OS X v10.7 the performance was monitored and found
to be pleasingly efficient. Remarkably so, given that only stage 1 was fully realised for
testing and the video was being decoded fully by software, placing a huge amount
of load on the system from the inverse DCT functions needed to reconstitute the
image. This is much more computation than what was anticipated. Despite this
the software consistently utilised a maximum of 200 MB of memory, which was
almost entirely used by libX264 in the decoding process. This was determined by
using Apples Instruments developer tool which tracks mallocs throughout execution.
Furthermore the video was running at 30fps on 1080p video.

34
5.7 Efficiency 5 EXPERIMENTATION

Figure 32: Visualisation of motion vectors filtered by texture level and global
motion

35
6 CRITICAL EVALUATION

6 Critical Evaluation
This project has been found in hindsight to be hugely ambitious, especially given
the complete lack of prior knowledge of computer vision or motion analysis. This
was very much shown by the amount of time spent learning about the H.264 and
MPEG specifications. Had I known know the complexity of video encoding, I would
have dialled back the scope of the project to simply vector filtration on a desktop
operating system.

However, I do feel as though the software which has been written by me achieves
three of the four goals it set out to do. It extracts and filters the vectors fantastically
based on the best approaches found in previous research and my own algorithms. It
also runs extremely quickly based on the testing undertaken.

In terms of my personal approach to the project, I feel it was managed well with
regular visits to my supervisor to keep him informed of my progress. All the work
undertaken and the code written was as a result of comprehensive background re-
search. Despite not being able to implement the software on a mobile operating
system, the work and time spent investigating these platforms could easily account
for half the time spent on the project.

I feel this work has a lot of potential and if caried on further there could be a lot to
gain. If work were to continue, it would be admirable to have the filter and classifier
running on an ARM processor using live video with access to the ISP with a partial
decoder having been written to parse the NAL units. There are many classifiers
which have been researched for use as applicable classifiers with this data and they
could easily be implemented and tested for efficiency on a smartphone.

36
7 CONCLUSION

7 Conclusion
This project has determined that it is completely viable to perform semantic motion
analysis based purely on video data in the compressed domain. This is true in both
the sense that enough meaning can be extracted and also that it can be achieved
with a sufficiently low computational burden so as to offer a huge advantage over
analysis in the uncompressed domain. Arising also from this directly is the added
economy in power consumption on a device running from a battery

It has been determined that the inherent flaws of using the macroblock motion vec-
tors are significant, owing to their raison detre. However it is possible to perform
several methods of filtration on the vectors to remove any outliers and be left only
with motion data representative of moving video objects. Without filtration it is
made very clear that the vectors do not represent coherent motion of objects in a
frame and are just a means of compressing the data as much as possible. However
after performing various kinds of noise reduction and texture analysis, the motion
vectors can be classified accurately enough to become a viable dataset with a wealth
of possibilities available for further semantic analysis.

A rudimentary classifier has been created by summing the absolute differences of a


frame presented from the learnt histogram of average motion from previous frames
and determining the abnormality by assuming a direct correlation between the mag-
nitude of the sum of absolute differences and the presence of abnormal motion in the
scene. However after proving the efficacy of this dataset, the problem of creating a
classifier is beyond the scope of this project as the problem of extracting semantics
from segmented moving video objects has already been comprehensively explored.
The novel discovery of this project has been to determine that compressed domain
analysis could, in terms of viability of the dataset and speed of execution, be ap-
plied to a live video and be performed in realtime on a low powered piece of highly
restrictive hardware given access to the hardware video encoder.

37
REFERENCES

References
[1] Din-Yuen Chan, Zih-Siang Lin, and Pei-Shan Wu. Rate controlling based on
effective source-complexity metrics appropriated to h.264/avc. In Computer
Symposium (ICS), 2010 International, pages 142 147, dec. 2010.

[2] Neva Cherniavsky, Anna C Cavender, Richard E Ladner, and Eve A Riskin.
Variable frame rate for low power mobile sign language communication, pages
163170. ACM, 2007.

[3] How-Lung Eng and Kai-Kuang Ma. Spatiotemporal segmentation of moving


video objects over mpeg compressed domain. In Multimedia and Expo, 2000.
ICME 2000. 2000 IEEE International Conference on, volume 3, pages 1531
1534 vol.3, 2000.

[4] M Fieldler. Implementation of basic h.264/avc decoder. seminar paper at Chem-


nitz University of Technology, pages 128, 2004.

[5] Wolfgang Hesseler and Stefan Eickeler. Mpeg-2 compressed-domain algorithms


for video analysis. EURASIP J. Appl. Signal Process., 2006:186186, January
2006.

[6] Nahum Kiryati, Tammy Riklin Raviv, Yan Ivanchenko, and Shay Rochel. Real-
time abnormal motion detection in surveillance video, 2008.

[7] Ale Leonardis, Horst Bischof, Axel Pinz, Herbert Bay, Tinne Tuytelaars, and
Luc Van Gool. Computer Vision ECCV 2006 SURF: Speeded Up Robust Fea-
tures, volume 3951. Springer Berlin Heidelberg, 2006.

[8] Haowei Liu, Ming-Ting Sun, Ruei-Cheng Wu, and Shiaw-Shian Yu. Automatic
video activity detection using compressed domain motion trajectories for h.264
videos. J. Vis. Comun. Image Represent., 22(5):432439, July 2011.

[9] Burak Ozer, Wayne Wolf, and Ali N. Akansu. Human activity detection in mpeg
sequences. In Workshop on Human Motion00, pages 6166, 2000.

[10] Kari Pulli, Anatoly Baksheev, Kirill Kornyakov, and Victor Eruhimov. Realtime
computer vision with opencv. Queue, 10(4):40:4040:56, April 2012.

[11] S. Rao and P.S. Sastry. Abnormal activity detection in video sequences us-
ing learnt probability densities. In TENCON 2003. Conference on Convergent
Technologies for Asia-Pacific Region, volume 1, pages 369 372 Vol.1, oct. 2003.

[12] G. Takacs, V. Chandrasekhar, B. Girod, and R. Grzeszczuk. Feature track-


ing for mobile augmented reality using video coder motion vectors. In Mixed
and Augmented Reality, 2007. ISMAR 2007. 6th IEEE and ACM International
Symposium on, pages 141 144, nov. 2007.

[13] Takanori Yokoyama, Shuhei Ota, and Toshinori Watanabe. Noisy mpeg motion
vector reduction for motion analysis. In Proceedings of the 2009 Sixth IEEE
International Conference on Advanced Video and Signal Based Surveillance,

38
REFERENCES

AVSS 09, pages 274279, Washington, DC, USA, 2009. IEEE Computer Society.

[14] Wei Zeng, Jun Du, Wen Gao, and Qingming Huang. Robust moving object
segmentation on h.264/avc compressed video using the block-based mrf model.
Real-Time Imaging, 11(4):290299, August 2005.

[15] Hua Zhong, Jianbo Shi, and M. Visontai. Detecting unusual activity in video.
In Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings
of the 2004 IEEE Computer Society Conference on, volume 2, pages II819
II826 Vol.2, june-2 july 2004.

39
APPENDICES

Appendices
Appendix A Bibliography
[1] Oren Boiman and Michal Irani. Detecting irregularities in images and in video.
Int. J. Comput. Vision, 74:1731, August 2007.

[2] Din-Yuen Chan, Zih-Siang Lin, and Pei-Shan Wu. Rate controlling based on
effective source-complexity metrics appropriated to h.264/avc. In Computer
Symposium (ICS), 2010 International, pages 142 147, dec. 2010.

[3] Neva Cherniavsky, Anna C Cavender, Richard E Ladner, and Eve A Riskin.
Variable frame rate for low power mobile sign language communication, pages
163170. ACM, 2007.

[4] F Colace, M De Santo, M Molinara, and G Percannella. Noisy motion vectors


removal for reliable camera parameters estimation in MPEG coded videos. IEEE,
2003.

[5] James W. Davis and Mark A. Keck. Modeling behavior trends and detecting
abnormal events using seasonal kalman filters .

[6] How-Lung Eng and Kai-Kuang Ma. Spatiotemporal segmentation of moving


video objects over mpeg compressed domain. In Multimedia and Expo, 2000.
ICME 2000. 2000 IEEE International Conference on, volume 3, pages 1531
1534 vol.3, 2000.

[7] M Fieldler. Implementation of basic h.264/avc decoder. seminar paper at Chem-


nitz University of Technology, pages 128, 2004.

[8] Roger S. Gaborski, Vishal S. Vaingankar, Vineet Chaoji, and Ankur Terede-
sai. Venus: A system for novelty detection in video streams with learning. In
FLAIRS Conference04, pages 11, 2004.

[9] I. Haritaoglu, D. Harwood, and L.S. Davis. W4: real-time surveillance of people
and their activities. Pattern Analysis and Machine Intelligence, IEEE Transac-
tions on, 22(8):809 830, aug 2000.

[10] Alexander Haubold and Milind Naphade. Classification of video events using
4-dimensional time-compressed motion features. In Proceedings of the 6th ACM
international conference on Image and video retrieval, CIVR 07, pages 178185,
New York, NY, USA, 2007. ACM.

[11] Wolfgang Hesseler and Stefan Eickeler. Mpeg-2 compressed-domain algorithms


for video analysis. EURASIP J. Appl. Signal Process., 2006:186186, January
2006.

[12] C. Kas, M. Brulin, H. Nicolas, and C. Maillet. Compressed domain aided anal-
ysis of traffic surveillance videos. In Distributed Smart Cameras, 2009. ICDSC
2009. Third ACM/IEEE International Conference on, pages 1 8, 30 2009-sept.

40
APPENDICES

2 2009.

[13] Christian Kas and Henri Nicolas. An approach to trajectory estimation of mov-
ing objects in the h.264 compressed domain. In Proceedings of the 3rd Pacific
Rim Symposium on Advances in Image and Video Technology, PSIVT 09, pages
318329, Berlin, Heidelberg, 2008. Springer-Verlag.

[14] Nahum Kiryati, Tammy Riklin Raviv, Yan Ivanchenko, and Shay Rochel. Real-
time abnormal motion detection in surveillance video, 2008.

[15] Ale Leonardis, Horst Bischof, Axel Pinz, Herbert Bay, Tinne Tuytelaars, and
Luc Van Gool. Computer Vision ECCV 2006 SURF: Speeded Up Robust Fea-
tures, volume 3951. Springer Berlin Heidelberg, 2006.

[16] Xiaokun Li and Fatih M. Porikli. A hidden markov model framework for traffic
event detection using video features. In ICIP 2004.

[17] Haowei Liu, Ming-Ting Sun, Ruei-Cheng Wu, and Shiaw-Shian Yu. Automatic
video activity detection using compressed domain motion trajectories for h.264
videos. J. Vis. Comun. Image Represent., 22(5):432439, July 2011.

[18] G. Medioni, I. Cohen, F. Bremond, S. Hongeng, and R. Nevatia. Event detection


and analysis from video streams. Pattern Analysis and Machine Intelligence,
IEEE Transactions on, 23(8):873 889, aug 2001.

[19] N/A. Homepage of the ffmpeg codec library, December 2011.

[20] N/A. Homepage of the x264 encoding library, December 2011.

[21] Burak Ozer, Wayne Wolf, and Ali N. Akansu. Human activity detection in mpeg
sequences. In Workshop on Human Motion00, pages 6166, 2000.

[22] S. Rao and P.S. Sastry. Abnormal activity detection in video sequences us-
ing learnt probability densities. In TENCON 2003. Conference on Convergent
Technologies for Asia-Pacific Region, volume 1, pages 369 372 Vol.1, oct. 2003.

[23] Chris Stauffer and W. Eric L. Grimson. Learning patterns of activity using real-
time tracking. IEEE Trans. Pattern Anal. Mach. Intell., 22(8):747757, August
2000.

[24] Yeping Su, Ming-Ting Sun, and V. Hsu. Global motion estimation from coarsely
sampled motion vector field and the applications. Circuits and Systems for Video
Technology, IEEE Transactions on, 15(2):232 242, feb. 2005.

[25] G. Takacs, V. Chandrasekhar, B. Girod, and R. Grzeszczuk. Feature track-


ing for mobile augmented reality using video coder motion vectors. In Mixed
and Augmented Reality, 2007. ISMAR 2007. 6th IEEE and ACM International
Symposium on, pages 141 144, nov. 2007.

[26] Kun Tao, Shouxun Lin, and Yongdong Zhang. Compressed domain motion
analysis for video semantic events detection. In Proceedings of the 2009 WASE
International Conference on Information Engineering - Volume 01, ICIE 09,

41
APPENDICES

pages 201204, Washington, DC, USA, 2009. IEEE Computer Society.

[27] Takanori Yokoyama, Shuhei Ota, and Toshinori Watanabe. Noisy mpeg motion
vector reduction for motion analysis. In Proceedings of the 2009 Sixth IEEE
International Conference on Advanced Video and Signal Based Surveillance,
AVSS 09, pages 274279, Washington, DC, USA, 2009. IEEE Computer Society.

[28] K Yoon, D DeMenthon, and D Doermann. Event detection from mpeg video in
the compressed domain. Proceedings 15th International Conference on Pattern
Recognition ICPR2000, pages 819822, 2000.

[29] Wei Zeng, Jun Du, Wen Gao, and Qingming Huang. Robust moving object
segmentation on h.264/avc compressed video using the block-based mrf model.
Real-Time Imaging, 11(4):290299, August 2005.

[30] Hua Zhong, Jianbo Shi, and M. Visontai. Detecting unusual activity in video.
In Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings
of the 2004 IEEE Computer Society Conference on, volume 2, pages II819
II826 Vol.2, june-2 july 2004.

Appendix B Project Brief

42
Analysis of MPEG4 video compression artifacts for
computationally cheap feature extraction

Joseph Conway
Supervisor: Eric Cooke

Computer vision systems can be split into a four stage process: Image acquisition,
feature detection, semantic analysis & decision making. The process of analysing a
video feed requires a large amount of processing power and system resources. One of
the most complex parts of the process is feature extraction. During feature extrac-
tion it is the responsibility of the algorithm to determine which elements of the image
are worthy of further analysis and which parts of the image can be ignored. Feature
extraction involves detecting edges, ridges, textures or corners in conjunction with
segmentation or blobbing for use in the higher level semantic analysis.

The complexity of feature extraction algorithms has significant repercussions in the


design of computer vision systems. System designers often have to compromise on the
functionality of an application or system owing to limited processing power available.
With the increasing popularity of mobile platforms, it would be extremely desirable
to perform feature detection and analysis on a live video feed in a fast, simple and
low power fashion with no compromise on the input resolution or frame rate.

I propose the usage of compression artifacts inherent in an h.264/AVC video stream


as the basis for an extremely low complexity and novel means of feature extraction.
Specifically, the macroblock motion vectors which are encoded by a dedicated hard-
ware chip. These motion vectors can be used for performing motion detection with
almost no computation at all. As such, further semantic analysis can be performed
with a reasonably low total cost of computation. Previously, algorithms have been
able to analyse only key frames and on a massive per-pixel basis. Owing to the small
data structure involved in this algorithm, it is possible that higher level analysis
could be performed by established machine learning and classification algorithms
implemented on the low resolution matrix of vectors.

The project will implemented and tested on mobile development platforms such as
Google Android and Apple iOS. Such a development platform is perfect as current
smartphones have limited resources available yet have embedded cameras capable of
producing high definition h.264/AVC video.

43
APPENDICES

Appendix C Code Snippets


( i n t ) o p e n F i l e : ( c o n s t c h a r ) f n {
char filename = malloc ( s i z e o f ( fn ) ) ;

// f n = udp : / /@? l o c a l p o r t =1234;

av register all ();


// open f i l e
i f ( a v f o r m a t o p e n i n p u t (&pFormatCtx , fn , NULL, NULL) ! = 0 ) {
NSLog (@ Couldnt open f i l e ) ; / / Couldn t open f i l e
}
// R e t r i e v e stream i n f o r m a t i o n
i f ( a v f o r m a t f i n d s t r e a m i n f o ( pFormatCtx , NULL) <0){
NSLog (@ Couldn t f i n d steam i n f o r m a t i o n ) ;
}

// Dump i n f o r m a t i o n about f i l e onto s t a n d a r d e r r o r


av dump format ( pFormatCtx , 0 , f i l e n a m e , 0 ) ;
int i ;

// Find t h e f i r s t v i d e o stream
v i d e o S t r e a m =1;
f o r ( i =0; i <pFormatCtx>n b s t r e a m s ; i ++)
i f ( pFormatCtx>s t r e a m s [ i ]>codec>c o d e c t y p e==AVMEDIA TYPE VIDEO) {
v i d e o S t r e a m=i ;
break ;
}
NSLog (@ vs%d , v i d e o S t r e a m ) ;
i f ( v i d e o S t r e a m==1){
NSLog (@ d i d n t f i n d v i d e o stream ) ;
}
// Get a p o i n t e r t o t h e c o d e c c o n t e x t f o r t h e v i d e o stream
pCodecCtx=pFormatCtx>s t r e a m s [ v i d e o S t r e a m]> c o d e c ;

AVCodec pCodec ;

// Find t h e d e c o d e r f o r t h e v i d e o stream
pCodec=a v c o d e c f i n d d e c o d e r ( pCodecCtx>c o d e c i d ) ;
i f ( pCodec==NULL) {
NSLog (@ Unsupported c o d e c ! \ n ) ;
}
// Open c o d e c
i f ( a v c o d e c o p e n 2 ( pCodecCtx , pCodec , NULL) <0){
NSLog (@ c o u l d not open c o d e c ) ;
}
pCodecCtx>l o w r e s = 2 ;
pCodecCtx>debug = FF DEBUG DCT COEFF ;
pCodecCtx>debug = FF DEBUG MB TYPE ;
pCodecCtx>debug = FF DEBUG MV ;
pCodecCtx>debug = FF DEBUG VIS MB TYPE ;

pCodecCtx>debug mv = FF DEBUG VIS MV B FOR ;


// pCodecCtx>debug mv = FF DEBUG VIS MV B BACK ;
pCodecCtx>debug mv = FF DEBUG VIS MV P FOR ;
// A l l o c a t e v i d e o frame
pFrame=a v c o d e c a l l o c f r a m e ( ) ;

uint8 t buffer ;
i n t numBytes ;
// Determine r e q u i r e d b u f f e r s i z e and a l l o c a t e b u f f e r
numBytes=a v p i c t u r e g e t s i z e (PIX FMT YUV420P , pCodecCtx>width ,
pCodecCtx>h e i g h t ) ;
b u f f e r =( u i n t 8 t ) a v m a l l o c ( numBytes s i z e o f ( u i n t 8 t ) ) ;

// A s s i g n a p p r o p r i a t e p a r t s o f b u f f e r t o image p l a n e s i n pFrameRGB
// Note t h a t pFrameRGB i s an AVFrame , but AVFrame i s a s u p e r s e t
// o f AVPicture
a v p i c t u r e f i l l ( ( AVPicture ) pFrame , b u f f e r , PIX FMT YUV420P ,
pCodecCtx>width , pCodecCtx>h e i g h t ) ;

44
APPENDICES

[ self initialiseDecoder ];

return 1;
}

( i n t ) nextFrame {

// p r i n t f ( \ nNextFrame ) ;
int frameFinished = 0;
AVPacket p a c k e t ;
w h i l e ( f r a m e F i n i s h e d == 0 ) {
i n t i =0;
i f ( a v r e a d f r a m e ( pFormatCtx , &p a c k e t )>=0) {
// I s t h i s a p a c k e t from t h e v i d e o stream ?
i f ( p a c k e t . s t r e a m i n d e x==v i d e o S t r e a m ) {
// Decode v i d e o frame
pCodecCtx>debug = FF DEBUG DCT COEFF ;
a v c o d e c d e c o d e v i d e o 2 ( pCodecCtx , pFrame , &f r a m e F i n i s h e d , &p a c k e t ) ;

// Did we g e t a v i d e o frame ?
i f ( frameFinished ) {
i ++;
// empty v e c t o r s a r r a y
timeCounter++;
[ p a r e n t setTime : timeCounter ] ;
//[ vectors removeAllObjects ] ;
[ s e l f cleanVectors ] ;
// p o p u l a t e v e c t o r s a r r a y

// a v e r a g e m a j o r l y changed MBs
// meanMajorlyDifferentMB = meanMajorlyDifferentMB + ( ( m a j o r l y D i f f e r e n t M B s m
i f (drawBMP) {
[ s e l f convertToBmp : i saveImg :NO ] ;
}

p r i n t f ( SAD:%d\n , s u m A b s o l u t e D i f f e r e n c e ) ;
i f ( majorlyDifferentMBs > 30){
// NSLog (@SAD:%d %d , timeCounter /30 , m a j o r l y D i f f e r e n t M B s ) ;

/ / [ s e l f convertToBmp : i saveImg : YES ] ;

} else {
/ / [ [ p a r e n t imgView ] s e t A l p h a V a l u e : p a r e n t . imgView . a l p h a V a l u e 0 . 9 ] ;
}
sumAbsoluteDifference = 0;
meanMajorlyDifferentMB = ( meanMajorlyDifferentMB+m a j o r l y D i f f e r e n t M B s ) / 2 ;

majorlyDifferentMBs = 0;

//
i f ( pFrame>p i c t t y p e != AV PICTURE TYPE I) {
frameNumber++;
[ s e l f e x t r a c t v e c t o r s : pFrame ] ;
i f ( filterMB )
[ s e l f estimateGlobalMotion ] ;
[ s e l f updateAverages ] ;

i f ( dokmeans ) {
[ s e l f kmeans ] ;
}

[ [ p a r e n t curView ] s e t N e e d s D i s p l a y : YES ] ;
}
else {
i f ( ! drawHeatmap ) {
[ s e l f cleanAverageVectors ] ;
}

45
APPENDICES

[ s e l f cleanHistogram ] ;

// draw v e c t o r s

}
}

// Free t h e p a c k e t t h a t was a l l o c a t e d by a v r e a d f r a m e
a v f r e e p a c k e t (& p a c k e t ) ;
}
else{
NSLog (@ F i n i s h e d ) ;
[ p a r e n t s e t F i n i s h e d : YES ] ;
return 0;

return 1;
}

( v o i d ) e x t r a c t v e c t o r s : ( AVFrame ) p i c t {
c o n s t i n t s h i f t = 1 + pFrame>m o t i o n s u b s a m p l e l o g 2 ; i n t mb y ;
uint8 t ptr ;
int i ;
int h chroma shift , v chroma shift , block height ;
c o n s t i n t width = pCodecCtx>width ;
const int height = pCodecCtx>h e i g h t ;
c o n s t i n t m v s a m p l e l o g 2 = 4 p i c t >m o t i o n s u b s a m p l e l o g 2 ;
const int mv stride = ( mb width << m v s a m p l e l o g 2 ) + ( pCodecCtx>c o d e c i d == CODEC ID H26
i n t m b s t r i d e = ( width +15)>>4;

///// s>l o w d e l a y = 0 ; // needed t o s e e t h e v e c t o r s w i t h o u t t r a s h i n g t h e b u f f e r s


a v c o d e c g e t c h r o m a s u b s a m p l e ( pCodecCtx>p i x f m t ,& h c h r o m a s h i f t , &v c h r o m a s h i f t ) ;

p i c t >t y p e = FF BUFFER TYPE COPY ;


p i c t >opaque= NULL;
ptr = p i c t >data [ 0 ] ;
b l o c k h e i g h t = 16 >> v c h r o m a s h i f t ;

f o r ( mb y = 0 ; mb y < mb height ; mb y++) {


i n t mb x ;
f o r ( mb x = 0 ; mb x < mb width ; mb x++) {
c o n s t i n t mb index = mb x + mb y ( mb width +1);
i f ( ( pCodecCtx>debug mv ) && p i c t >m o t i o n v a l ) {
i n t type ;
f o r ( t y p e = 0 ; t y p e < 3 ; t y p e++) {
int direction = 0;
switch ( type ) {
case 0:
i f ( ( ! ( pCodecCtx>debug mv & FF DEBUG VIS MV P FOR ) ) ||
( p i c t >p i c t t y p e != AV PICTURE TYPE P ) )
continue ;
direction = 0;
break ;
case 1:
i f ( ( ! ( pCodecCtx>debug mv & FF DEBUG VIS MV B FOR ) ) ||

46
APPENDICES

( p i c t >p i c t t y p e != AV PICTURE TYPE B ) )


continue ;
direction = 0;
break ;
case 2:
i f ( ( ! ( pCodecCtx>debug mv & FF DEBUG VIS MV B BACK ) ) ||
( p i c t >p i c t t y p e != AV PICTURE TYPE B ) )
continue ;
direction = 1;
break ;
}
if ( ! USES LIST ( p i c t >mb type [ mb index ] , d i r e c t i o n ) )
continue ;

i f ( IS 8X8 ( p i c t >mb type [ mb index ] ) ) {


int i ;
f o r ( i = 0 ; i < 4 ; i ++) {
i n t sx = mb x 16 + 4 + 8 ( i & 1 ) ;
i n t sy = mb y 16 + 4 + 8 ( i >> 1 ) ;
i n t xy = ( mb x 2 + ( i & 1 ) +
( mb y 2 + ( i >> 1 ) ) m v s t r i d e ) << ( m v s a m p l e l o g 2 1 ) ;
i n t mx = ( p i c t >m o t i o n v a l [ d i r e c t i o n ] [ xy ] [ 0 ] >> s h i f t ) + sx ;
i n t my = ( p i c t >m o t i o n v a l [ d i r e c t i o n ] [ xy ] [ 1 ] >> s h i f t ) + sy ;
i f ( ( p i c t >m b s k i p t a b l e [ xy ] < 1 ) | | ( f i l t e r M B==NO) ) {
[ s e l f draw arrow : sx sy : sy ex : mx ey : my t y p e : i ] ;
}

}
} e l s e i f ( IS 16X8 ( p i c t >mb type [ mb index ] ) ) {
int i ;
f o r ( i = 0 ; i < 2 ; i ++) {
i n t sx = mb x 16 + 8 ;
i n t sy = mb y 16 + 4 + 8 i ;
i n t xy = ( mb x 2 + ( mb y 2 + i ) m v s t r i d e ) << ( m v s a m p l e l o g 2
i n t mx = ( p i c t >m o t i o n v a l [ d i r e c t i o n ] [ xy ] [ 0 ] >> s h i f t ) ;
i n t my = ( p i c t >m o t i o n v a l [ d i r e c t i o n ] [ xy ] [ 1 ] >> s h i f t ) ;

i f (IS INTERLACED( p i c t >mb type [ mb index ] ) )


my = 2 ;

if ( ( p i c t >m b s k i p t a b l e [ xy ] < 1 ) | | ( f i l t e r M B==NO) ) {


[ s e l f draw arrow : sx sy : sy ex : mx ey : my t y p e :4+ i ] ;
}
}
} e l s e i f ( IS 8X16 ( p i c t >mb type [ mb index ] ) ) {
int i ;
f o r ( i = 0 ; i < 2 ; i ++) {
i n t sx = mb x 16 + 4 + 8 i ;
i n t sy = mb y 16 + 8 ;
i n t xy = ( mb x 2 + i + mb y 2 m v s t r i d e ) << ( m v s a m p l e l o g 2 1 )
i n t mx = p i c t >m o t i o n v a l [ d i r e c t i o n ] [ xy ] [ 0 ] >> s h i f t ;
i n t my = p i c t >m o t i o n v a l [ d i r e c t i o n ] [ xy ] [ 1 ] >> s h i f t ;

i f (IS INTERLACED( p i c t >mb type [ mb index ] ) )


my = 2 ;

if ( ( p i c t >m b s k i p t a b l e [ xy ] < 1 ) | | ( f i l t e r M B==NO) ) {


[ s e l f draw arrow : sx sy : sy ex : mx ey : my t y p e :6+ i ] ;
}
}
} else {
i n t sx= mb x 16 + 8 ;
i n t sy= mb y 16 + 8 ;
i n t xy= ( mb x + mb y m v s t r i d e ) << m v s a m p l e l o g 2 ;
i n t mx= ( p i c t >m o t i o n v a l [ d i r e c t i o n ] [ xy ][0] > > s h i f t ) + sx ;
i n t my= ( p i c t >m o t i o n v a l [ d i r e c t i o n ] [ xy ][1] > > s h i f t ) + sy ;
i f ( ( p i c t >m b s k i p t a b l e [ xy ] < 1 ) | | ( f i l t e r M B==NO) ) {
[ s e l f draw arrow : sx sy : sy ex : mx ey : my t y p e : 8 ] ;
}
}
}
}
}

47
APPENDICES

48