Está en la página 1de 14

976 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART B: CYBERNETICS, VOL. 41, NO.

4, AUGUST 2011

Spatial Markov Kernels for Image


Categorization and Annotation
Zhiwu Lu and Horace H. S. Ip

AbstractThis paper presents a novel discriminative stochastic perception we humans have, which thus can be used to classify
method for image categorization and annotation. We first divide a larger number of categories.
the images into blocks on a regular grid and then generate vi- One influential work that follows the second strategy is a bag-
sual keywords through quantizing the features of image blocks.
The traditional Markov chain model is generalized to capture of-words (BOW) approach such as probabilistic latent semantic
2-D spatial dependence between visual keywords by defining the analysis (PLSA) [5] and latent Dirichlet allocation [6]. Derived
notion of past as what we have observed in a row-wise raster from the statistical text literature, the BOW approach has now
scan. The proposed spatial Markov chain model can be trained been widely used for the semantic analysis of image content
via maximum-likelihood estimation and then be used directly for [7], [8]. To generate visual keywords automatically for image
image categorization. Since this is completely a generative method,
we can further improve it through developing new discriminative representation, we first divide images into regions and then per-
learning. Hence, spatial dependence between visual keywords is form k-means clustering on the feature vectors extracted from
incorporated into kernels in two different ways, for use with a these regions. Since regions within an image are assumed to be
support vector machine in a discriminative approach to the image independently drawn from a generative model, previous BOW
categorization problem. Moreover, a kernel combination is used methods ignore spatial dependence between visual keywords.
to handle rotation and multiscale issues. Experiments on several
image databases demonstrate that our spatial Markov kernel To capture this spatial information within images, we pro-
method for image categorization can achieve promising results. pose a spatial Markov chain (SMC) model by representing each
When applied to image annotation, which can be considered as a image as a 2-D sequence of visual keywords. Formation of
multilabel image categorization process, our method also outper- such a 2-D sequence can be achieved through clustering feature
forms state-of-the-art techniques. vectors that are derived from image blocks on a regular grid.
Index TermsImage annotation, image categorization, kernel Unlike hidden Markov models (HMMs) [9], [10], the feature
methods, Markov models, visual keywords. vectors of image blocks are not considered by our SMC model,
and they have been used for generating visual keywords. In
I. I NTRODUCTION fact, our SMC model can be regarded as a 2-D generalization
of a Markov chain by employing a second-order neighborhood
I MAGE categorization refers to labeling images into one of
some predefined categories. Although this is usually not
very difficult for humans, it has been proven to be an extremely
system on the regular grid and assuming the conditional inde-
pendence of vertical and horizontal transitions between states.
challenging problem for machines, owing to variable and, The notion of past in an SMC is defined as what we have
sometimes, uncontrolled imaging conditions as well as complex observed in a row-wise raster scan on the regular grid. Based
and hard-to-describe objects in an image. In the literature, one on our independence assumption of horizontal and vertical
direct strategy is to classify images using some low-level visual transitions, the computation of the probability of an observed
features such as color and texture. This approach considers image given an SMC model is tractable. In the literature, other
each category as an individual object [1], [2], which is usually 2-D extensions of a Markov chain have also been proposed,
applied to classify only a small number of categories. Another such as a pseudo-2-D Markov model [11] and a Markov mesh
more effective strategy adopts intermediate representation [3], random field [12]. However, these Markov models may not
[4] in order to reduce the gap between low- and high-level simultaneously consider the vertical and horizontal transitions
processing and thus match the scene/object model with the in a tractable solution or may not capture the local relationship
between blocks with rows or columns of states as calculation
elements.
The standard algorithm to train our SMC model is maximum-
Manuscript received February 7, 2010; revised September 20, 2010; accepted likelihood estimation (MLE), which can be used directly for im-
December 12, 2010. Date of publication January 24, 2011; date of current age categorization. Although this generative method is shown
version July 20, 2011. This work was supported in part by the Research Council
of Hong Kong under Grant CityU 114007, by the City University of Hong effective in later experiments, we can further improve it through
Kong under Grant 7008040, and by the National Natural Science Foundation developing new discriminative learning. That is, we can in-
of China under Grants 60873154 and 61073084. This paper was recommended corporate the spatial dependence between visual keywords
by Associate Editor D. Goldgof.
The authors are with the Department of Computer Science, City University captured by an SMC into two kernels, for use with a support
of Hong Kong, Kowloon, Hong Kong (e-mail: lzhiwu2@student.cityu.edu.hk; vector machine (SVM) in a discriminative approach to the im-
cship@cityu.edu.hk). age categorization problem. The proposed two spatial Markov
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. kernels (SMKs) differ in how the SMC models are estimated. If
Digital Object Identifier 10.1109/TSMCB.2010.2102749 an SMC model is estimated for each image, a kernel serves to

1083-4419/$26.00 2011 IEEE


LUAND IP: SPATIAL MARKOV KERNELS FOR IMAGE CATEGORIZATION AND ANNOTATION 977

measure the similarity between two SMC models. Otherwise, A. Intermediate Representation
if an SMC model is estimated for each category, a kernel
In the context of image categorization, one direct strategy
can be calculated based on the probabilities of two images
is to classify images using low-level visual features such as
generated from this model. In this case, we can follow the idea
color and texture. This strategy considers each category as an
of Perronin and Dance [13] and Jaakkola et al. [14] to define a
individual object, which is usually applied to classify only a
discriminative Fisher kernel.
small number of categories such as indoor versus outdoor or
As we have mentioned, image categorization is an extremely
city versus landscape scenes. To narrow the semantic gap, we
challenging problem. The fact that images could have quite
adopt intermediate representation [3], [4] for image categoriza-
different orientations and scales further adds difficulty to this
tion that matches the scene/object model with the perception
problem. To handle such rotation and multiscale issues, we
we humans have, which thus can be used to classify a larger
consider a kernel combination. That is, we first define SMKs
number of categories. For example, BOW methods that follow
at different orientations and scales and then combine them to
this strategy have been shown to be effective in [7] and [8].
make our technique less sensitive to the changes of orientations
However, a serious problem with these BOW methods is that
or scales across different images. Here, it should be noted that
spatial dependence between visual keywords within images is
the rotation issue is challenging for an HMM and a Markov
usually not considered for image categorization.
chain model because they are directed graphs. To our best
Unlike BOW methods, this paper can exploit spatial informa-
knowledge, we have made the first effort to handle the rotation
tion of images for image categorization. In fact, both local and
issue for such Markov models.
global spatial information are captured in this paper: The spatial
In this paper, we first combine our SMKs with an SVM
dependence between visual keywords learned with our 2-D
classifier for image categorization. Moreover, we also apply our
Markov model can be regarded as the local spatial information,
SMKs to image annotation, which is a more challenging image
and the spatial layout of visual keywords obtained with our
categorization problem (with a large number of categories)
multiscale kernel combination is the global spatial information.
when each keyword is assumed to denote a category. Our main
In the literature, most methods only consider either local or
contributions can be summarized as follows.
global spatial information. For example, the collapsed graph
1) The spatial dependence between visual keywords has [15] and Markov stationary analysis [16] only learn local spatial
been incorporated into our SMKs defined based on our information, whereas the constellation model [17] and spatial
SMC model.
pyramid matching (SPM) [18] only capture the global layout
2) A Fisher kernel has been defined based on our SMC of an image. Here, it should be noted that Fisher kernels [14]
model so that this generative model can be used for image
are used similarly in [17] and in this paper. Moreover, different
categorization in a discriminative approach.
from these methods that exploit spatial information with in-
3) Multiscale extensions of our SMKs have been formulated termediate scene representation, the dominant spatial structure
to capture the global spatial layout of visual keywords.
of a scene can also be represented by a set of perceptual
4) Our SMKs can potentially be applied to many ma-
dimensions proposed in [19]. However, our later experiments
chine learning techniques that are used for image show that intermediate scene representation can achieve better
categorization.
results.
5) Our SMKs have been shown to achieve promising results For the above image categorization methods that generate
in the challenging application of image annotation. visual keywords based on ordinary k-means clustering, their
The remainder of this paper is organized as follows. In overall performance can certainly be improved if we adopt
Section II, we give a brief review of closely related work. other more advanced clustering algorithms such as modified
Section III presents our spatial Markov model to capture the global k-means [20]. However, in this paper, we focus on
spatial dependence between visual keywords. In Section IV, developing very general techniques to capture the spatial de-
this spatial information is further incorporated into our SMKs in pendence between visual keywords. More importantly, since we
two different ways. In Section V, a kernel combination is used aim to make a fair comparison, we adopt the ordinary k-means
to deal with rotation and multiscale issues. Section VI gives clustering, which is commonly used in closely related work.
the algorithmic details of image categorization and annotation
using our SMKs. Section VII presents the evaluation of our
method for image categorization. In Section VIII, we compare B. 2-D Markov Models
our method with closely related work on image categorization.
In Section IX, our method is evaluated in the challenging Most natural images have an inherent layered structure that
application of image annotation. Finally, Section X gives the delineates semantically distinct regions. For example, for the
conclusions drawn from our experimental results. beach scene, there are three horizontal regions (or layers)
starting from the bottom of the images: sand, water, and sky.
To learn the spatial structure of the images, a popular type of
II. R ELATED W ORK
probabilistic graphical models, i.e., the Markov models such
Since BOW-based intermediate representation and 2-D as an HMM [9], [10] and a Markov random field [21], have
Markov models are combined in our proposed model, we will been applied to image content analysis. Previous work on
give a brief review of these two techniques in the following, an HMM focused on capturing transitions of low-level visual
respectively. features extracted from different regions (or blocks) across
978 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART B: CYBERNETICS, VOL. 41, NO. 4, AUGUST 2011

images or different resolutions (or scales) of images [22], [23].


In particular, although the spatial structure of the images was
successfully captured by the HMM for 2-D shape recognition
in [24] and [25], each 2-D shape had to first be represented as a
1-D sequence.
More recently, an HMM has been applied to model tran-
sitions of high-level semantic labels across an image in [26].
That is, an HMM is used to capture spatial dependence between
semantic labels. This has been achieved by first dividing an
image into equivalent blocks on a regular grid so that the
spatial position can be characterized for each block. Through
defining the notion of past as what we have observed in a row-
wise raster scan on the regular grid, a novel 2-D spatial HMM Fig. 1. Second-order neighborhood system used in our SMC model.
(SHMM) [26] has been proposed to capture spatial dependence
between semantic labels within an image. Although the SHMM
can achieve an automatic assignment of semantic labels for where qx,y S (1 x X, 1 y Y ) is the visual keyword
each block in a test image, the initial manual assignment of automatically assigned to block (x, y) in the image.
these labels for the training image blocks has to be provided We now present our SMC model as follows: Let Qx,y denote
in order to train the SHMM. This is indeed a challenging task the sequence of states from block (1, 1) to block (x, y) in a row-
for a large database. Similar to the SHMM, our 2-D Markov wise raster scan on the regular grid (see Fig. 1). We follow our
model proposed in this paper can also model transitions of high- previous work [26], [27] and formulate the defining property
level semantic labels. However, unlike the SHMM, manual for an SMC model as
labeling of image blocks for model training is not necessary
since each image block can be automatically labeled with a
P (qx,y |Qx,y1 ) = P (qx,y |qx,y1 )P (qx,y |qx1,y ). (1)
visual keyword.
In summary, this paper differs from our previous SHMM
in that we have automatically annotated each image block That is, given the previous state sequence Qx,y1 of block
with a visual keyword, whereas the manual annotation of (x, y), its state qx,y only depends on the states (i.e., qx,y1 and
image blocks has to be provided as training data in [26]. qx1,y ) of two previously neighbor blocks in a row-wise raster
Upon our another previous work [27], this paper provides three scan.
additional contributions: 1) a Fisher kernel is defined based Elements of the proposed 2-D Markov model can now be
on our SMC model; 2) multiscale extensions are formulated formally defined as follows.
for our SMKs; and 3) our SMKs are used for the challeng-
1) = {i } is the collection of initial state distribution,
ing problem of image annotation. More importantly, we have
where i = P (q1,1 = si ).
successfully applied 2-D Markov models and kernel methods
2) The horizontal state transition matrix is H = {hij :
[28], [29] simultaneously to image categorization and annota-
si , sj S, 1 i, j M }, where hij is defined as
tion, which is different from other closely related approaches
P (qx,y = sj |qx,y1 = si ).
that have combined kernel methods with a 1-D HMM for
3) The vertical state transition matrix is V = {vij : si , sj
protein classification [14] or with a Gaussian mixture model
S, 1 i, j M }, where vij is defined as P (qx,y =
for image categorization [13]. Moreover, since we seamlessly
sj |qx1,y = si ).
integrate kernel methods and Markov models, our methods
also differ from generative methods [30][32] that similarly For convenience, the compact notation = {, H, V } is
make use of the Markov models for semantic analysis or face used to indicate the complete set of parameters of an SMC
recognition. model.
Based on the Markov property defined in (1), the probability
III. S PATIAL M ARKOV M ODEL of a sequence of states Q given an SMC model can be
formulated as
In order to apply 2-D Markov models to image categoriza-
tion, we have to first generate an image representation based  nh (Q) nv (Q)
on visual keywords. In this paper, the images are first divided P (Q|) = q1,1 hijij vijij (2)
1i,jM
into equivalent blocks on a regular grid, and the feature vectors
extracted from the blocks are then clustered by k-means to
give a vocabulary of visual keywords S = {si }M i=1 . By a row- where nhij (Q) (or nvij (Q)) is the number of horizontal (or
wise raster scan on the regular grid, each image can then vertical) transitions from state si to state sj that occur in the
be represented as a 2-D sequence of visual keywords, which sequence Q.
have been automatically assigned to respective blocks. More We further present the estimation of the model parameters
formally, an image with X Y blocks can be denoted as a to maximize the probability of the sequence given the model.
sequence Q = q1,1 q1,2 , . . . , q1,Y q2,1 q2,2 , . . . , q2,Y , . . . , qX,Y , Here, it can be reduced to an MLE of model parameters. Given
LUAND IP: SPATIAL MARKOV KERNELS FOR IMAGE CATEGORIZATION AND ANNOTATION 979

Fig. 2. Illustration of the spatial dependence between visual keywords encoded in the learned SMC model that can help to distinguish the two images with the
same BOW representation. We find that the horizontal transition matrices (i.e., H) learned from the two images are quite different. As for the vertical transition
matrices (i.e., V ), the reason why they have no difference is that the two objects (i.e., A and C) in the front have the same vertical dependence.

a set of 2-D sequences C from one category, the parameters of image categorization approach based on our SMC model is
an SMC model can then be derived as completely generative, but the final classification problem is
actually discriminative. To develop a discriminative approach

M
i = ni (C) ni (C) (3) to the image categorization problem, we consider incorporating
i =1
the spatial dependence between visual keywords captured by an
 M SMC into kernels for use with an SVM.
hij = nhij (C) nhij  (C) (4) In the following, we propose two SMKs, which differ in how
j  =1 our SMC models are estimated. If an SMC model is estimated

M for each image, the kernel is used to measure how close two
vij = nvij (C) nvij  (C) (5) SMC models are, which will be denoted as SMK1. Otherwise,
j  =1 if an SMC model is estimated for each category, the kernel can
be calculated based on the probabilities of two images coming
where ni (C) is the times of the state si occurring  at hblock from this SMC model, which will be denoted as SMK2. The
(1, 1) according to the training sets C, nh
ij (C) = QC nij (Q), labels of images are used for SMK2.
and nvij (C) = QC nvij (Q). The spatial dependence between
visual keywords encoded in the learned SMC model can help
to distinguish two images even with the same BOW representa- A. Definition of SMK1
tion, which is shown in Fig. 2. We can find that the horizontal The basic idea of defining a kernel is to map the 2-D
transition matrices (i.e., H) learned from the two images are sequence Q of an image into a high-dimensional feature space:
quite different, which is consistent with the fact that objects Q  (Q). If an SMC model (Q) is estimated via MLE for
A and C in the two images have different horizontal dependence each image (or each sequence) Q, the feature mapping can
(adjacent or not). be given as
Our SMC model can directly be used for image catego-  
rization. For a multiclass classification problem with C image (Q) = (Q) = (Q) , H (Q) , V (Q) . (7)
categories, we first train an SMC model for each category by the
above MLE scheme and then classify a new test image Qtest by That is, each sequence Q is now represented by the model
parameters of an SMC.
c = arg max log P (Qtest |c ). (6) Since we focus on capturing spatial dependence between
1cC
states (visual keywords), we only consider the horizontal and
It should be noted that this is completely a generative method vertical transition matrices in (Q). Moreover, to make the
for image categorization. In the following, we will try to make computation more efficient, the feature mapping is then
use of our SMC model in a discriminative approach. defined as
  
(Q) (Q)
(Q) = hij + vij 2 . (8)
IV. SMK S 1i,jM

Although our SMC model is shown to achieve satisfactory Although we can map Q into an even higher dimensional
results in later experiments when applied directly to image feature space by stacking H (Q) and V (Q) in one vector, the
categorization, we can further improve it through developing experimental results show that the above definition of per-
new discriminative learning. Here, it should be noted that the forms better. More importantly, this kernel definition can help
980 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART B: CYBERNETICS, VOL. 41, NO. 4, AUGUST 2011

Fig. 3. Illustration of our definition of the feature mapping that can help to deal with the rotation issue to some extent. Although different probability transition
matrices (i.e., H and V ) are learned from the two images, the obtained feature vectors (i.e., ) are the same, which means that we successfully remove the effect
of the vertical rotation.
 M
to deal with horizontal and vertical rotations (also flips) to some constraint conditions (i.e., Mj=1 hij = 1 and j=1 vij = 1),
extent, as shown in Fig. 3. In the next section, to ensure our we take into account the following substitutions:
kernel definition invariant to more complex rotations, we will
consider more orientations through a kernel combination. 
M

After the feature mapping has been defined, our SMK1 hij = eij eij  (11)
j  =1
function in the feature space (determined by ) can be given as

M
 vij = eij eij  (12)
K(Q, Q) = Korig (Q), (Q) (9)
j  =1

where < ij , and ij < +. These two unconstrained


where Q and Q are two sequences, and Korig can be any kind auxiliary parameters are introduced to ensure that the gradients
of an original kernel. In our experiments, the Gaussian kernel always exist. According to the probability P (Q|) given in (2),
is used as Korig . Although the feature vector given by (8) has gradients of log P (Q|) with respect to ij and ij have the
dimensionality M 2 for M states, the computation of the above following detailed forms:
kernel is very efficient because the model parameters H (Q) and

V (Q) are extremely sparse. log P (Q|)


ij =
(Q)

= nhij (Q) 1 h0ij (13)
ij =0

(Q) log P (Q|)


B. Definition of SMK2 ij =
= nvij (Q) 1 vij
0
. (14)
ij =0
Although we know the labels of the images (i.e., sequences)
used to train the SMC model, we do not take advantage of Similar to our definition of the feature mapping used by
this supervisory information in the definition of SMK1. To take SMK1, we finally take into account the following for the
such prior knowledge into account, we estimate an SMC model SMK2 function:
for each category and follow the idea of Jaakkola et al. [14] to   
(Q)
define a new type of a Fisher kernel called SMK2. (Q) = (Q)
ij + ij 2 (15)
1i,jM
Let 0 = {0 , H0 = {h0ij }M M , V0 = {vij 0
}M M } be the
SMC model estimated for a category. The feature mapping which is slightly different from the original definition in (10)
for a sequence Q can be defined based on the following Fisher for the Fisher kernel that stacks the horizontal and vertical
score: gradients into one higher dimensional feature vector.
We now give our SMK2 function in the same form as (9),

which can be calculated very efficiently due to the sparse prop-
log P (Q|)

(Q) =
(10) erty of the two matrices {nhij (Q)}M M and {nvij (Q)}M M
=0 used in the definition of the above feature mapping .

where = {, H, V } denotes an arbitrary SMC model. Since


V. K ERNEL C OMBINATION
we focus on capturing the spatial dependence between states
(visual keywords), only the horizontal and vertical transition The main advantage of the kernel methods for image cat-
matrices (i.e., H and V ) are used to define (Q) in the egorization is that different kernels can be readily combined
following. together. In this paper, a kernel combination is used to handle
To make sure that the two transition matrices H = rotation and multiscale issues, which are both very challenging
{hij }M M and V = {vij }M M satisfy the probability in the literature.
LUAND IP: SPATIAL MARKOV KERNELS FOR IMAGE CATEGORIZATION AND ANNOTATION 981

A. On the Rotation Issue Let Qli be the ith subsequence at the level l for a sequence Q.
The SMK at this scale can be formulated as
As we have mentioned, image categorization is an extremely
difficult problem. The fact that different objects in an image 
4 l

(l)
may have different orientations (even from the same category) K (Q, Q) = K Qli , Qli (18)
adds difficulty to this problem. To handle such a rotation issue, i=1
we take into account dual SMC models, for which the notions
of past are defined completely inversely. The obtained two where Q and Q are two sequences, and K can be SMK1 or
kernels based on these two SMC models are then combined SMK2. We first define an SMK for each subsequence at the
to make our method less sensitive to rotations between images level l and then take a sum of the obtained 4l kernels. Intuitively,
from one category. K (l) (Q, Q) not only measures the number of the same cooccur-
The inverse SMC model can be defined similarly as the rences (i.e., spatial dependence) of visual keywords found at the
original SMC model, and we have to first rewrite the original level l in both Q and Q but also captures the spatial layout of
sequence Q as these cooccurrences on the 2l 2l grid at this level.
Since the cooccurrences of visual keywords found at the
level l also include all the cooccurrences found at the finer
Q = qX,Y qX,Y 1 . . . qX,1 qX1,Y qX1,Y 1 . . . qX1,1 . . . q1,1
level l + 1, the increment of the same cooccurrences found
at the level l in both Q and Q is measured by K (l) (Q, Q)
which is obtained by an inverse row-wise raster scan on the K (l+1) (Q, Q) for l = 0, . . . , L 1. The SMKs at multiple
regular grid. Based on dual SMC models, we then define the scales can then be combined by a weighted sum as follows:
feature mapping 1 for SMK1 and 2 for SMK2 as follows:
(L)
   Kms (Q, Q) = K (L) (Q, Q)
(Q) (Q) (Q ) (Q )
 1 
1 (Q) = hij +vij +hij +vij 4 (16) L1
1i,jM
   + K (l) (Q, Q) K (l+1) (Q, Q)
(Q) (Q ) (Q ) 2Ll
2 (Q) = (Q)
ij +ij +ij +ij 4 . (17) l=0
1i,jM
1  1 L
= L K (0) (Q, Q) + K (l) (Q, Q).
These definitions of our SMKs can help to deal with the 2 2Ll+1
l=1
horizontal and vertical rotations (also flips). To ensure that our
(19)
kernel definitions are invariant to more complex rotations, we
can even similarly consider the four diagonal orientations. Here, we apply empirical weights to the kernels at different
It should be noted that our two SMK methods are only scales, which ensure that a coarser scale plays a less important
approximately rotational invariant. To obtain a completely in- (L)
role. When L = 0, the above multiscale kernel Kms degrades
variant solution with respect to arbitrary rotation, we have to
to the original SMK.
try all the possible scanning paths on a regular grid (an image
block locates at each grid cell), and then, a sequence of blocks
can be generated along each scanning path. However, it may VI. A LGORITHM
incur too large computational cost to be applicable in practice. This section gives the algorithmic details of image catego-
Fortunately, in natural images, the image blocks are usually rization and annotation using our SMKs. We first combine
arranged in an order for normal viewing, and the horizontal our SMKs with the SVM classifier for image categorization.
or vertical rotations that most possibly occur among images Moreover, we also apply our SMKs to the challenging prob-
due to the image preparation process can be handled by our lem of image annotation through graph-based semisupervised
SMC models. Hence, the categorization results are shown to be learning [33].
encouraging in our later experiments.

A. Image Categorization
B. On the MultiScale Issue Given a multiclass image categorization problem, the strat-
We now consider the multiscale issue. Similar to the idea egy of training an SVM classifier is one against the rest for our
of a wavelet transform, we place a series of increasingly finer SMK2 and one against one for our SMK1. The optimization
grids over the state space of the SMC model. That is, each problem in an SVM is solved by A Library for Support Vector
subsequence at the level l will be divided into 2 2 parts at the Machine.1 It should be noted that our SMK2 has to follow the
level l + 1, where l = 0, . . . , L 1 and L are the finest scale. one-against-the-rest strategy due to its special definition (i.e.,
Hence, we can obtain 4l subsequences at the level l. For exam- we must first train an SMC model to define our SMK2 for each
ple, given a sequence Q = q1,1 , q1,2 , q1,3 , q1,4 , q2,1 , q2,2 , q2,3 , category). Hence, we need to compute our SMK2 C times to
q2,4 , the four subsequences for l = 1 are q1,1 q1,2 , q1,3 q1,4 , solve an image categorization problem with C categories based
q2,1 q2,2 , and q2,3 q2,4 , respectively. Based on these subse- on the one-against-the-rest strategy, whereas our SMK1 can be
quences, we can then estimate the model parameters to define
our SMK. 1 http://www.csie.ntu.edu.tw/ cjlin/libsvm
982 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART B: CYBERNETICS, VOL. 41, NO. 4, AUGUST 2011

commonly adopted for all the binary classification subproblems provided, the annotation algorithm based on semisupervised
(i.e., we only need to compute this kernel once). Given a learning is summarized as follows:
database of N images, the worst case time complexity of these
1) Form the affinity
matrix W of a graph by Wij =
two SVM-based algorithms using our SMK1 and SMK2 is
approximately O(M 2 N 2 4L + N 2 ) and O(CM 2 N 2 4L + N 2 ), K(Qi , Qj )/ K(Qi , Qi ) K(Qj , Qj ) if Qj (j
= i) is
respectively. The first part of the time complexity is related to among the k-nearest neighbors of Qi and Wij = 0 other-
the kernel computation. In practice, these two algorithms incur wise (which also implies Wii = 0).
much less time cost since the feature vectors used to define our 2) Construct the matrix S = D1/2 W D1/2 in which D is
SMKs are extremely sparse, and we can compute the kernel a diagonal matrix with its (i, i)-element equal to the sum
matrices very efficiently. of the ith row of W .
In fact, the learned SMC model used to define our SMKs 3) Iterate F (t + 1) = SF (t) + (1 )Z for keyword
can also be applied to image categorization directly. For an propagation until convergence, where is a smoothing
image categorization problem with C categories, we first train parameter in the range (0, 1).
an SMC model c for each category and then classify a new test 4) Let F denote the limit of the sequence {F (t)}. Annotate
image Qtest by c = arg max1cC log P (Qtest |c ). Unlike each image Qi by a set of keywords with respect to its
the above image categorization algorithms that combine our ranking scores Fi .
SMKs with the SVM classifier, this SMC-based algorithm is According to Zhou et al. [33], the above semisupervised
completely a generative method for image categorization. A learning algorithm converges to F = (1 )(I S)1 Z.
distinct advantage of this image categorization algorithm is Based on our SMKs, this algorithm can exploit the semantic
that it has linear time complexity of O(M 2 N ) with respect to context across visual keywords for image annotation. Here, it
data size. should be noted that we only consider our SMK1 for image
annotation since we have to compute our SMK2 too many
B. Image Annotation times, given a large number of keywords (i.e., class labels)
involved in the annotation of images. When our SMK1 is used
Our SMKs are further applied to automatic image annotation. for graph construction, the above algorithm has linear time
When each keyword is treated as an independent category, complexity of O(kCN ) with respect to data size.
an image annotation can be formulated as a multilabel im-
age categorization problem. Methods like linguistic indexing
of pictures [23] and image annotation using an SVM [34] VII. C ATEGORIZATION R ESULTS
or a Bayes point machine [35] follow this strategy. Hence,
In this section, we compare our two SMK methods for
the SVM classifier with our SMKs can be directly used for
image categorization with SMC, BOW, and PLSA methods on
image annotation. However, this classification-based method
Corel and histological image databases to access the gain in
is not scalable to a large-scale concept space. In the context
combining Markov models and kernel methods. In the follow-
of image annotation, the concept space is very large because
ing, a BOW method only refers to BOW representation with
there are usually hundreds of keywords. Therefore, the problem
the frequencies of visual keywords as features and includes
of a semantic overlap and a data imbalance among different
no topic analysis. To solve the image categorization problem,
semantic classes becomes serious, and then, the classification
our SMC model is used as a generative classifier according
performance degrades heavily. To deal with this problem, we
to (6), whereas all the other methods take into account the
make use of graph-based semisupervised learning [33] together
discriminative SVM classifier.
with our SMKs. The basic idea of semisupervised learning is
keyword propagation along the underlying manifold structure
among all the images.
A. Experimental Setup
To apply semisupervised learning to automatic im-
age annotation, we need to first formulate this challeng- The first image database consists of 1000 images taken from
ing problem as follows. Given an image data set Q = ten compact disks read-only memory (CD-ROMs) published by
{Q1 , . . . , QNl , QNl +1 , . . . , QN } and a vocabulary of annota- Corel Corporation. Each Corel CD-ROM contains 100 images
tion keywords L = {w1 , . . . , wC }, the first Nl images Qi (i and represents a distinct concept (i.e., category). All the images
Nl ) are labeled as Zi {0, 1}C , where Zij = 1 if the keyword are of the size of 384 256 or 256 384 pixels. The category
wj L is attached to Qi and zero otherwise, whereas the names and some randomly selected sample images from each
remaining images Qu (Nl + 1 u N ) are not labeled, and category are shown in Fig. 4(a). This is a rather complex image
then, Zuj = 0. The goal is to predict the keywords of the database. Some images from two natural scenes (e.g., skiing
unlabeled images. versus mountains or beach versus mountains) are difficult to
Let F denote the set of N C matrices with nonnegative distinguish even by humans.
entries. For each matrix F F, Fij is defined as the ranking The second image database is the same as that used in [26]
score of image Qi according to the keyword wj . We can and [36], which has five categories [see some samples from
understand F as a vector function F : Q RC , which assigns each category in Fig. 4(b)] of histological images captured from
a vector Fi to each image Qi . For the initial annotation Z, the mentioned five major regions along the human gastroin-
we have Z F. Assuming that an SMK function K has been testinal tract with 40 images for each region. The collection of
LUAND IP: SPATIAL MARKOV KERNELS FOR IMAGE CATEGORIZATION AND ANNOTATION 983

Fig. 4. Some sample images from the two image databases. (a) Corel images (ten categories). (b) Histological images (five categories).

these histological images is rather time consuming. They were same size to form training and test sets, respectively. The
obtained from patients records in the past ten years from a local number of visual keywords M and the number of topics are
hospital. The image resolution was originally set to 4491 selected by twofold cross-validation on the training set (the
3480 pixels during the capturing process. Similar to [26] and other parameters are selected similarly), and the corresponding
[36], all the images are then down-sampled to 1123 870 results are shown in Fig. 5. We finally select M = 400 for
pixels. SMK1, M = 200 for SMK2 and the SMC, and M = 1000 for
To generate a vocabulary of visual keywords, we have to the BOW method and PLSA. Moreover, the number of topics
first divide each image into blocks of equivalent size and then is set to 100 for PLSA. The average categorization accuracy
extract a 30-dimensional feature vector (similar in [26]) for values for each category over ten random test sets are listed in
each block: six color features (average and standard deviation Table I, which also presents the overall accuracy values on all
of each color component) and 24 texture features (average the ten categories.
and standard deviation of Gabor outputs over three scales and In terms of the overall accuracy values, it can be found from
four orientations). For the gray histological images, feature Table I that our two SMK methods for image categorization
dimensionality is 26. We then perform k-means clustering on always outperform the other three methods. This superior per-
the extracted feature vectors to generate a vocabulary of M formance could be due to the combination of Markov models
visual keywords. and kernel methods. Although the spatial dependence between
Moreover, images within each category are randomly se- visual keywords can be captured by the SMC model, this type
lected into two subsets of the same size to form training and test of spatial information is exploited for image categorization
sets, respectively. We repeat each experiment for ten random completely in a generative approach. Both the BOW method
splits and report the average of the results obtained over ten and PLSA adopt a discriminative SVM classifier, but they
different test sets. In the following, we only take into account ignore the spatial information within the images. Moreover, for
L = 0, 1, and 2 for our two SMK methods. The reason is our two SMK methods, we can find that SMK2 leads to better
that we have found in the experiments that more levels (or results than SMK1. Interestingly, similar to the results reported
scales) do not achieve better results but incur significantly more in [3], PLSA does not perform better than the BOW method.
computational cost. For all the methods, the parameters are In terms of the categorization accuracy values for each
selected by twofold cross-validation on the training set. category, it can be found from Table I that our SMC model is
particularly suitable to process images with a layered structure.
For natural scenes (e.g., skiing, beach, and mountains) that
B. Scene/Object Categorization
have multiple layers, our two SMK methods based on the
We compare the five methods on all the ten categories of SMC model achieve significant improvements, as compared
scene/object images from the Corel database. Images within with the BOW method and PLSA that do not consider the
each category are randomly selected into two subsets of the spatial dependence between visual keywords within the images.
984 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART B: CYBERNETICS, VOL. 41, NO. 4, AUGUST 2011

Fig. 5. Categorization results by twofold cross-validation on the Corel image database. (a) Varying number of visual keywords for the five methods. (b) Varying
number of latent topics for PLSA.

TABLE I results. The reason could be that we have actually learned the
C ATEGORIZATION R ESULTS (%) BY THE F IVE M ETHODS ON THE
C OREL I MAGE DATABASE semantic context (i.e., the spatial dependence between visual
keywords) for image classification. Of course, a smaller vo-
cabulary size means that a single visual keyword may denote
multiple concepts (e.g., a green block may mean a leaf or grass)
at a higher probability. However, similar to the case in natural
language processing, such semantic ambiguity can be reduced
when the context is known. Moreover, besides this type of
spatial information, our two SMK methods can capture another
type of spatial information when multiple scales are considered.
That is, the spatial layout of each pair of visual keywords
is actually included in (18). Here, only slightly better results
are obtained with this global spatial information, as shown in
Table I, which may be due to the fact that the first type of spatial
information contributes much more to image categorization on
this database.
The confusion matrix for the ten accuracy values provided
by our SMK2 method is presented in Table II to show more
details on the categorization of each category. Each row lists the
average percentages of images in a specific category classified
to each of the ten categories. The numbers on the diagonal
show the categorization accuracy for each category, and off-
diagonal entries indicate categorization errors. A detailed ex-
amination of the confusion matrix given by Table II shows
that six categories, namely, skiing, tigers, owls, flowers, horses,
Fig. 6. Comparison of the standard and rotation invariant formations for and food, are classified the best (with an accuracy value of
our two SMK methods when recognizing the four categories of animals with > 90%), whereas the largest error (the underlined number in
diverse poses. Here, SMK1-S (or SMK2-S) denotes the standard formation,
whereas SMK1-RI (or SMK2-RI) denotes the rotation invariant formation.
Table II) arises between two categories: beach and mountains.
As for these two natural scenes, 16% of beach images are
Moreover, due to the rotation invariant formulation, our SMK misclassified as mountains. The high errors can be due to the
methods are shown to be robust with respect to rotations. fact that many images from these two natural scenes have image
That is, when recognizing animals with diverse poses (e.g., blocks semantically related and visually similar, such as blocks
tigers, owls, elephants, and horses), we can obtain better results corresponding to mountains, rivers, lakes, and ocean.
using our SMK methods. To further confirm the superiority We now evaluate the five methods for image categorization
of the rotation invariant formulation for our SMK methods, when less training data are available. The number of training
we directly compare it with the standard formation (without images decreases from 500 to 50, and we select one tenth of
the kernel combination). As shown in Fig. 6, the rotation the total training images from each category. The number of
invariant formulation consistently performs better than the test images then increases from 500 to 950. Here, it should
standard formulation when recognizing animals with diverse be noted that PLSA is always run on the initial 500 training
poses. images, and only the labeled images input into an SVM are
Although the vocabulary size used here is relatively small, changed. As indicated in Fig. 7, when the number of training
as compared with what is typically used in the literature (1000 images decreases, the average classification accuracy values of
at least), our two SMK methods can still achieve encouraging our two SMK methods degrade as expected. However, when
LUAND IP: SPATIAL MARKOV KERNELS FOR IMAGE CATEGORIZATION AND ANNOTATION 985

TABLE II
C ONFUSION M ATRIX (%) ON THE C OREL I MAGE DATABASE BY O UR SMK2 M ETHOD

TABLE III
C ATEGORIZATION R ESULTS (%) BY THE F IVE M ETHODS ON THE
H ISTOLOGICAL I MAGE DATABASE

that our two SMK methods outperform the other three methods,
due to our combination of Markov models and kernel methods
for image categorization. That is, our methods indeed lead to
better results through incorporating the spatial information into
the SVM classifier.

Fig. 7. Comparison of the categorization results provided by the five methods


as the number of training images varies from 50 to 500. The average classifi- VIII. C OMPARISON W ITH R ELATED W ORK
cation accuracy values (in percentage) are computed over ten random test sets,
and the error bars indicate the corresponding 95% confidence intervals. In this section, we make further comparison with three
closely related approaches [8], [17], [19] to image categoriza-
the number of training images decreases to a small value (e.g., tion using their own data sets and the same number of training
50), our SMK2 and SMC methods are shown to degrade the and test images. The corresponding three data sets are denoted
slowest. In this case, we can also find that PLSA degrades more as follows:
gracefully than the BOW method and then perform a little better
when there are only 50 training images, which is also reported 1) Scenes-8: eight scenes categories provided by Oliva and
in [3]. Torralba [19].
Finally, the five methods for image categorization are com- 2) Scenes-13:13 scenes categories provided by Fei-Fei and
pared in terms of time cost. In the experiments, we find that Perona [8].
SMC and BOW methods run faster than the other three meth- 3) Caltech-4:four object categories from Caltech [17].
ods. Moreover, for the four discriminative methods, the time On the three data sets, our SMK method is compared with
cost of training an SVM is the same if the kernel matrix is SPM [18] and PLSA [5]. Since the spatial information of visual
provided in advance. Hence, we focus on the kernel compu- keywords is also considered, SPM is closely related to our SMK
tation to make a fair comparison. It can be observed that the method.
BOW method obtains the kernel matrix the fastest, our two Although our SMK2 performs obviously better than our
SMK methods are a little slower, and PLSA is the slowest due SMK1 when a small number of training images are provided
to the costly expectationmaximization iteration. Because the (see Fig. 7), they are shown to achieve comparable results
feature vectors used for kernel definition are extremely sparse, when the number of training images increases to a relatively
our two SMK methods result in a very efficient computation of large value. Hence, we only consider SMK1 in the following
the kernel matrices. (i.e., an SMK actually denotes our SMK1), due to the fact that
a relatively large number of training images are selected for
all the three data sets. Moreover, a distinct advantage of our
C. Histological Image Categorization
SMK1 is that it runs much faster than SMK2. The reason is
Our two SMK methods are further tested on the histological that we must compute our SMK2 C times to solve an image
image database used in [26] and [36]. Images within each classification problem with C classes based on the one-against-
category are randomly selected into two subsets of the same the-rest strategy, whereas our SMK1 can be commonly adopted
size to form training and test sets, respectively, The parameters for all the binary classification subproblems (i.e., this kernel is
are selected by twofold cross-validation on this database just as only computed once).
on the Corel image database. The overall accuracy values over For each data set, we extract scale-invariant feature transform
the five categories are used for evaluation. The average results (SIFT) descriptors [37] of 16 16 pixel blocks computed over
over ten random test sets are listed in Table III. We find again a regular grid with spacing of eight pixels, which is the same
986 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART B: CYBERNETICS, VOL. 41, NO. 4, AUGUST 2011

TABLE IV
C OMPARISON OF O UR M ETHOD W ITH P REVIOUS A PPROACHES U SING T HEIR OWN DATABASES

as in [18]. We then perform k-means clustering on a random TABLE V


A NNOTATION R ESULTS (%) M EASURED BY F1 FOR THE
subset of blocks from the training set to form a visual vocab- T WO M ULTISCALE M ETHODS
ulary. For the SMK and SPM, we train the SVM classifier
directly for image categorization. For PLSA, we have to first
compute the Gaussian kernel using the learned topics as fea-
tures. In the following, we only consider L = 0, 1, and 2 for
the SMK and SPM since we have found in the experiments that
more levels do not achieve better results but incur significantly In the next section, our method will be applied to image
more computational cost. For all the image categorization meth- annotation, which can be considered as a multilabel image cat-
ods, the parameters are selected by twofold cross-validation on egorization problem when each keyword is assumed to denote
the training set. a category. However, it becomes significantly more difficult
The classification accuracy (%) is used for performance than the traditional image categorization task, because there
evaluation, which is computed on the test set and then averaged are usually hundreds of keywords (i.e., categories) involved in
over ten independent trials. The results are given in Table IV. the annotation of images. In the following experiments, our
We can find that our method outperforms the three previous method has been shown to perform much better than SPM
approaches [8], [17], [19]. As compared with [19], our superior through incorporating the learned semantic context into image
performance could be due to different interpretation of scene annotation.
categories. That is, a set of perceptual dimensions are proposed
in [19] to represent the dominant spatial structure of a scene,
whereas our method uses an intermediate scene representation IX. A PPLICATION TO I MAGE A NNOTATION
with automatically learned visual keywords. Although such an In this section, our SMKs will be evaluated in the challenging
intermediate representation was also used in [8], the spatial application of image annotation. We compare our method with
information of visual keywords has not been considered in two previous methods [38], [39] on the standard annotation
scene classification. Therefore, our SMK method performs database [40]. Based on the continuous-space relevance model
much better. In [17], the constellation model is used to capture proposed in [38] or the multiple Bernoulli relevance model
spatial information, and then, a kernel is defined based on (MBRM) proposed in [39], the correlations between images
this model, which is similar to our method. However, a sparse and keywords can be estimated and then used for image an-
representation is formed in [17] through extracting interest notation. Moreover, our SMK is also compared with SPM [18]
points, whereas our method uses dense SIFT descriptors. As and PLSA [5]. Instead of the relevance model, we adopt graph-
shown in [8], dense features work better for scene classification. based semisupervised learning for image annotation. We can
More importantly, based on the 2-D Markov model, our method construct a graph directly using the SMK or SPM as the affinity
has actually learned the semantic context (i.e., the spatial de- matrix. For PLSA, we have to first compute the Gaussian
pendence between visual keywords), which can help to reduce kernel.
the ambiguity in scene classification.
Similar to Section VII, we find again that better results are
A. Experimental Setup
obtained with spatial information (i.e., our method is compared
with PLSA). Moreover, when L = 0, our method achieves more The standard annotation database [40] is used in the experi-
improvement over SPM [18] (compared with L = 1 and 2). ments, which consists of 5000 images. Each image is annotated
Although our method only performs slightly better than SPM with 15 keywords, and there are totally 371 keywords. We first
when L = 1 and 2, it should be noted that the spatial infor- divide each image into blocks and then extract a 30-dimensional
mation captured by the two methods is quite different. That feature vector (similar to [39]) for each block: six color features
is, SPM uses 2-D histograms of visual keywords, whereas our (block color average and standard deviation) and 24 texture
method learns the spatial dependence between visual keywords, features (average and standard deviation of Gabor outputs
which can be considered as some form of semantic context. computed over three scales and four orientations). Moreover,
Similar to the case in natural language processing, semantic we consider two training/test splits of the image database: One
ambiguity can be reduced when the context is known. This is the same as [39] with 4500 training images and 500 test
helps to shed some light on why our method gains more over images, and the other is the inverse with 500 training images
SPM on the Scenes-13 data set (compared with Caltech-4). That and 4500 test images (i.e., the 500 test images used in [39] is
is, when there are more categories in the data set, the semantic now used for training).
ambiguity will become larger. This challenge can be handled by Similar to previous work, we evaluate automatic image anno-
our method to some extent. tation through the process of retrieving test images with a single
LUAND IP: SPATIAL MARKOV KERNELS FOR IMAGE CATEGORIZATION AND ANNOTATION 987

TABLE VI
A NNOTATION R ESULTS (%) FOR T WO T RAINING /T EST S PLITS OF THE I MAGE DATABASE

Fig. 8. Annotation results of six sample images for the split training/test = 500/4500. Each image is annotated with the top five keywords.

keyword. For each keyword, the number of correctly annotated visual keywords) by the 2-D Markov model. That is, this
images is denoted as Nc , the number of retrieved images is type of spatial information captured by our SMK method is
denoted as Nr , and the number of truly related images in the more suitable for image annotation than that (i.e., the 2-D
test set is denoted as Nt . Then the recall, precision, and F1 histograms of visual keywords) captured by SPM, particularly
measures are computed as follows: when there exists a much larger semantic overlap or ambiguity
(due to hundreds of keywords used for image annotation). Since
recall = Nc /Nt , precision = Nc /Nr (20) PLSA does not consider spatial information, our SMK method
can perform significantly better than PLSA. When there are
F1 = 2 recall precision/(recall + precision) (21) only 500 training images, our SMK method gains 69% over
the MBRM on the F1 measure. This superior performance
which are further averaged over all the keywords in the test set. could be due to keyword propagation among all the images
In our experiments, we take the top five keywords for au- (including the training and test images) through graph-based
tomatic annotation of an image, which is the same as in [39]. semisupervised learning. As for the relevance models such as
We only consider L = 1 for the SMK and SPM since the the MBRM, keyword propagation is carried out only from the
performance degrades with more levels, as shown in Table V. training images to the test images. Moreover, our SMK method
These two kernels can be directly used as the affinity matrix is also compared with more recent state-of-the-art methods
of a graph. For PLSA, we have to first compute the Gaussian [41], [42] using their own results. As shown in Table VI, our
kernel. For all the annotation methods, the parameters are tuned SMK method outperforms [41], [42] due to the use of the
by twofold cross-validation on the training set according to F1 . context between visual keywords.
To further illustrate the effectiveness of our SMK method
for image annotation, samples of annotation for the split of the
B. Annotation Results
database with training/test = 500/4500 are presented in Fig. 8,
The annotation results for the two training/test splits of the together with manual (ground truth) annotations. Our method
image database are listed in Table VI. We find that our SMK is shown to generate more semantically related annotations for
method for image annotation always outperforms all the other each image than all the other methods. This could be due to
methods. Although both the SMK method and SPM make use fact that the semantic context learned by our method is used for
of spatial information, our SMK method can achieve about 15% keyword propagation. Moreover, the annotation keywords of an
improvement over SPM on the F1 measure through learning image generated by our method are more complete than those
the semantic context (i.e., the spatial dependence between by the other methods. For example, we can consider that the
988 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART B: CYBERNETICS, VOL. 41, NO. 4, AUGUST 2011

annotation sky for the first image (or cave for the second [16] J. Li, W. Wu, T. Wang, and Y. Zhang, One step beyond histograms:
image) is not only correct but also semantically related to the Image representation using Markov stationary features, in Proc. CVPR,
Jun. 2008, pp. 18.
other four annotation keywords of the image. [17] A. Holub, M. Welling, and P. Perona, Hybrid generativediscriminative
object recognition, Int. J. Comput. Vis., vol. 77, no. 13, pp. 239258,
May 2008.
[18] S. Lazebnik, C. Schmid, and J. Ponce, Beyond bags of features: Spatial
X. C ONCLUSION pyramid matching for recognizing natural scene categories, in Proc.
We have proposed a novel 2-D Markov model for image CVPR, Jun. 2006, pp. 21692178.
[19] A. Oliva and A. Torralba, Modeling the shape of the scene: A holistic
categorization and annotation in order to capture the spatial representation of the spatial envelope, Int. J. Comput. Vis., vol. 42, no. 3,
dependence between visual keywords and overcome the limita- pp. 145175, May/Jun. 2001.
tion of the BOW methods. Instead of directly using it as a gener- [20] A. Bagirov, Modified global k-means algorithm for minimum sum-of-
squares clustering problems, Pattern Recognit., vol. 41, no. 10, pp. 3192
ative model for image categorization, we further incorporate the 3199, Oct. 2008.
learned Markov dependence into kernels in two different ways, [21] F. Salzenstein and C. Collet, Fuzzy Markov random fields versus chains
for use with an SVM in a discriminative approach to this chal- for multispectral image segmentation, IEEE Trans. Pattern Anal. Mach.
Intell., vol. 28, no. 11, pp. 17531767, Nov. 2006.
lenging problem. Moreover, a kernel combination is adopted [22] J. Li, A. Najmi, and R. Gray, Image classification by a two-dimensional
to handle rotation and multiscale issues. The categorization hidden Markov model, IEEE Trans. Signal Processing, vol. 48, no. 2,
pp. 517533, Feb. 2000.
experiments on several image databases demonstrate that our [23] J. Li and J. Wang, Automatic linguistic indexing of pictures by a statisti-
SMK method can obtain promising results. When applied to cal modeling approach, IEEE Trans. Pattern Anal. Mach. Intell., vol. 25,
image annotation, our SMK method has also outperformed the no. 9, pp. 10751088, Sep. 2003.
[24] J. Cai and Z.-Q. Liu, Hidden Markov models with spectral features for
state-of-the-art annotation techniques. Particularly, both of our 2D shape recognition, IEEE Trans. Pattern Anal. Mach. Intell., vol. 23,
rotation invariant and multiscale extensions for our SMK have no. 12, pp. 14541458, Dec. 2001.
been shown to generally achieve improved performance. For [25] Y. He and A. Kundu, 2-D shape classification using hidden Markov
model, IEEE Trans. Pattern Anal. Mach. Intell., vol. 13, no. 11,
future work, since the kernel can be regarded as the similarity pp. 11721184, Nov. 1991.
measure, our SMK method will be applied to other challenging [26] F. Yu and H. Ip, Semantic content analysis and annotation of histological
problems such as image and video retrieval. images, Comput. Biol. Med., vol. 38, no. 6, pp. 635649, Jun. 2008.
[27] Z. Lu and H. Ip, Image categorization by learning with context and
consistency, in Proc. CVPR, Jun. 2009, pp. 27192726.
[28] H. Cevikalp, M. Neamtu, and A. Barkana, The kernel common vector
R EFERENCES
method: A novel nonlinear subspace classifier for pattern recognition,
[1] A. Vailaya, M. Figueiredo, A. Jain, and H.-J. Zhang, Image classification IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 37, no. 4, pp. 937951,
for content-based indexing, IEEE Trans. Image Process., vol. 10, no. 1, Aug. 2007.
pp. 117130, Jan. 2001. [29] E. Pekalska and R. Duin, Beyond traditional kernels: Classification in
[2] P. Huang, C.-H. Lee, and P.-L. Lin, Support vector classification two dissimilarity-based representation spaces, IEEE Trans. Syst., Man,
for pathological prostate images based on texture features of multi- Cybern. C, Appl. Rev., vol. 38, no. 6, pp. 729744, Nov. 2008.
categories, in Proc. IEEE Int. Conf. Syst., Man Cybern., Oct. 2009, [30] T. Griffiths, M. Steyvers, D. Blei, and J. Tenenbaum, Integrating top-
pp. 912916. ics and syntax, Adv. Neural Inf. Process. Syst., vol. 17, pp. 537544,
[3] P. Quelhas, F. Monay, J.-M. Odobez, D. Gatica-Perez, T. Tuytelaars, and 2005.
L. Van Gool, Modeling scenes with local descriptors and latent aspects, [31] A. Gruber, M. Rosen-Zvi, and Y. Weiss, Hidden topic Markov models,
in Proc. ICCV, Oct. 2005, pp. 883890. in Proc. AISTATS, Mar. 2007, pp. 18.
[4] M. Boutell, J. Luo, and C. Brown, Scene parsing using region-based [32] H. Othman and T. Aboulnasr, A separable low complexity 2D HMM with
generative models, IEEE Trans. Multimedia, vol. 9, no. 1, pp. 136146, application to face recognition, IEEE Trans. Pattern Anal. Mach. Intell.,
Jan. 2007. vol. 25, no. 10, pp. 12291238, Oct. 2003.
[5] T. Hofmann, Unsupervised learning by probabilistic latent semantic [33] D. Zhou, O. Bousquet, T. Lal, J. Weston, and B. Schlkopf, Learning
analysis, Mach. Learn., vol. 42, no. 1/2, pp. 177196, Jan./Feb. 2001. with local and global consistency, Adv. Neural Inf. Process. Syst., vol. 16,
[6] D. Blei, A. Ng, and M. Jordan, Latent Dirichlet allocation, J. Mach. pp. 321328, 2004.
Learn. Res., vol. 3, pp. 9931022, Jan. 2003. [34] Y. Gao, J. Fan, X. Xue, and R. Jain, Automatic image annotation by
[7] A. Bosch, A. Zisserman, and X. Muoz, Scene classification via pLSA, incorporating feature hierarchy and boosting to scale up SVM classifiers,
in Proc. ECCV, 2006, pp. 517530. in Proc. ACM Multimedia, 2006, pp. 901910.
[8] L. Fei-Fei and P. Perona, A Bayesian hierarchical model for learning [35] E. Chang, K. Goh, G. Sychay, and G. Wu, CBSA: Content-based soft
natural scene categories, in Proc. CVPR, 2005, pp. 524531. annotation for multimodal image retrieval using Bayes point machines,
[9] S. Huda, J. Yearwood, and R. Togneri, A constraint-based evolutionary IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 1, pp. 2638,
learning approach to the expectation maximization for optimal estima- Jan. 2003.
tion of the hidden Markov model for speech signal modeling, IEEE [36] H. Tang, R. Hanka, and H. Ip, Histological image retrieval based on
Trans. Syst., Man, Cybern. B, Cybern., vol. 39, no. 1, pp. 182197, semantic content analysis, IEEE Trans. Inf. Technol. Biomed., vol. 7,
Feb. 2009. no. 1, pp. 2636, Mar. 2003.
[10] J. Kwon and F. Park, Natural movement generation using hidden Markov [37] D. Lowe, Distinctive image features from scale invariant keypoints, Int.
models and principal components, IEEE Trans. Syst., Man, Cybern. B, J. Comput. Vis., vol. 60, no. 2, pp. 91110, Nov. 2004.
Cybern., vol. 38, no. 5, pp. 11841194, Oct. 2008. [38] V. Lavrenko, R. Manmatha, and J. Jeon, A model for learning the se-
[11] S.-S. Kuo and O. Agazzi, Keyword spotting in poorly printed documents mantics of pictures, Adv. Neural Inf. Process. Syst., vol. 16, pp. 553560,
using pseudo 2-D hidden Markov models, IEEE Trans. Pattern Anal. 2004.
Mach. Intell., vol. 16, no. 8, pp. 842848, Aug. 1994. [39] S. Feng, R. Manmatha, and V. Lavrenko, Multiple Bernoulli rele-
[12] P. Devijver, Segmentation of binary images using third order Markov vance models for image and video annotation, in Proc. CVPR, 2004,
mesh image models, in Proc. ICPR, Oct. 1986, pp. 259261. pp. II-1002II-1009.
[13] F. Perronnin and C. Dance, Fisher kernels on visual vocabularies for [40] P. Duygulu, K. Barnard, N. de Freitas, and D. Forsyth, Object recognition
image categorization, in Proc. CVPR, Jun. 2007, pp. 18. as machine translation: Learning a lexicon for a fixed image vocabulary,
[14] T. Jaakkola, M. Diekhans, and D. Haussler, A discriminative framework in Proc. ECCV, 2002, pp. 97112.
for detecting remote protein homologies, J. Comput. Biol., vol. 7, no. 1/2, [41] J. Liu, M. Li, Q. Liu, H. Lu, and S. Ma, Image annotation via graph
pp. 95114, Feb. 2000. learning, Pattern Recognit., vol. 42, no. 2, pp. 218228, Feb. 2009.
[15] R. Behmo, N. Paragios, and V. Prinet, Graph commute times for image [42] A. Makadia, V. Pavlovic, and S. Kumar, A new baseline for image
representation, in Proc. CVPR, Jun. 2008, pp. 18. annotation, in Proc. ECCV, 2008, pp. 316329.
LUAND IP: SPATIAL MARKOV KERNELS FOR IMAGE CATEGORIZATION AND ANNOTATION 989

Zhiwu Lu received the M.Sc. degree in applied Horace H. S. Ip received the B.Sc. (first-class
mathematics from Peking University, Beijing, China, honors) degree in applied physics and the Ph.D.
in 2005. He is currently working toward the Ph.D. degree in image processing from the University
degree with the Department of Computer Sci- College London, London, U.K., in 1980 and 1983,
ence, City University of Hong Kong, Kowloon, respectively.
Hong Kong. He is currently the Chair Professor of computer
From July 2005 to August 2007, he was a Software science, the Founding Director of the Centre for
Engineer with Founder Corporation, Beijing. From Innovative Applications of Internet and Multime-
September 2007 to June 2008, he was a Research dia Technologies (AIMtech Centre), and the Acting
Assistant with the Institute of Computer Science and Vice-President of City University of Hong Kong,
Technology, Peking University. He has published Kowloon, Hong Kong. He has published over 200
over 30 papers in international journals and conference proceedings, includ- papers in international journals and conference proceedings. His research
ing IEEE T RANSACTIONS ON I MAGE P ROCESSING, IEEE T RANSACTIONS interests include pattern recognition, multimedia content analysis and retrieval,
ON S YSTEMS , M AN , AND C YBERNETICS PART B, IEEE T RANSACTIONS virtual reality, and technologies for education.
ON M ULTIMEDIA , IEEE Computer Society Conference on Computer Vi- Dr. Ip is a Fellow of the Hong Kong Institution of Engineers, the U.K.
sion and Pattern Recognition (CVPR), European Conference on Computer Institution of Electrical Engineers, and the International Association of Pattern
Vision (ECCV), Association for the Advancement of Artificial Intelligence Recognition.
Conference on Artificial Intelligence (AAAI), and Association of Computing
Machinery International Conference on Multimedia (ACM-MM). His research
interests include pattern recognition, machine learning, multimedia information
retrieval, and computer vision.

También podría gustarte