Está en la página 1de 24

Multimed Tools Appl

DOI 10.1007/s11042-013-1707-2

Automatic tag-to-region assignment via multiple


instance learning
Zhaoqiang Xia Yi Shen Xiaoyi Feng Jinye Peng
Jianping Fan

Springer Science+Business Media New York 2013

Abstract Translating image tags at the image level to regions (i.e., tag-to-region
assignment), which could play an important role in leveraging loosely-labeled training images for object classifier training, has become a popular research topic in the
multimedia research community. In this paper, a novel two-stage multiple instance
learning algorithm is presented for automatic tag-to-region assignment. The regions
are generated by performing multiple-scale image segmentation and the instances
with unique semantics are selected out from those regions by a random walk process.
The affinity propagation (AP) clustering technique and Hausdorff distance are
performed on the instances to identify the most positive instance and utilize it to
initialize the maximum searching of Diverse Density likelihood in the first stage.
In the second stage, the most contributive instance, which is chosen from each bag,
is treated as the key instance for simplifying the computing procedure of Diverse
Density likelihood. At last, an automatic method is proposed to discriminate the
boundary between positive instances and negative instances. Our experiments on
three well-known image sets have provided positive results.

Z. Xia (B) X. Feng J. Peng


Northwestern Polytechnical University, Xian, China
e-mail: xiazhaoqiang@gmail.com
X. Feng
e-mail: fengxiao@nwpu.edu.cn
J. Peng
e-mail: jinyepeng@nwpu.edu.cn
Y. Shen J. Fan
University of North Carolina at Charlotte, Charlotte, NC, USA
Y. Shen
e-mail: yshen9@uncc.edu
J. Fan
e-mail: jfan@uncc.edu

Multimed Tools Appl

Keywords Tag-to-region assignment Multiple instance learning


Instance identification AP clustering

1 Introduction
With the exponential growth of digital images, there is an urgent need to achieve
automatic image annotation for supporting keyword-based (concept-based) image
retrieval [14, 27]. For the task of automatic image annotation, machine learning
techniques are usually involved to learn the classifiers from large amounts of labeled
training images. The ground-truth labels are usually provided by professionals.
Because it is time consuming and labor intensive to hire professionals for labeling
large amounts of training images, the sizes of such professionally-labeled image sets
tend to be small. As a result, the classifiers, which are learned from a small set of
professionally-labeled training images, may hardly be generalizable. To achieve more
reliable classifier training, the size of the labeled training images must be large due
to [26]: (1) the number of object classes could be very large; and (2) the learning
complexity for some object classes could be very high because they may have large
intra-class visual diversity and inter-class visual similarity (i.e., visual ambiguity).
On the other hand, it is much easier for us to obtain large-scale loosely-labeled images (object labels are loosely given at the image level rather than at the region level
or at the object level, shown in the left of Fig. 1) [11]. Such loosely-labeled images
may have multiple advantages: (1) they can represent various visual properties of
object classes more sufficiently; (2) they can be obtained with less effort by providing
the object-level labels loosely at the image level rather than at the object level or at
the region level; and (3) both their labels and their visual properties are diverse, thus
they can give a real-world point of departure for object detection and scene recognition. Therefore, one potential solution for the critical issue of the shortage of objectbased labeled training images is to leverage large-scale loosely-labeled training
images for object classifier training by multiple instance learning [24, 28, 34], where
each loosely-labeled image is considered as a bag and each region generated from the
image is treated as an instance.
It is not a trivial task to leverage the loosely-labeled images for object classifier
training because they may seriously suffer from the critical issue of correspondence

Fig. 1 Illustration of multiple instance learning for tag-to-region assignment. The loosely-labeled
images are shown in the left

Multimed Tools Appl

uncertainty, e.g., each loosely-labeled image contains multiple image regions and
multiple object labels are given at the image level, thus the correspondences between
the image regions and the available labels are uncertain [15]. To leverage the looselylabeled images for object classifier training, it is very attractive to develop new algorithms for: (a) supporting ambiguous image representation which can transform each
loosely-labeled image into bag of instances and expressing its semantics ambiguity
(i.e., multiple labels are available for one single image) explicitly in the instance
space; (b) identifying the instance labels automatically when the labels are provided
only at the image level (i.e., loose labels); and (c) identifying the true positive
instances fast for object classifier training.
As illustrated in Fig. 1, by assigning multiple labels (which are given at the image
level) into the most relevant image regions automatically, our multiple instance
learning algorithm can provide a good solution to leverage large-scale looselylabeled images for object classifier training. Compared to the traditional multiple
instance learning algorithms, which usually label the entire tag rather than instances,
a fast two-stage multiple instance learning algorithm is presented to identify the true
positive instance and can be applied to the tag-to-region assignment in this paper.
Three characteristics of our proposed approach are as follows:

Multiple-scale regions are generated by image segmentation and semantic


unique instances are selected from those regions by a random walk process;
A two-stage multiple instance learning algorithm for speeding up the Diverse
Density likelihood computation is proposed to identify the true positive instances;
An automatic method of boundary demarcation is proposed to determine the
boundaries of categories.

The rest of this paper is organized as follows. Section 2 reviews the related work
briefly and then the Diverse Density framework our work relied on is presented in
Section 3. Section 4 introduces our new algorithm in detail and Section 5 introduces
our experimental results for algorithm evaluation. We come to the conclusion and
future work in Section 6.

2 Related work
In the last decades, many multiple instance learning algorithms have been proposed
and applied to many fields since the term Multiple Instance learning was created by
Dietterich et al. in a drug activity prediction domain [10]. An intuitional solution for
the multiple instance learning problem is to find the true positive instance in positive
bag and many researches have been working on this. The axis-parallel rectangle
learning algorithm has been proposed to find which instances are true positive in
the positive bags directly at the instance level by Dietterich et al. [10]. The Diverse
Density approach has been proposed by Maron and Lozano-Prez [21] and applied
to scene classification [22], which integrates all the instances in a probabilistic model.
RW-SVM (Random Walk-SVM) algorithm has used a random walk process to find
the true positive instances and SVM is used to train the image classifiers to annotate
entire images of three categories [31]. A multiple-task SVM algorithm, called MTSMLMIL [26], has utilized graphic clustering to find the true positive instances and

Multimed Tools Appl

then the multi-task SVM is used to recommend tags for the image annotation. Many
other optimization algorithms, such as mi-SVM [1] and sparse-transductive SVM [2],
have been proposed to identify the true positive instances through an iterative
procedure. Another direction to solve the multiple instance learning problem is to
measure the image at the instance level, such as Citation-kNN [32] and BAMIC [35]
(a multiple instance clustering algorithm). These approaches just label the entire bag
while the preceding approaches finding the true positive instances can label both
instances and bags. Chen et al. have developed an approach called MILES (MultipleInstance Learning via Embedded instance Selection) to enable region-based image
annotation when labels are available only at the image level [5]. That approach maps
bags into a feature space defined by the instances and provides features for the
1-norm SVM through the mapping. Vijayanarasimhan et al. have developed a
multiple-label multiple-instance learning approach to achieve more effective learning from loosely-labeled images [29], which uses a sparse SVM to iteratively improve
positive bags. Viola et al. [30] have transformed the traditional boosting methods
to be better suited for multiple instance learning problem and used to learn object
detectors from loosely-labeled images.
Some approaches utilize expert-labeled training images to learn models and annotate images through these models. A multi-class SVM algorithm has been utilized
to annotate different image regions by Cusano et al. [8]. Through statistical modeling
and optimization techniques, Li and Wang [18] have developed an algorithm to
train the classifiers for hundreds of semantic concepts. A probabilistic model has
been proposed to estimate the mixture density for each image and minimize the
annotation error by Carneiro et al. [3]. Jeon et al. utilize a cross-media relevance
model to annotate images automatically [13]. Other approaches have been proposed
to utilize the user-supported image-level tags to annotate new images. A bi-layer
sparse coding algorithm based on over-segmented image regions has been used to
annotate images [20]. Liu et al. [16] have proposed a multi-edge graph model to label
the regions. Yang et al. have utilized Diversity Density framework to enrich image
tags [33]. A weakly supervised graph propagation method has been developed to
assign annotated labels at the image level to the semantic regions [19].
In this paper, we focus on the Diverse Density algorithm for multiple instance
learning, which has been used in many applications [5, 33]. The advantage of the
Diverse Density framework is to use the statistical information of all bags in a probabilistic way, which accumulates the instances to the bag level to utilize the label information provided. The instances are banded in the Noisy-Or probabilistic model to
obtain the likelihood for all bags, however, it is difficult to solve the optimal problem
if too many instances exist in the Noisy-Or model. So we propose a novel two-stage
approach based on the Diverse Density algorithm to accelerate the computation of
Diverse Density likelihood. Before introduced our approach, we revisit the Diverse
Density algorithm briefly in the next section.

3 Diverse density
Maron et al. propose the Diverse Density framework to solve the problem of drug
activity prediction and then apply it to support scene classification [21, 22]. The
general framework uses the likelihood of instances being positive (i.e., Diverse

Multimed Tools Appl

Density) to measure the intersection of positive bags minus the union of negative
bags. Diverse Density at a point is defined to measure how many different positive
bags have instances near that point and how far the negative instances are away from
that point [21]. The target of the Diverse Density likelihood DD(x) is to find an
appropriate point (denoted as t) in the feature space which has the most true positive
instances around that point and most true negative instances away from that point.
This appropriate point t in the Diversity Density framework is also called the concept,
where a bag is labeled positive even if only one of the instances in it falls within
the concept and a bag is labeled negative only if all the instances in it are negative.
The concept can be discovered through maximizing the likelihood of positive bags
and negative bags in the feature space. The appropriate point (the desired concept)
corresponds to the maximum of Diverse Density likelihood in the feature space.
In the Diverse Density framework, the set of loosely-labeled images (their labels
are given at the image level or bag level) is defined as D which consists of a set
of bags B = {B1 ,. . . , Bm } and corresponding labels L = {l1 ,. . . , lm }. Let bag Bi =
{Bi1 ,. . . , Bij ,. . . Bin } where Bij is the jth instance and label li = {li1 ,. . . , lij ,. . . lip }
where lij corresponds to the label of jth instance in Bi bag. The positive bags are
denoted as Bi+ and jth instance in Bi+ as Bij+ . Likewise, Bij represents a negative
instance in the negative bag Bi . The diverse density over all points x in the feature

space is denoted as DD(x) = P(x = t|B+


1 , . . . , Bn , B1 , . . . , Bm ). The concept (the
point t) can be found by computing maximum DD(x). Assuming that the maximum
point t follows a uniform prior in the feature space, according to the Bayes rule, this
question can be equivalent to


arg max DD(x) = arg max
P(x = t|Bi+ ) P(x = t|Bi )
(1)
x

The Diverse Density algorithm uses the Noisy-OR model to compute the probability
of a bag being positive near the potential point t and the probability of negative bags
being away from the point t

P(x = t|Bi+ ) = 1 (1 P(x = t|Bij+ ))
j

P(x =

t|Bi )


=
(1 P(x = t|Bij ))

(2)

Then those probabilities are modeled by the distance between the potential target of
concept t and the instances:
P(x = t|Bij) = exp(Bij x2 )

(3)

If the instances in a positive bag are near the potential target x, the probability P(x =
t|Bi+ ) would be high. Likewise, the probability P(x = t|Bi ) should be high only if the
instances in a negative bag are far away from the candidate point x. To reduce the
complexity of the Diverse Density algorithm, the negative logarithm of DD(x) can
be adopted to search its minimum value instead of computing the DD(x) directly.
From (1), one can observe that all instances need to be integrated into the likelihood DD(x), thus the formula becomes very complicated in computability and does
not have analytic solutions. Furthermore, the complexity of the generative model
is nonlinear with the number of instances and bags. The gradient ascent algorithm

Multimed Tools Appl

can be utilized to find the maximum of DD(x) even if it is not convex. To avoid the
local maximum of DD(x), multiple initial points need to be adopted to find the global
maximum of DD(x). So all the positive instances are used as its initial points and one
of them is likely to be close to the maximum point t. Although it is beneficial to find
the global solution for the DD(x), it is less efficient for computing through multiple
starting points especially as the number of positive instances becomes very large.
Based on this observation, our proposed algorithm for multiple instance learning
will identify an instance from all these positive instances as the single initial point
rather than using all the positive instances as the initial points. On the other hand, the
DD(x) is actually affected mostly by the instances which are nearest to the concept
t in each bag. Thus our proposed algorithm for multiple instance learning only uses
one instance instead of all instances to compute DD(x) in each bag. Through these
two steps, our proposed algorithm for multiple instance learning can reduce the
computational cost significantly and guarantee faster convergence.

4 Our proposed method


In this section, we introduce our proposed algorithm of multiple instance learning,
which can be applied to solve the problem of tag-to-region assignment. Figure 2
illustrates the framework of our proposed algorithm regarding the tag-to-region
assignment. Firstly, the positive bags (positive images) for a certain tag are generated
by collecting images labeled the specified tag (e.g. cow, aeroplane, or tree) while
negative bags are generated by collecting images without that tag. Then we utilize a

Fig. 2 Illustration of our proposed algorithm for the problem of tag-to-region assignment with
a multiple instance learning procedure, which includes three key components: a using the JSEG
segmentation to generate multiple-scale instances and select instances with unique semantics;
b utilizing the AP clustering to choose the best candidate for Diverse Density as the single initial
point; c speeding up the procedure of computing Diverse Density likelihood maximum by identifying
the most contributive instance from each bag

Multimed Tools Appl

multiple instance learning framework to discover a concept and its boundary for each
tag (e.g. cow), which would be explained from Sections 4.1 to 4.5. Finally, an appropriate tag selected from all tags would be assigned to the instance (i.e. image region)
by ranking the relative distances between the instance and concepts of all tags, which
is introduced in Section 4.5.
Our proposed algorithm with multiple instance learning consists of three key
components as shown in Fig. 2: (a) we utilize the image segmentation technique to
generate multiple-scale regions and extract those instances with unique semantics
(Section 4.1); (b) we utilize the AP clustering algorithm to find the best candidate
in the semantic-unique instances for computing the maximum of Diverse Density
likelihood (Sections 4.2 and 4.3); (c) we speed up the procedure of computing the
Diverse Density likelihood maximum by identifying the most contributive instance
from each bag (Section 4.4).
4.1 Multiple-scale instance generation
In this section, we would generate multiple-scale instances in each bag and pick out
those with unique semantics from the multiple-scale instances. These regions with
unique semantics would also be referred to as good instances. For some existing
multiple instance learning algorithms, the instances are generated through the randomly cropped selection from images [1, 22]. Such random selection procedure could
produce too many instances in one bag and each instance would be partly positive
because the randomly sampled boxes in images may not contain the objects of
interest completely. These instances produced by random selection are actually not
responsible for the bag-level tags and give rise to the non-uniqueness of semantics.
Another approach to tackle the non-uniqueness problem of semantics is to utilize
automatic image segmentation. However, over-segmentation or under-segmentation
can be easily reached by using different parameters for a segmentation algorithm. To
solve this dilemma, we utilize a set of parameters to generate multiple segments (i.e.
instances) with different sizes and shapes. We call this procedure as Multiple-Scale
Instance Generation, in which at least one of these segments can correspond to the
bag-level tag and satisfy the condition of semantic uniqueness. And then a random
walk procedure can be used to find the instances with unique semantics in each bag.
We make use of the J-images segmentation (JSEG) [9] algorithm to partition an
image into a set of regions (i.e., instances), which are determined by the adjustable
parameter = (q, m). The parameter q is denoted as the color quantization threshold and m represents the spatial segmentation threshold in the JSEG algorithm.1
Compared to other segmentation methods, JSEG is relatively fast to generate multiple instances with enough parameters. Figure 3 shows examples of JSEG segmentation. All these image regions are treated as the candidates of instances with unique
semantics while some candidates of instances generated by over-segmentation are
only fragments of instances with unique semantics. So these instances cannot be
used to compute the Diverse Density likelihood. Usually, the instances with unique
semantics would be similar with those fragments because these fragments are parts of

1 http://vision.ece.ucsb.edu/segmentation/jseg/

Multimed Tools Appl

Fig. 3 Illustration of generating multiple-scale regions (four scales with dif ferent ) and choosing
good instances with unique semantics

instances [25]. Based on this observation, a random walk process is utilized to seek
the regions which are similar with other regions and then these regions would be
selected as good instances in a bag, which usually have unique semantics.
Assuming n nodes (i.e. candidate instances or regions) exist in the random walk
process where each node corresponds to one candidate of instances in each bag. Then
the random walk process is formulated as

k+1 (i) =
k ( j)(i, j) + (1 )o (i)
(4)
ji

where i is the instance neighbor set connecting with the ith instance , o (i) is the
initial relevance score for the ith instance, (i, j) is the transition probability from
instance j to i, and [0, 1] linearly weights two terms. The relevance score for ith
instance at the kth iteration is defined as k (i). The first term in (4) represents the
similarity between the i instance and other instances.
Because multiple-scale segmentation method may generate the same (over 90 %
area overlapping) instances in different scales, the initial relevance score is defined as
(i)
o (i) = n
i=1 (i)

(5)

Multimed Tools Appl

where (i) is the number that ith instance appears in the multiple-scale segments. In
this context, we define the transition probability using the similarity of two instances,
that is
sij
(i, j) = 
(6)
k sik
where sij can be computed according to the distance of instances and explained
in (15).
According to (4), the random walk process can select out the instances having
higher similarities with others (through the synthesizing of similarities in first item of
(4)) and the instances with stronger relation with others would be selected. For each
image, we choose the top n/| | candidate instances as the final good instances. These
good instances are usually semantically unique and can further be used for instance
clustering and maximum computation of Diverse Density likelihood.
4.2 Instance clustering
In this and later section, we would discuss a method to identify the single starting
point for computing the maximum of Diverse Density likelihood. Even if the good
instances have been detected by the JSEG segmentation and random walk algorithm,
there are too many instances to choose as the initial points for Diverse Density
likelihood computation. The goal is that we can find a most positive instance as the
single point to start. According to the definition of Diverse Density, the maximum of
the Diverse Density likelihood may locate in the intensive area of positive instances
and the non-intensive area of negative instances. In other words, the optimal solution
may occur in the vicinity of one of the groups which gather the similar positive
instances together. This group we want to find would not contain negative instances.
Based on this observation, the instances in positive bags and negative bags are first
grouped into multiple clusters respectively according to their visual similarity. Then
a cluster of positive instances, which would be furthest away from all clusters of
negative instances, is identified by the distance between the clusters in the positive
instances and the clusters in the negative instances. This cluster center will be selected
as the initial point for computing Diverse Density.
In this paper, we adopt the AP [12] clustering approach to group the instances in
positive bags and negative bags into multiple clusters. The classical K-means or Kmediods clustering approach need to randomly choose k initial cluster centers at the
beginning, thus it is not very suitable for our problem of instance clustering because
the number of instance clusters is unknown. As an extension of K-medoids clustering,
AP clustering simultaneously takes all instances as the potential exemplars (i.e. cluster centers), where real-value preferences can be used to represent the probability
of instances as the exemplars. As a result, AP clustering does not need to assign the
initial cluster centers at the beginning and it can automatically detect the exemplars
and group instances.
The AP clustering algorithm propagates two kinds of messages (responsibility
and availability) between instances and uses these accumulated messages to determine exemplars and group instances. The responsibility r(i, k) , sent from the instance
i to the candidate exemplar k, is used to reflect how well-suited k is to serve as
the exemplar for i compared to all other possible exemplars. The availability a(i, k),
sent from the candidate exemplar instance k to the instance i, is used to reflect how

Multimed Tools Appl

appropriate it would be for the instance i to select k as its exemplar, taking support
from other instances into account [12]. These messages are updated using the rules




r(i, k) s(i, k)  max
a(i,
k
)
+
s(i,
k
)


k s.t.k =k

a(i, k) min 0, r(k, k) +


i s.t.i {i,k}
/



max 0, r(i , k)

(7)

where s(i, k) represents the similarity between ith instance and kth instance. The selfavailability is updated as follows

a(k, k)
max{0, r(i , k)}
(8)
i s.t.i {i,k}
/

According to the rules above, the responsibilities are first updated when the availabilities a(i, k ) are initialized to zero and the availabilities are then updated when the responsibilities are given. The algorithm is deemed to have converged when the updating no longer change, where the entities with maximal a(k, k) in all a(i, k) are selected
as exemplars automatically. After the cluster centers are detected, the remaining
instances are assigned to their nearest cluster centers automatically. Through the AP
clustering, the instances from positive bags and negative bags are grouped separately
and the similar instances are grouped into the same clusters. The selection of an
optimal initial point based on these clusters would be introduced in the next section.
4.3 Candidate identification
As discussed above, we can utilize the AP clustering to group instances into two kinds
of clusters: positive clusters and negative clusters. The clusters, which are grouped
from positive instances, are called positive clusters denoted as ; the clusters, which
are grouped from negative instances, are called negative clusters denoted as . After
that, we need to find a cluster furthest away from the clusters of negative clusters
 in the clusters of positive clusters . The maximum of Diverse Density likelihood
defined in (1) may occur at the adjacent area of such cluster. So this cluster center is
referred to as the most positive instance, which can be used as the best initial point
for the Diverse Density computation. Such a cluster can be identified by the distance
between the positive clusters  and the negative clusters .
In this paper, the Hausdorff distance is used to measure the distance between two
clusters (i.e., two instance sets), which has the advantage that it is not affected by a
moderate number of outliers. For two instance sets Cm  and Cn , the Hausdorff distance between Cm and Cn is defined that each element in Cm is within Hausdorff distance d of at least one element in Cn and each point in Cn is within Hausdorff
distance d of at least one element in Cm . The Hausdorff distance is defined as:
H(Cm , Cn ) = max { h(Cm , Cn ), h(Cn , Cm ) }
h(Cm , Cn ) = max min d(xm , xn )
xm Cm xn Cn

(9)

where the xm and xn are the elements in the instance sets Cm and Cn . The distance
d(xm , xn ) between two instances can be any distance metric corresponding to the
feature extraction. The distance we adopt will be introduced in details in Section 5.2.

Multimed Tools Appl

Based on the distance of two sets, we use the score to represent the distance
between every cluster in the positive clusters  and all the negative clusters . It is
defined as follows
(Cm ) = min H(Cm , Cn ),
n

Cm 

(10)

We can pick out the furthest cluster-pair through the equation above and identify the

cluster Cm
having the maximum of .

Since the maximum of Diverse Density likelihood will occur in the vicinity of Cm
,

we can take every instance in the cluster Cm


as the initial point of Diverse Density.

We can find (1) would have the same convergence regardless of what instance in Cm

is used. So we take the center tm of cluster Cm as the initial point for computing the
maximum of Diverse Density likelihood and call the best initial point tm as the most
positive instance in all instances.
4.4 Speeding up diverse density
Through the selection of the best initial point, we can acquire a global solution of
the maximum of Diversity Density without attempting each positive instance. This
selection procedure can reduce some unnecessary computation steps. Even this step
has been adopted, we still need to compute all instances in the positive bags when we
compute the Diverse Density likelihood according to (1). In this section, we would
explain the details of how to reduce the procedure of Diverse Density computation.
Through (1), we can find that the likelihood between each bag and the point t
is the product of probability between instances in each bag and the point t. The
maximum of this product is mainly influenced by the nearest instance to the point
t. As the probability of each instance being positive is always in the range [0, 1], the
product can not exceed the greatest probability of factors in the product. In the view
of multiple instance learning, there is always one instance in each bag with the most
contribution to the bag-level label. Based on these observations, a most contributive
instance from each bag is discovered to represent this bag and reduce the complexity
of Diverse Density computation, which is similar with the Maximization-step in [36].
We choose an instance from each bag like
arg max

P(x = t|Bi ( j))

(11)

After we use the most contributive instance Bij to represent a bag, we denote the
conditional probability between candidate points x and the bag Bi : P(x = t|Bi ) =
P(x = t|Bij ). The j th instance is determined by the equation above. The distance
of our algorithm in P(x = t|Bij) is a little different from that of Diverse Density.
The P(x = t|Bij) in our algorithm is defined as P(x = t|Bij) = exp(d(Bij, x)). The
distance d(Bij, x) can be any measurement between two vectors (instances). In
this paper, we use the distance defined by (15). So the Diverse Density likelihood
computation in our algorithm can be simplified as follows

arg max DD (x) = arg max P(x = t|Bi )
x

= arg min
x


i

logP(x = t|Bij)


(12)

Multimed Tools Appl

Even if we reduce the computation steps, it is still difficult to derivate the analytical
solution for (12). For the optimization problem above, we use the numeral solution
to search the optimal solution, where the optimal solution represents the concept of
each tag. The method of determining the boundary for each tag would be explained
in the next section.

4.5 Boundary determination


Even if the point t in the feature space has been identified by those procedures, we
still need to discover the boundary of positive instances and negative instances for
automatic tag-to-region assignment. A distance threshold T can be used to determine
the boundary. An instance is true positive if d(x, t) T; otherwise, it is negative. The
cross validation method can be used to tune this distance threshold, but even then the
search range of validation is still too large. So we find the minimum and maximum
thresholds as the upper and lower bound for searching before the cross validation is
used. In the positive bags, at least one instance is positive and at most all of them
are positive. According to this, we set the lower bound with the minimum distance
threshold where only one instance in a positive bag occurs and upper bound with
the maximum distance threshold where all the instances in a positive bag occur. we
define the boundary searching range as following
Tmin = max T,

d(Bij, t) T and j Bi+

Tmax = min T,

d(Bij, t) T and j Bi+

(13)

After the search range [Tmin , Tmax ] has been confirmed, we adopt the K-fold (K =
10) cross validation method to find the best distance threshold T [Tmin , Tmax ]. First
of all, the training set is partitioned into K subsets and one of them is alternately
used as the validation set. In each evaluation process, we partition the search range
[Tmin , Tmax ] into L intervals and use the center of L intervals to validate the accuracy.
The threshold which makes the best performance is then set to the final distance
threshold. The procedure of finding distance threshold is shown in Fig. 4, which
shows results for three categories of the MSRC and NUS-WIDE(OBJECT) datasets.
The concept and boundary can be used to assign the tags to regions, however, the
boundaries of some categories may overlap in the feature space and one instance
would be labeled with many tags. To determine the unique categories of instances,
we utilize the ranking of relative distances to solve this problem. For the concept of
ith category Ci , the ranking scores of an instance are defined
(x) = sort
i

d(x, C ) 
i

Ti

(14)

The distance measurement d(x, Ci ) is computed according to (15) and Ti is the best
distance threshold of ith category computed by cross validation. The category with
the greatest ranking score is given as the tag of this instance.

Multimed Tools Appl

Fig. 4 Threshold determination of 3 image categories on the MSRC and NUS-WIDE(OBJECT)


datasets

5 Experimental evaluation
In this section, we describe the details of our experiments, including the image sets
we used, the visual feature extraction of instances, the baseline methods and results
of our experiments.
5.1 Image sets
To evaluate our algorithm precisely, different image sets are used to verify the
effectiveness of our algorithm. MSRC2 is collected from search engines, and includes
591 images and 23 object categories while the horse category is disregarded due to
the lack of examples. For average, each image has about 3.95 tags in MSRC. NUSWIDE [6] is collected from the social website Flickr and contains 269,648 images
and tags associated with these images originally. In our experiments, we only use the
categories which can correspond to image regions, i.e. object categories. So we pick
out 25 object categories and 10,157 images (marked as NUS-WIDE(OBJECT)) to
evaluate our algorithm, where each image has 2.01 tags. COREL30K [3] is based

2 http://research.microsoft.com/en-us/projects/objectclassrecognition/,

we use the version 2.0.

Multimed Tools Appl

on the Corel image dataset, containing 31,695 images and 1,035 tags. We process
the COREL30K dataset in the same matter that NUS-WIDE dataset is processed,
and select out 27,194 images and 121 object categories3 and have enough images for
training and testing. For each image of COREL30K, it has 2.13 tags averagely.
5.2 Visual feature
To extract effective feature from the visual content of instances, we use a wellknown feature descriptor, the Bag-of-Words model with SIFT descriptor to capture
key-point information of instances [17] since the generation of good instances has
been introduced in Section 4. For each instance, interest points are extracted with
difference of Gaussian function and represented by a 128-dimensional descriptor.
The K-means clustering is then used for constructing a code book of SIFT points.
One critical issue for code book construction is to determine the size of the code
book by grid searching method. In our algorithm, we choose a codebook size of 500
and represent the instance with a 500-dimensional vector.
The distance between two instances xm and xn can be measured by many
approaches, such as Euclidean distance. However, the L1 norm (i.e. Manhattan
distance) has been proved more robust to outliers than L2 norm (i.e., Euclidean
distance). In this paper, we utilize the normalized L1 norm to measure the distance
of two instances. The distance function we adopted is defined as follows
 |xm (i) xn (i)|
d(xm , xn ) =
(15)
1 + xm (i) + xn (i)
i

The normalized distance measurement can reduce the impact of different dimensions. This distance measurement is used to compute the similarity matrix for the
clustering and determine the distance boundary.
5.3 Experiments
To evaluate the performance of choosing the best initial point for computing Diverse
Density likelihood, the traditional precision and recall are used to demonstrate the
effectiveness of our clustering method. As MSRC image set provides the pixel-wise
ground-truth images, we utilize these ground-truth images to test the performance of
our AP clustering. As shown in Fig. 5, precision in most categories is high, because
most instances grouped into the same cluster are in the same category. Recall is not
very high in many categories because images in the same category are very diverse in
the visual content and not all the positive instances could be grouped into a cluster as
the members of one cluster are not infinite. When the positive instances of one category become more and more, the recall would become lower observed from Fig. 5.
As the comparison of our distance measurement, we use the normalized Euclidean
distance (abbreviated to ED in Fig. 5) to observe the influence of distance for clustering. We can find that the distance defined in (15) can improve the precision with the
cost of reducing the recall rate compared to normalized Euclidean distance observed
from Fig. 5.

3 These

object categories are presented in Fig. 8 and Appendix

Multimed Tools Appl

Fig. 5 Precision and recall of AP clustering with the ground-truth of instances on the MSRC dataset

To evaluate the effectiveness of our proposed algorithm, we use the following


approaches as the baseline algorithms: (a) our algorithm versus our approach without
three key components (without Multiple-scale Instances Generation, without Candidate Identification, and without Threshold Selection); (b) our algorithm versus the
Diverse Density framework [21] (using all positive instances as their initial points
and searching the maximum Diverse Density directly); (c) our algorithm versus the
EM-DD algorithm [36] (choosing all positive instances as initial points and using the
most positive instance to compute the maximum Diverse Density); (d) our algorithm
versus the mi-SVM algorithm [1] (an approach to find the true positive instances
using the optimization technique directly) and (e) our algorithm versus the RWSVM algorithm [31] (another approach to find the true positive instances using
the random walk algorithm). For all approaches mentioned above, we compare the
accuracy and run-time of algorithms based on the MSRC, NUS-WIDE(OBJECT)
and COREL30K datasets. The algorithms are executed on computer clusters with
INTEL Xeon X5570 and RedHat sever 6.2. To avoid the influence of selected samples, we randomly generate K different training datasets and use the average result
of these random subsets to evaluate the effectiveness of our proposed algorithm.
In order to illustrate the impact of three key components, we compare the
approaches removing those components of our approach separately. The three key
components contain the Multiple-scale Instances Generation (MIG), the Candidate
Identification (CI), and the Threshold Selection (TS). Observed from Table 1, it
can be concluded that the three components have different effect on the accuracy

Multimed Tools Appl


Table 1 The average accuracy/run-time of each category on three image datasets removing different
components of our method
Our Method
MSRC(%/s)
NUS-WIDE(%/s)
COREL30K(%/s)

Without MIG

Without CI

Without TS

Overall

66.1/1.69
56.9/19.38
53.2/14.74

70.7/25.51
58.0/319.44
54.2/247.8

63.4/1.43
56.1/17.48
52.9/14.5

70.9/1.5
58.1/18.1
54.2/14.7

of tag-to-region assignment and run-time (the run-time of learning a concept and


its boundary). The components MIG and TS mainly contribute to the improvement
of accuracy while the component CI contributes to the decreasing of run-time. The
procedure MIG can select out the semantic-unique instance and be beneficial for
discovering more accurate concept for each tag. Without the procedure MIG, some
over-segmented regions (instances) may be generated and impact the detection of
concepts for tags. Bypassing the procedure TS, the experiential threshold would be
taken. Compared to utilizing the component TS, the experiential approach may be
simpler but more inaccurate. The component CI is used to seek the optimal initialization for computing the maximum of Diverse Density likelihood and save time of attempts for different initial points. The component CI utilize the clustering technique
to discover the best initial point for computing the maximum of Diverse Density
likelihood, instead of time-consuming computing with multiple initial points. More
details on the experiments are shown in Figs. 6, 7, 8 and Appendix.
To assess the advantages of the Diverse Density framework, we compare the
performance between Diverse Density-based approaches (i.e., Diverse Density,
EM-DD and ours) and SVM-based algorithm (i.e., mi-SVM and RW-SVM) using
the same three datasets and feature extraction method. As shown in Table 2, we
can observe from these results that Diverse Density-based algorithms improve the
average accuracy for most categories. The improvement on the average accuracy is
mainly attributed to the fact that Diverse Density algorithms utilize the likelihood of
instances to reduce the ambiguity between instances (image regions) and bag-level
labels (i.e., tags of the entire image). In other words, SVM-based algorithm utilizes

Fig. 6 Average accuracy on the MSRC dataset using 8 approaches: a mi-SVM; b RW-SVM;
c Diverse Density(DD); d EM-DD; e Our Method; f Our Method without MIG; g Our Method
without CI; h Our Method without TS

Multimed Tools Appl

Fig. 7 Average accuracy on the NUS-WIDE(OBJECT) dataset using 8 approaches: a mi-SVM;


b RW-SVM; c Diverse Density(DD); d EM-DD; e Our Method; f Our Method without MIG; g Our
Method without CI; h Our Method without TS

the optimization or iteration approach to obtain the relationship between instances


and bag-level labels directly. It is difficult to achieve the target when the initial labels
are not assigned to the right instances at the starting of algorithms. Diverse Densitybased algorithms are not necessary to assign the instances with right labels when the
algorithms start. In addition, the geometric shape of boundary in Diverse Density
is hypersphere in the feature space while the geometric shape of boundary in the
SVM-based algorithm is hyperplane.
To illustrate the improvement of our algorithm, we compare two existing Diverse
Density-based approaches: DD and EM-DD. The average accuracy of our experiments using these Diverse Density algorithms in details is shown in Figs. 6, 7, 8
and Appendix. The improvement of our algorithm is mainly from two components:
(a) we use two steps to find the concept of each category in the feature space. The
AP clustering and Hausdorff distance between clusters are utilized to identify the

Fig. 8 Average accuracy on the 29 categories (first part of 121 categories and simultaneously
occurred in MSRC or NUS-WIDE) of COREL30K dataset using 7 approaches: a mi-SVM; b RWSVM; c EM-DD; d Our Method; e Our Method without MIG; f Our Method without CI; g Our
Method without TS

Multimed Tools Appl


Table 2 Average accuracy of image annotation on three datasets using five different methods
MSRC
NUS-WIDE
COREL30K

mi-SVM (%)

RW-SVM (%)

DD (%)

EM-DD (%)

Our Method (%)

56.6
53.2
51.8

57.8
54.1
52.2

65.8
55.6
NR

55.2
52.8
52.5

70.9
58.1
54.2

Bold values shows the best performance

candidate exemplar of clusters as the initial point in the first step (coarse step). From
the initial point, the concept corresponding to each category is found by using the
boosting Diverse Density procedure in the second step (fine step). (b) we design
an automatic method to find the boundary for the concept after the concept has
been found, which is better than the method only using the leave-one-out approach.
The searching range defined in (13) can avoid the over-fitting problem and speed
up the searching speed. The concepts of our algorithm (maximum of DD (x)) are
almost the same as the Diverse Density algorithm, but the mainly performance
difference between two algorithms is determined by the distance threshold selection
procedure. We use the automatic method to find the threshold and obtain more
accurate results. In contrast to EM-DD, our algorithm utilizes the AP clustering and
Hausdorff distance to find the most likely candidate in the feature space other than
randomly choosing some initial points. Those randomly chosen points may result in
the local maximum of Diverse Density likelihood. EM-DD also uses the leave-oneout approach to obtain the boundary which may generate the over-fitting problem.
Our algorithm utilizes the AP clustering and Hausdorff distance to find the concept
(the point t) approximately and Diverse Density to obtain the exact concepts, and
then detect the boundary to recognize the new instances.
In Table 2, the average accuracy of each category is shown on the three datasets.
The average accuracy illustrates that our algorithm obtains the best performance
totally and most categories on the three datasets while other algorithms have better
performance than our algorithm in some categories, for instance, the face category
for Diverse Density algorithm. Our experiments also show that the NUS-WIDE
(OBJECT) and COREL30K image dataset are much noisier than the MSRC dataset
because the average results of five algorithms on MSRC are better than the NUSWIDE(OBJECT) and COREL30K datasets totally. We cannot find the optimal solution in most categories on COREL30K dataset using the Diverse Density method
because there are too many positive instances to compute the Diverse Density
likelihood. So we cannot acquire the results on COREL30K with the Diverse Density
method (indicated by NR in Table 2).
The run-time of four algorithms compared in our experiments are shown in the
Table 3. The mi-SVM obtains the best run-time performance because many special
Table 3 The average run-time of each category on three image datasets using different methods
mi-SVM
MSRC(s)
NUS-WIDE(s)
COREL30K(s)

100

200

300

1.05
12.6
10.2

1.95
22.8
23.4

3.12
42.6
32.5

Bold values shows the best performance

RW-SVM

DD

EM-DD(50 %)

Our Method

6.37
100.8
65.9

780.6
9829.8
NR

257.1
3686.33
135.6

1.5
18.6
14.7

Multimed Tools Appl

Fig. 9 The examples of tag-to-region assignment results. Three rows are from MSRC, NUSWIDE(OBJECT) and COREL30K datasets respectively

solution algorithms for SVM4 have been designed [23] while the Diverse
Density-based algorithms need to compute the Diverse Density maximum using the
numerical approaches, such as Newton method [7]. We choose the best initial point
(i.e., most positive instance) by the clustering and Hausdorff distance to replace
the multiple initial point trial in DD and EM-DD, and use the most contributive
instance of each bag to compute the Diverse Density instead of using all instances.
So the run-time can be reduced very much. In all algorithms, the mi-SVM algorithm
uses the least run-time compared to other methods, however, mi-SVM can hardly
converge the stable solution so we set the limited iterative times for terminating the
algorithm manually, such as 100, 200 and 300 in Table 3. The run-time used by EMDD algorithm is also determined by the number of initial instances. Although we do
experiments with different number of initial points for the EM-DD algorithm, we
only display the run-time using 50 % of positive instances which can make the best
performance.
At last, we show some results of tag-to-region assignment experiments in Fig. 9.
In the images of Fig. 9, the instance-level annotation is shown in the segmented
images directly. Some instances (image regions) are not assigned with any tags as
if the ranking scores defined in (14) are larger than 1.0. In other words, the instance
does not belong to any category if all the scores (x) > 1.0.

6 Conclusion and future work


In this paper, a novel multiple instance learning algorithm is developed to speed
up the procedure of Diverse Density likelihood computation, which is used for
automatic tag-to-region assignment. First of all, we utilize the JSEG image segmentation to generate multi-scale regions and choose the good instance in each bag
by a random walk process. Then the AP clustering technique is performed on the
instances of positive bags and negative bags to identify the best initial point and

4 The

LIBSVM [4] is used as SVM implementation included in mi-SVM and RW-SVM.

Multimed Tools Appl

initialize the maximum searching of the Diverse Density likelihood. To recognize


which instances are positive given a category, we propose an automatic method to
determine the boundary of categories. Our experiments on three well-known image
sets have provided very positive results. For the synonymy tags and co-occurred tags,
the performance of our proposed approach would be degraded. For instance, the
words car and automobile cannot be distinguished very easily.
In the future, we will extend our work in two directions: (a) testing our proposed
algorithm on large-scale image sets with large-scale categories (object classes); (b)
utilizing the relationship between the tags to achieve more effective solution for
multiple instance learning.
Acknowledgements The authors would like to thank Jonathan Fortune for language polish. This
work is partly supported by the doctorate foundation of Northwestern Polytechnical University
(No: CX201113), Doctoral Program of Higher Education of China (Grant No.20106102110028
and 20116102110027) and National Science Foundation of China (under Grant No.61075014 and
61272285).

Appendix: The part 2 and 3 of experiments on COREL30K

Fig. 10 Average accuracy on the 92 categories (part 2 and 3 of 121 categories) of COREL30K dataset
using 7 approaches: a mi-SVM; b RW-SVM; c EM-DD; d Our Method; e Our Method without MIG;
f Our Method without CI; g Our Method without TS

Multimed Tools Appl

References
1. Andrews S, Tsochantaridis I, Hofmann T (2002) Support vector machines for multiple-instance
learning. Adv Neural Inf Proc Syst 15:561568
2. Bunescu R, Mooney R (2007) Multiple instance learning for sparse positive bags. In: Proceedings
of the 24th International Conference on Machine Learning (ICML), pp 105112
3. Carneiro G, Chan A, Moreno P, Vasconcelos N (2007) Supervised learning of semantic classes
for image annotation and retrieval. IEEE Trans Pattern Anal Mach Intell29(3):394410
4. Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell
Syst Technol 2:27:127:27
5. Chen Y, Bi J, Wang J (2006) Miles: multiple-instance learning via embedded instance selection.
IEEE Trans Pattern Anal Mach Intell 28(12):19311947
6. Chua T, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) Nus-wide: a real-world web image
database from National University of Singapore. In: Proceeding of the ACM international
conference on image and video retrieval, p 48
7. Coleman TF, Li Y (1996) An interior trust region approach for nonlinear minimization subject
to bounds. SIAM J Optim 6(2):418445
8. Cusano C, Ciocca G, Schettini R (2004) Image annotation using svm. In: Society of Photo-Optical
Instrumentation Engineers conference (SPIE), vol 5304, pp 330338
9. Deng Y, Manjunath B, Shin H (1999) Color image segmentation. In: IEEE computer society
conference on Computer Vision and Pattern Recognition (CVPR), vol 2
10. Dietterich T, Lathrop R, Lozano-Prez T (1997) Solving the multiple instance problem with axisparallel rectangles. Artif Intell 89(12):3171
11. Fan J, Shen Y, Zhou N, Gao Y (2010) Harvesting large-scale weakly-tagged image databases
from the web. In: IEEE computer society conference on Computer Vision and Pattern Recognition (CVPR), pp 802809
12. Frey B, Dueck D (2007) Clustering by passing messages between data points. Science
315(5814):972
13. Jeon J, Lavrenko V, Manmatha R (2003) Automatic image annotation and retrieval using
cross-media relevance models. In: Proceedings of the 26th annual international ACM SIGIR
conference on research and development in informaion retrieval, pp 119126
14. Lew M, Sebe N, Djeraba C, Jain R (2006) Content-based multimedia information retrieval: state
of the art and challenges. ACM Trans Multimed Comput Commun Appl (TOMCCAP) 2(1):119
15. Liu D, Hua X, Zhang H (2011) Content-based tag processing for internet social images. Multimed Tools Appl 51:723738
16. Liu D, Yan S, Rui Y, Zhang H (2010) Unified tag analysis with multi-edge graph. In: Proceedings
of the international conference on Multimedia (ACM MM), pp 2534
17. Li F, Fergus R, Torralba A (2007) Recognizing and learning object categories. cvpr 2007 short
course
18. Li J, Wang J (2008) Real-time computerized annotation of pictures. IEEE Trans Pattern Anal
Mach Intell 30(6):9851002
19. Liu S, Yan S, Zhang T, Xu C, Liu J, Lu H (2012) Weakly-supervised graph propagation towards
collective image parsing. IEEE Trans Multimedia 14(2):361373
20. Liu X, Cheng B, Yan S, Tang J, Chua T, Jin H (2009) Label to region by bi-layer sparsity priors.
In: Proceedings of the 17th ACM international conference on multimedia, pp 115124
21. Maron O, Lozano-Prez T (1998) A framework for multiple-instance learning. In: Advances in
neural information processing systems, pp 570576
22. Maron O, Ratan A (1998) Multiple-instance learning for natural scene classification. In: Proceedings of the fifteenth international conference on machine learning, vol 15, pp 341349
23. Platt J, et al (1998) Sequential minimal optimization: a fast algorithm for training support vector
machines. Technical report msr-tr-98-14, Microsoft Research
24. Qi G, Hua X, Rui Y, Mei T, Tang J, Zhang H (2007) Concurrent multiple instance learning for
image categorization. In: IEEE conference Computer Vision and Pattern Recognition (CVPR),
pp 18
25. Russell B, Freeman W, Efros A, Sivic J, Zisserman A (2006) Using multiple segmentations to
discover objects and their extent in image collections. In: IEEE computer society conference on
Computer Vision and Pattern Recognition (CVPR), pp 16051614
26. Shen Y, Fan J (2010) Leveraging loosely-tagged images and inter-object correlations for tag
recommendation. In: Proceedings of the international conference on Multimedia (ACM MM),
pp 514

Multimed Tools Appl


27. Smeulders A, Worring M, Santini S, Gupta A, Jain R (2000) Content-based image retrieval at
the end of the early years. IEEE Trans Pattern Anal Mach Intell 22(12):13491380
28. Tang J, Hong R, Yan S, Chua T, Qi G, Jain R (2011) Image annotation by knn-sparse graphbased label propagation over noisily tagged web images. ACM Trans Intell Syst Technol
2(2):14
29. Vijayanarasimhan S, Grauman K (2008) Keywords to visual categories: multiple-instance learning for weakly supervised object categorization. In: IEEE conference Computer Vision and
Pattern Recognition (CVPR), pp 18
30. Viola P, Platt J, Zhang C (2006) Multiple instance boosting for object detection. Adv Neural Inf
Proc Syst 18:1417
31. Wang D, Li J, Zhang B (2006) Multiple-instance learning via random walk. In: Machine learning:
ECML 2006, pp 473484
32. Wang J, Zucker J (2000) Solving the multiple-instance problem: a lazy learning approach.
In: Proc. 17th international conf. on machine learning, pp 11191125
33. Yang K, Hua X, Wang M, Zhang H (2011) Tag tagging: towards more descriptive keywords of
image content. IEEE Trans Multimedia 13(4):662673
34. Zha Z, Hua X, Mei T, Wang J, Qi G, Wang Z (2008) Joint multi-label multi-instance learning for
image classification. In: IEEE conference Computer Vision and Pattern Recognition (CVPR),
pp 18
35. Zhang M, Zhou Z (2009) Multi-instance clustering with applications to multi-instance prediction.
Appl Intell 31(1):4768
36. Zhang Q, Goldman S (2001) Em-dd: an improved multiple-instance learning technique. Adv
Neural Inf Proc Syst 14:10731080

Zhaoqiang Xia is a PhD student at Northwestern Polytechnical University. His research interests
include multimedia retrieval, statistical machine learning and computer vision.

Multimed Tools Appl

Yi Shen is a PhD student at University of North Carolina at Charlotte. His research interests include
multi-label learning and multiple-instance learning.

Xiaoyi Feng is a professor at Northwestern Polytechnical University. Her research interests include
computer vision, image process, radar imagery and recognition.

Jinye Peng is a professor at Northwestern Polytechnical University. His research interests include
computer vision, pattern recognition and signal processing.

Multimed Tools Appl

Jianping Fan is a professor at University of North Carolina at Charlotte. His research interests
include semantic image and video analysis, computer vision, cross-media analysis, and statistical
machine learning.

También podría gustarte