Documentos de Académico
Documentos de Profesional
Documentos de Cultura
DOI 10.1007/s11042-013-1707-2
Abstract Translating image tags at the image level to regions (i.e., tag-to-region
assignment), which could play an important role in leveraging loosely-labeled training images for object classifier training, has become a popular research topic in the
multimedia research community. In this paper, a novel two-stage multiple instance
learning algorithm is presented for automatic tag-to-region assignment. The regions
are generated by performing multiple-scale image segmentation and the instances
with unique semantics are selected out from those regions by a random walk process.
The affinity propagation (AP) clustering technique and Hausdorff distance are
performed on the instances to identify the most positive instance and utilize it to
initialize the maximum searching of Diverse Density likelihood in the first stage.
In the second stage, the most contributive instance, which is chosen from each bag,
is treated as the key instance for simplifying the computing procedure of Diverse
Density likelihood. At last, an automatic method is proposed to discriminate the
boundary between positive instances and negative instances. Our experiments on
three well-known image sets have provided positive results.
1 Introduction
With the exponential growth of digital images, there is an urgent need to achieve
automatic image annotation for supporting keyword-based (concept-based) image
retrieval [14, 27]. For the task of automatic image annotation, machine learning
techniques are usually involved to learn the classifiers from large amounts of labeled
training images. The ground-truth labels are usually provided by professionals.
Because it is time consuming and labor intensive to hire professionals for labeling
large amounts of training images, the sizes of such professionally-labeled image sets
tend to be small. As a result, the classifiers, which are learned from a small set of
professionally-labeled training images, may hardly be generalizable. To achieve more
reliable classifier training, the size of the labeled training images must be large due
to [26]: (1) the number of object classes could be very large; and (2) the learning
complexity for some object classes could be very high because they may have large
intra-class visual diversity and inter-class visual similarity (i.e., visual ambiguity).
On the other hand, it is much easier for us to obtain large-scale loosely-labeled images (object labels are loosely given at the image level rather than at the region level
or at the object level, shown in the left of Fig. 1) [11]. Such loosely-labeled images
may have multiple advantages: (1) they can represent various visual properties of
object classes more sufficiently; (2) they can be obtained with less effort by providing
the object-level labels loosely at the image level rather than at the object level or at
the region level; and (3) both their labels and their visual properties are diverse, thus
they can give a real-world point of departure for object detection and scene recognition. Therefore, one potential solution for the critical issue of the shortage of objectbased labeled training images is to leverage large-scale loosely-labeled training
images for object classifier training by multiple instance learning [24, 28, 34], where
each loosely-labeled image is considered as a bag and each region generated from the
image is treated as an instance.
It is not a trivial task to leverage the loosely-labeled images for object classifier
training because they may seriously suffer from the critical issue of correspondence
Fig. 1 Illustration of multiple instance learning for tag-to-region assignment. The loosely-labeled
images are shown in the left
uncertainty, e.g., each loosely-labeled image contains multiple image regions and
multiple object labels are given at the image level, thus the correspondences between
the image regions and the available labels are uncertain [15]. To leverage the looselylabeled images for object classifier training, it is very attractive to develop new algorithms for: (a) supporting ambiguous image representation which can transform each
loosely-labeled image into bag of instances and expressing its semantics ambiguity
(i.e., multiple labels are available for one single image) explicitly in the instance
space; (b) identifying the instance labels automatically when the labels are provided
only at the image level (i.e., loose labels); and (c) identifying the true positive
instances fast for object classifier training.
As illustrated in Fig. 1, by assigning multiple labels (which are given at the image
level) into the most relevant image regions automatically, our multiple instance
learning algorithm can provide a good solution to leverage large-scale looselylabeled images for object classifier training. Compared to the traditional multiple
instance learning algorithms, which usually label the entire tag rather than instances,
a fast two-stage multiple instance learning algorithm is presented to identify the true
positive instance and can be applied to the tag-to-region assignment in this paper.
Three characteristics of our proposed approach are as follows:
The rest of this paper is organized as follows. Section 2 reviews the related work
briefly and then the Diverse Density framework our work relied on is presented in
Section 3. Section 4 introduces our new algorithm in detail and Section 5 introduces
our experimental results for algorithm evaluation. We come to the conclusion and
future work in Section 6.
2 Related work
In the last decades, many multiple instance learning algorithms have been proposed
and applied to many fields since the term Multiple Instance learning was created by
Dietterich et al. in a drug activity prediction domain [10]. An intuitional solution for
the multiple instance learning problem is to find the true positive instance in positive
bag and many researches have been working on this. The axis-parallel rectangle
learning algorithm has been proposed to find which instances are true positive in
the positive bags directly at the instance level by Dietterich et al. [10]. The Diverse
Density approach has been proposed by Maron and Lozano-Prez [21] and applied
to scene classification [22], which integrates all the instances in a probabilistic model.
RW-SVM (Random Walk-SVM) algorithm has used a random walk process to find
the true positive instances and SVM is used to train the image classifiers to annotate
entire images of three categories [31]. A multiple-task SVM algorithm, called MTSMLMIL [26], has utilized graphic clustering to find the true positive instances and
then the multi-task SVM is used to recommend tags for the image annotation. Many
other optimization algorithms, such as mi-SVM [1] and sparse-transductive SVM [2],
have been proposed to identify the true positive instances through an iterative
procedure. Another direction to solve the multiple instance learning problem is to
measure the image at the instance level, such as Citation-kNN [32] and BAMIC [35]
(a multiple instance clustering algorithm). These approaches just label the entire bag
while the preceding approaches finding the true positive instances can label both
instances and bags. Chen et al. have developed an approach called MILES (MultipleInstance Learning via Embedded instance Selection) to enable region-based image
annotation when labels are available only at the image level [5]. That approach maps
bags into a feature space defined by the instances and provides features for the
1-norm SVM through the mapping. Vijayanarasimhan et al. have developed a
multiple-label multiple-instance learning approach to achieve more effective learning from loosely-labeled images [29], which uses a sparse SVM to iteratively improve
positive bags. Viola et al. [30] have transformed the traditional boosting methods
to be better suited for multiple instance learning problem and used to learn object
detectors from loosely-labeled images.
Some approaches utilize expert-labeled training images to learn models and annotate images through these models. A multi-class SVM algorithm has been utilized
to annotate different image regions by Cusano et al. [8]. Through statistical modeling
and optimization techniques, Li and Wang [18] have developed an algorithm to
train the classifiers for hundreds of semantic concepts. A probabilistic model has
been proposed to estimate the mixture density for each image and minimize the
annotation error by Carneiro et al. [3]. Jeon et al. utilize a cross-media relevance
model to annotate images automatically [13]. Other approaches have been proposed
to utilize the user-supported image-level tags to annotate new images. A bi-layer
sparse coding algorithm based on over-segmented image regions has been used to
annotate images [20]. Liu et al. [16] have proposed a multi-edge graph model to label
the regions. Yang et al. have utilized Diversity Density framework to enrich image
tags [33]. A weakly supervised graph propagation method has been developed to
assign annotated labels at the image level to the semantic regions [19].
In this paper, we focus on the Diverse Density algorithm for multiple instance
learning, which has been used in many applications [5, 33]. The advantage of the
Diverse Density framework is to use the statistical information of all bags in a probabilistic way, which accumulates the instances to the bag level to utilize the label information provided. The instances are banded in the Noisy-Or probabilistic model to
obtain the likelihood for all bags, however, it is difficult to solve the optimal problem
if too many instances exist in the Noisy-Or model. So we propose a novel two-stage
approach based on the Diverse Density algorithm to accelerate the computation of
Diverse Density likelihood. Before introduced our approach, we revisit the Diverse
Density algorithm briefly in the next section.
3 Diverse density
Maron et al. propose the Diverse Density framework to solve the problem of drug
activity prediction and then apply it to support scene classification [21, 22]. The
general framework uses the likelihood of instances being positive (i.e., Diverse
Density) to measure the intersection of positive bags minus the union of negative
bags. Diverse Density at a point is defined to measure how many different positive
bags have instances near that point and how far the negative instances are away from
that point [21]. The target of the Diverse Density likelihood DD(x) is to find an
appropriate point (denoted as t) in the feature space which has the most true positive
instances around that point and most true negative instances away from that point.
This appropriate point t in the Diversity Density framework is also called the concept,
where a bag is labeled positive even if only one of the instances in it falls within
the concept and a bag is labeled negative only if all the instances in it are negative.
The concept can be discovered through maximizing the likelihood of positive bags
and negative bags in the feature space. The appropriate point (the desired concept)
corresponds to the maximum of Diverse Density likelihood in the feature space.
In the Diverse Density framework, the set of loosely-labeled images (their labels
are given at the image level or bag level) is defined as D which consists of a set
of bags B = {B1 ,. . . , Bm } and corresponding labels L = {l1 ,. . . , lm }. Let bag Bi =
{Bi1 ,. . . , Bij ,. . . Bin } where Bij is the jth instance and label li = {li1 ,. . . , lij ,. . . lip }
where lij corresponds to the label of jth instance in Bi bag. The positive bags are
denoted as Bi+ and jth instance in Bi+ as Bij+ . Likewise, Bij represents a negative
instance in the negative bag Bi . The diverse density over all points x in the feature
The Diverse Density algorithm uses the Noisy-OR model to compute the probability
of a bag being positive near the potential point t and the probability of negative bags
being away from the point t
P(x = t|Bi+ ) = 1 (1 P(x = t|Bij+ ))
j
P(x =
t|Bi )
=
(1 P(x = t|Bij ))
(2)
Then those probabilities are modeled by the distance between the potential target of
concept t and the instances:
P(x = t|Bij) = exp(Bij x2 )
(3)
If the instances in a positive bag are near the potential target x, the probability P(x =
t|Bi+ ) would be high. Likewise, the probability P(x = t|Bi ) should be high only if the
instances in a negative bag are far away from the candidate point x. To reduce the
complexity of the Diverse Density algorithm, the negative logarithm of DD(x) can
be adopted to search its minimum value instead of computing the DD(x) directly.
From (1), one can observe that all instances need to be integrated into the likelihood DD(x), thus the formula becomes very complicated in computability and does
not have analytic solutions. Furthermore, the complexity of the generative model
is nonlinear with the number of instances and bags. The gradient ascent algorithm
can be utilized to find the maximum of DD(x) even if it is not convex. To avoid the
local maximum of DD(x), multiple initial points need to be adopted to find the global
maximum of DD(x). So all the positive instances are used as its initial points and one
of them is likely to be close to the maximum point t. Although it is beneficial to find
the global solution for the DD(x), it is less efficient for computing through multiple
starting points especially as the number of positive instances becomes very large.
Based on this observation, our proposed algorithm for multiple instance learning
will identify an instance from all these positive instances as the single initial point
rather than using all the positive instances as the initial points. On the other hand, the
DD(x) is actually affected mostly by the instances which are nearest to the concept
t in each bag. Thus our proposed algorithm for multiple instance learning only uses
one instance instead of all instances to compute DD(x) in each bag. Through these
two steps, our proposed algorithm for multiple instance learning can reduce the
computational cost significantly and guarantee faster convergence.
Fig. 2 Illustration of our proposed algorithm for the problem of tag-to-region assignment with
a multiple instance learning procedure, which includes three key components: a using the JSEG
segmentation to generate multiple-scale instances and select instances with unique semantics;
b utilizing the AP clustering to choose the best candidate for Diverse Density as the single initial
point; c speeding up the procedure of computing Diverse Density likelihood maximum by identifying
the most contributive instance from each bag
multiple instance learning framework to discover a concept and its boundary for each
tag (e.g. cow), which would be explained from Sections 4.1 to 4.5. Finally, an appropriate tag selected from all tags would be assigned to the instance (i.e. image region)
by ranking the relative distances between the instance and concepts of all tags, which
is introduced in Section 4.5.
Our proposed algorithm with multiple instance learning consists of three key
components as shown in Fig. 2: (a) we utilize the image segmentation technique to
generate multiple-scale regions and extract those instances with unique semantics
(Section 4.1); (b) we utilize the AP clustering algorithm to find the best candidate
in the semantic-unique instances for computing the maximum of Diverse Density
likelihood (Sections 4.2 and 4.3); (c) we speed up the procedure of computing the
Diverse Density likelihood maximum by identifying the most contributive instance
from each bag (Section 4.4).
4.1 Multiple-scale instance generation
In this section, we would generate multiple-scale instances in each bag and pick out
those with unique semantics from the multiple-scale instances. These regions with
unique semantics would also be referred to as good instances. For some existing
multiple instance learning algorithms, the instances are generated through the randomly cropped selection from images [1, 22]. Such random selection procedure could
produce too many instances in one bag and each instance would be partly positive
because the randomly sampled boxes in images may not contain the objects of
interest completely. These instances produced by random selection are actually not
responsible for the bag-level tags and give rise to the non-uniqueness of semantics.
Another approach to tackle the non-uniqueness problem of semantics is to utilize
automatic image segmentation. However, over-segmentation or under-segmentation
can be easily reached by using different parameters for a segmentation algorithm. To
solve this dilemma, we utilize a set of parameters to generate multiple segments (i.e.
instances) with different sizes and shapes. We call this procedure as Multiple-Scale
Instance Generation, in which at least one of these segments can correspond to the
bag-level tag and satisfy the condition of semantic uniqueness. And then a random
walk procedure can be used to find the instances with unique semantics in each bag.
We make use of the J-images segmentation (JSEG) [9] algorithm to partition an
image into a set of regions (i.e., instances), which are determined by the adjustable
parameter = (q, m). The parameter q is denoted as the color quantization threshold and m represents the spatial segmentation threshold in the JSEG algorithm.1
Compared to other segmentation methods, JSEG is relatively fast to generate multiple instances with enough parameters. Figure 3 shows examples of JSEG segmentation. All these image regions are treated as the candidates of instances with unique
semantics while some candidates of instances generated by over-segmentation are
only fragments of instances with unique semantics. So these instances cannot be
used to compute the Diverse Density likelihood. Usually, the instances with unique
semantics would be similar with those fragments because these fragments are parts of
1 http://vision.ece.ucsb.edu/segmentation/jseg/
Fig. 3 Illustration of generating multiple-scale regions (four scales with dif ferent ) and choosing
good instances with unique semantics
instances [25]. Based on this observation, a random walk process is utilized to seek
the regions which are similar with other regions and then these regions would be
selected as good instances in a bag, which usually have unique semantics.
Assuming n nodes (i.e. candidate instances or regions) exist in the random walk
process where each node corresponds to one candidate of instances in each bag. Then
the random walk process is formulated as
k+1 (i) =
k ( j)(i, j) + (1 )o (i)
(4)
ji
where i is the instance neighbor set connecting with the ith instance , o (i) is the
initial relevance score for the ith instance, (i, j) is the transition probability from
instance j to i, and [0, 1] linearly weights two terms. The relevance score for ith
instance at the kth iteration is defined as k (i). The first term in (4) represents the
similarity between the i instance and other instances.
Because multiple-scale segmentation method may generate the same (over 90 %
area overlapping) instances in different scales, the initial relevance score is defined as
(i)
o (i) = n
i=1 (i)
(5)
where (i) is the number that ith instance appears in the multiple-scale segments. In
this context, we define the transition probability using the similarity of two instances,
that is
sij
(i, j) =
(6)
k sik
where sij can be computed according to the distance of instances and explained
in (15).
According to (4), the random walk process can select out the instances having
higher similarities with others (through the synthesizing of similarities in first item of
(4)) and the instances with stronger relation with others would be selected. For each
image, we choose the top n/| | candidate instances as the final good instances. These
good instances are usually semantically unique and can further be used for instance
clustering and maximum computation of Diverse Density likelihood.
4.2 Instance clustering
In this and later section, we would discuss a method to identify the single starting
point for computing the maximum of Diverse Density likelihood. Even if the good
instances have been detected by the JSEG segmentation and random walk algorithm,
there are too many instances to choose as the initial points for Diverse Density
likelihood computation. The goal is that we can find a most positive instance as the
single point to start. According to the definition of Diverse Density, the maximum of
the Diverse Density likelihood may locate in the intensive area of positive instances
and the non-intensive area of negative instances. In other words, the optimal solution
may occur in the vicinity of one of the groups which gather the similar positive
instances together. This group we want to find would not contain negative instances.
Based on this observation, the instances in positive bags and negative bags are first
grouped into multiple clusters respectively according to their visual similarity. Then
a cluster of positive instances, which would be furthest away from all clusters of
negative instances, is identified by the distance between the clusters in the positive
instances and the clusters in the negative instances. This cluster center will be selected
as the initial point for computing Diverse Density.
In this paper, we adopt the AP [12] clustering approach to group the instances in
positive bags and negative bags into multiple clusters. The classical K-means or Kmediods clustering approach need to randomly choose k initial cluster centers at the
beginning, thus it is not very suitable for our problem of instance clustering because
the number of instance clusters is unknown. As an extension of K-medoids clustering,
AP clustering simultaneously takes all instances as the potential exemplars (i.e. cluster centers), where real-value preferences can be used to represent the probability
of instances as the exemplars. As a result, AP clustering does not need to assign the
initial cluster centers at the beginning and it can automatically detect the exemplars
and group instances.
The AP clustering algorithm propagates two kinds of messages (responsibility
and availability) between instances and uses these accumulated messages to determine exemplars and group instances. The responsibility r(i, k) , sent from the instance
i to the candidate exemplar k, is used to reflect how well-suited k is to serve as
the exemplar for i compared to all other possible exemplars. The availability a(i, k),
sent from the candidate exemplar instance k to the instance i, is used to reflect how
appropriate it would be for the instance i to select k as its exemplar, taking support
from other instances into account [12]. These messages are updated using the rules
r(i, k) s(i, k) max
a(i,
k
)
+
s(i,
k
)
k s.t.k =k
i s.t.i {i,k}
/
max 0, r(i , k)
(7)
where s(i, k) represents the similarity between ith instance and kth instance. The selfavailability is updated as follows
a(k, k)
max{0, r(i , k)}
(8)
i s.t.i {i,k}
/
According to the rules above, the responsibilities are first updated when the availabilities a(i, k ) are initialized to zero and the availabilities are then updated when the responsibilities are given. The algorithm is deemed to have converged when the updating no longer change, where the entities with maximal a(k, k) in all a(i, k) are selected
as exemplars automatically. After the cluster centers are detected, the remaining
instances are assigned to their nearest cluster centers automatically. Through the AP
clustering, the instances from positive bags and negative bags are grouped separately
and the similar instances are grouped into the same clusters. The selection of an
optimal initial point based on these clusters would be introduced in the next section.
4.3 Candidate identification
As discussed above, we can utilize the AP clustering to group instances into two kinds
of clusters: positive clusters and negative clusters. The clusters, which are grouped
from positive instances, are called positive clusters denoted as ; the clusters, which
are grouped from negative instances, are called negative clusters denoted as . After
that, we need to find a cluster furthest away from the clusters of negative clusters
in the clusters of positive clusters . The maximum of Diverse Density likelihood
defined in (1) may occur at the adjacent area of such cluster. So this cluster center is
referred to as the most positive instance, which can be used as the best initial point
for the Diverse Density computation. Such a cluster can be identified by the distance
between the positive clusters and the negative clusters .
In this paper, the Hausdorff distance is used to measure the distance between two
clusters (i.e., two instance sets), which has the advantage that it is not affected by a
moderate number of outliers. For two instance sets Cm and Cn , the Hausdorff distance between Cm and Cn is defined that each element in Cm is within Hausdorff distance d of at least one element in Cn and each point in Cn is within Hausdorff
distance d of at least one element in Cm . The Hausdorff distance is defined as:
H(Cm , Cn ) = max { h(Cm , Cn ), h(Cn , Cm ) }
h(Cm , Cn ) = max min d(xm , xn )
xm Cm xn Cn
(9)
where the xm and xn are the elements in the instance sets Cm and Cn . The distance
d(xm , xn ) between two instances can be any distance metric corresponding to the
feature extraction. The distance we adopt will be introduced in details in Section 5.2.
Based on the distance of two sets, we use the score to represent the distance
between every cluster in the positive clusters and all the negative clusters . It is
defined as follows
(Cm ) = min H(Cm , Cn ),
n
Cm
(10)
We can pick out the furthest cluster-pair through the equation above and identify the
cluster Cm
having the maximum of .
Since the maximum of Diverse Density likelihood will occur in the vicinity of Cm
,
We can find (1) would have the same convergence regardless of what instance in Cm
is used. So we take the center tm of cluster Cm as the initial point for computing the
maximum of Diverse Density likelihood and call the best initial point tm as the most
positive instance in all instances.
4.4 Speeding up diverse density
Through the selection of the best initial point, we can acquire a global solution of
the maximum of Diversity Density without attempting each positive instance. This
selection procedure can reduce some unnecessary computation steps. Even this step
has been adopted, we still need to compute all instances in the positive bags when we
compute the Diverse Density likelihood according to (1). In this section, we would
explain the details of how to reduce the procedure of Diverse Density computation.
Through (1), we can find that the likelihood between each bag and the point t
is the product of probability between instances in each bag and the point t. The
maximum of this product is mainly influenced by the nearest instance to the point
t. As the probability of each instance being positive is always in the range [0, 1], the
product can not exceed the greatest probability of factors in the product. In the view
of multiple instance learning, there is always one instance in each bag with the most
contribution to the bag-level label. Based on these observations, a most contributive
instance from each bag is discovered to represent this bag and reduce the complexity
of Diverse Density computation, which is similar with the Maximization-step in [36].
We choose an instance from each bag like
arg max
(11)
After we use the most contributive instance Bij to represent a bag, we denote the
conditional probability between candidate points x and the bag Bi : P(x = t|Bi ) =
P(x = t|Bij ). The j th instance is determined by the equation above. The distance
of our algorithm in P(x = t|Bij) is a little different from that of Diverse Density.
The P(x = t|Bij) in our algorithm is defined as P(x = t|Bij) = exp(d(Bij, x)). The
distance d(Bij, x) can be any measurement between two vectors (instances). In
this paper, we use the distance defined by (15). So the Diverse Density likelihood
computation in our algorithm can be simplified as follows
arg max DD (x) = arg max P(x = t|Bi )
x
= arg min
x
i
logP(x = t|Bij)
(12)
Even if we reduce the computation steps, it is still difficult to derivate the analytical
solution for (12). For the optimization problem above, we use the numeral solution
to search the optimal solution, where the optimal solution represents the concept of
each tag. The method of determining the boundary for each tag would be explained
in the next section.
Tmax = min T,
(13)
After the search range [Tmin , Tmax ] has been confirmed, we adopt the K-fold (K =
10) cross validation method to find the best distance threshold T [Tmin , Tmax ]. First
of all, the training set is partitioned into K subsets and one of them is alternately
used as the validation set. In each evaluation process, we partition the search range
[Tmin , Tmax ] into L intervals and use the center of L intervals to validate the accuracy.
The threshold which makes the best performance is then set to the final distance
threshold. The procedure of finding distance threshold is shown in Fig. 4, which
shows results for three categories of the MSRC and NUS-WIDE(OBJECT) datasets.
The concept and boundary can be used to assign the tags to regions, however, the
boundaries of some categories may overlap in the feature space and one instance
would be labeled with many tags. To determine the unique categories of instances,
we utilize the ranking of relative distances to solve this problem. For the concept of
ith category Ci , the ranking scores of an instance are defined
(x) = sort
i
d(x, C )
i
Ti
(14)
The distance measurement d(x, Ci ) is computed according to (15) and Ti is the best
distance threshold of ith category computed by cross validation. The category with
the greatest ranking score is given as the tag of this instance.
5 Experimental evaluation
In this section, we describe the details of our experiments, including the image sets
we used, the visual feature extraction of instances, the baseline methods and results
of our experiments.
5.1 Image sets
To evaluate our algorithm precisely, different image sets are used to verify the
effectiveness of our algorithm. MSRC2 is collected from search engines, and includes
591 images and 23 object categories while the horse category is disregarded due to
the lack of examples. For average, each image has about 3.95 tags in MSRC. NUSWIDE [6] is collected from the social website Flickr and contains 269,648 images
and tags associated with these images originally. In our experiments, we only use the
categories which can correspond to image regions, i.e. object categories. So we pick
out 25 object categories and 10,157 images (marked as NUS-WIDE(OBJECT)) to
evaluate our algorithm, where each image has 2.01 tags. COREL30K [3] is based
2 http://research.microsoft.com/en-us/projects/objectclassrecognition/,
on the Corel image dataset, containing 31,695 images and 1,035 tags. We process
the COREL30K dataset in the same matter that NUS-WIDE dataset is processed,
and select out 27,194 images and 121 object categories3 and have enough images for
training and testing. For each image of COREL30K, it has 2.13 tags averagely.
5.2 Visual feature
To extract effective feature from the visual content of instances, we use a wellknown feature descriptor, the Bag-of-Words model with SIFT descriptor to capture
key-point information of instances [17] since the generation of good instances has
been introduced in Section 4. For each instance, interest points are extracted with
difference of Gaussian function and represented by a 128-dimensional descriptor.
The K-means clustering is then used for constructing a code book of SIFT points.
One critical issue for code book construction is to determine the size of the code
book by grid searching method. In our algorithm, we choose a codebook size of 500
and represent the instance with a 500-dimensional vector.
The distance between two instances xm and xn can be measured by many
approaches, such as Euclidean distance. However, the L1 norm (i.e. Manhattan
distance) has been proved more robust to outliers than L2 norm (i.e., Euclidean
distance). In this paper, we utilize the normalized L1 norm to measure the distance
of two instances. The distance function we adopted is defined as follows
|xm (i) xn (i)|
d(xm , xn ) =
(15)
1 + xm (i) + xn (i)
i
The normalized distance measurement can reduce the impact of different dimensions. This distance measurement is used to compute the similarity matrix for the
clustering and determine the distance boundary.
5.3 Experiments
To evaluate the performance of choosing the best initial point for computing Diverse
Density likelihood, the traditional precision and recall are used to demonstrate the
effectiveness of our clustering method. As MSRC image set provides the pixel-wise
ground-truth images, we utilize these ground-truth images to test the performance of
our AP clustering. As shown in Fig. 5, precision in most categories is high, because
most instances grouped into the same cluster are in the same category. Recall is not
very high in many categories because images in the same category are very diverse in
the visual content and not all the positive instances could be grouped into a cluster as
the members of one cluster are not infinite. When the positive instances of one category become more and more, the recall would become lower observed from Fig. 5.
As the comparison of our distance measurement, we use the normalized Euclidean
distance (abbreviated to ED in Fig. 5) to observe the influence of distance for clustering. We can find that the distance defined in (15) can improve the precision with the
cost of reducing the recall rate compared to normalized Euclidean distance observed
from Fig. 5.
3 These
Fig. 5 Precision and recall of AP clustering with the ground-truth of instances on the MSRC dataset
Without MIG
Without CI
Without TS
Overall
66.1/1.69
56.9/19.38
53.2/14.74
70.7/25.51
58.0/319.44
54.2/247.8
63.4/1.43
56.1/17.48
52.9/14.5
70.9/1.5
58.1/18.1
54.2/14.7
Fig. 6 Average accuracy on the MSRC dataset using 8 approaches: a mi-SVM; b RW-SVM;
c Diverse Density(DD); d EM-DD; e Our Method; f Our Method without MIG; g Our Method
without CI; h Our Method without TS
Fig. 8 Average accuracy on the 29 categories (first part of 121 categories and simultaneously
occurred in MSRC or NUS-WIDE) of COREL30K dataset using 7 approaches: a mi-SVM; b RWSVM; c EM-DD; d Our Method; e Our Method without MIG; f Our Method without CI; g Our
Method without TS
mi-SVM (%)
RW-SVM (%)
DD (%)
EM-DD (%)
56.6
53.2
51.8
57.8
54.1
52.2
65.8
55.6
NR
55.2
52.8
52.5
70.9
58.1
54.2
candidate exemplar of clusters as the initial point in the first step (coarse step). From
the initial point, the concept corresponding to each category is found by using the
boosting Diverse Density procedure in the second step (fine step). (b) we design
an automatic method to find the boundary for the concept after the concept has
been found, which is better than the method only using the leave-one-out approach.
The searching range defined in (13) can avoid the over-fitting problem and speed
up the searching speed. The concepts of our algorithm (maximum of DD (x)) are
almost the same as the Diverse Density algorithm, but the mainly performance
difference between two algorithms is determined by the distance threshold selection
procedure. We use the automatic method to find the threshold and obtain more
accurate results. In contrast to EM-DD, our algorithm utilizes the AP clustering and
Hausdorff distance to find the most likely candidate in the feature space other than
randomly choosing some initial points. Those randomly chosen points may result in
the local maximum of Diverse Density likelihood. EM-DD also uses the leave-oneout approach to obtain the boundary which may generate the over-fitting problem.
Our algorithm utilizes the AP clustering and Hausdorff distance to find the concept
(the point t) approximately and Diverse Density to obtain the exact concepts, and
then detect the boundary to recognize the new instances.
In Table 2, the average accuracy of each category is shown on the three datasets.
The average accuracy illustrates that our algorithm obtains the best performance
totally and most categories on the three datasets while other algorithms have better
performance than our algorithm in some categories, for instance, the face category
for Diverse Density algorithm. Our experiments also show that the NUS-WIDE
(OBJECT) and COREL30K image dataset are much noisier than the MSRC dataset
because the average results of five algorithms on MSRC are better than the NUSWIDE(OBJECT) and COREL30K datasets totally. We cannot find the optimal solution in most categories on COREL30K dataset using the Diverse Density method
because there are too many positive instances to compute the Diverse Density
likelihood. So we cannot acquire the results on COREL30K with the Diverse Density
method (indicated by NR in Table 2).
The run-time of four algorithms compared in our experiments are shown in the
Table 3. The mi-SVM obtains the best run-time performance because many special
Table 3 The average run-time of each category on three image datasets using different methods
mi-SVM
MSRC(s)
NUS-WIDE(s)
COREL30K(s)
100
200
300
1.05
12.6
10.2
1.95
22.8
23.4
3.12
42.6
32.5
RW-SVM
DD
EM-DD(50 %)
Our Method
6.37
100.8
65.9
780.6
9829.8
NR
257.1
3686.33
135.6
1.5
18.6
14.7
Fig. 9 The examples of tag-to-region assignment results. Three rows are from MSRC, NUSWIDE(OBJECT) and COREL30K datasets respectively
solution algorithms for SVM4 have been designed [23] while the Diverse
Density-based algorithms need to compute the Diverse Density maximum using the
numerical approaches, such as Newton method [7]. We choose the best initial point
(i.e., most positive instance) by the clustering and Hausdorff distance to replace
the multiple initial point trial in DD and EM-DD, and use the most contributive
instance of each bag to compute the Diverse Density instead of using all instances.
So the run-time can be reduced very much. In all algorithms, the mi-SVM algorithm
uses the least run-time compared to other methods, however, mi-SVM can hardly
converge the stable solution so we set the limited iterative times for terminating the
algorithm manually, such as 100, 200 and 300 in Table 3. The run-time used by EMDD algorithm is also determined by the number of initial instances. Although we do
experiments with different number of initial points for the EM-DD algorithm, we
only display the run-time using 50 % of positive instances which can make the best
performance.
At last, we show some results of tag-to-region assignment experiments in Fig. 9.
In the images of Fig. 9, the instance-level annotation is shown in the segmented
images directly. Some instances (image regions) are not assigned with any tags as
if the ranking scores defined in (14) are larger than 1.0. In other words, the instance
does not belong to any category if all the scores (x) > 1.0.
4 The
Fig. 10 Average accuracy on the 92 categories (part 2 and 3 of 121 categories) of COREL30K dataset
using 7 approaches: a mi-SVM; b RW-SVM; c EM-DD; d Our Method; e Our Method without MIG;
f Our Method without CI; g Our Method without TS
References
1. Andrews S, Tsochantaridis I, Hofmann T (2002) Support vector machines for multiple-instance
learning. Adv Neural Inf Proc Syst 15:561568
2. Bunescu R, Mooney R (2007) Multiple instance learning for sparse positive bags. In: Proceedings
of the 24th International Conference on Machine Learning (ICML), pp 105112
3. Carneiro G, Chan A, Moreno P, Vasconcelos N (2007) Supervised learning of semantic classes
for image annotation and retrieval. IEEE Trans Pattern Anal Mach Intell29(3):394410
4. Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell
Syst Technol 2:27:127:27
5. Chen Y, Bi J, Wang J (2006) Miles: multiple-instance learning via embedded instance selection.
IEEE Trans Pattern Anal Mach Intell 28(12):19311947
6. Chua T, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) Nus-wide: a real-world web image
database from National University of Singapore. In: Proceeding of the ACM international
conference on image and video retrieval, p 48
7. Coleman TF, Li Y (1996) An interior trust region approach for nonlinear minimization subject
to bounds. SIAM J Optim 6(2):418445
8. Cusano C, Ciocca G, Schettini R (2004) Image annotation using svm. In: Society of Photo-Optical
Instrumentation Engineers conference (SPIE), vol 5304, pp 330338
9. Deng Y, Manjunath B, Shin H (1999) Color image segmentation. In: IEEE computer society
conference on Computer Vision and Pattern Recognition (CVPR), vol 2
10. Dietterich T, Lathrop R, Lozano-Prez T (1997) Solving the multiple instance problem with axisparallel rectangles. Artif Intell 89(12):3171
11. Fan J, Shen Y, Zhou N, Gao Y (2010) Harvesting large-scale weakly-tagged image databases
from the web. In: IEEE computer society conference on Computer Vision and Pattern Recognition (CVPR), pp 802809
12. Frey B, Dueck D (2007) Clustering by passing messages between data points. Science
315(5814):972
13. Jeon J, Lavrenko V, Manmatha R (2003) Automatic image annotation and retrieval using
cross-media relevance models. In: Proceedings of the 26th annual international ACM SIGIR
conference on research and development in informaion retrieval, pp 119126
14. Lew M, Sebe N, Djeraba C, Jain R (2006) Content-based multimedia information retrieval: state
of the art and challenges. ACM Trans Multimed Comput Commun Appl (TOMCCAP) 2(1):119
15. Liu D, Hua X, Zhang H (2011) Content-based tag processing for internet social images. Multimed Tools Appl 51:723738
16. Liu D, Yan S, Rui Y, Zhang H (2010) Unified tag analysis with multi-edge graph. In: Proceedings
of the international conference on Multimedia (ACM MM), pp 2534
17. Li F, Fergus R, Torralba A (2007) Recognizing and learning object categories. cvpr 2007 short
course
18. Li J, Wang J (2008) Real-time computerized annotation of pictures. IEEE Trans Pattern Anal
Mach Intell 30(6):9851002
19. Liu S, Yan S, Zhang T, Xu C, Liu J, Lu H (2012) Weakly-supervised graph propagation towards
collective image parsing. IEEE Trans Multimedia 14(2):361373
20. Liu X, Cheng B, Yan S, Tang J, Chua T, Jin H (2009) Label to region by bi-layer sparsity priors.
In: Proceedings of the 17th ACM international conference on multimedia, pp 115124
21. Maron O, Lozano-Prez T (1998) A framework for multiple-instance learning. In: Advances in
neural information processing systems, pp 570576
22. Maron O, Ratan A (1998) Multiple-instance learning for natural scene classification. In: Proceedings of the fifteenth international conference on machine learning, vol 15, pp 341349
23. Platt J, et al (1998) Sequential minimal optimization: a fast algorithm for training support vector
machines. Technical report msr-tr-98-14, Microsoft Research
24. Qi G, Hua X, Rui Y, Mei T, Tang J, Zhang H (2007) Concurrent multiple instance learning for
image categorization. In: IEEE conference Computer Vision and Pattern Recognition (CVPR),
pp 18
25. Russell B, Freeman W, Efros A, Sivic J, Zisserman A (2006) Using multiple segmentations to
discover objects and their extent in image collections. In: IEEE computer society conference on
Computer Vision and Pattern Recognition (CVPR), pp 16051614
26. Shen Y, Fan J (2010) Leveraging loosely-tagged images and inter-object correlations for tag
recommendation. In: Proceedings of the international conference on Multimedia (ACM MM),
pp 514
Zhaoqiang Xia is a PhD student at Northwestern Polytechnical University. His research interests
include multimedia retrieval, statistical machine learning and computer vision.
Yi Shen is a PhD student at University of North Carolina at Charlotte. His research interests include
multi-label learning and multiple-instance learning.
Xiaoyi Feng is a professor at Northwestern Polytechnical University. Her research interests include
computer vision, image process, radar imagery and recognition.
Jinye Peng is a professor at Northwestern Polytechnical University. His research interests include
computer vision, pattern recognition and signal processing.
Jianping Fan is a professor at University of North Carolina at Charlotte. His research interests
include semantic image and video analysis, computer vision, cross-media analysis, and statistical
machine learning.