Está en la página 1de 12

Limitations of CBIR

Limitations of Content-based Image Retrieval


Theo Pavlidis
© Copyright 2008

Version of June 24, 2008


(latest minor editing: October 24, 2008)

Abstract

This is a viewpoint paper where I discuss my impressions from the current state of the art and then I express
opinions on what might be fruitful approaches. I find the current results in CBIR very limited in spite of over 20
years of research efforts. Certainly, I am not the only one who thinks that way, the lead editorial of a recent
special issue of the IEEE Proceedings on multimedia retrieval was titled "The Holy Grail of Multimedia
Information Retrieval: So Close or Yet So Far Away?"

I offer certain reasons for this state of affairs, especially for the discrepancy between high quality results
shown in papers and poorer results in practice. The main reason seems to be that the lessons about feature
selection and the "curse of dimensionality" in pattern recognition have been ignored in CBIR. Because there is
little connection between pixel statistics and the human interpretation of an image (the "semantic gap") the use
of large number of generic features makes highly likely that results will not be scalable, i.e. they will not hold
on collections of images other than the ones used during the development of the method. In other words, the
transformation from images to features (or other descriptors) is many-to-one and when the data set is
relatively small, there are no collisions. But as the size of the set increases unrelated images are likely to be
mapped into the same features.

I propose that generic CBIR will have to wait both for algorithmic advances in image understanding and
advances in computer hardware. In the meantime I suggest that efforts should be focused on retrieval of
images in specific applications where it is feasible to derive semantically meaningful features.

There are two appendices with examples of image retrieval. One presents the results obtained from some on
line systems and the other presents some experiments I conducted to demonstrate how a method that yields
impressive results in the author(s) paper gives poor results in independent tests. I have included a set of
images that provide a challenge to the current CBIR methodologies. Two other appendices illustrate the
severe limitations of color and edge strength histograms respectively for image characterization.

Related Links: Main CBIR page - A Challenge to CBIR Techniques - Dealing with the Challenge - "The
Holy Grail of Multimedia Information Retrieval: So Close or Yet So Far Away?" - An Answer (Three page
summary of this document) - An Evaluation of SIFT

Main page of theopavlidis.com - Site map

Timeline: First draft (with a different title): June 2007. Second draft: May 14, 2008. Third and Final draft:
June, 2008. (The date in parentheses at the top reflects the last time any minor revisions were made.)

Table of Contents

Section 1. Introduction  
Section 2. The Problem  
Section 3. The Curse of Dimensionality and the lack of Scalability  
Section 4. The Role of Context  
Section 5. Why is Image Retrieval so Hard Compared to Text Retrieval  
Section 6. What Can We Learn from Success in Other Areas  
Section 7. What to Do  
Section 8. Conclusions  
Section 9. Acknowledgements  
Section 10. Sources Cited  
Appendix A. Web Test Results of Content-based Image Retrieval  

file:///C|/Documents%20and%20Settings/Ahmed/Desktop/Limitations%20of%20Content-based%20Image%20Retrieval.htm[30/04/2011 01:41:32 ‫]ﺹ‬


Limitations of CBIR

Appendix B. Simulating What is Happening in Current CBIR systems  


Appendix C. Color Histograms Provide No Semantic Information  
Edge Strength Histograms Provide No Semantic Information
Appendix D.  
Either

1. Introduction

In October of 1998 I attended the VLBV98 workshop in Urbana-Champaign that included several papers on
video indexing and content-based image retrieval (CBIR). (The overall topic of the workshop was Very Large
Bit rate Video.) I recall an informal discussion with a few colleagues from the image processing and image
analysis areas where we concluded that CBIR was hopeless and only labeling of video (and image) databases
offered a promise. We concluded that given the cost of creating these databases, the additional cost for
labeling would be comparatively minor.

That put the subject of CBIR off my mind until late in 2005 when one of our dogs (Lucy) died and my wife ask
me to find all pictures I had of the dog so that she could put them in an album. That was not an easy task
given that I have a personal collection on my PC of several thousands of pictures, very few of them labeled.
As a result I started looking for image retrieval tools on the web and looking at the scientific literature. I was
sorely disappointed. Not only there were no tools that I could use on my collection, but also there were too
few publicly available demonstrations. While there have been numerous publications in this area, I found only
three systems that allowed online testing of the authors' methodologies and all three produced quite poor
results (see Appendix A). At first this was puzzling because the papers describing the methodologies
presented, what seemed to be, impressive results but soon I realized that the papers did not deal with
realistic situations. For example, asking to find "all images with a lot of red" is easily answered if the features
include information about the color histogram but that is not a particularly interesting query. There were also
deeper flaws in the methodologies that I will discuss in the next section.

Clearly, if the pictures in the database are unlabeled a good way to deal with such a query is to find one
picture of the subject and ask for pictures similar to the chosen one. It is not only private collections that are
likely to be unlabeled. The problem occurs even in public, supposedly tagged databases. When an individual
moves suddenly to prominence, it is highly unlikely that images of the person would have been tagged. A
notorious instance of such a search occurred during the president Clinton impeachment proceedings when
reporters were looking for pictures of the president with Monica Lewinski. Therefore the problem of retrieval by
example without relying on tags is not going to go away in spite of what some of us thought during VLBV98.

In my case, I did the best I could with a manual search but I then I wrote a simple CBIR program based on
published methodologies and I tested it. While the results were disappointing for the specific task, they
provided some insight on why impressive published results are followed by practical results of poor quality
(see Appendix B). My test showed that certain methodologies suffer from a lack of scalability. I discuss in the
sequel why this might be an inherent problem in the way current CBIR research is conducted.

In this document I use the term CBIR in a sense that parallels text retrieval. The query is an image and the
user wishes to find images in the database that contain objects present in the query image. The images
retrieved may be quite different from the query image as long as they contain the objects of interest. This is
sometimes referred to as search for objects (e.g. [SZ08]) to distinguish it with methods that attempt to match
whole images.
Back to the Table of Contents

2. The Problem

Many people agree that content-based image retrieval (CBIR) remains well behind content-based text
retrieval. For example, [YSGB05] states "current image retrieval methods are well off the required mark" and
[HSHZ04] states "the semantic gap between the user's needs and the capability of CBIR algorithms remains
significant." [DAL08] states "Research as reported in the scientific literature, however, has not made significant
inroads as medical CBIR applications incorporated into routine clinical medicine or medical research." The
editorial for a recent IEEE Proceedings special issue on Multimedia Information Retrieval was titled "The Holy
Grail of Multimedia Information Retrieval: So Close or Yet So Far Away?" [HLMS08]

Significant effort has been put into using low-level image properties such as color and texture. Simpler
methods use them as part of global statistics (histograms) over the whole image (some methods even forego
such characterizations and rely on global transforms such as wavelets.). More advanced methods [CBGM02,

file:///C|/Documents%20and%20Settings/Ahmed/Desktop/Limitations%20of%20Content-based%20Image%20Retrieval.htm[30/04/2011 01:41:32 ‫]ﺹ‬


Limitations of CBIR

CMMPY00, SZ08] use them to describe regions of an image. Even when sophisticated features such as SIFT
[Lo04] are used, they still rely on pixel values.

Several authors point to the existence of a semantic gap (for example, [SWSGJ00], [RMV07], and [HLMS08])
as that between the pixel values and the interpretation of the image. Personally I think that we deal with a
semantic abyss. The semantic interpretation of an image has very little to do with statistics of the values of the
pixels. Figures F1 and F2 illustrate this issue but it should be a familiar topic to anyone who has used
histogram equalization for image enhancement. Images tend to maintain their semantics even though their
luminance or color histograms are made flat. (See Appendix C for examples.) Histograms of other image
properties also seem to lack semantic information. (See Appendix D for examples.)

A B C

Figure F1: The picture in the middle (B) is a brightened version of the picture on the left (A)
but two different sets of feature measures classify it as being closer to the picture on the right
(C). (See Appendix B about details.) The values were:
    First feature set: dist(Left, Mid) = 25, dist(Left, Right) =25, dist(Mid, Right) = 0.
    Second feature set: dist(Left, Mid) = 42, dist(Left, Right) = 50, dist(Mid, Right) = 23.

Part of the problem is that representing an image by simple features usually results in loss of information so
that different pictures may map onto the same set of features. If an image has area A and each pixel has L
possible levels, then there are L A such images possible. On the other hand the number of possible histograms
is less than AL . Usually the perceptually significant L is much smaller than A so that different images may
have similar color histograms. This is apparently the case with the middle and right images of Figure F1.

Another way of expressing that phenomenon is to think of the mapping of image into a set of features as a
hashing process. The way such mapping are done results into too many hashing collisions. Images of quite
different nature are mapped into the same features. (For example all six images of Table C1 in Appendix C
are mapped into the same histogram based features.)

It would be possible to obtain a result closer to human intuition by using image segmentation and features
such as SIFT or similar [CBGM02, CMMPY00, SZ08]. However, such methods, while fixing the problem
illustrated in Figure F1, they might still leave relevant pictures unmatched, for example those shown in Figure
F2 because the pose of the object of interest is quite different than in Figure F1 (A). Note added on July 9: I
have posted results that confirm that while matching SIFT keypoints can meet the first part of the challenge, it
cannot meet the second.

I welcome been corrected by someone with a system where the example of the image shown in the left of
Figure F1 will extract from my collection the image in the middle of Figure F1 and those in Figure F2. (I have
a separate posting of this challenge on the web).

D E

Figure F2: Two pictures containing the same dog as in Figure F1.

file:///C|/Documents%20and%20Settings/Ahmed/Desktop/Limitations%20of%20Content-based%20Image%20Retrieval.htm[30/04/2011 01:41:32 ‫]ﺹ‬


Limitations of CBIR

In retrospect the disappointing quality of the results is not surprising. Searching a collection of images for
those containing a particular object is equivalent to solving several pattern classification problems. There is a
large literature on selecting the appropriate features and several methodologies have been discussed. I will
limit my examples here from optical character recognition (OCR), partly because of my familiarity with the area
and partly because there are several good quality commercial products. (My favorite is Cuneiform 99.)

Quite early, it was recognized that comparing the images of characters to prototypes was inadequate except
for very limited domains. Thus it was found more effective to extract high level features, such as strokes,
loops, corners, arcs, and the like [PM92, MNY99]. Words and individual characters were described in terms of
entities that corresponded with the human definition of symbols. How could one expect to achieve general
object matching by using techniques that have been proven inadequate in much simpler image recognition
problems?

The first step for dealing with the problem exhibited by Figure F1 (left and middle images) and by Figure F2 is
to have a classifier that would recognize that both images contain a dog. Until then no real progress can be
made.
Back to the Table of Contents

3. The Curse of Dimensionality and the lack of Scalability

Several authors do treat CBIR as a classification problem by dividing images into classes such as "animals",
"buildings", etc and then retrieving images from the same class as the query image. However queries of this
type are not particularly interesting. People are less likely to ask for “all pictures of buildings” and more likely to
ask for a picture of a particular building. Furthermore, an image may belong to several categories.

The major weakness of such approaches is that do not heed another lesson from Pattern Classification,
namely that increasing the number of features can be detrimental in a classification problem, a phenomenon
usually called the "curse of dimensionality". Duda and Hart [DH73 p. 95] attribute the term to Richard Bellman
although the Pattern Classification literature uses it in a different way than Bellman did. Let d be the number
of degree of freedom of the classifier. If the classifier is linear d equals the number of features F, otherwise it
is greater than F. If n is the number of training samples, then for n equal to 2(d+1) there is 50% probability
that successful classification can be obtained with a random assignment of two classes on the sample points.
Such a success can be called spurious classification and the classifier is likely to fail on new samples. This
issue is discussed at great length in Chapter 3.8 ("Problems of Dimensionality") and other parts of [DH73] as
well as in Chapter 4.3 ("Parzen Windows") of [DHS01]. For this reason it has been the practice to select the
number of samples per class to be considerably greater than the number of features, usually five or even ten
times as many.

In general, representing images by features corresponds to a projection from a very high dimensional space
(space of all possible images of given dimensions and pixel depth) to the much lower dimensional feature
space (recall Section 2). Ideally, we would like images in a given semantic category to project in nearby points
in the feature space. If the number of samples is small compared to the dimension of the space, then we
might be able to find rules to associate the feature sets of "similar" images even if no real clustering occurs.
But when new samples are added, it is unlikely that such an association will prevail. This is the intuitive
interpretation behind the analyses cited in the previous paragraphs.

This type of problem occurs even under different formulations of CBIR. Some authors treat features as words
and associate images that share many such "words" (for example, [MSMP99] and [SZ08]). The source of the
difficulty here is that image parts that may have different human interpretations may map onto the same
feature. (Such likelihood varies with the method, it is more likely with the features used in [MSMP99] than with
those used in [SZ08] but it is always there.) When the method is tested only on a few thousand known
pictures such conflicts are unlikely to occur. (And if they do, the authors are likely to modify their system while
preparing the publication of a paper.) But such conflicts are quite likely to occur in large-scale tests.

It might be useful to provide an analogy of this problem in text retrieval. Suppose that we decide to use only
the first three letters of each word. Thus "app" would be mapped to "apple", but also to "application",
"apparition", "apparel", etc. If one tries this method on a small collection that includes mostly articles on food,
the abbreviation will work fine, but it will fail in general. You can run experiments of this kind in Google
Desktop and Google and you are likely to find out that a coarse search will work fine in the former but not in
the latter. (See also Section 5.)

file:///C|/Documents%20and%20Settings/Ahmed/Desktop/Limitations%20of%20Content-based%20Image%20Retrieval.htm[30/04/2011 01:41:32 ‫]ﺹ‬


Limitations of CBIR

There is another issue involved in the selection of training samples. They need to be representative of the
population, a requirement that is much harder to satisfy unless we understand something about the structure
of the classes. For example, in a multifont OCR system we must include samples of all expected fonts as well
as samples from documents printed by all expected types of printers.

If a system is going to rely on features, they should be selected parsimoniously to avoid spurious success in
the learning set and failure in tests (the curse of dimensionality). It seems that most reported CBIR research
places great emphasis on color but color is less important for conveying information than it appears at first.
Each color has different luminance and two-color displays where both colors have the same luminance are
hard to discern. This phenomenon is called isoluminance (see [Bach_2] for an impressive demonstration) and
its existence implies that an image search based only on luminance is not going to be significantly inferior to a
search that includes colors. For the same reason the JPEG standard uses twice the resolution for luminance
as it does for chrominance. The ability of color blind people to deal quite well with most tasks of daily life as
well as the ability of everybody to obtain information from "black-and-white" images offer additional support to
this view. Adding color increases the number of features without improving significantly the semantic
information carried by features.

Appendix B offers a simulation of what are the effects of the "curse of dimensionality". Impressive results in
the author's paper are followed by poor results in field tests.
Back to the Table of Contents

4. The Role of Context

In some ways the above example presents too easy a challenge. If a person is asked to label the first two
images in Figure F1 and both images in Figure F2 the word "dog" is likely to be the first label. However, a
useful CBIR system must also deal with images of ambiguous meaning.

A well-known observation from research in vision is that humans rely a lot on broad context and are likely to
ignore local details. This is illustrated in Figure F3 and it suggests that reliance on low-level measurements
may produce results different from those of a human viewer. (We miss the forest because of the trees).

Bela Julesz [Ju91] stated that "in real-life situations, bottom-up and top-down processes are interwoven in
intricate ways," and that "progress in psychobiology is ... hampered ... by our inability to find the proper levels
of complexity for describing mental phenomena". According to V. S. Ramachandran “Perceptions emerge as a
result of reverberations of signals between different levels of the sensory hierarchy, indeed across different
senses” ([RB98], p. 56). Ramachandran is explicitly critical of the view that “sensory processing involves a
one-way cascade of information (processing)” [ibid]

Figure F3: Left: Can you find the spelling error in the text? Right: People
ignore the gap between the dots in FH but notice the much smaller gap
between K and Z. (Both figures have been adapted from [Pa94].)

The human visual system is able to ignore local cues once a high level interpretation has been achieved. Also
consider the two sentences:
    New York State lacks proper facilities for the mentally III.
    The New York Jets won Superbowl III.
The last word in each sentence is interpreted differently, even though they are identical. Such occurrences are
even more likely in pictures.

Context has been a major challenge to Artificial Intelligence research overall and the Nobel laureate Arno
Penzias has put very succinctly: “Human intelligence almost always thrives on context while computers work
on abstract numbers alone. … Independence from context is in fact a great strength of mathematics.” [Pe89, p.
49]

If context is known, then one can include that knowledge in the design the system but this possible only when

file:///C|/Documents%20and%20Settings/Ahmed/Desktop/Limitations%20of%20Content-based%20Image%20Retrieval.htm[30/04/2011 01:41:32 ‫]ﺹ‬


Limitations of CBIR

we deal with a specific application area. In one industrial applications case [PJHHL06] context was used to
design, in essence, a top-down vision system. In general we face the formidable problem of context inference
that is usually dealt by combining top down and bottom up approaches. There is considerable research in this
area but it is beyond the scope of this paper to discuss such literature.

Once we recognize the importance of context, there is at least one simple requirement for techniques for
image retrieval: It is not enough to look at parts of an image, we must also look at the overall composition.
[FSGD08] offers an example of improved retrieval by including object relationships in a medical application.

5. Why is Image Retrieval so Hard Compared to Text Retrieval

The fundamental reason that CBIR (and mechanical analysis of images in general) is so hard is that it faces
the challenge of matching the human visual system that has evolved from animal visual systems over a period
of more than 100 million years. In contrast, speech is barely over 100 thousand years old and written text no
more than 10 thousand years old. The great complexity of the visual system is detailed, amongst other places,
in [RB98], Chapter 4, where it is pointed out that the brain has models of the world (p. 68) and 30 distinct
areas, each performing a different type of information processing, such as detecting color, texture, etc (p. 72).
The complexity of the system may be also be inferred from the existence of visual illusions. (See [Bach_1] for
a large collection with scientific explanations.) While auditory illusions exist, they are far less common than
optical illusions.(See [DD95]. I am grateful to Professor Michael Bach of the University of Freiburg for pointing
out the existence of auditory illusions and the above reference.)

An operational reason for the difficulty is that the image data (values of pixels stored in bytes) are very far
from the interpretation of images by human observers. This is not the case with text where the stored data are
much closer to human interpretation.

The following point is so obvious that it would not be worth mentioning, if it were not for the fact that much
research in CBIR seems to ignore it. Text consists of words that are well-defined models of concepts so that
communication amongst humans is possible. While words may be ambiguous, they are usually easily specified
by content. Thus "dog" may mean either an animal or an unattractive person, but if we look at the words
surrounding it, the meaning usually becomes clear. In turn, words have a formal representation either by
single symbols (in Chinese for example) or by a sequence of symbols (in most languages). Whether one or
several, the symbols have a well-defined representation in terms of bits, so going from digital storage to a
human understandable representation is a matter of table look up. Therefore finding all files containing the
word "dog" is a task of (conceptually) simple string matching. (The Unix command fgrep " dog "
filenames does that.) Regular expressions that allow variations in the spelling and delimiters of words make
it easier to identify files where the word dog occurs (the Unix grep).

In text retrieval system users are free to compose queries using several words but this not the case with
image retrieval where the user is usually asked to provide a query image or a sketch. Some researchers have
recognized this and solutions include area selection in an image [Like], semantic labels [RMV07], and the use
of multiple images as well as feedback [IA03]. All these approaches improve the performance of an image
retrieval system. However the challenge remains because the user must perform some image manipulation
work, a more demanding task than typing words.

A fairer comparison between image and text retrieval is to consider retrieval not of words but of concepts. It
turns that the state of the art of text retrieval based on concepts is not particularly advanced.

Suppose that I am not looking for stories about any dog but for stories about a dog named Lucy. The task
is much harder but Google Desktop did a good job when I asked it to search for "dog Lucy" on my PC. It
seems to have identified all files containing either word, which was fine for my purpose. (There were six
returns, five of which were what I was looking for.) Google did not do as well because it looks at a much
bigger collection of documents. In a large collection even files with both words may be false positives. For
example: "Ask Lucy - How to feed a senior dog". However the first 20 (out of close to two million hits) were
all true positives. False positives sneaked in after No. 22. True positives were found after No. 70 and I did not
look after No. 80. Note that if I ask it to search for "dog named Lucy" I miss a lot of files because many of
them do not contain the word "named". We even get results such as: Lucy and I spent the weekend alone
together. We have a dog named Kyler. Asking to search for the exact phrase "dog named Lucy" yields about
4,000 hits and less than 250 are shown (the others assumed to be duplications). No false positives but a lot of
false negatives. Of course, in all such searches we will miss stories where the word "dog" has been replaced
by the name of the breed.

file:///C|/Documents%20and%20Settings/Ahmed/Desktop/Limitations%20of%20Content-based%20Image%20Retrieval.htm[30/04/2011 01:41:32 ‫]ﺹ‬


Limitations of CBIR

The drop in performance from Google Desktop to Google is what we also observe in CBIR systems when
going from the authors’ collection to large public databases.

Another example of the poor state of the art in text filtering is the challenge is to design a filter for material
inappropriate for young children. A story with the phrase "she took off her shoes and lifted her skirt in order to
cross the flooded street" is probably quite appropriate. On the other hand the words in italics are also likely to
appear in objectionable stories. Note also that each of the italicized words is not objectionable on its own.

Searching text by content is quite hard when the content we are looking for is a concept rather than a phrase.
A computer search is effective only as long as we look for byte patterns. In the case of text, words are
mapped easily into byte patterns, so searching for words is effective. This should give us a hint why content-
based image searching is hard. The counterpart of searching for words is searching for pixels of a particular
color, not a particularly interesting query.
Back to the Table of Contents

6. What Can We Learn from Success in Other Areas

Practical successes in the overall field of Artificial Intelligence (AI) have been relatively few, but we can learn a
lot from them. Not surprisingly, all successes have been in what Hubert Dreyfus calls Weak AI.

Considerable work in "simple" OCR algorithms that could run fast enough on machines of the 1960's or 1970's
turned to be a waste of time. Such techniques could not provide satisfactory performance for multifont text and
by 1990 there were machines fast enough to implement more "complex" algorithms. There were also issues of
scanner resolution. 10 point size text requires at least 300dpi resolution and, again, such scanners did not
become available at low price until around 1990. Earlier scanners of 100 or 150 dpi tended to distort images of
characters significantly. One notorious case was the mapping of "a", "e", and "s" into a form topologically
equivalent to an "8" and recognition efforts relied on slight differences in shape. This problem went away with
the introduction of 300dpi resolution scanners.

OCR products also take into account context by using, at least, a spelling checker.

OCR offers another, particularly appropriate, example. For several years published papers reported recognition
rates of 97-98%. However such a recognition rate is useless in practice because it corresponds to over 50
errors per page! In addition, many authors did not distinguish between rejection and substitution errors. For a
practical system the recognition rate on good quality print should be at least 99.9% (2-3 error per page) and
the errors should be rejections (rather than substitutions) so that the system can flag them. The tolerable rate
for substitution errors is much lower. George Nagy and Bob Haralick wrote a lot on that topic for anyone
interest to read more.

An Aside on Accuracy
The accuracy issue is particularly pertinent to image retrieval where all too often we see
publications claiming success rates anywhere from 60% to 90%. Such rates are clearly better
than random guesses but it is unclear how useful they are. There is often no distinction between
false positives and false negatives. For example, in a forensic application we may tolerate
several false matches as long as we do not miss any true match. The casual handling of this
issue in the literature temps me to offer a variation on the old joke "How a physicist proves that
all odd numbers are prime". An image retrieval person's proof: "3 is prime, 5 is prime, 7 is
prime, 9 hmm, 11 is prime, 13 is prime, oh well, we have an 83% success rate, let's publish."

Chess playing programs offer another example of the importance of computing power. As long as people were
relying on human inspired heuristics, chess playing programs were weak. A breakthrough occurred in the late
1970’s with Ken Thompson’s Belle that relied on special purpose hardware taking advantage of the raw search
power of machines. Eventually, this led to the Deep Blue that in 1997 defeated the human world champion.
The following is from an interview of Murray Campbell (one of the Deep Blue designers) [Ca07]: “There was a
turning point in the '70s when it was realized that, if you let computers do what they do best -- that is, search
through as many possibilities as they can as quickly as they can -- and stop the pretense of trying to emulate
how humans play, you actually got better performance. And so, from that day on, computers, including Deep
Blue, tended to be focused on searching through as many possible chess moves as they could in the amount
of time that was available for a computation.” We may have to wait for special purpose image processing
hardware (probably multiprocessor networks) before we can make truly significant progress in the general
image analysis area.

file:///C|/Documents%20and%20Settings/Ahmed/Desktop/Limitations%20of%20Content-based%20Image%20Retrieval.htm[30/04/2011 01:41:32 ‫]ﺹ‬


Limitations of CBIR

Speech recognition provides an example where prior knowledge about the structure of the signal is critical for
success and this argues in favor of restricting image analysis (including CBIR) research to special areas.
Speech recognition systems are now available in many telephone services (for example the American Airlines
reservation system provided by TellMe) and they are all grammar driven, in other words high level knowledge
about human language drives the recognition system (see for example [KGRMKL01].) Using spatial context for
image labeling has been helpful [SLZ03] but currently there are too few well defined rules that are valid for
large classes of images. (Grammatical approaches have been used in pattern recognition with limited success
because the two-dimensional structure of images does not offer itself to grammatical models the human
languages do.)

In the area of image analysis there is another important characteristic of successful applications. They tend to
be in situations where human recognition relies on the meticulous study of details rather than a general
overview of the scene. People have trouble with the former task but they excel at the latter (the adaptive
significance of the difference in abilities is obvious). Thus image analysis has found successful applications in
industrial inspection and biometrics, especially fingerprint recognition. Automatic fingerprint identification
(retrieving a match to a fingerprint from a database) is used in practice by the FBI and there are several
commercial products for fingerprint verification (checking whether two fingerprints are the same), for example
by Zvetco. (The mention of such systems attests only to their existence, rather than to their specific abilities.)
In both industrial and biometric applications the mapping of images to features was well understood and the
main challenge has been the reliable extraction of such features from the image (for example, the minutiae in
the case of fingerprints).
Back to the Table of Contents

7. What to Do

Clearly, the general CBIR problem is well beyond the current approaches and it may well turn out to be
intractable in the foreseeable future. In contrast to text retrieval that relies on linguistic expressions of human
concepts, there is no well defined set of rules to formalize the content of images. For any automatic image
processing we need a mapping between human impressions ("semantics") and quantities that can be
computed ("features"). Successful pattern recognition relies on such quantities. Statistical techniques may be
used for fine-tuning the parameters necessary to design such features, but they cannot uncover the structure
of the mapping. In generating such features we must also specify the classes of images we are dealing and
this, in turn, facilities the proper selection of "learning" samples.

[DAL08] describe a classification of the obstacles to CBIR that has many merits, but I prefer a more basic
approach based on the arguments I used over 35 years ago when I introduced Structural Pattern Recognition
[Pa72]. The most important step is to understand the nature of the problem and how the image can be
transformed into a mathematical representation without loss of information about what we consider to be the
important aspects of an image. For the case of OCR the important elements were strokes and their spatial
relationships (but not their color, or, within limits, their thickness.) If color and texture are important (for
example, to identify a lesion in a medical image) then we need texture and color descriptors that are accurate
enough to represent uniquely the textures and colors of interest.

I believe that a CBIR problem that fits the following characteristics is a good candidate for solution within the
current state of computer technology.

1. The mapping between semantics and image features is well defined. This way the issue of
"semantic gap" is put to rest. The representation by features can be made without loss of information,
in such as way so that only images with similar interpretation and no others are mapped into the same
set of features. (Examples abound in the pattern recognition literature, including OCR and fingerprint
recognition.)
2. The context is known. For example, we may know that the object of interest occupies a particular
area in the image or is close some other well defined feature. The context can be taken advantage in
the design and selection of the features.
3. The matching of images requires careful scrutiny. This is a process that humans are not very good
at and it is likely that machines can match or even exceed human performance. Medical, industrial,
forensic, security, as well as other applications seem to offer problems of this type.
4. The accuracy requirements are well defined. Knowing the relative significance of false matches
versus omitted true matches is particularly important.
5. The computational requirements are well defined. In many cases instant response is not needed.
For example, in a medical application it may take well over an hour to produce an image, so waiting

file:///C|/Documents%20and%20Settings/Ahmed/Desktop/Limitations%20of%20Content-based%20Image%20Retrieval.htm[30/04/2011 01:41:32 ‫]ﺹ‬


Limitations of CBIR

another hour to find matches in a database is not particularly onerous. In security applications a fast
response is often necessary but in that case the database is proprietary (not a public collection) and it
may organized accordingly for fast retrieval.

I do not see how we can have a mathematical representation without loss of information for the general
category of "all images" but we have a good chance of achieving that by confining the problem into retrieving
images from a particular category only (for example, matching an iris scan to a collection of iris scans).

Biometric applications are probably where most progress has been made (as stated in Section 6) and
research is now continuing in more challenging soft biometrics such as the matching of tattoos and scars
[LJJ08].

Medical applications appear particularly appropriate because the images are searched for objects (for
example, lesions) for which there is usually no intuitive human definition and comparisons between a query
image and the database can be quite laborious for a human. However according to [DAL08] there have been
no hospital applications, apparently because the methodology used has been "general" without taking
advantage of the specifics of the applications. A recent work [FSGD08] deals with a the identification of hand
bones, and it describes improvements in performance by taking advantage of the spatial relationships between
the bones and using inexact subgraph matching.

I also noticed recently the following description of a project directed by Professor Bob Fisher of the University
of Edinburgh: "The project goal is to develop an interactive skin cancer image database indexing tool, where
users compare 'live' skin lesions/spots to images selected from a database. Based on user feedback on the
appropriateness of the selections, the database will be searched for better matches." The specificity of the
application makes likely the development of semantically meaningful features.

Another potential area of successful CBIR is in reconnaissance and surveillance either for security purposes
or natural resource management. Over ten years ago we used a graph representation and shape features for
branches to retrieve information from road maps [HP96] that could match the location of an aerial photograph
to a particular geographic area.

My list of areas is far from exhaustive. As long as the objects we are looking for in a set of images can be
characterized accurately by computable features the area is ready for CBIR development within the current
state of computer technology.

It is important that the images and objects of interest be characterized formally so the results can be scalable.
The use of a standard set of images may be fine for the initial stages of the research but the algorithm must
soon be tested on any image that satisfies appropriate criteria. Let me digress here for a moment. I have
worked in industry where an imaging product is tested in the customer's site with blue collar workers (as
opposed to graduate students) handling the equipment and, while hard, it is quite possible to design systems
that perform well in an unlimited testing set (for example [PJHHL06]).

Evaluation of methodologies solely on the basis of statistical results of their performance on "standard sets" is
fraught with risks because of the difficulty of ensuring that the membership of the set and the rules of
evaluation are representative of the real world where the methodologies will be applied. See, for example,
[Io05]. Even though [Io05] address mainly medical studies, several of the issues it raises are also relevant to
pattern recognition. It is important to pay close attention to what is considered a satisfactory performance.

I believe it is the lack of proper testing that contributes to the paradox raised by [HLMS08]: "Multimedia
Information Retrieval: So Close or Yet So Far Away?" It is close if we look at published method that work well
on a specific data set. It is far away, if we realize that many such methods work only on a specific data set. In
the age of the internet it is essential to subject methodologies to public testing.
Back to the Table of Contents

8. Conclusions

I believe that until there is sufficient progress in object recognition or in application specific CBIR, general
CBIR research is unlikely to be fruitful, especially when constraints on real time performance are added. (Such
constraints tend to limit systems to global features that cannot meet even the simple challenge illustrated in
Figure F1.) If we start with problems fitting the five properties listed in the previous section, then we may have
a hope of not only solving specific problems, but also gaining insight on what methodologies are most
appropriate.

file:///C|/Documents%20and%20Settings/Ahmed/Desktop/Limitations%20of%20Content-based%20Image%20Retrieval.htm[30/04/2011 01:41:32 ‫]ﺹ‬


Limitations of CBIR

If I were forced to work on general CBIR, then I would look to extend current approaches that rely on the
search for objects by creating "visual words" (for example, [LJSJ08] and [SZ08]). Both of the cited papers rely
on matching sets of key points obtained by SIFT [Lo04]. However, I doubt that such features can deal with
shadows and differences in pose (as exemplified by the images of Figure F2). (Indeed [Pa08] suggests severe
limitations of SIFT for general images although it seems useful for images of 2D objects.) A critical extension
of such work would be to tag such visual words with object names and they search for name matches.
Because different visual words may map on the same name, fewer matches will be missed. Such labeling will
be bridging in effect the "semantic gap". It would be interesting to investigate whether approaches based on
segmentation (for example [CBGM02]) or graph matching (for example [Pt02]) face a higher threshold in
overcoming differences in pose and distractions like shadows than methods based on key points. Extensive
interaction with the user as described in [HSHZ04] is bound not only to help in performance but also provide
insight in the utility of features. The importance of user feedback has been documented by several other
authors (for example, [IA03]).

Several authors have been combining text tags of images with features [RMV07]. This is certainly a valid
approach, provided that the features used provide really useful information. However global statistics are
unlikely to help because of they are so far from the semantic interpretation. Instead of providing semantic
tagging to make up for semantically unreliable features it might be better to look for features that are closer to
semantics.

I would stay away from over-reliance on statistical "learning" techniques that has plagued not only CBIR but
machine vision and pattern recognition in general. Besides the issues discussed in Sections 3 and 6, such
techniques obscure the understanding of the solution to a problem. Of course, being able to solve problems
without understanding them is seductive but it is not conducive to good science. Unfortunately, to paraphrase
Langmuir [La53], the seduction is so strong that such pathological science is not likely to go away any time
soon.

I should emphasize that the problem is not caused by the statistical techniques themselves but by the way
they are applied. Studying the nature of the classification problem and deriving high level features (the core of
structural pattern recognition) allows classification with relatively few such complex features which in turn
reduces the needed number of training samples. Statistical techniques are then used for fine tuning the
classification (see for example, [KPB87]).

Humans (and animals) are able to learn from very few samples while machines cannot. This issue has
challenged researchers and [FFP06] is a paper that attempts to deal with that issue although I do not think it
captures the real mechanism that is used in natural systems. [RB98] states that "Rats trained to distinguish
between a square and a rectangle perform quite well when faced with skinnier rectangles. They have the
concept of rectangle!" So when animals and humans are faced with an example, they form a concept that
allows them to extrapolate in future situations. The biological adaptive significance of such ability is obvious!
The ability of humans to identify objects or persons from very rough sketches is remarkable. (I should add that
recent research [MJ06] has cast doubts on Ramachandran's interpretation. However the fact that humans can
learn from very few samples still stands.)
Back to the Table of Contents

9. Acknowledgements

A first draft of my discussion of CBIR was posted on my web site in June 2007. That draft underwent several
changes (I was more optimistic about the future of general CBIR in the beginning). I want to thank David J.
Liu of VIMA and Qasim Iqbal of the University of Texas for comments on earlier drafts and additional pointers
to the literature. Several others provided me with pointers to research work and publications and/ or answered
my questions about their work. They include Amanda Stint of Stony Brook University, Stephen Horowitz of
Yahoo, Nuno Vasconcelos of UC San Diego, Stephane Marchand-Maillet of the University of Geneva, Ingemar
Cox, Benedikt Fischer of the University of Aachen, Bob Fisher of the University of Edinburgh, Rahul
Sukthankar of HP Research and CMU, Andrew Lampert of CSIRO and Anil Jain of Michigan State University.

The listing of these people here does not imply in any way that they share the views expressed in this
document.

10. Sources Cited

file:///C|/Documents%20and%20Settings/Ahmed/Desktop/Limitations%20of%20Content-based%20Image%20Retrieval.htm[30/04/2011 01:41:32 ‫]ﺹ‬


Limitations of CBIR

[Bach_1] http://www.michaelbach.de/ot/

[Bach_2] http://www.michaelbach.de/ot/col_isoluNuBleu/index.html

[Ca07] Interview by Wired News

[CBGM02] C. Carson, S. Belongie, H. Greenspan, J. Malik "Blobworld: Image Segmentation Using


Expectation-Maximization and Its Application to Image Querying" IEEE Trans. PAMI, vol.
24 (8), 2002, pp. 1026-1038.

[CMMPY00] I. J. Cox, M. L. Miller, T. P. Minka, T. V. Papathomas, and P. N. Yianilos, "The Bayesian


Image Retrieval System, PicHunter: Theory, Implementation, and Psychophysical
Experiments" IEEE Trans. on Image Processing, vol. 9 (1), 2000, pp. 20- 37.

[DAL08] T. M. Deserno, S. Antani, and R. Long "Ontology of Gaps in Content-Based Image


Retrieval", Journal of Digital Imaging, 2008. A preliminary and shorter version appeared in
SPIE Proc. vol. 6516 (2007).

[DD95] Diana Deutsch's Audio Illusions. These illusions were first demonstrated in the 1970's.

[DH73] R. O. Duda and P. E. Hart Pattern Classification and Scene Analysis, Wiley, 1973.

[DHS01] R. O. Duda, P. E. Hart, and D. G. Stork Pattern Classification, 2nd edition, Wiley, 2001.

[FFP06] Fei-Fei, L., R. Fergus, and P. Perona “One-Shot Learning of Object Categories”, IEEE
Trans. PAMI, vol. 28 (2006), pp. 594-611.

[FSGD08] B. Fischer, M. Sauren, M. O. Güld, and T. M. Deserno "Scene Analysis with Structural
Prototypes for Content-Based Image Retrieval in Medicine", Proc. SPIE Conf. No. 6914,
2008.

[HLMS08] A. Hanjalic, R. Lienhart, W-Y Ma, and J. R. Smith "The Holy Grail of Multimedia
Information Retrieval: So Close or Yet So Far Away?" IEEE Proceedings, Special Issue on
Multimedia Information Retrieval, 96 (4), 2008, pp. 541-547.

[HP96] J. Hu and T. Pavlidis ``A Hierarchical Approach to Efficient Curvilinear Object Searching,''
Computer Vision and Image Understanding, 63 (March 1996), pp. 208-220.

[HSHZ04] L. Huston, R. Sukthankar, D. Hoiem. and J. Zhang "SnapFind: Brute Force Interactive
Image Retrieval" Proceedings of International Conference on Image Processing and
Graphics, 2004.

[IA03] Q. Iqbal, and J. K. Aggarwal “Feature Integration, Multi-image Queries and Relevance
Feedback in Image Retrieval”, Proc. VISUAL 2003, pp. 467-474.

[Io05] John P. A. Ioannidis "Why Most Published Research Findings Are False" Public Library of
Science - Medicine 2(8): e124, 2005.

[Ju91] B. Julesz "Early vision and focal attention", Reviews of Modern Physics, vol. 63, (July
1991), pp. 735-772.

[KPB87] S. Kahan, T. Pavlidis, and H. S. Baird, ``On the Recognition of Printed Characters of Any
Font And Size'', IEEE Trans. on Pattern Analysis and Machine Intelligence, PAMI-9
(1987), pp. 274-288.

[KGRMKL01] Knight, Sylvia / Gorrell, Genevieve / Rayner, Manny / Milward, David / Koeling, Rob /
Lewin, Ian (2001): "Comparing grammar-based and robust approaches to speech
understanding: a case study", In EUROSPEECH-2001, 1779-1782.

[La53] I. Langmuir "Pathological Science" Colloquium at The Knolls Research Laboratory,


December 18, 1953. Transcribed and edited by R. N. Hall. Web Posting

[LJJ08] J-E. Lee, A. K. Jain, R. Jin "Scars, Marks, and Tattoos (SMT): Soft Biometric for Suspect
and Victim Identification" Biometrics Symposium, Tampa, Sept. 2008.

[Lo04] D. Lowe "Distinctive Image Features from scale-inavariant keypoints" Int. J. Comp. Vision,

file:///C|/Documents%20and%20Settings/Ahmed/Desktop/Limitations%20of%20Content-based%20Image%20Retrieval.htm[30/04/2011 01:41:32 ‫]ﺹ‬


Limitations of CBIR

2004.

[LJSJ08] Y. Liu, R. Jin, R. Sukthankar, F. Juri "Unifying Discriminative Visual Codebook Generation
with Classifier Training for Object Recognition". Proceedings of Computer Vision and
Pattern Recognition Conference, 2008.

[Like] http://www.like.com

[MJ06] L. Minini and K. J. Jeffery 'Do rats use shape to solve "shape discriminations"?'
LEARNING & MEMORY 13 (2006) pp.287-297. Also on line.

[MNY99] S. Mori, H. Nishida, and H. Yamada Optical Character Recognition, Wiley, 1999.

[MSMP99] H. Müller, D. Squire, W. H. Müller. and T. Pun "Efficient access methods for content-based
retrieval with inverted files", 1999. On Line.

[Pa72] T. Pavlidis, "Structural Pattern Recognition: Primitives and Juxtaposition Relations",


Frontiers of Pattern Recognition, (M. S. Watanabe, ed.) Academic Press (1972), pp. 421-
451.

[Pa94] T. Pavlidis "Context Dependent Shape Perception", in Aspects of Visual Form Processing,
(C. Arcelli, L. P. Cordella, and G. Sanniti di Baja, eds.) World Scientific, 1994, pp. 440-
454.

[Pa08] T. Pavlidis, An Evaluation of the Scale Invariant Feature Transform (SIFT), August
2008.Web Posting.

[PJHHL06] T. Pavlidis, E. Joseph, D. He, E. Hatton, and K. Lu "Measurement of dimensions of solid


objects from two-dimensional image(s)" U. S. Patent 6,995,762, February 7, 2006.
Discussion of the methodology and links to sources on line.

[PM92] T. Pavlidis and S. Mori, Special Issue on Optical Character Recognition, IEEE
Proceedings, vol. 80, No. 7, July 1992.

[Pe89] Arno Penzias Ideas and Information, Norton, 1989.

[Pt02] E. G. M. Petrakis "Design and evaluation of spatial similarity approaches for image
retrieval" Image and Vision Computing, vol. 20 (2002), pp. 59-76.

[RB98] V. S. Ramachandran and S. Blakeslee Phantoms in the Brain, William Morrow and
Company Inc., New York, 1998

[RMV07] Rasiwasia, N., P. J. Moreno, and N. Vasconcelos “Bridging the Gap: Query by Semantic
Example”, IEEE Trans. Multimedia, vol. 9, 2007, pp. 923-938.

[SBTDM02] U. Sinha, A. Bui, R. Taira, J. Dioniso, C. Morioka, D. Johnson, and H. Kangarloo, " A
Review of Medical Imaging Informatics", Ann. N. Y. Acad. Sci., vol. 980 (2002), pp. 168-
197.

[SLZ03] Singhal, A.; Jiebo Luo; Weiyu Zhu “Probabilistic spatial context models for scene content
understanding” Computer Vision and Pattern Recognition, 2003. Proceedings, pp. I-235 -
I-241.

[SWSGJ00] A. W. M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain "Content-Based Image


Retrieval at the End of the Early Years", IEEE Trans. PAMI, vol. 22, 2000, pp 1349-1380.

[SZ08] J. Sivic and A. Zisserman "Efficient Visual Search for Objects in Videos", IEEE
Proceedings, Special Issue on Multimedia Information Retrieval, 96 (4), 2008, pp. 548-
566.

[YSGB05] K. Yanai, N. V. Shirahatti, P. Gabbur, and K. Barnard "Evaluation Strategies for Image
Understanding and Retrieval", Proc. ACM MIR, Nov. 2005.

Back to the Table of Contents

file:///C|/Documents%20and%20Settings/Ahmed/Desktop/Limitations%20of%20Content-based%20Image%20Retrieval.htm[30/04/2011 01:41:32 ‫]ﺹ‬

También podría gustarte