Image Search Engines
Image Search Engines
An Overview
by
PREFACE vii
v
vi
BIBLIOGRAPHY 36
PREFACE
TG/AS
University of Amsterdam
June, 2003
vii
viii Preface
Chapter 1
1
2 Image Search Engines: An Overview Chapter 1
L
IMAGE SEGMENTATION COMPUTATION OF O
FEATURES G
I
C
A
L
S
T
O
R
E
IMAGE RETRIEVAL
K-NEAREST NEIGHBOR
QO
QO QO
Qi+1
RELEVANCE FEEDBACK
QUERY REFORMULATION Qi+1
Figure 1.1. Overview of the basic concepts of the content-based image retrieval
scheme as considered in this chapter. First, features are extracted from the images
in the database which are stored and indexed. This is done off-line. The on-
line image retrieval process consists of a query example image from which image
features are extracted. These image feature are used to find the images in the
database which are most similar. Then, a candidate list of most similar images is
shown to the user. From the user feed-back the query is optimized and used as a
new query in an iterative manner.
categories: search by association, target search, and category search. For search by
association, the intention of the user is to browse through a large collection of im-
ages without a specific aim. Search by association tries to find interesting images
and is often applied in an iterative way by means of relevance feedback. Target
search is to find similar (target) images in the image database. Note that ”similar
image” may imply a (partially) identical image, or a (partially) identical object
in the image. The third class is category search, where the aim is to retrieve an
arbitrary image which is typical for a specific class or genre (e.g. indoor images,
portraits, city views). As many image retrieval systems are assembled around one
of these three search modes, it is important to get more insight in these categories
and their structure. Search modes will be discussed in Section 1.2.1.
•Image domains: The definition of image features depends on the repertoire of
images under consideration. This repertoire can be ordered along the complexity
of variations imposed by the imaging conditions such as illumination and viewing
geometry going from narrow domains to broad domains. For images from a narrow
Section 1.1. Overview of the chapter 3
Figure 1.2. Data flow and symbol conventions as used in this chapter. Different
styles of arrows indicate different data structures.
at the same having high discriminative power. In general, a tradeoff exists between
the amount of invariance and selectivity. In Section 1.3, a taxonomy on feature
extraction modules is given from an image processing perspective. The taxonomy
can be used to select the proper feature extraction method for a specific application
based on whether images come from broad domains and which search goals are at
hand (target/category/associate search). In Section 1.3.1, we first focus on color
content descriptors derived from image processing technology. Various color based
image search methods will be discussed based on different representation schemes
such as color histograms, color moments, color edge orientation, and color correlo-
grams. These image representation schemes are created on the basis of RGB, and
other color systems such as HSI and CIE L∗ a∗ b∗ . For example, the L∗ a∗ b∗ space
has been designed to conform to the human perception of color similarity. If the
appreciation of a human observer of an object is based on the perception of certain
conspicuous items in the image, it is natural to direct the computation of broad
domain features to these points and regions. Similarly, a biologically plausible ar-
Section 1.1. Overview of the chapter 5
new visualization techniques such as 3D virtual image clouds can used to desig-
nate certain images as relevant to the user’s requirements. These relevant images
are then further used by the system to construct subsequent (improved) queries.
Relevance feedback is an automatic process designed to produce improved query
formulations following an initial retrieval operation. Relevance feedback is needed
for image retrieval where users find it difficult to formulate pictorial queries. For
example, without any specific query image example, the user might find it diffi-
cult to formulate a query (e.g. to retrieve an image of a car) by image sketch or
by offering a pattern of feature values and weights. This suggests that the first
search is performed by an initial query formulation and a (new) improved query
formulation is constructed based on the search results with the goal to retrieve
more relevant images in the next search operations. Hence, from the user feedback
giving negative/positive answers, the method can automatically learn which image
features are more important. The system uses the feature weighting given by the
user to find the images in the image database which are optimal with respect to
the feature weighting. For example, the search by association allows users to refine
iteratively the query definition, the similarity or the examples with which the search
was started. Therefore, systems in this category are highly interactive. Interaction,
relevance feedback and learning are discussed in Section 1.6.
•Testing
In general, image search systems are assessed in terms of precision, recall, query-
processing time as well as reliability of a negative answer. Further, the relevance
feedback method is assessed in terms of the number of iterations to approach to
the ground-truth. Today, more and more images are archived yielding a very large
range of complex pictorial information. In fact, the average number of images, used
for experimentation as reported in the literature, augmented from a few in 1995
to over a hundred thousand by now. It is important that the dataset should have
ground-truths i.e. images which are (non) relevant to a given query. In general, it
is hard to get these ground-truths. Especially for very large datasets. A discussion
on system performance is given in Section 1.6.
trademarks. Systems in this category are usually interactive with a domain specific
definition of similarity.
Hence, in a narrow domain one finds images with a reduced diversity in their pic-
torial content. Usually, the image formation process is similar for all recordings.
When the object’s appearance has limited variability, the semantic description of
the image is generally well-defined and largely unique. An example of a narrow
domain is a set of frontal views of faces, recorded against a clear background. Al-
though each face is unique and has large variability in the visual details, there are
obvious geometrical, physical and illumination constraints governing the pictorial
domain. The domain would be wider in case the faces had been photographed from
a crowd or from an outdoor scene. In that case, variations in illumination, clutter
in the scene, occlusion and viewpoint will have a major impact on the analysis.
On the other end of the spectrum, we have the broad domain:
In broad domains images are polysemic, and their semantics are described only
partially. It might be the case that there are conspicuous objects in the scene for
which the object class is unknown, or even that the interpretation of the scene is
not unique. The broadest class available today is the set of images available on the
Internet.
Many problems of practical interest have an image domain in between these
extreme ends of the spectrum. The notions of broad and narrow are helpful in
characterizing patterns of use, in selecting features, and in designing systems. In
a broad image domain, the gap between the feature description and the semantic
interpretation is generally wide. For narrow, specialized image domains, the gap
between features and their semantic interpretation is usually smaller, so domain-
specific models may be of help.
For broad image domains in particular, one has to resort to generally valid prin-
ciples. Is the illumination of the domain white or colored? Does it assume fully
visible objects, or may the scene contain clutter and occluded objects as well? Is
it a 2D-recording of a 2D-scene or a 2D-recording of a 3D-scene? The given char-
acteristics of illumination, presence or absence of occlusion, clutter, and differences
in camera viewpoint, determine the demands on the methods of retrieval.
Section 1.2. Image Domains 9
The sensory gap is the gap between the object in the world and the in-
formation in a (computational) description derived from a recording of
that scene.
The sensory gap makes the description of objects an ill-posed problem: it yields
uncertainty in what is known about the state of the object. The sensory gap is par-
ticularly poignant when a precise knowledge of the recording conditions is missing.
The 2D-records of different 3D-objects can be identical. Without further knowledge,
one has to decide that they might represent the same object. Also, a 2D-recording
of a 3D- scene contains information accidental for that scene and that sensing but
one does not know what part of the information is scene related. The uncertainty
due to the sensory gap does not only hold for the viewpoint, but also for occlusion
(where essential parts telling two objects apart may be out of sight), clutter, and
illumination.
The semantic gap is the lack of coincidence between the information that
one can extract from the visual data and the interpretation that the same
data have for a user in a given situation.
A user wants to search for images on a conceptual level e.g. images containing
particular objects (target search) or conveying a certain message or genre (category
search). Image descriptions, on the other hand, are derived by low-level data-
driven methods. The semantic search by the user and the low-level syntactic image
descriptors may be disconnected. Association of a complete semantic system to
image data would entail, at least, solving the general object recognition problem.
Since this problem is yet unsolved and will likely to stay unsolved in its entirity,
research is focused on different methods to associate higher level semantics to data-
driven observables.
Indeed, the most reasonable tool for semantic image characterization entails
annotation by keywords or captions. This converts content-based image access to
10 Image Search Engines: An Overview Chapter 1
1.2.4 Discussion
We have discussed three broad types of search categories: target search, category
search and search by association. Target search is related to the classical methods
in the field of pattern matching and computer vision such as object recognition and
image matching. However, image retrieval differs from traditional pattern matching
by considering more and more images in the database. Therefore, new challenges in
content-based retrieval are in the huge amount of images to search among, the query
specification by multiple images, and in the variability of imaging conditions and
object states. Category search connects to statistical pattern recognition methods.
However, compared to traditional pattern recognition, new challenges are in the
interactive manipulation of results, the usually very large number of object classes,
and the absence of an explicit training phase for feature and classifier tuning (ac-
tive learning). Search by association is the most distant from the classical field of
computer vision. It is severely hampered by the semantic gap. As long as the gap
is there, use of content-based retrieval for browsing will not be within the grasp of
the general public as humans are accustomed to rely on the immediate semantic
imprint the moment they see an image.
An important distinction we have discussed is that between broad and narrow
domains. The broader the domain, the more browsing or search by association
should be considered during system set-up. The narrower the domain, the more
target search should be taken as search mode.
The major discrepancy in content-based retrieval is that the user wants to re-
trieve images on a semantic level, but the image characterizations can only provide
similarity on a low-level syntactic level. This is called the semantic gap. Fur-
thermore, another discrepancy is that between the properties in an image and the
properties of the object. This is called the sensory gap. Both the semantic and
sensory gap play a serious limiting role in the retrieval of images based on their
content.
Section 1.3. Image Features 11
1.3.1 Color
Color has been an active area of research in image retrieval, more than in any other
branch of computer vision. Color makes the image take values in a color vector
space. The choice of a color system is of great importance for the purpose of proper
image retrieval. It induces the equivalent classes to the actual retrieval algorithm.
However, no color system can be considered as universal, because color can be in-
terpreted and modeled in different ways. Each color system has its own set of color
models, which are the parameters of the color system. Color systems have been de-
veloped for different purposes: 1. display and printing processes: RGB, CM Y ; 2.
television and video transmittion efficiency: Y IQ, Y U V ; 3. color standardization:
XY Z; 4. color uncorrelation: I1 I2 I3 ; 5. color normalization and representation:
rgb, xyz; 6. perceptual uniformity: U ∗ V ∗ W ∗ , L∗ a∗ b∗ , L∗ u∗ v ∗ ; 7. and intuitive
description: HSI, HSV . With this large variety of color systems, the inevitable
question arises which color system to use for which kind of image retrieval appli-
cation. To this end, criteria are required to classify the various color systems for
12 Image Search Engines: An Overview Chapter 1
between two neighboring image locations, for C ∈ {R, G, B}, where ~x1 and ~x2
denote the image locations of the two neighboring pixels.
~
x ~
x
C1 1 C2 2
The color ratio’s of [60] are given by: M (C1~x1 , C1~x2 , C2~x1 , C2~x2 ) = ~
x ~
x express-
C1 2 C2 1
ing the color ratio between two neighboring image locations, for C1 , C2 ∈ {R, G, B}
where ~x1 and ~x2 denote the image locations of the two neighboring pixels. All
14 Image Search Engines: An Overview Chapter 1
these color ratio’s are device dependent, not perceptual uniform and they become
unstable when intensity is near zero. Further, N and F are dependent on the ob-
ject geometry. M has no viewing and lighting dependencies. In [55] a thorough
overview is given on color models for the purpose of image retrieval. Figure 1.5
shows the taxonomy of color models with respect to their characteristics. For more
information we refer to [55].
Rather than invariant descriptions, another approach to cope with the inequal-
ities in observation due to surface reflection is to search for clusters in a color
histogram of the image. In the RGB-histogram, clusters of pixels reflected off an
object form elongated streaks. Hence, in [126], a non-parametric cluster algorithm
in RGB-space is used to identify which pixels in the image originate from one
uniformly colored object.
Section 1.3. Image Features 15
1.3.2 Shape
Under the name ’local shape’ we collect all properties that capture conspicuous
geometric details in the image. We prefer the name local shape over other char-
acterization such as differential geometrical properties to denote the result rather
than the method.
Local shape characteristics derived from directional color derivatives have been
used in [117] to derive perceptually conspicuous details in highly textured patches
of diverse materials. A wide, rather unstructured variety of image detectors can be
found in [159].
In [61], a scheme is proposed to automatic detect and classify the physical na-
ture of edges in images using reflectance information. To achieve this, a framework
is given to compute edges by automatic gradient thresholding. Then, a taxonomy
is given on edge types based upon the sensitivity of edges with respect to different
imaging variables. A parameter-free edge classifier is provided labeling color tran-
sitions into one of the following types: (1) shadow-geometry edges, (2) highlight
edges, (3) material edges. In figure 1.6.a, six frames are shown from a standard
video often used as a test sequence in the literature. It shows a person against a
textured background playing ping-pong. The size of the image is 260x135. The
images are of low quality. The frames are clearly contaminated by shadows, shad-
ing and inter-reflections. Note that each individual object-parts (i.e. T-shirt, wall
and table) is painted homogeneously with a distinct color. Further, that the wall
is highly textured. The results of the proposed reflectance based edge classifier are
shown in figure 1.6.b-d. For more details see [61].
Combining shape and color both in invariant fashion is a powerful combination
as described by [58] where the colors inside and outside affine curvature maximums
in color edges are stored to identify objects.
Scale space theory was devised as the complete and unique primary step in
pre-attentive vision, capturing all conspicuous information [178]. It provides the
theoretical basis for the detection of conspicuous details on any scale. In [109] a
series of Gabor filters of different directions and scale have been used to enhance
image properties [136]. Conspicuous shape geometric invariants are presented in
[135]. A method employing local shape and intensity information for viewpoint and
occlusion invariant object retrieval is given in [143]. The method relies on voting
among a complete family of differential geometric invariants. Also, [170] searches
for differential affine-invariant descriptors. From surface reflection, in [5] the local
sign of the Gaussian curvature is computed, while making no assumptions on the
albedo or the model of diffuse reflectance.
1.3.3 Texture
In computer vision, texture is considered as all what is left after color and local shape
have been considered or it is given in terms of structure and randomness. Many
common textures are composed of small textons usually too large in number to be
perceived as isolated objects. The elements can be placed more or less regularly
16 Image Search Engines: An Overview Chapter 1
Figure 1.6. Frames from a video showing a person against a textured back-
ground playing ping-pong. From left to right column. a. Original color frame.
b. Classified edges. c. Material edges. d. Shadow and geometry edges.
some semantic correspondent. The lowest levels of the wavelet transforms [33], [22]
have been applied to texture representation [96], [156], sometimes in conjunction
with Markovian analysis [21]. Other transforms have also been explored, most no-
tably fractals [41]. A solid comparative study on texture classification from mostly
transform-based properties can be found in [133].
When the goal is to retrieve images containing objects having irregular texture
organization, the spatial organization of these texture primitives is, in worst case,
random. It has been demonstrated that for irregular texture, the comparison of
gradient distributions achieves satisfactory accuracy [122], [130] as opposed to frac-
tal or wavelet features. Therefore, most of the work on texture image retrieval is
stochastic from nature [12], [124], [190]. However, these methods rely on grey-value
information which is very sensitive to the imaging conditions. In [56] the aim is
to achieve content-based image retrieval of textured objects in natural scenes un-
der varying illumination and viewing conditions. To achieve this, image retrieval
is based on matching feature distributions derived from color invariant gradients.
To cope with object cluttering, region-based texture segmentation is applied on the
target images prior to the actual image retrieval process. In Figure 1.7 results are
shown of color invariant texture segmentation for image retrieval. From the results,
we can observe that RGB and normalized color θ1 θ2 , is highly sensitive to a change
in illumination color. Only M is insensitive to a change in illumination color. For
more information we refer to [56].
Query
Texture search proved also to be useful in satellite images [100] and images of
documents [31]. Textures also served as a support feature for segmentation-based
recognition [106], but the texture properties discussed so far offer little semantic
referent. They are therefore ill-suited for retrieval applications in which the user
wants to use verbal descriptions of the image. Therefore, in retrieval research,
in [104] the Wold features of periodicity, directionality, and randomness are used,
which agree reasonably well with linguistic descriptions of textures as implemented
in [127].
18 Image Search Engines: An Overview Chapter 1
1.3.4 Discussion
First of all, image processing in content-based retrieval should primarily be engaged
in enhancing the image information of the query, not on describing the content of
the image in its entirety.
To enhance the image information, retrieval has set the spotlights on color, as
color has a high discriminatory power among objects in a scene, much higher than
gray levels. The purpose of most image color processing is to reduce the influence of
the accidental conditions of the scene and sensing (i.e. the sensory gap). Progress
has been made in tailored color space representation for well-described classes of
variant conditions. Also, the application of geometrical description derived from
scale space theory will reveal viewpoint and scene independent salient point sets
thus opening the way to similarity of images on a few most informative regions or
points.
In this chapter, we have made a separation between color, local geometry and
texture. At this point it is safe to conclude that the division is an artificial la-
beling. For example, wavelets say something about the local shape as well as the
texture, and so may scale space and local filter strategies do. For the purposes
of content-based retrieval an integrated view on color, texture and local geometry
is urgently needed as only an integrated view on local properties can provide the
means to distinguish among hundreds of thousands different images. A recent ad-
vancement in that direction is the fusion of illumination and scale invariant color
and texture information into a consistent set of localized properties [74]. Also in
[16], homogeneous regions are represented as collections of ellipsoids of uniform
color or texture, but invariant texture properties deserve more attention [167] and
[177]. Further research is needed in the design of complete sets of image properties
with well-described variant conditions which they are capable of handling.
[65] roughly partitions the Munsell color space into eleven color zones. Similar
partitioning have been proposed by [29] and [24].
Another approach, proposed by [161], is the introduction of the cumulative color
histogram which generate more dense vectors. This enables to cope with coarsely
quantized color spaces. [186] proposes a variation of the cumulative histograms by
applying cumulative histograms to each sub-space.
Other approaches are based on the computation of moments of each color chan-
nel. For example, [6] represents color regions by the first three moments of the
color models in the HSV -space. Instead of constructing histograms from color in-
variants, [73], [45] propose the computation of illumination-invariant moments from
color histograms. In a similar way, [153] computes the color features from small
object regions instead of the entire object.
[85] proposes to use integrated wavelet decomposition. In fact, the color features
generate wavelet coefficients together with their energy distribution among channels
and quantization layers. Similar approaches based on wavelets have been proposed
by [175], [101].
All of this is in favor of the use of histograms. When very large data sets are
at stake, plain histogram comparison will saturate the discrimination. For a 64-
bin histogram, experiments show that for reasonable conditions, the discriminatory
power among images is limited to 25,000 images [160]. To keep up performance, in
[125], a joint histogram is used providing discrimination among 250,000 images in
their database rendering 80% recall among the best 10 for two shots from the same
scene using simple features. Other joint histograms add local texture or local shape
[68], directed edges [87], and local higher order structures [47].
Another alternative is to add a dimension representing the local distance. This
is the correlogram [80], defined as a 3- dimensional histogram where the colors of
any pair are along the first and second dimension and the spatial distance between
them along the third. The autocorrelogram defining the distances between pixels
of identical colors is found on the diagonal of the correlogram. A more general
version is the geometric histogram [1] with the normal histogram, the correlogram
and several alternatives as special cases. This also includes the histogram of the
triangular pixel values reported to outperform all of the above as it contains more
information.
A different view on accumulative features is to demand that all information (or
all relevant information) in the image is preserved in the feature values. When
the bit-content of the features is less than the original image, this boils down to
compression transforms. Many compression transforms are known, but the quest is
for transforms simultaneously suited as retrieval features. As proper querying for
similarity is based on a suitable distance function between images, the transform
has to be applied on a metric space. And, the components of the transform have
to correspond to semantically meaningful characteristics of the image. And, finally,
the transform should admit indexing in compressed form yielding a big computa-
tional advantage over having the image be untransformed first. [144] is just one
of many where the cosine-based JPEG-coding scheme is used for image retrieval.
22 Image Search Engines: An Overview Chapter 1
The JPEG-transform fulfills the first and third requirement but fails on a lack of
semantics. In the MPEG-standard the possibility to include semantic descriptors
in the compression transform is introduced [27]. For an overview of feature indexes
in the compressed domain, see [108]. In [96], a wavelet packet was applied to tex-
ture images and, for each packet, entropy and energy measures were determined
and collected in a feature vector. In [83], vector quantization was applied in the
space of coefficients to reduce its dimensionality. This approach was extended to
incorporate the metric of the color space in [141]. In [86] a wavelet transform was
applied independently to the three channels of a color image, and only the sign of
the most significant coefficients is retained. In [3], a scheme is offered for a broad
spectrum of invariant descriptors suitable for application on Fourier, wavelets and
splines and for geometry and color alike.
Fixed Partitioning
The simplest way is to use a fixed image decomposition in which an image is parti-
tioned into equally sized segments. The disadvantage of a fixed partitioning is that
blocks usually do not correspond with the visual content of an image. For example,
[65] splits an image into nine equally sized sub-images, where each sub-region is
represented by a color histogram. [67] segments the image by a quadtree, and [99]
uses a multi-resolution representation of each image. [36] also uses a 4x4 grid to
segment the image. [148] partitions images into three layers, where the first layer
is the whole image, the second layer is a 3x3 grid and the third layer a 5x5 grid. A
similar approach is proposed by [107] where three levels of a quadtree is used to seg-
ment the images. [37] proposes the use of inter-hierarchical distances measuring the
difference between color vectors of a region and its sub-segments. [20] uses an aug-
mented color histogram capturing the spatial information between pixels together
to the color distribution. In [59] the aim is to combine color and shape invariants
for indexing and retrieving images. Color invariant edges are derived from which
shape invariant features are computed. Then computational methods are described
to combine the color and shape invariants into a unified high-dimensional histogram
for discriminatory object retrieval. [81] proposes the use of color correlograms for
image retrieval. Color correlograms integrate the spatial information of colors by
expressing the probability that a pixel of color ci lies at a certain distance from a
Section 1.4. Representation and Indexing 23
pixel of color cj . It is shown that color correlograms are robust to a change in back-
ground, occlusion, and scale (camera zoom). [23] introduces the spatial chromatic
histograms, where for every pixel the percentage of pixels having the same color is
computed. Further, the spatial information is encoded by baricenter of the spatial
distribution and the corresponding deviation.
Region-based Partitioning
Segmentation is a computational method to assess the set of points in an image
which represent one object in the scene. As discussed before, many different com-
putational techniques exist, none of which is capable of handling any reasonable
set of real world images. However, in this case, weak segmentation may be suffi-
cient to recognize an object in a scene. Therefore, in [12] an image representation
is proposed providing a transformation from the raw pixel data to a small set of
image regions which are coherent in color and texture space. This so-called Blob-
world representation is based on segmentation using the Expectation-Maximization
algorithm on combined color and texture features. In the Picasso system [13], a
competitive learning clustering algorithm is used to obtain a multiresolution repre-
sentation of color regions. In this way, colors are represented in the l ∗ u∗ v ∗ space
through a set of 128 reference colors as obtained by the clustering algorithm. [63]
proposes a method based on matching feature distributions derived from color ra-
tio gradients. To cope with object cluttering, region-based texture segmentation
is applied on the target images prior to the actual image retrieval process. [26]
segments the image first into homogeneous regions by split and merge using a color
distribution homogeneity condition. Then, histogram intersection is used to express
the degree of similarity between pairs of image regions.
its main components making the matching between images of the same object easier.
Automatic identification of salient regions in the image based on non-parametric
clustering followed by decomposition of the shapes found into limbs is explored in
[50].
1.4.7 Discussion
General content-based retrieval systems have dealt with segmentation brittleness in
a few ways. First, a weaker version of segmentation has been introduced in content-
based retrieval. In weak segmentation the result is a homogeneous region by some
criterion, but not necessarily covering the complete object silhouette. It results in
a fuzzy, blobby description of objects rather than a precise segmentation. Salient
features of the weak segments capture the essential information of the object in a
nutshell. The extreme form of the weak segmentation is the selection of a salient
point set as the ultimately efficient data reduction in the representation of an object,
very much like the focus- of-attention algorithms for an earlier age. Only points on
the interior of the object can be used for identifying the object, and conspicuous
points at the borders of objects have to be ignored. Little work has been done how to
make the selection. Weak segmentation and salient features are a typical innovation
of content-based retrieval. It is expected that salience will receive much attention
in the further expansion of the field especially when computational considerations
will gain in importance.
The alternative is to do no segmentation at all. Content-based retrieval has
gained from the use of accumulative features, computed on the global image or par-
titionings thereof disregarding the content, the most notable being the histogram.
Where most attention has gone to color histograms, histograms of local geomet-
ric properties and texture are following. To compensate for the complete loss of
spatial information, recently the geometric histogram was defined with an addi-
tional dimension for the spatial layout of pixel properties. As it is a superset of
the histogram an improved discriminability for large data sets is anticipated. When
accumulative features they are calculated from the central part of a photograph
may be very effective in telling them apart by topic but the center does not always
reveals the purpose. Likewise, features calculated from the top part of a picture
may be effective in telling indoor scenes from outdoor scenes, but again this holds
to a limited degree. A danger of accumulative features is their inability to discrim-
inate among different entities and semantic meanings in the image. More work on
semantic-driven groupings will increase the power of accumulative descriptors to
capture the content of the image.
Structural descriptions match well with weak segmentation, salient regions and
weak semantics. One has to be certain that the structure is within one object and
not an accidental combination of patches which have no meaning in the object world.
The same brittleness of strong segmentation lurks here. We expect a sharp increase
in the research of local, partial or fuzzy structural descriptors for the purpose of
content-based retrieval especially of broad domains.
Section 1.5. Similarity and Search 27
on the basis of some feature set. The similarity measure depends on the type of
features.
At its best use, the similarity measure can be manipulated to represent different
semantic contents; images are then grouped by similarity in such a way that close
images are similar with respect to use and purpose. A common assumption is
that the similarity between two feature vectors F can be expressed by a positive,
monotonically non increasing function. This assumption is consistent with a class
of psychological models of human similarity perception [152], [142], and requires
that the feature space be metric. If the feature space is a vector space, d often is a
simple Euclidean distance, although there is indication that more complex distance
measures might be necessary [142]. This similarity model was well suited for early
query by example systems, in which images were ordered by similarity with one
example.
A different view sees similarity as an essentially probabilistic concept. This view
is rooted in the psychological literature [8], and in the context of content-based
retrieval it has been proposed, for example, in [116].
Measuring the distance between histograms has been an active line of research
since the early years of content-based retrieval, where histograms can be seen as
a set of ordered features. In content-based retrieval, histograms have mostly been
used in conjunction with color features, but there is nothing against being used in
texture or local geometric properties.
Various distance functions have been proposed. Some of these are general func-
tions such as Euclidean distance and cosine distance. Others are specially designed
for image retrieval such as histogram intersection [162]. The Minkowski-form dis-
tance for two vectors or histograms ~k and ~l with dimension n is given by:
n
X
k ~ ~
DM (k, l) = ( |ki − li |ρ )1/ρ (1.5.1)
i=1
where φ is the angle between the vectors ~k and ~l. When the two vectors have
equal directions, the cosine will add to one. The angle φ can also be described as a
function of ~k and ~l:
Section 1.5. Similarity and Search 29
~k · ~l
cos φ = (1.5.4)
||~k|| ||~l||
The cosine distance is well suited for features that are real vectors and not a col-
lection of independent scalar features.
The histogram intersection distance compares two histograms ~k and ~l of n bins
by taking the intersection of both histograms:
Pn
~ ~ min(ki , li )
DH (k, l) = 1 − i=1 Pn (1.5.5)
i=1 ki
When considering images of different sizes, this distance function is not a metric
due to DH (~k, ~l) 6= DH (~l, ~k). In order to become a valid distance metric, histograms
need to be normalized first:
~
~k n = Pkn (1.5.6)
i ki
For normalized histograms (total sum of 1), the histogram intersection is given by:
n
X
n ~ n ~n
DH (k , l ) = 1 − |kin − lin | (1.5.7)
i
The normalized cross correlation has a maximum of unity that occurs if and only
if ~k exactly matches ~l.
In the QBIC system [42], the weighted Euclidean distance has been used for
the similarity of color histograms. In fact, the distance measure is based on the
correlation between histograms ~k and ~l:
where kavg and lavg are 3x1 average color vectors of ~k and ~l.
As stated before, for broad domains, a proper similarity measure should be ro-
bust to object fragmentation, occlusion and clutter by the presence of other objects
in the view. In [58], various similarity function were compared for color-based his-
togram matching. From these results, it is concluded that retrieval accuracy of
similarity functions depend on the presence of object clutter in the scene. The his-
togram cross correlation provide best retrieval accuracy without any object clutter
(narrow domain). This is due to the fact that this similarity functions is symmetric
and can be interpreted as the number of pixels with the same values in the query
image which can be found present in the retrieved image and vice versa. This is a
desirable property when one object per image is recorded without any object clutter.
In the presence of object clutter (broad domain), highest image retrieval accuracy
is provided by the quadratic similarity function (e.g. histogram intersection). This
is because this similarity measure count the number of similar hits and hence are
insensitive to false positives.
Finally, the natural measure to compare ordered sets of accumulative features
is non-parametric test statistics. They can be applied to the distributions of the
coefficients of transforms to determine the likelihood that two samples derive from
the same distribution [14], [131]. They can also be applied to compare the equality
of two histograms and all variations thereof.
1.5.6 Discussion
Whenever the image itself permits an obvious interpretation, the ideal content-
based system should employ that information. A strong semantic interpretation
occurs when a sign can be positively identified in the image. This is rarely the case
due to the large variety of signs in a broad class of images and the enormity of
the task to define a reliable detection algorithm for each of them. Weak semantics
rely on inexact categorization induced by similarity measures, preferably online by
interaction. The categorization may agree with semantic concepts of the user, but
the agreement is in general imperfect. Therefore, the use of weak semantics is
usually paired with the ability to gear the semantics of the user to his or her needs
by interpretation. Tunable semantics is likely to receive more attention in the future
especially when data sets grow big.
Similarity is an interpretation of the image based on the difference with another
image. For each of the feature types a different similarity measure is needed. For
similarity between feature sets, special attention has gone to establishing similarity
among histograms due to their computational efficiency and retrieval effectiveness.
Similarity of shape has received a considerable attention in the context of object-
based retrieval. Generally, global shape matching schemes break down when there
is occlusion or clutter in the scene. Most global shape comparison methods im-
plicitly require a frontal viewpoint against a clear enough background to achieve a
sufficiently precise segmentation. With the recent inclusion of perceptually robust
points in the shape of objects, an important step forward has been made.
Similarity of hierarchically ordered descriptions deserves considerable attention,
as it is one mechanism to circumvent the problems with segmentation while main-
taining some of the semantically meaningful relationships in the image. Part of
the difficulty here is to provide matching of partial disturbances in the hierarchical
order and the influence of sensor-related variances in the description.
Section 1.6. Interaction and Learning 33
1.6.3 Learning
As data sets grow big and the processing power matches that growth, the oppor-
tunity arises to learn from experience. Rather than designing, implementing and
testing an algorithm to detect the visual characteristics for each different semantic
term, the aim is to learn from the appearance of objects directly.
For a review on statistical pattern recognition, see [2]. In [174] a variety of
techniques is discussed treating retrieval as a classification problem.
One approach is principal component analysis over a stack of images taken from
the same class z of objects. This can be done in feature space [120] or at the level of
the entire image, for examples faces in [115]. The analysis yields a set of eigenface
images, capturing the common characteristics of a face without having a geometric
model.
Effective ways to learn from partially labeled data have recently been introduced
in [183], [32] both using the principle of transduction [173]. This saves the effort of
labeling the entire data set, infeasible and unreliable as it grows big.
In [169] a very large number of pre-computed features is considered, of which a
small subset is selected by boosting [2] to learn the image class.
An interesting technique to bridge the gap between textual and pictorial descrip-
tions to exploit information at the level of documents is borrowed from information
retrieval, called latent semantic indexing [146], [187]. First a corpus is formed
of documents (in this case images with a caption) from which features are com-
puted. Then by singular value decomposition, the dictionary covering the captions
is correlated with the features derived from the pictures. The search is for hidden
correlations of features and captions.
1.6.4 Discussion
Learning computational models for semantics is an interesting and relatively new
approach. It gains attention quickly as the data sets and the machine power grow
big. Learning opens up the possibility to an interpretation of the image with-
out designing and testing a detector for each new notion. One such approach is
appearance-based learning of the common characteristics of stacks of images from
the same class. Appearance-based learning is suited for narrow domains. For the
success of the learning approach there is a trade-of between standardizing the ob-
jects in the data set and the size of the data set. The more standardized the data
are the less data will be needed, but, on the other hand, the less broadly applicable
the result will be. Interesting approaches to derive semantic classes from captions,
or a partially labeled or unlabeled data set have been presented recently, see above.
1.7 Conclusion
In this chapter, we have presented an overview on the theory, techniques and appli-
cations of content-based image retrieval. We took patterns of use and computation
as the pivotal building blocks of our survey.
Section 1.7. Conclusion 35
36
Bibliography 37
[39] J.P. Eakins, J.M. Boardman, and M.E. Graham. Similarity retrieval of trade-
mark images. IEEE Multimedia, 5(2):53–63, 1998.
[41] L.M. Kaplan et. al. Fast texture database retrieval using extended fractal fea-
tures. In I. Sethi and R. Jain, editors, Proceedings of SPIE vol. 3312, Storage and
Retrieval for Image and Video Databases, VI, pages 162–173, 1998.
[42] M. Flicker et al. Query by image and video content: the qbic system. IEEE
Computer, 28(9), 1995.
[43] R. Fagin. Combining fuzzy information from multiple systems. J Comput Syst
Sci, 58(1):83–99, 1999.
[44] J. Favella and V. Meza. Image-retrieval agent: Integrating image content and
text. 1999.
[46] G.D. Finlayson, M.S. Drew, and B.V. Funt. Spectral sharpening: Sensor trans-
formation for improved color constancy. JOSA, 11:1553–1563, 1994.
[48] D. Forsyth. Novel algorithm for color constancy. International Journal of Com-
puter Vision, 5:5–36, 1990.
[49] D.A. Forsyth and M.M. Fleck. Automatic detection of human nudes. Interna-
tional Journal of Computer Vision, 32(1):63–77, 1999.
[50] G. Frederix and E.J. Pauwels. Automatic interpretation based on robust seg-
mentation and shape extraction. In D.P. Huijsmans and A.W.M. Smeulders, editors,
Proceedings of Visual 99, International Conference on Visual Information Systems,
volume 1614 of Lecture Notes in Computer Science, pages 769–776, 1999.
[51] C-S. Fuh, S-W Cho, and K. Essig. Hierarchical color image region segmentation
for content-based image retrieval system. IEEE Transactions on Image Processing,
9(1):156 – 163, 2000.
[52] B.V. Funt and M.S. Drew. Color constancy computation in near-mondrian
scenes. In Computer Vision and Pattern Recognition, pages 544–549, 1988.
40 Bibliography
[53] B.V. Funt and G.D. Finlayson. Color constant color indexing. IEEE Transac-
tions on PAMI, 17(5):522–529, 1995.
[54] J. M. Geusebroek, A. W. M. Smeulders, and R. van den Boomgaard. Measure-
ment of color invariants. In Computer Vision and Pattern Recognition. IEEE Press,
2000.
[55] Th. Gevers. Color based image retrieval. In Multimedia Search. Springer Verlag,
2000.
[56] Th. Gevers. Image segmentation and matching of color-texture objects. IEEE
Trans. on Multimedia, 4(4), 2002.
[57] Th. Gevers and A. W. M. Smeulders. Color based object recognition. Pattern
recognition, 32(3):453 – 464, 1999.
[58] Th. Gevers and A. W. M. Smeulders. Content-based image retrieval by
viewpoint-invariant image indexing. Image and Vision Computing, 17(7):475 – 488,
1999.
[59] Th. Gevers and A. W. M. Smeulders. Pictoseek: combining color and shape
invariant features for image retrieval. IEEE Transactions on Image Processing,
9(1):102 – 119, 2000.
[60] Th. Gevers and A.W.M. Smeulders. Color based object recognition. Pattern
Recognition, 32:453–464, 1999.
[61] Th. Gevers and H. M. G. Stokman. Classification of color edges in video into
shadow-geometry, highlight, or material transitions. IEEE Trans. on Multimedia,
5(2), 2003.
[62] Th. Gevers and H. M. G. Stokman. Robust histogram construction from color
invariants for object recognition. IEEE Transactions on PAMI, 25(10), 2003.
[63] Th. Gevers, P. Vreman, and J. van der Weijer. Color constant texture seg-
mentation. In IS&T/SPIE Symposium on Electronic Imaging: Internet Imaging I,
2000.
[64] G.L. Gimel’farb and A. K. Jain. On retrieving textured images from an image
database. Pattern Recognition, 29(9):1461–1483, 1996.
[65] Y. Gong, C.H. Chuan, and G. Xiaoyi. Image indexing and retrieval using color
histograms. Multimedia Tools and Applications, 2:133–156, 1996.
[66] C. C. Gottlieb and H. E. Kreyszig. Texture descriptors based on co-occurrences
matrices. Computer Vision, Graphics, and Image Processing, 51, 1990.
[67] L.J. Guibas, B. Rogoff, and C. Tomasi. Fixed-window image descriptors for
image retrieval. In IS&T/SPIE Symposium on Electronic Imaging: Storage and
Retrieval for Image and Video Databases III, pages 352–362, 1995.
Bibliography 41
[69] A. Guttman. R-trees: A dynamic index structure for spatial searching. In ACM
SIGMOD, pages 47 – 57, 1984.
[70] J. Hafner, H.S. Sawhney, W. Equit, M. Flickner, and W. Niblack. Efficient color
histogram indexing for quadratic form distance functions. IEEE Transactions on
PAMI, 17(7):729–736, 1995.
[71] M. Hagendoorn and R. C. Veltkamp. Reliable and efficient pattern matching us-
ing an affine invariant metric. International Journal of Computer Vision, 35(3):203
– 225, 1999.
[75] K. Hirata and T. Kato. Rough sketch-based image information retrieval. NEC
Res Dev, 34(2):263 – 273, 1992.
[77] N.R. Howe and D.P. Huttenlocher. Integrating color, texture, and geometry for
image retrieval. In Computer Vision and Pattern Recognition, pages 239–247, 2000.
[78] C.C. Hsu, W.W. Chu, and R.K. Taira. A knowledge-based approach for retriev-
ing images by content. IEEE Transactions on Knowledge and Data Engineering,
8(4):522–532, 1996.
[80] J. Huang, S. R. Kumar, M. Mitra, W-J Zhu, and R. Zabih. Spatial color indexing
and applications. International Journal of Computer Vision, 35(3):245 – 268, 1999.
42 Bibliography
[81] J. Huang, S.R. Kumar, M. Mitra, W-J Zhu, and R. Ramin. Image indexing using
color correlograms. In Computer Vision and Pattern Recognition, pages 762–768,
1997.
[82] B. Huet and E. R. Hancock. Line pattern retrieval using relational histograms.
IEEE Transactions on PAMI, 21(12):1363 – 1371, 1999.
[83] F. Idris and S. Panchanathan. Image indexing using wavelet vector quantization.
In Proceedings of the SPIE Vol. 2606–Digital Image Storage and Archiving Systems,
pages 269–275, 1995.
[84] L. Itti, C. Koch, and E. Niebur. A model for saliency-based visual attention for
rapid scene analysis. IEEE Transactions on PAMI, 20(11):1254 – 1259, 1998.
[85] C.E. Jacobs, A. Finkelstein, and D.H. Salesin. Fast multiresolution image query-
ing. In Computer Graphics, 1995.
[86] C.E. Jacobs, A. Finkelstein, and S. H. Salesin. Fast multiresolution image query-
ing. In Proceedings of SIGGRAPH 95, Los Angeles, CA. ACM SIGGRAPH, New
York, 1995.
[87] A. K. Jain and A. Vailaya. Image retrieval using color and shape. Pattern
Recognition, 29(8):1233–1244, 1996.
[88] A. K. Jain and A. Vailaya. Shape-based retrieval: A case study with trademark
image databases. Pattern Recognition, 31(9):1369 – 1390, 1998.
[89] L. Jia and L. Kitchen. Object-based image similarity computation using induc-
tive learning of contour-segment relations. IEEE Transactions on Image Processing,
9(1):80 – 87, 2000.
[91] T. Kato, T. Kurita, N. Otsu, and K. Hirata. A sketch retrieval method for
full color image database - query by visual example. In Proceedings of the ICPR,
Computer Vision and Applications, The Hague, pages 530–533, 1992.
[92] J.R. Kender. Saturation, hue, and normalized colors: Calculation, digitization
effects, and use. Technical report, Department of Computer Science, Carnegie-
Mellon University, 1976.
[123] P. Pala and S. Santini. Image retrieval by shape and texture. Pattern Recog-
nition, 32(3):517–527, 1999.
[124] D.K. Panjwani and G. Healey. Markov random field models for unsupervised
segmentation of textured color images. IEEE Transactions on PAMI, 17(10):939 –
954, 1995.
[125] G. Pass and R. Zabith. Comparing images using joint histograms. Multimedia
systems, 7:234 – 240, 1999.
[129] R.W. Picard and T.P. Minka. Vision texture for annotation.
[133] T. Randen and J. Hakon Husoy. Filtering for texture classification: a compar-
ative study. IEEE Transactions on PAMI, 21(4):291 – 310, 1999.
[134] E. Riloff and L. Hollaar. Text databases and information retrieval. ACM
Computing Surveys, 28(1):133–135, 1996.
[135] E. Rivlin and I. Weiss. Local invariants for recognition. IEEE Transactions on
PAMI, 17(3):226 – 238, 1995.
[167] T. Tan. Rotation invariant texture features and their use in automatic script
identification. IEEE Transactions on PAMI, 20(7):751 – 756, 1998.
[169] K. Tieu and P. Viola. Boosting image retrieval. In Computer Vision and
Pattern Recognition, pages 228–235, 2000.
[170] T. Tuytelaars and L. van Gool. Content-based image retrieval based on local
affinely invariant regions. In Proceedings of Visual Information and Information
Systems, pages 493 – 500, 1999.
[173] V.N. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, 1995.
[177] L. Z. Wang and G. Healey. Using Zernike moments for the illumination and
geometry invariant classification of multispectral texture. IEEE Transactions on
Image Processing, 7(2):196 – 203, 1991.
[178] J. Weickert, S. Ishikawa, and A. Imiya. Linear scale space has first been pro-
posed in japan. Journal of Mathematical Imaging and Vision, 10:237 – 252, 1999.
[179] M. Werman and D. Weinshall. Similarity and affine invariant distances between
2d point sets. IEEE Transactions on PAMI, 17(8):810 – 814, 1995.
[182] H.J. Wolfson and I. Rigoutsos. Geometric hashing: An overview. IEEE com-
putational science and engineering, 4(4):10 – 21, 1997.
[184] H. H. Yu and W. Wolf. Scene classification methods for image and video
databases. In Proc. SPIE on Digital Image Storage and Archiving Systems, pages
363–371, 1995.
[186] Y.J. Zhang, Z.W. Liu, and Y. He. Comparison and improvement of color-
based image retrieval. In IS&T/SPIE Symposium on Electronic Imaging: Storage
and Retrieval for Image and Video Databases IV, pages 371–382, 1996.
[187] R. Zhao and W. Grosky. Locating text in complex color images. In IEEE
International Conference on Multimedia Computing and Systems, 2000.
[188] Y. Zhong, K. Karu, and A. K. Jain. Locating text in complex color images.
Pattern Recognition, 28(10):1523 – 1535, 1995.
[189] P. Zhu and P. M. Chirlian. On critical point detection of digital shapes. IEEE
Transactions on PAMI, 17(8):737 – 748, 1995.
[190] S.C. Zhu and A. Yuille. Region competition: Unifying snakes, region growing,
and bayes/mdl for multiband image segmentation. IEEE Transactions on PAMI,
18(9):884 – 900, 1996.