Gemini Algorithm

Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

GEMINI

GEneric Multimedia INdexIng

 GEneric Multimedia INdexIng


 distance measure
 Sub-pattern Match
 ‘quick and dirty’ test
 Lower bounding lemma

 1-D Time Sequences


 Color histograms
 Color auto-correlogram
 Shapes

1
GEneric Multimedia INdexIng
 Given a database of multimedia objects
 Design fast search algorithms that locate objects that
match a query object, exactly or approximately
 Objects:
• 1-d time sequences
• Digitized voice or music
• 2-d color images
• 2-d or 3-d gray scale medical images
• Video clips

 E.g.: “Find companies whose stock prices move


similarly”

Applications
 time series:
• financial, marketing (click-streams!), ECGs,
sound;
 images:
• medicine, digital libraries, education, art
 higher-d signals:
• scientific db (eg., astrophysics), medicine (MRI
scans), entertainment (video)

2
Sample queries

 Find medical cases similar to Smith's


 Find pairs of stocks that move in sync
 Find pairs of documents that are similar
(plagiarism?)
 Find faces similar to ‘Tiger Woods’

$price

$price

1 365
day
$price 1 365
day

distance function: by expert

1 365
(eg, Euclidean distance)
day

3
Generic Multimedia Indexing

 1st step: provide a measure for the


distance between two objects
 Distance function d():
• Given two objects O1, O2 the distance (=dis-
similarity) of the two objects is denoted by

d(O1, O2)

E.g., Euclidean distance (sum of squared


differences) of two equal-length time series

ε-Similarity query
 Given a query object Q, find all objects Oi
from the database are ε-similar (identical
for ε = 0) to Q

 {Oi DB | d(Q , Oi) < ε}.

4
Types of Similarity Queries

 Whole match queries:


• Given a collection of S objects O1,…, Os and a query
object Q find data objects that are within distance ε
from Q

Types of Similarity Queries

 Sub-pattern Match:
• Given a collection of S objects O1,…, OS and a query
(sub-) object Q and a tolerance ε identify the parts of the
data objects that match the query Q

5
Idea method – requirements

 Fast: sequential scanning and distance


calculation with each and every object too
slow for large databases

 Dynamic: easy to insert, delete, and


update objects

Basic idea
 Focus on ‘whole match’ queries
• Given a collection of S objects O1,…, Os, a
distance/dis-similarity function d(Oi, Oj), and a
query object Q find data objects that are within
distance ε from Q
 Sequential scanning?
May be too slow.. for the following
reasons:
• Distance computation is expensive (e.g., editing
distance in DNA strings)
• The Database size S may be huge
 Faster alternative?

6
GEneric Multimedia INdexIng

Christos
Faloutsos
QBIC 1994

• A feature extraction function maps the high dimensional


objects into a low dimensional space
• Objects that are very dissimilar in the feature space, are
also very dissimilar in the original space

Basic idea
 Faster alternative:
 Step 1: a ‘quick and dirty’ test to discard quickly
the vast majority of non-qualifying objects
 Step 2: use of SAMs (R-trees, Hilbert-Curve,..) to
achieve faster than sequential searching
 Example:
 Database of yearly stock price movements
• Euclidean distance function
• Characterize with a single number (‘feature’)
• Or use two or more features

7
Basic idea - illustration
Feature2
S1 F(S1)

1 365
day F(Sn)

Sn Feature1

1 365
day

 A query with tolerance ε becomes a sphere with radius ε

Basic idea – caution!


 The mapping F() from objects to k-dim. points
should not distort the distances
 d(): distance of two objects
 dfeature(): distance of their corresponding feature
vectors
 Ideally, perfect preservation of distances
 In practice, a guarantee of no false dismissals
 How?

8
 Objects represented by vectors that are very
dissimilar in the feature space are expected to
be very dissimilar in the original space

 If the distances in the feature space are always


smaller or equal than the distances in the
original space, a bound which is valid in both
spaces can be determined

 The distance of similar objects is smaller


or equal to ε in the original space and,
consequently, it is smaller or equal to ε in
the feature space as well...

9
Lower bounding lemma

 if distance of similar “objects“ is smaller


or equal to ε in original space
 then it is as well smaller or equal ε in the
feature space

d feature (F(O1 ),(O2 )) " d(O1,O2 ) " #


o.k.
d(O1,O2 ) " # $$ % d feature (F(O1 ),F(O2 ))
d feature (F(O1 ),F(O2 )) " # $WRONG!
$ $% d(O1,O2 ) " #
d feature (F(O1 ),F(O2 )) " # " d(O1,O2 ) " ?

 No object in the feature space will be missed


(false dismissals) in the feature space
!  There will be some objects that are not similar
in the original space (false hints/alarms)

10
 That means that we are guaranteed to have
selected all the objects we wanted plus some
additional false hits in the feature space

 In the second step, false hits have to be filtered


from the set of the selected objects through
comparison in the original space

11
Time sequences
white noise brown noise

Fourier
spectrum

... in log-log

Time sequences
 Conclusion: colored noises are well
approximated by their first few Fourier
coefficients

 Colored noises appear in nature

12
Time sequences
 Eg.:

GEMINI
Important:

Q: how to extract features?


A: “if I have only one number to describe
my object, what should this be?”

13
1-D Time Sequences
 Distance function: Euclidean distance
 Find features that:
 Preserve/lower-bound the distance
 Carry as much information as possible(reduce false
alarms)
 If we are allowed to use only one feature what
would this be? The average
 … extending it…

1-D Time Sequences


 ......
 If we are allowed to use only one feature what
would this be? The average
 … extending it…

 The average of 1st half, of the 2nd half, of the


1st quarter, etc.
 Coefficients of the Fourier transform (DFT),
wavelet transform, etc.

14
Feature extracting function
1. Define a distance function
2. Find a feature extraction function F() that
satisfies the bounding lemma

Example:
 Discrete Fourier Transform (DFT) preserve
Euclidian distances between signals (Parseval's
theorem)
 F() = DTF which keeps the first coefficients of the
transform

1-D Time Sequences


Show that the distance in feature space lower-bounds the actual
distance
DFT?
Parseval’s Theorem: DFT preserves the energy of the signal as
well as the distances between two signals
d(x,y) = d(X,Y)
where X and Y are the Fourier transforms of x and y

If we keep the first k ≤ n coefficients of DFT we lower-bound the


actual distance
k"1 2 n"1 2 n"1 2

d feature (F(x),F(y)) = # X f " Y f $ # X f " Y f = # x i " y i % d(x, y)


f =0 f =0 i= 0

15
Time sequences - results
 keep the first 2-3 Fourier coefficients
 faster than seq. scan
 no false dismissals
total
time cleanup-time
r-tree time
# coeff. kept

Time sequences -
improvements:

 could use Wavelets, or DCT


 could use segment averages

16
Images - color

what is an image?
A: 2-d array

2-D color images – Color histograms

 Each color image – a 2-d array of pixels


 Each pixel – 3 color components (R,G,B)
 h colors – each color denoting a point in 3-d color
space (as high as 224 colors)
 For each image compute the h-element color
histogram – each component is the percentage of
pixels that are most similar to that color
 The histogram of image I is defined as:
For a color Ci , Hci(I) represents the number of pixels of
color Ci in image I
OR:
For any pixel in image I, Hci(I) represents the possibility of
that pixel having color Ci.

17
2-D color images – Color histograms
 Usually cluster similar colors together and choose one
representative color for each ‘color bin’
 Most commercial CBIR systems include color histogram as
one of the features (e.g., QBIC of IBM)
 No space information

Color histograms - distance

 One method to measure the distance between two


histograms x and y is:

h h
d h2 ( x, y ) = ( x " y ) t # A # ( x " y ) = !! aij ( xi " yi )( x j " y j )
i j

where the color-to-color similarity matrix A has entries


aij that describe the similarity between color i and color j

18
Images - color
Mathematically, the distance function is:

Color histograms – lower bounding


 1st step: define the distance function between two color
images d()=dh()
 2nd step: find numerical features (one or more) whose
Euclidean distance lower-bounds dh()
 If we allowed to use one numerical feature to describe
the color image what should it be?
 Avg. amount for each color component (R,G,B)
 x = ( Ravg , Gavg , Bavg ) t
P
 Where Ravg = (1 / P ) ! R( p)… , and similarly for G and B
p =1

Where P is the number of pixels in the image, R(p) is the red


component (intensity) of the p-th pixel

19
Color histograms – lower bounding
 x of two
Given the average color vectors and y images we define davg() as
the Euclidean distance between the 3-d average color vectors
3
2
d avg ( x , y ) = ( x " y ) t # ( x " y ) = ! ( xi " yi ) 2
i =1
 3rd step: to prove that the feature distance davg() lower-bounds the actual
distance dh()...
• ...by the ``Quadratic Distance Bounding'' theorem it is guaranteed that the
distance between vectors representing histograms is bigger or equal as the
distance between histograms of average color images. The proof of the
``Quadratic Distance Bounding'' theorem is based upon the unconstrained
minimization problem using Langrange multipliers
 Main idea of approach:
 First a filtering using the average (R,G,B) color,
 then a more accurate matching using the full h-element histogram

Images - color
time
seq scan

performance:

w/ avg RGB

selectivity

20
Color auto-correlogram
 pick any pixel p1 of color Ci in the image I
 at distance k away from p1 pick another pixel p2
 what is the probability that p2 is also of color Ci ?
Red ?

k P2

P1

Image: I

Color auto-correlogram
 The auto-correlogram of image I for color Ci ,
distance k:

$ C( ki ) ( I ) # Pr[| p1 " p2 |= k , p2 ! I Ci | p1 ! I Ci ]

 Integrate both color information and space


information

21
Color auto-correlogram

Implementations
 Pixel Distance Measures
 Use D8 distance (also called chessboard distance):

dmax ( p,q) = max(| px " qx |,| py " qy |)

 Choose distance k=1,3,5,7


 Computation
! complexity:

• Histogram:
!( n 2 )
• Correlogram: !(134 * n 2 )

22
Implementations
 Features Distance Measures:
 D( f(I1) - f(I2) ) is small  I1 and I2 are similar
 m= R,G,B k=distance

 or histogram:
| hCi ( I ) # hCi ( I ' ) |
| I # I ' |h $ ! 1+ h
i"[ m ] Ci ( I ) + hCi ( I ' )
 For correlogram:

| % C( ki ) ( I ) # % C( ki ) ( I ' ) |
| I # I ' |% $ !
i"[ m ], k"[ d ] 1 + % C( ki ) ( I ) + % C( ki ) ( I ' )

Color Histogram vs Correlogram

Correlogram
method

Query Image 1st 2nd 3rd 4th 5th


(512 colors)

Histogram
method

1st 2nd 3rd 4th 5th

23
Color Histogram vs Correlogram

Query

Correlogram method: 1st


Histogram method: 48th

Target

Color Histogram vs Correlogram

Query

Correlogram method: 1st


Histogram method: 31th

Target

24
Color Histogram vs Correlogram

Query 1 Query 2 Query 3 Query 4

Target C: 178th C: 1st C: 1st C: 5th


H: 230th H: 1st H: 3rd H: 18th

 The correlogram method is more stable to contrast &


brightness change than the histogram method.

Color Histogram vs Correlogram

 The color correlogram describes the global


distribution of local spatial correlations of colors.
 It’s easy to compute
 It’s more stable than the color histogram method

25
Images - shapes
 Distance function: Euclidean, on the area

 Q: how to do dim. reduction?

 A: Karhunen-Loeve (PCA)

Images - shapes
 Performance: ~10x faster

log(# of I/Os)

all kept

# of features kept

26
Mutlimedia Indexing – Conclusions

 GEMINI is a popular method


 Whole matching problem
 Should pay attention to:
• Distance functions
• Feature Extraction functions
• Lower Bounding
• Particular application

Conclusions
 GEMINI works for any setting (time
sequences, images, etc)
 uses a ‘quick and dirty’ filter

 faster than seq. scan

27
 GEneric Multimedia INdexIng
 distance measure
 Sub-pattern Match
 ‘quick and dirty’ test
 Lower bounding lemma

 1-D Time Sequences


 Color histograms
 Color auto-correlogram
 Shapes

28

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy