Gemini Algorithm
Gemini Algorithm
Gemini Algorithm
1
GEneric Multimedia INdexIng
Given a database of multimedia objects
Design fast search algorithms that locate objects that
match a query object, exactly or approximately
Objects:
• 1-d time sequences
• Digitized voice or music
• 2-d color images
• 2-d or 3-d gray scale medical images
• Video clips
Applications
time series:
• financial, marketing (click-streams!), ECGs,
sound;
images:
• medicine, digital libraries, education, art
higher-d signals:
• scientific db (eg., astrophysics), medicine (MRI
scans), entertainment (video)
2
Sample queries
$price
$price
1 365
day
$price 1 365
day
1 365
(eg, Euclidean distance)
day
3
Generic Multimedia Indexing
d(O1, O2)
ε-Similarity query
Given a query object Q, find all objects Oi
from the database are ε-similar (identical
for ε = 0) to Q
4
Types of Similarity Queries
Sub-pattern Match:
• Given a collection of S objects O1,…, OS and a query
(sub-) object Q and a tolerance ε identify the parts of the
data objects that match the query Q
5
Idea method – requirements
Basic idea
Focus on ‘whole match’ queries
• Given a collection of S objects O1,…, Os, a
distance/dis-similarity function d(Oi, Oj), and a
query object Q find data objects that are within
distance ε from Q
Sequential scanning?
May be too slow.. for the following
reasons:
• Distance computation is expensive (e.g., editing
distance in DNA strings)
• The Database size S may be huge
Faster alternative?
6
GEneric Multimedia INdexIng
Christos
Faloutsos
QBIC 1994
Basic idea
Faster alternative:
Step 1: a ‘quick and dirty’ test to discard quickly
the vast majority of non-qualifying objects
Step 2: use of SAMs (R-trees, Hilbert-Curve,..) to
achieve faster than sequential searching
Example:
Database of yearly stock price movements
• Euclidean distance function
• Characterize with a single number (‘feature’)
• Or use two or more features
7
Basic idea - illustration
Feature2
S1 F(S1)
1 365
day F(Sn)
Sn Feature1
1 365
day
8
Objects represented by vectors that are very
dissimilar in the feature space are expected to
be very dissimilar in the original space
9
Lower bounding lemma
10
That means that we are guaranteed to have
selected all the objects we wanted plus some
additional false hits in the feature space
11
Time sequences
white noise brown noise
Fourier
spectrum
... in log-log
Time sequences
Conclusion: colored noises are well
approximated by their first few Fourier
coefficients
12
Time sequences
Eg.:
GEMINI
Important:
13
1-D Time Sequences
Distance function: Euclidean distance
Find features that:
Preserve/lower-bound the distance
Carry as much information as possible(reduce false
alarms)
If we are allowed to use only one feature what
would this be? The average
… extending it…
14
Feature extracting function
1. Define a distance function
2. Find a feature extraction function F() that
satisfies the bounding lemma
Example:
Discrete Fourier Transform (DFT) preserve
Euclidian distances between signals (Parseval's
theorem)
F() = DTF which keeps the first coefficients of the
transform
15
Time sequences - results
keep the first 2-3 Fourier coefficients
faster than seq. scan
no false dismissals
total
time cleanup-time
r-tree time
# coeff. kept
Time sequences -
improvements:
16
Images - color
what is an image?
A: 2-d array
17
2-D color images – Color histograms
Usually cluster similar colors together and choose one
representative color for each ‘color bin’
Most commercial CBIR systems include color histogram as
one of the features (e.g., QBIC of IBM)
No space information
h h
d h2 ( x, y ) = ( x " y ) t # A # ( x " y ) = !! aij ( xi " yi )( x j " y j )
i j
18
Images - color
Mathematically, the distance function is:
19
Color histograms – lower bounding
x of two
Given the average color vectors and y images we define davg() as
the Euclidean distance between the 3-d average color vectors
3
2
d avg ( x , y ) = ( x " y ) t # ( x " y ) = ! ( xi " yi ) 2
i =1
3rd step: to prove that the feature distance davg() lower-bounds the actual
distance dh()...
• ...by the ``Quadratic Distance Bounding'' theorem it is guaranteed that the
distance between vectors representing histograms is bigger or equal as the
distance between histograms of average color images. The proof of the
``Quadratic Distance Bounding'' theorem is based upon the unconstrained
minimization problem using Langrange multipliers
Main idea of approach:
First a filtering using the average (R,G,B) color,
then a more accurate matching using the full h-element histogram
Images - color
time
seq scan
performance:
w/ avg RGB
selectivity
20
Color auto-correlogram
pick any pixel p1 of color Ci in the image I
at distance k away from p1 pick another pixel p2
what is the probability that p2 is also of color Ci ?
Red ?
k P2
P1
Image: I
Color auto-correlogram
The auto-correlogram of image I for color Ci ,
distance k:
$ C( ki ) ( I ) # Pr[| p1 " p2 |= k , p2 ! I Ci | p1 ! I Ci ]
21
Color auto-correlogram
Implementations
Pixel Distance Measures
Use D8 distance (also called chessboard distance):
• Histogram:
!( n 2 )
• Correlogram: !(134 * n 2 )
22
Implementations
Features Distance Measures:
D( f(I1) - f(I2) ) is small I1 and I2 are similar
m= R,G,B k=distance
or histogram:
| hCi ( I ) # hCi ( I ' ) |
| I # I ' |h $ ! 1+ h
i"[ m ] Ci ( I ) + hCi ( I ' )
For correlogram:
| % C( ki ) ( I ) # % C( ki ) ( I ' ) |
| I # I ' |% $ !
i"[ m ], k"[ d ] 1 + % C( ki ) ( I ) + % C( ki ) ( I ' )
Correlogram
method
Histogram
method
23
Color Histogram vs Correlogram
Query
Target
Query
Target
24
Color Histogram vs Correlogram
25
Images - shapes
Distance function: Euclidean, on the area
A: Karhunen-Loeve (PCA)
Images - shapes
Performance: ~10x faster
log(# of I/Os)
all kept
# of features kept
26
Mutlimedia Indexing – Conclusions
Conclusions
GEMINI works for any setting (time
sequences, images, etc)
uses a ‘quick and dirty’ filter
27
GEneric Multimedia INdexIng
distance measure
Sub-pattern Match
‘quick and dirty’ test
Lower bounding lemma
28