Chapter 3 Indexing Structures
Chapter 3 Indexing Structures
Indexing Structure
IR ITec-4081 1
Designing an IR System
Our focus during IR system design:
• In improving Effectiveness of the system
–The concern here is retrieving more relevant documents for users query
–Effectiveness of the system is measured in terms of precision, recall, …
–Main emphasis: Stemming, stop words removal, weighting schemes, matching
algorithms
–Searching
• is an online process that scans document corpus to find relevant documents that
matches users query
Indexing Subsystem
documents
Documents Assign document identifier
Weighted index
terms Index File
Searching Subsystem
query parse query
query tokens
ranked non-stoplist
document Stop word
tokens
set
Ranking
Stemming & Normalize
relevant stemmed terms
document set
Similarity Query Term weighting
Measure terms
Index terms
Index
Text Compression
• Text compression is about finding ways to represent the text in fewer bits or
bytes
• Advantages:
–Save storage space requirement.
–Speed up document transmission time
–Takes less time to search the compressed text
• Common compression methods
–Statistical methods: which requires statistical information about frequency of occurrence of
symbols in the document
E.g. Huffman coding
•Estimate probabilities of symbols, code one at a time, shorter codes for high probabilities
–Adaptive methods: which constructs dictionary in the processing of compression
E.g. Ziv-Lempel compression:
•Replace words or symbols with a pointer to dictionary entries
Huffman coding
•Developed in 1950s by David Huffman, widely
used for text compression, multimedia codec and 0 1
message transmission D4
0 1
•The problem: Given a set of n symbols and their
1 D3
weights (or frequencies), construct a tree structure 0
(a binary tree for binary code) with the objective of D2
D1
reducing memory space & decoding time per
symbol.
Code of:
•Huffman coding is constructed based on frequency
D1 = 000
of occurrence of letters in text documents D2 = 001
D3 = 01
D4 = 1
How to construct Huffman coding
Step 1: Create forest of trees for each symbol, t1, t2,… tn
Step 2: Sort forest of trees according to falling probabilities of symbol
occurrence
Step 3: WHILE more than one tree exist DO
–Merge two trees t1 and t2 with least probabilities p1 and p2
–Label their root with sum p1 + p2
–Associate binary code: 1 with the right branch and 0 with the left branch
Step 4: Create a unique codeword for each symbol by traversing the tree from
the root to the leaf.
–Concatenate all encountered 0s and 1s together during traversal
• The resulting tree has a prob. of 1 in its root and symbols in its leaf node.
Example
• Consider a 7-symbol alphabet given in the following table to
construct the Huffman coding.
Symbol Probability
a 45
• The Huffman encoding
b 13
algorithm picks each time two
c 12 symbols (with the smallest
d 16 frequency) to combine
e 9
f 5
Exercise
1. Given the following, apply the Huffman algorithm to find an optimal binary
code:
Character: a b c d e t
Frequency: 16 5 12 17 10 25
•Ziv-Lempel compression
–Not rely on previous knowledge about the data
–Rather builds this knowledge in the course of data transmission/data storage
–Ziv-Lempel algorithm (called LZ) uses a table of code-words created during data
transmission;
•each time it replaces strings of characters with a reference to a previous occurrence of the string.
Lempel-Ziv Compression Algorithm
• The multi-symbol patterns are the form: C0C1 . . . Cn-1Cn. The prefix of a pattern
consists of all the pattern symbols except the last: C0C1 . . . Cn-1
Lempel-Ziv Output: there are three options in assigning a code to each symbol in the list
• If one-symbol pattern is not in dictionary, assign (0, symbol)
• If multi-symbol pattern is not in dictionary, assign (dictionaryPrefixIndex,
lastPatternSymbol)
• If the last input symbol or the last pattern is in the dictionary, assign (dictionaryPrefixIndex, )
Example: LZ Compression
Encode (i.e., compress) the string ABBCBCABABCAABCAAB using the LZ
algorithm.
1. Mississippi
2. ABBCBCABABCAABCAAB
3. SATATASACITASA.
Indexing structures: Exercise
• Discuss in detail theoretical and algorithmic concepts (including construction,
various operations, complexity, etc.) on the following commonly used data
structures
• Output: a set of index terms (vocabulary) to be used for Indexing the documents that each
term occurs in.
Basic Indexing Process
Documents to
be indexed. Friends, Romans, countrymen.
Token Tokenizer
stream. Friends Romans countrymen
friend 2 4
Indexer
Index File roman 1 2
(Inverted file). 13 16
countryman
Building Index file
•An index file of a document is a file consisting of a list of index terms and a link to one or
more documents that has the index term
–A good index file maps each keyword Ki to a set of documents Di that contain the keyword
•An index file is list of search terms that are organized for associative look-up, i.e., to
answer user’s query:
–In which documents does a specified search term appear?
–Where within each document does each term appear? (There may be several occurrences.)
•For organizing index file for a collection of documents, there are various options available:
–Decide what data structure and/or file structure to use. Is it sequential file, inverted file, suffix array,
signature file, etc. ?
Index file Evaluation Metrics
• Running time
–Indexing time
–Access/search time
–Update time (Insertion time, Deletion time, modification time….)
• Space overhead
–Computer storage space consumed.
So let it be with
Doc 2
Caesar. The noble
Brutus has told you
Caesar was ambitious
Sorting the Vocabulary
Term Doc #
I 1
•Why location?
– Having information about the location of each term within the document helps
for:
•user interface design: highlight location of search term
•proximity based ranking: adjacency and near operators (in Boolean searching)
•Why frequencies?
•Having information about frequency is used for:
–calculating term weighting (like TF, TF*IDF, …)
–optimizing query processing
Inverted File
Documents are organized by the terms/words they contain
Term CF Document TF Location
ID
This is called an
auto 3 2 1 66
index file.
19 1 213
29 1 45
bus 4 3 1 94 Text operations
19 2 7, 212 are performed
before building
22 1 56 the index.
taxi 1 5 1 43
train 3 11 2 3, 70
34 1 40
Construction of Inverted file
•An inverted index consists of two files:
–vocabulary file
–Posting file
Advantage of dividing inverted file:
•Keeping a pointer in the vocabulary to the list in the posting file
allows:
– the vocabulary to be kept in memory at search time even for large text
collection, and
– Posting file to be kept on disk for accessing to documents
Inverted index storage
•Separation of inverted file into vocabulary and posting file is a good idea.
–Vocabulary: For searching purpose we need only word list. This allows the vocabulary
to be kept in memory at search time since the space required for the vocabulary is
small.
• The vocabulary grows by O(nβ), where β is a constant between 0 – 1.
• Example: from 1,000,000,000 documents, there may be 1,000,000 distinct words. Hence, the size of
index is 100 MBs, which can easily be held in memory of a dedicated computer.
Act 3 3 Inverted
Bus 3 4 lists
pen 1 1
total 2 3
Example:
• Given a collection of documents, they are parsed to extract words and
these are saved with the Document ID.
So let it be with
Doc 2 Caesar. The noble
Brutus has told you
Caesar was ambitious
Sorting the Vocabulary Term
ambitious
be
Doc #
2
2
Term Doc #
brutus 1
I 1
brutus 2
• After all did
enact
1
1
capitol 1
documents have julius 1 caesar
caesar
1
2
caesar 1
been tokenized I 1 caesar 2
did 1
the inverted file is was
killed
1
1 enact 1
sorted by terms I 1 has 1
the 1 I 1
capitol 1 I 1
brutus 1
I 1
killed 1
it 2
me 1
so 2 julius 1
let 2 killed 1
it 2 killed 1
be 2 let 2
with 2 me 1
caesar 2 noble 2
the 2 so 2
noble 2 the 1
brutus 2
the 2
hath 2
told 2
told 2
you 2 you 2
caesar 2 was 1
was 2 was 2
ambitious 2 with 2
Remove stopwords, apply stemming &
compute term frequency
•Multiple term Term Doc #
entries in a ambition 2 Term Doc # TF
single brutus 1 ambition 2 1
document are brutus 2 brutus 1 1
merged and brutus 2 1
capitol 1
frequency capitol 1 1
caesar 1
information caesar 1 1
caesar 2
added caesar 2 2
caesar 2
•Counting enact 1 enact 1 1
number of julius 1 julius 1 1
occurrence of kill 1 kill 1 2
terms in the kill 1 noble 2 1
collections
noble 2
helps to
compute TF
Vocabulary and postings file
The file is commonly split into a Dictionary and a Posting file
vocabulary posting
Term Doc # TF Term DF CF Doc # TF
ambition 2 1 2 1
ambitious 1 1
brutus 1 1 1 1
brutus 2 1 brutus 2 2
2 1
capitol 1 1 capitol 1 1
1 1
caesar 1 1 caesar 2 3 1 1
caesar 2 2 enact 1 1 2 2
enact 1 1 julius 1 1 1 1
julius 1 1 kill 1 2 1 1
kill 1 2 1 2
noble 1 1
noble 2 1 2 1
Pointers
Complexity Analysis
• The inverted index can be built in O(n) time.
– n is number of vocabulary terms
• Since terms in vocabulary file are sorted
searching takes logarithmic time.
• To update the inverted index it is possible to
apply Incremental indexing which requires
O(k) time, k is number of new index terms
Exercises
• Construct the inverted index for the following
document collections.
Doc 1 : New home to home sales forecasts
Doc 2 : Rise in home sales in July
Doc 3 : Home sales rise in July for new homes
Doc 4 : July new home sales rise
Suffix Tree and Suffix Array
Reading Assignment
Suffix trie
• What is Suffix? A suffix is a substring that exists at the end of
the given string.
– Each position in the text is considered as a text suffix
– If txt=t1t2...ti...tn is a string, then Ti=ti, ti+1...tn is the suffix of txt that starts at
position i,
• Example: txt = mississippi txt = GOOGOL
T1 = mississippi; T1 = GOOGOL
T2 = ississippi; T2 = OOGOL
T3 = ssissippi; T3 = OGOL
T4 = sissippi; T4 = GOL
T5 = issippi; T5 = OL
T6 = ssippi; T6 = L
T7 = sippi;
T8 = ippi;
T9 = ppi;
T10 = pi;
T11 = i;
Suffix trie
•A suffix trie is an ordinary trie in which the input
strings are all possible suffixes.
–Principles: The idea behind suffix TRIE is to assign to each
symbol in a text an index corresponding to its position in the
text. (i.e: First symbol has index 1, last symbol has index n
(#of symbols in text).
• To build the suffix TRIE we use these indices instead of the
actual object.
•The structure has several advantages:
–It requires less storage space.
–We do not have to worry how the text is represented (binary,
ASCII, etc).
–We do not have to store the same object twice (no
duplicate).
Suffix Trie
•Construct SUFFIX TRIE for the following string: GOOGOL
•We begin by giving a position to every suffix in the text starting
from left to right as per characters occurrence in the string.
TEXT : GOOGOL$
POSITION : 1 2 3 4 5 6 7
•Build a SUFFIX TRIE for all n suffixes of the text.
•Note: The resulting tree has n leaves and height n.
• This structure is
particularly
useful for any
application
requiring prefix
based ("starts
with") pattern
matching.
Suffix tree
• A suffix tree is a member
of the trie family. It is a Trie
of all the proper suffixes of S O
–The suffix tree is created by
compacting unary nodes of
the suffix TRIE.
• We store pointers rather
than words in the leaves.
–It is also possible to replace
strings in every edge by a
pair (a,b), where a & b are
the beginning and end index
of the string. i.e.
(3,7) for OGOL$
(1,2) for GO
(7,7) for $
Example: Suffix tree
•Let s=abab, a suffix tree of s is a
compressed trie of all suffixes of s=abab$
• We label each leaf with the
starting point of the
{ corresponding suffix.
5 $ $
ab
b 5
4 b$
3 ab$ $
2 bab$ $ ab$ 4
ab$
1 abab$ 3
} 2
1
Complexity Analysis
• The suffix tree for a string has been built in
O(n2) time.
• The search time is also linear in the length of
string S.
• Searching for a substring[1..m], in string[1..n],
can be solved in O(m) time
– It requires to search for the length of the string
O(|S|).
Generalized suffix tree
• Given a set of strings S, a generalized suffix tree of S is a
compressed trie of all suffixes of s S
•To make suffixes prefix-free we add a special char, $, at the end of
s. To associate each suffix with a unique string in S add a different
special symbol to each s
• Build a suffix tree for the string s1$s2#, where `$' and `#'
are a special terminator for s1,s2.
•Ex.: Let s1=abab & s2=aab, a generalized suffix tree for s1 & s2 is:
•a •$ •#
•{ •b
• $ # •# •5 •4
• b$ b# •b
•ab$ •ab$ •$
• ab$ ab# •3
• bab$ aab# •ab$ •# •4
•$ •1
•2
• abab$ •1 •2
•3
Search in suffix tree
• Searching for all instances of a substring S in a suffix tree is
easy since any substring of S is the prefix of some suffix.
• Pseudo-code for searching in suffix tree:
–Start at root
–Go down the tree by taking each time the corresponding path
–If S correspond to a node then return all leaves in sub-tree
• the places where S can be found are given by the pointers in all the
leaves in the subtree rooted at x.
–If S encountered a NIL pointer before reaching the end, then
S is not in the tree
Example:
• If S = "GO" we take the GO path and return:
GOOGOL$,GOL$.
• If S = "OR" we take the O path and then we hit a NIL pointer so
"OR" is not in the tree.
Suffix Tree Applications
• Suffix Tree can be used to solve a large number of
string problems that occur in:
–text-editing,
–free-text search,
–etc.
• Main drawbacks:
–Its costly construction process.
–The need for the document/text to be readily available at
query time
Building suffix array
• Procedure:
– Identify suffixes of the given string
– Sort the suffixes lexicographically
– Store indices of all the suffixes in a table.
• The suffix array gives the indices of the suffixes in sorted
order
• A suffix array can be constructed in O(n log n) time,
where n is the length of the string, by sorting the suffixes
• Example: Consider the string "good".
– At the end, a special character is usually appended to the
string.
– In lexicographical order, the suffixes are "d$", "good$",
"od$” and "ood$".
– The suffix array is [4, 1, 3, 2, 5].
Building a suffix array
•Example:
•given the string S = GOOGOL, construct suffix array
• Sort the suffixes in lexicographical order and store in a
table all the indices.
suffixes Indices ptr
GOL$ S[0] 4
GOOGOL$ S[1] 1
L$ S[2] 6
OGOL$ S[3] 3
OL$ S[4] 5
OOGOL$ S[5] 2
$ S[6] 7
• Text
Signature:
1110101 0111100 1011111