0% found this document useful (0 votes)

6 views

Chapter 3 Indexing Structures

Chapter 3 discusses the design of Information Retrieval (IR) systems, focusing on improving effectiveness through relevant document retrieval and efficiency by optimizing storage and access times. It outlines the subsystems of indexing and searching, detailing processes like tokenization, stopword removal, stemming, and various compression methods such as Huffman coding and Lempel-Ziv compression. The chapter also covers indexing structures, their construction, evaluation metrics, and the advantages of inverted files for efficient document retrieval.

Uploaded by

abrehamasfewu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

Chapter 3 Indexing Structures

Uploaded by

abrehamasfewu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 63

Chapter - 3

Indexing Structure

IR ITec-4081 1
Designing an IR System
Our focus during IR system design:
• In improving Effectiveness of the system
–The concern here is retrieving more relevant documents for users query
–Effectiveness of the system is measured in terms of precision, recall, …
–Main emphasis: Stemming, stop words removal, weighting schemes, matching
algorithms

• In improving Efficiency of the system

–The concern here is reducing storage space requirement, enhancing searching time,
indexing time, access time…
–Main emphasis: Compression, indexing structures, space – time tradeoffs
Subsystems of IR system
The two subsystems of an IR system:
–Indexing:
• is an offline process of organizing documents using keywords extracted from the
collection
• Indexing is used to speed up access to desired information from document
collection as per users query

–Searching
• is an online process that scans document corpus to find relevant documents that
matches users query
Indexing Subsystem
documents
Documents Assign document identifier

document Tokenization document IDs

tokens
Stopword removal
non-stoplist tokens
Stemming & Normalization
stemmed terms
Term weighting

Weighted index
terms Index File
Searching Subsystem
query parse query
query tokens
ranked non-stoplist
document Stop word
tokens
set
Ranking
Stemming & Normalize
relevant stemmed terms
document set
Similarity Query Term weighting
Measure terms
Index terms
Index
Text Compression
• Text compression is about finding ways to represent the text in fewer bits or
bytes
• Advantages:
–Save storage space requirement.
–Speed up document transmission time
–Takes less time to search the compressed text
• Common compression methods
–Statistical methods: which requires statistical information about frequency of occurrence of
symbols in the document
E.g. Huffman coding
•Estimate probabilities of symbols, code one at a time, shorter codes for high probabilities
–Adaptive methods: which constructs dictionary in the processing of compression
E.g. Ziv-Lempel compression:
•Replace words or symbols with a pointer to dictionary entries
Huffman coding
•Developed in 1950s by David Huffman, widely
used for text compression, multimedia codec and 0 1
message transmission D4
0 1
•The problem: Given a set of n symbols and their
1 D3
weights (or frequencies), construct a tree structure 0
(a binary tree for binary code) with the objective of D2
D1
reducing memory space & decoding time per
symbol.
Code of:
•Huffman coding is constructed based on frequency
D1 = 000
of occurrence of letters in text documents D2 = 001
D3 = 01
D4 = 1
How to construct Huffman coding
Step 1: Create forest of trees for each symbol, t1, t2,… tn
Step 2: Sort forest of trees according to falling probabilities of symbol
occurrence
Step 3: WHILE more than one tree exist DO
–Merge two trees t1 and t2 with least probabilities p1 and p2
–Label their root with sum p1 + p2
–Associate binary code: 1 with the right branch and 0 with the left branch
Step 4: Create a unique codeword for each symbol by traversing the tree from
the root to the leaf.
–Concatenate all encountered 0s and 1s together during traversal

• The resulting tree has a prob. of 1 in its root and symbols in its leaf node.
Example
• Consider a 7-symbol alphabet given in the following table to
construct the Huffman coding.

Symbol Probability
a 45
• The Huffman encoding
b 13
algorithm picks each time two
c 12 symbols (with the smallest
d 16 frequency) to combine
e 9
f 5
Exercise
1. Given the following, apply the Huffman algorithm to find an optimal binary
code:

Character: a b c d e t

Frequency: 16 5 12 17 10 25

2. Given text: “for each rose, a rose is a rose”

– Construct the Huffman coding
Ziv-Lempel compression
•The problem with Huffman coding is that it requires knowledge about the data
before encoding takes place.
–Huffman coding requires frequencies of symbol occurrence before codeword is assigned
to symbols

•Ziv-Lempel compression
–Not rely on previous knowledge about the data
–Rather builds this knowledge in the course of data transmission/data storage
–Ziv-Lempel algorithm (called LZ) uses a table of code-words created during data
transmission;
•each time it replaces strings of characters with a reference to a previous occurrence of the string.
Lempel-Ziv Compression Algorithm
• The multi-symbol patterns are the form: C0C1 . . . Cn-1Cn. The prefix of a pattern
consists of all the pattern symbols except the last: C0C1 . . . Cn-1

Lempel-Ziv Output: there are three options in assigning a code to each symbol in the list
• If one-symbol pattern is not in dictionary, assign (0, symbol)
• If multi-symbol pattern is not in dictionary, assign (dictionaryPrefixIndex,
lastPatternSymbol)
• If the last input symbol or the last pattern is in the dictionary, assign (dictionaryPrefixIndex, )
Example: LZ Compression
Encode (i.e., compress) the string ABBCBCABABCAABCAAB using the LZ
algorithm.

The compressed message is: 0A0B2C3A2A4A6B

Example: Decompression
Decode (i.e., decompress) the sequence: 0A0B2C3A2A4A6B

The decompressed message is:

ABBCBCABABCAABCAAB
Exercise
Encode (i.e., compress) the following strings using the Lempel-Ziv
algorithm.

1. Mississippi
2. ABBCBCABABCAABCAAB
3. SATATASACITASA.
Indexing structures: Exercise
• Discuss in detail theoretical and algorithmic concepts (including construction,
various operations, complexity, etc.) on the following commonly used data
structures

1. Data structure vs. file structure

2. Arrays (fixed and dynamic arrays), sorted arrays
3. Records and linked list
4. Tree (AVL tree, Binary tree): balanced vs. unbalanced tree)
5. B tree and its variants (B+, B++ Tree, B* Tree),
6. Hierarchical Tree (like Quad Tree and its variants)
7. PAT-Tree and its variants
8. Disjoint tree: balanced and degenerate tree
9. Graph
10. Hashing,
11. Trie and its variants
Indexing: Basic Concepts
• Indexing is used to speed up access to desired information from document
collection as per users query such that
– It enhances efficiency in terms of time for retrieval. Relevant documents are searched
and retrieved quick
Example: author catalog in library
• An index file consists of records, called index entries.
• Index files are much smaller than the original file.
– Remember Heaps Law: In 1 GB text collection the size of a vocabulary is only 5 MB
(Baeza-Yates and Ribeiro-Neto, 2005)
– This size may be further reduced by Linguistic pre-processing (like stemming & other
normalization methods).
• The usual unit for indexing is the word
– Index terms - are used to look up records in a file.
Major Steps in Index Construction
• Source file: Collection of text document
–A document can be described by a set of representative keywords called index terms.
• Index Terms Selection:
–Tokenize: identify words in a document, so that each document is represented by a list of keywords or
attributes
–Stop words: removal of high frequency words
• Stop list of words is used for comparing the input text
–Word stem and normalization: reduce words with similar meaning into their stem/root word
• Suffix stripping is the common method
–Term relevance weight: Different index terms have varying relevance when used to describe document
contents.
• This effect is captured through the assignment of numerical weights to each index term of a document.
• There are different index terms weighting methods: TF, TF*IDF, …

• Output: a set of index terms (vocabulary) to be used for Indexing the documents that each
term occurs in.
Basic Indexing Process
Documents to
be indexed. Friends, Romans, countrymen.

Token Tokenizer
stream. Friends Romans countrymen

Modified Linguistic friend roman countryman

tokens. preprocessing

friend 2 4
Indexer
Index File roman 1 2
(Inverted file). 13 16
countryman
Building Index file
•An index file of a document is a file consisting of a list of index terms and a link to one or
more documents that has the index term
–A good index file maps each keyword Ki to a set of documents Di that contain the keyword

•Index file usually has index terms in a sorted order.

–The sort order of the terms in the index file provides an order on a physical file

•An index file is list of search terms that are organized for associative look-up, i.e., to
answer user’s query:
–In which documents does a specified search term appear?
–Where within each document does each term appear? (There may be several occurrences.)

•For organizing index file for a collection of documents, there are various options available:
–Decide what data structure and/or file structure to use. Is it sequential file, inverted file, suffix array,
signature file, etc. ?
Index file Evaluation Metrics
• Running time
–Indexing time
–Access/search time
–Update time (Insertion time, Deletion time, modification time….)

• Space overhead
–Computer storage space consumed.

• Access types supported efficiently.

–is the indexing structure allows to access:
• records with a specified term, or
• records with terms falling in a specified range of values.
Sequential File
•Sequential file is the most primitive file structures.
• It has no vocabulary as well as linking pointers .
•The records are generally arranged serially, one after another, but in
lexicographic order on the value of some key field.
• a particular attribute is chosen as primary key whose value will determine the order of the
records.
• when the first key fails to discriminate among records, a second key is chosen to give an
order.
Example:
• Given a collection of documents, they are parsed to extract words and
these are saved with the Document ID.

I did enact Julius

Doc 1
Caesar I was killed
I the Capitol;
Brutus killed me.

So let it be with
Doc 2
Caesar. The noble
Brutus has told you
Caesar was ambitious
Sorting the Vocabulary
Term Doc #
I 1

• After all documents did

enact
1
1
Sequential file
have been tokenized, julius
caesar
1
1 Doc
stopwords are I
was
1
1
Term No.
removed, and killed 1 1 ambition 2
I 1
normalization and the 1 2 brutus 1
capitol 1
3 brutus 2
stemming are brutus
killed
1
1 4 capitol 1
applied, to generate me
so
1
2 5 caesar 1
index terms let
it
2
2
6 caesar 2
• These index terms in be
with
2
2
7 caesar 2
sequential file are caesar 2 8 enact 1
the 2
9 julius 1
sorted in alphabetical noble 2
brutus 2 10 kill 1
order hath
told
2
2 11 kill 1
you 2
caesar 2 12 noble 2
was 2
ambitious 2
Sequential File
• Its main advantages are:
– easy to implement;
– provides fast access to the next record using lexicographic order.
– Instead of Linear time search, one can search in logarithmic time using binary search
• Its disadvantages:
– difficult to update. Index must be rebuilt if a new term is added. Inserting a new record
may require moving a large proportion of the file;
– random access is extremely slow.
• The problem of update can be solved :
– by ordering records by date of acquisition, than the key value; hence, the newest
entries are added at the end of the file & therefore pose no difficulty to updating. But
searching becomes very tough; it requires linear time
Inverted file
• A word oriented indexing mechanism based on sorted list of keywords, with
each keyword having links to the documents containing it
–Building and maintaining an inverted index is a relatively low cost risk.
–On a text of n words an inverted index can be built in O(n) time, n is number of
keywords

• Content of the inverted file:

–Data to be held in the inverted file includes :
• The vocabulary (List of terms)
• The occurrence (Location and frequency of terms in a document collection)
Inverted file
• The occurrence: contains one record per term, listing
–Frequency of each term in a document, i.e. count number of occurrences of
keywords in a document
• TFij, number of occurrences of term tj in document di
• DFj, number of documents containing tj
• maxi, maximum frequency of any term in di
• N, total number of documents in a collection
• CFj,, collection frequency of tj in nj
• ….

–Locations/Positions of words in the text

Inverted file
•Why vocabulary?
–Having information about vocabulary (list of terms) speeds searching for relevant
documents

•Why location?
– Having information about the location of each term within the document helps
for:
•user interface design: highlight location of search term
•proximity based ranking: adjacency and near operators (in Boolean searching)
•Why frequencies?
•Having information about frequency is used for:
–calculating term weighting (like TF, TF*IDF, …)
–optimizing query processing
Inverted File
Documents are organized by the terms/words they contain
Term CF Document TF Location
ID
This is called an
auto 3 2 1 66
index file.
19 1 213
29 1 45
bus 4 3 1 94 Text operations
19 2 7, 212 are performed
before building
22 1 56 the index.
taxi 1 5 1 43
train 3 11 2 3, 70
34 1 40
Construction of Inverted file
•An inverted index consists of two files:
–vocabulary file
–Posting file
Advantage of dividing inverted file:
•Keeping a pointer in the vocabulary to the list in the posting file
allows:
– the vocabulary to be kept in memory at search time even for large text
collection, and
– Posting file to be kept on disk for accessing to documents
Inverted index storage
•Separation of inverted file into vocabulary and posting file is a good idea.
–Vocabulary: For searching purpose we need only word list. This allows the vocabulary
to be kept in memory at search time since the space required for the vocabulary is
small.
• The vocabulary grows by O(nβ), where β is a constant between 0 – 1.
• Example: from 1,000,000,000 documents, there may be 1,000,000 distinct words. Hence, the size of
index is 100 MBs, which can easily be held in memory of a dedicated computer.

–Posting file requires much more space.

• For each word appearing in the text we are keeping statistical information related to word occurrence
in documents.
• Each of the postings pointer to the document requires an extra space of O(n).
•How to speed up access to inverted file?
Vocabulary file
•A vocabulary file (Word list):
–stores all of the distinct terms (keywords) that appear in any of the
documents (in lexicographical order) and
–For each word a pointer to posting file
•Records kept for each term j in the word list contains the following:
–term j
–number of documents in which term j occurs (DFj)
–Total frequency of term j (CFj)
–pointer to postings (inverted) list for term j
Postings File (Inverted List)
•For each distinct term in the vocabulary, stores a list of pointers to
the documents that contain that term.
•Each element in an inverted list is called a posting, i.e., the
occurrence of a term in a document
•It is stored as a separate inverted list for each column, i.e., a list
corresponding to each term in the index file.
– Each list consists of one or many individual postings related to Document
ID, TF and location information about a given term i
Organization of Index File
Vocabulary
Postings Actual
(word list)
(inverted list) Documents
Term No Tot Pointer
of freq To
Doc posting

Act 3 3 Inverted
Bus 3 4 lists

pen 1 1
total 2 3
Example:
• Given a collection of documents, they are parsed to extract words and
these are saved with the Document ID.

I did enact Julius

Doc 1 Caesar I was killed
I the Capitol;
Brutus killed me.

So let it be with
Doc 2 Caesar. The noble
Brutus has told you
Caesar was ambitious
Sorting the Vocabulary Term
ambitious
be
Doc #
2
2
Term Doc #
brutus 1
I 1
brutus 2
• After all did
enact
1
1
capitol 1
documents have julius 1 caesar
caesar
1
2
caesar 1
been tokenized I 1 caesar 2
did 1
the inverted file is was
killed
1
1 enact 1
sorted by terms I 1 has 1
the 1 I 1
capitol 1 I 1
brutus 1
I 1
killed 1
it 2
me 1
so 2 julius 1
let 2 killed 1
it 2 killed 1
be 2 let 2
with 2 me 1
caesar 2 noble 2
the 2 so 2
noble 2 the 1
brutus 2
the 2
hath 2
told 2
told 2
you 2 you 2
caesar 2 was 1
was 2 was 2
ambitious 2 with 2
Remove stopwords, apply stemming &
compute term frequency
•Multiple term Term Doc #
entries in a ambition 2 Term Doc # TF
single brutus 1 ambition 2 1
document are brutus 2 brutus 1 1
merged and brutus 2 1
capitol 1
frequency capitol 1 1
caesar 1
information caesar 1 1
caesar 2
added caesar 2 2
caesar 2
•Counting enact 1 enact 1 1
number of julius 1 julius 1 1
occurrence of kill 1 kill 1 2
terms in the kill 1 noble 2 1
collections
noble 2
helps to
compute TF
Vocabulary and postings file
The file is commonly split into a Dictionary and a Posting file
vocabulary posting
Term Doc # TF Term DF CF Doc # TF
ambition 2 1 2 1
ambitious 1 1
brutus 1 1 1 1
brutus 2 1 brutus 2 2
2 1
capitol 1 1 capitol 1 1
1 1
caesar 1 1 caesar 2 3 1 1
caesar 2 2 enact 1 1 2 2
enact 1 1 julius 1 1 1 1
julius 1 1 kill 1 2 1 1
kill 1 2 1 2
noble 1 1
noble 2 1 2 1

Pointers
Complexity Analysis
• The inverted index can be built in O(n) time.
– n is number of vocabulary terms
• Since terms in vocabulary file are sorted
searching takes logarithmic time.
• To update the inverted index it is possible to
apply Incremental indexing which requires
O(k) time, k is number of new index terms
Exercises
• Construct the inverted index for the following
document collections.
Doc 1 : New home to home sales forecasts
Doc 2 : Rise in home sales in July
Doc 3 : Home sales rise in July for new homes
Doc 4 : July new home sales rise
Suffix Tree and Suffix Array

Reading Assignment
Suffix trie
• What is Suffix? A suffix is a substring that exists at the end of
the given string.
– Each position in the text is considered as a text suffix
– If txt=t1t2...ti...tn is a string, then Ti=ti, ti+1...tn is the suffix of txt that starts at
position i,
• Example: txt = mississippi txt = GOOGOL
T1 = mississippi; T1 = GOOGOL
T2 = ississippi; T2 = OOGOL
T3 = ssissippi; T3 = OGOL
T4 = sissippi; T4 = GOL
T5 = issippi; T5 = OL
T6 = ssippi; T6 = L
T7 = sippi;
T8 = ippi;
T9 = ppi;
T10 = pi;
T11 = i;
Suffix trie
•A suffix trie is an ordinary trie in which the input
strings are all possible suffixes.
–Principles: The idea behind suffix TRIE is to assign to each
symbol in a text an index corresponding to its position in the
text. (i.e: First symbol has index 1, last symbol has index n
(#of symbols in text).
• To build the suffix TRIE we use these indices instead of the
actual object.
•The structure has several advantages:
–It requires less storage space.
–We do not have to worry how the text is represented (binary,
ASCII, etc).
–We do not have to store the same object twice (no
duplicate).
Suffix Trie
•Construct SUFFIX TRIE for the following string: GOOGOL
•We begin by giving a position to every suffix in the text starting
from left to right as per characters occurrence in the string.
TEXT : GOOGOL$
POSITION : 1 2 3 4 5 6 7
•Build a SUFFIX TRIE for all n suffixes of the text.
•Note: The resulting tree has n leaves and height n.

• This structure is
particularly
useful for any
application
requiring prefix
based ("starts
with") pattern
matching.
Suffix tree
• A suffix tree is a member
of the trie family. It is a Trie
of all the proper suffixes of S O
–The suffix tree is created by
compacting unary nodes of
the suffix TRIE.
• We store pointers rather
than words in the leaves.
–It is also possible to replace
strings in every edge by a
pair (a,b), where a & b are
the beginning and end index
of the string. i.e.
(3,7) for OGOL$
(1,2) for GO
(7,7) for $
Example: Suffix tree
•Let s=abab, a suffix tree of s is a
compressed trie of all suffixes of s=abab$
• We label each leaf with the
starting point of the
{ corresponding suffix.
5 $ $
ab
b 5
4 b$
3 ab$ $

2 bab$ $ ab$ 4
ab$
1 abab$ 3
} 2
1
Complexity Analysis
• The suffix tree for a string has been built in
O(n2) time.
• The search time is also linear in the length of
string S.
• Searching for a substring[1..m], in string[1..n],
can be solved in O(m) time
– It requires to search for the length of the string
O(|S|).
Generalized suffix tree
• Given a set of strings S, a generalized suffix tree of S is a
compressed trie of all suffixes of s  S
•To make suffixes prefix-free we add a special char, $, at the end of
s. To associate each suffix with a unique string in S add a different
special symbol to each s
• Build a suffix tree for the string s1$s2#, where `$' and `#'
are a special terminator for s1,s2.
•Ex.: Let s1=abab & s2=aab, a generalized suffix tree for s1 & s2 is:
•a •$ •#
•{ •b
• $ # •# •5 •4
• b$ b# •b
•ab$ •ab$ •$
• ab$ ab# •3
• bab$ aab# •ab$ •# •4
•$ •1
•2
• abab$ •1 •2
•3
Search in suffix tree
• Searching for all instances of a substring S in a suffix tree is
easy since any substring of S is the prefix of some suffix.
• Pseudo-code for searching in suffix tree:
–Start at root
–Go down the tree by taking each time the corresponding path
–If S correspond to a node then return all leaves in sub-tree
• the places where S can be found are given by the pointers in all the
leaves in the subtree rooted at x.
–If S encountered a NIL pointer before reaching the end, then
S is not in the tree
Example:
• If S = "GO" we take the GO path and return:
GOOGOL$,GOL$.
• If S = "OR" we take the O path and then we hit a NIL pointer so
"OR" is not in the tree.
Suffix Tree Applications
• Suffix Tree can be used to solve a large number of
string problems that occur in:
–text-editing,
–free-text search,
–etc.

• Some examples of string problems are given below.

–String matching
–Longest Common Substring
–Longest Repeated Substring
–Palindromes
–etc..
Drawbacks
• Suffix trees consume a lot of space
– Even if word beginnings are indexed, space
overhead of 120% - 240% over the text size is
produced. Because depending on the
implementation each nodes of the suffix tree
takes a space (in bytes) equivalent to the
number of symbols used.
– How much space is required at each node for
English word indexing based on alphabets a to z.
• How many bytes required to store
MISSISSIPI ?
Suffix array
• A suffix array is more compact than a suffix tree.
–Suffix arrays are a space efficient implementation of suffix
trees

• Like suffix tree, a suffix array is a sorted list of the

suffixes of a given string in lexicographical order.
–The sorted list is presented as an array of integers that
identify the suffixes in order.
–This allows a binary search or fast substring search.

• Main drawbacks:
–Its costly construction process.
–The need for the document/text to be readily available at
query time
Building suffix array
• Procedure:
– Identify suffixes of the given string
– Sort the suffixes lexicographically
– Store indices of all the suffixes in a table.
• The suffix array gives the indices of the suffixes in sorted
order
• A suffix array can be constructed in O(n log n) time,
where n is the length of the string, by sorting the suffixes
• Example: Consider the string "good".
– At the end, a special character is usually appended to the
string.
– In lexicographical order, the suffixes are "d$", "good$",
"od$” and "ood$".
– The suffix array is [4, 1, 3, 2, 5].
Building a suffix array
•Example:
•given the string S = GOOGOL, construct suffix array
• Sort the suffixes in lexicographical order and store in a
table all the indices.
suffixes Indices ptr
GOL$ S[0] 4
GOOGOL$ S[1] 1
L$ S[2] 6
OGOL$ S[3] 3
OL$ S[4] 5
OOGOL$ S[5] 2
$ S[6] 7

Not stored Stored

Building suffix array
• How can we build suffix array for multiple
strings, like GOOD and GOOGOL ?
Exercise
• Construct suffix array for the string
s = abab

• Identify suffixes and sort them

lexicographically:
ab$, abab$, b$, bab$, $
• The suffix array gives the indices of the
suffixes in sorted order
3 1 4 2 5
How do we search for a pattern ?
• If P occurs in T then all its occurrences are
consecutive in the suffix array.
• Do a binary search on the suffix array
• Takes O(logn) time
• Example 1: search for ‘good’ in the suffix array
constructed for ‘GOOGOL’.
• Exercise: Let the string given is S = mississippi,
construct suffix array and search for
(i) ppi
(ii)issa
Example
•Let S = mississippi
L 11 i
8 ippi
•Let P = issa 5 issippi
2 ississippi
1 mississippi
M 10 pi
9 ppi
7 sippi
4 sisippi
6 ssippi
R 3 ssissippi
Signature file
• Word-oriented index structures based on hashing
• How to build signature file
–Hash each word to allocate fixed sized F-bits vector
(word signature)
–Divide the text in blocks of N words each
–Assign F-bits masks for each text block of size N
(document signature)
• This is obtained by bitwise ORing the signatures of all the
words in the text block.
• Efficient to search for phrases
• Hence the signature file is no more than the sequence
of bit masks of all blocks (plus a pointer to each block).
Structure of Signature File
Document Signature file
F-bits pointer Text file
signature
0 1 … 0 1
1
1
…
N
blocks 1
1
0
1
Example
• Given a text: “A text has many words. Words are made from letters”

A text has many words. Words are made from letters

• Text
Signature:
1110101 0111100 1011111

• Signature (hash) function:

• h(text) = 1000101 •Block 4: 001100
• h(many) = 0110101 •OR 100001
• h(word) = 0111100 • 101101
• h(made) = 0010111
• h(letter) = 1001011
Searching
• During query processing:
–Hash the query to a F-bit mask Q
–Compare query signature with document signature of each
block, that is
• Bit-wise ANDing all the bits set in the query with bit masks Bi of
all the text block
–If all corresponding 1-bits are “on” in document signature,
document probably contains that term, that is
• If Q & Bi = Q, all the bits set in Q are also set in BI and therefore
the text block may contain the word

• The main idea of signature file is that if a word is

present in a text block, then all the bits set in its
signature are also set in the bit mask of the text block
–Hence if a bit is set in the mask of the query word and not in
the mask of the text block, then the word is not present in the
text block
Signature file trivia
•Signature files leads to possible mismatches.
–It is possible that all the corresponding bits are set
even though the word is not there. This is called
false drop.

•False drop or false positive

–Document that is retrieved by a search but is not
relevant to the searcher’s needs
–False drops occur because of words that are
written the same but have different meanings.
–Example: ‘squash’ refer to a game, a vegetable or
an action

Chapter 3 Part 1
No ratings yet
Chapter 3 Part 1
43 pages
ISR Chap...4
No ratings yet
ISR Chap...4
43 pages
Chapter Four Indexing Structure
100% (2)
Chapter Four Indexing Structure
60 pages
4_Indexing (2)
No ratings yet
4_Indexing (2)
29 pages
Chapter 3,4, 5 and 6
No ratings yet
Chapter 3,4, 5 and 6
145 pages
Chapter 4 IR
No ratings yet
Chapter 4 IR
56 pages
Ir Chapter Three
No ratings yet
Ir Chapter Three
41 pages
Chapter-4 - Data Structure-File Structure
No ratings yet
Chapter-4 - Data Structure-File Structure
34 pages
Advanced Indexing Issues
No ratings yet
Advanced Indexing Issues
52 pages
4_Indexing
No ratings yet
4_Indexing
59 pages
3-Index Construction
No ratings yet
3-Index Construction
43 pages
AI6122 Topic 3.1 - Index
No ratings yet
AI6122 Topic 3.1 - Index
40 pages
Chapter 3 Indexing
No ratings yet
Chapter 3 Indexing
48 pages
Completed UNIT-III 20.9.17
No ratings yet
Completed UNIT-III 20.9.17
61 pages
3 Index Construction
No ratings yet
3 Index Construction
43 pages
3 Index Construction
No ratings yet
3 Index Construction
43 pages
chap5
No ratings yet
chap5
64 pages
ch3_ Indexing _2019
No ratings yet
ch3_ Indexing _2019
38 pages
3_Index Construction 3560e51d31af433180d259cbc5729509
No ratings yet
3_Index Construction 3560e51d31af433180d259cbc5729509
5 pages
IR Unit III - Notes
No ratings yet
IR Unit III - Notes
18 pages
IR Chapter Three
No ratings yet
IR Chapter Three
30 pages
chap5-index-construction
No ratings yet
chap5-index-construction
38 pages
3 Indexing (2)
No ratings yet
3 Indexing (2)
28 pages
File Organization Lec910
No ratings yet
File Organization Lec910
37 pages
IR Chap3
No ratings yet
IR Chap3
45 pages
Introduction To Information Storage and Retrieval: Chapter Four: Indexing Structure
No ratings yet
Introduction To Information Storage and Retrieval: Chapter Four: Indexing Structure
34 pages
Algorithms: Compressed Matching in Dictionaries
No ratings yet
Algorithms: Compressed Matching in Dictionaries
14 pages
FOP Efficiency Indexing 13
No ratings yet
FOP Efficiency Indexing 13
22 pages
Lecture 5p2 - Index Construction & Compressing
No ratings yet
Lecture 5p2 - Index Construction & Compressing
40 pages
lecture2-indexing
No ratings yet
lecture2-indexing
78 pages
Chapter-2 - Automatic Text Anlysis
No ratings yet
Chapter-2 - Automatic Text Anlysis
67 pages
Chapter 3 IR
No ratings yet
Chapter 3 IR
56 pages
Lec2 2
No ratings yet
Lec2 2
17 pages
Chapter 6 Organizing Files For Performance Not Complete
No ratings yet
Chapter 6 Organizing Files For Performance Not Complete
65 pages
IRS UNIT-3 NOTES_241202_145950
No ratings yet
IRS UNIT-3 NOTES_241202_145950
21 pages
IT3031-L06-Indexing
No ratings yet
IT3031-L06-Indexing
45 pages
11 FM-Index
No ratings yet
11 FM-Index
6 pages
File Organizations and Indexing: R&G Chapter 8
No ratings yet
File Organizations and Indexing: R&G Chapter 8
40 pages
Compression For Sending and Storing Information: Text, Audio, Images, Videos
No ratings yet
Compression For Sending and Storing Information: Text, Audio, Images, Videos
28 pages
II. Information Retrieval (Basics Cont.) : Web Search - Summer Term 2006
No ratings yet
II. Information Retrieval (Basics Cont.) : Web Search - Summer Term 2006
16 pages
Information Retrieval - 3
No ratings yet
Information Retrieval - 3
36 pages
Multimedia Data Compression
No ratings yet
Multimedia Data Compression
31 pages
Lecture4 - Indexing and Searching I
No ratings yet
Lecture4 - Indexing and Searching I
56 pages
Pression
No ratings yet
Pression
44 pages
Chapter 5_Index Compression
No ratings yet
Chapter 5_Index Compression
28 pages
Unit 2
No ratings yet
Unit 2
28 pages
Unit 2
No ratings yet
Unit 2
157 pages
File Organizations and Indexing: R&G Chapter 8
No ratings yet
File Organizations and Indexing: R&G Chapter 8
26 pages
Week 9
No ratings yet
Week 9
46 pages
IT3020 L06 Indexing
No ratings yet
IT3020 L06 Indexing
41 pages
UNIT-2
No ratings yet
UNIT-2
10 pages
Unit I
No ratings yet
Unit I
83 pages
File Organizations and Indexing: R&G Chapter 8
No ratings yet
File Organizations and Indexing: R&G Chapter 8
40 pages
File Organizations and Indexing: R&G Chapter 8
No ratings yet
File Organizations and Indexing: R&G Chapter 8
40 pages
9 Dictionaries and Tolerant Retrieval
No ratings yet
9 Dictionaries and Tolerant Retrieval
58 pages
4.index Construction - New
No ratings yet
4.index Construction - New
46 pages
DB2 11 for z/OS: Intermediate Training for Application Developers
From Everand
DB2 11 for z/OS: Intermediate Training for Application Developers
Robert Wingate
No ratings yet
Evaluation of Some SMTP Testing, Email Verification, Header Analysis, SSL Checkers, Email Delivery, Email Forwarding and WordPress Email Tools
From Everand
Evaluation of Some SMTP Testing, Email Verification, Header Analysis, SSL Checkers, Email Delivery, Email Forwarding and WordPress Email Tools
Hidaia Mahmood Alassouli
No ratings yet
Schematron: A language for validating XML
From Everand
Schematron: A language for validating XML
Erik Siegel
No ratings yet
Mastering Python
From Everand
Mastering Python
Rick van Hattem
No ratings yet
Presentation1 IASpdf
No ratings yet
Presentation1 IASpdf
8 pages
Document 4
No ratings yet
Document 4
11 pages
1Network Devices
No ratings yet
1Network Devices
14 pages
Chapter -2 Text operation( Lecture 2.1)
No ratings yet
Chapter -2 Text operation( Lecture 2.1)
63 pages
android chapter 2
No ratings yet
android chapter 2
121 pages
IR Chapter 1
No ratings yet
IR Chapter 1
64 pages
Stay Classy: The Classification of Diabetic Foot Ulcers and Its Relevance To Management: Part 1
No ratings yet
Stay Classy: The Classification of Diabetic Foot Ulcers and Its Relevance To Management: Part 1
4 pages
Voon Lee Shan V Richard Malanjum
No ratings yet
Voon Lee Shan V Richard Malanjum
18 pages
Veena CV
No ratings yet
Veena CV
2 pages
Effects of Blade Discharge Angle Blade Number and
No ratings yet
Effects of Blade Discharge Angle Blade Number and
13 pages
12 Bahasa Inggris
100% (1)
12 Bahasa Inggris
8 pages
Wlink Int PDF
No ratings yet
Wlink Int PDF
52 pages
Antigone Reading Log (Winter Assignment)
No ratings yet
Antigone Reading Log (Winter Assignment)
2 pages
Herbalife International India, Pvt. LTD
No ratings yet
Herbalife International India, Pvt. LTD
5 pages
Iveco Trucks
No ratings yet
Iveco Trucks
1 page
Technological University of The Philippines: The Students As A Graduate of The BSME Program Will Be Able To
No ratings yet
Technological University of The Philippines: The Students As A Graduate of The BSME Program Will Be Able To
5 pages
Track & Trace Module: Mes Built On Ignition
No ratings yet
Track & Trace Module: Mes Built On Ignition
2 pages
Oil Flushing of Lube Oil System of Rotating Equipment
No ratings yet
Oil Flushing of Lube Oil System of Rotating Equipment
9 pages
Methodist Reflection
No ratings yet
Methodist Reflection
8 pages
Forensic Thesis
100% (1)
Forensic Thesis
38 pages
Assignment English 112
No ratings yet
Assignment English 112
3 pages
ING Bank Vs CIR
No ratings yet
ING Bank Vs CIR
28 pages
OPIC Manual PDF
No ratings yet
OPIC Manual PDF
16 pages
Atom Lesson Plan (2)
No ratings yet
Atom Lesson Plan (2)
7 pages
Investment Economics, Risk and Uncertainty
No ratings yet
Investment Economics, Risk and Uncertainty
17 pages
Weil's Compensation Application
No ratings yet
Weil's Compensation Application
434 pages
RPH MT DLP THN 3 V3 (Unit 1)
No ratings yet
RPH MT DLP THN 3 V3 (Unit 1)
12 pages
F28da - May - P2021-2022 - TZ1
No ratings yet
F28da - May - P2021-2022 - TZ1
5 pages
DNS 26 Cadet List
No ratings yet
DNS 26 Cadet List
2 pages
1000A AC Elevator Drive: Technical Manual
No ratings yet
1000A AC Elevator Drive: Technical Manual
474 pages
Nationalism By Rabindranath Tagore
No ratings yet
Nationalism By Rabindranath Tagore
2 pages
MBBsavings - 162393 740073 - 2021 06 30
No ratings yet
MBBsavings - 162393 740073 - 2021 06 30
66 pages
Get Prepare to Board! Creating Story and Characters for Animated Features and Shorts, third Edition Beiman free all chapters
100% (1)
Get Prepare to Board! Creating Story and Characters for Animated Features and Shorts, third Edition Beiman free all chapters
55 pages
Download Complete (Ebook) Caplan’s Stroke: A Clinical Approach by Louis R. Caplan ISBN 9781107087293, 1107087295 PDF for All Chapters
100% (8)
Download Complete (Ebook) Caplan’s Stroke: A Clinical Approach by Louis R. Caplan ISBN 9781107087293, 1107087295 PDF for All Chapters
65 pages
VBGJ
No ratings yet
VBGJ
2 pages
Shri Ganapati Atharvashirsh
No ratings yet
Shri Ganapati Atharvashirsh
5 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Chapter 3 Indexing Structures

Uploaded by

Chapter 3 Indexing Structures

Uploaded by

Chapter - 3

• In improving Efficiency of the system

document Tokenization document IDs

2. Given text: “for each rose, a rose is a rose”

The compressed message is: 0A0B2C3A2A4A6B

The decompressed message is:

1. Data structure vs. file structure

Modified Linguistic friend roman countryman

•Index file usually has index terms in a sorted order.

• Access types supported efficiently.

I did enact Julius

• After all documents did

• Content of the inverted file:

–Locations/Positions of words in the text

–Posting file requires much more space.

I did enact Julius

• Some examples of string problems are given below.

• Like suffix tree, a suffix array is a sorted list of the

Not stored Stored

• Identify suffixes and sort them

A text has many words. Words are made from letters

• Signature (hash) function:

• The main idea of signature file is that if a word is

•False drop or false positive

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.