CSC2308 Lec 02
CSC2308 Lec 02
(CSC2308)
_______________________________________________
Zauwali S. Paki
Department of Computer Science
Yusuf Maitama Sule University, Kano
zspaki3@gmail.com
Short quiz on the previous lectures
• Consider these documents:
Doc 1 breakthrough drug for schizophrenia
Doc 2 new schizophrenia drug
Doc 3 new approach for treatment of schizophrenia
Doc 4 new hopes for schizophrenia patients
Q1. Draw the term-document incidence matrix for
this document collection
Q2. What will be the returned results for these
queries:
a. schizophrenia AND drug
b. for AND NOT(drug OR approach)
You have 20 mins
Data Management I 2
Indexing and Search
Data Management I 3
What is indexing?
Data Management I 4
Index construction
Data Management I 5
Inverted index (inverted file)
Data Management I 6
Inverted index (inverted file)
Data Management I 7
Inverted index: the steps
• We need to create the index file in advance to gain
the speed benefits at retrieval time
• the major steps are as follows
Data Management I 8
Inverted index: the steps
3. Do linguistic preprocessing, producing a list of
normalized tokens, which are the indexing terms:
Data Management I 9
Indexing process
• The indexing operation gets as input the
normalized list of tokens for each document
• It is normally inform of a pair (term, docID)
• Sorting the list, a core indexing step, is then carried
out so that the terms are alphabetical
• Multiple occurrences of the same term from the
same document are merged
• Instances of the same term are grouped and
represented in dictionary and postings
Data Management I 10
Indexing process: Example
• Here are two documents
• Doc1: I did enact Julius Caesar: I was killed in the Capitol;
Brutus killed me.
• Doc2: So let it be with Caesar. The noble Brutus hath told
you Caesar was ambitious:
• we now tokenize the documents
Data Management I 11
Indexing process: Example
Data Management I 12
Indexing process: Example
• Two data structures are the suitable alternatives for
efficient storage of the postings lists: singly linked list
and variable length array
• Singly linked list allows cheap insertion into the posting
list in response to, for example, update (like recrawling
the web for updated documents)
Data Management I 13
Processing Boolean queries using inverted index
Data Management I 14
Processing Boolean queries using inverted index
Data Management I 15
Features of IR system
Data Management I 16