0% found this document useful (0 votes)
22 views

CSC2308 Lec 02

This document discusses indexing and search in information retrieval systems. It explains that an index improves search speed by organizing key terms from documents in a searchable list. Inverted indexes are commonly used, where terms are listed with pointers to the documents that contain them. The steps of index construction include preprocessing documents, extracting terms, and building the inverted file structure with a dictionary and postings lists. Boolean queries can then be efficiently processed against the index by intersecting the postings lists of query terms. Key features of IR systems are precision and recall for measuring search effectiveness.

Uploaded by

aabdurrahaman647
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

CSC2308 Lec 02

This document discusses indexing and search in information retrieval systems. It explains that an index improves search speed by organizing key terms from documents in a searchable list. Inverted indexes are commonly used, where terms are listed with pointers to the documents that contain them. The steps of index construction include preprocessing documents, extracting terms, and building the inverted file structure with a dictionary and postings lists. Boolean queries can then be efficiently processed against the index by intersecting the postings lists of query terms. Key features of IR systems are precision and recall for measuring search effectiveness.

Uploaded by

aabdurrahaman647
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Data Management I

(CSC2308)

_______________________________________________
Zauwali S. Paki
Department of Computer Science
Yusuf Maitama Sule University, Kano
zspaki3@gmail.com
Short quiz on the previous lectures
• Consider these documents:
Doc 1 breakthrough drug for schizophrenia
Doc 2 new schizophrenia drug
Doc 3 new approach for treatment of schizophrenia
Doc 4 new hopes for schizophrenia patients
Q1. Draw the term-document incidence matrix for
this document collection
Q2. What will be the returned results for these
queries:
a. schizophrenia AND drug
b. for AND NOT(drug OR approach)
You have 20 mins

Data Management I 2
Indexing and Search

Data Management I 3
What is indexing?

• An index is a data structure that improves the speed of


a search/lookup
• It is similar to a book indexes where major terms of the
book are organized in a list
• When you are looking for a given major term in the
book you just quickly to go the indexes and get a
pointer to its page in the book
• This concept is applied to the indexes in information
retrieval, the difference being the size of the indexed
documents and how indexes are updated to reflect
changes in the indexed documents

Data Management I 4
Index construction

• Before we can use an index, we need to create it


just in the case of a book indexes
• This is very crucial as the quality of indexes
considerably affects the performance of the search
engines that use them
• An index always maps back from terms to the parts
of a document where they occur

Data Management I 5
Inverted index (inverted file)

• A document, usually a text document, is split into


terms
• A dictionary is a data structure that contains terms
• Posting is an item that records that a term occurs in
a document
• Collection of postings is called postings list
• All postings lists taken together are called postings
• Within a document collection, each new document
is assigned a successive integer as document ID
(docID)

Data Management I 6
Inverted index (inverted file)

• Dictionary is kept in memory and the postings are stored in


the disk
Adapted from : C.D. Manning, P. Raghavan, H. Schütze 2009. An Introduction to
Information Retrieval (online version). Cambridge University Press.

Data Management I 7
Inverted index: the steps
• We need to create the index file in advance to gain
the speed benefits at retrieval time
• the major steps are as follows

Data Management I 8
Inverted index: the steps
3. Do linguistic preprocessing, producing a list of
normalized tokens, which are the indexing terms:

4. Index the documents that each term occurs in by


creating an inverted index, consisting of a dictionary
and postings.
• Loosely speaking, tokens and normalized tokens
mean words

Data Management I 9
Indexing process
• The indexing operation gets as input the
normalized list of tokens for each document
• It is normally inform of a pair (term, docID)
• Sorting the list, a core indexing step, is then carried
out so that the terms are alphabetical
• Multiple occurrences of the same term from the
same document are merged
• Instances of the same term are grouped and
represented in dictionary and postings

Data Management I 10
Indexing process: Example
• Here are two documents
• Doc1: I did enact Julius Caesar: I was killed in the Capitol;
Brutus killed me.
• Doc2: So let it be with Caesar. The noble Brutus hath told
you Caesar was ambitious:
• we now tokenize the documents

Data Management I 11
Indexing process: Example

Data Management I 12
Indexing process: Example
• Two data structures are the suitable alternatives for
efficient storage of the postings lists: singly linked list
and variable length array
• Singly linked list allows cheap insertion into the posting
list in response to, for example, update (like recrawling
the web for updated documents)

Data Management I 13
Processing Boolean queries using inverted index

• How do we process a query using an inverted index


and the basic Boolean retrieval model?
Consider processing the simple conjunctive query:
Brutus AND Calpurnia over the inverted index shown
on slide 8

Data Management I 14
Processing Boolean queries using inverted index

• So, we proceed as follows:


1. Locate Brutus in the Dictionary
2. Retrieve its postings
3. Locate Calpurnia in the Dictionary
4. Retrieve its postings
5. Intersect the two postings lists

Data Management I 15
Features of IR system

• Precision: what fraction of the returned results


are relevant to the information need?
• Recall What fraction of the relevant documents
in the collection were returned by the system?
• A document is relevant if it is the one user
perceives as containing information of value
with respect to their personal information need

Data Management I 16

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy