0% found this document useful (0 votes)

22 views

CSC2308 Lec 02

This document discusses indexing and search in information retrieval systems. It explains that an index improves search speed by organizing key terms from documents in a searchable list. Inverted indexes are commonly used, where terms are listed with pointers to the documents that contain them. The steps of index construction include preprocessing documents, extracting terms, and building the inverted file structure with a dictionary and postings lists. Boolean queries can then be efficiently processed against the index by intersecting the postings lists of query terms. Key features of IR systems are precision and recall for measuring search effectiveness.

Uploaded by

aabdurrahaman647

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views

CSC2308 Lec 02

Uploaded by

aabdurrahaman647

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Data Management I

(CSC2308)

_______________________________________________
Zauwali S. Paki
Department of Computer Science
Yusuf Maitama Sule University, Kano
zspaki3@gmail.com
Short quiz on the previous lectures
• Consider these documents:
Doc 1 breakthrough drug for schizophrenia
Doc 2 new schizophrenia drug
Doc 3 new approach for treatment of schizophrenia
Doc 4 new hopes for schizophrenia patients
Q1. Draw the term-document incidence matrix for
this document collection
Q2. What will be the returned results for these
queries:
a. schizophrenia AND drug
b. for AND NOT(drug OR approach)
You have 20 mins

Data Management I 2
Indexing and Search

Data Management I 3
What is indexing?

• An index is a data structure that improves the speed of

a search/lookup
• It is similar to a book indexes where major terms of the
book are organized in a list
• When you are looking for a given major term in the
book you just quickly to go the indexes and get a
pointer to its page in the book
• This concept is applied to the indexes in information
retrieval, the difference being the size of the indexed
documents and how indexes are updated to reflect
changes in the indexed documents

Data Management I 4
Index construction

• Before we can use an index, we need to create it

just in the case of a book indexes
• This is very crucial as the quality of indexes
considerably affects the performance of the search
engines that use them
• An index always maps back from terms to the parts
of a document where they occur

Data Management I 5
Inverted index (inverted file)

• A document, usually a text document, is split into

terms
• A dictionary is a data structure that contains terms
• Posting is an item that records that a term occurs in
a document
• Collection of postings is called postings list
• All postings lists taken together are called postings
• Within a document collection, each new document
is assigned a successive integer as document ID
(docID)

Data Management I 6
Inverted index (inverted file)

• Dictionary is kept in memory and the postings are stored in

the disk
Adapted from : C.D. Manning, P. Raghavan, H. Schütze 2009. An Introduction to
Information Retrieval (online version). Cambridge University Press.

Data Management I 7
Inverted index: the steps
• We need to create the index file in advance to gain
the speed benefits at retrieval time
• the major steps are as follows

Data Management I 8
Inverted index: the steps
3. Do linguistic preprocessing, producing a list of
normalized tokens, which are the indexing terms:

4. Index the documents that each term occurs in by

creating an inverted index, consisting of a dictionary
and postings.
• Loosely speaking, tokens and normalized tokens
mean words

Data Management I 9
Indexing process
• The indexing operation gets as input the
normalized list of tokens for each document
• It is normally inform of a pair (term, docID)
• Sorting the list, a core indexing step, is then carried
out so that the terms are alphabetical
• Multiple occurrences of the same term from the
same document are merged
• Instances of the same term are grouped and
represented in dictionary and postings

Data Management I 10
Indexing process: Example
• Here are two documents
• Doc1: I did enact Julius Caesar: I was killed in the Capitol;
Brutus killed me.
• Doc2: So let it be with Caesar. The noble Brutus hath told
you Caesar was ambitious:
• we now tokenize the documents

Data Management I 11
Indexing process: Example

Data Management I 12
Indexing process: Example
• Two data structures are the suitable alternatives for
efficient storage of the postings lists: singly linked list
and variable length array
• Singly linked list allows cheap insertion into the posting
list in response to, for example, update (like recrawling
the web for updated documents)

Data Management I 13
Processing Boolean queries using inverted index

• How do we process a query using an inverted index

and the basic Boolean retrieval model?
Consider processing the simple conjunctive query:
Brutus AND Calpurnia over the inverted index shown
on slide 8

Data Management I 14
Processing Boolean queries using inverted index

• So, we proceed as follows:

1. Locate Brutus in the Dictionary
2. Retrieve its postings
3. Locate Calpurnia in the Dictionary
4. Retrieve its postings
5. Intersect the two postings lists

Data Management I 15
Features of IR system

• Precision: what fraction of the returned results

are relevant to the information need?
• Recall What fraction of the relevant documents
in the collection were returned by the system?
• A document is relevant if it is the one user
perceives as containing information of value
with respect to their personal information need

Data Management I 16

Grade 7
No ratings yet
Grade 7
26 pages
4 - Indexing
No ratings yet
4 - Indexing
42 pages
IR Chap3
No ratings yet
IR Chap3
45 pages
ch3_ Indexing _2019
No ratings yet
ch3_ Indexing _2019
38 pages
Introduction To Information Storage and Retrieval: Chapter Four: Indexing Structure
No ratings yet
Introduction To Information Storage and Retrieval: Chapter Four: Indexing Structure
34 pages
Indexing Structure: Chapter Four
No ratings yet
Indexing Structure: Chapter Four
26 pages
Lect 3 Inverted Index
No ratings yet
Lect 3 Inverted Index
24 pages
Chapter 3 Indexing
No ratings yet
Chapter 3 Indexing
48 pages
2
No ratings yet
2
50 pages
3 Indexing (2)
No ratings yet
3 Indexing (2)
28 pages
Lecture1-Intro - Realted To Ch1
No ratings yet
Lecture1-Intro - Realted To Ch1
60 pages
Indexing 2021
No ratings yet
Indexing 2021
44 pages
Lecture1 Introduction
No ratings yet
Lecture1 Introduction
67 pages
3-Index Construction
No ratings yet
3-Index Construction
43 pages
3 Index Construction
No ratings yet
3 Index Construction
43 pages
3 Index Construction
No ratings yet
3 Index Construction
43 pages
Ir Chapter Three
No ratings yet
Ir Chapter Three
41 pages
L05
No ratings yet
L05
33 pages
Lecture1 Intro Handout 1 Per
No ratings yet
Lecture1 Intro Handout 1 Per
57 pages
Lec 1 IR
No ratings yet
Lec 1 IR
42 pages
Ir 1
No ratings yet
Ir 1
14 pages
3
No ratings yet
3
8 pages
4_Indexing
No ratings yet
4_Indexing
59 pages
Chapter 4 IR
No ratings yet
Chapter 4 IR
56 pages
C1 Intro
No ratings yet
C1 Intro
10 pages
03lecture 3 - Biomedical IR-indexing
No ratings yet
03lecture 3 - Biomedical IR-indexing
27 pages
Information Retrival Systems
No ratings yet
Information Retrival Systems
50 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
69 pages
chapter2-MA212-Indexing+&+Preprocessing
No ratings yet
chapter2-MA212-Indexing+&+Preprocessing
68 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
57 pages
04 - Recuperación Información Modelo Booleano
No ratings yet
04 - Recuperación Información Modelo Booleano
41 pages
2.boolean Retrieval Model
No ratings yet
2.boolean Retrieval Model
40 pages
IR Unit 2 Dictionaries and Query Processing
No ratings yet
IR Unit 2 Dictionaries and Query Processing
20 pages
Lecture01 Intro
No ratings yet
Lecture01 Intro
45 pages
lecture1-intro
No ratings yet
lecture1-intro
60 pages
IR-Lec1 - Ch1-2023
No ratings yet
IR-Lec1 - Ch1-2023
41 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
30 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
42 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
38 pages
2-Boolean IR and Indexing
No ratings yet
2-Boolean IR and Indexing
46 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
33 pages
Lecture2 Intro Boolean 6per
No ratings yet
Lecture2 Intro Boolean 6per
9 pages
Unit 1 Intro to IR
No ratings yet
Unit 1 Intro to IR
32 pages
FOP Efficiency Indexing 13
No ratings yet
FOP Efficiency Indexing 13
22 pages
Unit 2
No ratings yet
Unit 2
58 pages
Unit I
No ratings yet
Unit I
83 pages
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
IR Unit 2
No ratings yet
IR Unit 2
54 pages
Chapter 1 Boolean Retrieval Model
No ratings yet
Chapter 1 Boolean Retrieval Model
21 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
51 pages
Lecture 2 Inverted Index PDF
No ratings yet
Lecture 2 Inverted Index PDF
24 pages
03 -Lect3 search engines-part2
No ratings yet
03 -Lect3 search engines-part2
32 pages
IR_MOD4_NOTES
No ratings yet
IR_MOD4_NOTES
19 pages
4.index Construction - New
No ratings yet
4.index Construction - New
46 pages
L3L4 IRSW Boolean Retrieval
No ratings yet
L3L4 IRSW Boolean Retrieval
54 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Chapter 3 IR
No ratings yet
Chapter 3 IR
56 pages
Databases: System Concepts, Designs, Management, and Implementation
From Everand
Databases: System Concepts, Designs, Management, and Implementation
Jonathan Rigdon
No ratings yet
The InfluxDB Handbook: Deploying, Optimizing, and Scaling Time Series Data
From Everand
The InfluxDB Handbook: Deploying, Optimizing, and Scaling Time Series Data
Robert Johnson
No ratings yet
Basic Concepts in Data Structures
From Everand
Basic Concepts in Data Structures
K.Meenendranath Reddy
No ratings yet
Introduction to Microsoft SQL Server
From Everand
Introduction to Microsoft SQL Server
Eric Frick
No ratings yet
STD_FINAL_YEAR_DEFENCE
No ratings yet
STD_FINAL_YEAR_DEFENCE
6 pages
Sarauta HMS Final Correstion
No ratings yet
Sarauta HMS Final Correstion
51 pages
YAHAYA-1
No ratings yet
YAHAYA-1
29 pages
78708-Herbal medicine powerpoint download
No ratings yet
78708-Herbal medicine powerpoint download
1 page
23_0906_oia_01_5G_Security_508_Compliant
No ratings yet
23_0906_oia_01_5G_Security_508_Compliant
29 pages
Sequential Prog. Lec 1
No ratings yet
Sequential Prog. Lec 1
39 pages
Computer Science Handbook
No ratings yet
Computer Science Handbook
87 pages
CSC2308 Lec 03
No ratings yet
CSC2308 Lec 03
64 pages
ASUU COMPLETED CORRECTION
No ratings yet
ASUU COMPLETED CORRECTION
51 pages
Cloud Computing Continuation
No ratings yet
Cloud Computing Continuation
29 pages
Statement Mib
No ratings yet
Statement Mib
1 page
Stack Recursion
No ratings yet
Stack Recursion
5 pages
ITC2212 17 18 Session
No ratings yet
ITC2212 17 18 Session
46 pages
Java 2
No ratings yet
Java 2
7 pages
JAVA Chapter4 Lecture Notes
No ratings yet
JAVA Chapter4 Lecture Notes
10 pages
Itc2301 P6
No ratings yet
Itc2301 P6
24 pages
Mis Groups New Groups
No ratings yet
Mis Groups New Groups
5 pages
Mis Groups
No ratings yet
Mis Groups
4 pages
MD Itc3213-1
No ratings yet
MD Itc3213-1
19 pages
Mis Lecture Note
No ratings yet
Mis Lecture Note
17 pages
Discrete Structure
No ratings yet
Discrete Structure
31 pages
Forum 1
No ratings yet
Forum 1
92 pages
Paradigms, Design
No ratings yet
Paradigms, Design
48 pages
?
No ratings yet
?
10 pages
B.isah Siwes Slides Documentation
No ratings yet
B.isah Siwes Slides Documentation
16 pages
Vardhaman College of Engineering
No ratings yet
Vardhaman College of Engineering
83 pages
LAB Python Basics Ver 7.0
No ratings yet
LAB Python Basics Ver 7.0
17 pages
Dictionaries: Constructing A Dictionary
No ratings yet
Dictionaries: Constructing A Dictionary
4 pages
List, Tuple, Set, Dictionary
No ratings yet
List, Tuple, Set, Dictionary
51 pages
Python Certification
No ratings yet
Python Certification
9 pages
Chapter 9 Dictionaries
No ratings yet
Chapter 9 Dictionaries
6 pages
Data Structures - Python 3.10.4 Documentation
No ratings yet
Data Structures - Python 3.10.4 Documentation
11 pages
Kaviya - Python Project
No ratings yet
Kaviya - Python Project
93 pages
Advanced Python Unit5 Pandas
No ratings yet
Advanced Python Unit5 Pandas
24 pages
MC 0067
No ratings yet
MC 0067
60 pages
Python Programming Law Students-S1&S2
No ratings yet
Python Programming Law Students-S1&S2
106 pages
Class 12 IP Practical File Question
No ratings yet
Class 12 IP Practical File Question
36 pages
2-1 QB(2020-21)
No ratings yet
2-1 QB(2020-21)
27 pages
Python Dictionary
No ratings yet
Python Dictionary
4 pages
Solu 7
No ratings yet
Solu 7
24 pages
Questions and Answers Automation and Configuration Management
No ratings yet
Questions and Answers Automation and Configuration Management
16 pages
Pandas Total Notes
No ratings yet
Pandas Total Notes
66 pages
Data Science Interview Question
83% (6)
Data Science Interview Question
84 pages
Hashing RPK
No ratings yet
Hashing RPK
61 pages
21 - Data Structure and Algorithms - Hash Table
No ratings yet
21 - Data Structure and Algorithms - Hash Table
9 pages
Perl Reference Card #2
No ratings yet
Perl Reference Card #2
3 pages
Types of TestBench and Fork Join
No ratings yet
Types of TestBench and Fork Join
46 pages
File - 2102217671 - 1686193757 - D.S Part 5
No ratings yet
File - 2102217671 - 1686193757 - D.S Part 5
4 pages
An Introduction to Python Programming for Scientists and Engineers Johnny Wei-Bing Lin instant download
100% (3)
An Introduction to Python Programming for Scientists and Engineers Johnny Wei-Bing Lin instant download
49 pages
Expt3
No ratings yet
Expt3
6 pages
Lecture4 Py
No ratings yet
Lecture4 Py
8 pages
The Java - Util Package: General Area Interface/Abstract Class Concrete Class
No ratings yet
The Java - Util Package: General Area Interface/Abstract Class Concrete Class
3 pages
Python Lab Manual
No ratings yet
Python Lab Manual
17 pages
Dictionaries: Advanced Data Structures 1
No ratings yet
Dictionaries: Advanced Data Structures 1
138 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

CSC2308 Lec 02

Uploaded by

CSC2308 Lec 02

Uploaded by

Data Management I

• An index is a data structure that improves the speed of

• Before we can use an index, we need to create it

• A document, usually a text document, is split into

• Dictionary is kept in memory and the postings are stored in

4. Index the documents that each term occurs in by

• How do we process a query using an inverted index

• So, we proceed as follows:

• Precision: what fraction of the returned results

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.