Wollo University Kombolcha Institute of Technology College of Informatics Department of Information Technology
Wollo University Kombolcha Institute of Technology College of Informatics Department of Information Technology
Wollo University Kombolcha Institute of Technology College of Informatics Department of Information Technology
1
Course Outline
Course Title: Information Storage and Retrieval
Course Code: ITec3081
ECTS Credits (CP): 5
Target Group: B.Sc. 3rd year Information
Technology Students (Regular Program)
Year /Semester: Year: III, Semester: II
Status of the Course: Core
Inverted files
Chapter Three Tries, Suffix Trees and Suffix Arrays
Indexing Structures Signature files
4
Course Syllabus
Introduction of IR Models
Chapter Four Boolean model
IR Models Vector space model
Probabilistic model
Keyword-based queries
Chapter Six
Pattern matching
Query Languages
Structural queries
6
Chapter 1: Introduction to Information Storage and retrieval
Outline
• IR and IR systems
• Data versus information retrieval
• IR and the retrieval process
• Basic structure of an IR
7
Information Retrieval
Information retrieval (IR) is the process of
finding material (usually documents) of an
unstructured nature (usually text) that satisfies
an information need from within large collections
(usually stored on computers).
9
General Goal of Information Retrieval
10
Information Retrieval Systems?
• Document (Web page)
retrieval in response to a
query
– Quite effective (at some
things)
– Commercially successful
(some of them)
• But what goes on behind the
scenes?
– How do they work?
Web search systems
– What happens beyond the • Lycos, Excite, Yahoo, Google, Live,
Web? Northern Light, Teoma, HotBot, Baidu,
…
11
Examples of IR systems
Conventional (library catalog): Search by keyword,
title, author, etc.
Text-based (Lexis-Nexis, Google, FAST):
Search by keywords. Limited search using queries in
natural language.
Multimedia (QBIC, WebSeek, SaFe): Search by
visual appearance (shapes, colors,… ).
Question answering systems (AskJeeves,
Answerbus): Search in (restricted) natural language
Other:
Cross language information retrieval,
Music retrieval
12
Information Retrieval vs. Data Retrieval
Emphasis of IR is on the retrieval of information, rather than
on the retrieval of data
Data retrieval
Consists mainly of determining which documents contain a set of
keywords in the user query (which is not enough to satisfy the
user information need)
Aims at retrieving all objects that satisfy well defined semantics
a single erroneous object among a thousand retrieved objects
implies failure
Information retrieval
Is concerned with retrieving information about a subject or topic
than retrieving data which satisfies a given query
semantics is frequently loose: the retrieved objects might be
inaccurate
small errors are tolerated
13
Information Retrieval vs. Data Retrieval
• Example of data retrieval system is a relational database
Retrieval
DB
Browsing
USER
15
The User Task
Retrieval
It is the process of retrieving information whereby the
main objective is clearly defined from the onset of
searching process.
The user of a retrieval system has to translate his
19
Structure of an IR System
An Information Retrieval System serves as a bridge between
the world of authors and the world of readers/users,
That is, writers present a set of ideas in a document using a
set of concepts. Then Users seek the IR system for relevant
documents that satisfy their information need.
User Documents
Black box
Given:
A corpus of textual natural-language
documents.
A user query in the form of a textual
string.
Find:
A ranked set of documents that are
relevant to the query.
21
Typical IR System Architecture
Document
corpus
Query IR
String System
1. Doc1
2. Doc2
Ranked 3. Doc3
Documents .
.
22
Web Search System
Web Spider
Document
corpus
Query IR
String System
1. Page1
2. Page2
3. Page3 Ranked
. Documents
.
23
What is Information Retrieval ?
• A good formal definition of information retrieval is
given in Baeze-Yates & Riberio-Neto (1990, p1)
25
The Retrieval Process
• It is necessary to define the text database
before any of the retrieval processes are initiated
28
Detail view of the Retrieval
Process User Text
Interface
User Text
need
Text Operations
logical view Logical view
DB
User Query Language manager
Indexing Module
feedback & Operations
Searching Index
documents
Documents Assign document identifier
text document
Tokenize
IDs
tokens
Stop list
non-stoplist Stemming & Normalize
tokens
stemmed Term weighting
terms
terms with
weights Index
31
Searching Subsystem
query parse query
query tokens
ranked
Stop list non-stoplist
document
tokens
set
ranking
Stemming & Normalize
relevant stemmed terms
document set
Similarity Query Term weighting
Measure terms
Index terms
Index
32
Issues that arise in IR
Text representation
what makes a “good” representation?
how is a representation generated from text?
what are retrievable objects and how are they
organized?
information needs representation
what is an appropriate query language?
how can interactive query formulation and refinement
be supported?
Comparing representations (to identify relevant
documents)
What weighting scheme and similarity measure to be
used?
what is a “good” model of retrieval?
Evaluating effectiveness of retrieval
what are good metrics?
what constitutes a good experimental test bed? 33
Focus in IR System Design
Our focus during IR system design is:
• In improving performance effectiveness of the system
–Effectiveness of the system is measured in terms of
precision, recall, …
–Stemming, stop words, weighting schemes, matching
algorithms
• In improving performance efficiency
–The concern here is storage space usage, access time,
searching time, data transfer time …
–Concern regarding space – time tradeoffs !!
–Use Compression techniques, data/file structures, etc.
34
Thank you
Questions?
35