University of Gondar: Information Storage and Retrieval System
University of Gondar: Information Storage and Retrieval System
University of Gondar: Information Storage and Retrieval System
College of Informatics
Department of Information Science
Getachew G.
gedamugetachew96@gmail.com
Gondar, Ethiopia
1
Course Outline
Topic(s) Details
Define IR; The retrieval process; Basic structure of an IR
Overview of IR
system
Basic Laws in IR; Tokenization; Stop word detection;
Text Document
Stemming; Normalization; Term weighting; similarity
Operations
measures
Indexing
Structures The need for indexing; sequential file; Inverted files
A Formal Characterization of IR Models; Boolean model,
IR Models
Vector space model & Probabilistic model
Retrieval Evaluation of IR systems; Relevance judgement; Retrieval
Evaluation effectiveness measures (Recall, Precision, F-measure, etc.)
Types of Query formulation; Keyword-based queries (Boolean
Query Languages
queries); Pattern matching; Natural language queries
Current Issues in IR in Local Languages; Information Extraction; Information
IR Filtering; Text Summarization, Cross-language retrieval...
Text Collections and IR
• Information is organized into (a large number of)
documents
₋ Large collections of documents available from various sources:
books, magazines, newspapers, journal articles, conference
papers, digital libraries, Web pages, etc.
Input Output
Input Output
Thoughts Thoughts
Telepathy?
Words Words
Writing
Sounds Sounds
Speech
Encoding Decoding
Information Theory
• Better called “communication theory”
noise
A Synthesis
• Information retrieval as communication over
time and space, across a noisy channel
Source Destination
noise
Sender Recipient
indexing/writing retrieval/reading
noise
Types of Information Needs
• Retrospective
– “Searching the past”
– Different queries posed against a static collection
– Time invariant
• Prospective
– “Searching the future”
– Static query posed against a dynamic collection
– Time dependent
Retrospective Searches (I)
• Ad hoc retrieval: find documents “about this”
Identify positive accomplishments of the Hubble telescope since it
was launched in 1991.
• Directed exploration
Who makes the best chocolates?
• Routing
– Sort incoming documents into different bins?
Categorize news headlines: World? Nation? Metro? Sports?
Storage of Text
• Textual Documents
– Searchable as text
– Words are represented as ASCII/Unicode
• Image Documents
– Scanned image of text document, which is not searchable as
text: Texts (characters, words, etc.) are represented as
patterns of pixels
– Retrieval from Document Images: Two options
• Recognition-based retrieval: OCR is required to convert
document images to ASCII (may be error prone) and then
apply text IR systems on the recognized documents
• Recognition-free retrieval: Retrieval from document images
without explicit recognition.
• Search relevant documents directly from image collections
The Problem of IR
• Need
– Increasing the size and number of published documents
– Traditional methods had difficulties in document processing
– Different disciplines(Biotechnology, Genetics..) producing
different types of huge amount data Info.
need
Query
IR
Retrieval system
Document Answer list
collection
• Goal
– Find documents relevant to an information need from a large
document set
What is Information Retrieval ?
• Information retrieval is the process of searching for relevant
documents from unstructured large corpus that satisfy
information need of users
– It is a tool that finds and selects from a collection of items a
subset that serves the user’s purpose
Black box
User Documents
Typical IR System Architecture
Document
corpus
Query IR
String System
1. Doc1
2. Doc2
Ranked 3. Doc3
Relevant Documents .
.
The Notion of Relevance
• Relevance is a subjective judgment and may include:
– Being timely (recent information)
– Being authoritative (from a trusted source)
– Satisfying the goals of the user and his/her intended use of the
information (information need)
Web Spider
Document
corpus
Query IR
String System
1. Page1
2. Page2
3. Page3 Ranked
. Relevant Documents
.
Anatomy Web Search Engine
The Retrieval Process
User
Interface
User need
Text Text
Text Operations Database
L o g i c a l v i e w
• The user first specifies a user need which is then parsed &
transformed by the same text operation applied to the text
• Comparing representations
– What is a “good” similarity measure & retrieval model?
– How is uncertainty represented?