100% found this document useful (1 vote)

689 views

University of Gondar: Information Storage and Retrieval System

This document provides a course outline for an Information Storage and Retrieval System course. It lists various topics that will be covered, including an overview of IR, text document operations, indexing structures, IR models, retrieval evaluation, query languages, and current issues in IR. The course will be taught by Getachew G. Gedamu at the University of Gondar in Ethiopia.

Uploaded by

Aisha m

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

689 views

University of Gondar: Information Storage and Retrieval System

Uploaded by

Aisha m

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 29

University of Gondar

College of Informatics
Department of Information Science

Information Storage and Retrieval System

(INSC 4913)

Getachew G.
gedamugetachew96@gmail.com

Gondar, Ethiopia
1
Course Outline
Topic(s) Details
Define IR; The retrieval process; Basic structure of an IR
Overview of IR
system
Basic Laws in IR; Tokenization; Stop word detection;
Text Document
Stemming; Normalization; Term weighting; similarity
Operations
measures
Indexing
Structures The need for indexing; sequential file; Inverted files
A Formal Characterization of IR Models; Boolean model,
IR Models
Vector space model & Probabilistic model
Retrieval Evaluation of IR systems; Relevance judgement; Retrieval
Evaluation effectiveness measures (Recall, Precision, F-measure, etc.)
Types of Query formulation; Keyword-based queries (Boolean
Query Languages
queries); Pattern matching; Natural language queries
Current Issues in IR in Local Languages; Information Extraction; Information
IR Filtering; Text Summarization, Cross-language retrieval...
Text Collections and IR
• Information is organized into (a large number of)
documents
₋ Large collections of documents available from various sources:
books, magazines, newspapers, journal articles, conference
papers, digital libraries, Web pages, etc.

• Example: How Much Data?

– Google processes 20 PB a day (2008)
– Google Web Search Engine claims to index over 30 trillion
pages(1995-2014)
 It performs more than 40 000 search queries each second on
average and over 5.2 billion searches per day in 2017 and 4
trillion per year world wide.
• Wayback Machine has 50PB used storage(2014)
• Facebook has 100 PB of user data (2012)
• eBay has 6.5 PB of user data + 50 TB/day (2009)
Information as process
• Information = characteristics of the output of a process
– Tells us something about the process and the input

Input Output

Input Process Output

Input Output

– Information-generating process do not occur in

isolation
Input Process1 Process2 … Output
Where’s the human?
• If a tree falls in the forest, and no one is around to hear
it, is information transmitted?

• In the “information as process”: Yes, but that’s not very

interesting to us

• We’re concerned about information for human

consumption
– Transmission of information from one person to
another
– Recording of information
– Reconstruction of stored information
Another View
• Information science is characterized by “the deliberate
(purposeful) structure of the message by the sender in
order to affect the image structure of the recipient”
– This implies that the sender has knowledge of the recipient's
structure

• Text = “a collection of signs purposefully structured by

a sender with the intention of changing image-
structure of a recipient”

• Information = “the structure of any text which is

capable of changing the image-structure of a
recipient”
Transfer of Information
• Communication = transmission of information

Thoughts Thoughts
Telepathy?

Words Words
Writing

Sounds Sounds
Speech

Encoding Decoding
Information Theory
• Better called “communication theory”

• Developed by Claude Shannon in 1940’s

– Concerned with the transmission of electrical signals over wires
– How do we send information quickly and reliably?

• Underlies modern electronic communication:

– Voice and data traffic…
– Over copper, fiber optic, wireless, etc.

• Famous result: Channel Capacity Theorem

• Formal measure of information in terms of entropy

– Information = “reduction in surprise”
The Noisy Channel Model
• Communication = producing the same message at the
destination that was sent at the source
– The message must be encoded for transmission across a
medium (called channel)
– But the channel is noisy and can distort the message

• Semantics (meaning) is irrelevant

Source Destination

message Transmitter channel Receiver message

noise
A Synthesis
• Information retrieval as communication over
time and space, across a noisy channel
Source Destination

message Transmitter channel Receiver message

noise

Sender Recipient

message Encoding storage Decoding message

indexing/writing retrieval/reading
noise
Types of Information Needs
• Retrospective
– “Searching the past”
– Different queries posed against a static collection
– Time invariant

• Prospective
– “Searching the future”
– Static query posed against a dynamic collection
– Time dependent
Retrospective Searches (I)
• Ad hoc retrieval: find documents “about this”
Identify positive accomplishments of the Hubble telescope since it
was launched in 1991.

Compile a list of mammals that are considered to be endangered,

identify their habitat and, if possible, specify what threatens them.

• Known item search

Find Jimmy Lin’s homepage.

What’s the ISBN number of “Modern Information Retrieval”?

• Directed exploration
Who makes the best chocolates?

What video conferencing systems exist for digital reference desk

services?
Retrospective Searches (II)
• Question answering
Who discovered Oxygen?
When did Hawaii become a state?
“Factoid”
Where is Ayer’s Rock located?
What team won the World Series in 1992?

What countries export oil?

“List”
Name U.S. cities that have a “Shubert” theater.

Who is Aaron Copland?

“Definition”
What is a quasar?
Prospective “Searches”
• Filtering
– Make a binary decision about each incoming
document
Spam or not spam?

• Routing
– Sort incoming documents into different bins?
Categorize news headlines: World? Nation? Metro? Sports?
Storage of Text
• Textual Documents
– Searchable as text
– Words are represented as ASCII/Unicode
• Image Documents
– Scanned image of text document, which is not searchable as
text: Texts (characters, words, etc.) are represented as
patterns of pixels
– Retrieval from Document Images: Two options
• Recognition-based retrieval: OCR is required to convert
document images to ASCII (may be error prone) and then
apply text IR systems on the recognized documents
• Recognition-free retrieval: Retrieval from document images
without explicit recognition.
• Search relevant documents directly from image collections
The Problem of IR
• Need
– Increasing the size and number of published documents
– Traditional methods had difficulties in document processing
– Different disciplines(Biotechnology, Genetics..) producing
different types of huge amount data Info.
need

Query
IR
Retrieval system
Document Answer list
collection

• Goal
– Find documents relevant to an information need from a large
document set
What is Information Retrieval ?
• Information retrieval is the process of searching for relevant
documents from unstructured large corpus that satisfy
information need of users
– It is a tool that finds and selects from a collection of items a
subset that serves the user’s purpose

• Information retrieval (IR) is finding material (usually

documents) of an unstructured nature (usually text) that
satisfies an information need from within large collections
(usually stored on computers).
Examples of IR System
• Much IR systems focuses more specifically on text retrieval.
But there are many other IR areas:
– Cross-language retrieval, text summarization, information filtering,
Question-answering, content-based multimedia (audio, Image and Video)
retrieval

• Text-based (Lexis-Nexis, Google, FAST):

– Search by keywords.
– Limited search using queries in natural language.

• Multimedia (QBIC, WebSeek, SaFe):

– (shapes, colors,… ).

• Question answering systems (AskJeeves, Answerbus):

– Search in (restricted) natural language

• Cross language vs. Multilingual Information Retrieval

Information Retrieval serve as Bridge
• An Information Retrieval System serves as a bridge
between the world of authors and the world of
readers/users
• That is, writers present a set of ideas in a document
using a set of concepts

• Then Users seek the IR system for relevant documents

that satisfy their information need

Black box
User Documents
Typical IR System Architecture

Document
corpus

Query IR
String System

1. Doc1
2. Doc2
Ranked 3. Doc3
Relevant Documents .
.
The Notion of Relevance
• Relevance is a subjective judgment and may include:
– Being timely (recent information)
– Being authoritative (from a trusted source)
– Satisfying the goals of the user and his/her intended use of the
information (information need)

• Relevance information is that suited to your information

need

• What is actually needed (relevant)

– Dependent on: (User, Space/time, Group and Context)

• IR is very concerned with relevance

IR System vs. Web Search System

Web Spider
Document
corpus

Query IR
String System

1. Page1
2. Page2
3. Page3 Ranked
. Relevant Documents
.
Anatomy Web Search Engine
The Retrieval Process
User
Interface
User need
Text Text
Text Operations Database
L o g i c a l v i e w

User Query DocID

Indexing
feedback Formulation
Inverted
Query
Simla file
r
Meas ity
Searching ures
Index
Retrieved file
docs
Ranked docs
Ranking
The Retrieval Process
• It is necessary to define the text database before any of
the retrieval processes are initiated

• The text operations transform the original documents & the

information needs and generate a logical view of them

• Once the logical view of the documents is defined, the

database module builds an index of the text
– An index is a critical data structure
– It allows fast searching over large volumes of data
The Retrieval Process
• Different index structures might be used, but the most popular
one is the inverted file (more on this later) as indicated in the
slide

• Given the document database is indexed, the retrieval process

can be initiated

• The user first specifies a user need which is then parsed &
transformed by the same text operation applied to the text

– Next the query operations is applied before the actual query,

which provides a system representation for the user need, is
generated
The Retrieval Process
• The query is then processed to retrieve documents
– Before the retrieved documents are sent to the user, the retrieved
documents are ranked according to the likelihood of relevance

• The user then examines the set of ranked documents in

the search for useful information

• At this point, the user might pinpoint a subset of the

documents seen as definitely of interest & initiate a user
feedback cycle
– In such a cycle, the system uses the documents selected by the user
to change the query formulation

– Hopefully, this modified query is a better representation of the real

user need
Issues that arise in IR
• Text document representation
– What makes a “good” representation?
– How is a representation generated from text?
– What are the retrievable objects & how are they organized?

• Information need representation

– What is an appropriate query language?
– How can interactive query formulation & refinement be supported?

• Comparing representations
– What is a “good” similarity measure & retrieval model?
– How is uncertainty represented?

• Evaluating effectiveness of retrieval

– What are good metrics?
– What constitutes a good experimental test bed?
Students’ Reflection:
What are the main components in Information
Retrieval System?
a) ____________________________________
b) ____________________________________
c) ____________________________________

What are the main differences between Information Retrieval

System and Database Management System?
a) ____________________________________
b) ____________________________________
c) ____________________________________
29

Data Engineering With Databricks
100% (2)
Data Engineering With Databricks
63 pages
IP Exit Exam
No ratings yet
IP Exit Exam
40 pages
Mobile Computing Exam
75% (4)
Mobile Computing Exam
2 pages
Practical File SQL Queries DBMS
83% (12)
Practical File SQL Queries DBMS
30 pages
Haramaya University: Department of Information Systems
No ratings yet
Haramaya University: Department of Information Systems
34 pages
Web Programming Exit Exam
No ratings yet
Web Programming Exit Exam
154 pages
Mobile Application Development MCQ Worksheet For Exit Exam 2015@
100% (1)
Mobile Application Development MCQ Worksheet For Exit Exam 2015@
4 pages
IR Chap4
100% (1)
IR Chap4
32 pages
Ambo University Woliso Campus: Advanced Database For 2 Year
100% (2)
Ambo University Woliso Campus: Advanced Database For 2 Year
48 pages
Single-User vs. Multi-User System: Dbms - Module - 5 - Notes
No ratings yet
Single-User vs. Multi-User System: Dbms - Module - 5 - Notes
19 pages
Exit Exam Fundamentals of Database System 1 3
No ratings yet
Exit Exam Fundamentals of Database System 1 3
5 pages
Tax Management System PDF
No ratings yet
Tax Management System PDF
63 pages
CS Model Exit Exam-1
No ratings yet
CS Model Exit Exam-1
15 pages
Advanced DB Chapter-3
No ratings yet
Advanced DB Chapter-3
54 pages
Wollega University Department of Computer Science Selected Topics in Computer Science by Tadele D. March 18, 2023
100% (1)
Wollega University Department of Computer Science Selected Topics in Computer Science by Tadele D. March 18, 2023
75 pages
Bachelor of Education Degree in Information Technology
No ratings yet
Bachelor of Education Degree in Information Technology
16 pages
Chapter Three: Lecture 1: Solving Problems by Searching and Constraint Satisfaction Problem
No ratings yet
Chapter Three: Lecture 1: Solving Problems by Searching and Constraint Satisfaction Problem
53 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
34 pages
Cs-Model Exam
100% (1)
Cs-Model Exam
43 pages
Software Engineering - Mock Exit Exam 2023l24
No ratings yet
Software Engineering - Mock Exit Exam 2023l24
28 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
77 pages
Advanced Database Systems: Chapter 3:query Processing and Evaluation
100% (1)
Advanced Database Systems: Chapter 3:query Processing and Evaluation
36 pages
Event-Driven Programming MCQ Worksheet For Exit Exam 2015@
No ratings yet
Event-Driven Programming MCQ Worksheet For Exit Exam 2015@
10 pages
Midterm Exam
100% (1)
Midterm Exam
3 pages
Online DS MCQs Paper-MCS 2nd Eve
100% (1)
Online DS MCQs Paper-MCS 2nd Eve
9 pages
Exit-Exam - 230218 - 161311 NW Device & Conf
No ratings yet
Exit-Exam - 230218 - 161311 NW Device & Conf
32 pages
Distributed System: Naming System in DS
No ratings yet
Distributed System: Naming System in DS
51 pages
Computer Networks Set 1
No ratings yet
Computer Networks Set 1
5 pages
Mid Examination
No ratings yet
Mid Examination
2 pages
Computer Science (CS) Model Exit Exam
No ratings yet
Computer Science (CS) Model Exit Exam
51 pages
Ambo University Woliso Campus
No ratings yet
Ambo University Woliso Campus
10 pages
Exit Exam Questions Part 1
100% (1)
Exit Exam Questions Part 1
4 pages
Pgdac QB C++&DS
No ratings yet
Pgdac QB C++&DS
6 pages
Ambo University
No ratings yet
Ambo University
12 pages
OOP Questions For Exit Exam
100% (1)
OOP Questions For Exit Exam
2 pages
Fundamentals of Networking Module (Only For Exit Exam) Dawit
100% (1)
Fundamentals of Networking Module (Only For Exit Exam) Dawit
28 pages
Chapter 1 - Introduction
No ratings yet
Chapter 1 - Introduction
44 pages
Chapter 2 - Query Processing and Optimization
100% (1)
Chapter 2 - Query Processing and Optimization
28 pages
Ambo University Woliso Campus School of Technology and Informatics Depertement Information Technology
100% (1)
Ambo University Woliso Campus School of Technology and Informatics Depertement Information Technology
5 pages
Final Municipality Document
0% (1)
Final Municipality Document
104 pages
2 Tir 2017 CS exit exam question
100% (2)
2 Tir 2017 CS exit exam question
13 pages
BSC in IT
100% (2)
BSC in IT
14 pages
Chapter 4 - Naming: Distributed Systems (IT 441)
No ratings yet
Chapter 4 - Naming: Distributed Systems (IT 441)
68 pages
Aa
No ratings yet
Aa
15 pages
Degree Exit Exam Sample Questions
100% (1)
Degree Exit Exam Sample Questions
4 pages
Lab Manual: IT-602 Wireless & Mobile Computing
No ratings yet
Lab Manual: IT-602 Wireless & Mobile Computing
19 pages
Configure and Use Internet
100% (1)
Configure and Use Internet
9 pages
Hawassa University Department of Informatics Data Communication and Computer Networking Mid Exam
No ratings yet
Hawassa University Department of Informatics Data Communication and Computer Networking Mid Exam
5 pages
7000+ Internet Programming Questions and Answers PDF - 1
No ratings yet
7000+ Internet Programming Questions and Answers PDF - 1
1 page
08
No ratings yet
08
69 pages
Wireless Lans: Chapter Four
0% (1)
Wireless Lans: Chapter Four
11 pages
Advanced Database Technology: Ambo University
100% (1)
Advanced Database Technology: Ambo University
28 pages
Computer Exit Exam quesitions with Anaswer
75% (4)
Computer Exit Exam quesitions with Anaswer
14 pages
What Is Wireless Communication
100% (1)
What Is Wireless Communication
45 pages
Project Doc Final Edited
No ratings yet
Project Doc Final Edited
66 pages
Information Assurance Security MCQ Worksheet For Exit Exam 2015@
No ratings yet
Information Assurance Security MCQ Worksheet For Exit Exam 2015@
6 pages
School of Information Science: Addis Ababa University College of Natural and Computational Science
0% (1)
School of Information Science: Addis Ababa University College of Natural and Computational Science
8 pages
Holistic HNS COC practical answer
No ratings yet
Holistic HNS COC practical answer
11 pages
holistic COC l_4 question WDDBA
100% (1)
holistic COC l_4 question WDDBA
15 pages
Computer Science CS/CIS/CN Major Exit Exam Study Guide
100% (1)
Computer Science CS/CIS/CN Major Exit Exam Study Guide
2 pages
chapter one IR
No ratings yet
chapter one IR
18 pages
lecture1
No ratings yet
lecture1
42 pages
Oracle Mid Exam Sem 1
No ratings yet
Oracle Mid Exam Sem 1
5 pages
12 RDBMS
No ratings yet
12 RDBMS
8 pages
SQL Server Performance Tuning and Optimization
No ratings yet
SQL Server Performance Tuning and Optimization
1 page
Multimedia Information
No ratings yet
Multimedia Information
33 pages
Handout DBMS
No ratings yet
Handout DBMS
3 pages
RDBMS Concepts: Database
No ratings yet
RDBMS Concepts: Database
83 pages
TAble Space
No ratings yet
TAble Space
5 pages
Snow
No ratings yet
Snow
59 pages
Define Candidate Key, Alternate Key, Composite Key.: Atomicity
No ratings yet
Define Candidate Key, Alternate Key, Composite Key.: Atomicity
4 pages
Strategi Perencanaan Pembiayaan Sekolah Dalam Peningkatan Mutu Di SMP Negeri
No ratings yet
Strategi Perencanaan Pembiayaan Sekolah Dalam Peningkatan Mutu Di SMP Negeri
10 pages
Modulo 1 - Fundamentos de Big Data
No ratings yet
Modulo 1 - Fundamentos de Big Data
4 pages
Database Management System
No ratings yet
Database Management System
4 pages
Unit - I Introduction To Database Management Systems (DBMS) : Overview
No ratings yet
Unit - I Introduction To Database Management Systems (DBMS) : Overview
15 pages
Unit-4: Define The Domain For Clustering
No ratings yet
Unit-4: Define The Domain For Clustering
13 pages
ts6024 - Topic3 - DB - Lifecycle-20191104024431 2
No ratings yet
ts6024 - Topic3 - DB - Lifecycle-20191104024431 2
37 pages
Service Now Developer
No ratings yet
Service Now Developer
119 pages
Database Management Systems Components
No ratings yet
Database Management Systems Components
11 pages
Semantics of The Relation Attributes: Each Tuple in A Relation Should Represent One Entity or Relationship Instance
No ratings yet
Semantics of The Relation Attributes: Each Tuple in A Relation Should Represent One Entity or Relationship Instance
36 pages
Baan
No ratings yet
Baan
20 pages
Interview Questions Checklist 1-2 Years: Area of Skill Covered Topics
No ratings yet
Interview Questions Checklist 1-2 Years: Area of Skill Covered Topics
6 pages
ORACLE PLSQL Mid Term Part 1 SOLUTIONS
No ratings yet
ORACLE PLSQL Mid Term Part 1 SOLUTIONS
17 pages
BDA Cie 2 Answers
No ratings yet
BDA Cie 2 Answers
15 pages
MODULE 3 - Question &answer-2
No ratings yet
MODULE 3 - Question &answer-2
32 pages
Excel Interview Questions
No ratings yet
Excel Interview Questions
6 pages
Create Mysql User
No ratings yet
Create Mysql User
4 pages
Pentaho 6.0: Putting Big Data To Work
No ratings yet
Pentaho 6.0: Putting Big Data To Work
40 pages
MCS 023
No ratings yet
MCS 023
4 pages
2 - Unms - Building A Data Warehouse To Support Active Student Management Analysis and Design PDF
No ratings yet
2 - Unms - Building A Data Warehouse To Support Active Student Management Analysis and Design PDF
6 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

University of Gondar: Information Storage and Retrieval System

Uploaded by

University of Gondar: Information Storage and Retrieval System

Uploaded by

University of Gondar

Information Storage and Retrieval System

• Example: How Much Data?

Input Process Output

– Information-generating process do not occur in

• In the “information as process”: Yes, but that’s not very

• We’re concerned about information for human

• Text = “a collection of signs purposefully structured by

• Information = “the structure of any text which is

• Developed by Claude Shannon in 1940’s

• Underlies modern electronic communication:

• Famous result: Channel Capacity Theorem

• Formal measure of information in terms of entropy

• Semantics (meaning) is irrelevant

message Transmitter channel Receiver message

message Transmitter channel Receiver message

message Encoding storage Decoding message

Compile a list of mammals that are considered to be endangered,

• Known item search

What’s the ISBN number of “Modern Information Retrieval”?

What video conferencing systems exist for digital reference desk

What countries export oil?

Who is Aaron Copland?

• Information retrieval (IR) is finding material (usually

• Text-based (Lexis-Nexis, Google, FAST):

• Multimedia (QBIC, WebSeek, SaFe):

• Question answering systems (AskJeeves, Answerbus):

• Cross language vs. Multilingual Information Retrieval

• Then Users seek the IR system for relevant documents

• Relevance information is that suited to your information

• What is actually needed (relevant)

• IR is very concerned with relevance

User Query DocID

• The text operations transform the original documents & the

• Once the logical view of the documents is defined, the

• Given the document database is indexed, the retrieval process

– Next the query operations is applied before the actual query,

• The user then examines the set of ranked documents in

• At this point, the user might pinpoint a subset of the

– Hopefully, this modified query is a better representation of the real

• Information need representation

• Evaluating effectiveness of retrieval

What are the main differences between Information Retrieval

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.