University of Gondar: Information Storage and Retrieval System

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 29

University of Gondar

College of Informatics
Department of Information Science

Information Storage and Retrieval System


(INSC 4913)

Getachew G.
gedamugetachew96@gmail.com

Gondar, Ethiopia
1
Course Outline
Topic(s) Details
Define IR; The retrieval process; Basic structure of an IR
Overview of IR
system
Basic Laws in IR; Tokenization; Stop word detection;
Text Document
Stemming; Normalization; Term weighting; similarity
Operations
measures
Indexing
Structures The need for indexing; sequential file; Inverted files
A Formal Characterization of IR Models; Boolean model,
IR Models
Vector space model & Probabilistic model
Retrieval Evaluation of IR systems; Relevance judgement; Retrieval
Evaluation effectiveness measures (Recall, Precision, F-measure, etc.)
Types of Query formulation; Keyword-based queries (Boolean
Query Languages
queries); Pattern matching; Natural language queries
Current Issues in IR in Local Languages; Information Extraction; Information
IR Filtering; Text Summarization, Cross-language retrieval...
Text Collections and IR
• Information is organized into (a large number of)
documents
₋ Large collections of documents available from various sources:
books, magazines, newspapers, journal articles, conference
papers, digital libraries, Web pages, etc.

• Example: How Much Data?


– Google processes 20 PB a day (2008)
– Google Web Search Engine claims to index over 30 trillion
pages(1995-2014)
 It performs more than 40 000 search queries each second on
average and over 5.2 billion searches per day in 2017 and 4
trillion per year world wide.
• Wayback Machine has 50PB used storage(2014)
• Facebook has 100 PB of user data (2012)
• eBay has 6.5 PB of user data + 50 TB/day (2009)
Information as process
• Information = characteristics of the output of a process
– Tells us something about the process and the input

Input Output

Input Process Output

Input Output

– Information-generating process do not occur in


isolation
Input Process1 Process2 … Output
Where’s the human?
• If a tree falls in the forest, and no one is around to hear
it, is information transmitted?

• In the “information as process”: Yes, but that’s not very


interesting to us

• We’re concerned about information for human


consumption
– Transmission of information from one person to
another
– Recording of information
– Reconstruction of stored information
Another View
• Information science is characterized by “the deliberate
(purposeful) structure of the message by the sender in
order to affect the image structure of the recipient”
– This implies that the sender has knowledge of the recipient's
structure

• Text = “a collection of signs purposefully structured by


a sender with the intention of changing image-
structure of a recipient”

• Information = “the structure of any text which is


capable of changing the image-structure of a
recipient”
Transfer of Information
• Communication = transmission of information

Thoughts Thoughts
Telepathy?

Words Words
Writing

Sounds Sounds
Speech

Encoding Decoding
Information Theory
• Better called “communication theory”

• Developed by Claude Shannon in 1940’s


– Concerned with the transmission of electrical signals over wires
– How do we send information quickly and reliably?

• Underlies modern electronic communication:


– Voice and data traffic…
– Over copper, fiber optic, wireless, etc.

• Famous result: Channel Capacity Theorem

• Formal measure of information in terms of entropy


– Information = “reduction in surprise”
The Noisy Channel Model
• Communication = producing the same message at the
destination that was sent at the source
– The message must be encoded for transmission across a
medium (called channel)
– But the channel is noisy and can distort the message

• Semantics (meaning) is irrelevant


Source Destination

message Transmitter channel Receiver message

noise
A Synthesis
• Information retrieval as communication over
time and space, across a noisy channel
Source Destination

message Transmitter channel Receiver message

noise

Sender Recipient

message Encoding storage Decoding message

indexing/writing retrieval/reading
noise
Types of Information Needs
• Retrospective
– “Searching the past”
– Different queries posed against a static collection
– Time invariant

• Prospective
– “Searching the future”
– Static query posed against a dynamic collection
– Time dependent
Retrospective Searches (I)
• Ad hoc retrieval: find documents “about this”
Identify positive accomplishments of the Hubble telescope since it
was launched in 1991.

Compile a list of mammals that are considered to be endangered,


identify their habitat and, if possible, specify what threatens them.

• Known item search


Find Jimmy Lin’s homepage.

What’s the ISBN number of “Modern Information Retrieval”?

• Directed exploration
Who makes the best chocolates?

What video conferencing systems exist for digital reference desk


services?
Retrospective Searches (II)
• Question answering
Who discovered Oxygen?
When did Hawaii become a state?
“Factoid”
Where is Ayer’s Rock located?
What team won the World Series in 1992?

What countries export oil?


“List”
Name U.S. cities that have a “Shubert” theater.

Who is Aaron Copland?


“Definition”
What is a quasar?
Prospective “Searches”
• Filtering
– Make a binary decision about each incoming
document
Spam or not spam?

• Routing
– Sort incoming documents into different bins?
Categorize news headlines: World? Nation? Metro? Sports?
Storage of Text
• Textual Documents
– Searchable as text
– Words are represented as ASCII/Unicode
• Image Documents
– Scanned image of text document, which is not searchable as
text: Texts (characters, words, etc.) are represented as
patterns of pixels
– Retrieval from Document Images: Two options
• Recognition-based retrieval: OCR is required to convert
document images to ASCII (may be error prone) and then
apply text IR systems on the recognized documents
• Recognition-free retrieval: Retrieval from document images
without explicit recognition.
• Search relevant documents directly from image collections
The Problem of IR
• Need
– Increasing the size and number of published documents
– Traditional methods had difficulties in document processing
– Different disciplines(Biotechnology, Genetics..) producing
different types of huge amount data Info.
need

Query
IR
Retrieval system
Document Answer list
collection

• Goal
– Find documents relevant to an information need from a large
document set
What is Information Retrieval ?
• Information retrieval is the process of searching for relevant
documents from unstructured large corpus that satisfy
information need of users
– It is a tool that finds and selects from a collection of items a
subset that serves the user’s purpose

• Information retrieval (IR) is finding material (usually


documents) of an unstructured nature (usually text) that
satisfies an information need from within large collections
(usually stored on computers).
Examples of IR System
• Much IR systems focuses more specifically on text retrieval.
But there are many other IR areas:
– Cross-language retrieval, text summarization, information filtering,
Question-answering, content-based multimedia (audio, Image and Video)
retrieval

• Text-based (Lexis-Nexis, Google, FAST):


– Search by keywords.
– Limited search using queries in natural language.

• Multimedia (QBIC, WebSeek, SaFe):


– (shapes, colors,… ).

• Question answering systems (AskJeeves, Answerbus):


– Search in (restricted) natural language

• Cross language vs. Multilingual Information Retrieval


Information Retrieval serve as Bridge
• An Information Retrieval System serves as a bridge
between the world of authors and the world of
readers/users
• That is, writers present a set of ideas in a document
using a set of concepts

• Then Users seek the IR system for relevant documents


that satisfy their information need

Black box
User Documents
Typical IR System Architecture

Document
corpus

Query IR
String System

1. Doc1
2. Doc2
Ranked 3. Doc3
Relevant Documents .
.
The Notion of Relevance
• Relevance is a subjective judgment and may include:
– Being timely (recent information)
– Being authoritative (from a trusted source)
– Satisfying the goals of the user and his/her intended use of the
information (information need)

• Relevance information is that suited to your information


need

• What is actually needed (relevant)


– Dependent on: (User, Space/time, Group and Context)

• IR is very concerned with relevance


IR System vs. Web Search System

Web Spider
Document
corpus

Query IR
String System

1. Page1
2. Page2
3. Page3 Ranked
. Relevant Documents
.
Anatomy Web Search Engine
The Retrieval Process
User
Interface
User need
Text Text
Text Operations Database
L o g i c a l v i e w

User Query DocID


Indexing
feedback Formulation
Inverted
Query
Simla file
r
Meas ity
Searching ures
Index
Retrieved file
docs
Ranked docs
Ranking
The Retrieval Process
• It is necessary to define the text database before any of
the retrieval processes are initiated

• The text operations transform the original documents & the


information needs and generate a logical view of them

• Once the logical view of the documents is defined, the


database module builds an index of the text
– An index is a critical data structure
– It allows fast searching over large volumes of data
The Retrieval Process
• Different index structures might be used, but the most popular
one is the inverted file (more on this later) as indicated in the
slide

• Given the document database is indexed, the retrieval process


can be initiated

• The user first specifies a user need which is then parsed &
transformed by the same text operation applied to the text

– Next the query operations is applied before the actual query,


which provides a system representation for the user need, is
generated
The Retrieval Process
• The query is then processed to retrieve documents
– Before the retrieved documents are sent to the user, the retrieved
documents are ranked according to the likelihood of relevance

• The user then examines the set of ranked documents in


the search for useful information

• At this point, the user might pinpoint a subset of the


documents seen as definitely of interest & initiate a user
feedback cycle
– In such a cycle, the system uses the documents selected by the user
to change the query formulation

– Hopefully, this modified query is a better representation of the real


user need
Issues that arise in IR
• Text document representation
– What makes a “good” representation?
– How is a representation generated from text?
– What are the retrievable objects & how are they organized?

• Information need representation


– What is an appropriate query language?
– How can interactive query formulation & refinement be supported?

• Comparing representations
– What is a “good” similarity measure & retrieval model?
– How is uncertainty represented?

• Evaluating effectiveness of retrieval


– What are good metrics?
– What constitutes a good experimental test bed?
Students’ Reflection:
What are the main components in Information
Retrieval System?
a) ____________________________________
b) ____________________________________
c) ____________________________________

What are the main differences between Information Retrieval


System and Database Management System?
a) ____________________________________
b) ____________________________________
c) ____________________________________
29

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy