Information Retrieval System-Chapter-1

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 23

Information Retrieval

System
Chapter-1

IRS
Information retrieval (IR) deals with the representation,
storage, organization, and access to information items.
Information retrieval (IR) is the process of finding
relevant documents that satisfies information need
of users from large collections of unstructured text.

General Goal of Information


Retrieval
1. To help users find useful information based on their
information needs (with a minimum effort), despite
Increasing complexity of Information: Whatever the
shape(structured or unstructured), size of documents
corpus, distribution of documents.
Changing needs of user: User may search the
documents with different text and name.
2. Provide immediate random access to the document
collection (Efficient searching).
3

IRS Design/Architecture

Web Search System

(For general users IRS system is included with Web e.g. Search

Engines)

Data retrieval is DBMS system which is owned by the specific organization.

Indexing is done at the time of storage (organization of infrormation).

A Formal Characterization of IR
Models

Boolean Model
The Boolean model is a simple retrieval model based on
set theory and Boolean algebra.
the Boolean model provides a framework which is easy
to grasp by a common user of an IR system.
the queries are specified as Boolean expressions.
The result is relevant document produced or not.

10

11

Draw Backs of Boolean Model


First, its retrieval strategy is based on a binary decision
criterion (a document is predicted to be either relevant
or non-relevant)
the Boolean model is in reality much more a data
(instead of information) retrieval model.
Most of the users find it difficult to express their query
requests in terms of Boolean expressions.

12

Vector Model
Proposes a framework in which partial matching is
possible.
This is done by assigning non-binary weights to index
terms in queries and in documents.
These term weights are used to compute the degree
of similarity between each document stored in the
system and the user query.
Finally Sorting the documents retrieval in decreasing
order in terms of degree of similarity.
13

14

The Degree of similarity can be


calculated as:

15

16

Advantages of Vector model


(1)Its term (weight) scheme improves retrieval
performance.
(2) Its cosine ranking formula sorts the documents
according to their degree of similarity to the query.

17

Disadvantageous of Vector model


Theoretically: that index terms are assumed to be
mutually independent.

18

Probabilistic Model (BIR)


Probabilistic model introduced in 1976 by Roberston and
Sparck Jones. Which later became known as the binary
independence retrieval (BIR) model.
The probabilistic model attempts to capture the IR
problem within a probabilistic framework.

19

Fundamental idea of Probabilistic


Model
Given a user query, there is a set of documents which
contains exactly the relevant documents and no other.
(This is called as ideal answer set).
The problem is that we do not know exactly what these
properties are.
Since these properties are not known at query time, an
effort has to be made at initially guessing what they
could be.

20

The initial guess allows us to generate a preliminary


probabilistic ideal answer set which is used to retrieve a
first set of documents.
An interaction with the user is then initiated with the
purpose of improving the probabilistic ideal answer set.
On the basis of iterated feedback ideal answer set will
be generated.

21

The degree of similarity on


probabilistic model

Where ni = is number of documents


with keyword K
N is total number of documents

22

Cluster Based Retrieval model


Cluster-based retrieval has as its foundation the cluster
hypothesis.
It states Closely associated documents tend to be
relevant to the same requests.
Clustering picks out closely associated documents and
groups them together into one cluster.
For each cluster there will be one cluster representative
(C.R).
Each C.R holds the weight. Which further helps to
search.
23

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy