Irs PPT Unit Ii

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 19

UNIT-II

Retrieval Utilities

Utilities improve the results of a retrieval strategy. Most utilities add or remove
terms from the initial query in an attempt to refine the query.

Relevance Feedback

A popular information retrieval utility is relevance feedback. The basic premise is to


implement retrieval in multiple passes. The user refines the query in each pass
based on results of previous queries.
Fig: Relevance feedback process
Relevance Feedback in the Vector Space Model
Rocchio builds the new query Q' from the old query Q using the equation
given below:

Ri and Si are individual components of R and S, respectively.


The document vectors from the relevant documents are added to the initial query
vector, and the vectors from the non-relevant documents are subtracted. If all
documents are relevant, the third term does not appear
Clustering
•provide a grouping of similar objects into a class under a more general title
----clustering also allows linkage between clusters to be specified
•An information database can be viewed as being composed of a number of
independent items indexed by a series of index terms
Term clustering
Used to create a statistical thesaurus
Increase recall by expanding searches with related terms (query expansion)
Document clustering
Used to create document clusters
The search can retrieve items similar to an item of interest, even if the query
would not have retrieved the item (resultant set expansion)
Result-set clustering
Complete Term Relation Method

• The similarity between every term pair is


calculated as a basis for determining the clusters
• Using the vector model for clustering

• A similarity measure is required to calculate the


similarity between to terms
Cont….

Term-Term Matrix Threshold = 10

Term Relationship Matrix


Cont….

The final step in creating clusters is to determine when two objects (words) are in
the same cluster

• Cliques
• single link
• stars
• connected components
Cliques

• Cliques require all terms in a cluster to be within


the threshold of all other terms
– Class 1 (Term 1, Term 3, Term 4, Term 6)
– Class 2 (Term 1, Term 5)
– Class 3 (Term2, Term 4, Term 6)
– Class 4 (Term 2, Term 6, Term 8)
– Class 5 (Term 7)
Single Link
• Any term that is similar to any term in the cluster
an be added to the cluster
• It is impossible for a term to be in two different
clusters

• Overhead in assignment of terms to classes:


O(n2)
Star
• Select a term and then places in the class all terms that are
related to that term
• Terms not yet in classes are selected as new seeds until all terms
are assigned to a class
• There are many different classes that can be created using the
Star technique
• If we always choose as the starting point for a class the lowest
numbered term not already in a class
– Class 1 (Term 1, Term 3, Term 4, Term 5, Term 6)
– Class 2 (Term 2, Term 4, Term 8, Term 6)
– Class 3 (Term 7)
Item/Item and Item Relationship Matrix
Clustering Results

Clique: Class 1: {Item 1, Item 3}


Class 2: {Item 2, Item 4}

Single Link:

Star:

String: Clustering with existing clusters


D' Amore and Mah

The final weight for n-gram i in document j is:


where:
fij= frequency of an n-gram i in document j
eij= expected number of occurrences of an n-gram i in document j
σij =standard deviation

Damashek
N-Gram Models
• Estimate probability of each word given prior context.
– P(phone | Please turn off your cell)
• Number of parameters required grows exponentially with
the number of words of prior context.
• An N-gram model uses only N1 words of prior context.
– Unigram: P(phone)
– Bigram: P(phone | cell)
– Trigram: P(phone | your cell)
• The Markov assumption is the presumption that the future
behavior of a dynamical system only depends on its recent
history. In particular, in a kth-order Markov model, the
next state only depends on the k most recent states,
therefore an N-gram model is a (N1)-order Markov
model.
Damashek

Regression Analysis

A simple least squares polynomial regression could be implemented, that


would identify the correct values of a and (3 to predict life expectancy
(LE) based on age (A):
Conti……

For a given age, it is possible to find the related life expectancy. Now, if we
wish to predict the likelihood of a person having heart disease, we might
obtain the following data:
Conti……

Six variables used are given below:


• The mean of the total number of matching terms in the query.
• The square root of the number of terms in the query.
• The mean of the total number of matching terms in the document.
• The square root of the number of terms in the document.
• The average idf of the matching terms.
• The total number of matching terms in the query.
Stage 1:
A logistic regression is done for each composite
clue.
Conti……

Stage 2:
The second stage of the staged logistic regression attempts to correct for
errors induced by the number of composite clues. As the number of
composite clues grows, the likelihood of error increases. For N composite
clues, the following logistic regression is computed:

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy