Irs PPT Unit Ii

UNIT-II
Retrieval Utilities
Utilities improve the results of a retrieval strategy. Most utilities add or remove
terms from the initial query in an attempt to refine the query.
Relevance Feedback
A popular information retrieval utility is relevance feedback. The basic premise is to

implement retrieval in multiple passes. The user refines the query in each pass
based on results of previous queries.
Fig: Relevance feedback process
Relevance Feedback in the Vector Space Model
Rocchio builds the new query Q' from the old query Q using the equation
given below:
Ri and Si are individual components of R and S, respectively.

The document vectors from the relevant documents are added to the initial query
vector, and the vectors from the non-relevant documents are subtracted. If all
documents are relevant, the third term does not appear
Clustering
•provide a grouping of similar objects into a class under a more general title
----clustering also allows linkage between clusters to be specified
•An information database can be viewed as being composed of a number of
independent items indexed by a series of index terms
Term clustering
Used to create a statistical thesaurus
Increase recall by expanding searches with related terms (query expansion)
Document clustering
Used to create document clusters
The search can retrieve items similar to an item of interest, even if the query
would not have retrieved the item (resultant set expansion)
Result-set clustering
Complete Term Relation Method
• The similarity between every term pair is

calculated as a basis for determining the clusters
• Using the vector model for clustering
• A similarity measure is required to calculate the

similarity between to terms
Cont….
Term-Term Matrix Threshold = 10
Term Relationship Matrix

Cont….
The final step in creating clusters is to determine when two objects (words) are in
the same cluster
• Cliques
• single link
• stars
• connected components
Cliques
• Cliques require all terms in a cluster to be within

the threshold of all other terms
– Class 1 (Term 1, Term 3, Term 4, Term 6)
– Class 2 (Term 1, Term 5)
– Class 3 (Term2, Term 4, Term 6)
– Class 4 (Term 2, Term 6, Term 8)
– Class 5 (Term 7)
Single Link
• Any term that is similar to any term in the cluster
an be added to the cluster
• It is impossible for a term to be in two different
clusters
• Overhead in assignment of terms to classes:

O(n2)
Star
• Select a term and then places in the class all terms that are
related to that term
• Terms not yet in classes are selected as new seeds until all terms
are assigned to a class
• There are many different classes that can be created using the
Star technique
• If we always choose as the starting point for a class the lowest
numbered term not already in a class
– Class 1 (Term 1, Term 3, Term 4, Term 5, Term 6)
– Class 2 (Term 2, Term 4, Term 8, Term 6)
– Class 3 (Term 7)
Item/Item and Item Relationship Matrix
Clustering Results
Clique: Class 1: {Item 1, Item 3}

Class 2: {Item 2, Item 4}
Single Link:
Star:
String: Clustering with existing clusters

D' Amore and Mah
The final weight for n-gram i in document j is:

where:
fij= frequency of an n-gram i in document j
eij= expected number of occurrences of an n-gram i in document j
σij =standard deviation
Damashek
N-Gram Models
• Estimate probability of each word given prior context.
– P(phone | Please turn off your cell)
• Number of parameters required grows exponentially with
the number of words of prior context.
• An N-gram model uses only N1 words of prior context.
– Unigram: P(phone)
– Bigram: P(phone | cell)
– Trigram: P(phone | your cell)
• The Markov assumption is the presumption that the future
behavior of a dynamical system only depends on its recent
history. In particular, in a kth-order Markov model, the
next state only depends on the k most recent states,
therefore an N-gram model is a (N1)-order Markov
model.
Damashek
Regression Analysis
A simple least squares polynomial regression could be implemented, that

would identify the correct values of a and (3 to predict life expectancy
(LE) based on age (A):
Conti……
For a given age, it is possible to find the related life expectancy. Now, if we
wish to predict the likelihood of a person having heart disease, we might
obtain the following data:
Conti……
Six variables used are given below:

• The mean of the total number of matching terms in the query.
• The square root of the number of terms in the query.
• The mean of the total number of matching terms in the document.
• The square root of the number of terms in the document.
• The average idf of the matching terms.
• The total number of matching terms in the query.
Stage 1:
A logistic regression is done for each composite
clue.
Conti……
Stage 2:
The second stage of the staged logistic regression attempts to correct for
errors induced by the number of composite clues. As the number of
composite clues grows, the likelihood of error increases. For N composite
clues, the following logistic regression is computed:

Irs PPT Unit Ii

Uploaded by

Copyright:

Available Formats

Irs PPT Unit Ii

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Irs PPT Unit Ii

Uploaded by

Copyright:

Available Formats

UNIT-II

A popular information retrieval utility is relevance feedback. The basic premise is to

Ri and Si are individual components of R and S, respectively.

• The similarity between every term pair is

• A similarity measure is required to calculate the

Term-Term Matrix Threshold = 10

Term Relationship Matrix

• Cliques require all terms in a cluster to be within

• Overhead in assignment of terms to classes:

Clique: Class 1: {Item 1, Item 3}

String: Clustering with existing clusters

The final weight for n-gram i in document j is:

A simple least squares polynomial regression could be implemented, that

Six variables used are given below:

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.