2.notes CS8080 - Information Retrieval Technique
2.notes CS8080 - Information Retrieval Technique
2.notes CS8080 - Information Retrieval Technique
UNIT - II
Chapter 2 : Modeling and Retrieval Evaluation 2 - 1 to 2 - 44
UNIT - III
Chapter 3 : Text Classification and Clustering 3 - 1 to 3 - 46
(iii
3.2.1 Clustering 3 -....................................................................................................................4
UNIT - IV
Chapter 4 : Web Retrieval and Web Crawling 4 - 1 to 4 - 24
4.1 The Web.....................................................................................................................................4 - 2
4.1.1 Characteristics 4 -.............................................................................................................3
4.1.2 Modeling the Web 4 -......................................................................................................3
4.1.3 Link Analysis 4 -..............................................................................................................5
4.2 Search Engine Architectures 4 -......................................................................................................5
4.2.1 Cluster based Architecture 4 -..........................................................................................6
4.2.2 Distributed Architecture 4 -.............................................................................................8
4.3 Search Engine Ranking 4 -............................................................................................................10
(iv
4.3.1 Link based Ranking 4 -..................................................................................................11
(v
4.3.2 Simple Ranking Functions 4 -........................................................................................13
4.3.3 Learning to Rank 4 -.......................................................................................................14
4.3.4 Evaluations 4 -................................................................................................................14
4.4 Search Engine User Interaction 4 -................................................................................................14
4.5 Browsing 4 -..................................................................................................................................17
4.5.1 Web Directories 4 -........................................................................................................17
4.6 Applications of a Web Crawler 4 -................................................................................................17
4.6.1 Web Crawler Architecture 4 -........................................................................................18
4.6.2 Taxonomy of Crawler 4 -...............................................................................................20
4.7 Scheduling Algorithms 4 -.............................................................................................................20
4.8 Part A : Short Answered Questions [2 Marks Each] 4 -............................................21
4.9 Multiple Choice Questions with Answers 4 -.............................................................23
UNIT - V
Chapter 5 : Recommender System 5 - 1 to 5 - 18
5.1 Recommender Systems Functions.............................................................................................5 - 2
5.1.1 Challenges 5 -...................................................................................................................4
5.2 Data and Knowledge Sources 5 -....................................................................................................4
5.3 Recommendation Techniques 5 -....................................................................................................4
5.4 Basics of Content-based Recommender Systems 5 -......................................................................5
5.4.1 High Level Architecture Content-based Recommender Systems 5 -...............................5
5.4.2 Relevance Feedback 5 -...................................................................................................6
5.4.3 Advantages and Drawbacks of Content-based Filtering 5 -.............................................8
5.5 Collaborative Filtering 5 -...............................................................................................................8
5.5.1 Type of CF 5 -..................................................................................................................8
5.5.2 Collaborative Filtering Algorithms 5 -...........................................................................10
5.5.3 Advantages and Disadvantages 5 -.................................................................................12
5.5.4 Difference between Collaborative Filtering and Content based Filtering 5 -................12
5.6 Matrix Factorization Models 5 -....................................................................................................13
5.6.1 Singular Value Decomposition (SVD) 5 -.....................................................................13
5.7 Neighbourhood Models 5 -............................................................................................................14
5.7.1 Similarity Measures 5 -..................................................................................................14
5.8 Part A : Short Answered Questions [2 Marks Each] 5 -............................................15
5.9 Multiple Choice Questions with Answers 5 -.............................................................17
Solved Model Question Paper (M - 1) to (M - 4)
(vi
UNIT - I
1 Introduction
Syllabus
Information Retrieval - Early Developments - The IR Problem - The User’s Task - Information
versus Data Retrieval - The IR System - The Software Architecture of the IR System - The
Retrieval and Ranking Processes - The Web - The e-Publishing Era - How the web changed
Search - Practical Issues on the Web - How People Search - Search Interfaces Today -
Visualization in Search Interfaces.
Contents
(1 - 1)
(1 - 2)
In modern IR system, users need vast information for search engine. User looking for the
link to the homepage of a government, company and colleges. They also looking for
information required to execute tasks associated with their jobs or immediate needs.
Sometime user types full description of query to IR system. To solve this query by search
engine is not possible. Here user might want to first translate this information need into a
query, to be posed to the system.
Given the user query, the goal of the IR system is to retrieve information that is useful or
relevant to the user.
The key issues with IR models are selection of search vocabulary, search strategy
formulations and information overload
➥ 1.2.1 The User’s Task
The user of a retrieval system has to translate his information need into a query in the
language provided by the system. With an information retrieval system, this normally
implies specifying a set of words which convey the
semantics of the information need.
With a data retrieval system, a query expression is
used to convey the constraints that must be
satisfied by objects in the answer set. In both cases,
we say that the user searches for useful information
executing a retrieval task. Fig. 1.2.1 shows
Interaction of the user with the retrieval system.
Fig. 1.2.1 : Interaction of the user
with the retrieval system
Suppose the user may be interested in web site about healthcare product. In this situation,
the user might use an interactive interface to simply look around in the collection for
documents related to healthcare product.
User may be interested in new beauty product, weight loss or gain product. Here user is
browsing the documents in the collection, not searching. It is still a process of retrieving
information, but one whose main objectives are not clearly defined in the beginning and
whose purpose might change during the interaction with the system.
Pull technology : User requests information in an interactive manner. It perform three
retrieval tasks, i.e. Browsing (hypertext), Retrieval (classical IR systems) and Browsing and
retrieval (modern digital libraries and web systems).
Push technology : Automatic and permanent pushing of information to user. It acts like a
software agents.
(1 - 5)
The computer-based retrieval systems store only a representation of the document or query
which means that the text of a document is lost once it has been processed for the purpose
of generating its representation.
The process may involve structuring the information, such as classifying it. It will also
involve performing the actual retrieval function that is executing the search strategy in
response to a query.
Text document is the output of information retrieval system. Web search engines are the
most familiar example of IR systems.
The user’s query is processed by a search engine, which may be running on the user’s local
machine, on a large cluster of machines in a remote geographic location, or anywhere in
between.
A major task of a search engine is to maintain and manipulate an inverted index for a
document collection. This index forms the principal data structure used by the engine for
searching and relevance ranking.
(1 - 9)
As its basic function, an inverted index provides a mapping between terms and the
locations in the collection in which they occur.
To support relevance ranking algorithms, the search engine maintains collection statistics
associated with the index, such as the number of documents containing each term and the
length of each document.
In addition the search engine usually has access to the original content of the documents in
order to report meaningful results back to the user.
Using the inverted index, collection statistics, and other data, the search engine accepts
queries from its users, processes these queries, and returns ranked lists of results.
To perform relevance ranking, the search engine computes a score, sometimes called a
Retrieval Status Value (RSV), for each document. After sorting documents according to
their scores, the result list must be subjected to further processing, such as the removal of
duplicate or redundant results.
For example, a web search engine might report only one or results from a single host or
domain, eliminating the others in favor of pages from different sources.
Database management systems support fast and accurate data lookups in business and
industry; in journalism, lookups are related to questions of who, when, and where as
opposed to what, how, and why questions.
In libraries, lookups have been called “known item” searches to distinguish them from
subject or topical searches.
A typical example would be a user wanting to make a reservation to a restaurant and
looking for the phone number on the Web.
On the other hand, exploratory search is described as open-ended, with an unclear
information need, an ill-structured problem of search with multiple targets. This search
activity is evolving and can occur over time.
For example, a user wants to know more about Senegal, she doesn’t really know what kind
of information she wants or what she will discover in this searchsession; she only knows
she wants to learn more about that topic.
All the three search engines offer spelling error correction features and around 80% of the
time they provide 4 or more dynamic query suggestions.
(1 - 14)
Query formulation is the stage of the interactive information access process in which user
translates an information need into a query and submits the query to an information access
system such as a search engine.
The system performs some computation to match the query with the documents most likely
to be relevant to the query and returns a ranked list of relevant documents to the user.
🞕 Ans. :
Input : Store only a representation of the document or query which means that the text of a
document is lost once it has been processed for the purpose of generating its representation.
A document representative could be a list of extracted words considered to be significant.
Processor : Involve in performing actual retrieval function, executing the search strategy in
response to query.
(1 - 17)
Syllabus
Contents
(2 - 1)
Information Retrieval Techniques (2 - 2) Modeling and Retrieval Evaluation
Q : A Boolean expression. The terms are index terms and operators are AND, OR, and
NOT.
F : Boolean algebra over sets of terms and sets of documents.
R : A document is predicted as relevant to a query expression if and only if it satisfies
For instance, the query social AND economic will produce the set of documents that are
indexed both with the term social and the term economic, i.e. the intersection of both sets.
Combining terms with the OR operator will define a document set that is bigger than or
equal to the document sets of any of the single terms.
So, the query social OR political will produce the set of documents that are indexed with
either the term social or the term political or both, i.e. the union of both sets.
This is visualized in the Venn diagrams of Fig. 2.1.1 in which each set of documents is
Information Retrieval Techniques (2 - 4) Modeling and Retrieval Evaluation
visualized by a disc.
Information Retrieval Techniques (2 - 5) Modeling and Retrieval Evaluation
The intersections of these discs and their complements divide the document collection into
8 non-overlapping regions, the unions of which give 256 different Boolean combinations of
"social, political and economic documents". In Fig. 2.1.1 the retrieved sets are visualized by
the shaded areas.
Term weights are used to compute the degree of similarity between documents and the user
query. Then, retrieved documents are sorted in decreasing order.
Information Retrieval Techniques (2 - 6) Modeling and Retrieval Evaluation
They considered the index representations and the query as vectors embedded in a high
dimensional Euclidean space, where each term is assigned a separate dimension. The
– –
similarity measure is usually the cosine of the angle that separates the two vectors d and q .
The cosine of an angle is 0 if the vectors are orthogonal in the multidimensional space and
1 if the angle is 0 degrees. The cosine formula is given by :
( )
– – m d qk=1 k k
score d q =
m (dk)
2
m (qk)
2
k=1 k=1
Fig. 2.1.2 : Query and document representation in the vector space model
Measuring the cosine of the angle between vectors is equivalent with normalizing the
vectors to unit length and taking the vector inner product. If index representations and
queries are properly normalised, then the vector product measure of equation 1 does have a
strong theoretical motivation. The formula then becomes :
(– )
–q =
m
n (d ) n(q )
score d k k
k=1
vk
where n (vk) = m
2
(vk)
k=1
Information Retrieval Techniques (2 - 7) Modeling and Retrieval Evaluation
We think of the documents as a collection C of objects and think of the user query as a
specification of a set A of objects. In this scenario, the IR problem can be reduced to the
Information Retrieval Techniques (2 - 8) Modeling and Retrieval Evaluation
problem of determine which documents are in the set A and which ones are not (i.e. the IR
problem can be viewed as a clustering problem).
1. Intra-cluster : One needs to determine what are the features which better describe the
objects in the set A.
2. Inter-cluster : One needs to determine what are the features which better distinguish
the objects in the set A.
tf : Inter-clustering similarity is quantified by measuring the raw frequency of a term k i
inside a document dj , such term frequency is usually referred to as the tf factor and
provides one measure of how well that term describes the document contents.
idf : Inter-clustering similarity is quantified by measuring the inverse of the frequency of a
term ki among the documents in the collection. This frequency is often referred to as the
inverse document frequency.
➤ Advantages :
1. Its term-weighting scheme improves retrieval performance.
2. Its partial matching strategy allows retrieval of documents that approximate the query
conditions.
3. Its cosine ranking formula sorts the documents according to their degree of similarity to
the query.
➤ Disadvantages :
1. The assumption of mutual independence between index terms.
2. It cannot denote the "clear logic view" like Boolean model.
University Question
Term weighting is an important aspect of modern text retrieval systems. Terms are words,
phrases, or any other indexing units used to identify the contents of a text. Since different
terms have different importance in a text, an important indicator - the term weight - is
associated with every term.
The retrieval performance of the information retrieval systems is largely dependent on
similarity measures. Furthermore, a term weighting scheme plays an important role for the
similarity measure.
There are three components in a weighting scheme
: aij = gi * tij * dj
th
Where gi is the global weight of the i term,
th th
tij is the local weight of the i term in the j document,
th
dj is the normalization factor for the j document.
The document frequency is the number of documents in the collection that the term occurs
in. We define the idf weight of term t as follows :
N
idf weight (idf ) = log 10
t dft
here N is the number of documents in the collection
The tf-idf weight of a term is the product of its tf weight and its idf weight
W = (1 + log t f N
) log
t, d t, d dft
➤ Stop lists and Stemming
:
Stoplists : This is a list of words that we should ignore when processing documents, since
they give no useful information about content.
Stemming : This is the process of treating a set of words like "fights, fighting, fighter, …"
as all instances of the same term - in this case the stem is "fight".
The frequency curve actually demonstrates Zipf's law. With this law Zipf stated that the
product of the frequency of occurrences of words and the rank order is approximately
constant.
Luhn's method for each document :
1. Filter terms in the document using a list of stopwords.
2. Normalize terms by stemming like differentiate, different, differently.
3. Calculate frequencies of normalized terms.
4. Remove non-frequent terms.
Problem with Luhn's model is that the cut off points to be determined empirically, but trial
and error.
The decision may be binary (retrieve/reject), or it may involve estimating the degree of
relevance that the document has to the query.
Conflation algorithms are classified into two main classes : Stemming algorithms which are
language dependent and string-similarity algorithms which are language independent.
Stemming algorithms typically try to reduce related words to the same base (or
stem).System will usually consist of three parts :
The removal of high frequency words, 'stop' words or 'fluff' words is one way of
implementing Luhn's upper cut-off. This is normally done by comparing the input text with
a 'stop list' of words which are to be removed.
The advantages of the process are not only that non-significant words are removed and will
therefore not interfere during retrieval, but also that the size of the total document file can
be reduced by between 30 % to 50 %.
The second stage, suffix stripping, is more complicated. A standard approach is to have a
complete list of suffixes and to remove the longest possible one. For example, we may well
want UAL removed from FACTUAL but not from EQUAL.
Suffix stripping algorithms do not rely on a lookup table that consists of inflected forms
and root form relations. Instead, a typically smaller list of "rules" is stored which provide a
path for the algorithm, given an input word form, to find its root form. Some examples of
the rules include :
1. If the word ends in 'ed', remove the 'ed'
2. If the word ends in 'ing', remove the 'ing'
3. If the word ends in 'ly', remove the 'ly'
Following are the simple steps of conflation algorithm :
1. Open and read each input file and create a single index file.
2. Remove or filer out all stop words.
3. Remove all suffixes / affixes from each word if present.
4. Count frequencies of occurrences for each root word from 3.
5. Apply porters rules / algorithm for each root word from 3 and store in index file.
Information Retrieval Techniques (2 - 14) Modeling and Retrieval Evaluation
Cosine is a normalized dot product. Documents are ranked by decreasing cosine value.
A term that appears in many documents should not be regarded as more important than one
that appears in few documents.
A document with many occurrences of a term should not be regarded as less important
than a document with few occurrences of the term
University Question
This model is introduced by Roberston and Sparck Jones in 1976. It is also called Binary
Independence Retrieval (BIR) model.
Idea : Given a user query q, and the ideal answer set R of the relevant documents, the
problem is to specify the properties for this set.
Information Retrieval Techniques (2 - 15) Modeling and Retrieval Evaluation
sim (d–j q)
=
pr(ki = di|R– )
i=
1
Pr(ki|R) stands for the probability that the index term ki is present in a document randomly
selected from the set R.
–
Pr(ki|R) stands for the probability that the index term ki is not present in a document
randomly selected from the set R.
–
– gi (dj = 1) Pr (ki|R) gi (dj = 1) Pr (ki|R)
sim (dj q) =
– – –
gi (dj = 1) Pr (ki|R) gi (dj = 1) Pr (ki|R)
– – –
Pr(dj|R) + Pr(dj|R) = 1 –
t p(ki|R) 1 – p(ki|R)
–
sim (d q) = log + log
j
i=1 1 – p(k |R) –
i
p(ki|R)
p(k |R)
–
–
t 1 – p(ki|R)
sim (d
i
q) = w w log + log
j i, q i, j 1 – p(ki|R) –
i=
1
p(ki|R)
➤ Advantage :
Documents are ranked in decreasing order of their probability of being relevant.
➤ Disadvantages :
1. The need to guess the initial relevant and non-relevant sets.
2. Term frequency is not considered.
3. Independence assumption for index terms.
Information Retrieval Techniques (2 - 17) Modeling and Retrieval Evaluation
University Question
1. Explain in detail about binary independance model for Probability Ranking Principle (PRP).
AU : Dec.-17, Marks 10
Information Retrieval Techniques (2 - 18) Modeling and Retrieval Evaluation
Example 2.4.1 : Following are the technical memo titles and construct document matrix
1. Give an example for latent semantic indexing and explain the same.
AU : May-17, Marks 8
In neural network, nodes are processing unit and edges play the role of synaptic
connections. Weights are assigned to each edge of neural network.
The strength of a propagating signal is modelled by a weight assigned to each edge. The
state of a node is defined by its activation level. Depending on its activation level, a node
might issue an output signal.
First level of propagation :
a) Query terms issue the first signals.
b) These signals propagate accross the network to reach the document nodes.
Second level of propagation :
a) Document nodes might themselves generate new signals which affect the document
term nodes.
b) Document term nodes might respond with new signals of their own.
Thus, the first query formulation should be treated as an initial attempt to retrieve relevant
information. Documents initially retrieved could be analyzed for relevance and used to
improve the initial query.
Fig. 2.6.1 shows relevance feedback on initial query.
There are two basic approaches for compiling implicit feedback information :
1. Local analysis, which derives the feedback information from the top ranked documents
in the result set.
2. Global analysis, which derives the feedback information from external sources such as a
thesaurus.
Goal of relevance feedback :
1. Add query terms and adjust term weights.
2. Improve ranks of known relevant documents.
3. Other relevant docs will also be ranked higher.
➤ Relevance feedback : Basic idea
The user issues a short and simple query. The search engine returns a set of documents.
User marks some docs as relevant, some as non-relevant.
Search engine computes a new representation of the information need. Hope that it is better
than the initial query. Search engine runs new query and returns new results. New results
have better recall. Fig. 2.6.2 shows relevance and click feedback.
Fig. 2.6.2 (a) : Relevance feedback Fig. 2.6.2 (b) : Click feedback
Information Retrieval Techniques (2 - 23) Modeling and Retrieval Evaluation
➤ Rocchio Properties :
1. Does not guarantee a consistent hypothesis.
2. Forms a simple generalization of the examples in each class (a prototype).
3. Prototype vector does not need to be averaged or otherwise normalized for length since
cosine similarity is insensitive to vector length.
4. Classification is based on similarity to class prototypes.
Rocchio method is represented by following formula :
n1 n
Ri 2
Q1 = Q0 + +
Si
i = 1 n1 i = 1 n2
where Q0 = The vector for the initial query
Ri = The vector for the relevant document i
Si = The vector for the non-relevant document i
n1 = The number of relevant documents chosen
n2 = The number of non-relevant documents chosen.
and tune the importance of relevant and non relevant terms (in some studies best to set
to 0.75 and to 0.25).
Rocchio's clustering algorithm was developed on the SMART project. It operates in three
stages.
1. In the first stage it selects a number of objects as cluster centers. The remaining objects
are then assigned to the centers or to a 'rag-bag' cluster. On the basis of the initial
assignment the cluster representatives are computed and all objects are once more
assigned to the clusters. The assignment rules are explicitly defined in terms of
thresholds on a matching function. The final clusters may overlap.
2. The second stage is essentially an iterative step to allow the various input parameters to
be adjusted so that the resulting classification meets the prior specification of such
things as cluster size, etc. more nearly.
3. The third stage is for 'tidying up'. Unassigned objects are forcibly assigned and overlap
between clusters is reduced.
Example 2.6.1 : An IR system returns 8 relevant documents and 10 non-relevant documents. There
are a total of 20 relevant documents in the collection. What is the precision of the system on this
search and what is its recall ?
Information Retrieval Techniques (2 - 25) Modeling and Retrieval Evaluation
🞕 Solution :
Precision = 8 / 18 = 0.44
Recall = 8 / 20 = 0.40
Recall is the ratio of the number of relevant records retrieved to the total number of
relevant records in the database. It is usually expressed as a percentage.
Information Retrieval Techniques (2 - 26) Modeling and Retrieval Evaluation
A
Recall = A + B 100 %
Precision is the ratio of the number of relevant records retrieved to the total number of
irrelevant and relevant records retrieved. It is usually expressed as a percentage.
A
Precison = A + C 100 %
As recall increases, the precision decreases and recall decreases the precision increases.
Recall = 56.25 %
A
Precision = A + C 100 %
45 45
Precision = 100 % = 100
45 + 15 60
Information Retrieval Techniques (2 - 27) Modeling and Retrieval Evaluation
Precision = 75 %
Information Retrieval Techniques (2 - 28) Modeling and Retrieval Evaluation
Example 2.6.3 : 20 found documents, 18 relevant, 3 relevant documents are not found, 27 irrelevant
are as well not found. Calculate the precision and recall and fallout scores for the search.
🞕 Solution :
Precision : 18/20 = 90 %
Recall : 18/21 = 85.7 %
Fall-out : 2/29 = 6.9 %
Recall is a non-decreasing function of the number of docs retrieved. In a good system,
precision decreases as either the number of docs retrieved or recall increases. This is not a
theorem, but a result with strong empirical confirmation.
The set of ordered pairs makes up the precision-recall graph. Geometrically when the points
have been joined up in some way they make up the precision-recall curve. The performance
of each request is usually given by a precision-recall curve. To measure the overall
performance of a system, the set of curves, one for each request, is combined in some way
to produce an average curve.
Assume that set Rq containing the relevant document for q has been defined. Without loss
of generality, assume further that the set Rq is composed of the following documents :
Rq = { d3, d5, d9, d25, d39, d44, d56, d71, d89, d123 }
The documents that are relevant to the query q are marked with star after the document
number. Ten relevant documents, five included in Top 15.
Information Retrieval Techniques (2 - 29) Modeling and Retrieval Evaluation
Fig 2.6.3 shows the curve of precision versus recall. By taking various numbers of the top
returned documents (levels of recall), the evaluator can produce a precision-recall curve.
The precision versus recall curve is usually plotted based on 11 standard recall level :
0 %,10 %,….,100 %.
In this example : The precisions for recall levels higher than 50 % drop to 0 because no
relevant documents were retrieved. There was an interpolation for the recall level 0 %.
Since the recall levels for each query might be distinct from the 11 standard recall levels.
Fig. 2.6.4 shows the curve for interpolated precision and recall.
Following Fig. 2.6.5 shows the comparison between precision-recall curve and interpolated
precision.
Information Retrieval Techniques (2 - 31) Modeling and Retrieval Evaluation
Fig. 2.6.5
Use P = 0 for each relevant document that was not retrieved. Determine average for each
query, then average over queries :
Qj
1
MAP =
1
N P (doci)
N Q
j=1 ji=1
where
Qj = Number of relevant document for query j.
N = Number of queries.
th
P(doci) = Precision at i relevant document
Example 2.6.4 :
Query 1 Query 2
Rank Relev. P(doci) Rank Relev. P(doc)
1 X 1.00 1 X 1.00
2 2
3 X 0.67 3 X 0.67
4 4
5 5
6 X 0.50 6
7 7
8 8
9 9
10 X 0.40 10
11 11
12 12
13 13
14 14
15 15 X 0.2
16 AVG : 0.623
17
18
19
20 X 0.50
AVG : 0.564
MAP favors systems which return relevant documents fast.
Information Retrieval Techniques (2 - 33) Modeling and Retrieval Evaluation
🞕 Solution :
0.564 + 0.623
MAP = 2
MAP = 0.594
A necessary consequence of its monotonicity is that the average P-R curve will also be
monotonically decreasing. It is possible to define the set of observed points in such a way
that the interpolate function is not monotonically decreasing. In practice, even for this case,
we have that the average precision-recall curve is monotonically decreasing.
➤ Precision-recall appropriateness
Precision and recall have been extensively used to evaluate the retrieval performance of IR
algorithms. However, a more careful reflection reveals problems with these two measures :
First, the proper estimation of maximum recall for a query requires detailed knowledge of
all the documents in the collection.
Second, in many situations the use of a single measure could be more appropriate.
Third, recall and precision measure the effectiveness over a set of queries processed in
batch mode.
Fourth, for systems which require a weak ordering though, recall and precision might be
inadequate.
➤ Single value summaries
Average precision-recall curves constitute standard evaluation metrics for information
retrieval systems. However, there are situations in which we would like to evaluate retrieval
performance over individual queries. The reasons are twofold :
1. First, averaging precision over many queries might disguise important anomalies in the
retrieval algorithms under study.
2. Second, we might be interested in investigating whether a algorithm outperforms the
other for each query.
In these situations, a single precision value can be used.
➤ R-Precision
If we have a known set of relevant documents of size Rel, then calculate precision of the
top Rel docs returned.
Information Retrieval Techniques (2 - 34) Modeling and Retrieval Evaluation
Let R be the total number of relevant docs for a given query. The idea here is to compute
the precision at the Rth position in the ranking.
For the query q1, the R value is 10 and there are 4 relevant among the top 10 documents in
the ranking. Thus, the R-Precision value for this query is 0.4.
The R-precision measure is a useful for observing the behavior of an algorithm for
individual queries. Additionally, one can also compute an average R-precision figure over a
set of queries.
However, using a single number to evaluate a algorithm over several queries might be quite
imprecise.
➤ Precision histograms
The R-precision computed for several queries can be used to compare two algorithms as
follows :
Let,
RPA(i) : R-precision for algorithm A for the i-th
query RPB(i) : R-precision for algorithm B for the i-th
query
Define, for instance, the difference :
The algorithm A performs better for 8 of the queries, while the algorithm B performs better
for the other 2 queries.
Information Retrieval Techniques (2 - 35) Modeling and Retrieval Evaluation
Where tk is a term; Dr is the set of known relevant documents; Drk is the subset that contain
tk ; Dnr is the set of known irrelevant documents; Dnrk is the subset that contain tk.
Even though the set of known relevant documents is a perhaps small subset of the true set
of relevant documents, if we assume that the set of relevant documents is a small subset of
the set of all documents then the estimates given above will be reasonable. This gives a
basis for another way of changing the query term weights.
Reestimate pi and ri on the basis of these
(1)
(2)
| Vi | + kp
p = i
i |V|+k
Repeat, thus generating a succession of approximations to R.
Not in doc R – Rt N – Nt – R + R t N – Nt
R N–R N
The relevance weight is calculated using the assumption that the first 10 documents are
relevant and all others are not. For each term in these documents the following weight is
calculated :
Rt
R–
wt = log Rt
Nt – R t
N – Nt – R + R t
This approach makes various assumptions, such as that the document summaries displayed
in results lists (on whose basis users choose which documents to click on) are indicative of
the relevance of these documents.
In the original DirectHit search engine, the data about the click rates on pages was gathered
globally, rather than being user or query specific. This is one form of the general area of
clickstream mining. Today, a closely related approach is used in ranking the advertisements
that match a web search query.
University Questions
4. To increase the availability of appropriate evaluation techniques for use by industry and
academia, including development of new evaluation techniques more applicable to
current systems.
First TREC conference was held at NIST in November 1992, while the second TREC
conference occurred in August 1993. Starting in TREC-3, a variety of other ''tracks,'' were
introduced.
The tracks invigorate TREC by focusing research on new areas or particular aspects of text
retrieval. The tracks represent the majority of the work done in TREC.
Set of tracks in a particular TREC depends on :
1. Interest of participants
2. Appropriateness of task to TREC
3. Needs of sponsors
4. Resource constraints
➤ Data collection
The TREC collection has been growing steadily over the year. At TREC-3, the collection
size was roughly 2 gigabytes while at TREC-6 it had gone up to roughly 5.8 gigabytes.
The TREC collection is distributed in six CD-ROM disks of roughly 1 gigabyte of
compressed text each. The documents come from the following sources :
WSJ Wall Street Journal
AP Associated Press
ZIFF Computer Selects, Ziff-Davis
FR Federal Register
DOE US DOE Publications
SJMN San Jose Mercury News
PAT US Patents
FT Financial TImes
CR Congressional Record
FBIS Foregin Broadcast Information service
LAT LA Times
Information Retrieval Techniques (2 - 40) Modeling and Retrieval Evaluation
<desc> Description:
What information is available on petroleum exploration in
the South Atlantic near the Falkland Islands?
<narr> Narrative:
Any document discussing petroleum exploration in the
South Atlantic near the Falkland Islands is considered
relevant. Documents discussing petroleum exploration in
continental South America are not relevant.
</top>
Fig. 2.7.1 Sample topic of TREC
The task of converting an information request into a system query must be done by the
system itself and is considered to be an integral part of the evaluation performance.
The topic numbered 1 to 150 was prepared for use with the TREC-1 and TREC-2
conference. The topics numbered 151 to 200 were prepared for use with the TREC-3
conference. It includes only three fields : Title, Description and Narrative.
The topic numbered 201 to 250 was prepared for use with the TREC-4 conference. TREC-5
includes topic 251 to 300 and TREC-6 conference includes topic number 301 to 350.
Information Retrieval Techniques (2 - 41) Modeling and Retrieval Evaluation
3. Interactive : Task in which a human searcher interacts with the retrieval system to
determine the relevant documents. Documents are ruled relevant or not relevant.
4. NLP : Task aimed at verifying whether retrieval algorithms based on natural language
processing offer advantages when compared to the more traditional retrieval algorithms
based on the index terms.
5. Cross languages : Ad hoc task in which the documents are in one language but the
topics are in a different language.
6. High precision : Task in which the user of a retrieval system is asked to retrieve ten
documents that answer a given information request within five minutes.
7. Spoken document retrieval : Task in which the documents are written transcripts of
radio broadcast news shows.
8. Very large corpus : Ad hoc task in which the retrieval systems have to deal with
collections of size 20 gigabytes.
In TREC-7, the NLP and Chinese secondary tasks were discontinued. TREC-7 also
included a new task called Query Task in which several distinct query versions were
created for each example information request.
➤ Evaluation measures at the TREC conferences
TREC conferences uses four types of evaluations :
1. Summary table statistics
2. Recall-precision average
3. Document level averages
4. Average precision histograms
➤ Summary table statistics
Single value measures can also be stored in a table to provide a statistical summary
regarding the set of all the queries in a retrieval task.
For instance, these summary table statistics could include :
1. The number of queries.
2. Total number of documents retrieved by all queries.
3. Total number of relevant documents which were effectively retrieved when all queries
are considered.
4. Total number of relevant documents which could have been retrieved by all queries.
Information Retrieval Techniques (2 - 43) Modeling and Retrieval Evaluation
➤ Recall-precision average
It consists of a table or graph with average precision at 11 standard recall levels.
Since the recall levels of the individual queries are seldom equal to the standard recall
levels, interpolation is used to define the precision at the standard recall levels.
➤ Document level averages
Average precision is computed at specified document cutoff values.
The average precision might be computed when 5, 10, 20, 100 relevant documents have
been seen.
The average R-precision value might also be provided.
➤ Average precision histogram
It consists of a graph which includes a single measure for each separate topic.
It is difference between the R-precision for a target retrieval algorithm and the average R-
precision computed from the results of all participating retrieval system.
A unique environment for testing retrieval algorithms which are based on information
derived from cross-citing patterns. The CACM collection also includes a set of 52 test
information requests.
Example : "What articles exist which deal with TSS (Time Sharing System), an operating
system for IBM computers?"
Also includes two Boolean query formulations and a set of relevant documents. Since the
requests are fairly specific, the average number of relevant documents for each request is
small (around 15). Precision and recall tend to be low.
➤ The ISI collection
The 1,460 documents in the ISI test collection were selected from a previous collection
assembled by Small at ISI (Institute of Scientific Information).
The documents selected were those most cited in a cross-citation study done by Small. The
main purpose is to support investigation of similarities based on terms and on cross-citation
patterns.
The documents in the ISI collection include three types of subfields as follows.
1. Author names.
2. Word stems from the title and abstract sections.
3. Number of co-citations for each pair of articles.
The ISI collection includes a total of 35 test information requests for which there are
Boolean query formulations. It also includes 41 additional test information requests for
which there is no Boolean query formulation.
➤ Statistics for the CACM and ISI collections
1. Document statistics for the CACM and ISI Collections
Collection Number of documents Number of terms Terms/Documents
CACM 3204 10446 40.1
ISI 1460 7392 104.9
2. Query statistics for the CACM and ISI Collections
Collection Number of Terms per Relevant per Relevant in top
queries query query 10
CACM 52 11.4 15.3 1.9
ISI 35 and 76 8.1 49.8 1.7
Information Retrieval Techniques (2 - 45) Modeling and Retrieval Evaluation
3. Provide a controlled process designed to emphasize some terms (relevant ones) and de-
emphasize others (non-relevant ones)
Q.6 What are the assumptions of vector space model ?
🞕 Ans. : Assumption of vector space model :
3 Text Classification
and Clustering
Syllabus
Contents
(3 - 1)
Information Retrieval Techniques (3 - 2) Text Classification and Clustering
Machine learning provides business insight and intelligence. Decision makers are provided
with greater insights into their organizations. This adaptive technology is being used by
global enterprises to gain a competitive edge.
Machine learning algorithms discover the relationships between the variables of a system
(input, output and hidden) from direct samples of the system.
➤ Supervised Learning :
Supervised learning is the machine learning task of inferring a function from supervised
training data. The training data consist of a set of training examples.
The task of the supervised learner is to predict the output behavior of a system for any set
of input values, after an initial training phase.
Supervised learning in which the network is trained by providing it with input and
matching output patterns. These input-output pairs are usually provided by an external
teacher.
A supervised learning algorithm analyzes the training data and produces an inferred
function, which is called a classifier or a regression function.
➤ Un-Supervised Learning :
The model is not provided with the correct results during the training. It can be used to
cluster the input data in classes on the basis of their statistical properties only. Cluster
significance and labelling.
The labeling can be carried out even if the labels are only available for a small number of
objects representative of the desired classes. All similar inputs patterns are grouped
together as clusters.
If matching pattern is not found, a new cluster is formed. There is no error feedback.
They are called unsupervised because they do not need a teacher or super-visor to label a
set of training examples. Only the original data is required to start the analysis.
➤ Semi-Supervised Learning
Semi-supervised learning uses both labeled and unlabeled data to improve supervised
learning. The goal is to learn a predictor that predicts future test data better than the
predictor learned from the labeled training data alone.
Semi-supervised learning is motivated by its practical value in learning faster, better, and
cheaper. In many real-world applications, it is relatively easy to acquire a large amount of
unlabeled data x.
For example, documents can be crawled from the Web, images can be obtained from
surveillance cameras, and speech can be collected from broadcast.
Semi-supervised Learning makes use of both labeled and unlabeled data for training,
Information Retrieval Techniques (3 - 4) Text Classification and Clustering
➥ 3.2.1 Clustering
A cluster is therefore a collection of objects which are "similar" between them and are
"dissimilar" to the objects belonging to other clusters.
Conceptual clustering : Two or more objects belong to the same cluster if this one defines a
concept common to all that objects. In other words, objects are grouped according to their
fit to descriptive concepts, not according to simple similarity measures.
Clustering methods are unsupervised learning techniques. cluster distance is a distance of
two closest members in each class. Clustering methods are usually categorized according
to the type of cluster they produce. The clustering methods are categorized as :
1. Hierarchical methods : These types cluster produces the output list of cluster. Small
clusters of highly similar documents nested within larger clusters of less similar
documents.
2. Non-hierarchical methods : This method produced unordered lists.
Information Retrieval Techniques (3 - 6) Text Classification and Clustering
Other clustering methods are exclusive cluster and overlapping cluster. In the first case
(exclusive cluster) data are grouped in an exclusive way, so that if a certain datum belongs
to a definite cluster then it could not be included in another cluster.
The overlapping clustering uses fuzzy sets to cluster data, so that each point may belong to
two or more clusters with different degrees of membership. In this case, data will be
associated to an appropriate membership value.
Cluster analysis involves applying one or more clustering algorithms with the goal of
finding hidden patterns or groupings in a dataset. Clustering algorithms form groupings or
clusters in such a way that data within a cluster have a higher measure of similarity than
data in any other cluster. The measure of similarity on which the clusters are modeled can
be defined by Euclidean distance, probabilistic distance, or another metric.
Cluster analysis is an unsupervised learning method and an important task in exploratory
data analysis. Popular clustering algorithms include :
1. Hierarchical clustering : Builds a multilevel hierarchy of clusters by creating a cluster
tree.
2. k-Means clustering : Partitions data into k distinct clusters based on distance to the
centroid of a cluster.
3. Gaussian mixture models : Models clusters as a mixture of multivariate normal density
components.
4. Self-organizing maps : Uses neural networks that learn the topology and distribution of
the data.
➤ Desirable Properties of a Clustering Algorithm
1. Scalability (in terms of both time and space)
2. Ability to deal with different data types
3. Minimal requirements for domain knowledge to determine input parameters
4. Interpretability and usability
➤ Distance between Clusters :
1. Single Link : smallest distance between points
2. Complete Link : largest distance between points
3. Average Link : average distance between points
4. Centroid : distance between centroids
Information Retrieval Techniques (3 - 7) Text Classification and Clustering
K = 1 C(i) = K C(j) = K K=1 C(i) = K
th
Where mK is the mean vector of the K cluster.
th
NK is the number of observations in K cluster.
➤ K-Means Algorithm Properties
1. There are always K clusters.
2. There is always at least one item in each cluster.
3. The clusters are non-hierarchical and they do not overlap.
4. Every member of a cluster is closer to its cluster than any other cluster because
closeness does not always involve the 'center' of clusters.
➤ The K-Means Algorithm Process
1. The dataset is partitioned into K clusters and the data points are randomly assigned to
the clusters resulting in clusters that have roughly the same number of data points.
2. For each data point.
a. Calculate the distance from the data point to each cluster.
b. If the data point is closest to its own cluster, leave it where it is.
c. If the data point is not closest to its own cluster, move it into the closest cluster.
3. Repeat the above step until a complete pass through all the data points results in no data
point moving from one cluster to another. At this point the clusters are stable and the
clustering process ends.
4. The choice of initial partition can greatly affect the final clusters that result, in terms of
inter-cluster and intracluster distances and cohesion.
Information Retrieval Techniques (3 - 9) Text Classification and Clustering
1) Begin with the disjoint clustering having level L(0) = 0 and sequence number m = 0.
2) Find the least distance pair of clusters in the current clustering, say pair (r), (s),
according to d[(r), (s)] = min d[(i), (j)] where the minimum is over al pairs of clusters in
the current clustering.
3) Increment the sequence number : m = m + 1. Merge clusters (r) and (s) into a single
cluster to form the next clustering m. Set the level of this clustering to L(m) = d[(r), (s)].
4) Update the distance matrix, D by deleting the rows and columns correspondng to the
newly formed cluster. The distance between the new cluster, denoted (r, s) and old
cluster (k) is defined in this way : d[(k), (r,s)] = min (d[(k), (r)], d[(k), (s)]).
5) If all the data points are in one cluster then stop, else repeat from step 2.
Information Retrieval Techniques (3 - 11) Text Classification and Clustering
Fig. 3.3.1
Information Retrieval Techniques (3 - 14) Text Classification and Clustering
In decision tree learning, a new example is classified by submitting it to a series of tests that
determine the class label of the example. These tests are organized in a hierarchical
structure called a decision tree.
Decision tree has three other names :
1. Classification tree analysis is a term used when the predicted outcome is the class to
which the data belongs.
2. Regression tree analysis is a term used when the predicted outcome can be considered a
real number (e.g. the price of a house,or a patient's length of stay in a hospital).
Internal node denotes a test on an attribute. Branch represents an outcome of the test. Leaf
nodes represent class labels or class distribution.
A decision tree is a flow-chart-like tree structure, where each node denotes a test on an
attribute value, each branch represents an outcome of the test, and tree leaves represent
classes or class distributions. Decision trees can easily be converted to classification rules.
➤ Decision Tree Algorithm
To generate decision tree from the training tuples of data partition D.
➤ Input :
1. Data partition ( D)
2. Attribute list
3. Attribute selection method
➤ Algorithm :
1. Create a node (N)
2. If tuples in D are all of the same class then
3. Return node (N) as a leaf node labeled with the class C.
4. If attribute list is empty then return N as a leaf node labeled with the majority class in D
5. Apply Attribute selection method(D, attribute list) to find the "best" splitting criterion;
6. Label node N with splitting criterion;
7. If splitting attribute is discrete-valued and multiway splits allowed
8. Then attribute list -> attribute list -> splitting attribute
9. For (each outcome j of splitting criterion )
10. Let Dj be the set of data tuples in D satisfying outcome j;
Information Retrieval Techniques (3 - 15) Text Classification and Clustering
11. If Dj is empty then attach a leaf labeled with the majority class in D to node N;
12. Else attach the node returned by Generate decision tree(Dj , attribute list) to node N;
13. End of for loop
14. Return N;
CART analysis is a term used to refer to both of the above procedures. The name CART is
an acronym from the words Classification And Regression Trees, and was first introduced
by Breiman et al.
Learn trees in a Top-Down fashion :
1. Divide the problem in subproblems 2. Solve each problem
➤ Basic Divide-And-Conquer Algorithm :
1. Select a test for root node. Create branch for each possible outcome of the test.
2. Split instances into subsets. One for each branch extending from the node.
3. Repeat recursively for each branch, using only instances that reach the branch.
4. Stop recursion for a branch if all its instances have the same class.
Goal : Build a decision tree for classifying examples as positive or negative instances of a
concept.
Supervised learning, batch processing of training examples, using a preference bias.
Decision tree is a tree where
a. Each non-leaf node has associated with it an attribute (feature).
b. Each leaf node has associated with it a classification (+ or –).
c. Each arc has associated with it one of the possible values of the attribute at the node
from which the arc is directed.
For example :
Fig. 3.3.2
Information Retrieval Techniques (3 - 16) Text Classification and Clustering
Decision tree generation consists of two phases : Tree construction and pruning.
In tree construction phase, all the training examples are at the root. Partition examples
recursively based on selected attributes.
In tree pruning phase, the identification and removal of branches that reflect noise or
outliers.
There are various paradigms that are used for learning binary classifiers which include :
1. Decision Trees
2. Neural Networks
3. Bayesian Classification
4. Support Vector Machines
➤ Inductive Learning and Bias
Suppose that we want to learn a function f(x) a y and we are given some sample (x,y) pairs,
as in Fig. 3.3.3 (a). There are several hypotheses we could make about this function,
e.g : Fig. 3.3.3 (b), (c) and (d).
Fig. 3.3.3
A preference for one over the others reveals the bias of our learning technique, e.g. :
i) Prefer piece-wise functions
ii) Prefer a smooth function
iii) Prefer a simple function and treat outliers as noise
Preference Bias : The simplest explanation that is consistent with all observations is the
best. Here, that means the smallest decision tree that correctly classifies all of the training
examples is best.
Finding the provably smallest decision tree is an NP-Hard problem, so instead of
constructing the absolute smallest tree that is consistent with all of the training examples,
construct one that is pretty small.
Information Retrieval Techniques (3 - 17) Text Classification and Clustering
Training Set Error : For each record, follow the decision tree to see what it would predict.
For what number of records does the decision tree's prediction disagree with the true value
in the database ? This quantity is called the training set error. The smaller the better.
Test Set Error : We hide some data away when we learn the decision tree. But once
learned, we see how well the tree predicts that data. This is a good simulation of what
happens when we try to predict future data. It is called Test Set Error.
🞕 Solution :
H(class) = H(3/6, 3/6) = 1
H(class | color) = 3/6 * H(2/3, 1/3) + 1/6 H(1/1, 0/1) + 2/6 H(0/2, 2/2)
| | | | |
| | | 1 of 6 is blue 2 of 6 are green
| | |
| | 1 of the red is negative
| |
| 2 of the red are positive
|
|
3 out of 6 are red
= 1/2 * (–2/3 log2 2/3 – 1/3 log2 1/3) + 1/6 * (–1 log2 1 – 0 log2 0)
+ 2/6 * (– 0 log2 0 – 1 log2 1)
= 1/2 * (–2/3(log2 2 – log2 3) – 1/3(log2 1 – log2 3))+ 1/6 * 0 + 2/6 * 0
Max(0.543, 0.0, 0.459) = 0.543, so color is best. Make the root node's attribute color and
partition the examples for the resulting children nodes as shown :
The children associated with values green and blue are uniform, containing only – and +
examples, respectively. So make these children leaves with classifications – and +,
respectively.
While Euclidean metric is useful in low dimensions, it doesn't work well in high
dimensions and for categorical variables. The drawback of Euclidean distance is that it
ignores the similarity between attributes. Each attribute is treated as totally different from
all of the attributes.
Information Retrieval Techniques (3 - 20) Text Classification and Clustering
➤ Mahalanobis distance
Mahalanobis distance is also called quadratic distance.
Mahalanobis distance is a distance measure between two points in the space defined by two
or more correlated variables. Mahalanobis distance takes the correlations within a data set
between the variableinto considerations.
If there are two non-correlated variables, the Mahalanobis distance between the points of
the variable in a 2D scatter plot is same as Euclidean distance.
The Mahalanobis distance is the distance between an observation and the center for each
group in m-dimensional space defined by m variables and their covariance. Thus, a small
value of Mahalanobis distance increases the chance of an observation to be closer to the
group's center and the more likely it is to be assigned to that group.
Mahalanobis distance between two samples (x, y) of a random variable is defined as
T –1
dMahalanobis (x, y) = (x – y) (x – y)
Mahalanobis distance that takes into account the correlation S of the dataset :
–1
Lm (x, y) = (x – y) S (x – y)
Information Retrieval Techniques (3 - 21) Text Classification and Clustering
The steps that need to be carried out during the KNN algorithm are as follows :
a. Divide the data into training and test data.
b. Select a value K.
c. Determine which distance function is to be used.
d. Choose a sample from the test data that needs to be classified and compute the distance
to its n training samples.
e. Sort the distances obtained and take the k-nearest data samples.
f. Assign the test class to the class based on the majority vote of its K neighbors.
Fig. 3.3.4 shows geometrical representation of K-nearest neighbor algorithm.
Fig. 3.3.4
The performance of the KNN algorithm is influenced by three main factors :
1. The distance function or distance metric used to determine the nearest neighbors.
2. The decision rule used to derive a classification from the K-nearest neighbors.
3. The number of neighbors used to classify the new example.
➤ Advantages
1. The KNN algorithm is very easy to implement.
2. Nearly optimal in the large sample limit.
3. Uses local information, which can yield highly adaptive behavior.
4. Lends itself very easily to parallel implementations.
➤ Disadvantages
1. Large storage requirements.
2. Computationally intensive recall.
3. Highly susceptible to the curse of dimensionality.
Information Retrieval Techniques (3 - 22) Text Classification and Clustering
University Question
xi = ( xi xi..... xi ) and a set of corresponding output labels. Assume the dimension d of
the data point x is very large and we want to classify x.
Problem with high dimensional input vectors are large number of parameters to learn, if a
dataset is small, this can result in overfit and large variance of estimates.
Stemming : This is the process of treating a set of words like "fights, fighting, fighter, …"
as all instances of the same term - in this case the stem is "fight".
Information Retrieval Techniques (3 - 29) Text Classification and Clustering
v Values S v
where Values (A) is the set of all possible values for attribute A and Sv is the subset of S for
which attribute A has value v.
➤ Pruning by Information Gain :
The simplest technique is to prune out portions of the tree that result in the least
information gain.
This procedure does not require any additional data, and only bases the pruning on the
information that is already computed when the tree is being built from training data.
The process of information gain based pruning requires us to identify “twigs”, nodes whose
children are all leaves.
“Pruning” a twig removes all of the leaves which are the children of the twig, and makes
the twig a leaf. The Fig. 3.4.2 illustrates this.
Information Retrieval Techniques (3 - 30) Text Classification and Clustering
Fig. 3.4.2
Information Retrieval Techniques (3 - 31) Text Classification and Clustering
A statistical measure of how well a binary classification test correctly identifies a condition.
Probability of correctly labelling members of the target class.
No single measure tells the whole story. A classifier with 90 % accuracy can be useless if
90 percent of the population does not have cancer and the 10 % that do are misclassified by
Information Retrieval Techniques (3 - 33) Text Classification and Clustering
Binary classification accuracy metrics quantify the two types of correct predictions and two
types of errors. Typical metrics are accuracy (ACC), precision, recall, false positive rate,
F1-measure. Each metric measures a different aspect of the predictive model.
Recall is the ratio of the number of relevant records retrieved to the total number of
relevant records in the database. It is usually expressed as a percentage.
A
Recall = A + B 100 %
Information Retrieval Techniques (3 - 35) Text Classification and Clustering
Precision is the ratio of the number of relevant records retrieved to the total number of
irrelevant and relevant records retrieved. It is usually expressed as a percentage.
A
Precison = A + C 100 %
As recall increases, the precision decreases and recall decreases the precision increases.
Recall = 56.25 %
A
Precision = A + C 100 %
45 45
Precision = 100 % = 100
45 + 15 60
Precision = 75 %
Example 3.5.2 : 20 found documents, 18 relevant, 3 relevant documents are not found, 27
irrelevant are as well not found. Calculate the precision and recall and fallout scores for the search.
Information Retrieval Techniques (3 - 36) Text Classification and Clustering
🞕 Solution :
Precision : 18/20 = 90 %
Recall : 18/21 = 85.7 %
Fall-out : 2/29 = 6.9 %
Recall is a non-decreasing function of the number of docs retrieved. In a good system,
precision decreases as either the number of docs retrieved or recall increases. This is not a
theorem, but a result with strong empirical confirmation.
The set of ordered pairs makes up the precision-recall graph. Geometrically when the points
have been joined up in some way they make up the precision-recall curve. The performance
of each request is usually given by a precision-recall curve. To measure the overall
performance of a system, the set of curves, one for each request, is combined in some way
to produce an average curve.
Assume that set Rq containing the relevant document for q has been defined. Without loss
of generality, assume further that the set Rq is composed of the following documents :
Rq = { d3, d5, d9, d25, d39, d44, d56, d71, d89, d123 }
There are ten documents which are relevant to the query q.
For the query q, a ranking of the documents in the answer set as follows.
Ranking for query q :
1. d123 * 6. d9 * 11. d38
2. d84 7. d511 12. d48
3. d59 8. d129 13. d250
*
4. d6 9. d187 14. d113
5. d8 10. d25 * 15. d3 *
The documents that are relevant to the query q are marked with star after the document
number. Ten relevant documents, five included in Top 15.
Information Retrieval Techniques (3 - 37) Text Classification and Clustering
Fig. 3.5.1 shows the curve of precision versus recall. By taking various numbers of the top
returned documents (levels of recall), the evaluator can produce a precision-recall curve.
The precision versus recall curve is usually plotted based on 11 standard recall level :
0 %, 10 %,…., 100 %.
In this example : The precisions for recall levels higher than 50 % drop to 0 because no
relevant documents were retrieved. There was an interpolation for the recall level 0 %.
Since the recall levels for each query might be distinct from the 11 standard recall levels.
Fig. 3.5.3
Information Retrieval Techniques (3 - 39) Text Classification and Clustering
N = Number of queries.
th
P(doci) = Precision at i relevant document
Information Retrieval Techniques (3 - 40) Text Classification and Clustering
Example 3.5.3 :
Query 1 Query
Rank Relev. P(doci)
1 X 1.00
2
3 X 0.67
4
5 2
Rank Relev. P(doc)
6 X 0.50
7 1 X 1.00
8 2
9 3 X 0.67
10 X 0.40 4
11 5
12 6
13 7
14 8
15 9
10
16
11
17
18
19
20 X 0.50
AVG : 0.564
MAP favors systems which return relevant documents fast.
🞕 Solution :
0.564 + 0.623
MAP = 2
MAP = 0.594
A necessary consequence of its monotonicity is that the average P-R curve will also be
monotonically decreasing. It is possible to define the set of observed points in such a way
that the interpolate function is not monotonically decreasing. In practice, even for this case,
we have that the average precision-recall curve is monotonically decreasing.
Information Retrieval Techniques (3 - 41) Text Classification and Clustering
➤ Precision-recall appropriateness
Precision and recall have been extensively used to evaluate the retrieval performance of IR
algorithms. However, a more careful reflection reveals problems with these two measures :
First, the proper estimation of maximum recall for a query requires detailed knowledge of
all the documents in the collection.
Second, in many situations the use of a single measure could be more appropriate.
Third, recall and precision measure the effectiveness over a set of queries processed in
batch mode.
Fourth, for systems which require a weak ordering though, recall and precision might be
inadequate.
➤ Single value summaries
Average precision-recall curves constitute standard evaluation metrics for information
retrieval systems. However, there are situations in which we would like to evaluate retrieval
performance over individual queries. The reasons are twofold :
1. First, averaging precision over many queries might disguise important anomalies in the
retrieval algorithms under study.
2. Second, we might be interested in investigating whether a algorithm outperforms the
other for each query.
In these situations, a single precision value can be used.
➤ R-Precision
If we have a known set of relevant documents of size Rel, then calculate precision of the
top Rel docs returned.
Let R be the total number of relevant docs for a given query. The idea here is to compute
the precision at the Rth position in the ranking.
For the query q1, the R value is 10 and there are 4 relevant among the top 10 documents in
the ranking. Thus, the R-Precision value for this query is 0.4.
The R-precision measure is a useful for observing the behavior of an algorithm for
individual queries. Additionally, one can also compute an average R-precision figure over a
set of queries.
Information Retrieval Techniques (3 - 42) Text Classification and Clustering
However, using a single number to evaluate a algorithm over several queries might be quite
imprecise.
➤ Precision histograms
The R-precision computed for several queries can be used to compare two algorithms as
follows :
Let,
RPA(i) : R-precision for algorithm A for the i-th query.
RPB(i) : R-precision for algorithm B for the i-th query.
The algorithm A performs better for 8 of the queries, while the algorithm B performs better
for the other 2 queries.
Same process will be repeated for phrase based search but the difference is that all work
will be done on phrases identified by a phrase identification program which identify
phrases by counting their frequencies; phrase having a suitable count will be selected for
processing. Thus an index table will be generated containing phrases as its indices.
Automatic indexing is the process of representing the topical content of digitally stored
documents or document surrogates, performed without, or with very modest, human
intervention.
➥ 3.7.1 Inverted Indexes
Each document is assigned a list of keywords or attributes. Each keyword (attribute) is
associated with operational relevance weights.
An inverted file is the sorted list of keywords (attributes), with each keyword having links
to the documents containing that keyword.
Penalty : The size of inverted files ranges from 10 % to 100 % of more of the size of the
text itself. It need to update the index as the data set changes.
Inverted file is composed of two elements :
a. Vocabulary b. Occurrences
Vocabulary is the set of all different words in the text. For each such word a list of all the
text positions where the word appears is stored. The set of all those lists is called the
occurrences .
A controlled vocabulary which is the collection of keywords that will be indexed. Words in
the text that are not in the vocabulary will not be indexed. A list of stop-words that for
reasons of volume will not be included in the index.
A set of rules that decide the beginning of a word or a piece of text that is indexable. A list
of character sequences to be indexed (or not indexed).
Fig. 3.7.1 shows a sample text with inverted index file. Each entry in the vocabulary has
the word, a pointer into the postings structure and word metadata.
Fig. 3.7.2 : Sample text split into blocks and an inverted index using block addressing
The blocks can be of fixed size or they can be defined using the natural division of the text
collection into files, documents, web pages etc. Concept of block improves the retrieval
efficiency.
Information Retrieval Techniques (3 - 46) Text Classification and Clustering
➥ 3.7.2 Searching
The search algorithm on an inverted index follows three steps :
1. Vocabulary search : The words and pattern presents in the query are isolated and
searched in the vocabulary.
2. Retrieval of occurrences : The list of the occurrences of all the words found is
retrieved.
3. Manipulation of occurrences : The occurrences are processed to solve phrases,
proximity or Boolean operations. If block addressing is used it may be necessary to
directly search the text to find the information missing from the occurrences.
Structures used in inverted files are sorted arrays, hashing structures, Tries (digital search
trees) and combinations of these structures. Single word queries can be searched using any
suitable data structure to speed up the search.
If the index stores character positions the phrase query cannot allow the separators to be
disregarded and the proximity has to be defined in terms of character distance.
➥ 3.7.3 Construction
Building and maintaining an inverted index is a relatively low cost task. An inverted index
on a text of "n" characters can be built in O(n) time. All the vocabulary is kept into the trie
data structure, storing for each word a list of its occurrences.
Each word of the text is read and searched in the trie. If it is found, it is added to the trie
with an empty list of occurrences. Once it is in the trie, the new position is added to the end
of its list of occurrences. Fig. 3.7.3 shows building an inverted index for the sample text.
v Values(A) S v
where Values (A) is the set of all possible values for attribute A and Sv is the subset of S for
which attribute A has value v.
Information Retrieval Techniques (3 - 48) Text Classification and Clustering
4 Web Retrieval
and Web Crawling
Syllabus
The Web - Search Engine Architectures - Cluster based Architecture - Distributed Architectures -
Search Engine Ranking - Link based Ranking - Simple Ranking Functions - Learning to Rank -
Evaluations - Search Engine Ranking - Search Engine User Interaction - Browsing - Applications
of a Web Crawler - Taxonomy - Architecture and Implementation - Scheduling Algorithms -
Evaluation.
Contents
4.5 Browsing
(4 - 1)
Information Retrieval Techniques (4 - 2) Web Retrieval and Web Crawling
WWW uses client-server interaction. The browser program acts as a client that uses the
Internet to contact a remote server for a copy of the requested page. The server on the
remote system returns a copy of the page along with the additional information.
➥ 4.1.1 Characteristics
In characterizing the structure and content of the Web, it is necessary to establish precise
semantics for Web concepts.
Two characteristics make retrieval of relevant information from the Web a really hard task :
a. The large and distributed volume of data available
b. The fast pace of change
Internet and particular Web is dynamic in nature, so it is a difficult task to measure.
In the mid-1980s, you might spend all afternoon visiting your friends beforedropping by
the bank and grocery, and then go out to dinner and a show after.
Today, your banking, shopping and chat with your friends are all readily handledfrom your
tablet or phone, and if you're not in the mood for a fancy outing, Netflix and a quick
Internet-ordered pizza take care of the evening's entertainment needs and all of it can be
accomplished in less time than it takes to say "The Internet made me a hermit."
It was something of a privilege to be online, but the net gained traction really quickly.
While just 16 million people were using the web in 1995, this figure had leapt to 50 million
three years later, one billion by 2009 and more than twice that last year.
In 1993 there was an estimated 130 websites online, which jumped to ten thousand by
1996. Fast-forward to 2012, and there are 634 million websites competing for your
attention.
1 2
p(x) = exp – (ln x – ) 22 Fig. 4.1.1
x 2
where the average () and standard deviation () are 9.357 and 1.318 respectively.
The majority of the documents are small, but there is a non trivial number of large
documents. This is intuitive for image or video files, but it is also true for HTML pages. A
good fit is obtained with the Pareto distribution
k
p(x) =
1+
x
where x is measured in bytes and k and are parameters of the distribution.
So what languages dominate the Web ? It should come as no surprise that English still
dominates the Web, with more than two-thirds of the Web's pages being in English.
According to a study by, a Web site in the language Catalan, Japanese is the second most
popular language of Web sites.
Web pages by language
Language Web pages Percent of total
English 214,250,996 68.39
Japanese 18,335,739 5.85
German 18,069,744 5.77
Chinese 12,113,803 3.87
French 9,262,663 2.96
Spanish 7,573,064 2.42
Russian 5,900,956 1.88
Italian 4,883,497 1.56
Portuguese 4,291,237 1.37
Korean 4,046,530 1.29
Dutch 3,161,844 1.01
Sweden 2,929,241 0.93
Information Retrieval Techniques (4 - 6) Web Retrieval and Web Crawling
Figure consists of two parts : One deal with users, consisting of the user interface and
the query engine and another that consists of the crawler and indexer modules.
➤ 2. Indexer
The index is used in a centralized fashion to answer queries submitted from different
places in the web. Index the downloaded pages. Each downloaded page is processed
locally.
The indexing information is saved and the page is discarded.
A module that takes a collection of documents or data and builds a searchable index
from them. Common practices are inverted files, vector spaces, suffix structures and
hybrids of these.
Exception : Some search engines keep a local cache copy of many popular pages
indexed in their database, to allow for faster access and in case the destination server is
temporarily in-accessible.
➤ 3. User interface :
Solicit queries and deliver answers. All requests are submitted to a single site.
➤ 4. Query engine :
It process queries against the index. All processing is done locally. It requires a massive
array of computers and storage.
Following table gives idea about search engine with URL and web page indexed upto
May 1998.
Search engine URL Web page indexed
AltaVista www.altavista.com 140
AOL Netfind www.aol.com/netfind/ -
Excite www.excite.com 55
Google google.stanford.edu 25
GoTo goto.com -
HotBot www.hotbot.com 110
Infoseek www.infoseek.com 30
Lycos www.lycos.com 30
Magellan www.mckinley.com 55
Microsoft search.msn.com -
northernLight www.nlsearch.com 67
WebCrawler www.webcrawler.com 2
Information Retrieval Techniques (4 - 9) Web Retrieval and Web Crawling
➤ Features of harvest :
Harvest is a modular, distributed search system framework with working set components to
make it a complete search system. The default setup is to be a web search engine, but it is
also much more and provides following features :
1. Harvest is designed to work as distributed system. It can distribute the load among
different machines. It is possible to use a number of machines to gather data. The full-
text indexer doesn't have to run on the same machine as broker or web server.
2. Harvest is designed to be modular. Every single step during collecting data and
answering search requests are implemented as single programs. This makes it easy to
modify or replace parts of Harvest to customize its behaviour.
3. Harvest allows complete control over the content of data in the search database. It is
possible to customize the summarizer to create desired summaries which will be used
for searching. The filtering mechanism of Harvest allows making modifications to the
summary created by summarizers. Manually created summaries can be inserted to the
search database.
4. The Search interface is written in Perl to make customization easy, if desired.
For 100 million pages, this implies about 150 GB of disk space. Assuming that 500 bytes
are required to store the URL and the description of each Web page, we need 50 GB to
store the description for 100 million pages.
The use of Meta search engines is justified by coverage studies that show that a small
percentage of Web pages are in all search engines. Moreover, less than 1 % of the Web
pages indexed by AltaVista, HotBot, Excite and Inforseek are in all of those search engines.
➤ Advantages of distributed architecture :
1. Server load reduced : A gatherer running on a server reduces the external traffic (i.e.,
crawler requests) on that server.
2. Network traffic reduced : Crawlers retrieve entire documents, whereas Harvest moves
only summaries.
3. Redundant work avoided : A gatherer sending information to multiple brokers reduces
work repetition.
Computes hubs and authorities for a particular topic specified by a normal query. First
determines a set of relevant pages for the query called the base set S. Analyze the link
structure of the web sub-graph defined by S to find authority and hub pages in this set.
Hubs : It contains many outward links and lists of resources.
Authorities : It contains many inward links and provides resources, content.
Fig. 4.3.1
Let H(p) and A(p) be the hub and authority value of pages p. These values are defined
such that the following equations are satisfied for all pages p. Authorities are pointed to
by lots of good hubs :
a
p
= h
q:qp q
➤ C. PageRank
Numeric value to measure how important a page is. PageRank (PR) is the actual ranking
of a page, as determined by Google.
A probability is expressed as a numeric value between 0 and 1. A 0.5 probability is
commonly expressed as a "50 % chance" of something happening. Hence, a PageRank
of 0.5 means there is a 50 % chance that a person clicking on a random link will be
directed to the document with the 0.5 PageRank.
Information Retrieval Techniques (4 - 14) Web Retrieval and Web Crawling
Purpose : To increase the quality of search results and has left all of its competitors for
dead.
The Google Page Rank is based on how many links you have to and from your pages,
and their respective page rank.
We update our index every four weeks. Each time we update our database of web pages,
our index invariably shifts : We find new sites, we lose some sites, and sites ranking
may change. Your rank naturally will be affected by changes in the ranking of other
sites.
The Google PageRank (PR) is calculated for every webpage that exists in Google's
database. It's real value varies from 0,15 to infinite, but for representation purposes it is
converted to a value between 0 and 10 (from low to high). The calculation of the PR for
a page is based on the quantity and quality of web pages that contain links to that page.
Let C(a) be the number of outgoing links of page "a" and suppose that page "a" is
pointed to by pages p1 to pn. Then, the PageRank PR(a) of "a" is defined as
n
PR(a) = 1 + (1 – q) PR(pi) / C(pi)
i=1
Where q must be set by the system.
It is obvious that the PageRank algorithm does not rank the whole website, but it's
determined for each page individually. Furthermore, the PageRank of page A is
recursively defined by the PageRank of those pages which link to page A.
PageRank can be computed using an iterative algorithm. This means that each page is
assigned an initial starting value and the PageRanks of all pages are then calculated in
several computation circles based on the equations determined by the PageRank
algorithm. The iterative calculation shall again be illustrated by the three-page example,
whereby each page is assigned a starting PageRank value of 1.
➥ 4.3.2 Simple Ranking Functions
Simplest ranking scheme use a global ranking function such as PageRank. In this case,
quality of a Web page in the result set is independent of the query. The query only selects
pages to be ranked.
To elaborate ranking scheme, use a linear combination of different ranking signals.
To illustrate, consider the pages p that satisfy query Q Rank score R (p, Q ) of page p with
regard to query Q can be computed as :
R(p, Q) = BM25(p, Q) + (1 – ) PR (p)
Where = 1 : text-based ranking, early search engines
= 0 : link-based ranking, independent of the query
Current engines combine a text-based ranking with a link-based ranking.
Information Retrieval Techniques (4 - 15) Web Retrieval and Web Crawling
➥ 4.3.4 Evaluations
To evaluate quality, Web search engines typically use
a. human judgements of which results are relevant for a given query
b. some approximation of a ground truth inferred from user’s clicks
c. combination of both
To evaluate search results, use precision-recall metrics. Precision of Web results should be
measured only at the top positions in the ranking, say P@5, P@10, and P@20. Each query-
result pair should be subjected to 3-5 independent relevant assessments.
An advantage of using click-through data to evaluate the quality of answers derives from its
scalability. Note that users’ clicks are not used as a binary signal but in significantly more
complex ways such as :
a. Considering whether the user remained a long time on the page it clicked (a good
signal),
b. Jumped from one result to the other (a signal that nothing satisfying was found), or
c. The user clicked and came back right away (possibly implies Web spam)
These measures and their usage are complex and kept secret by leading search engines. An
important problem when using clicks is to take into account that the click rate is biased by
the ranking of the answer and the user interface.
While search rectangle is the favored layout style, there are alternatives :
a. Many sites include an Advanced Search page (rarely used)
b. Search toolbars provided by most search engines as a browser plug-in can be seen as a
version of the search rectangle.
c. Ultimate rectangle, introduced by Google’s Chrome omnibox, merges the functionality
of the address bar with that of the search box.
➤ Query Language
Queries in relational or object-oriented database systems are based on an exact match
mechanism, by which the system is able to return exactly those tuples satisfying some well
specified criteria given in the query expression.
When the query is submitted, the features of the query object are matched with respect to
the features of the objects stored in the database and only the objects that are more similar
to the query one is returned to the user.
In designing a multimedia query language, following points are considered :
1. How the user enters his/her request to the system, i.e. which interfaces are provided to
the user for query information ?
2. Which conditions on multimedia objects can be specified in the user request.
3. How uncertainty, proximity and weights impact the design of the query language.
➤ Search Query Operators
Search operator is classified as
1. Punctuation based search operator.
2. Boolean search operator.
3. Advanced search operators.
Boolean search is a search that uses the logical operators (AND, OR, NOT, -) in addition to
the keywords.
a. AND : The AND operator tells the search engine to return only documents with all the
keywords you entered.
b. OR : The OR operator tells the search engine to return documents if they contain one or
more keywords.
c. NOT : The NOT operator tells the search engine to exclude documents from a search if
they contain the keywords.
d. - Operator : The "-" operator is the same as the NOT operator and tells the search
engine to exclude documents from a search if they contain the keywords.
Information Retrieval Techniques (4 - 18) Web Retrieval and Web Crawling
➠ 4.5 Browsing
Browsing and searching are the two main paradigms for finding information on line. The
search paradigm has a long history; search facilities of different kinds are available in all
computing environments.
The browsing paradigm is newer and less ubiquitous, but it is gaining enormous popularity
through the World-Wide Web.
Both paradigms have their limitations. Search is sometimes hard for users who do not know
how to form a search query so that it is limited to relevant information.
Browsing can make the content come alive, and it is therefore more satisfying to users who
get positive reinforcement as they proceed. However, browsing is time-consuming and
users tend to get disoriented and lose their train of thoughts and their original goals.
Flat browsing : User explores a document space that flows a flat organization. Documents
might be represented as dots in a two-dimensional plane or as elements in a single
dimension list, which might be ranked by alphabetical or by any other order.
Web crawlers is also called ant, bot, worm or Web spider. The process of scanning the
WWW is called Web crawling or spidering.
Web crawling is the process by which we gather pages from the Web, in order to index
them and support a search engine. The objective of crawling is to quickly and efficiently
gather as many useful web pages as possible, together with the link structure that
interconnects them.
It starts with a list of URLs to visit. As it visits these URLs, it identifies all the hyperlinks
in the page and adds them to the list of URLs to visit, recursively browsing the Web
according to a set of policies.
Web crawlers are mainly used to create a copy of all the visited pages for later processing
by a search engine, which will index the downloaded pages to provide fast searches.
Crawlers can also be used for automating maintenance tasks on a Web site, such as
checking links or validating HTML code. Also, crawlers can be used to gather specific
types of information from Web pages, such as harvesting e-mail addresses.
When a search engine's web crawler visits a web page, it "reads" the visible text, the
hyperlinks, and the content of the various tags used in the site, such as keyword rich Meta
tags.
Using the information gathered from the crawler, a search engine will then determine what
the site is about and index the information. The website is then included in the search
engine's database and its page ranking process.
➤ Crawling process:
The crawler begins with a seed set of URLs to fetch. The crawler fetches and parses the
corresponding Web-pages and extracts both text and links. The text is fed to a text indexer;
the links (URL) are added to a URL frontier.
Initially, the URL frontier contains the seed set; as pages are fetched, the corresponding
URLs are deleted from the URL frontier. The entire process may be viewed as traversing
the web graph. In continuous crawling, the URL of a fetched page is added back to the
frontier for fetching again in the future.
A DNS resolution module that determines the web server from which to fetch the page
specified by a URL.
A fetch module that uses the http protocol to retrieve the web page at a URL.
A parsing module that extracts the text and set of links from a fetched web page.
A duplicate elimination module that determines whether an extracted link is already in the
URL frontier or has recently been fetched.
Robots.txt file : A web server administrator can use a file / robots.txt to notify a crawler
what is allowed and what is not.
➤ Politeness policies :
Within website www.cs.sfu.ca, many pages contain many links to other pages in the same
site. Downloading many pages simultaneously from one website may overwhelm the
server.
A reasonable web crawler should use only a tiny portion of the bandwidth of a website
server, not fetching more than one page at a time.
Implementation : Logically split the request queue into a single queue per web server; a
server queue is open only if it has not been accessed within the specified politeness
window.
Suppose a web crawler can fetch 100 pages per second and the politeness policy dictates
that it cannot fetch more than 1 page every 30 seconds from a server then we need URLs
from at least 3,000 different servers to make the crawler reach its peak throughput.
Information Retrieval Techniques (4 - 21) Web Retrieval and Web Crawling
Detecting updates : If a web page is updated, the page should be crawled again. In HTTP,
request HEAD returns only header information about the page but not the page itself. A
crawler can compare the date it received from the last GET request with the Last-Modified
value from a HEAD request.
An example of a selection policy is the PageRank policy where the importance of a page is
determined by the links to and from that page.
Common selection policies are restricting followed links, path-ascending crawling, focused
crawling and crawling the deep web.
Revisit policy : Web crawlers use revisiting policies to determine the cost associated with
an outdated resource. The goal is to minimize this cost. This is important because resources
in the Web are continually created, updated or deleted; all within the time it takes a web
crawler to finish its crawl through the Web.
Two re-visit policies are uniform policy and proportional policy.
The politeness policy is used so that the performance of a site is not heavily affected whist
the web crawler downloads a portion of the site. The server may be overloaded as it has to
handle the requests of the viewers of the site as well as the web crawler.
5 Recommender System
Syllabus
Contents
(5 - 1)
Information Retrieval Techniques (5 - 2) Recommender System
Recommendation systems are a key part of almost every modern consumer website. The
systems help drive customer interaction and sales by helping customers discover products
and services they might not ever find themselves.
Recommender systems predict the preference of the user for these items, which could be in
form of a rating or response. When more data becomes available for a customer profile, the
recommendations become more accurate.
There are a variety of applications for recommendations including movies (e.g. Netflix),
consumer products (e.g., Amazon or similar on-line retailers), music (e.g. Spotify), or
news, social media, online dating and advertising.
Fig. 5.1.2 shows how Amazon uses recommendation concept.
Information Retrieval Techniques (5 - 3) Recommender System
The recommendation approaches can be classified based on the information sources they
use. Three possible sources of information can be identified as input for there
commendation process. The available sources are the user data (demographics), the item
data(keywords, genres) and the user-item ratings.
➥ 5.1.1 Challenges
Following are the challenges for building recommender systems :
1. Huge amounts of data, tens of millions of customers and millions of distinct catalog
items.
2. Results are required to be returned in real time.
3. New customers have limited information.
4. Old customers can have a glut of information.
5. Customer data is volatile.
➤ 1. Content Analyzer
Extracts the features (keywords, n-grams) from the source.
Conversion from unstructured to structured item.
Data stored in the repository Represented Items
➤ 2. Profile Learner
To build user profile
Updates the profile using the data in Feedback repository
➤ 3. Filtering Component
Matching the user profile with the actual item to be recommended
Uses different strategies
Users have no detailed knowledge of collection makeup and the retrieval environment.
Most users often need to reformulate their queries to obtain the results of their interest.
➥ 5.4.2 Relevance Feedback
Thus, the first query formulation should be treated as an initial attempt to retrieve relevant
information. Documents initially retrieved could be analyzed for relevance and used to
improve the initial query.
Fig. 5.4.2 shows relevance feedback on initial query.
The process of query modification is commonly referred as Relevance feedback, when the
user provides information on relevant documents to a query.
The process of query modification is commonly referred as Query expansion, when
information related to the query is used to expand it.
The user issues a short and simple query. The search engine returns a set of documents.
User marks some docs as relevant, some as non-relevant.
Characteristics of relevance feedback :
1. It shields the user from the details of the query reformulation process.
2. It breaks down the whole searching task into a sequence of small steps which are easier
to grasp.
3. Provide a controlled process designed to emphasize some terms (relevant ones) and de-
emphasize others (non-relevant ones).
Issues with relevance feedback :
1. The user must have sufficient knowledge to form the initial query.
2. This does not work too well in cases like : Misspellings, CLIR and mismatch in user's
and document's vocabulary.
3. Relevant documents have to be similar to each other while similarity between relevant
and non-relevant document should be small.
4. Long queries generated may cause long response time.
5. Users are often reluctant to participate in explicit feedback.
Advantages of relevance feedback
1. Relevance feedback usually improves average precision by increasing the number of
good terms in the query.
2. It breaks down the search operation into a sequence of small search steps, designed to
approach the wanted subject area gradually.
3. It provides a controlled query alteration process designed to emphasize some terms and
to deemphasize others, as required in particular search environments.
Disadvantages of relevance feedback
1. More computational work
2. Easy to decrease precision
Two basic approaches of feedback methods :
1. Explicit feedback : The information for query reformulation is provided directly by the
users. However, collecting feedback information is expensive and time consuming.
The accuracy of recommendation depends on the quantity of ratings provided by the
user.
Information Retrieval Techniques (5 - 9) Recommender System
➥ 5.5.1 Type of CF
There are two types of collaborative filtering algorithms : user based and item based.
➤ 1. User based
User-based collaborative filtering algorithms work off the premise that if a user (A) has
a similar profile to another user (B), then A is more likely to prefer things that B prefers
when compared with a user chosen at random.
Information Retrieval Techniques (5 - 10) Recommender System
The assumption is that users with similar preferences will rate items similarly. Thus
missing ratings for a user can be predicted by first finding a neighborhood of similar
users and then aggregate the ratings of these users to form a prediction.
The neighborhood is defined in terms of similarity between users, either by taking a
given number of most similar users (k nearest neighbors) or all users within a given
similarity threshold. Popular similarity measures for CF are the Pearson correlation
coefficient and the Cosine similarity.
For example, a collaborative filtering recommendation system for television tastes could
make predictions about which television show a user should like given a partial list of
that user's tastes (likes or dislikes).
Note that these predictions are specific to the user, but use information gleaned from
many users. This differs from the simpler approach of giving an average score for each
item of interest, for example based on its number of votes.
User-based CF is a memory-based algorithm which tries to mimics word-of-mouth by
analyzing rating data from many individuals.
The two main problems of user-based CF are that the whole user database has to be kept
in memory and that expensive similarity computation between the active user and all
other users in the database has to be performed.
➤ 2. Item-based collaborative filtering
Item-based CF is a model-based approach which produces recommendations based on
the relationship between items inferred from the rating matrix. The assumption behind
this approach is that users will prefer items that are similar to other items they like.
The model-building step consists of calculating a similarity matrix containing all item-
to-item similarities using a given similarity measure. Popular are again Pearson
correlation and Cosine similarity. All pair-wise similarities are stored in n n similarity
matrix S.
Item-based collaborative filtering has become popularized due to its use by YouTube
and Amazon to provide recommendations to users. This algorithm works by building an
item-to-item matrix which defines the relationship between pairs of items.
When a user indicates a preference for a certain type of item, the matrix is used to
identify other items with similar characteristics that can also be recommended.
Information Retrieval Techniques (5 - 11) Recommender System
Item-based CF is more efficient than user-based CF since the model is relatively small
(N k) and can be fully pre-computed. Item-based CF is known to only produce slightly
inferior results compared to user-based CF and higher order models which take the joint
distribution of sets of items into account are possible. Furthermore, item-based CF is
successfully applied in large scale recommender systems (e.g., by Amazon.com).
➤ 2. Model-based algorithms :
Input the user database to estimate or learn a model of user ratings, then run new data
through the model to get a predicted output.
A prediction is computed through the expected value of a user rating, given his/her
ratings on other items.
Static structure. In dynamic domains the model could soon become inaccurate.
Model-based collaborative filtering algorithms provide item recommendation by first
developing a model of user ratings. Algorithms in this category take a probabilistic
approach and envision the collaborative filtering process as computing the expected
value of a user prediction, given his/her ratings on other items.
The model building process is performed by different machine learning algorithms such
as Bayesian network, clustering and rule-based approaches. The Bayesian network
model formulates a probabilistic model for collaborative filtering problem.
The clustering model treats collaborative filtering as a classification problem and works
by clustering similar users in same class and estimating the probability that a particular
user is in a particular class C and from there computes the conditional probability of
ratings.
The rule-based approach applies association rule discovery algorithms to find
association between co-purchased items and then generates item recommendation based
on the strength of the association between items
Advantages
1. Scalability : Most models resulting from model-based algorithms are much smaller than
the actual dataset, so that even for very large datasets, the model ends up being small
enough to be used efficiently. This imparts scalability to the overall system.
2. Prediction speed : Model-based systems are also likely to be faster, at least in
comparison to memory-based systems because, the time required to query the model is
usually much smaller than that required to query the whole dataset.
3. Avoidance of over fitting : If the dataset over which we build our model is
representative enough of real-world data, it is easier to try to avoid over-fitting with
model-based systems.
Disadvantages
1. Inflexibility : Because building a model is often a time- and resource-consuming
process, it is usually more difficult to add data to model-based systems, making them
inflexible.
Information Retrieval Techniques (5 - 13) Recommender System
2. Quality of predictions : The fact that we are not using all the information (the whole
dataset) available to us, it is possible that with model-based systems, we don't get
predictions as accurate as with model-based systems. It should be noted, however, that
the quality of predictions depends on the way the model is built. In fact, as can be seen
from the results page, a model-based system performed the best among all the
algorithms we tried.
1. Explain collaborative filtering and content based recommendation system with an example.
AU : May-17, Marks 16
AU : Dec-17, Marks 10
AU : May-17, Marks 8
The matrix S is a diagonal matrix containing the singular values of the matrix X. There are
exactly r singular values, where r is the rank of X.
The rank of a matrix is the number of linearly independent rows or columns in the matrix.
Recall that two vectors are linearly independent if they can not be written as the sum or
scalar multiple of any other vectors in the space.
➤ Incremental SVD Algorithm (SVD++)
The idea is borrowed from the Latent Semantic Indexing (LSI) world to handle dynamic
databases.
LSI is a conceptual indexing technique which uses the SVD to estimate the underlying
latent semantic structure of the word to document association.
Projection of additional users provides good approximation to the complete model
SVD based recommender systems has following limitations
a. It can not be applied on sparse data
b. Does not have regularization
Q.1 Explain the type of natural language technology used in information retrieval.
(Refer Two Marks Q.5 of Chapter - 1)
Q.7 Define search engine optimization. (Refer Two Marks Q.5 of Chapter - 4)
Q.8 What are politeness policies used in web crawling ?
(Refer Two Marks Q.10 of Chapter - 4)
Q.11 a) i) Explain in detail about the components of IR. (Refer section 1.4) [6]
ii) Explain various method used for visualization of search engine .
(Refer section 1.8) [7]
OR
(M - 1)
Information Retrieval Techniques (M - 2) Solved Model Question Paper
ii) Appraise the history of information retrieval. (Refer section 1.1) [8]
Q.15 a) i) What is recommender systems ? What are the challenges of recommender systems ?
(Refer section 5.1) [6]
ii) Explain various collaborative filtering algorithms. (Refer section 5.5.2) [7]
OR
Q.16 a) Consider the following six training examples, where each example has three attributes :
color, shape and size. Color has three possible values : red, green and blue. Shape has two
possible values : square and round. Size has two possible values : big and small.
Example Color Shape Size Class
1 red square big +
2 blue square big +
3 red round small –
4 green square small –
5 red round big +
6 green square big –
Which is best attribute for the root node of decision tree ? (Refer example 3.3.1) [15]
OR