TEST 1 Regular
Date: 09.09.2016 Weightage: 20 %( 60 M)
Duration: 60min. Type: Closed Book
Instructions: Answer all parts of the question together. Your answers should be brief.
Q1. Boolean retrieval [3+4+3 = 10M]
A. Consider the fragment of a positional index of three terms given below which has the following
format: word: doc#: <posn, posn,...>; doc#: <posn, posn,...>; .
KUMARAMANGALAM: 7: <1,6,8>; 8: <6>; 9: <2,15>; 10: <1>.
VIT: 4: <3>; 7: <14>.
BITS: 7: <2,23,56>; 8: <12,16,21>; 9: <13>; 11: <21,25>.
The /n operator, word1 /n word2 finds occurrences of word1 within n words of word2 (on
either side), where n is a positive integer argument. Thus n = 1 demands that word1 be adjacent
to word2.
i.Identify the set of documents that satisfy the query: KUMARAMANGALAM /2 BITS.
ii.Given the query KUMARAMANGALAM /n BITS identify the set of values for n where
documents {7,9} are returned as the answer.
iii.Identify the set of values for n for which the query BITS /n BITS returns a non-empty set of
documents as the answer.
B. Assume that our search engine lets us enter a query, which is a set of words, and returns the set
of documents that contain all the words in the query.
Imagine that we configure the system in four different modes, and for each mode we ask the same
Mode 1: We dont remove stopwords and we dont stem neither documents nor queries. Let A1 be
the set of returned documents.
Mode 2: We dont remove stopwords, but we stem both documents and queries. Let A2 be the set
of returned documents.
Mode 3: We remove stopwords, but dont stem. Let A3 be the set of returned documents.
Mode 4: We remove stopwords, and then we stem both documents and queries. Let A4 be the set
of returned documents.
Identify the relations among A1, A2, A3, and A4? For example, is A1 = A2? Is A2 a subset of
A4?, etc.
C. In a corpus of size 3,00, 000 documents we have the following term frequencies for some of the
ShivKera RuskinBond ChetanBhagat VikramSeth RabindranathTagore KiranDesai
24,000 1,000 10,000 4,000 13,000 7,000
Propose an evaluation plan for the following query: (ShivKera AND RuskinBond) AND
(ChetanBhagat AND VikramSeth) OR (RabindranathTagore AND KiranDesai) in order to
minimize the list processing time. Justify your answer.
B. Give an example scenario that illustrates when the retrieval system is likely to fail to accurately
retrieve the top k documents for a query. [3 M]
C. Describe the effect of adding new documents or changing existing documents within the VSM.
[Hint: What values have to be recomputed?] [3 M]
D. Euclidean distance is a measure that may be used to compute the similarity between two
vectors. Given a query q and documents d1, . . . , dn, we may rank the documents d1, . . . , dn
in the increasing order of Euclidean distance from q. Show that if q and the document vectors
di are all normalized to unit vectors, then the rank ordering produced by Euclidean distance is
identical to that produced by cosine similarity. [5 M]
Q4. Probabilistic IR
A. What are the differences between standard vector space tf-idf weighting and the Binary
Independence Model of probabilistic retrieval model (in the case where no document relevance
is available)? [3 M]
B. Given the following term incidence matrix as shown in Table 1 below, rank order the
documents using the probabilistic retrieval model where relevance estimates are not given for
a query containing terms {T2, T5, T6}. [11 M]
T1 T2 T3 T4 T5 T6
D1 1 0 0 1 1 0
D2 0 1 0 1 1 0
D3 1 0 1 0 1 1
D4 1 0 1 0 1 1
Table 1