Query-term proximity refers to the distance between two occurrences of query terms within a given document. This information is used to reduce the number of documents considered during scoring.

Pseudo-relevance feedback is a technique that considers the k first results of a retrieval to be relevant documents for feedback and query expansion.

ISCL wintersemester 2007 IR Final exam

11 February 2008 Hand-written / printed documents and calculators are authorized. Name : Semester :

Exercise 1 : Denitions
Dene the following terms : query-term proximity : distance between two occurrences of query terms within a given document. This information is used to reduce the number of document considered during scoring. pseudo-relevance feedback : technique consisting in considering the k rst results of a retrieval to be relevant documents for feedback and query expansion. shingle : set of all consecutive sequences of k terms within a document.

Exercise 2 : Vector Space Model

Consider a collection made of the 4 following documents (one document per line) :
d1. d2. d3. d4. John gives a book to Mary John who reads a book loves Mary who does John think Mary love ? John thinks a book is a good gift

These documents are pre-processed using a stop-list and a stemmer. The resulting index is built to allow to apply vector-based queries. Give a (graphical or textual) representation of this index. Index : term t book gift give good John love Mary read think N/dft 4/3 4/1 4/1 4/1 4/4 4/2 4/3 4/1 4/2 d1 d1 d4 d1 d4 d1 d2 d1 d2 d3 : tft,d1 :1 :1 :1 :1 :1 :1 :1 :1 :1 d2 : tft,d2 d2 :1 d3 : tft,d3 d4 :1 d4 : tft,d4

d2 :1 d3 :1 d2 :1 d4 :1

d3 :1 d3 :1

d4 :1

We now focus on 3 terms belonging to the dictionary, namely book, love and Mary. Compute the tf idf -based vector representation for the 4 documents in the collection (these vectors are normalized using the euclidian normalization). (book) v(d1 ) = (love) (M ary)

1log(4/3) D1

1log(4/3) D1 1log(4/3) D2 1log(4/2) D2 1log(4/3) D2

(book) (love) v(d3 ) = (M ary)

(book) (love) v(d2 ) = (M ary)

where D1 = where D2 = where D3 = where D4 =

(log(4/3))2 + 0 + (log(4/3))2

(log(4/3))2 + (log(4/2))2 + (log(4/3))2

1log(4/2) D3 1log(4/3) D3 1log(4/3) D4

0 + (log(4/2))2 + (log(4/3))2

(book) v(d4 ) = (love) (M ary)

0 0

(log(4/3))2 + 0 + 0

Consider the query "love Mary". Give the results of a ranked retrieval for this query. What document is considered to be the most relevant ? (book) 0 v(q) = (love) 1 (M ary) 1

most relevant documents : d3

s(q, d3 ) =

log(4/2) D3

log(4/3) D3

(NB : D3 < D2 )

Exercise 3 : Architecture of an IR system

Give a description of the architecture of an IR system for text retrieval (no hyperlinks), supposing wildcard and phrase queries are allowed and the documents are ranked (a gure can be used).

Which components must be added to your system to deal with web pages (give a brief denition of each component) ? Where are they located within the system ? (a gure can be used).

Exercise 4 : Evaluation of unranked retrieval

Consider the query Tbingen Tiger, with the following retrievals results : u
Retrieved: 1. Tbinger Kindertheater Teo Tiger u 2. Home: WALTER Tigers Tbingen u 3. Kindertheater Teo Tiger - Universittsstadt a 4. Alumni Tbingen u | Not retrieved: |5. Tbinger Tiger - Basketball u |6. Willkommen bei Theater Teo Tiger |7. Theater Teo Tiger / Stdte & Kultur a |8. Maskottchen - Wikipedia

Give the precision and recall of the retrieval (several interpretations of the results are possible, put yourself in the users shoes and choose one). Interpretation 1 : query about the Tbingen Basketball Team u Recall = #relevantretrieved = 1/2 Precision = #relevantretrieved = 1/4 #retrieved #relevant Interpretation 2 : query about the Teo Tiger Theater Recall = #relevantretrieved = 2/4 Precision = #relevantretrieved = 2/4 #retrieved #relevant What is the F-measure if we pay twice more attention to precision than recall ? F = F = F =
1 Pr

1 + (1 )
1 Re 1 Re

1 Re

with = 2/3 here = = 3 for interpretation 1 10 1 for interpretation 2 2

2 3 2 3

1 Pr 1 Pr

1 2 + (1 3 ) 1 + (1 2 ) 3

= =

2 3 2 3

4 1 4 2

1 + (1) 3 1 1 + (3)

2 1 4 2

Exercise 5 : Evaluation of ranked retrieval

In this exercise, we consider the retrieval for the query introduced in the previous exercise (exercise 4). Given that the user meant theater, what is the R-precision of the retrieval when looking at the 4 results ? r r Pr = Re = rel rel with rel = 4 (4 relevant documents in the collection) and r = 2 (number of retrieved relevant documents) P r = 1/2 Considering that this query only contains one information need, namely a theater named tiger in Tbingen, what is the Mean Average Precision for the query ? u 1 1 M AP (Q) = |Q| j=1 mj
|Q| mj

P recision(Rjk )

|Q| = 1, and we consider Sj = {d1 d2 d3 } (the 2 retrieved relevant documents are within the rst 3 results), thus mj = 3, plus the precision after the rst document is 1 and the precision after the third document is 2/3, thus : M AP (Q) = 1 2 5 1 3 P recision(Rk ) = (1 + ) = 3 k=1 3 3 9

The relevance of the retrieval has been evaluated by 2 judges the following way (+ means relevant, - means non-relevant) :
Judge a: 1.+ 2.+ 3.+ 4.Judge b: 1.+ 2.- 3.+ 4.+

Is there a good agreement between judges (give a measure of this agreement) ? The agreement can be measured using the kappa-statistics : kappa = P (A) P (E) 1 P (E)

where : P (A) is the proportion of agreements within the judgements P (E) is the proportion of expected agreements Judge b Y es N o T otal Here, we have : Judge a Y es 2 1 3 No 1 0 1 T otal 3 1 4 P(A) = 2 / 4 P(rel) = (3 + 3) / 8 P (E) = P (rel)2 + P (not rel)2 = 5/8 kappa = P(not-rel) = (1 + 1) / 8

4/8 5/8 = 1/3 bad agreement 1 5/8

Exercise 6 : Relevance feedback

In this exercise, we consider the collection from exercise 2 and the same query love Mary. Say the document 2 has been annotated as relevant, and the document 4 as not-relevant. Give the Rocchio-modied query (using the standard balancing weights). qm = q0 + where = 1, = 0.75 and = 0.15 Dr = {d2 }

1 |Dr |

dj Dr

1 |Dnr |

dj Dnr

Dnr = {d4 }


log(4/3) log(4/3) 0 0 = 1 + 0.75 log(4/2) 0.15 1 log(4/3) 0 qm 0.6 log(4/3) = 1 + 0.75 log(4/2) 1 + 0.75 log(4/3)

Exercise 7 : Web index evaluation

We want to evaluate the respective index size of 2 web search engines A and B (i.e. the number of pages these systems index). After an experiment using an as fair as possible sample of web pages, we make the following observation :
25 % of the page in As index, are also in Bs index. 40 % of the page in Bs index, are also in As index.

What is the amount of web pages indexed by A relatively to B. Give your computation.

X = A B = 0.25 |A| = 0.40 |B|

|B| 0.25 = = 5/8 |A| 0.40

Exercise 8 : Web crawling

Consider the following web graph :
Page Page Page Page Page Page A B C D E F points poinst points points points points to to to to to to page B, C and D. C and D. A and E. F. G. G and H.

Using page A for starting a web crawl, give the order of the indexing (the crawler uses a frontier with duplicates detection, all pages have equal priority). For information, the graph looks like : A C B E D G F H

The crawling is done in the following order : A-B-C-D-E-F-G-H (URL frontier as a FIFO).

Exercise 9 : Link analysis

Consider the following web graph :
Page Page Page Page Page A B C D E points poinst points points points to to to to to pages C and D. A and C. B. E. A.

Compute the adjacency matrix underlying this web graph. For information, the graph looks like : A B C D E

The adjacency matrix is the following :


A: B: C: D: E:

A: B: C: D: E: 0 0 1 1 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0

Compute the probability matrix underlying this web graph (teleport has a probability = 0.5). The probability matrix is the following :

P =

A: B: C: D: E:

A: 0.1 0.35 0.1 0.1 0.6

B: C: D: E: 0.1 0.35 0.35 0.1 0.1 0.35 0.1 0.1 0.6 0.1 0.1 0.1 0.1 0.1 0.1 0.6 0.1 0.1 0.1 0.1

Say, a web surf starts at page C. Propose a probability vector for this page. We can represent the page C via the following probability vector : v= A: B: C: D: E: 0 0 1 0 0

That is, 100% chances to be in page C. Compute an approximation of the PageRank score of the pages of the graph, using 3 iterations. We rst compute v1 = P.v :

v1 =

0.1 0.35 0.1 0.1 0.6

0.1 0.35 0.35 0.1 0 0.1 0.35 0.1 0.1 0 0.6 0.1 0.1 0.1 1 0.1 0.1 0.1 0.6 0 0 0.1 0.1 0.1 0.1

0.35 0.35 0.1 0.1 0.1

Then, we compute v2 = P.v1 : 0.1 0.35 0.1 0.1 0.6 0.1 0.35 0.35 0.1 0.35 0.35 0.1 0.35 0.1 0.1 0.6 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.6 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.35 0.35 0.1 0.15 0.1 0.35 0.1 0.1 0.2125 0.6 0.1 0.1 0.1 0.275 0.1 0.1 0.1 0.6 0.15 0.1 0.1 0.1 0.1 0.275

v2 =

0.15 0.2125 0.275 0.15 0.275

Finally, we compute v3 = P.v2 : 0.1 0.35 0.1 0.1 0.6 0.2125 0.2125 0.2125 0.24375 0.18125

v3 =

As a result, we obtain the following approximations of the PageRank scores : s(A) = 0.2125 s(B) = 0.2125 s(C) = 0.2125 s(D) = 0.24375 s(E) = 0.18125

Exercise 10 : Hubs and authorities (extra credit)

Propose an algorithm to compute the hub and authority scores of a base set. You can modularize your code be dening inter-dependent functions computing respectively the hub and authority score.

1 2 3 4 5 5 6 10 11 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8

computeHITS(Set B) init(hub, B) init(aut, B) i <- 0 while (i < 5) computeH(B, hub, aut) computeA(hub, aut) endwhile return (hub,aut) computeH(Set H, Set A) for i in H for j in A if (link(i,j)) H[i] <- H[i] + A[j] endif endfor endfor computeA(Set H, Set A) for i in A for j in H if (link(j,i)) A[i] <- A[i] + H[j] endif endfor endfor

// for each b in B hub[b] <-1 // for each b in B aut[b] <-1 // 5 iterations (cf lecture)

