Final Sol
Final Sol
Final Sol
11 February 2008 Hand-written / printed documents and calculators are authorized. Name : Semester :
Exercise 1 : Denitions
Dene the following terms : query-term proximity : distance between two occurrences of query terms within a given document. This information is used to reduce the number of document considered during scoring. pseudo-relevance feedback : technique consisting in considering the k rst results of a retrieval to be relevant documents for feedback and query expansion. shingle : set of all consecutive sequences of k terms within a document.
These documents are pre-processed using a stop-list and a stemmer. The resulting index is built to allow to apply vector-based queries. Give a (graphical or textual) representation of this index. Index : term t book gift give good John love Mary read think N/dft 4/3 4/1 4/1 4/1 4/4 4/2 4/3 4/1 4/2 d1 d1 d4 d1 d4 d1 d2 d1 d2 d3 : tft,d1 :1 :1 :1 :1 :1 :1 :1 :1 :1 d2 : tft,d2 d2 :1 d3 : tft,d3 d4 :1 d4 : tft,d4
d2 :1 d3 :1 d2 :1 d4 :1
d3 :1 d3 :1
d4 :1
We now focus on 3 terms belonging to the dictionary, namely book, love and Mary. Compute the tf idf -based vector representation for the 4 documents in the collection (these vectors are normalized using the euclidian normalization). (book) v(d1 ) = (love) (M ary)
1log(4/3) D1
0
1log(4/3) D1 1log(4/3) D2 1log(4/2) D2 1log(4/3) D2
(log(4/3))2 + 0 + (log(4/3))2
0
1log(4/2) D3 1log(4/3) D3 1log(4/3) D4
0 + (log(4/2))2 + (log(4/3))2
0 0
(log(4/3))2 + 0 + 0
Consider the query "love Mary". Give the results of a ranked retrieval for this query. What document is considered to be the most relevant ? (book) 0 v(q) = (love) 1 (M ary) 1
s(q, d3 ) =
log(4/2) D3
log(4/3) D3
(NB : D3 < D2 )
Which components must be added to your system to deal with web pages (give a brief denition of each component) ? Where are they located within the system ? (a gure can be used).
Give the precision and recall of the retrieval (several interpretations of the results are possible, put yourself in the users shoes and choose one). Interpretation 1 : query about the Tbingen Basketball Team u Recall = #relevantretrieved = 1/2 Precision = #relevantretrieved = 1/4 #retrieved #relevant Interpretation 2 : query about the Teo Tiger Theater Recall = #relevantretrieved = 2/4 Precision = #relevantretrieved = 2/4 #retrieved #relevant What is the F-measure if we pay twice more attention to precision than recall ? F = F = F =
1 Pr
1 + (1 )
1 Re 1 Re
1 Re
2 3 2 3
1 Pr 1 Pr
1 2 + (1 3 ) 1 + (1 2 ) 3
= =
2 3 2 3
4 1 4 2
1 + (1) 3 1 1 + (3)
2 1 4 2
P recision(Rjk )
k=1
|Q| = 1, and we consider Sj = {d1 d2 d3 } (the 2 retrieved relevant documents are within the rst 3 results), thus mj = 3, plus the precision after the rst document is 1 and the precision after the third document is 2/3, thus : M AP (Q) = 1 2 5 1 3 P recision(Rk ) = (1 + ) = 3 k=1 3 3 9
The relevance of the retrieval has been evaluated by 2 judges the following way (+ means relevant, - means non-relevant) :
Judge a: 1.+ 2.+ 3.+ 4.Judge b: 1.+ 2.- 3.+ 4.+
Is there a good agreement between judges (give a measure of this agreement) ? The agreement can be measured using the kappa-statistics : kappa = P (A) P (E) 1 P (E)
where : P (A) is the proportion of agreements within the judgements P (E) is the proportion of expected agreements Judge b Y es N o T otal Here, we have : Judge a Y es 2 1 3 No 1 0 1 T otal 3 1 4 P(A) = 2 / 4 P(rel) = (3 + 3) / 8 P (E) = P (rel)2 + P (not rel)2 = 5/8 kappa = P(not-rel) = (1 + 1) / 8
1 |Dr |
dj
dj Dr
1 |Dnr |
dj
dj Dnr
Dnr = {d4 }
qm
log(4/3) log(4/3) 0 0 = 1 + 0.75 log(4/2) 0.15 1 log(4/3) 0 qm 0.6 log(4/3) = 1 + 0.75 log(4/2) 1 + 0.75 log(4/3)
What is the amount of web pages indexed by A relatively to B. Give your computation.
Using page A for starting a web crawl, give the order of the indexing (the crawler uses a frontier with duplicates detection, all pages have equal priority). For information, the graph looks like : A C B E D G F H
The crawling is done in the following order : A-B-C-D-E-F-G-H (URL frontier as a FIFO).
Compute the adjacency matrix underlying this web graph. For information, the graph looks like : A B C D E
A=
A: B: C: D: E:
A: B: C: D: E: 0 0 1 1 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0
Compute the probability matrix underlying this web graph (teleport has a probability = 0.5). The probability matrix is the following :
P =
A: B: C: D: E:
B: C: D: E: 0.1 0.35 0.35 0.1 0.1 0.35 0.1 0.1 0.6 0.1 0.1 0.1 0.1 0.1 0.1 0.6 0.1 0.1 0.1 0.1
Say, a web surf starts at page C. Propose a probability vector for this page. We can represent the page C via the following probability vector : v= A: B: C: D: E: 0 0 1 0 0
That is, 100% chances to be in page C. Compute an approximation of the PageRank score of the pages of the graph, using 3 iterations. We rst compute v1 = P.v :
v1 =
0.1 0.35 0.35 0.1 0 0.1 0.35 0.1 0.1 0 0.6 0.1 0.1 0.1 1 0.1 0.1 0.1 0.6 0 0 0.1 0.1 0.1 0.1
Then, we compute v2 = P.v1 : 0.1 0.35 0.1 0.1 0.6 0.1 0.35 0.35 0.1 0.35 0.35 0.1 0.35 0.1 0.1 0.6 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.6 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.35 0.35 0.1 0.15 0.1 0.35 0.1 0.1 0.2125 0.6 0.1 0.1 0.1 0.275 0.1 0.1 0.1 0.6 0.15 0.1 0.1 0.1 0.1 0.275
v2 =
Finally, we compute v3 = P.v2 : 0.1 0.35 0.1 0.1 0.6 0.2125 0.2125 0.2125 0.24375 0.18125
v3 =
As a result, we obtain the following approximations of the PageRank scores : s(A) = 0.2125 s(B) = 0.2125 s(C) = 0.2125 s(D) = 0.24375 s(E) = 0.18125
1 2 3 4 5 5 6 10 11 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
computeHITS(Set B) init(hub, B) init(aut, B) i <- 0 while (i < 5) computeH(B, hub, aut) computeA(hub, aut) endwhile return (hub,aut) computeH(Set H, Set A) for i in H for j in A if (link(i,j)) H[i] <- H[i] + A[j] endif endfor endfor computeA(Set H, Set A) for i in A for j in H if (link(j,i)) A[i] <- A[i] + H[j] endif endfor endfor
// for each b in B hub[b] <-1 // for each b in B aut[b] <-1 // 5 iterations (cf lecture)