IR Chap4
IR Chap4
IR Chap4
Abdo A.
2019/20
1
Outline
Boolean model
March 9, 2020 2
IR Models - Basic Concepts
Word evidence: Bag of words
IR systems usually adopt index terms to index and retrieve
documents
Each document is represented by a set of representative
keywords or index terms (called Bag of Words)
An index term is a word from a document useful for
remembering the document main themes
Not all terms are equally useful for representing the document
contents:
less frequent terms allow identifying a narrower set of
documents.
But No ordering information is attached to the Bag of Words
identified from the document collection. 3
IR Models - Basic Concepts
• One central problem regarding IR systems is the issue of predicting
which documents are relevant and which are not.
• Such a decision is usually dependent on a ranking algorithm
which attempts to establish a simple ordering of the
documents retrieved.
• Documents appearning at the top of this ordering are
considered to be more likely to be relevant.
• Thus ranking algorithms are at the core of IR systems
• The IR models determine the predictions of what is relevant
and what is not, based on the notion of relevance implemented by
the system
4
IR Models - Basic Concepts
• After preprocessing, N distinct terms remain which are
Unique terms that form the VOCABULARY
• Let
– ki be an index term i & dj be a document j
– K = (k1, k2, …, kN) is the set of all index terms
• Each term, i, in a document or query j, is given a real-valued
weight, wij.
– wij is a weight associated with (ki,dj). If wij = 0 , it indicates
that term does not belong to document dj
• The weight wij quantifies the importance of the index term
for describing the document contents
• vec(dj) = (w1j, w2j, …, wtj) is a weighted vector associated
with the document dj 5
Mapping Documents & Queries
Represent both documents and queries as N-dimensional
vectors in a term-document matrix, which shows
occurrence of terms in the document collection or query
An entry in the matrix
corresponds to the “weight” of a term in
the document; d j (t1, j , t 2, j ,..., t N , j ); qk (t1,k , t 2,k ,..., t N ,k )
d4 d5
d3
d1
k3
9
The Boolean Model: Example
Given the following determine documents retrieved by the
Boolean model based IR system
Index Terms: K1, …,K8.
Documents:
1. D1 = {K1, K2, K3, K4, K5}
2. D2 = {K1, K2, K3, K4}
3. D3 = {K2, K4, K6, K8}
4. D4 = {K1, K3, K5, K7}
5. D5 = {K4, K5, K6, K7, K8}
6. D6 = {K1, K2, K3, K4}
• Query: K1 (K2 K3)
• Answer: {D1, D2, D4, D6} ({D1, D2, D3, D6} {D3, D5})
= {D1, D2, D6}
The Boolean Model: Further Example
Given the following three documents, Construct Term –
document matrix and find the relevant documents retrieved by
the Boolean model for given query Also find the relevant
• D1: “Shipment of gold damaged in a fire” documents for the
• D2: “Delivery of silver arrived in a silver truck” queries:
• D3: “Shipment of gold arrived in a truck” (a) “gold delivery”;
• Query: “gold silver truck” (b) ship gold;
(c) “silver truck”
Table below shows document –term (ti) matrix
arrive damage deliver fire gold silver ship truck
D1
D2
D3
Query
Exercise
Given the following three documents with the following
contents:
D1 = “computer information retrieval”
D2 = “computer retrieval”
D3 = “information”
D4 = “computer information”
Q2 = “information ¬computer”
12
Exercise:
What are the relevant documents retrieved for the query:
((information OR technology) AND (science OR computer))
Doc No Term 1 Term 2 Term 3 Term 4
1 science
2 computer
3 computer science
4 technology
5 technology science
6 technology computer
7 technology computer science
8 information
9 information science
10 information computer
11 information computer science
12 information technology
13 information technology science
14 information technology computer
15 information technology computer science
Drawbacks of the Boolean Model
• Retrieval based on binary decision criteria with no notion
of partial matching
• No ranking of the documents is provided (absence of a
grading scale)
• Information need has to be translated into a Boolean
expression which most users find awkward
• The Boolean queries formulated by the users are most
often too simplistic
• As a consequence, the Boolean model frequently returns
either too few or too many documents in response to a
user query
• Just changing a boolean operator from “AND” to “OR”
changes the result from intersection to union
14
2. Vector-Space Model (VSM)
This is the most commonly used strategy for measuring
relevance of documents for a given query. This is
because,
• Use of binary weights is too limiting
• Non-binary weights provide consideration for partial
matches
These term weights are used to compute a degree of
similarity between a query and each document
• Ranked set of documents provides for better matching
The idea behind VSM is that
• the meaning of a document is conveyed by the words
used in that document
15
Vector-Space Model
To find relevant documens for a given query,
First, Documents and queries are mapped into term vector space.
• Note that queries are considered as short documents
Second, in the vector space, queries and documents are
represented as weighted vectors
• There are different weighting technique; the most widely used
one is computing tf*idf for each term
Third, similarity measurement is used to rank documents by the
closeness of their vectors to the query.
• Documents are ranked by closeness to the query.
• Closeness is determined by a similarity score calculation
16
Term-document matrix.
A collection of n documents and query can be represented
in the vector space model by a term-document matrix.
An entry in the matrix corresponds to the “weight” of a
term in the document;
zero means the term has no significance in the
20
Example: Computing weights
• A collection includes 10,000 documents
• The term A appears 20 times in a particular document
• The maximum appearance of any term in this
document is 50
• The term A appears in 2,000 of the collection
documents.
21
Similarity Measure
j
dj
• Sim(q,dj) = cos() q
n i
d j q wi , j qi ,k
sim(d j , q) i 1
n n
dj q w 2 2
q
i 1 i, j i 1 i ,k
22
Vector-Space Model: Example
• Suppose we query for the query: Q: “gold silver
truck”. The database collection consists of three
documents with the following documents.
• D1: “Shipment of gold damaged in a fire”
• D2: “Delivery of silver arrived in a silver truck”
• D3: “Shipment of gold arrived in a truck”
• Assume that all terms are used, including
common terms, stop words, and also no terms
are reduced to root terms.
• Show retrieval results in ranked order?
23
Vector-Space Model: Example
Terms Q Counts TF DF IDF Wi = TF*IDF
D1 D2 D3 Q D1 D2 D3
a 0 1 1 1 3 0 0 0 0 0
arrived 0 0 1 1 2 0.176 0 0 0.176 0.176
damaged 0 1 0 0 1 0.477 0 0.477 0 0
delivery 0 0 1 0 1 0.477 0 0 0.477 0
fire 0 1 0 0 1 0.477 0 0.477 0 0
gold 1 1 0 1 2 0.176 0.176 0.176 0 0.176
in 0 1 1 1 3 0 0 0 0 0
of 0 1 1 1 3 0 0 0 0 0
silver 1 0 2 0 1 0.477 0.477 0 0.954 0
shipment 0 1 0 1 2 0.176 0 0.176 0 0.176
truck 1 0 1 1 2 0.176 0.176 0 0.176 0.176
Vector-Space Model TF*IDF
Terms Q D1 D2 D3
a 0 0 0 0
arrived 0 0 0.176 0.176
damaged 0 0.477 0 0
delivery 0 0 0.477 0
fire 0 0.477 0 0
gold 0.176 0.176 0 0.176
in 0 0 0 0
of 0 0 0 0
silver 0.477 0 0.954 0
shipment 0 0.176 0 0.176
truck 0.176 0 0.176 0.176
25
Vector-Space Model: Example
• Compute similarity using cosine Sim(q,d1)
• First, for each document and query, compute all vector
lengths (zero terms ignored)
|d1|= 0 .477 2
0 .477 2
0 .176 2
0 .176 2
= 0.517 = 0.719
|d2|= 0 .176 2
0.477 2
0.176 2
0.176 2
= 1.2001 = 1.095
|d3|= 0.176 0.176 0.176 0.176 = 0.124 = 0.352
2 2 2 2
|q|= 0.1762
0.4712
0.1762
= 0.2896 = 0.538
• Next, compute dot products (zero products ignored)
Q*d1= 0.176*0.167 = 0.0310
Q*d2 = 0.954*0.477 + 0.176 *0.176 = 0.4862
26
Q*d3 = 0.176*0.167 + 0.176*0.167 = 0.0620
Vector-Space Model: Example
Now, compute similarity score
Sim(q,d1) = (0.0310) / (0.538*0.719) = 0.0801
Sim(q,d2) = (0.4862 ) / (0.538*1.095)= 0.8246
Sim(q,d3) = (0.0620) / (0.538*0.352)= 0.3271
Finally, we sort and rank documents in
descending order according to the similarity
scores
Rank 1: Doc 2 = 0.8246
Rank 2: Doc 3 = 0.3271
Rank 3: Doc 1 = 0.0801
• Exercise: using normalized TF, rank documents using
cosine similarity measure? Hint: Normalize TF of term i
in doc j using max frequency of a term k in document j.
27
Vector-Space Model
• Advantages:
• term-weighting improves quality of the answer set
since it displays in ranked order
• partial matching allows retrieval of documents that
approximate the query conditions
• cosine ranking formula sorts documents according to
degree of similarity to the query
• Disadvantages:
• assumes independence of index terms (??)
2
8
Technical Memo Example
Suppose the database collection consists of the following documents.
30
Thank you
Exercises