Lecture 7
Lecture 7
Lecture 7
Information Retrieval
Cross-language IR
Finding documents written in another language Touches on Machine translation
....
Concerns
The set of texts can be very large hence hence efficiency is a concern Textual data is noisy, incomplete and untrustworthy hence robustness is a concern Information may be hidden:
Need to derive information from raw data Need to derive information from vaguely expressed needs
IR Basic concepts
Information needs : queries and relevance Indexing: helps speeding up retrieval Retrieval models: describe how to search and recover relevant documents Evaluation: IR systems are large and convincing evaluation is tricky
Information needs
Information needs
INFORMATION NEED : the topic about which the user desires to know more QUERY : what the user conveys to the computer in an attempt to communicate the information need RELEVANCE : a document is relevant if it is one that the user perceives as containing information of value wrt their personal information need
Ex : topic pipeline leaks relevant documents : doesnt matter if they use those words or express the concept with other words such a pipeline rupture .
Queries
Queries
Information needs are typically expressed as a query :
Where shall I go on holiday? holiday destinations
Remarks
A query :
is usually quite short and incomplete; may contain misspelled or poorly selected words may contain too many or too few words
Relevance
Relevance is subjective
python : ambiguous but not for user Topicality vs. Utility: a document is relevant wrt a specific Goal A document is relevant if it addresses the stated information need, not because it just happens to contain all the words in the query.
Relevance is a gradual concept (a document is not just relevant or not; it is more or less relevant to a query) IR systems usually rank retrieved documents by relevance
But many algorithm use a binary decision of relevance.
Terminology
An IR system looks for data matching some criteria defined by the users in their queries. The langage used to ask a question is called the query language. These queries use keywords (atomic items characterizing some data). The basic unit of data is a document (can be a file, an article, a paragraph, etc.). A document corresponds to free text (may be unstructured). All the documents are gathered into a collection (or corpus).
Enough for simple querying of modest collections (millions of words) But for many purposes, you do need more:
To process large document collections (billions ot trillions of words) quickly. To allow more flexible matching operations. For example, it is impractical to perform the query Romans NEAR countrymen with grep, where NEAR might be defined as within 5 words or within the same sentence. To allow ranked retrieval: in many cases you want the best answer to an information need among many documents that contain certain words
Index
Indexing documents
How to relate the users information need with some documents content ? Idea : using an index to refer to documents Usually an index is a list of terms that appear in a document, it can be represented mathematically as: index : doci {Uj keywordj} Here, the kind of index we use maps keywords to the list of documents they appear in: index : keywordj {Ui doci} We call this an inverted index.
Indexing documents
The set of keywords is usually called the dictionary (or vocabulary) A document identifier appearing in the list associated with a keyword is called a posting The list of document identifiers associated with a given keyword is called a posting list
Inverted files
The most common indexing technique Source file: collection organised by documents Inverted file: collection organised by terms
Inverted Index
Given a dictionary of terms (also called vocabulary or vocabulary lexicon) For each term, record in a list which documents the term occurs in Each item in the list:
records that a term appeared in a document and, later, often, the positions in the document is conventionally called a posting
Inverted Index
Exercise
Draw the inverted index that would be built for the following document collection Doc 1 breakthrough drug for schizophrenia Doc 2 new schizophrenia drug Doc 3 new approach for treatment of schizophrenia Doc 4 new hopes for schizophrenia patients For this document collection, what are the returned results for these queries:
1. schizophrenia AND drug 2. schizophrenia AND NOT(drug OR approach)
Indexing documents
Arising questions: how to build an index automatically ? What are the relevant keywords ? Some additional desiderata:
fast processing of large collections of documents, having flexible matching operations (robust retrieval), having the possibility to rank the retrieved document in terms of relevance
To ensure these requirements (especially fast processing) are fulfilled, the indexes are computed in advance Note that the format of the index has a huge impact on the performances of the system
Indexing documents
NB: an index is built in 4 steps:
1. Gathering of the collection (each document is given a unique identifier) 2. Segmentation of each document into a list of atomic tokens tokenization 3. Linguistic processing of the tokens in order to normalize them lemmatizing. 4. Indexing the documents by computing the dictionary and lists of postings
Manual indexing
Advantages
Human judgement are most reliable Retrieval is better
Drawbacks
Time consuming Not always consistent
different people build different indexes for the same document.
Automatic indexing
Using NLU?
Not fast enough in real world settings (e.g., web search) Not robust enough (low coverage) Difficulty : what to include and what to exclude.
Indexes should not contain headings for topics for which there is no information in the document Can a machine parse full sentences of ideas and recognize the core ideas, the important terms, and the relationships between related concepts throughout the entire text?
Stop list
The members of which are discarded during indexing
some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely.
Ex:
a an and are as at be by for from has he in is it its of on that the to was were will with
Quality of results
Most of the time not indexing stop words does little harm
keyword searches with terms like the and by dont seem very useful
Choosing keywords
Selecting the words that are most likely to appear in a query
These words characterize the documents they appear in Which are they?
BoW
Not the same thing a bit! said the Hatter. You might just as well say that I see what I eat is the same thing as I eat what I see! You might just as well say, added the March Hare, that I like what I get is the same thing as I get what I like! You might just as well say, added the Dormouse, who seemed to be talking in its sleep, that I breathe when I sleep is the same thing as I sleep when I breathe!
Bags of words
Nevertheless, it seems intuitive that two documents with similar bag of words representations are similar in content..
Morphological normalization
Should index terms be word forms, lemmas or stems?
Matching morphological variants increase recall Example morphological variants :
anticipate, anticipating, anticipated, anticipation Company/Companies, sell/sold USA vs U.S.A., 22/10/2007 vs 10/22/2007 vs 2007/10/22 university vs University opel
Two techniques:
stemming : refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time Lemmatisation : refers to doing things,properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return a dictionary form of a word, which is known as the lemma.
NB: documents and queries have to be processed using the same tokenization process !
English 1980)
Porter stemmer
Algorithm based on a set of context-sensitive rewriting rules http://tartarus.org/~martin/PorterStemmer/index.html http://tartarus.org/~martin/PorterStemmer/def.txt Rules are composed of a pattern (left-hand-side) and a string (right-handside), example: (.*)sses \1 ss sses ss : caresses caress (.* [aeiou].*)ies \1i ies i : ponies poni, ties ti (.* [aeiou].*)ss \1 ss ss ss : caress caress Rules may be constrained by conditions on the words measure, example: (m > 1) (.*)ement \1 replacement replac but not cement c (m>0) (.*)eed -> \1ee feed -> feed but agreed -> agree (*v*) ed -> \1 plastered -> plaster but bled -> bled (*v*) ing -> \1 motoring -> motor but sing -> sing
m will be called the measure of any word or word part when represented in this form. Here are some examples:
m=0 TR, EE, TREE, Y, BY m=1 TROUBLE, OATS, TREES, IVY m=2 TROUBLES, PRIVATE, OATEN, ORRERY. (m > 1) EMENT ->
This would map REPLACEMENT to REPLAC, since REPLAC is a word part for which m = 2.
Exercise
What is the Porter measure of the following words (give your computation) ?
crepuscular rigorous placement
Stemming
Most stemmers also removes suffixes such as ed, ing, ational, ation, able, ism...
Relational relate
Stemming
Popular stemmers
Porters Lovins Iterated Lovins Kstem
Lemmatization
Exceptions needs to be handled:
sought seek, sheep sheep, feet foot
Computationally more expensive than stemming as it lookups words in a dictionnary Lemmatizer for French
http://bach.arts.kuleuven.be/pmertens/morlex/ FLEMM (F. Namer)
Most web search engines do use stop word lists but not stemming/lemmatising because
the text collection is extremely large so that the change of matching morphogical variants is higher recall is not an issue stemming is imperfect and the size and diversity of the web increase the chance of a mismatch stemming/tokenising tools are available for few languages
Web search: scientists, found, compelling, new, evidence, possible, ancient, microscopic, life, mars, derived, magnetic, crystals, meteorite, fell, earth, red, planet, NASA, anounced, Monday Information service or library search: scientist, find, compelling, new, evidence, possible, ancient, microscopic, life, mars, derived, magnetic, crystal, meteorite, fall, earth, red, planet, NASA, anounce, Monday
Granularity
Document unit :
An index can map terms
... to documents ... to paragraphs in documents ... to sentences in document ... to positions in documents
An IR system should be designed to offer choices of granularity. For now, we will henceforth assume that a suitable size document unit has been chosen, together with an appropriate way of dividing or aggregating files, if needed.
Index Content
The index usually stores some or all of the following information:
For each term:
Document count. How many documents the term occurs in. Total Frequency count. How many times the term occurs accross all documents popularity measure
Retrieval model
Relevance can be measured in terms of a match between queries and document index
Best match
A query describes good or best matching documents The result is a ranked list of documents
Stastical retrieval
Statistical Models
A document is typically represented by a bag of words (unordered words with frequencies) User specifies a set of desired terms with optional weights: Weighted query terms: Q = < database 0.5; text 0.8; information 0.2 > Unweighted query terms: Q = < database; text; information > No Boolean conditions specified in the query.
56
Statistical Retrieval
Retrieval based on similarity between query and documents. Output documents are ranked according to similarity to query Similarity based on occurrence frequencies of keywords in query and document Automatic relevance feedback can be supported
The user issues a (short, simple) query. The system returns an initial set of retrieval results. The user marks some returned documents as relevant or nonrelevant. The systemcomputes a better representation of the information need base on the user feedback. The system displays a revised set of retrieval results.
57
Boolean model
Additionnally,
proximity operator simple regular expressions spelling variants
The intersection operation is the crucial one. It has to be we efficient so as to be able to quickly find documents that contain both terms.
sometimes referred to as merging postings lists because it uses a merge algorithm Merge algortihm : general family of algorithms that combine multiple sorted lists by interleaved advancing of pointers through each list
Intersection
Ideas:
consider keywords with shorter posting lists first (to reduce the number of operations). use the frequency information stored in the dictionary See Manning et al., 07 for the algorithm
Exercise
How would you process the following queries (main steps) Brutus AND NOT Caesar Try your algorithm on
Exercise
How would you process the following query (main steps) Brutus OR NOT Caesar
All terms have equal importance (no term weighing) Ranking models consistently better
The user wants documents were the whole phrase appears, and not only some parts of it (i.e. The inventor Stanford Ovshinsky never went to university is not a match ) About 10 % of the web queries are phrase queries (songs names, institutions...) Such queries need either more complex dictionary terms, or more complex index (critical parameter: size of the index)
Biword indexes
Use key-phrases of length 2, example : Text : Natural Language Processing Dictionary:
Natural Language Language Processing The dictionary is made of biwords (notion of context)
Positionnal indexes
Store positions in the inverted indexes, example: termID ::= doc1: position1, position2, ... doc2: position1, position2, .. .... Processing then corresponds to an extension of the merging algorithm (additional checkings while traversing the lists) NB: such indexes can be used to process proximity queries (i.e. using constraints on proximity between words)
Positional indexes need an entry per occurence (NB: classic inverted indexes need an entry per document Id) The size of such indexes grows exponentially with the size of the document The size of a positional index depends on the language being indexed and the type of document (books, articles, etc) On average, a positional index is 2-4 times bigger than a inverted index, it can reach 35 to 50 % of the size of the original text (for English) Positional indexes can be used in combination with classic indexes to save time and space (see [Williams et al, 2005]).
Exercise
Which documents can contain the sentence to be or not to be considering the following (incomplete) indexes ? be ::= 1: 7, 18, 33, 72, 86, 231 2: 3, 149 4: 17, 191, 291, 430, 434 5: 363, 367 to ::= 2: 1, 17, 74, 222, 551 4: 8, 16, 190, 429, 433 7: 13, 23, 191
Exercise
Given the following positional indexes, give the documents Ids corresponding to the query world wide web : world ::= 1: 7, 18, 33, 70, 85, 131 2: 3, 149 4: 17, 190, 291, 430, 434 wide ::= 1: 12, 19, 40, 72, 86, 231 2: 2, 17, 74, 150, 551 3: 8, 16, 191, 429, 435 web ::= 1: 20, 22, 41, 75, 87, 200 2: 18, 32, 45, 56, 77, 151 4: 25, 192, 300, 332, 440
The postings lists to access are: to, be, or, not. We will examine intersecting the postings lists for to and be. We first look for documents that contain both terms. Then, we look for places in the lists where there is an occurrence of be with a token index one higher than a position of to and then we look for another occurrence of each word with token index 4 higher than the first occurrence. In the above lists, the pattern of occurrences that is a possible match is: to: <...;4:<...,429,433>...> Be: <...;4:<...,430,434>...>
Exercise
Consider the following index:
Language: <d1,12><d2,23-32-43><d3,53><d5,36-42-48> Loria: <d1,25> <d2,34-40> <d5,38-51>
Where dI refers to the document I, the other numbers being positions. The infix operator NEAR/x refers to the proximity x between two term : Give the solutions to the query language NEAR/2 Loria Give the pairs (x,docids) for each x such that language NEAR/x Loria has at least one solution Propose an algorithm for retrieving matching document for this operator
Example: WESTLAW
Large commercial system that serves legal and professional market since 1974
legal materials (court opinions, statutes, regulations, ...) news (newspapers, magazines, journals, ...) financial (stock quotes, financial analyses, ...)
Total collection size: 5-7 Terabytes 700 000 users (they claim 56% of legal searchers as of 2002) Best match added in 1992
Best-Match retrieval
Boolean retrieval is the archetypal example of exactmatch retrieval Best-match or ranking models are now more common Advantages
easier to use similar efficiency provides ranking best match generally has better retrieval performance most relevant documents appear at the top of the ranking
Boolean model: all documents matching the query are retrieved The matching is binary: yes or no Extreme cases: the list of retrieved documents can be empty, or huge A ranking of the documents matching a query is needed A score is computed for each pair (query, document)
Vector-space Retrieval
By far the most common retrieval systems Key idea: Everything (document, queries) is a vector in a high dimensional space Vector coefficients for an object (document, query, term) represent the degree to which this object embodies each of the basic dimensions Relevance is measured using vector similarity: a document is relevant to a query if their representing vectors are similar
Vector-space Representation
Documents are vectors of terms Terms are vectors of documents A query is a vector of terms
Graphic Representation
T3
5
T1
D2 = 3T1 + 7T2 + T3
7
T2
93
where qi (di ) is the value of the i -th position of q(d) With binary values this amounts to counting the matching terms between q and d
Solution : the length of a document must be taken into account when computing the similarity score
Term weights
qi is the weight of the term i in q Up to now, we only considered binary term weight
0: term absent 1: term present
Two shortcomings:
Does not reflect how often a term occurs All terms are equally important (president vs. the)
Term frequency
A document is treated as a set of words Each word characterizes that document to some extent When we have eliminated stop words, the most frequent words tend to be what the document is about Therefore: fkd (Nb of occurrences of word k in document d) will be an important measure. Also called the term frequency (tf)
Document frequency
What makes this document distinct from others in the corpus? The terms which discriminate best are not those which occur with high document frequency! Therefore: dk (nb of documents in which word k occurs) will also be an important measure. Also called the document frequency (idf)
TF.IDF
This can all be summarized as:
Words are best discriminators when :
they occur often in this document (term frequency) do not occur in a lot of documents (document frequency)
There are multiple formulas for actually computing this. The underlying concept is the same in all of them.
Term weights
tf-score : tfi,j = frequency of term i in document j idf-score : idfi = Inversed document frequency of term i
idfi = log(N/ni) with
N, the size of the document collection (nb of documents) ni , the number of documents in which the term i occurs
Evaluation
Evaluation
Issues User-based evaluation System-based evaluation TREC Precision and recall
Evaluation methods
Two types of evaluation methods:
User-based: measures the user satisfaction System-based: focuses on how well the system ranks the documents
F-Measure
TREC
Text REtrieval Conference Proceedings at http://trec.nist.gov/ Established in 1991 to evaluate large scale IR Retrieving documents from a gigabyte collection Organised by NIST and run continuously since 1991 Best known IR evaluation setting
25 participants in 92 109 participants from 4 continents in 2004 European (CLEF) and Asian counterparts (NTCIR)7
TREC Format
Several IR research tracks
ad-hoc retrieval routing/filtering cross languag scanned document spoken document Video Web question answering ...
Query Expansion
Find a way to expand a users query to automatically include relevant terms (that they should have included themselves), in an effort to improve recall
Use a dictionary/thesaurus Use relevance feedback
Thesauri
A thesaurus contains information about words (e.g., violin) such as :
Synonyms: similar words e.g., fiddle Hyperonyms: more general words e.g., instrument Hyponyms: more specific words e.g., Stradivari Meronyms: parts, e.g., strings
Problems of Thesauri
Language dependent Available only for a couple of languages
Cooccurence models
Semantically or syntactically related terms Cooccurence vs. Thesauri
Easy to adapt to other languages/domains Also covers relations not expressed in thesaur Not as reliable as manually edited thesauri Can introduce considerable noise
Relevance feedback
Ask user to identify a few documents which appear to be related to their information need Extract terms from those documents and add them to the original query. Run the new query and present those results to the user. Typically converges quickly
Blind feedback
Assume that first few documents returned are most relevant rather than having users identify them Proceed as for relevance feedback Tends to improve recall at the expense of precision
Post-Hoc Analysis
When a set of documents has been returned, they can be analyzed to improve usefulness in addressing information need
Grouped by meaning for polysemic queries (using N-Gram-type approaches) Grouped by extracted information (Named entities, for instance) Group into existing hierarchy if structured fields available Filtering (e.g., eliminate spam)
References
Introduction to Information Retrieval, by C. Manning, P. Raghavan, and H. Schtze. To appear at Cambridge University Press (chapters available at the book website).
Information Retrieval, Second Edition, by C.J. van Rijsbergen, Butterworths, London, 1979. Available here.