0% found this document useful (0 votes)
6 views32 pages

Chapter 3 - InformationRetrieval-1

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 32

VIETNAM NATIONAL UNIVERSITY – HO CHI MINH CITY

UNIVERSITY OF INFORMATION TECHNOLOGY

DS307
SOCIAL MEDIA ANALYSIS

Faculty of Information Science and Engineering


University of Information Technology, VNU-HCM

UIT, VNU-HCM Social Media Analysis 1


This Course’s Contents

Introduction to Information Retrieval

UIT, VNU-HCM Social Media Analysis 2


Why Study IR?
 Many reasons, but if you want a one-word
answer:

UIT, VNU-HCM Social Media Analysis 3


Google …
❖ Examines billions of web pages
❖ Returns results in less than half a second
❖ Valued at gazillions of dollars by the public
market

UIT, VNU-HCM Social Media Analysis 4


How Does Google Work?
❖ Only Google know, but . . .
❖ Uses hundreds of thousands of machines
❖ Uses some sophisticated computer science
(efficient storage and searching of large datasets)
❖ Uses an innovative ranking algorithm
(based on the hypertext structure of the web)

UIT, VNU-HCM Social Media Analysis 5


How Does Google Work?
❖ Underlying Google is basic IR technology.
❖ The Web is indexed.
▪ an index links terms with pages.
❖ A user’s information need is represented as a query.
❖ Queries are matched against web pages.
▪ Google attempts to return pages which are relevant to
the information need.

UIT, VNU-HCM Social Media Analysis 6


IR is More Than Web Search
❖ IR is much older than the Web (1950s –)
❖ The Web has some unique characteristics which
make it a special case.
❖ IR deals with tasks other than searching:
▪ Categorizing Documents.
▪ Summarizing Documents.
▪ Answering Questions.
▪ …

UIT, VNU-HCM Social Media Analysis 7


Motivation for IR
❖ Searching literature databases.
❖ Web search.
❖ Volume of information stored electronically is
growing at ever faster rates
▪ need to search it
▪ categorize it
▪ filter it
▪ translate it
▪ summarize it
▪ ...
UIT, VNU-HCM Social Media Analysis 8
Biomedical Information
❖ Biomedical literature is growing at a startling rate
▪ Around 1,000,000 new articles are added to Medline
each year
❖ Tasks:
▪ Literature search
▪ Creation and maintenance of biological databases
▪ Knowledge discovery from text mining

UIT, VNU-HCM Social Media Analysis 9


Document Retrieval
 IR is often used to mean Document Retrieval
 Primary task of an IR system:
retrieve documents with content that is relevant
to a user’s information need
 How do we represent content?
 How do we represent information need?
 How do we decide on relevance?

UIT, VNU-HCM Social Media Analysis 10


Document Retrieval
 Representation/Indexing
◼ Representation of documents and requests:
▪ bag of words?
▪ stop words, upper/lower case, . . . ∗
▪ query language
◼ Storing the documents, building the index
 Searching
◼ Is a document relevant to the query?
▪ models of IR: Boolean, vector-space, probabilistic
◼ Efficient algorithms for searching large datasets

UIT, VNU-HCM Social Media Analysis 11


What IR is Not
 An IR system is not a Database Management
System
 A DBMS stores and processes well-defined data
 A search in a DBMS is exact / deterministic
 Search in an IR system is probabilistic
◼ inherent uncertainty exists at all stages of IR:
information need, formulating query, searching

UIT, VNU-HCM Social Media Analysis 12


A Simple Retrieval Model
 Bag of Words approach
◼ A document is represented as a bag of words –
Word order is ignored
◼ Syntactic structure is ignored
◼ ...
 Relevance is determined by comparing the
words in the document with the words in a
query
 Simple approach has been very effective
UIT, VNU-HCM Social Media Analysis 13
Vector Space Model
 Provides a ranking of documents with respect to a
query
 Documents and queries are vectors in a multi-
dimensional information space
 Key questions:
◼ What forms the dimensions of the space? ∗ terms, concepts, .
..
◼ How are document and query vectors compared?

UIT, VNU-HCM Social Media Analysis 14


Coordinate Matching
 Document relevance measured by the number
of query terms appearing in a document
 Terms provide the dimensions
◼ Large vocabulary ⇒ high dimensional space
 Similarity measure is the dot-product of the
query and document vectors

UIT, VNU-HCM Social Media Analysis 15


Simple Example
 Term vocabulary: (England, Australia, Pietersen, Hoggard,
run, wicket, catch, century, collapse)
 Documents:
◼ d1: Australia collapse as Hoggard takes 6 wickets
◼ d2: Pietersen’s century puts Australia on back foot
 Queries:
◼ q1: {Hoggard, Australia, wickets}
 Query, document similarity
◼ q1.d1 = (0,1,0,1,0,1,0,0,0) · (0,1,0,1,0,1,0,0,1) = 3
◼ q1.d2 = (0,1,0,1,0,1,0,0,0) · (0,1,1,0,0,0,0,1,0) = 1

UIT, VNU-HCM Social Media Analysis 16


Term Frequency (TF)
 Coordinate matching does not consider the frequency of
query terms in documents
 Term vocabulary: (England, Australia, Pietersen, Hoggard,
run, wicket, catch, century, collapse)
 d1: Australia collapsed as Hoggard took 6 wickets. Flintoff
praised Hoggard for his excellent line and length.
 q1: {Hoggard, Australia, wickets}
 q1 · d1 = (0,1,0,1,0,1,0,0,0) · (0,1,0,2,0,1,0,0,1) = 4

UIT, VNU-HCM Social Media Analysis 17


Term Frequency (TF)
 Coordinate matching does not consider the number of
documents query terms appear in
 Term vocabulary: (England, Australia, Pietersen, Hoggard,
run, wicket, catch, century, collapse)
 d2: Flintoff took the wicket of Australia’s Ponting, to give
him 2 wickets for the innings and 5 wickets for the match.
 q1: {Hoggard, Australia, wickets}
 q1 · d2 = (0,1,0,1,0,1,0,0,0) · (0,1,0,0,0,3,0,0,0) = 4

UIT, VNU-HCM Social Media Analysis 18


Inverse Document Frequency (IDF)
 Assume wicket appears in 100 documents in total, Hoggard
appears in 5, and Australia in 10
(ignoring IDF of other terms)
 d1: Australia collapsed as Hoggard took 6 wickets. Flintoff
praised Hoggard for his excellent line and length.
 d2: Flintoff took the wicket of Australia’s Ponting, to give him 2
wickets for the innings and 5 wickets for the match.
 q1: {Hoggard, Australia, wickets}
 q1 · d1 = (0,1,0,1,0,1,0,0,0) · (0,1/10,0,2/5,0,1/100,0,0,1) = 0.411
 q1 · d2 = (0,1,0,1,0,1,0,0,0) · (0,1/10,0,0/5,0,3/100,0,0,0) = 0.13

UIT, VNU-HCM Social Media Analysis 19


Document Length

UIT, VNU-HCM Social Media Analysis 20


Vector Space Similarity

UIT, VNU-HCM Social Media Analysis 21


Vector Space Similarity

UIT, VNU-HCM Social Media Analysis 22


Language Understanding?
 Want a system which “understands” documents and query and matches
them?
◼ use semantic representation and logical inference
 Until recently such technology was not robust / did not scale to large
unrestricted text collections
 But:
◼ useful for restricted domains
◼ now used for some large-scale tasks (QA, IE)
 Is a “deep” approach appropriate for document retrieval?
◼ Powerset (Natural Language Search) think so (see www.powerset.com)

UIT, VNU-HCM Social Media Analysis 24


Tasks in IR (broadly conceived)
 Document Retrieval (ad-hoc retrieval)
 Document Filtering or Routing
 Document Categorization
 Document Summarizing
 Information Extraction
 Question Answering

UIT, VNU-HCM Social Media Analysis 25


Other Topics
 Multimedia IR (images, sound, . . .)
◼ but text can be of different types (web pages, e-
mails, . . .)
 User-system interaction (HCI)
 Browsing

UIT, VNU-HCM Social Media Analysis 26


Evaluation
 IR has largely been treated as an empirical, or engineering,
task
 Evaluation has played an important role in the development of
IR
 DARPA/NIST Text Retrieval Conference (TREC)
◼ began in 1992
◼ has many participants
◼ uses large text databases
◼ considers many tasks in addition to document retrieval

UIT, VNU-HCM Social Media Analysis 27


IR in One Sentence
 “Indexing, retrieving and organizing text by
probabilistic or statistical techniques that reflect
semantics without actually understanding”
(James Allan, Umass)

UIT, VNU-HCM Social Media Analysis 28


Brief History of IR
 1960s
◼ development of basic techniques in automated indexing
and searching
 1970s
◼ Development of statistical methods / vector space models
◼ Split from NLP/AI
◼ Operational Boolean systems
 1980s
◼ Increased computing power
◼ Spread of operational systems

UIT, VNU-HCM Social Media Analysis 29


Brief History of IR
 1990s and 2000s
 Large-scale full text IR systems for retrieval and filtering
 Dominance of statistical ranking approaches
 Web search
 Multimedia and multilingual applications
 Question Answering
 TREC evaluations

UIT, VNU-HCM Social Media Analysis 30


Brief History of IR
 Course book
◼ Introduction to Information Retrieval
Manning, Raghavan, Schu ̈tze
http://nlp.stanford.edu/IR-
book/html/htmledition/irbook.html

UIT, VNU-HCM Social Media Analysis 31


Q&A

UIT, VNU-HCM Social Media Analysis 32


Thank you!

UIT, VNU-HCM Social Media Analysis 33

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy