0% found this document useful (0 votes)

34 views23 pages

Text-Based (Image) Retrieval: Henning Müller HES SO//Valais Sierre, Switzerland

This document discusses techniques for text-based retrieval of images and documents. It covers topics like stemming, stop word removal, term weighting using TF-IDF, and using tools like Lucene for text retrieval. Challenges discussed include multi-lingual retrieval, query expansion, relevance feedback, and ensuring diversity in results. The document provides an overview of fundamental concepts and techniques in text retrieval that are also applied to image retrieval.

Uploaded by

piccolovegita

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views23 pages

Text-Based (Image) Retrieval: Henning Müller HES SO//Valais Sierre, Switzerland

Uploaded by

piccolovegita

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Business Information

Systems

Text-based (image) retrieval

Henning Müller
HES SO//Valais
Sierre, Switzerland
Business Information
Systems

Overview

•  Difference of words and features

–  Weightings instead of distance measures
•  Stemming and pre-treatment
•  Approaches for multilingual retrieval
•  Tools available on the web
–  Lucene, …
Business Information
Systems

Text retrieval (of images)

•  Started in the early 1960s … for images 1970s

•  Not the main focus of this talk
•  Text retrieval is old!!
–  Many techniques in image retrieval are taken from
this domain (sometimes reinvented)
•  It becomes clear that the combination of visual
and textual retrieval has biggest potential
–  Good text retrieval engines exist in Open Source
Business Information
Systems

Problems with annotation (of images)

•  Many things are hard to express
–  Feelings, situations, … (what is scary?)
–  What is in the image, what is it about, what does
it invoke?
•  Annotation is never complete
–  Plus it depends on the goal of the annotation
•  Many ways to say the same thing …
–  Synonyms, hyponyms, hypernyms, …
•  Mistakes
–  Spelling errors, spelling differences (US vs. UK),
weird abbreviations (particularly medical …)
Business Information
Systems

Basics in text retrieval

•  Started with boolean search of words in text
–  In combination with AND, OR, NOT
–  No ranking, rather finite list of corresponding
documents
•  Vector space model to have distance between
search terms and documents
–  Each occurring word is a dimension, its difference
in frequency can be measured
–  Overall frequency of words as importance for axis
Business Information
Systems

Zipf distribution (wikipedia example)

•  X- rank

•  Y- number
of occurrences
of the word
Business Information
Systems

Principle ideas used in text IR

•  Words follow basically a Zipf distribution

•  Tf/idf weightings
–  A word frequent in a document describes it well
–  A word rare in a collection has a high
discriminative power
–  Many variations of tf/idf (see also Salton/Buckley
paper)
•  Use of inverted files for quick query responses
–  Relevance feedback, query expansion, …
Business Information
Systems

Techniques used in text retrieval

•  Bag of words approach
–  Or N-grams can be used
•  Stop words can be removed
•  Stemming can improve results
•  Named entity recognition
•  Spelling correction (also umlauts, accents, …)
–  Google had a big success with this
•  Mapping of text to a controlled vocabulary/
ontology
Business Information
Systems

Stop word removal

•  Very frequent words contain little information and
can be removed
–  Automatically in Google et al.
•  These words depend on the language
–  Stop word lists exist in many languages
•  Often 40-50% of texts
–  Contains also less frequent words not carrying
information
•  Or simply remove words above a certain
frequency
Business Information
Systems

Stemming - conflation

•  Strongly dependent on the language

•  Basically suffix stripping based on a set of rules
–  Cats, catty, catlike=cat as root or stem
•  Can also create errors or slightly change
meaning (errors often reported around ~5%)
•  Porter stemmer for English is one of the most
well known algorithms with a free
implementation
Business Information
Systems

Synonymy, polysemy

•  Synonymy
–  Several words can say the same thing: car,
automobile
•  Polysemy
–  The same word can have several meanings
•  Latent semantic Indexing (LSI)
–  Word cooccurences in the entire collection
–  Can reduce effects of synonyms
Business Information
Systems

Query expansion vs. relevance feedback

•  Most queries contain only very few keywords

•  Add keywords to expand the original query
–  Can be automatic or manual
–  Semantically similar words, synonyms,
discriminative words
•  Often used in a similar way as relevance
feedback but not with entire documents
Business Information
Systems

Medical terminologies

•  MeSH, UMLS are frequently used

–  Mapping of free text to terminologies
•  Quality for the first few is very high
–  Links between items can be used
•  Hyponyms, hypernyms, …
–  Several axes exist (anatomy, pathology, …)
•  This can be used for making a query more
discriminative
•  This can also be used for multilingual retrieval
Business Information
Systems

Wordnet
•  Hierarchy, links, definitions in English language
–  Maintained in Princeton
•  Car, auto, automobile, machine, motorcar
–  motor vehicle, automotive vehicle
•  vehicle
–  conveyance, transport
»  instrumentality, instrumentation
»  artifact, artefact
»  object, physical object
»  entity, something
Business Information
Systems

Apache Lucene

•  Open source text retrieval system

–  Written in Java
•  Several tools available
–  Easy to use
•  Used in many research projects and in industry
•  Image retrieval plugin exists
–  LIRE (Lucene Image REtrieval)
–  Using simple MPEG-7 visual features
Business Information
Systems

Multilingual retrieval

•  Many collections are inherently multilingual

–  Web, FlickR, medical teaching files, …
•  Translation resources exist on the web
–  TrebleCLEF has a survey of such resources in
work
–  Translate query into document language
–  Translate documents into query language
–  Map documents and queries onto a common
terminology of concepts
•  We understand documents in other languages
Business Information
Systems

Cross Language Evaluation Forum (CLEF)

•  Forum to compare multilingual retrieval in a

variety of domains
–  GeoCLEF
–  QA CLEF
–  Domain-specific CLEF
–  …
•  Proceedings are a very good start for multilingual
techniques
Business Information
Systems

Challenges in multi-linguality

•  Language pairs have a strongly varying difficulty

–  Families of languages are easier for multilingual
retrieval
•  Resources available depend strongly on the
languages used
–  English has many resources, German, Spanish
and French quite a few but rare languages rather
little
Business Information
Systems

Multilingual tools

•  Many translation tools are accessible on the

web
–  Yahoo! Babel fish
–  www.reverso.net
–  Google translate
•  Named entity recognition
•  Word-sense disambiguation
Business Information
Systems

Current challenges in text retrieval

•  Many taken from the WWW or linked to it
•  Analysis of link structures to obtain information
on potential relevance
–  Also in companies, social platforms, …
•  Question of diversity in results
–  You do not want to have the same results show
up ten times on the top
•  Retrieval in context (domain specific)
•  Question answering
Business Information
Systems
Diversity
Business Information
Systems

Conclusions
•  Text retrieval is the basis of image retrieval
–  Many techniques come from this domain
•  Text has more semantics than visual features
–  But other problems as well
•  Text and image features combined have biggest
chances for success
–  Use text wherever available
•  Multilinguality is an important issue as most of
the web is very multilingual
–  And also a part of research
Business Information
Systems

References
•  G. Salton and C. Buckley, Term weighting approaches in automatic text retrieval, Information Processing and
Management, 24(5):513--523, 1988.
•  K. Sparck Jones and C. J. Van Rijsbergen, Progress in documentation, Journal of Documentation}, 32:59--75, 1976.
•  J. J. Rocchio, Relevance feedback in information retrieval, The SMART Retrieval System, Experiments in Automatic
Document Processing, pages 313--323.
•  M. Braschler, C. Peters, Cross-Language Evaluation Forum: Objectives, Results, Achievements, Information Retrieval,
2004.
•  J. Gobeill, H. Müller, P. Ruch, Translation by Text Categorization: Medical Image Retrieval in ImageCLEFmed 2006,
Springer Lecture Notes in Computer Science (LNCS 4730), pages 706-710, 2007.

5 Unit Notes
100% (1)
5 Unit Notes
166 pages
Vdocuments - MX Emc Documentum XCP 22 Self Paced Tutorial v10
No ratings yet
Vdocuments - MX Emc Documentum XCP 22 Self Paced Tutorial v10
217 pages
Application of Computational Linguistics
No ratings yet
Application of Computational Linguistics
19 pages
Bulu
No ratings yet
Bulu
47 pages
Information Storage and Retrieval
No ratings yet
Information Storage and Retrieval
5 pages
TYBSC CS Information Retrieval Munotes
No ratings yet
TYBSC CS Information Retrieval Munotes
85 pages
Irs Unit-4 Notes - 241202 - 150037
No ratings yet
Irs Unit-4 Notes - 241202 - 150037
18 pages
Introduction To Information Retrieval Systems
No ratings yet
Introduction To Information Retrieval Systems
2 pages
KOCH, Ned (Ed.) Information - Systems - Research Action PDF
No ratings yet
KOCH, Ned (Ed.) Information - Systems - Research Action PDF
438 pages
ITR Notes
No ratings yet
ITR Notes
166 pages
or Read: Data Warehouse Textbook PDF Ebook
100% (1)
or Read: Data Warehouse Textbook PDF Ebook
8 pages
Intro IR
No ratings yet
Intro IR
108 pages
Irs Unit - 1-1
No ratings yet
Irs Unit - 1-1
45 pages
Information Retrieval: Prof: Ehab Ezzat Hassanein
No ratings yet
Information Retrieval: Prof: Ehab Ezzat Hassanein
49 pages
All Units Notes TYBSC-CS-Information-Retrieval
No ratings yet
All Units Notes TYBSC-CS-Information-Retrieval
89 pages
Cmrit Isr Notes - Docx New
No ratings yet
Cmrit Isr Notes - Docx New
54 pages
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
Unit - I - IR
No ratings yet
Unit - I - IR
39 pages
Prevention, Detection and Repair of Database Corruption: Janani Mahalingam
No ratings yet
Prevention, Detection and Repair of Database Corruption: Janani Mahalingam
34 pages
Rdbms Practical Program
No ratings yet
Rdbms Practical Program
25 pages
UNIT 1 Notes
No ratings yet
UNIT 1 Notes
16 pages
IR-Module 1 and 2
No ratings yet
IR-Module 1 and 2
48 pages
IRS - Notes - I&2 CSE A&B
No ratings yet
IRS - Notes - I&2 CSE A&B
27 pages
UNIT 1 IRS WWWWW
No ratings yet
UNIT 1 IRS WWWWW
26 pages
Irs Unit-2 Notes - 241015 - 102936
No ratings yet
Irs Unit-2 Notes - 241015 - 102936
27 pages
IRS Unit-1
No ratings yet
IRS Unit-1
27 pages
Cs8080 Irt Unit 1 PDF
No ratings yet
Cs8080 Irt Unit 1 PDF
28 pages
Information Retrieval: Introduction To
No ratings yet
Information Retrieval: Introduction To
21 pages
CSE494/598 Principles of Information Engineering
No ratings yet
CSE494/598 Principles of Information Engineering
45 pages
Search and Retrieval of Information
No ratings yet
Search and Retrieval of Information
7 pages
Case Sudies Assignment
No ratings yet
Case Sudies Assignment
21 pages
Database Management Systems
No ratings yet
Database Management Systems
85 pages
Hci Unit 5
No ratings yet
Hci Unit 5
22 pages
Irs Unit-V
No ratings yet
Irs Unit-V
48 pages
Chap 1
No ratings yet
Chap 1
22 pages
Lazareto Dec
No ratings yet
Lazareto Dec
87 pages
Irs Unit III
No ratings yet
Irs Unit III
74 pages
Irs Unit-Iv
No ratings yet
Irs Unit-Iv
48 pages
11 Multimedia Media IR
No ratings yet
11 Multimedia Media IR
19 pages
Fuzzy Ontologies and Scale Free Networks
No ratings yet
Fuzzy Ontologies and Scale Free Networks
11 pages
Data Modeling Using The Entity-Relationship Model: Dr. Enosha Hettiarachchi
No ratings yet
Data Modeling Using The Entity-Relationship Model: Dr. Enosha Hettiarachchi
60 pages
Cp5293 Big Data Analytics Question Bank
No ratings yet
Cp5293 Big Data Analytics Question Bank
26 pages
Unit I - Irs
No ratings yet
Unit I - Irs
116 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
50 pages
Web Information Retrieval
No ratings yet
Web Information Retrieval
10 pages
Question Bank 1
No ratings yet
Question Bank 1
29 pages
Unit V
No ratings yet
Unit V
43 pages
Unit I - Irs
No ratings yet
Unit I - Irs
85 pages
Presentation Cofi Cab
No ratings yet
Presentation Cofi Cab
40 pages
Paun Mihut Catalin Tema 5
75% (4)
Paun Mihut Catalin Tema 5
43 pages
Unit I
No ratings yet
Unit I
65 pages
CCW331 Business Analytics Material Unit I Type2
No ratings yet
CCW331 Business Analytics Material Unit I Type2
43 pages
Index Construction
No ratings yet
Index Construction
37 pages
Modern Information Retrieval: A Brief Overview
No ratings yet
Modern Information Retrieval: A Brief Overview
9 pages
Datastage Overview: Processing Stage Types
No ratings yet
Datastage Overview: Processing Stage Types
32 pages
IR Documentation
No ratings yet
IR Documentation
9 pages
Internship 1
No ratings yet
Internship 1
24 pages
Nosql Systems: Sharding, Replication and Consistency: Riccardo Torlone Università Roma Tre
No ratings yet
Nosql Systems: Sharding, Replication and Consistency: Riccardo Torlone Università Roma Tre
28 pages
Different Types of SQL Joins
No ratings yet
Different Types of SQL Joins
12 pages
Wbs Cy Control C - 01-2015
No ratings yet
Wbs Cy Control C - 01-2015
25 pages
Information Retrieval System
No ratings yet
Information Retrieval System
4 pages
Web Index: How Do Resources End Up in A Web Index?
No ratings yet
Web Index: How Do Resources End Up in A Web Index?
5 pages
Text Databases and Information Retrieval: Riloff, Hollaar@cs - Utah.edu&
No ratings yet
Text Databases and Information Retrieval: Riloff, Hollaar@cs - Utah.edu&
3 pages
Information Retrieval Is A Complex Process Because There Is No Infallible Way To Provide A Direct Connection Between A User
No ratings yet
Information Retrieval Is A Complex Process Because There Is No Infallible Way To Provide A Direct Connection Between A User
4 pages
2011 Kcse Computer Studies Questions P2
No ratings yet
2011 Kcse Computer Studies Questions P2
4 pages
Introduction Information Retrieval
No ratings yet
Introduction Information Retrieval
73 pages
Dsa Presentation 28
No ratings yet
Dsa Presentation 28
17 pages
Svy2001 Prac1
No ratings yet
Svy2001 Prac1
14 pages
Unit - 6
No ratings yet
Unit - 6
12 pages
转 Event 'Utl - file I - O' - - CodeAntenna
No ratings yet
转 Event 'Utl - file I - O' - - CodeAntenna
4 pages
Indexing Database Systems
No ratings yet
Indexing Database Systems
5 pages
Information Retrieval Techniques
No ratings yet
Information Retrieval Techniques
59 pages
Search Engines
No ratings yet
Search Engines
4 pages
CV 2
No ratings yet
CV 2
2 pages
IR Cs Sem 6
No ratings yet
IR Cs Sem 6
16 pages
Microsoft: Exam Questions DP-900
No ratings yet
Microsoft: Exam Questions DP-900
15 pages
Information Storage and Retrieval
No ratings yet
Information Storage and Retrieval
45 pages
Data Analysis - Spreadsheet
No ratings yet
Data Analysis - Spreadsheet
7 pages
Business Informatics
No ratings yet
Business Informatics
2 pages
A Practical Introduction To Databases
No ratings yet
A Practical Introduction To Databases
5 pages
Flight Management Project Flow Diagram
No ratings yet
Flight Management Project Flow Diagram
1 page
Semantic Survey Report
No ratings yet
Semantic Survey Report
2 pages
Oracle Question For Student
100% (1)
Oracle Question For Student
3 pages
Lecture Notes: CS6007 - Information Retrieval
No ratings yet
Lecture Notes: CS6007 - Information Retrieval
3 pages
A Database Management System
No ratings yet
A Database Management System
2 pages
Managing Multimedia and Unstructured Data in the Oracle Database
From Everand
Managing Multimedia and Unstructured Data in the Oracle Database
Marcelle Kratochvil
No ratings yet
Learning Neo4j
From Everand
Learning Neo4j
Rik Van Bruggen
3/5 (1)
Natural Language User Interface: Fundamentals and Applications
From Everand
Natural Language User Interface: Fundamentals and Applications
Fouad Sabry
No ratings yet
Question Answering: Fundamentals and Applications
From Everand
Question Answering: Fundamentals and Applications
Fouad Sabry
No ratings yet
The Art of Masterful Public Speaking
From Everand
The Art of Masterful Public Speaking
Ylich Tarazona
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Text-Based (Image) Retrieval: Henning Müller HES SO//Valais Sierre, Switzerland

Uploaded by

Text-Based (Image) Retrieval: Henning Müller HES SO//Valais Sierre, Switzerland

Uploaded by

Business Information

Text-based (image) retrieval

•  Difference of words and features

Text retrieval (of images)

•  Started in the early 1960s … for images 1970s

Problems with annotation (of images)

Basics in text retrieval

Zipf distribution (wikipedia example)

Principle ideas used in text IR

•  Words follow basically a Zipf distribution

Techniques used in text retrieval

Stop word removal

•  Strongly dependent on the language

Query expansion vs. relevance feedback

•  Most queries contain only very few keywords

•  MeSH, UMLS are frequently used

•  Open source text retrieval system

•  Many collections are inherently multilingual

Cross Language Evaluation Forum (CLEF)

•  Forum to compare multilingual retrieval in a

•  Language pairs have a strongly varying difficulty

•  Many translation tools are accessible on the

Current challenges in text retrieval

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Text-Based (Image) Retrieval: Henning Müller HES SO//Valais Sierre, Switzerland

Uploaded by

Text-Based (Image) Retrieval: Henning Müller HES SO//Valais Sierre, Switzerland

Uploaded by

Business Information

Text-based (image) retrieval

• Difference of words and features

Text retrieval (of images)

• Started in the early 1960s … for images 1970s

Problems with annotation (of images)

Basics in text retrieval

Zipf distribution (wikipedia example)

Principle ideas used in text IR

• Words follow basically a Zipf distribution

Techniques used in text retrieval

Stop word removal

• Strongly dependent on the language

Query expansion vs. relevance feedback

• Most queries contain only very few keywords

• MeSH, UMLS are frequently used

• Open source text retrieval system

• Many collections are inherently multilingual

Cross Language Evaluation Forum (CLEF)

• Forum to compare multilingual retrieval in a

• Language pairs have a strongly varying difficulty

• Many translation tools are accessible on the

Current challenges in text retrieval

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

•  Difference of words and features

•  Started in the early 1960s … for images 1970s

•  Words follow basically a Zipf distribution

•  Strongly dependent on the language

•  Most queries contain only very few keywords

•  MeSH, UMLS are frequently used

•  Open source text retrieval system

•  Many collections are inherently multilingual

•  Forum to compare multilingual retrieval in a

•  Language pairs have a strongly varying difficulty

•  Many translation tools are accessible on the