0% found this document useful (0 votes)
34 views23 pages

Text-Based (Image) Retrieval: Henning Müller HES SO//Valais Sierre, Switzerland

This document discusses techniques for text-based retrieval of images and documents. It covers topics like stemming, stop word removal, term weighting using TF-IDF, and using tools like Lucene for text retrieval. Challenges discussed include multi-lingual retrieval, query expansion, relevance feedback, and ensuring diversity in results. The document provides an overview of fundamental concepts and techniques in text retrieval that are also applied to image retrieval.

Uploaded by

piccolovegita
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views23 pages

Text-Based (Image) Retrieval: Henning Müller HES SO//Valais Sierre, Switzerland

This document discusses techniques for text-based retrieval of images and documents. It covers topics like stemming, stop word removal, term weighting using TF-IDF, and using tools like Lucene for text retrieval. Challenges discussed include multi-lingual retrieval, query expansion, relevance feedback, and ensuring diversity in results. The document provides an overview of fundamental concepts and techniques in text retrieval that are also applied to image retrieval.

Uploaded by

piccolovegita
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Business Information

Systems

Text-based (image) retrieval

Henning Müller
HES SO//Valais
Sierre, Switzerland
Business Information
Systems

Overview

•  Difference of words and features


–  Weightings instead of distance measures
•  Stemming and pre-treatment
•  Approaches for multilingual retrieval
•  Tools available on the web
–  Lucene, …
Business Information
Systems

Text retrieval (of images)

•  Started in the early 1960s … for images 1970s


•  Not the main focus of this talk
•  Text retrieval is old!!
–  Many techniques in image retrieval are taken from
this domain (sometimes reinvented)
•  It becomes clear that the combination of visual
and textual retrieval has biggest potential
–  Good text retrieval engines exist in Open Source
Business Information
Systems

Problems with annotation (of images)


•  Many things are hard to express
–  Feelings, situations, … (what is scary?)
–  What is in the image, what is it about, what does
it invoke?
•  Annotation is never complete
–  Plus it depends on the goal of the annotation
•  Many ways to say the same thing …
–  Synonyms, hyponyms, hypernyms, …
•  Mistakes
–  Spelling errors, spelling differences (US vs. UK),
weird abbreviations (particularly medical …)
Business Information
Systems

Basics in text retrieval


•  Started with boolean search of words in text
–  In combination with AND, OR, NOT
–  No ranking, rather finite list of corresponding
documents
•  Vector space model to have distance between
search terms and documents
–  Each occurring word is a dimension, its difference
in frequency can be measured
–  Overall frequency of words as importance for axis
Business Information
Systems

Zipf distribution (wikipedia example)

•  X- rank

•  Y- number
of occurrences
of the word
Business Information
Systems

Principle ideas used in text IR

•  Words follow basically a Zipf distribution


•  Tf/idf weightings
–  A word frequent in a document describes it well
–  A word rare in a collection has a high
discriminative power
–  Many variations of tf/idf (see also Salton/Buckley
paper)
•  Use of inverted files for quick query responses
–  Relevance feedback, query expansion, …
Business Information
Systems

Techniques used in text retrieval


•  Bag of words approach
–  Or N-grams can be used
•  Stop words can be removed
•  Stemming can improve results
•  Named entity recognition
•  Spelling correction (also umlauts, accents, …)
–  Google had a big success with this
•  Mapping of text to a controlled vocabulary/
ontology
Business Information
Systems

Stop word removal


•  Very frequent words contain little information and
can be removed
–  Automatically in Google et al.
•  These words depend on the language
–  Stop word lists exist in many languages
•  Often 40-50% of texts
–  Contains also less frequent words not carrying
information
•  Or simply remove words above a certain
frequency
Business Information
Systems

Stemming - conflation

•  Strongly dependent on the language


•  Basically suffix stripping based on a set of rules
–  Cats, catty, catlike=cat as root or stem
•  Can also create errors or slightly change
meaning (errors often reported around ~5%)
•  Porter stemmer for English is one of the most
well known algorithms with a free
implementation
Business Information
Systems

Synonymy, polysemy

•  Synonymy
–  Several words can say the same thing: car,
automobile
•  Polysemy
–  The same word can have several meanings
•  Latent semantic Indexing (LSI)
–  Word cooccurences in the entire collection
–  Can reduce effects of synonyms
Business Information
Systems

Query expansion vs. relevance feedback

•  Most queries contain only very few keywords


•  Add keywords to expand the original query
–  Can be automatic or manual
–  Semantically similar words, synonyms,
discriminative words
•  Often used in a similar way as relevance
feedback but not with entire documents
Business Information
Systems

Medical terminologies

•  MeSH, UMLS are frequently used


–  Mapping of free text to terminologies
•  Quality for the first few is very high
–  Links between items can be used
•  Hyponyms, hypernyms, …
–  Several axes exist (anatomy, pathology, …)
•  This can be used for making a query more
discriminative
•  This can also be used for multilingual retrieval
Business Information
Systems

Wordnet
•  Hierarchy, links, definitions in English language
–  Maintained in Princeton
•  Car, auto, automobile, machine, motorcar
–  motor vehicle, automotive vehicle
•  vehicle
–  conveyance, transport
»  instrumentality, instrumentation
»  artifact, artefact
»  object, physical object
»  entity, something
Business Information
Systems

Apache Lucene

•  Open source text retrieval system


–  Written in Java
•  Several tools available
–  Easy to use
•  Used in many research projects and in industry
•  Image retrieval plugin exists
–  LIRE (Lucene Image REtrieval)
–  Using simple MPEG-7 visual features
Business Information
Systems

Multilingual retrieval

•  Many collections are inherently multilingual


–  Web, FlickR, medical teaching files, …
•  Translation resources exist on the web
–  TrebleCLEF has a survey of such resources in
work
–  Translate query into document language
–  Translate documents into query language
–  Map documents and queries onto a common
terminology of concepts
•  We understand documents in other languages
Business Information
Systems

Cross Language Evaluation Forum (CLEF)

•  Forum to compare multilingual retrieval in a


variety of domains
–  GeoCLEF
–  QA CLEF
–  Domain-specific CLEF
–  …
•  Proceedings are a very good start for multilingual
techniques
Business Information
Systems

Challenges in multi-linguality

•  Language pairs have a strongly varying difficulty


–  Families of languages are easier for multilingual
retrieval
•  Resources available depend strongly on the
languages used
–  English has many resources, German, Spanish
and French quite a few but rare languages rather
little
Business Information
Systems

Multilingual tools

•  Many translation tools are accessible on the


web
–  Yahoo! Babel fish
–  www.reverso.net
–  Google translate
•  Named entity recognition
•  Word-sense disambiguation
Business Information
Systems

Current challenges in text retrieval


•  Many taken from the WWW or linked to it
•  Analysis of link structures to obtain information
on potential relevance
–  Also in companies, social platforms, …
•  Question of diversity in results
–  You do not want to have the same results show
up ten times on the top
•  Retrieval in context (domain specific)
•  Question answering
Business Information
Systems
Diversity
Business Information
Systems

Conclusions
•  Text retrieval is the basis of image retrieval
–  Many techniques come from this domain
•  Text has more semantics than visual features
–  But other problems as well
•  Text and image features combined have biggest
chances for success
–  Use text wherever available
•  Multilinguality is an important issue as most of
the web is very multilingual
–  And also a part of research
Business Information
Systems

References
•  G. Salton and C. Buckley, Term weighting approaches in automatic text retrieval, Information Processing and
Management, 24(5):513--523, 1988.
•  K. Sparck Jones and C. J. Van Rijsbergen, Progress in documentation, Journal of Documentation}, 32:59--75, 1976.
•  J. J. Rocchio, Relevance feedback in information retrieval, The SMART Retrieval System, Experiments in Automatic
Document Processing, pages 313--323.
•  M. Braschler, C. Peters, Cross-Language Evaluation Forum: Objectives, Results, Achievements, Information Retrieval,
2004.
•  J. Gobeill, H. Müller, P. Ruch, Translation by Text Categorization: Medical Image Retrieval in ImageCLEFmed 2006,
Springer Lecture Notes in Computer Science (LNCS 4730), pages 706-710, 2007.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy