0% found this document useful (0 votes)
39 views43 pages

Unit V

Uploaded by

Sree Dhathri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views43 pages

Unit V

Uploaded by

Sree Dhathri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 43

Contents:

Text Search Algorithms


• Introduction to Text Search Techniques
• Software Text Search Algorithms,
• Hardware Text Search Systems

Multimedia Information Retrieval


• Spoken Language Audio Retrieval,
• Non-Speech Audio Retrieval,
• Graph Retrieval,
• Imagery Retrieval,
• Video Retrieval
Software Text Search Algorithms

• In software streaming techniques, the item to be searched is read into memory


and then the algorithm is applied.
• There are four major algorithms associated with software text search:
• Brute force approach
• Knuth-Morris-Pratt
• Boyer-Moore
• Shift-OR algorithm
• Rabin -karp
Brute Force :

• This approach is the simplest string matching algorithm.


• The idea is to try and match the search string against the input text.
• If as soon as a mis-match is detected in the comparison process, shift the input
text one position and start the comparison process over
Flow Chart
Example
Advantages and Disadvantages
KNUTH-MORRIS PRATT

• Checks the characters from left to right.


• When a pattern has a sub-pattern appears more than one in the sub-pattern,
it uses that property to improve the time complexity.
• The time complexity of KMP is O(n).
kmpAlgorithm(text, pattern)
Input: The main text, and the pattern, which will be searched
Output − The location where patterns are found
Begin
n := size of text
m := size of pattern
call findPrefix(pattern, m, prefArray)

while i < n, do
if text[i] = pattern[j], then
increase i and j by 1
if j = m, then
print the location (i-j) as there is the pattern
j := prefArray[j-1]
else if i < n AND pattern[j] ≠ text[i] then
if j ≠ 0 then
j := prefArray[j - 1]
else
increase i by 1
done
End
Worked example of the search algorithm

Input:
Main String: “AAAABAAAAABBBAAAAB”, Pattern: “AAAB”

Output:
Pattern found at location: 1
Pattern found at location: 7
Pattern found at location: 14

Input:
txt[] = “THIS IS A TEST TEXT”, pat[] = “TEST”

Output:
Pattern found at index: 10
Example
BOYER MOORE ALGORITHM

• String algorithm is significantly enhanced as the comparison Process started at the


end of the search pattern processing right to left versus the start of the search
pattern.
• The advantage is that large jumps are mismatched character in the input stream the
search pattern which occurs frequently.
Flow chart
Examples:
HARDWARE TEXT SEARCH ALGORITHMS

• Specialized hardware machine to perform the searches and pass the results
to the main computer which support the user interface and retrieval of hits.
• Since the searcher is hardware based, scalability is achieved by
increasing the number of hardware search devices.
• The only limit on speed is the time it takes to flow the text of secondary
storage by having one search machine per disk, the maximum time it takes
to search a database of any size will be the time to search one disk.
The Fast Data Finder (FDF)
• It is the most recent specialized hardware text search unit still in use in many
organizations.
• It was developed to search text and has been used to search English and
foreign languages.
• The early Fast Data Finders consisted of an array of programmable text
processing cells connected in series forming a pipeline hardware search
processor.
• The cells are implemented using a VSLI chip. In the tests each chip contained
24 processor cells with a typical system containing 3600 cells (the FDF-3 has
a rack mount configuration with 10,800 cells).
• Each cell is a comparator for a single character, limiting the total number of
characters in a query to the number of cells. The cells are interconnected with
an 8-bit data path and approximately 20-bit control path.
• The text to be searched passes through
each cell in a pipeline fashion until the
complete database has been searched.
• As data are analyzed at each cell, the 20
control lines states are modified
depending upon their current state and
the results from the comparator.
• A cell is composed of both a register cell
(Rs) and a comparator (Cs).
• The input from the Document database is
controlled and buffered by the
microprocessor/memory and feed through
the comparators.
Other Hardware

• The search characters are stored in the registers. The connection between
the registers reflects the control lines that are also passing state information.

• Earliest hardware text string search unit - Rapid Search Machine developed
by General Electric. The machine consisted of a special purpose search unit
where a single query was passed against a magnetic tape containing the
documents.

• A more sophisticated search unit was developed by Operating Systems Inc.


called the Associative File Processor (AFP). It is capable of searching
against multiple queries at the same time.
• OSI, using a different approach, developed the High Speed Text Search
(HSTS) machine. It uses finite state machine algorithm and runs three
parallel state machines.

• GE redesigned the Rapid Search Machine into The GESCAN system


which uses a text array processor (TAP) that simultaneously matches
many terms and conditions against a given text stream the TAP receives
the query information from the users computer and directly access the
textual data from secondary storage.
GESCAN Text Array Processor
INFORMATION SYSTEM EVALUATION

• In recent years the evaluation of IRS and techniques for indexing, sorting,
searching and retrieving information have become increasingly important.

• This growth in interest is due to two major reasons:


1. The growing number of retrieval systems being used
2. Additional focus on evaluation methods themselves

• There are many reasons to evaluate the effectiveness of an IRS


1.To aid in the selection of a system to procure
2.To monitor and evaluate system effectiveness
3.To evaluate query generation process for improvements
4.To determine the effects of changes made to an existing information system
• From a human judgment standpoint, relevancy can be considered:
1.Subjective: depends upon a specific user’s judgment
2.Situational : relates to a user’s requirements
3.Cognitive : depends on human perception and behaviour
4.Temporal : changes over time
5.Measurable : observable at points in time
• Ingwersen categorizes the information view into four types of “aboutness”:
1. Author Aboutness:determined by the author’s language as matched by
the system in natural
language retrieval
2. Indexer Aboutness : determined by the indexer’s transformation of the
author’s natural language into a controlled vocabulary
3. Request Aboutness : determined by the user’s or intermediary’s
processing of a search statement into a query
4. User Aboutness : determined by the indexer’s attempt to represent the
document according to presupposition about what the user will want to know
• Measures used in system evaluation
To define the measures that can be used in evaluating IRS, it is useful to
define the major functions associated with identifying relevant items in an
information system
Measures Used in System Evaluations

• Measurements can be made from


two perspectives: user perspective
and system perspective.
• Techniques for collecting
measurements can also be
objective or subjective.
• An objective measure is one that is
well-defined and based upon
numeric values derived from the
system operation.
• A subjective measure can produce
a number, but is based upon an
individual users judgments.
• Measurements with automatic indexing of items arriving at a system are
derived from standard performance monitoring associated with any program
in a computer (e.g., resources used such as memory and processing cycles)
and time to process an item from arrival to availability to a search process.
• When manual indexing is required, the measures are then associated with
the indexing process.
Multimedia Information Retrieval

The needs to develop multimedia database management


• Efficient and effective storage and retrieval of multimedia information
become very critical
• Traditional DBMS is not capable of effectively handling multimedia data
due to its dealing with alphanumeric data
• Characteristics and requirements of alphanumeric data and multimedia
data are different
• A key issue in multimedia data is its multiple types such as text, audio,
video, graphics etc.
The fundamental of Multimedia Database (Content) Management
research covers:
• Feature extraction from these multiple media types to support the
information retrieval.
• Feature dimension reduction – High dimensional features
• Indexing and retrieval techniques for the feature space
ƒ Similarity measurement on query features
• How to integrate various indexing and retrieval techniques for effective
retrieval of multimedia documents.
• Same as DBMS, efficient search is the main performance concern
Multimedia Information Retrieval Systems (MIRS)

The needs for MIRS


• A vast multimedia data –
captured and stored
• The special characteristics and
requirements are significantly
different from alphanumeric
data.
• Text Document Information
Retrieval (Google search) has
limited capacity to handle
multimedia data effectively.
Expected Query types and Applications
• Metadata-based quires
• Timestamp of video and authors’ name
• Annotation-based quires (event based quires)
• Video segment of people picking up or dropping down bags
• Queries based on data patterns or features
• Color distribution, texture description and other low level statistical
information
• Query by example
• Cut a region of picture and try to find those regions from pictures or videos
with the same or similar semantic meaning
Introduction to Image Indexing and Retrieval

Four main approaches to image indexing and retrieval


• Low level features -- Content based Image Retrieval (CBIR)
• Structured attributes – Traditional database mgt. system
• Object-recognition – Automatic object recognition
• Text – Manual annotation (Google search)
Four main approaches to image
indexing and retrieval
• Content based Image Retrieval
(CBIR)– low level features
• Extract low level image features
(color, edge, texture and shape)
• Expand these image feature
towards semantic levels
• Index on these images based on
similar measurement
• Relevance feedback to refine
the candidate images
Image representation
• A visual content descriptor
can be either global or
local.
• The global descriptor uses
the visual features of the
whole image
• A local descriptor uses the
visual features of regions
or objects to describe the
image content, with the aid
of region/object
segmentation techniques
Spoken Language Audio Retrieval

• Spoken Content Retrieval (SCR) provides users with access to


digitized audio-visual content with a spoken language component.
• In recent years, the phenomenon of “speech media,” media involving
the spoken word, has developed in four important respects.
• First, and perhaps most often noted, is the unprecedented volume
• of stored digital spoken content that has accumulated online and in
institutional, enterprise and other private contexts.
• Speech media collections contain valuable information, but their sheer
volume makes this information useless unless spoken audio can be
effectively browsed and searched.
• Second, the form taken by speech media has grown progressively
diverse.
• Most obviously, speech media includes spoken-word audio collections
and collections of video containing spoken content. However, a speech
track can accompany an increasingly broad range of media.
• For example, speech annotation can be associated with images
captured with smartphones.
• Current developments are characterized by dramatic growth in the
volume of spoken content that is spontaneous and is recorded outside
of the studio, often in conversational settings.
• Third, the different functions fulfilled by speech media have increased in
variety. The spoken word can be used as a medium for communicating
factual information.
• Examples of this function range from material that has been scripted and
produced explicitly as video, such as television documentaries, to material
produced for a live audience and then recorded, such as lectures. The
spoken word can be used as a historical record.
• Examples include speech media that records events directly, such as
meetings, as well as speech media that captures events that are recounted,
such as interviews. The spoken word can also be used as a form of
entertainment. The importance of the entertainment function is reflected in
creative efforts ranging from professional film to user-generated video on
the Internet.
• Fourth, user attitudes towards speech media and the use of speech
media have evolved greatly. Although privacy concerns dominate, the
acceptance of the creation of speech recordings, for example, of call
center conversations, has recently grown.
• Also, users are becoming increasingly acquainted with the concept of
the spoken word as a basis on which media can be searched and
browsed. The expectation has arisen that access to speech media
should be as intuitive, reliable and comfortable as access to
conventional text media.
Elements of AIR

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy