irs notes_merged (1)
irs notes_merged (1)
irs notes_merged (1)
Software Program: The core of an IRS is the software that enables users to
search for and retrieve information.
Hardware: The system can run on standard computer hardware or specialized
hardware, which may include components for enhancing search capabilities or
converting non-textual data into searchable formats (e.g., transcribing audio to
text).
6.Efficiency and Overhead:
Information Retrieval Overhead: This refers to the time and effort required to find
relevant information, excluding the time spent reading the actual data. Key aspects
include:
o Search Composition: Preparing and setting up the search parameters.
o Search Execution: The process of running the search query.
o Reading Non-Relevant Items: Time spent dealing with results that do not
meet the user's needs.
A successful IRS minimizes this overhead by streamlining these processes, making it
faster and easier for users to locate relevant information.
7.Advancements and Internet Impact:
Growth of the Internet: The Internet has revolutionized information retrieval,
significantly increasing the volume of accessible data. Early systems like WAIS (Wide
Area Information Servers) laid the groundwork, while modern search engines such as
INFOSEEK and EXCITE have advanced these capabilities.
Current Search Technologies: Modern search engines and tools are capable of
handling vast amounts of textual data and providing efficient access to it. The
development in this field is ongoing, with substantial research and innovation in both
private and public sectors.
8.Multimedia Search:
Image Search: Technologies now enable searching through non-textual data such as
images. Websites like WEBSEEK, DITTO.COM, and ALTAVISTA/IMAGES offer image
search functionalities, allowing users to find visual content based on various criteria.
Difference between information retrieval system and Dbms:
Information Retrieval is concerned with the representation, storage, organization of,
and access to information items.
The main difference between databases and IR is that databases focus on structured
data while IR focuses mainly on unstructured data
Also, databases are concerned with data retrieval, not information retrieval.
Functional Overview:
• The total information storage and retrieval system consists of four major functional
processes:
– Item Normalization
– Selective dissemination of information (i.e. Mail)
– Archival Document Database Search
– Index database search along with Automatic File Build Process
• The next figure shows the logical view of these capabilities in a single integrated
information retrieval system.
Item normalization:
Item Normalization: The initial step in integrating items into a system is to convert them to
a standard format.
Key Operations:
Token Identification: Extracting individual units such as words from the item.
Token Characterization: Categorizing these tokens.
Stemming: Reducing tokens to their root forms (e.g., removing word endings).
Standardization:
Text: Standardizing different input formats to a system-acceptable format (e.g.,
translating foreign languages to Unicode).
Encoding: ISO-Latin is a standard encoding that includes multiple languages.
Multimedia Normalization:
Video: Common digital standards include MPEG-2, MPEG-1, AVI, and Real Media.
MPEG standards are used for higher quality, while Real Media is often used for lower
quality.
Audio: Typical standards include WAV and Real Media (Real Audio).
Images: Formats vary from JPEG to BMP.
Normalization process:
Parsing (Zoning): The process of dividing an item into meaningful logical sub-divisions called
"zones" (e.g., Title, Author, Abstract, Main Text, Conclusion, References).
Terminology: "Zone" is used instead of "field" to reflect the variable length and logical
nature of these sub-divisions, as opposed to the independent implication of "fields."
Identification and Storage:
Tokens: Identifying and categorizing tokens within these zones.
Storage: Organizing and storing tokens and their associated zones for easy retrieval.
Search and Review:
Search: Users can search within specific zones or categories.
Display Constraints: Limited screen size affects how many items can be reviewed at
once.
Optimization: Users often display minimal data (e.g., Title or Title and Abstract) to fit
more items per screen and expand items of interest for full review.
Processing Tokens: In search processes, "processing token" is used instead of "word" due to
its efficiency in search structures.
Word Identification:
Symbol Classes:
o Valid Word Symbols: Alphabetic characters and numbers.
o Inter-Word Symbols: Blanks, periods, and semicolons.
o Special Processing Symbols: Symbols requiring special handling, like hyphens.
Definition: A word is a contiguous sequence of word symbols separated by inter-
word symbols.
Language Considerations: The significance of symbols varies by language. For instance, an
apostrophe is crucial in foreign names but less so in English possessives.
Text Processing Design:
Symbol Prioritization: Decisions are based on search accuracy requirements and
language characteristics.
Special Processing: Symbols like hyphens may trigger specific rules to generate one
or more processing tokens.
Stop List/Algorithm: Applied to processing tokens to improve system efficiency.
Objective: To conserve system resources by removing tokens with minimal value,
such as very common words with little semantic meaning (e.g., "the," "and," "is").
Function:
Filtering: Excludes frequent, low-value words from indexing and search queries to
avoid irrelevant results.
Efficiency: Reduces index size and improves search query performance.
Zipf’s Law:
• According to Ziph's (Ziph-49) hypothesis, most unique words only appear a few times
when examining the frequency of recurrence of these terms over a corpus of
objects.
Concept: The frequency of a word is inversely proportional to its rank in the
frequency list.
Law: Frequency×Rank=constant
o Frequency: Number of occurrences of a word.
o Rank: Position of the word in the frequency table.
Identification of Specific Word Characteristics:
Purpose: Helps distinguish different meanings of a word, such as identifying its part
of speech.
Examples:
o "Plane" as an adjective: "level or flat"
o "Plane" as a noun: "aircraft or facet"
o "Plane" as a verb: "to smooth or even"
Stemming:
Definition: Reduces words to their base or root form to group variants with the same
meaning (e.g., "running" and "runner" reduced to "run").
Benefits:
o Precision: Improves search precision by matching different word forms to a
single root.
o Standardization: Simplifies indexing and searching by reducing the number of
unique terms.
o Efficiency: Decreases computational overhead and memory usage.
Finalization:
Application: Once tokens are finalized through stemming, they are updated in the
searchable data structure.
Components:
Search Process: Compares each incoming item with every user's profile to
find matches.
User Profiles (Statements of Interest): Contain broad search criteria and
specify which mail files should receive matching items.
User Mail Files: Store items that match user profiles, typically viewed in the
order they are received.
User Search Profiles (Push System):
Components:
Search Process: Executes the search against the document database.
User Entered Queries: Typically ad hoc queries used for searching.
Document Database: Contains all processed and stored items.
Search Types:
Retrospective Searches: Involve searching for information already processed and
can cover a wide range of time periods.
Data Volume: Databases may contain vast amounts of data, sometimes hundreds of
millions of items.
Data Management:
Static Items: Items are usually not edited after receipt.
Time-Based Partitioning: Databases are often partitioned by time to facilitate
archiving and efficient retrieval.
Implementation:
Structured DBMS: Frequently used to create and manage Private and Public Index
Files.
Synchronization Methods:
Time Synchronization: Aligns multimedia elements with time, such as matching
transcribed text with audio or video segments.
Positional Synchronization: Links multimedia content with specific positions in
textual items, often through hyperlinks.
Relationship to Database Management Systems:
Information Retrieval Systems (IRS):
Focus: Handle "fuzzy" text, which lacks strict standards and can vary widely in
terminology and style.
Characteristics: Deal with diverse vocabulary and ambiguous language.
User Considerations: Must account for various search term possibilities due to the
variability in how information is presented.
Database Management Systems (DBMS):
Focus: Manage "structured" data, which is well-defined and organized into tables.
Characteristics: Each table attribute has a clear semantic description (e.g.,
"employee name," "employee salary").
Search Results Presentation:
IRS: Results are often relevance-ranked and may use features like relevance
feedback to refine searches.
DBMS: Queries yield specific results in a tabular format, facilitating straightforward
retrieval.
DBMS vs. IRS:
DBMS: Used to store structured data with clear attributes but lacks ranking and
relevance feedback.
IRS: Handles fuzzy, unstructured information with ranking and feedback features.
Integration:
Structured Data in IRS: Users need to be resourceful to extract management data
and reports similar to those easily accessed in DBMS.
Integrated Systems:
o INQUIRE DBMS: One of the first to integrate DBMS and IRS features.
o ORACLE DBMS: Includes CONVECTIS, an IRS capability with a thesaurus to
generate themes.
o INFORMIX DBMS: Links to RetrievalWare for integration of structured data
and information retrieval functions.
Digital Libraries and Data Warehouses:
Information Retrieval Systems (IRS):
Digital Libraries:
Purpose: Manage and analyze large volumes of structured data, primarily in the
commercial sector.
Function: Serve as central repositories integrating data from various sources,
supporting decision-making through data manipulation, analysis, and reporting.
Data Mining: A key process that discovers patterns and relationships in data.
Focus: Handle structured data for decision support.
Features: Include data mining capabilities to discover hidden patterns and
relationships in data.
Information Retrieval System Capabilities: Search Capabilities,
Browse Capabilities, Miscellaneous Capabilities
Search Capabilities:
Objective: Connect a user’s need with relevant items in the information database.
Search Query Composition:
Types: Can include natural language text and/or Boolean logic indicators.
Term Weighting: Some systems allow numeric values (0.0 to 1.0) to prioritize search
terms (e.g., "automobile emissions(.9)" vs. "sulfur dioxide(.3)").
Search Parameters:
Scope: Can restrict searches to specific parts or zones within an item.
Benefit: Enhances precision by avoiding irrelevant sections, especially in large
documents.
Functions in Understanding Search Statements:
Term Relationships: Includes algorithms that define how terms relate (e.g., Boolean
logic, natural language processing, proximity searches, contiguous word phrases,
fuzzy searches).
Word Interpretation: Involves methods for interpreting terms (e.g., term masking,
numeric and date range searches, contiguous word phrases, concept/thesaurus
expansion).
Terminology:
Processing Token, Word, Term: These terms are used interchangeably to refer to
searchable units within documents.
Boolean Logic:
Purpose: Allows users to logically relate multiple concepts to define information
needs.
Operators:
o AND: Intersection of sets.
o OR: Union of sets.
o NOT: Difference between sets.
Parentheses: Used to specify the order of operations; without them, default
precedence (e.g., NOT, then AND, then OR) is followed.
Processing: Queries are processed from left to right unless parentheses alter the
order.
Special Case:
M of N Logic: Accepts items that contain any subset of a given set of search terms.
Example of Boolean Search:
Query: "Find any item containing any two of the following terms: 'AA,' 'BB,' 'CC'."
Boolean Expansion: Translates to: ((AA AND BB) OR (AA AND CC) OR (BB AND CC)).
System Capability:
Most systems: Support both Boolean operations and natural language interfaces.
Proximity Search:
Purpose: Restricts the distance between two search terms within a text, improving
precision by indicating how closely terms are related.
Semantic Concept: Terms found close together are more likely to be related to a
specific concept.
Format: TERM1 within “m” “units” of TERM2
o Distance Operator “m”: Integer number.
o Units: Characters, Words, Sentences, or Paragraphs.
Direction Operator: Specifies whether the second term must be before or after the
first term within the specified distance. Default is either direction.
Special Cases:
o Adjacent (ADJ) Operator: Distance of one, typically with a forward-only
direction.
o Distance of Zero: Terms must be within the same semantic unit.
Contiguous Word Phrases (CWP):
Definition: Two or more words treated as a single semantic unit.
Example: “United States of America” as a single search term representing a specific
concept (a country).
Function: Acts as a unique search operator, similar to the Proximity (Adjacency)
operator but more specific.
Comparison:
o For two terms, CWP and Proximity are identical.
o For more than two terms, CWP cannot be directly replicated with Proximity
and Boolean operators due to the latter's binary nature; CWP is an “N”ary
operator.
Terminology:
o WAIS: Calls them Literal Strings.
o RetrievalWare: Refers to them as Exact Phrases.
In WAIS: Multiple Adjacency (ADJ) operators define a Literal String, e.g., “United”
ADJ “States” ADJ “of” ADJ “America.”
Fuzzy Searches:
OCR Process:
Steps:
Error Rate: Typically achieves 90-99% accuracy but can introduce errors due to image
imperfections or recognition limitations.
Numeric Ranges:
o "125-425": Finds numbers between 125 and 425, inclusive.
o ">125": Finds numbers greater than 125.
o "<233": Finds numbers less than 233.
Date Ranges:
o "4/2/93-5/2/95": Finds dates from April 2, 1993, to May 2, 1995.
o ">4/2/93": Finds dates after April 2, 1993.
o "<5/2/95": Finds dates before May 2, 1995.
Processing: These queries are handled through normalization, which allows for
precise and meaningful searches beyond simple term-based methods.
Concept/Thesaurus Expansion:
Thesauri Types:
Usage:
Recall vs. Precision: Thesauri help broaden searches by including related terms,
improving recall. However, this can sometimes reduce precision by including
unrelated terms.
User Interaction: Some systems allow users to manually select and add terms from
thesauri or concept trees to refine searches, making searches more specific to user
needs.
Concept Classes:
Natural Language Queries allow users to type a full sentence or a prose statement in
everyday language to describe what they are searching for, instead of using specific search
terms and Boolean operators.
Benefits:
Accuracy: Longer and more detailed prose statements can lead to more accurate
search results as they provide more context.
Challenges:
System Functionality:
Translation: The system translates natural language input into a format it can
process, often involving parsing the sentence structure and understanding the
context.
Negation Handling: Correctly interpreting and applying negation is critical, as users
may want to exclude specific information or conditions.
User Behaviour:
Sentence Fragments: Users often enter sentence fragments rather than complete
sentences to save time. Systems need to handle these fragments effectively,
understanding the intended meaning despite the incomplete input.
Relevance feedback allows users to refine searches based on the relevance of items
they find, even without inputting full sentences. Users can select relevant items or
text segments, and the system adjusts its search based on this feedback.
Most commercial systems provide a user interface that supports both natural
language queries and Boolean logic, accommodating negation through the Boolean
portion. While natural language interfaces improve recall, they may decrease
precision when negation is required.
UNIT-2
Cataloging and Indexing
Indexing:
Converts items (e.g., documents) into a searchable format.
Can be done manually or automatically by software.
Facilitates direct or indirect search of items in the Document Database.
Concept-Based Representation:
Some systems transform items into a concept-based format rather than just text.
Information Extraction:
Extracts specific information to be normalized and entered into a structured
database.
Closely related to indexing but focuses on specific concepts.
Normalization:
Modifies extracted information to fit into a structured database management system
(DBMS).
Automatic File Build:
Refers to the process of transforming extracted information into a compatible
format for structured databases.
History of indexing:
Early History:
Cataloging: Originally known as cataloging, indexing has always aimed to help users
efficiently access item contents.
Pre-19th Century: Early cataloging focused on organizing information with basic
bibliographic details.
19th Century:
Hierarchical Methods: Introduction of more sophisticated systems like the Dewey
Decimal System, which organizes subjects into a hierarchical structure.
1960s:
Library of Congress: Began exploring computerization for cataloging.
MARC Standard: Developed the MARC (MAchine Readable Cataloging) standard to
create a standardized format for bibliographic records, facilitating electronic
management and sharing of catalog data.
1965:
DIALOG System: Developed by Lockheed for NASA, and commercialized in 1978. It
was one of the earliest commercial indexing systems, providing access to numerous
databases globally.
Objectives of Indexing:
The evolution of Information Retrieval (IR) systems has fundamentally changed the
approach and objectives of indexing. Here’s an overview of these changes:
Total Document Indexing:
o Digital Documents: The shift to digital formats allows indexing and searching
the entire text of documents, a method known as total document indexing.
o Searchable Data Format: This approach treats all words in a document as
potential index terms, making the entire text searchable.
Processing Tokens:
o Normalization: In modern systems, item normalization converts all possible
words into processing tokens—units that represent the document’s content
in a searchable format.
o Automatic Weighing: Systems can automatically weigh these processing
tokens based on their importance, improving search relevance.
Manual Indexing:
o Value Addition: Despite the advantages of full-text search, manual indexing
remains valuable. It involves selecting and abstracting key concepts, offering
context and relevance that automated systems might miss.
Key procedural decisions in the indexing process for organizations with multiple indexers:
Scope of Indexing:
o Definition: Refers to the level of detail and breadth of coverage that the
index will provide.
o Implications: Determines how comprehensively the index will cover the
subject matter, affecting the granularity of terms and concepts included.
Linking Index Terms:
o Definition: Involves connecting related terms or concepts within the index.
o Purpose: Facilitates better navigation and understanding by showing
relationships between terms, helping users find related information more
easily.
o Decision Impact: Determines how terms will be organized and interrelated
in the index, influencing the usability and effectiveness of the index for end
users.
Scope of Indexing:
Indexing, particularly manual indexing, involves several challenges due to the interaction
between authors and indexers:
1. Terminology Differences:
o Author's Specialized Terms: Authors often use field-specific terminology that
might be unfamiliar to indexers. For instance, a medical researcher might use
technical terms like "myocardial infarction," which are not commonly known
outside the medical field.
o Indexer’s Vocabulary: Indexers may not always be familiar with the
specialized vocabulary used by authors. As a result, they might select more
general terms like "heart disease" instead of the precise term "myocardial
infarction," potentially leading to less accurate indexing.
2. Expertise Gap:
o Author’s Expertise: Authors are typically experts in their subject matter, and
their writing reflects complex and nuanced discussions based on their deep
knowledge.
o Indexer’s Expertise: Indexers might not have the same level of expertise in
the specific subject area as the authors. This can result in the indexers
missing the significance or context of certain concepts, affecting the precision
and depth of the indexing.
3. Deciding on Indexing Completeness:
o Balancing Thoroughness and Practical Constraints: Indexers must determine
when to stop adding terms to the index, balancing the need for
comprehensive coverage with practical constraints such as time and cost.
When deciding on the indexing level for an item's concepts, two key criteria come into play:
exhaustivity and specificity. Each of these criteria impacts the effectiveness and efficiency
of the indexing process:
1. Exhaustivity:
o Definition: This criterion relates to the extent to which concepts within the
document are fully indexed. It involves determining whether to include
every relevant detail or just the major concepts.
o Example: In a lengthy document about microprocessors, if a small section
discusses “on-board caches,” the indexer must decide if this specialized detail
should be included. High exhaustivity would mean including this term to
ensure even minor but relevant details are indexed, while low exhaustivity
might exclude it to focus on more prominent topics.
2. Specificity:
o Definition: Specificity refers to the level of detail in the index terms. It
involves choosing between very precise terms or more general ones.
o Example: For the same document, indexers must choose whether to use
broad terms like “processor” and “microcomputer” or a specific model like
“Pentium.” High specificity uses detailed terms that precisely match the
content, while low specificity employs broader terms.
Example: For a document discussing indexing practices, a detailed index (like the one in the
Kowalski textbook) might include terms such as “indexing,” “indexer knowledge,”
“exhaustivity,” and “specificity,” ensuring that users can find detailed and specific
information.
Choosing the right balance between exhaustivity and specificity is crucial to meet the needs
of users while managing resources effectively.
Indexing Process:
Indexing Portions
Title Only:
o Advantages: Cost-effective, less effort required.
o Disadvantages: May miss critical details; limited to what’s conveyed in the title.
Title and Abstract:
o Advantages: Provides more context than title-only indexing.
o Disadvantages: Still may omit important details found in the body of the document.
Without Weighting:
o Example: A paper on "Artificial Intelligence in Healthcare" covering topics like
Machine Learning Algorithms, Healthcare Data Security, Medical Imaging
Techniques, and Patient Privacy Issues will list each term equally in the index.
o Result: All terms are treated with equal prominence, regardless of how extensively
they are discussed in the document.
With Weighting:
o Example: If the paper focuses heavily on "Machine Learning Algorithms" but only
briefly mentions "Patient Privacy Issues," weighting would highlight "Machine
Learning Algorithms" more prominently.
o Result: Terms are indexed based on their importance and frequency in the
document.
Weighting of Index Terms
Example of Weighting:
Machine Learning Algorithms (Weight: 5): The term is extensively discussed, central
to the paper.
Healthcare Data Security (Weight: 3): Significant but secondary focus.
Medical Imaging Techniques (Weight: 2): Briefly mentioned, relevant but not
central.
Patient Privacy Issues (Weight: 1): Minor point, least emphasized.
Benefits of Weighting:
Improved Precision: Easier to find detailed information on major topics.
Enhanced Relevance: Major terms are presented more prominently.
Efficient Retrieval: Quickly locate key themes and important sections.
Challenges:
Increased Complexity: Requires additional effort in assigning and managing weights.
Advanced Data Structures: Storing and retrieving weighted terms involves more
complex systems.
Pre coordination and Linkages
Linkages:
Definition: Connections made between index terms to show relationships or
attributes.
Example: In a document about "oil drilling in Mexico," linkages might connect terms
like "Mexico" and "oil drilling."
Pre coordination:
Definition: Relationships between terms are established at the time of indexing.
Purpose: Allows for immediate recognition of relationships between terms,
facilitating more specific searches.
Example: Index terms "Mexico" and "oil drilling" are linked together at indexing
time.
Post coordination:
Definition: Coordination of terms occurs at search time, using logical operators like
"AND" to combine terms.
Purpose: Allows flexibility in search queries but requires that all terms appear
together in the index.
Example: A search for "Mexico AND oil drilling" finds documents where both terms
are present, but their relationship is not predefined.
Factors in the Linkage Process:
1. Number of Terms:
o Determines how many terms can be linked together.
o Example: Linking "CITGO – oil drilling – Mexico" involves three terms.
2. Ordering Constraints:
o Defines if the sequence of terms matters.
o Example: The sequence "Mexico – oil drilling" might be different from "oil
drilling – Mexico" in terms of relevance.
3. Additional Descriptors:
o Provides extra information about the role of each term.
o Example: For "CITGO’s oil drilling in Mexico affecting local communities":
Descriptors: "CITGO (Source) – oil drilling (Activity) – Mexico
(Location)."
Purpose: Clarifies the role of each term, aiding in precise searches.
Relationship Between Multiple Terms
When multiple terms are used in indexing, the relationships between these terms can be
illustrated through various techniques. Each term needs to be qualified and linked to
another term to effectively describe a single semantic concept. Here’s how these
relationships can be managed:
1. Order of the Terms
Definition: The sequence in which terms are linked can provide additional context
about their relationships.
Example: In "CITGO – oil drilling – Mexico," the order implies that CITGO is
performing oil drilling activities in Mexico. The order helps clarify the relationship
and the context.
2. Positional Roles
Definition: This technique uses the position of terms in a sequence to define their
roles or relationships within an index entry.
Example: In "U.S. – oil refineries – Peru":
o Position 1: "U.S." (Source)
o Position 2: "oil refineries" (Activity)
o Position 3: "Peru" (Location)
Purpose: Each term’s position indicates its function or significance in relation to the
other terms.
3. Modifiers
Definition: Additional terms or qualifiers that provide more context or detail to the
primary terms.
Example: "CITGO’s new oil drilling project in Mexico" uses modifiers like "new" and
"project" to add detail about the main terms "CITGO," "oil drilling," and "Mexico."
Limitations of Fixed Positions
Fixed Number of Positions: If the sequence has a fixed number of positions,
including additional details like impact or timeframe might be challenging. The fixed
positions may not allow for flexibility in accommodating extra roles without causing
ambiguity.
Example: If the indexing system only allows for three fixed positions, adding an extra
role such as the impact of the drilling might be difficult unless the system supports
dynamic adjustments or additional descriptors.
By understanding and applying these methods—order of terms, positional roles, and
modifiers—indexers can create more meaningful and useful indexes, enhancing the
precision and relevance of search results.
Modifiers in Indexing
Modifiers provide a flexible approach to indexing by allowing additional context or details to
be associated with index terms. This method can streamline indexing and make it more
comprehensive, especially when dealing with multiple related terms.
Example Scenario
Consider a document about "U.S. introducing oil refineries into Peru, Bolivia, and
Argentina."
Without Modifiers (Positional Roles):
Separate entries are needed for each location:
o "U.S. – oil refineries – Peru"
o "U.S. – oil refineries – Bolivia"
o "U.S. – oil refineries – Argentina"
With Modifiers:
A single entry can be used with multiple locations and their respective roles:
o "U.S. – oil refineries – Peru (affected country) – Bolivia (affected country) –
Argentina (affected country)"
If the document discusses the impact of the oil refineries, modifiers can further
specify this:
o "U.S. – oil refineries – Peru (affected country, economic impact) – Bolivia
(affected country, economic impact) – Argentina (affected country, economic
impact)"
Advantages of Using Modifiers:
Efficiency: Reduces the need for multiple entries by combining related information
into a single index entry.
Flexibility: Allows the addition of various descriptors or roles to each term, providing
more context.
Clarity: Enhances the understanding of how different terms are related and their
significance within the document.
Challenges:
Complexity: More sophisticated indexing systems are required to handle and
interpret modifiers.
Consistency: Ensuring consistent application of modifiers across different documents
and contexts can be challenging.
Automatic Indexing
Automatic indexing is a process where systems autonomously select index terms for items
such as documents, eliminating the need for manual human intervention. This is in contrast
to manual indexing, where an individual determines the index terms based on their
expertise and understanding of the content.
Types of Automatic Indexing
1. Simple Indexing:
o Definition: Uses every word in the document as an index term. This method
is also known as total document indexing.
o Example: A document about "Artificial Intelligence" might be indexed with
terms like "Artificial," "Intelligence," "AI," and any other words from the
document.
2. Complex Indexing:
o Definition: Aims to replicate human indexing by selecting a limited number
of index terms that capture the major concepts of the item. This involves
more sophisticated algorithms to identify and prioritize key terms.
o Example: For a document on "Artificial Intelligence in Healthcare," complex
indexing might focus on terms like "Machine Learning," "Healthcare Data
Security," and "Medical Imaging" rather than including every term.
Advantages of Automatic Indexing
Cost Efficiency:
o Initial Costs: While there may be significant initial hardware and software
costs, the ongoing expenses are generally lower compared to the cost of
employing human indexers.
o Example: Automated systems can process thousands of documents at a
fraction of the cost it would take for human indexers to manually index the
same volume.
Processing Speed:
o Efficiency: Automatic indexing is typically much faster, with systems capable
of indexing a document in seconds, as opposed to the several minutes a
human indexer might require.
o Example: A large dataset of academic papers can be indexed overnight by an
automated system, whereas manual indexing could take weeks.
Consistency:
o Uniform Results: Algorithms provide consistent indexing results across all
documents, which helps maintain uniformity.
o Example: In TREC-2 experiments, different human indexers had about a 20%
discrepancy in indexing the same documents, whereas automated systems
can eliminate such variability.
Advantages of Human Indexing
Concept Abstraction: Human indexers can grasp and articulate the core ideas or
themes of a document beyond its literal text. This involves understanding the
underlying concepts and presenting them in a way that captures the essence of the
document.
o Example: A document discussing "climate change" can be indexed under the
broader category of "environmental issues," which reflects a higher level of
abstraction.
Contextual Understanding: Human indexers interpret the context in which terms are
used, distinguishing between different meanings based on contextual clues.
o Example: The term "bank" in a financial document would be interpreted as a
financial institution, whereas in a geographical context, it would refer to the
side of a river.
Judgment of Concept Value: Human indexers evaluate the significance of different
concepts within a document and prioritize them according to their relevance to the
intended audience or specific needs.
o Example: In a medical document, terms related to treatment may be
emphasized over general symptoms, reflecting their greater importance to
medical professionals.
Types of Indexing
Weighted Indexing: This approach assigns weights to index terms based on their
frequency and importance within the document. Higher weights indicate greater
significance.
o Method: Often uses normalized values to rank search results by relevance.
Unweighted Indexing: Records the presence (and sometimes location) of index
terms without differentiating their significance.
o Method: Early systems treated all terms equally, which can be less effective
in capturing the importance of specific concepts.
Concept Indexing: Maps the document into a representation based on conceptual
meanings rather than direct text. The index values are derived from these
conceptual representations.
Luhn’s Resolving Power: Luhn proposed that a term's importance within a document
correlates with its frequency. This implies that terms appearing more frequently are
considered more significant in the context of the document.
Distribution of Terms: Research by Brookstein, Klein, and Raita indicated that
important terms tend to cluster together in a document rather than being spread
out uniformly. This clustering of significant terms supports the notion that term
frequency and location are crucial in determining term importance.
UNIT-II
Data Structure: Introduction to Data Structure:
Two Major Data Structures in information systems:
Searchable Data Structure: Supports search functions, storing processing tokens and related
data.
Shown in Figure 4.1, expands on the document creation process from Chapter 1.
Not covered in this chapter (requires understanding of finite automata and languages like
regular expressions).
Stemming:
Reduces word variations to a base form, improving recall but may reduce precision.
N-gram Structure:
Enhances efficiency and supports complex concept searches compared to full word
inversion.
Treat text as a continuous stream, enabling unique search algorithms based on string
patterns.
Signature Files:
Reduces the searchable subset, which can be further refined with additional search
methods.
Hypertext Structure:
Stemming Algorithms:
History of Stemming:
Introduced in the 1960s to enhance performance by reducing the number of unique words.
With advancements in computing power, its relevance for performance has decreased; now
primarily focused on improving recall versus precision.
Trade-offs in Stemming:
Stemming requires additional processing but reduces search time by consolidating word
variants under a single index.
Offers an alternative to Term Masking, which involves merging indexes for each variant of a
search term.
The stem typically carries the word's primary meaning, while affixes add minor syntactical
changes.
Proper nouns and acronyms (e.g., names) are common and should not be stemmed.
Misspellings and exceptions further reduce the effectiveness of stemming in large corpora
(e.g., TREC database).
Helps ensure all relevant forms of a term are retrieved, boosting recall.
For example, “calculate” and its variants are stemmed to a common form, improving the
likelihood of retrieving all relevant items.
Impact on Precision:
While stemming can increase recall, it may decrease precision by generalizing the search.
Precision suffers if irrelevant items are included without relevance guarantees in the
retrieval process.
Systems must identify certain word categories (e.g., proper names, acronyms) that should
not be stemmed.
These categories lack a common core concept, so stemming could distort their meaning.
Stemming may lead to information loss, affecting higher-level NLP tasks like discourse
analysis.
Verb tenses, essential for understanding temporal context, can be lost in stemming (e.g.,
whether an action is past or future).
Affix Removal: Strips prefixes and suffixes, often iteratively, to find the root stem.
Table Lookup: Uses a large dictionary or thesaurus for predefined stem relationships.
Successor Stemming: Determines the optimal stem length based on prefix overlap,
balancing statistical and linguistic accuracy.
Porter Algorithm: Widely used, but can cause precision issues and user confusion.
Kstem Algorithm (INQUERY System): Combines simple rules with dictionary-based lookups
for accuracy.
Porter’s algorithm is particularly popular but may result in semantic shifts that confuse
users.
Stemming applies to both query terms and database text; transformations may
unintentionally shift meanings.
If a stemmed query term diverges in meaning, users might distrust the system due to
unexpected results.
The Porter Algorithm is based upon a set of conditions of the stem, suffix and prefix and associated
actions given the condition. Some examples of stem conditions are: 1. The measure, m, of a stem is a
function of sequences of vowels (a, e, i, o, u, y) followed by a consonant. If V is a sequence of vowels
and C is a sequence of consonants, then m is:
1b1 rules are expansion rules to make correction to stems for proper conflation. For example
stemming of skies drops the es, making it ski, which is the wrong concept and the I should be
changed to y.
The application of another rule in step 4, removing “ic,” can not be applied since only one rule from
each step is allowed be applied.
Discusses an alternative stemming approach that uses a dictionary look-up mechanism. This
approach supplements basic stemming rules by consulting a dictionary to manage exceptions more
accurately. Here's a summary:
1. Dictionary Look-Up Mechanism: This approach reduces words to their base form by
checking against a dictionary, which helps avoid some pitfalls of pure algorithmic stemming.
Stemming rules are used, but exceptions (like irregular forms) are handled by looking up the
stemmed term in a dictionary to find an appropriate base form.
2. INQUERY and Kstem: The INQUERY system employs the Kstem stemmer, a morphological
analyzer that reduces words to a root form. Kstem seeks to prevent words with different
meanings from being conflated (e.g., "memorial" and "memorize" both reduce to "memory,"
but Kstem avoids conflating non-synonyms when possible). This stemmer uses six major data
files, including lexicons, exception lists, and direct conflation lists, to refine accuracy.
3. RetrievalWare System: This system leverages a large thesaurus/semantic network with over
400,000 words. It uses dictionary look-ups for morphological variants and handles common
word endings like suffixes or plural forms.
Successor Stemmers
Successor stemmers are based on the concept of successor varieties, which are determined by
analyzing the segments of a word and their distribution in a corpus. The method is inspired by
structural linguistics, focusing on how morphemes and word boundaries are identified by phoneme
patterns. The main idea is to segment a word into parts and select the appropriate segment as the
stem based on its successor variety.
Key Concepts:
1. Successor Variety:
o The successor variety of a word segment is the number of distinct letters that follow
it, plus one for the current word.
o For example, for the prefix "bag", the successor variety would be based on the
number of words that share the first three letters but differ in the fourth letter.
2. Symbol Tree:
o A graphical representation of words shows the successor variety for prefixes. For
example, for the prefix “b” in the words "bag", "barn", "bring", etc., the successor
variety for “b” might be 3, indicating three distinct words starting with "b".
o
3. Methods for Word Segmentation: The successor variety is used to determine where to
break a word. The following segmentation methods are used:
o Cutoff Method: A cutoff value is selected to define the stem length. The value can
vary depending on the word set.
o Peak and Plateau: A segment break occurs when the successor variety increases
from one character to the next, creating a peak, or when the variety is steady
(plateau).
o Entropy Method: Uses the distribution of successor varieties and Shannon's entropy
to determine where to break the word based on statistical patterns. Let |Dak| be
the number of words beginning with the k length sequence of letters a. Let |Dakj|
be the number of words in Dak with successor j. The probability that a member of
Dak has the successor j is given by |Dakj|/|Dak|. The entropy (Average Information
as defined by Shannon-51) of |Dak| is:
o
4. Selection of the Stem:
o After segmentation, the correct segment to use as the stem is chosen based on the
frequency of the segment in the corpus.
o The rule used by Hafer and Weiss is:
Example:
For the word "boxer" and the set of words “bag”, “barn”, “bring”, “both”, “box”, and “bottle”:
Peak and Plateau method: Cannot be applied as the successor variety monotonically
decreases.
The Peak and Plateau and Complete Word methods do not require a cutoff value, which is an
advantage in some cases.
Advantages:
Combines techniques: The combination of multiple methods (e.g., cutoff, peak and plateau)
tends to produce more accurate results than using a single method.
Conclusions:
Frakes' Conclusions:
Little difference between stemmers, except the Hafer and Weiss stemmer.
ERRT compares stemming algorithms using Understemming Index (UI) and Overstemming
Index (OI).
OI: Measures distinct terms incorrectly grouped under the same stem.
Algorithm Comparison:
Porter algorithm: Higher UI, lower OI.
The comparison is less meaningful due to different objectives: Porter is a "light" stemmer,
Paice is a "heavy" stemmer.
General Observations:
Precision loss can be minimized by ranking terms, categorizing them, and excluding some
from stemming.
Stemming is not a major compression technique but can reduce dictionary sizes and
processing time for search terms.
Common data structure used in both Database Management and Information Retrieval
Systems.
2. Inversion Lists (Posting Files): Stores the list of document identifiers for each word.
3. Dictionary: A sorted list of all unique words and pointers to their corresponding
inversion lists.
The structure is called "inverted" because it stores a list of documents for each word, rather
than storing words for each document.
Each document is given a unique numerical identifier, which is stored in the inversion list for
the corresponding word.
Dictionary:
Optimization Techniques:
1. Zoning: The dictionary may be partitioned by different zones (e.g., "Abstract" vs. "Main
Body") in an item, increasing overhead when searching the entire item versus a specific
zone.
2. Short Inversion Lists: If an inversion list contains only one or two entries, those can be
stored directly in the dictionary.
3. Word Positions: For proximity searches, phrases, and term weighting algorithms, the
inversion list may store the positions of each word in the document (e.g., "bit" appearing at
positions 10, 12, and 18 in document #1).
4. Weights: Weights for words can also be stored in the inversion lists.
Words with special characteristics (e.g., dates, numbers) may be stored in their own
dictionaries for optimized internal representation and manipulation.
Search Process:
When a search is performed, the inversion lists for the terms in the query are retrieved and
processed.
The results are a final hit list of documents that satisfy the query.
Document numbers from the inversion list are used to retrieve the documents from the
Document File.
Query Example:
The Dictionary is used to find the inversion lists for "bit" and "computer."
Use of B-Trees:
Inversion lists may be stored at the leaf level or referenced in higher-level pointers.
o Leaf Nodes: All leaves are at the same level or differ by at most one level.
o
B-Trees in Heavy Updates:
Cutting and Pedersen (1990) described B-trees as an efficient inverted file storage
mechanism for data that undergoes heavy updates.
Items in information systems are seldom modified once produced, allowing for efficient
management of document files and inversion lists.
Document Files & Inversion Lists grow to a certain size and are then "frozen" to prevent
further modifications, starting a new structure for new data.
Archived databases containing older data are available for queries, reducing operations
overhead for newer queries.
The archived databases may be permanently backed up since older items are rarely deleted
or modified.
New inverted databases have overhead for adding new words and inversion lists, but
knowledge from archived databases can help establish the initial dictionary and inversion
structure.
Inversion List Structures provide optimal performance in large database queries by
minimizing data flow—only relevant data is retrieved from secondary storage.
Inversion list structures are effective for storing concepts and their relationships.
Each inversion list represents a concept and serves as a concordance of all items containing
that concept.
Finer resolution of concepts can be achieved by storing locations and weights of items in
inversion lists.
While inversion lists are useful for certain types of queries, other structures may be needed
for Natural Language Processing (NLP) algorithms to maintain the necessary semantic and
syntactic information.
N-Grams are a special technique for conflation (stemming) and a unique data structure in
information systems.
Unlike stemming, which aims to determine the stem of a word based on its semantic
meaning, n-grams do not consider semantics.
N-Gram Transformation:
The searchable data structure is transformed into overlapping n-grams, which are then used
to create the searchable database.
Examples of n-grams for the word phrase “sea colony” include bigrams, trigrams, and
pentagrams.
Interword Symbols:
For n-grams with n > 2, some systems allow interword symbols (e.g., space, period,
semicolon, colon, etc.) to be part of the n-gram set.
The symbol "#" is used to represent the interword symbol in such cases.
The interword symbol is typically excluded from the single-character n-gram option.
History of N-Grams:
The first use of n-grams dates back to World War II, where they were used by
cryptographers.
Fletcher Pratt mentioned that with the backing of bigram and trigram tables,
cryptographers could dismember a simple substitution cipher (Pratt-42).
Adamson (1974) described the use of bigrams as a method for conflating terms. However,
this does not align with the usual definition of stemming because n-grams produce word
fragments rather than semantically meaningful word stems.
Trigrams have been used in spelling error detection and correction by several researchers
(Angell-83, McIllroy-82, Morris-75, Peterson-80, Thorelli-62, Wang-77, Zamora-81).
N-grams (particularly trigrams) are used to analyze the probability of occurrence in the
English vocabulary, and any word containing rare or non-existent n-grams is flagged as a
potential error.
Zamora (1981) showed that trigram analysis was effective for identifying misspellings and
transposed characters.
D’Amore and Mah (1985) used various n-grams as index elements for inverted file systems.
N-grams have been core data structures for encoding profiles in the Logicon LMDS system
(Yochum-95) used for Selective Dissemination of Information.
The Acquaintance System (Damashek-95, Huffman-95) uses n-grams to store the searchable
document file for retrospective search in large textual databases.
Each position in the input text serves as an anchor point, and substrings are created from
that point to the end of the text.
Sistrings (semi-infinite strings) are created by adding null characters when necessary to
handle substrings that extend beyond the input text's original length.
The structure is suited for search processing in applications involving text and images, as
well as genetic databases (e.g., Manber-90).
Key Features:
1. Sistrings:
o A sistring starts at any position within the text and extends to the end.
o Each substring is unique, and substrings can even extend beyond the input stream
by appending null characters.
o Example sistrings for the text "Economics for Warsaw is complex" might look like:
o The tree is binary, with left branches for zeros and right branches for ones, based on
the individual bits of the sistring.
o Each node in the tree uses a bit to determine branching (whether it moves left or
right), allowing efficient traversal and search.
3. Leaf Nodes:
o The leaf nodes (bottom-most nodes) store the key values (substrings).
o For a text input of size “n,” there are n leaf nodes and n-1 upper-level nodes.
4. Search Constraints:
o This allows the tree to handle more targeted searches, such as only looking for
substrings that occur after interword symbols (like spaces, punctuation, etc.).
o The PAT Tree provides a compact representation of the text as substrings. The size
of the tree is proportional to the number of possible substrings.
Text Search and Indexing: Used for efficient substring search within large texts or databases.
Genetic Databases: This structure has applications in genetic databases where sequences of
characters need efficient indexing and searching.
Image Processing: PAT trees can also be used in indexing image data by treating image data
as continuous text.
The PAT (PAtricia Tree) data structure, used for string searching, leverages sistrings (semi-infinite
strings) to efficiently index and search substrings. The process of creating the PAT tree involves
converting input data into binary strings (sistrings) and organizing them into a binary tree structure.
Below is an explanation of the examples and concepts introduced in the text:
For the input text "100110001101", sistrings are generated from various starting points within the
string. These sistrings can be visualized as follows:
Each sistring represents a substring starting at a given position in the string and extending to the
end. These are used to define unique paths within the PAT Tree.
Binary Representation of Characters:
To illustrate the creation of the tree, each character of the word "home" is represented by its binary
equivalent:
"h" → 100
"o" → 110
"m" → 001
"e" → 101
The word "home" produces the input sequence 100110001101, which is then transformed into the
sistrings above. The PAT Tree is built by creating branches based on these binary sequences.
The full PAT Tree (Figure 4.11) is constructed by organizing the sistrings into nodes based on
the binary representation of the substrings.
Intermediate nodes can be optimized with skip values, which are represented in rectangles
(Figure 4.12). The skip value indicates the number of bits to skip before comparing the next
bit. This compression technique helps save space in the tree, making it more efficient.
Search Operations:
1. Prefix Search: The tree is well-suited for prefix searches because each sub-tree
contains sistrings for the prefix defined up to that node. This allows easy
identification of all strings that match a given prefix.
2. Range Search: The logically sorted structure facilitates range searches, where the
sub-trees within a specified range can be easily located.
3. Suffix Search: When the entire input stream is used to build the tree, suffix searches
become straightforward. These are useful for tasks like finding all occurrences of a
particular suffix in the text.
4. Imbedded String and Masked Search: This involves searching for fixed-length
substrings or masked patterns within the text.
While PAT Trees provide efficient searching for exact matches and structured patterns, fuzzy
searches (such as searching for terms with minor differences or errors) are challenging because
there may be a large number of sub-trees that could potentially match the search term.
The PAT Tree is compared with other traditional data structures like Signature Files and Inverted
Files. While Signature Files provide fast but imprecise searches, PAT Trees offer more accuracy and
flexibility for string-based searches. However, PAT Trees are not commonly used in major
commercial applications at this time.
Space Efficiency: The text is represented in a highly compressed form, reducing space compared to
inverted file structures.
Search: Signature file search involves a linear scan of the compressed file, making response time
proportional to file size.
Item Addition: New items can be appended to the end without reindexing, making it efficient for
dynamic datasets.
Deleted Items: Deleted items are typically marked but not removed.
Superimposed Coding:
A word signature is a fixed-length binary code with bits set to "1" based on a hash function.
Items are partitioned into blocks (e.g., five words per block).
Word signatures are created for each word and combined for each block.
Bit Density Control: To avoid too many "1"s, a maximum number of bits set to "1" is allowed per
word.
Code Length: The binary signature has a fixed code length (e.g., 16 bits).
Example: For the text "Computer Science graduate students study," the process involves hashing
each word, creating a word signature, and ORing the signatures to form a block signature.
Advantages:
Limitations:
Linear search for querying, which can be slower for large datasets.
Applications: Ideal for document retrieval, text-based search engines, and database indexing.
Search is performed through template matching based on the bit positions specified by the
query's words.
The signature file is stored with each row representing a signature block.
Design Objective:
The goal is to balance the size of the data structure against the density of the signatures.
Longer code lengths reduce the likelihood of collisions in word hashing (i.e., two different
words hashing to the same value).
Fewer bits per code reduce false hits due to word patterns in the final block signature.
For instance, if the word "hard" has the signature 1000 0111 0010 0000, it might incorrectly
match a block signature, resulting in a false hit.
A study by Faloutous and Christodoulakis showed that compressing the final data structure
optimizes the number of bits per word.
This approach makes the signature file resemble a binary-coded vector for each item,
ensuring no false hits unless two words hash to the same value.
Search Time:
Hashing: Block signatures are hashed to specific slots. A query with fewer terms maps to
multiple possible slots.
Sequential Indexing: Signatures are mapped to an index sequential file using the first “n”
bits of the signature.
B-Tree Structures: Similar signatures are clustered at leaf nodes of a B-tree (Deppisch-86).
Vertical Partitioning:
Signature matrices can be stored in column order (vertical partitioning) to optimize searches
on the columns.
This method is similar to using an inverted file structure, and it allows columns to be
searched by skipping those not relevant to the query.
Major overhead is from updates, as new “1”s must be added to the appropriate columns
when new items are added.
Applications:
Signature files are practical for medium-sized databases, databases with low-frequency
terms, WORM devices, parallel processing machines, and distributed environments
(Faloutsos-92).
The Internet has introduced new methods for representing information, leading to the
development of hypertext.
Hypertext Structure:
Hypertext is different from traditional information storage structures in both format and
usage.
Markup Languages:
HTML is an evolving standard that is updated as new display requirements for the Internet
emerge.
Detailed Descriptions:
Both HTML and XML provide detailed descriptions for subsets of text, similar to the zoning
concept discussed earlier.
Usage of Subsets:
Definition of Hypertext :
Hypertext is widely used on the Internet and requires electronic media storage.
Items in hypertext are called nodes, and the references between them are called links.
A node can reference another item of the same or different data type (e.g., text referencing
an image).
Each node is displayed by a viewer defined for the file type associated with it.
HTML defines the internal structure for information exchange on the World Wide Web.
A document in HTML consists of the text and HTML tags that describe how the document
should be displayed.
HTML tags like <title> and <strong> are used for formatting and structuring content.
The <a href=...> tag is used for hypertext linkages, linking to other files or URLs.
URL Components:
Hypertext allows navigation through multiple paths, as users can follow links or continue
reading the document sequentially.
Hypertext is a non-sequential directed graph structure, allowing multiple links per node.
Nodes contain their own information, and each node may have several outgoing links
(anchors).
When an anchor is activated, the link navigates to the destination node, creating a hypertext
network.
Hypertext is dynamic; new links and updated node information can be added without
modifying referencing items.
Conventional items have a fixed logical and physical structure, while hypertext allows
dynamic structure with links to other nodes.
In a hypertext environment, users navigate through the network of nodes by following links.
Large information spaces can disorient users, but the concept allows managing loosely
structured information.
Hypertext often references other media types (e.g., graphics, audio, video).
When a referenced item is logically part of the node (e.g., a graphic), it is typically stored at
the same physical location.
Items referenced by other users may be located at different physical sites, which could lead
to linkage integrity issues when items are moved or deleted.
Dynamic HTML:
Dynamic HTML (DHTML), introduced with Navigator 4.0 and Internet Explorer 4.0,
combines HTML tags, style sheets, and programming to create interactive and animated web
pages.
o Use of Cascading Style Sheets (CSS) for page layout and design.
o Document Object Model (DOM) for managing page elements (Microsoft's version:
Dynamic HTML Object Model, Netscape's version: HTML Object Model).
Style Sheets and Layering:
Style sheets describe the default style of a document, such as layout and text size.
DHTML allows cascading style sheets, where new styles can override previous ones in a
document.
Layering involves using alternative style sheets to overlay and superimpose content on a
page.
Dynamic Fonts:
Netscape supports dynamic fonts, allowing fonts not limited by what the browser provides.
Hypertext history:
In 1945, Vannevar Bush published an article describing the Memex system, a microfilm-
based system allowing the storage and retrieval of information using links.
The term "hypertext" was coined by Ted Nelson in 1965 as part of his Xanadu System,
envisioning all the world’s literature interlinked via hypertext references.
Early commercial use of hypertext was seen in the Hypertext Editing System developed at
Brown University and used for Apollo mission documentation at the Houston Manned
Spacecraft Center.
Other systems like Aspen (MIT), KMS (Carnegie Mellon), Hyperties (University of Maryland),
and Notecards (Xerox PARC) contributed to the development of hypertext and hypermedia.
HyperCard, the first widespread hypertext product, was delivered with Macintosh
computers and included a simple metalanguage (HyperTalk) for authoring hypertext items.
Hypertext gained popularity in the early 1990s with its use in CD-ROMs for educational and
entertainment products.
Its high level of popularity emerged with its inclusion in the World Wide Web specification
by CERN in Switzerland, with the Mosaic browser enabling widespread access to hypertext
documents.
XML:
XML (Extensible Markup Language) became a standard data structure on the web with its
first recommendation (1.0) issued on February 10, 1998.
XML serves as a middle ground between the simplicity of HTML and the complexity of SGML
(ISO 8879).
Its main objective is to extend HTML with semantic information, allowing for more flexible
tag creation.
The logical structure within XML is defined by a Data Type Description (DTD), which is more
flexible than HTML’s fixed tags and attributes.
Users can create any tags necessary to describe and manipulate their structure.
The W3C (World Wide Web Consortium) is redeveloping HTML into a suite of XML tags.
An example of XML tagging:
<company>Widgets Inc.</company>
<city>Boston</city>
<state>Mass</state>
<product>widgets</product>
W3C is developing a Resource Description Format (RDF) for representing properties of web
resources (e.g., images, documents).
XML links are being defined in the Xlink (XML Linking Language) and XPointer (XML Pointer
Language) specifications, allowing distinctions between different types of links (internal or
external).
Xlink and Xpointer will help determine what needs to be retrieved to define the total item to
be indexed.
XML will include XML Style Sheet Linking for defining how items are displayed and handled
through cascading style sheets, offering flexibility in how content is shown to users.
Speech recognition
Topic identification
Information retrieval
Dr. Lawrence Rabiner provided one of the first comprehensive descriptions of HMMs.
Markov process: A system where the next state depends only on the current state and not on past
states.
State transition matrix: Defines probabilities for moving between states, e.g., from S1 to S2.
Example sequence: Probability of market increasing for 4 days then falling = {S3, S3, S3, S3, S1}.
States in the model are not directly observable (hidden), but can be inferred through
observable outputs.
An input sequence provides results that can be used to deduce the most likely state
sequence.
Used to model systems where transitions between states are probabilistic, and each state
generates an observable output.
1. S = { S₀, ..., Sₙ₋₁}: A finite set of states. S₀ denotes the initial state.
2. V = { v₀, ..., vₘ₋₁}: A finite set of output symbols corresponding to observable outputs.
4. B = S × V: An output probability matrix where bⱼₖ is the probability of output vₖ in state sⱼ.
5. Initial State Distribution: Specifies the probability distribution over the initial states.
HMM Process:
Used for modeling and generating sequences of observed outputs and their associated
probabilities.
Given an observed output sequence, it models its generation by identifying the appropriate
HMM.
Training Sequence: Process of tuning the HMM model to maximize the probability of the
observed sequence.
Optimality Criterion: Algorithm used to select the most likely model based on observed
outputs.
Challenges:
Model Selection: Determining which model best explains the observed output sequence.
State Sequence Identification: Finding the state sequence that best explains the output.
SUMMARY:
Data Structures in Information Retrieval (IR):
Can be used for searching text directly (e.g., signature files, PAT trees) or organizing
searchable data structures created from text.
Has the widest applicability for indexing and searching large collections of text.
N-grams:
Expected to be widely used for finding relevant information, especially on the Internet.
Stemming:
The trade-off between increased recall and decreased precision is still under study.
UNIT - III
Automatic Indexing:
Classes of Automatic Indexing:
Automatic Indexing involves analyzing an item to extract information for creating a permanent index.
This index is the data structure supporting the search functionality. The process includes various stages:
token identification, stop lists, stemming, and creating searchable data structures.
1. Statistical Indexing:
○ Definition: This is the most common technique, particularly in commercial systems, using
event frequency (such as word occurrences) within documents.
○ Techniques:
■ Probabilistic Indexing: Calculates the probability of an item’s relevance to a
query based on certain stored statistics.
■ Bayesian and Vector Space Models: Focus on assigning a confidence level to an
item's relevance rather than an exact probability.
■ Neural Networks: Uses dynamic structures that learn and adjust based on
concept classes in the document.
○ Static Approach: Simply stores frequency data (e.g., word counts) for use in calculating
relevance scores.
2. Natural Language Indexing:
○ Definition: Similar to statistical indexing in token identification but goes further by
adding levels of language parsing.
○ Parsing: This extra step disambiguates tokens, helping to distinguish present, past, or
future contexts within items.
○ Purpose: Adds depth to the index by including contextual information, improving search
precision.
3. Concept Indexing:
○ Definition: Correlates words in an item to broader, often abstract, concepts instead of
specific terms.
○ Automatic Concept Classes: These generalized concepts may lack explicit names but hold
statistical significance.
○ Application: Allows for indexing based on idea associations rather than exact
terminology.
4. Hypertext Linkage Indexing:
○ Definition: Establishes virtual linkages or threads between concepts across multiple
items.
○ Benefit: Facilitates browsing by connecting related concepts, forming an interconnected
web of ideas within the dataset.
Each indexing method has strengths and limitations. For optimal results, as seen in TREC conference
evaluations, multiple indexing algorithms can be applied to the same dataset, although this requires
significant storage and processing overhead.
Statistical Indexing:
Statistical indexing involves using term frequencies and probability to estimate the relevance of
documents in a search. The goal is to rank items based on the frequency and distribution of query terms
within documents and across the database.
Probabilistic weighting applies probability theory to assess and rank documents by their likelihood of
relevance to a query.
Key Principles:
The probabilistic model calculates a relevance score by determining the probability that a document D is
relevant to a query Q. Two common calculations are:
1. Odds of Relevance (O(R)): The odds of a document being relevant can be represented as:
O(R)=P(R)/1−P(R)
where P(R) is the probability of relevance.
2. Log-Odds Formula: In logistic regression, we often use the log-odds to simplify calculations. The
log-odds of relevance for a term t in a document can be given by:
logO(R)=∑t∈Q(ct×weight of term)
Here:
○ ctrepresents coefficients derived from logistic regression (based on term frequency,
inverse document frequency, and other attributes).
○ The sum is over all terms in the query Q that appear in the document.
3. Probability of Relevance Calculation: To obtain the probability of relevance, we apply the inverse
logistic transformation to the log-odds result:
P(R∣D,Q)=1/1+e−logO(R)
where e is the base of the natural logarithm. This provides a probability score that ranks
documents by relevance to the query.
logO(R)=c0+c1×QRF+c2×DRF+c3×RFAD
Where:
Vector weighting represents each document as a vector in a multidimensional space. The SMART system
from Cornell University introduced this approach, using weighted vectors to improve retrieval accuracy
by capturing term importance.
Binary 1 1 1 0 1 0
In the binary vector, terms "Taxes" and "Shipping" may be below the importance threshold and are
excluded (represented as 0). The weighted vector provides varying importance levels based on relevance
scores.
The Simple Term Frequency (TF) Algorithm is one of the foundational approaches to weighting terms in
information retrieval. It measures how frequently a term appears in a document, with the term
frequency serving as a basis for determining its importance in representing the document’s content.
In a statistical indexing system, the following data are essential for calculating term weights:
The idea that frequent terms in a document are content-bearing is based on principles by Luhn and
Brookstein, who suggested that a term's resolving power correlates with its occurrence frequency within
an item.
Basic TF Calculation
The simplest approach assigns a weight equal to the term frequency. For instance, if the term
"computer" appears 15 times in a document, its weight would be 15. However, this method introduces
biases toward longer documents since they naturally contain more occurrences of terms.
To mitigate the impact of document length on term weights, normalization techniques are applied to
achieve a more balanced weighting scheme.
In the SMART System, slope and pivot are constants, with the pivot set to the average number of unique
terms across the document set. The slope adjusts the impact of the pivot to balance the relative weight
of terms across different document lengths.
Combining these normalization techniques results in a formula that improves retrieval effectiveness,
especially when applied to large datasets (such as those used in TREC):
Adjusted Weight=Log TFPivoted Normalization\text{Adjusted Weight} = \frac{\text{Log TF}}{\text{Pivoted
Normalization}}Adjusted Weight=Pivoted NormalizationLog TF
The logarithmic term frequency and pivoted normalization improve weight calculations by reducing bias
toward shorter documents and adjusting weights for longer documents.
The Inverse Document Frequency (IDF) algorithm enhances the basic Term Frequency (TF) model by
accounting for the frequency of a term's occurrence across the entire document collection (the
database). This helps in identifying how significant a term is within the database context. If a term occurs
in every document, it is not useful for distinguishing between items, as it appears too frequently and will
likely return a large number of irrelevant results. On the other hand, terms that are unique to fewer
documents are more informative and thus should be weighted more heavily.
IDF Formula
The IDF algorithm works by assigning higher weights to terms that appear in fewer documents in the
database. The general formula for Inverse Document Frequency is:
IDF(j)=log(n1+DFj)
Where:
Weight Calculation
The weight for term j in document i is computed as the product of TF and IDF:
WEIGHTij=TFij×IDFj Where:
This approach adjusts the importance of a term based on how widespread it is in the database. Terms
that are highly common across documents (like "computer") will have a lower IDF, reducing their overall
weight. In contrast, more specific terms (like "Mexico") that occur in fewer documents will receive higher
weights, thus helping to distinguish between items.
Example
If a new item appears containing all three terms, with the following term frequencies:
The unweighted term frequency for the new document vector is:
(4,8,10)
Thus, the weighted vector for the new item, using TF x IDF, becomes:
(8.08,26.32,6.7)
As seen in this example, the term "Mexico" has the highest weight, indicating its relative importance in
the context of the new document. The term "refinery", despite its higher frequency in the document,
has a much lower weight due to its higher prevalence across the database.
Dynamic Calculation
In systems with a dynamically changing document database, such as in the INQUERY system, the IDF
values are recalculated at retrieval time. The system stores the term frequency for each document but
calculates the IDF dynamically using the inverted list (which keeps track of which documents contain
each term) and a global count of the number of documents in the database.
5.2.2.3 Signal Weighting
The Signal Weighting algorithm enhances traditional weighting models like Term Frequency (TF) and
Inverse Document Frequency (IDF) by considering the distribution of term frequencies within
documents. While IDF adjusts for how widespread a term is across the database, it does not account for
how evenly or unevenly a term's occurrences are distributed within the documents in which it appears.
This can affect the ranking of documents when precision (maximizing the relevance of returned
documents) is important.
The idea behind Signal Weighting is to use the distribution of term frequencies in documents to adjust
the weight of a term further. If a term is highly concentrated in only a few documents or if it is
distributed unevenly, its weight should reflect this. A term that is evenly distributed across multiple
documents can be considered less "informative" than one that is more concentrated in fewer
documents.
The theoretical basis for Signal Weighting comes from Shannon's Information Theory. According to
Shannon, the information content of an event is inversely proportional to its probability of occurrence.
That is, an event that occurs frequently (high probability) provides less new information than an event
that occurs rarely (low probability). This can be mathematically expressed as:
𝐼𝑁𝐹𝑂𝑅𝑀𝐴𝑇𝐼𝑂𝑁 =− 𝑙𝑜𝑔2(𝑝)
Where:
For instance:
● If a term occurs 50% of the time (i.e., p=0.5), the information value is:
INFORMATION=−log2(0.5)=1 bit
● If a term occurs 0.5% of the time (i.e., p=0.005), the information value is:
INFORMATION=−log2(0.005)≈7.64 bits
In the context of term frequency, Signal Weighting uses this principle to measure the variability of a
term's distribution across documents. Terms with a more uniform distribution across items are seen as
less informative (more predictable), while terms with a highly skewed distribution are considered more
informative.
Average Information Value (AVE_INFO)
The average information value (AVE_INFO) represents the mean level of unpredictability across the
occurrences of a term within the documents that contain it. A term that occurs uniformly across all
documents has lower AVE_INFO, and a term that is highly variable in frequency across documents has
higher AVE_INFO.
The AVE_INFO can be defined as the ratio of the frequency of a term in a document to the total
occurrences of the term across the entire database. The maximum AVE_INFO occurs when the term’s
frequency distribution is perfectly uniform across the documents it appears in.
The Signal Weighting factor can be derived from AVE_INFO by taking the inverse of the value. The
formula is:
A higher Signal Weight is assigned to terms with more uneven distributions of occurrence across
documents.
Let's consider the distribution of terms SAW and DRILL across five items, as shown in Figure 5.5:
A 10 2
B 10 2
C 10 18
D 10 10
E 10 18
Both SAW and DRILL appear in the same number of items (5 items total), but their frequency
distributions differ.
● SAW appears uniformly across the items (each occurrence is 10), while DRILL has varying
frequencies (from 2 to 18). This suggests that DRILL's occurrences are more concentrated in
fewer items than SAW.
1. SAW: The distribution is uniform, so the AVE_INFO for SAW will be lower (indicating less
information).
2. DRILL: The distribution is skewed, so AVE_INFO for DRILL will be higher, which means DRILL
provides more information.
Thus, DRILL will receive a higher Signal Weight than SAW, as its distribution is less predictable.
Signal Weighting can be used on its own or in combination with other techniques like TF-IDF. However,
the additional overhead of maintaining the data and performing the necessary calculations may not
always justify the potential improvements in results. The effectiveness of Signal Weighting has been
demonstrated in studies by researchers like Harman and Lockbaum and Streeter, but it is not commonly
implemented in real-world systems due to the complexity involved.
The Discrimination Value is another method for determining the weight of a term in an information
retrieval system. This approach is designed to enhance the ability of a search system to discriminate
between items, i.e., to help the system distinguish relevant items from irrelevant ones. If all items appear
very similar, it becomes difficult to identify which ones meet the user's needs. Thus, weighting terms
based on their discriminatory power can improve the precision of a search.
The Discrimination Value aims to measure how much a term contributes to distinguishing between items
in the database. A term that differentiates documents well has high discriminatory power, while a term
that makes items look more similar to one another has low discriminatory power.
Salton and Yang (1973) proposed this weighting algorithm, where the Discrimination Value of a term iii is
calculated based on the difference in similarity between all items in the database before and after
removing the term iii.
1. AVESIM: This represents the average similarity between all pairs of items in the database.
2. AVESIMi: This represents the average similarity between all pairs of items when term iii is
removed from every item.
The Discrimination Value (DISCRIMi) for term iii is then calculated as:
● If DISCRIMi is positive, removing the term iii increases the similarity between items. This means
that term iii is effective in distinguishing between items, and its presence is valuable for
discrimination.
● If DISCRIMi is close to zero, the term iii neither increases nor decreases the similarity
significantly. Removing or including the term does not affect the database’s ability to distinguish
items.
● If DISCRIMi is negative, removing the term actually decreases the similarity between items,
meaning the term is adding noise or reducing the distinction between items. This term might not
be useful for discriminating between items.
Once the DISCRIMi value is computed, it is usually normalized to ensure that it is a positive number.
Normalization ensures that the discrimination value can be directly incorporated into a weighting
formula without negative values affecting the results.
After normalization, the Discrimination Value can be incorporated into a standard weighting formula to
adjust the weight of term iii in an item:
𝑊𝐸𝐼𝐺𝐻𝑇𝑖 = 𝐷𝐼𝑆𝐶𝑅𝐼𝑀𝑖×𝑇𝐹𝑖
Where:
● TF is the term frequency, which represents how often the term appears in an item.
● DISCRIMi helps adjust the weight based on how well the term discriminates between the items
in the database.
Example
Consider a scenario where term "computer" is used in several documents within a database:
● If removing the term "computer" from these documents leads to a reduction in similarity, it
means the term is distinguishing between items well and should be assigned a higher weight.
● If removing "computer" has little or no effect on similarity, it is not a strong discriminator and
should be assigned a lower weight.
● If its removal causes a higher similarity, the term might be counterproductive for distinguishing
items, suggesting it has low discriminatory value.
Weighting schemes like Inverse Document Frequency (IDF) and Signal Weighting are commonly used in
information retrieval systems to assign weights to terms based on their distribution across the database.
These schemes rely on factors such as term frequency and document frequency, which are influenced by
the distribution of processing tokens (terms) within the database. However, there are several challenges
when dealing with dynamic and constantly changing databases. As new items are added and existing
ones are modified or deleted, the distribution of terms also changes, causing fluctuations in the
weighting factors.
Several approaches have been proposed to mitigate the impact of these changing values and reduce
system overhead:
Another challenge arises when information databases are partitioned by time, especially for data that
becomes less relevant as it ages (e.g., news articles, research publications). Terms in older documents
often have less value compared to newer documents, making time a crucial factor in weighting schemes.
1. Time-Based Partitioning:
○ One solution is to partition the database into time-based segments (e.g., by year) and
allow users to specify the time period they want to search.
○ Challenge: Different time-based partitions may have different term distributions, which
could affect the weight calculations for the same processing token in different time
periods. The system must account for how to handle these variations and how to merge
results from different time periods into a single, ranked set of results.
2. Integrating Results Across Time:
○ Ideally, the system should allow users to query across multiple time periods and
databases that may use different weighting algorithms.
○ The system would then need to integrate the results from different time periods and
databases, combining them into a single ranked list of items, while ensuring consistency
in the ranking despite the possible differences in weighting schemes and time-based
variations.
The vector space model is widely used in information retrieval systems for representing documents and
queries, where each document is represented as a vector of terms, and each term corresponds to a
dimension. While the vector model has several advantages, it also faces notable limitations, especially
when dealing with complex or multi-topic documents and term positional information.
1. Multi-Topic Documents:
○ A significant challenge in the vector model is the lack of semantic distinctions when
documents cover multiple topics. For example, consider a document that discusses both
"oil in Mexico" and "coal in Pennsylvania." In the vector model, all terms are treated
independently, and there is no mechanism to associate specific terms with particular
topics or regions.
○ Issue: The vector model would likely return a high similarity score for a search query like
"coal in Mexico," even though the document does not discuss this combination. This is
because the model does not consider correlation factors between terms such as "oil"
and "Mexico" or "coal" and "Pennsylvania."
2. Lack of Positional Information:
○ The vector model does not retain positional information about terms in a document. For
example, proximity searching (finding documents where one term occurs close to
another, e.g., "a" within 10 words of "b") is not possible with the basic vector model.
○ Issue: The model allows only one scalar value for each term in a document, which limits
its ability to distinguish between important and less important occurrences based on the
position of terms within the document. This can be detrimental to search precision,
especially when context or term proximity is crucial to understanding the content.
3. Reduced Precision:
○ The lack of context and term proximity information in the vector model can lead to
reduced precision in search results. For example, a query for "coal in Mexico" might
return a document where "coal" and "Mexico" appear together but are not in context,
making the result less relevant.
5.2.3 Bayesian Model
To address the limitations of the vector model, a Bayesian approach can be employed. The Bayesian
model is based on conditional probabilities, offering a more nuanced way to maintain information about
processing tokens (terms) in documents.
● In a Bayesian framework, the goal is to calculate the probability of relevance for a given
document with respect to a search query. This can be represented as:
𝑃(𝑅𝐸𝐿∣𝐷𝑂𝐶𝑖, 𝑄𝑢𝑒𝑟𝑦𝑗) represents the probability that a document P(REL) is relevant to a query
𝑄𝑢𝑒𝑟𝑦𝑗..
● The Bayesian model can be applied both to the indexing of documents and to the ranking of
search results, providing a way to assign probabilities to the relevance of documents based on
the terms they contain.
The Bayesian model can be particularly useful for handling documents that cover multiple topics. For
example, it can incorporate dependencies between topics and proximity of terms to better reflect the
document's semantic content. In this approach:
In Figure 5.6, a simple Bayesian network is used to represent the relationship between topics (like "oil in
Mexico") and the processing tokens (terms) present in a document.
While this assumption simplifies the model, it is often not true in practice. For example, terms like
"politics" and "economics" are often related, especially in discussions about government policy or trade
laws. Similarly, some terms (like "oil" and "Mexico") may be strongly correlated, while others (like "oil"
and "coal") may be independent.
Handling Dependencies Between Topics and Tokens
To address the issue of dependencies between topics or processing tokens, the Bayesian model can be
extended by adding additional layers to the network:
● An Independent Topics layer can be added above the topics layer to handle interdependencies
between topics.
● An Independent Processing Tokens layer can be added above the processing token layer to
capture dependencies between terms.
This extended model allows the system to handle complex interdependencies between terms and topics,
improving the accuracy of relevance determination and the ranking of search results. However, this
approach can increase the complexity of the model, potentially reducing the precision of the indexing
process.
While the extended Bayesian model can handle dependencies more effectively, it may sacrifice some
precision by reducing the number of independent variables available to define the semantics of a
document. Nonetheless, it is considered a more mathematically correct approach compared to the basic
vector model, especially for multi-topic documents or when handling term dependencies.
The goal of Natural Language Processing (NLP) in information retrieval systems is to enhance the
indexing of documents by leveraging semantic and statistical information, thus improving search
precision and reducing the number of irrelevant results (false hits). Instead of treating each word as an
independent unit, NLP aims to extract meaning from the language itself, enabling more accurate
searches.
1. Phrase Generation:
○ Simple word-based indexing may fail to capture the nuanced meanings of complex
concepts. For example, the term "field" could refer to a variety of concepts, but phrases
like "magnetic field" or "grass field" offer a more specific representation. NLP enhances
indexing by generating term phrases that more accurately reflect these concepts. This
allows the system to distinguish between terms that are semantically close (e.g.,
"magnetic" and "field") versus those that are not.
2. Statistical vs. Syntactic Analysis:
○ Statistical approaches often generate phrases based on proximity (e.g., adjacent words
or words appearing within the same sentence), but this can lead to errors. For instance,
"Venetian blind" and "blind Venetian" may appear to be related due to proximity but
actually represent different concepts.
○ Syntactic and semantic analysis can more accurately define phrases based on the
grammar and meaning of the terms involved, improving the quality of term phrase
generation.
3. Phrase Disambiguation:
○ Phrases such as "blind Venetian" and "Venetian who is blind" should ideally map to the
same phrase, since they refer to the same concept. By analyzing the syntactic structure
of the phrase, NLP can disambiguate between different meanings and standardize the
phrase to a canonical form, ensuring that searches capture the intended semantic
meaning.
4. Part-of-Speech Tagging:
○ Part-of-speech (POS) tagging is a fundamental part of NLP, where words in a sentence
are categorized into their grammatical roles (e.g., nouns, verbs, adjectives). This helps
identify the structure of sentences and generate noun phrases, which are more
meaningful for indexing than individual words.
5. Syntactic Parsing and Hierarchical Analysis:
○ A syntactic parser can analyze sentence structure to create a parse tree, which
represents the relationships between words in a sentence. This hierarchy allows NLP
systems to identify potential phrases based on grammatical roles, such as recognizing a
noun phrase (e.g., "nuclear reactor") and breaking it down into sub-phrases (e.g.,
"nuclear" and "reactor").
○ By analyzing the predicate-argument structure, NLP can identify more complex
relationships between terms.
1. Lexical Analysis:
○ The first step in NLP is lexical analysis, where the text is processed to identify terms and
phrases. One approach to phrase generation uses statistical measures like the cohesion
factor proposed by Salton (1983), which considers the frequency of co-occurrence of
terms (e.g., adjacent terms, within sentences, etc.).
○ The SMART system, for example, identifies term pairs based on adjacency and
co-occurrence in at least 25 documents.
2. Statistical Analysis for Phrase Generation:
○ Statistical models tend to focus on two-term phrases (e.g., "magnetic field"), but NLP
allows for the creation of multi-term phrases, which can provide more precise semantic
meanings. For example, the phrase "industrious intelligent students" could be broken
down into several useful phrases like "intelligent student" and "industrious student."
3. Term Weighting:
○ After generating term phrases, weights need to be assigned to these phrases to reflect
their importance in document indexing. The typical approach uses term frequency and
inverse document frequency (TF-IDF) to assign weights to terms.
○ Since term phrases often appear less frequently than individual words, their weights
might be underrepresented. To compensate, more advanced weighting schemes, like the
one used at New York University, modify the IDF value based on the frequency of the
phrase, ensuring important phrases are adequately weighted.
4. Semantic Relationships:
○ NLP can also enhance phrase weighting by incorporating semantic relationships between
terms. For example, a semantic relationship could involve synonymy, where terms like
"computer" and "PC" are considered similar, or antonymy, where terms like "hot" and
"cold" are recognized as opposites.
○ The system can analyze phrase similarity using semantic categories (e.g., specialization,
antonymy, synonymy) to refine the indexing process.
5. Normalization and Canonical Forms:
○ To ensure that variations of the same concept are indexed under a single representation,
NLP systems aim to normalize phrases. For example, "blind Venetian" and "Venetian who
is blind" should be treated as the same concept. This normalization can help improve the
frequency of relevant phrases, ensuring that they meet the threshold for indexing.
6. Practical Applications:
○ The New York University NLP system, developed in collaboration with GE Corporate
Research and Development, is an example of an advanced NLP-based information
retrieval system. This system uses syntactic and statistical analysis to generate term
phrases and add them to the index. It also categorizes semantic relationships between
phrases (such as similarity or complementarity) to improve relevance ranking.
The Natural Language Processing (NLP) system outlined in section 5.3.2 builds on the basic task of
generating term phrases for indexing, which was discussed in section 5.3.1. This phase of NLP goes
beyond simple phrase extraction and focuses on deriving higher-level semantic information from the
text. It is aimed at identifying relationships between concepts and discourse-level structures, ultimately
improving the accuracy and relevance of information retrieval.
Concept indexing is an advanced method of indexing that moves beyond traditional keyword-based
indexing by focusing on concepts rather than specific terms. This approach leverages natural language
processing (NLP) techniques and controlled vocabularies to map terms to more general concepts,
improving both the accuracy and efficiency of information retrieval.
Concept indexing starts with a term-based indexing system but extends this by considering higher-level
concepts and relationships between concepts. This allows for a more generalized, semantic
understanding of documents, which helps overcome limitations associated with exact term matching.
In the DR-LINK system, for instance, terms within a document are replaced by an associated Subject
Code, which is part of a controlled vocabulary that maps specific terms to more general concepts. This
vocabulary often represents the key ideas or themes that an organization considers important for
indexing and retrieval.
A common way to implement concept indexing is through a controlled vocabulary, which is a predefined
set of terms or codes that represent specific concepts. By mapping terms to broader concept classes, an
indexing system can reduce the dimensionality of the index, making it easier to search and retrieve
relevant documents.
For example:
Rather than manually defining all possible concepts in advance, concept indexing can begin with a set of
unlabeled concept classes and allow the data itself to define the concept classes based on the content of
the documents. This process is similar to thesaurus creation, where semantic relationships between
terms help form broader concept categories.
Automatic creation of concept classes can be facilitated using machine learning techniques or statistical
methods. Over time, as the system processes more documents, it refines and adjusts its concept classes,
ensuring that they better reflect the topics and themes within the data.
Mapping a term to a concept can be complex because a single term may correspond to multiple
concepts. The challenge lies in determining the degree of association between a term and various
concepts. For example, the term "automobile" may be closely related to the concept of "vehicle", but
less strongly related to "fuel" or "environment".
This complexity is handled by assigning multiple concept codes to each term, with different weights
reflecting how strongly the term relates to each concept. The weight helps prioritize the more relevant
concepts when performing searches.
One practical implementation of concept indexing is the Convectis System by HNC Software Inc. This
system uses a neural network model to automatically group terms into concept classes based on their
context (i.e., how terms are used together in similar contexts within the text).
In Convectis:
● Context vectors are created for terms based on their proximity to other terms in a document. For
example, if the terms "automobile" and "vehicle" often appear together in similar contexts, they
will be mapped to the same concept class.
● A term may have multiple weights associated with different concepts, depending on the context
in which it appears. For instance, "automobile" might have a higher weight for the concept of
"vehicle" than for "fuel" or "environment".
● New terms can be mapped to existing concept classes by analyzing their proximity to
already-mapped terms. If a new term appears in a similar context to an existing term, it is likely
to be grouped with the same concept.
6. Challenges in Concept Indexing
● Concept Space Dimensionality: Ideally, each concept in the indexing system should be
represented as an orthogonal vector in a high-dimensional vector space, where each concept
has a distinct dimension. However, it is challenging to create such orthogonal concept vectors
because concepts often share overlapping meanings or attributes.
● Trade-offs: Due to practical limitations (such as computational power and storage), the number
of concept classes is typically limited, which can lead to overlapping or ambiguous concept
classifications.
● Contextual Ambiguity: Terms that appear in multiple contexts can cause confusion in concept
classification. For example, the term "automobile" may refer to transportation in one context,
but to environmental impact in another. This ambiguity must be handled carefully by the
indexing system.
In the Convectis system, the process of mapping the term "automobile" would look like this:
● The system identifies that "automobile" is strongly related to "vehicle" and to a lesser extent to
"transportation", "mechanical device", "fuel", and "environment".
● It assigns weights to these relationships based on contextual proximity (how close the terms
appear together in the text).
● If the term "automobile" appears near terms like "fuel efficiency" or "carbon emissions", the
system might increase the weight for the environment concept, whereas it might strengthen the
vehicle or transportation concept when it appears near terms like "car" or "road".
This second dimension, which includes hypertext links, has been underutilized in existing retrieval
systems. The inclusion of hyperlinks introduces a new layer of contextual relationship that can enrich the
understanding of a document's subject matter. Hypertext links not only connect related pieces of
information but also create pathways for the user to explore related content.
To make use of this extra dimension in retrieval systems, it’s important to consider how links between
documents or within documents themselves can aid in the contextualization of search results.
Most traditional systems, such as Yahoo, use manually generated indexes to create a hyperlinked
hierarchy. These systems rely on users to navigate through predefined paths. For example, users can
expand a topic and follow hyperlinks to more detailed subtopics, eventually reaching the actual content.
In contrast, systems like Lycos and AltaVista use web crawlers to automatically index information by
crawling web pages and returning the text for indexing. However, these systems generally do not make
use of the relationships between linked documents.
Web crawlers like WebCrawler and OpenText, and intelligent agents like NetSeeker, are designed to
search the Internet for relevant information. However, they primarily focus on searching for keywords
within documents, without leveraging the hyperlink relationships between documents to improve the
accuracy and relevance of results.
An index that incorporates hyperlinks would treat the hyperlink as an extension of the document's
content. Rather than simply being a reference, the content of a linked document should influence the
current document's indexing. When a hyperlink points to related content, the concepts in the linked item
should be included in the index of the current item.
For example, if a document discusses the financial state of Louisiana and includes a hyperlink to another
document about crop damage due to droughts in the southern states, the index for the first document
should allow for a search result for “droughts in Louisiana”. The relationship is established by the
hyperlink, which introduces new concepts related to the original document.
This approach treats hyperlinked content as an additional layer of information, enriching the document’s
context and providing more relevant results when users search for specific topics.
The hyperlink relationship could be incorporated into the index with a reduced weight from the linked
document’s content. The strength of the link—whether it’s a strong or weak link—would influence how
much weight is given to the concepts from the linked document. The link could also be multilevel,
meaning that if a link points to another linked document, the content from both documents could be
incorporated into the current document’s index, though with reduced weighting.
To mathematically represent this, we could express the relationship between two linked documents as:
The weight associated with the hyperlink could be adjusted by normalization factors depending on the
number of links or the type of relationship between the items.
1. Clustering phase: Documents are grouped based on similarity measures (such as the
cover-coefficient-based incremental clustering method (C2ICM)).
2. Link creation phase: Links are automatically created between documents (or document
sub-parts) that are within the same cluster and have similarity above a given threshold.
This automatic hyperlinking could help create a network of related documents, enriching the information
retrieval process by connecting content that might not have been explicitly linked by the author but is
related in meaning.
● Parsing Issues: Errors can occur during document segmentation, where punctuation or
formatting issues cause misinterpretation of document boundaries or sentence structures.
● Ambiguity in Linkage: Hyperlinks can refer to different types of relationships (e.g., reference,
citation, or causal link), and the nature of these relationships must be considered to accurately
reflect the link's relevance.
● Efficiency Concerns: Automatically generating hyperlinks or processing large numbers of dynamic
documents can be computationally expensive, especially for real-time systems.
Suppose two documents discuss Louisiana's financial state and drought effects on southern states. A
hyperlink between these documents can:
● Transfer the concepts related to drought and Louisiana from the second document into the first
document’s indexing.
● Enable a search for “droughts in Louisiana” to return relevant documents, even if the first
document does not explicitly mention "droughts" but includes a link to content on the topic.
The concept of clustering dates back to the creation of thesauri, which group synonyms and antonyms
together to help authors select the right vocabulary. The primary goal of clustering is to group similar
objects—such as terms, phrases, or documents—into "classes" or clusters, which can then be organized
under broader, more general categories. In the context of clustering, the term class is often synonymous
with cluster.
Clustering follows a sequence of steps to ensure that objects are grouped effectively, allowing for better
information retrieval.
a. Defining the Domain: The first step is to clearly define the domain or scope of the clustering effort.
This could involve creating a thesaurus for a specific field, such as medical terms, or clustering a set of
documents within a database. Defining the domain helps eliminate irrelevant data that could skew the
clustering results.
b. Determining Attributes: Once the domain is set, the next step is to define the attributes of the objects
to be clustered. In the case of a thesaurus, this could mean selecting the words to be clustered based on
their meaning or usage. For document clustering, the process may focus on specific parts of the
documents, such as the title, abstract, or main body. This focus ensures that only relevant information is
considered during clustering.
c. Determining Strength of Relationships: Next, we evaluate the relationships between the objects,
particularly the co-occurrence of attributes. For instance, in a thesaurus, this step involves identifying
synonyms—words that have similar meanings and thus belong together in the same cluster. In document
clustering, this might involve creating a similarity function based on the frequency with which words
appear together in the same document or sentence.
d. Postcoordination and Word Relationships: In the final phase, relationships are defined, and objects are
grouped based on these relationships. Human input may be needed to fine-tune the clusters, especially
when generating a thesaurus. Several types of relationships are commonly used in clustering:
In more advanced clustering schemes, the relationships between words may extend beyond synonyms
and include:
● Collocation: Words that frequently co-occur in proximity (e.g., "doctor" and "hospital").
● Paradigmatic: Words with similar semantic bases (e.g., "formula" and "equation").
Other relationships that are used in semantic networks include terms such as parent-of, child-of, part-of,
and contains-part-of, which define relationships between entities or concepts.
3. Challenges in Clustering
While clustering techniques can significantly improve information retrieval by improving recall, they
often do so at the cost of precision. The key challenge lies in balancing recall (finding as many relevant
documents or terms as possible) with precision (ensuring the results are relevant and manageable).
The process of automatic clustering also compounds these ambiguities. Since the process is automated,
it may produce imperfect groupings, and the language ambiguities inherent in human languages can
further complicate the process.
An additional challenge in clustering arises from homographs—words that share the same spelling but
have different meanings depending on context. For example, the word "field" could refer to an electronic
field, a field of grass, or a job field. Resolving homographs is difficult, but clustering systems often allow
users to interact with the system to specify the correct meaning, based on other search terms or context.
For example, if a user searches for "field", "hay", and "crops", the system could infer that the agricultural
meaning of "field" is the most relevant.
5. Vocabulary Constraints
In clustering, vocabulary constraints can play an important role in ensuring that the terms or documents
are properly grouped:
● Normalization: Refers to whether the system uses complete words or stems (root forms of
words). Normalization might involve standardizing all terms to their root forms (e.g., using "run"
instead of "running").
Thesaurus generation, particularly in the context of clustering terms, has been practiced for centuries.
The process initially involved manual clustering, but with the advent of electronic data, automated
statistical clustering methods emerged, allowing for the creation of more extensive and dynamic
thesauri. These automatically generated thesauri reflect the statistical relationships between words, but
the clusters typically lack meaningful names and are just groups of similar terms. While automated
techniques can be computationally intensive, there are methods to reduce the computational load,
though they may not always produce the most optimal results.
● Handcrafted: This method relies on human experts to manually curate the thesaurus, selecting
terms and determining relationships.
● Co-occurrence-based: This method involves creating relationships between terms based on their
frequent co-occurrence in documents.
● Header-modifier based: This method uses linguistic parsing to identify relationships between
words that appear in similar grammatical contexts, such as Subject-Verb, Verb-Object, and
Adjective-Noun structures.
While manually generated thesauri are valuable, particularly in domain-specific contexts, general
thesauri like WordNet are less useful in some cases due to the wide variety of meanings a single word
can have.
2. Co-occurrence-Based Clustering
Co-occurrence-based clustering focuses on identifying words that often appear together in the same
document or context. This method builds relationships based on the statistical significance of these
co-occurrences. One commonly used approach is the mutual information measure, which quantifies the
likelihood that two words appear together by comparing their observed co-occurrence with the
expected co-occurrence based on their individual frequencies.
In header-modifier based clustering, term relationships are identified using syntactic structures, with a
focus on how words are grammatically related in a sentence. For example, the relationship between a
noun and the verbs or adjectives it frequently co-occurs with can reveal its meaning or context. This
linguistic approach helps generate a thesaurus based on these relationships, with similarity scores
calculated using mutual information measures.
Manual clustering for thesaurus generation follows the general steps outlined in Section 6.1, but with a
more hands-on approach to selecting and grouping terms.
The first step in manual thesaurus creation is determining the domain for clustering. Defining the
domain reduces ambiguities, such as those caused by homographs (words with multiple meanings), and
ensures that only relevant terms are included. Often, existing resources like concordances, dictionaries,
and other domain-specific materials are used to compile a list of potential terms.
A concordance is an alphabetical list of words in a text, along with the frequency of their occurrence and
references to the specific documents in which they appear. This helps identify terms that are important
to the domain and worth clustering.
● KWOC (Key Word Out of Context): This is another term for a concordance, listing words with
their surrounding context.
● KWIC (Key Word In Context): This displays the word in its sentence or phrase context, which can
help resolve ambiguities (e.g., determining whether "chips" refers to memory chips or wood
chips).
● KWAC (Key Word And Context): Displays keywords along with their surrounding context, which
helps in better understanding the meaning of terms.
2. Selecting Terms
Once the domain and the relevant terms are identified, the next task is to select the terms to be included
in the thesaurus. This involves a careful examination of word frequency and relevance to the domain.
High-frequency terms, such as "computer" in a data processing thesaurus, may not hold significant
information value, and so should be excluded.
With the selected terms, the next step is to cluster them based on their relationships. This is where the
art of manual clustering comes into play. The relationships can be guided by various principles, such as
synonymy, hierarchical relationships, or co-occurrence patterns.
For instance, if the term "computer" often co-occurs with "processor," "motherboard," and "RAM," they
may be grouped together in a cluster under the broader term "computer hardware." The strength of
these relationships may also be assessed through human judgment, refining the clusters.
After the clustering process, the resulting thesaurus is reviewed by multiple editors to ensure its
accuracy and usefulness. This quality assurance phase ensures that the terms are grouped appropriately
and that the relationships reflect the real-world usage of the words.
Specificity: Involves ensuring that the vocabulary is either general or specific enough to make sense for
the domain being clustered.
Clustering must consider these constraints to avoid ambiguity and ensure that terms or documents are
clustered appropriately.
Clustering is both a science and an art. While scientific algorithms can assist in grouping terms or
documents, human intuition and domain expertise are often required to fine-tune the clusters and
ensure their relevance. Good clustering can enhance the effectiveness of a retrieval system, improving
recall and providing more accurate results.
However, clustering also comes with the inherent challenge of balancing recall and precision. Increasing
recall (finding more relevant documents) may decrease precision (resulting in irrelevant documents
being retrieved), making it essential to use the clustering techniques carefully to avoid overwhelming
users with irrelevant information.
Automatic term clustering involves using computational techniques to generate clusters of terms that
are related to one another. These clusters form the basis of a statistical thesaurus, and the general
principle is that the more often two terms co-occur in the same context, the more likely they are to be
related to the same concept. The core difference between various techniques lies in the completeness of
the correlations they compute and the associated computational cost. More comprehensive methods
provide more accurate clusters but require more processing power.
The clustering process can vary, with some methods starting with an arbitrary set of clusters and refining
them iteratively, while others involve a single pass through the data. As the number of clusters grows, it
may be necessary to use hierarchical clustering to create more abstract clusters, improving the overall
organization of terms.
Typically, each term is assigned to only one class, although a threshold-based approach can allow a term
to appear in multiple classes if its similarity score exceeds a certain value. The polythetic clustering
method is often employed, where each cluster is defined by a set of words, and an item's inclusion in a
cluster depends on how similar its words are to those in other items within the cluster.
In the Complete Term Relation Method, the relationship between every pair of terms is calculated to
determine how strongly they are related and thus should be clustered together. One common approach
to this is the vector model.
Automatic term clustering involves using computational techniques to generate clusters of terms that
are related to one another. These clusters form the basis of a statistical thesaurus, and the general
principle is that the more often two terms co-occur in the same context, the more likely they are to be
related to the same concept. The core difference between various techniques lies in the completeness of
the correlations they compute and the associated computational cost. More comprehensive methods
provide more accurate clusters but require more processing power.
The clustering process can vary, with some methods starting with an arbitrary set of clusters and refining
them iteratively, while others involve a single pass through the data. As the number of clusters grows, it
may be necessary to use hierarchical clustering to create more abstract clusters, improving the overall
organization of terms.
The basic process for automatic thesaurus generation follows the steps described in Section 6.1, starting
with selecting a set of items (i.e., documents or text segments) that represent the vocabulary to be
clustered. These items provide the context for clustering the words, which are the processing tokens
(individual words). The algorithms used for clustering can differ in how they group the terms, but they all
rely on calculating the relationships (similarities) between the words based on their occurrences.
Typically, each term is assigned to only one class, although a threshold-based approach can allow a term
to appear in multiple classes if its similarity score exceeds a certain value. The polythetic clustering
method is often employed, where each cluster is defined by a set of words, and an item's inclusion in a
cluster depends on how similar its words are to those in other items within the cluster.
In the Complete Term Relation Method, the relationship between every pair of terms is calculated to
determine how strongly they are related and thus should be clustered together. One common approach
to this is the vector model.
1. Vector Model Representation
In this model, items are represented by a matrix where the rows correspond to individual items
(documents) and the columns correspond to unique terms (processing tokens). The entries in the matrix
represent how strongly a term in a document relates to a concept. For example, the matrix in Figure 6.2
shows a database with 5 items and 8 terms. The similarity between two terms can then be calculated by
comparing their vectors across items.
The relationship between terms is determined using a similarity measure, which quantifies how close
two terms are to each other in terms of their occurrences. One simple measure is the dot product of the
vectors representing the two terms. This results in a Term-Term Matrix, which provides a similarity score
between every pair of terms.
The Term-Term Matrix generated is symmetric and reflects the pairwise similarities between all terms.
The diagonal entries are zero, as a term is always perfectly similar to itself. To create clusters, a threshold
value is applied. If the similarity between two terms exceeds this threshold, they are considered similar
enough to belong to the same cluster.
For example, with a threshold of 10, two terms are considered similar if their similarity score is 10 or
higher. This results in a Term Relationship Matrix that only contains binary values: 1 for terms that are
considered similar and 0 for terms that are not.
3. Clustering Techniques
Once the Term Relationship Matrix is created, the next step is to assign terms to clusters based on their
relationships. Several clustering algorithms can be used:
● Clique-based Clustering: In this approach, a cluster is created only if every term in the cluster is
similar to every other term. This is the most stringent technique and results in smaller, tighter
clusters.
○ Algorithm: Start by selecting a term, place it in a new cluster, then iteratively add other
terms that meet the similarity threshold with all existing terms in the cluster. Continue
this process until all terms are clustered.
● Single Link Clustering: This method relaxes the constraint that every term in the cluster must be
similar to all other terms. Instead, any term that is similar to any term in an existing cluster is
added to that cluster. This leads to a partitioning of the terms, where each term is assigned to
one and only one cluster.
○ Algorithm: Start with an unclustered term, place it in a new class, then add other similar
terms to that class. Repeat until all terms are assigned to a cluster.
● Star Technique: In this approach, a central "core" term is selected, and any term related to this
core is included in the cluster. Other terms not yet in any class are selected as new cores until all
terms are clustered.
● String Technique: This method begins with a term, adds the most similar term that isn't already
in any cluster, and continues until no more related terms can be added. Then a new class is
started with an unclustered term.
The choice of clustering algorithm depends on factors such as the density of the Term Relationship
Matrix (i.e., how many relationships are present between terms) and the specific objectives of the
thesaurus. Dense matrices, with many term relationships, require tighter constraints (like clique
clustering) to prevent overly broad clusters. Sparse matrices may benefit from more relaxed constraints,
such as those used in single-link clustering.
● Clique-based clustering offers the highest precision, producing small, highly cohesive clusters
that are more likely to correspond to a single concept.
● Single link clustering offers the highest recall but may result in very broad clusters that mix
concepts. It is computationally more efficient but can cause irrelevant terms to be included in
clusters.
6.2.2.2 Clustering Using Existing Clusters
An alternative clustering methodology involves starting with existing clusters rather than calculating
relationships between every pair of terms from scratch. This method aims to reduce the number of
similarity calculations by iteratively adjusting the assignments of terms to clusters. The process
continues until there is minimal movement between clusters, which indicates that the clustering has
stabilized.
Key Concepts
1. Centroids:
○ The centroid of a cluster can be thought of as the "center" of the cluster, similar to the
center of mass in physics. In terms of vectors, the centroid is the average vector of all
the terms in a cluster. This vector represents a point in the N-dimensional space, where
N is the number of items (terms or documents).
○ The centroid acts as a reference point to which all other terms in the cluster are
compared for similarity.
2. Iterative Process:
○ Initially, terms are assigned to clusters, and centroids are calculated for each cluster
based on the current assignment of terms.
○ The similarity between each term and the centroids of all clusters is then computed.
Each term is reassigned to the cluster whose centroid is most similar to the term.
○ This process is repeated: after each reassignment, new centroids are computed, and the
terms are re-evaluated to ensure they are in the most appropriate clusters.
○ The iteration stops when the assignments stabilize, meaning that terms no longer move
between clusters, or the movement is minimal.
Efficiency
The process of adjusting term assignments and calculating centroids is computationally efficient, with
time complexity on the order of O(n), where n is the number of terms or items being clustered. The
initial assignment of terms to clusters is not critical, as the iterative process will refine the assignments
until it stabilizes.
● Reduced Calculations: Since terms are not compared directly to every other term in the dataset,
the number of similarity calculations is reduced.
● Faster Convergence: The iterative nature of the process means that the algorithm can converge
quickly, as it focuses on updating assignments based on the centroid rather than recalculating
pairwise relationships from scratch.
Visual Representation
A graphical representation of terms and centroids can illustrate how the iterative process works. In this
representation:
● Terms are plotted as points in an N-dimensional space (or reduced to 2D for simplicity).
● The centroids of each cluster are plotted as central points (or vectors) that are recalculated after
each iteration.
● The clusters will eventually "settle" around their centroids, with terms assigned to the cluster
whose centroid they are closest to in terms of similarity.
Process Steps
1. Initial Assignment: Terms are initially assigned to clusters, either randomly or based on some
predefined criteria.
2. Calculate Centroids: The centroid for each cluster is computed as the average of all vectors
(terms) in the cluster.
3. Reassign Terms: The similarity between each term and the centroids of all clusters is computed.
Each term is reassigned to the cluster whose centroid is most similar.
4. Iterate: Steps 2 and 3 are repeated until the term assignments stabilize, meaning that terms no
longer change clusters or the changes are minimal.
5. Convergence: Once minimal movement is detected, the clustering process stops.
The clustering using existing clusters method, when visualized graphically, shows how the centroids of
the clusters move over multiple iterations as the terms are reassigned to more appropriate clusters.
Visual Explanation
● Centroids are represented by solid black boxes (or squares), and they move as the terms in the
clusters are reassigned based on similarity.
● The ovals in the diagram represent the ideal cluster assignments, which provide a reference for
where terms should ideally belong.
● Initially, clusters might not be well-formed, but through iterations, the centroids gradually shift
closer to these ideal assignments.
Consider the example where terms are assigned to three arbitrary clusters:
These initial cluster assignments generate centroids for each cluster, which are computed as the average
of the weights of terms in that cluster across the items (documents or terms) in the database. For
example:
● For Class 1, the first value in the centroid is the average of the weights for Term 1 and Term 2 in
Item 1, and similar calculations are done for other terms and items.
● For Class 2 and Class 3, centroids are calculated similarly, averaging the weights of the terms in
those clusters.
Once centroids are established, the next step is to calculate the similarity between the terms and the
centroids. This can be done using the similarity measure described earlier in Section 6.2.2.1.
● After the first iteration, terms are reassigned to clusters based on the similarity between each
term and the centroids of the clusters. For instance, if the similarity measure between Term 1
and the centroid of Class 1 is the highest, Term 1 stays in Class 1.
● However, in cases where multiple clusters have similar centroids for a term (such as Term 5), the
term is assigned to the class with the most similar weights across all items in the class.
● Term 7, although not strongly aligned with Class 1, is assigned to Class 1 based on the highest
similarity.
Iteration and Convergence
As iterations continue, the centroids of the clusters adjust as terms are reallocated. This process is shown
in Figure 6.8, where the updated centroids reflect the reassignment of terms:
● For example, Term 7 moves from Class 1 to Class 3 because it is more similar to the terms in
Class 3 than in Class 1. The shift of Term 7 represents the stabilization of the cluster
assignments.
The iteration continues until minimal changes are detected, at which point the clustering process has
converged.
1. Fixed Number of Classes: The number of clusters is fixed at the start of the process. As a result,
it is impossible for the number of clusters to grow during the process. The initial number of
clusters may sometimes not be appropriate for the data, limiting the flexibility of the approach.
2. Forced Assignments: Since each term must be assigned to one and only one cluster, the process
might force some terms into clusters where their similarity to the other terms is weak. This is
particularly problematic when a term's similarity to all clusters is low, resulting in poor cluster
cohesion.
3. No New Cluster Formation: Because no new clusters can be created during the process, terms
that might naturally form their own group are constrained to merge with other terms,
potentially leading to less meaningful clusters.
6.2.2.3 One Pass Assignments
The One Pass Assignment method is an efficient clustering technique that minimizes computational
overhead by processing each term only once. This method works by assigning terms to pre-existing
classes based on their similarity to the centroids of those classes.
Process Description
Consider the following example where terms are assigned to classes based on a threshold similarity of
10:
For each class, centroids are recalculated as the average of the terms in the class. For example:
● Minimal Computational Overhead: The method operates with a time complexity of O(n), as it
only requires one pass over the terms to assign them to classes.
● Quick to Compute: It doesn’t require multiple iterations like other clustering methods, making it
faster.
● Non-Optimal Clusters: This method does not always produce optimal clusters. The order in
which terms are analyzed affects the resulting classes. If the order of terms changes, the
resulting clusters might differ, as the centroids are recalculated with each term added.
○ For instance, terms that would naturally belong to the same cluster could end up in
different clusters because the centroid values change with each new term added.
● Threshold Dependency: The threshold value plays a critical role in determining whether a term
is assigned to an existing class or a new one. A poorly chosen threshold could lead to suboptimal
clustering results.
Item clustering is a method of grouping items based on their similarities, which is often used in the
context of organizing information, such as in libraries, filing systems, or digital repositories. This process
is similar to term clustering but focuses on grouping entire items (such as documents, products, or other
entities) rather than individual terms.
In traditional systems, manual item clustering is an essential part of organizing items. A person reads
through each item and assigns it to one or more categories. These categories represent groups of items
that share common characteristics or topics. For physical systems, such as books in a library, an item (like
a book) is typically placed in a primary category, though it can often be cross-referenced in other
categories through index terms or tags.
In item clustering, the goal is to group similar items based on the similarity of the terms they contain.
This involves creating an Item-Item similarity matrix where each item is compared to every other item
based on the terms they share.
Similarity Calculation
The similarity between two items is calculated based on the number of terms they have in common. This
is done by comparing the rows in an item-term matrix, where each row represents a term and each
column represents an item. When comparing two items, the focus is on their shared terms.
This is similar to term clustering, but instead of comparing terms (as in Sections 6.2.2.1 to 6.2.2.3), you
are now comparing entire items based on the terms they contain.
To illustrate item clustering, assume we are working with a set of items and their associated terms as
described in Figure 6.2 (not shown here). Here's how the process would work:
1. Item-Term Matrix: Construct an item-term matrix where each item is represented as a row, and
each term is a column. Each cell in the matrix holds the weight or frequency of the term in the
item.
2. Item-Item Matrix: Using the item-term matrix, calculate the similarity between each pair of
items. The similarity function might calculate, for example, the cosine similarity or the Jaccard
index, depending on the chosen metric.
3. Thresholding: Similar to term clustering, a threshold can be set to determine the strength of the
similarity between two items. If the similarity exceeds this threshold, the items are considered
related and can be grouped into the same cluster.
Once the similarities between items are calculated, the resulting matrix is an Item Relationship Matrix.
This matrix visually represents the relationships between items, where the values indicate the degree of
similarity between each pair of items.
● High values in the matrix indicate that two items are very similar to each other, and they might
be assigned to the same cluster.
● Low values indicate that the items are dissimilar and likely belong to different clusters.
Using the similarity equation and a threshold of 10, the system would produce an Item Relationship
Matrix (like Figure 6.10, not shown). This matrix would highlight which items are similar enough to be
grouped together, creating clusters of related items. These clusters can then be further refined or
adjusted manually or through automated methods, depending on the needs of the system.
1. Matrix Creation: Build an item-term matrix to represent the presence or absence (and weight)
of terms in items.
2. Similarity Calculation: Calculate the similarity between items based on their term content.
3. Clustering: Apply a threshold to assign items to clusters based on their similarity scores.
4. Iterative Refinement: The process can be iterated, adjusting clusters and centroids as needed,
similar to the techniques described in term clustering.
Hierarchical Clustering (HACM)
Visual Representation:
● Dendrograms: A tree-like diagram that shows the clustering process and helps in visualizing the
similarity between items and clusters.
○ The size of the cluster (shown by the ellipse) can indicate the number of items in that
cluster.
○ Linkages between clusters can be visualized by dashed lines to show reduced similarity.
Lance-Williams Dissimilarity Update Formula:
The Lance-Williams dissimilarity update formula is central to many HACM approaches and is used to
calculate the dissimilarity (or distance) between clusters. It allows the combination of two clusters into a
new one and provides a general approach to updating dissimilarities between clusters.
● Group Average: One of the most effective methods for hierarchical clustering, where the
dissimilarity between two clusters is the average of all dissimilarities between items in the first
cluster and items in the second cluster.
○ This method has been shown to produce the best results in document clustering tasks.
Ward’s Method:
● Ward's Method: Focuses on minimizing the variance within clusters by using the Euclidean
distance between centroids of clusters.
○ The formula used is based on the variance (I) of the points in the clusters, and the
algorithm seeks to minimize the squared Euclidean distance between centroids
normalized by the number of items in each cluster.
This method enhances the retrieval and organization of items by structuring them into hierarchical
clusters, facilitating easier browsing and retrieval.
Unit-IV
User Search Techniques:
Search Statements and Binding:
Search Statements:
Represent the information need of users, specifying the concepts they wish to locate.
May allow users to assign weights to different concepts based on their importance.
Binding: Transforming abstract forms into more specific forms (e.g., user's vocabulary or
past experiences).
The goal is to logically subset the total item space to find relevant information.
Some examples for Statistical Weighting in Search are Document Frequency and Total
Frequency for a specific term.
Document Frequency (DF): How many documents in the database contain a specific term.
Total Frequency (TF): How often a specific term appears across all documents in the
database.
These statistics are dynamic and depend on the current contents of the database being
searched.
Levels of Binding:
1. User's Binding: The initial stage where users define concepts based on their vocabulary and
understanding.
2. Search System Binding: The system translates the query into its own metalanguage (e.g.,
statistical systems, natural language systems, concept systems).
o Concept Systems: Map the search statement to specific concepts used in indexing.
3. Database Binding: The final stage where the search is applied to a specific database using
statistics (e.g., Document Frequency, Total Frequency).
o Concept Indexing: Concepts are derived from statistical analysis of the database.
Longer search queries improve the ability of IR systems to find relevant items.
Selective Dissemination of Information (SDI) systems use long profiles (75-100 terms).
In large systems, typical ad hoc queries are around 7 terms.
Short search queries highlight the need for automatic search expansion algorithms.
This formula uses the summation of the product of the various terms of two items when treating the
index as a vector. If is replaced with then the same formula generates the similarity between every
Item and The problem with this simple measure is in the normalization needed to account for
variances in the length of items. Additional normalization is also used to have the final results come
between zero and +1
Similarity Measures:
1) Cosine Similarity
Vector-Based:
Cosine similarity measures the cosine of the angle between two non-zero vectors in a multi-
dimensional space. Outputs a value between -1 and 1, with 1 indicating identical vectors.
Value=0-orthogonal
Value=1-coincident
Efficient Computation:
Can be calculated efficiently using dot products, making it a popular choice for IR systems.
2) Jaccard Similarity:
Set-Based:
The Jaccard similarity coefficient measures the similarity between two finite sets.
Range of Values:
As the common elements increase, the similarity value quickly decreases, but is always in the range -
1 to +1.
Applications:
Useful for comparing the overlap between documents, tags, or other categorical data.
3)Dice Method:
The Dice measure simplifies the denominator from the Jaccard measure and introduces a
factor of 2 in the numerator. The normalization in the Dice formula is also invariant to the
number of terms in common. For the Dice value, the numerator factor of 2 is divided into
the denominator. As long as the vector values are same, independent of their order, the
Cosine and Dice normalization factors do not change.
Use of a similarity algorithm returns the complete data base as search results. Many of the
items have a similarity close or equal to zero (or minimum value the similarity measure
produces). For this reason, thresholds are usually associated with the search process. The
threshold defines the items in the resultant Hit file from the query. Thresholds are either a
value that the similarity measure must equal or exceed or a number that limits the number of
items in the Hit file.
Clustering Hierarchy:
The items are stored in clusters that are represented by the centroid for each cluster. The
hierarchy is used in search by performing a top-down process. The query is compared to the
centroids “A” and “B.” If the results of the similarity measure are above the threshold, the
query is then applied to the nodes’ children. If not, then that part of the tree is pruned and
not searched. This continues until the actual leaf nodes that are not pruned are compared.
The risk is that the average may not be similar enough to the query for continued search, but
specific items used to calculate the centroid may be close enough to satisfy the search.
each letter at the leaf (bottom nodes) represent an item (i.e., K, L, M, N, D, E, F, G, H, P, Q, R, J). The
letters at the higher nodes (A, C, B, I) represent the centroid of their immediate children nodes. The
hierarchy is used in search by performing a top-down process. The query is compared to the
centroids “A” and “B.”
Hidden Markov Models Techniques:
In HMMs the documents are considered unknown statistical processes that can generate
output that is equivalent to the set of queries that would consider the document relevant.
Another way to look at it is by taking the general definition that a HMM is defined by output
that is produced by passing some unknown key via state transitions through a noisy channel.
The observed output is the query, and the unknown keys are the relevant documents
The development for a HMM approach begins with applying Bayes rule to the conditional
probability:
By applying Bayes rule to the conditional probability, we can derive an expression for the
posterior probability, which represents the probability of a document being relevant given
the query and the observed output. This posterior probability is then used to make decisions
on document relevance in HMMs. The goal is to find the most likely sequence of hidden
states (relevant documents) that generate the observed output (query).
A Hidden Markov Model is defined by a set of states, a transition matrix defining the
probability of moving between states, a set of output symbols and the probability of the
output symbols given a particular state. The set of all possible queries is the output symbol
set and the Document file defines the states.
Thus the HMM process traces itself through the states of a document (e.g., the words in the
document) and at each state transition has an output of query terms associated with the
new state.
The biggest problem in using this approach is to estimate the transition probability matrix
and the output (queries that could cause hits) for every document in the corpus. If there was
a large training database of queries and the relevant documents they were associated with
that included adequate coverage, then the problem could be solved using Estimation-
Maximization algorithms.
Ranking Algorithms:
A by-product of use of similarity measures for selecting Hit items is a value that can be used
in ranking the output. Ranking the output implies ordering the output from most likely items
that satisfy the query to least likely items. This reduces the user overhead by allowing the
user to display the most likely relevant items first.
The original Boolean systems returned items ordered by date of entry into the system versus
by likelihood of relevance to the user’s search statement. With the inclusion of statistical
similarity techniques into commercial systems and the large number of hits that originate
from searching diverse corpora, such as the Internet, ranking has become a common feature
of modern systems.
In most of the commercial systems, heuristic rules are used to assist in the ranking of items.
Generally, systems do not want to use factors that require knowledge across the corpus
(e.g., inverse document frequency) as a basis for their similarity or ranking functions because
it is too difficult to maintain current values as the database changes and the added
complexity has not been shown to significantly improve the overall weighting process.
RetrievalWare System:
RetrievalWare first uses indexes (inversion lists) to identify potential relevant items. It then
applies coarse grain and fine grain ranking. The coarse grain ranking is based on the
presence of query terms within items. In the fine grain ranking, the exact rank of the item is
calculated. The coarse grain ranking is a weighted formula that can be adjusted based on
completeness, contextual evidence or variety, and semantic distance.
Completeness is the proportion of the number of query terms (or related terms if a query
term is expanded using the RetrievalWare semantic network/thesaurus) found in the item
versus the number in the query. It sets an upper limit on the rank value for the item. If
weights are assigned to query terms, the weights are factored into the value. Contextual
evidence occurs when related words from the semantic network are also in the item.
Thus if the user has indicated that the query term “charge” has the context of “paying for an
object” then finding words such as “buy,” “purchase,” “debt” suggests that the term
“charge” in the item has the meaning the user desires and that more weight should be
placed in ranking the item. Semantic distance evaluates how close the additional words are
to the query term.
Synonyms add additional weight; antonyms decrease weight. The coarse grain process
provides an initial rank to the item based upon existence of words within the item. Since
physical proximity is not considered in coarse grain ranking, the ranking value can be easily
calculated.
Fine grain ranking considers the physical location of query terms and related words using
factors of proximity in addition to the other three factors in coarse grain evaluation. If the
related terms and query terms occur in close proximity (same sentence or paragraph) the
item is judged more relevant. A factor is calculated that maximizes at adjacency and
decreases as the physical separation increases.
Although ranking creates a ranking score, most systems try to use other ways of indicating
the rank value to the user as Hit lists are displayed. The scores have a tendency to be
misleading and confusing to the user. The differences between the values may be very close
or very large.
Relevance Feedback:
● Relevance feedback is a process where the search system uses feedback from users
about relevant and non-relevant items to adjust and refine future queries. The goal is to
improve search results by weighting the query terms based on user-provided relevance
judgments.
● Rocchio's Work (1965): The first major work on relevance feedback, published by
Rocchio, used a vector space model to adjust query terms based on relevance
feedback. The main idea was to:
○ Increase the weight of terms from relevant items.
○ Decrease the weight of terms from non-relevant items.
● The formula for adjusting the query vector based on relevance feedback includes:
○ The original query vector.
○ Vectors for relevant items and non-relevant items.
○ Constants to adjust the weights of terms.
● The formula used for adjusting the query vector is given by:
Qnew=αQold+β∑rVectorr−γ∑nrVectornrQ_{\text{new}} = \alpha Q_{\text{old}} + \beta
\sum_{r} \text{Vector}_r - \gamma \sum_{nr}
\text{Vector}_{nr}Qnew=αQold+βr∑Vectorr−γnr∑Vectornrwhere:
○ QnewQ_{\text{new}}Qnewis the revised query vector.
○ QoldQ_{\text{old}}Qoldis the original query vector.
○ rrr and nrnrnr represent relevant and non-relevant items, respectively.
○ α,β,γ\alpha, \beta, \gammaα,β,γ are constants used to adjust the weights.
● Positive Feedback: This increases the weight of terms from relevant items and moves
the query closer to retrieving relevant documents.
● Negative Feedback: This decreases the weight of terms from non-relevant items but
does not necessarily push the query toward more relevant results. Positive feedback is
typically more effective than negative feedback.
● In many cases, only positive feedback is used because it more effectively improves the
query and results.
● A query modification example shows how relevance feedback impacts the similarity
measures of documents. A query initially produces both relevant and non-relevant items,
and the process adjusts the weights of query terms.
○ Non-relevant documents may contain terms that, due to their weight in the
original query, initially seem relevant.
○ Relevant documents get higher similarity scores after feedback is applied, while
non-relevant documents are reduced in weight.
○ In the example, a term (e.g., "Macintosh") that was not part of the original query
is added due to its appearance in relevant documents.
Pseudo-Relevance Feedback:
● In some cases, manual feedback is not required. Instead, the system assumes the
top-ranked results from an initial query are relevant, even without explicit user input.
This is known as pseudo-relevance feedback or blind feedback.
○ The system then uses these top-ranked items to modify and expand the query.
○ This technique can improve performance over no feedback at all and is used in
TREC (Text REtrieval Conference) evaluation tests, where systems improve on
subsequent query passes by using feedback from the first pass.
Key Points:
1. Relevance feedback is a powerful tool for improving search results by adjusting queries
based on the relevance of retrieved items.
2. The Rocchio formula helps modify queries by increasing or decreasing term weights
from relevant and non-relevant items.
3. Positive feedback is generally more effective than negative feedback in moving the
query closer to the user’s information needs.
4. Pseudo-relevance feedback can automate this process by assuming top-ranked items
are relevant, which can lead to better performance in systems with little user interaction.
Selective Dissemination of Information Search:
Key Concepts:
Relevance Feedback:
Classification Techniques:
1. Statistical Classification:
○ Some systems, like Schutze et al., use statistical techniques for classification,
categorizing items as relevant or non-relevant. They employ error minimization
techniques in high-dimensional spaces (many unique terms), often using
methods like linear discriminant analysis or logistic regression.
2. Latent Semantic Indexing (LSI):
○ To reduce dimensionality in the classification process, latent semantic indexing
(LSI) is applied, which uses patterns of term co-occurrence to identify the most
important features and reduce computational complexity.
3. Neural Networks:
○ Another technique is the use of neural networks, where the system learns to
classify items based on training data. This method is flexible but runs the risk of
overfitting, where the model performs well on training data but poorly on new
data.
One unique challenge in dissemination systems is how to organize and display items in the
user’s mail file. Since items are added continuously, the order of the items can change as new
items are processed and ranked. This constant reordering can confuse users who may have
relied on spatial or positional memory of items.
Weighted Searches of Boolean Systems:
● Boolean queries use logical operators (AND, OR, NOT) to combine search terms.
● Natural language queries are easier to work with in statistical models and are directly
compatible with similarity measures like cosine similarity, which are based on
comparing the relevance of documents.
However, there are challenges when mixing Boolean logic with weighted indexing systems,
where each term in a query has a weight representing its importance.
● The strict application of Boolean operators (AND, OR) is too restrictive or general when
used with weighted indexes:
○ AND: Typically too restrictive, as it only returns documents that contain all
specified terms.
○ OR: Too general, as it may return too many documents by including those that
contain at least one term.
● Lack of ranking: Boolean queries do not rank results by relevance, unlike weighted
systems, which assess how well documents match a query.
Another approach to integrating Boolean queries with weighted terms is the P-norm model.
This model views query terms and items as coordinates in an n-dimensional space:
● For OR queries, the ideal situation is the maximum distance from the origin (where all
values are zero).
● For AND queries, the ideal situation is the unit vector (where all values are 1).
The P-norm model uses these geometric principles to rank documents based on their distance
from the ideal points for each query.
Salton proposed a method where traditional Boolean operations (AND, OR, NOT) are applied
first, and then the results are refined using weights assigned to query terms:
This algorithm ensures that the final result set consists of the most relevant items based on both
the strict Boolean logic and the weighting of terms.
An example is given of a query with two terms (e.g., "Computer" AND "Sale") where the weight
for "Sale" varies. The algorithm adjusts the results as the weight for "Sale" changes from 0.0 to
1.0. The resulting set of documents will progressively include more or fewer items based on the
weight of each term.
Searching the INTERNET and Hypertext:
The primary methods for searching the Internet involve servers that create indexes of web
content to allow efficient searching. Popular search engines like Yahoo, AltaVista, and Lycos
automatically visit websites, retrieve textual data, and index it. These indexes contain URLs
(Uniform Resource Locators) and allow users to search for relevant documents.
These systems rely on ranking algorithms based on word occurrences within the indexed text to
display results.
2. Intelligent Agents:
Intelligent Agents are automated tools that assist users in searching for information on the
Internet. These agents can operate independently, traverse various websites, and gather
information based on a user’s query. The key characteristics of intelligent agents are:
3. Relevance Feedback:
To enhance search capabilities, automatic relevance feedback is used in intelligent agents. This
two-step process helps expand the search by incorporating corpora-specific terminology,
allowing the agent to better align the search with the language used by the website authors.
● As the agent moves from site to site, it adjusts its query based on the relevance
feedback it receives.
● There is a challenge regarding how much feedback from one site should be carried over
to the next site.
● Incremental relevance feedback can help address this, allowing the agent to adjust as it
learns more about the user’s needs and the terminology used across sites.
4. Hyperlink-Based Search:
Hyperlinks play a key role in web search. A hyperlink is a reference to another document, often
embedded in the displayed text. Hyperlinks can lead to additional information or related content,
which can be useful for a user’s search.
● Hyperlinks create a static network of linked items that a user can explore by following
links.
● The context of hyperlinks determines their relevance—some links point to directly related
content (e.g., embedded images or quoted text), while others may point to supporting or
tangentially related information.
● The challenge is to automatically follow hyperlinks and use the additional information
from linked items to resolve the user's search need. The linked items must be assigned
appropriate weights to determine their relevance.
5. New Search Capabilities on the Internet:
New systems are emerging that provide personalized information based on a user’s preferences
or interests. Examples include:
● Pointcast and FishWrap: These systems send users personalized content based on their
specified interests, continually updating as new information becomes available.
● SFGATE: A similar service providing news updates to users based on their preferences.
Some systems, like Firefly, use collaborative filtering to recommend products or content based
on a user’s preferences and the preferences of similar users. These systems continuously adapt
to the user’s evolving interests and compare the user’s behavior with others to predict relevant
items.
● Firefly tracks a user’s preferences for movies and music, compares them with other
users, and suggests items based on shared interests.
● Empirical Media: This system also uses collaborative intelligence by combining individual
user rankings and group behavior to rank items more accurately.
● Early research into using queries across multiple users for document classification
showed limited success. However, the large-scale interaction on the Internet today
provides opportunities for more effective clustering and learning algorithms, improving
information retrieval capabilities.
Key Points:
1. Internet search engines create indexes by crawling and extracting text from websites.
2. Intelligent agents enhance search by operating autonomously and adjusting based on
user feedback and reasoning.
3. Relevance feedback helps agents refine searches by learning from user interactions and
adjusting queries.
4. Hyperlinks provide a network for users to explore related items, and understanding
hyperlink context is critical for effective navigation.
5. Personalized information systems and collaborative filtering can adapt to user
preferences and recommend relevant content.
6. Learning algorithms on the Internet enable better document classification and retrieval,
especially when large-scale user interactions are taken into account.
Information Visualisation:
Focus on Information Retrieval Systems:
The primary focus has been on indexing, searching, and clustering, rather than information
display.
Display changes in data while maintaining the same representation (e.g., showing new
linkages between clusters).
Enable interactive user input for dynamic movement between information spaces.
Perfect search algorithms achieving near 100% precision and recall have not been realized,
as shown by TREC and other information forums.
Information Visualization helps reduce user overhead by optimally displaying search results,
aiding users in selecting the most relevant items.
Visual displays consolidate search results into a form that is easier for users to process,
though they do not directly answer specific retrieval needs.
Cognitive engineering applies design principles based on human cognitive processes such as
attention, memory, imagery, and information processing.
Research from 1989 highlighted the importance of mental depiction in cognition, showing
that visual representation is as important as symbolic description.
Types of Visualization:
Historical Roots:
o The concept of visualization dates back over 2400 years to Plato, who believed that
perception involves translating physical energy from the environment into neural
signals, interpreted and categorized by the mind. The mind processes not only
physical stimuli but also computer-generated inputs.
o Text-only interfaces simplify user interaction but limit the mind's powerful
information-processing functions, which have evolved over time.
Development of Information Visualization:
o Information visualization emerged as a discipline in the 1970s, influenced by
debates about how the brain processes mental images.
o Advancements in technology and information retrieval were necessary for this field
to become viable.
o One of the earliest pioneers, Doyle (1962), introduced the idea of semantic road
maps to help users visualize the entire database. These maps show items related to
specific semantic themes, enabling focused querying of the database.
o In the late 1960s, the concept expanded to spatial organization of information,
leading to mapping techniques that visually represented data.
o Sammon (1969) implemented a non-linear mapping algorithm to reveal document
associations, further enhancing the spatial organization of data.
Advancements in the 1990s:
o The 1990s saw technological advancements and the exponential growth of available
information, leading to practical research and commercialization in information
visualization.
o The WIMP interface (Windows, Icons, Menus, and Pointing devices) revolutionized
how users interact with computers, although it still required human activities to be
optimized for computer understanding.
o Donald A. Norman (1996) advocated for reversing this trend, stating that
technology should conform to people, not the other way around.
Importance of Information Visualization:
o Information visualization has the potential to enhance the user’s ability to
efficiently find needed information, minimizing resource expenditure and optimizing
interaction with complex data.
Norman's emphasis: To optimize information retrieval, focus on understanding user
interface and information processing, adapting them to computer interfaces.
Challenges with text-only interfaces:
o Text reduces complexity but restricts the more powerful cognitive functions of the
mind.
o Textual search results often present large numbers of items, requiring users to
manually sift through them.
Information visualization aids the user by:
o Reducing time to understand search results and identify clusters of relevant
information.
o Revealing relationships between items, as opposed to treating them as
independent.
o Performing simple actions that enhance search functions.
Fox et al. study findings:
o Users wanted systems that allowed them to locate and explore patterns in
document databases.
o Visual tools should help users focus on areas of personal interest and identify
emerging topics.
o Desired interfaces to easily identify trends and topics of interest.
Benefits of visual representation:
o Cognitive parallel processing of multiple facts and data relationships.
o Provides high-level summaries of search results, allowing users to focus on relevant
items.
o Enables visualization of relationships between items that would be missed in text-
only formats.
Usage of aggregates in visualization:
o Allows users to view items in clusters, refining searches by focusing on relevant
categories.
o Helps correlate search terms with retrieved items, showing their influence on search
results.
Limitations of textual display:
o No mechanism to show relationships between items (e.g., "date" and "researcher"
for polio vaccine development).
o Text can only sort items by one attribute, typically relevance rank.
Human cognition as a basis for visualization:
o Visualization techniques are developed heuristically and are tested based on
cognitive principles.
o Commercial pressures drive the development of intuitive visualization techniques.
o The 1970s debate centered on whether vision is purely data collection or also
involves information processing.
o Arnheim proposed that treating perception and thinking as separate functions leads
to treating the mind as a serial automaton, where perception deals with instances
and thinking handles generalizations.
Visualization as a Process:
o Visualization involves transforming information into a visual form that enables users
to observe and understand it.
o Visual input is treated as part of the understanding process, not as discrete facts.
o Gestalt Psychology: The mind follows specific rules to combine stimuli into a
coherent mental representation:
o Shifting the cognitive load from slower cognitive processes to faster perceptual
systems enhances human-computer interactions.
Example: The visual system can easily detect borders between changes in orientation of the
same object.
Orientation and Grouping:
If information semantics are organized in orientations, the brain's clustering function can
detect groupings more easily than with different objects, assuming orientations are
meaningful.
This uses the feature detectors in the retina for maximum efficiency in recognizing patterns.
Pre-attentive Processing in Boundary Detection:
The preattentive process detects boundaries between orientation groups of the same
object.
Identifying rotated objects, such as a rotated square, requires more effort than detecting
boundaries between different orientations of the same object.
Character Recognition:
When dealing with characters, identifying a rotated character (in an uncommon direction) is
more challenging.
Conscious Processing:
Conscious processing capabilities come into play when detecting the different shapes and
borders, such as in distinguishing between different boundaries in Figure 8.1.
The time it takes to visualize and recognize different boundaries can be measured, showing
the difference between pre-attentive and conscious processing.
A light object on a dark background appears larger than when the object is dark and the
background is light.
This optical illusion suggests using bright colors for small objects in visual displays to
enhance their visibility.
Color Attributes:
o Saturation: The degree to which a hue differs from gray with the same lightness.
Complementary Colors: Two colors that combine to form white or gray (e.g., red/green,
yellow/blue).
Color in Visualization: Colors are frequently used to organize, classify, and enhance features.
o Humans are naturally attracted to primary colors (red, blue, green, yellow), and they
retain images associated with these colors longer.
o However, colors can evoke emotional responses and cause aversion in some people,
and color blindness should be considered when using color in displays.
Depth in Visualization:
Depth can be represented using monocular cues like shading, blurring, perspective, motion,
and texture.
Shading and contrast: These cues are more affected by lightness than by contrast, so using
bright colors against a contrasting background can enhance depth representation.
Innate Depth Perception: Depth and size recognition are learned early in life; for instance,
six-month-old children already understand depth (Gibson-60).
Depth is a natural cognitive process used universally to classify and process information in
the real world.
Configural Clues: These are arrangements of objects that allow easy recognition of high-
level abstract conditions, replacing more concentrated lower-level visual processes.
Example: In a regular polygon (like a square), modifying the sides creates a configural effect,
where visual processing quickly detects deviations from the normal shape.
These effects are useful in detecting changes in operational systems or visual displays that
require monitoring.
Human Visual System: The system is sensitive to spatial frequency, which refers to the
number of light-dark cycles per degree of visual field. A cycle represents one complete light-
dark change.
Sensitivity Limits: The human visual system is less sensitive to spatial frequencies of 5-6
cycles per degree. Higher spatial frequencies are often harder to process, which can reduce
cognitive processing time, allowing faster reactions to motion (e.g., by animals like cats).
Motion and Pattern Extraction: Distinct, well-defined images are easier to detect for motion
or changes than blurred images. Certain spatial frequencies aid in extracting patterns of
interest, especially when motion is used to convey information.
Application in Aircraft Displays: Dr. Mary Kaiser from NASA-AMES is researching
perceptually derived displays, focusing on human vision filters like spatial resolution,
stereopsis, and attentional focus, to improve aircraft information displays.
Learning from Usage: The human sensory system learns and adapts based on usage, making
it easier to process familiar orientations (e.g., horizontal and vertical) than others, which
require additional cognitive effort.
Bright Colors: Bright colors in displays naturally attract attention, similar to how brightly
colored flowers are noticed in a garden, enhancing focus.
Depth Representation: The cognitive system is adept at interpreting depth in objects, such
as the use of rectangular depth for information representation, which is easier for the visual
system to process than abstract three-dimensional forms (e.g., spheres).
Pre-existing Biases: For example, if a user has been working with clusters of items, they may
see non-existent clusters in new presentations. This could lead to misinterpretation.
Past Experiences Influence Interpretation: Users may interpret visual information based on
what they commonly encounter in their daily lives, which may differ from the designer’s
intended representation. This highlights the need to consider context when designing
visualizations to avoid confusion and misinterpretation.
1. Document Clustering:
o Goal: To visually represent the document space defined by search criteria. This
space contains clusters of documents grouped by their content.
o Visualization Techniques: The clusters are displayed with indications of their size
and topic, helping users navigate towards items of interest. This method is
analogous to browsing a library’s index and then exploring the relevant books in
various sections retrieved by the search.
o Benefits: Allows users to get an overview of related documents, helping them to
locate the most relevant items more efficiently.
2. Search Statement Analysis:
o Goal: To assist users in understanding why specific items were retrieved in response
to a query. This analysis provides insights into how search results are related to the
query terms, especially with modern search algorithms and ranking techniques.
o Challenges: Unlike traditional Boolean systems, modern search engines use
techniques like relevance feedback and thesaurus expansion, making it harder for
users to correlate the expanded terms in a query with the retrieved results.
o Visualization Techniques: Tools are used to display the full set of terms, including
additional ones introduced through relevance feedback or thesaurus expansion.
Alongside these terms, the retrieved documents are shown, with indications of how
important each term is in the ranking and retrieval process.
o Benefits: This approach helps users understand the reasoning behind the retrieval
results, aiding in refining queries and improving search accuracy.
Structured databases and link analysis play a crucial role in improving the effectiveness of
information retrieval (IR) systems. Both methods focus on organizing and correlating data in a way
that enhances users' ability to locate relevant documents and understand their context. Here's an
overview of these techniques and their applications in information retrieval:
Purpose: Structured databases are essential for storing and managing citation and semantic
data about documents. This data often includes metadata such as author, publication date,
topic, and other descriptors that help in categorizing and retrieving information.
Role: These structured tiles allow efficient access and manipulation of the data, which is
crucial when dealing with large datasets like academic citations or historical records. A well-
organized database improves the speed and accuracy of search results by ensuring that
information is easily accessible and related queries can be processed more effectively.
Link Analysis:
Purpose: Link analysis examines the relationships between multiple documents, recognizing
that information is often connected across different sources. This technique is particularly
useful when searching for topics that involve multiple interrelated documents.
Example: A time/event link analysis can correlate documents discussing a specific event,
such as an oil spill caused by a tanker. While individual documents may discuss aspects of
the spill, the link analysis identifies the temporal relationships between documents,
revealing patterns or sequences of events that are critical but may not be explicitly
mentioned in any single document.
Hierarchical structures are commonly used to represent information that has inherent relationships
or dependencies, such as time, organization, or classification systems. These structures are
especially useful when representing large datasets or complex topics. The following techniques are
designed to help users visualize and navigate hierarchical data more easily:
1. ConeTree:
o Advantages: The Cone-Tree allows users to quickly visualize the entire hierarchy,
making it easier to understand the size and relationships of different subtrees.
Compared to traditional 2D node-and-link trees, the Cone-Tree maximizes the
available space and provides a more intuitive perspective of large datasets.
2. Perspective Wall:
o Description: The Perspective Wall divides information into three visual areas, with
the primary area in focus and the others out of focus. This creates a layered
representation of the data, allowing users to focus on one part of the hierarchy
while keeping the context of the surrounding information visible.
o Advantages: This technique helps maintain an overview of the data while also
allowing in-depth exploration of a specific area. It's particularly useful when users
need to keep track of large datasets without losing context.
3. TreeMaps:
o Advantages: TreeMaps make optimal use of available screen space and provide an
efficient way to visualize large hierarchies. They are especially useful for displaying
categories that can be grouped into subcategories, such as topics within a document
collection.
When data has network-type relationships (i.e., items are interrelated), cluster-based approaches
can visually represent these relationships. Semantic scatterplots are a common method for
visualizing clustering patterns:
Vineta and Bead Systems: These systems display clustering using three-dimensional
scatterplots, helping users visualize the relationships between documents or items within a
multidimensional space.
2. Embedded Coordinate Spaces: Feiner’s “worlds within worlds” approach suggests that
larger coordinate spaces can be redefined with subspaces nested inside other spaces,
allowing for better organization of multidimensional data.
3. Other Methods: Techniques such as semantic regions (Kohonen), linked trees (Narcissus
system), or non-Euclidean landscapes (Lamping and Rao) aim to tackle the issue of
visualizing complex multidimensional data.
Search-Driven Visualizations:
Information Crystal: A visualization technique inspired by Venn diagrams, which helps users
detect patterns of term relationships in a search result file (referred to as a Hit file). It
constrains the data within a specific search space, allowing the user to better understand
how search terms relate to the results.
An important aspect of visualization is helping users refine their search statements and understand
the effects of different search terms. The challenge with systems that use similarity measures (e.g.,
relevance scoring) is that it can be difficult for users to discern how specific terms impact the
selection and ranking of documents in their search results. Visualization systems aim to address this
by:
Graphical Display of Item Characteristics: A visualization tool can display the characteristics
(such as term frequency, relevance, or weight) of the retrieved items to show how search
terms influenced the results.
VIBE System: A system designed for visualizing term relationships by spatially positioning
documents in a way that shows their relevance to specific query terms. This allows users to
see the term relationships and the distribution of documents based on their relevance to the
search terms.
Self-Organization and Kohonen’s Algorithm: Lin has further enhanced this self-organizing
approach by applying Kohonen’s algorithm to automatically generate a table of contents
(TOC) for documents, which is displayed in a map format. This technique helps visualize
document groupings based on their content.
Envision System Overview:
The Envision system aims to present a user-friendly, graphical representation of search results,
making it accessible even on a variety of computer platforms. The system features three interactive
windows for displaying search results:
1. Query Window: This window provides an editable version of the user's query, allowing easy
modification of the search terms.
2. Item Summary Window: This window shows bibliographic citation information for items
selected in the Graphic View Window.
3. Graphic View Window: This window is a two-dimensional scatterplot, where each item in
the search results (Hit file) is represented by an icon. Items are depicted as circles (for single
items) or ellipses (for clusters of multiple items). The X-axis represents the estimated
relevance, and the Y-axis shows the author’s name. The weight of each item or cluster is
displayed below the icon or ellipse, and selecting an item reveals its bibliographic
information.
While the system is simple and user-friendly, it can face challenges when the number of items or
entries becomes large. To address this, Envision plans to introduce a "zoom" feature to allow users
to view larger areas of the scatterplot at lower detail.
Relevance and Cluster Representation: The system uses scatterplot graphs where circles
represent individual items, and ellipses represent clusters of items that share similar
attributes or relevance scores. This allows users to visually assess both individual items and
groups of related documents.
Interactive User Experience: Selecting an item in the Graphic View Window not only
highlights the item but also updates the Item Summary Window with detailed bibliographic
information. This interactive system enhances user engagement and ease of navigation.
A similar technique, though differing in format, is used by Veerasamy and Belkin (1996), which
involves a series of vertical columns of bars:
Rows Represent Index Terms: Each row corresponds to an index term (word or phrase) used
in the search.
Bar Height Indicates Term Weight: The height of the bar in each column shows the weight
of the corresponding index term in the document represented by that column.
Relevance Feedback: This approach also displays terms added by relevance feedback,
helping users identify which terms have the most significant impact on the retrieval of
specific documents.
1. Identify Key Terms: Quickly determine which search terms were most influential in
retrieving a specific document by scanning the columns.
2. Refine Search Terms: See how various terms contributed to the retrieval process, making it
easier to adjust query terms that are underperforming or causing false hits. Users can
remove or reduce the weight of terms that aren't contributing positively to the results.
several advanced visualization techniques used in commercial and research systems for document
analysis and retrieval. These systems are designed to enhance the user’s ability to understand and
navigate complex information spaces. Here's a breakdown of the different systems and approaches
described:
Visualization: DCARS displays query results as a histogram where each row represents an
item, and the width of a tile bar indicates the contribution of each term to the selection.
This provides insight into why a particular item was found.
User Interface: The system offers a friendly interface but struggles with allowing users to
effectively modify search queries to improve results. While it shows term contributions, it’s
challenging for users to use this information to refine or improve their search statements.
Cityscape Visualization:
Structure: These skyscrapers are interconnected with lines that vary in representation to
show interrelationships between concepts. The colors or fill designs of buildings can add
another layer of meaning, with buildings sharing the same color possibly representing
higher-level concepts.
Zooming: Users can move through the cityscape, zooming in on specific areas of interest,
and uncovering additional structures that may be hidden in other views.
Library Metaphor:
User-Friendly Navigation: Another widely understood metaphor is the library. In this model,
the information is represented as a space within a library, and users can navigate through
these areas.
Viewing Information: Once in an information room, users can view virtual “books” placed on
a shelf. After selecting a book, the user can scan related items, with each item represented
as a page within the book. The ability to fan the pages out allows users to explore the
content in a spatial manner. This is exemplified by the WebBook system.
Clustering and Statistical Term Relationships:
Matrix Representation: When correlating items or terms, a large matrix is often created
where each cell represents the similarity between two terms or items. However, displaying
this matrix in traditional table form is impractical due to its size and complexity.
various information visualization techniques used in different types of systems for analyzing and
interacting with large datasets, focusing on text, citation data, structured databases, hyperlinks,
and pattern/linkage analysis. Below is a breakdown of the systems and concepts described:
SeeSoft System: Created by AT&T Bell Laboratories, this system visualizes software code by
using columns and color codes to show changes in lines of code over time. This allows
developers to easily track modifications across large codebases.
DEC FUSE/SoftVis: Built on SeeSoft's approach, this tool provides small pictures of code files
with size proportional to the number of lines. It uses color coding (e.g., green for comments)
to help visualize the structure and complexity of code modules.
TileBars Tool: Developed by Xerox PARC, this tool visualizes the distribution of query terms
within each item in a Hit file, helping users quickly find the most relevant sections of
documents.
Query By Example: IBM’s early tool for structured database queries used a two-dimensional
table where users defined values of interest. The system would then complete the search
based on these values.
IVEE (Information Visualization and Exploration Environment): This tool uses three-
dimensional representations of structured databases. For example, a box representing a
department can have smaller boxes inside it representing employees. The system allows for
additional visualizations like maps and starfields, with interactive sliders and toggles for
manipulating the search.
HomeFinder System: Used for displaying homes for sale, this tool combines a starfield
display of homes with a city map to show the geographic location of each home.
Navigation and Hyperlink Visualization:
Navigational Visualization: Hyperlinks in information retrieval systems can lead users to feel
"lost in cyberspace." One solution is to provide a tree structure visualization of the
information space. MITRE Corporation has developed a tool for web browsers that visually
represents the path users have followed and helps them navigate effectively.
Pathfinder Project: A system sponsored by the Army that incorporates various visualization
techniques for analyzing patterns and linkages. It includes several tools:
o Document Browser: Uses color density in text to indicate the importance of an item
relative to a user's query.
o CAMEO: Models analytic processes with nodes and links, where the color of nodes
changes based on how well the found items satisfy the query.
o Counts: Plots statistical information about words and phrases over time, helping
visualize trends and developments.
o CrossField Matrix: A two-dimensional matrix that correlates data from two fields
(e.g., countries and products), using color to represent time span and showing how
long a country has been producing a specific product.
The use of maps and geographic data is another area of information visualization.
Geographic Information Systems (GIS) provide a way to visualize and analyze data in the
context of geographic locations.
SPIRE tool:
Information Visualization: A technique used to represent complex data visually to help users
understand relationships, trends, and important concepts.
Metaphors for Aggregation: Visualization often uses metaphors like peaks, valleys, or cityscapes to
highlight major concepts and relationships, providing users with an overview before diving into
details.
Cognitive Engineering: Visual representation must consider users' cultural and physiological
differences, such as color blindness or familiarity with certain environments, to ensure the
metaphor is understandable.
Complexity of Search Algorithms: As search algorithms evolve, the role of visualization will expand
to clarify not just the retrieved information but also the relationships between the search statement
and the items.
Hypertext Linkages: The growth of hyperlinks in information systems will require new visualization
tools to help users understand the network of relationships between linked items.
Technical Limitations:
Interactive Clustering: Users expect the ability to interact with clusters, drilling down from high-level
to more granular levels with increasing precision.
Real-time Processing: The system must provide near-instant feedback (e.g., under 20 seconds) to
meet user expectations for speed.
Key Challenges:
2. Increased Precision: As users drill deeper into clusters, the system needs to display results
with higher precision, often requiring natural language processing for better understanding
and categorization.
Unit- V
Text search Algorithms:
Introduction to text search techniques:
1. Concept:
○ A text streaming search system allows users to enter queries, which are then compared
to a database of text. As soon as an item matching a query is found, results can be
presented to the user.
2. System Components:
○ Term Detector: Identifies query terms in the text.
○ Query Resolver: Processes search statements, passes terms to the detector, and
determines if an item satisfies the query. It then communicates results to the user
interface.
○ User Interface: Updates the user with search status and retrieves matching items.
3. Search Process:
○ The system searches for occurrences of query terms in a text stream. It involves
detecting patterns (query terms) in the text.
○ Worst-case time complexity: O(n) where n is the length of the text, and O(n*m) for brute
force methods.
○ Hardware & Software: Multiple detectors may run in parallel for efficiency, either
searching the entire database or retrieving random items.
4. Index vs. Streaming Search:
○ Streaming Search: More efficient in terms of speed as results are shown immediately
when found and doesn't require extra storage.
○ Index Search: Requires the whole query to be processed before results appear and has
storage overhead but can be more efficient in some cases like fuzzy searches.
○ Disadvantages of Streaming: Dependent on the I/O speed, may not handle all search
types as efficiently as indexing systems.
5. Finite State Automata:
○ Many text searchers use finite state automata (FSA), which are logical machines used to
recognize patterns in input strings. The FSA consists of:
■ I: Set of input symbols.
■ S: Set of states.
■ P: Productions, defining state transitions.
■ Initial state: The starting point.
■ Final state(s): Where the machine ends when a pattern is found.
○ Example: An FSA can be used to detect the string "CPU" in a text stream by transitioning
through states based on the input symbols received.
6. Transition Representation:
○ The states and transitions in an FSA can be represented by a table, where rows represent
current states, and columns represent the input symbols that trigger state transitions.
This system allows for real-time search and retrieval of text items without needing extra storage, but may
face performance challenges with I/O speed and certain types of queries.
● Both the Aho-Corasick and Knuth-Morris-Pratt (KMP) algorithms compare the same number of
characters.
● The new algorithm improves upon these by making state transitions independent of the number
of search terms, and the search operation is linear with respect to the number of characters in
the input stream.
● The comparison count is proportional to T (the length of the text) multiplied by a constant w > 1,
providing a significant improvement over KMP (which depends on the query size) and
Boyer-Moore (which handles only one search term).
Baeza-Yates and Gonnet's Extension:
● This algorithm can handle "don’t care" symbols, complement symbols, and up to k mismatches.
● It uses a vector of m states, where m is the length of the search pattern, with each state
corresponding to a specific portion of the pattern matched to the input text.
● The algorithm provides a fuzzy search by determining the number of mismatches between the
search pattern and the text. If mismatches occur, the vector helps track the positions where
matches may happen, supporting fuzzy searches.
Shift-Add Algorithm:
● The Shift-Add algorithm utilizes this vector representation and performs a comparison by
updating the vector as it moves through the text.
● For each character in the pattern, a table T(x) is maintained that stores the status (match or
mismatch). When the vector value is zero, a perfect match is found.
● Don’t care symbols and complementary symbols can be included, making the algorithm flexible
for varied search types.
Extensions by Wu and Manber:
● The Shift-Add algorithm was extended by Wu and Manber to handle insertions, deletions, and
positional mismatches as well.
Hardware Implementation:
● One of the advantages of the Shift-Add algorithm is its ease of implementation in hardware,
making it efficient for real-time applications.
● Limitations: Software text search faces restrictions such as the ability to handle multiple search
terms simultaneously and issues with I/O speeds.
● Hardware Solution: To offload the resource-intensive searching, specialized hardware search
machines were developed. These machines perform searches independently of the main
processor and send the results to the main computer.
1. Scalability: Speed improves with the addition of more hardware devices (one per disk), and the
only limiting factor is the speed of data transfer from secondary storage (disks).
2. No Need for Indexing: Unlike traditional systems that need large indexes (often 70% the size of
the documents), hardware search machines do not require an index, allowing immediate
searches as new items arrive.
3. Deterministic Speed: While hardware searches may be slower than indexed searches, they
provide predictable search times, and results are available immediately as hits are found.
1. Parasel Searcher (formerly Fast Data Finder): A specialized hardware text search system with an
array of programmable processing cells. Each cell compares a single character in the query,
making it scalable based on the number of cells.
○ Fast Data Finder (FDF): The system uses cells interconnected in series, each handling a
single character comparison, and is used for complex searches, including Boolean logic,
proximity, and fuzzy matching.
2. GESCAN: Uses a Text Array Processor (TAP) that matches multiple terms and conditions in
parallel. It supports various search features like exact term matches, don’t-care symbols, Boolean
logic, and proximity.
3. Associative File Processor (AFP): An early hardware search unit capable of handling multiple
queries simultaneously.
4. Content Addressable Segment Sequential Memory (CASSM): Developed as a general-purpose
search device, this system can be used for string searching across a database.
● The Fast Data Finder (FDF) has been adapted for genetic analysis, including DNA and protein
sequence matching. It is used in Smith-Waterman (S-W) dynamic programming for sequence
similarity searches and for identifying conserved regions in biological sequences.
● The Biology Tool Kit (BTK) integrates with the FDF to perform fuzzy matching for biological
sequence data.
Limitations:
● Expense: The cost and the need to stream entire databases for search have limited the
widespread adoption of hardware search systems.
● Database Size: The entire database must be streamed for a search, which can be
resource-intensive.
Multimedia Information Retrieval:
Spoken Language Audio Retrieval:
● Value of Speech Search: Just like text search, the ability to search audio sources such as
speeches, radio broadcasts, and conversations would be valuable for various applications,
including speaker verification, transcription, and command and control.
● Challenges in Speech Recognition:
○ Word Error Rates: Transcription of speech can be challenging due to high word error
rates, which can be as high as 50%, depending on factors like the speaker, the type of
speech (dictation vs. conversation), and environmental conditions. However, redundancy
in the source material can help offset these errors, still allowing effective retrieval.
○ Lexicon Size: While speech recognition systems often have lexicons of around 100,000
words, text systems typically contain much larger lexicons, sometimes over 500,000
words, which adds complexity to speech recognition.
○ Development Effort: A significant challenge is the effort needed to develop an
annotated corpus (e.g., a video mail corpus) to train and evaluate speech recognition
systems.
● Performance: Research by Jones et al. (1997) compared speech retrieval to text retrieval. The
results showed:
○ Speaker-dependent systems: Retain around 95% of the performance of text-based
retrieval.
○ Speaker-independent systems: Achieve about 75% of the performance of text-based
retrieval.
● System Scalability: Scalability remains a challenge in speech retrieval, partly due to the size of
the lexicon and the complexity of developing annotated corpora for training.
● Rough’n’Ready Prototype:
○ Purpose: Developed by BBN, Rough’n’Ready provides information access to spoken
language from audio and video sources, especially broadcast news. It generates a
summarization of speech for easy browsing.
○ Technology: The transcription is created by the BYBLOS™ system, a large vocabulary
speech recognition system that uses a continuous-density Hidden Markov Model
(HMM).
○ Performance: BYBLOS runs at three times real-time speed, with a 60,000-word
dictionary. The system has reported a word error rate (WER) of 18.8% for broadcast
news transcription.
● Multilingual Efforts:
○ LIMSI North American Broadcast News System: Reported a 13.6% word error rate and
focused on multilingual information access.
○ Tokyo Institute of Technology & NHK: Joint research focused on transcription and topic
extraction from Japanese broadcast news. This project aims to improve accuracy by
modeling filled pauses, performing online incremental speaker adaptation, and using a
context-dependent language model.
● Filled Pauses: One area of focus in improving speech recognition systems is the handling of filled
pauses (e.g., "um," "uh") in natural speech.
● Speaker Adaptation: Improving speaker adaptation is crucial, as different speakers have different
speaking styles and accents. On-line incremental adaptation is a key strategy to improve
recognition over time.
● Language Models: Using context-dependent language models can significantly improve the
performance of speech recognition systems by considering the sequence and context of words,
including special characters such as Kanji (Chinese characters used in Japanese), Hira-gana, and
Kata-kana (Japanese syllabary).
Challenges in Multilingual Speech Processing:
● Multilingual Transcription: Efforts are being made to extend broadcast news transcription
systems to support multiple languages, including German, French, and Japanese.
● Contextual Models for Multilingual Support: Developing models that handle multiple languages
with different writing systems (e.g., Chinese characters, Japanese characters) poses an additional
challenge in improving the accuracy and performance of speech recognition systems across
languages.
● Content-Based Search: Users can search for sounds by their acoustic properties (e.g., pitch,
loudness, duration) or perceptual properties (e.g., “scratchy”).
○ Example: A user might search for "all AIFF encoded files with animal or human vocal
sounds that resemble barking" without specifying the exact duration or amplitude.
○ Query by Example: Users can also train the system by example, where the system learns
to associate perceptual properties (like "scratchiness" or "buzziness") with the sound
features.
○ Weighted Queries: The system supports complex weighted queries based on different
sound characteristics. For example, a query might specify a foreground sound with
metallic and plucked properties, and a specific pitch range.
● Training by Example: Users can train the system to recognize and retrieve sounds with indirectly
related perceptual properties, allowing for more flexible and nuanced searches.
● Performance Evaluation: The system was tested using a database of 400 sound files, including
sounds from nature, animals, instruments, and speech.
● Additional System Requirements:
○ Sound Displays: Visual representations of sound data may be necessary for better
understanding and refining searches.
○ Sound Synthesis: This refers to a query formulation or refinement tool that helps users
create or refine sound queries.
○ Sound Separation: This involves separating overlapping sound features or elements
within a given sound.
○ Matching Feature Trajectories Over Time: The system also supports tracking how
features (e.g., pitch, loudness) evolve over time in a sound, allowing for more dynamic
and sophisticated search queries.
GraphRetrieval:
● Purpose: SageBook provides a comprehensive system for querying, indexing, and retrieving data
graphics, which includes charts, tables, and other types of visual representations of data. It
allows users to search based on both the graphical elements (e.g., bars, lines) and the underlying
data represented by the graphic.
● Graphical Querying:
○ Users can formulate queries through a graphical direct-manipulation interface called
SageBrush. This allows them to select and arrange various components of a graphic,
such as:
■ Spaces (e.g., charts, tables),
■ Objects (e.g., marks, bars),
■ Object properties (e.g., color, size, shape, position).
○ The left side of Figure 10.3 (not shown here) illustrates how the user can design a query
by manipulating these elements.
● Matching Criteria:
○ The system performs both exact and similarity-based matching of the graphics. For a
successful match, the graphemes (graphical elements such as bars or lines) must not
only belong to the same class (e.g., bars, lines) but also match specific properties (e.g.,
color, shape, size).
○ The retrieved data-graphics are ranked based on their degree of similarity to the query.
For example, in a “close graphics matching strategy”, SageBook will prioritize results that
closely resemble the structure and properties of the query.
● Graphic Adaptation:
○ The system also allows users to manually adapt the retrieved graphics. For instance,
users can modify or eliminate certain elements that do not match the specifications of
the query.
● Grouping and Clustering:
○ To facilitate browsing large collections, SageBook includes data-graphic grouping
techniques based on both graphical and data properties, enabling users to efficiently
browse large collections of graphics.
Internal Representation:
Search Strategies:
● SageBook offers multiple search strategies with varying levels of match relaxation:
○ For graphical properties, users can perform searches with different strategies (e.g., exact
match, partial match).
○ For data properties, there are also alternative search strategies to accommodate
different matching requirements.
Applications:
● Versatility: SageBook’s capability to search and retrieve graphics by content is valuable across
various domains, including:
○ Business Graphics: for financial charts, reports, and presentations.
○ Cartography: for terrain, elevation, and feature maps.
○ Architecture: for blueprints and designs.
○ Communications and Networking: for routers, links, and network diagrams.
○ Systems Engineering: for component and connection diagrams.
○ Military Campaign Planning: for strategic maps and force deployment visualizations.
● Real-World Relevance: The system’s ability to handle complex graphical elements, relationships,
and data attributes makes it applicable in a broad range of fields where visual representations of
data are crucial for analysis, planning, and decision-making.
Imagery retrieval:
● Problem: Traditional image retrieval systems rely heavily on metadata, such as captions or tags,
but these often do not fully capture the visual content of images. As a result, there has been
significant research into content-based retrieval, where images are indexed and searched based
on their visual features.
● Early Approaches:
○ Initial efforts focused on indexing visual features such as color, texture, and shape to
allow for retrieving similar images without needing manual indexing. Notable works
include Niblack and Jain’s algorithm development for automatic indexing of visual
features.
○ QBIC System (Query By Image Content):
■ QBIC, developed by Flicker et al. (1997), is an example of a content-based image
retrieval system that supports queries based on visual properties such as color,
shape, texture, and even sketches.
■ For instance, users could query a collection of US stamps by selecting the color
red or searching for stamps associated with the keyword "president". QBIC
would retrieve images that match these criteria, allowing for more intuitive and
visual-based searching.
○ Refining Queries: Users can refine their search by adding additional constraints. For
example, a query might be refined to include images that contain a red round object
with coarse texture and a green square.
○ Automated and Semi-Automated Object Identification: Since manual annotation of
images is cumbersome, automated tools (e.g., foreground/background models) were
developed to help identify objects in images, facilitating the indexing process.
● Researchers extended the concepts from image retrieval to video retrieval. Flicker et al. (1997)
explored shot detection and keyframe extraction, allowing for queries such as “find me all shots
panning left to right” based on the content of video shots. The system retrieves a list of
keyframes (representative frames) that can then be used to retrieve the associated video shot.
● Informedia Digital Video Library: Developed by Wactlar et al. (2000), this system supports
content-based video retrieval by extracting information from both audio and video. It includes a
feature called named face that automatically associates a name with a face, enabling users to
search for faces by name or vice versa.
Video retrieval:
● Personalcasts and Video Mail: The growing availability of video content (e.g., video mail, taped
meetings, surveillance video, broadcast television) has created a demand for more efficient
access methods. Content-based systems allow users to search video based on its content rather
than relying on manually added tags or metadata.
● BNN System: This system helps create personalcasts from broadcast news, enabling users to
search for and retrieve specific news stories from a large repository of video data.
○ BNN performs automated processing of broadcast news, including capture, annotation,
segmentation, summarization, and visualization of stories.
○ It integrates text, speech, and image processing technologies to allow users to search
video content based on:
■ Keywords
■ Named entities (people, locations, organizations)
■ Time intervals (e.g., specific news broadcast dates)
○ This approach significantly reduces the need for manual video annotation, which can
often be inconsistent or error-prone.
● GeoNODE System: This system focuses on topic detection and tracking for broadcast news and
newswire sources. It allows users to analyze geospatial and temporal contexts for news stories.
○ GeoNODE provides visual analytics by displaying data on a time line of stories related to
specific topics, as well as cartographic visualizations that highlight news mentions of
specific locations (e.g., countries or cities).
○ For example, in the GeoNODE map, the saturation of color indicates the frequency of
news mentions in different regions (e.g., darker colors indicate more mentions).
● Geospatial Search and Data Mining:
○ Users can search for documents that mention specific locations or geospatial trends.
○ The system also supports data mining for discovering correlations among named
entities across multiple news sources.
● GeoNODE Performance: In preliminary evaluations, GeoNODE identified over 80% of
human-defined topics and detected 83% of stories within those topics with a very low
misclassification error (0.2%).
● Integration of Multiple Data Sources: The future of systems like GeoNODE will rely on the ability
to extract and analyze information from a variety of multimedia sources, including text, audio,
and video.
● Machine Learning and Evaluation: As these systems evolve, they will increasingly depend on
machine learning techniques, multimedia corpora, and common evaluation tasks to improve
their performance and capabilities.