0% found this document useful (0 votes)
3 views10 pages

UNIT-2

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 10

UNIT 2

INVERTED FILES
INTRODUCTION

Three of the most commonly used file structures for information retrieval can be classified as:

 Lexicographical indices (indices that are sorted)


 Clustered file structures
 Indices based on hashing

One type of lexicographical index, the inverted file, is presented in this chapter, with a second
type of lexicographical index, the Patricia (PAT) tree.

The concept of the inverted file type of index is as follows:

 Assume a set of documents


 Each document is assigned a list of keywords or attributes, with optional relevance
weights associated with each keyword (attribute).
 An inverted file is then the sorted list (or index) of keywords (attributes), with each
keyword having links to the documents containing that keyword in below figure.
 This is the kind of index found in most commercial library systems. The use of an
inverted file improves search efficiency by several orders of magnitude, a necessity for
very large text files.

An Inverted File Implemented using Sorted Array


Usually there are some restrictions imposed on these indices and consequently on later searches.
Examples of these restrictions are:

 A controlled vocabulary which is the collection of keywords that will be indexed. Words
in the text that are not in the vocabulary will not be indexed, and hence are not
searchable.
 A list of stop words (articles, prepositions, etc.) that for reasons of volume or precision
and recall will not be included in the index, and hence are not searchable.
 A set of rules that decide the beginning of a word or a piece of text that is indexable.
These rules deal with the treatment of spaces, punctuation marks, or some standard
prefixes, and may have significant impact on what terms are indexed.
 A list of character sequences to be indexed (or not indexed). In large text databases, not
all character sequences are indexed; for example, character sequences consisting of all
numeric’s are often not indexed.

STRUCTURES USED IN INVERTED FILES

There are several structures that can be used in implementing inverted files:

 Sorted arrays
 B-trees
 Tries and various hashing structures, or combinations of these structures.

These are sorted indices and can efficiently support range queries such as documents having
keywords that start with “comput”.

Range Query: It is a common DB operation that retrieves all records where some value is
between an upper and lower boundary.

The Sorted array

 It stores the list of keywords in a sorted array


 Including the no. of documents associated with each keyword and a link to the documents
containing that keyword.
 This array is commonly searched using a binary search.

Disadvantage: updating index is expensive

Advantage: easy to implement and a reasonably fast

B-trees: Another implementation structure for an inverted file is a B-tree

 It is efficient for dynamic data (heavily updated).


 A special case of the B-tree, the prefix B-tree, uses prefixes of words as primary keys in a
B –tree index (Bayer and Unterauer 1977) and is particularly suitable for storage of
textual indices
 Each internal node has a variable number of keys
 Each key is the shortest word (in length) that distinguishes the keys stored in the next
level. The key does not need to be a prefix of an actual term in the index
 The last level or leaf level stores the keywords themselves
 Because the internal node keys and their lengths depend on the set of keywords, the order
(size) of each node of the prefix B -tree is variable
 Updates are done similarly to those for a B-tree to maintain a balanced tree

Disadvantage: Use more space when compare to sorted array

Advantage: Updates are much easier and search time

 The prefix B tree method breaks down if there are many words with the same (long) prefix.

 In this case, common prefixes should be further divided to avoid wasting space.

Prefix B-tree

 Compared with sorted arrays, B-trees use more space.


 However, updates are much easier and the search time is generally faster, especially if
secondary storage is used for the inverted file (instead of memory).
 The implementation of inverted files using B-trees is more complex than using sorted
arrays, and therefore readers are referred to Knuth (1973) and Cutting and Pedersen
(1990) for details of implementation of B -trees, and to Bayer and Unterauer (1977) for
details of implementation of prefix B -trees.

Tries
 Inverted files can also be implemented using a trie structure.
 This structure uses the digital decomposition of the set of keywords to represent those
keywords.
 A special trie structure, the Patricia (PAT) tree, is especially useful in information
retrieval.

A trie for keys “A”, ”to”, ”tea”, ”ted”, ”ten”, ”i”, ”in” and “inn”

BUILDING AN INVERTED FILE USING A SORTED ARRAY

The production of sorted array inverted files can be divided into two or three sequential steps as
shown in Figure below. First, the input text must be parsed into a list of words along with their
location in the text. This is usually the most time consuming and storage consuming operation in
indexing. Second, this list must then be inverted, from a list of terms in location order to a list of
terms ordered for use in searching (sorted into alphabetical order, with a list of all locations
attached to each term). An optional third step is the post processing of these inverted files, such
as for adding term weights, or for reorganizing or compressing the files.

Overall schematic of sorted array inverted file creation


Creating the initial word list requires several different operations:

 The individual words must be recognized from the text.


 Each word is then checked against a stop list of common words.
 If it can be considered a non-common word, may be passed through a stemming
algorithm. The resultant stem is then recorded in the word -within-location list.
 The word list resulting from the parsing operation (typically stored as a disk file) is then
inverted.
 This is usually done by sorting on the word (or stem), with duplicates retained.
 Even with the use of high-speed sorting utilities, however, this sort can be time
consuming for large data sets (on the order of n log n).
 One way to handle this problem is to break the data sets into smaller pieces, process each
piece, and then correctly merge the results.
 After sorting, the duplicates are merged to produce within system not using within -
document frequencies statistics.
 Note that although only record numbers are shown as locations in Figure, typically
inverted files store field locations and possibly even word location.
 These additional locations are needed for field and proximity searching in Boolean
operations and cause higher inverted file storage overhead than if only record location
was needed. Inverted files for ranking retrieval systems usually store only record
locations and term weights or frequencies.
Inversion of word list

Although an inverted file could be used directly by the search routine, it is usually processed into
an improved final format. This format is based on the search methods and the (optional)
weighting methods used. A common search technique is to use a binary search routine on the file
to locate the query words. This implies that the file to be searched should be as short as possible,
and for this reason the single file shown containing the terms, locations, and (possibly)
frequencies is usually split into two pieces.

The first piece is the dictionary containing the term, statistics about that term such as number of
postings, and a pointer to the location of the postings file for that term. The second piece is the
postings file itself, which contains the record numbers (plus other necessary location
information) and the (optional) weights for all occurrences of the term. In this manner, the
dictionary used in the binary search has only one “line” per unique term.

MODIFICATIONS TO THE BASIC TECHNIQUE

 Two different techniques are presented as improvements on the basic inverted file
creation.
 The first technique is for working with very large data sets using secondary storage.
 The second technique uses multiple memory loads for inverting files.

Producing an Inverted File for Large Data Sets without Sorting:

Most computers cannot sort the very large disk files needed to hold the initial word list within a
reasonable time. By breaking the dataset into smaller pieces and then processing can be done,
after merging can be taken place. But for small datasets it carries a significant overhead, so can’t
be used.

The new indexing method is a two-step process that does not need the middle sorting step.
The first step produces the initial inverted file. The second step add the terms weights to that file
and recognizes the file for maximum efficiency.

Flowchart of new indexing method

 The creation of the initial inverted file avoids the use of an explicit sort by using a right-
threaded binary tree.
 The data contained in each binary tree node is the current number of term
postings and the storage location of the postings list for that term.
 Each term is identified by the text parsing program.
 It is looked up in the binary tree, and either is added to the tree, along with related
data , or causes tree data to be updated.
 The postings are stored as multiple linked lists, one variable length linked list for
each term, with the lists stored in one large file.
 Each element in the linked postings file consists of a record number, the term
frequency in that record, and a pointer to the next element in the linked list for
that given term.
 By storing the postings in a single file, no storage is wasted, and the files are
easily accessed by following the links.
 As the location of both the head and tail of each linked list is stored in the binary tree, the
entire list, does not need to be read for each addition, but only once for use in creating the
final postings file (step two).
 The binary tree and linked postings lists are saved for use by the term weighting routine
(step two). This routine walks the binary tree and the linked postings list to create an
alphabetical term list (dictionary) and a sequentially stored postings file.
 To do this, each term is consecutively read from the binary tree (this automatically puts
the list in alphabetical order), along with its related data.
 A new sequentially stored postings file is allocated, with two elements per
posting.
 The linked postings list is then traversed, with the frequencies being used to
calculate the term weights (if desired).
 The last step writes the record numbers and corresponding term weights to the
newly created sequential postings file.
 These sequentially stored postings files could not be created in step one because
the number of postings is unknown at that point in processing, and input order is
text order, not inverted file order. The final index files therefore consist of the
same dictionary and sequential postings file as for the basic inverted file.

Fast Inversion Algorithm (FAST-INV)

The second technique to produce a sorted array inverted file is a fast inversion algorithm called
FAST –INV.

This technique takes advantage of two principles:

 The large primary memories available on today's computers.


 The inherent order of the input data.
 The first principle is important since personal computers with more than 1 meg a byte of
primary memory are common, and mainframes may have more than 100 megabytes of
memory. Even if databases are on the order of 1 gigabyte, if they can be split into
memory loads that can be rapidly processed and then combined, the overall cost will be
minimized.
 The second principle is crucial since with large files it is very expensive to use
polynomial or even n log n sorting algorithms. These costs are further compounded if
memory is not used, since then the cost is for disk operations.
Overall Scheme of FAST-INV

 The input to FAST -INV is a document vector file containing the concept vectors for
each document in the collection to be indexed.
 The document numbers appear in the left -hand column and the concept numbers
of the words in each document appear in the right- hand column.
 This is similar to the initial word list for the basic method, except that the words
are represented by concept numbers, one concept number for each unique word
in the collection (i.e., 250,000 unique words implies 250,000 unique concept
numbers).
 Note however that the document vector file is in sorted order, so that concept
numbers are sorted within document numbers, and document numbers are sorted
within collection. This is necessary for FAST-INV to work correctly.
 Input to FAST-INV is a document vector file containing the concept vectors for each
document. A sample document vector file is as shown:
Document Concept vectors
1 3
1 5
1 12
2 3
2 6
3 1
3 8
Sample document vector

Splitting document vector file:

The load table indicates the range of concepts that should be processed for each primary memory
load. There are two approaches to handling the multiplicity of loads.

 One approach, which is currently used, is to make a pass through the document vector
file to obtain the input for each load. This has the advantage of not requiring additional
storage space (though that can be obviated through the use of magnetic tapes), but has the
disadvantage of requiring expensive disk I/O.
 The second approach is to build a new copy document vector collection, with the desired
separation into loads. This can easily be done using the load table, since sizes of each
load are known, in one pass through the input. As each document vector is read, it is
separated into parts for each range of concepts in the load table, and those parts are
appended to the end of the corresponding section of the output document collection file.
With I/O buffering, the expense of this operation is proportional to the size of the files,
and essentially costs the same as copying the file.

Inverting each load

 When a load is to be processed, the appropriate section of the CONPTR file is needed.
 An output array of size equal to the input document vector file subset is needed.
 As each document vector is processed, the offset (previously recorded in CONPTR) for a
given concept is used to place the corresponding document/weight entry, and then that
offset is incremented.
 Thus, the CONPTR data allows the input to be directly mapped to the output, without any
sorting.
 At the end of the input load the newly constructed output is appended to the inverted file.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy