UNIT-2
UNIT-2
UNIT-2
INVERTED FILES
INTRODUCTION
Three of the most commonly used file structures for information retrieval can be classified as:
One type of lexicographical index, the inverted file, is presented in this chapter, with a second
type of lexicographical index, the Patricia (PAT) tree.
A controlled vocabulary which is the collection of keywords that will be indexed. Words
in the text that are not in the vocabulary will not be indexed, and hence are not
searchable.
A list of stop words (articles, prepositions, etc.) that for reasons of volume or precision
and recall will not be included in the index, and hence are not searchable.
A set of rules that decide the beginning of a word or a piece of text that is indexable.
These rules deal with the treatment of spaces, punctuation marks, or some standard
prefixes, and may have significant impact on what terms are indexed.
A list of character sequences to be indexed (or not indexed). In large text databases, not
all character sequences are indexed; for example, character sequences consisting of all
numeric’s are often not indexed.
There are several structures that can be used in implementing inverted files:
Sorted arrays
B-trees
Tries and various hashing structures, or combinations of these structures.
These are sorted indices and can efficiently support range queries such as documents having
keywords that start with “comput”.
Range Query: It is a common DB operation that retrieves all records where some value is
between an upper and lower boundary.
The prefix B tree method breaks down if there are many words with the same (long) prefix.
In this case, common prefixes should be further divided to avoid wasting space.
Prefix B-tree
Tries
Inverted files can also be implemented using a trie structure.
This structure uses the digital decomposition of the set of keywords to represent those
keywords.
A special trie structure, the Patricia (PAT) tree, is especially useful in information
retrieval.
A trie for keys “A”, ”to”, ”tea”, ”ted”, ”ten”, ”i”, ”in” and “inn”
The production of sorted array inverted files can be divided into two or three sequential steps as
shown in Figure below. First, the input text must be parsed into a list of words along with their
location in the text. This is usually the most time consuming and storage consuming operation in
indexing. Second, this list must then be inverted, from a list of terms in location order to a list of
terms ordered for use in searching (sorted into alphabetical order, with a list of all locations
attached to each term). An optional third step is the post processing of these inverted files, such
as for adding term weights, or for reorganizing or compressing the files.
Although an inverted file could be used directly by the search routine, it is usually processed into
an improved final format. This format is based on the search methods and the (optional)
weighting methods used. A common search technique is to use a binary search routine on the file
to locate the query words. This implies that the file to be searched should be as short as possible,
and for this reason the single file shown containing the terms, locations, and (possibly)
frequencies is usually split into two pieces.
The first piece is the dictionary containing the term, statistics about that term such as number of
postings, and a pointer to the location of the postings file for that term. The second piece is the
postings file itself, which contains the record numbers (plus other necessary location
information) and the (optional) weights for all occurrences of the term. In this manner, the
dictionary used in the binary search has only one “line” per unique term.
Two different techniques are presented as improvements on the basic inverted file
creation.
The first technique is for working with very large data sets using secondary storage.
The second technique uses multiple memory loads for inverting files.
Most computers cannot sort the very large disk files needed to hold the initial word list within a
reasonable time. By breaking the dataset into smaller pieces and then processing can be done,
after merging can be taken place. But for small datasets it carries a significant overhead, so can’t
be used.
The new indexing method is a two-step process that does not need the middle sorting step.
The first step produces the initial inverted file. The second step add the terms weights to that file
and recognizes the file for maximum efficiency.
The creation of the initial inverted file avoids the use of an explicit sort by using a right-
threaded binary tree.
The data contained in each binary tree node is the current number of term
postings and the storage location of the postings list for that term.
Each term is identified by the text parsing program.
It is looked up in the binary tree, and either is added to the tree, along with related
data , or causes tree data to be updated.
The postings are stored as multiple linked lists, one variable length linked list for
each term, with the lists stored in one large file.
Each element in the linked postings file consists of a record number, the term
frequency in that record, and a pointer to the next element in the linked list for
that given term.
By storing the postings in a single file, no storage is wasted, and the files are
easily accessed by following the links.
As the location of both the head and tail of each linked list is stored in the binary tree, the
entire list, does not need to be read for each addition, but only once for use in creating the
final postings file (step two).
The binary tree and linked postings lists are saved for use by the term weighting routine
(step two). This routine walks the binary tree and the linked postings list to create an
alphabetical term list (dictionary) and a sequentially stored postings file.
To do this, each term is consecutively read from the binary tree (this automatically puts
the list in alphabetical order), along with its related data.
A new sequentially stored postings file is allocated, with two elements per
posting.
The linked postings list is then traversed, with the frequencies being used to
calculate the term weights (if desired).
The last step writes the record numbers and corresponding term weights to the
newly created sequential postings file.
These sequentially stored postings files could not be created in step one because
the number of postings is unknown at that point in processing, and input order is
text order, not inverted file order. The final index files therefore consist of the
same dictionary and sequential postings file as for the basic inverted file.
The second technique to produce a sorted array inverted file is a fast inversion algorithm called
FAST –INV.
The input to FAST -INV is a document vector file containing the concept vectors for
each document in the collection to be indexed.
The document numbers appear in the left -hand column and the concept numbers
of the words in each document appear in the right- hand column.
This is similar to the initial word list for the basic method, except that the words
are represented by concept numbers, one concept number for each unique word
in the collection (i.e., 250,000 unique words implies 250,000 unique concept
numbers).
Note however that the document vector file is in sorted order, so that concept
numbers are sorted within document numbers, and document numbers are sorted
within collection. This is necessary for FAST-INV to work correctly.
Input to FAST-INV is a document vector file containing the concept vectors for each
document. A sample document vector file is as shown:
Document Concept vectors
1 3
1 5
1 12
2 3
2 6
3 1
3 8
Sample document vector
The load table indicates the range of concepts that should be processed for each primary memory
load. There are two approaches to handling the multiplicity of loads.
One approach, which is currently used, is to make a pass through the document vector
file to obtain the input for each load. This has the advantage of not requiring additional
storage space (though that can be obviated through the use of magnetic tapes), but has the
disadvantage of requiring expensive disk I/O.
The second approach is to build a new copy document vector collection, with the desired
separation into loads. This can easily be done using the load table, since sizes of each
load are known, in one pass through the input. As each document vector is read, it is
separated into parts for each range of concepts in the load table, and those parts are
appended to the end of the corresponding section of the output document collection file.
With I/O buffering, the expense of this operation is proportional to the size of the files,
and essentially costs the same as copying the file.
When a load is to be processed, the appropriate section of the CONPTR file is needed.
An output array of size equal to the input document vector file subset is needed.
As each document vector is processed, the offset (previously recorded in CONPTR) for a
given concept is used to place the corresponding document/weight entry, and then that
offset is incremented.
Thus, the CONPTR data allows the input to be directly mapped to the output, without any
sorting.
At the end of the input load the newly constructed output is appended to the inverted file.