Indexing Dbms
Indexing Dbms
Indexing Dbms
Primary Index
• A single-level index is an auxiliary file that makes it more efficient to
search for a record in the data file.
• The index is usually specified on one field of the file (although it could
be specified on several fields)
• One form of an index is a file of entries <field value, pointer to
record>, which is ordered by field value
• The index is called an access path on the field.
Index
• Indexes can also be characterized as dense or sparse.
• A dense index has an index entry for every search key value (and
hence every record) in the data file.
• A sparse (or non dense) index, on the other hand, has index
entries for only some of the search values
• A primary index is a nondense (sparse) index, since it
includes an entry for each disk block of the data file, rather
than for every search value.
Primary Index
The index file for a primary index occupies a much smaller space
than does the data file, for two reasons.
• First, there are fewer index entries than there are records in the data
file.
• Second, each index entry is typically smaller in size than a data
record because it has only two fields, both of which tend to be short
in size; consequently, more index entries than data records can fit in
one block.
Therefore, a binary search on the index file requires fewer block
accesses than a binary search on the data file.
• If the primary index file contains only bi blocks, then to locate a
record with a search key value requires a binary search of that
index and access to the block containing that record: a total of
Example Question
Suppose that we have an ordered file with r = 300,000 records
stored on a disk with block size B = 4,096 bytes. File records are
of fixed size and are unspanned, with record length R = 100
bytes. Compute the blocking factor and the number of blocks
needed for the file. Also, find the number of block accesses if
binary search is used on the data file.
Solution
• The blocking factor for the file would be bfr = (B/R) = (4,096/100) =
40 records per block. The number of blocks needed for the file is b =
(r/bfr) = (300,000/40) = 7,500 blocks.
• A binary search on the data file would need approximately log2 b=
(log2 7,500) = 13 block accesses.
• In case of linear search, it would require b/2= 3750 block accesses
on an average.
Example Question
• Now suppose that the ordering key field of the file is V = 9 bytes
long, a block pointer is P = 6 bytes long, and we have constructed a
primary index for the file.
• The size of each index entry is Ri = (9 + 6) = 15 bytes, so the
blocking factor for the index is bfri = (B/Ri) = (4,096/15) = 273 entries
per block.
• The total number of index entries ri is equal to the number of blocks
in the data file, which is 7,500. The number of index blocks is hence
bi = (ri/bfri) = (7,500/273) = 28 blocks. To perform a binary search on
the index file would need (log2 bi) = (log2 28) = 5 block accesses.
• To search for a record using the index, we need one additional block
access to the data file for a total of 5 + 1 = 6 block accesses—an
improvement over binary search on the data file, which required 13
disk block accesses.
Clustering Index
• If file records are physically ordered on a non key field—which
does not have a distinct value for each record—that field is
called the clustering field and the data file is called a
clustered file.
• A clustering index, to speed up retrieval of all the records that
have the same value for the clustering field.
• This differs from a primary index, which requires that the
ordering field of the data file have a distinct value for each
record.
• It is another example of non-dense index where Insertion and
Deletion is relatively straightforward with a clustering index.
Clustering Index
Secondary Index
• A secondary index provides a secondary means of accessing a
data file for which some primary access already exists. The data file
records could be ordered, unordered, or hashed.
• The secondary index may be created on a field that is a candidate
key and has a unique value in every record, or on a non-key field
with duplicate values.
• The index is again an ordered file with two fields. The first field is of
the same data type as some non-ordering field of the data file that is
an indexing field. The second field is either a block pointer or a
record pointer.
• It is an example of dense index as an entry is required for each
record.
Primary Index vs. Secondary Index
The main difference between primary and secondary index is that the
primary index is an index on a set of fields that includes the primary
key for the field and does not contain duplicates, while the secondary
index is an index that is not a primary index and which can contain
duplicates.
Example Question
Assume we have an ordered file with r = 300,000 records stored on a disk
with block size B = 4,096 bytes. File records are of fixed size and are
unspanned, with record length R = 100 bytes. Compute the blocking factor
and the number of blocks needed for the file. Also, find the number of block
accesses if binary search is used on the data file.
Suppose we want to search for a record with a specific value for the
secondary key—a non-ordering key field of the file that is V = 9 bytes long.
Solution
• The blocking factor for the file would be bfr = (B/R) = (4,096/100) = 40 records per
block. The number of blocks needed for the file is b = (r/bfr) = (300,000/40) =
7,500 blocks.
• A binary search on the data file would need approximately log2 b= (log2 7,500) =
13 block accesses.
• A block pointer is P = 6 bytes long, so each index entry is Ri = (9 + 6) = 15 bytes,
and the blocking factor for the index is bfri = (B/Ri) = (4,096/15) = 273 index
entries per block.
In a dense secondary index, the total number of index entries ri is equal to the
number of records in the data file, which is 300,000.
The number of blocks needed for the index is hence bi = (ri/bfri) = (300,000/273) =
1,099 blocks.
A binary search on this secondary index needs (log2 bi) = (log2 1099) = 11 block accesses.
Multilevel Index
• Because a single-level index is an ordered file, we can create a
primary index to the index itself ; in this case, the original index file is
called the first-level index and the index to the index is called the
second-level index.
• We can repeat the process, creating a third, fourth, any level until all
entries of the top level fit in one disk block
• A multi-level index can be created for any type of first-level index
(primary, secondary, clustering) as long as the first-level index consists
of more than one disk block
Multilevel Index
Such a multi-level index is a form of search tree ; however, insertion
and deletion of new index entries is a severe problem because every
level of the index is an ordered file.
A Search Tree with pointers to subtrees
Multilevel Index
• Because of the insertion and deletion problem, most multi-level
indexes use B-tree or B+-tree data structures, which leave space in
each tree node (disk block) to allow for new index entries.
• These data structures are variations of search trees that allow
efficient insertion and deletion of new search values.
• In B-Tree and B+ Tree data structures, each node corresponds to a
disk block.
• Each node is kept between half-full and completely full.
B Tree vs. B+ Tree
• In a B-tree, pointers to data records exist at all levels of the tree.
• In a B+-tree, all pointers to data records exists at the leaf-level nodes.
• A B+-tree can have less levels (or higher capacity of search values)
than the corresponding B-tree.
Insertion in a B+ Tree
• q = 3 and pleaf = 2.
Deletion in a B+ Tree
Example
To calculate the order p of a B+-tree, suppose that the search key field
is V = 9 bytes long, the block size is B = 512 bytes, a record pointer is
Pr = 7 bytes, and a block pointer/tree pointer is P = 6 bytes. An
internal node of the B+-tree can have up to p tree pointers and p − 1
search field values; these must fit into a single block. Hence, we have:
(p * P) + ((p − 1) * V) ≤ B
(p * 6) + ((p − 1) * 9) ≤ 512
(15 * p) ≤ 512
We can choose p to be the largest value satisfying the above
inequality, which gives p = 34.