03 UW Indexing
03 UW Indexing
:
Database System
Principles
Notes 4: Indexing
1
Chapter 4
2
Topics
• Conventional indexes
• B-trees
• Hashing schemes
3
• A single-level index is an auxiliary file that
makes it more efficient to search for a record
in the data file.
4
• The index file usually occupies considerably
less disk blocks than the data file because its
entries are much smaller
• A binary search on the index yields a pointer
to the file record
6
Dense Index Sequential File
10 10
20 20
30
40
30
40
50
60 50
70 60
80
70
90 80
100 90
110 100
120
7
Sparse Index Sequential File
10 10
30 20
50
70
30
40
90
110 50
130 60
150
70
170 80
190 90
210 100
230
8
Sparse 2nd level Sequential File
10 10 10
90 30 20
170 50
250 70
30
40
90
330 50
110
410 60
130
490
150
570 70
170 80
190 90
210 100
230
9
Notes on pointers:
BP
RP
10
Sparse vs. Dense Tradeoff
(Later:
– sparse better for insertions
– dense needed for secondary indexes)
11
Terms
• Index sequential file
• Search key ( primary key)
• Primary index (on ordering field)
• Secondary index (on non-ordering field)
• Dense index (all Search Key values in)
• Sparse index
• Multi-level index
12
Next:
• Duplicate keys
• Deletion/Insertion
• Secondary indexes
13
Duplicate
keys
10
10
10
20
20
30
30
30
40
45
14
Duplicate
keys
Dense index, one way to
implement?
10
10 10
10
10 10
20 20
20 20
30 30
30 30
30 30
40
45
15
Duplicate
keys
Dense index, better way?
10
10 10
20
30 10
40 20
20
30
30
30
40
45
16
Duplicate
keys
Sparse index, one way? (see previous page)
careful if looking
10
10 10
for 20 or 30!
10
20 10
30 20
20
30
30
30
40
45
17
Duplicate
keys
Sparse index, another way?
– place first new key from block
10
should 10 10
20
this be 30 10
40? 30 20
20
30
30
30
40
45
18
Summary Duplicate values,
primary index
• Index may point to first instance of
each value only
File
Index a
a a
.
.
b
19
Deletion from sparse index
– delete record 40
10
10 20
30
50 30
70 40
90 50
110 60
130 70
150 80
20
Deletion from sparse index
– delete record 40
10
10 20
30
50 30
70 40
90 50
110 60
130 70
150 80
21
Deletion from sparse index
– delete record 30
10
10 20
30
50 30
70 40
90 50
110 60
130 70
150 80
22
Deletion from sparse index
– delete record 30
10
10 20
40 30
50 30 40
70 40
90 50
110 60
130 70
150 80
23
Deletion from sparse index
– delete records 30 & 40
10
10 20
30
50 30
70 40
90 50
110 60
130 70
150 80
24
Deletion from sparse index
– delete records 30 & 40
10
10 20
30
50 30
70 40
90 50
110 60
130 70
150 80
25
Deletion from sparse index
– delete records 30 & 40
10
10 20
50 30
70 50 30
70 40
90 50
110 60
130 70
150 80
26
Deletion from dense index
– delete record 30
10
10 20
20
30 30
40 40
50 50
60 60
70 70
80 80
27
Deletion from dense index
– delete record 30
10
10 20
20
30 30 40
40 40
50 50
60 60
70 70
80 80
28
Deletion from dense index
– delete record 30
10
10 20
20
40 30 30 40
40 40
50 50
60 60
70 70
80 80
29
Insertion, sparse index case
– insert record 34
10
10 20
30
40 30
60
40
50
60
30
Insertion, sparse index case
– insert record 34
10
10 20
30
40 30
60 34
40
50
• our lucky day! 60
we have free space
where we need it!
31
Insertion, sparse index case
– insert record 15
10
10 20
30
40 30
60
40
50
60
32
Insertion, sparse index case
– insert record 15
10
10 20 15
20 30
40 30 20
60 30
40
50
• Illustrated: Immediate reorganization
60
• Variation:
– insert new block (chained file)
– update index
33
Insertion, sparse index case
– insert record 25
10 25
10 20
30
40 30 overflow blocks
60 (reorganize later...)
40
50
60
34
Insertion, dense index case
• Similar
• Often more expensive . . .
35
Secondary indexes
Sequence
• Sparse index field
30
30
20
50
80 20
100 70
90 80
... 40
100
10
does not make sense! 90
60
36
Secondary indexes
Sequence
• Dense index field
10 30
20 50
30
40 20
70
50
60
80
70
40
... 100
10
90
60
37
Secondary indexes
Sequence
• Dense index field
10 30
20 50
10 30
40 20
50
70
90
... 50
60
80
sparse 70
40
high ... 100
level 10
90
60
38
With secondary indexes:
• Lowest level is dense
• Other levels are sparse
39
Duplicate values & secondary
indexes
one option...
10 20
10 10
10
20
Problem: 20
40
excess overhead!20
30 10
• disk space 40 40
• search time 40 10
40 40
40
... 30
40
40
Duplicate values & secondary
indexes
another option...
10 20
10
Problem:
variable size 20
20
40
records in
10
index! 30
40
40 10
You can specify it in 40
your SQL statement:
CREATE INDEX. 30
40
41
Duplicate values & secondary
indexes
20
10 10
20
30 20
40 40
50 10
60 40
...
10
40
Another idea: 30
Chain records with same key?
40
Problems:
• Need to add fields to records
• Need to follow chain to know records
42
Duplicate values & secondary
indexes
20
10 10
20
30 20
40 40
50 10
60 40
...
10
40
30
Pointers can be stored 40
in separate blocks
buckets
43
Why “bucket” idea is useful
Indexes Records
Name: primary EMP
(name,dept,floor,...)
Dept: secondary
Floor: secondary
44
Query: Get employees in
(Toy Dept) ^ (2nd floor)
Dept. index EMP Floor index
Toy 2nd
• Conventional index
– Basic Ideas: sparse, dense, multi-
level…
– Duplicate Keys
– Deletion/Insertion
– Secondary indexes
46
Conventional indexes
Advantage:
- Simple
- Index is sequential file
good for scans
Disadvantage:
- Inserts expensive,
and/or
- Lose sequentiality &
balance
47
Example Index (sequential)
10
20 39
30 31
33 35
continuous 36
40
50
60 32
free space 38
34
70
80
90 overflow area
(not sequential)
48
• Clustering Index (it’s a primary index)
49
• Clustering Index
50
• Clustering Index version 2
51
Outline:
• Conventional indexes
• B-Trees NEXT
• Hashing schemes
52
• NEXT: Another type of index
– Give up on sequentiality of index
– Try to get “balance”
53
3
5
11
30
30
35
100
101
B+Tree Example
110
100
Root
120
130
150
156 120
179 150
180
180
200
54
n=3
Lookup in a B+ tree
55
Sample non-leaf
57
81
95
to keys to keys to keys to keys
< 57 57 k<81 81k<95 95
56
sequence
To record 57
with key 57
To record 81
with key 81
Sample leaf node:
To record
with key 85 95
in
From non-leaf node
57
to next leaf
In textbook’s notation
n=3
Leaf:
30 35
30
35
Non-leaf:
30
30
58
Size of nodes: n+1 pointers
(fixed)
n keys
59
Don’t want nodes to be too
empty
• Use at least
Non-leaf: (n+1)/2pointers
60
Leaf
n=3
Non-leaf
min. node
3 120
5 150
11 180
Full node
30
30
35
61
62
(3) Number of pointers/keys for
B+tree
Max Max Min Min
ptrs keys ptrsdata keys
Non-leaf n+1 n (n+1)/2 (n+1)/2- 1
(non-root)
Leaf
(non-root) n+1 n (n+1)/2 (n+1)/2
Root n+1 n 1 1
63
Insert into B+tree
64
(a) Insert key = 32 n=3
100
30
11
30
31
3
5
65
(a) Insert key = 32 n=3
100
30
11
30
31
32
3
5
66
(a) Insert key = 7 n=3
100
30
11
30
31
3
5
67
(a) Insert key = 7 n=3
100
30
57
11
30
31
3
5
68
(a) Insert key = 7 n=3
100
30
7
57
11
30
31
3
5
69
100
150
156 120
179 150
180
(c) Insert key = 160
180
n=3
200
70
100
150
156 120
179 150
180
(c) Insert key = 160
160
179
180
n=3
200
71
100
150
156 120
179 150
180
(c) Insert key = 160
160
179
180
180
n=3
200
72
100
160
150
156 120
179 150
180
(c) Insert key = 160
160
179
180
180
n=3
200
73
(d) New root, insert n=3
45
10
20
30
10
12
20
25
30
32
40
1
2
3
74
(d) New root, insert n=3
45
10
20
30
10
12
20
25
30
32
40
40
45
1
2
3
75
1
2 45
3
10
12
10
20
30
20
25
(d) New root, insert
30
32 40
40
40
n=3
45
76
(d) New root, insert n=3
45
30
new root
10
20
30
40
10
12
20
25
30
32
40
40
45
1
2
3
77
Deletion from B+tree
78
(b) Coalesce with
n=4
sibling
– Delete 50
100
10
40
10
20
30
40
50
79
(b) Coalesce with
n=4
sibling
– Delete 50
100
10
40
40
10
20
30
40
50
80
(c) Redistribute keys
n=4
– Delete 50
100
10
40
10
20
30
35
40
50
81
(c) Redistribute keys
n=4
– Delete 50
40 35
100
10
35
10
20
30
35
40
50
82
(d) Non-leaf coalese
n=4
– Delete 37
25
10
20
30
40
30
37
10
14
20
22
25
26
40
45
1
3
83
(d) Non-leaf coalese
n=4
– Delete 37
25
10
20
30
40
30
30
37
10
14
20
22
25
26
40
45
1
3
84
(d) Non-leaf coalese
n=4
– Delete 37
25
40
10
20
30
40
30
30
37
10
14
20
22
25
26
40
45
1
3
85
(d) Non-leaf coalese
n=4
– Delete 37
25
new root
40
25
10
20
30
40
30
30
37
10
14
20
22
25
26
40
45
1
3
86
B+tree deletions in practice
87
• Speaking of buffering…
Is LRU a good policy for B+tree
buffers?
Of course not!
Should try to keep root in memory
at all times
(and perhaps some nodes from second level)
88
Variation on B+tree: B-tree (no
+)
• Idea:
– Avoid duplicate keys
– Have record pointers in non-leaf
nodes
89
K1 P1 K2 P2 K3 P3
90
10
20
30
40
25
50
45
60
70
B-tree example
80
90 85 65
100 105 125
110
120
130
140 145
150 165
160
n=2
170
91
180
B-tree example n=2
• sequence pointers
not useful now!
125
(but keep space for simplicity)
65
105
145
165
25
45
85
100
110
120
130
140
150
160
170
180
10
20
30
40
50
60
70
80
90
92
Note on inserts
• Say we insert record with key = 25
n=3
10
20
30
leaf
93
Note on inserts
• Say we insert record with key = 25
n=3
10
20
30
leaf
• Afterwards
20
–
–
:
10
25
30
94
So, for B-trees:
MAX MIN
Tree Rec Keys Tree Rec Keys
Ptrs Ptrs Ptrs Ptrs
Non-leaf
non-root n+1 n n (n+1)/2 (n+1)/2-1
(n+1)/2-1
Leaf
non-root 1 n n 1 n/2 n/2
Root
non-leafn+1 n n 2 1 1
Root
Leaf 1 n n 1 1 1
95
Tradeoffs:
B-trees have faster lookup than
B+trees
96
Outline/summary
• Conventional Indexes
• Sparse vs. dense
• Primary vs. secondary
• B trees
• B+trees vs. B-trees
• B+trees vs. indexed sequential
• Hashing schemes --> Next
97