09 Hashtable
09 Hashtable
1
Introduction
• A table has several fields (types of
information)
2
Tables: rows & columns of
information
• To find an entry in the table, you only need
know the contents of one of the fields (not
all of them).
4
Element: a key and its entry
For searching purposes, it is best to store the
key and the entry separately (even though the
key’s value may be inside the entry)
key entry
“Smith” “Smith”, “124 Hawkers Lane”, “9675846”
Element
“Yeo” “Yeo”, “1 Apple Crescent”, “0044 1970 622455”
5
Implementation 1:
Unsorted Sequential Array
• An array in which elements are
stored consecutively in any order key entry
0 14 <data>
• insert: add to back of array; O(1) 1 45 <data>
• find: search through the keys one at 2 22 <data>
a time, potentially all of the keys; 3 67 <data>
O(n) 4 17 <data>
• remove: find and remove node;
…
O(n)
and so on
6
Implementation 2:
Sorted Sequential Array
• An array in which elements are
stored consecutively, sorted by key key entry
0 15 <data>
• insert: add in sorted order; O(n) 1 17 <data>
• find: binary search; O(log n) 2 22 <data>
3 45 <data>
• remove: find, remove element;
O(log n) 4 67 <data>
…
and so on
7
Implementation 3:
Linked List (Unsorted or Sorted)
• Elements are again stored key entry
consecutively
14 <data>
and so on 8
Implementation 4:
Balanced Binary Tree
• A balanced tree, ordered by key
22 <data>
and so on
9
Implementation 5:
Hash Table
• An array in which elements are not key entry
stored consecutively - their place
of storage is calculated using the
key and a hash function 4 <key> <data>
10 <key> <data>
hash array
Key index
function
12
General Idea
• The hash table structure is an array of some fixed size,
containing the items.
• The size of the array is TableSize.
• The items that are stored in the hash table are indexed
by values from 0 to TableSize – 1.
• Each key is mapped into some number in the range 0 to
TableSize – 1
– this number is called hash value.
• The mapping is called a hash function.
13
Example Hash
Table
0
1
Items
2
john 25000
3 john 25000
phil 31250 key Hash 4 phil 31250
Function
dave 27500 5
mary 28200 6 dave 27500
7 mary 28200
key 8
9
hash
value
14
Hash Function
• The hash function
– must be simple to be computed.
– must distribute the keys evenly among the cells.
15
Hash Method
• The worst hash function maps all keys to
the same address
16
Hash function
Problems
• Keys may not be numeric.
• Number of possible keys is much larger than the
size of table.
• Different keys may map into the same location
– Collision issue
– If there are too many collisions, the performance of
the hash table is very low
17
Hash Function
• If the input keys are integers then function
(Key mod TableSize) is a general strategy.
18
Hashing methods
• Many hashing methods
– Truncation
– Division
– Mid-square
– Folding
– Digit analysis
– And so on
19
Truncation method
• Ignore part of the key and use the rest as the array
index (converting non-numeric parts)
• Example
If students have a 9-digit identification number,
take the last 3 digits as the table position
– e.g. 925371622 becomes 622
20
Division method
22
Mid-square method
• Example
– consider items whose keys are 4-digit numbers in base
10
– The goal is to hash these key values to a table of size
100
– This range is equivalent to two digits in base 10, that is,
the number of bits is 2
– For example, the input is the number 4567, squaring
gets an 8-digit number, 20857489
– The middle two digits of this result is 57
23
Folding method
24
Folding method
• Example
25
Digit analysis method
• All the keys in the table are known in advance
• The index is formed by extracting, and then
manipulating specific digits from the key
• For example, the key is 925371622, we select the
digits from 5 through 7 resulting 537
• The manipulation can then take many forms
– Reversing the digits (735)
– Performing a circular shift to the right (753)
– Performing a circular shift to the left (375)
– Swapping each pair of digits (357)
26
Hash Function 1
• Add up the ASCII values of all characters of the key.
int hash(const string &key, int tableSize)
{
int hasVal = 0;
28
Hash Function 3
KeySize -1
hash(key) = å Key
i =0
[ KeySize - i - 1] × 37 i
hashVal %= tableSize;
return hashVal;
};
29
Hash Function 3: example
98 108 105 key[i]
key a l i
0 1 2 i
KeySize = 3;
30
Collision Resolution
• If, when an element is inserted, it hashes to the
same address as an already inserted element, then
we have a collision and need to resolve it.
• There are several methods
– Separate chaining
– Open addressing
• Linear Probing
• Quadratic Probing
• Double Hashing
31
Collision Resolution: Separate
chaining
32
Separate Chaining
• The idea is to use a linked list of all elements that
hash to the same value.
– The array elements are pointers to the first nodes of the
linked lists.
– A new item is inserted into the front of the list.
• Advantages
– Better space utilization for large items.
– Simple collision handling: searching linked list.
– Overflow: we can store more items than the hash table
size.
– Deletion is quick and easy: deletion from the linked list.
33
Example
Keys: 0, 1, 4, 9, 16, 25, 36, 49, 64, 81
hash(key) = key % 10.
0 0
1 81 1
2
4 64 4
5 25
6 36 16
7
9 49 9
34
Operations
• Initialization: all entries are set to NULL
• Find
– locate the cell using hash function.
– sequential search on the linked list in that cell.
• Insertion
– Locate the cell using hash function.
– If the item does not exist, insert it as the first item in the
list.
• Deletion
– Locate the cell using hash function.
– Delete the item from the linked list.
35
Hash Table Class for separate chaining
template <class HashedObj>
class HashTable
{
public:
HashTable(const HashedObj & notFound, int size=101 );
HashTable( const HashTable & rhs )
:ITEM_NOT_FOUND( rhs.ITEM_NOT_FOUND ),
theLists( rhs.theLists ) { }
void makeEmpty( );
void insert( const HashedObj & x );
void remove( const HashedObj & x );
if( p == NULL )
whichList.insert( x, whichList.zeroth( ) );
}
37
Remove routine
/**
* Remove item x from the hash table.
*/
template <class HashedObj>
void HashTable<HashedObj>::remove( const HashedObj & x )
{
theLists[hash(x, theLists.size())].remove( x );
}
38
Find routine
/**
* Find item x in the hash table.
* Return the matching item or ITEM_NOT_FOUND if not found
*/
template <class HashedObj>
const HashedObj & HashTable<HashedObj>::find( const
HashedObj & x ) const
{
HashedObj * itr;
itr = theLists[ hash( x, theLists.size( ) ) ].find( x );
if(itr==NULL)
return ITEM_NOT_FOUND;
else
return *itr;
}
39
Collision Resolution: Open
Addressing
43
Open Addressing
• Separate chaining has the disadvantage of
using linked lists.
– Requires the implementation of a second data
structure.
• For open addressing hashing, all the data go
inside the table.
– Thus, a bigger table is needed.
– If a collision occurs, alternative cells are tried
until an empty cell is found.
44
Open Addressing
• Formally
– Cells h0(x), h1(x), h2(x),… are tried sequentially where
hi(x) = (hash(x) + f(i)) mod TableSize, with f(0) = 0.
– Function f is the collision resolution strategy.
• There are three common collision resolution
strategies (f)
– Linear Probing
– Quadratic probing
– Double hashing
45
Linear Probing
• In linear probing, collisions are resolved by
sequentially scanning an array (with
wraparound) until an empty cell is found.
– i.e. f is a linear function of i, typically f(i)= i.
• Example
– Insert items with keys: 89, 18, 49, 58, 9 into an
empty hash table.
– Table size is 10.
– Hash function is hash(x) = x mod 10.
• f(i) = i;
46
Figure 20.4 Linear probing hash table after each insertion
47
Find and Delete
• The find algorithm follows the same probe
sequence as the insert algorithm.
– A find for 58 would involve 4 probes.
– A find for 19 would involve 5 probes.
• We must use lazy deletion (i.e. marking
items as deleted)
– Standard deletion (i.e. physically removing the
item) cannot be performed.
• When an item is deleted, the location must be marked in a
special way, so that the searches know that the spot used to
have something in it. 48
Clustering Problem
• As long as table is big enough, a free cell
can always be found, but the time to do so
can get quite large.
• Worse, even if the table is relatively empty,
blocks of occupied cells start forming.
• This effect is known as primary clustering.
• Any key that hashes into the cluster will
require several attempts to resolve the
collision, and then it will add to the cluster.
49
Quadratic Probing
• Quadratic Probing eliminates primary clustering
problem of linear probing.
• Collision function is quadratic.
– The popular choice is f(i) = i2.
• If the hash function evaluates to h and a search in
cell h is inconclusive, we try cells h + 12, h+22, …
h + i2.
– i.e. It examines cells 1,4,9 and so on away from the
original probe.
• Remember that subsequent probe points are a
quadratic number of positions from the original
probe point.
54
Figure 20.6
A quadratic probing
hash table after each address = (x % 10 + i2) % 10
insertion (note that (where i = 0..9)
the table size was
poorly chosen
because it is not a
prime number).
55
Quadratic Probing
• Problem
– We may not be sure that we will probe all locations in
the table.
– If the hash table size is not prime this problem will be
much severe.
56
Analysis of Quadratic Probing
• Quadratic probing has not yet been mathematically
analyzed.
• Although quadratic probing eliminates primary
clustering, elements that hash to the same location
will probe the same alternative cells. This is known
as secondary clustering.
• Techniques that eliminate secondary clustering are
available.
– the most popular is double hashing.
58
Double Hashing
• A second hash function is used to solve the
collision.
– f(i) = i * hash2(x)
• We apply a second hash function to x and probe at
a distance hash2(x), 2*hash2(x), … and so on.
59
Hashing Applications
• Compilers use hash tables to implement the
symbol table (a data structure to keep track
of declared variables).
• Game programs use hash tables to keep
track of positions it has encountered
(transposition table)
• Online spelling checkers.
61
Summary
• Hash tables can be used to implement the insert
and find operations in constant average time.
• Collision resolution
– Separate chaining
– Open addressing
62