Hashing Part1
Hashing Part1
Hashing Part1
& Algorithms
Chapter 5
Let us assume that we want to search for a particular
item in a database of 20,000,000 data items
How long would it take to find for a successful sea
How long would it take for an unsuccessful search?
It depends on the data structure
If the data structure is a linked list,
the search time is O(N)
Chapter 5: Hashing
5.1 General Idea
5.2 Hash Function
5.3 Separate Chaining
5.4 Open Addressing
5.5 Rehashing
5.6 Extendible Hashing
Chapter 5: Hashing
Our goals: We will
See several methods of implementing the hash table
Compare these methods analytically
Show numerous applications of hashing
Compare hash tables with binary search trees
First some terminology
Hash table ADT is a data structure that supports o
nly a subset of the operations allowed by the binar
y search trees
Implementation of a hash table is called hashing
General Idea
The general idea behind hashing is to directly map ea
ch data item into an address in memory using some f
key hash function index to an array
Components of hashing
General Idea… array
Name: Salman Arain h(Salman) = 1
University: NFC-IEFR 1
Office: First Floor
Mobile Number: 2
Data Item
General Idea…
Desired Properties of h(k)
simple to compute
uniform distribution of keys over {0, 1, …, m-1}
when h(k1) = h(k2) for two distinct keys k1, k2 , w
e have a collision
Name: Salman Arain h(Salman) = 1
University: NFC-IEFR 1
Office: First Floor
Mobile Number: 2
Data Items
Name: Hassan Hamid
University: UET
Office: room 8 EED
Mobile Number:
h(Hamid) = 1
A collision has occurred m-1
General Idea…
Two Important Topics in Hashing
How to select a hash function
How to resolve collisions
General Idea…
Hashing revisited
A hash table data structure is an array
Each data element contains a key
Each key is mapped to some number in the range
from 0 to TableSize-1, with the help of a hash functi
The hash function should be efficient to compute and sho
uld ensure that different data items get mapped to differe
nt numbers
The key and the hashing function are used both to
insert the data into the table and to later find that d
General Idea…
PTCL is a large telephone company, and they wan
t to maintain a database that provides the caller ID
given a phone number, return the caller’s na
phone numbers range from 0 to r = 107 -1
want to do this as efficiently as possible
General Idea…
Solution 1
an array indexed by key
takes O(1) time,
O(r) space - huge amount of wasted space
General Idea…
Solution 2
Linked list
takes O(r) time,
O(r) space (only as much space as is needed )
General Idea…
Solution 3
Hash table
O(1) expected time, O(n+m) space, where m is table size
Like an array, but come up with a function to map the
large range into one which we can manage
e.g. take the original key, modulo the (relatively small) size o
f the array, and use that as an index
6829229 mod 5 = 4
Hash Function
A simple hash function
If input keys (k) are integers
hash function, h( k ) = k mod m
where m is the table size
Suppose m = 10,
k = 10, 20, 30, 40
h(k) = 0, 0, 0, 0
A bad choice if the keys end in zeros
Hash Function…
Another simple hash function
If input keys (k) are integers
hash function, h( k ) = k mod m
where m is the table size and is a prime number
Suppose m = 11,
k = 10, 20, 30, 40
h(k) = 10, 9, 8, 7
Distributes the keys more uniformly
Hash Function…
A simple hash function
If the keys are strings, then the hash function can b
e some function of the characters in the strings
One possibility is to simply add the ASCII values of t
he characters:
h( str ) str[i ] %m
i 0
h(ABC) = (65 + 66 + 67)%m
Hash Function…
Programming details
Hash1( const char *Key, int TableSize )
unsigned int HashVal = 0;
/* 1*/ while( *Key != '\0' )
/* 2*/ HashVal += *Key++;
/* 3*/ return HashVal %
TableSize; }
Hash Function…
If the table size is large, the function does not distri
bute the keys well
TableSize = 10,007 (prime number)
Keys are <= 8 characters
Each char is 1 byte long so highest value it can ha
ve is 28 – 1 = 127
Hash function will have range: 0 to (127*8) = 0 to
~10K spaces in the table and only using the first 1
K elements
Hash Function…
Another hash function
If the keys are strings
convert the string into some number in some arbitr
ary base b
length1 i
h( str ) str[i ] b %m
i 0
h(ABC) = (65b0 + 66b1 + 67b2) %m
Hash Function…
Examines first three characters of the input
The value 27 represents the number of letters i
n English alphabet, plus the blank
Hash2( const char *Key, int TableSize )
return ( Key[ 0 ] + 27 * Key[ 1 ] + 729 * Key[ 2 ] )% TableSize;
Hashing 5-25
Hash Function…
Rule of Thumb
Separate Chaining
How to deal with two keys which hash to the same sp
ot in the array?
Use chaining
All data items that hash to the same number are k
ept in a linked list
Setup an array of lists, indexed by the keys, to l
ists of items with the same key
Separate Chaining…
Separate Chaining…
Here the size of the hash t
able = 10
Keys are the first ten perfe
ct squares 0, 1, 4, 9, 16, 25
, 36, 49, 64, and 81
The hash function, h(k) =
k mod 10
To insert an element
compute h(k) to determine which list to traverse
If T[h(k)] contains a null pointer, initialize this entry
to point to a linked list that contains k alone
If T[h(k)] is a non-empty list, we add k at the begin
ning of this list
Separate Chaining…
To delete an element
compute h(k), then search for k within the list at T[
delete k if it is found
Separate Chaining…
Programming Details
#ifndef _HashSep_H
#define _HashSep_H
struct ListNode
struct ListNode;
ElementType Element;
typedef struct ListNode *Position;
Position Next;
struct HashTbl;
typedef struct HashTbl *HashTable;
HashTable InitializeTable( int TableSize );
void DestroyTable( HashTable H );
Position Find( ElementType Key, HashTable H );
void Insert( ElementType Key, HashTable H );
ElementType Retrieve( Position P );
#endif /* _HashSep_H */
Separate Chaining…
Programming Details
struct HashTbl
int TableSize;
List *TheLists;
Separate Chaining…
Programming Details
InitializeTable( int TableSize )
HashTable H;
int i;
/* 1*/ if( TableSize < MinTableSize )
/* 2*/ Error( "Table size too small" );
/* 3*/ return NULL;
/* Allocate table */
/* 4*/ H = malloc( sizeof( struct HashTbl ) );
/* 5*/ if( H == NULL )
/* 6*/ FatalError( "Out of space!!!" );
/* 7*/ H->TableSize = NextPrime( TableSize );
Separate Chaining…
Programming Details
/* Allocate array of lists */
/* 8*/ H->TheLists = malloc( sizeof( List ) * H->TableSize );
/* 9*/ if( H->TheLists == NULL )
/*10*/ FatalError( "Out of space!!!" );
/* Allocate list headers */
/*11*/ for( i = 0; i < H->TableSize; i++ )
/*12*/ H->TheLists[ i ] = malloc( sizeof( struct ListNode ) );
/*13*/ if( H->TheLists[ i ] == NULL )
/*14*/ FatalError( "Out of space!!!" );
/*15*/ H->TheLists[ i ]->Next = NULL;
return H;
Separate Chaining…
Programming Details
Find( ElementType Key, HashTable H )
Position P;
List L;
/* 1*/ L = H->TheLists[ Hash( Key, H->TableSize ) ];
/* 2*/ P = L->Next;
/* 3*/ while( P != NULL && P->Element != Key )
/* 4*/ P = P->Next;
/* 5*/ return P;
Separate Chaining…
Programming Insert( ElementType Key, HashTable H )
Details {
Position Pos, NewCell;
List L;
/* 1*/ Pos = Find( Key, H );
/* 2*/ if( Pos == NULL ) /* Key is not found */
/* 3*/ NewCell = malloc( sizeof( struct ListNode ) );
/* 4*/ if( NewCell == NULL )
/* 5*/ FatalError( "Out of space!!!" );
else {
/* 6*/ L = H->TheLists[ Hash( Key, H->TableSize ) ];
/* 7*/ NewCell->Next = L->Next;
/* 8*/ NewCell->Element = Key;
/* 9*/ L->Next = NewCell;
Separate Chaining…
Analysing the performance of separate chaining hash t
as we increase the number of elements N in the has
h table, more and more items will be stored in linked
lists, thus slowing everything down
Also increasing the table size TableSize allows you t
o hold more data in an efficient manner
It turns out that the ratio λ = N / T is the important q
uantity to analyze
This is called the load factor
Separate Chaining…
Analysing the performance of separate chaining hash
Time to perform search = the constant time requir
ed to evaluate the hash function + time to travers
e the list
Note that, for separate chaining, the average lengt
h of a linked list is λ
Thus, an unsuccessful search will require to traver
se λ links on average
A successful search requires that about 1 + (λ/2) li
nks be traversed
Hashing 5-40
Separate Chaining…
Open Addressing
Separate chaining has the disadvantage of using link
ed lists that slows the algorithm because of the time r
equired to allocate new cells
Open addressing
relocate the key k to be inserted if it collides with a
n existing key
That is, we store k at an entry different from T[h
Open Addressing…
Open addressing hashing resolves collisions by tryin
g alternative slots in the hash table, until an empty cel
l is found
cells h0 (X), h1 (X), h2 (X),… are tried in succession
where hi (X) = (Hash(X) + F(i))mod TableSize with F
(0) = 0
The function, F, is the collision resolution strategy
Open Addressing…
Linear Probing
F(i) is a linear function of i, i.e. F(i) = i
h0(X) = Hash(X) + 0
h1(X) = Hash(X) + 1
h2(X) = Hash(X) + 2
cells are probed sequentially (with wraparound)
in search of an empty cell
Open Addressing…
suppose that our hash function converts a 2-digit i
nteger into a single digit by taking the least-signific
ant digit
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
70 81 97
0 1 2 3 4 5 6 7 8 9
70 81 60 97
0 1 2 3 4 5 6 7 8 9
70 81 60 51 97
0 1 2 3 4 5 6 7 8 9
70 81 60 51 97 38 89
0 1 2 3 4 5 6 7 8 9
70 81 60 51 68 97 38 89
0 1 2 3 4 5 6 7 8 9
70 81 60 51 68 24 97 38 89
0 1 2 3 4 5 6 7 8 9
70 81 60 51 68 24 97 38 89
0 1 2 3 4 5 6 7 8 9
70 81 60 51 68 24 97 38 89
0 1 2 3 4 5 6 7 8 9
70 81 60 51 68 24 97 38 89
81 60 51 68 24 97 38 70
0 1 2 3 4 5 6 7 8 9
70 81 60 51 68 24 97 38
0 1 2 3 4 5 6 7 8 9
70 81 60 51 24 97 38 68
0 1 2 3 4 5 6 7 8 9
70 81 60 51 24 97 38 68
0 1 2 3 4 5 6 7 8 9
70 81 60 51 24 97 38 68
0 1 2 3 4 5 6 7 8 9
70 81 51 24 97 38 68
0 1 2 3 4 5 6 7 8 9
70 81 51 24 97 38 68
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
76 81 63 70 97
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
76 55 81 57 9 85 63 38 68 70 21 97 73
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
74 76 55 81 57 9 85 60 85 63 38 16 68 70 21 97 73 72
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
74 76 49 55 81 57 7 9 85 60 85 63 38 61 16 68 70 21 97 73 72
Open Addressing…
Asymptotic Performance…
The number of probes for an unsuccessful search
or for an insertion is higher:
Number of probes ( ½ ) ( 1+1/( 1- )2 )
if = 0.75 , 8.5 probes are expected
if = 0.9 , 50 probes are expected, and this unreasonable
Hashing 5-70
Open Addressing…
*The following plot shows how the number of require
d probes increases
Hashing 5-73