Hashing Part1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 73

EE 232 Data Structures

& Algorithms
Session-18

Chapter 5
Hashing

Hashing 5-1
Motivation
 Let us assume that we want to search for a particular
item in a database of 20,000,000 data items
 How long would it take to find for a successful sea
rch?
 How long would it take for an unsuccessful search?
 It depends on the data structure

Hashing 5-2
Motivation…
 If the data structure is a linked list,
 the search time is O(N)

 If the data structure is a binary search tree,


 estimated running time is O(logN)
 log 20,000,000 ≈ 24

 Can we do even better than O(logN) ?


 hash table ADT

Hashing 5-3
Chapter 5: Hashing
5.1 General Idea
5.2 Hash Function
5.3 Separate Chaining
5.4 Open Addressing
5.5 Rehashing
5.6 Extendible Hashing

Hashing 5-4
Chapter 5: Hashing
Our goals: We will
 See several methods of implementing the hash table
 Compare these methods analytically
 Show numerous applications of hashing
 Compare hash tables with binary search trees

Hashing 5-5
First some terminology
 Hash table ADT is a data structure that supports o
nly a subset of the operations allowed by the binar
y search trees
 Implementation of a hash table is called hashing

 Hashing is a technique used for performing inserti


ons, deletions, and finds in a constant time

Hashing 5-6
Chapter 5: Hashing
5.1 General Idea
5.2 Hash Function
5.3 Separate Chaining
5.4 Open Addressing
5.5 Rehashing
5.6 Extendible Hashing

Hashing 5-7
General Idea
 The general idea behind hashing is to directly map ea
ch data item into an address in memory using some f
unction
 key  hash function  index to an array

 Components of hashing

 A hash table is an array of some fixed size ‘m’


 A hash function h(k) that maps a search key k to s
ome location in the range [0...m-1]
h(k): S  {0, 1, …, m-1}

Hashing 5-8
General Idea… array

0
Name: Salman Arain h(Salman) = 1
University: NFC-IEFR 1
Office: First Floor
Mobile Number: 2
Email:
etc

Data Item

Here we are using a hashing function that


accepts my last name as a key and returns a 1

m-1
Hashing 5-9
General Idea…
 Desired Properties of h(k)
 simple to compute
 uniform distribution of keys over {0, 1, …, m-1}
when h(k1) = h(k2) for two distinct keys k1, k2 , w
e have a collision

Copyright © Kashif Javed Hashing 5-10


General Idea… array

0
Name: Salman Arain h(Salman) = 1
University: NFC-IEFR 1
Office: First Floor
Mobile Number: 2
Email:
etc

Data Items
Name: Hassan Hamid
University: UET
Office: room 8 EED
Mobile Number:
h(Hamid) = 1
Email:
etc
A collision has occurred m-1

Hashing 5-11
General Idea…
 Two Important Topics in Hashing
 How to select a hash function
 How to resolve collisions

Hashing 5-12
General Idea…
 Hashing revisited
 A hash table data structure is an array
 Each data element contains a key
 Each key is mapped to some number in the range
from 0 to TableSize-1, with the help of a hash functi
on
The hash function should be efficient to compute and sho
uld ensure that different data items get mapped to differe
nt numbers
 The key and the hashing function are used both to
insert the data into the table and to later find that d
ata
Hashing 5-13
General Idea…
 Example
 PTCL is a large telephone company, and they wan
t to maintain a database that provides the caller ID
capability
given a phone number, return the caller’s na
me
phone numbers range from 0 to r = 107 -1
want to do this as efficiently as possible

Hashing 5-14
General Idea…
 Solution 1
 an array indexed by key
takes O(1) time,
O(r) space - huge amount of wasted space

Umer (null) Hassan (null) (null)


Hamid Hamid
6829227 0000000 6829229 0000000 0000000

Hashing 5-15
General Idea…
 Solution 2
 Linked list
 takes O(r) time,
O(r) space (only as much space as is needed )

Umer Hamid Hassan Hamid


6829227 6829229

Hashing 5-16
General Idea…
 Solution 3
 Hash table
O(1) expected time, O(n+m) space, where m is table size
 Like an array, but come up with a function to map the
large range into one which we can manage
e.g. take the original key, modulo the (relatively small) size o
f the array, and use that as an index
6829229 mod 5 = 4

(null) (null) (null) (null) Hassan


Hamid
0 1 2 3 4

Hashing 5-17
Chapter 5: Hashing
5.1 General Idea
5.2 Hash Function
5.3 Separate Chaining
5.4 Open Addressing
5.5 Rehashing
5.6 Extendible Hashing

Hashing 5-18
Hash Function
 A simple hash function
 If input keys (k) are integers
 hash function, h( k ) = k mod m
where m is the table size

 Example
Suppose m = 10,
k = 10, 20, 30, 40
h(k) = 0, 0, 0, 0
A bad choice if the keys end in zeros

Hashing 5-19
Hash Function…
 Another simple hash function
 If input keys (k) are integers
 hash function, h( k ) = k mod m
where m is the table size and is a prime number

 Example
Suppose m = 11,
k = 10, 20, 30, 40
h(k) = 10, 9, 8, 7
Distributes the keys more uniformly

Hashing 5-20
Hash Function…
 A simple hash function
 If the keys are strings, then the hash function can b
e some function of the characters in the strings
 One possibility is to simply add the ASCII values of t
he characters:

 length1 
h( str )    str[i ] %m
 Example
 i 0 
h(ABC) = (65 + 66 + 67)%m

Hashing 5-21
Hash Function…
 Programming details

typedef unsigned int Index;

Index
Hash1( const char *Key, int TableSize )
{
unsigned int HashVal = 0;
/* 1*/ while( *Key != '\0' )
/* 2*/ HashVal += *Key++;
/* 3*/ return HashVal %
TableSize; }

Hashing 5-22
Hash Function…
 Problem
 If the table size is large, the function does not distri
bute the keys well
 TableSize = 10,007 (prime number)
 Keys are <= 8 characters
 Each char is 1 byte long so highest value it can ha
ve is 28 – 1 = 127
 Hash function will have range: 0 to (127*8) = 0 to
1016
 ~10K spaces in the table and only using the first 1
K elements
Hashing 5-23
Hash Function…
 Another hash function
 If the keys are strings
 convert the string into some number in some arbitr
ary base b

 length1 i
h( str )    str[i ]  b %m
 i 0 

 Example
h(ABC) = (65b0 + 66b1 + 67b2) %m

Hashing 5-24
Hash Function…
 Examines first three characters of the input
 The value 27 represents the number of letters i
n English alphabet, plus the blank

Index
Hash2( const char *Key, int TableSize )
{
return ( Key[ 0 ] + 27 * Key[ 1 ] + 729 * Key[ 2 ] )% TableSize;
}

Hashing 5-25
Hash Function…
 Rule of Thumb

 Hash functions should try to achieve uniform full c


overage of the hash table, while minimizing collisio
ns
 Since this is usually impossible, and collisions will
almost always occur, an important design consider
ation is how you deal with the collision resolution

Hashing 5-26
Chapter 5: Hashing
5.1 General Idea
5.2 Hash Function
5.3 Separate Chaining
5.4 Open Addressing
5.5 Rehashing
5.6 Extendible Hashing

Hashing 5-27
Separate Chaining
 How to deal with two keys which hash to the same sp
ot in the array?
 Use chaining
 All data items that hash to the same number are k
ept in a linked list
Setup an array of lists, indexed by the keys, to l
ists of items with the same key

Hashing 5-28
Separate Chaining…
 Example

0 Name: Kashif Javed Name: Hassan Hamid


University: UET University: UET
Office: room 8 EED
1 Office: room 4 EED
Mobile Number:
Mobile Number:
Email: Email:
2 etc etc

m-1 The two entries are now stored


in a linked list

Hashing 5-29
Separate Chaining…
 Example
 Here the size of the hash t
able = 10
 Keys are the first ten perfe
ct squares 0, 1, 4, 9, 16, 25
, 36, 49, 64, and 81
 The hash function, h(k) =
k mod 10

A separate chaining hash table


Hashing 5-30
Separate Chaining…
 To find an element
 using hash function, look up its position in table
 search for the element in the linked list of the hash
ed slot

 To insert an element
 compute h(k) to determine which list to traverse
 If T[h(k)] contains a null pointer, initialize this entry
to point to a linked list that contains k alone
 If T[h(k)] is a non-empty list, we add k at the begin
ning of this list

Hashing 5-31
Separate Chaining…
 To delete an element
 compute h(k), then search for k within the list at T[
h(k)]
 delete k if it is found

Hashing 5-32
Separate Chaining…
 Programming Details
#ifndef _HashSep_H
#define _HashSep_H
struct ListNode
{
struct ListNode;
ElementType Element;
typedef struct ListNode *Position;
Position Next;
struct HashTbl;
};
typedef struct HashTbl *HashTable;
HashTable InitializeTable( int TableSize );
void DestroyTable( HashTable H );
Position Find( ElementType Key, HashTable H );
void Insert( ElementType Key, HashTable H );
ElementType Retrieve( Position P );

#endif /* _HashSep_H */
Hashing 5-33
Separate Chaining…
 Programming Details

typedef Position List;

/* List *TheList will be an array


of lists, allocated later. The lists
use headers (for simplicity), */

struct HashTbl
{
int TableSize;
List *TheLists;
};

Hashing 5-34
Separate Chaining…
 Programming Details

HashTable
InitializeTable( int TableSize )
{
HashTable H;
int i;
/* 1*/ if( TableSize < MinTableSize )
{
/* 2*/ Error( "Table size too small" );
/* 3*/ return NULL;
}
/* Allocate table */
/* 4*/ H = malloc( sizeof( struct HashTbl ) );
/* 5*/ if( H == NULL )
/* 6*/ FatalError( "Out of space!!!" );
/* 7*/ H->TableSize = NextPrime( TableSize );

Hashing 5-35
Separate Chaining…
 Programming Details
/* Allocate array of lists */
/* 8*/ H->TheLists = malloc( sizeof( List ) * H->TableSize );
/* 9*/ if( H->TheLists == NULL )
/*10*/ FatalError( "Out of space!!!" );
/* Allocate list headers */
/*11*/ for( i = 0; i < H->TableSize; i++ )
{
/*12*/ H->TheLists[ i ] = malloc( sizeof( struct ListNode ) );
/*13*/ if( H->TheLists[ i ] == NULL )
/*14*/ FatalError( "Out of space!!!" );
else
/*15*/ H->TheLists[ i ]->Next = NULL;
}
/*16*/
return H;
}
Hashing 5-36
Separate Chaining…
 Programming Details

Position
Find( ElementType Key, HashTable H )
{
Position P;
List L;
/* 1*/ L = H->TheLists[ Hash( Key, H->TableSize ) ];
/* 2*/ P = L->Next;
/* 3*/ while( P != NULL && P->Element != Key )
/* 4*/ P = P->Next;
/* 5*/ return P;
}

Hashing 5-37
Separate Chaining…
void
 Programming Insert( ElementType Key, HashTable H )
Details {
Position Pos, NewCell;
List L;
/* 1*/ Pos = Find( Key, H );
/* 2*/ if( Pos == NULL ) /* Key is not found */
{
/* 3*/ NewCell = malloc( sizeof( struct ListNode ) );
/* 4*/ if( NewCell == NULL )
/* 5*/ FatalError( "Out of space!!!" );
else {
/* 6*/ L = H->TheLists[ Hash( Key, H->TableSize ) ];
/* 7*/ NewCell->Next = L->Next;
/* 8*/ NewCell->Element = Key;
/* 9*/ L->Next = NewCell;
}
}
}
Hashing 5-38
Separate Chaining…
 Analysing the performance of separate chaining hash t
able
 as we increase the number of elements N in the has
h table, more and more items will be stored in linked
lists, thus slowing everything down
 Also increasing the table size TableSize allows you t
o hold more data in an efficient manner
 It turns out that the ratio λ = N / T is the important q
uantity to analyze
 This is called the load factor

Hashing 5-39
Separate Chaining…
 Analysing the performance of separate chaining hash
table…
 Time to perform search = the constant time requir
ed to evaluate the hash function + time to travers
e the list
 Note that, for separate chaining, the average lengt
h of a linked list is λ
 Thus, an unsuccessful search will require to traver
se λ links on average
 A successful search requires that about 1 + (λ/2) li
nks be traversed

Hashing 5-40
Separate Chaining…

 Analysing the performance of separate chaining hash t


able…
 Thus, lowering the load factor is a good thing, from
the time point of view
 From the space point of view, lowering the load fact
or means increasing the table size
 This can lead to largely wasted space
 A reasonable compromise is λ ≈ 1
search times will be roughly O(1)

Hashing 5-41
Chapter 5: Hashing
5.1 General Idea
5.2 Hash Function
5.3 Separate Chaining
5.4 Open Addressing
5.5 Rehashing
5.6 Extendible Hashing

Hashing 5-42
Open Addressing
 Separate chaining has the disadvantage of using link
ed lists that slows the algorithm because of the time r
equired to allocate new cells

 Open addressing
 relocate the key k to be inserted if it collides with a
n existing key
That is, we store k at an entry different from T[h
(k)]

Hashing 5-43
Open Addressing…
 Open addressing hashing resolves collisions by tryin
g alternative slots in the hash table, until an empty cel
l is found
 cells h0 (X), h1 (X), h2 (X),… are tried in succession
where hi (X) = (Hash(X) + F(i))mod TableSize with F
(0) = 0
 The function, F, is the collision resolution strategy

Hashing 5-44
Open Addressing…
 Linear Probing
 F(i) is a linear function of i, i.e. F(i) = i
h0(X) = Hash(X) + 0
h1(X) = Hash(X) + 1
h2(X) = Hash(X) + 2
…
cells are probed sequentially (with wraparound)
in search of an empty cell

Hashing 5-45
Open Addressing…
 *Example
 suppose that our hash function converts a 2-digit i
nteger into a single digit by taking the least-signific
ant digit

*http://www.ece.uwaterloo.ca/~ece250/ Hashing 5-46


Open Addressing…
 *Insertions
 Insert the numbers 81, 70, 97, 60, 51, 38, 89, 68, 24 into the
initially empty hash table:

0 1 2 3 4 5 6 7 8 9

*http://www.ece.uwaterloo.ca/~ece250/ Hashing 5-47


Open Addressing…
 *Insertions…
 We can easily insert 81, 70, and 97 into their corresponding
bins:

0 1 2 3 4 5 6 7 8 9
70 81 97

*http://www.ece.uwaterloo.ca/~ece250/ Hashing 5-48


Open Addressing…
 *Insertions…
 Inserting 60 causes a collision in bin 0, therefore, we check:
bin 1 (also full), and
bin 2 (empty)

0 1 2 3 4 5 6 7 8 9
70 81 60 97

*http://www.ece.uwaterloo.ca/~ece250/ Hashing 5-49


Open Addressing…
 *Insertions…
 Inserting 51 also causes a collision, this time, in bin 1, theref
ore, we check:
bin 2 (also full), and
bin 3 (empty)

0 1 2 3 4 5 6 7 8 9
70 81 60 51 97

*http://www.ece.uwaterloo.ca/~ece250/ Hashing 5-50


Open Addressing…
 *Insertions…
 38 and 89 can be placed into bins 8 and 9 respectively witho
ut collisions

0 1 2 3 4 5 6 7 8 9
70 81 60 51 97 38 89

*http://www.ece.uwaterloo.ca/~ece250/ Hashing 5-51


Open Addressing…
 *Insertions…
 Inserting 68 causes a collision in bin 8, and therefore we che
ck bins:
9, 0, 1, 2, 3, and finally 4 which is empty
insert 68 into bin 4

0 1 2 3 4 5 6 7 8 9
70 81 60 51 68 97 38 89

*http://www.ece.uwaterloo.ca/~ece250/ Hashing 5-52


Open Addressing…
 *Insertions…
 Inserting 24 causes a collision in bin 4, however the next bin
is empty

0 1 2 3 4 5 6 7 8 9
70 81 60 51 68 24 97 38 89

*http://www.ece.uwaterloo.ca/~ece250/ Hashing 5-53


Open Addressing…
 *Searching
 Testing for membership is similar to insertions
 Start at the appropriate bin, and continue searchin
g forward until either:
the item is found, or
an empty bin is found

*http://www.ece.uwaterloo.ca/~ece250/ Hashing 5-54


Open Addressing…
 *Searching…
 Searching for 68, we first examine bin 8, then 9, 0, 1, 2, 3, a
nd 4, finding 68 in bin 4
 Searching for 23, we search bins 3, 4, 5, and bin 6 is empty,
so 23 is not in the table

0 1 2 3 4 5 6 7 8 9
70 81 60 51 68 24 97 38 89

*http://www.ece.uwaterloo.ca/~ece250/ Hashing 5-55


Open Addressing…
 *Removing
 We cannot simply remove elements from the hash table
 For example, if we delete 89 by removing it, we can no longe
r find 68

0 1 2 3 4 5 6 7 8 9
70 81 60 51 68 24 97 38 89

*http://www.ece.uwaterloo.ca/~ece250/ Hashing 5-56


Open Addressing…
 *Removing…
 However, we cannot simply move all entries up to fill the gap
 Moving 70 to bin 9 would make it impossible to find 70

0 1 2 3 4 5 6 7 8 9
70 81 60 51 68 24 97 38 89

81 60 51 68 24 97 38 70

*http://www.ece.uwaterloo.ca/~ece250/ Hashing 5-57


Open Addressing…
 *Removing…
 Instead, we must probe forward, moving only those elements
which would not be moved to a location before their bin start
s
 For example, we remove 89

0 1 2 3 4 5 6 7 8 9
70 81 60 51 68 24 97 38

*http://www.ece.uwaterloo.ca/~ece250/ Hashing 5-58


Open Addressing…
 *Removing…
 We probe forward until we find an entry which can be moved
into bin 9
 We cannot move 70, 81, 60, or 51, but we can move 68

0 1 2 3 4 5 6 7 8 9
70 81 60 51 24 97 38 68

*http://www.ece.uwaterloo.ca/~ece250/ Hashing 5-59


Open Addressing…
 *Removing…
 Next, we search forward again, and note that 24 can be mov
ed forward
 The next cell is already empty, and therefore we are finished

0 1 2 3 4 5 6 7 8 9
70 81 60 51 24 97 38 68

*http://www.ece.uwaterloo.ca/~ece250/ Hashing 5-60


Open Addressing…
 *Removing…
 Suppose we now remove 60

0 1 2 3 4 5 6 7 8 9
70 81 60 51 24 97 38 68

*http://www.ece.uwaterloo.ca/~ece250/ Hashing 5-61


Open Addressing…
 *Removing…
 We find 60 in bin 2, and therefore we remove it
 We search forward and find that we can move 51 into bin 2

0 1 2 3 4 5 6 7 8 9
70 81 51 24 97 38 68

*http://www.ece.uwaterloo.ca/~ece250/ Hashing 5-62


Open Addressing…
 *Removing…
 We cannot move 24 forward
 The next bin (5) is empty, therefore we are finished

0 1 2 3 4 5 6 7 8 9
70 81 51 24 97 38 68

*http://www.ece.uwaterloo.ca/~ece250/ Hashing 5-63


Open Addressing…
 *Primary Clustering
 We have already observed the following phenome
non:
as we insert more elements into the hash table,
the contiguous regions get larger
Any key that hashes into the cluster will require
several attempts to resolve the collision
 This results in longer search times

*http://www.ece.uwaterloo.ca/~ece250/ Hashing 5-64


Open Addressing…
 *Primary Clustering…
 Consider inserting the following entries 81, 70, 97, 63, 76, 38,
85, 68, 21, 9, 55, 73, 57, 60, 72, 74, 85, 16, 61, 7, 49
 Use the number modulo 25 to determine which bin it should
occupy
 The first five don’t cause any collisions

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

76 81 63 70 97

*http://www.ece.uwaterloo.ca/~ece250/ Hashing 5-65


Open Addressing…
 *Primary Clustering…
 Inserting 38 causes a collision in bin 13
 The next seven do not cause any further collisions

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

76 55 81 57 9 85 63 38 68 70 21 97 73

*http://www.ece.uwaterloo.ca/~ece250/ Hashing 5-66


Open Addressing…
 *Primary Clustering…
 The next four insertions cause collisions:
60 (bin 10)
72 (bin 22)
74 (bin 24)
85 (bin 10)
 We can safely insert 16 into bin 16

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

74 76 55 81 57 9 85 60 85 63 38 16 68 70 21 97 73 72

*http://www.ece.uwaterloo.ca/~ece250/ Hashing 5-67


Open Addressing…
 *Primary Clustering…
 The remaining insertions all cause collisions:
61 (bin 11)
7 (bin 7)
49 (bin 24)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

74 76 49 55 81 57 7 9 85 60 85 63 38 61 16 68 70 21 97 73 72

*http://www.ece.uwaterloo.ca/~ece250/ Hashing 5-68


Open Addressing…
 Asymptotic Performance
 Primary clustering affects the number of probes re
quired to perform the insertions, searches or deleti
ons
 The average number of probes for a successful se
arch can be estimated as
Number of probes  ( ½ ) ( 1+1/( 1- ) )
where  is the load factor – what fraction of the table is u
sed

Hashing 5-69
Open Addressing…

 Asymptotic Performance…
 The number of probes for an unsuccessful search
or for an insertion is higher:
Number of probes  ( ½ ) ( 1+1/( 1- )2 )
if  = 0.75 , 8.5 probes are expected
if  = 0.9 , 50 probes are expected, and this unreasonable

Hashing 5-70
Open Addressing…
 *The following plot shows how the number of require
d probes increases

*http://www.ece.uwaterloo.ca/~ece250/ Hashing 5-71


Open Addressing…
 *Primary clustering occurs with linear probing becaus
e the same linear pattern
 if a bin is inside a cluster, then the next bin must ei
ther
also be in that cluster, or
expand the cluster

 Instead of searching forward in a linear fashion, consi


der searching forward using a quadratic function

*http://www.ece.uwaterloo.ca/~ece250/ Hashing 5-72


In Next Class

 Open addressing with quadratic probing


 Rehashing and extendible hashing

Hashing 5-73

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy