0% found this document useful (0 votes)

28 views

Handout 9 - Hashing

Data Structures and Algorithms

Uploaded by

abduwasi ahmed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views

Handout 9 - Hashing

Data Structures and Algorithms

Uploaded by

abduwasi ahmed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Mekelle University Faculty of Business & Economics

Computer Science Department

ICT241: Data Structures and Algorithms

Handout 7 - Hashing

Handout Overview

This handout gives an introduction to the subject of hashing. Common hash

functions such as division, folding, mid-square function, extraction and radix
transformation are discussed. In addition, a number of collision resolution
techniques are described, such as open addressing, chaining and bucketing.

1. Hashing

All of the searching techniques we have seen so far operate by comparing the
value being searched for with the values of a key value of each element. For
example, when searching for an integer val in a binary search tree, we compare
val with the integer (the key) stored at each node we visit. Such searching
techniques vary in their complexity, but will always be more than O(1).

Hashing is an alternative way of storing data that aims to greatly improve the
efficiency of search operations. With hashing, when adding a new data element,
the key itself is used to directly determine the location to store the element.
Therefore, when searching for a data element, instead of searching through a
sequence of key values to find the location of the data we want, the key value
itself can be used to directly determine the location in which the data is stored.
This means that the search time is reduced from O(n), as in sequential search, or
O(log n), as in binary search, to O(1), or constant complexity. Regardless of the
number of elements stored, the search time is the same.

The question is, how can we determine the position to store a data element using
only its key value? We need to find a function h that can transform a key value K
(e.g. an integer, a string, etc.) into an index into a table used for storing data. The
function h is called a hash function. If h transforms different keys into different
indices it is called a perfect hash function. (A non-perfect hash function may
transform two different key values into the same index.)

Consider the example of a compiler that needs to store the values of all program
variables. The key in this case is the name of the variable, and the data to be
stored is the variable’s value. What hash function could we use? One possibility
would be to add the ASCII codes of every letter in the variable name and use the
resulting integer to index a table of values. But in this case the two variables abc
and cba would have the same index. This problem is known as collision and will

1
be discussed later in this handout. The worth of a hash function depends to a
certain extent on how well it avoids collisions.

2. Hash Functions

Clearly there are a large number of potential hash functions. In fact, if we wish to
assign positions for n items in a table of size m, the number of potential hash
m!
functions is mn, and the number of perfect hash functions is . Most of
(m  n)!
these potential functions are not of practical use, so this section discusses a
number of popular types of hash function.

2.1. Division

A hash function must guarantee that the value of the index that it returns is a
valid index into the table used to store the data. In other words, it must be less
than the size of the table. Therefore an obvious way to accomplish this is to
perform a modulo (remainder) operation. If the key K is a number, and the
size of the table is Tsize, the hash function is defined as h(K) = K mod TSize.
Division hash functions perform best if the value of TSize is a prime number.

2.2. Folding

Folding hash functions work by dividing the key into a number of parts. For
example, the key value 123456789 might be divided into three parts: 123, 456
and 789. Next these parts are combined together to produce the target
address. There are two ways in which this can be done: shift folding and
boundary folding.

In shift folding, the different parts of the key are left as they are, placed
underneath one another, and processed in some way. For example, the parts
123, 456 and 789 can be added to give the result 1368. To produce the target
address, this result can be divided modulo TSize.

In boundary folding, alternate parts of the key are left intact and reverse. In
the example given above, 123 is left intact, 456 is reversed to give 654, and
789 is left intact. So this time the numbers 123, 654 and 789 are summed to
give the result 1566. This result can be converted to the target address by
using the modulo operation.

2.3. Mid-Square Function

In the mid-square method, the key is squared and the middle part of the result
is used as the address. For example, if the key is 2864, then the square of
2864 is 8202496, so we use 024 as the address, which is the middle part of

2
8202496. If the key is not a number, it can be pre-processed to convert it into
one.

2.4. Extraction

In the extraction method, only a part of the key is used to generate the
address. For the key 123456789, this method might use the first four digits
(1234), or the last four (6789), or the first two and last two (1289). Extraction
methods can be satisfactory so long as the omitted portion of the key is not
significant in distinguishing the keys. For example, at Mekelle University
many student ID numbers begin with the letters “RDG”, so the first three
letters can be omitted and the following numbers used to generate the key
using one of the other hash function techniques.

2.5. Radix Transformation

If TSize is 100, and a division technique is used to generate the target address,
then the keys 147 and 247 will produce the same address. Therefore this
would not be a perfect hash function. The radix transformation technique
attempts to avoid such collisions by changing the number base of the key
before generating the address. For example, if we convert the keys 14710 and
24710 into base 9, we get 1739 and 3049. Therefore, after a modulo operation
the addresses used would be 47 and 04. Note, however, that radix
transformation does not completely avoid collisions: the two keys 14710 and
6610 are converted to 1739 and 739, so they would both hash to the same
address, 73.

3. Collision Resolution

If the hash function being used is not a perfect hash function (which is usually the
case), then the problem of collisions will arise. Collisions occur when two keys
hash to the same address. The chance of collisions occurring can be reduced by
choosing the right hash function, or by increasing the size of the table, but it can
never be completely eliminated. For this reason, any hashing system should adopt
a collision resolution strategy. This section examines some common strategies.

3.6. Open Addressing

In open addressing, if a collision occurs, an alternative address within the

table is found for the new data. If this address is also occupied, another
alternative is tried. The sequence of alternative addresses to try is known as
the probing sequence. In general terms, if position h(K) is occupied, the
probing sequence is

norm(h( K )  p(1)), norm(h( K )  p(2)),, norm(h( K )  p(i )),

3
where function p is the probing function and norm is a normalisation function
that ensures the address generated is within an acceptable range, for example
the modulo function.

The simplest method is linear probing. In this technique the probing sequence
is simply a series of consecutive addresses; in other words the probing
function p(i) = i. If one address is occupied, we try the next address in the
table, then the next, and so on. If the last address is occupied, we start again at
the beginning of the table. Linear probing has the advantage of simplicity, but
it has the tendency to produce clusters of data within the table. For example,
Figure 1 shows a sequence of insertions into a hash table using the following
key/value pairs:
Key Value
15 A
2 B
33 C
5 D
19 E
22 F
9 G
32 H

The first three insertions (A, B and C) do not result in collisions. However,
when data D is inserted it hashes to the address 5, which is currently occupied
by A, so it is placed in the next address. Similarly, when data F is inserted at
address 2 it collides with B, so we try address 3 instead. Here it collides with
C, so we have to place it at address 4. Data G also collides with E at address
9, so because 9 is the last address in the table we place it at address 1. Finally
data H collides with 5 different elements before being successfully placed at
address 7.

Figure 1 – Collision resolution using linear probing.

4
We can see in Figure 1 that there is a cluster of 6 elements (from addresses 2
to 7) stored next to each other. The problem with clusters is that the
probability of a collision for a key is dependent on the address that it hashes
to. Clustering can be avoided by using a more careful choice of probing
function p. One possible choice is to use the sequence of addresses

h( K )  i 2 , h( K )  i 2 , for i = 1, 2, ... , (Tsize – 1) / 2.

Including the original attempt to hash K, this formula results in the sequence
h(K), h(K) + 1, h(K) – 1, h(K) + 4, h(K) – 4, etc. All of these addresses
should be divided modulo Tsize. For example, for the H2 data in Figure 1, we
first try address 2, then address 3 (2 + 1), and then address 1 (2 – 1), where
the data is successfully placed. This technique is known as quadratic probing.
Quadratic probing results in fewer clusters than linear probing, but because
the same probing sequence is used for every key, sometimes clusters can
build up away from the original address. These clusters are known as
secondary clusters.

Another possibility, which avoids the problem of secondary clusters, is to use

a different probing sequence for each key. This can be achieved by using a
random number generator seeded by a value that is dependent on the key.
Remember that random number generators always require a seed value, and if
the same seed is used the same sequence of ‘random’ numbers will be
generated. So if, for example, the value of the key (if it is an integer), were to
be used, each different key would generate a different sequence of probes,
thus avoiding secondary clusters.

Another way to avoid secondary clusters is to use double hashing. Double

hashing uses two different hashing functions: one to find the primary position
of a key, and another for resolving conflicts. The idea is that if the primary
hashing function, h(K), hashes two keys K1 and K2 to the same address, then
the secondary hashing function, hp(K), will probably not. The probing
sequence is therefore

h( K ), h( K )  hp ( K ), h( K )  2  hp ( K ),, h( K )  i  hp ( K ),

Experiments indicate that double hashing generally eliminates secondary

clustering, but using a second hash function can be time-consuming.

3.7. Chaining

In chaining, each address in the table refers to a list, or chain, of data values.
If a collision occurs the new data is simply added to end of the chain. Figure 2
shows an example of using chaining for collision resolution.

Provided that the lists do not become very long, chaining is an efficient
technique. However, if there are many collisions the lists will become long
and retrieval performance can be severely degraded. Performance can be

5
improved by ordering the values in the list (so that an exhaustive search is not
necessary for unsuccessful searches) or by using self-organising lists.

An alternative version of chaining is called coalesced hashing, or coalesced

chaining. In this method, the link to the next value in the list actually points
to another table address. If a collision occurs, then a technique such as linear
probing is used to find an available address, and the data is placed there. In
addition, a link is placed at the original address indicating where the next data
element is stored. Figure 3 shows an example of this technique. When the
keys D5 and F2 collide Figure 3b, linear probing is used to position the keys,
but links from their original hashed addresses are maintained. Variations on
coalesced hashing include always placing colliding keys at the end of the
table, or storing colliding keys in a special reserved area known as the cellar.
In both cases a link from the original hashed address will point to the new
location. The advantage of coalesced hashing is that it avoids the need to
make a sequential search through the table for the required data in the event
of collisions.

Figure 2 – Collision resolution using chaining.

6
Figure 3 – Collision resolution using coalesced hashing.

3.8. Bucket Addressing

Bucket addressing is similar to chaining, except that the data are stored in a
bucket at each table address. A bucket is a block of memory that can store a
number of items, but not an unlimited number as in the case of chaining.

Bucketing reduces the chance of collisions, but does not totally avoid them. If
the bucket becomes full, then an item hashed to it must be stored elsewhere.
Therefore bucketing is commonly combined with an open addressing
technique such as linear or quadratic probing. Figure 4 shows an example of
bucketing that uses a bucket size of 3 elements at each address.

7
Figure 4 – Collision resolution using bucketing.

8
Summary of Key Points

The following points summarize the key concepts in this handout:

 Hashing is a data storage technique that aims to improve the efficiency of
search operations.
 Using hashing, a hash function h(K) is used to determine the address within a
table at which a key K will be stored.
 A perfect hash function is one that will generate different addresses for
different keys.
 If two keys hash to the same address a collision will occur.
 The simplest hash function is to use a modulo operation using the number of
addresses in the table as the divisor.
 Folding hash functions work by dividing the key into a number of parts and
then combining them to produce the target address.
 In shift folding the different parts of the key are left intact before being
combined.
 In boundary folding alternate parts of the key are reversed before combination.
 In the mid-square hash function, the key is squared and the middle part of the
result is used as the address.
 In the extraction method, only a part of the key is used to generate the address.
 In the radix transformation technique, the number base of the key is changed
to try to avoid collisions.
 Open addressing attempts to resolve collisions by finding an alternative
address at which to store collided keys.
 The probing sequence is the series of addresses tried by an open addressing
scheme.
 Linear probing uses a probing sequence consisting of consecutive addresses in
the table.
 Quadratic probing used a probing sequence of the form h(K), h(K) + 1, h(K) –
1, h(K) + 4, h(K) – 4, etc.
 A cluster is a set of keys that are stored in addresses in the same part of the
table.
 A primary cluster occurs when many keys hash to the same (or similar)
primary address.
 A secondary cluster occurs when many keys hash to the same (or similar)
alternative address.
 Secondary clusters can be avoided by using a random number technique or by
using double hashing.
 In double hashing a different hash function is used to generate the probing
sequence.
 In chaining, each table address refers to a linked list of data elements.
 In coalesced chaining, or coalesced hashing, collided keys are stored in an
alternative position in the table but a link from the original hashed address is
maintained.
 In bucket addressing, each table address contains a bucket capable of storing
multiple data elements.

9
Exercises

1) Write a C++ program to implement a simple division hashing scheme. The

program should read in a sequence of key-value pairs from the keyboard – the key
should be a positive integer and the value should be a string. Each key-value pair
should be stored in a table of size 100. Use linear probing for collision resolution.
After the user has finished entering key-value pairs (e.g. they could enter a
negative key), they should be able to retrieve a sequence of values by entering
their keys.

2) Update the program you wrote in (1) to make it use quadratic probing instead of
linear probing.

Notes prepared by: FBE Computer Science Department.

10
Sources: Data Structures and Algorithms in C++, A. Drozdek, 2001

BCS304-DSA Notes M-5
100% (1)
BCS304-DSA Notes M-5
22 pages
VSX-295, 405, 425
No ratings yet
VSX-295, 405, 425
11 pages
UNIT V - Hashing
No ratings yet
UNIT V - Hashing
20 pages
Hashing Slide
No ratings yet
Hashing Slide
16 pages
Module 5
No ratings yet
Module 5
22 pages
DS 5
No ratings yet
DS 5
23 pages
Hash Function
No ratings yet
Hash Function
9 pages
Hashing
No ratings yet
Hashing
41 pages
unit 1 Hashing
No ratings yet
unit 1 Hashing
61 pages
Final Hashing
No ratings yet
Final Hashing
41 pages
AST20105 Data Structure and Algorithms: Chapter 9 - Hash Table
No ratings yet
AST20105 Data Structure and Algorithms: Chapter 9 - Hash Table
39 pages
2,2Hashing
No ratings yet
2,2Hashing
30 pages
MODULE-5
No ratings yet
MODULE-5
33 pages
Hashing Techniques
No ratings yet
Hashing Techniques
13 pages
Algo Cha 8
No ratings yet
Algo Cha 8
20 pages
CH 4
No ratings yet
CH 4
58 pages
MCA Data Structures With Algorithms 14
No ratings yet
MCA Data Structures With Algorithms 14
12 pages
DS Lecture - 6 (Hashing)
No ratings yet
DS Lecture - 6 (Hashing)
26 pages
Unit-5
No ratings yet
Unit-5
50 pages
DS Lecture - 6 (Hashing)
No ratings yet
DS Lecture - 6 (Hashing)
27 pages
CH 4 Hash Table
No ratings yet
CH 4 Hash Table
20 pages
Hash Table PDF
No ratings yet
Hash Table PDF
25 pages
Lecture 27 - Hashing
No ratings yet
Lecture 27 - Hashing
48 pages
3 Hashing
No ratings yet
3 Hashing
20 pages
Unit-5 2
No ratings yet
Unit-5 2
9 pages
11-Hashing-Hong Kong (1)
No ratings yet
11-Hashing-Hong Kong (1)
25 pages
Hashing
No ratings yet
Hashing
30 pages
Cse373 10 Hashing
No ratings yet
Cse373 10 Hashing
36 pages
Study_Material_on_Hashing
No ratings yet
Study_Material_on_Hashing
4 pages
Hashing
No ratings yet
Hashing
34 pages
Hashing
No ratings yet
Hashing
20 pages
Hashing PPT For Student
No ratings yet
Hashing PPT For Student
53 pages
Lecture 08 - Hash Tables
No ratings yet
Lecture 08 - Hash Tables
21 pages
Module 5: HASHING: Functions. The Values Are Then Stored in A Data Structure Called Hash Table
No ratings yet
Module 5: HASHING: Functions. The Values Are Then Stored in A Data Structure Called Hash Table
39 pages
Chapter 28 Hashing: Hash Table. The Function That Maps A Key To An Index in The Hash Table Is
No ratings yet
Chapter 28 Hashing: Hash Table. The Function That Maps A Key To An Index in The Hash Table Is
4 pages
Hashing Reading
No ratings yet
Hashing Reading
10 pages
DS Lecture - 6 (Hashing)
No ratings yet
DS Lecture - 6 (Hashing)
32 pages
ADI Hashing
No ratings yet
ADI Hashing
47 pages
Lab 2
No ratings yet
Lab 2
10 pages
CO4 - Hashing in Data Structure
No ratings yet
CO4 - Hashing in Data Structure
13 pages
Hashing
No ratings yet
Hashing
56 pages
Hashing Algorithms
No ratings yet
Hashing Algorithms
22 pages
Hashing PDF
No ratings yet
Hashing PDF
56 pages
Hashing
No ratings yet
Hashing
9 pages
Hash Tables: COT4810 Ken Pritchard 2 Sep 04
No ratings yet
Hash Tables: COT4810 Ken Pritchard 2 Sep 04
20 pages
Lecture-Hashing
No ratings yet
Lecture-Hashing
8 pages
12. Hashing
No ratings yet
12. Hashing
35 pages
Dsa 5
No ratings yet
Dsa 5
22 pages
Lect 19 Hashing
No ratings yet
Lect 19 Hashing
24 pages
Implementation Priority Queue Using Array
No ratings yet
Implementation Priority Queue Using Array
3 pages
Dsa Module 6 Ktustudents - in
No ratings yet
Dsa Module 6 Ktustudents - in
9 pages
HASHING
No ratings yet
HASHING
63 pages
Lect Hashing
No ratings yet
Lect Hashing
36 pages
Lecture 14 Hashing
No ratings yet
Lecture 14 Hashing
44 pages
ADS M TECH MID 2
No ratings yet
ADS M TECH MID 2
26 pages
Hashing
No ratings yet
Hashing
44 pages
Chapter 5_Hashing _Part1
No ratings yet
Chapter 5_Hashing _Part1
28 pages
Chapter 11 Hashing
No ratings yet
Chapter 11 Hashing
42 pages
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet
Learn C++
From Everand
Learn C++
Durgesh
4.5/5 (9)
Learn Programming Using C#
From Everand
Learn Programming Using C#
Taurius Litvinavicius
No ratings yet
Abrahams0169finalreport
No ratings yet
Abrahams0169finalreport
32 pages
Final Intern Report
No ratings yet
Final Intern Report
26 pages
1 13determinants of Smallholder Farmers Participation On Wheat Row
No ratings yet
1 13determinants of Smallholder Farmers Participation On Wheat Row
14 pages
Mizan-Tepi University Tepi Campus: Individual Assignment
No ratings yet
Mizan-Tepi University Tepi Campus: Individual Assignment
15 pages
CAST Tools - Module F
No ratings yet
CAST Tools - Module F
27 pages
SAP Business One
100% (3)
SAP Business One
16 pages
Final Kemo
No ratings yet
Final Kemo
1 page
Amnesty Renewal
No ratings yet
Amnesty Renewal
1 page
Optics and Mechatronic
No ratings yet
Optics and Mechatronic
2 pages
Your Rubric Making A Poster Habitat Flyer
No ratings yet
Your Rubric Making A Poster Habitat Flyer
1 page
Org Mode Teaser
No ratings yet
Org Mode Teaser
24 pages
CS207A: Data Structures and Algorithms (Module #3) Assignment #1
No ratings yet
CS207A: Data Structures and Algorithms (Module #3) Assignment #1
2 pages
GB Over IP Description
No ratings yet
GB Over IP Description
17 pages
MVC Architecture
No ratings yet
MVC Architecture
62 pages
Rws Service PDF
No ratings yet
Rws Service PDF
288 pages
Prince 2 Highlight Report 2013
No ratings yet
Prince 2 Highlight Report 2013
5 pages
Chapter 6
No ratings yet
Chapter 6
78 pages
CAQC
100% (2)
CAQC
20 pages
3160715(S24)_merged
No ratings yet
3160715(S24)_merged
7 pages
300+ Top Web Technology Lab Viva Questions and Answers PDF
No ratings yet
300+ Top Web Technology Lab Viva Questions and Answers PDF
47 pages
Jadwal Audit: Menilai Produk Sesuai Dengan Undang-Undang, Peraturan Regulator, Dan Peraturan Kontrak
No ratings yet
Jadwal Audit: Menilai Produk Sesuai Dengan Undang-Undang, Peraturan Regulator, Dan Peraturan Kontrak
6 pages
Non Interacting System
No ratings yet
Non Interacting System
5 pages
MCQ 1
No ratings yet
MCQ 1
22 pages
Information Security - Review Question - CIA
No ratings yet
Information Security - Review Question - CIA
1 page
ملزمه لوجك بالعربي
No ratings yet
ملزمه لوجك بالعربي
107 pages
How Smart Use Cases Drive Web Development: Sander Hoogendoorn
No ratings yet
How Smart Use Cases Drive Web Development: Sander Hoogendoorn
87 pages
PITRAM Software
No ratings yet
PITRAM Software
22 pages
Complexity A Guided Tour REVIEW
No ratings yet
Complexity A Guided Tour REVIEW
10 pages
WebTV List
100% (1)
WebTV List
11 pages
BI Developer Test Medium
No ratings yet
BI Developer Test Medium
8 pages
Viva Presentation On IoT
No ratings yet
Viva Presentation On IoT
41 pages
Coverity Prevent On The Symbian OS (English)
No ratings yet
Coverity Prevent On The Symbian OS (English)
32 pages
Database Server Comparison: Dell PowerEdge R630 vs. Lenovo ThinkServer RD550
No ratings yet
Database Server Comparison: Dell PowerEdge R630 vs. Lenovo ThinkServer RD550
18 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Handout 9 - Hashing

Uploaded by

Handout 9 - Hashing

Uploaded by

Mekelle University Faculty of Business & Economics

Computer Science Department

ICT241: Data Structures and Algorithms

This handout gives an introduction to the subject of hashing. Common hash

2.3. Mid-Square Function

2.5. Radix Transformation

3.6. Open Addressing

In open addressing, if a collision occurs, an alternative address within the

norm(h( K )  p(1)), norm(h( K )  p(2)),, norm(h( K )  p(i )),

Figure 1 – Collision resolution using linear probing.

h( K )  i 2 , h( K )  i 2 , for i = 1, 2, ... , (Tsize – 1) / 2.

Another possibility, which avoids the problem of secondary clusters, is to use

Another way to avoid secondary clusters is to use double hashing. Double

Experiments indicate that double hashing generally eliminates secondary

An alternative version of chaining is called coalesced hashing, or coalesced

Figure 2 – Collision resolution using chaining.

3.8. Bucket Addressing

The following points summarize the key concepts in this handout:

1) Write a C++ program to implement a simple division hashing scheme. The

Notes prepared by: FBE Computer Science Department.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.