0% found this document useful (0 votes)
28 views

Handout 9 - Hashing

Data Structures and Algorithms

Uploaded by

abduwasi ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

Handout 9 - Hashing

Data Structures and Algorithms

Uploaded by

abduwasi ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Mekelle University Faculty of Business & Economics

Computer Science Department

ICT241: Data Structures and Algorithms

Handout 7 - Hashing

Handout Overview

This handout gives an introduction to the subject of hashing. Common hash


functions such as division, folding, mid-square function, extraction and radix
transformation are discussed. In addition, a number of collision resolution
techniques are described, such as open addressing, chaining and bucketing.

1. Hashing

All of the searching techniques we have seen so far operate by comparing the
value being searched for with the values of a key value of each element. For
example, when searching for an integer val in a binary search tree, we compare
val with the integer (the key) stored at each node we visit. Such searching
techniques vary in their complexity, but will always be more than O(1).

Hashing is an alternative way of storing data that aims to greatly improve the
efficiency of search operations. With hashing, when adding a new data element,
the key itself is used to directly determine the location to store the element.
Therefore, when searching for a data element, instead of searching through a
sequence of key values to find the location of the data we want, the key value
itself can be used to directly determine the location in which the data is stored.
This means that the search time is reduced from O(n), as in sequential search, or
O(log n), as in binary search, to O(1), or constant complexity. Regardless of the
number of elements stored, the search time is the same.

The question is, how can we determine the position to store a data element using
only its key value? We need to find a function h that can transform a key value K
(e.g. an integer, a string, etc.) into an index into a table used for storing data. The
function h is called a hash function. If h transforms different keys into different
indices it is called a perfect hash function. (A non-perfect hash function may
transform two different key values into the same index.)

Consider the example of a compiler that needs to store the values of all program
variables. The key in this case is the name of the variable, and the data to be
stored is the variable’s value. What hash function could we use? One possibility
would be to add the ASCII codes of every letter in the variable name and use the
resulting integer to index a table of values. But in this case the two variables abc
and cba would have the same index. This problem is known as collision and will

1
be discussed later in this handout. The worth of a hash function depends to a
certain extent on how well it avoids collisions.

2. Hash Functions

Clearly there are a large number of potential hash functions. In fact, if we wish to
assign positions for n items in a table of size m, the number of potential hash
m!
functions is mn, and the number of perfect hash functions is . Most of
(m  n)!
these potential functions are not of practical use, so this section discusses a
number of popular types of hash function.

2.1. Division

A hash function must guarantee that the value of the index that it returns is a
valid index into the table used to store the data. In other words, it must be less
than the size of the table. Therefore an obvious way to accomplish this is to
perform a modulo (remainder) operation. If the key K is a number, and the
size of the table is Tsize, the hash function is defined as h(K) = K mod TSize.
Division hash functions perform best if the value of TSize is a prime number.

2.2. Folding

Folding hash functions work by dividing the key into a number of parts. For
example, the key value 123456789 might be divided into three parts: 123, 456
and 789. Next these parts are combined together to produce the target
address. There are two ways in which this can be done: shift folding and
boundary folding.

In shift folding, the different parts of the key are left as they are, placed
underneath one another, and processed in some way. For example, the parts
123, 456 and 789 can be added to give the result 1368. To produce the target
address, this result can be divided modulo TSize.

In boundary folding, alternate parts of the key are left intact and reverse. In
the example given above, 123 is left intact, 456 is reversed to give 654, and
789 is left intact. So this time the numbers 123, 654 and 789 are summed to
give the result 1566. This result can be converted to the target address by
using the modulo operation.

2.3. Mid-Square Function

In the mid-square method, the key is squared and the middle part of the result
is used as the address. For example, if the key is 2864, then the square of
2864 is 8202496, so we use 024 as the address, which is the middle part of

2
8202496. If the key is not a number, it can be pre-processed to convert it into
one.

2.4. Extraction

In the extraction method, only a part of the key is used to generate the
address. For the key 123456789, this method might use the first four digits
(1234), or the last four (6789), or the first two and last two (1289). Extraction
methods can be satisfactory so long as the omitted portion of the key is not
significant in distinguishing the keys. For example, at Mekelle University
many student ID numbers begin with the letters “RDG”, so the first three
letters can be omitted and the following numbers used to generate the key
using one of the other hash function techniques.

2.5. Radix Transformation

If TSize is 100, and a division technique is used to generate the target address,
then the keys 147 and 247 will produce the same address. Therefore this
would not be a perfect hash function. The radix transformation technique
attempts to avoid such collisions by changing the number base of the key
before generating the address. For example, if we convert the keys 14710 and
24710 into base 9, we get 1739 and 3049. Therefore, after a modulo operation
the addresses used would be 47 and 04. Note, however, that radix
transformation does not completely avoid collisions: the two keys 14710 and
6610 are converted to 1739 and 739, so they would both hash to the same
address, 73.

3. Collision Resolution

If the hash function being used is not a perfect hash function (which is usually the
case), then the problem of collisions will arise. Collisions occur when two keys
hash to the same address. The chance of collisions occurring can be reduced by
choosing the right hash function, or by increasing the size of the table, but it can
never be completely eliminated. For this reason, any hashing system should adopt
a collision resolution strategy. This section examines some common strategies.

3.6. Open Addressing

In open addressing, if a collision occurs, an alternative address within the


table is found for the new data. If this address is also occupied, another
alternative is tried. The sequence of alternative addresses to try is known as
the probing sequence. In general terms, if position h(K) is occupied, the
probing sequence is

norm(h( K )  p(1)), norm(h( K )  p(2)),, norm(h( K )  p(i )),

3
where function p is the probing function and norm is a normalisation function
that ensures the address generated is within an acceptable range, for example
the modulo function.

The simplest method is linear probing. In this technique the probing sequence
is simply a series of consecutive addresses; in other words the probing
function p(i) = i. If one address is occupied, we try the next address in the
table, then the next, and so on. If the last address is occupied, we start again at
the beginning of the table. Linear probing has the advantage of simplicity, but
it has the tendency to produce clusters of data within the table. For example,
Figure 1 shows a sequence of insertions into a hash table using the following
key/value pairs:
Key Value
15 A
2 B
33 C
5 D
19 E
22 F
9 G
32 H

The first three insertions (A, B and C) do not result in collisions. However,
when data D is inserted it hashes to the address 5, which is currently occupied
by A, so it is placed in the next address. Similarly, when data F is inserted at
address 2 it collides with B, so we try address 3 instead. Here it collides with
C, so we have to place it at address 4. Data G also collides with E at address
9, so because 9 is the last address in the table we place it at address 1. Finally
data H collides with 5 different elements before being successfully placed at
address 7.

Figure 1 – Collision resolution using linear probing.

4
We can see in Figure 1 that there is a cluster of 6 elements (from addresses 2
to 7) stored next to each other. The problem with clusters is that the
probability of a collision for a key is dependent on the address that it hashes
to. Clustering can be avoided by using a more careful choice of probing
function p. One possible choice is to use the sequence of addresses

h( K )  i 2 , h( K )  i 2 , for i = 1, 2, ... , (Tsize – 1) / 2.

Including the original attempt to hash K, this formula results in the sequence
h(K), h(K) + 1, h(K) – 1, h(K) + 4, h(K) – 4, etc. All of these addresses
should be divided modulo Tsize. For example, for the H2 data in Figure 1, we
first try address 2, then address 3 (2 + 1), and then address 1 (2 – 1), where
the data is successfully placed. This technique is known as quadratic probing.
Quadratic probing results in fewer clusters than linear probing, but because
the same probing sequence is used for every key, sometimes clusters can
build up away from the original address. These clusters are known as
secondary clusters.

Another possibility, which avoids the problem of secondary clusters, is to use


a different probing sequence for each key. This can be achieved by using a
random number generator seeded by a value that is dependent on the key.
Remember that random number generators always require a seed value, and if
the same seed is used the same sequence of ‘random’ numbers will be
generated. So if, for example, the value of the key (if it is an integer), were to
be used, each different key would generate a different sequence of probes,
thus avoiding secondary clusters.

Another way to avoid secondary clusters is to use double hashing. Double


hashing uses two different hashing functions: one to find the primary position
of a key, and another for resolving conflicts. The idea is that if the primary
hashing function, h(K), hashes two keys K1 and K2 to the same address, then
the secondary hashing function, hp(K), will probably not. The probing
sequence is therefore

h( K ), h( K )  hp ( K ), h( K )  2  hp ( K ),, h( K )  i  hp ( K ),

Experiments indicate that double hashing generally eliminates secondary


clustering, but using a second hash function can be time-consuming.

3.7. Chaining

In chaining, each address in the table refers to a list, or chain, of data values.
If a collision occurs the new data is simply added to end of the chain. Figure 2
shows an example of using chaining for collision resolution.

Provided that the lists do not become very long, chaining is an efficient
technique. However, if there are many collisions the lists will become long
and retrieval performance can be severely degraded. Performance can be

5
improved by ordering the values in the list (so that an exhaustive search is not
necessary for unsuccessful searches) or by using self-organising lists.

An alternative version of chaining is called coalesced hashing, or coalesced


chaining. In this method, the link to the next value in the list actually points
to another table address. If a collision occurs, then a technique such as linear
probing is used to find an available address, and the data is placed there. In
addition, a link is placed at the original address indicating where the next data
element is stored. Figure 3 shows an example of this technique. When the
keys D5 and F2 collide Figure 3b, linear probing is used to position the keys,
but links from their original hashed addresses are maintained. Variations on
coalesced hashing include always placing colliding keys at the end of the
table, or storing colliding keys in a special reserved area known as the cellar.
In both cases a link from the original hashed address will point to the new
location. The advantage of coalesced hashing is that it avoids the need to
make a sequential search through the table for the required data in the event
of collisions.

Figure 2 – Collision resolution using chaining.

6
Figure 3 – Collision resolution using coalesced hashing.

3.8. Bucket Addressing

Bucket addressing is similar to chaining, except that the data are stored in a
bucket at each table address. A bucket is a block of memory that can store a
number of items, but not an unlimited number as in the case of chaining.

Bucketing reduces the chance of collisions, but does not totally avoid them. If
the bucket becomes full, then an item hashed to it must be stored elsewhere.
Therefore bucketing is commonly combined with an open addressing
technique such as linear or quadratic probing. Figure 4 shows an example of
bucketing that uses a bucket size of 3 elements at each address.

7
Figure 4 – Collision resolution using bucketing.

8
Summary of Key Points

The following points summarize the key concepts in this handout:


 Hashing is a data storage technique that aims to improve the efficiency of
search operations.
 Using hashing, a hash function h(K) is used to determine the address within a
table at which a key K will be stored.
 A perfect hash function is one that will generate different addresses for
different keys.
 If two keys hash to the same address a collision will occur.
 The simplest hash function is to use a modulo operation using the number of
addresses in the table as the divisor.
 Folding hash functions work by dividing the key into a number of parts and
then combining them to produce the target address.
 In shift folding the different parts of the key are left intact before being
combined.
 In boundary folding alternate parts of the key are reversed before combination.
 In the mid-square hash function, the key is squared and the middle part of the
result is used as the address.
 In the extraction method, only a part of the key is used to generate the address.
 In the radix transformation technique, the number base of the key is changed
to try to avoid collisions.
 Open addressing attempts to resolve collisions by finding an alternative
address at which to store collided keys.
 The probing sequence is the series of addresses tried by an open addressing
scheme.
 Linear probing uses a probing sequence consisting of consecutive addresses in
the table.
 Quadratic probing used a probing sequence of the form h(K), h(K) + 1, h(K) –
1, h(K) + 4, h(K) – 4, etc.
 A cluster is a set of keys that are stored in addresses in the same part of the
table.
 A primary cluster occurs when many keys hash to the same (or similar)
primary address.
 A secondary cluster occurs when many keys hash to the same (or similar)
alternative address.
 Secondary clusters can be avoided by using a random number technique or by
using double hashing.
 In double hashing a different hash function is used to generate the probing
sequence.
 In chaining, each table address refers to a linked list of data elements.
 In coalesced chaining, or coalesced hashing, collided keys are stored in an
alternative position in the table but a link from the original hashed address is
maintained.
 In bucket addressing, each table address contains a bucket capable of storing
multiple data elements.

9
Exercises

1) Write a C++ program to implement a simple division hashing scheme. The


program should read in a sequence of key-value pairs from the keyboard – the key
should be a positive integer and the value should be a string. Each key-value pair
should be stored in a table of size 100. Use linear probing for collision resolution.
After the user has finished entering key-value pairs (e.g. they could enter a
negative key), they should be able to retrieve a sequence of values by entering
their keys.

2) Update the program you wrote in (1) to make it use quadratic probing instead of
linear probing.

Notes prepared by: FBE Computer Science Department.

10
Sources: Data Structures and Algorithms in C++, A. Drozdek, 2001

11

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy