Lecture Notes On Hash Tables: 15-122: Principles of Imperative Computation Frank Pfenning, Rob Simmons February 28, 2013
Lecture Notes On Hash Tables: 15-122: Principles of Imperative Computation Frank Pfenning, Rob Simmons February 28, 2013
Hash Tables
15-122: Principles of Imperative Computation
Frank Pfenning, Rob Simmons
Lecture 13
February 28, 2013
Introduction
L ECTURE N OTES
Hash Tables
L13.2
Associative Arrays
Chains
Hash Tables
L13.3
then no entry with the given key exists. If we keep the chain unsorted this
gives us O(n) worst case complexity for finding a key in a chain of length
n, assuming that computing and comparing keys is constant time.
Given what we have seen so far in our search data structures, this seems
very poor behavior, but if we know our data collections will always be
small, it may in fact be reasonable on occasion.
Can we do better? One idea goes back to binary search. If keys are ordered we may be able to arrange the elements in an array or in the form of
a tree and then cut the search space roughly in half every time we make a
comparison. We will begin thinking about this approch just before Spring
Break, and it will occupy us for a few lectures after the break as well. Designing such data structures is a rich and interesting subject, but the best
we can hope for with this approach is O(log(n)), where n is the number of
entries. We have seen that this function grows very slowly, so this is quite
a practical approach.
Nevertheless, the challenge arises if we can do better than O(log(n)),
say, constant time O(1) to find an entry with a given key. We know that
it can done be for arrays, indexed by integers, which allow constant-time
access. Can we also do it, for example, for strings?
Hashing
The first idea behind hash tables is to exploit the efficiency of arrays. So:
to map a key to an entry, we first map a key to an integer and then use the
integer to index an array A. The first map is called a hash function. We write
it as hash( ). Given a key k, our access could then simply be A[hash(k)].
There is an immediate problem with this approach: there are 231 positive integers, so we would need a huge array, negating any possible performance advantages. But even if we were willing to allocate such a huge
array, there are many more strings than ints so there cannot be any hash
function that always gives us different ints for different strings.
The solution is to allocate an array of smaller size, say m, and then look
up the result of the hash function modulo m, for example, A[hash(k)%m].
This creates a new problem: it is inevitable that multiple strings will map
to the same array index. For example, if the array has size m then if we
have more then m elements, at least two must map to the same index. In
practice, this will happen much sooner that this.
If two hash functions map a key to the same integer value (modulo m),
we say we have a collision. In general, we would like to avoid collisions,
L ECTURE N OTES
Hash Tables
L13.4
Separate Chaining
L ECTURE N OTES
Hash Tables
L13.5
Randomness
The average case analysis relies on the fact that the hash values of the key
are relatively evenly distributed. This can be restated as saying that the
probability that each key maps to an array index i should be about the
same, namely 1/m. In order to avoid systematically creating collisions,
small changes in the input string should result in unpredicable change in
the output hash value that is uniformly distributed over the range of C0 integers. We can achieve this with a pseudorandom number generator (PRNG).
L ECTURE N OTES
Hash Tables
L13.6
A pseudorandom number generator is just a function that takes one number and obtains another in a way that is both unpredictable and easy to
calculuate. The C0 rand library is a pseudorandom numer generator with
a fairly simple interface:
/* library file rand.h0 */
typedef struct rand* rand_t;
rand_t init_rand (int seed);
int rand(rand_t gen);
One can generate a random number generator (type rand_t) by initializing
it with an arbitrary seed. Then we can generate a sequence of random
numbers by repeatedly calling rand on such a generator.
The rand library in C0 is implemented as a linear congruential generator. A linear congruential generator takes a number x and finds the next
number by calculating (a x) + c modulo m. In C0, its easiest to say that
m is just 232 , since addition and multiplication in C0 are already defined
modulo 232 . The trick is finding a good multiplier a and summand c.
If we were using 4-bit numbers (from 8 to 7 where multiplication and
addition are modulo 16) then we could set a to 5 and c to 7 and our pseudorandom number generator would generate the following series of numbers:
0 7 (6) (7) 4 (5) (2)
3 (8) (1) 1 (4) 3 6 5 0 . . .
The PRNG used in C0s library sets a to 1664525 and c to 1013904223
and generates the following series of numbers starting from 0:
0 1013904223 1196435762 (775096599) (1426500812) . . .
This kind of generator is fine for random testing or (indeed) the basis for
a hashing function, but the results are too predictable to use it for cryptographic purposes such as encrypting a message. In particular, a linear
congruential generator will sometimes have repeating patterns in the lower
bits. If one wants numbers from a small range it is better to use the higher
bits of the generated results rather than just applying the modulus operation.
It is important to realize that these numbers just look random, they arent
really random. In particular, we can reproduce the exact same sequence if
we give it the exact same seed. This property is important for both testing purposes and for hashing. If we discover a bug during testing with
L ECTURE N OTES
Hash Tables
L13.7
Exercises
Exercise 1 What happens when you replace the data structure for separate chaining by something other than a linked list? Discuss the changes and identify benefits and disadvantages when using a sorted list, a queue, a doubly-linked list, or
another hash table for separate chaining.
Exercise 2 Consider the situation of writing a hash function for strings of length
two, that only use the characters A to Z. There are 676 different such strings.
You were hoping to get away with implementing a hash table without collisions,
since you are only using 79 out of those 676 two-letter words. But you still see
collisions most of the time. Explain this phenomenon with the birthday problem.
L ECTURE N OTES