Lect10 Hash Basics
Lect10 Hash Basics
Lect10 Hash Basics
Hash Functions: A good hashing function should have the following properties:
• It should be efficiently computable, say in constant time and using simple arithmetic
operations.
• It should produce few collisions. Two additional aspects of a hash function implied by
this are:
– It should be a function of every bit of the key (otherwise keys that differ only in
these bits will collide)
– It break up (scatter) naturally occuring clusters of key values.
As an example of the last rule, observe that in writing programs it is not uncommon to use
very similar variables names, “temp1”, “temp2”, and “temp3”. It is important such similar
names be mapped to very different locations in the hash output space. By the way, the origin
of the name “hashing” is from this mixing aspect of hash functions (thinking of “hash” in
food preparation as a mixture of things).
We will think of hash functions as being applied to nonnegative integer keys. Keys that are
not integers will generally need to be converted into this form (e.g., by converting the key into
a bit string, such as an ASCII or Unicode representation of a string) and then interpreting the
bit string as an integer. Since the hash function’s output is the range [0..m − 1], an obvious
(but not very good) choice for a hash function is:
h(x) = x mod m.
This is called division hashing. It satisfies our first criteria of efficiency, but consecutive keys
are mapped to consecutive entries, and this is does not do a good job of breaking up clusters.
Some Common Hash Functions: Many different hash functions have been proposed. The topic
is quite deep, and we will not claim to have a definitive answer for the best hash function.
Here are three simple, commonly used hash functions:
Randomization and Universal Hashing: Any deterministic hashing scheme runs the risk that
we may (very rarely) come across a set keys that behaves badly for this choice. As we have
seen before, one way to evade attacks by a clever adversary is to employ randomization. Any
given hash function might be bad for a particular set of keys. So, it would seem that we can
never declare that any one hash function to be “ideal.”
One way to approach this conundrum is to flip the question on its head. Rather than trying to
determine the chances that a fixed hash function works for a random set of keys, let us instead
fix two keys x and y, say, and then select our hash function h at random out of large bag of
possible hash functions. Then we ask the question, what is the probability that h(x) = h(y),
given that the choice of h is random. Since there are m table entries, a probability of 1/m
would be the best we could hope for.
This gives rise to the notion of universal hashing. A hashing scheme is said to be universal if
the hash function is selected randomly from a large class of functions, and the probability of
a collision between any two fixed keys is 1/m.
There are many different universal hash functions. Let’s consider one simple one (which was
first proposed by the inventors of universal hashing, Carter and Wegman). First, let p be any
prime number that is chosen to be larger than any input key x to the hash function. Next,
select two integers a and b at random where
a ∈ {1, 2, . . . , p − 1} and b ∈ {0, 1, . . . , p − 1}.
(Note that a 6= 0.) Finally, consider the following linear hash function, which depends on the
choice of a and b.
ha,b (x) = ((ax + b) mod p) mod m.
As a and b vary, this defines a family of functions. Let Hp denote the class of hash functions
that arise by considering all possible choices of a and b (subject to the above restrictions).
The following theorem shows that Hp is a universal hashing system by showing that the
probability that two fixed keys collide is 1/m. The proof is not terribly deep, but it involves
some nontrivial modular arithmetic. We present it for the sake of completeness.
Theorem: Consider any two integers x and y, where 0 ≤ y < x < p. Let ha,b be a hash
function chosen uniformly at random from Hp . Then the probability that ha,b (x) =
ha,b (y) is at most 1/m.
Proof: (Optional) Let us select a and b randomly as specified above and let h = ha,b . Observe
that h(x) = h(y) if and only if the two values (ax + b) mod p and (ay + b) mod p differ
from each other by a multiple of m. This is equivalent to saying that there exists an
integer i, where |i| ≤ (p − 1)/m, such that:
(ax + b) mod p = (ay + b) mod p + i · m.
Since y < x, their difference x − y is nonzero and (since p is prime) x − y has an inverse
modulo p. That is, there exists a number q such that (x−y)·q ≡ 1 (mod p). Multiplying
both sides by q, we have
a ≡ i · m · q (mod p)
By definition of our hashing system, are p − 1 possible choices for a. By varying i in
the allowed range, there are b(p − 1)/mc possible nonzero values for the right-hand side.
Thus, the probability of collision is
Like the other randomized structures we have seen this year, universal hash functions are both
simple and provide good guarantees on the expected-case performance of hashing systems.
We will pick this topic up in our next lecture, focusing on methods for collision resolution,
under the assumption that our hashing function has a low probability of collisions.