Unit 4 - 4.4
Unit 4 - 4.4
Requirements ( 4.4 )
● Consider a Web site gathering statistics on how many unique users it has seen in each
given month.
● The universal set is the set of logins for that site, and a stream element is generated each time
someone logs in.
● This measure is appropriate for a site like Amazon, where the typical user logs in with their unique
login name.
However...
Possible Solutions
● If the number of distinct elements is too great, or if there are too many streams that need to be
processed at once then we cannot store the needed data in main memory.
● There are several options. We could use more machines, each machine handling only one or
several of the streams.
● We could store most of the data structure in secondary memory and batch stream elements so
whenever we brought a disk block to main memory there would be many tests and updates to be
performed on the data in that.
Flajolet Martin Algorithm
If the stream contains n elements with m of them unique, this algorithm runs in O(n)
time and needs O(log(m)) memory
It is possible to estimate the number of distinct elements by hashing the elements
of the universal set to a bit-string that is sufficiently long.
As we see more different hash-values, it becomes more likely that one of these
values will be “unusual.”
The particular unusual property we shall exploit is that the value ends in many 0’s,
Algorithm
Whenever we apply a hash function h to a stream element a, the bit string h(a) will end
in some number of 0’s, possibly none. Call this number the tail length for a and h. Let R
be the maximum tail length of any a seen so far in the stream. Then we shall use
estimate 2R for the number of distinct elements seen in the stream.
1. If m is much larger than 2r , then the probability that we shall find a tail of length at
least r approaches 1.
2. If m is much less than 2r , then the probability of finding a tail length at least r
approaches 0
Overview
● If the stream contains N elements and M of them are unique, then the algorithm
runs in O(N) time and needs O(log(M)) memory.
Example
From the input, we can see that there are 4 distinct elements.
Calculating the Hashes
h(3) = 4 = 100
h(4) = 0 = 000
Find the Number of Trailing Zeros
h(3) = 4 = 100 = 2
h(4) = 0 = 000 = 0
Distinct Elements
From the binary equivalent trailing zeros, write the maximum number of trailing zeros.
Thus, value of r = 2
From the Input we can verify that the number of distinct elements is 4
Problem
● Observe that as we read the stream it is not necessary to store the elements
seen.
● The only thing we need to keep in main memory is one integer per hash
function this integer records the largest tail length seen so far for that hash
function and any stream element.
● Only if we are trying to process many streams at the same time would main
memory constrain the number of hash functions we could associate with any
one stream.
● In practice, the time it takes to compute hash values for each stream element
would be the more significant limitation on the number of hash functions we
use.