0% found this document useful (0 votes)
150 views

Unit 4 - 4.4

The document discusses counting distinct elements in a data stream using the Flajolet-Martin algorithm. It begins by introducing the count-distinct problem and potential solutions. It then describes the Flajolet-Martin algorithm, which uses hash functions to estimate the number of distinct elements in one pass using only O(log(m)) memory, where m is the number of distinct elements. It discusses combining estimates to improve accuracy and addresses problems and solutions related to the estimates and memory usage.

Uploaded by

King Bavisi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
150 views

Unit 4 - 4.4

The document discusses counting distinct elements in a data stream using the Flajolet-Martin algorithm. It begins by introducing the count-distinct problem and potential solutions. It then describes the Flajolet-Martin algorithm, which uses hash functions to estimate the number of distinct elements in one pass using only O(log(m)) memory, where m is the number of distinct elements. It discusses combining estimates to improve accuracy and addresses problems and solutions related to the estimates and memory usage.

Uploaded by

King Bavisi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 23

Unit 4 : Counting Distinct Elements in a Stream, CountDistinct Problem, Flajolet-Martin Algorithm, Combining Estimates, Space

Requirements ( 4.4 )

4.4 Counting Distinct Elements in a Stream

From Mining of Massive Datasets


The Count-Distinct Problem
Understanding the problem
● In this section we look at a third simple kind of processing we might want to do on a
stream (besides sampling and filtering)
Where ?

● Consider a Web site gathering statistics on how many unique users it has seen in each
given month.
● The universal set is the set of logins for that site, and a stream element is generated each time
someone logs in.
● This measure is appropriate for a site like Amazon, where the typical user logs in with their unique
login name.
However...
Possible Solutions

● The obvious way to solve the problem is to keep in main memory a


list of all the elements seen so far in the stream.
● Keep them in an efficient search structure such as a hash table or
search tree, so one can quickly add new elements and check
whether or not the element that just arrived on the stream was
already seen.
● As long as the number of distinct elements is not too great, this
structure can fit in main memory and there is little problem
obtaining an exact answer to the question how many distinct
elements appear in the stream.
However...

● If the number of distinct elements is too great, or if there are too many streams that need to be
processed at once then we cannot store the needed data in main memory.

● There are several options. We could use more machines, each machine handling only one or
several of the streams.
● We could store most of the data structure in secondary memory and batch stream elements so
whenever we brought a disk block to main memory there would be many tests and updates to be
performed on the data in that.
Flajolet Martin Algorithm

Flajolet-Martin algorithm approximates the number of unique objects in a stream or a


database in one pass.

If the stream contains n elements with m of them unique, this algorithm runs in O(n)
time and needs O(log(m)) memory
It is possible to estimate the number of distinct elements by hashing the elements
of the universal set to a bit-string that is sufficiently long.

As we see more different hash-values, it becomes more likely that one of these
values will be “unusual.”

The particular unusual property we shall exploit is that the value ends in many 0’s,
Algorithm

Whenever we apply a hash function h to a stream element a, the bit string h(a) will end
in some number of 0’s, possibly none. Call this number the tail length for a and h. Let R
be the maximum tail length of any a seen so far in the stream. Then we shall use
estimate 2R for the number of distinct elements seen in the stream.

While m being the number of distinct elements, We can conclude:

1. If m is much larger than 2r , then the probability that we shall find a tail of length at
least r approaches 1.

2. If m is much less than 2r , then the probability of finding a tail length at least r
approaches 0
Overview

● Flajolet Martin Algorithm approximates the number of unique objects in a stream or


database in one pass.

● If the stream contains N elements and M of them are unique, then the algorithm
runs in O(N) time and needs O(log(M)) memory.
Example

Determine the distinct element in the stream in FM

Input: Steam of Integers (x): 1, 3, 2, 1, 2, 3, 4, 3, 1, 2, 3, 1


Hash Function: h(x) = 6x + 1 mod 5

From the input, we can see that there are 4 distinct elements.
Calculating the Hashes

Input: Steam of Integers (x): 1, 3, 2, 1, 2, 3, 4, 3, 1, 2, 3, 1

Hash Function: h(x) = 6x + 1 mod 5 h(1) = 6 (1) + 1 mod 5 = 7 mod 5 =2


Thus, hashing individual elements in the Input Stream: h(3) = 6 (3) + 1 mod 5 = 19 mod 5 =4
h(1) = 6 (1) + 1 mod 5 = 7 mod 5 =2 h(2) = 6 (2) + 1 mod 5 = 13 mod 5 =3
h(3) = 6 (3) + 1 mod 5 = 19 mod 5 =4 h(1) = 6 (1) + 1 mod 5 = 7 mod 5 =2
h(3) = 6 (3) + 1 mod 5 = 19 mod 5 =4
h(3) = 6 (3) + 1 mod 5 = 19 mod 5 =4

h(1) = 6 (1) + 1 mod 5 = 7 mod 5 =2 h(4) = 6 (4) + 1 mod 5 = 25 mod 5 =0


h(2) = 6 (2) + 1 mod 5 = 13 mod 5 =3

h(2) = 6 (2) + 1 mod 5 = 13 mod 5 = 3


Binary Bit Calculation

For every hash function, write the Binary Equivalent

h(1) = 2 = 010 h(3) = 4 = 100

h(3) = 4 = 100 h(1) = 2 = 010

h(2) = 3 = 011 h(2) = 3 = 011

h(1) = 2 = 010 h(3) = 4 = 100

h(2) = 3 = 011 h(1) = 2 = 010

h(3) = 4 = 100

h(4) = 0 = 000
Find the Number of Trailing Zeros

h(1) = 2 = 010 = 1 h(3) = 4 = 100 = 2

h(3) = 4 = 100 = 2 h(1) = 2 = 010 = 1

h(2) = 3 = 011 = 0 h(2) = 3 = 011 = 0

h(1) = 2 = 010 = 1 h(3) = 4 = 100 = 2

h(2) = 3 = 011 = 0 h(1) = 2 = 010 = 1

h(3) = 4 = 100 = 2

h(4) = 0 = 000 = 0
Distinct Elements

From the binary equivalent trailing zeros, write the maximum number of trailing zeros.

Thus, value of r = 2

The number of distinct value R = 2r


Thus, R = 2r = 4

From the Input we can verify that the number of distinct elements is 4
Problem

● Consider a value of r such that 2r is much larger than m


● There is some probability p that we shall discover r to be the largest number of 0’s at the end of
the hash value.
● Then the probability of finding r + 1 to be the largest number of 0’s instead is at least p/2.
● if we do increase by 1 the number of 0’s at the end of a hash value, the value of 2 r doubles.
● For each possible large R to the expected value of 2r grows as R grows, and the expected value
of 2r is actually infinite.
Solution: Combining Estimate

● Combining Estimate is another way to combine the estimates


● Take the median of all estimates, the median is not affected by the occasional outsized value of 2 r
● The problem of 2r becoming infinite is actually solved
Problem

● Another problem here is, it is always a power of 2.


● Thus, no matter how many hash functions we use, should the correct value of m be between two
powers of 2, it will be impossible to obtain a close estimate.
Solution: Combining Estimate

● We can combine the two methods.


● First, group the hash functions into small groups, and take their average.
● Then, take the median of the averages.
● Occasional outsized 2r will bias some of the groups and make them too large.
● In order to guarantee that any possible average can be obtained, groups should be of size at least a
small multiple of log2 m.
Space Optimization

● Observe that as we read the stream it is not necessary to store the elements
seen.
● The only thing we need to keep in main memory is one integer per hash
function this integer records the largest tail length seen so far for that hash
function and any stream element.
● Only if we are trying to process many streams at the same time would main
memory constrain the number of hash functions we could associate with any
one stream.
● In practice, the time it takes to compute hash values for each stream element
would be the more significant limitation on the number of hash functions we
use.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy