0% found this document useful (0 votes)

150 views

Unit 4 - 4.4

The document discusses counting distinct elements in a data stream using the Flajolet-Martin algorithm. It begins by introducing the count-distinct problem and potential solutions. It then describes the Flajolet-Martin algorithm, which uses hash functions to estimate the number of distinct elements in one pass using only O(log(m)) memory, where m is the number of distinct elements. It discusses combining estimates to improve accuracy and addresses problems and solutions related to the estimates and memory usage.

Uploaded by

King Bavisi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

150 views

Unit 4 - 4.4

Uploaded by

King Bavisi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 23

Unit 4 : Counting Distinct Elements in a Stream, CountDistinct Problem, Flajolet-Martin Algorithm, Combining Estimates, Space

Requirements ( 4.4 )

4.4 Counting Distinct Elements in a Stream

From Mining of Massive Datasets

The Count-Distinct Problem
Understanding the problem
● In this section we look at a third simple kind of processing we might want to do on a
stream (besides sampling and filtering)
Where ?

● Consider a Web site gathering statistics on how many unique users it has seen in each
given month.
● The universal set is the set of logins for that site, and a stream element is generated each time
someone logs in.
● This measure is appropriate for a site like Amazon, where the typical user logs in with their unique
login name.
However...
Possible Solutions

● The obvious way to solve the problem is to keep in main memory a

list of all the elements seen so far in the stream.
● Keep them in an efficient search structure such as a hash table or
search tree, so one can quickly add new elements and check
whether or not the element that just arrived on the stream was
already seen.
● As long as the number of distinct elements is not too great, this
structure can fit in main memory and there is little problem
obtaining an exact answer to the question how many distinct
elements appear in the stream.
However...

● If the number of distinct elements is too great, or if there are too many streams that need to be
processed at once then we cannot store the needed data in main memory.

● There are several options. We could use more machines, each machine handling only one or
several of the streams.
● We could store most of the data structure in secondary memory and batch stream elements so
whenever we brought a disk block to main memory there would be many tests and updates to be
performed on the data in that.
Flajolet Martin Algorithm

Flajolet-Martin algorithm approximates the number of unique objects in a stream or a

database in one pass.

If the stream contains n elements with m of them unique, this algorithm runs in O(n)
time and needs O(log(m)) memory
It is possible to estimate the number of distinct elements by hashing the elements
of the universal set to a bit-string that is sufficiently long.

As we see more different hash-values, it becomes more likely that one of these
values will be “unusual.”

The particular unusual property we shall exploit is that the value ends in many 0’s,
Algorithm

Whenever we apply a hash function h to a stream element a, the bit string h(a) will end
in some number of 0’s, possibly none. Call this number the tail length for a and h. Let R
be the maximum tail length of any a seen so far in the stream. Then we shall use
estimate 2R for the number of distinct elements seen in the stream.

While m being the number of distinct elements, We can conclude:

1. If m is much larger than 2r , then the probability that we shall find a tail of length at
least r approaches 1.

2. If m is much less than 2r , then the probability of finding a tail length at least r
approaches 0
Overview

● Flajolet Martin Algorithm approximates the number of unique objects in a stream or

database in one pass.

● If the stream contains N elements and M of them are unique, then the algorithm
runs in O(N) time and needs O(log(M)) memory.
Example

Determine the distinct element in the stream in FM

Input: Steam of Integers (x): 1, 3, 2, 1, 2, 3, 4, 3, 1, 2, 3, 1

Hash Function: h(x) = 6x + 1 mod 5

From the input, we can see that there are 4 distinct elements.
Calculating the Hashes

Input: Steam of Integers (x): 1, 3, 2, 1, 2, 3, 4, 3, 1, 2, 3, 1

Hash Function: h(x) = 6x + 1 mod 5 h(1) = 6 (1) + 1 mod 5 = 7 mod 5 =2

Thus, hashing individual elements in the Input Stream: h(3) = 6 (3) + 1 mod 5 = 19 mod 5 =4
h(1) = 6 (1) + 1 mod 5 = 7 mod 5 =2 h(2) = 6 (2) + 1 mod 5 = 13 mod 5 =3
h(3) = 6 (3) + 1 mod 5 = 19 mod 5 =4 h(1) = 6 (1) + 1 mod 5 = 7 mod 5 =2
h(3) = 6 (3) + 1 mod 5 = 19 mod 5 =4
h(3) = 6 (3) + 1 mod 5 = 19 mod 5 =4

h(1) = 6 (1) + 1 mod 5 = 7 mod 5 =2 h(4) = 6 (4) + 1 mod 5 = 25 mod 5 =0

h(2) = 6 (2) + 1 mod 5 = 13 mod 5 =3

h(2) = 6 (2) + 1 mod 5 = 13 mod 5 = 3

Binary Bit Calculation

For every hash function, write the Binary Equivalent

h(1) = 2 = 010 h(3) = 4 = 100

h(3) = 4 = 100 h(1) = 2 = 010

h(2) = 3 = 011 h(2) = 3 = 011

h(1) = 2 = 010 h(3) = 4 = 100

h(2) = 3 = 011 h(1) = 2 = 010

h(3) = 4 = 100

h(4) = 0 = 000
Find the Number of Trailing Zeros

h(1) = 2 = 010 = 1 h(3) = 4 = 100 = 2

h(3) = 4 = 100 = 2 h(1) = 2 = 010 = 1

h(2) = 3 = 011 = 0 h(2) = 3 = 011 = 0

h(1) = 2 = 010 = 1 h(3) = 4 = 100 = 2

h(2) = 3 = 011 = 0 h(1) = 2 = 010 = 1

h(3) = 4 = 100 = 2

h(4) = 0 = 000 = 0
Distinct Elements

From the binary equivalent trailing zeros, write the maximum number of trailing zeros.

Thus, value of r = 2

The number of distinct value R = 2r

Thus, R = 2r = 4

From the Input we can verify that the number of distinct elements is 4
Problem

● Consider a value of r such that 2r is much larger than m

● There is some probability p that we shall discover r to be the largest number of 0’s at the end of
the hash value.
● Then the probability of finding r + 1 to be the largest number of 0’s instead is at least p/2.
● if we do increase by 1 the number of 0’s at the end of a hash value, the value of 2 r doubles.
● For each possible large R to the expected value of 2r grows as R grows, and the expected value
of 2r is actually infinite.
Solution: Combining Estimate

● Combining Estimate is another way to combine the estimates

● Take the median of all estimates, the median is not affected by the occasional outsized value of 2 r
● The problem of 2r becoming infinite is actually solved
Problem

● Another problem here is, it is always a power of 2.

● Thus, no matter how many hash functions we use, should the correct value of m be between two
powers of 2, it will be impossible to obtain a close estimate.
Solution: Combining Estimate

● We can combine the two methods.

● First, group the hash functions into small groups, and take their average.
● Then, take the median of the averages.
● Occasional outsized 2r will bias some of the groups and make them too large.
● In order to guarantee that any possible average can be obtained, groups should be of size at least a
small multiple of log2 m.
Space Optimization

● Observe that as we read the stream it is not necessary to store the elements
seen.
● The only thing we need to keep in main memory is one integer per hash
function this integer records the largest tail length seen so far for that hash
function and any stream element.
● Only if we are trying to process many streams at the same time would main
memory constrain the number of hash functions we could associate with any
one stream.
● In practice, the time it takes to compute hash values for each stream element
would be the more significant limitation on the number of hash functions we
use.

Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
Untitled
83% (6)
Untitled
71 pages
Javascript Leetcode Examples
No ratings yet
Javascript Leetcode Examples
34 pages
User Manual For Aacalc7
100% (2)
User Manual For Aacalc7
42 pages
Functional Dependencies and Normalization
No ratings yet
Functional Dependencies and Normalization
7 pages
OOSD IIT Kharagpur Mid Sem-12 Question Paper
No ratings yet
OOSD IIT Kharagpur Mid Sem-12 Question Paper
3 pages
Mining Data Streams (Part 2)
No ratings yet
Mining Data Streams (Part 2)
56 pages
50 Python Concepts Every Developer Should Know
From Everand
50 Python Concepts Every Developer Should Know
Hernando Abella
No ratings yet
CP5261 Data Analytics Laboratory LTPC0042 Objectives
No ratings yet
CP5261 Data Analytics Laboratory LTPC0042 Objectives
80 pages
Unit-1 Basics of Algorithms and Mathematics
No ratings yet
Unit-1 Basics of Algorithms and Mathematics
47 pages
6 1 Mining Complex Data
No ratings yet
6 1 Mining Complex Data
69 pages
St. Xavier's College, Ranchi: Naac A'
No ratings yet
St. Xavier's College, Ranchi: Naac A'
5 pages
Lecture Notes: Introduction To Data Science and Big Data
No ratings yet
Lecture Notes: Introduction To Data Science and Big Data
5 pages
Indexer
No ratings yet
Indexer
11 pages
Data Structures and Algorithms: Assignment 1
No ratings yet
Data Structures and Algorithms: Assignment 1
4 pages
ATP Module 1 Examples
No ratings yet
ATP Module 1 Examples
24 pages
Java Interview Programs PDF
No ratings yet
Java Interview Programs PDF
70 pages
Inverted File Assignment
No ratings yet
Inverted File Assignment
6 pages
CS-703 (B) Data Warehousing and Data Mining Lab
No ratings yet
CS-703 (B) Data Warehousing and Data Mining Lab
50 pages
Unit - I Introduction To Data Analytics
No ratings yet
Unit - I Introduction To Data Analytics
89 pages
BCSL 058 Computer Oriented Numerical Techniques Lab Solved Assignment 2019 20
No ratings yet
BCSL 058 Computer Oriented Numerical Techniques Lab Solved Assignment 2019 20
17 pages
CS302 Unit1-III
No ratings yet
CS302 Unit1-III
18 pages
Unit-3 DWDM
No ratings yet
Unit-3 DWDM
11 pages
Unit 1
No ratings yet
Unit 1
70 pages
Java Prac File Updated
100% (1)
Java Prac File Updated
24 pages
AVL Tree
No ratings yet
AVL Tree
34 pages
Data Mining Lab Manual
No ratings yet
Data Mining Lab Manual
34 pages
Analysis of Algorithms CS 477/677: Hashing Instructor: George Bebis
No ratings yet
Analysis of Algorithms CS 477/677: Hashing Instructor: George Bebis
53 pages
Unit-2 Solution
No ratings yet
Unit-2 Solution
22 pages
DATA MINING Chapter 1 and 2 Lect Slide
No ratings yet
DATA MINING Chapter 1 and 2 Lect Slide
47 pages
Practical 5: Introduction To Weka For Classfication
100% (1)
Practical 5: Introduction To Weka For Classfication
4 pages
Big Data Analysis - Lab Manual - Bharathidasan University - B.Sc Data Science, Second Year, 4th Semester
No ratings yet
Big Data Analysis - Lab Manual - Bharathidasan University - B.Sc Data Science, Second Year, 4th Semester
41 pages
1) Explain Briefly About The Four Major Phases of Unified Process With Neat Diagram. The Four Phases
No ratings yet
1) Explain Briefly About The Four Major Phases of Unified Process With Neat Diagram. The Four Phases
8 pages
Merkle-Damgard Scheme
No ratings yet
Merkle-Damgard Scheme
8 pages
unit V
No ratings yet
unit V
67 pages
Big Data Analytics – Unit 4
No ratings yet
Big Data Analytics – Unit 4
32 pages
Exp 2.3 Java WS
No ratings yet
Exp 2.3 Java WS
4 pages
Prims and Kruskal's Algorithm Problem
No ratings yet
Prims and Kruskal's Algorithm Problem
13 pages
Data Structure4
No ratings yet
Data Structure4
6 pages
Parallel Sorting Algorithms
No ratings yet
Parallel Sorting Algorithms
22 pages
4.7.1 - Data Warehousing Mining & Business Intelligence
No ratings yet
4.7.1 - Data Warehousing Mining & Business Intelligence
3 pages
3.1 What Is Data Warehouse?: Unit Iii
No ratings yet
3.1 What Is Data Warehouse?: Unit Iii
33 pages
KMP Algorithm
No ratings yet
KMP Algorithm
26 pages
File Hanling - New - C++
No ratings yet
File Hanling - New - C++
26 pages
CH 3
No ratings yet
CH 3
107 pages
#1 Hashing
No ratings yet
#1 Hashing
22 pages
Pythonic Data Cleaning With Numpy and Pandas
No ratings yet
Pythonic Data Cleaning With Numpy and Pandas
11 pages
Density & Grid based clustering
100% (1)
Density & Grid based clustering
21 pages
Java Assignment
No ratings yet
Java Assignment
6 pages
Hashing PDF
No ratings yet
Hashing PDF
65 pages
Chapter 1:-: Basics of An Algorithm and Mathematics
100% (1)
Chapter 1:-: Basics of An Algorithm and Mathematics
34 pages
UNIT 3(Chapter 2) Pandas
No ratings yet
UNIT 3(Chapter 2) Pandas
43 pages
MFCS Practicles PDF
No ratings yet
MFCS Practicles PDF
16 pages
UNIT V Streaming
No ratings yet
UNIT V Streaming
22 pages
Data Analytics Lab File Rohit
No ratings yet
Data Analytics Lab File Rohit
23 pages
UNIT4
No ratings yet
UNIT4
7 pages
Python Solutions For iPA 10-Feb-23
No ratings yet
Python Solutions For iPA 10-Feb-23
21 pages
ACID Properties in DBMS
No ratings yet
ACID Properties in DBMS
5 pages
Unit 4
No ratings yet
Unit 4
4 pages
Crud Operation
No ratings yet
Crud Operation
10 pages
Array Data Structure Lect-3
No ratings yet
Array Data Structure Lect-3
16 pages
Advanced Unix Programming
From Everand
Advanced Unix Programming
Prof. N. B Venkateswarlu
No ratings yet
Gigabyte Ga-B85m-D3ph Rev1.1
No ratings yet
Gigabyte Ga-B85m-D3ph Rev1.1
33 pages
FPDF Document PDF
No ratings yet
FPDF Document PDF
52 pages
HMI Selection Design and Operation Ebook PDF
No ratings yet
HMI Selection Design and Operation Ebook PDF
81 pages
Table of Contents Emtech
No ratings yet
Table of Contents Emtech
2 pages
CIT 143: Introduction To Data Organisation and Management
No ratings yet
CIT 143: Introduction To Data Organisation and Management
213 pages
Construction Management PDF
No ratings yet
Construction Management PDF
93 pages
ASCII Protocol
No ratings yet
ASCII Protocol
97 pages
14-Formal Specifications-08-02-2024
No ratings yet
14-Formal Specifications-08-02-2024
24 pages
Product Number Product Description General Information Product Status Base Unit of Measure Machine Compatibility Withdrawal Date
No ratings yet
Product Number Product Description General Information Product Status Base Unit of Measure Machine Compatibility Withdrawal Date
6 pages
2발전기특성Model DataSheets PDF
No ratings yet
2발전기특성Model DataSheets PDF
181 pages
Deloitte TSI - Virtual Internship - Templatev4
No ratings yet
Deloitte TSI - Virtual Internship - Templatev4
4 pages
E-Commerce and Operations Management
No ratings yet
E-Commerce and Operations Management
18 pages
MCQ - Unit 1,2,3,4
No ratings yet
MCQ - Unit 1,2,3,4
23 pages
Vehicle To Vehicle Communication Whitepaper
100% (1)
Vehicle To Vehicle Communication Whitepaper
13 pages
School of Business and Economics Department of Economics: Syed - Ehsan@northsouth - Edu
No ratings yet
School of Business and Economics Department of Economics: Syed - Ehsan@northsouth - Edu
5 pages
V6.15.Draft RFP For FMS and O&M of Non IT Equipment of DR Site Jodhpur - PD - 29.05.23
No ratings yet
V6.15.Draft RFP For FMS and O&M of Non IT Equipment of DR Site Jodhpur - PD - 29.05.23
135 pages
DAM Andler: Senior Business Analyst / Project Manager
No ratings yet
DAM Andler: Senior Business Analyst / Project Manager
2 pages
A 310 Manual Fujitsu Finepix
No ratings yet
A 310 Manual Fujitsu Finepix
43 pages
3.3. Function Operations
No ratings yet
3.3. Function Operations
27 pages
Khawar Nehal Resume IT Manager 11 June 2023-1
No ratings yet
Khawar Nehal Resume IT Manager 11 June 2023-1
25 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-K
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-K
2 pages
Thesis Title With Hardware
100% (3)
Thesis Title With Hardware
7 pages
How to forecast in Excel
No ratings yet
How to forecast in Excel
12 pages
Hands-On Labs For Identity and Access Management (IAM) :channels
No ratings yet
Hands-On Labs For Identity and Access Management (IAM) :channels
1 page
SW DRL Ug 10636 PDF
No ratings yet
SW DRL Ug 10636 PDF
446 pages
Census PDF
No ratings yet
Census PDF
5 pages
Methods of Data Collection Advantages and Disadvantages
100% (2)
Methods of Data Collection Advantages and Disadvantages
4 pages
OperatorKeyFigure-SAP IBP
No ratings yet
OperatorKeyFigure-SAP IBP
58 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Unit 4 - 4.4

Uploaded by

Unit 4 - 4.4

Uploaded by

Unit 4 : Counting Distinct Elements in a Stream, CountDistinct Problem, Flajolet-Martin Algorithm, Combining Estimates, Space

4.4 Counting Distinct Elements in a Stream

From Mining of Massive Datasets

● The obvious way to solve the problem is to keep in main memory a

Flajolet-Martin algorithm approximates the number of unique objects in a stream or a

While m being the number of distinct elements, We can conclude:

● Flajolet Martin Algorithm approximates the number of unique objects in a stream or

Determine the distinct element in the stream in FM

Input: Steam of Integers (x): 1, 3, 2, 1, 2, 3, 4, 3, 1, 2, 3, 1

Input: Steam of Integers (x): 1, 3, 2, 1, 2, 3, 4, 3, 1, 2, 3, 1

Hash Function: h(x) = 6x + 1 mod 5 h(1) = 6 (1) + 1 mod 5 = 7 mod 5 =2

h(1) = 6 (1) + 1 mod 5 = 7 mod 5 =2 h(4) = 6 (4) + 1 mod 5 = 25 mod 5 =0

h(2) = 6 (2) + 1 mod 5 = 13 mod 5 = 3

For every hash function, write the Binary Equivalent

h(1) = 2 = 010 h(3) = 4 = 100

h(3) = 4 = 100 h(1) = 2 = 010

h(2) = 3 = 011 h(2) = 3 = 011

h(1) = 2 = 010 h(3) = 4 = 100

h(2) = 3 = 011 h(1) = 2 = 010

h(1) = 2 = 010 = 1 h(3) = 4 = 100 = 2

h(3) = 4 = 100 = 2 h(1) = 2 = 010 = 1

h(2) = 3 = 011 = 0 h(2) = 3 = 011 = 0

h(1) = 2 = 010 = 1 h(3) = 4 = 100 = 2

h(2) = 3 = 011 = 0 h(1) = 2 = 010 = 1

The number of distinct value R = 2r

● Consider a value of r such that 2r is much larger than m

● Combining Estimate is another way to combine the estimates

● Another problem here is, it is always a power of 2.

● We can combine the two methods.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.