0% found this document useful (0 votes)
58 views26 pages

5.2.2. Application: Bucket Sort: 12-Feb-15 Mat-72306 Randal, Spring 2015 207

Bucket sort is an algorithm that can sort n numbers in expected linear time O(n) under certain assumptions. It works by first placing the numbers in buckets based on their binary representation. Each bucket is then sorted individually using a quadratic sorting algorithm. The expected time spent sorting each bucket is O(1) due to properties of the Poisson distribution, which models the number of elements in each bucket. Therefore, the total expected time of bucket sort is O(n).

Uploaded by

Mirza Abdulla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views26 pages

5.2.2. Application: Bucket Sort: 12-Feb-15 Mat-72306 Randal, Spring 2015 207

Bucket sort is an algorithm that can sort n numbers in expected linear time O(n) under certain assumptions. It works by first placing the numbers in buckets based on their binary representation. Each bucket is then sorted individually using a quadratic sorting algorithm. The expected time spent sorting each bucket is O(1) due to properties of the Poisson distribution, which models the number of elements in each bucket. Therefore, the total expected time of bucket sort is O(n).

Uploaded by

Mirza Abdulla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

2/12/2015

5.2.2. Application: Bucket Sort

• Bucket sort breaks the log ) lower bound


for standard comparison-based sorting, under
certain assumptions on the input
• We want to sort a set of = 2 integers chosen
I+U@R from the range [0, 2 ), where
• Using Bucket sort, we can sort the numbers in
expected time )
• Expectation is over the choice of the random
input, Bucket sort is a deterministic algorithm

MAT-72306 RandAl, Spring 2015 12-Feb-15 207

• Bucket sort works in two stages


• First we place the elements into buckets
• The th bucket holds all elements whose first
binary digits correspond to the number
• E.g., if = 2 bucket 3 contains all elements
whose first 10 binary digits are 0000000011
• When the elements of the th bucket all
come before those in the th bucket in the
sorted order
• Assuming that each element can be placed in
the appropriate bucket in (1) time, this stage
requires only ( ) time
MAT-72306 RandAl, Spring 2015 12-Feb-15 208

1
2/12/2015

• Because the elements to be sorted are chosen


uniformly, the number of elements that land in a
specific bucket follows a binomial distribution
,1 )
• Buckets can be implemented using linked lists
• In the second stage, each bucket is sorted using
any standard quadratic time algorithm (e.g.,
Bubblesort or Insertion sort)
• Concatenating the sorted lists from each bucket
in order gives the sorted order for the elements
• It remains to show that the expected time spent
in the second stage is only ( )
MAT-72306 RandAl, Spring 2015 12-Feb-15 209

• The result relies on our assumption regarding


the input distribution.
• Under the uniform distribution, Bucket sort falls
naturally into the balls and bins model:
– the elements are balls, buckets are bins, and
each ball falls uniformly at random into a bin
• Let be the number of elements that land in the
th bucket
• The time to sort the th bucket is then at most
for some constant
MAT-72306 RandAl, Spring 2015 12-Feb-15 210

2
2/12/2015

• The expected time spent sorting is at most

• The second equality follows from symmetry:


is the same for all buckets
• Since ,1 ), using earlier results yields
1) 1
= +1 =2 <2
• Hence the total expected time spent in the
second stage is at most , so Bucket sort runs
in expected linear time

MAT-72306 RandAl, Spring 2015 12-Feb-15 211

5.3. The Poisson Distribution

• We now consider the probability that a given bin


is empty in balls and bins model as well as
the expected number of empty bins
• For the first bin to be empty, it must be missed
by all balls
• Since each ball hits the first bin with probability
, the probability the first bin remains empty is
1

MAT-72306 RandAl, Spring 2015 12-Feb-15 212

3
2/12/2015

• Symmetry: the probability is the same for all bins


• If is a RV that is 1 when the th bin is empty
and 0 otherwise, then =
• Let represent the number of empty bins
• Then, by the linearity of expectations,
1
=

• Thus, the expected fraction of empty bins is


approximately
• This approximation is very good even for
moderately size values of and
MAT-72306 RandAl, Spring 2015 12-Feb-15 213

• Generalize to find the expected fraction of bins


with balls for any constant
• The probability that a given bin has balls is
1 1

1 1) + 1) 1
=
!
• When and are large compared to , the
second factor on the RHS is approx. ,
and the third factor is approx.

MAT-72306 RandAl, Spring 2015 12-Feb-15 214

4
2/12/2015

• Hence the probability that a given bin has


balls is approximately

!
and the expected number of bins with exactly
balls is approximately

Definition 5.1: A discrete Poisson random variable


with parameter is given by the following
probability distribution on = 0, 1,2, … :

Pr =
!
MAT-72306 RandAl, Spring 2015 12-Feb-15 215

• The expectation of this random variable is :

]= Pr

=
!

1)!

!
• Because probabilities sum to 1
MAT-72306 RandAl, Spring 2015 12-Feb-15 216

5
2/12/2015

• In the context of throwing balls into bins, the


distribution of the number of balls in a bin is
approximately Poisson with , which is
exactly the average number of balls per bin, as
one might expect

Lemma 5.2: The sum of a finite number of


independent Poisson random variables is a
Poisson random variable.

MAT-72306 RandAl, Spring 2015 12-Feb-15 217

Lemma 5.3: The MGF of a Poisson RV with


parameter , is
.
Proof: For any ,

=
!
)
!
.

MAT-72306 RandAl, Spring 2015 12-Feb-15 218

6
2/12/2015

• Differentiating yields:

+ 1)

• Setting = 0 gives

+1
+1
+1

MAT-72306 RandAl, Spring 2015 12-Feb-15 219

• Given two independent Poisson RVs and


with means and , apply Theorem 4.3 to
prove
)

• This is the MGF of a Poisson RV with mean

• By Theorem 4.2, the MGF uniquely defines the


distribution, and hence the sum is a
Poisson RV with mean

MAT-72306 RandAl, Spring 2015 12-Feb-15 220

7
2/12/2015

Theorem 5.4: Let be a Poisson RV with


parameter .
1. If , then
Pr ;
2. If , then
Pr .

Proof: For any > 0 and ,


Pr = Pr .

MAT-72306 RandAl, Spring 2015 12-Feb-15 221

Plugging in the expression for the MGF of the


Poisson distribution, we have
Pr .

Choosing = ln > 0 gives

Pr
=
The proof of 2 is similar.

MAT-72306 RandAl, Spring 2015 12-Feb-15 222

8
2/12/2015

5.3.1. Limit of the Binomial Distribution

• The Poisson distribution is the limit distribution of


the binomial distribution with parameters and
, when is large and is small

Theorem 5.5: Let ), where is a


function of and lim is a constant that is
independent of . Then, for any fixed ,

lim Pr = .
!
MAT-72306 RandAl, Spring 2015 12-Feb-15 223

• This theorem directly applies to the balls-and-


bins scenario
• Consider the situation where there are balls
and bins, where is a function of and
lim
• Let be the number of balls in a specific bin
• Then , 1/ )
• Theorem 5.5 thus applies and says that

lim Pr = .
!
matching the earlier approximation
MAT-72306 RandAl, Spring 2015 12-Feb-15 224

9
2/12/2015

• Consider the # of spelling or grammatical


mistakes in a book
• Model such mistakes s.t. each word is likely to
have an error with some very small probability
• The # of errors is a binomial RV with large and
small and can be treated as a Poisson RV
• As another example, consider the # of chocolate
chips inside a chocolate chip cookie
• Model by splitting the volume of the cookie into a
large # of small disjoint compartments, so that a
chip lands in each with some probability
• Now the # of chips in a cookie roughly follows a
Poisson distribution
MAT-72306 RandAl, Spring 2015 12-Feb-15 225

5.4. The Poisson Approximation

• The main difficulty in balls-and-bins problems is


handling that dependencies naturally arise
• If, e.g., bin 1 is empty, then it is less likely that
bin 2 is empty because the balls must now be
distributed among 1 bins
• More concretely: if we know the number of balls
in the first 1 bins, then the number of balls
in the last bin is completely determined
• The loads of the bins are not independent
MAT-72306 RandAl, Spring 2015 12-Feb-15 226

10
2/12/2015

• The distribution of the number of balls in a given


bin is approximately Poisson with mean
• We would like to say that the joint distribution of
the number of balls in all the bins is well
approximated by assuming the load at each bin
is an independent Poisson RV with mean
• This would allow us to treat bin loads as
independent RVs
• We show here that we can do this when we are
concerned with sufficiently rare events
MAT-72306 RandAl, Spring 2015 12-Feb-15 227

• Suppose that balls are thrown into bins


)
I+U@R, and let be the number of balls in
the th bin,
) )
• Let ,…, be independent Poisson RVs
with mean
• In the first case, there are balls in total
• In the second case we know only that is the
expected number of balls in all of the bins
• If, using the Poisson distribution, we end up with
balls, then we do indeed have that the
distribution is the same as if we threw balls
into bins randomly
MAT-72306 RandAl, Spring 2015 12-Feb-15 228

11
2/12/2015

) )
Theorem 5.6: The distribution ,…,
)
conditioned on is the same as
) )
,…, , regardless of the value of .

Proof: When throwing balls into bins, the


) )
probability that ,…, = ,…, for any
,…, satisfying is given by
;…; !
=
!
MAT-72306 RandAl, Spring 2015 12-Feb-15 229

Now, for any , … , with = , consider the


probability that
) )
,…, = ,…,
) )
Conditioned on ,…, satisfying
)

) ) )
Pr ,…, = ,…,

Pr )
= )
Pr

MAT-72306 RandAl, Spring 2015 12-Feb-15 230

12
2/12/2015

The probability that is


!, since the are
independent Poisson RVs with mean .
Also, by Lemma 5.2, the sum of the is itself a
Poisson RV with mean . Hence we have:
!
=
!
!
=
!

proving the theorem.


MAT-72306 RandAl, Spring 2015 12-Feb-15 231

• With this we can prove strong results about any


function on the loads of the bins

Theorem 5.7: Let ,…, ) be a nonnegative


function. Then
) )
,…, ,…, .

• This holds for any nonnegative function on the


number of balls in the bins
• In particular, if is the indicator that is 1 if some
event occurs and 0 otherwise, then the theorem
gives bounds on the probability of events
MAT-72306 RandAl, Spring 2015 12-Feb-15 232

13
2/12/2015

• We call the scenario in which the number of


balls in the bins are taken to be independent
Poisson RVs with mean the Poisson case
• The scenario where balls are thrown into
bins I+U@R is the exact case
Corollary 5.9: Any event that takes place with
probability in the Poisson case takes place with
probability at most the exact case.
Proof: Let be the indicator function of the event.
In this case, ] is just the probability that the
event occurs, and the result follows immediately
from Theorem 5.7.
MAT-72306 RandAl, Spring 2015 12-Feb-15 233

• Any event that happens with small probability in


the Poisson case also happens with small
probability in the exact case
• In the analysis of algorithms we often want to
show that certain events happen with small
probability
– This result says that we can utilize an analysis
of the Poisson approximation to obtain a
bound for the exact case
• The Poisson approximation is easier to analyze
because the numbers of balls in each bin are
independent random variables
MAT-72306 RandAl, Spring 2015 12-Feb-15 234

14
2/12/2015

• We can actually do even a little bit better in


many natural cases

Theorem 5.10: Let ,…, ) be a nonnegative


) )
function such that ,…, is either
monotonically increasing or monotonically
decreasing in . Then
) )
,…, ,…,

The following corollary is immediate:


MAT-72306 RandAl, Spring 2015 12-Feb-15 235

Corollary 5.11: Let be an event whose


probability is either monotonically increasing or
monotonically decreasing in the number of balls. If
has probability in the Poisson case, then has
probability at most in the exact case.

• Consider again the maximum load problem for


the case
• A union bound argument shows that the
maximum load is at most 3 ln ln ln w.h.p.
• Using the Poisson approximation, we prove the
following almost-matching lower bound on the
maximum load
MAT-72306 RandAl, Spring 2015 12-Feb-15 236

15
2/12/2015

Lemma 5.12: When balls are thrown I+U@R


into bins, the maximum load is at least =
with probability at least for
sufficiently large.
Proof: In the Poisson case, the probability that bin
1 has load at least is at least
!, which is the probability it has load exactly
, Pr = . In the Poisson case,
!
all bins are independent, so the probability that no
bin has load at least is at most
1 !
!
MAT-72306 RandAl, Spring 2015 12-Feb-15 237

!
We need to choose so that , for
then (by Thm 5.7) we will have that the probability
that the maximum load is not at least in the
exact case is at most <1 .
This will give the lemma
Because the maximum load is clearly
monotonically increasing in the number of balls,
we could also apply the slightly better Thm 5.10,
but this would not affect the argument substantially
It therefore suffices to show that ln , or
equivalently that ln ! < ln ln ln ln
From Lemma 5.8 (not shown), it follows that:
MAT-72306 RandAl, Spring 2015 12-Feb-15 238

16
2/12/2015

when (and hence = ln / ln ln ) are suitably


large. Hence, for suitably large,
ln ln + ln
ln ln
= ln ln ln ln ln +
ln ln ln ln
+ ln ln ln ln ln
ln
ln
ln ln
ln ln ln ln .
The last two inequalities use the fact that
ln ln = (ln / ln ln ).
MAT-72306 RandAl, Spring 2015 12-Feb-15 239

5.5. Application: Hashing

• Consider a password checker, which prevents


people from using easily cracked passwords by
keeping a dictionary of unacceptable ones
• The application would check if the requested
password is unacceptable
• A checker could store the unacceptable
passwords alphabetically and do a binary search
on the dictionary to check a proposed password
• A binary search would require (log ) time for
words
MAT-72306 RandAl, Spring 2015 12-Feb-15 240

17
2/12/2015

5.5.1. Chain Hashing

• Another possibility is to place the words into bins


and search the appropriate bin for the word
• Words in a bin are represented by a linked list
• The placement of words into bins is
accomplished by using a hash function
• A hash function from a universe into a range
[0, 1] can be thought of as a way of placing
items from the universe into bins

MAT-72306 RandAl, Spring 2015 12-Feb-15 241

• Here the universe consist of possible


password strings
• The collection of bins is called a hash table
• This approach to hashing is called chain
hashing
• Using a hash table turns the dictionary problem
into a balls-and-bins problem
• If our dictionary of unacceptable passwords
consists of words and the range of the hash
function is [0, 1], then we can model the
distribution of words in bins with the same
distribution as balls placed randomly in bins
MAT-72306 RandAl, Spring 2015 12-Feb-15 242

18
2/12/2015

• It is a strong assumption to presume that a hash


function maps words into bins in a fashion that
appears random, so that the location of each word is
independent and identically distributed (i.i.d)
• We assume that
– for each , the probability that ) = is
1/ (for 1) and
– that the values of ) for each are
independent of each other
• This does not mean that every evaluation of )
yields a different random answer
• The value of ) is fixed for all time; it is just equally
likely to take on any value in the range
MAT-72306 RandAl, Spring 2015 12-Feb-15 243

• Consider the search time when there are bins


and words
• To search for an item, we first hash it to find the
bin that it lies in and then search sequentially
through the linked list for it
• If we search for a word that is not in our
dictionary, the expected number of words in the
bin the word hashes to is
• If we search for a word that is in our dictionary,
the expected number of other words in that
word's bin is 1)/ , so the expected number
of words in the bin is 1 + ( 1)/
MAT-72306 RandAl, Spring 2015 12-Feb-15 244

19
2/12/2015

• If we choose = bins for our hash table, then


the expected number of words we must search
through in a bin is constant
• If the hashing takes constant time, then the total
expected time for the search is constant
• The maximum time to search for a word,
however, is proportional to the maximum
number of words in a bin
• We have shown that when this maximum
load is ln ln ln with probability close to 1,
and hence w.h.p. this is the maximum search
time in such a hash table
MAT-72306 RandAl, Spring 2015 12-Feb-15 245

• While this is still faster than the required time for


standard binary search, it is much slower than
the average, which can be a drawback for many
applications
• Another drawback of chain hashing can be
wasted space
• If we use bins for items, several of the bins
will be empty, potentially leading to wasted
space
• The space wasted can be traded off against the
search time by making the average number of
words per bin larger than 1
MAT-72306 RandAl, Spring 2015 12-Feb-15 246

20
2/12/2015

5.5.2. Hashing: Bit Strings

• Now save space instead of time


• Consider, again, the problem of keeping a
dictionary of unsuitable passwords
• Assume that a password is restricted to be eight
ASCII characters, which requires 64 bits (8
bytes) to represent
• Suppose we use a hash function to map each
word into a 32-bit string
• This string is a short fingerprint for the word
MAT-72306 RandAl, Spring 2015 12-Feb-15 247

• We keep the fingerprints in a sorted list


• To check if a proposed password is
unacceptable, we calculate its fingerprint and
look for it on the list, say by a binary search
• If the fingerprint is on the list, we declare the
password unacceptable
• In this case, our password checker may not give
the correct answer!
• It is possible that an acceptable password is
rejected because its fingerprint matches the
fingerprint of an unacceptable password
MAT-72306 RandAl, Spring 2015 12-Feb-15 248

21
2/12/2015

• Hence there is some chance that hashing will


yield a false positive: it may falsely declare a
match when there is not an actual match
• The fingerprints do not uniquely identify the
associated word
• This is the only type of mistake this algorithm
can make
• Allowing false positives means our algorithm is
overly conservative, which is probably
acceptable
• Letting easily cracked passwords through,
however, would probably not be acceptable
MAT-72306 RandAl, Spring 2015 12-Feb-15 249

• Place in a more general context: describe as an


approximate set membership problem
• Suppose we have a set = ,…, of
elements from a large universe
• We want to be able to quickly answer queries of
the form "Is an element of ?"
• We want also like the representation to take as
little space as possible
• To save space, we are willing to allow
occasional mistakes in the form of false positives
• Here the unallowable passwords correspond to
our set
MAT-72306 RandAl, Spring 2015 12-Feb-15 250

22
2/12/2015

• How large should the range of the hash function


used to create the fingerprints be?
• How many bits should be in a fingerprint?
• Obviously, we want to choose the number of bits
that gives an acceptable probability for a false
positive match
• The probability that an acceptable password has
a fingerprint that is different from any specific
unallowable password in is 1 2
• If the set has size , then the probability of a
false positive for an acceptable password is
1 1 1 2 1
MAT-72306 RandAl, Spring 2015 12-Feb-15 251

• If we want this probability of a false positive to


be less than a constant , we need
which implies that
log
ln 1 (1 )
• I.e., we need lg bits
• If we, however, use = 2 lg bits, then the
probability of a false positive falls to
1 1
<

• If we have 2 = 65,536 words, then using 32


bits yields a FP Pr of just less than 1/65,536
MAT-72306 RandAl, Spring 2015 12-Feb-15 252

23
2/12/2015

5.6. Random Graphs

• There are many NP-hard computational


problems defined on graphs: Hamiltonian cycle,
independent set, vertex cover, …
• Are these problems hard for most inputs or just
for a relatively small fraction of all graphs?
• Random graph models provide a probabilistic
setting for studying such questions
• Most of the work on random graphs has focused
on two closely related models, and

MAT-72306 RandAl, Spring 2015 12-Feb-15 253

5.6.1. Random Graph Models

• In we consider all undirected graphs on


distinct vertices ,…,
• A graph with a given set of edges has
probability

• One way to generate a random graph in is


to consider each of the possible edges in
some order and then independently add each
edge to the graph with probability
MAT-72306 RandAl, Spring 2015 12-Feb-15 254

24
2/12/2015

• The expected number of edges in the graph is


therefore , and each vertex has expected
degree 1)
• In the model, we consider all undirected
graphs on vertices with exactly edges
• There are possible graphs, each selected
with equal probability
• One way to generate a graph uniformly from the
graphs in is to start with a graph with no
edges
MAT-72306 RandAl, Spring 2015 12-Feb-15 255

• Choose one of the possible edges uniformly


at random and add it to the edges in the graph
• Now choose one of the remaining 1
possible edges I+U@R and add it to the graph
• Continue similarly until there are edges
• The and models are related:
– When , the number of edges in a
random graph in is concentrated around
, and conditioned on a graph from
having edges, that graph is uniform over all
the graphs from
MAT-72306 RandAl, Spring 2015 12-Feb-15 256

25
2/12/2015

• There are many similarities between random


graphs and the balls-and-bins models
• Throwing edges into the graph as in the
model is like throwing balls into bins
• However, since each edge has two endpoints,
each edge is like throwing two balls at once into
two different bins
• The pairing adds a rich structure that does not
exist in the balls-and-bins model
• Yet we can often utilize the relation between the
two models to simplify analysis in random graph
models
MAT-72306 RandAl, Spring 2015 12-Feb-15 257

26

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy