0% found this document useful (0 votes)
4 views17 pages

Locally Differentially Private Frequent Itemset Mining

This paper introduces a new approach for frequent itemset mining under Local Differential Privacy (LDP) using a protocol called Set-Value Item Mining (SVIM). It addresses the challenges of accurately estimating frequencies from user data that consists of sets of items, proposing an adaptive method to select the best frequency oracle protocol based on the data characteristics. Experimental results demonstrate that SVIM significantly outperforms existing methods in identifying frequent items and estimating their frequencies while maintaining privacy.

Uploaded by

17043420
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views17 pages

Locally Differentially Private Frequent Itemset Mining

This paper introduces a new approach for frequent itemset mining under Local Differential Privacy (LDP) using a protocol called Set-Value Item Mining (SVIM). It addresses the challenges of accurately estimating frequencies from user data that consists of sets of items, proposing an adaptive method to select the best frequency oracle protocol based on the data characteristics. Experimental results demonstrate that SVIM significantly outperforms existing methods in identifying frequent items and estimating their frequencies while maintaining privacy.

Uploaded by

17043420
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

2018 IEEE Symposium on Security and Privacy

Locally Differentially Private


Frequent Itemset Mining
Tianhao Wang Ninghui Li Somesh Jha
Department of Computer Science Department of Computer Science Department of Computer Science
Purdue University Purdue University University of Wisconsin-Madison
West Lafayette, IN West Lafayette, IN Madison, WI
tianhaowang@purdue.edu ninghui@cs.purdue.edu jha@cs.wisc.edu

Abstract—The notion of Local Differential Privacy (LDP) also planning to build a “RAPPOR-like” system that collects
enables users to respond to sensitive questions while preserving frequent homepages.
their privacy. The basic LDP frequent oracle (FO) protocol
We assume that each user possesses an input value v ∈ D,
enables an aggregator to estimate the frequency of any value.
But when each user has a set of values, one needs an additional where D is the value domain. A party wants to learn the
padding and sampling step to find the frequent values and distribution of the input values of all users. We call this party
estimate their frequencies. In this paper, we formally define such the aggregator instead of the data curator, because it does
padding and sample based frequency oracles (PSFO). We further not see the raw data. Existing research [5], [18], [36] has
identify the privacy amplification property in PSFO. As a result,
developed multiple frequency oracle (FO) protocols, using
we propose SVIM, a protocol for finding frequent items in the
set-valued LDP setting. Experiments show that under the same which an aggregator can estimate the frequency of any chosen
privacy guarantee and computational cost, SVIM significantly value x ∈ D. In [30], Qin et al. considered the setting where
improves over existing methods. With SVIM to find frequent each user’s value is a set of items v ⊆ I, where I is the
items, we propose SVSM to effectively find frequent itemsets, item domain. Such a set-valued setting occurs frequently in
which to our knowledge has not been done before in the LDP
the situation where LDP is applied. For example, when Apple
setting.
wants to estimate the frequencies of the emoji’s typed everday
by the users, each user has a set of emoji’s that they typed [34].
I. I NTRODUCTION The LDPMiner protocol in [30] aims at finding the k most
frequent items and their frequencies.
In recent years, differential privacy [14], [16] has been This problem is challenging because the number of items
increasingly accepted as the de facto standard for data privacy each user has is different. To deal with this, a core technique
in the research community [2], [15], [17], [24]. In the standard in [30] is “padding and sampling”. That is, each user first
(or centralized) setting, a data curator collects personal data pads her set of values with dummy items to a fixed size , then
from each individual, and produces outputs based on the randomly samples one item from the padded set, and finally
dataset in a way that satisfies differential privacy. In this uses an FO protocol to report the item. When estimating the
setting, the data curator sees the raw input from all users and frequency of an item, one multiples the estimation from the FO
is trusted to handle these private data correctly. protocol by . Without padding, the probability that an item
Recently, techniques for avoiding a central trusted authority is sampled is difficult to assess, making accurate frequency
have been introduced. They use the concept of Differential estimation difficult.
Privacy in the Local setting, which we call LDP. Such In [30], the FO protocol is used in a black-box fashion.
techniques enable collection of statistics of users’ data while That is, in order to satisfy -LDP, the FO protocol is invoked
preserving privacy of participants, without relying on trust in with the same privacy parameter . We observe that, since the
a single data curator. For example, researchers from Google sampling step randomly selects an item, it has an amplification
developed RAPPOR [18], [20] and Prochlo [8], which are effect in terms of privacy. This effect has been observed and
included as part of Chrome. They enable Google to collect studied in the standard DP setting [25]. If one applies an
users’ answers to questions such as the default homepage algorithm to a dataset randomly sampled from the input with a
of their browser, the default search engine, and so on, in sampling rate of β < 1, to satisfy -DP, the algorithm can use
order to understand the unwanted or malicious hijacking of a privacy budget of  > ; more specifically, the relationship
user settings. Apple [33], [34] also uses similar methods to 
e −1
between  , , and β is e −1 = β1 .

help with predictions of spelling and other tasks. Samsung
proposed a similar system [28] which enables collection of Intuitively, one can apply the same observation here. Since
not only categorical answers but also numerical answers each item is selected with probability β = 1 , to satisfy
(e.g., time of usage, battery volume), although it is not clear -LDP,

one can invoke the FO protocol with  , such that
e −1 
e −1 =  (or, equivalently  = ln ( · (e − 1) + 1) ≥ ).

whether this has been deployed by Samsung. Firefox [1] is

© 2018, Tianhao Wang. Under license to IEEE. 127


DOI 10.1109/SP.2018.00035
Authorized licensed use limited to: University of Malaya. Downloaded on March 07,2024 at 11:24:34 UTC from IEEE Xplore. Restrictions apply.
Surprisingly, in our study of padding-and-sampling-based and selects appropriate , and sends  to users. Third, users
frequency oracle (PSFO), we found that one cannot always use PSFO with the given  to report occurrences of items in
get this privacy amplification effect. Whether this benefit is the candidate set; the aggregator estimates the frequency of
applicable or not depends on the internal structure of the FO these items. Fourth, the aggregator selects the top k frequent
protocol. In [36], the three best performing FO protocols are items and use the size distribution in step two to further correct
Generalized Random Response, Optimized Unary Encoding, undercounts. Experimental results how that SVIM significantly
and Optimized Local Hash. The latter two offer the same outperforms LDPMiner in that it identifies more frequent items
accuracy, and Optimized Local Hash has lower communication as well as estimates the frequencies more accurately.
cost. It was found that Generalized Random Response offers In the setting where each user’s input data is a set of items,
the best accuracy when |D| < 3e + 2, and Optimized a natural problem is to find frequent itemsets. Frequent itemset
Local Hash offers the best accuracy when |D| ≥ 3e + 2. mining (FIM) is a well recognized data-mining problem. The
We found that, the privacy amplification effect exists for discovery of frequent itemsets can serve valuable economic
Generalized Random Response, but not for Optimized Lo- and research purposes, e.g., mining association rules [4],
cal Hash. Optimized Local Hash is able to provide better predicting user behavior [3], and finding correlations [9]. FIM
accuracy when |D| is large because each perturbed output while satisfying DP in the centralized setting has been studied
can be used to support multiple input values. However, the extensively, e.g., [7], [39], [26]. However, because of the
same feature makes Optimized Local Hash unable to benefit challenges of dealing with set-valued inputs in the LDP setting,
from sampling. The difference in the ability to benefit from no solution for the LDP setting has been proposed. Authors
sampling changes the criterion to decide which of Generalized of [30] consider only the identification of frequent items, and
Random Response and Optimized Local Hash to use. We thus leave FIM as an open problem. Using the PSFO technique,
propose to adaptively select the best FO protocol in PSFO, we are able to provide the first solution to FIM in the
based on |I|,  and the particular  value. Essentially, when LDP setting. We call the protocol Set-Value itemSet Mining
|I| > (42 − ) · e + 1, Generalized Random Response should (SVSM) protocol; experimental evaluations demonstrates its
be used. Replacing the FO protocol used in [30] with such an effectiveness.
adaptively chosen FO protocol greatly improves the accuracy To summarize, the main contributions of this paper are:
of the resulting frequent items. • We investigate padding-and-sample-based frequency or-
We also observe that the selection of an appropriate  acles (PSFO) and discover the interesting phenomenon
is crucial, and it can be different depending on the goal. that some FO protocols can benefit from the sampling
Essentially, each user pads her itemset to size , generating step, but others cannot. Based on this, we proposed to
two sources of errors: When  is small, one would under- adaptively select the best-performing FO protocol in each
estimate the frequency counts, since items in a set with more usage of PSFO.
than  items will be sampled with probability less than 1/. • We design and implement SVIM to find frequent values
On the other hand, since  is multiplied to a noisy estimate, together with their frequencies. Experimental results on
increasing  magnifies the noises. The LDPMiner protocol both empirical and real-world datasets demonstrate the
in [30] has two phases, the first phase selects 2k candidate significant improvement over previous techniques.
frequent items using a quite large , and the second phase • We provide the first FIM protocol under the LDP setting,
computes their frequencies using  = 2k. We observe that for and empirically demonstrate its effectiveness on real-
the purpose of identifying candidates for the frequent items, world datasets. This solves a problem left open by [30].
setting  = 1 is fine. While the resulting frequency counts Roadmap. In Section II, we present background knowledge
under-estimate the true counts, the frequencies of all items of LDP and FO. We then go over the problem definition
are under-estimated, and it is very unlikely that the true top and existing solutions in Section III. With an investigation of
k items are not among the 2k candidates. However, when the the sample-based frequency oracle in Section IV, we present
goal is to estimate frequency, one needs select a larger . But our proposed method in Section V. Experimental results are
 should not be increased to the point that there is absolutely presented in VI. Finally we discuss related work in Section VII
no under-estimation, because this increases the magnitude of and provide concluding remarks in in Section VIII.
noises. Selecting  is a trade-off between under-estimation and
noise. II. BACKGROUND
Following these insights, we propose Set-Value Item Mining We consider a setting where there are several users and
(SVIM) protocol, which handles set values under the LDP one aggregator. Each user possesses a value v from a domain
setting and provides much better accuracy than existing proto- D, and the aggregator wants to learn the distribution of values
cols within the same privacy constraints. There are four steps: among all users, in a way that protects the privacy of individual
First, users use PSFO with a small  to report; the aggregator users.
identifies frequent items as candidates, and sends this set to
users. Second, users report (using a standard FO protocol) the A. Differential Privacy in the Local Setting
number of candidate items they have; the aggregator estimates In the local setting, each user perturbs the input value v
the distribution of how many candidate items the users have using an algorithm Ψ and sends Ψ(v) to the aggregator. The

128

Authorized licensed use limited to: University of Malaya. Downloaded on March 07,2024 at 11:24:34 UTC from IEEE Xplore. Restrictions apply.
formal privacy requirement is that the algorithm Ψ(·) satisfies analyzed, optimized, and compared against each other, and it
the following property: was found that when d is large, the Optimized Local Hashing
(OLH) protocol provides the best accuracy while maintaining
Definition 1 ( Local Differential Privacy). An algorithm Ψ(·)
a low communication cost. In this paper, we use the OLH
satisfies -local differential privacy (-LDP), where  ≥ 0, if
protocol as a primitive and describe it below.
and only if for any input v1 , v2 ∈ D, we have
2) Optimized Local Hashing (OLH) [36]: The Optimized
∀T ⊆ Range(Ψ) : Pr [Ψ(v1 ) ∈ T ] ≤ e Pr [Ψ(v2 ) ∈ T ] , Local Hashing (OLH) protocol deals with a large domain
size d by first using a hash function to map an input value
where Range(Ψ) denotes the set of all possible outputs of the
into a smaller domain of size g (typically g d), and
algorithm Ψ.
then applying randomized response to the hashed value in the
Similar to the centralized setting, there is sequential com- smaller domain. In this protocol, both the hashing step and
position in the local setting. That is, if the user executes a set the randomization step result in information loss. The choice
of the parameter g is a tradeoff between losing information
 each satisfying i -LDP, then the whole process
of functions,
satisfies i -LDP. The value  is also called the privacy during the hashing step and losing information during the
budget. randomization step. In [36], it is found that the optimal
Compared to the centralized setting, the local version of DP (minimal variance) choice of g is e + 1 .
offers a stronger level of protection, because each user only In OLH, the reporting protocol is
reports the perturbed data. Each user’s privacy is still protected
ΨOLH() (v) := H, ΨGRR() (H(v)),
even if the aggregator is malicious.
where H is randomly chosen from a family of hash functions
B. Frequency Oracles
that hash each value in D to {1 . . . g}, and ΨGRR() is given
A frequency oracle (FO) protocol enables the estimation of in (1), while operating on the domain {1 . . . g}.
the frequency of any given value x ∈ D under LDP. It is Let H j , y j  be the report from the j’th user. For each value
specified by a pair algorithms: Ψ, Φ, where Ψ is used by x ∈ D, to compute its frequency, one first computes C(x) =
each user to perturb her input value, and Φ is used by the |{j | H j (x) = y j }|. That is, C(x) is the number of reports
aggregator; Φ takes as input the reports from all users, and that “supports” that the input is x. One then transforms C(x)
can be queried for the frequency of each value. to its unbiased estimation
1) Generalized Randomized Response (GRR): This FO
protocol generalizes the randomized response technique [38].
C(x) − n/g
In the special case where the value is one bit, i.e., when ΦOLH() (x) := . (4)
d = |D| = 2, ΨGRR() (v) keeps the bit unchanged with p − 1/g

probability ee+1 and flips it with probability e1+1 . In the The variance of this estimation is
general case, when d > 2, the perturbation function is defined 4e
Var[ΦOLH() (x)] = · n. (5)
as (e − 1)2
 e
  p = e +d−1 , if y = v Compared with (3), the factor d − 2 + e is replaced by 4e .
∀y∈D Pr ΨGRR() (v) = y = 1
q = e +d−1 , if y = v This suggests that for smaller d (such that d − 2 < 3e ), one
(1) is better off with GRR; but for large d, OLH is better and has
This satisfies -LDP since pq = e . To estimate the frequency a variance that does not depend on d.
of x ∈ D, one counts how many times x is reported as C(x), III. S ET VALUES UNDER LDP
and then computes
In [30], Qin et al. considered the problem where each user’s
C(x) − nq value is a set. Such set-valued settings occur frequently in
ΦGRR() (x) := (2)
p−q the situation where LDP is applied. For example, iOS users
where n is the total number of users. That is, the frequency type many emoji’s every day, and Apple wants to estimate the
estimate is a linear transformation of the noisy count C(x), in frequencies of the emoji’s [34].
order to account for the effect of randomized response. In [36], A. Problem Definition and Challenge
it is shown that this is an unbiased estimation of the true count,
Specifically, the aggregator knows a set I of items. There
and the variance for this estimation is
are n users. The j’th user has a value vj ⊆ I. We call this a
d − 2 + e
Var[ΦGRR() (x)] = ·n (3) transaction. For any item x ∈ I, its frequency is defined as the
(e − 1)2 number of transactions that include x, i.e., fx := |{vj | x ∈
The accuracy of this protocol deteriorates fast when the vj }|. Similarly, the frequency of any itemset x ⊆ I is defined
domain size d increases. This is reflected in the fact that (3) as the number of transactions that include x as a subset, i.e.,
is linear in d. fx := |{vj | x ⊆ vj }|.
More sophisticated frequency estimators have been studied With the constraint of LDP defined on each user’s value v,
before [18], [5], [36]. In [36], several such protocols are the goal in this setting is to find items and, more generally,

129

Authorized licensed use limited to: University of Malaya. Downloaded on March 07,2024 at 11:24:34 UTC from IEEE Xplore. Restrictions apply.
itemsets that are frequent in the population. An item (itemset) reported with probability L1 . Hence one needs to multiply the
is a top-k frequent item (itemset) if its frequency is among the frequency oracle’s estimation by a factor of L. Since 90%
k highest for all items (itemsets). of transactions will have length exactly L after padding, this
This problem is quite challenging even when one just estimation is reasonably accurate. From the estimates, the
tries to find frequent items. Encoding each transaction as a aggregator identifies S, the set of 2k items that have the
single value in the domain D = P(I) (i.e., D is the power highest estimated frequencies, and sends S to the users. Size
set of I), and using existing FO protocols does not work. of S is set to be twice that of the goal so that few candidates
While there exist protocols specifically designed for larges are missed in this step.
domains (such as [37], [6]), such techniques still doesn’t Phase 2: Frequency Estimation. On receiving S, each user
scale to the case where the binary encoding of the input intersects it with v, which results in a transaction of length no
domain has more than a few hundred bits. We want to more than |S| = 2k. She then pads her transaction v ∩ S to
be able deal with hundreds or thousands of items. An FO be of size 2k, selects at uniform random one item v from the
protocol can identify only values that are very frequent in the padded transaction, and sends ΨFO(/2) (v) to the aggregator.
population, because the scale of the added noises is linear to Since each user sends two things, each in a way that satisfies
square root of the population size [11]. It is quite possible (/2)-LDP, by sequential composition, the protocol satisfies
that each particular transaction appears relative infrequently, -LDP.
even though some items and itemsets appear very frequently.
The aggregator estimates frequency for each item x ∈ S:
When no value in P(I) is fre-
quent enough to be identified, us- Transaction ΦFO(/2) (x) · 2k
ing a direct encoding an aggregator a, c, e
can obtain only noises. b, d, e Since the size of all user’s transactions have size 2k after
See Table I for an example with a, b, e padding, the estimated frequencies are unbiased.
five transactions. While no transac- a, d, e
tion appears more than once, items IV. PADDING - AND -S AMPLING - BASED F REQUENCY
a, f
a and e each appears 4 times, and O RACLES
TABLE I
the itemset {a, e} appears 3 times. T RANSACTIONS E XAMPLE . The LDPMiner protocol deals with the challenge of set-
Thus the three most frequent item- valued inputs by using padding and sampling before applying
sets are {a}, {e}, {a, e}. an FO protocol to report. We call such protocols Padding-and-
B. The LDPMiner Sampling-based Frequency Oracle (PSFO) protocols. They
use a padding-and-sampling function, defined as follows.
To the best of our knowledge, LDPMiner [30] is the only
protocol for dealing with set values in the LDP setting. While Definition 2 (PS). The padding and sampling function PS is
finding frequent itemsets is a natural goal, LDPMiner finds specified by a positive integer  and takes a set v ⊆ I as input.
only frequent items (i.e., singleton itemsets) and leaves the It assumes the existence of  dummy items ⊥1 , ⊥2 , . . . , ⊥ ∈ I.
frequent itemset mining as an open problem. LDPMiner has PS (v) does the following: If |v| < , it adds  − |v| different
two phases. random dummy elements to v. It then selects an element v at
Phase 1: Candidate Set Identification. The goal of Phase 1 uniform random and outputs that element.
is to identify a candidate set for frequent items. The protocol A PSFO protocol then uses an FO protocol to transmit
requires as input a parameter L, which is the 90th percentile the element v. Note that the domain of the FO becomes
of transaction lengths . That is, about 90% of all transactions I ∪ {⊥1 , ⊥2 , . . . , ⊥ }. To estimate the frequency of an item
have length no more than L. When L is not known, it needs x, one obtains the frequency estimation of x from the FO
to be estimated. In [30], it is assumed that L is available. protocol, and then multiplies it by . More formally,
In Phase 1, each user whose transaction v has less than L
items first pads it with dummy items so that the transaction Definition 3 (PSFO). A padding-and-sample-based frequency
has size L. Then, the user selects at uniform random one item oracle (PSFO) protocol is specified by three parameters: a
v from the padded transaction (which could result in a dummy positive integer , a frequency oracle FO, and the privacy
item), and uses FO to report it with privacy budget /2. That budget . It is composed of a pair of algorithms: Ψ, Φ,
is, each user sends to the aggregator ΨFO(/2) (v). Note that defined as follows.
the FO can perturb the original value into any value including
PSFO(, FO, ) := ΨFO() (PS (·)), ΦFO() (·) · 
the dummy item.
The aggregator then computes, for each item x ∈ I, its Note that if one does not do the padding step, it is equivalent
estimated frequency as to setting  = 1. Doing so significantly under-estimates the
ΦFO(/2) (x) · L true counts. With padding to length  and then sampling,
one can multiply the estimated counts by  to correct the
The intuition behind the above estimation is that in each under estimation. However, items that appear in transactions
transaction of length L, each item x will be selected and longer than  can still be underestimated. At the same time,

130

Authorized licensed use limited to: University of Malaya. Downloaded on March 07,2024 at 11:24:34 UTC from IEEE Xplore. Restrictions apply.
multiplying the estimation by  will enlarge any error due to 
2 5 10 20 50 100
noise by a factor of . 
0.1 0.19 0.72 0.42
1.13 1.83 2.44
Using this notation, the two phases of LDPMiner can be cast 0.5 0.83 2.01 1.45
2.64 3.51 4.19
as using PSFO(L, FO, /2) in Phase 1 and PSFO(2k, FO, /2) 1.0 1.49 2.90 2.26
3.57 4.46 5.15
2.0 2.62 4.17 3.49
4.86 5.77 6.46
in Phase 2. 4.0 4.68 6.29 5.59
6.98 7.89 8.59
TABLE II
N UMERICAL VALUE OF  UNDER DIFFERENT  AND .
A. Privacy Amplification of GRR

LDPMiner uses the FO protocol in a black-box fashion.


2 10 50
That is, in order to satisfy -LDP, it invokes the FO protocol 5 20 100
with the same privacy parameter . We observe that, since the 7.0
sampling step randomly selects an item, it has an amplification 6.0
effect for privacy. This effect has been observed and studied 5.0
in the standard DP setting [25]: If one applies an algorithm to 4.0

ε’
a dataset randomly sampled from the input with a sampling 3.0
rate of β < 1, to satisfy -DP, the algorithm can use a privacy 2.0

−1
budget of  such that ee −1 = β1 . 1.0
We observe that the same privacy amplification effect exists 0.0
when using the Generalized Random Response (GRR) in 0.5 1 1.5 2 2.5 3 3.5
PSFO. ε

Fig. 1. Privacy amplification effect for different .


Theorem 1 (PSFO-GRR: Privacy Amplification).
ΨGRR( ) (PS (·)) satisfies -LDP, such that  =
ln ( · (e − 1) + 1). When t ∈ v2 , p2 = q  . Thus pp12 is maximized when p1 =
1   
Proof. Let d = |I| +  be the size of the new domain (I  = p + −1
 q and p2 = q . That is,
I ∪ {⊥1 , . . . , ⊥ }),  as the privacy budget used in GRR. As p1 p / + q  ( − 1)/

defined in (1), we have p = e +d e  1 ≤
 −1 and q = e +d −1 as p2 q
the perturbation probabilities.  1 −1
It suffices to prove that for any  ≥ 0, any v1 , v2 ⊆ I, and ≤ e +
 
any possible output t ∈ I  , pp12 ≤ e , where ( · (e − 1) + 1)  − 1
= + = e
   
p1 = Pr ΨGRR( ) (PS (v1 )) = t , and
 
p2 = Pr ΨGRR( ) (PS (v2 )) = t .
Approximately, the privacy budget will be amplified by a
factor of ln  (will be the same if  = 1). Table II and Figure 1
We first examine p1 . When t ∈ v1 , give the corresponding  value for  under different .

p1 =Pr [t is sampled] · p + Pr [t is not sampled] · q  B. No Privacy Amplification of other FO


1 max{|v1 |, } − 1  Interestingly, we found that this privacy amplification effect
= · p + ·q does not exist for OLH. The reason is that, in GRR, the output
max{|v1 |, } max{|v1 |, }
1 domain of the perturbation is the same as the input domain;
=q  + · (p − q  ) thus each reported value y can “support” a single input element
max{|v1 |, }
1 x = y in I. In OLH, however, the reported value takes the
≤q  + · (p − q  ) form of H, j and can support any element x in I such that

1  −1  H(x) = j. In case the chosen hash function H hashes all the
= p + q user’s items into the same value, no matter how we sample,
 
the hashed result after sampling will always be the same value.
When t ∈ v1 , p1 = q  . Similarly, for p2 , when t ∈ v2 , Therefore, there is no privacy amplification in the sampling.
Theorem 2 (PSFO-OLH: No Privacy Amplification).
p2 =Pr [t is sampled] · p + Pr [t is not sampled] · q  ΨOLH( ) (PS (·)) does not satisfy -LDP for any  <  when
=Pr [t is sampled] · (q  + p − q  ) the input domain I is sufficiently large.
+Pr [t is not sampled] · q  Proof. Let g be the output size of hash functions. Consider an
=q  + Pr [t is sampled] · (p − q  ) ≥ q  input domain I such that |I| ≥ g + 1. Let H be the chosen
hash function. By the pigeon hole principle, there exists a

131

Authorized licensed use limited to: University of Malaya. Downloaded on March 07,2024 at 11:24:34 UTC from IEEE Xplore. Restrictions apply.
value y such that H hashes at least  items into y. Let v1 Theorem 4 (PSFO Variance). PSFO(, FO, ) has variance
consists of  items that are hashed to y, and v2 consists of 2 times that of FO when  ≥ maxj∈[n] |vj |. That is,
items that are not hashed to y. Then
  Var[ΦPSFO(,FO,) (x)] = 2 · Var[ΦFO( ) (x)],
Pr ΨOLH() (PS (v1 )) = H, y
  where  = ln ( · (e − 1) + 1) if FO is GRR.
Pr ΨOLH() (PS (v2 )) = H, y
  Proof. We prove for GRR, and the proof for OLH can be easily
Pr ΨGRR() (H(PS (v1 ))) = y|H
=   derived.
Pr ΨGRR() (H(PS (v2 ))) = y|H
Pr [H is picked] · p p  Var[ΦPSFO(,GRR,) (x)] = Var[ΦGRR( ) (x) · ]
= = = e   
Pr [H is picked] · q  q C(x) − nq  2
=Var  
·  =   2
· Var[C(x)]
Therefore, ΨOLH( ) (PS (·)) is not -LDP for any  <  . p −q (p − q ) j

In [36], another FO protocol, Optimal Unary Encoding 2 1  −1  1  −1 
=  · nx p + q 1− p + q
(OUE), was proposed. It has similar accuracy as OLH. In OUE, (p − q  )2    

the reported value is a binary vector, each bit representing one
+(n − nx )q  (1 − q  )
possible input value. One reported value can have multiple bits
being 1, supporting multiple input values. Similar to OLH, in 2
case the reported vector supports all the user’s items, there is  · [nq  (1 − q  )] = 2 · Var[ΦGRR( ) (x)]
(p − q  )2
no privacy amplification.
Note that from each user’s point of view, the hash function
H is randomly chosen. Thus only when the user happens to D. Adaptive FO
choose a hash function H that hashes all the user’s items into
PSFO needs to use an FO protocol. In [36], it was shown
the same hash value, would there be no privacy amplification
that one should choose GRR when d < 3e + 2 (where
benefit at all. However, this can happen with only small
d = |D| is the size of the domain under consideration), and
probability. This observation suggests that (, δ)-LDP can
OLH otherwise. With sampling, GRR can benefit from privacy
be applied to obtain some amplification effect, as will be
amplification, but OLH benefit less. As a result, the criterion
discussed in Appendix A.
for choosing between GRR and OLH changes. For GRR, when
C. Utility of PSFO  is used in PSFO, the effective privacy budget GRR can use
We now analyze the accuracy of PSFO. We first show that becomes ln((e − 1) + 1). We use (3) (with domain size
PSFO is unbiased when each user’s itemset size is no more |I  | = d + ) and get:
than . Var[ΦGRR(ln((e −1)+1) (x) · ]
Theorem 3 (PSFO Expectation). PSFO(, FO, ) is unbiased d + l − 2 +  · (e − 1) + 1
=n · 2 ·
when  ≥ maxj∈[n] |vj |. That is, ( · (e − 1) + 1 − 1)2
  d + l − 1 +  · (e − 1)
E ΦPSFO(,FO,) (x) = nx , =n ·
(e − 1)2
where nx is the number of users who have item x. e ·+d−1

=n · (6)
Proof. We prove for GRR, using the aggregate function de- (e − 1)2
fined in (2). The proof for OLH (with aggregate function in (4)) For OLH, by (5) we have variance independent on d:
can also be derived similarly.
42 · e
  Var[ΦOLH() (x) · ] = n · (7)
E ΦPSFO(,GRR,) (x) (e − 1)2
 
  C(x) − nq  Comparing (6) and (7), when
=E ΦGRR( ) (x) ·  = E · 
p − q 
1  d < (4 − 1)e + 1, (8)
nx  p + nx  q + (n − nx )q  − nq 
−1 
= ·
p − q  using GRR itemset will lead to better accuracy. Note that by
1 
nx  (p − q ) + nx q  + (n − nx )q  − nq 
 taking  = 1, (8) is slightly different from the inequality of
= · d < 3e + 2 from [36]. This is because here we consider
p − q 
a more general setting where some user may have no item
=nx at all, while the setting of [36] is that each user has exactly
one item. We propose Adap, which becomes GRR or OLH
adaptively (with new budget) based on (8). That is,
The estimation is inherently noisy. We now calculate the 
GRR(ln((e − 1) + 1) if d < e (4 − 1) + 1,
variance of the estimation. Adap() :=
OLH() otherwise.

132

Authorized licensed use limited to: University of Malaya. Downloaded on March 07,2024 at 11:24:34 UTC from IEEE Xplore. Restrictions apply.
E. Choosing 
To use PSFO, one needs to decide what value of  to use. S
When  is small, there is less variance but more bias (in the
form of under estimation); when  is large, there is more
variance and less bias. To find the suitable , the high level S
idea is to find the right tradeoff between bias and variance. L
When identifying candidate items, the goal is find the most
S, L
frequent items (but not accurate frequencies for them), we
propose to use a small . The intuition is that, while the
bias is large when  is small, the bias tends to be the same
direction (namely under estimation) for all items. While the
IS
absolute values of the counts are very inaccurate, the relative IS
order remain mostly unchanged. Note that it is possible the L
order is reversed after sampling (if one item appears more
often in smaller transactions, and another item appears more IS, L
often in larger transactions). To reduce this risk, we identify
2k candidate items (the optimal size of the candidate set is
dependent on the data distribution; we tried different values
and 2k appears to be a reasonable choice).
When estimating the actual frequency, one should use a
Fig. 2. Illustration of SVIM and SVSM. The users to the left are partitioned
larger  to reduce bias. We propose to use the 90th percentile into five groups. The aggregator to the right first runs SVIM with the first
L of the length of the input itemsets. While under estimation three groups, and find the frequent items. Then the aggregator interacts with
can still occur, the degree is limited. Furthermore, when given the following two groups to find frequent itemsets.
the distribution of the lengths of input itemsets, we propose
to correct this under estimation by multiplying the estimation
The advantages of setting  = 1 are first, every user will report
by the factor:
an item, making the signal strong; second, there is no extra
N cost of obtaining the exact L value.
u(L) = d . (9)
N− n ( − L) The aggregator then estimates the frequency of the domain
=L+1
by
Here N denotes the total number of items,  ΦPSFO(1,Adap,) (x),
d n denotes the
number of users with itemset size , and =L+1 N  ( − L)
and obtains the set S of the 2k most frequent items. S is then
gives the total number of missed items.
sent to users who participate in Step 2. Note that this phase
is unnecessary when the original domain size close to or less
V. P ROPOSED M ETHOD
than 2k.
In this section, we propose solution for the frequent item Step 2: Size Estimation. Having narrowed down the domain
and itemset mining. We first present Set-Value Item Mining from I to S, the aggregator now estimates frequencies of items
(SVIM) protocol to find frequent items in the set-value setting. in S. As suggested by the analysis of PSFO (Section IV-E),
Based on the result from SVIM, we build Set-Value itemSet the aggregator first finds the 90-th percentile L (in this step)
Mining (SVSM) protocol to find frequent itemsets. The high and then uses it as the limit to estimate frequencies of S (next
level protocol structure is given in Figure 2. step).
To find L, each user in this task reports the size of the
A. Frequent Item Mining private set intersected with the candidate S, i.e.,
At a high level, SVIM works as follows: A set of candidate ΨOLH() (|v ∩ S|).
items are identified first. Then these items are estimated and
updated. The users are partitioned into three groups, each There is no sampling involved in this step, because each user
participating in a task. Given that each task requires privacy has only one value. Here OLH is as FO by default.
budget of , each user is protected by -LDP. The aggregator in this step estimates the length distribution
Step 1: Prune the Domain. When the domain is big (e.g., by calculating
tens of thousands), the aggregator has to first narrow down the ΦOLH() ()
focus to a small candidate set. Specifically, in Step 1, each for all  ∈ [1, 2, . . . , 2k], and finds the 90 percentile L.
user reports with a randomly selected value from her private
L is, the aggregator then finds the smallest L such that
That
set with length limit set to 1: =1 ΦOLH() ()
2k > 0.9. Information of S and L are then sent
=1 ΦOLH() ()
ΨPSFO(1,Adap,) (v). to the users for the next task.

133

Authorized licensed use limited to: University of Malaya. Downloaded on March 07,2024 at 11:24:34 UTC from IEEE Xplore. Restrictions apply.
Note that some of the estimates may be overwhelmed B. Frequent Itemset Mining
by noise, making it useless. For this reason,
√ we use the The problem of mining frequent itemsets is similar to
significance threshold T = F −1 1 − 0.05 2k Var to filter the mining frequent items. The desired result becomes a set of
estimates, where F −1 is the inverse of standard normal CDF, itemsets instead of items. These frequent itemsets can be
and Var is specified by (5). Specifically, the aggregator keeps used, for example, by websites, to mine assocition rules and
estimates that are greater than T , and replaces all the others make recommendations. However, the task is much more
with zeros. challenging, because there are exponentially more itemsets to
Step 3: Candidates Estimation. On receiving S and L, consider, and each user also has many more potential itemsets.
each of the rest of the users reports a value sampled from In this section, we introduce SVSM for finding frequent
the intersection of his private set v and the candidate set S, itemsets effectively. In the high level, the aggregator first
padded to L, i.e., obtains the frequent items by executing SVIM. The aggregator
ΨPSFO(L,Adap,) (v ∩ S). then constructs a candidate set of itemsets IS. Finally the set
IS is estimated in a fashion similar to the latter part of SVIM.
The aggregator can estimates the candidates by running Constructing Candidate Set. The challenging part of fre-
quent itemset mining is to construct IS. There are exponen-
ΦPSFO(L,Adap,) (x), tially many possible itemsets that can be frequent. If one can
reduce it to a manageable range (thousands), one can cast the
for all x ∈ S. Since the 90-th percentile L is used as limit, problem to the item mining problem and run SVIM. Moreover,
the estimates are slightly under-estimate the truth. Therefore, if size of IS is close to k, only the estimation of IS (latter
the estimates are updated in the next step. part of SVIM) suffices.
Step 4: Estimation Update. The update assumes that the Let S  be the k most frequent items returned by SVIM.
missed count follow similar distribution as the reported ones. To effectively further reduce the candidate size, we use infor-
Given that L is the 90 percentile, the difference will not be mation of the estimates of S  . Specifically, for an itemset x,
significant. Thus the estimate for each item x is multiplied we first guess its frequency, denoted by f˜x , as the product of
with a fixed update factor (the noisy version of (9)) the estimates for all its items, i.e., f˜x = x∈x Φ (x), where
2k Φ (x) = max0.9·Φ(x) is the normalized estimate. The 0.9
 =1 ΦOLH() () x∈S  Φ(x)
u (L) := 2k 2k factor of Φ (x) serves to lower the normalized estimates for the
=1 ΦOLH() () − =L+1 ΦOLH() () ( − L) most frequent item, because otherwise, the guessed frequency
(10) of any set without the most frequent item equals that of the set
Note that there is no privacy concern in this step because no plus the most frequent item, which is unlikely to be true. Then
user is involved. The information is obtained from Step 2 and 2k itemsets with highest guessing frequencies are selected to
3. construct IS. The intuition is that, it is very unlikely that
Difference from LDPMiner. The major differences between a frequent itemset is composed of several infrequent items
SVIM and LDPMiner are many. (1) In Phase 1 of SVIM, the (while it is theoretically possible). The guessing frequency is
limit is set to one, instead of the 90-th percentile of lengths of thus an effective measurement of the likelihood each itemset
full transactions. (2) In Phase 2 of SVIM, the limit is reduced is among the frequent ones.
from |S| to the 90-th percentile L of the length of transactions Formally, in SVSM, the domain IS is constructed as
when limited to items in S. (3) Phase 1 of LDPMiner uses 
IS := {x : x ⊆ S  , 1 < |x| < log2 k, Φ (x) > t},
the 90-th percentile; it was assumed that this is provided as
x∈x
input. In SVIM, the 90-th percentile of length is obtained in a
way that satisfies LDP. (4) SVIM uses Adap instead of black where t is choosen so that |IS| = 2k.
box FO. (5) SVIM has an update step at the end, which uses Mining Frequent Itemset. After the domain IS is defined,
the length distribution information to further reduces the bias. the following protocol works similar to SVIM for frequent item
(6) In SVIM, users are partitioned into groups, each answering mining. Note that step 1 is not necessary since IS is already
one separate question, instead of answering multiple questions small. For each user with value v, a set of values from the
each with part of . It is proved in [36] that this will make domain IS is obtained first:
the overall result more accurate. (7) SVIM uses OLH, a more
accurate FO introduced in [36]. Since improvements (6) and vs = {x : x ∈ IS, x ⊆ v}
(7) are not introduced in this paper, in the experiments, for
such that each itemset x ∈ vs is a value in IS.
a fair comparison, we evaluate on an improved version of
Then a group of users report the size of their vs’s with FO:
LDPMiner. Specifically, OLH is used as the FO, and users
are partitioned into groups. That is, the evaluation shows only ΨOLH() (|vs|).
differences due to (1), (2), (4), (5). Difference (3) means that
SVIM is end-to-end private, and LDPMiner needs a data- After the aggregator evaluates the number of users that has
dependent input.  itemsets for each  ∈ [1, 2, . . . , 2k], the aggregator finds the

134

Authorized licensed use limited to: University of Malaya. Downloaded on March 07,2024 at 11:24:34 UTC from IEEE Xplore. Restrictions apply.
90 percentile L and send it to the users in the final group, who We instantiate the quality function using x’s rank as follows:
then reports vs by the highest ranked value has a score of k (i.e., q(x1 ) = k),
the next one has score k − 1, and so on; the k-th value has a
ΨPSFO(L,Adap,) (vs).
score of 1, and all other values have scores of 0. To normalize
The aggregator obtains the estimates by evaluating this into a value between 0 and 1, we divide the sum of scores
by the maximum possible score, i.e., k(k+1) 2 . This gives rise
ΦPSFO(L,Adap,) (x) · u (L) to what we call the Normalized Cumulative Rank (NCR); this
for any itemset x ∈ IS, where u (L) is the update factor used metric uses the true rank information of the top-k values.
for correcting bias (same format as (10)), and get results for 2. Squared Error (Var): We measure the estimation accuracy
the heavy itemsets and their estimates. by squared errors. That is,
1  2
VI. E VALUATION Var = (fx · n − Φ(x)) ,
|xt ∩ xr | x∈x ∩x
Now we discuss experiments that evaluate different pro- t r

tocols. Basically, we want to answer the following questions: Note that we only account heavy hitters that are successfully
First, how many frequent items and itemsets can be effectively identified by the protocol, i.e., x ∈ xt ∩ xr .
identified. Second, how much do our proposed protocols
improve over existing ones. B. Evaluation of Item Mining
As a highlight, in the POS dataset, our protocols can cor-
rectly identify around 45 frequent items (while existing ones For the item mining problem, our main focus is to compare
can identify around 12), with much more accurate estimates the performance of our proposed method SVIM, and the
(error is 3 orders of magnitudes less). existing method, LDPMiner. We implemented them as follows:
LDPMiner is almost implemented as described in [30].
A. Experimental Setup For a fair comparison, we made two modifications. First, we
Environment. All algorithms are implemented in Python 2.7 partition the users into two groups. The first group focus on
and all the experiments are conducted on an Intel Core i7-4790 finding S, while the second focus on estimating S. Users
3.60GHz PC with 16GB memory. Each experiment is run 10 in each group use the full privacy budget  to report. It is
times, with mean and standard deviation reported. proven [36] that by this way, the overall utility is better,
Datasets. We run experiments on the following datasets: compared to keeping asking all the users multiple questions,
• POS: A dataset containing merchant transactions of half
with splited privacy budget. Second, to get the 90th percentile
a million users and 1657 categories. L, an additional group of users are assigned to report the size
• Kosarak: A dataset of click streams on a Hungarian
of their private set. As a result, there will be three groups,
website that contains around one million users and 42 10% of users report size in advance, 40% report in the first
thousand categories. phase, and 50% report in the second phase.
• Online: Similar to POS dataset, this is a dataset that
For SVIM, we do the similar thing. Half of the users report
contains merchant transactions of half a million users and based on the original itemsets to find the candidate set S,
2603 categories. and the other half report after seeing the candidate set to
• Synthesize: The dataset is generated by the IBM Syn-
estimate S. The difference is, the 90th percentile L is used
thetic Data Generation Code for Associations and Se- when estimating S. Therefore, 10% of all users are allocated
quential Patterns 1.8 million transactions was generated, to estimate L from the second half. That is, 50% report in
with 1000 categories. The average transaction size is 5. the first phase, 10% of users report size of the their itemsets
intersected with S, and 40% report one actual item.
For brevity, we only plot results for the one dataset (POS). The
To demonstrate the precise effect of each design detail, we
detailed results for other datasets are deferred to the appendix.
also line up several intermediate protocols between LDPMiner
Metrics. To measure utility, we use the following metrics.
and SVIM. We present them with synonyms (that specify the
Define xi as the i-th most frequent value (xi is an item in the
FO and  used in both tasks) to highlight the difference as
task of item mining and an itemset in itemset mining). Let
follows:
the ground truth for top k values as xt = {x1 , x2 , . . . , xk }.
Denote the k values identified by the protocol using xr . Then • (BLH, L), (SUE, 2k): LDPMiner. LDPMiner uses two

xt ∩ xr is the set of real top-k values that are identified by the FO’s BLH [5] and SUE [18]. It is proven in [36] that
protocol. the two performs not as good as OLH.
1. Normalized Cumulative Rank (NCR). For each value x, • (OLH, L), (OLH, 2k): The frequency oracles are replaced

we assign a quality function q(·) to each value, and use the with OLH.
Normalized Cumulative Gain (NCG) metric [22]: • (OLH, 1), (OLH, 2k): The first phase uses  = 1. Note
that L is no longer needed, so there are two groups each
 consists of half of the users.
q(x)
NCG = x∈xr . • (OLH, 1), (Adap, 2k): The second phase uses adaptive
x∈xt q(x) frequency oracle.

135

Authorized licensed use limited to: University of Malaya. Downloaded on March 07,2024 at 11:24:34 UTC from IEEE Xplore. Restrictions apply.
• (OLH, 1), (Adap, L): The second phase uses L. An extra C. Evaluation of Itemset Mining
group of 10% of users are assigned to estimate that. We evaluate the effectiveness of SVSM. We want to answer
• (OLH, 1), (Adap, L)(c): The final results are updated the questions how many itemsets can be identified, and the
based on the length distribution. This is the SVIM. effectiveness of using SVIM in SVSM.
Note that the allocation of 10% of users for length distribution We implement SVSM as follows, half of the users are
is not fully justified. This is because the optimal allocation allocated to find frequent items first. Then the set IS is
depends heavily on the dataset, and 10% seems a reasonable constructed and estimated, by taking each of the element of
choice. it as an independent item. To compare the effect of SVIM
over LDPMiner, we also instantiate SVSM using LDPMiner.
Detailed Results. In Figure 3, we evaluate the above six Specifically, half of the users are allocated to find frequent
protocols on POS dataset, and plot the NCR and Var scores. items using LDPMiner; then IS is constructed similarly;
Overall, the identification accuracy (indicated by NCR) in- finally, Phase 2 of LDPMiner is executed to estimate frequency
creases with , and decreases with k. Similarly, the estimation of IS and output the most frequent k itemsets. Note that
accuracy becomes better (as the indicator Var decreases) when the 50% − 50% allocation is used since mining singletons
 is larger, and worse (Var increases) if k is larger. Now we and itemsets are two goals. One can allocate more users to
analyze performance of each competitor in more detail. singletons if singleton mining is more important.
1. (BLH, L), (SUE, 2k) → (OLH, L), (OLH, 2k): First of Detailed Results. Figure 4 shows the results of mining
all, we observe the identification accuracy improves when the frequent itemsets. As we can see from the upper two sub-
FO in the first phase is changed from BLH to OLH. This figures, when fixing k = 64, the proposed SVSM protocol
happens because, by using OLH, a more accurate S will be (instantiated with SVIM, as default) can achieve the NCR
returned, and by using OLH in the second phase, one can better score of 0.7 at  = 1 and 0.9 when  = 2. As to when
identify the top k items. Note that the estimation accuracy LDPMiner is used to instantiate SVSM, the utility drops to
actually does not improve significantly, because better FO does around 0.2. When  is fixed at 2, the improvement of SVIM
a better job at reducing the noise for the lower ranked values over LDPMiner is also significant, especially when k is greater
(thus NCR is higher). The estimation improvement is nearly than 64 (SVSM-SVIM keeps NCR greater than 0.8, while
unnoticeable in the log based figures. NCR for SVSM-LDPMiner drops to below 0.2). This suggests
2. (OLH, L), (OLH, 2k) → (OLH, 1), (OLH, 2k): One ma- that SVSM with LDPMiner can effectively find only around
jor NCR improvement happens when the length limit is 10 most frequent itemsets, while SVSM with SVIM can find
changed from the 90th percentile L to 1. To this point, the around 70, demonstrating a 7× improvement.
top 2k items returned by the first phase contains most of the For the estimation accuracy shown by the bottom two sub-
top k items. The NCR bottle neck lies on the second phase, figures, we can see that the estimation error drops with , and
which cannot effectively identify the top k from the 2k items. increases with k. When using LDPMiner in SVSM the error is
Note that the estimation accuracy does not improve because two magnitudes greater than using SVIM. This effect is more
the same FO is used in the second phase. significant when k is greater than 64. This is because Var for
3. (OLH, 1), (OLH, 2k) → (OLH, 1), (Adap, 2k): The most LDPMiner is heavily dependent on k, while SVIM not.
significant improvement happens when changing from OLH to VII. R ELATED W ORK
Adap in the second phase. Both identification and estimation
Differential privacy has been the de facto notion protect-
accuracy significantly (NCR almost doubled, and Var reduced
ing privacy. In the centralized settings, many DP algorithms
by two magnitudes). This is because Adap significantly re-
have been proposed (see [17], [35] for theoretical treatments
duces the variance (from a factor of (2k)2 to 2k).
and [24] in a more practical perspective). Recently, Uber has
4. (OLH, 1), (Adap, L) and (OLH, 1), (Adap, L)(c): By re- deployed a system enforcing DP during SQL queries [23],
ducing 2k to the 90th percentile L in the second phase, the Google also proposed several works that combine DP with
results are further improved. Note that the improvement is machine learning, e.g., [29]. In the local setting, we have also
not that significant but still meaningful. This is partly because seen real world deployment: Google deployed RAP [18] as
an additional 10% of users are assigned to estimate the size an extension within Chrome, and Apple [34], [33] also uses
distribution (to find L and update the results). similar methods to help with predictions of spelling and other
Remark.
√ Because of the noisy nature (noise is in the order things.
of n) of the local setting of DP, in order to get meaningful Of all the problems, one basic mechanism in LDP is to
information, one has to increase  or n (or both). When the estimate frequencies of values. Wang et al. compare different
number of users is not sufficiently large, as in our experiment, mechanisms using estimation variance [36]. They conclude
the improvement is not significant in the small  range, as that when the domain size is small, the Generalized Ran-
being used by experiments of centralized DP (e.g., 0.1). dom Response provides best utility, and Optimal Local Hash
However, in the case of deployed LDP protocol (Google uses (OLH)/Optimal Unary Encoding (OUE) [36] perform better
 > 4 [18], and Apple uses  = 1 or 2 [32]), the advantage of when the domain is large. There also exist other mechanisms
the proposed protocol is profound. with higher variance: Binary Local Hash (BLH) by Bassily

136

Authorized licensed use limited to: University of Malaya. Downloaded on March 07,2024 at 11:24:34 UTC from IEEE Xplore. Restrictions apply.
(BLH,L),(SUE,2k) (OLH,1),(Adap,2k) (BLH,L),(SUE,2k) (OLH,1),(Adap,2k)
(OLH,L),(OLH,2k) (OLH,1),(Adap,L) (OLH,L),(OLH,2k) (OLH,1),(Adap,L)
(OLH,1),(OLH,2k) (OLH,1),(Adap,L)(c) (OLH,1),(OLH,2k) (OLH,1),(Adap,L)(c)

1.0 1.0

0.8 0.8

0.6 0.6
ncr

ncr
0.4 0.4

0.2 0.2

0.0 0.0

0.5 1 1.5 2 2.5 3 3.5 20 40 60 80 100


ε k
(a) POS, NCR, vary , k = 64 (b) POS, NCR vary k,  = 2

(BLH,L),(SUE,2k) (OLH,1),(Adap,2k) (BLH,L),(SUE,2k) (OLH,1),(Adap,2k)


(OLH,L),(OLH,2k) (OLH,1),(Adap,L) (OLH,L),(OLH,2k) (OLH,1),(Adap,L)
(OLH,1),(OLH,2k) (OLH,1),(Adap,L)(c) (OLH,1),(OLH,2k) (OLH,1),(Adap,L)(c)
1012 1012
11 11
10 10
10 10
10 10
9
10 109
var

var
108 108
7 7
10 10
6 6
10 10
105 105
0.5 1 1.5 2 2.5 3 3.5 20 40 60 80 100
ε k
(c) POS, Var, vary , k = 64 (d) POS, Var vary k,  = 2

Fig. 3. Singleton identification.

and Smith [5] can be viewed as OLH with hash range always private values are continuous.
equal to 2 and RAPPOR (SUE) by Erlingsson et al. [18],
whose simple version can be viewed as OUE with suboptimal VIII. C ONCLUSIONS
parameters. These protocols use ideas from earlier work [27],
[13]. In this paper, we investigate the LDP protocols in a setting
where each user has a set of items. We introduce PSFO,
The frequent itemset mining problem is to identify the
that enables users to sample one item from the set to report.
frequent set of items that appear simultaneously where each
The utility of PSFO is thoroughly analyzed and optimized,
user has a set of items. There exist protocols to handle this
resulting two key observations: First, we identify the additional
in the classic DP setting [39], [26]. In the local setting,
privacy gain provided by the sampling step, which we call
Evfimievski et al. [19] considered an easier setting where each
privacy amplification effect; Second, we observe that the
user has a fixed amount of items. The protocol cannot be
padding size  should be small when domain size is large,
applied for the general itemset problem. Qin et al. proposed
and  should be large when domain is small. Based on the
LDPMiner [30] that finds only the frequent singletons. In this
analysis, we propose SVIM that significantly outperforms the
paper, we propose and optimize PSFO, and thus be able to
existing protocol LDPMiner. Then we propose SVSM to find
identify both singletons and itemsets effectively.
frequent itemset, which is an open problem in [30], for the first
Besides the frequent itemset mining problem, there are other time. We demonstrate the effectiveness of SVIM and SVSM
problems in the LDP setting that rely on mechanisms for using empirical experiment on both synthetic and real-world
frequency estimation. The problem of finding heavy hitters in datasets.
a very large domain was exhaustively investigated [21], [27],
[20], [5], [33], [6], [37], [10].
ACKNOWLEDGMENT
Nguyên et al. [28] studied the problem of empirical risk
minimization. Smith et al. [31] also propose a protocol for the We would like to thank the anonymous reviewers for their
same problem but without interaction. Ding [12] and Nguyên helpful comments. The work is supported in part by a Purdue
et al. [28] studied mean estimation in the setting where the Research Foundation award, and the NSF grant No. 1640374.

137

Authorized licensed use limited to: University of Malaya. Downloaded on March 07,2024 at 11:24:34 UTC from IEEE Xplore. Restrictions apply.
SVSM(LDPMiner) SVSM(SVIM) SVSM(LDPMiner) SVSM(SVIM)

1.0 1.0

0.8 0.8

0.6 0.6
ncr

ncr
0.4 0.4

0.2 0.2

0.0 0.0

0.5 1 1.5 2 2.5 3 3.5 20 40 60 80 100


ε k
(a) NCR, vary , k = 64 (b) NCR vary k,  = 2

SVSM(LDPMiner) SVSM(SVIM) SVSM(LDPMiner) SVSM(SVIM)


1012 1012
1011 1011
10 10
10 10
9 9
10 10
var

var
8
10 108
7
10 107
106 106
5
10 105
0.5 1 1.5 2 2.5 3 3.5 20 40 60 80 100
ε k
(c) Var, vary , k = 64 (d) Var, vary k,  = 2

Fig. 4. POS Itemset Mining Results.

R EFERENCES [11] T.-H. H. Chan, M. Li, E. Shi, and W. Xu. Differentially private continual
monitoring of heavy hitters from distributed streams. In Privacy
Enhancing Technologies, volume 7384, pages 140–159. Springer, 2012.
[1] Mozilla governance: Usage of differential privacy & rappor. [12] B. Ding, J. Kulkarni, and S. Yekhanin. Collecting telemetry data
https://groups.google.com/forum/#!topic/mozilla.governance/81gMQeMEL0w. privately. In Advances in Neural Information Processing Systems, pages
[2] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, 3574–3583, 2017.
K. Talwar, and L. Zhang. Deep learning with differential privacy. In [13] J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Local privacy and
Proceedings of the 2016 ACM SIGSAC Conference on Computer and statistical minimax rates. In FOCS, pages 429–438, 2013.
Communications Security, pages 308–318. ACM, 2016. [14] C. Dwork. Differential privacy. In ICALP, pages 1–12, 2006.
[3] E. Adar, D. S. Weld, B. N. Bershad, and S. S. Gribble. Why we search: [15] C. Dwork. A firm foundation for private data analysis. Commun. ACM,
visualizing and predicting user behavior. In Proceedings of the 16th 54(1):86–95, 2011.
international conference on World Wide Web, pages 161–170. ACM, [16] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to
2007. sensitivity in private data analysis. In TCC, pages 265–284, 2006.
[4] R. Agrawal, R. Srikant, et al. Fast algorithms for mining association [17] C. Dwork and A. Roth. The algorithmic foundations of differential
rules. In Proc. 20th int. conf. very large data bases, VLDB, volume privacy. Foundations and Trends in Theoretical Computer Science, 9(3-
1215, pages 487–499, 1994. 4):211–407, 2014.
[5] R. Bassily and A. Smith. Local, private, efficient protocols for succinct [18] Ú. Erlingsson, V. Pihur, and A. Korolova. Rappor: Randomized
histograms. In Proceedings of the Forty-Seventh Annual ACM on aggregatable privacy-preserving ordinal response. In Proceedings of
Symposium on Theory of Computing, pages 127–135. ACM, 2015. the 2014 ACM SIGSAC conference on computer and communications
[6] R. Bassily, U. Stemmer, A. G. Thakurta, et al. Practical locally private security, pages 1054–1067. ACM, 2014.
heavy hitters. In Advances in Neural Information Processing Systems, [19] A. Evfimievski, J. Gehrke, and R. Srikant. Limiting privacy breaches
pages 2285–2293, 2017. in privacy preserving data mining. In Proceedings of the twenty-second
[7] R. Bhaskar, S. Laxman, A. Smith, and A. Thakurta. Discovering frequent ACM SIGMOD-SIGACT-SIGART symposium on Principles of database
patterns in sensitive data. In KDD, pages 503–512, 2010. systems, pages 211–222. ACM, 2003.
[8] A. Bittau, U. Erlingsson, P. Maniatis, I. Mironov, A. Raghunathan, [20] G. Fanti, V. Pihur, and Ú. Erlingsson. Building a rappor with the un-
D. Lie, M. Rudominer, U. Kode, J. Tinnes, and B. Seefeld. Prochlo: known: Privacy-preserving learning of associations and data dictionaries.
Strong privacy for analytics in the crowd. In Proceedings of the 26th Proceedings on Privacy Enhancing Technologies (PoPETS), issue 3,
Symposium on Operating Systems Principles, pages 441–459. ACM, 2016, 2016.
2017. [21] J. Hsu, S. Khanna, and A. Roth. Distributed private heavy hitters. In
[9] S. Brin, R. Motwani, and C. Silverstein. Beyond market baskets: International Colloquium on Automata, Languages, and Programming,
Generalizing association rules to correlations. In Acm Sigmod Record, pages 461–472. Springer, 2012.
volume 26, pages 265–276. ACM, 1997. [22] K. Järvelin and J. Kekäläinen. Cumulated gain-based evaluation of
[10] M. Bun, J. Nelson, and U. Stemmer. Heavy hitters and the structure of ir techniques. ACM Transactions on Information Systems (TOIS),
local privacy. arXiv preprint arXiv:1711.04740, 2017. 20(4):422–446, 2002.

138

Authorized licensed use limited to: University of Malaya. Downloaded on March 07,2024 at 11:24:34 UTC from IEEE Xplore. Restrictions apply.
[23] N. Johnson, J. P. Near, and D. Song. Practical differential privacy for same value) happens, and derive the corresponding  that OLH
sql queries using elastic sensitivity. arXiv preprint arXiv:1706.09479,
2017. can use. For example, when  = 2, the probability both the
[24] N. Li, M. Lyu, D. Su, and W. Yang. Differential Privacy: From Theory user’s items are hashed into the same value by the chosen hash

to Practice. Synthesis Lectures on Information Security, Privacy, and function is δ = g1 , where g = e +1 is the range of the hash
Trust. Morgan Claypool, 2016.
[25] N. Li, W. Qardaji, and D. Su. On sampling, anonymization, and
function. Under the condition the user’s items are hashed to
differential privacy or, k-anonymization meets differential privacy. In at least two results, OLH can be used with  = ln(2e − 1).
ASIACCS, 2012.
[26] N. Li, W. H. Qardaji, D. Su, and J. Cao. Privbasis: Frequent itemset Theorem 5 ((, δ)-LDP by OLH( )). ΨOLH( ) (PS (·)) satis-
mining with differential privacy. VLDB, 5(11):1340–1351, 2012.
[27] N. Mishra and M. Sandler. Privacy via pseudorandom sketches. In Pro-
fies (, δ)-LDP, where  = ln  · (e − 1) + 1 , and  is an
ceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium integer such that
on Principles of database systems, pages 143–152. ACM, 2006.
[28] T. T. Nguyên, X. Xiao, Y. Yang, S. C. Hui, H. Shin, and J. Shin.
Collecting and analyzing data from smart device users with local  1
· ≤ δ.
differential privacy. arXiv preprint arXiv:1606.05053, 2016.  + 1 
 · (e − 1) + 2 
[29] N. Papernot, S. Song, I. Mironov, A. Raghunathan, K. Talwar, and
Ú. Erlingsson. Scalable private learning with pate. 2018.
[30] Z. Qin, Y. Yang, T. Yu, I. Khalil, X. Xiao, and K. Ren. Heavy hitter That is, for any  ≥ 0, any input v1 , v2 ⊆ I, and any set of
estimation over set-valued data with local differential privacy. In CCS,
2016. possible output T ⊆ Range(ΨPSFO(,GRR, ) ),
[31] A. Smith, A. Thakurta, and J. Upadhyay. Is interaction necessary
for distributed private learning? In IEEE Symposium on Security and  
Privacy, 2017. Pr ΨOLH( ) (PS(v1 , )) ∈ T
[32] J. Tang, A. Korolova, X. Bai, X. Wang, and X. Wang. Privacy loss in  
apple’s implementation of differential privacy on macos 10.12. arXiv
≤ e ·Pr ΨOLH( ) (PS(v2 , )) ∈ T + δ. (11)
preprint arXiv:1709.02753, 2017.
[33] A. G. Thakurta, A. H. Vyrros, U. S. Vaishampayan, G. Kapoor,
J. Freudiger, V. R. Sridhar, and D. Davidson. Learning new words,
Mar. 14 2017. US Patent 9,594,741.
[34] A. G. Thakurta, A. H. Vyrros, U. S. Vaishampayan, G. Kapoor,
J. Freudinger, V. V. Prakash, A. Legendre, and S. Duplinsky. Emoji
Proof. To prove (11), it is equivalent to first prove that a
frequency detection and deep link frequency, July 11 2017. US Patent “failure” event, where more than  items in v1 are hashed
9,705,908. to the same value, happens with probability less than δ, and
[35] S. Vadhan. The complexity of differential privacy. In Tutorials on the
Foundations of Cryptography, pages 347–450. Springer, 2017.
then prove that under the condition the “failure” event does
[36] T. Wang, J. Blocki, N. Li, and S. Jha. Locally differentially private not happen, ΨOLH( ) (PS (·)) satisfies -LDP.
protocols for frequency estimation. In USENIX’17: Proceedings of 26th Given that the hash function is chosen randomly, and the
USENIX Security Symposium on USENIX Security Symposium. USENIX
Association, 2017. hash family is random, bounding the “failure” probability is
[37] T. Wang, N. Li, and S. Jha. Locally differentially private heavy hitter equivalent to bounding the probability of throwing  balls
identification. arXiv preprint arXiv:1708.06674, 2017. randomly into g bins, and the max load is more than  . The
[38] S. L. Warner. Randomized response: A survey technique for eliminating
evasive answer bias. Journal of the American Statistical Association, probability can be calculated as follows:
60(309):63–69, 1965. Let Ei,a be the event that bin i contains more than a balls,
[39] C. Zeng, J. F. Naughton, and J.-Y. Cai. On differentially private frequent
itemset mining. Proceedings of the VLDB Endowment, 6(1):25–36, then
2012.
 1
Pr [Ei,a ] =
A PPENDIX a + 1 g a+1
A. (, δ)-LDP and Limited Amplification Effect
By union bound, we know that
In (, δ)-LDP, the value δ (which is typically very small)
has an intuitive interpretation of “failure” probability. That is, ⎡ ⎤
with probability 1 − δ, Ψ is -LDP. When δ = 0, (, 0)-LDP    1
δ = Pr ⎣ Ei, ⎦ ≤ Pr [Ei, ] =
becomes -LDP.
i
  + 1 g 
i∈[g]
Definition 4 ((, δ) Local Differential Privacy). An algorithm
Ψ satisfies (, δ)-local differential privacy ((, δ)-LDP), where
 ≥ 0, and 0 ≤ δ < 1 if and only if for any input v1 , v2 ⊆ I, 
we have where g = e + 1 =  · (e − 1) + 2 .
Now it suffices to prove that for any  ≥ 0, any v1 , v2 ⊆ I,
∀T ⊆ Range(Ψ) : Pr [Ψ(v1 ) ∈ T ] ≤ e Pr [Ψ(v2 ) ∈ T ] + δ, any possible hash function H (such that at most  items are
where Range(Ψ) denotes the set of all possible outputs of the hashed into the same value), and any t ∈ [g], pp12 ≤ e , where
algorithm Ψ.
 
To apply the privacy amplification, one uses δ to measure p1 = Pr ΨOLH( ) (PS (v1 )) = H, t , and
 
the probability that failure (multiple values are hashed to the p2 = Pr ΨOLH( ) (PS (v2 )) = H, t .

139

Authorized licensed use limited to: University of Malaya. Downloaded on March 07,2024 at 11:24:34 UTC from IEEE Xplore. Restrictions apply.
2 10 50 2 10 50
5 20 100 5 20 100
7.0 7.0
6.0 6.0
5.0 5.0
4.0 4.0
ε’

ε’
3.0 3.0
2.0 2.0
1.0 1.0
0.0 0.0
0.5 1 1.5 2 2.5 3 3.5 0.5 1 1.5 2 2.5 3 3.5
ε ε
(a) Amplification of OLH with δ = 10−3 (b) Amplification of OLH with δ = 10−9

Fig. 5. Privacy amplification effect for different .


We first upper bound p1 , 2 5 10 20 50 100

  0.1 0.10 0.10 0.11 0.13 0.15 0.15
p1 =Pr [H is picked] · Pr ΨGRR() (H(PS (v1 ))) = t|H 0.5 0.50 0.50 0.54 0.62 0.68 0.80
1.0 1.00 1.00 1.24 1.35 1.59 1.73
=Pr [H is picked] · Pr [v is sampled ∧ H(v) = t] p 2.0 2.00 2.20 2.62 3.10 3.71 4.28
4.0 4.00 4.50 5.19 5.88 6.80 7.20
0.1 0.10 0.10 0.10 0.10 0.12 0.14
+Pr [v is sampled ∧ H(v) = t] q  0.5 0.50 0.50 0.50 0.52 0.59 0.65
1.0 1.00 1.00 1.00 1.07 1.30 1.51
 2.0 2.00 2.00 2.00 2.38 2.93 3.49
≤Pr [H is picked] · · p 4.0 4.00 4.00 4.50 5.04 5.82 6.51
max{|v1 |, } TABLE III
max{|v1 |, } −   N UMERICAL VALUE OF  UNDER DIFFERENT  AND . T HE UPPER PART IS
+ ·q FOR δ = 10−3 , AND THE LOWER PART IS FOR δ = 10−9 .
max{|v1 |, }
The equality holds when H(v) = t for all v1 . Similarly, we
lower bound p2 ,
  The theorem above gives us the formula to calculate δ and
p2 =Pr [H is picked] · Pr ΨGRR() (H(PS (v2 ))) = t|H  for any  . Therefore, if δ is specified, we are able to come
up with the highest  . Table III and Figure 5 give results of
=Pr [H is picked] · Pr [v is sampled ∧ H(v) = t] p  given  and , under the condition δ equals 10−3 and 10−9 ,
respectively. We can see  ≥ , the difference becomes more
+Pr [v is sampled ∧ H(v) = t] q  significant when  or  is large. However, the increased amount
is less than that for GRR, as shown in Table II and Figure 1.
0 Note that however, the (, δ)-LDP notion is strictly weaker
≥Pr [H is picked] · · p
max{|v1 |, } (less secure) than -LDP and thus not directly comparable
max{|v1 |, }  here.
+ ·q = Pr [H is picked] · q 
max{|v1 |, }
B. Additional Results
The equality holds when none of the items from v2 are Item Mining. We report experimental results of item mining
hashed to t by H. Thus, we now bound pp12 : for the datasets of Kosarak, Online and Synthesize in Figures 6
p1 p  max{|v1 |, } −  and 7. We can see similar trends as that of Figure 3. Note that
≤  · + performance on different dataset is slightly different, because
p2 q max{|v1 |, } max{|v1 |, }
of different size, distribution, etc. Specifically, NCR and Var
 p
=1+ · −1 are worse in the Kosarak dataset, than that on the others,
max{|v1 |, } q
  because the original domain is big (42 thousand, while the
 
others are 1 to 3 thousand). Overall, the proposed method
≤ 1 + · e − 1
 SVIM works persistently better than its competitors.
  Itemset Mining. We also plot results for itemset mining in
= 1 + ·  · (e − 1) + 1 − 1 = e .
  Figure 8. Results for the synthetic dataset is not included be-
The equality is achieved when H(v) = t for all v1 while cause there is no frequent itemset (the items from the generator
H(v) = t for all v2 . are independent). For the others, we can still see similar trends
and that our proposed solution works persistently better.

140

Authorized licensed use limited to: University of Malaya. Downloaded on March 07,2024 at 11:24:34 UTC from IEEE Xplore. Restrictions apply.
(BLH,L),(SUE,2k) (OLH,1),(Adap,2k) (BLH,L),(SUE,2k) (OLH,1),(Adap,2k)
(OLH,L),(OLH,2k) (OLH,1),(Adap,L) (OLH,L),(OLH,2k) (OLH,1),(Adap,L)
(OLH,1),(OLH,2k) (OLH,1),(Adap,L)(c) (OLH,1),(OLH,2k) (OLH,1),(Adap,L)(c)

1.0 1.0

0.8 0.8

0.6 0.6
ncr

ncr
0.4 0.4

0.2 0.2

0.0 0.0

0.5 1 1.5 2 2.5 3 3.5 20 40 60 80 100


ε k
(a) Kosarak, NCR, vary , k = 64 (b) Kosarak, NCR vary k,  = 2

(BLH,L),(SUE,2k) (OLH,1),(Adap,2k) (BLH,L),(SUE,2k) (OLH,1),(Adap,2k)


(OLH,L),(OLH,2k) (OLH,1),(Adap,L) (OLH,L),(OLH,2k) (OLH,1),(Adap,L)
(OLH,1),(OLH,2k) (OLH,1),(Adap,L)(c) (OLH,1),(OLH,2k) (OLH,1),(Adap,L)(c)

1.0 1.0

0.8 0.8

0.6 0.6
ncr

ncr
0.4 0.4

0.2 0.2

0.0 0.0

0.5 1 1.5 2 2.5 3 3.5 20 40 60 80 100


ε k
(c) Online, NCR, vary , k = 64 (d) Online, NCR vary k,  = 2

(BLH,L),(SUE,2k) (OLH,1),(Adap,2k) (BLH,L),(SUE,2k) (OLH,1),(Adap,2k)


(OLH,L),(OLH,2k) (OLH,1),(Adap,L) (OLH,L),(OLH,2k) (OLH,1),(Adap,L)
(OLH,1),(OLH,2k) (OLH,1),(Adap,L)(c) (OLH,1),(OLH,2k) (OLH,1),(Adap,L)(c)

1.0 1.0

0.8 0.8

0.6 0.6
ncr

ncr

0.4 0.4

0.2 0.2

0.0 0.0

0.5 1 1.5 2 2.5 3 3.5 20 40 60 80 100


ε k
(e) Synthetic, NCR, vary , k = 64 (f) Synthetic, NCR vary k,  = 2

Fig. 6. More results on singleton identification.

141

Authorized licensed use limited to: University of Malaya. Downloaded on March 07,2024 at 11:24:34 UTC from IEEE Xplore. Restrictions apply.
(BLH,L),(SUE,2k) (OLH,1),(Adap,2k) (BLH,L),(SUE,2k) (OLH,1),(Adap,2k)
(OLH,L),(OLH,2k) (OLH,1),(Adap,L) (OLH,L),(OLH,2k) (OLH,1),(Adap,L)
12 (OLH,1),(OLH,2k) (OLH,1),(Adap,L)(c) 12 (OLH,1),(OLH,2k) (OLH,1),(Adap,L)(c)
10 10
11 11
10 10
10 10
10 10
109 109
var

var
8
10 108
7
10 107
6 6
10 10
5
10 105
0.5 1 1.5 2 2.5 3 3.5 20 40 60 80 100
ε k
(a) Kosarak, Var, vary , k = 64 (b) Kosarak, Var vary k,  = 2

(BLH,L),(SUE,2k) (OLH,1),(Adap,2k) (BLH,L),(SUE,2k) (OLH,1),(Adap,2k)


(OLH,L),(OLH,2k) (OLH,1),(Adap,L) (OLH,L),(OLH,2k) (OLH,1),(Adap,L)
(OLH,1),(OLH,2k) (OLH,1),(Adap,L)(c) (OLH,1),(OLH,2k) (OLH,1),(Adap,L)(c)
1012 1012
11 11
10 10
10 10
10 10
9
10 109
var

var
108 108
7 7
10 10
6 6
10 10
105 105
0.5 1 1.5 2 2.5 3 3.5 20 40 60 80 100
ε k
(c) Online, Var, vary , k = 64 (d) Online, Var vary k,  = 2

(BLH,L),(SUE,2k) (OLH,1),(Adap,2k) (BLH,L),(SUE,2k) (OLH,1),(Adap,2k)


(OLH,L),(OLH,2k) (OLH,1),(Adap,L) (OLH,L),(OLH,2k) (OLH,1),(Adap,L)
12 (OLH,1),(OLH,2k) (OLH,1),(Adap,L)(c) 12 (OLH,1),(OLH,2k) (OLH,1),(Adap,L)(c)
10 10
11 11
10 10
10 10
10 10
109 109
var

var

8
10 108
7 7
10 10
6 6
10 10
5
10 105
0.5 1 1.5 2 2.5 3 3.5 20 40 60 80 100
ε k
(e) Synthetic, Var, vary , k = 64 (f) Synthetic, Var vary k,  = 2

Fig. 7. More results on singleton estimation.

142

Authorized licensed use limited to: University of Malaya. Downloaded on March 07,2024 at 11:24:34 UTC from IEEE Xplore. Restrictions apply.
SVSM(LDPMiner) SVSM(SVIM) SVSM(LDPMiner) SVSM(SVIM)

1.0 1.0

0.8 0.8

0.6 0.6
ncr

ncr
0.4 0.4

0.2 0.2

0.0 0.0

0.5 1 1.5 2 2.5 3 3.5 20 40 60 80 100


ε k
(a) Kosarak NCR, vary , k = 64 (b) Kosarak NCR vary k,  = 2

SVSM(LDPMiner) SVSM(SVIM) SVSM(LDPMiner) SVSM(SVIM)

1.0 1.0

0.8 0.8

0.6 0.6
ncr

ncr
0.4 0.4

0.2 0.2

0.0 0.0

0.5 1 1.5 2 2.5 3 3.5 20 40 60 80 100


ε k
(c) Online NCR, vary , k = 64 (d) Online NCR vary k,  = 2

SVSM(LDPMiner) SVSM(SVIM) SVSM(LDPMiner) SVSM(SVIM)


1012 1012
1011 1011
10 10
10 10
9 9
10 10
var

var

108 108
7
10 107
6
10 106
105 105
0.5 1 1.5 2 2.5 3 3.5 20 40 60 80 100
ε k
(e) Kosarak Var, vary , k = 64 (f) Kosarak Var, vary k,  = 2

SVSM(LDPMiner) SVSM(SVIM) SVSM(LDPMiner) SVSM(SVIM)


1012 1012
1011 1011
10 10
10 10
9 9
10 10
var

var

8
10 108
107 107
6
10 106
5
10 105
0.5 1 1.5 2 2.5 3 3.5 20 40 60 80 100
ε k
(g) Online Var, vary , k = 64 (h) Online Var, vary k,  = 2

Fig. 8. More results on itemset mining results for Kosarak dataset.

143

Authorized licensed use limited to: University of Malaya. Downloaded on March 07,2024 at 11:24:34 UTC from IEEE Xplore. Restrictions apply.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy