DP Generalization Roth
DP Generalization Roth
Generalization Guarantees
Christopher Jung
University of Pennsylvania, Philadelphia, PA, USA
chrjung@seas.upenn.edu
Katrina Ligett
The Hebrew University, Jerusalem, Israel
katrina@cs.huji.ac.il
Seth Neel
University of Pennsylvania, Philadelphia, PA, USA
sethneel@wharton.upenn.edu
Aaron Roth
University of Pennsylvania, Philadelphia, PA, USA
aaroth@cis.upenn.edu
Saeed Sharifi-Malvajerdi
University of Pennsylvania, Philadelphia, PA, USA
saeedsh@wharton.upenn.edu
Moshe Shenfeld
The Hebrew University, Jerusalem, Israel
moshe.shenfeld@mail.huji.ac.il
Abstract
We give a new proof of the “transfer theorem” underlying adaptive data analysis: that any mechanism
for answering adaptively chosen statistical queries that is differentially private and sample-accurate
is also accurate out-of-sample. Our new proof is elementary and gives structural insights that we
expect will be useful elsewhere. We show: 1) that differential privacy ensures that the expectation of
any query on the conditional distribution on datasets induced by the transcript of the interaction is
close to its expectation on the data distribution, and 2) sample accuracy on its own ensures that any
query answer produced by the mechanism is close to the expectation of the query on the conditional
distribution. This second claim follows from a thought experiment in which we imagine that the
dataset is resampled from the conditional distribution after the mechanism has committed to its
answers. The transfer theorem then follows by summing these two bounds, and in particular, avoids
the “monitor argument” used to derive high probability bounds in prior work.
An upshot of our new proof technique is that the concrete bounds we obtain are substantially
better than the best previously known bounds, even though the improvements are in the constants,
rather than the asymptotics (which are known to be tight). As we show, our new bounds outperform
the naive “sample-splitting” baseline at dramatically smaller dataset sizes compared to the previous
state of the art, bringing techniques from this literature closer to practicality.
2012 ACM Subject Classification Theory of computation → Sample complexity and generalization
bounds
Keywords and phrases Differential Privacy, Adaptive Data Analysis, Transfer Theorem
Acknowledgements We thank Adam Smith for helpful conversations at an early stage of this work,
and Daniel Roy for helpful feedback on the presentation of the result.
1 Introduction
Many data analysis pipelines are adaptive: the choice of which analysis to run next depends on
the outcome of previous analyses. Common examples include variable selection for regression
problems and hyper-parameter optimization in large-scale machine learning problems: in
both cases, common practice involves repeatedly evaluating a series of models on the same
dataset. Unfortunately, this kind of adaptive re-use of data invalidates many traditional
methods of avoiding over-fitting and false discovery, and has been blamed in part for the
recent flood of non-reproducible findings in the empirical sciences [14].
There is a simple way around this problem: don’t re-use data. This idea suggests a
baseline called data splitting: to perform k analyses on a dataset, randomly partition the
dataset into k disjoint parts, and perform each analysis on a fresh part. The standard
“holdout method” is the special case of k = 2. Unfortunately, this natural baseline makes
poor use of data: in particular, the data requirements of this method grow linearly with the
number of analyses k to be performed.
A recent literature starting with Dwork et al. [6] shows how to give a significant asymptotic
improvement over this baseline via a connection to differential privacy: rather than computing
and reporting exact sample quantities, perturb these quantities with noise. This line of
work established a powerful transfer theorem, that informally says that any analysis that
is simultaneously differentially private and accurate in-sample will also be accurate out-of-
sample. The best analysis of this technique shows that for a√broad class of analyses and a
target accuracy goal, the data requirements grow only with k – a quadratic improvement
over the baseline [1]. Moreover, it is known that in the worst case, this cannot be improved
asymptotically [15, 23]. Unfortunately, thus far this literature has had little impact on
practice. One major reason for this is that although the more sophisticated techniques from
this literature give asymptotic improvements over the sample-splitting baseline, the concrete
bounds do not actually improve on the baseline until the dataset is enormous. This remains
true even after optimizing the constants that arise from the arguments of Dwork et al. [6] or
Bassily et al. [1], and appears to be a fundamental limitation of their proof techniques [20].
In this paper, we give a new proof of the transfer theorem connecting differential privacy
and in-sample accuracy to out-of-sample accuracy. Our proof is based on a simple insight
that arises from imagining a “resampling” experiment, and in particular yields an improved
concrete bound that beats the sample-splitting baseline at dramatically smaller data set
sizes n compared to prior work. In fact, at reasonable dataset sizes, the magnitude of the
improvement arising from our new theorem is significantly larger than the improvement
between the bounds of Bassily et al. [1] and Dwork et al. [6]: see Figure 1.
C. Jung, K. Ligett, S. Neel, A. Roth, S. Sharifi-Malvajerdi, and M. Shenfeld 31:3
Max Queries Answered with Width = 0.1 and Uniform Coverage 95.0%
25000
Sample Splitting Baseline
Optimized BNSSSU
20000 DFHPRR
Our Bound
Number of Queries k
15000
10000
5000
0
200000 400000 600000 800000 1000000 1200000 1400000 1600000
Data Set Size n
Figure 1 A comparison of the number of adaptive linear queries that can be answered using the
Gaussian mechanism as analyzed by our transfer theorem (Theorem 9), the numerically optimized
variant of the bound from Bassily et al. [1] (Optimized BNSSSU) as derived in [20], and the original
transfer theorem from Dwork et al. [6] (DFHPRR). We plot for each dataset size n, the number
of queries k that can be answered while guaranteeing confidence intervals around the answer that
have width α = 0.1 and uniform coverage probability 1 − β = 0.95. We compare with the naive
sample splitting baseline that simply splits the dataset into k pieces and answers each query with
the empirical answer on a fresh piece.
Prior Work
confidence intervals, and so prior work has focused on strengthening the above observation
into a high probability bound. For small , the optimal bound has the asymptotic form:
2
Prq∼M (S) [| Ex∼P [q(x)] − n1 x∈S q(x)| ≥ ] ≤ e−O( n) [1]. Note that this bound does not
P
refer to the estimated answers supplied to the data analyst: it says only that a differentially
private data analyst is unlikely to be able to find a query whose average value on the dataset
differs substantially from its expectation. Pairing this with a simultaneous high probability
bound on the in-sample accuracy of a mechanism – that it supplies answers a such that with
high probability the empirical error is small: Pra∼M (S) [|a − n1 x∈S q(x)| ≥ α] ≤ β – yields
P
ITCS 2020
31:4 A New Analysis of Differential Privacy’s Generalization Guarantees
Our Approach
We take a fundamentally different approach by directly providing high probability bounds
on the out-of-sample accuracy |a − Ex∼P [q(x)]| of mechanisms that are both differentially
private and accurate in-sample. Our elementary approach is motivated by the following
thought experiment: in actuality, the dataset S is fixed before any interaction with M begins.
However, imagine that after the entire interaction with M is complete, the dataset S is
resampled from the conditional distribution Q on datasets conditioned on the output of M .
This thought experiment doesn’t alter the joint distribution on datasets and outputs, and
so any in-sample accuracy guarantees that M has continue to hold under this hypothetical
re-sampling experiment. But because the empirical value of the queries on the re-sampled
dataset are likely to be close to their expected value over the conditional distribution Q, the
only way the mechanism can promise to be sample-accurate with high probability is if it
provides answers that are close to their expected value over the conditional distribution with
high probability.
This focuses attention on the conditional distribution on datasets induced by differentially
private transcripts. But it is not hard to show that a consequence of differential privacy
is that the conditional expectation of any query must be close to its expectation over the
data distribution with high probability. In contrast to prior work, this argument directly
leverages high-probability in-sample accuracy guarantees of a private mechanism to derive
high-probability out-of-sample guarantees, without the need for additional machinery like
the monitor argument of Bassily et al. [1].
problems, compared to binary prediction problems. Rogers et al. [20] give a method for
certifying the correctness of heuristically guessed confidence intervals, which they show often
out-perform the theoretical guarantees by orders of magnitude.
Finally, Elder [9, 8] proposes a Bayesian reformulation of the adaptive data analysis
problem. In the model of [9], the data distribution P is assumed to itself be drawn from a
prior that is commonly known to the data analyst and mechanism. In contrast, we work
in the standard adversarial setting originally introduced by Dwork et al. [6] in which the
mechanism must offer guarantees for worst case data distributions and analysts, and focus
our attention on conditional distributions purely as a proof technique.
2 Preliminaries
Let X be an abstract data domain, and let P be an arbitrary distribution over X . A dataset
of size n is a collection of n data records: S = {Si }ni=1 ∈ X n . We study datasets sampled
i.i.d. from P: S ∼ P n . We will write S to denote the random variable and x for realizations
of this random variable. A linear query is a function q : X ∗ → [0, 1] that takes the following
empirical average form when acting on a data set S ∈ X n :
n
1X
q(S) = q(Si ).
n i=1
We note that for linear queries, when the dataset distribution D = P n , we have q(P n ) =
Ex∼P [q(x)], which we write as q(P) when the notation is clear from context. However, the
more general definition will be useful because we will need to evaluate the expectation of q
over other (non-product) distributions over datasets in our arguments, and we will generalize
beyond linear queries in Appendices A.1 and A.2.
Given a family of queries Q, a statistical estimator is a (possibly stateful) randomized
algorithm M : X n ×Q∗ → R∗ parameterized by a dataset S that interactively takes as input a
stream of queries qi ∈ Q, and provides answers ai ∈ R. An analyst is an arbitrary randomized
algorithm A : R∗ → Q∗ that generates a stream of queries and receives a stream of answers
(which can inform the next queries it generates). When an analyst interacts with a statistical
estimator, they generate a transcript of their interaction π ∈ Π where Π = (Q × R)∗ is the
space of all transcripts. Throughout we write Π to denote the transcript’s random variable
and π for its realizations.
The interaction is summarized in Algorithm 1, and we write Interact(M, A; S) to refer
to it. When M and A are clear from context, we will abbreviate this notation and write
simply I(S). When we refer to an indexed query qj , this is implicitly a function of the
transcript π. Given a transcript π ∈ Π, write Qπ to denote the conditional distribution
on datasets conditional on Π = π: Qπ = (P n )|Interact(M, A; S) = π. Note that Qπ will
no longer generally be a product distribution. We will be interested in evaluating uniform
accuracy bounds, which control the worst-case error over all queries:
ITCS 2020
31:6 A New Analysis of Differential Privacy’s Generalization Guarantees
I Definition 1 (Accuracy). M satisfies (α, β)-sample accuracy if for every data analyst A
and every data distribution P,
Pr [max |qj (S) − aj | ≥ α] ≤ β.
S∼P n ,Π∼Interact(M,A;S) j
We say M satisfies (α, β)-distributional accuracy if for every data analyst A and every data
distribution P,
Pr [max |qj (P n ) − aj | ≥ α] ≤ β.
S∼P n ,Π∼Interact(M,A;S) j
If Interact(M, · ; ·) satisfies (, δ)-differential privacy, we will also say that M satisfies (, δ)-
differential privacy.
We introduce a novel quantity that will be crucial to our argument: it captures the effect
of the transcript on the change in the expectation of a query contained in the transcript.
I Definition 3. An interaction Interact(M, A; ·) is called (, δ)-posterior stable if for every
data distribution P:
Pr [max |qj (P n ) − qj (QΠ )| ≥ ] ≤ δ.
S∼P n ,Π∼Interact(M,A;S) j
The theorem follows easily from a change in perspective driven by an elementary observa-
tion. Imagine that after the interaction is run and results in a transcript π, the dataset S is
resampled from its conditional distribution Qπ . This does not change the joint distribution
on datasets and transcripts. This simple claim is formalized below: its elementary proof
appears in Appendix B.
Pr [(S, Π) ∈ E] = Pr [(S 0 , Π) ∈ E]
S∼P n ,Π∼I(S) S∼P n ,Π∼I(S),S 0 ∼QΠ
The change in perspective suggested by the resampling lemma makes it easy to see why
the following must be true: any sample-accurate mechanism must in fact be accurate with
respect to the conditional distribution it induces. This is because if it can first commit
to answers, and guarantee that they are sample-accurate after the dataset is resampled
from the conditional, the answers it committed to must have been close to the conditional
means, because it is likely that the empirical answers on the resampled dataset will be. This
argument is generic and does not use differential privacy.
I Lemma 6. Suppose that M is (α, β)-sample accurate. Then for every c > 0 it also satisfies:
β
Pr [max |aj − qj (QΠ )| > α + c] ≤
S∼P n , Π∼Interact (M,A;S) j c
Proof. Denote by j ∗ (π) = arg max|aj − qj (Qπ )|. Given α ≥ 0 and c > 0, and expanding the
j
definition of qj ∗ (Π) (QΠ ) we get:
Pr aj ∗ (Π) − qj ∗ (Π) (QΠ ) > α + c
S∼P n ,Π∼I(S)
0
= Pr E aj (Π) − qj (Π) (S ) − α > c
∗ ∗
S∼P n ,Π∼I(S) S 0 ∼QΠ
0
≤ Pr 0
E max aj ∗ (Π) − qj ∗ (Π) (S ) − α, 0 > c
S∼P ,Π∼I(S) S ∼QΠ
n
(1) 1
0
≤ E E max aj ∗ (Π) − qj ∗ (Π) (S ) − α, 0
c S∼P n ,Π∼I(S) S 0 ∼QΠ
(2) 1
aj ∗ (Π) − qj ∗ (Π) (S 0 ) − α > 0
≤ E Pr
c S∼P ,Π∼I(S) S ∼QΠ
n 0
1
aj ∗ (Π) − qj ∗ (Π) (S 0 ) > α
= Pr
c S∼P n ,Π∼I(S),S 0 ∼QΠ
(3) 1
= Pr a j ∗ (Π) − qj ∗ (Π) (S) > α
c S∼P n ,Π∼I(S)
Here, inequality (1) follows from Markov’s inequality, inequality (2) follows from the fact that
aj ∗ (Π) − qj ∗ (Π) (S 0 ) − α ≤ 1, and equality 3 follows from the Resampling Lemma (Lemma 5).
Repeating this argument for qj ∗ (Π) (QΠ ) − aj ∗ (Π) yields a symmetric bound, so by combining
the two with the guarantee of (α, β)-sample accuracy we get,
1
Pr aj ∗ (Π) − qj ∗ (Π) (QΠ ) > α + c ≤ Pr aj ∗ (Π) − qj ∗ (Π) (S) > α
S∼P n ,Π∼I(S) c S∼P ,Π∼I(S)
n
β
≤ J
c
ITCS 2020
31:8 A New Analysis of Differential Privacy’s Generalization Guarantees
Because sample accuracy implies accuracy with respect to the conditional distribution,
together with a bound on posterior stability, the transfer theorem follows immediately:
max |aj − qj (P)| ≤ max |ai − qi (QΠ )| + max |ql (QΠ ) − ql (P)|.
j i l
Lemma 6 bounds the first term by α + c with probability 1 − βc over Π, and the definition
of posterior stability bounds the second term by with probability 1 − δ over Π, which
concludes the proof. J
I Lemma 7. If M is (, δ)-differentially private, then for any data distribution P, any
analyst A, and any constant c > 0:
δ
Pr max |qj (QΠ ) − qj (P)| > (e − 1) + 2c ≤
S∼P n ,Π∼Interact (M,A;S) j c
i.e. it is (0 , δ 0 )-posterior stable for every data analyst A, where 0 = e − 1 + 2c and δ 0 = δc .
Proof. Given a transcript π ∈ Π, let j ∗ (π) ∈ arg maxj |qj (Qπ ) − qj (P)|. Define for an α > 0:
Πα = π ∈ Π| qj ∗ (π) (Qπ ) − qj ∗ (π) (P) > α
X + (π) = x ∈ X| Pr [Si = x] > Pr [Si = x]
S∼Qπ ,Si ∼S Si ∼P
[
Bα+ = X + (π) × {π}
π∈Πα
Π+ +
α (x) = π ∈ Π| (x, π) ∈ Bα
Fix any α. Suppose that Pr qj∗ (Π) (QΠ ) − qj∗ (Π) (P) > α > δc . We must have that either
δ δ
Pr qj ∗ (Π) (QΠ ) − qj ∗ (Π) (P) > α > 2c or Pr qj ∗ (Π) (P) − qj ∗ (Π) (QΠ ) > α > 2c . Without
loss of generality, assume
δ
Pr qj ∗ (Π) (QΠ ) − qj ∗ (Π) (P) > α = Pr [Π ∈ Πα ] > (1)
2c
Let Si be the random variable obtained by first sampling S ∼ P n and then sampling Si ∈ S
uniformly at random. We compare the probability measure of Bα+ under the joint distribution
on Si and Π with its corresponding measure under the product distribution of Si and Π:
C. Jung, K. Ligett, S. Neel, A. Roth, S. Sharifi-Malvajerdi, and M. Shenfeld 31:9
> α · Pr [Π ∈ Πα ]
On the other hand, using the definition of (, δ)-differential privacy (See Lemma 21 for the
elementary derivation of the first inequality):
I Remark 8. Note
1. Since differential privacy is closed under post processing, this claim can be generalized
beyond queries contained in the transcript to any query generated as function of the
transcript.
2. In the case of (, 0)-differential privacy, choosing c = 0, the claim holds for every query
with probability 1.
Combined with our general transfer theorem (Theorem 4), this directly yields a transfer
theorem for differential privacy:
I Theorem 9 (Transfer Theorem for (, δ)-Differential Privacy). Suppose that M is (, δ)-
differentially private and (α, β)-sample accurate for linear queries. Then for every analyst A
and c, d > 0 it also satisfies:
β δ
Pr [max |aj − qj (P)| > α + (e − 1) + c + 2d] ≤ +
S∼P n , Π∼Interact (M,A;S) j c d
β
i.e. it is (α0 , β 0 )-distributionally accurate for α0 = α + (e − 1) + c + 2d and β 0 = c + dδ .
ITCS 2020
31:10 A New Analysis of Differential Privacy’s Generalization Guarantees
I Remark 10. As we will see in Section 4, the Gaussian mechanism (and many other
differentially private mechanisms) has a sample accuracy bound that depends only on the
square root of the log of both 1/β and 1/δ. Thus, despite the Markov-like term β 0 = βc + dδ
in the above transfer theorem, together with the sample accuracy bounds of the Gaussian
mechanism, it yields Chernoff-like concentration.
Our technique extends easily to reason about arbitrary low sensitivity queries and
minimization queries. See Appendix A.1 and A.2 for more details.
We now apply our new transfer theorem to derive the concrete bounds that we plotted in
Figure 1. The Gaussian mechanism is extremely simple and has only a single parameter σ:
for each query qi that arrives, the Gaussian mechanism returns the answer ai ∼ N (qi (S), σ 2 )
where N (qi (S), σ 2 ) denotes the Gaussian distribution with mean qi (S) and standard deviation
σ. First, we recall the differential privacy properties of the Gaussian mechanism.
I Theorem 11 ([2]). When used to answer k linear queries, for every 0 < δ < 1, the
Gaussian mechanism with parameter σ satisfies (, δ)-differential privacy for:
v !
u r
k u k k
= 2 2 + t2 2 2 log π · 2 2 /δ
2n σ n σ 2n σ
It is also easy to see that the sample-accuracy of the Gaussian mechanism is characterized
by the CDF of the Gaussian distribution:
I Lemma 12. For any 0 < β < 1, the Gaussian mechanism with parameter σ is (αG , β)-
sample accurate for:
1/k ! s √
√ √ √
β β 2k
αG = 2σ · erfc−1 2−2 1− < 2σ · erfc−1 < 2σ log .
2 k πβ
Proof. For a query qj , write aj = qj (S) + Zj where Zj ∼ N (0, σ 2 ). The sample error
is maxj |aj − qj (S)| = maxj |Zj |. We have that Pr[maxj |Zj | ≥ α] ≤ Pr[maxj Zj ≥ α] +
Pr[minj Zj ≤ −α]. αG is the value that solves the equation Pr[maxj Zj ≥ α] = Pr[minj Zj ≤
−α] = β/2 J
With these quantities in hand, we can now apply Theorem 9 to derive distributional
accuracy bounds for the Gaussian mechanism:
I Theorem 13. Fix a desired confidence parameter 0 < β < 1. When σ is set optimally,
the Gaussian mechanism can be used to answer k linear queries while satisfying (α, β)-
distributional accuracy, where α is the solution to the following unconstrained minimization
problem:
q
√
δ k
+ 2 k
log
p
π· k
/δ
δ
2σ · erfc−1
2 2 2 2 2 2
α = min + e 2n σ n σ 2n σ
−1+6
σ,δ>0 k β
C. Jung, K. Ligett, S. Neel, A. Roth, S. Sharifi-Malvajerdi, and M. Shenfeld 31:11
Proof. Using Theorem 9 and fixing β 0 = δ and c = d, we have that an (α0 , β 0 )-sample
accurate, (, δ)-differentially private mechanism is (α, β)-distributionally accurate for α =
α0 + (e − 1) + 3c and β = 2δ c where c can be an arbitrary parameter. For any fixed value
of β, we can take c = 2δ β , and see that we obtain (α, β)-distributional accuracy where
0
α = α + (e − 1) + 6 (δ/β). The theorem then follows from plugging in the privacy bound
from Theorem 11, the sample accuracy bound from Theorem 12, and optimizing over the
free variables σ and δ. J
5 Discussion
We have given a new proof of the transfer theorem for differential privacy that has several
appealing properties. Besides being simpler than previous arguments, it achieves substantially
better concrete bounds than previous transfer theorems, and uncovers new structural insights
about the role of differential privacy and sample accuracy. In particular, sample accuracy
serves to guarantee that the reported answers are close to their conditional means, and
differential privacy serves to guarantee that the conditional means are close to their true
answers. This focuses attention on the conditional data distribution as a key quantity of
interest, which we expect will be fruitful in future work. In particular, it may shed light on
what makes certain data analysts overfit less than worst-case bounds would suggest: because
they choose queries whose conditional means are closer to the prior than the worst-case query.
There seems to be one remaining place to look for improvement in our transfer theorem:
Lemmas 6 and 7 both exhibit a Markov-like tradeoff between a parameter c and β and
δ respectively. Although the dependence on β and δ in our ultimate bounds is only root-
logarithmic, it would still yield an improvement if this Markov-like dependence could be
replaced with a Chernoff-like dependence. It is possible to do this for the β parameter: we
give an alternative (and even simpler) proof of the transfer theorem for (, 0)-differential
privacy which shows that conditional distributions induced by private mechanisms exhibit
Chernoff-like concentration, in Appendix D. But the only way we know to extend this
argument to (, δ)-differential privacy requires dividing δ by a factor of n, which yields a
final theorem that is inferior to Theorem 9.
References
1 Raef Bassily, Kobbi Nissim, Adam Smith, Thomas Steinke, Uri Stemmer, and Jonathan
Ullman. Algorithmic stability for adaptive data analysis. In Proceedings of the forty-eighth
annual ACM symposium on Theory of Computing, pages 1046–1059. ACM, 2016.
2 Mark Bun and Thomas Steinke. Concentrated differential privacy: Simplifications, extensions,
and lower bounds. In Theory of Cryptography Conference, pages 635–658. Springer, 2016.
3 Rachel Cummings, Katrina Ligett, Kobbi Nissim, Aaron Roth, and Zhiwei Steven Wu. Adaptive
learning with robust generalization guarantees. In Conference on Learning Theory, pages
772–814, 2016.
4 Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toni Pitassi, Omer Reingold, and Aaron
Roth. Generalization in adaptive data analysis and holdout reuse. In Advances in Neural
Information Processing Systems, pages 2350–2358, 2015.
5 Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and
Aaron Roth. The reusable holdout: Preserving validity in adaptive data analysis. Science,
349(6248):636–638, 2015.
6 Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and
Aaron Leon Roth. Preserving statistical validity in adaptive data analysis. In Proceed-
ings of the forty-seventh annual ACM symposium on Theory of computing, pages 117–126.
ACM, 2015.
ITCS 2020
31:12 A New Analysis of Differential Privacy’s Generalization Guarantees
7 Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to
sensitivity in private data analysis. In Theory of cryptography conference, pages 265–284.
Springer, 2006.
8 Sam Elder. Bayesian adaptive data analysis guarantees from subgaussianity. arXiv preprint,
2016. arXiv:1611.00065.
9 Sam Elder. Challenges in bayesian adaptive data analysis. arXiv preprint, 2016. arXiv:
1604.02492.
10 Vitaly Feldman, Roy Frostig, and Moritz Hardt. The advantages of multiple classes for
reducing overfitting from test set reuse. In International Conference on Machine Learning,
pages 1892–1900, 2019.
11 Vitaly Feldman and Thomas Steinke. Generalization for Adaptively-chosen Estimators via
Stable Median. In Conference on Learning Theory, pages 728–757, 2017.
12 Vitaly Feldman and Thomas Steinke. Calibrating Noise to Variance in Adaptive Data Analysis.
In Conference On Learning Theory, pages 535–544, 2018.
13 Vitaly Feldman and Jan Vondrak. Generalization bounds for uniformly stable algorithms. In
Advances in Neural Information Processing Systems, pages 9747–9757, 2018.
14 Andrew Gelman and Eric Loken. The Statistical Crisis in Science. American Scientist,
102(6):460, 2014.
15 Moritz Hardt and Jonathan Ullman. Preventing false discovery in interactive data analysis
is hard. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, pages
454–463. IEEE, 2014.
16 Katrina Ligett and Moshe Shenfeld. A necessary and sufficient stability notion for adaptive
generalization. arXiv preprint, 2019. arXiv:1906.00930.
17 Seth Neel and Aaron Roth. Mitigating Bias in Adaptive Data Gathering via Differential
Privacy. In International Conference on Machine Learning (ICML), 2018.
18 Xinkun Nie, Xiaoying Tian, Jonathan Taylor, and James Zou. Why Adaptively Collected
Data Have Negative Bias and How to Correct for It. In International Conference on Artificial
Intelligence and Statistics, pages 1261–1269, 2018.
19 Kobbi Nissim and Uri Stemmer. Concentration Bounds for High Sensitivity Functions Through
Differential Privacy. Journal of Privacy and Confidentiality, 9(1), 2019.
20 Ryan Rogers, Aaron Roth, Adam Smith, Nathan Srebro, Om Thakkar, and Blake Woodworth.
Guaranteed Validity for Empirical Approaches to Adaptive Data Analysis. arXiv preprint,
2019. arXiv:1906.09231.
21 Ryan Rogers, Aaron Roth, Adam Smith, and Om Thakkar. Max-information, differential
privacy, and post-selection hypothesis testing. In 2016 IEEE 57th Annual Symposium on
Foundations of Computer Science (FOCS), pages 487–494. IEEE, 2016.
22 Daniel Russo and James Zou. Controlling bias in adaptive data analysis using information
theory. In Artificial Intelligence and Statistics, pages 1232–1240, 2016.
23 Thomas Steinke and Jonathan Ullman. Interactive fingerprinting codes and the hardness of
preventing false discovery. In Conference on Learning Theory, pages 1588–1628, 2015.
24 Thomas Steinke and Jonathan Ullman. Subgaussian tail bounds via stability arguments. arXiv
preprint, 2017. arXiv:1701.03493.
25 Aolin Xu and Maxim Raginsky. Information-theoretic analysis of generalization capability of
learning algorithms. In Advances in Neural Information Processing Systems, pages 2524–2533,
2017.
26 Tijana Zrnic and Moritz Hardt. Natural Analysts in Adaptive Data Analysis. In International
Conference on Machine Learning, pages 7703–7711, 2019.
C. Jung, K. Ligett, S. Neel, A. Roth, S. Sharifi-Malvajerdi, and M. Shenfeld 31:13
A Extensions
A.1 Low Sensitivity Queries
Our technique extends easily to reason about arbitrary low sensitivity queries. We only need
to generalize our lemma about posterior stability.
I Definition 14. A query q : X n → R is called ∆-sensitive if for all pairs of neighbouring
datasets S, S 0 ∈ X n : |q(S) − q(S 0 )| ≤ ∆. Note that linear queries are (1/n)-sensitive.
I Lemma 15. If M is an (, δ)-differentially private mechanism for answering ∆-sensitive
queries, then for any data distribution P, analyst A, and any constant c > 0:
n δ
Pr max |qj (QΠ ) − qj (P )| > (e − 1 + 4c)n∆ ≤
S∼P n ,Π∼Interact (M,A;S)) j c
where S i←Y = (S1 , . . . , Si−1 , Y, Si+1 , . . . , Sn ), and the last equality follows from the obser-
vation that (S, Y ) and (S i←Y , Si) are identically
distributed. Since Y ∼ P, independently
i←Y
from Π, we get that EY ∼P q̄j ∗ (π) S≤i = q̄j ∗ (π) (S≤i−1 ), so
" #
X
E Pr [Π = π] q̄j ∗ (π) (S≤i ) − q̄j ∗ (π) (S≤i−1 ) + ∆
S∼P n Π∼I(S)
π∈Πα
≤ E e Pr [Π ∈ Πα ] + 2δ ∆
S∼P n Π∼I(S)
= (e Pr [Π ∈ Πα ] + 2δ) ∆
ITCS 2020
31:14 A New Analysis of Differential Privacy’s Generalization Guarantees
We say that M satisfies (α, β)-distributional accuracy for minimization queries if for every
data analyst A and every data distribution P:
0 0
n
Pr max E L j (S , θj ) − min Lj (S , θ) ≥ α ≤β
S∼P ,Π∼Interact(M,A;S) j S 0 ∼P n θ∈Θ
C. Jung, K. Ligett, S. Neel, A. Roth, S. Sharifi-Malvajerdi, and M. Shenfeld 31:15
I Theorem 20 (Transfer Theorem for Minimization Queries). Suppose that M is (, δ)-
differentially private and (α, β)-sample accurate for ∆-sensitive minimization queries. Then
for every analyst A and c, d > 0 it also satisfies:
β δ
Pr max E Lj (S 0 , θj ) − min Lj (S 0 , θ) > α + c + 2(e − 1 + 4d)n∆ ≤ +
j S 0 ∼P n θ∈Θ c d
β
i.e. it is (α0 , β 0 )-distributionally accurate for α0 = α + c + 2(e − 1 + 4d)n∆ and β 0 = c + dδ .
Pr [(S 0 , Π) ∈ E]
S∼P n ,Π∼I(S),S 0 ∼QΠ
XXX
= Pr[S = x] Pr[Π = π|S = x] 0Pr [S 0 = x0 ]1[(x0 , π) ∈ E]
S ∼Qπ
x π x0
XX
= Pr[Π = π] 0Pr [S 0 = x0 ]1[(x0 , π) ∈ E]
S ∼Qπ
π x0
XX
= Pr[Π = π] Pr[S = x0 |Π = π]1[(x0 , π) ∈ E]
π x0
XX Pr[Π = π|S = x0 ] · Pr[S = x0 ]
= Pr[Π = π] 1[(x0 , π) ∈ E]
Pr[Π = π]
π x0
= Pr [(S, Π) ∈ E] J
S∼P n ,Π∼I(S)
Pr [Π ∈ E|Si = x] ≤ e Pr [Π ∈ E] + δ
S∼P n ,Si ∼S,Π∼I(S) S∼P n ,Π∼I(S)
ITCS 2020
31:16 A New Analysis of Differential Privacy’s Generalization Guarantees
Now fix any realization x, and consider each term: E[q(Si )|S<i = x<i ]. We have:
X
E [q(Si )|S<i = x<i ] = q(x) · Pr n [Si = x|Π = π, S<i = x<i ]
S∼Qπ S∼P
x
X PrS∼P n [Π = π|Si = x, S<i = x<i ] · PrS∼P n [Si = x]
= q(x) ·
x
Pr[Π = π|S<i = x<i ]
X
≤ e · q(x) · Pr n [Si = x]
S∼P
x
= e q(P)
where the inequality follows from the definition of (, 0)-differential privacy. Symmetrically,
we can show that ES∼Qπ [q(Si )|S<i = x<i ] ≥ e− q(P). Therefore we have that:
n
1X
e− q(P) ≤
E[q(Si )|S<i ] ≤ e q(P).
n i=1
Combining this with Equation 4 gives us that for any η > 0, with probability 1 − η when
S ∼ Qπ :
r r
2 ln(2/η) − 2 ln(2/η)
q(S) ≤ e q(P) + and q(S) ≥ e q(P) − J
n n
C. Jung, K. Ligett, S. Neel, A. Roth, S. Sharifi-Malvajerdi, and M. Shenfeld 31:17
I Theorem 23. Suppose that M is (, 0)-differentially private and (α, β)-sampleq
accurate.
2 ln(2/η)
Then for any η > 0 it is (α0 , β 0 )-distributionally accurate for α0 = α + (e − 1) + n
and β 0 = β + η.
Proof. For a given π, let j ∗ (π) = arg maxj |aj − qj (P)|. By the triangle inequality we have:
|aj ∗ (Π) − qj ∗ (Π) (P)| ≤ |aj ∗ (Π) − qj ∗ (Π) (S)| + |qj ∗ (Π) (S) − qj ∗ (Π) (P)|
≤ max |aj − qj (S)| + |qj ∗ (Π) (S) − qj ∗ (Π) (P)|
j
By the definition of (α, β)-sample accuracy, we have that with probability 1 − β, maxj |aj −
qj (S)| ≤ α. The Resampling Lemma (Lemma 5) gives us that:
" r #
2 ln(2/η)
Pr |qj ∗ (Π) (S) − qj ∗ (Π) (P)| ≥ (e − 1) +
S∼P n , Π∼I(S) n
" r #
0 2 ln(2/η)
= Pr |qj ∗ (Π) (S ) − qj ∗ (Π) (P)| ≥ (e − 1) +
S∼P n , Π∼I(S), S 0 ∼QΠ n
" " r ##
0 2 ln(2/η)
= E Pr |qj ∗ (Π) (S ) − qj ∗ (Π) (P)| ≥ (e − 1) +
S∼P n , Π∼I(S) S 0 ∼QΠ n
≤η
ITCS 2020