DP-TBART
DP-TBART
Abstract
1 Introduction
Though sharing tabular data about individuals can be broadly beneficial (for medical research, for
example), privacy concerns typically prevent such data sharing. Differentially private synthetic data
generation stands out as an appealing solution to this problem: it provides strong formal privacy
guarantees and avoids having to anticipate the exact analysis a downstream analyst might want to
conduct, while producing a synthetic dataset that “looks like” the real data from the perspective of
an analyst. This problem has received considerable attention from the research community, with a
wide variety of approaches available in the literature. Among these, approaches tend to fall into two
categories:
∗
Work done as an intern at Bloomberg.
• Deep learning-based approaches are inspired by successes in domains such as images and
text (Li et al., 2021) and train deep neural networks using differentially private optimizers
such as DP-SGD (Abadi et al., 2016) to directly fit the data.2
Differential privacy (DP), first introduced by Dwork et al. (2006), is a technique used to protect the
privacy of individuals in data analysis by preventing attackers from inferring specific information
from the data. While many formulations of differential privacy exist, we choose to follow the standard
formulation parameterized by ϵ and δ, with lower values providing stronger privacy guarantees. More
formally, a differentially-private algorithm is defined as follows:
Definition 2.1. Let A : D → S be a randomized algorithm, and let ϵ > 0, δ ∈ [0, 1]. We say that A
is (ϵ, δ)-differentially private if for any two neighboring datasets D, D′ ∈ D differing by a single
record, we have that
P [A(D) ∈ S] ≤ eϵ P [A(D′ ) ∈ S] + δ
for all S ⊂ S.
In this paper, we consider a “record” to be a row in a tabular dataset, but this varies depending on the
use case. Furthermore, we take “differing by a single record” to mean the addition/removal of an
arbitrary record from the dataset (this is known as “unbounded DP”), following the convention set in
previous works (Li et al., 2021; McKenna et al., 2022).
More recently, methods for training neural networks in a differentially private manner have been
proposed (Abadi et al., 2016; Papernot et al., 2016). In this work, we focus on DP-SGD, a method
that preserves privacy throughout training by clipping & perturbing gradients. The per-sample
gradients are first projected down to a radius C ball (with C designated as a hyperparameter) in
order to bound their sensitivities, and isotropic Gaussian noise is further added to ensure the privacy
guarantee is satisfied. The final gradient is obtained by taking the mean of the noised per-sample
gradients. Formally, the differentially private estimate of the gradient from a batch of n examples
with per-sample gradients gi is computed as
n
1X C
g := min , 1 · gi + N (0, C 2 σ 2 I)
n i=1 ||gi ||
where σ is pre-computed by performing a binary search for the smallest suitable σ that satisfies a
desired (ϵ, δ)-DP guarantee for a training run with a set batch size and number of training steps.
2
In general, we follow the nomenclature set in Tao et al. (2021). For deep learning-based approaches, we
deviate because of the recent development of non-GAN approaches.
2
2.2 Tabular data generation with deep learning
Many methods for generating tabular data with deep learning have been proposed. Among these,
some of the most popular approaches include GANs (Xu et al., 2019; Zhao et al., 2021; 2022)
and, more recently, autoregressive models (Borisov et al., 2022; Canale et al., 2022), diffusion
models (Kotelnikov et al., 2022), normalizing flows (Lee et al., 2022), VAEs (Xu et al., 2019), and
MMD (Harder et al., 2021; Yang et al., 2023).
Unfortunately, while some of these works train with differential privacy, they often neglect to mention
the state-of-the-art marginal-based approaches, giving the false impression that these deep learning
approaches are state-of-the-art on the task. In contrast, our paper explicitly compares these two
approaches on a wide variety of datasets, shows that a performance gap between more recent deep
learning methods and marginal-based methods still exists, and demonstrates that autoregressive
models can bridge the gap.
Among the methods listed above, (Canale et al., 2022) is the most similar to our method, which
presents a universal attention-based modeling strategy for generating synthetic hierarchical data.
However, despite the importance of differentially private tabular data generation, the paper presents
only a single experiment that validates that their model can achieve competitive performance on
this task on a single metric. We expand on these preliminary results and demonstrate that a similar
approach is applicable to a wider variety of datasets, and in some limited cases even outperforms
state-of-the-art marginal-based methods. Furthermore, we bolster our results with a theoretical
framework that sheds light into the specific limitations of marginal-based methods, and why deep
learning-based methods can sometimes outperform them. As such, this paper stands to be the first
to provide a comprehensive comparison between deep learning-based methods and marginal-based
methods on the task of differentially private tabular data generation.
3 Method
3.1 Preprocessing
Consistent pre-processing and assumptions about what is considered public or private information is
crucial for fair comparisons between methods. In this work, we include datasets with both numeric
and categorical column types, and consider the column type (categorical or numerical) as well as
their “min” and “max” values to be public (in other words, no DP budget needs to be used to compute
max and min).
For the methods “AIM” (McKenna et al., 2022) and “DP-TBART”, we uniformly discretize the
numeric columns into 100 bins using the publicly available min and max values. Then, at inference
time we invert the discrete values to the midpoint of their original bin edges, rounding to integer
values if necessary. For the method “DPCTGAN” (Rosenblatt et al., 2020), values are normalized
based on the original min and max.
3
3.2 Autoregressive Modeling
We adopt a language modeling procedure (akin to many natural language modeling proce-
dures (Vaswani et al., 2017; Radford et al., 2019)) specifically tailored for the discrete tabular setting.
Let K be the number of columns in the given dataset. We factorize the probability distribution over a
row X = (X1 , X2 , . . . , XK ) with the chain rule:
K
Y
P (X) = P (X1 ) P (Xk |X<k )
k=2
Note that here, values in each column encode to tokens in a shared vocabulary V while
Pensuring that
no two columns share an encoded value.3 As a consequence, we have that |V | = i |Vi |, where
Vi represents the token encodings for column i. The model is defined as a function fθ that takes a
sequence in V k (with 0 ≤ k < K) and outputs a vector of logits in R|V | . To force the model fθ
to represent a probability distribution over only valid tokens P (X) and thus to output only valid
samples, we post-process the model output at each column i and mask out all elements irrelevant to
column i. That is, for some sequence x of length k, our revised logit outputs are
where ⊙ represents the element-wise (Hadamard) product and m sets all elements between index
Pk Pk+1
i=1 |Vi | and i=1 |Vi | to one and −∞ everywhere else.
We sample from trained models directly, without any further post-processing on the distribution.
We adopt a three-layer transformer decoder-only architecture (Vaswani et al., 2017) closely resembling
DistilGPT-2 (hidden dimension 768, 12 attention heads) with learned positional encodings and train
our models with DP-Adam, learning rate 3e − 4, and batch size 256 for all of our experiments unless
otherwise indicated. During training, we randomly set aside 1% of the data for validation and treat
the rest as the training set. For more information, see the Appendix.
4 Experiments
4.1 Experimental Setup
4.1.1 Datasets
We use 10 datasets from various sources for our experiments. Please note that the columns designated
as the “Target Variable” are only used for evaluation purposes and are not given any special treatment
during training. In other words, the fact that a column is a target variable does not influence the
training process.
4.1.2 Methods
In addition to DP-TBART, we evaluate the following methods: AIM (McKenna et al., 2022), the
most recent marginal-based method; DP-NTK (Yang et al., 2023), the most recent deep learning-
based method; and DP-MERF (Harder et al., 2021) and DP-HFlow (Lee et al., 2022) and DPCT-
GAN (Rosenblatt et al., 2020), additional deep learning-based methods.
4.1.3 Evaluation
We evaluate using the following metrics, which are adapted from the Synthetic Data Vault
(SDV) (Patki et al., 2016).
3
Though we maintain this design choice for all datasets included in Table 2, we share tokens across columns
for the Dyck and character-level datasets.
4
Dataset # of Rows Categorical Numeric Problem Target Variable
Adult 48,842 9 5 Classification income>50K
King 21,613 7 13 Regression price
Insurance 1,338 4 3 Regression charges
Loan 5,000 7 7 Classification Personal Loan
Credit 284,807 1 30 Classification Class
Bank 45,211 10 7 Classification y
Census 299,285 35 6 Classification income
Car 1,728 7 0 Classification class
Mushroom 8,124 23 0 Classification poisonous
Poker Hands 25,010 11 0 Classification 10
• Kolmogorov-Smirnov Test (KS): For numeric columns, this metric computes the
Kolmogorov-Smirnov statistic KS (the maximum difference between the CDF of the
synthetic and real columns) and returns 1 − KS. This metric lies between 0 and 1, and
higher is better. We average this value across all numeric columns.
• Chi-Square Test (CS): For categorical columns, this metric returns the p-value from a
chi-square test that checks the null hypothesis that the synthetic data comes from the same
distribution as the real data. We average this value across all columns. This metric lies
between 0 and 1, and higher is better.
• Detection (DET): We train an XGBoost model to differentiate between real and synthetic
samples, and then report 1 − ROCAU C, where ROCAU C is computed as an average of
the ROC-AUC scores across several folds. This metric lies between 0 and 1, and higher is
better (we would like a classifier to not be able to distinguish synthetic data from real data).
• ML Efficacy (Regression): We fit a linear regression model to predict a target variable in
the dataset on synthetic data and evaluate its performance on the real dataset. This metric
returns the r2 test score. This metric lies below 1, and higher is better.
• ML Efficacy (Classification): We fit a decision tree to predict a target variable in the dataset
on synthetic data and evaluate its performance on the real dataset. This metric returns the F1
test score.
Generally, we interpret the “KS” and “CS” metrics to capture how well a model is able to replicate
low-order marginals and the “DET” and “ML” metrics to capture the model’s ability to capture
higher-order, more complex correlations. We follow this rough intuition in our later analysis.
4.2 Compute
Our experiments were conducted on a DGX box with 8 NVIDIA Tesla V100 GPUs, with each
individual model run executed on a single GPU. Each model required roughly 20 minutes to 3 hours
to train, depending on the input dataset size. This resulted in a total compute usage of approximately
200 GPU hours for our main experiments.
4.3 Benchmark
As shown in Table 2, we substantially outperform all three deep learning baselines included in our
study, reaching performance comparable with the leading marginal-based approach, AIM (McKenna
et al., 2022).
We use unbounded DP (following (Li et al., 2021; McKenna et al., 2022)) for all approaches, with a
choice of ϵ = 1 and δ = 1e − 9 for all datasets and approaches.
Though AIM outperforms all other methods on our benchmark (see Table 2), a key limitation is that
since it measures only low-order marginals, it cannot access the complex higher-order correlations
5
Method KS ↑ CS ↑ DET ↑ ML (CLF) ↑ ML (REG) ↑
AIM 0.886 ± 0.001 0.997 ± 0.001 0.260 ± 0.002 0.594 ± 0.002 0.132 ± 0.057
DP-TBART 0.847 ± 0.001 0.978 ± 0.002 0.172 ± 0.004 0.554 ± 0.005 -0.264 ± 0.022
DPCTGAN 0.591 ± 0.006 0.831 ± 0.01 0.031 ± 0.001 0.412 ± 0.012 -2.83e4 ± 2.31e4
DP-MERF 0.505 ± 0.006 0.532 ± 0.003 0.0 ± 0.0 0.491 ± 0.014 -4.2e20 ± 2.5e20
DP-HFlow* 0.676 ± 0.004 0.622 ± 0.008 0.034 ± 0.004 0.543 ± 0.019 -0.603 ± 0.345
DP-NTK 0.349 ± 0.001 0.438 ± 0.006 0.0 ± 0.0 0.502 ± 0.011 -5.83e16 ± 4.76e16
Table 2: We evaluate AIM, DP-TBART, and four deep learning baselines against a variety of metrics,
and report the average scores (and their standard errors) across all datasets. Note that ML (REG)
may take on negative values if the regressor’s predictions perform worse on the real test data than a
baseline that knows only the true mean. (*) For DP-HFlow, we re-implement this method as no code
is publicly available.
that deep learning-based methods can. In this section, we will first propose a theoretical framework
that precisely characterizes this limitation and then use this framework to find two real-world settings
where marginal-based approaches fail to achieve good performance.
Preliminaries To begin, assume a data-generating process yields a discrete distribution P̄ (X) de-
fined on a sample space X = (V1 , . . . , VK ) with Vi disjoint w.r.t. each other as defined in Section 3.2.
We refer to P̄ (X) as the true distribution.
Next, a two-player adversarial game containing two agents, modeler and nature, ensues. The modeler
is given access to the set of all marginals of order M , i.e., P̄ (Xi1 , . . . , XiM ) for all possible subsets
{i1 , . . . , iM } ⊆ {1, 2, . . . , K}, but nothing else. This is consistent with a setting in which there is
infinite data and an infinite privacy budget.
The modeler must propose a distribution QM (X1 , X2 , . . . , XK ) such that all marginals match,
i.e., QM (Xi1 , . . . , XiM ) = P̄ (Xi1 , . . . , XiM ) for all subsets {i1 , . . . , iM }. We call the set of all
distributions such that all marginals match ΠM . Then, nature reacts to QM and chooses PM ∈ ΠM .
The modeler is asked to minimize the log loss L(PM , QM ) = EPM [− log QM (x)], and nature
maximizes the same quantity. We further assume that when nature encounters ties along level sets
of f (PM ) = L(PM , QM ), they are broken by deferring to maximizing the quantity KL(QM ||PM )
(but that any choice based on a further tie is arbitrary).
Results We first appeal to a set of results from Grünwald and Dawid (2004), which we restate as a
single lemma using our notation.
Lemma 4.1. (Grünwald and Dawid, 2004)
1. The modeler’s optimal estimate is uniquely given by the maximum entropy distribution
Q∗M = arg supP ∈ΠM H(P ), where H is Shannon entropy.
2. L(P, Q∗M ) = L(P ′ , Q∗M ) = H(Q∗M ) for all P, P ′ ∈ ΠM .
These initial results provide a couple of key insights. First, if a modeler acts optimally (in the sense
that they pick the QM that, under an adversarial response from nature, achieves the lowest log loss),
they will choose the maximum entropy distribution out of all of their available options: this in fact
mirrors the behavior of current state-of-the-art marginal-based methods (McKenna et al., 2019; 2022).
Second, given an optimal modeler, the log loss is the same no matter what distribution nature chooses,
and in fact the log loss is equal to the entropy of the modeler’s guess. This second insight will prove
useful in the proof for our central result, which we now state.
6
Theorem 4.2. KL(Q∗M ||P ) = H(Q∗M ) − H(P ) for any P ∈ ΠM .
Proof. By Lemma 4.1, we know that L(P, Q∗M ) = H(Q∗M ). The KL divergence is thus
Theorem 4.2 gives a closed-form expression for the KL divergence between the modeler’s estimated
distribution and a hypothetical real distribution, expressed in terms of the entropy of the estimated
distribution and the entropy of the real distribution. Theorem 4.2 thus highlights when we should
expect the estimate from a marginal-based method to be good or bad: is the real distribution
high-entropy or low-entropy? If the real distribution is high-entropy, then we should expect a
marginal-based approach to achieve low KL divergence; on the other hand, if the real distribution is
low-entropy, we should expect a marginal-based approach to provide a guess that is too “smoothed
out”, and far away from ground truth. Furthermore, it grounds the intuition that the less information
that is provided to the model (M is small), the larger the KL divergence. If less information is
provided, then H(Q∗M ) will be larger, and thus the KL divergence will also be larger. We continue
discussion of Theorem 4.2 in Sections 4.4.2 and 4.4.3.
With Theorem 4.2 in hand, we can now quickly derive an impossibility result that makes the key
limitation of marginal-based approaches precise.
Corollary 4.3. When M < K, KL(Q∗M ||PM ) > 0.
Proof. When a marginal-based method is not given the full joint P̂ (X1 , . . . , XK ), then |ΠM | ≥ 2.
Since there is only one maximum entropy distribution Q∗M (Lemma 4.1), and since nature breaks
ties by choosing the distribution with the smallest entropy (Section 4.4.1 and Lemma 4.1), then that
means H(Q∗M ) > H(PM ), which implies that KL(Q∗M ||PM ) > 0.
Since marginal-based approaches learn distributions via incomplete information (i.e., low-order
marginals), then even in the most ideal of settings (infinite data, privacy budget), they cannot
distinguish between one of potentially many choices of the true distribution: this leads to an error
that does not asymptotically vanish with more data or privacy budget.
In the next two sub-sections, we present two empirical settings that demonstrate this exact limitation
of marginal-based approaches.
Dyck Dataset In this section, we compare DP-TBART to AIM on Dyck-k, a family of datasets we
define and artificially construct.4 Informally speaking, Dyck-k consists of all length k strings that
belong to the Dyck formal language: the set of all strings with well-matched parentheses. Formally,
we can define Dyck as the set of all strings generated by the following context-free grammar:
X 7→| ( X ) X
| ϵ,
4
Note that in NLP, Dyck-k may refer to the language of nested brackets of k types. We consider only
languages with one bracket type.
7
Method % Valid ↑
AIM 0.517
DP-TBART 0.804
Table 3: We evaluate DP-TBART against AIM on Dyck-20 and report its performance with validity
as the metric.
Table 4: We compare perplexities of outputs from DP-TBART with outputs from AIM on the
WikiText-103 dataset as evaluated by GPT-2.
8
0.86 AIM
DPTBART
0.84
0.82
0.80
0.78
0.76
0.74
102 103 104 105
Figure 1: We show the performance (DET metric) of DP-TBART and AIM when varying ϵ on the
Adult dataset.
5 Discussion
5.1 Deep Learning for Synthetic Data
Though synthetic data has been studied in the deep learning community, many studies overlook privacy
concerns by failing to implement differential privacy, leading to potential leakage of private data (Xu
et al., 2019; Zhao et al., 2021; 2022; Borisov et al., 2022; Kotelnikov et al., 2022). Additionally,
these studies have also often neglected to compare their methods with state-of-the-art approaches
(i.e., marginal-based).
In contrast, our work is the first to evaluate a deep learning method against the state-of-the-art across
several datasets under strict privacy guarantees. We contextualize our results with further analysis,
providing the first theoretical and empirical insights into the relative strengths of marginal-based
methods and deep learning-based methods for synthetic tabular data generation.
Though DP-TBART does not outperform AIM on standard benchmarks, our auxiliary results suggest
that deep learning-based methods can fill an important niche that marginal-based methods are not
able to. We are optimistic that, through further investigation, deep learning-based approaches have
the potential to surpass marginal-based methods in a broader range of scenarios or even pave the way
for a hybrid method that incorporates the strengths of both approaches.
6 Conclusion
In this paper, we have presented an approach for differentially private synthetic tabular data generation
based on deep learning. We have shown that, in certain settings, our deep learning-based approach
is able to outperform state-of-the-art marginal-based approaches. We further provide a theoretical
framework that formalizes the limitations of marginal-based methods and two empirical settings
that demonstrate these limitations in action. Finally, we discuss the potential of deep learning-based
approaches for further exploration.
We believe that our work serves as a promising starting point for understanding the utility of deep
learning-based approaches for generating synthetic data under privacy guarantees, and hope that our
results may inspire further exploration in this area.
Though our theoretical framework allows us to understand the limitations of different modeling
strategies, it also rests on assumptions that are not satisfied in practice, e.g., infinite data and noiseless
marginal measurements. Despite these unrealistic assumptions, we believe that the framework
is intuitive enough to provide insight into practical settings (and confirm this with experiments,
see Section 4.4.3). As for DP-TBART itself, it currently handles only discrete (or discretized) data,
underperforms relative to the state-of-the-art marginal-based method, and requires GPU hardware to
run efficiently. Despite row-level differential privacy guarantees, outputs from DP-TBART could still
reflect underlying biases in the training data, which could lead to harm.
9
References
M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang. Deep
learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on
computer and communications security, pages 308–318, 2016.
G. Arakelyan, G. Soghomonyan, and The Aim team. Aim, 6 2020. URL https://github.com/
aimhubio/aim.
V. Borisov, K. Seßler, T. Leemann, M. Pawelczyk, and G. Kasneci. Language models are realistic
tabular data generators. arXiv preprint arXiv:2210.06280, 2022.
C. M. Bowen and J. Snoke. Comparative study of differentially private synthetic data algorithms
from the nist pscr differential privacy synthetic data challenge. arXiv preprint arXiv:1911.12704,
2019.
L. Canale, N. Grislain, G. Lothe, and J. Leduc. Generative modeling of complex data. arXiv preprint
arXiv:2202.02145, 2022.
S. De, L. Berrada, J. Hayes, S. L. Smith, and B. Balle. Unlocking high-accuracy differentially private
image classification through scale. arXiv preprint arXiv:2204.13650, 2022.
D. Dua and C. Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci.
edu/ml.
C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data
analysis. In Theory of cryptography conference, pages 265–284. Springer, 2006.
W. Falcon and The PyTorch Lightning team. PyTorch Lightning, 3 2019. URL https://github.
com/Lightning-AI/lightning.
P. D. Grünwald and A. P. Dawid. Game theory, maximum entropy, minimum discrepancy and robust
bayesian decision theory. Annals of Statistics, 2004.
F. Harder, K. Adamczewski, and M. Park. Dp-merf: Differentially private mean embeddings with
random features for practical privacy-preserving data generation. In International conference on
artificial intelligence and statistics, pages 1819–1827. PMLR, 2021.
A. Kotelnikov, D. Baranchuk, I. Rubachev, and A. Babenko. Tabddpm: Modelling tabular data with
diffusion models. arXiv preprint arXiv:2209.15421, 2022.
A. Kurakin, S. Chien, S. Song, R. Geambasu, A. Terzis, and A. Thakurta. Toward training at imagenet
scale with differential privacy. arXiv preprint arXiv:2201.12328, 2022.
J. Lee, M. Kim, Y. Jeong, and Y. Ro. Differentially private normalizing flows for synthetic tabular
data generation. 2022.
X. Li, F. Tramer, P. Liang, and T. Hashimoto. Large language models can be strong differentially
private learners. arXiv preprint arXiv:2110.05679, 2021.
T. Liu, G. Vietri, and S. Z. Wu. Iterative methods for private synthetic data: Unifying framework and
new methods. Advances in Neural Information Processing Systems, 34:690–702, 2021.
R. McKenna, D. Sheldon, and G. Miklau. Graphical-model based estimation and inference for
differential privacy. In International Conference on Machine Learning, pages 4435–4444. PMLR,
2019.
R. McKenna, B. Mullins, D. Sheldon, and G. Miklau. Aim: An adaptive and iterative mechanism for
differentially private synthetic data. arXiv preprint arXiv:2201.12677, 2022.
S. Merity, C. Xiong, J. Bradbury, and R. Socher. Pointer sentinel mixture models, 2016.
S. Moro, P. Cortez, and P. Rita. A data-driven approach to predict the success of bank telemarketing.
Decision Support Systems, 62:22–31, 2014.
10
N. Papernot, M. Abadi, U. Erlingsson, I. Goodfellow, and K. Talwar. Semi-supervised knowledge
transfer for deep learning from private training data. arXiv preprint arXiv:1610.05755, 2016.
N. Papernot, S. Chien, S. Song, A. Thakurta, and U. Erlingsson. Making the shoe fit: Architectures,
initializations, and tuning for learning with privacy. 2019.
A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein,
L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy,
B. Steiner, L. Fang, J. Bai, and S. Chintala. PyTorch: An Imperative Style, High-Performance
Deep Learning Library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché Buc,
E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages
8024–8035. Curran Associates, Inc., 2019. URL http://papers.neurips.cc/paper/
9015-pytorch-an-imperative-style-high-performance-deep-learning-library.
pdf.
N. Patki, R. Wedge, and K. Veeramachaneni. The synthetic data vault. In 2016 IEEE International
Conference on Data Science and Advanced Analytics (DSAA), pages 399–410, Oct 2016. doi:
10.1109/DSAA.2016.49.
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are
unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
L. Rosenblatt, X. Liu, S. Pouyanfar, E. de Leon, A. Desai, and J. Allen. Differentially private
synthetic data: Applied evaluations and enhancements. arXiv preprint arXiv:2011.05537, 2020.
V. Sanh, L. Debut, J. Chaumond, and T. Wolf. Distilbert, a distilled version of bert: smaller, faster,
cheaper and lighter. In NeurIPS EMC2 Workshop, 2019.
Y. Tao, R. McKenna, M. Hay, A. Machanavajjhala, and G. Miklau. Benchmarking differentially
private synthetic data generation algorithms. arXiv preprint arXiv:2112.09238, 2021.
F. Tramer and D. Boneh. Differentially private learning needs better features (or much more data).
arXiv preprint arXiv:2011.11660, 2020.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin.
Attention is all you need. Advances in neural information processing systems, 30, 2017.
T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, C. Ma, Y. Jernite, J. Plu,
C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush. Transformers: State-of-the-
Art Natural Language Processing. pages 38–45. Association for Computational Linguistics, 10
2020. URL https://www.aclweb.org/anthology/2020.emnlp-demos.6.
L. Xu, M. Skoularidou, A. Cuesta-Infante, and K. Veeramachaneni. Modeling tabular data using
conditional gan. Advances in Neural Information Processing Systems, 32, 2019.
Y. Yang, K. Adamczewski, D. J. Sutherland, X. Li, and M. Park. Differentially private neural tangent
kernels for privacy-preserving data generation. arXiv preprint arXiv:2303.01687, 2023.
J. Zhang, G. Cormode, C. M. Procopiuc, D. Srivastava, and X. Xiao. Privbayes: Private data release
via bayesian networks. ACM Transactions on Database Systems (TODS), 42(4):1–41, 2017.
Z. Zhao, A. Kunar, R. Birke, and L. Y. Chen. Ctab-gan: Effective table data synthesizing. In Asian
Conference on Machine Learning, pages 97–112. PMLR, 2021.
Z. Zhao, A. Kunar, R. Birke, and L. Y. Chen. Ctab-gan+: Enhancing tabular data synthesis. arXiv
preprint arXiv:2204.00401, 2022.
11
A Additional possible baselines
Despite the existence of CTAB-GAN+ (Zhao et al., 2022), a more recent deep learning approach that
claims differential privacy guarantees, we did not evaluate against this model.
The original paper does not contain a complete proof that it satisfies differential privacy, and in
fact there is reason to believe that it does not. The pre-processing pipeline is deterministic, even
for continuous and mixed-type columns, for which it learns a Gaussian mixture. Furthermore, the
privacy budget calculation remains the same even when both CTAB-GAN+ performs a modified
batch sampling scheme that upweights infrequent rows (training-by-sampling) and takes gradient
steps w.r.t. information loss, which is a deterministic function of real samples.
Note that while our paper does not contain an explicit proof of privacy, we have introduced no new
mechanisms (such as training-by-sampling or modified batch sampling above), and thus the proof
in Abadi et al. (2016) remains valid for our algorithm.
B.1 Adult
• Link: https://github.com/ryan112358/private-pgm/blob/master/data/adult.csv
• Provenance: We downloaded the above CSV file without any further pre-processing.
• Description: This dataset is originally sourced from the UCI Machine Learning reposi-
tory (Dua and Graff, 2017), but has been discretized in the process. The dataset contains
personal attributes of U.S. residents, sampled in a stratified manner. The data was extracted
from the 1994 U.S. Census.
• Licensing: Apache License 2.0
B.2 King
• Link: https://github.com/Team-TUD/CTAB-GAN-Plus/blob/main/Real_Datasets/king.csv
• Provenance: We downloaded the above CSV file without any further pre-processing.
• Description: The King dataset was originally obtained from this Kaggle dataset and contains
house attributes (sqft, # of bathrooms, etc.) labeled with their sale price between 2014 and
2015 for King County, Washington.
• Licensing: CC0: Public Domain
B.3 Insurance
• Link: https://www.kaggle.com/datasets/mirichoi0218/insurance
• Provenance: We downloaded ‘insurance.csv’ without any further pre-processing
• Description: The Insurance dataset contains personal attributes (age, etc.) and U.S. medical
insurance costs.
• Licensing: Database Contents License
B.4 Loan
• Link: https://www.kaggle.com/datasets/itsmesunil/bank-loan-modelling
• Provenance: We downloaded ‘Bank_Personal_Loan_Modelling.xlsx’ and processed it into
a CSV.
• Description: The Loan dataset contains personal attributes from bank customers and a label
(whether they converted from a depositor to loan customer).
• Licensing: CC0: Public Domain
12
B.5 Credit
• Link: https://www.kaggle.com/mlg-ulb/creditcardfraud
• Provenance: We downloaded the ‘creditcard.csv’ with a Kaggle account with no additional
pre-processing.
• Description: The Credit dataset contains 30 numerical attributes describing credit card
transactions along with a label for whether they were fraudulent.
• Licensing: Database Contents License
B.6 Bank
• Link: https://archive.ics.uci.edu/ml/datasets/Bank+Marketing
• Provenance: We downloaded ‘bank-full.csv’ from the UCI Machine Learning reposi-
tory (Dua and Graff, 2017) with no additional pre-processing.
• Description: The Bank dataset (Moro et al., 2014) is derived from marketing campaigns from
a bank. It contains personal attributes about a client and whether the client has subscribed to
a product (bank term deposit).
• Licensing: Creative Commons Attribution 4.0 International
B.7 Census
• Link: https://archive.ics.uci.edu/ml/datasets/Census+Income
• Provenance: We downloaded ‘census.tar.gz’ from the UCI Machine Learning reposi-
tory (Dua and Graff, 2017) and extracted it. Then, we combined census-income.data
and census-income.test by concatenating both files together.
• Description: The Census dataset contains demographic and employment-related personal
attributes from the U.S. Census Bureau, where each row corresponds to a stratified sample
from the U.S. population.
• Licensing: Creative Commons Attribution 4.0 International
B.8 Car
• Link: https://archive.ics.uci.edu/ml/datasets/Car+Evaluation
• Provenance: We downloaded ‘car.data’ from the UCI Machine Learning repository (Dua
and Graff, 2017) and performed no further pre-processing.
• Description: The Car dataset has only categorical columns and each row describes attributes
of a different car model.
• Licensing: Creative Commons Attribution 4.0 International
B.9 Mushroom
• Link: https://archive.ics.uci.edu/ml/datasets/Mushroom
• Provenance: We downloaded ‘agaricus-lepiota.data’ from the UCI Machine Learning repos-
itory (Dua and Graff, 2017) and performed no further pre-processing.
• Description: The Mushroom dataset contains descriptions of hypothetical samples corre-
sponding to 23 species of mushrooms.
• Licensing: Creative Commons Attribution 4.0 International
13
Adult Credit Insurance
12
9
1e-05 1.637 1.574 1.566 1.567 1e-05 2.556 2.195 2.161 2.159 1e-05 2.525 2.526 2.527 2.536
8
10 40
0.0001 1.534 1.494 1.471 1.479
7
0.0001 2.194 2.173 2.104 2.096 0.0001 2.368 2.371 2.396 2.384
Learning Rate
Learning Rate
Learning Rate
0.0003 1.486 1.452 1.432 1.44 6 0.0003 2.231 2.191 2.075 2.063 8 0.0003 2.301 2.305 2.354 2.307 30
5
0.001 1.52 1.462 1.451 1.458 0.001 2.303 2.247 2.093 2.01 6 0.001 2.407 2.406 2.42 2.403
20
4
0.003 2.269 1.663 1.634 1.654 0.003 2.864 2.602 2.276 2.175 0.003 5.384 5.332 4.224 4.335
3 4
10
0.01 9.349 3.974 3.741 3.678 2 0.01 12.17 3.931 2.525 2.522 0.01 46.67 43.73 38.02 49.52
16 64 256 1024 16 64 256 1024 16 64 256 1024
Batch Size Batch Size Batch Size
14 16
0.0001 1.862 1.838 1.824 1.835 0.0001 2.353 2.341 2.331 2.336 0.0001 1.888 1.888 1.889 1.888 8
Learning Rate
Learning Rate
Learning Rate
14
12 7
0.0003 1.887 1.789 1.788 1.786 0.0003 2.407 2.39 2.361 2.362 12 0.0003 1.891 1.892 1.892 1.891
10 6
10
0.001 1.976 1.85 1.819 1.853 8 0.001 2.574 2.587 2.508 2.528 0.001 1.927 1.912 1.914 1.907 5
8
0.003 3.879 2.404 2.312 2.263 6 0.003 3.792 3.785 3.504 3.484 6
0.003 2.503 2.111 2.168 2.183 4
4 3
0.01 17.36 8.786 7.396 7.127 0.01 19.41 15.28 12.67 14.91 4 0.01 9.774 6.385 6.435 6.349
2 2
16 64 256 1024 16 64 256 1024 16 64 256 1024
Batch Size Batch Size Batch Size
Learning Rate
Learning Rate
25
20
0.0003 1.365 1.375 1.322 1.395 0.0003 2.735 2.73 2.744 2.742 0.0003 1.107 1.109 1.076 1.123 15
20
0.001 1.47 1.497 1.356 1.429 15 0.001 2.801 2.769 2.786 2.781 15 0.001 1.115 1.168 1.124 1.178
10
0.003 2.635 3.122 2.225 1.821 10 0.003 3.472 3.307 3.477 3.437 10 0.003 3.155 2.5 2.719 3.019
5
0.01 31.82 36.41 19.8 17.61 5
0.01 28.84 26.33 30.59 29.16 5 0.01 23.43 25.82 24.37 25.2
16 64 256 1024 16 64 256 1024 16 64 256 1024
Batch Size Batch Size Batch Size
Figure 2: We train models with varying learning rates and batch sizes and evaluate the validation loss
of each trained model.
C Hyperparameter Study
Given observations from past work from image and text domains that networks trained with differen-
tial privacy are sensitive to hyperparameters (Papernot et al., 2019; Li et al., 2021; De et al., 2022),
we investigate whether the same phenomena occur in the tabular domain as well.
Learning Rate & Batch Size Our learning rate and batch size sweeps reveal behaviors that mirror
those in networks trained on images (Kurakin et al., 2022; De et al., 2022) or text (Li et al., 2021):
performance is highly sensitive to learning rate and batch size, and increasing batch size almost never
hurts. See Figure 2 for a visualization of our results.
Clipping Norm We further study the behavior of the network under various clipping norms and
precisely explain the underlying mechanism that causes the observed behavior. As shown in Figure 3,
we find that the network behaves similarly to what has been empirically observed in images (Kurakin
et al., 2022) and text (Li et al., 2021): decreasing the clipping norm until all gradients are clipped
results in the best performance. Unlike prior work, however, we continue to evaluate the model at
smaller and smaller values of C, and observe that the performance of the network actually stays
constant and then begins to rapidly degrade.
While the plateauing in performance is explained by (1) the fact that Adam is gradient scale-invariant
(i.e., scaling each gradient g by some scalar does not impact parameter updates) and (2) the fact that
all gradients are clipped below C = 1, this does not adequately explain the rapid degradation of
performance when C reaches 10−5 .
While it may be tempting to conclude that such significant degradation of performance is due to
floating point imprecision, it√turns out that the answer is much more subtle. Adam
√ computes the
effective step ∆t = α · m̂t /( v̂t + ϵ) where ϵ = 1e − 8 in practice and m̂t and v̂t are numerical
estimates proportional to clipped gradient magnitude. When the gradient magnitudes reach the same
order of magnitude as ϵ, the denominator of the fraction becomes dominated by ϵ and training begins
to falter, as seen in Figure 3.
14
% clipped 1.0
1.60
0.8
Validation Loss
1.55
0.6
1.50 0.4
0.2
1.45
0.0
10 7 10 5 10 3 10 1 101
Clipping Norm
Figure 3: We show the impact of clipping norm on network performance on the Adult dataset.
D Software
For model implementation and training, we use Pytorch (Paszke et al., 2019) (BSD-3) and Light-
ning (Falcon and The PyTorch Lightning team, 2019) (Apache License 2.0). For experiment tracking,
we use Aim (Arakelyan et al., 2020) (Apache License 2.0). For our transformer model, we adapted
DistilGPT-2 (Sanh et al., 2019) from Hugging Face Transformers (Wolf et al., 2020) (Apache License
2.0).
For our DP-SGD implementation, we use private-transformers (Apache License 2.0).
E Poisson sampling
While some previous works sample batches non-privately by shuffling the training dataset and picking
uniform batches (Li et al., 2021; Tramer and Boneh, 2020), we ensure end-to-end differential privacy
guarantees by implementing Poisson sampling. Since the Binomial distribution B(n, p) concentrates
tightly around its mean when n is large and p is small, we observe that out-of-memory issues do not
occur in practice, and micro-batching was not required.
F Column Ordering
One might observe that an autoregressive approach introduces an additional high-dimensional hyper-
parameter: column ordering. In our experiments, we attempted several strategies (ordering columns
from ones with largest cardinalities to smallest, smallest to largest, and toposorting a graph learned
via structure learning), but observed no consistent impact on performance.
15