0% found this document useful (0 votes)
8 views15 pages

DP-TBART

The document presents DP-TBART, a transformer-based autoregressive model designed for generating differentially private synthetic tabular data, which outperforms traditional marginal-based methods in certain scenarios. It provides a theoretical framework to understand the limitations of existing approaches and demonstrates that deep learning techniques can effectively bridge the performance gap in this domain. The study includes extensive experiments across various datasets, confirming the model's competitive performance and potential advantages over state-of-the-art methods.

Uploaded by

myshenc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views15 pages

DP-TBART

The document presents DP-TBART, a transformer-based autoregressive model designed for generating differentially private synthetic tabular data, which outperforms traditional marginal-based methods in certain scenarios. It provides a theoretical framework to understand the limitations of existing approaches and demonstrates that deep learning techniques can effectively bridge the performance gap in this domain. The study includes extensive experiments across various datasets, confirming the model's competitive performance and potential advantages over state-of-the-art methods.

Uploaded by

myshenc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

DP-TBART: A Transformer-based Autoregressive

Model for Differentially Private Tabular Data


Generation

Rodrigo Castellon∗ Achintya Gopal


Department of Computer Science Bloomberg
arXiv:2307.10430v1 [cs.LG] 19 Jul 2023

Stanford University New York, NY USA


Stanford, CA USA agopal6@bloomberg.net
rjcaste@stanford.edu

Brian Bloniarz David Rosenberg


Bloomberg Bloomberg
San Francisco, CA USA Toronto, ON Canada
bbloniarz@bloomberg.net drosenberg44@bloomberg.net

Abstract

The generation of synthetic tabular data that preserves differential privacy is a


problem of growing importance. While traditional marginal-based methods have
achieved impressive results, recent work has shown that deep learning-based ap-
proaches tend to lag behind. In this work, we present Differentially-Private TaBular
AutoRegressive Transformer (DP-TBART), a transformer-based autoregressive
model that maintains differential privacy and achieves performance competitive
with marginal-based methods on a wide variety of datasets, capable of even outper-
forming state-of-the-art methods in certain settings. We also provide a theoretical
framework for understanding the limitations of marginal-based approaches and
where deep learning-based approaches stand to contribute most. These results
suggest that deep learning-based techniques should be considered as a viable al-
ternative to marginal-based methods in the generation of differentially private
synthetic tabular data.

1 Introduction

Though sharing tabular data about individuals can be broadly beneficial (for medical research, for
example), privacy concerns typically prevent such data sharing. Differentially private synthetic data
generation stands out as an appealing solution to this problem: it provides strong formal privacy
guarantees and avoids having to anticipate the exact analysis a downstream analyst might want to
conduct, while producing a synthetic dataset that “looks like” the real data from the perspective of
an analyst. This problem has received considerable attention from the research community, with a
wide variety of approaches available in the literature. Among these, approaches tend to fall into two
categories:

• Marginal-based approaches make differentially private measurements of low-order


marginals, typically represented as histograms, and then fit a generative model to these
histograms.


Work done as an intern at Bloomberg.
• Deep learning-based approaches are inspired by successes in domains such as images and
text (Li et al., 2021) and train deep neural networks using differentially private optimizers
such as DP-SGD (Abadi et al., 2016) to directly fit the data.2

Recent works unambiguously establish marginal-based approaches as state-of-the-art, with perfor-


mance far surpassing that of approaches using neural networks, which typically fail to adequately
model even one-dimensional marginals (Tao et al., 2021; Bowen and Snoke, 2019).
Given the lack of success of deep learning relative to marginal-based approaches, we ask the following
question: is this significant gap due to a fundamental inadequacy in the deep learning approach to the
task? If not, how can we close this gap?
To answer this question, we present Differentially-Private TaBular AutoRegressive Transformer (DP-
TBART), a transformer-based autoregressive model explicitly tailored for the tabular domain, and we
show that this simple approach significantly bridges the gap on 10 separate datasets. In addition to
narrowing this gap, we provide preliminary results suggesting that this modeling approach serves
as a promising direction for downstream use cases that require accurately modeling higher-order
interactions among columns.

2 Background & Related Work

2.1 Differential Privacy

Differential privacy (DP), first introduced by Dwork et al. (2006), is a technique used to protect the
privacy of individuals in data analysis by preventing attackers from inferring specific information
from the data. While many formulations of differential privacy exist, we choose to follow the standard
formulation parameterized by ϵ and δ, with lower values providing stronger privacy guarantees. More
formally, a differentially-private algorithm is defined as follows:
Definition 2.1. Let A : D → S be a randomized algorithm, and let ϵ > 0, δ ∈ [0, 1]. We say that A
is (ϵ, δ)-differentially private if for any two neighboring datasets D, D′ ∈ D differing by a single
record, we have that
P [A(D) ∈ S] ≤ eϵ P [A(D′ ) ∈ S] + δ
for all S ⊂ S.

In this paper, we consider a “record” to be a row in a tabular dataset, but this varies depending on the
use case. Furthermore, we take “differing by a single record” to mean the addition/removal of an
arbitrary record from the dataset (this is known as “unbounded DP”), following the convention set in
previous works (Li et al., 2021; McKenna et al., 2022).
More recently, methods for training neural networks in a differentially private manner have been
proposed (Abadi et al., 2016; Papernot et al., 2016). In this work, we focus on DP-SGD, a method
that preserves privacy throughout training by clipping & perturbing gradients. The per-sample
gradients are first projected down to a radius C ball (with C designated as a hyperparameter) in
order to bound their sensitivities, and isotropic Gaussian noise is further added to ensure the privacy
guarantee is satisfied. The final gradient is obtained by taking the mean of the noised per-sample
gradients. Formally, the differentially private estimate of the gradient from a batch of n examples
with per-sample gradients gi is computed as

n    
1X C
g := min , 1 · gi + N (0, C 2 σ 2 I)
n i=1 ||gi ||

where σ is pre-computed by performing a binary search for the smallest suitable σ that satisfies a
desired (ϵ, δ)-DP guarantee for a training run with a set batch size and number of training steps.

2
In general, we follow the nomenclature set in Tao et al. (2021). For deep learning-based approaches, we
deviate because of the recent development of non-GAN approaches.

2
2.2 Tabular data generation with deep learning

Many methods for generating tabular data with deep learning have been proposed. Among these,
some of the most popular approaches include GANs (Xu et al., 2019; Zhao et al., 2021; 2022)
and, more recently, autoregressive models (Borisov et al., 2022; Canale et al., 2022), diffusion
models (Kotelnikov et al., 2022), normalizing flows (Lee et al., 2022), VAEs (Xu et al., 2019), and
MMD (Harder et al., 2021; Yang et al., 2023).
Unfortunately, while some of these works train with differential privacy, they often neglect to mention
the state-of-the-art marginal-based approaches, giving the false impression that these deep learning
approaches are state-of-the-art on the task. In contrast, our paper explicitly compares these two
approaches on a wide variety of datasets, shows that a performance gap between more recent deep
learning methods and marginal-based methods still exists, and demonstrates that autoregressive
models can bridge the gap.
Among the methods listed above, (Canale et al., 2022) is the most similar to our method, which
presents a universal attention-based modeling strategy for generating synthetic hierarchical data.
However, despite the importance of differentially private tabular data generation, the paper presents
only a single experiment that validates that their model can achieve competitive performance on
this task on a single metric. We expand on these preliminary results and demonstrate that a similar
approach is applicable to a wider variety of datasets, and in some limited cases even outperforms
state-of-the-art marginal-based methods. Furthermore, we bolster our results with a theoretical
framework that sheds light into the specific limitations of marginal-based methods, and why deep
learning-based methods can sometimes outperform them. As such, this paper stands to be the first
to provide a comprehensive comparison between deep learning-based methods and marginal-based
methods on the task of differentially private tabular data generation.

2.3 Tabular data generation with marginal-based approaches

In general, marginal-based approaches make differentially-private measurements (e.g., column


histograms) and then fit these measurements with a model; both steps may each be performed exactly
once (McKenna et al., 2019) or, more often, interleaved many times (Zhang et al., 2017; Liu et al.,
2021; McKenna et al., 2022).
Methods that fall under this approach distinguish themselves by (1) which measurements they make
and (2) the model used. For example, McKenna et al. (2019) proposes using probabilistic graphical
models (PGMs) to represent the distribution. Other methods use Bayesian networks (Zhang et al.,
2017) or neural networks (Liu et al., 2021).
The inherent weakness of these methods is that they access all information about the data distribution
in the form of low-order marginals, due to the computational complexity of the exponential scaling to
high-order marginals. This limitation forces marginal-based methods to prioritize simple correlations
at the expense of more complex interactions among columns. We make this intuition more concrete
in Section 4.4.1.

3 Method

3.1 Preprocessing

Consistent pre-processing and assumptions about what is considered public or private information is
crucial for fair comparisons between methods. In this work, we include datasets with both numeric
and categorical column types, and consider the column type (categorical or numerical) as well as
their “min” and “max” values to be public (in other words, no DP budget needs to be used to compute
max and min).
For the methods “AIM” (McKenna et al., 2022) and “DP-TBART”, we uniformly discretize the
numeric columns into 100 bins using the publicly available min and max values. Then, at inference
time we invert the discrete values to the midpoint of their original bin edges, rounding to integer
values if necessary. For the method “DPCTGAN” (Rosenblatt et al., 2020), values are normalized
based on the original min and max.

3
3.2 Autoregressive Modeling

We adopt a language modeling procedure (akin to many natural language modeling proce-
dures (Vaswani et al., 2017; Radford et al., 2019)) specifically tailored for the discrete tabular setting.
Let K be the number of columns in the given dataset. We factorize the probability distribution over a
row X = (X1 , X2 , . . . , XK ) with the chain rule:

K
Y
P (X) = P (X1 ) P (Xk |X<k )
k=2

Note that here, values in each column encode to tokens in a shared vocabulary V while
Pensuring that
no two columns share an encoded value.3 As a consequence, we have that |V | = i |Vi |, where
Vi represents the token encodings for column i. The model is defined as a function fθ that takes a
sequence in V k (with 0 ≤ k < K) and outputs a vector of logits in R|V | . To force the model fθ
to represent a probability distribution over only valid tokens P (X) and thus to output only valid
samples, we post-process the model output at each column i and mask out all elements irrelevant to
column i. That is, for some sequence x of length k, our revised logit outputs are

fθ′ (x) := fθ (x) ⊙ m

where ⊙ represents the element-wise (Hadamard) product and m sets all elements between index
Pk Pk+1
i=1 |Vi | and i=1 |Vi | to one and −∞ everywhere else.
We sample from trained models directly, without any further post-processing on the distribution.

3.3 Implementation Details

We adopt a three-layer transformer decoder-only architecture (Vaswani et al., 2017) closely resembling
DistilGPT-2 (hidden dimension 768, 12 attention heads) with learned positional encodings and train
our models with DP-Adam, learning rate 3e − 4, and batch size 256 for all of our experiments unless
otherwise indicated. During training, we randomly set aside 1% of the data for validation and treat
the rest as the training set. For more information, see the Appendix.

4 Experiments
4.1 Experimental Setup

4.1.1 Datasets
We use 10 datasets from various sources for our experiments. Please note that the columns designated
as the “Target Variable” are only used for evaluation purposes and are not given any special treatment
during training. In other words, the fact that a column is a target variable does not influence the
training process.

4.1.2 Methods
In addition to DP-TBART, we evaluate the following methods: AIM (McKenna et al., 2022), the
most recent marginal-based method; DP-NTK (Yang et al., 2023), the most recent deep learning-
based method; and DP-MERF (Harder et al., 2021) and DP-HFlow (Lee et al., 2022) and DPCT-
GAN (Rosenblatt et al., 2020), additional deep learning-based methods.

4.1.3 Evaluation
We evaluate using the following metrics, which are adapted from the Synthetic Data Vault
(SDV) (Patki et al., 2016).
3
Though we maintain this design choice for all datasets included in Table 2, we share tokens across columns
for the Dyck and character-level datasets.

4
Dataset # of Rows Categorical Numeric Problem Target Variable
Adult 48,842 9 5 Classification income>50K
King 21,613 7 13 Regression price
Insurance 1,338 4 3 Regression charges
Loan 5,000 7 7 Classification Personal Loan
Credit 284,807 1 30 Classification Class
Bank 45,211 10 7 Classification y
Census 299,285 35 6 Classification income
Car 1,728 7 0 Classification class
Mushroom 8,124 23 0 Classification poisonous
Poker Hands 25,010 11 0 Classification 10

Table 1: Summary of datasets used in our experiments.

• Kolmogorov-Smirnov Test (KS): For numeric columns, this metric computes the
Kolmogorov-Smirnov statistic KS (the maximum difference between the CDF of the
synthetic and real columns) and returns 1 − KS. This metric lies between 0 and 1, and
higher is better. We average this value across all numeric columns.
• Chi-Square Test (CS): For categorical columns, this metric returns the p-value from a
chi-square test that checks the null hypothesis that the synthetic data comes from the same
distribution as the real data. We average this value across all columns. This metric lies
between 0 and 1, and higher is better.
• Detection (DET): We train an XGBoost model to differentiate between real and synthetic
samples, and then report 1 − ROCAU C, where ROCAU C is computed as an average of
the ROC-AUC scores across several folds. This metric lies between 0 and 1, and higher is
better (we would like a classifier to not be able to distinguish synthetic data from real data).
• ML Efficacy (Regression): We fit a linear regression model to predict a target variable in
the dataset on synthetic data and evaluate its performance on the real dataset. This metric
returns the r2 test score. This metric lies below 1, and higher is better.
• ML Efficacy (Classification): We fit a decision tree to predict a target variable in the dataset
on synthetic data and evaluate its performance on the real dataset. This metric returns the F1
test score.

Generally, we interpret the “KS” and “CS” metrics to capture how well a model is able to replicate
low-order marginals and the “DET” and “ML” metrics to capture the model’s ability to capture
higher-order, more complex correlations. We follow this rough intuition in our later analysis.

4.2 Compute

Our experiments were conducted on a DGX box with 8 NVIDIA Tesla V100 GPUs, with each
individual model run executed on a single GPU. Each model required roughly 20 minutes to 3 hours
to train, depending on the input dataset size. This resulted in a total compute usage of approximately
200 GPU hours for our main experiments.

4.3 Benchmark

As shown in Table 2, we substantially outperform all three deep learning baselines included in our
study, reaching performance comparable with the leading marginal-based approach, AIM (McKenna
et al., 2022).
We use unbounded DP (following (Li et al., 2021; McKenna et al., 2022)) for all approaches, with a
choice of ϵ = 1 and δ = 1e − 9 for all datasets and approaches.

4.4 Limitations of Marginal-based Approaches

Though AIM outperforms all other methods on our benchmark (see Table 2), a key limitation is that
since it measures only low-order marginals, it cannot access the complex higher-order correlations

5
Method KS ↑ CS ↑ DET ↑ ML (CLF) ↑ ML (REG) ↑
AIM 0.886 ± 0.001 0.997 ± 0.001 0.260 ± 0.002 0.594 ± 0.002 0.132 ± 0.057
DP-TBART 0.847 ± 0.001 0.978 ± 0.002 0.172 ± 0.004 0.554 ± 0.005 -0.264 ± 0.022
DPCTGAN 0.591 ± 0.006 0.831 ± 0.01 0.031 ± 0.001 0.412 ± 0.012 -2.83e4 ± 2.31e4
DP-MERF 0.505 ± 0.006 0.532 ± 0.003 0.0 ± 0.0 0.491 ± 0.014 -4.2e20 ± 2.5e20
DP-HFlow* 0.676 ± 0.004 0.622 ± 0.008 0.034 ± 0.004 0.543 ± 0.019 -0.603 ± 0.345
DP-NTK 0.349 ± 0.001 0.438 ± 0.006 0.0 ± 0.0 0.502 ± 0.011 -5.83e16 ± 4.76e16

Table 2: We evaluate AIM, DP-TBART, and four deep learning baselines against a variety of metrics,
and report the average scores (and their standard errors) across all datasets. Note that ML (REG)
may take on negative values if the regressor’s predictions perform worse on the real test data than a
baseline that knows only the true mean. (*) For DP-HFlow, we re-implement this method as no code
is publicly available.

that deep learning-based methods can. In this section, we will first propose a theoretical framework
that precisely characterizes this limitation and then use this framework to find two real-world settings
where marginal-based approaches fail to achieve good performance.

4.4.1 Theoretical Framework


We provide a theoretical analysis of marginal-based methods, which currently perform well on
standard benchmarks, but are limited by construction in their ability to encode joint distributions
with multiple interactions between variables Our information theoretical framework demonstrates
this limitation and gives a closed-form expression for the model’s divergence in the ideal setting of
infinite data and privacy budget.

Preliminaries To begin, assume a data-generating process yields a discrete distribution P̄ (X) de-
fined on a sample space X = (V1 , . . . , VK ) with Vi disjoint w.r.t. each other as defined in Section 3.2.
We refer to P̄ (X) as the true distribution.
Next, a two-player adversarial game containing two agents, modeler and nature, ensues. The modeler
is given access to the set of all marginals of order M , i.e., P̄ (Xi1 , . . . , XiM ) for all possible subsets
{i1 , . . . , iM } ⊆ {1, 2, . . . , K}, but nothing else. This is consistent with a setting in which there is
infinite data and an infinite privacy budget.
The modeler must propose a distribution QM (X1 , X2 , . . . , XK ) such that all marginals match,
i.e., QM (Xi1 , . . . , XiM ) = P̄ (Xi1 , . . . , XiM ) for all subsets {i1 , . . . , iM }. We call the set of all
distributions such that all marginals match ΠM . Then, nature reacts to QM and chooses PM ∈ ΠM .
The modeler is asked to minimize the log loss L(PM , QM ) = EPM [− log QM (x)], and nature
maximizes the same quantity. We further assume that when nature encounters ties along level sets
of f (PM ) = L(PM , QM ), they are broken by deferring to maximizing the quantity KL(QM ||PM )
(but that any choice based on a further tie is arbitrary).

Results We first appeal to a set of results from Grünwald and Dawid (2004), which we restate as a
single lemma using our notation.
Lemma 4.1. (Grünwald and Dawid, 2004)

1. The modeler’s optimal estimate is uniquely given by the maximum entropy distribution
Q∗M = arg supP ∈ΠM H(P ), where H is Shannon entropy.
2. L(P, Q∗M ) = L(P ′ , Q∗M ) = H(Q∗M ) for all P, P ′ ∈ ΠM .

These initial results provide a couple of key insights. First, if a modeler acts optimally (in the sense
that they pick the QM that, under an adversarial response from nature, achieves the lowest log loss),
they will choose the maximum entropy distribution out of all of their available options: this in fact
mirrors the behavior of current state-of-the-art marginal-based methods (McKenna et al., 2019; 2022).
Second, given an optimal modeler, the log loss is the same no matter what distribution nature chooses,
and in fact the log loss is equal to the entropy of the modeler’s guess. This second insight will prove
useful in the proof for our central result, which we now state.

6
Theorem 4.2. KL(Q∗M ||P ) = H(Q∗M ) − H(P ) for any P ∈ ΠM .

Proof. By Lemma 4.1, we know that L(P, Q∗M ) = H(Q∗M ). The KL divergence is thus

KL(Q∗M ||P ) = L(P, Q∗M ) − H(P ) = H(Q∗M ) − H(P ),

proving the theorem.

Theorem 4.2 gives a closed-form expression for the KL divergence between the modeler’s estimated
distribution and a hypothetical real distribution, expressed in terms of the entropy of the estimated
distribution and the entropy of the real distribution. Theorem 4.2 thus highlights when we should
expect the estimate from a marginal-based method to be good or bad: is the real distribution
high-entropy or low-entropy? If the real distribution is high-entropy, then we should expect a
marginal-based approach to achieve low KL divergence; on the other hand, if the real distribution is
low-entropy, we should expect a marginal-based approach to provide a guess that is too “smoothed
out”, and far away from ground truth. Furthermore, it grounds the intuition that the less information
that is provided to the model (M is small), the larger the KL divergence. If less information is
provided, then H(Q∗M ) will be larger, and thus the KL divergence will also be larger. We continue
discussion of Theorem 4.2 in Sections 4.4.2 and 4.4.3.
With Theorem 4.2 in hand, we can now quickly derive an impossibility result that makes the key
limitation of marginal-based approaches precise.
Corollary 4.3. When M < K, KL(Q∗M ||PM ) > 0.

Proof. When a marginal-based method is not given the full joint P̂ (X1 , . . . , XK ), then |ΠM | ≥ 2.
Since there is only one maximum entropy distribution Q∗M (Lemma 4.1), and since nature breaks
ties by choosing the distribution with the smallest entropy (Section 4.4.1 and Lemma 4.1), then that
means H(Q∗M ) > H(PM ), which implies that KL(Q∗M ||PM ) > 0.

Since marginal-based approaches learn distributions via incomplete information (i.e., low-order
marginals), then even in the most ideal of settings (infinite data, privacy budget), they cannot
distinguish between one of potentially many choices of the true distribution: this leads to an error
that does not asymptotically vanish with more data or privacy budget.
In the next two sub-sections, we present two empirical settings that demonstrate this exact limitation
of marginal-based approaches.

4.4.2 Selected Datasets


In this section, we present two datasets with complex joint interactions—Dyck-k and WikiText-
103 (Merity et al., 2016)—in which DP-TBART is able to achieve better performance than AIM under
tight privacy guarantees. Both selected datasets are ideal testbeds for our approach because they
serve as worst-case scenarios for the marginal-based approach. Seen through the lens of Theorem 4.2,
measuring only low-order marginals leads to an estimated distribution with significantly higher
entropy than the ground truth distribution (which contains much redundancy): this leads to a large KL
divergence. Intuitively speaking, it is difficult for the marginal-based method to guess the higher-order
correlations accurately, so we would expect its performance to be quite far from optimal.

Dyck Dataset In this section, we compare DP-TBART to AIM on Dyck-k, a family of datasets we
define and artificially construct.4 Informally speaking, Dyck-k consists of all length k strings that
belong to the Dyck formal language: the set of all strings with well-matched parentheses. Formally,
we can define Dyck as the set of all strings generated by the following context-free grammar:
X 7→| ( X ) X
| ϵ,
4
Note that in NLP, Dyck-k may refer to the language of nested brackets of k types. We consider only
languages with one bracket type.

7
Method % Valid ↑
AIM 0.517
DP-TBART 0.804
Table 3: We evaluate DP-TBART against AIM on Dyck-20 and report its performance with validity
as the metric.

Method GPT-2 Perplexity ↓


AIM 3948
DP-TBART 2945
DP-TBART (no privacy) 2334

Table 4: We compare perplexities of outputs from DP-TBART with outputs from AIM on the
WikiText-103 dataset as evaluated by GPT-2.

where ϵ is the empty string.


Dyck-k is thus the set of all strings in Dyck with length k. To give a concrete example, “((()))” and
“(()())” belong to Dyck-6. On the other hand, “)(())(” does not belong to Dyck-6; it has length 6 but
does not have well-matched parentheses.
To study the performance of our approach, we construct a comprehensive dataset consisting of all
strings in Dyck-20. Then, we evaluate each method by fitting it to the real dataset in a differentially
private manner (ϵ = 1 and δ = 1e − 9), sample from the fitted model, and measure its performance
by counting the frequency with which it samples syntactically correct strings.
In Table 3, we show the performance of our approach along with AIM. We find that our approach
outperforms the competing marginal-based approach by a wide margin, achieving 80.4% accuracy in
sampling syntactically correct strings.
While Dyck-k is an artifically-constructed dataset, it serves as a concrete empirical example of the
limitations of marginal-based methods: even in the best of circumstances, they may be unable to
capture the full complexity of a given joint distribution.

WikiText-103 We also compare DP-TBART to AIM on WikiText-103 (Merity et al., 2016), a


real-world text dataset of Wikipedia articles. We split WikiText-103 into contiguous 50-character
substrings and treat each substring as a row in our tabular dataset, where each column represents a
distinct character index.
We evaluate each model by fitting to the processed dataset, sampling, and then taking the average
of the perplexity of the sampled rows computed with GPT-2 (Radford et al., 2019). In Table 4, we
show the performance of our approach, along with AIM. We find that our approach substantially
outperforms AIM, achieving 25% lower perplexity, which suggests that the full joint distribution of
characters is better captured by our approach than by AIM.

4.4.3 Low Privacy Setting


As pointed out in Corollary 4.3, even under the most generous assumptions (i.e., perfect marginal
measurements, infinite data), marginal-based methods can fail to accurately capture the target
distribution. On the other hand, deep learning-based methods have no such intrinsic limitation. In
this section, we analyze what happens in practice when we approach this ideal noiseless setting by
letting the privacy budget grow to ϵ = 100 or greater.
In Figure 1, we show the performance of our approach and the leading marginal-based ap-
proach (McKenna et al., 2022) when varying ϵ on the Adult dataset. We find that while both
approaches benefit from the increased privacy budget, our approach consistently outperforms the
marginal-based approach across values of ϵ exceeding 1000 on the DET metric, suggesting that our
approach is able to better capture higher-order correlations in the data when the privacy budget is
relaxed.

8
0.86 AIM
DP­TBART
0.84
0.82
0.80
0.78
0.76
0.74
102 103 104 105

Figure 1: We show the performance (DET metric) of DP-TBART and AIM when varying ϵ on the
Adult dataset.

5 Discussion
5.1 Deep Learning for Synthetic Data

Though synthetic data has been studied in the deep learning community, many studies overlook privacy
concerns by failing to implement differential privacy, leading to potential leakage of private data (Xu
et al., 2019; Zhao et al., 2021; 2022; Borisov et al., 2022; Kotelnikov et al., 2022). Additionally,
these studies have also often neglected to compare their methods with state-of-the-art approaches
(i.e., marginal-based).
In contrast, our work is the first to evaluate a deep learning method against the state-of-the-art across
several datasets under strict privacy guarantees. We contextualize our results with further analysis,
providing the first theoretical and empirical insights into the relative strengths of marginal-based
methods and deep learning-based methods for synthetic tabular data generation.

5.2 The promise of deep learning-based approaches

Though DP-TBART does not outperform AIM on standard benchmarks, our auxiliary results suggest
that deep learning-based methods can fill an important niche that marginal-based methods are not
able to. We are optimistic that, through further investigation, deep learning-based approaches have
the potential to surpass marginal-based methods in a broader range of scenarios or even pave the way
for a hybrid method that incorporates the strengths of both approaches.

6 Conclusion
In this paper, we have presented an approach for differentially private synthetic tabular data generation
based on deep learning. We have shown that, in certain settings, our deep learning-based approach
is able to outperform state-of-the-art marginal-based approaches. We further provide a theoretical
framework that formalizes the limitations of marginal-based methods and two empirical settings
that demonstrate these limitations in action. Finally, we discuss the potential of deep learning-based
approaches for further exploration.
We believe that our work serves as a promising starting point for understanding the utility of deep
learning-based approaches for generating synthetic data under privacy guarantees, and hope that our
results may inspire further exploration in this area.

6.1 Limitations & Broader Impact

Though our theoretical framework allows us to understand the limitations of different modeling
strategies, it also rests on assumptions that are not satisfied in practice, e.g., infinite data and noiseless
marginal measurements. Despite these unrealistic assumptions, we believe that the framework
is intuitive enough to provide insight into practical settings (and confirm this with experiments,
see Section 4.4.3). As for DP-TBART itself, it currently handles only discrete (or discretized) data,
underperforms relative to the state-of-the-art marginal-based method, and requires GPU hardware to
run efficiently. Despite row-level differential privacy guarantees, outputs from DP-TBART could still
reflect underlying biases in the training data, which could lead to harm.

9
References
M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang. Deep
learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on
computer and communications security, pages 308–318, 2016.
G. Arakelyan, G. Soghomonyan, and The Aim team. Aim, 6 2020. URL https://github.com/
aimhubio/aim.
V. Borisov, K. Seßler, T. Leemann, M. Pawelczyk, and G. Kasneci. Language models are realistic
tabular data generators. arXiv preprint arXiv:2210.06280, 2022.
C. M. Bowen and J. Snoke. Comparative study of differentially private synthetic data algorithms
from the nist pscr differential privacy synthetic data challenge. arXiv preprint arXiv:1911.12704,
2019.
L. Canale, N. Grislain, G. Lothe, and J. Leduc. Generative modeling of complex data. arXiv preprint
arXiv:2202.02145, 2022.
S. De, L. Berrada, J. Hayes, S. L. Smith, and B. Balle. Unlocking high-accuracy differentially private
image classification through scale. arXiv preprint arXiv:2204.13650, 2022.
D. Dua and C. Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci.
edu/ml.
C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data
analysis. In Theory of cryptography conference, pages 265–284. Springer, 2006.
W. Falcon and The PyTorch Lightning team. PyTorch Lightning, 3 2019. URL https://github.
com/Lightning-AI/lightning.
P. D. Grünwald and A. P. Dawid. Game theory, maximum entropy, minimum discrepancy and robust
bayesian decision theory. Annals of Statistics, 2004.
F. Harder, K. Adamczewski, and M. Park. Dp-merf: Differentially private mean embeddings with
random features for practical privacy-preserving data generation. In International conference on
artificial intelligence and statistics, pages 1819–1827. PMLR, 2021.
A. Kotelnikov, D. Baranchuk, I. Rubachev, and A. Babenko. Tabddpm: Modelling tabular data with
diffusion models. arXiv preprint arXiv:2209.15421, 2022.
A. Kurakin, S. Chien, S. Song, R. Geambasu, A. Terzis, and A. Thakurta. Toward training at imagenet
scale with differential privacy. arXiv preprint arXiv:2201.12328, 2022.
J. Lee, M. Kim, Y. Jeong, and Y. Ro. Differentially private normalizing flows for synthetic tabular
data generation. 2022.
X. Li, F. Tramer, P. Liang, and T. Hashimoto. Large language models can be strong differentially
private learners. arXiv preprint arXiv:2110.05679, 2021.
T. Liu, G. Vietri, and S. Z. Wu. Iterative methods for private synthetic data: Unifying framework and
new methods. Advances in Neural Information Processing Systems, 34:690–702, 2021.
R. McKenna, D. Sheldon, and G. Miklau. Graphical-model based estimation and inference for
differential privacy. In International Conference on Machine Learning, pages 4435–4444. PMLR,
2019.
R. McKenna, B. Mullins, D. Sheldon, and G. Miklau. Aim: An adaptive and iterative mechanism for
differentially private synthetic data. arXiv preprint arXiv:2201.12677, 2022.
S. Merity, C. Xiong, J. Bradbury, and R. Socher. Pointer sentinel mixture models, 2016.
S. Moro, P. Cortez, and P. Rita. A data-driven approach to predict the success of bank telemarketing.
Decision Support Systems, 62:22–31, 2014.

10
N. Papernot, M. Abadi, U. Erlingsson, I. Goodfellow, and K. Talwar. Semi-supervised knowledge
transfer for deep learning from private training data. arXiv preprint arXiv:1610.05755, 2016.
N. Papernot, S. Chien, S. Song, A. Thakurta, and U. Erlingsson. Making the shoe fit: Architectures,
initializations, and tuning for learning with privacy. 2019.
A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein,
L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy,
B. Steiner, L. Fang, J. Bai, and S. Chintala. PyTorch: An Imperative Style, High-Performance
Deep Learning Library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché Buc,
E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages
8024–8035. Curran Associates, Inc., 2019. URL http://papers.neurips.cc/paper/
9015-pytorch-an-imperative-style-high-performance-deep-learning-library.
pdf.
N. Patki, R. Wedge, and K. Veeramachaneni. The synthetic data vault. In 2016 IEEE International
Conference on Data Science and Advanced Analytics (DSAA), pages 399–410, Oct 2016. doi:
10.1109/DSAA.2016.49.
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are
unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
L. Rosenblatt, X. Liu, S. Pouyanfar, E. de Leon, A. Desai, and J. Allen. Differentially private
synthetic data: Applied evaluations and enhancements. arXiv preprint arXiv:2011.05537, 2020.
V. Sanh, L. Debut, J. Chaumond, and T. Wolf. Distilbert, a distilled version of bert: smaller, faster,
cheaper and lighter. In NeurIPS EMC2 Workshop, 2019.
Y. Tao, R. McKenna, M. Hay, A. Machanavajjhala, and G. Miklau. Benchmarking differentially
private synthetic data generation algorithms. arXiv preprint arXiv:2112.09238, 2021.
F. Tramer and D. Boneh. Differentially private learning needs better features (or much more data).
arXiv preprint arXiv:2011.11660, 2020.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin.
Attention is all you need. Advances in neural information processing systems, 30, 2017.
T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, C. Ma, Y. Jernite, J. Plu,
C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush. Transformers: State-of-the-
Art Natural Language Processing. pages 38–45. Association for Computational Linguistics, 10
2020. URL https://www.aclweb.org/anthology/2020.emnlp-demos.6.
L. Xu, M. Skoularidou, A. Cuesta-Infante, and K. Veeramachaneni. Modeling tabular data using
conditional gan. Advances in Neural Information Processing Systems, 32, 2019.
Y. Yang, K. Adamczewski, D. J. Sutherland, X. Li, and M. Park. Differentially private neural tangent
kernels for privacy-preserving data generation. arXiv preprint arXiv:2303.01687, 2023.
J. Zhang, G. Cormode, C. M. Procopiuc, D. Srivastava, and X. Xiao. Privbayes: Private data release
via bayesian networks. ACM Transactions on Database Systems (TODS), 42(4):1–41, 2017.
Z. Zhao, A. Kunar, R. Birke, and L. Y. Chen. Ctab-gan: Effective table data synthesizing. In Asian
Conference on Machine Learning, pages 97–112. PMLR, 2021.
Z. Zhao, A. Kunar, R. Birke, and L. Y. Chen. Ctab-gan+: Enhancing tabular data synthesis. arXiv
preprint arXiv:2204.00401, 2022.

11
A Additional possible baselines
Despite the existence of CTAB-GAN+ (Zhao et al., 2022), a more recent deep learning approach that
claims differential privacy guarantees, we did not evaluate against this model.
The original paper does not contain a complete proof that it satisfies differential privacy, and in
fact there is reason to believe that it does not. The pre-processing pipeline is deterministic, even
for continuous and mixed-type columns, for which it learns a Gaussian mixture. Furthermore, the
privacy budget calculation remains the same even when both CTAB-GAN+ performs a modified
batch sampling scheme that upweights infrequent rows (training-by-sampling) and takes gradient
steps w.r.t. information loss, which is a deterministic function of real samples.
Note that while our paper does not contain an explicit proof of privacy, we have introduced no new
mechanisms (such as training-by-sampling or modified batch sampling above), and thus the proof
in Abadi et al. (2016) remains valid for our algorithm.

B Dataset sources and provenance


For ease of reproducibility and building off of our work, we will describe the provenance for our
datasets: where they came from and what pre-processing steps we performed (if any).

B.1 Adult
• Link: https://github.com/ryan112358/private-pgm/blob/master/data/adult.csv
• Provenance: We downloaded the above CSV file without any further pre-processing.
• Description: This dataset is originally sourced from the UCI Machine Learning reposi-
tory (Dua and Graff, 2017), but has been discretized in the process. The dataset contains
personal attributes of U.S. residents, sampled in a stratified manner. The data was extracted
from the 1994 U.S. Census.
• Licensing: Apache License 2.0

B.2 King
• Link: https://github.com/Team-TUD/CTAB-GAN-Plus/blob/main/Real_Datasets/king.csv
• Provenance: We downloaded the above CSV file without any further pre-processing.
• Description: The King dataset was originally obtained from this Kaggle dataset and contains
house attributes (sqft, # of bathrooms, etc.) labeled with their sale price between 2014 and
2015 for King County, Washington.
• Licensing: CC0: Public Domain

B.3 Insurance
• Link: https://www.kaggle.com/datasets/mirichoi0218/insurance
• Provenance: We downloaded ‘insurance.csv’ without any further pre-processing
• Description: The Insurance dataset contains personal attributes (age, etc.) and U.S. medical
insurance costs.
• Licensing: Database Contents License

B.4 Loan
• Link: https://www.kaggle.com/datasets/itsmesunil/bank-loan-modelling
• Provenance: We downloaded ‘Bank_Personal_Loan_Modelling.xlsx’ and processed it into
a CSV.
• Description: The Loan dataset contains personal attributes from bank customers and a label
(whether they converted from a depositor to loan customer).
• Licensing: CC0: Public Domain

12
B.5 Credit
• Link: https://www.kaggle.com/mlg-ulb/creditcardfraud
• Provenance: We downloaded the ‘creditcard.csv’ with a Kaggle account with no additional
pre-processing.
• Description: The Credit dataset contains 30 numerical attributes describing credit card
transactions along with a label for whether they were fraudulent.
• Licensing: Database Contents License

B.6 Bank
• Link: https://archive.ics.uci.edu/ml/datasets/Bank+Marketing
• Provenance: We downloaded ‘bank-full.csv’ from the UCI Machine Learning reposi-
tory (Dua and Graff, 2017) with no additional pre-processing.
• Description: The Bank dataset (Moro et al., 2014) is derived from marketing campaigns from
a bank. It contains personal attributes about a client and whether the client has subscribed to
a product (bank term deposit).
• Licensing: Creative Commons Attribution 4.0 International

B.7 Census
• Link: https://archive.ics.uci.edu/ml/datasets/Census+Income
• Provenance: We downloaded ‘census.tar.gz’ from the UCI Machine Learning reposi-
tory (Dua and Graff, 2017) and extracted it. Then, we combined census-income.data
and census-income.test by concatenating both files together.
• Description: The Census dataset contains demographic and employment-related personal
attributes from the U.S. Census Bureau, where each row corresponds to a stratified sample
from the U.S. population.
• Licensing: Creative Commons Attribution 4.0 International

B.8 Car
• Link: https://archive.ics.uci.edu/ml/datasets/Car+Evaluation
• Provenance: We downloaded ‘car.data’ from the UCI Machine Learning repository (Dua
and Graff, 2017) and performed no further pre-processing.
• Description: The Car dataset has only categorical columns and each row describes attributes
of a different car model.
• Licensing: Creative Commons Attribution 4.0 International

B.9 Mushroom
• Link: https://archive.ics.uci.edu/ml/datasets/Mushroom
• Provenance: We downloaded ‘agaricus-lepiota.data’ from the UCI Machine Learning repos-
itory (Dua and Graff, 2017) and performed no further pre-processing.
• Description: The Mushroom dataset contains descriptions of hypothetical samples corre-
sponding to 23 species of mushrooms.
• Licensing: Creative Commons Attribution 4.0 International

B.10 Poker Hands


• Link: https://archive.ics.uci.edu/ml/datasets/Poker+Hand
• Provenance: We downloaded ‘poker-hand-training-true.data’ from the UCI Machine Learn-
ing repository (Dua and Graff, 2017) and performed no further pre-processing.
• Description: The Poker Hands dataset contains hands of five playing cards and their corre-
sponding class (in Poker).
• Licensing: Creative Commons Attribution 4.0 International

13
Adult Credit Insurance
12
9
1e-05 1.637 1.574 1.566 1.567 1e-05 2.556 2.195 2.161 2.159 1e-05 2.525 2.526 2.527 2.536
8
10 40
0.0001 1.534 1.494 1.471 1.479
7
0.0001 2.194 2.173 2.104 2.096 0.0001 2.368 2.371 2.396 2.384

Learning Rate

Learning Rate

Learning Rate
0.0003 1.486 1.452 1.432 1.44 6 0.0003 2.231 2.191 2.075 2.063 8 0.0003 2.301 2.305 2.354 2.307 30

5
0.001 1.52 1.462 1.451 1.458 0.001 2.303 2.247 2.093 2.01 6 0.001 2.407 2.406 2.42 2.403
20
4
0.003 2.269 1.663 1.634 1.654 0.003 2.864 2.602 2.276 2.175 0.003 5.384 5.332 4.224 4.335
3 4
10
0.01 9.349 3.974 3.741 3.678 2 0.01 12.17 3.931 2.525 2.522 0.01 46.67 43.73 38.02 49.52
16 64 256 1024 16 64 256 1024 16 64 256 1024
Batch Size Batch Size Batch Size

Bank King Poker Hands


1e-05 1.962 1.918 1.912 1.92 16 1e-05 2.468 2.425 2.421 2.427 18 1e-05 1.901 1.896 1.895 1.895 9

14 16
0.0001 1.862 1.838 1.824 1.835 0.0001 2.353 2.341 2.331 2.336 0.0001 1.888 1.888 1.889 1.888 8
Learning Rate

Learning Rate

Learning Rate
14
12 7
0.0003 1.887 1.789 1.788 1.786 0.0003 2.407 2.39 2.361 2.362 12 0.0003 1.891 1.892 1.892 1.891
10 6
10
0.001 1.976 1.85 1.819 1.853 8 0.001 2.574 2.587 2.508 2.528 0.001 1.927 1.912 1.914 1.907 5
8
0.003 3.879 2.404 2.312 2.263 6 0.003 3.792 3.785 3.504 3.484 6
0.003 2.503 2.111 2.168 2.183 4

4 3
0.01 17.36 8.786 7.396 7.127 0.01 19.41 15.28 12.67 14.91 4 0.01 9.774 6.385 6.435 6.349
2 2
16 64 256 1024 16 64 256 1024 16 64 256 1024
Batch Size Batch Size Batch Size

Car Loan Mushroom


35 30 25
1e-05 1.219 1.219 1.214 1.219 1e-05 2.997 2.986 2.988 3.015 1e-05 1.278 1.273 1.276 1.273
30 25
0.0001 1.309 1.313 1.281 1.298 0.0001 2.768 2.763 2.771 2.776 0.0001 1.2 1.197 1.188 1.2 20
Learning Rate

Learning Rate

Learning Rate
25
20
0.0003 1.365 1.375 1.322 1.395 0.0003 2.735 2.73 2.744 2.742 0.0003 1.107 1.109 1.076 1.123 15
20

0.001 1.47 1.497 1.356 1.429 15 0.001 2.801 2.769 2.786 2.781 15 0.001 1.115 1.168 1.124 1.178
10

0.003 2.635 3.122 2.225 1.821 10 0.003 3.472 3.307 3.477 3.437 10 0.003 3.155 2.5 2.719 3.019
5
0.01 31.82 36.41 19.8 17.61 5
0.01 28.84 26.33 30.59 29.16 5 0.01 23.43 25.82 24.37 25.2
16 64 256 1024 16 64 256 1024 16 64 256 1024
Batch Size Batch Size Batch Size

Figure 2: We train models with varying learning rates and batch sizes and evaluate the validation loss
of each trained model.

C Hyperparameter Study

Given observations from past work from image and text domains that networks trained with differen-
tial privacy are sensitive to hyperparameters (Papernot et al., 2019; Li et al., 2021; De et al., 2022),
we investigate whether the same phenomena occur in the tabular domain as well.

Learning Rate & Batch Size Our learning rate and batch size sweeps reveal behaviors that mirror
those in networks trained on images (Kurakin et al., 2022; De et al., 2022) or text (Li et al., 2021):
performance is highly sensitive to learning rate and batch size, and increasing batch size almost never
hurts. See Figure 2 for a visualization of our results.

Clipping Norm We further study the behavior of the network under various clipping norms and
precisely explain the underlying mechanism that causes the observed behavior. As shown in Figure 3,
we find that the network behaves similarly to what has been empirically observed in images (Kurakin
et al., 2022) and text (Li et al., 2021): decreasing the clipping norm until all gradients are clipped
results in the best performance. Unlike prior work, however, we continue to evaluate the model at
smaller and smaller values of C, and observe that the performance of the network actually stays
constant and then begins to rapidly degrade.
While the plateauing in performance is explained by (1) the fact that Adam is gradient scale-invariant
(i.e., scaling each gradient g by some scalar does not impact parameter updates) and (2) the fact that
all gradients are clipped below C = 1, this does not adequately explain the rapid degradation of
performance when C reaches 10−5 .
While it may be tempting to conclude that such significant degradation of performance is due to
floating point imprecision, it√turns out that the answer is much more subtle. Adam
√ computes the
effective step ∆t = α · m̂t /( v̂t + ϵ) where ϵ = 1e − 8 in practice and m̂t and v̂t are numerical
estimates proportional to clipped gradient magnitude. When the gradient magnitudes reach the same
order of magnitude as ϵ, the denominator of the fraction becomes dominated by ϵ and training begins
to falter, as seen in Figure 3.

14
% clipped 1.0
1.60
0.8

Validation Loss
1.55
0.6

1.50 0.4

0.2
1.45
0.0
10 7 10 5 10 3 10 1 101
Clipping Norm

Figure 3: We show the impact of clipping norm on network performance on the Adult dataset.

D Software
For model implementation and training, we use Pytorch (Paszke et al., 2019) (BSD-3) and Light-
ning (Falcon and The PyTorch Lightning team, 2019) (Apache License 2.0). For experiment tracking,
we use Aim (Arakelyan et al., 2020) (Apache License 2.0). For our transformer model, we adapted
DistilGPT-2 (Sanh et al., 2019) from Hugging Face Transformers (Wolf et al., 2020) (Apache License
2.0).
For our DP-SGD implementation, we use private-transformers (Apache License 2.0).

E Poisson sampling
While some previous works sample batches non-privately by shuffling the training dataset and picking
uniform batches (Li et al., 2021; Tramer and Boneh, 2020), we ensure end-to-end differential privacy
guarantees by implementing Poisson sampling. Since the Binomial distribution B(n, p) concentrates
tightly around its mean when n is large and p is small, we observe that out-of-memory issues do not
occur in practice, and micro-batching was not required.

F Column Ordering
One might observe that an autoregressive approach introduces an additional high-dimensional hyper-
parameter: column ordering. In our experiments, we attempted several strategies (ordering columns
from ones with largest cardinalities to smallest, smallest to largest, and toposorting a graph learned
via structure learning), but observed no consistent impact on performance.

15

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy