SoK: A Review of Differentially Private Linear
SoK: A Review of Differentially Private Linear
Abstract—Linear models are ubiquitous in data science, but are example, a common goal is to constrain the number of
particularly prone to overtting and data memorization in high nonzero coefcients of the model’s weight. This both promotes
dimensions. To guarantee the privacy of training data, differential generalization and produces a weight vector which highlights
privacy can be used. Many papers have proposed optimiza-
tion techniques for high-dimensional differentially private linear the features which are most strongly related to the model’s
models, but a systematic comparison between these methods output [7].
does not exist. We close this gap by providing a comprehensive However, linear models’ solutions can produce weights
review of optimization methods for private high-dimensional which rely on information contained in one or a few data-
linear models. Empirical tests on all methods demonstrate robust points. This can be problematic for models trained on sensitive
and coordinate-optimized algorithms perform best, which can
inform future research. Code for implementing all methods is data, such as those used in medicine or banking, since the
released online. parameters of these models can leak information about these
Index Terms—differential privacy, high-dimensional, linear datapoints. Specically, extensive research has demonstrated
regression, logistic regression that membership inference attacks can reconstruct datapoints
used in models with high accuracy given access to only the
I. I NTRODUCTION model’s outputs or parameters. This is especially true in the
Linear models, like linear and logistic regression, are ubiq- high-dimensional regime, where models are easily capable of
uitous in current data science and data analytics efforts. Their overtting to data [8].
simple structure enables them to train quickly and generalize A theoretical solution to this problem employs differential
well on simple problems. Additionally, they can be interpreted privacy. Differential privacy is a statistical guarantee of the
easily to understand model decision making, which is essential privacy of an algorithm’s outputs, which ensures that an
in regulated elds like medicine and nance. For example, algorithm does not rely too heavily on information in any one
linear regression has been used to predict future consumer de- datapoint. Differential privacy prevents membership inference
mand, corporate resource requirements, and house prices [1], and dataset reconstruction attacks on a trained model [9].
[2]. Logistic regression has been used for disease prediction Deferentially private linear and logistic regression have been
and fraud detection [3], [4]. studied extensively in the past two decades, but primarily
from a theoretical perspective. Many optimization strategies
In the low-dimensional regime, where a dataset has many
and heuristics have been developed, but for most important
more datapoints than features, linear models typically gener-
statistical tasks, systematic comparisons between different
alize well without signicant tuning. From an information-
algorithms have not been conducted.
theoretic perspective, this is because the model has enough
This paper reviews methods to develop differentially private
data to learn major trends in the dataset, which should gener-
high-dimensional linear models. This task is fundamental in
alize to future instances. This is why linear models are often
private statistics since linear models are typically the rst
most effective on simple, large datasets [5].
models tested in a data science pipeline. Since differential
However, many modern datasets are high-dimensional, hav-
privacy works best with simple models, using an appropriately
ing more features than datapoints. This is common in ge-
trained linear model can avoid false conclusions on linear mod-
nomics or nance, where the expression of many genes or
els’ ineffectiveness for certain problems. Finally, differential
prices of many assets outnumber individual observations. In
privacy typically struggles with high-dimensional problems
this case, linear models can overt to the data, producing poor
since noise can overwhelm the signals in these problems. This
generalization to future inputs [6].
review seeks to identify whether specic optimization methods
A common solution to this problem uses regularization,
consistently improve performance on private high-dimensional
which constrains the weight vector of these models. For
problems.
Approved for Public Release; Distribution Unlimited. PA Number: AFRL-
Specically, this review contributes the following:
2023-5408 1) We provide the rst centralized review of all methods
performing high-dimensional linear modeling with dif- Approximate differential privacy, otherwise known as (ϵ, δ)
ferential privacy, and provide insights about methods’ DP, modies the above denition of DP to
strengths, weaknesses, and assumptions.
P[A(D) ∈ E] ≤ eϵ P[A(D ′ ) ∈ E] + δ,
2) We implement methods reviewed in code and release
all code at https://github.com/afrl-ri/differential-privacy- where 0 < δ ≤ 1. In this case, a nonprivate algorithm B can
review. Many reviewed works do not release code, which be made (ϵ, δ) DP by adding noise from a normal distribution
can make it challenging for their methods to be tested with scale proportional to B’s L2 sensitivity. This method can
and improved upon by other works. We close this gap. add signicantly less noise to B’s outputs because the Gaussian
3) We provide a systematic empirical comparison of each distribution’s tails are signicantly lighter than the Laplace
method’s performance on a variety of datasets. Current distribution [11].
differential privacy literature typically compares meth- Finally, we mention some important properties of DP em-
ods’ theoretical utility. While this is a useful metric, em- ployed in the methods reviewed in section III. We begin with
pirical performance is often overlooked. In conducting the post-processing property, which states that if A(D) is DP,
these empirical tests, we nd surprising and previously then for any function g which does not re-access the dataset
unidentied trends in performance which can inuence D, g(A(D)) is DP with the same privacy parameters [10].
future research. Next, we discuss composition. Say A(D) is (ϵ, δ) DP, where
Finally, we remark that in this article, the term “linear 0 ≤ δ ≤ 1. Then under parallel composition, where D is
models” refers to both linear regression, where the output is an broken into disjoint sets such that D1 ∪ D2 ∪ · · · ∪ Dk = D,
unconstrained numeric prediction (i.e., the mean-squared error releasing all outputs A(D1 ), A(Dk ) is DP with the same
loss), and logistic regression, where a binary label is desired. parameters as A(D) [12]. In contrast, under sequential com-
Most proposed methods in the literature support both, but some position, if A1 (D) is (ϵ1 , δ1 ) DP and A2 (D) is (ϵ2 , δ2 ) DP,
linear regression and logistic regression specic methods will then releasing A1 (D) and A2 (D) is (ϵ1 + ϵ2 , δ1 + δ2 ) DP [13].
also be evaluated. Note that the above bound for sequential composition is
called the basic composition theorem. Employing concen-
II. P RELIMINARIES tration inequalities, using a small value δ ′ , the advanced
In this section we provide brief overviews of differen- composition theorem states that we can produce a sublinear
tial privacy and nonprivate optimization methods for high- increase in ϵ. For example, for k-fold adaptive composition
dimensional linear models. This provides context for our under (ϵ, 0) DP, the advanced composition theorem states
literature review in section III. that the output is (ϵ′ , δ ′ ) DP, where ϵ′ = 2ϵ 2k log(1δ ′ ).
Similarly, for adaptive composition of (ϵ, δ) DP algorithms,
A. Differential Privacy the advanced composition theorem shows a k-fold composition
Differential privacy (DP) is a statistical guarantee of the is (ϵ′ , kδ + δ ′ ) DP [13].
privacy of an algorithm’s outputs. Specically, given an algo- The advanced composition theorem is tight for (ϵ, 0) DP,
rithm A, it is (ϵ, 0) DP if for all datasets D and D ′ differing meaning that it produces the lowest privacy parameters which
on one datapoint and any event E, can generally be applied to any (ϵ, 0) DP mechanism. How-
ever, advanced composition is not tight for (ϵ, δ) DP where
P[A(D) ∈ E] δ > 0 [14]. This has led to the adoption of Rényi and zero
≤ eϵ
P[A(D ′ ) ∈ E] concentrated DP, reformulations of DP which are naturally
In other words, DP bounds the amount A’s output can depend able to produce tight Gaussian mechanism composition for
on a single datapoint. Since it cannot rely too much on any (ϵ, δ) mechanisms [15], [16].
individual datapoint, it cannot reveal signicant information Finally, many DP mechanisms have been developed, but
about a single individual in a dataset [9]. here we mention one specically: the report-noisy-max mech-
Implementing the above denition of DP for a nonprivate anism. The report-noisy-max mechanism is one application of
algorithm B requires bounding its L1 sensitivity. This is the the exponential mechanism, and it takes in a set of scores
maximum amount that B’s output can change between two and is used to choose the object with the highest score. It is
neighboring datasets as measured by the L1 norm. Mathemat- typically implemented by adding Laplacian noise to each of
ically, this is written as the scores and then choosing the one which is largest [17].
The report-noisy-max mechanism has found wide use in DP
max ∥B(D) − B(D ′ )∥1 literature, and is used by some of the algorithms we review in
D,D ′ : dist(D,D ′ )=1
section III.
Once the sensitivity is known, Laplacian noise with appro- 1) Global and Local Differential Privacy: Traditional DP
priate scale can be added to B to produce a DP version of algorithms assume that individuals’ raw datapoints are colo-
B [10]. However, adding Laplacian noise to B using its L1 cated on one machine. Only resulting analyses or models built
sensitivity can destroy its efcacy. The noise can corrupt its from the data will be shared publicly with untrusted agents, so
outputs so much that it becomes useless. To address this issue, only these analyses or models must be noised. This framework
approximate DP was developed. is called global or central DP, in reference to the assumption
that all individuals trust one entity to curate their raw data on similar datasets. Algorithmic stability is a well-studied
[18]. property in statistics and machine learning as it enables a
There are a few common methods to achieving DP under the mathematical analysis of algorithm performance [25].
global assumption. The rst is output perturbation, in which DP algorithms are inherently related to algorithmic stability,
the output of an algorithm is noised based on its sensitivity. as a DP algorithm cannot have a large change in its output
This method is used often with traditional statistical estimators when one of its datapoints changes. Indeed, Dwork and Lei
such as the mean, but it can be difcult to calculate the formalized this intuition when they studied the relationship
sensitivity of more complicated outputs like machine learning between DP and robust statistics [12].
models [19], [20]. Unfortunately, Xu et al. found that sparse algorithms cannot
Another method is objective perturbation, which noises the be stable [25]. Applying this to our case of DP, this means that,
objective function of an optimization problem. This method is in some sense, DP algorithms are at odds with sparsity.
exible to modeling algorithms, but original works required We include this discussion here to highlight the fundamental
the optimization problem to be solved exactly, which is challenge of producing a sparse DP estimator. While both
challenging. More recent frameworks have extended objective sparsity and DP are useful tools for data analysis, an effective
perturbation to approximate solutions of optimization prob- combination of both is a difcult and open problem. Some of
lems [20], [21]. the methods reviewed in this work take steps to resolve this
Finally, gradient perturbation privatizes algorithms using problem.
gradient descent by noising each step of the optimization
B. High-Dimensional Optimization for Linear Models
process. This method has been shown to work better than
output and objective perturbation on many problems, and it In this section, we provide a brief overview of high-
is easily modied for the stochastic gradient descent case dimensional optimizers for linear and logistic regression.
[22]. Its popularity is bolstered by the developments of Réyni Optimization methods for high-dimensional problems are an
and zero-concentrated DP, which allow for tight composition important open problem, so we focus on well-established
of gradient perturbation methods in the non-stochastic and methods relevant to section III here.
stochastic settings [15], [16]. First, note that in high-dimensional problems, it is easy for
However, when individuals do not trust a central actor models to overt the training data. If n is the number of
to curate their data, local DP must be used. In this case, datapoints and d is the number of features, high-dimensional
individuals noise data prior to sending it to a central server, problems have n < d, so the number of coefcients in the
thus ensuring that curators of the central server cannot re- weight vector exceeds the number of targets [26]. For this
identify the data’s sensitive attributes. A number of noising reason, we must constrain the weight space to be sparse avoid
algorithms exist, but the main drawback of local DP is reduced overtting [27]. Note that literature on nonprivate optimization
accuracy and utility [18]. often considers sparsity to be a de-facto requirement for high-
To combat reduced accuracy and utility while avoiding trust dimensional problems since it improves utility of solutions
of the central actor, shufe differential privacy was developed. greatly; however, as discussed in the prior section, maintaining
In shufe differential privacy, each individual’s data is shufed sparsity with DP can be challenging.
prior to being provided to the actor so it cannot be readily Two common approaches are used to avoid overtting in
linked back to a single user through a specic connection. high-dimensional linear models. The rst is an L0 contraint
This allows for reducing the noise added to each individual’s on the weight vector, which controls the number of nonzero
data, which can improve accuracy and utility. It can be components in the weight vector. Under this constraint, a linear
shown that shufe differential privacy is a stronger than global problem with per-example loss function ℓ(xi , yi ; wi ) can be
differential privacy and weaker than local differential privacy written as
N
[23]. One common specic application of shufe differential arg min ℓ(xi , yi ; wi )
privacy is aggregate differential privacy, which sums many w: ∥w∥0 ≤k i=1
local differentially private metrics prior to providing this result
To solve this problem, iterative gradient hard thresholding
to an untrusted actor. Aggregate differential privacy has been
can be used. This is a greedy method which retains only the
used in real-world systems for tracking user statistics and
top-k coefcients of the weight vector after each gradient step.
application performance [24].
Since it is greedy, it is not guaranteed to converge to an optimal
In this review, we mention methods for high-dimensional
solution. However, it approaches the L0 constrained problem
private linear modeling under both the global and local models.
in a computationally efcient manner [28].
However, since methods for high-dimensional linear modeling
L1 constraints are used more often. In this setting, we solve
under local DP are nascent, we only test models built under
global DP. Finally, note that we are not aware of any methods N
which employ shufe differential privacy for high-dimensional arg min ℓ(xi , yi ; wi )
w: ∥w∥1 ≤k i=1
linear models.
2) Sparsity, Stability, and Differential Privacy: Heuristi- This is a relaxation of the L0 constraint which often produces
cally, an algorithm which is stable produces similar outputs sparse solutions and can be solved efciently with a number of
optimizers; however, the nonzero components of its solutions loss function. They then derive the utility of their algorithm
are biased towards 0, which can reduce the utility of the output under both (ϵ, 0) DP and √ (ϵ, δ) DP, and show that using (ϵ, δ)
[29]. DP saves a factor of d in utility.
Solving this problem can be done with a number of op- Their utility analysis shows that this mechanism is not
timizers. Most commonly, it is solved with the Frank-Wolfe directly applicable to high-dimensional convex optimization.
algorithm, which iteratively takes steps towards the vertex of In the second part of their work, they assume that high-
the k-norm L1 -ball which has minimum loss [30]. Compressed dimensional data can be explained by a model with sparsity s,
learning, in which the dimensionality of the dataset is reduced and derive algorithms to select a support set of size s prior to
by multiplication with a random matrix, can also be used [31]. running their generalized objective perturbation mechanism.
Note that the L1 constrained problem is often written as an Note that they make a number of additional assumptions
L1 regularized problem. Specically, a linear problem can be when calculating utility, including those on the norm of the
written as dataset and targets along with the restricted strong convexity
N
of the dataset. However, these assumptions are not necessary
arg min ℓ(xi , yi ; wi ) + λ∥w∥1 to guarantee privacy.
w
i=1 The rst algorithm they analyze is derived from the expo-
There is an exact duality between the constrained and regu- nential mechanism. Roughly, this method calculates the loss
larized setting, where a specic value of k corresponds to a of each support set with size s, and uses the exponential
value of λ. However, the closed form conversion from k to λ mechanism to choose the support set with minimum loss.
is not known [32]. After choosing the support set, these features are passed
To solve the regularized problem, coordinate descent can into the generalized objective perturbation mechanism. They
be used. This algorithm iterates through each coefcient of prove utility of their method when n = ω(s3 log d), but is
the weight vector, making gradient updates for only that co- computationally inefcient since the support selection step
efcient. More recent algorithms like the alternating direction must calculate the models of all ds possible support sets.
method of multipliers can also be used, which split the smooth This can be compared to the inefciency of L0 -constrained
and nonsmooth parts of the regularized optimization problem minimization in the nonprivate setting.
for better initial optimization [33], [34]. Their second algorithm uses the sample √ and aggregate
framework to split the training examples into n blocks and √
III. R EVIEW compute the optimal support on each block. After all n
This section provides a literature review of algorithms used blocks are run, each of the d features has a score associated
for high-dimensional DP linear models. The section is orga- with it: the number of blocks which chose to include it in
nized by optimization technique, and is in roughly chronolog- their support. A private mechanism can be used to return
ical order of the rst paper considering each technique. Figure the top-s features, which are then passed into the generalized
1 provides an overview of these methods, and employs a color- objective perturbation mechanism. They prove utility of this
2
coding we will use in coming sections. Certain optimization method when √n = ω(s log d), but require an assumption that
2
techniques have been considered by multiple works, and as each of the n blocks follows the restricted strong convexity
such, their subsections are longer than others. Throughout the assumption, which is strictly stronger than the assumption for
sections, we highlight potential strengths and weaknesses of the whole dataset.
each algorithm. These algorithms are then tested empirically Since their method for high-dimensional regression involves
in section V. two stages, we call it TS. Note that because the method is split
In conducting this literature review, we searched for papers into two steps, the privacy budget must also be split between
considering “sparse,” “high-dimensional,” “L1 ,” or “L0 ” DP these steps. In their paper, Kifer et al. choose to split the
regression. After extensive reading, we identied the works budget into ( 2ϵ , 0) for the support selection step and ( 2ϵ , δ) for
presented in this section as relevant to our review. the perturbation mechanism. We follow the same convention
Finally, all methods presented in this section employ the in our experiments. Additionally, our experiments only test the
unbounded Hamming distance for their adjacency relation. In sample and aggregate support selection mechanism since it is
other words, the denition of differential privacy is employed computationally efcient.
when D and D ′ differ by one (extra or removed) row. This Thakurta & Smith (2013) follow up Kifer et al.’s work
ensures that the privacy produced by all methods is directly by developing DP model selection mechanisms based on
comparable. perturbation and subsampling stability [35]. In short, they call
a function f perturbation stable if f outputs the same value
on an input dataset D and all of D’s neighbors. Under this
A. Model Selection condition, they require an algorithm Adist which outputs the
High-dimensional DP linear models were rst considered in distance between D to its nearest unstable instance. Given the
2012 by Kifer et al. [21]. Their work is split into two parts: the output of Adist , they use the Propose-Test-Release mechanism
rst developed a generalized objective perturbation mechanism to output f (D) in a DP manner.
which employs L2 regularization to achieve a strongly-convex However, this mechanism is typically not practical since
Model Selection Thresholding
Summary: Choose a subset of features privately, and Summary: Use IGHT to produce sparse weight.
then use traditional DP optimization to find a weight. Privatize with gradient perturbation or output
Assumptions: Feature selection is algorithmically perturbation.
stable. Assumptions: IGHT must identify coefficients
Privacy: (, 0) or (, ) efficiently. For heavy-tailed methods, truncated
Methods: TS [21] gradients must not be overwhelmed by noise.
Privacy: (, )
Methods: DPIGHT [43], DPSLKT [44], HTSL [8],
HTSO [8]
Frank-Wolfe
Summary: Iteratively choose to move towards a vertex
of a polytope constraint in a private manner.
Assumptions: Loss is Lipschitz and smooth. Solutions Coordinate Descent
are found in few iterations. For heavy-tailed methods, Summary: Use greedy coordinate descent to privately
truncated gradients must provide effective signal. a single component of the weight to update.
Privacy: (, 0) or (, ) Assumptions: Greedy coordinate descent can be
Methods: FW [37], POLYFW [38], VRFW [39], implemented efficiently. Lipschitz constants for each
HTFW [8], HTPL [8] feature are known.
Privacy: (, )
Methods: GCDGSQ [46], GCDGSR [46], GCDGSS
[46]
Compressed Learning
Summary: Compress high-dimensional dataset to low-
dimensions by multiplying with a random matrix.
Optimize in low-dimensions and find corresponding Mirror Descent
weight in original space. Assumptions: Loss is Summary: Use iteratively stronger regularization to
Lipschitz. Random matrix does not destroy important solve a constrained optimization problem.
information in dataset. Assumptions: Composing multiple private
Privacy: (, 0) or (, ) optimizations is numerically stable.
Methods: PROJERM [40] Privacy: (, )
Methods: NM [39]
ADMM
Summary: Privatize the ADMM algorithm with
objective perturbation.
Assumptions: Search through large hyperparameter
space possible. ADMM converges.
Privacy: (, 0)
Methods: ADMM [42], ADMMHALF [42]
TABLE III
PAH: M EAN S QUARED E RROR
TABLE IV
E2006: M EAN S QUARED E RROR
TABLE VI
DB WORLD : ACCURACY
TABLE VII
RCV1: ACCURACY
TABLE VIII
H YPERPARAMETERS