0% found this document useful (0 votes)
4 views21 pages

SoK: A Review of Differentially Private Linear

This document reviews optimization methods for differentially private linear models in high-dimensional data, addressing the lack of systematic comparisons in existing literature. It highlights the effectiveness of coordinate-optimized algorithms and provides empirical tests, releasing code for implementation. The review aims to enhance understanding and performance of private high-dimensional linear modeling, which is crucial in sensitive fields like medicine and finance.

Uploaded by

nandhini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views21 pages

SoK: A Review of Differentially Private Linear

This document reviews optimization methods for differentially private linear models in high-dimensional data, addressing the lack of systematic comparisons in existing literature. It highlights the effectiveness of coordinate-optimized algorithms and provides empirical tests, releasing code for implementation. The review aims to enhance understanding and performance of private high-dimensional linear modeling, which is crucial in sensitive fields like medicine and finance.

Uploaded by

nandhini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

SoK: A Review of Differentially Private Linear

Models For High-Dimensional Data


Amol Khanna Edward Raff Nathan Inkawhich
Booz Allen Hamilton Booz Allen Hamilton Air Force Research Laboratory
Annapolis Junction, MD, USA University of Maryland, Baltimore County Rome, NY, USA
Khanna_Amol@bah.com Annapolis Junction, MD, USA Nathan.Inkawhich@us.af.mil
Raff_Edward@bah.com
arXiv:2404.01141v1 [cs.LG] 1 Apr 2024

Abstract—Linear models are ubiquitous in data science, but are example, a common goal is to constrain the number of
particularly prone to overtting and data memorization in high nonzero coefcients of the model’s weight. This both promotes
dimensions. To guarantee the privacy of training data, differential generalization and produces a weight vector which highlights
privacy can be used. Many papers have proposed optimiza-
tion techniques for high-dimensional differentially private linear the features which are most strongly related to the model’s
models, but a systematic comparison between these methods output [7].
does not exist. We close this gap by providing a comprehensive However, linear models’ solutions can produce weights
review of optimization methods for private high-dimensional which rely on information contained in one or a few data-
linear models. Empirical tests on all methods demonstrate robust points. This can be problematic for models trained on sensitive
and coordinate-optimized algorithms perform best, which can
inform future research. Code for implementing all methods is data, such as those used in medicine or banking, since the
released online. parameters of these models can leak information about these
Index Terms—differential privacy, high-dimensional, linear datapoints. Specically, extensive research has demonstrated
regression, logistic regression that membership inference attacks can reconstruct datapoints
used in models with high accuracy given access to only the
I. I NTRODUCTION model’s outputs or parameters. This is especially true in the
Linear models, like linear and logistic regression, are ubiq- high-dimensional regime, where models are easily capable of
uitous in current data science and data analytics efforts. Their overtting to data [8].
simple structure enables them to train quickly and generalize A theoretical solution to this problem employs differential
well on simple problems. Additionally, they can be interpreted privacy. Differential privacy is a statistical guarantee of the
easily to understand model decision making, which is essential privacy of an algorithm’s outputs, which ensures that an
in regulated elds like medicine and nance. For example, algorithm does not rely too heavily on information in any one
linear regression has been used to predict future consumer de- datapoint. Differential privacy prevents membership inference
mand, corporate resource requirements, and house prices [1], and dataset reconstruction attacks on a trained model [9].
[2]. Logistic regression has been used for disease prediction Deferentially private linear and logistic regression have been
and fraud detection [3], [4]. studied extensively in the past two decades, but primarily
from a theoretical perspective. Many optimization strategies
In the low-dimensional regime, where a dataset has many
and heuristics have been developed, but for most important
more datapoints than features, linear models typically gener-
statistical tasks, systematic comparisons between different
alize well without signicant tuning. From an information-
algorithms have not been conducted.
theoretic perspective, this is because the model has enough
This paper reviews methods to develop differentially private
data to learn major trends in the dataset, which should gener-
high-dimensional linear models. This task is fundamental in
alize to future instances. This is why linear models are often
private statistics since linear models are typically the rst
most effective on simple, large datasets [5].
models tested in a data science pipeline. Since differential
However, many modern datasets are high-dimensional, hav-
privacy works best with simple models, using an appropriately
ing more features than datapoints. This is common in ge-
trained linear model can avoid false conclusions on linear mod-
nomics or nance, where the expression of many genes or
els’ ineffectiveness for certain problems. Finally, differential
prices of many assets outnumber individual observations. In
privacy typically struggles with high-dimensional problems
this case, linear models can overt to the data, producing poor
since noise can overwhelm the signals in these problems. This
generalization to future inputs [6].
review seeks to identify whether specic optimization methods
A common solution to this problem uses regularization,
consistently improve performance on private high-dimensional
which constrains the weight vector of these models. For
problems.
Approved for Public Release; Distribution Unlimited. PA Number: AFRL-
Specically, this review contributes the following:
2023-5408 1) We provide the rst centralized review of all methods
performing high-dimensional linear modeling with dif- Approximate differential privacy, otherwise known as (ϵ, δ)
ferential privacy, and provide insights about methods’ DP, modies the above denition of DP to
strengths, weaknesses, and assumptions.
P[A(D) ∈ E] ≤ eϵ P[A(D ′ ) ∈ E] + δ,
2) We implement methods reviewed in code and release
all code at https://github.com/afrl-ri/differential-privacy- where 0 < δ ≤ 1. In this case, a nonprivate algorithm B can
review. Many reviewed works do not release code, which be made (ϵ, δ) DP by adding noise from a normal distribution
can make it challenging for their methods to be tested with scale proportional to B’s L2 sensitivity. This method can
and improved upon by other works. We close this gap. add signicantly less noise to B’s outputs because the Gaussian
3) We provide a systematic empirical comparison of each distribution’s tails are signicantly lighter than the Laplace
method’s performance on a variety of datasets. Current distribution [11].
differential privacy literature typically compares meth- Finally, we mention some important properties of DP em-
ods’ theoretical utility. While this is a useful metric, em- ployed in the methods reviewed in section III. We begin with
pirical performance is often overlooked. In conducting the post-processing property, which states that if A(D) is DP,
these empirical tests, we nd surprising and previously then for any function g which does not re-access the dataset
unidentied trends in performance which can inuence D, g(A(D)) is DP with the same privacy parameters [10].
future research. Next, we discuss composition. Say A(D) is (ϵ, δ) DP, where
Finally, we remark that in this article, the term “linear 0 ≤ δ ≤ 1. Then under parallel composition, where D is
models” refers to both linear regression, where the output is an broken into disjoint sets such that D1 ∪ D2 ∪ · · · ∪ Dk = D,
unconstrained numeric prediction (i.e., the mean-squared error releasing all outputs A(D1 ),    A(Dk ) is DP with the same
loss), and logistic regression, where a binary label is desired. parameters as A(D) [12]. In contrast, under sequential com-
Most proposed methods in the literature support both, but some position, if A1 (D) is (ϵ1 , δ1 ) DP and A2 (D) is (ϵ2 , δ2 ) DP,
linear regression and logistic regression specic methods will then releasing A1 (D) and A2 (D) is (ϵ1 + ϵ2 , δ1 + δ2 ) DP [13].
also be evaluated. Note that the above bound for sequential composition is
called the basic composition theorem. Employing concen-
II. P RELIMINARIES tration inequalities, using a small value δ ′ , the advanced
In this section we provide brief overviews of differen- composition theorem states that we can produce a sublinear
tial privacy and nonprivate optimization methods for high- increase in ϵ. For example, for k-fold adaptive composition
dimensional linear models. This provides context for our under (ϵ, 0) DP, the advanced composition  theorem states
literature review in section III. that the output is (ϵ′ , δ ′ ) DP, where ϵ′ = 2ϵ 2k log(1δ ′ ).
Similarly, for adaptive composition of (ϵ, δ) DP algorithms,
A. Differential Privacy the advanced composition theorem shows a k-fold composition
Differential privacy (DP) is a statistical guarantee of the is (ϵ′ , kδ + δ ′ ) DP [13].
privacy of an algorithm’s outputs. Specically, given an algo- The advanced composition theorem is tight for (ϵ, 0) DP,
rithm A, it is (ϵ, 0) DP if for all datasets D and D ′ differing meaning that it produces the lowest privacy parameters which
on one datapoint and any event E, can generally be applied to any (ϵ, 0) DP mechanism. How-
ever, advanced composition is not tight for (ϵ, δ) DP where
P[A(D) ∈ E] δ > 0 [14]. This has led to the adoption of Rényi and zero
≤ eϵ 
P[A(D ′ ) ∈ E] concentrated DP, reformulations of DP which are naturally
In other words, DP bounds the amount A’s output can depend able to produce tight Gaussian mechanism composition for
on a single datapoint. Since it cannot rely too much on any (ϵ, δ) mechanisms [15], [16].
individual datapoint, it cannot reveal signicant information Finally, many DP mechanisms have been developed, but
about a single individual in a dataset [9]. here we mention one specically: the report-noisy-max mech-
Implementing the above denition of DP for a nonprivate anism. The report-noisy-max mechanism is one application of
algorithm B requires bounding its L1 sensitivity. This is the the exponential mechanism, and it takes in a set of scores
maximum amount that B’s output can change between two and is used to choose the object with the highest score. It is
neighboring datasets as measured by the L1 norm. Mathemat- typically implemented by adding Laplacian noise to each of
ically, this is written as the scores and then choosing the one which is largest [17].
The report-noisy-max mechanism has found wide use in DP
max ∥B(D) − B(D ′ )∥1  literature, and is used by some of the algorithms we review in
D,D ′ : dist(D,D ′ )=1
section III.
Once the sensitivity is known, Laplacian noise with appro- 1) Global and Local Differential Privacy: Traditional DP
priate scale can be added to B to produce a DP version of algorithms assume that individuals’ raw datapoints are colo-
B [10]. However, adding Laplacian noise to B using its L1 cated on one machine. Only resulting analyses or models built
sensitivity can destroy its efcacy. The noise can corrupt its from the data will be shared publicly with untrusted agents, so
outputs so much that it becomes useless. To address this issue, only these analyses or models must be noised. This framework
approximate DP was developed. is called global or central DP, in reference to the assumption
that all individuals trust one entity to curate their raw data on similar datasets. Algorithmic stability is a well-studied
[18]. property in statistics and machine learning as it enables a
There are a few common methods to achieving DP under the mathematical analysis of algorithm performance [25].
global assumption. The rst is output perturbation, in which DP algorithms are inherently related to algorithmic stability,
the output of an algorithm is noised based on its sensitivity. as a DP algorithm cannot have a large change in its output
This method is used often with traditional statistical estimators when one of its datapoints changes. Indeed, Dwork and Lei
such as the mean, but it can be difcult to calculate the formalized this intuition when they studied the relationship
sensitivity of more complicated outputs like machine learning between DP and robust statistics [12].
models [19], [20]. Unfortunately, Xu et al. found that sparse algorithms cannot
Another method is objective perturbation, which noises the be stable [25]. Applying this to our case of DP, this means that,
objective function of an optimization problem. This method is in some sense, DP algorithms are at odds with sparsity.
exible to modeling algorithms, but original works required We include this discussion here to highlight the fundamental
the optimization problem to be solved exactly, which is challenge of producing a sparse DP estimator. While both
challenging. More recent frameworks have extended objective sparsity and DP are useful tools for data analysis, an effective
perturbation to approximate solutions of optimization prob- combination of both is a difcult and open problem. Some of
lems [20], [21]. the methods reviewed in this work take steps to resolve this
Finally, gradient perturbation privatizes algorithms using problem.
gradient descent by noising each step of the optimization
B. High-Dimensional Optimization for Linear Models
process. This method has been shown to work better than
output and objective perturbation on many problems, and it In this section, we provide a brief overview of high-
is easily modied for the stochastic gradient descent case dimensional optimizers for linear and logistic regression.
[22]. Its popularity is bolstered by the developments of Réyni Optimization methods for high-dimensional problems are an
and zero-concentrated DP, which allow for tight composition important open problem, so we focus on well-established
of gradient perturbation methods in the non-stochastic and methods relevant to section III here.
stochastic settings [15], [16]. First, note that in high-dimensional problems, it is easy for
However, when individuals do not trust a central actor models to overt the training data. If n is the number of
to curate their data, local DP must be used. In this case, datapoints and d is the number of features, high-dimensional
individuals noise data prior to sending it to a central server, problems have n < d, so the number of coefcients in the
thus ensuring that curators of the central server cannot re- weight vector exceeds the number of targets [26]. For this
identify the data’s sensitive attributes. A number of noising reason, we must constrain the weight space to be sparse avoid
algorithms exist, but the main drawback of local DP is reduced overtting [27]. Note that literature on nonprivate optimization
accuracy and utility [18]. often considers sparsity to be a de-facto requirement for high-
To combat reduced accuracy and utility while avoiding trust dimensional problems since it improves utility of solutions
of the central actor, shufe differential privacy was developed. greatly; however, as discussed in the prior section, maintaining
In shufe differential privacy, each individual’s data is shufed sparsity with DP can be challenging.
prior to being provided to the actor so it cannot be readily Two common approaches are used to avoid overtting in
linked back to a single user through a specic connection. high-dimensional linear models. The rst is an L0 contraint
This allows for reducing the noise added to each individual’s on the weight vector, which controls the number of nonzero
data, which can improve accuracy and utility. It can be components in the weight vector. Under this constraint, a linear
shown that shufe differential privacy is a stronger than global problem with per-example loss function ℓ(xi , yi ; wi ) can be
differential privacy and weaker than local differential privacy written as
N
[23]. One common specic application of shufe differential arg min ℓ(xi , yi ; wi )
privacy is aggregate differential privacy, which sums many w: ∥w∥0 ≤k i=1
local differentially private metrics prior to providing this result
To solve this problem, iterative gradient hard thresholding
to an untrusted actor. Aggregate differential privacy has been
can be used. This is a greedy method which retains only the
used in real-world systems for tracking user statistics and
top-k coefcients of the weight vector after each gradient step.
application performance [24].
Since it is greedy, it is not guaranteed to converge to an optimal
In this review, we mention methods for high-dimensional
solution. However, it approaches the L0 constrained problem
private linear modeling under both the global and local models.
in a computationally efcient manner [28].
However, since methods for high-dimensional linear modeling
L1 constraints are used more often. In this setting, we solve
under local DP are nascent, we only test models built under
global DP. Finally, note that we are not aware of any methods N

which employ shufe differential privacy for high-dimensional arg min ℓ(xi , yi ; wi )
w: ∥w∥1 ≤k i=1
linear models.
2) Sparsity, Stability, and Differential Privacy: Heuristi- This is a relaxation of the L0 constraint which often produces
cally, an algorithm which is stable produces similar outputs sparse solutions and can be solved efciently with a number of
optimizers; however, the nonzero components of its solutions loss function. They then derive the utility of their algorithm
are biased towards 0, which can reduce the utility of the output under both (ϵ, 0) DP and √ (ϵ, δ) DP, and show that using (ϵ, δ)
[29]. DP saves a factor of d in utility.
Solving this problem can be done with a number of op- Their utility analysis shows that this mechanism is not
timizers. Most commonly, it is solved with the Frank-Wolfe directly applicable to high-dimensional convex optimization.
algorithm, which iteratively takes steps towards the vertex of In the second part of their work, they assume that high-
the k-norm L1 -ball which has minimum loss [30]. Compressed dimensional data can be explained by a model with sparsity s,
learning, in which the dimensionality of the dataset is reduced and derive algorithms to select a support set of size s prior to
by multiplication with a random matrix, can also be used [31]. running their generalized objective perturbation mechanism.
Note that the L1 constrained problem is often written as an Note that they make a number of additional assumptions
L1 regularized problem. Specically, a linear problem can be when calculating utility, including those on the norm of the
written as dataset and targets along with the restricted strong convexity
N
 of the dataset. However, these assumptions are not necessary
arg min ℓ(xi , yi ; wi ) + λ∥w∥1  to guarantee privacy.
w
i=1 The rst algorithm they analyze is derived from the expo-
There is an exact duality between the constrained and regu- nential mechanism. Roughly, this method calculates the loss
larized setting, where a specic value of k corresponds to a of each support set with size s, and uses the exponential
value of λ. However, the closed form conversion from k to λ mechanism to choose the support set with minimum loss.
is not known [32]. After choosing the support set, these features are passed
To solve the regularized problem, coordinate descent can into the generalized objective perturbation mechanism. They
be used. This algorithm iterates through each coefcient of prove utility of their method when n = ω(s3 log d), but is
the weight vector, making gradient updates for only that co- computationally inefcient since the   support selection step
efcient. More recent algorithms like the alternating direction must calculate the models of all ds possible support sets.
method of multipliers can also be used, which split the smooth This can be compared to the inefciency of L0 -constrained
and nonsmooth parts of the regularized optimization problem minimization in the nonprivate setting.
for better initial optimization [33], [34]. Their second algorithm uses the sample √ and aggregate
framework to split the training examples into n blocks and √
III. R EVIEW compute the optimal support on each block. After all n
This section provides a literature review of algorithms used blocks are run, each of the d features has a score associated
for high-dimensional DP linear models. The section is orga- with it: the number of blocks which chose to include it in
nized by optimization technique, and is in roughly chronolog- their support. A private mechanism can be used to return
ical order of the rst paper considering each technique. Figure the top-s features, which are then passed into the generalized
1 provides an overview of these methods, and employs a color- objective perturbation mechanism. They prove utility of this
2
coding we will use in coming sections. Certain optimization method when √n = ω(s log d), but require an assumption that
2

techniques have been considered by multiple works, and as each of the n blocks follows the restricted strong convexity
such, their subsections are longer than others. Throughout the assumption, which is strictly stronger than the assumption for
sections, we highlight potential strengths and weaknesses of the whole dataset.
each algorithm. These algorithms are then tested empirically Since their method for high-dimensional regression involves
in section V. two stages, we call it TS. Note that because the method is split
In conducting this literature review, we searched for papers into two steps, the privacy budget must also be split between
considering “sparse,” “high-dimensional,” “L1 ,” or “L0 ” DP these steps. In their paper, Kifer et al. choose to split the
regression. After extensive reading, we identied the works budget into ( 2ϵ , 0) for the support selection step and ( 2ϵ , δ) for
presented in this section as relevant to our review. the perturbation mechanism. We follow the same convention
Finally, all methods presented in this section employ the in our experiments. Additionally, our experiments only test the
unbounded Hamming distance for their adjacency relation. In sample and aggregate support selection mechanism since it is
other words, the denition of differential privacy is employed computationally efcient.
when D and D ′ differ by one (extra or removed) row. This Thakurta & Smith (2013) follow up Kifer et al.’s work
ensures that the privacy produced by all methods is directly by developing DP model selection mechanisms based on
comparable. perturbation and subsampling stability [35]. In short, they call
a function f perturbation stable if f outputs the same value
on an input dataset D and all of D’s neighbors. Under this
A. Model Selection condition, they require an algorithm Adist which outputs the
High-dimensional DP linear models were rst considered in distance between D to its nearest unstable instance. Given the
2012 by Kifer et al. [21]. Their work is split into two parts: the output of Adist , they use the Propose-Test-Release mechanism
rst developed a generalized objective perturbation mechanism to output f (D) in a DP manner.
which employs L2 regularization to achieve a strongly-convex However, this mechanism is typically not practical since
Model Selection Thresholding
Summary: Choose a subset of features privately, and Summary: Use IGHT to produce sparse weight.
then use traditional DP optimization to find a weight. Privatize with gradient perturbation or output
Assumptions: Feature selection is algorithmically perturbation.
stable. Assumptions: IGHT must identify coefficients
Privacy: (, 0) or (, ) efficiently. For heavy-tailed methods, truncated
Methods: TS [21] gradients must not be overwhelmed by noise.
Privacy: (, )
Methods: DPIGHT [43], DPSLKT [44], HTSL [8],
HTSO [8]
Frank-Wolfe
Summary: Iteratively choose to move towards a vertex
of a polytope constraint in a private manner.
Assumptions: Loss is Lipschitz and smooth. Solutions Coordinate Descent
are found in few iterations. For heavy-tailed methods, Summary: Use greedy coordinate descent to privately
truncated gradients must provide effective signal. a single component of the weight to update.
Privacy: (, 0) or (, ) Assumptions: Greedy coordinate descent can be
Methods: FW [37], POLYFW [38], VRFW [39], implemented efficiently. Lipschitz constants for each
HTFW [8], HTPL [8] feature are known.
Privacy: (, )
Methods: GCDGSQ [46], GCDGSR [46], GCDGSS
[46]
Compressed Learning
Summary: Compress high-dimensional dataset to low-
dimensions by multiplying with a random matrix.
Optimize in low-dimensions and find corresponding Mirror Descent
weight in original space. Assumptions: Loss is Summary: Use iteratively stronger regularization to
Lipschitz. Random matrix does not destroy important solve a constrained optimization problem.
information in dataset. Assumptions: Composing multiple private
Privacy: (, 0) or (, ) optimizations is numerically stable.
Methods: PROJERM [40] Privacy: (, )
Methods: NM [39]

ADMM
Summary: Privatize the ADMM algorithm with
objective perturbation.
Assumptions: Search through large hyperparameter
space possible. ADMM converges.
Privacy: (, 0)
Methods: ADMM [42], ADMMHALF [42]

Fig. 1. A taxonomy of optimization techniques used for high-dimensional DP linear models.


calculating Adist requires searching over all datasets around Due to the lack of assumptions in producing a utility bound,
D, of which there may be innitely many. Secondly, the private FW was explored by a number of subsequent works.
mechanism relies on the stability of f at the dataset D, which These works identied that private FW requires gradients with
may change for different functions f . respect to the entire dataset at each iteration, which can waste
Thakurta & Smith then analyze subsampling stable func- privacy budget. According to the principles of subsampling
tions. For a dataset D, they dene D as a random subset of D in DP, if the entire dataset is not accessed at each iteration,
in which each element appears independently with probability less information from the data is used in each update, and
 with
q. Then f is called q-subsampling stable if f (D) = f (D) less noise is required for privacy. The two following methods
probability at least 4 . Under this condition, they derive an
3
consider stochastic private FW methods.
(ϵ, δ)-DP algorithm which outputs f (D) with high probability Bassily et al. (2021) consider a stochastic private FW
when f is q-subsampling stable. method which reduces variance of updates through a recursive
While computationally tractable, this algorithm is still in- gradient estimator [38]. Specically, their method computes
efcient in that it requires O( q12 ) runs of f . Given the the gradient of the objective function with respect to n2 of the
computational complexity of model selection procedures and datapoints. Then it iterates through the remaining n2 points
the fact that q can be very small for low values of ϵ and δ, and updates the gradient with a weighted average. The weight
this method can be difcult to use in practice. Additionally, is updated after each iteration in a private manner using
there is a nonzero probability that the algorithm outputs ⊥, a the exponential mechanism, and advanced composition yields
symbol indicating that the model selection procedure cannot be (ϵ, δ)-DP. Using the author’s convention, we call this method
outputted while maintaining DP. Finally, as in the TS approach, POLYFW.
using this procedure would require arbitrarily splitting the POLYFW has error of O(  log
√ d ), which is not optimal
ϵ n
privacy budget between the model selection and optimization according to the work below. Their analysis does not explicitly
steps. For these reasons, we do not implement this method in rely on the curvature constant but rather requires L1 Lipschitz-
our experiments. ness and a smooth loss function with respect to the L1 norm.
Another potential approach within model selection could Asi et al. (2021) develop a variance-reduced FW method
involve modifying nonprivate screening rules to operate with (VRFW) for smooth functions using DP binary trees [39].
DP. Recent work has attempted to do this but has shown a Nodes in these trees closer to the root are assigned more
negative result [36]. It remains an open question whether an samples, making their gradient more accurate, while nodes
effective screening rule can be derived for model selection closer to the leaves have less samples. Iterating over the
with high probability. leaves, their algorithm makes FW updates which consider the
gradients collected over the entire path from the root to the
B. Frank-Wolfe leaf. By reusing gradients calculated from more datapoints,
The Frank-Wolfe algorithm (FW) is a greedy optimization the algorithm is able to add less noise through the exponential
technique which works well on problems with polytope con- mechanism. Their method can achieve both (ϵ, 0)-DP and
straints. At each iteration, FW nds a rst-order approximation (ϵ, δ)-DP.
of the objective function at the current iterate. It then nds a Asi et al. calculate the error of this method, which is
vertex of the polytope constraint which minimizes the rst-  √n+1(nϵ)2/3 ) for (ϵ, δ)-DP. They then calculate lower
O(1
order approximation, and takes a step in this direction. The bounds of error, and show that this error matches the lower
algorithm was rst developed for quadratic programming, bound. For this reason, their algorithm has optimal utility for
but can be efciently implemented for many loss functions smooth loss functions.
including mean-squared error and binary cross-entropy. Finally, Hu et al. (2022) develop two stochastic FW-based
Talwar et al. (2015) privatize FW for L1 constrained op- algorithms for heavy tailed high-dimensional data [8]. Each of
timization [37]. The algorithm works well in this setting their utility analyses assumes bounded fourth order moments,
since the L1 constraint corresponds to a polytope. In order which is common in heavy tailed statistics.
to produce private FW, at each iteration they treat the values The rst method they use involves a soft truncation and
of the rst-order approximation at each vertex as scores, and scaling of the gradient. They do this because heavy-tailed data
use the exponential mechanism to choose a direction to move may not have bounded gradients, which can lead to unbounded
in. Using the advanced composition theorem for (ϵ, 0) DP, sensitivity. With these truncated gradients, the exponential
they can compute the exact amount of noise to add, achieving mechanism is used to choose a direction for minimization.
(ϵ, δ) DP. This algorithm is (ϵ, 0)-DP. We call this method HTFW.
Their utility analysis does not require restricted strong When specically considering linear regression, they nd
convexity or restricted strong smoothness. Instead, it relies on that truncating each element of the design matrix and target
the curvature constant of the loss function, which is bounded vector to be in [−K, K] can bound sensitivity. After this data
for both linear and logistic regression. This constant and with processing step, they use the exponential mechanism to choose
the Lipschitzness of the loss function with respect to the L1 a direction for minimization. This algorithm is (ϵ, δ)-DP, and
norm is the only information necessary to produce the utility Hu et al. abbreviate it as HTPL.
bound.
C. Compressed Learning (ADMM and ADMMHALF) [42]. First, they show that these
Another approach to constrained optimization is through methods access the data in only the second of its three
compressed learning, which reduces the dimensionality of the steps. Then, to privatize the algorithm, they employ objective
input space by multiplying the design matrix by a random perturbation in every second step. Finally, they use the basic
matrix Φ ∈ Rm×d . Typically Φ is chosen to be a subgaussian composition theorem to show that with appropriate noise, their
random matrix with norm O(1) and the feasible set C is the algorithm is (ϵ, 0)-DP after K iterations.
scaled L1 ball. Although Wang & Zhang do not provide a utility analysis
Kasiviswanathan & Jin (2016) employ compressed sensing of ADMM and ADMMHALF, they do analyze the three steps
for high-dimensional DP linear modelling [40]. After choosing of each iteration and nd that only the rst step is unstable
Φ and C, they solve the problem with sparsity. Since the data is only accessed in the second
step, privatizing the second step does not require excessive
n
1 noise since this step is convex and stable. For this reason,
arg min ℓ(⟨Φxi , ϑ⟩; yi )
ϑ∈ΦC n i=1 they intuitively claim that their algorithm is well-suited for
DP, which outputs accurate results on stable functions.
with a DP optimizer. To obtain an estimate for the parameter
vector in the data space, they solve E. Thresholding
θ ∈ arg min ∥θ∥C A simple approach to sparse optimization is to enforce
θ∈Rd :Φθ=ϑ sparsity throughout training. This can be done with hard
thresholding operations, in which only the top s components
where ∥·∥C represents the Minkowski norm induced by the
of a partially learned weight are retained after each optimizer
feasible set. When C is the scaled L1 ball, this is equivalent
step. Hard thresholding based approaches have been consid-
to the L1 norm. Note that the output of this procedure is DP
ered in many forms for high-dimensional DP optimization. We
with the same parameters as ϑ since θ is only a function of
cover these methods below.
Φ, which does not depend on the data, and ϑ, which is DP.
Wang & Gu (2019) consider DP iterative gradient hard
Since their procedure only requires a DP optimizer to
thresholding (DPIGHT), which retains the top s components
guarantee DP, it can be both (ϵ, 0)-DP and (ϵ, δ)-DP. In our ex-
of parameter vector after each gradient step [43]. Although
periments, we implement an (ϵ, δ)-DP optimizer to work with
their method is simple, when they assume that each row of
their procedure. We refer to their procedure as PROJERM, in
the design matrix is sub-Gaussian, they achieve the same
reference to its projected empirical risk minimization step.
utility as Kifer et al. in the TS procedure without requiring
The utility bounds which Kasiviswanathan & Jin de-
an extra support selection step. However, their utility does
rive follow directly from those known from the Johnson-
assume the sparse eigenvalue condition, which implies both
Lindenstrauss lemma. Their bounds hold under conditions on
restricted strong convexity and restricted strong smoothness.
size of m.
They analyze the privacy of their algorithm with respect to
Zheng et al. (2017) used compressed learning to develop
zero-concentrated DP as this allows for tight composition of
an algorithm for high-dimensional linear regression under
gradient descent. Note that zero-concentrated DP can easily
local DP [41]. Instead of using gradient perturbation like
be converted to (ϵ, δ)-DP.
Kasiviswanathan & Jin, Zheng et al. noise the projection of
Wang & Gu (2020) follow up this work by considering a DP
each datapoint Φxi prior to optimization. Using the noised
knowledge transfer framework (DPSLKT) to produce sparse
datapoints, they correct the quadratic loss function of linear
linear models [44]. First, they use IGHT to train a teacher
regression to ensure it is unbiased, and optimize over this
model on private data without DP. Next, they generate an
corrected loss function. Their algorithm is (ϵ, δ)-DP.
auxiliary training set and use output perturbation when passing
Zheng et al. use matrix concentration inequalities to derive it through the initially trained model. Now this dataset is DP.
bounds on utility in the value of the loss. However, they make Finally, to produce a sparse model, they use IGHT to nd a
no claims in directly comparing the output wpriv from their student model which ts the the DP data well. Due to the
algorithm to the optimal w∗ . post-processing property of DP, this model is also DP. Since
this method also uses gradient descent, Wang & Gu use zero-
D. ADMM concentrated DP, which can be converted to (ϵ, δ)-DP.
The alternating direction method of multipliers (ADMM) The utility guarantees of DPSLKT outperform those of
is a common algorithm used for optimization of nonsmooth previous methods because they rely √ on the L∞ norm of
or nonconvex regularizers by transforming an optimization the data, which can be up to O( d) times less than the
problem into two simpler problems. Under assumptions on the L2 norm. However, their utility analysis requires restricted
strong convexity of the loss function and separability of the strong convexity and restricted strong smoothness. For this
regularizer, ADMM can be shown to converge to a stationary reason, Wang & Gu give an option to dene the loss function
point. with an L2 regularization term, since this will ensure that
Wang & Zhang (2020) use the ADMM algorithm and use the loss is strongly convex. This approach is similar to the
it to achieve DP L1 and L1/2 regularized logistic regression TS method of Kifer et al. However, unlike the TS method,
this regularization term is not necessary, but without it, an In the nonprivate setting, all of these methods perform well
assumption of restricted strong convexity and restricted strong on high-dimensional linear problems. Non-greedy methods are
smoothness is necessary, which typically does not hold on typically used since they are computationally efcient, only
real-world datasets. requiring the gradient with respect to one parameter.
Hu et al. (2022) employ iterative hard thresholding to Mangold et al. (2022) develop a DP coordinate descent
develop a private linear regression algorithm for heavy tailed algorithm for high-dimensional ERM [46]. They choose to
high-dimensional data, which they call HTSL [8]. To do this, use greedy coordinate descent (GCD), which updates the
they truncate each element of the design matrix and target coordinate with the largest absolute value gradient, because
vector to be in [−K, K], where K is a value set by the user. under DP fewer computations enable less added noise. They
Then they use gradient descent for optimization, but after each acknowledge that this is an inherent tradeoff to computational
step they privately retain the top s elements of the weight efciency, since their procedure must compute the gradient
vector. Since their data and targets are truncated, they can with respect to every parameter at every iteration.
calculate the sensitivity of this step, which is passed into Since they update only one parameter per iteration, Mangold
the exponential mechanism. Using the advanced composition et al. are able to employ per-feature smoothness and Lipschitz
theorem, they achieve (ϵ, δ)-DP. constants. In some datasets, some features can have signi-
Note that truncation is required because the data is heavy cantly lower smoothness and Lipschitz constants as compared
tailed. Heavy tailed data is typically not bounded or subgaus- to the entire dataset. For these features, less noise is added
sian, and its loss function is typically not O(1) Lipschitz. during coefcient selection and optimization. Finally, using the
Some combintation of these assumptions are required to advanced composition theorem, their method achieves (ϵ, δ)-
compute the sensitivity for non-heavy tailed algorithms. To DP.
avoid this, Hu et al. truncate the “outlier” elements in the Their utility bound demonstrates that their algorithm con-
design matrix and target vector. However, the utility bound verges to a ball around the optimal weight, where the radius
which Hu et al. derive requires that the dataset and targets have of the ball is determined by the level of privacy. However, it
bounded fourth order moment. This assumption is similar to is worth noting that their utility bounds consider only smooth
those made by other works in heavy tailed optimization. loss functions, and are thus not applicable to L1 regularizers.
For general convex optimization, Hu et al. use a soft To use L1 regularizers, Mangold et al. develop DP proximal
trunctation and scaling operation on the gradient. Then they coordinate descent algorithms with a variety of proximal
use gradient descent, privately retaining the top s elements of operators. These algorithms are named GCDGSQ, GCDGSR,
the weight vector. In this case, they bound the optimization and GCDGSS, to represent three different proximal operators,
sensitivity through the truncation and scaling of the gradi- and empirical results indicate that these modications greatly
ent. Again, their utility bound requires bounded fourth order improve performance on high-dimensional datasets. However,
moments. This method produces and (ϵ, δ)-DP result. They they do not consider the utility of the algorithm with proximal
call this method HTSO, since it performs heavy-tailed sparse operators.
optimization.
Finally, we mention Wang & Xu’s 2019 work on iterative G. Mirror Descent
hard thresholding methods for local DP [45]. In this work, they Asi et al. (2021) privatize an iterative localization frame-
assume that each datapoint xi is sampled from {−1, +1}d , work with private optimization. They call their method pri-
which is a very strong condition. vate mirror descent with noise multipliers (NM) [39]. This
However, this work is valuable since it shows that even method nds the optimal parameter vector for a sequence
under such strong conditions, the utility must have or- of optimization problems with increasing regularization. By
 log d). Their iterative hard thresholding algorithm
der Ω(d sequentially increasing the regularization, the method nds

achieves this, but in general, orders greater than Ω(log d) are iteratively smaller regions for optimization, making the opti-
considered unlearnable in the high-dimensional setting. They mization problem easier. In addition, the regularization makes
then demonstrate that if only the labels yi are kept private, the optimization problem strongly convex, making it easy to
iterative hard thresholding can achieve a bound of O(log d). use well-known optimal optimization techniques for private
However, this setting is dissimilar from the DP setting we strongly convex problems. They prove that NM achieves opti-
consider in this paper, and thus we do not implement this mal utility for non-smooth loss functions by demonstrating that
algorithm. using private optimization with iterative localization produces
a loss which matches a known lower bound up to logarithmic
F. Coordinate Descent factors. This algorithm is (ϵ, δ)-DP.
Coordinate descent is an optimization algorithm which itera-
IV. I MPLEMENTATION D ETAILS
tively updates only one parameter of a model’s weight using its
gradient. The algorithm typically chooses parameters to update In the following section, we perform experiments on the
cyclically or at random, but greedy coordinate descent chooses central DP methods previously discussed. We release all
to update the parameter with the highest absolute gradient. code for these experiments online. This is one of the major
contributions of this work, as previous works do not release Finally, it is important to note that all experiments were con-
code or test algorithms empirically. While theoretical utility ducted with Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz.
bounds can provide an even ground for algorithm comparison, For each dataset-algorithm combination, 20 trials were per-
these bounds often make signicant (and different) assump- formed on different CPUs to speed up training. A 48 hour
tions; empirical performance on a variety of datasets better wall-clock time limit on computation time was set for each
demonstrates how these algorithms would perform if used in dataset-algorithm combination.
application. In this section, we list challenges we faced when
implementing some of these methods. V. E XPERIMENTS
This section runs experiments on six datasets to test the
• Without an optimized linear algebra implementation, the methods described in Section III. These experiments provide
FW algorithm is very inefcient. An optimized imple- a systematic comparison between algorithms’ performance on
mentation can be found in [47]. This holds for the other varying privacy budgets. Our study focuses on two of the
variants of FW as well. most prevalent linear models: linear regression and logistic
• VRFW required smoothness constants to be greater than regression.
a threshold. Since a k-smooth function is also l-smooth All datasets were chosen from the libsvm and OpenML
when l ≥ k, when implementing this method we set libraries [55], [56]. For linear regression, we used the Bodyfat,
the smoothness constant to max(k, l) where k is the PAH, and E2006 datasets. The raw datasets for Bodyfat
smoothness constant of our loss function and l is the and PAH were used, but for computational efciency, 500
required lower bound on smoothness. datapoints from the E2006 dataset were chosen and their
• For heavy-tailed methods requiring the robust gradient, dimensionality was reduced to 500 with PCA. Similarly, the
the expression of the correction factor can be found in Heart, DBworld-subjects-stemmed, and RCV1 datasets were
Lemma 3.2 in [48]. The authors considering heavy-tailed used for logistic regression. We used the raw Heart and
private optimizers claimed to provide this expression in DBworld datasets, but chose 500 datapoints of the RCV1
their appendix, but we could not nd it. dataset and reduced their dimensionality to 500 with PCA.
• The private optimization method recommended by the These datasets were chosen to identify algorithms’ perfor-
authors of PROJERM requires n2 steps of stochastic mance on different scales of dimensionality while retaining
gradient descent, where n is the number of datapoints. computational efciency. Table I provides a summary of the
Additionally, the method requires solving a optimization datasets’ dimensionalities.
problem in d variables. We did not modify these steps, All features in datasets were demeaned, and then samples
but for larger datasets PROJERM did not converge within in each dataset were rescaled so the maximum L1 -norm of a
48 hours. sample in any given dataset was 1. This was done to bound the
• ADMM and ADMMHALF have a signicant number Lipschitz and smoothness constants of the datasets under linear
hyperparamters and can be unstable for many hyper- and logistic regression, a common requirement in DP opti-
parameters and datasets. We had to refactor portions mization literature and detailed well in [40]. Finally, linear or
of their algorithm into numerically stable mathematical logistic regression models were trained on the datasets under a
equivalents. Refactoring improved performance but cer- variety of hyperparameters. Each set of hyperparameters was
tain datasets were still unstable. tested for 20 trials, with a unique 60%/20%/20% split between
• We did not implement GCD methods but instead used training, validation, and testing datasets used for each trial. A
code released with the GCD paper to run these methods. detailed list of the hyperparameters used for each algorithm
• When implementing NM, we tried to use CVXPY to can be found in the appendix.
solve each optimization problem [49], [50]. However, Note that algorithms were tested with privacy parameters
the constraints in this problem are unstable, and the free ϵ ∈ {01, 05, 1, 2, 5} with δ = n21 , where ntrain is the number
train
ECOS solver included in CVXPY would produce errors of training samples. These privacy parameters were chosen
prior to convergence [51]. The free SCS solver would according to common recommendations in DP literature. Fi-
take up to 10 seconds for a single iteration, and the cu- nally, each algorithm was run with ϵ = 100 and δ = 0999
mulative sequence of problems would require over 10,000 to approximate a nonprivate solution. We did this for each
iterations, making it infeasible [52]. For this reason, we algorithm so that we nd the best nonprivate solution using
approximated the solution to each optimization problem the same choices of hyperparameters used for private training.
using gradient descent.
• We employed differentially private stochastic gradient A. Linear Regression
descent (DPSGD) as a baseline for each dataset [53]. In Table II, Table III, and Table IV demonstrate the mean of
addition to being a common DP optimization method, 20 trials of mean-squared errors for the test portions of the
works have shown that the noise added during DPSGD Bodyfat, PAH, and E2006 datasets, respectively. Additionally,
can be an effective regularization strategy which prevents subscripts designating two times the standard error of the
overtting. This may cause it to perform well in high- mean are provided. For each value of ϵ, the best performing
dimensional settings [54]. algorithm’s results are bolded. Note that ProjERM is not
TABLE I B. Logistic Regression
S UMMARY OF DATASETS
Table V, Table VI, and Table VII demonstrate the mean
DATASET N D R EGRESSION T YPE of 20 trials of accuracies for the test portions of the Heart,
B ODYFAT 252 14 L INEAR DBworld, and RCV1 datasets, respectively. Subscripts are pro-
PAH 80 112 L INEAR
vided denoting two times the standard error of the mean. Note
E2006 500 500 L INEAR
H EART 270 13 L OGISTIC that ADMM, ADMMHalf, and ProjERM are not included
DB WORLD 64 229 L OGISTIC in Table VII since they did not converge within 48 hours.
RCV1 500 500 L OGISTIC Additionally, the ECOS solver failed to converge in the TS
procedure in Table VI. These three tables are presented in the
appendix due to the page limit of the pain body of the paper.
Note that in the appendix, Figure 5, Figure 6, and Figure
included in Table IV since it did not converge within 48 hours. 7 provide boxplots of the accuracies for the trials on each
We now discuss notable performance trends which are seen in dataset.
these tables. Note that in the appendix, Figure 2, Figure 3, and From these three tables, it is clear that GCD, or greedy
Figure 4 provide boxplots of the mean-squared errors for the coordinate descent, performs well on all values of ϵ. As
trials on each dataset. discussed in the previous subsection, GCD methods do not
rely on the Lipschitz constant for optimization of the entire
Perhaps the most signicant trend among all three tables
weight vector but instead use different Lipschitz constants for
is that HTSO, or heavy-tailed sparse optimization, performs
each coefcient. This allows GCD to add less noise than other
well on all values of ϵ. HTSO performs iterative gradient hard
optimizers.
thresholding using the robust gradient of the data. Intuitively,
this makes sense; while most other methods tested require Thus we conclude this section similar to that of linear
the loss to be Lipschitz, this constant is often a worst-case regression: future research on private logistic regression mod-
bound. For example, although bounding the Lipschitz constant els should incorporate the scale of each feature separately
of the loss requires rescaling the datapoints, datapoints with when adding private noise. This allows for improved empirical
larger rescaled norms might be outliers, and most datapoints utility, in which features with different scales can be noised
have norms near 0. In contrast, heavy tailed optimizers do not effectively.
assume Lipschtizness of the loss, and instead use parameters to
VI. T RENDS IN R ESULTS
calculate a robust gradient which is resistant to outliers. Thus,
even in settings where the Lipschitz constant of the loss can Having completed the implementation and empirical eval-
be set, heavy-tailed private gradient optimizers can perform uation of each surveyed method, we note three high-level
better. themes in the research that may form valuable directions for
future work. Each of these is signicant in that they inhibit
Next, we note that GCD, or greedy coordinate descent, can
practitioner adoption of DP methods by imposing unexpected
perform also well. Similar to HTSO, these methods do not
costs of behaviors compared to non-private methods.
rely on the Lipschitz constant for optimization of the entire
First is the extreme computational cost of using DP linear
weight vector but rather use different Lipschitz constants for
models compared to their original non-private counterparts,
each coefcient. For this reason, private optimization can add
taking orders of magnitude more time to train. This limits
less noise to coefcients which are smaller than others, thus
real-world utility where datasets can not be subsampled for
improving the utility of the method. Finally, as was reported
convenience. A related issue is that reguarlizaition penalties
in [46], different greedy selection rules can perform very
do not have the same (mostly) intuitive behavior in a DP
differently depending on the dataset.
model, and there appears to be no clear guidance on how and
Note that GCD methods can be made to better approx- when to apply regularizers to a DP linear model. Finally, we
imate the behavior of HTSO through a clipping parameter. nd that accuracy does not necessarily decrease with stricter
Specically, by aggressively clipping gradients before apply- privacy parameters. We discuss these issues in the following
ing coordinate descent, GCD can resist outlier inuence on subsections.
gradients. However, our goal was to observe the performance
of algorithms without additional modications, and we did not A. Computational Inefciency
try to use signicant clipping here. We have evaluated six values of ϵ for ≈ 16 algorithms for
From these results, we identify that future research on private each table. Filling each cell required 12-36 hours of computing
statistical models which incorporate measures of scale and time for datasets that are considered small by modern machine
robustness could be particularly promising. Specically, robust learning standards. Indeed, training a linear model via scikit-
calculations are statistically stable, which reduces the noise learn or other APIs takes minutes. This extreme cost disparity
required for DP. Considering the scale of features in a linear is perhaps the biggest hindrance to the adoption of DP methods
model allows the information from features “squashed” after in practice, as a practitioner has no reasonable expectation that
normalization to not be lost. they could train a model on a real-world larger dataset.
TABLE II
B ODYFAT : M EAN S QUARED E RROR

ϵ 0.1 0.5 1.0 2.0 5.0 N ONPRIVATE


TS 0.1142(0.0067) 0.1133(0.0066) 0.1122(0.0065) 0.1103(0.0064) 0.1051(0.0060) 0.0112(0.0070)
FW 0.0947(0.0053) 0.0947(0.0054) 0.0947(0.0054) 0.0947(0.0054) 0.0947(0.0054) 0.0844(0.0047)
POLYFW 0.0955(0.0053) 0.0955(0.0053) 0.0955(0.0053) 0.0955(0.0053) 0.0955(0.0053) 0.0955(0.0053)
VRFW 0.0965(0.0057) 0.0965(0.0057) 0.0965(0.0057) 0.0965(0.0057) 0.0965(0.0057) 0.0892(0.0058)
HTFW 0.0946(0.0053) 0.0946(0.0053) 0.0946(0.0053) 0.0937(0.0058) 0.0896(0.0049) 0.0886(0.0049)
HTPL 0.0946(0.0053) 0.0946(0.0053) 0.0946(0.0053) 0.0946(0.0053) 0.0946(0.0053) 0.0847(0.0047)
PROJERM 0.0945(0.0053) 0.0945(0.0053) 0.0945(0.0053) 0.0945(0.0053) 0.0945(0.0053) 0.0890(0.0051)
DPIGHT 0.0891(0.0052) 0.0926(0.0054) 0.0896(0.0055) 0.0892(0.0056) 0.0916(0.0052) 0.0932(0.0053)
DPSLKT 0.0902(0.0060) 0.0902(0.0060) 0.0902(0.0060) 0.0902(0.0060) 0.0908(0.0052) 0.0744(0.0075)
HTSL 0.0896(0.0065) 0.0885(0.0059) 0.0894(0.0059) 0.0886(0.0054) 0.0886(0.0056) 0.0942(0.0055)
HTSO 0.0685(0.0091) 0.0727(0.0153) 0.0699(0.0458) 0.0521(0.0131) 0.0607(0.0104) 0.0729(0.0068)
GCDGSQ 1.0307(0.1095) 0.0768(0.0082) 0.0661(0.0046) 0.0701(0.0046) 0.0755(0.0057) 0.0162(0.0022)
GCDGSR 0.0956(0.0053) 0.0041(0.0018) 0.0312(0.0024) 0.0582(0.0033) 0.0697(0.0053) 0.0615(0.0064)
GCDGSS 0.0956(0.0053) 0.0779(0.0048) 0.0894(0.0051) 0.0956(0.0069) 0.0462(0.0090) 0.0435(0.0042)
NM 0.0956(0.0053) 0.0956(0.0053) 0.0953(0.0053) 0.0973(0.0053) 0.0948(0.0054) 0.0980(0.0052)
SGD 0.0916(0.0051) 0.0916(0.0051) 0.0916(0.0051) 0.0916(0.0051) 0.0916(0.0051) 0.0983(0.0055)

TABLE III
PAH: M EAN S QUARED E RROR

ϵ 0.1 0.5 1.0 2.0 5.0 N ONPRIVATE


TS 0.0607(0.0252) 0.0572(0.0245) 0.0607(0.0257) 0.0608(0.0262) 0.0611(0.0275) 0.0837(0.0224)
FW 0.0910(0.0168) 0.0910(0.0168) 0.0910(0.0168) 0.0910(0.0168) 0.0910(0.0168) 0.0902(0.0166)
POLYFW 0.0909(0.0168) 0.0909(0.0168) 0.0909(0.0168) 0.0909(0.0168) 0.0909(0.0168) 0.0911(0.0168)
VRFW 0.0918(0.0170) 0.0918(0.0170) 0.0918(0.0170) 0.0918(0.0170) 0.0918(0.0170) 0.0911(0.0169)
HTFW 0.0907(0.0167) 0.0907(0.0167) 0.0907(0.0167) 0.0907(0.0167) 0.0905(0.0167) 0.0910(0.0168)
HTPL 0.0907(0.0167) 0.0908(0.0168) 0.0907(0.0167) 0.0907(0.0167) 0.0907(0.0167) 0.0905(0.0167)
PROJERM 0.0910(0.0168) 0.0910(0.0168) 0.0910(0.0168) 0.0910(0.0168) 0.0910(0.0168) 0.0911(0.0168)
DPIGHT 0.0892(0.0163) 0.0897(0.0164) 0.0892(0.0163) 0.0904(0.0167) 0.0904(0.0166) 0.0910(0.0168)
DPSLKT 0.0538(0.0068) 0.0850(0.0146) 0.0538(0.0068) 0.0852(0.0138) 0.0850(0.0146) 0.0828(0.0170)
HTSL 0.0905(0.0168) 0.0901(0.0166) 0.0903(0.0166) 0.0905(0.0167) 0.0908(0.0168) 0.0911(0.0168)
HTSO 0.0581(0.0157) 0.0352(0.0063) 0.0432(0.0062) 0.0530(0.0077) 0.0647(0.0102) 0.0909(0.0168)
GCDGSQ 429.1658(195.0152) 15.6425(4.2077) 3.4062(0.9211) 1.1643(1.2433) 0.0591(0.0231) 0.0474(0.0076)
GCDGSR 1366.9646(1497.3976) 56.6219(61.293) 38.2438(49.2854) 1.5510(0.8699) 0.1607(0.0751) 0.0855(0.0169)
GCDGSS 1.4834(0.9205) 0.2219(0.0879) 0.1485(0.0459) 0.1188(0.0301) 0.1179(0.0522) 0.0395(0.0065)
NM 0.0911(0.0168) 0.0911(0.0168) 0.0911(0.0168) 0.0911(0.0168) 0.0911(0.0168) 0.0911(0.0168)
SGD 0.0914(0.0170) 0.0914(0.0170) 0.0914(0.0170) 0.0914(0.0170) 0.0914(0.0170) 0.0910(0.0168)

TABLE IV
E2006: M EAN S QUARED E RROR

ϵ 0.1 0.5 1.0 2.0 5.0 N ONPRIVATE


TS 0.0392(0.0033) 0.0393(0.0033) 0.0393(0.0033) 0.0393(0.0033) 0.0380(0.0035) 0.0444(0.0052)
FW 0.0390(0.0033) 0.0390(0.0033) 0.0390(0.0033) 0.0390(0.0033) 0.0390(0.0033) 0.0382(0.0033)
POLYFW 0.0391(0.0033) 0.0391(0.0033) 0.0391(0.0033) 0.0391(0.0033) 0.0391(0.0033) 0.0391(0.0033)
VRFW 0.0391(0.0033) 0.0391(0.0033) 0.0391(0.0033) 0.0391(0.0033) 0.0391(0.0033) 0.0391(0.0033)
HTFW 0.0391(0.0033) 0.0391(0.0033) 0.0391(0.0033) 0.0391(0.0033) 0.0391(0.0033) 0.0391(0.0033)
HTPL 0.0391(0.0033) 0.0391(0.0033) 0.0391(0.0033) 0.0391(0.0033) 0.0391(0.0033) 0.0385(0.0033)
DPIGHT 0.0390(0.0033) 0.0390(0.0033) 0.0390(0.0033) 0.0391(0.0033) 0.0391(0.0033) 0.0391(0.0033)
DPSLKT 0.0391(0.0033) 0.0400(0.0035) 0.0394(0.0034) 0.0391(0.0033) 0.0391(0.0033) 0.0391(0.0033)
HTSL 0.0384(0.0033) 0.0388(0.0033) 0.0388(0.0033) 0.0389(0.0033) 0.0390(0.0033) 0.0391(0.0033)
HTSO 0.0308(0.0031) 0.0247(0.0043) 0.0229(0.0044) 0.0246(0.0040) 0.0282(0.0034) 0.0392(0.0034)
GCDGSQ 0.0909(0.0132) 0.0658(0.0064) 0.0657(0.0064) 0.0656(0.0064) 0.0656(0.0064) 0.0566(0.0058)
GCDGSR 0.0656(0.0064) 0.0656(0.0064) 0.0656(0.0064) 0.0656(0.0064) 0.0656(0.0064) 0.0656(0.0064)
GCDGSS 0.6121(0.1639) 0.0834(0.0085) 0.0690(0.0061) 0.0660(0.0064) 0.0656(0.0064) 0.0656(0.0064)
NM 0.0391(0.0033) 0.0391(0.0033) 0.0391(0.0033) 0.0391(0.0033) 0.0391(0.0033) 0.0391(0.0033)
SGD 4.0000(0.0000) 0.0391(0.0033) 0.0391(0.0033) 0.0391(0.0033) 0.0391(0.0033) 0.0391(0.0033)
As it exists today, only one work has attempted to make classical optimizer until convergence to a global minima (in
the FW algorithm scalable to sparse datasets [57]. While they the case of strongly convex losses) is possible regardless of the
were able to obtain signicant speedups of up to four orders of convergence rate of the algorithm being tested. This is not the
magnitude, the FW algorithm never performed best in any of case for DP, as each access to the data must be accounted for
the ϵ ≤ 5 experiments. It remains to be determined how many in the total privacy budget. This creates an interplay between
DP algorithms for linear models can be made computationally target ϵ and convergence where it may be best to perform fewer
tractable while balancing their efcacy. A key issue in this task optimization iterations to achieve a lower total ϵ, but thus result
is the nature of DP itself; addining noise at each step of the in an irregular effective convergence rate. Another possibility
process is a computationally demanding task in-and-of itself is implied by the results of [64] who observe that there is a
that tends to remove all sparsity that one would exploit in high nite practical range of ϵ values that have a meaningful impact
dimensional problems. on privacy. Values of ϵ below or above this range add/reduce
the noise without changing the identiability of the data and
B. Unclear Impact of Regularization thus could lead to lower ϵ values obtaining better accuracy by
In the linear model case, the L1 penalty is particularly desir- chance because chance was the only factor in relative ranking.
able because it is robust to spurious uninformative features and
produces sparse solutions [58]. For both L1 and L2 models, a VII. C ONCLUSION
regularization path is a common desiderata where the penalty In this paper, we provide the rst unied review of opti-
λ is varied from small-to-large, and a smooth relationship mization methods for high dimensional DP linear models. In
between coefcients (and accuracy) with the value of λ is used doing so, we give an overview of the strengths and weaknesses
to inform the model selection and interrogate the data [59]. of many methods, highlighting the different approaches taken
This is often achieved via a “warm start” where the solution in prior literature. Next, we conduct empirical experiments for
of one value of λ is used as the starting solution for an adjacent a systematic comparison of the optimization methods, which
value of λ′ ̸= λ [60]. has not been done in works prior to this. Finally, we release
When training differentially private linear models, we do our code for easier future use and better analysis of future
not see these common benets occur. In all cases, even when algorithms.
using a L1 penalty on the weight vector, the solution is Our empirical experiments highlight surprising and previ-
completely dense. This simultaneously removes the benet ously unobserved trends of optimization for high-dimensional
of sparse solutions (larger models) and mitigates the benet DP linear models. Specically, we nd that methods which
of robustness in the face of irrelevant features (because all are able to take into account the scale of each feature perform
features are always used). Warm starting the solution via a better than those which rely on the Lipschitz constant of the
previous value of λ is also non-viable as it would require loss function. Indeed, even when the Lipschitz constant of
expending additional privacy budget. the loss function is bounded, heavy-tailed or coordinate-based
The nature of how to regularize DP linear models and their optimization techniques can perform better since they are more
ultimate benet is an open problem. One possible hypothesis robust and add less noise.
is that the noise added by DP itself acts as a regularize. This We believe that this paper and the surprising result high-
is supported by prior results in deep learning on the utility lighted above can inuence future research on differentially
of noise [61]. Other works in DP learning with a backbone private optimization, which is an active research eld. Further
network trained on non-private data also support an implicit study on heavy-tailed or coordinate-wise optimization can
regularization effect in obtaining good results without any improve performance on the tasks listed in this review, and
additional regularization [54], though the use of a backbone may even translate to more complicated models such as neural
confounds drawing a conclusion. networks.

C. Accuracy Asperity with Privacy R EFERENCES


It is natural that as ϵ decreases, the accuracy should also [1] P. Lyssiotou, P. Pashardes, and T. Stengos, “Age effects on consumer
decrease, as the limit of ϵ → 0 implies a constant prediction. demand: an additive partially linear regression model,” Canadian Jour-
nal of Economics/Revue canadienne d’économique, vol. 35, no. 1, pp.
However, this is only a weakly observable trend in our 153–165, 2002.
experiments despite running each trial 20 times and averaging [2] C. R. Madhuri, G. Anuradha, and M. V. Pujitha, “House price prediction
the result. This has also been observed by [62], [63], and using regression techniques: A comparative study,” in 2019 International
conference on smart structures and systems (ICSSS). IEEE, 2019, pp.
appears to be a broader issue with DP learning. To some degree 1–5.
the issue is unavoidable because noise is added to the model, [3] I. Kurt, M. Ture, and A. T. Kurum, “Comparing performances of logistic
and so noise necessarily increases in the output. However, it regression, classication and regression tree, and neural networks for
predicting coronary artery disease,” Expert systems with applications,
remains to be determined if the amount of asperity in results vol. 34, no. 1, pp. 366–374, 2008.
for a given change in ϵ is “optimal” and how much this could [4] Y. Sahin and E. Duman, “Detecting credit card fraud by ann and
be reduced. logistic regression,” in 2011 international symposium on innovations
in intelligent systems and applications. IEEE, 2011, pp. 315–319.
One possible issue is the intersection of the optimization [5] K. P. Murphy, Probabilistic machine learning: an introduction. MIT
algorithm used, and that of the noise added. Running a press, 2022.
[6] P. Bühlmann, “Statistical signicance in high-dimensional linear mod- [33] S. J. Wright, “Coordinate descent algorithms,” Mathematical program-
els,” 2013. ming, vol. 151, no. 1, pp. 3–34, 2015.
[7] P. Rigollet, “18. s997: High dimensional statistics,” Lecture Notes, [34] R. Glowinski, “On alternating direction methods of multipliers: a his-
Cambridge, MA, USA: MIT Open-CourseWare, 2015. torical perspective,” Modeling, simulation and optimization for science
[8] L. Hu, S. Ni, H. Xiao, and D. Wang, “High dimensional differentially and technology, pp. 59–82, 2014.
private stochastic optimization with heavy-tailed data,” in Proceedings [35] A. G. Thakurta and A. Smith, “Differentially private feature selection
of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of via stability arguments, and the robustness of the lasso,” in Conference
Database Systems, 2022, pp. 227–236. on Learning Theory. PMLR, 2013, pp. 819–850.
[9] C. Dwork, “Differential privacy,” in International colloquium on au- [36] A. Khanna, F. Lu, and E. Raff, “The challenge of differentially private
tomata, languages, and programming. Springer, 2006, pp. 1–12. screening rules,” arXiv preprint arXiv:2303.10303, 2023.
[10] C. Dwork, A. Roth et al., “The algorithmic foundations of differential [37] K. Talwar, A. Guha Thakurta, and L. Zhang, “Nearly optimal private
privacy,” Foundations and Trends® in Theoretical Computer Science, lasso,” Advances in Neural Information Processing Systems, vol. 28,
vol. 9, no. 3–4, pp. 211–407, 2014. 2015.
[11] A. Beimel, K. Nissim, and U. Stemmer, “Private learning and san- [38] R. Bassily, C. Guzmán, and A. Nandi, “Non-euclidean differentially
itization: Pure vs. approximate differential privacy,” in International private stochastic convex optimization,” in Conference on Learning
Workshop on Approximation Algorithms for Combinatorial Optimiza- Theory. PMLR, 2021, pp. 474–499.
tion. Springer, 2013, pp. 363–378. [39] H. Asi, V. Feldman, T. Koren, and K. Talwar, “Private stochastic convex
[12] C. Dwork and J. Lei, “Differential privacy and robust statistics,” in optimization: Optimal rates in l1 geometry,” in International Conference
Proceedings of the forty-rst annual ACM symposium on Theory of on Machine Learning. PMLR, 2021, pp. 393–403.
computing, 2009, pp. 371–380. [40] S. P. Kasiviswanathan and H. Jin, “Efcient private empirical risk mini-
[13] P. Kairouz, S. Oh, and P. Viswanath, “The composition theorem for mization for high-dimensional learning,” in International Conference on
differential privacy,” in International conference on machine learning. Machine Learning. PMLR, 2016, pp. 488–497.
PMLR, 2015, pp. 1376–1385. [41] K. Zheng, W. Mou, and L. Wang, “Collect at once, use effectively: Mak-
[14] J. P. Near and C. Abuah, “Programming differential privacy,” 2021. ing non-interactive locally private learning possible,” in International
[15] M. Bun and T. Steinke, “Concentrated differential privacy: Simpli- Conference on Machine Learning. PMLR, 2017, pp. 4130–4139.
cations, extensions, and lower bounds,” in Theory of Cryptography [42] P. Wang and H. Zhang, “Differential privacy for sparse classication
Conference. Springer, 2016, pp. 635–658. learning,” Neurocomputing, vol. 375, pp. 91–101, 2020.
[16] I. Mironov, “Rényi differential privacy,” in 2017 IEEE 30th computer [43] L. Wang and Q. Gu, “Differentially private iterative gradient hard
security foundations symposium (CSF). IEEE, 2017, pp. 263–275. thresholding for sparse learning,” in 28th International Joint Conference
[17] F. McSherry and K. Talwar, “Mechanism design via differential privacy,” on Articial Intelligence, 2019.
in 48th Annual IEEE Symposium on Foundations of Computer Science [44] ——, “A knowledge transfer framework for differentially private sparse
(FOCS’07). IEEE, 2007, pp. 94–103. learning,” in Proceedings of the AAAI Conference on Articial Intelli-
gence, vol. 34, no. 04, 2020, pp. 6235–6242.
[18] X. Xiong, S. Liu, D. Li, Z. Cai, and X. Niu, “A comprehensive survey
on local differential privacy,” Security and Communication Networks, [45] D. Wang and J. Xu, “On sparse linear regression in the local differential
vol. 2020, pp. 1–29, 2020. privacy model,” in International Conference on Machine Learning.
PMLR, 2019, pp. 6628–6637.
[19] B. I. Rubinstein, P. L. Bartlett, L. Huang, and N. Taft, “Learning in a
[46] P. Mangold, A. Bellet, J. Salmon, and M. Tommasi, “High-dimensional
large function space: Privacy-preserving mechanisms for svm learning,”
private empirical risk minimization by greedy coordinate descent,” arXiv
arXiv preprint arXiv:0911.5708, 2009.
preprint arXiv:2207.01560, 2022.
[20] K. Chaudhuri, C. Monteleoni, and A. D. Sarwate, “Differentially private
[47] A. Khanna, F. Lu, E. Raff, and B. Testa, “Differentially private logistic
empirical risk minimiza tion.” Journal of Machine Learning Research,
regression with sparse solutions,” in Proceedings of the 16th ACM
vol. 12, no. 3, 2011.
Workshop on Articial Intelligence and Security, ser. AISec ’23. New
[21] D. Kifer, A. Smith, and A. Thakurta, “Private convex empirical risk min- York, NY, USA: Association for Computing Machinery, 2023, p. 1–9.
imization and high-dimensional regression,” in Conference on Learning [Online]. Available: https://doi.org/10.1145/3605764.3623910
Theory. JMLR Workshop and Conference Proceedings, 2012, pp. 25–1.
[48] O. Catoni and I. Giulini, “Dimension-free pac-bayesian bounds for
[22] R. Bassily, A. Smith, and A. Thakurta, “Private empirical risk minimiza- matrices, vectors, and linear least squares regression,” arXiv preprint
tion: Efcient algorithms and tight error bounds,” in 2014 IEEE 55th arXiv:1712.02747, 2017.
annual symposium on foundations of computer science. IEEE, 2014, [49] S. Diamond and S. Boyd, “CVXPY: A Python-embedded modeling lan-
pp. 464–473. guage for convex optimization,” Journal of Machine Learning Research,
[23] A. Cheu, “Differential privacy in the shufe model: A survey of vol. 17, no. 83, pp. 1–5, 2016.
separations,” arXiv preprint arXiv:2107.11839, 2021. [50] A. Agrawal, R. Verschueren, S. Diamond, and S. Boyd, “A rewriting
[24] A. McMillan, O. Javidbakht, K. Talwar, E. Briggs, M. Chatzidakis, system for convex optimization problems,” Journal of Control and
J. Chen, J. Duchi, V. Feldman, Y. Goren, M. Hesse et al., “Private feder- Decision, vol. 5, no. 1, pp. 42–60, 2018.
ated statistics in an interactive setting,” arXiv preprint arXiv:2211.10082, [51] A. Domahidi, E. Chu, and S. Boyd, “Ecos: An socp solver for embedded
2022. systems,” in 2013 European control conference (ECC). IEEE, 2013,
[25] H. Xu, C. Caramanis, and S. Mannor, “Sparse algorithms are not stable: pp. 3071–3076.
A no-free-lunch theorem,” IEEE transactions on pattern analysis and [52] B. O’Donoghue, E. Chu, N. Parikh, and S. Boyd, “Conic optimization
machine intelligence, vol. 34, no. 1, pp. 187–193, 2011. via operator splitting and homogeneous self-dual embedding,” Journal of
[26] J. Subramanian and R. Simon, “Overtting in prediction models–is it a Optimization Theory and Applications, vol. 169, no. 3, pp. 1042–1068,
problem only in high dimensions?” Contemporary clinical trials, vol. 36, June 2016.
no. 2, pp. 636–641, 2013. [53] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov,
[27] M. J. Wainwright, High-dimensional statistics: A non-asymptotic view- K. Talwar, and L. Zhang, “Deep learning with differential privacy,” in
point. Cambridge university press, 2019, vol. 48. Proceedings of the 2016 ACM SIGSAC conference on computer and
[28] P. Jain, A. Tewari, and P. Kar, “On iterative hard thresholding methods communications security, 2016, pp. 308–318.
for high-dimensional m-estimation,” Advances in neural information [54] M. Tobaben, A. Shysheya, J. Bronskill, A. Paverd, S. Tople, S. Zanella-
processing systems, vol. 27, 2014. Beguelin, R. E. Turner, and A. Honkela, “On the efcacy of differentially
[29] D. Bertsimas, A. King, and R. Mazumder, “Best subset selection via a private few-shot image classication,” arXiv preprint arXiv:2302.01190,
modern optimization lens,” 2016. 2023.
[30] M. Frank, P. Wolfe et al., “An algorithm for quadratic programming,” [55] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector
Naval research logistics quarterly, vol. 3, no. 1-2, pp. 95–110, 1956. machines,” ACM Transactions on Intelligent Systems and Technology,
[31] O. Maillard and R. Munos, “Compressed least-squares regression,” vol. 2, pp. 27:1–27:27, 2011.
Advances in neural information processing systems, vol. 22, 2009. [56] J. Vanschoren, J. N. van Rijn, B. Bischl, and L. Torgo, “Openml:
[32] S. P. Boyd and L. Vandenberghe, Convex optimization. Cambridge Networked science in machine learning,” SIGKDD Explorations, vol. 15,
university press, 2004. no. 2, pp. 49–60, 2013.
[57] E. Raff, A. Khanna, and F. Lu, “Scaling up differentially private
lasso regularized logistic regression via faster frank-wolfe iterations,” in
Advances in Neural Information Processing Systems, vol. 36. Curran
Associates, Inc., 2023.
[58] A. Y. Ng, “Feature selection, l1 vs. l2 regularization, and rotational
invariance,” Twenty-rst international conference on Machine learning
- ICML ’04, p. 78, 2004.
[59] J. Friedman, T. Hastie, and R. Tibshirani, “Regularization paths for
generalized linear models via coordinate descent,” Journal of Statistical
Software, vol. 33, no. 1, p. 1–22, 2010.
[60] C.-H. Tsai, C.-Y. Lin, and C.-J. Lin, “Incremental and decremental
training for linear classication,” in Proceedings of the 20th ACM
SIGKDD International Conference on Knowledge Discovery and
Data Mining, ser. KDD ’14. New York, NY, USA: Association
for Computing Machinery, 2014, p. 343–352. [Online]. Available:
https://doi.org/10.1145/2623330.2623661
[61] H. Noh, T. You, J. Mun, and B. Han, “Regularizing deep neural networks
by noise: Its interpretation and optimization,” in Proceedings of the 31st
International Conference on Neural Information Processing Systems,
ser. NIPS’17. Red Hook, NY, USA: Curran Associates Inc., 2017,
p. 5115–5124.
[62] M. S. Alvim, N. Fernandes, A. McIver, C. Morgan, and G. H. Nunes,
“A novel analysis of utility in privacy pipelines, using kronecker
products and quantitative information ow,” in Proceedings of the
2023 ACM SIGSAC Conference on Computer and Communications
Security, ser. CCS ’23. New York, NY, USA: Association for
Computing Machinery, 2023, p. 1718–1731. [Online]. Available:
https://doi.org/10.1145/3576915.3623081
[63] A. Khanna, V. Schaffer, G. Gürsoy, and M. Gerstein, “Privacy-preserving
model training for disease prediction using federated learning with
differential privacy,” in 2022 44th Annual International Conference of
the IEEE Engineering in Medicine & Biology Society (EMBC). IEEE,
2022, pp. 1358–1361.
[64] T. LeBlond, J. Munoz, F. Lu, M. Fuchs, E. Zaresky-Williams, E. Raff,
and B. Testa, “Probing the transition to dataset-level privacy in ml
models using an output-specic and data-resolved privacy prole,”
in Proceedings of the 16th ACM Workshop on Articial Intelligence
and Security, ser. AISec ’23. New York, NY, USA: Association
for Computing Machinery, 2023, p. 23–33. [Online]. Available:
https://doi.org/10.1145/3605764.3623904
A PPENDIX
First, we provide tabular results for the logistic regression
datasets which did not t into the main body of the paper:
TABLE V
H EART : ACCURACY

ϵ 0.1 0.5 1.0 2.0 5.0 N ONPRIVATE


TS 0.4639(0.0823) 0.4713(0.0829) 0.6509(0.0475) 0.6722(0.0347) 0.7333(0.0258) 0.8009(0.0237)
FW 0.5481(0.0451) 0.5778(0.0530) 0.5806(0.0540) 0.6130(0.0262) 0.6509(0.0304) 0.7602(0.0253)
POLYFW 0.5759(0.0577) 0.5759(0.0577) 0.5759(0.0577) 0.5759(0.0577) 0.5759(0.0577) 0.5620(0.0658)
VRFW 0.3815(0.0264) 0.3815(0.0264) 0.3815(0.0264) 0.3815(0.0264) 0.3815(0.0264) 0.5630(0.0437)
HTFW 0.6602(0.0544) 0.6648(0.0519) 0.6602(0.0544) 0.6602(0.0544) 0.6630(0.0530) 0.8000(0.0295)
PROJERM 0.4537(0.0724) 0.4537(0.0724) 0.4481(0.0741) 0.4481(0.0741) 0.4509(0.0739) 0.7741(0.0229)
ADMM 0.6519(0.0308) 0.6815(0.0453) 0.6843(0.0507) 0.6815(0.0526) 0.6861(0.0357) 0.6380(0.0282)
ADMMHALF 0.7861(0.0232) 0.7963(0.0237) 0.7972(0.0237) 0.7944(0.0245) 0.8019(0.0240) 0.7602(0.0205)
DPIGHT 0.5787(0.0508) 0.5944(0.0423) 0.6880(0.0428) 0.7380(0.0493) 0.7991(0.0228) 0.8278(0.0223)
DPSLKT 0.5981(0.0396) 0.5843(0.0390) 0.5981(0.0396) 0.5981(0.0396) 0.5935(0.0384) 0.7611(0.0245)
HTSO 0.6722(0.0294) 0.6731(0.0301) 0.6731(0.0301) 0.6759(0.0292) 0.6769(0.0313) 0.8287(0.0240)
GCDGSQ 0.7852(0.0235) 0.8306(0.0261) 0.8111(0.0229) 0.8111(0.0229) 0.7954(0.0216) 0.7722(0.0325)
GCDGSR 0.7778(0.0206) 0.7778(0.0206) 0.7778(0.0206) 0.7778(0.0206) 0.7778(0.0206) 0.7778(0.0201)
GCDGSS 0.7574(0.0203) 0.7574(0.0203) 0.7574(0.0203) 0.7574(0.0203) 0.7574(0.0203) 0.7574(0.0203)
NM 0.5324(0.0540) 0.3000(0.0424) 0.3704(0.0294) 0.3111(0.0362) 0.4259(0.0415) 0.3778(0.0451)
SGD 0.6269(0.0464) 0.6269(0.0464) 0.6269(0.0464) 0.6269(0.0464) 0.6269(0.0464) 0.6713(0.0306)

TABLE VI
DB WORLD : ACCURACY

ϵ 0.1 0.5 1.0 2.0 5.0 N ONPRIVATE


FW 0.4769(0.0637) 0.5077(0.0504) 0.4769(0.0637) 0.4808(0.06) 0.4769(0.0637) 0.7038(0.0477)
POLYFW 0.4692(0.0557) 0.4692(0.0557) 0.4692(0.0557) 0.5077(0.054) 0.4692(0.0557) 0.4769(0.038)
VRFW 0.6038(0.0561) 0.6038(0.0561) 0.6038(0.0561) 0.6038(0.0561) 0.6038(0.0561) 0.5077(0.0573)
HTFW 0.4731(0.0464) 0.4731(0.0464) 0.4731(0.0464) 0.4731(0.0464) 0.4731(0.0464) 0.5154(0.0462)
PROJERM 0.5500(0.0614) 0.5808(0.0655) 0.5808(0.0655) 0.5808(0.0655) 0.5808(0.0655) 0.6423(0.0572)
ADMM 0.4808(0.0669) 0.4808(0.0669) 0.4808(0.0669) 0.5115(0.0653) 0.4808(0.0669) 0.5615(0.0724)
ADMMHALF 0.5462(0.0472) 0.5500(0.0538) 0.5500(0.0464) 0.5731(0.0646) 0.4885(0.0358) 0.5615(0.0724)
DPIGHT 0.5308(0.0546) 0.4731(0.0450) 0.5885(0.0490) 0.5077(0.0573) 0.4808(0.043) 0.5462(0.074)
DPSLKT 0.4654(0.0467) 0.4654(0.0467) 0.4654(0.0467) 0.4654(0.0467) 0.5462(0.059) 0.4962(0.0541)
HTSO 0.6038(0.0561) 0.6038(0.0561) 0.6115(0.0683) 0.6192(0.0626) 0.6192(0.0626) 0.5154(0.0592)
GCDGSQ 0.6885(0.0378) 0.7154(0.0448) 0.7154(0.0448) 0.6769(0.077) 0.7154(0.0652) 0.6885(0.0505)
GCDGSR 0.5615(0.0536) 0.5615(0.0536) 0.5615(0.0536) 0.5615(0.0536) 0.5615(0.0536) 0.5615(0.0536)
GCDGSS 0.6577(0.0736) 0.6577(0.0736) 0.6577(0.0736) 0.6577(0.0736) 0.6577(0.0736) 0.7(0.0431)
NM 0.5192(0.0485) 0.5192(0.0485) 0.5654(0.0743) 0.4692(0.062) 0.4154(0.0846) 0.4808(0.0545)
SGD 0.4154(0.0504) 0.4154(0.0504) 0.4154(0.0504) 0.4154(0.0504) 0.4154(0.0504) 0.5154(0.0536)

TABLE VII
RCV1: ACCURACY

ϵ 0.1 0.5 1.0 2.0 5.0 N ONPRIVATE


TS 0.5095(0.0578) 0.4550(0.0540) 0.4550(0.0540) 0.4575(0.0541) 0.4585(0.0540) 0.4670(0.0531)
FW 0.5175(0.0217) 0.4995(0.0204) 0.4950(0.0176) 0.4930(0.0167) 0.5175(0.0217) 0.6440(0.0323)
POLYFW 0.5055(0.0183) 0.5015(0.0218) 0.5055(0.0183) 0.5030(0.0202) 0.5055(0.0183) 0.4895(0.0230)
VRFW 0.5170(0.0190) 0.5170(0.0190) 0.5170(0.0190) 0.5170(0.0190) 0.5170(0.0190) 0.4930(0.0263)
HTFW 0.5185(0.0168) 0.5185(0.0168) 0.5185(0.0168) 0.5185(0.0164) 0.5185(0.0168) 0.5805(0.0555)
DPIGHT 0.5255(0.0212) 0.5140(0.0274) 0.5140(0.0274) 0.5140(0.0274) 0.5140(0.0274) 0.5315(0.0248)
DPSLKT 0.5660(0.0170) 0.5620(0.0167) 0.5660(0.0170) 0.5485(0.0295) 0.5620(0.0167) 0.5225(0.0208)
HTSO 0.5060(0.0220) 0.5195(0.0266) 0.5360(0.0180) 0.5360(0.0180) 0.5195(0.0266) 0.5145(0.0236)
GCDGSQ 0.5790(0.0222) 0.5805(0.0222) 0.5945(0.0213) 0.5940(0.0210) 0.5955(0.0201) 0.6125(0.0185)
GCDGSR 0.5305(0.0237) 0.5305(0.0237) 0.5305(0.0237) 0.5305(0.0237) 0.5285(0.0222) 0.5860(0.0269)
GCDGSS 0.5920(0.0184) 0.5920(0.0184) 0.5925(0.0185) 0.5940(0.0187) 0.5970(0.0195) 0.6180(0.0154)
NM 0.4985(0.0239) 0.5010(0.0232) 0.4745(0.0208) 0.5135(0.0197) 0.4615(0.0204) 0.5020(0.0233)
SGD 0.0000(0.0000) 0.5295(0.0216) 0.5295(0.0216) 0.5295(0.0216) 0.5295(0.0216) 0.4870(0.0180)
Here we detail the hyperparameters tested for each algo-
rithm and acronyms employed in the paper:

TABLE VIII
H YPERPARAMETERS

S PARSITY ∈ {1, 2, 5, 10}


TS
R EG ∈ {0.001, 0.005, 0.01, 0.05, 0.1, 0.5}
FW I TER ∈ {1, 2, 5, 10, 20, 50, 100}
POLYFW N ONE
VRFW N ONE
I TER ∈ {1, 2, 5, 10, 20, 50, 100}
HTFW
s ∈ {1, 10, 100}
HTPL I TER ∈ {1, 2, 5, 10, 20, 50, 100}
PROJERM L ATENT ∈ {2, 5, 10, 20}
I TER ∈ {1, 2, 5, 10, 20, 50, 100}
ADMM γ ∈ {0.001, 0.01, 0.1, 1}
R EG ∈ {0.001, 0.005, 0.01, 0.05, 0.1, 0.5}
I TER ∈ {1, 2, 5, 10, 20, 50, 100}
ADMMHALF γ ∈ {0.001, 0.01, 0.1, 1} TABLE IX
R EG ∈ {0.001, 0.005, 0.01, 0.05, 0.1, 0.5} ACRONYMS AND D EFINITIONS
S PARSITY ∈ {1, 2, 5, 10}
DPIGHT LR ∈ {0.001, 0.01, 0.1} Acronym Denition
I TER ∈ {1, 2, 5, 10, 20, 50, 100} DP Differential Privacy
S PARSITY ∈ {1, 2, 5, 10} TS Two-Stage [21]
LR ∈ {0.001, 0.01, 0.1} FW Frank-Wolfe [37]
DPSLKT POLYFW Variance-Reduced Frank-Wolfe [38]
I TER ∈ {1, 2, 5, 10, 20, 50, 100}
λ ∈ {0.001, 0.01, 0.1, 1} VRFW Variance-Reduced Frank-Wolfe [39]
HTFW Heavy-Tailed Frank-Wolfe [8]
S PARSITY ∈ {1, 2, 5, 10}
HTPL Heavy-Tailed Private Lasso [8]
HTSL LR ∈ {0.001, 0.01, 0.1} PROJERM Projected Empirical Risk Minimization [40]
I TER ∈ {1, 2, 5, 10, 20, 50, 100} ADMM Alternating Direction Method of Multipliers [42]
S PARSITY ∈ {1, 2, 5, 10} ADMMHALF Alternating Direction Method of Multipliers [42]
LR ∈ {0.001, 0.01, 0.1} DPIGHT DP Iterative Gradient Hard Thresholding [43]
HTSO
I TER ∈ {1, 2, 5, 10, 20, 50, 100} DPSLKT DP Knowledge Transfer Framework [44]
s ∈ {1, 10, 100} HTSL Heavy-Tailed Sparse Lasso [8]
S PARSITY ∈ {1, 2, 5, 10} HTSO Heavy-Tailed Sparse Optimization [8]
I TER ∈ {1, 2, 5, 10, 20, 50, 100} GCDGSQ Greedy Coordinate Descent [46]
GCDGSQ
GCDGSR Greedy Coordinate Descent [46]
R EG ∈ {0.001, 0.005, 0.01, 0.05, 0.1, 0.5} GCDGSS Greedy Coordinate Descent [46]
S PARSITY ∈ {1, 2, 5, 10} NM Noise Multiplier [39]
GCDGSR I TER ∈ {1, 2, 5, 10, 20, 50, 100} GCD Greedy Coordinate Descent [46]
R EG ∈ {0.001, 0.005, 0.01, 0.05, 0.1, 0.5} CVXPY Convex Programming Python Package [49]
S PARSITY ∈ {1, 2, 5, 10} SCS Splitting Conic Solver [50]
GCDGSS I TER ∈ {1, 2, 5, 10, 20, 50, 100} ECOS Embedded Conic Solver [51]
R EG ∈ {0.001, 0.005, 0.01, 0.05, 0.1, 0.5} ERM Empirical Risk Minimization
NM N ONE CPU Central Processing Unit
GHz Gigahertz
BATCH S IZE ∈ {32, 64, 128} PCA Principal Components Analysis
SGD LR ∈ {0.001, 0.01, 0.1}
I TER ∈ {1, 2, 5, 10, 20, 50, 100}
Next, we provide boxplots summarizing the information
provided in the results tables:
Fig. 2. Bodyfat: Mean Squared Error

Fig. 3. PAH: Mean Squared Error


Fig. 4. E2006: Mean Squared Error

Fig. 5. Heart: Accuracy


Fig. 6. DBworld: Accuracy

Fig. 7. RCV1: Accuracy

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy