Paper I
Paper I
Published in: 2021 IEEE 31st International Workshop on Machine Learning for Signal
Processing (MLSP)
DOI: https://doi.org/10.1109/MLSP52302.2021.9596115
AURA:
© 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be
obtained for all other uses, in any current or future media, including reprinting/republishing this
material for advertising or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of this work in other
works.
Machine Learning for Signal Reconstruction from Streaming Time-series Data
A.1 Introduction
Regression problems are some of the most important problems due to their numerous
applications and relevance in a wide range of fields. In practice, regression problems
are usually formulated as convex optimization problems with strongly convex objec-
tives over feasible convex sets. Besides being one of the most benign settings, this
formulation includes significant instances of interest, such as those arising in regu-
larized regression [19], for example, to reduce the complexity of the reconstruction
by promoting smoothness. Because of this, such strongly convex objectives are com-
monly set as the sum of a convex loss, which reflects how far the solution lies from
the data samples, and a strongly convex regularizer, which controls the complex-
ity of the solution. On the other hand, most real-world scenarios where regression
techniques may be useful occur under dynamic environments. This fact motivates
the design of online methods. They allow tracking over time the underlying target
signals in a recursive manner with reduced memory and computational needs.
In particular, this paper focuses on sequentially streamed quantized signals.
When the underlying physical process generating the signal data samples is un-
known, as usual in practice, instead of blindly selecting a certain ad-hoc parametric
regression model, the target signal can be estimated from the data samples. This
can be done by means of non-parametric regression methods at the expense of a
certain memory and computational cost that can be controlled.
Under the mathematical framework of Reproducing Kernel Hilbert Spaces (RKHSs)
and thanks to the Representer Theorem [80], such a non-parametric estimate can
be constructed from a pre-selected reproducing kernel with a complexity that grows
PAPER A
linearly with the number of data samples. Regression with kernels and its online vari-
ants have been widely studied in the literature [60, 22]. Their main strength is that
they are able to find non-linear patterns at a reasonable computational cost. The
Naive Online regularized Risk Minimization Algorithm (NORMA) [12] is arguably
the most representative algorithm from the stochastic approximation kernel-based
perspective. In its standard form, it concentrates all the novelty in the new expan-
sion coefficient of the signal estimate. However, intuitively, it seems reasonable to
distribute the novelty among several expansion coefficients that contribute to the
signal estimate instead. In this way, the novelty and correction of previous estimate
errors are integrated, more ergonomically, in the signal estimate.
To the best of our knowledge, most of the existing literature has focused on con-
trolling the signal estimate complexity rather than focusing on strategies to control
the error in the estimates. Examples of research works controlling complexity are
truncation [12] and model-order control via dictionary refining [81], among others
[82, 83]. Only some works have studied reducing the signal estimate errors by means
of a sliding window scheme [84, 23, 85]. However, in [84], the selection criterion to
choose among all possible function estimates is least squares making it unsuitable
for more general settings, such as incorporating quantization intervals instead of
signal values. Similarly, in [23], even though its selection criterion allows certain
freedom, regularization is not encouraged and therefore, the smoothness of the un-
derlying physical signal is not fully promoted. Lastly, in [85], the selection criterion
is constructed as a regularized augmentation of instantaneous loss-data pairs. As a
result, it naturally extends NORMA in a sliding window scheme. Nonetheless, in
this work, we present a novel algorithm constituted by a robust selection criterion
alongside a conveniently engineered optimization method that outperforms all these
algorithms for the task of regression-based tracking of quantized signals.
The paper is structured as follows: Sec. A.2 presents the windowed cost and
formulates the problem from a learner-adversary perspective. Then, in Sec. A.3, we
provide our main contribution: a novel method to minimize the windowed cost via
proximal average functional gradient descent. The resulting approach, a novel algo-
rithm called WORM, is used for the practical use case of regression-based tracking
of quantized signals. Next, in Sec. A.4, we provide its tracking guarantees through
a dynamic regret analysis. Finally, in Sec. A.5, we analyze the experimental perfor-
mance of our algorithm using synthetic data, and Sec. A.6 concludes the paper.
i.e., a functional evaluated over one data sample, we formulate the hypothesis that
a concurrent functional cost, i.e., a functional that considers up to L ∈ N data
samples simultaneously, may lead to better performance at the expense of a higher
but bounded computational cost.
In order to test our hypothesis, we first consider a proper convex instantaneous
loss ℓn : H → R ∪ {∞} given by
where k(xn , ·) is the reproducing kernel associated with the RKHS H centered at
xn . Notice that the equality in (A.1) holds thanks to the reproducing property [12].
Consequently, we define the so-called windowed cost as a composite of a weighted
arithmetic mean of an instantaneous loss as in (A.1), computed over L consecutive
data samples and the squared Hilbert norm associated to H as the regularizer, i.e.,
λ
Cn (f ) ≜ Ln (f ) + ∥f ∥2H , (A.2)
2
with regularization parameter λ > 0 and where the windowed loss Ln : H → R∪{∞}
is given by
n
X (n)
Ln (f ) = ωi ℓi (f ), (A.3)
i=ln
(n) (n)
where ln = max{1, n − L + 1} and ni=ln ωi = 1 with ωi ≥ 0. Finally, the RKHS
P
H, the instantaneous loss ℓ, the regularizer parameter λ and the tuning routine of
(n)
the convex weights {ωi }ni=ln are specified by the user.
The dynamic regret captures how well the sequence of function estimates {fn }N n=1
matches the sequence of optimal decisions in environments that may change unpre-
dictably over time. In general, obtaining a bound on the dynamic regret may not be
possible [66]. However, under some mild assumptions on the sequence of functional
costs, it is possible to derive worst-case bounds in terms of the cumulative variation
of the optimal function estimates
N
X
CN = ∥fn∗ − fn−1
∗
∥H . (A.5)
n=2
for all h ∈ H. Notice that since the objective in (A.8) is strongly convex, the
proximal map is single-valued.
Next, we denote by Lηn the so-called proximal average functional of the windowed
loss in (A.3) at instant n with real parameter η > 0, as the unique closed proper
convex functional such that
n
X (n)
MηLηn = ωi Mηℓi , (A.9)
i=ln
1
Its proximal operator can be computed efficiently.
Machine Learning for Signal Reconstruction from Streaming Time-series Data
where ℓi ≜ ℓ(f (xi ), yi ) for all f ∈ H. Even though it is possible to derive an explicit
expression for the proximal average functional from its definition (definition 4.1,
[65]), for the sake of clarity, and since only its existence is needed for the algorithm,
we do not include its explicit form here.
At each iteration n, our algorithm executes the steps:
λ
f¯n = fn − η ∂f ∥f ∥2H , (A.10a)
2 f =fn
fn+1 = proxη η (f¯n ),
Ln
(A.10b)
with 0 ≤ η < λ−1 . The first algorithm step (A.10a) is equivalent to f¯n = ρfn with
ρ ≜ (1 − ηλ) ∈ [0, 1). The proximal operator proxηLηn : H → H, can be readily
computed by differentiating both sides of the definition in (A.9) while applying the
Moreau envelope property given by (A.7), getting
n
X (n)
proxηLηn = ωi proxηℓi . (A.11)
i=ln
The remaining steps depend on the choice of the instantaneous loss. In particular,
since we are interested in quantized signals, an adequate functional instantaneous
loss must not penalize the function estimates that pass through the intervals. We
develop further this reasoning in Sec. A.3.2.
that contains all the functions in H passing through the ith quantization interval,
and use the metric distance functional to the ith hyperslab
Regarding the tuning routine of the convex weights in (A.3), recall that if the
set {i ∈ [ln , n] : f¯n ∈
/ Hi } = ∅, any choice of convex weights incurs zero windowed
loss. If not, each convex weight is tuned as
(n) m
(n) di (f¯n )m |β̄i |m k(xi , xi ) 2
ωi = Pn ¯ m = Pn |β̄ (n) |m k(x , x ) m2 , (A.15)
j=ln dj (fn ) j=l j n j j
(n)
where β̄i comes from the metric projection map PHi (f¯n ) and m is a user predefined
non-negative real power. In this way, if m = 0 the convex weights are all equal. On
the other hand, when m tends to infinity, only the weight associated to the largest
distance is considered. Thus, the power m allows, with a range of flexibility, to
weigh more those windowed loss terms in which the intermediate update f¯n incurs
a larger loss.
Accordingly, from the proximal operator of the metric distance (Chapter 6, [89])
with parameter η, i.e.,
η
proxdi (fn ) = f¯n + min 1,
η ¯
(PHi (f¯n ) − f¯n ) (A.16)
di (f¯n )
and the proximal average decomposition in (A.11), we can rewrite the algorithm
step (A.10b) as
n
X (n) η (n)
fn+1 = f¯n − ωi min 1, β̄i k(xi , ·). (A.17)
i=ln
di (f¯n )
Finally, assuming that the algorithm does not have access to any a priori infor-
mation when it encounters the first data sample, we can set f1 = 0. Then, from the
algorithm step (A.10a), substituting each function estimate by its kernel expansion,
(n)
i.e., fn = n−1
P
i=1 αi k(xi , ·) and identifying terms in (A.17), we obtain the following
closed-form update rule for the non-parametric coefficients
(
(n) (n) (n)
(n+1) ραi − ωi Γη,i if i ∈ [1, n − 1],
αi = (n) (n) (A.18)
−ωi Γη,i if i = n,
n 1
o
(n) (n) (n)
where Γη,i ≜ min |β̄i |, ηk(xi , xi )− 2 sign(β̄i ) if i ∈ [ln , n] and equals zero other-
wise.
A.3.2.1 Sparsification
The WORM algorithm, like many other kernel-based algorithms, suffers from the
curse of kernelization [82], i.e., unbounded linear growth in model size and update
time with the amount of data. For the considered application in Sec. A.3.2, a simple
complexity control mechanism as kernel series truncation allows to preserve, to some
extent, both performance as well as theoretical tracking guarantees, as we show in
Secs. A.4 and A.5. Thus, given a user-defined truncation parameter τ ∈ N, such
that τ > L, if the number of effective coefficients constituting the function estimate
fn exceeds τ , we remove the older expansion term, i.e.,
(n)
en = αn−τ k(xn−τ , ·), (A.19)
Machine Learning for Signal Reconstruction from Streaming Time-series Data
For the sake of notation, we omit the sub-index H in inner products and norms
since the RKHS is clear by context. Considering the assumption in (A.20) and the
PAPER A
Hence, from the relation (A.22), the firmly non-expansiveness of the proximal op-
erator [37], and the method step (A.10a) with truncation, we achieve the following
inequality
where the step (A.24a) comes after using the relation (A.23), the definition of cu-
mulative
PN variation in (A.5), and renaming the cumulative truncation error EN ≜
n=2 ∥en ∥. In step (A.24b), we rename the summation index and add the positive
term ρ∥fN − fN∗ ∥ to the right hand-side of the inequality.
Regrouping the terms in (A.24), leads to
N
X 1
∥fn − fn∗ ∥ ≤ (∥f1 − f1∗ ∥ + CN + ρEN ) (A.25)
n=1
1−ρ
and substituting the relation obtained in (A.25) into the inequality (A.21), allows
to upper-bound the dynamic regret as
G
RegN ≤ (∥f1 − f1∗ ∥ + CN + ρEN ) . (A.26)
1−ρ
Machine Learning for Signal Reconstruction from Streaming Time-series Data
4
Average i = q di(fn)
5.0480
5.0478
n
3
n
5.0476
2 69.00 69.01 69.02
Augmented NORMA
1 KAPSM
WORM
0 Truncated WORM
0 20 40 60 80 100
n
Figure A.1: Average q-inconsistency of the sequence of function estimates {fn }100
n=1
over 500 different quantized signals.
This result explicitly shows the trade-off between tracking accuracy and model
complexity [85]. In other words, without truncation, the dynamic regret reduces to
RegN ≤ O(1 + CN ), depending entirely on the environment. On the other hand, if
we control the complexity of the function estimates via any truncation strategy such
that the norm of the truncation error is upper bounded by a positive constant, i.e.,
supn∈[1,N ] ∥en ∥ ≤ δ, the dynamic regret reduces to RegN ≤ O(CN + δT ), leading to
a steady tracking error in well-behaved environments.
Augmented NORMA
100 KAPSM
WORM
Truncated WORM
80
2
Average fn
60
40
20
0
0 20 40 60 80 100
n
Figure A.2: Average complexity of the sequence of function estimates {fn }100
n=1 over
500 different quantized signals.
Augmented NORMA
6 KAPSM
WORM
5 Truncated WORM
Quantization intervals
4
Signal value
60 70 80 90 100
Time
Figure A.3: Comparison of regression plots for the last function estimate f100 , over
the last 45 data samples of a synthetically generated quantized signal.
Machine Learning for Signal Reconstruction from Streaming Time-series Data
tamps {xn }100n=1 are uniformly arranged. For the sake of illustration, we use a Gaus-
sian reproducing kernel, i.e., k(x, t) = exp(−(x − t)2 /(2σ 2 )), with σ = 3. All four
algorithms use the same window length L = 10. As to the augmented NORMA, the
WORM algorithm and its truncated version, all use the same learning rate η = 1.5
and regularization parameter λ = 0.005. We restrict the truncated WORM func-
tion estimates expansion to a maximum of 30 terms, i.e., τ = 30. Both versions
of WORM use the power m = 2. For the augmented NORMA, the instantaneous
loss terms within the nth window are equally weighted with the weight min{n, L}−1
(n) 1
and ∂f di (fn ) = sign(βi )k(xi , xi )− 2 k(xi ,P
·) is used as a valid functional subgradient.
We also define the q-inconsistency, i.e., ni=qn di (fn ), with qn = max{n − q + 1, 1}
(n) (n)
and q = 20, and use the squared Hilbert norm, ∥fn ∥2H = ni,j=τn αi αj k(xi , xj ),
P
with τn = max{n − τ + 1, 1}, as performance metrics for the function estimates.
The first metric measures how far is the function estimate of falling into the last q
received quantization intervals. The second metric measures the function estimate
complexity.
As shown in Fig. A.1 and Fig. A.2, there is a trade-off between q-inconsistency
and complexity. The WORM algorithm successfully balances both altogether. As
to its truncated version, the same experimental results show that the complexity
can be successfully controlled at the expense of little accuracy. Finally, Fig. A.3
shows a snapshot of the last function estimate f100 for each algorithm.
A.6 Conclusion
In this paper, we propose a novel algorithm, WORM, for regression-based tracking
of quantized signals. We derive a theoretical dynamic regret bound for WORM that
ensures tracking guarantees. Our experiment shows that WORM provides better
signal reconstruction in terms of consistency and smoothness altogether compared
to the state-of-the-art.