18-417
18-417
18-417
Brian Karrer
briankarrer@fb.com
Facebook, Menlo Park, California, USA
Abstract
Motivated by the needs of online large-scale recommender systems, we specialize the decoupled
extended Kalman filter (DEKF) to factorization models, including factorization machines, ma-
trix and tensor factorization, and illustrate the effectiveness of the approach through numerical
experiments on synthetic and on real-world data. Online learning of model parameters through
the DEKF makes factorization models more broadly useful by (i) allowing for more flexible obser-
vations through the entire exponential family, (ii) modeling parameter drift, and (iii) producing
parameter uncertainty estimates that can enable explore/exploit and other applications. We use a
different parameter dynamics than the standard DEKF, allowing parameter drift while encourag-
ing reasonable values. We also present an alternate derivation of the extended Kalman filter and
DEKF that highlights the role of the Fisher information matrix in the EKF.
Keywords: approximate online inference, Kalman filter, matrix factorization, factorization
machines, explore exploit.
1. Introduction
Today there are many examples of large-scale recommender systems that serve hundreds of millions
of people every day, e.g., Netflix, Spotify, or Amazon. The scale is such that computationally efficient
approaches are necessary. In addition, product preferences can change over time so ideal methods
should naturally learn and respond to such changes. Lastly, an individual may only try a service a
few times before deciding whether to continue to interact with it, e.g., see Gomez-Uribe and Hunt
(2015). Learning an individual’s preferences quickly is then also important. One useful strategy to
address this coldstart situation involves explore-exploit strategies that rely on uncertainty estimates.
The obvious need for computationally efficient methods to generate recommendations in situations
with time-varying preferences and that can enable explore-exploit strategies is the main motivation
for this work. The method we describe here extends matrix factorization, perhaps the most popular
approach to recommendations, and other factorization models to meet these requirements: it makes
these approaches online and dynamic, and provides uncertainty estimates to enable explore-exploit
strategies. In addition, we develop our method to naturally handle a wide range of data types (e.g.,
positive integers, binary outcomes, or continuous outcomes) by working directly with the exponential
family.
Our method utilizes Kalman filtering. The Kalman filter (KF) was initially introduced in Kalman
(1960) for state estimation in linear systems driven by Gaussian noise, and with observations that
depend linearly on the state and on additional Gaussian noise. The KF iteratively computes the
exact posterior of the state as new observations become available. Many variants of the KF have
since been developed and applied to a wide variety of models, e.g., for parameter learning. See
Simon (2006) and Haykin et al. (2001) for good overviews of Kalman filters; the latter is focused on
neural network applications.
Regression, matrix and tensor factorization, factorization machines, and many other statistical
models can be viewed as variations of a general model with exponential family observations. An
approximate Gaussian posterior of the parameters for this general model can be learned online,
even when the parameters drift over time, through a KF called the extended Kalman filter (EKF),
developed to handle non-Gaussian observations. However, maintaining a full covariance matrix
of the parameters, as prescribed by the EKF, can often be prohibitive in terms of memory and
computation. The decoupled EKF (DEKF) can alleviate this limitation.
The DEKF was introduced in Puskorius and Feldkamp (1991) to train neural networks. It
approximates the covariance matrix of the parameters in the EKF as block-diagonal. We will argue
that this approximation is particularly relevant for models with a large number of parameters, such as
factorization models, where only a relatively small subset of them is relevant to any given observation.
Developing and applying the DEKF to factorization models has not been done before, and is the
main contribution of this paper. Specifically, we assume that model parameters can be naturally
grouped into subsets we call entities 1 , such that few entities are involved in each observation. E.g.,
in matrix factorization exactly two subsets of parameters define each observation, those for the user
and the item interacting, so we can let each user and each item correspond to an entity.
When a new observation arrives, we show the DEKF only requires updating the parameters of
entities involved in the new observation. This leads to a particularly efficient implementation of the
DEKF for factorization models. Because the DEKF produces a posterior distribution of the pa-
rameters, it also enables applications that require uncertainty estimates, e.g., where explore/exploit
trade-offs are important. For example, we show that the DEKF enables Thompson sampling in
factorization models.
The DEKF we present here is different from the standard DEKF in several ways. First, we
specialize it to exponential family models, motivated by models with typically few entities per
observation. Second, the standard DEKF was formulated for static parameters, or for parameters
that undergo a simple random walk. The latter choice can result in parameter values that become
too large and lead to badly behaved models. Here, we consider parameter dynamics that allow for
parameter drift while encouraging reasonable values. Modeling parameter drift can be desirable in
situations where the underlying data is non-stationary, as is often the case in recommender systems,
where user preferences and item popularities can change over time. To keep our paper self-contained,
we assume no familiarity with Kalman filtering.
The rest of this paper is organized as follows. Section 2 introduces the general model we study,
and describes several kinds of factorization models as special cases. Section 3 derives and describes
our DEKF for factorization models with exponential family observations. We then discuss connec-
tions of the EKF and DEKF to other related methods. Section 4 describes numerical results on
simulated and on real data, obtained from the application of our DEKF to a variety of models for
the tasks of prediction and of reward maximization (explore/exploit). Section 5 concludes with a
discussion about limitations, and suggests possible research directions.
1. These subsets are called nodes in the original DEKF paper, but we find entity more descriptive for factorization
models.
2
DEKF for Factorization Models
1. Initialize parameters.
(a) Initialize reference vectors:
ri ∼ N πi , Πi , (1)
Here, η ∈ Rd is the natural parameter of the distribution, a known function of ξi,t , and
in some cases, of context xt . E.g., xt can be the predictors in a regression model or
factorization machine, or the indices corresponding to the user and item involved in an
observation for matrix factorization. The natural parameter η is the connection between
the model parameters we seek to estimate and the observations. Also, Φ ∈ Rd×d is a
symmetric positive definite matrix that is a known nuisance parameter. The functions
b() and c() depend on the specific member of the exponential family chosen for the
model. Importantly, in a typical factorization model, very few entities are involved in
each observation, i.e., η is an explicit function of very few entities, e.g., one user and
one item in matrix factorization. The symbol 0 denotes the vector or matrix transpose
operation.
As we will see, the model above includes regression and factorization models with static or dynamic
parameters. In factorization models, typically d k. The observation model in Equation 4 is a
generalization of the Generalized Linear Model (or GLM, see Hastie, 2017; Nelder and Baker, 1972)
based on a moderately more complex mapping between the model parameters and the parameters of
the distribution that generates the observations in order to handle factorization models.3 Typically
the nuisance parameter is the identity matrix, though in linear regression with known covariance, Φ
is the covariance of the observations.
Our goal is to estimate the distribution of θt , or equivalently, of ξi,t for all entities i, given all
the observations up to time t in an online fashion. This estimation problem is generally analytically
intractable, but we obtain approximate algorithms through Kalman filtering. A Kalman filter (KF)
has two distinct steps that need to be performed in each timestep: a prediction of the parameters
in light of their dynamics, and an update step that incorporates the information from the latest
2. The exponential family is often defined using T (y) instead of y where T (y) indicates the vector of sufficient
statistics for an underlying vector of observations y. To avoid additional notation, we consider our observation
vector y to just be the vector of sufficient statistics.
3. If the mapping from parameters to signal was arbitrary, this would be the Generalized Non-linear Model, but
factorization models only require multi-linear maps.
3
Gómez-Uribe and Karrer
observation. To simplify the exposition, we first consider the special case where the model parameters
are static, i.e., where Ωi = 0 and αi = 1 for all i, so Equation 3 simply becomes ξi,t = ξi,t−1 . The
reference vectors then only serve to initialize the model parameters through Equation 2, and the KF
only consists of an update step that depends strongly on the observation model. We consider the
more general model with dynamic parameters again in Section 3.4.
We denote the mean and covariance of yt given η by µy (η) and Σy (η), though we may omit the
dependence on η for improved readability. We also often omit the time subscript of y, θ, x, and
other time-varying quantities for similar reasons. For distributions in the form of Equation 4, it can
be shown that
∂b 0
µy (η) =Φ = h(η), (5)
∂η
∂2b ∂h
Σy (η) =Φ 2 Φ = Φ, (6)
∂η ∂η
where h(η), defined in the first equation, is called the response function. Throughout our paper, our
notation for vector and matrix derivatives is consistent with the notation of tensor calculus, which
∂b
often results in the transposed vectors and matrices of other notations. E.g., here ∂η is a row vector
in Rd , and ∂η
∂θ is a d-by-k matrix. To connect the observations to the model parameters, we assume
that η is a deterministic and possibly non-linear function of θ, with finite second derivatives. Often,
η is also a function of context denoted by x. It is typical and helpful to think of an intermediate and
simple function λ of θ and x that the natural parameter is a function of, i.e., η = η(λ(θ, x)). This
intermediate function λ is called the signal, and outputs values in Rd . To avoid notation clutter, we
suppress all dependencies on x. We will need to evaluate the mean and covariance of y for specific
values of θ ∈ Rk . Abusing notation for improved readability, we will write h(θ) and Σy (θ) instead
of h(η(θ)) and Σy (η(θ)) to denote the mean and covariance of y at a specific value of θ.
The model also needs an invertible function called the link function 4 g(λ) that maps the signal to
µy = h(η), so η = h−1 (g(λ)). Depending on the family, µy can have a restricted range of values (e.g.
µy > 0), and for ease of exposition, we only consider link functions that obey these ranges without
restricting the signal. A particularly useful choice for the link function is the canonical link function
(g = h) that makes λ = η, and simplifies relevant mathematics. Because the specific distribution
within the exponential family determines h(η), different distributions have different canonical links.
We will write g(θ) rather than g(λ(θ)) for improved readability. To summarize, θ determines η, but
only through the signal λ. Then η determines the mean and covariance of y via Equations 5 and 6.
Table 1 summarizes for convenience the main notation introduced so far, as well as some symbols
that are introduced later.
4
DEKF for Factorization Models
Symbol Variable
y observation
x context
θ model parameters
λ(θ, x) signal
η(λ) natural parameter of y
Φ nuisance parameter of the observation
h(θ) response function; mean of y given θ
Σy (θ) covariance of y given θ
ξi parameters of entity i, with θ0 = [ξ10 . . . ξn0 ]
e(θ) prediction error y − h(θ)
F(θ) Fisher information matrix
µ, Σ mean and covariance of θ
µi , Σi mean and covariance of ξi
l(y) log-likelihood of y
ωi,t noise driving the dynamics of entity i
Ωi covariance of ωi,t
αi memory of dynamics for entity i
ri reference vector of entity i
πi , Πi initial mean and covariance of ri
ρi , Pi current mean and covariance of ri
Ri current covariance between ri and ξi
Table 1: Notation.
dot-product of the user and item vectors involved in an observation. Sometimes user and item
bias terms are added to the signal too.
MF models typically assume that the observations are univariate Gaussian, or occasionally
Bernoulli, e.g., see Mnih and Salakhutdinov (2008) and Koren et al. (2009), so our setup
generalizes these models to observations in other exponential family distributions that can be
more natural for different kinds of data. In addition, applying the DEKF to these models
allows for user and item vector drift, and enables explore/exploit applications.
4. Factorization machines (FM). These models, introduced in Rendle (2010), typically have
univariate responses, and include univariate regression, MF, and tensor models as special cases.
Assume there are n entities, e.g., user or items that can be involved in any of the observations,
and let xi be non-zero only when entity i is involved in the observation, with x = [x1 . . . xn ]0 .
Let ξi be the parameters corresponding to entity i. In a factorization machine (FM) of order
5
Gómez-Uribe and Karrer
2, ξi0 = [wi vi0 ], where wi ∈ R and vi ∈ Ra2 , with a2 a positive integer, so ξi ∈ Ra2 +1 . Then the
signal becomes
n
X n
X n
X
λ = wo + wi xi + vi0 vj xi xj . (7)
i=1 i=1 j=i+1
When x has exactly two non-zero entries set to 1, then Equation 7 becomes identical to the
signal in MF, with a user, item and a general bias term. Higher-order factorization machines
are described in Rendle (2010). FMs are learned via stochastic gradient descent, Markov Chain
Monte Carlo, or alternating least squares or coordinate ascent (Rendle, 2012). Our treatment
extends FMs beyond Bernoulli and Gaussian observations, allows for dynamic parameters, and
provides parameter uncertainty estimates.
∂η ∂η 0
B(θ) =Φ−1 Σy (θ)Φ−1 Σ .
∂θ ∂θ
Given a value of θ, B(θ) ∈ Rd×d . The mean and covariance of the approximate Gaussian posterior
are then found via:
−1
∂η 0
µnew =µ + Σ |µ I + B(µ) Φ−1 y − h(µ) , (8)
∂θ
−1
∂η 0
∂η
Σnew =Σ − Σ |µ I + B(µ) Φ−1 Σy (µ)Φ−1 |µ Σ. (9)
∂θ ∂θ
Here ∂η ∂η
∂θ |µ denotes ∂θ evaluated at θ = µ, and we use that notation elsewhere for some function
evaluations. Note that the matrix in the square brackets above, whose inverse is needed, is only of
size d-by-d. Also, we see that the update to the mean in Equation 8 is proportional to the error
e(µ) = y − h(µ). Applying these equations to a specific model requires specifying the distribution of
∂η
the observation, and the link function, to determine Φ, Σy (µ), h(µ), and ∂λ . The latter is needed
∂η ∂η ∂λ ∂λ
to compute ∂θ = ∂λ ∂θ . The last quantity, ∂θ , comes from the specific model being used, e.g.,
regression, MF, etc.
A reader familiar with the extended Kalman filter may find it difficult to map the above ex-
pressions onto the standard EKF expressions. To clarify the relationship, we convert from the
exponential family’s canonical to mean parameterization. To do so, recall that µy = g(λ) = h(η).
Thus ∂η ∂h −1 ∂g
∂θ = ( ∂η )
∂η −1 ∂g
∂θ , and applying Eq. 6, ∂θ = ΦΣy ∂θ . Inserting this relationship and simplifying
6
DEKF for Factorization Models
3.1.1 Derivation
A standard derivation of the EKF proceeds as follows: first, y is approximated as a Gaussian
according to y ∼ N h(θ), Σy (µ) . Notice that the variance is evaluated at the mean of the prior, while
the mean is allowed to depend on θ. To make the log-likelihood l(y) a quadratic function of θ, h(θ) is
approximated through a first-order Taylor expansion around µ. We present an alternative derivation
of the EKF update step for our general model that brings connections to other methods and statistical
concepts more directly. This derivation directly illustrates why the DEKF is particularly appropriate
for factorization models.
We start by approximating l(y) as a quadratic function of θ through a second-order Taylor
expansion about the prior mean µ. We then take the expectation of the corresponding Hessian over
the distribution of y given η to guarantee that the covariance matrix remains positive definite. Lastly,
we do some algebra to obtain the desired EKF equations. In the special case of Gaussian observations
and linear response function, the EKF approximations become equalities, and the update step of
the EKF is identical to that of the KF.
To start, we note that
∂l(y)0 ∂η 0 ∂l(y)0 ∂η 0 −1
= = Φ e(θ),
∂θ ∂θ ∂η ∂θ
where ∂η ∂η ∂λ
∂θ = ∂λ ∂θ ∈ R
d×k
is the derivative of the natural parameter with respect to θ. The
(conditional) Fisher information matrix plays a prominent role in our derivation. It is given by
∂l(y)0 ∂l(y)
F(θ) =Ey|θ
∂θ ∂θ
0
∂η −1 ∂η
= Φ Σy (θ)Φ−1 , (12)
∂θ ∂θ
where the first equality is a definition, and the last equality is specific to our model assumptions.
We use the notation Ey|θ to emphasize that this expectation is over samples of y from the statistical
model with parameters θ.5
The Hessian of the log-likelihood is
d
∂ 2 l(y) ∂η 0 ∂ ∂ 2 ηj
X
−1 −1
= Φ e(θ) + Φ e(θ)
∂θ2 ∂θ ∂θ j=1
∂θ2 j
d
∂ 2 ηj
X
−1
= − F(θ) + Φ e(θ) , (13)
j=1
∂θ2 j
an explicit function of the Fisher information matrix. Here, Φ−1 e(θ) j is just the j-th entry of the
vector Φ−1 e(θ). The first term in the last equation is a negative definite matrix. The second term
5. Recall that the natural parameter η, through the signal λ, can be a function of the context x that accompanied
the observation y. The true Fisher information matrix is hence an average over the unknown distribution of
contexts x and over the model distribution for y given x and θ. The above Fisher information is the conditional
Fisher information considered for a fixed context x.
7
Gómez-Uribe and Karrer
is not necessarily negative definite, and we will see below that this could result in invalid covariance
matrices that are not positive definite. To avoid this situation, in our second-order Taylor expansion,
2
we will replace the Hessian ∂∂θl(y)
2 in Equation 13 by its average over y given η, i.e., by −F(θ). This
is consistent with Equation 13, which uses y only in the second term on the right, through e(θ), and
the error averaged over y given η is zero.6
Combining these results we obtain our second-order approximation of the log-likelihood about
the prior mean µ:
2
∂l(y) 1 0 ∂ l(y)
l(y) ≈l(y, µ) + |µ θ − µ + θ − µ Ey|µ 2
| µ θ−µ
∂θ 2 ∂θ
∂η 1 0
=l(y, µ) + e(µ)0 Φ−1 |µ θ − µ − θ − µ F(µ) θ − µ .
∂θ 2
Plugging this approximation into
as well as writing the Gaussian prior of θ, while dropping terms independent of θ yields
1 0 ∂η
log P (θ|y) ∝ − θ − µ Σ−1 + F(µ) θ − µ + e(µ)0 Φ−1 |µ θ − µ
2 ∂θ
1 0 −1
= − θ − µ − δ Σnew θ − µ − δ , (15)
2
with
Σ−1
new =Σ
−1
+ F(µ), (16)
0
∂η
δ =Σnew |µ Φ−1 e(µ). (17)
∂θ
The last equality in Equation 15 is obtained by completing squares. The result shows that the
approximate posterior distribution is Gaussian with mean µ + δ and covariance Σnew .
The EKF covariance update then follows from applying the Woodbury identity (see Petersen
et al., 2008, sec. 3.2) to Equation 16, and some re-arrangement. Plugging the updated covariance
into Equation 17 yields the EKF mean update, also after some re-arrangement.
8
DEKF for Factorization Models
all the time. The parameters for entities that have not been involved in any observations can just
be appended into the set of parameters when the entity is first observed.
Consider the evaluation of Equations 8 and 9, both of which rely upon the computation of ∂η ∂θ Σ
∂η 0
and ∂η∂θ Σ ∂θ , evaluated at θ = µ. Without loss of generality, assume only the first m entities are
involved in the observation, so we have that
∂η ∂η ∂η
= ... 0 ,
∂θ ∂ξ1 ∂ξm
Pm
where 0 is a matrix with entries set to zero of dimensions d × (k − i=1 ki ). Combined with the
block-diagonal structure of Σ, this yields
∂η ∂η ∂η
Σ= Σ1 . . . Σm 0 ,
∂θ ∂ξ1 ∂ξm
0 m
∂η ∂η X ∂η ∂η 0
Σ = Σi .
∂θ ∂θ i=1
∂ξi ∂ξi
where 0 is again defined to have the appropriate dimensions. Substituting the first of these equations
into the EKF update equations shows that only the m entities involved in the observation are
updated, whereas all others remain the same. Examining the terms, we see that only means and
covariances involved in the observation are used to compute the updates as well.
Evaluation of the expressions above at θ = µ leaves little extra work to compute the updated
parameters µnew and Σnew . The resulting EKF posterior covariance Σnew , however, is typically not
block-diagonal over the entities. Letting Σij,new denote the updated block for entities i and j in the
observation, we have that
−1
∂η 0
∂η
Σij,new =Σij − Σi |µ I + B(µ) Φ−1 Σy (µ)Φ−1 |µ Σj .
∂ξi ∂ξj
The updated Σij,new will generally be non-zero for any pair of entities i and j involved in the observa-
tion, even when Σij = 0. To retain the desired block-diagonal covariance, the DEKF approximates
the posterior by zeroing out any off-diagonal covariance blocks. In practice, we simply never compute
off-diagonal blocks. This finishes the update step for the DEKF that reflects the new observation in
the parameter estimates. For models with static parameters, the DEKF only has an update step,
9
Gómez-Uribe and Karrer
2
resulting in Algorithm 1. The memory storage is O(k Pn) and the computation per P observation is
O(k + d ) for the EKF. The DEKF, in contrast, is O( i=1 ki2 ) for storage and O( i∈ξλ ki2 + d3 ) for
2 3
computation per observation, where ξλ are the indices of the m entities involved in the observation.
The reduction in both memory storage and computation for factorization models where the number
of entities is large can thus be significant.
Σ−1 −1
t+1 =Σt + F(µt , xt ). (18)
where Σ−1 0 is the initial inverse prior covariance. For large enough T , the first term can become
irrelevant. The second term has a non-zero contribution from observation t only for matrix entries
corresponding to parameters involved in the observation, i.e., entry i, j of F(µt , xt ) is not zero only
if the gradients of the natural parameter η with respect to parameters i and j are both non-zero
for observation t. So roughly speaking the inverse covariance per observation for parameter i and j
is proportional to the co-occurrence frequency of parameter i and j in observations. Similarly, the
inverse covariance per observation for parameter i is roughly proportional to the marginal frequency
of parameter i’s involvement in an observation.
We suggest to group subsets of parameters that have high co-occurrence in F(µt , xt ) into entities.
The resulting entities will then correspond to blocks with substantially larger values in the inverse
covariance per observation, because the co-occurrence frequency of parameters belonging to different
entities is typically smaller than the within-entity frequency. This also indicates why a fully diagonal
approximation to the covariance may be worse than the DEKF block-diagonal approximation: off-
diagonal co-occurrence frequencies similar (or even equal) to the marginal frequencies would be
ignored in its inverse.
θt = Gt θt−1 + ut + t .
Here t is additive Gaussian noise, and the dynamics matrix Gt and the vector ut are known. In
the EKF (and the original DEKF), the true dynamics are defined by non-linear functions, that
10
DEKF for Factorization Models
are approximated through a first order Taylor expansion about the mean of the current posterior,
resulting in essentially the same linear dynamics above.
For our purposes, these dynamics are too general, since the parameters Gt and ut are typically
unknown in machine learning applications. We consider parameter dynamics here only as a means
to incorporate data non-stationarity. So we assume each entity i evolves independently of the others
according to Equation 3. There, the memory parameter αi provides a form of regularization towards
the reference vector ri .
Our motivation for adding reference vectors and the memory parameter αi is two-fold. First, if
αi = 1, the entity parameters undergo a random walk, and can accumulate a large covariance. In
MF models such a random walk often leads to user and item vectors that produce absurdly large
signals. In contrast, Equation 3 implies the steady-state distribution ξi ∼ N ri , (1 − αi2 )−1 Ωi ,
which we use to initialize the entity vectors (Equation 2). Second, these dynamics allow predicting
a reasonable mean, i.e., the reference vector, for entities that have not been observed in a long time.
This is particularly relevant for factorization models, where entities may be observed infrequently.
With parameter dynamics, θ includes both the reference and the entity vectors. We expand our
notation to let ρi denote the current mean of ri , Ri the covariance between ri and ξi , and Pi the
covariance matrix of ri . An entity now refers to both the subset of current model parameters, ξi ,
and its associated reference vector ri . The DEKF posterior maintains a block-diagonal covariance
over these augmented entities because Ri is generally non-zero for any entity i.
The update step in Algorithm 1 is still valid now that there are parameter dynamics, after
∂η
replacing ∂ξ i
with a gradient with respect to the complete set of entity i’s parameters, both current
11
Gómez-Uribe and Karrer
h i
∂η ∂η
and reference, i.e. ∂ξ ,
i ∂ri
, and similarly replacing Σi . However, this replacement is inefficient
since the gradient of the log-likelihood with respect to the reference vectors is always zero, because
∂η
∂ri = 0. Our full variant DEKF, in Algorithm 2, modifies the update step of Algorithm 1 to remove
∂η
this inefficiency. In Algorithm 2, ∂ξ i
is just the gradient of η with respect to the entity’s current
parameters ξi .
The main change in a Kalman filter when adding parameter dynamics is the presence of the
predict step. Because the DEKF update step only requires means and covariances for the entities
involved in the observation, we are only required to apply the predict step for those entities when
observed. In particular, the predict step can be applied immediately before the update step for the
set of entities in an observation. This is possible because our dynamics is completely independent
across entities. As opposed to laboriously maintaining a posterior over all parameters at time t, we
can just maintain a lazy posterior over each entity by recording only the most recent posterior for
each entity, and the last time that entity was updated. This is statistically identical to an inference
procedure that would update the posterior for all entities at every time step.
Consider a particular entity i. When we predict for this entity at time t, we first check whether
we already have a past mean and covariance for the parameters of this entity. If not, we assume the
current parameters are drawn from the steady-state distribution of the dynamics and set the means
and covariances to
µi π
= i ,
ρi πi
and
Πi + (1 − αi2 )−1 Ωi
Σi
Ri = Πi .
Pi Πi
If entity i has a posterior that was last updated at time k, we can write down the entire dynamics
for the corresponding parameters between time k and the current time t, as
t−k−1
X
ξi,t =αit−k (ξi,k − ri ) + ri + αir ωi,r+k+1 .
r=0
which implies we can directly update the entity’s posterior at time k to the posterior at time t. For
the means, we have
t−k
µi,new αi (µi − ρi ) + ρi
= .
ρi,new ρi
Because the predict step for entities can predict across any number of discrete time-steps with the
same computational cost, our particular choice of entity dynamics allows us to incorporate parameter
drift efficiently. We summarize the complete algorithm with the predict-update cycle in Algorithm 2.
12
DEKF for Factorization Models
13
Gómez-Uribe and Karrer
stochastic gradient Fisher scoring (SGFS) in Ahn et al. (2012) is somewhat similar to our algo-
rithm, and resembles online Fisher scoring driven by Gaussian noise. However, compared to our
algorithm, it is not specifically online, does not maintain a distribution of the parameters, nor has
been developed for entities.
4. Numerical Results
We first apply the DEKF to simulated data, both with static and with dynamic parameters. For
simplicity of exposition, we define a single observation model, and couple it to the model parameters
through different signal definitions to obtain regression, matrix and tensor factorization models.
Consider a stream of univariate binary observations yt and context xt provided at time t. We
generate this stream according to the generative model described in Section 2. That is, we simulate
the entity dynamics over each time step explicitly, and sample an observation yt by randomly
selecting the entities involved, and possibly additional context. We learn the model parameters
from this stream of observations via the EKF and DEKF algorithms. When Algorithm 2, that takes
parameter dynamics into account, is applied for parameter learning, we use the true values for Ωi
and αi in it, i.e., we use the same values to generate the data and to learn the parameters. Similarly,
the same values for the entity priors πi and Πi are used for data generation and parameter learning.
We model the binary observations using the Bernoulli exponential family with the canonical
eη
link. With this choice, as in logistic regression, the probability of an observation is py (η) = 1+e η.
2 ∂η
So h(µ) = py (µ), and the variance σy is py (µ)(1 − py (µ)). With the canonical link, ∂λ = 1. Finally,
the Bernoulli log-likelihood is yη + log 1 − py , so Φ = 1. The prior covariance Πi for an arbitrary
entity i with ki entries was obtained as follows in every simulation. First, we construct a ki by ki
matrix U1 by sampling its entries independently from the uniform distribution in [0, 1]. Then, we
obtain a positive definite matrix with non-negative entries by setting U2 = U1 U01 /ki2 . We choose
the latter because we want the entries of the initial parameter vectors to have positive correlation.
s
We finally re-scale the matrix to have reasonable magnitude via Πi = up U2 , where sp is a typical
value, to be specified later, that we want to achieve in the diagonal entities of the prior Πi , and u
is the average of the diagonal entries of U2 .
Simulations and inference with dynamic parameters also require a description of the dynamics.
The covariance Ωi of the Gaussian noise that drives the dynamics for entity i was obtained similarly
to Πi , but with a few differences to achieve both positive and negative covariance entries. The
entries of U1 are now sampled from a standard normal distribution, then U2 = U1 U01 /ki2 as before,
and finally Ωi = sud U2 , where sd is a typical value of a parameter drift per observation we want to
1
achieve, and where the normalizer u is given by (det U2 ) ki . The memory parameter αi for entity i
is set by first choosing the desired half life thi of the entity, i.e., the number of observations after
which any difference from the reference
vector should decay in half assuming simple geometric decay.
Then, αi = exp log (0.5)/thi . The rest of the parameters for the different models we study are
described next.
1. Regression. We considered a ‘sparse’ regression model roughly equivalent to matrix fac-
torization with known item vectors. We created 10 entities corresponding to 10 users with
ki = 30, and prior mean πi with all entries equal to −0.00405. In each observation only one
of these entities was randomly selected. We also created an additional entity with 50 en-
tries that appears in every observation and prior mean with all entries equal to −0.0068, to
model purely item-dependent effects (like the genre of an item); purely user-dependent or other
context-dependent effects like time of day or day of week can be modeled similarly. In each
observation, the signal is the dot product of a context vector with 80 entries and a vector that
concatenates the two entities in the observation. So equivalently, this can be thought of as
a regression model with 350 = 10 × 30 + 50 parameters, and where the context vector has
one ‘dense’ portion of 50 entries that is typically non-zero in all observations, and 10 ‘sparse’
14
DEKF for Factorization Models
portions with 30 entries each, only one of which is non-zero in each observation. All entities
used the same dynamics parameters: the half life was set to 10000, and the covariance scales
were set to sp = 0.005 and sd = 0.0028. A fixed set of 100 context vectors with 80 entries each
were constructed by sampling each entry independently from N (1, 1). In each observation,
one of these 100 context vectors is randomly selected, and combined with the 80-dimensional
parameter vector resulting from selecting one sparse entity and the dense entity, to produce
the signal λ for the observation.
2. Matrix factorization. We generated 10 users and 10 item entities, each with 10 entries, for
a total of 200 parameters. We set the prior mean entries of πi to 0.2 for users entities, and
to −0.2 for item entities (perturbed very slightly per simulation). The half life of all user and
item entities was set to 10000, and their covariance scales to sp = 0.144 and sd =2.45e−5.0.
3. Tensor factorization We decomposed a multi-way array with four modes with dimensions
[3, 3, 4, 4], so the number of entities is 14. We used a rank 20 decomposition, so each entity
vector was in R20 . We set the entries of πi for all entities to −0.405465 (with small random
perturbations per simulation). We set the halflife of all entities to 10000, and the covariance
scales to sp = 0.23 and sd =3.8e−5.0.
15
Gómez-Uribe and Karrer
Figure 1: Parameter
Pt estimation. The solid lines show the cumulative average absolute error at
iteration t: 1/t i=1 |ptrue,i − ppredicted,i |. There is one observation per iteration. All lines are
averages over 10 simulations.
16
DEKF for Factorization Models
three entities of the three remaining types to define the observation. Once a recommendation has
been made, the corresponding observation is sampled and the outcome recorded. To perform the
recommendation, the algorithm begins by applying the predict step for the means and covariances
of every entity in the set of contexts.9 We then generate a recommendation through Thompson
sampling (see Russo et al., 2018, for an overview)10 and apply the update step after receiving a new
observation yt for the recommendation from the underlying process.
We evaluate recommendation quality by measuring cumulative regret, the sum over recommen-
dations of the true probability for the best context minus the true probability for the chosen context.
We compare Thompson sampling against random recommendations as well as recommending the
context with the highest prediction h(µ), an approach that does not require posterior uncertainty.
E.g., for traditional MF this strategy recommends the item vector with the largest dot-product with
the user vector. To reduce the computational requirements, we changed some of the parameters of
our models. Specifically, for sparse regression, we now let the sparse entities have only 10 entries,
and the dense one 20. For TF, we reduced the number of entries per entity to 5. For MF, we slowed
down parameter dynamics by setting the half life of all entities to 100000.
The left-hand and right-hand side of Figure 2 respectively show the cumulative regret for static
and dynamic parameters, normalized through division by the cumulative regret achieved by random
recommendations. We see that leveraging the uncertainty through Thompson sampling eventually
results in a substantially lower cumulative regret than using the (approximate) posterior mean for
static parameters, in all cases. We also see that the EKF and DEKF perform similarly when using
Thompson sampling, while the diagonal DEKF performs worse. For dynamic parameters, we do
not see much difference in the performance between the diagonal DEKF, the DEKF, and the EKF.
Thompson sampling eventually performs better than recommending based on the posterior mean
in these examples, just like with static parameters. However, such behavior is not general, and in
fact, the relative performance of Thompson sampling and posterior mean recommendation depends
strongly on the balance between the average information obtained from an observation and the
average information lost because of parameter drift. Thompson sampling can be inefficient because
of over-exploration in situations when the system changes over time faster than the observations
provide useful information to determine the optimal action, e.g., see section 8.2 in Russo et al.
(2018). As a simple example, Figure 3 shows what happens when we increase parameter drift in
our MF model by reducing the half life to 10000, and then further to 1000. The performance of
Thompson sampling worsens indeed, eventually making recommendations based on the posterior
mean a better strategy.
9. In a practical application, this predict-sample cycle would not need to be applied on every recommendation.
10. We generate one posterior sample θsampled from the joint posterior over every entity possibly in the observation,
and then select a valid set of sampled entities and context x to maximize h(θsampled , x).
17
Gómez-Uribe and Karrer
Figure
Pt 2: Explore-exploit. Each line is the normalized cumulative regret at iteration t, defined as
i=1 i − qi ), divided by the same quantity for random recommendations, where pi and qi are the
(p
highest probability context and the probability of the recommended context at time i respectively.
Figures show averages over 20 simulations.
18
DEKF for Factorization Models
(a) Dynamic matrix factorization with half life (b) Dynamic matrix factorization with half life
of 10000. of 1000.
Figure 3: Explore-exploit with stronger parameter drift. Dynamic MF with half lives of 10000 and
1000, instead of 100000.
analyzing the data points chronologically. The resulting metrics are then not directly comparable
to other batch approaches, e.g., that have the benefit of having learned from data points across the
entire time range of the data sets.
Because most user-item pairs do not have ratings, we focus on prediction, and not explore-exploit.
Both data sets are too large for the EKF in models where each user and each item require their own
parameters. But the DEKF still allows us to learn a matrix factorization model where each user
and each item is an entity. Each user and movie vector is chosen to be ten-dimensional. We model
ratings in two ways, resulting in two different models: as observations from a Gaussian distribution
(Gaussian-MF) with standard deviation chosen to be a quarter-star, and as Bernoulli observations
(Bernoulli-MF) corresponding to whether the star rating was greater than or equal to four.
We consider four DEKF versions, depending on whether we treat each parameter entry as an
entity (i.e., a diagonal DEKF) or each user and item vectors as an entity (i.e., the standard DEKF),
and on whether we consider the parameters to be static or dynamic. We initialize each variant to
have the same priors, utilizing the mean and standard deviation of the observations to set prior
mean π and prior covariance scale sp . For dynamics, we use half-lives of one year and five years
for users and movies, respectively, for MovieLens-20M, and one year for both users and movies for
NetflixPrize. In contrast to our study based on simulated data, here we set Ω to be diagonal with
identical entries equal to sd , allowing for different values for users and items.11
4.3.1 MovieLens-20M
In Figure 4a, the cumulative average out-of-sample root-mean-square-error (RMSE) is shown as a
function of time for Gaussian observations. Vertical blue lines denote dates for significant changes
to the MovieLens platform from Table 1 in Harper and Konstan (2015). The effects of these changes
are clearly visible in the algorithm’s curves. In particular, the switch to include half-star ratings
in February 2003 substantially lowered RMSE. In this figure, the diagonal DEKF performs worst,
and modeling parameter dynamics was helpful. To interpret these RMSE values, we reference Table
1 from Strub et al. (2016) where the best supervised learning MovieLens-20M result had a batch
RMSE evaluated on a test set of 0.7652, and a rank 10 Bayesian probabilistic matrix factorization
(BPMF) achieved 0.8123. Differences between approaches were on the order of hundredths of RMSE,
11. We tried several values for these dynamic parameters, and chose the particular dynamic parameters via examining
predictions on the first five thousand observations, a minute fraction of the observations.
19
Gómez-Uribe and Karrer
indicating that the observed differences between DEKF versions is substantial. Our best performing
algorithm achieves RMSE of 0.8082, comparable to the BPMF, despite our approach and RMSE
evaluation being online, as already stressed before. In Figure 4b, for the Bernoulli observations,
we show the cumulative average out-of-sample normalized cross-entropy (NE).12 Blue lines again
indicate substantial platform changes. The effects of platform changes are still visible, although less
so for the Bernoulli observations. Dynamics is again helpful, although surprisingly, here we observe
the diagonal DEKF performs best.
4.3.2 NetflixPrize
We repeat these calculations for NetflixPrize. In Figure 4c, we show the cumulative average out-of-
sample RMSE achieved as a function of time for the Gaussian observations. Similar to MovieLens-
20M, the diagonal version performs poorly compared to the DEKF, with a benefit from dynamics.
We compare against supervised learning results, where in Table 3 from Zheng et al. (2016), we
observe a range of test RMSE across algorithms from 0.803 to 0.874. Our best performing version
has RMSE 0.856, comfortably within this range, but achieved for online learning. Again for the
Bernoulli observations, in Figure 4d, we show the cumulative average out-of-sample normalized
cross-entropy (NE). Like MovieLens-20M, dynamics performed best for Bernoulli-MF, and diagonal
entities showed a slight edge.
4.3.3 Conclusions
Overall these results indicate that the DEKF produces reasonable predictions on real-world data
sets, and that incorporating dynamics can improve predictions. The best choice of entities for
prediction accuracy was inconsistent across the experiments, e.g., setting each parameter as an
independent entity was often superior for Bernoulli observations. This diagonal option is appealing
from a complexity standpoint, and for very large data sets, may be the only option available.
5. Discussion
We have specialized the EKF to a model with observations in the exponential family, which includes
the GLM, MF, TF and factorization machines. This treatment results in more flexible observation
models than are typically considered in these models. It also enables parameter dynamics to account
for data drift. In addition, the uncertainty around the estimates the EKF provides can enable
applications where uncertainty is necessary, such as explore/exploit. However, when the number
of parameters is large, as is often the case in modern applications, the memory and computation
requirements of the EKF can be prohibitive. To address this, we specialize the DEKF to our model.
We show that in both the EKF and the DEKF, only parameters involved in an observation need to
be updated, and develop an optimized version of the DEKF that is particularly well suited for the
kinds of models we consider, which are naturally defined to only involve a relatively small subset of
the parameters in each observation.
Of course, the EKF is an approximate inference algorithm and has been observed to sometimes
produce badly behaved parameter estimates when the response function is sufficiently non-linear and
the initial prior is not sufficiently well-specified. The DEKF inherits those problems, and examples
can be found by applying the DEKF to the Poisson distribution with the canonical link (h(η) = eη ),
which often displays enormous predictive errors for early iterations. Fortunately, these problems
can have known, simple solutions. Either one can strengthen the prior, add a learning rate to slow
down the initial parameter updates, or utilize the iterated decoupled EKF (IDEKF), as described
in Appendix A, instead of the DEKF. A safe default procedure for highly non-linear responses may
20
DEKF for Factorization Models
Figure 4: Static and dynamic DEKF, with diagonal and user / item entity choices, applied to
MovieLens-20M and NetflixPrize. Dynamic half-lives for users and items are included in each legend,
and expressed in years. Vertical blue lines in MovieLens-20M correspond to significant changes in
that platform detailed in Harper and Konstan (2015). Let π be the value of every entry of πi for
all entities. For (a), π = 0.5916, sp = 0.0924, sd,user = 1.3585e-9, and sd,movie = 2.717e-10. For
(b), π = 4.4721e-5, sp = 0.2133, sd,user = 7.8633e-9, and sd,movie = 1.5727e-9. For (c), π = 0.6003,
sp = 0.0916, sd = 2.8279e-9. For (d), π = 0.1630, sp = 0.1908, sd = 6.8433e-9.
21
Gómez-Uribe and Karrer
be to start with the IDEKF and later switch to the DEKF, but as shown in Section 4, this was
unnecessary for our numerical results. A more serious problem occurs when the true posterior is
multi-modal and not well-approximated as a Gaussian. Like the EKF (and related methods), we
expect the DEKF to not perform well in this situation.
Our approach contains hyperparameters per entity given by π, Π, α, and Ω. The latter two
are only relevant in situations with dynamic parameters, while the first two are always relevant. In
specific applications, it is typically unclear a priori whether including dynamics (through α 6= 1.0,
and Ω 6= 0) will result in more useful models. Indeed, our simulations suggest that the model
choices that better match the true data generation process, which is typically unknown, work best.
On the MovieLens-20M and NetflixPrize data sets, however, we observed that adding dynamics with
reasonable settings was helpful. One way to specify πi and Πi would be to analyze offline data about
the entity. An easier approach is to first specify the prior per entity type (e.g., all items are given
the same prior). Then, we recommend sampling entities from these priors (and possibly simulated
context if needed) for the signal, and then sampling observations. Reasonable entity priors should
produce a reasonable distribution of observations.
When entities can be logically grouped into types, we can also drastically reduce the importance
of π and Π by warm-starting a new entity’s reference vector distribution based on similar entities
(e.g. other users for a new user). We can sample reference vectors from the current posterior of
similar entities, and use the empirical mean and covariance of those samples as the reference vector
prior for the new entity. Hence the hyperparameter priors would just be used for the initial entities
of each type, and afterwards the observed data becomes influential. We leave developing this idea
further to future work.
Specifying the dynamics is more difficult and likely problem-specific. As rough guidance, we
suggest that hyperparameters can again be shared across entities of the same type. Then the
memory can be intuitively set via considering the half-life of the dynamics. Finally, let Ω, specified
last, be a constant times the identity matrix. These constants can be roughly determined via
sampling reference and current vectors from the steady-state distribution, sampling observations for
the reference signal and steady-state signal, and measuring the typical change in observation due to
drift.
To summarize, this guidance involves setting these hyperparameters by considering answers to the
following questions. What is a typical reference observation? What is a typical reference deviation in
observation? How long until an entity’s parameters drift halfway back to their reference parameters
in expectation? What is a typical deviation in observation due to drift? This guidance is a starting
point, and analyzing a subset of data, perhaps repeatedly through cross-validation, could produce
a better initialization. Developing online solutions for fitting the hyperparameters is another area
of future work. Future research could also consider online Kalman-filter-like algorithms for models
that have latent variables, such as mixture and topic models.
Acknowledgments
CGU wants to thank Vijay Bharadwaj for suggesting the idea of extending the EKF to matrix
factorization models, and Danny Ferrante and Nico Stier for supporting this work. We also want to
thank Ben van Roy and Yann Ollivier for useful feedback.
22
DEKF for Factorization Models
very likely according to the posterior are relatively far from µ. This suggests improving the accuracy
of the EKF approximation by Taylor expanding about the MAP value of θ, i.e., about the most
likely value of θ according to the posterior. The iterated EKF (IEKF), described next, pursues this
strategy.
Consider approximating l(y) about an arbitrary value γ, rather than about µ:
∂l(y) 1 0 ∂ 2 l(y)
l(y) ≈l(y, γ) + |γ θ − γ + θ − γ | γ θ − γ .
∂θ 2 ∂θ2
Working through the rest of the EKF derivation in the same way as before results in the following
update equations:
∂ 2 l(y)
Σ−1
new =Σ
−1
− 2 γ
| , (20)
∂θ 0
∂ 2 l(y)
∂l(y)
δ =Σnew |γ − |γ γ − µ .
∂θ ∂θ2
Note that the column vector that multiplies Σnew on the right to determine the mean update δ now
has two terms, and the second term goes to zero when γ = µ. Also note that Equation 20 may lead
to a “covariance” that is not positive-definite. Using the Fisher information matrix, like the EKF
does, instead of the Hessian, is one alternative, and results in the update
Σ−1
new =Σ
−1
+ F(γ), (21)
∂l(y)0
δ =Σnew |γ + F(γ) γ − µ . (22)
∂θ
Now consider the reference point γ that is self-consistent, i.e., that results in δ = γ − µ. Under
these circumstances, we get from Equation 22 that
∂l(y)0
|γ − Σ−1 (γ − µ) = 0,
∂θ
where the left side is identical to the gradient of the log posterior evaluated at γ. Therefore, a self-
consistent γ is a stationary point of the log posterior. In particular, the MAP estimate of θ satisfies
this equation. The IEKF computes a MAP estimate by iterating
∂l(y)0
−1 −1 −1
γnew = γ + s(Σ + F(γ)) |γ − Σ (γ − µ) ,
∂θ
initialized from γ = µ, using a line-search with step size s ∈ [0, 1] to ensure that the log posterior is
increasing on each iteration (see Skoglund et al., 2015). Upon convergence, the updated mean is γ
and the updated covariance comes from Equation 21 evaluated at the converged γ.13
After applying the Woodbury identity and some re-arrangement, γnew − γ can be written for our
exponential family models as
" −1 #
∂η 0
∂η
s µ−γ+Σ |γ I + B(γ) Φ−1 y − h(γ) + Σy Φ−1 |γ (γ − µ) .
∂θ ∂θ
This equation can be evaluated similarly to Equation 8. The block-diagonal entity approximation
to the covariance still implies that only parameters associated with entities in an observation are
updated. So our computational machinery can also be directly adapted for an iterated decoupled
EKF.
13. Typically it is the second-to-last γ that is used for the covariance, and the relevant terms have already been
computed.
23
Gómez-Uribe and Karrer
References
Sungjin Ahn, Anoop Korattikara, and Max Welling. Bayesian posterior sampling via stochastic
gradient fisher scoring. In Proceedings of the 29th International Coference on International Con-
ference on Machine Learning, pages 1771–1778, 2012.
Shun-Ichi Amari. Natural gradient works efficiently in learning. Neural Comput., 10(2):251–276,
February 1998. ISSN 0899-7667.
James Bennett, Stan Lanning, and Netflix Netflix. The netflix prize. In In KDD Cup and Workshop
in conjunction with KDD, 2007.
Carlos Gómez-Uribe. Online algorithms for parameter mean and variance estimation in dynamic
regression. arxiv preprint. arXiv preprint arXiv:1605.05697, 2016.
Carlos A Gomez-Uribe and Neil Hunt. The netflix recommender system: Algorithms, business
value, and innovation. ACM Transactions on Management Information Systems (TMIS), 6(4):
1–19, 2015.
Roger Grosse and Ruslan Salakhudinov. Scaling up natural gradient by sparsely factorizing the
inverse fisher matrix. In International Conference on Machine Learning, pages 2304–2313, 2015.
F. Maxwell Harper and Joseph A. Konstan. The movielens datasets: History and context. ACM
Trans. Interact. Intell. Syst., 5(4):19:1–19:19, December 2015. ISSN 2160-6455.
Trevor J Hastie. Generalized additive models. In Statistical models in S, pages 249–307. Routledge,
2017.
Simon S Haykin et al. Kalman filtering and neural networks. Wiley Online Library, 2001.
Rudolph Emil Kalman. A new approach to linear filtering and prediction problems. Journal of basic
Engineering, 82(1):35–45, 1960.
Tamara G. Kolda and Brett W. Bader. Tensor decompositions and applications. SIAM Rev., 51(3):
455–500, August 2009. ISSN 0036-1445.
Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender
systems. Computer, 42(8), 2009.
Takio Kurita. Iterative weighted least squares algorithms for neural networks classifiers. New
Generation Computing, 12(4):375–394, Sep 1994. ISSN 1882-7055.
James Martens. New insights and perspectives on the natural gradient method. arXiv preprint
arXiv:1412.1193, 2014.
James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate
curvature. In International conference on machine learning, pages 2408–2417, 2015.
Andriy Mnih and Ruslan R Salakhutdinov. Probabilistic matrix factorization. In Advances in neural
information processing systems, pages 1257–1264, 2008.
John Ashworth Nelder and R Jacob Baker. Generalized linear models. Wiley Online Library, 1972.
Yann Ollivier et al. Online natural gradient as a kalman filter. Electronic Journal of Statistics, 12
(2):2930–2961, 2018.
Razvan Pascanu, Universite De Montreal, and Yoshua Bengio. Revisiting natural gradient for deep
networks. In In International Conference on Learning Representations, 2014.
24
DEKF for Factorization Models
Kaare Brandt Petersen, Michael Syskind Pedersen, et al. The matrix cookbook. Technical University
of Denmark, 7(15):510, 2008.
Gintaras V Puskorius and Lee A Feldkamp. Decoupled extended kalman filter training of feedforward
layered networks. In Neural Networks, 1991., IJCNN-91-Seattle International Joint Conference
on, volume 1, pages 771–777. IEEE, 1991.
Steffen Rendle. Factorization machines. In Data Mining (ICDM), 2010 IEEE 10th International
Conference on, pages 995–1000. IEEE, 2010.
Steffen Rendle. Factorization machines with libfm. ACM Transactions on Intelligent Systems and
Technology (TIST), 3(3):57, 2012.
Nicolas L Roux, Pierre-Antoine Manzagol, and Yoshua Bengio. Topmoumoute online natural gra-
dient algorithm. In Advances in neural information processing systems, pages 849–856, 2008.
Daniel J Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen. A tutorial on
thompson sampling. Foundations and Trends R in Machine Learning, 11(1):1–96, 2018.
Dan Simon. Optimal state estimation: Kalman, H infinity, and nonlinear approaches. John Wiley
& Sons, 2006.
25