A Survey of Reinforcement Learning Algorithms
A Survey of Reinforcement Learning Algorithms
Abstract—Reinforcement learning (RL) algorithms find appli- recommender application, the state is current genre of videos
cations in inventory control, recommender systems, vehicular watched or books purchased etc., and the agent decides on
arXiv:2005.10619v1 [cs.LG] 19 May 2020
traffic management, cloud computing and robotics. The real- the set of items to be recommended for the user. Based
world complications of many tasks arising in these domains
makes them difficult to solve with the basic assumptions un- on this, the user chooses the recommended content or just
derlying classical RL algorithms. RL agents in these applications ignores it. After ignoring the recommendation, the user may
often need to react and adapt to changing operating conditions. A go ahead and browse some more content. In this manner, the
significant part of research on single-agent RL techniques focuses state evolves and every action chosen by the agent captures
on developing algorithms when the underlying assumption of additional information about the user.
stationary environment model is relaxed. This paper provides
a survey of RL methods developed for handling dynamically It is important to understand that there must be a feedback
varying environment models. The goal of methods not limited mechanism which recognizes when the autonomous agent has
by the stationarity assumption is to help autonomous agents chosen the right action. Only then can the autonomous agent
adapt to varying operating conditions. This is possible either learn to select the right actions. This is achieved through a
by minimizing the rewards lost during learning by RL agent reward (or cost) function which ranks an action selected in a
or by finding a suitable policy for the RL agent which leads
to efficient operation of the underlying system. A representative particular state of the system. Since the agent’s interaction with
collection of these algorithms is discussed in detail in this work the system (or environment) produces a sequence of actions,
along with their categorization and their relative merits and this sequence is also ranked by a pre-fixed performance
demerits. Additionally we also review works which are tailored criterion. Such a criterion is usually a function of the rewards
to application domains. Finally, we discuss future enhancements (or cost) obtained throughout the interaction. The goal of the
for this field.
autonomous agent is to find a sequence of actions for every
Index Terms—Reinforcement learning, Sequential Decision- initial state of the system such that this performance criterion
Making, Non-Stationary Environments, Markov Decision Pro- is optimized in an average sense. Reinforcement learning
cesses, Regret Computation, Meta-learning, Context Detection.
(RL) [6] algorithms provide a mathematical framework for
sequential decision making by autonomous agents.
In this paper, we consider an important challenge for
I. I NTRODUCTION developing autonomous agents for real-life applications [7].
Resurgence of artificial intelligence (AI) and advancements This challenge is concerned with the scenario when the
in it has led to automation of physical and cyber-physical sys- environment undergoes changes. Such changes necessitate
tems [1], cloud computing [2], communication networks [3], that the autonomous agent continually track the environment
robotics [4] etc. Intelligent automation through AI requires characteristics and adapt/change the learnt actions in order to
that these systems be controlled by smart autonomous agents ensure efficient system operation. For e.g., consider a vehicular
with least manual intervention. Many of the tasks in the above traffic signal junction managed by an autonomous agent. This
listed applications are of sequential decision-making nature, is an example of intelligent transportation system, wherein
in the sense that the autonomous agent monitors the state the agent selects the green signal duration for every lane.
of the system and decides on an action for that state. This The traffic inflow rate on lanes varies according to time of
action when exercised on the system, changes the state of day, special events in a city etc. If we consider the lane
the system. Further, in the new state, the agent again needs occupation levels as the state, then the lane occupation levels
to choose an action (or control). This repeated interaction are influenced by traffic inflow rate as well as the number of
between the autonomous agent and the system is sequential vehicles allowed to clear the junction based on the green signal
and the change in state of the system is dependent on the action duration. Thus, based on traffic inflow rate, some particular
chosen. However, this change is uncertain and the future state lane occupation levels will be more probable. If this inflow
of the system cannot be predicted. For e.g., a recommender rate varies, some other set of lane occupation levels will
system [5] controlled by an autonomous agent seeks to predict become more probable. Thus, as this rate varies, so does the
“rating” or “preference” of users for commercial items/movies. state evolution distribution. It is important that under such
Based on the prediction, it recommends items-to-buy/videos- conditions, the agent select appropriate green signal duration
to-watch to the user. Recommender systems are popular on based on the traffic pattern and it must be adaptive enough to
online stores, video-on-demand service providers etc.. In a change the selection based on varying traffic conditions.
Formally, the system or environment is characterized by a
Sindhu P R is with the Dept. of Computer Science and Automation,
Indian Institute of Science, Bangalore, Karnataka, 560012 India E-mail: model or context. The model or context comprises of the state
sindhupr@iisc.ac.in. evolution probability distribution and the reward function -
2
B. Overview
Fig. 1: Reinforcement learning with dynamically varying en- The remainder of the paper is organized as follows. Section
vironments. The environment is modeled as a set of contexts II presents the basic mathematical foundation for modelling
and evolution of these contexts is indicated by blue arrows. At a sequential decision-making problem in the Markov deci-
time t, the current state is st and RL agent’s action at changes sion process (MDP) framework. It also briefly states the
the state to st+1 and reward rt is generated. assumptions which are building blocks of RL algorithms.
In Section III, we formally introduce the problem, provide
the first component models the uncertainty in state evolution, a rigorous problem statement and the associated notation.
while the second component helps the agent learn the right Section IV describes the benefits of developing algorithms for
sequence of actions. The problem of varying environments dynamically varying environments. It also identifies challenges
implies that the environment context changes with time. This that lie in this pursuit. Section V describes the solution
is illustrated in Fig. 1, where the environment model chooses approaches proposed till now for the problem described in
the reward and next state based on the current “active” context Section III. This section discusses two prominent categories of
i, 1 ≤ i ≤ n. More formal notation is described in Section II. prior works. Section VI discusses relevant works in continual
and meta learning. In both Sections V and VI, we identify
A. Contribution and Related Work the strengths of the different works as well as the aspects that
This paper provides a detailed discussion of the reinforce- they do not address. Section VII gives a brief overview of
ment learning techniques for tackling dynamically changing application domains which have been specifically targeted by
environment contexts in a system. The focus is on a single some authors. Section VIII concludes the work and elaborates
autonomous RL agent learning a sequence of actions for con- on the possible future enhancements with respect to the prior
trolling such a system. Additionally, we provide an overview work. Additionally, it also describes challenges that research
of challenges and benefits of developing new algorithms for in this area should address.
dynamically changing environments. The benefits of such an
endeavour is highlighted in the application domains where the II. P RELIMINARIES
effect of varying environments is clearly observed. We identify Reinforcement learning (RL) algorithms are based on a
directions for future research as well. stochastic modelling framework known as Markov decision
Many streams of work in current RL literature attempt to process (MDP) [11], [12]. In this section, we describe in detail
solve the same underlying problem - that of learning policies the MDP framework.
which ensure proper and efficient system operation in case
of dynamically varying environments. This problem will be
formally defined in Section III. However, here we give an A. Markov Decision Process : A Stochastic Model
overview of the following streams of work : continual learning A MDP is formally defined as a tuple M = hS, A, P, Ri,
and meta-learning. In Section V, a detailed analysis of prior where S is the set of states of the system, A is the set of actions
works in these streams is provided. (or decisions). P : S × A → P(S) is the conditional transition
• Continual learning: Continual learning [8] is the ability probability function. Here, P(S) is the set of probability
of a model to learn continually from a stream of data, distributions over the state space S. The transition function P
building on what was learnt previously, as well as being models the uncertainty in the evolution of states of the system
able to remember previously seen tasks. The desired based on the action exercised by the agent. Given the current
properties of a continual learning algorithm is that it must state s and the action a, the system evolves to the next state
be able to learn every moment, transfer learning from according to the probability distribution P (·|s, a) over the set
previously seen data/tasks to new tasks and must be re- S. At every state, the agent selects a feasible action for every
sistant to catastrophic forgetting. Catastrophic forgetting decision epoch. The decision horizon is determined by the
3
number of decision epochs. If the number of decision epochs Here, 0 ≤ γ < 1 is the discount factor and it measures the
is finite (or infinite), the stochastic process is referred to as a current value of a unit reward that is received one epoch
finite (or infinite)-horizon MDP respectively. R : S×A → R is in the future. A policy π ∗ is optimal w.r.t this criterion if
the reward (or cost) function which helps the agent learn. The it maximizes (2). Under the average reward per step crite-
environment context comprises of the transition probability rion, the value function of a state s under a given policy
and reward functions. If environments vary, they share the π = (d1 , d2 , . . .) is defined as follows (if it exists):
state and action spaces but differ only in these functions. "N −1 #
π 1 X
V (s) = lim E R(st , dt (st ))|s0 = s . (3)
B. Decision Rules and Policies N →∞ N
t=0
The evolution of states, based on actions selected by agent The goal of the autonomous agent (as explained in Section I)
until time t, is captured by the “history” variable ht . This is an is to find a policy π ∗ such that either (2) or (3) is maximized
element in the set Ht , which is the set of all plausible histories in case of infinite horizon or (1) in case of finite horizon, for
upto time t. Thus, Ht = {ht = (s0 , a0 , s1 , a1 , . . . , st ) : si ∈ all s ∈ S.
S, ai ∈ A, 0 ≤ i ≤ t}. The sequence of decisions taken by
agent is referred to as policy, wherein a policy is comprised of D. Algorithms and their Assumptions
decision rules. A randomized, history-dependent decision rule
at time t is defined as ut : Ht → P(A), where P(A) is the set RL algorithms are developed with basic underlying assump-
of all probability distributions on A. Given ut , the next action tions on the transition probability and reward functions. Such
at current state st is picked by sampling an action from the assumptions are necessary, since, RL algorithms are examples
probability distribution ut (ht ). If this probability distribution of stochastic approximation [13] algorithms. Convergence of
is a degenerate distribution, then the decision rule is called the RL algorithms to the optimal value functions hold when
deterministic decision rule. Additionally, if the decision rule the following assumptions are satisfied.
does not vary with time t, we refer to the rule as a stationary Assumption 1: |R(s, a)| < B < ∞, ∀a ∈ A ∀s ∈ S.
Assumption 2: Stationary P and R, i.e., the functions P and
decision rule. A decision rule at time t dependent only on the
current state st is known as a state-dependent decision rule and R do not vary over time.
Assumption 1 states that the reward values are bounded.
denoted as dt : S → P(A). A deterministic, state-dependent
Assumption 2 implies that the transition probability and reward
and stationary decision rule is denoted as d : S → A. Such a
functions do not vary with time.
rule maps a state to its feasible actions. When the agent learns
We focus on model-based and model-free RL algorithmsin
to make decisions, basically it learns the appropriate decision
this survey. Model-based RL algorithms are developed to learn
rule for every decision epoch. A policy is formally defined as
optimal policies and optimal value functions by estimating P
a sequence of decision rules. Type of policy depends on the
and R from state and reward samples. Model-free algorithms
common type of its constituent decision rules.
do not estimate P and R functions. Instead these directly
C. Value Function : Performance Measure of a Policy either find value function of a policy and improve or directly
find the optimal value function. RL algorithms utilize function
Each policy is assigned a “score” based on a pre-fixed approximation to approximate either the value function of a
performance criterion (as explained in Section I). For ease of policy or the optimal value function. Function approximation
exposition, we consider state-dependent deterministic decision is also utilized in the policy space. Deep neural network
rules only. For a finite-horizon MDP with horizon T , the architectures are also a form of function approximation for
often used performance measure is the expected total reward RL algorithms [14].
criterion. Let π : S → A be a deterministic policy such that In this paper, we use the terms “dynamically varying en-
for a state s, π(s) = (d1 (s), . . . , dT (s)), ∀s ∈ S. The value vironments” and “non-stationary environments” interchange-
function of a state s with respect to this policy is defined as ably. In the non-stationary environment scenario, Assumption
follows: 2 does not hold true. Since previously proposed RL algo-
" T #
π
X rithms [6], [10] are mainly suited for stationary environments,
V (s) = E R(st , dt (st ))|s0 = s , (1) we need to develop new methods which autonomous agents
t=0
can utilize to handle non-stationary environments. In the next
where the expectation is w.r.t all sample paths under policy π. section, we formally describe the problem of non-stationary
A policy π ∗ is optimal w.r.t the expected total reward criterion environments using the notation defined in this section. Addi-
if it maximizes (1) for all states and over all policies. tionally, we also highlight the performance criterion commonly
For infinite horizon MDP, the often used performance mea- used in prior works for addressing learning capabilities in
sures are the expected sum of discounted rewards of a policy dynamically varying environments.
and the average reward per step for a given policy. Under
the expected sum of discounted rewards criterion, the value III. P ROBLEM F ORMULATION
function of a state s under a given policy π = (d1 , d2 , . . .) is
In this section, we formulate the problem of learning opti-
defined as follows:
"∞ # mal policies in non-stationary RL environments and introduce
π
X
t the notation that will be used in the rest of the paper. Since
V (s) = E γ R(st , dt (st ))|s0 = s . (2)
t=0
the basic stochastic modeling framework of RL is MDP, we
4
will describe the problem using notation introduced in Section rewards gained during system evolution, i.e., its more emphasis
II. is on the rewards collected rather than on finding the policy
We define a family of MDPs as {Mk }k∈N+ , where Mk = which optimally controls a system. The regret is usually
hS, A, Pk , Rk i, where S and A are the state and action spaces, defined for a finite-horizon system as follows:
while Pk is the conditional transition probability kernel and T −1
Rk is the reward function of MDP Mk . The autonomous RL
X
Regret = VT∗ (s0 ) − R(st , at ), (6)
agent observes a sequence of states {st }t≥0 , where st ∈ S. t=0
For each state, an action at is chosen based on a policy. For
each pair (st , at ), the next state st+1 is observed according to where T is the time horizon, VT∗ (s0 ) is the optimal expected
the distribution Pk (·|st , at ) and reward Rk (st , at ) is obtained. T -step reward that can be achieved by any policy when system
Here 0 < k ≤ t. Note that, when Assumption 2 is true, starts in state s0 .
Pk (·|st , at ) = P (·|st , at ), ∀k ∈ N+ (as in Section II). The It should be noted that the space of history-dependent,
RL agent must learn optimal behaviour when the system is randomized policies is a large intractable space. Searching
modeled as a family of MDPs {Mk }k∈N+ . this space for a suitable policy is hard. Additionally, in the
The decision epoch at which the environment model/context model-free RL case, how do we learn value functions with
changes is known as changepoint and we denote the set only state and reward samples? In the next section, we explore
of changepoints using the notation {Ti }i≥1 , which is an these issues and discuss prior approaches in connection with
increasing sequence of random integers. Thus, for example, the problem of non-stationary environments in RL. Some are
at time T1 , the environment model will change from say Mk0 methods designed for the case when model-information is
to Mk1 , at T2 it will change from Mk1 to say Mk2 and so known, while others are based on model-free RL. All regret-
on. With respect to these model changes, the non-stationary based approaches usually are model-based RL approaches,
dynamics for t ≥ 0 will be which work with finite-horizon systems. Approaches based
on infinite-horizon systems usually are control methods, i.e.,
′
Pk0 (s |s, a), t < T1
the main aim in such works is to find an approximately
′ optimal policy for a system exposed to changing environment
P (st+1 = s |st = s, at = a) = Pk1 (s |s, a), T1 ≤ t < T2
′
parameters.
...
A. Benefits
The extreme cases of the above formulation occur when either
Ti+1 = Ti + 1, ∀i ≥ 1 or T1 = ∞. The former represents a RL is a machine learning paradigm which is more similar
scenario where model dynamics change in every epoch. The to human intelligence, compared to supervised and unsuper-
latter is the stationary case. Thus, the above formulation is a vised learning. This is because, unlike supervised learning,
generalization of MDPs as defined in Section II. Depending the RL autonomous agent is not given samples indicating
on the decision making horizon, the number of such changes what classifies as good behaviour and what is not. Instead
will be either finite or infinite. With changes in context, the the environment only gives a feedback recognizing when the
performance criterion differs, but (1)-(3) give away some hints action by the agent is good and when it is not. Making
as to what they can be. Additionally, since Assumption 2 does RL algorithms efficient is the first step towards realizing
not hold true, it is natural to expect that a stationary policy general artificial intelligence [15]. Dealing with ever-changing
may not be optimal. Hence, it is important to expand the policy environment dynamics is the next step in this progression,
search space to the set of all history-dependent, randomized eliminating the drawback that RL algorithms are applicable
time-varying policies. only in domains with low risk, for e.g., video games [46] and
Given the family of MDPs {Mk }k∈N+ , one objec- pricing [60].
tive is to learn a policy π = (u1 , u2 , . . .) such that Multi-agent RL [16] is concerned with learning in prescence
the of multiple agents. It can be considered as an extension of
∞long-run expected sum of discounted rewards, i.e.,
P t single-agent RL, but encompasses unique problem formu-
E γ R(st , ut (Ht ))|H0 = h0 is maximized for all initial lation that draws from game theoretical concepts as well.
t=0
histories h0 ∈ H0 . For finite horizon MDPs, the objective When multiple agents learn, they can be either competitive to
equivalent to (1) can be stated in a similar fashion. The achieve conflicting goals or cooperative to achieve a common
same follows for (3), where the policy search space will goal. In either case, the agent actions are no longer seen
be randomized, history-dependent and time-varying. Another in isolation, when compared to single-agent RL. Instead the
performance criterion which is widely used is called as regret. actions are ranked based on what effect an individual agent’s
This performance criterion is directly concerned with the action has on the collective decision making. This implies
5
that the dynamics observed by an individual agent changes known theoretical results to obtain a policy from this
based on other agents’ learning. So, as agents continually optimized regret value. Moreover, the regret value is
learn, they face dynamically varying environments, where the based on the horizon length T .
environments are in this case dependent on joint actions of all • The works [17]-[21] slightly differ with regard to the
agents. Unlike the change in transition probability and reward assumptions on the pattern of environment changes. [17]
functions (Section III), when multiple agents learn, the varying assumes that the number of changes is known, while
conditions is a result of different regions of state-action space [20], [21] do not impose restrictions on it. The work on
being explored. Thus, non-stationary RL methods developed Contextual MDP [19] assumes a finite, known number
for single-agent RL can be extended to multi-agent RL as well. of environment contexts. [18] assumes that only the cost
functions change and that they vary arbitrarily.
B. Challenges • Other than the mathematical tools used, the above works
• Sample efficiency : Algorithms for handling varying envi- also differ with respect to the optimal time-dependent
ronment conditions will definitely have issues w.r.t sam- policy used in the computation of the regret. The optimal
ple efficiency. When environment changes, then learning policy is average-reward optimal in [17], while it is
needs to be quick, but the speed will depend on the state- total-reward optimal in [18]-[21]. [19] differs by letting
reward samples obtained. Hence, if these samples are not the optimal policy to be a piecewise stationary policy,
informative of the change, then algorithms might take where each stationary policy is total-reward optimal for
longer to learn new policies from these samples. a particular environmental context.
• Computation power : Single-agent RL algorithms face 2) Details: We now describe each of the above works in
curse-of-dimensionality with increased size of state- detail. Contextual MDP (CMDP) is introduced by [19]. A
action spaces. Deep RL [14] use graphical process- CMDP is a tuple hC, S, A, Y (C)i, where C is the set of contexts
ing units (GPU) hardware for handling large problem and c ∈ C is the context variable. Y (C) maps a context c
size. Detecting changing operating conditions puts ad- to a MDP Mc = hS, A, Pc , Rc , ξ0c i. Here, Pc and Rc are
ditional burden on computation. Hence, this will present same as Pk , Rk respectively as defined in Section III. ξ0c
a formidable challenge. is the distribution of the initial state s0 . The time horizon
• Theoretical results : As stated in Section II, without T is divided into H episodes, with an MDP context c ∈ C
Assumption 2, it is difficult to obtain convergence results picked at the start of each episode. This context chosen is
for model-free RL algorithms in non-stationary environ- latent information for the RL controller. After the context is
ments. Thus, providing any type of guarantees on their picked (probably by an adversary), a start state s0 is picked
performance becomes hard. according to the distribution ξ0c and episode sample path and
rewards are obtained according to Pc , Rc . Suppose the episode
V. C URRENT S OLUTION A PPROACHES variable is h and rht is the reward obtained in step t of episode
Solution approaches proposed till now have looked at both h. Let Th , which is a stopping time, be the episode horizon.
finite horizon (see Section V-A) as well as infinite horizon The regret is defined as follows:
(Section V-B) cases. Prior approaches falling into these cate- H
X X th
H X
maximize the long-run objective and also provide meth- where M is the environment context and T is the first time
ods to find optimal policy corresponding to this optimal step in which s′ is reached from the initial state s. Both
objective value, regret-based learning approaches mini- algorithms keep track of the number of visits as well as the
mize regret during learning phase only. There are no emprical average of rewards for all state-action pairs. Using a
6
confidence parameter, confidence intervals for these estimates regret, which is linear in T . The mathematical machinery
are maintained and improved. The regret is defined as follows: used to show this is complex. Moreover, the distance
T measure used considers only the distance between prob-
ability distributions. However, the reward functions are
X
∗
RegretUCRL2 = T ρ − rt ,
t=1
important components of MDP and varies with the policy.
It is imperative that a distance measure is dependent on
where rt is the reward obtained at every step t and ρ∗ is the
reward functions too.
optimal average reward defined as follows:
• UCRL2 and variation-aware UCRL2 restart learning with
changes in confidence parameter. This implies that in
" T #
∗ 1 X
∗
ρ = lim E rt , simple cases where the environment model alternates
T →∞ T
t=1 between two contexts, these methods restart with large
rt∗ is he reward obtained at every step when optimal policy confidence sets, leading to increased regret. Even if this
π ∗ is followed. information is provided, these algorithms will necessarily
When environment model changes utmost L times, then learn- require a number of iterations to improve the confidence
ing is restarted with a confidence parameter that is dependent sets for estimating transition probability and reward func-
on L. Variation-aware UCRL2 [20] modifies this restart sched- tions.
ule, where the confidence parameter is dependent on the MDP • UCRL2, variation-aware UCRL2 and online learning
context variation also. Context variation parameter depends on approaches proposed in [18], [21] are model-based ap-
the maximum difference between the single step rewards as proaches, which do not scale well to large state-action
well as the maximum difference between transition probability space MDPs. The diameter DM (see (7)) varies with the
functions, over the time horizon T . When the environment model and in many cases can be quite high, especially
changes, the estimation restarts, leading to a loss in the if the MDP problem size is huge. In this case, the regret
information collected. Both algorithms give sublinear regret upper bound might be very high.
upper bound dependent on the diameter of the MDP contexts.
The regret upper bound in [20] is additionally dependent on B. Infinite Horizon Approaches
the MDP context variation.
Online learning [22] based approaches for non-stationary Works based on infinite-horizon are [23]-[32]. These are
environments are proposed by [18], [21]. MD2 [18] assumes oriented towards developing algorithms which learn a good
that the transition probabilities are stationary and known to control policy in non-stationary environment models.
the agent, while the cost functions vary (denoted Lt ) and are 1) Details: [23] proposes a stochastic model for MDPs
picked by an adversary. The goal of the RL agent is to select with non-stationary environments. These are known as hidden-
a sequence of vectors wt ∈ CV , where CV ∈ Rd is a convex mode MDPs (HM-MDPs). Each mode corresponds to a MDP
and compact subset of Rd . The chosen vectors must reduce with stationary environment model. When a system is modeled
the regret, which is defined as follows: as HM-MDP, then the transitions between modes are hidden
from the learning agent. State and action spaces are common
T T
X X to all modes - but each mode differs from the other modes
RegretMD2 = hLt , wt i − min hLt , wi,
w∈CV w.r.t the transition probability and reward functions. Algorithm
t=1 t=1
for solving [24] HM-MDP assumes that model information
where h· , ·i is the usual Euclidean inner product. Thus, with- is known. It is based on a Bellman equation developed for
out information of Lt , wt can be chosen only by observing the HM-MDP which is further used to design a value iteration
history of cost samples obtained. For this, the authors propose algorithm based on dynamic programming principles for this
solution methods based on Mirror Descent and exponential model.
weights algorithms. [21] considers time-varying reward func- A model-free algorithm for non-stationary environments is
tions and develops a distance measure for reward functions, proposed by [25]. It is a context detection based method
based on total variation. Using this, regret upper bound is known as RLCD. Akin to UCRL2, RLCD estimates transition
derived which depends on this distance measure. Further, [21] probability and reward functions from simulation samples.
adapts Follow the Leader algorithm for online learning in However, unlike UCRL2, it attempts to infer whether un-
MDPs. derlying MDP environment parameters have changed or not.
3) Remarks: The active model/context is tracked using a predictor function.
• Contextual MDP [19] necessitates the need to measure This function utilizes an error score to rank the contexts that
“closeness” between MDPs, which enables the proposed are already observed. The error score dictates which context
CECE algorithm to cluster MDP models and classify any is designated as “active”, based on the observed trajectory.
new model observed. The clustering and classification At every decision epoch, the error score of all contexts is
of MDPs requires a distance metric for measuring how computed and the one with the least error score is labeled as
close are two trajectories to each other. [19] defines the current “active” model. A threshold value is used for the
this distance using the transition probabilities of the error score to instantiate data structures for new context, i.e.,
MDPs. Using this distance metric and other theoretical a context which is not yet observed by the learning agent. If
assumptions, this work derives an upper bound on the all the contexts have an error score greater than this threshold
7
value, then, data structures for a new context are initialized. t + 1 are obtained using πKL . Simultaneously SRt is also
This new context is then selected as the active context model. updated. When SRt crosses threshold c2 , where c2 > c1 , TTS
Thus, new model estimates and the associated data structures switches to πj , which is the optimal policy for MDP with Pj
are created on-the-fly. as the transition probability function. The CUSUM statistic
Suppose environment model changes are negligible, we SRt helps in detecting changes in environment context.
expect that the value functions also do not change much [29] proposes a model-free RL method for handling non-
amongst the models. This is formally shown by [26]. If stationary environments based on a novel change detection
the accumulated changes in transition probability or reward method for multivariate data [35]. Similar to [28], this work
function remain bounded over time and such changes are assumes that context change pattern is known. However,
insignificant, then value functions of policies of all contexts unlike [28], [29] carries out change detection on state-reward
are “close” enough. Hence, [26] gives a theoretical framework samples obtained during simulation and not on the transition
highlighting conditions on context evolution. It also indicates probability functions. The Q-learning (QL) algorithm (see [6],
when the pursuit for non-stationary RL algorithms is worth- [33]) is used for learning policies, but maintains a separate Q
while. value table for each of the environment contexts. During learn-
Change detection-based approaches for learning/planning in ing, the state-reward samples, known as experience tuples,
non-stationary RL is proposed by [27]-[29]. The identifica- are analyzed using the multivariate change detection method
tion of active context based on the error score is the crux known as ODCP. When a change is detected, based on the
of RLCD method. [27] improves RLCD by incorporating known pattern of changes, the RL controller starts updating
change detection techniques for identification of active context. the Q values of the appropriate context. This method is known
Similar to RLCD, this method estimates the transition and as Context QL and is more efficient in learning in dynamically
reward functions for all contexts. Suppose the number of active varying environments, when compared to QL.
context estimates maintained by [27] is j. At time t, a number A variant of QL, called as Repeated Update QL (RUQL) is
Si,t , ∀ i, 1 ≤ i ≤ j is computed. Let P̃i and R̃i be the proposed in [30]. This adaptation of QL repeats the updates to
transition probability and reward function estimates of context the Q values of a state-action pair by altering the learning rate
i, where 1 ≤ i ≤ j. Si,t is updated as follows: sequence of QL. Though this variant is simple to implement, it
! has the same disadvantage as QL, i.e., poor learning efficiency
P̃i (st+1 |st , at )R̃i (st , at , st+1 ) in non-stationary environments.
Si,t = max 0, Si,t−1 + ln ,
P0 (st+1 |st , at )R0 (st , at , st+1 ) Online-learning based variant of QL for arbitrarily varying
reward and transition probability functions in MDPs is pro-
where P0 is the fixed transition function for a uniform model posed by [31]. This algorithm, known as Q-FPL, is model-free
- one which gives equal probability of transition between all and requires the state-reward sample trajectories only. With
states for all actions and R0 is set to 0 for all state-action pairs. this information, the objective of the algorithm is to control the
A change is detected if max Si,t > c, where c is a threshold MDP in a manner such that regret is minimized. The regret is
1≤i≤j
value. R̃i is updated as the moving average of simulated defined as the difference between the average reward per step
reward samples. P̃i is updated based on maximum likelihood obtained by Q-FPL algorithm and the average reward obtained
estimation. The updation of P̃i and R̃i are same as in [25]. [28] by the best stationary, deterministic policy. Formally, we have
shows that in full information case, i.e., when complete model T
1 X
information is known, the change detection approach of [27] RegretQ-FPL = sup E[rt (st , σ(st ))]−
leads to loss in performance with delayed detection. Based on σ:S→A T t=1
MDP (NSMDP), which is a generalization of MDP. Thus, at eliminates the need to re-learn a policy leading to better
every instant, a RL agent has access to the current “snapshot” sample efficiency. This sample efficiency is however
of the environment in the NSMDP model. Given this snapshot attained at the cost of memory requirement - Q values
and the current state, the RATS algorithm utilizes a tree-search need to be stored for every context and hence the method
algorithm to decide the optimal action to be exercised for the is not scalable.
current state. • RATS algorithm approximates a worst-case NSMDP.
2) Remarks: However, due to the planning algorithms required for
• The algorithms for solving HM-MDPs [24] are computa- the tree-search, this algorithm is not scalable to larger
tionally intensive and are not practically applicable. With problems.
advances in deep RL [14], there are better tools to make The prior approaches discussed in this section are summarized
these computationally more feasible. in Table I. The columns assess decision horizon, model
• RLCD [25] does not require apriori knowledge about the information requirements, mathematical tools used and policy
number of environment contexts and the context change retaining capability. A ‘-’ indicates that the column heading is
pattern, but is highly memory intensive, since it stores and not applicable to the algorithm. In the next section, we describe
updates estimates of transition probabilities and rewards works in related areas which are focussed on learning across
corresponding to all detected contexts. different tasks or using experience gained in simple tasks to
• [28] is a model-based algorithm and hence itis impossible learn optimal control of more complex tasks. We also discuss
to use it when model information cannot be obtained. how these are related to the problem we focus on.
However, this algorithm can be utilized in model-based
RL. But certain practical issues limit its use even in VI. R ELATED A REAS
model-based RL. One is that pattern of model changes
A. Continual Learning
needs to be known apriori. Additionally, its two-threshold
switching strategy is dependent on CUSUM statistic for Continual learning algorithms [8] have been explored in
change detection and more importantly on the threshold the context of deep neural networks. However, it is still in
values chosen. Since [28] does not provide a method to its nascent stage in RL. The goal in continual learning is to
pre-fix suitable threshold values, it needs to be always learn across multiple tasks. The tasks can probably vary in
selected by trial and error. This is impossible to do since difficulty, but mostly they are the same problem domain. For
it will depend on the reward values, sample paths etc. e.g., consider a grid world task, wherein the RL agent must
• Extensive experiments while assessing the two thresh- reach a goal position from a starting position by learning the
old switching strategy put forth the following issue. movements possible, any forbidden regions etc. Note that the
This issue is with reference to (8), where the frac- goal position matters in this task, since the agent learns to
π
Pj i (st+1 |st ,πi (at )) reach a given goal position. If the goal position changes, then
tion P πi (s |s ,π (a )) is computed. Suppose for the
i t+1 t i t it is a completely new task for the RL agent, which, now
policy πi it so happens that Pjπi (st+1 |st , πi (at )) = has to find the path to the new goal position. Thus, both tasks
Piπi (st+1 |st , πi (aπt )) and optimal policy of Pj is πj 6= πi , though being in the same problem domain are different. When
i
P (st+1 |st ,πi (at ))
we will have Pjπi (s |s ,π (a )) = 1 and SRt will the RL agent has to learn the optimal policy for the new grid
i t+1 t i t
grow uncontrollably and cross every pre-fixed threshold world, it should make sure to not forget the policy for the old
value. Thus, in this normal case itself, the detection fails, task. Hence, continual learning places emphasis on resisting
unless threshold value if pre-fixed with knowledge of the forgetting [8].
changepoint! Thus, [28] is not practically applicable in An agent capable of continual, hierarchical, incremental
many scenarios. learning and development (CHILD) is proposed in [36]. This
• Numerical experiments in [26] show that QL and asyn- work introduces continual learning by stating the properties
chronous value iteration are adaptive in nature. So, even expected out of such a RL agent and combines temporal
if environment contexts change, these learn the policies transition hierarchies (TTH) algorithm with QL. The TTH
for the new context with the help of samples from the method is a constructive neural network based approach that
new context. However, once new samples are obtained predicts probabilities of events and creates new neuronal units
and new context is sufficiently explored, the policies to predict these events and their contexts. This method updates
corresponding to older models are lost. Thus, QL does the weights, activations of the exising neuronal units and also
not have memory retaining capability. creates new ones. It takes as input the reward signal obtained
• RUQL [30] faces same issues as QL - it can learn optimal in the sample path. The output gives the Q values which are
policies for only one environment model at a time and further utilized to pick actions. This work provides extensive
cannot retain the policies learnt earlier. This is mainly experimental results on grid world problems where learning
because both QL and RUQL update the same set of from previous experience is seen to outperform learning from
Q values, even if environment model changes. Further, scratch. The numerical experiments also analyze TTH’s ca-
QL and RUQL cannot monitor changes in context - this pability of acquiring new skills, as well as retaining learnt
will require some additional tools as proposed by [29]. policies.
The Context QL method retains the policies learnt earlier [37] derives motivation from synaptic plasticity of human
in the form of Q values for all contexts observed. This brain, which is the ability of the neurons in the brain to
9
Algorithm Decision Horizon Model Information Requirements Mathematical Tool Used Policy Retention
CECE [19] Finite Model-based Clustering and Classification -
UCRL2 [17] Finite Model-based Confidence Sets -
Variation-aware UCRL2 [20] Finite Model-based Confidence Sets -
MD2 [18] Finite Partially Model-based Online learning -
FTL [21] Finite Model-based Online learning -
RLCD [25] Infinite Model-free Error score Yes
TTS [28] Infinite Model-based Change detection -
Context QL [29] Infinite Model-free Change detection Yes
RUQL [30] Infinite Model-free Step-size manipulation No
Q-FPL [31] Infinite Model-free Online learning Yes
RATS [32] Infinite Model-free Decision tree search Yes
TABLE I: Performance comparison of ODCP and ECP in changepoints detected when model information is known.
strengthen their connections with other neurons. These con- Q-learning for non-stationary environments. It resists
nections (or synapses) and strengths form the basis of learning catastrophic forgetting by maintaining separate Q values
in brain. Further, each of the neurons can simultaneously store for each model. This work provides empirical evidence
multiple memories, which implies that synapses are capable that policies for all models are retained. However, there
of storing connection strengths for multiple tasks! [37] intends are issues with computational efficiency and the method
to replicate synaptic plasticity in neural network architectures needs to be adapted for function-approximation based RL
used as function approximators in RL. For this, the authors use algorithms.
a biologically plausible synaptic model [38]. According to this • The CHILD [36] method is akin to RLCD [25] and
model, the synaptic weight is dependent on a weighted average Context QL [29], both of which also have separate data
of its previous changes, which can further be approximated structures for each model. Thus, in combination with
using a particular chain model. This chain model, which gives change detection, the CHILD algorithm can be used for
the synaptic weight at current time by accounting for all dynamically varying environments as well.
previous changes, is incorporated to tune the parameters of the
neural networks. Experiments on simple grid world problems B. Learning to Learn : Meta-learning Approaches
shows that QL with the above model has better performance
Meta-learning as defined in Section I involves reusing
in changing tasks when compared to classical QL.
experience and skills from earlier tasks to learn new skills.
Policy consolidation-based approach [39] is developed to
If RL agents must meta-learn, then we need to define what
tackle forgetting of policies. It operates on the same synaptic
constitutes experience and what is the previous data that is
model as [37], but consolidates memory at the policy level.
useful in skill development. Thus, we need to understand what
Policy consolidation means that the current behavioural policy
constitutes meta-data and how to learn using meta-data. Most
is distilled into a cascade of hidden networks that record poli-
of the prior works are targeted towards deep reinforcement
cies at multiple timescales. The recorded policies also affect
learning (DRL), where only deep neural network architectures
the behavioural policy by feeding into the policy network.
are used for function approximation of value functions and
The policies are encoded by the parameters of the neural
policies.
network and the distance between the parameters of two such
A general model-agnostic meta-learning algorithm is pro-
networks can be used as a substitute for the distance between
posed in [40]. The algorithm can be applied to any learning
policies (represented by the networks). This substitute measure
problem and model that is trained using a gradient-descent
is also incorporated in the loss function used for training the
procedure, but is mostly tested on deep neural architectures,
policy network. This method is tested on some benchmark RL
since their parameters are trained by back propagating the
problems.
gradients. The main idea is to get hold of an internal rep-
1) Remarks: resentation in these architectures that is suitable for a wide
• Developing biologically inspired algorithms [37], [39] is variety of tasks. Further, using samples from new tasks, this
a novel idea. This has been also explored in many areas in internal representation (in terms of network parameters) is
supervised learning as well. However, to develop robust fine-tuned for each task. Thus, there is no “learning from
performance which is reliable, adequate experimentation scratch”, but learning from a basic internal representation is
and theoretical justifications is needed. The above works the main idea. The assumptions are that such representations
lack this and at best can be considered as just initial are functions of some parameters and the loss function is
advancements in this stream of work. differentiable w.r.t those parameters. The method is evaluated
• We would like to compare continual learning algorithms on classification, regression and RL benchmark problems.
with approaches in Section V. Algorithms like [20], [30] However, it is observed by [43] that the gradient estimates of
do not resist catastrophic forgetting, because training on MAML have high variance. This is mitigated by introducing
new data quickly erases knowledge acquired from older surrogate objective functions which are unbiased.
data. These algorithms restart with a fixed confidence A probabilistic view of MAML is given by [41]. A fixed
parameter schedule. In comparison to this, [29] adapts number of trajectories from a task T and according to a policy
10
parameterized by θ is obtained. The loss function defined algorithms are now being deployed in a variety of applications
on these trajectories is then used to update the task-specific owing to the improved computing infrastructure. One would
parameters φ. This is carried out using the gradient of the loss also expect that easing of the stationary assumptions on RL
function, which is obtained from either policy gradient [71] environment models would also further increase the need for
or TRPO [44]. The same formulation is extended to non- high computation power. But, due to these advances in com-
stationary settings where [41] assumes that tasks themselves puting infrastructure, there is hope to extend RL to applications
evolve as a Markov chain. where non-stationary settings can make the system inefficient
Learning good directed exploration strategies via meta- (unless there is adaptation).
learning is the focus in [42]. The algorithm developed by the In this section, we survey the following representative
authors, known as MAESN, uses prior experience in related application domains: transportation and traffic systems, cyber-
tasks to initialize policies for new tasks and also to learn physical systems, digital marketing and inventory pricing,
appropriate exploration strategies as well. This is in compar- recommender systems and robotics. In these representative
ison to [40], where only policy is fine tuned. The method domains, we cover works which propose algorithms to specif-
assumes that neural network parameters are denoted θ and ically deal with dynamically varying environments. Most of
per-task variational parameters are denoted as ωi , 1 ≤ i ≤ N , these prior works are customized to their respective applica-
where N is the number of tasks. On every iteration through tions.
the task training set, N tasks are sampled from this training set
according to a distribution ρ. For a task, the RL agent gets state
A. Transportation and Traffic Systems
and reward samples. These are used to update the variational
parameters. Further, after the iteration, θ is updated using Traffic systems are either semi-automated or fully auto-
TRPO [44] algorithm. Numerical experiments for MAESN are mated physical infrastructure systems that manage the vehicu-
carried out on robotic manipulation and locomotion tasks. lar traffic in urban areas. These are installed to improve flow of
1) Remarks: vehicular traffic and relieve congestion on urban roads. With
• Explainability of generalization power of deep neural
the resurgence of AI, these systems are being fully automated
network architectures is still an open problem. The meta- using AI techniques. AI-based autonomous traffic systems
RL approaches are all based on using these architectures. use computer vision, data analytics and machine learning
Thus, the performance of these algorithms can only be techniques for their operation. Improvement in computing
validated empirically. Also, most of the works described power for RL has catapulted its use in traffic systems, and,
in this section lack theoretical justification. Only the RL based traffic signal controllers are being designed [51],
problem formulation involves some mathematical ideas, [52], [29]. Non-stationary RL traffic controllers are proposed
but none of the results are theoretical in nature. However, by [52], [29].
applied works like [55] can be encouraged, but only if Soilse [52] is a RL-based intelligent urban traffic signal con-
such works provide some performance analysis of meta- troller (UTC) tailored to fluctuating vehicular traffic patterns.
RL algorithms. It is phase-based, wherein a phase is a set of traffic lanes on
• The experimental results in most of the above works are
which vehicles are allowed to proceed at any given time. Along
still preliminary. These can be improved by facilitating with the reward feedback signal, the UTC obtains a degree of
more analysis. pattern change signal from a pattern change detection (PCD)
module. This module tracks the incoming traffic count of lanes
at a traffic junction. It detects a change in the flow pattern
VII. A PPLICATION D OMAINS using moving average filters and CUSUM [53] tool. When a
Reinforcement learning finds its use in a number of do- significant change in traffic density is detected, learning rates
mains - for e.g., in operations research [45], games [46], and exploration parameters are changed to facilitate learning.
robotics [47], intelligent transportation and traffic sys- Context QL [29], as described in Section V, tackles non-
tems [48]. However, in most of the prior works in these stationary environments. This method is evaluated in an au-
applications, the assumption is that the environment char- tonomous UTC application. The difference in the performance
acteristics and dynamics remain stationary. The number of of QL [33] and Context QL is highlighted by numerical
prior works developing application-specific non-stationary RL experiments [29]. This difference indicates that designing new
algorithms is limited. This is due to the fact that adapting RL methods for varying operating conditions is indeed beneficial.
to problems with stationary environments is the first simple Intelligent transportation systems (ITS) employ information
step towards more general RL controllers, for which scalability and communication technologies for road transport infras-
is still an issue. Only recent advances in deep RL [14] has tructure, mobility management and for interfaces with other
improved their scalability to large state-action space MDPs. modes of transport as well. This field of research also includes
Improved computation power and hardware, due to advances new business models for smart transportation. Urban aerial
in high-performance computing, has led to better off-the- transport devices like unmanned aerial systems (UAS) are
shelf software packages for RL. Advancements in computing also part of ITS. For urban services like package delivery,
has led to better implementations of RL - these use deep law enforcement and outdoor survey, an UAS is equipped
neural architectures [14], parallelization [49], [50] for making with cameras and other sensors. To carry out these tasks
algorithms scalable to large problem sizes. Single-agent RL efficiently, UAS takes photos and videos of densely human
11
populated areas. Though information gathering is vital, there minimize transmission of stale information from the base
are high chances that the UAS intrudes privacy. [54] considers stations to the user equipments. For this, an infinite-state,
this conflicting privacy-information criteria. The problem is average-reward MDP is formulated. Optimizing this MDP
that UAS may fly over areas which are densely populated is infeasible and hence this work finds a simple heuristic
and take pictures of humans in various locations. Though scheduling policy which is capable of achieving the lowest
the UAS can use human density datasets to avoid densely AoI.
populated locations, human concentration is still unpredictable
and may change depending on weather, time of day, events
C. Digital Marketing and Inventory Pricing
etc. Thus, the model of human population density tends to be
non-stationary. [54] proposes a model-based RL path planning Digital marketing and inventory pricing are connected
problem that maintains and learns a separate model for each strategies for improving sale of goods and services. In current
distribution of human density. times, many online sites complement inventory pricing with
digital marketing to attract more buyers and hence improve
sales. Digital marketing moves away from conventional mar-
B. Cyber-Physical Systems and Wireless Networks keting strategies in the sense that it uses digital technologies
Cyber-physical systems (CPS) are integration of physical like social media, websites, display advertising, etc to promote
processes and their associated networking and computation products and attract customers. Thus, it has more avenues for
systems. A physical process for e.g., a manufacturing plant an improved reach when compared to conventional marketing.
or energy plant, is controlled by embedded computers and Inventory pricing is concerned with pricing the goods/services
networks, using a closed feedback loop, where physical pro- that are produced/developed to be sold. It is important that to
cesses affect computation and vice-versa. Thus, autonomous gain profits, the manufacturer prices products/services accord-
control forms an innate part of CPS. Many prior works address ing to the uncertain demand for the product, the production
anomaly detection in CPS [56], since abnormal operation rate etc.
of CPS forces the controllers to deal with non-stationary A pricing policy for maximizing revenue for a given in-
environments. ventory of items is the focus of [60]. The objective of the
CPS security [57] also arises from anomaly detection. The automated pricing agent is to sell a given inventory before a
computation and networking systems of CPS are liable to fixed time and maximize the total revenue from that inventory.
denial of service (DoS) and malware attacks. These attacks can This work assumes that the demand distribution is unknown
be unearthed only if sensors and/or CPS controller can detect and varies with time. Hence, this gives rise to non-stationary
anomalies in CPS operation. In this respect, [57] proposes a environment dynamics. This work employs QL with eligibility
statistical method for operational CPS security. A modification traces [6] to learn a pricing policy.
of the Shiryaev-Roberts-Pollak procedure is used to detect [61] studies off-policy policy evaluation method for digital
changes in operation variables of CPS which can detect DDoS marketing. The users of an online product site are shown cus-
and malware attacks. tomized promotions. Every such marketing promotion strategy
The data from urban infrastructure CCTV networks can be uses the customer information to decide which promotions to
leveraged to monitor and detect events like fire hazards in display on the website. [61] proposes a MDP model with user
buildings, organized marathons on urban roads, crime hot- information as the state and the promotions to be displayed as
spots etc. [58] uses CCTV data along with social media posts the action. The reward gained from promotions is measured by
data to detect events in an urban environment. This multimodal tracking the number of orders per visit of the customer. The
dataset exhibits change in properties before, after and during proposed method is shown to reduce errors in policy evaluation
the event. Specifically, [58] tracks the counts of social media of the promotion strategy.
posts from locations in the vicinity of a geographical area,
counts of persons and cars on the road. These counts are
modeled as Poisson random variables and it is assumed that D. Recommender Systems
before, after and during a running marathon event, the mean Recommender systems/platforms are information filtering
rates of the observed counts changes. A hidden Markov model systems that predicts the preferences that a user would assign
(HMM) is proposed with the mean count rates as the hidden to a product/service. These systems have revolutionized online
states. This HMM is extended to stopping time POMDP and marketing, online shopping and online question-answer forums
structure of optimal policies for this event detection model is etc. Their predictions are further aimed at suggesting relevant
obtained. products, movies, books etc to online users. These systems
[59] considers improving user experience in cellular wire- now form the backbone of digital marketing and promotion.
less networks by minimizing Age of Information metric (AoI). Many content providers like Netflix, YouTube, Spotify, Quora
This metric measures the freshness of information that is trans- etc use them as content recommenders/playlists.
mitted to end users (“user equipments”) in a wireless cellular A concept drift based model management for recommender
network. A multi-user scheduling problem is formulated which systems is proposed by [62]. This work utilizes RL for
does not restrict the characteristics of the wireless channel handling concept drift in supervised learning tasks. Supervised
model. Thus, a non-stationary channel model is assumed for learning tasks see shifts in input-label correspondence, feature
the multi-user scheduling problem and the objective is to distribution due to ever changing dynamics of data in real
12
world. Each feature distribution and input-label correspon- technique is limited to the traffic model and more so if lane
dence is represented as a model and whenver there is a shift in occupation levels are the states of the model. Similar is the
the underlying data, this model needs to be retrained. A MDP case with a majority of the other works as well. It is tough to
is formulated for taking decisions about model retraining, extend the above works to more general settings. Some works
which decides when to update a model. This decision is which are generalizable are [29], [54], [61], [67], [68]. The
necessary, because, the model of a given system influences the methods suggested in these works can be adapted to other ap-
ability to act upon the current data and any change in it will plications as well provided some changes are incorporated. For
affect its influence on current as well as future data. If new e.g., [29] should be extended to continuous state-action space
models are learned quickly, then the learning agent may be settings by incorporating function approximation techniques.
simply underfitting data and wasting computational resources This will improve its application to tougher problems. [54]
on training frequently. However, if the agent delays model utilizes Gaussian process tool to build a model-based RL path
retraining, then the prediction performance of model might planner for UAS. This can be extended to model-free settings
decrease drastically. Thus, given the current model, current using [69] or other works on similar lines. Extending [61] to
data, the MDP-based RL agent decides when and how to policy improvement techniques like actor-critic [70] and policy
update the model. A similar work using variants of deep Q- gradient [71] is also a good direction of future work.
networks (DQN) [63] is proposed in [64].
VIII. F UTURE D IRECTIONS
E. Robotics The previous sections of this survey introduced the problem,
presented the benefits and challenges of non-stationary RL
Robotics is the design, development, testing, operation and
algorithms as well as introduced prior works. This survey
the use of robots. Its objective is to build machines that are
paper also categorized earlier works. In this section, we
intelligent, can assist humans in various tasks and also perform
describe the possible directions in which the prior works can
tasks which are beyond human reach. Robots can also be
be enhanced. Following this, we also enumerate challenges
designed to provide safety to human operations. Robots are
which are not addressed by the prior works, and which warrant
now being utilized in outer space missions, medical surgery,
our attention.
meal delivery in hospitals [65] etc. However, often robots need Prior approaches can be improved in the following manner:
to adapt to non-stationary operating conditions - for e.g., a
• The regret based approaches described in Section V-A are
ground robot/rover must adapt its walking gait to changing
useful in multi-armed bandit-type settings where efficient
terrain conditions [66] or friction coefficients of surface [67].
learning with minimal loss is the focus. Since these are
Robotic environments characterized by changing conditions
not geared towards finding good policies, these works
and sparse rewards are particularly hard to learn because,
do not prove to be directly useful in RL settings, where
often, the reinforcement to the RL agent is a small value and
control is the main focus. However, the ideas they propose
is also obtained at the end of the task. [67] focuses on learning
can be incorporated to guide initial exploration of actions
in robotic arms where object manipulation is characterized by
in approaches like [29], [30].
sparse-reward environments. The robotic arm is tasked with
• Relaxing certain theoretical assumptions like non-
moving or manipulating objects which are placed at fixed
communicating MDPs [72], multi-chain MDPs [73] etc
positions on a table. In these tasks, often, dynamic adaptation
can further improve the applicability of regret-based
to the surface friction and changed placement of objects
approaches in control-based approaches.
on the table is tough. [67] adapts the TRPO algorithm for
• Most of the model-based and model-free approaches in
dealing with these changing operating conditions. The robotic
Section V are not scalable to large problem sizes. This
arm RL agent is modeled as a continuous state-action space
is because each of these methods either consume lot of
MDP. In a continuous state-action space setting, the policy
memory for storing estimates of model information [17]-
is parameterized by Gaussian distribution. [67] proposes a
[27], or consume compute power for detecting changes
strategy to adjust the variance of this Gaussian policy in order
[28], [29]. [32] uses compute power for building large
to adapt to environment changes.
decision trees as well. These phenomenal compute power
Hexapod locomotion in complex terrain is the focus of [68].
and memory requirements render these approaches to be
This approach assumes that the terrain is modeled using N
non-applicable in practical applications which typically
discrete distributions and each such distribution captures the
function with restricted resources. An option is to offload
difficulties of that terrain. For each such terrain, an expert
the compute and memory power requirements onto a
policy is obtained using deep RL. Conditioned on the state
central server. Another option is to incorporate function
history, a policy from this set of expert policies is picked
approximation in the representation of value functions
leading to an adaptive gait of hexapod.
and policies.
• Tools from statistics - like for e.g., quickest change
F. Remarks detection [34], anomaly detection can prove to be in-
All prior works discussed in this section are specifically dispensable in the problem of non-stationary RL. Also
designed for their respective applications. For e.g., Soilse [52] introducing memory retaining capacity in deep neural
predicts the change in lane inflow rates and uses this to network architectures will can be a remedy for resisting
infer whether environment context has changes or not. This catastrophic forgetting.
13
• Works [28], [29], [52] assume that the pattern of envi- [9] J. Vanschoren, Meta-Learning. Springer International Publishing, 2019,
ronment changes is known and can be tracked. However, pp. 35–61.
[10] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming.
practically it is often difficult to track such changes. For Athena Scientific, 1996.
this tracking [74] methods can be used. [11] D. Bertsekas, Dynamic Programming and Optimal Control, 4th ed.
Next, we discuss additional challenges in this area. Belmont,MA: Athena Scientific, 2013, vol. II.
[12] M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dy-
• Need to develop algorithms that are sensitive to changes namic Programming, 2nd ed. New York, NY, USA: John Wiley &
in environment dynamics and adapt to changing operating Sons, Inc., 2005.
[13] V. S. Borkar, Stochastic Approximation: A Dynamical Systems View-
conditions seamlessly. Such algorithms can be extended point. Springer, 2009, vol. 48.
to continual RL settings. [14] V. Francois-Lavet, P. Henderson, R. Islam, M. G. Bellemare, and
• In the literature, there is a lack of Deep RL approaches J. Pineau, An Introduction to Deep Reinforcement Learning, 2018,
vol. 11, no. 3-4.
to handle non-stationary environments, which can scale [15] P. Wang and B. Goertzel, Theoretical Foundations of Artificial General
with the problem size. Meta learning approaches [40], Intelligence. Springer, 2012, vol. 4.
[41], [42], [43] exist, but these are still in the initial [16] L. Busoniu, R. Babuska, and B. De Schutter, “A Comprehensive Survey
research stages. These works are not sufficiently analyzed of Multiagent Reinforcement Learning,” IEEE Transactions on Systems,
Man, and Cybernetics, Part C (Applications and Reviews), vol. 38, no. 2,
and utilized. More importantly, these are not explainable pp. 156–172, 2008.
algorithms. [17] T. Jaksch, R. Ortner, and P. Auer, “Near-optimal regret bounds for
• Some applications like robotics [47] create additional reinforcement learning,” Journal of Machine Learning Research, vol. 11,
no. Apr, pp. 1563–1600, 2010.
desired capabilities like for e.g sample efficiency. When [18] T. Dick, A. Gyorgy, and C. Szepesvari, “Online learning in Markov
dealing with non-stationary environment characteristics, decision processes with changing cost sequences,” in International
the number of samples the RL agent obtains for every Conference on Machine Learning, 2014, pp. 512–520.
[19] A. Hallak, D. D. Castro, and S. Mannor, “Contextual Markov Decision
environment model can be quite limited. In the extreme Processes,” in Proceedings of the 12th European Workshop on Rein-
case, the agent may obtain only one sample trajectory, forcement Learning (EWRL 2015), 2015.
which is observed in robotics arm manipulation exercises. [20] R. Ortner, P. Gajane, and P. Auer, “Variational Regret Bounds for
Reinforcement Learning,” in Proceedings of the 35th Conference on
In such a case, we expect the learning algorithm to be Uncertainty in Artificial Intelligence, 2019.
data efficient and utilize the available data for multiple [21] Y. Li and N. Li, “Online Learning for Markov Decision Processes in
purposes - like learn good policies as well as detect Nonstationary Environments: A Dynamic Regret Analysis,” in 2019
American Control Conference (ACC), July 2019, pp. 1232–1237.
changes in environment statistics. [22] S. Shalev-Shwartz, “Online Learning and Online Convex Optimization,”
• While encountering abnormal conditions, a RL au- Foundations and Trends R in Machine Learning, vol. 4, no. 2, pp. 107–
tonomous agent might violate safety constraints, because 194, 2012.
[23] S. P. Choi, D.-Y. Yeung, and N. L. Zhang, “An Environment Model
the delay in efficiently controlling the system in abnormal for Nonstationary Reinforcement Learning,” in Advances in neural
conditions can lead to some physical harm. For e.g., in information processing systems, 2000, pp. 987–993.
self-driving cars, a suddenor abrupt change in weather [24] ——, “Hidden-Mode Markov Decision Processes for Nonstationary
conditions can lead to impaired visual information from Sequential Decision Making,” in Sequence Learning. Springer, 2000,
pp. 264–287.
car sensors. Such scenarios mandate that the RL agent, [25] B. C. da Silva et al., “Dealing with Non-stationary Environments Using
though still learning new policies, must keep up with Context Detection,” in Proceedings of the 23rd International Conference
some nominal safe bahaviour. Thus, this can lead to on Machine Learning, 2006, pp. 217–224.
[26] B. C. Csáji and L. Monostori, “Value Function Based Reinforcement
works which intersect safe RL [75] and non-stationary Learning in Changing Markovian Environments,” Journal of Machine
RL algorithms. Learning Research, vol. 9, pp. 1679–1709, jun 2008.
[27] E. Hadoux, A. Beynier, and P. Weng, “Sequential Decision-Making
under Non-stationary Environments via Sequential Change-point Detec-
R EFERENCES tion,” in Learning over Multiple Contexts (LMCE), Nancy, France, Sep
[1] H. Mirzaei Buini, S. Peter, and T. Givargis, “Adaptive embedded control 2014.
of cyber-physical systems using reinforcement learning,” IET Cyber- [28] T. Banerjee, M. Liu, and J. P. How, “Quickest change detection approach
Physical Systems: Theory Applications, vol. 2, no. 3, pp. 127–135, 2017. to optimal control in Markov decision processes with model changes,”
[2] C.-Z. Xu, J. Rao, and X. Bu, “Url: A unified reinforcement learning in 2017 American Control Conference, ACC 2017, Seattle, WA, USA,
approach for autonomic cloud management,” Journal of Parallel and May 24-26, 2017, 2017, pp. 399–405.
Distributed Computing, vol. 72, no. 2, pp. 95 – 105, 2012. [29] Sindhu Padakandla, Prabuchandran K. J, and S. Bhatnagar,
[3] Y. Qian, J. Wu, R. Wang, F. Zhu, and W. Zhang, “Survey on Reinforce- “Reinforcement Learning in Non-Stationary Environments,” CoRR,
ment Learning Applications in Communication Networks,” Journal of 2019. [Online]. Available: http://arxiv.org/abs/1905.03970
Communications and Information Networks, vol. 4, no. 2, pp. 30–39, [30] S. Abdallah and M. Kaisers, “Addressing Environment Non-Stationarity
June 2019. by Repeating Q-learning Updates,” Journal of Machine Learning Re-
[4] H. Liu, S. Liu, and K. Zheng, “A Reinforcement Learning-Based search, vol. 17, no. 46, pp. 1–31, 2016.
Resource Allocation Scheme for Cloud Robotics,” IEEE Access, vol. 6, [31] J. Y. Yu and S. Mannor, “Arbitrarily modulated Markov decision
pp. 17 215–17 222, 2018. processes,” in Proceedings of the 48h IEEE Conference on Decision
[5] C. A. Gomez-Uribe and N. Hunt, “The Netflix Recommender System: and Control (CDC) Conference, 2009, pp. 2946–2953.
Algorithms, Business Value, and Innovation,” ACM Trans. Manage. Inf. [32] E. Lecarpentier and E. Rachelson, “Non-Stationary Markov Decision
Syst., vol. 6, no. 4, Dec 2016. Processes, a Worst-Case Approach using Model-Based Reinforcement
[6] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, Learning,” in Advances in Neural Information Processing Systems, 2019,
2nd ed. Cambridge, MA, USA: MIT Press, 2018. pp. 7214–7223.
[7] G. Dulac-Arnold, D. J. Mankowitz, and T. Hester, “Challenges of Real- [33] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, no.
World Reinforcement Learning,” CoRR, vol. abs/1904.12901, 2019. 3-4, pp. 279–292, 1992.
[8] G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter, “Continual [34] A. Shiryaev, “On Optimum Methods in Quickest Detection Problems,”
lifelong learning with neural networks: A review,” Neural Networks, vol. Theory of Probability and Its Applications, vol. 8, no. 1, pp. 22–46,
113, pp. 54 – 71, 2019. 1963.
14
[35] K. J. Prabuchandran, N. Singh, P. Dayama, and V. Pandit, “Change Point [57] A. Coluccia and A. Fascista, “An alternative procedure to cumulative
Detection for Compositional Multivariate Data,” arXiv, 2019. sum for cyber-physical attack detection,” Internet Technology Letters,
[36] M. B. Ring, “CHILD: A first step towards continual learning,” in vol. 1, no. 3, p. e2, 2018.
Learning to learn. Springer, 1998, pp. 261–292. [58] T. Banerjee, G. Whipps, P. Gurram, and V. Tarokh, “Sequential Event
[37] C. Kaplanis, M. Shanahan, and C. Clopath, “Continual Reinforcement Detection Using Multimodal Data in Nonstationary Environments,” in
Learning with Complex Synapses,” in Proceedings of the 35th Interna- 2018 21st International Conference on Information Fusion (FUSION),
tional Conference on Machine Learning, ser. Proceedings of Machine 2018, pp. 1940–1947.
Learning Research, vol. 80. PMLR, 10–15 Jul 2018, pp. 2497–2506. [59] S. Banerjee, R. Bhattacharjee, and A. Sinha, “Fundamental Limits of
[38] M. K. Benna and S. Fusi, “Computational principles of synaptic memory Age-of-Information in Stationary and Non-stationary Environments,”
consolidation,” Nature neuroscience, vol. 19, no. 12, p. 1697, 2016. 2020.
[39] C. Kaplanis et al., “Policy Consolidation for Continual Reinforcement [60] R. Rana and F. S. Oliveira, “Real-time dynamic pricing in a
Learning,” in Proceedings of the 36th International Conference on non-stationary environment using model-free reinforcement learning,”
Machine Learning, vol. 97. PMLR, 09–15 Jun 2019, pp. 3242–3251. Omega, vol. 47, pp. 116 – 126, 2014.
[40] C. Finn, P. Abbeel, and S. Levine, “Model-Agnostic Meta-Learning for [61] P. S. Thomas, G. Theocharous, M. Ghavamzadeh, I. Durugkar, and
Fast Adaptation of Deep Networks,” in Proceedings of the 34th Inter- E. Brunskill, “Policy Evaluation for Nonstationary Decision Problems,
national Conference on Machine Learning - Volume 70, ser. ICML’17. with Applications to Digital Marketing,” in Proceedings of the Thirty-
JMLR.org, 2017, p. 1126–1135. First AAAI Conference on Artificial Intelligence, ser. AAAI’17. AAAI
[41] M. Al-Shedivat, T. Bansal, Y. Burda, I. Sutskever, I. Mordatch, Press, 2017, p. 4740–4745.
and P. Abbeel, “Continuous adaptation via meta-learning in [62] E. Liebman, E. Zavesky, and P. Stone, “A Stitch in Time - Autonomous
nonstationary and competitive environments,” in International Model Management via Reinforcement Learning,” in Proceedings of the
Conference on Learning Representations, 2018. [Online]. Available: 17th International Conference on Autonomous Agents and Multiagent
https://openreview.net/forum?id=Sk2u1g-0- Systems, ser. AAMAS ’18. International Foundation for Autonomous
[42] A. Gupta, R. Mendonca, Y. Liu, P. Abbeel, and S. Levine, “Meta- Agents and Multiagent Systems, 2018, p. 990–998.
Reinforcement Learning of Structured Exploration Strategies,” in Ad- [63] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
vances in Neural Information Processing Systems, 2018, pp. 5302–5311. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski
[43] H. Liu, R. Socher, and C. Xiong, “Taming MAML: Efficient Unbiased et al., “Human-level control through deep reinforcement learning,”
Meta-Reinforcement Learning,” in Proceedings of the 36th International Nature, vol. 518, no. 7540, p. 529, 2015.
Conference on Machine Learning, ser. Proceedings of Machine Learning [64] S.-Y. Chen, Y. Yu, Q. Da, J. Tan, H.-K. Huang, and H.-H. Tang,
Research, vol. 97. PMLR, 09–15 Jun 2019, pp. 4061–4071. “Stabilizing Reinforcement Learning in Dynamic Environment with
[44] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust Application to Online Recommendation,” in Proceedings of the 24th
Region Policy Optimization,” in International conference on machine ACM SIGKDD International Conference on Knowledge Discovery &
learning, 2015, pp. 1889–1897. Data Mining, 2018, p. 1187–1196.
[45] M. Schneckenreither and S. Haeussler, “Reinforcement learning methods [65] M. A. Okyere, R. Forson, and F. Essel-Gaisey, “Positive externalities
for operations research applications: The order release problem,” in Ma- of an epidemic: The case of the coronavirus (COVID-19) in China,”
chine Learning, Optimization, and Data Science, G. Nicosia, P. Pardalos, Journal of Medical Virology.
G. Giuffrida, R. Umeton, and V. Sciacca, Eds. Springer International [66] G. Mesesan, J. Englsberger, G. Garofalo, C. Ott, and A. Albu-Schäffer,
Publishing, 2019, pp. 545–559. “Dynamic Walking on Compliant and Uneven Terrain using DCM
[46] K. Shao, Z. Tang, Y. Zhu, N. Li, and D. Zhao, “A Survey of Deep and Passivity-based Whole-body Control,” in 2019 IEEE-RAS 19th
Reinforcement Learning in Video Games,” 2019. International Conference on Humanoid Robots (Humanoids), 2019, pp.
[47] J. Kober and J. Peters, Reinforcement Learning in Robotics: A Survey. 25–32.
Springer International Publishing, 2014, pp. 9–67. [67] X. Lin, P. Guo, C. Florensa, and D. Held, “Adaptive Variance for Chang-
[48] K.-L. A. Yau, J. Qadir, H. L. Khoo, M. H. Ling, and P. Komisarczuk, “A ing Sparse-Reward Environments,” in 2019 International Conference on
Survey on Reinforcement Learning Models and Algorithms for Traffic Robotics and Automation (ICRA), 2019, pp. 3210–3216.
Signal Control,” ACM Comput. Surv., vol. 50, no. 3, June 2017. [68] T. Azayev and K. Zimmerman, “Blind Hexapod Locomotion in Complex
[49] P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, Terrain with Gait Adaptation Using Deep Reinforcement Learning and
M. Elibol, Z. Yang, W. Paul, M. I. Jordan, and I. Stoica, “Ray: A Classification,” Journal of Intelligent & Robotic Systems, pp. 1–13, 2020.
Distributed Framework for Emerging AI Applications,” in Proceedings [69] M. Turchetta, A. Krause, and S. Trimpe, “Robust Model-free Reinforce-
of the 12th USENIX Conference on Operating Systems Design and ment Learning with Multi-objective Bayesian Optimization,” 2019.
Implementation, ser. OSDI’18. USENIX Association, 2018, p. 561–577. [70] V. R. Konda and J. N. Tsitsiklis, “On Actor-Critic Algorithms,” SIAM
[50] E. Liang, R. Liaw, R. Nishihara, P. Moritz, R. Fox, K. Goldberg, J. Gon- J. Control Optim., vol. 42, no. 4, p. 1143–1166, April 2003.
zalez, M. Jordan, and I. Stoica, “RLlib: Abstractions for Distributed [71] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour, “Policy Gradient
Reinforcement Learning,” in Proceedings of the 35th International Methods for Reinforcement Learning with Function Approximation,” in
Conference on Machine Learning, ser. Proceedings of Machine Learning Proceedings of the 12th International Conference on Neural Information
Research, vol. 80. PMLR, 2018, pp. 3053–3062. Processing Systems, ser. NIPS’99. MIT Press, 1999, p. 1057–1063.
[51] Prabuchandran K.J., Hemanth Kumar A.N, and S. Bhatnagar, “Multi- [72] R. Fruit, M. Pirotta, and A. Lazaric, “Near Optimal Exploration-
agent reinforcement learning for traffic signal control,” in 17th Interna- Exploitation in Non-Communicating Markov Decision Processes,” in
tional IEEE Conference on Intelligent Transportation Systems (ITSC), Proceedings of the 32nd International Conference on Neural Information
2014, pp. 2529–2534. Processing Systems, ser. NIPS’18. Curran Associates Inc., 2018, p.
[52] A. Salkham and V. Cahill, “Soilse: A decentralized approach to op- 2998–3008.
timization of fluctuating urban traffic using reinforcement learning,” [73] T. Sun, Q. Zhao, and P. B. Luh, “A Rollout Algorithm for Multichain
in 13th International IEEE Conference on Intelligent Transportation Markov Decision Processes with Average Cost,” in Positive Systems.
Systems, Sept 2010, pp. 531–538. Springer, 2009, pp. 151–162.
[53] E. S. Page, “Continuous Inspection Schemes,” Biometrika, vol. 41, no. [74] O. N. Granichin and V. A. Erofeeva, “Cyclic Stochastic Approximation
1/2, pp. 100–115, 1954. with Disturbance on Input in the Parameter Tracking Problem based
[54] R. Allamaraju, H. Kingravi, A. Axelrod, G. Chowdhary, R. Grande, on a Multiagent Algorithm,” Automation and Remote Control, vol. 79,
J. P. How, C. Crick, and W. Sheng, “Human aware UAS path planning no. 6, pp. 1013–1028, 2018.
in urban environments using nonstationary MDPs,” in 2014 IEEE [75] J. Garcı́a and F. Fernández, “A Comprehensive Survey on Safe Rein-
International Conference on Robotics and Automation (ICRA), 2014, forcement Learning,” Journal of Machine Learning Research, vol. 16,
pp. 1161–1167. no. 42, pp. 1437–1480, 2015.
[55] Y. Jaafra, A. Deruyver, J. L. Laurent, and M. S. Naceur, “Context-
Aware Autonomous Driving Using Meta-Reinforcement Learning,” in
2019 18th IEEE International Conference On Machine Learning And
Applications (ICMLA), Dec 2019, pp. 450–455.
[56] C. O’Reilly, A. Gluhak, M. A. Imran, and S. Rajasegarar, “Anomaly De-
tection in Wireless Sensor Networks in a Non-Stationary Environment,”
IEEE Communications Surveys Tutorials, vol. 16, no. 3, pp. 1413–1432,
2014.
This figure "non-stat2.jpg" is available in "jpg" format from:
http://arxiv.org/ps/2005.10619v1