2006 00979v1 PDF
2006 00979v1 PDF
RL algorithms or rapidly prototype ideas. To address this, we introduce Acme, a tool to simplify the devel-
opment of novel RL algorithms that is specifically designed to enable simple agent implementations that
can be run at various scales of execution. Our aim is also to make the results of various RL algorithms
developed in academia and industrial labs easier to reproduce and extend. To this end we are releasing
baseline implementations of various algorithms, created using our framework. In this work we introduce
the major design decisions behind Acme and show how these are used to construct these baselines. We
also experiment with these agents at different scales of both complexity and computation—including dis-
tributed versions. Ultimately, we show that the design decisions behind Acme lead to agents that can be
scaled both up and down and that, for the most part, greater levels of parallelization result in agents with
equivalent performance, just faster.
1. Introduction
Reinforcement learning (RL) provides an elegant formalization of the problem of intelligence (Russell, 2016;
Sutton and Barto, 2018). In combination with advances in deep learning and computational resources, this
formulation has led to dramatic results in acting from perception (Mnih et al., 2015), game playing (Silver et al.,
2016), and robotics (OpenAI et al., 2018) among others. A central goal of much of this work is to create a general
agent that can learn to achieve goals across a wide range of environments (Legg and Hutter, 2007). In pursuit
of this objective, the scale and complexity of agents developed by the research community has grown over time;
OpenAI Five (Berner et al., 2019) and AlphaStar (Vinyals et al., 2019) serve as just two recent examples of this
phenomenon.
A characteristic of much of recent RL research has been an integrationist perspective on agent design, involving
the combination of various algorithmic components. Agents may incorporate ideas such as intrinsic rewards
and auxiliary tasks (Jaderberg et al., 2017; Jaques et al., 2019), ensembling (Osband et al., 2018), prioritized
experience replay (Schaul et al., 2015), distributional backups (Bellemare et al., 2017), specialised neural network
architectures (Wang et al., 2015), policy improvement search methods (Silver et al., 2018), learning from
demonstrations (Hester et al., 2018; Nair et al., 2018; Gulcehre et al., 2020), variance reduction (Wang et al.,
2016; Schulman et al., 2017; Espeholt et al., 2018), hierarchy (Kulkarni et al., 2016; Vezhnevets et al., 2017) or
meta-learning (Al-Shedivat et al., 2017; Finn et al., 2017; Xu et al., 2018), to name a few examples. This has led to
many state-of-the-art agents incorporating numerous heterogeneous components, contributing to their increased
complexity and to growing concerns about the reproducibility of research (Pineau et al., 2020).
Numerous recent advances in machine learning systems have been attributable to increases in scale along two
principal dimensions: function approximation capacity (number of trainable parameters) and amount of data
(number and quality of training examples). In the context of RL, we focus discussion to the latter. In contrast to
most supervised and unsupervised learning settings, an RL agent must interact with an environment to generate its
own training data. This motivates interacting with multiple instances of an environment (simulated or otherwise)
in parallel to generate more experience to learn from. This has led to the widespread use of increasingly large-scale
Acme: A Research Framework for Distributed Reinforcement Learning
Actor
Observations
Actions
Rewards
Environment
Figure 1 | A simple, high-level illustration of an actor interacting with its environment. Here we illustrate the flow
of information between an actor which produces actions and the environment which consumes those actions in
order to produce rewards and novel observations.
distributed systems in RL agent training (Mnih et al., 2016; Horgan et al., 2018a; Espeholt et al., 2018; Kapturowski
et al., 2019). This approach introduces numerous engineering and algorithmic challenges, and relies on significant
amounts of infrastructure which can impede the reproducibility of research. It also motivates agent designs that
may represent dramatic departures from canonical abstractions laid out in the reinforcement learning literature
(Sutton and Barto, 2018). This often means that “scaling up” from a simple, single-process prototype of an
algorithm to a full distributed system may require a re-implementation of the agent.
Acme is a software library and light-weight framework for expressing and training RL agents which attempts
to address both the issues of complexity and scale within a unified framework, allowing for fast iteration of
research ideas and scalable implementation of state-of-the-art agents. Acme does this by providing tools and
components for constructing agents at various levels of abstraction, from the lowest (e.g. networks, losses, policies)
through to workers (actors, learners, replay buffers), and finally entire agents complete with the experimental
apparatus necessary for robust measurement and evaluation, such as training loops, logging, and checkpointing.
The agents written in the framework are state-of-the-art implementations that promote the use of common tools
and components, hopefully leading to common community benchmarks. Our modular design of Acme’s agents
makes them easily scalable to large distributed systems, all while maintaining clear and straightforward abstractions
and simultaneously supporting training in the non-distributed setting.
In what remains of this section, we give a brief overview of modern reinforcement learning and discuss various
software frameworks used to tackle such problems. Section 2 goes on to introduce the key structural contributions
of our approach to designing RL agents. In Section 3 we build upon this design to show how this framework is
used to implement a number of modern agent implementations. Finally, in Section 4 we experiment with these
agents and demonstrate that they can be used to obtain state-of-the-art performance across a variety of domains.
The standard setting for reinforcement learning consists of a learning agent—an entity that perceives and acts—
interacting with an unknown environment in discrete time (Figure 1). An agent is primarily characterized by its
policy π , which maps its experienced history of observations (o 0 , . . . , o t ) to an action a t . The functional form of
an agent’s mapping might, for example, be represented using a feed-forward or recurrent neural network whose
inputs include previous observations and actions. Once the agent has acted in the environment it then receives a
reward signal r t , makes an observation ot +1 , and this cycle continues. The agent’s goal is to maximize an aggregate
of future rewards it expects to receive by acting upon the environment. Note that this definition of the RL problem
is very broad and encompassing. For instance, the environment’s dynamics can be stochastic or deterministic,
stationary or non-stationary, and include other agents. Likewise, the agent can learn from experiences generated
by other behaviour policies (off-policy) or by its own policy (on-policy).
Throughout this work we will also refer to the data generation processes which interact with the environment
as actor processes or more simply as actors. This is in contrast with the concept of learners, i.e. the processes which
2
Acme: A Research Framework for Distributed Reinforcement Learning
consume data in order to update policy parameters, typically by stochastic gradient descent. Classically, these two
processes proceed in lockstep with one another. However, by making this explicit actor/learner distinction, we can
also design agents which consist either of a single actor or many distributed actors which feed data to one or more
learner processes. Overall, any agent interacting within this setting has to master two formidable challenges which
align with these two processes.
First, an agent must explore its environment effectively so as to obtain useful experiences. Second, it has to
learn effectively from these experiences. In online RL, both challenges are attacked simultaneously. As a result,
vast numbers of interactions are often required to learn policies represented as deep neural networks. The need
for data of this magnitude motivates the use of distributed agents as described above with many parallel actors.
This is particularly important in simulated environments and games where massive amounts of experience can be
gathered in a distributed manner and at rates substantially faster than real-time. At the other end of the spectrum
lies offline RL—also known as batch RL—which focuses on the challenge of learning policies from a fixed dataset of
experiences. This situation arises often in settings where online experimentation is impossible or impractical, e.g.
industrial control and healthcare. Frequently, the goal of this setting is to learn a policy that outperforms those
used to generate the dataset of past experiences. Of course there is also a great deal of work in between these two
extremes, which is where most of the work on off-policy agents lies.
Acme is designed to greatly simplify the construction of agents in each of these settings. In Section 2 we
introduce natural modules to the design of agents which correspond to the acting, dataset, and learning components
introduced above. These allow us to tackle simple, classical on- and off-policy agents by combining all of the
above in a synchronous setting. We can also separate the acting and learning components and replicate the actor
processes to arrive at modern, distributed agents. And by removing acting completely and making use of a fixed
dataset we can tackle the offline RL setting directly. Finally, in order to exemplify this split we will also detail in
Section 3 a number of example learning components built using Acme and show how these can be combined to
arrive at different algorithms.
Numerous open-source software libraries and frameworks have been developed in recent years. In this section we
give a brief review of recent examples, and situate Acme within the broader context of similar projects. OpenAI
baselines (Dhariwal et al., 2017) and TF-Agents (Sergio Guadarrama, 2018) are both examples of established
deep RL frameworks written in TensorFlow 1.X. They both strive to express numerous algorithms in single-process
format. Dopamine (Castro et al., 2018) is a framework focusing on single-process agents in the DQN (Mnih et al.,
2015) family, and various distributional variants including Rainbow (Hessel et al., 2018), and Implicit Quantile
Networks (Dabney et al., 2018). Fiber (Zhi et al., 2020) and Ray (Moritz et al., 2017) are both generic tools
for expressing distributed computations, similar to Launchpad, described below. ReAgent (Gauci et al., 2018) is
primarily aimed at offline/batch RL from large datasets in production settings. SEED RL (Espeholt et al., 2019) is
a highly scalable implementation of IMPALA (Espeholt et al., 2018) that uses batched inference on accelerators
to maximize compute efficiency and throughput. Similarly, TorchBeast (Küttler et al., 2019) is another IMPALA
implementation written in Torch. SURREAL (Fan et al., 2018) expresses continuous control agents in a distributed
training framework. Arena (Song et al., 2019) is targeted at expressing multi-agent reinforcement learning.
The design philosophy behind Acme is to strike a balance between simplicity and that of modularity and scale.
This is often a difficult target to hit—often it is much easier to lean heavily into one and neglect the other. Instead,
in Acme we have designed a framework and collection of agents that can be easily modified and experimented
with at small scales, or expanded to high levels of data throughput at the other end of the spectrum. While we are
focusing for the moment on releasing the single-process variants of these agents, the design philosophy behind the
large-scale distributed versions remains the same—as we will detail in the following section.
2. Acme
Acme is a library and framework for building readable, efficient, research-oriented reinforcement learning algo-
rithms. At its core Acme is designed to enable simple descriptions of RL agents that can be run at many different
scales of execution. While this usually culminates in running many separate (parallel) acting and learning processes
in one large distributed system; we first describe Acme in a simpler, single-process setting, where acting and
learning are perfectly synchronized. A key feature of Acme is that the agents can be run in both the single-process
3
Acme: A Research Framework for Distributed Reinforcement Learning
Actor
policy observe
ot+1
at
<latexit sha1_base64="nw+CPBUS5OCODrPaNTC1WO3zHfQ=">AAACFXicbZBNS8NAEIY39avWr6jgxUuwCJ5K0ip6LHrxWMHWQhvCZjttl24+2J0IJeZ3ePeqf8GbePXsP/BnuE17sK0vDLy8M8MMjx8LrtC2v43Cyura+kZxs7S1vbO7Z+4ftFSUSAZNFolItn2qQPAQmshRQDuWQANfwIM/upn0Hx5BKh6F9ziOwQ3oIOR9zijqyDOPqJd2FaMCMEoxS2uVixizzDPLdsXOZS0bZ2bKdTJVwzN/ur2IJQGEyARVquPYMboplciZgKzUTRTElI3oADrahjQA5ab5/5l1qpOe1Y+krhCtPP27kdJAqXHg68mA4lAt9ibhf71Ogv0rN+VhnCCEbHqonwgLI2sCw+pxCQzFWBvKJNe/WmxIJWWokc1dyRFJEFlJo3EWQSybVrXi1CrVu/Ny/XoGqUiOyQk5Iw65JHVySxqkSRh5Ii/klbwZz8a78WF8TkcLxmznkMzJ+PoFJfKf7g==</latexit>
rt
<latexit sha1_base64="jyHvBo2W6KtleLbSEUsoYW5GiqE=">AAACFXicbVC7SgNBFJ31GRMfq4KNzWIQUoXdBNEyaGMZwTwgCWF2cpMMmX0wczcQ1v0Oe1ut7awUW2v/wB/Q2smjMIkHLhzOuZdzOW4ouELb/jRWVtfWNzZTW+nM9s7unrl/UFVBJBlUWCACWXepAsF9qCBHAfVQAvVcATV3cDX2a0OQigf+LY5CaHm05/MuZxS11DaPZDtuKkYFYBBjEhfzZyEmSdvM2nl7AmuZODOSLeW+n1+HmZ9y2/xqdgIWeeAjE1SphmOH2IqpRM4EJOlmpCCkbEB70NDUpx6oVjz5P7FOtdKxuoHU46M1Uf9exNRTauS5etOj2FeL3lj8z2tE2L1oxdwPIwSfTYO6kbAwsMZlWB0ugaEYaUKZ5PpXi/WppAx1ZXMpk4okiCStq3EWi1gm1ULeKeYLN062dEmmSJFjckJyxCHnpESuSZlUCCN35IE8kifj3ngx3oz36eqKMbs5JHMwPn4BDDukCQ==</latexit>
Environment
Figure 2 | Expansion of the environment loop to display the mechanism by which an actor interacts with the envi-
ronment. Also shown for comparison is pseudocode describing this interaction as well as the actual implementation
of this loop.
and highly distributed regimes using the exact same modules or building blocks with very limited differences. We
achieve this by factoring the code into components that make sense at both ends of the scale. In what remains of
this section we will discuss several of these components and how they interact.
One of the core concepts within reinforcement learning is that of the environment with which an agent interacts.
We will assume an environment which maintains its own state and is interacted with sequentially such that taking
action at produces a tuple (r t , o t +1 , e t +1 ) consisting of a reward, a new observation, and an end of episode indicator.
Importantly, note that we have chosen to subscript each reward such that it coincides with the action that produced
it. Acme assumes that the environment adheres to the dm_env.Environment interface. However, readers familiar
with the dm_env.TimeStep interface, may notice that we’ve deliberately omitted the environmental discount factor
to simplify notation, as it often simply takes binary values to signal the end of an episode.
In Acme, the component that interacts most closely with the environment is the actor. At a high level, an
actor consumes observations produced by the environment and produces actions that are in turn fed into the
environment. Then, after observing the ensuing transition, we give the actor an opportunity to update its internal
state; this most often relates to its action-selection policy, but we will return to this point shortly.
The interaction between an actor and its environment is mediated by an environment loop. Custom loops
can easily be implemented but we provide a generic one that meets most of our needs and provides a simple
entry point for interacting with any of the actors or agents implemented within Acme. In Figure 2 we illustrate
this interaction in further detail by expanding the loop (shown earlier in Figure 1) to include the exact interface
by which an actor interacts with its environment. Given an observation ot we must first be able to evaluate the
4
Acme: A Research Framework for Distributed Reinforcement Learning
actor’s action-selection policy a t = π (ot ), where a t can also represent a sample of some random variable in the
case of a stochastic policy. Once an action is taken the actor must be able to record the reward and subsequent
observation obtained from the environment—e.g. one might insert this data into a replay table or collect an entire
trajectory to be processed at the end of an episode. These two methods are included in our illustration, and show
the life-cycle of a single iteration of an environment loop. This figure also shows the pseudocode and includes
a (slightly simplified) example of Acme’s implementation of this loop. We stress here, however, that while this
formulation is not a new concept—it can be found in any introductory text (e.g. Sutton and Barto, 2018)—it does
serve to highlight one of the key design goals of Acme: wherever possible there should be a one-to-one mapping
between typical RL pseudocode and its implementation.
Thus far we’ve focused our attention on components that are relevant for data generation: acting in the environment
and observing ensuing transitions. We now introduce the learner component, which consumes this data in order to
obtain a better policy. This component often contains the bulk of the code relevant to any specific RL algorithm
and, in deep RL, takes the form of optimizing the weights of a neural network to minimize some algorithm-specific
loss(es). More precise mathematical descriptions for a variety of algorithms will be detailed in Section 3. While it
is possible to run a learner without further interaction with the environment (see e.g. Section 2.6), in RL we are
often interested in concurrent learning and acting. Therefore we introduce a special type of actor that includes
both an acting and a learning component; we refer to these as agents to distinguish them from their non-learning
counterparts.
While an agent defers its action selection to its own acting component, its update method, elided from the
previous figure, is where an agent triggers some number of learning steps within its learner component. In contrast,
a generic actor’s update method simply pulls neural network weights from a variable source if it is provided one
at initialization. Since a learner component is a valid variable source, the actor component may query a learner
directly for its latest network weights. This will be particularly relevant when we discuss distributed agents in
Section 2.4.
In Figure 3 we again show the environment loop, where we have expanded the interaction to show the internals
of a learning agent we just described. While redundant, we sometimes use the term learning agent to emphasize
that the agent contains a learner component. The illustration includes the actor and learner components and
depicts how they interact. In particular, the actor pulls weights from the learner components in order to keep its
action-selection up-to-date. Meanwhile, the learner pulls experiences observed by the actor through a dataset,
which is yet another important component.
Note that having a dataset component sit between the actor and learner components is quite a general view that
includes on- and off-policy learning, and experience replay—prioritized or otherwise—depending on how the
dataset is configured. From the learner’s perspective data is provided simply as a stream of sampled mini-batches;
the dataset can be configured to hold on to stale data, and/or the actors can be programmed to add noise to the
learner-specified policy. While we have generally standardized on TensorFlow’s Dataset object to provide efficient
buffering and iteration over data, this does not mandate the use of TensorFlow for the update step implemented
by a learner. The dataset itself is backed by a low-level data storage system, called Reverb (Albin Cassirer, Gabriel
Barth-Maron, Manuel Kroiss, Eugene Brevdo, 2020), that is released concurrently. Reverb can be roughly described
as a storage system which enables efficient insertion and routing of items and a flexible sampling mechanism that
allows: first-in-first-out, last-in-first-out, uniform, and weighted sampling schemes.
Acme also provides a simple common interface for insertion into the low-level storage system in the form of
adders. Adders provide add methods which are analogous to the observe methods found on an actor—in fact
most actors’ observations are forwarded directly onto an adder object. These objects exist in order to allow for
different styles of pre-processing and aggregation of observational data that occurs before insertion into the dataset.
For example a given agent implementation might rely on sampling transitions, n -step transitions, sequences
(overlapping or not), or entire episodes—all of which have existing adder implementations in Acme.
Relying on different adder implementations to carry the workload of adding data once it is observed has also
allowed us to design very general actor modules that support a wide variety of learning agents. While any agent is
5
Acme: A Research Framework for Distributed Reinforcement Learning
Agent
Learner Dataset
Actor
policy observe
ot+1
at
<latexit sha1_base64="nw+CPBUS5OCODrPaNTC1WO3zHfQ=">AAACFXicbZBNS8NAEIY39avWr6jgxUuwCJ5K0ip6LHrxWMHWQhvCZjttl24+2J0IJeZ3ePeqf8GbePXsP/BnuE17sK0vDLy8M8MMjx8LrtC2v43Cyura+kZxs7S1vbO7Z+4ftFSUSAZNFolItn2qQPAQmshRQDuWQANfwIM/upn0Hx5BKh6F9ziOwQ3oIOR9zijqyDOPqJd2FaMCMEoxS2uVixizzDPLdsXOZS0bZ2bKdTJVwzN/ur2IJQGEyARVquPYMboplciZgKzUTRTElI3oADrahjQA5ab5/5l1qpOe1Y+krhCtPP27kdJAqXHg68mA4lAt9ibhf71Ogv0rN+VhnCCEbHqonwgLI2sCw+pxCQzFWBvKJNe/WmxIJWWokc1dyRFJEFlJo3EWQSybVrXi1CrVu/Ny/XoGqUiOyQk5Iw65JHVySxqkSRh5Ii/klbwZz8a78WF8TkcLxmznkMzJ+PoFJfKf7g==</latexit>
rt
<latexit sha1_base64="jyHvBo2W6KtleLbSEUsoYW5GiqE=">AAACFXicbVC7SgNBFJ31GRMfq4KNzWIQUoXdBNEyaGMZwTwgCWF2cpMMmX0wczcQ1v0Oe1ut7awUW2v/wB/Q2smjMIkHLhzOuZdzOW4ouELb/jRWVtfWNzZTW+nM9s7unrl/UFVBJBlUWCACWXepAsF9qCBHAfVQAvVcATV3cDX2a0OQigf+LY5CaHm05/MuZxS11DaPZDtuKkYFYBBjEhfzZyEmSdvM2nl7AmuZODOSLeW+n1+HmZ9y2/xqdgIWeeAjE1SphmOH2IqpRM4EJOlmpCCkbEB70NDUpx6oVjz5P7FOtdKxuoHU46M1Uf9exNRTauS5etOj2FeL3lj8z2tE2L1oxdwPIwSfTYO6kbAwsMZlWB0ugaEYaUKZ5PpXi/WppAx1ZXMpk4okiCStq3EWi1gm1ULeKeYLN062dEmmSJFjckJyxCHnpESuSZlUCCN35IE8kifj3ngx3oz36eqKMbs5JHMwPn4BDDukCQ==</latexit>
Environment
able to implement its own internal actor—or indeed bypass its actor component entirely and implement its own
acting/observing methods directly as an agent is an actor in its own right—most agents are able to use one of
these standard actors. Actors in Acme predominantly fall into one of two styles: feed-forward and recurrent. As
their names suggest, these actors primarily differ in how they maintain state (or do not) between calls to the
action-selection method, and the exact form of network used for these actors must be passed in at construction.
Note that in Acme we have also taken pains to ensure that the communication between different components is
agnostic to the underlying framework (e.g. TensorFlow) used. However, as the actors themselves must interact
directly with this framework we also provide different implementations for both TensorFlow and JAX—and similar
accommodations could be made for other frameworks.
Given these different modules we can easily construct novel algorithms by varying one or more components.
For example, the distributional discrete policy gradient method we will introduce later consists of a feed-forward
actor, an n -step transition adder, the distributional DPG losses, and either uniform or prioritized sampling under
the dataset. However, easily composing modules in order to create novel agents is not the primary purpose of these
components. Instead, in the next section we will describe how these modules can easily be pulled apart at the
boundaries in order to enable distributed agents that can run at much larger scales.
Up to this point we have primarily described the interaction between an actor—or agent—and its environment
using a simple, synchronous setting. However, a common use case is to generate data asynchronously from the
learning process, often by interacting with multiple environments in parallel (Nair et al., 2015; Mnih et al., 2016;
Horgan et al., 2018a; Barth-Maron et al., 2018; Kapturowski et al., 2019). In Acme we accomplish this by splitting
the acting, learning, and storage components introduced earlier into different threads or processes. This has two
benefits: the first being that environment interactions can occur asynchronously with the learning process, i.e.
we allow the learning process to proceed as quickly as possible regardless of the speed of data gathering. The
other benefit gained by structuring an agent in this way is that by making use of more actors in parallel we can
accelerate the data generation process.
An example of a distributed agent is shown in Figure 4. By examining this figure we can see that this largely
maintains the same structure introduced earlier. However, in the previous section links between different modules
were used merely to indicate function calls. Instead, in the distributed variant each module is launched in its own
process, where the links between different modules are now used to illustrate remote procedure calls (RPC). In the
illustrated example this distributed agent consists of a data storage process, a learner process, and one or more
6
Acme: A Research Framework for Distributed Reinforcement Learning
Learner Dataset
… …
Actor
policy observe
Environment
Distributed Actors
Figure 4 | Example of a distributed, asynchronous agent. In contrast to Figure 1 we have moved the the replay
and update components into external processes here designated by grey nodes. We have also replicated the
environment loop and replaced the actor itself by a thin proxy actor which pulls parameter updates from a learner.
distributed actors, each with their own environment loop. In order to simplify this construction, we also frequently
use an additional sub-module on the actor process: a variable client. This serves purely to allow the actor to poll
the learner for variable updates and simplifies the code (which is more cumbersome if the learner is required to
push to every actor).
In this work we focus on describing the single-process variants of Acme agents. As a result, a full description of
the distributed agents is somewhat out of scope. However, here we briefly describe the tool we have developed,
Launchpad, which enables these distributed variants. The agents introduced in the previous section were themselves
composed of different sub-modules, e.g. an actor, learner, and data storage system. In the same way, Launchpad can
be thought of as a mechanism for composing these modules in the distributed setting. Roughly speaking, Launchpad
provides a mechanism for creating a distributed program as a graph consisting of nodes and edges. Nodes exactly
correspond to the modules—represented as class instances as described above—whereas the edges represent a
client/server channel allowing communication between two modules. Once this graph has been constructed the
program can then be launched to start its underlying computation. The key innovation of Launchpad is that it
handles the creation of these edges in such a way that from the perspective of any module there is no distinction
between a local and remote communication, e.g. for an actor retrieving parameters from a learner in both instances
this just looks like a method call.
As a result, in what remains of this work we will primarily describe either the individual modules, or their
single-process combinations. Our results, however, will show both single-process and distributed variants and in
both cases the same underlying learning and acting code is being used. We will leave a further detailed description
of Launchpad for later work.
First introduced by (Lin, 1992), experience replay has since been successfully applied to deep reinforcement
learning (Mnih et al., 2013), allowing agents to reuse previous experiences. In Acme observations are added to the
replay buffer through the actor’s observe() method, which is called with each time step. Batches of transitions
used to train the agent are then sampled from the buffer using the learner’s update() method. Designing a replay
buffer requires careful consideration regarding how to package experience into elementary items, how to sample
7
Acme: A Research Framework for Distributed Reinforcement Learning
these items, and how to remove them when the buffer is full. With Reverb, these features are easily configured
allowing the agent code to focus on what behaviour to use rather than how to achieve it.
In a synchronous learning loop, you may prescribe how many steps of acting in the environment your agent
should perform between each learning steps. This ratio between acting and learning has a dramatic effect on
not only the sample efficiency (the number of environment steps required to reach a given performance) but
also the long-term learning performance and stability. The same is true for distributed learning settings although
this setting makes it more difficult to maintain a fixed ratio. Indeed if distributing an agent is for the sake of
computational efficiency, then it is clearly not desirable to block the learning process while the actor processes are
gathering data. On the other hand, running both processes independently easily results in higher variance. The
variance is often attributable to differences in the computational substrate (e.g. different hardware and network
connectivity) between seeds but pinpointing precise sources can be extremely challenging.
In Acme, these scaling issues are mitigated through the use of Reverb’s RateLimiter. By adopting rate
limitation, one can enforce a desired relative rate of learning to acting, allowing the actor and learner processes to
run unblocked so long as they remain within some defined tolerance of the prescribed rate. In an ideal setting,
both processes are given enough computational resource to run unblocked by the rate limiter. However if due
to network issues, insufficient resources, or otherwise, one of the processes starts lagging behind the other, the
rate limiter will block the latter while the former catches up. While this may waste computational resource by
keeping the latter idle, it does so only for as long as is necessary to ensure the relative rate of learning to acting
stays within tolerance.
Indeed the replay buffer is a good place to enforce such a rate. Notice that the first step in a learning process is
to sample data from the dataset; similarly a key step in the environment loop (which runs on the actor processes)
is to observe transitions and insert data into the dataset. Since both the learning and acting processes must
communicate with the dataset component. If the learner is sampling experiences too quickly, the buffer blocks
sampling requests until the actor catches up; if the actor(s) are inserting experiences too quickly, the buffer blocks
insert requests until the learner catches up.
An additional benefit of the structuring we have taken for Acme agents is that it is trivial to apply any learning
component to the offline setting, which assumes a fixed dataset of experiences that can be learned in a purely
supervised manner. In Acme this is as simple as applying a given learner module to a given dataset of experiences.
As there is a great deal of overlap between the pure batch setting and off-policy learning, many off-policy agents
perform quite well and/or can be adapted to work when not allowed to interact with the environment. As we
will describe in Section 3, however, agents purpose-built for this setting can have the edge when considering the
underlying distribution of generated data. We defer any further discussion of this use-case to Section 3.7.
3. Agent Implementations
In this section, we describe a set of agents whose implementations we include in the initial release of Acme. Our
intention is for these agents to serve both as clear and succinct reference implementations of their respective RL
algorithms, as well as strong research baselines in their own right. More agents may become available in the
future but we hope that these serve as readable examples for the RL community to base their own design and
implementations on. We will also relate these agents back to the underlying components of Acme used to construct
these agents. However, for most of this section the primary differences between agents will be contained within
the learner, i.e. the module which consumes data and updates the agent’s policy parameters.
In what follows, we will describe the underlying learning algorithms for the agents that we are releasing along
with Acme. While this is not intended as a full tutorial, we will first provide a brief introductory background in order
to keep this section reasonably self-contained. We will then roughly organize our discussion around the following
groupings of agents: temporal difference learning and value-based agents, followed by policy optimization and
actor-critic agents.
8
Acme: A Research Framework for Distributed Reinforcement Learning
The interaction between an agent and its environment can typically be formalized as a (partially-observable)
Markov decision process ((PO)MDP Cassandra et al., 1994; Puterman, 1994). This formalism corresponds to the
environment loop introduced earlier (see Figure 2). However, for clarity we repeat our earlier discussion, i.e. at
time t the agent makes observation o t given by the environment and selects action a t = π (o t ) as determined by
its policy. Upon taking this action the environment emits reward r t and transitions to a new state, resulting in
observation o t +1 . This interaction happens indefinitely or until the environment produces an episode termination
signal e t +1 .
For simplicity we will also denote both deterministic and stochastic policies with π and we will simplify such
policies to depend only on the most recent observation. To fully solve this problem in general, partially-observable
domains the policy may be required to depend on the entire history of observations o 0:t . To address this, modern
implementations frequently take the approach of using a learned summary statistic, i.e. a recurrent state, that is
also output by the policy network. Some algorithms we introduce below will make use of this more complicated
formulation, however for the sake of simplicity we will only do so when necessary.
While each algorithm we introduce below consumes data generated by an environment loop as above, one way
in which they differ is in the format that data takes as it is presented to the learner. The simplest—and perhaps
most classical—form in which data is exposed is as a transition (o t , at , r t , o t +1 ). By collecting an entire sequence
of transitions until the termination signal e t is true we can also form episodes (ot , a t , r t )t ≥0 or sub-slices of the
episode which we refer to as sequences. As noted above these individual elements will be exposed to the learning
algorithm using the interface of a dataset, typically by sampling at random (uniformly or with some probability).
However, we will also see that it is possible to process elements in the order in which they are observed, in which
case the dataset takes the form of a queue. In Acme each of these data formats is handled by a relevant adder
object.
The objective of all the agents we will cover is to maximize some form of its expected return. Although other
aggregation methods are possible, we will focus on the sum of discounted, future rewards
γ i r t +i = r t + γ R t +1
Õ
Rt = (1)
i ≥0
where R t is a random variable that depends on the future trajectory of the agent. While the policy π is probably
the most important component of any RL algorithm, for all of the agents we will discuss, the state-action value or
Q -function is almost as important. In the first set of algorithms we will examine this will provide either a direct
parameterization of the policy, or will be indirectly used to optimize the policy. At its heart, this function, in
conjunction with the policy π , maps any observation/action pair to the future rewards of taking that action and
then following the given policy. This can be written as
Q π (ot , at ) = E [R t | ot , at ] , (2)
= r t + γ Eπ [ Q π (ot +1 , at +1 ) | ot , at ] (3)
where by returning to the recursive definition of the returns given in (1) we arrive at the celebrated Bellman
equation. This definition allows us to start with an arbitrary approximation to Q π and repeatedly, recursively
improve upon that estimate. Under certain regularity conditions (see Sutton and Barto (2018) for more details)
we can combine this update with updates to the policy in order to obtain the optimal Q -function and hence the
optimal policy. In the next section we will first discuss algorithms for which the value function and the underlying
policy are one-and-the-same.
While exactly computing Q π using the Bellman equation is not possible when the underlying environment dynamics
are unknown (and must be sampled) we can instead empirically approximate this function using observed data.
This is commonly addressed in deep RL by using a neural network Q ϕ to approximate Q π . In order to optimize
the parameters of this network, it is common to introduce a bootstrap target:
9
Acme: A Research Framework for Distributed Reinforcement Learning
This roughly corresponds to one step of the backup operation introduced in the recursive Bellman equation.
As is common practice since (Mnih et al., 2015), these targets use an identical network—dubbed the target
network—with “stale” parameters ϕ 0. The online network (so-called in order to distinguish it from the target
network) has parameters ϕ and can be fit to this bootstrap target by minimizing the squared temporal difference
(TD) error
2
L(ϕ) = E y π,ϕ 0 − Q ϕ (o, a) .
(5)
The expectation above is taken with respect to transitions (o, a, r , o 0) generated by the agent’s policy; in practice
this typically means that the loss is formed empirically from samples taken from a replay buffer filled by the actor
process(es). Note that as the learner trains the value function Q ϕ , it periodically communicates these weights to
the actor. We next consider value-based agents, whose greedy deterministic policies are directly derived from the
Q-function.
Deep Q-Networks. The first algorithm we consider is that of Deep Q-Networks (DQN) (Mnih et al., 2013,
2015). Equipped with the Q -function and facing an observation o , a greedy actor can simply select the action
that maximizes its value. This indirectly defines the policy as π (o) = arg maxa Q(o, a), where typically actions a
are restricted to a finite, integer-valued set. By plugging this policy into Equation (5) we arrive at the loss used
by DQN, where in this case the bootstrap target is simply a function of the sampled transition and the target
parameters ϕ 0 (this is a point that will become more important shortly). In order to optimize this loss, DQN fills a
replay buffer with transitions generated in an ϵ -greedy manner (i.e. with probability ϵ we generate actions purely
at random). This buffer is then sampled uniformly at random to form a minibatch of sample transitions which are
then used to perform stochastic gradient descent on the given loss.
Note that DQN is an off-policy algorithm, meaning that is capable of learning from data generated off-policy,
i.e. from a policy separate from the one it is optimizing. The is to distinguish from on-policy algorithms for which
data generation uses the policy being optimized. While most of the algorithms that we will consider are off-policy,
we will note variations from this norm as necessary.
In our implementation of DQN, and following in the spirit of Rainbow DQN (Hessel et al., 2018), we also include
a number of recent enhancements. The first of these is the use of Double Q-learning(van Hasselt et al., 2015) to
combat over-estimation of the Q -function. While subtle, this corresponds to a modification of the bootstrap target
to
Note that here we have also been careful with the subscripted parameters in use by each component. We see that
the policy selecting the action uses the online network weights ϕ while the network used to evaluate said action
uses the target weights ϕ 0. Importantly, we do not allow gradients to flow through the bootstrap target, and as a
result it is necessary to employ a stop gradient since the target now depends on ϕ . This was not necessary with the
simpler DQN variant where the target only depended on ϕ 0.
Our implementation also makes use of n -step targets to fit the agent’s value function—this allows the algorithm
to use longer sequences of the observed reward signal when Í bootstrapping. In order to enable this the actor must
store overlapping n -step transitions of the form (ot , a t , i <n γ i r t +i , o t +n ) rather than a single transition. In this
setting the target becomes
n−1
γ i r t +i + γ n Q ϕ 0 (ot +n , πϕ (ot +n )),
Õ
yϕ,ϕ 0 = (7)
i
and in Acme there is an adder built expressly for this purpose. The use of n -step transitions are also useful
computationally because they are functionally equivalent to single-step transitions—which we can see by letting
n equal one—and use the same amount of storage. This is a standard improvement we will see in use by many
algorithms whose original variants were based on transitions. Additional enhancements to our implementation
include duelling networks (Wang et al., 2015) and prioritized experience replay (Schaul et al., 2015), wherein
priorities are used to sample transitions from replay proportional to their TD error rather than uniformly. Finally,
the distributed variant, which is not yet open-sourced, resembles that of Ape-X DQN (Horgan et al., 2018b); with
each actor running its own ϵ -greedy behavior policy, with ϵ drawn from a log-uniform distribution.
10
Acme: A Research Framework for Distributed Reinforcement Learning
Recurrent DQN. The recurrent replay distributed DQN (R2D2) algorithm (Kapturowski et al., 2019) further
extends the work of Ape-X DQN by making use of a recurrent network—in particular by incorporating an additional
LSTM layer in the Q -network. This results in a value function of the form Q ϕ (o, a, s) which additionally takes a
recurrent state s which must also be initialized at the beginning of an episode. The change to a recurrent network
leads to several other modifications to the Q-learning process. First, rather than learning from transitions R2D2
instead relies on full n -step sequences of the form (o t :t +n , at :t +n+1 , r t :t +n−1 ) and using strided, double Q-learning
over these fixed length sequences. In Acme this is accomplished simply by making use of a sequence adder as
opposed to the transition adder used by DQN. Additionally, these sequences are sampled from replay with priorities
by a convex combination of their mean and maximum absolute TD-errors. R2D2 also makes use of a transformed
loss, introduced in (Pohlen et al., 2018), rather than using clipped rewards.
Finally, in order to learn the value function, R2D2 also requires a sequence of recurrent states s t :t +n . This can
prove problematic for sequences which do not begin from the initial state. In order to solve this problem R2D2
stores old sequence in the replay buffer alongside the sequences of observations, but with this solution comes
another problem: stale recurrent states, i.e. states which are different from those currently being generated by the
network. This problem is solved by storing such sequences with an extra “burn-in” period at the beginning where
no learning is done but which is only used to initialize the recurrent state.
Note that while we will also present results on the distributed variant of this algorithm, the version released is
not distributed. However, for simplicity we will still refer to it as R2D2—a convention we will maintain for other
algorithms introduced later which were originally published under a “distributed” moniker.
The agents we have discussed thus far have relied upon the value function Q ϕ to indirectly parameterize their
policies, i.e. by selecting the value-maximizing action for a given observation. We will now discuss methods for
which the policy itself is directly parameterized by weights θ and denoted πθ . In doing so we will introduce the
so-called actor-critic learning paradigm wherein a learned value function (hereafter referred to as the critic) is
learned in tandem with the policy and is used to define the policy’s loss.
Typically, actor-critic methods rely on the same TD error minimization introduced in the previous section to
optimize the critic. However, while different agents optimize different policy losses, but they are generally derived
from a common goal, namely to maximize expected return
given the value-function Q πθ associated with this policy. By making use of the policy gradient theorem (Sutton
et al., 2000) an unbiased estimate of the policay gradient can be written as
where o, a and their ensuing return are sampled from the experiences of πθ . Though unbiased, these gradients
can exhibit high variance. A common strategy to attack this variance is by introducing and subtracting a baseline
independent of the action.
The most common baseline is the state value function V πθ (o) = Ea Q πθ (o, a) . Unlike the state-action value, V πθ
estimates the expected return starting from the observation o and acting according to π thereafter—it integrates
over all actions according to πθ rather than allowing for an initial deviation. The difference between these two
value estimates, i.e. Q πθ (o, a) − V πθ (o), is known as the advantage and as a result by introducing this baseline we
arrive at a family of advantage actor-critic algorithms (Mnih et al., 2016; Espeholt et al., 2018) using the following
policy gradient
Algorithms in this family primarily differ in how they estimate this advantage, see Schulman et al. (2016) for a
few examples.
Importance Weighted Actor-Learner Architecture (IMPALA). IMPALA (Espeholt et al., 2018) is a distributed
advantage actor-critic agent which makes use of the loss introduced above. In order to do so it estimates the
11
Acme: A Research Framework for Distributed Reinforcement Learning
state-action value function using Monte Carlo rollouts and produces an estimate Vϕ of the state value function by
minimizing a variant of the TD error. As a result, IMPALA is very close to an on-policy algorithm and is able to
make use of very long sequences. The catch is that IMPALA is designed to work in the distributed setting where
the behavior policy generating this data may not exactly match the policy being evaluated.
In order to counteract this off-policy bias, a typical approach is to employ an importance sampling correction.
Given a trajectory generated by a behaviour policy πb , the importance sampling ratio is the ratio between the
π (a |o )
probability of the action under target policy πθ and the behaviour policy: ρ t = πθ (a t |o t ) . This ratio allows us
b t t
to adjust the probabilities for any generated action, reweighted as if they were actually sampled according to
the target policy πθ . Although this allows us to have an unbiased estimate of the gradients, it suffers from high
variance. To compensate for this, IMPALA introduces a V-trace correction which further truncates the importance
sampling weights. This has the effect of cutting a trajectory once it becomes too far off-policy. This results in a
recursive definition of the Bellman update at step t which can be rewritten for the state value function as
t +n−
Õ1 k −1
γ k −t
Ö
vt = Vϕ (ot ) + c i ρ k r k + γV (ok +1 ) − Vϕ (ok ) .
(11)
k =t i=t
We can see that the final term is simply the TD error for step k and c t = min 1, ρ t is the truncated importance
sampling weight. As with DQN, the critic for IMPALA is updated to minimize the squared TD error, i.e. the
difference between the current estimate Vϕ and the updated version v t ,
2
L(ϕ) = E vt − Vϕ (ot ) .
(12)
To update the policy IMPALA makes use of the policy gradient under this advantage estimate can then be updated
to follow the entropy-regularized policy gradient. For a single step t this gradient corresponds to the first term of
the following:
" #
Õ
∇J (θ ) ≈ E log πθ (at |ot ) r t + γvt +1 − Vϕ (ot ) − β πθ (a|ot ) log πθ (a|ot ) .
(13)
a
The additional entropy regularization term is introduced to prevent instability caused by the policy moving too
quickly and is a common technique used to stabilize policy gradient methods. Off-policy correction and slowing
down the policy for stability are the two key ingredients of existing powerful actor-critic agents, including Actor-
Critic with Experience Replay (ACER) by Wang et al. (2016), Trust Region Policy Optimization (TRPO) by Schulman
et al. (2015) and Proximal Policy Optimization (PPO) by Schulman et al. (2017).
We now turn to a collection of agents where the actions taken by the agent are real valued and therefore continuous.
The set of agents we consider are also actor-critic methods, but in this case making use of the state-action value
function. While this is not strictly necessary, in this setting taking an arg max over values is often not tractable so
actor-critic learning an attractive alternative. The following learning algorithms alternate gradient steps optimizing
the critic and policy losses. As with the DQN agent, these maintain target parameters ϕ 0 and θ 0 which are
periodically copied from their online counterparts for training stability. Meanwhile one or more actors will fill the
replay buffer from which samples are taken in order to optimize the associated loss function. Departing from the
DQN work introduced earlier, these actors make use of exploration noise of given by a Gaussian of fixed scale (i.e.
rather than an epsilon-greedy strategy).
Deep Deterministic Policy Gradient (DDPG). Unlike IMPALA where the policy is stochastic with discrete actions,
DDPG (Lillicrap et al., 2016) uses a deterministic policy with continuous actions. Our implementation employs the
standard n –step TD loss in Equation (5) to train the critic Q ϕ . However, because of the deterministic policy, Silver
et al. (2014) derived a new policy gradient theorem resulting in the following
h i
∇J (θ ) = Eo ∇θ πθ ∇a Q ϕ (o, a)a=π .
(14)
θ (o)
12
Acme: A Research Framework for Distributed Reinforcement Learning
As noted above, our DDPG implementation roughly follows the same strategy that of DQN given above, with a
replaced learner and Gaussian exploration noise. I.e. the mechanism by which data is added is equivalent. Our
implementation uses uniform sampling from replay, as we have found that prioritization provides minimal (if any)
benefit. We will follow this same strategy for the remaining continuous control algorithms for which the only
changes necessary come to their learning strategy and losses.
Maximum a posteriori Policy Optimization (MPO). Introduced by Abdolmaleki et al. (2018), the MPO algo-
rithm takes a two-pronged approach to policy optimization, inspired by the classical expectation-maximization
(EM) algorithm. Because of this particular approach, its expected return takes the following peculiar form:
Q ϕ 0 (o, a)
Jη,α (θ ) = Eo Ea∼πθ 0 exp log πθ (a|o) + α [ϵ − Eo D KL (πθ 0 (·|o) k πθ (a|o))] , (15)
η
which includes a Kullback-Leibler (KL) divergence regularization that targets a hyperparameter ϵ . This regulariza-
tion makes sure (i) that the online policy πθ does not move too far from the target network πθ 0 and (ii) that the
online policy keeps adapting if necessary. Finally, the dual variables η and α are not hyperparameters, they are
dual variables with losses of their own to change them adaptively.
Distributional critics, D4PG, and DMPO. Introduced by Bellemare et al. (2017), the C51 agent’s critic Z ϕ
estimates the distribution over returns, in contrast to the critics described so far, which only estimate the expected
return. Therefore, naturally we should have Q ϕ (o, a) = EZ ϕ (o, a). The D4PG (Barth-Maron et al., 2018) and our
novel DMPO agent are adaptations of DDPG and MPO, respectively, to use this distributional critic. Their policy
losses remains almost unchanged—the critic simply needs to be averaged. For instance, the D4PG policy gradient
becomes: h i
∇θ J (θ ) = Eo ∇θ πθ ∇a EQ ϕ (o, a)a=π (o) .
(16)
θ
Meanwhile the critic loss has to be changed from (5) to take into account the fact that we are now working with
distributions.
L(ϕ) = Eτ H Yϕ 0 (τ ), Z ϕ (s, a) ,
(17)
Yϕ 0 (τ ) = Π r + γ Z ϕ 0 (o 0, πθ 0 (o 0)) ,
(18)
where H (·, ·) represents the cross-entropy. The projection Π is needed since the bootstrap target is now a distribution
denoted by a capital Yϕ 0 and it must be projected onto the fixed support of the critic Z ϕ before the cross-entropy
can be computed.
This agent implements planning with an environment model (learned or otherwise), with search guided by policy
and value networks. This can be thought of as a scaled-down and simplified version of the AlphaZero algorithm
(Silver et al., 2018), optionally coupled with an environment transition model, similar to the set-up in (van Hasselt
et al., 2019). We learn a value function Vϕ via TD-learning and policy πθ via imitation of a model-based MCTS
policy π MCTS , whose search is guided by the model-free policy πθ :
√
N (o t )
a search = arg max Q(ot , a) + β N (ot ,a)+ π
1 θ
(a | o t ,
) (19)
a
where N is a simple visit-count for a given node in a tree, and β is the UCT hyperparameter controlling how greedy
the search should be with respect to action-value estimates. When doing tree search roll-outs, we truncate at some
fixed search depth and bootstrap from our model-free value estimate Vϕ . The agent’s final policy is softmax with
respect to value over the children of the root node of the search tree, and the model-free policy loss is given by
13
Acme: A Research Framework for Distributed Reinforcement Learning
When a task requires a long sequence of correct actions before obtaining any reward from the environment, it is
said to be a hard exploration task. Exploration in such sparse reward tasks is particularly challenging due to the
problems of reward discovery and credit assignment. Promising recent attempts at overcoming exploration in such
tasks leverage demonstrations of successful trajectories (Kim et al., 2013; Hester et al., 2017; Vecerik et al., 2017;
Nair et al., 2018; Gulcehre et al., 2020), dubbed the RL with Expert Demonstrations (RLED; Piot et al., 2014). In
the RLED framework, an expert teacher (e.g. a human or a scripted agent) demonstrates the task using the same
action space. The resulting trajectories are stored in an expert replay buffer, which is then interleaved with the
learning agent’s own experiences in its minibatches. These agents must therefore necessarily learn off-policy from
a replay buffer, which in practice—and in our case—is often prioritized (Schaul et al., 2015). Applying the RLED
approach to DQN and R2D2, produces the DQf D and R2D3 agents, respectively (Hester et al., 2017; Gulcehre
et al., 2020).
In situations where interactions with the environment are expensive or dangerous, such as robotics, self-driving
cars, and health-care, it is not possible to directly use RL algorithms, which gather data while they learn—classically
known as online learning. As a result, there has been increasing interest in methods for learning policies from
logged data known as offline or batch RL (Lagoudakis and Parr, 2003; Lange et al., 2012) methods. Recently,
these methods have produced promising results in simple domains and there is ongoing research efforts to scale
these algorithms to more challenging problems (Fujimoto et al., 2018; Agarwal et al., 2019; Cabi et al., 2019).
As mentioned earlier, one of the advantages of the modularity of Acme agents is that any of our agents can be
used in the offline setting by simply providing our learner components a fixed dataset of experience. Indeed given
a dataset the learning algorithm is independent of the data generation process, making it amenable to both online
and offline learning. In this case, there are no actors in the distributed system and only the dataset and learner
remain. However, it may be helpful to keep an evaluator process (an actor that doesn’t add it experience to replay)
to continually quantify the performance of the learning algorithm. This does not necessarily mean that a learning
process designed for online, or even off-policy, data will perform well in the purely offline setting—and often, we
see that methods purposely-built to account for the distribution of data perform better.
One notable learning algorithm that serves as a baseline algorithm for all offline algorithms is known as
Behaviour Cloning (BC; Pomerleau, 1989; Michie, 1993). Indeed this is the simplest form of imitation learning
where the agent learns to mimic the demonstrations by learning a mapping between observations and actions
via supervised learning. While it can be quite competitive when the dataset is large and includes high-quality
demonstrations, many interesting applications do not have that luxury.
4. Experiments
Here, we present the performance of several Acme agents on various simulated environments. We put a significant
effort in implementing readable, modular, and scalable agents; we present these results to demonstrate that
this has not come at the expense of performance. Indeed our agents achieve returns that are comparable to the
published state-of-the-art. In relating these results, we are not necessarily interested in comparing agents as much
as comparing performance across distributed scales. In addition to performance plots we also study the effect of
rate limitation on both sample efficiency and walltime. Before diving into the results of our benchmarks, let us
briefly introduce the relevant environments and measurements that we report on.
4.1. Environments
DeepMind Control suite. The DeepMind Control Suite (Tassa et al., 2018) provides a set of continuous control
tasks in MuJoCo (Todorov et al., 2012) and has been widely used as a benchmark to assess performance of
continuous control algorithms. The tasks vary from the simple control problems with a single degree of freedom
(DOF) such as the cartpole and pendulum, to the control of complex multi-joint bodies such as the humanoid
(21 DOF). We will be considering two variants of these tasks: learning from raw features or from pixels. When
learning from raw features the observations are scalars representing positions and velocities; these vary in size
between 3 to 137 dimensions depending on the task. When learning from pixels, the observations are stacks of 3
14
Acme: A Research Framework for Distributed Reinforcement Learning
consecutive RGB images of size 72 × 96, stacked along the channel axis to allow the agents to learn to use dynamic
information like velocity and acceleration. We clustered these tasks into 4 categories according to the complexity
of control (trivial, easy, medium and hard) running each for a sufficient number of actor steps.
Arcade Learning Environment. The Arcade Learning Environment (ALE) (Bellemare et al., 2013) provides
a simulator for Atari 2600 games. ALE is one of the most commonly used benchmark environments for RL
research. The action space ranges between 4 to 18 discrete actions (joystick controls) depending on the game.
The observation space consists of 210 x 160 RGB image. We use a representative subset of ten Atari games to
assess the performance of our discrete agents. We also use several pre-processing methods on the Atari frames
including giving zero-discount on life loss, action repeats with frame pooling, greyscaling, rescaling to 84x84,
reward clipping, and observation stacking following (Mnih et al., 2015).
Behaviour Suite. The Behaviour Suite for Reinforcement Learning (bsuite) (Osband et al., 2020) provides
a collection of experiments that investigate the core capabilities of RL agents. These tasks are intended to be
challenging but simple and interpretable tests of specific axes of capability for a given RL agent. Each experiment
studies the scalability and robustness of agents in various types of tasks, such as exploration, memory, generalization,
and robustness to noise. This provides an powerful experimental apparatus for testing hypotheses about the
capabilities of various RL algorithms and agents. For example we will clearly see later that those agents with
recurrence (e.g. R2D2) are the only agents able to properly perform in tasks that require memory. While this
conjecture can readily made from an algorithmic perspective, bsuite allows us to actually quantify the amount of
gains brought by such an algorithmic component.
4.2. Measurements
All our agents (including the single-process ones, for fair walltime comparison) are equipped with a background
evaluator node that periodically queries the learner for policy network weights and evaluates them by running an
episode in the environment and logging the observed episode return. For all tasks in the control suite, an episode
corresponds to 1,000 steps and the reward signal is such that 100 is the theoretical limit on the episode return.
There is no such common episode return scale across Atari levels but we do limit the max episode length on the
evaluator to 108,000 environment steps as is common.
Sample efficiency and actor steps. In many applications, sample efficiency is a critical obstacle in the way of an
RL algorithm’s viability. For example stepping the environment could involve a computationally expensive physics
simulator or a real robot which is slow and endures some wear-and-tear. In our possibly distributed agents, the
most accurate measure of environment interactions during training is what we refer to as actor steps, to distinguish
them from the evaluator steps. In the Atari environment, since we repeat actions 4 times, as is now standard, our
measurement of actor steps is exactly four times smaller than the number of environment frames, which is a more
common measure in the ALE literature.
Speed and learner walltime. Of course in some applications the environment may be a cheap simulator, of
which any number of copies can be quickly spun up and driven in parallel by distributed actors. Though sample
efficiency may still come into play, in these cases one is often more interested in how quickly results can be obtained
by increasing the number of distributed actors. Unfortunately simply measuring walltime via timestamp is a rather
noisy way of keeping time as it can lead to vastly different results due to uncontrollable factors. Indeed distributed
computation on a shared cloud service can bring its own challenges in the form of interrupted processes and/or
communication, but in the following benchmarks we are interested in the performance of distributed agents
on dedicated hardware. In order to simulate dedicated hardware, we accumulate time within the learner from
immediately after the first learner step; we refer to this measure of time as the learner walltime. Since the learner
is checkpointed along with the networks, this timekeeping persists through interruptions on the learner.
We begin by reporting our results on the control suite from features. For practical reasons, in this section we
highlight a challenging subset of tasks, but a comprehensive set of benchmarks can be found in the appendix.
15
Acme: A Research Framework for Distributed Reinforcement Learning
Figure 5 | Performance of control agents on the control suite from raw features. Comparing the single process
agents. Depicted curves represent rolling averages over 10 seeds. Performance as measured by episode return with
respect to actor steps.
Figure 6 | Performance of D4PG on the control suite from raw features. Comparing the single-process and two
distributed variants (2, 4 actors). Depicted curves represent rolling averages over 10 seeds. Top: Sample efficiency
as measured by episode return with respect to actor steps. Bottom: Learning speed as measured by episode return
with respect to learner walltime.
Figure 5 shows that our released single-process agents achieve state-of-the-art performance. Note that we have
made reasonable efforts to tune these agents so that their defaults are suited for solving control suite tasks from
features.
Let us focus on one of the agents from Figure 5, say D4PG, and let us overlay the performance of the exact same
code running in a distributed setup with 2 and 4 actors, resulting in Figure 6. The top half of the figure shows
that, with the exception of manipulator:bring_ball (a very challenging task), all variants of the agent achieve
nearly identical results when measured against actor steps. Indeed this is the rate limiter’s function. Meanwhile,
the bottom half of the figure shows the exact same training curves plotted against learner walltime; these show
the benefit of distributed acting. Together these form compelling evidence that rate limitation is a tremendous
tool for reproducibility and the fair comparison of agents between disparate computational distribution strategies.
Next we compare the performance of our D4PG agent while varying the rate of learning to acting. (Note that a
similar analysis could easily be carried out for any of our agents.) The rate is measured in number of samples out
of replay per insert into the replay buffer: samples per insert ratio (SPI). For reference, the default D4PG agent
16
Acme: A Research Framework for Distributed Reinforcement Learning
Figure 7 | Sensitivity analysis of the effect of rate limitation on the sample efficiency of D4PG. All curves correspond
to D4PG with 4 distributed actors and represent rolling averages over 10 seeds. See Figure 18 in the appendix for
results on more tasks.
uses a batch size of 256 and a SPI of 32; this corresponds to doing one gradient update every 8 acting steps. These
results strongly suggest that setting this relative rate of learning to acting can dramatically affect sample efficiency
and performance.
Figure 7 shows a representative subset of our experiment. Overall we notice a trend that lower SPI ratios lead
to more wasteful agents in the sense that they need many more interactions with the environment to reach the
same performance. However we can also differentiate three distinct effects on the results depending on the task.
On the acrobot:swingup task, the higher the ratio the better—more learning per acting leads to a more sample
efficient agent. Meanwhile on the humanoid tasks, this is not the case. Indeed, we see that our chosen default of
32 SPI seems to be the most efficient choice in the long run. Though the 128 SPI agent learns fastest to begin with,
it seems to have settled on a sub-optimal solution; whereas the rate of 512 SPI is clearly unstable.
Finally, notice that with no rate limitation at all, performance is similar to our default of 32 SPI. This is
due to our choice of number of distributed actors (4) in this experiment, which proved to be an efficient use of
computational resources.
We would like to investigate this phenomenon further in the future, and hopefully thanks to the agents and
tools we are releasing, the community can help. Our current hypothesis is the following. In the acrobot tasks,
it is possible to accurately evaluate the value of a state-action pair independent of the policy used thereafter;
therefore with a reasonable number of transitions, the task can be effectively learned offline and certainly off-policy.
Meanwhile, in the humanoid tasks, due to the complex composition of joints and the temporally extended nature
of the tasks, accurately evaluating a state-action pair (which is the critic’s function) is tightly coupled with the
policy being evaluated. For instance, in humanoid:run, moving a humanoid’s ankle to let its body fall forward is
only good if a hip flexor will later bring the opposite leg forward to support the falling body.
Off-policy experience and the role of replay buffer size. What complicates matters further is the role of the
replay buffer capacity: the larger the buffer, the more stale the experience it contains. Therefore, combining a
large replay buffer with a large SPI ratio, leads to a very off-policy learning. At the opposite end of the two spectra,
by using a relatively small replay buffer and SPI ratio, learning can be entirely on-policy. In order to simplify this
analysis, in all these experiments, we kept the replay buffer fixed at a large size of 1 million items (in this case an
item is an n –step transition).
We shift our attention to learning control suite tasks from pixel observations. In this setting we only ran D4PG and
only on the trivial, easy, and medium tasks. Remarkably, by simply adding a residual network (ResNet) torso to
the exact same network architecture as above, many complex tasks were learned without any additional tuning.
Naturally, there is still a lot of room for improvement as many of the tasks are not learned. Here we highlight a few
tasks on which D4PG performed very well; the full set of results can be found in Figures 19–20 in the appendix.
Once again we see that, when measured with respect to actor steps, performance and sample-efficiency is very
similar across the variants of the agent: single-process and distributed with 2, 8 and 16 actors. However, when
measured with respect to the learner’s walltime, the advantage of having more actors is clear. In fact the benefits
are more dramatic than in Figure 6, which is a consequence of the added computational load on each individual
17
Acme: A Research Framework for Distributed Reinforcement Learning
actor, due to the environment rendering each frame and the additional ResNet forward pass required to process
them. This experiment shows that parallelism helps accelerate training, but that the same performance can still be
achieved with fewer actors and more time.
Figure 8 | Performance of D4PG on the control suite from pixel observations. Comparing the single-process and
three distributed variants (2, 8, 16 actors). Depicted curves represent rolling averages over 10 seeds. Top: Sample
efficiency as measured by episode return with respect to actor steps. Bottom: Learning speed as measure by episode
return with respect to learner walltime.
4.6. Atari
We evaluate the performance of DQN, R2D2 and IMAPALA agents trained on 10 individual levels with varying
levels of difficulty over 3 seeds. We use 256 actors in all cases. We have done agent-specific hyper-parameter
tuning. For each agent the hyper-parameters are the same across all games and set as default parameters in Acme.
Figure 9 | Comparison of performance (sample complexity) of DQN, R2D2 and IMPALA on a subset of Atari
tasks. Here we plot rolling averages across 3 seeds as measured by episode return with respect to actor steps.
Note the qualitatively different learning traits for each algorithm: IMPALA typically learns quickly but is prone
to instabilities; in contrast R2D2 learns slowly but typically attains higher final performance with less variance;
finally DQN, being a feed-forward agent, tends to get “off the ground” faster than R2D2 but performance plateaus
at a much lower level.
18
Acme: A Research Framework for Distributed Reinforcement Learning
We report performance measured by episode return with respect to actor steps in Figure 9. Each actor step
results in 4 environment steps here as we set the action repeat to 4 for all agents. All agents are trained for
approximately same duration. Note that in IMPALA we are using the Deep architecture proposed in (Espeholt
et al., 2018) which is more costly at runtime and hence performs fewer updates. We also measure episode return
with respect to training time which can be found in Figure 21 in the appendix.
4.7. bsuite
We evaluate the performance of DQN, R2D2, IMAPALA and MCTS agents on the bsuite benchmark. We aggregate
the performance of these agents across all tasks shown in the “radar plot” in Figure 10. We did not tune the
hyperparameters and believe these results can be improved with careful tuning. In particular, IMPALA performs
poorly in bandit-like domains without extra tuning due to instabilities arising from short sequences. In these
experiments, MCTS has access to a perfect simulator for all tasks.
Figure 10 | Comparing aggregate performance of DQN, R2D2, MCTS and IMPALA on bsuite. Note (a) in this
experiment MCTS has access to a perfect simulator; (b) IMPALA performs poorly in bandit-like domains without
extra hyperparameter tuning, due to instabilities arising from short sequences.
Figure 11 | Comparing aggregate performance of DQN, DQf D, R2D2 and R2D3 on bsuite. DQf D and R2D3 have
access to demonstrations for the exploration tasks, so this comparison is only meant to quantify how well these
algorithms can leverage that additional information.
To verify our DQf D and R2D3 agents, we test them on bsuite (Figure 11). We only include demonstrations for
the exploration tasks. These demonstrations were generated using the optimal policy, which has knowledge of
the action mapping of the environment. See code for details. For Deep Sea, one demonstration is sufficient.
For Deep Sea Stochastic, because the stochastic nature of the environment, we need more demonstrations. We
19
Acme: A Research Framework for Distributed Reinforcement Learning
generate num_demos = environment size * 10. Because the optimal policy does not always solve the task in this
environment, we include 80% successful trajectories, and 20% unsuccessful trajectories.
4.9. Offline RL
We provide results on control from features and Atari offline-RL datasets. The details of those datasets will be
available in an upcoming offline RL benchmark dataset release as RL Unplugged. RL Unplugged will include a
diverse set of challenging Acme compatible offline RL datasets along with a proper evaluation method for each
dataset. In this paper, we focus on the easier version of the control dataset and only a small subset of Atari games
that will be released with RL Unplugged. Here, we are only interested in showing that it is possible to run offline
RL agents with the Acme infrastructure. RL Unplugged will dive into a more in-depth analysis of these and other
tasks.
Control from features. In this section we briefly show the results of BC and D4PG algorithms used in the offline
setting on two control suite tasks: cartpole:swingup and fish:swim from features information (similar to Sec. 4.3).
To generate the dataset, we ran D4PG in online mode with three random seeds till convergence (we call the
resulting policy the data generation policy), collected all the data experienced by these three runs (which includes
low quality data e.g. from the beginning of training), and subsampled this data, leaving 200 transitions for the
cartpole:swingup environment and 8000 transitions for the fish:swim environment. Then, this data was assembled
into a dataset and used for training BC and D4PG agents from scratch in offline mode, i.e. without having access
to the environment during training (but still using the environment for evaluation). D4PG was able to almost
match the performance of the data generation policy on both domains, but in the cartpole:swingup experiment its
performance decreased after a while, probably due to overfitting to a small dataset (Fig. 12).
Figure 12 | Comparing performance of BC and D4PG on an offline setting with a fixed dataset as measured by
episode return with respect to learner steps. The performance of the data generation policy is shown with the
dashed horizontal line.
Atari. Atari has been an established benchmark for offline-RL (Agarwal et al., 2019; Fujimoto et al., 2019).
Here, we report results on nine atari games to show that the performance of our offline Acme DQN agent can
match the the best behavior policy that generated the data while being trained online. We have run our DQN atari
agent on the data generated the dataset. We have used the atari offline RL dataset generated in (Agarwal et al.,
2019), and trained a Double DQN (van Hasselt et al., 2015) with Adam optimizer (Kingma and Ba, 2014). We
trained our agents on nine Atari games that can be categorized in terms of diverse range of difficulty which is
determined in terms of the performance of the online agent that generated the datasets: RoadRunner, IceHockey,
Zaxxon, DemonAttack as the easy games, BeamRider, MsPacman, Robotank, Pooyan as the medium difficulty
and DoubleDunk as a hard game. We trained our agents for 5 million learner steps with minibatches of size 256
and report the results in Figure 13. Our results closely matches the results reported in (Agarwal et al., 2019) for
Offline DQN on most games with the same network architecture.
20
Acme: A Research Framework for Distributed Reinforcement Learning
Figure 13 | Comparing performance of DQN on an offline setting with a fixed dataset as measured by episode
return with respect to learner steps. The best performance of the online policy during that generated the data over
the course of training is shown with the dashed horizontal line.
5. Conclusion
In this work we have introduced Acme, a modular light-weight framework that supports scalable and fast iteration
of research ideas in RL. Acme naturally supports both single-actor and distributed training paradigms and provides
a variety of agent baselines with state-of-the-art performance. In this release we are focusing on the single-actor
setting which is more in line with the standard needs of the academic community. However, this work also describes
the design of Acme which enables the use of agent components in both these settings. Although we provide results
for both settings, we also show that the same results can be obtained using single-process agents enabled by our
framework.
By providing these tools, we hope that Acme will help improve the status of reproducibility in RL, and empower
the academic research community with simple building blocks to create new RL agents. Additionally, our baselines
should provide additional yardsticks to measure progress in the field. We are excited to share Acme with the
research community and look forward to contributions from everyone, as well as feedback to keep improving and
extending Acme.
Acknowledgements
We’d like to thank Yannis Assael and Vaggelis Margaritis for a great deal of graphical feedback on this work. We’d
also like to thank Jackie Kay, David Budden, and Siqi Liu for help on an earlier version of this codebase.
References
Abdolmaleki, A., Springenberg, J. T., Tassa, Y., Munos, R., Heess, N., and Riedmiller, M. (2018). Maximum a
posteriori policy optimisation. arXiv preprint arXiv:1806.06920.
Agarwal, R., Schuurmans, D., and Norouzi, M. (2019). An optimistic perspective on offline reinforcement learning.
NeurIPS Deep Reinforcement Learning Workshop.
Al-Shedivat, M., Bansal, T., Burda, Y., Sutskever, I., Mordatch, I., and Abbeel, P. (2017). Continuous adaptation
via meta-learning in nonstationary and competitive environments. arXiv preprint arXiv:1710.03641.
Albin Cassirer, Gabriel Barth-Maron, Manuel Kroiss, Eugene Brevdo (2020). Reverb: An efficient data storage and
transport system for ml research. https://github.com/deepmind/reverb. [Online; accessed 01-June-2020].
Barth-Maron, G., Hoffman, M. W., Budden, D., Dabney, W., Horgan, D., TB, D., Muldal, A., Heess, N., and Lillicrap,
T. P. (2018). Distributed distributional deterministic policy gradients. In 6th International Conference on Learning
21
Acme: A Research Framework for Distributed Reinforcement Learning
Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings.
OpenReview.net.
Bellemare, M. G., Dabney, W., and Munos, R. (2017). A distributional perspective on reinforcement learning. In
Precup, D. and Teh, Y. W., editors, Proceedings of the 34th International Conference on Machine Learning, ICML
2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages
449–458. PMLR.
Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. (2013). The arcade learning environment: An evaluation
platform for general agents. J. Artif. Intell. Res., 47:253–279.
Berner, C., Brockman, G., Chan, B., Cheung, V., Debiak, P., Dennison, C., Farhi, D., Fischer, Q., Hashme, S., Hesse,
C., Józefowicz, R., Gray, S., Olsson, C., Pachocki, J., Petrov, M., de Oliveira Pinto, H. P., Raiman, J., Salimans, T.,
Schlatter, J., Schneider, J., Sidor, S., Sutskever, I., Tang, J., Wolski, F., and Zhang, S. (2019). Dota 2 with large
scale deep reinforcement learning. CoRR, abs/1912.06680.
Cabi, S., Gómez Colmenarejo, S., Novikov, A., Konyushkova, K., Reed, S., Jeong, R., Zolna, K., Aytar, Y., Budden,
D., Vecerik, M., Sushkov, O., Barker, D., Scholz, J., Denil, M., de Freitas, N., and Wang, Z. (2019). Scaling
data-driven robotics with reward sketching and batch reinforcement learning. In Robotics, Science and Systems.
Cassandra, A. R., Kaelbling, L. P., and Littman, M. L. (1994). Acting optimally in partially observable stochastic
domains. In AAAI, volume 94, pages 1023–1028.
Castro, P. S., Moitra, S., Gelada, C., Kumar, S., and Bellemare, M. G. (2018). Dopamine: A Research Framework
for Deep Reinforcement Learning.
Dabney, W., Ostrovski, G., Silver, D., and Munos, R. (2018). Implicit quantile networks for distributional reinforce-
ment learning. In Dy, J. G. and Krause, A., editors, Proceedings of the 35th International Conference on Machine
Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of
Machine Learning Research, pages 1104–1113. PMLR.
Dhariwal, P., Hesse, C., Klimov, O., Nichol, A., Plappert, M., Radford, A., Schulman, J., Sidor, S., Wu, Y., and
Zhokhov, P. (2017). OpenAI Baselines. https://github.com/openai/baselines.
Espeholt, L., Marinier, R., Stanczyk, P., Wang, K., and Michalski, M. (2019). Seed rl: Scalable and efficient deep-rl
with accelerated central inference.
Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning,
I., Legg, S., and Kavukcuoglu, K. (2018). IMPALA: scalable distributed deep-rl with importance weighted
actor-learner architectures. In Dy, J. G. and Krause, A., editors, Proceedings of the 35th International Conference
on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of
Proceedings of Machine Learning Research, pages 1406–1415. PMLR.
Fan, L., Zhu, Y., Zhu, J., Liu, Z., Zeng, O., Gupta, A., Creus-Costa, J., Savarese, S., and Fei-Fei, L. (2018). Surreal:
Open-source reinforcement learning framework and robot manipulation benchmark. In Conference on Robot
Learning.
Finn, C., Abbeel, P., and Levine, S. (2017). Model-agnostic meta-learning for fast adaptation of deep networks. In
International Conference on Machine Learning, pages 1126—-1135.
Fujimoto, S., Conti, E., Ghavamzadeh, M., and Pineau, J. (2019). Benchmarking batch deep reinforcement learning
algorithms. arXiv preprint arXiv:1910.01708.
Fujimoto, S., Meger, D., and Precup, D. (2018). Off-policy deep reinforcement learning without exploration. arXiv
preprint arXiv:1812.02900.
Gauci, J., Conti, E., Liang, Y., Virochsiri, K., Chen, Z., He, Y., Kaden, Z., Narayanan, V., and Ye, X. (2018). Horizon:
Facebook’s open source applied reinforcement learning platform. arXiv preprint arXiv:1811.00260.
Gulcehre, C., Paine, T. L., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz,
N., Williams, D., et al. (2020). Making efficient use of demonstrations to solve hard exploration problems. 8th
International Conference on Learning Representations, ICLR 2020, Addis Adaba, France, April 26-1 May, 2020,
Conference Track Proceedings.
22
Acme: A Research Framework for Distributed Reinforcement Learning
Hessel, M., Modayil, J., van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., Horgan, D., Piot, B., Azar, M. G., and
Silver, D. (2018). Rainbow: Combining improvements in deep reinforcement learning. In McIlraith, S. A. and
Weinberger, K. Q., editors, Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18),
the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational
Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 3215–3222.
AAAI Press.
Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband,
I., et al. (2018). Deep Q-learning from demonstrations. In AAAI Conference on Artificial Intelligence.
Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Sendonaris, A., Dulac-Arnold, G., Osband, I.,
Agapiou, J., Leibo, J. Z., and Gruslys, A. (2017). Learning from demonstrations for real world reinforcement
learning. CoRR, abs/1704.03732.
Horgan, D., Quan, J., Budden, D., Barth-Maron, G., Hessel, M., van Hasselt, H., and Silver, D. (2018a). Distributed
prioritized experience replay. In International Conference on Learning Representations.
Horgan, D., Quan, J., Budden, D., Barth-Maron, G., Hessel, M., van Hasselt, H., and Silver, D. (2018b). Distributed
prioritized experience replay.
Jaderberg, M., Mnih, V., Czarnecki, W. M., Schaul, T., Leibo, J. Z., Silver, D., and Kavukcuoglu, K. (2017). Reinforce-
ment learning with unsupervised auxiliary tasks. In 5th International Conference on Learning Representations,
ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net.
Jaques, N., Lazaridou, A., Hughes, E., Gulcehre, C., Ortega, P., Strouse, D., Leibo, J. Z., and De Freitas, N. (2019).
Social influence as intrinsic motivation for multi-agent deep reinforcement learning. In International Conference
on Machine Learning, pages 3040–3049.
Kapturowski, S., Ostrovski, G., Dabney, W., Quan, J., and Munos, R. (2019). Recurrent experience replay in
distributed reinforcement learning. In International Conference on Learning Representations.
Kim, B., Farahmand, A.-m., Pineau, J., and Precup, D. (2013). Learning from limited demonstrations. In Advances
in Neural Information Processing Systems, pages 2859–2867.
Kingma, D. and Ba, J. (2014). Adam optimizer. arXiv preprint arXiv:1412.6980, pages 1–15.
Kulkarni, T. D., Narasimhan, K., Saeedi, A., and Tenenbaum, J. (2016). Hierarchical deep reinforcement learning:
Integrating temporal abstraction and intrinsic motivation. In Advances in Neural Information Processing Systems,
pages 3675–3683.
Küttler, H., Nardelli, N., Lavril, T., Selvatici, M., Sivakumar, V., Rocktäschel, T., and Grefenstette, E. (2019).
Torchbeast: A pytorch platform for distributed rl. arXiv preprint arXiv:1910.03552.
Lagoudakis, M. G. and Parr, R. (2003). Least-squares policy iteration. Journal of machine learning research,
4(Dec):1107–1149.
Lange, S., Gabel, T., and Riedmiller, M. (2012). Batch reinforcement learning. In Reinforcement learning, pages
45–73. Springer.
Legg, S. and Hutter, M. (2007). Universal intelligence: A definition of machine intelligence. Minds and Machines,
17(4):391–444.
Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. (2016). Continuous
control with deep reinforcement learning. In Bengio, Y. and LeCun, Y., editors, 4th International Conference on
Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings.
Lin, L.-J. (1992). Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine
learning, 8(3-4):293–321.
23
Acme: A Research Framework for Distributed Reinforcement Learning
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. (2016).
Asynchronous methods for deep reinforcement learning. In Balcan, M. F. and Weinberger, K. Q., editors,
Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine
Learning Research, pages 1928–1937, New York, New York, USA. PMLR.
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. (2013). Playing
atari with deep reinforcement learning.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M.,
Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-level control through deep reinforcement learning. Nature,
518(7540):529.
Moritz, P., Nishihara, R., Wang, S., Tumanov, A., Liaw, R., Liang, E., Paul, W., Jordan, M. I., and Stoica, I. (2017).
Ray: A distributed framework for emerging AI applications. CoRR, abs/1712.05889.
Nair, A., McGrew, B., Andrychowicz, M., Zaremba, W., and Abbeel, P. (2018). Overcoming exploration in
reinforcement learning with demonstrations. In IEEE International Conference on Robotics and Automation, pages
6292–6299.
Nair, A., Srinivasan, P., Blackwell, S., Alcicek, C., Fearon, R., De Maria, A., Panneershelvam, V., Suleyman, M.,
Beattie, C., Petersen, S., et al. (2015). Massively parallel methods for deep reinforcement learning. arXiv preprint
arXiv:1507.04296.
OpenAI, Andrychowicz, M., Baker, B., Chociej, M., Józefowicz, R., McGrew, B., Pachocki, J. W., Pachocki, J., Petron,
A., Plappert, M., Powell, G., Ray, A., Schneider, J., Sidor, S., Tobin, J., Welinder, P., Weng, L., and Zaremba, W.
(2018). Learning dexterous in-hand manipulation. arXiv preprint arXiv:1808.00177.
Osband, I., Aslanides, J., and Cassirer, A. (2018). Randomized prior functions for deep reinforcement learning.
In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, page
8626–8638, Red Hook, NY, USA. Curran Associates Inc.
Osband, I., Doron, Y., Hessel, M., Aslanides, J., Sezener, E., Saraiva, A., McKinney, K., Lattimore, T., Szepesvári, C.,
Singh, S., Roy, B. V., Sutton, R. S., Silver, D., and van Hasselt, H. (2020). Behaviour suite for reinforcement
learning. In International Conference on Representation Learning.
Pineau, J., Vincent-Lamarre, P., Sinha, K., Larivière, V., Beygelzimer, A., d’Alché-Buc, F., Fox, E. B., and Larochelle, H.
(2020). Improving reproducibility in machine learning research (A report from the neurips 2019 reproducibility
program). CoRR, abs/2003.12206.
Piot, B., Geist, M., and Pietquin, O. (2014). Boosted Bellman residual minimization handling expert demonstrations.
In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 549–564. Springer.
Pohlen, T., Piot, B., Hester, T., Azar, M. G., Horgan, D., Budden, D., Barth-Maron, G., van Hasselt, H., Quan, J.,
Vecerík, M., Hessel, M., Munos, R., and Pietquin, O. (2018). Observe and look further: Achieving consistent
performance on atari. CoRR, abs/1805.11593.
Pomerleau, D. A. (1989). ALVINN: An autonomous land vehicle in a neural network. In NIPS, pages 305–313.
Puterman, M. (1994). Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.
Russell, S. (2016). Rationality and intelligence: A brief update. In Fundamental issues of artificial intelligence,
pages 7–28.
Schaul, T., Quan, J., Antonoglou, I., and Silver, D. (2015). Prioritized experience replay.
Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015). Trust region policy optimization. In
International conference on machine learning, pages 1889–1897.
Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. (2016). High-dimensional continuous control using
generalized advantage estimation. In International Conference on Learning Representations.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms.
arXiv preprint arXiv:1707.06347.
24
Acme: A Research Framework for Distributed Reinforcement Learning
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I.,
Panneershelvam, V., Lanctot, M., et al. (2016). Mastering the game of go with deep neural networks and tree
search. nature, 529(7587):484.
Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D.,
Graepel, T., Lillicrap, T., Simonyan, K., and Hassabis, D. (2018). A general reinforcement learning algorithm
that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144.
Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. (2014). Deterministic policy gradient
algorithms. In Proceedings of the 32nd International Conference on Machine Learning,ICML, 22-24 June 2014,
Bejing, China.
Song, Y., Wang, J., Lukasiewicz, T., Xu, Z., Xu, M., Ding, Z., and Wu, L. (2019). Arena: A general evaluation
platform and building toolkit for multi-agent intelligence. arXiv preprint arXiv:1905.08085.
Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. (2000). Policy gradient methods for reinforcement
learning with function approximation. In Advances in neural information processing systems, pages 1057–1063.
Tassa, Y., Doron, Y., Muldal, A., Erez, T., Li, Y., de Las Casas, D., Budden, D., Abdolmaleki, A., Merel, J., Lefrancq,
A., Lillicrap, T. P., and Riedmiller, M. A. (2018). Deepmind control suite. CoRR, abs/1801.00690.
Todorov, E., Erez, T., and Tassa, Y. (2012). Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ
International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE.
van Hasselt, H., Guez, A., and Silver, D. (2015). Deep reinforcement learning with double q-learning.
van Hasselt, H., Hessel, M., and Aslanides, J. (2019). When to use parametric models in reinforcement learning?
In Wallach, H. M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E. B., and Garnett, R., editors, Advances
in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019,
NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pages 14322–14333.
Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., and Riedmiller,
M. A. (2017). Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse
rewards. CoRR, abs/1707.08817.
Vezhnevets, A. S., Osindero, S., Schaul, T., Heess, N., Jaderberg, M., Silver, D., and Kavukcuoglu, K. (2017). Feudal
networks for hierarchical reinforcement learning. In Proceedings of the 34th International Conference on Machine
Learning - Volume 70, ICML’17, page 3540–3549. JMLR.org.
Vinyals, O., Babuschkin, I., Czarnecki, W., Mathieu, M., Dudzik, A., Chung, J., Choi, D., Powell, R., Ewalds, T.,
Georgiev, P., Oh, J., Horgan, D., Kroiss, M., Danihelka, I., Huang, A., Sifre, L., Cai, T., Agapiou, J., Jaderberg, M.,
and Silver, D. (2019). Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575.
Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R., Kavukcuoglu, K., and de Freitas, N. (2016). Sample efficient
actor-critic with experience replay. arXiv preprint arXiv:1611.01224.
Wang, Z., Schaul, T., Hessel, M., van Hasselt, H., Lanctot, M., and de Freitas, N. (2015). Dueling network
architectures for deep reinforcement learning.
Xu, Z., van Hasselt, H. P., and Silver, D. (2018). Meta-gradient reinforcement learning. In Advances in Neural
Information Processing Systems, pages 2396–2407.
Zhi, J., Wang, R., Clune, J., and Stanley, K. O. (2020). Fiber: A platform for efficient development and distributed
training for reinforcement learning and population-based methods.
25
Acme: A Research Framework for Distributed Reinforcement Learning
Figure 14 | Comparison of single process agent performance as measured by episode return with respect to actor
steps.
26
Acme: A Research Framework for Distributed Reinforcement Learning
Figure 15 | Comparison of distributed agent performance as measured by episode return with respect to actor
steps.
27
Acme: A Research Framework for Distributed Reinforcement Learning
Figure 16 | Comparison of D4PG performance with different number of actors as measured by episode return with
respect to actor steps.
28
Acme: A Research Framework for Distributed Reinforcement Learning
Figure 17 | Comparison of D4PG performance with different number of actors as measured by episode return with
respect to learner walltime.
29
Acme: A Research Framework for Distributed Reinforcement Learning
Figure 18 | Sensitivity analysis of the effect of rate limitation on the sample efficiency and performance of D4PG.
All curves correspond to D4PG with 4 distributed actors. Episode returns are averaged over 10 seeds.
30
Acme: A Research Framework for Distributed Reinforcement Learning
Figure 19 | Comparison of D4PG performance with pixel observations with different number of actors as measured
by episode return with respect to actor steps.
31
Acme: A Research Framework for Distributed Reinforcement Learning
Figure 20 | Comparison of D4PG performance with pixel observations with different number of actors as measured
by episode return with respect to learner walltime.
32
Acme: A Research Framework for Distributed Reinforcement Learning
Figure 21 | Comparing performance of DQN, R2D2 and IMPALA on Atari tasks as measured by episode return with
respect to training time.
33