SHUO SUN, RUNDONG WANG, and BO AN, Nanyang Technological University, Singapore
Quantitative trading (QT), which refers to the usage of mathematical models and data-driven techniques in
analyzing the financial market, has been a popular topic in both academia and financial industry since 1970s.
In the last decade, reinforcement learning (RL) has garnered significant interest in many domains such as
robotics and video games, owing to its outstanding ability on solving complex sequential decision making
problems. RL’s impact is pervasive, recently demonstrating its ability to conquer many challenging QT tasks.
It is a flourishing research direction to explore RL techniques’ potential on QT tasks. This paper aims at
providing a comprehensive survey of research efforts on RL-based methods for QT tasks. More concretely,
we devise a taxonomy of RL-based QT models, along with a comprehensive summary of the state of the art.
Finally, we discuss current challenges and propose future research directions in this exciting field.
Additional Key Words and Phrases: Reinforcement learning, quantitative finance, stock market, survey
Quantitative trading (QT) is a type of market strategy that relies on mathematical and statistical
models to automatically identify investment opportunities [19]. With the advent of the AI age, it
becomes more popular and accounts for more than 70% and 40% trading volumes, in developed
markets (e.g., U.S.) and developing markets (e.g., China), respectively. In general, QT research can
be divided into two directions. In the finance community, designing theories and models to under-
stand and explain the financial market is the main focus. The famous capital asset pricing model
(CAPM) [115], Almgren-Chriss model [2], Markowitz portfolio theory [89], and Fama & French
factor model [38] are a few representative examples. On the other hand, computer scientists apply
data-driven machine learning (ML) techniques to analyze financial data [32, 104]. Recently, deep
learning [72] becomes an appealing approach owing to not only its stellar performance but also 44
to the attractive property of learning meaningful representations from scratch.
Reinforcement Learning for Quantitative Trading 44:3
focus of Tesfatsion and Judd [130]. There are also many surveys on deep learning for finance. Wong
and Selvi [145] covered early works. Sezer et al. [114] was a recent survey paper with a focus on
financial time series forecasting. Ozbayoglu et al. [102] made a survey on the development of DL
in financial applications. Fischer [43] presented a brief review of RL methods in the financial mar-
ket. Hambly et al. [53] proposed a comprehensive survey on recent advances of RL in finance.
Considering the increasing popularity and potential of RL-based QT applications, a comprehen-
sive survey will be of high scientific and practical values. More than 100 high quality papers are
shortlisted and categorized in this survey. Furthermore, we analyze the current situation of this
area and point out future research directions.
Notation Description
h length of a holding period
pi the time series vector of asset i’s price
pi,t the price of asset i at time t
pi,t the price of asset i after a holding period h from time t
pt the price of a single asset at time t
st position of an asset at time t
uti trading volume of asset i at time t
n the time series vector of net value
nt net value at time t
nt net value after a holding period h from time t
w ti portfolio weight of asset i at time t
wt portfolio vector at time t
wt portfolio vector after a holding period h from time t
vt portfolio value at time t
vt portfolio value after a holding period h from time t
fti transaction fee for asset i at time t
ξ transaction fee rate
q the quantity of a limit order
Q total quantity required to be executed
r the time series vector of return rate
rt return rate at time t
2.1 Overview
The financial market, an ecosystem involving transactions between businesses and investors, ob-
served a market capitalization exceeding $90 trillion globally as of the year 2020.1 For many coun-
tries, the financial industry has become a paramount pillar, which spawns the birth of many finan-
cial centres. The International Monetary Fund (IMF) categorizes financial centres as follows:
international financial centres, such as New York, London and Tokyo; regional financial centres,
such as Shanghai, Shenzhen, and Sydney; and offshore financial centres, such as Hong Kong, Sin-
gapore, and Dublin. At the core of financial centres, trading exchanges,where trading activities
involving trillions of dollars take place everyday, are formed. Trading exchanges can be divided
as stock exchanges such as NYSE, Nasdaq, and Euronext, derivatives exchanges such as CME, and
cryptocurrency exchanges such as Coinbase and Huobi. Participants in the financial market can be
generally categorized as financial intermediaries (e.g., banks and brokers), issuers (e.g., companies
and governments), institutional investors (e.g., investment managers and hedge funds) and individ-
ual investors. With the development of electronic trading platform, quantitative trading, which has
been demonstrated as quite profitable by many leading trading companies (e.g., Renaissance, Two
Sigma, Citadel, D.E. Shaw), is becoming a dominating trading style in the global financial markets.
In 2020, quantitative trading accounts for over 70% and 40% trading volume in developed market
(e.g., U.S. and Europe) and emerging market (e.g., China and India), respectively.2 We introduce
some basic QT concepts as follows:
• Financial Asset. A financial asset refers to a liquid asset, which can be converted into
cash immediately during trading time. Classic financial assets include stocks, futures, bonds,
foreign exchanges, and cryptocurrencies.
• Holding Period. Holding period h refers to the time period where traders just hold the
financial assets without any buying or selling actions.
• Asset Price. The price of a financial asset i is defined as a time series
is the
pi = {pi,1 , pi,2 , pi,3 , . . . , pi,t }, where pi,t denotes the price of asset i at time t. pi,t
price of asset i after a holding period h from time t. pt is used to denote the price at time t
when there is only one financial asset.
• OHLC. OHLC is the abbreviation of open price, high price, low price and close price.
The candle stick, which is consisted of OHLC, is widely used to analyze the financial market.
• Volume. Volume is the amount of a financial asset that changes hands. uti is the trading
volume of asset i at time t.
• Technical Indicator. A technical indicator indicates a feature calculated by a formulaic
combination of OHLC and volume. Technical indicators are usually designed by finance
experts to uncover the underlying pattern of the financial market.
• Return Rate. Return rate is the percentage change of capital, where r t = (pt +1 − pt )/pt de-
notes the return rate at time t. The time series of return rate is denoted as r = (r 1 , r 2 , . . . , r t ).
• Transaction Fee. Transaction fee is the expenses incurred during trading financial assets:
fti = pi,t × uti × ξ , where ξ is the transaction fee rate.
• Liquidity. Liquidity refers to the efficiency with which a financial asset can be converted
into cash without having an evident impact on its market price. Cash itself is the asset with
the most liquidity.
Series Momentum [95], and Cross Sectional Momentum [20] are three classic momentum strate-
gies. In contrast, mean reversion strategies such as Bollinger bands [13] assume the price of fi-
nancial assets will finally revert to the long-term mean. Although traditional methods somehow
capture the underlying patterns of the financial market, these simple rule-based methods exhibit
limited generalization ability among different market conditions. We introduce some basic AT con-
cepts as follows:
• Position. Position st is the amount of a financial asset owned by traders at time t. It repre-
sents a long (short) position when st is positive (negative).
• Long Position. Long position makes positive profit when the price of the asset increases.
For long trading actions, which buy a financial asset i at time t first and then sell it at t + 1,
the profit is uti (pi,t +1 − pi,t ), where uti is the buying volume of asset i at time t.
• Short Position. Short position makes positive profit when the price of the asset decreases.
For short trading actions, which buys a financial asset at time t first and then sells it at t + 1,
the profit is uti (pi,t − pi,t +1 ).
• Net Value. Net value represents a fund’s per share value. It is defined as a time series
n = {n 1 , n 2 , . . . , nt }, where nt denotes the net value at time t. The initial net value is always
set to 1.
[25]; Follow-the-Winner approaches such as Exponential Gradient (EG) [54] and Winner [45];
Follow-the-Loser approaches such as Robust Mean Reversion (RMR) [59], Passive Aggressive
Online Learning (PAMR) [79], and Anti-Correlation [14]; Pattern-Matching-based approaches
such as correlation-driven nonparametric learning (CORN) [78] and B K [51]; and Meta-
Learning algorithms such as Online Newton Step (ONS). The readers can check this survey [77]
for more details. We introduce some basic PM concepts as follows:
• Portfolio. A portfolio can be represented as:
wt = [w t0 , w t1 , . . . , w tM ] ∈ R M +1 and w ti = 1
where M+1 is the number of portfolio’s constituents, including one risk-free asset, i.e., cash,
and M risky assets. w ti represents the ratio of the total portfolio value (money) invested at
the beginning of the holding period t on asset i. Specifically, w t0 represents the cash in hand.
• Portfolio Value. We define vt and vt as portfolio value at the beginning and end of the
holding period. So we can get the change of portfolio value during the holding period and
the change of portfolio weights:
w ti pi, t
M wip
t i,t pi, t
vt = vt w t = f or i ∈ [0, M]
pi,t M w ti pi, t
i=0 i=0 pi, t
Weighted Average Price (VWAP) distributes orders in proportion to the (empirically estimated)
market transaction volume. The goal of VWAP is to track the market average execution price [67].
However, traditional solutions are not effective in the real market because of the inconsistency
between the assumptions and reality.
Formally, OE is to trade fixed amount of shares within a predetermined time horizon (e.g., one
hour or one day). At each time step t, traders can propose to trade a quantity of qt ≥ 0 shares at
current market price pt , The matching system will then return the execution results at time t + 1.
Taking the sell side as an example, assuming a total of Q shares required to be executed during the
whole time horizon, the OE task can be formulated as:
arg max (qt · pt ), s.t. qt = Q
q 1,q 2, ...,qT
t =1 t =1
OE not only completes the liquidation requirement but also the maximize/minimize average execu-
tion price for the sell/buy side execution, respectively. We introduce basic OE concepts as follows:
• Market Order. A market order refers submitting an order to buy or sell a financial asset
at the current market price, which expresses the desire to trade at the best available price
• Limit Order. A limit order is an order placed to buy or sell a number of shares at a speci-
fied price during a specified time frame. It can be modeled as a tuple pt ar дet ±qt ar дet , where
pt ar дet represents the submitted target price, qt ar дet represents the submitted target quan-
tity, and ± represents trading direction (buy/sell)
• Limit Order Book. A limit order book (LOB) is a list containing all the information about
the current limit orders in the market. An example of LOB is shown in Figure 3.
• Average Execution Price. Average execution price (AEP) is defined as p̄ = Tt=1 Qt · pt .
• Order Matching System. The electronic system that matches buy and sell orders for a
financial market is called the order matching system. The matching system is the core of all
electronic exchanges, which decides the execution results of orders in the market. The most
common matching mechanism is first-in-first-out, which means limit orders at the same
price will be executed in the order in which the orders were submitted.
probability, and γ is the discount factor. The goal of an RL agent is to find a policy π (a | s) that
takes action a ∈ A in state s ∈ S in order to maximize the expected discounted cumulative reward:
max E[R(τ )], where R(τ ) = γ t r (at , st ) and 0 ≤ γ ≤ 1
t =0
Sutton and Barto [127] summarise RL’s main components as: (i) Policy, which refers to the prob-
ability of taking action a when the agent is in state s. From the policy perspective, RL algorithms
are categorized into on-policy and off-policy methods. The goal of on-policy RL methods is to
evaluate or improve the policy, which they are now using to make decisions. As for off-policy RL
methods, they aim at improving or evaluating the policy that is different from the one used to
generate data. (ii) Reward: after taking selected actions, the environment sends back a numerical
signal reward to inform the agent how good or bad are the actions selected. (iii) Value function,
which means the expected return if the agent starts in that state s or state-action pair (s, a), and
then acts according to a particular policy π consistently. Value function tells how good or bad the
agent’s current position is in the long run. (iv) Model, which is an inference about the behaviour
of the environment in different states.
Plenty of algorithms have been proposed to solve RL problems. Tabular methods and ap-
proximation methods are two mainstream directions. For tabular algorithms, a table is used to
represent the value function for every action and state pair. The exact optimal policy can be
found through checking the table. Due to the curse of dimensionality, tabular methods only work
well when the action and state space is small. Dynamic programming (DP), Monto Carlo
(MC), and temporal difference (TD) are a few widely studied tabular methods. Under the
perfect model of environment assumption, DP uses a value function to search for good policies.
Policy iteration and value iteration are two classic DP algorithms. MC methods try to learn good
policies through sample sequences of states, actions, and reward from the environment. For MC
methods, the assumption of perfect environment understanding is not required. TD methods are
a combination of DP and MC methods. While they do not need a model from the environment,
they can bootstrap, which is the ability to update estimates based on other estimates. From this
family, Q-learning [142] and SARSA [107] are popular algorithms, which belong to off-policy and
on-policy methods, respectively.
On the other hand, approximation methods try to find a great approximate function with lim-
ited computation. Learning to generalize from previous experiences (already seen states) to unseen
states is a reasonable direction. Policy gradient methods are popular approximate solutions. RE-
INFORCE [143] and actor-critic [71] are two important examples. With the popularity of deep
learning, RL researchers use neural networks as function approximators. DRL is the combination
of DL and RL, which lead to great success in many domains [91, 133]. Popular DRL algorithms
for the QT community include deep Q-network (DQN) [91], deterministic policy gradient
(DPG) [120], deep deterministic policy gradient (DDPG) [82], and proximal policy opti-
mization (PPO) [112]. Recurrent reinforcement learning (RRL) is another widely used RL
approach for QT. “Recurrent” means the previous output is fed into the model as part of the input
here. RRL achieves more stable performance when exposed to noisy data such as financial data. In
the following of this section, we briefly introduce popular RL algorithms in quantitative trading
following [127]:
DQN [91] is a Q-learning algorithm that approximates the state-value function with a deep
neural network. It is usually used in conjunction with experience replay to store the episodes
steps in memory for off-policy learning. Moreover, the Q-network is optimized towards a frozen
target network that is periodically updated to make the learning process more stable.
PPO [112] is a policy gradient method, which alternates between sampling data through inter-
action with the environment to optimize a “surrogate” objective function using stochastic gradient
ascent. A novel objective function is proposed to enable multiple epochs of minibatch updates. PPO
is easy to implement with better sample complexity and shares many benefits of the classic trust
region policy optimization.
A2C [90] is a simple and lightweight framework for deep reinforcement learning that uses
asynchronous gradient descent for optimization of deep neural network controllers. Instead of
experience replay, A2C asynchronously executes multiple agents in parallel, on multiple instances
of the environment. This parallelism also decorrelates the agents’ data into a more stationary
process, since at any given time-step the parallel agents will be experiencing a variety of different
states. This simple idea enables a much larger spectrum of fundamental on-policy RL algorithms.
SAC [52] is an off-policy actor-critic method based on the maximum entropy RL frame-
work [165], which maximizes a weight objective of the reward and the policy entropy, to encourage
robustness to noise and exploration. For parameter updating, SAC alternates between a soft policy
evaluation and a soft policy improvement. At the soft policy evaluation step, a soft Q-function,
which is modeled as a neural network with parameters θ , is updated by minimizing the following
soft Bellman residual. To handle continuous action spaces, the policy is modeled as a Gaussian
with mean and covariance given by neural networks.
DDPG [82] is a model-free off-policy algorithm for learning continous actions, which combines
ideas from DPG and DQN. It uses experience replay and slow-learning target networks from DQN,
and it is based on DPG, which can operate over continuous action spaces.
learn non-linear relationships between features. In recent years, deep learning models including
multi-layer perceptron (MLP) [31], recurrent neural network (RNN) [113], long short term
memory (LSTM) [113], and convolutional neural network (CNN) [56] are prevailing owing
to their outstanding ability to learn hidden relationships between features. An efficient Mixture-
of-Experts framework [124] is proposed to mimic the efficient bottom-up trading strategy design
work flow in real-world trading firms.
Besides different ML models, there is also a trend to utilize alternative data for improving predic-
tion performance. For instance, economic news [57], frequency of prices [158], social media [151],
financial events [33], investment behaviors [23], and weather information [164] have been used
as extra information to learn the intrinsic pattern of financial assets. Graph neural networks have
been introduced to model the relationship between stocks [24, 80, 110, 150]. Hybrid methods are
also proposed to further improve prediction performance [58, 86].
fine-grained data (e.g., second-level and minute-level) are often used to simulate the micro-level
financial market.
technical indicators as extra market information. Layer two evaluates trading actions from layer
one with consideration of risk factors. The goal of layer three is to search for optimal values
of hyperparameters in layer two. With the three-layer architecture, it outperforms baselines on
Eur/ US Dollar exchange data. Vittori et al. [136] proposed a risk-averse algorithm called Trust
Region Volatility Optimization (TRVO) for option hedging. TRVO trains a sheaf of agents
characterized by different risk aversion methods and is able to span an efficient frontier on the
volatility-p&l space. Simulation results demonstrate that TRVO outperforms the classic Black &
Scholes delta hedge [12].
With the development of deep learning, a few DRL methods are proposed for algorithmic trading.
FDDR [30] enhanced the classic RRL method [94] with deep neural networks. An RNN layer is
used to learn meaningful recurrent representations of the market. In addition, a fuzzy extension
is proposed to further reduce the uncertainty. FDDR achieves great performance on both stock
index and commodity futures. To balance between profit and risk, a multi-objective RL method
with LSTM layers [118] is proposed. Through optimizing profit and Sharpe Ratio simultaneously,
the agent achieves better performance on three Chinese stock index futures.
Value-based methods. QSR [46] uses Q-learning to optimize absolute profit and relative risk-
adjusted profit, respectively. A combination of two networks is employed to improve performance
on US Dollar/German Deutschmark exchange data. Lee and Jangmin [74] proposed a multi-agent
Q-learning framework for stock trading. Four cooperative agents are designed to generate trading
signals and order prices for both buy and sell side. Through sharing training episodes and learned
policies with each other, this method achieves better performance in terms of both profit and
risk management on the Korea stock market compared to supervised learning baselines. In [62],
the authors firstly design some local traders based on dynamic programming and heuristic rules.
Later on, they apply Q-learning to learn a meta policy of these local traders on Korea stock mar-
kets. de Oliveira et al. [27] implemented a SARSA-based RL method and tested it on 10 stocks in
the Brazil market.
DQN is used to enhance trading systems by considering trading frequencies, market confusion,
and transfer learning [64]. The trading frequency is determined in three ways: (1) a heuristic func-
tion related to Q-value, (2) an action-dependent NN regressor, and (3) an action-independent NN
regressor. Another heuristic function is applied to add a filter as the agent’s certainty on mar-
ket condition. Moreover, the authors train the agent on selected component stocks and apply the
pre-train weights as the starting point for different stock indexes. Experiments on four different
stock indexes demonstrate the effectiveness of the proposed framework. DeepScalper [125] ap-
plied branch dueling Q-network [141] for intraday trading. An encoder-decoder architecture is
proposed to learn market embedding incorporating both micro-level and macro-level market infor-
mation. A novel hindsight bonus is added in the reward function to encourage long-term horizon.
The authors also design an auxiliary task by predicting future volatility. DeepScalper significantly
outperforms many baselines on Chinese treasury bond and stock index markets. Riva et al. [106]
proposed a value-based RL algorithm to train agents for FX trading via fitted-Q iteration. The
importance of tuning control frequency is studied in order to obtain effective trading policies.
Other methods. iRDPG [87] is an adaptive DPG-based framework. Due to the noisy nature of
financial data, the authors formulate algorithmic trading as a Partially Observable Markov De-
cision Process (POMDP). GRU layers are introduced in iRDPG to learn recurrent market embed-
ding. In addition, the authors apply behavior cloning with expert trading actions to guide iRDPG
and achieve great performance on two Chinese stock index futures. There are also some works
focusing on evaluating the performance of different RL algorithms on their own data. Zhang et al.
[161] evaluated DQN, PG, and A2C on the 50 most liquid futures contracts. Yuan et al. [155] tested
PPO, DQN, and SAC on three selected stocks. Based on these two works, DQN achieves the best
overall performance among different financial assets.
Summary. Although existing works demonstrate the potential of RL for quantitative trading,
there is seemingly no consensus on a general ranking of different RL algorithms (notably, we
acknowledge that no free lunch theorem exists). The summary of algorithmic trading publications is
in Table 4. In addition, most existing RL-based works only focus on general AT, which tries to make
profit through trading one asset. In finance, extensive trading strategies have been designed based
on trading frequency (e.g., high-frequency trading) and asset types (e.g., stock and cryptocurrency).
the experiments on the Chinese stock market, Investor-Imitator successfully extracts interpretable
knowledge of portfolio management that can help human traders better understand the financial
market. Alphastock [138] is another policy-based RL method for portfolio management. LSTM
with history state attention model that is used to learn better stock representation. A cross-asset
attention network (CAAN) incorporating price rising rank prior is added to further describe the
interrelationships among stocks. Later on, the output of CAAN (winning score of each stock) is
fed into a heuristic portfolio generator to construct the final portfolio. Policy gradient is used to
optimize the Sharpe Ratio. Experiments on both U.S. and Chinese stock market show that Alpha-
stock achieves robust performance over different market states. EI 3 [117] is another RRL-based
method, which tries to build profitable cryptocurrency portfolios by extracting multi-scale pat-
terns in the financial market. Inspired by the success of Inception networks [128], the authors de-
sign a multi-scale temporal feature aggregation convolution framework with two CNN branches
to extract short-term and mid-term market embedding and a max pooling branch to extract the
highest price information. To bridge the gap between the traditional Markowitz portfolio and RL-
based methods, Benhamou et al. [7] applied PG with a delayed reward function and showed better
performance than the classic Markowitz efficient frontier.
Zhang et al. [160] proposed a cost-sensitive PM framework based on direct policy gradient. To
learn more robust market representation, a novel two-stream portfolio policy network is designed
to extract both price series pattern and the relationship between different financial assets. In addi-
tion, the authors design a new cost-sensitive reward function to take the trading cost constraint
into consideration with theoretically near-optimal guarantee. Finally, the effectiveness of the cost-
sensitive framework is demonstrated on real-world cryptocurrency datasets. Xu et al. [149] pro-
posed a novel relation-aware transformer (RAT) under the classic RRL paradigm. RAT is struc-
turally innovated to capture both sequential patterns and the inner corrections between financial
assets. Specifically, RAT follows an encoder-decoder structure, where the encoder is for sequential
feature extraction and the decoder is for decision making. Experiments on two cryptocurrency
and one stock datasets not only show RAT’s superior performance over existing baselines but also
demonstrate that RAT can effectively learn better representation and benefit from leverage oper-
ation. Bisi et al. [11] derived a PG theorem with a novel objective function, which exploited the
mean-volatility relationship. The new objective could be used in actor-only algorithms such as
TRPO with monotonic improvement guarantees. Wang et al. [140] proposed DeepTrader, a PG-
based DRL method, to tackle the risk-return balancing problem in PM. The model simultaneously
uses negative maximum drawdown and price rising rate as reward functions to balance between
profit and risk. The authors propose an asset scoring unit with graph convolution layer to capture
temporal and spatial interrelations among stocks. Moreover, a market scoring unit is designed
to evaluate the market condition. DeepTrader achieves great performance across three different
Actor-critic methods. Jiang et al. [66] proposed a DPG-based RL framework for portfolio
management. The framework consists of three novel components: (1) the Ensemble of Identical
Independent Evaluators (EIIE) topology; (2) a Portfolio Vector Memory (PVM); and (3) an
Online Stochastic Batch Learning (OSBL) scheme. Specifically, the idea of EIIE is that the em-
bedding concatenation of output from different NN layers can learn better market representation
effectively. In order to take transaction costs into consideration, PVM uses the output portfolio at
the last time step as part of the input of current time step. The OSBL training scheme makes sure
that all data points in the same batch are trained in the original time order. To demonstrate the
effectiveness of proposed components, extensive experiments using different NN architectures are
conducted on cryptocurrency data. Later on, more comprehensive experiments are conducted in
an extended version [65]. To model the data heterogeneity and environment uncertainty in PM, Ye
et al. [153] proposed a State-Augmented RL (SARL) framework based on DPG. SARL learns
the price movement prediction with financial news as additional states. Extensive experiments on
both cryptocurrency and U.S. stock market validation that SARL outperforms previous approaches
in terms of return rate and risk-adjusted criteria. Another popular actor-critic RL method for
portfolio management is DDPG. Xiong et al. [148] constructed a highly profitable portfolio with
DDPG on the Chinese stock market. PROFIT [111] is another DDPG-based approach that makes
time-aware decisions on PM with text data. The authors make use of a custom policy network
that hierarchically and attentively learns time-aware representations of news and tweets for PM,
which is generalizable among various actor-critic RL methods. PROFIT shows promising perfor-
mance on both the China and U.S. stock markets. Murray et al. [96] proposed a novel actor-critic
algorithm for solving general risk-reverse stochastic control problems and use it to learn hedging
strategies for portfolio management across multiple risk aversion levels simultaneously.
Other methods. Neuneier [97] made an attempt to formalize portfolio management as an MDP
and trained an RL agent with Q-learning. Experiments on German stock market demonstrate its
superior performance over heuristic benchmark policy. Later on, a shared value-function for dif-
ferent assets and model-free policy-iteration are applied to further improve the performance of
Q-learning in [98]. There are a few model-based RL methods that attempt to learn some models of
the financial market for portfolio management. Yu et al. [154] proposed the first model-based RL
framework for portfolio management, which supports both off-policy and on-policy settings. The
authors design an Infused Prediction Module (IPM) to predict future price, a Data Augmen-
tation Module (DAM) with recurrent adversarial networks to mitigate the data deficiency issue,
and a Behavior Cloning Module (BCM) to reduce the portfolio volatility. MetaTrader [101] is a
novel two-stage RL-based approach for portfolio management, which learns to integrate diverse
trading policies to adapt to various market conditions. In the first stage, MetaTrader incorporates
an imitation learning objective into the reinforcement learning framework. Through imitating dif-
ferent expert demonstrations, MetaTrader acquires a set of trading policies with great diversity. In
the second stage, MetaTrader learns a meta-policy to recognize the market conditions and decides
on the most proper learned policy to follow.
Portfolio management is also formulated as a multi-agent RL problem. MAPS [73] is a cooper-
ative multi-agent RL system in which each agent is an independent “investor” creating its own
portfolio. The authors design a novel loss function to guide each agent to act as diversely as pos-
sible while maximizing its long-term profit. MAPS outperforms most of baselines with 12 years
of U.S. stock market data. In addition, the authors find that adding more agents to MAPS can lead
to a more diversified portfolio with higher Sharpe Ratio. MSPM [61] is a multi-agent RL frame-
work with a modularized and scalable architecture for PM. MSPM consists of the Evolving Agent
Module (EAM) to learn market embedding with heterogeneous input and the Strategic Agent
Module (SAM) to produce profitable portfolios based on the output of EAM.
Some works compare the profitability of portfolios constructed by different RL algorithms on
their own data. Liang et al. [81] compared the performance of DDPG, PPO, and PG on the Chinese
stock market. Yang et al. [152] firstly tested the performance of PPO, A2C, and DDPG on the
U.S. stock market. Later on, the authors find that the ensemble strategy of these three algorithms
can integrate the best features and shows more robust performance adjusting to different market
Summary. Since a portfolio is a vector of weights for different financial assets, which naturally
corresponds to a policy, policy-based methods are the most widely-used RL methods for PM. There
are also many successful examples based on actor-critic algorithms. The summary of portfolio
management publications is in Table 5. We point out two issues of existing methods: (1) Most of
them ignore the interrelationship between different financial assets, which is valuable for human
ACM Transactions on Intelligent Systems and Technology, Vol. 14, No. 3, Article 44. Publication date: March 2023.
44:18 S. Sun et al.
portfolio managers. (2) Existing works construct portfolios from a relatively small pool of stocks
(e.g., 20 in total). However, the real market contains thousands of stocks and common RL methods
are vulnerable when the action space is very large [35].
PPO is another widely used RL method for OE. Lin and Beling [85] proposed an end-to-end
PPO-based framework. MLP and LSTM are tested as time dependencies accounting network. The
authors design a sparse reward function instead of previous implementation shortfall (IS) or
a shaped reward function, which leads to state-of-the-art performance on 14 stocks in the U.S.
market. Fang et al. [40] proposed another PPO-based framework to bridge the gap between the
noisy yet imperfect market states and the optimal action sequences for OE. The framework lever-
ages a policy distillation method with an entropy regularization term in the loss function to guide
the student agent toward learning similar policy by an oracle teacher with perfect information
of the financial market. Moreover, the authors design a normalized reward function to encourage
universal learning among different stocks. Extensive experiments on the Chinese stock market
demonstrate that the proposed method significantly outperforms various baselines with reason-
able trading actions.
We present a summary of existing RL-based order execution publications in Table 6. Although
there are a few successful examples using either Q-learning or PPO on order execution, existing
works share a few limitations. First, most algorithms are only tested on stock data. Their perfor-
mance on different financial assets (e.g., futures and cryptocurrency) is still unclear. Second, the
execution time window (e.g., one day) is too long, which makes the task easier. In practice, profes-
sional traders usually finish the execution process in a much shorter time window (e.g., 10 minutes).
Third, existing works will fail when the trading volume is huge, because all of them assume there
is no obvious market impact, which is impossible for large volume settings. In the real-world, the
requirement of institutional investors is to execute large amount of shares in a relatively short
time window. There is still a long way to go for researchers to tackle these limitations.
off-policy Q-learning algorithm to develop trading strategy implemented with a simple lookup
table. The method achieves great performance on event-by-event LOB data confirmed by a pro-
fessional trading firm. For training robust market making agents, Spooner and Savani [122] intro-
duced a game-theoretic adaptation of the traditional mathematical market making model. The au-
thors thoroughly investigate the impact in three environmental settings with adversarial RL. Zhao
and Linetsky [162] proposed a high-frequency feature called Book Exhaustion Rate (BER),
which can serve as a direct measurement of the adverse selection risk from an equilibrium point
of view. The authors train a market making agent via RL using three years of LOB data on Chicago
Mercantile Exchange S & P 500 and achieve stable performance.
Even though market making is a fundamental task in quantitative trading, research on RL-based
market making is still at the early stage. Existing few works simply apply different RL methods
on their own data. The summary of order execution publications is in Table 7. To fully realize
the potential of RL for market making, one major obstacle is the lack of high-fidelity micro-level
market simulator. At present, there is still no reasonable way to simulate the ubiquitous market
impact. This unignorable gap between simulation and real market limits the usage of RL in market
RL-based QT models’ generalization performance across different financial assets or markets. Fifth,
for high risky decision-making tasks such as QT, we need to explain its actions to human traders
as a condition for their full acceptance of the algorithm. Hierarchical RL methods decompose the
main goal into sub-goals for low-level agents. By learning the optimal subgoals for the low-level
agent, the high-level agent forms a representation of the financial market that is interpretable by
human traders. Sixth, for QT, learning through directly interacting with the real market is risky
and impractical. RL-based QT normally use historical data to learn a policy, which fits in offline RL
settings. Offline RL techniques can help to model the distribution shift and risk of financial market
while training RL agents.
tried to take market impact into consideration for MM with event-level data. Byrd et al. [15] pro-
posed Aides, an agent-based financial market simulator to model market impact. Vyetrenko et al.
[137] made a survey on current status for market simulation and proposed a series of stylized met-
rics to test the quality of simulation. It is a very challenging but important research direction to
build high-fidelity market simulators.
In this article, we provided a comprehensive review of the most notable works on RL-based QT
models. We proposed a classification scheme for organizing and clustering existing works, and we
highlighted a bunch of influential research prototypes. We also discussed the pros/cons of utilizing
RL techniques for QT tasks. In addition, we point out some of the most pressing open problems
and promising future directions. Both RL and QT are ongoing hot research topics in the past few
decades. There are many newly developing techniques and emerging models each year. We hope
that this survey can provide readers with a comprehensive understanding of the key aspects of
this field, clarify the most notable advances, and shed some light on future research.
