Reinforcement Learning For Quantitative Trading: Shuo Sun Rundong Wang Bo An

Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

Reinforcement Learning for Quantitative Trading

SHUO SUN, RUNDONG WANG, and BO AN, Nanyang Technological University, Singapore

Quantitative trading (QT), which refers to the usage of mathematical models and data-driven techniques in
analyzing the financial market, has been a popular topic in both academia and financial industry since 1970s.
In the last decade, reinforcement learning (RL) has garnered significant interest in many domains such as
robotics and video games, owing to its outstanding ability on solving complex sequential decision making
problems. RL’s impact is pervasive, recently demonstrating its ability to conquer many challenging QT tasks.
It is a flourishing research direction to explore RL techniques’ potential on QT tasks. This paper aims at
providing a comprehensive survey of research efforts on RL-based methods for QT tasks. More concretely,
we devise a taxonomy of RL-based QT models, along with a comprehensive summary of the state of the art.
Finally, we discuss current challenges and propose future research directions in this exciting field.

CCS Concepts: • Computing methodologies → Machine learning; • Applied computing → Electronic


commerce; • Information systems → Expert systems;

Additional Key Words and Phrases: Reinforcement learning, quantitative finance, stock market, survey

ACM Reference format:


Shuo Sun, Rundong Wang, and Bo An. 2023. Reinforcement Learning for Quantitative Trading. ACM Trans.
Intell. Syst. Technol. 14, 3, Article 44 (March 2023), 29 pages.
https://doi.org/10.1145/3582560

1 INTRODUCTION
Quantitative trading (QT) is a type of market strategy that relies on mathematical and statistical
models to automatically identify investment opportunities [19]. With the advent of the AI age, it
becomes more popular and accounts for more than 70% and 40% trading volumes, in developed
markets (e.g., U.S.) and developing markets (e.g., China), respectively. In general, QT research can
be divided into two directions. In the finance community, designing theories and models to under-
stand and explain the financial market is the main focus. The famous capital asset pricing model
(CAPM) [115], Almgren-Chriss model [2], Markowitz portfolio theory [89], and Fama & French
factor model [38] are a few representative examples. On the other hand, computer scientists apply
data-driven machine learning (ML) techniques to analyze financial data [32, 104]. Recently, deep
learning [72] becomes an appealing approach owing to not only its stellar performance but also 44
to the attractive property of learning meaningful representations from scratch.

This project is supported by the National Research Foundation, Singapore under its Industry Alignment Fund - Pre-
positioning (IAF-PP) Funding Initiative. Any opinions, findings and conclusions or recommendations expressed in this
material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore.
Authors’ address: S. Sun, R. Wang, and B. An, School of Computer Science and Engineering, Nanyang Technological Uni-
versity, Singapore 639798, Singapore; emails: {shuo003, rundong001}@e.ntu.edu.sg, boan@ntu.edu.sg.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses,
contact the owner/author(s).
© 2023 Copyright held by the owner/author(s).
2157-6904/2023/03-ART44 $15.00
https://doi.org/10.1145/3582560

ACM Transactions on Intelligent Systems and Technology, Vol. 14, No. 3, Article 44. Publication date: March 2023.
44:2 S. Sun et al.

RL is an emerging subfield of ML, which provides a mathematical formulation of learning based


control to extract knowledge from trial and error. With the usage of RL, we can train agents with
near-optimal behaviour policy through optimizing task-specific reward functions [127]. In the last
decade, we have witnessed many significant artificial intelligence (AI) milestones achieved by
RL approaches in domains such as Go [119], video games [91], robotics [76], and science [41].
RL-based methods also have achieved state-of-the-art performance on many QT tasks such as
algorithmic trading (AT) [87], portfolio management (PM) [139], order execution (OE) [40],
and market making (MM) [121]. It is a promising research direction to address QT tasks with
RL techniques [3, 43, 134].

1.1 Why Reinforcement Learning for Quantitative Trading?


The overall objective of QT tasks is to maximize long-term profit under certain risk tolerance.
Specifically, algorithmic trading makes profit through consistently buying and selling one given
financial asset; portfolio management tries to maintain a well-balanced portfolio with weighted
proportion of multiple financial assets; order execution aims at fulfilling a specific trading order
in time with minimum execution cost; market making provides liquidity to the market and makes
profit from the tiny price spread between buy and sell orders. Traditional QT strategies such as
momentum [63, 95] and mean reversion [105] methods discover trading opportunities based on
heuristic rules. Finance expert knowledge is incorporated to capture the underlying pattern of
the financial market. However, rule-based methods exhibit poor generalization ability and only
perform well in certain market conditions [30]. Another paradigm is to trade based on signals from
financial prediction. Different types of supervised learning methods such as linear models [4, 10],
tree-based models [69, 70], and deep neural networks [31, 113] are applied for financial prediction.
Nevertheless, the high volatility and noisy nature of the financial market make it extremely hard
to predict future price accurately [37]. In addition, there is a noticeable gap between prediction
signals and profitable trading actions [109]. Thus, the overall performance of prediction-based
methods is not satisfying as well.
To design profitable QT strategies, the advantages of RL methods are four-fold: (i) RL allows
training an end-to-end agent, which takes available market information as input state and outputs
trading actions directly; (ii) RL-based methods bypass the extremely difficult task to predict future
price and optimize overall profit directly; (iii) Task-specific constraints (e.g., transaction cost and
slippage) can be imported into RL objectives easily; and (iv) RL methods have the potential to
generalize to any market condition.

1.2 Difference from Existing Surveys


To the best of our knowledge, this survey is the first comprehensive survey on RL-based quantita-
tive trading applications. Although there are some existing works trying to explore the usage of
RL techniques in QT tasks, none of them has provided an in-depth taxonomy of existing works,
analyzed current challenges of this research field, or proposed future directions in this area. The
goal of this survey is to provide a summary of existing RL-based methods for QT applications
from both RL algorithm perspective and application domain perspective, to analyze current status
of this field, and to point out future research directions.
A number of survey papers on ML in finance have been presented in recent years. For exam-
ple, Rundo et al. [108] proposed a brief survey on ML for QT. Emerson et al. [36] focused on
the trend and applications. Bahrammirzaee [5] introduced hybrid methods in financial applica-
tions. Gai et al. [44] proposed a review of Fintech from both ML and general perspectives. Zhang
and Zhou [157] discussed about data mining approaches in Fintech. Chalup and Mitschele [18] dis-
cussed about kernel methods in financial applications. Agent-based computational finance is the

ACM Transactions on Intelligent Systems and Technology, Vol. 14, No. 3, Article 44. Publication date: March 2023.
Reinforcement Learning for Quantitative Trading 44:3

focus of Tesfatsion and Judd [130]. There are also many surveys on deep learning for finance. Wong
and Selvi [145] covered early works. Sezer et al. [114] was a recent survey paper with a focus on
financial time series forecasting. Ozbayoglu et al. [102] made a survey on the development of DL
in financial applications. Fischer [43] presented a brief review of RL methods in the financial mar-
ket. Hambly et al. [53] proposed a comprehensive survey on recent advances of RL in finance.
Considering the increasing popularity and potential of RL-based QT applications, a comprehen-
sive survey will be of high scientific and practical values. More than 100 high quality papers are
shortlisted and categorized in this survey. Furthermore, we analyze the current situation of this
area and point out future research directions.

1.3 How Do We Collect Papers?


Google scholar is used as the main search engine to collect relevant papers. In addition, we screened
related top conferences such as NeurIPS, ICML, IJCAI, AAAI, and KDD, just to name a few, to
collect high-quality relevant publications. Moreover, we also collect papers from a new conference
focusing on combining AI and finance called ICAIF. Major key words we used are reinforcement
learning, quantitative finance, algorithmic trading, portfolio management, order execution, market
making, and stock.

1.4 Contribution of This Survey


This survey aims at thoroughly reviewing existing works on RL-based QT applications. We hope
it will provide a panorama, which will help readers quickly get a full picture of research works
in this field. In conclusion, the main contribution of this survey is three-fold: (i) We propose a
comprehensive survey on RL-based QT applications and categorized existing works from differ-
ent perspectives. (ii) We analyze the advantages and disadvantages of RL techniques for QT and
highlight the pathway of current research. (iii) We discuss current challenges and point out future
research directions.

1.5 Article Organization


The remainder of this article is organized as follows: Section 2 introduces background of QT. Sec-
tion 3 provides a brief description of RL. Section 4 discusses the usage of supervised learning
methods in QT. Section 5 makes a comprehensive review of RL methods in QT. Section 6 discusses
current challenges and open research directions in this area. Section 7 concludes this paper.

2 QUANTITATIVE TRADING BACKGROUND


Before diving into details of this survey, we in-
troduce background knowledge of QT in this
section. Figure 1 illustrates the relationships
of the four mainstream QT tasks: algorith-
mic trading (AT); portfolio management
(PM), order execution (OE), market mak-
ing (MM). In the following section, we first
provide a brief overview of financial markets
and quantitative trading. Then, we introduce
the preliminaries and definition of each QT
task in order. A summary of notations is illus-
Fig. 1. Tree relationships of different QT tasks.
trated in Table 1.

ACM Transactions on Intelligent Systems and Technology, Vol. 14, No. 3, Article 44. Publication date: March 2023.
44:4 S. Sun et al.

Table 1. A Summary of Notations

Notation Description
h length of a holding period
pi the time series vector of asset i’s price
pi,t the price of asset i at time t

pi,t the price of asset i after a holding period h from time t
pt the price of a single asset at time t
st position of an asset at time t
uti trading volume of asset i at time t
n the time series vector of net value
nt net value at time t
nt net value after a holding period h from time t
w ti portfolio weight of asset i at time t
wt portfolio vector at time t
wt portfolio vector after a holding period h from time t
vt portfolio value at time t
vt portfolio value after a holding period h from time t
fti transaction fee for asset i at time t
ξ transaction fee rate
q the quantity of a limit order
Q total quantity required to be executed
r the time series vector of return rate
rt return rate at time t

2.1 Overview
The financial market, an ecosystem involving transactions between businesses and investors, ob-
served a market capitalization exceeding $90 trillion globally as of the year 2020.1 For many coun-
tries, the financial industry has become a paramount pillar, which spawns the birth of many finan-
cial centres. The International Monetary Fund (IMF) categorizes financial centres as follows:
international financial centres, such as New York, London and Tokyo; regional financial centres,
such as Shanghai, Shenzhen, and Sydney; and offshore financial centres, such as Hong Kong, Sin-
gapore, and Dublin. At the core of financial centres, trading exchanges,where trading activities
involving trillions of dollars take place everyday, are formed. Trading exchanges can be divided
as stock exchanges such as NYSE, Nasdaq, and Euronext, derivatives exchanges such as CME, and
cryptocurrency exchanges such as Coinbase and Huobi. Participants in the financial market can be
generally categorized as financial intermediaries (e.g., banks and brokers), issuers (e.g., companies
and governments), institutional investors (e.g., investment managers and hedge funds) and individ-
ual investors. With the development of electronic trading platform, quantitative trading, which has
been demonstrated as quite profitable by many leading trading companies (e.g., Renaissance, Two
Sigma, Citadel, D.E. Shaw), is becoming a dominating trading style in the global financial markets.
In 2020, quantitative trading accounts for over 70% and 40% trading volume in developed market
(e.g., U.S. and Europe) and emerging market (e.g., China and India), respectively.2 We introduce
some basic QT concepts as follows:

1 https://data.worldbank.org/indicator/CM.MKT.LCAP.CD/.
2 https://therobusttrader.com/what-percentage-of-trading-is-algorithmic/.

ACM Transactions on Intelligent Systems and Technology, Vol. 14, No. 3, Article 44. Publication date: March 2023.
Reinforcement Learning for Quantitative Trading 44:5

• Financial Asset. A financial asset refers to a liquid asset, which can be converted into
cash immediately during trading time. Classic financial assets include stocks, futures, bonds,
foreign exchanges, and cryptocurrencies.
• Holding Period. Holding period h refers to the time period where traders just hold the
financial assets without any buying or selling actions.
• Asset Price. The price of a financial asset i is defined as a time series
 is the
pi = {pi,1 , pi,2 , pi,3 , . . . , pi,t }, where pi,t denotes the price of asset i at time t. pi,t
price of asset i after a holding period h from time t. pt is used to denote the price at time t
when there is only one financial asset.
• OHLC. OHLC is the abbreviation of open price, high price, low price and close price.
The candle stick, which is consisted of OHLC, is widely used to analyze the financial market.
• Volume. Volume is the amount of a financial asset that changes hands. uti is the trading
volume of asset i at time t.
• Technical Indicator. A technical indicator indicates a feature calculated by a formulaic
combination of OHLC and volume. Technical indicators are usually designed by finance
experts to uncover the underlying pattern of the financial market.
• Return Rate. Return rate is the percentage change of capital, where r t = (pt +1 − pt )/pt de-
notes the return rate at time t. The time series of return rate is denoted as r = (r 1 , r 2 , . . . , r t ).
• Transaction Fee. Transaction fee is the expenses incurred during trading financial assets:
fti = pi,t × uti × ξ , where ξ is the transaction fee rate.
• Liquidity. Liquidity refers to the efficiency with which a financial asset can be converted
into cash without having an evident impact on its market price. Cash itself is the asset with
the most liquidity.

2.2 Algorithmic Trading


Algorithmic trading (AT) refers to the process that traders consistently buy and sell one given
financial asset to make profit. It is widely applied in trading stocks, commodity futures and foreign
exchanges. For AT, time is split as discrete time steps. At the beginning of a trading period, traders
are allocated some cash and set net value as 1. Then, at each time step t, traders have the options
to buy, hold, or sell some amount of shares for changing positions. Net value and position is used
to represent traders’ status at each time step. The objective of AT is to maximize the final net
value at the end of the trading period. Based on trading styles, algorithmic trading is generally
divided into five categories: position trading, swing trading, day trading, scalp trading, and high-
frequency trading. Specifically, position trading involves holding the financial asset for a long
period of time, which is unconcerned with short-term market fluctuations and only focuses on
the overarching market trend. Swing trading is a medium-term style that holds financial assets
for several days or weeks. The goal of swing trading is to spot a trend and then capitalise on dips
and peaks that provide entry points. Day trading tries to capture the fleeting intraday pattern
in the financial market and all positions will be closed at the end of the day to avoid overnight
risk. Scalping trading aims at discovering micro-level trading opportunities and makes profit by
holding financial assets for only a few minutes. High-frequency trading is a type of trading style
characterized by high speeds, high turnover rates, and high order-to-trade. A summary of different
trading styles is illustrated in Table 2.
Traditional AT methods discover trading signals based on technical indicators or mathemati-
cal models. Buy and Hold (BAH) strategy, which invests all capital at the beginning and holds
until the end of the trading period, is proposed to reflect the average market condition. Momen-
tum strategies, which assumes the trend of financial assets in the past has the tendency to con-
tinue in the future, are other well-known AT strategies. Buying-Winner-Selling-Loser [63], Times

ACM Transactions on Intelligent Systems and Technology, Vol. 14, No. 3, Article 44. Publication date: March 2023.
44:6 S. Sun et al.

Table 2. A Summary of Algorithmic Trading Styles

Trading Style Time Frame Holding Period


Position trading Long term Months to years
Swing trading Medium term Days to weeks
Day trading Short term Within a trading day
Scalping trading Short term Seconds to minutes
High-frequency trading Extreme short term Milliseconds to seconds

Series Momentum [95], and Cross Sectional Momentum [20] are three classic momentum strate-
gies. In contrast, mean reversion strategies such as Bollinger bands [13] assume the price of fi-
nancial assets will finally revert to the long-term mean. Although traditional methods somehow
capture the underlying patterns of the financial market, these simple rule-based methods exhibit
limited generalization ability among different market conditions. We introduce some basic AT con-
cepts as follows:
• Position. Position st is the amount of a financial asset owned by traders at time t. It repre-
sents a long (short) position when st is positive (negative).
• Long Position. Long position makes positive profit when the price of the asset increases.
For long trading actions, which buy a financial asset i at time t first and then sell it at t + 1,
the profit is uti (pi,t +1 − pi,t ), where uti is the buying volume of asset i at time t.
• Short Position. Short position makes positive profit when the price of the asset decreases.
For short trading actions, which buys a financial asset at time t first and then sells it at t + 1,
the profit is uti (pi,t − pi,t +1 ).
• Net Value. Net value represents a fund’s per share value. It is defined as a time series
n = {n 1 , n 2 , . . . , nt }, where nt denotes the net value at time t. The initial net value is always
set to 1.

2.3 Portfolio Management


Portfolio management (PM) is a fundamental QT task, where investors hold a number of fi-
nancial assets and reallocate them periodically to maximize long-term profit. In the literature, it
is also called portfolio optimization, portfolio selection, and portfolio allocation. In the real mar-
ket, portfolio managers work closely with traders, where portfolio managers assign a percentage
weighting to every stock in the portfolio periodically, and traders focus on finishing portfolio re-
allocation at the favorable price to minimize the trading cost. For PM, time is split into two types
of periods: holding period and trading period as shown in Figure 2. At the beginning of a holding
period, the agent holds a portfolio wt which consists of pre-selected financial assets with a corre-
sponding portfolio value vt . With the fluctuation of the market, the assets’ prices would change
during the holding period. At the end of the holding period, the agent will get a new portfolio value
vt and decide a new portfolio weight wt+1 of the next holding period. During the trading period,
the agent buys or sells some shares of assets to achieve the new portfolio weights. The lengths of
the holding period and trading period are based on specific settings and can change over time. In
some previous works, the trading period is set to 0, which means the change of portfolio weight is
achieved immediately for convenience. The objective is to maximize the final portfolio value given
a long time horizon.
PM has been a fundamental problem for both the finance and ML community for decades. Ex-
isting approaches can be grouped into four major categories, which are benchmarks such as Con-
stant Rebalanced Portfolio (CRP) and Uniform Constant Rebalanced Portfolio (UCRP)

ACM Transactions on Intelligent Systems and Technology, Vol. 14, No. 3, Article 44. Publication date: March 2023.
Reinforcement Learning for Quantitative Trading 44:7

Fig. 2. Portfolio management process.

[25]; Follow-the-Winner approaches such as Exponential Gradient (EG) [54] and Winner [45];
Follow-the-Loser approaches such as Robust Mean Reversion (RMR) [59], Passive Aggressive
Online Learning (PAMR) [79], and Anti-Correlation [14]; Pattern-Matching-based approaches
such as correlation-driven nonparametric learning (CORN) [78] and B K [51]; and Meta-
Learning algorithms such as Online Newton Step (ONS). The readers can check this survey [77]
for more details. We introduce some basic PM concepts as follows:
• Portfolio. A portfolio can be represented as:

T

M
wt = [w t0 , w t1 , . . . , w tM ] ∈ R M +1 and w ti = 1
i=0
where M+1 is the number of portfolio’s constituents, including one risk-free asset, i.e., cash,
and M risky assets. w ti represents the ratio of the total portfolio value (money) invested at
the beginning of the holding period t on asset i. Specifically, w t0 represents the cash in hand.
• Portfolio Value. We define vt and vt as portfolio value at the beginning and end of the
holding period. So we can get the change of portfolio value during the holding period and
the change of portfolio weights:
w ti pi, t

M wip
t i,t pi, t
vt = vt w t = f or i ∈ [0, M]
pi,t M w ti pi, t
i=0 i=0 pi, t

2.4 Order Execution


While adjusting the new portfolio, investors need to buy (or sell) some amount of shares by execut-
ing an order of liquidation (or acquirement). Essentially, the objectives of order execution are two-
fold: it does not only require to fulfill the whole order but also targets a more economical execution
with maximizing profit (or minimizing cost). As mentioned in [16], the major challenge of order
execution lies in the trade-off between avoiding harmful market impact caused by large transac-
tions in a short period and restraining price risk, which means missing good trading windows due
to slow execution. Traditional OE solutions are usually designed based on some stringent assump-
tions of the market and then derive some model-based methods with stochastic control theory.
For instance, Time Weighted Average
Price (TWAP) evenly splits the whole or-
der and executes at each time step with
the assumption that the market price fol-
lows the Brownian motion [9]. The Almgren-
Chriss model [2] incorporates temporary and
permanent price impact functions also with
the Brownian motion assumption. Volume Fig. 3. A snapshot of Limit Order Book.

ACM Transactions on Intelligent Systems and Technology, Vol. 14, No. 3, Article 44. Publication date: March 2023.
44:8 S. Sun et al.

Weighted Average Price (VWAP) distributes orders in proportion to the (empirically estimated)
market transaction volume. The goal of VWAP is to track the market average execution price [67].
However, traditional solutions are not effective in the real market because of the inconsistency
between the assumptions and reality.
Formally, OE is to trade fixed amount of shares within a predetermined time horizon (e.g., one
hour or one day). At each time step t, traders can propose to trade a quantity of qt ≥ 0 shares at
current market price pt , The matching system will then return the execution results at time t + 1.
Taking the sell side as an example, assuming a total of Q shares required to be executed during the
whole time horizon, the OE task can be formulated as:

T 
T
arg max (qt · pt ), s.t. qt = Q
q 1,q 2, ...,qT
t =1 t =1

OE not only completes the liquidation requirement but also the maximize/minimize average execu-
tion price for the sell/buy side execution, respectively. We introduce basic OE concepts as follows:
• Market Order. A market order refers submitting an order to buy or sell a financial asset
at the current market price, which expresses the desire to trade at the best available price
immediately.
• Limit Order. A limit order is an order placed to buy or sell a number of shares at a speci-
fied price during a specified time frame. It can be modeled as a tuple pt ar дet ±qt ar дet , where
pt ar дet represents the submitted target price, qt ar дet represents the submitted target quan-
tity, and ± represents trading direction (buy/sell)
• Limit Order Book. A limit order book (LOB) is a list containing all the information about
the current limit orders in the market. An example of LOB is shown in Figure 3.
 q
• Average Execution Price. Average execution price (AEP) is defined as p̄ = Tt=1 Qt · pt .
• Order Matching System. The electronic system that matches buy and sell orders for a
financial market is called the order matching system. The matching system is the core of all
electronic exchanges, which decides the execution results of orders in the market. The most
common matching mechanism is first-in-first-out, which means limit orders at the same
price will be executed in the order in which the orders were submitted.

2.5 Market Making


Market makers are traders who continually quote prices at which they are willing to trade on
both buy and sell side for one financial asset. They provide liquidity and make profit from the
tiny price spread between buy and sell orders. The main challenge for market making is non-zero
inventory. When we submit a limit order on both sides, there is no guarantee that all the orders can
be successfully executed. It is risky when non-zero inventory accumulates to a high level because
this means market maker will have to close the inventory by current market price, which could lead
to a significant loss. In practice, some market makers keep their inventory at a low-level to avoid
market exposure and only make profits by repeatedly making their quoted spread. On the other
hand, some more advanced market makers may choose to hold a non-zero inventory to capture
the market trend, while exploiting the quoted spread simultaneously. Traditional finance methods
consider market making as a stochastic optimal control problem [17]. Agent-based method [47]
and RL [121] have also been applied to market making.

2.6 Evaluation Metrics


In this subsection, we discuss common profit metrics, risk metrics, and risk-adjusted metrics for
evaluation in this field.

ACM Transactions on Intelligent Systems and Technology, Vol. 14, No. 3, Article 44. Publication date: March 2023.
Reinforcement Learning for Quantitative Trading 44:9

2.6.1 Profit Metrics.


• Profit rate (PR). PR is the percent change of net value over time horizon h. The formal
definition is:
PR = (nt +h − nt )/nt
• Win rate (WR). WR evaluates the proportion of trading days with positive profit among
all trading days.
2.6.2 Risk Metrics.
• Volatility (VOL). VOL is the variance of the return vector r. It is widely used to measure
the uncertainty of return rate and reflects the risk level of strategies. The formal definition
is:
V OL = σ [r]
• Maximum drawdown (MDD). MDD [88] measures the largest decline from the peak in
the whole trading period to show the worst case. The formal definition is:
 
n t − nτ
MDD = max max
τ ∈(0,t ) t ∈(0,τ ) nt
• Downside deviation (DD). DD refers to the standard deviation of trade returns that are
negative.
• Gain-loss ratio (GLR). GLR is a downside risk measure. It represents the relative relation-
ship of trades with a positive return and trades with a negative return. The formula is:
E[r|r > 0]
GLR =
E[−r|r < 0]
2.6.3 Risk-adjusted Metrics.
• Sharpe ratio (SR). SR [116] is a risk-adjusted profit measure, which refers to the return per
unit of deviation:
E[r]
SR =
σ [r]
• Sortino ratio (SoR). SoR is a variant of risk-adjusted profit measure, which applies DD as
risk measure:
E[r]
SoR =
DD
• Calmar ratio (CR). CR is another variant of risk-adjusted profit measure, which applies
MDD as risk measure:
E[r]
CR =
MDD
3 OVERVIEW OF REINFORCEMENT LEARNING
RL is a popular subfield of ML that studies complex decision making problems. Sutton and Barto
[127] distinguish RL problems by three key characteristics: (i) the problem is closed-loop, (ii) the
agent figures out what to do through trial-and-error, and (iii) actions have an impact on both short
term and long term results. The decision maker is called agent and the environment is everything
else except the agent. At each time step, the agent obtains some observations of the environment,
which is called state. Later on, the agent takes an action based on the current state. The environ-
ment will then return a reward and a new state to the agent. Formally, an RL problem is typically
formulated as a Markov decision process (MDP) in the form of a tuple M = (S, A, R, P, γ ), where
S is a set of states s ∈ S, A is a set of actions a ∈ A, R is the reward function, P is the transition

ACM Transactions on Intelligent Systems and Technology, Vol. 14, No. 3, Article 44. Publication date: March 2023.
44:10 S. Sun et al.

probability, and γ is the discount factor. The goal of an RL agent is to find a policy π (a | s) that
takes action a ∈ A in state s ∈ S in order to maximize the expected discounted cumulative reward:

τ
max E[R(τ )], where R(τ ) = γ t r (at , st ) and 0 ≤ γ ≤ 1
t =0

Sutton and Barto [127] summarise RL’s main components as: (i) Policy, which refers to the prob-
ability of taking action a when the agent is in state s. From the policy perspective, RL algorithms
are categorized into on-policy and off-policy methods. The goal of on-policy RL methods is to
evaluate or improve the policy, which they are now using to make decisions. As for off-policy RL
methods, they aim at improving or evaluating the policy that is different from the one used to
generate data. (ii) Reward: after taking selected actions, the environment sends back a numerical
signal reward to inform the agent how good or bad are the actions selected. (iii) Value function,
which means the expected return if the agent starts in that state s or state-action pair (s, a), and
then acts according to a particular policy π consistently. Value function tells how good or bad the
agent’s current position is in the long run. (iv) Model, which is an inference about the behaviour
of the environment in different states.
Plenty of algorithms have been proposed to solve RL problems. Tabular methods and ap-
proximation methods are two mainstream directions. For tabular algorithms, a table is used to
represent the value function for every action and state pair. The exact optimal policy can be
found through checking the table. Due to the curse of dimensionality, tabular methods only work
well when the action and state space is small. Dynamic programming (DP), Monto Carlo
(MC), and temporal difference (TD) are a few widely studied tabular methods. Under the
perfect model of environment assumption, DP uses a value function to search for good policies.
Policy iteration and value iteration are two classic DP algorithms. MC methods try to learn good
policies through sample sequences of states, actions, and reward from the environment. For MC
methods, the assumption of perfect environment understanding is not required. TD methods are
a combination of DP and MC methods. While they do not need a model from the environment,
they can bootstrap, which is the ability to update estimates based on other estimates. From this
family, Q-learning [142] and SARSA [107] are popular algorithms, which belong to off-policy and
on-policy methods, respectively.
On the other hand, approximation methods try to find a great approximate function with lim-
ited computation. Learning to generalize from previous experiences (already seen states) to unseen
states is a reasonable direction. Policy gradient methods are popular approximate solutions. RE-
INFORCE [143] and actor-critic [71] are two important examples. With the popularity of deep
learning, RL researchers use neural networks as function approximators. DRL is the combination
of DL and RL, which lead to great success in many domains [91, 133]. Popular DRL algorithms
for the QT community include deep Q-network (DQN) [91], deterministic policy gradient
(DPG) [120], deep deterministic policy gradient (DDPG) [82], and proximal policy opti-
mization (PPO) [112]. Recurrent reinforcement learning (RRL) is another widely used RL
approach for QT. “Recurrent” means the previous output is fed into the model as part of the input
here. RRL achieves more stable performance when exposed to noisy data such as financial data. In
the following of this section, we briefly introduce popular RL algorithms in quantitative trading
following [127]:
DQN [91] is a Q-learning algorithm that approximates the state-value function with a deep
neural network. It is usually used in conjunction with experience replay to store the episodes
steps in memory for off-policy learning. Moreover, the Q-network is optimized towards a frozen
target network that is periodically updated to make the learning process more stable.

ACM Transactions on Intelligent Systems and Technology, Vol. 14, No. 3, Article 44. Publication date: March 2023.
Reinforcement Learning for Quantitative Trading 44:11

PPO [112] is a policy gradient method, which alternates between sampling data through inter-
action with the environment to optimize a “surrogate” objective function using stochastic gradient
ascent. A novel objective function is proposed to enable multiple epochs of minibatch updates. PPO
is easy to implement with better sample complexity and shares many benefits of the classic trust
region policy optimization.
A2C [90] is a simple and lightweight framework for deep reinforcement learning that uses
asynchronous gradient descent for optimization of deep neural network controllers. Instead of
experience replay, A2C asynchronously executes multiple agents in parallel, on multiple instances
of the environment. This parallelism also decorrelates the agents’ data into a more stationary
process, since at any given time-step the parallel agents will be experiencing a variety of different
states. This simple idea enables a much larger spectrum of fundamental on-policy RL algorithms.
SAC [52] is an off-policy actor-critic method based on the maximum entropy RL frame-
work [165], which maximizes a weight objective of the reward and the policy entropy, to encourage
robustness to noise and exploration. For parameter updating, SAC alternates between a soft policy
evaluation and a soft policy improvement. At the soft policy evaluation step, a soft Q-function,
which is modeled as a neural network with parameters θ , is updated by minimizing the following
soft Bellman residual. To handle continuous action spaces, the policy is modeled as a Gaussian
with mean and covariance given by neural networks.
DDPG [82] is a model-free off-policy algorithm for learning continous actions, which combines
ideas from DPG and DQN. It uses experience replay and slow-learning target networks from DQN,
and it is based on DPG, which can operate over continuous action spaces.

4 SUPERVISED LEARNING FOR QUANTITATIVE TRADING


Supervised learning techniques have been widely used in the pipeline of QT research. In this sec-
tion, we propose a brief review of research efforts on supervised learning for QT. We introduce
existing works from three perspectives: feature engineering, financial forecasting, and enhancing
traditional methods with ML.

4.1 Feature Engineering


Discovering a series of high-quality features is the foundation of ML algorithms’ success. In QT,
features, which have the ability to explain and predict future price, they are also called indicators
or alpha factors. Traditionally, alpha factors are designed and tested by finance experts based on
domain knowledge. However, this way of mining alpha is very costly and not realistic for indi-
vidual investors. There are many attempts to automatically discover alpha factors. Alpha101 [68]
introduced a set of 101 public alpha factors. Autoalpha [159] combined genetic algorithm and prin-
ciple component analysis (PCA) to search for alpha factors with low correlation. ADNN [39]
proposed an alpha discovery neural network framework for mining alpha factors. In general, it
is harmful to feed all available features into ML models directly. Feature selection approaches are
applied to reduce irrelevant and redundant features in QT applications [75, 132, 156]. Another para-
digm is to use dimension reduction techniques such as PCA [144] and latent Dirichlet allocation
(LDA) [131] to extract meaningful features.

4.2 Financial Forecasting


The usage of supervised learning methods in financial forecasting is pervasive. Researchers for-
mulate return prediction as a regression task and price trend prediction as a classification task.
Linear models such as linear regression [10], LASSO [103], and elastic net [146] are used for finan-
cial prediction. Non-linear models including random forest [70], decision tree [6], support vector
machine (SVM) [60], and LightGBM [126] outperform linear models owing to their ability to

ACM Transactions on Intelligent Systems and Technology, Vol. 14, No. 3, Article 44. Publication date: March 2023.
44:12 S. Sun et al.

learn non-linear relationships between features. In recent years, deep learning models including
multi-layer perceptron (MLP) [31], recurrent neural network (RNN) [113], long short term
memory (LSTM) [113], and convolutional neural network (CNN) [56] are prevailing owing
to their outstanding ability to learn hidden relationships between features. An efficient Mixture-
of-Experts framework [124] is proposed to mimic the efficient bottom-up trading strategy design
work flow in real-world trading firms.
Besides different ML models, there is also a trend to utilize alternative data for improving predic-
tion performance. For instance, economic news [57], frequency of prices [158], social media [151],
financial events [33], investment behaviors [23], and weather information [164] have been used
as extra information to learn the intrinsic pattern of financial assets. Graph neural networks have
been introduced to model the relationship between stocks [24, 80, 110, 150]. Hybrid methods are
also proposed to further improve prediction performance [58, 86].

4.3 Enhancing Traditional Methods with ML


Another research direction is to enhance traditional rule-based methods with ML techniques. Lim
et al. [83] enhanced time-series momentum with deep learning. Takeuchi and Lee [129] applied
NN to enhance cross section momentum. Chauhan et al. [22] took account uncertainty and look-
ahead based on factor models. Alphastock [138] proposed a deep reinforcement attention network
to improve the classic buying-winners-and-selling-losers strategy [63]. Gu et al. explore ML tech-
niques’ ability on asset pricing. In [49], an autoencoder architecture was proposed for asset pricing.
Compared to pure ML methods, these methods keep the original financial insight and have better
explainability.
Even though supervised ML methods achieve great success in financial forecasting with the
combination of feature engineering techniques, there is still an unignorable gap between accurate
prediction and profitable trading actions. RL methods can tackle this obstacle through learning an
end-to-end agent, which maps market information into trading actions directly. In the next section,
we will discuss notable RL methods for QT tasks and why they are superior to traditional methods.

5 REINFORCEMENT LEARNING FOR QUANTITATIVE TRADING


In this section, we present a comprehensive review of notable RL-based methods for QT. We go
through existing works across four mainstream QT tasks with a summary table at the end of each
subsection.

5.1 Categories of RL-based QT Models


In order to provide a bird-eye’s view of this field, existing works are classified from different per-
spectives. Table 3 summarizes existing works from the RL algorithm perspective. Q-learning and
recurrent RL are the most popular RL algorithms for QT. Recent trends indicate DRL methods
such as DQN, DDPG, and PPO outperform traditional RL methods. In addition, we use three pie
charts to provide taxonomies of existing works based on financial markets, financial assets, and
data frequencies. The percentage numbers shown in the pie charts are calculated by dividing the
number of papers belonging to each type with the total number of papers. We classify existing
works based on financial markets (illustrated in Figure 4(a)). The US market is the most studied
market in the literature. The Chinese market is getting popular in recent years. The study of the
European market is mainly in the early era. We classify existing works based on financial assets
(illustrated in Figure 4(b)). Stock data is used for more than 40% of publications. Stock index is the
second popular additional option. There are also some works focusing on cryptocurrency in re-
cent years. We classify existing works based on data frequencies (illustrated in Figure 4(c)). About
half of papers use day-level data since it is easy to access. For order execution and market making,

ACM Transactions on Intelligent Systems and Technology, Vol. 14, No. 3, Article 44. Publication date: March 2023.
Reinforcement Learning for Quantitative Trading 44:13

Table 3. Publications Based on Different Reinforcement Learning Algorithms


Category RL algorithm Publication
Q-learning [8, 46, 55, 62, 74, 84, 97–99, 106, 121, 163]
Value-Based SARSA [21, 27, 121, 122]
DQN [26, 64, 100, 101, 125, 139, 155, 161]
Recurrent RL [29, 30, 34, 92–94, 117, 118, 149]
Policy-Based REINFORCE [139]
PG [7, 81, 140, 160, 161]
TRPO [11, 136]
DPG [65, 66, 87, 153]
PPO [26, 40, 81, 85, 152, 155]
Actor-Critic DDPG [81, 111, 148, 152]
SAC [155]
A2C [152, 161]
Others Model-based RL [50, 154]
Multi-Agent RL [61, 73, 74]

Fig. 4. Categorization of existing works.

fine-grained data (e.g., second-level and minute-level) are often used to simulate the micro-level
financial market.

5.2 RL in Algorithmic Trading


Algorithmic trading refers to trade one particular financial asset with signals generated automati-
cally by computer programs. It has been widely applied in trading all kinds of financial assets. In
this subsection, we will present a review of most RL-based algorithmic trading papers dating back
to the 1990s.
Policy-based methods. To tackle the limitations of supervised learning methods, Moody and
Wu [93] made the first attempt to apply RL in algorithmic trading. In this paper, an agent is trained
with recurrent RL (RRL) to optimize the overall profit directly. A novel evaluation metric called
Differential Sharpe Ratio is designed as the optimization objective to improve the performance. Em-
pirical study on artificial price data shows that it outperforms previous forecasting-based methods.
Based on the same algorithm, further experiments are conducted using monthly S&P stock index
data [94] and US Dollar/British Pound exchange data [92]. As an extension of RRL [93], Dempster
and Leemans [29] proposed an adaptive RRL framework with three layers. Layer one adds 14

ACM Transactions on Intelligent Systems and Technology, Vol. 14, No. 3, Article 44. Publication date: March 2023.
44:14 S. Sun et al.

technical indicators as extra market information. Layer two evaluates trading actions from layer
one with consideration of risk factors. The goal of layer three is to search for optimal values
of hyperparameters in layer two. With the three-layer architecture, it outperforms baselines on
Eur/ US Dollar exchange data. Vittori et al. [136] proposed a risk-averse algorithm called Trust
Region Volatility Optimization (TRVO) for option hedging. TRVO trains a sheaf of agents
characterized by different risk aversion methods and is able to span an efficient frontier on the
volatility-p&l space. Simulation results demonstrate that TRVO outperforms the classic Black &
Scholes delta hedge [12].
With the development of deep learning, a few DRL methods are proposed for algorithmic trading.
FDDR [30] enhanced the classic RRL method [94] with deep neural networks. An RNN layer is
used to learn meaningful recurrent representations of the market. In addition, a fuzzy extension
is proposed to further reduce the uncertainty. FDDR achieves great performance on both stock
index and commodity futures. To balance between profit and risk, a multi-objective RL method
with LSTM layers [118] is proposed. Through optimizing profit and Sharpe Ratio simultaneously,
the agent achieves better performance on three Chinese stock index futures.
Value-based methods. QSR [46] uses Q-learning to optimize absolute profit and relative risk-
adjusted profit, respectively. A combination of two networks is employed to improve performance
on US Dollar/German Deutschmark exchange data. Lee and Jangmin [74] proposed a multi-agent
Q-learning framework for stock trading. Four cooperative agents are designed to generate trading
signals and order prices for both buy and sell side. Through sharing training episodes and learned
policies with each other, this method achieves better performance in terms of both profit and
risk management on the Korea stock market compared to supervised learning baselines. In [62],
the authors firstly design some local traders based on dynamic programming and heuristic rules.
Later on, they apply Q-learning to learn a meta policy of these local traders on Korea stock mar-
kets. de Oliveira et al. [27] implemented a SARSA-based RL method and tested it on 10 stocks in
the Brazil market.
DQN is used to enhance trading systems by considering trading frequencies, market confusion,
and transfer learning [64]. The trading frequency is determined in three ways: (1) a heuristic func-
tion related to Q-value, (2) an action-dependent NN regressor, and (3) an action-independent NN
regressor. Another heuristic function is applied to add a filter as the agent’s certainty on mar-
ket condition. Moreover, the authors train the agent on selected component stocks and apply the
pre-train weights as the starting point for different stock indexes. Experiments on four different
stock indexes demonstrate the effectiveness of the proposed framework. DeepScalper [125] ap-
plied branch dueling Q-network [141] for intraday trading. An encoder-decoder architecture is
proposed to learn market embedding incorporating both micro-level and macro-level market infor-
mation. A novel hindsight bonus is added in the reward function to encourage long-term horizon.
The authors also design an auxiliary task by predicting future volatility. DeepScalper significantly
outperforms many baselines on Chinese treasury bond and stock index markets. Riva et al. [106]
proposed a value-based RL algorithm to train agents for FX trading via fitted-Q iteration. The
importance of tuning control frequency is studied in order to obtain effective trading policies.
Other methods. iRDPG [87] is an adaptive DPG-based framework. Due to the noisy nature of
financial data, the authors formulate algorithmic trading as a Partially Observable Markov De-
cision Process (POMDP). GRU layers are introduced in iRDPG to learn recurrent market embed-
ding. In addition, the authors apply behavior cloning with expert trading actions to guide iRDPG
and achieve great performance on two Chinese stock index futures. There are also some works
focusing on evaluating the performance of different RL algorithms on their own data. Zhang et al.
[161] evaluated DQN, PG, and A2C on the 50 most liquid futures contracts. Yuan et al. [155] tested

ACM Transactions on Intelligent Systems and Technology, Vol. 14, No. 3, Article 44. Publication date: March 2023.
Reinforcement Learning for Quantitative Trading 44:15

Table 4. Summary of RL for Algorithmic Trading


Reference RL method Data Source Asset Type Market Data frequency
[93] RRL - Artificial - -
[94] RRL - Stock Index USA 1 Month
[46] Q-learning Hand-Crafted FX - 1 Day
[92] RRL Lagged Return FX - 30 Min
[74] Multi-agent RL Hand-Crafted Stock Index Korea -
[62] Q-learning Hand-Crafted Stock Index Korea -
Lagged Return
[29] RRL FX - 1 Min
Technical Indicator
Artificial - -
[8] Q-learning Lagged Return
Stock Italy 1 Day
Stock Index
[30] RRL Price China 1 Min
Commodity
[118] RRL Lagged Return Stock Index China 1 Min
USA, Hong Kong
[64] DQN Lagged Return Stock Index 1 Day
Europe, Korea
[27] SARSA OHLC, Technical Indicator Stock Brazil 15 Min
[87] DDPG OHLC, Technical Indicator Stock Index China 1 Min
[136] TRPO Hand-crafted Artificial - -
Price, Lagged Return Stock Index
[161] DQN, PG, A2C - -
Technical Indicators Commodity
[155] PPO, DQN, SAC OHLC Stock China 1 Day
Price Stock Index
[125] DQN China 1 Min
Order Book Treasury Bond
[106] Q-learning - FX - 1 Min

PPO, DQN, and SAC on three selected stocks. Based on these two works, DQN achieves the best
overall performance among different financial assets.
Summary. Although existing works demonstrate the potential of RL for quantitative trading,
there is seemingly no consensus on a general ranking of different RL algorithms (notably, we
acknowledge that no free lunch theorem exists). The summary of algorithmic trading publications is
in Table 4. In addition, most existing RL-based works only focus on general AT, which tries to make
profit through trading one asset. In finance, extensive trading strategies have been designed based
on trading frequency (e.g., high-frequency trading) and asset types (e.g., stock and cryptocurrency).

5.3 RL in Portfolio Management


Portfolio management, which studies the art of balancing between a collection of different financial
assets, has become a popular topic for RL researchers. In this subsection, we will survey on most
notable existing works on RL-based portfolio management.
Policy-based methods. Since a portfolio is essentially a weight distribution among different
financial assets, policy-based methods are the most widely applied RL methods for PM. Almahdi
and Yang [1] proposed an RRL-based algorithm for portfolio management. Maximum drawdown
is applied as the objective function to measure downside risk. In addition, an adaptive version
is designed with a transaction cost and market condition stop-loss retraining mechanism. In or-
der to extract information from historical trading records, Investor-Imitator [34] formalizes the
trading knowledge by imitating the behavior of an investor with a set of logic descriptors. More-
over, to instantiate specific logic descriptors, the authors introduce a Rank-Invest model that can
keep the diversity of different logic descriptors through optimizing a variety of evaluation metrics
with RRL. Investor-Imitator attempts to imitate three types of investors (oracle investor, collab-
orator investor, public investor) by designing investor-specific reward function for each type. In

ACM Transactions on Intelligent Systems and Technology, Vol. 14, No. 3, Article 44. Publication date: March 2023.
44:16 S. Sun et al.

the experiments on the Chinese stock market, Investor-Imitator successfully extracts interpretable
knowledge of portfolio management that can help human traders better understand the financial
market. Alphastock [138] is another policy-based RL method for portfolio management. LSTM
with history state attention model that is used to learn better stock representation. A cross-asset
attention network (CAAN) incorporating price rising rank prior is added to further describe the
interrelationships among stocks. Later on, the output of CAAN (winning score of each stock) is
fed into a heuristic portfolio generator to construct the final portfolio. Policy gradient is used to
optimize the Sharpe Ratio. Experiments on both U.S. and Chinese stock market show that Alpha-
stock achieves robust performance over different market states. EI 3 [117] is another RRL-based
method, which tries to build profitable cryptocurrency portfolios by extracting multi-scale pat-
terns in the financial market. Inspired by the success of Inception networks [128], the authors de-
sign a multi-scale temporal feature aggregation convolution framework with two CNN branches
to extract short-term and mid-term market embedding and a max pooling branch to extract the
highest price information. To bridge the gap between the traditional Markowitz portfolio and RL-
based methods, Benhamou et al. [7] applied PG with a delayed reward function and showed better
performance than the classic Markowitz efficient frontier.
Zhang et al. [160] proposed a cost-sensitive PM framework based on direct policy gradient. To
learn more robust market representation, a novel two-stream portfolio policy network is designed
to extract both price series pattern and the relationship between different financial assets. In addi-
tion, the authors design a new cost-sensitive reward function to take the trading cost constraint
into consideration with theoretically near-optimal guarantee. Finally, the effectiveness of the cost-
sensitive framework is demonstrated on real-world cryptocurrency datasets. Xu et al. [149] pro-
posed a novel relation-aware transformer (RAT) under the classic RRL paradigm. RAT is struc-
turally innovated to capture both sequential patterns and the inner corrections between financial
assets. Specifically, RAT follows an encoder-decoder structure, where the encoder is for sequential
feature extraction and the decoder is for decision making. Experiments on two cryptocurrency
and one stock datasets not only show RAT’s superior performance over existing baselines but also
demonstrate that RAT can effectively learn better representation and benefit from leverage oper-
ation. Bisi et al. [11] derived a PG theorem with a novel objective function, which exploited the
mean-volatility relationship. The new objective could be used in actor-only algorithms such as
TRPO with monotonic improvement guarantees. Wang et al. [140] proposed DeepTrader, a PG-
based DRL method, to tackle the risk-return balancing problem in PM. The model simultaneously
uses negative maximum drawdown and price rising rate as reward functions to balance between
profit and risk. The authors propose an asset scoring unit with graph convolution layer to capture
temporal and spatial interrelations among stocks. Moreover, a market scoring unit is designed
to evaluate the market condition. DeepTrader achieves great performance across three different
markets.
Actor-critic methods. Jiang et al. [66] proposed a DPG-based RL framework for portfolio
management. The framework consists of three novel components: (1) the Ensemble of Identical
Independent Evaluators (EIIE) topology; (2) a Portfolio Vector Memory (PVM); and (3) an
Online Stochastic Batch Learning (OSBL) scheme. Specifically, the idea of EIIE is that the em-
bedding concatenation of output from different NN layers can learn better market representation
effectively. In order to take transaction costs into consideration, PVM uses the output portfolio at
the last time step as part of the input of current time step. The OSBL training scheme makes sure
that all data points in the same batch are trained in the original time order. To demonstrate the
effectiveness of proposed components, extensive experiments using different NN architectures are
conducted on cryptocurrency data. Later on, more comprehensive experiments are conducted in
an extended version [65]. To model the data heterogeneity and environment uncertainty in PM, Ye

ACM Transactions on Intelligent Systems and Technology, Vol. 14, No. 3, Article 44. Publication date: March 2023.
Reinforcement Learning for Quantitative Trading 44:17

et al. [153] proposed a State-Augmented RL (SARL) framework based on DPG. SARL learns
the price movement prediction with financial news as additional states. Extensive experiments on
both cryptocurrency and U.S. stock market validation that SARL outperforms previous approaches
in terms of return rate and risk-adjusted criteria. Another popular actor-critic RL method for
portfolio management is DDPG. Xiong et al. [148] constructed a highly profitable portfolio with
DDPG on the Chinese stock market. PROFIT [111] is another DDPG-based approach that makes
time-aware decisions on PM with text data. The authors make use of a custom policy network
that hierarchically and attentively learns time-aware representations of news and tweets for PM,
which is generalizable among various actor-critic RL methods. PROFIT shows promising perfor-
mance on both the China and U.S. stock markets. Murray et al. [96] proposed a novel actor-critic
algorithm for solving general risk-reverse stochastic control problems and use it to learn hedging
strategies for portfolio management across multiple risk aversion levels simultaneously.
Other methods. Neuneier [97] made an attempt to formalize portfolio management as an MDP
and trained an RL agent with Q-learning. Experiments on German stock market demonstrate its
superior performance over heuristic benchmark policy. Later on, a shared value-function for dif-
ferent assets and model-free policy-iteration are applied to further improve the performance of
Q-learning in [98]. There are a few model-based RL methods that attempt to learn some models of
the financial market for portfolio management. Yu et al. [154] proposed the first model-based RL
framework for portfolio management, which supports both off-policy and on-policy settings. The
authors design an Infused Prediction Module (IPM) to predict future price, a Data Augmen-
tation Module (DAM) with recurrent adversarial networks to mitigate the data deficiency issue,
and a Behavior Cloning Module (BCM) to reduce the portfolio volatility. MetaTrader [101] is a
novel two-stage RL-based approach for portfolio management, which learns to integrate diverse
trading policies to adapt to various market conditions. In the first stage, MetaTrader incorporates
an imitation learning objective into the reinforcement learning framework. Through imitating dif-
ferent expert demonstrations, MetaTrader acquires a set of trading policies with great diversity. In
the second stage, MetaTrader learns a meta-policy to recognize the market conditions and decides
on the most proper learned policy to follow.
Portfolio management is also formulated as a multi-agent RL problem. MAPS [73] is a cooper-
ative multi-agent RL system in which each agent is an independent “investor” creating its own
portfolio. The authors design a novel loss function to guide each agent to act as diversely as pos-
sible while maximizing its long-term profit. MAPS outperforms most of baselines with 12 years
of U.S. stock market data. In addition, the authors find that adding more agents to MAPS can lead
to a more diversified portfolio with higher Sharpe Ratio. MSPM [61] is a multi-agent RL frame-
work with a modularized and scalable architecture for PM. MSPM consists of the Evolving Agent
Module (EAM) to learn market embedding with heterogeneous input and the Strategic Agent
Module (SAM) to produce profitable portfolios based on the output of EAM.
Some works compare the profitability of portfolios constructed by different RL algorithms on
their own data. Liang et al. [81] compared the performance of DDPG, PPO, and PG on the Chinese
stock market. Yang et al. [152] firstly tested the performance of PPO, A2C, and DDPG on the
U.S. stock market. Later on, the authors find that the ensemble strategy of these three algorithms
can integrate the best features and shows more robust performance adjusting to different market
situations.
Summary. Since a portfolio is a vector of weights for different financial assets, which naturally
corresponds to a policy, policy-based methods are the most widely-used RL methods for PM. There
are also many successful examples based on actor-critic algorithms. The summary of portfolio
management publications is in Table 5. We point out two issues of existing methods: (1) Most of
them ignore the interrelationship between different financial assets, which is valuable for human

ACM Transactions on Intelligent Systems and Technology, Vol. 14, No. 3, Article 44. Publication date: March 2023.
44:18 S. Sun et al.

Table 5. Summary of RL for Portfolio Management


Reference RL method Data Source Asset Type Market Data frequency
[97] Q-learning Technical Indicator Stock Index Germany 1 Day
[98] Q-learning Hand-crafted Stock Index Germany 1 Day
[66] DPG Price, Hand-crafted Cryptocurrency - 30 Min
[1] RRL - Stock Index - -
[65] DPG Lagged Portfolio Cryptocurrency - 30 Min
OHLC
[81] DDPG, PPO, PG Stock China 1 Day
Technical Indicator
[34] RRL Hand-crafted Stock China 1 Day
[148] DDPG Price, Hand-crafted Stock USA 1 Day
[138] - Technical Indicator Stock China, USA -
[154] Model-based RL OHLC, Hand-crafted Stock USA 1 Hour
[117] RRL Price, Lagged Portfolio Cryptocurrency - -
[7] PG Price, Hand-crafted - - 1 Day
[160] PG OHLC, Lagged Return Cryptocurrency - -
[152] PPO, A2C, DDPG Price, Hand-crafted Stock USA 1 Day
[73] Multi-agent RL Technical Indicator Stock USA 1 Day
Financial Text
[111] DDPG Stock USA, China 1 Min
Hand-crafted
Stock USA 1 Day
[153] DPG Financial Text, OHLC
Cryptocurrency - 30 Min
Stock USA
[149] RRL OHLC 30 Min
Cryptocurrency -
[139] REINFORCE, DQN OHLC, Lagged Portfolio Stock USA, China 1 Day
Lagged Return Artificial - -
[11] TRPO
Lagged Portfolio Stock Index USA 1 Day
USA, China
[140] PG Technical Indicator Stock -
Hong Kong
[61] Multi-agent RL Price, Financial Text Stock USA 1 Day
USA, China
[101] DQN Technical Indicator Stock -
Hong Kong
[96] Actor-Critic Technical Indicator Artificial -

portfolio managers. (2) Existing works construct portfolios from a relatively small pool of stocks
(e.g., 20 in total). However, the real market contains thousands of stocks and common RL methods
are vulnerable when the action space is very large [35].

5.4 RL in Order Execution


Different from AT and PM, order execution (OE) is a micro-level QT task, which tries to trade a
fixed number of shares in a given time horizon and minimize the execution cost. In the real financial
market, OE is extremely important for institutional traders whose trading volume is large enough
to have an obvious impact of the market price.
Nevmyvaka et al. [99] proposed the first RL-based method for large-scale order execution. The
authors use Q-learning to train the agent with real-world LOB data. With carefully designed
state, action, and reward function, the Q-learning framework can significantly outperform tra-
ditional baselines. Hendricks and Wilcox [55] implemented another Q-learning based RL method
on the South Africa market by extending the popular Almgren-Chriss model with linear price im-
pact. Ning et al. [100] proposed an RL framework using Double DQN and evaluated its performance
on nine different U.S. stocks. Dabérius et al. [26] implemented DDQN and PPO and compared their
performance with TWAP.

ACM Transactions on Intelligent Systems and Technology, Vol. 14, No. 3, Article 44. Publication date: March 2023.
Reinforcement Learning for Quantitative Trading 44:19

Table 6. Summary of RL for Order Execution

Reference RL method Data Source Asset Type Market Data frequency


[99] Q-learning Price, Hand-crafted Stock USA Millisecond
[55] Q-learning Hand-crafted Stock South Africa 5 Min
[100] DQN LOB Stock USA 1 Second
[26] DDQN, PPO Hand-crafted Artificial - -
[85] PPO LOB Stock USA Millisecond
[40] PPO OHLC, Hand-crafted Stock China 1 Min

PPO is another widely used RL method for OE. Lin and Beling [85] proposed an end-to-end
PPO-based framework. MLP and LSTM are tested as time dependencies accounting network. The
authors design a sparse reward function instead of previous implementation shortfall (IS) or
a shaped reward function, which leads to state-of-the-art performance on 14 stocks in the U.S.
market. Fang et al. [40] proposed another PPO-based framework to bridge the gap between the
noisy yet imperfect market states and the optimal action sequences for OE. The framework lever-
ages a policy distillation method with an entropy regularization term in the loss function to guide
the student agent toward learning similar policy by an oracle teacher with perfect information
of the financial market. Moreover, the authors design a normalized reward function to encourage
universal learning among different stocks. Extensive experiments on the Chinese stock market
demonstrate that the proposed method significantly outperforms various baselines with reason-
able trading actions.
We present a summary of existing RL-based order execution publications in Table 6. Although
there are a few successful examples using either Q-learning or PPO on order execution, existing
works share a few limitations. First, most algorithms are only tested on stock data. Their perfor-
mance on different financial assets (e.g., futures and cryptocurrency) is still unclear. Second, the
execution time window (e.g., one day) is too long, which makes the task easier. In practice, profes-
sional traders usually finish the execution process in a much shorter time window (e.g., 10 minutes).
Third, existing works will fail when the trading volume is huge, because all of them assume there
is no obvious market impact, which is impossible for large volume settings. In the real-world, the
requirement of institutional investors is to execute large amount of shares in a relatively short
time window. There is still a long way to go for researchers to tackle these limitations.

5.5 RL in Market Making


Market making refers to trading activities that buy and sell one given asset simultaneously at a
desired price. The goal of a market maker is to provide liquidity to the market and market profit
through the tiny price spread of buy/sell orders. In this subsection, we will discuss existing RL-
based methods for market making.
Chan and Shelton [21] made the first attempt to apply RL for market making without any as-
sumption of the market. Simulation showed that the RL method converged on optimal strategies
successfully on a few controlled environments. Spooner et al. [121] focused on designing and an-
alyzing temporal-difference (TD) RL methods for market making. The authors firstly build a
realistic, data-driven simulator with millisecond LOB data for market making. With an asymmetri-
cally dampened reward function and a linear combination of tile coding as state, both Q-learning
and SARSA outperform previous baselines. Lim and Gorse [84] proposed a Q-learning based al-
gorithm with a novel usage of CARA utility as the terminal reward for market making. Guéant
and Manziuk [50] proposed a model-based actor-critic RL algorithm, which focuses on market
making optimization for multiple corporate bonds. Zhong et al. [163] proposed a model-free and

ACM Transactions on Intelligent Systems and Technology, Vol. 14, No. 3, Article 44. Publication date: March 2023.
44:20 S. Sun et al.

Table 7. Summary of RL for Market Making

Reference RL method Data Source Asset Type Market Data frequency


[21] SARSA Hand-crafted Artificial - -
[121] Q-learning, SARSA Hand-crafted - - Millisecond
[84] Q-learning Hand-crafted Artificial - -
[50] Model-based RL Hand-crafted Bond Europe -
[163] Q-learning LOB - - Event
[122] SARSA Hand-crafted Artificial - -
[162] - Hand-crafted Stock & Treasury US 5 Min

off-policy Q-learning algorithm to develop trading strategy implemented with a simple lookup
table. The method achieves great performance on event-by-event LOB data confirmed by a pro-
fessional trading firm. For training robust market making agents, Spooner and Savani [122] intro-
duced a game-theoretic adaptation of the traditional mathematical market making model. The au-
thors thoroughly investigate the impact in three environmental settings with adversarial RL. Zhao
and Linetsky [162] proposed a high-frequency feature called Book Exhaustion Rate (BER),
which can serve as a direct measurement of the adverse selection risk from an equilibrium point
of view. The authors train a market making agent via RL using three years of LOB data on Chicago
Mercantile Exchange S & P 500 and achieve stable performance.
Even though market making is a fundamental task in quantitative trading, research on RL-based
market making is still at the early stage. Existing few works simply apply different RL methods
on their own data. The summary of order execution publications is in Table 7. To fully realize
the potential of RL for market making, one major obstacle is the lack of high-fidelity micro-level
market simulator. At present, there is still no reasonable way to simulate the ubiquitous market
impact. This unignorable gap between simulation and real market limits the usage of RL in market
making.

6 OPEN ISSUES AND FUTURE DIRECTIONS


Even though existing works have demonstrated the success of RL methods on QT tasks, this section
will point out a few prospective future research directions. Several critical open issues and potential
solutions are also elaborated.

6.1 Advanced RL Techniques on QT


Most existing works are only straightforward application of classic RL methods on QT tasks. The
effectiveness of more advanced RL techniques on financial data is not well-explored. We point out
a few promising directions in this subsection.
First, data scarcity is a major challenge on applying RL for QT tasks. Model-based RL can speed
up the training process by learning a model of the financial market [154]. The worst-case (e.g.,
financial crisis) can be used as a regularizer for maximizing the accumulated reward. Second, the
key objective of QT is to balance between maximizing profit and minimizing risk. Multi-objective
RL techniques provide a weapon to balance the trade-off between profit and risk. Training diver-
sified trading policies with different risk tolerance is an interesting direction. Third, graph learn-
ing [147] has shown promising results on modeling the ubiquitous relationship between stocks
in supervised learning [42, 110]. Combining graph learning with RL for modeling the internal re-
lationship between different stocks or financial market is an interesting future direction. Fourth,
the severe distribution shift of financial market makes RL-based methods exhibit poor generaliza-
tion ability in new market condition. Meta-RL and transfer learning techniques can help improve

ACM Transactions on Intelligent Systems and Technology, Vol. 14, No. 3, Article 44. Publication date: March 2023.
Reinforcement Learning for Quantitative Trading 44:21

RL-based QT models’ generalization performance across different financial assets or markets. Fifth,
for high risky decision-making tasks such as QT, we need to explain its actions to human traders
as a condition for their full acceptance of the algorithm. Hierarchical RL methods decompose the
main goal into sub-goals for low-level agents. By learning the optimal subgoals for the low-level
agent, the high-level agent forms a representation of the financial market that is interpretable by
human traders. Sixth, for QT, learning through directly interacting with the real market is risky
and impractical. RL-based QT normally use historical data to learn a policy, which fits in offline RL
settings. Offline RL techniques can help to model the distribution shift and risk of financial market
while training RL agents.

6.2 Alternative Data and New QT Settings


Intuitively, alternative data can provide extra information to learn better representation of the
financial market. Economic news [57], frequency of prices [158], social media [151], financial
events [33], and investment behaviors [23] have been applied to improve performance of financial
prediction. For RL-based methods, price movement embedding [153] and market condition embed-
ding [140] are incorporated as extra information to improve an RL agent’s performance. However,
existing works simply concatenate extra features or embedding from multiple data sources as mar-
ket representation. One interesting forward-looking direction is to utilize multi-modality learning
techniques to learn more meaningful representations with both original price and alternative data
while training RL agents. Besides alternative data, there are still some important QT settings un-
explored by RL researchers. Intraday trading, high-frequency trading and pairs trading are a few
examples. Intraday trading tries to capture price fluctuation patterns within the same trading day;
high-frequency trading aims at capturing the fleeting micro-level trading opportunities; and pairs
trading focuses on analyzing the relative trend of two highly correlated assets.

6.3 Enhance with Auto-ML


Due to the noisy nature of financial data and brittleness of RL methods, the success of RL-based QT
models highly relies on carefully designed RL components (e.g., reward function) and proper-tuned
hyperparameters (e.g., network architecture). As a result, it is still difficult for people without in-
depth knowledge of RL such as economists and professional traders to design profitable RL-based
trading strategies. Auto-ML, which tries to design high-quality ML applications automatically, can
enhance the development of RL-based QT from three perspectives: (i) For feature engineering,
auto-ML can automatically construct, select, and collect meaningful features. (ii) For hyperparam-
eter tuning, auto-ML can automatically search for proper hyperparameters such as update rule,
learning rate, and reward function. (iii) For neural architecture search, auto-ML can automatically
search for suitable neural network architectures for training RL agents. With the assistance of
auto-ML, RL-based QT models can be more usable for people without in depth knowledge of RL.
We believe that it is a promising research direction to facilitate the development of RL-based QT
models with auto-ML techniques.

6.4 More Realistic Simulation


High-fidelity simulation is the key foundation of RL methods’ success. Although existing works
take many practical constraints such as transaction fee [135], execution cost [139], and slip-
page [87] into consideration, current simulation is far from realistic. The ubiquitous market impact,
which refers to the effect of one trader’s actions to other traders, is ignored. For leading trading
firms, their trading volume can account for over 10% of the total volume with a significant impact of
other traders in the market. As a result, simulation with only historical market data is not enough.
There are some research efforts focusing on dealing with the market impact. Spooner et al. [121]

ACM Transactions on Intelligent Systems and Technology, Vol. 14, No. 3, Article 44. Publication date: March 2023.
44:22 S. Sun et al.

tried to take market impact into consideration for MM with event-level data. Byrd et al. [15] pro-
posed Aides, an agent-based financial market simulator to model market impact. Vyetrenko et al.
[137] made a survey on current status for market simulation and proposed a series of stylized met-
rics to test the quality of simulation. It is a very challenging but important research direction to
build high-fidelity market simulators.

6.5 The Field Needs More Unified and Harder Evaluation


When a new RL-based QT method is proposed, the authors are expected to compare their methods
with SOTA baselines on some financial datasets. At present, the selection of baselines and datasets
is seemingly arbitrary, which leads to an inconsistent reporting of revenues. As a result, there
is no wide consensus on the general ranking of RL-based methods for QT tasks, which makes it
extremely challenging to benchmark new RL algorithms in this field. The question is, how do we
solve it? We can borrow some experience from neighbouring ML fields such as computer vision
and natural language processing. A suite of standardized evaluation datasets and implementation
of SOTA methods could be a good solution to this problem. As for evaluation criteria, most existing
works only evaluate the profitability of RL algorithms with financial metrics such as total return,
which ignores several critical axes, such as risk-control, explainability, diversity, reliability, and
university [123]. In practice, due to the low signal-to-noise nature of financial markets, FinRL
methods with only high profit on backtesting are likely to overfit on historical data and fail in
real-world deployment [28]. We also note that split of training, validation, and test set in most
QT papers is quite random. Since there is significant distribution shift among time in the financial
market, it is better to split data on a rolling basis. In addition, it is well-known that the performance
of RL methods is very sensitive to hyperparameters such as learning rate. To provide more reliable
evaluation of RL methods, authors should spend roughly the same time on tuning hyperparameters
for both baselines and their own methods. In practice, some authors make much more effort on
tuning their own methods than baselines, which makes the reported revenue not promising. As
David Shaw (founder of a world-class hedge fund) said, he will never trade with a method that
does not prove itself through a systematic evaluation. We urge the QT community to conduct
more strict evaluation on new proposed methods. With proper datasets, baseline implementation,
and evaluation scheme, research on RL-based QT could achieve faster development.

7 CONCLUSION
In this article, we provided a comprehensive review of the most notable works on RL-based QT
models. We proposed a classification scheme for organizing and clustering existing works, and we
highlighted a bunch of influential research prototypes. We also discussed the pros/cons of utilizing
RL techniques for QT tasks. In addition, we point out some of the most pressing open problems
and promising future directions. Both RL and QT are ongoing hot research topics in the past few
decades. There are many newly developing techniques and emerging models each year. We hope
that this survey can provide readers with a comprehensive understanding of the key aspects of
this field, clarify the most notable advances, and shed some light on future research.

REFERENCES
[1] Saud Almahdi and Steve Y. Yang. 2017. An adaptive portfolio trading system: A risk-return portfolio optimization
using recurrent reinforcement learning with expected maximum drawdown. Expert Systems with Applications 87
(2017), 267–279.
[2] Robert Almgren and Neil Chriss. 2001. Optimal execution of portfolio transactions. Journal of Risk 3 (2001), 5–40.
[3] Bo An, Shuo Sun, and Rundong Wang. 2022. Deep reinforcement learning for quantitative trading: Challenges and
opportunities. IEEE Intelligent Systems 37, 2 (2022), 23–26.

ACM Transactions on Intelligent Systems and Technology, Vol. 14, No. 3, Article 44. Publication date: March 2023.
Reinforcement Learning for Quantitative Trading 44:23

[4] Adebiyi A. Ariyo, Adewumi O. Adewumi, and Charles K. Ayo. 2014. Stock price prediction using the ARIMA model.
In Proceedings of the 6th International Conference on Computer Modelling and Simulation (ICCMS). 106–112.
[5] Arash Bahrammirzaee. 2010. A comparative survey of artificial intelligence applications in finance: Artificial neural
networks, expert system and hybrid intelligent systems. Neural Computing and Applications 19, 8 (2010), 1165–1195.
[6] Suryoday Basak, Saibal Kar, Snehanshu Saha, Luckyson Khaidem, and Sudeepa Roy Dey. 2019. Predicting the direc-
tion of stock market prices using tree-based classifiers. The North American Journal of Economics and Finance 47
(2019), 552–567.
[7] Eric Benhamou, David Saltiel, Sandrine Ungari, and Abhishek Mukhopadhyay. 2020. Bridging the gap between
Markowitz planning and deep reinforcement learning. arXiv preprint arXiv:2010.09108 (2020).
[8] Francesco Bertoluzzo and Marco Corazza. 2012. Testing different reinforcement learning configurations for financial
trading: Introduction and applications. Procedia Economics and Finance 3 (2012), 68–77.
[9] Dimitris Bertsimas and Andrew W. Lo. 1998. Optimal control of execution costs. Journal of Financial Markets 1,
1 (1998), 1–50.
[10] Dinesh Bhuriya, Girish Kaushal, Ashish Sharma, and Upendra Singh. 2017. Stock market predication using a linear
regression. In Proceedings of 1st International Conference of Electronics, Communication and Aerospace Technology
(ICECA). 510–513.
[11] Lorenzo Bisi, Luca Sabbioni, Edoardo Vittori, Matteo Papini, and Marcello Restelli. 2019. Risk-averse trust region
optimization for reward-volatility reduction. arXiv preprint arXiv:1912.03193 (2019).
[12] Fischer Black and Myron Scholes. 1973. The pricing of options and corporate liabilities. The Journal of Political
Economy 81, 3 (1973), 637–654.
[13] John Bollinger. 2002. Bollinger on Bollinger Bands. McGraw-Hill New York.
[14] Allan Borodin, Ran El-Yaniv, and Vincent Gogan. 2004. Can we learn to beat the best stock. Journal of Artificial
Intelligence Research 21 (2004), 579–594.
[15] David Byrd, Maria Hybinette, and Tucker Hybinette Balch. 2019. Abides: Towards high-fidelity market simulation
for AI research. arXiv preprint arXiv:1904.12066 (2019).
[16] Álvaro Cartea, Sebastian Jaimungal, and José Penalva. 2015. Algorithmic and High-frequency Trading.
[17] Álvaro Cartea, Sebastian Jaimungal, and Jason Ricci. 2014. Buy low, sell high: A high frequency trading perspective.
SIAM Journal on Financial Mathematics 5, 1 (2014), 415–444.
[18] Stephan K. Chalup and Andreas Mitschele. 2008. Kernel Methods in Finance. Chapter 27, 655–687.
[19] Ernest P. Chan. 2021. Quantitative Trading: How to Build Your Own Algorithmic Trading Business. Wiley.
[20] Louis K. C. Chan, Narasimhan Jegadeesh, and Josef Lakonishok. 1996. Momentum strategies. The Journal of Finance
51, 5 (1996), 1681–1713.
[21] Nicholas Tung Chan and Christian Shelton. 2001. An Electronic Market-maker. Technical Report. (2001).
[22] Lakshay Chauhan, John Alberg, and Zachary Lipton. 2020. Uncertainty-aware lookahead factor models for quanti-
tative investing. In Proceedings of the 37th International Conference on Machine Learning (ICML). 1489–1499.
[23] Chi Chen, Li Zhao, Jiang Bian, Chunxiao Xing, and Tie-Yan Liu. 2019. Investment behaviors can tell what inside:
Exploring stock intrinsic properties for stock trend prediction. In Proceedings of the 25th ACM SIGKDD International
Conference on Knowledge Discovery & Data Mining (KDD). 2376–2384.
[24] Yingmei Chen, Zhongyu Wei, and Xuanjing Huang. 2018. Incorporating corporation relationship via graph con-
volutional neural networks for stock price prediction. In Proceedings of the 27th ACM International Conference on
Information and Knowledge Management (CIKM). 1655–1658.
[25] Thomas M. Cover. 1991. Universal portfolios. Mathematical Finance 1, 1 (1991), 1–29.
[26] Kevin Dabérius, Elvin Granat, and Patrik Karlsson. 2019. Deep execution-value and policy based reinforcement learn-
ing for trading and beating market benchmarks. Available at SSRN 3374766 (2019).
[27] Renato Arantes de Oliveira, Heitor S. Ramos, Daniel Hasan Dalip, and Adriano César Machado Pereira. 2020. A
tabular SARSA-based stock market agent. In Proceedings of the 1st ACM International Conference on AI in Finance
(ICAIF).
[28] Marcos Lopez De Prado. 2018. Advances in Financial Machine Learning. John Wiley & Sons.
[29] Michael A. H. Dempster and Vasco Leemans. 2006. An automated FX trading system using adaptive reinforcement
learning. Expert Systems with Applications 30, 3 (2006), 543–552.
[30] Yue Deng, Feng Bao, Youyong Kong, Zhiquan Ren, and Qionghai Dai. 2016. Deep direct reinforcement learning for
financial signal representation and trading. IEEE Transactions on Neural Networks and Learning Systems 28, 3 (2016),
653–664.
[31] A. Victor Devadoss and T. Antony Alphonnse Ligori. 2013. Forecasting of stock prices using multi layer perceptron.
International Journal of Computing Algorithm 2 (2013), 440–449.
[32] Xiao Ding, Yue Zhang, Ting Liu, and Junwen Duan. 2015. Deep learning for event-driven stock prediction. In Pro-
ceedings of the 24th International Joint Conference on Artificial Intelligence (IJCAI). 2327–2333.

ACM Transactions on Intelligent Systems and Technology, Vol. 14, No. 3, Article 44. Publication date: March 2023.
44:24 S. Sun et al.

[33] Xiao Ding, Yue Zhang, Ting Liu, and Junwen Duan. 2016. Knowledge-driven event embedding for stock prediction.
In Proceedings of the 26th International Conference on Computational Linguistics. 2133–2142.
[34] Yi Ding, Weiqing Liu, Jiang Bian, Daoqiang Zhang, and Tie-Yan Liu. 2018. Investor-Imitator: A framework for trading
knowledge extraction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery &
Data Mining (KDD). 1310–1319.
[35] Gabriel Dulac-Arnold, Richard Evans, Hado van Hasselt, Peter Sunehag, Timothy Lillicrap, Jonathan Hunt, Timothy
Mann, Theophane Weber, Thomas Degris, and Ben Coppin. 2015. Deep reinforcement learning in large discrete
action spaces. arXiv preprint arXiv:1512.07679 (2015).
[36] Sophie Emerson, Ruairí Kennedy, Luke O’Shea, and John O’Brien. 2019. Trends and applications of machine learning
in quantitative finance. In Proceedings of the 8th International Conference on Economics and Finance Research (ICEFR).
[37] Eugene F. Fama. 2021. Efficient capital markets: A review of theory and empirical work. The Fama Portfolio (2021),
76–121.
[38] Eugene F. Fama and Kenneth R. French. 1993. Common risk factors in the returns on stocks and bonds. Journal of
Financial Economics 33, 1 (1993), 3–56.
[39] Jie Fang, Shutao Xia, Jianwu Lin, Zhikang Xia, Xiang Liu, and Yong Jiang. 2019. Alpha discovery neural network
based on prior knowledge. arXiv preprint arXiv:1912.11761 (2019).
[40] Yuchen Fang, Kan Ren, Weiqing Liu, Dong Zhou, Weinan Zhang, Jiang Bian, Yong Yu, and Tie-Yan Liu. 2021. Universal
trading for order execution with oracle policy distillation. In Proceedings of the 35th AAAI Conference on Artificial
Intelligence (AAAI).
[41] Alhussein Fawzi, Matej Balog, Aja Huang, Thomas Hubert, Bernardino Romera-Paredes, Mohammadamin
Barekatain, Alexander Novikov, Francisco J. R. Ruiz, Julian Schrittwieser, Grzegorz Swirszcz, et al. 2022. Discovering
faster matrix multiplication algorithms with reinforcement learning. Nature 610, 7930 (2022), 47–53.
[42] Fuli Feng, Xiangnan He, Xiang Wang, Cheng Luo, Yiqun Liu, and Tat-Seng Chua. 2019. Temporal relational ranking
for stock prediction. ACM Transactions on Information Systems (TOIS) 37, 2 (2019), 27.
[43] Thomas G. Fischer. 2018. Reinforcement Learning in Financial Markets-A Survey. Technical Report. FAU Discussion
Papers in Economics.
[44] Keke Gai, Meikang Qiu, and Xiaotong Sun. 2018. A survey on fintech. Journal of Network and Computer Applications
103 (2018), 262–273.
[45] Alexei A. Gaivoronski and Fabio Stella. 2000. Stochastic nonstationary optimization for finding universal portfolios.
Annals of Operations Research 100, 1 (2000), 165–188.
[46] Xiu Gao and Laiwan Chan. 2000. An algorithm for trading and portfolio management using q-learning and Sharpe
Ratio maximization. In Proceedings of the 14th International Conference on Neural Information Processing (NIPS).
832–837.
[47] Dhananjay K. Gode and Shyam Sunder. 1993. Allocative efficiency of markets with zero-intelligence traders: Market
as a partial substitute for individual rationality. Journal of Political Economy 101, 1 (1993), 119–137.
[48] Shihao Gu, Bryan Kelly, and Dacheng Xiu. 2020. Empirical asset pricing via machine learning. The Review of Financial
Studies 33, 5 (2020), 2223–2273.
[49] Shihao Gu, Bryan Kelly, and Dacheng Xiu. 2021. Autoencoder asset pricing models. Journal of Econometrics 222,
1 (2021), 429–450.
[50] Olivier Guéant and Iuliia Manziuk. 2019. Deep reinforcement learning for market making in corporate bonds: Beating
the curse of dimensionality. Applied Mathematical Finance 26, 5 (2019), 387–452.
[51] László Györfi, Gábor Lugosi, and Frederic Udina. 2006. Nonparametric kernel-based sequential investment strate-
gies. Mathematical Finance: An International Journal of Mathematics, Statistics and Financial Economics 16, 2 (2006),
337–357.
[52] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. 2018. Soft actor-critic: Off-policy maximum en-
tropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning.
[53] Ben Hambly, Renyuan Xu, and Huining Yang. 2021. Recent advances in reinforcement learning in finance. arXiv
preprint arXiv:2112.04553 (2021).
[54] David P. Helmbold, Robert E. Schapire, Yoram Singer, and Manfred K. Warmuth. 1998. On-line portfolio selection
using multiplicative updates. Mathematical Finance 8, 4 (1998), 325–347.
[55] Dieter Hendricks and Diane Wilcox. 2014. A reinforcement learning extension to the Almgren-Chriss framework for
optimal trade execution. In Proceedings of the IEEE Conference on Computational Intelligence for Financial Engineering
& Economics. 457–464.
[56] Ehsan Hoseinzade and Saman Haratizadeh. 2019. CNNpred: CNN-based stock market prediction using a diverse set
of variables. Expert Systems with Applications 129 (2019), 273–285.
[57] Ziniu Hu, Weiqing Liu, Jiang Bian, Xuanzhe Liu, and Tie-Yan Liu. 2018. Listening to chaotic whispers: A deep learning
framework for news-oriented stock trend prediction. In Proceedings of the 11th ACM International Conference on Web
Search and Data Mining (WSDM). 261–269.

ACM Transactions on Intelligent Systems and Technology, Vol. 14, No. 3, Article 44. Publication date: March 2023.
Reinforcement Learning for Quantitative Trading 44:25

[58] Chien-Feng Huang. 2012. A hybrid stock selection model using genetic algorithms and support vector regression.
Applied Soft Computing 12, 2 (2012), 807–818.
[59] Dingjiang Huang, Junlong Zhou, Bin Li, Steven Hoi, and Shuigeng Zhou. 2013. Robust median reversion strategy for
on-line portfolio selection. In Proceedings of the 23rd International Joint Conference on Artificial Intelligence (IJCAI).
2006–2012.
[60] Wei Huang, Yoshiteru Nakamori, and Shou-Yang Wang. 2005. Forecasting stock market movement direction with
support vector machine. Computers & Operations Research 32, 10 (2005), 2513–2522.
[61] Zhenhan Huang and Fumihide Tanaka. 2021. A modularized and scalable multi-agent reinforcement learning-based
system for financial portfolio management. arXiv preprint arXiv:2102.03502 (2021).
[62] O. Jangmin, Jongwoo Lee, Jae Won Lee, and Byoung-Tak Zhang. 2006. Adaptive stock trading with dynamic asset
allocation using reinforcement learning. Information Sciences 176, 15 (2006), 2121–2147.
[63] Narasimhan Jegadeesh and Sheridan Titman. 1993. Returns to buying winners and selling losers: Implications for
stock market efficiency. The Journal of Finance 48, 1 (1993), 65–91.
[64] Gyeeun Jeong and Ha Young Kim. 2019. Improving financial trading decisions using deep Q-learning: Predicting the
number of shares, action strategies, and transfer learning. Expert Systems with Applications 117 (2019), 125–138.
[65] Zhengyao Jiang and Jinjun Liang. 2017. Cryptocurrency portfolio management with deep reinforcement learning.
In 2017 Intelligent Systems Conference (IntelliSys). 905–913.
[66] Zhengyao Jiang, Dixing Xu, and Jinjun Liang. 2017. A deep reinforcement learning framework for the financial
portfolio management problem. arXiv preprint arXiv:1706.10059 (2017).
[67] Sham M. Kakade, Michael Kearns, Yishay Mansour, and Luis E. Ortiz. 2004. Competitive algorithms for VWAP and
limit order trading. In Proceedings of the 5th ACM Conference on Electronic Commerce (EC). 189–198.
[68] Zura Kakushadze. 2016. 101 formulaic alphas. Wilmott 2016, 84 (2016), 72–81.
[69] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. Light-
GBM: A highly efficient gradient boosting decision tree. In Proceedings of the 30th Neural Information Processing
Systems. 3146–3154.
[70] Luckyson Khaidem, Snehanshu Saha, and Sudeepa Roy Dey. 2016. Predicting the direction of stock market prices
using random forest. arXiv preprint arXiv:1605.00003 (2016).
[71] Vijay R. Konda and John N. Tsitsiklis. 2000. Actor-critic algorithms. Proceedings of the 14th Neural Information Pro-
cessing Systems (NIPS). 1008–1014.
[72] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521, 7553 (2015), 436–444.
[73] Jinho Lee, Raehyun Kim, Seok-Won Yi, and Jaewoo Kang. 2020. MAPS: Multi-agent reinforcement learning-based
portfolio management system. arXiv preprint arXiv:2007.05402 (2020).
[74] Jae Won Lee and O. Jangmin. 2002. A multi-agent Q-learning framework for optimizing stock trading systems. In
Proceedings of the 13th International Conference on Database and Expert Systems Applications (DESA). 153–162.
[75] Ming-Chi Lee. 2009. Using support vector machine with a hybrid feature selection method to the stock trend predic-
tion. Expert Systems with Applications 36, 8 (2009), 10896–10904.
[76] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. 2016. End-to-end training of deep visuomotor policies.
Journal of Machine Learning Research 17, 1 (2016), 1334–1373.
[77] Bin Li and Steven C. H. Hoi. 2014. Online portfolio selection: A survey. Comput. Surveys 46, 3 (2014), 1–36.
[78] Bin Li, Steven C. H. Hoi, and Vivekanand Gopalkrishnan. 2011. Corn: Correlation-driven nonparametric learning
approach for portfolio selection. ACM Transactions on Intelligent Systems and Technology 2, 3 (2011), 1–29.
[79] Bin Li, Peilin Zhao, Steven C. H. Hoi, and Vivekanand Gopalkrishnan. 2012. PAMR: Passive aggressive mean reversion
strategy for portfolio selection. Machine Learning 87, 2 (2012), 221–258.
[80] Wei Li, Ruihan Bao, Keiko Harimoto, Deli Chen, Jingjing Xu, and Qi Su. 2020. Modeling the stock relation with
graph network for overnight stock movement prediction. In Proceedings of the 29th International Joint Conference on
Artificial Intelligence (IJCAI). 4541–4547.
[81] Zhipeng Liang, Hao Chen, Junhao Zhu, Kangkang Jiang, and Yanran Li. 2018. Adversarial deep reinforcement learn-
ing in portfolio management. arXiv preprint arXiv:1808.09940 (2018).
[82] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and
Daan Wierstra. 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015).
[83] Bryan Lim, Stefan Zohren, and Stephen Roberts. 2019. Enhancing time-series momentum strategies using deep neural
networks. The Journal of Financial Data Science 1, 4 (2019), 19–38.
[84] Ye-Sheen Lim and Denise Gorse. 2018. Reinforcement learning for high-frequency market making. In Proceedings
of the 26th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning
(ESANN).
[85] Siyu Lin and Peter A. Beling. 2020. An end-to-end optimal trade execution framework based on proximal policy
optimization. In Proceedings of the 29th International Joint Conference on Artificial Intelligence (IJCAI). 4548–4554.

ACM Transactions on Intelligent Systems and Technology, Vol. 14, No. 3, Article 44. Publication date: March 2023.
44:26 S. Sun et al.

[86] Guang Liu, Yuzhao Mao, Qi Sun, Hailong Huang, Weiguo Gao, Xuan Li, JianPing Shen, Ruifan Li, and Xiaojie Wang.
2020. Multi-scale two-way deep neural network for stock trend prediction. In Proceedings of the 29th International
Joint Conference on Artificial Intelligence (IJCAI). 4555–4561.
[87] Yang Liu, Qi Liu, Hongke Zhao, Zhen Pan, and Chuanren Liu. 2020. Adaptive quantitative trading: An imitative
deep reinforcement learning approach. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI).
2128–2135.
[88] Malik Magdon-Ismail and Amir F. Atiya. 2004. Maximum drawdown. Risk Magazine 17, 10 (2004), 99–102.
[89] Harry Markowitz. 1959. Portfolio Selection. Yale University Press New Haven.
[90] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver,
and Koray Kavukcuoglu. 2016. Asynchronous methods for deep reinforcement learning. In International Conference
on Machine Learning. 1928–1937.
[91] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves,
Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, et al. 2015. Human-level control through deep reinforce-
ment learning. Nature 518, 7540 (2015), 529–533.
[92] John Moody and Matthew Saffell. 2001. Learning to trade via direct reinforcement. IEEE Transactions on Neural
Networks 12, 4 (2001), 875–889.
[93] John Moody and Lizhong Wu. 1997. Optimization of trading systems and portfolios. In Proceedings of the IEEE/IAFE
1997 Computational Intelligence for Financial Engineering. 300–307.
[94] John Moody, Lizhong Wu, Yuansong Liao, and Matthew Saffell. 1998. Performance functions and reinforcement
learning for trading systems and portfolios. Journal of Forecasting 17, 5–6 (1998), 441–470.
[95] Tobias J. Moskowitz, Yao Hua Ooi, and Lasse Heje Pedersen. 2012. Time series momentum. Journal of Financial
Economics 104, 2 (2012), 228–250.
[96] Phillip Murray, Ben Wood, Hans Buehler, Magnus Wiese, and Mikko Pakkanen. 2022. Deep hedging: Continuous
reinforcement learning for hedging of general portfolios across multiple risk aversions. In 3rd ACM International
Conference on AI in Finance. 361–368.
[97] Ralph Neuneier. 1996. Optimal asset allocation using adaptive dynamic programming. Proceedings of the 10th Neural
Information Processing Systems (NIPS). 952–958.
[98] Ralph Neuneier. 1998. Enhancing Q-learning for optimal asset allocation. In Proceedings of the 12th Neural Information
Processing Systems (NIPS). 936–942.
[99] Yuriy Nevmyvaka, Yi Feng, and Michael Kearns. 2006. Reinforcement learning for optimized trade execution. In
Proceedings of the 23rd International Conference on Machine Learning (ICML). 673–680.
[100] Brian Ning, Franco Ho Ting Lin, and Sebastian Jaimungal. 2018. Double deep Q-learning for optimal execution. arXiv
preprint arXiv:1812.06600 (2018).
[101] Hui Niu, Siyuan Li, and Jian Li. 2022. MetaTrader: An reinforcement learning approach integrating diverse poli-
cies for portfolio optimization. In Proceedings of the 31st ACM International Conference on Information & Knowledge
Management. 1573–1583.
[102] Ahmet Murat Ozbayoglu, Mehmet Ugur Gudelek, and Omer Berat Sezer. 2020. Deep learning for financial applica-
tions: A survey. Applied Soft Computing (2020), 106384.
[103] Theodore Panagiotidis, Thanasis Stengos, and Orestis Vravosinos. 2018. On the determinants of Bitcoin returns: A
LASSO approach. Finance Research Letters 27 (2018), 235–240.
[104] Jigar Patel, Sahil Shah, Priyank Thakkar, and Ketan Kotecha. 2015. Predicting stock and stock price index movement
using trend deterministic data preparation and machine learning techniques. Expert Systems with Applications 42,
1 (2015), 259–268.
[105] James M. Poterba and Lawrence H. Summers. 1988. Mean reversion in stock prices: Evidence and implications. Jour-
nal of Financial Economics 22, 1 (1988), 27–59.
[106] Antonio Riva, Lorenzo Bisi, Pierre Liotet, Luca Sabbioni, Edoardo Vittori, Marco Pinciroli, Michele Trapletti, and
Marcello Restelli. 2021. Learning FX trading strategies with FQI and persistent actions. In Proceedings of the Second
ACM International Conference on AI in Finance. 1–9.
[107] Gavin A. Rummery and Mahesan Niranjan. 1994. On-line Q-learning Using Connectionist Systems. University of Cam-
bridge, Department of Engineering Cambridge, UK.
[108] Francesco Rundo, Francesca Trenta, Agatino Luigi di Stallo, and Sebastiano Battiato. 2019. Machine learning for
quantitative finance applications: A survey. Applied Sciences 9, 24 (2019), 5574.
[109] Ramit Sawhney, Shivam Agarwal, Arnav Wadhwa, Tyler Derr, and Rajiv Ratn Shah. 2021. Stock selection via spa-
tiotemporal hypergraph attention network: A learning to rank approach. In Proceedings of the AAAI Conference on
Artificial Intelligence, Vol. 35. 497–504.
[110] Ramit Sawhney, Shivam Agarwal, Arnav Wadhwa, and Rajiv Shah. 2021. Exploring the scale-free nature of stock
markets: Hyperbolic graph learning for algorithmic trading. In Proceedings of the Web Conference 2021. 11–22.

ACM Transactions on Intelligent Systems and Technology, Vol. 14, No. 3, Article 44. Publication date: March 2023.
Reinforcement Learning for Quantitative Trading 44:27

[111] Ramit Sawhney, Arnav Wadhwa, Shivam Agarwal, and Rajiv Shah. 2021. Quantitative day trading from natural
language using reinforcement learning. In Proceedings of the 2021 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies. 4018–4030.
[112] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization
algorithms. arXiv preprint arXiv:1707.06347 (2017).
[113] Sreelekshmy Selvin, R. Vinayakumar, E. A. Gopalakrishnan, Vijay Krishna Menon, and K. P. Soman. 2017. Stock price
prediction using LSTM, RNN and CNN-sliding window model. In Proceedings of the 6th International Conference on
Advances in Computing, Communications and Informatics (ICACCI). 1643–1647.
[114] Omer Berat Sezer, Mehmet Ugur Gudelek, and Ahmet Murat Ozbayoglu. 2020. Financial time series forecasting with
deep learning: A systematic literature review: 2005–2019. Applied Soft Computing 90 (2020), 106181.
[115] William F. Sharpe. 1964. Capital asset prices: A theory of market equilibrium under conditions of risk. The Journal
of Finance 19, 3 (1964), 425–442.
[116] William F. Sharpe. 1994. The Sharpe Ratio. Journal of Portfolio Management 21, 1 (1994), 49–58.
[117] Si Shi, Jianjun Li, Guohui Li, and Peng Pan. 2019. A multi-scale temporal feature aggregation convolutional neu-
ral network for portfolio management. In Proceedings of the 28th ACM International Conference on Information and
Knowledge Management (CIKM). 1613–1622.
[118] Weiyu Si, Jinke Li, Peng Ding, and Ruonan Rao. 2017. A multi-objective deep reinforcement learning approach for
stock index future’s intraday trading. In Proceeding of the 10th International Symposium on Computational Intelligence
and Design (ISCID). 431–436.
[119] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrit-
twieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. 2016. Mastering the game of Go with deep
neural networks and tree search. Nature 529, 7587 (2016), 484–489.
[120] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. 2014. Deterministic
policy gradient algorithms. In Proceedings of the 31st International Conference on Machine Learning (ICML). 387–395.
[121] Thomas Spooner, John Fearnley, Rahul Savani, and Andreas Koukorinis. 2018. Market making via reinforcement
learning. arXiv preprint arXiv:1804.04216 (2018).
[122] Thomas Spooner and Rahul Savani. 2020. Robust market making via adversarial reinforcement learning. arXiv
preprint arXiv:2003.01820 (2020).
[123] Shuo Sun, Molei Qin, Xinrun Wang, and Bo An. 2022. PRUDEX-compass: Towards systematic evaluation of rein-
forcement learning in financial markets. (2022).
[124] Shuo Sun, Rundong Wang, and Bo An. 2022. Quantitative stock investment by routing uncertainty-aware trading
experts: A multi-task learning approach. arXiv preprint arXiv:2207.07578 (2022).
[125] Shuo Sun, Wanqi Xue, Rundong Wang, Xu He, Junlei Zhu, Jian Li, and Bo An. 2022. DeepScalper: A risk-aware
reinforcement learning framework to capture fleeting intraday trading opportunities. In Proceedings of the 31st ACM
International Conference on Information & Knowledge Management. 1858–1867.
[126] Xiaolei Sun, Mingxi Liu, and Zeqian Sima. 2020. A novel cryptocurrency price trend forecasting model based on
LightGBM. Finance Research Letters 32 (2020), 101084.
[127] Richard S. Sutton and Andrew G. Barto. 2018. Reinforcement Learning: An Introduction. MIT Press.
[128] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent
Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition. 1–9.
[129] Lawrence Takeuchi and Yu-Ying Albert Lee. 2013. Applying Deep Learning to Enhance Momentum Trading Strategies
in Stocks. Technical report.
[130] Leigh Tesfatsion and Kenneth L. Judd. 2006. Handbook of Computational Economics: Agent-based Computational
Economics.
[131] Alaa Tharwat, Tarek Gaber, Abdelhameed Ibrahim, and Aboul Ella Hassanien. 2017. Linear discriminant analysis: A
detailed tutorial. AI Communications 30, 2 (2017), 169–190.
[132] Chih-Fong Tsai and Yu-Chieh Hsiao. 2010. Combining multiple feature selection methods for stock prediction: Union,
intersection, and multi-intersection approaches. Decision Support Systems 50, 1 (2010), 258–269.
[133] Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung,
David H. Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. 2019. Grandmaster level in StarCraft II using
multi-agent reinforcement learning. Nature 575, 7782 (2019), 350–354.
[134] Edoardo Vittori. 2022. Augmenting Traders with Learning Machines.
[135] Edoardo Vittori, Martino Bernasconi de Luca, Francesco Trovò, and Marcello Restelli. 2020. Dealing with transaction
costs in portfolio optimization: Online gradient descent with momentum. In Proceedings of the 1st ACM International
Conference on AI in Finance (ICAIF). 1–8.

ACM Transactions on Intelligent Systems and Technology, Vol. 14, No. 3, Article 44. Publication date: March 2023.
44:28 S. Sun et al.

[136] Edoardo Vittori, Michele Trapletti, and Marcello Restelli. 2020. Option hedging with risk averse reinforcement learn-
ing. In Proceedings of the First ACM International Conference on AI in Finance. 1–8.
[137] Svitlana Vyetrenko, David Byrd, Nick Petosa, Mahmoud Mahfouz, Danial Dervovic, Manuela Veloso, and Tucker Hy-
binette Balch. 2019. Get real: Realism metrics for robust limit order book market simulations. arXiv preprint
arXiv:1912.04941 (2019).
[138] Jingyuan Wang, Yang Zhang, Ke Tang, Junjie Wu, and Zhang Xiong. 2019. AlphaStock: A buying-winners-and-
selling-losers investment strategy using interpretable deep reinforcement attention networks. In Proceedings of the
25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD). 1900–1908.
[139] Rundong Wang, Hongxin Wei, Bo An, Zhouyan Feng, and Jun Yao. 2020. Commission fee is not enough: A hierar-
chical reinforced framework for portfolio management. arXiv preprint arXiv:2012.12620 (2020).
[140] Zhicheng Wang, Biwei Huang, Shikui Tu, Kun Zhang, and Lei Xu. 2021. DeepTrader: A deep reinforcement learning
approach to risk-return balanced portfolio management with market conditions embedding. In Proceedings of the
35th AAAI Conference on Artificial Intelligence (AAAI).
[141] Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Hasselt, Marc Lanctot, and Nando Freitas. 2016. Dueling network
architectures for deep reinforcement learning. In Proceedings of 35th International Conference on Machine Learning.
1995–2003.
[142] Christopher J. C. H. Watkins and Peter Dayan. 1992. Q-learning. Machine Learning 8, 3–4 (1992), 279–292.
[143] Ronald J. Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning.
Machine Learning 8, 3–4 (1992), 229–256.
[144] Svante Wold, Kim Esbensen, and Paul Geladi. 1987. Principal component analysis. Chemometrics and Intelligent Lab-
oratory Systems 2, 1–3 (1987), 37–52.
[145] Bo K. Wong and Yakup Selvi. 1998. Neural network applications in finance: A review and analysis of literature
(1990–1996). Information & Management 34, 3 (1998), 129–139.
[146] Lan Wu and Yuehan Yang. 2014. Nonnegative elastic net and application in index tracking. Appl. Math. Comput. 227
(2014), 541–552.
[147] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S. Yu Philip. 2020. A comprehensive
survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems 32, 1 (2020), 4–24.
[148] Zhuoran Xiong, Xiao-Yang Liu, Shan Zhong, Hongyang Yang, and Anwar Walid. 2018. Practical deep reinforcement
learning approach for stock trading. arXiv preprint arXiv:1811.07522 (2018).
[149] Ke Xu, Yifan Zhang, Deheng Ye, Peilin Zhao, and Mingkui Tan. 2020. Relation-aware transformer for portfolio policy
learning. In Proceedings of the 29th International Joint Conference on Artificial Intelligence (IJCAI). 4647–4653.
[150] Wentao Xu, Weiqing Liu, Chang Xu, Jiang Bian, Jian Yin, and Tie-Yan Liu. 2021. REST: Relational event-driven stock
trend forecasting. In Proceedings of the Web Conference 2021. 1–10.
[151] Yumo Xu and Shay B. Cohen. 2018. Stock movement prediction from tweets and historical prices. In Proceedings of
the 56th Annual Meeting of the Association for Computational Linguistics (ACL). 1970–1979.
[152] Hongyang Yang, Xiao-Yang Liu, Shan Zhong, and Anwar Walid. 2020. Deep reinforcement learning for automated
stock trading: An ensemble strategy. Available at SSRN (2020).
[153] Yunan Ye, Hengzhi Pei, Boxin Wang, Pin-Yu Chen, Yada Zhu, Ju Xiao, and Bo Li. 2020. Reinforcement-learning based
portfolio management with augmented asset movement prediction states. In Proceedings of the 34th AAAI Conference
on Artificial Intelligence (AAAI). 1112–1119.
[154] Pengqian Yu, Joon Sern Lee, Ilya Kulyatin, Zekun Shi, and Sakyasingha Dasgupta. 2019. Model-based deep reinforce-
ment learning for dynamic portfolio optimization. arXiv preprint arXiv:1901.08740 (2019).
[155] Yuyu Yuan, Wen Wen, and Jincui Yang. 2020. Using data augmentation based reinforcement learning for daily stock
trading. Electronics 9, 9 (2020), 1384.
[156] Chuheng Zhang, Yuanqi Li, Xi Chen, Yifei Jin, Pingzhong Tang, and Jian Li. 2020. DoubleEnsemble: A new ensemble
method based on sample reweighting and feature selection for financial data analysis. arXiv preprint arXiv:2010.01265
(2020).
[157] Dongsong Zhang and Lina Zhou. 2004. Discovering golden nuggets: Data mining in financial application. IEEE Trans-
actions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 34, 4 (2004), 513–522.
[158] Liheng Zhang, Charu Aggarwal, and Guo-Jun Qi. 2017. Stock price prediction via discovering multi-frequency trad-
ing patterns. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
(KDD). 2141–2149.
[159] Tianping Zhang, Yuanqi Li, Yifei Jin, and Jian Li. 2020. AutoAlpha: An efficient hierarchical evolutionary algorithm
for mining alpha factors in quantitative investment. arXiv preprint arXiv:2002.08245 (2020).
[160] Yifan Zhang, Peilin Zhao, Bin Li, Qingyao Wu, Junzhou Huang, and Mingkui Tan. 2020. Cost-sensitive portfolio
selection via deep reinforcement learning. IEEE Transactions on Knowledge and Data Engineering (2020).

ACM Transactions on Intelligent Systems and Technology, Vol. 14, No. 3, Article 44. Publication date: March 2023.
Reinforcement Learning for Quantitative Trading 44:29

[161] Zihao Zhang, Stefan Zohren, and Stephen Roberts. 2020. Deep reinforcement learning for trading. The Journal of
Financial Data Science 2, 2 (2020), 25–40.
[162] Muchen Zhao and Vadim Linetsky. 2021. High frequency automated market making algorithms with adverse se-
lection risk control via reinforcement learning. In Proceedings of the Second ACM International Conference on AI in
Finance. 1–9.
[163] Yueyang Zhong, YeeMan Bergstrom, and Amy Ward. 2020. Data-driven market-making via model-free learning. In
Proceedings of the 29th International Joint Conference on Artificial Intelligence (IJCAI). 2327–2333.
[164] Dawei Zhou, Lecheng Zheng, Yada Zhu, Jianbo Li, and Jingrui He. 2020. Domain adaptive multi-modality neural
attention network for financial forecasting. In Proceedings of the 29th Web Conference (WWW). 2230–2240.
[165] Brian D. Ziebart. 2010. Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy.
Carnegie Mellon University.

Received 1 October 2021; revised 2 November 2022; accepted 9 January 2023

ACM Transactions on Intelligent Systems and Technology, Vol. 14, No. 3, Article 44. Publication date: March 2023.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy