StefanZohren_Columbia-Bloomberg

The Momentum Transformer: An
Intelligent and Interpretable Deep

Learning Trading Strategy
Stefan Zohren
Oxford Man Institute of Quantitative Finance

University of Oxford
September 23, 2022

Outline
Classical Time-Series Momentum Strategies
Deep Momentum Networks
Deep Momentum Networks with Changepoint Detection
Momentum Transformer
Conclusions
Classical Time-Series Momentum Strategies
Momentum Strategies
▶ Time-series Momentum (TSMOM) (Moskowitz et al. [1]) is

derived from the philosophy that strong price trends have a
tendency to persist.
▶ Often known as ‘follow the winner’ because it is assumed that
winners will continue to be winners in the subsequent period.
▶ TSMOM is a univariate approach as opposed to cross-sectional
(Jegadeesh et al. [2]) momentum strategies, which trade assets
against each other and select a portfolio based on relative
ranking.
▶ Strategies involve 1) estimation of a trend, and 2) sizing
positions accordingly.
▶ Momentum strategies are an important part of alternative
investments and are at the heart of commodity trading advisors
(CTAs).
Classical Strategies
▶ Volatility scaling has been proven to play a crucial role in the
positive performance of TSMOM strategies (Kim et al. [3]).
▶ Where Xt(i) is position size of the i-th asset, N the number of
(i)
assets in our portfolio, σt the ex-ante volatility estimate and σtgt
the target volatilty,
N
1 X (i) (i) (i) σtgt (i)
RTSMOM
t+1 = Rt+1 , Rt+1 = Xt (i)
rt+1 . (1)
N σt
i=1
▶ Moskowitz et al. [1], selects position as Xt(i) = sgn(rt−252,t ), where

we are using the volatility scaling framework and rt−252,t is
annual return.
▶ Moving Average Convergence Divergence (MACD) is a volatility
normalised trend-following momentum indicator that describes
the relationship between two moving averages of a security’s
price, functioning as a trigger for buy and sell signals (see Baz et
al. [4]).
▶ Previous Deep learning approaches either only learnt the trend or

alternatively performed classification, taking a maximum long or
short position.
▶ In our first work with B. Lim and S. Roberts, we proposed a
framework termed as Deep Momentum Networks (DMNs) which
resulted in significantly better risk-adusted-returns.
▶ Rather than estimating the trend and then using a rule based
approach to size positions, DMNs learn the trend in a data-driven
manner and directly output position sizes.
▶ A squashing function tanh(·) directly outputs positions
(i)
Xt ∈ (−1, 1)
▶ DMNs also benefit from the volatility scaling framework.
▶ Inputs are normalised returns at different timescales and
different MACD indicators.
▶ We found that the LSTM, a

type of Recurrent Neural
Network used for sequence
modelling, produced the best
results.
▶ Since we want to maximise
risk-adjusted-returns, DMNs
use a Sharpe Ratio Loss
function
√ h i
(i)
252 EΩ Rt
Lsharpe (θθ ) = − r h i .
(i)
VarΩ Rt
(2)
Figure: LSTM Deep Momentum
Network architecture
Interpreting the results of DMNs
▶ DMN’s strong performance attributed to learning simultaneously
to exploit slow momentum and fast reversion
▶ Following small reversions in regimes of strong trend can lead to
larger transaction costs
Deep Momentum Networks with Changepoint
Detection
Momentum Turning Points and Changepoint Detection
▶ Ideally identify regimes – trending and reverting market – then
learn to switch to exploit each state.
▶ Can be done with changepoint detection as proposed in earlier
work with K. Woods and S. Roberts
Motivation from Momentum Turning Points
▶ Immediately after momentum turning points, where a trend
reverses from an uptrend (downtrend) to a downtrend
(uptrend), time-series momentum (TSMOM) strategies are prone
to making bad bets.
▶ We require an approach which is a balancing act between being
quick enough to respond to turning points, but not over-reacting
to noise.
▶ Garg et al. [5] proposed an Intermediate stategy where a slow
momentum signal based on a long lookback window, such as one
year, is blended with a fast momentum signal based on a short
lookback window, such as one month.
Xt = (1 − w) sgn(rt−252,t ) + w sgn(rt−21,t ). (3)
▶ MACD can often produce false positives and signal a possible

reversal without one actually happening.
Changepoint Detection
▶ Changepoint detection (CPD) is a field which involves the

identification of abrupt changes in sequential data.
▶ To enable us to respond to CPD in real time, we require an
‘online’ algorithm, which processes each data point as it becomes
available.
▶ First introduced by Adams et al. [6], Bayesian approaches to
online CPD, which naturally accommodate to noisy, uncertain
and incomplete time-series data, have proven to be very
successful.
▶ We focus on approaches using Gaussian Processes (GPs)
(Williams et al. [7]) a Bayesian non-parametric model which has
a proven track record for time-series forecasting, is principled
and is robust to noisy inputs.
Changepoint Detection with Gaussian Processes
▶ For daily return r̂t(i) , normalised over some look-back window

(we’ll revisit this), we define the GP as a distribution over
functions,
(i)
r̂t = f (t) + ϵt , f ∼ GP(0, kξ ), ϵt ∼ N (0, σn2 ), (4)
where ϵ is an additive noise process and GP is specified by a

covariance function kξ (·, ·), which is in turn parameterised by a
set of hyperparameters ξ. Noise variance σn , helps to deal with
noisy outputs which are uncorrelated.
▶ The Matérn 3/2 kernel is a good choice of covariance function
for noisy financial data, with kernel hyperparameters
ξM = (λ, σh , σn ), with λ the input scale and σh the output scale.
▶ A changepoint can either be a drastic change in covariance, a
sudden change in the input scale, or a sudden change in the
output scale
Changepoint Kernel
▶ Garnett et al. introduced the Region-switching kernel, where it
is assumed there is a drastic change, or changepoint, at
c ∈ {t − l + 1, t − l + 2, . . . , t − 1}, after which all observations
before c are completely uninformative about the observations
after this point,
 kξ1 (x, x′ ) x, x′ < c


′ ′
kξR (x, x ) = kξ (x, x ) x, x′ ≥ c (5)
 2
0 otherwise.
▶ The lookback window (LBW) l for this approach needs to be

prespecified and it is assumed to contain a single changepoint.
▶ A more flexible approach is the Changepoint kernel, where
c ∈ (t − l, t) is the changepoint location, s > 0 is the steepness
parameter and σ(x, x′ ) = σ(x)σ(x′ ), σ̄(x, x′ )(1 − σ(x))(1 − σ(x′ )),
kξC (x, x′ ) = kξ1 (x, x′ )σ(x, x′ ) + kξ2 (x, x′ )σ̄(x, x′ ). (6)
▶ We use Matérn 3/2 for the left and right kernels.

Changepoint Detection Module Outputs
▶ We consider the series {rt(i) t

′ }t′ =t−l , with lookback horizon l from
time t. For every CPD window, where T = {t − l, t − l + 1, . . . , t},

we standardise our returns for consistency.
▶ For each time step, our changepoint detection module outputs,
(i)
1. changepoint detection location γt ∈ (0, 1), indicating how
far in the past the changepoint is, and,
(i)
2. changepoint score νt ∈ (0, 1), which measures the level of
disequilibrium, measured by the reduction in negative log
marginal likelihood achieved via the introduction of the
changepoint kernel hyperparameters.
(i) 1 (i) c − (t − l)
νt (l) = 1 − , γt (l) = , (7)
1 + e−(nlmnξC −nlmnξM ) l
▶ Both values are normalised to help improve stability and
performance of our LSTM module.
Changepoint Kernel
Figure: Plots of daily returns for S&P 500, composite ratio-adjusted

continuous futures contract during the first quarter of 2020, where
returns have been standardised. The top plot fits a GP, using the
Matérn 3/2 kernel and the bottom using the Changepoint kernel
specified in (6).
DMNs with Changepoint Detection Model
▶ In our second paper with K. Wood and S. Roberts, we introduce

online CPD based on GPs into DMNs.
▶ Precisely, the input u(i)
t for each time-step of LSTM sequence
consists of past returns, MACD signals, as well as changepoint
location and severity scores:
(i) √ ′ ′
n o
(i)
1. rt−t′ ,t /σt t | t ∈ {1, 21, 63, 126, 252} ,
2. {MACD(i, t, S, L)|(S.L) ∈ {(8, 24), (16, 28), (32, 96)}},
(i) (i)
3. νt (l) and γt (l) for l ∈ {10, 21, 63, 126, 252}
▶ The LSTM is not complex enough to handle multiple CPD
lookback-windows (LBW) and we optimise l as part of the
hyperparameter tuning process.
▶ Later work also demonstrats that multiple LBWs (short and long)
work well in conjunction with a variable selection network.
Data and Experimental Setting
▶ Portfolio consisting of 50 of the most liquid, ratio-adjusted
continuous futures contracts over the period 1990–2020.
▶ Includes daily Commodities, FX, Fixed Income and Equities data,
extracted from the Pinnacle Data Corp CLC database.
▶ We use an expanding window approach, where we start by using
1990–1995 for training/validation, then test out-of-sample on
the period 1995–2000. With each successive iteration, we expand
the training/validation window by an additional five years.
▶ We use a 90%/10% split for training/validation data, training on
the Sharpe loss function via minibatch Stochastic Gradient
Descent (SGD), using the validation set to tune the
hyper-parameters and for early stopping.
▶ The outer optimisation loop tunes dropout rate, hidden layer
size, minibatch size, learning rate, max gradient norm and CPD
LBW length, with 50 iterations of random grid search.
Performance Results
Figure: Benchmarking DMNs against Intermediate strategy

w ∈ {0, 0.5, 1}, Long Only and MACD.
Performance Results
Figure: Strategy performance benchmark for raw signal output.

Slow Momentum with Fast Reversion
Figure: Slow momentum and fast reversion happening simultaneously.

Momentum Transformer
Transformers
▶ Based on the concept ‘attention is all you need’, doing away with
convolutions and recurrent neural networks (RNNs).
▶ The attention-based architecture allows the network to focus on
significant time steps in the past and longer-term patterns
▶ Have led to state-of-the-art performance in diverse fields, such as
of natural language processing, computer vision, and speech
processing (see Lin et al. [8]).
▶ Have recently have been harnessed for time-series modelling (Li
et al. [9] Lim et al. [10], Zhuo et al. [11]).
▶ Naturally adapts to new market regimes, such as during the
SARS-CoV-2 crisis.
Base Architectures Tested in the Momentum Transformer
▶ Transformer: (Vaswani et al. [12]) consists of encoder and
decoder – each consisting of l identical layers of a (multi)
self-attention mechanism, followed by a position-wise
feed-forward network and a residual connection between these
two components.
▶ Decoder-Only Transformer: (Li et al. [9]) only the decoder side.
▶ Convolutional Transformer: Li et al. [9] incorporates
convolutional and log-sparse self-attention.
▶ Informer Transformer: Zhuo et al. [11] replaces the naive
sparsity rule of the Conv. Transformer with a measurement based
on the Kullback-Leibler divergence to distinguish essential
queries, referred to as ProbSparse self-attention.
▶ Decoder-Only Temporal Fusion Transformer (TFT): an
attention-LSTM hybrid which uses recurrent LSTM layers for
local processing and interpretable self-attention layers for
long-term dependencies. We consider the Decoder-Only version
of the original TFT (Lim et al. [10]).
Momentum Transformer (Decoder-Only TFT)
Figure: Decoder-Only TFT

Results
Figure: Strategy Performance Benchmark – Raw Signal Output

Results
Figure: These plots benchmark our strategy performance for the

2015–2020 scenario (left) and the SARS-CoV-2 scenario (right). For
each plot we start with $100 and we re-scale returns to 15% volatility.
Since we ran each experiment five times, we plot the repeat which
resulted in the median Sharpe ratio, across the entire experiment.
Attention Patterns
Figure: Lumber future price during SARS-CoV-2 crisis and the

associated attention pattern when making a prediction at 1 March
2020 (blue), 21 April 2020 (orange), and 2 July 2020 (green).
Attention Patterns
▶ We observe significant
structure in attention
patterns.
▶ The attention on
momentum turning
points is pronounced,
segmenting the time
series into regimes.
▶ Our model focuses on
previous time-steps
which are in a similar
regime.
Figure: FTSE 100 future prior to 2008.

Variable Importance
▶ Our model intelligently

blends different classical
strategies at different points
in time.
▶ We observe that the strategy
changes with the addition of
CPD, placing left emphasis on
returns at timescales in
between daily (shortest) and
annual (longest).
Figure: Variable importance for

Cocoa future for Decoder-Only TFT
(middle) and with CPD (bottom).
Transaction Cost Impact
Figure: Transaction cost impact on Sharpe over 2015–2020 for

individual assets, averaged by asset class, and for diversified portfolio.
Conclusions
▶ Deep Momentum Networks are novel models which directly

output trading signals which are optimised for Sharpe ratio
▶ The original deep Momentum Networks based on LSTMs
perform well by exploiting a blend of momentum and mean
reversion
▶ We introduce Changepoint detection to this model to more
intelligently adapt to changes from trending to more
reverting regimes
▶ We further improve the model by considering transformer
based architectures
▶ The attention-based architectures, which we tested, are
robust to significant events, such as during the SARS-CoV-2
market crash and tend to focus less on mean-reversion and
more on longer term trends.
Thank you!
Papers:
Deep Momentum Networks [1904.04912]
DMNs with Changepoints [2105.13727]
Momentum Transform [2112.08534]
stefan.zohren@eng.ox.ac.uk
[1] T. J. Moskowitz, Y. H. Ooi, and L. H. Pedersen, “Time series momentum,” Journal of Financial Economics,
vol. 104, no. 2, pp. 228 – 250, 2012. Special Issue on Investor Sentiment.
[2] N. Jegadeesh and S. Titman, “Returns to buying winners and selling losers: Implications for stock market
efficiency,” The Journal of Finance, vol. 48, no. 1, pp. 65–91, 1993.
[3] A. Y. Kim, Y. Tse, and J. K. Wald, “Time series momentum and volatility scaling,” Journal of Financial Markets,
vol. 30, pp. 103 – 124, 2016.
[4] J. Baz, N. Granger, C. R. Harvey, N. Le Roux, and S. Rattray, “Dissecting investment strategies in the cross
section and time series,” SSRN, 2015.
[5] A. Garg, C. L. Goulding, C. R. Harvey, and M. Mazzoleni, “Momentum turning points,” Available at SSRN
3489539, 2021.
[6] R. P. Adams and D. J. MacKay, “Bayesian online changepoint detection,” arXiv preprint arXiv:0710.3742, 2007.
[7] C. K. Williams and C. E. Rasmussen, “Gaussian processes for regression,” 1996.
[8] T. Lin, Y. Wang, X. Liu, and X. Qiu, “A survey of Transformers,” arXiv preprint arXiv:2106.04554, 2021.
[9] S. Li, X. Jin, Y. Xuan, X. Zhou, W. Chen, Y.-X. Wang, and X. Yan, “Enhancing the locality and breaking the
memory bottleneck of Transformer on time series forecasting,” Advances in Neural Information Processing
Systems (NeurIPS), vol. 32, pp. 5243–5253, 2019.
[10] B. Lim, S. Ö. Arık, N. Loeff, and T. Pfister, “Temporal fusion transformers for interpretable multi-horizon time
series forecasting,” International Journal of Forecasting, 2021.
[11] H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, and W. Zhang, “Informer: Beyond efficient Transformer
for long sequence time-series forecasting,” in The Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI
2021, Virtual Conference, vol. 35, pp. 11106–11115, AAAI Press, 2021.
[12] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention
is all you need,” arXiv preprint arXiv:1706.03762, 2017.

StefanZohren_Columbia-Bloomberg

Uploaded by

Copyright:

Available Formats

StefanZohren_Columbia-Bloomberg

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

StefanZohren_Columbia-Bloomberg

Uploaded by

Copyright:

Available Formats

The Momentum Transformer: An

Intelligent and Interpretable Deep

Oxford Man Institute of Quantitative Finance

September 23, 2022

Classical Time-Series Momentum Strategies

Deep Momentum Networks

Deep Momentum Networks with Changepoint Detection

▶ Time-series Momentum (TSMOM) (Moskowitz et al. [1]) is

▶ Moskowitz et al. [1], selects position as Xt(i) = sgn(rt−252,t ), where

▶ Previous Deep learning approaches either only learnt the trend or

▶ We found that the LSTM, a

Xt = (1 − w) sgn(rt−252,t ) + w sgn(rt−21,t ). (3)

▶ MACD can often produce false positives and signal a possible

▶ Changepoint detection (CPD) is a field which involves the

▶ For daily return r̂t(i) , normalised over some look-back window

where ϵ is an additive noise process and GP is specified by a

 kξ1 (x, x′ ) x, x′ < c

▶ The lookback window (LBW) l for this approach needs to be

kξC (x, x′ ) = kξ1 (x, x′ )σ(x, x′ ) + kξ2 (x, x′ )σ̄(x, x′ ). (6)

▶ We use Matérn 3/2 for the left and right kernels.

▶ We consider the series {rt(i) t

time t. For every CPD window, where T = {t − l, t − l + 1, . . . , t},

Figure: Plots of daily returns for S&P 500, composite ratio-adjusted

▶ In our second paper with K. Wood and S. Roberts, we introduce

Figure: Benchmarking DMNs against Intermediate strategy

Figure: Strategy performance benchmark for raw signal output.

Figure: Slow momentum and fast reversion happening simultaneously.

Figure: Decoder-Only TFT

Figure: Strategy Performance Benchmark – Raw Signal Output

Figure: These plots benchmark our strategy performance for the

Figure: Lumber future price during SARS-CoV-2 crisis and the

Figure: FTSE 100 future prior to 2008.

▶ Our model intelligently

Figure: Variable importance for

Figure: Transaction cost impact on Sharpe over 2015–2020 for

▶ Deep Momentum Networks are novel models which directly

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.