Mlfinlab Release Hudson & Thames
Mlfinlab Release Hudson & Thames
Mlfinlab Release Hudson & Thames
Release 0.4.1
1 Notes 3
2 Built With 5
3 Getting Started 7
3.1 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Barriers to Entry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4 Implementations 11
4.1 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.2 Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.5 Fractionally Differentiated Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.6 Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.7 Sequentially Bootstrapped Bagging Classifier/Regressor . . . . . . . . . . . . . . . . . . . . . . . . 40
4.8 Feature Importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.9 Bet Sizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.10 Portfolio Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5 Additional Information 65
5.1 Contact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2 Contributing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.3 License . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Index 69
i
ii
mlfinlab, Release 0.4.1
mlfinlab is an open source package based on the research of Dr Marcos Lopez de Prado in his new book Advances in
Financial Machine Learning. This implementation started out as a spring board for a research project in the Masters in
Financial Engineering programme at WorldQuant University and has grown into a mini research group called Hudson
and Thames (not affiliated with the university).
GETTING STARTED 1
mlfinlab, Release 0.4.1
2 GETTING STARTED
CHAPTER
ONE
NOTES
mlfinlab is a living, breathing project, and new functionalities are consistently being added.
The implementations that will be added in the future as well as the implementations that are currently supported can
be seen below:
• Part 4: Useful Financial Features
• Working on Chapter 19: Microstructural Features (Maksim)
• Part 3: Backtesting
• Done Chapter 16: ML Asset Allocation
• Done Chapter 10: Bet Sizing
• Part 2: Modelling
• Done Chapter 8: Feature Importance
• Done Chapter 7: Cross-Validation
• Done Chapter 6: Ensemble Methods
• Done Sequential Bootstrap Ensemble
3
mlfinlab, Release 0.4.1
4 Chapter 1. Notes
CHAPTER
TWO
BUILT WITH
5
mlfinlab, Release 0.4.1
THREE
GETTING STARTED
3.1 Installation
• Anaconda 3
• Python 3.6
The package can be installed from the PyPi index via the console: Launch the terminal and run:
Clone the package repo to your local machine then follow the steps below.
1. Make sure you install the latest version of the Anaconda 3 distribution. To do this you can follow the install and
update instructions found on this link
2. Launch a terminal
3. Create a New Conda Environment. From terminal:
5. From Terminal: go to the directory where you have saved the file, example:
cd Desktop/mlfinlab
7
mlfinlab, Release 0.4.1
Windows
6. From Anaconda Prompt: go to the directory where you have saved the file, example:
cd Desktop/mlfinlab
As most of you know, getting through the first 3 chapters of the book is challenging as it relies on HFT data to create
the new financial data structures. Sourcing the HFT data is very difficult and thus we have resorted to purchasing the
full history of S&P500 Emini futures tick data from TickData LLC.
We are not affiliated with TickData in any way but would like to recommend others to make use of their service. The
full history cost us about $750 and is worth every penny. They have really done a great job at cleaning the data and
providing it in a user friendly manner.
TickData does offer about 20 days worth of raw tick data which can be sourced from their website link.
For those of you interested in working with a two years of sample tick, volume, and dollar bars, it is provided for in
the research repo.
You should be able to work on a few implementations of the code with this set.
Searching for free tick data can be a challenging task. The following three sources may help:
1. Dukascopy. Offers free historical tick data for some futures, though you do have to register.
2. Most crypto exchanges offer tick data but not historical (see Binance API). So you’d have to run a script for a
few days.
3. Blog Post: How and why I got 75Gb of free foreign exchange “Tick” data.
3.3 Requirements
• codecov==2.0.15
• coverage==4.5.2
• pandas==0.24.2
• pylint==2.3.0
• numpy==1.16.4
• xmlrunner==1.7.7
• numba==0.43.0
• scikit-learn==0.21
• Installation
• Barriers to Entry
• Requirements
3.3. Requirements 9
mlfinlab, Release 0.4.1
FOUR
IMPLEMENTATIONS
When analyzing financial data, unstructured datasets are commonly transformed into a structured format referred to
as bars, where a bar represents a row in a table. mlfinlab implements tick, volume, and dollar bars using traditional
standard bar methods as well as the less common information driven bars.
# Required Imports
import numpy as np
import pandas as pd
data = pd.read_csv('FILE_PATH')
Data Formatting
In order to utilize the bar sampling methods presented below, our data must first be formatted properly. Many data
vendors will let you choose the format of your raw tick data files. We want to only focus on the following 3 columns:
date_time, price, volume. The reason for this is to minimise the size of the csv files and the amount of time when
reading in the files.
Our data is sourced from TickData LLC which provide TickWrite 7, to aid in the formatting of saved files. This allows
us to save csv files in the format date_time, price, volume.
For this tutorial we will assume that you need to first do some preprocessing and then save your data to a csv file:
Initially, your instinct may be to pass mlfinlab package an in-memory DataFrame object but the truth is when you’re
running the function in production, your raw tick data csv files will be way too large to hold in memory. We used the
subset 2011 to 2019 and it was more than 25 gigs. It is for this reason that the mlfinlab package requires a file path to
read the raw data files from disk:
11
mlfinlab, Release 0.4.1
# Save to csv
new_data.to_csv('FILE_PATH', index=False)
The three standard bar methods implemented share a similiar underlying idea in that we want to sample a bar after a
certain threshold is reached.
1. For tick bars, we sample a bar after a certain number of ticks.
2. For volume bars, we sample a bar after a certain volume amount is traded.
3. For dollar bars, we sample a bar after a certain dollar amount is traded.
These bars are used throughout the text book (Advances in Financial Machine Learning, By Marcos Lopez de Prado,
2018, pg 25) to build the more interesting features for predicting financial time series data.
Tick Bars
# Tick Bars
tick = standard_data_structures.get_tick_bars('FILE_PATH', threshold=5500,
batch_size=1000000, verbose=False)
Volume Bars
12 Chapter 4. Implementations
mlfinlab, Release 0.4.1
• batch_size – The number of rows per batch. Less RAM = smaller batch size.
• verbose – Print out batch numbers (True or False)
• to_csv – Save bars to csv after every batch run (True or False)
• output_path – Path to csv file, if to_csv is True
Returns Dataframe of volume bars
# Volume Bars
volume = standard_data_structures.get_volume_bars('FILE_PATH', threshold=28000,
batch_size=1000000, verbose=False)
Dollar Bars
# Dollar Bars
dollar = standard_data_structures.get_dollar_bars('FILE_PATH', threshold=70000000,
batch_size=1000000, verbose=True)
Statistical Properties
It can be seen below that tick, volume, and dollar bars all exhibit a distribution significantly closer to normal versus
standard time bars:
Information-driven bars are based on the notion of sampling a bar when new information arrives to the market. The
two types of information-driven bars implemented are imbalance bars and run bars. For each type, tick, volume, and
dollar bars are included.
Imbalance Bars
Let’s discuss imbalance bars generation on example of volume imbalance bars. As it is described in Advances in
Financial Machine Learning book:
First let’s define what is the tick rule:
{︃
𝑏𝑡−1 , ∆𝑝𝑡 =0
𝑏𝑡 =
|∆𝑝𝑡 |/∆𝑝𝑡 , ∆𝑝𝑡 ̸= 0
For any given 𝑡, where 𝑝𝑡 is the price associated with 𝑡 and 𝑣𝑡 is volume, the tick rule 𝑏𝑡 is defined as:
Tick rule is used as a proxy of trade direction, however, some data providers already provide customers with tick
direction, in this case we don’t need to calculate tick rule, just use the provided tick direction instead.
14 Chapter 4. Implementations
mlfinlab, Release 0.4.1
Algorithm Logic
Now we have understood the logic of imbalance bar generation, let’s understand how the process looks in details:
num_prev_bars = 3
expected_num_ticks_init = 100000
expected_num_ticks = expected_num_ticks_init
cum_theta = 0
num_ticks = 0
imbalance_array = []
imbalance_bars = []
bar_length_array = []
for row in data.rows:
#track high,low,close, volume info
num_ticks += 1
tick_rule = get_tick_rule(price, prev_price)
volume_imbalance = tick_rule * row['volume']
imbalance_array.append(volume_imbalance)
cum_theta += volume_imbalance
if len(imbalance_bars) == 0 and len(imbalance_array) >= expected_num_ticks_init:
expected_imbalance = ewma(imbalance_array, window=expected_num_ticks_init)
Note that in algorithm pseudo-code we reset 𝜃𝑡 when bar is formed, in our case the formula for 𝜃𝑡 is:
∑︀𝑇
𝜃𝑡 = 𝑡=𝑡* 𝑏𝑡 * 𝑣𝑡
Let’s look at dynamics of |𝜃𝑡 | and 𝐸0 [𝑇 ] * |2𝑣 + − 𝐸0 [𝑣𝑡 ]| to understand why we decided to reset 𝜃𝑡 when bar is formed.
The dynamics when theta value is reset:
Note that on the first ticks, threshold condition is not stable. Remember, before the first bar is generated, expected
imbalance is calculated on every tick with window = expected_num_ticks_init, that is why it changes with every tick.
After the first bar was generated both expected number of ticks (𝐸0 [𝑇 ]) and exptected volume imbalance (2𝑣 + −𝐸0 [𝑣𝑡 ])
are updated only when the next bar is generated
When theta is not reset:
The reason for that is due to the fact that theta is accumulated when several bars are generated theta value is not reset
⇒ condition is met on small number of ticks ⇒ length of the next bar converges to 1 ⇒ bar is sampled on the next
consecutive tick.
16 Chapter 4. Implementations
mlfinlab, Release 0.4.1
The logic described above is implemented in the mlfinlab package under ImbalanceBars
Examples
Run Bars
Run bars share the same mathematical structure as imblance bars, however, instead of looking at each individual
trade, we are looking at sequences of trades in the same direction. The idea is that we are trying to detect order flow
imbalance caused by actions such as large traders sweeping the order book or iceberg orders.
Examples of implementations of run bars can be seen below:
get_tick_run_bars(file_path, num_prev_bars, exp_num_ticks_init=100000, batch_size=2e7, ver-
bose=True, to_csv=False, output_path=None)
Creates the tick run bars: date_time, open, high, low, close.
Parameters
• file_path – File path pointing to csv data.
• num_prev_bars – Number of previous bars used for EWMA window expected # of ticks
• exp_num_ticks_init – initial expected number of ticks per bar
• batch_size – The number of rows per batch. Less RAM = smaller batch size.
• verbose – Print out batch numbers (True or False)
• to_csv – Save bars to csv after every batch run (True or False)
• output_path – Path to csv file, if to_csv is True
Returns DataFrame of tick bars
18 Chapter 4. Implementations
mlfinlab, Release 0.4.1
The following research notebooks can be used to better understand the previously discussed data structures
Standard Bars
• Getting Started
• Sample Techniques
Imbalance Bars
• Imbalance Bars
4.2 Filters
Filters are used to filter events based on some kind of trigger. For example a structural break filter can be used to filter
events where a structural break occurs. This event is then used to measure the return from the event to some event
horizon, say a day.
20 Chapter 4. Implementations
mlfinlab, Release 0.4.1
An example showing how the CUSUM filter can be used to downsample a time series of close prices can be seen
below:
4.3 Labeling
The primary labeling method used in financial academia is the fixed-time horizon method. While ubiquitous, this
method has many faults which are remedied by the triple-barrier method discussed below. The triple-barrier method
can be extended to incorporate meta-labeling which will also be demonstrated and discussed below.
The idea behind the triple-barrier method is that we have three barriers: an upper barrier, a lower barrier, and a vertical
barrier. The upper barrier represents the threshold an observation’s return needs to reach in order to be considered a
buying opportunty (a label of 1), the lower barrier represents the threshold an observation’s return needs to reach in
order to be considered a selling opportunity (a label of -1), and the vertical barrier represents the amount of time an
observation has to reach its given return in either direction before it is given a label of 0. This concept can be better
understood visually and is shown in the figure below taken from Advances in Financial Machine Learning (reference):
4.3. Labeling 21
mlfinlab, Release 0.4.1
One of the major faults with the fixed-time horizon method is that observations are given a label with respect to a
certain threshold after a fixed interval regardless of their respective volatilities. In other words, the expected returns of
every observation are treated equally regardless of the associated risk. The triple-barrier method tackles this issue by
dynamically setting the upper and lower barriers for each observation based on their given volatilities.
4.3.2 Meta-Labeling
22 Chapter 4. Implementations
mlfinlab, Release 0.4.1
Binary classification problems present a trade-off between type-I errors (false positives) and type-II errors (false nega-
tives). In general, increasing the true positive rate of a binary classifier will tend to increase its false positive rate. The
receiver operating characteristic (ROC) curve of a binary classifier measures the cost of increasing the true positive
rate, in terms of accepting higher false positive rates.
The image illustrates the so-called “confusion matrix.” On a set of observations, there are items that exhibit a condition
(positives, left rectangle), and items that do not exhibit a condition (negative, right rectangle). A binary classifier
predicts that some items exhibit the condition (ellipse), where the TP area contains the true positives and the TN area
contains the true negatives. This leads to two kinds of errors: false positives (FP) and false negatives (FN). “Precision”
is the ratio between the TP area and the area in the ellipse. “Recall” is the ratio between the TP area and the area in the
left rectangle. This notion of recall (aka true positive rate) is in the context of classification problems, the analogous
4.3. Labeling 23
mlfinlab, Release 0.4.1
to “power” in the context of hypothesis testing. “Accuracy” is the sum of the TP and TN areas divided by the overall
set of items (square). In general, decreasing the FP area comes at a cost of increasing the FN area, because higher
precision typically means fewer calls, hence lower recall. Still, there is some combination of precision and recall that
maximizes the overall efficiency of the classifier. The F1-score measures the efficiency of a classifier as the harmonic
average between precision and recall.
Meta-labeling is particularly helpful when you want to achieve higher F1-scores. First, we build a model that
achieves high recall, even if the precision is not particularly high. Second, we correct for the low precision by applying
meta-labeling to the positives predicted by the primary model.
Meta-labeling will increase your F1-score by filtering out the false positives, where the majority of positives have
already been identified by the primary model. Stated differently, the role of the secondary ML algorithm is to determine
whether a positive from the primary (exogenous) model is true or false. It is not its purpose to come up with a betting
opportunity. Its purpose is to determine whether we should act or pass on the opportunity that has been presented.
Meta-labeling is a very powerful tool to have in your arsenal, for four additional reasons. First, ML algorithms
are often criticized as black boxes. Meta-labeling allows you to build an ML system on top of a white box (like a
fundamental model founded on economic theory). This ability to transform a fundamental model into an ML model
should make meta-labeling particularly useful to “quantamental” firms. Second, the effects of overfitting are limited
when you apply metalabeling, because ML will not decide the side of your bet, only the size. Third, by decoupling the
side prediction from the size prediction, meta-labeling enables sophisticated strategy structures. For instance, consider
that the features driving a rally may differ from the features driving a sell-off. In that case, you may want to develop an
ML strategy exclusively for long positions, based on the buy recommendations of a primary model, and an ML strategy
exclusively for short positions, based on the sell recommendations of an entirely different primary model. Fourth,
achieving high accuracy on small bets and low accuracy on large bets will ruin you. As important as identifying good
opportunities is to size them properly, so it makes sense to develop an ML algorithm solely focused on getting that
critical decision (sizing) right. We will retake this fourth point in Chapter 10. In my experience, meta-labeling ML
models can deliver more robust and reliable outcomes than standard labeling models.
Model Architecture
The following image explains the model architecture. The first step is to train a primary model (binary classification).
Second a threshold level is determined at which the primary model has a high recall, in the coded example you will
find that 0.30 is a good threshold, ROC curves could be used to help determine a good level. Third the features from
the first model are concatenated with the predictions from the first model, into a new feature set for the secondary
model. Meta Labels are used as the target variable in the second model. Now fit the second model. Fourth the
prediction from the secondary model is combined with the prediction from the primary model and only where both
are true, is your final prediction true. I.e. if your primary model predicts a 3 and your secondary model says you have
a high probability of the primary model being correct, is your final prediction a 3, else not 3.
24 Chapter 4. Implementations
mlfinlab, Release 0.4.1
4.3.3 Implementation
The following functions are used for the triple-barrier method which works in tandem with meta-labeling.
get_daily_vol(close, lookback=100)
Snippet 3.1 computes the daily volatility at intraday estimation points, applying a span of lookback days to an
exponentially weighted moving standard deviation.
See the pandas documentation for details on the pandas.Series.ewm function.
Note: This function is used to compute dynamic thresholds for profit taking and stop loss limits.
Parameters
• close – Closing prices
• lookback – lookback period to compute volatility
Returns series of daily volatility value
4.3. Labeling 25
mlfinlab, Release 0.4.1
26 Chapter 4. Implementations
mlfinlab, Release 0.4.1
get_bins(triple_barrier_events, close)
Snippet 3.7, page 51, Labeling for Side & Size with Meta Labels
Compute event’s outcome (including side information, if provided). events is a DataFrame where:
Now the possible values for labels in out[‘bin’] are {0,1}, as opposed to whether to take the bet or pass, a purely
binary prediction. When the predicted label the previous feasible values {1,0,1}. The ML algorithm will be
trained to decide is 1, we can use the probability of this secondary prediction to derive the size of the bet, where
the side (sign) of the position has been set by the primary model.
Parameters
• triple_barrier_events – (data frame)
events.index is event’s starttime
events[‘t1’] is event’s endtime
events[‘trgt’] is event’s target
events[‘side’] (optional) implies the algo’s position side
Case 1: (‘side’ not in events): bin in (-1,1) <-label by price action
Case 2: (‘side’ in events): bin in (0,1) <-label by pnl (meta-labeling)
• close – (series) close prices
Returns (data frame) of meta-labeled events
drop_labels(events, min_pct=.05)
Snippet 3.8 page 54 This function recursively eliminates rare observations.
Parameters
• events – (data frame) events
• min_pct – (float) a fraction used to decide if the observation occurs less than that fraction
Returns (data frame) of event
4.3.4 Example
Suppose we use a mean reverting strategy as our primary model, giving each observation a label of 1 or -1. We can
then use meta-labeling to act as a filter for the bets of our primary model.
import mlfinlab as ml
import numpy as np
import pandas as pd
# Read in data
data = pd.read_csv('FILE_PATH')
Assuming we have a pandas series with the timestamps of our observations and their respective labels given by the
primary model, the process to generate meta-labels goes as follows.
# Compute daily volatility
daily_vol = ml.util.get_daily_vol(close=data['close'], lookback=50)
4.3. Labeling 27
mlfinlab, Release 0.4.1
Once we have computed our daily volatility along with our vertical time barriers and have downsampled our series
using the CUSUM filter, we can use the triple-barrier method to compute our meta-labels by passing in the side
predicted by the primary model.
pt_sl = [1, 2]
min_ret = 0.005
triple_barrier_events = ml.labeling.get_events(close=data['close'],
t_events=cusum_events,
pt_sl=pt_sl,
target=daily_vol,
min_ret=min_ret,
num_threads=3,
vertical_barrier_times=vertical_barriers,
side_prediction=data['side'])
As can be seen above, we have scaled our lower barrier and set our minimum return to 0.005.
Meta-labels can then be computed using the time that each observation touched its respective barrier
This example ends with creating the meta-labels. To see a further explanation of using these labels in a secondary
model to help filter out false positives, see the research notebooks below.
The following research notebooks can be used to better understand the triple-barrier method and meta-labeling
Triple-Barrier Method
Meta-Labeling
4.4 Sampling
In financial machine learning, samples are not independent. The most part of traditional machine learning algorithms
assume that samples are IID, in case of financial machine learning samples are neither identically distributed not
independent. In this section we will tackle the problem of samples dependency. As you remember, we mostly label
our data sets using the triple-barrier method. Each label in triple-barrier event has label index and label end time (t1)
which corresponds to time when one of barriers was touched.
28 Chapter 4. Implementations
mlfinlab, Release 0.4.1
import pandas as pd
import numpy as np
from mlfinlab.sampling.concurrent import get_av_uniqueness_from_triple_barrier
We would like to build our model in such a way that it takes into account labels concurrency. In order to do that we
need to look at the bootstrapping algorithm of Random Forest.
The key power of ensemble learning techniques is bagging (which is bootstrapping with replacement). The key idea
behind bagging is to randomly choose samples for each decision tree. In this case trees become diverse and by
averaging predictions of diverse tress built on randomly selected samples and random subset of features data scientists
make the algorithm much less prone to overfit.
4.4. Sampling 29
mlfinlab, Release 0.4.1
However, in our case we would not only like to randomly choose samples but also choose samples which are unique
and non-concurrent. But how can we solve this problem? Here comes Sequential Bootstrapping algorithm.
The key idea behind Sequential Bootstrapping is to select samples in such a way that on each iteration we maximize
average uniqueness of selected subsamples.
Implementation
The core functions behind Sequential Bootstrapping are implemented in mlfinlab and can be seen below:
get_ind_matrix(triple_barrier_events, price_bars)
Snippet 4.3, page 65, Build an Indicator Matrix
Get indicator matrix. The book implementation uses bar_index as input, however there is no explanation how
to form it. We decided that using triple_barrier_events and price bars by analogy with concurrency is the best
option.
Parameters
• samples_info_sets – (pd.Series): triple barrier events.t1 from labeling.get_events
• price_bars – (pd.DataFrame): price bars which were used to form triple barrier events
Returns (np.array) indicator binary matrix indicating what (price) bars influence the label for each
observation
get_ind_mat_average_uniqueness(ind_mat)
Snippet 4.4. page 65, Compute Average Uniqueness Average uniqueness from indicator matrix
Parameters ind_mat – (np.matrix) indicator binary matrix
Returns (float) average uniqueness
get_ind_mat_label_uniqueness(ind_mat)
An adaption of Snippet 4.4. page 65, which returns the indicator matrix element uniqueness.
Parameters ind_mat – (np.matrix) indicator binary matrix
Returns (np.matrix) element uniqueness
seq_bootstrap(ind_mat, sample_length=None, warmup_samples=None, compare=False, verbose=False,
random_state=np.random.RandomState())
Snippet 4.5, Snippet 4.6, page 65, Return Sample from Sequential Bootstrap Generate a sample via sequential
bootstrap.
Note: Moved from pd.DataFrame to np.matrix for performance increase
Parameters
• ind_mat – (data frame) indicator matrix from triple barrier events
• sample_length – (int) Length of bootstrapped sample
• warmup_samples – (list) list of previously drawn samples
• compare – (boolean) flag to print standard bootstrap uniqueness vs sequential bootstrap
uniqueness
• verbose – (boolean) flag to print updated probabilities on each step
• random_state – (np.random.RandomState) random state
Returns (array) of bootstrapped samples indexes
30 Chapter 4. Implementations
mlfinlab, Release 0.4.1
Example
An example of Sequential Bootstrap using a a toy example from the book can be seen below.
Consider a set of labels {𝑦𝑖 }𝑖=0,1,2 where:
• label 𝑦0 is a function of return 𝑟0,2
• label 𝑦1 is a function of return 𝑟2,3
• label 𝑦2 is a function of return 𝑟4,5
The first thing we need to do is to build and indicator matrix. Columns of this matrix correspond to samples and rows
correspond to price returns timestamps which were used during samples labelling. In our case indicator matrix is:
ind_mat.loc[:, 0] = [1, 1, 1, 0, 0, 0]
ind_mat.loc[:, 1] = [0, 0, 1, 1, 0, 0]
ind_mat.loc[:, 2] = [0, 0, 0, 0, 1, 1]
One can use get_ind_matrix method from mlfinlab to build indicator matrix from triple-barrier events.
triple_barrier_ind_mat = get_ind_matrix(barrier_events)
We can get average label uniqueness on indicator matrix using get_ind_mat_average_uniqueness function from mlfin-
lab.
ind_mat_uniqueness = get_ind_mat_average_uniqueness(triple_barrier_ind_mat)
Let’s get the first sample average uniqueness (we need to filter out zeros to get unbiased result).
first_sample = ind_mat_uniqueness[0]
first_sample[first_sample > 0].mean()
>> 0.26886446886446885
av_unique.iloc[0]
>> tW 0.238776
As you can see it is quite close to values generated by get_av_uniqueness_from_triple_barrier function call.
Let’s move back to our example. In Sequential Bootstrapping algorithm we start with an empty array of samples (𝜑)
and loop through all samples to get the probability of chosing the sample based on average uniqueness of reduced
indicator matrix constructed from [previously chosen columns] + sample
phi = []
while length(phi) < number of samples to bootstrap:
average_uniqueness_array = []
for sample in samples:
previous_columns = phi
ind_mat_reduced = ind_mat[previous_columns + i]
average_uniqueness_array[sample] = get_ind_mat_average_uniqueness(ind_mat_
˓→reduced)
4.4. Sampling 31
mlfinlab, Release 0.4.1
For performance increase we optimized and parallesied for-loop using numba, which corresponds to boot-
strap_loop_run function.
Not let’s finish the example:
To be as close to the mlfinlab implementation let’s convert ind_mat to numpy matrix
ind_mat = ind_mat.values
1st iteration:
On the first step all labels will have equal probalities as average uniqueness of matrix with 1 column is 1. Say we have
chosen 1 on the first step
2nd iteration:
prob_array
>> array([0.35714285714285715, 0.21428571428571427, 0.42857142857142855],
dtype=object)
Probably the second chosen feature will be 2 (prob_array[2] = 0.42857 which is the largest probability). As you can
see up till now the algorithm has chosen two the least concurrent labels (1 and 2)
3rd iteration:
phi = [1,2]
uniqueness_array = np.array([None, None, None])
for i in range(0, 3):
ind_mat_reduced = ind_mat[:, phi + [i]]
label_uniqueness = get_ind_mat_average_uniqueness(ind_mat_reduced)[-1]
uniqueness_array[i] = (label_uniqueness[label_uniqueness > 0].mean())
prob_array = uniqueness_array / sum(uniqueness_array)
prob_array
>> array([0.45454545454545453, 0.2727272727272727, 0.2727272727272727],
dtype=object)
Sequential Bootstrapping tries to minimise the probability of repeated samples so as you can see the most probable
sample would be 0 with 1 and 2 already selected.
4th iteration:
phi = [1, 2, 0]
uniqueness_array = np.array([None, None, None])
for i in range(0, 3):
ind_mat_reduced = ind_mat[:, phi + [i]]
label_uniqueness = get_ind_mat_average_uniqueness(ind_mat_reduced)[-1]
uniqueness_array[i] = (label_uniqueness[label_uniqueness > 0].mean())
prob_array = uniqueness_array / sum(uniqueness_array)
32 Chapter 4. Implementations
mlfinlab, Release 0.4.1
prob_array
>> array([0.32653061224489793, 0.3061224489795918, 0.36734693877551017],
dtype=object)
samples
>> [1, 2, 0, 2]
As you can see the first 2 iterations of algorithm yield the same probabilities, however sometimes the algorithm
randomly chooses not the 2 sample on 2nd iteration that is why further probabilities are different from the example
above. However, if you repeat the process several times you’ll see that on average drawn sample equal to the one from
the example
Monte-Carlo Experiment
Let’s see how sequential bootstrapping increases average label uniqueness on this example by generating 3 samples
using sequential bootstrapping and 3 samples using standard random choise, repeat the experiment 10000 times and
record corresponding label uniqueness in each experiment
standard_unq_array[i] = random_unq_mean
seq_unq_array[i] = sequential_unq_mean
KDE plots of label uniqueness support the fact that sequential bootstrapping gives higher average label uniqueness
4.4. Sampling 33
mlfinlab, Release 0.4.1
We can compare average label uniqueness using sequential bootstrap vs label uniqueness using standard random sam-
pling by setting compare parameter to True. We have massively increased the performance of Sequential Bootstrapping
which was described in the book. For comparison generating 50 samples from 8000 barrier-events would take 3 days,
we have reduced time to 10-12 seconds which decreases by increasing number of CPUs.
Let’s apply sequential bootstrapping to our full data set and draw 50 samples:
Sometimes you would see that standard bootstrapping gives higher uniqueness, however as it was shown in Monte-
Carlo example, on average Sequential Bootstrapping algorithm has higher average uniqueness.
mlfinlab supports two methods of applying sample weights. The first is weighting an observation based on its given
return as well as average uniqueness. The second is weighting an observation based on a time decay.
The following function utilizes a samples average uniqueness and its return to compute sample weights:
get_weights_by_return(triple_barrier_events, close_series, num_threads=5)
Parameters
• triple_barrier_events – (data frame) of events from labeling.get_events()
34 Chapter 4. Implementations
mlfinlab, Release 0.4.1
import pandas as pd
import numpy as np
from mlfinlab.sampling.attribution import get_weights_by_return
By Time Decay
The following function assigns sample weights using a time decay factor
get_weights_by_time_decay(triple_barrier_events, close_series, num_threads=5, decay=1)
Parameters
• triple_barrier_events – (data frame) of events from labeling.get_events()
• close_series – (pd.Series) close prices
• num_threads – (int) the number of threads concurrently used by the function.
• decay – (int) decay factor - decay = 1 means there is no time decay - 0 < decay < 1 means
that weights decay linearly over time, but every observation still receives a strictly positive
weight, regadless of how old - decay = 0 means that weights converge linearly to zero, as
they become older - decay < 0 means that the oldes portion c of the observations receive
zero weight (i.e they are erased from memory)
This function can be utilized as shown below assuming we have already found our barrier events
import pandas as pd
import numpy as np
from mlfinlab.sampling.attribution import get_weights_by_time_decay
The following research notebooks can be used to better understand the previously discussed sampling methods
4.4. Sampling 35
mlfinlab, Release 0.4.1
Sequential Bootstrapping
• Sequential Bootstrapping
One of the challenges of quantitative analysis in finance is that price time series have trends or non-constant mean.
This makes the time series non-stationary. Non-stationary time series are hard to work with when we want to do
inferential analysis such as average and variance of returns, or probability of loss. Stationary series also help in
supervised learning methods. Specifically, in supervised learning one needs to map hitherto unseen observations to
a set of labeled examples and determine the label of the new observation. As Marcos Lopez de Prado (MLdP) says
in Chapter 5, “if the features are not stationary we cannot map the new observation to a large number of known
examples”. However, to make a time series (or a feature) stationary often requires data transformations like computing
changes (change in price, yields or volatility). These transformations also leave the time series bereft of any memory
and thereby reducing or eliminating its predictive capability. Fractionally differentiated features tackle this problem
by deriving features through fractionally differentiating a time series to the point where the series is stationary, but not
over differencing such that we lose all predictive power.
The following graph shows a fractionally differenced series plotted over the original closing price series:
4.5.1 Implementation
The following function implemented in mlfinlab can be used to derive fractionally differentiated features.
36 Chapter 4. Implementations
mlfinlab, Release 0.4.1
Parameters
• series – (pd.Series) a time series that needs to be differenced
• diff_amt – (float) Differencing amount
• thresh – (float) threshold or epsilon
Returns (pd.DataFrame) data frame of differenced series
Given that we know the amount we want to difference our price series, fractionally differentiated features can be
derived as follows:
import numpy as np
import pandas as pd
from mlfinlab.features.fracdiff import frac_diff_ffd
data = pd.read_csv('FILE_PATH')
frac_diff_series = frac_diff_ffd(data['close'], 0.5)
The following research notebook can be used to better understand fractionally differentiated features.
This implementation is based on Chapter 7 of the book. The purpose of performing cross validation is to reduce
the probability of over-fitting and the book recommends it as the main tool of research. There are two innovations
compared to the classical K-Fold Cross Validation implemented in sklearn.
1. The first one is a process called purging which removes from the training set those samples that are build with
information that overlaps samples in the testing set. More details on this in section 7.4.1, page 105.
2. The second innovation is a process called embargo which removes a number of observations from the end of the
test set. This further prevents leakage where the purging process is not enough. More details on this in section 7.4.2,
page 107.
Implements the book chapter 7 on Cross Validation for financial data.
Fig. 1: Image showing the process of purging. Figure taken from page 107 of the book.
Fig. 2: Image showing the process of embargo. Figure taken from page 108 of the book.
38 Chapter 4. Implementations
mlfinlab, Release 0.4.1
Parameters
• classifier – A sk-learn Classifier object instance.
• X – The dataset of records to evaluate.
• y – The labels corresponding to the X dataset.
• cv_gen – Cross Validation generator object instance.
• sample_weight – A numpy array of weights for each record in the dataset.
• scoring – A metric name to use for scoring; currently supports neg_log_loss, accuracy,
f1, precision, recall, and roc_auc.
Returns The computed score as a numpy array.
In sampling section we have shown that sampling should be done by Sequential Bootstrapping. Sequen-
tiallyBootstrappedBaggingClassifier and SequentiallyBootstrappedBaggingRegressor extend sklearn’s BaggingClas-
sifier/Regressor by using Sequential Bootstrapping instead of random sampling.
In order to build indicator matrix we need Triple Barrier Events (samples_info_sets) and price bars used to label
training data set. That is why samples_info_sets and price bars are input parameters for classifier/regressor.
Implementation of Sequentially Bootstrapped Bagging Classifier using sklearn’s library as base class
class SequentiallyBootstrappedBaggingClassifier(samples_info_sets, price_bars,
base_estimator=None,
n_estimators=10, max_samples=1.0,
max_features=1.0, boot-
strap_features=False,
oob_score=False, warm_start=False,
n_jobs=None, random_state=None,
verbose=0)
A Sequentially Bootstrapped Bagging classifier is an ensemble meta-estimator that fits base classifiers each on
random subsets of the original dataset generated using Sequential Bootstrapping sampling procedure and then
aggregate their individual predictions ( either by voting or by averaging) to form a final prediction. Such a
meta-estimator can typically be used as a way to reduce the variance of a black-box estimator (e.g., a decision
tree), by introducing randomization into its construction procedure and then making an ensemble out of it.
Parameters
• samples_info_sets – pd.Series, The information range on which each record is con-
structed from samples_info_sets.index: Time when the information extraction started. sam-
ples_info_sets.value: Time when the information extraction ended.
• price_bars – pd.DataFrame Price bars used in samples_info_sets generation
• base_estimator – object or None, optional (default=None) The base estimator to fit on
random subsets of the dataset. If None, then the base estimator is a decision tree.
• n_estimators – int, optional (default=10) The number of base estimators in the ensem-
ble.
• max_samples – int or float, optional (default=1.0) The number of samples to draw from
X to train each base estimator. If int, then draw max_samples samples. If float, then draw
max_samples * X.shape[0] samples.
40 Chapter 4. Implementations
mlfinlab, Release 0.4.1
• max_features – int or float, optional (default=1.0) The number of features to draw from
X to train each base estimator. If int, then draw max_features features. If float, then draw
max_features * X.shape[1] features.
• bootstrap_features – boolean, optional (default=False) Whether features are drawn
with replacement.
• oob_score – bool, optional (default=False) Whether to use out-of-bag samples to esti-
mate the generalization error.
• warm_start – bool, optional (default=False) When set to True, reuse the solution of the
previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new
ensemble.
• n_jobs – int or None, optional (default=None) The number of jobs to run in parallel for
both fit and predict. None means 1 unless in a joblib.parallel_backend context.
-1 means using all processors.
• random_state – int, RandomState instance or None, optional (default=None) If int, ran-
dom_state is the seed used by the random number generator; If RandomState instance, ran-
dom_state is the random number generator; If None, the random number generator is the
RandomState instance used by np.random.
• verbose – int, optional (default=0) Controls the verbosity when fitting and predicting.
Variables
• base_estimator – estimator The base estimator from which the ensemble is grown.
• estimators – list of estimators The collection of fitted base estimators.
• estimators_samples – list of arrays The subset of drawn samples (i.e., the in-bag
samples) for each base estimator. Each subset is defined by an array of the indices selected.
• estimators_features – list of arrays The subset of drawn features for each base
estimator.
• classes – array of shape = [n_classes] The classes labels.
• n_classes – int or list The number of classes.
• oob_score – float Score of the training dataset obtained using an out-of-bag estimate.
• oob_decision_function – array of shape = [n_samples, n_classes] Decision func-
tion computed with out-of-bag estimate on the training set. If n_estimators is small it
might be possible that a data point was never left out during the bootstrap. In this case,
oob_decision_function_ might contain NaN.
class SequentiallyBootstrappedBaggingRegressor(samples_info_sets, price_bars,
base_estimator=None,
n_estimators=10, max_samples=1.0,
max_features=1.0, boot-
strap_features=False, oob_score=False,
warm_start=False, n_jobs=None,
random_state=None, verbose=0)
A Sequentially Bootstrapped Bagging regressor is an ensemble meta-estimator that fits base regressors each on
random subsets of the original dataset using Sequential Bootstrapping and then aggregate their individual pre-
dictions (either by voting or by averaging) to form a final prediction. Such a meta-estimator can typically be used
as a way to reduce the variance of a black-box estimator (e.g., a decision tree), by introducing randomization
into its construction procedure and then making an ensemble out of it.
Parameters
42 Chapter 4. Implementations
mlfinlab, Release 0.4.1
class_weight='balanced_subsample')
clf = SequentiallyBootstrappedBaggingClassifier(base_estimator=base_est, samples_info_
˓→sets=triple_barrier_events.t1,
price_bars=price_bars, oob_score=True)
clf.fit(X, y)
One of the key research principles of Advances in Financial Machine learning is:
Backtesting is not a research tool. Feature importance is.
There are three ways to get feature importance scores:
1) Mean Decrease Impurity (MDI). This score can be obtained from tree-based classifiers and corresponds to
sklearn’s feature_importances attribute. MDI uses in-sample (IS) performance to estimate feature importance.
2) Mean Decrease Accuracy (MDA). This method can be applied to any tree-based classifier, not only tree based.
MDA uses out-of-sample (OOS) performance in order to estimate feature importance.
3) Single Feature Importance (SFI). MDA and MDI feature suffer from substitution effects: if two features are
highly correlated, one of them will be considered as important while the other one will be redundant. SFI is
OOS feature importance estimator which doesn’t suffer from substitution effect because it estimates each feature
importance separately.
˓→importance
44 Chapter 4. Implementations
mlfinlab, Release 0.4.1
class_weight='balanced_subsample')
clf = SequentiallyBootstrappedBaggingClassifier(base_estimator=base_est, samples_info_
˓→sets=triple_barrier_events.t1,
price_bars=price_bars, oob_score=True)
clf.fit(X_train, y_train)
scoring='accuracy').mean()
Partial solution to solve substitution effects is to orthogonalize features - apply PCA to them. However, PCA can be
used not only to reduce the dimension of your data set, but also to understand whether the patterns detected by feature
importance are valid. Suppose, that you derive orthogonal features using PCA. Your PCA analysis has determined
that some features are more ‘principal’ than others, without any knowledge of the labels (unsupervised learning). That
is, PCA has ranked features without any possible overfitting in a classification sense. When your MDI, MDA, SFI
analysis selects as most important (using label information) the same features that PCA chose as principal (ignoring
label information), this constitutes confirmatory evidence that the pattern identified by the ML algorithm is not entirely
overfit. Here is the example plot of MDI feature importance vs PCA eigen values:
46 Chapter 4. Implementations
mlfinlab, Release 0.4.1
Module which implements feature PCA compression and PCA analysis of feature importance
get_orthogonal_features(feature_df, variance_thresh=0.95)
Snippet 8.5, page 119. Computation of Orthogonal Features.
Get PCA orthogonal features
Parameters
• feature_df – (pd.DataFrame): with features
• variance_thresh – (float): % of overall variance which compressed vectors should
explain
Returns (pd.DataFrame): compressed PCA features which explain %variance_thresh of variance
get_pca_rank_weighted_kendall_tau(feature_imp, pca_rank)
Snippet 8.6, page 121. Computation of Weighted Kendall’s Tau Between Feature Importance and Inverse PCA
Ranking
Parameters
• feature_imp – (np.array): with feature mean importance
• pca_rank – (np.array): PCA based feature importance rank
Returns (float): weighted Kendall tau of feature importance and inverse PCA rank with p_value
feature_pca_analysis(feature_df, feature_importance, variance_thresh=0.95)
Perform correlation analysis between feature importance (MDI for example, supervised) and PCA eigen values
(unsupervised). High correlation means that probably the pattern identified by the ML algorithm is not entirely
overfit.
Parameters
• feature_df – (pd.DataFrame): with features
• feature_importance – (pd.DataFrame): individual MDI feature importance
• variance_thresh – (float): % of overall variance which compressed vectors should
explain in PCA compression
Returns (dict): with kendall, spearman, pearson and weighted_kendall correlations and p_values
Let’s see how PCA feature extraction is analysis are done using mlfinlab functions:
import pandas as pd
from mlfinlab.feature_importance.orthogonal import get_orthogonal_features, feature_
˓→pca_analysis
pca_features = get_orthogonal_features(X_train)
correlation_dict = feature_pca_analysis(X_train, feat_imp)
“There are fascinating parallels between strategy games and investing. Some of the best portfolio managers I have
worked with are excellent poker players, perhaps more so than chess players. One reason is bet sizing, for which Texas
Hold’em provides a great analogue and training ground. Your ML algorithm can achieve high accuracy, but if you do
not size your bets properly, your investment strategy will inevitably lose money. In this chapter we will review a few
approaches to size bets from ML predictions.” Advances in Financial Machine Learning, Chapter 10: Bet Sizing, pg
141.
The code in this directory falls under 3 submodules:
1. Bet Sizing: We have extended the code from the book in an easy to use format for practitioners to use going
forward.
2. EF3M: An implementation of the EF3M algorithm.
3. Chapter10_Snippets: Documented and adjusted snippets from the book for users to experiment with.
48 Chapter 4. Implementations
mlfinlab, Release 0.4.1
Functions for bet sizing are implemented based on the approaches described in chapter 10.
Assuming a machine learning algorithm has predicted a series of investment positions, one can use the probabilities
of each of these predictions to derive the size of that specific bet.
bet_size_probability(events, prob, num_classes, pred=None, step_size=0.0, average_active=False,
num_threads=1)
Calculates the bet size using the predicted probability. Note that if ‘average_active’ is True, the returned pan-
das.Series will be twice the length of the original since the average is calculated at each bet’s open and close.
Parameters
• events – (pandas.DataFrame) Contains at least the column ‘t1’, the expiry datetime of the
product, with a datetime index, the datetime the position was taken.
• prob – (pandas.Series) The predicted probability.
• num_classes – (int) The number of predicted bet sides.
• pred – (pd.Series) The predicted bet side. Default value is None which will return a relative
bet size (i.e. without multiplying by the side).
• step_size – (float) The step size at which the bet size is discretized, default is 0.0 which
imposes no discretization.
• average_active – (bool) Option to average the size of active bets, default value is False.
• num_threads – (int) The number of processing threads to utilize for multiprocessing,
default value is 1.
Returns (pandas.Series) The bet size, with the time index.
Assuming one has a series of forecasted prices for a given investment product, that forecast and the current market
price and position can be used to dynamically calculate the bet size.
bet_size_dynamic(current_pos, max_pos, market_price, forecast_price, cal_divergence=10,
cal_bet_size=0.95, func=’sigmoid’)
Calculates the bet sizes, target position, and limit price as the market price and forecast price fluctuate. The
current position, maximum position, market price, and forecast price can be passed as separate pandas.Series
(with a common index), as individual numbers, or a combination thereof. If any one of the aforementioned
arguments is a pandas.Series, the other arguments will be broadcast to a pandas.Series of the same length and
index.
Parameters
• current_pos – (pandas.Series, int) Current position.
• max_pos – (pandas.Series, int) Maximum position
• market_price – (pandas.Series, float) Market price.
• forecast_price – (pandas.Series, float) Forecast price.
• cal_divergence – (float) The divergence to use in calibration.
• cal_bet_size – (float) The bet size to use in calibration.
• func – (string) Function to use for dynamic calculation. Valid options are: ‘sigmoid’,
‘power’.
Returns (pandas.DataFrame) Bet size (bet_size), target position (t_pos), and limit price (l_p).
These approaches consider the number of concurrent active bets and their sides, and sets the bet size is such a way that
reserves some cash for the possibility that the trading signal strengthens before it weakens.
bet_size_budget(events_t1, sides)
Calculates a bet size from the bet sides and start and end times. These sequences are used to determine the
number of concurrent long and short bets, and the resulting strategy-independent bet sizes are the difference
between the average long and short bets at any given time. This strategy is based on the section 10.2 in “Advances
in Financial Machine Learning”. This creates a linear bet sizing scheme that is aligned to the expected number
of concurrent bets in the dataset.
Parameters
• events_t1 – (pandas.Series) The end datetime of the position with the start datetime as
the index.
50 Chapter 4. Implementations
mlfinlab, Release 0.4.1
• sides – (pandas.Series) The side of the bet with the start datetime as index. Index must
match the ‘events_t1’ argument exactly. Bet sides less than zero are interpretted as short,
bet sides greater than zero are interpretted as long.
Returns (pandas.DataFrame) The ‘events_t1’ and ‘sides’ arguments as columns, with the number
of concurrent active long and short bets, as well as the bet size, in additional columns.
bet_size_reserve(events_t1, sides, fit_runs=100, epsilon=1e-05, factor=5, variant=2, max_iter=10000,
num_workers=1, return_parameters=False)
Calculates the bet size from bet sides and start and end times. These sequences are used to determine the number
of concurrent long and short bets, and the difference between the two at each time step, c_t. A mixture of two
Gaussian distributions is fit to the distribution of c_t, which is then used to determine the bet size. This strategy
results in a sigmoid-shaped bet sizing response aligned to the expected number of concurrent long and short bets
in the dataset.
Note that this function creates a <mlfinlab.bet_sizing.ef3m.M2N> object and makes use of the paral-
lel fitting functionality. As such, this function accepts and passes fitting parameters to the mlfin-
lab.bet_sizing.ef3m.M2N.mp_fit() method.
Parameters
• events_t1 – (pandas.Series) The end datetime of the position with the start datetime as
the index.
• sides – (pandas.Series) The side of the bet with the start datetime as index. Index must
match the ‘events_t1’ argument exactly. Bet sides less than zero are interpretted as short,
bet sides greater than zero are interpretted as long.
• fit_runs – (int) Number of runs to execute when trying to fit the distribution.
• epsilon – (float) Error tolerance.
• factor – (float) Lambda factor from equations.
• variant – (int) Which algorithm variant to use, 1 or 2.
• max_iter – (int) Maximum number of iterations after which to terminate loop.
• num_workers – (int) Number of CPU cores to use for multiprocessing execution, set to
-1 to use all CPU cores. Default is 1.
• return_parameters – (bool) If True, function also returns a dictionary of the fited
mixture parameters.
Returns (pandas.DataFrame) The ‘events_t1’ and ‘sides’ arguments as columns, with the number
of concurrent active long, short bets, the difference between long and short, and the bet size in
additional columns. Also returns the mixture parameters if ‘return_parameters’ is set to True.
confirm_and_cast_to_df(d_vars)
Accepts either pandas.Series (with a common index) or integer/float values, casts all non-pandas.Series val-
ues to Series, and returns a pandas.DataFrame for further calculations. This is a helper function to the
‘bet_size_dynamic’ function.
Parameters d_vars – (dict) A dictionary where the values are either pandas.Series or single
int/float values. All pandas.Series passed are assumed to have the same index. The keys of
the dictionary will be used for column names in the returned pandas.DataFrame.
Returns (pandas.DataFrame) The values from the input dictionary in pandas.DataFrame format,
with dictionary keys as column names.
get_concurrent_sides(events_t1, sides)
Given the side of the position along with its start and end timestamps, this function returns two pandas.Series
indicating the number of concurrent long and short bets at each timestamp.
Parameters
• events_t1 – (pandas.Series) The end datetime of the position with the start datetime as
the index.
• sides – (pandas.Series) The side of the bet with the start datetime as index. Index must
match the ‘events_t1’ argument exactly. Bet sides less than zero are interpretted as short,
bet sides greater than zero are interpretted as long.
Returns (pandas.DataFrame) The ‘events_t1’ and ‘sides’ arguments as columns, with two additional
columns indicating the number of concurrent active long and active short bets at each timestamp.
cdf_mixture(x_val, parameters)
The cumulative distribution function of a mixture of 2 normal distributions, evaluated at x_val.
Parameters
• x_val – (float) Value at which to evaluate the CDF.
• parameters – (list) The parameters of the mixture, [mu_1, mu_2, sigma_1, sigma_2,
p_1]
Returns (float) CDF of the mixture.
single_bet_size_mixed(c_t, parameters)
Returns the single bet size based on the description provided in question 10.4(c), provided the difference in
concurrent long and short positions, c_t, and the fitted parameters of the mixture of two Gaussain distributions.
Parameters
• c_t – (int) The difference in the number of concurrent long bets minus short bets.
• parameters – (list) The parameters of the mixture, [mu_1, mu_2, sigma_1, sigma_2,
p_1]
Returns (float) Bet size.
The EF3M algorithm was introduced in a paper by Marcos Lopez de Prado and Matthew D. Foreman, titled “A mixture
of Gaussians approach to mathematical portfolio oversight: the EF3M algorithm”.
The abstract reads: “An analogue can be made between: (a) the slow pace at which species adapt to an environment,
which often results in the emergence of a new distinct species out of a once homogeneous genetic pool, and (b) the
slow changes that take place over time within a fund, mutating its investment style. A fund’s track record provides a
sort of genetic marker, which we can use to identify mutations. This has motivated our use of a biometric procedure
to detect the emergence of a new investment style within a fund’s track record. In doing so, we answer the question:
“What is the probability that a particular PM’s performance is departing from the reference distribution used to allocate
her capital?” The EF3M approach, inspired by evolutionary biology, may help detect early stages of an evolutionary
divergence in an investment style, and trigger a decision to review a fund’s capital allocation.”
The Exact Fit of the first 3 Moments (EF3M) algorithm allows the parameters of a mixture of Gaussian distributions
to be estimated given the first 5 moments of the mixture distribution, as well as the assumption that the mixture
distribution is composed of a number of Gaussian distributions.
A more thorough investigation into the algorithm can be found within our Research repository
52 Chapter 4. Implementations
mlfinlab, Release 0.4.1
M2N
A class for determining the means, standard deviations, and mixture proportion of a given distribution from it’s first
four or five statistical moments.
class M2N(moments, epsilon=1e-05, factor=5, n_runs=1, variant=1, max_iter=100000, num_workers=-1)
M2N - A Mixture of 2 Normal distributions This class is used to contain parameters and equations for the EF3M
algorithm, when fitting parameters to a mixture of 2 Gaussian distributions.
Parameters
• moments – (list) The first five (1. . . 5) raw moments of the mixture distribution.
• epsilon – (float) Fitting tolerance
• factor – (float) Lambda factor from equations
• n_runs – (int) Number of times to execute ‘singleLoop’
• variant – (int) The EF3M variant to execute, options are 1: EF3M using first 4 moments,
2: EF3M using first 5 moments
• max_iter – (int) Maximum number of iterations to perform in the ‘fit’ method
• num_workers – (int) Number of CPU cores to use for multiprocessing execution. Default
is -1 which sets num_workers to all cores.
fit(mu_2)
Fits and the parameters that describe the mixture of the 2 Normal distributions for a given set of initial
parameter guesses.
Parameters mu_2 – (float) An initial estimate for the mean of the second distribution.
get_moments(parameters, return_result=False)
Calculates and returns the first five (1. . . 5) raw moments corresponding to the newly estimated parameters.
Parameters
• parameters – (list) List of parameters if the specific order [mu_1, mu_2, sigma_1,
sigma_2, p_1]
• return_result – (bool) If True, method returns a result instead of setting the
‘self.new_moments’ attribute.
Returns (list) List of the first five moments
iter_4(mu_2, p_1)
Evaluation of the set of equations that make up variant #1 of the EF3M algorithm (fitting using the first
four moments).
Parameters
• mu_2 – (float) Initial parameter value for mu_2
• p_1 – (float) Probability defining the mixture; p_1, 1 - p_1
Returns (list) List of estimated parameter if no invalid values are encountered (e.g. complex
values, divide-by-zero), otherwise an empty list is returned.
iter_5(mu_2, p_1)
Evaluation of the set of equations that make up variant #2 of the EF3M algorithm (fitting using the first
five moments).
Parameters
• mu_2 – (float) Initial parameter value for mu_2
• p_1 – (float) Probability defining the mixture; p_1, 1-p_1
Returns (list) List of estimated parameter if no invalid values are encountered (e.g. complex
values, divide-by-zero), otherwise an empty list is returned.
mp_fit()
Parallelized implementation of the ‘single_fit_loop’ method. Makes use of dask.delayed to execute multi-
ple calls of ‘single_fit_loop’ in parallel.
Returns (pd.DataFrame) Fitted parameters and error
single_fit_loop(epsilon=0)
A single scan through the list of mu_2 values, cataloging the successful fittings in a DataFrame.
Parameters epsilon – (float) Fitting tolerance.
Returns (pd.DataFrame) Fitted parameters and error
centered_moment(moments, order)
Compute a single moment of a specific order about the mean (centered) given moments about the origin (raw).
Parameters
54 Chapter 4. Implementations
mlfinlab, Release 0.4.1
Chapter 10 of “Advances in Financial Machine Learning” contains a number of Python code snippets, many of which
are used to create the top level bet sizing functions. These functions can be found in mlfinlab.bet_sizing.
ch10_snippets.py.
bet_size_sigmoid(w_param, price_div)
Part of SNIPPET 10.4 Calculates the bet size from the price divergence and a regulating coefficient. Based on a
sigmoid function for a bet size algorithm.
Parameters
• w_param – (float) Coefficient regulating the width of the bet size function.
• price_div – (float) Price divergence, forecast price - market price.
Returns (float) The bet size.
get_target_pos_sigmoid(w_param, forecast_price, market_price, max_pos)
Part of SNIPPET 10.4 Calculates the target position given the forecast price, market price, maximum position
size, and a regulating coefficient. Based on a sigmoid function for a bet size algorithm.
Parameters
• w_param – (float) Coefficient regulating the width of the bet size function.
• forecast_price – (float) Forecast price.
• market_price – (float) Market price.
• max_pos – (int) Maximum absolute position size.
Returns (int) Target position.
56 Chapter 4. Implementations
mlfinlab, Release 0.4.1
58 Chapter 4. Implementations
mlfinlab, Release 0.4.1
• w_param – (float) Coefficient regulating the width of the bet size function.
• forecast_price – (float) Forecast price.
• market_price – (float) Market price.
• max_pos – (int) Maximum absolute position size.
• func – (string) Function to use for dynamic calculation. Valid options are: ‘sigmoid’,
‘power’.
Returns (int) Target position.
inv_price(forecast_price, w_param, m_bet_size, func)
Derived from SNIPPET 10.4 Calculates the inverse of the bet size with respect to the market price. The ‘func’
argument allows the user to choose between bet sizing functions.
Parameters
• forecast_price – (float) Forecast price.
• w_param – (float) Coefficient regulating the width of the bet size function.
• m_bet_size – (float) Bet size.
Returns (float) Inverse of bet size with respect to market price.
limit_price(target_pos, pos, forecast_price, w_param, max_pos, func)
Derived from SNIPPET 10.4 Calculates the limit price. The ‘func’ argument allows the user to choose between
bet sizing functions.
Parameters
• target_pos – (int) Target position.
• pos – (int) Current position.
• forecast_price – (float) Forecast price.
• w_param – (float) Coefficient regulating the width of the bet size function.
• max_pos – (int) Maximum absolute position size.
• func – (string) Function to use for dynamic calculation. Valid options are: ‘sigmoid’,
‘power’.
Returns (float) Limit price.
get_w(price_div, m_bet_size, func)
Derived from SNIPPET 10.4 Calculates the inverse of the bet size with respect to the regulating coefficient ‘w’.
The ‘func’ argument allows the user to choose between bet sizing functions.
Parameters
• price_div – (float) Price divergence, forecast price - market price.
• m_bet_size – (float) Bet size.
• func – (string) Function to use for dynamic calculation. Valid options are: ‘sigmoid’,
‘power’.
Returns (float) Inverse of bet size with respect to the regulating coefficient.
The following research notebooks can be used to better understand bet sizing.
The portfolio optimisation module contains some classic algorithms that are used for asset allocation and optimising
strategies. We will discuss these algorithms in detail below.
Hierarchical Risk Parity is a novel portfolio optimisation method developed by Marcos Lopez de Prado. The working
of the algorithm can be broken down into 3 steps:
1. Based on the expected returns of the assets, they are segregated into clusters via hierarchical tree clustering.
2. Based on these clusters, the covariance matrix of the returns is diagonalised in a quasi manner such that assets
within the same cluster are regrouped together.
3. Finally, using an iterative approach, weights are assigned to each cluster recursively. At each node, the weight
breaks down into the sub-cluster until all the individual assets are assigned a unique weight.
Although, it is a simple algorithm, HRP has been found to be a very stable algorithm as compared to its older counter-
parts. This is because, HRP does not involve taking inverse of the covariance matrix matrix which makes it robust to
small changes in the covariances of the asset returns.
Implementation
This module implements the HRP algorithm mentioned in the following paper: López de Prado, Marcos, Building
Diversified Portfolios that Outperform Out-of-Sample (May 23, 2016). Journal of Portfolio Management, 2016; The
code is reproduced with modification from his book: Advances in Financial Machine Learning, Chp-16
class HierarchicalRiskParity
The HRP algorithm is a robust algorithm which tries to overcome the limitations of the CLA algorithm. It has
three important steps - hierarchical tree clustering, quasi diagnalisation and recursive bisection. Non-inversion
of covariance matrix makes HRP a very stable algorithm and insensitive to small changes in covariances.
__init__()
Initialize self. See help(type(self)) for accurate signature.
allocate(asset_prices, resample_by=’B’, use_shrinkage=False)
Calculate asset allocations using HRP algorithm
Parameters
• asset_prices – (pd.Dataframe) a dataframe of historical asset prices (daily close)
indexed by date
• resample_by – (str) specifies how to resample the prices - weekly, daily, monthly etc..
Defaults to ‘B’ meaning daily business days which is equivalent to no resampling
• use_shrinkage – (Boolean) specifies whether to shrink the covariances
plot_clusters(assets)
Plot a dendrogram of the hierarchical clusters
60 Chapter 4. Implementations
mlfinlab, Release 0.4.1
This is a robust alternative to the quadratic optimisation used to find mean-variance optimal portfolios. The major
difference between classic Mean-Variance and CLA is the type of optimisation problem solved. A typical mean-
variance optimisation problem looks something like this:
{︀ }︀
minimise 𝑤𝑇 Σ𝑤
𝑤
∑︀
where, 𝑖 𝑤𝑖 = 1 and 0 <= 𝑤 <= 1. CLA also solves the same problem but with some added constraints - each
weight of an asset in the portfolio can have different lower and upper bounds. The optimisation objective still remains
the same but the second constraint changes to - 𝑙𝑖 <= 𝑤𝑖 <= 𝑢𝑖 . Each weight in the allocation has an upper and a
lower bound, which increases the number of constraints to be solved.
The current CLA implementation in the package supports the following solutions:
1. CLA Turning Points
2. Maximum Sharpe Portfolio
3. Minimum Variance Portfolio
4. Efficient Frontier Solution
Implementation
This module implements the famous Critical Line Algorithm for mean-variance portfolio optimisation. It is reproduced
with modification from the following paper: D.H. Bailey and M.L. Prado “An Open-Source Implementation of the
Critical- Line Algorithm for Portfolio Optimization”,Algorithms, 6 (2013), 169-196.
class CLA(weight_bounds=(0, 1), calculate_returns=’mean’)
CLA is a famous portfolio optimisation algorithm used for calculating the optimal allocation weights for a given
portfolio. It solves the optimisation problem with constraints on each weight - lower and upper bounds on the
weight value. This class can compute multiple types of solutions - the normal cla solution, minimum variance
solution, maximum sharpe solution and finally the solution to the efficient frontier.
__init__(weight_bounds=(0, 1), calculate_returns=’mean’)
Initialise the storage arrays and some preprocessing.
Parameters
• weight_bounds – (tuple) a tuple specifying the lower and upper bound ranges for the
portfolio weights
• calculate_returns – (str) the method to use for calculation of expected returns.
Currently supports “mean” and “exponential”
allocate(asset_prices, solution=’cla_turning_points’, resample_by=’B’)
Calculate the portfolio asset allocations using the method specified.
Parameters
• asset_prices – (pd.Dataframe) a dataframe of historical asset prices (adj closed)
• solution – (str) specify the type of solution to compute. Options are:
cla_turning_points, max_sharpe, min_volatility, efficient_frontier
• resample_by – (str) specifies how to resample the prices - weekly, daily, monthly etc..
Defaults to ‘B’ meaning daily business days which is equivalent to no resampling
This class contains the classic Mean-Variance optimisation techniques which use quadratic optimisation to get so-
lutions to the portfolio allocation problem. Currently, it only supports the basic inverse-variance allocation strategy
(IVP) but we aim to add more functions to tackle different optimisation objectives like maximum sharpe, minimum
volatility, targeted-risk return maximisation and much more.
Implementation
This module implements the classic mean-variance optimisation techniques for calculating the efficient frontier. It
uses typical quadratic optimisers to generate optimal portfolios for different objective functions.
class MeanVarianceOptimisation
This class contains a variety of methods dealing with different solutions to the mean variance optimisation
problem.
__init__()
Initialize self. See help(type(self)) for accurate signature.
allocate(asset_prices, solution=’inverse_variance’, resample_by=’B’)
Calculate the portfolio asset allocations using the method specified.
Parameters
• asset_prices – (pd.Dataframe) a dataframe of historical asset prices (daily close)
• solution – (str) the type of solution/algorithm to use to calculate the weights
• resample_by – (str) specifies how to resample the prices - weekly, daily, monthly etc..
Defaults to ‘B’ meaning daily business days which is equivalent to no resampling
4.10.4 Examples
Lets see how to import and use the 3 portfolio optimisation classes
# Read in data
stock_prices = pd.read_csv('FILE_PATH', parse_dates=True, index_col='Date') # The
˓→date column may be named differently for your input.
One important thing to remember here - all the 3 classes require that the stock prices dataframe be indexed by date
because internally this will be used to calculate the expected returns.
62 Chapter 4. Implementations
mlfinlab, Release 0.4.1
For HRP and IVP, you can access the computed weights as shown above. They are in the form of a dataframe and we
can sort them in descending order of their weights.
# Turning Points
cla.allocate(asset_prices=stock_prices, resample_by='W', solution='cla_turning_points
˓→')
cla_weights = cla.weights
means, sigma = cla.efficient_frontier_means, cla.efficient_frontier_sigma
The following research notebooks can be used to better understand how the algorithms within this module can be used
on real stock data.
• Chapter 16 Exercise Notebook
• Data Structures
• Filters
• Labeling
• Sampling
• Fractionally Differentiated Features
• Cross Validation
• Sequentially Bootstrapped Bagging Classifier/Regressor
• Feature Importance
• Bet Sizing
• Portfolio Optimisation
64 Chapter 4. Implementations
CHAPTER
FIVE
ADDITIONAL INFORMATION
5.1 Contact
At the moment the project is still rather small and thus I would recommend getting in touch with us over email so
that we can further discuss the areas of contribution that interest you the most. We have a slack channel where we all
communicate.
For now you can get hold us at: research@hudsonthames.org
Looking forward to hearing from you!
5.2 Contributing
Currently we have a live project board that follows the principles of Agile Project Management. This board is available
to the public and lets everyone know what the maintainers are currently working on. The board has an ice bucket filled
with new features and documentation that have priority.
Needless to say, if you are interested in working on something not on the project board, just drop us an email. We
would be very excited to discuss this further.
The typical contributions are:
• Answering the questions at the back of a chapter in a Jupyter Notebook. Research repo
• Adding new features and core functionality to the mlfinlab package.
5.2.2 Templates
65
mlfinlab, Release 0.4.1
5.3 License
m
mlfinlab.cross_validation.cross_validation,
37
mlfinlab.ensemble.sb_bagging, 40
mlfinlab.feature_importance.importance,
43
mlfinlab.feature_importance.orthogonal,
47
mlfinlab.portfolio_optimization.cla, 61
mlfinlab.portfolio_optimization.hrp, 60
mlfinlab.portfolio_optimization.mean_variance,
62
67
mlfinlab, Release 0.4.1
69
mlfinlab, Release 0.4.1
70 Index