JRFM 14 00215
JRFM 14 00215
JRFM 14 00215
Article
New Dataset for Forecasting Realized Volatility: Is the Tokyo
Stock Exchange Co-Location Dataset Helpful for Expansion of
the Heterogeneous Autoregressive Model in the Japanese
Stock Market?
Takuo Higashide 1,2 , Katsuyuki Tanaka 3 , Takuji Kinkyo 3 and Shigeyuki Hamori 3, *
Abstract: This study analyzes the importance of the Tokyo Stock Exchange Co-Location dataset (TSE
Co-Location dataset) to forecast the realized volatility (RV) of Tokyo stock price index futures. The
heterogeneous autoregressive (HAR) model is a popular linear regression model used to forecast RV.
Citation: Higashide, Takuo, This study expands the HAR model using the TSE Co-Location dataset, stock full-board dataset and
Katsuyuki Tanaka, Takuji Kinkyo, market volume dataset based on the random forest method, which is a popular machine learning
and Shigeyuki Hamori. 2021. New algorithm and a nonlinear model. The TSE Co-Location dataset is a new dataset. This is the only
Dataset for Forecasting Realized
information that shows the transaction status of high-frequency traders. In contrast, the stock full-
Volatility: Is the Tokyo Stock
board dataset shows the status of buying and selling dominance. The market volume dataset is used
Exchange Co-Location Dataset
as a proxy for liquidity and is recognized as important information in finance. To the best of our
Helpful for Expansion of the
knowledge, this study is the first to use the TSE co-location dataset. The experimental results show
Heterogeneous Autoregressive Model
in the Japanese Stock Market? Journal
that our model yields a higher forecast out-of-sample accuracy of RV than the HAR model. Moreover,
of Risk and Financial Management 14: we find that the TSE Co-Location dataset has become more important in recent years, along with the
215. https://doi.org/10.3390/ increasing importance of high-frequency trading.
jrfm14050215
Keywords: realized volatility; Tokyo Stock Exchange Co-Location dataset; heterogeneous autoregres-
Academic Editors: Robert Brooks and sive model; random forest method; high-frequency traders
Thanasis Stengos
model is well known as a long-term memory process model. In contrast, the HAR model
is not a long-term memory process model but well approximates the long-term memory
process with a few explanatory variables, which are past daily, weekly and monthly RV, in
a linear modeling framework. Baillie et al. (2019) report that RV series are quite complex
and can involve both HAR components and long memory components. Watanabe (2020)
remarks that the HAR model is the most commonly used model in recent years for RV
time-series modeling as the HAR model can predict RV with high prediction accuracy
because of few explanatory variables. Qiu et al. (2019) remarked that the HAR model has
computational simplicity (e.g., ordinary least squares method) and excellent out-of-sample
performance compared to ARFIMA. Various previous studies have followed Corsi (2009),
expanding and generalizing the HAR model in many directions (Andersen et al. 2007;
Ubukata and Watanabe 2014; Bekaert and Hoerova 2014; Bollerslev et al. 2016; Luong and
Dokuchaev 2018; Qiu et al. 2019; Motegi et al. 2020; Watanabe 2020). In particular, Luong
and Dokuchaev (2018) introduced a nonlinear model using the random forest method,
which is a well-known machine learning method introduced by Breiman (2001). They
apply the random forest method for forecasting the direction (“up” or “down”) of RV in a
binary classification problem framework using a technical indicator of RV.
Linton and Mahmoodzadeh (2018) report that high-frequency trading (HFT) is the
predominant feature in current financial markets due to technological advances and market
structure development. Iwaisako (2017) reports that HFT has become an essential function
in the stock markets of developed countries since the latter half of the 2000s. According
to Iwaisako (2017), there were 81 academic papers related to HFT between 2000 and 2010,
but it increased to 334 from 2011 to 2016. In the Japanese stock market, as well as in other
developed countries’ stock markets, the influence of high-frequency traders (HFTs) is being
watched. There are some previous studies to examine the relationship between HFTs and
volatility (Zhang 2010; Haldane 2011; Benos and Sagade 2012; Caivano 2015; Myers and
Gerig 2015; Kirilenko et al. 2017; Malceniece et al. 2019). These existing studies report that
HFTs effects on volatility. In addition, HFTs can overamplify volatility and disrupt the mar-
ket with system errors. Considering the situation, the Japanese Financial Services Agency
introduced the high-frequency trade participants registration system in 2018 to carefully
observe these influences (The Japanese Government Financial Services Agency 2018). Ac-
cording to a report issued by the Japanese Financial Services Agency in August 2020, 55
investors were registered as HFT participants. In addition, 54 of the 55 investors are foreign
investors. This ratio may be surprising but is only natural because about 70% of the trading
in the Japanese stock market is executed by overseas investors, such as hedge funds and
they adopt HFT as an edgy investment strategy.
This study analyzes the importance of the Tokyo Stock Exchange Co-Location dataset
(TSE Co-Location dataset) to forecast the RV of Tokyo stock price index futures. Existing
studies define the HFTs to analyze the impact of the HFTs on the volatility (Zhang 2010;
Haldane 2011; Benos and Sagade 2012; Caivano 2015; Myers and Gerig 2015; Kirilenko
et al. 2017; Malceniece et al. 2019). However, these existing studies may have limitation in
terms of generalization. Because there is no correct answer in the definition of the HFTs
(Iwaisako 2017) and the definition ambiguity remains. In this study, we respond to this
problem by using the TSE Co-Location dataset. The TSE Co-Location dataset is detailed
information on HFT taken by the participants who trade via a server located in the TSE
Co-Location area. This server only allows participants to perform HFT. Hence, the TSE
Co-Location dataset is generated with no ambiguity in the definition of HFTs. This is the
only dataset that can show the actual situation of HFT of stocks in Japan. Although the
HFT research is becoming more important, no analysis has been performed using the TSE
Co-Location dataset. To the best of our knowledge, this study is the first to use the TSE
Co-Location dataset.
We propose a new framework for forecasting the RV direction (“up” or “down”) of
Tokyo stock price index (TOPIX) futures in Tokyo time (9:00–15:00) using the random forest
method inspired by Luong and Dokuchaev (2018). Including Loung and Dokuchaev, most
J. Risk Financial Manag. 2021, 14, 215 3 of 18
of the previous studies in RV forecast use only explanatory variables related to RV directly
(e.g., viewed over different time horizons of RV, technical indicator of RV). However, in
our framework, we use the past viewed RV and the TSE Co-Location dataset, stock full-
board dataset and market volume dataset as explanatory variables. In particular, the TSE
Co-Location dataset is one of the main characteristics of our model. The TSE Co-Location
dataset is a new dataset provided by the Japan Exchange Group. This is the only dataset
that can determine the activity status of HFTs in the Japanese stock market. The stock full-
board dataset provides information on the potential of market liquidity and the strength
of demand and supply. The market volume dataset is used as a proxy for liquidity and
is recognized as important financial information. By expanding the explanatory variable
space by adding these three datasets, we show that our model yields a higher out-of-sample
accuracy (hereafter, we simply refer to as accuracy) of the direction of RV forecast than the
HAR model through experimental results.
In summary, our main contributions of this study are as follows: First, we experimen-
tally show the importance of the TSE Co-Location dataset to forecast the RV of TOPIX.
To the best of our knowledge, this study is the first to use the TSE Co-Location dataset
and show its importance. Second, our proposed model provides higher forecast accuracy
than the HAR model. This is beneficial to both researchers and practitioners because it
allows them to make a better selection toward the financial problem in advances. Third,
we found that the random forest method framework works effectively and can be superior
to the linear model in the framework of RV forecast, which is in line with the previous
studies that used the random forest method for building bankruptcy models of companies
(Tanaka et al. 2016, 2018a, 2018b, 2019). Our study uses a sufficiently long observation
period (2012 to 2019) to consider the change in market quality affected by the HFT system
and participants. Our observation period contains an essential period, which was around
2015. The HFT system named “Arrowhead” was introduced in 2010 by the Japan Exchange
Group. In 2015, Arrowhead was renewed to provide a better trading system that allowed
the participants to trade more frequently.
The remainder of this paper is organized as follows. In Section 2, we summarily
review the previous literature. In Section 3, we briefly review the overall process of our
study and introduce the details of datasets, preprocessing of datasets and the random forest
method. In Section 4, we provide out-of-sample experimental results of the RV forecast
accuracy. Section 5 presents the discussion and the conclusion.
2. Literature Review
2.1. Literature Review of Volatility Forecasting Models
There are many previous studies of time-series modeling for volatility forecasting, such
as the autoregressive conditional heteroskedasticity (ARCH) model (Engle 1982), stochas-
tic volatility model (Taylor 1982), generalized ARCH (GARCH) model (Bollerslev 1986),
Glosten–Jagnnathan-Runkle GARCH model (Glosten et al. 1993) considering the asym-
metry of volatility fluctuations, exponential GARCH (EGARCH) model (Nelson 1991)
and asymmetric power GARCH (Ding et al. 1993). In addition, fractionally integrated
EGARCH (Baillie et al. 1996) and stochastic volatility model with fractional integrated
order (Harvey 1998) are considering long-term memory. Many other forecasting models
have been studied; for example, Poon and Granger (2003) comprehensively summarize a
wide range of previous studies regarding forecasting models of volatility.
Most of the volatility forecasting models are expanded based on the ARCH type
modeling framework or the stochastic volatility modeling framework. On the contrary,
most RV forecasting modeling is expanded based on the ARFIMA modeling framework
(Andersen et al. 2001) or the HAR modeling framework (Corsi 2009). Baillie et al. (2019)
assess the separate roles of fractionally integrated long memory models, extended HAR
models and time varying parameter HAR models. According to Baillie et al. (2019),
their experimental results suggest that RV series are quite complex and can involve both
HAR components and long memory components. Recently, as we noted in the previous
J. Risk Financial Manag. 2021, 14, 215 4 of 18
section, Watanabe (2020) reports that various previous studies have followed Corsi (2009),
expanding and generalizing the HAR model in many directions. Because the HAR model
can predict RV with high prediction accuracy (Watanabe 2020), also the HAR model has
computational simplicity and excellent out-of-sample performance compared to ARFIMA
(Qiu et al. 2019). Andersen et al. (2007) proposed the HAR with continuous volatility and
jumps (HAR-CJ) model, which decomposes RV into continuous and jump components,
respectively, in explanatory space. Bollerslev et al. (2016) introduced the HAR quarticity
(HARQ) model, which can handle the time-varying coefficients of the HAR model. In terms
of asymmetric modeling, Ubukata and Watanabe (2014) and Bekaert and Hoerova (2014)
proposed an asymmetric HAR model. Watanabe (2020) proposed an asymmetric HAR-CJ
model and an asymmetric HARQ model. Both asymmetric models are differentiated from
the symmetric model by adding a return term to the explanatory variables with a dummy.
Qiu et al. (2019) proposed a versatile HAR model that applies the least-squares model
averaging approach to HAR-type models with signed realized semi-variance to account
for model uncertainty and to allow for a more flexible lag structure. Motegi et al. (2020)
propose moving average threshold HAR models as a combination of HAR and threshold
autoregression. In contrast to these linear models for RV forecasting, Luong and Dokuchaev
(2018) introduced a nonlinear model using the random forest method.
In the
In thedataset
datasetformation
formationprocess,
process,
wewe created
created several
several combinations
combinations of explanatory
of explanatory var-
variable
iable datasets
datasets to analyze
to analyze the contribution
the contribution of variable
of each each variable to the improvement
to the improvement of
of predic-
prediction accuracy.
tion accuracy.
Finally, we
Finally, webuilt
builtmodels
modelsforforeach
eachdataset
datasetusing
usingthe logistic
the and
logistic random
and random forest methods
forest meth-
to compare their RV forecast accuracy. We provide the prediction accuracy from
ods to compare their RV forecast accuracy. We provide the prediction accuracy from dif- different
methods
ferent and tasks.
methods and Below, we describe
tasks. Below, the details
we describe theof each preprocessing
details and dataset.
of each preprocessing and da-
taset.
3.1. NEEDS Tick Data File Preprocessing
We useTick
3.1. NEEDS the NEEDS
Data FileTick Data File provided by NIKKEI Media Marketing (NIKKEI
Preprocessing
Media Marketing NEEDS Tick Data n.d.) for RV calculation (Section 3.1.1) and stock
We use
full-board the NEEDS
dataset Tick Data(Section
preprocessing File provided
3.1.2). by NIKKEI
Before Media Marketing
calculation (NIKKEI
and preprocessing,
Media Marketing NEEDS Tick Data n.d.) for RV calculation (Section 3.1.1)
we thin out every 5 min according to previous studies (Ubukata and Watanabe 2014). and stock full-
board dataset preprocessing (Section 3.1.2). Before calculation and preprocessing,
Most previous researchers reported that the smaller the interval, the larger the market we thin
microstructure noise that may be contained during the RV calculation. We extract the
following information: traded price, traded volume and stock full-board dataset, which is
composed of the 1st best quote to the 10th best quote quantity and price on both the bid
and offer sides. In the morning session, the data points we extracted were 09:01, 09:05, . . . ,
11:25. In the afternoon session, 12:31, 12:35, . . . , 14:55. Note that there is only a morning
J. Risk Financial Manag. 2021, 14, 215 6 of 18
session on both the grand opening and closing. Therefore, we extracted only the morning
session on these two days.
n −1
∑ rt2+i/n
(d)
RVt =α (1)
i =0
where
T 2 T
α= ∑ Rt − R / ∑ RVt (2)
t =1 t =1
Here, the subscript t indexes the day, while T indexes the endpoint within the ob-
servation period. α indexes the evening time-adjustment coefficient. The superscript (d)
in Equation (1) indexes daily. We follow Watanabe (2020) to calculate Equation (2), as
proposed by Hansen and Lunde (2005). Note that we calculate the return for RV based on
the trade price. If there are no transactions, we use the previously traded price.
n −1
Cum_Plust = ∑ Bidt+i/n + O f f ert+i/n , (3)
i =0
n −1
Cum_Minust = ∑ O f f ert+i/n − Bidt+i/n (4)
i =0
Equation (3) describes the liquidity and Equation (4) describes the demand and supply
of the market, respectively.
that can show the actual situation of HFT of stocks in Japan. Therefore, it is clear that
the co-location information is closely related to high-frequency price data. Thus, it is an
important variable that generates RV. As mentioned earlier, the importance of HFTs actions
has been increasing annually since 2010. Under this trend, the TSE Co-Location dataset
is the only important bridge for researchers and practitioners to consider the influence of
HFTs in the Japanese stock market.
In this study, we use three explanatory variables on day t. Note that there are only two
ways to trade Japanese stocks on the Tokyo Stock Exchange market: via TSE Co-Location
or the other. Thus, each denominator of Equations (5)–(7) is the total number taken by
these two methods. In contrast to the denominator, the numerator shows only the number
taken through the TSE Co-Location server.
Table 1. Dataset.
Incidentally, the HAR model is known as the linear regression model in Equation (9),
where
l
∑ log RVt−s .
(l )
log RVt = l −1
s =1
This Equation is a linear combination of the constant term, daily RV (which is denoted
(d)
by RVt ) calculated by Equations (1) and (2), weekly RV and monthly RV. Indicating
(w)
the aggregation period as a superscript, the notation for the weekly RV is RVt , while
(m)
the monthly RV is denoted by RVt . For instance, the weekly RV at time t is given by
the average
(w) 1 (d) (d) (d)
RVt = RVt + RVt−1d + · · · + RVt−4d .
5
The HAR model is not a long-term memory process but has three different types of
autoregressive terms to approximate long-term memory processes.
As shown in Table 1, we prepare different types of autoregressive terms for the
explanatory variables. Note that these explanatory variables are the rate of change. Using
these datasets, we build five different types of models and forecast the RV direction (“up”
or “down”). For this, we define the RV direction as follows:
RVt
(
1 (up) if RVt−1 >1
δt = RVt .
0 (down) if RVt−1 <1
If RVt /RV−1 = 1, this case is omitted from the training dataset. In this study, there is no
data point for this case.
3.4. Methods
Random forest is a popular machine learning method used for classification and
regression tasks with high-dimensional data (Breiman 2001). Random forests are applied
in various areas, including computer vision, finance and bioinformatics, because they
provide strong classification and regression performance. Random forest is called ensemble
learning in the field of machine learning because random forest combines and aggregates
several predictions outputted by several randomized decision trees. Each decision tree
corresponds to a weak discriminator in ensemble learning. Random forests are structured
through an ensemble of d decision trees with the following algorithm:
1. Create subsets of training data with random sampling by bootstrap.
2. Train a decision tree for each subset of training data.
3. Choose the best split of a variable from only the randomly selected m variables at
each node of the tree and derive the split function.
4. Repeat steps 1, 2 and 3 to produce d decision trees.
5. For test data, make predictions by voting or by averaging the most popular class
among all of the output from the d decision trees.
The Gini index proposed by Economist Gini is a popular evaluation criterion for
constructing decision trees (Breiman et al. 1984), where the Gini index is used to measure
the impurity of each node for the best split. The criteria of the best split are determined
to maximize the decline rate of impurities at each node. The Gini index is an essential
criterion for selecting the optimal splitting variable and the corresponding threshold value
at each node. Suppose Mn is the number of pieces of information reaching node n and Mni
is the number of data points belonging to class Ci . The Gini index, GIn , of node n is
k 2
Mi
GIn = 1 − ∑ pin , where pin = n .
i =1
Mn
A higher Gini index value for node n represents an impurity. Hence, a decreasing Gini
index is an important criterion for node splitting.
is the number of data points belonging to class 𝐶 . The Gini index, 𝐺𝐼 , of node n is
𝑀
𝐺𝐼 = 1 − (𝑝 ) , 𝑤ℎ𝑒𝑟𝑒 𝑝 = .
𝑀
J. Risk Financial Manag. 2021, 14, 215 9 of 18
A higher Gini index value for node n represents an impurity. Hence, a decreasing
Gini index is an important criterion for node splitting.
While the random forest method is not state-of-the-art, such as deep learning tech-
niques, we the
While choose this method
random as it has
forest method is several preferable features.
not state-of-the-art, such asFirst,
deeprandom
learningforests
tech-
providewe
niques, higher
chooseclassification
this methodaccuracy
as it hasbecause they integrate
several preferable a large
features. number
First, randomof decision
forests
provide higherrandom
trees. Second, classification
forestsaccuracy
are robust because they integrate
to over-fitting becausea large
of thenumber
bootstrapof sampling
decision
trees. Second,
of data randomsampling
and random forests areof robust to over-fitting
variables to build each because of the
decision tree;bootstrap
hence, thesampling
correla-
of data and random sampling of variables to build each decision tree; hence,
tion between decision trees is low. As a result, the effect of overfitting is extremely small the correlation
between decision trees isability
and the generalization low. As a result, the
is enhanced. effectrandom
Third, of overfitting
forestsiscan
extremely
handle small and the
large datasets
generalization
without needing ability
too is enhanced.
much Third,time
calculation random
becauseforests
they can handle
enable large datasets
researchers without
to train mul-
needing tooefficiently
tiple trees much calculation timeMoreover,
in parallel. because they enable
unlike deep researchers
learning, to thetrain multiple
number trees
of hyper-
efficiently
parameters in is
parallel.
small andMoreover, unlike
researchers dodeep learning,
not need the number
to puzzle over the of hyper-parameter
hyper-parametersset- is
small and researchers do not need to puzzle over the hyper-parameter
tings; researchers only need to choose the number of decision trees to build a model. Fi- settings; researchers
only need
nally, to choose
random thecan
forests number
be usedof decision
to ranktrees to build a model.
the importance Finally, which
of variables, randomhelps
forests
re-
can be used to rank the importance of variables, which helps researchers
searchers identify the influential variables in the model. Therefore, researchers can man- identify the
influential variables
age the model in theand
efficiently model. Therefore,
explain researchers
the contents of thecan manage
model the model efficiently
to stakeholders.
and explain the contents of the model to stakeholders.
4. Experimental Results
4. Experimental Results
In this section, we present the experimental results. The TSE Co-Location dataset may
In this section, we present the experimental results. The TSE Co-Location dataset may
not be familiar to both practitioners and researchers; hence, we show the time series chart
not be familiar to both practitioners and researchers; hence, we show the time series chart
of the TSE Co-Location dataset and RV in Figure 2 and discuss its implications.
of the TSE Co-Location dataset and RV in Figure 2 and discuss its implications.
1.0
30
RV
Colo_C
Colo_Y
Colo_B
25
0.8
20
0.6
15
0.4
10
0.2
5
0.0
date
( ) (d)
Figure
Figure 2. Timeseries
2. Time seriesofofthe
the TSE
TSE Co-Location
Co-Location dataset
dataset andand
RV. RV. In figure,
In this this figure, 𝑅𝑉 RV
RV denotes
RV denotes t ,
, while
while Colo_C,
Colo_C, Colo_Y
Colo_Y and Colo_B
and Colo_B denote
denote co-location
co-location ratios;
ratios; the ratio
the ratio of order
of order quantity
quantity via via
the the
TSETSE Co-
Co-Location area to total order quantity (defined in Equation (5)), the ratio of execution quantity
via the TSE Co-Location area to total order of execution quantity (defined in Equation (6)) and the
ratio of the value traded quantity via TSE Co-Location area to total value traded quantity (defined in
Equation (7)), respectively.
As can be seen from this figure, in the early 2010s, each TSE Co-Location index
gradually increased because Arrowhead was introduced in 2010. This implies that the
number of HFTs gradually increased during the system transition period. That is, the
adjustment time to a new system varies from practitioner to practitioner. Thus, each index
continues to increase for a certain period. A few years after its introduction, the Arrowhead
system was revised in 2015. This revision allows trading participants to trade faster and
J. Risk Financial Manag. 2021, 14, 215 10 of 18
more frequently. As a result, Colo_C continued to rise from the middle of the 2010s to the
end of the 2010s. In contrast, Colo_B and Colo_Y have been flat since 2015.
As noted in the previous section, most HFTs in the Japanese stock market are foreign
investors. Thus, we are of the opinion that these three TSE Co-Location numbers refer
mainly to foreign investors. Since 2010, the Abenomics policy has attracted foreign in-
vestors’ interest in the Japanese stock market. It is known that foreign investors account for
approximately 70% of the Japanese stock market (HFT and regular trading).
Here, precision measures the number of correct predictions divided by all instances.
Sensitivity measures the number of correct predictions divided by all correct instances.
Table 3 shows the prediction accuracy of the RV during the total observation period for
each model.
In the explanatory variables, the space of the HAR model (denoted as NoIin Table 3;
below, we unify this notation rule and others, as well), logistic method and our suggested
random forest method yield almost the same prediction accuracy. However, with an
increase in explanatory variables, the difference in prediction accuracy between the logistic
method and the random forest method is larger. For instance, there is a 22% difference
inaccuracy in the HAR + Volume + TSE Co-Location + Stock full-board model. These
results are consistent with those of previous studies (Ohlson 1980; Shirata 2003; Hastie
et al. 2008; Alpaydin 2014), which reported that linear models do not work where there are
many explanatory variables in space.
Next, we shift to the differences between the models based on the RF method. The
HAR model yielded a forecast accuracy of 60%. On the contrary, the HAR + Volume
model yielded a 4% higher accuracy than the HAR model. In addition to the HAR +
Volume model, the HAR + TSE Co-Location model and the HAR + stock full-board model
yielded 3% and 6% higher accuracy, respectively. Moreover, the HAR + Volume + TSE
Co-Location + Stock full-board model was 8% higher. From these results, it is evident that
each explanatory variable helps the HAR model to improve forecast accuracy. In particular,
the finding of an 8% increase in the prediction accuracy is a major contribution. However,
Table 3 implies that the features of these three variables may partially overlap. Compared
to the total of 13%, which is simply the sum of the effects of each explanatory variable,
the HAR + Volume + TSE Co-Location + Stock full-board model yields 5% lower than
13%. Future studies could work toward elucidating the aspects in which each variable has
overlapping features despite having different data generating processes.
RV_daily
RV_monthly
RV_weekly
Colo_B_monthly
market volume_daily
Colo_C_daily
Cum_Plus_daily
Colo_C_weekly
Colo_Y_monthly
Colo_C_monthly
Cum_Plus_fiveday
Cum_Minus_fiveday
Colo_B_weekly
Cum_Plus_monthly
Colo_B_daily
market volume_monthly
Colo_Y_daily
market volume_weekly
Cum_Minus_monthly
Cum_Minus_daily
Colo_Y_weekly
0 10 20 30 40 50 60 70 80
Figure
Figure3.3.Important variables
Important in thein
variables total
theobservation period. period.
total observation
The top three important variables are RV’s autoregressive terms in different time ho-
rizons. This result is natural because there is a clustering effect on volatility. The RV’s past
data are beneficial information for the forecast itself. Interestingly, five of the Top 10 im-
portant variables are the TSE Co-Location dataset. This accounts for over 70% of the Top
10 variables. We can state that the TSE Co-Location dataset is especially important for
J. Risk Financial Manag. 2021, 14, 215 12 of 18
The top three important variables are RV’s autoregressive terms in different time
horizons. This result is natural because there is a clustering effect on volatility. The RV’s
past data are beneficial information for the forecast itself. Interestingly, five of the Top 10
important variables are the TSE Co-Location dataset. This accounts for over 70% of the
Top 10 variables. We can state that the TSE Co-Location dataset is especially important
for improving the prediction accuracy of the HAR + Volume + TSE Co-Location + Stock
full-board model. Especially for Colo_C, all types of time horizons rank in. In contrast, for
Colo_B and Colo_Y, the monthly time horizon is ranked. The results indicate that long-
term trends are more important than short-term trends for these two variables. Market
volume_daily and Cum_Plus_daily were also ranked in the Top 10 important variables.
Both variables are related to market liquidity. The difference between the two variables
is the amount actually traded or the amount that can be traded. As mentioned above,
liquidity is closely related to volatility. Higher liquidity stabilizes the market environment,
where prices are less likely to jump and volatility is more stable. This result is consistent
with a practical point of view. Contrarily, the results of the importance of Cum_minus are
rather surprising to practitioners. From a practical point of view, Cum_minus is known
as an important variable in looking at the future direction of the market and indicates
the strength of supply and demand between selling and buying. Our results suggest
that the price direction may not be directly related to the up and down forecast of RV
since the ranking of Cum_Minus_weekly, Cum_Minus_monthly and Cum_Minus_daily
is not very high; they are ranked 12, 19 and 20 out of 21 variables, respectively, in the
important variables.
From another perspective, we look at the important variables from the time horizon:
daily, weekly and monthly. Table 4 shows the rank of the periods in descending order by
the Gini index for each category. In most categories, the daily period was ranked at the
top. We cannot always necessarily say that the shorter the period, the more important it is.
The comparison between weekly and monthly data is more important than weekly data.
From this case, it is evident that very short periods and slightly longer periods play a more
important role in the model. However, it depends on the category, but the tendency is
as noted above. One possible explanation is that the expiration of information, which is
aggregated by these categories, is nonlinear in RV forecast in the Japanese stock market.
Detailed discussions require larger and more extensive analyses, such as comparing trends
among countries.
However, in the categories excluding RV, the importance of the TSE Co-Location dataset
increased overall and ranked higher from the first half period to the second half period. In
particular, the increase in Colo_C was remarkable. Colo_C occupies the second position in
the second half period, following RV. Colo_B ranks in the top three, but this category is
less important in the TSE Co-Location dataset than in the first half period. This suggests
that information on the order status of HFT among market participants is more valuable
than that of what is bought or sold. It is interesting to recognize this trend as an increase in
HFTs. In contrast, Colo_Y was not as important in both periods. In fact, remember the flow
of order → execution → trading volume, when an order is filled, that number is reflected
J. Risk Financial Manag. 2021, 14, xin
FOR
thePEER REVIEW
trading 14 of 19 is not
volume. From this, we think that it is possible to interpret that Colo_Y
an important variable because it has a strong meaning between order and trading volume.
0 5 10 15 20 25 30 35 40
First half period Second half period
4. Importance
FigureFigure variable
4. Importance changes
variable from
changes the
from thefirst
firsthalf
halfperiod to the
period to thesecond
secondhalf
half period
period sorted
sorted by first-half
by first-half period
period base. base.
RV_daily
RV_weekly
RV_monthly
Colo_C_daily
Colo_C_weekly
Colo_B_monthly
Cum_Minus_fiveday
Colo_Y_daily
Colo_B_weekly
Colo_Y_monthly
Cum_Plus_daily
Cum_Minus_daily
Cum_Plus_monthly
Cum_Plus_fiveday
Colo_C_monthly
market volume_daily
Colo_B_daily
market volume_weekly
Cum_Minus_monthly
Colo_Y_weekly
market volume_monthly
0 5 10 15 20 25 30 35 40
First half period Second half period
5. Importance
FigureFigure variable
5. Importance changes
variable fromfrom
changes the the
firstfirst
halfhalf
period to the
period second
to the half
second period
half periodsorted
sortedby
bysecond-half
second-half period
period base.
base.
Table 5. Prediction accuracy of RV in the first half period and second-half period.
F-Measure
First Half Period Second Half Period
Random Forest 0.56 0.61
Logstic 0.54 0.39
Regarding other variables, it was found that the importance of market volume has de-
creased and Cum_minus, which observes the changes in supply and demand, has become
more important. Although this is a different result from that in the previous subsection, it
goes without saying that the results will change depending on the analysis period.
These results indicate that our proposed datasets contribute to the increase in the
accuracy of prediction and have a high affinity with the random forest method. The
random forest method excels in prediction accuracy compared with the logistic model
on average, as we expected, corresponding to the model accuracy in the previous section.
Regarding this point, we consider that it is natural and consistent with previous research
(Tanaka et al. 2018a; Tanaka et al. 2019), which have similar frameworks. As noted in
the previous section, linear models do not work in the large explanatory variable space.
In fact, from Table 3, it is evident that the logistic method seems to overfit. Practitioners
and researchers should select nonlinear methods, such as the random forest method. The
quality of the Japanese stock market changed due to the introduction of Arrowhead in
2010 and the effect continued strongly. Moreover, we found important rolls of the TSE
Co-Location dataset deeply related to RV. We consider that the dataset associated with HFT,
such as the TSE Co-Location dataset, is expected to play an important role in the model
building process with the upcoming revolutions in trading systems. Now, we cannot evade
high-dimensional explanatory space by adding a new dataset to the HAR. We propose the
use of new datasets. By combining them with the random forest method, we show novel
experimental results regarding RV forecast during two crucial events: the introduction and
revision of Arrowhead. Overall, our proposed model is superior to the HAR model and
can be expected to yield a high prediction accuracy. If there is a similar dataset to the TSE
Co-Location dataset, our framework can be applied to the other stock markets as well.
Nevertheless, the forecasting problem may vary depending on conditions such as the
selection of sampling periods. In particular, when the quality of the market changes due
to new regulations, even if the model has a high accuracy of forecast in the past, in some
cases, it may be necessary to revise the model in anticipation of the upcoming data in the
near future. In such cases, practitioners will need to search for similar events in history,
grasp the strengths and weaknesses of pattern recognition of the dataset at that time and
correct them. Despite the model yielding a high prediction accuracy during the training
period, it can be completely useless during the test period. Dividing the sample period
appropriately, which is a universal problem, is an issue in this study. In future work, we
would like to find a way to divide it automatically and conveniently.
There are pros and cons to HFT not only in Japan but also globally. Linton and
Mahmoodzadeh (2018) indicate that fast algorithmic transactions place unexpectedly large
orders due to program errors and algorithms that behave differently than the programmer
tend to cause chain reactions, increase market volatility and disrupt market order (For
instance, the May 2010 Flash Crash, August 2012 wrong order by Night Capital, October
2018 Tokyo Stock Exchange Markets Arrowhead system trouble triggered by Merrill Lynch,
September 2020 Tokyo Stock Exchange Markets Arrowhead system trouble and so on).
On the contrary, IOSCO reports that there is a close relationship between liquidity and
volatility, in the sense that more liquidity can better absorb shocks to stock prices. HFT
involved in official market-making businesses may help mitigate volatility in the short
hours of the day by providing liquidity. In fact, HFT is thought to be responsible for
more than 40% of the trading volume in the Japanese equity market in 2019, as shown in
Figure 2. From the perspective of market liquidity and pursuit of alpha by hedge funds
through HFT, we consider that HFT will play a more important role in any case. Our
proposed dataset contributes to improving prediction accuracy through various types of
experiments. We have experimentally shown that the degree of influence has increased in
recent years. We believe that this trend will intensify in the future. Recently, the Financial
Services Agency of Japan has been strengthening monitoring and legal systems, such as
requiring frequent registration to trade. Along with this, the environment of HFT in the
Japanese stock market is getting better, with further system improvement efforts by the
Tokyo Stock Exchange, development of private exchanges and dark pools in securities
firms and upgrade of securities firm systems to connect customers and securities firms
more quickly.
J. Risk Financial Manag. 2021, 14, 215 16 of 18
In future work, we would like to examine this in more detail by decomposing the
effect of each variable on the RV forecast improvement. Furthermore, considering Baillie
et al. (2019), there may be room to extend our model, taking into account the long memory
process. Chen et al. (2018) and Ma et al. (2019) are one of the helpful existing researches to
expand our random forest based model to integrate long short-term memory process. In
addition, we would like to extend our framework to higher-order moments. High-order
moments are an important research area in finance. Hollstein and Prokopczuk (2018)
showed that volatility affects stock returns. Amaya et al. (2015) insist that high-order
moments are beneficial information for asset price modeling. Hence, we believe that
verifying the effectiveness of our approach even in high-order moments would be helpful
for both practitioners and researchers.
Author Contributions: Formal analysis, T.H.; investigation, T.H.; data curation, T.H.; writing–
original draft preparation, T.H.; writing–review and editing, T.H.; supervision, K.T., T.K. and S.H.;
funding acquisition, S.H. All authors have read and agreed to the published version of the manuscript.
Funding: This work was supported by JSPS KAKENHI Grant Number (A) 17H00983.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: The data source of our study is Japan Exchange Group and Nikkei Needs.
Acknowledgments: We are grateful to Japan Exchange Group for providing the Tokyo Stock Co-
Location dataset. Also, we are grateful to the editor and three anonymous reviewers for their helpful
comments and suggestions.
Conflicts of Interest: The authors declare no conflict of interest. This paper gives the views of the
authors, and not necessarily the position of the Nissay Asset Management.
References
Alpaydin, Ethem. 2014. Introduction to Machine Learning. London: The MIT Press.
Amaya, Diego, Peter Christoffersen, Kris Jacobs, and Aurelio Vasquez. 2015. Does realized skewness predict the cross-section of equity
returns? Journal f Financial Economics 118: 135–67. [CrossRef]
Andersen, Torben G., and Tim Bollerslev. 1998. Answering the skeptics: Yes, standard volatility models do provide accurate forecasts.
International Economic Review 39: 885–905. [CrossRef]
Andersen, Torben G., Tim Bollerslev, Francis X. Diebold, and Paul Labys. 2001. The distribution of realized exchange rate volatility.
Journal of the American Statistical Association 96: 42–55. [CrossRef]
Andersen, Torben G., Tim Bollerslev, Francis X. Diebold, and Paul Labys. 2003. Modeling and forecasting realized volatility. Econometrica
71: 529–626. [CrossRef]
Andersen, Torben G., Tim Bollerslev, and Francis X. Diebold. 2007. Roughing it up: Including jump components in the measurement,
modeling, and forecasting of return volatility. Review of Economics and Statistics 89: 701–20. [CrossRef]
Baillie, Richard T., Tim Bollerslev, and Hans O. Mikkelsen. 1996. Fractionally integrated generalized autoregressive conditional
heteroskedasticity. Journal of Econometrics 74: 3–30. [CrossRef]
Baillie, Richard T., Fabio Calonaci, Dooyeon Cho, and Seunghwa Rho. 2019. Long memory, realized volatility and heterogeneous
autoregressive models. Journal of Time Series Analysis 40: 609–28. [CrossRef]
Barndorff-Nielsen, Ole E., and Neil Shephard. 2002. Estimating quadratic variation using realized variance. Journal of Applied
Econometrics 17: 457–77. [CrossRef]
Bekaert, Geert, and Marie Hoerova. 2014. The VIX, the variance premium, and stock market volatility. Journal of Econometrics 183:
181–92. [CrossRef]
Benos, Evangelos, and Satchit Sagade. 2012. High-Frequency Trading Behaviour and Its Impact on Market Quality: Evidence from the UK
Equity Market. BoE Working Paper No. 469. London: Bank of England.
Bollerslev, Tim. 1986. Generalized autoregressive conditional heteroskedasticity. Journal of Econometrics 21: 307–28. [CrossRef]
Bollerslev, Tim, Andrew Patton, and Rogier Quaedvlieg. 2016. Exploiting the errors: A simple approach for improved volatility
forecasting. Journal of Econometrics 192: 1–18. [CrossRef]
Breiman, Leo. 2001. Random forests. Machine Learning 45: 5–32. [CrossRef]
Breiman, Leo, Jerome Friedman, Charles J. Stone, and Richard A. Olshen. 1984. Classification and regression trees. In Monterey:
Wadsworth. London: Chapman & Hall.
Caivano, Valeria. 2015. The Impact of High-Frequency Trading on Volatility. Evidence from the Italian Market. CONSOB Working
Papers No. 80. Available online: https://ssrn.com/abstract=2573677 (accessed on 4 April 2021).
J. Risk Financial Manag. 2021, 14, 215 17 of 18
Chen, Zhen, Ningning He, Yu Huang, Wen Tao Qin, Xuhan Liu, and Lei Li. 2018. Integration of a deep learning classifier with a
random forest approach for predicting malonylation sites. Genomics, Proteomics & Bioinformatics 16: 451–59.
Corsi, Fulvio. 2009. A simple approximate long-memory model of realized volatility. Journal of Financial Econometrics 7: 174–96.
[CrossRef]
Croft, W. Bruce, Donald Metzler, and Trevor Strohman. 2010. Search Engines: Information Retrieval in Practice. Boston: Addison-Wesley,
p. 310.
Ding, Zhuanxin, Clive W. J. Granger, and Robert F. Engle. 1993. A Long Memory Property of Stock Market Returns and a New Model.
Journal of Empirical Finance 1: 83–106. [CrossRef]
Engle, Robert F. 1982. Autoregressive conditional heteroskedasticity with estimates of the variance of United Kingdom inflation.
Econometrica 50: 987–1007. [CrossRef]
Glosten, Lawrence R., Ravi Jagannathan, and David E. Runkle. 1993. On the Relation between the Expected Value and the Volatility of
the Nominal Excess Return on Stocks. Journal of Finance 48: 1779–802. [CrossRef]
Haldane, Andy. 2011. The race to zero. Paper presented at International Economic Association Sixteenth World Congress, Beijing,
China, July 4–8.
Hansen, Peter R., and Asger Lunde. 2005. A realized variance for the whole day based on intermittent high-frequency data. Journal of
Financial Econometrics 3: 525–54. [CrossRef]
Harvey, Andrew C. 1998. Long Memory in Stochastic Volatility. In Forecasting Volatility in Financial Markets. Edited by Stephen Satchell
and John Knight. Amsterdam: Elsevier.
Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2008. The Elements of Statistical Learning. New York: Springer.
Hollstein, Fabian, and Mrcel Prokopczuk. 2018. How aggregate volatility-of-volatility affects stock returns. Review of Asset Pricing
Studies 8: 253–92. [CrossRef]
Iwaisako, Tokuo. 2017. Nihon ni okeru kohindo torihiki no genjo ni tsuite (Current Status of High-Frequency Trading in Japan). Japan
Securities Dealers Association. Available online: https://www.jsda.or.jp/about/iwaisakoronbun.pdf (accessed on 17 August
2020). (In Japanese).
Japan Exchange Group Connectivity Services. 2021. Available online: https://www.jpx.co.jp/english/systems/connectivity/index.
html (accessed on 24 February 2021).
Kirilenko, Andrei, Albert S. Kyle, Mehrdad Samadi, and Tugkan Tuzun. 2017. The flash crash: High-frequency trading in an electronic
market. The Journal of Finance 72: 967–98. [CrossRef]
Linton, Oliver, and Soheil Mahmoodzadeh. 2018. Implication of high-frequency trading for security markets. Annual Review of
Economics 10: 237–59. [CrossRef]
Luong, Chuong, and Nikolai Dokuchaev. 2018. Forecasting of realised volatility with the random forests algorithm. Journal of Risk and
Financial Management 11: 61. [CrossRef]
Ma, Yillin, Ruizhu Han, and Xiaoling Fu. 2019. Stock prediction based on random forest and LSTM neural network. Paper presented at
19th International Conference on Control, Automation and Systems (ICCAS), Jeju, Korea, October 15–18; pp. 126–30.
Malceniece, Laura, Kārlis Malcenieks, and Tālis J. Putnin, š. 2019. High frequency trading and comovement in financial markets. Journal
of Financial Economics 134: 381–99. [CrossRef]
Motegi, Kaiji, Xiaojing Cai, Shigeyuki Hamori, and Heifeng Xu. 2020. Moving average threshold heterogeneous autoregressive
(MAT-HAR) models. Journal of Forecasting 39: 1035–42. [CrossRef]
Müller, Ulrich A., M. Michel Dacorogna, Rakhal D. Davé, Richard B. Olsen, Oliver V. Pictet, and Jacob E. von Weizsäcker. 1997.
Volatilities of different time resolutions: Analyzing the dynamics of market components. Journal of Empirical Finance 4: 213–39.
[CrossRef]
Myers, Benjamin, and Austin Gerig. 2015. Simulating the synchronizing behavior of high-frequency trading in multiple markets. In
Financial Econometrics and Empirical Market Microstructure. Cham: Springer, pp. 207–13.
Nelson, Daniel B. 1991. Conditional Heteroskedasticity in Asset Returns: A new approach. Econometrica 59: 347–70. [CrossRef]
NIKKEI Media Marketing NEEDS Tick Data. n.d. Available online: https://www.nikkeimm.co.jp/service/detail/id=.317 (accessed on
24 February 2021).
Ohlson, James A. 1980. Financial ratios and the probabilistic prediction on bankruptcy. Journal of Accounting Research 18: 109–31.
[CrossRef]
Patterson, Josh, and Adam Gibson. 2017. Deep Learning: A Practitioner’s Approach, 1st ed. Newton: O’Reilly Media, Inc., p. 39.
Poon, S. Huang, and Clive W. J. Granger. 2003. Forecasting Volatility in Financial Markets: A Review. Journal of Economic Literature 41:
478–539. [CrossRef]
Qiu, Yue, Xinyu Zhang, Tian Xie, and Shangwei Zhao. 2019. Versatile HAR model for realized volatility: A least-square model
averaging perspective. Journal of Management Science and Engineering 4: 55–73. [CrossRef]
Shirata, Yoshiko C. 2003. Predictors of Bankruptcy after Bubble Economy in Japan: What Can You Learn from Japan Case? Paper
presented at 15th Asian-Pacific Conference on International Accounting Issues, Thailand, November 1.
Tanaka, Katsuyuki, Takuji Kinkyo, and Shigeyuki Hamori. 2016. Random forests-based early warning system for bank failures.
Economics Letters 148: 118–21. [CrossRef]
Tanaka, Katsuyuki, Takuo Higashide, Takuji Kinkyo, and Shigeyuki Hamori. 2018a. Forecasting the vulnerability of industrial
economic activities: Predicting the bankruptcy of companies. Journal of Management Information and Decision Sciences 20: 1–24.
J. Risk Financial Manag. 2021, 14, 215 18 of 18
Tanaka, Katsuyuki, Takuji Kinkyo, and Shigeyuki Hamori. 2018b. Financial hazard map: Financial vulnerability predicted by a random
forests classification model. Sustainability 10: 1530. [CrossRef]
Tanaka, Katsuyuki, Takuo Higashide, Takuji Kinkyo, and Shigeyuki Hamori. 2019. Analyzing industry-level vulnerability by predicting
financial bankruptcy. Economic Inquiry 57: 2017–34. [CrossRef]
Taylor, Stephen J. 1982. Financial returns modeled by the products of two stochastic processes, a study of daily sugar prices 1961–1979.
In Time Series Analysis: Theory and Practice 1. Edited by Oliver Duncan Anderson. Amsterdam: North-Holland, pp. 203–26.
The Japanese Government Financial Services Agency. 2018. Available online: https://www.fsa.go.jp/en./regulated/hst/index.html
(accessed on 1 April 2021).
Toriumi, Fujio, Hirokazu Nishioka, Toshimitsu Umeoka, and Kenichiro Ishii. 2012. Analysis of the market difference using the stock
board. The Japanese Society for Artificial Intelligence 27: 143–50. (In Japanese). [CrossRef]
Ubukata, Masato, and Toshiaki Watanabe. 2014. Market variance risk premiums in Japan for asset predictability. Empirical Economics
47: 169–98. [CrossRef]
Watanabe, Toshiaki. 2020. Heterogeneous Autoregressive Models: Survey with the Application to the Realized Volatility of Nikkei 225
Stock Index. Hiroshima University of Economics, Keizai Kenkyu 42: 5–18. (In Japanese).
Zhang, Frank. 2010. High-Frequency Trading, Stock Volatility, and Price Discovery. Social Science Research Network. Available online:
http://ssrn.com/abstract=1691679 (accessed on 4 April 2021). [CrossRef]