Financial Forecasting and Gradient Descent
Financial Forecasting and Gradient Descent
Financial Forecasting and Gradient Descent
Abstract
Multilayer neural network has been successfully applied to the time series forecasting. Steepest descend, a
popular learning algorithm for backpropagation network, converges slowly and has the difficulty in
determining the network parameters. In this paper, conjugate gradient learning algorithm with restart
procedure is introduced to overcome these problems. Also, the commonly used random weight initialization
does not guarantee to generate a set of initial connection weights close to the optimal weights leading to slow
convergence. Multiple linear regression (MLR) provides a better alternative for weight initialization.
The daily trade data of the listed companies from Shanghai Stock Exchange is collected for technical analysis
with the means of neural networks. Two learning algorithms and two weight initializations are compared.
The results find that neural networks can model the time series satisfactorily, whatever which learning
algorithm and weight initialization are adopted. However, the proposed conjugate gradient with MLR weight
initialization requires a lower computation cost and learns better than steepest decent with random
initialization.
Keywords: time series forecasting, technical analysis, learning algorithm, conjugate gradient, multiple linear regression weight
initialization, backpropagation neural network
To compute gradient in step 2 and 7b, the objective where f is a transfer function. The output value of the
function is first defined. The aim is to minimize the output node can be calculated as
network error that is dependent of the independent ∑
y s = f ( v j R sj )
j
(16)
connection weights. The objective function is defined by
the error function: where v j is the weight between the hidden layer and the
∑∑( t
1 2
f (w ) = nj − y nj ( w )) (11) output layer.
2N n j
where N is the number of patterns in the training set; Assume sigmoid function f ( x ) = 1 is used as the
w is one-dimensional weight vector in which 1 + e −x
weights are ordered by layer and then by neuron; transfer function of the network. By Taylor’s expansion,
1 x
t nj and ynj ( w ) are the actual and desired f (x ) ≅ + (17)
2 4
outputs of the j-th output neuron for n-th pattern,
respectively. Applying the linear approximation in (17) to (16), we have
the following approximated linear relationship between the
With the arguments in [34], the gradient is output y and vj’s:
1
g( w ) = ∑δ y ( w )
N n nj ni
(12) 1 1 m
y s = + ( ∑ v j R sj ) (18)
2 4 j
For output nodes,
or 4 y s − 2 = v1 R1s + v2 R2s + ... + vm Rms
δ nj = −( t nj − y nj ( w ))s 'j ( netnj ) (13)
s = 1, 2,..N (19)
where s'j ( net nj ) is the derivative of the activation where m is the number of hidden nodes;
N is the total number of training samples.
function of the input of the j-th neuron net nj .
For the hidden node, The set of equations in (19) is a typical multiple linear
δ nj = s 'j ( net nj )∑ δ nk w jk (14)
regression model. R sj ’s are considered as the regressors.
k
where w jk is the weight from j-th to the k-th neuron. v j ’s can be estimated by standard regression method.
Fig 1a: Predicted ∆EMA(t) vs actual ∆EMA(t) Fig 1b: Predicted stock price vs actual stock price
Figure 1: A sample result from neural network
In figure 1a, although predicted ∆EMA(t) and actual 5. Conclusion & Discussion
∆EMA(t) have a relative great deviation in some regions,
the network can still model the actual EMA reasonably The experimental results show that it is possible to model
well. On the other hand, after the transformation of stock price based on historical trading data by using a three-
∆EMA(t) to exact price value, the deviation between actual layer neural network. In general, both steepest descent
price and predicted price is small. Two curves in figure 1b network and conjugate gradient network produce the same
nearly coincide. This reflects the selection of the network level of error and reach the same level of direction
forecaster was appropriate. prediction accuracy.
The performance of scenarios mentioned above is evaluated Conjugate gradient approach has advantages of steepest
by average number of iterations required for training, descent approach. It does not require empirical
average MSE in testing phase and the percentage of correct determination of network. As opposed to zigzag motion in
direction prediction in testing phase. The results are steepest descent approach, its orthogonal search prevents a
summarized in Table 1. good point being spoiled. Theoretically, the convergence of
second-order conjugate gradient method is faster than first
Average % of correct order steepest descent approach. This is verified in our
Average experiment.
number of direction
MSE
iterations prediction
In regard to initial starting point, the experimental results
CG / RI 56.636 0.001753 73.055
show the good starting point generated by multiple linear
CG / MLRI 30.273 0.001768 73.545
regression weight initialization is spoiled by subsequent
SD / RI 497.818 0.001797 72.564
direction in steepest descent network. On the contrary,
SD / MLRI 729.367 0.002580 69.303 regression initialization provides a good starting point,
Table 1: Performance evaluation for four scenarios improving the convergence of conjugate gradient learning.
All scenarios, except for steepest descent with MLR To sum up, the efficiency of backpropagation can be
initialization, achieve similar average MSE and percentage improved by conjugate gradient learning with multiple
of correct direction prediction. All scenarios perform linear regression weight initialization.
satisfactory. The mean square error produced is on average
below 0.258% and more than 69% correct direction It is believed that the computation time of conjugate
prediction is reached. gradient can be reduced by Shanno’s approach[7]. The
initialization scheme may be improved by estimating
Conjugate gradient learning on average requires significant weights between input nodes and hidden nodes, instead of
less number of iterations than steepest descent learning. random initialization. Enrichment of more relevant inputs
Due to complexity of line search, conjugate gradient such as fundamental data and data from derivative markets
requires a longer computation time than steepest gradient may improve the predictability of the network. Finally,
per iteration. However, overall convergence of conjugate more sophisticated network architectures can be attempted
gradient neural network is still faster than steepest descent for price prediction.
network.