Data Traffic Prediction Using Fuzzy Methods: Anthony Chiaratti Pedro Dias de Oliveira Carvalho

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 9

Data traffic prediction using Fuzzy methods

Anthony Chiaratti1 Pedro Dias de Oliveira Carvalho.2

Universidade Federal de Minas Gerais.


Av. Pres. Antônio Carlos, 6627 - Pampulha, Belo Horizonte - MG, 31270-901.

1
Email: achiaratti@gmail.com
2
Email: pedrodoc@gmail.com

Abstract: Data traffic in high-speed networks currently hold all kind of


information from personal internet usage to telephone and TV services. These
networks can be model as essentially an output data through each router
feeding a computer, a server or even another router working as a complex
chain. This paper is concerned in prediction TCP (Transmission Control
Protocol) connection throughput in wide area. Data traffic has some specially
characteristic that will be detailed such as high pick and seasonality that will be
present in data, using fuzzy methods to predict data flow through a link could be
simpler and cheaper than develop statistics models that are current used.

Keywords: Fuzzy time series, data traffic prediction.

1. Introduction
Data links flow is essential discussion to avoid congestion and valuable
information loss during high usage traffic, this problem come from long date on
telephone links that used to direct connect each subscriber to other, with its
expansion it became impractical to dedicate each user to a channel and the
natural solution was multiplex the signal and share the media.

Erlang became an international accept unit to probabilistic predict if a link


will deny the next access, when in full capacity, or if call will be accepted when
the link has available room for more traffic [1]. As well spread Erlang work may
be it is highly concentrate in pure telephone usage and its characteristics that
could widely vary in TCP connections, basics of Erlang distribution is a general
form of gamma [2].

In the other hand the use of FTS (Fuzzy Time Series) in data traffic
prediction problematic is expect to bring inherit information diversity, also there
is a very large amount of publications regarding improvements in FTS methods,
each method impose its own data restriction. This paper provides an overview
comparison between some FTS methods in traffic throughput based in high
value changes within samples, seasonality and information diversity, this make
data traffic prediction a great challenge.

FTS is successful current successful applied in a variety of data forecast,


as complex as stock market and also capable of output goods results even for
nonlinear problems.

2. Data set analysis


As previous discussed network data includes several types of data,
presents high peak during few hours and low traffic in most of time, additionally
present seasonality characteristics.

Seasonality is highly expected due nature of human resource use, high


demand on rush-hours, such as telephone, internet traffic, server access, etc.
characterizing a 24 hours period correlation. Also it’s expect that each day of a
week will be somehow different, as clear example of this aspect is that during
weekend companies will diminish request of internet and telephone access,
causing weekly seasonality as well.

Extracting hourly link traffic from a public distribution data[3] and


evaluation its series in Figure 1 and its ACF (AutoCorrelation Function) in
Figure 2 we can notice:

Figure 1 – Time-Series
Figure 2 – ACF

First and highest correction is with past hour, in other words we can state
that grows or reduces depends on last sample. As discussed previously a daily
season can be notice in each 24 hour period, each period reduces its
correlation as the get far from most recent data, although after reaching its
minimum, correction start increase until a new peak dated 169 hours later, it’s
very near from expect week season in 24 times 7 equal 168 hour.

Table 1 – Terms of seasonality

Parameter ACF value Description


Yt-1 0.756 One hour before
Yt-2 0.435 Two hours before
Yt-24 0.344 One day before

For seasonality removing a differentiation was performed data evaluation


show that this transformation effective removed season as expected as
demonstrated in Figure 3.
Figure 3 – Differential ACF

3. Methodology
FTS methods are based in same logic and then, for elucidation only, Chen
(1996) algorithm will be presented. For others algorithm their differences from
Chen method will be discussed.

Step 1 is defining the universe U of discourse within historical data that shall
include Dmin as minimum value and Dmax as maximum value. Defining D1 and
D2 as U=[Dmin – D1 ; Dmax + D2] where D1 and D2 are positive numbers.

Step 2 is splitting U into intervals. Methods for splitting universe of discourse


is an open issue in FTS area, some proposed methods are Entropy[4] and K-
Cluster[5], for this paper was used effective length proposed by Huarng (2001).

Step 3 is to define fuzzy sets of U In this step there is no restriction on


determining how many linguistic variables can be fuzzy sets[6]. This step will
also define the membership function of each fuzzy set. Most used is triangular
although any logic propose is valid, this step should take in previous experience
and best of knowledge.

Step 4 fuzzify historical data there means assign data values into fuzzy sets.
Step 5 shall establish FLR (Fuzzy Logical Relationship), these FLR defines
relation between time differences in fuzzy sets, Chen’s method recurrence is
eliminated.

Step 6 calculates the forecasted value follow simple rules. If FLR is one-to-
one the forecasted value is the midpoint of followed interval. If FLR is one-to-
many the forecasted value is the mean of midpoints of followed intervals.

Other method adopted to be compared were first proposed by Yu to improve


forecast based in two main simplification on Chen’s, the elimination of
recurrence that should inherit important information and equal weight for time
distance data, for those aspects her algorithm propose the following steps.

Based on Chen’s algorithm, steps 1 through 4 are the same, in step 5


redundancies are kept and their order of appearance is preserved.

Step 6 will define weights for each FLRG observed, a classical method by
linear weight starting from 1 to N (Number of relationships) normalized by the
sum of weight. A new comparison will be implemented by using weights logic,
the exponential first proposed by Lee and Javedani (2011).

Step 7 will forecast the values multiplying midpoint vector by transposed


weight vector.

To make comparison with FTS an Auto Regressive (AR) method will be


evaluated using terms in accordance with Table 1, but is important to note that
this will result in a third order AR and FTS are used by first order. AR is merged
with FTS using Yu algorithm presenting the best results.

4. Evaluation
Motivated by reasons presented on introduction and abstract classic
methods will be performed first by methods itself and their comparison with
naïve method, it is important to observe that from our motivation the one step
ahead is not the most important forecast, but free running forecast, we choose
24 hours for network engineers alert when congestion is predicted and
preventive resources allocation.

Seasonality will also be discussed and evaluated expected to improve


results.

From database were extract hourly traffic from a wide area then this data
were divided into two sets. First one was used for algorithm training, running
proposed methods presented in previous section and their variants for
seasonality.
Second dataset was used as validation, running one step ahead and 24
hours free run.

For all algorithms were defined a universe of discourse defined with


parameters on Table 2 all terms in bytes.

Table 2 - Data definitions

Dmin D1 Dmax D2 Universe


185.799 185.799 16.704.916 295.084 [0,17.000.000]

For simplicity this universe was divided using Huarng (2001) effective length
into equals intervals his paper proposed to use half of the absolute differences
mean and round it on defined base, from training database Table 3 shown
divided intervals: Shown in Megabytes

Table 3 – Intervals

Effective Length 0.4Mbytes Number of intervals 43


Begin End Midpoint Fuzzy set Begin End Midpoint Fuzzy set
0 0.4 0.2 A1 8.8 9.2 9.0 A23
0.4 0.8 0.6 A2 9.2 9.6 9.4 A24
0.8 1.2 1.0 A3 9.6 10.0 9.8 A25
1.2 1.6 1.4 A4 10.0 10.4 10.2 A26
1.6 2.0 1.8 A5 10.4 10.8 10.6 A27
2.0 2.4 2.2 A6 10.8 11.2 11.0 A28
2.4 2.8 2.6 A7 11.2 11.6 11.4 A29
2.8 3.2 3.0 A8 11.6 12.0 11.8 A30
3.2 3.6 3.4 A9 12.0 12.4 12.2 A31
3.6 4.0 3.8 A10 12.4 12.8 12.6 A32
4.0 4.4 4.2 A11 12.8 13.2 13.0 A33
4.4 4.8 4.6 A12 13.2 13.6 13.4 A34
4.8 5.2 5.0 A13 13.6 14.0 13.8 A35
5.2 5.6 5.4 A14 14.0 14.4 14.2 A36
5.6 6.0 5.8 A15 14.4 14.8 14.6 A37
6.0 6.4 6.2 A16 14.8 15.2 15.0 A38
6.4 6.8 6.6 A17 15.2 15.6 15.4 A39
6.8 7.2 7.0 A18 15.6 16.0 15.8 A40
7.2 7.6 7.4 A19 16.0 16.4 16.2 A41
7.6 8.0 7.8 A20 16.4 16.8 16.6 A42
8.0 8.4 8.2 A21 16.8 17.2 17.0 A43
8.4 8.8 8.6 A22 -- -- -- --

For methods comparison, several metrics is proposed, e.g., MAPE (Mean


absolute Percentage Error), MAE (Mean Absolute Error) and MASE (Mean
Absolute Scaled Error).
MASE was proposed by Hyndman (2005) and is the most widely applicable
[7] and can be used for comparison in a variety of time series, for the lack of
studies in data traffic using methods comparisons, MASE is presented for
results evaluation.

When seasonality is disregards:

∑𝑇𝑡=1|𝑒𝑡 |
𝑀𝐴𝑆𝐸 =
𝑇
∑𝑇 |𝑌 |
𝑇 − 1 𝑡=2 𝑡 − 𝑌𝑡−1
Where T is total time and |𝑒𝑡 | is the absolute error defined by difference
between actual value 𝑌𝑡 and forecasted value 𝑌̂𝑡 . The denominator calculates
the mean absolute error of one step ahead, in other words, the naïve error.

Regarding seasonality with period m the formula can be represented by:

∑𝑇𝑡=1|𝑒𝑡 |
𝑀𝐴𝑆𝐸 =
𝑇
∑𝑇 |𝑌 |
𝑇 − 𝑚 𝑡=𝑚+1 𝑡 − 𝑌𝑡−𝑚

Seasonality study was performed regarding ACF function, Table 1 show


parameters:

5. Results
After a careful analysis were decided to perform comparison for 1 step
ahead using naïve, Chen, Yu with equal weight, Yu with linear weight, Yu with
exponential weights (C=2) and differential transformation.

All tests were performed using 400Kbytes of length of intervals giving 43


intervals and defining universe of discourse from 0 to 17.2Mbytes with 3 weeks
training and 1 week prediction.

Table 4 – Forecast error 1 step ahead

MASE MAPE MAE


Naive 0,99 55,20% 1085
Chen 1,49 137,00% 1632
Yu Equal Weight 1,04 60,00% 1141
Yu Linear Weight 1,07 63,80% 1168
Exponetial (c=2) 1,26 86,00% 1377
Differential 1,4 92,00% 1537

Some of their data is shown in Figure 4


18000000
16000000
14000000
12000000
Real
10000000
Naive
8000000
Yu_EqualWeight
6000000
Yu_LinearWeight
4000000
2000000
0

100
109
118
127
136
145
154
163
172
1

73
82
10
19
28
37
46
55
64

91
Figure 4 – Forecast data 1 hour ahead

As previous discussed other important forecast is for one day, or 24 hours


step, this means that a recursive code will be used and forecasted data will be
used for next hours instead of actual data. For this analysis naïve, Yu with equal
weight, Yu with linear weight, Exponential (C=2) and AR-FTS using Yt-24, Yt-144
and Yt-168.

Tests present following results:

Table 5 – Forecast error 24 hour run

MASE MAPE MAE


Naive 1,11 141,00% 2049
Yu Equal Weight 1,03 144,00% 1906
Yu Linear Weight 1,04 145,00% 1932
Exponetial (c=2) 1,12 153,00% 2089
Yu Seasonal AR 0,74 86,00% 1381

18000000
16000000
14000000
12000000 Real
10000000
Naive
8000000
Yu_EqualWeight
6000000
4000000 Yu_LinearWeight
2000000
0
100
109
118
127
136
145
154
1
10
19
28
37
46
55
64
73
82
91

Figure 5 – Forecast data 24 hour run


6. Conclusion
Even with an error higher than expected, results can predict with useful
information for traffic congestion and network management, it can successful
predict rush hours spikes that can be avoid by resource relocation.

For 24 hour free run the AR-FTS present the best results compared with
first order methods. Data presents a very difficult behavior and is still a
challenge to be well forecasted.

7. References
[1] Ian Angus, Telemanagement #187

[2] https://en.wikipedia.org/wiki/Erlang_distribution

[3] http://ita.ee.lbl.gov/html/contrib/LBL-CONN-7.html

[4] Entropy-based and trapezoid fuzzification-based fuzzy time series


approaches for forecasting IT project cost

[5] Fuzzy Time Series Forecasting Based On K-Means Clustering

[6] A fuzzy time series-markov chain model with an application to forecast the
exchange rate between the Taiwan and us dollar (2011)

[7] Hyndman, R. J. (2006). "Another look at measures of forecast accuracy",


FORESIGHT Issue 4 June 2006, pg46

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy