Exercise 6: Time Series Analysis and Stochastic Modelling
Exercise 6: Time Series Analysis and Stochastic Modelling
Exercise 6: Time Series Analysis and Stochastic Modelling
Course-work-and-data-analysis (/github/bikasbhattarai/Course-work-and-data-analysis/tree/master)
/ Hydrology-Course (/github/bikasbhattarai/Course-work-and-data-analysis/tree/master/Hydrology-Course)
/
GEO4310_2015 (/github/bikasbhattarai/Course-work-and-data-analysis/tree/master/Hydrology-Course/GEO4310_2015)
/
EX6 (/github/bikasbhattarai/Course-work-and-data-analysis/tree/master/Hydrology-Course/GEO4310_2015/EX6)
Date: 2015/10/12
The annual mean runoff data in a station are given in the file Qyearlymean.txt.
Xi = a + bTi
Ho : b = 0; Ha : b ≠ 0
b−o
Test statistic t =
sb
Where
−−−−−−−−− n 2
2 ∑ (χi −χ
^i )
S
sb = √ and S 2 =
i=1
(1)
n ¯ 2 n−2
∑ (T t −T )
t=1
(2)
n−1
ρk = =
1 n ¯ 2
V ar(X t ) ∑ (X t −X )
n−1 t=1
(−1−z1−α/2 √n−2)
l =
n−1
and
(−1+z1−α/2 √n−2)
u =
n−1
Ho : ρ k = 0 ; Ha : ρ k ≠ 0
If the calculated ρ k falls outside these confidence limits, then the hypothesis is rejected, i.e., ρ k is significant for
significant level α and time lag k.
In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import scipy.stats
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
from scipy.stats.stats import pearsonr
from pandas import Series
import statsmodels.tsa.stattools
In [3]:
q=pd.read_table('Qyearlymean.txt')
q.columns = ([u'YEAR', u'T', u'discharge'])
discharge = q['discharge']
print q.head(3)
print len(q)
YEAR T discharge
0 1940 1 398.0
1 1941 2 429.1
2 1942 3 690.2
51
https://nbviewer.jupyter.org/github/bikasbhattarai/Course-work-and-data-analysis/blob/master/Hydrology-Course/GEO4310_20… 2/18
9/26/2020 Jupyter Notebook Viewer
In [4]:
q= pd.read_table('Qyearlymean.txt')
# create a fitted model between discharge as dependent variable and time (T) as indip
fd = smf.ols(formula='discharge ~ T', data= q).fit()
#print summary statistics
print(fd.summary())
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is co
Now formulating the test hypothesis for the coefficients to test wheather the coefficients are significantly different or
not and the test hypothesis can be formulated as given below:
H0 : b = 0
Ha : b ≠ 0
In [5]:
# Obtaining the value of 't' from t-table using the python function at alpha = 10, de
round(scipy.stats.t.ppf(0.90,len(q)-2),3)
Out[5]:
1.299
https://nbviewer.jupyter.org/github/bikasbhattarai/Course-work-and-data-analysis/blob/master/Hydrology-Course/GEO4310_20… 3/18
9/26/2020 Jupyter Notebook Viewer
Conclusion: From the analysis, |t| value is greater than tcritical so H0 is rejected and concluded that, there is a
trend.
2. If the trend is significant, remove the trend from the data series.
To remove the trend, the difference between observed values and estimated values, the error, is calculated. The
error is always without trend. To get the time series of discharge without trend, interception and each error value is
added.
https://nbviewer.jupyter.org/github/bikasbhattarai/Course-work-and-data-analysis/blob/master/Hydrology-Course/GEO4310_20… 4/18
9/26/2020 Jupyter Notebook Viewer
In [6]:
# Defining x-axis
year = q['T']
#creating plot
fig, ax = plt.subplots(figsize=(15,7))
y = 2.11x + 491.233
y_without_trend = -0.00x + 491.233
3) Calculate and plot the autocorrelation ρ k and 95% confidence limits for time shifts, k, up to 5 years.
https://nbviewer.jupyter.org/github/bikasbhattarai/Course-work-and-data-analysis/blob/master/Hydrology-Course/GEO4310_20… 5/18
9/26/2020 Jupyter Notebook Viewer
(−1−z1−α/2 √n−2)
l =
n−1
and
(−1+z1−α/2 √n−2)
u =
n−1
Ho : ρ k = 0 ; Ha : ρ k ≠ 0
If the calculated ρ k falls outside these confidence limits, then the hypothesis is rejected, i.e., ρ k is significant for
significant level α and time lag k.
In [7]:
Out[7]:
Conclustion: From above calculation H0 : pk = 0 is rejected up to time lag of 3 years. This means the
autocorrelation coefficient is significant for α =5% up to 3 years of time lag. The autocorrelation for time lag up to 5
years together with limits for confidence interval with 5% leel of significance is given below:
https://nbviewer.jupyter.org/github/bikasbhattarai/Course-work-and-data-analysis/blob/master/Hydrology-Course/GEO4310_20… 6/18
9/26/2020 Jupyter Notebook Viewer
In [8]:
fig, ax = plt.subplots(figsize=(10,5))
ax.plot(auto, marker='o')
plt.grid()
plt.legend(['Auto correlation', 'Lower CI' , 'Uppper CI'])
plt.ylabel('rL', fontsize =18)
plt.xlabel('Time lag (Years)',fontsize=18)
plt.show()
In python there is a inbuild function instide statsmodels.graphics.tsaplots called plot_acf to calculate and plot the
auto-correlation with 95% significance level. The plot given below is plotted by using the plot_acf function where
area under the shaded region is the confidence intervals.
In [9]:
Fit the first order Markov model for annual data (after the removal of trend if there is any)
Parameter estimation:
^
β 1 = ρ1 where ρ 1 is calculated using equation (2).
n 2
2 ∑ (χ −χ̄ )
2 ^
2
^ = σ x (1 − β )
σ ϵ 1
where σ x2 =
i=1 i i
n−1
and
1 n
^ = x̄ =
μ ∑ xi
n i=1
Note that in step (4), one simulation sample is compared with the observed data, while in step (5) the mean values
of 100 samples are compared with the observed values.
Solution:
In [10]:
Out[10]:
0 51 491.22 128.23
Now Fitting of first order Markov model for annual data after trend is removed. Equation (4) to simulate
discharge.Autocorrelation for time lag of 1 is calculated by using the python function:
https://nbviewer.jupyter.org/github/bikasbhattarai/Course-work-and-data-analysis/blob/master/Hydrology-Course/GEO4310_20… 8/18
9/26/2020 Jupyter Notebook Viewer
In [11]:
0 0.528
dtype: float64
β1 = ρ 1 = auto1 = 0.528
mean = x̄ = 491.22
2
std = σ x = 128.23
and z t is series generated from random normal distribution: N(0,1). The initial value for Xt−1 is selected as the
3
1959 value i.e 552.63 m /s .
In [12]:
mean = 491.22
std = 128.23
#Selecting the data series 1959 to 1990
no_trend = without_trend[19:]
#generating normally distributed random data series
z= np.random.normal(0, 1, 32)
# Simulating the discharge by using equation (4)
simulated = pd.DataFrame(mean+(0.557*(no_trend -mean))+(z*std*(np.sqrt(1-0.557**2))))
calculating the auto-correlation upto 2 year time lag for observed and simulated datasets.
In [13]:
1 2
0 0.595307 0.614706
In [14]:
1 2
0 0.252234 0.267889
https://nbviewer.jupyter.org/github/bikasbhattarai/Course-work-and-data-analysis/blob/master/Hydrology-Course/GEO4310_20… 9/18
9/26/2020 Jupyter Notebook Viewer
In [15]:
index = ['Observed','Simulated']
statistic = pd.DataFrame(index = index)
statistics = pd.DataFrame({'Auto_1': [0.595, 0.128], 'Auto_2': [0.6147, 0.0735]},inde
statistics['Mean'] = round(no_trend.mean(),2),round(simulated.mean(),2)
statistics['std'] = round(no_trend.std(),2), round(simulated.std(),2)
statistics['median'] = round(no_trend.median(),2), round(simulated.median(),2)
statistics
Out[15]:
In [16]:
fig, ax = plt.subplots(figsize=(13,5))
ax.plot(no_trend, label='observed')
ax.plot(simulated, label='simulated')
plt.grid()
ax.set_xlim([1 ,31])
plt.legend()
plt.xlabel('Years',fontsize =18)
plt.ylabel('Discharge ($m^3/s$)',fontsize =18)
plt.show()
Conclusion: The statistics for observed and modelled time series (1960 -1990) is in above table. And aslo it is
clear from the plot that the variation between the observed and simulated data series is large. Comparision shows
that the Markov medeled time series is a more ramdom series as is expected when including the random z values
from a normal distribution in the Markov model. Simulated times series has lower mean, and lower standard
deviation then that of the observed series. And the value of mean and standard deviation changes everytimes after
the model because of random number.
5 Use AR(1) model to simulate 100 samples, each having 33 years, calculate the mean values of the statistics
(average, standard deviation, r1) from the 100 samples and compare them with the statistics calculated from
observed 33 years data series
https://nbviewer.jupyter.org/github/bikasbhattarai/Course-work-and-data-analysis/blob/master/Hydrology-Course/GEO4310_2… 10/18
9/26/2020 Jupyter Notebook Viewer
In [395]:
def f(iteration,df):
g = np.random.normal(0, 1, 32)
mm = Series(no_trend)
x = 491.22+(0.557*(mm -491.22))+(g*128.23*(np.sqrt(1-0.557**2)))
df[iteration] = x
repeat_fun(100,f,df)
#Dataframe with 100 columns and 31 rows will be obtained by printing df.
#calculating the mean for 100 times run datasets
Simulated_100 =df.mean(axis=1)
Calculating auto-correlation up to 2 year time lag for mean which is calculated from 100 years run datasets.
In [397]:
1 2
0 0.219762 0.432154
Calculating and comparing the different statistics for the mean of 100 times run and observed data without trend.
In [398]:
ind = ['Observed','Simulated']
simulated_100 = pd.DataFrame(index = ind)
simulated_100 = pd.DataFrame({'Auto_1': [0.595, 0.128], 'Auto_2': [0.219, 0.432]},ind
simulated_100['Mean'] = round(no_trend.mean(),2),round(Simulated_100.mean(),2)
simulated_100['std'] = round(no_trend.std(),2), round(Simulated_100.std(),2)
simulated_100['median'] = round(no_trend.median(),2), round(Simulated_100.median(),2)
simulated_100
Out[398]:
https://nbviewer.jupyter.org/github/bikasbhattarai/Course-work-and-data-analysis/blob/master/Hydrology-Course/GEO4310_2… 11/18
9/26/2020 Jupyter Notebook Viewer
In [401]:
fig, ax = plt.subplots(figsize=(13,5))
ax.plot(Simulated_100,label='Simulated_100')
ax.plot(no_trend, label='observed')
plt.grid()
ax.set_xlim([1 ,31])
plt.ylabel('Discharge($m^3/s$)', fontsize =18)
plt.xlabel('Years',fontsize=18)
plt.legend()
plt.show()
Conclusion mean and standard deviation for mean simulated discharge is lower then that of observed one. On the
basis of standard deviation we can say that the variation in simulated data sets is comperatively less then that of
observed datasets.
Thomas-Fiering is a generalization of the first order Markov model. In general for the stochastic models (MA, AR,
ARMA) stationarity is assumed. For example, the first order Markov model assumes stationarity in the first three
moments. However, there are generalizations of these models that account for non-stationarity. For data sets which
are seasonally non-stationary, you can use Thomas-Fiering.
1. Fit the Thomas-Fiering Model to the data and simulate the monthly discharge from January 1949 to December
1975 using the value of December 1948 as the initial value. Plot the simulated and observed data on the same
graph and compare the statistics of the two data series
2. Simulate the 100-year discharge series from 1976 to 2075 using the value of December 1975 as initial value
and compare the statistics of the simulated data series with the observed data series.
Calculation procedure:
https://nbviewer.jupyter.org/github/bikasbhattarai/Course-work-and-data-analysis/blob/master/Hydrology-Course/GEO4310_2… 12/18
9/26/2020 Jupyter Notebook Viewer
1
(a) The mean flow: q¯j =
n
∑ qj,i ,
i
where i = {j, 12 + j, 24 + j, ⋯}
−−−−−−−−
2
∑ (q j,i −q
¯ )
(b) The standard deviation: Sj = √
i j
n−1
∑ (q j,i −q
¯ j)(q j+1,i −q
¯ j+1 )
i
rj =
2 2
√∑ (q −q
¯ ) ∑ (q j+1,i −q
¯ )
i j,i j i j+1
(d) the slope of the regression equation relating the month’s flow to flow in the preceding month:
Sj+1
bj = rj
Sj
Solution
https://nbviewer.jupyter.org/github/bikasbhattarai/Course-work-and-data-analysis/blob/master/Hydrology-Course/GEO4310_2… 13/18
9/26/2020 Jupyter Notebook Viewer
In [18]:
q = q_hist
# create a random generator, note we have to call this '()' and take the
# first element returned [0]... so in use it is Z()[0]
Z = np.random.randn
# the model, not vectorized. assumes each of the variables above are
# of length 12 ... and note that Python is zero-indexed (e.g. first element is '0
def tf1(qi, month):
i = month-1 #account for zero offset
return q_mean_j1[i] + (rj[i] * (sj1[i] / sj[i])) * (qi - q_mean_j[i]) + Z(1)[
# enumerate (e.g. makes 'i' a counter) and get each month in the simulation perio
for i, m in enumerate(period.month[1:]):
#append to the list by calling the model with the prior value and the month
simq.append(tf1(simq[i-1], m))
if __name__ == "__main__":
https://nbviewer.jupyter.org/github/bikasbhattarai/Course-work-and-data-analysis/blob/master/Hydrology-Course/GEO4310_2… 14/18
9/26/2020 Jupyter Notebook Viewer
In [19]:
index = ['Observed','Simulated']
stat = pd.DataFrame(index = index)
stat['Mean'] = round(q.discharge.mean(),2),round(simq.mean(),2)
stat['std'] = round(q.discharge.std(),2), round(simq.std(),2)
stat['median'] = round(q.discharge.median(),2), round(simq.median(),2)
stat
Out[19]:
Conclusion: From the plot above it is clear that the there is a variation between observed and simulated discharge
from thomas fiering model. Simulated mean is higher than that of observed while the simulated standard deviation
is lower then that of observed datasets. And also the mean is greater than the median in both case. This is common
for a distribution that is skewed to the right (that is, bunched up toward the left and with a "tail" stretching toward the
right)
Simulating discharge for monthly 100 years by using the Thomas-Fiering model
https://nbviewer.jupyter.org/github/bikasbhattarai/Course-work-and-data-analysis/blob/master/Hydrology-Course/GEO4310_2… 15/18
9/26/2020 Jupyter Notebook Viewer
In [20]:
In [21]:
sim=pd.DataFrame(sim_100['discharge'])
mean_100 = pd.DataFrame(sim.mean().round(2))
std_100 = pd.DataFrame(sim.std().round(2))
median_100 =pd.DataFrame(sim.median().round(2))
stats_100= (pd.concat([mean_100, std_100,median_100], axis =1))
my_columns = ["mean", "std", "median"]
stats_100.columns = my_columns
stats_100
Out[21]:
https://nbviewer.jupyter.org/github/bikasbhattarai/Course-work-and-data-analysis/blob/master/Hydrology-Course/GEO4310_2… 16/18
9/26/2020 Jupyter Notebook Viewer
In [30]:
plot_acf(sim_yearly)
Out[30]:
In [29]:
sim_yearly = sim_100.groupby(sim_100.index.year).sum()
https://nbviewer.jupyter.org/github/bikasbhattarai/Course-work-and-data-analysis/blob/master/Hydrology-Course/GEO4310_2… 17/18
9/26/2020 Jupyter Notebook Viewer
In [40]:
In [ ]:
https://nbviewer.jupyter.org/github/bikasbhattarai/Course-work-and-data-analysis/blob/master/Hydrology-Course/GEO4310_2… 18/18