Conditional Normalization in Time Series Analysis
Conditional Normalization in Time Series Analysis
series analysis
Puwasala Gamakumara
Department of Econometrics & Business Statistics
Monash University, Clayton VIC 3800, Australia
Email: puwasala.gamakumara@gmail.com
Corresponding author
arXiv:2305.12651v1 [stat.ME] 22 May 2023
Edgar Santos-Fernandez
School of Mathematical Sciences
Queensland University of Technology, Brisbane QLD 4000, Australia
Email: edgar.santosfernandez@qut.edu.au
Rob J Hyndman
Department of Econometrics & Business Statistics
Monash University, Clayton VIC 3800, Australia
Email: Rob.Hyndman@monash.edu
Kerrie Mengersen
Science and Engineering Faculty
School of Mathematical Sciences
Queensland University of Technology, Brisbane QLD 4000, Australia
Email: k.mengersen@qut.edu.au
Catherine Leigh
Biosciences and Food Technology Discipline
School of Science
RMIT University, Bundoora VIC 3082, Australia
Email: catherine.leigh@rmit.edu.au
23 May 2023
Conditional normalization in time
series analysis
Abstract
Time series often reflect variation associated with other related variables. Controlling for the
effect of these variables is useful when modeling or analysing the time series. We introduce a
novel approach to normalize time series data conditional on a set of covariates. We do this by
modeling the conditional mean and the conditional variance of the time series with generalized
additive models using a set of covariates. The conditional mean and variance are then used to
normalize the time series. We illustrate the use of conditionally normalized series using two
applications involving river network data. First, we show how these normalized time series can
be used to impute missing values in the data. Second, we show how the normalized series can
be used to estimate the conditional autocorrelation function and conditional cross-correlation
functions via additive models. Finally we use the conditional cross-correlations to estimate the
time it takes water to flow between two locations in a river network.
1 Introduction
Normalization of some variables is often required prior to using a statistical or machine learning
algorithm. In this study, we introduce a novel normalization method for time series data which
aims primarily to remove the conditional variation in the time series that is induced by other
sources of variation.
2
Conditional normalization in time series analysis
In practice, it is common to work with multiple time series that are inter-related and non-
stationary. We propose a method to normalize univariate time series conditional on a set of
covariates. This method can be considered as a variation of z-score normalization, but where the
mean and variance are functions of the covariates. Thus we refer to this method as conditional
normalization.
In the proposed method, we first estimate the conditional mean of the time series using a
generalized additive model (GAM) (Hastie & Tibshirani 1990) with a set of covariates. The
conditional variance is then estimated via a different GAM fitted to the squared errors from
the conditional mean model, with respect to the same set of covariates. Finally, the estimated
conditional mean and variance are used to standardize the time series. One can choose the most
relevant set of covariates that can explain maximum variation in the time series.
It is relatively common to subtract a conditional mean in order to adjust data for subsequent
analysis, and sometimes this is called “normalization” (e.g., Xie et al. 2019). However, our
approach is much more general as both the conditional mean and conditional variance are
modeled, and we allow for non-linear relationships between the response and covariates in both
models.
We show two possible uses of conditionally normalized time series. First, we describe how the
conditionally normalized time series can be used to impute missing values in a univariate time
series. To do this, we model the normalized series, and use the model to impute the missing
values. The resulting imputations are then “unnormalized” to give estimates on the original
scale.
Second, we show how the conditionally normalized time series can be used to estimate the
conditional Autocorrelation Function (ACF) and conditional Cross-Correlation Function (CCF).
We can define the conditional ACF at lag k as the conditional expectation of the cross-product of the
conditionally normalized time series and its kth lagged series. Similarly, the conditional CCF at lag k
can be defined as the conditional expectation of the cross-product between two conditionally normalized
time series at k lags apart. To estimate the conditional expectations, we propose to fit GAMs for
the cross-product of the normalized time series using the same set of covariates used in the
conditional normalization.
3
Conditional normalization in time series analysis
(USA), which has many missing values. We describe how we can impute these missing values
using Bayesian machinery for modeling.
The second application uses the conditional CCF to estimate the lag time between two sensor
locations in the Pringle Creek river network in Texas, USA. The lag time is the time it takes
water to flow downstream from an upstream location. This lag time often depends on the
upstream river behavior. For example, when the upstream water level increases, water flow
will typically be increased and hence the lag time will decline. On the other hand, when the
level is low, water may be flowing more slowly and hence the lag time will increase. Lag time
has been estimated using different approaches in many hydrological applications (see Van
der Velde et al. (2010), Hrachowitz et al. (2016) and Li et al. (2018) for example). We propose to
estimate the lag time as the lag that gives the maximum conditional cross-correlation between
two water-quality variables observed at upstream and downstream locations, conditional on
other, related water-quality variables measured at an upstream location. This will allow the lag
time to be estimated conditional on the upstream river behavior.
The rest of the paper is organized as follows. The underlying methods for conditional normal-
ization are described in Section 2. Section 3 contain the empirical application on missing value
imputation, while Section 4 discusses the application to estimating lag times between sensor
locations. Finally, we discuss the results and provide some concluding remarks in Section 5.
4
Conditional normalization in time series analysis
We estimate m(zt ) and v(zt ) using GAMs (Hastie & Tibshirani 1990). First, we fit the model
p
yt = α0 + ∑ f i (zi,t ) + ε t ,
i =1
where f i (·) are smooth functions, and ε 1 , . . . , ε T have mean 0 and variance v(zt ), giving
p
m̂(zt ) = α̂0 + ∑ fˆi (zi,t ). (2)
i =1
In estimating the model, we ignore any heteroskedasticity and autocorrelation, and use penalized
splines for each f i function.
where each gi (·) is a smooth function, and the Gamma parameterization has the first argument,
v(zt ), as the mean, and the second argument, r, as the shape parameter. The Gamma family is
not essential here, and it may be replaced by another distribution whose support is in (0, ∞).
The resulting variance estimate is
p
v̂(zt ) = exp β̂ 0 + ∑ ĝi (zi,t ) . (3)
i =1
where ŷ∗t is the imputed value of y∗t , computed using a Kalman smoother.
5
Conditional normalization in time series analysis
The function rk (·) can be estimated using a separate GAM for each k:
eu − 1
η −1 ( u ) = . (5)
eu + 1
Other smooth monotonic link functions, η, that map [−1, 1] to the real line, (−∞, ∞), may also
be used.
where m x (zt ) and my (zt ) are estimated using (2), and v x (zt ) and vy (zt ) are estimated using (3).
p
η (ck (zt )) = φ0 + ∑ si (zi,t ). (6)
i =1
6
Conditional normalization in time series analysis
Figure 1 shows the time series of stream temperature and air temperature. It is known that the
air temperatures are strongly correlated with stream temperatures (Bal et al. 2014), and so these
will be used as a covariate in our models.
30
20
Temperature
10
−10
−20
Figure 1: Stream temperatures (black) and air temperatures (red) (◦ C) from the 42 spatial locations.
(yst − µst )
y∗st = ∼ AR( p),
σt
7
Conditional normalization in time series analysis
where µst and σt are the mean and the standard deviation respectively (formulated as functions
of covariates). To avoid overparameterization (i.e. having more parameters than what we can
learn from the model), we assume that sites have a common standard deviation σ at time t. The
standardized response variable y∗st is modeled using an autoregressive process of order p to
account for the remaining serial correlation in the data.
The covariates sinkt = sin(2πtk/m) and coskt = cos(2πtk/m), k = 1, . . . , 5, are the first five
pairs of Fourier terms (harmonic regression parameters), where m is the seasonal period (Section
7.4, Hyndman & Athanasopoulos 2021).
3.4 Results
The data are missing 1654 temperature values out of 15330 observation periods. To illustrate our
approach, we also remove 20% of the non-missing observations to form a test set. We aim to
estimate these missing values and compare the estimates with the original values. Thus, the
model is trained using 80% of the non-missing data, with 20% used for testing the prediction
accuracy. Figure 2 shows the water temperature values in the training data for each of the spatial
locations.
The model is estimated in Stan using a Hamiltonian Monte Carlo procedure. We use 3 chains
each composed of 6,000 samples and we discard a burn-in of 4,000 samples.
We found that the AR of order eight offered the best fit in terms of root mean square prediction
error (RMSPE). Figure 3 shows the observed water temperature (training set = blue) and the
8
Conditional normalization in time series analysis
40
Spatial locations
30
20
10
0
2011 2012 2013 2014 2015 2016
Date
Temperature
0 5 10 15 20
Figure 2: Daily mean temperature (◦ C) values. The gray areas represent periods that are missing from
the training data.
estimated values (testing set = red) in each of the 42 spatial locations. We note that the model
captures well the periodicity in the data and produces good estimates of the missing temperature
values when comparing the predictions vs hold-out data.
We also assessed the uncertainty in the estimates using highest posterior density intervals.
Figure 4 shows these for the first two spatial locations. In both locations, the model captures
the periodic patterns, including backcasting to predict the initial missing sections of the time
series in location two. Overall, the proportion of observations in which the nominal 95% highest
density interval of the estimated mean temperature contains the true value is 0.946.
The posterior distributions of the regression coefficients β indicate that the spatial covariates
and air temperature substantially affect the stream temperature (Figure 5), with the three pairs
of Fourier terms explaining seasonality changes in the response variable well.
The posterior means of the daily standard deviation indicates that the standard deviation in
summer was three times higher than in the winter (Figure 6).
An ACF plot of the standardized time series corresponding to site 2 is presented in Appendix A.
9
Conditional normalization in time series analysis
1 2 3 4 5 6 7
20
10
0
8 9 10 11 12 13 14
20
10
0
15 16 17 18 19 20 21
20
Temperature
10
0
22 23 24 25 26 27 28
20
10
0
29 30 31 32 33 34 35
20
10
0
36 37 38 39 40 41 42
20
10
0
11
12
13
14
15
16
11
12
13
14
15
16
11
12
13
14
15
16
11
12
13
14
15
16
11
12
13
14
15
16
11
12
13
14
15
16
11
12
13
14
15
16
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
Date
Figure 3: Time series of stream temperature in the 42 spatial locations. Points in blue represent the
training set, while the predictions for the missing periods are given in red.
In river systems, these in-situ sensors are typically placed at one or more locations to measure
multiple water-quality variables semi-continuously. The resultant data can be used for different
kinds of analyses such as identifying trends in water quality and predicting sediment and
nutrient concentrations through space and time (Leigh et al. 2019).
10
Conditional normalization in time series analysis
15
10
5
Temperature
15
10
5
0
Figure 4: Time series of two spatial locations. Points in blue represent the training set, while the
predictions for the missing periods are given in red along with the 95% highest posterior
density intervals.
βcos3
βsin3
βcos2
βsin2
βcos1
βsin1
βair_temp
βcum_drain
βelev
βslope
βintercept
−2 0 2 4
11
Conditional normalization in time series analysis
3.0
2.5
2.0
1.5
1.0
Estimating lag time is necessary in order to compute lagged, explanatory water-quality variables
which can be used as predictors in models of the downstream response variable of interest. One
approach for estimating lag time is to use empirical equations based on the length and slope of
the flow path and other catchment features (Green & Nelson 2002; Li & Chibber 2008). Another
approach uses water level and flow (i.e., discharge) data. For example, Seyam & Othman (2014)
used this approach to estimate lag time between four upstream locations and a downstream
location in the Selangor River basin. Their method involved plotting the hydrograph for the
downstream location and then water level at the upstream location during high flow events and
estimating the time difference between peaks of the two plots. The average time difference was
then considered as the lag time between the two locations.
Field-based methods to estimate the lag time include injecting salt tracers (usually Sodium
Chloride or Sodium Bromide) at an upstream location and measuring the salt concentration
through time at a downstream location, from which the travel time is then estimated. This
manual process has to be carried out several times a year during both high flows and low flows,
which is costly and time-consuming.
12
Conditional normalization in time series analysis
Recall that we have fitted the following separate GAMs for each k
pk
y∗t xt∗−k = η −1 (φ0 + ∑ si (zi,t )) + ε t,k .
i =1
pk
ε t,k = µk + ∑ ψi,k ε t−i,k + ζ t,k , ζ t,k ∼ N (0, σk2 ).
i =1
For each model, the order pk is determined by minimizing the corrected Akaike Information
Criterion (AICc) using the auto.arima function from the forecast package (Hyndman et al.
2022; Hyndman & Khandakar 2008). Then we resample from ζ̂ t,k to generate our bootstrap
sample following these steps.
pk
1. Randomly select with replacement a sample of size T from ζ̂ t,k = ε t,k − µ̂k − ∑ ψ̂i,k ε t−i,k .
i =1
b .
Denoted this sample as ζ̂ t,k
pk
2. Compute εbt,k as εbt,k = µ̂k + ∑ ψ̂i,k εbt−i,k + ζ t,k
b for k = 1, . . . , K.
i =1
pk
3. Compute y∗t xt∗−k b = η −1 (φ̂0 + ∑ ŝi (zi,t )) + εbt,k for k = 1, . . . , K.
i =1
13
Conditional normalization in time series analysis
pk
y∗t xt∗−k b = η −1 (φ0 + ∑ si (zi,t )) + ε t,k .
i =1
Water quality is measured in Pringle Creek using two sensor locations situated about 200 m apart,
with a small tributary entering the main creek between the two sensors. The variables measured
by these sensors include turbidity (Formazin Nephelometric Unit), specific conductance, pH,
dissolved oxygen, and chlorophyll. Measurements are available at 1-minute frequencies and
can be retrieved from NEON Data Portal (National Ecological Observatory Network (NEON)
2021d). Surface water level and water temperature are also available from both locations at
5-minute frequencies and can be retrieved from National Ecological Observatory Network
(NEON) (2021a) and National Ecological Observatory Network (NEON) (2021c), respectively.
The data we consider were collected from 1 October 2019 to 31 December 2019. This time span
avoids the summer period in which surface pools of water disconnect and contains the least
number of missing observations after removing the anomalies.
We will use turbidity to compute the cross-correlations between upstream and downstream
sensors. Turbidity is chosen because it is heavily influenced by fresh inputs of water from
upstream, and hence there should be a strong relationship between upstream and downstream
turbidity. We choose water level and temperature as the covariates from the upstream sensor to
model the cross-correlation between upstream and downstream turbidity.
Appendix B discusses the data pre-processing, anomaly detection, missing value imputation,
and variable selection steps of the analysis.
1 https://www.neonscience.org/field-sites/prin
14
Conditional normalization in time series analysis
Let yt denote the time series of turbidity measured at the downstream sensor. Following
Section 2.4, we first normalize xt and yt+k for k = 1, . . . , 24 conditional on zt . Figures 7 and 8
visualized the fitted mean and variance models for yt+1 respectively.
Figure 7: Visualizing the fitted smooth functions in the conditional mean model for turbidity downstream
with the predictors, water level and temperature from upstream sensor. Each panel visualizes
the relationship between the response and predictor while holding other predictors at their
medians (251.6m and 9.926o C for water level and temperature, respectively). The smooth
function is shown in blue with 95% confidence bands. The degrees of the smoothing are shown
in the y-axis label for each plot.
Following Equation (6), the fitted conditional cross-correlation function between turbidity at the
upstream and downstream sensors at lag k can be written as
15
Conditional normalization in time series analysis
Figure 8: Visualizing the fitted smooth functions in the conditional variance model for turbidity down-
stream with the predictors, water level and temperature from upstream sensor. Each panel
visualizes the relationship between the response and predictor while holding other predictors at
their medians (251.6m and 9.926o C for water level and temperature, respectively). The smooth
function is shown in blue with 95% confidence bands. The degrees of the smoothing are shown
in the y-axis label for each plot.
where {ŝ1,k , ŝ2,k } denote natural cubic splines. Similar to when fitting mean and variance models,
the degrees of freedom for each spline are chosen by examining the relationship between the
response and each covariate.
At lag 1, when temperature is greater than 10◦ C it is slightly negatively affecting the cross-
correlation between turbidity upstream and downstream while controlled for water level (see
Figure 9). Plots that visualize relationships at other lags can be obtained similarly.
The 80% and 95% confidence intervals for the relationship between estimated dt and each
upstream covariate zi,t used in our conditional cross-correlation models (see Figure 10) are
computed using the Sieve bootstrap approach (Bühlmann 1997) and the algorithm is described
in Section 4.4.
To visualize the relationship between dt and each covariate, we replace the remaining covariates
with their medians in the original data, and then estimate dt from the fitted model using this
16
Conditional normalization in time series analysis
Figure 9: Visualizing the fitted smooth functions for conditional cross-correlation between turbidity-
upstream and turbidity-downstream at lag 1 with the predictors, water level and temperature
from upstream sensor. Each plot visualizes the relationship between the response and predictor
while holding other predictors at their medians (251.6m and 9.926o C for water level and
temperature in upstream respectively). The top panel shows the smooth terms in the predictor
scale whereas the bottom panel is in the response scale.
modified data. Figure 10 displays the relationship between dt and each covariate. It is clear from
Figure 10 that upstream water level has a negative effect on the lag time. That is, when water
level increases, the lag time decreases. Increasing water level implies high fresh water inputs
and more flow, water moving downstream in less time, hence the lag time will be decreasing.
When the water level is between 251.6 and 251.8 m, the lag time is very low. In fact, there was
only one incident in November that showed a water level within this range, which occurred
during a freshwater inflow event. However, when the water level is more than 251.8 m, the lag
time has increased deviating from its previous pattern. It is unclear what exactly happens at
that instance, however, the original data indicates that the water level was higher than 251.8 m
only during a single event that happened in November-2019 (see Figure 19 in Section B.4). On
the other hand, when the upstream temperature is below 10o C, it has a positive effect on the lag
time - that is, when the water temperature increases, the lag time also increases. This pattern is
consistent with river behavior, as water temperature can increase during dry seasons when there
is less inflow to the system, particularly if dry seasons occur in the warmer months. Low inflow
17
Conditional normalization in time series analysis
causes the water to move downstream more slowly, resulting in an increase in the lag time.
However, for temperatures greater than 10o C, which mostly occur during early October and
freshwater inflow events, the lag time remains consistently low. Figure 11 maximum conditional
cross-correlation and the lag time between turbidity-upstream and turbidity-downstream with
the predictors, water level and temperature from upstream sensors.
Figure 10: Visualizing dt with 80% and 95% bootstrap confidence intervals. Each panel visualizes dt vs
each upstream covariate while holding the remaining upstream covariates at their medians
(251.6m and 9.926o C for water level and temperature, respectively).
4.8 Evaluation
We can use the estimated dt to compute the lead variable, i.e., yt+dt (lag variable, i.e., xt−dt )
from the downstream (upstream) sensor. yt+dt is expected to have the maximum conditional
cross-correlation with xt compared to any yt+k for k = 1, . . . , 24. That is, ideally we expect
E[y∗t+dt xt∗ | zt ] > E[y∗t+k xt∗ | zt ] for all lags k and time t, where xt∗ and y∗t+dt are the conditionally
normalized series of xt and yt+dt with respect to zt . To evaluate this, we first fit a GAM to y∗t+dt xt∗
using zt as the predictors and follow the Section 2.4 to compute E[y∗t+dt xt∗ | zt ]. These conditional
cross-correlations are then compared with E[y∗t+k xt∗ | zt ] for all k and t, which were obtained
using Equation (8). The resultant conditional cross-correlations are shown in Figure 12. We can
see that the E[y∗t+dt xt∗ | zt ] is greater than E[y∗t+k xt∗ | zt ] for majority of the time.
5 Discussion
In this study we introduce a novel approach to normalize univariate time series conditional
on a set of covariates. The proposed approach uses generalized additive models to estimate
18
Conditional normalization in time series analysis
Figure 11: Time series plots of water-quality variables and lag time between upstream and downstream
sensors for the period 01-Oct-2019 to 31-Dec-2019. (a) Time series plot of turbidity-
downstream. (b) time series plot of turbidity-upstream. (c) Lag time between upstream
and downstream sensors with the predictors, water level and temperature from upstream
sensor (d) Maximum conditional cross-correlation between upstream and downstream sensors
with the predictors, water level and temperature from upstream sensor
the conditional mean and variance of the time series, given a set of covariates. The conditional
mean is estimated via an additive model fitted to the time series with respect to the covariates.
The residuals from this model are then used to estimate the conditional variance, by fitting a
separate generalized additive model to the squared residuals from the conditional mean model
using the same set of covariates. We assume a gamma family with a log link in the latter model.
The estimated conditional means and variances are then used to normalize the time series.
Normalizing a given time series in this manner will reduce some of the variation induced
through the covariates. Thus it will help to effectively model the autocorrelation of the series
via appropriate time series models. Using an empirical application, we have shown that these
normalized time series can be used to impute missing values and make predictions in stream
temperature.
The conditionally normalized time series can also be used to compute conditional autocorrelation
and conditional cross-correlation functions at different lags. To compute conditional ACF at lag
19
Conditional normalization in time series analysis
Figure 12: Time plot of the conditional ccf estimated at different lags, i.e., E[ xt∗ y∗t+k | zt ] for k = 1, . . . , 24.
The black line represent the conditional ccf at lag dt ,i.e.,E[ xt∗ y∗t+dt | zt ]. Approximately, 96%
of the time E[ xt∗ y∗t+dt | zt ] > E[ xt∗ y∗t+k | zt ].
k, we have proposed to fit an additive model to the cross product of the normalized time series
and its lagged series at k, using the same set of covariates used in the normalization. Similarly,
the conditional CCF at k can be estimated via an additive model fitted to the cross product of
two conditionally normalized time series at k lags apart.
We have further shown that the conditional cross-correlations can be used to estimate the water
travel time between two locations in a river. This lag time between two river locations varies in
response to the upstream river behavior. Thus we proposed to estimate this lag time conditional
on the upstream river behavior as observed by the water-quality variables measured at the
upstream location. We first computed the cross-correlation between the same water-quality
variable measured at both upstream and downstream locations at different lags, conditional
on a set of water-quality variables measured at the upstream location. Then the lag time is
computed as the lag that gives the maximum conditional cross-correlation. The significance of
the maximum conditional cross-correlation was evaluated in a probabilistic way by computing
standard errors of the predictions in the link space and then computing t statistics. We used
this approach to estimate the water travel time between two locations in Pringle Creek, one of
the NEON aquatic sites located in Texas, USA. The results show that the estimated time lag
20
Conditional normalization in time series analysis
captures the highest correlation between the two water-quality variables measured at upstream
and downstream locations. Lag time estimation using the conditional behavior of the river
and the correlation between variables is useful in developing statistical methods for predicting
other water-quality variables of interest such as sediment and nutrient concentrations in river
networks (Leigh et al. 2019). Such data-driven approaches are also useful to complement or
replace expensive, and time-consuming field-based methods such as salt tracer experiments.
Further research could extend these approaches, for example, considering vector autoregressions
for multivariate time series problems. Similarly, models can be extended to account for spatial
dependence between sites.
Reproducibility
All code to reproduce the results in this paper are available at https://github.com/PuwasalaG/
Conditional_normalisation_in_TSA. All analysis was conducted using R (R Core Team 2021)
and Stan (Stan Development Team 2021). The methods discussed are available in the conduits
package for R (Gamakumara, Talagala & Hyndman 2023).
Acknowledgments
This project is funded by the Australian Research Council (ARC) Linkage project (grant number:
LP180101151) “Revolutionising high resolution water-quality monitoring in the information
age”. The authors acknowledge the staff members from Aquatic Instruments Science team and
the National Ecological Observatory Network (NEON), especially Guy Litt, Bobby Hensley
and Gary Henson, for their valuable explanations on the background of the Pringle Creek site
and the relationships between water-quality variables. Further, we convey our gratitude to
Erin Peterson and Claire Kermorvant for valuable discussions on the project and water-quality
characteristics.
21
Conditional normalization in time series analysis
site 2
1.0
ACF
0.4
−0.2
0 5 10 15 20 25
Lag
Figure 14: Pringle Creek sensor locations. The creek is shown in the blue line and the two pink circles
denote the upstream and downstream sensor locations. Image courtesy National Ecological
Observatory Network.
22
Conditional normalization in time series analysis
1500
1000
Turbidity
500
0
2019−11−03 2019−11−06 2019−11−09 2019−11−12 2019−11−15
Timestamp
Figure 15: Turbidity at downstream sensor colored by the quality flags. Note that only turbidity series
had technical anomalies. Other series did not show any technical anomalies during the study
period we chose.
23
Conditional normalization in time series analysis
Figure 16: Time plots of the variables used in the study for the period spanning from 01-Oct-2019 to
31-Dec-2019. The anomalous points we identify were colored in red.
24
Conditional normalization in time series analysis
Apart from these anomalous points, we also noticed that every fifth observation in turbidity
in both sites are anomalous as a result of the wiper on the optical turbidity sensor operating
(i.e. wiping any biofouling off the sensor probe) every five minutes (Figure 17). Therefore we
also discarded these anomalous points in turbidity series from both sites.
3
Turbidity (FNU)
Oct 14 00:00 Oct 14 06:00 Oct 14 12:00 Oct 14 18:00 Oct 15 00:00
Timestamp
Figure 17: Wiper anomalies in turbidity downstream sensor. This plot only shows data for 14-Oct-2019
to distinguish the scale difference between wiper anomalies and typical data.
Turbidity tends to increase when freshwater flows into the river (when water level rises) as
this will increase the suspended particles in water. In contrast, conductance tends to decrease
with fresh water inflows as the water becomes diluted (Leigh et al. 2019) (see Figure 18). This
behavior also explains why the relationship between water level at the upstream location and
25
Conditional normalization in time series analysis
turbidity at the upstream site is much stronger than that between the upstream and downstream
locations (see Figure 20).
We then chose the set of covariates from the upstream sensor to model the cross-correlation
between upstream and downstream turbidity by visually analysing the relationship between
the variables. From Figure 20, we can see that the upstream water level, temperature and
conductance show non-linear relationships with both turbidity series. In contrast, dissolved-
oxygen does not show much relationship with turbidity. We also see that water level and
conductance have a strong non-linear relationship. Hence, choosing both water level and
conductance as covariates could lead to multicollinearity problems. Given these observations,
we chose water level and temperature as the covariates to compute conditional cross-correlations
between upstream and downstream turbidity series.
Prior to the remaining analysis, we impute the missing values in upstream level and temperature
(see Figure 21 to visualize the percentage of missing values). The missing values in water level
are imputed using linear interpolation whereas the missing values in temperature are imputed
using a Kalman-smoother implemented in the imputeTS R package (Moritz & Bartz-Beielstein
2017), based on a state space representation of the ARIMA model chosen by auto.arima
implemented in the forecast R package (Hyndman et al. 2022; Hyndman & Khandakar 2008).
Figure 22 plots the time series for level and temperature with the imputations.
Figure 21 shows that the turbidity upstream has considerable amount of missing values which
might affect the lag time estimation. Therefore, we first impute these missing values following
the method explained in Section 2.2. Let upstream turbidity is denoted by xt and the conditional
xt −m̂ x (zt )
normalized series of that is given by xt∗ = √ where zt contains the water level and
v̂ x (zt )
temperature measured at the upstream sensor. The conditional means, m̂ x (zt ) and variances,
v̂ x (zt ) are computed using Equations (2) and (3) respectively. We use the generalized additive
models implemented in the mgcv R package (Wood 2020; Wood 2017). Thin plate regression
splines (Wood 2003) are used as the smooth function for each covariate. We can set the dimension
k of the smoother by observing the relationship between the response and the predictor. Caution
should be taken when choosing k, because larger values can lead to overfitting due to the
26
Conditional normalization in time series analysis
Figure 18: Time plots of water-quality variables for the period 01-Oct-2019 to 31-Dec-2019. All variables
are aggregated into 5-minute frequencies, post anomaly removal (See Section B.2). Turbidity
and water level are shown on the log scale. Observations highlighted in orange show examples
of patterns in water quality when there are fresh water inflows (water level rises).
27
Conditional normalization in time series analysis
Figure 19: Time plots of water-quality variables for the period 01-Oct-2019 to 31-Dec-2019. All variables
are aggregated into 5-minute frequencies, post anomaly removal (See Appendix Section B.2).
All water-quality variables are plotted in their original scale. A couple of instances are
highlighted in orange to show the patterns in water-quality variables when there is fresh
water inflows.
28
Conditional normalization in time series analysis
Figure 20: Pairwise scatter plots between the turbidity and other covariates from the upstream sensor.
Upper triangular matrices shows the Pearson’s correlation coefficient for each pair.
underlying autocorrelation in the time series. See Wood (2017) for a discussion on choosing k in
GAMs.
From Figure 23 it can be seen that turbidity has a positive relationship with water level when ad-
justed for temperature and the relationship is stronger when the water level is higher. However,
temperature does not seem to have much effect on turbidity when adjusted for water level.
To compute the conditional variance, we model the squared residuals from the conditional
mean models assuming a Gamma family with a log link as in Equation (3). Figure 24 shows the
relationship between the response and each predictor in the fitted conditional variance model.
29
Conditional normalization in time series analysis
)
%
)
%
(1
(2
)
%
am
am
(5
)
%
tre
am
tre
(1
ps
ns
re
am
_u
t
ow
ps
e
re
ur
_u
_d
t
ps
t
ra
ity
ity
_u
pe
id
id
l
ve
rb
rb
m
tu
tu
te
le
0
Observations
10000
20000
Missing Present
(2.1%) (97.9%)
Figure 22: Time plot of upstream level and temperature with imputed values. Points in orange denote
the imputed missing values.
30
Conditional normalization in time series analysis
Figure 23: Visualizing the fitted smooth functions in the conditional mean model for turbidity upstream
with the predictors, water level and temperature from upstream sensor. Each panel visualizes
the relationship between the response and predictor while holding other predictors at their
medians (251.6m and 9.926o C for water level and temperature, respectively). The smooth
function is shown in blue and the black points are the partial residuals. The degrees of the
smoothing are shown in the y-axis label for each plot.
Figure 24: Visualizing the fitted smooth functions in the conditional variance model for turbidity up-
stream with the predictors, water level and temperature from upstream sensor. Each panel
visualizes the relationship between the response and predictor while holding other predictors at
their medians (251.6m and 9.926o C for upstream water level and temperature, respectively).
The smooth function is shown in blue and the black points are the partial residuals. The
degrees of the smoothing are shown in the y-axis label for each plot.
31
Conditional normalization in time series analysis
The normalized series for turbidity upstream is shown in Figure 25. To impute the missing
values of turbidity upstream, we fit an ARIMA model to xt∗ using the auto.arima function
from the forecast R package (Hyndman et al. 2022; Hyndman & Khandakar 2008). Following
Equation (4) we then impute the missing values of xt as x̂t∗ v̂(zt ) + m̂(zt ), where x̂t∗ is computed
p
using the Kalman smoother implemented in the imputeTS R package (Moritz & Bartz-Beielstein
2017). It should be noted that, this method leads to impute negative values. Since the turbidity
cannot be negative, we have removed those points in the remaining analysis (see Figure 26).
Turbidity upstream (FNU)
15
10
7500
Count
5000
2500
0
0 5 10 15
Turbidity upstream (FNU)
32
Conditional normalization in time series analysis
Figure 26: Time plot of upstream turbidity and temperature with imputed values. Points in orange
denote the imputed missing values.
References
Bal, G, E Rivot, JL Baglinière, J White & E Prévost (2014). A hierarchical Bayesian model to
quantify uncertainty of stream water temperature forecasts. PLoS One 9(12), e115659.
Bühlmann, P (1997). Sieve Bootstrap for Time Series. Bernoulli 3(2), 123–148. http : / / www .
jstor.org/stable/3318584.
Cawley, KM (2021). NEON Algorithm Theoretical Basis Document (ATBD): Water Quality. https:
//data.neonscience.org/api/v0/documents/NEON.DOC.004931vB.
Gamakumara, P, PD Talagala & RJ Hyndman (2023). conduits: CONDitional UI for Time Series
normalisation. R package version 1.0.0. https://github.com/PuwasalaG/conduits.
Green, JI & EJ Nelson (2002). Calculation of time of concentration for hydrologic design and
analysis using geographic information system vector objects. Journal of Hydroinformatics 4(2),
75–81.
Hastie, TJ & RJ Tibshirani (1990). Generalized additive models. Vol. 43. CRC press.
Hrachowitz, M, P Benettin, BM Van Breukelen, O Fovet, NJ Howden, L Ruiz, Y Van Der Velde
& AJ Wade (2016). Transit times—The link between hydrology and water quality at the
catchment scale. Wiley Interdisciplinary Reviews: Water 3(5), 629–657.
Hyndman, R, G Athanasopoulos, C Bergmeir, G Caceres, L Chhay, M O’Hara-Wild, F Petropou-
los, S Razbash, E Wang & F Yasmeen (2022). forecast: Forecasting functions for time series and
linear models. R package version 8.16. https://pkg.robjhyndman.com/forecast/.
33
Conditional normalization in time series analysis
Hyndman, RJ & G Athanasopoulos (2021). Forecasting: principles and practice. 3rd ed. Melbourne,
Australia: OTexts. OTexts.org/fpp3.
Hyndman, RJ & Y Khandakar (2008). Automatic time series forecasting: the forecast package for
R. Journal of Statistical Software 26(3), 1–22.
Isaak, DJ, SJ Wenger, EE Peterson, JM Ver Hoef, DE Nagel, CH Luce, SW Hostetler, JB Dunham,
BB Roper, SP Wollrab, et al. (2017). The NorWeST summer stream temperature model and
scenarios for the western US: A crowd-sourced database and new geospatial tools foster a
user community and predict broad climate warming of rivers and streams. Water Resources
Research 53(11), 9181–9205.
Leigh, C, S Kandanaarachchi, JM McGree, RJ Hyndman, O Alsibai, K Mengersen & EE Peterson
(2019). Predicting sediment and nutrient concentrations from high-frequency water-quality
data. PloS one 14(8), e0215503.
Li, MH & P Chibber (2008). Overland flow time of concentration on very flat terrains. Transporta-
tion Research Record 2060(1), 133–140.
Li, Y, Y Zhu, L Chen & Z Shen (2018). The time delay of flow and sediment in the Middle and
Lower Yangtze River and its response to the Three Gorges Dam. Journal of Hydrometeorology
19(3), 625–638.
Moritz, S & T Bartz-Beielstein (2017). imputeTS: Time Series Missing Value Imputation in R. The
R Journal 9(1), 207–218.
National Ecological Observatory Network (NEON) (2021a). Elevation of surface water
(DP1.20016.001). en. https : / / data . neonscience . org / data - products / DP1 . 20016 .
001/RELEASE-2021.
National Ecological Observatory Network (NEON) (2021b). Reaeration field and lab collection
(DP1.20190.001). en. https://data.neonscience.org/data-products/DP1.20190.001.
National Ecological Observatory Network (NEON) (2021c). Temperature (PRT) in surface water
(DP1.20053.001). en. https://data.neonscience.org/data-products/DP1.20053.001/
RELEASE-2021.
National Ecological Observatory Network (NEON) (2021d). Water quality (DP1.20288.001). en.
https://data.neonscience.org/data-products/DP1.20288.001/RELEASE-2021.
34
Conditional normalization in time series analysis
Seyam, M & F Othman (2014). The influence of accurate lag time estimation on the performance
of stream flow data-driven based models. Water resources management 28(9), 2583–2597.
Stan Development Team (2021). Stan Modeling Language Users Guide and Reference Manual. Version
2.28. https://mc-stan.org.
Vafaeipour, M, O Rahbari, MA Rosen, F Fazelpour & P Ansarirad (2014). Application of sliding
window technique for prediction of wind velocity time series. International Journal of Energy
and Environmental Engineering 5(2), 1–7.
Van der Velde, Y, G De Rooij, J Rozemeijer, F Van Geer & H Broers (2010). Nitrate response
of a lowland catchment: On the relation between stream concentration and travel time
distribution dynamics. Water Resources Research 46(11).
Wanielista, M, R Kersten, R Eaglin, et al. (1997). Hydrology: Water quantity and quality control. John
Wiley and Sons.
Wood, S (2020). mgcv: Mixed GAM Computation Vehicle with Automatic Smoothness Estimation. R
package version 1.8-33. https://cran.r-project.org/package=mgcv.
Wood, SN (2003). Thin plate regression splines. Journal of the Royal Statistical Society: Series B
(Statistical Methodology) 65(1), 95–114.
Wood, SN (2017). Generalized additive models: an introduction with R. CRC press.
Xie, R, M Zhang, P Venkatraman, X Zhang, G Zhang, R Carmer, SA Kantola, CP Pang, P Ma, M
Zhang, et al. (2019). Normalization of large-scale behavioural data collected from zebrafish.
Plos one 14(2), e0212234.
35