0% found this document useful (0 votes)
28 views35 pages

Conditional Normalization in Time Series Analysis

Uploaded by

jewel.plus.sic
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views35 pages

Conditional Normalization in Time Series Analysis

Uploaded by

jewel.plus.sic
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Conditional normalization in time

series analysis

Puwasala Gamakumara
Department of Econometrics & Business Statistics
Monash University, Clayton VIC 3800, Australia
Email: puwasala.gamakumara@gmail.com
Corresponding author
arXiv:2305.12651v1 [stat.ME] 22 May 2023

Edgar Santos-Fernandez
School of Mathematical Sciences
Queensland University of Technology, Brisbane QLD 4000, Australia
Email: edgar.santosfernandez@qut.edu.au

Priyanga Dilini Talagala


Department of Computational Mathematics, University of Moratuwa,
Bandaranayake Mawatha, Moratuwa, 10400, Sri Lanka
Email: priyangad@uom.lk

Rob J Hyndman
Department of Econometrics & Business Statistics
Monash University, Clayton VIC 3800, Australia
Email: Rob.Hyndman@monash.edu

Kerrie Mengersen
Science and Engineering Faculty
School of Mathematical Sciences
Queensland University of Technology, Brisbane QLD 4000, Australia
Email: k.mengersen@qut.edu.au

Catherine Leigh
Biosciences and Food Technology Discipline
School of Science
RMIT University, Bundoora VIC 3082, Australia
Email: catherine.leigh@rmit.edu.au

23 May 2023
Conditional normalization in time
series analysis

Abstract

Time series often reflect variation associated with other related variables. Controlling for the
effect of these variables is useful when modeling or analysing the time series. We introduce a
novel approach to normalize time series data conditional on a set of covariates. We do this by
modeling the conditional mean and the conditional variance of the time series with generalized
additive models using a set of covariates. The conditional mean and variance are then used to
normalize the time series. We illustrate the use of conditionally normalized series using two
applications involving river network data. First, we show how these normalized time series can
be used to impute missing values in the data. Second, we show how the normalized series can
be used to estimate the conditional autocorrelation function and conditional cross-correlation
functions via additive models. Finally we use the conditional cross-correlations to estimate the
time it takes water to flow between two locations in a river network.

Keywords: conditional normalization, missing value imputation, conditional autocorrelation,

conditional cross-correlation, lag time estimation, stream data, water quality

1 Introduction
Normalization of some variables is often required prior to using a statistical or machine learning
algorithm. In this study, we introduce a novel normalization method for time series data which
aims primarily to remove the conditional variation in the time series that is induced by other
sources of variation.

Common data normalization methods, such as min-max transformation or standardization (also


called z-score normalization) are not always applicable for time series data. The normalizing
constants may change in the future, and they assume stationary processes. For non-stationary
data, sliding-window normalization has been proposed (e.g., Ogasawara et al. 2010; Vafaeipour
et al. 2014), where the time series is divided into windows of a specified length and data are
normalized within each window. However, these methods do not account for any external
variables that can influence the variation in the time series.

2
Conditional normalization in time series analysis

In practice, it is common to work with multiple time series that are inter-related and non-
stationary. We propose a method to normalize univariate time series conditional on a set of
covariates. This method can be considered as a variation of z-score normalization, but where the
mean and variance are functions of the covariates. Thus we refer to this method as conditional
normalization.

In the proposed method, we first estimate the conditional mean of the time series using a
generalized additive model (GAM) (Hastie & Tibshirani 1990) with a set of covariates. The
conditional variance is then estimated via a different GAM fitted to the squared errors from
the conditional mean model, with respect to the same set of covariates. Finally, the estimated
conditional mean and variance are used to standardize the time series. One can choose the most
relevant set of covariates that can explain maximum variation in the time series.

It is relatively common to subtract a conditional mean in order to adjust data for subsequent
analysis, and sometimes this is called “normalization” (e.g., Xie et al. 2019). However, our
approach is much more general as both the conditional mean and conditional variance are
modeled, and we allow for non-linear relationships between the response and covariates in both
models.

We show two possible uses of conditionally normalized time series. First, we describe how the
conditionally normalized time series can be used to impute missing values in a univariate time
series. To do this, we model the normalized series, and use the model to impute the missing
values. The resulting imputations are then “unnormalized” to give estimates on the original
scale.

Second, we show how the conditionally normalized time series can be used to estimate the
conditional Autocorrelation Function (ACF) and conditional Cross-Correlation Function (CCF).
We can define the conditional ACF at lag k as the conditional expectation of the cross-product of the
conditionally normalized time series and its kth lagged series. Similarly, the conditional CCF at lag k
can be defined as the conditional expectation of the cross-product between two conditionally normalized
time series at k lags apart. To estimate the conditional expectations, we propose to fit GAMs for
the cross-product of the normalized time series using the same set of covariates used in the
conditional normalization.

We also highlight two straightforward empirical applications of conditional normalization of


time series. The first application involves a time series of mean daily stream temperatures
observed in multiple locations in Boise River, in the northwestern United States of America

3
Conditional normalization in time series analysis

(USA), which has many missing values. We describe how we can impute these missing values
using Bayesian machinery for modeling.

The second application uses the conditional CCF to estimate the lag time between two sensor
locations in the Pringle Creek river network in Texas, USA. The lag time is the time it takes
water to flow downstream from an upstream location. This lag time often depends on the
upstream river behavior. For example, when the upstream water level increases, water flow
will typically be increased and hence the lag time will decline. On the other hand, when the
level is low, water may be flowing more slowly and hence the lag time will increase. Lag time
has been estimated using different approaches in many hydrological applications (see Van
der Velde et al. (2010), Hrachowitz et al. (2016) and Li et al. (2018) for example). We propose to
estimate the lag time as the lag that gives the maximum conditional cross-correlation between
two water-quality variables observed at upstream and downstream locations, conditional on
other, related water-quality variables measured at an upstream location. This will allow the lag
time to be estimated conditional on the upstream river behavior.

The rest of the paper is organized as follows. The underlying methods for conditional normal-
ization are described in Section 2. Section 3 contain the empirical application on missing value
imputation, while Section 4 discusses the application to estimating lag times between sensor
locations. Finally, we discuss the results and provide some concluding remarks in Section 5.

2 Conditional estimation via GAMs


In this section, we discuss our approach to conditional normalization of a time series, and then
we discuss a couple of scenarios in which these conditionally normalized time series can be
helpful in data analysis.

2.1 Conditional normalization


Let yt be a variable observed at times t = 1, . . . , T, and zt = (z1,t , . . . , z p,t ) be a p dimensional
vector of variables measured at the same times. We assume the mean and variance of yt are
functions of zt ; that is, E(yt | zt ) = m(zt ) and V(yt | zt ) = v(zt ). Our aim is to normalize yt
conditional on zt , giving
yt − m̂(zt )
y∗t = p . (1)
v̂(zt )

4
Conditional normalization in time series analysis

We estimate m(zt ) and v(zt ) using GAMs (Hastie & Tibshirani 1990). First, we fit the model

p
yt = α0 + ∑ f i (zi,t ) + ε t ,
i =1

where f i (·) are smooth functions, and ε 1 , . . . , ε T have mean 0 and variance v(zt ), giving

p
m̂(zt ) = α̂0 + ∑ fˆi (zi,t ). (2)
i =1

In estimating the model, we ignore any heteroskedasticity and autocorrelation, and use penalized
splines for each f i function.

Next, we fit the model

[yt − m̂(zt )]2 ∼ Gamma(v(zt ), r ),


p
log(v(zt )) = β 0 + ∑ gi (zi,t ),
i =1

where each gi (·) is a smooth function, and the Gamma parameterization has the first argument,
v(zt ), as the mean, and the second argument, r, as the shape parameter. The Gamma family is
not essential here, and it may be replaced by another distribution whose support is in (0, ∞).
The resulting variance estimate is

 p 
v̂(zt ) = exp β̂ 0 + ∑ ĝi (zi,t ) . (3)
i =1

2.2 Imputation of missing values


The conditionally normalized series can be used when imputing missing values in a univariate
time series, assuming y∗t is a (possibly seasonal) Autoregressive Integrated Moving Average
(ARIMA) process. The normalization removes some of the sources of variation in the data,
allowing the ARIMA model to handle any remaining serial correlation.

We can then impute yt using


q
yt = ŷ∗t v̂(zt ) + m̂(zt ). (4)

where ŷ∗t is the imputed value of y∗t , computed using a Kalman smoother.

5
Conditional normalization in time series analysis

2.3 Conditional autocorrelation function


We can also use the normalized time series to compute a conditional ACF as

rk (zt ) = E[y∗t y∗t−k | zt ] for k = 1, 2, . . .

The function rk (·) can be estimated using a separate GAM for each k:

y∗t y∗t−k ∼ N (rk (zt ), σk2 ),


p
η (rk (zt )) = γ0 + ∑ hi (zi,t ),
i =1

where hi (.) are smooth functions and

eu − 1
η −1 ( u ) = . (5)
eu + 1

Other smooth monotonic link functions, η, that map [−1, 1] to the real line, (−∞, ∞), may also
be used.

2.4 Conditional cross-correlation function


We can use a similar approach to estimate the conditional cross-correlation functions. Suppose
we have a variable xt observed at the same time as yt . We are interested in estimating the
cross-correlation between xt and yt+k for k = 1, 2, . . ., conditional on a set of variables zt . First
we normalize xt and yt with respect to zt using (1) and get

xt − m̂ x (zt ) yt − m̂y (zt )


xt∗ = p and y∗t = q ,
v̂ x (zt ) v̂y (zt )

where m x (zt ) and my (zt ) are estimated using (2), and v x (zt ) and vy (zt ) are estimated using (3).

Then we can estimate the conditional cross-correlation

ck (zt ) = E[y∗t+k xt∗ | zt ] for k = 1, 2, . . .

using the GAMs

y∗t+k xt∗ ∼ N (ck (zt ), u2k ),

p
η (ck (zt )) = φ0 + ∑ si (zi,t ). (6)
i =1

6
Conditional normalization in time series analysis

3 Application: Stream temperature imputation

3.1 Temperature data


In this case study, we use a dataset comprising daily mean stream temperatures (◦ C) recorded in
a large-scale dendritic network in northwestern USA (Isaak et al. 2017). The five-year dataset
includes data from 42 in-situ sensors, each deployed in a unique spatial location. It contains
mean daily stream temperature data, with a total of 1825 observations per time series. There
is a tendency to get missing values in the original data due to sensor issues and the goal is to
impute those values. For illustration purposes, we took a subset of the data at evenly spaced
intervals discarding all but every 5th observation in the time series resulting in 73 observations
per year on each spatial location.

Figure 1 shows the time series of stream temperature and air temperature. It is known that the
air temperatures are strongly correlated with stream temperatures (Bal et al. 2014), and so these
will be used as a covariate in our models.

30

20
Temperature

10

−10

−20

2011 2012 2013 2014 2015 2016


Date

Figure 1: Stream temperatures (black) and air temperatures (red) (◦ C) from the 42 spatial locations.

3.2 The conditional normalization model


Let yst represent the temperature at spatial locations s = 1, 2, . . . , S, and time points t =
1, 2, . . . , T, where S = 42 and T = 1825. We will use the conditional normalization model:

(yst − µst )
y∗st = ∼ AR( p),
σt

7
Conditional normalization in time series analysis

where µst and σt are the mean and the standard deviation respectively (formulated as functions
of covariates). To avoid overparameterization (i.e. having more parameters than what we can
learn from the model), we assume that sites have a common standard deviation σ at time t. The
standardized response variable y∗st is modeled using an autoregressive process of order p to
account for the remaining serial correlation in the data.

3.3 GAM models


We formulate the expected value (µst ) of stream temperature as a function of air temperatures,
as well as spatial covariates and landscape factors that are constant over time: stream slope,
elevation (elev) and cumulative drainage area (cd). The resulting mean is the linear function

µst = β 0 + β 1 slopes + β 2 elevs + β 3 cds + β 4 atst

+ β 5 sin1t + β 6 cos1t + β 7 sin2t + β 8 cos2t + β 9 sin3t + β 10 cos3t

+ β 11 sin4t + β 12 cos4t + β 13 sin5t + β 14 cos5t .

The covariates sinkt = sin(2πtk/m) and coskt = cos(2πtk/m), k = 1, . . . , 5, are the first five
pairs of Fourier terms (harmonic regression parameters), where m is the seasonal period (Section
7.4, Hyndman & Athanasopoulos 2021).

We model σt using a gamma distribution with common a and time-specific parameter bt ,


σt2 ∼ Gamma( a, bt ). We use a non-informative uniform prior for a ∼ U (0, 100) and set
bt = a/ exp ( Xγ), where X is a design matrix containing the Fourier covariates and γ is a
vector of regression coefficients.

3.4 Results
The data are missing 1654 temperature values out of 15330 observation periods. To illustrate our
approach, we also remove 20% of the non-missing observations to form a test set. We aim to
estimate these missing values and compare the estimates with the original values. Thus, the
model is trained using 80% of the non-missing data, with 20% used for testing the prediction
accuracy. Figure 2 shows the water temperature values in the training data for each of the spatial
locations.

The model is estimated in Stan using a Hamiltonian Monte Carlo procedure. We use 3 chains
each composed of 6,000 samples and we discard a burn-in of 4,000 samples.

We found that the AR of order eight offered the best fit in terms of root mean square prediction
error (RMSPE). Figure 3 shows the observed water temperature (training set = blue) and the

8
Conditional normalization in time series analysis

40
Spatial locations

30

20

10

0
2011 2012 2013 2014 2015 2016
Date

Temperature
0 5 10 15 20

Figure 2: Daily mean temperature (◦ C) values. The gray areas represent periods that are missing from
the training data.

estimated values (testing set = red) in each of the 42 spatial locations. We note that the model
captures well the periodicity in the data and produces good estimates of the missing temperature
values when comparing the predictions vs hold-out data.

We also assessed the uncertainty in the estimates using highest posterior density intervals.
Figure 4 shows these for the first two spatial locations. In both locations, the model captures
the periodic patterns, including backcasting to predict the initial missing sections of the time
series in location two. Overall, the proportion of observations in which the nominal 95% highest
density interval of the estimated mean temperature contains the true value is 0.946.

The posterior distributions of the regression coefficients β indicate that the spatial covariates
and air temperature substantially affect the stream temperature (Figure 5), with the three pairs
of Fourier terms explaining seasonality changes in the response variable well.

The posterior means of the daily standard deviation indicates that the standard deviation in
summer was three times higher than in the winter (Figure 6).

An ACF plot of the standardized time series corresponding to site 2 is presented in Appendix A.

9
Conditional normalization in time series analysis

1 2 3 4 5 6 7
20
10
0

8 9 10 11 12 13 14
20
10
0

15 16 17 18 19 20 21
20
Temperature

10
0

22 23 24 25 26 27 28
20
10
0

29 30 31 32 33 34 35
20
10
0

36 37 38 39 40 41 42
20
10
0
11
12
13
14
15
16

11
12
13
14
15
16

11
12
13
14
15
16

11
12
13
14
15
16

11
12
13
14
15
16

11
12
13
14
15
16

11
12
13
14
15
16
20
20
20
20
20
20

20
20
20
20
20
20

20
20
20
20
20
20

20
20
20
20
20
20

20
20
20
20
20
20

20
20
20
20
20
20

20
20
20
20
20
20
Date

Figure 3: Time series of stream temperature in the 42 spatial locations. Points in blue represent the
training set, while the predictions for the missing periods are given in red.

4 Application: Predicting lag time on river flow

4.1 Automated in-situ sensors


Traditional methods of monitoring water quality include collecting water samples at low fre-
quencies (monthly, bi-monthly) and conducting lab-based assessments to measure water quality.
With advancements in technology, this manual process is being replaced or augmented by
automated, high-frequency in-situ sensors.

In river systems, these in-situ sensors are typically placed at one or more locations to measure
multiple water-quality variables semi-continuously. The resultant data can be used for different
kinds of analyses such as identifying trends in water quality and predicting sediment and
nutrient concentrations through space and time (Leigh et al. 2019).

10
Conditional normalization in time series analysis

15
10
5
Temperature

15
10
5
0

2011 2012 2013 2014 2015 2016


Date

Figure 4: Time series of two spatial locations. Points in blue represent the training set, while the
predictions for the missing periods are given in red along with the 95% highest posterior
density intervals.

βcos3
βsin3
βcos2
βsin2
βcos1
βsin1
βair_temp
βcum_drain
βelev
βslope
βintercept

−2 0 2 4

Figure 5: Posterior means of the regression coefficients.

4.2 Lag time


When analysing sensor data from multiple locations along the same water flow path, it is useful
to know the lag time between sensor locations, as this facilitates prediction of downstream water
quality using information collected upstream. Lag time can be defined hydrologically in many
ways. For example, Wanielista, Kersten, Eaglin, et al. (1997) defined it as “time from the centroid of
rainfall excess to the time of peak runoff for a watershed”. Here we define the lag time specifically as
“the time it takes water to flow downstream from an upstream location”.

11
Conditional normalization in time series analysis

Posterior mean of the SD 3.5

3.0

2.5

2.0

1.5

1.0

0 100 200 300


Day of the year

Figure 6: Posterior means of the standard deviation (σ).

Estimating lag time is necessary in order to compute lagged, explanatory water-quality variables
which can be used as predictors in models of the downstream response variable of interest. One
approach for estimating lag time is to use empirical equations based on the length and slope of
the flow path and other catchment features (Green & Nelson 2002; Li & Chibber 2008). Another
approach uses water level and flow (i.e., discharge) data. For example, Seyam & Othman (2014)
used this approach to estimate lag time between four upstream locations and a downstream
location in the Selangor River basin. Their method involved plotting the hydrograph for the
downstream location and then water level at the upstream location during high flow events and
estimating the time difference between peaks of the two plots. The average time difference was
then considered as the lag time between the two locations.

Field-based methods to estimate the lag time include injecting salt tracers (usually Sodium
Chloride or Sodium Bromide) at an upstream location and measuring the salt concentration
through time at a downstream location, from which the travel time is then estimated. This
manual process has to be carried out several times a year during both high flows and low flows,
which is costly and time-consuming.

4.3 Estimating lag time via conditional cross-correlations


Lag time can be influenced by various environmental conditions upstream, such as the water
level, discharge, temperature and other water-quality variables. Therefore, we propose a
method to estimate the lag time between two sensor locations in a river using conditional
cross-correlations.

12
Conditional normalization in time series analysis

Suppose xt and yt , observed at times t = 1, . . . , T, denote the same water-quality variable


measured at upstream and downstream sensors respectively. Let zt be a p dimensional vector of
other water-quality variables measured at the upstream sensor at time t. We estimate ck (zt ), the
cross-correlation between yt+k and xt at lags k = 1, . . . , K, conditional on zt , using the model
described in Section 2.4. Then, we can use the estimates of conditional cross-correlations to
estimate the river lag time between the two locations. We define this lag time as dt , and estimate
it using
dˆt (zt ) = argmax ĉk (zt ). (7)
k

4.4 Computing bootstrapped confidence intervals for dt


Computing the standard errors and confidence intervals for dt is not straightforward, so we use
a bootstrap method. We resample the residuals from the various models used in the conditional
cross-correlation calculation to generate new data. Because these residuals are serially correlated,
we use a sieve bootstrap approach (Bühlmann 1997) to capture the autocorrelation structure
in the data in our bootstrap samples. The following algorithm describes our approach for
computing these confidence intervals.

Algorithm (Sieve bootstrap confidence intervals for dt )

Recall that we have fitted the following separate GAMs for each k

pk
y∗t xt∗−k = η −1 (φ0 + ∑ si (zi,t )) + ε t,k .
i =1

Since the ε t,k are serially correlated, we fit a pth


k order autoregressive model for ε t,k for each k:

pk
ε t,k = µk + ∑ ψi,k ε t−i,k + ζ t,k , ζ t,k ∼ N (0, σk2 ).
i =1

For each model, the order pk is determined by minimizing the corrected Akaike Information
Criterion (AICc) using the auto.arima function from the forecast package (Hyndman et al.
2022; Hyndman & Khandakar 2008). Then we resample from ζ̂ t,k to generate our bootstrap
sample following these steps.
pk
1. Randomly select with replacement a sample of size T from ζ̂ t,k = ε t,k − µ̂k − ∑ ψ̂i,k ε t−i,k .
i =1
b .
Denoted this sample as ζ̂ t,k
pk
2. Compute εbt,k as εbt,k = µ̂k + ∑ ψ̂i,k εbt−i,k + ζ t,k
b for k = 1, . . . , K.
i =1
pk
3. Compute y∗t xt∗−k b = η −1 (φ̂0 + ∑ ŝi (zi,t )) + εbt,k for k = 1, . . . , K.
i =1

13
Conditional normalization in time series analysis

4. Fit the following GAM to the bootstrapped data for each k:

pk
y∗t xt∗−k b = η −1 (φ0 + ∑ si (zi,t )) + ε t,k .
i =1

5. Use the models in step 4 to compute dt for a given set of zi .


6. Repeat steps 1 to 5 for b = 1, . . . , m, where m = 1000. Thereby we will get a sample of dt
of size m which can be used to form the empirical distribution of dt . Use this sample to
compute (α/2)th and (1 − α/2)th quantiles which represent the lower and upper bounds
for the 100(1 − α)% confidence interval for dt .

4.5 Study area and water-quality data


We consider Pringle Creek, one of the NEON (National Ecological Observatory Network)
aquatic sites located in Wise County, Texas, and managed by the U.S Forest Services1 . A detailed
description of the study site is given in Appendix B.

Water quality is measured in Pringle Creek using two sensor locations situated about 200 m apart,
with a small tributary entering the main creek between the two sensors. The variables measured
by these sensors include turbidity (Formazin Nephelometric Unit), specific conductance, pH,
dissolved oxygen, and chlorophyll. Measurements are available at 1-minute frequencies and
can be retrieved from NEON Data Portal (National Ecological Observatory Network (NEON)
2021d). Surface water level and water temperature are also available from both locations at
5-minute frequencies and can be retrieved from National Ecological Observatory Network
(NEON) (2021a) and National Ecological Observatory Network (NEON) (2021c), respectively.

The data we consider were collected from 1 October 2019 to 31 December 2019. This time span
avoids the summer period in which surface pools of water disconnect and contains the least
number of missing observations after removing the anomalies.

We will use turbidity to compute the cross-correlations between upstream and downstream
sensors. Turbidity is chosen because it is heavily influenced by fresh inputs of water from
upstream, and hence there should be a strong relationship between upstream and downstream
turbidity. We choose water level and temperature as the covariates from the upstream sensor to
model the cross-correlation between upstream and downstream turbidity.

Appendix B discusses the data pre-processing, anomaly detection, missing value imputation,
and variable selection steps of the analysis.
1 https://www.neonscience.org/field-sites/prin

14
Conditional normalization in time series analysis

4.6 Conditional cross-correlation between turbidity at upstream and down-


stream sensors
Based on the NEON reaeration sampling protocol (National Ecological Observatory Network
(NEON) 2021b) and information gathered from field experts, the time it takes water to travel
downstream between two sensor locations at Pringle Creek is typically about 45 − 60 minutes,
though it may be shorter than 45 minutes during high flows and greater than 60 minutes during
low flows. Considering this information, we choose to compute the cross-correlations up to
24 lags, which will allow for a maximum of two hours of travel time (as the frequency of the
water-quality data is 5 minutes).

Let yt denote the time series of turbidity measured at the downstream sensor. Following
Section 2.4, we first normalize xt and yt+k for k = 1, . . . , 24 conditional on zt . Figures 7 and 8
visualized the fitted mean and variance models for yt+1 respectively.

Figure 7: Visualizing the fitted smooth functions in the conditional mean model for turbidity downstream
with the predictors, water level and temperature from upstream sensor. Each panel visualizes
the relationship between the response and predictor while holding other predictors at their
medians (251.6m and 9.926o C for water level and temperature, respectively). The smooth
function is shown in blue with 95% confidence bands. The degrees of the smoothing are shown
in the y-axis label for each plot.

Following Equation (6), the fitted conditional cross-correlation function between turbidity at the
upstream and downstream sensors at lag k can be written as

ĉk (zt ) = η −1 φ̂o + ŝ1,k (level_upstream)t + ŝ2,k (temperature_upstream)t ), (8)

15
Conditional normalization in time series analysis

Figure 8: Visualizing the fitted smooth functions in the conditional variance model for turbidity down-
stream with the predictors, water level and temperature from upstream sensor. Each panel
visualizes the relationship between the response and predictor while holding other predictors at
their medians (251.6m and 9.926o C for water level and temperature, respectively). The smooth
function is shown in blue with 95% confidence bands. The degrees of the smoothing are shown
in the y-axis label for each plot.

where {ŝ1,k , ŝ2,k } denote natural cubic splines. Similar to when fitting mean and variance models,
the degrees of freedom for each spline are chosen by examining the relationship between the
response and each covariate.

At lag 1, when temperature is greater than 10◦ C it is slightly negatively affecting the cross-
correlation between turbidity upstream and downstream while controlled for water level (see
Figure 9). Plots that visualize relationships at other lags can be obtained similarly.

4.7 Lag time prediction


As we described in Equation (7), lag time is defined as the lag that gives the maximum cross-
correlation conditional on the upstream variables observed at time t. This will allow dt to vary
according to the upstream river behavior.

The 80% and 95% confidence intervals for the relationship between estimated dt and each
upstream covariate zi,t used in our conditional cross-correlation models (see Figure 10) are
computed using the Sieve bootstrap approach (Bühlmann 1997) and the algorithm is described
in Section 4.4.

To visualize the relationship between dt and each covariate, we replace the remaining covariates
with their medians in the original data, and then estimate dt from the fitted model using this

16
Conditional normalization in time series analysis

Figure 9: Visualizing the fitted smooth functions for conditional cross-correlation between turbidity-
upstream and turbidity-downstream at lag 1 with the predictors, water level and temperature
from upstream sensor. Each plot visualizes the relationship between the response and predictor
while holding other predictors at their medians (251.6m and 9.926o C for water level and
temperature in upstream respectively). The top panel shows the smooth terms in the predictor
scale whereas the bottom panel is in the response scale.

modified data. Figure 10 displays the relationship between dt and each covariate. It is clear from
Figure 10 that upstream water level has a negative effect on the lag time. That is, when water
level increases, the lag time decreases. Increasing water level implies high fresh water inputs
and more flow, water moving downstream in less time, hence the lag time will be decreasing.
When the water level is between 251.6 and 251.8 m, the lag time is very low. In fact, there was
only one incident in November that showed a water level within this range, which occurred
during a freshwater inflow event. However, when the water level is more than 251.8 m, the lag
time has increased deviating from its previous pattern. It is unclear what exactly happens at
that instance, however, the original data indicates that the water level was higher than 251.8 m
only during a single event that happened in November-2019 (see Figure 19 in Section B.4). On
the other hand, when the upstream temperature is below 10o C, it has a positive effect on the lag
time - that is, when the water temperature increases, the lag time also increases. This pattern is
consistent with river behavior, as water temperature can increase during dry seasons when there
is less inflow to the system, particularly if dry seasons occur in the warmer months. Low inflow

17
Conditional normalization in time series analysis

causes the water to move downstream more slowly, resulting in an increase in the lag time.
However, for temperatures greater than 10o C, which mostly occur during early October and
freshwater inflow events, the lag time remains consistently low. Figure 11 maximum conditional
cross-correlation and the lag time between turbidity-upstream and turbidity-downstream with
the predictors, water level and temperature from upstream sensors.

Figure 10: Visualizing dt with 80% and 95% bootstrap confidence intervals. Each panel visualizes dt vs
each upstream covariate while holding the remaining upstream covariates at their medians
(251.6m and 9.926o C for water level and temperature, respectively).

4.8 Evaluation
We can use the estimated dt to compute the lead variable, i.e., yt+dt (lag variable, i.e., xt−dt )
from the downstream (upstream) sensor. yt+dt is expected to have the maximum conditional
cross-correlation with xt compared to any yt+k for k = 1, . . . , 24. That is, ideally we expect
E[y∗t+dt xt∗ | zt ] > E[y∗t+k xt∗ | zt ] for all lags k and time t, where xt∗ and y∗t+dt are the conditionally
normalized series of xt and yt+dt with respect to zt . To evaluate this, we first fit a GAM to y∗t+dt xt∗
using zt as the predictors and follow the Section 2.4 to compute E[y∗t+dt xt∗ | zt ]. These conditional
cross-correlations are then compared with E[y∗t+k xt∗ | zt ] for all k and t, which were obtained
using Equation (8). The resultant conditional cross-correlations are shown in Figure 12. We can
see that the E[y∗t+dt xt∗ | zt ] is greater than E[y∗t+k xt∗ | zt ] for majority of the time.

5 Discussion
In this study we introduce a novel approach to normalize univariate time series conditional
on a set of covariates. The proposed approach uses generalized additive models to estimate

18
Conditional normalization in time series analysis

Figure 11: Time series plots of water-quality variables and lag time between upstream and downstream
sensors for the period 01-Oct-2019 to 31-Dec-2019. (a) Time series plot of turbidity-
downstream. (b) time series plot of turbidity-upstream. (c) Lag time between upstream
and downstream sensors with the predictors, water level and temperature from upstream
sensor (d) Maximum conditional cross-correlation between upstream and downstream sensors
with the predictors, water level and temperature from upstream sensor

the conditional mean and variance of the time series, given a set of covariates. The conditional
mean is estimated via an additive model fitted to the time series with respect to the covariates.
The residuals from this model are then used to estimate the conditional variance, by fitting a
separate generalized additive model to the squared residuals from the conditional mean model
using the same set of covariates. We assume a gamma family with a log link in the latter model.
The estimated conditional means and variances are then used to normalize the time series.

Normalizing a given time series in this manner will reduce some of the variation induced
through the covariates. Thus it will help to effectively model the autocorrelation of the series
via appropriate time series models. Using an empirical application, we have shown that these
normalized time series can be used to impute missing values and make predictions in stream
temperature.

The conditionally normalized time series can also be used to compute conditional autocorrelation
and conditional cross-correlation functions at different lags. To compute conditional ACF at lag

19
Conditional normalization in time series analysis

Figure 12: Time plot of the conditional ccf estimated at different lags, i.e., E[ xt∗ y∗t+k | zt ] for k = 1, . . . , 24.
The black line represent the conditional ccf at lag dt ,i.e.,E[ xt∗ y∗t+dt | zt ]. Approximately, 96%
of the time E[ xt∗ y∗t+dt | zt ] > E[ xt∗ y∗t+k | zt ].

k, we have proposed to fit an additive model to the cross product of the normalized time series
and its lagged series at k, using the same set of covariates used in the normalization. Similarly,
the conditional CCF at k can be estimated via an additive model fitted to the cross product of
two conditionally normalized time series at k lags apart.

We have further shown that the conditional cross-correlations can be used to estimate the water
travel time between two locations in a river. This lag time between two river locations varies in
response to the upstream river behavior. Thus we proposed to estimate this lag time conditional
on the upstream river behavior as observed by the water-quality variables measured at the
upstream location. We first computed the cross-correlation between the same water-quality
variable measured at both upstream and downstream locations at different lags, conditional
on a set of water-quality variables measured at the upstream location. Then the lag time is
computed as the lag that gives the maximum conditional cross-correlation. The significance of
the maximum conditional cross-correlation was evaluated in a probabilistic way by computing
standard errors of the predictions in the link space and then computing t statistics. We used
this approach to estimate the water travel time between two locations in Pringle Creek, one of
the NEON aquatic sites located in Texas, USA. The results show that the estimated time lag

20
Conditional normalization in time series analysis

captures the highest correlation between the two water-quality variables measured at upstream
and downstream locations. Lag time estimation using the conditional behavior of the river
and the correlation between variables is useful in developing statistical methods for predicting
other water-quality variables of interest such as sediment and nutrient concentrations in river
networks (Leigh et al. 2019). Such data-driven approaches are also useful to complement or
replace expensive, and time-consuming field-based methods such as salt tracer experiments.

Further research could extend these approaches, for example, considering vector autoregressions
for multivariate time series problems. Similarly, models can be extended to account for spatial
dependence between sites.

Reproducibility
All code to reproduce the results in this paper are available at https://github.com/PuwasalaG/
Conditional_normalisation_in_TSA. All analysis was conducted using R (R Core Team 2021)
and Stan (Stan Development Team 2021). The methods discussed are available in the conduits
package for R (Gamakumara, Talagala & Hyndman 2023).

Acknowledgments
This project is funded by the Australian Research Council (ARC) Linkage project (grant number:
LP180101151) “Revolutionising high resolution water-quality monitoring in the information
age”. The authors acknowledge the staff members from Aquatic Instruments Science team and
the National Ecological Observatory Network (NEON), especially Guy Litt, Bobby Hensley
and Gary Henson, for their valuable explanations on the background of the Pringle Creek site
and the relationships between water-quality variables. Further, we convey our gratitude to
Erin Peterson and Claire Kermorvant for valuable discussions on the project and water-quality
characteristics.

21
Conditional normalization in time series analysis

A Other results from the stream temperature application

site 2
1.0
ACF

0.4
−0.2

0 5 10 15 20 25

Lag

Figure 13: Autocorrelation plot of the standardized y∗s=2,t .

B Data cleaning and preliminary analysis

B.1 Study area


Pringle Creek drains a catchment of 48.9 km2 and experiences high flows in spring when rainfall
is heaviest and low flows during the typically dry summers. Rainfall can occur in winters,
typically December to January, but not snow or ice fall. The average annual temperature is about
17.5◦ C while the average annual precipitation is about 898 mm.

Figure 14: Pringle Creek sensor locations. The creek is shown in the blue line and the two pink circles
denote the upstream and downstream sensor locations. Image courtesy National Ecological
Observatory Network.

22
Conditional normalization in time series analysis

B.2 Data cleaning


The NEON data we use in Application 2 (Section 4) undergo an automated water-quality
assuring process as a part of the rigorous NEON protocols, and are given a quality flag for
each observation. The most commonly used quality flags include range flags, spike flags and
step flags (see Cawley (2021) for a detailed explanation of these quality flags). Range flags are
obvious technical anomalies as they identify the out of range observations of each water-quality
variable. Therefore we use these range flags and claim these points as anomalies. However,
other quality flags do not necessarily imply technical anomalies. For example, Figure 15 shows
a section of data for turbidity at the downstream sensor at Pringle Creek, colored by the quality
flags given by the automated process. In the event of freshwater inflows, turbidity tends to
increase and then gradually decrease; the natural behavior of turbidity in many river systems.
However, these types of points (i.e., sudden increases in turbidity) are flagged by the automated
process, even though they are unlikely to be technical anomalies. Therefore we did not directly
use any quality flags other than range flags in this study, and claim the points shown in Figure 16
as technical anomalies.

1500

1000
Turbidity

500

0
2019−11−03 2019−11−06 2019−11−09 2019−11−12 2019−11−15
Timestamp

Flag Valid data Spike Flag Null Flag Step Flag

Figure 15: Turbidity at downstream sensor colored by the quality flags. Note that only turbidity series
had technical anomalies. Other series did not show any technical anomalies during the study
period we chose.

23
Conditional normalization in time series analysis

Figure 16: Time plots of the variables used in the study for the period spanning from 01-Oct-2019 to
31-Dec-2019. The anomalous points we identify were colored in red.

24
Conditional normalization in time series analysis

Apart from these anomalous points, we also noticed that every fifth observation in turbidity
in both sites are anomalous as a result of the wiper on the optical turbidity sensor operating
(i.e. wiping any biofouling off the sensor probe) every five minutes (Figure 17). Therefore we
also discarded these anomalous points in turbidity series from both sites.

3
Turbidity (FNU)

Oct 14 00:00 Oct 14 06:00 Oct 14 12:00 Oct 14 18:00 Oct 15 00:00
Timestamp

Flag valid data wiper anomaly

Figure 17: Wiper anomalies in turbidity downstream sensor. This plot only shows data for 14-Oct-2019
to distinguish the scale difference between wiper anomalies and typical data.

B.3 Preliminary analysis


Prior to the analysis, we removed all the anomalous points (see Appendix B.2), then aggregated
the observations for all water-quality variables into 5-minute frequency data (Figure 18). The
resultant time series reflect the typical behavior of water quality in many rivers. Figure 19
is similar to Figure 18 except that the turbidity and level are plotted in log scale in Figure 18.
Changed scales in some variables has allowed us to see the internal variations in these individual
variables.

Turbidity tends to increase when freshwater flows into the river (when water level rises) as
this will increase the suspended particles in water. In contrast, conductance tends to decrease
with fresh water inflows as the water becomes diluted (Leigh et al. 2019) (see Figure 18). This
behavior also explains why the relationship between water level at the upstream location and

25
Conditional normalization in time series analysis

turbidity at the upstream site is much stronger than that between the upstream and downstream
locations (see Figure 20).

We then chose the set of covariates from the upstream sensor to model the cross-correlation
between upstream and downstream turbidity by visually analysing the relationship between
the variables. From Figure 20, we can see that the upstream water level, temperature and
conductance show non-linear relationships with both turbidity series. In contrast, dissolved-
oxygen does not show much relationship with turbidity. We also see that water level and
conductance have a strong non-linear relationship. Hence, choosing both water level and
conductance as covariates could lead to multicollinearity problems. Given these observations,
we chose water level and temperature as the covariates to compute conditional cross-correlations
between upstream and downstream turbidity series.

Prior to the remaining analysis, we impute the missing values in upstream level and temperature
(see Figure 21 to visualize the percentage of missing values). The missing values in water level
are imputed using linear interpolation whereas the missing values in temperature are imputed
using a Kalman-smoother implemented in the imputeTS R package (Moritz & Bartz-Beielstein
2017), based on a state space representation of the ARIMA model chosen by auto.arima
implemented in the forecast R package (Hyndman et al. 2022; Hyndman & Khandakar 2008).
Figure 22 plots the time series for level and temperature with the imputations.

B.4 Imputing missing values of turbidity upstream via conditional normal-


ization
To estimate the lag time between two sensors, we compute the conditional cross-correlations,
assuming that the lag time depends on the water-quality at the upstream location. First we will
normalize the two turbidity series conditional on the upstream level and temperature.

Figure 21 shows that the turbidity upstream has considerable amount of missing values which
might affect the lag time estimation. Therefore, we first impute these missing values following
the method explained in Section 2.2. Let upstream turbidity is denoted by xt and the conditional
xt −m̂ x (zt )
normalized series of that is given by xt∗ = √ where zt contains the water level and
v̂ x (zt )
temperature measured at the upstream sensor. The conditional means, m̂ x (zt ) and variances,
v̂ x (zt ) are computed using Equations (2) and (3) respectively. We use the generalized additive
models implemented in the mgcv R package (Wood 2020; Wood 2017). Thin plate regression
splines (Wood 2003) are used as the smooth function for each covariate. We can set the dimension
k of the smoother by observing the relationship between the response and the predictor. Caution
should be taken when choosing k, because larger values can lead to overfitting due to the

26
Conditional normalization in time series analysis

Figure 18: Time plots of water-quality variables for the period 01-Oct-2019 to 31-Dec-2019. All variables
are aggregated into 5-minute frequencies, post anomaly removal (See Section B.2). Turbidity
and water level are shown on the log scale. Observations highlighted in orange show examples
of patterns in water quality when there are fresh water inflows (water level rises).

27
Conditional normalization in time series analysis

Figure 19: Time plots of water-quality variables for the period 01-Oct-2019 to 31-Dec-2019. All variables
are aggregated into 5-minute frequencies, post anomaly removal (See Appendix Section B.2).
All water-quality variables are plotted in their original scale. A couple of instances are
highlighted in orange to show the patterns in water-quality variables when there is fresh
water inflows.

28
Conditional normalization in time series analysis

Figure 20: Pairwise scatter plots between the turbidity and other covariates from the upstream sensor.
Upper triangular matrices shows the Pearson’s correlation coefficient for each pair.

underlying autocorrelation in the time series. See Wood (2017) for a discussion on choosing k in
GAMs.

From Figure 23 it can be seen that turbidity has a positive relationship with water level when ad-
justed for temperature and the relationship is stronger when the water level is higher. However,
temperature does not seem to have much effect on turbidity when adjusted for water level.

To compute the conditional variance, we model the squared residuals from the conditional
mean models assuming a Gamma family with a log link as in Equation (3). Figure 24 shows the
relationship between the response and each predictor in the fitted conditional variance model.

29
Conditional normalization in time series analysis

)
%
)
%

(1
(2
)
%

am
am
(5

)
%

tre
am

tre

(1

ps
ns
re

am

_u
t

ow
ps

e
re

ur
_u

_d

t
ps

t
ra
ity

ity

_u

pe
id

id

l
ve
rb

rb

m
tu

tu

te
le
0
Observations

10000

20000

Missing Present
(2.1%) (97.9%)

Figure 21: Visualization of the missing values in the variables.

Figure 22: Time plot of upstream level and temperature with imputed values. Points in orange denote
the imputed missing values.

30
Conditional normalization in time series analysis

Figure 23: Visualizing the fitted smooth functions in the conditional mean model for turbidity upstream
with the predictors, water level and temperature from upstream sensor. Each panel visualizes
the relationship between the response and predictor while holding other predictors at their
medians (251.6m and 9.926o C for water level and temperature, respectively). The smooth
function is shown in blue and the black points are the partial residuals. The degrees of the
smoothing are shown in the y-axis label for each plot.

Figure 24: Visualizing the fitted smooth functions in the conditional variance model for turbidity up-
stream with the predictors, water level and temperature from upstream sensor. Each panel
visualizes the relationship between the response and predictor while holding other predictors at
their medians (251.6m and 9.926o C for upstream water level and temperature, respectively).
The smooth function is shown in blue and the black points are the partial residuals. The
degrees of the smoothing are shown in the y-axis label for each plot.

31
Conditional normalization in time series analysis

The normalized series for turbidity upstream is shown in Figure 25. To impute the missing
values of turbidity upstream, we fit an ARIMA model to xt∗ using the auto.arima function
from the forecast R package (Hyndman et al. 2022; Hyndman & Khandakar 2008). Following
Equation (4) we then impute the missing values of xt as x̂t∗ v̂(zt ) + m̂(zt ), where x̂t∗ is computed
p

using the Kalman smoother implemented in the imputeTS R package (Moritz & Bartz-Beielstein
2017). It should be noted that, this method leads to impute negative values. Since the turbidity
cannot be negative, we have removed those points in the remaining analysis (see Figure 26).
Turbidity upstream (FNU)

15

10

Oct Nov Dec Jan


Timestamp
10000

7500
Count

5000

2500

0
0 5 10 15
Turbidity upstream (FNU)

Figure 25: Conditionally normalized upstream and downstream turbidity.

32
Conditional normalization in time series analysis

Figure 26: Time plot of upstream turbidity and temperature with imputed values. Points in orange
denote the imputed missing values.

References
Bal, G, E Rivot, JL Baglinière, J White & E Prévost (2014). A hierarchical Bayesian model to
quantify uncertainty of stream water temperature forecasts. PLoS One 9(12), e115659.
Bühlmann, P (1997). Sieve Bootstrap for Time Series. Bernoulli 3(2), 123–148. http : / / www .
jstor.org/stable/3318584.

Cawley, KM (2021). NEON Algorithm Theoretical Basis Document (ATBD): Water Quality. https:
//data.neonscience.org/api/v0/documents/NEON.DOC.004931vB.

Gamakumara, P, PD Talagala & RJ Hyndman (2023). conduits: CONDitional UI for Time Series
normalisation. R package version 1.0.0. https://github.com/PuwasalaG/conduits.
Green, JI & EJ Nelson (2002). Calculation of time of concentration for hydrologic design and
analysis using geographic information system vector objects. Journal of Hydroinformatics 4(2),
75–81.
Hastie, TJ & RJ Tibshirani (1990). Generalized additive models. Vol. 43. CRC press.
Hrachowitz, M, P Benettin, BM Van Breukelen, O Fovet, NJ Howden, L Ruiz, Y Van Der Velde
& AJ Wade (2016). Transit times—The link between hydrology and water quality at the
catchment scale. Wiley Interdisciplinary Reviews: Water 3(5), 629–657.
Hyndman, R, G Athanasopoulos, C Bergmeir, G Caceres, L Chhay, M O’Hara-Wild, F Petropou-
los, S Razbash, E Wang & F Yasmeen (2022). forecast: Forecasting functions for time series and
linear models. R package version 8.16. https://pkg.robjhyndman.com/forecast/.

33
Conditional normalization in time series analysis

Hyndman, RJ & G Athanasopoulos (2021). Forecasting: principles and practice. 3rd ed. Melbourne,
Australia: OTexts. OTexts.org/fpp3.
Hyndman, RJ & Y Khandakar (2008). Automatic time series forecasting: the forecast package for
R. Journal of Statistical Software 26(3), 1–22.
Isaak, DJ, SJ Wenger, EE Peterson, JM Ver Hoef, DE Nagel, CH Luce, SW Hostetler, JB Dunham,
BB Roper, SP Wollrab, et al. (2017). The NorWeST summer stream temperature model and
scenarios for the western US: A crowd-sourced database and new geospatial tools foster a
user community and predict broad climate warming of rivers and streams. Water Resources
Research 53(11), 9181–9205.
Leigh, C, S Kandanaarachchi, JM McGree, RJ Hyndman, O Alsibai, K Mengersen & EE Peterson
(2019). Predicting sediment and nutrient concentrations from high-frequency water-quality
data. PloS one 14(8), e0215503.
Li, MH & P Chibber (2008). Overland flow time of concentration on very flat terrains. Transporta-
tion Research Record 2060(1), 133–140.
Li, Y, Y Zhu, L Chen & Z Shen (2018). The time delay of flow and sediment in the Middle and
Lower Yangtze River and its response to the Three Gorges Dam. Journal of Hydrometeorology
19(3), 625–638.
Moritz, S & T Bartz-Beielstein (2017). imputeTS: Time Series Missing Value Imputation in R. The
R Journal 9(1), 207–218.
National Ecological Observatory Network (NEON) (2021a). Elevation of surface water
(DP1.20016.001). en. https : / / data . neonscience . org / data - products / DP1 . 20016 .
001/RELEASE-2021.

National Ecological Observatory Network (NEON) (2021b). Reaeration field and lab collection
(DP1.20190.001). en. https://data.neonscience.org/data-products/DP1.20190.001.
National Ecological Observatory Network (NEON) (2021c). Temperature (PRT) in surface water
(DP1.20053.001). en. https://data.neonscience.org/data-products/DP1.20053.001/
RELEASE-2021.

National Ecological Observatory Network (NEON) (2021d). Water quality (DP1.20288.001). en.
https://data.neonscience.org/data-products/DP1.20288.001/RELEASE-2021.

Ogasawara, E, LC Martinez, D De Oliveira, G Zimbrão, GL Pappa & M Mattoso (2010). Adaptive


normalization: A novel data normalization approach for non-stationary time series. In: The
2010 International Joint Conference on Neural Networks (IJCNN). IEEE, pp.1–8.
R Core Team (2021). R: A Language and Environment for Statistical Computing. R Foundation for
Statistical Computing. Vienna, Austria. https://www.R-project.org/.

34
Conditional normalization in time series analysis

Seyam, M & F Othman (2014). The influence of accurate lag time estimation on the performance
of stream flow data-driven based models. Water resources management 28(9), 2583–2597.
Stan Development Team (2021). Stan Modeling Language Users Guide and Reference Manual. Version
2.28. https://mc-stan.org.
Vafaeipour, M, O Rahbari, MA Rosen, F Fazelpour & P Ansarirad (2014). Application of sliding
window technique for prediction of wind velocity time series. International Journal of Energy
and Environmental Engineering 5(2), 1–7.
Van der Velde, Y, G De Rooij, J Rozemeijer, F Van Geer & H Broers (2010). Nitrate response
of a lowland catchment: On the relation between stream concentration and travel time
distribution dynamics. Water Resources Research 46(11).
Wanielista, M, R Kersten, R Eaglin, et al. (1997). Hydrology: Water quantity and quality control. John
Wiley and Sons.
Wood, S (2020). mgcv: Mixed GAM Computation Vehicle with Automatic Smoothness Estimation. R
package version 1.8-33. https://cran.r-project.org/package=mgcv.
Wood, SN (2003). Thin plate regression splines. Journal of the Royal Statistical Society: Series B
(Statistical Methodology) 65(1), 95–114.
Wood, SN (2017). Generalized additive models: an introduction with R. CRC press.
Xie, R, M Zhang, P Venkatraman, X Zhang, G Zhang, R Carmer, SA Kantola, CP Pang, P Ma, M
Zhang, et al. (2019). Normalization of large-scale behavioural data collected from zebrafish.
Plos one 14(2), e0212234.

35

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy