Lu_asu_0010E_20505
Lu_asu_0010E_20505
Lu_asu_0010E_20505
by
Xuetao Lu
December 2020
ABSTRACT
Spatial regression is one of the central topics in spatial statistics. Based on the
two categories, linear mixed regression models and nonlinear regression models. This
dissertation explored these models and their real world applications. New methods
and models were proposed to overcome the challenges in practice. There are three
In the first part, nonlinear regression models were embedded into a multistage
workflow to predict the spatial abundance of reef fish species in the Gulf of Mex-
ico. There were two challenges, zero-inflated data and out of sample prediction. The
methods and models in the workflow could effectively handle the zero-inflated sam-
pling data without strong assumptions. Three strategies were proposed to solve the
out of sample prediction problem. The results and discussions showed that the non-
linear prediction had the advantages of high accuracy, low bias and well-performed
in multi-resolution.
In the second part, a two-stage spatial regression model was proposed for ana-
lyzing soil carbon stock (SOC) data. In the first stage, there was a spatial linear
mixed model that captured the linear and stationary effects. In the second stage,
a generalized additive model was used to explain the nonlinear and nonstationary
effects. The results illustrated that the two-stage model had good interpretability in
which is competitive to the popular machine learning models, like, random forest,
regression tree), was proposed in the third part. Combining advantages in both
BART and Gaussian process, the model could capture the nonlinear effects of both
i
observed and latent covariates. To develop the model, first, the traditional BART
based Markov chain Monte Carlo (MCMC) in parameter estimating was discussed.
Based on the idea of analysis of variation, back comparing and tuning range, were
proposed to tackle this failure. Finally, effectiveness of the new model was examined
ii
ACKNOWLEDGMENTS
The present dissertation was undertaken under the joint supervision of Dr. Robert
McCulloch and Dr. Steven Saul. I would like to express my deepest gratitude and
appreciation to them. During my research, Robert always trusted me with his many
great ideas and, at the same time, encouraged me to confidently further develop and
explore them in my own ways, always while offering invaluable advice. Steven is
my co-advisor as well as a good friend. Four years ago he introduced me into spatial
statistics which was a total new area for me. Without his inspirational guidance, con-
stant support, and patient encouragement this dissertation would not be completed.
Overall, they have inspired me both as an academic and as a person and, therefore,
I am very grateful to my mentors, Dr. Julie Bessac and Dr. Umakant Mishra, in
under their supervision when I took an internship at Argonne this summer. They all
and friendship.
Hahn, Dr. Shiwei Lan, and Dr. Shuang Zhou for their helpful support on my disser-
(NOAA) National Marine Fisheries Service (NMFS) through the University of Miamis
Cooperative Institute for Marine and Atmospheric Studies (CIMAS), and National
Science Foundation.
iii
sister, Chao, for sending their love and support from thousands miles away. A special
thanks to my wife Dr. Yuxia Shen for her endless support, love and patience. And
iv
TABLE OF CONTENTS
Page
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
CHAPTER
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.2 Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.3 Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3 A TWO-STAGE MODEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
v
CHAPTER Page
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2.5 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
vi
CHAPTER Page
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
APPENDIX
vii
LIST OF TABLES
Table Page
viii
LIST OF FIGURES
Figure Page
ix
Figure Page
3.9 Discoveries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.11 Fitting Comparison with Real Value and the Old BART . . . . . . . . . . . . . . 83
x
Figure Page
xi
Chapter 1
INTRODUCTION
graphic data. These data are prevalent in many scientific disciplines such as meteorol-
etc. With the use of spatial statistics becoming more popular across different disci-
plines, it is currently one of the most active research areas in statistics. Gelfand et al.
(2010) viewed spatial statistics as being comprised of three major categories: continu-
ous spatial variation, discrete spatial variation and spatial point patterns. Continuous
spatial variation that focus on the study of continuous spatial processes includes the
tral methods, hierarchical modeling, spatial design, etc. Spatial regression stays at
the center of this category and connects with all other topics.
specified covariates. When the data has a spatial component, the regression model has
to recognise and adapt to this change. In this case, we call it spatial regression model.
A general form of the spatial regression model that we studied in the dissertation is
as follows:
where s := {s1 , ...sn } is the set of spatial locations; y(s) is the observed dependent
1
w(s) is a stochastic process; (s) are the i.i.d. errors.
According the general form (1.1), spatial regression models can be divided into
two categories.
dependence that be modeled by a stochastic process w(s). The function f (·) that
models the effect of observed covariates has a linear form. The classical Kriging
models (Cressie, 1993) which focus on estimating the first-order (large-scale or global
trend) and second-order (small-scale or local) structure of y(s) falls in this category.
For example, at any new spatial location s0 , the stochastic term in Kriging models is
n
P
w(s0 ) = λ(si )y(si ). Then,
i=1
Spatial linear mixed regression model has a long history in spatial statistics
(Cressie, 1993). With the advantages of solid theoretical foundation, simple mathe-
maticial formula and good interpretability, they are widely applied in different disci-
plines, such as geography (Haining et al., 2010), ecology (Robertson, 1987), meteo-
In this case, w(s) = 0, it means that only the effect of observed covariates be
considered in the model. The spatial dependence effect is modeled by the function
f (·). Since the linear regression model is trival, we are interested in the nonlinear
ones, e.g. the popular machine learning models, like, ensemble models (Random
2
Forest, XGBoost), kernel based models (Support Vector Machine), Neural Networks,
etc. As the rising of machine learning, their application in spatial analysis grows
rapidly, especially in the direction of deriving spatial predictions for spatial regression
(Appelhans et al., 2015) (Li et al., 2011) and detecting spatial patterns (Williamson
et al., 2020).
Similar to the ordinary statistical regression, there are two major goals in spatial
regression, prediction and interpretation. Figure 1.1 illustrates the relative positions
of the two categories models in the coordinate system of prediction and interpretation.
In real applications, if our goal is to get a good prediction then spatial nonlinear
regression models are good choices. While, if understanding the relationships between
Y and X is a top priority, we prefer the spatial linear Mixed regression models which
are much easier to be explained than the nonlinear ones. However, there is a trade-off
3
1.2 Gaussian Process
In spatial process regression model (1.1), the term w(s) usually is a Gaussian
infer the distribution over functions. First, using Gaussian process defines a prior
over functions. Then, convert it into a Gaussian process posterior after obtaining
some data.
Gaussian distribution:
distribution:
Figure 1.2 (a) shows 5 samples from a Guassian process prior distribution, while
(b) illustrates 5 samples from its posterior distribution after obtaining 8 new obser-
vations.
The zero mean Gaussian process can be denoted by GP (0, C(·, ·|θ)),where θ is
4
Figure 1.2: Sampling from Gaussian Process Prior and Posterior Distributions
function C(·, ·|θ). In order to model the spatial dependence, we assume the covariance
function following some spatial correlation structure. For example, a low dimensional
where Γ(·) is the gamma function; Kν (·) is the modified Bessel function of the second
kind; ||si − sj || is the Euclidean distance between spatial point si and sj ; The set of
decay rate and smoothness in spatial correlation respectively. There are two popular
||si − sj ||
• ν = 1
2
, Cν (||si − sj ||) = σ 2 exp{− }, it’s an exponential covariance
ρ
function.
||si − sj ||2
• ν → ∞ , Cν (||si − sj ||) = σ 2 exp{− }, it’s a gaussian covariance
2ρ2
function which is also called squared-exponential covariance function.
5
With advantages mentioned above, Gaussian process is prevalent in spatial statis-
tical modeling today. However, the computational issue will rise when data becomes
spatial locations has to calculate the inverse and determinant of covariance matrix
whose exact calculation requires O(n3 ) operations and O(n2 ) storage. In recent year,
due to the advance in technology, massive spatial data are collected in various disci-
a rich literature on this problem. Basically, the studies are going on two tracks, low-
rank and sparsity. Low-rank approximation is a very active field in numerical linear
algebra. Hackbusch (2015) developed the theory of hierarchical matrices which can
k is the rank parameter controlling the accuracy of the approximation, α, β ∈ {1, 2, 3}.
There were also low-rank modeling methods in spatial statistics community, e.g. Fixed
Rank Kriging (Cressie and Johannesson, 2008), predictive process model (Banerjee
et al., 2008) and stochastic partial differential equations (Lindgren et al., 2011). On
the other track, studies seek to introduce sparseness into the covariance or precision
matrix. For example, Furrer et al. (2006) applied covariance tapering to create a
sparse approximate linear system that can then be solved using sparse matrix algo-
rithms, or Datta et al. (2016) extends the Vecchias approximation (Vecchia, 1988)
to a Gaussian process for creating the sparse precision matrices by using conditional
6
1.3 Overview
Figure 1.3 shows the two categories of spatial regression models and my study
• Chapter 2, introduces a real world application that applied the nonlinear re-
gression models to predict the spatial distributions of reef fish abundance in the
• Chapter 3, aims to develop a spatial regression model that can break the trade-
Kriging and generalized additive model is built to achieve this goal. Model’s in-
terpretability and prediction accuracy are tested on both the real and simulation
data.
7
• Chapter 4, proposes a new BART model and combines it with the Gaussian
and a real data testing are given for examining the effectiveness of Gaussian
8
Chapter 2
and managing organism population. However, the task becomes difficult to marine
species due to the low detection rates because of sampling underwater. In this chapter,
regression models to tackle this problem. In Section 2.1, we introduced the problem,
challenges in Section 2.2. The methods to solve the zero-inflated data problem were
discussed in this section as well. In Section 2.3, we applied nonlinear spatial regression
models to predict the spatial distributions of reef fish abundance. Three strategies
were proposed to handle the out of sample problem. In Section 2.4, the nonlinear
prediction results were compared to the ones of linear regression and catch per unit
effort models.
2.1 Introduction
The ability to map the abundance of organisms across space is a critical precur-
sor for many applied research applications that support sustainable environmental
habitat use (Mateo-Sánchez et al., 2015), establishing protected areas (Lin et al.,
2017), building population and ecosystem models (Stratford et al., 2016), using such
policies (Guisan et al., 2013), etc. However, in most cases, the real data needed to de-
velop such maps either does not exist or is zero-inflated or unevenly distributed across
9
space (Prosser et al., 2018). This is because field sampling efforts can be expensive
or often must collect information for multiple applications. For example, collecting
data from the marine environment can be particularly difficult due to, cost and lo-
satellite signals from underwater, and visual limitations associated with water clarity
and water column light attenuation. It results in fewer observations of species from
the years to maximize the use of the zero-inflated sampling data, such as variogram
estimation and random field simulation (Saul and Purkis, 2015), generalized ad-
ditive models (Drexler and Ainsworth, 2013), additive beta regression (Ros-Pena
et al., 2018), etc. However, most of them assume samples evenly distributed in the
study region or their values follow specified distributions, conditional normal distri-
most organisms usually distribute in patchy pattern across the landscape or seascape
(Ainsworth et al., 2016). The underlying reason is that organism abundance directly
patchy spatial patterns. Although, for some organisms, approaches work on commer-
cial data, like catch per unit effort data (McDonald et al., 2001), can extract useful
maps of abundance, models with environmental covariates are more promising credit
to their high interpretability and low bias (Streich et al., 2017). Many methods to
map organism spatial abundance use the relationship between abundance and envi-
ronmental covariates. Models can be divided into two categories, linear models and
nonlinear models. Linear models had been extensively studied in predicting organism
abundance (Guisan et al., 2002). They have many advantages, such as, model is easy
10
in both time and resources, etc. Linear models perform well on large spatial scales
identifying the overall trend or gradient of abundance (Guisan et al., 2002). How-
ever, most relationships in real world are intrinsically nonlinear rather than linear in
nature.In the few years, as therising of machine learning, nonlinear models have been
widely applied in organism population prediction (Ye et al., 2019). They are good
covariates and the complex interactions among covariates. So, nonlinear models also
have the ability to identify the local trend in fine spatial scale. Compare to linear
models, nonlinear ones could normally provide more accurate predictions at the cost
of sacrificing interpretability.
2.1.1 Data
The study region was the area whose depth is within 1 200 meters in north Gulf
of Mexico (Figure 2.1). Two types of datasets, video surveys data and interpolated
Figure 2.1: Study region and video survey data. The zoomed area shows zero-
inflated feature of the sampling data. The red dot represents positive sampling, the
value on it numbers the fishes been sampled.
Three independent fishery video surveys are carried out annually to collect infor-
11
mation on the abundance of shallow water reef fish species throughout the Gulf of
Mexico. The first one is sponsored by the National Marine Fisheries Service (NMFS)
Panama City Florida laboratory, the second is part of the Southeast Area Monitor-
ing and Assessment Program (SEAMAP), and the third is sponsored by the State of
Florida Fish and Wildlife Commission (FWC). Each survey is methodologically stan-
dardized to others. It allows us to merge them to a single dataset with trivial effort.
The video surveys target the commercially important species, such as red grouper
erca microlepis), mutton snapper (Lutjanus analis), etc. Video surveys were carried
units (PSUs) which located in the most possible habitat area were spatial blocks with
sampling sites (USS) were point locations that were randomly choose in the ultimate
sampling sites (USS). The sampling gear consisted of four cameras mounted orthog-
onally with each other. Cameras was deployed at each location for 20 minutes and
record every species encountered. The camera sampling protocol included the use of
bait at the center of the four-camera array to increase the positive detection rate.
The video footage was read by several technicians to identified and enumerated the
species observed. In order to avoid double counting, the count value was set as the
maximum number of the species recorded in a frame during the 20-minute sampling
period (Somerton and Glendhill, 2005). The sampling design wasnt optimal due to
budget constraint. For example, the short sampling time may be the prime reason
for the zero-inflated sampling result (Figure 2.1). Another defect is that the sampling
sites were not evenly distributed across the study region. In the prediction stage, it
will cause the out of sample prediction problem for the blank areas (Figure 2.1).
12
project produced detailed mappings of the sea floor in various locations by interpo-
lating from all available point datasets. Individual raw data points were screened for
quality control before being used for interpolation. Isotropic, binned semivariograms
were used to interpolate point data to raster map (Goff et al., 2008). Maps can
study, the environmental covariate maps (percentage content) to be used for predic-
tion includes sand, gravel, mud, sediment grain size, carbonate, clay, and rock. One
defect of dbSEABED database was that benthic samples collected over the years were
more concentrated in nearshore areas than offshore ones. It makes the data has lower
variation thus higher accuracy in nearshore area than in offshore area. Despite this,
the dbSEABED dataset is the most spatially comprehensive habitat data publicly
The multistage workflow (Figure 2.2) start from simulating the video survey pro-
cess to generate simulated sampling outcomes under different settings. In the second
stage, a method named empirical maximum likelihood analysis worked with simulated
sampling data to find a relationship between the video survey data (catch ratio) and
an empirical maximum likelihood density function which was the key to address the
zero-inflated issue. Then, inputting with real video survey data and the empirical
maximum likelihood density function, a two-step random smoothing method was em-
ployed in the third stage. In step one, spatial abundance was estimated in sampling
to produce the block spatial abundance that will work as training data in next stage
models. In the final stage, working on the training data and environmental covariates
13
(habitat data), nonlinear spatial regression (Machine learning) models coming from 3
different families, support vector machine, neural networks and random forest, were
assembled to generate a high accuracy and low bias prediction of abundance spatial
distribution.
Figure 2.2: There are four stages with different methods/models and data in each
of them. Generated data means that the data was generated by the model or prior
knowledge. In contrast, real data is collected from real world.
In the rest of this section, we will go through the StageI to Stage III. The meth-
ods/models in these stages work together to tackle the zero-inflated problem of the
To make the most use of the video survey data, an individual-based discrete event
simulation was developed to model the video survey process.(Pfeffermann, 2013) This
was developed using the PyGame library in the Python programming language (Kelly,
Red grouper excavates benthic material to create nests or pits in which they live,
and from which they exhibit high site fidelity (Harter et al., 2017). Red snapper
14
exhibits less site fidelity than red grouper but spends continuous periods of time at
one site, on the order of months or years, before moving to another habitat location.
Thus, at short time intervals, such as the length of the camera sampling protocol,
In order to model site fidelity, we assumed the fish move around nearby its home.
the simulation, wandering implemented by a Markov Chain Monte Carlo. Its random
fashion followed an isotropic bi-normal spatial distribution around the home (Figure
2.3). It was meant to represent activities such as food foraging similar to central
place foraging theory (Schoener, 1987). Fish had a 68% probability to move within
one standard deviation of the isotropic bi-normal spatial distribution, and 95% prob-
ability of moving within two standard deviations. Home range was defined as the
spatial distance of two standard deviations. For two neighboring homes, the average
percentage of overlapping is less than 50% (Farmer and Ault, 2011). The setting of
parameters, such as, home range, fishs movement frequency, speed, turning angle,
followed Farmers papers (Farmer and Ault, 2011),(Farmer and Ault, 2014) which
conducted a thorough investigation to the movement of reef fish species in the Gulf
of Mexico.
In video survey, the vulnerability of fish to be sampled by the camera gear was
enhanced by placing bait at the location of the camera array to attract fish. We
modeled the bait effect on video sampling. The chance that a fish would detect the
bait and come to it is determined by two factors: the diffusion rate of the scent of the
bait, and the probability that a fish in the vicinity of bait detectability, could detect
the bait (Stoner, 2004). We assumed that the bait odor had highest intensity at the
15
Figure 2.3: Video survey simulation. The green dots represent the location of a fish
home, green concentric circles around each dot represents the first and second stan-
dard deviations, the red dot represents the location of the camera, and the red circle
represents the distance to bait detectability, which expands throughout 20 minutes,
the winding trails represents the trace of fishes, each color corresponding one fish.
camera sampling gear, and spread with intensity diminishing exponentially as moving
away from the camera. The attenuation of bait odor through the water column is an
understudied complex process, as is the probability that a fish nearby will detect it.
The few studies that have been done suggest a wide range of distances from which
fish can detect bait, and the research suggests it is species dependent (Sainte-Marie
and Hargrave, 1987). As a result, we made two assumptions: (a) that the radius of
detectability from the camera array, meaning the maximum distance of odor spread
was 50 meters in 20 mins, and (b) that the value for the shape parameter of the
(4) Interactions
In each simulation step, program checks the location of fish in relation to the
location of the bait and the range of bait odor dissipation. If the fish entered the
range of bait odor, then it was assigned a probability of being sampled by the camera.
16
Once a fish was sampled by the camera, it was removed from the simulation to avoid
double counting.
The most important parameter in simulation model was the number of fish homes.
Since simulation area was invariant and the number of fishes in each home followed a
known uniform discrete distribution, the number of fish homes scaled the abundance
of fish at each sampling station. We tested a range of numbers fish homes in the
simulation. It was increased by three and up to a big enough number which was con-
strained by the rule of less than 50% habitat overlapping and the size of simulation
area. For each number, the simulation run 5000 times. One simulation step corre-
sponds to 3 seconds in real time, and camera works 20 minutes. The data generated
by the video survey simulation will be used in the empirical maximum analysis.
lihood estimator (MLE) for unknown parameter. This method typically includes three
components: an analytic mathematic model, the target parameter and the observed
outcomes. Although our workflow contained a model (video simulation model), the
target parameters (fish density) and the observed outcomes (real video survey data),
video survey model is a programing simulation model rather than an analytic math-
likelihood analysis to tackle this issue. The method includes four parts: re-sampling,
is as following.
17
(1) Initialize the number of the fish homes as n=3.
(2) Sample 100 outcomes without replacement from the results of video survey
simulation. Calculate the ratio between number of detected fishes and the
(3) Calculate empirical probability mass function (pmf) of discretized catch ratio
(4) Increase home number by n=n+3 and repeat steps (2) and (3).
Once the probability mass functions were obtained, we can calculate the empirical
likelihood function for each catch ratio. Figure 2.4 gives an example that how to
build an empirical likelihood function from probability mass functions. Then, the
empirical maximum likelihood estimator of home number under each catch ratio will
be equal to the globe maximum of empirical likelihood function. Since the number of
fishes in each home followed a uniform discrete distribution, the maximum likelihood
of density by the ratio of maximum likelihood estimator of fishes and the size of
area. This allowed us to build a function between the empirical maximum likelihood
density and catch ratio (Figure 2.5). This function handles the spatial sparsity and
zero-inflated characteristics of video survey data. Even if the catch ratio is close to
There is a linear relation between the empirical maximum likelihood density and
catch ratio. The changing of parameters value, like, the parameters value of camera,
bait and fish behavior, in video simulation model only affects the coefficient (slope) of
this linear relation. However, all the coefficient will be cancelled when we transform
the absolute value of abundance to the relative value of abundance - the spatial
18
Figure 2.4: For each home number, there is a corresponding probability mass func-
tion for discrete catch ratios. (a), (b), and (c) are 3 examples of pmf. Given a value
of catch ratio, the discrete empirical likelihood function can be obtained (d). Since
the gap of home number was 3, we can fit a curve (d) to get the discrete empirical
likelihood function with gap one. Finally, the empirical maximum likelihood estima-
tor can be found from this discrete empirical likelihood function. In this example,
when the catch ratio was 0.05, the empirical MLE of home number was 19.
distribution. For purposes of this study, we were only interested in the abundance
spatial distribution. So, we didnt really care about the setting of parameters in video
simulation model because they did influence the linear coefficients rather than the final
spatial distribution. But when you are interest in estimating the real abundance, the
by random smoothing (Figure 2.6). First, the sampling area needs to be rasterized
into grid cells, each approximately 0.25 square kilometers. Then, random smoothing
19
Figure 2.5: Empirical Maximum Likelihood Density Function
was carried out by randomly drawing circle windows in the area (Figure 2.6). In
each smoothing window, the catch ratio was calculated from the video survey data.
Working with empirical maximum likelihood density (EMLD) function, we can assign
the empirical maximum likelihood density (abundance) to all the grid cells in the
smoothing window.
The smoothing windows may overlap with each other. Therefore, a grid cell
density may be assigned to the same grid cell. In order to combine all the different
values of a grid cell, a weighted mean empirical maximum likelihood density was taken.
Weights were determined by calculating a credibility statistic for each window. The
x
c(x) = (2.1)
N
where c(x) is the credibility; x is the number of samples in the smoothing window;
N is sample size.
In order to penalize the windows with low credibility, we calculate the weight of
each window:
c2
wi = Pn i 2
, i = 1, 2, ..., n (2.2)
i=1 cj
20
Figure 2.6: The procedure of random smoothing estimation starts from rasterizing,
(a) to (b). Then, randomly draw windows in the sampling area up to a large enough
number, (c) to (d). This number can be determined by checking the convergency of
gemld.
where wi is the weight of ith window; ci is the credibility of ith window; n is the
Thus, the weighted mean empirical maximum likelihood density of a grid cell can
be denoted as follows.
n
X
gemldk = wik ∗ wemldi , i = 1, 2, ..., N (2.3)
i=1
where gemldk is the weighted mean empirical maximum likelihood density of k-th
grid; wk is the weight of ith random smoothing window covering k th grid; wemldi is
the empirical maximum likelihood density of ith random smoothing window; n is the
number of random smoothing windows covering the k th grid; N is the number of grid
cells.
21
2.2.4 Reducing Uncertainty
For each grid cell, we can calculate its credibility mean as follows.
n
1X
gmck = gcik , k = 1, 2, ..., N (2.4)
n i=1
where gmck is the mean of credibility of k th grid; gcik is the credibility of ith window
old.
22
GF S = {gvk : gvk ≤ ScissorsF k = 1, ..., N } (2.9)
where GBS is the set of grid cells after Bayesian scissors cutting; GF S is the set of
grid cells after Frequentist scissors cutting; N is the number of grid cells.
The final set of grid cells, GL , is obtained by the intersection of GBS and GF S .
GL = GBS ∩ GF S (2.10)
GL is the block spatial abundance with low uncertainty. Figure 2.7 shows the
credibility low variance block spatial abundance data. In next section, the later will
23
2.3 Non-linear Models and out of Sample Prediction
As discussed in Section 2.1, our final goal is to get good prediction of abundance
spatial distribution. Nonlinear spatial regression (machine learning) models are good
choices for this task. With the ability of capturing the subtle nonlinear relations be-
tween abundance and environmental covariates and the complex interactions among
covariates, nonlinear models are able to produce high accuracy and low bias predic-
tions which reflect the near-real patchy patterns in both large and fine scales. In
the nonlinear models from three families, multilayer perceptron, random forest, and
that carry out supervised learning via a back-propagation training algorithm (Ram-
choun et al., 2016). With bootstrapping the input data, random forest models use
Support vector machine makes prediction by classifying the input dataset into dis-
and Shawe-Taylor, 2000). Machine learning models took the block spatial abundance
as its training dataset. Predictors were habitat environmental covariates that include
location (latitude and longitude), depth, rugosity, sand, gravel, mud, sediment grain
size, carbonate, clay, and rock. As discussed in Section 2.1, video survey data is
clustered rather than evenly distributed in the study region. It caused some blank
areas in which block spatial abundance (training data) was absent. Predicting in
these blank areas will encounter the out of sample prediction problem. This prob-
lem can introduce great uncertainty and produce erroneous predictions in high risk
24
(Wenger and Olden, 2012). Therefore, we proposed three strategies, prior knowledge,
The goal of involving prior knowledge is to extend the set of training data in the
areas that video survey was missing. First, we need to identify the areas that prior
knowledge can be engaged confidently. For example, it is well known that red grouper
is predominately spatially distributed with high abundance in the eastern portion and
very low abundance in the western portion of the Gulf of Mexico. Second, we created
several initial predictive maps by running different machine learning models (Figure
2.8 b,c,d). Then, the prior knowledge about spatial distribution was represented by
assigning weights to each initial prediction (Table 2.1). Finally, the new training data
Table 2.1: An example shows the prior knowledge (weights) applied in Figure 2.8.
Prior Weights
Area
LR MLP SVM
The limitation of this strategy comes from the lack of prior knowledge. For exam-
ple, we can’t use it for red grouper in the gap areas of east gulf of Mexico, because
the red groupers live there and any inaccurate prior knowledge will cause bias. The
same thing happend on red snappers who live in the entire gulf of Mexico.
25
Figure 2.8: The top panel shows the video survey of red grouper throughout the Gulf
of Mexico. Red points indicate positive samples and green points indicate negative
(zero). The three panels at bottom show initial predictions of spatial abundance in the
western portion of the Gulf. They are generated from linear regression, multilayer
perceptron and support vector machine respectively. Numbered polygons are the
areas applied prior knowledge (weights). The weights are shown in Table 1 below.
2.3.2 Aggregation
et al., 2010) that combines predictions from different models to stabilize the final
prediction. In this study, three popular machine learning algorithms were used: mul-
tilayer perceptron, random forest and support vector machine. In each category,
a range of tuning parameters for that algorithm were tested in a range of possible
values. With multilayer perceptron algorithm, the number of hidden layers and the
number of neurons in each hidden layer was tried. With random forest algorithm,
the maximum number of features allowed in each decision tree, the number of trees
26
to build before averaging for prediction, the number of levels in each decision tree
(maximum depth), and the minimum sample leaf size (size of the end node) were
tested. With support vector machine algorithm, the kernel parameters that include
the type of hyperplane and the shape of the hyperplane were tested. Based on the
mean test score obtained from cross-validation, we narrowed down the number of
candidate models by a criterion that the mean test scores fell in the range between
0.65 and 0.95. This criterion worked well with model combination in keeping the
balance between underfitting and overfitting. Table 2.2 shows the selected models for
With the selected 33 models in table 2.2, we fit them with training data then
predicitons, like, mean, weighted mean, median, etc. In the study, we choose median.
27
Table 2.2: Parameter settings and mean test scores of the selected models for ag-
gregative prediction of red groupers abundance spatial distribution.
Modle Tuning parameters Mean test score
(20,) 0.70046559
(30,) 0.74232753
(50,) 0.76654593
(80,) 0.79432473
28
2.3.3 Iteration
For each grid cell, the selected models generated predictions. Based on these
predicitons, mean, standard deviation, and coefficient of variation (CV) can be calcu-
lated. By choosing a threshold for CV (CV0.5), we can filter all the grid cells to get
the ones with low variance. Then, new selected cells can be added into the original
training data (Figure 2.9). We can iterate this process to expand the training data.
However, in order to avoid systematic bias, iterations are better to be less than three.
Figure 2.10 shows the distributions of the coefficient of variation (CV) before and
after the additional training data was added. It indicates that the one-time iteration
can greatly reduce the overall CV, thus stabilize the final prediction.
Figure 2.9: Map of original training data (light blue) and new training data (dark
blue). The new training data was extended by one-time iteration with criterion
CV ≤ 0.5.
Figure 2.10: The distributions of coefficient of variation (CV) before and after one-
time iteration.
29
2.4 Results and Discussion
Our workflow worked well not only in large scale but also in fine scale. To demon-
strate this, there is a comparison with the linear regression model in Figure 2.11.
The figure shows that linear regression model can capture the overall trend across
the entire Gulf. However, in fine scale, linear regression prediction was too smooth
to capture the patchy pattern. In comparison, the prediction of our workflow, here-
inafter referred to as non-linear prediction, was able to capture the overall trend as
well as the patchy pattern under high resolution. In addition, at some locations, the
example, it is well known by biologist that red grouper distributes higher abundance
in areas with higher level rugosity and gravel over sea bottoms. Its because their
affinity for structure and their role as ecosystem engineers excavating pits in which
to live (Harter et al., 2017) (Coleman et al., 2011). Circle number one in Figure 2.11
(panels b-1 and b-2) shows that non-linear prediction correctly gave higher abundance
in areas of higher rugosity (panel c), while the linear regression prediction shown the
opposite. Similarly, when considering the locations of gravel habitat, circles numbered
two, three, and four in Figure 2.11 (panels b-1 and b-2) shows that non-linear pre-
diction correctly demonstrates the positive relation between abundance and the level
of gravel. But the linear regression prediction shows inconsistent results. Especially,
The nonlinear prediciton abundance maps of red grouper and red snapper were
choose the maps of spatial catch per unit effort (CPUE) obtained from fishery data
(Figures 2.12(a) and 2.13(a)) (McDonald et al., 2001). Figure 2.12(c) and 2.13(c)
show the prediciton that corrected by considering the impact of pollution and over-
30
Figure 2.11: Comparison between linear regression prediction and non-linear pre-
diction. Panels a-1 and a-2 show the two predictions in a large scale. Panels b-1 and
b-2 show the two predictions in a small scale which is 10 times finer than the large
scale (area in blue rectangle in panels a-1 and a-2). Panels c and d show the spatial
distributions of rugosity and gravel respectively corresponding to the area in panels
b-1 and b-2.
fishing. Overall, the non-linear prediction is consistent with CPUE map. However,
the quality, quantity, and location of fishery-dependent data were influenced by the
In catch per unit efforts (CPUE) map, bias can be introduced from the amount
31
Figure 2.12: Abundance spatial distribution of red grouper. The top panel rep-
resents spatial catch per unit effort map obtained from logbook data. The middle
panel represents the non-linear prediction. The bottom panel shows the prediciton
that corrected by considering the impact of pollution and overfishing.
of catch. For example, Figure 2.14(b) shows relatively low abundance in the area
circles numbered 3 and 4. In fact, there is a zone in the middle of these two areas
(Figure 2.14(a)). The sea bottom of this zone has high-level rugosity and covered
by hard and soft corals. The environment condition is desirable for the living of
red grouper (Coleman et al., 2011). As a part of Florida middle grounds habitat
of particular concern project, this zone is protected from some fishing gear types
32
Figure 2.13: Abundance spatial distribution of red snapper. The top panel rep-
resents spatial catch per unit effort map obtained from logbook data. The middle
panel represents the non-linear prediction. The bottom panel shows the prediciton
that corrected by considering the impact of pollution and overfishing.
including bottom longlines, trawls, dredges, pots and traps (Lembke et al., 2017).
CPUE prediction in this zone is highly biased, since it depends on the amount actually
fished. Contrarily, the nonlinear models which directly employed habitat information
conditions. Figure 2.14(a) illustrates that nonlinear prediction was able to capture
33
Figure 2.14: Comparison of red grouper abundance maps between non-linear pre-
diction and CPUE. Non-linear prediction can catch the high abundance patchy area
(between circles numbered 3 and 4). This area is highly suitable for the living of
red grouper because it is covered by hard and soft corals and protected from some
fishing gear types including bottom longlines, trawls, dredges, pots and traps. How-
ever, CPUE map shows highly biased prediction due to it can be distorted by fishery
policy.
Catch per unit efforts (CPUE) map may also introduce bias from the amount
of efforts. For example, in Figure 2.15(b), the non-linear prediction of red snappers
abundance in western Gulf of Mexico shows that the abundance in region circle num-
bered 1 was higher than regions circles numbered 2 and 3. It is reasonable because
region circle numbered 1 in Figure 2.15(a) has higher level mud on sea bottom. And
its well known by biologist that red snappers occupy mud bottom during much of
their life history. However, CPUE map in Figure 15 (c) gave opposite answer. We
can see there are only two seaports (in blue circles) in western of Gulf of Mexico. For
both of them, the cost of fishing in region circle numbered 1 is higher than regions
circle numbered 2 and 3. The high cost or effort distorts CPUE prediction in this
area far from the real abundance. To this end, depending on fishery-independent
environmental data nonlinear prediction maps could be less bias than CPUE maps.
34
Figure 2.15: Maps representing the spatial distribution of mud levels (panel a), the
non-linear prediction (panel b), and CPUE map (panel c). Blue triangles in blue
circles on panel c indicate the locations of fishing port.
First, the distribution of many organisms is patchy across the landscape or seascape.
A strength of our workflow and nonlinear predicitons (Figure 2.2) is the ability to
capture patch dynamics from sparse data, which will render them applicable to many
organisms.
abundance data, both in marine and terrestrial ecosystems. As a result, our work-
flow is well suited for applications of many already existing video survey datasets
be adapted to this workflow without having to make any changes to the rest stages.
Third, as mentioned in the end of Section 2.2.2, if the matter is spatial distribu-
tion rather than absolute abundance. The only thing you need to know there is the
relationship (function) between the catch ratio and population density. If the rela-
tionship is a linear function, the first two stages of workflow can be omitted. Even
the exact form of the linear function is not required. You can assign an arbitrary
value to the coefficient for the linear function and continue subsequentstages of the
35
most cases, the relationship between catch ratio and population density is a linear
function. Otherwise, you have to find the exact form of this function. If it is possible,
you can still get rid of the first two stages of the workflow. Final, the flexibility of the
workflow also comes from the loose coupling relations among its stages. For exam-
ples, a moving window smoothing can take the place of random window smoothing.
Another example, you can add more nonlinear (machine learning) models or differ-
ent parameterizations. Note that if you change the set of more nonlinear (machine
learning) models, the final prediction will change as well. However, when you apply
the techniques in Section 2.3, such as, controlling the level of predictive accuracy,
ear regression models to predict maps of abundance spatial distribution for reef fish
species. This workflow can effectively handle zero-inflated sampling data without
strong assumptions. The nonlinear prediction has the advantages, high accuracy,
36
Chapter 3
A TWO-STAGE MODEL
The purpose of study in this chapter is to develop a spatial regression model for
analyzing the soil carbon stock (SOC) data. Different from the application in Chapter
2, the desired model should perform well in both prediction and interpretation. Un-
fortunately, as mentioned in Chapter 1 there is trade-off between the two goals. For
example, generally speaking, the linear regression model has good interpretability but
bad prediction accuracy. In contrast, the nonlinear models are good at predicting but
two stage model trying to break the trade-off between prediction and interpretation.
Section 3.1 introduces the data we used in this study. In Section 3.2, a two-stage
model is proposed. The model’s abilities in interpretation and prediction are discuss
from a conceptual view. The results and discussion are presented in Section 3.3.
3.1 Data
The soil carbon stock (SOC) data comes from the rapid carbon assessment study
initiated by the Natural Resources Conservation Services Soil Science Division of the
U.S. Department of Agriculture (USDA) Staff and Loecke (2016). More than 6200
sites across the conterminous United States were established according to a multilevel
stratified random sampling scheme. SOC stock for a fixed soil depth (0 - 30 cm) was
calculated using (3.1) (Adhikari et al., 2020). Figure 3.1 shows the map of SOC data
CF
SOCstk = SOC × BD × D × (1 − ) (3.1)
100
37
where SOCstk is the SOC stock (M g ha−1 ), SOC is the SOC content (g 100 g −1 ),
BD is the soil bulk density (M g m−3 ), D is the given soil layer thickness (cm), and
Figure 3.1: Soil carbon stock (SOC) data. The scale of SOC data was transformed
by the nature log function. This transformation normalized the SOC data (Figue 3.2)
for the convenance of modeling.
A wide range of environmental covariates (31 variables) were collected and evalu-
ated as SOC predictors. Table 3.1 lists their name, a brief description and their source.
Figure 3.2 shows a summary of SOC data and eight environmental covariates.
Different from the application in Chapter 2, this data is neat and tidy. As Figue
3.1 showing, the samples are well scattered across the whole area. It’s good for
prediction. And also, there are 31 covariates which is relatively sufficient for the
study of interpretation.
38
Table 3.1: Environmental Variables Description and Data Source
39
Figure 3.2: The summary of SOC data and 8 environmental covariates. The num-
bers are correlation values between variables.
3.2 Methods
As mentioned in the beginning of this chapter, the challenge of this study is the
novel two-stage statistical method that combines global mostly-linear effects (Stage-
1) and with non-linear effects (Stage-2). In particular, Stage-1 relies on the universal
The two-stage model is built on the basis of two well studied statistical models,
40
The Universal regression kriging relies on the expression of the quantity of
interest Y as follows.
where s is the spatial location, f (s) is a low degree polynormial function that
regression part to capture the global linear relationship between the dependent
variable Y (s) and the explanatory covariates X(s); λ(s) is a stochastic part that
captures the spatial structure of the variable Y (s), λ(s) is generally assumed to
distributed.
in 1986 (Hastie and Tibshirani (1986), Hastie and Tibshirani (1990)). GAM
assumes the relationships between the individual predictors X and the depen-
dent variable Y follow smooth functionals that can be linear or nonlinear. These
basis decomposition such as splines; are i.i.d errors. Meanwhile, the predictor
41
3) The two-stage model
is a workflow in which we apply the universal regression kriging in the first stage
where fspl (s) is a linear function of spatial coordinates capturing the global
linear spatial trend; X(x)β is the linear regression of covariates representing the
global linear effects of covariates X(s) on Y (s); λ(s) is a zero mean stationary
Gaussian process which explains the global stationary spatial dependence of the
p
X
δ(s) = fsps (s) + fi (Xi (s)) + (s) (3.5)
i=1
where fsps (s) is a spatial smoother that handle the nonlinear and nonstation-
Pp
ary spatial dependence; i=1 fi (Xi (s)) are the additive nonlinear univariate
42
model is the R2 coefficient expressed as the ratio of model explained variation and
total variation.
SSE Explained V ariation
R2 = 1 − =
SST T otal V ariation
In the two-stage model (3.4) and (3.5), all the components are additive. It’s
a elegant and powerful assumption that offers a natural way to decompose the in-
terpretability of model into its components. The key idea in the definition of R2
is the explained variation by the model. Similarly, we can generalized this idea to
the second-level (component-level) interpretability (Figure 3.3). In the first stage, the
Universal Regression Kriging (URK) models the global variations of the data. In par-
ticular, the URK decomposes the total variation into 4 parts: the variations explained
by a global linear spatial trend, the variations explained by the linear regression of
and the remaining unresolved variations. In the second stage, the residuals of URK
model become the input of Generalized Additive Model (GAM). The variations that
and a pure errors component which can’t be explained by both URK and GAM.
between elements and the response variable Y in each component cited above in
the second-level. For example, the global linear relationship between the covariates
X(s) and response variable Y (s) can be explained by the coefficients β, the global
stationary zero mean Gaussian process λ(s) can be characterized and explained by
43
Figure 3.3: Two-stage Universal Regression Kriging Generalized Additive Model
model where the contribution of each covariate to the response variable is clearly
between covariates and dependent variables are not assumed to be linear. Since the
marginal impact of a single covariate Xi , does not depend on the values of the other
covariates in GAM, we can simply interpret its relationship to the response variable
by exhibiting the univariate function fi (Xi ). For example, the synthetic example of
Figure 3.4, we can say that the expected value of first stage residuals δ(s) increases
feature of GAM ,which also plays an important role in model interpretation, is the
ability to control the smoothness of the predictor functions. With GAMs, we can
impose the prior belief that predictive function is inherently smooth in nature, even
44
Figure 3.4: Generalized Additive Model Demo
As Figure 3.3 showing, the model is fitted in two stages, which leads to an analysis
In spatial data analysis, extracting global linear trend (with covariates) and
proposed model. The second layer analysis that aims at revealing the nonlinear
relationships between response variable Y (s) and covariates X(s) coincides with
the second stage of the analysis flow. The second layer analysis is subtle and
on the basis of first layer analysis. The order of these two layers is meaningful
relationships between Y (s) and X(s) in second layer, if this order is not followed.
For example, the existing machine learning models that be applied in spatial
context do not include a independent stochastic process, like the λ(s) in first
The universal regression kriging model and generalized additive model are well
connect them by following the simple rule that the former’s output will be
the latter’s input. Moreover, there are existing R packages that implement
45
these two models respectively. In this paper, we use the R packages ”fields”
Nychka et al. (2017) for URK and ”mgcv” Simon Wood (2019) for GAM. The
differences among packages mainly come from the target problems they want
to solve and the algorithms that used by the model. For example, fields and
universal regression kriging model. The main issue that FRK focuses on is
versatile tool for spatial analysis with moderate size data. The algorithms they
while fields uses REML and GCV algorithms. Various choices of packages offer
great flexibility for analyzing data. We can select the most suitable packages
In predictive modelling and especially with the increasing use of machine learning
One of the major goals of this paper is to find an optimal framework to balance
this trade-off. As Figure 3.3 shows, the two-stage model can accommodates linear
with popular machine learning (nonlinear) models for simulated data and soil organic
carbon data.
Figure 3.5 illustrates the framework of model comparison. Five models are eval-
uated and compared: an ordinary linear regression model, the proposed two-stage
model, a random forest model, a gradient boost model and a support vector ma-
chine model. Since some models, like gradient boost and support vector machine,
46
Figure 3.5: Framework of Model Comparison
can not handle categorical data, thus encoding process or feature engineering can
Then, all five models were compared through a shared cross-validation scheme, as
illustrated in Figure 3.5 (right). Statistics of combined test data, like the predictive
root mean square error (RMSE) or predictive R2 , were used to evaluate and compare
of the proposed two-stage model fitted on the SOC data and its covariates described
in Section 3.2.
The principle of Occams Razor states that among several plausible explanations
for a phenomenon, the simplest is best. Simplicity plays a important role in model’s
interpretability. We want to explain the data in the simplest way redundant pre-
dictors should be removed. Moreover, unnecessary predictors will add noise to the
47
estimation of other quantities that we are interested in. So, the first thing we need
to do is variable selection. Since there are several categorical variables in the data,
we choose group lasso Yuan and Lin (2006) to select the important variables to be
With the previously selected covariates, the two-stage model is fitted to the organic
soil carbon data. In each stage, we summarize the model information and elucidate
The estimated regression coefficients are exhibited in Table 3.2. Because all the
covariates were scaled before model fitting, the coefficients βs are comparable with
each other and provide the relative importance of each of them to the soil carbon
stock.. The coefficients α0 , αlong and αlat belong to fspl (s) which is a linear surface
48
trend function of spatial coordinates (slong , slat ).
Each component of the URK model (3.4) can be visualized in Figure 3.7 which
provides further interpretation. The fitted R2 of first stage model URK is about
67.3%. In other words, it means there is approximately 22.7% variation of the data
left in the residuals δ(s) and will be dealt with GAM in second stage.
49
Stage-2: Generalized additive model
The fitted information of GAM can be found in Table 3.3. Under the significant
level 0.01, there are three covariates have non-zero effect on the response variable.
They are REDL14, NDVI14 and SoilOrder. All of other covariates are not significant,
in other words, they have no effects on the response variable. Figure 3.8 shows the
50
Figure 3.8: Estimated GAM Predictor Functions (Smoothers)
anisms of the system. The first stage URK model reveals the global linear relation-
ships between covariates and response variable, then, the second stage GAM corrects
the first stage understanding with more subtle non-linear details. Finally, we obtain
the overall understanding by adding up results from the two stages. The following
Equation (3.6) shows the estimated relationship (fˆ(Xi )) between covariate Xi and
response variable Y .
where β̂i was the linear coefficient estimated by URK and fˆi (Xi ) was the nonlinear
smoothing function fitted by GAM. These functional relationships bring to light the
In Figure 3.9, the top row plots show the nonlinear fitted functions for covariates,
check the part of function where the 95% confidence interval is away from zero (seg-
ment between the blue vertical lines), which indicates a significant contribution of the
covariate to the soil carbon prediction. After representing the significant contribution
in the context spatial context (the bottom row of Figure 3.9), one can vizualise the
spatial reparttion of the significant predictors. First, significant covariate data tend
to exhibit some spatial clustering patterns. Second, the spatial regions of 3 clusters
51
are overlapped and located in the southwest of United States. A reasonable hypoth-
esis may be like that, there is a latent variable which influences the 3 covariates, but
it’s not included in the data and need further investigation. The clusters provides the
extension of the traditional regression framework and allows the regression coefficients
to vary across space. GWR is a very popular geostatistical tool to explore possible
spatial patterns of the covariates effects (regression coefficients) and acquire valuable
compare the interpretation of the components of each GWR and Two-Stage models
First, there are some covariates’ effect claimed to be global linear in Two-Stage
model but are spatially varying in GWR model. For example, in Figure 3.9 (top
row), the effects of covariate NDVI14 shows non-linearity. By comparing the two
GWR coefficients maps (Figure 3.10), NDVI14 and TMEANAA30, the spatial vari-
NDVI14 is spatially varying (non-linear), GWR model tell us that the coefficient of
52
TMEANAA30 will be spatially varying as well. In other word, the effect of covari-
ate TMEANAA30 is not globally linear. While Table 3.3 shows that the covariate
TMEANAA30 in the second stage GAM has no significant effects on response variable
(p-value = 0.806594). It indicates that TMEANAA30 only has the global linear (con-
stant coefficient) relationship with response variable in the first stage URK model.
In sum, GWR model and Two-Stage model provide contradicting explanation to co-
variate TMEANAA30. The reason hides in the stochastic process term λ(s) in the
first stage URK model. As Figure 3.10 (right) showing, the value λ(s) varies spatially
compensating for errors in the linear global term. Since the coefficients of GWR are
estimated locally, λ(s) will cause their estimated values varying across space.
Second, the spatial clustering patterns of the covariate effects differ from GWR
(left) shows non-negligable differences between GWR and Two-Stage model. The
maps of NDVI14 (middle) present some similarities but also differences, for example,
the junction region between Arizona and Utah, the Northeast states and Florida.
For SoilOrder (right), the two maps show similarity in the west of United States
but differences in other regions. The reason causing the differences is similar to the
analysis of Figure 3.10. In addition, the insignificant parts that we get rid of in GAM
In summary, the power of interpretability of Two-Stage model comes from its abil-
53
Figure 3.11: Comparing Two-stage model and GWR
the two analysis layers can be easily and clearly decoupled. The second layer (GAM)
analysis relies on the basis of extracting out all the influences of first layer (URK).
pretable components. For instance, the spatial cluster patterns of REDL14 (Figure
3.11), the analysis is base on the condition of extracting out all of other influences,
such as the influences from global linearity, global stationary spatial dependence, non-
stationary spatial dependence and other covariates. While GWR model mixes up all
In this section, we compare the prediction results of different models. The frame-
Different from interpretability, the goal of prediction is accuracy and ability to pre-
dict the data characteristics. In order to compare the model capabilities of prediction,
To work with categorical data, we can use variable encoding approach. However,
54
this approach didn’t work well on this real dataset. It caused the predictive R2 of SVM
model (≈ 49%) even lower than the ordinary linear regression model (≈ 54%). To
solve this issue, we adopted feature engineering on categorical variables. For example,
using the median value of response variable to instead the nominal number of category
variable which can be accommodated by any model. After feature engineering (all
categorical covariates), we found that the predictive R2 increased from 49% to 58%
SoilOrder 1 2 3 4 5 6 7 8 9 10 11
y.median 3.9276 3.5199 4.0103 4.0338 3.6944 5.9676 4.0752 4.3900 4.7165 3.0662 4.8314
Table 3.5 lists the predictive rmse and predictive R2 for the compared models. The
Two-Stage model stays competitive to the popular machine learning models, Random
forest, XGBoost and SVM. However, comparing to URK model, Two-Stage model
The reason came from the nature of data. The purely random variation takes a large
proportion (approximately 40%) in the total variation of data. It made all the models,
popular machine learning models and improve its capabilities compared to URK and
linear regression models, we simulate a dataset and conduct the comparison described
55
Table 3.5: Prediction Comparison on Real Data
LM 0.6978181 0.5392069
RF 0.6534186 0.5958378
where the components are generated as follows and their visualizations can be found
in Figure 3.12.
Yx (s) represents the nonlinear relationship between two covariates X1 (s) and X2 (s)
X2 (s) ∼ unif (0, 2π)
λ(s) is a zero mean Gaussian process capturing the isotropic stationary spatial
where Γ(·) is the Gamma function, Kν (·) is the modified Bessel function of the second
kind, |s1 − s2 | is the Euclidean distance between spatial point s1 and s2 , and σ 2 = 1,
56
p(s) = 0.005 ∗ sx ∗ sy represents non-stationary dependence in the spatial coordi-
nates sx and sy .
i.i.d.
(s) ∼ N (0, 10.24) is the pure error also called nugget effect in Geostatistics.
The results of model comparison are shown in Table 3.6. URK model shows a
predictive R2 ≈ 49.3% which is closed to linear regression model 48.4% but far away
know the predictive R2 contributed by the second stage GAM is 31.4% which is a
Two-Stage model has higher predictive R2 and similar results to the random forest.
The comparison on synthesized data again proves the predictive ability of Two-Stage
57
Table 3.6: Prediction Comparison on Simulated Data
LM 5.695840 0.4843744
RF 3.522183 0.8026377
In summary, the two-stage model has good interpretability which is close to the
to the nonlinear (machine learning) models, like random forest, xgboost and support
vector machine. It makes the two-stage model stand out from the rest (Figure 3.13).
58
Chapter 4
The Bayesian Additive Regression Trees (BART) model is rarely used in spatial
applications. One of the reasons is that the error term in BART model is restricted
we get rid of this constraint and propose a Gaussian process BART model for spatial
4.1. Then, in Section 4.2, we develop a new BART model that can accommodate
the correlated errors. In section 4.3, the Gaussian process BART model is studied.
4.1 Introduction
i.i.d
y = g(X; T1 , M1 ) + ... + g(X; Tm , Mm ) + , ∼ N (0, σ 2 ) (4.1)
nodes with decision rules and a set of terminal nodes; M = {µ1 , ..., µb } where b is
BART is inspired by the idea of boosting that sums the contribution of sequen-
tial weak learners (trees) to get a much more accurate prediction. Different from
59
Figure 4.1: (Left) An example of single binary tree, with internal nodes labelled by
their splitting rules, terminal nodes labelled with the corresponding parameters µi
and the observations associated with it. (Right) The corresponding partition of the
sample space and the step function.
other boosting methods, like, gradient boosting trees, BART works in a Bayesian
framework using prior and likelihood to generate a posterior distribution of the pre-
diction. The posterior distribution provides much richer information than the point
hyperparameters, like, max tree size, which normally be tuned via cross-validation in
other models.
Experiments study (Chipman et al., 2010) showed that BART outperforms other
popular machine learning methods, including Neural Nets, Gradient Boosting Trees
and Random Forest. Recall the spatial nonlinear regression model which excludes
i.i.d
y(s) = f (s; X(s)) + (s), (s) ∼ N (0, σ 2 ) (4.2)
The BART model (4.1), of course, is a good candidate in this category. But we
want to be more ambitious. Since the term w(s) in model 1.1 models the effects
model can benefit us in both prediction and interpretation (see Section 4.3.1).
60
4.2 Bart for Correlated Errors
In BART model (4.1) the error term is assumed independent and identically
i.i.d
distributed, (s) ∼ N (0, σ 2 ). We can generalize this assumption and allow the error
We will build the new model (4.3) and illustrate how it works in this section. But,
first of all, the question can be simplifie to a single tree model by taking advantage
P
of the reductions Rj = y − k6=j g(X; Tk , Mk ).
Rj = g(X; Tj , Mj ) + , ∼ N (0, Σ)
Hereafter, we remove the subscripts and discuss on the single tree model (4.4).
To understand model (4.4), the first and most impotant step is dummy represen-
tation. Simply speaking, dummy representation provides a matrix form to the single
g(X; T, M ) = Dµ (4.5)
where
µ = [µ1 , µ2 , ..., µb ]T
and
d d ... d1b
11 12
d21 d22 ... d2b
D= .
. .. . .
. . .
dn1 dn2 ... dnb
61
D is called dummy matrix which is a n×b matrix. n is the number of observations
and b is the number of bottom nodes. For each row in D, there is only one entry
[di1 , ..., di,j−1 , di,j , di,j+1 , ..., din ] = [0, ..., 0, 1, 0, ...0] (4.6)
is the ith row in D and its j th column is 1. Matrix D can be viewed as a map that maps
the observations to the bottom nodes of the tree. The row (4.6) works as mapping
the ith observation to the j th bottom node. An example is as following. The dummy
0 0 1
1 0 0
µ1
R = g(X; T, M ) = Dµ = 0
1 0
µ2
0 1 0
µ3
0 0 1
Based on the dummy representation, the tree model (4.4) can be re-denoted as a
matrix form.
The matrix form makes mathematical derivation possible. Moreover, given X this
means if X and T are fixed a dummy matrix D is uniquely determined regardless the
p(R|X, T ) which is the pivot of MCMC transitions. The details will be discussed in
next section.
62
4.2.2 Metropolis-Hastings Search
In BART, each tree be updated at every MCMC iteration. Recall (4.4), obviously,
to update a tree we need to update its components T and M . Naturally, the structure
T 0 , T 1 , T 2 , ...
The sequence starting with an initial tree T 0 , iteratively simulate the transitions
In (4.8) the transition kernel q(·, ·) and the prior p(T ) are same in both traditional
q(T ∗ , T i ) p(T ∗ )
(4.9)
p(T i ) q(T i , T ∗ )
On the other hand, the correlated data goes into (4.10) which is a marginal like-
lihood ratio. And this ratio is the difference between the traditional BART and the
new BART.
p(R|X, T i+1 )
(4.10)
p(R|X, T i )
In the discussion of dummy representation, we know that given X and T a dummy
matrix D can be uniquely determined. So, the marginal likelihood p(R|X, T ) is equal
p(R|Di+1 )
(4.11)
p(R|Di )
63
Now the question is converted to calculate p(R|D). By (4.7), we can get the joint
likelihood (4.12).
Then the marginal likelihood can be got by integrated out the µ. The only thing
µ̄ and Q are the mean and precision matrix of the Gaussian prior distribution respec-
tively. (4.14) shows the result of the integration (4.13). The proof can be found in
Appendix A.1.1;
n 1 1
(2π)− 2 |Σ|− 2 |Q| 2 1
p(R|D) = exp{− (−v T (Q + DT Σ−1 D)v + µ̄T Qµ̄ + RT Σ−1 R)}
1
|Q + DT Σ−1 D| 2 2
(4.14)
(4.15)
Finally, we plug (4.15) into the marginal likelihood ratio (4.11) and get (4.16).
The computational complexity of (4.16) will be studied in Section 4.2.4. And the
[Di+1 (Qi+1 + (Di+1 )T Σ−1 Di+1 )−1 (Di+1 )T − Di (Qi + (Di )T Σ−1 Di )−1 (Di )T ]Σ−1 R}
(4.16)
64
4.2.3 Posterior Distribution of µ
In Section 4.2.2, the tree structure T was updated. Given the new T , we can
update M which is the set of µ in the bottom nodes. Since X and new T are known,
portional to the product of its likelihood and prior probability density function (4.18).
prior π(µ) ∼ N (µ̄, Q−1 ). The posterior distribution p(µ|R) is as (4.19) and the proof
65
4.2.4 Computational Complexity
The new BART works with the covariance matrix Σ whose dimension is n×n. n is
the number of observations. When the data is big, the huge covariance matrix could
calculate the |Σ| and Σ−1 . Their exact calculation requires O(n3 ) operations which
becomes an impossible mission for a personal computer when n is greater than, for
example, one million. In this section, we will investigate the computational complexity
of the new BART model. Since a preprocessing step called reordering can greatly
simplify the discussion, before digging into the computational stuff, it’s worth to
Supposing a tree has b bottom nodes. The dummy matrix D maps n observations
sets. Reordering means that we reorder all the observations to make them ordered
successively in each partition. Since any reordering is a map and can be achieved by
of P ) can realize the reordering. Then, the reordered dummy matrix, DP , can be
denoted as following.
P T D = DP , D = P DP (4.21)
where
d011 d012 ... d01b
d0 d0 0
21 22 ... d2b
DP = . .. . .
. .
. .
d0n1 d0n2 ... d0nb
66
and
0, i ∈
/ nj ,
d0ij = i = 1, ..., n; j = 1, ..., b.
1, i ∈ nj .
node.
Recall the example in Section 4.2.1, it’s easy to find the reordered matrix DP and
permutation matrix P .
0 0 1 0 0 0 0 1 1 0 0
1 0 0 1 0 0 0 0 0 1 0
D = 0 = P D =
1 0
P 0 1 0
0 0 0
1 0
0 1 0 0 0 1 0 0 0 0 1
0 0 1 0 0 0 1 0 0 0 1
67
Similar to (4.21), R and Σ also have their reordered counterparts.
R = P RP , Σ = P ΣP T (4.22)
In Appendix A.2, we proved that reordering didn’t change the values of marginal
likelihood ratio (4.16) and posterior distribution (4.20). So, the discussion of compu-
and,
For the new BART, we assume the precision matrix Σ−1 and Σ−1
P are known. The
possible computation burden comes from the underline item in (4.23) and (4.24).
Q + DPT Σ−1
P DP (4.25)
calculate |A| and (A)−1 . They need O(b3 ) operations, b is the number of bottom
nodes. Fortunately, in the new BART, the size of tree which is the number of bottom
nodes are small (usually less than 20). So, if the number of nonzero entries in Σ−1
P
is O(n), the MCMC updating of single tree needs O(n) operations. The details of
calculating (4.23) and (4.24) can be found in Appendix A.3. However, we have to
compute Σ−1 for back comparing and buildup tuning range in section 4.4.4. A sparse
68
4.2.5 Example
In order to compare the new and old BART, we make an example to demonstrate
their similarities and differences. The simulation data was created as follows.
where, f (xi ) = x3i , xi ∈ (−1, 1); n = 200. We assumed the error term ηi followed a
normal distribution ηi ∼ N (0, Σ). There are two scenarios about the structure of the
error term.
In this scenario, Σ = σ 2 I, and the new BART should be identical to the old
i ∼ N (0, σ 2 ), i = 1, ..., n
η = A
where
1 0 0 . . . 0 0
η1
ρ 1 0 . . . 0 0
1
. .
η= . .
. , A = 0 , = (4.26)
ρ 1 . . . 0 0 .
. .. . . . . .. ..
ηn .. . . . . . n
0 0 0 ... ρ 1
n×n
69
The inverse of Σ can be calculate by
Let σ = 0.1 and ρ = 0.8, we examined the new and old BART with above two
settings of error term. Figure 4.2 shows the results. When the errors are i.i.d. the two
BART models are consistent. But, when the errors are correlated the two models are
different from each other. In Table 4.1, we measured the differences between the two
models. Compare to the old BART, the new BART fitted bad to the training data
but outperformed in restoring the function f (s). It means if the correlated structure
of errors is known the new BART can correct its fit to the real signal f (x) rather
than the noise according the information getting from covariance matrix Σ.
Figure 4.2: Left figure shows if the errors were i.i.d. the new BART degenerated to
the old BART. Right figure shows the new BART was different from the old BART
when the errors were correlated.
70
4.3 Gaussian Process Bart Model
In the new BART model (4.3), the correlation structure of error term is arbitrary
always assumes the error term is satisfied some special correlation structure. There
are different ways to model it. In spatial statistics, as discussed in section 1.2, one
of the most popular ways is using the Gaussian process. So, by combining Gaussian
process and the new BART we proposed a new nonlinear spatial regression model
where
i.i.d.
• (si ) ∼ N (0, τ 2 ) denotes the independent and identically distributed noise.
nonlinear spatial regression models don’t include the term w(s), which causes two
problems. One is that the models can’t take into account the latent covariates.
Another is that the models overfit the effects of observed covariates. On the other
hand, the spatial linear mixed models can capture the latent covariates’ effects with
w(s) but they can’t model the nonlinear effects of the observed covariates. Gaussian
71
process BART model (4.27) is the first spatial regression model that is able to handle
deep level understanding about the Gaussian process BART model (4.27). Figure 4.3
illustrates the idea of analysis of variation. The discussion can be divided into three
parts.
First, the data generating process. From a physics point of view, observed data or
observations are generated by the underlying physical process and plus the pure error.
We call the underlying physical process data generating process. Based on this idea,
the total variations in the observations can be divided into two parts, the variations
explained by data generating process and the variations of pure error. In Figure 4.3,
we denote the variations with sum of square errors. The data generating process
fP rocess (X(s), Z(s)) can include observed covariates X(s) and latent covariates Z(s).
72
X Z
So, the total variations in data can be divided into three parts, SSfprocess , SSfprocess
and SSEprocess .
Second, the ideal case. Figure 4.3 shows the ideal case that the Gaussian pro-
cess BART model (4.27) can perfectly explain the three parts variations in the
new X Z
data, fBART (X(s)) catching SSfprocess , w(s) catching SSfprocess and SSEprocess go-
ing into new (s). It also indicates that the nonlinear models without w(s), like the
old BART model, will overfit the observed covariates process fprocess (X(s)). Be-
old
cause fBART (X(s)) will fit some variations belonging to the latent covariates process
Z
fprocess (Z(s)) (SSfprocess ).
Third, the normal case. In practice the ideal case rarely happens. The reasons
may include, the new BART is still overfitting, the effects of latent covariates doesn’t
setting of the Gaussian process w(s) is not suitable to the real data ,etc. However,
comparing to the old BART, if w(s) with its explained variations SSw can shrink the
new
variations SSfBART and SSE new , the existing of w(s) is preferable. The shrinkage
new
of SSfBART can reduce overfitting of the new BART and restore more close to the
real underlying process of observed covariates X(s). The example in section 4.2.5
At first glance, Gibbs sampling is a good choice to estimate fBART and the param-
process w(s) and τ 2 . (4.28) and (4.29) are the two steps in MCMC updating.
θ | fBART (4.28)
73
First, given fBART , model (4.27) can be convert to a Bayesian hierarchical model
(4.30).
Let Σ = C + τ 2 I, we can get the posterior distribution p(θ|y) (4.31) which can be
1 1
p(θ|y) ∝ p(θ) × p exp{− (y − fBART )T Σ−1 (y − fBART )} (4.31)
|Σ| 2
Second, if parameters θ are known, the precision matrix Σ−1 could be calculated
as well. The problem of updating (4.29) given Σ−1 was already solved in section 4.2.
Everything looks good so far. However, the devil is in the detail. Let’s look at an
where σ = 1, τ = 1 and φ = 6. x(si ) ∼ unif (1, 3) denotes that x(si ) follows a uniform
The MCMC samples of τ 2 , σ 2 and φ is showed in Figure 4.4 (a). The Markov chain
couldn’t burn into stationary. If we fixed one of the parameters, τ = 1, then the chain
achieved stationary in Figure 4.4 (b). But, in this case, the estimated parameter φ is
big (≈ 150) which makes the covariance matrix Σ = σ 2 I. So, the new BART model
74
Figure 4.4: The failure of Likelihood based MCMC
For the problem shown in Figure 4.4 (a), the reason is because both BART and
Gaussian process are nonparameter and nonlinear model. They are sensitive to the
changes of data. An disturbance in the data may cause dramatic turbulence in the
Markov chain. When we fixed one or several parameters, this problem may be solved,
like the case in Figure 4.4 (b). However, the degeneration issue comes out. To explain
this issue, let’s recall the example in section 4.2.5. Table 4.1 tell us that working with
the true parameter ρ and Σ the new BART fitted bad to the training data. Actually,
the fitting will get worse when |ρ| approaches to 1. Suppose we know nothing about
the parameters of Σ and all the information comes from the data which determines
the likelihood. In likelihood based MCMC, the likelihood will guide its searching
behavior in parameter space. As a result, the data/likelihood will lead the parameter
and Σ will degenerate to σ 2 I. The same thing happens in spatial context. If there
is no any prior information about the parameters of the covariance function C(·),
the data/likelihood will lead MCMC searching to eliminate the spatial dependent
structure in C(·) and makes Σ = σ 2 I. Figure 4.4 (b) just showed this situation. We
want to estimate the parameters which must be known first. It looks like we are
locked in a dead loop. In next section, we will introduce a key to open this lock,
75
4.3.3 Back Comparing and Tuning Range
In section 4.3.2, we discussed the failure of likelihood based MCMC. The reason
of failure is because the data leads the search in parameter space and tend to elimi-
nate the correlation structure. So, the solution should pull the parameter search in
the opposite direction. Instead let the data totally control the parameter searching
range in parameter space. We proposed a strategy, back comparing, to select the good
candidates. Figure 4.5 demonstrates the idea of back comparing. First, we propose
an candidate θ. Second, use this candidate to fit the new BART model. With fitted
new new
BART model fBART , the variation of residuals SSEreal can be calculated. Mean-
while, with the value of candidate θ, it’s easy to calculate the proposed variation of
new
the mixed errors SSM Eproposed which includes the errors comes from w(s) and (s).
new new
Then, we compare SSEreal and SSM Eproposed . There are three possible cases.
new new
(1) SSEreal < SSM Eproposed
new
Over-estimation. The proposed θ estimates more variations (SSM Eproposed )
new
when it works with the new BART fBART .
new new
(2) SSEreal > SSM Eproposed
new
Under-estimation. The proposed θ estimates less variations (SSM Eproposed )
new
when it works with the new BART fBART .
new new
(3) SSEreal ≈ SSM Eproposed
eters. Figure 4.6 illustrates the parameter searching process. After proposed a set
76
Figure 4.5: Back Comparing
of parameters {θ (0) , ..., θ (n) } we apply the backing comparing to each of them. The
good estimations be picked out and put into a new set, called tuning range. All the
proposed parameters in the tuning range are good for both the model (4.27) and
the data. People can select the one that fits to their goals best. A more intuitive
analogy is the speaker volume control knob. You can tune the knob to get the volume
comfortable to you. However, the best volume differs from person to person. Even
you will adjust it when the situation changes, for example, the environment changes
Figure 4.6: Parameter space searching for the buildup of tuning range
There is still a question. How can we properly propose a parameter set {θ (0) , ..., θ (n) }
for searching the tuning range? The answer is that we can use the information getting
from the old BART or the liner mixed models. All approaches in this section will be
77
4.4 Experiments and Results
In this section, we will show the applications of model (4.27) in two type of
problems, one dimension problems and two dimension problems. The methods, back
comparing and tuning range, will be discussed carefully. The idea of analysis of
variation introduced in section 4.3.1 will provide guidance to the parameter selection
in tuning range.
In one dimension, the class of autoregressive (AR) processes, and its extensions,
special Guassian process called Gaussian linear process if it satisfies the recursions
where {t } is an i.i.d. sequence of N (0, σ 2 ) random variables, and the polynomial
φ(z) = 1 − φ1 z − ... − φp z p has no zeros inside or on the unit circle (Brockwell and
Davis, 2002). It means that the Gaussian process BART model (4.27) can be used in
Recall the example in section 4.2.5. It’s not exactly an AR(1) process but an
AR(1) error process. However, it’s also a Gaussian linear process. Because the
where matrix A was defined in (4.26). In this case, the Gaussian process BART model
(4.27) becomes to (4.33). Next, we will treat this model with previously proposed
78
methods.
new
y(ti ) = fBART (X(ti )) + η(ti ) i = 1, .., n (4.33)
Note: In this experiment, the parameters in model (4.33) are θ = {σ, ρ}. Their real
value are σ = 0.1 and ρ = 0.8 (the green point in Figure 4.7).
In order to apply back comparing to find the tuning range, first, we need to pro-
pose a searching set {θ (0) , ..., θ (n) } in parameter space. The estimation from old
BART model can provide clues, σ̂ = 0.1042189 where is the yellow dash line located
in Figure 4.7. We can search σ in the neighbor interval of σ̂ which is (0.05, 0.15)
shown in Figure 4.7. For the parameter ρ, its value must be constrained in (0, 1) to
keep the Gaussian process η(t) stationary. We divided each interval into 10 segments
and selected the centers as the searching set. So, the searching set included 100 can-
didates {θ (0) , ..., θ (99) } which is showed in Figure 4.7. As the discussion in section
new new
4.3.3, to apply back comparing we need to compare SSEreal and SSM Eproposed . Fig-
new new
ure 4.7 shows the back comparing results SSEreal − SSM Eproposed . The cells with
We can change this criteria to control the size of tuning range. In Figure 4.7, the left
panel shows the results that was obtained by working on fitted training data. First,
the model was fitted with all the observed data. Then, the observed data and their
new new
model fitted values were used to calculate the variations, SSEreal and SSM Eproposed .
In this case, the BART model tends to overfit the data and causes the parameters
under estimated. To tackle this issue, instead of fitting training data we can use
predicted testing data and the observed testing data to compute the variations. The
predicted testing data was generated be a 4-folds cross-validation. Figure 4.7 right
panel presents the back comparing results and tuning range in this case. Now, you
79
may have a question. Why the real value (green point) of parameters is not included
in the tuning range? Let’s look at the results of back comparing. Compared to the
real value, all the values in tuning range are under estimated which indicates their
corresponding variations SSw (see Figure 4.3) are less than the real value case. Recall
the analysis of variation and Figure 4.3. The real value corresponds to the ideal case
(Figure 4.3) which rarely happen. While, the values in tuning range correspond to the
normal case (Figure 4.3). Last but not least, the number of folds in cross-validation
should not be too small (less than 4) to damage the correlation structure of covariance
matrix Σ.
Figure 4.7: Back comparing and Tuning range. The green dot is the real value of
parameters. The vertical yellow dash line shows the estimation of σ from the old
BART. The green cells indicate the tuning range. Left panel shows the tuning range
that selected using the fitted training data. While, the tuning range in right panel
was using the predicted testing data in a 4-folds cross-validation.
As discussed in Section 4.3.3, any candidate in tuning range is good for the model
and data but maybe not for your purpose. Besides personal purpose, there is a
80
guidance for selecting the parameter values in tuning range according the idea of
analysis of variation in Section 4.3.1. In Figure 4.3, compare to the old BART model
new
the more SSfBART and SSE new be shrink, the better the Gaussian process BART
model (4.27) performs in both interpretation and prediction. Figure 4.8 shows the
old new
values SSfBART −SSfBART and SSE old −SSE new . So, under the guidance of analysis
of variation we should select the big values of these two subtractions which indicates
Moreover, with the proposed parameter values we can decompose the variations
new
in the Gaussian process η(t) into pure error variation SSEproposed and correlated
new
variation SSwproposed .
new
SSEproposed = nσ 2 , new
SSwproposed new
= SSM Eproposed new
− SSEproposed
new
where SSM Eproposed is in Figure 4.5. Their values was showed in Figure 4.9. The
new
analysis of variation (Figure 4.3) suggests to select big value of SSwproposed and small
new
value of SSEproposed . It is consistent to the previous guidance.
81
Figure 4.9: The variance decomposition of Gaussian process η(t).
(3) Results
The analysis of variation recommended θ top = {σ, ρ} = {0.085, 0.95} at the top
of tuning range. To study the effect of different values in tuning range we selected
another one θ bottom = {σ, ρ} = {0.125, 0.05} at the bottom of tuning range. Their
comparison in Figure 4.10 shows the significant differences. In Figure 4.11, we com-
pared the top one with the real value (left) and the bottom one with the old BART
(right). Although the top one {σ, ρ} = {0.085, 0.95} is different from the real value
{σ, ρ} = {0.1, 0.8}, their fits are quite similar. It indicates that the guidance from
analysis of variation is effective and can lead us closing to the real value. On the other
hand, the fit of bottom one {σ, ρ} = {0.125, 0.05} is very close to the fit of old BART
{σ, ρ} = {0.1042189, 0}. Based on the comparisons, it’s easy to imagine that if we
scan the tuning range from top to bottom the model fitting will degenerate from the
new BART with (near) real correlation structure to the (near) old BART. In other
82
Figure 4.10: Two extreme candidates from the tuning range. The comparison shows
they impose different influences on the model fitting.
Figure 4.11: Comparing the two extreme candidates in tuning range to the real
value and old BART. Left panel shows the similarity of model fitting between the
candidate at the top of tuning range and the real value. Right panel shows the
similarity of model fitting between the candidate at the bottom of tuning range and
the old BART
83
4.4.2 Two Dimension Experiment
The two dimension experiment is created following the Gaussian BART model
(4.27).
where
• w(si ) ∼ GP (0, C(·, ·|σ, φ)), C(sj , sk |σ, φ) = σ 2 exp{−φ ∗ d(sj , sk )},
i.i.d
• (si ) ∼ N (0, τ 2 ).
We can explore the created data in Figure 4.12. Similar to the one dimension
Figure 4.12: Experiment data exploration. Left and middle panels show the spatial
maps of y(si ) and w(si ) respectively. Right panel shows relation between x and y.
84
First, we need to propose the searching set in parameter space. The liner mixed
The parameters estimated by (4.35) are {σ̂, φ̂, τ̂ } = {1.18, 5.46, 1.05}. According
these estimations, we created the searching set as follows. 10 equally divide the
interval (0.5,1.4) for σ; 10 equally divide the interval (1,10) for φ; 7 equally divide
the interval (0.4,1.6) for τ . Then, back comparing was applied to build the tuning
range. Figure 4.13 (left) shows the back comparing results and selected tuning range.
In this experiment, instead of sum square error (SSE) the mean square error (MSE)
was used to avoid large values. Figure 4.13 (right) indicates that we took on strict
new new
criteria, |M SEreal − M SM Eproposed | < 1, to select the tuning range. The exact values
Figure 4.13: Back Comparing and Tuning Range (left). Back comparing MSE
density and selection criteria (right).
M SE new can be used as guidance to select the good candidates in tuning range. Fig-
85
Table 4.2: Back Comparing and Candidates Selection Guidance
old new
τ σ φ Back Comparing MSE M SfBART − M SfBART M SE old − M SE new
ure 4.14 illustrates their relative values (color) and positions in tuning range. Their ex-
old new
act value are listed in Table 4.2. According the first guidance, M SfBART −M SfBART ,
the candidates {τ, σ, φ} = {1.6, 0.6, 1}, {1.4, 0.9, 3}, {1, 0.9, 3} are the top three can-
didates. But, when we check with the second guidance, M SE old − M SE new , it
gives the opposite order, and the value is negative for the candidate {1.6, 0.6, 1}.
In this case, {1.4, 0.9, 3} and {1, 0.9, 3} are both good. I choose the second one
old new
{τ, σ, φ} = {1, 0.9, 3} because the sum of subtractions, M SfBART − M SfBART +
(3) Results
The motivation of developing the Gaussian process BART model (4.27) is try-
ing to gain the advantages of both the spatial linear mixed regression models and
the spatial nonlinear regression models. On one hand, compare to the linear mixed
regression models, Gaussian process BART model should be capable to handle the
nonlinear relationships between the observed variables y(s) and x(s). On the other
86
old new
Figure 4.14: Left shows the values of M SfBART − M SfBART in tuning range. Right
old new
shows the values of M SE − M SE in tuning range.
hand, compare to the nonlinear regression models, Gaussian process BART model
should be able to understand the spatial dependence which may be caused by the
latent variables. Obviously, we already achieved the second goal. For the first goal,
let’s compare the results between Gaussian process BART model (4.34) and linear
• First, they have similar ability to understand the spatial dependent structure
in the data. It is because their estimated parameters are both close to the real
value.
• Second, they have different ability to understand the relationship between y(s)
and x(s). Figure 4.15 demonstrates the differences. obviously, the Gaussian
process BART model greatly captured the nonlinear relation f (x) = x3 between
• Last but not least, the failure of fitting nonlinear trend may cause the linear
mixed model violates its assumption that the Gaussian process w(s) is station-
ary. In this experiment, the linear mixed model failed to extract the nonlinear
87
there are many literature working on this problem. They were trying to estimate
cess convolution, low rank splines or basis functions, etc. While, the Gaussian
process BART model (4.27) which is able to capture both linear and nonlinear
trend naturally makes the stationary assumption much more robust than it in
Figure 4.15: The fitting results of Gaussian process BART, old BART and Linear
mixed model
Moreover, we can compare the Gaussian process BART to old BART. In Figure
4.15, they look similar but still have some differences. It is because the Gaussian
process BART foresees the spatial dependence when it fits the data. Like the example
in Section 4.2.5, the Gaussian process BART model should perform better in fitting
the underlying process f (x) than the old BART. To illustrate that, we can calculate
88
the mean square errors (MSE) as follows.
This result proves the claim that Gaussian process BART performs better in restoring
In this section, we test Gaussian process BART on real data which is the soil
carbon stock data in Chapter 3. In order to visually compare the results among
different models, we chose two environmental covariates to do the test. From the
have nonlinear relationships with the response variable y (see Figure 3.8). So, we
where
89
• We test two scenarios, one is using variable NDVI14 only, another is using
• In the first scenario, fLM X (X(si )) in the linear mixed model (4.36) is:
In the second scenario, fLM X (X(si )) in the linear mixed model (4.36) is:
• w(si ) ∼ GP (0, C(·, ·|σ, φ)), C(sj , sk |σ, φ) = σ 2 exp{−φ ∗ d(sj , sk )},
where d(sj , sk ) is the Euclidean distance between point sj and sk . We use the
R package ”fields” (Nychka et al., 2017) to check this model. In that package
the parameter φ is set to 1 defaultly. So, only the unknown parameter σ will
be estimated.
i.i.d
• (si ) ∼ N (0, τ 2 ). The unknown parameter τ will be estimated.
The estimations of the linear mixed model (4.36) shows in Table 4.3. We use the
estimations σ and τ to fit the Gaussian process BART model (4.37). Figure 4.16
shows the fitting results of these 3 models. We can see the differences among them.
Since the Gaussian process BART model (4.37) used the same value of covariance
90
Table 4.3: Linear Mixed Model Estimations (the First Scenario)
β0 β1 σ τ
parameters with the linear mixed model (4.36), the fitting of Gaussian process BART
shrinks more to linear mixed model comparing to the old BART. Meanwhile, the
Gaussian process BART keeps its non-linearity comparing to the linear mixed model.
Figure 4.16: The fitting results of Gaussian process BART, old BART and Linear
mixed model on real data with one covariate NDVI14.
The real soil carbon stock data and two environmental covariates NDVI14 and
The estimations of the linear mixed model (4.36) shows in Table 4.4. Similar to
the first scenario, we use the estimations σ and τ to fit the Gaussian process BART
91
Figure 4.17: The read data with two environmental covariates NDVI14 and
REDL14.
model (4.37).
β0 β1 β2 σ τ
Figure 4.18 illustrates the different model fittings. Comparing to the linear mixed
model, both Gaussian process BART and old BART successfully captured the non-
Figure 4.19 shows the differences among the three models. Similar to the first
scenario, the Gaussian process BART shrinks more to linear mixed model comparing
to the old BART. It’s because that the Gaussian process BART model (4.37) used
the same value of covariance parameters with the linear mixed model (4.36).
92
Figure 4.18: The model fittings on read data.
All the computation issues with Gaussian process BART model (4.27) are related
to parameter space searching for the buildup of tuning range. The issues and possible
propose a searching set in parameter space. This searching set suffers from the
is using low dimensional parametric Gaussian process models, for example, the
Matérn family. The second possible solution is using random search to instead
grid search. The ad hoc information of random search can be used to locate
93
the promising areas in parameter space. Another possible solution is parallel
computing. In theory, all the points in searching set can be test in parallel.
Since the computation resources is limited, we can partition the searching set
For every time searching, we have to inverse the covariance matrix Σ with
the proposed parameters in searching set. It’s because the algorithm of the
new BART needs Σ−1 rather than Σ (see Appendix A). Since the dimension of
possible solution is creating a sparse matrix which has O(n) non-zero entries to
94
Chapter 5
CONCLUSION
This chapter summarizes the key ideas and contributions of the dissertation. Ideas
duced.
posed for the spatial prediction problem in reef species abundance study. The
prior knowledge, aggregation and iteration were introduced to help the nonlin-
• Chapter 3 developed a novel two-stage model for the spatial regression problems
in soil carbon stock (SOC) analysis. In the first stage, a universal regression
Kriging model captures the linear and stationary effects of covariates. The
• In Chapter 4, the traditional BART model was extended to a new BART model
which can accommodate the general correlated errors. A novel nonlinear spatial
regression model called Gaussian process BART can then be built by combin-
95
ing the new BART and Gaussian process. Because of the failure of likelihood
based MCMC in parameter estimation, the methods, back comparing and tun-
• Solving the computation issue of parameter space searching for the buildup of
tuning range.
• Updating R package “BART” with the new algorithm for accommodating cor-
related errors.
96
REFERENCES
97
Diesing, M. and D. Stephens, “A multi-model ensemble approach to seabed mapping”,
Journal of Sea Research 100, 62 – 69, meshAtlantic: Mapping Atlantic Area Seabed
Habitats for Better Marine Management (2015).
Dobesch, H., P. Dumolard and I. Dyras, Spatial Interpolation for Climate Data: The
Use of GIS in Climatology and Meteorology (ISTE Ltd, 2007).
Drexler, M. and C. H. Ainsworth, “Generalized additive models used to predict species
abundance in the gulf of mexico: an ecosystem modeling tool”, PloS one 8, 5 (2013).
Farmer, N. A. and J. S. Ault, “Grouper and snapper movements and habitat use in
dry tortugas, florida”, Mar Ecol Prog Ser 433, 169–184 (2011).
Farmer, N. A. and J. S. Ault, “Modeling coral reef fish home range movements in dry
tortugas, florida”, The Scientific World Journal 2014, 14 (2014).
Finley, A. O., A. Datta, B. D. Cook, D. C. Morton, H. E. Andersen and S. Banerjee,
“Efficient algorithms for bayesian nearest neighbor gaussian processes”, Journal of
Computational and Graphical Statistics 28, 2, 401–414 (2019).
Furrer, R., M. G. Genton and D. Nychka, “Covariance tapering for interpolation of
large spatial datasets”, Journal of Computational and Graphical Statistics 15, 3,
502–523 (2006).
Gelfand, A. E., P. J. Diggle, M. Fuentes and P. Guttorp, Handbook of Spatial Statistics
(Chapman & Hall/CRC, 2010).
Geoga, C. J., M. Anitescu and M. L. Stein, “Scalable gaussian process computations
using hierarchical matrices”, (2019).
Goff, J. A., C. J. Jenkins and S. Williams, “Seabed mapping and characterization of
sediment variability using the usseabed data base”, Continental Shelf Research 28,
4, 614 – 633 (2008).
Guisan, A., T. C. Edwards and T. Hastie, “Generalized linear and generalized additive
models in studies of species distributions: setting the scene”, Ecological Modelling
157, 2, 89 – 100 (2002).
Guisan, A., R. Tingley, J. B. Baumgartner, I. Naujokaitis-Lewis, P. R. Sutcliffe,
A. I. T. Tulloch, T. J. Regan, L. Brotons, E. McDonald-Madden, C. Mantyka-
Pringle, T. G. Martin, J. R. Rhodes, R. Maggini, S. A. Setterfield, J. Elith, M. W.
Schwartz, B. A. Wintle, O. Broennimann, M. Austin, S. Ferrier, M. R. Kearney,
H. P. Possingham and Y. M. Buckley, “Predicting species distributions for conser-
vation decisions”, Ecology Letters 16, 12, 1424–1435 (2013).
Hackbusch, W., Hierarchical Matrices: Algorithms and Analysis (2015).
Haining, R. P., R. Kerry and M. A. Oliver, “Geography, spatial data analysis, and
geostatistics: An overview”, Geographical Analysis 42, 731 (2010).
98
Harter, S., H. Moe, J. Reed and A. David, “Fish assemblages associated with red
grouper pits at pulley ridge, a mesophotic reef in the gulf of mexico”, Fishery
Bulletin 115, 419–432 (2017).
Hastie, T. and R. Tibshirani, “Generalized additive models”, Statistical Science Vol.
1, 297–318 (1986).
Hastie, T. and R. Tibshirani, Generalized Additive Models (Chapman and Hall, New
York, 1990).
Kelly, S., “Basic introduction to pygame”, (2016).
Lembke, C., S. Grasty, A. Silverman, H. A. Broadbent, S. E. Butcher and S. Mu-
rawski, “The camera-based assessment survey system (c-bass): A towed camera
platform for reef fish abundance surveys and benthic habitat characterization in
the gulf of mexico”, Continental Shelf Research 151, 62–71 (2017).
Li, J., A. D. Heap, A. Potter and J. J. Daniell, “Application of machine learning meth-
ods to spatial interpolation of environmental variables”, Environmental Modelling
& Software 26, 12, 1647 – 1659 (2011).
Lin, Y.-P., W.-C. Lin, Y.-C. Wang, W.-Y. Lien, T. Huang, C.-C. Hsu, D. S. Schmeller
and N. D. Crossman, “Systematically designating conservation areas for protect-
ing habitat quality and multiple ecosystem services”, Environmental Modelling &
Software 90, 126 – 146 (2017).
Lindgren, F., H. Rue and J. Lindstrm, “An explicit link between gaussian fields
and gaussian markov random fields: the stochastic partial differential equation
approach”, Journal of the Royal Statistical Society: Series B (Statistical Method-
ology) 73, 4, 423–498 (2011).
Mateo-Sánchez, M. C., A. Gastón, C. Ciudad, J. I. Garcı́a-Viñas, J. Cuevas, C. López-
Leiva, A. Fernández-Landa, N. Algeet-Abarquero, M. Marchamalo, M. Fortin and
S. Saura, “Seasonal and temporal changes in species use of the landscape: how do
they impact the inferences from multi-scale habitat modeling?”, Landscape Ecology
31, 1261–1276 (2015).
McDonald, A., J. S. Parslow and A. J. Davidson, “Interpretation of a modified linear
model of catch-per-unit-effort data in a spatially-dynamic fishery”, Environmental
Modelling & Software 16, 2, 167 – 181, environmental Modelling and Socioeco-
nomics (2001).
Nychka, D., R. Furrer, J. Paige and S. Sain, “fields: Tools for spatial data”, R package
version 10.3 (2017).
Pfeffermann, D., “New important developments in small area estimation”, Statistical
Science 28, 1, 4068 (2013).
Prosser, D., C. Ding, R. Erwin, T. Mundkur, J. Sullivan and E. C. Ellis, “Species
distribution modeling in regions of high need and limited data: waterfowl of china”,
Avian Research 9, 1–14 (2018).
99
Ramchoun, H., M. J. Idrissi, Y. Ghanou and M. Ettaouil, “Multilayer perceptron:
Architecture optimization and training”, Int. J. Interact. Multim. Artif. Intell. 4,
26–30 (2016).
Robertson, G. P., “Geostatistics in ecology: Interpolating with known variance”,
Ecology 68, 3, 744–748 (1987).
Ros-Pena, L., T. Kneib, C. Cadarso-Surez, N. Klein and M. Marey-Prez, “Study-
ing the occurrence and burnt area of wildfires using zero-one-inflated structured
additive beta regression”, Environmental Modelling & Software 110, 107 – 118,
special Issue on Environmental Data Science and Decision Support: Applications
in Climate Change and the Ecological Footprint (2018).
Sainte-Marie, B. and B. Hargrave, “Estimation of scavenger abundance and distance
of attraction to bait”, Marine Biology 94, 431–443 (1987).
Saul, S. and S. Purkis, “Semi-automated object-based classification of coral reef habi-
tat using discrete choice models”, Remote Sensing 7, 12, 15894–15916 (2015).
Saul, S., J. Walter, D. Die, D. Naar and B. Donahue, “Modeling the spatial distri-
bution of commercially important reef fishes on the west florida shelf”, Fisheries
Research 143, 12 – 20 (2013).
Schoener, T., “A brief history of optimal foraging ecology”, (1987).
Simon Wood, “mgcv: Mixed gam computation vehicle with automatic smoothness
estimation”, R package version 1.8-31 (2019).
Somerton, D. and C. T. Glendhill, “Report of the national marine fisheries service
workshop on underwater video analysis, august 4-6, 2004”, (2005).
Staff, S. S. and T. Loecke, “Rapid carbon assessment: Methodology, sampling and
summary”, (2016).
Stein, M. L., Statistical Interpolation of Spatial Data: Some Theory for Kriging
(Springer, New York, 1999).
Stohlgren, T. J., P. Ma, S. Kumar, M. Rocca, J. T. Morisette, C. S. Jarnevich and
N. Benson, “Ensemble habitat mapping of invasive plant species”, Risk Analysis
30, 2, 224–235 (2010).
Stoner, A. W., “Effects of environmental variables on fish feeding ecology: implica-
tions for the performance of baited fishing gear and stock assessment”, Journal of
Fish Biology 65, 6, 1445–1471 (2004).
Stratford, D. S., C. A. Pollino and A. E. Brown, “Modelling population responses to
flow: The development of a generic fish population model”, Environmental Mod-
elling & Software 79, 96 – 119 (2016).
Streich, M. K., M. J. Ajemian, J. J. Wetz and G. W. Stunz, “A comparison of fish
community structure at mesophotic artificial reefs and natural banks in the western
gulf of mexico”, Marine and Coastal Fisheries 9, 1, 170–189 (2017).
100
Vecchia, A. V., “Estimation and model identification for continuous spatial pro-
cesses”, Journal of the Royal Statistical Society. Series B (Methodological) 50,
2, 297–312 (1988).
Wenger, S. J. and J. D. Olden, “Assessing transferability of ecological models: an un-
derappreciated aspect of statistical validation”, Methods in Ecology and Evolution
3, 2, 260–267 (2012).
Williamson, D. J., G. L. Burn, S. Simoncelli, J. Griffi, R. Peters, D. M. Davis and
D. M. Owen, “Machine learning for cluster analysis of localization microscopy
data”, Nat Commun 11(1), 1493 (2020).
Ye, L., L. Gao, R. Marcos-Martinez, D. Mallants and B. A. Bryan, “Projecting aus-
tralia’s forest cover dynamics and exploring influential factors using deep learning”,
Environmental Modelling & Software 119, 407 – 417 (2019).
Yuan, M. and Y. Lin, “Model selection and estimation in regression with grouped
variables”, Journal of the Royal Statistical Society, Series B 68(1), 4967 (2006).
101
APPENDIX A
102
A.1 Marginal Likelihood and Posterior Distribution
Since
(∗) = RT Σ−1 R − 2RT Σ−1 Dµ + µT DT Σ−1 Dµ + µT Qµ − 2µ̄T Qµ + µ̄T Qµ̄
= µT (DT Σ−1 D + Q)µ − 2(RT Σ−1 D + µ̄T Q)µ + RT Σ−1 R + µ̄T Qµ̄
Then, we can introduce a variable v and think about the following term.
103
Finally
(∗) = (µ − v)T (Q + DT Σ−1 D)(µ − v) + C
where
C = −v T (Q + DT Σ−1 D)v + RT Σ−1 R + µ̄T Qµ̄
Plug (A.1) and (A.2) into the integral term.
Z
p(R|D, µ)p(µ)dµ
Z
− n+b 1 1 1 1
= (2π) 2 |Σ| 2 |Q| 2 exp{− C} exp{− (µ − v)T (Q + DT Σ−1 D)(µ − v)}dµ
−
2 2
n+b 1 1 1 b 1
= (2π)− 2 |Σ|− 2 |Q| 2 exp{− C}(2π) 2 |Q + DT Σ−1 D|− 2
Z 2
b 1 1
· (2π)− 2 |Q + DT Σ−1 D| 2 exp{− (µ − v)T (Q + DT Σ−1 D)(µ − v)}dµ
2
−n − 1 1
(2π) 2 |Σ| 2 |Q| 2 1
= 1 exp{− C}
T −1
|Q + D Σ D| 2 2
Based on the proof of marginal likelihood above, it’s easy to get the posterior
distribution p(µ|R). By (4.18), we have
If given, p(R|D, µ) in (A.1) and π(µ) in (A.2) . Then, by (A.3) and (A.4), we can
directly prove that
104
A.2 Invariant under Reordering
DT Σ−1 R = (P DP )T (P ΣP P T )−1 (P RP )
= DPT P T P Σ−1 T
P P P RP (A.7)
= DPT Σ−1
P RP
Q + DT Σ−1 D = Q + (P DP )T P Σ−1 T T −1
P P P DP = Q + DP ΣP DP (A.8)
and,
p(µ|R) ∼ N ((Q + DT Σ−1 D)−1 DT Σ−1 R, (Q + DT Σ−1 D)−1 )
By applying (A.7) and (A.8), obviously, (4.16) and (4.20) are invariant under
reordering.
105
A.3 On the Calculation of Marginal Likelihood Ratio
A = Q + DPT Σ−1
P DP , Q = τ −2 I
a11 + τ −2
a12 ... a1b
a21 a22 + τ −2 ... a2b
A= .. .. .. (A.9)
. . .
ab1 ab2 ... abb + τ −2
where XX
aji = aij = qhl , i ≤ j, i, j ∈ {1, ..., b}
h∈ni l∈nj
nk , k ∈ {1, ..., b} is the index set of observations that associated with bottom node k
and qhl is the hth row and lth column entry in Σ−1 P .
q11 q12 ... q1n
q21 q22 ... q2n
Σ−1
P = ..
.. . .
. . .
qn1 qn2 ... qnn
To understand the form of E, we have to consider the birth and death operations
respectively. Without losing generality, we can assume that birth or death operation
occurs in (i + 1)th MCMC iteration. Since the dummy matrix D has very special form
(see section 4.2.1), we developed an algorithm as following to achieve computational
efficiency.
(1) Birth
106
In this scenario, the tree has b bottom nodes at i step and b + 1 nodes at i + 1
step. So, (Ai )−1 and (Ai+1 )−1 are b × b and (b + 1) × (b + 1) matrices. We can denote
them by block matrices as follows.
i+1
V12i+1
i i
i+1 −1 V11 i −1 V11 v12
(A ) = , (A ) = i
V21i+1 V22i+1 i
v21 v22
b11 b12 ... b1(b+1)
V11i+1 V12i+1
i i
−V11i − v12 v12 b21 b22 ... b2(b+1)
B= i+1 vi v i
v i
= . . ...
V22i+1 − 22 .. ..
V21 − 21i i
22
i
v21 v22 v22
b(b+1)1 b(b+1)2 ... b(b+1)(b+1)
(2) Death
Similar to birth scenario, we denote the matrices (Ai )−1 and (Ai+1 )−1 as following.
i i+1 i+1
i −1 V11 V12i i+1 −1 V v12
(A ) = , (A ) = 11
V21i V22i i+1
v21 v22i+1
107
We can denote E as a block matrix
E11 E12 ... E1b0
E21 E22 ... E2b0
E = DPi+1 (Ai+1 )−1 (DPi+1 )T − DPi (Ai )−1 (DPi )T =
... .. ...
.
Eb0 1 Eb0 2 ... Eb0 b0
where
b + 1, Birth,
0
b =
b, Death.
Each block matrix has a special form
T
Eij = Eji = bij Jij , i ≤ j, i, j ∈ {1, ..., b0 }
where bij is the (i, j) element of matrix B calculated in birth or death step; Jij is a
card(ni ) × card(nj ) matrix whose entries are 1s. (card(nk ) is the cardinality of set
nk ).
Let’s set
RPT Σ−1
P = [ω1 ω2 . . . ωb ] ,
0 ωi = [ωij ], j ∈ ni
and
u = RPT Σ−1 −1
P EΣP RP
ω1T
ω2T
u = [ω1 ω2 . . . ωb0 ] E
...
ωbT0
0
b X
b 0
X
= ωi Eij ωjT
i=1 j=1
b X
b0 0
X
= (ωi Jij ωjT )bij
i=1 j=1
b X
b0 0
X X X
= [( ωih )( ωjl )bij ]
i=1 j=1 h∈ni l∈nj
108
APPENDIX B
109
The easy way to understand nearest neighbor Gaussian process is from the linear
form of Gaussian process (B.1).
w = Hw + η (B.1)
where w is an instance of Gaussian process W ∼ GP (0, K(·, ·|θ)) and w ∼ N (0, C),
C is the covariance matrix calculated by K(·, ·|θ). The structure of H is as follows.
0 0 0 ... 0
h21 0 0 ... 0
h31 h32 0 ... 0
H= .
.. .. . . .. ..
. . . .
hn1 hn2 . . . hn(n−1) 0
w1 = 0 + η1
w2 = h21 w1 + η2
w3 = h31 w1 + h32 w2 + η3
..
.
wn = hn1 w1 + hn2 w2 + · · · + hn(n−1) w(n−1) + ηn
and
η ∼ N (0, Λ)
where Λ is diagonal with entries Λ11 = var(w1 ) and Λii = var(wi |{wj : j < i}) for
i = 2, . . . , n.
Since I − H is nonsingular
1 0 0 ... 0
−h21 1 0 ... 0
−h −h 1 . .. 0
I −H = 31 32
... .
.. . .. . .. ..
.
−hn1 −hn2 . . . −hn(n−1) 1
Note: For any matrix M and set of indices I1 , I2 ∈ {1, 2, ..., n}, let M [I1 , I2 ] denote
the submatrix of M formed by the rows indexed by I1 and columns indexed by I2 .
Let
var(w1 , ..., wi+1 ) = C
110
Then,
var(w1 , ..., wi ) = C[1 : i, 1 : i]
and
C[1 : i, 1 : i] C[1 : i, i + 1]
C=
C[i + 1, 1 : i] C[i + 1, i + 1]
By equation (B.3), we can get
111