Lu_asu_0010E_20505

Spatial Regression and Gaussian Process BART
by
Xuetao Lu
A Dissertation Presented in Partial Fulfillment

of the Requirements for the Degree
Doctor of Philosophy
Approved November 2020 by the

Graduate Supervisory Committee:
Robert McCulloch, Chair

Steven Saul
Paul Hahn
Shiwei Lan
Shuang Zhou
ARIZONA STATE UNIVERSITY
December 2020
ABSTRACT
Spatial regression is one of the central topics in spatial statistics. Based on the
goals, interpretation or prediction, spatial regression models can be classified into
two categories, linear mixed regression models and nonlinear regression models. This
dissertation explored these models and their real world applications. New methods
and models were proposed to overcome the challenges in practice. There are three
major parts in the dissertation.
In the first part, nonlinear regression models were embedded into a multistage
workflow to predict the spatial abundance of reef fish species in the Gulf of Mex-
ico. There were two challenges, zero-inflated data and out of sample prediction. The
methods and models in the workflow could effectively handle the zero-inflated sam-
pling data without strong assumptions. Three strategies were proposed to solve the
out of sample prediction problem. The results and discussions showed that the non-
linear prediction had the advantages of high accuracy, low bias and well-performed
in multi-resolution.
In the second part, a two-stage spatial regression model was proposed for ana-
lyzing soil carbon stock (SOC) data. In the first stage, there was a spatial linear
mixed model that captured the linear and stationary effects. In the second stage,
a generalized additive model was used to explain the nonlinear and nonstationary
effects. The results illustrated that the two-stage model had good interpretability in
understanding the effect of covariates, meanwhile, it kept high prediction accuracy
which is competitive to the popular machine learning models, like, random forest,
xgboost and support vector machine.
A new nonlinear regression model, Gaussian process BART (Bayesian additive
regression tree), was proposed in the third part. Combining advantages in both
BART and Gaussian process, the model could capture the nonlinear effects of both
i
observed and latent covariates. To develop the model, first, the traditional BART
was generalized to accommodate correlated errors. Then, the failure of likelihood
based Markov chain Monte Carlo (MCMC) in parameter estimating was discussed.
Based on the idea of analysis of variation, back comparing and tuning range, were
proposed to tackle this failure. Finally, effectiveness of the new model was examined
by experiments on both simulation and real data.
ii
ACKNOWLEDGMENTS
The present dissertation was undertaken under the joint supervision of Dr. Robert
McCulloch and Dr. Steven Saul. I would like to express my deepest gratitude and
appreciation to them. During my research, Robert always trusted me with his many
great ideas and, at the same time, encouraged me to confidently further develop and
explore them in my own ways, always while offering invaluable advice. Steven is
my co-advisor as well as a good friend. Four years ago he introduced me into spatial
statistics which was a total new area for me. Without his inspirational guidance, con-
stant support, and patient encouragement this dissertation would not be completed.
Overall, they have inspired me both as an academic and as a person and, therefore,
I feel very privileged to have worked under their supervision.
I am very grateful to my mentors, Dr. Julie Bessac and Dr. Umakant Mishra, in
Argonne National Laboratory. My research presented in Chapter 3 was accomplished
under their supervision when I took an internship at Argonne this summer. They all
made this an unforgettable experience that has evolved in an ongoing collaboration
and friendship.
I also would like to extend my appreciation to my committee members, Dr. Paul
Hahn, Dr. Shiwei Lan, and Dr. Shuang Zhou for their helpful support on my disser-
tation. Their efforts make this dissertation better.
I am grateful to the funding bodies that enabled me to pursue my studies: Gulf of
Mexico Research Initiative (GoMRI), National Oceanic Atmospheric Administrations
(NOAA) National Marine Fisheries Service (NMFS) through the University of Miamis
Cooperative Institute for Marine and Atmospheric Studies (CIMAS), and National
Science Foundation.
Finally my deepest thanks go to my parents in law, Wenhua and Wenjiang, for
helping us take care of my daughter. And to my parents, Junjiang and Hongxia, my
iii
sister, Chao, for sending their love and support from thousands miles away. A special
thanks to my wife Dr. Yuxia Shen for her endless support, love and patience. And
to my daughter Julie for reminding me that happiness is in the simple things.
This dissertation is dedicated to my wife, my daughter, Yuxia and Julie, my love
for the rest of my life.
iv
TABLE OF CONTENTS
Page
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
CHAPTER
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Spatial Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Gaussian Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 REAL DATA CHALLENGES AND NONLINEAR MODELS . . . . . . . . . . . 9
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Multistage Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.1 Video Survey Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.2 Empirical Maximum Likelihood Analysis . . . . . . . . . . . . . . . . . . 17
2.2.3 Random Smoothing Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.4 Reducing Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Non-linear Models and out of Sample Prediction . . . . . . . . . . . . . . . . . . 24
2.3.1 Prior Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.2 Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.3 Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3 A TWO-STAGE MODEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.1 The Two-stage Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
v
CHAPTER Page
3.2.2 Model Interpretability and Analysis Flow . . . . . . . . . . . . . . . . . . 42
3.2.3 Prediction Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3.1 Interpretation of Fitted Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3.2 Prediction Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4 GAUSSIAN PROCESS BART . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2 Bart for Correlated Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2.1 Dummy Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2.2 Metropolis-Hastings Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2.3 Posterior Distribution of µ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2.4 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2.5 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3 Gaussian Process Bart Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.3.1 Analysis of Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.3.2 The Failure of Likelihood Based MCMC . . . . . . . . . . . . . . . . . . . 73
4.3.3 Back Comparing and Tuning Range . . . . . . . . . . . . . . . . . . . . . . . 76
4.4 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.4.1 One Dimension Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.4.2 Two Dimension Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.4.3 Testing on Real Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.4.4 Discussion on Computation Issues . . . . . . . . . . . . . . . . . . . . . . . . . 93
5 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
vi
CHAPTER Page
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
APPENDIX
A BART FOR CORRELATED DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
A.1 Marginal Likelihood and Posterior Distribution . . . . . . . . . . . . . . . . . . . 103
A.1.1 Marginal Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
A.1.2 Posterior Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
A.2 Invariant under Reordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
A.3 On the Calculation of Marginal Likelihood Ratio . . . . . . . . . . . . . . . . . . 106
A.3.1 Calculate Matrix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
A.3.2 The Block Form of Matrix E . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
A.3.3 Calculate Marginal Likelihood Ratio . . . . . . . . . . . . . . . . . . . . . . 108
B NEAREST NEIGHBOR GAUSSIAN PROCESS . . . . . . . . . . . . . . . . . . . . . . . 109
vii
LIST OF TABLES
Table Page
2.1 Prior Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2 Models for Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1 Environmental Variables Description and Data Source . . . . . . . . . . . . . . . . 39
3.2 Estimated Coefficients by URK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3 Importance of Smoothers Fitted by GAM in Second Stage . . . . . . . . . . . . 50
3.4 Example of Feature Engineering for the Covariate Soilorder . . . . . . . . . . . 55
3.5 Prediction Comparison on Real Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.6 Prediction Comparison on Simulated Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.1 Comparing BARTnew and BARTold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.2 Back Comparing and Candidates Selection Guidance . . . . . . . . . . . . . . . . . 86
4.3 Linear Mixed Model Estimations (the First Scenario) . . . . . . . . . . . . . . . . . 91
4.4 Linear Mixed Model Estimations (the Second Scenario) . . . . . . . . . . . . . . . 92
viii
LIST OF FIGURES
Figure Page
1.1 Prediction Vs Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Sampling from Gaussian Process Prior and Posterior Distributions . . . . 5
1.3 Study Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 Study Region and Video Survey Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Multistage Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Video Survey Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Empirical Maximum Likelihood Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5 Empirical Maximum Likelihood Density Function . . . . . . . . . . . . . . . . . . . . 20
2.6 Random Smoothing Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.7 Reducing Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.8 Prior knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.9 New Training Data from Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.10 Stabilization Effect of Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.11 Comparison Between Linear Regression and Non-linear Prediction . . . . . 31
2.12 Abundance Spatial Distribution of Red Grouper . . . . . . . . . . . . . . . . . . . . . 32
2.13 Abundance Spatial Distribution of Red Snapper . . . . . . . . . . . . . . . . . . . . . . 33
2.14 CPUE Bias from Catch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.15 CPUE Bias from Efforts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.1 The Soil Carbon Stock (SOC) Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2 The Summary of SOC Data and Some Environmental Covariates . . . . . . 40
3.3 Two-stage Universal Regression Kriging Generalized Additive Model . . . 44
3.4 Generalized Additive Model Demo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.5 Framework of Model Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.6 Group Lasso Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
ix
Figure Page
3.7 Fitted Universal Regression Kriging Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.8 Estimated GAM Predictor Functions (Smoothers) . . . . . . . . . . . . . . . . . . . . 51
3.9 Discoveries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.10 Constant or Varying Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.11 Comparing Two-stage model and GWR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.12 Simulated Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.13 Breaking the Trade-off Between Prediction and Interpretation . . . . . . . . . 58
4.1 Single Binary Regression Tree Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2 Comparing New BART and Old BART . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.3 Analysis of Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.4 The Failure of Likelihood Based MCMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.5 Back Comparing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.6 Parameter Space Searching for the Buildup of Tuning Range . . . . . . . . . . 77
4.7 Back Comparing and Tuning Range (1D Experiment) . . . . . . . . . . . . . . . . 80
4.8 Guidance of Parameter Selection in Tuning Range (1D Experiment) . . . 81
4.9 Variation Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.10 Fitting Comparison of Two Candidates in Tuning Range . . . . . . . . . . . . . . 83
4.11 Fitting Comparison with Real Value and the Old BART . . . . . . . . . . . . . . 83
4.12 Data Exploration of 2D Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.13 Back Comparing and Tuning Range (2D Experiment) . . . . . . . . . . . . . . . . 85
4.14 Guidance of Candidate Selection in Tuning Range (2D Experiment) . . . 87
4.15 Results Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.16 Results Comparison on Read Data (One Covariate) . . . . . . . . . . . . . . . . . . 91
4.17 The Read Data with Two Environmental Covariates . . . . . . . . . . . . . . . . . . 92
x
Figure Page
4.18 Model Fittings on Read Data (Two Covariates) . . . . . . . . . . . . . . . . . . . . . . 93
4.19 Results Comparison among Models (Two Covariates) . . . . . . . . . . . . . . . . . 93
xi
Chapter 1
INTRODUCTION
Spatial statistics is a branch of statistics that be developed specifically for geo-
graphic data. These data are prevalent in many scientific disciplines such as meteorol-
ogy, oceanography, soil science, agriculture, geology, natural resources, epidemiology,
etc. With the use of spatial statistics becoming more popular across different disci-
plines, it is currently one of the most active research areas in statistics. Gelfand et al.
(2010) viewed spatial statistics as being comprised of three major categories: continu-
ous spatial variation, discrete spatial variation and spatial point patterns. Continuous
spatial variation that focus on the study of continuous spatial processes includes the
topics, like, geostatistical modeling and inference, likelihood-based approaches, spec-
tral methods, hierarchical modeling, spatial design, etc. Spatial regression stays at
the center of this category and connects with all other topics.
1.1 Spatial Regression
Regression is a technique used to examine the relation of a dependent variable to
specified covariates. When the data has a spatial component, the regression model has
to recognise and adapt to this change. In this case, we call it spatial regression model.
A general form of the spatial regression model that we studied in the dissertation is
as follows:
y(s) = f (s; X(s)) + w(s) + (s) (1.1)
where s := {s1 , ...sn } is the set of spatial locations; y(s) is the observed dependent
variable at s; X(s) are the observed covariates at s; f (·) is an arbitrary function;
1
w(s) is a stochastic process; (s) are the i.i.d. errors.
According the general form (1.1), spatial regression models can be divided into
two categories.
(1) Spatial Linear Mixed Regression Model
In this case, w(s) 6= 0, the effect of unobserved covariates is exhibited as spatial
dependence that be modeled by a stochastic process w(s). The function f (·) that
models the effect of observed covariates has a linear form. The classical Kriging
models (Cressie, 1993) which focus on estimating the first-order (large-scale or global
trend) and second-order (small-scale or local) structure of y(s) falls in this category.
For example, at any new spatial location s0 , the stochastic term in Kriging models is
n
P
w(s0 ) = λ(si )y(si ). Then,
i=1
• if f (s0 ; X(s0 )) = 0 , (1.1) is a simple kriging model.
• if f (s0 ; X(s0 )) = α0 + α1 s0x + α2 s0y , (1.1) is an universal kriging model.
• if f (s0 ; X(s0 )) = X(s0 )β, (1.1) is a regression kriging model.
Spatial linear mixed regression model has a long history in spatial statistics
(Cressie, 1993). With the advantages of solid theoretical foundation, simple mathe-
maticial formula and good interpretability, they are widely applied in different disci-
plines, such as geography (Haining et al., 2010), ecology (Robertson, 1987), meteo-
rology (Dobesch et al., 2007), etc.
(2) Spatial Nonlinear Regression Model
In this case, w(s) = 0, it means that only the effect of observed covariates be
considered in the model. The spatial dependence effect is modeled by the function
f (·). Since the linear regression model is trival, we are interested in the nonlinear
ones, e.g. the popular machine learning models, like, ensemble models (Random
2
Forest, XGBoost), kernel based models (Support Vector Machine), Neural Networks,
etc. As the rising of machine learning, their application in spatial analysis grows
rapidly, especially in the direction of deriving spatial predictions for spatial regression
(Appelhans et al., 2015) (Li et al., 2011) and detecting spatial patterns (Williamson
et al., 2020).
Similar to the ordinary statistical regression, there are two major goals in spatial
regression, prediction and interpretation. Figure 1.1 illustrates the relative positions
of the two categories models in the coordinate system of prediction and interpretation.
In real applications, if our goal is to get a good prediction then spatial nonlinear
regression models are good choices. While, if understanding the relationships between
Y and X is a top priority, we prefer the spatial linear Mixed regression models which
are much easier to be explained than the nonlinear ones. However, there is a trade-off
between the prediction and interpretation. We will discuss it in chapter 3.
Figure 1.1: Prediction Vs Interpretation
3
1.2 Gaussian Process
In spatial process regression model (1.1), the term w(s) usually is a Gaussian
process. Gaussian process is as well known as the extension of multivariate Gaussian
to infinite-sized collections of real-valued variables. This extension can be used to
infer the distribution over functions. First, using Gaussian process defines a prior
over functions. Then, convert it into a Gaussian process posterior after obtaining
some data.
Suppose we choose a particular finite subset of these random function variables
f = {f1 , ..., fN } and the data {Y = {y1 , ..., yN }, X Y = {X Y 1 , ..., X Y N }} as the
prior distribution. By the property of Gaussian Process, f follows a multivariate
Gaussian distribution:
p(f |Y, X Y ) ∼ N (0, K Y )
where K Y ij = C(X Y i , X Y j ), C(·, ·) is a covariance function. In spatial regression
models, we always assume the mean of Gaussian process is zero.
If some new data {Z = {z1 , ..., zM }, X Z = {X Z 1 , ..., X Z M }} be observed, we can
get the posterior distribution by the property of conditional multivariate Gaussian
distribution:
p(f |Z, X Z ) ∼ N (µ, Σ)
where µ = K ZY (K Y )−1 Y and Σ = K Z − K ZY (K Y )−1 K Y Z . And K ZY =
C(X Z , X Y ) = (K Y Z )T is M × N and K Z = C(X Z , X Z ) is M × M .
Figure 1.2 (a) shows 5 samples from a Guassian process prior distribution, while
(b) illustrates 5 samples from its posterior distribution after obtaining 8 new obser-
vations.
The zero mean Gaussian process can be denoted by GP (0, C(·, ·|θ)),where θ is
the parameters of covariance function. It is completely determined by its covariance
4
Figure 1.2: Sampling from Gaussian Process Prior and Posterior Distributions
function C(·, ·|θ). In order to model the spatial dependence, we assume the covariance
function following some spatial correlation structure. For example, a low dimensional
parametric correlation structure can be specified by Matérn covariance function family
(Stein, 1999) as following.

!ν !
22
1−ν √ ||si − sj || √ ||si − sj ||
C(si , sj ) = Cν (||si − sj ||) = σ 2ν Kν 2ν
Γ(ν) ρ ρ
where Γ(·) is the gamma function; Kν (·) is the modified Bessel function of the second
kind; ||si − sj || is the Euclidean distance between spatial point si and sj ; The set of
parameters is θ = {σ 2 , ρ, ν} and ρ, ν are positive real numbers. ρ and ν control the
decay rate and smoothness in spatial correlation respectively. There are two popular
candidates from Matérn covariance function family in spatial statistics.
||si − sj ||
• ν = 1
2
, Cν (||si − sj ||) = σ 2 exp{− }, it’s an exponential covariance
ρ
function.
||si − sj ||2
• ν → ∞ , Cν (||si − sj ||) = σ 2 exp{− }, it’s a gaussian covariance
2ρ2
function which is also called squared-exponential covariance function.
5
With advantages mentioned above, Gaussian process is prevalent in spatial statis-
tical modeling today. However, the computational issue will rise when data becomes
big. It’s because the likelihood computation of a Gaussian process observed at n
spatial locations has to calculate the inverse and determinant of covariance matrix
whose exact calculation requires O(n3 ) operations and O(n2 ) storage. In recent year,
due to the advance in technology, massive spatial data are collected in various disci-
plines, we do require novel methods to overcome this challenge. Fortunately, there is
a rich literature on this problem. Basically, the studies are going on two tracks, low-
rank and sparsity. Low-rank approximation is a very active field in numerical linear
algebra. Hackbusch (2015) developed the theory of hierarchical matrices which can
provide a low-rank approximation requiring only O(nklog(n)) units of storage and
O(nk α log(n)β ) operations for matrix multiplication, inversion or determinant, where
k is the rank parameter controlling the accuracy of the approximation, α, β ∈ {1, 2, 3}.
Geoga et al. (2019) presented a kernel-independent method that applies hierarchical
matrices to the problem of maximum likelihood estimation for Gaussian processes.
There were also low-rank modeling methods in spatial statistics community, e.g. Fixed
Rank Kriging (Cressie and Johannesson, 2008), predictive process model (Banerjee
et al., 2008) and stochastic partial differential equations (Lindgren et al., 2011). On
the other track, studies seek to introduce sparseness into the covariance or precision
matrix. For example, Furrer et al. (2006) applied covariance tapering to create a
sparse approximate linear system that can then be solved using sparse matrix algo-
rithms, or Datta et al. (2016) extends the Vecchias approximation (Vecchia, 1988)
to a Gaussian process for creating the sparse precision matrices by using conditional
independence given information from neighboring locations.
6
1.3 Overview
Figure 1.3 shows the two categories of spatial regression models and my study
Roadmap. My research involves both application and methodology problems. They
are organized in the rest of the dissertation as follows.
Figure 1.3: Study Roadmap
• Chapter 2, introduces a real world application that applied the nonlinear re-
gression models to predict the spatial distributions of reef fish abundance in the
Gulf of Mexico. A multistage workflow is proposed to overcome the challenges
of zero-inflated data and out of sample prediction. The nonlinear predictions
are compared to the predictions of other methods.
• Chapter 3, aims to develop a spatial regression model that can break the trade-
off between prediction and interpretation. A two-stage universal regression
Kriging and generalized additive model is built to achieve this goal. Model’s in-
terpretability and prediction accuracy are tested on both the real and simulation
data.
7
• Chapter 4, proposes a new BART model and combines it with the Gaussian
process to build a nonlinear spatial regression model, named Gaussian process
BART. The methods for parameter estimation is discussed. Two experiments
and a real data testing are given for examining the effectiveness of Gaussian
process BART model.
• Chapter 5, summarizes the key contributions of the dissertation and discusses
ideas for future work.
8
Chapter 2
REAL DATA CHALLENGES AND NONLINEAR MODELS
Understanding the spatial distribution of abundance is fundamental to assessing
and managing organism population. However, the task becomes difficult to marine
species due to the low detection rates because of sampling underwater. In this chapter,
we proposed a multistage statistical workflow which applied the non-linear spatial
regression models to tackle this problem. In Section 2.1, we introduced the problem,
data and challenges. We proposed a multistage statistical workflow to overcome the
challenges in Section 2.2. The methods to solve the zero-inflated data problem were
discussed in this section as well. In Section 2.3, we applied nonlinear spatial regression
models to predict the spatial distributions of reef fish abundance. Three strategies
were proposed to handle the out of sample problem. In Section 2.4, the nonlinear
prediction results were compared to the ones of linear regression and catch per unit
effort models.
2.1 Introduction
The ability to map the abundance of organisms across space is a critical precur-
sor for many applied research applications that support sustainable environmental
resources management. This includes understanding linkages between species and
habitat use (Mateo-Sánchez et al., 2015), establishing protected areas (Lin et al.,
2017), building population and ecosystem models (Stratford et al., 2016), using such
ecological analyses to develop biologically and economically sustainable management
policies (Guisan et al., 2013), etc. However, in most cases, the real data needed to de-
velop such maps either does not exist or is zero-inflated or unevenly distributed across
9
space (Prosser et al., 2018). This is because field sampling efforts can be expensive
or often must collect information for multiple applications. For example, collecting
data from the marine environment can be particularly difficult due to, cost and lo-
gistical considerations accessing remote locations, the inability to transmit radio or
satellite signals from underwater, and visual limitations associated with water clarity
and water column light attenuation. It results in fewer observations of species from
the marine environment with which to develop distribution maps.
Independent of the ecosystem, a variety of techniques have been employed over
the years to maximize the use of the zero-inflated sampling data, such as variogram
estimation and random field simulation (Saul and Purkis, 2015), generalized ad-
ditive models (Drexler and Ainsworth, 2013), additive beta regression (Ros-Pena
et al., 2018), etc. However, most of them assume samples evenly distributed in the
study region or their values follow specified distributions, conditional normal distri-
bution or beta distribution. In practice, these assumptions may be violated, because
most organisms usually distribute in patchy pattern across the landscape or seascape
(Ainsworth et al., 2016). The underlying reason is that organism abundance directly
depends on their habitat environment and most of environmental covariates follow
patchy spatial patterns. Although, for some organisms, approaches work on commer-
cial data, like catch per unit effort data (McDonald et al., 2001), can extract useful
maps of abundance, models with environmental covariates are more promising credit
to their high interpretability and low bias (Streich et al., 2017). Many methods to
map organism spatial abundance use the relationship between abundance and envi-
ronmental covariates. Models can be divided into two categories, linear models and
nonlinear models. Linear models had been extensively studied in predicting organism
abundance (Guisan et al., 2002). They have many advantages, such as, model is easy
to construct, parameters and results are highly interpretable, computation is efficient
10
in both time and resources, etc. Linear models perform well on large spatial scales
identifying the overall trend or gradient of abundance (Guisan et al., 2002). How-
ever, most relationships in real world are intrinsically nonlinear rather than linear in
nature.In the few years, as therising of machine learning, nonlinear models have been
widely applied in organism population prediction (Ye et al., 2019). They are good
at capturing the subtle nonlinear relations between abundance and environmental
covariates and the complex interactions among covariates. So, nonlinear models also
have the ability to identify the local trend in fine spatial scale. Compare to linear
models, nonlinear ones could normally provide more accurate predictions at the cost
of sacrificing interpretability.
2.1.1 Data
The study region was the area whose depth is within 1 200 meters in north Gulf
of Mexico (Figure 2.1). Two types of datasets, video surveys data and interpolated
bottom habitat data, were used in the study.
Figure 2.1: Study region and video survey data. The zoomed area shows zero-
inflated feature of the sampling data. The red dot represents positive sampling, the
value on it numbers the fishes been sampled.
Three independent fishery video surveys are carried out annually to collect infor-
11
mation on the abundance of shallow water reef fish species throughout the Gulf of
Mexico. The first one is sponsored by the National Marine Fisheries Service (NMFS)
Panama City Florida laboratory, the second is part of the Southeast Area Monitor-
ing and Assessment Program (SEAMAP), and the third is sponsored by the State of
Florida Fish and Wildlife Commission (FWC). Each survey is methodologically stan-
dardized to others. It allows us to merge them to a single dataset with trivial effort.
The video surveys target the commercially important species, such as red grouper
(Epinephelus morio), red snapper (Lutjanus campechanus), gag grouper (Mycterop-
erca microlepis), mutton snapper (Lutjanus analis), etc. Video surveys were carried
out by following a two-stage sampling design. The first-stage or primary sampling
units (PSUs) which located in the most possible habitat area were spatial blocks with
10 minutes of latitude by 10 minutes of longitude. The second stage or ultimate
sampling sites (USS) were point locations that were randomly choose in the ultimate
sampling sites (USS). The sampling gear consisted of four cameras mounted orthog-
onally with each other. Cameras was deployed at each location for 20 minutes and
record every species encountered. The camera sampling protocol included the use of
bait at the center of the four-camera array to increase the positive detection rate.
The video footage was read by several technicians to identified and enumerated the
species observed. In order to avoid double counting, the count value was set as the
maximum number of the species recorded in a frame during the 20-minute sampling
period (Somerton and Glendhill, 2005). The sampling design wasnt optimal due to
budget constraint. For example, the short sampling time may be the prime reason
for the zero-inflated sampling result (Figure 2.1). Another defect is that the sampling
sites were not evenly distributed across the study region. In the prediction stage, it
will cause the out of sample prediction problem for the blank areas (Figure 2.1).
Bottom habitat information was offered by dbSEABED database. dbSEABED
12
project produced detailed mappings of the sea floor in various locations by interpo-
lating from all available point datasets. Individual raw data points were screened for
quality control before being used for interpolation. Isotropic, binned semivariograms
were used to interpolate point data to raster map (Goff et al., 2008). Maps can
be generated respectively to describe single benthic environmental variable. In this
study, the environmental covariate maps (percentage content) to be used for predic-
tion includes sand, gravel, mud, sediment grain size, carbonate, clay, and rock. One
defect of dbSEABED database was that benthic samples collected over the years were
more concentrated in nearshore areas than offshore ones. It makes the data has lower
variation thus higher accuracy in nearshore area than in offshore area. Despite this,
the dbSEABED dataset is the most spatially comprehensive habitat data publicly
available for the Gulf of Mexico at this time.
2.2 Multistage Workflow
The multistage workflow (Figure 2.2) start from simulating the video survey pro-
cess to generate simulated sampling outcomes under different settings. In the second
stage, a method named empirical maximum likelihood analysis worked with simulated
sampling data to find a relationship between the video survey data (catch ratio) and
fish abundance (empirical maximum likelihood density). The relation represented by
an empirical maximum likelihood density function which was the key to address the
zero-inflated issue. Then, inputting with real video survey data and the empirical
maximum likelihood density function, a two-step random smoothing method was em-
ployed in the third stage. In step one, spatial abundance was estimated in sampling
areas. In step two, uncertainties in abundance estimation were effectively removed
to produce the block spatial abundance that will work as training data in next stage
models. In the final stage, working on the training data and environmental covariates
13
(habitat data), nonlinear spatial regression (Machine learning) models coming from 3
different families, support vector machine, neural networks and random forest, were
assembled to generate a high accuracy and low bias prediction of abundance spatial
distribution.
Figure 2.2: There are four stages with different methods/models and data in each
of them. Generated data means that the data was generated by the model or prior
knowledge. In contrast, real data is collected from real world.
In the rest of this section, we will go through the StageI to Stage III. The meth-
ods/models in these stages work together to tackle the zero-inflated problem of the
video survey data.
2.2.1 Video Survey Simulation
To make the most use of the video survey data, an individual-based discrete event
simulation was developed to model the video survey process.(Pfeffermann, 2013) This
was developed using the PyGame library in the Python programming language (Kelly,
2016). The following assumptions were made for the simulation:
(1) Site fidelity
Red grouper excavates benthic material to create nests or pits in which they live,
and from which they exhibit high site fidelity (Harter et al., 2017). Red snapper
14
exhibits less site fidelity than red grouper but spends continuous periods of time at
one site, on the order of months or years, before moving to another habitat location.
Thus, at short time intervals, such as the length of the camera sampling protocol,
individuals were assumed to have strong site fidelity in the simulation.
(2) Fish home and behavior
In order to model site fidelity, we assumed the fish move around nearby its home.
Fish was able to explore surrounding places by wandering in a random fashion. In
the simulation, wandering implemented by a Markov Chain Monte Carlo. Its random
fashion followed an isotropic bi-normal spatial distribution around the home (Figure
2.3). It was meant to represent activities such as food foraging similar to central
place foraging theory (Schoener, 1987). Fish had a 68% probability to move within
one standard deviation of the isotropic bi-normal spatial distribution, and 95% prob-
ability of moving within two standard deviations. Home range was defined as the
spatial distance of two standard deviations. For two neighboring homes, the average
percentage of overlapping is less than 50% (Farmer and Ault, 2011). The setting of
parameters, such as, home range, fishs movement frequency, speed, turning angle,
followed Farmers papers (Farmer and Ault, 2011),(Farmer and Ault, 2014) which
conducted a thorough investigation to the movement of reef fish species in the Gulf
of Mexico.
(3) Camera and bait
In video survey, the vulnerability of fish to be sampled by the camera gear was
enhanced by placing bait at the location of the camera array to attract fish. We
modeled the bait effect on video sampling. The chance that a fish would detect the
bait and come to it is determined by two factors: the diffusion rate of the scent of the
bait, and the probability that a fish in the vicinity of bait detectability, could detect
the bait (Stoner, 2004). We assumed that the bait odor had highest intensity at the
15
Figure 2.3: Video survey simulation. The green dots represent the location of a fish
home, green concentric circles around each dot represents the first and second stan-
dard deviations, the red dot represents the location of the camera, and the red circle
represents the distance to bait detectability, which expands throughout 20 minutes,
the winding trails represents the trace of fishes, each color corresponding one fish.
camera sampling gear, and spread with intensity diminishing exponentially as moving
away from the camera. The attenuation of bait odor through the water column is an
understudied complex process, as is the probability that a fish nearby will detect it.
The few studies that have been done suggest a wide range of distances from which
fish can detect bait, and the research suggests it is species dependent (Sainte-Marie
and Hargrave, 1987). As a result, we made two assumptions: (a) that the radius of
detectability from the camera array, meaning the maximum distance of odor spread
was 50 meters in 20 mins, and (b) that the value for the shape parameter of the
exponential bait detectability distribution was 0.05.
(4) Interactions
In each simulation step, program checks the location of fish in relation to the
location of the bait and the range of bait odor dissipation. If the fish entered the
range of bait odor, then it was assigned a probability of being sampled by the camera.
16
Once a fish was sampled by the camera, it was removed from the simulation to avoid
double counting.
The most important parameter in simulation model was the number of fish homes.
Since simulation area was invariant and the number of fishes in each home followed a
known uniform discrete distribution, the number of fish homes scaled the abundance
of fish at each sampling station. We tested a range of numbers fish homes in the
simulation. It was increased by three and up to a big enough number which was con-
strained by the rule of less than 50% habitat overlapping and the size of simulation
area. For each number, the simulation run 5000 times. One simulation step corre-
sponds to 3 seconds in real time, and camera works 20 minutes. The data generated
by the video survey simulation will be used in the empirical maximum analysis.
2.2.2 Empirical Maximum Likelihood Analysis
In statistics, traditional maximum likelihood analysis produces the maximum like-
lihood estimator (MLE) for unknown parameter. This method typically includes three
components: an analytic mathematic model, the target parameter and the observed
outcomes. Although our workflow contained a model (video simulation model), the
target parameters (fish density) and the observed outcomes (real video survey data),
we are unable to apply traditional maximum likelihood methodology because the
video survey model is a programing simulation model rather than an analytic math-
ematic model. As a result, we developed a novel method called empirical maximum
likelihood analysis to tackle this issue. The method includes four parts: re-sampling,
empirical probability mass functions, empirical likelihood function, and empirical
maximum likelihood density function.
The procedures of re-sampling and creating empirical probability mass functions
is as following.
17
(1) Initialize the number of the fish homes as n=3.
(2) Sample 100 outcomes without replacement from the results of video survey
simulation. Calculate the ratio between number of detected fishes and the
number of fish homes. We mane this ratio as catch ratio (CR).
(3) Calculate empirical probability mass function (pmf) of discretized catch ratio
values for each value of home numbers.
(4) Increase home number by n=n+3 and repeat steps (2) and (3).
Once the probability mass functions were obtained, we can calculate the empirical
likelihood function for each catch ratio. Figure 2.4 gives an example that how to
build an empirical likelihood function from probability mass functions. Then, the
empirical maximum likelihood estimator of home number under each catch ratio will
be equal to the globe maximum of empirical likelihood function. Since the number of
fishes in each home followed a uniform discrete distribution, the maximum likelihood
estimator of fish homes can be easily converted to maximum likelihood estimator
of density by the ratio of maximum likelihood estimator of fishes and the size of
area. This allowed us to build a function between the empirical maximum likelihood
density and catch ratio (Figure 2.5). This function handles the spatial sparsity and
zero-inflated characteristics of video survey data. Even if the catch ratio is close to
zero, a non-zero maximum likelihood density could be calculated.
There is a linear relation between the empirical maximum likelihood density and
catch ratio. The changing of parameters value, like, the parameters value of camera,
bait and fish behavior, in video simulation model only affects the coefficient (slope) of
this linear relation. However, all the coefficient will be cancelled when we transform
the absolute value of abundance to the relative value of abundance - the spatial
18
Figure 2.4: For each home number, there is a corresponding probability mass func-
tion for discrete catch ratios. (a), (b), and (c) are 3 examples of pmf. Given a value
of catch ratio, the discrete empirical likelihood function can be obtained (d). Since
the gap of home number was 3, we can fit a curve (d) to get the discrete empirical
likelihood function with gap one. Finally, the empirical maximum likelihood estima-
tor can be found from this discrete empirical likelihood function. In this example,
when the catch ratio was 0.05, the empirical MLE of home number was 19.
distribution. For purposes of this study, we were only interested in the abundance
spatial distribution. So, we didnt really care about the setting of parameters in video
simulation model because they did influence the linear coefficients rather than the final
spatial distribution. But when you are interest in estimating the real abundance, the
parameter setting is essential.
2.2.3 Random Smoothing Estimation
The spatial abundance or empirical maximum likelihood density was estimated
by random smoothing (Figure 2.6). First, the sampling area needs to be rasterized
into grid cells, each approximately 0.25 square kilometers. Then, random smoothing
19
Figure 2.5: Empirical Maximum Likelihood Density Function
was carried out by randomly drawing circle windows in the area (Figure 2.6). In
each smoothing window, the catch ratio was calculated from the video survey data.
Working with empirical maximum likelihood density (EMLD) function, we can assign
the empirical maximum likelihood density (abundance) to all the grid cells in the
smoothing window.
The smoothing windows may overlap with each other. Therefore, a grid cell
could be covered by different windows. Hence different empirical maximum likelihood
density may be assigned to the same grid cell. In order to combine all the different
values of a grid cell, a weighted mean empirical maximum likelihood density was taken.
Weights were determined by calculating a credibility statistic for each window. The
credibility was defined as follows:
x
c(x) = (2.1)
N
where c(x) is the credibility; x is the number of samples in the smoothing window;
N is sample size.
In order to penalize the windows with low credibility, we calculate the weight of
each window:
c2
wi = Pn i 2
, i = 1, 2, ..., n (2.2)
i=1 cj
20
Figure 2.6: The procedure of random smoothing estimation starts from rasterizing,
(a) to (b). Then, randomly draw windows in the sampling area up to a large enough
number, (c) to (d). This number can be determined by checking the convergency of
gemld.
where wi is the weight of ith window; ci is the credibility of ith window; n is the
number of random smoothing windows.
Thus, the weighted mean empirical maximum likelihood density of a grid cell can
be denoted as follows.
n
X
gemldk = wik ∗ wemldi , i = 1, 2, ..., N (2.3)
i=1
where gemldk is the weighted mean empirical maximum likelihood density of k-th
grid; wk is the weight of ith random smoothing window covering k th grid; wemldi is
the empirical maximum likelihood density of ith random smoothing window; n is the
number of random smoothing windows covering the k th grid; N is the number of grid
cells.
21
2.2.4 Reducing Uncertainty
In random smoothing, uncertainties were introduced by randomized smoothing
windows. We consider two concepts to identify uncertainties: credibility and vari-
ance. Credibility measured uncertainty from a Bayesian perspective, while variance
measured uncertainty from a frequentist perspective.
For each grid cell, we can calculate its credibility mean as follows.
n
1X
gmck = gcik , k = 1, 2, ..., N (2.4)
n i=1
where gmck is the mean of credibility of k th grid; gcik is the credibility of ith window
covering k th grid; n is the number of random smoothing windows covering k th grid;
N is the number of grid cells.
The variance can be calculated:
gvk = var(Sk ), k = 1, 2, ..., N (2.5)
where gvk is the variance of empirical maximum likelihood density of k th grid; Sk =
{gemld1 , ..., gemldn }; n is the number of random smoothing windows covering k th
grid; N is the number of grid cells.
Based on above definitions, we developed a method hereafter referred to as Bayesian
and Frequentist scissors to eliminate uncertainties by using a priori determined thresh-
old.
ScissorsB = quantilegmc {gmci , i = 1, ..., N } (2.6)
ScissorsF = quantilegv {gvi , i = 1, ..., N } (2.7)
where N is the number of grid cells.
The scissors worked as follows.
GBS = {gmck : gmck ≥ ScissorsB k = 1, ..., N } (2.8)
22
GF S = {gvk : gvk ≤ ScissorsF k = 1, ..., N } (2.9)
where GBS is the set of grid cells after Bayesian scissors cutting; GF S is the set of
grid cells after Frequentist scissors cutting; N is the number of grid cells.
The final set of grid cells, GL , is obtained by the intersection of GBS and GF S .
GL = GBS ∩ GF S (2.10)
GL is the block spatial abundance with low uncertainty. Figure 2.7 shows the
process that how to apply random smoothing and Bayesian/frequentist scissors to
get the block spatial abundance.
Figure 2.7: An example applying random smoothing and Bayesian/frequentist scis-

sors to get a low uncertainty set of label data. Panel (a) shows the video survey data
of red grouper in a small region. The red dots with small numbers identify the num-
ber of red groupers were found in the survey. The result of the random smoothing is
shown in panel (b). The spatial mean of credibility and spatial sample variance are
shown in panels (c) and (d) respectively. Panels (e), (f) and (g) show the results of
GBS , GF S and GL respectively.
Up to now we successfully convert the zero-inflated video survey data to high
credibility low variance block spatial abundance data. In next section, the later will
be used as training data of the spatial nonlinear regression models.
23
2.3 Non-linear Models and out of Sample Prediction
As discussed in Section 2.1, our final goal is to get good prediction of abundance
spatial distribution. Nonlinear spatial regression (machine learning) models are good
choices for this task. With the ability of capturing the subtle nonlinear relations be-
tween abundance and environmental covariates and the complex interactions among
covariates, nonlinear models are able to produce high accuracy and low bias predic-
tions which reflect the near-real patchy patterns in both large and fine scales. In
addition, there is no precondition on the distribution assumptions of the data. This
feature greatly enhances their adaptability to adapt different datasets. We choose
the nonlinear models from three families, multilayer perceptron, random forest, and
support vector machine. Multilayer perceptron is a type of artificial neural network
that carry out supervised learning via a back-propagation training algorithm (Ram-
choun et al., 2016). With bootstrapping the input data, random forest models use
a combination of multiple decision trees to implement prediction (Breiman, 2001).
Support vector machine makes prediction by classifying the input dataset into dis-
crete classes across a separating hyperplane. Moreover, it can incorporate multiple
variables to map correlations in non-linear space to improve predictions (Cristianini
and Shawe-Taylor, 2000). Machine learning models took the block spatial abundance
as its training dataset. Predictors were habitat environmental covariates that include
location (latitude and longitude), depth, rugosity, sand, gravel, mud, sediment grain
size, carbonate, clay, and rock. As discussed in Section 2.1, video survey data is
clustered rather than evenly distributed in the study region. It caused some blank
areas in which block spatial abundance (training data) was absent. Predicting in
these blank areas will encounter the out of sample prediction problem. This prob-
lem can introduce great uncertainty and produce erroneous predictions in high risk
24
(Wenger and Olden, 2012). Therefore, we proposed three strategies, prior knowledge,
aggregation, and iteration to overcome this problem.
2.3.1 Prior Knowledge
The goal of involving prior knowledge is to extend the set of training data in the
areas that video survey was missing. First, we need to identify the areas that prior
knowledge can be engaged confidently. For example, it is well known that red grouper
is predominately spatially distributed with high abundance in the eastern portion and
very low abundance in the western portion of the Gulf of Mexico. Second, we created
several initial predictive maps by running different machine learning models (Figure
2.8 b,c,d). Then, the prior knowledge about spatial distribution was represented by
assigning weights to each initial prediction (Table 2.1). Finally, the new training data
can be calculated from the weighted combination.
Table 2.1: An example shows the prior knowledge (weights) applied in Figure 2.8.
Prior Weights
Area
LR MLP SVM
Area 1 0.1 0 0.9
Area 2 0.1 0.45 0.45
Area 3 1/3 1/3 1/3
Area 4 0.1 0.45 0.45
Area 5 0.3 0.4 0.4
The limitation of this strategy comes from the lack of prior knowledge. For exam-
ple, we can’t use it for red grouper in the gap areas of east gulf of Mexico, because
the red groupers live there and any inaccurate prior knowledge will cause bias. The
same thing happend on red snappers who live in the entire gulf of Mexico.
25
Figure 2.8: The top panel shows the video survey of red grouper throughout the Gulf
of Mexico. Red points indicate positive samples and green points indicate negative
(zero). The three panels at bottom show initial predictions of spatial abundance in the
western portion of the Gulf. They are generated from linear regression, multilayer
perceptron and support vector machine respectively. Numbered polygons are the
areas applied prior knowledge (weights). The weights are shown in Table 1 below.
2.3.2 Aggregation
Aggregation is a model ensemble approach (Diesing and Stephens, 2015)(Stohlgren
et al., 2010) that combines predictions from different models to stabilize the final
prediction. In this study, three popular machine learning algorithms were used: mul-
tilayer perceptron, random forest and support vector machine. In each category,
a range of tuning parameters for that algorithm were tested in a range of possible
values. With multilayer perceptron algorithm, the number of hidden layers and the
number of neurons in each hidden layer was tried. With random forest algorithm,
the maximum number of features allowed in each decision tree, the number of trees
26
to build before averaging for prediction, the number of levels in each decision tree
(maximum depth), and the minimum sample leaf size (size of the end node) were
tested. With support vector machine algorithm, the kernel parameters that include
the type of hyperplane and the shape of the hyperplane were tested. Based on the
mean test score obtained from cross-validation, we narrowed down the number of
candidate models by a criterion that the mean test scores fell in the range between
0.65 and 0.95. This criterion worked well with model combination in keeping the
balance between underfitting and overfitting. Table 2.2 shows the selected models for
aggregative prediction of red grouper spatial abundance.
With the selected 33 models in table 2.2, we fit them with training data then
make predicitons. The final prediciton is estimated by a suitable statistic of the 33
predicitons, like, mean, weighted mean, median, etc. In the study, we choose median.
27
Table 2.2: Parameter settings and mean test scores of the selected models for ag-
gregative prediction of red groupers abundance spatial distribution.
Modle Tuning parameters Mean test score
(20,) 0.70046559
(30,) 0.74232753
(50,) 0.76654593
(80,) 0.79432473
Number of neurons in each hidden layer

(250,) 0.84556754
(300,) 0.8656241
(8, 8) 0.74759073
(10, 10) 0.75592852
MLP
(15, 15) 0.80448259

(30, 30) 0.87254421
(50, 50) 0.91866864
(5, 5, 5) 0.70332429
(7, 7, 7) 0.7580669
(10, 10, 10) 0.80493967
(13, 13, 13) 0.85251525
(20, 20, 20) 0.89251978
(30, 30, 30) 0.92922169
max features, number of estimators
(0.3, 1000, 7, 150) 0.75829703

max depth, min samples per leaf
(0.5, 1000, 7, 150) 0.78100404

(0.5, 1000, 8, 150) 0.81256017
Random Forest
(0.7, 1000, 8, 100) 0.84390548

(0.5, 1000, 9, 100) 0.85742639
(0.7, 1000, 9, 100) 0.87075123
(0.7, 1000, 5, 50) 0.71127043
(0.7, 1000, 5, 150) 0.70476749
(0.7, 1000, 15, 100) 0.91193176
(30, 0.5) 0.92571115

Support Vector Machine
(50, 0.05) 0.74480628

(50, 0.1) 0.79185051
C, γ
(50, 0.15) 0.8290371

(50, 0.25) 0.88165398
(70, 0.5) 0.93602774
(100, 0.1) 0.80533548
28
2.3.3 Iteration
For each grid cell, the selected models generated predictions. Based on these
predicitons, mean, standard deviation, and coefficient of variation (CV) can be calcu-
lated. By choosing a threshold for CV (CV0.5), we can filter all the grid cells to get
the ones with low variance. Then, new selected cells can be added into the original
training data (Figure 2.9). We can iterate this process to expand the training data.
However, in order to avoid systematic bias, iterations are better to be less than three.
Figure 2.10 shows the distributions of the coefficient of variation (CV) before and
after the additional training data was added. It indicates that the one-time iteration
can greatly reduce the overall CV, thus stabilize the final prediction.
Figure 2.9: Map of original training data (light blue) and new training data (dark
blue). The new training data was extended by one-time iteration with criterion
CV ≤ 0.5.
Figure 2.10: The distributions of coefficient of variation (CV) before and after one-
time iteration.
29
2.4 Results and Discussion
Our workflow worked well not only in large scale but also in fine scale. To demon-
strate this, there is a comparison with the linear regression model in Figure 2.11.
The figure shows that linear regression model can capture the overall trend across
the entire Gulf. However, in fine scale, linear regression prediction was too smooth
to capture the patchy pattern. In comparison, the prediction of our workflow, here-
inafter referred to as non-linear prediction, was able to capture the overall trend as
well as the patchy pattern under high resolution. In addition, at some locations, the
linear regression prediction shown contradictories to the non-linear prediction. For
example, it is well known by biologist that red grouper distributes higher abundance
in areas with higher level rugosity and gravel over sea bottoms. Its because their
affinity for structure and their role as ecosystem engineers excavating pits in which
to live (Harter et al., 2017) (Coleman et al., 2011). Circle number one in Figure 2.11
(panels b-1 and b-2) shows that non-linear prediction correctly gave higher abundance
in areas of higher rugosity (panel c), while the linear regression prediction shown the
opposite. Similarly, when considering the locations of gravel habitat, circles numbered
two, three, and four in Figure 2.11 (panels b-1 and b-2) shows that non-linear pre-
diction correctly demonstrates the positive relation between abundance and the level
of gravel. But the linear regression prediction shows inconsistent results. Especially,
linear regression prediction at circle number two illustrates a self-contradictory.
The nonlinear prediciton abundance maps of red grouper and red snapper were
shown in Figure 2.12(b) and 2.13(b). To validate the goodness of predicting, we
choose the maps of spatial catch per unit effort (CPUE) obtained from fishery data
(Figures 2.12(a) and 2.13(a)) (McDonald et al., 2001). Figure 2.12(c) and 2.13(c)
show the prediciton that corrected by considering the impact of pollution and over-
30
Figure 2.11: Comparison between linear regression prediction and non-linear pre-
diction. Panels a-1 and a-2 show the two predictions in a large scale. Panels b-1 and
b-2 show the two predictions in a small scale which is 10 times finer than the large
scale (area in blue rectangle in panels a-1 and a-2). Panels c and d show the spatial
distributions of rugosity and gravel respectively corresponding to the area in panels
b-1 and b-2.
fishing. Overall, the non-linear prediction is consistent with CPUE map. However,
it is important to acknowledge that they do notcoincide each othertotally. Because
the quality, quantity, and location of fishery-dependent data were influenced by the
decision-making behaviors of commercial fishers (Saul et al., 2013). It made the
CPUE map biased from real abundance distribution.
In catch per unit efforts (CPUE) map, bias can be introduced from the amount
31
Figure 2.12: Abundance spatial distribution of red grouper. The top panel rep-
resents spatial catch per unit effort map obtained from logbook data. The middle
panel represents the non-linear prediction. The bottom panel shows the prediciton
that corrected by considering the impact of pollution and overfishing.
of catch. For example, Figure 2.14(b) shows relatively low abundance in the area
circles numbered 3 and 4. In fact, there is a zone in the middle of these two areas
(Figure 2.14(a)). The sea bottom of this zone has high-level rugosity and covered
by hard and soft corals. The environment condition is desirable for the living of
red grouper (Coleman et al., 2011). As a part of Florida middle grounds habitat
of particular concern project, this zone is protected from some fishing gear types
32
Figure 2.13: Abundance spatial distribution of red snapper. The top panel rep-
resents spatial catch per unit effort map obtained from logbook data. The middle
panel represents the non-linear prediction. The bottom panel shows the prediciton
that corrected by considering the impact of pollution and overfishing.
including bottom longlines, trawls, dredges, pots and traps (Lembke et al., 2017).
CPUE prediction in this zone is highly biased, since it depends on the amount actually
fished. Contrarily, the nonlinear models which directly employed habitat information
was able to appropriately predict near-real abundance based on real environment
conditions. Figure 2.14(a) illustrates that nonlinear prediction was able to capture
the high abundance in the protected zone.
33
Figure 2.14: Comparison of red grouper abundance maps between non-linear pre-
diction and CPUE. Non-linear prediction can catch the high abundance patchy area
(between circles numbered 3 and 4). This area is highly suitable for the living of
red grouper because it is covered by hard and soft corals and protected from some
fishing gear types including bottom longlines, trawls, dredges, pots and traps. How-
ever, CPUE map shows highly biased prediction due to it can be distorted by fishery
policy.
Catch per unit efforts (CPUE) map may also introduce bias from the amount
of efforts. For example, in Figure 2.15(b), the non-linear prediction of red snappers
abundance in western Gulf of Mexico shows that the abundance in region circle num-
bered 1 was higher than regions circles numbered 2 and 3. It is reasonable because
region circle numbered 1 in Figure 2.15(a) has higher level mud on sea bottom. And
its well known by biologist that red snappers occupy mud bottom during much of
their life history. However, CPUE map in Figure 15 (c) gave opposite answer. We
can see there are only two seaports (in blue circles) in western of Gulf of Mexico. For
both of them, the cost of fishing in region circle numbered 1 is higher than regions
circle numbered 2 and 3. The high cost or effort distorts CPUE prediction in this
area far from the real abundance. To this end, depending on fishery-independent
environmental data nonlinear prediction maps could be less bias than CPUE maps.
34
Figure 2.15: Maps representing the spatial distribution of mud levels (panel a), the
non-linear prediction (panel b), and CPUE map (panel c). Blue triangles in blue
circles on panel c indicate the locations of fishing port.
The generalizability of the multistage workflow can be explained as follows.
First, the distribution of many organisms is patchy across the landscape or seascape.
A strength of our workflow and nonlinear predicitons (Figure 2.2) is the ability to
capture patch dynamics from sparse data, which will render them applicable to many
organisms.
Second, video survey methodology is commonly used to capture presence and
abundance data, both in marine and terrestrial ecosystems. As a result, our work-
flow is well suited for applications of many already existing video survey datasets
collected in a variety of ecosystems. Actually, different video simulations can flexibly
be adapted to this workflow without having to make any changes to the rest stages.
Third, as mentioned in the end of Section 2.2.2, if the matter is spatial distribu-
tion rather than absolute abundance. The only thing you need to know there is the
relationship (function) between the catch ratio and population density. If the rela-
tionship is a linear function, the first two stages of workflow can be omitted. Even
the exact form of the linear function is not required. You can assign an arbitrary
value to the coefficient for the linear function and continue subsequentstages of the
workflow. The effect of coefficient will be canceled automatically. Fortunately, in
35
most cases, the relationship between catch ratio and population density is a linear
function. Otherwise, you have to find the exact form of this function. If it is possible,
you can still get rid of the first two stages of the workflow. Final, the flexibility of the
workflow also comes from the loose coupling relations among its stages. For exam-
ples, a moving window smoothing can take the place of random window smoothing.
Another example, you can add more nonlinear (machine learning) models or differ-
ent parameterizations. Note that if you change the set of more nonlinear (machine
learning) models, the final prediction will change as well. However, when you apply
the techniques in Section 2.3, such as, controlling the level of predictive accuracy,
aggregation and iteration, the final prediction will go stable.
In this chapter, we proposed a generalizable multistage workflow for the nonlin-
ear regression models to predict maps of abundance spatial distribution for reef fish
species. This workflow can effectively handle zero-inflated sampling data without
strong assumptions. The nonlinear prediction has the advantages, high accuracy,
low bias and well-performed in multi-resolution. Moreover, high adaptivity of the
workflow makes it suitable to different applications and datasets.
36
Chapter 3
A TWO-STAGE MODEL
The purpose of study in this chapter is to develop a spatial regression model for
analyzing the soil carbon stock (SOC) data. Different from the application in Chapter
2, the desired model should perform well in both prediction and interpretation. Un-
fortunately, as mentioned in Chapter 1 there is trade-off between the two goals. For
example, generally speaking, the linear regression model has good interpretability but
bad prediction accuracy. In contrast, the nonlinear models are good at predicting but
the black-box property harms their interpretability. In this chapter, we proposed a
two stage model trying to break the trade-off between prediction and interpretation.
Section 3.1 introduces the data we used in this study. In Section 3.2, a two-stage
model is proposed. The model’s abilities in interpretation and prediction are discuss
from a conceptual view. The results and discussion are presented in Section 3.3.
3.1 Data
The soil carbon stock (SOC) data comes from the rapid carbon assessment study
initiated by the Natural Resources Conservation Services Soil Science Division of the
U.S. Department of Agriculture (USDA) Staff and Loecke (2016). More than 6200
sites across the conterminous United States were established according to a multilevel
stratified random sampling scheme. SOC stock for a fixed soil depth (0 - 30 cm) was
calculated using (3.1) (Adhikari et al., 2020). Figure 3.1 shows the map of SOC data
in a log transformed scale.
CF
SOCstk = SOC × BD × D × (1 − ) (3.1)
100
37
where SOCstk is the SOC stock (M g ha−1 ), SOC is the SOC content (g 100 g −1 ),
BD is the soil bulk density (M g m−3 ), D is the given soil layer thickness (cm), and
CF is the volumetric fraction of the coarse fragments.
Figure 3.1: Soil carbon stock (SOC) data. The scale of SOC data was transformed
by the nature log function. This transformation normalized the SOC data (Figue 3.2)
for the convenance of modeling.
A wide range of environmental covariates (31 variables) were collected and evalu-
ated as SOC predictors. Table 3.1 lists their name, a brief description and their source.
Figure 3.2 shows a summary of SOC data and eight environmental covariates.
Different from the application in Chapter 2, this data is neat and tidy. As Figue
3.1 showing, the samples are well scattered across the whole area. It’s good for
prediction. And also, there are 31 covariates which is relatively sufficient for the
study of interpretation.
38
Table 3.1: Environmental Variables Description and Data Source
Environmental variable Brief description Data source
Precipitation (PPT) 30-yr (1981 to 2010) annual average http://www.prism.oregonstate.edu/normals

Precipitation of the driest season 30-yr (19712000) annual average precipitation http://worldclim.org/bioclim
(PDRY) of the driest month
Potential evapotranspiration (PET) 30-yr (19712000) potential evapotranspiration https://doi.org/10.6084/m9.figshare.
7504448.v3
Precipitation of the wettest season 30-yr (19712000) annual average precipitation http://worldclim.org/bioclim
(PWET) of the wettest month
Dew point temperature (TD) 30-yr (19812010) annual average dew point http://www.prism.oregonstate.edu/normals
temperature
Minimum temperature (TMIN) 30-yr (19812010) annual average minimum http://www.prism.oregonstate.edu/normals
temperature
Mean temperature (TMEAN) 30-yr (19812010) annual average temperature http://www.prism.oregonstate.edu/normals
Maximum temperature (TMAX) 30-yr (19812010) annual average maximum http://www.prism.oregonstate.edu/normals
temperature
Ecological region (ECOL3) Ecological zone map at level 3 legend DerivedfromgSSURGO
Net primary production (NETPP) Annual terrestrial primary production DerivedfromLandsat
Landsat Band 3 (RED) Landsat Band 3 for 2014 http://earthenginepartners.appspot.com/
science-2013-global-forest/download_v1.6.
html
Landsat Band 5 (SW1) Landsat Band 5 for 2014 http://earthenginepartners.appspot.com/
html
Landsat Band 7 (SW2) Landsat Band 7 for 2014 http://earthenginepartners.appspot.com/
html
National land cover database Land cover of the United States for 2011
(NLCD)
Potential vegetation (PVEG) U.S. Potential natural vegetation Original Kuchler Types, v2.0
Normalized difference vegetation Calculated as (NIR RED)/(NIR + RED), http://earthenginepartners.appspot.com/
index (NDVI) where, NIR is near-infrared band (Landsat science-2013-global-forest/download_v1.6.
Band 4) html
Elevation (DEM) Land surface elevation Derived from the national digital elevation
dataset (NDEM) from U.S. Geological Sur-
vey
Slope aspect (ASPECT) Direction of the steepest slope from the north Derived from the DEM
Slope length factor (LSFACTOR) Slope length factor calculated as in the USLE Derived from the DEM
(universal soil-loss equation)
Multi-resolution valley bottom flat- Potential depositional areas Derived from the DEM
ness index (MRVBF)
Melton ruggedness number (MRN) Melton ruggedness number Derived from the DEM
Mid-slope position (MSPOS) Covers the warmer zones of slopes Derived from the DEM
Wetness index (SAGAWI) Topographic wetness index with modified Derived from the DEM
catchment area
Slope height (SLOPEHT) Height of the local slope Derived from the DEM
Slope gradient (SLOPE) Local slope gradient in percent Derived from the DEM
Valley depth (VALDEP) Calculates the extent of valley depth Derived from the DEM
Drainage class (DRNG) Natural soil drainage class Derived from gSSURGO
Surface geology (GEOSUR) Surficial geology class Derived from gSSURGO
Hydrological group (HYDRO) Hydrologic soil group class Derived from gSSURGO
Soil order class (SOIL) Taxonomy soil order class Derived from gSSURGO
Soil temperature regime (SOILTR) Soil temperature regime class Derived from gSSURGO
39
Figure 3.2: The summary of SOC data and 8 environmental covariates. The num-
bers are correlation values between variables.
3.2 Methods
As mentioned in the beginning of this chapter, the challenge of this study is the
trade-off between prediction and interpretation. To break this trade-off, we propose a
novel two-stage statistical method that combines global mostly-linear effects (Stage-
1) and with non-linear effects (Stage-2). In particular, Stage-1 relies on the universal
regression kriging whereas Stage-2 is based in a Generalized Additive Model with
splines to capture non-linear effects.
3.2.1 The Two-stage Model
The two-stage model is built on the basis of two well studied statistical models,
universal regression Kriging and generalized additive model.
1) Universal regression kriging (URK)
40
The Universal regression kriging relies on the expression of the quantity of
interest Y as follows.
Y (s) = f (s) + X(s)β + λ(s) + (s) (3.2)
where s is the spatial location, f (s) is a low degree polynormial function that
can capture the deterministic spatial trend of dependent variable; X(s)β is a
regression part to capture the global linear relationship between the dependent
variable Y (s) and the explanatory covariates X(s); λ(s) is a stochastic part that
captures the spatial structure of the variable Y (s), λ(s) is generally assumed to
be a zero-mean stationary Gaussian process; (s) is the nugget effect, usually
assumed independent from Y and independent across locations, and identically
distributed.
2) Generalized additive model (GAM)
Generalized additive models were originally invented by Hastie and Tibshirani
in 1986 (Hastie and Tibshirani (1986), Hastie and Tibshirani (1990)). GAM
assumes the relationships between the individual predictors X and the depen-
dent variable Y follow smooth functionals that can be linear or nonlinear. These
smooth functional relationships can be estimated and added up as the predictors
of E(Y |X) and is expressed as following.
Y = β0 + f1 (X1 ) + ... + fp (Xp ) + (3.3)
where fi (Xi ) is an arbitrary smooth univariate function of Xi , usually based on
basis decomposition such as splines; are i.i.d errors. Meanwhile, the predictor
function has constraint equal to zero.
E(fi (Xi )) = 0, i = 1, ..., p
41
3) The two-stage model
The proposed two-stage universal regression kriging generalized additive model
is a workflow in which we apply the universal regression kriging in the first stage
and the generalized additive model in the second stage.
[Stage-1:] Universal regression kriging model
Y (s) = fspl (s) + X(s)β + λ(s) + δ(s) (3.4)
where fspl (s) is a linear function of spatial coordinates capturing the global
linear spatial trend; X(x)β is the linear regression of covariates representing the
global linear effects of covariates X(s) on Y (s); λ(s) is a zero mean stationary
Gaussian process which explains the global stationary spatial dependence of the
process Y; δ(s) are residuals of the first stage.
[Stage-2:] Generalized additive model
p
X
δ(s) = fsps (s) + fi (Xi (s)) + (s) (3.5)
i=1
where fsps (s) is a spatial smoother that handle the nonlinear and nonstation-
Pp
ary spatial dependence; i=1 fi (Xi (s)) are the additive nonlinear univariate
functions for each covariate; (s) is a pure white-noise error.
3.2.2 Model Interpretability and Analysis Flow
There is no mathematical definition of model’s interpretability. However, we can
consider it as the degree to which a human can understand/explain the model. In
this section, we discuss 3 levels of interpretability.
The concept that corresponds to the first-level (model-level) interpretability of a
42
model is the R2 coefficient expressed as the ratio of model explained variation and
total variation.
SSE Explained V ariation
R2 = 1 − =
SST T otal V ariation
In the two-stage model (3.4) and (3.5), all the components are additive. It’s
a elegant and powerful assumption that offers a natural way to decompose the in-
terpretability of model into its components. The key idea in the definition of R2
is the explained variation by the model. Similarly, we can generalized this idea to
the second-level (component-level) interpretability (Figure 3.3). In the first stage, the
Universal Regression Kriging (URK) models the global variations of the data. In par-
ticular, the URK decomposes the total variation into 4 parts: the variations explained
by a global linear spatial trend, the variations explained by the linear regression of
covariates, the variations explained by a zero-mean stationary spatial random process
and the remaining unresolved variations. In the second stage, the residuals of URK
model become the input of Generalized Additive Model (GAM). The variations that
can’tnot captured by URK will be handled in GAM. Similarly, GAM decomposes
the variations into a non-stationary or non-linear spatial component explained by a
spatial smoother and nonlinear covariates component explained by spline smoothers
and a pure errors component which can’t be explained by both URK and GAM.
The third-level (element-level) interpretability is the explanation of relationship
between elements and the response variable Y in each component cited above in
the second-level. For example, the global linear relationship between the covariates
X(s) and response variable Y (s) can be explained by the coefficients β, the global
stationary zero mean Gaussian process λ(s) can be characterized and explained by
the parameters of its covariance function. Since the element-level interpretability of
URK is simple and straightforward, we put more effort on GAM.
n general, GAM has the interpretability advantages of multiple linear regression
43
Figure 3.3: Two-stage Universal Regression Kriging Generalized Additive Model
model where the contribution of each covariate to the response variable is clearly
encoded. In addition, GAM is substantially more flexibility since the relationships
between covariates and dependent variables are not assumed to be linear. Since the
marginal impact of a single covariate Xi , does not depend on the values of the other
covariates in GAM, we can simply interpret its relationship to the response variable
by exhibiting the univariate function fi (Xi ). For example, the synthetic example of
Figure 3.4, we can say that the expected value of first stage residuals δ(s) increases
exponentially as X1 (s) increases, holding everything else constant. Another important
feature of GAM ,which also plays an important role in model interpretation, is the
ability to control the smoothness of the predictor functions. With GAMs, we can
impose the prior belief that predictive function is inherently smooth in nature, even
though the dataset may suggest a more noisy function.
44
Figure 3.4: Generalized Additive Model Demo
As Figure 3.3 showing, the model is fitted in two stages, which leads to an analysis
conducted in two stages bringing the following advantages.
(1) Layers of analysis
In spatial data analysis, extracting global linear trend (with covariates) and
stationary spatial dependence is the first interest of geostatistical studies, which
we consider as the first layer analysis corresponding to the first-stage of the
proposed model. The second layer analysis that aims at revealing the nonlinear
relationships between response variable Y (s) and covariates X(s) coincides with
the second stage of the analysis flow. The second layer analysis is subtle and
on the basis of first layer analysis. The order of these two layers is meaningful
since it is challenging to separate the effects of global stationary from nonlinear
relationships between Y (s) and X(s) in second layer, if this order is not followed.
For example, the existing machine learning models that be applied in spatial
context do not include a independent stochastic process, like the λ(s) in first
stage, to capture the spatial dependence.
(2) Simplicity and flexibility
The universal regression kriging model and generalized additive model are well
studied in the statistical community. In the proposed two-stage model, we
connect them by following the simple rule that the former’s output will be
the latter’s input. Moreover, there are existing R packages that implement
45
these two models respectively. In this paper, we use the R packages ”fields”
Nychka et al. (2017) for URK and ”mgcv” Simon Wood (2019) for GAM. The
differences among packages mainly come from the target problems they want
to solve and the algorithms that used by the model. For example, fields and
FRK Andrew Zammit-Mangion (2020) are two R packages implementing the
universal regression kriging model. The main issue that FRK focuses on is
computational intensity of large/big spatial data, while fields be designed as a
versatile tool for spatial analysis with moderate size data. The algorithms they
utilized to estimate parameters are different as well. FRK uses EM algorithm,
while fields uses REML and GCV algorithms. Various choices of packages offer
great flexibility for analyzing data. We can select the most suitable packages
according to the requirements in practise.
3.2.3 Prediction Setup
In predictive modelling and especially with the increasing use of machine learning
techniques, a trade-off emerges between interpretability and accuracy of prediction.
One of the major goals of this paper is to find an optimal framework to balance
this trade-off. As Figure 3.3 shows, the two-stage model can accommodates linear
and nonlinear, stationary and non-stationary variations. In the following section,
we assess the predictive accuracy of the proposed two-stage model by comparing it
with popular machine learning (nonlinear) models for simulated data and soil organic
carbon data.
Figure 3.5 illustrates the framework of model comparison. Five models are eval-
uated and compared: an ordinary linear regression model, the proposed two-stage
model, a random forest model, a gradient boost model and a support vector ma-
chine model. Since some models, like gradient boost and support vector machine,
46
Figure 3.5: Framework of Model Comparison
can not handle categorical data, thus encoding process or feature engineering can
be performed as discussed in Section 3.3.2. In order to avoid data linkage, nested
resampling was applied to tune the hyperparameters of machine learning models.
Then, all five models were compared through a shared cross-validation scheme, as
illustrated in Figure 3.5 (right). Statistics of combined test data, like the predictive
root mean square error (RMSE) or predictive R2 , were used to evaluate and compare
the accuracy of predictions.
3.3 Results and Discussion
In this section, we exhibit the results in terms of interpretability and prediction
of the proposed two-stage model fitted on the SOC data and its covariates described
in Section 3.2.
3.3.1 Interpretation of Fitted Model
(1) Variable selection
The principle of Occams Razor states that among several plausible explanations
for a phenomenon, the simplest is best. Simplicity plays a important role in model’s
interpretability. We want to explain the data in the simplest way redundant pre-
dictors should be removed. Moreover, unnecessary predictors will add noise to the
47
estimation of other quantities that we are interested in. So, the first thing we need
to do is variable selection. Since there are several categorical variables in the data,
we choose group lasso Yuan and Lin (2006) to select the important variables to be
included in the regression model.
Figure 3.6: Group Lasso Variable Selection
The covariates selected by group lasso are as follows.
ASPECT REDL14 TMEANAA30 PWETCL5 NDVI14 MIDSLPPOS
LSFACTOR DRNGSS7 NLCD2011 SoilOrder SOILMREGIM DEMNED6
Landsat NPP PET SLOPEHT TMAXAA30 VALDEP
(2) Components of the fitted model on data
With the previously selected covariates, the two-stage model is fitted to the organic
soil carbon data. In each stage, we summarize the model information and elucidate
the structure of data interpreted by the model.
Stage-1: Universal regression kriging model
The estimated regression coefficients are exhibited in Table 3.2. Because all the
covariates were scaled before model fitting, the coefficients βs are comparable with
each other and provide the relative importance of each of them to the soil carbon
stock.. The coefficients α0 , αlong and αlat belong to fspl (s) which is a linear surface
48
trend function of spatial coordinates (slong , slat ).
fspl (s) = α0 + αlong ∗ slong + αlat ∗ slat
Table 3.2: Estimated Coefficients by URK
α0 αlong αlat βASP ECT βREDL14
4.3597 0.0025 -0.0025 -0.0284 -0.1703
βT M EAN AA30 βP W ET CL5 βN DV I14 βM IDSLP P OS βLSF ACT OR
-0.2100 0.1360 0.1439 0.0621 -0.0373
βDEM N ED6 βDRN GSS7 βN LCD2011 βSoilOrder βSOILM REGIM
0.0429 0.2361 0.1180 0.1172 0.0629
βLandsat N P P βP ET βSLOP EHT βT M AXAA30 βV ALDEP
-0.0248 -0.0940 0.0015 0.0218 -0.0256
Each component of the URK model (3.4) can be visualized in Figure 3.7 which
provides further interpretation. The fitted R2 of first stage model URK is about
67.3%. In other words, it means there is approximately 22.7% variation of the data
left in the residuals δ(s) and will be dealt with GAM in second stage.
Figure 3.7: Fitted Universal Regression Kriging Model
49
Stage-2: Generalized additive model
The fitted information of GAM can be found in Table 3.3. Under the significant
level 0.01, there are three covariates have non-zero effect on the response variable.
They are REDL14, NDVI14 and SoilOrder. All of other covariates are not significant,
in other words, they have no effects on the response variable. Figure 3.8 shows the
significant fitted predictor functions (smoothers).
Table 3.3: Importance of Smoothers Fitted by GAM in Second Stage
Smoother edf Ref.df F p-value
s(Long,Lat) 2.000 2.000 0.530 0.588885
s(ASPECT) 2.617 2.892 2.661 0.028600
s(REDL14) 2.733 2.950 8.553 4.20e-05 ***
s(TMEANAA30) 1.000 1.000 0.060 0.806594
s(PWETCL5) 2.084 2.488 1.267 0.197143
s(NDVI14) 3.000 3.000 20.432 3.62e-13 ***
s(MIDSLPPOS) 1.560 1.909 0.562 0.524011
s(LSFACTOR) 1.658 2.037 0.631 0.508336
s(DEMNED6) 1.000 1.000 0.078 0.780581
s(Landsat NPP) 2.300 2.688 3.391 0.012992
s(PET) 1.000 1.000 0.151 0.697162
s(SLOPEHT) 1.000 1.000 0.149 0.699483
s(TMAXAA30) 1.000 1.000 0.039 0.843379
s(VALDEP) 1.000 1.000 0.061 0.804975
s(DRNGSS7) 2.461 2.774 3.776 0.028268
s(NLCD2011) 1.905 2.283 2.966 0.047557
s(SoilOrder) 2.330 2.700 6.752 0.000277 ***
s(SOILMREGIM) 1.105 1.202 0.450 0.472465
50
Figure 3.8: Estimated GAM Predictor Functions (Smoothers)
(3) Analysis of spatial patterns of significant predictors
The most obvious benefit of interpretability is understanding the underlying mech-
anisms of the system. The first stage URK model reveals the global linear relation-
ships between covariates and response variable, then, the second stage GAM corrects
the first stage understanding with more subtle non-linear details. Finally, we obtain
the overall understanding by adding up results from the two stages. The following
Equation (3.6) shows the estimated relationship (fˆ(Xi )) between covariate Xi and
response variable Y .
fˆ(Xi ) = Xi β̂i + fî (Xi ) (3.6)
where β̂i was the linear coefficient estimated by URK and fî (Xi ) was the nonlinear
smoothing function fitted by GAM. These functional relationships bring to light the
underlying dependencies between the covariates and soil carbon.
In Figure 3.9, the top row plots show the nonlinear fitted functions for covariates,
REDL14, NDVI14 and SoilOrder. The visualization of predictor function leads us to
check the part of function where the 95% confidence interval is away from zero (seg-
ment between the blue vertical lines), which indicates a significant contribution of the
covariate to the soil carbon prediction. After representing the significant contribution
in the context spatial context (the bottom row of Figure 3.9), one can vizualise the
spatial reparttion of the significant predictors. First, significant covariate data tend
to exhibit some spatial clustering patterns. Second, the spatial regions of 3 clusters
51
are overlapped and located in the southwest of United States. A reasonable hypoth-
esis may be like that, there is a latent variable which influences the 3 covariates, but
it’s not included in the data and need further investigation. The clusters provides the
information of location where the further investigation should be conducted.
Figure 3.9: Spatial Clusters of Significant Non-linear Predictors
(4) Comparison to GWR model
Geographically weighted regression (GWR) model Brunsdon et al. (1998) is an
extension of the traditional regression framework and allows the regression coefficients
to vary across space. GWR is a very popular geostatistical tool to explore possible
spatial patterns of the covariates effects (regression coefficients) and acquire valuable
information for further analysis, such as clusters detection. In the following, we
compare the interpretation of the components of each GWR and Two-Stage models
fitted on the carbon soil data.
First, there are some covariates’ effect claimed to be global linear in Two-Stage
model but are spatially varying in GWR model. For example, in Figure 3.9 (top
row), the effects of covariate NDVI14 shows non-linearity. By comparing the two
GWR coefficients maps (Figure 3.10), NDVI14 and TMEANAA30, the spatial vari-
ability of TMEANAA30 is larger than NDVI14. So, if we assume the coefficient of
NDVI14 is spatially varying (non-linear), GWR model tell us that the coefficient of
52
TMEANAA30 will be spatially varying as well. In other word, the effect of covari-
ate TMEANAA30 is not globally linear. While Table 3.3 shows that the covariate
TMEANAA30 in the second stage GAM has no significant effects on response variable
(p-value = 0.806594). It indicates that TMEANAA30 only has the global linear (con-
stant coefficient) relationship with response variable in the first stage URK model.
In sum, GWR model and Two-Stage model provide contradicting explanation to co-
variate TMEANAA30. The reason hides in the stochastic process term λ(s) in the
first stage URK model. As Figure 3.10 (right) showing, the value λ(s) varies spatially
compensating for errors in the linear global term. Since the coefficients of GWR are
estimated locally, λ(s) will cause their estimated values varying across space.
Figure 3.10: Constant or Varying Coefficients
Second, the spatial clustering patterns of the covariate effects differ from GWR
to the Two-Stage model. In Figure 3.11, the repartition of REDL14 significance
(left) shows non-negligable differences between GWR and Two-Stage model. The
maps of NDVI14 (middle) present some similarities but also differences, for example,
the junction region between Arizona and Utah, the Northeast states and Florida.
For SoilOrder (right), the two maps show similarity in the west of United States
but differences in other regions. The reason causing the differences is similar to the
analysis of Figure 3.10. In addition, the insignificant parts that we get rid of in GAM
plots may also cause the GWR coefficients spatially varying.
In summary, the power of interpretability of Two-Stage model comes from its abil-
ity of decomposition. First, Two-Stage model conducts analysis hierarchically and
53
Figure 3.11: Comparing Two-stage model and GWR
the two analysis layers can be easily and clearly decoupled. The second layer (GAM)
analysis relies on the basis of extracting out all the influences of first layer (URK).
Second, in each layer, the additivity of components ensures decomposition of inter-
pretable components. For instance, the spatial cluster patterns of REDL14 (Figure
3.11), the analysis is base on the condition of extracting out all of other influences,
such as the influences from global linearity, global stationary spatial dependence, non-
stationary spatial dependence and other covariates. While GWR model mixes up all
of those influences, which makes the interpretation ambiguous.
3.3.2 Prediction Results
In this section, we compare the prediction results of different models. The frame-
work for comparison (Figure 3.5) was introduced in Section 3.2.3.
(1) Real data
Different from interpretability, the goal of prediction is accuracy and ability to pre-
dict the data characteristics. In order to compare the model capabilities of prediction,
we employ all predictors (covariates) into the models.
To work with categorical data, we can use variable encoding approach. However,
54
this approach didn’t work well on this real dataset. It caused the predictive R2 of SVM
model (≈ 49%) even lower than the ordinary linear regression model (≈ 54%). To
solve this issue, we adopted feature engineering on categorical variables. For example,
SoilOrder is a nominal categorical variable. Feature engineering can be applied by
using the median value of response variable to instead the nominal number of category
(Table 3.4). Now, we transferred the nominal categorical variable to a numerical
variable which can be accommodated by any model. After feature engineering (all
categorical covariates), we found that the predictive R2 increased from 49% to 58%
with SVM model but changed negligibly with other models.
Table 3.4: Example of Feature Engineering for the Covariate Soilorder
SoilOrder 1 2 3 4 5 6 7 8 9 10 11
y.median 3.9276 3.5199 4.0103 4.0338 3.6944 5.9676 4.0752 4.3900 4.7165 3.0662 4.8314
Table 3.5 lists the predictive rmse and predictive R2 for the compared models. The
Two-Stage model stays competitive to the popular machine learning models, Random
forest, XGBoost and SVM. However, comparing to URK model, Two-Stage model
only improves the predictive R2 by 0.5% which is negligible in some circumstances.
The reason came from the nature of data. The purely random variation takes a large
proportion (approximately 40%) in the total variation of data. It made all the models,
except linear regression model, obtaining similar predictive R2 . Regardless of only
0.5% improvement in predictive R2 , Two-Stage model discovered much more useful
information comparing to URK model (see Section 3.3.1).
(2) Simulation data
In order to demonstrate the abilities of the two-stage model to compete with
popular machine learning models and improve its capabilities compared to URK and
linear regression models, we simulate a dataset and conduct the comparison described
55
Table 3.5: Prediction Comparison on Real Data
Model Predictive RMSE Predictive R2
LM 0.6978181 0.5392069
URK 0.6729866 0.5732470
Two-Stage 0.6687347 0.5786224
RF 0.6534186 0.5958378
XGB 0.6689105 0.5765102
SVM 0.6693818 0.5757998
in Figure 3.5. The response variable Y (s) is simulated as
Y (s) = Yx (s) + λ(s) + p(s) + (s) (3.7)
where the components are generated as follows and their visualizations can be found
in Figure 3.12.
Yx (s) represents the nonlinear relationship between two covariates X1 (s) and X2 (s)
and the response variable.
Yx (s) = 0.1 ∗ X1 (s)3 + 10 ∗ sin(X2 (s) + 3)



X (s) ∼ unif (1, 4)

1

X2 (s) ∼ unif (0, 2π)

λ(s) is a zero mean Gaussian process capturing the isotropic stationary spatial
dependence. λ(s) entirely characterized by a Matrn covariance function as follows.

!ν !
21−ν √ |s 1 − s 2 | √ |s 1 − s 2 |
Cν (|s1 − s2 |) = σ 2 2ν Kν 2ν
Γ(ν) ρ ρ
where Γ(·) is the Gamma function, Kν (·) is the modified Bessel function of the second
kind, |s1 − s2 | is the Euclidean distance between spatial point s1 and s2 , and σ 2 = 1,
ρ = 0.5 and ν = 1.5.
56
p(s) = 0.005 ∗ sx ∗ sy represents non-stationary dependence in the spatial coordi-
nates sx and sy .
i.i.d.
(s) ∼ N (0, 10.24) is the pure error also called nugget effect in Geostatistics.
Figure 3.12: Simulated Data
The results of model comparison are shown in Table 3.6. URK model shows a
predictive R2 ≈ 49.3% which is closed to linear regression model 48.4% but far away
from Two-Stage model 80.7%. By relationship of URK and Two-Stage model, we
know the predictive R2 contributed by the second stage GAM is 31.4% which is a
great improvement. Moreover, comparing to Random forest, XGBoost and SVM,
Two-Stage model has higher predictive R2 and similar results to the random forest.
The comparison on synthesized data again proves the predictive ability of Two-Stage
model is competitive to popular machine learning models.
57
Table 3.6: Prediction Comparison on Simulated Data
Model Predictive RMSE Predictive R2
LM 5.695840 0.4843744
URK 5.667649 0.4932645
Two-Stage 3.499716 0.8067849
RF 3.522183 0.8026377
XGB 3.645922 0.7884789
SVM 3.564608 0.7981013
In summary, the two-stage model has good interpretability which is close to the
linear regression. Meanwhile, it keeps high prediction accuracy that is competitive
to the nonlinear (machine learning) models, like random forest, xgboost and support
vector machine. It makes the two-stage model stand out from the rest (Figure 3.13).
Figure 3.13: Breaking the Trade-off Between Prediction and Interpretation
58
Chapter 4
GAUSSIAN PROCESS BART
The Bayesian Additive Regression Trees (BART) model is rarely used in spatial
applications. One of the reasons is that the error term in BART model is restricted
to be independently distributed which is unusual in spatial problems. In this chapter,
we get rid of this constraint and propose a Gaussian process BART model for spatial
regression problems. First, the traditional BART model is introduced in Section
4.1. Then, in Section 4.2, we develop a new BART model that can accommodate
the correlated errors. In section 4.3, the Gaussian process BART model is studied.
Section 4.4 shows two experiments and a testing on real data.
4.1 Introduction
Bayesian Additive Regression Trees (BART), proposed by Chipman et al. (2010),
can be viewed as a sum-of-trees model as follows.
i.i.d
y = g(X; T1 , M1 ) + ... + g(X; Tm , Mm ) + , ∼ N (0, σ 2 ) (4.1)
where y, X are observed dependent and independent variables; is independent and
identically distributed random error; T denotes a tree, consisting of a set of interior
nodes with decision rules and a set of terminal nodes; M = {µ1 , ..., µb } where b is
the number of terminal nodes of T ; g(X; Ti , Mi ), i = 1, ..., m denotes a single binary
regression tree that assigns µj , j = 1, ..., b in M to the observations through T . A
example of single binary regression tree is illustrated in Figure 4.1.
BART is inspired by the idea of boosting that sums the contribution of sequen-
tial weak learners (trees) to get a much more accurate prediction. Different from
59
Figure 4.1: (Left) An example of single binary tree, with internal nodes labelled by
their splitting rules, terminal nodes labelled with the corresponding parameters µi
and the observations associated with it. (Right) The corresponding partition of the
sample space and the step function.
other boosting methods, like, gradient boosting trees, BART works in a Bayesian
framework using prior and likelihood to generate a posterior distribution of the pre-
diction. The posterior distribution provides much richer information than the point
estimation of classical regression models. In addition, the Bayesian framework has
a built-in complexity penalty mechanism that automatically initializes the model’s
hyperparameters, like, max tree size, which normally be tuned via cross-validation in
other models.
Experiments study (Chipman et al., 2010) showed that BART outperforms other
popular machine learning methods, including Neural Nets, Gradient Boosting Trees
and Random Forest. Recall the spatial nonlinear regression model which excludes
the stochastic process term w(s) in model (1.1):
i.i.d
y(s) = f (s; X(s)) + (s), (s) ∼ N (0, σ 2 ) (4.2)
The BART model (4.1), of course, is a good candidate in this category. But we
want to be more ambitious. Since the term w(s) in model 1.1 models the effects
of unobserved independent variables. Keeping it in the spatial nonlinear regression
model can benefit us in both prediction and interpretation (see Section 4.3.1).
60
4.2 Bart for Correlated Errors
In BART model (4.1) the error term is assumed independent and identically
i.i.d
distributed, (s) ∼ N (0, σ 2 ). We can generalize this assumption and allow the error
term has a general correlated structure, ∼ N (0, Σ).
y = g(X; T1 , M1 ) + ... + g(X; Tm , Mm ) + , ∼ N (0, Σ) (4.3)
We will build the new model (4.3) and illustrate how it works in this section. But,
first of all, the question can be simplifie to a single tree model by taking advantage
P
of the reductions Rj = y − k6=j g(X; Tk , Mk ).
Rj = g(X; Tj , Mj ) + , ∼ N (0, Σ)
Hereafter, we remove the subscripts and discuss on the single tree model (4.4).
R = g(X; T, M ) + , ∼ N (0, Σ) (4.4)
4.2.1 Dummy Representation
To understand model (4.4), the first and most impotant step is dummy represen-
tation. Simply speaking, dummy representation provides a matrix form to the single
tree model (4.4). The tree g(x; T, M ) can be denoted as follows.
g(X; T, M ) = Dµ (4.5)
where
µ = [µ1 , µ2 , ..., µb ]T
and  
d d ... d1b
 11 12 
 
 d21 d22 ... d2b 
D= .
 
 . .. . . 
 . . . 

 
dn1 dn2 ... dnb
61
D is called dummy matrix which is a n×b matrix. n is the number of observations
and b is the number of bottom nodes. For each row in D, there is only one entry
equal to 1 and the rest are equal to 0. For example,
[di1 , ..., di,j−1 , di,j , di,j+1 , ..., din ] = [0, ..., 0, 1, 0, ...0] (4.6)
is the ith row in D and its j th column is 1. Matrix D can be viewed as a map that maps
the observations to the bottom nodes of the tree. The row (4.6) works as mapping
the ith observation to the j th bottom node. An example is as following. The dummy
matrix D mapped r2 to node 1, r3 and r4 to node 2, r1 and r5 to node 3.
 
0 0 1
  
1 0 0
   µ1 
  
R = g(X; T, M ) = Dµ = 0
  
 1 0
 µ2 
  
0 1 0
  µ3
 
0 0 1
Based on the dummy representation, the tree model (4.4) can be re-denoted as a
matrix form.
R = g(X; T, M ) = Dµ + ε, ε ∼ N (0, Σ) (4.7)
The matrix form makes mathematical derivation possible. Moreover, given X this
representation perfectly decoupled the components T and M in the tree model. It
means if X and T are fixed a dummy matrix D is uniquely determined regardless the
value of µ in M . This decoupling will benefit the calculation of marginal likelihood
p(R|X, T ) which is the pivot of MCMC transitions. The details will be discussed in
next section.
62
4.2.2 Metropolis-Hastings Search
In BART, each tree be updated at every MCMC iteration. Recall (4.4), obviously,
to update a tree we need to update its components T and M . Naturally, the structure
of tree which is T should be updated first. ? proposed Metropolis-Hastings algorithm
to draw a sequence of trees,
T 0 , T 1 , T 2 , ...
The sequence starting with an initial tree T 0 , iteratively simulate the transitions
from T i to T i+1 , i = 0, 1, 2, ..., by the following two steps:
(1) Generate a candidate value T ∗ with probability distribution q(T i , T ∗ ).
(2) Set T i+1 = T ∗ with probability
q(T ∗ , T i ) p(R|X, T ∗ )p(T ∗ )

α(T i = T ∗ ) = min{ , 1} (4.8)
q(T i , T ∗ ) p(R|X, T i )p(T i )
Otherwise, set T i+1 = T i .
In (4.8) the transition kernel q(·, ·) and the prior p(T ) are same in both traditional
BART and new BART. So, (4.9) doesn’t change as well.
q(T ∗ , T i ) p(T ∗ )
(4.9)
p(T i ) q(T i , T ∗ )
On the other hand, the correlated data goes into (4.10) which is a marginal like-
lihood ratio. And this ratio is the difference between the traditional BART and the
new BART.
p(R|X, T i+1 )
(4.10)
p(R|X, T i )
In the discussion of dummy representation, we know that given X and T a dummy
matrix D can be uniquely determined. So, the marginal likelihood p(R|X, T ) is equal
to p(R|D). Then, (4.10) is equal to (4.11) as well.
p(R|Di+1 )
(4.11)
p(R|Di )
63
Now the question is converted to calculate p(R|D). By (4.7), we can get the joint
likelihood (4.12).
p(R|D, µ) ∼ N (Dµ, Σ) (4.12)
Then the marginal likelihood can be got by integrated out the µ. The only thing
we need is a prior distribution π(µ).

Z
p(R|D) = p(R|D, µ)π(µ) dµ (4.13)
A Gaussian prior π(µ) ∼ N (µ̄, Q−1 ) is preferred, because it conjugates to (4.12).
µ̄ and Q are the mean and precision matrix of the Gaussian prior distribution respec-
tively. (4.14) shows the result of the integration (4.13). The proof can be found in
Appendix A.1.1;
n 1 1
(2π)− 2 |Σ|− 2 |Q| 2 1
p(R|D) = exp{− (−v T (Q + DT Σ−1 D)v + µ̄T Qµ̄ + RT Σ−1 R)}
1
|Q + DT Σ−1 D| 2 2
(4.14)
where, v = (Q + DT Σ−1 D)−1 (Qµ̄ + DT Σ−1 R).
Let µ̄ = 0, (4.14) can be simplified to (4.15).

n 1 1
(2π)− 2 |Σ|− 2 |Q| 21
p(R|D) = exp{ [RT Σ−1 D(Q + DT Σ−1 D)−1 DT Σ−1 R − RT Σ−1 R]}
1
|Q + DT Σ−1 D| 2 2
(4.15)
Finally, we plug (4.15) into the marginal likelihood ratio (4.11) and get (4.16).
The computational complexity of (4.16) will be studied in Section 4.2.4. And the
details of calculation can be found in Appendix A.3.

1 1
p(R|Di+1 ) |Qi+1 | 2 |Qi + (Di )T Σ−1 Di | 2 1 T −1
i
= 1 1 · exp{ R Σ
p(R|D ) i i+1 i+1 T
|Q | 2 |Q + (D ) Σ D | 2 −1 i+1 2
[Di+1 (Qi+1 + (Di+1 )T Σ−1 Di+1 )−1 (Di+1 )T − Di (Qi + (Di )T Σ−1 Di )−1 (Di )T ]Σ−1 R}
(4.16)
64
4.2.3 Posterior Distribution of µ
In Section 4.2.2, the tree structure T was updated. Given the new T , we can
update M which is the set of µ in the bottom nodes. Since X and new T are known,
the likelihood of µ is easily obtained from (4.12).
p(R|µ) = p(R|D, µ) ∼ N (Dµ, Σ) (4.17)
According Bayesian theory, the posterior probability density function of µ is pro-
portional to the product of its likelihood and prior probability density function (4.18).
p(µ|R) ∝ p(R|µ)π(µ) (4.18)
Similar to the calculation of marginal likelihood (4.13), we choose the conjugate
prior π(µ) ∼ N (µ̄, Q−1 ). The posterior distribution p(µ|R) is as (4.19) and the proof
can be found in Appendix A.1.2.
p(µ|R) ∼ N ( (Q + DT Σ−1 D)−1 (Qµ̄ + DT Σ−1 R) , (Q + DT Σ−1 D)−1 ) (4.19)
Furthermore, if let µ̄ = 0, (4.18) can be simplified to (4.20).
p(µ|R) ∼ N ((Q + DT Σ−1 D)−1 DT Σ−1 R, (Q + DT Σ−1 D)−1 ) (4.20)
65
4.2.4 Computational Complexity
The new BART works with the covariance matrix Σ whose dimension is n×n. n is
the number of observations. When the data is big, the huge covariance matrix could
cause computational problems, for example, the computation of likelihood needs to
calculate the |Σ| and Σ−1 . Their exact calculation requires O(n3 ) operations which
becomes an impossible mission for a personal computer when n is greater than, for
example, one million. In this section, we will investigate the computational complexity
of the new BART model. Since a preprocessing step called reordering can greatly
simplify the discussion, before digging into the computational stuff, it’s worth to
spend some time to understand the reordering.
Supposing a tree has b bottom nodes. The dummy matrix D maps n observations
to them. Based on this mapping, the observations can be partitioned at most b
sets. Reordering means that we reorder all the observations to make them ordered
successively in each partition. Since any reordering is a map and can be achieved by
multiplying a permutation matrix. Let’s assume permutation matrix P T (transpose
of P ) can realize the reordering. Then, the reordered dummy matrix, DP , can be
denoted as following.
P T D = DP , D = P DP (4.21)
where  
d011 d012 ... d01b
 
 
 d0 d0 0 
 21 22 ... d2b 
DP =  . .. . . 
 . .
. .

 
 
d0n1 d0n2 ... d0nb
66
and 

0, i ∈
 / nj ,
d0ij = i = 1, ..., n; j = 1, ..., b.

1, i ∈ nj .

where nj , j = 1, .., b is the index set of observations that be mapped to j th bottom
node.
Intuitively, DP is formatting as follows.

 
1 0 ... 0
. . .. .. 
 .. .. . .
 
 
1 0 ... 0
 
 
 
0 1 ... 0
 
. . .. .. 
 .. .. . .
 
 
0 1 ... 0
 
DP =  


0 0 ... 0
 
. . .. .. 
 .. .. . .
 
 
0 0 ... 0
 
 
 
0 0 ... 1
 
. . .. .. 
 .. .. . .
 
 
0 0 ... 1
Recall the example in Section 4.2.1, it’s easy to find the reordered matrix DP and
permutation matrix P .
    
0 0 1 0 0 0 0 1 1 0 0
    
1 0 0 1 0 0 0 0 0 1 0

   
    
D = 0 = P D =
    
 1 0
 P 0 1 0
 0 0 0
 1 0

    
0 1 0 0 0 1 0 0 0 0 1

   
    
0 0 1 0 0 0 1 0 0 0 1
67
Similar to (4.21), R and Σ also have their reordered counterparts.
R = P RP , Σ = P ΣP T (4.22)
In Appendix A.2, we proved that reordering didn’t change the values of marginal
likelihood ratio (4.16) and posterior distribution (4.20). So, the discussion of compu-
tational complexity can be carried on their reordering expressions.

i i T −1 i 1/2
p(RP |DPi+1 ) |Qi+1 |1/2 |Q + (DP ) ΣP DP |
=
p(RP |DPi ) |Qi |1/2 |Qi+1 + (DPi+1 )T Σ−1 i+1 1/2
P DP |
1 (4.23)
exp{ RPT Σ−1 i+1
P [DP (Q
i+1
+ (DPi+1 )T Σ−1 i+1 −1 i+1 T
P DP ) (DP )
2
− DPi (Qi + (DPi )T Σ−1 i −1 i T −1
P DP ) (DP ) ]ΣP RP }
and,
p(µP |RP ) ∼ N ((Q + DPT Σ−1 −1 T −1 T −1 −1

P DP ) DP ΣP RP , (Q + DP ΣP DP ) ) (4.24)
For the new BART, we assume the precision matrix Σ−1 and Σ−1
P are known. The
possible computation burden comes from the underline item in (4.23) and (4.24).
Q + DPT Σ−1
P DP (4.25)
In Appendix A.1.1, we show (4.25) is a b × b symmetric matrix and its calcu-
lation is the sum of all non-zero entries in Σ−1

P . In (4.23) and (4.24), we need to
calculate |A| and (A)−1 . They need O(b3 ) operations, b is the number of bottom
nodes. Fortunately, in the new BART, the size of tree which is the number of bottom
nodes are small (usually less than 20). So, if the number of nonzero entries in Σ−1
P
is O(n), the MCMC updating of single tree needs O(n) operations. The details of
calculating (4.23) and (4.24) can be found in Appendix A.3. However, we have to
compute Σ−1 for back comparing and buildup tuning range in section 4.4.4. A sparse
approximation approach is adopted and introduced in Appendix B.
68
4.2.5 Example
In order to compare the new and old BART, we make an example to demonstrate
their similarities and differences. The simulation data was created as follows.
yi = f (xi ) + ηi i ∈ {1, ..., n}
where, f (xi ) = x3i , xi ∈ (−1, 1); n = 200. We assumed the error term ηi followed a
normal distribution ηi ∼ N (0, Σ). There are two scenarios about the structure of the
error term.
(1) ηi are independent and identically distributed (i.i.d.)
In this scenario, Σ = σ 2 I, and the new BART should be identical to the old
BART. Figure 4.2 (left) proved this claim.
(2) ηi are correlated
In this scenario, ηi was created as follows.
i ∼ N (0, σ 2 ), i = 1, ..., n
η1 = 1 , ηj = ρj−1 + j , 0 < ρ < 1, j = 2, ..., n
We can denote ηi in a matrix form.
η = A
where
 
  1 0 0 . . . 0 0
 
 
 η1 
ρ 1 0 . . . 0 0
   1 
.   .
η= . .
 . , A = 0 , = (4.26)
 
 ρ 1 . . . 0 0  .
  . .. . . . . .. .. 
 
ηn  .. . . . . . n
 
 
0 0 0 ... ρ 1
n×n
69
The inverse of Σ can be calculate by
Σ−1 = σ −2 A−T A−1
Let σ = 0.1 and ρ = 0.8, we examined the new and old BART with above two
settings of error term. Figure 4.2 shows the results. When the errors are i.i.d. the two
BART models are consistent. But, when the errors are correlated the two models are
different from each other. In Table 4.1, we measured the differences between the two
models. Compare to the old BART, the new BART fitted bad to the training data
but outperformed in restoring the function f (s). It means if the correlated structure
of errors is known the new BART can correct its fit to the real signal f (x) rather
than the noise according the information getting from covariance matrix Σ.
Figure 4.2: Left figure shows if the errors were i.i.d. the new BART degenerated to
the old BART. Right figure shows the new BART was different from the old BART
when the errors were correlated.
Table 4.1: Comparing BARTnew and BARTold
Model Training Data MSE Training Data R2 Restoring f (x) MSE
BARTold 0.008474055 94.0% 0.01849994
BARTnew 0.02189554 84.5% 0.01547938

BARTnew −BARTold
BARTold
158.4% -10.1% -16.3%
70
4.3 Gaussian Process Bart Model
In the new BART model (4.3), the correlation structure of error term is arbitrary
because Σ is a general covariance matrix. However, in real world applications, people
always assumes the error term is satisfied some special correlation structure. There
are different ways to model it. In spatial statistics, as discussed in section 1.2, one
of the most popular ways is using the Gaussian process. So, by combining Gaussian
process and the new BART we proposed a new nonlinear spatial regression model
(4.27) which is named Gaussian process BART.
y(si ) = fBART (X(si )) + w(si ) + (si ) (4.27)
where
• y(si ) denotes the response variable observed at location si , i = 1, 2, ..., n.
• X(si ) are covariates observed at si .
• fBART (X(si )) is the mean spatial trend function of X(si ).
• w(si ) is a Gaussian process modeling the effect of unobserved covariates.
i.i.d.
• (si ) ∼ N (0, τ 2 ) denotes the independent and identically distributed noise.
In Chapter 1, we introduced the two categories of spatial regression models. The
nonlinear spatial regression models don’t include the term w(s), which causes two
problems. One is that the models can’t take into account the latent covariates.
Another is that the models overfit the effects of observed covariates. On the other
hand, the spatial linear mixed models can capture the latent covariates’ effects with
w(s) but they can’t model the nonlinear effects of the observed covariates. Gaussian
71
process BART model (4.27) is the first spatial regression model that is able to handle
the nonlinear effects of both observed and latent covariates.
4.3.1 Analysis of Variation
In this section, we proposed a method, called analysis of variation, to gain a
deep level understanding about the Gaussian process BART model (4.27). Figure 4.3
illustrates the idea of analysis of variation. The discussion can be divided into three
parts.
Figure 4.3: Analysis of Variation
First, the data generating process. From a physics point of view, observed data or
observations are generated by the underlying physical process and plus the pure error.
We call the underlying physical process data generating process. Based on this idea,
the total variations in the observations can be divided into two parts, the variations
explained by data generating process and the variations of pure error. In Figure 4.3,
we denote the variations with sum of square errors. The data generating process
fP rocess (X(s), Z(s)) can include observed covariates X(s) and latent covariates Z(s).
72
X Z
So, the total variations in data can be divided into three parts, SSfprocess , SSfprocess
and SSEprocess .
Second, the ideal case. Figure 4.3 shows the ideal case that the Gaussian pro-
cess BART model (4.27) can perfectly explain the three parts variations in the
new X Z
data, fBART (X(s)) catching SSfprocess , w(s) catching SSfprocess and SSEprocess go-
ing into new (s). It also indicates that the nonlinear models without w(s), like the
old BART model, will overfit the observed covariates process fprocess (X(s)). Be-
old
cause fBART (X(s)) will fit some variations belonging to the latent covariates process
Z
fprocess (Z(s)) (SSfprocess ).
Third, the normal case. In practice the ideal case rarely happens. The reasons
may include, the new BART is still overfitting, the effects of latent covariates doesn’t
behave as spatial dependence, the assumption (stationary, isotropy) or parameter
setting of the Gaussian process w(s) is not suitable to the real data ,etc. However,
comparing to the old BART, if w(s) with its explained variations SSw can shrink the
new
variations SSfBART and SSE new , the existing of w(s) is preferable. The shrinkage
new
of SSfBART can reduce overfitting of the new BART and restore more close to the
real underlying process of observed covariates X(s). The example in section 4.2.5
supports this claim.
4.3.2 The Failure of Likelihood Based MCMC
At first glance, Gibbs sampling is a good choice to estimate fBART and the param-
eters θ in model (4.27). θ includes parameters in the covariance function of Gaussian
process w(s) and τ 2 . (4.28) and (4.29) are the two steps in MCMC updating.
θ | fBART (4.28)
fBART | Σ−1 (θ) (4.29)
73
First, given fBART , model (4.27) can be convert to a Bayesian hierarchical model
(4.30).
p(θ|y) ∝ p(θ) × N (w(s)|0, C) × N (y|fBART + w(s), τ 2 I) (4.30)
Furthermore, the Gaussian process w(s) can be integrated out.
p(θ|y) ∝ p(θ) × N (y|fBART , C + τ 2 I)
Let Σ = C + τ 2 I, we can get the posterior distribution p(θ|y) (4.31) which can be
used to update (4.28).
1 1
p(θ|y) ∝ p(θ) × p exp{− (y − fBART )T Σ−1 (y − fBART )} (4.31)
|Σ| 2
Second, if parameters θ are known, the precision matrix Σ−1 could be calculated
as well. The problem of updating (4.29) given Σ−1 was already solved in section 4.2.
Everything looks good so far. However, the devil is in the detail. Let’s look at an
experiment first. The simulation data is created as follows.
f (x(si )) = x(si )3 , x(si ) ∼ unif (1, 3), i = 1, ..., n
w(s) ∼ GP (0, C(·)), C(|si − sj |) = σ 2 exp{−φ|si − sj |} (4.32)

i.i.d.
(si ) ∼ N (0, τ 2 )
where σ = 1, τ = 1 and φ = 6. x(si ) ∼ unif (1, 3) denotes that x(si ) follows a uniform
distribution in the range (1, 3).
The MCMC samples of τ 2 , σ 2 and φ is showed in Figure 4.4 (a). The Markov chain
couldn’t burn into stationary. If we fixed one of the parameters, τ = 1, then the chain
achieved stationary in Figure 4.4 (b). But, in this case, the estimated parameter φ is
big (≈ 150) which makes the covariance matrix Σ = σ 2 I. So, the new BART model
degenerated to the old BART model.
74
Figure 4.4: The failure of Likelihood based MCMC
For the problem shown in Figure 4.4 (a), the reason is because both BART and
Gaussian process are nonparameter and nonlinear model. They are sensitive to the
changes of data. An disturbance in the data may cause dramatic turbulence in the
Markov chain. When we fixed one or several parameters, this problem may be solved,
like the case in Figure 4.4 (b). However, the degeneration issue comes out. To explain
this issue, let’s recall the example in section 4.2.5. Table 4.1 tell us that working with
the true parameter ρ and Σ the new BART fitted bad to the training data. Actually,
the fitting will get worse when |ρ| approaches to 1. Suppose we know nothing about
the parameters of Σ and all the information comes from the data which determines
the likelihood. In likelihood based MCMC, the likelihood will guide its searching
behavior in parameter space. As a result, the data/likelihood will lead the parameter
ρ going to zero. In other word, the correlation structure of Σ will be eliminated
and Σ will degenerate to σ 2 I. The same thing happens in spatial context. If there
is no any prior information about the parameters of the covariance function C(·),
the data/likelihood will lead MCMC searching to eliminate the spatial dependent
structure in C(·) and makes Σ = σ 2 I. Figure 4.4 (b) just showed this situation. We
want to estimate the parameters which must be known first. It looks like we are
locked in a dead loop. In next section, we will introduce a key to open this lock,
which is called back comparing.
75
4.3.3 Back Comparing and Tuning Range
In section 4.3.2, we discussed the failure of likelihood based MCMC. The reason
of failure is because the data leads the search in parameter space and tend to elimi-
nate the correlation structure. So, the solution should pull the parameter search in
the opposite direction. Instead let the data totally control the parameter searching
process, we need to intervene it by proposing candidates that scatter over a larger
range in parameter space. We proposed a strategy, back comparing, to select the good
candidates. Figure 4.5 demonstrates the idea of back comparing. First, we propose
an candidate θ. Second, use this candidate to fit the new BART model. With fitted
new new
BART model fBART , the variation of residuals SSEreal can be calculated. Mean-
while, with the value of candidate θ, it’s easy to calculate the proposed variation of
new
the mixed errors SSM Eproposed which includes the errors comes from w(s) and (s).
new new
Then, we compare SSEreal and SSM Eproposed . There are three possible cases.
new new
(1) SSEreal < SSM Eproposed
new
Over-estimation. The proposed θ estimates more variations (SSM Eproposed )
new
when it works with the new BART fBART .
new new
(2) SSEreal > SSM Eproposed
new
Under-estimation. The proposed θ estimates less variations (SSM Eproposed )
new
when it works with the new BART fBART .
new new
(3) SSEreal ≈ SSM Eproposed
Good-estimation. The proposed θ provides good estimation about the varia-

new new
tions of SSM Eproposed when it works with the new BART fBART .
Back comparing provides us a criteria to identify the good estimation of param-
eters. Figure 4.6 illustrates the parameter searching process. After proposed a set
76
Figure 4.5: Back Comparing
of parameters {θ (0) , ..., θ (n) } we apply the backing comparing to each of them. The
good estimations be picked out and put into a new set, called tuning range. All the
proposed parameters in the tuning range are good for both the model (4.27) and
the data. People can select the one that fits to their goals best. A more intuitive
analogy is the speaker volume control knob. You can tune the knob to get the volume
comfortable to you. However, the best volume differs from person to person. Even
you will adjust it when the situation changes, for example, the environment changes
from quiet to noisy.
Figure 4.6: Parameter space searching for the buildup of tuning range
There is still a question. How can we properly propose a parameter set {θ (0) , ..., θ (n) }
for searching the tuning range? The answer is that we can use the information getting
from the old BART or the liner mixed models. All approaches in this section will be
demonstrated with concrete examples in section 4.4.
77
4.4 Experiments and Results
In this section, we will show the applications of model (4.27) in two type of
problems, one dimension problems and two dimension problems. The methods, back
comparing and tuning range, will be discussed carefully. The idea of analysis of
variation introduced in section 4.3.1 will provide guidance to the parameter selection
in tuning range.
4.4.1 One Dimension Experiment
In one dimension, the class of autoregressive (AR) processes, and its extensions,
autoregressive moving-average (ARMA) processes are popular choices for modeling
time-varying processes. By Wold decomposition theorem, any AR(p) process is a
special Guassian process called Gaussian linear process if it satisfies the recursions
yt = φ1 yt−1 + ... + φp yt−p + t
where {t } is an i.i.d. sequence of N (0, σ 2 ) random variables, and the polynomial
φ(z) = 1 − φ1 z − ... − φp z p has no zeros inside or on the unit circle (Brockwell and
Davis, 2002). It means that the Gaussian process BART model (4.27) can be used in
one dimension problems.
Recall the example in section 4.2.5. It’s not exactly an AR(1) process but an
AR(1) error process. However, it’s also a Gaussian linear process. Because the
sequence η = {ηt } always follows a multivariate Gaussian distribution.
η ∼ M V N (0, Σ), Σ = σ 2 AAT
where matrix A was defined in (4.26). In this case, the Gaussian process BART model
(4.27) becomes to (4.33). Next, we will treat this model with previously proposed
78
methods.
new
y(ti ) = fBART (X(ti )) + η(ti ) i = 1, .., n (4.33)
Note: In this experiment, the parameters in model (4.33) are θ = {σ, ρ}. Their real
value are σ = 0.1 and ρ = 0.8 (the green point in Figure 4.7).
(1) Back comparing and Tuning range
In order to apply back comparing to find the tuning range, first, we need to pro-
pose a searching set {θ (0) , ..., θ (n) } in parameter space. The estimation from old
BART model can provide clues, σ̂ = 0.1042189 where is the yellow dash line located
in Figure 4.7. We can search σ in the neighbor interval of σ̂ which is (0.05, 0.15)
shown in Figure 4.7. For the parameter ρ, its value must be constrained in (0, 1) to
keep the Gaussian process η(t) stationary. We divided each interval into 10 segments
and selected the centers as the searching set. So, the searching set included 100 can-
didates {θ (0) , ..., θ (99) } which is showed in Figure 4.7. As the discussion in section
new new
4.3.3, to apply back comparing we need to compare SSEreal and SSM Eproposed . Fig-
new new
ure 4.7 shows the back comparing results SSEreal − SSM Eproposed . The cells with
negative (positive) value represent over-estimation (nuder-estimation). The tuning

new new
range (green cells) was selected under the criteria |SSEreal − SSM Eproposed | < 0.5.
We can change this criteria to control the size of tuning range. In Figure 4.7, the left
panel shows the results that was obtained by working on fitted training data. First,
the model was fitted with all the observed data. Then, the observed data and their
new new
model fitted values were used to calculate the variations, SSEreal and SSM Eproposed .
In this case, the BART model tends to overfit the data and causes the parameters
under estimated. To tackle this issue, instead of fitting training data we can use
predicted testing data and the observed testing data to compute the variations. The
predicted testing data was generated be a 4-folds cross-validation. Figure 4.7 right
panel presents the back comparing results and tuning range in this case. Now, you
79
may have a question. Why the real value (green point) of parameters is not included
in the tuning range? Let’s look at the results of back comparing. Compared to the
real value, all the values in tuning range are under estimated which indicates their
corresponding variations SSw (see Figure 4.3) are less than the real value case. Recall
the analysis of variation and Figure 4.3. The real value corresponds to the ideal case
(Figure 4.3) which rarely happen. While, the values in tuning range correspond to the
normal case (Figure 4.3). Last but not least, the number of folds in cross-validation
should not be too small (less than 4) to damage the correlation structure of covariance
matrix Σ.
Figure 4.7: Back comparing and Tuning range. The green dot is the real value of
parameters. The vertical yellow dash line shows the estimation of σ from the old
BART. The green cells indicate the tuning range. Left panel shows the tuning range
that selected using the fitted training data. While, the tuning range in right panel
was using the predicted testing data in a 4-folds cross-validation.
(2) Guidance of parameter selection in tuning range
As discussed in Section 4.3.3, any candidate in tuning range is good for the model
and data but maybe not for your purpose. Besides personal purpose, there is a
80
guidance for selecting the parameter values in tuning range according the idea of
analysis of variation in Section 4.3.1. In Figure 4.3, compare to the old BART model
new
the more SSfBART and SSE new be shrink, the better the Gaussian process BART
model (4.27) performs in both interpretation and prediction. Figure 4.8 shows the
old new
values SSfBART −SSfBART and SSE old −SSE new . So, under the guidance of analysis
of variation we should select the big values of these two subtractions which indicates
the preference of correlation structure (big ρ).
Figure 4.8: Guidance of parameter selection in tuning range
Moreover, with the proposed parameter values we can decompose the variations
new
in the Gaussian process η(t) into pure error variation SSEproposed and correlated
new
variation SSwproposed .
new
SSEproposed = nσ 2 , new
SSwproposed new
= SSM Eproposed new
− SSEproposed
new
where SSM Eproposed is in Figure 4.5. Their values was showed in Figure 4.9. The
new
analysis of variation (Figure 4.3) suggests to select big value of SSwproposed and small
new
value of SSEproposed . It is consistent to the previous guidance.
81
Figure 4.9: The variance decomposition of Gaussian process η(t).
(3) Results
The analysis of variation recommended θ top = {σ, ρ} = {0.085, 0.95} at the top
of tuning range. To study the effect of different values in tuning range we selected
another one θ bottom = {σ, ρ} = {0.125, 0.05} at the bottom of tuning range. Their
comparison in Figure 4.10 shows the significant differences. In Figure 4.11, we com-
pared the top one with the real value (left) and the bottom one with the old BART
(right). Although the top one {σ, ρ} = {0.085, 0.95} is different from the real value
{σ, ρ} = {0.1, 0.8}, their fits are quite similar. It indicates that the guidance from
analysis of variation is effective and can lead us closing to the real value. On the other
hand, the fit of bottom one {σ, ρ} = {0.125, 0.05} is very close to the fit of old BART
{σ, ρ} = {0.1042189, 0}. Based on the comparisons, it’s easy to imagine that if we
scan the tuning range from top to bottom the model fitting will degenerate from the
new BART with (near) real correlation structure to the (near) old BART. In other
word, the covariance matrix Σ will approximately degenerate to σ 2 I.
82
Figure 4.10: Two extreme candidates from the tuning range. The comparison shows
they impose different influences on the model fitting.
Figure 4.11: Comparing the two extreme candidates in tuning range to the real
value and old BART. Left panel shows the similarity of model fitting between the
candidate at the top of tuning range and the real value. Right panel shows the
similarity of model fitting between the candidate at the bottom of tuning range and
the old BART
83
4.4.2 Two Dimension Experiment
The two dimension experiment is created following the Gaussian BART model
(4.27).
y(si ) = f (x(si )) + w(si ) + (si ), i ∈ {1, ..., 400} (4.34)
where
• f (x(si )) = x(si )3 , x(si ) ∼ unif (1, 3).
• w(si ) ∼ GP (0, C(·, ·|σ, φ)), C(sj , sk |σ, φ) = σ 2 exp{−φ ∗ d(sj , sk )},
where d(sj , sk ) is the Euclidean distance between point sj and sk .
i.i.d
• (si ) ∼ N (0, τ 2 ).
• The real value of parameters are {σ, φ, τ } = {1, 6, 1}.
We can explore the created data in Figure 4.12. Similar to the one dimension
experiment, we will check this experiment in three parts.
Figure 4.12: Experiment data exploration. Left and middle panels show the spatial
maps of y(si ) and w(si ) respectively. Right panel shows relation between x and y.
(1) Back comparing and Tuning range
84
First, we need to propose the searching set in parameter space. The liner mixed
model regression Kriging (4.35) can provide clues.
y(si ) = β0 + x(si )β1 + w(si ) + (si ) (4.35)
The parameters estimated by (4.35) are {σ̂, φ̂, τ̂ } = {1.18, 5.46, 1.05}. According
these estimations, we created the searching set as follows. 10 equally divide the
interval (0.5,1.4) for σ; 10 equally divide the interval (1,10) for φ; 7 equally divide
the interval (0.4,1.6) for τ . Then, back comparing was applied to build the tuning
range. Figure 4.13 (left) shows the back comparing results and selected tuning range.
In this experiment, instead of sum square error (SSE) the mean square error (MSE)
was used to avoid large values. Figure 4.13 (right) indicates that we took on strict
new new
criteria, |M SEreal − M SM Eproposed | < 1, to select the tuning range. The exact values
of back comparing are listed in Table 4.2.
Figure 4.13: Back Comparing and Tuning Range (left). Back comparing MSE
density and selection criteria (right).
(2) Guidance of parameter selection in tuning range

old new
As discussed in the one dimension experiment, M SfBART −M SfBART and M SE old −
M SE new can be used as guidance to select the good candidates in tuning range. Fig-
85
Table 4.2: Back Comparing and Candidates Selection Guidance
old new
τ σ φ Back Comparing MSE M SfBART − M SfBART M SE old − M SE new
0.4 0.9 3 0.97500489 50.90653 1.11422824
0.4 1.4 6 0.16215768 50.29464 1.11422824
0.6 0.9 3 -0.02155375 50.02426 0.99994253
0.6 1.4 6 -0.45026158 49.79650 0.99994253
0.8 0.9 3 -0.10393848 50.10187 0.83994253
1.0 0.9 3 0.54719834 50.95872 0.63422824
1.0 1.4 6 -0.92835327 49.68413 0.63422824
1.2 0.9 3 -0.04935067 50.61360 0.38279967
1.4 0.9 3 0.14343539 51.10353 0.08565681
1.4 1.2 5 0.80226033 49.45199 0.08565681
1.6 0.6 1 0.85251483 51.78393 -0.25720033
ure 4.14 illustrates their relative values (color) and positions in tuning range. Their ex-
old new
act value are listed in Table 4.2. According the first guidance, M SfBART −M SfBART ,
the candidates {τ, σ, φ} = {1.6, 0.6, 1}, {1.4, 0.9, 3}, {1, 0.9, 3} are the top three can-
didates. But, when we check with the second guidance, M SE old − M SE new , it
gives the opposite order, and the value is negative for the candidate {1.6, 0.6, 1}.
In this case, {1.4, 0.9, 3} and {1, 0.9, 3} are both good. I choose the second one
old new
{τ, σ, φ} = {1, 0.9, 3} because the sum of subtractions, M SfBART − M SfBART +
M SE old − M SE new ,is bigger than the first one’s.
(3) Results
The motivation of developing the Gaussian process BART model (4.27) is try-
ing to gain the advantages of both the spatial linear mixed regression models and
the spatial nonlinear regression models. On one hand, compare to the linear mixed
regression models, Gaussian process BART model should be capable to handle the
nonlinear relationships between the observed variables y(s) and x(s). On the other
86
old new
Figure 4.14: Left shows the values of M SfBART − M SfBART in tuning range. Right
old new
shows the values of M SE − M SE in tuning range.
hand, compare to the nonlinear regression models, Gaussian process BART model
should be able to understand the spatial dependence which may be caused by the
latent variables. Obviously, we already achieved the second goal. For the first goal,
let’s compare the results between Gaussian process BART model (4.34) and linear
mixed model (4.35).
• First, they have similar ability to understand the spatial dependent structure
in the data. It is because their estimated parameters are both close to the real
value.
• Second, they have different ability to understand the relationship between y(s)
and x(s). Figure 4.15 demonstrates the differences. obviously, the Gaussian
process BART model greatly captured the nonlinear relation f (x) = x3 between
y(s) and x(s).
• Last but not least, the failure of fitting nonlinear trend may cause the linear
mixed model violates its assumption that the Gaussian process w(s) is station-
ary. In this experiment, the linear mixed model failed to extract the nonlinear
trend f (x) = x3 . So, the stationary assumption must be violated. Actually,
87
there are many literature working on this problem. They were trying to estimate
a non-stationary Gaussian process using methods like, spatial partitioning, pro-
cess convolution, low rank splines or basis functions, etc. While, the Gaussian
process BART model (4.27) which is able to capture both linear and nonlinear
trend naturally makes the stationary assumption much more robust than it in
the linear mixed model.
Figure 4.15: The fitting results of Gaussian process BART, old BART and Linear
mixed model
Moreover, we can compare the Gaussian process BART to old BART. In Figure
4.15, they look similar but still have some differences. It is because the Gaussian
process BART foresees the spatial dependence when it fits the data. Like the example
in Section 4.2.5, the Gaussian process BART model should perform better in fitting
the underlying process f (x) than the old BART. To illustrate that, we can calculate
88
the mean square errors (MSE) as follows.
− fôld (xi ))2 − fˆGP (xi ))2

Pn Pn
i=1 (f (xi ) i=1 (f (xi )
M SEfold = , M SEfGP =
n n
From the experiment, we got M SEfold = 0.3594093 and M SEfGP = 0.3418461.
This result proves the claim that Gaussian process BART performs better in restoring
the underlying process f (x) than the old BART.
4.4.3 Testing on Real Data
In this section, we test Gaussian process BART on real data which is the soil
carbon stock data in Chapter 3. In order to visually compare the results among
different models, we chose two environmental covariates to do the test. From the
results in Chapter 3, we know the environmental covariates NDVI14 and REDL14
have nonlinear relationships with the response variable y (see Figure 3.8). So, we
choose them and construct the models as follows.
The linear mixed model:
y(si ) = fLM X (X(si )) + w(si ) + (si ) (4.36)
The Gaussian process BART model:
y(si ) = fGP BART (X(si )) + w(si ) + (si ) (4.37)
The old BART model:
y(si ) = fBART (X(si )) + (si ) (4.38)
where
• si , i = 1, 2, ..., 6213, there are 6213 observations in different locations.
89
• We test two scenarios, one is using variable NDVI14 only, another is using
NDVI14 and REDL14 two variables. So, in the first scenario,
X(si ) = {xN DV I14 (si )}
In the second scenario,
X(si ) = {xN DV I14 (si ), xREDL14 (si )}
• In the first scenario, fLM X (X(si )) in the linear mixed model (4.36) is:
fLM X (X(si )) = β0 + xN DV I14 (si ) ∗ β1
In the second scenario, fLM X (X(si )) in the linear mixed model (4.36) is:
fLM X (X(si )) = β0 + xN DV I14 (si ) ∗ β1 + xREDL14 (si ) ∗ β2
• w(si ) ∼ GP (0, C(·, ·|σ, φ)), C(sj , sk |σ, φ) = σ 2 exp{−φ ∗ d(sj , sk )},
where d(sj , sk ) is the Euclidean distance between point sj and sk . We use the
R package ”fields” (Nychka et al., 2017) to check this model. In that package
the parameter φ is set to 1 defaultly. So, only the unknown parameter σ will
be estimated.
i.i.d
• (si ) ∼ N (0, τ 2 ). The unknown parameter τ will be estimated.
The results will be presented in two scenarios as well.
(1) The first scenario (NDVI14)
The estimations of the linear mixed model (4.36) shows in Table 4.3. We use the
estimations σ and τ to fit the Gaussian process BART model (4.37). Figure 4.16
shows the fitting results of these 3 models. We can see the differences among them.
Since the Gaussian process BART model (4.37) used the same value of covariance
90
Table 4.3: Linear Mixed Model Estimations (the First Scenario)
β0 β1 σ τ
4.0475534 0.4237016 0.5282045 0.6557
parameters with the linear mixed model (4.36), the fitting of Gaussian process BART
shrinks more to linear mixed model comparing to the old BART. Meanwhile, the
Gaussian process BART keeps its non-linearity comparing to the linear mixed model.
Figure 4.16: The fitting results of Gaussian process BART, old BART and Linear
mixed model on real data with one covariate NDVI14.
(2) The second scenario (NDVI14 and REDL14)
The real soil carbon stock data and two environmental covariates NDVI14 and
REDL14 are showed in Figure 4.17.
The estimations of the linear mixed model (4.36) shows in Table 4.4. Similar to
the first scenario, we use the estimations σ and τ to fit the Gaussian process BART
91
Figure 4.17: The read data with two environmental covariates NDVI14 and
REDL14.
model (4.37).
Table 4.4: Linear Mixed Model Estimations (the Second Scenario)
β0 β1 β2 σ τ
4.0413586 0.163074 -0.2934858 0.5040833 0.6539
Figure 4.18 illustrates the different model fittings. Comparing to the linear mixed
model, both Gaussian process BART and old BART successfully captured the non-
linear relationships between the covariates and response variables.
Figure 4.19 shows the differences among the three models. Similar to the first
scenario, the Gaussian process BART shrinks more to linear mixed model comparing
to the old BART. It’s because that the Gaussian process BART model (4.37) used
the same value of covariance parameters with the linear mixed model (4.36).
92
Figure 4.18: The model fittings on read data.
Figure 4.19: The results comparison among different models.
4.4.4 Discussion on Computation Issues
All the computation issues with Gaussian process BART model (4.27) are related
to parameter space searching for the buildup of tuning range. The issues and possible
solutions are discussed in this section.
(1) The curse of dimensionality
As showing in the experiments, to construct the tuning range we have to
propose a searching set in parameter space. This searching set suffers from the
curse of dimensionality as the parameters increasing. The first possible solution
is using low dimensional parametric Gaussian process models, for example, the
Matérn family. The second possible solution is using random search to instead
grid search. The ad hoc information of random search can be used to locate
93
the promising areas in parameter space. Another possible solution is parallel
computing. In theory, all the points in searching set can be test in parallel.
Since the computation resources is limited, we can partition the searching set
into subsets and deploy different computational resources to each of them.
(2) The inverse of covariance matrix
For every time searching, we have to inverse the covariance matrix Σ with
the proposed parameters in searching set. It’s because the algorithm of the
new BART needs Σ−1 rather than Σ (see Appendix A). Since the dimension of
Σ is n × n where n is the number of observations, as the data increasing, the
calculation of Σ−1 will become the computational bottleneck of model (4.27). A
possible solution is creating a sparse matrix which has O(n) non-zero entries to
approximate Σ−1 . A popular approach called nearest neighbor Gaussian process
(Finley et al., 2019) is presented in Appendix B.
94
Chapter 5
CONCLUSION
This chapter summarizes the key ideas and contributions of the dissertation. Ideas
for further research are also discussed.
5.1 Summary of Contributions
• Chapter 1 provided a unifying view of existing models for spatial regression. A
classification based on their capability of modeling latent variables was intro-
duced.
• In Chapter 2, a multistage workflow equipped with nonlinear models was pro-
posed for the spatial prediction problem in reef species abundance study. The
methods, empirical maximum likelihood analysis and random smoothing, were
developed to solve the zero-inflated issue in sampling data. Three strategies,
prior knowledge, aggregation and iteration were introduced to help the nonlin-
ear models overcome the out of sample prediction issue.
• Chapter 3 developed a novel two-stage model for the spatial regression problems
in soil carbon stock (SOC) analysis. In the first stage, a universal regression
Kriging model captures the linear and stationary effects of covariates. The
remaining nonlinear and non-stationary effects are modeled by a generalized
additive model in the second stage.
• In Chapter 4, the traditional BART model was extended to a new BART model
which can accommodate the general correlated errors. A novel nonlinear spatial
regression model called Gaussian process BART can then be built by combin-
95
ing the new BART and Gaussian process. Because of the failure of likelihood
based MCMC in parameter estimation, the methods, back comparing and tun-
ing range, were proposed based on the idea of analysis of variation.
5.2 Future Work
Promising paths for future work involve:
• Applying the Gaussian process BART model to real world problems.
• Solving the computation issue of parameter space searching for the buildup of
tuning range.
• Updating R package “BART” with the new algorithm for accommodating cor-
related errors.
96
REFERENCES
Adhikari, K., U. Mishra, P. Owens, Z. Libohova, S. Wills, W. Riley, F. Hoffman and

D. Smith, “Importance and strength of environmental controllers of soil organic
carbon changes with scale”, Geoderma 375, 114472 (2020).
Ainsworth, L. M., C. B. Dean and R. Joy, “Zero-inflated spatial models: Application
and interpretation”, (2016).
Andrew Zammit-Mangion, “Frk: Fixed rank kriging”, R package version 0.2.2.1
(2020).
Appelhans, T., E. Mwangomo, D. R. Hardy, A. Hemp and T. Nauss, “Evaluating
machine learning approaches for the interpolation of monthly air temperature at
mt. kilimanjaro, tanzania”, Spatial Statistics 14, 91–113 (2015).
Banerjee, S., A. E. Gelfand, A. O. Finley and H. Sang, “Gaussian predictive process
models for large spatial data sets”, Journal of the Royal Statistical Society: Series
B (Statistical Methodology) 70, 4, 825–848 (2008).
Breiman, L., “Random forests”, 45, 1, 532 (2001).
Brockwell, P. J. and R. A. Davis, Introduction to Time Series and Forecasting
(Springer-Verlag New York, 2002).
Brunsdon, C., S. Fotheringham and M. Charlton, “Geographically weighted
regression-modelling spatial non-stationarity”, Journal of the Royal Statistical So-
ciety. Series D (The Statistician) 47, 3, 431–443 (1998).
Chipman, H. A., E. I. George and R. E. McCulloch, “Geographically weighted
regression-modelling spatial non-stationarity”, The Annals of Applied Statistics
4(1), 266–298 (2010).
Coleman, F. C., K. M. Scanlon and C. C. Koenig, “Groupers on the edge: Shelf
edge spawning habitat in and around marine reserves of the northeastern gulf of
mexico”, The Professional Geographer 63, 4, 456–474 (2011).
Cressie, N. and G. Johannesson, “Fixed rank kriging for very large spatial data sets”,
Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70, 1,
209–226 (2008).
Cressie, N. A. C., Statistics for Spatial Data (John Wiley & Sons, Inc., 1993).
Cristianini, N. and J. Shawe-Taylor, An Introduction to Support Vector Machines and
Other Kernel-based Learning Methods (Cambridge University Press, 2000).
Datta, A., S. Banerjee, A. O. Finley and A. E. Gelfand, “Hierarchical nearest-neighbor
gaussian process models for large geostatistical datasets”, Journal of the American
Statistical Association 111, 514, 800–812 (2016).
97
Diesing, M. and D. Stephens, “A multi-model ensemble approach to seabed mapping”,
Journal of Sea Research 100, 62 – 69, meshAtlantic: Mapping Atlantic Area Seabed
Habitats for Better Marine Management (2015).
Dobesch, H., P. Dumolard and I. Dyras, Spatial Interpolation for Climate Data: The
Use of GIS in Climatology and Meteorology (ISTE Ltd, 2007).
Drexler, M. and C. H. Ainsworth, “Generalized additive models used to predict species
abundance in the gulf of mexico: an ecosystem modeling tool”, PloS one 8, 5 (2013).
Farmer, N. A. and J. S. Ault, “Grouper and snapper movements and habitat use in
dry tortugas, florida”, Mar Ecol Prog Ser 433, 169–184 (2011).
Farmer, N. A. and J. S. Ault, “Modeling coral reef fish home range movements in dry
tortugas, florida”, The Scientific World Journal 2014, 14 (2014).
Finley, A. O., A. Datta, B. D. Cook, D. C. Morton, H. E. Andersen and S. Banerjee,
“Efficient algorithms for bayesian nearest neighbor gaussian processes”, Journal of
Computational and Graphical Statistics 28, 2, 401–414 (2019).
Furrer, R., M. G. Genton and D. Nychka, “Covariance tapering for interpolation of
large spatial datasets”, Journal of Computational and Graphical Statistics 15, 3,
502–523 (2006).
Gelfand, A. E., P. J. Diggle, M. Fuentes and P. Guttorp, Handbook of Spatial Statistics
(Chapman & Hall/CRC, 2010).
Geoga, C. J., M. Anitescu and M. L. Stein, “Scalable gaussian process computations
using hierarchical matrices”, (2019).
Goff, J. A., C. J. Jenkins and S. Williams, “Seabed mapping and characterization of
sediment variability using the usseabed data base”, Continental Shelf Research 28,
4, 614 – 633 (2008).
Guisan, A., T. C. Edwards and T. Hastie, “Generalized linear and generalized additive
models in studies of species distributions: setting the scene”, Ecological Modelling
157, 2, 89 – 100 (2002).
Guisan, A., R. Tingley, J. B. Baumgartner, I. Naujokaitis-Lewis, P. R. Sutcliffe,
A. I. T. Tulloch, T. J. Regan, L. Brotons, E. McDonald-Madden, C. Mantyka-
Pringle, T. G. Martin, J. R. Rhodes, R. Maggini, S. A. Setterfield, J. Elith, M. W.
Schwartz, B. A. Wintle, O. Broennimann, M. Austin, S. Ferrier, M. R. Kearney,
H. P. Possingham and Y. M. Buckley, “Predicting species distributions for conser-
vation decisions”, Ecology Letters 16, 12, 1424–1435 (2013).
Hackbusch, W., Hierarchical Matrices: Algorithms and Analysis (2015).
Haining, R. P., R. Kerry and M. A. Oliver, “Geography, spatial data analysis, and
geostatistics: An overview”, Geographical Analysis 42, 731 (2010).
98
Harter, S., H. Moe, J. Reed and A. David, “Fish assemblages associated with red
grouper pits at pulley ridge, a mesophotic reef in the gulf of mexico”, Fishery
Bulletin 115, 419–432 (2017).
Hastie, T. and R. Tibshirani, “Generalized additive models”, Statistical Science Vol.
1, 297–318 (1986).
Hastie, T. and R. Tibshirani, Generalized Additive Models (Chapman and Hall, New
York, 1990).
Kelly, S., “Basic introduction to pygame”, (2016).
Lembke, C., S. Grasty, A. Silverman, H. A. Broadbent, S. E. Butcher and S. Mu-
rawski, “The camera-based assessment survey system (c-bass): A towed camera
platform for reef fish abundance surveys and benthic habitat characterization in
the gulf of mexico”, Continental Shelf Research 151, 62–71 (2017).
Li, J., A. D. Heap, A. Potter and J. J. Daniell, “Application of machine learning meth-
ods to spatial interpolation of environmental variables”, Environmental Modelling
& Software 26, 12, 1647 – 1659 (2011).
Lin, Y.-P., W.-C. Lin, Y.-C. Wang, W.-Y. Lien, T. Huang, C.-C. Hsu, D. S. Schmeller
and N. D. Crossman, “Systematically designating conservation areas for protect-
ing habitat quality and multiple ecosystem services”, Environmental Modelling &
Software 90, 126 – 146 (2017).
Lindgren, F., H. Rue and J. Lindstrm, “An explicit link between gaussian fields
and gaussian markov random fields: the stochastic partial differential equation
approach”, Journal of the Royal Statistical Society: Series B (Statistical Method-
ology) 73, 4, 423–498 (2011).
Mateo-Sánchez, M. C., A. Gastón, C. Ciudad, J. I. Garcı́a-Viñas, J. Cuevas, C. López-
Leiva, A. Fernández-Landa, N. Algeet-Abarquero, M. Marchamalo, M. Fortin and
S. Saura, “Seasonal and temporal changes in species use of the landscape: how do
they impact the inferences from multi-scale habitat modeling?”, Landscape Ecology
31, 1261–1276 (2015).
McDonald, A., J. S. Parslow and A. J. Davidson, “Interpretation of a modified linear
model of catch-per-unit-effort data in a spatially-dynamic fishery”, Environmental
Modelling & Software 16, 2, 167 – 181, environmental Modelling and Socioeco-
nomics (2001).
Nychka, D., R. Furrer, J. Paige and S. Sain, “fields: Tools for spatial data”, R package
version 10.3 (2017).
Pfeffermann, D., “New important developments in small area estimation”, Statistical
Science 28, 1, 4068 (2013).
Prosser, D., C. Ding, R. Erwin, T. Mundkur, J. Sullivan and E. C. Ellis, “Species
distribution modeling in regions of high need and limited data: waterfowl of china”,
Avian Research 9, 1–14 (2018).
99
Ramchoun, H., M. J. Idrissi, Y. Ghanou and M. Ettaouil, “Multilayer perceptron:
Architecture optimization and training”, Int. J. Interact. Multim. Artif. Intell. 4,
26–30 (2016).
Robertson, G. P., “Geostatistics in ecology: Interpolating with known variance”,
Ecology 68, 3, 744–748 (1987).
Ros-Pena, L., T. Kneib, C. Cadarso-Surez, N. Klein and M. Marey-Prez, “Study-
ing the occurrence and burnt area of wildfires using zero-one-inflated structured
additive beta regression”, Environmental Modelling & Software 110, 107 – 118,
special Issue on Environmental Data Science and Decision Support: Applications
in Climate Change and the Ecological Footprint (2018).
Sainte-Marie, B. and B. Hargrave, “Estimation of scavenger abundance and distance
of attraction to bait”, Marine Biology 94, 431–443 (1987).
Saul, S. and S. Purkis, “Semi-automated object-based classification of coral reef habi-
tat using discrete choice models”, Remote Sensing 7, 12, 15894–15916 (2015).
Saul, S., J. Walter, D. Die, D. Naar and B. Donahue, “Modeling the spatial distri-
bution of commercially important reef fishes on the west florida shelf”, Fisheries
Research 143, 12 – 20 (2013).
Schoener, T., “A brief history of optimal foraging ecology”, (1987).
Simon Wood, “mgcv: Mixed gam computation vehicle with automatic smoothness
estimation”, R package version 1.8-31 (2019).
Somerton, D. and C. T. Glendhill, “Report of the national marine fisheries service
workshop on underwater video analysis, august 4-6, 2004”, (2005).
Staff, S. S. and T. Loecke, “Rapid carbon assessment: Methodology, sampling and
summary”, (2016).
Stein, M. L., Statistical Interpolation of Spatial Data: Some Theory for Kriging
(Springer, New York, 1999).
Stohlgren, T. J., P. Ma, S. Kumar, M. Rocca, J. T. Morisette, C. S. Jarnevich and
N. Benson, “Ensemble habitat mapping of invasive plant species”, Risk Analysis
30, 2, 224–235 (2010).
Stoner, A. W., “Effects of environmental variables on fish feeding ecology: implica-
tions for the performance of baited fishing gear and stock assessment”, Journal of
Fish Biology 65, 6, 1445–1471 (2004).
Stratford, D. S., C. A. Pollino and A. E. Brown, “Modelling population responses to
flow: The development of a generic fish population model”, Environmental Mod-
elling & Software 79, 96 – 119 (2016).
Streich, M. K., M. J. Ajemian, J. J. Wetz and G. W. Stunz, “A comparison of fish
community structure at mesophotic artificial reefs and natural banks in the western
gulf of mexico”, Marine and Coastal Fisheries 9, 1, 170–189 (2017).
100
Vecchia, A. V., “Estimation and model identification for continuous spatial pro-
cesses”, Journal of the Royal Statistical Society. Series B (Methodological) 50,
2, 297–312 (1988).
Wenger, S. J. and J. D. Olden, “Assessing transferability of ecological models: an un-
derappreciated aspect of statistical validation”, Methods in Ecology and Evolution
3, 2, 260–267 (2012).
Williamson, D. J., G. L. Burn, S. Simoncelli, J. Griffi, R. Peters, D. M. Davis and
D. M. Owen, “Machine learning for cluster analysis of localization microscopy
data”, Nat Commun 11(1), 1493 (2020).
Ye, L., L. Gao, R. Marcos-Martinez, D. Mallants and B. A. Bryan, “Projecting aus-
tralia’s forest cover dynamics and exploring influential factors using deep learning”,
Environmental Modelling & Software 119, 407 – 417 (2019).
Yuan, M. and Y. Lin, “Model selection and estimation in regression with grouped
variables”, Journal of the Royal Statistical Society, Series B 68(1), 4967 (2006).
101
APPENDIX A
BART FOR CORRELATED DATA
102
A.1 Marginal Likelihood and Posterior Distribution
A.1.1 Marginal Likelihood
The marginal likelihood p(R|D) can be derived as follows.

First, by (4.12), we know
n 1 1
p(R|D, µ) = (2π)− 2 |Σ|− 2 exp{− (R − Dµ)T Σ−1 (R − Dµ)} (A.1)
2
If given π(µ) ∼ N (µ̄, Q−1 ), where
b 1 1
π(µ) = (2π)− 2 |Q| 2 exp{− (µ − µ̄)T Q(µ − µ̄)} (A.2)
2
The marginal distrionbution of p(R|D) can be calculated by integrated out µ.
Z
p(R|D) = p(R|D, µ)π(µ)dµ
Let’s check the product of likelihood and prior.

n 1 1
p(R|D, µ)π(µ) = (2π)− 2 |Σ|− 2 exp{− (R − Dµ)T Σ−1 (R − Dµ)}∗
2
b 1 1
(2π)− 2 |Q| 2 exp{− (µ − µ̄)T Q(µ − µ̄)}
2
− n+b − 1 1 (A.3)
= (2π) 2 |Σ| 2 |Q| 2 ∗
1
exp{− [(R − Dµ)T Σ−1 (R − Dµ) + (µ − µ̄)T Q(µ − µ̄)]}
2| {z }
(∗)
Since
(∗) = RT Σ−1 R − 2RT Σ−1 Dµ + µT DT Σ−1 Dµ + µT Qµ − 2µ̄T Qµ + µ̄T Qµ̄
= µT (DT Σ−1 D + Q)µ − 2(RT Σ−1 D + µ̄T Q)µ + RT Σ−1 R + µ̄T Qµ̄
Then, we can introduce a variable v and think about the following term.
(µ − v)T (DT Σ−1 D + Q)(µ − v)

= µT (DT Σ−1 D + Q)µ − 2v T (DT Σ−1 D + Q)µ + v T (DT Σ−1 D + Q)v
To make the underline coefficient equal, we can let
v T (DT Σ−1 D + Q) = RT Σ−1 D + µ̄T Q
v T = (RT Σ−1 D + µ̄T Q)(DT Σ−1 D + Q)−1

and
v = (Q + DT Σ−1 D)−1 (Qµ̄ + DT Σ−1 R) (A.4)
103
Finally
(∗) = (µ − v)T (Q + DT Σ−1 D)(µ − v) + C
where
C = −v T (Q + DT Σ−1 D)v + RT Σ−1 R + µ̄T Qµ̄
Plug (A.1) and (A.2) into the integral term.
Z
p(R|D, µ)p(µ)dµ
Z
− n+b 1 1 1 1
= (2π) 2 |Σ| 2 |Q| 2 exp{− C} exp{− (µ − v)T (Q + DT Σ−1 D)(µ − v)}dµ
−
2 2
n+b 1 1 1 b 1
= (2π)− 2 |Σ|− 2 |Q| 2 exp{− C}(2π) 2 |Q + DT Σ−1 D|− 2
Z 2
b 1 1
· (2π)− 2 |Q + DT Σ−1 D| 2 exp{− (µ − v)T (Q + DT Σ−1 D)(µ − v)}dµ
2
−n − 1 1
(2π) 2 |Σ| 2 |Q| 2 1
= 1 exp{− C}
T −1
|Q + D Σ D| 2 2
After simplifying we can get (A.5) which is same to (4.14).

n 1 1
(2π)− 2 |Σ|− 2 |Q| 2 1
p(R|D) = exp{− (−v T (Q + DT Σ−1 D)v + µ̄T Qµ̄ + RT Σ−1 R)}
1
|Q + DT Σ−1 D| 2 2
(A.5)
where, v = (Q + DT Σ−1 D)−1 (Qµ̄ + DT Σ−1 R).
A.1.2 Posterior Distribution
Based on the proof of marginal likelihood above, it’s easy to get the posterior
distribution p(µ|R). By (4.18), we have
p(µ|R) ∝ p(R|µ)π(µ) = p(R|D, µ)π(µ)
If given, p(R|D, µ) in (A.1) and π(µ) in (A.2) . Then, by (A.3) and (A.4), we can
directly prove that
p(µ|R) ∼ N ( (Q + DT Σ−1 D)−1 (Qµ̄ + DT Σ−1 R) , (Q + DT Σ−1 D)−1 ) (A.6)
104
A.2 Invariant under Reordering
If P is a permutation matrix, it has the property that P −1 = P T . Then, according

(4.21) and (4.22) we can prove (A.7).
DT Σ−1 R = (P DP )T (P ΣP P T )−1 (P RP )
= DPT P T P Σ−1 T
P P P RP (A.7)
= DPT Σ−1
P RP
If given Q = τ −2 I, similar to (A.7) we can prove that
Q + DT Σ−1 D = Q + (P DP )T P Σ−1 T T −1
P P P DP = Q + DP ΣP DP (A.8)
Let’s recall (4.16) and (4.20),

1 1
p(R|Di+1 ) |Qi+1 | 2 |Qi + (Di )T Σ−1 Di | 2 1 T −1
i
= 1 1 · exp{ R Σ
p(R|D ) |Qi | 2 |Qi+1 + (Di+1 )T Σ−1 Di+1 | 2 2
[Di+1 (Qi+1 + (Di+1 )T Σ−1 Di+1 )−1 (Di+1 )T − Di (Qi + (Di )T Σ−1 Di )−1 (Di )T ]Σ−1 R}
and,
p(µ|R) ∼ N ((Q + DT Σ−1 D)−1 DT Σ−1 R, (Q + DT Σ−1 D)−1 )
By applying (A.7) and (A.8), obviously, (4.16) and (4.20) are invariant under
reordering.
105
A.3 On the Calculation of Marginal Likelihood Ratio
A.3.1 Calculate Matrix A
We use matrix A to denote (4.25)
A = Q + DPT Σ−1
P DP , Q = τ −2 I
According the discussion of reordering in section 4.2.4, it’s easy to know A is a

symmetric matrix. We can denote it as follows.
a11 + τ −2
 
a12 ... a1b
 a21 a22 + τ −2 ... a2b 
A=  .. .. ..  (A.9)
. . . 
ab1 ab2 ... abb + τ −2
where XX
aji = aij = qhl , i ≤ j, i, j ∈ {1, ..., b}
h∈ni l∈nj
nk , k ∈ {1, ..., b} is the index set of observations that associated with bottom node k
and qhl is the hth row and lth column entry in Σ−1 P .
 
q11 q12 ... q1n
 q21 q22 ... q2n 
Σ−1
P =  ..
 .. . . 
. . . 
qn1 qn2 ... qnn
So, the operations to calculate A is the summation of non-zero entries in Σ−1 .
A.3.2 The Block Form of Matrix E
Plug A into (4.23), we can get
p(R|Di+1 ) |Qi+1 |1/2 |Ai |1/2

=
p(R|Di ) |Qi |1/2 |Ai+1 |1/2
1
· exp{ RPT Σ−1 i+1
P [DP (A ) (DPi+1 )T − DPi (Ai )−1 (DPi )T ] Σ−1
i+1 −1
RP }
2 | {z } P
E
To understand the form of E, we have to consider the birth and death operations
respectively. Without losing generality, we can assume that birth or death operation
occurs in (i + 1)th MCMC iteration. Since the dummy matrix D has very special form
(see section 4.2.1), we developed an algorithm as following to achieve computational
efficiency.
(1) Birth
106
In this scenario, the tree has b bottom nodes at i step and b + 1 nodes at i + 1
step. So, (Ai )−1 and (Ai+1 )−1 are b × b and (b + 1) × (b + 1) matrices. We can denote
them by block matrices as follows.
i+1
V12i+1
i i

i+1 −1 V11 i −1 V11 v12
(A ) = , (A ) = i
V21i+1 V22i+1 i
v21 v22
where, V11i+1 and V11i are (b − 1) × (b − 1) matrices; V12i+1 = (V21i+1 )T is (b − 1) × 2

i i i
matrix; v12 = v21 is a b − 1 column vector; v22 is a scalar.
We create a matrix  
i
V11i v12 i
v12
(Ai )−1  i i
ex = v21 v22 v22
i 
i i i
v21 v22 v22
Let B = (Ai+1 )−1 − (Ai )−1
ex , then, we can get
 
b11 b12 ... b1(b+1)
V11i+1 V12i+1
 i i

−V11i − v12 v12  b21 b22 ... b2(b+1) 
B=  i+1 vi v i
v i
=  . . ... 
V22i+1 − 22  .. ..

V21 − 21i i
22
i

v21 v22 v22
b(b+1)1 b(b+1)2 ... b(b+1)(b+1)
(2) Death
Similar to birth scenario, we denote the matrices (Ai )−1 and (Ai+1 )−1 as following.
i i+1 i+1
i −1 V11 V12i i+1 −1 V v12
(A ) = , (A ) = 11
V21i V22i i+1
v21 v22i+1
where, V11i+1 and V11i are (b − 2) × (b − 2) matrices; V12i = (V21i )T is (b − 2) × 2 matrix;

i+1 i+1 i+1
v12 = v21 is a b − 2 column vector; v22 is a scalar.
Create a matrix  i+1 i+1 i+1 

V11 v12 v12
(Ai+1 )−1
ex
i+1
= v21 v22
 i+1 i+1 
v22
i+1 i+1 i+1
v21 v22 v22
Different from the birth scenario, B = (Ai+1 )−1 i −1
ex − (A )
 
 i+1 i
i+1 i+1 i
 b11 b12 ... b1b
V11
i+1
− V11 v12 i+1
v12 − V12
i+1
b21 b22 ... b2b 
B = v21
 i v22 v22 i
 =  .
 .. .. . . 
i+1 − V21 i+1 i+1 − V22 . . 
v21 v22 v22
bb1 bb2 ... bbb
Block form of matrix E
107
We can denote E as a block matrix
 
E11 E12 ... E1b0
 E21 E22 ... E2b0 
E = DPi+1 (Ai+1 )−1 (DPi+1 )T − DPi (Ai )−1 (DPi )T = 
 ... .. ... 
. 
Eb0 1 Eb0 2 ... Eb0 b0
where
b + 1, Birth,
0
b =
b, Death.
Each block matrix has a special form
T
Eij = Eji = bij Jij , i ≤ j, i, j ∈ {1, ..., b0 }
where bij is the (i, j) element of matrix B calculated in birth or death step; Jij is a
card(ni ) × card(nj ) matrix whose entries are 1s. (card(nk ) is the cardinality of set
nk ).
A.3.3 Calculate Marginal Likelihood Ratio
Let’s set
RPT Σ−1
P = [ω1 ω2 . . . ωb ] ,
0 ωi = [ωij ], j ∈ ni
and
u = RPT Σ−1 −1
P EΣP RP
Then, u can be calculated
ω1T
 
ω2T 
u = [ω1 ω2 . . . ωb0 ] E 
 ... 

ωbT0
0
b X
b 0
X
= ωi Eij ωjT
i=1 j=1
b X
b0 0
X
= (ωi Jij ωjT )bij
i=1 j=1
b X
b0 0
X X X
= [( ωih )( ωjl )bij ]
i=1 j=1 h∈ni l∈nj
Finally, we can get the marginal likelihood ratio as follows.

( i
p(R|Di+1 ) τ −1 |A|Ai+1| | exp{ 21 u} Birth
i
= |Ai | 1
p(R|D ) τ |Ai+1 | exp{ 2 u} Death
108
APPENDIX B
NEAREST NEIGHBOR GAUSSIAN PROCESS
109
The easy way to understand nearest neighbor Gaussian process is from the linear
form of Gaussian process (B.1).
w = Hw + η (B.1)
where w is an instance of Gaussian process W ∼ GP (0, K(·, ·|θ)) and w ∼ N (0, C),
C is the covariance matrix calculated by K(·, ·|θ). The structure of H is as follows.
 
0 0 0 ... 0
 h21 0 0 ... 0
h31 h32 0 ... 0
 
H=  .
 .. .. . . .. .. 
. . . .
hn1 hn2 . . . hn(n−1) 0
w1 = 0 + η1
w2 = h21 w1 + η2
w3 = h31 w1 + h32 w2 + η3
..
.
wn = hn1 w1 + hn2 w2 + · · · + hn(n−1) w(n−1) + ηn
and
η ∼ N (0, Λ)
where Λ is diagonal with entries Λ11 = var(w1 ) and Λii = var(wi |{wj : j < i}) for
i = 2, . . . , n.
Since I − H is nonsingular
 
1 0 0 ... 0
 −h21 1 0 ... 0
−h −h 1 . .. 0
 
I −H =  31 32
 ... .
.. . .. . .. .. 
.
−hn1 −hn2 . . . −hn(n−1) 1
Then, (B.1) can be transformd to w = (I − H)−1 η. So,
C = (I − H)−1 Λ(I − H)−T (B.2)

Recall,
wi+1 = h(i+1)1 w1 + h(i+1)2 w2 + · · · + h(i+1)i wi + ηi+1 (B.3)
Let, hi+1 = (h(i+1)1 , ..., h(i+1)i ) and wi+1 = (w1 , ..., wi , wi+1 ) = (wi , wi+1 ), where
wi = (w1 , ..., wi ).
Note: For any matrix M and set of indices I1 , I2 ∈ {1, 2, ..., n}, let M [I1 , I2 ] denote
the submatrix of M formed by the rows indexed by I1 and columns indexed by I2 .
Let
var(w1 , ..., wi+1 ) = C
110
Then,
var(w1 , ..., wi ) = C[1 : i, 1 : i]
and
C[1 : i, 1 : i] C[1 : i, i + 1]
C=
C[i + 1, 1 : i] C[i + 1, i + 1]
By equation (B.3), we can get
C[i + 1, 1 : i] = hi+1 · C[1 : i, 1 : i]

C[i + 1, i + 1] = Λi+1,i+1 + hi+1 · C[1 : i, i + 1]
Then, hi+1 and Λi+1,i+1 can be calculated as follows.
hi+1 = C[i + 1, 1 : i]C[1 : i, 1 : i]−1 (B.4)
Λi+1,i+1 = C[i + 1, i + 1] − hi+1 · C[1 : i, i + 1] (B.5)

Using (B.4) and (B.5), the covariance matrix C can be decomposited by (B.2).
However, the computational complexity of (B.4) still increases as the dimension of
C[1 : i, 1 : i] increasing (O(n3 )). In order to achieve the sparsity, we permit no more
than m elements in hi (the i-th row of matrix H) to be nonzero.
Let ne(i) to represent the number of nearest neighbors of point i = 1, ..., n and
ne(i) ≤ m. Then equation (B.4) and (B.5) become
hne[i+1] = C[i + 1, ne(i + 1)]C[ne(i + 1), ne(i + 1)]−1 (B.6)
Λi+1,i+1 = C[i + 1, i + 1] − hne[i+1] · C[ne(i + 1), i + 1] (B.7)

The size of linear system { (B.6), (B.7) } is at most m × m. So, the computational
complexity decreases from O(n3 ) to O(nm3 ).
From (B.2), (B.6) and (B.7), we can get that
C̃ = (I − H)−1 Λ(I − H)−T (B.8)
C̃ −1 = (I − H)T Λ−1 (I − H) (B.9)

where H and Λ are computed from (B.6) and (B.7) respectively.
Since Σ = C + τ 2 I, by Sherman Woodbury Morrison (SWM) identity, we can get:
Σ−1 = (C + τ 2 I)−1 = τ −2 I − τ −4 (C −1 + τ −2 I)−1
Then, the approximation of Σ−1 is as follows:

−1
Σ̃−1 = τ −2 I − τ −4 (C̃ + τ −2 I)−1 (B.10)
−1 −1
The calculation of (C̃ + τ −2 I)−1 can enjoy the sparsity of C̃ .
111

Lu_asu_0010E_20505

Uploaded by

Copyright:

Available Formats

Lu_asu_0010E_20505

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lu_asu_0010E_20505

Uploaded by

Copyright:

Available Formats

Spatial Regression and Gaussian Process BART

A Dissertation Presented in Partial Fulfillment

Approved November 2020 by the

Robert McCulloch, Chair

ARIZONA STATE UNIVERSITY

goals, interpretation or prediction, spatial regression models can be classified into

major parts in the dissertation.

understanding the effect of covariates, meanwhile, it kept high prediction accuracy

xgboost and support vector machine.

A new nonlinear regression model, Gaussian process BART (Bayesian additive

was generalized to accommodate correlated errors. Then, the failure of likelihood

by experiments on both simulation and real data.

I feel very privileged to have worked under their supervision.

Argonne National Laboratory. My research presented in Chapter 3 was accomplished

made this an unforgettable experience that has evolved in an ongoing collaboration

I also would like to extend my appreciation to my committee members, Dr. Paul

tation. Their efforts make this dissertation better.

I am grateful to the funding bodies that enabled me to pursue my studies: Gulf of

Mexico Research Initiative (GoMRI), National Oceanic Atmospheric Administrations

Finally my deepest thanks go to my parents in law, Wenhua and Wenjiang, for

helping us take care of my daughter. And to my parents, Junjiang and Hongxia, my

to my daughter Julie for reminding me that happiness is in the simple things.

This dissertation is dedicated to my wife, my daughter, Yuxia and Julie, my love

for the rest of my life.

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

1.1 Spatial Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Gaussian Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 REAL DATA CHALLENGES AND NONLINEAR MODELS . . . . . . . . . . . 9

2.2 Multistage Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.1 Video Survey Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.2 Empirical Maximum Likelihood Analysis . . . . . . . . . . . . . . . . . . 17

2.2.3 Random Smoothing Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2.4 Reducing Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.3 Non-linear Models and out of Sample Prediction . . . . . . . . . . . . . . . . . . 24

2.3.1 Prior Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2.1 The Two-stage Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.2.2 Model Interpretability and Analysis Flow . . . . . . . . . . . . . . . . . . 42

3.2.3 Prediction Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.3.1 Interpretation of Fitted Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.3.2 Prediction Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4 GAUSSIAN PROCESS BART . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.2 Bart for Correlated Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.2.1 Dummy Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.2.2 Metropolis-Hastings Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.2.3 Posterior Distribution of µ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.2.4 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.3 Gaussian Process Bart Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.3.1 Analysis of Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.3.2 The Failure of Likelihood Based MCMC . . . . . . . . . . . . . . . . . . . 73

4.3.3 Back Comparing and Tuning Range . . . . . . . . . . . . . . . . . . . . . . . 76

4.4 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

4.4.1 One Dimension Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

4.4.2 Two Dimension Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

4.4.3 Testing on Real Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

4.4.4 Discussion on Computation Issues . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

A BART FOR CORRELATED DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102